digitalmars.D - Issues with debugging GC-related crashes #2

Matthias Klumpp (106/106) Apr 16 2018 Hi!

Matthias Klumpp (5/13) Apr 16 2018 Another thing to mention is that the software uses LMDB[1] and
Kagamin (2/5) Apr 17 2018 What do you use destructors for?
Kagamin (3/3) Apr 17 2018 Other stuff to try:

Matthias Klumpp (52/55) Apr 17 2018 I haven't tried that yet (next on my todo list), if I do run the

Kagamin (8/8) Apr 18 2018 You can call GC.collect at some points in the program to see if

Matthias Klumpp (34/42) Apr 18 2018 I already do that, and indeed I get crashes. I could throw those

Johannes Pfau (16/29) Apr 18 2018 The important point to note here is that this is not one of these 'GC

kinke (8/10) Apr 18 2018 Interesting, but I don't think it applies here. Both start and

Matthias Klumpp (8/20) Apr 18 2018 size_t memSize = pooltable.maxAddr - minAddr;

Johannes Pfau (16/37) Apr 18 2018 I see. Then I'd try to debug where the range originally comes from, try

Johannes Pfau (7/17) Apr 19 2018 Of course, if this is a GC pool / heap range adding breakpoints in the

Johannes Pfau (11/28) Apr 19 2018 Having a quick look at https://github.com/ldc-developers/druntime/blob/

Kagamin (25/32) Apr 19 2018 If big LMDB mapping causes a problem, try a test like this:

Kagamin (2/2) Apr 19 2018 foreach(i;0..10000)
Matthias Klumpp (43/76) Apr 19 2018 I tried something similar, with no effect.

kinke (6/42) Apr 19 2018 You probably already figured that the new Fiber seems to be

Matthias Klumpp (39/83) Apr 19 2018 Jup, I did that already, it just took a really long time to run

Matthias Klumpp (3/6) Apr 19 2018 I forgot to mention that, the error code was 12, ENOMEM, so this
Dmitry Olshansky (8/19) Apr 19 2018 I think the order of operations is wrong, here is an example from

Matthias Klumpp (10/30) Apr 20 2018 Indeed! It's also the only place where this is shuffled around,

Matthias Klumpp (21/47) Apr 20 2018 Turns out that was indeed the case! I created a small testcase

Dmitry Olshansky (5/21) Apr 23 2018 Partly dumb luck on my part since I opened hashmap file first

Matthias Klumpp (7/30) Apr 18 2018 Just to be sure, I applied your patch, but unfortunately I still

Kagamin (5/11) Apr 19 2018 Can you narrow down the earliest point at which it starts to
Kagamin (4/20) Apr 19 2018 As a workaround:

kinke (8/11) Apr 18 2018 Speaking for LDC, none are, they all need to be enabled

Matthias Klumpp (7/18) Apr 18 2018 Yeah... Maybe making a CI build with "enable all the things"

Matthias Klumpp (88/94) Apr 18 2018 No luck...

negi (12/13) Apr 18 2018 This reminds me of (otherwise unrelated) problems I had involving
Kagamin (4/11) Apr 20 2018 Indeed, this is iteration over Treap!Range used to store ranges

Matthias Klumpp <mak debian.org> writes:

Hi!

I am developing a software called AppStream Generator in D, which 
is the default way of Debian and Ubuntu (and Arch Linux) to 
produce metadata for their software center applications.
D is working well for that purpose now, and - except for high 
memory usage - there are no issues on Debian. On Ubuntu, however, 
the software regularly crashes when the GC tries to mark a memory 
range that is not accessible to it (likely already freed).

The software is compiled using LDC 1.8.0, and uses D language 
bindings for C libraries generated by gir-to-d[1] as well as the 
EMSI containers library[2]. All of these are loaded as shared 
libraries.
You can find the source-code of appstream-generator on Github[3].

The code uses std.typecons.scoped occasionally, does no GC 
allocations in destructors and does nothing to mess with the GC 
in general. There are a few calls to GC.add/removeRoot in the 
gir-to-d generated code (ObjectG.d), but those are very unlikely 
to cause issues (removing them did yield the same crash, and the 
same code is used by more projects).

Running the tool under gdb yields backtraces like:
```
Thread 1 "appstream-gener" received signal SIGSEGV, Segmentation 
fault.
0x00007ffff5121168 in 
_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv (this=..., 
pbot=0x7fcf4d721010 <error: Cannot access memory at address 
0x7fcf4d721010>,
     ptop=0x7fcf4e321010 <error: Cannot access memory at address 
0x7fcf4e321010>) at gc.d:1990
1990    gc.d: No such file or directory.
(gdb) bt full

_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv (this=..., 
pbot=0x7fcf4d721010 <error: Cannot access memory at address 
0x7fcf4d721010>, ptop=0x7fcf4e321010 <error: Cannot access memory 
at address 0x7fcf4e321010>) at gc.d:1990
         p = 0xe256e <error: Cannot access memory at address 
0xe256e>
         p1 = 0x7fcf4d721010
         p2 = 0x7fcf4e321010
         stackPos = 0
         stack =
             {{pbot = 0x17 <error: Cannot access memory at address 
0x17>, ptop = 0x30b28ac000 <error: Cannot access memory at 
address 0x30b28ac000>}, {pbot = 0x7fcf45721000 "`&<\365\377\177", 
ptop = 0x3b <error: Cannot access memory at address 0x3b>}, {pbot 
= 0x0, ptop = 0x7fcf4f6f3000 
"are/icons/Moka/16x16/apps/AdobeReader12.png\n/usr/share/icons/Moka/16x16/apps/AdobeReader8.png\n/usr/share/icons/Moka/16x16/apps/AdobeReader9.png\n/usr/share/icons/Moka/16x16/apps/Blender.pn
\n/usr/share/"...}, {pbot = 0x17 <error: Cannot access memory at address 0x17>,
ptop = 0x30b28ac000 <error: Cannot access memory at address 0x30b28ac000>},
{pbot = 0x7fcf45721000 "`&<\365\377\177", ptop = 0x3b <error: Cannot access
memory at address 0x3b>}, {pbot = 0x1083c00 "0V\001\340\337\177", ptop = 0x0},
{pbot = 0x17 <error: Cannot access memory at address 0x17>, ptop = 0x18 <error:
Cannot access memory at address 0x18>}, {pbot = 0x16 <error: Cannot access
memory at address 0x16>, ptop = 0x146a650 ""}, {pbot = 0x0, ptop =
0x7fcf4f68c000 "256x256/apps/homebank.png\n/usr/share/icons/Moka/256x256/apps/hp-logo.png\n/usr/share/icons/Moka/256x256/apps/hugin.png\n/usr/share/icons/Moka/256x256/apps/hydrogen.png\n/usr/share/icons/Mok
/256x256/apps"...}, {pbot = 0x17 <error: Cannot access memory at address 0x17>,
ptop = 0x30b28ac000 <error: Cannot access memory at address 0x30b28ac000>},
{pbot = 0x7fcf45721000 "`&<\365\377\177", ptop = 0x3b <error: Cannot access
memory at address 0x3b>}, {pbot = 0x1083c00 "0V\001\340\337\177", ptop =
0x7fcf4f6bc000 "ons/Moka/48x48/places/distributor-logo-mageia.png\n/usr/share/icons/Moka/48x48/places/distributor-logo-mandriva.png\n/usr/share/icons/Moka/48x48/places/distributor-logo-manjaro.png\n/usr/sh
re/icons/Moka"...}, {pbot = 0x17 <error: Cannot access memory at address 0x17>,
ptop = 0x18 <error: Cannot access memory at address 0x18>}, {pbot = 0x16
<error: Cannot access memory at address 0x16>, ptop = 0x146a650 ""}, {pbot =
0x0, ptop = 0x7fcf4f466000
"/opera-extension.svg\n/usr/share/icons/Numix/64/mimetypes/package-gdebi.svg\n/usr/share/icons/Numix/64/mimetypes/package-x-generic.svg\n/usr/share/icons/Numix/64/mimetypes/package.svg\n/usr/
hare/icons/Nu"...}, {pbot = 0x17 <error: Cannot access memory at address 0x17>,
ptop = 0x30b28ac000 <error: Cannot access memory at address 0x30b28ac000>},
{pbot = 0x7fcf45721000 "`&<\365\377\177", ptop = 0x3b <error: Cannot access
memory at address 0x3b>}, {pbot = 0x1083c00 "0V\001\340\337\177", ptop =
0x7fcf4f01e000 "pirus-Adapta-Nokto/16x16/actions/upcomingevents-amarok.svg\n/usr/share/icons/Papirus-Adapta-Nokto/16x16/actions/upindicator.svg\n/usr/share/icons/Papirus-Adapta-Nokto/16x16/actions/upload-m
dia.svg\n/usr"...}, {pbot = 0x1 <error: Cannot access memory at address 0x1>,
ptop = 0x30b28ac000 <error: Cannot access memory at address 0x30b28ac000>},
{pbot = 0x7fcf45721000 "`&<\365\377\177", ptop = 0x3b <error: Cannot access
memory at address 0x3b>}, {pbot = 0x1083c00 "0V\001\340\337\177", ptop =
0x7fdfd8faa000 "icons/ContrastHigh/32x32/status/user-offline.png\n/usr/share/icons/ContrastHigh/32x32/status/user-status-pending.png\n/usr/share/icons/ContrastHigh/32x32/status/user-trash-full.png\n/usr/sh
re/icons/Cont"...}, {pbot = 0x75671e0 "P", ptop = 0x75671e0 "P"}, {pbot =
0x75671a0 "\020\203\244\004", ptop = 0x7fffffffbc00 "s\f"}, {pbot = 0x0, ptop =
0x7567420 "P"}, {pbot = 0x7567420 "P", ptop = 0xc735e0 ""}, {pbot = 0x1 <error:
Cannot access memory at address 0x1>, ptop = 0xc73 <error: Cannot access memory
at address 0xc73>}, {pbot = 0xc735e <error: Cannot access memory at address
0xc735e>, ptop = 0xc735e0 ""}, {pbot = 0x17 <error: Cannot access memory at
address 0x17>, ptop = 0x18 <error: Cannot access memory at address 0x18>},
{pbot = 0x16 <error: Cannot access memory at address 0x16>, ptop = 0x146a650
""}, {pbot = 0x0, ptop = 0x7568230 "P"}, {pbot = 0x7568230 "P", ptop =
0x7568230 "P"}, {pbot = 0x75681f0 "\220\202\337\006", ptop = 0x7fffffffbc90
"\300\274\377\377\377\177"}}
         pcache = 0
         pools = 0x1083c00
         highpool = 59
         minAddr = 0x7fcf45721000 "`&<\365\377\177"
         memSize = 209153867776
         base = 0x17 <error: Cannot access memory at address 0x17>
         top = 0xe256e0 ""

_D2gc4impl12conservativeQw3Gcx7markAllMFNbbZ14__foreachbody3MFNbKSQCm1
gcinterface5RangeZi (__applyArg0=...) at gc.d:2188
         range = {pbot = 0x7fcf4d721010 <error: Cannot access 
memory at address 0x7fcf4d721010>, ptop = 0x7fcf4e321010 <error: 
Cannot access memory at address 0x7fcf4e321010>, ti = 0x0}
         this =
              0x8635d0: {rootsLock = {impl = {val = 1, contention 
= 0 '\000'}}, rangesLock = {impl = {val = 1, contention = 0 
'\000'}}, roots = {root = 0x0, rand48 = {rng_state = 
8187282149633}}, ranges = {root = 0x703d2d0, rand48 = {rng_state 
= 637908263724}}, log = false, disabled = 0, pooltable = {pools = 
0x1083c00, npools = 60, _minAddr = 0x7fcf45721000 
"`&<\365\377\177", _maxAddr = 0x7ffff7fcd000 "\327\207\017+"}, 
bucket = {0x7fdeebfaf6f0, 0x7fdeebfff480, 0x7fdeebffa200, 
0x7fdeebffb880, 0x7fdeebffcc00, 0x0, 0x7fdeebffec00, 
0x7fdeebfed800}, smallCollectThreshold = 494324, 
largeCollectThreshold = 320094, usedSmallPages = 507904, 
usedLargePages = 290132, mappedPages = 813954, toscan = {_length 
= 0, _p = 0x7ffff7ebd000, _cap = 4096}}

_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf7opApplyMFNbMDFNbKQBtZiZ9__lambd
2MFNbKxSQCpQCpQCfZi (e=...) at treap.d:47
         dg = {context = 0x7fffffffc140 "\320\065\206", funcptr = 
0x7ffff5121d10 
<_D2gc4impl12conservativeQw3Gcx7markAllMFNbbZ14__foreachbody3MFNbKSQCm11gcinterface5RangeZi>}

_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeM
FNbKxSQDiQDiQCyZiZi (node=0x7568700, dg=...) at treap.d:221
         result = 0
```
See https://paste.debian.net/1020595/ and 
https://paste.debian.net/1020596/ for long backtraces (and 
https://paste.debian.net/1020597/ for a short version).

For reasons unknown, this issue only happens at Ubuntu, and only 
occasionally, in a way that it is frequent enough to make the 
software impossible to use, but not persistent enough that 
running Dustmite on the code would make sense.

Given that the code does nothing (that I am aware of) that would 
mess with the GC, I am pretty much out of ideas by now and 
started to assume a bug in LDC or the D GC in general now.

Does anyone of you have an idea what is going on here? Is there 
anything more to try out to find out the root cause of the issue 
and figure out if there is a bug (and where to report it)?

The only major difference between Ubuntu and Debian in terms of 
how things are compiled is that Ubuntu enabled --as-needed 
linking options, which doesn't seem to be relevant here.

I would be happy for any help with figuring out what this issue 
actually is!

Regards,
     Matthias

[1]: https://github.com/gtkd-developers/gir-to-d
[2]: https://github.com/dlang-community/containers
[3]: https://github.com/ximion/appstream-generator

Apr 16 2018

Matthias Klumpp <mak debian.org> writes:

On Monday, 16 April 2018 at 16:36:48 UTC, Matthias Klumpp wrote:
 [...]
 The code uses std.typecons.scoped occasionally, does no GC 
 allocations in destructors and does nothing to mess with the GC 
 in general. There are a few calls to GC.add/removeRoot in the 
 gir-to-d generated code (ObjectG.d), but those are very 
 unlikely to cause issues (removing them did yield the same 
 crash, and the same code is used by more projects).
 [...]

Another thing to mention is that the software uses LMDB[1] and 
mmaps huge amounts of data into memory (gigabyte range).
Not sure if that information is relevant at all though.

[1]: https://symas.com/lmdb/technical/

Apr 16 2018

Kagamin <spam here.lot> writes:

On Monday, 16 April 2018 at 16:36:48 UTC, Matthias Klumpp wrote:
 The code uses std.typecons.scoped occasionally, does no GC 
 allocations in destructors and does nothing to mess with the GC 
 in general.

What do you use destructors for?

Apr 17 2018

Kagamin <spam here.lot> writes:

Other stuff to try:
1. run application compiled on debian against ubuntu libs
2. can you mix dependencies from debian and ubuntu?

Apr 17 2018

Matthias Klumpp <mak debian.org> writes:

On Tuesday, 17 April 2018 at 08:23:07 UTC, Kagamin wrote:
 Other stuff to try:
 1. run application compiled on debian against ubuntu libs
 2. can you mix dependencies from debian and ubuntu?

I haven't tried that yet (next on my todo list), if I do run the 
program compiled with address sanitizer on Debian, I do get 
errors like:
```
AddressSanitizer:DEADLYSIGNAL
=================================================================
==25964==ERROR: AddressSanitizer: SEGV on unknown address 
0x7fac8db3f800 (pc 0x7fac9c433430 bp 0x000000000008 sp 
0x7ffc92be3dd0 T0)
==25964==The signal is caused by a READ memory access.

_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv 
(/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xa142f)

_D2gc4impl12conservativeQw3Gcx7markAllMFNbbZ14__foreachbody3MFNbKSQCm1
gcinterface5RangeZi (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xa1a2f)

_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeM
FNbKxSQDiQDiQCyZiZi (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xc7ad4)

_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeM
FNbKxSQDiQDiQCyZiZi (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xc7ac6)

_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeM
FNbKxSQDiQDiQCyZiZi (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xc7ac6)

_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeM
FNbKxSQDiQDiQCyZiZi (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xc7ac6)

_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf7opAp
lyMFNbMDFNbKQBtZiZi (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xc7a51)

_D2gc4impl12conservativeQw3Gcx11fullcollectMFNbbZm 
(/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0x9ef26)

_D2gc4impl12conservativeQw14ConservativeGC__T9runLockedS_DQCeQCeQCcQCnQBs18fullCollectNoStackMFNbZ2goFNbPSQEaQEaQDyQEj3Gc
ZmTQvZQDfMFNbKQBgZm (/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0x9f226)

(/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xa35d0)

(/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xb1ab2)

_D2rt6dmain211_d_run_mainUiPPaPUAAaZiZ6runAllMFZv 
(/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xb1e65)

(/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xb1d0b)

(/lib/x86_64-linux-gnu/libc.so.6+0x21a86)

(/home/matthias/Development/AppStream/generator/build/src/asgen/appstream-generator+0xba1d9)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV 
(/usr/lib/x86_64-linux-gnu/libdruntime-ldc-shared.so.78+0xa142f) 
in _D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv
==25964==ABORTING
```
So, I don't think this bug is actually limited to Ubuntu, it just 
shows up there more often for some reason.

Apr 17 2018

Kagamin <spam here.lot> writes:

You can call GC.collect at some points in the program to see if 
they can trigger the crash 
https://dlang.org/library/core/memory/gc.collect.html
If you link against debug druntime, GC can check invariants for 
correctness of its structures. There's a number of debugging 
options for GC, though not sure which ones are enabled in default 
debug build of druntime: 
https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/conservative/gc.d#L1388

Apr 18 2018

Matthias Klumpp <mak debian.org> writes:

On Wednesday, 18 April 2018 at 10:15:49 UTC, Kagamin wrote:
 You can call GC.collect at some points in the program to see if 
 they can trigger the crash

I already do that, and indeed I get crashes. I could throw those 
calls into every function though, or make a minimal pool size, 
maybe that yields something...

 https://dlang.org/library/core/memory/gc.collect.html
 If you link against debug druntime, GC can check invariants for 
 correctness of its structures. There's a number of debugging 
 options for GC, though not sure which ones are enabled in 
 default debug build of druntime: 
 https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/conservative/gc.d#L1388

I get compile errors for the INVARIANT option, and I don't 
actually know how to deal with those properly:
```
src/gc/impl/conservative/gc.d(1396): Error: shared mutable method 
core.internal.spinlock.SpinLock.lock is not callable using a 
shared const object
src/gc/impl/conservative/gc.d(1396):        Consider adding const 
or inout to core.internal.spinlock.SpinLock.lock
src/gc/impl/conservative/gc.d(1403): Error: shared mutable method 
core.internal.spinlock.SpinLock.unlock is not callable using a 
shared const object
src/gc/impl/conservative/gc.d(1403):        Consider adding const 
or inout to core.internal.spinlock.SpinLock.unlock
```

Commenting out the locks (eww!!) yields no change in behavior 
though.

The crashes always appear in 
https://github.com/dlang/druntime/blob/master/src/gc/impl/conservative/gc.d#L1990

Meanwhile, I also tried to reproduce the crash locally in a 
chroot, with no result. All libraries used between the machine 
where the crashes occur and my local machine were 100% identical, 
the only differences I am aware of are obviously the hardware 
(AWS cloud vs. home workstation) and the Linux kernel (4.4.0 vs 
4.15.0)

The crash happens when built with LDC or DMD, that doesn't 
influence the result. Copying over a binary from the working 
machine to the crashing one also results in the same errors.

I am completely out of ideas here. Since I think I can rule out a 
hardware fault at Amazon, I don't even know what else would make 
sense to try.

Apr 18 2018

Johannes Pfau <nospam example.com> writes:

Am Wed, 18 Apr 2018 17:40:56 +0000 schrieb Matthias Klumpp:
 
 The crashes always appear in
 https://github.com/dlang/druntime/blob/master/src/gc/impl/conservative/

gc.d#L1990
 

The important point to note here is that this is not one of these 'GC 
collected something because it was not reachable' bugs. A crash in the GC 
mark routine means it somehow scans an invalid address range. Actually, 
I've seen this before...


 Meanwhile, I also tried to reproduce the crash locally in a chroot, with
 no result. All libraries used between the machine where the crashes
 occur and my local machine were 100% identical,
 the only differences I am aware of are obviously the hardware (AWS cloud
 vs. home workstation) and the Linux kernel (4.4.0 vs 4.15.0)
 
 The crash happens when built with LDC or DMD, that doesn't influence the
 result. Copying over a binary from the working machine to the crashing
 one also results in the same errors.


Actually this sounds very familiar:
https://github.com/D-Programming-GDC/GDC/pull/236

it took us quite some time to reduce and debug this:

https://github.com/D-Programming-GDC/GDC/pull/236/commits/
5021b8d031fcacac52ee43d83508a5d2856606cd

So I wondered why I couldn't find this in the upstream druntime code. 
Turns out our pull request has never been merged....

https://github.com/dlang/druntime/pull/1678


-- 
Johannes

Apr 18 2018

kinke <noone nowhere.com> writes:

On Wednesday, 18 April 2018 at 20:36:03 UTC, Johannes Pfau wrote:
 Actually this sounds very familiar: 
 https://github.com/D-Programming-GDC/GDC/pull/236

Interesting, but I don't think it applies here. Both start and 
end addresses are 16-bytes aligned, and both cannot be accessed 
according to the stack trace (`pbot=0x7fcf4d721010 <error: Cannot 
access memory at address 0x7fcf4d721010>, ptop=0x7fcf4e321010 
<error: Cannot access memory at address 0x7fcf4e321010>`). That's 
quite interesting too: `memSize = 209153867776`. Don't know what 
exactly it is, but it's a pretty large number (~194 GB).

Apr 18 2018

Matthias Klumpp <mak debian.org> writes:

On Wednesday, 18 April 2018 at 22:12:12 UTC, kinke wrote:
 On Wednesday, 18 April 2018 at 20:36:03 UTC, Johannes Pfau 
 wrote:
 Actually this sounds very familiar: 
 https://github.com/D-Programming-GDC/GDC/pull/236

 Interesting, but I don't think it applies here. Both start and 
 end addresses are 16-bytes aligned, and both cannot be accessed 
 according to the stack trace (`pbot=0x7fcf4d721010 <error: 
 Cannot access memory at address 0x7fcf4d721010>, 
 ptop=0x7fcf4e321010 <error: Cannot access memory at address 
 0x7fcf4e321010>`). That's quite interesting too: `memSize = 
 209153867776`. Don't know what exactly it is, but it's a pretty 
 large number (~194 GB).

size_t memSize = pooltable.maxAddr - minAddr;
(https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/con
ervative/gc.d#L1982 )
That wouldn't make sense for a pool size...

The machine this is running on has 16G memory, at the time of the 
crash the software was using ~2.1G memory, with 130G virtual 
memory due to LMDB memory mapping (I wonder what happens if I 
reduce that...)

Apr 18 2018

Johannes Pfau <nospam example.com> writes:

Am Wed, 18 Apr 2018 22:24:13 +0000 schrieb Matthias Klumpp:

 On Wednesday, 18 April 2018 at 22:12:12 UTC, kinke wrote:
 On Wednesday, 18 April 2018 at 20:36:03 UTC, Johannes Pfau wrote:
 Actually this sounds very familiar:
 https://github.com/D-Programming-GDC/GDC/pull/236

 Interesting, but I don't think it applies here. Both start and end
 addresses are 16-bytes aligned, and both cannot be accessed according
 to the stack trace (`pbot=0x7fcf4d721010 <error: Cannot access memory
 at address 0x7fcf4d721010>, ptop=0x7fcf4e321010 <error: Cannot access
 memory at address 0x7fcf4e321010>`). That's quite interesting too:
 `memSize = 209153867776`. Don't know what exactly it is, but it's a
 pretty large number (~194 GB).

 
 size_t memSize = pooltable.maxAddr - minAddr;
 (https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/

conservative/gc.d#L1982
 )
 That wouldn't make sense for a pool size...
 
 The machine this is running on has 16G memory, at the time of the crash
 the software was using ~2.1G memory, with 130G virtual memory due to
 LMDB memory mapping (I wonder what happens if I reduce that...)

I see. Then I'd try to debug where the range originally comes from, try 
adding breakpoints in _d_dso_registry, registerGCRanges and similar 
functions here: https://github.com/dlang/druntime/blob/master/src/rt/
sections_elf_shared.d#L421

Generally if you produced a crash in gdb it should be reproducible if you 
restart the program in gdb. So once you have a crash, you should be able 
to restart the program and look at the _dso_registry and see the same 
addresses somewhere. If you then think you see memory corruption 
somewhere you could also use read or write watchpoints.

But just to be sure: you're not adding any GC ranges manually, right?
You could also try to compare the GC range to the address range layout 
in /proc/$PID/maps .



-- 
Johannes

Apr 18 2018

Johannes Pfau <nospam example.com> writes:

Am Thu, 19 Apr 2018 06:33:27 +0000 schrieb Johannes Pfau:

 
 Generally if you produced a crash in gdb it should be reproducible if
 you restart the program in gdb. So once you have a crash, you should be
 able to restart the program and look at the _dso_registry and see the
 same addresses somewhere. If you then think you see memory corruption
 somewhere you could also use read or write watchpoints.
 
 But just to be sure: you're not adding any GC ranges manually, right?
 You could also try to compare the GC range to the address range layout
 in /proc/$PID/maps .

Of course, if this is a GC pool / heap range adding breakpoints in the 
sections code won't be useful. Then I'd try to add a write watchpoint on 
pooltable.minAddr / maxAddr, restart the programm in gdb and see where / 
why the values are set.

-- 
Johannes

Apr 19 2018

Johannes Pfau <nospam example.com> writes:

Am Thu, 19 Apr 2018 07:04:14 +0000 schrieb Johannes Pfau:

 Am Thu, 19 Apr 2018 06:33:27 +0000 schrieb Johannes Pfau:
 
 
 Generally if you produced a crash in gdb it should be reproducible if
 you restart the program in gdb. So once you have a crash, you should be
 able to restart the program and look at the _dso_registry and see the
 same addresses somewhere. If you then think you see memory corruption
 somewhere you could also use read or write watchpoints.
 
 But just to be sure: you're not adding any GC ranges manually, right?
 You could also try to compare the GC range to the address range layout
 in /proc/$PID/maps .

 
 Of course, if this is a GC pool / heap range adding breakpoints in the
 sections code won't be useful. Then I'd try to add a write watchpoint on
 pooltable.minAddr / maxAddr, restart the programm in gdb and see where /
 why the values are set.

Having a quick look at https://github.com/ldc-developers/druntime/blob/
ldc/src/gc/pooltable.d: The GC seems to allocate multiple pools using 
malloc, but only keeps track of one minimum/maximum address for all 
pools. Now if there's some other memory area malloced in between these 
pools, you will end up with a huge memory block. When this will get 
scanned and if any of the memory in-between the GC pools is protected, 
you might see the GC crash. However, I don't really know anything about 
the GC code, so some GC expert would have to confirm this.



-- 
Johannes

Apr 19 2018

Kagamin <spam here.lot> writes:

On Wednesday, 18 April 2018 at 22:24:13 UTC, Matthias Klumpp 
wrote:
 size_t memSize = pooltable.maxAddr - minAddr;
 (https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/con
ervative/gc.d#L1982 )
 That wouldn't make sense for a pool size...

 The machine this is running on has 16G memory, at the time of 
 the crash the software was using ~2.1G memory, with 130G 
 virtual memory due to LMDB memory mapping (I wonder what 
 happens if I reduce that...)

If big LMDB mapping causes a problem, try a test like this:
---
import core.memory;
void testLMDB()
{
     //how do you use it?
}
void test1()
{
     void*[][] a;
     foreach(i;0..100000)a~=new void*[10000];
     void*[][] b;
     foreach(i;0..100000)b~=new void*[10000];
     b=null;
     GC.collect();

     testLMDB();

     GC.collect();
     foreach(i;0..100000)a~=new void*[10000];
     foreach(i;0..100000)b~=new void*[10000];
     b=null;
     GC.collect();
}
---

Apr 19 2018

Kagamin <spam here.lot> writes:

foreach(i;0..10000)
100000 is too much

Apr 19 2018

Matthias Klumpp <mak debian.org> writes:

On Thursday, 19 April 2018 at 08:30:45 UTC, Kagamin wrote:
 On Wednesday, 18 April 2018 at 22:24:13 UTC, Matthias Klumpp 
 wrote:
 size_t memSize = pooltable.maxAddr - minAddr;
 (https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/con
ervative/gc.d#L1982 )
 That wouldn't make sense for a pool size...

 The machine this is running on has 16G memory, at the time of 
 the crash the software was using ~2.1G memory, with 130G 
 virtual memory due to LMDB memory mapping (I wonder what 
 happens if I reduce that...)

 If big LMDB mapping causes a problem, try a test like this:
 ---
 import core.memory;
 void testLMDB()
 {
     //how do you use it?
 }
 void test1()
 {
     void*[][] a;
     foreach(i;0..100000)a~=new void*[10000];
     void*[][] b;
     foreach(i;0..100000)b~=new void*[10000];
     b=null;
     GC.collect();

     testLMDB();

     GC.collect();
     foreach(i;0..100000)a~=new void*[10000];
     foreach(i;0..100000)b~=new void*[10000];
     b=null;
     GC.collect();
 }
 ---

I tried something similar, with no effect.
Something that maybe is relevant though: I occasionally get the 
following SIGABRT crash in the tool on machines which have the 
SIGSEGV crash:
```
Thread 53 "appstream-gener" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fdfe98d4700 (LWP 7326)]
0x00007ffff5040428 in __GI_raise (sig=sig entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or 
directory.
(gdb) bt

../sysdeps/unix/sysv/linux/raise.c:54


ulong) (this=0x7fde0758a680, guardPageSize=4096, sz=20480) at 
src/core/thread.d:4606

_D4core6thread5Fiber6__ctorMFNbDFZvmmZCQBlQBjQBf 
(this=0x7fde0758a680, guardPageSize=4096, sz=16384, dg=...)
     at src/core/thread.d:4134

_D3std11concurrency__T9GeneratorTAyaZQp6__ctorMFDFZvZCQCaQBz__TQBpTQBiZQBx
(this=0x7fde0758a680, dg=...)
     at 
/home/ubuntu/dtc/dmd/generated/linux/debug/64/../../../../../druntime/import/core/thread.d:4126

_D5asgen8handlers11iconhandler5Theme21matchingIconFilenamesMFAyaSQCl5utils9ImageSizebZC3std11concurrency
_T9GeneratorTQCfZQp (this=0x7fdea2747800, relaxedScalingRules=true, size=...,
iname=...) at ../src/asgen/handlers/iconhandler.d:196

_D5asgen8handlers11iconhandler11IconHandler21possibleIconFilenamesMFAyaSQCs5utils9Image
izebZ9__lambda4MFZv (this=0x7fde0752bd00)
     at ../src/asgen/handlers/iconhandler.d:392

(this=0x7fde07528580) at src/core/thread.d:4436

src/core/thread.d:3665

```

This is in the constructor of a std.concurrency.Generator:
auto gen = new Generator!string (...)

I am not sure what to make of this yet though... This goes into 
DRuntime territory that I actually hoped to never have to deal 
with as much as I apparently need to now.

Apr 19 2018

kinke <noone nowhere.com> writes:

On Thursday, 19 April 2018 at 17:01:48 UTC, Matthias Klumpp wrote:
 Something that maybe is relevant though: I occasionally get the 
 following SIGABRT crash in the tool on machines which have the 
 SIGSEGV crash:
 ```
 Thread 53 "appstream-gener" received signal SIGABRT, Aborted.
 [Switching to Thread 0x7fdfe98d4700 (LWP 7326)]
 0x00007ffff5040428 in __GI_raise (sig=sig entry=6) at 
 ../sysdeps/unix/sysv/linux/raise.c:54
 54      ../sysdeps/unix/sysv/linux/raise.c: No such file or 
 directory.
 (gdb) bt

 ../sysdeps/unix/sysv/linux/raise.c:54


 ulong) (this=0x7fde0758a680, guardPageSize=4096, sz=20480) at 
 src/core/thread.d:4606

 _D4core6thread5Fiber6__ctorMFNbDFZvmmZCQBlQBjQBf 
 (this=0x7fde0758a680, guardPageSize=4096, sz=16384, dg=...)
     at src/core/thread.d:4134

 _D3std11concurrency__T9GeneratorTAyaZQp6__ctorMFDFZvZCQCaQBz__TQBpTQBiZQBx
(this=0x7fde0758a680, dg=...)
     at 
 /home/ubuntu/dtc/dmd/generated/linux/debug/64/../../../../../druntime/import/core/thread.d:4126

 _D5asgen8handlers11iconhandler5Theme21matchingIconFilenamesMFAyaSQCl5utils9ImageSizebZC3std11concurrency
_T9GeneratorTQCfZQp (this=0x7fdea2747800, relaxedScalingRules=true, size=...,
iname=...) at ../src/asgen/handlers/iconhandler.d:196

 _D5asgen8handlers11iconhandler11IconHandler21possibleIconFilenamesMFAyaSQCs5utils9Image
izebZ9__lambda4MFZv (this=0x7fde0752bd00)
     at ../src/asgen/handlers/iconhandler.d:392

 (this=0x7fde07528580) at src/core/thread.d:4436

 src/core/thread.d:3665

 ```

You probably already figured that the new Fiber seems to be 
allocating its 16KB-stack, with an additional 4 KB guard page at 
its bottom, via a 20 KB mmap() call. The abort seems to be 
triggered by mprotect() returning -1, i.e., a failure to disallow 
all access to the the guard page; so checking `errno` should help.

Apr 19 2018

Matthias Klumpp <mak debian.org> writes:

On Thursday, 19 April 2018 at 18:45:41 UTC, kinke wrote:
 On Thursday, 19 April 2018 at 17:01:48 UTC, Matthias Klumpp 
 wrote:
 Something that maybe is relevant though: I occasionally get 
 the following SIGABRT crash in the tool on machines which have 
 the SIGSEGV crash:
 ```
 Thread 53 "appstream-gener" received signal SIGABRT, Aborted.
 [Switching to Thread 0x7fdfe98d4700 (LWP 7326)]
 0x00007ffff5040428 in __GI_raise (sig=sig entry=6) at 
 ../sysdeps/unix/sysv/linux/raise.c:54
 54      ../sysdeps/unix/sysv/linux/raise.c: No such file or 
 directory.
 (gdb) bt

 ../sysdeps/unix/sysv/linux/raise.c:54


 ulong) (this=0x7fde0758a680, guardPageSize=4096, sz=20480) at 
 src/core/thread.d:4606

 _D4core6thread5Fiber6__ctorMFNbDFZvmmZCQBlQBjQBf 
 (this=0x7fde0758a680, guardPageSize=4096, sz=16384, dg=...)
     at src/core/thread.d:4134

 _D3std11concurrency__T9GeneratorTAyaZQp6__ctorMFDFZvZCQCaQBz__TQBpTQBiZQBx
(this=0x7fde0758a680, dg=...)
     at 
 /home/ubuntu/dtc/dmd/generated/linux/debug/64/../../../../../druntime/import/core/thread.d:4126

 _D5asgen8handlers11iconhandler5Theme21matchingIconFilenamesMFAyaSQCl5utils9ImageSizebZC3std11concurrency
_T9GeneratorTQCfZQp (this=0x7fdea2747800, relaxedScalingRules=true, size=...,
iname=...) at ../src/asgen/handlers/iconhandler.d:196

 _D5asgen8handlers11iconhandler11IconHandler21possibleIconFilenamesMFAyaSQCs5utils9Image
izebZ9__lambda4MFZv (this=0x7fde0752bd00)
     at ../src/asgen/handlers/iconhandler.d:392

 (this=0x7fde07528580) at src/core/thread.d:4436

 src/core/thread.d:3665

 ```

 You probably already figured that the new Fiber seems to be 
 allocating its 16KB-stack, with an additional 4 KB guard page 
 at its bottom, via a 20 KB mmap() call. The abort seems to be 
 triggered by mprotect() returning -1, i.e., a failure to 
 disallow all access to the the guard page; so checking `errno` 
 should help.

Jup, I did that already, it just took a really long time to run 
because when I made the change to print errno I also enabled 
detailed GC profiling (via the PRINTF* debug options). Enabling 
the INVARIANT option for the GC is completely broken by the way, 
I enforced the compile to work by casting to shared, with the 
result of the GC locking up forever at the start of the program.

Anyway, I think for a chance I actually produced some useful 
information via the GC debug options:
Given the following crash:
```

_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv (this=..., 
ptop=0x7fdfce7fc010, pbot=0x7fdfcdbfc010)
     at src/gc/impl/conservative/gc.d:1990
         p1 = 0x7fdfcdbfc010
         p2 = 0x7fdfce7fc010
         stackPos = 0
[...]
```
The scanned range seemed fairly odd to me, so I searched for it 
in the (very verbose!) GC debug output, which yielded:
```
235.244445: 0xc4f090.Gcx::addRange(0x8264230, 0x8264270)
235.244460: 0xc4f090.Gcx::addRange(0x7fdfcdbfc010, 0x7fdfce7fc010)
235.253861: 0xc4f090.Gcx::addRange(0x8264300, 0x8264340)
235.253873: 0xc4f090.Gcx::addRange(0x8264390, 0x82643d0)
```
So, something is calling addRange explicitly there, causing the 
GC to scan a range that it shouldn't scan. Since my code doesn't 
add ranges to the GC, and I looked at the generated code from 
girtod/GtkD and it very much looks fine to me, I am currently 
looking into EMSI containers[1] as the possible culprit.
That library being the issue would also make perfect sense, 
because this issue started to appear with such a frequency only 
after containers were added (there was a GC-related crash before, 
but that might have been a different one).

So, I will look into that addRange call next.

[1]: https://github.com/dlang-community/containers

Apr 19 2018

Matthias Klumpp <mak debian.org> writes:

On Friday, 20 April 2018 at 00:11:25 UTC, Matthias Klumpp wrote:
 [...]
 Jup, I did that already, it just took a really long time to run 
 because when I made the change to print errno [...]

I forgot to mention that, the error code was 12, ENOMEM, so this 
is actually likely not a relevant issue afterall.

Apr 19 2018

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On Friday, 20 April 2018 at 00:11:25 UTC, Matthias Klumpp wrote:
 On Thursday, 19 April 2018 at 18:45:41 UTC, kinke wrote:
 [...]

 Jup, I did that already, it just took a really long time to run 
 because when I made the change to print errno I also enabled 
 detailed GC profiling (via the PRINTF* debug options). Enabling 
 the INVARIANT option for the GC is completely broken by the 
 way, I enforced the compile to work by casting to shared, with 
 the result of the GC locking up forever at the start of the 
 program.

 [...]

I think the order of operations is wrong, here is an example from 
containers:

allocator.dispose(buckets);
static if (useGC)
	    GC.removeRange(buckets.ptr);

If GC triggers between dispose and removeRange, it will likely 
segfault.

 [1]: https://github.com/dlang-community/containers

Apr 19 2018

Matthias Klumpp <mak debian.org> writes:

On Friday, 20 April 2018 at 05:32:32 UTC, Dmitry Olshansky wrote:
 On Friday, 20 April 2018 at 00:11:25 UTC, Matthias Klumpp wrote:
 On Thursday, 19 April 2018 at 18:45:41 UTC, kinke wrote:
 [...]

 Jup, I did that already, it just took a really long time to 
 run because when I made the change to print errno I also 
 enabled detailed GC profiling (via the PRINTF* debug options). 
 Enabling the INVARIANT option for the GC is completely broken 
 by the way, I enforced the compile to work by casting to 
 shared, with the result of the GC locking up forever at the 
 start of the program.

 [...]

 I think the order of operations is wrong, here is an example 
 from containers:

 allocator.dispose(buckets);
 static if (useGC)
 	    GC.removeRange(buckets.ptr);

 If GC triggers between dispose and removeRange, it will likely 
 segfault.

Indeed! It's also the only place where this is shuffled around, 
all other parts of the containers library do this properly.
The thing I wonder about is though, that the crash usually 
appeared in an explicit GC.collect() call when the application 
was not running multiple threads. At that point, the GC - as far 
as I know - couldn't have triggered after the buckets were 
disposed of and the ranges were removed. But maybe I am wrong 
with that assumption.
This crash would be explained perfectly by that bug.

Apr 20 2018

Matthias Klumpp <mak debian.org> writes:

On Friday, 20 April 2018 at 18:30:30 UTC, Matthias Klumpp wrote:
 On Friday, 20 April 2018 at 05:32:32 UTC, Dmitry Olshansky 
 wrote:
 On Friday, 20 April 2018 at 00:11:25 UTC, Matthias Klumpp 
 wrote:
 On Thursday, 19 April 2018 at 18:45:41 UTC, kinke wrote:
 [...]

 [...]

 I think the order of operations is wrong, here is an example 
 from containers:

 allocator.dispose(buckets);
 static if (useGC)
 	    GC.removeRange(buckets.ptr);

 If GC triggers between dispose and removeRange, it will likely 
 segfault.

 Indeed! It's also the only place where this is shuffled around, 
 all other parts of the containers library do this properly.
 The thing I wonder about is though, that the crash usually 
 appeared in an explicit GC.collect() call when the application 
 was not running multiple threads. At that point, the GC - as 
 far as I know - couldn't have triggered after the buckets were 
 disposed of and the ranges were removed. But maybe I am wrong 
 with that assumption.
 This crash would be explained perfectly by that bug.

Turns out that was indeed the case! I created a small testcase 
which managed to very reliably reproduce the issue on all 
machines that I tested it on. After reordering the 
dispose/removeRange, the crashes went away completely.
I submitted a pull request to the containers library to fix this 
issue: https://github.com/dlang-community/containers/pull/107

I will also try to get the patch into the components in Debian 
and Ubuntu, so we can maybe have a chance of updating the 
software center metadata for Ubuntu before 18.04 LTS releases 
next week.
Since asgen uses HashMaps for pretty much everything, an most of 
the time with GC-managed elements, this should improve the 
stability of the application greatly.

Thanks a lot for the help in debugging this, I learned a lot 
about DRuntime internals in the process. Also, it is no 
exaggeration to say that the appstream-generator project would 
not be written in D (there was a Rust prototype once...) and I 
would probably not be using D as much (or at all) without the 
helpful community around it.
Thank you :-)

Apr 20 2018

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On Friday, 20 April 2018 at 19:32:24 UTC, Matthias Klumpp wrote:
 On Friday, 20 April 2018 at 18:30:30 UTC, Matthias Klumpp wrote:
 [...]

 Turns out that was indeed the case! I created a small testcase 
 which managed to very reliably reproduce the issue on all 
 machines that I tested it on. After reordering the 
 dispose/removeRange, the crashes went away completely.
 I submitted a pull request to the containers library to fix 
 this issue: 
 https://github.com/dlang-community/containers/pull/107

Partly dumb luck on my part since I opened hashmap file first 
just to see if there are some mistakes in GC.add/removeRange, and 
it was a hit. I just assumed it was wrong everywhere else ;)

Glad it was that simple. Thanks for fixing it for good.

 Thanks a lot for the help in debugging this, I learned a lot 
 about DRuntime internals in the process. Also, it is no 
 exaggeration to say that the appstream-generator project would 
 not be written in D (there was a Rust prototype once...) and I 
 would probably not be using D as much (or at all) without the 
 helpful community around it.
 Thank you :-)

Apr 23 2018

Matthias Klumpp <mak debian.org> writes:

On Wednesday, 18 April 2018 at 20:36:03 UTC, Johannes Pfau wrote:
 [...]

 Actually this sounds very familiar: 
 https://github.com/D-Programming-GDC/GDC/pull/236

 it took us quite some time to reduce and debug this:

 https://github.com/D-Programming-GDC/GDC/pull/236/commits/ 
 5021b8d031fcacac52ee43d83508a5d2856606cd

 So I wondered why I couldn't find this in the upstream druntime 
 code. Turns out our pull request has never been merged....

 https://github.com/dlang/druntime/pull/1678

Just to be sure, I applied your patch, but unfortunately I still 
get the same result...

On Wednesday, 18 April 2018 at 20:38:20 UTC, negi wrote:
 On Monday, 16 April 2018 at 16:36:48 UTC, Matthias Klumpp wrote:
 ...

 This reminds me of (otherwise unrelated) problems I had 
 involving Linux 4.15.

 If you feel out of ideas, I suggest you take a look at the 
 kernels.  It might
 be that Ubuntu is turning some security-related knob in a 
 different direction
 than Debian.  Or it might be some bug in 4.15 (I found it to be 
 quite buggy,
 specially during the first few point releases; 4.15 was the 
 first upstream
 release including large amounts of meltdown/spectre-related 
 work).

All the crashes are happening on a 4.4 kernel though... I am 
currently pondering digging out a 4.4 kernel here to see if that 
makes me reproduce the crash locally.

Apr 18 2018

Kagamin <spam here.lot> writes:

On Wednesday, 18 April 2018 at 17:40:56 UTC, Matthias Klumpp 
wrote:
 On Wednesday, 18 April 2018 at 10:15:49 UTC, Kagamin wrote:
 You can call GC.collect at some points in the program to see 
 if they can trigger the crash

 I already do that, and indeed I get crashes. I could throw 
 those calls into every function though, or make a minimal pool 
 size, maybe that yields something...

Can you narrow down the earliest point at which it starts to 
crash? That might identify if something in particular causes the 
crash.

Apr 19 2018

Kagamin <spam here.lot> writes:

On Wednesday, 18 April 2018 at 17:40:56 UTC, Matthias Klumpp 
wrote:
 I get compile errors for the INVARIANT option, and I don't 
 actually know how to deal with those properly:
 ```
 src/gc/impl/conservative/gc.d(1396): Error: shared mutable 
 method core.internal.spinlock.SpinLock.lock is not callable 
 using a shared const object
 src/gc/impl/conservative/gc.d(1396):        Consider adding 
 const or inout to core.internal.spinlock.SpinLock.lock
 src/gc/impl/conservative/gc.d(1403): Error: shared mutable 
 method core.internal.spinlock.SpinLock.unlock is not callable 
 using a shared const object
 src/gc/impl/conservative/gc.d(1403):        Consider adding 
 const or inout to core.internal.spinlock.SpinLock.unlock
 ```

 Commenting out the locks (eww!!) yields no change in behavior 
 though.

As a workaround:
(cast(shared)rangesLock).lock();

Apr 19 2018

kinke <noone nowhere.com> writes:

On Wednesday, 18 April 2018 at 10:15:49 UTC, Kagamin wrote:
 There's a number of debugging options for GC, though not sure 
 which
 ones are enabled in default debug build of druntime

Speaking for LDC, none are, they all need to be enabled 
explicitly. There's a whole bunch of them 
(https://github.com/dlang/druntime/blob/master/src/gc/impl/conserv
tive/gc.d#L20-L31), so enabling most of them would surely help in tracking this
down, but it's most likely still going to be very tedious.
I'm not really surprised that there are compilation errors when 
enabling the debug options, that's a likely fate of untested code 
unfortunately.

If possible, I'd give static linking a try.

Apr 18 2018

Matthias Klumpp <mak debian.org> writes:

On Wednesday, 18 April 2018 at 18:55:48 UTC, kinke wrote:
 On Wednesday, 18 April 2018 at 10:15:49 UTC, Kagamin wrote:
 There's a number of debugging options for GC, though not sure 
 which
 ones are enabled in default debug build of druntime

 Speaking for LDC, none are, they all need to be enabled 
 explicitly. There's a whole bunch of them 
 (https://github.com/dlang/druntime/blob/master/src/gc/impl/conserv
tive/gc.d#L20-L31), so enabling most of them would surely help in tracking this
down, but it's most likely still going to be very tedious.
 I'm not really surprised that there are compilation errors when 
 enabling the debug options, that's a likely fate of untested 
 code unfortunately.

Yeah... Maybe making a CI build with "enable all the things" 
makes sense to combat that...

 If possible, I'd give static linking a try.

I tried that, with at least linking druntime and phobos 
statically. I did not, however, link all the things statically.
That is something to try (at least statically linking all the D 
libraries).

Apr 18 2018

Matthias Klumpp <mak debian.org> writes:

On Wednesday, 18 April 2018 at 20:40:52 UTC, Matthias Klumpp 
wrote:
 [...]
 If possible, I'd give static linking a try.

 I tried that, with at least linking druntime and phobos 
 statically. I did not, however, link all the things statically.
 That is something to try (at least statically linking all the D 
 libraries).

No luck...
```

_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv (this=..., 
ptop=0x7fcf6a11b010, pbot=0x7fcf6951b010)
     at src/gc/impl/conservative/gc.d:1990
         p1 = 0x7fcf6951b010
         p2 = 0x7fcf6a11b010
         stackPos = 0
         stack =
             {{pbot = 0x7fffffffcc60, ptop = 0x7f15af 
<_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv+1403>}, {pbot = 
0xc22bf0 <_D2gc6configQhSQnQm6Config>, ptop = 0xc4cd28}, {pbot = 
0x87b4118, ptop = 0x87b4118}, {pbot = 0x0, ptop = 0xc4cda0}, 
{pbot = 0x7fffffffcca0, ptop = 0x7f15af 
<_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv+1403>}, {pbot = 
0xc22bf0 <_D2gc6configQhSQnQm6Config>, ptop = 0xc4cd28}, {pbot = 
0x87af258, ptop = 0x87af258}, {pbot = 0x0, ptop = 0xc4cda0}, 
{pbot = 0x7fffffffcce0, ptop = 0x7f15af 
<_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv+1403>}, {pbot = 
0xc22bf0 <_D2gc6configQhSQnQm6Config>, ptop = 0xc4cd28}, {pbot = 
0x87af158, ptop = 0x87af158}, {pbot = 0x0, ptop = 0xc4cda0}, 
{pbot = 0x7fffffffcd20, ptop = 0x7f15af 
<_D2gc4impl12conservativeQw3Gcx4markMFNbNlPvQcZv+1403>}, {pbot = 
0xc22bf0 <_D2gc6configQhSQnQm6Config>, ptop = 0xc4cd28}, {pbot = 
0x87af0d8, ptop = 0x87af0d8}, {pbot = 0x0, ptop = 0xc4cda0}, 
{pbot = 0x7fdf6b265000, ptop = 0x69b96a0}, {pbot = 0x28, ptop = 
0x7fcf5951b000}, {pbot = 0x309eab7000, ptop = 0x7fdf6b265000}, 
{pbot = 0x0, ptop = 0x0}, {pbot = 0x1381d00, ptop = 0x1c}, {pbot 
= 0x1d, ptop = 0x1c}, {pbot = 0x1a44100, ptop = 0x1a4410}, {pbot 
= 0x1a44, ptop = 0x4}, {pbot = 0x7fdf6b355000, ptop = 0x69b96a0}, 
{pbot = 0x28, ptop = 0x7fcf5951b000}, {pbot = 0x309eab7000, ptop 
= 0x4ac0}, {pbot = 0x4a, ptop = 0x0}, {pbot = 0x1381d00, ptop = 
0x1c}, {pbot = 0x1d, ptop = 0x1c}, {pbot = 0x4ac00, ptop = 
0x4ac0}, {pbot = 0x4a, ptop = 0x4}}
         pcache = 0
         pools = 0x69b96a0
         highpool = 40
         minAddr = 0x7fcf5951b000
         memSize = 208820465664
         base = 0xaef0
         top = 0xae
         p = 0x4618770
         pool = 0x0
         low = 110859936
         high = 40
         mid = 140528533483520
         offset = 208820465664
         biti = 8329709
         pn = 142275872
         bin = 1
         offsetBase = 0
         next = 0xc4cc80
         next = {pbot = 0x7fffffffcbe0, ptop = 0x7f19ed 
<_D2gc4impl12conservativeQw3Gcx7markAllMFNbbZ14__foreachbody3MFNbKSQCm11gcinterface5RangeZi+57>}
         __r292 = 0x7fffffffd320
         __key293 = 8376632
         rng =  0x0: <error reading variable>

_D2gc4impl12conservativeQw3Gcx7markAllMFNbbZ14__foreachbody3MFNbKSQCm1
gcinterface5RangeZi (this=0x7fffffffd360, __applyArg0=...)
     at src/gc/impl/conservative/gc.d:2188
         range = {pbot = 0x7fcf6951b010, ptop = 0x7fcf6a11b010, ti 
= 0x0}

_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf7opApplyMFNbMDFNbKQBtZiZ9__lambd
2MFNbKxSQCpQCpQCfZi (this=0x7fffffffd320, e=...) at
src/rt/util/container/treap.d:47

_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeM
FNbKxSQDiQDiQCyZiZi (dg=..., node=0x80396c0) at
src/rt/util/container/treap.d:221
         result = 0

_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeM
FNbKxSQDiQDiQCyZiZi (dg=..., node=0x87c8140) at
src/rt/util/container/treap.d:224
         result = 0

_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeM
FNbKxSQDiQDiQCyZiZi (dg=..., node=0x7fdfc8000950) at
src/rt/util/container/treap.d:218
         result = 16844032

_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeM
FNbKxSQDiQDiQCyZiZi (dg=..., node=0x7fdfc8000a50) at
src/rt/util/container/treap.d:218
         result = 0

_D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeM
FNbKxSQDiQDiQCyZiZi (dg=..., node=0x7fdfc8000c50) at
src/rt/util/container/treap.d:218
         result = 0

[etc...]

src/core/memory.d:207

(this=0x7ffff7ee13c0) at ../src/asgen/engine.d:122
```

Apr 18 2018

negi <negi east.orb> writes:

On Monday, 16 April 2018 at 16:36:48 UTC, Matthias Klumpp wrote:
 ...

This reminds me of (otherwise unrelated) problems I had involving 
Linux 4.15.

If you feel out of ideas, I suggest you take a look at the 
kernels.  It might
be that Ubuntu is turning some security-related knob in a 
different direction
than Debian.  Or it might be some bug in 4.15 (I found it to be 
quite buggy,
specially during the first few point releases; 4.15 was the first 
upstream
release including large amounts of meltdown/spectre-related work).

Apr 18 2018

Kagamin <spam here.lot> writes:

On Monday, 16 April 2018 at 16:36:48 UTC, Matthias Klumpp wrote:

 _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf7opApplyMFNbMDFNbKQBtZiZ9__lambd
2MFNbKxSQCpQCpQCfZi (e=...) at treap.d:47
         dg = {context = 0x7fffffffc140 "\320\065\206", funcptr 
 = 0x7ffff5121d10 
 <_D2gc4impl12conservativeQw3Gcx7markAllMFNbbZ14__foreachbody3MFNbKSQCm11gcinterface5RangeZi>}

 _D2rt4util9container5treap__T5TreapTS2gc11gcinterface5RangeZQBf13opApplyHelperFNbxPSQDeQDeQDcQCv__TQCsTQCpZQDa4NodeM
FNbKxSQDiQDiQCyZiZi (node=0x7568700, dg=...) at treap.d:221

Indeed, this is iteration over Treap!Range used to store ranges 
added with addRange method.
https://github.com/ldc-developers/druntime/blob/ldc/src/gc/impl/conservative/gc.d#L2182

Apr 20 2018

D Programming

C/C++ Programming

Other

digitalmars.D - Issues with debugging GC-related crashes #2