digitalmars.D.learn - How to debug (potential) GC bugs?
- Matthias Klumpp (39/39) Sep 25 2016 Hello!
- Marco Leise (23/30) Sep 26 2016 If you pass callbacks into the C code, make sure they never
- Guillaume Piolat (20/44) Sep 27 2016 There is no way the GC scans memory allocated with malloc (unless
- Kapps (14/54) Sep 27 2016 First, make sure any C threads calling D code use
- Kagamin (4/4) Sep 29 2016 Does it crash only in rt_finalize2? It calls the class
- Matthias Klumpp (24/28) Sep 30 2016 Thank you all for the good advice! I do none of those things in
- Kagamin (10/18) Oct 03 2016 `grep "~this" *.d` gives nothing? It can be a struct with
- Kagamin (4/4) Oct 03 2016 If it's heap corruption, GC has debugging option -debug=SENTINEL
- Martin Nowak (6/10) Oct 07 2016 We actually did change druntime recently to no longer fail when
- Kagamin (3/6) Oct 03 2016 Oh, wait, what do you mean by crashing?
- Ilya Yaroshenko (4/11) Oct 04 2016 Probably related issue:
- Martin Nowak (2/4) Oct 07 2016 Crashes in a finalizer, likely not related to the dead-lock bug.
- Johannes Pfau (34/56) Oct 07 2016 Can you get the GDC & LDC phobos versions?
Hello! I am working together with others on the D-based appstream-generator[1] project, which is generating software metadata for "software centers" and other package-manager functionality on Linux distributions, and is used by default on Debian, Ubuntu and Arch Linux. For Ubuntu, some modifications on the code were needed, and apparently for them the code is currently crashing in the GC collection thread: http://paste.debian.net/840490/ The project is running a lot of stuff in parallel and is using the GC (if the extraction is a few seconds slower due to the GC being active, it doesn't matter much). We also link against a lot of 3rd-party libraries and use a big amount of existing C code in the project. So, I would like to know the following things: 1) Is there any caveat when linking to C libraries and using the GC in a project? So far, it seems to be working well, but there have been a few cases where I was suspicious about the GC actually doing something to malloc'ed stuff or C structs present in the bindings. 2) How can one debug issues like the one mentioned above properly? Since it seems to happen in the GC and doesn't give me information on where to start searching for the issue, I am a bit lost. 3) The tool seems to leak memory somewhere and OOMs pretty quickly on some machines. All the stuff using C code frees resources properly though, and using Valgrind on the project is a pain due to large amounts of data being mmapped. I worked around this a while back, but then the GC interfered with Valgrind, making information less useful. Is there any information on how to find memory leaks, or e.g. large structs the GC cannot free because something is still having a needless reference on it? Unfortunately I can't reproduce the crash from 2) myself, it only seems to happen at Ubuntu (but Ubuntu is using some different codepaths too). Any insights would be highly appreciated! Cheers, Matthias [1[: https://github.com/ximion/appstream-generator
Sep 25 2016
Am Sun, 25 Sep 2016 16:23:11 +0000 schrieb Matthias Klumpp <matthias tenstral.net>:So, I would like to know the following things: 1) Is there any caveat when linking to C libraries and using the GC in a project? So far, it seems to be working well, but there have been a few cases where I was suspicious about the GC actually doing something to malloc'ed stuff or C structs present in the bindings.If you pass callbacks into the C code, make sure they never throw. Stack unwinding and exception handling generally doesn't work across language boundaries. A tracing garbage collector starts with the assumption that all the memory that it allocated is no longer reachable and then starts scanning the known memory for any pointers to allocations that falsify this assumption. What you malloc'ed is unknown to the GC and wont be scanned. Should you ever have GC memory pointers in your malloc'ed stuff, then you need to call GC.addRange() to make those pointers keep the allocations alive. Otherwise you will get a "used after free" error: data corruption or access violations. A simple case would be a string that you constructed in D and store in C as a pointer. The GC can automatically scan the stack and any globals/statics on the D side, but that's about it. I know of no tools similar to valgrind specially designed to debug the D GC. You can plug into the GC API and keep track of the allocation sizes. I.e. write a proxy GC. -- Marco
Sep 26 2016
On Sunday, 25 September 2016 at 16:23:11 UTC, Matthias Klumpp wrote:Hello! I am working together with others on the D-based appstream-generator[1] project, which is generating software metadata for "software centers" and other package-manager functionality on Linux distributions, and is used by default on Debian, Ubuntu and Arch Linux. For Ubuntu, some modifications on the code were needed, and apparently for them the code is currently crashing in the GC collection thread: http://paste.debian.net/840490/ The project is running a lot of stuff in parallel and is using the GC (if the extraction is a few seconds slower due to the GC being active, it doesn't matter much). We also link against a lot of 3rd-party libraries and use a big amount of existing C code in the project. So, I would like to know the following things:1) Is there any caveat when linking to C libraries and using the GC in a project? So far, it seems to be working well, but there have been a few cases where I was suspicious about the GC actually doing something to malloc'ed stuff or C structs present in the bindings.There is no way the GC scans memory allocated with malloc (unless you tell it to) or used in the bindings. A caveat is that if you are called from C (not your case), you must initialize the runtime, and attach/detach threads. The GC could well stop threads that are currently in the C code if they were registered to the runtime.2) How can one debug issues like the one mentioned above properly? Since it seems to happen in the GC and doesn't give me information on where to start searching for the issue, I am a bit lost.There can be multiple reasons. - The GC is collecting some object that is unreachable from its POV; when you are actually using it. - The GC is calling destructors, that should not be called by the GC. Performing illegal operations. usually this is solved by using deterministic destruction instead and never relying on a destructor called by the GC. - The GC tries to stop threads that don't exist anymore or are not interruptible My advice is to have a fuly deterministic tree of objects, like a C++ program, and Google for "GC-proof resource class" in case you are using classes.
Sep 27 2016
On Sunday, 25 September 2016 at 16:23:11 UTC, Matthias Klumpp wrote:Hello! I am working together with others on the D-based appstream-generator[1] project, which is generating software metadata for "software centers" and other package-manager functionality on Linux distributions, and is used by default on Debian, Ubuntu and Arch Linux. For Ubuntu, some modifications on the code were needed, and apparently for them the code is currently crashing in the GC collection thread: http://paste.debian.net/840490/ The project is running a lot of stuff in parallel and is using the GC (if the extraction is a few seconds slower due to the GC being active, it doesn't matter much). We also link against a lot of 3rd-party libraries and use a big amount of existing C code in the project. So, I would like to know the following things: 1) Is there any caveat when linking to C libraries and using the GC in a project? So far, it seems to be working well, but there have been a few cases where I was suspicious about the GC actually doing something to malloc'ed stuff or C structs present in the bindings. 2) How can one debug issues like the one mentioned above properly? Since it seems to happen in the GC and doesn't give me information on where to start searching for the issue, I am a bit lost. 3) The tool seems to leak memory somewhere and OOMs pretty quickly on some machines. All the stuff using C code frees resources properly though, and using Valgrind on the project is a pain due to large amounts of data being mmapped. I worked around this a while back, but then the GC interfered with Valgrind, making information less useful. Is there any information on how to find memory leaks, or e.g. large structs the GC cannot free because something is still having a needless reference on it? Unfortunately I can't reproduce the crash from 2) myself, it only seems to happen at Ubuntu (but Ubuntu is using some different codepaths too). Any insights would be highly appreciated! Cheers, Matthias [1[: https://github.com/ximion/appstream-generatorFirst, make sure any C threads calling D code use Thread.attachThis (thread_attachThis maybe?). Otherwise the GC will not suspend those threads during a collection which will cause crashes. I'd guess this is your issue. Second, tell the GC of non-GC memory that has pointers to GC memory by using GC.addRange / GC.addRoot as needed. Make sure to remove them once the non-GC memory is deallocated as well, otherwise you'll get memory leaks. The GC collector is also conservative, not precise, so false positives are possible. If you're using 64 bit programs, this shouldn't be much of an issue though. Finally, make sure you're not doing any GC allocations in dtors.
Sep 27 2016
Does it crash only in rt_finalize2? It calls the class destructor, and the destructor must not allocate or touch GC in any way because the GC doesn't yet support allocation during collection.
Sep 29 2016
On Thursday, 29 September 2016 at 09:56:34 UTC, Kagamin wrote:Does it crash only in rt_finalize2? It calls the class destructor, and the destructor must not allocate or touch GC in any way because the GC doesn't yet support allocation during collection.Thank you all for the good advice! I do none of those things in my code though... Unfortunately for having deterministic memory management, I would essentially need to develop GC-less, and would loose classes. This means many nice features of D aren't available, e.g. I couldn't use interfaces (AFAIK they don't work on structs) or constraints. Strangely after switching from the GDC compiler to the LDC compiler, all crashes observed at Ubuntu are gone. So, this problem is: A) A compiler / DRuntime bug, or B) A bug in my code (not) triggered by a certain compiler / DRuntime For the excessive memory usage, I have no idea yet - the GC not freeing its memory pool on exit is quite bad for Valgrinding the code. Memory consumption has bettered recently after not re-opening a LMDB database twice in the same process from multiple threads, which is not supported by LMDB. I haven't done longer runs yet, so I am not sure if that really was the problem (seems unlikely, but you never know...). Cheers, Matthias
Sep 30 2016
On Saturday, 1 October 2016 at 00:06:05 UTC, Matthias Klumpp wrote:I do none of those things in my code though...`grep "~this" *.d` gives nothing? It can be a struct with destructor stored in a class. Can you observe the error? Try to set a breakpoint at onInvalidMemoryOperationError https://github.com/dlang/druntime/blob/master/src/core/exception.d#L559 and see what stack leads to it.Unfortunately for having deterministic memory management, I would essentially need to develop GC-less, and would loose classes. This means many nice features of D aren't available, e.g. I couldn't use interfaces (AFAIK they don't work on structs) or constraints.Not necessarily. You only need to dispose the resources in time, to dispose.Strangely after switching from the GDC compiler to the LDC compiler, all crashes observed at Ubuntu are gone.Sounds not good.
Oct 03 2016
If it's heap corruption, GC has debugging option -debug=SENTINEL - for buffer overrun checks. Also that particular stack trace shows that object being destroyed is allocated in bin 512, i.e. its size is between 256 and 512 bytes.
Oct 03 2016
On Saturday, 1 October 2016 at 00:06:05 UTC, Matthias Klumpp wrote:So, this problem is: A) A compiler / DRuntime bug, or B) A bug in my code (not) triggered by a certain compiler / DRuntimeWe actually did change druntime recently to no longer fail when using GC.free from a finalizer (will get ignored now). Maybe that's what fixed it for you w/ a newer version, but at a quick glance I haven't seen any freeing code in destructors.
Oct 07 2016
On Sunday, 25 September 2016 at 16:23:11 UTC, Matthias Klumpp wrote:For Ubuntu, some modifications on the code were needed, and apparently for them the code is currently crashing in the GC collection thread: http://paste.debian.net/840490/Oh, wait, what do you mean by crashing?
Oct 03 2016
On Sunday, 25 September 2016 at 16:23:11 UTC, Matthias Klumpp wrote:Hello! I am working together with others on the D-based appstream-generator[1] project, which is generating software metadata for "software centers" and other package-manager functionality on Linux distributions, and is used by default on Debian, Ubuntu and Arch Linux. [...]Probably related issue: https://issues.dlang.org/show_bug.cgi?id=15939
Oct 04 2016
On Tuesday, 4 October 2016 at 08:14:37 UTC, Ilya Yaroshenko wrote:Probably related issue: https://issues.dlang.org/show_bug.cgi?id=15939Crashes in a finalizer, likely not related to the dead-lock bug.
Oct 07 2016
Am Sun, 25 Sep 2016 16:23:11 +0000 schrieb Matthias Klumpp <matthias tenstral.net>:Hello! I am working together with others on the D-based appstream-generator[1] project, which is generating software metadata for "software centers" and other package-manager functionality on Linux distributions, and is used by default on Debian, Ubuntu and Arch Linux. For Ubuntu, some modifications on the code were needed, and apparently for them the code is currently crashing in the GC collection thread: http://paste.debian.net/840490/ The project is running a lot of stuff in parallel and is using the GC (if the extraction is a few seconds slower due to the GC being active, it doesn't matter much). [...] 2) How can one debug issues like the one mentioned above properly? Since it seems to happen in the GC and doesn't give me information on where to start searching for the issue, I am a bit lost.Can you get the GDC & LDC phobos versions? We added shared library support in 2.068 which replaced much of GDC-specific backported GC/TLS code with the standard upstream implementation. So using a recent 2.068 GDC could help. Judging from the stack trace you're probably using a 2.067 phobos: https://github.com/D-Programming-GDC/GDC/blob/722cf5670d927ef6182bf1b72765a64ca0fde693/libphobos/libdruntime/rt/lifetime.d#L1423 Here's some advice for debugging such a problem: The memory layout is usually deterministic when restarting the app in gdb with the run command. So you can do this: gdb app Then get the value of p when the app crashed, in the posted stack trace 0x7fdfae368000 Should now break whenever the object is collected, so you can check if it is collected twice. You can also use next to step until you get the classinfo in c and then print the classinfo contents: print c You can also use write breakpoints to find data corruption: find the value of pc: then disable the old breakpoint & run from start This should now break when data is written to the location. (The commands might not be 100% correct ;-)
Oct 07 2016