digitalmars.D - GC hangs on spinlock in ConservativeGC.runLocked
- klasbo (48/48) Feb 20 2023 This is in a vibe.d application, on Ubuntu server 22.04.
- Steven Schveighoffer (5/13) Feb 20 2023 This bug was fixed in 2.102.0:
- klasbo (17/21) Feb 20 2023 I was using 2.100, so that would explain it then.
- Steven Schveighoffer (31/40) Feb 20 2023 So it was quite a subtle bug.
- klasbo (7/19) Feb 20 2023 So at least I will now get a proper error (presumably
This is in a vibe.d application, on Ubuntu server 22.04. Presumably something isn't unlocking the gcLock, but I have no idea what. See GDB backtrace below. Ideas? And is there some way to get stack traces for any other vibe-tasks in a running process? Could this be caused by a non-clean exit of a vibe task? It's hard to know if this is me, vibe.d, or druntime... ``` (clock_id=clock_id entry=0, flags=flags entry=0, req=0x7f7d0e6763d0, rem=0x7f7d0e6763e0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78 rem=<optimized out>) at ../sysdeps/unix/sysv/linux/nanosleep.c:25 core.thread.osthread.Thread.sleep(core.time.Duration) () core.internal.spinlock.SpinLock.yield(ulong) shared () shared () core.internal.gc.impl.conservative.gc.ConservativeGC.runLocked!(core.internal.gc.impl.conservative.gc.ConservativeGC mallocNoSync(ulong, uint, ref ulong, const(TypeInfo)), core.internal.gc.impl.conservative.gc.mallocTime, core.internal.gc.impl.conservative.gc.numMallocs, ulong, uint, ulong, const(TypeInfo)).runLocked(ref ulong, ref uint, ref ulong, ref const(TypeInfo)) () core.internal.gc.impl.conservative.gc.ConservativeGC.malloc(ulong, uint, const(TypeInfo)) () () core.internal.gc.impl.conservative.gc.Gcx.fullcollect(bool, bool, bool) () core.internal.gc.impl.conservative.gc.Gcx.bigAlloc(ulong, ref ulong, uint, const(TypeInfo)) () core.internal.gc.impl.conservative.gc.ConservativeGC.runLocked!(core.internal.gc.impl.conservative.gc.ConservativeGC mallocNoSync(ulong, uint, ref ulong, const(TypeInfo)), core.internal.gc.impl.conservative.gc.mallocTime, core.internal.gc.impl.conservative.gc.numMallocs, ulong, uint, ulong, const(TypeInfo)).runLocked(ref ulong, ref uint, ref ulong, ref const(TypeInfo)) () core.internal.gc.impl.conservative.gc.ConservativeGC.qalloc(ulong, uint, scope const(TypeInfo)) () _D3std5array__T8AppenderTAyaZQo13ensureAddableMFmZ9__lambda9MFNaNbNeZS4 ore6memory8BlkInfo_ (__capture=0x7f7d0e676a90, __HID4=0x7f7d0e676a60) at /usr/local/dlang/dmd-2.100.2/linux/bin64/../../src/phobos/std/array.d:3576 [rest of trace omitted] ```
Feb 20 2023
On 2/20/23 8:01 AM, klasbo wrote:This is in a vibe.d application, on Ubuntu server 22.04. Presumably something isn't unlocking the gcLock, but I have no idea what. See GDB backtrace below. Ideas? And is there some way to get stack traces for any other vibe-tasks in a running process? Could this be caused by a non-clean exit of a vibe task?This bug was fixed in 2.102.0: https://dlang.org/changelog/2.102.0.html#druntime.nogc-traceinfo The actual bug: https://issues.dlang.org/show_bug.cgi?id=22616 -Steve
Feb 20 2023
On Monday, 20 February 2023 at 16:31:18 UTC, Steven Schveighoffer wrote:This bug was fixed in 2.102.0: https://dlang.org/changelog/2.102.0.html#druntime.nogc-traceinfo The actual bug: https://issues.dlang.org/show_bug.cgi?id=22616 -SteveI was using 2.100, so that would explain it then. I don't quite understand the internals of the GC well enough to "get" what happens here. Is it that an assert in sweep() triggers, and this[1] scope(failure) in fullcollect() would re-trigger a GC collection when it tries to allocate the trace info (which is now fixed when trace info is nogc)? Or the more important (for me) question: Is this part of "normal" GC control flow (why would assert(freedPages < usedPages) trigger? This is the beyond the limit of my GC understanding!), or is there still/also something broken on my end that I have to look for? [1] https://github.com/dlang/dmd/blob/3457018e34ab2333de22a6285622d3190268d5e0/druntime/src/core/internal/gc/impl/conservative/gc.d#L3147 Thank you for the answer, and (to all involved) the work in fixing it!
Feb 20 2023
On 2/20/23 4:28 PM, klasbo wrote:I don't quite understand the internals of the GC well enough to "get" what happens here. Is it that an assert in sweep() triggers, and this[1] scope(failure) in fullcollect() would re-trigger a GC collection when it tries to allocate the trace info (which is now fixed when trace info is nogc)?So it was quite a subtle bug. The throw handler will look at the `info` member of the Throwable (the trace info), and if it is non-null, will attempt to allocate a trace info. If you threw inside the GC (i.e. when the GC is collecting), this means you *cannot* allocate. The trace info allocator sneakily was using the GC, and so it had a case where it checked the `inFinalizer` flag, and if that was true, it would simply not allocate traceinfo (leave it at null). In the GC collection routine, there is a mechanism that catches `Exception`, and if so, handles it properly by throwing a FinalizerError (this has its `info` set to `SuppressTraceInfo`, which prevents further traceinfo allocations). But `Error` is *not* caught. It leaks all the way out to a `scope(failure)` statement. Now, inside this `scope(failure)` statement, the `inFinalizer` flag is set to false. What does this do? a. catches the Error (which remember, still has the null trace info) b. Sets the flag to false c. *rethrows the Error*. This sees a null traceinfo, and tries to allocate d. Allocating tries to take the spinlock, which is still locked, and deadlocks. The whole reason we don't allow allocating inside the finalizer is because of this deadlock! The solution is to allocate the trace info with C `malloc`, and free it with `free`. The trace info was always simply a list of stack frame addresses, and so trivial to allocate/free.Or the more important (for me) question: Is this part of "normal" GC control flow (why would assert(freedPages < usedPages) trigger? This is the beyond the limit of my GC understanding!), or is there still/also something broken on my end that I have to look for?The error itself is something that isn't addressed with this fix. So I should clarify that the *dealdock* was fixed, not your original root cause. Please try with the latest compiler, and elaborate further if you still can't figure it out and/or file a bug! -Steve
Feb 20 2023
On Monday, 20 February 2023 at 22:11:47 UTC, Steven Schveighoffer wrote:On 2/20/23 4:28 PM, klasbo wrote:So at least I will now get a proper error (presumably InvalidMemoryOperationError?) and its stack trace when/if-ever this happens again. Or if I am (un)lucky, this will be "handled" by vibe and only show up in any logs I enable. Thanks again.Or the more important (for me) question: Is this part of "normal" GC control flow (why would assert(freedPages < usedPages) trigger? This is the beyond the limit of my GC understanding!), or is there still/also something broken on my end that I have to look for?The error itself is something that isn't addressed with this fix. So I should clarify that the *dealdock* was fixed, not your original root cause. Please try with the latest compiler, and elaborate further if you still can't figure it out and/or file a bug! -Steve
Feb 20 2023