digitalmars.D - GC hangs on spinlock in ConservativeGC.runLocked

klasbo (48/48) Feb 20 2023 This is in a vibe.d application, on Ubuntu server 22.04.

Steven Schveighoffer (5/13) Feb 20 2023 This bug was fixed in 2.102.0:

klasbo (17/21) Feb 20 2023 I was using 2.100, so that would explain it then.

Steven Schveighoffer (31/40) Feb 20 2023 So it was quite a subtle bug.

klasbo (7/19) Feb 20 2023 So at least I will now get a proper error (presumably

klasbo <klasbo gmail.com> writes:

This is in a vibe.d application, on Ubuntu server 22.04. 
Presumably something isn't unlocking the gcLock, but I have no 
idea what. See GDB backtrace below.

Ideas? And is there some way to get stack traces for any other 
vibe-tasks in a running process? Could this be caused by a 
non-clean exit of a vibe task?

It's hard to know if this is me, vibe.d, or druntime...

```

(clock_id=clock_id entry=0, flags=flags entry=0, 
req=0x7f7d0e6763d0, rem=0x7f7d0e6763e0)
     at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78

rem=<optimized out>) at ../sysdeps/unix/sysv/linux/nanosleep.c:25

core.thread.osthread.Thread.sleep(core.time.Duration) ()

core.internal.spinlock.SpinLock.yield(ulong) shared ()

shared ()

core.internal.gc.impl.conservative.gc.ConservativeGC.runLocked!(core.internal.gc.impl.conservative.gc.ConservativeGC
mallocNoSync(ulong, uint, ref ulong, const(TypeInfo)),
core.internal.gc.impl.conservative.gc.mallocTime,
core.internal.gc.impl.conservative.gc.numMallocs, ulong, uint, ulong,
const(TypeInfo)).runLocked(ref ulong, ref uint, ref ulong, ref const(TypeInfo))
()

core.internal.gc.impl.conservative.gc.ConservativeGC.malloc(ulong, uint,
const(TypeInfo)) ()



()




core.internal.gc.impl.conservative.gc.Gcx.fullcollect(bool, bool, 
bool) ()

core.internal.gc.impl.conservative.gc.Gcx.bigAlloc(ulong, ref 
ulong, uint, const(TypeInfo)) ()

core.internal.gc.impl.conservative.gc.ConservativeGC.runLocked!(core.internal.gc.impl.conservative.gc.ConservativeGC
mallocNoSync(ulong, uint, ref ulong, const(TypeInfo)),
core.internal.gc.impl.conservative.gc.mallocTime,
core.internal.gc.impl.conservative.gc.numMallocs, ulong, uint, ulong,
const(TypeInfo)).runLocked(ref ulong, ref uint, ref ulong, ref const(TypeInfo))
()

core.internal.gc.impl.conservative.gc.ConservativeGC.qalloc(ulong, uint, scope
const(TypeInfo)) ()


_D3std5array__T8AppenderTAyaZQo13ensureAddableMFmZ9__lambda9MFNaNbNeZS4
ore6memory8BlkInfo_ (__capture=0x7f7d0e676a90, __HID4=0x7f7d0e676a60)
     at 
/usr/local/dlang/dmd-2.100.2/linux/bin64/../../src/phobos/std/array.d:3576
[rest of trace omitted]
```

Feb 20 2023

Steven Schveighoffer <schveiguy gmail.com> writes:

On 2/20/23 8:01 AM, klasbo wrote:
 This is in a vibe.d application, on Ubuntu server 22.04. Presumably 
 something isn't unlocking the gcLock, but I have no idea what. See GDB 
 backtrace below.
 
 Ideas? And is there some way to get stack traces for any other 
 vibe-tasks in a running process? Could this be caused by a non-clean 
 exit of a vibe task?
 


This bug was fixed in 2.102.0: 
https://dlang.org/changelog/2.102.0.html#druntime.nogc-traceinfo

The actual bug: https://issues.dlang.org/show_bug.cgi?id=22616

-Steve

Feb 20 2023

klasbo <klasbo gmail.com> writes:

On Monday, 20 February 2023 at 16:31:18 UTC, Steven Schveighoffer 
wrote:
 This bug was fixed in 2.102.0: 
 https://dlang.org/changelog/2.102.0.html#druntime.nogc-traceinfo

 The actual bug: https://issues.dlang.org/show_bug.cgi?id=22616

 -Steve

I was using 2.100, so that would explain it then.

I don't quite understand the internals of the GC well enough to 
"get" what happens here. Is it that an assert in sweep() 
triggers, and this[1] scope(failure) in fullcollect() would 
re-trigger a GC collection when it tries to allocate the trace 
info (which is now fixed when trace info is nogc)?
Or the more important (for me) question: Is this part of "normal" 
GC control flow (why would assert(freedPages < usedPages) 
trigger? This is the beyond the limit of my GC understanding!), 
or is there still/also something broken on my end that I have to 
look for?

[1] 
https://github.com/dlang/dmd/blob/3457018e34ab2333de22a6285622d3190268d5e0/druntime/src/core/internal/gc/impl/conservative/gc.d#L3147

Thank you for the answer, and (to all involved) the work in 
fixing it!

Feb 20 2023

Steven Schveighoffer <schveiguy gmail.com> writes:

On 2/20/23 4:28 PM, klasbo wrote:
 I don't quite understand the internals of the GC well enough to "get" 
 what happens here. Is it that an assert in sweep() triggers, and this[1] 
 scope(failure) in fullcollect() would re-trigger a GC collection when it 
 tries to allocate the trace info (which is now fixed when trace info is 
 nogc)?

So it was quite a subtle bug.

The throw handler will look at the `info` member of the Throwable (the 
trace info), and if it is non-null, will attempt to allocate a trace info.

If you threw inside the GC (i.e. when the GC is collecting), this means 
you *cannot* allocate. The trace info allocator sneakily was using the 
GC, and so it had a case where it checked the `inFinalizer` flag, and if 
that was true, it would simply not allocate traceinfo (leave it at null).

In the GC collection routine, there is a mechanism that catches 
`Exception`, and if so, handles it properly by throwing a FinalizerError 
(this has its `info` set to `SuppressTraceInfo`, which prevents further 
traceinfo allocations).

But `Error` is *not* caught. It leaks all the way out to a 
`scope(failure)` statement.

Now, inside this `scope(failure)` statement, the `inFinalizer` flag is 
set to false. What does this do?

a. catches the Error (which remember, still has the null trace info)
b. Sets the flag to false
c. *rethrows the Error*. This sees a null traceinfo, and tries to allocate
d. Allocating tries to take the spinlock, which is still locked, and 
deadlocks.

The whole reason we don't allow allocating inside the finalizer is 
because of this deadlock!

The solution is to allocate the trace info with C `malloc`, and free it 
with `free`. The trace info was always simply a list of stack frame 
addresses, and so trivial to allocate/free.

 Or the more important (for me) question: Is this part of "normal" GC 
 control flow (why would assert(freedPages < usedPages) trigger? This is 
 the beyond the limit of my GC understanding!), or is there still/also 
 something broken on my end that I have to look for?

The error itself is something that isn't addressed with this fix. So I 
should clarify that the *dealdock* was fixed, not your original root 
cause. Please try with the latest compiler, and elaborate further if you 
still can't figure it out and/or file a bug!

-Steve

Feb 20 2023

klasbo <klasbo gmail.com> writes:

On Monday, 20 February 2023 at 22:11:47 UTC, Steven Schveighoffer 
wrote:
 On 2/20/23 4:28 PM, klasbo wrote:
 Or the more important (for me) question: Is this part of 
 "normal" GC control flow (why would assert(freedPages < 
 usedPages) trigger? This is the beyond the limit of my GC 
 understanding!), or is there still/also something broken on my 
 end that I have to look for?

 The error itself is something that isn't addressed with this 
 fix. So I should clarify that the *dealdock* was fixed, not 
 your original root cause. Please try with the latest compiler, 
 and elaborate further if you still can't figure it out and/or 
 file a bug!

 -Steve


So at least I will now get a proper error (presumably 
InvalidMemoryOperationError?) and its stack trace when/if-ever 
this happens again. Or if I am (un)lucky, this will be "handled" 
by vibe and only show up in any logs I enable.

Thanks again.

Feb 20 2023

D Programming

C/C++ Programming

Other

digitalmars.D - GC hangs on spinlock in ConservativeGC.runLocked