www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - GC issue? List.pool overwritten by allocated object

reply Denis Feklushkin <feklushkin.denis gmail.com> writes:
Hi!

It seems I have encountered a bug that is hard to understand and 
fix without knowlenge of the GC internals. But I have some code 
that reproduces the problem well. I made a branch so that 
everyone can try it (see below)

Usual (not very beautiful, yes) code that I do for fun. During 
run it creates and destroys various objects, everything is as 
usual, it does nothing strange, no manipulations with the GC, 
except collect() called once or twice, and I also often call 
destroy(). Also no multithreading, but Vulkan API is used and it 
implicitly creates threads. On sucessful run code displays window 
with two rotating pictures.

For small objects my code regularly and deterministically gets 
into a situation when at some point the value of 
core.internal.gc.impl.conservative.gc.List.pool pointer is 
overwritten by garbage. Using gdb I tracked that after 
appropriate List.pool is created and written, at some time this 
piece of memory is overwritten by a newly allocated D object. As 
result, garbage value of List.pool is used at next 
gc.Gcx.smallAlloc() call and SIGSEGV occurs.

(For tracking I used gdb option "set scheduler-locking on" - it 
seems that this is what makes List* address the same every time, 
which makes debugging much easier.)

I tried to turn on --d-debug=INVARIANT --d-debug=SENTINEL 
--d-debug=MEMSTOMP for druntime. All these options confirming the 
problem. Sometimes issue shifted either to a newly added GC 
invariant as assert error, assert(*sentinel_pre(p) == 
SENTINEL_PRE) error, or problem manifests itself not immediately 
after launch, but after a few seconds of the application's 
operation when allocating object. But it still repeats every time 
- that is, this is not a heisenbug.

Perhaps all this is the result of an error somewhere else, which 
results in this behavior. That is, if some my code (or third 
party) corrupts something that affects to allocation? But it 
seems that I do not do any hacks, any manipulations with 
pointers, etc.

Everything is reproduced on DMD and LDC. I use LDC for debugging 
because it is easy to switch between different druntimes in it.

I couldn't reduce code to highlight issue. So here is how to 
reproduce:

$ git clone --branch=move_to_ldc2 
https://github.com/denizzzka/pukan.git
(ensure you are on commit 
f7e5293cdeb14da911bc337e281378b92ca39f25)
$ cd pukan #important!
$ dub run

For now I tested my code only on Linux, so it might not work in 
Windows at all.
Issue is reproduceable on druntime supplied with:
DMD64 D Compiler v2.111.0
LDC 1.40.1
May 12
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 13/05/2025 3:31 AM, Denis Feklushkin wrote:
 Perhaps all this is the result of an error somewhere else, which results 
 in this behavior. That is, if some my code (or third party) corrupts 
 something that affects to allocation? But it seems that I do not do any 
 hacks, any manipulations with pointers, etc.
This is pretty heavily used code in druntime, my immediate thought is what is your code doing to cause this (I didn't see anything obvious)? Try using ldc's address sanitizer to see if that finds something. Otherwise try dividing and conquering to find what triggers it down to the statement.
May 12
parent reply Denis Feklushkin <feklushkin.denis gmail.com> writes:
On Monday, 12 May 2025 at 15:40:01 UTC, Richard (Rikki) Andrew 
Cattermole wrote:
 On 13/05/2025 3:31 AM, Denis Feklushkin wrote:
 Perhaps all this is the result of an error somewhere else, 
 which results in this behavior. That is, if some my code (or 
 third party) corrupts something that affects to allocation? 
 But it seems that I do not do any hacks, any manipulations 
 with pointers, etc.
This is pretty heavily used code in druntime, my immediate thought is what is your code doing to cause this (I didn't see anything obvious)?
Yes, of course I understand perfectly well. And it seems to me that I am not doing anything "reprehensible". Failure causes when code simply allocates by the "new" keyword, which internally calls GC's smallAlloc(). Fails on different points, depending on compilation options, compiled-in debug facilities, sanitizers, etc. And if class what allocation causes error manually moved into "heavy" by adding 64kB size field just another class allocation causes same error. Maybe somewhere after destroy() I successfully write something into destroyed object field and this corrupts internal GC data? I'll try to remove everything destroy() calls right now
 Try using ldc's address sanitizer to see if that finds 
 something.
Nothing, sanitizer only highlights point where is pointer to pool is broken: AddressSanitizer:DEADLYSIGNAL ================================================================= ==36969==ERROR: AddressSanitizer: SEGV on unknown address 0x000100000006 (pc 0x55c3ba20eac3 bp 0x523000005500 sp 0x7ffd7cb0fcf0 T0) ==36969==The signal is caused by a READ memory access. _D4core8internal2gc4impl12conservativeQw3Gcx10smallAllocMFNbmKmkxC8TypeInfoZPv (/home/denizzz/Dev/pukan3D/pukan+0x33fac3) (BuildId: 25246214f82ed318a32cc136c8e965179f4dcad3) 0x000100000006 is garbage value, placed by wrong allocation at swapchain.d:92: s = new SyncFramesInFlight(device, commandBuffers[i]); (if used debug configuration described in origin message):
May 12
next sibling parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 13/05/2025 6:22 AM, Denis Feklushkin wrote:
     Try using ldc's address sanitizer to see if that finds something.
 
 Nothing, sanitizer only highlights point where is pointer to pool is broken:
 
 
   AddressSanitizer:DEADLYSIGNAL
 
 ==36969==ERROR: AddressSanitizer: SEGV on unknown address 0x000100000006 
 (pc 0x55c3ba20eac3 bp 0x523000005500 sp 0x7ffd7cb0fcf0 T0) ==36969==The 

 _D4core8internal2gc4impl12conservativeQw3Gcx10smallAllocMFNbmKmkxC8TypeInfoZPv
(/home/denizzz/Dev/pukan3D/pukan+0x33fac3) (BuildId:
25246214f82ed318a32cc136c8e965179f4dcad3)
 
 0x000100000006 is garbage value, placed by wrong allocation at 
 swapchain.d:92: s = new SyncFramesInFlight(device, commandBuffers[i]); 
 (if used debug configuration described in origin message):
This is useful information, now you can minify your code to what causes it. I suggest throwing dustmite at it, and looking for that segfault. https://github.com/CyberShadow/DustMite
May 12
prev sibling next sibling parent Denis Feklushkin <feklushkin.denis gmail.com> writes:
On Monday, 12 May 2025 at 18:22:16 UTC, Denis Feklushkin wrote:

 Maybe somewhere after destroy() I successfully write something 
 into destroyed object field and this corrupts internal GC data? 
 I'll try to remove everything destroy() calls right now
Removed all destroy() calls - nothing changed
May 12
prev sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On Monday, 12 May 2025 at 18:22:16 UTC, Denis Feklushkin wrote:
 On Monday, 12 May 2025 at 15:40:01 UTC, Richard (Rikki) Andrew 
 Cattermole wrote:
 On 13/05/2025 3:31 AM, Denis Feklushkin wrote:
 Perhaps all this is the result of an error somewhere else, 
 which results in this behavior. That is, if some my code (or 
 third party) corrupts something that affects to allocation? 
 But it seems that I do not do any hacks, any manipulations 
 with pointers, etc.
This is pretty heavily used code in druntime, my immediate thought is what is your code doing to cause this (I didn't see anything obvious)?
Yes, of course I understand perfectly well. And it seems to me that I am not doing anything "reprehensible".
The "reprehensible" thing that almost always causes GC issues is use after free because you are interacting with C memory. I have not diagnosed the specific issue, but you are very much using some C libs to do complicated things. I literally just fixed a bug at work that existed for 3 years because a GC object was being freed slightly early. Issue was -- we were using C memory that was owned by a GC object that was no longer referenced. GC runs -- destructor frees memory -- use after free. Not saying this isn't some latent GC bug that has existed for a while. But the good news is that it's repeatable, so it should be possible to track down.
 Failure causes when code simply allocates by the "new" keyword, 
 which internally calls GC's smallAlloc(). Fails on different 
 points, depending on compilation options, compiled-in debug 
 facilities, sanitizers, etc. And if class what allocation 
 causes error manually moved into "heavy" by adding 64kB size 
 field just another class allocation causes same error.
Having errors very much points at the problem happening *before* `new` is called. If it's not always failing in the same spot, that sounds a lot like memory corruption. And very often the corruption happens long before the explosion.
 Maybe somewhere after destroy() I successfully write something 
 into destroyed object field and this corrupts internal GC data? 
 I'll try to remove everything destroy() calls right now
First thing I would rule out is C memory being used to refer to GC objects. Focus on places where C memory is allocated, especially with things like callbacks + data pointer. Another cause, as I mentioned above, is using a GC object to manage C memory, and then forgetting the GC object but remembering the C memory. -Steve
May 12
parent reply Denis Feklushkin <feklushkin.denis gmail.com> writes:
First of all, I want to thank everyone for their help. And, yes - 
I forgot to check obvious things before I was deep into GC

On Monday, 12 May 2025 at 21:29:10 UTC, Steven Schveighoffer 
wrote:

 Having errors very much points at the problem happening 
 *before* `new` is called. If it's not always failing in the 
 same spot, that sounds a lot like memory corruption. And very 
 often the corruption happens long before the explosion.
 First thing I would rule out is C memory being used to refer to 
 GC objects. Focus on places where C memory is allocated, 
 especially with things like callbacks + data pointer.

 Another cause, as I mentioned above, is using a GC object to 
 manage C memory, and then forgetting the GC object but 
 remembering the C memory.

 -Steve
So far I have done two things in this direction: 1. I called GC.disable() at start of main() 2. destroy() was removed from the code It seems like this should eliminate probability of use after freeing and referring from C to D objects? Nothing has changed, issue is still here dustmite needs a lot of time - I launched it but I'm still waiting
May 13
next sibling parent Denis Feklushkin <feklushkin.denis gmail.com> writes:
On Tuesday, 13 May 2025 at 10:08:02 UTC, Denis Feklushkin wrote:

 It seems like this should eliminate probability of use after 
 freeing and referring from C to D objects?
More precisely, probability of damage of internal GC structures
May 13
prev sibling parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 13/05/2025 10:08 PM, Denis Feklushkin wrote:
 dustmite needs a lot of time - I launched it but I'm still waiting
What I normally do is help dustmite. Run it for a little bit, dup the file system, remove some cycles or dependent usage of a variable, run it again. It can't always break chains, so it needs a bit of help.
May 13
parent reply Denis Feklushkin <feklushkin.denis gmail.com> writes:
On Tuesday, 13 May 2025 at 10:14:17 UTC, Richard (Rikki) Andrew 
Cattermole wrote:
 On 13/05/2025 10:08 PM, Denis Feklushkin wrote:
 dustmite needs a lot of time - I launched it but I'm still 
 waiting
What I normally do is help dustmite. Run it for a little bit, dup the file system, remove some cycles or dependent usage of a variable, run it again. It can't always break chains, so it needs a bit of help.
Yes, I already did some for this. At first (for the whole night) I decided to launch "dub dustmite", which (it seems) works only over whole sources with dependencies, and it was too big amount of work
May 13
parent reply Denis Feklushkin <feklushkin.denis gmail.com> writes:
On Tuesday, 13 May 2025 at 10:21:12 UTC, Denis Feklushkin wrote:

I added simple `debug(PRINTF)` section exactly after druntime 
allocator. It throws error if newly allocated memory intersects 
with already allocated internal bucket List structures. I hope I 
didn't make a mistake in this code?

```d
auto p = runLocked!(mallocNoSync, mallocTime, numMallocs)(size, 
bits, localAllocSize, ti);

debug(PRINTF)
{
     outer:
     foreach(List* firstList; gcx.bucket)
     {
         List* curr = firstList;
         while(curr !is null)
         {
             void* p_end = cast(ubyte*) p + localAllocSize;
             void* curr_end = cast(ubyte*) curr + List.sizeof;

             const bool notIntersects = ((p < curr && p_end < 
curr) || (p > curr_end && p_end > curr_end));

             if(!notIntersects)
             {
                 printf("%p - allocated into bucket List value, 
located on %p: firstList.pool=%p curr.pool=%p\n",
                     p, curr, firstList.pool, curr.pool);

                 assert(false);
                 break outer;
             }

             curr = curr.next;
         }
     }
}
```

Druntime was built as debug version with enabled INVARIANT, 
MEMSTOMP and PRINTF

Then this snippet was used with compiled druntime (do not forget 
to replace path to new druntime in ldc2.conf):
```d
/+ dub.sdl:
	name "issue"
+/
// How to run: dub run --single app.d

class C {}

void main()
{
     new C;
}
```

```
 dub run --single app.d --compiler=ldc2
Starting Performing "debug" build using ldc2 for x86_64. Building issue ~master: building configuration [application] Linking issue Running issue _d_newclass(ci = 0x56496398c350, app.C) 0x5649a1312c90.Gcx::addRange(0x564963985940, 0x564963994718) GC::malloc(gcx = 0x5649a1312c90, size = 16 bits = 2, ti = app.C) => p = 0x7fa30e5cb000 0x7fa30e5cb000 - allocated into bucket List value, located on 0x7fa30e5cb010: firstList.pool=0x5649a1313fa0 curr.pool=0x5649a1313fa0 core.exception.AssertError core/internal/gc/impl/conservative/gc.d(505): Assertion failure ---------------- core/runtime.d:831 [0x564963942d45] core/lifetime.d:126 [0x56496394234c] core/runtime.d:753 [0x564963942d0e] core/runtime.d:773 [0x564963942640] rt/dmain2.d:241 [0x564963920f30] rt/deh.d:47 [0x564963949b9e] rt/dwarfeh.d:347 [0x564963921ac2] core/exception.d:569 [0x564963936a05] core/exception.d:808 [0x564963936444] core/internal/gc/impl/conservative/gc.d:505 [0x5649639502f3] core/internal/gc/proxy.d:156 [0x56496393cf70] core/internal/gc/impl/proto/gc.d:101 [0x5649639604fb] core/internal/gc/proxy.d:156 [0x56496393cf70] rt/lifetime.d:130 [0x5649639235fe] app.d:10 [0x56496391a7af] rt/dmain2.d:520 [0x56496392169c] rt/dmain2.d:474 [0x5649639214b2] rt/dmain2.d:520 [0x5649639215ba] rt/dmain2.d:474 [0x5649639214b2] rt/dmain2.d:545 [0x564963921372] rt/dmain2.d:333 [0x564963921040] /home/denizzz/ldc2_standalone/bin/../import/core/internal/entrypoint.d:42 [0x56496391a7f1] ??:? [0x7fa30e6f6ca7] ??:? __libc_start_main [0x7fa30e6f6d64] ??:? [0x56496391a6d0] GC.fullCollect() processing GC Marks, (nil) rt_finalize2(p = 0x5649a1312c20) Error Program exited with code 1 ``` Am I making an obvious mistake somewhere?
May 13
parent kinke <noone nowhere.com> writes:
On Tuesday, 13 May 2025 at 18:30:34 UTC, Denis Feklushkin wrote:
 I hope I didn't make a mistake in this code?
The intersection logic is wrong, treating adjacency as intersection. Try this: `const bool intersects = (p_end > curr && p < curr_end)`.
May 13