digitalmars.D - GC issue? List.pool overwritten by allocated object

Denis Feklushkin (51/51) May 12 Hi!

Richard (Rikki) Andrew Cattermole (6/10) May 12 This is pretty heavily used code in druntime, my immediate thought is

Denis Feklushkin (27/38) May 12 Yes, of course I understand perfectly well. And it seems to me

Richard (Rikki) Andrew Cattermole (4/19) May 12 This is useful information, now you can minify your code to what causes ...
Denis Feklushkin (2/5) May 12 Removed all destroy() calls - nothing changed
Steven Schveighoffer (24/47) May 12 The "reprehensible" thing that almost always causes GC issues is

Denis Feklushkin (11/22) May 13 First of all, I want to thank everyone for their help. And, yes -

Denis Feklushkin (2/4) May 13 More precisely, probability of damage of internal GC structures
Richard (Rikki) Andrew Cattermole (5/6) May 13 What I normally do is help dustmite.

Denis Feklushkin (6/13) May 13 Yes, I already did some for this.

Denis Feklushkin (93/94) May 13 On Tuesday, 13 May 2025 at 10:21:12 UTC, Denis Feklushkin wrote:

kinke (4/5) May 13 The intersection logic is wrong, treating adjacency as

Denis Feklushkin (6/11) May 13 I changed it to this and everything worked out:

Denis Feklushkin (8/10) May 14 The idea of such check failed because List pointers to other

Denis Feklushkin (11/13) May 14 The malloc function could either be thread-safe or thread-unsafe.

Denis Feklushkin (34/35) May 16 However, I am still on this issue! :(

Denis Feklushkin (40/44) May 16 I still think this is may be a druntime issue. And it's probably

Denis Feklushkin (3/5) May 16 Because it's not an address, it's just some internal data
Denis Feklushkin (3/8) May 16 ...and into this memory writes Vulkan library, as into its own

Denis Feklushkin (9/17) May 17 I'm really tired of researching this issue. Maybe someone else

Denis Feklushkin (80/89) May 18 I managed to reduce the GC calls to several thousands (yes!)

Denis Feklushkin (5/19) May 18 This is was wrong approach. SIGSERV caused by FINALIZE attr bits

Steven Schveighoffer (14/18) May 18 Oof, yes. The gc_malloc calls with the FINALIZE bit set need

Denis Feklushkin (9/13) May 18 I just added `GC.collect()` before lines what caused SIGSERV and

Steven Schveighoffer (9/22) May 18 No, I mean that almost always a GC problem is caused by using

=?UTF-8?Q?Ali_=C3=87ehreli?= (11/12) May 13 Do those threads call back to D code that allocate from the GC? If so,

Denis Feklushkin (6/11) May 13 There is no such thing in my code (it is possible with Vulkan,

Denis Feklushkin <feklushkin.denis gmail.com> writes:

Hi!

It seems I have encountered a bug that is hard to understand and 
fix without knowlenge of the GC internals. But I have some code 
that reproduces the problem well. I made a branch so that 
everyone can try it (see below)

Usual (not very beautiful, yes) code that I do for fun. During 
run it creates and destroys various objects, everything is as 
usual, it does nothing strange, no manipulations with the GC, 
except collect() called once or twice, and I also often call 
destroy(). Also no multithreading, but Vulkan API is used and it 
implicitly creates threads. On sucessful run code displays window 
with two rotating pictures.

For small objects my code regularly and deterministically gets 
into a situation when at some point the value of 
core.internal.gc.impl.conservative.gc.List.pool pointer is 
overwritten by garbage. Using gdb I tracked that after 
appropriate List.pool is created and written, at some time this 
piece of memory is overwritten by a newly allocated D object. As 
result, garbage value of List.pool is used at next 
gc.Gcx.smallAlloc() call and SIGSEGV occurs.

(For tracking I used gdb option "set scheduler-locking on" - it 
seems that this is what makes List* address the same every time, 
which makes debugging much easier.)

I tried to turn on --d-debug=INVARIANT --d-debug=SENTINEL 
--d-debug=MEMSTOMP for druntime. All these options confirming the 
problem. Sometimes issue shifted either to a newly added GC 
invariant as assert error, assert(*sentinel_pre(p) == 
SENTINEL_PRE) error, or problem manifests itself not immediately 
after launch, but after a few seconds of the application's 
operation when allocating object. But it still repeats every time 
- that is, this is not a heisenbug.

Perhaps all this is the result of an error somewhere else, which 
results in this behavior. That is, if some my code (or third 
party) corrupts something that affects to allocation? But it 
seems that I do not do any hacks, any manipulations with 
pointers, etc.

Everything is reproduced on DMD and LDC. I use LDC for debugging 
because it is easy to switch between different druntimes in it.

I couldn't reduce code to highlight issue. So here is how to 
reproduce:

$ git clone --branch=move_to_ldc2 
https://github.com/denizzzka/pukan.git
(ensure you are on commit 
f7e5293cdeb14da911bc337e281378b92ca39f25)
$ cd pukan #important!
$ dub run

For now I tested my code only on Linux, so it might not work in 
Windows at all.
Issue is reproduceable on druntime supplied with:
DMD64 D Compiler v2.111.0
LDC 1.40.1

May 12

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 13/05/2025 3:31 AM, Denis Feklushkin wrote:
 Perhaps all this is the result of an error somewhere else, which results 
 in this behavior. That is, if some my code (or third party) corrupts 
 something that affects to allocation? But it seems that I do not do any 
 hacks, any manipulations with pointers, etc.

This is pretty heavily used code in druntime, my immediate thought is 
what is your code doing to cause this (I didn't see anything obvious)?

Try using ldc's address sanitizer to see if that finds something.

Otherwise try dividing and conquering to find what triggers it down to 
the statement.

May 12

Denis Feklushkin <feklushkin.denis gmail.com> writes:

On Monday, 12 May 2025 at 15:40:01 UTC, Richard (Rikki) Andrew 
Cattermole wrote:
 On 13/05/2025 3:31 AM, Denis Feklushkin wrote:
 Perhaps all this is the result of an error somewhere else, 
 which results in this behavior. That is, if some my code (or 
 third party) corrupts something that affects to allocation? 
 But it seems that I do not do any hacks, any manipulations 
 with pointers, etc.

 This is pretty heavily used code in druntime, my immediate 
 thought is what is your code doing to cause this (I didn't see 
 anything obvious)?

Yes, of course I understand perfectly well. And it seems to me 
that I am not doing anything "reprehensible".

Failure causes when code simply allocates by the "new" keyword, 
which internally calls GC's smallAlloc(). Fails on different 
points, depending on compilation options, compiled-in debug 
facilities, sanitizers, etc. And if class what allocation causes 
error manually moved into "heavy" by adding 64kB size field just 
another class allocation causes same error.

Maybe somewhere after destroy() I successfully write something 
into destroyed object field and this corrupts internal GC data? 
I'll try to remove everything destroy() calls right now

 Try using ldc's address sanitizer to see if that finds 
 something.

Nothing, sanitizer only highlights point where is pointer to pool 
is broken:

AddressSanitizer:DEADLYSIGNAL
=================================================================
==36969==ERROR: AddressSanitizer: SEGV on unknown address 
0x000100000006 (pc 0x55c3ba20eac3 bp 0x523000005500 sp 
0x7ffd7cb0fcf0 T0)
==36969==The signal is caused by a READ memory access.

_D4core8internal2gc4impl12conservativeQw3Gcx10smallAllocMFNbmKmkxC8TypeInfoZPv
(/home/denizzz/Dev/pukan3D/pukan+0x33fac3) (BuildId:
25246214f82ed318a32cc136c8e965179f4dcad3)

0x000100000006 is garbage value, placed by wrong allocation at 
swapchain.d:92:
s = new SyncFramesInFlight(device, commandBuffers[i]);
(if used debug configuration described in origin message):

May 12

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 13/05/2025 6:22 AM, Denis Feklushkin wrote:
     Try using ldc's address sanitizer to see if that finds something.
 
 Nothing, sanitizer only highlights point where is pointer to pool is broken:
 
 
   AddressSanitizer:DEADLYSIGNAL
 
 ==36969==ERROR: AddressSanitizer: SEGV on unknown address 0x000100000006 
 (pc 0x55c3ba20eac3 bp 0x523000005500 sp 0x7ffd7cb0fcf0 T0) ==36969==The 

 _D4core8internal2gc4impl12conservativeQw3Gcx10smallAllocMFNbmKmkxC8TypeInfoZPv
(/home/denizzz/Dev/pukan3D/pukan+0x33fac3) (BuildId:
25246214f82ed318a32cc136c8e965179f4dcad3)
 
 0x000100000006 is garbage value, placed by wrong allocation at 
 swapchain.d:92: s = new SyncFramesInFlight(device, commandBuffers[i]); 
 (if used debug configuration described in origin message):

This is useful information, now you can minify your code to what causes it.

I suggest throwing dustmite at it, and looking for that segfault.

https://github.com/CyberShadow/DustMite

May 12

Denis Feklushkin <feklushkin.denis gmail.com> writes:

On Monday, 12 May 2025 at 18:22:16 UTC, Denis Feklushkin wrote:

 Maybe somewhere after destroy() I successfully write something 
 into destroyed object field and this corrupts internal GC data? 
 I'll try to remove everything destroy() calls right now

Removed all destroy() calls - nothing changed

May 12

Steven Schveighoffer <schveiguy gmail.com> writes:

On Monday, 12 May 2025 at 18:22:16 UTC, Denis Feklushkin wrote:
 On Monday, 12 May 2025 at 15:40:01 UTC, Richard (Rikki) Andrew 
 Cattermole wrote:
 On 13/05/2025 3:31 AM, Denis Feklushkin wrote:
 Perhaps all this is the result of an error somewhere else, 
 which results in this behavior. That is, if some my code (or 
 third party) corrupts something that affects to allocation? 
 But it seems that I do not do any hacks, any manipulations 
 with pointers, etc.

 This is pretty heavily used code in druntime, my immediate 
 thought is what is your code doing to cause this (I didn't see 
 anything obvious)?

 Yes, of course I understand perfectly well. And it seems to me 
 that I am not doing anything "reprehensible".

The "reprehensible" thing that almost always causes GC issues is 
use after free because you are interacting with C memory. I have 
not diagnosed the specific issue, but you are very much using 
some C libs to do complicated things.

I literally just fixed a bug at work that existed for 3 years 
because a GC object was being freed slightly early. Issue was -- 
we were using C memory that was owned by a GC object that was no 
longer referenced. GC runs -- destructor frees memory -- use 
after free.

Not saying this isn't some latent GC bug that has existed for a 
while. But the good news is that it's repeatable, so it should be 
possible to track down.

 Failure causes when code simply allocates by the "new" keyword, 
 which internally calls GC's smallAlloc(). Fails on different 
 points, depending on compilation options, compiled-in debug 
 facilities, sanitizers, etc. And if class what allocation 
 causes error manually moved into "heavy" by adding 64kB size 
 field just another class allocation causes same error.

Having errors very much points at the problem happening *before* 
`new` is called. If it's not always failing in the same spot, 
that sounds a lot like memory corruption. And very often the 
corruption happens long before the explosion.

 Maybe somewhere after destroy() I successfully write something 
 into destroyed object field and this corrupts internal GC data? 
 I'll try to remove everything destroy() calls right now

First thing I would rule out is C memory being used to refer to 
GC objects. Focus on places where C memory is allocated, 
especially with things like callbacks + data pointer.

Another cause, as I mentioned above, is using a GC object to 
manage C memory, and then forgetting the GC object but 
remembering the C memory.

-Steve

May 12

Denis Feklushkin <feklushkin.denis gmail.com> writes:

First of all, I want to thank everyone for their help. And, yes - 
I forgot to check obvious things before I was deep into GC

On Monday, 12 May 2025 at 21:29:10 UTC, Steven Schveighoffer 
wrote:

 Having errors very much points at the problem happening 
 *before* `new` is called. If it's not always failing in the 
 same spot, that sounds a lot like memory corruption. And very 
 often the corruption happens long before the explosion.

 First thing I would rule out is C memory being used to refer to 
 GC objects. Focus on places where C memory is allocated, 
 especially with things like callbacks + data pointer.

 Another cause, as I mentioned above, is using a GC object to 
 manage C memory, and then forgetting the GC object but 
 remembering the C memory.

 -Steve

So far I have done two things in this direction:

1. I called GC.disable() at start of main()
2. destroy() was removed from the code

It seems like this should eliminate probability of use after 
freeing and referring from C to D objects? Nothing has changed, 
issue is still here

dustmite needs a lot of time - I launched it but I'm still waiting

May 13

Denis Feklushkin <feklushkin.denis gmail.com> writes:

On Tuesday, 13 May 2025 at 10:08:02 UTC, Denis Feklushkin wrote:

 It seems like this should eliminate probability of use after 
 freeing and referring from C to D objects?

More precisely, probability of damage of internal GC structures

May 13

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 13/05/2025 10:08 PM, Denis Feklushkin wrote:
 dustmite needs a lot of time - I launched it but I'm still waiting

What I normally do is help dustmite.

Run it for a little bit, dup the file system, remove some cycles or 
dependent usage of a variable, run it again.

It can't always break chains, so it needs a bit of help.

May 13

Denis Feklushkin <feklushkin.denis gmail.com> writes:

On Tuesday, 13 May 2025 at 10:14:17 UTC, Richard (Rikki) Andrew 
Cattermole wrote:
 On 13/05/2025 10:08 PM, Denis Feklushkin wrote:
 dustmite needs a lot of time - I launched it but I'm still 
 waiting

 What I normally do is help dustmite.

 Run it for a little bit, dup the file system, remove some 
 cycles or dependent usage of a variable, run it again.

 It can't always break chains, so it needs a bit of help.

Yes, I already did some for this.

At first (for the whole night) I decided to launch "dub 
dustmite", which (it seems) works only over whole sources with 
dependencies, and it was too big amount of work

May 13

Denis Feklushkin <feklushkin.denis gmail.com> writes:

On Tuesday, 13 May 2025 at 10:21:12 UTC, Denis Feklushkin wrote:

I added simple `debug(PRINTF)` section exactly after druntime 
allocator. It throws error if newly allocated memory intersects 
with already allocated internal bucket List structures. I hope I 
didn't make a mistake in this code?

```d
auto p = runLocked!(mallocNoSync, mallocTime, numMallocs)(size, 
bits, localAllocSize, ti);

debug(PRINTF)
{
     outer:
     foreach(List* firstList; gcx.bucket)
     {
         List* curr = firstList;
         while(curr !is null)
         {
             void* p_end = cast(ubyte*) p + localAllocSize;
             void* curr_end = cast(ubyte*) curr + List.sizeof;

             const bool notIntersects = ((p < curr && p_end < 
curr) || (p > curr_end && p_end > curr_end));

             if(!notIntersects)
             {
                 printf("%p - allocated into bucket List value, 
located on %p: firstList.pool=%p curr.pool=%p\n",
                     p, curr, firstList.pool, curr.pool);

                 assert(false);
                 break outer;
             }

             curr = curr.next;
         }
     }
}
```

Druntime was built as debug version with enabled INVARIANT, 
MEMSTOMP and PRINTF

Then this snippet was used with compiled druntime (do not forget 
to replace path to new druntime in ldc2.conf):
```d
/+ dub.sdl:
	name "issue"
+/
// How to run: dub run --single app.d

class C {}

void main()
{
     new C;
}
```

```
 dub run --single app.d --compiler=ldc2

     Starting Performing "debug" build using ldc2 for x86_64.
     Building issue ~master: building configuration [application]
      Linking issue
      Running issue
_d_newclass(ci = 0x56496398c350, app.C)
0x5649a1312c90.Gcx::addRange(0x564963985940, 0x564963994718)
GC::malloc(gcx = 0x5649a1312c90, size = 16 bits = 2, ti = app.C)
   => p = 0x7fa30e5cb000
0x7fa30e5cb000 - allocated into bucket List value, located on 
0x7fa30e5cb010: firstList.pool=0x5649a1313fa0 
curr.pool=0x5649a1313fa0
core.exception.AssertError core/internal/gc/impl/conservative/gc.d(505):
Assertion failure
----------------
core/runtime.d:831 [0x564963942d45]
core/lifetime.d:126 [0x56496394234c]
core/runtime.d:753 [0x564963942d0e]
core/runtime.d:773 [0x564963942640]
rt/dmain2.d:241 [0x564963920f30]
rt/deh.d:47 [0x564963949b9e]
rt/dwarfeh.d:347 [0x564963921ac2]
core/exception.d:569 [0x564963936a05]
core/exception.d:808 [0x564963936444]
core/internal/gc/impl/conservative/gc.d:505 [0x5649639502f3]
core/internal/gc/proxy.d:156 [0x56496393cf70]
core/internal/gc/impl/proto/gc.d:101 [0x5649639604fb]
core/internal/gc/proxy.d:156 [0x56496393cf70]
rt/lifetime.d:130 [0x5649639235fe]
app.d:10 [0x56496391a7af]
rt/dmain2.d:520 [0x56496392169c]
rt/dmain2.d:474 [0x5649639214b2]
rt/dmain2.d:520 [0x5649639215ba]
rt/dmain2.d:474 [0x5649639214b2]
rt/dmain2.d:545 [0x564963921372]
rt/dmain2.d:333 [0x564963921040]
/home/denizzz/ldc2_standalone/bin/../import/core/internal/entrypoint.d:42
[0x56496391a7f1]
??:? [0x7fa30e6f6ca7]
??:? __libc_start_main [0x7fa30e6f6d64]
??:? [0x56496391a6d0]
GC.fullCollect()
processing GC Marks, (nil)
rt_finalize2(p = 0x5649a1312c20)
Error Program exited with code 1
```

Am I making an obvious mistake somewhere?

May 13

kinke <noone nowhere.com> writes:

On Tuesday, 13 May 2025 at 18:30:34 UTC, Denis Feklushkin wrote:
 I hope I didn't make a mistake in this code?

The intersection logic is wrong, treating adjacency as 
intersection. Try this: `const bool intersects = (p_end > curr && 
p < curr_end)`.

May 13

Denis Feklushkin <feklushkin.denis gmail.com> writes:

On Tuesday, 13 May 2025 at 19:12:19 UTC, kinke wrote:
 On Tuesday, 13 May 2025 at 18:30:34 UTC, Denis Feklushkin wrote:
 I hope I didn't make a mistake in this code?

 The intersection logic is wrong, treating adjacency as 
 intersection. Try this: `const bool intersects = (p_end > curr 
 && p < curr_end)`.

I changed it to this and everything worked out:

((p < curr && p_end <= curr) || (p >= curr_end && p_end >= 
curr_end));

It seems to be correct: all p borders should leave on same side 
from curr range

May 13

Denis Feklushkin <feklushkin.denis gmail.com> writes:

On Wednesday, 14 May 2025 at 06:54:42 UTC, Denis Feklushkin wrote:

 ((p < curr && p_end <= curr) || (p >= curr_end && p_end >= 
 curr_end));

The idea of such check failed because List pointers to other 
lists sometimes are overwritten by garbage and issue just moves 
to curr.next access

I also see that there is no any kind of TLS sections in the 
libvulkan.so

But I don't understand why malloc() can give intersecting 
allocations in this case. Any ideas?

May 14

Denis Feklushkin <feklushkin.denis gmail.com> writes:

On Wednesday, 14 May 2025 at 08:00:06 UTC, Denis Feklushkin wrote:

 But I don't understand why malloc() can give intersecting 
 allocations in this case. Any ideas?

The malloc function could either be thread-safe or thread-unsafe. 
Both are not reentrant:

Malloc operates on a global heap, and it's possible that two 
different invocations of malloc that happen at the same time, 
return the same memory block. (The 2nd malloc call should happen 
before an address of the chunk is fetched, but the chunk is not 
marked as unavailable). This violates the postcondition of 
malloc, so this implementation would not be re-entrant.

https://stackoverflow.com/a/3941563

Okay, I think the question can be considered closed

May 14

Denis Feklushkin <feklushkin.denis gmail.com> writes:

On Wednesday, 14 May 2025 at 09:11:13 UTC, Denis Feklushkin wrote:

 Okay, I think the question can be considered closed


However, I am still on this issue! :(
```
ERROR: AddressSanitizer: SEGV on unknown address 0x000100000006
```
I tried all 4 available TLS models: global-dynamic, 
local-dynamic, initial-exec,  local-exec. But I didn't build 
druntime with these models - only resulting binary.

Valgring says that memory block, returned by malloc(), has never 
been allocated dynamically:
```
$ valgrind --tool=memcheck ./pukan
[...]
==1218062==  Address 0x100000006 is not stack'd, malloc'd or 
(recently) free'd
```

`$fs_base` (a-la TLS pointer reported by GDB) is 
0x00007ffff7a50b40. And all other allocated and used values of my 
D code are lying nearby this value.

I also found that the problem with access to the 0x100000006 
pointer is quite common. And, it seems, always threads-related:

https://bbs.archlinux.org/viewtopic.php?id=210363 - here I 
couldn't track how they solved the problem
https://github.com/gluster/glusterfs/issues/2971 - crashed while 
`ltcmalloc` library init/fini related functions are called in two 
different threads during a library loaded/unloaded.The process is 
getting crashed during access of tls variables in heap profiler 
api

Adding fuel to the fire is the fact that the same `vulkan` 
library works for me without (any known) problems [in Danny 
Arends project](https://github.com/DannyArends/DImGui), which 
uses SDL2 instead of glfw. Loading the `vulkan` library itself 
happens by the same way in both projects - linking during the 
build process.

May 16

Denis Feklushkin <feklushkin.denis gmail.com> writes:

On Friday, 16 May 2025 at 10:42:36 UTC, Denis Feklushkin wrote:
 On Wednesday, 14 May 2025 at 09:11:13 UTC, Denis Feklushkin 
 wrote:

 Okay, I think the question can be considered closed


 However, I am still on this issue! :(

I still think this is may be a druntime issue. And it's probably 
not about TLS.

I discovered the [rr](https://github.com/rr-debugger/rr) tool 
that allows quickly create and replay repeatable replays in the 
gdb (its built-in system works very slowly). So now there's no 
need to run gdb many times and carefully examine everything. `rr` 
available in Debian, but that version doesn't work with my code - 
some kind of tick counting error, seems because video driver 
used), but self-compiled one works fine.

So, after playing and rewind few times I clearly see:

I made sure that malloc uses switched "arenas" as soon as threads 
appear - this mechanism is built into glibc and enabled 
automatically when second pthread created.

I also tried replacing `free(void*)` symbol with my own empty 
stub to make sure that nothing was freed definitely and someone 
didn't get the used piece again. It didn't help.

Vulkan library quite legitimately allocates some memory for its 
needs, uses it, and this memory contains that memory piece where 
the issue occurs. I don't know why Valgrind answered (evasively) 
that this memory had not been allocated before.

Next, when executing on the D side, GC's pool of small 
allocations (of size 32) is exhausted. And then some magic 
happens in the gc.d code using recoverPool near 
`SmallObjectPool.allocPage()`, which I do not fully understand. 
(Obliviously, this is necessary to reuse the memory that was 
previously allocated.)

As a result, a new `List` is formed without `malloc()` call. This 
list contains a pointer to the some pool. Apparently, this memory 
is taken from a previously used pool. But at the same time, the 
memory that this pointer points to looks as has never been 
touched by any D code. I haven't figured out why this is so yet. 
Perhaps there is some error in calculating pointers.

Also, during inside of `allocPage`, execution flow gets to the 
line:
```
void* p = baseAddr + pn * PAGESIZE;
```
but at same time baseAddr == 0xf0f0f0f0f0f0f0f0f0 (result of 
MEMSTOMP)

May 16

Denis Feklushkin <feklushkin.denis gmail.com> writes:

On Friday, 16 May 2025 at 20:26:44 UTC, Denis Feklushkin wrote:

 where the issue occurs. I don't know why Valgrind answered 
 (evasively) that this memory had not been allocated before.

Because it's not an address, it's just some internal data 
accidentally saved into `List.pool` by `Vulkan`

May 16

Denis Feklushkin <feklushkin.denis gmail.com> writes:

On Friday, 16 May 2025 at 20:26:44 UTC, Denis Feklushkin wrote:
 As a result, a new `List` is formed without `malloc()` call. 
 This list contains a pointer to the some pool. Apparently, this 
 memory is taken from a previously used pool. But at the same 
 time, the memory that this pointer points to looks as has never 
 been touched by any D code.

...and into this memory writes Vulkan library, as into its own 
allocated memory

May 16

Denis Feklushkin <feklushkin.denis gmail.com> writes:

On Friday, 16 May 2025 at 22:55:33 UTC, Denis Feklushkin wrote:
 On Friday, 16 May 2025 at 20:26:44 UTC, Denis Feklushkin wrote:
 As a result, a new `List` is formed without `malloc()` call. 
 This list contains a pointer to the some pool. Apparently, 
 this memory is taken from a previously used pool. But at the 
 same time, the memory that this pointer points to looks as has 
 never been touched by any D code.

 ...and into this memory writes Vulkan library, as into its own 
 allocated memory

I'm really tired of researching this issue. Maybe someone else 
also interested?

Just made a branch with latest dirty debug changes:
```
git clone --branch=manual_reduce 
git github.com:denizzzka/pukan.git
```
commit: 34dff13e76bb6ffbe9053eb8cad8f8f33a850b94

May 17

Denis Feklushkin <feklushkin.denis gmail.com> writes:

On Saturday, 17 May 2025 at 20:31:22 UTC, Denis Feklushkin wrote:

 I'm really tired of researching this issue. Maybe someone else 
 also interested?

 Just made a branch with latest dirty debug changes:
 ```
 git clone --branch=manual_reduce 
 git github.com:denizzzka/pukan.git
 ```
 commit: 34dff13e76bb6ffbe9053eb8cad8f8f33a850b94

I managed to reduce the GC calls to several thousands (yes!) 
small `GC.malloc()`/`GC.free()` calls and get rid of third-party 
libraries (vulkan, etc). Actually, I just recorded all 
allocations/deallocations that my D code makes and then trimmed 
them a bit because the error still reproduceable. I hope that 
this is not a problem in the approach itself.

Sample now looks like one file ([ZIP archive 
link](https://github.com/denizzzka/pukan/raw/f3bcf6a22201eac2092c9e08ef9f01176e10d25d/issue_sample.zip)):
```d
/+ dub.sdl:
	name "issue"
+/
// How to run: dub run --single code.d

import core.memory: GC;

auto gc_malloc(T...)(T a)
{
     auto r = GC.malloc(a);
     assert(r !is null);
     return r;
}

auto gc_free(T...)(T a) => GC.free(a);

void main() {

version(linux)
version(DigitalMars)
{
     import etc.linux.memoryerror;
     registerMemoryAssertHandler();
}

void* ptr_0x7f5b360f3008 = gc_malloc(72, 0x1);
void* ptr_0x7f5b360f4008 = gc_malloc(8, 0x0);
void* ptr_0x7f5b360f5008 = gc_malloc(24, 0xa);
[...]
void* ptr_0x7f5b3611b968 = gc_malloc(12, 0x0);
void* ptr_0x7f5b3611b988 = gc_malloc(12, 0x0);

}

```

After compiling by DMD v2.111.0 execution returns:
```
 dub run --single code.d --compiler=dmd

     Starting Performing "debug" build using dmd for x86_64.
     Building issue ~master: building configuration [application]
      Linking issue
      Running issue
core.exception.AssertError /usr/include/dmd/druntime/import/etc/linux
memoryerror.d(415): segmentation fault: null pointer read/write operation
----------------
??:? _d_assert_msg [0x55f779816710]
/usr/include/dmd/druntime/import/etc/linux/memoryerror.d:415 
extern (C) nothrow  nogc void 
etc.linux.memoryerror.registerMemoryAssertHandler!().registerMemoryAssertHandler()._d_han
leSignalAssert(int, core.sys.posix.signal.siginfo_t*, void*) [0x55f7798165f3]
??:? [0x7fdfed618def]
??:? rt_finalize2 [0x55f77981d75b]
??:? rt_finalizeFromGC [0x55f7798486ba]
??:? nothrow ulong 
core.internal.gc.impl.conservative.gc.Gcx.sweep() [0x55f77983e478]
??:? nothrow ulong 
core.internal.gc.impl.conservative.gc.Gcx.fullcollect(bool, bool) 
[0x55f77983f5a5]
??:? nothrow ulong 
core.internal.gc.impl.conservative.gc.ConservativeGC.runLocked!(core.internal.gc.impl.conservative.gc.ConservativeGC.fullCollect().go(core.internal.gc.impl.co
servative.gc.Gcx*), core.internal.gc.impl.conservative.gc.Gcx*).runLocked(ref
core.internal.gc.impl.conservative.gc.Gcx*) [0x55f7798442e2]
??:? nothrow ulong 
core.internal.gc.impl.conservative.gc.ConservativeGC.fullCollect()
[0x55f77983ba9f]
??:? nothrow void 
core.internal.gc.impl.conservative.gc.ConservativeGC.collect() 
[0x55f77983ba7d]
??:? gc_term [0x55f7798280c7]
??:? rt_term [0x55f77981d002]
??:? void rt.dmain2._d_run_main2(char[][], ulong, extern (C) int 
function(char[][])*).runAll() [0x55f779816d60]
??:? void rt.dmain2._d_run_main2(char[][], ulong, extern (C) int 
function(char[][])*).tryExec(scope void delegate()) 
[0x55f779816c49]
??:? _d_run_main2 [0x55f779816bb2]
??:? _d_run_main [0x55f77981699b]
/usr/include/dmd/druntime/import/core/internal/entrypoint.d:29 
main [0x55f779816485]
??:? [0x7fdfed602ca7]
??:? __libc_start_main [0x7fdfed602d64]
??:? _start [0x55f779801670]
Error Program exited with code 1
```

May 18

Denis Feklushkin <feklushkin.denis gmail.com> writes:

On Sunday, 18 May 2025 at 12:32:23 UTC, Denis Feklushkin wrote:
 On Saturday, 17 May 2025 at 20:31:22 UTC, Denis Feklushkin 
 wrote:

 I'm really tired of researching this issue. Maybe someone else 
 also interested?

 Just made a branch with latest dirty debug changes:
 ```
 git clone --branch=manual_reduce 
 git github.com:denizzzka/pukan.git
 ```
 commit: 34dff13e76bb6ffbe9053eb8cad8f8f33a850b94

 I managed to reduce the GC calls to several thousands (yes!) 
 small `GC.malloc()`/`GC.free()` calls and get rid of 
 third-party libraries (vulkan, etc).

This is was wrong approach. SIGSERV caused by FINALIZE attr bits 
on some of GC.malloc() calls without acltually specified class 
info


That's it, I have no other ideas

May 18

Steven Schveighoffer <schveiguy gmail.com> writes:

On Sunday, 18 May 2025 at 15:49:21 UTC, Denis Feklushkin wrote:
 This is was wrong approach. SIGSERV caused by FINALIZE attr 
 bits on some of GC.malloc() calls without acltually specified 
 class info

Oof, yes. The gc_malloc calls with the FINALIZE bit set need 
either a class object to be filled in, or a struct finalizer 
supplied via the TypeInfo (you are not supplying any to the 
calls).

 That's it, I have no other ideas

Memory problems suck. Finding out why something did something 
after the fact is nearly impossible.

In all my experience with the GC, and I've had a lot over the 
last year, these problems are extremely difficult to find.

Please send me an email, maybe we can do some kind of session to 
try and find the problems. I have very good current knowledge of 
the GC, but I'm not going to be able to understand your program 
without help.

-Steve

May 18

Denis Feklushkin <feklushkin.denis gmail.com> writes:

On Monday, 12 May 2025 at 21:29:10 UTC, Steven Schveighoffer 
wrote:

 Yes, of course I understand perfectly well. And it seems to me 
 that I am not doing anything "reprehensible".

 The "reprehensible" thing that almost always causes GC issues 
 is use after free because you are interacting with C memory.

I just added `GC.collect()` before lines what caused SIGSERV and 
all was fixed. Is that what you meant?

If so, I don't understand the nature of this error

I feel uncomfortable about all this: if it fixes problem - then 
why? If it doesn't, then there must be a bug somewhere that is 
causing and collect() jsut masks it

[Commit](https://github.com/denizzzka/pukan/commit/a60d0487ba8c2a4ec5a7f
8394fc5cd01753e17b) that fixes(or not?) issue

May 18

Steven Schveighoffer <schveiguy gmail.com> writes:

On Sunday, 18 May 2025 at 19:10:18 UTC, Denis Feklushkin wrote:
 On Monday, 12 May 2025 at 21:29:10 UTC, Steven Schveighoffer 
 wrote:

 Yes, of course I understand perfectly well. And it seems to 
 me that I am not doing anything "reprehensible".

 The "reprehensible" thing that almost always causes GC issues 
 is use after free because you are interacting with C memory.

 I just added `GC.collect()` before lines what caused SIGSERV 
 and all was fixed. Is that what you meant?

No, I mean that almost always a GC problem is caused by using 
memory that the GC cannot see.

So things are collected before they are unreferenced.

However, GC.disable at the start should fix it, and you've said 
that doesn't. So that sounds more like a straight buffer overflow 
or other issue.

 If so, I don't understand the nature of this error

 I feel uncomfortable about all this: if it fixes problem - then 
 why? If it doesn't, then there must be a bug somewhere that is 
 causing and collect() jsut masks it

I would guess it is the latter.

-Steve

May 18

=?UTF-8?Q?Ali_=C3=87ehreli?= <acehreli yahoo.com> writes:

On 5/12/25 8:31 AM, Denis Feklushkin wrote:

 Vulkan API is used and it implicitly creates threads.

Do those threads call back to D code that allocate from the GC? If so, 
the GC must be aware of the threads to be able to suspend them during a 
collection.

I had to call thread_attachThis() to do that in a past project:

   https://dlang.org/library/core/thread/osthread/thread_attach_this.html

However, it was not clear whether or when to make a corresponding call 
to thread_detachThis(). If Vulkan threads disappear on their own, your 
only chance for a call to thread_detachThis() may be right before 
returning from your D callback function.

Ali

May 13

Denis Feklushkin <feklushkin.denis gmail.com> writes:

On Wednesday, 14 May 2025 at 04:26:08 UTC, Ali Çehreli wrote:
 On 5/12/25 8:31 AM, Denis Feklushkin wrote:

 Vulkan API is used and it implicitly creates threads.

 Do those threads call back to D code that allocate from the GC? 
 If so, the GC must be aware of the threads to be able to 
 suspend them during a collection.

There is no such thing in my code (it is possible with Vulkan, 
but I removed this code from the test build)

But I am almost sure that the problem is in the Vulkan lib: when 
Vulkan VkDevice object created then about 30 threads implicitely 
created by Vulkan library and something goes wrong

May 13

D Programming

C/C++ Programming

Other

digitalmars.D - GC issue? List.pool overwritten by allocated object