www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Problem with GC and address/leak sanitizer

reply =?UTF-8?B?THXDrXM=?= Marques <luis luismarques.eu> writes:
I have a program where the GC *seems* to be overwriting memory 
still in use and corrupting data.

Here's the code. It's massively reduced from the original 
program. It's hard to reduce it further because minor changes can 
prevent the problem from triggering. I'll explain below the 
important parts.

```d
import std.stdio;

struct S {
     int check;
     S* next;
     int[4] data;
}

int main(string[] args) {
     void*[] allocs;
     enum bad_iter = 268;
     for (int n = 0; n < bad_iter+1; n++) {
         allocs.length = 0;
         auto x = "                   ";
         x ~= ' ';

         int[10][] ts;
         for(int i = 0; i < 21; i++) {
             ts.length++;
         }

         S head;
         S* s = &head;
         if (n == bad_iter) {
             n = bad_iter; // convenient line to set a breakpoint 
only for the last iteration
         }
         for(int i = 0; i < 8; i++) {
             auto ns = new S;
             ns.check = 1; // set test value here
             s.next = ns;
             s = ns;
         }
         s = head.next; // get the first S allocated this iteration
         if (s.check != 1) { // check test value here
             writefln("check=%d", s.check);
             return -1;
         }

         new int[10];
         allocs ~= null;
         new size_t[3];
     }
     return 0;
}
```

The important part is the following. On each iteration we create 
8 instances of S. For each S value, we set its `check` field to 
1. Then we check the value of that field (for the first instance 
of S). When compiled with the address sanitizer, we observe it's 
been corrupted and it's no longer 1.

Am I doing something incorrectly in the code? AFAIK I'm 
respecting the rules required by the GC. Maybe there's a silly 
bug I overlooked?

Tested with LDC 1.40.0 on x86_64 Linux:


```


BUG
check=-337690816
$
```

By setting a watchpoint on the address of the field, I see that 
the code that writes to `check` is part of the GC implementation. 
Here's the backtrace:

```


libdruntime-ldc-shared.so.110`_D4core8internal2gc4impl12conservativeQw3Gcx15recoverNextPageMFNbEQCmQC
QCeQCeQCcQCn4BinsZb + 348

libdruntime-ldc-shared.so.110`_D4core8internal2gc4impl12conservativeQw3Gcx10smallAllocMF
bmKmkxC8TypeInfoZPv + 776

libdruntime-ldc-shared.so.110`_D4core8internal2gc4impl12conservativeQw14ConservativeGC__T9runLockedS_DQCsQCqQCkQCkQCiQCtQBy12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS_DQFaQEyQEsQEsQEqQFb10mallocTimelS_DQGiQGgQGaQGaQFyQGj10numMallocslTmTkTmTxQDlZQF
MFNbKmKkKmKxQEeZQDx + 89

libdruntime-ldc-shared.so.110`_DThn16_4core8internal2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkMxC8TypeInfoZ
QDd6memory8BlkInfo_ + 83

libdruntime-ldc-shared.so.110`gc_qalloc + 28

app`_D4core8lifetime__T11_d_newitemTTS3app1SZQwFNaNbNeZPQt at 
lifetime.d:2837:5

0x00007fffffffe438) at app.d:28:13

libdruntime-ldc-shared.so.110`_D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv
+ 77

libdruntime-ldc-shared.so.110`_d_run_main2 + 407

libdruntime-ldc-shared.so.110`_d_run_main + 141

argv=0x00007fffffffe728) at entrypoint.d:42:17

libc.so.6`__libc_start_call_main(main=(app`main at 
entrypoint.d:39), argc=1, argv=0x00007fffffffe728) at 
libc_start_call_main.h:58:16

libc.so.6`__libc_start_main_impl(main=(app`main at 
entrypoint.d:39), argc=1, argv=0x00007fffffffe728, 
init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, 
stack_end=0x00007fffffffe718) at libc-start.c:360:3

```

There is a subsequent write to that memory location in the leak 
sanitizer and LSan complains:

`==4056526==LeakSanitizer has encountered a fatal error.`  
(though usually this message isn't flushed)

I assume the original problem was caused by the GC and ASan/LSan 
are just subsequent victims, but it's hard to be sure. 
Apparently, LSan is automatically enabled for Linux when ASan is 
used. Although the ASan documentation says that LSan "can be 
enabled using `ASAN_OPTIONS=detect_leaks=1` on macOS", setting 
that to 0 didn't seem to disable it, so I couldn't test with ASan 
but not LSan.

Any ideas of what might be going on?
Feb 15
next sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On Saturday, 15 February 2025 at 23:31:42 UTC, Luís Marques wrote:
 I have a program where the GC *seems* to be overwriting memory 
 still in use and corrupting data.
Do I understand that this corruption is happening only with address sanitizer turned on? I don't see any red flags here, though I'm assuming a lot of these weird random things you are doing (like appending a space to a string every loop) are essential to making the thing fail? It's possible these are tickling GC patterns that cause problems, or it's possible it's tickling bugs in code generation that might prevent the GC from seeing memory! Writing to the "check" field might be because a gc cycle ran, and that item was incorrectly collected, and now the gc is writing to it because it thinks that data is fair game to use. The writing is probably not the problem, the problem is the previous collection of that data. I have learned a lot of tricks when implementing the new GC, and when faced with problems like this, it's super-difficult to figure out how to properly find the problem. One technique I used is to fork after scanning, but before collection, and put that forked process to sleep. If the failure happens, then I can gdb into the forked process and see what state the GC was in, including the entire graph of memory, and I could see how a piece of memory is or isn't referenced. This is a tedious process, and requires a lot of knowledge and patience. If this is indeed a problem with the GC, it's going to be tough to track down. If it's a problem with the codegen, then probably also difficult, but this function is small enough, that maybe someone can look at the assembly and verify that it's doing the right thing? I don't know.
Feb 15
parent reply Walter Bright <newshound2 digitalmars.com> writes:
Is this an issue with the new GC, or the old one?
Feb 17
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On Monday, 17 February 2025 at 22:14:13 UTC, Walter Bright wrote:
 Is this an issue with the new GC, or the old one?
Old gc -Steve
Feb 17
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 2/17/2025 8:55 PM, Steven Schveighoffer wrote:
 On Monday, 17 February 2025 at 22:14:13 UTC, Walter Bright wrote:
 Is this an issue with the new GC, or the old one?
Old gc
I'm a little surprised to see a problem crop up after 25 years of continuous use? Perhaps try an older release of the compiler?
Feb 17
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 18/02/2025 6:22 PM, Walter Bright wrote:
 On 2/17/2025 8:55 PM, Steven Schveighoffer wrote:
 On Monday, 17 February 2025 at 22:14:13 UTC, Walter Bright wrote:
 Is this an issue with the new GC, or the old one?
Old gc
I'm a little surprised to see a problem crop up after 25 years of continuous use? Perhaps try an older release of the compiler?
This has already been solved, see Johan's comments. ASAN with a fake stack isn't operating properly.
Feb 17
parent Walter Bright <newshound2 digitalmars.com> writes:
On 2/17/2025 9:26 PM, Richard (Rikki) Andrew Cattermole wrote:
 This has already been solved, see Johan's comments.
Excellent! Carry on...
Feb 19
prev sibling parent reply Johan <j j.nl> writes:
On Saturday, 15 February 2025 at 23:31:42 UTC, Luís Marques wrote:
 Tested with LDC 1.40.0 on x86_64 Linux:


 ```

 $ ldc2 app.d -fsanitize=address -g --frame-pointer=all && ./app 

 check=-337690816
 $
 ```
Can you run with `ASAN_OPTIONS=verbosity=1` and make sure that FakeStack is not enabled? (`detect_stack_use_after_return=false`) (And also test with a little bit older LDC, with different LLVM version, to see if it is a new issue or not) -Johan
Feb 16
parent reply =?UTF-8?B?THXDrXM=?= Marques <luis luismarques.eu> writes:
On Sunday, 16 February 2025 at 20:18:18 UTC, Johan wrote:
 Can you run with `ASAN_OPTIONS=verbosity=1` and make sure that 
 FakeStack is not enabled?  
 (`detect_stack_use_after_return=false`)
The fake stack allocator is enabled. If I disable it via `ASAN_OPTIONS=detect_stack_use_after_return=1` the problem no longer reproduces. According to [1], integrating Fake Stack with GC requires special consideration. What's the status of ASan / fake stack support in LDC? (was it supposed to work, to be disabled by default, etc. ...?) Thanks! [1] https://github.com/google/sanitizers/wiki/AddressSanitizerUseAfterReturn#garbage-collection
Feb 16
next sibling parent =?UTF-8?B?THXDrXM=?= Marques <luis luismarques.eu> writes:
On Sunday, 16 February 2025 at 21:48:31 UTC, Luís Marques wrote:
 The fake stack allocator is enabled. If I disable it via 
 `ASAN_OPTIONS=detect_stack_use_after_return=1` the problem no 
 longer reproduces.
I meant =0, of course.
Feb 16
prev sibling parent reply Johan <j j.nl> writes:
On Sunday, 16 February 2025 at 21:48:31 UTC, Luís Marques wrote:
 On Sunday, 16 February 2025 at 20:18:18 UTC, Johan wrote:
 Can you run with `ASAN_OPTIONS=verbosity=1` and make sure that 
 FakeStack is not enabled?  
 (`detect_stack_use_after_return=false`)
The fake stack allocator is enabled. If I disable it via `ASAN_OPTIONS=detect_stack_use_after_return=1` the problem no longer reproduces. According to [1], integrating Fake Stack with GC requires special consideration. What's the status of ASan / fake stack support in LDC? (was it supposed to work, to be disabled by default, etc. ...?)
FakeStack allocates (!) space for stack variables, and points to that "fake stack" memory with a pointer in actual CPU stack memory. This means that the stack variables are now no longer in memory that is scanned by the GC. The fix for that, of course, is to include all FakeStacks in the GC scanning [1a][1b]. This used to work, but somehow does not work anymore since LDC 2.100 (I perhaps have forgotten about this and just noticed it). [2] You are very welcome to help investigate why it is no longer working! [3] is an interesting test case of how code should work. -Johan [1a] https://github.com/ldc-developers/druntime/compare/d6b328be91db63aff979f584b0d1def0f746d730...1d938e0b7f668b099f9fa694b135c82ef13dec59 [1b] https://github.com/ldc-developers/ldc/pull/3888 [2] https://github.com/ldc-developers/ldc/blob/d3f065816ec7d420f370e4c95c6000eb78187e25/tests/sanitizers/asan_fakestack_GC.d#L3 [3] https://github.com/llvm/llvm-project/blob/main/compiler-rt/test/asan/TestCases/Posix/gc-test.cpp
Feb 16
parent reply =?UTF-8?B?THXDrXM=?= Marques <luis luismarques.eu> writes:
On Sunday, 16 February 2025 at 22:06:37 UTC, Johan wrote:
 You are very welcome to help investigate why it is no longer 
 working!
Sure, I'll have a look. Thanks.
Feb 16
parent reply =?UTF-8?B?THXDrXM=?= Marques <luis luismarques.eu> writes:
On Sunday, 16 February 2025 at 22:40:58 UTC, Luís Marques wrote:
 On Sunday, 16 February 2025 at 22:06:37 UTC, Johan wrote:
 This used to work, but somehow does not work anymore since LDC 
 2.100 (I perhaps have forgotten about this and just noticed 
 it). [2]

 You are very welcome to help investigate why it is no longer 
 working!
Sure, I'll have a look. Thanks.
I don't think this broke with the D 2.100. For instance, LDC 1.29.0 is based on 2.099.1 and exhibits the same problem. Even older LDC versions don't trip on this exact program but they do output AddressSanitizer CHECK failures. ``` ==4108825==AddressSanitizer CHECK failed: /home/vsts/work/1/s/compiler-rt/lib/sanitizer_common/sanitizer_l nux_libcdep.cpp:556 "((*tls_addr + *tls_size)) <= ((*stk_addr + *stk_size))" (0x78927e371080, 0x78927e371000) ``` I'll need some time to dig through the IR, the GC, etc. If you are going to look at this, please let me know, to avoid duplicate efforts.
Feb 17
parent Johan <j j.nl> writes:
On Monday, 17 February 2025 at 21:56:29 UTC, Luís Marques wrote:
 On Sunday, 16 February 2025 at 22:40:58 UTC, Luís Marques wrote:
 On Sunday, 16 February 2025 at 22:06:37 UTC, Johan wrote:
 This used to work, but somehow does not work anymore since 
 LDC 2.100 (I perhaps have forgotten about this and just 
 noticed it). [2]

 You are very welcome to help investigate why it is no longer 
 working!
Sure, I'll have a look. Thanks.
I don't think this broke with the D 2.100. For instance, LDC 1.29.0 is based on 2.099.1 and exhibits the same problem. Even older LDC versions don't trip on this exact program but they do output AddressSanitizer CHECK failures. ``` ==4108825==AddressSanitizer CHECK failed: /home/vsts/work/1/s/compiler-rt/lib/sanitizer_common/sanitizer_l nux_libcdep.cpp:556 "((*tls_addr + *tls_size)) <= ((*stk_addr + *stk_size))" (0x78927e371080, 0x78927e371000) ``` I'll need some time to dig through the IR, the GC, etc.
It is likely related to LLVM version. Did you already check that? Possibly a subtle change in API.
 If you are going to look at this, please let me know, to avoid 
 duplicate efforts.
Not soon, no. -Johan
Feb 18