digitalmars.D - Problem with GC and address/leak sanitizer
- =?UTF-8?B?THXDrXM=?= Marques (115/115) Feb 15 I have a program where the GC *seems* to be overwriting memory
- Steven Schveighoffer (28/30) Feb 15 Do I understand that this corruption is happening only with
- Walter Bright (1/1) Feb 17 Is this an issue with the new GC, or the old one?
- Steven Schveighoffer (3/4) Feb 17 Old gc
- Walter Bright (3/7) Feb 17 I'm a little surprised to see a problem crop up after 25 years of contin...
- Richard (Rikki) Andrew Cattermole (3/13) Feb 17 This has already been solved, see Johan's comments.
- Walter Bright (2/3) Feb 19 Excellent! Carry on...
- Johan (6/14) Feb 16 Can you run with `ASAN_OPTIONS=verbosity=1` and make sure that
- =?UTF-8?B?THXDrXM=?= Marques (11/14) Feb 16 The fake stack allocator is enabled. If I disable it via
- =?UTF-8?B?THXDrXM=?= Marques (2/5) Feb 16 I meant =0, of course.
- Johan (20/31) Feb 16 FakeStack allocates (!) space for stack variables, and points to
- =?UTF-8?B?THXDrXM=?= Marques (2/4) Feb 16 Sure, I'll have a look. Thanks.
- =?UTF-8?B?THXDrXM=?= Marques (12/20) Feb 17 I don't think this broke with the D 2.100. For instance, LDC
- Johan (5/26) Feb 18 It is likely related to LLVM version. Did you already check that?
I have a program where the GC *seems* to be overwriting memory still in use and corrupting data. Here's the code. It's massively reduced from the original program. It's hard to reduce it further because minor changes can prevent the problem from triggering. I'll explain below the important parts. ```d import std.stdio; struct S { int check; S* next; int[4] data; } int main(string[] args) { void*[] allocs; enum bad_iter = 268; for (int n = 0; n < bad_iter+1; n++) { allocs.length = 0; auto x = " "; x ~= ' '; int[10][] ts; for(int i = 0; i < 21; i++) { ts.length++; } S head; S* s = &head; if (n == bad_iter) { n = bad_iter; // convenient line to set a breakpoint only for the last iteration } for(int i = 0; i < 8; i++) { auto ns = new S; ns.check = 1; // set test value here s.next = ns; s = ns; } s = head.next; // get the first S allocated this iteration if (s.check != 1) { // check test value here writefln("check=%d", s.check); return -1; } new int[10]; allocs ~= null; new size_t[3]; } return 0; } ``` The important part is the following. On each iteration we create 8 instances of S. For each S value, we set its `check` field to 1. Then we check the value of that field (for the first instance of S). When compiled with the address sanitizer, we observe it's been corrupted and it's no longer 1. Am I doing something incorrectly in the code? AFAIK I'm respecting the rules required by the GC. Maybe there's a silly bug I overlooked? Tested with LDC 1.40.0 on x86_64 Linux: ``` BUG check=-337690816 $ ``` By setting a watchpoint on the address of the field, I see that the code that writes to `check` is part of the GC implementation. Here's the backtrace: ``` libdruntime-ldc-shared.so.110`_D4core8internal2gc4impl12conservativeQw3Gcx15recoverNextPageMFNbEQCmQC QCeQCeQCcQCn4BinsZb + 348 libdruntime-ldc-shared.so.110`_D4core8internal2gc4impl12conservativeQw3Gcx10smallAllocMF bmKmkxC8TypeInfoZPv + 776 libdruntime-ldc-shared.so.110`_D4core8internal2gc4impl12conservativeQw14ConservativeGC__T9runLockedS_DQCsQCqQCkQCkQCiQCtQBy12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS_DQFaQEyQEsQEsQEqQFb10mallocTimelS_DQGiQGgQGaQGaQFyQGj10numMallocslTmTkTmTxQDlZQF MFNbKmKkKmKxQEeZQDx + 89 libdruntime-ldc-shared.so.110`_DThn16_4core8internal2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkMxC8TypeInfoZ QDd6memory8BlkInfo_ + 83 libdruntime-ldc-shared.so.110`gc_qalloc + 28 app`_D4core8lifetime__T11_d_newitemTTS3app1SZQwFNaNbNeZPQt at lifetime.d:2837:5 0x00007fffffffe438) at app.d:28:13 libdruntime-ldc-shared.so.110`_D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv + 77 libdruntime-ldc-shared.so.110`_d_run_main2 + 407 libdruntime-ldc-shared.so.110`_d_run_main + 141 argv=0x00007fffffffe728) at entrypoint.d:42:17 libc.so.6`__libc_start_call_main(main=(app`main at entrypoint.d:39), argc=1, argv=0x00007fffffffe728) at libc_start_call_main.h:58:16 libc.so.6`__libc_start_main_impl(main=(app`main at entrypoint.d:39), argc=1, argv=0x00007fffffffe728, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffe718) at libc-start.c:360:3 ``` There is a subsequent write to that memory location in the leak sanitizer and LSan complains: `==4056526==LeakSanitizer has encountered a fatal error.` (though usually this message isn't flushed) I assume the original problem was caused by the GC and ASan/LSan are just subsequent victims, but it's hard to be sure. Apparently, LSan is automatically enabled for Linux when ASan is used. Although the ASan documentation says that LSan "can be enabled using `ASAN_OPTIONS=detect_leaks=1` on macOS", setting that to 0 didn't seem to disable it, so I couldn't test with ASan but not LSan. Any ideas of what might be going on?
Feb 15
On Saturday, 15 February 2025 at 23:31:42 UTC, Luís Marques wrote:I have a program where the GC *seems* to be overwriting memory still in use and corrupting data.Do I understand that this corruption is happening only with address sanitizer turned on? I don't see any red flags here, though I'm assuming a lot of these weird random things you are doing (like appending a space to a string every loop) are essential to making the thing fail? It's possible these are tickling GC patterns that cause problems, or it's possible it's tickling bugs in code generation that might prevent the GC from seeing memory! Writing to the "check" field might be because a gc cycle ran, and that item was incorrectly collected, and now the gc is writing to it because it thinks that data is fair game to use. The writing is probably not the problem, the problem is the previous collection of that data. I have learned a lot of tricks when implementing the new GC, and when faced with problems like this, it's super-difficult to figure out how to properly find the problem. One technique I used is to fork after scanning, but before collection, and put that forked process to sleep. If the failure happens, then I can gdb into the forked process and see what state the GC was in, including the entire graph of memory, and I could see how a piece of memory is or isn't referenced. This is a tedious process, and requires a lot of knowledge and patience. If this is indeed a problem with the GC, it's going to be tough to track down. If it's a problem with the codegen, then probably also difficult, but this function is small enough, that maybe someone can look at the assembly and verify that it's doing the right thing? I don't know.
Feb 15
Is this an issue with the new GC, or the old one?
Feb 17
On Monday, 17 February 2025 at 22:14:13 UTC, Walter Bright wrote:Is this an issue with the new GC, or the old one?Old gc -Steve
Feb 17
On 2/17/2025 8:55 PM, Steven Schveighoffer wrote:On Monday, 17 February 2025 at 22:14:13 UTC, Walter Bright wrote:I'm a little surprised to see a problem crop up after 25 years of continuous use? Perhaps try an older release of the compiler?Is this an issue with the new GC, or the old one?Old gc
Feb 17
On 18/02/2025 6:22 PM, Walter Bright wrote:On 2/17/2025 8:55 PM, Steven Schveighoffer wrote:This has already been solved, see Johan's comments. ASAN with a fake stack isn't operating properly.On Monday, 17 February 2025 at 22:14:13 UTC, Walter Bright wrote:I'm a little surprised to see a problem crop up after 25 years of continuous use? Perhaps try an older release of the compiler?Is this an issue with the new GC, or the old one?Old gc
Feb 17
On 2/17/2025 9:26 PM, Richard (Rikki) Andrew Cattermole wrote:This has already been solved, see Johan's comments.Excellent! Carry on...
Feb 19
On Saturday, 15 February 2025 at 23:31:42 UTC, Luís Marques wrote:Tested with LDC 1.40.0 on x86_64 Linux: ``` $ ldc2 app.d -fsanitize=address -g --frame-pointer=all && ./app check=-337690816 $ ```Can you run with `ASAN_OPTIONS=verbosity=1` and make sure that FakeStack is not enabled? (`detect_stack_use_after_return=false`) (And also test with a little bit older LDC, with different LLVM version, to see if it is a new issue or not) -Johan
Feb 16
On Sunday, 16 February 2025 at 20:18:18 UTC, Johan wrote:Can you run with `ASAN_OPTIONS=verbosity=1` and make sure that FakeStack is not enabled? (`detect_stack_use_after_return=false`)The fake stack allocator is enabled. If I disable it via `ASAN_OPTIONS=detect_stack_use_after_return=1` the problem no longer reproduces. According to [1], integrating Fake Stack with GC requires special consideration. What's the status of ASan / fake stack support in LDC? (was it supposed to work, to be disabled by default, etc. ...?) Thanks! [1] https://github.com/google/sanitizers/wiki/AddressSanitizerUseAfterReturn#garbage-collection
Feb 16
On Sunday, 16 February 2025 at 21:48:31 UTC, Luís Marques wrote:The fake stack allocator is enabled. If I disable it via `ASAN_OPTIONS=detect_stack_use_after_return=1` the problem no longer reproduces.I meant =0, of course.
Feb 16
On Sunday, 16 February 2025 at 21:48:31 UTC, Luís Marques wrote:On Sunday, 16 February 2025 at 20:18:18 UTC, Johan wrote:FakeStack allocates (!) space for stack variables, and points to that "fake stack" memory with a pointer in actual CPU stack memory. This means that the stack variables are now no longer in memory that is scanned by the GC. The fix for that, of course, is to include all FakeStacks in the GC scanning [1a][1b]. This used to work, but somehow does not work anymore since LDC 2.100 (I perhaps have forgotten about this and just noticed it). [2] You are very welcome to help investigate why it is no longer working! [3] is an interesting test case of how code should work. -Johan [1a] https://github.com/ldc-developers/druntime/compare/d6b328be91db63aff979f584b0d1def0f746d730...1d938e0b7f668b099f9fa694b135c82ef13dec59 [1b] https://github.com/ldc-developers/ldc/pull/3888 [2] https://github.com/ldc-developers/ldc/blob/d3f065816ec7d420f370e4c95c6000eb78187e25/tests/sanitizers/asan_fakestack_GC.d#L3 [3] https://github.com/llvm/llvm-project/blob/main/compiler-rt/test/asan/TestCases/Posix/gc-test.cppCan you run with `ASAN_OPTIONS=verbosity=1` and make sure that FakeStack is not enabled? (`detect_stack_use_after_return=false`)The fake stack allocator is enabled. If I disable it via `ASAN_OPTIONS=detect_stack_use_after_return=1` the problem no longer reproduces. According to [1], integrating Fake Stack with GC requires special consideration. What's the status of ASan / fake stack support in LDC? (was it supposed to work, to be disabled by default, etc. ...?)
Feb 16
On Sunday, 16 February 2025 at 22:06:37 UTC, Johan wrote:You are very welcome to help investigate why it is no longer working!Sure, I'll have a look. Thanks.
Feb 16
On Sunday, 16 February 2025 at 22:40:58 UTC, Luís Marques wrote:On Sunday, 16 February 2025 at 22:06:37 UTC, Johan wrote:I don't think this broke with the D 2.100. For instance, LDC 1.29.0 is based on 2.099.1 and exhibits the same problem. Even older LDC versions don't trip on this exact program but they do output AddressSanitizer CHECK failures. ``` ==4108825==AddressSanitizer CHECK failed: /home/vsts/work/1/s/compiler-rt/lib/sanitizer_common/sanitizer_l nux_libcdep.cpp:556 "((*tls_addr + *tls_size)) <= ((*stk_addr + *stk_size))" (0x78927e371080, 0x78927e371000) ``` I'll need some time to dig through the IR, the GC, etc. If you are going to look at this, please let me know, to avoid duplicate efforts.This used to work, but somehow does not work anymore since LDC 2.100 (I perhaps have forgotten about this and just noticed it). [2] You are very welcome to help investigate why it is no longer working!Sure, I'll have a look. Thanks.
Feb 17
On Monday, 17 February 2025 at 21:56:29 UTC, Luís Marques wrote:On Sunday, 16 February 2025 at 22:40:58 UTC, Luís Marques wrote:It is likely related to LLVM version. Did you already check that? Possibly a subtle change in API.On Sunday, 16 February 2025 at 22:06:37 UTC, Johan wrote:I don't think this broke with the D 2.100. For instance, LDC 1.29.0 is based on 2.099.1 and exhibits the same problem. Even older LDC versions don't trip on this exact program but they do output AddressSanitizer CHECK failures. ``` ==4108825==AddressSanitizer CHECK failed: /home/vsts/work/1/s/compiler-rt/lib/sanitizer_common/sanitizer_l nux_libcdep.cpp:556 "((*tls_addr + *tls_size)) <= ((*stk_addr + *stk_size))" (0x78927e371080, 0x78927e371000) ``` I'll need some time to dig through the IR, the GC, etc.This used to work, but somehow does not work anymore since LDC 2.100 (I perhaps have forgotten about this and just noticed it). [2] You are very welcome to help investigate why it is no longer working!Sure, I'll have a look. Thanks.If you are going to look at this, please let me know, to avoid duplicate efforts.Not soon, no. -Johan
Feb 18