digitalmars.D.bugs - [Issue 11981] New: unittest 'host' deadlock
- d-bugmail puremagic.com (68/68) Jan 23 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (11/11) Feb 05 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (57/57) Feb 08 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (9/9) Feb 08 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (7/9) Feb 08 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (63/63) Feb 08 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (57/57) Feb 08 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (12/12) Feb 08 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (7/9) Feb 08 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (10/15) Feb 08 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (9/13) Feb 08 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (9/9) Feb 08 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (32/32) Feb 08 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (9/9) Feb 08 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (11/11) Feb 08 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (7/8) Feb 08 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (17/19) Feb 09 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (9/14) Feb 09 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (9/9) Feb 09 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (11/16) Feb 09 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (11/11) Feb 10 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (17/17) Feb 13 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
- d-bugmail puremagic.com (6/6) Feb 17 2014 https://d.puremagic.com/issues/show_bug.cgi?id=11981
https://d.puremagic.com/issues/show_bug.cgi?id=11981 Summary: unittest 'host' deadlock Product: D Version: D2 Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: druntime AssignedTo: nobody puremagic.com ReportedBy: braddr puremagic.com --- The deadlock is extremely rare, but I managed to remember to capture the stacktraces of an example before killing the process this time. It looks like the suspend signal happened while in the alloc library was in use and the subsequently called a function that needed to alloc. I'll bet that tls_get_addr_tail isn't legal to call from a signal handler. Filed as OS 'all', but it's really all posix quite likely. Thread 1 (Thread 0x2b40440adf40 (LWP 5616)): /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so Thread 2 (Thread 0x2b4045836700 (LWP 5620)): ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 size=<optimized out>, a_tsd=<optimized out>) at arena.c:838 dl-tls.c:742 /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so size=<optimized out>, a_tsd=<optimized out>) at arena.c:838 attr=0x2b4045835e20) at pthread_getattr_np.c:167 /home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so pthread_create.c:311 ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 23 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 safety0ff.bugz <safety0ff.bugz gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |safety0ff.bugz gmail.com 19:28:43 PST --- See also: http://d.puremagic.com/issues/show_bug.cgi?id=10351 and http://forum.dlang.org/thread/lqavnshdhofruqpryjwy forum.dlang.org -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 05 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 Stanislav Blinov <stanislav.blinov gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |stanislav.blinov gmail.com 16:33:53 PST --- Don't know how useful this could be but... This code: import std.concurrency; import core.memory; import core.thread; import core.atomic; import std.stdio; // Reduced from host unittest version(all) // version(none) avoids deadlock { shared uint tctor, tdtor; static this() { auto v = atomicOp!"+="(tctor,1); writefln(" new thread %s", v); } static ~this() { auto v = atomicOp!"+="(tdtor,1); writefln(" cleanup %s", v); } } void main() { enum uint numThreads = 16; long i = 0; while(true) { writefln("Starting run %s", i); shared uint finished; foreach(immutable t; 0..numThreads) { spawn((shared uint* f) { atomicOp!"+="(*f, 1); }, &finished); } long n = 0; while (atomicLoad!(MemoryOrder.acq)(finished) != numThreads) { writefln("suspend %s", n); thread_suspendAll(); writefln("resume %s", n); thread_resumeAll(); ++n; } thread_joinAll(); writefln("Run %s done", i); ++i; } } deadlocks in every single run on my machine (though randomly in each run). $uname -a x86_64 x86_64 GNU/Linux $ldd --version ldd (Ubuntu EGLIBC 2.17-0ubuntu5.1) 2.17 -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 17:07:19 PST --- That program deadlocks due to the writeln between thread_suspendAll and thread_resumeAll (removing it removes the deadlock.) A notable characteristic about the GC related deadlock is that one thread is/should be stuck in thread_suspendAll. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 17:16:28 PST ---That program deadlocks due to the writeln between thread_suspendAll and thread_resumeAll (removing it removes the deadlock.)Yes, I've just noticed. My bad :( Will try to look more into original unittest. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 18:07:14 PST --- Ok, I've built druntime with -g -debug and built a modified host.c that runs infinitely to eventually hit that deadlock: --- host.c 2014-02-09 01:45:36.399539870 +0400 +++ host_.c 2014-02-09 05:40:45.885080082 +0400 -5,6 +5,7 int main(int argc, char* argv[]) { +while(1) { const size_t pathlen = strrchr(argv[0], '/') - argv[0] + 1; char *name = malloc(pathlen + sizeof("plugin1.so")); memcpy(name, argv[0], pathlen); -46,5 +47,6 assert(dlclose(plugin1) == 0); free(name); +} return EXIT_SUCCESS; } Here's the stacktrace: Thread 4 (Thread 0x7ffff6a05700 (LWP 16188)): ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 dl-tls.c:742 sp=0x7ffff6a046c0) at src/core/thread.d:1069 src/core/thread.d:2085 src/core/thread.d:411 pthread_create.c:311 ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 1 (Thread 0x7ffff7fc8740 (LWP 16182)): src/core/thread.d:2244 src/gc/gc.d:2389 src/gc/gc.d:1156 Reposting machine details: $uname -a x86_64 x86_64 GNU/Linux $ldd --version ldd (Ubuntu EGLIBC 2.17-0ubuntu5.1) 2.17 -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 18:15:46 PST --- Another run, this time replaced GC.collect() with thread_suspendAll()/thread_resumeAll() in plugin.d: --- plugin.d 2014-02-09 01:52:48.944214965 +0400 +++ plugin_.d 2014-02-09 06:11:44.677437500 +0400 -21,7 +21,9 assert(tdtor >= 0); // test some runtime functionality launchThread(); - GC.collect(); + //GC.collect(); + thread_suspendAll(); + thread_resumeAll(); joinThread(); } catch (Throwable) stack trace: Thread 19 (Thread 0x7ffff6a05700 (LWP 16354)): ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 size=<optimized out>, a_tsd=<optimized out>) at arena.c:838 dl-tls.c:742 sp=0x7ffff6a04640) at src/core/thread.d:1069 src/core/thread.d:2085 src/core/thread.d:411 size=<optimized out>, a_tsd=<optimized out>) at arena.c:838 attr=0x7ffff6a04d88) at pthread_getattr_np.c:167 pthread_create.c:311 ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 1 (Thread 0x7ffff7fc8740 (LWP 16333)): src/core/thread.d:2244 -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 18:24:18 PST --- Here's the sequence of events I suspect causes this: 3 Threads: Thread 2 acquires malloc internal lock Thread 1 causes a GC collection and therefore thread_suspendAll Thread 3 enters the signal handler for suspendAll, which somehow calls tls_get_addr_tail, which calls malloc, and since Thread 2 has the lock, it causes a deadlock. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 18:30:16 PST ---Here's the sequence of events I suspect causes this: 3 Threads:But there are only 2, unless I am mistaken? -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 18:35:18 PST ---I assumed there were other threads because the stacktraces were for Threads 1, 4 and 19. On a separate note: Async-safe TLS access is scheduled for glibc 2.19 (upcoming release.) -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------Here's the sequence of events I suspect causes this: 3 Threads:But there are only 2, unless I am mistaken?
Feb 08 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 18:40:06 PST ---I assumed there were other threads because the stacktraces were for Threads 1, 4 and 19.There are only 2: main and one spawned by launchThread(). Looks like the spawned one locks itself up when calling malloc inside suspend handler. What does it allocate there?..On a separate note: Async-safe TLS access is scheduled for glibc 2.19 (upcoming release.)-- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 19:09:56 PST --- Oh, I see. SIGUSR1 is being sent to a thread while it's being constructed. Either getStackBottom() or rt.tlsgc.init() call malloc(). When SIGUSR1 is received it calls Thread.getThis(), and that TLS reference had not yet been allocated, so it calls malloc() again and locks. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 20:23:11 PST --- So the failing sequence boils down to this: Main: | t: | // t.start(): | | slock.lock(); | m_isRunning = true; | pthread_create(); | // at this point, | // t.isRunning == true | add(this); | // at this point, | // t is subject to | // thread_suspendAll | slock.unlock(); | | // thread_suspendAll(): | slock.lock(); | | // suspend(t): | // thread_entryPoint(): pthread_kill(); | getStackBottom(); | // or rt.tlsgc.init() | // or (maybe?) Thread.setThis() At this point, t *may* be inside malloc(). It catches SIGUSR1, thread_suspendHandler() is called. Eventually, the handler gets to Thread.getThis(), which calls malloc() again. If t already was inside malloc(), it locks. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 Stanislav Blinov <stanislav.blinov gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- OS/Version|All |Linux Severity|normal |major -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 Stanislav Blinov <stanislav.blinov gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |pull 22:37:21 PST --- Not pretty, but possible solution: https://github.com/D-Programming-Language/druntime/pull/718 -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 23:25:08 PST ---Not pretty, but possible solution:The thread start up code looks very race-prone, I hadn't looked at it before. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 Martin Nowak <code dawg.eu> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |code dawg.euThe thread start up code looks very race-prone, I hadn't looked at it before.Yeah, pretty subtle, I'd really like to get rid of the Fiber context stack for example, but it's not trivial. One reason is that we're using suspend/resume on OSX and Windows but signals on Posix.Async-safe TLS access is scheduled for glibc 2.19 (upcomingrelease.) As I wrote on the pull request, a correct fix would be to let Thread.sm_this use pthread_getspecific again. https://github.com/D-Programming-Language/druntime/pull/718#discussion_r9568348 -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 09 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 03:13:27 PST ---Looks like they've delayed async-safe TLS until a later release. I'm not familiar with pthread_getspecific, but as usual I'm learning as I go to try to help with druntime issues. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------Async-safe TLS access is scheduled for glibc 2.19 (upcomingrelease.) As I wrote on the pull request, a correct fix would be to let Thread.sm_this use pthread_getspecific again. https://github.com/D-Programming-Language/druntime/pull/718#discussion_r9568348
Feb 09 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 07:03:20 PST --- I'm starting to wonder if it's really posix-specific. The issue only manifests when using shared libraries (second malloc call is made from dl-tls.c). Can Windows DLL TLS handling be susceptible to this too? Is it currently possible to replicate that unittest for Windows? -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 09 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981I'm starting to wonder if it's really posix-specific. The issue only manifests when using shared libraries (second malloc call is made from dl-tls.c).Yep, because the TLS segment for loaded shared libraries is allocated lazily, while the TLS segments the executable and linked shared libraries are allocated eagerly.Can Windows DLL TLS handling be susceptible to this too? Is it currently possible to replicate that unittest for Windows?The same problem cannot happen on OSX or Windows, because we're using suspend API's there, i.e. no signal handler code that does the TLS access is run. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 09 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 Commits pushed to master at https://github.com/D-Programming-Language/druntime https://github.com/D-Programming-Language/druntime/commit/25b68995b280a5e02a758aa0e66aa25cc412da50 Fix for issue 11981 https://github.com/D-Programming-Language/druntime/commit/4ce7f1d36008ec9c00f4ec8d3a1b38fc2ef3452f Fix for issue 11981 - deadlock during thread initialization on Posix -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 10 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 Brad Roberts <braddr puremagic.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED PST --- I'll go ahead and mark this resolved. Given it's relative rarity it's hard to state conclusively that it's fixed, but it certainly sounds like it might be based on the change description. That said, we need to be MUCH more careful about what code is executed in the context of signal handlers as it's a very narrow set of valid apis. I'll bet that we're technically violating the list in a number of ways and are just getting away with it most of the time. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 13 2014
https://d.puremagic.com/issues/show_bug.cgi?id=11981 *** Issue 11630 has been marked as a duplicate of this issue. *** -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 17 2014