www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 11981] New: unittest 'host' deadlock

reply d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981

           Summary: unittest 'host' deadlock
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: druntime
        AssignedTo: nobody puremagic.com
        ReportedBy: braddr puremagic.com



---
The deadlock is extremely rare, but I managed to remember to capture the
stacktraces of an example before killing the process this time.  It looks like
the suspend signal happened while in the alloc library was in use and the
subsequently called a function that needed to alloc.  I'll bet that
tls_get_addr_tail isn't legal to call from a signal handler.  Filed as OS
'all', but it's really all posix quite likely.

Thread 1 (Thread 0x2b40440adf40 (LWP 5616)):


/home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so

/home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so

/home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so

/home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so

/home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so

/home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so



Thread 2 (Thread 0x2b4045836700 (LWP 5620)):

../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95



size=<optimized out>, a_tsd=<optimized out>) at arena.c:838



dl-tls.c:742

/home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so

/home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so

/home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so



size=<optimized out>, a_tsd=<optimized out>) at arena.c:838


attr=0x2b4045835e20) at pthread_getattr_np.c:167

/home/braddr/sandbox/d/d-tester/client/pull-869498-Linux_64_64/druntime/lib/libdruntime-linux64.so

pthread_create.c:311

../sysdeps/unix/sysv/linux/x86_64/clone.S:113

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 23 2014
next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981


safety0ff.bugz <safety0ff.bugz gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |safety0ff.bugz gmail.com



19:28:43 PST ---
See also: http://d.puremagic.com/issues/show_bug.cgi?id=10351 and
http://forum.dlang.org/thread/lqavnshdhofruqpryjwy forum.dlang.org

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 05 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981


Stanislav Blinov <stanislav.blinov gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |stanislav.blinov gmail.com



16:33:53 PST ---
Don't know how useful this could be but...

This code:

import std.concurrency;
import core.memory;
import core.thread;
import core.atomic;
import std.stdio;

// Reduced from host unittest
version(all) // version(none) avoids deadlock
{
shared uint tctor, tdtor;
static this() { auto v = atomicOp!"+="(tctor,1); writefln("  new thread %s",
v); }
static ~this() { auto v = atomicOp!"+="(tdtor,1); writefln("  cleanup %s", v);
}
}

void main()
{

    enum uint numThreads = 16;
    long i = 0;    
    while(true)
    {
        writefln("Starting run %s", i);
        shared uint finished;
        foreach(immutable t; 0..numThreads) 
        {
            spawn((shared uint* f) { atomicOp!"+="(*f, 1); }, &finished);
        }
        long n = 0;
        while (atomicLoad!(MemoryOrder.acq)(finished) != numThreads)
        {
            writefln("suspend %s", n);
            thread_suspendAll();
            writefln("resume %s", n);
            thread_resumeAll();
            ++n;
        }
        thread_joinAll();
        writefln("Run %s done", i);
        ++i;
    }
}


deadlocks in every single run on my machine (though randomly in each run).

$uname -a


x86_64 x86_64 GNU/Linux

$ldd --version

ldd (Ubuntu EGLIBC 2.17-0ubuntu5.1) 2.17

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 08 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981




17:07:19 PST ---
That program deadlocks due to the writeln between thread_suspendAll and
thread_resumeAll (removing it removes the deadlock.)

A notable characteristic about the GC related deadlock is that one thread
is/should be stuck in thread_suspendAll.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 08 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981




17:16:28 PST ---

 That program deadlocks due to the writeln between thread_suspendAll and
 thread_resumeAll (removing it removes the deadlock.)
Yes, I've just noticed. My bad :( Will try to look more into original unittest. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981




18:07:14 PST ---
Ok, I've built druntime with -g -debug and built a modified host.c that runs
infinitely to eventually hit that deadlock:

--- host.c      2014-02-09 01:45:36.399539870 +0400
+++ host_.c      2014-02-09 05:40:45.885080082 +0400

   -5,6 +5,7   

 int main(int argc, char* argv[])
 {
+while(1) {
     const size_t pathlen = strrchr(argv[0], '/') - argv[0] + 1;
     char *name = malloc(pathlen + sizeof("plugin1.so"));
     memcpy(name, argv[0], pathlen);
   -46,5 +47,6   
     assert(dlclose(plugin1) == 0);

     free(name);
+}
     return EXIT_SUCCESS;
 }

Here's the stacktrace:

Thread 4 (Thread 0x7ffff6a05700 (LWP 16188)):

../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95




dl-tls.c:742

sp=0x7ffff6a046c0) at src/core/thread.d:1069

src/core/thread.d:2085

src/core/thread.d:411




pthread_create.c:311

../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 1 (Thread 0x7ffff7fc8740 (LWP 16182)):


src/core/thread.d:2244


src/gc/gc.d:2389

src/gc/gc.d:1156






Reposting machine details:

$uname -a


x86_64 x86_64 GNU/Linux

$ldd --version

ldd (Ubuntu EGLIBC 2.17-0ubuntu5.1) 2.17

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 08 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981




18:15:46 PST ---
Another run, this time replaced GC.collect() with
thread_suspendAll()/thread_resumeAll() in plugin.d:

--- plugin.d    2014-02-09 01:52:48.944214965 +0400
+++ plugin_.d    2014-02-09 06:11:44.677437500 +0400
   -21,7 +21,9   
         assert(tdtor >= 0);
         // test some runtime functionality
         launchThread();
-        GC.collect();
+        //GC.collect();
+        thread_suspendAll();
+        thread_resumeAll();
         joinThread();
     }
     catch (Throwable)

stack trace:

Thread 19 (Thread 0x7ffff6a05700 (LWP 16354)):

../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95



size=<optimized out>, a_tsd=<optimized out>) at arena.c:838



dl-tls.c:742

sp=0x7ffff6a04640) at src/core/thread.d:1069

src/core/thread.d:2085

src/core/thread.d:411



size=<optimized out>, a_tsd=<optimized out>) at arena.c:838


attr=0x7ffff6a04d88) at pthread_getattr_np.c:167


pthread_create.c:311

../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 1 (Thread 0x7ffff7fc8740 (LWP 16333)):


src/core/thread.d:2244




-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 08 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981




18:24:18 PST ---
Here's the sequence of events I suspect causes this:
3 Threads:
Thread 2 acquires malloc internal lock
Thread 1 causes a GC collection and therefore thread_suspendAll
Thread 3 enters the signal handler for suspendAll, which somehow calls
tls_get_addr_tail, which calls malloc, and since Thread 2 has the lock, it
causes a deadlock.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 08 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981




18:30:16 PST ---

 Here's the sequence of events I suspect causes this:
 3 Threads:
But there are only 2, unless I am mistaken? -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981




18:35:18 PST ---


 Here's the sequence of events I suspect causes this:
 3 Threads:
But there are only 2, unless I am mistaken?
I assumed there were other threads because the stacktraces were for Threads 1, 4 and 19. On a separate note: Async-safe TLS access is scheduled for glibc 2.19 (upcoming release.) -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981




18:40:06 PST ---


 I assumed there were other threads because the stacktraces were for Threads 1,
 4 and 19.
There are only 2: main and one spawned by launchThread(). Looks like the spawned one locks itself up when calling malloc inside suspend handler. What does it allocate there?..
 On a separate note: Async-safe TLS access is scheduled for glibc 2.19 (upcoming
 release.)
-- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981




19:09:56 PST ---
Oh, I see. SIGUSR1 is being sent to a thread while it's being constructed. 

Either getStackBottom() or rt.tlsgc.init() call malloc(). When SIGUSR1 is
received it calls Thread.getThis(), and that TLS reference had not yet been
allocated, so it calls malloc() again and locks.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 08 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981




20:23:11 PST ---
So the failing sequence boils down to this:

Main:                      | t:
                           |
// t.start():              |
                           |
slock.lock();              |
m_isRunning = true;        |
pthread_create();          |
// at this point,          |
// t.isRunning == true     |
add(this);                 |
// at this point,          |
// t is subject to         |
// thread_suspendAll       |
slock.unlock();            |
                           |
// thread_suspendAll():    |
slock.lock();              |
                           |
// suspend(t):             | // thread_entryPoint():
pthread_kill();            | getStackBottom(); 
                           | // or rt.tlsgc.init()
                           | // or (maybe?) Thread.setThis()

At this point, t *may* be inside malloc(). It catches SIGUSR1,
thread_suspendHandler() is called. Eventually, the handler gets to
Thread.getThis(), which calls malloc() again. If t already was inside malloc(),
it locks.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 08 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981


Stanislav Blinov <stanislav.blinov gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         OS/Version|All                         |Linux
           Severity|normal                      |major


-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 08 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981


Stanislav Blinov <stanislav.blinov gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |pull



22:37:21 PST ---
Not pretty, but possible solution:

https://github.com/D-Programming-Language/druntime/pull/718

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 08 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981




23:25:08 PST ---

 Not pretty, but possible solution:
The thread start up code looks very race-prone, I hadn't looked at it before. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 08 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981


Martin Nowak <code dawg.eu> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |code dawg.eu




 The thread start up code looks very race-prone, I hadn't looked at it before.
Yeah, pretty subtle, I'd really like to get rid of the Fiber context stack for example, but it's not trivial. One reason is that we're using suspend/resume on OSX and Windows but signals on Posix.
 Async-safe TLS access is scheduled for glibc 2.19 (upcoming
release.) As I wrote on the pull request, a correct fix would be to let Thread.sm_this use pthread_getspecific again. https://github.com/D-Programming-Language/druntime/pull/718#discussion_r9568348 -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 09 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981




03:13:27 PST ---

 Async-safe TLS access is scheduled for glibc 2.19 (upcoming
release.) As I wrote on the pull request, a correct fix would be to let Thread.sm_this use pthread_getspecific again. https://github.com/D-Programming-Language/druntime/pull/718#discussion_r9568348
Looks like they've delayed async-safe TLS until a later release. I'm not familiar with pthread_getspecific, but as usual I'm learning as I go to try to help with druntime issues. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 09 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981




07:03:20 PST ---
I'm starting to wonder if it's really posix-specific. The issue only manifests
when using shared libraries (second malloc call is made from dl-tls.c).

Can Windows DLL TLS handling be susceptible to this too? Is it currently
possible to replicate that unittest for Windows?

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 09 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981





 I'm starting to wonder if it's really posix-specific. The issue only manifests
 when using shared libraries (second malloc call is made from dl-tls.c).
 
Yep, because the TLS segment for loaded shared libraries is allocated lazily, while the TLS segments the executable and linked shared libraries are allocated eagerly.
 Can Windows DLL TLS handling be susceptible to this too? Is it currently
 possible to replicate that unittest for Windows?
The same problem cannot happen on OSX or Windows, because we're using suspend API's there, i.e. no signal handler code that does the TLS access is run. -- Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 09 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981




Commits pushed to master at https://github.com/D-Programming-Language/druntime

https://github.com/D-Programming-Language/druntime/commit/25b68995b280a5e02a758aa0e66aa25cc412da50
Fix for issue 11981

https://github.com/D-Programming-Language/druntime/commit/4ce7f1d36008ec9c00f4ec8d3a1b38fc2ef3452f


Fix for issue 11981 - deadlock during thread initialization on Posix

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 10 2014
prev sibling next sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981


Brad Roberts <braddr puremagic.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED



PST ---
I'll go ahead and mark this resolved.  Given it's relative rarity it's hard to
state conclusively that it's fixed, but it certainly sounds like it might be
based on the change description.

That said, we need to be MUCH more careful about what code is executed in the
context of signal handlers as it's a very narrow set of valid apis.  I'll bet
that we're technically violating the list in a number of ways and are just
getting away with it most of the time.

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 13 2014
prev sibling parent d-bugmail puremagic.com writes:
https://d.puremagic.com/issues/show_bug.cgi?id=11981




*** Issue 11630 has been marked as a duplicate of this issue. ***

-- 
Configure issuemail: https://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 17 2014