digitalmars.D.bugs - [Issue 5488] New: Spawned threads hang in a way that suggests allocation or gc issue
- d-bugmail puremagic.com (36/36) Jan 25 2011 http://d.puremagic.com/issues/show_bug.cgi?id=5488
- d-bugmail puremagic.com (18/18) Jan 31 2011 http://d.puremagic.com/issues/show_bug.cgi?id=5488
- d-bugmail puremagic.com (9/9) Jan 31 2011 http://d.puremagic.com/issues/show_bug.cgi?id=5488
- d-bugmail puremagic.com (15/15) Jan 31 2011 http://d.puremagic.com/issues/show_bug.cgi?id=5488
- d-bugmail puremagic.com (93/93) Jan 31 2011 http://d.puremagic.com/issues/show_bug.cgi?id=5488
- d-bugmail puremagic.com (7/7) Feb 02 2011 http://d.puremagic.com/issues/show_bug.cgi?id=5488
- d-bugmail puremagic.com (14/14) Apr 22 2012 http://d.puremagic.com/issues/show_bug.cgi?id=5488
http://d.puremagic.com/issues/show_bug.cgi?id=5488 Summary: Spawned threads hang in a way that suggests allocation or gc issue Product: D Version: D2 Platform: x86 OS/Version: Mac OS X Status: NEW Severity: normal Priority: P2 Component: Phobos AssignedTo: nobody puremagic.com ReportedBy: adam_conner_sax yahoo.com 20:05:49 PST --- Created an attachment (id=882) code to demonstrate the issue described above The attached program hangs more often than not during the second set of spawns (using dmd 2.051 on OSX). The thread functions do nothing but allocate a large array and then exit. In one case the array is an Array!double (from std.container) and in the other it is a built-in double[]. In the second case, a large enough array will cause the program to hang. Sean Kelly has already done some investigating, quoting from his responses: 1) This one is weird, and doesn't appear related to 4307. One of the threads (thread A) is in a GC collection and blocked trying to acquire the mutex protecting the global thread list within thread_resumeAll. Another thread (thread B) is also blocked trying to acquire this mutex for other reasons. My best guess is that pthread_mutex in OSX is trying to give ownership of the lock to thread B, and since thread B is suspended it effectively blocks thread A from acquiring it to resume execution after the GC cycle. 2) After some testing, it looks like I was right. I have a fix for this, but it's far from ideal (though the diff is small): require everything but thread_resumeAll to acquire two locks in sequence, while thread_resumeAll only acquires the second. I'll try to come up with something better. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 25 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5488 Sean Kelly <sean invisibleduck.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED CC| |sean invisibleduck.org AssignedTo|nobody puremagic.com |sean invisibleduck.org --- This one is weird, and doesn't appear related to 4307. One of the threads (thread A) is in a GC collection and blocked trying to acquire the mutex protecting the global thread list within thread_resumeAll. Another thread (thread B) is also blocked trying to acquire this mutex for other reasons. My best guess is that pthread_mutex in OSX is trying to give ownership of the lock to thread B, and since thread B is suspended it effectively blocks thread A from acquiring it to resume execution after the GC cycle. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 31 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5488 --- After some testing, it looks like I was right. I have a fix for this, but it's far from ideal (though the diff is small): require everything but thread_resumeAll to acquire two locks in sequence, while thread_resumeAll only acquires the second. I'll try to come up with something better. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 31 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5488 --- Okay, I decided to use a Mutex instead of a built-in object monitor for locking the thread list, because this allows me to lock in thread_suspendAll() and hold the lock until thread_resumeAll() completes. This also allows me to remove some busy waits I'd added to Thread.add() to avoid adding a thread or context while a GC cycle was in progress. Much neater and in theory it solves everything. That said, I'm still seeing a rare occasional deadlock in the attached app. This one appears to be different however, and the near complete lack of usable debug info in DMD binaries on OSX is complicating figuring this one out. I'll add some printfs and hope that turns up something. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 31 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5488 --- I've confirmed that this new deadlock isn't because the GC thread was blocked acquiring the global thread list mutex, so I've fixed the issue this ticket was created for. However, it's starting to look like the Mach thread_suspend() call doesn't play well with Posix mutexes. What I think is happening is that a thread is blocked on the GC mutex when a collection occurs. The collection completes and the mutex is released, but the thread being given the lock is slow to resume and is missing the signal meant to notify it that the lock is free. This is all conjecture based on stack traces (which I'll include below) and some printfs to confirm that core.thread isn't involved, but it seems reasonable. If true though, it could mean that the mechanism used to stop and restart the world during a GC run on OSX is fundamentally unsound. I'll see about confirming the cause and go from there. 0x984d6142 in semaphore_wait_signal_trap () (gdb) bt D3std11concurrency38__T6_spawnTkTkTS3std11concurrency3TidZ6_spawnFbPFkkS3std11concurrency3TidZvkkS3std11concurrency3TidZS3std11concurrency3Tid () D3std11concurrency37__T5spawnTkTkTS3std11concurrency3TidZ5spawnFPFkkS3std11concurrency3TidZvkkS3std11concurrency3TidZS3std11concurrency3Tid () (gdb) thread 2 [Switching to thread 2 (process 114)] 0x0000512e in D6object12__T5clearTdZ5clearFKdZv () (gdb) thread 3 [Switching to thread 3 (process 114)] 0x0000512e in D6object12__T5clearTdZ5clearFKdZv () (gdb) thread 4 [Switching to thread 4 (process 114)] 0xffff07b6 in __memcpy () (gdb) thread 5 [Switching to thread 5 (process 114)] 0x984d6142 in semaphore_wait_signal_trap () (gdb) bt D3std11concurrency36__T4ListTS3std11concurrency7MessageZ4List3putMFS3std11concurrency7MessageZv () D3std11concurrency10MessageBox3putMFKS3std11concurrency7MessageZv () D3std11concurrency33__T5_sendTS3std11concurrency3TidZ5_sendFE3std11concurrency7MsgTypeS3std11concurrency3TidS3std11concurrency3TidZv () (gdb) thread 6 [Switching to thread 6 (process 114)] 0x984d6142 in semaphore_wait_signal_trap () (gdb) bt -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 31 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5488 --- Okay, all issues related to this appear to have been fixed and changes checked in. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 02 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5488 SomeDude <lovelydear mailmetrash.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |lovelydear mailmetrash.com PDT --- Is the problem solved on Mac OSX ? This test runs on Win32 2.059 as long as the process takes less than 1.3Gb of RAM on my machine, i.e no problem with nThreads = 40, but hangs if nThreads = 50 (probably because it can't allocate any more RAM). If multiplier is reduced to 10_000, it runs fine with nThreads = 100. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Apr 22 2012