www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Windows multi-threading performance issues on multi-core systems only

reply Dan <dsstruthers yahoo.com> writes:
I have a question regarding performance issue I am seeing on multicore Windows
systems.  I am creating many threads to do parallel tasks, and on multicore
Windows systems the performance is abysmal.  If I use task manager to set the
processor affinity to a single CPU, the program runs as I would expect. 
Without that, it takes about 10 times as long to complete.

Am I doing something wrong?  I have tried DMD 2.0.37 and DMD 1.0.53 with the
same results, running the binary on both a dual-core P4 and a newer Core2 duo.

Any help is greatly appreciated!
Dec 14 2009
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Dan (dsstruthers yahoo.com)'s article
 I have a question regarding performance issue I am seeing on multicore Windows
systems. I am creating many threads to do parallel tasks, and on multicore Windows systems the performance is abysmal. If I use task manager to set the processor affinity to a single CPU, the program runs as I would expect. Without that, it takes about 10 times as long to complete.
 Am I doing something wrong?  I have tried DMD 2.0.37 and DMD 1.0.53 with the
same results, running the binary on both a dual-core P4 and a newer Core2 duo.
 Any help is greatly appreciated!
I've seen this happen before. Without knowing the details of your code, my best guess is that you're getting a lot of contention for the GC lock. (It could also be some other lock, but if it were, there's a good chance you'd already know it because it wouldn't be hidden.) The current GC design isn't very multithreading-friendly yet. It requires a lock on every allocation. Furthermore, the array append operator (~=) currently takes the GC lock on **every append** to query the GC for info about the memory block that the array points to. There's been plenty of talk about what should be done to eliminate this, but nothing has been implemented so far. Assuming I am right about why your code is so slow, here's how to deal with it: 1. Cut down on unnecessary memory allocations. Use structs instead of classes where it makes sense. 2. Try to stack allocate stuff. alloca is your friend. 3. Pre-allocate arrays if you know ahead of time how long they're supposed to be. If you don't know how long they're supposed to be, use std.array.Appender (in D2) for now until a better solution gets implemented. Never use ~= in multithreaded code that gets executed a lot.
Dec 14 2009
next sibling parent zsxxsz <zhengshuxin hexun.com> writes:
== Quote from dsimcha (dsimcha yahoo.com)'s article
 == Quote from Dan (dsstruthers yahoo.com)'s article
 I have a question regarding performance issue I am seeing on multicore Windows
systems. I am creating many threads to do parallel tasks, and on multicore Windows systems the performance is abysmal. If I use task manager to set the processor affinity to a single CPU, the program runs as I would expect. Without that, it takes about 10 times as long to complete.
 Am I doing something wrong?  I have tried DMD 2.0.37 and DMD 1.0.53 with the
same results, running the binary on both a dual-core P4 and a newer Core2 duo.
 Any help is greatly appreciated!
I've seen this happen before. Without knowing the details of your code, my best guess is that you're getting a lot of contention for the GC lock. (It could also be some other lock, but if it were, there's a good chance you'd already know it because it wouldn't be hidden.) The current GC design isn't very multithreading-friendly yet. It requires a lock on every allocation. Furthermore, the array append operator (~=) currently takes the GC lock on **every append** to query the GC for info about the memory block that the array points to. There's been plenty of talk about what should be done to eliminate this, but nothing has been implemented so far. Assuming I am right about why your code is so slow, here's how to deal with it: 1. Cut down on unnecessary memory allocations. Use structs instead of classes where it makes sense. 2. Try to stack allocate stuff. alloca is your friend. 3. Pre-allocate arrays if you know ahead of time how long they're supposed to be. If you don't know how long they're supposed to be, use std.array.Appender (in D2) for now until a better solution gets implemented. Never use ~= in multithreaded code that gets executed a lot.
Yes, I've seen this before, too. But in my muti-threads, the alloc operations aren't avoiding, so the D's GC should improve it's performance for multi-threads.
Dec 15 2009
prev sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 14 Dec 2009 21:18:28 -0500, dsimcha <dsimcha yahoo.com> wrote:

 == Quote from Dan (dsstruthers yahoo.com)'s article
 I have a question regarding performance issue I am seeing on multicore  
 Windows
systems. I am creating many threads to do parallel tasks, and on multicore Windows systems the performance is abysmal. If I use task manager to set the processor affinity to a single CPU, the program runs as I would expect. Without that, it takes about 10 times as long to complete.
 Am I doing something wrong?  I have tried DMD 2.0.37 and DMD 1.0.53  
 with the
same results, running the binary on both a dual-core P4 and a newer Core2 duo.
 Any help is greatly appreciated!
I've seen this happen before. Without knowing the details of your code, my best guess is that you're getting a lot of contention for the GC lock. (It could also be some other lock, but if it were, there's a good chance you'd already know it because it wouldn't be hidden.) The current GC design isn't very multithreading-friendly yet. It requires a lock on every allocation. Furthermore, the array append operator (~=) currently takes the GC lock on **every append** to query the GC for info about the memory block that the array points to. There's been plenty of talk about what should be done to eliminate this, but nothing has been implemented so far.
I would suspect something else. I would expect actually that in an allocation-heavy design, running on multiple cores should be at *least* as fast as running on a single core. He also only has 2 cores. For splitting the parallel tasks to 2 cores to take 10x longer is very alarming. I would suspect application design before the GC in this case. If it's a fundamental D issue, then we need to fix it ASAP, especially since D2 is supposed to be (among other things) an upgrade for multi-core. Maybe I'm wrong, is there a good test case to prove it is worse on multiple cores? -Steve
Dec 15 2009
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Steven Schveighoffer (schveiguy yahoo.com)'s article
 Maybe I'm wrong, is there a good test case to prove it is worse on
 multiple cores?
 -Steve
This is a synthetic, extreme corner case benchmark, but I think it hammers home the point that such problems are at least plausible. Tested on a Core 2 Quad with -O -inline -release: import std.stdio, std.perf, core.thread; void main() { writeln("Set affinity, then press enter."); readln(); auto pc = new PerformanceCounter; pc.start; enum nThreads = 4; auto threads = new Thread[nThreads]; foreach(ref thread; threads) { thread = new Thread(&doAppending); thread.start(); } foreach(thread; threads) { thread.join(); } pc.stop; writeln(pc.milliseconds); } void doAppending() { uint[] arr; foreach(i; 0..1_000_000) { arr ~= i; } } Timing with affinity for all 4 CPUs: 28390 milliseconds Timing with affinity for only 1 CPU: 533 milliseconds
Dec 15 2009
next sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2009-12-15 10:28:37 -0500, dsimcha <dsimcha yahoo.com> said:

 void doAppending() {
     uint[] arr;
     foreach(i; 0..1_000_000) {
         arr ~= i;
     }
 }
For comparison's sake, you might want to compare that with malloc/realloc: void doAppending() { uint* arr = null; foreach(i; 0..1_000_000) { arr = realloc(arr, (uint*).sizeof * (i+1)); arr[i] = i; } // leak arr } -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Dec 15 2009
prev sibling next sibling parent reply Dan <dsstruthers yahoo.com> writes:
dsimcha Wrote:

 == Quote from Steven Schveighoffer (schveiguy yahoo.com)'s article
 Maybe I'm wrong, is there a good test case to prove it is worse on
 multiple cores?
 -Steve
This is a synthetic, extreme corner case benchmark, but I think it hammers home the point that such problems are at least plausible. Tested on a Core 2 Quad with -O -inline -release: import std.stdio, std.perf, core.thread; void main() { writeln("Set affinity, then press enter."); readln(); auto pc = new PerformanceCounter; pc.start; enum nThreads = 4; auto threads = new Thread[nThreads]; foreach(ref thread; threads) { thread = new Thread(&doAppending); thread.start(); } foreach(thread; threads) { thread.join(); } pc.stop; writeln(pc.milliseconds); } void doAppending() { uint[] arr; foreach(i; 0..1_000_000) { arr ~= i; } } Timing with affinity for all 4 CPUs: 28390 milliseconds Timing with affinity for only 1 CPU: 533 milliseconds
My code does do considerable array appending, and I see exactly the same issue as dsimcha points out above. I would expect it is GC related, but why for multiple cores only, I cannot fathom. Thanks for the repro, dsimcha! My code snippet would not have been as straight-forward.
Dec 15 2009
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Dan (dsstruthers yahoo.com)'s article
 My code does do considerable array appending, and I see exactly the same issue
as dsimcha points out above. I would expect it is GC related, but why for multiple cores only, I cannot fathom.
 Thanks for the repro, dsimcha!  My code snippet would not have been as
straight-forward. Two reasons: 1. When GC info is queried, the last query is cached for (relatively) efficient array appending. However, this cache is not thread-local. Therefore, if you're appending to arrays in two different threads simultaneously, they'll both keep evicting each other's cached GC info, causing it to have to be looked up again and again. 2. Every array append requires a lock acquisition. This is much more expensive if there's contention. Bottom line: Array appending in multithreaded code is **horribly** broken and I'm glad it's being brought to people's attention. For a temporary fix, pre-allocate or use std.array.Appender.
Dec 15 2009
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Tue, 15 Dec 2009 12:23:01 -0500, dsimcha <dsimcha yahoo.com> wrote:

 == Quote from Dan (dsstruthers yahoo.com)'s article
 My code does do considerable array appending, and I see exactly the  
 same issue
as dsimcha points out above. I would expect it is GC related, but why for multiple cores only, I cannot fathom.
 Thanks for the repro, dsimcha!  My code snippet would not have been as
straight-forward. Two reasons: 1. When GC info is queried, the last query is cached for (relatively) efficient array appending. However, this cache is not thread-local. Therefore, if you're appending to arrays in two different threads simultaneously, they'll both keep evicting each other's cached GC info, causing it to have to be looked up again and again. 2. Every array append requires a lock acquisition. This is much more expensive if there's contention. Bottom line: Array appending in multithreaded code is **horribly** broken and I'm glad it's being brought to people's attention. For a temporary fix, pre-allocate or use std.array.Appender.
Yes, but why does multiple cores make the problem worse? If it's the lock, then I'd expect just locking in multiple threads without any appending does worse on multiple cores than on a single core. If it's the lookup, why does it take longer to lookup on multiple cores? The very idea that multiple cores makes threading code *slower* goes against everything I've ever heard about multi-core and threads. I agree array appending for multithreaded code is not as efficient as it is when you use a dedicated append-friendly object, but it's a compromise between efficiency of appending (which arguably is not that common) and efficiency for everything else (slicing, passing to a function, etc). I expect that very soon we will have efficient appending with thread local caching of lookups, but the multi-core thing really puzzles me. -Steve
Dec 15 2009
parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Steven Schveighoffer (schveiguy yahoo.com)'s article
 Yes, but why does multiple cores make the problem worse?  If it's the
 lock, then I'd expect just locking in multiple threads without any
 appending does worse on multiple cores than on a single core.
It does. import std.stdio, std.perf, core.thread; void main() { writeln("Set affinity, then press enter."); readln(); auto pc = new PerformanceCounter; pc.start; enum nThreads = 4; auto threads = new Thread[nThreads]; foreach(ref thread; threads) { thread = new Thread(&doStuff); thread.start(); } foreach(thread; threads) { thread.join(); } pc.stop; writeln(pc.milliseconds); } void doStuff() { foreach(i; 0..1_000_000) { synchronized {} } } Timing with affinity for all CPUs: 20772 ms. Timing with affinity for 1 CPU: 156 ms. Heavy lock contention **kills** multithreaded code b/c not only do you serialize everything, but the OS has to perform a context switch on every contention. I posted about a year ago that using spinlocks in the GC massively sped things up, at least in synthetic benchmarks, if you have heavy contention and multiple cores. See http://www.digitalmars.com/d/archives/digitalmars/D/More_on_GC_ pinlocks_80485.html . However, this post largely got ignored.
 If it's the
 lookup, why does it take longer to lookup on multiple cores?
Because appending to multiple arrays simultaneously (whether on single or multiple cores) causes each array's append to evict the other array's append from the GC block info cache. If you set the affinity to only 1 CPU, this only happens once per context switch.
Dec 15 2009
prev sibling parent reply "Simen kjaeraas" <simen.kjaras gmail.com> writes:
dsimcha <dsimcha yahoo.com> wrote:

 This is a synthetic, extreme corner case benchmark, but I think it  
 hammers home
 the point that such problems are at least plausible.  Tested on a Core 2  
 Quad with
 -O -inline -release:
 Timing with affinity for all 4 CPUs:  28390 milliseconds

 Timing with affinity for only 1 CPU:  533 milliseconds
Tested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It scales roughly inverse linearly with number of threads: 163ms for 1, 364ms for 2, 886ms for 4 This is quite different from your numbers, though. -- Simen
Dec 15 2009
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Simen kjaeraas (simen.kjaras gmail.com)'s article
 dsimcha <dsimcha yahoo.com> wrote:
 This is a synthetic, extreme corner case benchmark, but I think it
 hammers home
 the point that such problems are at least plausible.  Tested on a Core 2
 Quad with
 -O -inline -release:
 Timing with affinity for all 4 CPUs:  28390 milliseconds

 Timing with affinity for only 1 CPU:  533 milliseconds
Tested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It scales roughly inverse linearly with number of threads: 163ms for 1, 364ms for 2, 886ms for 4 This is quite different from your numbers, though.
Yea, forgot to mention my numbers were on Win XP. Maybe Windows 7 critical sections are better implemented or something. Can a few other people with a variety of OS's run this benchmark and post their numbers?
Dec 15 2009
next sibling parent reply Sergey Gromov <snake.scaly gmail.com> writes:
dsimcha wrote:
 == Quote from Simen kjaeraas (simen.kjaras gmail.com)'s article
 dsimcha <dsimcha yahoo.com> wrote:
 Timing with affinity for all 4 CPUs:  28390 milliseconds

 Timing with affinity for only 1 CPU:  533 milliseconds
Tested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It scales roughly inverse linearly with number of threads: 163ms for 1, 364ms for 2, 886ms for 4
Maybe you mean core 2 quad?
 This is quite different from your numbers, though.
Yea, forgot to mention my numbers were on Win XP. Maybe Windows 7 critical sections are better implemented or something. Can a few other people with a variety of OS's run this benchmark and post their numbers?
Core 2 duo laptop, 1.8GHz, XP Pro: 860ms 1 core 11369ms 2 cores
Dec 15 2009
parent "Simen kjaeraas" <simen.kjaras gmail.com> writes:
On Wed, 16 Dec 2009 03:01:54 +0100, Sergey Gromov <snake.scaly gmail.com>  
wrote:

 dsimcha wrote:
 == Quote from Simen kjaeraas (simen.kjaras gmail.com)'s article
 dsimcha <dsimcha yahoo.com> wrote:
 Timing with affinity for all 4 CPUs:  28390 milliseconds

 Timing with affinity for only 1 CPU:  533 milliseconds
Tested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It scales roughly inverse linearly with number of threads: 163ms for 1, 364ms for 2, 886ms for 4
Maybe you mean core 2 quad?
Nope. Tried with 4 threads to see what the difference was. -- Simen
Dec 15 2009
prev sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2009-12-15 19:49:43 -0500, dsimcha <dsimcha yahoo.com> said:

 == Quote from Simen kjaeraas (simen.kjaras gmail.com)'s article
 Tested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It
 scales roughly inverse linearly with number of threads:
 163ms for 1,
 364ms for 2,
 886ms for 4
 This is quite different from your numbers, though.
Yea, forgot to mention my numbers were on Win XP. Maybe Windows 7 critical sections are better implemented or something. Can a few other people with a variety of OS's run this benchmark and post their numbers?
Core 2 Duo / Mac OS X 10.6 / 4 threads: Crystal:~ mifo$ ./test Set affinity, then press enter. Bus error Runs for about 18 seconds, then crashes. At first glance, it looks as if the Thread class is broken and for some reason I get a null dereference when a thread finishes. Great! Anyway, I've done some sampling on the program while it runs, and each of the worker thread spans about 85% of its time inside _d_monitorenter and 11% in _d_monitorleave soon after starting the program, which later becomes 88% and 7% respectively soon before the program finishes. The funny things is that if I just bypass the GC like this: void doAppending() { uint* arr = null; foreach(i; 0..1_000_000) { arr = cast(uint*)realloc(arr, (uint*).sizeof * (i+1)); arr[i] = i; } // leak arr } it finishes (I mean crashes) in less than half a second. So it looks like realloc does a much better job at locking it's data structure that the GC. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Dec 15 2009
parent reply Jacob Carlborg <doob me.com> writes:
On 12/16/09 03:44, Michel Fortin wrote:
 On 2009-12-15 19:49:43 -0500, dsimcha <dsimcha yahoo.com> said:

 == Quote from Simen kjaeraas (simen.kjaras gmail.com)'s article
 Tested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It
 scales roughly inverse linearly with number of threads:
 163ms for 1,
 364ms for 2,
 886ms for 4
 This is quite different from your numbers, though.
Yea, forgot to mention my numbers were on Win XP. Maybe Windows 7 critical sections are better implemented or something. Can a few other people with a variety of OS's run this benchmark and post their numbers?
Core 2 Duo / Mac OS X 10.6 / 4 threads: Crystal:~ mifo$ ./test Set affinity, then press enter. Bus error Runs for about 18 seconds, then crashes. At first glance, it looks as if the Thread class is broken and for some reason I get a null dereference when a thread finishes. Great! Anyway, I've done some sampling on the program while it runs, and each of the worker thread spans about 85% of its time inside _d_monitorenter and 11% in _d_monitorleave soon after starting the program, which later becomes 88% and 7% respectively soon before the program finishes. The funny things is that if I just bypass the GC like this: void doAppending() { uint* arr = null; foreach(i; 0..1_000_000) { arr = cast(uint*)realloc(arr, (uint*).sizeof * (i+1)); arr[i] = i; } // leak arr } it finishes (I mean crashes) in less than half a second. So it looks like realloc does a much better job at locking it's data structure that the GC.
It runs fine on Mac OS X 10.5 with dmd 2.037.
Dec 16 2009
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2009-12-16 07:08:30 -0500, Jacob Carlborg <doob me.com> said:

 On 12/16/09 03:44, Michel Fortin wrote:
 Core 2 Duo / Mac OS X 10.6 / 4 threads:
 
 Crystal:~ mifo$ ./test
 Set affinity, then press enter.
 
 Bus error
 
 Runs for about 18 seconds, then crashes. At first glance, it looks as if
 the Thread class is broken and for some reason I get a null dereference
 when a thread finishes. Great!
It runs fine on Mac OS X 10.5 with dmd 2.037.
Then that would be specific to 10.6. I've filled a bug. http://d.puremagic.com/issues/show_bug.cgi?id=3619 -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Dec 16 2009
parent Sean Kelly <sean invisibleduck.org> writes:
Michel Fortin Wrote:

 On 2009-12-16 07:08:30 -0500, Jacob Carlborg <doob me.com> said:
 
 On 12/16/09 03:44, Michel Fortin wrote:
 Core 2 Duo / Mac OS X 10.6 / 4 threads:
 
 Crystal:~ mifo$ ./test
 Set affinity, then press enter.
 
 Bus error
 
 Runs for about 18 seconds, then crashes. At first glance, it looks as if
 the Thread class is broken and for some reason I get a null dereference
 when a thread finishes. Great!
It runs fine on Mac OS X 10.5 with dmd 2.037.
Then that would be specific to 10.6. I've filled a bug. http://d.puremagic.com/issues/show_bug.cgi?id=3619
I haven't changed the thread code in eons, so I'm currently attributing this to either the change in how static arrays are passed or some other compiler issue. I'm using 10.6 myself, so I'll give it a look.
Dec 16 2009
prev sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
Steven Schveighoffer Wrote:
 
 I would suspect something else.  I would expect actually that in an  
 allocation-heavy design, running on multiple cores should be at *least* as  
 fast as running on a single core.  He also only has 2 cores.  For  
 splitting the parallel tasks to 2 cores to take 10x longer is very  
 alarming.  I would suspect application design before the GC in this case.   
 If it's a fundamental D issue, then we need to fix it ASAP, especially  
 since D2 is supposed to be (among other things) an upgrade for multi-core.
Assuming he's right, the only thing I can come up with that makes even the remotest sense is that SuspendThread is ridiculously slow for threads on other cores, but effectively a no-op for threads on the same core. But the Boehm GC works the same way as far as I know, and if it had such problems I'm sure we'd have heard about it. It also doesn't seem reasonable that the Windows kernel team would be able to keep something so broken in place.
Dec 15 2009
parent retard <re tard.com.invalid> writes:
Tue, 15 Dec 2009 11:15:04 -0500, Sean Kelly wrote:

 Steven Schveighoffer Wrote:
 
 I would suspect something else.  I would expect actually that in an
 allocation-heavy design, running on multiple cores should be at *least*
 as fast as running on a single core.  He also only has 2 cores.  For
 splitting the parallel tasks to 2 cores to take 10x longer is very
 alarming.  I would suspect application design before the GC in this
 case. If it's a fundamental D issue, then we need to fix it ASAP,
 especially since D2 is supposed to be (among other things) an upgrade
 for multi-core.
Assuming he's right, the only thing I can come up with that makes even the remotest sense is that SuspendThread is ridiculously slow for threads on other cores, but effectively a no-op for threads on the same core. But the Boehm GC works the same way as far as I know, and if it had such problems I'm sure we'd have heard about it. It also doesn't seem reasonable that the Windows kernel team would be able to keep something so broken in place.
FWIW, memory allocations decrease the throughput on the great Sun Hotspot GC too: http://java.sun.com/docs/hotspot/gc5.0/ in the user code.
Dec 15 2009