digitalmars.D - Windows multi-threading performance issues on multi-core systems only
- Dan (3/3) Dec 14 2009 I have a question regarding performance issue I am seeing on multicore W...
- dsimcha (23/26) Dec 14 2009 systems. I am creating many threads to do parallel tasks, and on multic...
- zsxxsz (3/29) Dec 15 2009 Yes, I've seen this before, too. But in my muti-threads, the alloc opera...
- Steven Schveighoffer (11/41) Dec 15 2009 I would suspect something else. I would expect actually that in an
- dsimcha (30/33) Dec 15 2009 This is a synthetic, extreme corner case benchmark, but I think it hamme...
- Michel Fortin (14/20) Dec 15 2009 For comparison's sake, you might want to compare that with malloc/reallo...
- Dan (3/46) Dec 15 2009 My code does do considerable array appending, and I see exactly the same...
- dsimcha (15/17) Dec 15 2009 as dsimcha points out above. I would expect it is GC related, but why f...
- Steven Schveighoffer (14/40) Dec 15 2009 Yes, but why does multiple cores make the problem worse? If it's the
- dsimcha (38/43) Dec 15 2009 It does.
- Simen kjaeraas (9/16) Dec 15 2009 Tested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It
- dsimcha (4/19) Dec 15 2009 Yea, forgot to mention my numbers were on Win XP. Maybe Windows 7 criti...
- Sergey Gromov (5/20) Dec 15 2009 Core 2 duo laptop, 1.8GHz, XP Pro:
- Simen kjaeraas (5/17) Dec 15 2009 Nope. Tried with 4 threads to see what the difference was.
- Michel Fortin (28/39) Dec 15 2009 Core 2 Duo / Mac OS X 10.6 / 4 threads:
- Jacob Carlborg (2/39) Dec 16 2009 It runs fine on Mac OS X 10.5 with dmd 2.037.
- Michel Fortin (7/20) Dec 16 2009 Then that would be specific to 10.6. I've filled a bug.
- Sean Kelly (2/21) Dec 16 2009 I haven't changed the thread code in eons, so I'm currently attributing ...
- Sean Kelly (2/10) Dec 15 2009 Assuming he's right, the only thing I can come up with that makes even t...
- retard (5/23) Dec 15 2009 FWIW, memory allocations decrease the throughput on the great Sun Hotspo...
I have a question regarding performance issue I am seeing on multicore Windows systems. I am creating many threads to do parallel tasks, and on multicore Windows systems the performance is abysmal. If I use task manager to set the processor affinity to a single CPU, the program runs as I would expect. Without that, it takes about 10 times as long to complete. Am I doing something wrong? I have tried DMD 2.0.37 and DMD 1.0.53 with the same results, running the binary on both a dual-core P4 and a newer Core2 duo. Any help is greatly appreciated!
Dec 14 2009
== Quote from Dan (dsstruthers yahoo.com)'s articleI have a question regarding performance issue I am seeing on multicore Windowssystems. I am creating many threads to do parallel tasks, and on multicore Windows systems the performance is abysmal. If I use task manager to set the processor affinity to a single CPU, the program runs as I would expect. Without that, it takes about 10 times as long to complete.Am I doing something wrong? I have tried DMD 2.0.37 and DMD 1.0.53 with thesame results, running the binary on both a dual-core P4 and a newer Core2 duo.Any help is greatly appreciated!I've seen this happen before. Without knowing the details of your code, my best guess is that you're getting a lot of contention for the GC lock. (It could also be some other lock, but if it were, there's a good chance you'd already know it because it wouldn't be hidden.) The current GC design isn't very multithreading-friendly yet. It requires a lock on every allocation. Furthermore, the array append operator (~=) currently takes the GC lock on **every append** to query the GC for info about the memory block that the array points to. There's been plenty of talk about what should be done to eliminate this, but nothing has been implemented so far. Assuming I am right about why your code is so slow, here's how to deal with it: 1. Cut down on unnecessary memory allocations. Use structs instead of classes where it makes sense. 2. Try to stack allocate stuff. alloca is your friend. 3. Pre-allocate arrays if you know ahead of time how long they're supposed to be. If you don't know how long they're supposed to be, use std.array.Appender (in D2) for now until a better solution gets implemented. Never use ~= in multithreaded code that gets executed a lot.
Dec 14 2009
== Quote from dsimcha (dsimcha yahoo.com)'s article== Quote from Dan (dsstruthers yahoo.com)'s articleYes, I've seen this before, too. But in my muti-threads, the alloc operations aren't avoiding, so the D's GC should improve it's performance for multi-threads.I have a question regarding performance issue I am seeing on multicore Windowssystems. I am creating many threads to do parallel tasks, and on multicore Windows systems the performance is abysmal. If I use task manager to set the processor affinity to a single CPU, the program runs as I would expect. Without that, it takes about 10 times as long to complete.Am I doing something wrong? I have tried DMD 2.0.37 and DMD 1.0.53 with thesame results, running the binary on both a dual-core P4 and a newer Core2 duo.Any help is greatly appreciated!I've seen this happen before. Without knowing the details of your code, my best guess is that you're getting a lot of contention for the GC lock. (It could also be some other lock, but if it were, there's a good chance you'd already know it because it wouldn't be hidden.) The current GC design isn't very multithreading-friendly yet. It requires a lock on every allocation. Furthermore, the array append operator (~=) currently takes the GC lock on **every append** to query the GC for info about the memory block that the array points to. There's been plenty of talk about what should be done to eliminate this, but nothing has been implemented so far. Assuming I am right about why your code is so slow, here's how to deal with it: 1. Cut down on unnecessary memory allocations. Use structs instead of classes where it makes sense. 2. Try to stack allocate stuff. alloca is your friend. 3. Pre-allocate arrays if you know ahead of time how long they're supposed to be. If you don't know how long they're supposed to be, use std.array.Appender (in D2) for now until a better solution gets implemented. Never use ~= in multithreaded code that gets executed a lot.
Dec 15 2009
On Mon, 14 Dec 2009 21:18:28 -0500, dsimcha <dsimcha yahoo.com> wrote:== Quote from Dan (dsstruthers yahoo.com)'s articleI would suspect something else. I would expect actually that in an allocation-heavy design, running on multiple cores should be at *least* as fast as running on a single core. He also only has 2 cores. For splitting the parallel tasks to 2 cores to take 10x longer is very alarming. I would suspect application design before the GC in this case. If it's a fundamental D issue, then we need to fix it ASAP, especially since D2 is supposed to be (among other things) an upgrade for multi-core. Maybe I'm wrong, is there a good test case to prove it is worse on multiple cores? -SteveI have a question regarding performance issue I am seeing on multicore Windowssystems. I am creating many threads to do parallel tasks, and on multicore Windows systems the performance is abysmal. If I use task manager to set the processor affinity to a single CPU, the program runs as I would expect. Without that, it takes about 10 times as long to complete.Am I doing something wrong? I have tried DMD 2.0.37 and DMD 1.0.53 with thesame results, running the binary on both a dual-core P4 and a newer Core2 duo.Any help is greatly appreciated!I've seen this happen before. Without knowing the details of your code, my best guess is that you're getting a lot of contention for the GC lock. (It could also be some other lock, but if it were, there's a good chance you'd already know it because it wouldn't be hidden.) The current GC design isn't very multithreading-friendly yet. It requires a lock on every allocation. Furthermore, the array append operator (~=) currently takes the GC lock on **every append** to query the GC for info about the memory block that the array points to. There's been plenty of talk about what should be done to eliminate this, but nothing has been implemented so far.
Dec 15 2009
== Quote from Steven Schveighoffer (schveiguy yahoo.com)'s articleMaybe I'm wrong, is there a good test case to prove it is worse on multiple cores? -SteveThis is a synthetic, extreme corner case benchmark, but I think it hammers home the point that such problems are at least plausible. Tested on a Core 2 Quad with -O -inline -release: import std.stdio, std.perf, core.thread; void main() { writeln("Set affinity, then press enter."); readln(); auto pc = new PerformanceCounter; pc.start; enum nThreads = 4; auto threads = new Thread[nThreads]; foreach(ref thread; threads) { thread = new Thread(&doAppending); thread.start(); } foreach(thread; threads) { thread.join(); } pc.stop; writeln(pc.milliseconds); } void doAppending() { uint[] arr; foreach(i; 0..1_000_000) { arr ~= i; } } Timing with affinity for all 4 CPUs: 28390 milliseconds Timing with affinity for only 1 CPU: 533 milliseconds
Dec 15 2009
On 2009-12-15 10:28:37 -0500, dsimcha <dsimcha yahoo.com> said:void doAppending() { uint[] arr; foreach(i; 0..1_000_000) { arr ~= i; } }For comparison's sake, you might want to compare that with malloc/realloc: void doAppending() { uint* arr = null; foreach(i; 0..1_000_000) { arr = realloc(arr, (uint*).sizeof * (i+1)); arr[i] = i; } // leak arr } -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Dec 15 2009
dsimcha Wrote:== Quote from Steven Schveighoffer (schveiguy yahoo.com)'s articleMy code does do considerable array appending, and I see exactly the same issue as dsimcha points out above. I would expect it is GC related, but why for multiple cores only, I cannot fathom. Thanks for the repro, dsimcha! My code snippet would not have been as straight-forward.Maybe I'm wrong, is there a good test case to prove it is worse on multiple cores? -SteveThis is a synthetic, extreme corner case benchmark, but I think it hammers home the point that such problems are at least plausible. Tested on a Core 2 Quad with -O -inline -release: import std.stdio, std.perf, core.thread; void main() { writeln("Set affinity, then press enter."); readln(); auto pc = new PerformanceCounter; pc.start; enum nThreads = 4; auto threads = new Thread[nThreads]; foreach(ref thread; threads) { thread = new Thread(&doAppending); thread.start(); } foreach(thread; threads) { thread.join(); } pc.stop; writeln(pc.milliseconds); } void doAppending() { uint[] arr; foreach(i; 0..1_000_000) { arr ~= i; } } Timing with affinity for all 4 CPUs: 28390 milliseconds Timing with affinity for only 1 CPU: 533 milliseconds
Dec 15 2009
== Quote from Dan (dsstruthers yahoo.com)'s articleMy code does do considerable array appending, and I see exactly the same issueas dsimcha points out above. I would expect it is GC related, but why for multiple cores only, I cannot fathom.Thanks for the repro, dsimcha! My code snippet would not have been asstraight-forward. Two reasons: 1. When GC info is queried, the last query is cached for (relatively) efficient array appending. However, this cache is not thread-local. Therefore, if you're appending to arrays in two different threads simultaneously, they'll both keep evicting each other's cached GC info, causing it to have to be looked up again and again. 2. Every array append requires a lock acquisition. This is much more expensive if there's contention. Bottom line: Array appending in multithreaded code is **horribly** broken and I'm glad it's being brought to people's attention. For a temporary fix, pre-allocate or use std.array.Appender.
Dec 15 2009
On Tue, 15 Dec 2009 12:23:01 -0500, dsimcha <dsimcha yahoo.com> wrote:== Quote from Dan (dsstruthers yahoo.com)'s articleYes, but why does multiple cores make the problem worse? If it's the lock, then I'd expect just locking in multiple threads without any appending does worse on multiple cores than on a single core. If it's the lookup, why does it take longer to lookup on multiple cores? The very idea that multiple cores makes threading code *slower* goes against everything I've ever heard about multi-core and threads. I agree array appending for multithreaded code is not as efficient as it is when you use a dedicated append-friendly object, but it's a compromise between efficiency of appending (which arguably is not that common) and efficiency for everything else (slicing, passing to a function, etc). I expect that very soon we will have efficient appending with thread local caching of lookups, but the multi-core thing really puzzles me. -SteveMy code does do considerable array appending, and I see exactly the same issueas dsimcha points out above. I would expect it is GC related, but why for multiple cores only, I cannot fathom.Thanks for the repro, dsimcha! My code snippet would not have been asstraight-forward. Two reasons: 1. When GC info is queried, the last query is cached for (relatively) efficient array appending. However, this cache is not thread-local. Therefore, if you're appending to arrays in two different threads simultaneously, they'll both keep evicting each other's cached GC info, causing it to have to be looked up again and again. 2. Every array append requires a lock acquisition. This is much more expensive if there's contention. Bottom line: Array appending in multithreaded code is **horribly** broken and I'm glad it's being brought to people's attention. For a temporary fix, pre-allocate or use std.array.Appender.
Dec 15 2009
== Quote from Steven Schveighoffer (schveiguy yahoo.com)'s articleYes, but why does multiple cores make the problem worse? If it's the lock, then I'd expect just locking in multiple threads without any appending does worse on multiple cores than on a single core.It does. import std.stdio, std.perf, core.thread; void main() { writeln("Set affinity, then press enter."); readln(); auto pc = new PerformanceCounter; pc.start; enum nThreads = 4; auto threads = new Thread[nThreads]; foreach(ref thread; threads) { thread = new Thread(&doStuff); thread.start(); } foreach(thread; threads) { thread.join(); } pc.stop; writeln(pc.milliseconds); } void doStuff() { foreach(i; 0..1_000_000) { synchronized {} } } Timing with affinity for all CPUs: 20772 ms. Timing with affinity for 1 CPU: 156 ms. Heavy lock contention **kills** multithreaded code b/c not only do you serialize everything, but the OS has to perform a context switch on every contention. I posted about a year ago that using spinlocks in the GC massively sped things up, at least in synthetic benchmarks, if you have heavy contention and multiple cores. See http://www.digitalmars.com/d/archives/digitalmars/D/More_on_GC_ pinlocks_80485.html . However, this post largely got ignored.If it's the lookup, why does it take longer to lookup on multiple cores?Because appending to multiple arrays simultaneously (whether on single or multiple cores) causes each array's append to evict the other array's append from the GC block info cache. If you set the affinity to only 1 CPU, this only happens once per context switch.
Dec 15 2009
dsimcha <dsimcha yahoo.com> wrote:This is a synthetic, extreme corner case benchmark, but I think it hammers home the point that such problems are at least plausible. Tested on a Core 2 Quad with -O -inline -release: Timing with affinity for all 4 CPUs: 28390 milliseconds Timing with affinity for only 1 CPU: 533 millisecondsTested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It scales roughly inverse linearly with number of threads: 163ms for 1, 364ms for 2, 886ms for 4 This is quite different from your numbers, though. -- Simen
Dec 15 2009
== Quote from Simen kjaeraas (simen.kjaras gmail.com)'s articledsimcha <dsimcha yahoo.com> wrote:Yea, forgot to mention my numbers were on Win XP. Maybe Windows 7 critical sections are better implemented or something. Can a few other people with a variety of OS's run this benchmark and post their numbers?This is a synthetic, extreme corner case benchmark, but I think it hammers home the point that such problems are at least plausible. Tested on a Core 2 Quad with -O -inline -release: Timing with affinity for all 4 CPUs: 28390 milliseconds Timing with affinity for only 1 CPU: 533 millisecondsTested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It scales roughly inverse linearly with number of threads: 163ms for 1, 364ms for 2, 886ms for 4 This is quite different from your numbers, though.
Dec 15 2009
dsimcha wrote:== Quote from Simen kjaeraas (simen.kjaras gmail.com)'s articleMaybe you mean core 2 quad?dsimcha <dsimcha yahoo.com> wrote:Timing with affinity for all 4 CPUs: 28390 milliseconds Timing with affinity for only 1 CPU: 533 millisecondsTested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It scales roughly inverse linearly with number of threads: 163ms for 1, 364ms for 2, 886ms for 4Core 2 duo laptop, 1.8GHz, XP Pro: 860ms 1 core 11369ms 2 coresThis is quite different from your numbers, though.Yea, forgot to mention my numbers were on Win XP. Maybe Windows 7 critical sections are better implemented or something. Can a few other people with a variety of OS's run this benchmark and post their numbers?
Dec 15 2009
On Wed, 16 Dec 2009 03:01:54 +0100, Sergey Gromov <snake.scaly gmail.com> wrote:dsimcha wrote:Nope. Tried with 4 threads to see what the difference was. -- Simen== Quote from Simen kjaeraas (simen.kjaras gmail.com)'s articleMaybe you mean core 2 quad?dsimcha <dsimcha yahoo.com> wrote:Timing with affinity for all 4 CPUs: 28390 milliseconds Timing with affinity for only 1 CPU: 533 millisecondsTested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It scales roughly inverse linearly with number of threads: 163ms for 1, 364ms for 2, 886ms for 4
Dec 15 2009
On 2009-12-15 19:49:43 -0500, dsimcha <dsimcha yahoo.com> said:== Quote from Simen kjaeraas (simen.kjaras gmail.com)'s articleCore 2 Duo / Mac OS X 10.6 / 4 threads: Crystal:~ mifo$ ./test Set affinity, then press enter. Bus error Runs for about 18 seconds, then crashes. At first glance, it looks as if the Thread class is broken and for some reason I get a null dereference when a thread finishes. Great! Anyway, I've done some sampling on the program while it runs, and each of the worker thread spans about 85% of its time inside _d_monitorenter and 11% in _d_monitorleave soon after starting the program, which later becomes 88% and 7% respectively soon before the program finishes. The funny things is that if I just bypass the GC like this: void doAppending() { uint* arr = null; foreach(i; 0..1_000_000) { arr = cast(uint*)realloc(arr, (uint*).sizeof * (i+1)); arr[i] = i; } // leak arr } it finishes (I mean crashes) in less than half a second. So it looks like realloc does a much better job at locking it's data structure that the GC. -- Michel Fortin michel.fortin michelf.com http://michelf.com/Tested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It scales roughly inverse linearly with number of threads: 163ms for 1, 364ms for 2, 886ms for 4 This is quite different from your numbers, though.Yea, forgot to mention my numbers were on Win XP. Maybe Windows 7 critical sections are better implemented or something. Can a few other people with a variety of OS's run this benchmark and post their numbers?
Dec 15 2009
On 12/16/09 03:44, Michel Fortin wrote:On 2009-12-15 19:49:43 -0500, dsimcha <dsimcha yahoo.com> said:It runs fine on Mac OS X 10.5 with dmd 2.037.== Quote from Simen kjaeraas (simen.kjaras gmail.com)'s articleCore 2 Duo / Mac OS X 10.6 / 4 threads: Crystal:~ mifo$ ./test Set affinity, then press enter. Bus error Runs for about 18 seconds, then crashes. At first glance, it looks as if the Thread class is broken and for some reason I get a null dereference when a thread finishes. Great! Anyway, I've done some sampling on the program while it runs, and each of the worker thread spans about 85% of its time inside _d_monitorenter and 11% in _d_monitorleave soon after starting the program, which later becomes 88% and 7% respectively soon before the program finishes. The funny things is that if I just bypass the GC like this: void doAppending() { uint* arr = null; foreach(i; 0..1_000_000) { arr = cast(uint*)realloc(arr, (uint*).sizeof * (i+1)); arr[i] = i; } // leak arr } it finishes (I mean crashes) in less than half a second. So it looks like realloc does a much better job at locking it's data structure that the GC.Tested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It scales roughly inverse linearly with number of threads: 163ms for 1, 364ms for 2, 886ms for 4 This is quite different from your numbers, though.Yea, forgot to mention my numbers were on Win XP. Maybe Windows 7 critical sections are better implemented or something. Can a few other people with a variety of OS's run this benchmark and post their numbers?
Dec 16 2009
On 2009-12-16 07:08:30 -0500, Jacob Carlborg <doob me.com> said:On 12/16/09 03:44, Michel Fortin wrote:Then that would be specific to 10.6. I've filled a bug. http://d.puremagic.com/issues/show_bug.cgi?id=3619 -- Michel Fortin michel.fortin michelf.com http://michelf.com/Core 2 Duo / Mac OS X 10.6 / 4 threads: Crystal:~ mifo$ ./test Set affinity, then press enter. Bus error Runs for about 18 seconds, then crashes. At first glance, it looks as if the Thread class is broken and for some reason I get a null dereference when a thread finishes. Great!It runs fine on Mac OS X 10.5 with dmd 2.037.
Dec 16 2009
Michel Fortin Wrote:On 2009-12-16 07:08:30 -0500, Jacob Carlborg <doob me.com> said:I haven't changed the thread code in eons, so I'm currently attributing this to either the change in how static arrays are passed or some other compiler issue. I'm using 10.6 myself, so I'll give it a look.On 12/16/09 03:44, Michel Fortin wrote:Then that would be specific to 10.6. I've filled a bug. http://d.puremagic.com/issues/show_bug.cgi?id=3619Core 2 Duo / Mac OS X 10.6 / 4 threads: Crystal:~ mifo$ ./test Set affinity, then press enter. Bus error Runs for about 18 seconds, then crashes. At first glance, it looks as if the Thread class is broken and for some reason I get a null dereference when a thread finishes. Great!It runs fine on Mac OS X 10.5 with dmd 2.037.
Dec 16 2009
Steven Schveighoffer Wrote:I would suspect something else. I would expect actually that in an allocation-heavy design, running on multiple cores should be at *least* as fast as running on a single core. He also only has 2 cores. For splitting the parallel tasks to 2 cores to take 10x longer is very alarming. I would suspect application design before the GC in this case. If it's a fundamental D issue, then we need to fix it ASAP, especially since D2 is supposed to be (among other things) an upgrade for multi-core.Assuming he's right, the only thing I can come up with that makes even the remotest sense is that SuspendThread is ridiculously slow for threads on other cores, but effectively a no-op for threads on the same core. But the Boehm GC works the same way as far as I know, and if it had such problems I'm sure we'd have heard about it. It also doesn't seem reasonable that the Windows kernel team would be able to keep something so broken in place.
Dec 15 2009
Tue, 15 Dec 2009 11:15:04 -0500, Sean Kelly wrote:Steven Schveighoffer Wrote:FWIW, memory allocations decrease the throughput on the great Sun Hotspot GC too: http://java.sun.com/docs/hotspot/gc5.0/ in the user code.I would suspect something else. I would expect actually that in an allocation-heavy design, running on multiple cores should be at *least* as fast as running on a single core. He also only has 2 cores. For splitting the parallel tasks to 2 cores to take 10x longer is very alarming. I would suspect application design before the GC in this case. If it's a fundamental D issue, then we need to fix it ASAP, especially since D2 is supposed to be (among other things) an upgrade for multi-core.Assuming he's right, the only thing I can come up with that makes even the remotest sense is that SuspendThread is ridiculously slow for threads on other cores, but effectively a no-op for threads on the same core. But the Boehm GC works the same way as far as I know, and if it had such problems I'm sure we'd have heard about it. It also doesn't seem reasonable that the Windows kernel team would be able to keep something so broken in place.
Dec 15 2009