digitalmars.D - Windows multi-threading performance issues on multi-core systems only

Dan (3/3) Dec 14 2009 I have a question regarding performance issue I am seeing on multicore W...

dsimcha (23/26) Dec 14 2009 systems. I am creating many threads to do parallel tasks, and on multic...

zsxxsz (3/29) Dec 15 2009 Yes, I've seen this before, too. But in my muti-threads, the alloc opera...
Steven Schveighoffer (11/41) Dec 15 2009 I would suspect something else. I would expect actually that in an

dsimcha (30/33) Dec 15 2009 This is a synthetic, extreme corner case benchmark, but I think it hamme...

Michel Fortin (14/20) Dec 15 2009 For comparison's sake, you might want to compare that with malloc/reallo...
Dan (3/46) Dec 15 2009 My code does do considerable array appending, and I see exactly the same...

dsimcha (15/17) Dec 15 2009 as dsimcha points out above. I would expect it is GC related, but why f...

Steven Schveighoffer (14/40) Dec 15 2009 Yes, but why does multiple cores make the problem worse? If it's the

dsimcha (38/43) Dec 15 2009 It does.

Simen kjaeraas (9/16) Dec 15 2009 Tested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It

dsimcha (4/19) Dec 15 2009 Yea, forgot to mention my numbers were on Win XP. Maybe Windows 7 criti...

Sergey Gromov (5/20) Dec 15 2009 Core 2 duo laptop, 1.8GHz, XP Pro:

Simen kjaeraas (5/17) Dec 15 2009 Nope. Tried with 4 threads to see what the difference was.

Michel Fortin (28/39) Dec 15 2009 Core 2 Duo / Mac OS X 10.6 / 4 threads:

Jacob Carlborg (2/39) Dec 16 2009 It runs fine on Mac OS X 10.5 with dmd 2.037.

Michel Fortin (7/20) Dec 16 2009 Then that would be specific to 10.6. I've filled a bug.

Sean Kelly (2/21) Dec 16 2009 I haven't changed the thread code in eons, so I'm currently attributing ...

Sean Kelly (2/10) Dec 15 2009 Assuming he's right, the only thing I can come up with that makes even t...

retard (5/23) Dec 15 2009 FWIW, memory allocations decrease the throughput on the great Sun Hotspo...

Dan <dsstruthers yahoo.com> writes:

I have a question regarding performance issue I am seeing on multicore Windows
systems.  I am creating many threads to do parallel tasks, and on multicore
Windows systems the performance is abysmal.  If I use task manager to set the
processor affinity to a single CPU, the program runs as I would expect. 
Without that, it takes about 10 times as long to complete.

Am I doing something wrong?  I have tried DMD 2.0.37 and DMD 1.0.53 with the
same results, running the binary on both a dual-core P4 and a newer Core2 duo.

Any help is greatly appreciated!

Dec 14 2009

dsimcha <dsimcha yahoo.com> writes:

== Quote from Dan (dsstruthers yahoo.com)'s article
 I have a question regarding performance issue I am seeing on multicore Windows

systems.  I am creating many threads to do parallel tasks, and on multicore
Windows systems the performance is abysmal.  If I use task manager to set the
processor affinity to a single CPU, the program runs as I would expect.  Without
that, it takes about 10 times as long to complete.
 Am I doing something wrong?  I have tried DMD 2.0.37 and DMD 1.0.53 with the

same results, running the binary on both a dual-core P4 and a newer Core2 duo.
 Any help is greatly appreciated!

I've seen this happen before.  Without knowing the details of your code, my best
guess is that you're getting a lot of contention for the GC lock.  (It could
also
be some other lock, but if it were, there's a good chance you'd already know it
because it wouldn't be hidden.)  The current GC design isn't very
multithreading-friendly yet.  It requires a lock on every allocation.
Furthermore, the array append operator (~=) currently takes the GC lock on
**every
append** to query the GC for info about the memory block that the array points
to.
 There's been plenty of talk about what should be done to eliminate this, but
nothing has been implemented so far.

Assuming I am right about why your code is so slow, here's how to deal with it:

1.  Cut down on unnecessary memory allocations.  Use structs instead of classes
where it makes sense.

2.  Try to stack allocate stuff.  alloca is your friend.

3.  Pre-allocate arrays if you know ahead of time how long they're supposed to
be.
 If you don't know how long they're supposed to be, use std.array.Appender (in
D2)
for now until a better solution gets implemented.  Never use ~= in multithreaded
code that gets executed a lot.

Dec 14 2009

zsxxsz <zhengshuxin hexun.com> writes:

== Quote from dsimcha (dsimcha yahoo.com)'s article
 == Quote from Dan (dsstruthers yahoo.com)'s article
 I have a question regarding performance issue I am seeing on multicore Windows

 systems.  I am creating many threads to do parallel tasks, and on multicore
 Windows systems the performance is abysmal.  If I use task manager to set the
 processor affinity to a single CPU, the program runs as I would expect. 
Without
 that, it takes about 10 times as long to complete.
 Am I doing something wrong?  I have tried DMD 2.0.37 and DMD 1.0.53 with the

 same results, running the binary on both a dual-core P4 and a newer Core2 duo.
 Any help is greatly appreciated!

 I've seen this happen before.  Without knowing the details of your code, my
best
 guess is that you're getting a lot of contention for the GC lock.  (It could
also
 be some other lock, but if it were, there's a good chance you'd already know it
 because it wouldn't be hidden.)  The current GC design isn't very
 multithreading-friendly yet.  It requires a lock on every allocation.
 Furthermore, the array append operator (~=) currently takes the GC lock on
**every
 append** to query the GC for info about the memory block that the array points
to.
  There's been plenty of talk about what should be done to eliminate this, but
 nothing has been implemented so far.
 Assuming I am right about why your code is so slow, here's how to deal with it:
 1.  Cut down on unnecessary memory allocations.  Use structs instead of classes
 where it makes sense.
 2.  Try to stack allocate stuff.  alloca is your friend.
 3.  Pre-allocate arrays if you know ahead of time how long they're supposed to
be.
  If you don't know how long they're supposed to be, use std.array.Appender (in
D2)
 for now until a better solution gets implemented.  Never use ~= in
multithreaded
 code that gets executed a lot.

Yes, I've seen this before, too. But in my muti-threads, the alloc operations
aren't avoiding, so the D's GC should improve it's performance for
multi-threads.

Dec 15 2009

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 14 Dec 2009 21:18:28 -0500, dsimcha <dsimcha yahoo.com> wrote:

 == Quote from Dan (dsstruthers yahoo.com)'s article
 I have a question regarding performance issue I am seeing on multicore  
 Windows

 systems.  I am creating many threads to do parallel tasks, and on  
 multicore
 Windows systems the performance is abysmal.  If I use task manager to  
 set the
 processor affinity to a single CPU, the program runs as I would expect.   
 Without
 that, it takes about 10 times as long to complete.
 Am I doing something wrong?  I have tried DMD 2.0.37 and DMD 1.0.53  
 with the

 same results, running the binary on both a dual-core P4 and a newer  
 Core2 duo.
 Any help is greatly appreciated!

 I've seen this happen before.  Without knowing the details of your code,  
 my best
 guess is that you're getting a lot of contention for the GC lock.  (It  
 could also
 be some other lock, but if it were, there's a good chance you'd already  
 know it
 because it wouldn't be hidden.)  The current GC design isn't very
 multithreading-friendly yet.  It requires a lock on every allocation.
 Furthermore, the array append operator (~=) currently takes the GC lock  
 on **every
 append** to query the GC for info about the memory block that the array  
 points to.
  There's been plenty of talk about what should be done to eliminate  
 this, but
 nothing has been implemented so far.

I would suspect something else.  I would expect actually that in an  
allocation-heavy design, running on multiple cores should be at *least* as  
fast as running on a single core.  He also only has 2 cores.  For  
splitting the parallel tasks to 2 cores to take 10x longer is very  
alarming.  I would suspect application design before the GC in this case.   
If it's a fundamental D issue, then we need to fix it ASAP, especially  
since D2 is supposed to be (among other things) an upgrade for multi-core.

Maybe I'm wrong, is there a good test case to prove it is worse on  
multiple cores?

-Steve

Dec 15 2009

dsimcha <dsimcha yahoo.com> writes:

== Quote from Steven Schveighoffer (schveiguy yahoo.com)'s article
 Maybe I'm wrong, is there a good test case to prove it is worse on
 multiple cores?
 -Steve

This is a synthetic, extreme corner case benchmark, but I think it hammers home
the point that such problems are at least plausible.  Tested on a Core 2 Quad
with
-O -inline -release:

import std.stdio, std.perf, core.thread;

void main() {
    writeln("Set affinity, then press enter.");
    readln();

    auto pc = new PerformanceCounter;
    pc.start;

    enum nThreads = 4;
    auto threads = new Thread[nThreads];
    foreach(ref thread; threads) {
        thread = new Thread(&doAppending);
        thread.start();
    }

    foreach(thread; threads) {
        thread.join();
    }

    pc.stop;
    writeln(pc.milliseconds);
}

void doAppending() {
    uint[] arr;
    foreach(i; 0..1_000_000) {
        arr ~= i;
    }
}

Timing with affinity for all 4 CPUs:  28390 milliseconds

Timing with affinity for only 1 CPU:  533 milliseconds

Dec 15 2009

Michel Fortin <michel.fortin michelf.com> writes:

On 2009-12-15 10:28:37 -0500, dsimcha <dsimcha yahoo.com> said:

 void doAppending() {
     uint[] arr;
     foreach(i; 0..1_000_000) {
         arr ~= i;
     }
 }

For comparison's sake, you might want to compare that with malloc/realloc:

void doAppending() {
	uint* arr = null;
	foreach(i; 0..1_000_000) {
		arr = realloc(arr, (uint*).sizeof * (i+1));
		arr[i] = i;
	}
	// leak arr
}

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Dec 15 2009

Dan <dsstruthers yahoo.com> writes:

dsimcha Wrote:

 == Quote from Steven Schveighoffer (schveiguy yahoo.com)'s article
 Maybe I'm wrong, is there a good test case to prove it is worse on
 multiple cores?
 -Steve

 
 This is a synthetic, extreme corner case benchmark, but I think it hammers home
 the point that such problems are at least plausible.  Tested on a Core 2 Quad
with
 -O -inline -release:
 
 import std.stdio, std.perf, core.thread;
 
 void main() {
     writeln("Set affinity, then press enter.");
     readln();
 
     auto pc = new PerformanceCounter;
     pc.start;
 
     enum nThreads = 4;
     auto threads = new Thread[nThreads];
     foreach(ref thread; threads) {
         thread = new Thread(&doAppending);
         thread.start();
     }
 
     foreach(thread; threads) {
         thread.join();
     }
 
     pc.stop;
     writeln(pc.milliseconds);
 }
 
 void doAppending() {
     uint[] arr;
     foreach(i; 0..1_000_000) {
         arr ~= i;
     }
 }
 
 Timing with affinity for all 4 CPUs:  28390 milliseconds
 
 Timing with affinity for only 1 CPU:  533 milliseconds

My code does do considerable array appending, and I see exactly the same issue
as dsimcha points out above.  I would expect it is GC related, but why for
multiple cores only, I cannot fathom.

Thanks for the repro, dsimcha!  My code snippet would not have been as
straight-forward.

Dec 15 2009

dsimcha <dsimcha yahoo.com> writes:

== Quote from Dan (dsstruthers yahoo.com)'s article
 My code does do considerable array appending, and I see exactly the same issue

as dsimcha points out above.  I would expect it is GC related, but why for
multiple cores only, I cannot fathom.
 Thanks for the repro, dsimcha!  My code snippet would not have been as

straight-forward.

Two reasons:

1.  When GC info is queried, the last query is cached for (relatively) efficient
array appending.  However, this cache is not thread-local.  Therefore, if you're
appending to arrays in two different threads simultaneously, they'll both keep
evicting each other's cached GC info, causing it to have to be looked up again
and
again.

2.  Every array append requires a lock acquisition.  This is much more expensive
if there's contention.

Bottom line:  Array appending in multithreaded code is **horribly** broken and
I'm
glad it's being brought to people's attention.  For a temporary fix,
pre-allocate
or use std.array.Appender.

Dec 15 2009

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Tue, 15 Dec 2009 12:23:01 -0500, dsimcha <dsimcha yahoo.com> wrote:

 == Quote from Dan (dsstruthers yahoo.com)'s article
 My code does do considerable array appending, and I see exactly the  
 same issue

 as dsimcha points out above.  I would expect it is GC related, but why  
 for
 multiple cores only, I cannot fathom.
 Thanks for the repro, dsimcha!  My code snippet would not have been as

 straight-forward.

 Two reasons:

 1.  When GC info is queried, the last query is cached for (relatively)  
 efficient
 array appending.  However, this cache is not thread-local.  Therefore,  
 if you're
 appending to arrays in two different threads simultaneously, they'll  
 both keep
 evicting each other's cached GC info, causing it to have to be looked up  
 again and
 again.

 2.  Every array append requires a lock acquisition.  This is much more  
 expensive
 if there's contention.

 Bottom line:  Array appending in multithreaded code is **horribly**  
 broken and I'm
 glad it's being brought to people's attention.  For a temporary fix,  
 pre-allocate
 or use std.array.Appender.

Yes, but why does multiple cores make the problem worse?  If it's the  
lock, then I'd expect just locking in multiple threads without any  
appending does worse on multiple cores than on a single core.  If it's the  
lookup, why does it take longer to lookup on multiple cores?  The very  
idea that multiple cores makes threading code *slower* goes against  
everything I've ever heard about multi-core and threads.

I agree array appending for multithreaded code is not as efficient as it  
is when you use a dedicated append-friendly object, but it's a compromise  
between efficiency of appending (which arguably is not that common) and  
efficiency for everything else (slicing, passing to a function, etc).  I  
expect that very soon we will have efficient appending with thread local  
caching of lookups, but the multi-core thing really puzzles me.

-Steve

Dec 15 2009

dsimcha <dsimcha yahoo.com> writes:

== Quote from Steven Schveighoffer (schveiguy yahoo.com)'s article
 Yes, but why does multiple cores make the problem worse?  If it's the
 lock, then I'd expect just locking in multiple threads without any
 appending does worse on multiple cores than on a single core.

It does.

import std.stdio, std.perf, core.thread;

void main() {
    writeln("Set affinity, then press enter.");
    readln();

    auto pc = new PerformanceCounter;
    pc.start;

    enum nThreads = 4;
    auto threads = new Thread[nThreads];
    foreach(ref thread; threads) {
        thread = new Thread(&doStuff);
        thread.start();
    }

    foreach(thread; threads) {
        thread.join();
    }

    pc.stop;
    writeln(pc.milliseconds);
}

void doStuff() {
    foreach(i; 0..1_000_000) {
       synchronized {}
    }
}

Timing with affinity for all CPUs:  20772 ms.
Timing with affinity for 1 CPU:  156 ms.

Heavy lock contention **kills** multithreaded code b/c not only do you serialize
everything, but the OS has to perform a context switch on every contention.

I posted about a year ago that using spinlocks in the GC massively sped things
up,
at least in synthetic benchmarks, if you have heavy contention and multiple
cores.
 See
http://www.digitalmars.com/d/archives/digitalmars/D/More_on_GC_
pinlocks_80485.html .
 However, this post largely got ignored.

 If it's the
 lookup, why does it take longer to lookup on multiple cores?

Because appending to multiple arrays simultaneously (whether on single or
multiple
cores) causes each array's append to evict the other array's append from the GC
block info cache.  If you set the affinity to only 1 CPU, this only happens once
per context switch.

Dec 15 2009

"Simen kjaeraas" <simen.kjaras gmail.com> writes:

dsimcha <dsimcha yahoo.com> wrote:

 This is a synthetic, extreme corner case benchmark, but I think it  
 hammers home
 the point that such problems are at least plausible.  Tested on a Core 2  
 Quad with
 -O -inline -release:
 Timing with affinity for all 4 CPUs:  28390 milliseconds

 Timing with affinity for only 1 CPU:  533 milliseconds

Tested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It  
scales roughly inverse linearly with number of threads:

163ms for 1,
364ms for 2,
886ms for 4

This is quite different from your numbers, though.

-- 
Simen

Dec 15 2009

dsimcha <dsimcha yahoo.com> writes:

== Quote from Simen kjaeraas (simen.kjaras gmail.com)'s article
 dsimcha <dsimcha yahoo.com> wrote:
 This is a synthetic, extreme corner case benchmark, but I think it
 hammers home
 the point that such problems are at least plausible.  Tested on a Core 2
 Quad with
 -O -inline -release:
 Timing with affinity for all 4 CPUs:  28390 milliseconds

 Timing with affinity for only 1 CPU:  533 milliseconds

 Tested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It
 scales roughly inverse linearly with number of threads:
 163ms for 1,
 364ms for 2,
 886ms for 4
 This is quite different from your numbers, though.

Yea, forgot to mention my numbers were on Win XP.  Maybe Windows 7 critical
sections are better implemented or something.   Can a few other people with a
variety of OS's run this benchmark and post their numbers?

Dec 15 2009

Sergey Gromov <snake.scaly gmail.com> writes:

dsimcha wrote:
 == Quote from Simen kjaeraas (simen.kjaras gmail.com)'s article
 dsimcha <dsimcha yahoo.com> wrote:
 Timing with affinity for all 4 CPUs:  28390 milliseconds

 Timing with affinity for only 1 CPU:  533 milliseconds

 Tested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It
 scales roughly inverse linearly with number of threads:
 163ms for 1,
 364ms for 2,
 886ms for 4


Maybe you mean core 2 quad?

 This is quite different from your numbers, though.

 
 Yea, forgot to mention my numbers were on Win XP.  Maybe Windows 7 critical
 sections are better implemented or something.   Can a few other people with a
 variety of OS's run this benchmark and post their numbers?

Core 2 duo laptop, 1.8GHz, XP Pro:

860ms   1 core
11369ms 2 cores

Dec 15 2009

"Simen kjaeraas" <simen.kjaras gmail.com> writes:

On Wed, 16 Dec 2009 03:01:54 +0100, Sergey Gromov <snake.scaly gmail.com>  
wrote:

 dsimcha wrote:
 == Quote from Simen kjaeraas (simen.kjaras gmail.com)'s article
 dsimcha <dsimcha yahoo.com> wrote:
 Timing with affinity for all 4 CPUs:  28390 milliseconds

 Timing with affinity for only 1 CPU:  533 milliseconds

 Tested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It
 scales roughly inverse linearly with number of threads:
 163ms for 1,
 364ms for 2,
 886ms for 4


 Maybe you mean core 2 quad?

Nope. Tried with 4 threads to see what the difference was.

-- 
Simen

Dec 15 2009

Michel Fortin <michel.fortin michelf.com> writes:

On 2009-12-15 19:49:43 -0500, dsimcha <dsimcha yahoo.com> said:

 == Quote from Simen kjaeraas (simen.kjaras gmail.com)'s article
 Tested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It
 scales roughly inverse linearly with number of threads:
 163ms for 1,
 364ms for 2,
 886ms for 4
 This is quite different from your numbers, though.

 
 Yea, forgot to mention my numbers were on Win XP.  Maybe Windows 7 critical
 sections are better implemented or something.   Can a few other people with a
 variety of OS's run this benchmark and post their numbers?

Core 2 Duo / Mac OS X 10.6 / 4 threads:

	Crystal:~ mifo$ ./test
	Set affinity, then press enter.

	Bus error

Runs for about 18 seconds, then crashes. At first glance, it looks as 
if the Thread class is broken and for some reason I get a null 
dereference when a thread finishes. Great!

Anyway, I've done some sampling on the program while it runs, and each 
of the worker thread spans about 85% of its time inside _d_monitorenter 
and 11% in _d_monitorleave soon after starting the program, which later 
becomes 88% and 7% respectively soon before the program finishes.

The funny things is that if I just bypass the GC like this:

	void doAppending() {
		uint* arr = null;
		foreach(i; 0..1_000_000) {
			arr = cast(uint*)realloc(arr, (uint*).sizeof * (i+1));
			arr[i] = i;
		}
		// leak arr
	}

it finishes (I mean crashes) in less than half a second. So it looks 
like realloc does a much better job at locking it's data structure that 
the GC.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Dec 15 2009

Jacob Carlborg <doob me.com> writes:

On 12/16/09 03:44, Michel Fortin wrote:
 On 2009-12-15 19:49:43 -0500, dsimcha <dsimcha yahoo.com> said:

 == Quote from Simen kjaeraas (simen.kjaras gmail.com)'s article
 Tested this on a Core 2 Duo, same options. OS is Windows 7, 64bit. It
 scales roughly inverse linearly with number of threads:
 163ms for 1,
 364ms for 2,
 886ms for 4
 This is quite different from your numbers, though.

 Yea, forgot to mention my numbers were on Win XP. Maybe Windows 7
 critical
 sections are better implemented or something. Can a few other people
 with a
 variety of OS's run this benchmark and post their numbers?

 Core 2 Duo / Mac OS X 10.6 / 4 threads:

 Crystal:~ mifo$ ./test
 Set affinity, then press enter.

 Bus error

 Runs for about 18 seconds, then crashes. At first glance, it looks as if
 the Thread class is broken and for some reason I get a null dereference
 when a thread finishes. Great!

 Anyway, I've done some sampling on the program while it runs, and each
 of the worker thread spans about 85% of its time inside _d_monitorenter
 and 11% in _d_monitorleave soon after starting the program, which later
 becomes 88% and 7% respectively soon before the program finishes.

 The funny things is that if I just bypass the GC like this:

 void doAppending() {
 uint* arr = null;
 foreach(i; 0..1_000_000) {
 arr = cast(uint*)realloc(arr, (uint*).sizeof * (i+1));
 arr[i] = i;
 }
 // leak arr
 }

 it finishes (I mean crashes) in less than half a second. So it looks
 like realloc does a much better job at locking it's data structure that
 the GC.

It runs fine on Mac OS X 10.5 with dmd 2.037.

Dec 16 2009

Michel Fortin <michel.fortin michelf.com> writes:

On 2009-12-16 07:08:30 -0500, Jacob Carlborg <doob me.com> said:

 On 12/16/09 03:44, Michel Fortin wrote:
 Core 2 Duo / Mac OS X 10.6 / 4 threads:
 
 Crystal:~ mifo$ ./test
 Set affinity, then press enter.
 
 Bus error
 
 Runs for about 18 seconds, then crashes. At first glance, it looks as if
 the Thread class is broken and for some reason I get a null dereference
 when a thread finishes. Great!

 
 It runs fine on Mac OS X 10.5 with dmd 2.037.

Then that would be specific to 10.6. I've filled a bug.

http://d.puremagic.com/issues/show_bug.cgi?id=3619

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Dec 16 2009

Sean Kelly <sean invisibleduck.org> writes:

Michel Fortin Wrote:

 On 2009-12-16 07:08:30 -0500, Jacob Carlborg <doob me.com> said:
 
 On 12/16/09 03:44, Michel Fortin wrote:
 Core 2 Duo / Mac OS X 10.6 / 4 threads:
 
 Crystal:~ mifo$ ./test
 Set affinity, then press enter.
 
 Bus error
 
 Runs for about 18 seconds, then crashes. At first glance, it looks as if
 the Thread class is broken and for some reason I get a null dereference
 when a thread finishes. Great!

 
 It runs fine on Mac OS X 10.5 with dmd 2.037.

 
 Then that would be specific to 10.6. I've filled a bug.
 
 http://d.puremagic.com/issues/show_bug.cgi?id=3619

I haven't changed the thread code in eons, so I'm currently attributing this to
either the change in how static arrays are passed or some other compiler issue.
 I'm using 10.6 myself, so I'll give it a look.

Dec 16 2009

Sean Kelly <sean invisibleduck.org> writes:

Steven Schveighoffer Wrote:
 
 I would suspect something else.  I would expect actually that in an  
 allocation-heavy design, running on multiple cores should be at *least* as  
 fast as running on a single core.  He also only has 2 cores.  For  
 splitting the parallel tasks to 2 cores to take 10x longer is very  
 alarming.  I would suspect application design before the GC in this case.   
 If it's a fundamental D issue, then we need to fix it ASAP, especially  
 since D2 is supposed to be (among other things) an upgrade for multi-core.

Assuming he's right, the only thing I can come up with that makes even the
remotest sense is that SuspendThread is ridiculously slow for threads on other
cores, but effectively a no-op for threads on the same core.  But the Boehm GC
works the same way as far as I know, and if it had such problems I'm sure we'd
have heard about it.  It also doesn't seem reasonable that the Windows kernel
team would be able to keep something so broken in place.

Dec 15 2009

retard <re tard.com.invalid> writes:

Tue, 15 Dec 2009 11:15:04 -0500, Sean Kelly wrote:

 Steven Schveighoffer Wrote:
 
 I would suspect something else.  I would expect actually that in an
 allocation-heavy design, running on multiple cores should be at *least*
 as fast as running on a single core.  He also only has 2 cores.  For
 splitting the parallel tasks to 2 cores to take 10x longer is very
 alarming.  I would suspect application design before the GC in this
 case. If it's a fundamental D issue, then we need to fix it ASAP,
 especially since D2 is supposed to be (among other things) an upgrade
 for multi-core.

 
 Assuming he's right, the only thing I can come up with that makes even
 the remotest sense is that SuspendThread is ridiculously slow for
 threads on other cores, but effectively a no-op for threads on the same
 core.  But the Boehm GC works the same way as far as I know, and if it
 had such problems I'm sure we'd have heard about it.  It also doesn't
 seem reasonable that the Windows kernel team would be able to keep
 something so broken in place.

FWIW, memory allocations decrease the throughput on the great Sun Hotspot 
GC too: http://java.sun.com/docs/hotspot/gc5.0/

in the user code.

Dec 15 2009

D Programming

C/C++ Programming

Other

digitalmars.D - Windows multi-threading performance issues on multi-core systems only