digitalmars.D.learn - Concurrency and program speed

Joseph Rushton Wakeling (46/46) Feb 28 2013 Hello all,

jerro (9/9) Feb 28 2013 The timings on my (4 core) machine are:
thedeemon (11/16) Mar 01 2013 I've just run your example with DMD 2.061 on an Intel Core2 Quad

Joseph Rushton Wakeling (8/13) Mar 01 2013 I had some concerns here. rndGen is supposed to be thread-safe, which w...

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

Hello all,

I'm in need of some guidance regarding std.concurrency.  Before writing
further, 
I should add that I'm an almost complete novice where concurrency is concerned, 
in general and particularly with D: I've written a few programs that made use
of 
std.parallelism but that's about it.

In this case, there's a strong need to use std.concurrency because the
functions 
that will be run in parallel involve generating substantial quantities of
random 
numbers.  AFAICS std.parallelism just isn't safe for that, in a statistical 
sense (no idea how it might slow things down in terms of shared access to a 
common rndGen).

Now, I'm not naive enough to believe that using n threads will simply result in 
the program runtime being divided by n.  However, the results I'm getting with 
some simple test code (attached) are curious and I'd like to understand better 
what's going on.

The program is simple enough:

       foreach(i; iota(n))
             spawn(&randomFunc, m);

... where randomFunc is a function that generates and sums m different random 
numbers.  For speed comparison one can do instead,

       foreach(i; iota(n))
             randomFunc(m);

With m = 100_000_000 being chosen for my case.

Setting n = 2 on my 4-core laptop, the sequential case runs in about 4 s; the 
concurrent version using spawn() runs in about 2.2 s (the total amount of
"user" 
time given for the sequential programs is about 4 s and about 4.3 s 
respectively).  So, roughly half speed, as you might expect.

Setting n = 3, the sequential case runs in about 6 s (surprise!), the
concurrent 
version in about 3 (with about 8.1 s of "user" time recorded).  In other words, 
the program speed is only half that of the sequential version, even though 
there's no shared data and the CPU can well accommodate the 3 threads at full 
speed.  (In fact 270% CPU usage is recorded, but that should still see a faster 
program.)

Setting n = 4, the sequential case runs in 8 s, the concurrent in about 3.8 
(with 14.8 s of "user" time recorded), with 390% CPU usage.

In other words, it doesn't seem possible to get more than about 2 * speedup on 
my system from using concurrency, even though there should not be any data
races 
or other factors that might explain slower performance.

I didn't expect speed / n, but I did expect something a little better than this 
-- so can anyone suggest what might be going on here?  (Unfortunately, I don't 
have a system with a greater number of cores on which to test with greater 
numbers of threads.)

The times reported here are for programs compiled with GDC, but using LDC or
DMD 
produces similar behaviour.

Can anyone advise?

Thanks & best wishes,

     -- Joe

Feb 28 2013

"jerro" <a a.com> writes:

The timings on my (4 core) machine are:

sequential: 7.7s
concurent : 2.1s

Which is about what one would expect. One thing that could be 
contributing to the timings you are seeing is Turbo Boost 
(http://en.wikipedia.org/wiki/Intel_Turbo_Boost). Some mobile 
processors have large (like for example 2.1 GHz vs 3.1 GHz) 
differences between normal frequencies and maximal Turbo Boost 
frequencies.

Feb 28 2013

"thedeemon" <dlang thedeemon.com> writes:

On Thursday, 28 February 2013 at 14:15:57 UTC, Joseph Rushton 
Wakeling wrote:

 In other words, it doesn't seem possible to get more than about 
 2 * speedup on
 my system from using concurrency, even though there should not 
 be any data races or other factors that might explain slower 
 performance.

I've just run your example with DMD 2.061 on an Intel Core2 Quad 
(Q8200   2.33 GHz) and got 36 s in sequential mode and 11-12 s in 
parallel (n=4). So it's ~3x speedup.

Two ideas coming to mind:
1) Amdahl's law. If there is some locking inside random numbers 
generator, you will never get linear speed up.
2) More probable: the "4 cores" of your laptop might be just 2 
cores with hyperthreading, that's not honest 4 cores, they won't 
give you 4x speedup anywhere.

Mar 01 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 03/01/2013 11:43 AM, thedeemon wrote:
 Two ideas coming to mind:
 1) Amdahl's law. If there is some locking inside random numbers generator, you
 will never get linear speed up.

I had some concerns here.  rndGen is supposed to be thread-safe, which was 
incidentally why I was using std.concurrency; I'm not sure if the same safety 
occurs with std.parallelism.

 2) More probable: the "4 cores" of your laptop might be just 2 cores with
 hyperthreading, that's not honest 4 cores, they won't give you 4x speedup
anywhere.

Yes, this turns out to be the case, so is the most likely cause.  Your results 
and Jerro's suggest that with a greater number of true cores, an appropriate 
speedup results.

Thanks to both of you for the thoughts and advice!

Mar 01 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Concurrency and program speed