digitalmars.D.learn - Scalability in std.parallelism

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:
```In the following test code given below of std.parallelism I get
some interesting results:

when compiled as

dmd -release -noboundscheck -O -inline -w -wi -wi
~/Work/justd/t_parallelism.d -oft_parallelism

My scalability measures says the following

3.14159 took 221[ms]
3.14159 took 727[ms]
Speedup 3.28959
-5.80829e+09 took 33[ms]
-5.80829e+09 took 201[ms]
Speedup 6.09091

Why do I get a larger speed for a simpler map function?
Shouldn't it be the opposite?
I've always read that the more calculations I perform on each
memory access the better the speedup...

Anyhow the speedups are great!

Sample code follows:

import std.algorithm, std.parallelism, std.range, std.datetime,
std.stdio;

void test1 () {
immutable n = 100_000_000;
immutable delta = 1.0 / n;

auto piTerm(int i) {
immutable x = (i - 0.5) * delta;
return delta / (1.0 + x*x);
}

auto nums = n.iota.map!piTerm; // numbers
StopWatch sw;

sw.reset();
sw.start();
sw.stop();
immutable ms = sw.peek().msecs;
writeln(pi, " took ", ms, "[ms]");

sw.reset();
sw.start();
immutable pi_ = 4.0*std.algorithm.reduce!"a+b"(nums);
sw.stop();
immutable ms_ = sw.peek().msecs;
writeln(pi_, " took ", ms_, "[ms]");

writeln("Speedup ", cast(real)ms_ / ms);
}

auto square(T)(T i)  safe pure nothrow { return i*i; }

void test2 () {
immutable n = 100_000_000;
immutable delta = 1.0 / n;

auto nums = n.iota.map!square; // numbers
StopWatch sw;

sw.reset();
sw.start();
sw.stop();
immutable ms = sw.peek().msecs;
writeln(pi, " took ", ms, "[ms]");

sw.reset();
sw.start();
immutable pi_ = 4.0*std.algorithm.reduce!"a+b"(nums);
sw.stop();
immutable ms_ = sw.peek().msecs;
writeln(pi_, " took ", ms_, "[ms]");

writeln("Speedup ", cast(real)ms_ / ms);
}

void main () {
test1();
test2();
}
```
Feb 22 2014
=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
```On 02/22/2014 08:21 AM, "Nordlöw" wrote:

In the following test code given below of std.parallelism I get some
interesting results:

when compiled as

dmd -release -noboundscheck -O -inline -w -wi -wi
~/Work/justd/t_parallelism.d -oft_parallelism

My scalability measures says the following

3.14159 took 221[ms]
3.14159 took 727[ms]
Speedup 3.28959
-5.80829e+09 took 33[ms]
-5.80829e+09 took 201[ms]
Speedup 6.09091

On my quad-core Intel I get the following (I have two actual cores, four

3.14159 took 441[ms]
3.14159 took 878[ms]
Speedup 1.99093
-5.80829e+09 took 98[ms]
-5.80829e+09 took 328[ms]
Speedup 3.34694

I am not an expect at all but it looks like the first test cannot take

auto piTerm(int i) {
immutable x = (i - 0.5) * delta;
return delta / (1.0 + x*x);
}

auto square(T)(T i)  safe pure nothrow { return i*i; }

Ali
```
Feb 23 2014
"Per =?UTF-8?B?Tm9yZGzDtnci?= <per.nordlow gmail.com> writes:
``` I am not an expect at all but it looks like the first test
to some degree.

Yes, that is my hypothesis aswell :)

Thx
```
Feb 24 2014
Russel Winder <russel winder.org.uk> writes:
```On Sun, 2014-02-23 at 22:55 -0800, Ali Çehreli wrote:
[…]
On my quad-core Intel I get the following (I have two actual cores, four

3.14159 took 441[ms]
3.14159 took 878[ms]
Speedup 1.99093
-5.80829e+09 took 98[ms]
-5.80829e+09 took 328[ms]
Speedup 3.34694

I am not an expect at all but it looks like the first test cannot take

I'm fairly sure I don't agree with this hypothesis. Two cores with
hyperthreads generally means a maximum speed up of 2 with optimized
native code. So for me the first one looks fine whereas the second one
seems to indicate a problem in std.parallelism.

--
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder
```
Feb 24 2014
"thedeemon" <dlang thedeemon.com> writes:
```On Monday, 24 February 2014 at 14:34:14 UTC, Russel Winder wrote:

Two cores with hyperthreads generally means a maximum speed up
of 2 with optimized native code.

Not true. If the code is not trivial and the threads are not
doing exactly same instructions (i.e. they can do some search
where number of operations depends on data) then 2 cores x 2
hyperthreads can easily provide more than 2x speed up (but far
from 4x of course). I see it very often in my video processing
code.
```
Feb 25 2014
Russel Winder <russel winder.org.uk> writes:
```On Tue, 2014-02-25 at 12:33 +0000, thedeemon wrote:
On Monday, 24 February 2014 at 14:34:14 UTC, Russel Winder wrote:

Two cores with hyperthreads generally means a maximum speed up
of 2 with optimized native code.

Not true. If the code is not trivial and the threads are not
doing exactly same instructions (i.e. they can do some search
where number of operations depends on data) then 2 cores x 2
hyperthreads can easily provide more than 2x speed up (but far
from 4x of course). I see it very often in my video processing
code.

I suspect the issue here is how compute intensive the code is, are there
cache line misses, are there requests out to memory, etc. i.e.
non-trivial. My observation and gross over-simplification stems from CPU
bound jobs with very localized data, no need for memory writes, and
definitely no I/O. This leads to no opportunity for the hyperthreads to
contribute anything.

I would guess that your video processing uses a cache-friendly
(streaming?) algorithm so that the hyperthreads can operate with data
already in cache whilst the other gets more data into the cache. This
could easily get a >2x speed up on 2 core, 2 hyperthreads machine if the
data chunks are of a suitable size and the memory reads and writes are
in good rhythm with the calculation done on the data.

--
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder
```
Feb 25 2014
"Stanislav Blinov" <stanislav.blinov gmail.com> writes:
```Don't rely on dmd when making raw performance tests.

On my machine (i3-2100, two cores):

dmd2 -O -release -inline

3.14159 took 368[ms]
3.14159 took 713[ms]
Speedup 1.9375
-5.80829e+09 took 61[ms]
-5.80829e+09 took 201[ms]
Speedup 3.29508

ldc2 -O3 -release

3.14159 took 360[ms]
3.14159 took 718[ms]
Speedup 1.99444
-5.80829e+09 took 0[ms]
-5.80829e+09 took 0[ms]
Speedup -nan

ldc2 -O3 -release -vectorize -vectorize-loops

3.14159 took 193[ms]
3.14159 took 721[ms]
Speedup 3.73575
-5.80829e+09 took 0[ms]
-5.80829e+09 took 0[ms]
Speedup -nan
```
Feb 24 2014
"safety0ff" <safety0ff.dev gmail.com> writes:
```On Saturday, 22 February 2014 at 16:21:21 UTC, Nordlöw wrote:
In the following test code given below of std.parallelism I get
some interesting results:

Don't forget that "n.iota.map" is returning a lazily evaluated
range.
Std.parallelism might have to convert the lazy range to a random
access range (i.e. an array,) before it can schedule the work.

If I add ".array" after the map call (e.g. auto nums =
n.iota.map!piTerm.array;)
I get numbers closer to the ideal for test2.

Now we compare the differences between test1 and test2:
test1 is reducing doubles and test2 is reducing ints.

I believe that the reason for the difference in speed up is
Hyper threads can contend for shared resources in the CPU (e.g.
cache and FPU.)

On my computer, forcing the nums to be a range of doubles in
test2 causes the speed up to drop to approximately the same as
test1.

Regards.
```
Feb 24 2014
Russel Winder <russel winder.org.uk> writes:
```On Sat, 2014-02-22 at 16:21 +0000, "Nordlöw" wrote:
In the following test code given below of std.parallelism I get
some interesting results:

when compiled as

dmd -release -noboundscheck -O -inline -w -wi -wi
~/Work/justd/t_parallelism.d -oft_parallelism

My scalability measures says the following

3.14159 took 221[ms]
3.14159 took 727[ms]
Speedup 3.28959
-5.80829e+09 took 33[ms]
-5.80829e+09 took 201[ms]
Speedup 6.09091

Why do I get a larger speed for a simpler map function?
Shouldn't it be the opposite?

I'm not sure just now, but it is an interesting issue.  On my 7 year old
twin 2.33GHz Xeon, I get:

3.14159 took 128[ms]
3.14159 took 860[ms]
Speedup 6.71875
-5.80829e+09 took 28[ms]
-5.80829e+09 took 302[ms]
Speedup 10.7857

I've always read that the more calculations I perform on each
memory access the better the speedup...

Not necessarily, it depends on the caches and lots of other fun things.

Anyhow the speedups are great!

As anyone involved in native code CPU-bound benchmarking will tell you,
hyperthreads are generally a waste of time. So on your machine I'd
expect a "flat out" speed up of 4 on your machine, 8 on mine. Virtual
machines, e.g. JVM and PVM, can sometimes do interesting things that

Sample code follows:

Hey, I recognize the core of this code, it's my π by quadrature example.
Sadly π is not well approximately by -5.80829e+09 :-)

I shal tinker with this a bit as I want to understand why the speed up
is greater than the number of cores. It implies overhead in one of the
algorithms.

--
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder
```
Feb 24 2014