digitalmars.D.learn - Parallel processing and further use of output

Zoidberg (17/17) Sep 26 2015 I've run into an issue, which I guess could be resolved easily,

John Colvin (12/29) Sep 26 2015 Here's a correct version:

Zoidberg (1/12) Sep 26 2015 Thanks! Works fine. So "shared" and "atomic" is a must?

Russel Winder via Digitalmars-d-learn (22/36) Sep 28 2015 Yes and no. But mostly no. If you have to do this as an explicit

anonymous (23/43) Sep 26 2015 Definitely a race, yeah. You need to prevent two += operations happening...

Meta (3/7) Sep 26 2015 Is this valid syntax? I've never seen synchronized used like this

Zoidberg (4/11) Sep 26 2015 Atomic worked perfectly and reasonably fast. "Synchronized" may
anonymous (7/14) Sep 26 2015 I'm sure it's valid.

Russel Winder via Digitalmars-d-learn (17/19) Sep 28 2015 Rough and ready anecdotal evidence would indicate that this is a

John Colvin (4/13) Sep 28 2015 It would be really great if someone knowledgable did a full

Russel Winder via Digitalmars-d-learn (16/21) Sep 28 2015 Indeed, I would love to be able to do this. However I don't have time
Russel Winder via Digitalmars-d-learn (44/44) Sep 28 2015 As a single data point:

John Colvin (3/29) Sep 28 2015 Pretty much as expected. Locks are slow, shared accumulators

Russel Winder via Digitalmars-d-learn (20/24) Sep 28 2015 Quite. Dataflow is where the parallel action is. (Except for those

Jay Norwood (7/7) Sep 26 2015 std.parallelism.reduce documentation provides an example of a

Jay Norwood (6/6) Sep 26 2015 btw, on my corei5, in debug build,

Jay Norwood (8/8) Sep 26 2015 This is a work-around to get a ulong result without having the

John Colvin (4/12) Sep 26 2015 or
Russel Winder via Digitalmars-d-learn (14/23) Sep 28 2015 Not needed as reduce can take an initial value that sets the type of

Russel Winder via Digitalmars-d-learn (16/26) Sep 28 2015 Which may or may not already have been fixed, or=E2=80=A6
Jay Norwood (5/8) Sep 28 2015 https://issues.dlang.org/show_bug.cgi?id=14832

Zoidberg <g11290302 trbvm.com> writes:

I've run into an issue, which I guess could be resolved easily, 
if I knew how...

[CODE]
     ulong i = 0;
     foreach (f; parallel(iota(1, 1000000+1)))
     {
         i += f;
     }
     thread_joinAll();
     i.writeln;
[/CODE]

It's basically an example which adds all the numbers from 1 to 
1000000 and should therefore give 500000500000. Running the above 
code gives 205579930677, leaving out "thread_joinAll()" the 
output is 210161213519.

I suspect there's some sort of data race. Any hint how to get 
this straight?

Sep 26 2015

John Colvin <john.loughran.colvin gmail.com> writes:

On Saturday, 26 September 2015 at 12:18:16 UTC, Zoidberg wrote:
 I've run into an issue, which I guess could be resolved easily, 
 if I knew how...

 [CODE]
     ulong i = 0;
     foreach (f; parallel(iota(1, 1000000+1)))
     {
         i += f;
     }
     thread_joinAll();
     i.writeln;
 [/CODE]

 It's basically an example which adds all the numbers from 1 to 
 1000000 and should therefore give 500000500000. Running the 
 above code gives 205579930677, leaving out "thread_joinAll()" 
 the output is 210161213519.

 I suspect there's some sort of data race. Any hint how to get 
 this straight?

Here's a correct version:

import std.parallelism, std.range, std.stdio, core.atomic;
void main()
{
     shared ulong i = 0;
     foreach (f; parallel(iota(1, 1000000+1)))
     {
         i.atomicOp!"+="(f);
     }
     i.writeln;
}

Sep 26 2015

Zoidberg <g11290302 trbvm.com> writes:

 Here's a correct version:

 import std.parallelism, std.range, std.stdio, core.atomic;
 void main()
 {
     shared ulong i = 0;
     foreach (f; parallel(iota(1, 1000000+1)))
     {
         i.atomicOp!"+="(f);
     }
     i.writeln;
 }

Thanks! Works fine. So "shared" and "atomic" is a must?

Sep 26 2015

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Sat, 2015-09-26 at 12:32 +0000, Zoidberg via Digitalmars-d-learn
wrote:
 Here's a correct version:
=20
 import std.parallelism, std.range, std.stdio, core.atomic;
 void main()
 {
     shared ulong i =3D 0;
     foreach (f; parallel(iota(1, 1000000+1)))
     {
         i.atomicOp!"+=3D"(f);
     }
     i.writeln;
 }

=20
 Thanks! Works fine. So "shared" and "atomic" is a must?

Yes and no. But mostly no. If you have to do this as an explicit
iteration (very 1970s) then yes to avoid doing things wrong you have to
ensure the update to the shared mutable state is atomic.

A more modern (1930s/1950s) way of doing things is to use implicit
iteration =E2=80=93 something Java, C++, etc. are all getting into more and
more, you should use a reduce call. People have previously mentioned:

    taskPool.reduce!"a + b"(iota(1UL,1000001))

which I would suggest has to be seen as the best way of writing this
algorithm.

=20
--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Sep 28 2015

anonymous <anonymous example.com> writes:

On Saturday 26 September 2015 14:18, Zoidberg wrote:

 I've run into an issue, which I guess could be resolved easily, 
 if I knew how...
 
 [CODE]
      ulong i = 0;
      foreach (f; parallel(iota(1, 1000000+1)))
      {
          i += f;
      }
      thread_joinAll();
      i.writeln;
 [/CODE]
 
 It's basically an example which adds all the numbers from 1 to 
 1000000 and should therefore give 500000500000. Running the above 
 code gives 205579930677, leaving out "thread_joinAll()" the 
 output is 210161213519.
 
 I suspect there's some sort of data race. Any hint how to get 
 this straight?

Definitely a race, yeah. You need to prevent two += operations happening 
concurrently.

You can use core.atomic.atomicOp!"+=" instead of plain +=:
----
    shared ulong i = 0;
    foreach (f; parallel(iota(1, 1000000+1)))
    {
        import core.atomic: atomicOp;
        i.atomicOp!"+="(f);
    }
----
i is shared because atomicOp requires a shared variable. I'm not sure what 
the implications of that are, if any.

Alternatively, you could use `synchronized`:
----
    ulong i = 0;
    foreach (f; parallel(iota(1, 1000000+1)))
    {
        synchronized i += f;
    }
----
I'm pretty sure atomicOp is faster, though.

Sep 26 2015

Meta <jared771 gmail.com> writes:

On Saturday, 26 September 2015 at 12:33:45 UTC, anonymous wrote:
     foreach (f; parallel(iota(1, 1000000+1)))
     {
         synchronized i += f;
     }

Is this valid syntax? I've never seen synchronized used like this 
before.

Sep 26 2015

Zoidberg <g11299467 trbvm.com> writes:

On Saturday, 26 September 2015 at 13:09:54 UTC, Meta wrote:
 On Saturday, 26 September 2015 at 12:33:45 UTC, anonymous wrote:
     foreach (f; parallel(iota(1, 1000000+1)))
     {
         synchronized i += f;
     }

 Is this valid syntax? I've never seen synchronized used like 
 this before.

Atomic worked perfectly and reasonably fast. "Synchronized" may 
work as well, but I had to abort the execution prior to finishing 
because it seemed horribly slow.

Sep 26 2015

anonymous <anonymous example.com> writes:

On Saturday, 26 September 2015 at 13:09:54 UTC, Meta wrote:
 On Saturday, 26 September 2015 at 12:33:45 UTC, anonymous wrote:
     foreach (f; parallel(iota(1, 1000000+1)))
     {
         synchronized i += f;
     }

 Is this valid syntax? I've never seen synchronized used like 
 this before.

I'm sure it's valid.

A mutex is created for that instance of synchronized. I.e., only 
one thread can execute that piece of code at a time.

If you're missing the braces, they're optional for single 
statements, as usual.

http://dlang.org/statement.html#SynchronizedStatement

Sep 26 2015

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Sat, 2015-09-26 at 14:33 +0200, anonymous via Digitalmars-d-learn
wrote:
 [=E2=80=A6]
 I'm pretty sure atomicOp is faster, though.

Rough and ready anecdotal evidence would indicate that this is a
reasonable statement, by quite a long way. However a proper benchmark
is needed for statistical significance.

On the other hand std.parallelism.taskPool.reduce surely has to be the
correct way of expressing the algorithm?

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Sep 28 2015

John Colvin <john.loughran.colvin gmail.com> writes:

On Monday, 28 September 2015 at 11:31:33 UTC, Russel Winder wrote:
 On Sat, 2015-09-26 at 14:33 +0200, anonymous via 
 Digitalmars-d-learn wrote:
 […]
 I'm pretty sure atomicOp is faster, though.

 Rough and ready anecdotal evidence would indicate that this is 
 a reasonable statement, by quite a long way. However a proper 
 benchmark is needed for statistical significance.

 On the other hand std.parallelism.taskPool.reduce surely has to 
 be the correct way of expressing the algorithm?

It would be really great if someone knowledgable did a full 
review of std.parallelism to find out the answer, hint, hint...  
:)

Sep 28 2015

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Mon, 2015-09-28 at 11:38 +0000, John Colvin via Digitalmars-d-learn
wrote:
 [=E2=80=A6]
=20
 It would be really great if someone knowledgable did a full=20
 review of std.parallelism to find out the answer, hint, hint... =20
 :)

Indeed, I would love to be able to do this. However I don't have time
in the next few months to do this on a volunteer basis, and no-one is
paying money whereby this review could happen as a side effect. Sad,
but=E2=80=A6
--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Sep 28 2015

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

As a single data point:

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D  anonymo=
us_fix.d =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
500000500000

real	0m0.168s
user	0m0.200s
sys	0m0.380s
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D  colvin_=
fix.d =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
500000500000

real	0m0.036s
user	0m0.124s
sys	0m0.000s
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D  norwood=
_reduce.d =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
500000500000

real	0m0.009s
user	0m0.020s
sys	0m0.000s
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D  origina=
l.d =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
218329750363

real	0m0.024s
user	0m0.076s
sys	0m0.000s


Original is the original, not entirely slow, but broken :-). anonymous
is the anonymous' synchronized keyword version, slow. colvin_fix is
John Colvin's use of atomicOp, correct but only ok-ish on speed. Jay
Norword first proposed the reduce answer on the list, I amended it a
tiddly bit, but clearly it is a resounding speed winner.

I guess we need a benchmark framework that can run these 100 times
taking processor times and then do the statistics on them. Most people
would assume normal distribution of results and do mean/std deviation
and median.=20

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Sep 28 2015

John Colvin <john.loughran.colvin gmail.com> writes:

On Monday, 28 September 2015 at 12:18:28 UTC, Russel Winder wrote:
 As a single data point:

 ======================  anonymous_fix.d ========== 500000500000

 real	0m0.168s
 user	0m0.200s
 sys	0m0.380s
 ======================  colvin_fix.d ==========
 500000500000

 real	0m0.036s
 user	0m0.124s
 sys	0m0.000s
 ======================  norwood_reduce.d ==========
 500000500000

 real	0m0.009s
 user	0m0.020s
 sys	0m0.000s
 ======================  original.d ==========
 218329750363

 real	0m0.024s
 user	0m0.076s
 sys	0m0.000s


 Original is the original, not entirely slow, but broken :-). 
 anonymous is the anonymous' synchronized keyword version, slow. 
 colvin_fix is John Colvin's use of atomicOp, correct but only 
 ok-ish on speed. Jay Norword first proposed the reduce answer 
 on the list, I amended it a tiddly bit, but clearly it is a 
 resounding speed winner.

Pretty much as expected. Locks are slow, shared accumulators 
suck, much better to write to thread local and then merge.

Sep 28 2015

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Mon, 2015-09-28 at 12:46 +0000, John Colvin via Digitalmars-d-learn
wrote:
 [=E2=80=A6]
=20
 Pretty much as expected. Locks are slow, shared accumulators=20
 suck, much better to write to thread local and then merge.

Quite. Dataflow is where the parallel action is. (Except for those
writing concurrency and parallelism libraries) Anyone doing concurrency
and parallelism with shared memory multi-threading, locks,
synchronized, mutexes, etc. is doing it wrong. This has been known
since the 1970s, but the programming community got sidetracked by lack
of abstraction (*) for a couple of decades.


(*) I blame C, C++ and Java. And programmers who programmed before (or
worse, without) thinking.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Sep 28 2015

Jay Norwood <jayn prismnet.com> writes:

std.parallelism.reduce documentation provides an example of a 
parallel sum.

This works:
auto sum3 = taskPool.reduce!"a + b"(iota(1.0,1000001.0));

This results in a compile error:
auto sum3 = taskPool.reduce!"a + b"(iota(1UL,1000001UL));

I believe there was discussion of this problem recently ...

Sep 26 2015

Jay Norwood <jayn prismnet.com> writes:

btw, on my corei5, in debug build,
reduce (using double): 11msec
non_parallel: 37msec
parallel with atomicOp: 123msec

so, that is the reason for using parallel reduce, assuming the 
ulong range thing will get fixed.

Sep 26 2015

Jay Norwood <jayn prismnet.com> writes:

This is a work-around to get a ulong result without having the 
ulong as the range variable.

ulong getTerm(int i)
{
    return i;
}
auto sum4 = taskPool.reduce!"a + 
b"(std.algorithm.map!getTerm(iota(1000000001)));

Sep 26 2015

John Colvin <john.loughran.colvin gmail.com> writes:

On Saturday, 26 September 2015 at 17:20:34 UTC, Jay Norwood wrote:
 This is a work-around to get a ulong result without having the 
 ulong as the range variable.

 ulong getTerm(int i)
 {
    return i;
 }
 auto sum4 = taskPool.reduce!"a + 
 b"(std.algorithm.map!getTerm(iota(1000000001)));

or

auto sum4 = taskPool.reduce!"a + b"(0UL, iota(1_000_000_001));

works for me

Sep 26 2015

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Sat, 2015-09-26 at 17:20 +0000, Jay Norwood via Digitalmars-d-learn
wrote:
 This is a work-around to get a ulong result without having the=20
 ulong as the range variable.
=20
 ulong getTerm(int i)
 {
     return i;
 }
 auto sum4 =3D taskPool.reduce!"a +=20
 b"(std.algorithm.map!getTerm(iota(1000000001)));

Not needed as reduce can take an initial value that sets the type of
the template. See previous email.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Sep 28 2015

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Sat, 2015-09-26 at 15:56 +0000, Jay Norwood via Digitalmars-d-learn
wrote:
 std.parallelism.reduce documentation provides an example of a=20
 parallel sum.
=20
 This works:
 auto sum3 =3D taskPool.reduce!"a + b"(iota(1.0,1000001.0));
=20
 This results in a compile error:
 auto sum3 =3D taskPool.reduce!"a + b"(iota(1UL,1000001UL));
=20
 I believe there was discussion of this problem recently ...

Which may or may not already have been fixed, or=E2=80=A6

On the other hand:

	taskPool.reduce!"a + b"(1UL, iota(1000001));

seems to work fine.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Sep 28 2015

Jay Norwood <jayn prismnet.com> writes:

On Saturday, 26 September 2015 at 15:56:54 UTC, Jay Norwood wrote:
 This results in a compile error:
 auto sum3 = taskPool.reduce!"a + b"(iota(1UL,1000001UL));

 I believe there was discussion of this problem recently ...

https://issues.dlang.org/show_bug.cgi?id=14832

https://issues.dlang.org/show_bug.cgi?id=6446

looks like the problem has been reported a couple of times.  I 
probably saw the discussion of the 8/22 bug.

Sep 28 2015

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Parallel processing and further use of output