digitalmars.D - std.parallel_algorithm
- dsimcha (11/11) May 22 2011 Here's some very early work in progress on std.parallel_algorithm:
- Russel Winder (25/39) May 22 2011 thm.d
- dsimcha (5/29) May 23 2011 Yeah, I meant to mention that when I was talking about how much of a
- Jonathan M Davis (11/26) May 22 2011 Well, unless you intend to try and have functions which choose whether t...
- Andrej Mitrovic (4/4) May 22 2011 David, do you have some automated way of adding new modules to the
- Andrej Mitrovic (2/2) May 22 2011 Russell, get parallelism.d from here:
- Andrej Mitrovic (3/3) May 22 2011 Anyway here's some results on a quad-core AMD:
- Russel Winder (14/16) May 23 2011 Too many ls in there but I'll assume it's me ;-)
- Jonathan M Davis (4/12) May 23 2011 That's what David said in his initial post. He had to make additional ch...
- Russel Winder (15/27) May 23 2011 nges=20
- dsimcha (22/38) May 23 2011 Interesting. Thanks. Unfortunately it doesn't look like merge and
Here's some very early work in progress on std.parallel_algorithm: https://github.com/dsimcha/parallel_algorithm/blob/master/parallel_algorithm.d I haven't compiled the docs yet because it's too much of a WIP and doesn't even work without a few modifications I made to std.parallelism. So far I've got parallel merge, sort and dot product. One issue is that most of these algorithms require pretty fine grained parallelism and the break even point depends on a lot of things, like the details of the hardware. How should we go about documenting/solving this? Is it reasonable to always just punt the issue of finding the break even point and whether one is above or below it to the user of the library? If not, what is a reasonable solution?
May 22 2011
David, On Mon, 2011-05-23 at 00:00 -0400, dsimcha wrote:Here's some very early work in progress on std.parallel_algorithm: =20 https://github.com/dsimcha/parallel_algorithm/blob/master/parallel_algori=thm.d=20 I haven't compiled the docs yet because it's too much of a WIP and=20 doesn't even work without a few modifications I made to std.parallelism.==20So far I've got parallel merge, sort and dot product. =20 One issue is that most of these algorithms require pretty fine grained==20parallelism and the break even point depends on a lot of things, like=20 the details of the hardware. How should we go about documenting/solving==20this? Is it reasonable to always just punt the issue of finding the=20 break even point and whether one is above or below it to the user of the==20library? If not, what is a reasonable solution?Does this requires a version of std.parallelism that is not in the standard Phobos? + dmd -m32 -lib std/parallel_algorithm.d std/parallel_algorithm.d(379): Error: class std.parallelism.TaskPool member defaultWorkUnitSize is not accessible + dmd -m64 -lib std/parallel_algorithm.d std/parallel_algorithm.d(379): Error: class std.parallelism.TaskPool member defaultWorkUnitSize is not accessible --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 22 2011
On 5/23/2011 1:23 AM, Russel Winder wrote:David, On Mon, 2011-05-23 at 00:00 -0400, dsimcha wrote:Yeah, I meant to mention that when I was talking about how much of a work in progress it is. I didn't expect anyone to actually want to test it this soon. You need the slightly modified one at: https://github.com/dsimcha/phobos/blob/master/std/parallelism.dHere's some very early work in progress on std.parallel_algorithm: https://github.com/dsimcha/parallel_algorithm/blob/master/parallel_algorithm.d I haven't compiled the docs yet because it's too much of a WIP and doesn't even work without a few modifications I made to std.parallelism. So far I've got parallel merge, sort and dot product. One issue is that most of these algorithms require pretty fine grained parallelism and the break even point depends on a lot of things, like the details of the hardware. How should we go about documenting/solving this? Is it reasonable to always just punt the issue of finding the break even point and whether one is above or below it to the user of the library? If not, what is a reasonable solution?Does this requires a version of std.parallelism that is not in the standard Phobos? + dmd -m32 -lib std/parallel_algorithm.d std/parallel_algorithm.d(379): Error: class std.parallelism.TaskPool member defaultWorkUnitSize is not accessible + dmd -m64 -lib std/parallel_algorithm.d std/parallel_algorithm.d(379): Error: class std.parallelism.TaskPool member defaultWorkUnitSize is not accessible
May 23 2011
On 2011-05-22 21:00, dsimcha wrote:Here's some very early work in progress on std.parallel_algorithm: https://github.com/dsimcha/parallel_algorithm/blob/master/parallel_algorith m.d I haven't compiled the docs yet because it's too much of a WIP and doesn't even work without a few modifications I made to std.parallelism. So far I've got parallel merge, sort and dot product. One issue is that most of these algorithms require pretty fine grained parallelism and the break even point depends on a lot of things, like the details of the hardware. How should we go about documenting/solving this? Is it reasonable to always just punt the issue of finding the break even point and whether one is above or below it to the user of the library? If not, what is a reasonable solution?Well, unless you intend to try and have functions which choose whether they're parallel or not at runtime based on what they're given, I'd fully expect the issue of whether to use the parallel or non-parallel versions up to the programmer using them, and if that's the case, it's completely up to them to determine which is more efficient given their particular use case. At best, you could give some sort of information on what you expect the break-even point to be, and if it's highly system dependent (as it sounds like it is), then I don't see how you could do even that. So, really, I'd expect it to be completely up to the programmer. - Jonathan M Davis
May 22 2011
David, do you have some automated way of adding new modules to the makefile? It's a bit of a drag to manually do this. ;) Also, make sucks. I made a typo at line 429 and it reported an error at line 542. Gah!
May 22 2011
Russell, get parallelism.d from here: https://github.com/dsimcha/phobos/tree/master/std
May 22 2011
Anyway here's some results on a quad-core AMD: https://gist.github.com/986322 Was too lazy to calculate an average out of N runs. :)
May 22 2011
On Mon, 2011-05-23 at 08:16 +0200, Andrej Mitrovic wrote:Russell, get parallelism.d from here: https://github.com/dsimcha/phobos/tree/master/stdToo many ls in there but I'll assume it's me ;-) So the implication is that std.parallelism has changed since the 2.053 release and that std.parallel_algorithm relies on this change? --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 23 2011
On 2011-05-23 02:07, Russel Winder wrote:On Mon, 2011-05-23 at 08:16 +0200, Andrej Mitrovic wrote:That's what David said in his initial post. He had to make additional changes to std.parallelism for std.parallel_algorithm to work. - Jonathan M DavisRussell, get parallelism.d from here: https://github.com/dsimcha/phobos/tree/master/stdToo many ls in there but I'll assume it's me ;-) So the implication is that std.parallelism has changed since the 2.053 release and that std.parallel_algorithm relies on this change?
May 23 2011
Jonathan, On Mon, 2011-05-23 at 02:15 -0700, Jonathan M Davis wrote:On 2011-05-23 02:07, Russel Winder wrote:nges=20On Mon, 2011-05-23 at 08:16 +0200, Andrej Mitrovic wrote:=20 That's what David said in his initial post. He had to make additional cha=Russell, get parallelism.d from here: https://github.com/dsimcha/phobos/tree/master/std=20 Too many ls in there but I'll assume it's me ;-) =20 So the implication is that std.parallelism has changed since the 2.053 release and that std.parallel_algorithm relies on this change?to std.parallelism for std.parallel_algorithm to work.Ah, thanks for the reminder. That'll teach me to skim read, get the URL, act, and fail to actually note important information. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 23 2011
On 5/23/2011 6:10 AM, Russel Winder wrote:Jonathan, On Mon, 2011-05-23 at 02:15 -0700, Jonathan M Davis wrote:Interesting. Thanks. Unfortunately it doesn't look like merge and dotProduct scale to a quad core very well, because I tuned these to be very close to the break-even point on a dual core. This is part of my concern: In the worst case, stuff that's very close to break-even on a dual or quad core can get **slower** when run on a 6 or 8 core. This is just a fact of life. Eventually, I'd like to push the break-even point for std.parallelism a little lower and I know roughly how to do it (mainly by implementing work stealing instead of a simple FIFO queue). However, there are tons of nagging little implementation issues and there would still always be a break even point, even if it's lower. Right now the break-even point seems to be on the order of a few microseconds per task, which on modern hardware equates to ~10,000 clock cycles. This is good enough for a lot of practical purposes, but obviously not nano-parallelism. I've made improving this a fairly low priority. Given the implementation difficulties, the fact that it's fairly easy to take full advantage of multicore hardware without getting down to this granularity (usually if you need more performance you're executing some expensive algorithm in a loop and you can just parallelize the outer loop and be done with it) and the fact that there's plenty of other useful hacking jobs in the D ecosystem, the benefits-to-effort ratio doesn't seem that good.On 2011-05-23 02:07, Russel Winder wrote:Ah, thanks for the reminder. That'll teach me to skim read, get the URL, act, and fail to actually note important information.On Mon, 2011-05-23 at 08:16 +0200, Andrej Mitrovic wrote:That's what David said in his initial post. He had to make additional changes to std.parallelism for std.parallel_algorithm to work.Russell, get parallelism.d from here: https://github.com/dsimcha/phobos/tree/master/stdToo many ls in there but I'll assume it's me ;-) So the implication is that std.parallelism has changed since the 2.053 release and that std.parallel_algorithm relies on this change?
May 23 2011