digitalmars.D.learn - std.parallelism curious results
- flamencofantasy (54/54) Oct 05 2014 Hello,
- Artem Tarasov (3/3) Oct 05 2014 Welcome to the world of multithreading.
- Russel Winder via Digitalmars-d-learn (30/95) Oct 05 2014 -----BEGIN PGP SIGNED MESSAGE-----
- Sativa (44/44) Oct 05 2014 Two problems, one, you should create your threads outside the
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (26/31) Oct 05 2014 Reducing the number of threads is key. However, unlike what others said,...
- Sativa (39/39) Oct 05 2014 On Sunday, 5 October 2014 at 21:25:39 UTC, Ali Çehreli wrote:
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (10/12) Oct 05 2014 The for loop condition is executed at every iteration and division is an...
- Sativa (13/26) Oct 05 2014 Yes, it is a common problem when doing a computation in a for
- flamencofantasy (5/5) Oct 05 2014 Thanks everyone for the replies.
Hello, I am summing up the first 1 billion integers in parallel and in a single thread and I'm observing some curious results; parallel sum : 499999999500000000, elapsed 102833 ms single thread sum : 499999999500000000, elapsed 1667 ms The parallel version is 60+ times slower on my i7-3770K CPU. I think that maybe due to the CPU constantly flushing and reloading the caches in the parallel version but I don't know for sure. Here is the D code; shared ulong sum = 0; ulong iter = 1_000_000_000UL; StopWatch sw; sw.start(); foreach(i; parallel(iota(0, iter))) { atomicOp!"+="(sum, i); } sw.stop(); writefln("parallel sum : %s, elapsed %s ms", sum, sw.peek().msecs); sum = 0; sw.reset(); sw.start(); for (ulong i = 0; i < iter; ++i) { sum += i; } sw.stop(); writefln("single thread sum : %s, elapsed %s ms", sum, sw.peek().msecs); parallel sum : 499999999500000000, elapsed 20320 ms single thread sum : 499999999500000000, elapsed 1901 ms is strange on the exact same CPU. long sum = 0; long iter = 1000000000L; var sw = Stopwatch.StartNew(); Parallel.For(0, iter, i => { Interlocked.Add(ref sum, i); }); Console.WriteLine("parallel sum : {0}, elapsed {1} ms", sum, sw.ElapsedMilliseconds); sum = 0; sw = Stopwatch.StartNew(); for (long i = 0; i < iter; ++i) { sum += i; } Console.WriteLine("single thread sum : {0}, elapsed {1} ms", sum, sw.ElapsedMilliseconds); Thoughts?
Oct 05 2014
Welcome to the world of multithreading. You have just discovered that atomic operations are performance killers, congratulations on this.
Oct 05 2014
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 05/10/14 15:27, flamencofantasy via Digitalmars-d-learn wrote:Hello, I am summing up the first 1 billion integers in parallel and in a single thread and I'm observing some curious results;I am fairly certain that your use of "parallel for" introduces quite a lot of threads other than you "master" one.parallel sum : 499999999500000000, elapsed 102833 ms single thread sum : 499999999500000000, elapsed 1667 ms The parallel version is 60+ times slower on my i7-3770K CPU. I think that maybe due to the CPU constantly flushing and reloading the caches in the parallel version but I don't know for sure.I would bet there are cache problems, but far more likely that the core problem is all the thread activity and in particular all the synchronization.Here is the D code; shared ulong sum = 0; ulong iter = 1_000_000_000UL; StopWatch sw; sw.start(); foreach(i; parallel(iota(0, iter))) { atomicOp!"+="(sum, i); }Well that will be the problem then, lots and lots of synchronization with the billion tasks you have set up. I am highly surprised this is only 60 times slower than sequential!sw.stop(); writefln("parallel sum : %s, elapsed %s ms", sum, sw.peek().msecs); sum = 0; sw.reset(); sw.start(); for (ulong i = 0; i < iter; ++i) { sum += i; } sw.stop(); writefln("single thread sum : %s, elapsed %s ms", sum, sw.peek().msecs); parallel sum : 499999999500000000, elapsed 20320 ms single thread sum : 499999999500000000, elapsed 1901 ms is strange on the exact same CPU. long sum = 0; long iter = 1000000000L; var sw = Stopwatch.StartNew(); Parallel.For(0, iter, i => { Interlocked.Add(ref sum, i); });(somewhat perverse) context is relatively much more efficient than that of D. There is almost certainly a useful benchmark test that can come of this for the std.parallelism implementation (if only I had a few cycles to get really stuck in to a review and analysis of the module :-(Console.WriteLine("parallel sum : {0}, elapsed {1} ms", sum, sw.ElapsedMilliseconds); sum = 0; sw = Stopwatch.StartNew(); for (long i = 0; i < iter; ++i) { sum += i; } Console.WriteLine("single thread sum : {0}, elapsed {1} ms", sum, sw.ElapsedMilliseconds); Thoughts?- -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iEYEARECAAYFAlQxeZ0ACgkQ+ooS3F10Be+DKQCgu2Ro+2bVmEua3oPHZ6kAqUVv cg8AoLpN3BRvLBQLT8qDaiP0wVMS5dQZ =w4Gx -----END PGP SIGNATURE-----
Oct 05 2014
Two problems, one, you should create your threads outside the stop watch, it is not generally a fair comparison in the real world. It throws of the results for short tasks. Second, you are creating one thread per integer, this is bad. Do you really want to create 1B threads when you only have probably 4 cores? Below there are 4 threads used. Each thread adds up 1/4 of the integers. So it is like 4 threads, each adding up 250M integers. The speed, compared to a single thread adding up 250M integers, shows how much the parallelism costs per thread. import std.stdio, std.parallelism, std.datetime, std.range, core.atomic; void main() { StopWatch sw; shared ulong sum1 = 0, sum2 = 0, sum3 = 0, time1, time2, time3; auto numThreads = 4; ulong iter = numThreads*100000UL; auto thds = parallel(iota(0, iter, iter/numThreads)); sw.start(); foreach(i; thds) { ulong s = 0; for(ulong k = 0; k < iter/numThreads; k++) { s += k; } s += i*iter/numThreads; atomicOp!"+="(sum1, s); } sw.stop(); time1 = sw.peek().usecs; sw.reset(); sw.start(); for (ulong i = 0; i < iter; ++i) { sum2 += i; } sw.stop(); time2 = sw.peek().usecs; writefln("parallel sum : %s, elapsed %s us", sum1, time1); writefln("single thread sum : %s, elapsed %s us", sum2, time2); writefln("Efficiency : %s%%", 100*time2/time1); } http://dpaste.dzfl.pl/bfda7bb2e2b7 Some results: parallel sum : 79999800000, elapsed 3356 us single thread sum : 79999800000, elapsed 1984 us Efficiency : 59% (Not sure all the code is correct, the point is you were creating 1B threads with 1B atomic operations. The worse possible comparison one can do between single and multi-threaded tests.
Oct 05 2014
On 10/05/2014 07:27 AM, flamencofantasy wrote:I am summing up the first 1 billion integers in parallel and in a single thread and I'm observing some curious results; parallel sum : 499999999500000000, elapsed 102833 ms single thread sum : 499999999500000000, elapsed 1667 ms The parallel version is 60+ times slowerReducing the number of threads is key. However, unlike what others said, parallel() does not use that many threads. By default, TaskPool objects are constructed by 'totalCPUs - 1' worker threads. All of parallel()'s iteration are executed on that few threads. The main problem here is the use of atomicOp, which necessarily synchronizes the whole process. Something like the following takes advantage of parallelism and reduces the execution time by half on my machine (4 cores (hyperthreaded 2 actul ones)). ulong adder(ulong beg, ulong end) { ulong localSum = 0; foreach (i; beg .. end) { localSum += i; } return localSum; } enum totalTasks = 10; foreach(i; parallel(iota(0, totalTasks))) { ulong beg = i * iter / totalTasks; ulong end = beg + iter / totalTasks; atomicOp!"+="(sum, adder(beg, end)); } Ali
Oct 05 2014
On Sunday, 5 October 2014 at 21:25:39 UTC, Ali Çehreli wrote: import std.stdio, std.cstream, std.parallelism, std.datetime, std.range, core.atomic; void main() { StopWatch sw; shared ulong sum1 = 0; ulong sum2 = 0, sum3 = 0, time1, time2, time3; enum numThreads = 4; // If numThreads is a variable then it significantly slows down the process ulong iter = 1000000L; iter = numThreads*cast(ulong)(iter/numThreads); // Force iter to be a multiple of the number of threads so we can partition uniformly auto thds = parallel(iota(0, cast(uint)iter, cast(uint)(iter/numThreads))); sw.reset(); sw.start(); foreach(i; thds) { ulong s = 0; for(ulong k = 0; k < iter/numThreads; k++) { s += k; } s += i*iter/numThreads; atomicOp!"+="(sum1, s); } sw.stop(); time1 = sw.peek().usecs; sw.reset(); sw.start(); for (ulong i = 0; i < iter; ++i) { sum2 += i; } sw.stop(); time2 = sw.peek().usecs; writefln("parallel sum : %s, elapsed %s us", sum1, time1); writefln("single thread sum : %s, elapsed %s us", sum2, time2); if (time1 > 0) writefln("Efficiency : %s%%", 100*time2/time1); din.getc(); } Playing around with the code above, it seems when numThreads is an enum, the execution time is significantly effected(that from being < 100% to being >100% efficiency). results on a 4 core laptop with release builds: parallel sum : 499999500000, elapsed 2469 us single thread sum : 499999500000, elapsed 8054 us Efficiency : 326% when numThreads is an int: parallel sum : 499999500000, elapsed 21762 us single thread sum : 499999500000, elapsed 8033 us Efficiency : 36%
Oct 05 2014
On 10/05/2014 02:40 PM, Sativa wrote:foreach(i; thds) { ulong s = 0; for(ulong k = 0; k < iter/numThreads; k++)The for loop condition is executed at every iteration and division is an expensive operation. Apparently, the compiled does some optimization when the divisor is known at compile time. Being 4, it is just a shift of 2 bits. Try something like 5, it is slow even for enum. This solves the problem: const end = iter/numThreads; for(ulong k = 0; k < end; k++) { Ali
Oct 05 2014
On Sunday, 5 October 2014 at 21:53:23 UTC, Ali Çehreli wrote:On 10/05/2014 02:40 PM, Sativa wrote:Yes, it is a common problem when doing a computation in a for loop on the bounds. Most of the time they are constant for the loop but the compiler computes it every iteration. When doing a simple sum(when the loop does not do much), it becomes expensive since it is comparable to what is happening inside the loop. It's surprising just how slow it makes it though. One can't really make numThreads const in the real world though as it wouldn't optimal(unless one had a version for each number of possible threads). Obviously one can just move the computation outside the loop. I would expect better results if the loops actually did some real work.foreach(i; thds) { ulong s = 0; for(ulong k = 0; k < iter/numThreads; k++)The for loop condition is executed at every iteration and division is an expensive operation. Apparently, the compiled does some optimization when the divisor is known at compile time. Being 4, it is just a shift of 2 bits. Try something like 5, it is slow even for enum. This solves the problem: const end = iter/numThreads; for(ulong k = 0; k < end; k++) { Ali
Oct 05 2014
Thanks everyone for the replies. I wasn't sure how std.parallel operated but I thought it would launch at most a number of threads equal to the number of cores on the machine, just as Ali confirmed and similar to what Windows' thread pool does.
Oct 05 2014