digitalmars.D.learn - Concurrency and program speed
- Joseph Rushton Wakeling (46/46) Feb 28 2013 Hello all,
- jerro (9/9) Feb 28 2013 The timings on my (4 core) machine are:
- thedeemon (11/16) Mar 01 2013 I've just run your example with DMD 2.061 on an Intel Core2 Quad
- Joseph Rushton Wakeling (8/13) Mar 01 2013 I had some concerns here. rndGen is supposed to be thread-safe, which w...
Hello all, I'm in need of some guidance regarding std.concurrency. Before writing further, I should add that I'm an almost complete novice where concurrency is concerned, in general and particularly with D: I've written a few programs that made use of std.parallelism but that's about it. In this case, there's a strong need to use std.concurrency because the functions that will be run in parallel involve generating substantial quantities of random numbers. AFAICS std.parallelism just isn't safe for that, in a statistical sense (no idea how it might slow things down in terms of shared access to a common rndGen). Now, I'm not naive enough to believe that using n threads will simply result in the program runtime being divided by n. However, the results I'm getting with some simple test code (attached) are curious and I'd like to understand better what's going on. The program is simple enough: foreach(i; iota(n)) spawn(&randomFunc, m); ... where randomFunc is a function that generates and sums m different random numbers. For speed comparison one can do instead, foreach(i; iota(n)) randomFunc(m); With m = 100_000_000 being chosen for my case. Setting n = 2 on my 4-core laptop, the sequential case runs in about 4 s; the concurrent version using spawn() runs in about 2.2 s (the total amount of "user" time given for the sequential programs is about 4 s and about 4.3 s respectively). So, roughly half speed, as you might expect. Setting n = 3, the sequential case runs in about 6 s (surprise!), the concurrent version in about 3 (with about 8.1 s of "user" time recorded). In other words, the program speed is only half that of the sequential version, even though there's no shared data and the CPU can well accommodate the 3 threads at full speed. (In fact 270% CPU usage is recorded, but that should still see a faster program.) Setting n = 4, the sequential case runs in 8 s, the concurrent in about 3.8 (with 14.8 s of "user" time recorded), with 390% CPU usage. In other words, it doesn't seem possible to get more than about 2 * speedup on my system from using concurrency, even though there should not be any data races or other factors that might explain slower performance. I didn't expect speed / n, but I did expect something a little better than this -- so can anyone suggest what might be going on here? (Unfortunately, I don't have a system with a greater number of cores on which to test with greater numbers of threads.) The times reported here are for programs compiled with GDC, but using LDC or DMD produces similar behaviour. Can anyone advise? Thanks & best wishes, -- Joe
Feb 28 2013
The timings on my (4 core) machine are: sequential: 7.7s concurent : 2.1s Which is about what one would expect. One thing that could be contributing to the timings you are seeing is Turbo Boost (http://en.wikipedia.org/wiki/Intel_Turbo_Boost). Some mobile processors have large (like for example 2.1 GHz vs 3.1 GHz) differences between normal frequencies and maximal Turbo Boost frequencies.
Feb 28 2013
On Thursday, 28 February 2013 at 14:15:57 UTC, Joseph Rushton Wakeling wrote:In other words, it doesn't seem possible to get more than about 2 * speedup on my system from using concurrency, even though there should not be any data races or other factors that might explain slower performance.I've just run your example with DMD 2.061 on an Intel Core2 Quad (Q8200 2.33 GHz) and got 36 s in sequential mode and 11-12 s in parallel (n=4). So it's ~3x speedup. Two ideas coming to mind: 1) Amdahl's law. If there is some locking inside random numbers generator, you will never get linear speed up. 2) More probable: the "4 cores" of your laptop might be just 2 cores with hyperthreading, that's not honest 4 cores, they won't give you 4x speedup anywhere.
Mar 01 2013
On 03/01/2013 11:43 AM, thedeemon wrote:Two ideas coming to mind: 1) Amdahl's law. If there is some locking inside random numbers generator, you will never get linear speed up.I had some concerns here. rndGen is supposed to be thread-safe, which was incidentally why I was using std.concurrency; I'm not sure if the same safety occurs with std.parallelism.2) More probable: the "4 cores" of your laptop might be just 2 cores with hyperthreading, that's not honest 4 cores, they won't give you 4x speedup anywhere.Yes, this turns out to be the case, so is the most likely cause. Your results and Jerro's suggest that with a greater number of true cores, an appropriate speedup results. Thanks to both of you for the thoughts and advice!
Mar 01 2013