digitalmars.D.learn - multithread/concurrency/parallel methods and performance
- SrMordred (19/19) Feb 18 2018 I´m experimenting with threads and related recently.
- Jordan Wilson (7/28) Feb 18 2018 It may be due to thread local storage:
- Nicholas Wilson (25/46) Feb 18 2018 As SIZE=1024*1024 (i.e. not much, possibly well within L2 cache
- SrMordred (11/18) Feb 19 2018 That 'concurrency overhead' is what i´m not getting.
- Dmitry Olshansky (6/27) Feb 18 2018 The operation is trivial and dataset is rather small. In such
- SrMordred (10/13) Feb 19 2018 Yes, absolutely right :)
- Dmitry Olshansky (20/33) Feb 19 2018 Yeah, the world is ugly place where trivial math sometimes
I´m experimenting with threads and related recently. (i´m just started so may be some terrrible mistakes here) With this base work: foreach(i ; 0 .. SIZE) { results[i] = values1[i] * values2[i]; } and then with this 3 others methods: parallel, spawn and Threads. this was my results: _base : 456 ms and 479 us _parallel : 331 ms, 324 us, and 4 hnsecs _concurrency : 367 ms, 348 us, and 2 hnsecs _thread : 369 ms, 565 us, and 3 hnsecs (code here : https://run.dlang.io/is/2pdmmk ) All methods have minor speedup gains. I was expecting a lot more. Since I have 7 cores I expected like below 100ms. I´m not seeing false sharing in this case. or i'm wrong? If someone can expand on this, i'll be grateful. Thanks!
Feb 18 2018
On Sunday, 18 February 2018 at 17:54:58 UTC, SrMordred wrote:I´m experimenting with threads and related recently. (i´m just started so may be some terrrible mistakes here) With this base work: foreach(i ; 0 .. SIZE) { results[i] = values1[i] * values2[i]; } and then with this 3 others methods: parallel, spawn and Threads. this was my results: _base : 456 ms and 479 us _parallel : 331 ms, 324 us, and 4 hnsecs _concurrency : 367 ms, 348 us, and 2 hnsecs _thread : 369 ms, 565 us, and 3 hnsecs (code here : https://run.dlang.io/is/2pdmmk ) All methods have minor speedup gains. I was expecting a lot more. Since I have 7 cores I expected like below 100ms. I´m not seeing false sharing in this case. or i'm wrong? If someone can expand on this, i'll be grateful. Thanks!It may be due to thread local storage: https://tour.dlang.org/tour/en/multithreading/thread-local-storage https://dlang.org/articles/migrate-to-shared.html I'm not sure though, as I don't know how your results array initialised. Jordan
Feb 18 2018
On Sunday, 18 February 2018 at 17:54:58 UTC, SrMordred wrote:I´m experimenting with threads and related recently. (i´m just started so may be some terrrible mistakes here) With this base work: foreach(i ; 0 .. SIZE) { results[i] = values1[i] * values2[i]; } and then with this 3 others methods: parallel, spawn and Threads. this was my results: _base : 456 ms and 479 us _parallel : 331 ms, 324 us, and 4 hnsecs _concurrency : 367 ms, 348 us, and 2 hnsecs _thread : 369 ms, 565 us, and 3 hnsecs (code here : https://run.dlang.io/is/2pdmmk ) All methods have minor speedup gains. I was expecting a lot more. Since I have 7 cores I expected like below 100ms. I´m not seeing false sharing in this case. or i'm wrong? If someone can expand on this, i'll be grateful. Thanks!As SIZE=1024*1024 (i.e. not much, possibly well within L2 cache for 32bit) it may be that dealing with the concurrency overhead adds a significant amount of overhead. Also the run.dlang.io link has no -O flag and thus no optimisations without -O i get _base : 323 ms, 92 μs, and 6 hnsecs _parallel : 276 ms, 649 μs, and 3 hnsecs _concurrency : 221 ms, 931 μs, and 7 hnsecs _thread : 212 ms, 277 μs, and 3 hnsecs with it I get _base : 150 ms, 728 μs, and 5 hnsecs _parallel : 120 ms, 78 μs, and 5 hnsecs _concurrency : 134 ms, 787 μs, and 4 hnsecs _thread : 129 ms, 476 μs, and 2 hnsecs with SIZE= 16*1024*1024 without -O i get _base : 5 secs, 835 ms, 240 μs, and 9 hnsecs _parallel : 4 secs, 802 ms, 279 μs, and 8 hnsecs _concurrency : 2 secs, 133 ms, 685 μs, and 3 hnsecs _thread : 2 secs, 108 ms, 860 μs, and 9 hnsecs with SIZE= 16*1024*1024 with -O i get _base : 2 secs, 502 ms, 523 μs, and 4 hnsecs _parallel : 1 sec, 769 ms, 945 μs, and 3 hnsecs _concurrency : 1 sec, 362 ms, 747 μs, and 1 hnsec _thread : 1 sec, 335 ms, 720 μs, and 1 hn
Feb 18 2018
On Monday, 19 February 2018 at 05:49:54 UTC, Nicholas Wilson wrote:As SIZE=1024*1024 (i.e. not much, possibly well within L2 cache for 32bit) it may be that dealing with the concurrency overhead adds a significant amount of overhead.That 'concurrency overhead' is what i´m not getting. Since the array is big, dividing it into 6, 7 cores will not trash L1 since they are very far from each other, right? Or L2 cache trashing is also a problem in this case?_base : 150 ms, 728 μs, and 5 hnsecs _parallel : 120 ms, 78 μs, and 5 hnsecs _concurrency : 134 ms, 787 μs, and 4 hnsecs _thread : 129 ms, 476 μs, and 2 hnsecsYes, on my PC I was using -release. Yet, 150ms for 1 core. 120-134ms of X cores. Shouldn´t be way faster? I´m trying to understand where the overhead is, and if is possible to get rid of it (perfect thread scaling).
Feb 19 2018
On Sunday, 18 February 2018 at 17:54:58 UTC, SrMordred wrote:I´m experimenting with threads and related recently. (i´m just started so may be some terrrible mistakes here) With this base work: foreach(i ; 0 .. SIZE) { results[i] = values1[i] * values2[i]; } and then with this 3 others methods: parallel, spawn and Threads. this was my results: _base : 456 ms and 479 us _parallel : 331 ms, 324 us, and 4 hnsecs _concurrency : 367 ms, 348 us, and 2 hnsecs _thread : 369 ms, 565 us, and 3 hnsecs (code here : https://run.dlang.io/is/2pdmmk ) All methods have minor speedup gains. I was expecting a lot more. Since I have 7 cores I expected like below 100ms.The operation is trivial and dataset is rather small. In such cases SIMD with eg array ops is the way to go: result[] = values[] * values2[]; Parallelism is gets more interesting with more expensive operations. You may also try bigger sizes or both.I´m not seeing false sharing in this case. or i'm wrong? If someone can expand on this, i'll be grateful. Thanks!
Feb 18 2018
On Monday, 19 February 2018 at 05:54:53 UTC, Dmitry Olshansky wrote:The operation is trivial and dataset is rather small. In such cases SIMD with eg array ops is the way to go: result[] = values[] * values2[];Yes, absolutely right :) I make a simple example to understand why the threads are not scaling in the way i thought they would. I imagine that, if one core work is done in 200ms a 4 core work will be done in 50ms, plus some overhead, since they are working on separate block of memory, without need of sync, and without false sharing, etc (at least I think i don´t have this problem here).
Feb 19 2018
On Monday, 19 February 2018 at 14:57:22 UTC, SrMordred wrote:On Monday, 19 February 2018 at 05:54:53 UTC, Dmitry Olshansky wrote:Yeah, the world is ugly place where trivial math sometimes doesn’t work. I suggest to: - run with different number of threads from 1 to n - vary sizes from 100k to 10m it’s 6/7 ~ 0.86ms which is a deal smaller then a CPU timeslice. In essence a single core runs fast b/c it doesn’t wait for all others to complete via join easily burning its quota in one go. In MT I bet some of overhead comes from not all threads finishing (and starting) at once, so the join block in the kernel. You could run your MT code with strace to see if it hits the futex call or some such, if it does that’s where you are getting delays. (that’s assuming you are on Linux) std.parallel version is a bit faster b/c I think it caches created threadpool so you don’t start threads anew on each run.The operation is trivial and dataset is rather small. In such cases SIMD with eg array ops is the way to go: result[] = values[] * values2[];Yes, absolutely right :) I make a simple example to understand why the threads are not scaling in the way i thought they would.I imagine that, if one core work is done in 200ms a 4 core work will be done in 50ms, plus some overhead, since they are working on separate block of memory, without need of sync, and without false sharing, etc (at least I think i don´t have this problem here).If you had a long queue of small tasks like that and you don’t wait to join all threads untill absolutely required you get near perfect scalability. (Unless hitting other bottlenecks like RAM).
Feb 19 2018