digitalmars.D - Running Phobos unit tests in threads: I have data
- Atila Neves (71/71) May 03 2014 So I tried using unit-threaded to run Phobos unit tests again and
- Atila Neves (6/78) May 03 2014 I turned off all output to check. It was still slower with
- Rikki Cattermole (2/99) May 03 2014 Out of curiosity are you on Windows?
- Atila Neves (4/5) May 03 2014 No, Arch Linux 64-bit. I also just noticed a glaring threading
- Rikki Cattermole (3/8) May 03 2014 I'm surprised. Threads should be cheap on Linux. Something funky
- Walter Bright (2/3) May 03 2014 No doubt: http://www.youtube.com/watch?v=aZcbDESaxhY
- Dicebot (2/12) May 05 2014 Threads are never cheap.
- Brad Anderson (3/16) May 05 2014 Regarding this, I found this talk interesting:
- Atila Neves (11/11) May 03 2014 Ok, so I went and added __traits(getUnitTests) to unit-threaded.
- Atila Neves (55/55) May 03 2014 I can reproduce the slower-with-threads issue without using my
- Walter Bright (6/9) May 03 2014 I haven't investigated this, but my suspicions are:
- Atila Neves (6/18) May 03 2014 In the current measurements probably since the whole run takes
- Dmitry Olshansky (6/19) May 03 2014 Try different batch size:
- Atila Neves (13/21) May 03 2014 So as to not have thread creation be disproportionately
- Atila Neves (8/31) May 03 2014 gdc gave _very_ different results. I had to use different modules
- Atila Neves (18/55) May 03 2014 Same thing with unit_threaded on Phobos, 3x faster even without
- Andrei Alexandrescu (2/8) May 03 2014 Sounds like a severe bug in dmd or dependents. -- Andrei
- Atila Neves (7/20) May 04 2014 Seems like it. Just to be sure I swapped ld.gold for ld.bfd and
- Andrei Alexandrescu (2/17) May 04 2014 The simpler the better. -- Andrei
- Atila Neves (2/28) May 05 2014
- safety0ff (8/21) May 04 2014 This reminds me of when I was parallelizing a project euler
- Atila Neves (6/28) May 05 2014 Funny you should say that, a friend of mine tried porting a
- Orvid King via Digitalmars-d (9/39) May 05 2014 Going to take a wild guess, but as core.atomic.casImpl will never be
- Iain Buclaw via Digitalmars-d (5/13) May 05 2014 Aye, and atomic intrinsics though they may be, it could even be
- Andrei Alexandrescu (4/5) May 03 2014 [snip]
- Atila Neves (5/10) May 03 2014 I'm using parallel and taskPool from std.parallelism. I was under
- Russel Winder via Digitalmars-d (16/19) May 04 2014 There is a default, related to the number of cores the OS thinks there
- Atila Neves (6/26) May 04 2014 Like I mentioned afterwards, I tried a different number of
- Russel Winder via Digitalmars-d (10/13) May 04 2014 If you can create a small example of the problem, and I can remember how
- Andrei Alexandrescu (2/9) May 04 2014 This is an awesome offer, Russel. Thanks! -- Andrei
- Joseph Rushton Wakeling via Digitalmars-d (3/8) May 04 2014 Yup. That bit me with a new laptop the first time I tried parallel prog...
So I tried using unit-threaded to run Phobos unit tests again and had problems (which I'll look into later) with its compile-time reflection. Then I realised I was an idiot since I don't need to reflect on anything: all Phobos tests are in unittest blocks so all I need to do is include them in the build and unit-threaded will run them for me. I tried a basic sanity check by running them in one thread only with the -s option and got a segfault, and a failing test before that. None of this should happen, and I'll be taking a look at that as well. But I carried on by removing the troublesome modules from the build. These turned out to be: std.datetime (fails) std.process (fails and causes the segfault) std.stdio (fails) All the others pass in single threaded mode. After this I tried using threads and std.parallelism failed, so I took that away from the build as well. Another thing to mention is that although the tests are running in threads, since when I wrote the library the getUnitTests __traits wasn't available (and since then I wasn't interested in using it), each module's unit tests run as one test. So they only interleave with other modules, not with each other. Running in one thread took 39 +/- 1 seconds. Running in 8 threads took... ~41 seconds. Oops. I noticed some tests take a lot longer so I tried removing those. They were: std.file std.conv std.regex std.random std.container std.xml std.utf std.numeric std.uuid std.exception I also removed any modules that were likely to be problematic like std.concurrency and std.socket. With the reduced sample size the results were: 1 thread: ~1.9s 8 threads: 4.1s +/- 0.2 So the whole threading thing isn't looking so great. Or at least not how I implemented it. This got me thinking about my own projects. The tests run so fast I never really paid attention to how fast they were running. I compared running the unit tests in Cerealed in one or more threads and got the same result: running in one thread was faster. I have to look to be sure but maybe the bottleneck is output. As in actually printing the results to the screen. I had to jump through a few hoops to make sure the output wasn't interleaved, and in the end decided to have one thread be responsible for that, with the tests sending it output messages. For reference, I copied all of the std/*.d modules into a local std directory, compiled all of them with dmd -unittest -c, then used this as the build command: dmd -unittest -I~/coding/d/unit-threaded/source ut.d std/algorithm.o std/array.o std/ascii.o std/base64.o std/bigint.o std/bitmanip.o std/compiler.o std/complex.o std/container.o std/cstream.o std/csv.o std/demangle.o std/encoding.o std/format.o std/functional.o std/getopt.o std/json.o std/math.o std/mathspecial.o std/metastrings.o std/mmfile.o std/numeric.o std/outbuffer.o std/range.o std/signals.o std/stdint.o std/stdiobase.o std/stream.o std/string.o std/syserror.o std/system.o std/traits.o std/typecons.o std/typelist.o std/typetuple.o std/uri.o std/variant.o std/zip.o std/zlib.o libunit-threaded.a -ofphobos_ut I got libunit-threaded.a by running "dub build" in the root directory of unit-threaded. I might just implement a random order option now. Hmm. Atila
May 03 2014
I turned off all output to check. It was still slower with multiple threads. That was the only "weird" thing I was doing I could think of as the cause. Otherwise it's just a foreach(test; tests.parallel) { test(); }. Atila On Saturday, 3 May 2014 at 11:54:55 UTC, Atila Neves wrote:So I tried using unit-threaded to run Phobos unit tests again and had problems (which I'll look into later) with its compile-time reflection. Then I realised I was an idiot since I don't need to reflect on anything: all Phobos tests are in unittest blocks so all I need to do is include them in the build and unit-threaded will run them for me. I tried a basic sanity check by running them in one thread only with the -s option and got a segfault, and a failing test before that. None of this should happen, and I'll be taking a look at that as well. But I carried on by removing the troublesome modules from the build. These turned out to be: std.datetime (fails) std.process (fails and causes the segfault) std.stdio (fails) All the others pass in single threaded mode. After this I tried using threads and std.parallelism failed, so I took that away from the build as well. Another thing to mention is that although the tests are running in threads, since when I wrote the library the getUnitTests __traits wasn't available (and since then I wasn't interested in using it), each module's unit tests run as one test. So they only interleave with other modules, not with each other. Running in one thread took 39 +/- 1 seconds. Running in 8 threads took... ~41 seconds. Oops. I noticed some tests take a lot longer so I tried removing those. They were: std.file std.conv std.regex std.random std.container std.xml std.utf std.numeric std.uuid std.exception I also removed any modules that were likely to be problematic like std.concurrency and std.socket. With the reduced sample size the results were: 1 thread: ~1.9s 8 threads: 4.1s +/- 0.2 So the whole threading thing isn't looking so great. Or at least not how I implemented it. This got me thinking about my own projects. The tests run so fast I never really paid attention to how fast they were running. I compared running the unit tests in Cerealed in one or more threads and got the same result: running in one thread was faster. I have to look to be sure but maybe the bottleneck is output. As in actually printing the results to the screen. I had to jump through a few hoops to make sure the output wasn't interleaved, and in the end decided to have one thread be responsible for that, with the tests sending it output messages. For reference, I copied all of the std/*.d modules into a local std directory, compiled all of them with dmd -unittest -c, then used this as the build command: dmd -unittest -I~/coding/d/unit-threaded/source ut.d std/algorithm.o std/array.o std/ascii.o std/base64.o std/bigint.o std/bitmanip.o std/compiler.o std/complex.o std/container.o std/cstream.o std/csv.o std/demangle.o std/encoding.o std/format.o std/functional.o std/getopt.o std/json.o std/math.o std/mathspecial.o std/metastrings.o std/mmfile.o std/numeric.o std/outbuffer.o std/range.o std/signals.o std/stdint.o std/stdiobase.o std/stream.o std/string.o std/syserror.o std/system.o std/traits.o std/typecons.o std/typelist.o std/typetuple.o std/uri.o std/variant.o std/zip.o std/zlib.o libunit-threaded.a -ofphobos_ut I got libunit-threaded.a by running "dub build" in the root directory of unit-threaded. I might just implement a random order option now. Hmm. Atila
May 03 2014
On Saturday, 3 May 2014 at 12:08:56 UTC, Atila Neves wrote:I turned off all output to check. It was still slower with multiple threads. That was the only "weird" thing I was doing I could think of as the cause. Otherwise it's just a foreach(test; tests.parallel) { test(); }. Atila On Saturday, 3 May 2014 at 11:54:55 UTC, Atila Neves wrote:Out of curiosity are you on Windows?So I tried using unit-threaded to run Phobos unit tests again and had problems (which I'll look into later) with its compile-time reflection. Then I realised I was an idiot since I don't need to reflect on anything: all Phobos tests are in unittest blocks so all I need to do is include them in the build and unit-threaded will run them for me. I tried a basic sanity check by running them in one thread only with the -s option and got a segfault, and a failing test before that. None of this should happen, and I'll be taking a look at that as well. But I carried on by removing the troublesome modules from the build. These turned out to be: std.datetime (fails) std.process (fails and causes the segfault) std.stdio (fails) All the others pass in single threaded mode. After this I tried using threads and std.parallelism failed, so I took that away from the build as well. Another thing to mention is that although the tests are running in threads, since when I wrote the library the getUnitTests __traits wasn't available (and since then I wasn't interested in using it), each module's unit tests run as one test. So they only interleave with other modules, not with each other. Running in one thread took 39 +/- 1 seconds. Running in 8 threads took... ~41 seconds. Oops. I noticed some tests take a lot longer so I tried removing those. They were: std.file std.conv std.regex std.random std.container std.xml std.utf std.numeric std.uuid std.exception I also removed any modules that were likely to be problematic like std.concurrency and std.socket. With the reduced sample size the results were: 1 thread: ~1.9s 8 threads: 4.1s +/- 0.2 So the whole threading thing isn't looking so great. Or at least not how I implemented it. This got me thinking about my own projects. The tests run so fast I never really paid attention to how fast they were running. I compared running the unit tests in Cerealed in one or more threads and got the same result: running in one thread was faster. I have to look to be sure but maybe the bottleneck is output. As in actually printing the results to the screen. I had to jump through a few hoops to make sure the output wasn't interleaved, and in the end decided to have one thread be responsible for that, with the tests sending it output messages. For reference, I copied all of the std/*.d modules into a local std directory, compiled all of them with dmd -unittest -c, then used this as the build command: dmd -unittest -I~/coding/d/unit-threaded/source ut.d std/algorithm.o std/array.o std/ascii.o std/base64.o std/bigint.o std/bitmanip.o std/compiler.o std/complex.o std/container.o std/cstream.o std/csv.o std/demangle.o std/encoding.o std/format.o std/functional.o std/getopt.o std/json.o std/math.o std/mathspecial.o std/metastrings.o std/mmfile.o std/numeric.o std/outbuffer.o std/range.o std/signals.o std/stdint.o std/stdiobase.o std/stream.o std/string.o std/syserror.o std/system.o std/traits.o std/typecons.o std/typelist.o std/typetuple.o std/uri.o std/variant.o std/zip.o std/zlib.o libunit-threaded.a -ofphobos_ut I got libunit-threaded.a by running "dub build" in the root directory of unit-threaded. I might just implement a random order option now. Hmm. Atila
May 03 2014
Out of curiosity are you on Windows?No, Arch Linux 64-bit. I also just noticed a glaring threading bug in my code as well that somehow's never turned up. This is not a good day. Atila
May 03 2014
On Saturday, 3 May 2014 at 12:24:59 UTC, Atila Neves wrote:I'm surprised. Threads should be cheap on Linux. Something funky is definitely going on I bet.Out of curiosity are you on Windows?No, Arch Linux 64-bit. I also just noticed a glaring threading bug in my code as well that somehow's never turned up. This is not a good day. Atila
May 03 2014
On 5/3/2014 5:26 AM, Rikki Cattermole wrote:Something funky is definitely going on I bet.No doubt: http://www.youtube.com/watch?v=aZcbDESaxhY
May 03 2014
On Saturday, 3 May 2014 at 12:26:13 UTC, Rikki Cattermole wrote:On Saturday, 3 May 2014 at 12:24:59 UTC, Atila Neves wrote:Threads are never cheap.I'm surprised. Threads should be cheap on Linux. Something funky is definitely going on I bet.Out of curiosity are you on Windows?No, Arch Linux 64-bit. I also just noticed a glaring threading bug in my code as well that somehow's never turned up. This is not a good day. Atila
May 05 2014
On Monday, 5 May 2014 at 17:56:11 UTC, Dicebot wrote:On Saturday, 3 May 2014 at 12:26:13 UTC, Rikki Cattermole wrote:Regarding this, I found this talk interesting: https://www.youtube.com/watch?v=KXuZi9aeGTwOn Saturday, 3 May 2014 at 12:24:59 UTC, Atila Neves wrote:Threads are never cheap.I'm surprised. Threads should be cheap on Linux. Something funky is definitely going on I bet.Out of curiosity are you on Windows?No, Arch Linux 64-bit. I also just noticed a glaring threading bug in my code as well that somehow's never turned up. This is not a good day. Atila
May 05 2014
Ok, so I went and added __traits(getUnitTests) to unit-threaded. That way each unittest block is its own test case. I registered these modules in std to run: array, ascii, base64, bigint, bitmanip, concurrency, container, cstream. On the good news front, they all passed even though they were running concurrently. On the bad news front, single-threaded operation was still faster (0.22s vs 0.28s). I still don't know why. I fixed my concurrency bug, now I'm using taskPool.amap. Atila
May 03 2014
I can reproduce the slower-with-threads issue without using my library. I've included the source file below and would like to know if other people see the same thing. The Phobos modules are all called "ustd" because I couldn't/didn't know how to get this to work otherwise. So I copied the std/*.d files to a directory called ustd and changed their module declarations. Silly but it works. I'd love to know how to do this properly. With this file, I consistenly get faster times with "-s" (for single-threaded) than without (multi-threaded): import std.parallelism; import std.getopt; import ustd.array; import ustd.ascii; import ustd.base64; import ustd.bigint; import ustd.bitmanip; import ustd.concurrency; import ustd.container; import ustd.cstream; alias TestFunction = void function(); auto getTests(Modules...)() { TestFunction[] tests; foreach(mod; Modules) { foreach(test; __traits(getUnitTests, mod)) { tests ~= &test; } } return tests; } void main(string[] args) { bool single; getopt(args, "single|s", &single ); enum tests = getTests!( ustd.array, ustd.ascii, ustd.base64, ustd.bigint, ustd.bitmanip, ustd.concurrency, ustd.container, ustd.cstream, ); if(single) { foreach(test; tests) { test(); } } else { foreach(test; tests.parallel) { test(); } } }
May 03 2014
On 5/3/2014 10:22 AM, Atila Neves wrote:I can reproduce the slower-with-threads issue without using my library. I've included the source file below and would like to know if other people see the same thing.I haven't investigated this, but my suspicions are: 1. thread creation/destruction is dominating the times. 2. since very few of the unittests block, there is no speed advantage from having more threads than cores. If you limit the number of threads to the number of cores on your machine, you might see a speedup.
May 03 2014
On Saturday, 3 May 2014 at 18:26:37 UTC, Walter Bright wrote:On 5/3/2014 10:22 AM, Atila Neves wrote:In the current measurements probably since the whole run takes less than a second. But the first ones I did were dozens of seconds long, so I don't think so.I can reproduce the slower-with-threads issue without using my library. I've included the source file below and would like to know if other people see the same thing.I haven't investigated this, but my suspicions are: 1. thread creation/destruction is dominating the times.2. since very few of the unittests block, there is no speed advantage from having more threads than cores. If you limit the number of threads to the number of cores on your machine, you might see a speedup.Like I mentioned above, unless I'm mistaken taskPool should be using a correct number of threads for my machine already.
May 03 2014
03-May-2014 21:22, Atila Neves пишет:I can reproduce the slower-with-threads issue without using my library. I've included the source file below and would like to know if other people see the same thing. The Phobos modules are all called "ustd" because I couldn't/didn't know how to get this to work otherwise. So I copied the std/*.d files to a directory called ustd and changed their module declarations. Silly but it works. I'd love to know how to do this properly.[snip]if(single) { foreach(test; tests) { test(); } } else { foreach(test; tests.parallel) {Try different batch size: test.parallel(1), test.parallel(2) etc. -- Dmitry Olshansky
May 03 2014
So as to not have thread creation be disproportionately represented, I repeated the module list over and over again, making the number of tests run equal to 9990. This takes 5s on my machine to run in on thread and 12s in multiple. Here are the things I tried: 1. Created my own TaskPool so I could decide how many threads to use 2. Changed the batch size in parallel from 1 to 10 to 100 to 1000 3. Explicitly spawn two threads and tell each to do a foreach on half of the tests None of them made it go any faster. I had similar results using unit-threaded on my own projects. This is weird. Atilaif(single) { foreach(test; tests) { test(); } } else { foreach(test; tests.parallel) {Try different batch size: test.parallel(1), test.parallel(2) etc.
May 03 2014
gdc gave _very_ different results. I had to use different modules because at some point tests started failing, but with gdc the threaded version runs ~3x faster. On my own unit-threaded benchmarks, running the UTs for Cerealed over and over again was only slightly slower with threads than without. With dmd the threaded version was nearly 3x slower. Atila On Saturday, 3 May 2014 at 21:14:29 UTC, Atila Neves wrote:So as to not have thread creation be disproportionately represented, I repeated the module list over and over again, making the number of tests run equal to 9990. This takes 5s on my machine to run in on thread and 12s in multiple. Here are the things I tried: 1. Created my own TaskPool so I could decide how many threads to use 2. Changed the batch size in parallel from 1 to 10 to 100 to 1000 3. Explicitly spawn two threads and tell each to do a foreach on half of the tests None of them made it go any faster. I had similar results using unit-threaded on my own projects. This is weird. Atilaif(single) { foreach(test; tests) { test(); } } else { foreach(test; tests.parallel) {Try different batch size: test.parallel(1), test.parallel(2) etc.
May 03 2014
Same thing with unit_threaded on Phobos, 3x faster even without repeating the modules (0.1s vs 0.3s). Since the example is shorter than the other one, I'll post it here in case anyone else wants to try: import unit_threaded.runner; int main(string[] args) { return args.runTests!( "ustd.array", "ustd.ascii", "ustd.base64", "ustd.bigint", "ustd.bitmanip", "ustd.concurrency", "ustd.container", "ustd.cstream", ); } On Saturday, 3 May 2014 at 21:42:13 UTC, Atila Neves wrote:gdc gave _very_ different results. I had to use different modules because at some point tests started failing, but with gdc the threaded version runs ~3x faster. On my own unit-threaded benchmarks, running the UTs for Cerealed over and over again was only slightly slower with threads than without. With dmd the threaded version was nearly 3x slower. Atila On Saturday, 3 May 2014 at 21:14:29 UTC, Atila Neves wrote:So as to not have thread creation be disproportionately represented, I repeated the module list over and over again, making the number of tests run equal to 9990. This takes 5s on my machine to run in on thread and 12s in multiple. Here are the things I tried: 1. Created my own TaskPool so I could decide how many threads to use 2. Changed the batch size in parallel from 1 to 10 to 100 to 1000 3. Explicitly spawn two threads and tell each to do a foreach on half of the tests None of them made it go any faster. I had similar results using unit-threaded on my own projects. This is weird. Atilaif(single) { foreach(test; tests) { test(); } } else { foreach(test; tests.parallel) {Try different batch size: test.parallel(1), test.parallel(2) etc.
May 03 2014
On 5/3/14, 2:42 PM, Atila Neves wrote:gdc gave _very_ different results. I had to use different modules because at some point tests started failing, but with gdc the threaded version runs ~3x faster. On my own unit-threaded benchmarks, running the UTs for Cerealed over and over again was only slightly slower with threads than without. With dmd the threaded version was nearly 3x slower.Sounds like a severe bug in dmd or dependents. -- Andrei
May 03 2014
On Saturday, 3 May 2014 at 22:46:03 UTC, Andrei Alexandrescu wrote:On 5/3/14, 2:42 PM, Atila Neves wrote:Seems like it. Just to be sure I swapped ld.gold for ld.bfd and the problem was still there. I'm not entirely sure how to file this bug: with just my simple example above? Atilagdc gave _very_ different results. I had to use different modules because at some point tests started failing, but with gdc the threaded version runs ~3x faster. On my own unit-threaded benchmarks, running the UTs for Cerealed over and over again was only slightly slower with threads than without. With dmd the threaded version was nearly 3x slower.Sounds like a severe bug in dmd or dependents. -- Andrei
May 04 2014
On 5/4/14, 1:44 AM, Atila Neves wrote:On Saturday, 3 May 2014 at 22:46:03 UTC, Andrei Alexandrescu wrote:The simpler the better. -- AndreiOn 5/3/14, 2:42 PM, Atila Neves wrote:Seems like it. Just to be sure I swapped ld.gold for ld.bfd and the problem was still there. I'm not entirely sure how to file this bug: with just my simple example above?gdc gave _very_ different results. I had to use different modules because at some point tests started failing, but with gdc the threaded version runs ~3x faster. On my own unit-threaded benchmarks, running the UTs for Cerealed over and over again was only slightly slower with threads than without. With dmd the threaded version was nearly 3x slower.Sounds like a severe bug in dmd or dependents. -- Andrei
May 04 2014
https://issues.dlang.org/show_bug.cgi?id=12708 On Sunday, 4 May 2014 at 16:07:30 UTC, Andrei Alexandrescu wrote:On 5/4/14, 1:44 AM, Atila Neves wrote:On Saturday, 3 May 2014 at 22:46:03 UTC, Andrei Alexandrescu wrote:The simpler the better. -- AndreiOn 5/3/14, 2:42 PM, Atila Neves wrote:Seems like it. Just to be sure I swapped ld.gold for ld.bfd and the problem was still there. I'm not entirely sure how to file this bug: with just my simple example above?gdc gave _very_ different results. I had to use different modules because at some point tests started failing, but with gdc the threaded version runs ~3x faster. On my own unit-threaded benchmarks, running the UTs for Cerealed over and over again was only slightly slower with threads than without. With dmd the threaded version was nearly 3x slower.Sounds like a severe bug in dmd or dependents. -- Andrei
May 05 2014
On Saturday, 3 May 2014 at 22:46:03 UTC, Andrei Alexandrescu wrote:On 5/3/14, 2:42 PM, Atila Neves wrote:This reminds me of when I was parallelizing a project euler solution: atomic access was so much slower on DMD that it made performance worse than the single threaded version for one stage of the program. I know that std.parallelism does make use of core.atomic under the hood, so this may be a factor when using DMD.gdc gave _very_ different results. I had to use different modules because at some point tests started failing, but with gdc the threaded version runs ~3x faster. On my own unit-threaded benchmarks, running the UTs for Cerealed over and over again was only slightly slower with threads than without. With dmd the threaded version was nearly 3x slower.Sounds like a severe bug in dmd or dependents. -- Andrei
May 04 2014
On Sunday, 4 May 2014 at 17:01:23 UTC, safety0ff wrote:On Saturday, 3 May 2014 at 22:46:03 UTC, Andrei Alexandrescu wrote:Funny you should say that, a friend of mine tried porting a lock-free algorithm of his from Java to D a few weeks ago. The D version ran 3 orders of magnitude slower. Then I tried gdc and ldc on his code. ldc produced code running at around 80% of the speed of the Java version, fdc was around 30%. But dmd...On 5/3/14, 2:42 PM, Atila Neves wrote:This reminds me of when I was parallelizing a project euler solution: atomic access was so much slower on DMD that it made performance worse than the single threaded version for one stage of the program. I know that std.parallelism does make use of core.atomic under the hood, so this may be a factor when using DMD.gdc gave _very_ different results. I had to use different modules because at some point tests started failing, but with gdc the threaded version runs ~3x faster. On my own unit-threaded benchmarks, running the UTs for Cerealed over and over again was only slightly slower with threads than without. With dmd the threaded version was nearly 3x slower.Sounds like a severe bug in dmd or dependents. -- Andrei
May 05 2014
Going to take a wild guess, but as core.atomic.casImpl will never be inlined anywhere with DMD, due to it's inline assembly, you have the cost of building and destroying a stack frame, the cost of passing the args in, moving them into registers, saving potentially trashed registers, etc. every time it even attempts to acquire a lock, and the GC uses a single global lock for just about everything. As you can imagine, I suspect this is far from optimal, and, if I remember right, GDC uses intrinsics for the atomic operations. On 5/5/14, Atila Neves via Digitalmars-d <digitalmars-d puremagic.com> wrote:On Sunday, 4 May 2014 at 17:01:23 UTC, safety0ff wrote:On Saturday, 3 May 2014 at 22:46:03 UTC, Andrei Alexandrescu wrote:Funny you should say that, a friend of mine tried porting a lock-free algorithm of his from Java to D a few weeks ago. The D version ran 3 orders of magnitude slower. Then I tried gdc and ldc on his code. ldc produced code running at around 80% of the speed of the Java version, fdc was around 30%. But dmd...On 5/3/14, 2:42 PM, Atila Neves wrote:This reminds me of when I was parallelizing a project euler solution: atomic access was so much slower on DMD that it made performance worse than the single threaded version for one stage of the program. I know that std.parallelism does make use of core.atomic under the hood, so this may be a factor when using DMD.gdc gave _very_ different results. I had to use different modules because at some point tests started failing, but with gdc the threaded version runs ~3x faster. On my own unit-threaded benchmarks, running the UTs for Cerealed over and over again was only slightly slower with threads than without. With dmd the threaded version was nearly 3x slower.Sounds like a severe bug in dmd or dependents. -- Andrei
May 05 2014
On 5 May 2014 19:07, Orvid King via Digitalmars-d <digitalmars-d puremagic.com> wrote:Going to take a wild guess, but as core.atomic.casImpl will never be inlined anywhere with DMD, due to it's inline assembly, you have the cost of building and destroying a stack frame, the cost of passing the args in, moving them into registers, saving potentially trashed registers, etc. every time it even attempts to acquire a lock, and the GC uses a single global lock for just about everything. As you can imagine, I suspect this is far from optimal, and, if I remember right, GDC uses intrinsics for the atomic operations.Aye, and atomic intrinsics though they may be, it could even be improved by switching over to C++ atomic intrinsics, which map directly to core.atomics. :)
May 05 2014
On 5/3/14, 4:54 AM, Atila Neves wrote:So I tried using unit-threaded to run Phobos unit tests[snip] Thanks. Are you using thread pooling (a limited number of threads e.g. 1.5 * cores running all unittests)? -- Andrei
May 03 2014
On Saturday, 3 May 2014 at 18:16:52 UTC, Andrei Alexandrescu wrote:On 5/3/14, 4:54 AM, Atila Neves wrote:I'm using parallel and taskPool from std.parallelism. I was under the impression it gave me a ready-to-use pool with as many threads as I have cores.So I tried using unit-threaded to run Phobos unit tests[snip] Thanks. Are you using thread pooling (a limited number of threads e.g. 1.5 * cores running all unittests)? -- Andrei
May 03 2014
On Sat, 2014-05-03 at 19:37 +0000, Atila Neves via Digitalmars-d wrote: […]I'm using parallel and taskPool from std.parallelism. I was under the impression it gave me a ready-to-use pool with as many threads as I have cores.There is a default, related to the number of cores the OS thinks there is (*), but you can also set the number manually. std.parallelism could do with some work to make it better than it already is. (*) Physical cores are not necessarily the number reported by the OS due to core hyperthreads. Quad core no hyperthreads, and dual core, two hyperthreads per core, both get reported as four processor systems. However if you benchmark them you get very, very different performance characteristics. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 04 2014
Like I mentioned afterwards, I tried a different number of threads. On my machine, at least, std.parallelism.totalCPUs returns 8, the number of virtual cores. As it should. Atila On Sunday, 4 May 2014 at 07:49:51 UTC, Russel Winder via Digitalmars-d wrote:On Sat, 2014-05-03 at 19:37 +0000, Atila Neves via Digitalmars-d wrote: […]I'm using parallel and taskPool from std.parallelism. I was under the impression it gave me a ready-to-use pool with as many threads as I have cores.There is a default, related to the number of cores the OS thinks there is (*), but you can also set the number manually. std.parallelism could do with some work to make it better than it already is. (*) Physical cores are not necessarily the number reported by the OS due to core hyperthreads. Quad core no hyperthreads, and dual core, two hyperthreads per core, both get reported as four processor systems. However if you benchmark them you get very, very different performance characteristics.
May 04 2014
On Sun, 2014-05-04 at 08:47 +0000, Atila Neves via Digitalmars-d wrote:Like I mentioned afterwards, I tried a different number of threads. On my machine, at least, std.parallelism.totalCPUs returns 8, the number of virtual cores. As it should.If you can create a small example of the problem, and I can remember how to run std.parallelism as a separate module, I can try and take a look at this later next week. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 04 2014
On 5/4/14, 3:06 AM, Russel Winder via Digitalmars-d wrote:On Sun, 2014-05-04 at 08:47 +0000, Atila Neves via Digitalmars-d wrote:This is an awesome offer, Russel. Thanks! -- AndreiLike I mentioned afterwards, I tried a different number of threads. On my machine, at least, std.parallelism.totalCPUs returns 8, the number of virtual cores. As it should.If you can create a small example of the problem, and I can remember how to run std.parallelism as a separate module, I can try and take a look at this later next week.
May 04 2014
On 04/05/14 09:49, Russel Winder via Digitalmars-d wrote:(*) Physical cores are not necessarily the number reported by the OS due to core hyperthreads. Quad core no hyperthreads, and dual core, two hyperthreads per core, both get reported as four processor systems. However if you benchmark them you get very, very different performance characteristics.Yup. That bit me with a new laptop the first time I tried parallel programming with D :-)
May 04 2014