digitalmars.D - Running Phobos unit tests in threads: I have data

Atila Neves (71/71) May 03 2014 So I tried using unit-threaded to run Phobos unit tests again and

Atila Neves (6/78) May 03 2014 I turned off all output to check. It was still slower with

Rikki Cattermole (2/99) May 03 2014 Out of curiosity are you on Windows?

Atila Neves (4/5) May 03 2014 No, Arch Linux 64-bit. I also just noticed a glaring threading

Rikki Cattermole (3/8) May 03 2014 I'm surprised. Threads should be cheap on Linux. Something funky

Walter Bright (2/3) May 03 2014 No doubt: http://www.youtube.com/watch?v=aZcbDESaxhY
Dicebot (2/12) May 05 2014 Threads are never cheap.

Brad Anderson (3/16) May 05 2014 Regarding this, I found this talk interesting:

Atila Neves (11/11) May 03 2014 Ok, so I went and added __traits(getUnitTests) to unit-threaded.

Atila Neves (55/55) May 03 2014 I can reproduce the slower-with-threads issue without using my

Walter Bright (6/9) May 03 2014 I haven't investigated this, but my suspicions are:

Atila Neves (6/18) May 03 2014 In the current measurements probably since the whole run takes

Dmitry Olshansky (6/19) May 03 2014 Try different batch size:

Atila Neves (13/21) May 03 2014 So as to not have thread creation be disproportionately

Atila Neves (8/31) May 03 2014 gdc gave _very_ different results. I had to use different modules

Atila Neves (18/55) May 03 2014 Same thing with unit_threaded on Phobos, 3x faster even without
Andrei Alexandrescu (2/8) May 03 2014 Sounds like a severe bug in dmd or dependents. -- Andrei

Atila Neves (7/20) May 04 2014 Seems like it. Just to be sure I swapped ld.gold for ld.bfd and

Andrei Alexandrescu (2/17) May 04 2014 The simpler the better. -- Andrei

Atila Neves (2/28) May 05 2014

safety0ff (8/21) May 04 2014 This reminds me of when I was parallelizing a project euler

Atila Neves (6/28) May 05 2014 Funny you should say that, a friend of mine tried porting a

Orvid King via Digitalmars-d (9/39) May 05 2014 Going to take a wild guess, but as core.atomic.casImpl will never be
Iain Buclaw via Digitalmars-d (5/13) May 05 2014 Aye, and atomic intrinsics though they may be, it could even be

Andrei Alexandrescu (4/5) May 03 2014 [snip]

Atila Neves (5/10) May 03 2014 I'm using parallel and taskPool from std.parallelism. I was under

Russel Winder via Digitalmars-d (16/19) May 04 2014 There is a default, related to the number of cores the OS thinks there

Atila Neves (6/26) May 04 2014 Like I mentioned afterwards, I tried a different number of

Russel Winder via Digitalmars-d (10/13) May 04 2014 If you can create a small example of the problem, and I can remember how

Andrei Alexandrescu (2/9) May 04 2014 This is an awesome offer, Russel. Thanks! -- Andrei

Joseph Rushton Wakeling via Digitalmars-d (3/8) May 04 2014 Yup. That bit me with a new laptop the first time I tried parallel prog...

"Atila Neves" <atila.neves gmail.com> writes:

So I tried using unit-threaded to run Phobos unit tests again and 
had problems (which I'll look into later) with its compile-time 
reflection. Then I realised I was an idiot since I don't need to 
reflect on anything: all Phobos tests are in unittest blocks so 
all I need to do is include them in the build and unit-threaded 
will run them for me.

I tried a basic sanity check by running them in one thread only 
with the -s option and got a segfault, and a failing test before 
that. None of this should happen, and I'll be taking a look at 
that as well.

But I carried on by removing the troublesome modules from the 
build. These turned out to be:

std.datetime (fails)
std.process (fails and causes the segfault)
std.stdio (fails)

All the others pass in single threaded mode. After this I tried 
using threads and std.parallelism failed, so I took that away 
from the build as well.

Another thing to mention is that although the tests are running 
in threads, since when I wrote the library the getUnitTests 
__traits wasn't available (and since then I wasn't interested in 
using it), each module's unit tests run as one test. So they only 
interleave with other modules, not with each other.

Running in one thread took 39 +/- 1 seconds.
Running in 8 threads took... ~41 seconds.

Oops. I noticed some tests take a lot longer so I tried removing 
those. They were:

std.file
std.conv
std.regex
std.random
std.container
std.xml
std.utf
std.numeric
std.uuid
std.exception

I also removed any modules that were likely to be problematic 
like std.concurrency and std.socket. With the reduced sample size 
the results were:

1 thread: ~1.9s
8 threads: 4.1s +/- 0.2

So the whole threading thing isn't looking so great. Or at least 
not how I implemented it. This got me thinking about my own 
projects. The tests run so fast I never really paid attention to 
how fast they were running. I compared running the unit tests in 
Cerealed in one or more threads and got the same result: running 
in one thread was faster.

I have to look to be sure but maybe the bottleneck is output. As 
in actually printing the results to the screen. I had to jump 
through a few hoops to make sure the output wasn't interleaved, 
and in the end decided to have one thread be responsible for 
that, with the tests sending it output messages.

For reference, I copied all of the std/*.d modules into a local 
std directory, compiled all of them with dmd -unittest -c, then 
used this as the build command:

dmd -unittest -I~/coding/d/unit-threaded/source ut.d 
std/algorithm.o std/array.o std/ascii.o std/base64.o std/bigint.o 
std/bitmanip.o std/compiler.o std/complex.o std/container.o 
std/cstream.o std/csv.o std/demangle.o std/encoding.o 
std/format.o std/functional.o std/getopt.o std/json.o std/math.o 
std/mathspecial.o std/metastrings.o std/mmfile.o std/numeric.o 
std/outbuffer.o std/range.o  std/signals.o  std/stdint.o 
std/stdiobase.o std/stream.o std/string.o std/syserror.o 
std/system.o std/traits.o std/typecons.o std/typelist.o 
std/typetuple.o std/uri.o std/variant.o std/zip.o std/zlib.o  
libunit-threaded.a -ofphobos_ut

I got libunit-threaded.a by running "dub build" in the root 
directory of unit-threaded.

I might just implement a random order option now. Hmm.

Atila

May 03 2014

"Atila Neves" <atila.neves gmail.com> writes:

I turned off all output to check. It was still slower with 
multiple threads. That was the only "weird" thing I was doing I 
could think of as the cause. Otherwise it's just a foreach(test; 
tests.parallel) { test(); }.

Atila

On Saturday, 3 May 2014 at 11:54:55 UTC, Atila Neves wrote:
 So I tried using unit-threaded to run Phobos unit tests again 
 and had problems (which I'll look into later) with its 
 compile-time reflection. Then I realised I was an idiot since I 
 don't need to reflect on anything: all Phobos tests are in 
 unittest blocks so all I need to do is include them in the 
 build and unit-threaded will run them for me.

 I tried a basic sanity check by running them in one thread only 
 with the -s option and got a segfault, and a failing test 
 before that. None of this should happen, and I'll be taking a 
 look at that as well.

 But I carried on by removing the troublesome modules from the 
 build. These turned out to be:

 std.datetime (fails)
 std.process (fails and causes the segfault)
 std.stdio (fails)

 All the others pass in single threaded mode. After this I tried 
 using threads and std.parallelism failed, so I took that away 
 from the build as well.

 Another thing to mention is that although the tests are running 
 in threads, since when I wrote the library the getUnitTests 
 __traits wasn't available (and since then I wasn't interested 
 in using it), each module's unit tests run as one test. So they 
 only interleave with other modules, not with each other.

 Running in one thread took 39 +/- 1 seconds.
 Running in 8 threads took... ~41 seconds.

 Oops. I noticed some tests take a lot longer so I tried 
 removing those. They were:

 std.file
 std.conv
 std.regex
 std.random
 std.container
 std.xml
 std.utf
 std.numeric
 std.uuid
 std.exception

 I also removed any modules that were likely to be problematic 
 like std.concurrency and std.socket. With the reduced sample 
 size the results were:

 1 thread: ~1.9s
 8 threads: 4.1s +/- 0.2

 So the whole threading thing isn't looking so great. Or at 
 least not how I implemented it. This got me thinking about my 
 own projects. The tests run so fast I never really paid 
 attention to how fast they were running. I compared running the 
 unit tests in Cerealed in one or more threads and got the same 
 result: running in one thread was faster.

 I have to look to be sure but maybe the bottleneck is output. 
 As in actually printing the results to the screen. I had to 
 jump through a few hoops to make sure the output wasn't 
 interleaved, and in the end decided to have one thread be 
 responsible for that, with the tests sending it output messages.

 For reference, I copied all of the std/*.d modules into a local 
 std directory, compiled all of them with dmd -unittest -c, then 
 used this as the build command:

 dmd -unittest -I~/coding/d/unit-threaded/source ut.d 
 std/algorithm.o std/array.o std/ascii.o std/base64.o 
 std/bigint.o std/bitmanip.o std/compiler.o std/complex.o 
 std/container.o std/cstream.o std/csv.o std/demangle.o 
 std/encoding.o std/format.o std/functional.o std/getopt.o 
 std/json.o std/math.o std/mathspecial.o std/metastrings.o 
 std/mmfile.o std/numeric.o std/outbuffer.o std/range.o  
 std/signals.o  std/stdint.o std/stdiobase.o std/stream.o 
 std/string.o std/syserror.o std/system.o std/traits.o 
 std/typecons.o std/typelist.o std/typetuple.o std/uri.o 
 std/variant.o std/zip.o std/zlib.o  libunit-threaded.a 
 -ofphobos_ut

 I got libunit-threaded.a by running "dub build" in the root 
 directory of unit-threaded.

 I might just implement a random order option now. Hmm.

 Atila

May 03 2014

"Rikki Cattermole" <alphaglosined gmail.com> writes:

On Saturday, 3 May 2014 at 12:08:56 UTC, Atila Neves wrote:
 I turned off all output to check. It was still slower with 
 multiple threads. That was the only "weird" thing I was doing I 
 could think of as the cause. Otherwise it's just a 
 foreach(test; tests.parallel) { test(); }.

 Atila

 On Saturday, 3 May 2014 at 11:54:55 UTC, Atila Neves wrote:
 So I tried using unit-threaded to run Phobos unit tests again 
 and had problems (which I'll look into later) with its 
 compile-time reflection. Then I realised I was an idiot since 
 I don't need to reflect on anything: all Phobos tests are in 
 unittest blocks so all I need to do is include them in the 
 build and unit-threaded will run them for me.

 I tried a basic sanity check by running them in one thread 
 only with the -s option and got a segfault, and a failing test 
 before that. None of this should happen, and I'll be taking a 
 look at that as well.

 But I carried on by removing the troublesome modules from the 
 build. These turned out to be:

 std.datetime (fails)
 std.process (fails and causes the segfault)
 std.stdio (fails)

 All the others pass in single threaded mode. After this I 
 tried using threads and std.parallelism failed, so I took that 
 away from the build as well.

 Another thing to mention is that although the tests are 
 running in threads, since when I wrote the library the 
 getUnitTests __traits wasn't available (and since then I 
 wasn't interested in using it), each module's unit tests run 
 as one test. So they only interleave with other modules, not 
 with each other.

 Running in one thread took 39 +/- 1 seconds.
 Running in 8 threads took... ~41 seconds.

 Oops. I noticed some tests take a lot longer so I tried 
 removing those. They were:

 std.file
 std.conv
 std.regex
 std.random
 std.container
 std.xml
 std.utf
 std.numeric
 std.uuid
 std.exception

 I also removed any modules that were likely to be problematic 
 like std.concurrency and std.socket. With the reduced sample 
 size the results were:

 1 thread: ~1.9s
 8 threads: 4.1s +/- 0.2

 So the whole threading thing isn't looking so great. Or at 
 least not how I implemented it. This got me thinking about my 
 own projects. The tests run so fast I never really paid 
 attention to how fast they were running. I compared running 
 the unit tests in Cerealed in one or more threads and got the 
 same result: running in one thread was faster.

 I have to look to be sure but maybe the bottleneck is output. 
 As in actually printing the results to the screen. I had to 
 jump through a few hoops to make sure the output wasn't 
 interleaved, and in the end decided to have one thread be 
 responsible for that, with the tests sending it output 
 messages.

 For reference, I copied all of the std/*.d modules into a 
 local std directory, compiled all of them with dmd -unittest 
 -c, then used this as the build command:

 dmd -unittest -I~/coding/d/unit-threaded/source ut.d 
 std/algorithm.o std/array.o std/ascii.o std/base64.o 
 std/bigint.o std/bitmanip.o std/compiler.o std/complex.o 
 std/container.o std/cstream.o std/csv.o std/demangle.o 
 std/encoding.o std/format.o std/functional.o std/getopt.o 
 std/json.o std/math.o std/mathspecial.o std/metastrings.o 
 std/mmfile.o std/numeric.o std/outbuffer.o std/range.o  
 std/signals.o  std/stdint.o std/stdiobase.o std/stream.o 
 std/string.o std/syserror.o std/system.o std/traits.o 
 std/typecons.o std/typelist.o std/typetuple.o std/uri.o 
 std/variant.o std/zip.o std/zlib.o  libunit-threaded.a 
 -ofphobos_ut

 I got libunit-threaded.a by running "dub build" in the root 
 directory of unit-threaded.

 I might just implement a random order option now. Hmm.

 Atila


Out of curiosity are you on Windows?

May 03 2014

"Atila Neves" <atila.neves gmail.com> writes:

 Out of curiosity are you on Windows?

No, Arch Linux 64-bit. I also just noticed a glaring threading 
bug in my code as well that somehow's never turned up. This is 
not a good day.

Atila

May 03 2014

"Rikki Cattermole" <alphaglosined gmail.com> writes:

On Saturday, 3 May 2014 at 12:24:59 UTC, Atila Neves wrote:
 Out of curiosity are you on Windows?

 No, Arch Linux 64-bit. I also just noticed a glaring threading 
 bug in my code as well that somehow's never turned up. This is 
 not a good day.

 Atila

I'm surprised. Threads should be cheap on Linux. Something funky 
is definitely going on I bet.

May 03 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 5/3/2014 5:26 AM, Rikki Cattermole wrote:
 Something funky is definitely going on I bet.

No doubt: http://www.youtube.com/watch?v=aZcbDESaxhY

May 03 2014

"Dicebot" <public dicebot.lv> writes:

On Saturday, 3 May 2014 at 12:26:13 UTC, Rikki Cattermole wrote:
 On Saturday, 3 May 2014 at 12:24:59 UTC, Atila Neves wrote:
 Out of curiosity are you on Windows?

 No, Arch Linux 64-bit. I also just noticed a glaring threading 
 bug in my code as well that somehow's never turned up. This is 
 not a good day.

 Atila

 I'm surprised. Threads should be cheap on Linux. Something 
 funky is definitely going on I bet.

Threads are never cheap.

May 05 2014

"Brad Anderson" <eco gnuk.net> writes:

On Monday, 5 May 2014 at 17:56:11 UTC, Dicebot wrote:
 On Saturday, 3 May 2014 at 12:26:13 UTC, Rikki Cattermole wrote:
 On Saturday, 3 May 2014 at 12:24:59 UTC, Atila Neves wrote:
 Out of curiosity are you on Windows?

 No, Arch Linux 64-bit. I also just noticed a glaring 
 threading bug in my code as well that somehow's never turned 
 up. This is not a good day.

 Atila

 I'm surprised. Threads should be cheap on Linux. Something 
 funky is definitely going on I bet.

 Threads are never cheap.

Regarding this, I found this talk interesting: 
https://www.youtube.com/watch?v=KXuZi9aeGTw

May 05 2014

"Atila Neves" <atila.neves gmail.com> writes:

Ok, so I went and added __traits(getUnitTests) to unit-threaded. 
That way each unittest block is its own test case. I registered 
these modules in std to run:

array, ascii, base64, bigint, bitmanip, concurrency, container, 
cstream.

On the good news front, they all passed even though they were 
running concurrently.

On the bad news front, single-threaded operation was still faster 
(0.22s vs 0.28s). I still don't know why.

I fixed my concurrency bug, now I'm using taskPool.amap.


Atila

May 03 2014

"Atila Neves" <atila.neves gmail.com> writes:

I can reproduce the slower-with-threads issue without using my 
library. I've included the source file below and would like to 
know if other people see the same thing.

The Phobos modules are all called "ustd" because I 
couldn't/didn't know how to get this to work otherwise. So I 
copied the std/*.d files to a directory called ustd and changed 
their module declarations. Silly but it works. I'd love to know 
how to do this properly.

With this file, I consistenly get faster times with "-s" (for 
single-threaded) than without (multi-threaded):


import std.parallelism;
import std.getopt;


import ustd.array;
import ustd.ascii;
import ustd.base64;
import ustd.bigint;
import ustd.bitmanip;
import ustd.concurrency;
import ustd.container;
import ustd.cstream;


alias TestFunction = void function();

auto getTests(Modules...)() {
     TestFunction[] tests;
     foreach(mod; Modules) {
         foreach(test; __traits(getUnitTests, mod)) {
             tests ~= &test;
         }
     }
     return tests;
}



void main(string[] args) {
     bool single;
     getopt(args,
            "single|s", &single
         );

     enum tests = getTests!(
         ustd.array,
         ustd.ascii,
         ustd.base64,
         ustd.bigint,
         ustd.bitmanip,
         ustd.concurrency,
         ustd.container,
         ustd.cstream,
         );

     if(single) {
         foreach(test; tests) {
             test();
         }
     } else {
         foreach(test; tests.parallel) {
             test();
         }
     }
}

May 03 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 5/3/2014 10:22 AM, Atila Neves wrote:
 I can reproduce the slower-with-threads issue without using my library. I've
 included the source file below and would like to know if other people see the
 same thing.

I haven't investigated this, but my suspicions are:

1. thread creation/destruction is dominating the times.

2. since very few of the unittests block, there is no speed advantage from 
having more threads than cores. If you limit the number of threads to the
number 
of cores on your machine, you might see a speedup.

May 03 2014

"Atila Neves" <atila.neves gmail.com> writes:

On Saturday, 3 May 2014 at 18:26:37 UTC, Walter Bright wrote:
 On 5/3/2014 10:22 AM, Atila Neves wrote:
 I can reproduce the slower-with-threads issue without using my 
 library. I've
 included the source file below and would like to know if other 
 people see the
 same thing.

 I haven't investigated this, but my suspicions are:

 1. thread creation/destruction is dominating the times.

In the current measurements probably since the whole run takes 
less than a second. But the first ones I did were dozens of 
seconds long, so I don't think so.

 2. since very few of the unittests block, there is no speed 
 advantage from having more threads than cores. If you limit the 
 number of threads to the number of cores on your machine, you 
 might see a speedup.

Like I mentioned above, unless I'm mistaken taskPool should be 
using a correct number of threads for my machine already.

May 03 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

03-May-2014 21:22, Atila Neves пишет:
 I can reproduce the slower-with-threads issue without using my library.
 I've included the source file below and would like to know if other
 people see the same thing.

 The Phobos modules are all called "ustd" because I couldn't/didn't know
 how to get this to work otherwise. So I copied the std/*.d files to a
 directory called ustd and changed their module declarations. Silly but
 it works. I'd love to know how to do this properly.

[snip]

      if(single) {
          foreach(test; tests) {
              test();
          }
      } else {
          foreach(test; tests.parallel) {

Try different batch size:
test.parallel(1), test.parallel(2) etc.



-- 
Dmitry Olshansky

May 03 2014

"Atila Neves" <atila.neves gmail.com> writes:

     if(single) {
         foreach(test; tests) {
             test();
         }
     } else {
         foreach(test; tests.parallel) {

 Try different batch size:
 test.parallel(1), test.parallel(2) etc.

So as to not have thread creation be disproportionately 
represented, I repeated the module list over and over again, 
making the number of tests run equal to 9990. This takes 5s on my 
machine to run in on thread and 12s in multiple. Here are the 
things I tried:

1. Created my own TaskPool so I could decide how many threads to 
use
2. Changed the batch size in parallel from 1 to 10 to 100 to 1000
3. Explicitly spawn two threads and tell each to do a foreach on 
half of the tests


None of them made it go any faster. I had similar results using 
unit-threaded on my own projects. This is weird.

Atila

May 03 2014

"Atila Neves" <atila.neves gmail.com> writes:

gdc gave _very_ different results. I had to use different modules 
because at some point tests started failing, but with gdc the 
threaded version runs ~3x faster.

On my own unit-threaded benchmarks, running the UTs for Cerealed 
over and over again was only slightly slower with threads than 
without. With dmd the threaded version was nearly 3x slower.

Atila

On Saturday, 3 May 2014 at 21:14:29 UTC, Atila Neves wrote:
    if(single) {
        foreach(test; tests) {
            test();
        }
    } else {
        foreach(test; tests.parallel) {

 Try different batch size:
 test.parallel(1), test.parallel(2) etc.

 So as to not have thread creation be disproportionately 
 represented, I repeated the module list over and over again, 
 making the number of tests run equal to 9990. This takes 5s on 
 my machine to run in on thread and 12s in multiple. Here are 
 the things I tried:

 1. Created my own TaskPool so I could decide how many threads 
 to use
 2. Changed the batch size in parallel from 1 to 10 to 100 to 
 1000
 3. Explicitly spawn two threads and tell each to do a foreach 
 on half of the tests


 None of them made it go any faster. I had similar results using 
 unit-threaded on my own projects. This is weird.

 Atila

May 03 2014

"Atila Neves" <atila.neves gmail.com> writes:

Same thing with unit_threaded on Phobos, 3x faster even without 
repeating the modules (0.1s vs 0.3s). Since the example is 
shorter than the other one, I'll post it here in case anyone else 
wants to try:

import unit_threaded.runner;

int main(string[] args) {
     return args.runTests!(
         "ustd.array",
         "ustd.ascii",
         "ustd.base64",
         "ustd.bigint",
         "ustd.bitmanip",
         "ustd.concurrency",
         "ustd.container",
         "ustd.cstream",
         );
}


On Saturday, 3 May 2014 at 21:42:13 UTC, Atila Neves wrote:
 gdc gave _very_ different results. I had to use different 
 modules because at some point tests started failing, but with 
 gdc the threaded version runs ~3x faster.

 On my own unit-threaded benchmarks, running the UTs for 
 Cerealed over and over again was only slightly slower with 
 threads than without. With dmd the threaded version was nearly 
 3x slower.

 Atila

 On Saturday, 3 May 2014 at 21:14:29 UTC, Atila Neves wrote:
   if(single) {
       foreach(test; tests) {
           test();
       }
   } else {
       foreach(test; tests.parallel) {

 Try different batch size:
 test.parallel(1), test.parallel(2) etc.

 So as to not have thread creation be disproportionately 
 represented, I repeated the module list over and over again, 
 making the number of tests run equal to 9990. This takes 5s on 
 my machine to run in on thread and 12s in multiple. Here are 
 the things I tried:

 1. Created my own TaskPool so I could decide how many threads 
 to use
 2. Changed the batch size in parallel from 1 to 10 to 100 to 
 1000
 3. Explicitly spawn two threads and tell each to do a foreach 
 on half of the tests


 None of them made it go any faster. I had similar results 
 using unit-threaded on my own projects. This is weird.

 Atila

May 03 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/3/14, 2:42 PM, Atila Neves wrote:
 gdc gave _very_ different results. I had to use different modules
 because at some point tests started failing, but with gdc the threaded
 version runs ~3x faster.

 On my own unit-threaded benchmarks, running the UTs for Cerealed over
 and over again was only slightly slower with threads than without. With
 dmd the threaded version was nearly 3x slower.

Sounds like a severe bug in dmd or dependents. -- Andrei

May 03 2014

"Atila Neves" <atila.neves gmail.com> writes:

On Saturday, 3 May 2014 at 22:46:03 UTC, Andrei Alexandrescu 
wrote:
 On 5/3/14, 2:42 PM, Atila Neves wrote:
 gdc gave _very_ different results. I had to use different 
 modules
 because at some point tests started failing, but with gdc the 
 threaded
 version runs ~3x faster.

 On my own unit-threaded benchmarks, running the UTs for 
 Cerealed over
 and over again was only slightly slower with threads than 
 without. With
 dmd the threaded version was nearly 3x slower.

 Sounds like a severe bug in dmd or dependents. -- Andrei

Seems like it. Just to be sure I swapped ld.gold for ld.bfd and 
the problem was still there.

I'm not entirely sure how to file this bug: with just my simple 
example above?

Atila

May 04 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/4/14, 1:44 AM, Atila Neves wrote:
 On Saturday, 3 May 2014 at 22:46:03 UTC, Andrei Alexandrescu wrote:
 On 5/3/14, 2:42 PM, Atila Neves wrote:
 gdc gave _very_ different results. I had to use different modules
 because at some point tests started failing, but with gdc the threaded
 version runs ~3x faster.

 On my own unit-threaded benchmarks, running the UTs for Cerealed over
 and over again was only slightly slower with threads than without. With
 dmd the threaded version was nearly 3x slower.

 Sounds like a severe bug in dmd or dependents. -- Andrei

 Seems like it. Just to be sure I swapped ld.gold for ld.bfd and the
 problem was still there.

 I'm not entirely sure how to file this bug: with just my simple example
 above?

The simpler the better. -- Andrei

May 04 2014

"Atila Neves" <atila.neves gmail.com> writes:

https://issues.dlang.org/show_bug.cgi?id=12708

On Sunday, 4 May 2014 at 16:07:30 UTC, Andrei Alexandrescu wrote:
 On 5/4/14, 1:44 AM, Atila Neves wrote:
 On Saturday, 3 May 2014 at 22:46:03 UTC, Andrei Alexandrescu 
 wrote:
 On 5/3/14, 2:42 PM, Atila Neves wrote:
 gdc gave _very_ different results. I had to use different 
 modules
 because at some point tests started failing, but with gdc 
 the threaded
 version runs ~3x faster.

 On my own unit-threaded benchmarks, running the UTs for 
 Cerealed over
 and over again was only slightly slower with threads than 
 without. With
 dmd the threaded version was nearly 3x slower.

 Sounds like a severe bug in dmd or dependents. -- Andrei

 Seems like it. Just to be sure I swapped ld.gold for ld.bfd 
 and the
 problem was still there.

 I'm not entirely sure how to file this bug: with just my 
 simple example
 above?

 The simpler the better. -- Andrei

May 05 2014

"safety0ff" <safety0ff.dev gmail.com> writes:

On Saturday, 3 May 2014 at 22:46:03 UTC, Andrei Alexandrescu 
wrote:
 On 5/3/14, 2:42 PM, Atila Neves wrote:
 gdc gave _very_ different results. I had to use different 
 modules
 because at some point tests started failing, but with gdc the 
 threaded
 version runs ~3x faster.

 On my own unit-threaded benchmarks, running the UTs for 
 Cerealed over
 and over again was only slightly slower with threads than 
 without. With
 dmd the threaded version was nearly 3x slower.

 Sounds like a severe bug in dmd or dependents. -- Andrei

This reminds me of when I was parallelizing a project euler 
solution: atomic access was so much slower on DMD that it made 
performance worse than the single threaded version for one stage 
of the program.

I know that std.parallelism does make use of core.atomic under 
the hood, so this may be a factor when using DMD.

May 04 2014

"Atila Neves" <atila.neves gmail.com> writes:

On Sunday, 4 May 2014 at 17:01:23 UTC, safety0ff wrote:
 On Saturday, 3 May 2014 at 22:46:03 UTC, Andrei Alexandrescu 
 wrote:
 On 5/3/14, 2:42 PM, Atila Neves wrote:
 gdc gave _very_ different results. I had to use different 
 modules
 because at some point tests started failing, but with gdc the 
 threaded
 version runs ~3x faster.

 On my own unit-threaded benchmarks, running the UTs for 
 Cerealed over
 and over again was only slightly slower with threads than 
 without. With
 dmd the threaded version was nearly 3x slower.

 Sounds like a severe bug in dmd or dependents. -- Andrei

 This reminds me of when I was parallelizing a project euler 
 solution: atomic access was so much slower on DMD that it made 
 performance worse than the single threaded version for one 
 stage of the program.

 I know that std.parallelism does make use of core.atomic under 
 the hood, so this may be a factor when using DMD.

Funny you should say that, a friend of mine tried porting a 
lock-free algorithm of his from Java to D a few weeks ago. The D 
version ran 3 orders of magnitude slower. Then I tried gdc and 
ldc on his code. ldc produced code running at around 80% of the 
speed of the Java version, fdc was around 30%. But dmd...

May 05 2014

Orvid King via Digitalmars-d <digitalmars-d puremagic.com> writes:

Going to take a wild guess, but as core.atomic.casImpl will never be
inlined anywhere with DMD, due to it's inline assembly, you have the
cost of building and destroying a stack frame, the cost of passing the
args in, moving them into registers, saving potentially trashed
registers, etc. every time it even attempts to acquire a lock, and the
GC uses a single global lock for just about everything. As you can
imagine, I suspect this is far from optimal, and, if I remember right,
GDC uses intrinsics for the atomic operations.

On 5/5/14, Atila Neves via Digitalmars-d <digitalmars-d puremagic.com> wrote:
 On Sunday, 4 May 2014 at 17:01:23 UTC, safety0ff wrote:
 On Saturday, 3 May 2014 at 22:46:03 UTC, Andrei Alexandrescu
 wrote:
 On 5/3/14, 2:42 PM, Atila Neves wrote:
 gdc gave _very_ different results. I had to use different
 modules
 because at some point tests started failing, but with gdc the
 threaded
 version runs ~3x faster.

 On my own unit-threaded benchmarks, running the UTs for
 Cerealed over
 and over again was only slightly slower with threads than
 without. With
 dmd the threaded version was nearly 3x slower.

 Sounds like a severe bug in dmd or dependents. -- Andrei

 This reminds me of when I was parallelizing a project euler
 solution: atomic access was so much slower on DMD that it made
 performance worse than the single threaded version for one
 stage of the program.

 I know that std.parallelism does make use of core.atomic under
 the hood, so this may be a factor when using DMD.

 Funny you should say that, a friend of mine tried porting a
 lock-free algorithm of his from Java to D a few weeks ago. The D
 version ran 3 orders of magnitude slower. Then I tried gdc and
 ldc on his code. ldc produced code running at around 80% of the
 speed of the Java version, fdc was around 30%. But dmd...

May 05 2014

Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 5 May 2014 19:07, Orvid King via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 Going to take a wild guess, but as core.atomic.casImpl will never be
 inlined anywhere with DMD, due to it's inline assembly, you have the
 cost of building and destroying a stack frame, the cost of passing the
 args in, moving them into registers, saving potentially trashed
 registers, etc. every time it even attempts to acquire a lock, and the
 GC uses a single global lock for just about everything. As you can
 imagine, I suspect this is far from optimal, and, if I remember right,
 GDC uses intrinsics for the atomic operations.

Aye, and atomic intrinsics though they may be, it could even be
improved by switching over to C++ atomic intrinsics, which map
directly to core.atomics.  :)

May 05 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/3/14, 4:54 AM, Atila Neves wrote:
 So I tried using unit-threaded to run Phobos unit tests

[snip]

Thanks. Are you using thread pooling (a limited number of threads e.g. 
1.5 * cores running all unittests)? -- Andrei

May 03 2014

"Atila Neves" <atila.neves gmail.com> writes:

On Saturday, 3 May 2014 at 18:16:52 UTC, Andrei Alexandrescu 
wrote:
 On 5/3/14, 4:54 AM, Atila Neves wrote:
 So I tried using unit-threaded to run Phobos unit tests

 [snip]

 Thanks. Are you using thread pooling (a limited number of 
 threads e.g. 1.5 * cores running all unittests)? -- Andrei

I'm using parallel and taskPool from std.parallelism. I was under 
the impression it gave me a ready-to-use pool with as many 
threads as I have cores.

May 03 2014

Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Sat, 2014-05-03 at 19:37 +0000, Atila Neves via Digitalmars-d wrote:
[…]
 I'm using parallel and taskPool from std.parallelism. I was under 
 the impression it gave me a ready-to-use pool with as many 
 threads as I have cores.

There is a default, related to the number of cores the OS thinks there
is (*), but you can also set the number manually.  std.parallelism could
do with some work to make it better than it already is.


(*) Physical cores are not necessarily the number reported by the OS due
to core hyperthreads. Quad core no hyperthreads, and dual core, two
hyperthreads per core, both get reported as four processor systems.
However if you benchmark them you get very, very different performance
characteristics.

-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

May 04 2014

"Atila Neves" <atila.neves gmail.com> writes:

Like I mentioned afterwards, I tried a different number of 
threads. On my machine, at least, std.parallelism.totalCPUs 
returns 8, the number of virtual cores. As it should.

Atila

On Sunday, 4 May 2014 at 07:49:51 UTC, Russel Winder via 
Digitalmars-d wrote:
 On Sat, 2014-05-03 at 19:37 +0000, Atila Neves via 
 Digitalmars-d wrote:
 […]
 I'm using parallel and taskPool from std.parallelism. I was 
 under the impression it gave me a ready-to-use pool with as 
 many threads as I have cores.

 There is a default, related to the number of cores the OS 
 thinks there
 is (*), but you can also set the number manually.  
 std.parallelism could
 do with some work to make it better than it already is.


 (*) Physical cores are not necessarily the number reported by 
 the OS due
 to core hyperthreads. Quad core no hyperthreads, and dual core, 
 two
 hyperthreads per core, both get reported as four processor 
 systems.
 However if you benchmark them you get very, very different 
 performance
 characteristics.

May 04 2014

Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Sun, 2014-05-04 at 08:47 +0000, Atila Neves via Digitalmars-d wrote:
 Like I mentioned afterwards, I tried a different number of 
 threads. On my machine, at least, std.parallelism.totalCPUs 
 returns 8, the number of virtual cores. As it should.

If you can create a small example of the problem, and I can remember how
to run std.parallelism as a separate module, I can try and take a look
at this later next week.

-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

May 04 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/4/14, 3:06 AM, Russel Winder via Digitalmars-d wrote:
 On Sun, 2014-05-04 at 08:47 +0000, Atila Neves via Digitalmars-d wrote:
 Like I mentioned afterwards, I tried a different number of
 threads. On my machine, at least, std.parallelism.totalCPUs
 returns 8, the number of virtual cores. As it should.

 If you can create a small example of the problem, and I can remember how
 to run std.parallelism as a separate module, I can try and take a look
 at this later next week.

This is an awesome offer, Russel. Thanks! -- Andrei

May 04 2014

Joseph Rushton Wakeling via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 04/05/14 09:49, Russel Winder via Digitalmars-d wrote:
 (*) Physical cores are not necessarily the number reported by the OS due
 to core hyperthreads. Quad core no hyperthreads, and dual core, two
 hyperthreads per core, both get reported as four processor systems.
 However if you benchmark them you get very, very different performance
 characteristics.

Yup.  That bit me with a new laptop the first time I tried parallel programming 
with D :-)

May 04 2014

D Programming

C/C++ Programming

Other

digitalmars.D - Running Phobos unit tests in threads: I have data