digitalmars.D.learn - Error running concurrent process and storing results in array

data pulverizer (38/38) May 05 2020 I have been using std.parallelism and that has worked quite

Mathias LANG (9/10) May 05 2020 The problem here is that `process` is a delegate, not a function.

data pulverizer (20/30) May 05 2020 I moved the `process` function out of main and it is now running

=?UTF-8?Q?Ali_=c3=87ehreli?= (9/12) May 05 2020 thread_joinAll(). I have an example here:

data pulverizer (41/53) May 05 2020 I actually started off using std.parallelism and it worked well

drug (3/70) May 05 2020 proc is already a delegate, so &proc is a pointer to the delegate, just

data pulverizer (14/16) May 05 2020 Thanks done that but getting a range violation on z which was not

drug (38/58) May 05 2020 confirmed. I think that's because `proc` delegates captures `i` variable...

data pulverizer (2/31) May 06 2020 Thanks. Now working.
Steven Schveighoffer (15/53) May 06 2020 foreach over a range of integers is lowered to an equivalent for loop,

drug (6/14) May 06 2020 I was surprised but `foreach` version do not have range violation, so

Steven Schveighoffer (14/30) May 06 2020 Ah yes, because foreach(i; 0 .. n) actually uses a hidden variable to

Mathias LANG (11/17) May 05 2020 Yeah, you need to synchronize so that your main thread wait on

data pulverizer (44/56) May 05 2020 Sorry, I meant 10_000 not 100_000_000 I square the number by

data pulverizer (4/6) May 05 2020 Just checked this and can confirm that the data is not being
drug (5/67) May 05 2020 General advice - try to avoid using `array` and `new` in hot code.

data pulverizer (10/15) May 05 2020 I changed the Matrix object from class to struct and timing went

drug (7/22) May 05 2020 Thing are really interesting. So there is a space to improve performance...

data pulverizer (4/10) May 06 2020 I tried `--fast-math` in ldc but it didn't make any difference

data pulverizer (2/12) May 06 2020 Just tried removing the boundscheck and got 1.5 seconds in D!
data pulverizer (2/12) May 06 2020 Just tried removing the boundscheck and got 1.5 seconds in D!

drug (3/18) May 06 2020 Congrats! it looks like the thriller!

data pulverizer (21/40) May 06 2020 CPU usage now revs up almost has time to touch 100% before the

drug (3/24) May 06 2020 What is current D time? That would be really nice if you make the resume...

data pulverizer (7/10) May 06 2020 Current Times:

Jacob Carlborg (6/7) May 06 2020 It would be nice if you could get it published on the Dlang blog [1].

data pulverizer (21/26) May 06 2020 I'm definitely open to publishing it in dlang blog, getting paid

data pulverizer (3/7) May 06 2020 What is the difference between -O2 and -O3 ldc2 compiler

Jacob Carlborg (5/6) May 11 2020 `--help` says -O2 is "Good optimizations" and -O3 "Aggressive

data pulverizer (6/11) May 21 2020 Started uploading the code and writing the article for this. The

data pulverizer (6/11) May 21 2020 First draft of the article is done. I welcome comments. I love

drug (4/18) May 06 2020 Oh, I'm impressed. I thought that D time has been decreased by 1.5
data pulverizer (5/6) May 06 2020 This is going to sound absurd but can we do even better? If none

data pulverizer (5/11) May 07 2020 After running the Julia code by the Julia community they made

drug (4/16) May 07 2020 That's a good sign because I was afraid that super short D time was a

data pulverizer (5/19) May 07 2020 Don't worry, the full code will be released so that it can be

data pulverizer (39/43) May 08 2020 I've run the Chapel code past the Chapel programming language

data pulverizer (7/10) May 08 2020 Also BLAS is of limited use for most of all the kernel functions,

wjoe (2/5) May 13 2020 Aren't kernel function calls suffering a context switch though ?

data pulverizer (2/7) May 14 2020 Why would they?

WebFreak001 (11/24) May 06 2020 Cool! But before getting too excited I would recommend you to

data pulverizer (9/23) May 06 2020 Yes, I've been outputting portions of the result which is a

Jacob Carlborg (5/6) May 06 2020 Try the following flags as well:

Jacob Carlborg (5/7) May 06 2020 That should be _logical_ cores. If the CPU supports hyper threading it

drug (5/20) May 05 2020 I think the problem is in `process` attributes (error message you posted...
Jacob Carlborg (6/8) May 06 2020 If you happen to be using macOS, I know that when std.parallelism checks...

data pulverizer <data.pulverizer gmail.com> writes:

I have been using std.parallelism and that has worked quite 
nicely but it is not fully utilising all the cpu resources in my 
computation so I though it could be good to run it concurrently 
to see if I can get better performance. However I am very new to 
std.concurrency and the baby version of the code I am trying to 
run:

```
void main()
{
   import std.concurrency;
   import std.stdio: writeln;

   void process(double x, double y, long i, shared(double[]) z)
   {
     z[i] = x*y;
   }
   long n = 100;
   shared(double[]) z = new double[n];
   for(long i = 0; i < n; ++i)
   {
     spawn(&process, cast(double)(i), cast(double)(i + 1), i, z);
   }
   writeln("z: ", z);
}
```


Illicits the following error:

```
onlineapp.d(14): Error: template std.concurrency.spawn cannot 
deduce function from argument types !()(void delegate(double x, 
double y, long i, shared(double[]) z) pure nothrow  nogc  safe, 
double, double, long, shared(double[])), candidates are:
/dlang/dmd/linux/bin64/../../src/phobos/std/concurrency.d(460):   
      spawn(F, T...)(F fn, T args)
   with F = void delegate(double, double, long, shared(double[])) 
pure nothrow  nogc  safe,
        T = (double, double, long, shared(double[]))
   must satisfy the following constraint:
        isSpawnable!(F, T)
```

May 05 2020

Mathias LANG <geod24 gmail.com> writes:

On Wednesday, 6 May 2020 at 03:25:41 UTC, data pulverizer wrote:
 [...]

The problem here is that `process` is a delegate, not a function. 
The compiler *should* know it's a function, but for some reason 
it does not. Making the function static, or moving it outside of 
the scope of main, will fix it.

For reference, this will spawn 100 threads to do a simple 
computation so probably not what you would want, I expect. But I 
suppose this is just example code and the underlying computation 
is much more expensive ?

May 05 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Wednesday, 6 May 2020 at 03:33:12 UTC, Mathias LANG wrote:
 On Wednesday, 6 May 2020 at 03:25:41 UTC, data pulverizer wrote:
 [...]

 The problem here is that `process` is a delegate, not a 
 function. The compiler *should* know it's a function, but for 
 some reason it does not. Making the function static, or moving 
 it outside of the scope of main, will fix it.

I moved the `process` function out of main and it is now running 
but it prints out

```
z: [nan, 2, nan, 12, 20, nan, nan, nan, nan, 90, nan, 132, nan, 
nan, 210, nan, nan, nan, nan, nan, nan, nan, nan, nan, 600, nan, 
nan, nan, nan, nan, 930, 992, 1056, nan, 1190, nan, nan, nan, 
nan, nan, 1640, 1722, nan, nan, nan, nan, nan, nan, nan, nan, 
nan, nan, nan, nan, nan, 3080, nan, nan, 3422, 3540, nan, nan, 
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 
nan, 8010, nan, nan, nan, nan, nan, nan, 9312, nan, nan, 9900]
```
Is there something I need to do to wait for each thread to finish 
computation?

 For reference, this will spawn 100 threads to do a simple 
 computation so probably not what you would want, I expect. But 
 I suppose this is just example code and the underlying 
 computation is much more expensive ?

Yes, that's exactly what I want the actual computation I'm 
running is much more expensive and much larger. It shouldn't 
matter if I have like 100_000_000 threads should it? The threads 
should just be queued until the cpu works on it?

Thanks

May 05 2020

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 5/5/20 8:41 PM, data pulverizer wrote:> On Wednesday, 6 May 2020 at 
03:33:12 UTC, Mathias LANG wrote:
 On Wednesday, 6 May 2020 at 03:25:41 UTC, data pulverizer wrote:


 Is there something I need to do to wait for each thread to finish
 computation?

thread_joinAll(). I have an example here:

   http://ddili.org/ders/d.en/concurrency.html#ix_concurrency.thread_joinAll

Although I understand that you're experimenting with std.concurrency, I 
want to point out that there is also std.parallelism, which may be 
better suited in many cases. Again, here are some examples:

   http://ddili.org/ders/d.en/parallelism.html

Ali

May 05 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Wednesday, 6 May 2020 at 03:56:04 UTC, Ali Çehreli wrote:
 On 5/5/20 8:41 PM, data pulverizer wrote:> On Wednesday, 6 May 
 2020 at 03:33:12 UTC, Mathias LANG wrote:
 On Wednesday, 6 May 2020 at 03:25:41 UTC, data pulverizer


 wrote:

 Is there something I need to do to wait for each thread to

 finish
 computation?

 thread_joinAll(). I have an example here:

   
 http://ddili.org/ders/d.en/concurrency.html#ix_concurrency.thread_joinAll

This worked nicely thank you very much

 ... I want to point out that there is also std.parallelism, 
 which may be better suited in many cases.

I actually started off using std.parallelism and it worked well 
but the CPU usage on all the threads was less than half on my 
system monitor meaning there is more performance to be wrung out 
of my computer, which is why I am now looking into spawn. When 
you suggested using thread_joinAll() I saw that is in 
`core.thread.osthread` module. It might be shaving the yak this 
point but I have tried using `Thread` instead of `spawn`:

```
void process(double x, double y, long i, shared(double[]) z)
{
   z[i] = x*y;
}

void main()
{
   import core.thread.osthread;
   import std.stdio: writeln;

   long n = 100;
   shared(double[]) z = new double[n];
   for(long i = 0; i < n; ++i)
   {
     auto proc = (){
       process(cast(double)(i), cast(double)(i + 1), i, z);
       return;
     };
     new Thread(&proc).start();
   }
   thread_joinAll();
   writeln("z: ", z);
}
```
and I am getting the following error:

```
onlineapp.d(20): Error: none of the overloads of this are 
callable using argument types (void delegate()  system*), 
candidates are:
/dlang/dmd/linux/bin64/../../src/druntime/import/core/thread/osthread.d(646):  
     core.thread.osthread.Thread.this(void function() fn, ulong sz = 0LU)
/dlang/dmd/linux/bin64/../../src/druntime/import/core/thread/osthread.d(671):  
     core.thread.osthread.Thread.this(void delegate() dg, ulong sz = 0LU)
/dlang/dmd/linux/bin64/../../src/druntime/import/core/thread/osthread.d(1540): 
      core.thread.osthread.Thread.this(ulong sz = 0LU)
```

May 05 2020

drug <drug2004 bk.ru> writes:

06.05.2020 07:25, data pulverizer пишет:
 On Wednesday, 6 May 2020 at 03:56:04 UTC, Ali Çehreli wrote:
 On 5/5/20 8:41 PM, data pulverizer wrote:> On Wednesday, 6 May 2020 at 
 03:33:12 UTC, Mathias LANG wrote:
 On Wednesday, 6 May 2020 at 03:25:41 UTC, data pulverizer


 wrote:

 Is there something I need to do to wait for each thread to

 finish
 computation?

 thread_joinAll(). I have an example here:

 http://ddili.org/ders/d.en/concurrency.html#ix_concurrency.thread_joinAll

 
 This worked nicely thank you very much
 
 ... I want to point out that there is also std.parallelism, which may 
 be better suited in many cases.

 
 I actually started off using std.parallelism and it worked well but the 
 CPU usage on all the threads was less than half on my system monitor 
 meaning there is more performance to be wrung out of my computer, which 
 is why I am now looking into spawn. When you suggested using 
 thread_joinAll() I saw that is in `core.thread.osthread` module. It 
 might be shaving the yak this point but I have tried using `Thread` 
 instead of `spawn`:
 
 ```
 void process(double x, double y, long i, shared(double[]) z)
 {
    z[i] = x*y;
 }
 
 void main()
 {
    import core.thread.osthread;
    import std.stdio: writeln;
 
    long n = 100;
    shared(double[]) z = new double[n];
    for(long i = 0; i < n; ++i)
    {
      auto proc = (){
        process(cast(double)(i), cast(double)(i + 1), i, z);
        return;
      };

proc is already a delegate, so &proc is a pointer to the delegate, just 
pass a `proc` itself

      new Thread(&proc).start();
    }
    thread_joinAll();
    writeln("z: ", z);
 }
 ```
 and I am getting the following error:
 
 ```
 onlineapp.d(20): Error: none of the overloads of this are callable using 
 argument types (void delegate()  system*), candidates are:
 /dlang/dmd/linux/bin64/../../src/druntime/import/core/thread/osthread.d(646): 
      
 core.thread.osthread.Thread.this(void function() fn, ulong sz = 0LU)
 /dlang/dmd/linux/bin64/../../src/druntime/import/core/thread/osthread.d(671): 
      
 core.thread.osthread.Thread.this(void delegate() dg, ulong sz = 0LU)
 /dlang/dmd/linux/bin64/../../src/druntime/import/core/thread/osthread.d(1540):
       
 core.thread.osthread.Thread.this(ulong sz = 0LU)
 ```

May 05 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Wednesday, 6 May 2020 at 05:44:47 UTC, drug wrote:
 proc is already a delegate, so &proc is a pointer to the 
 delegate, just pass a `proc` itself

Thanks done that but getting a range violation on z which was not 
there before.

```
core.exception.RangeError onlineapp.d(3): Range violation
----------------
??:? _d_arrayboundsp [0x55de2d83a6b5]
onlineapp.d:3 void onlineapp.process(double, double, long, 
shared(double[])) [0x55de2d8234fd]
onlineapp.d:16 void onlineapp.main().__lambda1() [0x55de2d823658]
??:? void core.thread.osthread.Thread.run() [0x55de2d83bdf9]
??:? thread_entryPoint [0x55de2d85303d]
??:? [0x7fc1d6088668]
```

May 05 2020

drug <drug2004 bk.ru> writes:

06.05.2020 09:24, data pulverizer пишет:
 On Wednesday, 6 May 2020 at 05:44:47 UTC, drug wrote:
 proc is already a delegate, so &proc is a pointer to the delegate, 
 just pass a `proc` itself

 
 Thanks done that but getting a range violation on z which was not there 
 before.
 
 ```
 core.exception.RangeError onlineapp.d(3): Range violation
 ----------------
 ??:? _d_arrayboundsp [0x55de2d83a6b5]
 onlineapp.d:3 void onlineapp.process(double, double, long, 
 shared(double[])) [0x55de2d8234fd]
 onlineapp.d:16 void onlineapp.main().__lambda1() [0x55de2d823658]
 ??:? void core.thread.osthread.Thread.run() [0x55de2d83bdf9]
 ??:? thread_entryPoint [0x55de2d85303d]
 ??:? [0x7fc1d6088668]
 ```
 

confirmed. I think that's because `proc` delegates captures `i` variable 
of `for` loop. I managed to get rid of range violation by using `foreach`:
```
foreach(i; 0..n) // instead of for(long i = 0; i < n;)
```
I guess that `proc` delegate cant capture `i` var of `foreach` loop so 
the range violation doesn't happen.

you use `proc` delegate to pass arguments to `process` function. I would 
recommend for this purpose to derive a class from class Thread. Then you 
can pass the arguments in ctor of the derived class like:
```
foreach(long i; 0..n)
     new DerivedThread(double)(i), cast(double)(i + 1), i, z).start(); 
thread_joinAll();
```

not tested example of derived thread
```
class DerivedThread
{
     this(double x, double y, long i, shared(double[]) z)
     {
         this.x = x;
	this.y = y;
	this.i = i;
	this.z = z;
         super(&run);
     }
private:
     void run()
     {
          process(x, y, i, z);
     }
	double x, y;
	long i;
	shared(double[]) z;
}
```

May 05 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Wednesday, 6 May 2020 at 06:49:13 UTC, drug wrote:
 ... Then you can pass the arguments in ctor of the derived 
 class like:
 ```
 foreach(long i; 0..n)
     new DerivedThread(double)(i), cast(double)(i + 1), i, 
 z).start(); thread_joinAll();
 ```

 not tested example of derived thread
 ```
 class DerivedThread
 {
     this(double x, double y, long i, shared(double[]) z)
     {
         this.x = x;
 	this.y = y;
 	this.i = i;
 	this.z = z;
         super(&run);
     }
 private:
     void run()
     {
          process(x, y, i, z);
     }
 	double x, y;
 	long i;
 	shared(double[]) z;
 }
 ```

Thanks. Now working.

May 06 2020

Steven Schveighoffer <schveiguy gmail.com> writes:

On 5/6/20 2:49 AM, drug wrote:
 06.05.2020 09:24, data pulverizer пишет:
 On Wednesday, 6 May 2020 at 05:44:47 UTC, drug wrote:
 proc is already a delegate, so &proc is a pointer to the delegate, 
 just pass a `proc` itself

 Thanks done that but getting a range violation on z which was not 
 there before.

 ```
 core.exception.RangeError onlineapp.d(3): Range violation
 ----------------
 ??:? _d_arrayboundsp [0x55de2d83a6b5]
 onlineapp.d:3 void onlineapp.process(double, double, long, 
 shared(double[])) [0x55de2d8234fd]
 onlineapp.d:16 void onlineapp.main().__lambda1() [0x55de2d823658]
 ??:? void core.thread.osthread.Thread.run() [0x55de2d83bdf9]
 ??:? thread_entryPoint [0x55de2d85303d]
 ??:? [0x7fc1d6088668]
 ```

 
 confirmed. I think that's because `proc` delegates captures `i` variable 
 of `for` loop. I managed to get rid of range violation by using `foreach`:
 ```
 foreach(i; 0..n) // instead of for(long i = 0; i < n;)
 ```
 I guess that `proc` delegate cant capture `i` var of `foreach` loop so 
 the range violation doesn't happen.

foreach over a range of integers is lowered to an equivalent for loop, 
so that was not the problem.

Indeed, D does not capture individual for loop contexts, only the 
context of the entire function.

 
 you use `proc` delegate to pass arguments to `process` function. I would 
 recommend for this purpose to derive a class from class Thread. Then you 
 can pass the arguments in ctor of the derived class like:
 ```
 foreach(long i; 0..n)
      new DerivedThread(double)(i), cast(double)(i + 1), i, z).start(); 
 thread_joinAll();
 ```

This is why it works, because you are capturing the value manually while 
in the loop itself.

Another way to do this is to create a new capture context:

foreach(long i; 0 .. n)
    auto proc = (val => {
        process(cast(double)(val), cast(double)(val + 1), val, z);
    })(i);
    ...
}

-Steve

May 06 2020

drug <drug2004 bk.ru> writes:

06.05.2020 16:57, Steven Schveighoffer пишет:
 ```
 foreach(i; 0..n) // instead of for(long i = 0; i < n;)
 ```
 I guess that `proc` delegate cant capture `i` var of `foreach` loop so 
 the range violation doesn't happen.

 
 foreach over a range of integers is lowered to an equivalent for loop, 
 so that was not the problem.

I was surprised but `foreach` version do not have range violation, so 
there is difference between `foreach` and `for` loops. I did not try 
DerivedThread at all, only suggested them to avoid var capture. I just 
changed `for` by `foreach` and range violation gone. Probably this is 
implementation details.

May 06 2020

Steven Schveighoffer <schveiguy gmail.com> writes:

On 5/6/20 2:29 PM, drug wrote:
 06.05.2020 16:57, Steven Schveighoffer пишет:
 ```
 foreach(i; 0..n) // instead of for(long i = 0; i < n;)
 ```
 I guess that `proc` delegate cant capture `i` var of `foreach` loop 
 so the range violation doesn't happen.

 foreach over a range of integers is lowered to an equivalent for loop, 
 so that was not the problem.

 
 I was surprised but `foreach` version do not have range violation, so 
 there is difference between `foreach` and `for` loops. I did not try 
 DerivedThread at all, only suggested them to avoid var capture. I just 
 changed `for` by `foreach` and range violation gone. Probably this is 
 implementation details.
 

Ah yes, because foreach(i; 0 .. n) actually uses a hidden variable to 
iterate, and assigns it to i each time through the loop. It used to just 
use i for iteration, but then you could play tricks by adjusting i.

So the equivalent for loop would be:

for(int _i = 0; _i < n; ++_i)
{
    auto i = _i; // this won't be executed after _i is out of range
    ... // foreach body
}

So the problem would not be a range error, but just random i's coming 
through to the various threads ;)

Very interesting!

-Steve

May 06 2020

Mathias LANG <geod24 gmail.com> writes:

On Wednesday, 6 May 2020 at 03:41:11 UTC, data pulverizer wrote:
 Is there something I need to do to wait for each thread to 
 finish computation?

Yeah, you need to synchronize so that your main thread wait on 
all the other threads to finish.
Look up `Thread.join`.

 Yes, that's exactly what I want the actual computation I'm 
 running is much more expensive and much larger. It shouldn't 
 matter if I have like 100_000_000 threads should it? The 
 threads should just be queued until the cpu works on it?

It does matter quite a bit. Each thread has its own resources 
allocated to it, and some part of the language will need to 
interact with *all* threads, e.g. the GC.
In general, if you want to parallelize something, you should aim 
to have as many threads as you have cores. Having 100M threads 
will mean you have to do a lot of context switches. You might 
want to look up the difference between tasks and threads.

May 05 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Wednesday, 6 May 2020 at 04:04:14 UTC, Mathias LANG wrote:
 On Wednesday, 6 May 2020 at 03:41:11 UTC, data pulverizer wrote:
 Yes, that's exactly what I want the actual computation I'm 
 running is much more expensive and much larger. It shouldn't 
 matter if I have like 100_000_000 threads should it? The 
 threads should just be queued until the cpu works on it?

 It does matter quite a bit. Each thread has its own resources 
 allocated to it, and some part of the language will need to 
 interact with *all* threads, e.g. the GC.
 In general, if you want to parallelize something, you should 
 aim to have as many threads as you have cores. Having 100M 
 threads will mean you have to do a lot of context switches. You 
 might want to look up the difference between tasks and threads.

Sorry, I meant 10_000 not 100_000_000 I square the number by 
mistake because I'm calculating a 10_000 x 10_000 matrix it's 
only 10_000 tasks, so 1 task does 10_000 calculations. The actual 
bit of code I'm parallelising is here:

```
auto calculateKernelMatrix(T)(AbstractKernel!(T) K, Matrix!(T) 
data)
{
   long n = data.ncol;
   auto mat = new Matrix!(T)(n, n);

   foreach(j; taskPool.parallel(iota(n)))
   {
     auto arrj = data.refColumnSelect(j).array;
     for(long i = j; i < n; ++i)
     {
       mat[i, j] = K.kernel(data.refColumnSelect(i).array, arrj);
       mat[j, i] = mat[i, j];
     }
   }
   return mat;
}
```

At the moment this code is running a little bit faster than 
threaded simd optimised Julia code, but as I said in an earlier 
reply to Ali when I look at my system monitor, I can see that all 
the D threads are active and running at ~ 40% usage, meaning that 
they are mostly doing nothing. The Julia code runs all threads at 
100% and is still a tiny bit slower so my (maybe incorrect?) 
assumption is that I could get more performance from D. The 
method `refColumnSelect(j).array` is (trying to) reference a 
column from a matrix (1D array with computed index referencing) 
which I select from the matrix using:

```
return new Matrix!(T)(data[startIndex..(startIndex + nrow)], 
[nrow, 1]);
```

If I use the above code, I am I wrong in assuming that the sliced 
data (T[]) is referenced rather than copied? That so if I do:

```
auto myData = data[5...10];
```

myData is referencing elements [5..10] of data and not creating a 
new array with elements data[5..10] copied?

May 05 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Wednesday, 6 May 2020 at 04:52:30 UTC, data pulverizer wrote:
 myData is referencing elements [5..10] of data and not creating 
 a new array with elements data[5..10] copied?

Just checked this and can confirm that the data is not being 
copied so that is not the source of cpu idling: 
https://ddili.org/ders/d.en/slices.html

May 05 2020

drug <drug2004 bk.ru> writes:

06.05.2020 07:52, data pulverizer пишет:
 On Wednesday, 6 May 2020 at 04:04:14 UTC, Mathias LANG wrote:
 On Wednesday, 6 May 2020 at 03:41:11 UTC, data pulverizer wrote:
 Yes, that's exactly what I want the actual computation I'm running is 
 much more expensive and much larger. It shouldn't matter if I have 
 like 100_000_000 threads should it? The threads should just be queued 
 until the cpu works on it?

 It does matter quite a bit. Each thread has its own resources 
 allocated to it, and some part of the language will need to interact 
 with *all* threads, e.g. the GC.
 In general, if you want to parallelize something, you should aim to 
 have as many threads as you have cores. Having 100M threads will mean 
 you have to do a lot of context switches. You might want to look up 
 the difference between tasks and threads.

 
 Sorry, I meant 10_000 not 100_000_000 I square the number by mistake 
 because I'm calculating a 10_000 x 10_000 matrix it's only 10_000 tasks, 
 so 1 task does 10_000 calculations. The actual bit of code I'm 
 parallelising is here:
 
 ```
 auto calculateKernelMatrix(T)(AbstractKernel!(T) K, Matrix!(T) data)
 {
    long n = data.ncol;
    auto mat = new Matrix!(T)(n, n);
 
    foreach(j; taskPool.parallel(iota(n)))
    {
      auto arrj = data.refColumnSelect(j).array;
      for(long i = j; i < n; ++i)
      {
        mat[i, j] = K.kernel(data.refColumnSelect(i).array, arrj);
        mat[j, i] = mat[i, j];
      }
    }
    return mat;
 }
 ```
 
 At the moment this code is running a little bit faster than threaded 
 simd optimised Julia code, but as I said in an earlier reply to Ali when 
 I look at my system monitor, I can see that all the D threads are active 
 and running at ~ 40% usage, meaning that they are mostly doing nothing. 
 The Julia code runs all threads at 100% and is still a tiny bit slower 
 so my (maybe incorrect?) assumption is that I could get more performance 
 from D. The method `refColumnSelect(j).array` is (trying to) reference a 
 column from a matrix (1D array with computed index referencing) which I 
 select from the matrix using:
 
 ```
 return new Matrix!(T)(data[startIndex..(startIndex + nrow)], [nrow, 1]);
 ```
 
 If I use the above code, I am I wrong in assuming that the sliced data 
 (T[]) is referenced rather than copied? That so if I do:
 
 ```
 auto myData = data[5...10];
 ```
 
 myData is referencing elements [5..10] of data and not creating a new 
 array with elements data[5..10] copied?

General advice - try to avoid using `array` and `new` in hot code. 
Memory allocating is slow in general, except if you use carefully 
crafted custom memory allocators. And that can easily be the reason of 
40% cpu usage because the cores are waiting for the memory subsystem.

May 05 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Wednesday, 6 May 2020 at 05:50:23 UTC, drug wrote:
 General advice - try to avoid using `array` and `new` in hot 
 code. Memory allocating is slow in general, except if you use 
 carefully crafted custom memory allocators. And that can easily 
 be the reason of 40% cpu usage because the cores are waiting 
 for the memory subsystem.

I changed the Matrix object from class to struct and timing went 
from about 19 seconds with ldc2 and flags `-O5` to 13.69 seconds, 
but CPU usage is still at ~ 40% still using 
`taskPool.parallel(iota(n))`. The `.array` method is my method 
for the Matrix object just returning internal data array object 
so it shouldn't copy. Julia is now at about 34 seconds (D was at 
about 30 seconds while just using dmd with no optimizations), to 
make things more interesting I also did an implementation in 
Chapel which is now at around 9 seconds with `--fast` flag.

May 05 2020

drug <drug2004 bk.ru> writes:

06.05.2020 09:43, data pulverizer пишет:
 On Wednesday, 6 May 2020 at 05:50:23 UTC, drug wrote:
 General advice - try to avoid using `array` and `new` in hot code. 
 Memory allocating is slow in general, except if you use carefully 
 crafted custom memory allocators. And that can easily be the reason of 
 40% cpu usage because the cores are waiting for the memory subsystem.

 
 I changed the Matrix object from class to struct and timing went from 
 about 19 seconds with ldc2 and flags `-O5` to 13.69 seconds, but CPU 
 usage is still at ~ 40% still using `taskPool.parallel(iota(n))`. The 
 `.array` method is my method for the Matrix object just returning 
 internal data array object so it shouldn't copy. Julia is now at about 
 34 seconds (D was at about 30 seconds while just using dmd with no 
 optimizations), to make things more interesting I also did an 
 implementation in Chapel which is now at around 9 seconds with `--fast` 
 flag.

Thing are really interesting. So there is a space to improve performance 
in 2.5 times :-)
Yes, `array` is smart enough and if you call it on another array it is 
no op.
What means `--fast` in Chapel? Do you try `--fast-math` in ldc? Don't 
know if 05 use this flag

May 05 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Wednesday, 6 May 2020 at 06:54:07 UTC, drug wrote:
 Thing are really interesting. So there is a space to improve 
 performance in 2.5 times :-)
 Yes, `array` is smart enough and if you call it on another 
 array it is no op.
 What means `--fast` in Chapel? Do you try `--fast-math` in ldc? 
 Don't know if 05 use this flag

I tried `--fast-math` in ldc but it didn't make any difference 
the documentation of `--fast` in Chapel says "Disable checks; 
optimize/specialize".

May 06 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Wednesday, 6 May 2020 at 07:27:19 UTC, data pulverizer wrote:
 On Wednesday, 6 May 2020 at 06:54:07 UTC, drug wrote:
 Thing are really interesting. So there is a space to improve 
 performance in 2.5 times :-)
 Yes, `array` is smart enough and if you call it on another 
 array it is no op.
 What means `--fast` in Chapel? Do you try `--fast-math` in 
 ldc? Don't know if 05 use this flag

 I tried `--fast-math` in ldc but it didn't make any difference 
 the documentation of `--fast` in Chapel says "Disable checks; 
 optimize/specialize".

Just tried removing the boundscheck and got 1.5 seconds in D!

May 06 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Wednesday, 6 May 2020 at 07:27:19 UTC, data pulverizer wrote:
 On Wednesday, 6 May 2020 at 06:54:07 UTC, drug wrote:
 Thing are really interesting. So there is a space to improve 
 performance in 2.5 times :-)
 Yes, `array` is smart enough and if you call it on another 
 array it is no op.
 What means `--fast` in Chapel? Do you try `--fast-math` in 
 ldc? Don't know if 05 use this flag

 I tried `--fast-math` in ldc but it didn't make any difference 
 the documentation of `--fast` in Chapel says "Disable checks; 
 optimize/specialize".

Just tried removing the boundscheck and got 1.5 seconds in D!

May 06 2020

drug <drug2004 bk.ru> writes:

06.05.2020 10:42, data pulverizer пишет:
 On Wednesday, 6 May 2020 at 07:27:19 UTC, data pulverizer wrote:
 On Wednesday, 6 May 2020 at 06:54:07 UTC, drug wrote:
 Thing are really interesting. So there is a space to improve 
 performance in 2.5 times :-)
 Yes, `array` is smart enough and if you call it on another array it 
 is no op.
 What means `--fast` in Chapel? Do you try `--fast-math` in ldc? Don't 
 know if 05 use this flag

 I tried `--fast-math` in ldc but it didn't make any difference the 
 documentation of `--fast` in Chapel says "Disable checks; 
 optimize/specialize".

 
 Just tried removing the boundscheck and got 1.5 seconds in D!
 

Congrats! it looks like the thriller!
What about cpu usage? the same 40%?

May 06 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Wednesday, 6 May 2020 at 07:47:59 UTC, drug wrote:
 06.05.2020 10:42, data pulverizer пишет:
 On Wednesday, 6 May 2020 at 07:27:19 UTC, data pulverizer 
 wrote:
 On Wednesday, 6 May 2020 at 06:54:07 UTC, drug wrote:
 Thing are really interesting. So there is a space to improve 
 performance in 2.5 times :-)
 Yes, `array` is smart enough and if you call it on another 
 array it is no op.
 What means `--fast` in Chapel? Do you try `--fast-math` in 
 ldc? Don't know if 05 use this flag

 I tried `--fast-math` in ldc but it didn't make any 
 difference the documentation of `--fast` in Chapel says 
 "Disable checks; optimize/specialize".

 
 Just tried removing the boundscheck and got 1.5 seconds in D!
 

 Congrats! it looks like the thriller!
 What about cpu usage? the same 40%?

CPU usage now revs up almost has time to touch 100% before the 
process is finished! Interestingly using `--boundscheck=off` 
without `--ffast-math` gives a timing of around 4 seconds and, 
whereas using `--ffast-math` without `--boundscheck=off` made no 
difference, having both gives us the 1.5 seconds. As Jacob 
Carlborg suggested I tried adding `-mcpu=native -flto=full 
-defaultlib=phobos2-ldc-lto,druntime-ldc-lto` but I didn't see 
any difference.

Current Julia time is still around 35 seconds even when using 
 inbounds  simd, and running julia -O3 --check-bounds=no but I'll 
probably need to run the code by the Julia community to see 
whether it can be further optimized but it's pretty interesting 
to see D so far in front. Interesting when I attempt to switch 
off the garbage collector in Julia, the process gets killed 
because my computer runs out of memory (I have over 26 GB of 
memory free) whereas in D the memory I'm using barely registers 
(max 300MB) - it uses even less than Chapel (max 500MB) - which 
doesn't use much at all. It's exactly the same computation, D and 
Julia's timing were similar before the GC optimization and 
compiler flag magic in D.

May 06 2020

drug <drug2004 bk.ru> writes:

06.05.2020 11:18, data pulverizer пишет:
 
 CPU usage now revs up almost has time to touch 100% before the process 
 is finished! Interestingly using `--boundscheck=off` without 
 `--ffast-math` gives a timing of around 4 seconds and, whereas using 
 `--ffast-math` without `--boundscheck=off` made no difference, having 
 both gives us the 1.5 seconds. As Jacob Carlborg suggested I tried 
 adding `-mcpu=native -flto=full 
 -defaultlib=phobos2-ldc-lto,druntime-ldc-lto` but I didn't see any 
 difference.
 
 Current Julia time is still around 35 seconds even when using  inbounds 
  simd, and running julia -O3 --check-bounds=no but I'll probably need to 
 run the code by the Julia community to see whether it can be further 
 optimized but it's pretty interesting to see D so far in front. 
 Interesting when I attempt to switch off the garbage collector in Julia, 
 the process gets killed because my computer runs out of memory (I have 
 over 26 GB of memory free) whereas in D the memory I'm using barely 
 registers (max 300MB) - it uses even less than Chapel (max 500MB) - 
 which doesn't use much at all. It's exactly the same computation, D and 
 Julia's timing were similar before the GC optimization and compiler flag 
 magic in D.

What is current D time? That would be really nice if you make the resume 
of your research.

May 06 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Wednesday, 6 May 2020 at 08:28:41 UTC, drug wrote:
 What is current D time? ...

Current Times:

D:      ~ 1.5 seconds
Chapel: ~ 9 seconds
Julia:  ~ 35 seconds

 That would be really nice if you make the resume of your 
 research.

Yes, I'll do a blog or something on GitHub and link it.

Thanks for all your help.

May 06 2020

Jacob Carlborg <doob me.com> writes:

On 2020-05-06 12:23, data pulverizer wrote:

 Yes, I'll do a blog or something on GitHub and link it.

It would be nice if you could get it published on the Dlang blog [1]. 
One usually get paid for that. Contact Mike Parker.

[1] https://blog.dlang.org

-- 
/Jacob Carlborg

May 06 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Wednesday, 6 May 2020 at 17:31:39 UTC, Jacob Carlborg wrote:
 On 2020-05-06 12:23, data pulverizer wrote:

 Yes, I'll do a blog or something on GitHub and link it.

 It would be nice if you could get it published on the Dlang 
 blog [1]. One usually get paid for that. Contact Mike Parker.

 [1] https://blog.dlang.org

I'm definitely open to publishing it in dlang blog, getting paid 
would be nice. I've just done a full reconciliation of the output 
from D and Chapel with Julia's output they're all the same. In 
the calculation I used 32-bit floats to minimise memory 
consumption, I was also working with the 10,000 MINST image data 
(t10k-images-idx3-ubyte.gz) http://yann.lecun.com/exdb/mnist/ 
rather than random generated data.

The -O3 -O5 optimization on the ldc compiler is instrumental in 
bringing the times down, going with -02 based optimization even 
with the other flags gives us ~ 13 seconds for the 10,000 dataset 
rather than the very nice 1.5 seconds.

As an idea of how kernel matrix computations scale the file 
"train-images-idx3-ubyte.gz" contains 60,000 images and Julia 
performs a kernel matrix calculation in 1340 seconds while D 
performs it in 163 seconds - not really in line with the first 
time, I'd expect around 1.5*36 = 54 seconds; Chapel performs in 
357 seconds - approximately line with the original and the new 
kernel matrix consumes about 14 GB of memory which is why I chose 
to use 32 bit floats - to give me an opportunity to do the kernel 
matrix calculation on my laptop that currently has 31GB of RAM.

May 06 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Wednesday, 6 May 2020 at 23:10:05 UTC, data pulverizer wrote:
 The -O3 -O5 optimization on the ldc compiler is instrumental in 
 bringing the times down, going with -02 based optimization even 
 with the other flags gives us ~ 13 seconds for the 10,000 
 dataset rather than the very nice 1.5 seconds.

What is the difference between -O2 and -O3 ldc2 compiler 
optimizations?

May 06 2020

Jacob Carlborg <doob me.com> writes:

On 2020-05-07 02:17, data pulverizer wrote:

 What is the difference between -O2 and -O3 ldc2 compiler optimizations?

`--help` says -O2 is "Good optimizations" and -O3 "Aggressive 
optimizations". Not very specific.

-- 
/Jacob Carlborg

May 11 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Wednesday, 6 May 2020 at 17:31:39 UTC, Jacob Carlborg wrote:
 On 2020-05-06 12:23, data pulverizer wrote:

 Yes, I'll do a blog or something on GitHub and link it.

 It would be nice if you could get it published on the Dlang 
 blog [1]. One usually get paid for that. Contact Mike Parker.

 [1] https://blog.dlang.org

Started uploading the code and writing the article for this. The 
code for each language can be run, see the script.x files in each 
folder for details and timings.

https://github.com/dataPulverizer/KernelMatrixBenchmark

Thanks

May 21 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 21 May 2020 at 07:38:45 UTC, data pulverizer wrote:
 Started uploading the code and writing the article for this. 
 The code for each language can be run, see the script.x files 
 in each folder for details and timings.

 https://github.com/dataPulverizer/KernelMatrixBenchmark

 Thanks

First draft of the article is done. I welcome comments. I love 
writing D code but I wanted the article to be as "fair and 
balanced" as possible. Don't you just love that phrase? Also the 
article kind of morphed into a more general discussion of the 
programming languages.

May 21 2020

drug <drug2004 bk.ru> writes:

06.05.2020 13:23, data pulverizer пишет:
 On Wednesday, 6 May 2020 at 08:28:41 UTC, drug wrote:
 What is current D time? ...

 
 Current Times:
 
 D:      ~ 1.5 seconds
 Chapel: ~ 9 seconds
 Julia:  ~ 35 seconds
 

Oh, I'm impressed. I thought that D time has been decreased by 1.5 
seconds but it is 1.5 seconds!

 That would be really nice if you make the resume of your research.

 
 Yes, I'll do a blog or something on GitHub and link it.
 
 Thanks for all your help.

You're welcome! Helping others helps me too.

May 06 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Wednesday, 6 May 2020 at 10:23:17 UTC, data pulverizer wrote:
 D:      ~ 1.5 seconds

This is going to sound absurd but can we do even better? If none 
of the optimizations we have so far is using simd maybe we can 
get even better performance by using it. I think I need to go and 
read a simd primer.

May 06 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 7 May 2020 at 02:06:32 UTC, data pulverizer wrote:
 On Wednesday, 6 May 2020 at 10:23:17 UTC, data pulverizer wrote:
 D:      ~ 1.5 seconds

 This is going to sound absurd but can we do even better? If 
 none of the optimizations we have so far is using simd maybe we 
 can get even better performance by using it. I think I need to 
 go and read a simd primer.

After running the Julia code by the Julia community they made 
some changes (using views rather than passing copies of the 
array) and their time has come down to ~ 2.5 seconds. The plot 
thickens.

May 07 2020

drug <drug2004 bk.ru> writes:

07.05.2020 17:49, data pulverizer пишет:
 On Thursday, 7 May 2020 at 02:06:32 UTC, data pulverizer wrote:
 On Wednesday, 6 May 2020 at 10:23:17 UTC, data pulverizer wrote:
 D:      ~ 1.5 seconds

 This is going to sound absurd but can we do even better? If none of 
 the optimizations we have so far is using simd maybe we can get even 
 better performance by using it. I think I need to go and read a simd 
 primer.

 
 After running the Julia code by the Julia community they made some 
 changes (using views rather than passing copies of the array) and their 
 time has come down to ~ 2.5 seconds. The plot thickens.

That's a good sign because I was afraid that super short D time was a 
result of wrong benchmark (too good to be truth). I'm glad D time really 
is both great and real. Your blog post definitely will be very interesting.

May 07 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 7 May 2020 at 15:41:12 UTC, drug wrote:
 07.05.2020 17:49, data pulverizer пишет:
 On Thursday, 7 May 2020 at 02:06:32 UTC, data pulverizer wrote:
 On Wednesday, 6 May 2020 at 10:23:17 UTC, data pulverizer 
 wrote:
 D:      ~ 1.5 seconds


 
 After running the Julia code by the Julia community they made 
 some changes (using views rather than passing copies of the 
 array) and their time has come down to ~ 2.5 seconds. The plot 
 thickens.

 That's a good sign because I was afraid that super short D time 
 was a result of wrong benchmark (too good to be truth). I'm 
 glad D time really is both great and real. Your blog post 
 definitely will be very interesting.

Don't worry, the full code will be released so that it can be 
inspected by anyone interested before the blog is released. Now 
working for a version on Nim for even more comparison. Can't wait 
to find out how everything compares.

May 07 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Thursday, 7 May 2020 at 14:49:43 UTC, data pulverizer wrote:
 After running the Julia code by the Julia community they made 
 some changes (using views rather than passing copies of the 
 array) and their time has come down to ~ 2.5 seconds. The plot 
 thickens.

I've run the Chapel code past the Chapel programming language 
people and they've brought the time down to ~ 6.5 seconds. I've 
disallowed calling BLAS because I'm looking at the performance of 
the programming language implementations rather than it's ability 
to call other libraries.

So far the times are looking like this:

D:      ~ 1.5 seconds
Julia:  ~ 2.5 seconds
Chapel: ~ 6.5 seconds

I've been working on the Nim benchmark and have written a little 
byte order set of functions for big -> little endian stuff 
(https://gist.github.com/dataPulverizer/744fadf8924ae96135fc600ac86c7060) which
was fun and has the ntoh, hton, and so forth functions that can be applied to
any basic type. Now writing a little matrix type in the same vein as the D
matrix type I wrote and then do the easy bit which is writing the kernel matrix
algorithm itself.

In the end I'll run the benchmark on data of various sizes. 
Currently I'm just running it on the (10,000 x 784) data set 
which outputs a (10,000 x 10,000) matrix. I'll end up running 
(5,000 x 784), (10,000 x 784), (20,000 x 784), (30,000 x 784), 
(40,000 x 784), (50,000 x 784), and (60,000 x 784). Ideally I'd 
measure each on 100 times and plot confidence intervals, but I'll 
have to settle for measuring each one 3 times and take an average 
otherwise it will take too much time. I don't think that D will 
have it it's own way for all the data sizes, from what I can see, 
Julia may do better at the largest data set, maybe simd will be a 
factor there.

The data set sizes are not randomly chosen. In many common data 
science tasks maybe > 90% of what data scientists currently work 
on, people work with data sets in this range or even smaller, the 
big data stuff is much less common unless you're working for 
Google (FANGs) or a specialist startup. I remember running a 
kernel cluster in often used "data science" languages (none of 
which I'm benchmarking here) and it wasn't done after an hour and 
then hung and crashed, I implemented something in Julia and it 
was done in a minute. Calculating kernel matrices is the 
cornerstone of many kernel-based machine learning libraries 
kernel PCA, Kernel Clustering, SVM and so on. It's a pretty 
important thing to calculate and shows the potential of these 
languages in the data science field. I think an article like this 
is valid for people that implement numerical libraries. I'm also 
hoping to throw in C++ by way of comparison.

May 08 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Friday, 8 May 2020 at 13:36:22 UTC, data pulverizer wrote:
 ...I've disallowed calling BLAS because I'm looking at the 
 performance of the programming language implementations rather 
 than it's ability to call other libraries.

Also BLAS is of limited use for most of all the kernel functions, 
it's very useful for DotKernel but not as much for the others 
(there are many such functions) and no use for some others - 
which is another contributory factor to why I also chose kernel 
matrix calculations, you can't always call a library, sometimes 
you just need to write performant code.

May 08 2020

wjoe <invalid example.com> writes:

On Friday, 8 May 2020 at 13:43:40 UTC, data pulverizer wrote:
 [...] I also chose kernel matrix calculations, you can't always 
 call a library, sometimes you just need to write performant 
 code.

Aren't kernel function calls suffering a context switch though ?

May 13 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Wednesday, 13 May 2020 at 15:13:50 UTC, wjoe wrote:
 On Friday, 8 May 2020 at 13:43:40 UTC, data pulverizer wrote:
 [...] I also chose kernel matrix calculations, you can't 
 always call a library, sometimes you just need to write 
 performant code.

 Aren't kernel function calls suffering a context switch though ?

Why would they?

May 14 2020

WebFreak001 <d.forum webfreak.org> writes:

On Wednesday, 6 May 2020 at 07:42:44 UTC, data pulverizer wrote:
 On Wednesday, 6 May 2020 at 07:27:19 UTC, data pulverizer wrote:
 On Wednesday, 6 May 2020 at 06:54:07 UTC, drug wrote:
 Thing are really interesting. So there is a space to improve 
 performance in 2.5 times :-)
 Yes, `array` is smart enough and if you call it on another 
 array it is no op.
 What means `--fast` in Chapel? Do you try `--fast-math` in 
 ldc? Don't know if 05 use this flag

 I tried `--fast-math` in ldc but it didn't make any difference 
 the documentation of `--fast` in Chapel says "Disable checks; 
 optimize/specialize".

 Just tried removing the boundscheck and got 1.5 seconds in D!

Cool! But before getting too excited I would recommend you to 
also run tests if the resulting data is even still correct before 
you keep this in if you haven't done this already!

If you feel like it, I would recommend you to write up some small 
blog article what you learned about how to improve performance of 
hot code like this. Maybe simply write a post on reddit or make a 
full blog or something.

Ultimately: all the smart suggestions in here should probably be 
aggregated. More benchmarks and more blog articles always help 
the discoverability then.

May 06 2020

data pulverizer <data.pulverizer gmail.com> writes:

On Wednesday, 6 May 2020 at 07:57:46 UTC, WebFreak001 wrote:
 On Wednesday, 6 May 2020 at 07:42:44 UTC, data pulverizer wrote:
 On Wednesday, 6 May 2020 at 07:27:19 UTC, data pulverizer 
 wrote:
 Just tried removing the boundscheck and got 1.5 seconds in D!

 Cool! But before getting too excited I would recommend you to 
 also run tests if the resulting data is even still correct 
 before you keep this in if you haven't done this already!

Yes, I've been outputting portions of the result which is a 
10_000 x 10_000 matrix but it's definitely a good idea to do a 
full reconciliation of the outputs from all the languages.

 If you feel like it, I would recommend you to write up some 
 small blog article what you learned about how to improve 
 performance of hot code like this. Maybe simply write a post on 
 reddit or make a full blog or something.

I'll probably do a blog on GitHub and it can be linked it on 
reddit.

 Ultimately: all the smart suggestions in here should probably 
 be aggregated. More benchmarks and more blog articles always 
 help the discoverability then.

Definitely, Julia has a very nice performance optimization 
section that makes things easy to start with 
https://docs.julialang.org/en/v1/manual/performance-tips/index.html, it helps
alot to start getting your code speedy before you ask for help from the
community.

May 06 2020

Jacob Carlborg <doob me.com> writes:

On 2020-05-06 08:54, drug wrote:

 Do you try `--fast-math` in ldc? Don't know if 05 use this flag

Try the following flags as well:

`-mcpu=native -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto`

-- 
/Jacob Carlborg

May 06 2020

Jacob Carlborg <doob me.com> writes:

On 2020-05-06 06:04, Mathias LANG wrote:

 In general, if you want to parallelize something, you should aim to have 
 as many threads as you have cores.

That should be _logical_ cores. If the CPU supports hyper threading it 
can run two threads per core.

-- 
/Jacob Carlborg

May 06 2020

drug <drug2004 bk.ru> writes:

06.05.2020 06:25, data pulverizer пишет:
 
 ```
 onlineapp.d(14): Error: template std.concurrency.spawn cannot deduce 
 function from argument types !()(void delegate(double x, double y, long 
 i, shared(double[]) z) pure nothrow  nogc  safe, double, double, long, 
 shared(double[])), candidates are:
 /dlang/dmd/linux/bin64/../../src/phobos/std/concurrency.d(460):      
 spawn(F, T...)(F fn, T args)
    with F = void delegate(double, double, long, shared(double[])) pure 
 nothrow  nogc  safe,
         T = (double, double, long, shared(double[]))
    must satisfy the following constraint:
         isSpawnable!(F, T)
 ```
 

I think the problem is in `process` attributes (error message you posted 
is strange, is it the full message?)
Make your `process` function a template one to let the compiler to 
deduce its attributes. Or set them manually.

May 05 2020

Jacob Carlborg <doob me.com> writes:

On 2020-05-06 05:25, data pulverizer wrote:
 I have been using std.parallelism and that has worked quite nicely but 
 it is not fully utilising all the cpu resources in my computation

If you happen to be using macOS, I know that when std.parallelism checks 
how many cores the computer has, it checks physical cores instead of 
logical cores. That could be a reason, if you're running macOS.

-- 
/Jacob Carlborg

May 06 2020

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Error running concurrent process and storing results in array