digitalmars.D.learn - parallel is slower than serial

Yura (56/56) Oct 18 2022 Dear All,

Guillaume Piolat (5/6) Oct 18 2022 The size of your task are way too small.

=?UTF-8?Q?Ali_=c3=87ehreli?= (6/9) Oct 18 2022 In other words, the OP can adjust work unit size. It is on the official

Siarhei Siamashka (8/17) Oct 18 2022 It's usually a much better idea to parallelize the outer loop.

Yura (13/30) Oct 18 2022 Thank you, folks, for your hints and suggestions!

Yura <yuriy.min gmail.com> writes:

Dear All,

I am trying to make a simple code run in parallel. The parallel 
version works, and gives the same number as serial albeit slower.

First, the parallel features I am using:

import core.thread: Thread;
import std.range;
import std.parallelism:parallel;
import std.parallelism:taskPool;
import std.parallelism:totalCPUs;

// Then, I have an array of structures

shared Sphere [] dot;

// Each Sphere is

struct Sphere {
   string El;
   double x;
   double y;
   double z;
   double S;
   double Z;
   double V;
}

// Then for each Sphere, i.e. dot[i]
// I need to do some arithmetics with itself and other dots
// I have only parallelized the inner loop, i is fixed.

// parallel loop
auto I = std.range.iota(0,dot.length);
shared double [] Ai;
Ai.length = dot.length;
foreach (j;parallel(I)) {
   Ai[j] = GETAij (i, j, dot[i], dot[j]);
}

for (auto j=0;j<Ai.length;j++) {
   A = A ~ Ai[j];
}

// the function GETAij

// this is the function to calculate Aij cells
// in parallel

double GETAij (ulong i, ulong j, Sphere dot_i, Sphere dot_j) {
   double Aij;
   if (i == j) {
     Aij = 1.0694*pow((4*pi/dot_i.S),0.5);
   }
   else {
     Aij = 1/(distDD(dot_i,dot_j));
   }
   return Aij;
}

double distDD (Sphere A, Sphere B) {
   double dx2 = (A.x-B.x)*(A.x-B.x);
   double dy2 = (A.y-B.y)*(A.y-B.y);
   double dz2 = (A.z-B.z)*(A.z-B.z);
   double d = pow((dx2 + dy2 + dz2),0.5);
   return d;
}

What I am doing wrong? Any advanced options for the ldc2 
compiler? Many thanks in advance!

Oct 18 2022

Guillaume Piolat <first.last spam.org> writes:

On Tuesday, 18 October 2022 at 11:56:30 UTC, Yura wrote:
 What I am doing wrong?

The size of your task are way too small.
To win something with OS threads, you must think of tasks that 
takes on the order of milliseconds rather than less than 0.1ms.
Else you will just pay extra in synchronization costs.

Oct 18 2022

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 10/18/22 06:24, Guillaume Piolat wrote:

 To win something with OS threads, you must think of tasks that takes on
 the order of milliseconds rather than less than 0.1ms.
 Else you will just pay extra in synchronization costs.

In other words, the OP can adjust work unit size. It is on the official 
documentation but I also mention it on slide 72 of the section that 
starts at the following point:

   https://youtu.be/dRORNQIB2wA?t=1327

Ali

Oct 18 2022

Siarhei Siamashka <siarhei.siamashka gmail.com> writes:

On Tuesday, 18 October 2022 at 11:56:30 UTC, Yura wrote:
 ```D
 // Then for each Sphere, i.e. dot[i]
 // I need to do some arithmetics with itself and other dots
 // I have only parallelized the inner loop, i is fixed.

It's usually a much better idea to parallelize the outer loop. 
Even OpenMP tutorials explain this: 
https://ppc.cs.aalto.fi/ch3/nested/ (check the "collapse it into 
one loop" suggestion from it).

 ```D
 for (auto j=0;j<Ai.length;j++) {
   A = A ~ Ai[j];
 }
 ```

This way of appending to an array is very slow and `A ~= Ai[j];` 
is much faster. And even better would be `A ~= Ai;` instead of 
the whole loop.

Oct 18 2022

Yura <yuriy.min gmail.com> writes:

Thank you, folks, for your hints and suggestions!

Indeed, I re-wrote the code and got it substantially faster and 
well paralleled.

Insted of making inner loop parallel, I made parallel both of 
them. For that I had to convert 2d index into 1d, and then back 
to 2d. Essentially I had to calculate each element Aij of the 
matrix, and then I put everything to 1d array.

And yes, A = A ~ Aij was very slow, to avoid it I had to use 2d 
-> 1d mapping. I will check your solution as well as I like it 
too.

The more I use the D Language, the more I like it.

On Tuesday, 18 October 2022 at 16:07:22 UTC, Siarhei Siamashka 
wrote:
 On Tuesday, 18 October 2022 at 11:56:30 UTC, Yura wrote:
 ```D
 // Then for each Sphere, i.e. dot[i]
 // I need to do some arithmetics with itself and other dots
 // I have only parallelized the inner loop, i is fixed.

 It's usually a much better idea to parallelize the outer loop. 
 Even OpenMP tutorials explain this: 
 https://ppc.cs.aalto.fi/ch3/nested/ (check the "collapse it 
 into one loop" suggestion from it).

 ```D
 for (auto j=0;j<Ai.length;j++) {
   A = A ~ Ai[j];
 }
 ```

 This way of appending to an array is very slow and `A ~= 
 Ai[j];` is much faster. And even better would be `A ~= Ai;` 
 instead of the whole loop.

Oct 18 2022

D Programming

C/C++ Programming

Other

digitalmars.D.learn - parallel is slower than serial