digitalmars.D.learn - No of threads

Vino (16/16) Dec 19 2017 Hi All,

codephantom (2/7) Dec 19 2017 something to do with your cacheLineSize perhaps?

rikki cattermole (5/15) Dec 19 2017 The size of the cache line should be 64 in pretty much all 32/64bit x86
Vino (6/16) Dec 19 2017 There are other process running on the same server which use 200+

=?UTF-8?Q?Ali_=c3=87ehreli?= (24/33) Dec 19 2017 That parameter is workUnitSize, meaning the number of elements each

Vino (28/72) Dec 20 2017 Hi Ali,

Temtaime (2/81) Dec 20 2017 GC collect stops the worlds so there's no gain.
=?UTF-8?Q?Ali_=c3=87ehreli?= (9/15) Dec 20 2017 You tried with larger work unit sizes, right? More importantly, I think

Vino (18/39) Dec 21 2017 Hi Ali,

codephantom (3/13) Dec 20 2017 Are you running this over the network, or on (each) server that

Vino (6/22) Dec 21 2017 Hi,

Vino <vino.bheeman hotmail.com> writes:

Hi All,

   Request your help in clarifying the below. As per the document

foreach (d; taskPool.parallel(xxx)) : The total number of threads 
that will be created is total CPU -1 ( 2 processor with 6 core : 
11 threads)

foreach (d; taskPool.parallel(xxx,1)) : The total number of 
threads that will be created is total CPU -1 ( 2 processor with 6 
core : 12 threads)

So if I increase the parallel process by any number what would be 
the no of threads that would be created

foreach (d; taskPool.parallel(xxx,20)) : As in Windows 2008 
whatever value is set for the parallel the total number of 
threads does not increase more than 12.

So not sure if this is correct, so can any one explain me on same.


From,
Vino.B

Dec 19 2017

codephantom <me noyb.com> writes:

On Tuesday, 19 December 2017 at 10:24:47 UTC, Vino wrote:
 foreach (d; taskPool.parallel(xxx,20)) : As in Windows 2008 
 whatever value is set for the parallel the total number of 
 threads does not increase more than 12.

 So not sure if this is correct, so can any one explain me on 
 same.

something to do with your cacheLineSize perhaps?

Dec 19 2017

rikki cattermole <rikki cattermole.co.nz> writes:

On 19/12/2017 11:03 AM, codephantom wrote:
 On Tuesday, 19 December 2017 at 10:24:47 UTC, Vino wrote:
 foreach (d; taskPool.parallel(xxx,20)) : As in Windows 2008 whatever 
 value is set for the parallel the total number of threads does not 
 increase more than 12.

 So not sure if this is correct, so can any one explain me on same.

 
 something to do with your cacheLineSize perhaps?

The size of the cache line should be 64 in pretty much all 32/64bit x86 
cpu's.

My suspicion is that TaskPool is limiting itself on purpose (based on 
what code I read).

Dec 19 2017

Vino <vino.bheeman hotmail.com> writes:

On Tuesday, 19 December 2017 at 11:03:27 UTC, codephantom wrote:
 On Tuesday, 19 December 2017 at 10:24:47 UTC, Vino wrote:
 foreach (d; taskPool.parallel(xxx,20)) : As in Windows 2008 
 whatever value is set for the parallel the total number of 
 threads does not increase more than 12.

 So not sure if this is correct, so can any one explain me on 
 same.

 something to do with your cacheLineSize perhaps?

There are other process running on the same server which use 200+ 
threads which means the server is capable of running more that 
200+ threads, as i suspect is ti something to do with TaskPool

From,
Vino.B

Dec 19 2017

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 12/19/2017 02:24 AM, Vino wrote:
 Hi All,

    Request your help in clarifying the below. As per the document

 foreach (d; taskPool.parallel(xxx)) : The total number of threads that
 will be created is total CPU -1 ( 2 processor with 6 core : 11 threads)

 foreach (d; taskPool.parallel(xxx,1)) : The total number of threads that
 will be created is total CPU -1 ( 2 processor with 6 core : 12 threads)

That parameter is workUnitSize, meaning the number of elements each 
thread will process per work unit. So, when you set it to 100, each 
thread will work on 100 elements before they go pick more elements to 
work on. Experiment with different values to find out which is faster 
for your work load. If each element takes very short amount of time to 
work on, you need larger values because you don't want to stop a happy 
thread that's chugging along on elements. It really depends on each 
program, so try different values.

 foreach (d; taskPool.parallel(xxx,20)) : As in Windows 2008 whatever
 value is set for the parallel the total number of threads does not
 increase more than 12.

taskPool is just for convenience. You need to create your own TaskPool 
if you want more threads:

import std.parallelism;
import core.thread;
import std.range;

void main() {
     auto t = new TaskPool(20);
     foreach (d; t.parallel(100.iota)) {
         // ...
     }
     Thread.sleep(5.seconds);
     t.finish();
}

Now there are 20 + 1 (main) threads.

Ali

Dec 19 2017

Vino <vino.bheeman hotmail.com> writes:

On Tuesday, 19 December 2017 at 18:42:01 UTC, Ali Çehreli wrote:
 On 12/19/2017 02:24 AM, Vino wrote:
 Hi All,

    Request your help in clarifying the below. As per the

 document
 foreach (d; taskPool.parallel(xxx)) : The total number of

 threads that
 will be created is total CPU -1 ( 2 processor with 6 core :

 11 threads)
 foreach (d; taskPool.parallel(xxx,1)) : The total number of

 threads that
 will be created is total CPU -1 ( 2 processor with 6 core :

 12 threads)

 That parameter is workUnitSize, meaning the number of elements 
 each thread will process per work unit. So, when you set it to 
 100, each thread will work on 100 elements before they go pick 
 more elements to work on. Experiment with different values to 
 find out which is faster for your work load. If each element 
 takes very short amount of time to work on, you need larger 
 values because you don't want to stop a happy thread that's 
 chugging along on elements. It really depends on each program, 
 so try different values.

 foreach (d; taskPool.parallel(xxx,20)) : As in Windows 2008

 whatever
 value is set for the parallel the total number of threads

 does not
 increase more than 12.

 taskPool is just for convenience. You need to create your own 
 TaskPool if you want more threads:

 import std.parallelism;
 import core.thread;
 import std.range;

 void main() {
     auto t = new TaskPool(20);
     foreach (d; t.parallel(100.iota)) {
         // ...
     }
     Thread.sleep(5.seconds);
     t.finish();
 }

 Now there are 20 + 1 (main) threads.

 Ali

Hi Ali,

  Thank you very much, below are the observations, our program is 
used to calculate the size of the folders, and we don't see any 
improvements in the execution speed from the below test, are we 
missing something. Basically we expected the total execution time 
for the test 2 , as the time taken to calculate the size of the 
biggest folder + few additional mins, the biggest folder size is 
of 604 GB.  Memory usage is just 12 MB, whereas the server has 65 
GB and hardly 30% - 40% is used at any given point in time, so 
there is no memory constrain.


Test 1:
foreach (d; taskPool.parallel(dFiles[],1))
auto SdFiles = Array!ulong(dirEntries(d, SpanMode.depth).map!(a 
=> a.size).fold!((a,b) => a + b) (x))[].filter!(a => a  > Size);

Execution Time is 26 mins with 11+1 (main) threads and 1 element 
per thread

Test 2:
auto TL = dFiles.length;
auto TP = new TaskPool(TL);
foreach (d; TP.parallel(dFiles[],1))
auto SdFiles = Array!ulong(dirEntries(d, SpanMode.depth).map!(a 
=> a.size).fold!((a,b) => a + b) (x))[].filter!(a => a  > Size);
Thread.sleep(5.seconds); TP.finish();

Execution Time is 27 mins with 153+1 (main) threads and 1 element 
per thread


From,
Vino.B

Dec 20 2017

Temtaime <temtaime gmail.com> writes:

On Wednesday, 20 December 2017 at 13:41:06 UTC, Vino wrote:
 On Tuesday, 19 December 2017 at 18:42:01 UTC, Ali Çehreli wrote:
 On 12/19/2017 02:24 AM, Vino wrote:
 Hi All,

    Request your help in clarifying the below. As per the

 document
 foreach (d; taskPool.parallel(xxx)) : The total number of

 threads that
 will be created is total CPU -1 ( 2 processor with 6 core :

 11 threads)
 foreach (d; taskPool.parallel(xxx,1)) : The total number of

 threads that
 will be created is total CPU -1 ( 2 processor with 6 core :

 12 threads)

 That parameter is workUnitSize, meaning the number of elements 
 each thread will process per work unit. So, when you set it to 
 100, each thread will work on 100 elements before they go pick 
 more elements to work on. Experiment with different values to 
 find out which is faster for your work load. If each element 
 takes very short amount of time to work on, you need larger 
 values because you don't want to stop a happy thread that's 
 chugging along on elements. It really depends on each program, 
 so try different values.

 foreach (d; taskPool.parallel(xxx,20)) : As in Windows 2008

 whatever
 value is set for the parallel the total number of threads

 does not
 increase more than 12.

 taskPool is just for convenience. You need to create your own 
 TaskPool if you want more threads:

 import std.parallelism;
 import core.thread;
 import std.range;

 void main() {
     auto t = new TaskPool(20);
     foreach (d; t.parallel(100.iota)) {
         // ...
     }
     Thread.sleep(5.seconds);
     t.finish();
 }

 Now there are 20 + 1 (main) threads.

 Ali

 Hi Ali,

  Thank you very much, below are the observations, our program 
 is used to calculate the size of the folders, and we don't see 
 any improvements in the execution speed from the below test, 
 are we missing something. Basically we expected the total 
 execution time for the test 2 , as the time taken to calculate 
 the size of the biggest folder + few additional mins, the 
 biggest folder size is of 604 GB.  Memory usage is just 12 MB, 
 whereas the server has 65 GB and hardly 30% - 40% is used at 
 any given point in time, so there is no memory constrain.


 Test 1:
 foreach (d; taskPool.parallel(dFiles[],1))
 auto SdFiles = Array!ulong(dirEntries(d, SpanMode.depth).map!(a 
 => a.size).fold!((a,b) => a + b) (x))[].filter!(a => a  > Size);

 Execution Time is 26 mins with 11+1 (main) threads and 1 
 element per thread

 Test 2:
 auto TL = dFiles.length;
 auto TP = new TaskPool(TL);
 foreach (d; TP.parallel(dFiles[],1))
 auto SdFiles = Array!ulong(dirEntries(d, SpanMode.depth).map!(a 
 => a.size).fold!((a,b) => a + b) (x))[].filter!(a => a  > Size);
 Thread.sleep(5.seconds); TP.finish();

 Execution Time is 27 mins with 153+1 (main) threads and 1 
 element per thread


 From,
 Vino.B

GC collect stops the worlds so there's no gain.

Dec 20 2017

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 12/20/2017 05:41 AM, Vino wrote:

 auto TL = dFiles.length;
 auto TP = new TaskPool(TL);

I assume dFiles is large. So, that's a lot of threads there.

 foreach (d; TP.parallel(dFiles[],1))

You tried with larger work unit sizes, right? More importantly, I think 
all these threads are working on the same disk. If the access is 
serialized by the OS or a lower entity, then all threads necessarily 
wait for each other, making the whole exercise serial.

 auto SdFiles = Array!ulong(dirEntries(d, SpanMode.depth).map!(a =>
 a.size).fold!((a,b) => a + b) (x))[].filter!(a => a  > Size);
 Thread.sleep(5.seconds);

You don't need that at all. I had left it in there just to give me a 
chance to examine the number of threads the program was using.

Ali

Dec 20 2017

Vino <vino.bheeman hotmail.com> writes:

On Wednesday, 20 December 2017 at 17:31:20 UTC, Ali Çehreli wrote:
 On 12/20/2017 05:41 AM, Vino wrote:

 auto TL = dFiles.length;
 auto TP = new TaskPool(TL);

 I assume dFiles is large. So, that's a lot of threads there.

 foreach (d; TP.parallel(dFiles[],1))

 You tried with larger work unit sizes, right? More importantly, 
 I think all these threads are working on the same disk. If the 
 access is serialized by the OS or a lower entity, then all 
 threads necessarily wait for each other, making the whole 
 exercise serial.

 auto SdFiles = Array!ulong(dirEntries(d,

 SpanMode.depth).map!(a =>
 a.size).fold!((a,b) => a + b) (x))[].filter!(a => a  > Size);
 Thread.sleep(5.seconds);

 You don't need that at all. I had left it in there just to give 
 me a chance to examine the number of threads the program was 
 using.

 Ali

Hi Ali,

Below are the answers.

"I think all these threads are working on the same disk. If the 
access is serialized by the OS or a lower entity, then all 
threads necessarily wait for each other, making the whole  
exercise serial."

    The File system that is used here to scan and find the folder 
size is an NetApp File system mapped on Windows 2008. The file 
system is exported using NFS v3 so you are right that the disk 
access is serialized.

The no of folders are from 2 NetApp file system and no of folders 
in each file system is as below

File system 1 : 76 folders and File system 2: 77 folders.

 You don't need that at all. I had left it in there just to give 
 me a chance to examine the number of threads the program was 
 using.

We have not update your main code yet, it was a test that we 
performed on test server.

From,
Vino.B

Dec 21 2017

codephantom <me noyb.com> writes:

On Wednesday, 20 December 2017 at 13:41:06 UTC, Vino wrote:
 Hi Ali,

  Thank you very much, below are the observations, our program 
 is used to calculate the size of the folders, and we don't see 
 any improvements in the execution speed from the below test, 
 are we missing something. Basically we expected the total 
 execution time for the test 2 , as the time taken to calculate 
 the size of the biggest folder + few additional mins, the 
 biggest folder size is of 604 GB.  Memory usage is just 12 MB, 
 whereas the server has 65 GB and hardly 30% - 40% is used at 
 any given point in time, so there is no memory constrain.

Are you running this over the network, or on (each) server that 
contains the actual folders?

Dec 20 2017

Vino <vino.bheeman hotmail.com> writes:

On Thursday, 21 December 2017 at 00:32:50 UTC, codephantom wrote:
 On Wednesday, 20 December 2017 at 13:41:06 UTC, Vino wrote:
 Hi Ali,

  Thank you very much, below are the observations, our program 
 is used to calculate the size of the folders, and we don't see 
 any improvements in the execution speed from the below test, 
 are we missing something. Basically we expected the total 
 execution time for the test 2 , as the time taken to calculate 
 the size of the biggest folder + few additional mins, the 
 biggest folder size is of 604 GB.  Memory usage is just 12 MB, 
 whereas the server has 65 GB and hardly 30% - 40% is used at 
 any given point in time, so there is no memory constrain.

 Are you running this over the network, or on (each) server that 
 contains the actual folders?

Hi,

   Yes, the file system used is a NetApp file system mapped on 
Windows server.

From,
Vino.B

Dec 21 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - No of threads