digitalmars.D.learn - Create many objects using threads
- Caslav Sabani (15/15) May 05 2014 Hi,
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (17/25) May 05 2014 1) If it has to be a single array, meaning that all of the objects are
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (63/78) May 05 2014 Here is an example:
- Caslav Sabani (6/6) May 05 2014 Hi Ali,
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (6/10) May 05 2014 The .parallel in the foreach loop makes the body of the loop be executed...
- Kapps (6/22) May 05 2014 I could be wrong here, but I think that the GC actually blocks
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (34/38) May 05 2014 I did:
- Caslav Sabani (6/6) May 05 2014 Hi all,
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (4/6) May 05 2014 Not at all! That statement can be true only in certain programs. :)
- hardcoremore (6/13) May 06 2014 But what does exactly means that Garbage Collector blocks? What
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (10/14) May 06 2014 I know this much: The current GC that comes in D runtime is a
- Kapps (71/110) May 06 2014 Huh, that's a much, much, higher impact than I'd expected.
- Kapps (31/101) May 06 2014 I tried with using an allocator that never releases memory,
Hi, I have just started to learn D. Its a great language. I am trying to achieve the following but I am not sure is it possible or should be done at all: I want to have one array where I will store like 100000 objects. But I want to use 4 threads where each thread will create 25000 objects and store them in array above mentioned. And all 4 threads should be working in parallel because I have 4 core processor for example. I do not care in which order objects are created nor objects should be aware of one another. I just need them stored in array. Can threading help in creating many objects at once? Note that I am beginner at working with threads so any help is welcome :) Thanks
May 05 2014
On 05/05/2014 10:14 AM, Caslav Sabani wrote:I want to have one array where I will store like 100000 objects. But I want to use 4 threads where each thread will create 25000 objects and store them in array above mentioned.1) If it has to be a single array, meaning that all of the objects are in consecutive memory, you can create the array and give four slices of it to the four tasks. To do that, you can either create a proper D array filled with objects with .init values; or you can allocate any type of memory and create objects in place there. 2) If it doesn't have to a single array, you can have the four tasks create four separate arrays. You can then use them as a single range by std.range.chain. This option allows you to have a single array as well. I would like to give examples of those methods later. Gotta go now... :)And all 4 threads should be working in parallel because I have 4 core processor for example. I do not care in which order objects arecreated norobjects should be aware of one another. I just need them stored in array. Can threading help in creating many objects at once? Note that I am beginner at working with threads so any help is welcome :) ThanksI recommend looking at std.parallelism first and then std.concurrency. Here are two chapters that may be helpful: http://ddili.org/ders/d.en/parallelism.html http://ddili.org/ders/d.en/concurrency.html Ali
May 05 2014
On 05/05/2014 10:25 AM, Ali Çehreli wrote:On 05/05/2014 10:14 AM, Caslav Sabani wrote: > I want to have one array where I will store like 100000 objects. > > But I want to use 4 threads where each thread will create 25000objects> and store them in array above mentioned. 1) If it has to be a single array, meaning that all of the objects are in consecutive memory, you can create the array and give four slices of it to the four tasks. To do that, you can either create a proper D array filled with objects with .init values; or you can allocate any type of memory and create objects in place there.Here is an example: import std.stdio; import std.parallelism; import core.thread; import std.conv; enum elementCount = 8; size_t elementPerThread; static this () { assert((elementCount % totalCPUs) == 0, "Cannot distribute tasks to cores evenly"); elementPerThread = elementCount / totalCPUs; } void main() { auto arr = new int[](elementCount); foreach (i; 0 .. totalCPUs) { const beg = i * elementPerThread; const end = beg + elementPerThread; arr[beg .. end] = i.to!int; } thread_joinAll(); // (I don't think this is necessary with std.parallelism) writeln(arr); // [ 0, 0, 1, 1, 2, 2, 3, 3 ] }2) If it doesn't have to a single array, you can have the four tasks create four separate arrays. You can then use them as a single range by std.range.chain.That is a lie. :) chain would work but it had to know the number of total cores at compile time. Instead, joiner or join can be used: import std.stdio; import std.parallelism; import core.thread; enum elementCount = 8; size_t elementPerThread; static this () { assert((elementCount % totalCPUs) == 0, "Cannot distribute tasks to cores evenly"); elementPerThread = elementCount / totalCPUs; } void main() { auto arr = new int[][](totalCPUs); foreach (i; 0 .. totalCPUs) { foreach (e; 0 .. elementPerThread) { arr[i] ~= i; } } thread_joinAll(); // (I don't think this is necessary with std.parallelism) writeln(arr); // [[0, 0], [1, 1], [2, 2], [3, 3]] import std.range; writeln(arr.joiner); // [ 0, 0, 1, 1, 2, 2, 3, 3 ] import std.algorithm; auto arr2 = arr.joiner.array; static assert(is (typeof(arr2) == int[])); writeln(arr2); // [ 0, 0, 1, 1, 2, 2, 3, 3 ] auto arr3 = arr.join; static assert(is (typeof(arr3) == int[])); writeln(arr3); // [ 0, 0, 1, 1, 2, 2, 3, 3 ] }This option allows you to have a single array as well.arr2 and arr3 above are examples of that. Ali
May 05 2014
Hi Ali, Thanks for your reply. But I am struggling to understand from your example where is the code that creates or spawns new thread. How do you create new thread and fill array with instantiated objects in that thread? Thanks
May 05 2014
On 05/05/2014 01:38 PM, Caslav Sabani wrote:I am struggling to understand from your example where is the code that creates or spawns new thread.The .parallel in the foreach loop makes the body of the loop be executed in parallel.How do you create new thread and fill array with instantiated objects in that thread?It is automatic in that example but you can created thread explicitly by std.concurrency or core.thread as well. Ali
May 05 2014
On Monday, 5 May 2014 at 17:14:54 UTC, Caslav Sabani wrote:Hi, I have just started to learn D. Its a great language. I am trying to achieve the following but I am not sure is it possible or should be done at all: I want to have one array where I will store like 100000 objects. But I want to use 4 threads where each thread will create 25000 objects and store them in array above mentioned. And all 4 threads should be working in parallel because I have 4 core processor for example. I do not care in which order objects are created nor objects should be aware of one another. I just need them stored in array. Can threading help in creating many objects at once? Note that I am beginner at working with threads so any help is welcome :) ThanksI could be wrong here, but I think that the GC actually blocks when creating objects, and thus multiple threads creating instances would not provide a significant speedup, possibly even a slowdown. You'd want to benchmark this to be certain it helps.
May 05 2014
On 05/05/2014 02:38 PM, Kapps wrote:I think that the GC actually blocks when creating objects, and thus multiple threads creating instances would not provide a significant speedup, possibly even a slowdown.Wow! That is the case. :)You'd want to benchmark this to be certain it helps.I did: import std.range; import std.parallelism; class C {} void foo() { auto c = new C; } void main(string[] args) { enum totalElements = 10_000_000; if (args.length > 1) { foreach (i; iota(totalElements).parallel) { foo(); } } else { foreach (i; iota(totalElements)) { foo(); } } } Typical run on my system for "-O -noboundscheck -inline": $ time ./deneme parallel real 0m4.236s user 0m4.325s sys 0m9.795s $ time ./deneme real 0m0.753s user 0m0.748s sys 0m0.003s Ali
May 05 2014
Hi all, Thanks for your reply. So basically using threads in D for creating multiple instances of class is actually slower. But what does exactly means that Garbage Collector blocks? What does it blocks and in which way? Thanks
May 05 2014
On 05/05/2014 04:32 PM, Caslav Sabani wrote:So basically using threads in D for creating multiple instances ofclass isactually slower.Not at all! That statement can be true only in certain programs. :) Ali
May 05 2014
On Tuesday, 6 May 2014 at 03:26:52 UTC, Ali Çehreli wrote:On 05/05/2014 04:32 PM, Caslav Sabani wrote:But what does exactly means that Garbage Collector blocks? What does it blocks and in which way? And can I use threads to create multiple instance faster or that is just not possible? ThanksSo basically using threads in D for creating multipleinstances of class isactually slower.Not at all! That statement can be true only in certain programs. :) Ali
May 06 2014
On 05/06/2014 05:46 AM, hardcoremore wrote:But what does exactly means that Garbage Collector blocks? What does it blocks and in which way?I know this much: The current GC that comes in D runtime is a single-threaded GC (aka "a stop-the-world GC"), meaning that all threads are stopped when the GC is running a garbage collection cycle.And can I use threads to create multiple instance faster or that is just not possible?My example program that did nothing but constructed objects on the GC heap cannot be an indicator of the performance of all multi-threaded programs. In real programs there will be computation-intensive parts; there will be parts blocked on I/O; etc. There is no way of knowing without measuring. Ali
May 06 2014
On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote:On 05/05/2014 02:38 PM, Kapps wrote:Huh, that's a much, much, higher impact than I'd expected. I tried with GDC as well (the one in Debian stable, which is unfortunately still 2.055...) and got similar results. I also tried creating only totalCPUs threads and having each of them create NUM_ELEMENTS / totalCPUs objects rather than risking that each creation was a task, and it still seems to be the same. Using malloc and emplace instead of new D, results are about 50% faster for single-threadeded and ~3-4 times faster for multi-threaded (4 cpu 8 thread machine, Linux 64-bit). The multi-threaded version is still twice as slow though. On my Windows laptop (with the program compiled for 32-bit), it did not make a significant difference and the multi-threaded version is still 4 times slower. That being said, I think most malloc implementations while being thread-safe, usually use locks or do not scale well. Code: import std.range; import std.parallelism; import std.datetime; import std.stdio; import core.stdc.stdlib; import std.conv; class C {} void foo() { //auto c = new C; enum size = __traits(classInstanceSize, C); void[] mem = malloc(size)[0..size]; emplace!C(mem); } void createFoos(size_t count) { foreach(i; 0 .. count) { foo(); } } void main(string[] args) { StopWatch sw = StopWatch(AutoStart.yes); enum totalElements = 10_000_000; if (args.length <= 1) { foreach (i; iota(totalElements)) { foo(); } } else if(args[1] == "tasks") { foreach (i; parallel(iota(totalElements))) { foo(); } } else if(args[1] == "parallel") { for(int i = 0; i < totalCPUs; i++) { taskPool.put(task(&createFoos, totalElements / totalCPUs)); } taskPool.finish(true); } else writeln("Unknown argument '", args[1], "'."); sw.stop(); writeln(cast(Duration)sw.peek); } Results (Linux 64-bit): shardsoft:~$ dmd -O -inline -release test.d shardsoft:~$ ./test 552 ms, 729 μs, and 7 hnsecs shardsoft:~$ ./test 532 ms, 139 μs, and 5 hnsecs shardsoft:~$ ./test tasks 1 sec, 171 ms, 126 μs, and 4 hnsecs shardsoft:~$ ./test tasks 1 sec, 38 ms, 468 μs, and 6 hnsecs shardsoft:~$ ./test parallel 1 sec, 146 ms, 738 μs, and 2 hnsecs shardsoft:~$ ./test parallel 1 sec, 268 ms, 195 μs, and 3 hnsecsI think that the GC actually blocks when creating objects, and thus multiple threads creatinginstances would notprovide a significant speedup, possibly even a slowdown.Wow! That is the case. :)You'd want to benchmark this to be certain it helps.I did: import std.range; import std.parallelism; class C {} void foo() { auto c = new C; } void main(string[] args) { enum totalElements = 10_000_000; if (args.length > 1) { foreach (i; iota(totalElements).parallel) { foo(); } } else { foreach (i; iota(totalElements)) { foo(); } } } Typical run on my system for "-O -noboundscheck -inline": $ time ./deneme parallel real 0m4.236s user 0m4.325s sys 0m9.795s $ time ./deneme real 0m0.753s user 0m0.748s sys 0m0.003s Ali
May 06 2014
On Tuesday, 6 May 2014 at 15:56:11 UTC, Kapps wrote:On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote:I tried with using an allocator that never releases memory, rounds up to a power of 2, and is lock-free. The results are quite a bit better. shardsoft:~$ ./test 1 sec, 47 ms, 474 μs, and 4 hnsecs shardsoft:~$ ./test 1 sec, 43 ms, 588 μs, and 2 hnsecs shardsoft:~$ ./test tasks 692 ms, 769 μs, and 8 hnsecs shardsoft:~$ ./test tasks 692 ms, 686 μs, and 8 hnsecs shardsoft:~$ ./test parallel 691 ms, 856 μs, and 9 hnsecs shardsoft:~$ ./test parallel 690 ms, 22 μs, and 3 hnsecs I get similar results on my laptop (which is much faster than the results I got on it using DMD's malloc):On 05/05/2014 02:38 PM, Kapps wrote:Huh, that's a much, much, higher impact than I'd expected. I tried with GDC as well (the one in Debian stable, which is unfortunately still 2.055...) and got similar results. I also tried creating only totalCPUs threads and having each of them create NUM_ELEMENTS / totalCPUs objects rather than risking that each creation was a task, and it still seems to be the same. snipI think that the GC actually blocks when creating objects, and thus multiple threads creatinginstances would notprovide a significant speedup, possibly even a slowdown.Wow! That is the case. :)You'd want to benchmark this to be certain it helps.I did: import std.range; import std.parallelism; class C {} void foo() { auto c = new C; } void main(string[] args) { enum totalElements = 10_000_000; if (args.length > 1) { foreach (i; iota(totalElements).parallel) { foo(); } } else { foreach (i; iota(totalElements)) { foo(); } } } Typical run on my system for "-O -noboundscheck -inline": $ time ./deneme parallel real 0m4.236s user 0m4.325s sys 0m9.795s $ time ./deneme real 0m0.753s user 0m0.748s sys 0m0.003s Alitest1 sec, 125 ms, and 847 ╬╝stest1 sec, 125 ms, 741 ╬╝s, and 6 hnsecstest tasks556 ms, 613 ╬╝s, and 8 hnsecstest tasks552 ms and 287 ╬╝stest parallel554 ms, 542 ╬╝s, and 6 hnsecstest parallel551 ms, 514 ╬╝s, and 9 hnsecs Code: http://pastie.org/9146326 Unfortunately it doesn't compile with the ancient version of gdc available in Debian, so I couldn't test with that. The results should be quite a bit better since core.atomic would be faster. And frankly, I'm not sure if the allocator actually works properly, but it's just for testing purposes anyways.
May 06 2014