digitalmars.D.learn - benchmark on binary trees
- Alex (38/38) Dec 04 2015 Hi everybody,
- CraigDillabaugh (4/9) Dec 04 2015 If you have access to LDC or GDC they typically produce
- anonymous (18/33) Dec 04 2015 [...]
- Alex (19/42) Dec 04 2015 Yes, I missed this, sorry. The main part of the question was
- anonymous (6/9) Dec 04 2015 You forgot -inline.
- JR (5/7) Dec 05 2015 [...]
- Alex (32/32) Dec 06 2015 Ok, lets conclude this post for now. Did some comparison runs
- Rikki Cattermole (5/36) Dec 06 2015 Why is TreeNode not final?
- Alex (12/16) Dec 07 2015 This is an interesting hint! Just after adding final the program
- visitor (11/15) Dec 06 2015 Hello, interesting exercise for me to learn about allocators :-)
Hi everybody, this is going to be a learning by doing a benchmark test - post. Found this "game" on the web http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=binarytrees and wanted to experiment on my self, I tried to reimplement some code in D. This: http://dpaste.dzfl.pl/4d8e0ceb867c and this: http://dpaste.dzfl.pl/2fd3841d85ef by reimplementing the C++ version. Besides the questions of improving both of them, which are important but only because I would... maybe... at some point... upload my code. For some DLang advertising effect ;) The learning effect - questions are the following: it would be faster. This was not the case. Why? This is a big question, as there could be just some silly mistakes of kind "do not use this call, use this, and your program will be twice as fast." to some mistakes of general kind like "why were you so foolish, and didn't wrote the whole code in just one line." 2. I failed trying to implement some kind of parallelism in the second version. I tried something like auto depthind = iota(min_depth, max_depth+1, 2); foreach(dummy_i, d; taskPool.parallel(depthind)) for the for-loop in the main function, but then, the program never ends. Do you have a hint what was wrong? 3. The compilation was done by: dmd -O -release -boundscheck=off [filename.d] Is there anything else to improve performance significantly? corner, so I natively think in terms of the first approach. Can one general make the statement, that for D one of the approaches above will be always faster then the other?
Dec 04 2015
On Friday, 4 December 2015 at 14:06:26 UTC, Alex wrote:Hi everybody, this is going to be a learning by doing a benchmark test - post.clip3. The compilation was done by: dmd -O -release -boundscheck=off [filename.d] Is there anything else to improve performance significantly?If you have access to LDC or GDC they typically produce significantly faster code.
Dec 04 2015
On 04.12.2015 15:06, Alex wrote:would be faster. This was not the case. Why?[...] Why did you expect the C++ inspired version to be faster? Just because the original was written in C++? From a quick skim the two versions seem to be pretty much identical, aside from names and struct vs class. Names don't make a difference of course. It would be easier to compare the two versions if you used the same names, though. The differences between structs on the heap and classes are subtle. It's not surprising that they don't have an impact here. If there are substantial differences between the two versions, please point them out.2. I failed trying to implement some kind of parallelism in the second version. I tried something like auto depthind = iota(min_depth, max_depth+1, 2); foreach(dummy_i, d; taskPool.parallel(depthind)) for the for-loop in the main function, but then, the program never ends. Do you have a hint what was wrong?Works for me. Maybe show the exact full program you tried, and tell what compiler version you're using.3. The compilation was done by: dmd -O -release -boundscheck=off [filename.d] Is there anything else to improve performance significantly?The other compilers, ldc and gdc, usually produce faster code than dmd.corner, so I natively think in terms of the first approach. Can one general make the statement, that for D one of the approaches above will be always faster then the other?Just reiterating what I said re the first question: I don't really see a difference. If you think there is, please point it out. Or if you're not sure, feel free to ask about specific parts.
Dec 04 2015
On Friday, 4 December 2015 at 19:31:22 UTC, anonymous wrote:Why did you expect the C++ inspired version to be faster? Just because the original was written in C++? From a quick skim the two versions seem to be pretty much identical, aside from names and struct vs class. Names don't make a difference of course. It would be easier to compare the two versions if you used the same names, though. The differences between structs on the heap and classes are subtle. It's not surprising that they don't have an impact here. If there are substantial differences between the two versions, please point them out.Yes, I missed this, sorry. The main part of the question was probably about the class and struct difference. I thought handling with structs and pointers would be faster then with classes.Ok, this was strange, but I found the crux. The correct question is: Why the parallel version is slower then the sequential? If you set int n = 14 in the main function the parallel version is MUCH slower then the sequential. At my machine 7x slower. Shouldn't it be the other way round?2. auto depthind = iota(min_depth, max_depth+1, 2); foreach(dummy_i, d; taskPool.parallel(depthind))Works for me. Maybe show the exact full program you tried, and tell what compiler version you're using.Thanks for the hint! As ldc doesn't have the experimental part of the includes, compared on the first version. The result was: program compiled with ldc2, same params, was 5% slower... nothing crucial, I think...3. The compilation was done by: dmd -O -release -boundscheck=off [filename.d] Is there anything else to improve performance significantly?The other compilers, ldc and gdc, usually produce faster code than dmd.Just reiterating what I said re the first question: I don't really see a difference. If you think there is, please point it out. Or if you're not sure, feel free to ask about specific parts.Yeah... so the answer here for me, is that I can stay with my way
Dec 04 2015
On 04.12.2015 21:30, Alex wrote:Yes, I missed this, sorry. The main part of the question was probably about the class and struct difference. I thought handling with structs and pointers would be faster then with classes.When you use a struct directly, without going through a pointer, that can be significantly faster than using a class. But structs through pointers are pretty much the same as classes, performance-wise. [...]Why the parallel version is slower then the sequential? If you set int n = 14 in the main function the parallel version is MUCH slower then the sequential. At my machine 7x slower. Shouldn't it be the other way round?I don't know what's going on here. You're allocating a lot of `TreeNode`s, though. That's probably not very parallelizable. The GC could also play a role. [...]As ldc doesn't have the experimental part of the includes, compared on the first version. The result was: program compiled with ldc2, same params, was 5% slower... nothing crucial, I think..."The experimental part" is std.experimental.allocator, right? I may be wrong here, but I think the default allocator is essentially just `new`. So that wouldn't give you any performance boost. [...]Yeah... so the answer here for me, is that I can stay with my way ofIn this case, I think you're fine. Generally, be aware that D doesn't shine when you create lots of throw-away objects. The GC can't compete languages too closely, performance may be worse.
Dec 04 2015
On Friday, 4 December 2015 at 23:23:37 UTC, anonymous wrote:found and tried out the -vgc option... Is there a way to deactivate the GC, if it stands in way?Why the parallel version is slower then the sequential? If you set int n = 14 in the main function the parallel version is MUCH slower then the sequential. At my machine 7x slower. Shouldn't it be the other way round?I don't know what's going on here. You're allocating a lot of `TreeNode`s, though. That's probably not very parallelizable. The GC could also play a role.In this case, I think you're fine. Generally, be aware that D doesn't shine when you create lots of throw-away objects. The translate code from those languages too closely, performance may be worse.Yes, I thought in the same direction. That's why I tried to reimplement the c++ version. The idea was: as I can't compete approach. I don't try to write something which compete with c++ either (I would have to take c++, then? ;) ), but something which clearly outperforms the languages with a virtual machine... tried the -inline option... no time win...
Dec 04 2015
On 05.12.2015 01:40, Alex wrote:found and tried out the -vgc option... Is there a way to deactivate the GC, if it stands in way?You can call core.memory.GC.disable to disable automatic collections. .enable to turn them on again.Yes, I thought in the same direction. That's why I tried to reimplement could try to compete by applying another approach. I don't try to write something which compete with c++ either (I would have to take c++, then? ;) ), but something which clearly outperforms the languages with a virtual machine...Your C++ inspired version still allocated via the GC, though. If that eats performance, then you'd have to mirror more closely what the C++ version actually does. It most probably doesn't use a GC. I presume this is the C++ version you took as inspiration: http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=gpp&id=6 That uses a boost::object_pool for the nodes. Assuming that that's being used for a reason, you could probably get a performance boost by doing something similar. I'm not really familiar with std.experimental.allocator, maybe there's something like object_pool in there. Otherwise, you'd have to implement it yourself. Generally, I think most of the time you can write a program in D that's as fast as one written in C++. But you may have to give up on some convenience, and the libraries may not be there to support you.
Dec 05 2015
On 04.12.2015 15:06, Alex wrote:3. The compilation was done by: dmd -O -release -boundscheck=off [filename.d] Is there anything else to improve performance significantly?You forgot -inline. By the way, I'm not a fan of using -boundscheck=off like a general optimization flag. It undermines safe, and as the documentation says, it should be used with caution. http://dlang.org/dmd-windows.html#switch-boundscheck
Dec 04 2015
On Friday, 4 December 2015 at 14:06:26 UTC, Alex wrote: [...]hoping it would be faster. This was not the case. Why?[...]Is there anything else to improve performance significantly?Profile. Profile profile profile. Callgrind. Find bottlenecks instead of guessing them.
Dec 05 2015
Ok, lets conclude this post for now. Did some comparison runs with the original C++ code. Missed this at the beginning. The results so far are as following: Here is the original code in C++. http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=gpp&id=6 With modifications to be able to run it on my mac os machine this results in the code available here: http://pastebin.com/G5cYESdX compilation done with g++ -c -pipe -O3 -fomit-frame-pointer -march=native -fopenmp main.cpp -o main.o && g++ main.o -o main.run -fopenmp -lboost_system Here you can find the last version of my D code: http://dpaste.dzfl.pl/8c8ab00699b5 compilation done with dmd -release -O -boundscheck=off -inline main.d time comparison, just with time ./main yields for C++ real 0m8.255s user 0m7.342s sys 0m0.905s for D real 0m35.356s user 0m35.149s sys 0m0.154s so the overall factor is round about 5. Thanks for commenting to everyone! If anybody has further ideas - all of them would be appreciated :) The original site is not interested in any further languages to be tested, so my experiment ends here for now...
Dec 06 2015
On 06/12/15 9:35 PM, Alex wrote:Ok, lets conclude this post for now. Did some comparison runs with the original C++ code. Missed this at the beginning. The results so far are as following: Here is the original code in C++. http://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=gpp&id=6 With modifications to be able to run it on my mac os machine this results in the code available here: http://pastebin.com/G5cYESdX compilation done with g++ -c -pipe -O3 -fomit-frame-pointer -march=native -fopenmp main.cpp -o main.o && g++ main.o -o main.run -fopenmp -lboost_system Here you can find the last version of my D code: http://dpaste.dzfl.pl/8c8ab00699b5 compilation done with dmd -release -O -boundscheck=off -inline main.d time comparison, just with time ./main yields for C++ real 0m8.255s user 0m7.342s sys 0m0.905s for D real 0m35.356s user 0m35.149s sys 0m0.154s so the overall factor is round about 5. Thanks for commenting to everyone! If anybody has further ideas - all of them would be appreciated :) The original site is not interested in any further languages to be tested, so my experiment ends here for now...Why is TreeNode not final? Also yours does not use threads in any way. If you cannot add multithreading on D, remove it from c++ before comparing. They are not equal.
Dec 06 2015
On Sunday, 6 December 2015 at 08:45:10 UTC, Rikki Cattermole wrote:Why is TreeNode not final?This is an interesting hint! Just after adding final the program takes two seconds less... This is roughly 5%. Do you have another hints of this kind? ;)Also yours does not use threads in any way. If you cannot add multithreading on D, remove it from c++ before comparing. They are not equal.Yes... I'm aware of this discrepancy, and I tried how the time statistics differ with and without the #pragma statement. With the statement it is half a second slower then without. This seems strange to me, and I don't have a clue why it is so. But as this doesn't change very much for the D/C++ comparison and I think the correct goal is to have a parallel version in D I let the #pragma statement inside.
Dec 07 2015
On Sunday, 6 December 2015 at 08:35:33 UTC, Alex wrote:Thanks for commenting to everyone! If anybody has further ideas - all of them would be appreciated :) The original site is not interested in any further languages to be tested, so my experiment ends here for now...Hello, interesting exercise for me to learn about allocators :-) i managed to parallelize the code reaching similar performance, in terms of speed, as the non parallel version : http://dpaste.dzfl.pl/6c3e6edcff59 BUT it consumes insane memory, don't try with argument more than 17 !!! so either i'm doing something stupid (in parallel code, i guess) or the FreeList allocator is totally not the right tool ... (both ?) some light-shedding would be really appreciated, Thanks
Dec 06 2015
On Sunday, 6 December 2015 at 12:23:52 UTC, visitor wrote:Hello, interesting exercise for me to learn about allocators :-)Nice to know, a novice can inspire someone :)i managed to parallelize the code reaching similar performance, in terms of speed, as the non parallel version : http://dpaste.dzfl.pl/6c3e6edcff59Cool! This is what I looked for!BUT it consumes insane memory, don't try with argument more than 17 !!! so either i'm doing something stupid (in parallel code, i guess) or the FreeList allocator is totally not the right tool ... (both ?)I assume, the allocator itself is something, that is not really needed in this case. Maybe, there is a more straight forward access to the problem. Even a simpler then in all the versions on the benchgame site, but I don't see it right now. And with the allocator attempt I had a chance to experiment with the experimental module and to write a very quick copy of a program, which I want to have... But your solution is really impressive, it reduces the factor from 5 to 3 immediately :) And I'm going to read your code carefully...some light-shedding would be really appreciated, Thanks
Dec 07 2015
On Monday, 7 December 2015 at 10:55:25 UTC, Alex wrote:On Sunday, 6 December 2015 at 12:23:52 UTC, visitor wrote:i've got more speed improvement with "taskPool.parallel(depthind, 2)" in the foreach parallel loop : second argument are workUnits (2 for me, on a quad core gave best results) Also using directly "FreeList!(Mallocator, Tree_node.sizeof)" without wrapping it in an allocatorObject gives speed improvement (with changes to makeTree method) i took inspiration from the C code, they use a memory pool management, like anonymous already pointed in c++ version, which i think could (must?) be achieved with allocators, to gain speed i think it's a key point, no GC !! FreeList allocator appears (to me) as a good candidate for this. but as i'm new to this, i'm sure to not doing it the right way ! i tried the exact same FreeList allocator but backed with the GCAllocator (not the Mallocator used in my code), then memory consumption is very good but of course it"s slow ! i tried a lot of other allocators, variations on the presented code, but memory management is awful :(Hello, interesting exercise for me to learn about allocators :-)Nice to know, a novice can inspire someone :)i managed to parallelize the code reaching similar performance, in terms of speed, as the non parallel version : http://dpaste.dzfl.pl/6c3e6edcff59Cool! This is what I looked for!BUT it consumes insane memory, don't try with argument more than 17 !!!I assume, the allocator itself is something, that is not really needed in this case. Maybe, there is a more straight forward access to the problem. Even a simpler then in all the versions on the benchgame site, but I don't see it right now. And with the allocator attempt I had a chance to experiment with the experimental module and to write a very quick copy of a program, which I want to have...
Dec 07 2015
using Apache Portable Runtime(APR) like in the C version : http://dpaste.dzfl.pl/6ca8b5ffd6dc works like a charm, 2.061s on my machine ! if file name is binarytrees.d dmd -w -inline -O -release -I/usr/include/apr-1.0 -L/usr/lib/x86_64-linux-gnu/libapr-1.so -of"binarytrees" "binarytrees.d" measured with "time ./binarytrees 20" on command line: real 0m2.061s user 0m8.156s sys 0m0.181s C version gives : real 0m1.892s user 0m9.087s sys 0m0.188s Not bad !!! :-D Still i would like to know what i'm doing wrong with the allocator's version, besides the fact that apr tools are doing more sophisticated work than my naive approach :-D ?
Dec 08 2015
C++ version : real 0m3.587s user 0m9.211s sys 0m7.341s
Dec 08 2015
version with apr (like in c version) http://dpaste.dzfl.pl/68c0157225e7 compiled with ldc it's indeed a bit faster on average : real 0m1.999s user 0m9.810s sys 0m0.148 btw Rust version is even faster than my little bit outdated gcc (4.9) latest try with allocators : http://dpaste.dzfl.pl/86b9b3c4ad71 swallows huge memory, slow ! what am i doing wrong ? (concurrency problem ?)
Dec 09 2015
ok, i have a working version (memory is nice, twice the speed as non parallel) ; http://dpaste.dzfl.pl/504a652c6c47 real 0m14.427s user 1m19.347s sys 0m0.124s i've got similar performances, without Allocators, using directly malloc and free i had to recursively deallocate ... though, still far from competing with the fastest, any advice ? i guess FreeList allocator is not the better tool here ?
Dec 11 2015