digitalmars.D - Perlin noise benchmark speed
- Nick Treleaven (13/13) Jun 20 2014 Hi,
- Nick Treleaven (4/6) Jun 20 2014 Also, it does appear to be using the correct compiler flags (at least
- David Nadlinger (8/15) Jun 20 2014 -release is missing, although that probably isn't playing a big
- MrSmith (2/18) Jun 20 2014 struct can be used instead of class
- Robert Schadek via Digitalmars-d (2/17) Jun 20 2014 I converted Noise2DContext into a struct, I gone add some more to my pat...
- Robert Schadek via Digitalmars-d (2/8) Jun 20 2014 I added some final pure @safe stuff
- David Nadlinger (8/9) Jun 20 2014 Thanks. As a general comment, I'd be careful with suggesting the
- dennis luehring (3/16) Jun 20 2014 write, printf etc. performance is benchmarked also - so not clear
- dennis luehring (9/32) Jun 20 2014 using perf with 10 is maybe too small to give good avarge result infos
- Mattcoder (6/9) Jun 20 2014 Indeed and using Windows (At least 8), the size of command-window
- David Nadlinger (14/22) Jun 20 2014 Before I wrote the above, I briefly ran the benchmark on my local
- Ary Borenszweig (4/18) Jun 20 2014 I just tried it with ldc and it's faster (faster than Go, slower than
- bearophile (8/10) Jun 20 2014 This should be compiled with LDC2, it's more idiomatic and a
- bearophile (3/4) Jun 20 2014 Sorry for the awful tabs.
- bearophile (7/7) Jun 20 2014 So this is the best so far version:
- Mattcoder (10/12) Jun 20 2014 Just one note, with the last version of DMD:
- Mattcoder (8/21) Jun 20 2014 Sorry, I forgot this:
- bearophile (5/9) Jun 20 2014 If you remove the calls to floor, you are avoiding the main
- bearophile (5/6) Jun 20 2014 Yes, I know, at the top of the file I have specified it's for
- bearophile (6/6) Jun 20 2014 If I add this import in Noise2DContext.getGradients the run-time
- whassup (2/8) Jun 20 2014
- JR (6/12) Jun 20 2014 Was just about to post that if I cheat and replace usage of
- dennis luehring (4/15) Jun 20 2014 it does not makes sense to "optmized" this example more and more - it
- Mattcoder (4/7) Jun 20 2014 Oh please, let him continue, I'm really learning a lot with these
- bearophile (11/13) Jun 20 2014 But the original code is not fast. So someone has to find what's
- dennis luehring (5/18) Jun 20 2014 as long as you find out its a library thing
- David Nadlinger (9/37) Jun 21 2014 bearophile's work is very valuable regardless of what the cause
- bearophile (4/5) Jun 20 2014 And a simple benchmark for D ranges/parallelism:
- bearophile (5/6) Jun 20 2014 And a simple benchmark for D ranges/parallelism:
- weaselcat (20/35) Mar 23 2015 I saw this thread when searching for something on the site, been
- Iain Buclaw via Digitalmars-d (6/45) Mar 23 2015 I'd suspect stdc.math to be SSE3/SSE4 optimised assembly, where as std.m...
Hi, A Perlin noise benchmark was quoted in this reddit thread: http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6sr It apparently shows the 3 main D compilers producing slower code than Go, Rust, gcc, clang, Nimrod: https://github.com/nsf/pnoise#readme I initially wondered about std.random, but got this response: "Yeah, but std.random is not used in that benchmark, it just initializes 256 random vectors and permutates 256 sequential integers. What spins in a loop is just plain FP math and array read/writes. I'm sure it can be done faster, maybe D compilers are bad at automatic inlining or something. " Obviously this is only one person's benchmark, but I wondered if people would like to check their code and suggest reasons for the speed deficit.
Jun 20 2014
On 20/06/2014 13:32, Nick Treleaven wrote:It apparently shows the 3 main D compilers producing slower code than Go, Rust, gcc, clang, Nimrod:Also, it does appear to be using the correct compiler flags (at least for dmd): https://github.com/nsf/pnoise/blob/master/compile.bash
Jun 20 2014
On Friday, 20 June 2014 at 12:34:55 UTC, Nick Treleaven wrote:On 20/06/2014 13:32, Nick Treleaven wrote:-release is missing, although that probably isn't playing a big role here. Another minor issues is that Noise2DContext isn't final, making the calls to get virtual. This should cause such a big difference though. Hopefully somebody can investigate this more closely. DavidIt apparently shows the 3 main D compilers producing slower code than Go, Rust, gcc, clang, Nimrod:Also, it does appear to be using the correct compiler flags (at least for dmd): https://github.com/nsf/pnoise/blob/master/compile.bash
Jun 20 2014
On Friday, 20 June 2014 at 12:56:46 UTC, David Nadlinger wrote:On Friday, 20 June 2014 at 12:34:55 UTC, Nick Treleaven wrote:struct can be used instead of classOn 20/06/2014 13:32, Nick Treleaven wrote:-release is missing, although that probably isn't playing a big role here. Another minor issues is that Noise2DContext isn't final, making the calls to get virtual. This should cause such a big difference though. Hopefully somebody can investigate this more closely. DavidIt apparently shows the 3 main D compilers producing slower code than Go, Rust, gcc, clang, Nimrod:Also, it does appear to be using the correct compiler flags (at least for dmd): https://github.com/nsf/pnoise/blob/master/compile.bash
Jun 20 2014
On 06/20/2014 02:56 PM, David Nadlinger via Digitalmars-d wrote:On Friday, 20 June 2014 at 12:34:55 UTC, Nick Treleaven wrote:I converted Noise2DContext into a struct, I gone add some more to my patchOn 20/06/2014 13:32, Nick Treleaven wrote:-release is missing, although that probably isn't playing a big role here. Another minor issues is that Noise2DContext isn't final, making the calls to get virtual. This should cause such a big difference though. Hopefully somebody can investigate this more closely. DavidIt apparently shows the 3 main D compilers producing slower code than Go, Rust, gcc, clang, Nimrod:Also, it does appear to be using the correct compiler flags (at least for dmd): https://github.com/nsf/pnoise/blob/master/compile.bash
Jun 20 2014
On 06/20/2014 02:34 PM, Nick Treleaven via Digitalmars-d wrote:On 20/06/2014 13:32, Nick Treleaven wrote:I added some final pure safe stuffIt apparently shows the 3 main D compilers producing slower code than Go, Rust, gcc, clang, Nimrod:Also, it does appear to be using the correct compiler flags (at least for dmd): https://github.com/nsf/pnoise/blob/master/compile.bash
Jun 20 2014
On Friday, 20 June 2014 at 13:20:16 UTC, Robert Schadek via Digitalmars-d wrote:I added some final pure safe stuffThanks. As a general comment, I'd be careful with suggesting the use of pure/ safe/… for performance improvements in microbenchmarks. While it is certainly good D style to use them wherever possible, it might lead people less familiar with D to believe that fast D code needs a lot of annotations. David
Jun 20 2014
Am 20.06.2014 14:32, schrieb Nick Treleaven:Hi, A Perlin noise benchmark was quoted in this reddit thread: http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6sr It apparently shows the 3 main D compilers producing slower code than Go, Rust, gcc, clang, Nimrod: https://github.com/nsf/pnoise#readme I initially wondered about std.random, but got this response: "Yeah, but std.random is not used in that benchmark, it just initializes 256 random vectors and permutates 256 sequential integers. What spins in a loop is just plain FP math and array read/writes. I'm sure it can be done faster, maybe D compilers are bad at automatic inlining or something. " Obviously this is only one person's benchmark, but I wondered if people would like to check their code and suggest reasons for the speed deficit.write, printf etc. performance is benchmarked also - so not clear if pnoise is super-fast but write is super-slow etc...
Jun 20 2014
Am 20.06.2014 15:14, schrieb dennis luehring:Am 20.06.2014 14:32, schrieb Nick Treleaven:using perf with 10 is maybe too small to give good avarge result infos and also runtime startup etc. is measured - it not clear what is slower these benchmarks should be seperated into 3 parts runtime-startup pure pnoise result output - needed only once for verification, return dummy output will fit better to test the pnoise speed are array bounds checks active?Hi, A Perlin noise benchmark was quoted in this reddit thread: http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6sr It apparently shows the 3 main D compilers producing slower code than Go, Rust, gcc, clang, Nimrod: https://github.com/nsf/pnoise#readme I initially wondered about std.random, but got this response: "Yeah, but std.random is not used in that benchmark, it just initializes 256 random vectors and permutates 256 sequential integers. What spins in a loop is just plain FP math and array read/writes. I'm sure it can be done faster, maybe D compilers are bad at automatic inlining or something. " Obviously this is only one person's benchmark, but I wondered if people would like to check their code and suggest reasons for the speed deficit.write, printf etc. performance is benchmarked also - so not clear if pnoise is super-fast but write is super-slow etc...
Jun 20 2014
On Friday, 20 June 2014 at 13:14:04 UTC, dennis luehring wrote:write, printf etc. performance is benchmarked also - so not clear if pnoise is super-fast but write is super-slow etc...Indeed and using Windows (At least 8), the size of command-window (CMD) interferes in the result drastically... for example: running this test with console maximized will take: 2.58s while the same test but in small window: 2.11s! Matheus.
Jun 20 2014
On Friday, 20 June 2014 at 13:46:26 UTC, Mattcoder wrote:On Friday, 20 June 2014 at 13:14:04 UTC, dennis luehring wrote:Before I wrote the above, I briefly ran the benchmark on my local (OS X) machine, and verified that the bulk of the time is indeed spent in the noise calculation loop (with stdout piped into /dev/null). Still, the LDC-compiled code is only about half as fast as the Clang-compiled version, and there is no good reason why it should be. My new guess is a difference in inlining heuristics (note also that the Rust version uses inlining hints). The big difference between GCC and Clang might be a hint that the performance drop is caused by a rather minute difference in optimizer tuning. Thus, we really need somebody to sit down with a profiler/disassembler and figure out what is going on. Davidwrite, printf etc. performance is benchmarked also - so not clear if pnoise is super-fast but write is super-slow etc...Indeed and using Windows (At least 8), the size of command-window (CMD) interferes in the result drastically... for example: running this test with console maximized will take: 2.58s while the same test but in small window: 2.11s!
Jun 20 2014
On 6/20/14, 9:32 AM, Nick Treleaven wrote:Hi, A Perlin noise benchmark was quoted in this reddit thread: http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6sr It apparently shows the 3 main D compilers producing slower code than Go, Rust, gcc, clang, Nimrod: https://github.com/nsf/pnoise#readme I initially wondered about std.random, but got this response: "Yeah, but std.random is not used in that benchmark, it just initializes 256 random vectors and permutates 256 sequential integers. What spins in a loop is just plain FP math and array read/writes. I'm sure it can be done faster, maybe D compilers are bad at automatic inlining or something. " Obviously this is only one person's benchmark, but I wondered if people would like to check their code and suggest reasons for the speed deficit.I just tried it with ldc and it's faster (faster than Go, slower than Ni. But this is still slower than other languages. And other languages keep the array bounds check on...
Jun 20 2014
Nick Treleaven:A Perlin noise benchmark was quoted in this reddit thread: http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6srThis should be compiled with LDC2, it's more idiomatic and a little faster than the original D version: http://dpaste.dzfl.pl/8d2ff04b62d3 I have already seen that if I inline Noise2DContext.get in the main manually the program gets faster (but not yet fast enough). Bye, bearophile
Jun 20 2014
http://dpaste.dzfl.pl/8d2ff04b62d3Sorry for the awful tabs. Bye, bearophile
Jun 20 2014
So this is the best so far version: http://dpaste.dzfl.pl/8dae9b359f27 I don't show the version with manually inlined function. (I have also seen that GCC generates on my cpu a little faster code if I don't use sse registers.) Bye, bearophile
Jun 20 2014
On Friday, 20 June 2014 at 16:02:56 UTC, bearophile wrote:So this is the best so far version: http://dpaste.dzfl.pl/8dae9b359f27Just one note, with the last version of DMD: dmd -O -noboundscheck -inline -release pnoise.d pnoise.d(42): Error: pure function 'pnoise.Noise2DContext.getGradients' cannot c all impure function 'core.stdc.math.floor' pnoise.d(43): Error: pure function 'pnoise.Noise2DContext.getGradients' cannot c all impure function 'core.stdc.math.floor' Matheus.
Jun 20 2014
On Friday, 20 June 2014 at 18:29:35 UTC, Mattcoder wrote:On Friday, 20 June 2014 at 16:02:56 UTC, bearophile wrote:Sorry, I forgot this: Beside the error above, which for now I'm using: immutable float x0f = cast(int)x; //x.floor; immutable float y0f = cast(int)y; //y.floor; Just to compile, your version here is twice faster than the original one. Matheus.So this is the best so far version: http://dpaste.dzfl.pl/8dae9b359f27Just one note, with the last version of DMD: dmd -O -noboundscheck -inline -release pnoise.d pnoise.d(42): Error: pure function 'pnoise.Noise2DContext.getGradients' cannot c all impure function 'core.stdc.math.floor' pnoise.d(43): Error: pure function 'pnoise.Noise2DContext.getGradients' cannot c all impure function 'core.stdc.math.floor' Matheus.
Jun 20 2014
Mattcoder:Beside the error above, which for now I'm using: immutable float x0f = cast(int)x; //x.floor; immutable float y0f = cast(int)y; //y.floor; Just to compile,If you remove the calls to floor, you are avoiding the main problem to fix. Bye, bearohile
Jun 20 2014
Mattcoder:Just one note, with the last version of DMD:Yes, I know, at the top of the file I have specified it's for ldc2. Bye, bearophile
Jun 20 2014
If I add this import in Noise2DContext.getGradients the run-time decreases a lot (I am now just two times slower than gcc with -Ofast): import core.stdc.math: floor; Bye, bearophile
Jun 20 2014
GO BEAROPHILE YOU CAN DO IT On Friday, 20 June 2014 at 15:24:38 UTC, bearophile wrote:If I add this import in Noise2DContext.getGradients the run-time decreases a lot (I am now just two times slower than gcc with -Ofast): import core.stdc.math: floor; Bye, bearophile
Jun 20 2014
On Friday, 20 June 2014 at 15:24:38 UTC, bearophile wrote:If I add this import in Noise2DContext.getGradients the run-time decreases a lot (I am now just two times slower than gcc with -Ofast): import core.stdc.math: floor; Bye, bearophileWas just about to post that if I cheat and replace usage of floor(x) with cast(float)cast(int)x, ldc2 is almost down to gcc speeds (119.6ms average over 100 full executions vs gcc 102.7ms). It stood out in the callgraph. Because profiling before optimizing.
Jun 20 2014
Am 20.06.2014 17:09, schrieb bearophile:Nick Treleaven:it does not makes sense to "optmized" this example more and more - it should be fast with the original version (except the missing finals on the virtuals)A Perlin noise benchmark was quoted in this reddit thread: http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6srThis should be compiled with LDC2, it's more idiomatic and a little faster than the original D version: http://dpaste.dzfl.pl/8d2ff04b62d3 I have already seen that if I inline Noise2DContext.get in the main manually the program gets faster (but not yet fast enough). Bye, bearophile
Jun 20 2014
On Friday, 20 June 2014 at 18:32:22 UTC, dennis luehring wrote:it does not makes sense to "optmized" this example more and more - it should be fast with the original version (except the missing finals on the virtuals)Oh please, let him continue, I'm really learning a lot with these optimizations. Matheus.
Jun 20 2014
dennis luehring:it does not makes sense to "optmized" this example more and more - it should be fast with the original versionBut the original code is not fast. So someone has to find what's broken. I have shown part of the broken parts to fix (floor on ldc2). Also, the original code is not written in a fully idiomatic way, also because unfortunately today the "lazy" way to write D code is not always the best/right way (example: you have to add ton of immutable/const, and annotations, because immutability is not the default), so a code fix is good. Bye, bearophile
Jun 20 2014
Am 20.06.2014 22:44, schrieb bearophile:dennis luehring:as long as you find out its a library thing the c version is without any annotations and immutable/const the fastest - so whats the problem with D here, it can't(shouln't) be that one needs to work/change that much on such simple code to reach c speedit does not makes sense to "optmized" this example more and more - it should be fast with the original versionBut the original code is not fast. So someone has to find what's broken. I have shown part of the broken parts to fix (floor on ldc2). Also, the original code is not written in a fully idiomatic way, also because unfortunately today the "lazy" way to write D code is not always the best/right way (example: you have to add ton of immutable/const, and annotations, because immutability is not the default), so a code fix is good. Bye, bearophile
Jun 20 2014
On Saturday, 21 June 2014 at 05:00:25 UTC, dennis luehring wrote:Am 20.06.2014 22:44, schrieb bearophile:bearophile's work is very valuable regardless of what the cause is, as it provides a pretty decent hint of what could be improved for anybody investigating the issue. This is not to say that we wouldn't need to fix our compilers (in end user terms, i.e. compiler + standard library) to make those examples fast – zero-cost abstractions are one of the main strengths of D. Daviddennis luehring:as long as you find out its a library thing the c version is without any annotations and immutable/const the fastest - so whats the problem with D here, it can't(shouln't) be that one needs to work/change that much on such simple code to reach c speedit does not makes sense to "optmized" this example more and more - it should be fast with the original versionBut the original code is not fast. So someone has to find what's broken. I have shown part of the broken parts to fix (floor on ldc2). Also, the original code is not written in a fully idiomatic way, also because unfortunately today the "lazy" way to write D code is not always the best/right way (example: you have to add ton of immutable/const, and annotations, because immutability is not the default), so a code fix is good. Bye, bearophile
Jun 21 2014
Nick Treleaven:A Perlin noise benchmark was quoted in this reddit thread:And a simple benchmark for D ranges/parallelism: Bye, bearophile
Jun 20 2014
Nick Treleaven:A Perlin noise benchmark was quoted in this reddit thread:And a simple benchmark for D ranges/parallelism: http://www.reddit.com/r/programming/comments/28mub4/clash_of_the_lambdas_comparing_lambda_performance/ Bye, bearophile
Jun 20 2014
On Friday, 20 June 2014 at 12:32:39 UTC, Nick Treleaven wrote:Hi, A Perlin noise benchmark was quoted in this reddit thread: http://www.reddit.com/r/rust/comments/289enx/c0de517e_where_is_my_c_replacement/cibn6sr It apparently shows the 3 main D compilers producing slower code than Go, Rust, gcc, clang, Nimrod: https://github.com/nsf/pnoise#readme I initially wondered about std.random, but got this response: "Yeah, but std.random is not used in that benchmark, it just initializes 256 random vectors and permutates 256 sequential integers. What spins in a loop is just plain FP math and array read/writes. I'm sure it can be done faster, maybe D compilers are bad at automatic inlining or something. " Obviously this is only one person's benchmark, but I wondered if people would like to check their code and suggest reasons for the speed deficit.I saw this thread when searching for something on the site, been a few months since anyone posted- I fixed the D flags, gdc is now about 15% faster than the second fastest in the benchmark(C - gcc) which obviously puts D in first. some notes: LDC is missing _tons_ of inline opportunities, killing it in comparison to GDC. I think GDC inlined pretty much everything. LDC is about 50% slower. Also, AFAICT there's no fast-math switch for LDC(enabling this for GDC might actually be compromising it though : ) ) I think LDC turns the floor in std.math into the same as the stdc one, but GDC does not. std.math.floor is still abysmally slow, I thought it was because it was still using reals but that does not seem to be the case. GDC slows to a crawl(10-20x slower) if you replace the stdc floor with the one in std.math(just remove the alias) I thought this might be interesting to someone(i.e, LDC/GDC folks or phobos math folks) bye.
Mar 23 2015
I'd suspect stdc.math to be SSE3/SSE4 optimised assembly, where as std.math uses a very generic (works on almost every float format) implementation that is at least 'pure'. Iain. On 24 Mar 2015 00:30, "weaselcat via Digitalmars-d" < digitalmars-d puremagic.com> wrote:On Friday, 20 June 2014 at 12:32:39 UTC, Nick Treleaven wrote:Hi, A Perlin noise benchmark was quoted in this reddit thread: http://www.reddit.com/r/rust/comments/289enx/c0de517e_ where_is_my_c_replacement/cibn6sr It apparently shows the 3 main D compilers producing slower code than Go, Rust, gcc, clang, Nimrod: https://github.com/nsf/pnoise#readme I initially wondered about std.random, but got this response: "Yeah, but std.random is not used in that benchmark, it just initializes 256 random vectors and permutates 256 sequential integers. What spins in a loop is just plain FP math and array read/writes. I'm sure it can be done faster, maybe D compilers are bad at automatic inlining or something. " Obviously this is only one person's benchmark, but I wondered if people would like to check their code and suggest reasons for the speed deficit.I saw this thread when searching for something on the site, been a few months since anyone posted- I fixed the D flags, gdc is now about 15% faster than the second fastest in the benchmark(C - gcc) which obviously puts D in first. some notes: LDC is missing _tons_ of inline opportunities, killing it in comparison to GDC. I think GDC inlined pretty much everything. LDC is about 50% slower. Also, AFAICT there's no fast-math switch for LDC(enabling this for GDC might actually be compromising it though : ) ) I think LDC turns the floor in std.math into the same as the stdc one, but GDC does not. std.math.floor is still abysmally slow, I thought it was because it was still using reals but that does not seem to be the case. GDC slows to a crawl(10-20x slower) if you replace the stdc floor with the one in std.math(just remove the alias) I thought this might be interesting to someone(i.e, LDC/GDC folks or phobos math folks) bye.
Mar 23 2015