digitalmars.D - I wonder how fast we'd do
- Andrei Alexandrescu (1/1) May 27 2019 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.ht...
- Nicholas Wilson (4/5) May 27 2019 The DMD numbers will be crap because it doesn't autovectorise.
- Mike Franklin (8/9) May 27 2019 Here's a comparison between LDC, GCC 9.1, and CLang 8.0 in
- Mike Franklin (5/13) May 27 2019 And DMD with -O -release to get rid of bounds checking:
- Walter Bright (2/3) May 28 2019 It did unroll the loop, though!
- Mike Franklin (3/9) May 27 2019 Forgot the link. oops. https://godbolt.org/z/Unj9Kk
- Uknown (21/22) May 27 2019 I tested 3 D variants :
- Marco de Wild (6/29) May 27 2019 When the blog post released I wrote a few benchmarks.
- KnightMare (11/12) May 28 2019 I doubgt about results on my machine:
- KnightMare (1/5) May 28 2019 my CPU is i7-3615QM (Ivy Bridge, no AVX2)
- Marco de Wild (12/48) May 28 2019 Should have been 20 ms of course.
- KnightMare (5/10) May 28 2019 all code was skipped coz results was unused.
- KnightMare (5/6) May 28 2019 I'm confused by the accuracy of measurements in steps of 16/17ms:
- Atila Neves (9/10) May 28 2019 By using a D driver to call extern C implementations, I get 27ms
- Guillaume Piolat (9/13) May 28 2019 This really isn't _that_ surprising.
- Nicholas Wilson (8/22) May 28 2019 Indeed, the only thing that usually has any effect is aliasing
- Atila Neves (5/19) May 29 2019 Sure. I wasn't surprised the loop versions were all the same,
- H. S. Teoh (11/26) May 30 2019 They should *not* pay a performance penalty, otherwise I'd stop using
- Walter Bright (3/6) May 28 2019 I'm not surprised. First off, because of inlining, etc., all are transfo...
- H. S. Teoh (7/16) May 28 2019 Does dmd unroll loops yet? That appears to be a major cause of
- Walter Bright (2/5) May 29 2019 https://digitalmars.com/d/archives/digitalmars/D/I_wonder_how_fast_we_d_...
- KnightMare (31/32) May 28 2019 totally for:
- KnightMare (18/21) May 28 2019 people explained why code is being thrown away
On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu wrote:https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.htmlThe DMD numbers will be crap because it doesn't autovectorise. LDC should be the same as clang and GDC should be the same as gcc.
May 27 2019
On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu wrote:https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.htmlHere's a comparison between LDC, GCC 9.1, and CLang 8.0 in Compiler Explorer. The GCC code looks awfully compact. The LDC and CLang code look quite similar. I tried to test at https://explore.dgnu.org/ also, but when I attempted, the service was down :( Mike
May 27 2019
On Tuesday, 28 May 2019 at 05:17:09 UTC, Mike Franklin wrote:On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu wrote:And DMD with -O -release to get rid of bounds checking: https://run.dlang.io/is/4QmXkO Doesn't look as good as the others. Mikehttps://jackmott.github.io/programming/2016/07/22/making-obvious-fast.htmlHere's a comparison between LDC, GCC 9.1, and CLang 8.0 in Compiler Explorer. The GCC code looks awfully compact. The LDC and CLang code look quite similar. I tried to test at https://explore.dgnu.org/ also, but when I attempted, the service was down :(
May 27 2019
On 5/27/2019 10:20 PM, Mike Franklin wrote:Doesn't look as good as the others.It did unroll the loop, though!
May 28 2019
On Tuesday, 28 May 2019 at 05:17:09 UTC, Mike Franklin wrote:On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu wrote:Forgot the link. oops. https://godbolt.org/z/Unj9Kk Mikehttps://jackmott.github.io/programming/2016/07/22/making-obvious-fast.htmlHere's a comparison between LDC, GCC 9.1, and CLang 8.0 in Compiler Explorer. The GCC code looks awfully compact. The LDC and CLang code look quite similar.
May 27 2019
On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu wrote:https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.htmlI tested 3 D variants : ---ver1.d double sum = 0.0; for (int i = 0; i < values.length; i++) { double v = values[i] * values[i]; sum += v; } ---ver2.d double sum = 0.0; foreach (v; values) sum += v * v; return sum; ---ver3.d import std.algorithm : sum; double[] squares; squares[] = values[] * values[]; return squares.sum; All 3 were the exact same with LDC. https://run.dlang.io/is/6pjEud
May 27 2019
On Tuesday, 28 May 2019 at 05:20:14 UTC, Uknown wrote:On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu wrote:When the blog post released I wrote a few benchmarks. Surprisingly, using values.map!(x => x*x).sum was the fastest (faster than v1). It got around to 20 us on my machine.https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.htmlI tested 3 D variants : ---ver1.d double sum = 0.0; for (int i = 0; i < values.length; i++) { double v = values[i] * values[i]; sum += v; } ---ver2.d double sum = 0.0; foreach (v; values) sum += v * v; return sum; ---ver3.d import std.algorithm : sum; double[] squares; squares[] = values[] * values[]; return squares.sum; All 3 were the exact same with LDC. https://run.dlang.io/is/6pjEud
May 27 2019
I doubgt about results on my machine: t1 : 42 ms, 858 ╬╝s, and 7 hnsecs 1.06672e+09 t2 : 42 ms, 647 ╬╝s, and 8 hnsecs 1.06672e+09 t3 : 0 hnsecs 0 // without printing results LDC drops calcs at all // ldc2 -release -O3 times2.d writeln("t1 : ", t1 / n, " \t", r1/n ); writeln("t2 : ", t2 / n, " \t", r2/n ); writeln("t3 : ", t3 / n, " \t", r3/n ); offtopic: 123╬╝s better change to 123us (for Windows sure)https://run.dlang.io/is/6pjEud
May 28 2019
I doubgt about results on my machine: t1 : 42 ms, 858 ╬╝s, and 7 hnsecs 1.06672e+09 t2 : 42 ms, 647 ╬╝s, and 8 hnsecs 1.06672e+09 t3 : 0 hnsecs 0my CPU is i7-3615QM (Ivy Bridge, no AVX2)
May 28 2019
On Tuesday, 28 May 2019 at 05:54:07 UTC, Marco de Wild wrote:On Tuesday, 28 May 2019 at 05:20:14 UTC, Uknown wrote:Should have been 20 ms of course. https://run.dlang.io/is/Fpg8Iw 21 ms, 387 μs, and 7 hnsecs (map) 32 ms, 191 μs, and 1 hnsec (foreach) 32 ms, 183 μs, and 8 hnsecs (for) However, recompiling it with LDC (to reproduce the exact compile flags) gives exactly the opposite result *facepalm*, bumping the map to 40 ms: 41 ms, 792 μs, and 7 hnsecs 30 ms and 893 μs 31 ms, 76 μs, and 6 hnsecsOn Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu wrote:When the blog post released I wrote a few benchmarks. Surprisingly, using values.map!(x => x*x).sum was the fastest (faster than v1). It got around to 20 us on my machine.https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.htmlI tested 3 D variants : ---ver1.d double sum = 0.0; for (int i = 0; i < values.length; i++) { double v = values[i] * values[i]; sum += v; } ---ver2.d double sum = 0.0; foreach (v; values) sum += v * v; return sum; ---ver3.d import std.algorithm : sum; double[] squares; squares[] = values[] * values[]; return squares.sum; All 3 were the exact same with LDC. https://run.dlang.io/is/6pjEud
May 28 2019
When the blog post released I wrote a few benchmarks. Surprisingly, using values.map!(x => x*x).sum was the fastest (faster than v1). It got around to 20 us on my machine.all code was skipped coz results was unused. code with using results https://pastebin.com/j0T0MRmA and d_with_sum was skipped coz https://forum.dlang.org/thread/wxesgcjznvwpdwpnxnej forum.dlang.org after all results will be XXms
May 28 2019
On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu wrote:https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.htmlI'm confused by the accuracy of measurements in steps of 16/17ms: 17ms, 34ms, 75ms, 98ms, 128ms... sse(2) should get same numbers (most compiler base on llvm)
May 28 2019
On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu wrote:https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.htmlBy using a D driver to call extern C implementations, I get 27ms for everything here: https://github.com/atilaneves/blog-obvious Much to my surprise, C, C++, D and Rust all had the same performance as each other, independently of whether C++, D and Rust used ranges/algorithm/streams or plain loops. All done with -O2, all LLVM.
May 28 2019
On Tuesday, 28 May 2019 at 09:49:26 UTC, Atila Neves wrote:Much to my surprise, C, C++, D and Rust all had the same performance as each other, independently of whether C++, D and Rust used ranges/algorithm/streams or plain loops. All done with -O2, all LLVM.This really isn't _that_ surprising. Once properly optimized, native code is the same speed for every input language. C, C++, D and Rust all have a "no room below" ethic in most cases, so you end up with the very same performance. Barring anomalies like bounds check or integer overflow checks. Comparisons of backends would be much more interesting, but drive less interest on Internet forums.
May 28 2019
On Tuesday, 28 May 2019 at 14:20:30 UTC, Guillaume Piolat wrote:On Tuesday, 28 May 2019 at 09:49:26 UTC, Atila Neves wrote:Indeed, the only thing that usually has any effect is aliasing rules, and the occasional convincing the code generator to do non-temporal ops. The REAL power of the frontend language is to make it aware to the optimiser that the code is redundant, 'cause theres no faster code than no code at all. Thats why I'm really excited to see what we could use MLIR[1] for in LDC. [1]: https://github.com/tensorflow/mlir/Much to my surprise, C, C++, D and Rust all had the same performance as each other, independently of whether C++, D and Rust used ranges/algorithm/streams or plain loops. All done with -O2, all LLVM.This really isn't _that_ surprising. Once properly optimized, native code is the same speed for every input language. C, C++, D and Rust all have a "no room below" ethic in most cases, so you end up with the very same performance. Barring anomalies like bounds check or integer overflow checks. Comparisons of backends would be much more interesting, but drive less interest on Internet forums.
May 28 2019
On Tuesday, 28 May 2019 at 14:20:30 UTC, Guillaume Piolat wrote:On Tuesday, 28 May 2019 at 09:49:26 UTC, Atila Neves wrote:Sure. I wasn't surprised the loop versions were all the same, it'd be weird if they weren't. I was surprised that the algorithm/range/iterator versions didn't pay a performance penalty!Much to my surprise, C, C++, D and Rust all had the same performance as each other, independently of whether C++, D and Rust used ranges/algorithm/streams or plain loops. All done with -O2, all LLVM.This really isn't _that_ surprising. Once properly optimized, native code is the same speed for every input language. C, C++, D and Rust all have a "no room below" ethic in most cases, so you end up with the very same performance. Barring anomalies like bounds check or integer overflow checks. Comparisons of backends would be much more interesting, but drive less interest on Internet forums.
May 29 2019
On Wed, May 29, 2019 at 09:00:15AM +0000, Atila Neves via Digitalmars-d wrote:On Tuesday, 28 May 2019 at 14:20:30 UTC, Guillaume Piolat wrote:[...]They should *not* pay a performance penalty, otherwise I'd stop using them right away! They are supposed to be a nicer, higher-level way of writing loops (without most of the gotchas, boilerplate, and unreusability), but at the bottom they should translate to basically exactly the same thing as writing out the loops manually. I expect nothing less from a modern optimizing compiler. T -- Turning your clock 15 minutes ahead won't cure lateness---you're just making time go faster!This really isn't _that_ surprising. Once properly optimized, native code is the same speed for every input language. C, C++, D and Rust all have a "no room below" ethic in most cases, so you end up with the very same performance. Barring anomalies like bounds check or integer overflow checks. Comparisons of backends would be much more interesting, but drive less interest on Internet forums.Sure. I wasn't surprised the loop versions were all the same, it'd be weird if they weren't. I was surprised that the algorithm/range/iterator versions didn't pay a performance penalty!
May 30 2019
On 5/28/2019 2:49 AM, Atila Neves wrote:Much to my surprise, C, C++, D and Rust all had the same performance as each other, independently of whether C++, D and Rust used ranges/algorithm/streams or plain loops. All done with -O2, all LLVM.I'm not surprised. First off, because of inlining, etc., all are transformed into a simple loop. Then, the auto-vectorizer does the rest.
May 28 2019
On Tue, May 28, 2019 at 10:11:43AM -0700, Walter Bright via Digitalmars-d wrote:On 5/28/2019 2:49 AM, Atila Neves wrote:Does dmd unroll loops yet? That appears to be a major cause of suboptimal codegen in dmd, last time I checked. Would be nice to improve this. T -- Some ideas are so stupid that only intellectuals could believe them. -- George OrwellMuch to my surprise, C, C++, D and Rust all had the same performance as each other, independently of whether C++, D and Rust used ranges/algorithm/streams or plain loops. All done with -O2, all LLVM.I'm not surprised. First off, because of inlining, etc., all are transformed into a simple loop. Then, the auto-vectorizer does the rest.
May 28 2019
On 5/28/2019 5:22 PM, H. S. Teoh wrote:Does dmd unroll loops yet? That appears to be a major cause of suboptimal codegen in dmd, last time I checked. Would be nice to improve this.https://digitalmars.com/d/archives/digitalmars/D/I_wonder_how_fast_we_d_do_327428.html#N327436
May 29 2019
On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu wrote:https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.htmltotally for: https://pastebin.com/j0T0MRmA small changes to Marco de Wild code Windows Server 2019, i7-3615QM, DMD 2.086.0, LDC 1.16.0-b1 C:\content\downloadz\dlang>ldc2 -release -O3 -mattr=avx times2.d C:\content\downloadz\dlang>times2.exe t1=42 ms, 714 ╬╝s, and 9 hnsecs r=10922666154674544967680 t2=42 ms and 614 ╬╝s r=10922666154674544967680 t3=0 hnsecs r=0 t4=42 ms, 474 ╬╝s, and 8 hnsecs r=10922666154674544967680 C:\content\downloadz\dlang>dmd -release -O -mcpu=avx times2.d C:\content\downloadz\dlang>times2.exe t1=141 ms, 263 ╬╝s, and 5 hnsecs r=10922666154673907433000 t2=143 ms, 128 ╬╝s, and 9 hnsecs r=10922666154673907433000 t3=1 hnsec r=0 t4=491 ms, 829 ╬╝s, and 9 hnsecs r=10922666154673907433000 1) different sums DMD and LDC (probably fast-math, dont know) 2) t3=0 for d_with_sum. lets see assembler for LDC (-output-s): .def _D6times210d_with_sumFNaNbNfAdZd; .scl 2; .type 32; .endef .section .text,"xr",discard,_D6times210d_with_sumFNaNbNfAdZd .globl _D6times210d_with_sumFNaNbNfAdZd .p2align 4, 0x90 _D6times210d_with_sumFNaNbNfAdZd: vxorps %xmm0, %xmm0, %xmm0 retq // this means "return 0"? cool optimization 3) for Windows better change "╬╝s" to "us" (when /SUBSYSTEM:CONSOLE)
May 28 2019
https://pastebin.com/j0T0MRmA Windows Server 2019, i7-3615QM, DMD 2.086.0, LDC 1.16.0-b1 2) t3=0 for d_with_sum. lets see assembler for LDC (-output-s):people explained why code is being thrown away https://forum.dlang.org/thread/wxesgcjznvwpdwpnxnej forum.dlang.org I have next results now: C:\content\downloadz\dlang>ldc2 -release -O3 -mattr=avx times2.d C:\content\downloadz\dlang>times2.exe t1=42 ms, 929 ╬╝s, and 1 hnsec r=109226661546_74544967680 t2=42 ms and 578 ╬╝s r=109226661546_74544967680 t3=333 ms, 539 ╬╝s, and 3 hnsecs r=109226661546_66672259072 t4=42 ms, 631 ╬╝s, and 9 hnsecs r=109226661546_74544967680 C:\content\downloadz\dlang>dmd -release -O -mcpu=avx times2.d C:\content\downloadz\dlang>times2.exe core.exception.OutOfMemoryError src\core\exception.d(702): Memory allocation failed ---------------- I have 16GB RAM, 8GB free (by Task Manager) double[32M].sizeof=256MB remove attrs from d_with_sum & main - nothing changed strange a little bit.
May 28 2019