www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - I wonder how fast we'd do

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html
May 27
next sibling parent Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html
The DMD numbers will be crap because it doesn't autovectorise. LDC should be the same as clang and GDC should be the same as gcc.
May 27
prev sibling next sibling parent reply Mike Franklin <slavo5150 yahoo.com> writes:
On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html
Here's a comparison between LDC, GCC 9.1, and CLang 8.0 in Compiler Explorer. The GCC code looks awfully compact. The LDC and CLang code look quite similar. I tried to test at https://explore.dgnu.org/ also, but when I attempted, the service was down :( Mike
May 27
next sibling parent reply Mike Franklin <slavo5150 yahoo.com> writes:
On Tuesday, 28 May 2019 at 05:17:09 UTC, Mike Franklin wrote:
 On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
 wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html
Here's a comparison between LDC, GCC 9.1, and CLang 8.0 in Compiler Explorer. The GCC code looks awfully compact. The LDC and CLang code look quite similar. I tried to test at https://explore.dgnu.org/ also, but when I attempted, the service was down :(
And DMD with -O -release to get rid of bounds checking: https://run.dlang.io/is/4QmXkO Doesn't look as good as the others. Mike
May 27
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/27/2019 10:20 PM, Mike Franklin wrote:
 Doesn't look as good as the others.
It did unroll the loop, though!
May 28
prev sibling parent Mike Franklin <slavo5150 yahoo.com> writes:
On Tuesday, 28 May 2019 at 05:17:09 UTC, Mike Franklin wrote:
 On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
 wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html
Here's a comparison between LDC, GCC 9.1, and CLang 8.0 in Compiler Explorer. The GCC code looks awfully compact. The LDC and CLang code look quite similar.
Forgot the link. oops. https://godbolt.org/z/Unj9Kk Mike
May 27
prev sibling next sibling parent reply Uknown <sireeshkodali1 gmail.com> writes:
On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html
I tested 3 D variants : ---ver1.d double sum = 0.0; for (int i = 0; i < values.length; i++) { double v = values[i] * values[i]; sum += v; } ---ver2.d double sum = 0.0; foreach (v; values) sum += v * v; return sum; ---ver3.d import std.algorithm : sum; double[] squares; squares[] = values[] * values[]; return squares.sum; All 3 were the exact same with LDC. https://run.dlang.io/is/6pjEud
May 27
parent reply Marco de Wild <mdwild sogyo.nl> writes:
On Tuesday, 28 May 2019 at 05:20:14 UTC, Uknown wrote:
 On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
 wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html
I tested 3 D variants : ---ver1.d double sum = 0.0; for (int i = 0; i < values.length; i++) { double v = values[i] * values[i]; sum += v; } ---ver2.d double sum = 0.0; foreach (v; values) sum += v * v; return sum; ---ver3.d import std.algorithm : sum; double[] squares; squares[] = values[] * values[]; return squares.sum; All 3 were the exact same with LDC. https://run.dlang.io/is/6pjEud
When the blog post released I wrote a few benchmarks. Surprisingly, using values.map!(x => x*x).sum was the fastest (faster than v1). It got around to 20 us on my machine.
May 27
next sibling parent reply KnightMare <black80 bk.ru> writes:
 https://run.dlang.io/is/6pjEud
I doubgt about results on my machine: t1 : 42 ms, 858 ╬╝s, and 7 hnsecs 1.06672e+09 t2 : 42 ms, 647 ╬╝s, and 8 hnsecs 1.06672e+09 t3 : 0 hnsecs 0 // without printing results LDC drops calcs at all // ldc2 -release -O3 times2.d writeln("t1 : ", t1 / n, " \t", r1/n ); writeln("t2 : ", t2 / n, " \t", r2/n ); writeln("t3 : ", t3 / n, " \t", r3/n ); offtopic: 123╬╝s better change to 123us (for Windows sure)
May 28
parent KnightMare <black80 bk.ru> writes:
 I doubgt about results on my machine:
 t1 : 42 ms, 858 ╬╝s, and 7 hnsecs       1.06672e+09
 t2 : 42 ms, 647 ╬╝s, and 8 hnsecs       1.06672e+09
 t3 : 0 hnsecs   0
my CPU is i7-3615QM (Ivy Bridge, no AVX2)
May 28
prev sibling next sibling parent Marco de Wild <mdwild sogyo.nl> writes:
On Tuesday, 28 May 2019 at 05:54:07 UTC, Marco de Wild wrote:
 On Tuesday, 28 May 2019 at 05:20:14 UTC, Uknown wrote:
 On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
 wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html
I tested 3 D variants : ---ver1.d double sum = 0.0; for (int i = 0; i < values.length; i++) { double v = values[i] * values[i]; sum += v; } ---ver2.d double sum = 0.0; foreach (v; values) sum += v * v; return sum; ---ver3.d import std.algorithm : sum; double[] squares; squares[] = values[] * values[]; return squares.sum; All 3 were the exact same with LDC. https://run.dlang.io/is/6pjEud
When the blog post released I wrote a few benchmarks. Surprisingly, using values.map!(x => x*x).sum was the fastest (faster than v1). It got around to 20 us on my machine.
Should have been 20 ms of course. https://run.dlang.io/is/Fpg8Iw 21 ms, 387 μs, and 7 hnsecs (map) 32 ms, 191 μs, and 1 hnsec (foreach) 32 ms, 183 μs, and 8 hnsecs (for) However, recompiling it with LDC (to reproduce the exact compile flags) gives exactly the opposite result *facepalm*, bumping the map to 40 ms: 41 ms, 792 μs, and 7 hnsecs 30 ms and 893 μs 31 ms, 76 μs, and 6 hnsecs
May 28
prev sibling parent KnightMare <black80 bk.ru> writes:
 When the blog post released I wrote a few benchmarks. 
 Surprisingly, using

 values.map!(x => x*x).sum

 was the fastest (faster than v1). It got around to 20 us on my 
 machine.
all code was skipped coz results was unused. code with using results https://pastebin.com/j0T0MRmA and d_with_sum was skipped coz https://forum.dlang.org/thread/wxesgcjznvwpdwpnxnej forum.dlang.org after all results will be XXms
May 28
prev sibling next sibling parent KnightMare <black80 bk.ru> writes:
On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html
I'm confused by the accuracy of measurements in steps of 16/17ms: 17ms, 34ms, 75ms, 98ms, 128ms... sse(2) should get same numbers (most compiler base on llvm)
May 28
prev sibling next sibling parent reply Atila Neves <atila.neves gmail.com> writes:
On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html
By using a D driver to call extern C implementations, I get 27ms for everything here: https://github.com/atilaneves/blog-obvious Much to my surprise, C, C++, D and Rust all had the same performance as each other, independently of whether C++, D and Rust used ranges/algorithm/streams or plain loops. All done with -O2, all LLVM.
May 28
next sibling parent reply Guillaume Piolat <first.last gmail.com> writes:
On Tuesday, 28 May 2019 at 09:49:26 UTC, Atila Neves wrote:
 Much to my surprise, C, C++, D and Rust all had the same 
 performance as each other, independently of whether C++, D and 
 Rust used ranges/algorithm/streams or plain loops. All done 
 with -O2, all LLVM.
This really isn't _that_ surprising. Once properly optimized, native code is the same speed for every input language. C, C++, D and Rust all have a "no room below" ethic in most cases, so you end up with the very same performance. Barring anomalies like bounds check or integer overflow checks. Comparisons of backends would be much more interesting, but drive less interest on Internet forums.
May 28
next sibling parent Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Tuesday, 28 May 2019 at 14:20:30 UTC, Guillaume Piolat wrote:
 On Tuesday, 28 May 2019 at 09:49:26 UTC, Atila Neves wrote:
 Much to my surprise, C, C++, D and Rust all had the same 
 performance as each other, independently of whether C++, D and 
 Rust used ranges/algorithm/streams or plain loops. All done 
 with -O2, all LLVM.
This really isn't _that_ surprising. Once properly optimized, native code is the same speed for every input language. C, C++, D and Rust all have a "no room below" ethic in most cases, so you end up with the very same performance. Barring anomalies like bounds check or integer overflow checks. Comparisons of backends would be much more interesting, but drive less interest on Internet forums.
Indeed, the only thing that usually has any effect is aliasing rules, and the occasional convincing the code generator to do non-temporal ops. The REAL power of the frontend language is to make it aware to the optimiser that the code is redundant, 'cause theres no faster code than no code at all. Thats why I'm really excited to see what we could use MLIR[1] for in LDC. [1]: https://github.com/tensorflow/mlir/
May 28
prev sibling parent reply Atila Neves <atila.neves gmail.com> writes:
On Tuesday, 28 May 2019 at 14:20:30 UTC, Guillaume Piolat wrote:
 On Tuesday, 28 May 2019 at 09:49:26 UTC, Atila Neves wrote:
 Much to my surprise, C, C++, D and Rust all had the same 
 performance as each other, independently of whether C++, D and 
 Rust used ranges/algorithm/streams or plain loops. All done 
 with -O2, all LLVM.
This really isn't _that_ surprising. Once properly optimized, native code is the same speed for every input language. C, C++, D and Rust all have a "no room below" ethic in most cases, so you end up with the very same performance. Barring anomalies like bounds check or integer overflow checks. Comparisons of backends would be much more interesting, but drive less interest on Internet forums.
Sure. I wasn't surprised the loop versions were all the same, it'd be weird if they weren't. I was surprised that the algorithm/range/iterator versions didn't pay a performance penalty!
May 29
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, May 29, 2019 at 09:00:15AM +0000, Atila Neves via Digitalmars-d wrote:
 On Tuesday, 28 May 2019 at 14:20:30 UTC, Guillaume Piolat wrote:
[...]
 This really isn't _that_ surprising.
 
 Once properly optimized, native code is the same speed for every
 input language.
 C, C++, D and Rust all have a "no room below" ethic in most cases,
 so you end up with the very same performance. Barring anomalies like
 bounds check or integer overflow checks.
 
 Comparisons of backends would be much more interesting, but drive
 less interest on Internet forums.
Sure. I wasn't surprised the loop versions were all the same, it'd be weird if they weren't. I was surprised that the algorithm/range/iterator versions didn't pay a performance penalty!
They should *not* pay a performance penalty, otherwise I'd stop using them right away! They are supposed to be a nicer, higher-level way of writing loops (without most of the gotchas, boilerplate, and unreusability), but at the bottom they should translate to basically exactly the same thing as writing out the loops manually. I expect nothing less from a modern optimizing compiler. T -- Turning your clock 15 minutes ahead won't cure lateness---you're just making time go faster!
May 30
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 5/28/2019 2:49 AM, Atila Neves wrote:
 Much to my surprise, C, C++, D and Rust all had the same performance as each 
 other, independently of whether C++, D and Rust used ranges/algorithm/streams
or 
 plain loops. All done with -O2, all LLVM.
I'm not surprised. First off, because of inlining, etc., all are transformed into a simple loop. Then, the auto-vectorizer does the rest.
May 28
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, May 28, 2019 at 10:11:43AM -0700, Walter Bright via Digitalmars-d wrote:
 On 5/28/2019 2:49 AM, Atila Neves wrote:
 Much to my surprise, C, C++, D and Rust all had the same performance
 as each other, independently of whether C++, D and Rust used
 ranges/algorithm/streams or plain loops. All done with -O2, all
 LLVM.
I'm not surprised. First off, because of inlining, etc., all are transformed into a simple loop. Then, the auto-vectorizer does the rest.
Does dmd unroll loops yet? That appears to be a major cause of suboptimal codegen in dmd, last time I checked. Would be nice to improve this. T -- Some ideas are so stupid that only intellectuals could believe them. -- George Orwell
May 28
parent Walter Bright <newshound2 digitalmars.com> writes:
On 5/28/2019 5:22 PM, H. S. Teoh wrote:
 Does dmd unroll loops yet? That appears to be a major cause of
 suboptimal codegen in dmd, last time I checked. Would be nice to improve
 this.
https://digitalmars.com/d/archives/digitalmars/D/I_wonder_how_fast_we_d_do_327428.html#N327436
May 29
prev sibling parent reply KnightMare <black80 bk.ru> writes:
On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html
totally for: https://pastebin.com/j0T0MRmA small changes to Marco de Wild code Windows Server 2019, i7-3615QM, DMD 2.086.0, LDC 1.16.0-b1 C:\content\downloadz\dlang>ldc2 -release -O3 -mattr=avx times2.d C:\content\downloadz\dlang>times2.exe t1=42 ms, 714 ╬╝s, and 9 hnsecs r=10922666154674544967680 t2=42 ms and 614 ╬╝s r=10922666154674544967680 t3=0 hnsecs r=0 t4=42 ms, 474 ╬╝s, and 8 hnsecs r=10922666154674544967680 C:\content\downloadz\dlang>dmd -release -O -mcpu=avx times2.d C:\content\downloadz\dlang>times2.exe t1=141 ms, 263 ╬╝s, and 5 hnsecs r=10922666154673907433000 t2=143 ms, 128 ╬╝s, and 9 hnsecs r=10922666154673907433000 t3=1 hnsec r=0 t4=491 ms, 829 ╬╝s, and 9 hnsecs r=10922666154673907433000 1) different sums DMD and LDC (probably fast-math, dont know) 2) t3=0 for d_with_sum. lets see assembler for LDC (-output-s): .def _D6times210d_with_sumFNaNbNfAdZd; .scl 2; .type 32; .endef .section .text,"xr",discard,_D6times210d_with_sumFNaNbNfAdZd .globl _D6times210d_with_sumFNaNbNfAdZd .p2align 4, 0x90 _D6times210d_with_sumFNaNbNfAdZd: vxorps %xmm0, %xmm0, %xmm0 retq // this means "return 0"? cool optimization 3) for Windows better change "╬╝s" to "us" (when /SUBSYSTEM:CONSOLE)
May 28
parent KnightMare <black80 bk.ru> writes:
 https://pastebin.com/j0T0MRmA
 Windows Server 2019, i7-3615QM, DMD 2.086.0, LDC 1.16.0-b1
 2) t3=0 for d_with_sum. lets see assembler for LDC (-output-s):
people explained why code is being thrown away https://forum.dlang.org/thread/wxesgcjznvwpdwpnxnej forum.dlang.org I have next results now: C:\content\downloadz\dlang>ldc2 -release -O3 -mattr=avx times2.d C:\content\downloadz\dlang>times2.exe t1=42 ms, 929 ╬╝s, and 1 hnsec r=109226661546_74544967680 t2=42 ms and 578 ╬╝s r=109226661546_74544967680 t3=333 ms, 539 ╬╝s, and 3 hnsecs r=109226661546_66672259072 t4=42 ms, 631 ╬╝s, and 9 hnsecs r=109226661546_74544967680 C:\content\downloadz\dlang>dmd -release -O -mcpu=avx times2.d C:\content\downloadz\dlang>times2.exe core.exception.OutOfMemoryError src\core\exception.d(702): Memory allocation failed ---------------- I have 16GB RAM, 8GB free (by Task Manager) double[32M].sizeof=256MB remove attrs from d_with_sum & main - nothing changed strange a little bit.
May 28