digitalmars.D - I wonder how fast we'd do

Andrei Alexandrescu (1/1) May 27 2019 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.ht...

Nicholas Wilson (4/5) May 27 2019 The DMD numbers will be crap because it doesn't autovectorise.
Mike Franklin (8/9) May 27 2019 Here's a comparison between LDC, GCC 9.1, and CLang 8.0 in

Mike Franklin (5/13) May 27 2019 And DMD with -O -release to get rid of bounds checking:

Walter Bright (2/3) May 28 2019 It did unroll the loop, though!

Mike Franklin (3/9) May 27 2019 Forgot the link. oops. https://godbolt.org/z/Unj9Kk

Uknown (21/22) May 27 2019 I tested 3 D variants :

Marco de Wild (6/29) May 27 2019 When the blog post released I wrote a few benchmarks.

KnightMare (11/12) May 28 2019 I doubgt about results on my machine:

KnightMare (1/5) May 28 2019 my CPU is i7-3615QM (Ivy Bridge, no AVX2)

Marco de Wild (12/48) May 28 2019 Should have been 20 ms of course.
KnightMare (5/10) May 28 2019 all code was skipped coz results was unused.

KnightMare (5/6) May 28 2019 I'm confused by the accuracy of measurements in steps of 16/17ms:
Atila Neves (9/10) May 28 2019 By using a D driver to call extern C implementations, I get 27ms

Guillaume Piolat (9/13) May 28 2019 This really isn't _that_ surprising.

Nicholas Wilson (8/22) May 28 2019 Indeed, the only thing that usually has any effect is aliasing
Atila Neves (5/19) May 29 2019 Sure. I wasn't surprised the loop versions were all the same,

H. S. Teoh (11/26) May 30 2019 They should *not* pay a performance penalty, otherwise I'd stop using

Walter Bright (3/6) May 28 2019 I'm not surprised. First off, because of inlining, etc., all are transfo...

H. S. Teoh (7/16) May 28 2019 Does dmd unroll loops yet? That appears to be a major cause of

Walter Bright (2/5) May 29 2019 https://digitalmars.com/d/archives/digitalmars/D/I_wonder_how_fast_we_d_...

KnightMare (31/32) May 28 2019 totally for:

KnightMare (18/21) May 28 2019 people explained why code is being thrown away

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html

May 27 2019

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html

The DMD numbers will be crap because it doesn't autovectorise.  
LDC should be the same as clang and GDC should be the same as gcc.

May 27 2019

Mike Franklin <slavo5150 yahoo.com> writes:

On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html

Here's a comparison between LDC, GCC 9.1, and CLang 8.0 in 
Compiler Explorer.  The GCC code looks awfully compact.  The LDC 
and CLang code look quite similar.

I tried to test at https://explore.dgnu.org/ also, but when I 
attempted, the service was down :(

Mike

May 27 2019

Mike Franklin <slavo5150 yahoo.com> writes:

On Tuesday, 28 May 2019 at 05:17:09 UTC, Mike Franklin wrote:
 On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
 wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html

 Here's a comparison between LDC, GCC 9.1, and CLang 8.0 in 
 Compiler Explorer.  The GCC code looks awfully compact.  The 
 LDC and CLang code look quite similar.

 I tried to test at https://explore.dgnu.org/ also, but when I 
 attempted, the service was down :(

And DMD with -O -release to get rid of bounds checking:

https://run.dlang.io/is/4QmXkO

Doesn't look as good as the others.

Mike

May 27 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 5/27/2019 10:20 PM, Mike Franklin wrote:
 Doesn't look as good as the others.

It did unroll the loop, though!

May 28 2019

Mike Franklin <slavo5150 yahoo.com> writes:

On Tuesday, 28 May 2019 at 05:17:09 UTC, Mike Franklin wrote:
 On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
 wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html

 Here's a comparison between LDC, GCC 9.1, and CLang 8.0 in 
 Compiler Explorer.  The GCC code looks awfully compact.  The 
 LDC and CLang code look quite similar.

Forgot the link. oops.  https://godbolt.org/z/Unj9Kk

Mike

May 27 2019

Uknown <sireeshkodali1 gmail.com> writes:

On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html

I tested 3 D variants :

---ver1.d
double sum = 0.0;
for (int i = 0; i < values.length; i++)
{
	double v = values[i] * values[i];
	sum += v;
}


---ver2.d
double sum = 0.0;
foreach (v; values)
         sum += v * v;
return sum;


---ver3.d
import std.algorithm : sum;
double[] squares;
squares[] = values[] * values[];
return squares.sum;

All 3 were the exact same with LDC. https://run.dlang.io/is/6pjEud

May 27 2019

Marco de Wild <mdwild sogyo.nl> writes:

On Tuesday, 28 May 2019 at 05:20:14 UTC, Uknown wrote:
 On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
 wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html

 I tested 3 D variants :

 ---ver1.d
 double sum = 0.0;
 for (int i = 0; i < values.length; i++)
 {
 	double v = values[i] * values[i];
 	sum += v;
 }


 ---ver2.d
 double sum = 0.0;
 foreach (v; values)
         sum += v * v;
 return sum;


 ---ver3.d
 import std.algorithm : sum;
 double[] squares;
 squares[] = values[] * values[];
 return squares.sum;

 All 3 were the exact same with LDC. 
 https://run.dlang.io/is/6pjEud

When the blog post released I wrote a few benchmarks. 
Surprisingly, using

values.map!(x => x*x).sum

was the fastest (faster than v1). It got around to 20 us on my 
machine.

May 27 2019

KnightMare <black80 bk.ru> writes:

 https://run.dlang.io/is/6pjEud


I doubgt about results on my machine:
t1 : 42 ms, 858 ╬╝s, and 7 hnsecs       1.06672e+09
t2 : 42 ms, 647 ╬╝s, and 8 hnsecs       1.06672e+09
t3 : 0 hnsecs   0

// without printing results LDC drops calcs at all
// ldc2 -release -O3 times2.d
writeln("t1 : ", t1 / n, " \t", r1/n );
writeln("t2 : ", t2 / n, " \t", r2/n );
writeln("t3 : ", t3 / n, " \t", r3/n );


offtopic:
123╬╝s better change to 123us (for Windows sure)

May 28 2019

KnightMare <black80 bk.ru> writes:

 I doubgt about results on my machine:
 t1 : 42 ms, 858 ╬╝s, and 7 hnsecs       1.06672e+09
 t2 : 42 ms, 647 ╬╝s, and 8 hnsecs       1.06672e+09
 t3 : 0 hnsecs   0

my CPU is i7-3615QM (Ivy Bridge, no AVX2)

May 28 2019

Marco de Wild <mdwild sogyo.nl> writes:

On Tuesday, 28 May 2019 at 05:54:07 UTC, Marco de Wild wrote:
 On Tuesday, 28 May 2019 at 05:20:14 UTC, Uknown wrote:
 On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
 wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html

 I tested 3 D variants :

 ---ver1.d
 double sum = 0.0;
 for (int i = 0; i < values.length; i++)
 {
 	double v = values[i] * values[i];
 	sum += v;
 }


 ---ver2.d
 double sum = 0.0;
 foreach (v; values)
         sum += v * v;
 return sum;


 ---ver3.d
 import std.algorithm : sum;
 double[] squares;
 squares[] = values[] * values[];
 return squares.sum;

 All 3 were the exact same with LDC. 
 https://run.dlang.io/is/6pjEud

 When the blog post released I wrote a few benchmarks. 
 Surprisingly, using

 values.map!(x => x*x).sum

 was the fastest (faster than v1). It got around to 20 us on my 
 machine.

Should have been 20 ms of course.

https://run.dlang.io/is/Fpg8Iw

21 ms, 387 μs, and 7 hnsecs (map)
32 ms, 191 μs, and 1 hnsec (foreach)
32 ms, 183 μs, and 8 hnsecs (for)

However, recompiling it with LDC (to reproduce the exact compile 
flags) gives exactly the opposite result *facepalm*, bumping the 
map to 40 ms:

41 ms, 792 μs, and 7 hnsecs
30 ms and 893 μs
31 ms, 76 μs, and 6 hnsecs

May 28 2019

KnightMare <black80 bk.ru> writes:

 When the blog post released I wrote a few benchmarks. 
 Surprisingly, using

 values.map!(x => x*x).sum

 was the fastest (faster than v1). It got around to 20 us on my 
 machine.

all code was skipped coz results was unused.
code with using results https://pastebin.com/j0T0MRmA

and d_with_sum was skipped coz
https://forum.dlang.org/thread/wxesgcjznvwpdwpnxnej forum.dlang.org

after all results will be XXms

May 28 2019

KnightMare <black80 bk.ru> writes:

On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html

I'm confused by the accuracy of measurements in steps of 16/17ms: 
17ms, 34ms, 75ms, 98ms, 128ms...
sse(2) should get same numbers (most compiler base on llvm)

May 28 2019

Atila Neves <atila.neves gmail.com> writes:

On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html

By using a D driver to call extern C implementations, I get 27ms 
for everything here:

https://github.com/atilaneves/blog-obvious

Much to my surprise, C, C++, D and Rust all had the same 
performance as each other, independently of whether C++, D and 
Rust used ranges/algorithm/streams or plain loops. All done with 
-O2, all LLVM.

May 28 2019

Guillaume Piolat <first.last gmail.com> writes:

On Tuesday, 28 May 2019 at 09:49:26 UTC, Atila Neves wrote:
 Much to my surprise, C, C++, D and Rust all had the same 
 performance as each other, independently of whether C++, D and 
 Rust used ranges/algorithm/streams or plain loops. All done 
 with -O2, all LLVM.

This really isn't _that_ surprising.

Once properly optimized, native code is the same speed for every 
input language.
C, C++, D and Rust all have a "no room below" ethic in most 
cases, so you end up with the very same performance. Barring 
anomalies like bounds check or integer overflow checks.

Comparisons of backends would be much more interesting, but drive 
less interest on Internet forums.

May 28 2019

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Tuesday, 28 May 2019 at 14:20:30 UTC, Guillaume Piolat wrote:
 On Tuesday, 28 May 2019 at 09:49:26 UTC, Atila Neves wrote:
 Much to my surprise, C, C++, D and Rust all had the same 
 performance as each other, independently of whether C++, D and 
 Rust used ranges/algorithm/streams or plain loops. All done 
 with -O2, all LLVM.

 This really isn't _that_ surprising.

 Once properly optimized, native code is the same speed for 
 every input language.
 C, C++, D and Rust all have a "no room below" ethic in most 
 cases, so you end up with the very same performance. Barring 
 anomalies like bounds check or integer overflow checks.

 Comparisons of backends would be much more interesting, but 
 drive less interest on Internet forums.

Indeed, the only thing that usually has any effect is aliasing 
rules, and the occasional convincing the code generator to do 
non-temporal ops. The REAL power of the frontend language is to 
make it aware to the optimiser that the code is redundant, 'cause 
theres no faster code than no code at all. Thats why I'm really 
excited to see what we could use MLIR[1] for in LDC.

[1]: https://github.com/tensorflow/mlir/

May 28 2019

Atila Neves <atila.neves gmail.com> writes:

On Tuesday, 28 May 2019 at 14:20:30 UTC, Guillaume Piolat wrote:
 On Tuesday, 28 May 2019 at 09:49:26 UTC, Atila Neves wrote:
 Much to my surprise, C, C++, D and Rust all had the same 
 performance as each other, independently of whether C++, D and 
 Rust used ranges/algorithm/streams or plain loops. All done 
 with -O2, all LLVM.

 This really isn't _that_ surprising.

 Once properly optimized, native code is the same speed for 
 every input language.
 C, C++, D and Rust all have a "no room below" ethic in most 
 cases, so you end up with the very same performance. Barring 
 anomalies like bounds check or integer overflow checks.

 Comparisons of backends would be much more interesting, but 
 drive less interest on Internet forums.

Sure. I wasn't surprised the loop versions were all the same, 
it'd be weird if they weren't. I was surprised that the 
algorithm/range/iterator versions didn't pay a performance 
penalty!

May 29 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, May 29, 2019 at 09:00:15AM +0000, Atila Neves via Digitalmars-d wrote:
 On Tuesday, 28 May 2019 at 14:20:30 UTC, Guillaume Piolat wrote:

[...]
 This really isn't _that_ surprising.
 
 Once properly optimized, native code is the same speed for every
 input language.
 C, C++, D and Rust all have a "no room below" ethic in most cases,
 so you end up with the very same performance. Barring anomalies like
 bounds check or integer overflow checks.
 
 Comparisons of backends would be much more interesting, but drive
 less interest on Internet forums.

 
 Sure. I wasn't surprised the loop versions were all the same, it'd be
 weird if they weren't. I was surprised that the
 algorithm/range/iterator versions didn't pay a performance penalty!

They should *not* pay a performance penalty, otherwise I'd stop using
them right away!  They are supposed to be a nicer, higher-level way of
writing loops (without most of the gotchas, boilerplate, and
unreusability), but at the bottom they should translate to basically
exactly the same thing as writing out the loops manually. I expect
nothing less from a modern optimizing compiler.


T

-- 
Turning your clock 15 minutes ahead won't cure lateness---you're just making
time go faster!

May 30 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 5/28/2019 2:49 AM, Atila Neves wrote:
 Much to my surprise, C, C++, D and Rust all had the same performance as each 
 other, independently of whether C++, D and Rust used ranges/algorithm/streams
or 
 plain loops. All done with -O2, all LLVM.

I'm not surprised. First off, because of inlining, etc., all are transformed 
into a simple loop. Then, the auto-vectorizer does the rest.

May 28 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, May 28, 2019 at 10:11:43AM -0700, Walter Bright via Digitalmars-d wrote:
 On 5/28/2019 2:49 AM, Atila Neves wrote:
 Much to my surprise, C, C++, D and Rust all had the same performance
 as each other, independently of whether C++, D and Rust used
 ranges/algorithm/streams or plain loops. All done with -O2, all
 LLVM.

 
 I'm not surprised. First off, because of inlining, etc., all are
 transformed into a simple loop. Then, the auto-vectorizer does the
 rest.

Does dmd unroll loops yet? That appears to be a major cause of
suboptimal codegen in dmd, last time I checked. Would be nice to improve
this.


T

-- 
Some ideas are so stupid that only intellectuals could believe them. -- George
Orwell

May 28 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 5/28/2019 5:22 PM, H. S. Teoh wrote:
 Does dmd unroll loops yet? That appears to be a major cause of
 suboptimal codegen in dmd, last time I checked. Would be nice to improve
 this.

https://digitalmars.com/d/archives/digitalmars/D/I_wonder_how_fast_we_d_do_327428.html#N327436

May 29 2019

KnightMare <black80 bk.ru> writes:

On Tuesday, 28 May 2019 at 04:38:32 UTC, Andrei Alexandrescu 
wrote:
 https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html

totally for:
https://pastebin.com/j0T0MRmA small changes to Marco de Wild code
Windows Server 2019, i7-3615QM, DMD 2.086.0, LDC 1.16.0-b1

C:\content\downloadz\dlang>ldc2 -release -O3 -mattr=avx times2.d
C:\content\downloadz\dlang>times2.exe
t1=42 ms, 714 ╬╝s, and 9 hnsecs         r=10922666154674544967680
t2=42 ms and 614 ╬╝s                    r=10922666154674544967680
t3=0 hnsecs     r=0
t4=42 ms, 474 ╬╝s, and 8 hnsecs         r=10922666154674544967680

C:\content\downloadz\dlang>dmd -release -O -mcpu=avx times2.d
C:\content\downloadz\dlang>times2.exe
t1=141 ms, 263 ╬╝s, and 5 hnsecs        r=10922666154673907433000
t2=143 ms, 128 ╬╝s, and 9 hnsecs        r=10922666154673907433000
t3=1 hnsec      r=0
t4=491 ms, 829 ╬╝s, and 9 hnsecs        r=10922666154673907433000

1) different sums DMD and LDC (probably fast-math, dont know)
2) t3=0 for d_with_sum. lets see assembler for LDC (-output-s):
	.def	 _D6times210d_with_sumFNaNbNfAdZd;
	.scl	2;
	.type	32;
	.endef
	.section	.text,"xr",discard,_D6times210d_with_sumFNaNbNfAdZd
	.globl	_D6times210d_with_sumFNaNbNfAdZd
	.p2align	4, 0x90
_D6times210d_with_sumFNaNbNfAdZd:
	vxorps	%xmm0, %xmm0, %xmm0
	retq // this means "return 0"? cool optimization
3) for Windows better change "╬╝s" to "us" (when 
/SUBSYSTEM:CONSOLE)

May 28 2019

KnightMare <black80 bk.ru> writes:

 https://pastebin.com/j0T0MRmA
 Windows Server 2019, i7-3615QM, DMD 2.086.0, LDC 1.16.0-b1
 2) t3=0 for d_with_sum. lets see assembler for LDC (-output-s):

people explained why code is being thrown away
https://forum.dlang.org/thread/wxesgcjznvwpdwpnxnej forum.dlang.org

I have next results now:

C:\content\downloadz\dlang>ldc2 -release -O3 -mattr=avx times2.d
C:\content\downloadz\dlang>times2.exe
t1=42 ms, 929 ╬╝s, and 1 hnsec          r=109226661546_74544967680
t2=42 ms and 578 ╬╝s                    r=109226661546_74544967680
t3=333 ms, 539 ╬╝s, and 3 hnsecs        r=109226661546_66672259072
t4=42 ms, 631 ╬╝s, and 9 hnsecs         r=109226661546_74544967680

C:\content\downloadz\dlang>dmd -release -O -mcpu=avx times2.d
C:\content\downloadz\dlang>times2.exe

core.exception.OutOfMemoryError src\core\exception.d(702): Memory 
allocation failed
----------------

I have 16GB RAM, 8GB free (by Task Manager)
double[32M].sizeof=256MB
remove attrs from d_with_sum & main - nothing changed
strange a little bit.

May 28 2019

D Programming

C/C++ Programming

Other

digitalmars.D - I wonder how fast we'd do