www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Rquest for timings

reply bearophile <bearophileHUGS lycos.com> writes:
This is the nbody benchmark of the Shootout site:
http://shootout.alioth.debian.org/u32/performance.php?test=nbody

The faster version is a Fortran one, probably thanks to vector operations that
allow a better SIMD vectorization.

This is the C++ version:
http://shootout.alioth.debian.org/u32/program.php?test=nbody&lang=gpp&id=1

C++ version compiled with:
g++ -Ofast -fomit-frame-pointer -march=native -mfpmath=sse -msse3 --std=c++0x
An input parameter is n= 3_000_000 (but use a larger number if therun time is
too much small, like 10 millions or more).

First D2 version (serial):
http://codepad.org/AdRSm2wP

Second D2 version, three times slower thanks to vector ops, more similar to the
Fortran version:
http://codepad.org/7O3mz9en

Is someone willing to take two (or more) timings using LDC 2 compiler (I have
LDC1 only, beside DMD)? I'd like to know how much time it takes to run the
first D version *compared* to the C++ version :-) If you time the second D2
version too, then it's better.

Bye and thank you,
bearophile
Nov 24 2011
next sibling parent reply Andrea Fontana <advmail katamail.com> writes:
Hmm reading code i verified that i can gain a 10% writing:

                    imd dSquared =3D sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2);
                    imd mag =3D dt / (dSquared * dSquared^^2);

Around line 115. Give it a try...=20

Il giorno gio, 24/11/2011 alle 22.37 -0500, bearophile ha scritto:

 This is the nbody benchmark of the Shootout site:
 http://shootout.alioth.debian.org/u32/performance.php?test=3Dnbody
=20
 The faster version is a Fortran one, probably thanks to vector operations=
that allow a better SIMD vectorization.
=20
 This is the C++ version:
 http://shootout.alioth.debian.org/u32/program.php?test=3Dnbody&lang=3Dgpp=
&id=3D1
=20
 C++ version compiled with:
 g++ -Ofast -fomit-frame-pointer -march=3Dnative -mfpmath=3Dsse -msse3 --s=
td=3Dc++0x
 An input parameter is n=3D 3_000_000 (but use a larger number if therun t=
ime is too much small, like 10 millions or more).
=20
 First D2 version (serial):
 http://codepad.org/AdRSm2wP
=20
 Second D2 version, three times slower thanks to vector ops, more similar =
to the Fortran version:
 http://codepad.org/7O3mz9en
=20
 Is someone willing to take two (or more) timings using LDC 2 compiler (I =
have LDC1 only, beside DMD)? I'd like to know how much time it takes to run= the first D version *compared* to the C++ version :-) If you time the seco= nd D2 version too, then it's better.
=20
 Bye and thank you,
 bearophile
Nov 25 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrea Fontana:

 Hmm reading code i verified that i can gain a 10% writing:
 
                     imd dSquared = sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2);
                     imd mag = dt / (dSquared * dSquared^^2);
 
 Around line 115. Give it a try... 
My version performs a sqrt and one multiplication, while your version performs one sqrt and two multiplications. On my PC my version is faster. Bye, bearophile
Nov 25 2011
parent reply Andrea Fontana <advmail katamail.com> writes:
Really? Dmd version and system? (here: dmd from git, ubuntu 11.10 64bit)

./app 50000000

Your version:
real	0m14.674s

My version:
real	0m13.644s

Your version:
imd dSquared =3D dx ^^ 2 + dy ^^ 2 + dz ^^ 2;  =20
imd mag =3D dt / (dSquared * sqrt(dSquared));

My version:
imd dSquared =3D sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2);
imd mag =3D dt / (dSquared * dSquared^^2);  // That is: dt /
(dSquared^^3);
=20
Probably vars evaluation works better on my example?=20

Btw:

imd dSquared =3D sqrt(dx*dx + dy*dy + dz*dz);
imd mag =3D dt / (dSquared*dSquared*dSquared);

real	0m13.574s


Il giorno ven, 25/11/2011 alle 08.14 -0500, bearophile ha scritto:

 Andrea Fontana:
=20
 Hmm reading code i verified that i can gain a 10% writing:
=20
                     imd dSquared =3D sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2);
                     imd mag =3D dt / (dSquared * dSquared^^2);
=20
 Around line 115. Give it a try...=20
=20 My version performs a sqrt and one multiplication, while your version per=
forms one sqrt and two multiplications. On my PC my version is faster.
=20
 Bye,
 bearophile
Nov 25 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrea Fontana Wrote:

Yes, really :-) Timings taken with DMD 2.056+, 32 bit Vista OS.

Bye,
bearophile
Nov 25 2011
parent reply Andrea Fontana <advmail katamail.com> writes:
Maybe Jerry could test my edits and check for timing...

About "^^": have you tried to benchmark

1/(x ^^ n) vs x ^^ -n ?  (with n>0)

On my setup second version is very very slow.


== Quotato dall`articolo bearophile (bearophileHUGS lycos.com)
 Andrea Fontana Wrote:
 Yes, really :-) Timings taken with DMD 2.056+, 32 bit Vista OS.
 Bye,
 bearophile
Nov 26 2011
parent bearophile <bearophileHUGS lycos.com> writes:
Andrea Fontana:

 About "^^": have you tried to benchmark
 
 1/(x ^^ n) vs x ^^ -n ?  (with n>0)
 
 On my setup second version is very very slow.
Take a look at the produced assembly code :-) Maybe x ^^ n is rewritten with a simple code, while ^^-n calls the pow function. Bye, bearophile
Nov 26 2011
prev sibling parent reply Jerry <jlquinn optonline.net> writes:
bearophile <bearophileHUGS lycos.com> writes:

 This is the nbody benchmark of the Shootout site:
 http://shootout.alioth.debian.org/u32/performance.php?test=nbody

 The faster version is a Fortran one, probably thanks to vector operations that
allow a better SIMD vectorization.

 This is the C++ version:
 http://shootout.alioth.debian.org/u32/program.php?test=nbody&lang=gpp&id=1

 C++ version compiled with:
 g++ -Ofast -fomit-frame-pointer -march=native -mfpmath=sse -msse3 --std=c++0x
 An input parameter is n= 3_000_000 (but use a larger number if therun
 time is too much small, like 10 millions or more).
All timings done with gdc 0.30 using dmd 2.055 and gcc 4.6.2. I built with both D and C++ enabled so the back end would be the same. jlquinn wyvern:~/d/tests$ ~/gcc/gdc/bin/g++ -O3 -fomit-frame-pointer -march=native -lm -mfpmath=sse -msse3 --std=c++0x nbody.cc -o nbody_c++ jlquinn wyvern:~/d/tests$ time ./nbody_c++ 50000000 -0.169075164 -0.169059907 real 0m10.209s user 0m10.180s sys 0m0.010s
 First D2 version (serial):
 http://codepad.org/AdRSm2wP
~/gcc/gdc/bin/gdc -O3 -fomit-frame-pointer -march=native -mfpmath=sse -msse3 -frelease nbody.d -o nbody_d jlquinn wyvern:~/d/tests$ time ./nbody_d 50000000 -0.169075164 -0.169059907 real 0m9.830s user 0m9.820s sys 0m0.000s jlquinn wyvern:~/d/tests$ dmd -O -release nbody.d jlquinn wyvern:~/d/tests$ time ./nbody_d 50000000 -0.169075164 -0.169059907 real 0m9.828s user 0m9.830s sys 0m0.000s
 Second D2 version, three times slower thanks to vector ops, more similar to
the Fortran version:
 http://codepad.org/7O3mz9en
~/gcc/gdc/bin/gdc -O3 -fomit-frame-pointer -march=native -mfpmath=sse -msse3 -frelease nbody2.d -o nbody2_d jlquinn wyvern:~/d/tests$ time ./nbody2_d 50000000 -0.169075164 -0.169059907 real 0m26.805s user 0m26.760s sys 0m0.020s jlquinn wyvern:~/d/tests$ dmd -O -release nbody2.d jlquinn wyvern:~/d/tests$ time ./nbody2_d 50000000 -0.169075164 -0.169059907 real 0m26.777s user 0m26.760s sys 0m0.000s
 Is someone willing to take two (or more) timings using LDC 2 compiler (I have
LDC1 only, beside DMD)? I'd like to know how much time it takes to run the
first D version *compared* to the C++ version :-) If you time the second D2
version too, then it's better.

 Bye and thank you,
 bearophile
So, the upshot seems like DMD and GDC generate similar code for this test. And both D compilers generate slightly faster code than the C++ version, therefore the D front end is doing a slightly better optimization job, or your first version is slightly more efficient code. Jerry
Nov 25 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
Thank you for your suprising timings, Jerry.

 All timings done with gdc 0.30 using dmd 2.055 and gcc 4.6.2.  I built
 with both D and C++ enabled so the back end would be the same.
Is your system a 64 bit one?
 So, the upshot seems like DMD and GDC generate similar code for this test.
This is an uncommon thing, expecially on 32 bit systems.
 And both D compilers generate slightly faster code than the C++
 version, therefore the D front end is doing a slightly better
 optimization job, or your first version is slightly more efficient code.
D1 code is often a bit slower than similar C++ code, but in this case I think D2 has allowed to specify more semantics that has produced a faster program. The static foreach I have used that D2 code is not just looking more clean compared to the those C++0x template tricks, but also the assembly output is better. And the first D2 program is not even the fastest possible: that second D2 program today is slow, but it contains some more semantics that hopefuly someday will allow the second version of the program to be faster than the first one, and about as fast as that Fortran version. This code that currently doesn't compile (no vector ^^, no vector sqrt, no good sum function): immutable double[NPAIR] distance = sqrt(sum(r[] ^^ 2, dim=0)); Is currently implemented like this: double[NPAIR] distance = void; foreach (i; Iota!(0, NPAIR)) distance[i] = sqrt(r[i][0] ^^ 2 + r[i][1] ^^ 2 + r[i][2] ^^ 2); Here those square roots are parallelizable, the compiler is allowed to use a SSE2 sqrtpd instruction to performs those 10 sqrt(double) with 5 instructions. With the ymm register of AVX the instruction VSQRTPD (intrinsic _mm256_sqrt_pd in lesser languages) does 4 double squares at a time. But maybe its starting location needs to be aligned to 16 bytes (not currently supported syntax): align(16) immutable double[NPAIR] distance = sqrt(sum(r[] ^^ 2, dim=0)); Bye, bearophile
Nov 26 2011
parent bearophile <bearophileHUGS lycos.com> writes:
 Here those square roots are parallelizable, the compiler is allowed to use a
SSE2 sqrtpd instruction to performs those 10 sqrt(double) with 5 instructions.
With the ymm register of AVX the instruction VSQRTPD (intrinsic _mm256_sqrt_pd
in lesser languages) does 4 double squares at a time. But maybe its starting
location needs to be aligned to 16 bytes (not currently supported syntax):
The 32bit assembly produced by the Intel Fortran compiler on that code, it's heavily optimized and fully inlined: http://codepad.org/h1ilZWVu It uses only serial square roots (sqrtsd), so the performance improvement has other causes that I don't know. This also probably means the Fortran version is not the faster version possible. Bye, bearophile
Nov 27 2011