digitalmars.D.learn - Rquest for timings
- bearophile (15/15) Nov 24 2011 This is the nbody benchmark of the Shootout site:
- Andrea Fontana (13/35) Nov 25 2011 Hmm reading code i verified that i can gain a 10% writing:
- bearophile (4/10) Nov 25 2011 My version performs a sqrt and one multiplication, while your version pe...
- Andrea Fontana (21/34) Nov 25 2011 Really? Dmd version and system? (here: dmd from git, ubuntu 11.10 64bit)
- bearophile (4/4) Nov 25 2011 Andrea Fontana Wrote:
- Andrea Fontana (5/9) Nov 26 2011 Maybe Jerry could test my edits and check for timing...
- bearophile (5/10) Nov 26 2011 Take a look at the produced assembly code :-)
- Jerry (43/59) Nov 25 2011 All timings done with gdc 0.30 using dmd 2.055 and gcc 4.6.2. I built
- bearophile (15/21) Nov 26 2011 This is an uncommon thing, expecially on 32 bit systems.
- bearophile (5/6) Nov 27 2011 The 32bit assembly produced by the Intel Fortran compiler on that code, ...
This is the nbody benchmark of the Shootout site: http://shootout.alioth.debian.org/u32/performance.php?test=nbody The faster version is a Fortran one, probably thanks to vector operations that allow a better SIMD vectorization. This is the C++ version: http://shootout.alioth.debian.org/u32/program.php?test=nbody&lang=gpp&id=1 C++ version compiled with: g++ -Ofast -fomit-frame-pointer -march=native -mfpmath=sse -msse3 --std=c++0x An input parameter is n= 3_000_000 (but use a larger number if therun time is too much small, like 10 millions or more). First D2 version (serial): http://codepad.org/AdRSm2wP Second D2 version, three times slower thanks to vector ops, more similar to the Fortran version: http://codepad.org/7O3mz9en Is someone willing to take two (or more) timings using LDC 2 compiler (I have LDC1 only, beside DMD)? I'd like to know how much time it takes to run the first D version *compared* to the C++ version :-) If you time the second D2 version too, then it's better. Bye and thank you, bearophile
Nov 24 2011
Hmm reading code i verified that i can gain a 10% writing: imd dSquared =3D sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2); imd mag =3D dt / (dSquared * dSquared^^2); Around line 115. Give it a try...=20 Il giorno gio, 24/11/2011 alle 22.37 -0500, bearophile ha scritto:This is the nbody benchmark of the Shootout site: http://shootout.alioth.debian.org/u32/performance.php?test=3Dnbody =20 The faster version is a Fortran one, probably thanks to vector operations=that allow a better SIMD vectorization.=20 This is the C++ version: http://shootout.alioth.debian.org/u32/program.php?test=3Dnbody&lang=3Dgpp=&id=3D1=20 C++ version compiled with: g++ -Ofast -fomit-frame-pointer -march=3Dnative -mfpmath=3Dsse -msse3 --s=td=3Dc++0xAn input parameter is n=3D 3_000_000 (but use a larger number if therun t=ime is too much small, like 10 millions or more).=20 First D2 version (serial): http://codepad.org/AdRSm2wP =20 Second D2 version, three times slower thanks to vector ops, more similar =to the Fortran version:http://codepad.org/7O3mz9en =20 Is someone willing to take two (or more) timings using LDC 2 compiler (I =have LDC1 only, beside DMD)? I'd like to know how much time it takes to run= the first D version *compared* to the C++ version :-) If you time the seco= nd D2 version too, then it's better.=20 Bye and thank you, bearophile
Nov 25 2011
Andrea Fontana:Hmm reading code i verified that i can gain a 10% writing: imd dSquared = sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2); imd mag = dt / (dSquared * dSquared^^2); Around line 115. Give it a try...My version performs a sqrt and one multiplication, while your version performs one sqrt and two multiplications. On my PC my version is faster. Bye, bearophile
Nov 25 2011
Really? Dmd version and system? (here: dmd from git, ubuntu 11.10 64bit) ./app 50000000 Your version: real 0m14.674s My version: real 0m13.644s Your version: imd dSquared =3D dx ^^ 2 + dy ^^ 2 + dz ^^ 2; =20 imd mag =3D dt / (dSquared * sqrt(dSquared)); My version: imd dSquared =3D sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2); imd mag =3D dt / (dSquared * dSquared^^2); // That is: dt / (dSquared^^3); =20 Probably vars evaluation works better on my example?=20 Btw: imd dSquared =3D sqrt(dx*dx + dy*dy + dz*dz); imd mag =3D dt / (dSquared*dSquared*dSquared); real 0m13.574s Il giorno ven, 25/11/2011 alle 08.14 -0500, bearophile ha scritto:Andrea Fontana: =20forms one sqrt and two multiplications. On my PC my version is faster.Hmm reading code i verified that i can gain a 10% writing: =20 imd dSquared =3D sqrt(dx ^^ 2 + dy ^^ 2 + dz ^^ 2); imd mag =3D dt / (dSquared * dSquared^^2); =20 Around line 115. Give it a try...=20=20 My version performs a sqrt and one multiplication, while your version per==20 Bye, bearophile
Nov 25 2011
Andrea Fontana Wrote: Yes, really :-) Timings taken with DMD 2.056+, 32 bit Vista OS. Bye, bearophile
Nov 25 2011
Maybe Jerry could test my edits and check for timing... About "^^": have you tried to benchmark 1/(x ^^ n) vs x ^^ -n ? (with n>0) On my setup second version is very very slow. == Quotato dall`articolo bearophile (bearophileHUGS lycos.com)Andrea Fontana Wrote: Yes, really :-) Timings taken with DMD 2.056+, 32 bit Vista OS. Bye, bearophile
Nov 26 2011
Andrea Fontana:About "^^": have you tried to benchmark 1/(x ^^ n) vs x ^^ -n ? (with n>0) On my setup second version is very very slow.Take a look at the produced assembly code :-) Maybe x ^^ n is rewritten with a simple code, while ^^-n calls the pow function. Bye, bearophile
Nov 26 2011
bearophile <bearophileHUGS lycos.com> writes:This is the nbody benchmark of the Shootout site: http://shootout.alioth.debian.org/u32/performance.php?test=nbody The faster version is a Fortran one, probably thanks to vector operations that allow a better SIMD vectorization. This is the C++ version: http://shootout.alioth.debian.org/u32/program.php?test=nbody&lang=gpp&id=1 C++ version compiled with: g++ -Ofast -fomit-frame-pointer -march=native -mfpmath=sse -msse3 --std=c++0x An input parameter is n= 3_000_000 (but use a larger number if therun time is too much small, like 10 millions or more).All timings done with gdc 0.30 using dmd 2.055 and gcc 4.6.2. I built with both D and C++ enabled so the back end would be the same. jlquinn wyvern:~/d/tests$ ~/gcc/gdc/bin/g++ -O3 -fomit-frame-pointer -march=native -lm -mfpmath=sse -msse3 --std=c++0x nbody.cc -o nbody_c++ jlquinn wyvern:~/d/tests$ time ./nbody_c++ 50000000 -0.169075164 -0.169059907 real 0m10.209s user 0m10.180s sys 0m0.010sFirst D2 version (serial): http://codepad.org/AdRSm2wP~/gcc/gdc/bin/gdc -O3 -fomit-frame-pointer -march=native -mfpmath=sse -msse3 -frelease nbody.d -o nbody_d jlquinn wyvern:~/d/tests$ time ./nbody_d 50000000 -0.169075164 -0.169059907 real 0m9.830s user 0m9.820s sys 0m0.000s jlquinn wyvern:~/d/tests$ dmd -O -release nbody.d jlquinn wyvern:~/d/tests$ time ./nbody_d 50000000 -0.169075164 -0.169059907 real 0m9.828s user 0m9.830s sys 0m0.000sSecond D2 version, three times slower thanks to vector ops, more similar to the Fortran version: http://codepad.org/7O3mz9en~/gcc/gdc/bin/gdc -O3 -fomit-frame-pointer -march=native -mfpmath=sse -msse3 -frelease nbody2.d -o nbody2_d jlquinn wyvern:~/d/tests$ time ./nbody2_d 50000000 -0.169075164 -0.169059907 real 0m26.805s user 0m26.760s sys 0m0.020s jlquinn wyvern:~/d/tests$ dmd -O -release nbody2.d jlquinn wyvern:~/d/tests$ time ./nbody2_d 50000000 -0.169075164 -0.169059907 real 0m26.777s user 0m26.760s sys 0m0.000sIs someone willing to take two (or more) timings using LDC 2 compiler (I have LDC1 only, beside DMD)? I'd like to know how much time it takes to run the first D version *compared* to the C++ version :-) If you time the second D2 version too, then it's better. Bye and thank you, bearophileSo, the upshot seems like DMD and GDC generate similar code for this test. And both D compilers generate slightly faster code than the C++ version, therefore the D front end is doing a slightly better optimization job, or your first version is slightly more efficient code. Jerry
Nov 25 2011
Thank you for your suprising timings, Jerry.All timings done with gdc 0.30 using dmd 2.055 and gcc 4.6.2. I built with both D and C++ enabled so the back end would be the same.Is your system a 64 bit one?So, the upshot seems like DMD and GDC generate similar code for this test.This is an uncommon thing, expecially on 32 bit systems.And both D compilers generate slightly faster code than the C++ version, therefore the D front end is doing a slightly better optimization job, or your first version is slightly more efficient code.D1 code is often a bit slower than similar C++ code, but in this case I think D2 has allowed to specify more semantics that has produced a faster program. The static foreach I have used that D2 code is not just looking more clean compared to the those C++0x template tricks, but also the assembly output is better. And the first D2 program is not even the fastest possible: that second D2 program today is slow, but it contains some more semantics that hopefuly someday will allow the second version of the program to be faster than the first one, and about as fast as that Fortran version. This code that currently doesn't compile (no vector ^^, no vector sqrt, no good sum function): immutable double[NPAIR] distance = sqrt(sum(r[] ^^ 2, dim=0)); Is currently implemented like this: double[NPAIR] distance = void; foreach (i; Iota!(0, NPAIR)) distance[i] = sqrt(r[i][0] ^^ 2 + r[i][1] ^^ 2 + r[i][2] ^^ 2); Here those square roots are parallelizable, the compiler is allowed to use a SSE2 sqrtpd instruction to performs those 10 sqrt(double) with 5 instructions. With the ymm register of AVX the instruction VSQRTPD (intrinsic _mm256_sqrt_pd in lesser languages) does 4 double squares at a time. But maybe its starting location needs to be aligned to 16 bytes (not currently supported syntax): align(16) immutable double[NPAIR] distance = sqrt(sum(r[] ^^ 2, dim=0)); Bye, bearophile
Nov 26 2011
Here those square roots are parallelizable, the compiler is allowed to use a SSE2 sqrtpd instruction to performs those 10 sqrt(double) with 5 instructions. With the ymm register of AVX the instruction VSQRTPD (intrinsic _mm256_sqrt_pd in lesser languages) does 4 double squares at a time. But maybe its starting location needs to be aligned to 16 bytes (not currently supported syntax):The 32bit assembly produced by the Intel Fortran compiler on that code, it's heavily optimized and fully inlined: http://codepad.org/h1ilZWVu It uses only serial square roots (sqrtsd), so the performance improvement has other causes that I don't know. This also probably means the Fortran version is not the faster version possible. Bye, bearophile
Nov 27 2011