digitalmars.D.learn - N-body bench
- bearophile (8/8) Jan 24 2014 If someone if willing to test LDC2 with a known benchmark,
- Jerry (5/9) Jan 28 2014 Just playing with the C++ version in gcc 4.7.3, I see a significant
- Stanislav Blinov (4/12) Jan 29 2014 Hmm.. How would one use core.simd with LDC2? It doesn't seem to
- bearophile (6/9) Jan 29 2014 I don't know if this is useful for you, but here I wrote a basic
- Stanislav Blinov (8/17) Jan 29 2014 I meant how to make it compile with ldc2? I've translated the
- bearophile (9/16) Jan 29 2014 Usually for me ldc2 works with simd. Perhaps you have to show us
- Stanislav Blinov (10/27) Jan 29 2014 It's a direct translation of that C++ code:
- Stanislav Blinov (3/3) Jan 29 2014 Regarding dmd it looks awfully similar to this:
- Stanislav Blinov (18/18) Jan 30 2014 On Wednesday, 29 January 2014 at 18:05:41 UTC, Stanislav Blinov
- Stanislav Blinov (23/23) Jan 30 2014 Ok, didn't need to wait for the weekend :)
- Stanislav Blinov (17/17) Jan 30 2014 On Thursday, 30 January 2014 at 14:17:16 UTC, Stanislav Blinov
- bearophile (7/9) Jan 30 2014 Is the latest link shown the last version?
- Stanislav Blinov (5/12) Jan 30 2014 No. In toDouble2() on line 13:
- bearophile (17/18) Jan 30 2014 Yes. The older version of LDC2 doesn't even compile the code. I
- Stanislav Blinov (7/16) Jan 30 2014 That won't compile with dmd (at least, with 2.064.2): it expects
- bearophile (7/14) Jan 30 2014 First let me try to fiddle with the code some more :-)
- bearophile (4/8) Jan 30 2014 Few more changes, but this version still lacks the toDouble2:
- Stanislav Blinov (5/5) Jan 30 2014 On Thursday, 30 January 2014 at 18:29:42 UTC, bearophile wrote:
- bearophile (10/12) Jan 30 2014 All my versions of ldc2 don't even accept -vectorize :-)
- bearophile (14/16) Jan 30 2014 It's a very silly problem for a statically typed language. The D
- Stanislav Blinov (20/23) Jan 30 2014 I agree.
- bearophile (35/41) Jan 30 2014 That should be impossible, as I remember from my old profilings
- Stanislav Blinov (17/53) Jan 30 2014 :)
- bearophile (6/10) Jan 30 2014 If a function takes no time to run, and you tweak it, your
- Stanislav Blinov (11/17) Jan 30 2014 Thanks.
- bearophile (13/19) Jan 30 2014 You seem to have a quite recent CPU, as the G++ code contains
- Stanislav Blinov (18/21) Jan 30 2014 Hmm...
- bearophile (7/23) Jan 30 2014 Now the ldc2-compile runs in 4 seconds, this sounds correct. If
- bearophile (49/49) Jan 30 2014 Since my post someone has added a Fortran version based on the
- Stanislav Blinov (10/13) Jan 30 2014 Yup, I saw it. They're cheating, they almost don't have to
- Stanislav Blinov (2/2) Jan 30 2014 Gah! G'Kar moment...
If someone if willing to test LDC2 with a known benchmark, there's this one: http://benchmarksgame.alioth.debian.org/u32/performance.php?test=nbody A reformatted C++11 version good as start point for a D translation: http://codepad.org/4mOHW0fz Bye, bearophile
Jan 24 2014
"bearophile" <bearophileHUGS lycos.com> writes:If someone if willing to test LDC2 with a known benchmark, there's this one: http://benchmarksgame.alioth.debian.org/u32/performance.php?test=nbody A reformatted C++11 version good as start point for a D translation: http://codepad.org/4mOHW0fzJust playing with the C++ version in gcc 4.7.3, I see a significant speedup by using -funroll-loops. You might want to make sure that's enabled. Jerry
Jan 28 2014
On Friday, 24 January 2014 at 15:56:26 UTC, bearophile wrote:If someone if willing to test LDC2 with a known benchmark, there's this one: http://benchmarksgame.alioth.debian.org/u32/performance.php?test=nbody A reformatted C++11 version good as start point for a D translation: http://codepad.org/4mOHW0fz Bye, bearophileHmm.. How would one use core.simd with LDC2? It doesn't seem to define D_SIMD. Or should I go for builtins?
Jan 29 2014
Stanislav Blinov:Hmm.. How would one use core.simd with LDC2? It doesn't seem to define D_SIMD. Or should I go for builtins?I don't know if this is useful for you, but here I wrote a basic usage example of SIMD in ldc2 (second D entry): http://rosettacode.org/wiki/Four_bits_adder#D Bye, bearophile
Jan 29 2014
On Wednesday, 29 January 2014 at 16:43:35 UTC, bearophile wrote:Stanislav Blinov:I meant how to make it compile with ldc2? I've translated the code, it compiles and works with dmd (although segfaults in -release mode for some reason, probably a bug somewhere). But with ldc2: nbody.d(68): Error: undefined identifier __simd nbody.d(68): Error: undefined identifier XMM those are needed for that sqrt reciprocal call.Hmm.. How would one use core.simd with LDC2? It doesn't seem to define D_SIMD. Or should I go for builtins?I don't know if this is useful for you, but here I wrote a basic usage example of SIMD in ldc2 (second D entry): http://rosettacode.org/wiki/Four_bits_adder#D Bye, bearophile
Jan 29 2014
Stanislav Blinov:I meant how to make it compile with ldc2? I've translated the code, it compiles and works with dmd (although segfaults in -release mode for some reason, probably a bug somewhere). But with ldc2: nbody.d(68): Error: undefined identifier __simd nbody.d(68): Error: undefined identifier XMM those are needed for that sqrt reciprocal call.Usually for me ldc2 works with simd. Perhaps you have to show us the code, ask for help in the ldc newsgoup, or ask for help in the #ldc IRC channel. Regarding dmd with -release, I suggest you to minimize the code and put the problem in Bugzilla. Benchmarks are also useful to find and fix compiler bugs. Bye, bearophile
Jan 29 2014
On Wednesday, 29 January 2014 at 16:54:54 UTC, bearophile wrote:Stanislav Blinov:It's a direct translation of that C++ code: http://dpaste.dzfl.pl/89517fd0bf8fa This line: distance = __simd(XMM.CVTPS2PD, __simd(XMM.RSQRTPS, __simd(XMM.CVTPD2PS, dsquared))); The XMM enum and __simd functions are defined only when D_SIMD version is set. ldc2 doesn't seem to set this, unless I'm missing some sort of compiler switch.I meant how to make it compile with ldc2? I've translated the code, it compiles and works with dmd (although segfaults in -release mode for some reason, probably a bug somewhere). But with ldc2: nbody.d(68): Error: undefined identifier __simd nbody.d(68): Error: undefined identifier XMM those are needed for that sqrt reciprocal call.Usually for me ldc2 works with simd. Perhaps you have to show us the code, ask for help in the ldc newsgoup, or ask for help in the #ldc IRC channel.Regarding dmd with -release, I suggest you to minimize the code and put the problem in Bugzilla. Benchmarks are also useful to find and fix compiler bugs.I'm already onto it :)
Jan 29 2014
Regarding dmd it looks awfully similar to this: http://d.puremagic.com/issues/show_bug.cgi?id=9449 I'd need to do some more runs though.
Jan 29 2014
On Wednesday, 29 January 2014 at 18:05:41 UTC, Stanislav Blinov wrote: Yep, doesn't seem to be simd-related: struct S(T) { T v1, v2; } void main() { alias T = double; // integrals and float are ok :\ version (workaround) { S!T[1] p = void; } else { S!T[1] p; } } Anyway, here's the revised (and bugfixed :o)) code, if anyone's interested: http://dpaste.dzfl.pl/52d9e1fdc0fd On my machine, dmd -release -O -inline -noboundscheck is only 6 times slower than that C++ version :D I'll try to get around to making it work with ldc on the weekend.
Jan 30 2014
Ok, didn't need to wait for the weekend :) Looks like both dmd and ldc don't optimize slice operations yet, had to revert to loops (shaved off ~1.5 seconds for ldc, ~9 seconds for dmd). Also, my local pull of ldc had some issues with to!int(string), reverted that to atoi :) Here's the code: http://dpaste.dzfl.pl/4b6df0771696 C++ version compiled with the provided flags. dmd -release -O -inline -noboundscheck ldc2 -release -O3 -disable-boundscheck -vectorize -vectorize-loops Here are the results on my machine (i3 2100 3.1GHz): time ./nbody-cpp 50000000: -0.169075164 -0.169059907 0:05.20 real, 5.18 user, 0.00 sys, 532 kb, 99% cpu time ./nbody-ldc 50000000: -0.169075164 -0.169059907 0:07.84 real, 7.82 user, 0.00 sys, 1324 kb, 99% cpu time ./nbody-dmd 50000000: -0.169075164 -0.169059907 0:23.35 real, 23.29 user, 0.00 sys, 1184 kb, 99% cpu
Jan 30 2014
On Thursday, 30 January 2014 at 14:17:16 UTC, Stanislav Blinov wrote: Forgot one slice assignment in toDobule2(). Now the results are more interesting: time ./nbody-cpp 50000000: -0.169075164 -0.169059907 0:05.20 real, 5.18 user, 0.00 sys, 532 kb, 99% cpu time ./nbody-ldc 50000000: -0.169075164 -0.169059907 0:05.94 real, 5.92 user, 0.00 sys, 1320 kb, 99% cpu time ./nbody-dmd 50000000: -0.169075164 -0.169059907 0:19.62 real, 19.57 user, 0.00 sys, 1188 kb, 99% cpu :)
Jan 30 2014
Stanislav Blinov:Forgot one slice assignment in toDobule2(). Now the results are more interesting:Is the latest link shown the last version? I need the 0.13.0-alpha1 to compile the code. I am seeing a significant performance difference between C++ and D-ldc2. Bye, bearophile
Jan 30 2014
On Thursday, 30 January 2014 at 15:37:24 UTC, bearophile wrote:Stanislav Blinov:No. In toDouble2() on line 13: replace result.array = args[0] with result.array[0] = args[0]; result.array[1] = args[0];Forgot one slice assignment in toDobule2(). Now the results are more interesting:Is the latest link shown the last version?I need the 0.13.0-alpha1 to compile the code. I am seeing a significant performance difference between C++ and D-ldc2.You mean with your current version of ldc?
Jan 30 2014
Stanislav Blinov:You mean with your current version of ldc?Yes. The older version of LDC2 doesn't even compile the code. I need to use 0.13.0-alpha1. Your D code with small changes: http://codepad.org/xqqScd42 Asm generated by G++ for the advance function (that is the one that uses most of the run time): http://codepad.org/tApRNsVy Asm generated by ldc2: http://codepad.org/jKSJcOAZ With N = 5_000_000 my timings on an old CPU are 2.23 seconds for ldc2 and 1.83 seconds for g++. So there's some performance difference. I have tried to unroll manually the loop in the D code, but I see worse performance. I'll try some more later. Bye, bearophile
Jan 30 2014
On Thursday, 30 January 2014 at 16:53:22 UTC, bearophile wrote:Yes. The older version of LDC2 doesn't even compile the code. I need to use 0.13.0-alpha1.Hmm.Your D code with small changes: http://codepad.org/xqqScd42That won't compile with dmd (at least, with 2.064.2): it expects constants as initializers for vectors. :( That's why I rolled up that toDouble2() function.With N = 5_000_000 my timings on an old CPU are 2.23 seconds for ldc2 and 1.83 seconds for g++. So there's some performance difference.What about 50_000_000?I have tried to unroll manually the loop in the D code, but I see worse performance. I'll try some more later.I'm also fiddling :)
Jan 30 2014
Stanislav Blinov:That won't compile with dmd (at least, with 2.064.2): it expects constants as initializers for vectors. :( That's why I rolled up that toDouble2() function.I see. Then probably I will have to put it back...First let me try to fiddle with the code some more :-) Once done, this should go somewhere (like the wiki) as a simple example of SIMD usage in D. Bye, bearophileWith N = 5_000_000 my timings on an old CPU are 2.23 seconds for ldc2 and 1.83 seconds for g++. So there's some performance difference.What about 50_000_000?
Jan 30 2014
Stanislav Blinov:Few more changes, but this version still lacks the toDouble2: http://codepad.org/SpMprWym Bye, bearophileThat won't compile with dmd (at least, with 2.064.2): it expects constants as initializers for vectors. :( That's why I rolled up that toDouble2() function.
Jan 30 2014
On Thursday, 30 January 2014 at 18:29:42 UTC, bearophile wrote: I see you're compiling with ldmd2 -wi -O -release -inline -noboundscheck nbody.d Try ldc2 -release -O3 -disable-boundscheck -vectorize -vectorize-loops
Jan 30 2014
Stanislav Blinov:ldc2 -release -O3 -disable-boundscheck -vectorize -vectorize-loopsAll my versions of ldc2 don't even accept -vectorize :-) ldc2: Unknown command line argument '-vectorize'. Try: 'ldc2 -help' ldc2: Did you mean '-vectorize-slp'? And -vectorize-loops should be active on default on recent versions of ldc2 (including V.0.12.1), and indeed I see no performance difference in using it. Bye, bearophile
Jan 30 2014
Stanislav Blinov:Looks like both dmd and ldc don't optimize slice operations yet, had to revert to loopsIt's a very silly problem for a statically typed language. The D type system knows the static length of those arrays, but it doesn't use such information. (Similarly several algorithms in Phobos force to throw away this very precious compile-time information requiring dynamic arrays in input.) I have just suggested a fix for ldc2: http://forum.dlang.org/thread/qeytzeqnygxpocywyifp forum.dlang.org I have a similar enhancement request since some time in Bugzilla: https://d.puremagic.com/issues/show_bug.cgi?id=10523 https://d.puremagic.com/issues/show_bug.cgi?id=10305 Bye, bearophile
Jan 30 2014
On Thursday, 30 January 2014 at 18:43:02 UTC, bearophile wrote:It's a very silly problem for a statically typed language. The D type system knows the static length of those arrays, but it doesn't use such information.I agree. Unrolling everything except the loop in energy() seems to have squeezed the bits neede to outperform c++, at least on my machine :) http://dpaste.dzfl.pl/45e98e476daf (I'm sticking to atoi because my copy of ldc seems to have an issue in std.conv). time ./nbody-cpp 50000000: -0.169075164 -0.169059907 0:05.15 real, 5.14 user, 0.00 sys, 532 kb, 99% cpu time ./nbody-ldc 50000000: -0.169075164 -0.169059907 0:04.41 real, 4.40 user, 0.00 sys, 1308 kb, 99% cpu time ./nbody-dmd 50000000: -0.169075164 -0.169059907 0:15.39 real, 15.34 user, 0.00 sys, 1192 kb, 99% cpu
Jan 30 2014
Stanislav Blinov:Unrolling everything except the loop in energy() seems to have squeezed the bits neede to outperform c++, at least on my machine :)That should be impossible, as I remember from my old profilings that energy() should use only an irrelevant amount of run time.http://dpaste.dzfl.pl/45e98e476dafWhile I benchmark some variants of this program I am seeing a large variety of problems, limitations, bugs and regressions. You latest D code crashes my ldc2 V.0.12.1, while 0.13.0-alpha1 compiles it. My older version of your D code runs with both compiler versions, but V.0.12.1 generates faster code. Plus you can't make those double2 immutable, you can't use vector ops (because of performance, and also because they aren't nothrow in V.0.12.1). I was also experimenting with (note the align): align(16) struct Body { double[3] x, v; double mass; } struct NBodySystem { private: __gshared static Body[5] bodies = [ // Sun. Body([0., 0., 0.], [0., 0., 0.], solarMass), ... But this improves the code for V.0.12.1 and worsens it for 0.13.0-alpha1. Also I think the __gshared is ignored in V.0.12.1, but this bug could be fixed in more recent versions of ldc2.(I'm sticking to atoi because my copy of ldc seems to have an issue in std.conv).My version seems to use to!() correctly. If ldc2 developers are reading this thread there is enough strange stuff here to give one or two headaches :-) Now I don't know what "final" version should I keep of this program :-) Bye, bearophile
Jan 30 2014
On Thursday, 30 January 2014 at 21:04:06 UTC, bearophile wrote:Stanislav Blinov:I meant that if I unroll it, it's not irrelevant anymore :)Unrolling everything except the loop in energy() seems to have squeezed the bits neede to outperform c++, at least on my machine :)That should be impossible, as I remember from my old profilings that energy() should use only an irrelevant amount of run time.While I benchmark some variants of this program I am seeing a large variety of problems, limitations, bugs and regressions...:)You latest D code crashes my ldc2 V.0.12.1, while 0.13.0-alpha1 compiles it.:))My older version of your D code runs with both compiler versions, but V.0.12.1 generates faster code.:)))Plus you can't make those double2 immutable, you can't use vector ops (because of performance, and also because they aren't nothrow in V.0.12.1).Well, not being able to make them immutable is not *that* big of a problem now, is it? What would be actually cool to have are those slice operations.I was also experimenting with (note the align): align(16) struct Body { double[3] x, v; double mass; } struct NBodySystem { private: __gshared static Body[5] bodies = [ // Sun. Body([0., 0., 0.], [0., 0., 0.], solarMass),Yeah... I've even thrown away that filler in the latest version :o)But this improves the code for V.0.12.1 and worsens it for 0.13.0-alpha1.%|I'm using the git head (704ab3, last commit Sun Jan 26 00:00:21). I haven't tried the release yet.(I'm sticking to atoi because my copy of ldc seems to have an issue in std.conv).My version seems to use to!() correctly.If ldc2 developers are reading this thread there is enough strange stuff here to give one or two headaches :-)Indeed.Now I don't know what "final" version should I keep of this program :-)I was going to compare the asm listings, but C++ seems to have unrolled and inlined the outer loop right inside main(), and now I'm slightly lost in it :)
Jan 30 2014
Stanislav Blinov:I meant that if I unroll it, it's not irrelevant anymore :)If a function takes no time to run, and you tweak it, your program is not supposed to go faster.I was going to compare the asm listings, but C++ seems to have unrolled and inlined the outer loop right inside main(), and now I'm slightly lost in it :)Try using -fkeep-inline-functions. Bye, bearophile
Jan 30 2014
On Thursday, 30 January 2014 at 21:33:38 UTC, bearophile wrote:If a function takes no time to run, and you tweak it, your program is not supposed to go faster.Right.Thanks. G++: http://codepad.org/oOZQw1VQ LDC: http://codepad.org/5nHoZL1k LDC basically generated something that I can only call "one straight *whoooosh*"... This reminds me Andrei's talk on (last years?) GoingNative ("more instructions is not always slower code").I was going to compare the asm listings, but C++ seems to have unrolled and inlined the outer loop right inside main(), and now I'm slightly lost in it :)Try using -fkeep-inline-functions.
Jan 30 2014
Stanislav Blinov:G++: http://codepad.org/oOZQw1VQ LDC: http://codepad.org/5nHoZL1kYou seem to have a quite recent CPU, as the G++ code contains instructions like vmovsd. So you can try to do the same with ldc2, and use AVX or AVX2. There are the switches: -march=<string> - Architecture to generate code for: -mattr=<a1,+a2,-a3,...> - Target specific attributes (-mattr=help for details) -mcpu=<cpu-name> - Target a specific cpu type (-mcpu=help for details)LDC basically generated something that I can only call "one straight *whoooosh*"...:-) Bye, bearophile
Jan 30 2014
On Thursday, 30 January 2014 at 21:54:17 UTC, bearophile wrote:You seem to have a quite recent CPU,An aging i3?as the G++ code contains instructions like vmovsd. So you can try to do the same with ldc2, and use AVX or AVX2.Hmm... This is getting a bit silly now. I must have some compile switches for g++ wrong: g++ -Ofast -fkeep-inline-functions -fomit-frame-pointer -march=native -mfpmath=sse -mavx -mssse3 -flto --std=c++11 -fopenmp nbody.cpp -o nbody-cpp time ./nbody-cpp 50000000: -0.169075164 -0.169059907 0:05.09 real, 5.07 user, 0.00 sys, 1140 kb, 99% cpu ldc2 -release -O3 -disable-boundscheck -vectorize -vectorize-loops -ofnbody-ldc -mattr=+avx,+ssse3 nbody.d time ./nbody-ldc 50000000: -0.169075164 -0.169059907 0:04.02 real, 4.01 user, 0.00 sys, 1304 kb, 99% cpu
Jan 30 2014
Stanislav Blinov:An aging i3?My CPU is older, it doesn't support AVX2 and AVX.This is getting a bit silly now. I must have some compile switches for g++ wrong: g++ -Ofast -fkeep-inline-functions -fomit-frame-pointer -march=native -mfpmath=sse -mavx -mssse3 -flto --std=c++11 -fopenmp nbody.cpp -o nbody-cpp time ./nbody-cpp 50000000: -0.169075164 -0.169059907 0:05.09 real, 5.07 user, 0.00 sys, 1140 kb, 99% cpu ldc2 -release -O3 -disable-boundscheck -vectorize -vectorize-loops -ofnbody-ldc -mattr=+avx,+ssse3 nbody.d time ./nbody-ldc 50000000: -0.169075164 -0.169059907 0:04.02 real, 4.01 user, 0.00 sys, 1304 kb, 99% cpuNow the ldc2-compile runs in 4 seconds, this sounds correct. If you have paid for a CPU with AVX2 or AVX, it's right to use that :-) Bye, bearophile
Jan 30 2014
Since my post someone has added a Fortran version based on the algorithm used in the C++11 code. It's a little faster than the C++11 code and it's much nicer looking: http://benchmarksgame.alioth.debian.org/u32/program.php?test=nbody&lang=ifc&id=5 pure subroutine advance(tstep, x, v, mass) real*8, intent(in) :: tstep real*8, dimension(4,nb), intent(inout) :: x, v real*8, dimension(nb), intent(in) :: mass real*8 :: r(4,N),mag(N) real*8 :: distance, d2 integer :: i, j, m m = 1 do i = 1, nb do j = i + 1, nb r(1,m) = x(1,i) - x(1,j) r(2,m) = x(2,i) - x(2,j) r(3,m) = x(3,i) - x(3,j) m = m + 1 end do end do do m = 1, N d2 = r(1,m)**2 + r(2,m)**2 + r(3,m)**2 distance = 1/sqrt(real(d2)) distance = distance * (1.5d0 - 0.5d0 * d2 * distance * distance) !distance = distance * (1.5d0 - 0.5d0 * d2 * distance * distance) mag(m) = tstep * distance**3 end do m = 1 do i = 1, nb do j = i + 1, nb v(1,i) = v(1,i) - r(1,m) * mass(j) * mag(m) v(2,i) = v(2,i) - r(2,m) * mass(j) * mag(m) v(3,i) = v(3,i) - r(3,m) * mass(j) * mag(m) v(1,j) = v(1,j) + r(1,m) * mass(i) * mag(m) v(2,j) = v(2,j) + r(2,m) * mass(i) * mag(m) v(3,j) = v(3,j) + r(3,m) * mass(i) * mag(m) m = m + 1 end do end do do i = 1, nb x(1,i) = x(1,i) + tstep * v(1,i) x(2,i) = x(2,i) + tstep * v(2,i) x(3,i) = x(3,i) + tstep * v(3,i) end do end subroutine advance Bye, bearophile
Jan 30 2014
On Thursday, 30 January 2014 at 22:45:45 UTC, bearophile wrote:Since my post someone has added a Fortran version based on the algorithm used in the C++11 code. It's a little faster than the C++11 code and it's much nicer looking:Yup, I saw it. They're cheating, they almost don't have to explicitly handle any SSE business :o) I'm wondering how our little code could perform on that machine. It looks nice too, by the way: http://dpaste.dzfl.pl/a81a475bbcf6 I've rearranged some bits, brought back to!int (turned out there wasn't any issues, it's just that ldc generated errors regarding to! when there were other compiler errors %\), replaced TypeTuples with your Iota... the works :)
Jan 30 2014
Gah! G'Kar moment... http://dpaste.dzfl.pl/203d237d7413
Jan 30 2014