digitalmars.D.learn - N-body bench

bearophile (8/8) Jan 24 2014 If someone if willing to test LDC2 with a known benchmark,

Jerry (5/9) Jan 28 2014 Just playing with the C++ version in gcc 4.7.3, I see a significant
Stanislav Blinov (4/12) Jan 29 2014 Hmm.. How would one use core.simd with LDC2? It doesn't seem to

bearophile (6/9) Jan 29 2014 I don't know if this is useful for you, but here I wrote a basic

Stanislav Blinov (8/17) Jan 29 2014 I meant how to make it compile with ldc2? I've translated the

bearophile (9/16) Jan 29 2014 Usually for me ldc2 works with simd. Perhaps you have to show us

Stanislav Blinov (10/27) Jan 29 2014 It's a direct translation of that C++ code:

Stanislav Blinov (3/3) Jan 29 2014 Regarding dmd it looks awfully similar to this:

Stanislav Blinov (18/18) Jan 30 2014 On Wednesday, 29 January 2014 at 18:05:41 UTC, Stanislav Blinov

Stanislav Blinov (23/23) Jan 30 2014 Ok, didn't need to wait for the weekend :)

Stanislav Blinov (17/17) Jan 30 2014 On Thursday, 30 January 2014 at 14:17:16 UTC, Stanislav Blinov

bearophile (7/9) Jan 30 2014 Is the latest link shown the last version?

Stanislav Blinov (5/12) Jan 30 2014 No. In toDouble2() on line 13:

bearophile (17/18) Jan 30 2014 Yes. The older version of LDC2 doesn't even compile the code. I

Stanislav Blinov (7/16) Jan 30 2014 That won't compile with dmd (at least, with 2.064.2): it expects

bearophile (7/14) Jan 30 2014 First let me try to fiddle with the code some more :-)

bearophile (4/8) Jan 30 2014 Few more changes, but this version still lacks the toDouble2:

Stanislav Blinov (5/5) Jan 30 2014 On Thursday, 30 January 2014 at 18:29:42 UTC, bearophile wrote:

bearophile (10/12) Jan 30 2014 All my versions of ldc2 don't even accept -vectorize :-)

bearophile (14/16) Jan 30 2014 It's a very silly problem for a statically typed language. The D

Stanislav Blinov (20/23) Jan 30 2014 I agree.

bearophile (35/41) Jan 30 2014 That should be impossible, as I remember from my old profilings

Stanislav Blinov (17/53) Jan 30 2014 :)

bearophile (6/10) Jan 30 2014 If a function takes no time to run, and you tweak it, your

Stanislav Blinov (11/17) Jan 30 2014 Thanks.

bearophile (13/19) Jan 30 2014 You seem to have a quite recent CPU, as the G++ code contains

Stanislav Blinov (18/21) Jan 30 2014 Hmm...

bearophile (7/23) Jan 30 2014 Now the ldc2-compile runs in 4 seconds, this sounds correct. If

bearophile (49/49) Jan 30 2014 Since my post someone has added a Fortran version based on the

Stanislav Blinov (10/13) Jan 30 2014 Yup, I saw it. They're cheating, they almost don't have to

Stanislav Blinov (2/2) Jan 30 2014 Gah! G'Kar moment...

"bearophile" <bearophileHUGS lycos.com> writes:

If someone if willing to test LDC2 with a known benchmark, 
there's this one:

http://benchmarksgame.alioth.debian.org/u32/performance.php?test=nbody

A reformatted C++11 version good as start point for a D 
translation:
http://codepad.org/4mOHW0fz

Bye,
bearophile

Jan 24 2014

Jerry <jlquinn optonline.net> writes:

"bearophile" <bearophileHUGS lycos.com> writes:

 If someone if willing to test LDC2 with a known benchmark, there's this one:

 http://benchmarksgame.alioth.debian.org/u32/performance.php?test=nbody

 A reformatted C++11 version good as start point for a D translation:
 http://codepad.org/4mOHW0fz

Just playing with the C++ version in gcc 4.7.3, I see a significant
speedup by using -funroll-loops.  You might want to make sure that's
enabled.

Jerry

Jan 28 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Friday, 24 January 2014 at 15:56:26 UTC, bearophile wrote:
 If someone if willing to test LDC2 with a known benchmark, 
 there's this one:

 http://benchmarksgame.alioth.debian.org/u32/performance.php?test=nbody

 A reformatted C++11 version good as start point for a D 
 translation:
 http://codepad.org/4mOHW0fz

 Bye,
 bearophile

Hmm.. How would one use core.simd with LDC2? It doesn't seem to 
define D_SIMD.
Or should I go for builtins?

Jan 29 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Stanislav Blinov:

 Hmm.. How would one use core.simd with LDC2? It doesn't seem to 
 define D_SIMD.
 Or should I go for builtins?

I don't know if this is useful for you, but here I wrote a basic 
usage example of SIMD in ldc2 (second D entry):
http://rosettacode.org/wiki/Four_bits_adder#D

Bye,
bearophile

Jan 29 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Wednesday, 29 January 2014 at 16:43:35 UTC, bearophile wrote:
 Stanislav Blinov:

 Hmm.. How would one use core.simd with LDC2? It doesn't seem 
 to define D_SIMD.
 Or should I go for builtins?

 I don't know if this is useful for you, but here I wrote a 
 basic usage example of SIMD in ldc2 (second D entry):
 http://rosettacode.org/wiki/Four_bits_adder#D

 Bye,
 bearophile

I meant how to make it compile with ldc2? I've translated the 
code, it compiles and works with dmd (although segfaults in 
-release mode for some reason, probably a bug somewhere).

But with ldc2:

nbody.d(68): Error: undefined identifier __simd
nbody.d(68): Error: undefined identifier XMM

those are needed for that sqrt reciprocal call.

Jan 29 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Stanislav Blinov:

 I meant how to make it compile with ldc2? I've translated the 
 code, it compiles and works with dmd (although segfaults in 
 -release mode for some reason, probably a bug somewhere).

 But with ldc2:

 nbody.d(68): Error: undefined identifier __simd
 nbody.d(68): Error: undefined identifier XMM

 those are needed for that sqrt reciprocal call.

Usually for me ldc2 works with simd. Perhaps you have to show us 
the code, ask for help in the ldc newsgoup, or ask for help in 
the #ldc IRC channel.

Regarding dmd with -release, I suggest you to minimize the code 
and put the problem in Bugzilla. Benchmarks are also useful to 
find and fix compiler bugs.

Bye,
bearophile

Jan 29 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Wednesday, 29 January 2014 at 16:54:54 UTC, bearophile wrote:
 Stanislav Blinov:

 I meant how to make it compile with ldc2? I've translated the 
 code, it compiles and works with dmd (although segfaults in 
 -release mode for some reason, probably a bug somewhere).

 But with ldc2:

 nbody.d(68): Error: undefined identifier __simd
 nbody.d(68): Error: undefined identifier XMM

 those are needed for that sqrt reciprocal call.

 Usually for me ldc2 works with simd. Perhaps you have to show 
 us the code, ask for help in the ldc newsgoup, or ask for help 
 in the #ldc IRC channel.

It's a direct translation of that C++ code:

http://dpaste.dzfl.pl/89517fd0bf8fa

This line:

distance = __simd(XMM.CVTPS2PD, __simd(XMM.RSQRTPS, 
__simd(XMM.CVTPD2PS, dsquared)));

The XMM enum and __simd functions are defined only when D_SIMD 
version is set. ldc2 doesn't seem to set this, unless I'm missing 
some sort of compiler switch.

 Regarding dmd with -release, I suggest you to minimize the code 
 and put the problem in Bugzilla. Benchmarks are also useful to 
 find and fix compiler bugs.

I'm already onto it :)

Jan 29 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

Regarding dmd it looks awfully similar to this:

http://d.puremagic.com/issues/show_bug.cgi?id=9449

I'd need to do some more runs though.

Jan 29 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Wednesday, 29 January 2014 at 18:05:41 UTC, Stanislav Blinov 
wrote:

Yep, doesn't seem to be simd-related:

struct S(T) { T v1, v2; }

void main() {
	alias T = double; // integrals and float are ok :\
	version	(workaround) {
		S!T[1] p = void;
	} else {
		S!T[1] p;
	}
}

Anyway, here's the revised (and bugfixed :o)) code, if anyone's 
interested:

http://dpaste.dzfl.pl/52d9e1fdc0fd

On my machine, dmd -release -O -inline -noboundscheck is only 6 
times slower than that C++ version :D

I'll try to get around to making it work with ldc on the weekend.

Jan 30 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

Ok, didn't need to wait for the weekend :)

Looks like both dmd and ldc don't optimize slice operations yet, 
had to revert to loops (shaved off ~1.5 seconds for ldc, ~9 
seconds for dmd). Also, my local pull of ldc had some issues with 
to!int(string), reverted that to atoi :)

Here's the code:

http://dpaste.dzfl.pl/4b6df0771696

C++ version compiled with the provided flags.

dmd -release -O -inline -noboundscheck

ldc2 -release -O3 -disable-boundscheck -vectorize -vectorize-loops

Here are the results on my machine (i3 2100  3.1GHz):

time ./nbody-cpp 50000000:
-0.169075164
-0.169059907
0:05.20 real, 5.18 user, 0.00 sys, 532 kb, 99% cpu

time ./nbody-ldc 50000000:
-0.169075164
-0.169059907
0:07.84 real, 7.82 user, 0.00 sys, 1324 kb, 99% cpu

time ./nbody-dmd 50000000:
-0.169075164
-0.169059907
0:23.35 real, 23.29 user, 0.00 sys, 1184 kb, 99% cpu

Jan 30 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Thursday, 30 January 2014 at 14:17:16 UTC, Stanislav Blinov 
wrote:

Forgot one slice assignment in toDobule2(). Now the results are 
more interesting:

time ./nbody-cpp 50000000:
-0.169075164
-0.169059907
0:05.20 real, 5.18 user, 0.00 sys, 532 kb, 99% cpu

time ./nbody-ldc 50000000:
-0.169075164
-0.169059907
0:05.94 real, 5.92 user, 0.00 sys, 1320 kb, 99% cpu

time ./nbody-dmd 50000000:
-0.169075164
-0.169059907
0:19.62 real, 19.57 user, 0.00 sys, 1188 kb, 99% cpu

:)

Jan 30 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Stanislav Blinov:

 Forgot one slice assignment in toDobule2(). Now the results are 
 more interesting:

Is the latest link shown the last version?

I need the 0.13.0-alpha1 to compile the code.
I am seeing a significant performance difference between C++ and 
D-ldc2.

Bye,
bearophile

Jan 30 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Thursday, 30 January 2014 at 15:37:24 UTC, bearophile wrote:
 Stanislav Blinov:

 Forgot one slice assignment in toDobule2(). Now the results 
 are more interesting:

 Is the latest link shown the last version?

No. In toDouble2() on line 13:

replace result.array = args[0]

with result.array[0] = args[0]; result.array[1] = args[0];

 I need the 0.13.0-alpha1 to compile the code.
 I am seeing a significant performance difference between C++ 
 and D-ldc2.

You mean with your current version of ldc?

Jan 30 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Stanislav Blinov:

 You mean with your current version of ldc?

Yes. The older version of LDC2 doesn't even compile the code. I 
need to use 0.13.0-alpha1.

Your D code with small changes:
http://codepad.org/xqqScd42

Asm generated by G++ for the advance function (that is the one 
that uses most of the run time):
http://codepad.org/tApRNsVy

Asm generated by ldc2:
http://codepad.org/jKSJcOAZ

With N = 5_000_000 my timings on an old CPU are 2.23 seconds for 
ldc2 and 1.83 seconds for g++. So there's some performance 
difference.

I have tried to unroll manually the loop in the D code, but I see 
worse performance. I'll try some more later.

Bye,
bearophile

Jan 30 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Thursday, 30 January 2014 at 16:53:22 UTC, bearophile wrote:

 Yes. The older version of LDC2 doesn't even compile the code. I 
 need to use 0.13.0-alpha1.

Hmm.

 Your D code with small changes:
 http://codepad.org/xqqScd42

That won't compile with dmd (at least, with 2.064.2): it expects 
constants as initializers for vectors. :( That's why I rolled up 
that toDouble2() function.

 With N = 5_000_000 my timings on an old CPU are 2.23 seconds 
 for ldc2 and 1.83 seconds for g++. So there's some performance 
 difference.

What about 50_000_000?

 I have tried to unroll manually the loop in the D code, but I 
 see worse performance. I'll try some more later.

I'm also fiddling :)

Jan 30 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Stanislav Blinov:

 That won't compile with dmd (at least, with 2.064.2): it 
 expects constants as initializers for vectors. :( That's why I 
 rolled up that toDouble2() function.

I see. Then probably I will have to put it back...


 With N = 5_000_000 my timings on an old CPU are 2.23 seconds 
 for ldc2 and 1.83 seconds for g++. So there's some performance 
 difference.

 What about 50_000_000?

First let me try to fiddle with the code some more :-)

Once done, this should go somewhere (like the wiki) as a simple 
example of SIMD usage in D.

Bye,
bearophile

Jan 30 2014

"bearophile" <bearophileHUGS lycos.com> writes:

 Stanislav Blinov:

 That won't compile with dmd (at least, with 2.064.2): it 
 expects constants as initializers for vectors. :( That's why I 
 rolled up that toDouble2() function.


Few more changes, but this version still lacks the toDouble2:
http://codepad.org/SpMprWym

Bye,
bearophile

Jan 30 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Thursday, 30 January 2014 at 18:29:42 UTC, bearophile wrote:

I see you're compiling with

ldmd2 -wi -O -release -inline -noboundscheck nbody.d

Try

ldc2 -release -O3 -disable-boundscheck -vectorize -vectorize-loops

Jan 30 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Stanislav Blinov:

 ldc2 -release -O3 -disable-boundscheck -vectorize 
 -vectorize-loops

All my versions of ldc2 don't even accept -vectorize :-)

ldc2: Unknown command line argument '-vectorize'.  Try: 'ldc2 
-help'
ldc2: Did you mean '-vectorize-slp'?

And -vectorize-loops should be active on default on recent 
versions of ldc2 (including V.0.12.1), and indeed I see no 
performance difference in using it.

Bye,
bearophile

Jan 30 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Stanislav Blinov:

 Looks like both dmd and ldc don't optimize slice operations 
 yet, had to revert to loops

It's a very silly problem for a statically typed language. The D 
type system knows the static length of those arrays, but it 
doesn't use such information.
(Similarly several algorithms in Phobos force to throw away this 
very precious compile-time information requiring dynamic arrays 
in input.)

I have just suggested a fix for ldc2:
http://forum.dlang.org/thread/qeytzeqnygxpocywyifp forum.dlang.org

I have a similar enhancement request since some time in Bugzilla:
https://d.puremagic.com/issues/show_bug.cgi?id=10523
https://d.puremagic.com/issues/show_bug.cgi?id=10305

Bye,
bearophile

Jan 30 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Thursday, 30 January 2014 at 18:43:02 UTC, bearophile wrote:

 It's a very silly problem for a statically typed language. The 
 D type system knows the static length of those arrays, but it 
 doesn't use such information.

I agree.


Unrolling everything except the loop in energy() seems to have 
squeezed the bits neede to outperform c++, at least on my machine 
:)

http://dpaste.dzfl.pl/45e98e476daf

(I'm sticking to atoi because my copy of ldc seems to have an 
issue in std.conv).

time ./nbody-cpp 50000000:
-0.169075164
-0.169059907
0:05.15 real, 5.14 user, 0.00 sys, 532 kb, 99% cpu

time ./nbody-ldc 50000000:
-0.169075164
-0.169059907
0:04.41 real, 4.40 user, 0.00 sys, 1308 kb, 99% cpu

time ./nbody-dmd 50000000:
-0.169075164
-0.169059907
0:15.39 real, 15.34 user, 0.00 sys, 1192 kb, 99% cpu

Jan 30 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Stanislav Blinov:

 Unrolling everything except the loop in energy() seems to have 
 squeezed the bits neede to outperform c++, at least on my 
 machine :)

That should be impossible, as I remember from my old profilings 
that energy() should use only an irrelevant amount of run time.


 http://dpaste.dzfl.pl/45e98e476daf

While I benchmark some variants of this program I am seeing a 
large variety of problems, limitations, bugs and regressions.

You latest D code crashes my ldc2 V.0.12.1, while 0.13.0-alpha1 
compiles it. My older version of your D code runs with both 
compiler versions, but V.0.12.1 generates faster code.

Plus you can't make those double2 immutable, you can't use vector 
ops (because of performance, and also because they aren't nothrow 
in V.0.12.1).

I was also experimenting with (note the align):

align(16) struct Body {
     double[3] x, v;
     double mass;
}

struct NBodySystem {
private:
     __gshared static Body[5] bodies = [
         // Sun.
         Body([0., 0., 0.],
              [0., 0., 0.],
              solarMass),
...

But this improves the code for V.0.12.1 and worsens it for 
0.13.0-alpha1.


Also I think the __gshared is ignored in V.0.12.1, but this bug 
could be fixed in more recent versions of ldc2.


 (I'm sticking to atoi because my copy of ldc seems to have an 
 issue in std.conv).

My version seems to use to!() correctly.

If ldc2 developers are reading this thread there is enough 
strange stuff here to give one or two headaches :-)

Now I don't know what "final" version should I keep of this 
program :-)

Bye,
bearophile

Jan 30 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Thursday, 30 January 2014 at 21:04:06 UTC, bearophile wrote:
 Stanislav Blinov:

 Unrolling everything except the loop in energy() seems to have 
 squeezed the bits neede to outperform c++, at least on my 
 machine :)

 That should be impossible, as I remember from my old profilings 
 that energy() should use only an irrelevant amount of run time.

I meant that if I unroll it, it's not irrelevant anymore :)

 While I benchmark some variants of this program I am seeing a 
 large variety of problems, limitations, bugs and regressions...

:)

 You latest D code crashes my ldc2 V.0.12.1, while 0.13.0-alpha1 
 compiles it.

:))

 My older version of your D code runs with both compiler 
 versions, but V.0.12.1 generates faster code.

:)))

 Plus you can't make those double2 immutable, you can't use 
 vector ops (because of performance, and also because they 
 aren't nothrow in V.0.12.1).

Well, not being able to make them immutable is not *that* big of 
a problem now, is it? What would be actually cool to have are 
those slice operations.

 I was also experimenting with (note the align):

 align(16) struct Body {
     double[3] x, v;
     double mass;
 }

 struct NBodySystem {
 private:
     __gshared static Body[5] bodies = [
         // Sun.
         Body([0., 0., 0.],
              [0., 0., 0.],
              solarMass),

Yeah... I've even thrown away that filler in the latest version 
:o)

 But this improves the code for V.0.12.1 and worsens it for 
 0.13.0-alpha1.

%|

 (I'm sticking to atoi because my copy of ldc seems to have an 
 issue in std.conv).

 My version seems to use to!() correctly.

I'm using the git head (704ab3, last commit Sun Jan 26 00:00:21). 
I haven't tried the release yet.

 If ldc2 developers are reading this thread there is enough 
 strange stuff here to give one or two headaches :-)

Indeed.

 Now I don't know what "final" version should I keep of this 
 program :-)

I was going to compare the asm listings, but C++ seems to have 
unrolled and inlined the outer loop right inside main(), and now 
I'm slightly lost in it :)

Jan 30 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Stanislav Blinov:

 I meant that if I unroll it, it's not irrelevant anymore :)

If a function takes no time to run, and you tweak it, your 
program is not supposed to go faster.


 I was going to compare the asm listings, but C++ seems to have 
 unrolled and inlined the outer loop right inside main(), and 
 now I'm slightly lost in it :)

Try using -fkeep-inline-functions.

Bye,
bearophile

Jan 30 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Thursday, 30 January 2014 at 21:33:38 UTC, bearophile wrote:

 If a function takes no time to run, and you tweak it, your 
 program is not supposed to go faster.

Right.

 I was going to compare the asm listings, but C++ seems to have 
 unrolled and inlined the outer loop right inside main(), and 
 now I'm slightly lost in it :)

 Try using -fkeep-inline-functions.

Thanks.

G++:
http://codepad.org/oOZQw1VQ

LDC:
http://codepad.org/5nHoZL1k


LDC basically generated something that I can only call "one 
straight *whoooosh*"... This reminds me Andrei's talk on (last 
years?) GoingNative ("more instructions is not always slower 
code").

Jan 30 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Stanislav Blinov:

 G++:
 http://codepad.org/oOZQw1VQ

 LDC:
 http://codepad.org/5nHoZL1k

You seem to have a quite recent CPU, as the G++ code contains 
instructions like vmovsd. So you can try to do the same with 
ldc2, and use AVX or AVX2.

There are the switches:

-march=<string>            - Architecture to generate code for:
-mattr=<a1,+a2,-a3,...>    - Target specific attributes 
(-mattr=help for details)
-mcpu=<cpu-name>           - Target a specific cpu type 
(-mcpu=help for details)


 LDC basically generated something that I can only call "one 
 straight *whoooosh*"...

:-)

Bye,
bearophile

Jan 30 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Thursday, 30 January 2014 at 21:54:17 UTC, bearophile wrote:

 You seem to have a quite recent CPU,

An aging i3?

 as the G++ code contains instructions like vmovsd. So you can 
 try to do the same with ldc2, and use AVX or AVX2.

Hmm...


This is getting a bit silly now. I must have some compile 
switches for g++ wrong:

g++ -Ofast -fkeep-inline-functions -fomit-frame-pointer 
-march=native -mfpmath=sse -mavx -mssse3 -flto --std=c++11 
-fopenmp nbody.cpp -o nbody-cpp

time ./nbody-cpp 50000000:
-0.169075164
-0.169059907
0:05.09 real, 5.07 user, 0.00 sys, 1140 kb, 99% cpu

ldc2 -release -O3 -disable-boundscheck -vectorize 
-vectorize-loops -ofnbody-ldc -mattr=+avx,+ssse3 nbody.d

time ./nbody-ldc 50000000:
-0.169075164
-0.169059907
0:04.02 real, 4.01 user, 0.00 sys, 1304 kb, 99% cpu

Jan 30 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Stanislav Blinov:

 An aging i3?

My CPU is older, it doesn't support AVX2 and AVX.


 This is getting a bit silly now. I must have some compile 
 switches for g++ wrong:

 g++ -Ofast -fkeep-inline-functions -fomit-frame-pointer 
 -march=native -mfpmath=sse -mavx -mssse3 -flto --std=c++11 
 -fopenmp nbody.cpp -o nbody-cpp

 time ./nbody-cpp 50000000:
 -0.169075164
 -0.169059907
 0:05.09 real, 5.07 user, 0.00 sys, 1140 kb, 99% cpu

 ldc2 -release -O3 -disable-boundscheck -vectorize 
 -vectorize-loops -ofnbody-ldc -mattr=+avx,+ssse3 nbody.d

 time ./nbody-ldc 50000000:
 -0.169075164
 -0.169059907
 0:04.02 real, 4.01 user, 0.00 sys, 1304 kb, 99% cpu

Now the ldc2-compile runs in 4 seconds, this sounds correct. If 
you have paid for a CPU with AVX2 or AVX, it's right to use that 
:-)

Bye,
bearophile

Jan 30 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Since my post someone has added a Fortran version based on the 
algorithm used in the C++11 code. It's a little faster than the 
C++11 code and it's much nicer looking:
http://benchmarksgame.alioth.debian.org/u32/program.php?test=nbody&lang=ifc&id=5


pure subroutine advance(tstep, x, v, mass)
   real*8, intent(in) :: tstep
   real*8, dimension(4,nb), intent(inout) :: x, v
   real*8, dimension(nb), intent(in) :: mass
   real*8 :: r(4,N),mag(N)

   real*8 :: distance, d2
   integer :: i, j, m
   m = 1
   do i = 1, nb
      do j = i + 1, nb
         r(1,m) = x(1,i) - x(1,j)
         r(2,m) = x(2,i) - x(2,j)
         r(3,m) = x(3,i) - x(3,j)
         m = m + 1
      end do
   end do

   do m = 1, N
      d2 = r(1,m)**2 + r(2,m)**2 + r(3,m)**2
      distance = 1/sqrt(real(d2))
      distance = distance * (1.5d0 - 0.5d0 * d2 * distance * 
distance)
      !distance = distance * (1.5d0 - 0.5d0 * d2 * distance * 
distance)
      mag(m) = tstep * distance**3
   end do

   m = 1
   do i = 1, nb
      do j = i + 1, nb
         v(1,i) = v(1,i) - r(1,m) * mass(j) * mag(m)
         v(2,i) = v(2,i) - r(2,m) * mass(j) * mag(m)
         v(3,i) = v(3,i) - r(3,m) * mass(j) * mag(m)

         v(1,j) = v(1,j) + r(1,m) * mass(i) * mag(m)
         v(2,j) = v(2,j) + r(2,m) * mass(i) * mag(m)
         v(3,j) = v(3,j) + r(3,m) * mass(i) * mag(m)

         m = m + 1
      end do
   end do

   do i = 1, nb
      x(1,i) = x(1,i) + tstep * v(1,i)
      x(2,i) = x(2,i) + tstep * v(2,i)
      x(3,i) = x(3,i) + tstep * v(3,i)
   end do
   end subroutine advance


Bye,
bearophile

Jan 30 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Thursday, 30 January 2014 at 22:45:45 UTC, bearophile wrote:
 Since my post someone has added a Fortran version based on the 
 algorithm used in the C++11 code. It's a little faster than the 
 C++11 code and it's much nicer looking:

Yup, I saw it. They're cheating, they almost don't have to 
explicitly handle any SSE business :o) I'm wondering how our 
little code could perform on that machine.

It looks nice too, by the way:

http://dpaste.dzfl.pl/a81a475bbcf6

I've rearranged some bits, brought back to!int (turned out there 
wasn't any issues, it's just that ldc generated errors regarding 
to! when there were other compiler errors %\), replaced 
TypeTuples with your Iota... the works :)

Jan 30 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

Gah! G'Kar moment...

http://dpaste.dzfl.pl/203d237d7413

Jan 30 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - N-body bench