www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Speed of math function atan: comparison D and C++

reply J-S Caux <js gmail.com> writes:
I'm considering shifting a large existing C++ codebase into D 
(it's a scientific code making much use of functions like atan, 
log etc).

I've compared the raw speed of atan between C++ (Apple LLVM 
version 7.3.0 (clang-703.0.29)) and D (dmd v2.079.0, also ldc2 
1.7.0) by doing long loops of such functions.

I can't get the D to run faster than about half the speed of C++.

Are there benchmarks for such scientific functions published 
somewhere?
Mar 04 2018
next sibling parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 05/03/2018 6:35 PM, J-S Caux wrote:
 I'm considering shifting a large existing C++ codebase into D (it's a 
 scientific code making much use of functions like atan, log etc).
 
 I've compared the raw speed of atan between C++ (Apple LLVM version 
 7.3.0 (clang-703.0.29)) and D (dmd v2.079.0, also ldc2 1.7.0) by doing 
 long loops of such functions.
 
 I can't get the D to run faster than about half the speed of C++.
 
 Are there benchmarks for such scientific functions published somewhere
Gonna need to disassemble and compare them. atan should work out to only be a few instructions (inline assembly) from what I've looked at in the source. Also you should post the code you used for each.
Mar 04 2018
next sibling parent reply J-S Caux <js gmail.com> writes:
On Monday, 5 March 2018 at 05:40:09 UTC, rikki cattermole wrote:
 On 05/03/2018 6:35 PM, J-S Caux wrote:
 I'm considering shifting a large existing C++ codebase into D 
 (it's a scientific code making much use of functions like 
 atan, log etc).
 
 I've compared the raw speed of atan between C++ (Apple LLVM 
 version 7.3.0 (clang-703.0.29)) and D (dmd v2.079.0, also ldc2 
 1.7.0) by doing long loops of such functions.
 
 I can't get the D to run faster than about half the speed of 
 C++.
 
 Are there benchmarks for such scientific functions published 
 somewhere
Gonna need to disassemble and compare them. atan should work out to only be a few instructions (inline assembly) from what I've looked at in the source. Also you should post the code you used for each.
So the codes are trivial, simply some check of raw speed: double x = 0.0; for (int a = 0; a < 1000000000; ++a) x += atan(1.0/(1.0 + sqrt(1.0 + a))); for C++ and double x = 0.0; for (int a = 0; a < 1_000_000_000; ++a) x += atan(1.0/(1.0 + sqrt(1.0 + a))); for D. C++ exec takes 40 seconds, D exec takes 68 seconds.
Mar 04 2018
next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
On 05/03/2018 7:01 PM, J-S Caux wrote:
 On Monday, 5 March 2018 at 05:40:09 UTC, rikki cattermole wrote:
 On 05/03/2018 6:35 PM, J-S Caux wrote:
 I'm considering shifting a large existing C++ codebase into D (it's a 
 scientific code making much use of functions like atan, log etc).

 I've compared the raw speed of atan between C++ (Apple LLVM version 
 7.3.0 (clang-703.0.29)) and D (dmd v2.079.0, also ldc2 1.7.0) by 
 doing long loops of such functions.

 I can't get the D to run faster than about half the speed of C++.

 Are there benchmarks for such scientific functions published somewhere
Gonna need to disassemble and compare them. atan should work out to only be a few instructions (inline assembly) from what I've looked at in the source. Also you should post the code you used for each.
So the codes are trivial, simply some check of raw speed:   double x = 0.0;   for (int a = 0; a < 1000000000; ++a) x += atan(1.0/(1.0 + sqrt(1.0 + a))); for C++ and   double x = 0.0;   for (int a = 0; a < 1_000_000_000; ++a) x += atan(1.0/(1.0 + sqrt(1.0 + a))); for D. C++ exec takes 40 seconds, D exec takes 68 seconds.
Yes, but that doesn't show me how you benchmarked.
Mar 04 2018
prev sibling next sibling parent reply Uknown <sireeshkodali1 gmail.com> writes:
On Monday, 5 March 2018 at 06:01:27 UTC, J-S Caux wrote:
 On Monday, 5 March 2018 at 05:40:09 UTC, rikki cattermole wrote:
 On 05/03/2018 6:35 PM, J-S Caux wrote:
 I'm considering shifting a large existing C++ codebase into D 
 (it's a scientific code making much use of functions like 
 atan, log etc).
 
 I've compared the raw speed of atan between C++ (Apple LLVM 
 version 7.3.0 (clang-703.0.29)) and D (dmd v2.079.0, also 
 ldc2 1.7.0) by doing long loops of such functions.
 
 I can't get the D to run faster than about half the speed of 
 C++.
 
 Are there benchmarks for such scientific functions published 
 somewhere
Gonna need to disassemble and compare them. atan should work out to only be a few instructions (inline assembly) from what I've looked at in the source. Also you should post the code you used for each.
So the codes are trivial, simply some check of raw speed: double x = 0.0; for (int a = 0; a < 1000000000; ++a) x += atan(1.0/(1.0 + sqrt(1.0 + a))); for C++ and double x = 0.0; for (int a = 0; a < 1_000_000_000; ++a) x += atan(1.0/(1.0 + sqrt(1.0 + a))); for D. C++ exec takes 40 seconds, D exec takes 68 seconds.
Depending on your platform, the size of `double` could be different between C++ and D. Could you check that the size and precision are indeed the same? Also, benchmark method is just as important as benchmark code. Did you use DMD or LDC as the D compiler? In this case it shouldn't matter, but try with LDC if you haven't. Also ensure that you've used the right flags: `-release -inline -O`. If the D version is still slower, you could try using the C version of the function Simply change `import std.math: atan;` to `core.stdc.math: atan;` [0] [0]: https://dlang.org/phobos/core_stdc_math.html#.atan
Mar 05 2018
parent reply J-S Caux <js gmail.com> writes:
On Monday, 5 March 2018 at 09:48:49 UTC, Uknown wrote:

 Depending on your platform, the size of `double` could be 
 different between C++ and D. Could you check that the size and 
 precision are indeed the same?
 Also, benchmark method is just as important as benchmark code. 
 Did you use DMD or LDC as the D compiler? In this case it 
 shouldn't matter, but try with LDC if you haven't. Also ensure 
 that you've used the right flags:
 `-release -inline -O`.

 If the D version is still slower, you could try using the C 
 version of the function
 Simply change `import std.math: atan;` to `core.stdc.math: 
 atan;` [0]

 [0]: https://dlang.org/phobos/core_stdc_math.html#.atan
Thanks all for the info. I've tested these two very basic representative codes: https://www.dropbox.com/s/b5o4i8h43qh1saf/test.cc?dl=0 https://www.dropbox.com/s/zsaikhdoyun3olk/test.d?dl=0 Results: C++: g++ (Apple LLVM version 7.3.0): 9.5 secs g++ (GCC 7.1.0): 10.7 secs D: dmd : 35.5 secs dmd -release -inline -O : 29.5 secs ldc2 : 34.4 secs ldc2 -release -O : 31.5 secs But now: using the core.stdc.math atan as per Uknown's suggestion: D: dmd: 9 secs dmd -release -inline -O : 6.8 secs ldc2 : 10 secs ldc2 -release -O : 6.5 secs <- best So indeed the difference is between the `std.math atan` versus the `core.stdc.math atan`. Thanks Uknown! Just knowing this trick could make the difference between me and other scientists switching over to D... But now comes the question: can the D fundamental maths functions be propped up to be as fast as the C ones?
Mar 05 2018
next sibling parent bauss <jj_1337 live.dk> writes:
On Monday, 5 March 2018 at 18:39:21 UTC, J-S Caux wrote:
 But now comes the question: can the D fundamental maths 
 functions be propped up to be as fast as the C ones?
Probably, if someone takes the time to look at the bottlenecks.
Mar 05 2018
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Mar 05, 2018 at 06:39:21PM +0000, J-S Caux via Digitalmars-d-learn
wrote:
[...]
 I've tested these two very basic representative codes:
 https://www.dropbox.com/s/b5o4i8h43qh1saf/test.cc?dl=0
 https://www.dropbox.com/s/zsaikhdoyun3olk/test.d?dl=0
 
 Results:
 
 C++:
 g++ (Apple LLVM version 7.3.0):  9.5 secs
 g++ (GCC 7.1.0):  10.7 secs
 
 D:
 dmd :  35.5 secs
 dmd -release -inline -O : 29.5 secs
 ldc2 :  34.4 secs
 ldc2 -release -O : 31.5 secs
 
 But now: using the core.stdc.math atan as per Uknown's suggestion:
 D:
 dmd:  9 secs
 dmd -release -inline -O :  6.8 secs
 ldc2 : 10 secs
 ldc2 -release -O :  6.5 secs   <- best
 
 So indeed the difference is between the `std.math atan` versus the
 `core.stdc.math atan`. Thanks Uknown! Just knowing this trick could
 make the difference between me and other scientists switching over to
 D...
 
 But now comes the question: can the D fundamental maths functions be
 propped up to be as fast as the C ones?
Walter has been adamant that we should always compute std.math.* functions with the `real` type, which on x86 maps to the non-IEEE 80-bit floats. However, 80-bit floats have been deprecated for a while now, and pretty much nobody cares to improve their performance on newer CPUs, focusing instead on SSE/MMX performance with 64-bit doubles. People have been clamoring for using 64-bit doubles by default rather than 80-bit floats, but so far Walter has refused to budge. But perhaps this time, we might have a strong case for pushing this into D. IMO, it has been long overdue. I filed an issue for this: https://issues.dlang.org/show_bug.cgi?id=18559 If you have any additional relevant information, please post it there so that we can build a strong case to convince Walter about this issue. T -- Heuristics are bug-ridden by definition. If they didn't have bugs, they'd be algorithms.
Mar 05 2018
next sibling parent reply bachmeier <no spam.net> writes:
On Monday, 5 March 2018 at 20:11:06 UTC, H. S. Teoh wrote:

 Walter has been adamant that we should always compute 
 std.math.* functions with the `real` type, which on x86 maps to 
 the non-IEEE 80-bit floats.  However, 80-bit floats have been 
 deprecated for a while now, and pretty much nobody cares to 
 improve their performance on newer CPUs, focusing instead on 
 SSE/MMX performance with 64-bit doubles.  People have been 
 clamoring for using 64-bit doubles by default rather than 
 80-bit floats, but so far Walter has refused to budge.
I wonder if Ilya has worked on any of this for Mir.
Mar 05 2018
parent jmh530 <john.michael.hall gmail.com> writes:
On Monday, 5 March 2018 at 21:05:19 UTC, bachmeier wrote:
 I wonder if Ilya has worked on any of this for Mir.
Mir has sin and cos, but that's it. It looks like they use llvm intrinsics on LDC and then fall back to phobos' implementation.
Mar 05 2018
prev sibling next sibling parent reply =?iso-8859-1?Q?Robert_M._M=FCnch?= <robert.muench saphirion.com> writes:
On 2018-03-05 20:11:06 +0000, H. S. Teoh said:

 Walter has been adamant that we should always compute std.math.*
 functions with the `real` type, which on x86 maps to the non-IEEE 80-bit
 floats.  However, 80-bit floats have been deprecated for a while now,
Hi, do you have a reference for this? I can't believe this, as the 80-bit are pretty important for a lot of optimization algorithms. We use it all the time and it's absolutly necessary.
 and pretty much nobody cares to improve their performance on newer CPUs,
Really?
 focusing instead on SSE/MMX performance with 64-bit doubles.  People
 have been clamoring for using 64-bit doubles by default rather than
 80-bit floats, but so far Walter has refused to budge.
IMO this is all driven by the GPU/AI hype that just (seems) to be happy with rough precision. -- Robert M. Münch http://www.saphirion.com smarter | better | faster
Mar 05 2018
next sibling parent reply J-S Caux <js gmail.com> writes:
On Tuesday, 6 March 2018 at 07:12:57 UTC, Robert M. Münch wrote:
 On 2018-03-05 20:11:06 +0000, H. S. Teoh said:

 Walter has been adamant that we should always compute 
 std.math.*
 functions with the `real` type, which on x86 maps to the 
 non-IEEE 80-bit
 floats.  However, 80-bit floats have been deprecated for a 
 while now,
Hi, do you have a reference for this? I can't believe this, as the 80-bit are pretty important for a lot of optimization algorithms. We use it all the time and it's absolutly necessary.
 and pretty much nobody cares to improve their performance on 
 newer CPUs,
Really?
 focusing instead on SSE/MMX performance with 64-bit doubles.  
 People
 have been clamoring for using 64-bit doubles by default rather 
 than
 80-bit floats, but so far Walter has refused to budge.
IMO this is all driven by the GPU/AI hype that just (seems) to be happy with rough precision.
Speaking for myself, the reason why I haven't made the switch from C++ to D many years ago for all my scientific work is that for many computations, 64 bit precision is certainly sufficient, and the performance I could get out of D (factor 4 to 6 slower in my tests) was simply insufficient. Now, with Uknown's trick of using the C math functions, I can reconsider. It's a bit of a "patch" but at least it works. In an ideal world, I'd like the language I use to: - have double-precision arithmetic with equal performance to C/C++ - have all basic mathematical functions implemented, including for complex types - *big bonus*: have the ability to do extended-precision arithmetic (integer, but most importantly (complex) floating-point) on-the-fly if I so wish, without having to rely on external libraries. C++ was always fine, with external libraries for extended precision, but D is so much more pleasant to use. Many of my colleagues are switching to e.g. Julia despite the performance costs, because it is by design a very maths/science-friendly language. D is however much closer to a whole stack of existing codebases, so switching to it would involve much less extensive refactoring.
Mar 06 2018
parent Uknown <sireeshkodali1 gmail.com> writes:
On Tuesday, 6 March 2018 at 08:20:05 UTC, J-S Caux wrote:
 On Tuesday, 6 March 2018 at 07:12:57 UTC, Robert M. Münch wrote:
 On 2018-03-05 20:11:06 +0000, H. S. Teoh said:
[snip] Now, with Uknown's trick of using the C math functions, I can reconsider. It's a bit of a "patch" but at least it works.
I'm glad I could help!
 In an ideal world, I'd like the language I use to:
 - have double-precision arithmetic with equal performance to 
 C/C++
 - have all basic mathematical functions implemented, including 
 for complex types
 - *big bonus*: have the ability to do extended-precision 
 arithmetic (integer, but most importantly (complex) 
 floating-point) on-the-fly if I so wish, without having to rely 
 on external libraries.
D has std.complex and inbuilt complex types, just like C [0][1]. I modified the mandelbrot generator on Wikipedia, using D's std.complex and didn't have too much of an issue with performance.[2] Also, std.bigint and mir might be of interest to you.[3]
 C++ was always fine, with external libraries for extended 
 precision, but D is so much more pleasant to use. Many of my 
 colleagues are switching to e.g. Julia despite the performance 
 costs, because it is by design a very maths/science-friendly 
 language. D is however much closer to a whole stack of existing 
 codebases, so switching to it would involve much less extensive 
 refactoring.
Theres a good chance D can interface with those libraries you mentioned... [0]: https://dlang.org/phobos/std_complex.html [1]: https://dlang.org/phobos/core_stdc_complex.html [2]: https://github.com/Sirsireesh/Khoj-2017/blob/master/Mandelbrot-set/mandelbrot.d [3]: https://github.com/libmir
Mar 06 2018
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Mar 06, 2018 at 08:12:57AM +0100, Robert M. Münch via
Digitalmars-d-learn wrote:
 On 2018-03-05 20:11:06 +0000, H. S. Teoh said:
 
 Walter has been adamant that we should always compute std.math.*
 functions with the `real` type, which on x86 maps to the non-IEEE
 80-bit floats.  However, 80-bit floats have been deprecated for a
 while now,
Hi, do you have a reference for this? I can't believe this, as the 80-bit are pretty important for a lot of optimization algorithms. We use it all the time and it's absolutly necessary.
[...] http://www.zdnet.com/article/nvidia-de-optimizes-physx-for-the-cpu/?tag=nl.e539 Quotation: Intel started discouraging the use of x87 with the introduction of the P4 in late 2000. AMD deprecated x87 since the K8 in 2003, as x86-64 is defined with SSE2 support; VIA’s C7 has supported SSE2 since 2005. In 64-bit versions of Windows, x87 is deprecated for user-mode, and prohibited entirely in kernel-mode. Pretty much everyone in the industry has recommended SSE over x87 since 2005 and there are no reasons to use x87, unless software has to run on an embedded Pentium or 486. I'm not advocating for getting *rid* of 80-bit float support, but only to make it *optional* rather than the default, as currently done in std.math. T -- Once bitten, twice cry...
Mar 06 2018
parent reply jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 6 March 2018 at 17:51:54 UTC, H. S. Teoh wrote:
 [snip]

 I'm not advocating for getting *rid* of 80-bit float support, 
 but only to make it *optional* rather than the default, as 
 currently done in std.math.


 T
Aren't there two issues: 1) std.math functions that cast to real to perform calculations, 2) the compiler sometimes converts things to real in the background when people don't want it to. Number 1 seems straightforward to fix. Introduce new versions of the std.math functions for float/double and the user can cast to real if the additional accuracy is necessary. Number 2 would require a compiler switch, I imagine.
Mar 06 2018
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Mar 06, 2018 at 06:05:59PM +0000, jmh530 via Digitalmars-d-learn wrote:
 On Tuesday, 6 March 2018 at 17:51:54 UTC, H. S. Teoh wrote:
 [snip]
 
 I'm not advocating for getting *rid* of 80-bit float support, but
 only to make it *optional* rather than the default, as currently
 done in std.math.
[...]
 Aren't there two issues: 1) std.math functions that cast to real to
 perform calculations, 2) the compiler sometimes converts things to
 real in the background when people don't want it to.
 
 Number 1 seems straightforward to fix. Introduce new versions of the
 std.math functions for float/double and the user can cast to real if
 the additional accuracy is necessary.
The fix itself may be straightforward, but how to do it without breaking tons of existing code and provoking user backlash is the tricky part.
 Number 2 would require a compiler switch, I imagine.
It may not always be the compiler's fault. In the case of x87, it's the hardware itself that internally promotes to 80-bit and truncates later. IIRC, the original intent was that user code would only deal with 64-bit, and the 80-bit stuff would only happen inside the x87 (C, for example, does not provide direct access to this type, except via vendor extensions). However, due to the necessity to be able to save intermediate computational states, there are instructions that can load/extract 80-bit intermediate values to/from the x87, and eventually people ended up just using these instructions for working with the 80-bit type directly. You can suppress the compiler from issuing these instructions, but 64-bit doubles may still be internally converted by the hardware to 80-bit intermediate values during computation. But I suppose you could force the compiler to use SSE instructions for double operations instead of x87, then it would bypass the 80-bit intermediate values completely. T -- Being able to learn is a great learning; being able to unlearn is a greater learning.
Mar 06 2018
parent jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 6 March 2018 at 18:41:15 UTC, H. S. Teoh wrote:
 The fix itself may be straightforward, but how to do it without 
 breaking tons of existing code and provoking user backlash is 
 the tricky part.
 [snip]
Ah, I see what you're saying. People may be depending on the extra accuracy for these functions. Would just require something like double sin(double x) safe pure nothrow nogc { version (FP_Math) { ///double sin implementation } else { return sin(cast(real) x); } }
Mar 06 2018
prev sibling parent Andrea Fontana <nospam example.com> writes:
On Monday, 5 March 2018 at 20:11:06 UTC, H. S. Teoh wrote:
 Walter has been adamant that we should always compute 
 std.math.* functions with the `real` type
 T
I don't understand why atan(float) returns real and atan(double) return real too. If I'm working with float, why does it return a real? If you want to comute with real is ok, but shouldn't be T atan(T) rather than real atan(T)? I'm missing something. Andrea
Mar 06 2018
prev sibling next sibling parent Johan Engelen <j j.nl> writes:
On Monday, 5 March 2018 at 06:01:27 UTC, J-S Caux wrote:
 On Monday, 5 March 2018 at 05:40:09 UTC, rikki cattermole wrote:
 On 05/03/2018 6:35 PM, J-S Caux wrote:
 I'm considering shifting a large existing C++ codebase into D 
 (it's a scientific code making much use of functions like 
 atan, log etc).
 
 I've compared the raw speed of atan between C++ (Apple LLVM 
 version 7.3.0 (clang-703.0.29)) and D (dmd v2.079.0, also 
 ldc2 1.7.0) by doing long loops of such functions.
 
 I can't get the D to run faster than about half the speed of 
 C++.
double x = 0.0; for (int a = 0; a < 1000000000; ++a) x += atan(1.0/(1.0 + sqrt(1.0 + a))); for C++ and double x = 0.0; for (int a = 0; a < 1_000_000_000; ++a) x += atan(1.0/(1.0 + sqrt(1.0 + a))); for D. C++ exec takes 40 seconds, D exec takes 68 seconds.
The performance problem with this code is that LDC does not yet do cross-module inlining by default. GDC does. If you pass `-enable-cross-module-inlining` to LDC, things should be faster. In particular, std.sqrt is not inlined although it is profitable to do so (it becomes one machine instruction). Things become worse when using core.stdc.math.sqrt, because no implementation source available: no inlining possible. Another problem is that std.math.atan(double) just calls std.math.atan(real). Calculations are more expensive on platforms where real==80bits (i.e. x86), and that's not solvable with a compile flag. What it takes is someone to write the double and float versions of atan (and other math functions), but it requires someone with the right knowledge to do it. Your tests (and reporting about them) are much appreciated. Please do file bug reports for these things. Perhaps you can take a stab at implementing double-versions of the functions you need? cheers, Johan
Mar 05 2018
prev sibling parent psychoticRabbit <meagain meagain.com> writes:
On Monday, 5 March 2018 at 06:01:27 UTC, J-S Caux wrote:
 So the codes are trivial, simply some check of raw speed:

   double x = 0.0;
   for (int a = 0; a < 1000000000; ++a) x += atan(1.0/(1.0 + 
 sqrt(1.0 + a)));

 for C++ and

   double x = 0.0;
   for (int a = 0; a < 1_000_000_000; ++a) x += atan(1.0/(1.0 + 
 sqrt(1.0 + a)));

 for D. C++ exec takes 40 seconds, D exec takes 68 seconds.
should a be an int? make it a double ;-)
Mar 05 2018
prev sibling parent Era Scarecrow <rtcvb32 yahoo.com> writes:
On Monday, 5 March 2018 at 05:40:09 UTC, rikki cattermole wrote:
 atan should work out to only be a few instructions (inline 
 assembly) from what I've looked at in the source.

 Also you should post the code you used for each.
Should be 3-4 instructions. Load input to the FPU (Optional? Depends on if it already has the value loaded), Atan, Fwait (optional?), Retrieve value. Off hand that i remember, FPU instructions run in their own separated space and should more or less take up only a few cycles by themselves to run (and also run in parallel to the CPU code). At which point if the code is running half the speed of C++'s, that means probably bad optimization elsewhere, or even the control settings for the FPU. I really haven't looked that in depth to the FPU stuff since about 2000...
Mar 04 2018
prev sibling parent Marc <jckj33 gmail.com> writes:
On Monday, 5 March 2018 at 05:35:28 UTC, J-S Caux wrote:
 I'm considering shifting a large existing C++ codebase into D 
 (it's a scientific code making much use of functions like atan, 
 log etc).

 I've compared the raw speed of atan between C++ (Apple LLVM 
 version 7.3.0 (clang-703.0.29)) and D (dmd v2.079.0, also ldc2 
 1.7.0) by doing long loops of such functions.

 I can't get the D to run faster than about half the speed of 
 C++.

 Are there benchmarks for such scientific functions published 
 somewhere?
What compiled flags did you used to compile both C++ and D versions?
Mar 05 2018