www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - Performance issue with fastmath and vectorization

reply dextorious <dextorious gmail.com> writes:
As part of slowly learning the basics of programming in D, I 
ported some of my fluid dynamics code from C++ to D and quickly 
noticed a rather severe performance degradation by a factor of 
2-3x. I've narrowed it down to a simple representative benchmark 
of virtually identical C++ and D code.

The D version: http://pastebin.com/Rs9CUA5j
The C++ code:  http://pastebin.com/XzStHXA2

I compile the D code using the latest beta release on GitHub, 
using the compiler switches -release -O5 -mcpu=haswell 
-boundscheck=off. The C++ version is compiled using Clang 3.9.0 
with the switches -std=c++14 -Ofast -fno-exceptions -fno-rtti 
-flto -ffast-math -march=native, which is my usual configuration 
for numerical code.

On my Haswell i7-4710HQ machine the C++ version runs in 
~10ms/iteration while the D code takes 25ms. Comparing profiler 
output with the generated assembly code quickly reveals the 
reason - while Clang fully unrolls the inner loop and uses FMA 
instructions wherever possible, the inner loop assembly produced 
by LDC looks like this:

   0.24 │6c0:   vmovss (%r15,%rbp,4),%xmm4
   1.03 │       vmovss (%r12,%rbp,4),%xmm5
   3.51 │       add    $0x4,%rdi
   6.96 │       add    $0x4,%rax
   1.04 │6d4:   vmulss (%rax,%rcx,1),%xmm4,%xmm4
   4.66 │       vmulss (%rax,%rdx,1),%xmm5,%xmm5
   8.44 │       vaddss %xmm4,%xmm5,%xmm4
   1.09 │       vmulss %xmm0,%xmm4,%xmm5
   3.73 │       vmulss %xmm4,%xmm5,%xmm4
   7.48 │       vsubss %xmm3,%xmm4,%xmm4
   1.13 │       vmulss %xmm1,%xmm4,%xmm4
   2.00 │       vaddss %xmm2,%xmm5,%xmm5
   3.46 │       vmovss 0x0(%r13,%rbp,4),%xmm6
   7.85 │       vmulss (%rax,%rsi,1),%xmm6,%xmm6
   2.50 │       vaddss %xmm4,%xmm5,%xmm4
   6.49 │       vmulss %xmm4,%xmm6,%xmm4
  25.48 │       vmovss %xmm4,(%rdi)
   8.26 │       cmp    $0x20,%rax
   0.00 │     ↑ jne    6c0

Am I doing something blatantly wrong here or have I run into a 
compiler limitation? Is there anything short of using intrinsics 
or calling C/C++ code I can do here to get to performance parity?

Also, while on the subject, is there a way to force LDC to apply 
the relaxed floating point model to the entire program, rather 
than individual functions (the equivalent of --fast-math)?
Nov 11 2016
next sibling parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 12/11/2016 1:03 PM, dextorious wrote:
 As part of slowly learning the basics of programming in D, I ported some
 of my fluid dynamics code from C++ to D and quickly noticed a rather
 severe performance degradation by a factor of 2-3x. I've narrowed it
 down to a simple representative benchmark of virtually identical C++ and
 D code.

 The D version: http://pastebin.com/Rs9CUA5j
 The C++ code:  http://pastebin.com/XzStHXA2

 I compile the D code using the latest beta release on GitHub, using the
 compiler switches -release -O5 -mcpu=haswell -boundscheck=off. The C++
 version is compiled using Clang 3.9.0 with the switches -std=c++14
 -Ofast -fno-exceptions -fno-rtti -flto -ffast-math -march=native, which
 is my usual configuration for numerical code.

 On my Haswell i7-4710HQ machine the C++ version runs in ~10ms/iteration
 while the D code takes 25ms. Comparing profiler output with the
 generated assembly code quickly reveals the reason - while Clang fully
 unrolls the inner loop and uses FMA instructions wherever possible, the
 inner loop assembly produced by LDC looks like this:

   0.24 │6c0:   vmovss (%r15,%rbp,4),%xmm4
   1.03 │       vmovss (%r12,%rbp,4),%xmm5
   3.51 │       add    $0x4,%rdi
   6.96 │       add    $0x4,%rax
   1.04 │6d4:   vmulss (%rax,%rcx,1),%xmm4,%xmm4
   4.66 │       vmulss (%rax,%rdx,1),%xmm5,%xmm5
   8.44 │       vaddss %xmm4,%xmm5,%xmm4
   1.09 │       vmulss %xmm0,%xmm4,%xmm5
   3.73 │       vmulss %xmm4,%xmm5,%xmm4
   7.48 │       vsubss %xmm3,%xmm4,%xmm4
   1.13 │       vmulss %xmm1,%xmm4,%xmm4
   2.00 │       vaddss %xmm2,%xmm5,%xmm5
   3.46 │       vmovss 0x0(%r13,%rbp,4),%xmm6
   7.85 │       vmulss (%rax,%rsi,1),%xmm6,%xmm6
   2.50 │       vaddss %xmm4,%xmm5,%xmm4
   6.49 │       vmulss %xmm4,%xmm6,%xmm4
  25.48 │       vmovss %xmm4,(%rdi)
   8.26 │       cmp    $0x20,%rax
   0.00 │     ↑ jne    6c0

 Am I doing something blatantly wrong here or have I run into a compiler
 limitation? Is there anything short of using intrinsics or calling C/C++
 code I can do here to get to performance parity?

 Also, while on the subject, is there a way to force LDC to apply the
 relaxed floating point model to the entire program, rather than
 individual functions (the equivalent of --fast-math)?
Just a thought but try this: void compute_neq(float[] neq, const float[] ux, const float[] uy, const float[] rho, const float[] ex, const float[] ey, const float[] w, const size_t N) fastmath { foreach(idx; 0 .. N*N) { float usqr = ux[idx] * ux[idx] + uy[idx] * uy[idx]; foreach(q; 0 .. 9) { float eu = 3.0f * (ex[q] * ux[idx] + ey[q] * uy[idx]); float tmp = 1.0f + eu + 0.5f * eu * eu - 1.5f * usqr; tmp *= w[q] * rho[idx]; neq[idx * 9 + q] = tmp; } } } It may not make any difference since it is semantically the same but I thought at the very least rewriting it to be a bit more idiomatic may help.
Nov 11 2016
parent deXtoRious <dextorious gmail.com> writes:
On Saturday, 12 November 2016 at 03:30:47 UTC, rikki cattermole 
wrote:
 Just a thought but try this:

 void compute_neq(float[] neq,
                  const float[] ux,
                  const float[] uy,
                  const float[] rho,
                  const float[] ex,
                  const float[] ey,
                  const float[] w,
                  const size_t N)  fastmath {
     foreach(idx; 0 .. N*N) {
         float usqr = ux[idx] * ux[idx] + uy[idx] * uy[idx];

         foreach(q; 0 .. 9) {
             float eu = 3.0f * (ex[q] * ux[idx] + ey[q] * 
 uy[idx]);
             float tmp = 1.0f + eu + 0.5f * eu * eu - 1.5f * 
 usqr;
             tmp *= w[q] * rho[idx];
             neq[idx * 9 + q] = tmp;
         }
     }
 }

 It may not make any difference since it is semantically the 
 same but I thought at the very least rewriting it to be a bit 
 more idiomatic may help.
That's how I originally wrote the code, then reverted to the C++-style for the comparison to make the code as identical as possible and make sure it doesn't make any difference. As expected, it doesn't.
Nov 12 2016
prev sibling next sibling parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
 As part of slowly learning the basics of programming in D, I 
 ported some of my fluid dynamics code from C++ to D and quickly 
 noticed a rather severe performance degradation by a factor of 
 2-3x. I've narrowed it down to a simple representative 
 benchmark of virtually identical C++ and D code.

 [...]
you can apply attributes to whole files with fastmath: void IamFastMath(){} void SoAmI(){} Don't know about whole program. i got some improvements with -vectorize-loops and making the stencil array static and passing by ref. I couldn't get it to unroll the inner loop though.
Nov 11 2016
parent deXtoRious <dextorious gmail.com> writes:
On Saturday, 12 November 2016 at 07:38:16 UTC, Nicholas Wilson 
wrote:
 On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
 As part of slowly learning the basics of programming in D, I 
 ported some of my fluid dynamics code from C++ to D and 
 quickly noticed a rather severe performance degradation by a 
 factor of 2-3x. I've narrowed it down to a simple 
 representative benchmark of virtually identical C++ and D code.

 [...]
you can apply attributes to whole files with fastmath: void IamFastMath(){} void SoAmI(){} Don't know about whole program. i got some improvements with -vectorize-loops and making the stencil array static and passing by ref. I couldn't get it to unroll the inner loop though.
Isn't -vectorize-loops already enabled by the other flags? Simply adding it doesn't seem to make a difference to the inner loop assembly for me. I'll try passing a static array by ref, which should slightly improve the function call performance, but I'd be surprised if it actually lets the compiler properly vectorize the inner loop or fully unroll it.
Nov 12 2016
prev sibling next sibling parent reply LiNbO3 <nosp m.please> writes:
On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
 On my Haswell i7-4710HQ machine the C++ version runs in 
 ~10ms/iteration while the D code takes 25ms. Comparing profiler 
 output with the generated assembly code quickly reveals the 
 reason - while Clang fully unrolls the inner loop and uses FMA 
 instructions wherever possible, the inner loop assembly 
 produced by LDC looks like this:
By compiling your code with the same set of flags you used on the godbolt (https://d.godbolt.org/) service I do see the FMA instructions being used.
Nov 12 2016
parent reply deXtoRious <dextorious gmail.com> writes:
On Saturday, 12 November 2016 at 09:45:29 UTC, LiNbO3 wrote:
 On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
 On my Haswell i7-4710HQ machine the C++ version runs in 
 ~10ms/iteration while the D code takes 25ms. Comparing 
 profiler output with the generated assembly code quickly 
 reveals the reason - while Clang fully unrolls the inner loop 
 and uses FMA instructions wherever possible, the inner loop 
 assembly produced by LDC looks like this:
By compiling your code with the same set of flags you used on the godbolt (https://d.godbolt.org/) service I do see the FMA instructions being used.
There are three vfmadd231ss in the entire assembly, but none of them are in the inner loop. The presence of any FMA instructions at all does show that the compiler properly accepts the -mcpu switch, but it doesn't seem to recognize the opportunities present in the inner loop. The assembly generated by the godbolt service seems largely identical to the one I got on my local machine.
Nov 12 2016
parent reply Johan Engelen <j j.nl> writes:
On Saturday, 12 November 2016 at 10:27:53 UTC, deXtoRious wrote:
 There are three vfmadd231ss in the entire assembly, but none of 
 them are in the inner loop. The presence of any FMA 
 instructions at all does show that the compiler properly 
 accepts the -mcpu switch, but it doesn't seem to recognize the 
 opportunities present in the inner loop.
Does the C++ need `__restrict__` for the parameters to get the assembly you want?
 The assembly generated by the godbolt service seems largely 
 identical to the one I got on my local machine.
It is easier for the discussion if you paste godbolt.org links btw, so we don't have to manually do it ourselves ;-) -Johan
Nov 12 2016
parent reply deXtoRious <dextorious gmail.com> writes:
On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen 
wrote:
 On Saturday, 12 November 2016 at 10:27:53 UTC, deXtoRious wrote:
 There are three vfmadd231ss in the entire assembly, but none 
 of them are in the inner loop. The presence of any FMA 
 instructions at all does show that the compiler properly 
 accepts the -mcpu switch, but it doesn't seem to recognize the 
 opportunities present in the inner loop.
Does the C++ need `__restrict__` for the parameters to get the assembly you want?
In this case, it doesn't seem to make any difference. It is habitual for me to use __restrict__ whenever possible in HPC code, but very often Clang/GCC are smart enough nowadays to make the inference regardless. On that note, I was under the impression that D arrays included the no aliasing assumption. If that's not the case, is there a way to achieve the equivalent of __restrict__ in D?
 The assembly generated by the godbolt service seems largely 
 identical to the one I got on my local machine.
It is easier for the discussion if you paste godbolt.org links btw, so we don't have to manually do it ourselves ;-) -Johan
Will do. :) By the way, I posted that issue on GH: https://github.com/ldc-developers/ldc/issues/1874
Nov 12 2016
parent reply Johan Engelen <j j.nl> writes:
On Saturday, 12 November 2016 at 10:56:20 UTC, deXtoRious wrote:
 On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen 
 wrote:
 Does the C++ need `__restrict__` for the parameters to get the 
 assembly you want?
In this case, it doesn't seem to make any difference.
That's good news, because there is currently no way to add that to LDC code, afaik. Hope you can try to cut more of these things from the example so it's easier to figure out why things are different. (e.g. is -Ofast needed, or is -O3 enough?) Thanks! cheers, Johan
Nov 12 2016
parent reply deXtoRious <dextorious gmail.com> writes:
On Saturday, 12 November 2016 at 11:04:59 UTC, Johan Engelen 
wrote:
 On Saturday, 12 November 2016 at 10:56:20 UTC, deXtoRious wrote:
 On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen 
 wrote:
 Does the C++ need `__restrict__` for the parameters to get 
 the assembly you want?
In this case, it doesn't seem to make any difference.
That's good news, because there is currently no way to add that to LDC code, afaik.
I hope it's somewhere on the roadmap for the future, as it does still make a measurable difference in some cases.
 Hope you can try to cut more of these things from the example 
 so it's easier to figure out why things are different.  (e.g. 
 is -Ofast needed, or is -O3 enough?)

 Thanks!

 cheers,
   Johan
-Ofast is also there out of habit, doesn't make a meaningful difference for a benchmark as simple as this. Other switches, like -fno-rtti, -fno-exceptions and even -flto can also be dropped, simply using -O3 -march=native -ffast-math is sufficient to outperform LDC by 2.5x, losing only about 10% from the best C++ performance and producing essentially the same unrolled FMA-enabled assembly with very minor changes.
Nov 12 2016
parent reply Johan Engelen <j j.nl> writes:
On Saturday, 12 November 2016 at 11:16:16 UTC, deXtoRious wrote:
 On Saturday, 12 November 2016 at 11:04:59 UTC, Johan Engelen 
 wrote:
 On Saturday, 12 November 2016 at 10:56:20 UTC, deXtoRious 
 wrote:
 On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen 
 wrote:
 Does the C++ need `__restrict__` for the parameters to get 
 the assembly you want?
In this case, it doesn't seem to make any difference.
That's good news, because there is currently no way to add that to LDC code, afaik.
I hope it's somewhere on the roadmap for the future, as it does still make a measurable difference in some cases.
Can you file an issue for that too? (ideas in forum posts get lost instantly) Make sure you add an (as small as possible) testcase that shows a clear difference in codegen (with/without for C++), and with worse codegen with D code without it. It may be relatively easy to implement it in LDC, but I don't think many people know the intricacies of C's restrict. With examples of the effect it has on assembly (clang C++) helps a lot towards getting it implemented.
 -Ofast is also there out of habit, doesn't make a meaningful 
 difference for a benchmark as simple as this. Other switches, 
 like -fno-rtti, -fno-exceptions and even -flto can also be 
 dropped, simply using -O3 -march=native -ffast-math is 
 sufficient to outperform LDC by 2.5x, losing only about 10% 
 from the best C++ performance and producing essentially the 
 same unrolled FMA-enabled assembly with very minor changes.
OK great. I think you ran into a compiler limitation somehow, so make sure you submit the simplified example/testcase on GH ! ;) (the simpler you can make it, the better) Btw, for benchmarking, you should mark the `compute_neq` function as "weak linkage", such that the compiler is not going to do inter-procedural optimization for the call to `compute_neq` in `main`. ( weak for LDC, clang probably something like __attribute__((weak)))
Nov 12 2016
next sibling parent deXtoRious <dextorious gmail.com> writes:
On Saturday, 12 November 2016 at 12:11:35 UTC, Johan Engelen 
wrote:
 On Saturday, 12 November 2016 at 11:16:16 UTC, deXtoRious wrote:
 On Saturday, 12 November 2016 at 11:04:59 UTC, Johan Engelen 
 wrote:
 On Saturday, 12 November 2016 at 10:56:20 UTC, deXtoRious 
 wrote:
 On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen 
 wrote:
 Does the C++ need `__restrict__` for the parameters to get 
 the assembly you want?
In this case, it doesn't seem to make any difference.
That's good news, because there is currently no way to add that to LDC code, afaik.
I hope it's somewhere on the roadmap for the future, as it does still make a measurable difference in some cases.
Can you file an issue for that too? (ideas in forum posts get lost instantly) Make sure you add an (as small as possible) testcase that shows a clear difference in codegen (with/without for C++), and with worse codegen with D code without it. It may be relatively easy to implement it in LDC, but I don't think many people know the intricacies of C's restrict. With examples of the effect it has on assembly (clang C++) helps a lot towards getting it implemented.
 -Ofast is also there out of habit, doesn't make a meaningful 
 difference for a benchmark as simple as this. Other switches, 
 like -fno-rtti, -fno-exceptions and even -flto can also be 
 dropped, simply using -O3 -march=native -ffast-math is 
 sufficient to outperform LDC by 2.5x, losing only about 10% 
 from the best C++ performance and producing essentially the 
 same unrolled FMA-enabled assembly with very minor changes.
OK great. I think you ran into a compiler limitation somehow, so make sure you submit the simplified example/testcase on GH ! ;) (the simpler you can make it, the better) Btw, for benchmarking, you should mark the `compute_neq` function as "weak linkage", such that the compiler is not going to do inter-procedural optimization for the call to `compute_neq` in `main`. ( weak for LDC, clang probably something like __attribute__((weak)))
Okay, I'll clean up the code and post an issue on GH later today, hopefully someone can figure out where the discrepancy comes from. I'll also file a separate issue / feature request for restrict afterwards, once I write up a representative test case that highlights the performance impact. Thanks for your help! The ability to get quick responses on compiler issues like this is really encouraging me to write more high performance code in D.
Nov 12 2016
prev sibling parent reply deXtoRious <dextorious gmail.com> writes:
Okay, so I've done some further experimentation with rather 
peculiar results. On the bright side, I'm now fairly sure this 
isn't an outright bug in the compiler. On the flip side, however, 
I'm quite confused about the results.

For the record, here are the current versions of the benchmark in 
godbolt:
D:   https://godbolt.org/g/B8gosP
C++: https://godbolt.org/g/DWjQrV

Apparently, LDC can be coaxed to use FMA instructions after all. 
It seems that with __attribute__((__weak__)) Clang produces code 
that is essentially identical to the D binary, both run in about 
19ms on my machine. When I remove __attribute__((__weak__)) and 
make the compute_neq function static void rather than simply 
void, Clang further unrolls the inner loop and uses a number of 
optimized load/store instructions that increase the performance 
by a huge margin - down to about 7ms. As for LDC, changing 
adding/removing  weak and static also has a major impact on the 
generated code and therefore the performance.

I have not found any way to make LDC perform the same 
optimizations as Clang's best case (simply static void, no weak 
attribute) and have run out of ideas. Furthermore, I have no idea 
why the aforementioned changes in the function declaration affect 
the both optimizers in this way, or whether finer control over 
vectorization/loop unrolling is possible in LDC. Any thoughts?
Nov 12 2016
next sibling parent reply Johan Engelen <j j.nl> writes:
On Saturday, 12 November 2016 at 15:44:28 UTC, deXtoRious wrote:
 I have not found any way to make LDC perform the same 
 optimizations as Clang's best case (simply static void, no weak 
 attribute) and have run out of ideas. Furthermore, I have no 
 idea why the aforementioned changes in the function declaration 
 affect the both optimizers in this way, or whether finer 
 control over vectorization/loop unrolling is possible in LDC. 
 Any thoughts?
I think that perhaps when inlining the fastmath function, some optimization attributes are lost somehow and the inlined code is not optimized as much (you'd have to specify fastmath on main too). It'd be easier to compare with -ffast-math I guess ;-) A look at the generated LLVM IR may provide some clues.
Nov 12 2016
parent reply deXtoRious <dextorious gmail.com> writes:
On Saturday, 12 November 2016 at 16:29:20 UTC, Johan Engelen 
wrote:
 On Saturday, 12 November 2016 at 15:44:28 UTC, deXtoRious wrote:
 I have not found any way to make LDC perform the same 
 optimizations as Clang's best case (simply static void, no 
 weak attribute) and have run out of ideas. Furthermore, I have 
 no idea why the aforementioned changes in the function 
 declaration affect the both optimizers in this way, or whether 
 finer control over vectorization/loop unrolling is possible in 
 LDC. Any thoughts?
I think that perhaps when inlining the fastmath function, some optimization attributes are lost somehow and the inlined code is not optimized as much (you'd have to specify fastmath on main too). It'd be easier to compare with -ffast-math I guess ;-) A look at the generated LLVM IR may provide some clues.
I tried putting fastmath on main as well, it makes no difference whatsoever (identical generated assembly). Apart from the weirdness with weak/static making way more difference than I would intuitively expect, it seems the major factor preventing performance parity with Clang is the conservative loop optimizations. Is there a way, similar to #pragma unroll in Clang, to tell LDC to try to unroll the inner loop?
Nov 12 2016
parent reply Johan Engelen <j j.nl> writes:
On Saturday, 12 November 2016 at 16:40:27 UTC, deXtoRious wrote:
 
 I tried putting  fastmath on main as well, it makes no 
 difference whatsoever (identical generated assembly).
Yeah I saw it too. It's a bit strange.
 Apart from the weirdness with weak/static making way more 
 difference than I would intuitively expect,
I am also surprised but: adding `static` in C++ makes it a fully private function, which does not need to be emitted as such (and isn't in your case, because it is fully inlined). I added `pragma(inline, true)` to the D function to get a similar effect, I hoped.
 it seems the major factor preventing performance parity with 
 Clang is the conservative loop optimizations. Is there a way, 
 similar to #pragma unroll in Clang, to tell LDC to try to 
 unroll the inner loop?
There isn't at the moment. We need a mechanism to tag statements with such metadata. In LLVM IR, this is what you'd want: http://llvm.org/docs/LangRef.html#llvm-loop I am not enough of a D expert to come up with a good way to do this. Perhaps David can help come up with a solution? Good stuff for another Github issue! ;-)
Nov 12 2016
parent reply Kagamin <spam here.lot> writes:
On Saturday, 12 November 2016 at 18:55:19 UTC, Johan Engelen 
wrote:
 I am not enough of a D expert to come up with a good way to do 
 this.
Spec says pragma can be applied to statements: https://dlang.org/spec/pragma.html
Nov 14 2016
parent Johan Engelen <j j.nl> writes:
On Monday, 14 November 2016 at 10:45:08 UTC, Kagamin wrote:
 On Saturday, 12 November 2016 at 18:55:19 UTC, Johan Engelen 
 wrote:
 I am not enough of a D expert to come up with a good way to do 
 this.
Spec says pragma can be applied to statements: https://dlang.org/spec/pragma.html
Excellent.
Nov 14 2016
prev sibling parent Kagamin <spam here.lot> writes:
On Saturday, 12 November 2016 at 15:44:28 UTC, deXtoRious wrote:
 I have not found any way to make LDC perform the same 
 optimizations as Clang's best case (simply static void, no weak 
 attribute) and have run out of ideas. Furthermore, I have no 
 idea why the aforementioned changes in the function declaration 
 affect the both optimizers in this way, or whether finer 
 control over vectorization/loop unrolling is possible in LDC. 
 Any thoughts?
LDC can compile to llvm bitcode, you can then generate object code from it with llc and its various options. AFAIK, LLVM equivalent of a private symbol is "hidden" attribute, you can try to apply that, though lto shouldn't be affected by it.
Nov 14 2016
prev sibling parent reply Johan Engelen <j j.nl> writes:
On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
 Also, while on the subject, is there a way to force LDC to 
 apply the relaxed floating point model to the entire program, 
 rather than individual functions (the equivalent of 
 --fast-math)?
Not yet. If you really think this has value, please file an issue for it on GH. It will be easy to add it to LDC. -Johan
Nov 12 2016
parent deXtoRious <dextorious gmail.com> writes:
On Saturday, 12 November 2016 at 10:10:44 UTC, Johan Engelen 
wrote:
 On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
 Also, while on the subject, is there a way to force LDC to 
 apply the relaxed floating point model to the entire program, 
 rather than individual functions (the equivalent of 
 --fast-math)?
Not yet. If you really think this has value, please file an issue for it on GH. It will be easy to add it to LDC. -Johan
Will do. The syntax Nicholas Wilson mentioned previously does make it easier to apply the attribute to multiple functions at once, but in many cases numerical code is written with the uniform assumption of the relaxed floating point model, so it seems much more appropriate to set it at the compiler level in those cases. It would also simplify benchmarking.
Nov 12 2016