digitalmars.D.ldc - Performance issue with fastmath and vectorization
- dextorious (44/44) Nov 11 2016 As part of slowly learning the basics of programming in D, I
- rikki cattermole (22/64) Nov 11 2016 Just a thought but try this:
- deXtoRious (6/30) Nov 12 2016 That's how I originally wrote the code, then reverted to the
- Nicholas Wilson (9/15) Nov 11 2016 you can apply attributes to whole files with
- deXtoRious (8/24) Nov 12 2016 Isn't -vectorize-loops already enabled by the other flags? Simply
- LiNbO3 (4/10) Nov 12 2016 By compiling your code with the same set of flags you used on the
- deXtoRious (8/18) Nov 12 2016 There are three vfmadd231ss in the entire assembly, but none of
- Johan Engelen (6/13) Nov 12 2016 Does the C++ need `__restrict__` for the parameters to get the
- deXtoRious (12/26) Nov 12 2016 In this case, it doesn't seem to make any difference. It is
- Johan Engelen (9/15) Nov 12 2016 That's good news, because there is currently no way to add that
- deXtoRious (11/27) Nov 12 2016 I hope it's somewhere on the roadmap for the future, as it does
- Johan Engelen (19/42) Nov 12 2016 Can you file an issue for that too? (ideas in forum posts get
- deXtoRious (10/53) Nov 12 2016 Okay, I'll clean up the code and post an issue on GH later today,
- deXtoRious (24/24) Nov 12 2016 Okay, so I've done some further experimentation with rather
- Johan Engelen (7/14) Nov 12 2016 I think that perhaps when inlining the fastmath function, some
- deXtoRious (9/24) Nov 12 2016 I tried putting @fastmath on main as well, it makes no difference
- Johan Engelen (13/22) Nov 12 2016 I am also surprised but: adding `static` in C++ makes it a fully
- Kagamin (4/6) Nov 14 2016 Spec says pragma can be applied to statements:
- Johan Engelen (2/8) Nov 14 2016 Excellent.
- Kagamin (5/12) Nov 14 2016 LDC can compile to llvm bitcode, you can then generate object
- Johan Engelen (5/9) Nov 12 2016 Not yet. If you really think this has value, please file an issue
- deXtoRious (8/18) Nov 12 2016 Will do. The syntax Nicholas Wilson mentioned previously does
As part of slowly learning the basics of programming in D, I ported some of my fluid dynamics code from C++ to D and quickly noticed a rather severe performance degradation by a factor of 2-3x. I've narrowed it down to a simple representative benchmark of virtually identical C++ and D code. The D version: http://pastebin.com/Rs9CUA5j The C++ code: http://pastebin.com/XzStHXA2 I compile the D code using the latest beta release on GitHub, using the compiler switches -release -O5 -mcpu=haswell -boundscheck=off. The C++ version is compiled using Clang 3.9.0 with the switches -std=c++14 -Ofast -fno-exceptions -fno-rtti -flto -ffast-math -march=native, which is my usual configuration for numerical code. On my Haswell i7-4710HQ machine the C++ version runs in ~10ms/iteration while the D code takes 25ms. Comparing profiler output with the generated assembly code quickly reveals the reason - while Clang fully unrolls the inner loop and uses FMA instructions wherever possible, the inner loop assembly produced by LDC looks like this: 0.24 │6c0: vmovss (%r15,%rbp,4),%xmm4 1.03 │ vmovss (%r12,%rbp,4),%xmm5 3.51 │ add $0x4,%rdi 6.96 │ add $0x4,%rax 1.04 │6d4: vmulss (%rax,%rcx,1),%xmm4,%xmm4 4.66 │ vmulss (%rax,%rdx,1),%xmm5,%xmm5 8.44 │ vaddss %xmm4,%xmm5,%xmm4 1.09 │ vmulss %xmm0,%xmm4,%xmm5 3.73 │ vmulss %xmm4,%xmm5,%xmm4 7.48 │ vsubss %xmm3,%xmm4,%xmm4 1.13 │ vmulss %xmm1,%xmm4,%xmm4 2.00 │ vaddss %xmm2,%xmm5,%xmm5 3.46 │ vmovss 0x0(%r13,%rbp,4),%xmm6 7.85 │ vmulss (%rax,%rsi,1),%xmm6,%xmm6 2.50 │ vaddss %xmm4,%xmm5,%xmm4 6.49 │ vmulss %xmm4,%xmm6,%xmm4 25.48 │ vmovss %xmm4,(%rdi) 8.26 │ cmp $0x20,%rax 0.00 │ ↑ jne 6c0 Am I doing something blatantly wrong here or have I run into a compiler limitation? Is there anything short of using intrinsics or calling C/C++ code I can do here to get to performance parity? Also, while on the subject, is there a way to force LDC to apply the relaxed floating point model to the entire program, rather than individual functions (the equivalent of --fast-math)?
Nov 11 2016
On 12/11/2016 1:03 PM, dextorious wrote:As part of slowly learning the basics of programming in D, I ported some of my fluid dynamics code from C++ to D and quickly noticed a rather severe performance degradation by a factor of 2-3x. I've narrowed it down to a simple representative benchmark of virtually identical C++ and D code. The D version: http://pastebin.com/Rs9CUA5j The C++ code: http://pastebin.com/XzStHXA2 I compile the D code using the latest beta release on GitHub, using the compiler switches -release -O5 -mcpu=haswell -boundscheck=off. The C++ version is compiled using Clang 3.9.0 with the switches -std=c++14 -Ofast -fno-exceptions -fno-rtti -flto -ffast-math -march=native, which is my usual configuration for numerical code. On my Haswell i7-4710HQ machine the C++ version runs in ~10ms/iteration while the D code takes 25ms. Comparing profiler output with the generated assembly code quickly reveals the reason - while Clang fully unrolls the inner loop and uses FMA instructions wherever possible, the inner loop assembly produced by LDC looks like this: 0.24 │6c0: vmovss (%r15,%rbp,4),%xmm4 1.03 │ vmovss (%r12,%rbp,4),%xmm5 3.51 │ add $0x4,%rdi 6.96 │ add $0x4,%rax 1.04 │6d4: vmulss (%rax,%rcx,1),%xmm4,%xmm4 4.66 │ vmulss (%rax,%rdx,1),%xmm5,%xmm5 8.44 │ vaddss %xmm4,%xmm5,%xmm4 1.09 │ vmulss %xmm0,%xmm4,%xmm5 3.73 │ vmulss %xmm4,%xmm5,%xmm4 7.48 │ vsubss %xmm3,%xmm4,%xmm4 1.13 │ vmulss %xmm1,%xmm4,%xmm4 2.00 │ vaddss %xmm2,%xmm5,%xmm5 3.46 │ vmovss 0x0(%r13,%rbp,4),%xmm6 7.85 │ vmulss (%rax,%rsi,1),%xmm6,%xmm6 2.50 │ vaddss %xmm4,%xmm5,%xmm4 6.49 │ vmulss %xmm4,%xmm6,%xmm4 25.48 │ vmovss %xmm4,(%rdi) 8.26 │ cmp $0x20,%rax 0.00 │ ↑ jne 6c0 Am I doing something blatantly wrong here or have I run into a compiler limitation? Is there anything short of using intrinsics or calling C/C++ code I can do here to get to performance parity? Also, while on the subject, is there a way to force LDC to apply the relaxed floating point model to the entire program, rather than individual functions (the equivalent of --fast-math)?Just a thought but try this: void compute_neq(float[] neq, const float[] ux, const float[] uy, const float[] rho, const float[] ex, const float[] ey, const float[] w, const size_t N) fastmath { foreach(idx; 0 .. N*N) { float usqr = ux[idx] * ux[idx] + uy[idx] * uy[idx]; foreach(q; 0 .. 9) { float eu = 3.0f * (ex[q] * ux[idx] + ey[q] * uy[idx]); float tmp = 1.0f + eu + 0.5f * eu * eu - 1.5f * usqr; tmp *= w[q] * rho[idx]; neq[idx * 9 + q] = tmp; } } } It may not make any difference since it is semantically the same but I thought at the very least rewriting it to be a bit more idiomatic may help.
Nov 11 2016
On Saturday, 12 November 2016 at 03:30:47 UTC, rikki cattermole wrote:Just a thought but try this: void compute_neq(float[] neq, const float[] ux, const float[] uy, const float[] rho, const float[] ex, const float[] ey, const float[] w, const size_t N) fastmath { foreach(idx; 0 .. N*N) { float usqr = ux[idx] * ux[idx] + uy[idx] * uy[idx]; foreach(q; 0 .. 9) { float eu = 3.0f * (ex[q] * ux[idx] + ey[q] * uy[idx]); float tmp = 1.0f + eu + 0.5f * eu * eu - 1.5f * usqr; tmp *= w[q] * rho[idx]; neq[idx * 9 + q] = tmp; } } } It may not make any difference since it is semantically the same but I thought at the very least rewriting it to be a bit more idiomatic may help.That's how I originally wrote the code, then reverted to the C++-style for the comparison to make the code as identical as possible and make sure it doesn't make any difference. As expected, it doesn't.
Nov 12 2016
On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:As part of slowly learning the basics of programming in D, I ported some of my fluid dynamics code from C++ to D and quickly noticed a rather severe performance degradation by a factor of 2-3x. I've narrowed it down to a simple representative benchmark of virtually identical C++ and D code. [...]you can apply attributes to whole files with fastmath: void IamFastMath(){} void SoAmI(){} Don't know about whole program. i got some improvements with -vectorize-loops and making the stencil array static and passing by ref. I couldn't get it to unroll the inner loop though.
Nov 11 2016
On Saturday, 12 November 2016 at 07:38:16 UTC, Nicholas Wilson wrote:On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:Isn't -vectorize-loops already enabled by the other flags? Simply adding it doesn't seem to make a difference to the inner loop assembly for me. I'll try passing a static array by ref, which should slightly improve the function call performance, but I'd be surprised if it actually lets the compiler properly vectorize the inner loop or fully unroll it.As part of slowly learning the basics of programming in D, I ported some of my fluid dynamics code from C++ to D and quickly noticed a rather severe performance degradation by a factor of 2-3x. I've narrowed it down to a simple representative benchmark of virtually identical C++ and D code. [...]you can apply attributes to whole files with fastmath: void IamFastMath(){} void SoAmI(){} Don't know about whole program. i got some improvements with -vectorize-loops and making the stencil array static and passing by ref. I couldn't get it to unroll the inner loop though.
Nov 12 2016
On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:On my Haswell i7-4710HQ machine the C++ version runs in ~10ms/iteration while the D code takes 25ms. Comparing profiler output with the generated assembly code quickly reveals the reason - while Clang fully unrolls the inner loop and uses FMA instructions wherever possible, the inner loop assembly produced by LDC looks like this:By compiling your code with the same set of flags you used on the godbolt (https://d.godbolt.org/) service I do see the FMA instructions being used.
Nov 12 2016
On Saturday, 12 November 2016 at 09:45:29 UTC, LiNbO3 wrote:On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:There are three vfmadd231ss in the entire assembly, but none of them are in the inner loop. The presence of any FMA instructions at all does show that the compiler properly accepts the -mcpu switch, but it doesn't seem to recognize the opportunities present in the inner loop. The assembly generated by the godbolt service seems largely identical to the one I got on my local machine.On my Haswell i7-4710HQ machine the C++ version runs in ~10ms/iteration while the D code takes 25ms. Comparing profiler output with the generated assembly code quickly reveals the reason - while Clang fully unrolls the inner loop and uses FMA instructions wherever possible, the inner loop assembly produced by LDC looks like this:By compiling your code with the same set of flags you used on the godbolt (https://d.godbolt.org/) service I do see the FMA instructions being used.
Nov 12 2016
On Saturday, 12 November 2016 at 10:27:53 UTC, deXtoRious wrote:There are three vfmadd231ss in the entire assembly, but none of them are in the inner loop. The presence of any FMA instructions at all does show that the compiler properly accepts the -mcpu switch, but it doesn't seem to recognize the opportunities present in the inner loop.Does the C++ need `__restrict__` for the parameters to get the assembly you want?The assembly generated by the godbolt service seems largely identical to the one I got on my local machine.It is easier for the discussion if you paste godbolt.org links btw, so we don't have to manually do it ourselves ;-) -Johan
Nov 12 2016
On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen wrote:On Saturday, 12 November 2016 at 10:27:53 UTC, deXtoRious wrote:In this case, it doesn't seem to make any difference. It is habitual for me to use __restrict__ whenever possible in HPC code, but very often Clang/GCC are smart enough nowadays to make the inference regardless. On that note, I was under the impression that D arrays included the no aliasing assumption. If that's not the case, is there a way to achieve the equivalent of __restrict__ in D?There are three vfmadd231ss in the entire assembly, but none of them are in the inner loop. The presence of any FMA instructions at all does show that the compiler properly accepts the -mcpu switch, but it doesn't seem to recognize the opportunities present in the inner loop.Does the C++ need `__restrict__` for the parameters to get the assembly you want?Will do. :) By the way, I posted that issue on GH: https://github.com/ldc-developers/ldc/issues/1874The assembly generated by the godbolt service seems largely identical to the one I got on my local machine.It is easier for the discussion if you paste godbolt.org links btw, so we don't have to manually do it ourselves ;-) -Johan
Nov 12 2016
On Saturday, 12 November 2016 at 10:56:20 UTC, deXtoRious wrote:On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen wrote:That's good news, because there is currently no way to add that to LDC code, afaik. Hope you can try to cut more of these things from the example so it's easier to figure out why things are different. (e.g. is -Ofast needed, or is -O3 enough?) Thanks! cheers, JohanDoes the C++ need `__restrict__` for the parameters to get the assembly you want?In this case, it doesn't seem to make any difference.
Nov 12 2016
On Saturday, 12 November 2016 at 11:04:59 UTC, Johan Engelen wrote:On Saturday, 12 November 2016 at 10:56:20 UTC, deXtoRious wrote:I hope it's somewhere on the roadmap for the future, as it does still make a measurable difference in some cases.On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen wrote:That's good news, because there is currently no way to add that to LDC code, afaik.Does the C++ need `__restrict__` for the parameters to get the assembly you want?In this case, it doesn't seem to make any difference.Hope you can try to cut more of these things from the example so it's easier to figure out why things are different. (e.g. is -Ofast needed, or is -O3 enough?) Thanks! cheers, Johan-Ofast is also there out of habit, doesn't make a meaningful difference for a benchmark as simple as this. Other switches, like -fno-rtti, -fno-exceptions and even -flto can also be dropped, simply using -O3 -march=native -ffast-math is sufficient to outperform LDC by 2.5x, losing only about 10% from the best C++ performance and producing essentially the same unrolled FMA-enabled assembly with very minor changes.
Nov 12 2016
On Saturday, 12 November 2016 at 11:16:16 UTC, deXtoRious wrote:On Saturday, 12 November 2016 at 11:04:59 UTC, Johan Engelen wrote:Can you file an issue for that too? (ideas in forum posts get lost instantly) Make sure you add an (as small as possible) testcase that shows a clear difference in codegen (with/without for C++), and with worse codegen with D code without it. It may be relatively easy to implement it in LDC, but I don't think many people know the intricacies of C's restrict. With examples of the effect it has on assembly (clang C++) helps a lot towards getting it implemented.On Saturday, 12 November 2016 at 10:56:20 UTC, deXtoRious wrote:I hope it's somewhere on the roadmap for the future, as it does still make a measurable difference in some cases.On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen wrote:That's good news, because there is currently no way to add that to LDC code, afaik.Does the C++ need `__restrict__` for the parameters to get the assembly you want?In this case, it doesn't seem to make any difference.-Ofast is also there out of habit, doesn't make a meaningful difference for a benchmark as simple as this. Other switches, like -fno-rtti, -fno-exceptions and even -flto can also be dropped, simply using -O3 -march=native -ffast-math is sufficient to outperform LDC by 2.5x, losing only about 10% from the best C++ performance and producing essentially the same unrolled FMA-enabled assembly with very minor changes.OK great. I think you ran into a compiler limitation somehow, so make sure you submit the simplified example/testcase on GH ! ;) (the simpler you can make it, the better) Btw, for benchmarking, you should mark the `compute_neq` function as "weak linkage", such that the compiler is not going to do inter-procedural optimization for the call to `compute_neq` in `main`. ( weak for LDC, clang probably something like __attribute__((weak)))
Nov 12 2016
On Saturday, 12 November 2016 at 12:11:35 UTC, Johan Engelen wrote:On Saturday, 12 November 2016 at 11:16:16 UTC, deXtoRious wrote:Okay, I'll clean up the code and post an issue on GH later today, hopefully someone can figure out where the discrepancy comes from. I'll also file a separate issue / feature request for restrict afterwards, once I write up a representative test case that highlights the performance impact. Thanks for your help! The ability to get quick responses on compiler issues like this is really encouraging me to write more high performance code in D.On Saturday, 12 November 2016 at 11:04:59 UTC, Johan Engelen wrote:Can you file an issue for that too? (ideas in forum posts get lost instantly) Make sure you add an (as small as possible) testcase that shows a clear difference in codegen (with/without for C++), and with worse codegen with D code without it. It may be relatively easy to implement it in LDC, but I don't think many people know the intricacies of C's restrict. With examples of the effect it has on assembly (clang C++) helps a lot towards getting it implemented.On Saturday, 12 November 2016 at 10:56:20 UTC, deXtoRious wrote:I hope it's somewhere on the roadmap for the future, as it does still make a measurable difference in some cases.On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen wrote:That's good news, because there is currently no way to add that to LDC code, afaik.Does the C++ need `__restrict__` for the parameters to get the assembly you want?In this case, it doesn't seem to make any difference.-Ofast is also there out of habit, doesn't make a meaningful difference for a benchmark as simple as this. Other switches, like -fno-rtti, -fno-exceptions and even -flto can also be dropped, simply using -O3 -march=native -ffast-math is sufficient to outperform LDC by 2.5x, losing only about 10% from the best C++ performance and producing essentially the same unrolled FMA-enabled assembly with very minor changes.OK great. I think you ran into a compiler limitation somehow, so make sure you submit the simplified example/testcase on GH ! ;) (the simpler you can make it, the better) Btw, for benchmarking, you should mark the `compute_neq` function as "weak linkage", such that the compiler is not going to do inter-procedural optimization for the call to `compute_neq` in `main`. ( weak for LDC, clang probably something like __attribute__((weak)))
Nov 12 2016
Okay, so I've done some further experimentation with rather peculiar results. On the bright side, I'm now fairly sure this isn't an outright bug in the compiler. On the flip side, however, I'm quite confused about the results. For the record, here are the current versions of the benchmark in godbolt: D: https://godbolt.org/g/B8gosP C++: https://godbolt.org/g/DWjQrV Apparently, LDC can be coaxed to use FMA instructions after all. It seems that with __attribute__((__weak__)) Clang produces code that is essentially identical to the D binary, both run in about 19ms on my machine. When I remove __attribute__((__weak__)) and make the compute_neq function static void rather than simply void, Clang further unrolls the inner loop and uses a number of optimized load/store instructions that increase the performance by a huge margin - down to about 7ms. As for LDC, changing adding/removing weak and static also has a major impact on the generated code and therefore the performance. I have not found any way to make LDC perform the same optimizations as Clang's best case (simply static void, no weak attribute) and have run out of ideas. Furthermore, I have no idea why the aforementioned changes in the function declaration affect the both optimizers in this way, or whether finer control over vectorization/loop unrolling is possible in LDC. Any thoughts?
Nov 12 2016
On Saturday, 12 November 2016 at 15:44:28 UTC, deXtoRious wrote:I have not found any way to make LDC perform the same optimizations as Clang's best case (simply static void, no weak attribute) and have run out of ideas. Furthermore, I have no idea why the aforementioned changes in the function declaration affect the both optimizers in this way, or whether finer control over vectorization/loop unrolling is possible in LDC. Any thoughts?I think that perhaps when inlining the fastmath function, some optimization attributes are lost somehow and the inlined code is not optimized as much (you'd have to specify fastmath on main too). It'd be easier to compare with -ffast-math I guess ;-) A look at the generated LLVM IR may provide some clues.
Nov 12 2016
On Saturday, 12 November 2016 at 16:29:20 UTC, Johan Engelen wrote:On Saturday, 12 November 2016 at 15:44:28 UTC, deXtoRious wrote:I tried putting fastmath on main as well, it makes no difference whatsoever (identical generated assembly). Apart from the weirdness with weak/static making way more difference than I would intuitively expect, it seems the major factor preventing performance parity with Clang is the conservative loop optimizations. Is there a way, similar to #pragma unroll in Clang, to tell LDC to try to unroll the inner loop?I have not found any way to make LDC perform the same optimizations as Clang's best case (simply static void, no weak attribute) and have run out of ideas. Furthermore, I have no idea why the aforementioned changes in the function declaration affect the both optimizers in this way, or whether finer control over vectorization/loop unrolling is possible in LDC. Any thoughts?I think that perhaps when inlining the fastmath function, some optimization attributes are lost somehow and the inlined code is not optimized as much (you'd have to specify fastmath on main too). It'd be easier to compare with -ffast-math I guess ;-) A look at the generated LLVM IR may provide some clues.
Nov 12 2016
On Saturday, 12 November 2016 at 16:40:27 UTC, deXtoRious wrote:I tried putting fastmath on main as well, it makes no difference whatsoever (identical generated assembly).Yeah I saw it too. It's a bit strange.Apart from the weirdness with weak/static making way more difference than I would intuitively expect,I am also surprised but: adding `static` in C++ makes it a fully private function, which does not need to be emitted as such (and isn't in your case, because it is fully inlined). I added `pragma(inline, true)` to the D function to get a similar effect, I hoped.it seems the major factor preventing performance parity with Clang is the conservative loop optimizations. Is there a way, similar to #pragma unroll in Clang, to tell LDC to try to unroll the inner loop?There isn't at the moment. We need a mechanism to tag statements with such metadata. In LLVM IR, this is what you'd want: http://llvm.org/docs/LangRef.html#llvm-loop I am not enough of a D expert to come up with a good way to do this. Perhaps David can help come up with a solution? Good stuff for another Github issue! ;-)
Nov 12 2016
On Saturday, 12 November 2016 at 18:55:19 UTC, Johan Engelen wrote:I am not enough of a D expert to come up with a good way to do this.Spec says pragma can be applied to statements: https://dlang.org/spec/pragma.html
Nov 14 2016
On Monday, 14 November 2016 at 10:45:08 UTC, Kagamin wrote:On Saturday, 12 November 2016 at 18:55:19 UTC, Johan Engelen wrote:Excellent.I am not enough of a D expert to come up with a good way to do this.Spec says pragma can be applied to statements: https://dlang.org/spec/pragma.html
Nov 14 2016
On Saturday, 12 November 2016 at 15:44:28 UTC, deXtoRious wrote:I have not found any way to make LDC perform the same optimizations as Clang's best case (simply static void, no weak attribute) and have run out of ideas. Furthermore, I have no idea why the aforementioned changes in the function declaration affect the both optimizers in this way, or whether finer control over vectorization/loop unrolling is possible in LDC. Any thoughts?LDC can compile to llvm bitcode, you can then generate object code from it with llc and its various options. AFAIK, LLVM equivalent of a private symbol is "hidden" attribute, you can try to apply that, though lto shouldn't be affected by it.
Nov 14 2016
On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:Also, while on the subject, is there a way to force LDC to apply the relaxed floating point model to the entire program, rather than individual functions (the equivalent of --fast-math)?Not yet. If you really think this has value, please file an issue for it on GH. It will be easy to add it to LDC. -Johan
Nov 12 2016
On Saturday, 12 November 2016 at 10:10:44 UTC, Johan Engelen wrote:On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:Will do. The syntax Nicholas Wilson mentioned previously does make it easier to apply the attribute to multiple functions at once, but in many cases numerical code is written with the uniform assumption of the relaxed floating point model, so it seems much more appropriate to set it at the compiler level in those cases. It would also simplify benchmarking.Also, while on the subject, is there a way to force LDC to apply the relaxed floating point model to the entire program, rather than individual functions (the equivalent of --fast-math)?Not yet. If you really think this has value, please file an issue for it on GH. It will be easy to add it to LDC. -Johan
Nov 12 2016