digitalmars.D.ldc - Performance issue with fastmath and vectorization

dextorious (44/44) Nov 11 2016 As part of slowly learning the basics of programming in D, I

rikki cattermole (22/64) Nov 11 2016 Just a thought but try this:

deXtoRious (6/30) Nov 12 2016 That's how I originally wrote the code, then reverted to the

Nicholas Wilson (9/15) Nov 11 2016 you can apply attributes to whole files with

deXtoRious (8/24) Nov 12 2016 Isn't -vectorize-loops already enabled by the other flags? Simply

LiNbO3 (4/10) Nov 12 2016 By compiling your code with the same set of flags you used on the

deXtoRious (8/18) Nov 12 2016 There are three vfmadd231ss in the entire assembly, but none of

Johan Engelen (6/13) Nov 12 2016 Does the C++ need `__restrict__` for the parameters to get the

deXtoRious (12/26) Nov 12 2016 In this case, it doesn't seem to make any difference. It is

Johan Engelen (9/15) Nov 12 2016 That's good news, because there is currently no way to add that

deXtoRious (11/27) Nov 12 2016 I hope it's somewhere on the roadmap for the future, as it does

Johan Engelen (19/42) Nov 12 2016 Can you file an issue for that too? (ideas in forum posts get

deXtoRious (10/53) Nov 12 2016 Okay, I'll clean up the code and post an issue on GH later today,
deXtoRious (24/24) Nov 12 2016 Okay, so I've done some further experimentation with rather

Johan Engelen (7/14) Nov 12 2016 I think that perhaps when inlining the fastmath function, some

deXtoRious (9/24) Nov 12 2016 I tried putting @fastmath on main as well, it makes no difference

Johan Engelen (13/22) Nov 12 2016 I am also surprised but: adding `static` in C++ makes it a fully

Kagamin (4/6) Nov 14 2016 Spec says pragma can be applied to statements:

Johan Engelen (2/8) Nov 14 2016 Excellent.

Kagamin (5/12) Nov 14 2016 LDC can compile to llvm bitcode, you can then generate object

Johan Engelen (5/9) Nov 12 2016 Not yet. If you really think this has value, please file an issue

deXtoRious (8/18) Nov 12 2016 Will do. The syntax Nicholas Wilson mentioned previously does

dextorious <dextorious gmail.com> writes:

As part of slowly learning the basics of programming in D, I 
ported some of my fluid dynamics code from C++ to D and quickly 
noticed a rather severe performance degradation by a factor of 
2-3x. I've narrowed it down to a simple representative benchmark 
of virtually identical C++ and D code.

The D version: http://pastebin.com/Rs9CUA5j
The C++ code:  http://pastebin.com/XzStHXA2

I compile the D code using the latest beta release on GitHub, 
using the compiler switches -release -O5 -mcpu=haswell 
-boundscheck=off. The C++ version is compiled using Clang 3.9.0 
with the switches -std=c++14 -Ofast -fno-exceptions -fno-rtti 
-flto -ffast-math -march=native, which is my usual configuration 
for numerical code.

On my Haswell i7-4710HQ machine the C++ version runs in 
~10ms/iteration while the D code takes 25ms. Comparing profiler 
output with the generated assembly code quickly reveals the 
reason - while Clang fully unrolls the inner loop and uses FMA 
instructions wherever possible, the inner loop assembly produced 
by LDC looks like this:

   0.24 │6c0:   vmovss (%r15,%rbp,4),%xmm4
   1.03 │       vmovss (%r12,%rbp,4),%xmm5
   3.51 │       add    $0x4,%rdi
   6.96 │       add    $0x4,%rax
   1.04 │6d4:   vmulss (%rax,%rcx,1),%xmm4,%xmm4
   4.66 │       vmulss (%rax,%rdx,1),%xmm5,%xmm5
   8.44 │       vaddss %xmm4,%xmm5,%xmm4
   1.09 │       vmulss %xmm0,%xmm4,%xmm5
   3.73 │       vmulss %xmm4,%xmm5,%xmm4
   7.48 │       vsubss %xmm3,%xmm4,%xmm4
   1.13 │       vmulss %xmm1,%xmm4,%xmm4
   2.00 │       vaddss %xmm2,%xmm5,%xmm5
   3.46 │       vmovss 0x0(%r13,%rbp,4),%xmm6
   7.85 │       vmulss (%rax,%rsi,1),%xmm6,%xmm6
   2.50 │       vaddss %xmm4,%xmm5,%xmm4
   6.49 │       vmulss %xmm4,%xmm6,%xmm4
  25.48 │       vmovss %xmm4,(%rdi)
   8.26 │       cmp    $0x20,%rax
   0.00 │     ↑ jne    6c0

Am I doing something blatantly wrong here or have I run into a 
compiler limitation? Is there anything short of using intrinsics 
or calling C/C++ code I can do here to get to performance parity?

Also, while on the subject, is there a way to force LDC to apply 
the relaxed floating point model to the entire program, rather 
than individual functions (the equivalent of --fast-math)?

Nov 11 2016

rikki cattermole <rikki cattermole.co.nz> writes:

On 12/11/2016 1:03 PM, dextorious wrote:
 As part of slowly learning the basics of programming in D, I ported some
 of my fluid dynamics code from C++ to D and quickly noticed a rather
 severe performance degradation by a factor of 2-3x. I've narrowed it
 down to a simple representative benchmark of virtually identical C++ and
 D code.

 The D version: http://pastebin.com/Rs9CUA5j
 The C++ code:  http://pastebin.com/XzStHXA2

 I compile the D code using the latest beta release on GitHub, using the
 compiler switches -release -O5 -mcpu=haswell -boundscheck=off. The C++
 version is compiled using Clang 3.9.0 with the switches -std=c++14
 -Ofast -fno-exceptions -fno-rtti -flto -ffast-math -march=native, which
 is my usual configuration for numerical code.

 On my Haswell i7-4710HQ machine the C++ version runs in ~10ms/iteration
 while the D code takes 25ms. Comparing profiler output with the
 generated assembly code quickly reveals the reason - while Clang fully
 unrolls the inner loop and uses FMA instructions wherever possible, the
 inner loop assembly produced by LDC looks like this:

   0.24 │6c0:   vmovss (%r15,%rbp,4),%xmm4
   1.03 │       vmovss (%r12,%rbp,4),%xmm5
   3.51 │       add    $0x4,%rdi
   6.96 │       add    $0x4,%rax
   1.04 │6d4:   vmulss (%rax,%rcx,1),%xmm4,%xmm4
   4.66 │       vmulss (%rax,%rdx,1),%xmm5,%xmm5
   8.44 │       vaddss %xmm4,%xmm5,%xmm4
   1.09 │       vmulss %xmm0,%xmm4,%xmm5
   3.73 │       vmulss %xmm4,%xmm5,%xmm4
   7.48 │       vsubss %xmm3,%xmm4,%xmm4
   1.13 │       vmulss %xmm1,%xmm4,%xmm4
   2.00 │       vaddss %xmm2,%xmm5,%xmm5
   3.46 │       vmovss 0x0(%r13,%rbp,4),%xmm6
   7.85 │       vmulss (%rax,%rsi,1),%xmm6,%xmm6
   2.50 │       vaddss %xmm4,%xmm5,%xmm4
   6.49 │       vmulss %xmm4,%xmm6,%xmm4
  25.48 │       vmovss %xmm4,(%rdi)
   8.26 │       cmp    $0x20,%rax
   0.00 │     ↑ jne    6c0

 Am I doing something blatantly wrong here or have I run into a compiler
 limitation? Is there anything short of using intrinsics or calling C/C++
 code I can do here to get to performance parity?

 Also, while on the subject, is there a way to force LDC to apply the
 relaxed floating point model to the entire program, rather than
 individual functions (the equivalent of --fast-math)?

Just a thought but try this:

void compute_neq(float[] neq,
                  const float[] ux,
                  const float[] uy,
                  const float[] rho,
                  const float[] ex,
                  const float[] ey,
                  const float[] w,
                  const size_t N)  fastmath {
     foreach(idx; 0 .. N*N) {
         float usqr = ux[idx] * ux[idx] + uy[idx] * uy[idx];

         foreach(q; 0 .. 9) {
             float eu = 3.0f * (ex[q] * ux[idx] + ey[q] * uy[idx]);
             float tmp = 1.0f + eu + 0.5f * eu * eu - 1.5f * usqr;
             tmp *= w[q] * rho[idx];
             neq[idx * 9 + q] = tmp;
         }
     }
}

It may not make any difference since it is semantically the same but I 
thought at the very least rewriting it to be a bit more idiomatic may help.

Nov 11 2016

deXtoRious <dextorious gmail.com> writes:

On Saturday, 12 November 2016 at 03:30:47 UTC, rikki cattermole 
wrote:
 Just a thought but try this:

 void compute_neq(float[] neq,
                  const float[] ux,
                  const float[] uy,
                  const float[] rho,
                  const float[] ex,
                  const float[] ey,
                  const float[] w,
                  const size_t N)  fastmath {
     foreach(idx; 0 .. N*N) {
         float usqr = ux[idx] * ux[idx] + uy[idx] * uy[idx];

         foreach(q; 0 .. 9) {
             float eu = 3.0f * (ex[q] * ux[idx] + ey[q] * 
 uy[idx]);
             float tmp = 1.0f + eu + 0.5f * eu * eu - 1.5f * 
 usqr;
             tmp *= w[q] * rho[idx];
             neq[idx * 9 + q] = tmp;
         }
     }
 }

 It may not make any difference since it is semantically the 
 same but I thought at the very least rewriting it to be a bit 
 more idiomatic may help.

That's how I originally wrote the code, then reverted to the 
C++-style for the comparison to make the code as identical as 
possible and make sure it doesn't make any difference. As 
expected, it doesn't.

Nov 12 2016

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
 As part of slowly learning the basics of programming in D, I 
 ported some of my fluid dynamics code from C++ to D and quickly 
 noticed a rather severe performance degradation by a factor of 
 2-3x. I've narrowed it down to a simple representative 
 benchmark of virtually identical C++ and D code.

 [...]

you can apply attributes to whole files with
 fastmath:

void IamFastMath(){}

void SoAmI(){}

Don't know about whole program.

i got some improvements with -vectorize-loops and making the 
stencil array static and passing by ref. I couldn't get it to 
unroll the inner loop though.

Nov 11 2016

deXtoRious <dextorious gmail.com> writes:

On Saturday, 12 November 2016 at 07:38:16 UTC, Nicholas Wilson 
wrote:
 On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
 As part of slowly learning the basics of programming in D, I 
 ported some of my fluid dynamics code from C++ to D and 
 quickly noticed a rather severe performance degradation by a 
 factor of 2-3x. I've narrowed it down to a simple 
 representative benchmark of virtually identical C++ and D code.

 [...]

 you can apply attributes to whole files with
  fastmath:

 void IamFastMath(){}

 void SoAmI(){}

 Don't know about whole program.

 i got some improvements with -vectorize-loops and making the 
 stencil array static and passing by ref. I couldn't get it to 
 unroll the inner loop though.

Isn't -vectorize-loops already enabled by the other flags? Simply 
adding it doesn't seem to make a difference to the inner loop 
assembly for me. I'll try passing a static array by ref, which 
should slightly improve the function call performance, but I'd be 
surprised if it actually lets the compiler properly vectorize the 
inner loop or fully unroll it.

Nov 12 2016

LiNbO3 <nosp m.please> writes:

On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
 On my Haswell i7-4710HQ machine the C++ version runs in 
 ~10ms/iteration while the D code takes 25ms. Comparing profiler 
 output with the generated assembly code quickly reveals the 
 reason - while Clang fully unrolls the inner loop and uses FMA 
 instructions wherever possible, the inner loop assembly 
 produced by LDC looks like this:

By compiling your code with the same set of flags you used on the 
godbolt (https://d.godbolt.org/) service I do see the FMA 
instructions being used.

Nov 12 2016

deXtoRious <dextorious gmail.com> writes:

On Saturday, 12 November 2016 at 09:45:29 UTC, LiNbO3 wrote:
 On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
 On my Haswell i7-4710HQ machine the C++ version runs in 
 ~10ms/iteration while the D code takes 25ms. Comparing 
 profiler output with the generated assembly code quickly 
 reveals the reason - while Clang fully unrolls the inner loop 
 and uses FMA instructions wherever possible, the inner loop 
 assembly produced by LDC looks like this:

 By compiling your code with the same set of flags you used on 
 the godbolt (https://d.godbolt.org/) service I do see the FMA 
 instructions being used.

There are three vfmadd231ss in the entire assembly, but none of 
them are in the inner loop. The presence of any FMA instructions 
at all does show that the compiler properly accepts the -mcpu 
switch, but it doesn't seem to recognize the opportunities 
present in the inner loop. The assembly generated by the godbolt 
service seems largely identical to the one I got on my local 
machine.

Nov 12 2016

Johan Engelen <j j.nl> writes:

On Saturday, 12 November 2016 at 10:27:53 UTC, deXtoRious wrote:
 There are three vfmadd231ss in the entire assembly, but none of 
 them are in the inner loop. The presence of any FMA 
 instructions at all does show that the compiler properly 
 accepts the -mcpu switch, but it doesn't seem to recognize the 
 opportunities present in the inner loop.

Does the C++ need `__restrict__` for the parameters to get the 
assembly you want?

 The assembly generated by the godbolt service seems largely 
 identical to the one I got on my local machine.

It is easier for the discussion if you paste godbolt.org links 
btw, so we don't have to manually do it ourselves ;-)

-Johan

Nov 12 2016

deXtoRious <dextorious gmail.com> writes:

On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen 
wrote:
 On Saturday, 12 November 2016 at 10:27:53 UTC, deXtoRious wrote:
 There are three vfmadd231ss in the entire assembly, but none 
 of them are in the inner loop. The presence of any FMA 
 instructions at all does show that the compiler properly 
 accepts the -mcpu switch, but it doesn't seem to recognize the 
 opportunities present in the inner loop.

 Does the C++ need `__restrict__` for the parameters to get the 
 assembly you want?

In this case, it doesn't seem to make any difference. It is 
habitual for me to use __restrict__ whenever possible in HPC 
code, but very often Clang/GCC are smart enough nowadays to make 
the inference regardless.

On that note, I was under the impression that D arrays included 
the no aliasing assumption. If that's not the case, is there a 
way to achieve the equivalent of __restrict__ in D?

 The assembly generated by the godbolt service seems largely 
 identical to the one I got on my local machine.

 It is easier for the discussion if you paste godbolt.org links 
 btw, so we don't have to manually do it ourselves ;-)

 -Johan

Will do. :)

By the way, I posted that issue on GH: 
https://github.com/ldc-developers/ldc/issues/1874

Nov 12 2016

Johan Engelen <j j.nl> writes:

On Saturday, 12 November 2016 at 10:56:20 UTC, deXtoRious wrote:
 On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen 
 wrote:
 Does the C++ need `__restrict__` for the parameters to get the 
 assembly you want?

 In this case, it doesn't seem to make any difference.

That's good news, because there is currently no way to add that 
to LDC code, afaik.

Hope you can try to cut more of these things from the example so 
it's easier to figure out why things are different.  (e.g. is 
-Ofast needed, or is -O3 enough?)

Thanks!

cheers,
   Johan

Nov 12 2016

deXtoRious <dextorious gmail.com> writes:

On Saturday, 12 November 2016 at 11:04:59 UTC, Johan Engelen 
wrote:
 On Saturday, 12 November 2016 at 10:56:20 UTC, deXtoRious wrote:
 On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen 
 wrote:
 Does the C++ need `__restrict__` for the parameters to get 
 the assembly you want?

 In this case, it doesn't seem to make any difference.

 That's good news, because there is currently no way to add that 
 to LDC code, afaik.

I hope it's somewhere on the roadmap for the future, as it does 
still make a measurable difference in some cases.

 Hope you can try to cut more of these things from the example 
 so it's easier to figure out why things are different.  (e.g. 
 is -Ofast needed, or is -O3 enough?)

 Thanks!

 cheers,
   Johan

-Ofast is also there out of habit, doesn't make a meaningful 
difference for a benchmark as simple as this. Other switches, 
like -fno-rtti, -fno-exceptions and even -flto can also be 
dropped, simply using -O3 -march=native -ffast-math is sufficient 
to outperform LDC by 2.5x, losing only about 10% from the best 
C++ performance and producing essentially the same unrolled 
FMA-enabled assembly with very minor changes.

Nov 12 2016

Johan Engelen <j j.nl> writes:

On Saturday, 12 November 2016 at 11:16:16 UTC, deXtoRious wrote:
 On Saturday, 12 November 2016 at 11:04:59 UTC, Johan Engelen 
 wrote:
 On Saturday, 12 November 2016 at 10:56:20 UTC, deXtoRious 
 wrote:
 On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen 
 wrote:
 Does the C++ need `__restrict__` for the parameters to get 
 the assembly you want?

 In this case, it doesn't seem to make any difference.

 That's good news, because there is currently no way to add 
 that to LDC code, afaik.

 I hope it's somewhere on the roadmap for the future, as it does 
 still make a measurable difference in some cases.

Can you file an issue for that too? (ideas in forum posts get 
lost instantly)
Make sure you add an (as small as possible) testcase that shows a 
clear difference in codegen (with/without for C++), and with 
worse codegen with D code without it.
It may be relatively easy to implement it in LDC, but I don't 
think many people know the intricacies of C's restrict. With 
examples of the effect it has on assembly (clang C++) helps a lot 
towards getting it implemented.

 -Ofast is also there out of habit, doesn't make a meaningful 
 difference for a benchmark as simple as this. Other switches, 
 like -fno-rtti, -fno-exceptions and even -flto can also be 
 dropped, simply using -O3 -march=native -ffast-math is 
 sufficient to outperform LDC by 2.5x, losing only about 10% 
 from the best C++ performance and producing essentially the 
 same unrolled FMA-enabled assembly with very minor changes.

OK great.
I think you ran into a compiler limitation somehow, so make sure 
you submit the simplified example/testcase on GH ! ;)
(the simpler you can make it, the better)

Btw, for benchmarking, you should mark the `compute_neq` function 
as "weak linkage", such that the compiler is not going to do 
inter-procedural optimization for the call to `compute_neq` in 
`main`. ( weak for LDC, clang probably something like 
__attribute__((weak)))

Nov 12 2016

deXtoRious <dextorious gmail.com> writes:

On Saturday, 12 November 2016 at 12:11:35 UTC, Johan Engelen 
wrote:
 On Saturday, 12 November 2016 at 11:16:16 UTC, deXtoRious wrote:
 On Saturday, 12 November 2016 at 11:04:59 UTC, Johan Engelen 
 wrote:
 On Saturday, 12 November 2016 at 10:56:20 UTC, deXtoRious 
 wrote:
 On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen 
 wrote:
 Does the C++ need `__restrict__` for the parameters to get 
 the assembly you want?

 In this case, it doesn't seem to make any difference.

 That's good news, because there is currently no way to add 
 that to LDC code, afaik.

 I hope it's somewhere on the roadmap for the future, as it 
 does still make a measurable difference in some cases.

 Can you file an issue for that too? (ideas in forum posts get 
 lost instantly)
 Make sure you add an (as small as possible) testcase that shows 
 a clear difference in codegen (with/without for C++), and with 
 worse codegen with D code without it.
 It may be relatively easy to implement it in LDC, but I don't 
 think many people know the intricacies of C's restrict. With 
 examples of the effect it has on assembly (clang C++) helps a 
 lot towards getting it implemented.

 -Ofast is also there out of habit, doesn't make a meaningful 
 difference for a benchmark as simple as this. Other switches, 
 like -fno-rtti, -fno-exceptions and even -flto can also be 
 dropped, simply using -O3 -march=native -ffast-math is 
 sufficient to outperform LDC by 2.5x, losing only about 10% 
 from the best C++ performance and producing essentially the 
 same unrolled FMA-enabled assembly with very minor changes.

 OK great.
 I think you ran into a compiler limitation somehow, so make 
 sure you submit the simplified example/testcase on GH ! ;)
 (the simpler you can make it, the better)

 Btw, for benchmarking, you should mark the `compute_neq` 
 function as "weak linkage", such that the compiler is not going 
 to do inter-procedural optimization for the call to 
 `compute_neq` in `main`. ( weak for LDC, clang probably 
 something like __attribute__((weak)))

Okay, I'll clean up the code and post an issue on GH later today, 
hopefully someone can figure out where the discrepancy comes from.

I'll also file a separate issue / feature request for restrict 
afterwards, once I write up a representative test case that 
highlights the performance impact.

Thanks for your help! The ability to get quick responses on 
compiler issues like this is really encouraging me to write more 
high performance code in D.

Nov 12 2016

deXtoRious <dextorious gmail.com> writes:

Okay, so I've done some further experimentation with rather 
peculiar results. On the bright side, I'm now fairly sure this 
isn't an outright bug in the compiler. On the flip side, however, 
I'm quite confused about the results.

For the record, here are the current versions of the benchmark in 
godbolt:
D:   https://godbolt.org/g/B8gosP
C++: https://godbolt.org/g/DWjQrV

Apparently, LDC can be coaxed to use FMA instructions after all. 
It seems that with __attribute__((__weak__)) Clang produces code 
that is essentially identical to the D binary, both run in about 
19ms on my machine. When I remove __attribute__((__weak__)) and 
make the compute_neq function static void rather than simply 
void, Clang further unrolls the inner loop and uses a number of 
optimized load/store instructions that increase the performance 
by a huge margin - down to about 7ms. As for LDC, changing 
adding/removing  weak and static also has a major impact on the 
generated code and therefore the performance.

I have not found any way to make LDC perform the same 
optimizations as Clang's best case (simply static void, no weak 
attribute) and have run out of ideas. Furthermore, I have no idea 
why the aforementioned changes in the function declaration affect 
the both optimizers in this way, or whether finer control over 
vectorization/loop unrolling is possible in LDC. Any thoughts?

Nov 12 2016

Johan Engelen <j j.nl> writes:

On Saturday, 12 November 2016 at 15:44:28 UTC, deXtoRious wrote:
 I have not found any way to make LDC perform the same 
 optimizations as Clang's best case (simply static void, no weak 
 attribute) and have run out of ideas. Furthermore, I have no 
 idea why the aforementioned changes in the function declaration 
 affect the both optimizers in this way, or whether finer 
 control over vectorization/loop unrolling is possible in LDC. 
 Any thoughts?

I think that perhaps when inlining the fastmath function, some 
optimization attributes are lost somehow and the inlined code is 
not optimized as much (you'd have to specify  fastmath on main 
too).

It'd be easier to compare with -ffast-math I guess ;-)

A look at the generated LLVM IR may provide some clues.

Nov 12 2016

deXtoRious <dextorious gmail.com> writes:

On Saturday, 12 November 2016 at 16:29:20 UTC, Johan Engelen 
wrote:
 On Saturday, 12 November 2016 at 15:44:28 UTC, deXtoRious wrote:
 I have not found any way to make LDC perform the same 
 optimizations as Clang's best case (simply static void, no 
 weak attribute) and have run out of ideas. Furthermore, I have 
 no idea why the aforementioned changes in the function 
 declaration affect the both optimizers in this way, or whether 
 finer control over vectorization/loop unrolling is possible in 
 LDC. Any thoughts?

 I think that perhaps when inlining the fastmath function, some 
 optimization attributes are lost somehow and the inlined code 
 is not optimized as much (you'd have to specify  fastmath on 
 main too).

 It'd be easier to compare with -ffast-math I guess ;-)

 A look at the generated LLVM IR may provide some clues.

I tried putting  fastmath on main as well, it makes no difference 
whatsoever (identical generated assembly). Apart from the 
weirdness with weak/static making way more difference than I 
would intuitively expect, it seems the major factor preventing 
performance parity with Clang is the conservative loop 
optimizations. Is there a way, similar to #pragma unroll in 
Clang, to tell LDC to try to unroll the inner loop?

Nov 12 2016

Johan Engelen <j j.nl> writes:

On Saturday, 12 November 2016 at 16:40:27 UTC, deXtoRious wrote:
 
 I tried putting  fastmath on main as well, it makes no 
 difference whatsoever (identical generated assembly).

Yeah I saw it too. It's a bit strange.

 Apart from the weirdness with weak/static making way more 
 difference than I would intuitively expect,

I am also surprised but: adding `static` in C++ makes it a fully 
private function, which does not need to be emitted as such (and 
isn't in your case, because it is fully inlined).
I added `pragma(inline, true)` to the D function to get a similar 
effect, I hoped.

 it seems the major factor preventing performance parity with 
 Clang is the conservative loop optimizations. Is there a way, 
 similar to #pragma unroll in Clang, to tell LDC to try to 
 unroll the inner loop?

There isn't at the moment. We need a mechanism to tag statements 
with such metadata. In LLVM IR, this is what you'd want: 
http://llvm.org/docs/LangRef.html#llvm-loop
I am not enough of a D expert to come up with a good way to do 
this. Perhaps David can help come up with a solution?
Good stuff for another Github issue! ;-)

Nov 12 2016

Kagamin <spam here.lot> writes:

On Saturday, 12 November 2016 at 18:55:19 UTC, Johan Engelen 
wrote:
 I am not enough of a D expert to come up with a good way to do 
 this.

Spec says pragma can be applied to statements: 
https://dlang.org/spec/pragma.html

Nov 14 2016

Johan Engelen <j j.nl> writes:

On Monday, 14 November 2016 at 10:45:08 UTC, Kagamin wrote:
 On Saturday, 12 November 2016 at 18:55:19 UTC, Johan Engelen 
 wrote:
 I am not enough of a D expert to come up with a good way to do 
 this.

 Spec says pragma can be applied to statements: 
 https://dlang.org/spec/pragma.html

Excellent.

Nov 14 2016

Kagamin <spam here.lot> writes:

On Saturday, 12 November 2016 at 15:44:28 UTC, deXtoRious wrote:
 I have not found any way to make LDC perform the same 
 optimizations as Clang's best case (simply static void, no weak 
 attribute) and have run out of ideas. Furthermore, I have no 
 idea why the aforementioned changes in the function declaration 
 affect the both optimizers in this way, or whether finer 
 control over vectorization/loop unrolling is possible in LDC. 
 Any thoughts?

LDC can compile to llvm bitcode, you can then generate object 
code from it with llc and its various options. AFAIK, LLVM 
equivalent of a private symbol is "hidden" attribute, you can try 
to apply that, though lto shouldn't be affected by it.

Nov 14 2016

Johan Engelen <j j.nl> writes:

On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
 Also, while on the subject, is there a way to force LDC to 
 apply the relaxed floating point model to the entire program, 
 rather than individual functions (the equivalent of 
 --fast-math)?

Not yet. If you really think this has value, please file an issue 
for it on GH.
It will be easy to add it to LDC.

-Johan

Nov 12 2016

deXtoRious <dextorious gmail.com> writes:

On Saturday, 12 November 2016 at 10:10:44 UTC, Johan Engelen 
wrote:
 On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
 Also, while on the subject, is there a way to force LDC to 
 apply the relaxed floating point model to the entire program, 
 rather than individual functions (the equivalent of 
 --fast-math)?

 Not yet. If you really think this has value, please file an 
 issue for it on GH.
 It will be easy to add it to LDC.

 -Johan

Will do. The syntax Nicholas Wilson mentioned previously does 
make it easier to apply the attribute to multiple functions at 
once, but in many cases numerical code is written with the 
uniform assumption of the relaxed floating point model, so it 
seems much more appropriate to set it at the compiler level in 
those cases. It would also simplify benchmarking.

Nov 12 2016

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - Performance issue with fastmath and vectorization