digitalmars.D.ldc - Operator overloading leads to bad code optimization

claptrap (16/16) Dec 03 2021 Just a simple function to split a bezier in two.

max haughton (4/20) Dec 05 2021 This is (to me at least) an odd one. Maybe there's a

kinke (7/22) Dec 05 2021 With gdc v11.1, I count 69 instructions for split and 51 for

ClapTrap (5/28) Dec 05 2021 gdc v11.1 doesn't inline the operator calls when I try it, if you

max haughton (7/36) Dec 05 2021 To make GCC inline properly without LTO you can use

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (7/9) Dec 06 2021 Multiplying with 0.5 only affects the exponent, but the add could

max haughton (5/14) Dec 07 2021 My sentence was referring to Iains decision to refuse to inline

claptrap <clap trap.com> writes:

Just a simple function to split a bezier in two.

Using "-O3"

LDC the operator version is 84 instructions
LDC the hand expanded math is 49 instructions.

It seems something as simple as this should be better optimised? 
Or am I missing something?

https://godbolt.org/z/4h9vob3Yo

In fact there's quite a few bits where it looks like completely 
redundant code is left in? Eg...

123 movss   dword ptr [rsp - 24], xmm1
124 movss   xmm0, dword ptr [rip + .LCPI4_0]
125 mulss   xmm1, xmm0
126 movss   dword ptr [rsp - 24], xmm1


137 movss   dword ptr [rsp - 24], xmm2
138 mulss   xmm2, xmm0
139 movss   dword ptr [rsp - 24], xmm2

Dec 03 2021

max haughton <maxhaton gmail.com> writes:

On Friday, 3 December 2021 at 21:24:07 UTC, claptrap wrote:
 Just a simple function to split a bezier in two.

 Using "-O3"

 LDC the operator version is 84 instructions
 LDC the hand expanded math is 49 instructions.

 It seems something as simple as this should be better 
 optimised? Or am I missing something?

 https://godbolt.org/z/4h9vob3Yo

 In fact there's quite a few bits where it looks like completely 
 redundant code is left in? Eg...

 123 movss   dword ptr [rsp - 24], xmm1
 124 movss   xmm0, dword ptr [rip + .LCPI4_0]
 125 mulss   xmm1, xmm0
 126 movss   dword ptr [rsp - 24], xmm1


 137 movss   dword ptr [rsp - 24], xmm2
 138 mulss   xmm2, xmm0
 139 movss   dword ptr [rsp - 24], xmm2

This is (to me at least) an odd one. Maybe there's a 
pass-ordering issue here leading to bad code.

Seems like GCC does not have this issue.

Dec 05 2021

kinke <noone nowhere.com> writes:

On Sunday, 5 December 2021 at 21:38:55 UTC, max haughton wrote:
 On Friday, 3 December 2021 at 21:24:07 UTC, claptrap wrote:
 Just a simple function to split a bezier in two.

 Using "-O3"

 LDC the operator version is 84 instructions
 LDC the hand expanded math is 49 instructions.

 It seems something as simple as this should be better 
 optimised? Or am I missing something?

 https://godbolt.org/z/4h9vob3Yo
 [...]

 [...]
 Seems like GCC does not have this issue.

With gdc v11.1, I count 69 instructions for split and 51 for 
split2 (59 with -O3). So I guess there's a semantic difference 
here with the slightly changed evaluation order (2D addition 
before scaling).

With `alias Point = __vector(float[2])`, split is reduced to 28 
instructions: https://godbolt.org/z/7ffebjaz8

Dec 05 2021

ClapTrap <clap trap.com> writes:

On Sunday, 5 December 2021 at 23:36:21 UTC, kinke wrote:
 On Sunday, 5 December 2021 at 21:38:55 UTC, max haughton wrote:
 On Friday, 3 December 2021 at 21:24:07 UTC, claptrap wrote:
 Just a simple function to split a bezier in two.

 Using "-O3"

 LDC the operator version is 84 instructions
 LDC the hand expanded math is 49 instructions.

 It seems something as simple as this should be better 
 optimised? Or am I missing something?

 https://godbolt.org/z/4h9vob3Yo
 [...]

 [...]
 Seems like GCC does not have this issue.

 With gdc v11.1, I count 69 instructions for split and 51 for 
 split2 (59 with -O3). So I guess there's a semantic difference 
 here with the slightly changed evaluation order (2D addition 
 before scaling).

gdc v11.1 doesn't inline the operator calls when I try it, if you 
try an earlier version 10.2 it does which reduces it to 48 
instructions

 With `alias Point = __vector(float[2])`, split is reduced to 28 
 instructions: https://godbolt.org/z/7ffebjaz8

Wow, that's awesome!

Dec 05 2021

max haughton <maxhaton gmail.com> writes:

On Monday, 6 December 2021 at 00:38:18 UTC, ClapTrap wrote:
 On Sunday, 5 December 2021 at 23:36:21 UTC, kinke wrote:
 On Sunday, 5 December 2021 at 21:38:55 UTC, max haughton wrote:
 On Friday, 3 December 2021 at 21:24:07 UTC, claptrap wrote:
 Just a simple function to split a bezier in two.

 Using "-O3"

 LDC the operator version is 84 instructions
 LDC the hand expanded math is 49 instructions.

 It seems something as simple as this should be better 
 optimised? Or am I missing something?

 https://godbolt.org/z/4h9vob3Yo
 [...]

 [...]
 Seems like GCC does not have this issue.

 With gdc v11.1, I count 69 instructions for split and 51 for 
 split2 (59 with -O3). So I guess there's a semantic difference 
 here with the slightly changed evaluation order (2D addition 
 before scaling).

 gdc v11.1 doesn't inline the operator calls when I try it, if 
 you try an earlier version 10.2 it does which reduces it to 48 
 instructions

 With `alias Point = __vector(float[2])`, split is reduced to 
 28 instructions: https://godbolt.org/z/7ffebjaz8

 Wow, that's awesome!

To make GCC inline properly without LTO you can use 
`-fwhole-program`.

Maybe Iain also has a flag that restores the old template 
behaviour.

These kinds of wacky phase ordering (I assume) issues is why I am 
slightly distrustful of GDC post-inlining decision.

Dec 05 2021

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Monday, 6 December 2021 at 00:41:06 UTC, max haughton wrote:
 These kinds of wacky phase ordering (I assume) issues is why I 
 am slightly distrustful of GDC post-inlining decision.

Multiplying with 0.5 only affects the exponent, but the add could 
overflow/underflow. Maybe that is wacky for D since it specifies 

for a set of options. If I specify -O or -O3 I would expect the 
same options as gcc. Otherwise people will claim that C++ is 
faster?

Dec 06 2021

max haughton <maxhaton gmail.com> writes:

On Monday, 6 December 2021 at 11:55:08 UTC, Ola Fosheim Grøstad 
wrote:
 On Monday, 6 December 2021 at 00:41:06 UTC, max haughton wrote:
 These kinds of wacky phase ordering (I assume) issues is why I 
 am slightly distrustful of GDC post-inlining decision.

 Multiplying with 0.5 only affects the exponent, but the add 
 could overflow/underflow. Maybe that is wacky for D since it 

 a shortcut for a set of options. If I specify -O or -O3 I would 
 expect the same options as gcc. Otherwise people will claim 
 that C++ is faster?

My sentence was referring to Iains decision to refuse to inline 
templates (i.e. defer to LTO). Makes it harder to work out what 
the compiler is going to do / is doing.

Dec 07 2021

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - Operator overloading leads to bad code optimization