www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - Operator overloading leads to bad code optimization

reply claptrap <clap trap.com> writes:
Just a simple function to split a bezier in two.

Using "-O3"

LDC the operator version is 84 instructions
LDC the hand expanded math is 49 instructions.

It seems something as simple as this should be better optimised? 
Or am I missing something?

https://godbolt.org/z/4h9vob3Yo

In fact there's quite a few bits where it looks like completely 
redundant code is left in? Eg...

123 movss   dword ptr [rsp - 24], xmm1
124 movss   xmm0, dword ptr [rip + .LCPI4_0]
125 mulss   xmm1, xmm0
126 movss   dword ptr [rsp - 24], xmm1


137 movss   dword ptr [rsp - 24], xmm2
138 mulss   xmm2, xmm0
139 movss   dword ptr [rsp - 24], xmm2
Dec 03 2021
parent reply max haughton <maxhaton gmail.com> writes:
On Friday, 3 December 2021 at 21:24:07 UTC, claptrap wrote:
 Just a simple function to split a bezier in two.

 Using "-O3"

 LDC the operator version is 84 instructions
 LDC the hand expanded math is 49 instructions.

 It seems something as simple as this should be better 
 optimised? Or am I missing something?

 https://godbolt.org/z/4h9vob3Yo

 In fact there's quite a few bits where it looks like completely 
 redundant code is left in? Eg...

 123 movss   dword ptr [rsp - 24], xmm1
 124 movss   xmm0, dword ptr [rip + .LCPI4_0]
 125 mulss   xmm1, xmm0
 126 movss   dword ptr [rsp - 24], xmm1


 137 movss   dword ptr [rsp - 24], xmm2
 138 mulss   xmm2, xmm0
 139 movss   dword ptr [rsp - 24], xmm2
This is (to me at least) an odd one. Maybe there's a pass-ordering issue here leading to bad code. Seems like GCC does not have this issue.
Dec 05 2021
parent reply kinke <noone nowhere.com> writes:
On Sunday, 5 December 2021 at 21:38:55 UTC, max haughton wrote:
 On Friday, 3 December 2021 at 21:24:07 UTC, claptrap wrote:
 Just a simple function to split a bezier in two.

 Using "-O3"

 LDC the operator version is 84 instructions
 LDC the hand expanded math is 49 instructions.

 It seems something as simple as this should be better 
 optimised? Or am I missing something?

 https://godbolt.org/z/4h9vob3Yo
 [...]
[...] Seems like GCC does not have this issue.
With gdc v11.1, I count 69 instructions for split and 51 for split2 (59 with -O3). So I guess there's a semantic difference here with the slightly changed evaluation order (2D addition before scaling). With `alias Point = __vector(float[2])`, split is reduced to 28 instructions: https://godbolt.org/z/7ffebjaz8
Dec 05 2021
parent reply ClapTrap <clap trap.com> writes:
On Sunday, 5 December 2021 at 23:36:21 UTC, kinke wrote:
 On Sunday, 5 December 2021 at 21:38:55 UTC, max haughton wrote:
 On Friday, 3 December 2021 at 21:24:07 UTC, claptrap wrote:
 Just a simple function to split a bezier in two.

 Using "-O3"

 LDC the operator version is 84 instructions
 LDC the hand expanded math is 49 instructions.

 It seems something as simple as this should be better 
 optimised? Or am I missing something?

 https://godbolt.org/z/4h9vob3Yo
 [...]
[...] Seems like GCC does not have this issue.
With gdc v11.1, I count 69 instructions for split and 51 for split2 (59 with -O3). So I guess there's a semantic difference here with the slightly changed evaluation order (2D addition before scaling).
gdc v11.1 doesn't inline the operator calls when I try it, if you try an earlier version 10.2 it does which reduces it to 48 instructions
 With `alias Point = __vector(float[2])`, split is reduced to 28 
 instructions: https://godbolt.org/z/7ffebjaz8
Wow, that's awesome!
Dec 05 2021
parent reply max haughton <maxhaton gmail.com> writes:
On Monday, 6 December 2021 at 00:38:18 UTC, ClapTrap wrote:
 On Sunday, 5 December 2021 at 23:36:21 UTC, kinke wrote:
 On Sunday, 5 December 2021 at 21:38:55 UTC, max haughton wrote:
 On Friday, 3 December 2021 at 21:24:07 UTC, claptrap wrote:
 Just a simple function to split a bezier in two.

 Using "-O3"

 LDC the operator version is 84 instructions
 LDC the hand expanded math is 49 instructions.

 It seems something as simple as this should be better 
 optimised? Or am I missing something?

 https://godbolt.org/z/4h9vob3Yo
 [...]
[...] Seems like GCC does not have this issue.
With gdc v11.1, I count 69 instructions for split and 51 for split2 (59 with -O3). So I guess there's a semantic difference here with the slightly changed evaluation order (2D addition before scaling).
gdc v11.1 doesn't inline the operator calls when I try it, if you try an earlier version 10.2 it does which reduces it to 48 instructions
 With `alias Point = __vector(float[2])`, split is reduced to 
 28 instructions: https://godbolt.org/z/7ffebjaz8
Wow, that's awesome!
To make GCC inline properly without LTO you can use `-fwhole-program`. Maybe Iain also has a flag that restores the old template behaviour. These kinds of wacky phase ordering (I assume) issues is why I am slightly distrustful of GDC post-inlining decision.
Dec 05 2021
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Monday, 6 December 2021 at 00:41:06 UTC, max haughton wrote:
 These kinds of wacky phase ordering (I assume) issues is why I 
 am slightly distrustful of GDC post-inlining decision.
Multiplying with 0.5 only affects the exponent, but the add could overflow/underflow. Maybe that is wacky for D since it specifies for a set of options. If I specify -O or -O3 I would expect the same options as gcc. Otherwise people will claim that C++ is faster?
Dec 06 2021
parent max haughton <maxhaton gmail.com> writes:
On Monday, 6 December 2021 at 11:55:08 UTC, Ola Fosheim Grøstad 
wrote:
 On Monday, 6 December 2021 at 00:41:06 UTC, max haughton wrote:
 These kinds of wacky phase ordering (I assume) issues is why I 
 am slightly distrustful of GDC post-inlining decision.
Multiplying with 0.5 only affects the exponent, but the add could overflow/underflow. Maybe that is wacky for D since it a shortcut for a set of options. If I specify -O or -O3 I would expect the same options as gcc. Otherwise people will claim that C++ is faster?
My sentence was referring to Iains decision to refuse to inline templates (i.e. defer to LTO). Makes it harder to work out what the compiler is going to do / is doing.
Dec 07 2021