www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - poor codegen for abs(), and div by literal?

reply NaN <divide by.zero> writes:
Given the following code...


module ohreally;

import std.math;

float foo(float y, float x)
{
     float ax = fabs(x);
     float ay = fabs(y);
     return ax*ay/3.142f;
}


the compiler outputs the following for the function body, 
(excluding prolog and epilogue code)...

    movss   -010h[RBP],XMM0
    movss   -8[RBP],XMM1
    fld     float ptr -010h[RBP]
    fabs
    fstp    qword ptr -020h[RBP]
    movsd   XMM0,-020h[RBP]
    cvtsd2ss        XMM0,XMM0
    fld     float ptr -8[RBP]
    fabs
    fstp    qword ptr -020h[RBP]
    movsd   XMM1,-020h[RBP]

    cvtsd2ss        XMM2,XMM1
    mulss   XMM0,XMM2
    movss   XMM3,FLAT:.rodata[00h][RIP]
    divss   XMM0,XMM3

So to do the abs(), it stores to memory from XMM reg, loads into 
x87 FPU regs, does the abs with the old FPU instruction, then for 
some reason stores the result as a double, loads that back into 
an XMM, converts it back to single.

And the div by 3.142f, is there a reason it cant be converted to 
a multiply? I know I can coax the multiply by doing 
*(1.0f/3.142f) instead, but I wondered if there's some reasoning 
in why its not done automatically?

Is any of this worth add to the bug tracker?
Feb 07
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Feb 07, 2019 at 11:15:08PM +0000, NaN via Digitalmars-d wrote:
 Given the following code...
[...]
 the compiler outputs the following for the function body, (excluding
 prolog and epilogue code)...
[...]
 So to do the abs(), it stores to memory from XMM reg, loads into x87
 FPU regs, does the abs with the old FPU instruction, then for some
 reason stores the result as a double, loads that back into an XMM,
 converts it back to single.
Which compiler are you using? For performance / codegen quality issues, I highly recommend looking at the output of ldc or gdc, rather than dmd. It's well-known that dmd codegen tends to lag behind ldc/gdc as far as efficiency / optimization is concerned. These days, I don't even look at dmd output anymore when I'm looking for performance. IME, dmd consistently produces code that's about 20-30% slower than ldc or gdc produced code, sometimes even as high as 40%.
 And the div by 3.142f, is there a reason it cant be converted to a
 multiply?  I know I can coax the multiply by doing *(1.0f/3.142f)
 instead, but I wondered if there's some reasoning in why its not done
 automatically?
 
 Is any of this worth add to the bug tracker?
If this problem is specific to dmd, you can post a bug against dmd, I suppose, but I wouldn't hold my breath for dmd codegen to significantly improve in the near future. Walter is far too overloaded with other language issues to do significant work on the optimizer at the moment. OTOH, when it comes to floating-point operations, the optimizer's hands may be tied because of IEEE 754 dictated semantics. There may be some corner cases where multiplying rather than dividing may produce different results, and therefore the optimizer is not free to simply substitute one for the other, even if in this case it works fine. You may need to spell it out yourself if what you want is a multiply rather than a divide. T -- IBM = I'll Buy Microsoft!
Feb 07
next sibling parent reply kinke <noone nowhere.com> writes:
On Thursday, 7 February 2019 at 23:35:36 UTC, H. S. Teoh wrote:
 On Thu, Feb 07, 2019 at 11:15:08PM +0000, NaN via Digitalmars-d
 So to do the abs(), it stores to memory from XMM reg, loads 
 into x87 FPU regs, does the abs with the old FPU instruction, 
 then for some reason stores the result as a double, loads that 
 back into an XMM, converts it back to single.
Which compiler are you using? For performance / codegen quality issues, I highly recommend looking at the output of ldc or gdc, rather than dmd.
Or just open std.math to see the simple reason for the old FPU being used: ``` real fabs(real x) safe pure nothrow nogc { pragma(inline, true); return core.math.fabs(x); } //FIXME ///ditto double fabs(double x) safe pure nothrow nogc { return fabs(cast(real) x); } //FIXME ///ditto float fabs(float x) safe pure nothrow nogc { return fabs(cast(real) x); } ``` Just one of many functions still operating with `real` precision only.
Feb 07
next sibling parent NaN <divide by.zero> writes:
On Friday, 8 February 2019 at 00:09:55 UTC, kinke wrote:
 On Thursday, 7 February 2019 at 23:35:36 UTC, H. S. Teoh wrote:
 On Thu, Feb 07, 2019 at 11:15:08PM +0000, NaN via Digitalmars-d
 So to do the abs(), it stores to memory from XMM reg, loads 
 into x87 FPU regs, does the abs with the old FPU instruction, 
 then for some reason stores the result as a double, loads 
 that back into an XMM, converts it back to single.
Which compiler are you using? For performance / codegen quality issues, I highly recommend looking at the output of ldc or gdc, rather than dmd.
Or just open std.math to see the simple reason for the old FPU being used:
Im embarrassed to admit I did look at the source and didn't spot that they were all being upcast to real precision.
Feb 07
prev sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Feb 08, 2019 at 12:09:55AM +0000, kinke via Digitalmars-d wrote:
[...]
 Or just open std.math to see the simple reason for the old FPU being
 used:
 
 ```
 real fabs(real x)  safe pure nothrow  nogc { pragma(inline, true);
 return core.math.fabs(x); }
 //FIXME
 ///ditto
 double fabs(double x)  safe pure nothrow  nogc { return fabs(cast(real) x);
 }
 //FIXME
 ///ditto
 float fabs(float x)  safe pure nothrow  nogc { return fabs(cast(real) x); }
 ```
 
 Just one of many functions still operating with `real` precision only.
Ugh. Not this again. :-( Didn't somebody clean up std.math recently to add double/float overloads? Or was that limited to only a few functions? This really needs to be fixed sooner rather than later. It's an embarrassment to D for anyone who cares about floating-point performance. T -- This is not a sentence.
Feb 07
prev sibling parent NaN <divide by.zero> writes:
On Thursday, 7 February 2019 at 23:35:36 UTC, H. S. Teoh wrote:
 On Thu, Feb 07, 2019 at 11:15:08PM +0000, NaN via Digitalmars-d 
 wrote:
 Given the following code...
[...]
 the compiler outputs the following for the function body, 
 (excluding
 prolog and epilogue code)...
[...]
 So to do the abs(), it stores to memory from XMM reg, loads 
 into x87 FPU regs, does the abs with the old FPU instruction, 
 then for some reason stores the result as a double, loads that 
 back into an XMM, converts it back to single.
Which compiler are you using? For performance / codegen quality issues, I highly recommend looking at the output of ldc or gdc, rather than dmd. It's well-known that dmd codegen tends to lag behind ldc/gdc as far as efficiency / optimization is concerned. These days, I don't even look at dmd output anymore when I'm looking for performance. IME, dmd consistently produces code that's about 20-30% slower than ldc or gdc produced code, sometimes even as high as 40%.
I use LDC primarily, just that wasnt inlining the fabs calls, and figured I would check to see what DMD was doing and that was screwy in a different way. Wasnt sure if it was something worth reporting but it looks like its a known issue from what kinke posted.
Feb 07
prev sibling parent reply Basile B. <b2.temp gmx.com> writes:
On Thursday, 7 February 2019 at 23:15:08 UTC, NaN wrote:
 Given the following code...


 module ohreally;

 import std.math;

 float foo(float y, float x)
 {
     float ax = fabs(x);
     float ay = fabs(y);
     return ax*ay/3.142f;
 }


 the compiler outputs the following for the function body, 
 (excluding prolog and epilogue code)...

    movss   -010h[RBP],XMM0
    movss   -8[RBP],XMM1
    fld     float ptr -010h[RBP]
    fabs
    fstp    qword ptr -020h[RBP]
    movsd   XMM0,-020h[RBP]
    cvtsd2ss        XMM0,XMM0
    fld     float ptr -8[RBP]
    fabs
    fstp    qword ptr -020h[RBP]
    movsd   XMM1,-020h[RBP]

    cvtsd2ss        XMM2,XMM1
    mulss   XMM0,XMM2
    movss   XMM3,FLAT:.rodata[00h][RIP]
    divss   XMM0,XMM3

 So to do the abs(), it stores to memory from XMM reg, loads 
 into x87 FPU regs, does the abs with the old FPU instruction, 
 then for some reason stores the result as a double, loads that 
 back into an XMM, converts it back to single.

 And the div by 3.142f, is there a reason it cant be converted 
 to a multiply? I know I can coax the multiply by doing 
 *(1.0f/3.142f) instead, but I wondered if there's some 
 reasoning in why its not done automatically?

 Is any of this worth add to the bug tracker?
Essentially the problem is that fabs() is always done with the FPU so with DMD the values always trip between the SSE registers, the stack of temporaries and the FPU registers. But fabs() doesn't have to be made in extended precision, it's not like the trigo operations after all, it's just about a **single bit**... On amd64 fabs (for single and double) could be done using SSE **only**. I don't know how compiler intrinsics work but the SSE version is not a single instruction. It's either 3 (generate a mask + logical and) or 2 (left shift by 1, right shift by 1 to clear the sign) A SSE-only would be more something like: pcmpeqd xmm2, xmm2 psrld xmm2, 01h andps xmm0, xmm2 andps xmm1, xmm2 mulss xmm0, xmm1 mulss xmm0, dword ptr [<address of constant>] ret in iasm (note: sadly the constant cannot be set to a static immutable): extern(C) float foo2(float y, float x, const float z = 1.0f / 3.142f) { asm pure nothrow { naked; pcmpeqd XMM3, XMM3; psrld XMM3, 1; andps XMM0, XMM3; andps XMM1, XMM3; mulss XMM0, XMM1; mulss XMM0, XMM2; ret; } } LDC2 does almost that, excepted that the logical AND is in a sub program: push rax movss dword ptr [rsp+04h], xmm1 call 000000000045A020h movss dword ptr [rsp], xmm0 movss xmm0, dword ptr [rsp+04h] call 000000000045A020h mulss xmm0, dword ptr [rsp] mulss xmm0, dword ptr [<address of constant>] pop rax ret 000000000049DA20h: andps xmm0, dqword ptr [<address of mask>] ret To come back to the bug of the "tripping values", it's known and it can even happen when the FPU is not used [1] [1] https://issues.dlang.org/show_bug.cgi?id=17965
Feb 08
next sibling parent Basile B. <b2.temp gmx.com> writes:
On Saturday, 9 February 2019 at 03:28:41 UTC, Basile B. wrote:
 On Thursday, 7 February 2019 at 23:15:08 UTC, NaN wrote:
 Is any of this worth add to the bug tracker?
I think so : https://issues.dlang.org/show_bug.cgi?id=19663
Feb 09
prev sibling parent reply NaN <divide by.zero> writes:
On Saturday, 9 February 2019 at 03:28:41 UTC, Basile B. wrote:
 On Thursday, 7 February 2019 at 23:15:08 UTC, NaN wrote:
 Given the following code...
 LDC2 does almost that, excepted that the logical AND is in a 
 sub program:

   push rax
   movss dword ptr [rsp+04h], xmm1
   call 000000000045A020h
   movss dword ptr [rsp], xmm0
   movss xmm0, dword ptr [rsp+04h]
   call 000000000045A020h
   mulss xmm0, dword ptr [rsp]
   mulss xmm0, dword ptr [<address of constant>]
   pop rax
   ret
What flags are you passing LDC? I cant get it to convert the division into a multiply by it's inverse unless i specifically change /3.142f to /(1.0f/3.142f). and FWIW im using... float fabs(float x) // need cause math.fabs not being inlined { uint tmp = *(cast(int*)&x) & 0x7fffffff; float f = *(cast(float*) &tmp); return f; } that compiles down to a single "andps" instruction and is inlined if it's in the same module. I tried cross module inlining as suggested in the LDC forum but it caused my program to hang.
Feb 09
next sibling parent kinke <noone nowhere.com> writes:
On Saturday, 9 February 2019 at 15:14:47 UTC, NaN wrote:
 What flags are you passing LDC? I cant get it to convert the 
 division into a multiply by it's inverse unless i specifically 
 change /3.142f to /(1.0f/3.142f).
Use `-ffast-math`. If you need more fine-grained control (e.g., `enable-unsafe-fp-math`), dare invoking `ldc2 --help-hidden`.
Feb 09
prev sibling parent Basile B. <b2.temp gmx.com> writes:
On Saturday, 9 February 2019 at 15:14:47 UTC, NaN wrote:
 On Saturday, 9 February 2019 at 03:28:41 UTC, Basile B. wrote:
 [...]
   [...]
What flags are you passing LDC? I cant get it to convert the division into a multiply by it's inverse unless i specifically change /3.142f to /(1.0f/3.142f). and FWIW im using... float fabs(float x) // need cause math.fabs not being inlined { uint tmp = *(cast(int*)&x) & 0x7fffffff; float f = *(cast(float*) &tmp); return f; } that compiles down to a single "andps" instruction and is inlined if it's in the same module. I tried cross module inlining as suggested in the LDC forum but it caused my program to hang.
Oh sorry, I got the inverse multiply manually. I was more focused on the x87 instruction than the last part of the Q.
Feb 09