digitalmars.D.learn - Easiest way to use FMA instruction

Ben Jones (5/5) Jan 09 2020 What's the easiest way to use the FMA instruction (fused multiply

Ben Jones (13/18) Jan 09 2020 This seems to work with DMD, but seems fragile:

Johan (36/43) Jan 09 2020 Why do you want to use the FMA instruction?

Johan (15/31) Jan 09 2020 You have to tell LDC that you are compiling for a CPU that has

Ben Jones (3/19) Jan 09 2020 I need it for the rounding behavior. Thanks for the pointers,

kinke (11/13) Jan 10 2020 Simpler variant:

Ben Jones <fake fake.fake> writes:

What's the easiest way to use the FMA instruction (fused multiply 
add that has nice rounding properties)?  The FMA function in 
Phobos just does a*b +c  which will round twice.

Do any of the intrinsics libraries include this?  Should I write 
my own inline ASM?

Jan 09 2020

Ben Jones <fake fake.fake> writes:

On Thursday, 9 January 2020 at 20:57:10 UTC, Ben Jones wrote:
 What's the easiest way to use the FMA instruction (fused 
 multiply add that has nice rounding properties)?  The FMA 
 function in Phobos just does a*b +c  which will round twice.

 Do any of the intrinsics libraries include this?  Should I 
 write my own inline ASM?

This seems to work with DMD, but seems fragile:

`
///returns round(a*b + c)  -- computed as if in infinite 
precision, rounded at the end
double fma(double a, double b, double c)  safe pure  nogc nothrow{
     asm  safe pure  nogc nothrow {
         naked;
         vfmadd231sd XMM0, XMM1, XMM2;
         ret;
     }
}
`

Jan 09 2020

Johan <j j.nl> writes:

On Thursday, 9 January 2020 at 22:50:37 UTC, Ben Jones wrote:
 On Thursday, 9 January 2020 at 20:57:10 UTC, Ben Jones wrote:
 What's the easiest way to use the FMA instruction (fused 
 multiply add that has nice rounding properties)?  The FMA 
 function in Phobos just does a*b +c  which will round twice.

 Do any of the intrinsics libraries include this?  Should I 
 write my own inline ASM?



Why do you want to use the FMA instruction?

If for performance:
Inline assembly is generally very bad for performance as it 
disables inlining and the compiler probably does not understand 
the instruction itself (hence cannot combine it with other 
optimizations). In this case you don't necessarily need the FMA 
instruction (instead you want whatever instruction is fastest), 
so you shouldn't force the compiler to use that instruction. Have 
a look at https://github.com/AuburnSounds/intel-intrinsics, FMA 
is not supported yet.


If only for the rounding behavior:
Then indeed you need to force the compiler to use the FMA 
instruction (also for non-optimized code, so cannot rely on 
optimizer). Inline assembly is a solution. GDC and LDC provide a 
better inline assembly method that preserves a.o. inlining 
potential and doesn't require hardcoded ABI details.
For LDC:
```
double fma(double a, double b, double c)
{
     import ldc.llvmasm;
     return __irEx!(
              `declare double  llvm.fma.f64(double %a, double %b, 
double %c)`,
              `%r = call double  llvm.fma.f64(double %0, double 
%1, double %2)
               ret double %r`,
              "",
              double, double, double, double)(a,b,c);
}
```
https://wiki.dlang.org/LDC_inline_IR , but it is a little 
outdated, see https://github.com/ldc-developers/ldc/issues/3271


cheers,
  Johan

Jan 09 2020

Johan <j j.nl> writes:

On Friday, 10 January 2020 at 00:02:52 UTC, Johan wrote:
 
 For LDC:
 ```
 double fma(double a, double b, double c)
 {
     import ldc.llvmasm;
     return __irEx!(
              `declare double  llvm.fma.f64(double %a, double 
 %b, double %c)`,
              `%r = call double  llvm.fma.f64(double %0, double 
 %1, double %2)
               ret double %r`,
              "",
              double, double, double, double)(a,b,c);
 }
 ```

You have to tell LDC that you are compiling for a CPU that has 
FMA capability (otherwise it will insert a call to a "fma" 
runtime library function that most likely you are not linking 
with). For example, "-mattr=fma" or "-mcpu=skylake".
https://d.godbolt.org/z/ddwORl

Or you add it only for the "fma" function, using
```
import ldc.attributes;
 target("fma") double fma(double a, double b, double c) ...
```
https://d.godbolt.org/z/-X7FnC
https://wiki.dlang.org/LDC-specific_language_changes#.40.28ldc.attributes.target.28.22feature.22.29.29

cheers,
   Johan

Jan 09 2020

Ben Jones <fake fake.fake> writes:

On Friday, 10 January 2020 at 00:08:44 UTC, Johan wrote:
 On Friday, 10 January 2020 at 00:02:52 UTC, Johan wrote:
 [...]

 You have to tell LDC that you are compiling for a CPU that has 
 FMA capability (otherwise it will insert a call to a "fma" 
 runtime library function that most likely you are not linking 
 with). For example, "-mattr=fma" or "-mcpu=skylake".
 https://d.godbolt.org/z/ddwORl

 Or you add it only for the "fma" function, using
 ```
 import ldc.attributes;
  target("fma") double fma(double a, double b, double c) ...
 ```
 https://d.godbolt.org/z/-X7FnC
 https://wiki.dlang.org/LDC-specific_language_changes#.40.28ldc.attributes.target.28.22feature.22.29.29

 cheers,
   Johan

I need it for the rounding behavior.  Thanks for the pointers, 
that's very helpful.

Jan 09 2020

kinke <kinke gmx.net> writes:

On Friday, 10 January 2020 at 00:02:52 UTC, Johan wrote:
 For LDC:
 [...]

Simpler variant:

```
import ldc.intrinsics;
...
const result = llvm_fma(a, b, c);
```

This LLVM intrinsic is also used in LDC's Phobos for 
std.math.fma(); unfortunately, upstream Phobos just has a 
`real`-version, so the float/double versions aren't enabled yet: 
https://github.com/ldc-developers/phobos/blob/26d14c1a292267a32ce64fa7f219acc3d3cca274/std/math.d#L8370-L8376

Jan 10 2020

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Easiest way to use FMA instruction