digitalmars.D.ldc - How to prevent optimizer from reordering stuff?

Dan Olson (28/28) Mar 14 2015 While tracking down std.math problems for ARM, I find that optimizer

David Nadlinger (6/10) Mar 14 2015 IIRC FP flag/mode support is a tricky topic in LLVM in general,

Dan Olson (28/37) Mar 14 2015 Hi David.

David Nadlinger via digitalmars-d-ldc (13/18) Mar 15 2015 Yeah, seems like everything is in order (no pun intended) after the main...

Dan Olson (5/21) Mar 15 2015 It is a good puzzle. For what it is worth, clang does the same thing

Dan Olson (10/10) Mar 15 2015 Ok, I have stumbled into an old problem it seems.

Dan Olson (45/71) Mar 15 2015 I have a solution. At least it is a start. Specifying the result of

Dan Olson <zans.is.for.cans yahoo.com> writes:

While tracking down std.math problems for ARM, I find that optimizer
will reorder instructions to get FPSCR flags before the divide
operation.

Is there is a way to force instruction ordering here?  I tried the
llvm_memory_fence, but it doesn't do the job.

real zero = 0.0;

void foo()
{
    import std.math, std.c.stdio, ldc.llvmasm;

    real x = 1.0 / zero;

    auto f = __asm!uint("vmrs $0, fpscr", "=r");
    IeeeFlags flags = ieeeFlags();
    printf("%f, %u %d\n", x, f, flags.divByZero);
}

Compiled with -O -mtriple=thumbv7-apple-ios, you can see that vdiv is
after both my inline asm and std.math ieeeFlags().

	vldr	d8, [r0]
	  InlineAsm Start
	vmrs	r4, fpscr
	  InlineAsm End
	mov	r0, r5
	blx	__D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags

	mov	r0, r5
	vdiv.f64	d8, d16, d8

What to do?
--
Dan

Mar 14 2015

"David Nadlinger" <code klickverbot.at> writes:

On Saturday, 14 March 2015 at 18:42:45 UTC, Dan Olson wrote:
 While tracking down std.math problems for ARM, I find that 
 optimizer
 will reorder instructions to get FPSCR flags before the divide
 operation.

IIRC FP flag/mode support is a tricky topic in LLVM in general, 
but this specific problem seems weird. What are the attributes 
for __D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags in the IR? The 
optimizer should never move code across arbitrary function calls…

David

Mar 14 2015

Dan Olson <zans.is.for.cans yahoo.com> writes:

"David Nadlinger" <code klickverbot.at> writes:

 On Saturday, 14 March 2015 at 18:42:45 UTC, Dan Olson wrote:
 While tracking down std.math problems for ARM, I find that optimizer
 will reorder instructions to get FPSCR flags before the divide
 operation.

 IIRC FP flag/mode support is a tricky topic in LLVM in general, but
 this specific problem seems weird. What are the attributes for
 __D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags in the IR? The
 optimizer should never move code across arbitrary function calls…

 David

Hi David.

I don't see any attributes for for that function.  I will just paste
some of the -output-ll results since nothing sticks out to me.

declare fastcc void  _D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags(
std.math.IeeeFlags* noalias sret)

define fastcc void  _D10unittester3fooFZv() {
  %flags = alloca %std.math.IeeeFlags, align 4
  %1 = load double*  _D10unittester4zeroe, align 8
  %2 = fdiv double 1.000000e+00, %1

  call fastcc void  _D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags(
std.math.IeeeFlags* noalias sret %flags)
  %tmp = call fastcc i1
 _D3std4math9IeeeFlags9divByZeroMFNdZb(%std.math.IeeeFlags* %flags)
  %4 = zext i1 %tmp to i32
  %tmp1 = call i32 (i8*, ...)*  printf(i8* getelementptr inbounds ([11 x i8]*
 .str12, i32 0, i32 0), double %2, i32 %3, i32 %4)
  ret void
}

The only guess I have right now for this is from:

http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042e/IHI0042E_aapcs.pdf

  The FPSCR is the only status register that may be accessed by
  conforming code. It is a global register with the following
  properties:

  - The condition code bits (28-31), the cumulative saturation (QC) bit
    (27) and the cumulative exception-status bits (0-4) are not
    preserved across a public interface.

  (snip)

Maybe that means the compiler can says FPSCR state from my vdiv.f64
is undefined across function call boundaries, so ordering should not
matter?

Mar 14 2015

David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:

Hi Dan,

On 03/14/2015 09:20 PM, Dan Olson via digitalmars-d-ldc wrote:
 I don't see any attributes for for that function.  I will just paste
 some of the -output-ll results since nothing sticks out to me.

Yeah, seems like everything is in order (no pun intended) after the main 
IR-level optimizer. This suggests that the reordering happens on the 
target-specific optimization or instruction selection level. I suppose 
you could try disabling codegen optimizations if you wanted to 
investigate this further.

 Maybe that means the compiler can says FPSCR state from my vdiv.f64
 is undefined across function call boundaries, so ordering should not
 matter?

This seems like a reasonable guess. Did you try asking on the LLVM IRC 
channel or mailing list? Depending on the outcome (i.e. if the ABI is 
really to be interpreted that way), we should probably discuss its 
implications for D's FP handling strategy on the main D mailing lists.

Best,
David

Mar 15 2015

Dan Olson <zans.is.for.cans yahoo.com> writes:

David Nadlinger via digitalmars-d-ldc <digitalmars-d-ldc puremagic.com> writes:

 Hi Dan,

 On 03/14/2015 09:20 PM, Dan Olson via digitalmars-d-ldc wrote:
 I don't see any attributes for for that function.  I will just paste
 some of the -output-ll results since nothing sticks out to me.

 Yeah, seems like everything is in order (no pun intended) after the
 main IR-level optimizer. This suggests that the reordering happens on
 the target-specific optimization or instruction selection level. I
 suppose you could try disabling codegen optimizations if you wanted to
 investigate this further.

It is a good puzzle.  For what it is worth, clang does the same thing
with similar code.

 Maybe that means the compiler can says FPSCR state from my vdiv.f64
 is undefined across function call boundaries, so ordering should not
 matter?

 This seems like a reasonable guess. Did you try asking on the LLVM IRC
 channel or mailing list? Depending on the outcome (i.e. if the ABI is
 really to be interpreted that way), we should probably discuss its
 implications for D's FP handling strategy on the main D mailing lists.

I have not asked elsewhere yet.  I'm going to explore the problem a bit
more, then ask.

Mar 15 2015

Dan Olson <zans.is.for.cans yahoo.com> writes:

Ok, I have stumbled into an old problem it seems.

C99 invented "#pragma STDC FENV_ACCESS ON" to prevent optimizer from
reordering instructions that affect float environment.  See note [2]
here:

http://en.wikipedia.org/wiki/C99#Example

And clang (LLVM) does not support this pragma:

https://llvm.org/bugs/show_bug.cgi?id=10409

Work around in C is to use volatile vars to force ordering.

And one more reference:

http://wiki.musl-libc.org/wiki/Mathematical_Library#Fenv_and_error_handling

Mar 15 2015

Dan Olson <zans.is.for.cans yahoo.com> writes:

Dan Olson <zans.is.for.cans yahoo.com> writes:

 While tracking down std.math problems for ARM, I find that optimizer
 will reorder instructions to get FPSCR flags before the divide
 operation.

 Is there is a way to force instruction ordering here?  I tried the
 llvm_memory_fence, but it doesn't do the job.

 real zero = 0.0;

 void foo()
 {
     import std.math, std.c.stdio, ldc.llvmasm;

     real x = 1.0 / zero;

     auto f = __asm!uint("vmrs $0, fpscr", "=r");
     IeeeFlags flags = ieeeFlags();
     printf("%f, %u %d\n", x, f, flags.divByZero);
 }

 Compiled with -O -mtriple=thumbv7-apple-ios, you can see that vdiv is
 after both my inline asm and std.math ieeeFlags().

 	vldr	d8, [r0]
 	  InlineAsm Start
 	vmrs	r4, fpscr
 	  InlineAsm End
 	mov	r0, r5
 	blx	__D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags

 	mov	r0, r5
 	vdiv.f64	d8, d16, d8

 What to do?

I have a solution.  At least it is a start.  Specifying the result of
the floating point operation as argument of an empty inline asm gives
correct ordering.  And doesn't do any unnecessary stores like the C
volatile trick (FORCE_EVAL macro).

For my use, I wrapped the inline asm in a function "use()" that is
specific to ARM because of the 'w' constraint.  I am thinking it could
be named FORCE_EVAL to align with what is in linux libm and then made
general for other D cpu targets.

void use(T)(T x)  nogc nothrow
{
    import std.traits;
    static if (isFloatingPoint!(T))
        __asm("", "w", x);   // arm fp reg
    else
        __asm("", "r", x);
}


Compile as before (-O), but with use(x).

real zero = 0.0;

void foo()
{
    import std.math, std.c.stdio, ldc.llvmasm;
    
    real x = 1.0 / zero;
    use(x);

    // get float flags in arm specifc way
    auto f = __asm!uint("vmrs $0, fpscr", "=r");
    // get float flags D way
    IeeeFlags flags = ieeeFlags();
    printf("%f, %u %d\n", x, f, flags.divByZero);
}

Now vdiv.f64 happens before all the flag fetching.



	vldr	d17, [r0]
	mov	r0, r5
	vdiv.f64	d8, d16, d17          <------ yeah!
	  InlineAsm Start
	  InlineAsm End
	  InlineAsm Start
	vmrs	r4, fpscr
	  InlineAsm End
	blx	__D3std4math9ieeeFlagsFNdZS3std4math9IeeeFlags

--
Dan

Mar 15 2015

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - How to prevent optimizer from reordering stuff?