digitalmars.D.ldc - inline asm in inlined function / ECX clobbered / stack frame / naked

James Blachly (117/117) May 05 2019 I am posting here instead of learn because it is my understanding that

kinke (26/29) May 06 2019 Sorry, but I'll just focus on that, and not on the asm questions.

kinke (8/8) May 06 2019 Adding `-mcpu=haswell` to the LDC command-line, the generated asm

James Blachly (80/89) May 06 2019 Thanks kinke. Agree, I am all for portability. This is my first stab at ...

kinke (92/97) May 08 2019 I still couldn't believe this, so benchmarked myself.

kinke (6/6) May 08 2019 I should probably add that the inline asm version can be

James Blachly (9/16) May 28 2019 Kinke,

kinke (10/17) May 29 2019 1. Your function wasn't naked ('essentially naked' != naked,

James Blachly <james.blachly gmail.com> writes:

I am posting here instead of learn because it is my understanding that 
DMD will not inline a fn that has inline asm, and because one of my 
questions relates to why LLVM is unsafely(?) using ECX.

Problem Statement: I am writing a faster round-up-to-nearest-power-of-2 
function. An inline asm [1] version using BSR instruction is about twice 
as fast as (--(x), (x)|=(x)>>1, (x)|=(x)>>2, (x)|=(x)>>4, (x)|=(x)>>8, 
(x)|=(x)>>16, ++(x)).[2]

I know about core.bitop.bsr and std.math.nextPow2 which uses it. My asm 
code block is 2.5x faster than codegen for (2 << bsr(x)) which surprises 
me...

Summary: I've inlined a function consisting of mostly inline assembly. 
In the initial approach ECX is clobbered but not recognized by LDC/LLVM 
which tries to use it as loop iter. In approach 2, I pushed RCX to 
stack, LDC/LLVM sees me use RCX(?) and uses EDX instead, but now the 
inline asm referencing stack frame variable (x[EBP] or [EBP + x] -- side 
note, should that be RBP?) is off by 8 bytes, which I can rescue by 
manually writing [EBP + x + 8] which seems hinky and wrong. By moving 
the PUSH after EAX loaded from stack, all is in order.

Questions at bottom.

Results below manifest with ldc2 -release -O2 (version 1.15.0)

Link to compiler explorer: https://godbolt.org/z/uLaIS5

Approach 1:
uint roundup32(uint x)
{
     pragma(inline, true);
     version(LDC) pragma(LDC_allow_inline);

     if (x <= 2) return x;
         asm
         {
             mov EAX, x[EBP];
             sub EAX, 1;
             bsr ECX, EAX;   // ecx = y = msb(x-1)
             mov EAX, 2;
             shl EAX, CL;    // return (2 << y)
         }
    }    // returns EAX
}

Result 1:
This works well _WHEN FN NOT INLINED_.

When inlined, my code clobbers ECX which calling function was using to 
track loop iteration.

.LBB1_1:
         mov     dword ptr [rsp + 8], ecx	; x param == loop iter
         mov     eax, dword ptr [rsp + 8]	; x param
         sub     eax, 1
         bsr     ecx, eax
         mov     eax, 2
         shl     eax, cl			; inline asm block ends here
         mov     eax, eax		; alignment?
         add     rbx, rax		; rbx == accumulator
         add     ecx, 1		; next round for loop, but ecx was chgd

         jne     .LBB1_1

So, perhaps I can just save ECX.

Approach 2:
Same, but PUSH RCX / POP RCX at beginning and end of the asm block.

Result 2:
compiler detects I have used RCX and instead uses EDX as loop counter.
New problem: since I have pushed on to stack, the param x is at offset 
rbp + 16, but the generated code still b elieves it is at rbp + 8:

.LBB1_1:
         mov     dword ptr [rsp + 8], edx	; x param == loop iter
         push    rcx				; approach 2
         mov     eax, dword ptr [rsp + 8]	; looking for x here :-(
         sub     eax, 1
         bsr     ecx, eax
         mov     eax, 2
         shl     eax, cl
         pop     rcx			; inline asm blocks ends here
         mov     eax, eax		; alignment?
         add     rbx, rax		; rbx == accumulator
         add     edx, 1		; next round for loop

         jne     .LBB1_1

So now because we are looking in the wrong spot on the stack, the 
function has the wrong value in eax.

Approach 3:
Cheat, because I know RCX was pushed, and change the local reference to 
x from [RBP + x] to [RBP + x + 8].

Result 3:
This works, but feels wrong.

Final, successful approach:
I can just move the PUSH RCX until after EAX was loaded from [rsp + 8] 
and not worry about the altered stack frame.



Questions:
0. Should I quit using DMD style assembly and use LLVM style?

1. gcc inline asm has the capability to specify in/out values and what's 
clobbered. I assume this is a deficiency of DMD style inline asm?

2. LDC/LLVM can inline this function which eliminates prolog/epliogue, 
but still has a local stack frame consisting of the "passed" value x. 
What is going on -- This must be typical behavior when function inlined, 
yes?

3. Even though it is essentially a naked function, I cannot add naked; 
because then I can only access variables by name from the global scope. 
Why, when since it is inlined I can still access the "passed" x ?
3b. Is an inlined function de facto "naked" meaning there is no need for 
this keyword?

4a. How can I notify LDC/LLVM that ECX is being used OR: why does it not 
notice in approach 1, but it does appear to in approach 2?
4b. Cdecl says ECX is a caller-saved register [3]; does this go out the 
window when function is inlined? In this case, how can I be sure it is 
safe to use EAX, ECX, or EDX in an inline asm block in in a function 
that will be inlined?
(Interestingly when function not inlined, EBX is used to track loop iter 
since it knows ECX could be clobbered)

5. Is the final solution of moving the PUSH RCX until after EAX loaded 
from the stack the correct approach?


This was quite lengthy; As a novice **I am very grateful for your expert 
help.**

James


References:
[1] DMD style assembly: https://dlang.org/spec/iasm.html

[2] You can read more about round up optimization here: 
http://locklessinc.com/articles/next_pow2/ (although note that in this 
very old article, the bitshift trick was the fastest whereas on modern 
processors the BSR instruction method is faster)

[3] https://www.agner.org/optimize/calling_conventions.pdf

May 05 2019

kinke <noone nowhere.com> writes:

On Monday, 6 May 2019 at 03:09:38 UTC, James Blachly wrote:
 I know about core.bitop.bsr and std.math.nextPow2 which uses 
 it. My asm code block is 2.5x faster than codegen for (2 << 
 bsr(x)) which surprises me...

Sorry, but I'll just focus on that, and not on the asm questions. 
The reason is simple, I discourage anyone from going down to asm 
level if it can be avoided.

So, I have:

pragma(inline, true)
uint roundup32(uint x)
{
     import core.bitop;
     //if (x <= 2) return x;
     return 2u << bsr(x-1);
}

`ldc2 -mtriple=x86_64-linux-gnu -O -output-s foo.d` (AT&T 
syntax...):

_D3foo9roundup32FkZk:
	addl	$-1, %edi
	bsrl	%edi, %ecx
	xorl	$31, %ecx
	xorb	$31, %cl
	movl	$2, %eax
	shll	%cl, %eax
	retq

I can't believe that's 2.5x slower than your almost identical asm 
block. And that code is portable, not just OS- and 
ABI-independent, but also architecture-wise. 1000x better than 
inline asm IMO.

May 06 2019

kinke <noone nowhere.com> writes:

Adding `-mcpu=haswell` to the LDC command-line, the generated asm 
is:

	addl	$-1, %edi
	lzcntl	%edi, %eax
	xorb	$31, %al
	movl	$2, %ecx
	shlxl	%eax, %ecx, %eax

[Add `-mcpu=native` to tune code-gen for *your* CPU.]

May 06 2019

James Blachly <james.blachly gmail.com> writes:

On 5/6/19 3:55 PM, kinke wrote:
 Adding `-mcpu=haswell` to the LDC command-line, the generated asm is:
 
      addl    $-1, %edi
      lzcntl    %edi, %eax
      xorb    $31, %al
      movl    $2, %ecx
      shlxl    %eax, %ecx, %eax
 
 [Add `-mcpu=native` to tune code-gen for *your* CPU.]

Thanks kinke. Agree, I am all for portability. This is my first stab at asm.

The speed is not really the point; this is really me asking for help 
understanding compiler internals and fundamentals of inline assembly.

Nevertheless, running each of 3 algorithms (asm, famous bitwise OR ops 
for this problem, and std.bitop.bsr) 2^28 times I get the following results:

assembly version:
Elapsed msec: 564
Sum: 48038395756849835
Stopwatch is reset: 0
kroundup32:
Elapsed msec: 776
Sum: 48038395756849835
Stopwatch is reset: 0
bitop_bsr:
Elapsed msec: 1299
Sum: 48038395756849835

-mcpu=native:
The performance of bitop_bsr is markedly improved when using 
-mcpu=native on linux (but only marginally better on MacOS) which is 
really surprising(???) but still slower than the hand asm, about 2.25x 
the runtime from 2.5x on ivybridge (Mac) and 2.9x on sandybridge 
(linux). Maybe because I do not have the latest and greatest processors?

Assembly at bottom. The two xor instructions after bsr are 
unnecessary(?) and perhaps a contributor to slowdown.

Anyway, mostly I wanted help understanding register safety and proper 
use of inline asm.

Of note, looking at the .s output, on MacOS the inline asm {} block is 


#NO_APP. Since this is compiler-emitted I would have expected it to be 
the same across platforms -- or is this determined by LLVM version?

James


Assembly with -O2 -release -mcpu=native

Note that these are the non-inlined version.
Also not sure why roundup32(uint x) has so much more prologue than 
bitop_bsr(uint x).


_D9asminline9bitop_bsrFkZk:
         .cfi_startproc
         cmpl    $2, %edi
         ja      .LBB0_2
         movl    %edi, %eax
         retq
.LBB0_2:
         addl    $-1, %edi
         bsrl    %edi, %ecx
         xorl    $31, %ecx
         xorb    $31, %cl
         movl    $2, %eax
         shll    %cl, %eax
         retq


---

_D9asminline9roundup32FkZk:
         .cfi_startproc
         pushq   %rbp
         .cfi_def_cfa_offset 16
         .cfi_offset %rbp, -16
         movq    %rsp, %rbp
         .cfi_def_cfa_register %rbp
         movl    %edi, -4(%rbp)
         cmpl    $2, %edi
         ja      .LBB1_2
         movl    %edi, %eax
         popq    %rbp
         .cfi_def_cfa %rsp, 8
         retq
.LBB1_2:
         .cfi_def_cfa %rbp, 16
         #APP
         movl    -4(%rbp), %eax
         subl    $1, %eax
         pushq   %rcx
         bsrl    %eax, %ecx
         movl    $2, %eax
         shll    %cl, %eax
         popq    %rcx
         #NO_APP
         popq    %rbp
         .cfi_def_cfa %rsp, 8
         retq

May 06 2019

kinke <noone nowhere.com> writes:

On Tuesday, 7 May 2019 at 01:20:43 UTC, James Blachly wrote:
 The performance of bitop_bsr is markedly improved when using 
 -mcpu=native on linux (but only marginally better on MacOS) 
 which is really surprising(???) but still slower than the hand 
 asm, about 2.25x the runtime from 2.5x on ivybridge (Mac) and 
 2.9x on sandybridge (linux).

I still couldn't believe this, so benchmarked myself.
First observation: you didn't re-start() the StopWatch after 
resetting it; the next stop() will then measure the elapsed time 
since it was originally started (!).

I modified the code a bit, switching to LLVM inline asm (highly 
recommended if DMD compatibility isn't that important) and adding 
a nextPow2-based version:

```
enum MAXITER = 1 << 28;

pragma(inline, true):

uint roundup32(uint x)
{
     if (x <= 2) return x;

     import ldc.llvmasm;
     return __asm!uint(
         `bsrl %eax, %ecx
          #xorl $$31, %ecx
          #xorb $$31, %cl
          movl $$2, %eax
          shll %cl, %eax`,
         "={eax},{eax},~{ecx}", x - 1);
}

uint bitop_bsr(uint x)
{
     import core.bitop;
     return x <= 2 ? x : 2 << bsr(x - 1);
}

uint nextPow2(uint x)
{
     import std.math;
     return x <= 2 ? x : std.math.nextPow2(x - 1);
}

uint kroundup32(uint x)
{
     x -= 1;
     x |= (x >> 1);
     x |= (x >> 2);
     x |= (x >> 4);
     x |= (x >> 8);
     x |= (x >> 16);
     return x + 1;
}

void main()
{
     static void benchmark(alias Func)()
     {
         import std.datetime.stopwatch;
         import std.stdio;

         ulong sum;

         auto sw = StopWatch(AutoStart.yes);
         for(uint i; i < MAXITER; i++)
             sum += Func(i);
         sw.stop();

         writeln(__traits(identifier, Func), ":\t", 
sw.peek.total!"msecs", "ms\t", sum);
     }

     benchmark!roundup32();
     benchmark!bitop_bsr();
     benchmark!nextPow2();
     benchmark!kroundup32();
}
```

My results with an Intel Ivy Bridge CPU:

ldc2 -O -run asm.d:
roundup32:      254ms   48038395756849835
bitop_bsr:      386ms   48038395756849835
nextPow2:       381ms   48038395756849835
kroundup32:     326ms   48038395756849835

Observation 1: the inline asm version saving the 2 XORs is indeed 
faster, but by about 50% and not more than twice as fast. This 
was surprising to me; checking 
https://gmplib.org/~tege/x86-timing.pdf, BSR should take a 
surprisingly low 3 cycles vs. 5 cycles with the 2 additional 
dependent XORs on my Ivy Bridge.

Observation 2: std.math.nextPow2() is fine (equivalent to 
bitop_bsr).

Observation 3: kroundup32 is faster than nextPow2/bitop_bsr. With 
`-mcpu=native`, the first 3 timings are unchanged in my case, but 
the kroundup32 runtime shrinks to ~140ms. By then, it was clear 
that auto-vectorization must be interfering, and a look at the 
(inlined) asm confirmed it. Adding `-disable-loop-vectorization` 
to the cmdline, the kroundup32 time is exactly the same as 
nextPow2/bitop_bsr, ~380ms, independent from `-mcpu=native`.

The x86 timings PDF also shows that nextPow2 should be faster 
than the inline asm version on an AMD Zen CPU with tuned codegen, 
as BSR takes 4 cycles there, and LZCNT+XOR only 2 cycles. So 
whether inline asm really pays off in this case is highly 
CPU/codegen specific. It kills auto-vectorization for sure 
though, while nextPow2 may be vectorizable for some non-x86 
targets. If the real-world loop body is vectorizable, then 
kroundup32 may be significantly faster than the other versions...

May 08 2019

kinke <noone nowhere.com> writes:

I should probably add that the inline asm version can be 
simplified to a much more readable:

return x <= 2 ? x : 2 << __asm!uint("bsrl $1, $0", "=r,r", x - 1);

Check out 
http://llvm.org/docs/LangRef.html#inline-asm-constraint-string if 
interested in LLVM inline assembly.

May 08 2019

James Blachly <james.blachly gmail.com> writes:

On 5/8/19 11:42 AM, kinke wrote:
 I should probably add that the inline asm version can be simplified to a 
 much more readable:
 
 return x <= 2 ? x : 2 << __asm!uint("bsrl $1, $0", "=r,r", x - 1);
 
 Check out http://llvm.org/docs/LangRef.html#inline-asm-constraint-string 
 if interested in LLVM inline assembly.

Kinke,

Thank you for your extensive work and detailed answers. I will 
definitely take your advice and move to LLVM style inline asm.

Regarding my last unanswered(I think) question about why LDC/LLVM was 
not saving ECX when inlining the naked function:

since problem not evident when using the LLVM inline asm, this is 
appears to be a bug related to combination of function inlining and DMD 
style asm.

May 28 2019

kinke <noone nowhere.com> writes:

On Tuesday, 28 May 2019 at 20:45:35 UTC, James Blachly wrote:
 Thank you for your extensive work and detailed answers. I will 
 definitely take your advice and move to LLVM style inline asm.

You're welcome.

 Regarding my last unanswered(I think) question about why 
 LDC/LLVM was not saving ECX when inlining the naked function:

 since problem not evident when using the LLVM inline asm, this 
 is appears to be a bug related to combination of function 
 inlining and DMD style asm.

1. Your function wasn't naked ('essentially naked' != naked, 
there's a big difference; e.g., have a look at the LLVM IR).
2. Functions with DMD-style inline asm cannot be inlined 
(basically due to its lacking flexibility), and we normally bail 
out when applying `pragma(inline, true)`. You successfully 
circumvented that error by explicitly using 
`pragma(LDC_allow_inline)` (I can't think of a good use case for 
that pragma OTOH).

May 29 2019

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - inline asm in inlined function / ECX clobbered / stack frame / naked