www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - inline asm in inlined function / ECX clobbered / stack frame / naked

reply James Blachly <james.blachly gmail.com> writes:
I am posting here instead of learn because it is my understanding that 
DMD will not inline a fn that has inline asm, and because one of my 
questions relates to why LLVM is unsafely(?) using ECX.

Problem Statement: I am writing a faster round-up-to-nearest-power-of-2 
function. An inline asm [1] version using BSR instruction is about twice 
as fast as (--(x), (x)|=(x)>>1, (x)|=(x)>>2, (x)|=(x)>>4, (x)|=(x)>>8, 
(x)|=(x)>>16, ++(x)).[2]

I know about core.bitop.bsr and std.math.nextPow2 which uses it. My asm 
code block is 2.5x faster than codegen for (2 << bsr(x)) which surprises 
me...

Summary: I've inlined a function consisting of mostly inline assembly. 
In the initial approach ECX is clobbered but not recognized by LDC/LLVM 
which tries to use it as loop iter. In approach 2, I pushed RCX to 
stack, LDC/LLVM sees me use RCX(?) and uses EDX instead, but now the 
inline asm referencing stack frame variable (x[EBP] or [EBP + x] -- side 
note, should that be RBP?) is off by 8 bytes, which I can rescue by 
manually writing [EBP + x + 8] which seems hinky and wrong. By moving 
the PUSH after EAX loaded from stack, all is in order.

Questions at bottom.

Results below manifest with ldc2 -release -O2 (version 1.15.0)

Link to compiler explorer: https://godbolt.org/z/uLaIS5

Approach 1:
uint roundup32(uint x)
{
     pragma(inline, true);
     version(LDC) pragma(LDC_allow_inline);

     if (x <= 2) return x;
         asm
         {
             mov EAX, x[EBP];
             sub EAX, 1;
             bsr ECX, EAX;   // ecx = y = msb(x-1)
             mov EAX, 2;
             shl EAX, CL;    // return (2 << y)
         }
    }    // returns EAX
}

Result 1:
This works well _WHEN FN NOT INLINED_.

When inlined, my code clobbers ECX which calling function was using to 
track loop iteration.

.LBB1_1:
         mov     dword ptr [rsp + 8], ecx	; x param == loop iter
         mov     eax, dword ptr [rsp + 8]	; x param
         sub     eax, 1
         bsr     ecx, eax
         mov     eax, 2
         shl     eax, cl			; inline asm block ends here
         mov     eax, eax		; alignment?
         add     rbx, rax		; rbx == accumulator
         add     ecx, 1		; next round for loop, but ecx was chgd

         jne     .LBB1_1

So, perhaps I can just save ECX.

Approach 2:
Same, but PUSH RCX / POP RCX at beginning and end of the asm block.

Result 2:
compiler detects I have used RCX and instead uses EDX as loop counter.
New problem: since I have pushed on to stack, the param x is at offset 
rbp + 16, but the generated code still b elieves it is at rbp + 8:

.LBB1_1:
         mov     dword ptr [rsp + 8], edx	; x param == loop iter
         push    rcx				; approach 2
         mov     eax, dword ptr [rsp + 8]	; looking for x here :-(
         sub     eax, 1
         bsr     ecx, eax
         mov     eax, 2
         shl     eax, cl
         pop     rcx			; inline asm blocks ends here
         mov     eax, eax		; alignment?
         add     rbx, rax		; rbx == accumulator
         add     edx, 1		; next round for loop

         jne     .LBB1_1

So now because we are looking in the wrong spot on the stack, the 
function has the wrong value in eax.

Approach 3:
Cheat, because I know RCX was pushed, and change the local reference to 
x from [RBP + x] to [RBP + x + 8].

Result 3:
This works, but feels wrong.

Final, successful approach:
I can just move the PUSH RCX until after EAX was loaded from [rsp + 8] 
and not worry about the altered stack frame.



Questions:
0. Should I quit using DMD style assembly and use LLVM style?

1. gcc inline asm has the capability to specify in/out values and what's 
clobbered. I assume this is a deficiency of DMD style inline asm?

2. LDC/LLVM can inline this function which eliminates prolog/epliogue, 
but still has a local stack frame consisting of the "passed" value x. 
What is going on -- This must be typical behavior when function inlined, 
yes?

3. Even though it is essentially a naked function, I cannot add naked; 
because then I can only access variables by name from the global scope. 
Why, when since it is inlined I can still access the "passed" x ?
3b. Is an inlined function de facto "naked" meaning there is no need for 
this keyword?

4a. How can I notify LDC/LLVM that ECX is being used OR: why does it not 
notice in approach 1, but it does appear to in approach 2?
4b. Cdecl says ECX is a caller-saved register [3]; does this go out the 
window when function is inlined? In this case, how can I be sure it is 
safe to use EAX, ECX, or EDX in an inline asm block in in a function 
that will be inlined?
(Interestingly when function not inlined, EBX is used to track loop iter 
since it knows ECX could be clobbered)

5. Is the final solution of moving the PUSH RCX until after EAX loaded 
from the stack the correct approach?


This was quite lengthy; As a novice **I am very grateful for your expert 
help.**

James


References:
[1] DMD style assembly: https://dlang.org/spec/iasm.html

[2] You can read more about round up optimization here: 
http://locklessinc.com/articles/next_pow2/ (although note that in this 
very old article, the bitshift trick was the fastest whereas on modern 
processors the BSR instruction method is faster)

[3] https://www.agner.org/optimize/calling_conventions.pdf
May 05 2019
parent reply kinke <noone nowhere.com> writes:
On Monday, 6 May 2019 at 03:09:38 UTC, James Blachly wrote:
 I know about core.bitop.bsr and std.math.nextPow2 which uses 
 it. My asm code block is 2.5x faster than codegen for (2 << 
 bsr(x)) which surprises me...
Sorry, but I'll just focus on that, and not on the asm questions. The reason is simple, I discourage anyone from going down to asm level if it can be avoided. So, I have: pragma(inline, true) uint roundup32(uint x) { import core.bitop; //if (x <= 2) return x; return 2u << bsr(x-1); } `ldc2 -mtriple=x86_64-linux-gnu -O -output-s foo.d` (AT&T syntax...): _D3foo9roundup32FkZk: addl $-1, %edi bsrl %edi, %ecx xorl $31, %ecx xorb $31, %cl movl $2, %eax shll %cl, %eax retq I can't believe that's 2.5x slower than your almost identical asm block. And that code is portable, not just OS- and ABI-independent, but also architecture-wise. 1000x better than inline asm IMO.
May 06 2019
parent reply kinke <noone nowhere.com> writes:
Adding `-mcpu=haswell` to the LDC command-line, the generated asm 
is:

	addl	$-1, %edi
	lzcntl	%edi, %eax
	xorb	$31, %al
	movl	$2, %ecx
	shlxl	%eax, %ecx, %eax

[Add `-mcpu=native` to tune code-gen for *your* CPU.]
May 06 2019
parent reply James Blachly <james.blachly gmail.com> writes:
On 5/6/19 3:55 PM, kinke wrote:
 Adding `-mcpu=haswell` to the LDC command-line, the generated asm is:
 
      addl    $-1, %edi
      lzcntl    %edi, %eax
      xorb    $31, %al
      movl    $2, %ecx
      shlxl    %eax, %ecx, %eax
 
 [Add `-mcpu=native` to tune code-gen for *your* CPU.]
Thanks kinke. Agree, I am all for portability. This is my first stab at asm. The speed is not really the point; this is really me asking for help understanding compiler internals and fundamentals of inline assembly. Nevertheless, running each of 3 algorithms (asm, famous bitwise OR ops for this problem, and std.bitop.bsr) 2^28 times I get the following results: assembly version: Elapsed msec: 564 Sum: 48038395756849835 Stopwatch is reset: 0 kroundup32: Elapsed msec: 776 Sum: 48038395756849835 Stopwatch is reset: 0 bitop_bsr: Elapsed msec: 1299 Sum: 48038395756849835 -mcpu=native: The performance of bitop_bsr is markedly improved when using -mcpu=native on linux (but only marginally better on MacOS) which is really surprising(???) but still slower than the hand asm, about 2.25x the runtime from 2.5x on ivybridge (Mac) and 2.9x on sandybridge (linux). Maybe because I do not have the latest and greatest processors? Assembly at bottom. The two xor instructions after bsr are unnecessary(?) and perhaps a contributor to slowdown. Anyway, mostly I wanted help understanding register safety and proper use of inline asm. Of note, looking at the .s output, on MacOS the inline asm {} block is #NO_APP. Since this is compiler-emitted I would have expected it to be the same across platforms -- or is this determined by LLVM version? James Assembly with -O2 -release -mcpu=native Note that these are the non-inlined version. Also not sure why roundup32(uint x) has so much more prologue than bitop_bsr(uint x). _D9asminline9bitop_bsrFkZk: .cfi_startproc cmpl $2, %edi ja .LBB0_2 movl %edi, %eax retq .LBB0_2: addl $-1, %edi bsrl %edi, %ecx xorl $31, %ecx xorb $31, %cl movl $2, %eax shll %cl, %eax retq --- _D9asminline9roundup32FkZk: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset %rbp, -16 movq %rsp, %rbp .cfi_def_cfa_register %rbp movl %edi, -4(%rbp) cmpl $2, %edi ja .LBB1_2 movl %edi, %eax popq %rbp .cfi_def_cfa %rsp, 8 retq .LBB1_2: .cfi_def_cfa %rbp, 16 #APP movl -4(%rbp), %eax subl $1, %eax pushq %rcx bsrl %eax, %ecx movl $2, %eax shll %cl, %eax popq %rcx #NO_APP popq %rbp .cfi_def_cfa %rsp, 8 retq
May 06 2019
parent reply kinke <noone nowhere.com> writes:
On Tuesday, 7 May 2019 at 01:20:43 UTC, James Blachly wrote:
 The performance of bitop_bsr is markedly improved when using 
 -mcpu=native on linux (but only marginally better on MacOS) 
 which is really surprising(???) but still slower than the hand 
 asm, about 2.25x the runtime from 2.5x on ivybridge (Mac) and 
 2.9x on sandybridge (linux).
I still couldn't believe this, so benchmarked myself. First observation: you didn't re-start() the StopWatch after resetting it; the next stop() will then measure the elapsed time since it was originally started (!). I modified the code a bit, switching to LLVM inline asm (highly recommended if DMD compatibility isn't that important) and adding a nextPow2-based version: ``` enum MAXITER = 1 << 28; pragma(inline, true): uint roundup32(uint x) { if (x <= 2) return x; import ldc.llvmasm; return __asm!uint( `bsrl %eax, %ecx #xorl $$31, %ecx #xorb $$31, %cl movl $$2, %eax shll %cl, %eax`, "={eax},{eax},~{ecx}", x - 1); } uint bitop_bsr(uint x) { import core.bitop; return x <= 2 ? x : 2 << bsr(x - 1); } uint nextPow2(uint x) { import std.math; return x <= 2 ? x : std.math.nextPow2(x - 1); } uint kroundup32(uint x) { x -= 1; x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); return x + 1; } void main() { static void benchmark(alias Func)() { import std.datetime.stopwatch; import std.stdio; ulong sum; auto sw = StopWatch(AutoStart.yes); for(uint i; i < MAXITER; i++) sum += Func(i); sw.stop(); writeln(__traits(identifier, Func), ":\t", sw.peek.total!"msecs", "ms\t", sum); } benchmark!roundup32(); benchmark!bitop_bsr(); benchmark!nextPow2(); benchmark!kroundup32(); } ``` My results with an Intel Ivy Bridge CPU: ldc2 -O -run asm.d: roundup32: 254ms 48038395756849835 bitop_bsr: 386ms 48038395756849835 nextPow2: 381ms 48038395756849835 kroundup32: 326ms 48038395756849835 Observation 1: the inline asm version saving the 2 XORs is indeed faster, but by about 50% and not more than twice as fast. This was surprising to me; checking https://gmplib.org/~tege/x86-timing.pdf, BSR should take a surprisingly low 3 cycles vs. 5 cycles with the 2 additional dependent XORs on my Ivy Bridge. Observation 2: std.math.nextPow2() is fine (equivalent to bitop_bsr). Observation 3: kroundup32 is faster than nextPow2/bitop_bsr. With `-mcpu=native`, the first 3 timings are unchanged in my case, but the kroundup32 runtime shrinks to ~140ms. By then, it was clear that auto-vectorization must be interfering, and a look at the (inlined) asm confirmed it. Adding `-disable-loop-vectorization` to the cmdline, the kroundup32 time is exactly the same as nextPow2/bitop_bsr, ~380ms, independent from `-mcpu=native`. The x86 timings PDF also shows that nextPow2 should be faster than the inline asm version on an AMD Zen CPU with tuned codegen, as BSR takes 4 cycles there, and LZCNT+XOR only 2 cycles. So whether inline asm really pays off in this case is highly CPU/codegen specific. It kills auto-vectorization for sure though, while nextPow2 may be vectorizable for some non-x86 targets. If the real-world loop body is vectorizable, then kroundup32 may be significantly faster than the other versions...
May 08 2019
parent reply kinke <noone nowhere.com> writes:
I should probably add that the inline asm version can be 
simplified to a much more readable:

return x <= 2 ? x : 2 << __asm!uint("bsrl $1, $0", "=r,r", x - 1);

Check out 
http://llvm.org/docs/LangRef.html#inline-asm-constraint-string if 
interested in LLVM inline assembly.
May 08 2019
parent reply James Blachly <james.blachly gmail.com> writes:
On 5/8/19 11:42 AM, kinke wrote:
 I should probably add that the inline asm version can be simplified to a 
 much more readable:
 
 return x <= 2 ? x : 2 << __asm!uint("bsrl $1, $0", "=r,r", x - 1);
 
 Check out http://llvm.org/docs/LangRef.html#inline-asm-constraint-string 
 if interested in LLVM inline assembly.
Kinke, Thank you for your extensive work and detailed answers. I will definitely take your advice and move to LLVM style inline asm. Regarding my last unanswered(I think) question about why LDC/LLVM was not saving ECX when inlining the naked function: since problem not evident when using the LLVM inline asm, this is appears to be a bug related to combination of function inlining and DMD style asm.
May 28 2019
parent kinke <noone nowhere.com> writes:
On Tuesday, 28 May 2019 at 20:45:35 UTC, James Blachly wrote:
 Thank you for your extensive work and detailed answers. I will 
 definitely take your advice and move to LLVM style inline asm.
You're welcome.
 Regarding my last unanswered(I think) question about why 
 LDC/LLVM was not saving ECX when inlining the naked function:

 since problem not evident when using the LLVM inline asm, this 
 is appears to be a bug related to combination of function 
 inlining and DMD style asm.
1. Your function wasn't naked ('essentially naked' != naked, there's a big difference; e.g., have a look at the LLVM IR). 2. Functions with DMD-style inline asm cannot be inlined (basically due to its lacking flexibility), and we normally bail out when applying `pragma(inline, true)`. You successfully circumvented that error by explicitly using `pragma(LDC_allow_inline)` (I can't think of a good use case for that pragma OTOH).
May 29 2019