www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - Switch codegen.

reply claptrap <clap trap.com> writes:
Given code like this...

```d
const int x;
while(...)
{
     switch (x)
     {
         case 0: whatever; break;
         case 1: whatever; break;
         case 2: whatever; break;
         case 3: whatever; break;
         case 4: whatever; break;
     }
}
```

It generates a jump table if there's at least 4 or 5 cases, but 
at the start of the switch it generates 6 instructions (x64), 
check the bounds, lookup / calc the destination, and the actual 
jump. Like this...

```d
         lea     r13, [rip + .LJTI0_0]
         cmp     r14d, 7
         ja      .LBB0_11
         movsxd  rax, dword ptr [r13 + 4*r12]
         add     rax, r13
         jmp     rax
```

So the point is if the switch is in a loop and x is constant, 
most of that could be done ahead, outside the loop. It could 
actually be reduced to a single indirect jump inside the loop. It 
seems there is a LLVM IR instruction that may be useful for this, 
"indirectbr", it's a sort of computed goto.

https://llvm.org/docs/LangRef.html#indirectbr-instruction

Is this at all feasible? If so how would this be done? In the 
actual IR codegen or as an optimization pass? I willing to do the 
work if it's possible.
Aug 07 2023
next sibling parent reply Basile B. <b2.temp gmx.com> writes:
On Monday, 7 August 2023 at 20:22:06 UTC, claptrap wrote:
 Given code like this...

 ```d
 const int x;
 while(...)
 {
     switch (x)
     {
         case 0: whatever; break;
         case 1: whatever; break;
         case 2: whatever; break;
         case 3: whatever; break;
         case 4: whatever; break;
     }
 }
 ```

 It generates a jump table if there's at least 4 or 5 cases, but 
 at the start of the switch it generates 6 instructions (x64), 
 check the bounds, lookup / calc the destination, and the actual 
 jump. Like this...

 ```d
         lea     r13, [rip + .LJTI0_0]
         cmp     r14d, 7
         ja      .LBB0_11
         movsxd  rax, dword ptr [r13 + 4*r12]
         add     rax, r13
         jmp     rax
 ```

 So the point is if the switch is in a loop and x is constant, 
 most of that could be done ahead, outside the loop. It could 
 actually be reduced to a single indirect jump inside the loop. 
 It seems there is a LLVM IR instruction that may be useful for 
 this, "indirectbr", it's a sort of computed goto.

 https://llvm.org/docs/LangRef.html#indirectbr-instruction

 Is this at all feasible? If so how would this be done? In the 
 actual IR codegen or as an optimization pass? I willing to do 
 the work if it's possible.
As I've heard good feedback about that alternative to LLVM `switch` and also because I'm also interested to implement that in another compiler, I've made a comparison of both way here : https://godbolt.org/z/G41P88EY7 (D code at the end as comments) It seems that this would only work well for `final switch`es (no default block) over enum members. The main problem is that you need to build a constant array made of `blockaddress`, meaning that 1. if the first member value is not 0 then you'll need to apply an offset for the index used to select the right `blockaddress`. (Source 1, line 13) 2. if there are gaps between members, you'll need to use a dummy block, containing `unreachable`, to fill them. A note about the fact that I think that this would only works for enums: `final switch` over a value of an integral type is currently a lie [(see issue 6060)](https://issues.dlang.org/show_bug.cgi?id=6060), there's still a fallback with a runtime error. Otherwise, more concretly in LDC, the work would have to be done [here](https://github.com/ldc-developers/ldc/blob/cf32bac668540f3dd022493b3ba89b598a485dfb/gen/s atements.cpp#L953). At first glance you'll need to define a flag, similarly to the already existing `useSwitchInst` (used for the infamous case where `case`s are not compile-time constants), let's say `useIndirectBrInst`. To conclude, that seems faisable even if that would only be beneficial for a few specific cases.
Aug 13 2023
parent claptrap <clap trap.com> writes:
On Sunday, 13 August 2023 at 10:45:11 UTC, Basile B. wrote:
 As I've heard good feedback about that alternative to LLVM 
 `switch` and also because I'm also interested to implement that 
 in another compiler, I've made a comparison of both way here : 
 https://godbolt.org/z/G41P88EY7 (D code at the end as comments)

 It seems that this would only work well for `final switch`es 
 (no default block) over enum members. The main problem is that 
 you need to build a constant array made of `blockaddress`, 
 meaning that

 1. if the first member value is not 0 then you'll need to apply 
 an offset for the index used to select the right 
 `blockaddress`. (Source 1, line 13)
 2. if there are gaps between members, you'll need to use a 
 dummy block, containing `unreachable`, to fill them.

 A note about the fact that I think that this would only works 
 for enums: `final switch` over a value of an integral type is 
 currently a lie [(see issue 
 6060)](https://issues.dlang.org/show_bug.cgi?id=6060), there's 
 still a fallback with a runtime error.

 Otherwise, more concretly in LDC, the work would have to be 
 done 
 [here](https://github.com/ldc-developers/ldc/blob/cf32bac668540f3dd022493b3ba89b598a485dfb/gen/s
atements.cpp#L953). At first glance you'll need to define a flag, similarly to
the already existing `useSwitchInst` (used for the infamous case where `case`s
are not compile-time constants), let's say `useIndirectBrInst`.

 To conclude, that seems faisable even if that would only be 
 beneficial for a few specific cases.
From what I can tell even "final switch" generates a bounds check, it does a compare and then a ja to the topmost case, i guess it's "better to at least jump to known code even if its the wrong code than it is to jump into the unknown. Im not sure if this is the IR generated by LDC, or what LLVM does. Anyway I'm going to download the source and play around with it see what I can figure out. Thanks.
Aug 13 2023
prev sibling next sibling parent reply Johan <j j.nl> writes:
I'm quite certain that if `x` is constant, the bounds check and 
table lookup will be hoisted out of the loop, indeed leading to 
just an indirect jump inside the loop.
Of course, this requires running the optimizer (-Ox, x>0).

-Johan


On Monday, 7 August 2023 at 20:22:06 UTC, claptrap wrote:
 Given code like this...

 ```d
 const int x;
 while(...)
 {
     switch (x)
     {
         case 0: whatever; break;
         case 1: whatever; break;
         case 2: whatever; break;
         case 3: whatever; break;
         case 4: whatever; break;
     }
 }
 ```

 It generates a jump table if there's at least 4 or 5 cases, but 
 at the start of the switch it generates 6 instructions (x64), 
 check the bounds, lookup / calc the destination, and the actual 
 jump. Like this...

 ```d
         lea     r13, [rip + .LJTI0_0]
         cmp     r14d, 7
         ja      .LBB0_11
         movsxd  rax, dword ptr [r13 + 4*r12]
         add     rax, r13
         jmp     rax
 ```

 So the point is if the switch is in a loop and x is constant, 
 most of that could be done ahead, outside the loop. It could 
 actually be reduced to a single indirect jump inside the loop. 
 It seems there is a LLVM IR instruction that may be useful for 
 this, "indirectbr", it's a sort of computed goto.

 https://llvm.org/docs/LangRef.html#indirectbr-instruction

 Is this at all feasible? If so how would this be done? In the 
 actual IR codegen or as an optimization pass? I willing to do 
 the work if it's possible.
Aug 13 2023
parent reply claptrap <clap trap.com> writes:
On Sunday, 13 August 2023 at 13:13:30 UTC, Johan wrote:
 I'm quite certain that if `x` is constant, the bounds check and 
 table lookup will be hoisted out of the loop, indeed leading to 
 just an indirect jump inside the loop.
 Of course, this requires running the optimizer (-Ox, x>0).
I've tried as many variations as I can figure and it always generates this... ```D cmp r14d, 3 ja .LBB0_9 movsxd rax, dword ptr [r12 + 4*r15] add rax, r12 jmp rax ``` https://d.godbolt.org/z/jzoevxoG4 That's with O2, with O3 I cant make head nor tail of the code it generates, it's some pretty insane optimization. With my own code I use "asm int 3" to breakpoint in the function, so I can look at the disassembly even with O3 optimization, using visual studio, and it still generates those 5 instructions.
Aug 13 2023
parent reply Johan <j j.nl> writes:
On Sunday, 13 August 2023 at 18:57:42 UTC, claptrap wrote:
 On Sunday, 13 August 2023 at 13:13:30 UTC, Johan wrote:
 I'm quite certain that if `x` is constant, the bounds check 
 and table lookup will be hoisted out of the loop, indeed 
 leading to just an indirect jump inside the loop.
 Of course, this requires running the optimizer (-Ox, x>0).
I've tried as many variations as I can figure and it always generates this... ```D cmp r14d, 3 ja .LBB0_9 movsxd rax, dword ptr [r12 + 4*r15] add rax, r12 jmp rax ``` https://d.godbolt.org/z/jzoevxoG4 That's with O2, with O3 I cant make head nor tail of the code it generates, it's some pretty insane optimization.
It's easier to look at LLVM IR in this case. https://d.godbolt.org/z/b6b3TTeMW 1) By providing the implementation of `bar`, you are giving the optimizer extra information. It may not inline `bar`, but it _will_ make use of what it knows about the result of bar. That's why at -O3, there is insane optimization going on. Simply remove the implementation of `bar` for more insight (make it a function declaration, not a definition). 2) Indeed at -O2, the optimizer deems it non-profitable or does not see that it can hoist the load out of the for loop. At -O3 it does hoist it out, like you wanted. cheers, Johan
Aug 13 2023
parent reply claptrap <clap trap.com> writes:
On Sunday, 13 August 2023 at 20:52:18 UTC, Johan wrote:
 On Sunday, 13 August 2023 at 18:57:42 UTC, claptrap wrote:
 On Sunday, 13 August 2023 at 13:13:30 UTC, Johan wrote:
 I'm quite certain that if `x` is constant, the bounds check 
 and table lookup will be hoisted out of the loop, indeed 
 leading to just an indirect jump inside the loop.
 Of course, this requires running the optimizer (-Ox, x>0).
I've tried as many variations as I can figure and it always generates this... ```D cmp r14d, 3 ja .LBB0_9 movsxd rax, dword ptr [r12 + 4*r15] add rax, r12 jmp rax ``` https://d.godbolt.org/z/jzoevxoG4 That's with O2, with O3 I cant make head nor tail of the code it generates, it's some pretty insane optimization.
It's easier to look at LLVM IR in this case. https://d.godbolt.org/z/b6b3TTeMW 1) By providing the implementation of `bar`, you are giving the optimizer extra information. It may not inline `bar`, but it _will_ make use of what it knows about the result of bar. That's why at -O3, there is insane optimization going on. Simply remove the implementation of `bar` for more insight (make it a function declaration, not a definition). 2) Indeed at -O2, the optimizer deems it non-profitable or does not see that it can hoist the load out of the for loop. At -O3 it does hoist it out, like you wanted. cheers, Johan
It hasnt hoisted the load out, it's transformed it from a loop containing a switch, to a switch containing 5 loops. The switch statement is never branched back to. That does work for my case, probably the loop is too large, even with O3, the switch statement is still... 00007FF7C703075F movsxd rdi,dword ptr [rdx+rsi*4] 00007FF7C7030763 add rdi,rdx 00007FF7C7030766 jmp rdi And it still checks the switch index inside the loop, it has moved it a bit earlier in the code. It seems it can tell that some of the code before the switch has no affect if the switch is bypassed.
Aug 13 2023
parent claptrap <clap trap.com> writes:
On Sunday, 13 August 2023 at 22:55:12 UTC, claptrap wrote:
 On Sunday, 13 August 2023 at 20:52:18 UTC, Johan wrote:
See.. https://d.godbolt.org/z/E3KaoP7o9 if you have enough extra stuff in the loop it goes back to a loop containing a switch, and 5 instructions for the switch.
Aug 13 2023
prev sibling parent claptrap <clap trap.com> writes:
On Monday, 7 August 2023 at 20:22:06 UTC, claptrap wrote:

I've given up looking into this for a number of reasons. Its 
taking me a long time to understand the code and it seems 
IndirectBr is different enough from switch that there's a lot of 
extra work that would need to be done. The case destinations need 
to be basic blocks and I cant figure how to get that from the 
switch cases. Plus it looks like you needs a way to build a 
lookup table with the addresses of the case blocks in. You then 
look up the target in that and pass it to IndirectBr. A lot of 
stuff I couldn't figure out.

And I was looking at this in the switch statement IR gen, so it 
still wouldn't help if the goal was to pull the address 
calculation & bounds check out of an enclosing loop.

It seems like its an optimization that should be done by the 
backend anyway. I mean the LLVM IR is SSA, so the address calc & 
check should be safe to move to where the condition variable is 
defined? If you switch(X), you can move the address lookup to 
where X is defined? Maybe?
Sep 01 2023