digitalmars.D.ldc - Switch codegen.
- claptrap (36/36) Aug 07 2023 Given code like this...
- Basile B. (23/59) Aug 13 2023 As I've heard good feedback about that alternative to LLVM
- claptrap (9/32) Aug 13 2023 From what I can tell even "final switch" generates a bounds
- Johan (6/42) Aug 13 2023 I'm quite certain that if `x` is constant, the bounds check and
- claptrap (16/20) Aug 13 2023 I've tried as many variations as I can figure and it always
- claptrap (18/18) Sep 01 2023 On Monday, 7 August 2023 at 20:22:06 UTC, claptrap wrote:
Given code like this... ```d const int x; while(...) { switch (x) { case 0: whatever; break; case 1: whatever; break; case 2: whatever; break; case 3: whatever; break; case 4: whatever; break; } } ``` It generates a jump table if there's at least 4 or 5 cases, but at the start of the switch it generates 6 instructions (x64), check the bounds, lookup / calc the destination, and the actual jump. Like this... ```d lea r13, [rip + .LJTI0_0] cmp r14d, 7 ja .LBB0_11 movsxd rax, dword ptr [r13 + 4*r12] add rax, r13 jmp rax ``` So the point is if the switch is in a loop and x is constant, most of that could be done ahead, outside the loop. It could actually be reduced to a single indirect jump inside the loop. It seems there is a LLVM IR instruction that may be useful for this, "indirectbr", it's a sort of computed goto. https://llvm.org/docs/LangRef.html#indirectbr-instruction Is this at all feasible? If so how would this be done? In the actual IR codegen or as an optimization pass? I willing to do the work if it's possible.
Aug 07 2023
On Monday, 7 August 2023 at 20:22:06 UTC, claptrap wrote:Given code like this... ```d const int x; while(...) { switch (x) { case 0: whatever; break; case 1: whatever; break; case 2: whatever; break; case 3: whatever; break; case 4: whatever; break; } } ``` It generates a jump table if there's at least 4 or 5 cases, but at the start of the switch it generates 6 instructions (x64), check the bounds, lookup / calc the destination, and the actual jump. Like this... ```d lea r13, [rip + .LJTI0_0] cmp r14d, 7 ja .LBB0_11 movsxd rax, dword ptr [r13 + 4*r12] add rax, r13 jmp rax ``` So the point is if the switch is in a loop and x is constant, most of that could be done ahead, outside the loop. It could actually be reduced to a single indirect jump inside the loop. It seems there is a LLVM IR instruction that may be useful for this, "indirectbr", it's a sort of computed goto. https://llvm.org/docs/LangRef.html#indirectbr-instruction Is this at all feasible? If so how would this be done? In the actual IR codegen or as an optimization pass? I willing to do the work if it's possible.As I've heard good feedback about that alternative to LLVM `switch` and also because I'm also interested to implement that in another compiler, I've made a comparison of both way here : https://godbolt.org/z/G41P88EY7 (D code at the end as comments) It seems that this would only work well for `final switch`es (no default block) over enum members. The main problem is that you need to build a constant array made of `blockaddress`, meaning that 1. if the first member value is not 0 then you'll need to apply an offset for the index used to select the right `blockaddress`. (Source 1, line 13) 2. if there are gaps between members, you'll need to use a dummy block, containing `unreachable`, to fill them. A note about the fact that I think that this would only works for enums: `final switch` over a value of an integral type is currently a lie [(see issue 6060)](https://issues.dlang.org/show_bug.cgi?id=6060), there's still a fallback with a runtime error. Otherwise, more concretly in LDC, the work would have to be done [here](https://github.com/ldc-developers/ldc/blob/cf32bac668540f3dd022493b3ba89b598a485dfb/gen/s atements.cpp#L953). At first glance you'll need to define a flag, similarly to the already existing `useSwitchInst` (used for the infamous case where `case`s are not compile-time constants), let's say `useIndirectBrInst`. To conclude, that seems faisable even if that would only be beneficial for a few specific cases.
Aug 13 2023
On Sunday, 13 August 2023 at 10:45:11 UTC, Basile B. wrote:As I've heard good feedback about that alternative to LLVM `switch` and also because I'm also interested to implement that in another compiler, I've made a comparison of both way here : https://godbolt.org/z/G41P88EY7 (D code at the end as comments) It seems that this would only work well for `final switch`es (no default block) over enum members. The main problem is that you need to build a constant array made of `blockaddress`, meaning that 1. if the first member value is not 0 then you'll need to apply an offset for the index used to select the right `blockaddress`. (Source 1, line 13) 2. if there are gaps between members, you'll need to use a dummy block, containing `unreachable`, to fill them. A note about the fact that I think that this would only works for enums: `final switch` over a value of an integral type is currently a lie [(see issue 6060)](https://issues.dlang.org/show_bug.cgi?id=6060), there's still a fallback with a runtime error. Otherwise, more concretly in LDC, the work would have to be done [here](https://github.com/ldc-developers/ldc/blob/cf32bac668540f3dd022493b3ba89b598a485dfb/gen/s atements.cpp#L953). At first glance you'll need to define a flag, similarly to the already existing `useSwitchInst` (used for the infamous case where `case`s are not compile-time constants), let's say `useIndirectBrInst`. To conclude, that seems faisable even if that would only be beneficial for a few specific cases.From what I can tell even "final switch" generates a bounds check, it does a compare and then a ja to the topmost case, i guess it's "better to at least jump to known code even if its the wrong code than it is to jump into the unknown. Im not sure if this is the IR generated by LDC, or what LLVM does. Anyway I'm going to download the source and play around with it see what I can figure out. Thanks.
Aug 13 2023
I'm quite certain that if `x` is constant, the bounds check and table lookup will be hoisted out of the loop, indeed leading to just an indirect jump inside the loop. Of course, this requires running the optimizer (-Ox, x>0). -Johan On Monday, 7 August 2023 at 20:22:06 UTC, claptrap wrote:Given code like this... ```d const int x; while(...) { switch (x) { case 0: whatever; break; case 1: whatever; break; case 2: whatever; break; case 3: whatever; break; case 4: whatever; break; } } ``` It generates a jump table if there's at least 4 or 5 cases, but at the start of the switch it generates 6 instructions (x64), check the bounds, lookup / calc the destination, and the actual jump. Like this... ```d lea r13, [rip + .LJTI0_0] cmp r14d, 7 ja .LBB0_11 movsxd rax, dword ptr [r13 + 4*r12] add rax, r13 jmp rax ``` So the point is if the switch is in a loop and x is constant, most of that could be done ahead, outside the loop. It could actually be reduced to a single indirect jump inside the loop. It seems there is a LLVM IR instruction that may be useful for this, "indirectbr", it's a sort of computed goto. https://llvm.org/docs/LangRef.html#indirectbr-instruction Is this at all feasible? If so how would this be done? In the actual IR codegen or as an optimization pass? I willing to do the work if it's possible.
Aug 13 2023
On Sunday, 13 August 2023 at 13:13:30 UTC, Johan wrote:I'm quite certain that if `x` is constant, the bounds check and table lookup will be hoisted out of the loop, indeed leading to just an indirect jump inside the loop. Of course, this requires running the optimizer (-Ox, x>0).I've tried as many variations as I can figure and it always generates this... ```D cmp r14d, 3 ja .LBB0_9 movsxd rax, dword ptr [r12 + 4*r15] add rax, r12 jmp rax ``` https://d.godbolt.org/z/jzoevxoG4 That's with O2, with O3 I cant make head nor tail of the code it generates, it's some pretty insane optimization. With my own code I use "asm int 3" to breakpoint in the function, so I can look at the disassembly even with O3 optimization, using visual studio, and it still generates those 5 instructions.
Aug 13 2023
On Sunday, 13 August 2023 at 18:57:42 UTC, claptrap wrote:On Sunday, 13 August 2023 at 13:13:30 UTC, Johan wrote:It's easier to look at LLVM IR in this case. https://d.godbolt.org/z/b6b3TTeMW 1) By providing the implementation of `bar`, you are giving the optimizer extra information. It may not inline `bar`, but it _will_ make use of what it knows about the result of bar. That's why at -O3, there is insane optimization going on. Simply remove the implementation of `bar` for more insight (make it a function declaration, not a definition). 2) Indeed at -O2, the optimizer deems it non-profitable or does not see that it can hoist the load out of the for loop. At -O3 it does hoist it out, like you wanted. cheers, JohanI'm quite certain that if `x` is constant, the bounds check and table lookup will be hoisted out of the loop, indeed leading to just an indirect jump inside the loop. Of course, this requires running the optimizer (-Ox, x>0).I've tried as many variations as I can figure and it always generates this... ```D cmp r14d, 3 ja .LBB0_9 movsxd rax, dword ptr [r12 + 4*r15] add rax, r12 jmp rax ``` https://d.godbolt.org/z/jzoevxoG4 That's with O2, with O3 I cant make head nor tail of the code it generates, it's some pretty insane optimization.
Aug 13 2023
On Sunday, 13 August 2023 at 20:52:18 UTC, Johan wrote:On Sunday, 13 August 2023 at 18:57:42 UTC, claptrap wrote:It hasnt hoisted the load out, it's transformed it from a loop containing a switch, to a switch containing 5 loops. The switch statement is never branched back to. That does work for my case, probably the loop is too large, even with O3, the switch statement is still... 00007FF7C703075F movsxd rdi,dword ptr [rdx+rsi*4] 00007FF7C7030763 add rdi,rdx 00007FF7C7030766 jmp rdi And it still checks the switch index inside the loop, it has moved it a bit earlier in the code. It seems it can tell that some of the code before the switch has no affect if the switch is bypassed.On Sunday, 13 August 2023 at 13:13:30 UTC, Johan wrote:It's easier to look at LLVM IR in this case. https://d.godbolt.org/z/b6b3TTeMW 1) By providing the implementation of `bar`, you are giving the optimizer extra information. It may not inline `bar`, but it _will_ make use of what it knows about the result of bar. That's why at -O3, there is insane optimization going on. Simply remove the implementation of `bar` for more insight (make it a function declaration, not a definition). 2) Indeed at -O2, the optimizer deems it non-profitable or does not see that it can hoist the load out of the for loop. At -O3 it does hoist it out, like you wanted. cheers, JohanI'm quite certain that if `x` is constant, the bounds check and table lookup will be hoisted out of the loop, indeed leading to just an indirect jump inside the loop. Of course, this requires running the optimizer (-Ox, x>0).I've tried as many variations as I can figure and it always generates this... ```D cmp r14d, 3 ja .LBB0_9 movsxd rax, dword ptr [r12 + 4*r15] add rax, r12 jmp rax ``` https://d.godbolt.org/z/jzoevxoG4 That's with O2, with O3 I cant make head nor tail of the code it generates, it's some pretty insane optimization.
Aug 13 2023
On Sunday, 13 August 2023 at 22:55:12 UTC, claptrap wrote:On Sunday, 13 August 2023 at 20:52:18 UTC, Johan wrote:See.. https://d.godbolt.org/z/E3KaoP7o9 if you have enough extra stuff in the loop it goes back to a loop containing a switch, and 5 instructions for the switch.
Aug 13 2023
On Monday, 7 August 2023 at 20:22:06 UTC, claptrap wrote:I've given up looking into this for a number of reasons. Its taking me a long time to understand the code and it seems IndirectBr is different enough from switch that there's a lot of extra work that would need to be done. The case destinations need to be basic blocks and I cant figure how to get that from the switch cases. Plus it looks like you needs a way to build a lookup table with the addresses of the case blocks in. You then look up the target in that and pass it to IndirectBr. A lot of stuff I couldn't figure out. And I was looking at this in the switch statement IR gen, so it still wouldn't help if the goal was to pull the address calculation & bounds check out of an enclosing loop. It seems like its an optimization that should be done by the backend anyway. I mean the LLVM IR is SSA, so the address calc & check should be safe to move to where the condition variable is defined? If you switch(X), you can move the address lookup to where X is defined? Maybe?
Sep 01 2023