digitalmars.D.ldc - Switch codegen.

claptrap (36/36) Aug 07 2023 Given code like this...

Basile B. (23/59) Aug 13 2023 As I've heard good feedback about that alternative to LLVM

claptrap (9/32) Aug 13 2023 From what I can tell even "final switch" generates a bounds

Johan (6/42) Aug 13 2023 I'm quite certain that if `x` is constant, the bounds check and

claptrap (16/20) Aug 13 2023 I've tried as many variations as I can figure and it always

Johan (14/31) Aug 13 2023 It's easier to look at LLVM IR in this case.

claptrap (13/48) Aug 13 2023 It hasnt hoisted the load out, it's transformed it from a loop

claptrap (5/6) Aug 13 2023 See..

claptrap (18/18) Sep 01 2023 On Monday, 7 August 2023 at 20:22:06 UTC, claptrap wrote:

claptrap <clap trap.com> writes:

Given code like this...

```d
const int x;
while(...)
{
     switch (x)
     {
         case 0: whatever; break;
         case 1: whatever; break;
         case 2: whatever; break;
         case 3: whatever; break;
         case 4: whatever; break;
     }
}
```

It generates a jump table if there's at least 4 or 5 cases, but 
at the start of the switch it generates 6 instructions (x64), 
check the bounds, lookup / calc the destination, and the actual 
jump. Like this...

```d
         lea     r13, [rip + .LJTI0_0]
         cmp     r14d, 7
         ja      .LBB0_11
         movsxd  rax, dword ptr [r13 + 4*r12]
         add     rax, r13
         jmp     rax
```

So the point is if the switch is in a loop and x is constant, 
most of that could be done ahead, outside the loop. It could 
actually be reduced to a single indirect jump inside the loop. It 
seems there is a LLVM IR instruction that may be useful for this, 
"indirectbr", it's a sort of computed goto.

https://llvm.org/docs/LangRef.html#indirectbr-instruction

Is this at all feasible? If so how would this be done? In the 
actual IR codegen or as an optimization pass? I willing to do the 
work if it's possible.

Aug 07 2023

Basile B. <b2.temp gmx.com> writes:

On Monday, 7 August 2023 at 20:22:06 UTC, claptrap wrote:
 Given code like this...

 ```d
 const int x;
 while(...)
 {
     switch (x)
     {
         case 0: whatever; break;
         case 1: whatever; break;
         case 2: whatever; break;
         case 3: whatever; break;
         case 4: whatever; break;
     }
 }
 ```

 It generates a jump table if there's at least 4 or 5 cases, but 
 at the start of the switch it generates 6 instructions (x64), 
 check the bounds, lookup / calc the destination, and the actual 
 jump. Like this...

 ```d
         lea     r13, [rip + .LJTI0_0]
         cmp     r14d, 7
         ja      .LBB0_11
         movsxd  rax, dword ptr [r13 + 4*r12]
         add     rax, r13
         jmp     rax
 ```

 So the point is if the switch is in a loop and x is constant, 
 most of that could be done ahead, outside the loop. It could 
 actually be reduced to a single indirect jump inside the loop. 
 It seems there is a LLVM IR instruction that may be useful for 
 this, "indirectbr", it's a sort of computed goto.

 https://llvm.org/docs/LangRef.html#indirectbr-instruction

 Is this at all feasible? If so how would this be done? In the 
 actual IR codegen or as an optimization pass? I willing to do 
 the work if it's possible.

As I've heard good feedback about that alternative to LLVM 
`switch` and also because I'm also interested to implement that 
in another compiler, I've made a comparison of both way here : 
https://godbolt.org/z/G41P88EY7 (D code at the end as comments)

It seems that this would only work well for `final switch`es (no 
default block) over enum members. The main problem is that you 
need to build a constant array made of `blockaddress`, meaning 
that

1. if the first member value is not 0 then you'll need to apply 
an offset for the index used to select the right `blockaddress`. 
(Source 1, line 13)
2. if there are gaps between members, you'll need to use a dummy 
block, containing `unreachable`, to fill them.

A note about the fact that I think that this would only works for 
enums: `final switch` over a value of an integral type is 
currently a lie [(see issue 
6060)](https://issues.dlang.org/show_bug.cgi?id=6060), there's 
still a fallback with a runtime error.

Otherwise, more concretly in LDC, the work would have to be done 
[here](https://github.com/ldc-developers/ldc/blob/cf32bac668540f3dd022493b3ba89b598a485dfb/gen/s
atements.cpp#L953). At first glance you'll need to define a flag, similarly to
the already existing `useSwitchInst` (used for the infamous case where `case`s
are not compile-time constants), let's say `useIndirectBrInst`.

To conclude, that seems faisable even if that would only be 
beneficial for a few specific cases.

Aug 13 2023

claptrap <clap trap.com> writes:

On Sunday, 13 August 2023 at 10:45:11 UTC, Basile B. wrote:
As I've heard good feedback about that alternative to LLVM
`switch` and also because I'm also interested to implement that
in another compiler, I've made a comparison of both way here :
https://godbolt.org/z/G41P88EY7 (D code at the end as comments)

It seems that this would only work well for `final switch`es
(no default block) over enum members. The main problem is that
you need to build a constant array made of `blockaddress`,
meaning that

1. if the first member value is not 0 then you'll need to apply
an offset for the index used to select the right
`blockaddress`. (Source 1, line 13)
2. if there are gaps between members, you'll need to use a
dummy block, containing `unreachable`, to fill them.

A note about the fact that I think that this would only works
for enums: `final switch` over a value of an integral type is
currently a lie [(see issue
6060)](https://issues.dlang.org/show_bug.cgi?id=6060), there's
still a fallback with a runtime error.

Otherwise, more concretly in LDC, the work would have to be
done
[here](https://github.com/ldc-developers/ldc/blob/cf32bac668540f3dd022493b3ba89b598a485dfb/gen/s
atements.cpp#L953). At first glance you'll need to define a flag, similarly to
the already existing `useSwitchInst` (used for the infamous case where `case`s
are not compile-time constants), let's say `useIndirectBrInst`.

To conclude, that seems faisable even if that would only be
beneficial for a few specific cases.

From what I can tell even "final switch" generates a bounds
check, it does a compare and then a ja to the topmost case, i
guess it's "better to at least jump to known code even if its the
wrong code than it is to jump into the unknown. Im not sure if
this is the IR generated by LDC, or what LLVM does.

Anyway I'm going to download the source and play around with it
see what I can figure out.

Thanks.

Aug 13 2023

Johan <j j.nl> writes:

I'm quite certain that if `x` is constant, the bounds check and 
table lookup will be hoisted out of the loop, indeed leading to 
just an indirect jump inside the loop.
Of course, this requires running the optimizer (-Ox, x>0).

-Johan


On Monday, 7 August 2023 at 20:22:06 UTC, claptrap wrote:
 Given code like this...

 ```d
 const int x;
 while(...)
 {
     switch (x)
     {
         case 0: whatever; break;
         case 1: whatever; break;
         case 2: whatever; break;
         case 3: whatever; break;
         case 4: whatever; break;
     }
 }
 ```

 It generates a jump table if there's at least 4 or 5 cases, but 
 at the start of the switch it generates 6 instructions (x64), 
 check the bounds, lookup / calc the destination, and the actual 
 jump. Like this...

 ```d
         lea     r13, [rip + .LJTI0_0]
         cmp     r14d, 7
         ja      .LBB0_11
         movsxd  rax, dword ptr [r13 + 4*r12]
         add     rax, r13
         jmp     rax
 ```

 So the point is if the switch is in a loop and x is constant, 
 most of that could be done ahead, outside the loop. It could 
 actually be reduced to a single indirect jump inside the loop. 
 It seems there is a LLVM IR instruction that may be useful for 
 this, "indirectbr", it's a sort of computed goto.

 https://llvm.org/docs/LangRef.html#indirectbr-instruction

 Is this at all feasible? If so how would this be done? In the 
 actual IR codegen or as an optimization pass? I willing to do 
 the work if it's possible.

Aug 13 2023

claptrap <clap trap.com> writes:

On Sunday, 13 August 2023 at 13:13:30 UTC, Johan wrote:
 I'm quite certain that if `x` is constant, the bounds check and 
 table lookup will be hoisted out of the loop, indeed leading to 
 just an indirect jump inside the loop.
 Of course, this requires running the optimizer (-Ox, x>0).

I've tried as many variations as I can figure and it always 
generates this...

```D
     cmp     r14d, 3
     ja      .LBB0_9
     movsxd  rax, dword ptr [r12 + 4*r15]
     add     rax, r12
     jmp     rax
```

https://d.godbolt.org/z/jzoevxoG4

That's with O2, with O3 I cant make head nor tail of the code it 
generates, it's some pretty insane optimization.

With my own code I use "asm int 3" to breakpoint in the function, 
so I can look at the disassembly even with O3 optimization, using 
visual studio, and it still generates those 5 instructions.

Aug 13 2023

Johan <j j.nl> writes:

On Sunday, 13 August 2023 at 18:57:42 UTC, claptrap wrote:
 On Sunday, 13 August 2023 at 13:13:30 UTC, Johan wrote:
 I'm quite certain that if `x` is constant, the bounds check 
 and table lookup will be hoisted out of the loop, indeed 
 leading to just an indirect jump inside the loop.
 Of course, this requires running the optimizer (-Ox, x>0).

 I've tried as many variations as I can figure and it always 
 generates this...

 ```D
     cmp     r14d, 3
     ja      .LBB0_9
     movsxd  rax, dword ptr [r12 + 4*r15]
     add     rax, r12
     jmp     rax
 ```

 https://d.godbolt.org/z/jzoevxoG4

 That's with O2, with O3 I cant make head nor tail of the code 
 it generates, it's some pretty insane optimization.

It's easier to look at LLVM IR in this case.
https://d.godbolt.org/z/b6b3TTeMW

1)  By providing the implementation of `bar`, you are giving the 
optimizer extra information. It may not inline `bar`, but it 
_will_ make use of what it knows about the result of bar. That's 
why at -O3, there is insane optimization going on. Simply remove 
the implementation of `bar` for more insight (make it a function 
declaration, not a definition).

2) Indeed at -O2, the optimizer deems it non-profitable or does 
not see that it can hoist the load out of the for loop.  At -O3 
it does hoist it out, like you wanted.

cheers,
   Johan

Aug 13 2023

claptrap <clap trap.com> writes:

On Sunday, 13 August 2023 at 20:52:18 UTC, Johan wrote:
 On Sunday, 13 August 2023 at 18:57:42 UTC, claptrap wrote:
 On Sunday, 13 August 2023 at 13:13:30 UTC, Johan wrote:
 I'm quite certain that if `x` is constant, the bounds check 
 and table lookup will be hoisted out of the loop, indeed 
 leading to just an indirect jump inside the loop.
 Of course, this requires running the optimizer (-Ox, x>0).

 I've tried as many variations as I can figure and it always 
 generates this...

 ```D
     cmp     r14d, 3
     ja      .LBB0_9
     movsxd  rax, dword ptr [r12 + 4*r15]
     add     rax, r12
     jmp     rax
 ```

 https://d.godbolt.org/z/jzoevxoG4

 That's with O2, with O3 I cant make head nor tail of the code 
 it generates, it's some pretty insane optimization.

 It's easier to look at LLVM IR in this case.
 https://d.godbolt.org/z/b6b3TTeMW

 1)  By providing the implementation of `bar`, you are giving 
 the optimizer extra information. It may not inline `bar`, but 
 it _will_ make use of what it knows about the result of bar. 
 That's why at -O3, there is insane optimization going on. 
 Simply remove the implementation of `bar` for more insight 
 (make it a function declaration, not a definition).

 2) Indeed at -O2, the optimizer deems it non-profitable or does 
 not see that it can hoist the load out of the for loop.  At -O3 
 it does hoist it out, like you wanted.

 cheers,
   Johan

It hasnt hoisted the load out, it's transformed it from a loop 
containing a switch, to a switch containing 5 loops. The switch 
statement is never branched back to.

That does work for my case, probably the loop is too large, even 
with O3, the switch statement is still...

00007FF7C703075F  movsxd      rdi,dword ptr [rdx+rsi*4]
00007FF7C7030763  add         rdi,rdx
00007FF7C7030766  jmp         rdi

And it still checks the switch index inside the loop, it has 
moved it a bit earlier in the code. It seems it can tell that 
some of the code before the switch has no affect if the switch is 
bypassed.

Aug 13 2023

claptrap <clap trap.com> writes:

On Sunday, 13 August 2023 at 22:55:12 UTC, claptrap wrote:
 On Sunday, 13 August 2023 at 20:52:18 UTC, Johan wrote:

See..

https://d.godbolt.org/z/E3KaoP7o9

if you have enough extra stuff in the loop it goes back to a loop 
containing a switch, and 5 instructions for the switch.

Aug 13 2023

claptrap <clap trap.com> writes:

On Monday, 7 August 2023 at 20:22:06 UTC, claptrap wrote:

I've given up looking into this for a number of reasons. Its 
taking me a long time to understand the code and it seems 
IndirectBr is different enough from switch that there's a lot of 
extra work that would need to be done. The case destinations need 
to be basic blocks and I cant figure how to get that from the 
switch cases. Plus it looks like you needs a way to build a 
lookup table with the addresses of the case blocks in. You then 
look up the target in that and pass it to IndirectBr. A lot of 
stuff I couldn't figure out.

And I was looking at this in the switch statement IR gen, so it 
still wouldn't help if the goal was to pull the address 
calculation & bounds check out of an enclosing loop.

It seems like its an optimization that should be done by the 
backend anyway. I mean the LLVM IR is SSA, so the address calc & 
check should be safe to move to where the condition variable is 
defined? If you switch(X), you can move the address lookup to 
where X is defined? Maybe?

Sep 01 2023

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - Switch codegen.