digitalmars.D.ldc - Suboptimal dynamic array operands

z (36/36) Jun 20 2021 For reference : https://run.dlang.io/is/FULu3x (select LDC, click

kinke (13/23) Jun 20 2021 It looks like both loop unrolling and auto-vectorization expect a

z <z z.com> writes:

For reference : https://run.dlang.io/is/FULu3x (select LDC, click 
ASM and ctrl+f and type main to find the relevant code)
(compiler options are `-O -mcpu=native -enable-no-infs-fp-math 
-enable-unsafe-fp-math -enable-ipra -tailcallopt -release`)

When performing `a[] *op*= b[]` or `foreach(i, aa; a){a[i] -= 
b[i]}` operations, LDC generates slower code than it should.

The generated code appears to always be a switch loop which 
operates on packets of 32, 4 or 1 values(the size of the packets 
varies program to program) and jumps to the appropriate case 
depending on the remaining number of values to operate on.
The code for the 32 and 1-sized packets is ok in my program, but 
the middle in-between size(4 here, although i've seen it do it 
with 8) always uses unrolled `v*op*ss` instead of the packed 
versions(`VSUBPS` here).
In my testing, naively modifying the code with the appropriate 
SIMD equivalent through a debugger and jumping to the end of the 
switch case causes observable performance gain(5-10% total 
program time in the worst case where the array's .length is < 32)

I'm not sure if this is related, but i've also seen code output 
where the faulty case kept doing redundant register loads as if 
it was the first iteration.
An example:
```nasm
mov rsi, [rsp+30]
mov rdi, [rsp+78]
mov rsi, [rsi+138]
vmovss xmm0, [rdi+rdx*4]
vsubss xmm0, xmm0, [rsi+rdx*4]
vmovss [rdi+rdx*4], xmm0
;//repeat this for 3 more iterations, the pointer loads are the 
exact sames while the floats use an +(unroll_i*float.sizeof) 
offset
```

So my question is how can i get LDC/LLVM to generate the proper 
code?

Thanks.

Jun 20 2021

kinke <noone nowhere.com> writes:

On Monday, 21 June 2021 at 01:12:33 UTC, z wrote:
 When performing `a[] *op*= b[]` or `foreach(i, aa; a){a[i] -= 
 b[i]}` operations, LDC generates slower code than it should.

 The generated code appears to always be a switch loop which 
 operates on packets of 32, 4 or 1 values(the size of the 
 packets varies program to program) and jumps to the appropriate 
 case depending on the remaining number of values to operate on.
 The code for the 32 and 1-sized packets is ok in my program, 
 but the middle in-between size(4 here, although i've seen it do 
 it with 8) always uses unrolled `v*op*ss` instead of the packed 
 versions(`VSUBPS` here).

It looks like both loop unrolling and auto-vectorization expect a 
higher iteration/element count by default, and LDC currently 
doesn't have a way to fine-tune these parameters on a 
per-function/loop basis (e.g., via pragmas), only global LLVM 
cmdline options.

 restrict doesn't help much either here to tell the optimizer the 
opaque slices don't overlap (opaque because the GC allocation is 
an opaque druntime function call).

Using vector types explicitly improves things but imposes 
restrictions on lengths and alignment.

See https://d.godbolt.org/z/r746o3Ya5 for boilerplate-free 
assembly.

Jun 20 2021

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - Suboptimal dynamic array operands