digitalmars.D.ldc - toy windowing auto-vec miss

Bruce Carneal (12/12) Nov 06 2022 Here's a simple godbolt example of one of the areas in which gdc

rikki cattermole (3/3) Nov 07 2022 This might be a bit naive, but ldc's output is about a quarter smaller,

Bruce Carneal (11/14) Nov 07 2022 If you have long enough inputs, yes. A vectorized version

Bruce Carneal (13/25) Nov 07 2022 My "grenade" phrasing above was fun to write but overly dramatic.
Johan (5/9) Nov 07 2022 Don't have time to dive deeper but I found that:

Bruce Carneal (8/17) Nov 07 2022 That's very interesting.

Johan (10/29) Nov 07 2022 Yeah, this is an LLVM bug.

Bruce Carneal <bcarneal gmail.com> writes:

Here's a simple godbolt example of one of the areas in which gdc 
solidly outperforms ldc wrt auto-vectorization: simple but not 
trivial operand gather
https://godbolt.org/z/ox1vvxd8s


Compile time target adaptive manual __vector-ization is an answer 
here if you have no access to SIMT, so not a show stopper, but 
the code is less readable.

I'm not sure what the data parallel future should look like wrt 
language/IR but I'm pretty sure we can do better than praying 
that the auto vectorizer can dig patterns out of for loops, or 
throwing ourselves on the manual vectorization grenade, 
repeatedly.

Nov 06 2022

rikki cattermole <rikki cattermole.co.nz> writes:

This might be a bit naive, but ldc's output is about a quarter smaller, 
it uses significantly less jumps.

Is gdc actually faster?

Nov 07 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Monday, 7 November 2022 at 09:56:13 UTC, rikki cattermole 
wrote:
 This might be a bit naive, but ldc's output is about a quarter 
 smaller, it uses significantly less jumps.

 Is gdc actually faster?

If you have long enough inputs, yes.  A vectorized version 
overcomes the instruction stream overhead quickly after which the 
performance advantage trends to N/1.

As you imply, measurement trumps in-ones-head modelling.  I'll 
measure and report on the exact toy code later today but real 
world code with the same "simple but not trivial" operand 
pattern, involving Bayer/CFA data, has been measured and the 
performance gap verified.  For that code the workaround was 
manual __vector-ization and use of a shuffle intrinsic.

Nov 07 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Monday, 7 November 2022 at 01:59:03 UTC, Bruce Carneal wrote:
 Here's a simple godbolt example of one of the areas in which 
 gdc solidly outperforms ldc wrt auto-vectorization: simple but 
 not trivial operand gather
 https://godbolt.org/z/ox1vvxd8s


 Compile time target adaptive manual __vector-ization is an 
 answer here if you have no access to SIMT, so not a show 
 stopper, but the code is less readable.

 I'm not sure what the data parallel future should look like wrt 
 language/IR but I'm pretty sure we can do better than praying 
 that the auto vectorizer can dig patterns out of for loops, or 
 throwing ourselves on the manual vectorization grenade, 
 repeatedly.

My "grenade" phrasing above was fun to write but overly dramatic. 
  Manual __vector-ization is more tedious than dangerous and D 
ldc/gdc give you quite a bit of help there including 1) __vector 
types 2) CT max vector length introspection.

Also, auto vectorization *does* work nicely against simple/and-or 
conditioned inputs/outputs.

I believe there is a lot more to be had in the 
programmer-friendly-data-parallelism department, perhaps 
involving a (major) pivot to MLIR, but I give my considered 
thanks to those involved in providing what is already the best 
option in that arena from my point of view.  Introspection, 
__vector, auto-vec, dcompute, ... it's a potent toolkit.

Nov 07 2022

Johan <j j.nl> writes:

On Monday, 7 November 2022 at 01:59:03 UTC, Bruce Carneal wrote:
 Here's a simple godbolt example of one of the areas in which 
 gdc solidly outperforms ldc wrt auto-vectorization: simple but 
 not trivial operand gather
 https://godbolt.org/z/ox1vvxd8s

Don't have time to dive deeper but I found that:
Removing ` restrict` results in vectorized instructions with LDC 
(don't know if it is faster, just that they appear in ASM).

-Johan

Nov 07 2022

Bruce Carneal <bcarneal gmail.com> writes:

On Monday, 7 November 2022 at 16:49:24 UTC, Johan wrote:
 On Monday, 7 November 2022 at 01:59:03 UTC, Bruce Carneal wrote:
 Here's a simple godbolt example of one of the areas in which 
 gdc solidly outperforms ldc wrt auto-vectorization: simple but 
 not trivial operand gather
 https://godbolt.org/z/ox1vvxd8s

 Don't have time to dive deeper but I found that:
 Removing ` restrict` results in vectorized instructions with 
 LDC (don't know if it is faster, just that they appear in ASM).

 -Johan

That's very interesting.

This is the first time I've heard of  restrict making things 
worse wrt auto vectorization. From what I've seen in other 
experiments,  restrict provides a minor benefit (code size 
reduction) frequently while occasionally enabling vectorization 
of otherwise complex dependency graphs.

Thanks for the heads up.

Nov 07 2022

Johan <j j.nl> writes:

On Monday, 7 November 2022 at 18:14:44 UTC, Bruce Carneal wrote:
 On Monday, 7 November 2022 at 16:49:24 UTC, Johan wrote:
 On Monday, 7 November 2022 at 01:59:03 UTC, Bruce Carneal 
 wrote:
 Here's a simple godbolt example of one of the areas in which 
 gdc solidly outperforms ldc wrt auto-vectorization: simple 
 but not trivial operand gather
 https://godbolt.org/z/ox1vvxd8s

 Don't have time to dive deeper but I found that:
 Removing ` restrict` results in vectorized instructions with 
 LDC (don't know if it is faster, just that they appear in ASM).

 -Johan

 That's very interesting.

 This is the first time I've heard of  restrict making things 
 worse wrt auto vectorization. From what I've seen in other 
 experiments,  restrict provides a minor benefit (code size 
 reduction) frequently while occasionally enabling vectorization 
 of otherwise complex dependency graphs.

Yeah, this is an LLVM bug.

If you're interested in digging around a bit further, you can 
look at how the individual optimization passes change the IR code:
https://godbolt.org/z/e9nqPfeKn

Loop vectorization pass does nothing for the ` restrict` case. 
Note that the input for that pass is slightly different: the 
` restrict` case has a more complex forbody.preheader and 3 phi 
nodes in the for body (compared to 1 in the non-restrict case)

-Johan

Nov 07 2022

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - toy windowing auto-vec miss