digitalmars.D.ldc - toy windowing auto-vec miss
- Bruce Carneal (12/12) Nov 06 2022 Here's a simple godbolt example of one of the areas in which gdc
- rikki cattermole (3/3) Nov 07 2022 This might be a bit naive, but ldc's output is about a quarter smaller,
- Bruce Carneal (11/14) Nov 07 2022 If you have long enough inputs, yes. A vectorized version
- Bruce Carneal (13/25) Nov 07 2022 My "grenade" phrasing above was fun to write but overly dramatic.
- Johan (5/9) Nov 07 2022 Don't have time to dive deeper but I found that:
- Bruce Carneal (8/17) Nov 07 2022 That's very interesting.
- Johan (10/29) Nov 07 2022 Yeah, this is an LLVM bug.
Here's a simple godbolt example of one of the areas in which gdc solidly outperforms ldc wrt auto-vectorization: simple but not trivial operand gather https://godbolt.org/z/ox1vvxd8s Compile time target adaptive manual __vector-ization is an answer here if you have no access to SIMT, so not a show stopper, but the code is less readable. I'm not sure what the data parallel future should look like wrt language/IR but I'm pretty sure we can do better than praying that the auto vectorizer can dig patterns out of for loops, or throwing ourselves on the manual vectorization grenade, repeatedly.
Nov 06 2022
This might be a bit naive, but ldc's output is about a quarter smaller, it uses significantly less jumps. Is gdc actually faster?
Nov 07 2022
On Monday, 7 November 2022 at 09:56:13 UTC, rikki cattermole wrote:This might be a bit naive, but ldc's output is about a quarter smaller, it uses significantly less jumps. Is gdc actually faster?If you have long enough inputs, yes. A vectorized version overcomes the instruction stream overhead quickly after which the performance advantage trends to N/1. As you imply, measurement trumps in-ones-head modelling. I'll measure and report on the exact toy code later today but real world code with the same "simple but not trivial" operand pattern, involving Bayer/CFA data, has been measured and the performance gap verified. For that code the workaround was manual __vector-ization and use of a shuffle intrinsic.
Nov 07 2022
On Monday, 7 November 2022 at 01:59:03 UTC, Bruce Carneal wrote:Here's a simple godbolt example of one of the areas in which gdc solidly outperforms ldc wrt auto-vectorization: simple but not trivial operand gather https://godbolt.org/z/ox1vvxd8s Compile time target adaptive manual __vector-ization is an answer here if you have no access to SIMT, so not a show stopper, but the code is less readable. I'm not sure what the data parallel future should look like wrt language/IR but I'm pretty sure we can do better than praying that the auto vectorizer can dig patterns out of for loops, or throwing ourselves on the manual vectorization grenade, repeatedly.My "grenade" phrasing above was fun to write but overly dramatic. Manual __vector-ization is more tedious than dangerous and D ldc/gdc give you quite a bit of help there including 1) __vector types 2) CT max vector length introspection. Also, auto vectorization *does* work nicely against simple/and-or conditioned inputs/outputs. I believe there is a lot more to be had in the programmer-friendly-data-parallelism department, perhaps involving a (major) pivot to MLIR, but I give my considered thanks to those involved in providing what is already the best option in that arena from my point of view. Introspection, __vector, auto-vec, dcompute, ... it's a potent toolkit.
Nov 07 2022
On Monday, 7 November 2022 at 01:59:03 UTC, Bruce Carneal wrote:Here's a simple godbolt example of one of the areas in which gdc solidly outperforms ldc wrt auto-vectorization: simple but not trivial operand gather https://godbolt.org/z/ox1vvxd8sDon't have time to dive deeper but I found that: Removing ` restrict` results in vectorized instructions with LDC (don't know if it is faster, just that they appear in ASM). -Johan
Nov 07 2022
On Monday, 7 November 2022 at 16:49:24 UTC, Johan wrote:On Monday, 7 November 2022 at 01:59:03 UTC, Bruce Carneal wrote:That's very interesting. This is the first time I've heard of restrict making things worse wrt auto vectorization. From what I've seen in other experiments, restrict provides a minor benefit (code size reduction) frequently while occasionally enabling vectorization of otherwise complex dependency graphs. Thanks for the heads up.Here's a simple godbolt example of one of the areas in which gdc solidly outperforms ldc wrt auto-vectorization: simple but not trivial operand gather https://godbolt.org/z/ox1vvxd8sDon't have time to dive deeper but I found that: Removing ` restrict` results in vectorized instructions with LDC (don't know if it is faster, just that they appear in ASM). -Johan
Nov 07 2022
On Monday, 7 November 2022 at 18:14:44 UTC, Bruce Carneal wrote:On Monday, 7 November 2022 at 16:49:24 UTC, Johan wrote:Yeah, this is an LLVM bug. If you're interested in digging around a bit further, you can look at how the individual optimization passes change the IR code: https://godbolt.org/z/e9nqPfeKn Loop vectorization pass does nothing for the ` restrict` case. Note that the input for that pass is slightly different: the ` restrict` case has a more complex forbody.preheader and 3 phi nodes in the for body (compared to 1 in the non-restrict case) -JohanOn Monday, 7 November 2022 at 01:59:03 UTC, Bruce Carneal wrote:That's very interesting. This is the first time I've heard of restrict making things worse wrt auto vectorization. From what I've seen in other experiments, restrict provides a minor benefit (code size reduction) frequently while occasionally enabling vectorization of otherwise complex dependency graphs.Here's a simple godbolt example of one of the areas in which gdc solidly outperforms ldc wrt auto-vectorization: simple but not trivial operand gather https://godbolt.org/z/ox1vvxd8sDon't have time to dive deeper but I found that: Removing ` restrict` results in vectorized instructions with LDC (don't know if it is faster, just that they appear in ASM). -Johan
Nov 07 2022