digitalmars.D.learn - auto vectorization notes
- Bruce Carneal (17/17) Mar 23 2020 When speeds are equivalent, or very close, I usually prefer auto
- Crayo List (5/22) Mar 27 2020 auto vectorization is bad because you never know if your code
- Bruce Carneal (12/21) Mar 27 2020 Yes, that's a downside, you have to measure your performance
- Crayo List (13/34) Mar 28 2020 This is not true! The idea of ispc is to write portable code that
- Bruce Carneal (25/48) Mar 28 2020 There are many waypoints on the readability <==> performance
When speeds are equivalent, or very close, I usually prefer auto vectorized code to explicit SIMD/__vector code as it's easier to read. (on the downside you have to guard against compiler code-gen performance regressions) One oddity I've noticed is that I sometimes need to use pragma(inline, *false*) in order to get ldc to "do the right thing". Apparently the compiler sees the costs/benefits differently in the standalone context. More widely known techniques that have gotten people over the serial/SIMD hump include: 1) simplified indexing relationships 2) known count inner loops (chunkify) 3) static foreach blocks (manual inlining that the compiler "gathers") I'd be interested to hear from others regarding their auto vectorization and __vector experiences. What has worked and what hasn't worked in your performance sensitive dlang code?
Mar 23 2020
On Monday, 23 March 2020 at 18:52:16 UTC, Bruce Carneal wrote:When speeds are equivalent, or very close, I usually prefer auto vectorized code to explicit SIMD/__vector code as it's easier to read. (on the downside you have to guard against compiler code-gen performance regressions) One oddity I've noticed is that I sometimes need to use pragma(inline, *false*) in order to get ldc to "do the right thing". Apparently the compiler sees the costs/benefits differently in the standalone context. More widely known techniques that have gotten people over the serial/SIMD hump include: 1) simplified indexing relationships 2) known count inner loops (chunkify) 3) static foreach blocks (manual inlining that the compiler "gathers") I'd be interested to hear from others regarding their auto vectorization and __vector experiences. What has worked and what hasn't worked in your performance sensitive dlang code?auto vectorization is bad because you never know if your code will get vectorized next time you make some change to it and recompile. Just use : https://ispc.github.io/
Mar 27 2020
On Saturday, 28 March 2020 at 05:21:14 UTC, Crayo List wrote:On Monday, 23 March 2020 at 18:52:16 UTC, Bruce Carneal wrote:Yes, that's a downside, you have to measure your performance sensitive code if you change it *or* change compilers or targets. Explicit SIMD code, ispc or other, isn't as readable or composable or vanilla portable but it certainly is performance predictable. I find SIMT code readability better than SIMD but a little worse than auto-vectorizable kernels. Hugely better performance though for less effort than SIMD if your platform supports it. Is anyone actively using dcompute (SIMT enabler)? Unless I hear bad things I'll try it down the road as neither going back to CUDA nor "forward" to the SycL-verse appeals.[snip] (on the downside you have to guard against compiler code-gen performance regressions)auto vectorization is bad because you never know if your code will get vectorized next time you make some change to it and recompile. Just use : https://ispc.github.io/
Mar 27 2020
On Saturday, 28 March 2020 at 06:56:14 UTC, Bruce Carneal wrote:On Saturday, 28 March 2020 at 05:21:14 UTC, Crayo List wrote:This is not true! The idea of ispc is to write portable code that will vectorize predictably based on the target CPU. The object file/binary is not portable, if that is what you meant. Also, I find it readable.On Monday, 23 March 2020 at 18:52:16 UTC, Bruce Carneal wrote:Yes, that's a downside, you have to measure your performance sensitive code if you change it *or* change compilers or targets. Explicit SIMD code, ispc or other, isn't as readable or composable or vanilla portable but it certainly is performance predictable.[snip] (on the downside you have to guard against compiler code-gen performance regressions)auto vectorization is bad because you never know if your code will get vectorized next time you make some change to it and recompile. Just use : https://ispc.github.io/I find SIMT code readability better than SIMD but a little worse than auto-vectorizable kernels. Hugely better performance though for less effort than SIMD if your platform supports it.Again I don't think this is true. Unless I am misunderstanding you, SIMT and SIMD are not mutually exclusive and if you need performance then you must use both. Also based on the workload and processor SIMD may be much more effective than SIMT.
Mar 28 2020
On Saturday, 28 March 2020 at 18:01:37 UTC, Crayo List wrote:On Saturday, 28 March 2020 at 06:56:14 UTC, Bruce Carneal wrote:There are many waypoints on the readability <==> performance axis. If ispc works for you along that axis, great!On Saturday, 28 March 2020 at 05:21:14 UTC, Crayo List wrote:This is not true! The idea of ispc is to write portable code that will vectorize predictably based on the target CPU. The object file/binary is not portable, if that is what you meant. Also, I find it readable.On Monday, 23 March 2020 at 18:52:16 UTC, Bruce Carneal wrote:Explicit SIMD code, ispc or other, isn't as readable or composable or vanilla portable but it certainly is performance predictable.[snip]SIMD might become part of the solution under the hood for a number of reasons including: ease of deployment, programmer familiarity, PCIe xfer overheads, kernel launch overhead, memory subsystem suitability, existing code base issues, ... SIMT works for me in high throughput situations where it's hard to "take a log" on the problem. SIMD, in auto-vectorizable or more explicit form, works in others. Combinations can be useful but most of the work I've come in contact with splits pretty clearly along the memory bandwidth divide (SIMT on one side, SIMD/CPU on the other). Others need a plus-up in arithmetic horsepower. The more extreme the requirements, the more attractive SIMT appears. (hence my excitement about dcompute possibly expanding the dlang performance envelope with much less cognitive load than CUDA/OpenCL/SycL/...) On the readability front, I find per-lane programming, even with the current thread-divergence caveats, to be easier to reason about wrt correctness and performance predictability than other approaches. Apparently your mileage does vary. When you have chosen SIMD, whether ispc or other, over SIMT what drove the decision? Performance? Ease of programming to reach a target speed?I find SIMT code readability better than SIMD but a little worse than auto-vectorizable kernels. Hugely better performance though for less effort than SIMD if your platform supports it.Again I don't think this is true. Unless I am misunderstanding you, SIMT and SIMD are not mutually exclusive and if you need performance then you must use both. Also based on the workload and processor SIMD may be much more effective than SIMT.j
Mar 28 2020