digitalmars.D.ldc - 512 bit static array to vector
- Bruce Carneal (11/11) Jun 19 2022 Here's a comparison between ldc and gdc converting static arrays
- kinke (12/16) Jun 19 2022 Different results (actually using a 512-bit move, not 2x256) with
- Bruce Carneal (14/33) Jun 19 2022 An unexpected choice there given that x86-64-v4 requires
- Bruce Carneal (10/17) Jun 19 2022 Note that llvm/ldc chooses a 512 bit 2 instruction ld/st sequence
Here's a comparison between ldc and gdc converting static arrays to 512 bit vectors: https://godbolt.org/z/8jxafh76W A few observations: 1) LDC requires more instructions at 512 bits. At 256 (x86-64-v3) they're the same. 2) LDC emits worse code for the cleaner .array assignment than for the union hack. 3) LDC fabricates non-HW __vectors so is(someVector) has diminished CT utility. Is improving LLVM/LDC wrt any of the above relatively simple?
Jun 19 2022
On Sunday, 19 June 2022 at 14:13:45 UTC, Bruce Carneal wrote:1) LDC requires more instructions at 512 bits. At 256 (x86-64-v3) they're the same.Different results (actually using a 512-bit move, not 2x256) with `-mattr=avx512bw`. I guess LLVM makes performance assumptions for the provided CPU and prefers 256-bit instructions. The biggest difference with gdc is an ABI difference - gdc returning the vector directly in an AVX512 register, whereas LDC returns it indirectly (sret return - caller passes a pointer to its pre-allocated result). That's a limitation of the frontend's https://github.com/dlang/dmd/blob/master/src/dmd/argtypes_sysv_x64.d, which supports 256-bit vectors but no 512-bit ones (the SysV ABI keeps getting extended for broader vectors...).3) LDC fabricates non-HW __vectors so is(someVector) has diminished CT utility.I consider that useful, e.g., allowing to use a `double4` without having to consider CPU limitations. - I think the compiler should expose a trait for the largest supported vector size instead.
Jun 19 2022
On Sunday, 19 June 2022 at 16:56:17 UTC, kinke wrote:On Sunday, 19 June 2022 at 14:13:45 UTC, Bruce Carneal wrote:An unexpected choice there given that x86-64-v4 requires avx512bw. Still, glad to hear that the narrower specialization works well. It would be part of any multi-target binary where the programmer was concerned about maximum width-sensitive performance across the widest range of machines.1) LDC requires more instructions at 512 bits. At 256 (x86-64-v3) they're the same.Different results (actually using a 512-bit move, not 2x256) with `-mattr=avx512bw`. I guess LLVM makes performance assumptions for the provided CPU and prefers 256-bit instructions.The biggest difference with gdc is an ABI difference - gdc returning the vector directly in an AVX512 register, whereas LDC returns it indirectly (sret return - caller passes a pointer to its pre-allocated result). That's a limitation of the frontend's https://github.com/dlang/dmd/blob/master/src/dmd/argtypes_sysv_x64.d, which supports 256-bit vectors but no 512-bit ones (the SysV ABI keeps getting extended for broader vectors...).So, IIUC, gdc and ldc are not interoperable currently but will be once the frontend is updated?I agree. Perhaps a template: maxISAVectorLengthFor(T). If we're getting fancy we could do: maxMicroarchVectorLengthFor(T). Even better if these work correctly in multi target compilation scenarios and for the expanding set of types (f16, bf16, other?). Having both variants could be useful when targetting split/paired architectures, as AMD is fond of lately, or the SVE/RVV machines.3) LDC fabricates non-HW __vectors so is(someVector) has diminished CT utility.I consider that useful, e.g., allowing to use a `double4` without having to consider CPU limitations. - I think the compiler should expose a trait for the largest supported vector size instead
Jun 19 2022
On Sunday, 19 June 2022 at 16:56:17 UTC, kinke wrote:On Sunday, 19 June 2022 at 14:13:45 UTC, Bruce Carneal wrote:Note that llvm/ldc chooses a 512 bit 2 instruction ld/st sequence for a2vUnion given x86-64-v4 as the target but goes for a 256 bit wide 4 instruction ld/st sequence in a2vArray. As you note, -mattr=avx512bw forces a2vArray into the 2 instruction form but apparently some difference in the IR presented to LLVM? enables the choice of the shorter sequence for a2vUnion in either case. Just curious. Thanks for your having taken a look and for highlighting the workaround (specify avx512bw explicitly).1) LDC requires more instructions at 512 bits. At 256 (x86-64-v3) they're the same.Different results (actually using a 512-bit move, not 2x256) with `-mattr=avx512bw`. I guess LLVM makes performance assumptions for the provided CPU and prefers 256-bit instructions.
Jun 19 2022