digitalmars.D - Any SIMD experts?
- Martin Nowak (16/16) Dec 08 2014 I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at a time.
- Martin Nowak (12/21) Dec 08 2014 This is for the most performance critical instructions during GC
- Etienne (11/20) Dec 08 2014 Are you using it for the binary search part (find pool) ?
- Martin Nowak (3/4) Dec 08 2014 Nope the binary search is surprisingly fine, but I had it on my list of
- "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (3/6) Dec 09 2014 Consider rewriting the code to use non-branching comparison, i.e.
- John Colvin (2/19) Dec 08 2014 I don't quite understand what you mean by "save one conditional".
- Martin Nowak (12/13) Dec 08 2014 Instead of
- ponce (10/27) Dec 08 2014 Using the fact that a negative 64-bit integer is also a negative
- ponce (2/4) Dec 08 2014 Oops. This assumption is wrong.
- Martin Nowak (2/3) Dec 08 2014 It won't unless you allocate more than 4GB of RAM.
- John Colvin (22/39) Dec 08 2014 Well gcc gives me:
- John Colvin (4/52) Dec 08 2014 To conceptually get what it's doing here, the trick is that it's
- Martin Nowak (2/5) Dec 08 2014 All too easy, but would've taken me a pen and paper to realize :).
- John Colvin (4/13) Dec 08 2014 I wouldn't have seen it so easily but I've spent the last few
- Martin Nowak (7/8) Dec 08 2014 Tried that with dmd, it gave me.
- John Colvin (14/22) Dec 09 2014 It's unfortunate it can't work it out automatically like gcc can.
- John Colvin (3/29) Dec 09 2014 which of course Kenji already has a pull for, less than 3 hours
- Martin Nowak (2/3) Dec 10 2014 It's right on time, it's right on time
- John Colvin (6/11) Dec 10 2014 Still doesn't work though:
- ponce (15/32) Dec 08 2014 Another solution, in SSE pseudo-code:
- ponce (3/4) Dec 08 2014 Correction:
I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at a time. ulong2 vval = [v0, v1]; ulong2 vlow = [low, low]; ulong2 vhigh = [high, high]; int res = PMOVMSKB(vval >= vlow & vval < vhigh); I figured out sort of a solution, but it seems way too complicated, because there is only signed comparison. Usually (scalar) I'd use this, which makes use of unsigned wrap to safe one conditional immutable size = cast(ulong)(vhigh - vlow); if (cast(ulong)(v0 - vlow) < size) {} if (cast(ulong)(v1 - vlow) < size) {} over if (v0 >= vlow && v0 < vhigh) {} Maybe this can be used on SIMD too (saturated sub or so)? -Martin
Dec 08 2014
On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:Usually (scalar) I'd use this, which makes use of unsigned wrap to safe one conditional immutable size = cast(ulong)(vhigh - vlow); if (cast(ulong)(v0 - vlow) < size) {} if (cast(ulong)(v1 - vlow) < size) {} over if (v0 >= vlow && v0 < vhigh) {} Maybe this can be used on SIMD too (saturated sub or so)? -MartinThis is for the most performance critical instructions during GC marking. If we can come up with some good SIMD this will result in a good speedup. Yesterday I was surprised to learn that my unsigned wrap trick actually slowed down some GC benchmarks by 3%. The branch predictor had more trouble with the single branch because that resulted in a fifty-fifty chance. There is some correlation between the 2 branch bounds checks and one of them could be predicted fairly well, resulting in a better combined result. Always profile!
Dec 08 2014
On 2014-12-08 11:44 AM, Martin Nowak wrote:This is for the most performance critical instructions during GC marking. If we can come up with some good SIMD this will result in a good speedup. Yesterday I was surprised to learn that my unsigned wrap trick actually slowed down some GC benchmarks by 3%. The branch predictor had more trouble with the single branch because that resulted in a fifty-fifty chance. There is some correlation between the 2 branch bounds checks and one of them could be predicted fairly well, resulting in a better combined result. Always profile!Are you using it for the binary search part (find pool) ? http://cs.nyu.edu/~lerner/spring12/Preso06-SIMDTree.pdf The savings are would be best when the tree (pool info?) is under 64KB, savings being up to 30%. It could be worth doing it. Here's a quick solution: http://stackoverflow.com/questions/20616605/using-simd-avx-sse-for-tree-traversal Most of the code circulating on the web uses the _mm256 or _mm128 SIMD format, which I ported to multiple platforms already and I can share it through boost license too if you want: https://github.com/etcimon/botan/blob/master/source/botan/utils/simd/immintrin.d
Dec 08 2014
On 12/08/2014 06:21 PM, Etienne wrote:Are you using it for the binary search part (find pool) ?Nope the binary search is surprisingly fine, but I had it on my list of subjects for quite a while too.
Dec 08 2014
On Monday, 8 December 2014 at 16:44:37 UTC, Martin Nowak wrote:actually slowed down some GC benchmarks by 3%. The branch predictor had more trouble with the single branch because that resulted in a fifty-fifty chance. There is some correlationConsider rewriting the code to use non-branching comparison, i.e. use the condition test result directly in an expression or cmov.
Dec 09 2014
On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at a time. ulong2 vval = [v0, v1]; ulong2 vlow = [low, low]; ulong2 vhigh = [high, high]; int res = PMOVMSKB(vval >= vlow & vval < vhigh); I figured out sort of a solution, but it seems way too complicated, because there is only signed comparison. Usually (scalar) I'd use this, which makes use of unsigned wrap to safe one conditional immutable size = cast(ulong)(vhigh - vlow); if (cast(ulong)(v0 - vlow) < size) {} if (cast(ulong)(v1 - vlow) < size) {} over if (v0 >= vlow && v0 < vhigh) {} Maybe this can be used on SIMD too (saturated sub or so)? -MartinI don't quite understand what you mean by "save one conditional".
Dec 08 2014
On 12/08/2014 05:55 PM, John Colvin wrote:I don't quite understand what you mean by "save one conditional".Instead of if (v >= low && v < high) it's if (cast(size_t)(v - low) < size) So there is only one branch. There are other ways to eliminate one branch, for example this. if ((v >= low) & (v < high)) But the interesting point was that the second branch actually helped the branch predictor. Because it seems to know that when the first branch is true the second will likely be true as well, respectively if it was false the 2nd one will be as well.
Dec 08 2014
Using the fact that a negative 64-bit integer is also a negative 32-bit integer when truncated, you could use 2x SSE2 32-bit comparison. - one for [v0 , v1] - [vlow, vlow] compared to [0, 0] (result should be >= 0, signed comparison) - one for [v0 , v1] - [vhigh, vhigh] compared to [0, 0] (result should be < 0, signed comparison) Then apply logical and on the masks. This doesn't work if vhigh - vlow spans a too large area. On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at a time. ulong2 vval = [v0, v1]; ulong2 vlow = [low, low]; ulong2 vhigh = [high, high]; int res = PMOVMSKB(vval >= vlow & vval < vhigh); I figured out sort of a solution, but it seems way too complicated, because there is only signed comparison. Usually (scalar) I'd use this, which makes use of unsigned wrap to safe one conditional immutable size = cast(ulong)(vhigh - vlow); if (cast(ulong)(v0 - vlow) < size) {} if (cast(ulong)(v1 - vlow) < size) {} over if (v0 >= vlow && v0 < vhigh) {} Maybe this can be used on SIMD too (saturated sub or so)? -Martin
Dec 08 2014
On Monday, 8 December 2014 at 16:57:20 UTC, ponce wrote:Using the fact that a negative 64-bit integer is also a negative 32-bit integer when truncated,Oops. This assumption is wrong.
Dec 08 2014
On 12/08/2014 05:57 PM, ponce wrote:This doesn't work if vhigh - vlow spans a too large area.It won't unless you allocate more than 4GB of RAM.
Dec 08 2014
On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at a time. ulong2 vval = [v0, v1]; ulong2 vlow = [low, low]; ulong2 vhigh = [high, high]; int res = PMOVMSKB(vval >= vlow & vval < vhigh); I figured out sort of a solution, but it seems way too complicated, because there is only signed comparison. Usually (scalar) I'd use this, which makes use of unsigned wrap to safe one conditional immutable size = cast(ulong)(vhigh - vlow); if (cast(ulong)(v0 - vlow) < size) {} if (cast(ulong)(v1 - vlow) < size) {} over if (v0 >= vlow && v0 < vhigh) {} Maybe this can be used on SIMD too (saturated sub or so)? -MartinWell gcc gives me: typedef unsigned long ulong4 __attribute__ ((vector_size (32))); ulong4 foo(ulong4 a, ulong4 l, ulong4 h) { return (a >= l) & (a < h); } foo(unsigned long __vector, unsigned long __vector, unsigned long __vector): vmovdqa .LC0(%rip), %ymm3 vpsubq %ymm3, %ymm0, %ymm0 vpsubq %ymm3, %ymm2, %ymm2 vpsubq %ymm3, %ymm1, %ymm1 vpcmpgtq %ymm0, %ymm2, %ymm2 vpcmpgtq %ymm0, %ymm1, %ymm1 vpandn %ymm2, %ymm1, %ymm0 ret .LC0: .quad -9223372036854775808 .quad -9223372036854775808 .quad -9223372036854775808 .quad -9223372036854775808
Dec 08 2014
On Monday, 8 December 2014 at 17:05:09 UTC, John Colvin wrote:On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:To conceptually get what it's doing here, the trick is that it's offsetting the values so as to simulate unsigned comparisons using signed instructions.I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at a time. ulong2 vval = [v0, v1]; ulong2 vlow = [low, low]; ulong2 vhigh = [high, high]; int res = PMOVMSKB(vval >= vlow & vval < vhigh); I figured out sort of a solution, but it seems way too complicated, because there is only signed comparison. Usually (scalar) I'd use this, which makes use of unsigned wrap to safe one conditional immutable size = cast(ulong)(vhigh - vlow); if (cast(ulong)(v0 - vlow) < size) {} if (cast(ulong)(v1 - vlow) < size) {} over if (v0 >= vlow && v0 < vhigh) {} Maybe this can be used on SIMD too (saturated sub or so)? -MartinWell gcc gives me: typedef unsigned long ulong4 __attribute__ ((vector_size (32))); ulong4 foo(ulong4 a, ulong4 l, ulong4 h) { return (a >= l) & (a < h); } foo(unsigned long __vector, unsigned long __vector, unsigned long __vector): vmovdqa .LC0(%rip), %ymm3 vpsubq %ymm3, %ymm0, %ymm0 vpsubq %ymm3, %ymm2, %ymm2 vpsubq %ymm3, %ymm1, %ymm1 vpcmpgtq %ymm0, %ymm2, %ymm2 vpcmpgtq %ymm0, %ymm1, %ymm1 vpandn %ymm2, %ymm1, %ymm0 ret .LC0: .quad -9223372036854775808 .quad -9223372036854775808 .quad -9223372036854775808 .quad -9223372036854775808
Dec 08 2014
On 12/08/2014 06:20 PM, John Colvin wrote:To conceptually get what it's doing here, the trick is that it's offsetting the values so as to simulate unsigned comparisons using signed instructions.All too easy, but would've taken me a pen and paper to realize :).
Dec 08 2014
On Monday, 8 December 2014 at 23:40:21 UTC, Martin Nowak wrote:On 12/08/2014 06:20 PM, John Colvin wrote:I wouldn't have seen it so easily but I've spent the last few weeks doing discrete Fourier analysis, so I've got very used to spotting modulo arithmetic tricks of exactly this kind.To conceptually get what it's doing here, the trick is that it's offsetting the values so as to simulate unsigned comparisons using signed instructions.All too easy, but would've taken me a pen and paper to realize :).
Dec 08 2014
On 12/08/2014 06:05 PM, John Colvin wrote:Well gcc gives me:Tried that with dmd, it gave me. bug.d(5): Error: incompatible types for ((a) >= (l)): '__vector(ulong[4])' and '__vector(ulong[4])' bug.d(5): Error: incompatible types for ((a) < (h)): '__vector(ulong[4])' and '__vector(ulong[4])' Yours looks better.
Dec 08 2014
On Monday, 8 December 2014 at 22:10:09 UTC, Martin Nowak wrote:On 12/08/2014 06:05 PM, John Colvin wrote:It's unfortunate it can't work it out automatically like gcc can. Anyway, you can write it out manually: auto foo(ulong2 a, ulong2 l, ulong h) { static long2 shift = long.min; //-2^63 long2 aS = cast(long2)(a - shift); long2 lS = cast(long2)(l - shift); long2 hS = cast(long2)(h - shift); return (hS > aS) & (lS > aS); } but that doesn't actually compile, the compiler just sits in semantic3 forever (see https://issues.dlang.org/show_bug.cgi?id=13841)Well gcc gives me:Tried that with dmd, it gave me. bug.d(5): Error: incompatible types for ((a) >= (l)): '__vector(ulong[4])' and '__vector(ulong[4])' bug.d(5): Error: incompatible types for ((a) < (h)): '__vector(ulong[4])' and '__vector(ulong[4])' Yours looks better.
Dec 09 2014
On Tuesday, 9 December 2014 at 12:36:29 UTC, John Colvin wrote:On Monday, 8 December 2014 at 22:10:09 UTC, Martin Nowak wrote:which of course Kenji already has a pull for, less than 3 hours later :)On 12/08/2014 06:05 PM, John Colvin wrote:It's unfortunate it can't work it out automatically like gcc can. Anyway, you can write it out manually: auto foo(ulong2 a, ulong2 l, ulong h) { static long2 shift = long.min; //-2^63 long2 aS = cast(long2)(a - shift); long2 lS = cast(long2)(l - shift); long2 hS = cast(long2)(h - shift); return (hS > aS) & (lS > aS); } but that doesn't actually compile, the compiler just sits in semantic3 forever (see https://issues.dlang.org/show_bug.cgi?id=13841)Well gcc gives me:Tried that with dmd, it gave me. bug.d(5): Error: incompatible types for ((a) >= (l)): '__vector(ulong[4])' and '__vector(ulong[4])' bug.d(5): Error: incompatible types for ((a) < (h)): '__vector(ulong[4])' and '__vector(ulong[4])' Yours looks better.
Dec 09 2014
On 12/09/2014 05:22 PM, John Colvin wrote:which of course Kenji already has a pull for, less than 3 hours later :)It's right on time, it's right on time
Dec 10 2014
On Wednesday, 10 December 2014 at 15:53:40 UTC, Martin Nowak wrote:On 12/09/2014 05:22 PM, John Colvin wrote:Still doesn't work though: https://issues.dlang.org/show_bug.cgi?id=13852 you can't even use core.imd.__simd because it doesn't have PCMPGTQ https://issues.dlang.org/show_bug.cgi?id=8047which of course Kenji already has a pull for, less than 3 hours later :)It's right on time, it's right on time
Dec 10 2014
On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at a time. ulong2 vval = [v0, v1]; ulong2 vlow = [low, low]; ulong2 vhigh = [high, high]; int res = PMOVMSKB(vval >= vlow & vval < vhigh); I figured out sort of a solution, but it seems way too complicated, because there is only signed comparison. Usually (scalar) I'd use this, which makes use of unsigned wrap to safe one conditional immutable size = cast(ulong)(vhigh - vlow); if (cast(ulong)(v0 - vlow) < size) {} if (cast(ulong)(v1 - vlow) < size) {} over if (v0 >= vlow && v0 < vhigh) {} Maybe this can be used on SIMD too (saturated sub or so)? -MartinAnother solution, in SSE pseudo-code: values <- [vval0, vval1] // two value to bound-check, would be 4 in AVX low <- [vlow, vlow] high <- [vhigh, vhigh] subl <- values - low // using the PSUBQ instruction (SSE2) subh <- values - high // using the PSUBQ instruction (SSE2) mask <- subh andnot subl // using the PANDN instruction (SSE2) // only logical shift is available for 64-bit integers until AVX-512 // using PSRLQ to shift the sign bit to the right (SSE2) vresult <- mask >> 31 (for each dword) Now, each 64-bit word in vresult contains exactly 1 if bound were checked, or 0 else.
Dec 08 2014
On Monday, 8 December 2014 at 22:37:08 UTC, ponce wrote:vresult <- mask >> 31 (for each dword)Correction: vresult <- mask >> 63 (for each qword)
Dec 08 2014