www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Any SIMD experts?

reply Martin Nowak <code+news.digitalmars dawg.eu> writes:
I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at a time.

ulong2 vval = [v0, v1];
ulong2 vlow = [low, low];
ulong2 vhigh = [high, high];

int res = PMOVMSKB(vval >= vlow & vval < vhigh);

I figured out sort of a solution, but it seems way too complicated, 
because there is only signed comparison.

Usually (scalar) I'd use this, which makes use of unsigned wrap to safe 
one conditional

immutable size = cast(ulong)(vhigh - vlow);
if (cast(ulong)(v0 - vlow) < size) {}
if (cast(ulong)(v1 - vlow) < size) {}

over

if (v0 >= vlow && v0 < vhigh) {}

Maybe this can be used on SIMD too (saturated sub or so)?

-Martin
Dec 08 2014
next sibling parent reply "Martin Nowak" <code dawg.eu> writes:
On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:
 Usually (scalar) I'd use this, which makes use of unsigned wrap 
 to safe one conditional

 immutable size = cast(ulong)(vhigh - vlow);
 if (cast(ulong)(v0 - vlow) < size) {}
 if (cast(ulong)(v1 - vlow) < size) {}

 over

 if (v0 >= vlow && v0 < vhigh) {}

 Maybe this can be used on SIMD too (saturated sub or so)?

 -Martin
This is for the most performance critical instructions during GC marking. If we can come up with some good SIMD this will result in a good speedup. Yesterday I was surprised to learn that my unsigned wrap trick actually slowed down some GC benchmarks by 3%. The branch predictor had more trouble with the single branch because that resulted in a fifty-fifty chance. There is some correlation between the 2 branch bounds checks and one of them could be predicted fairly well, resulting in a better combined result. Always profile!
Dec 08 2014
next sibling parent reply Etienne <etcimon gmail.com> writes:
On 2014-12-08 11:44 AM, Martin Nowak wrote:
 This is for the most performance critical instructions during GC marking.
 If we can come up with some good SIMD this will result in a good speedup.

 Yesterday I was surprised to learn that my unsigned wrap trick actually
 slowed down some GC benchmarks by 3%. The branch predictor had more
 trouble with the single branch because that resulted in a fifty-fifty
 chance. There is some correlation between the 2 branch bounds checks and
 one of them could be predicted fairly well, resulting in a better
 combined result.
 Always profile!
Are you using it for the binary search part (find pool) ? http://cs.nyu.edu/~lerner/spring12/Preso06-SIMDTree.pdf The savings are would be best when the tree (pool info?) is under 64KB, savings being up to 30%. It could be worth doing it. Here's a quick solution: http://stackoverflow.com/questions/20616605/using-simd-avx-sse-for-tree-traversal Most of the code circulating on the web uses the _mm256 or _mm128 SIMD format, which I ported to multiple platforms already and I can share it through boost license too if you want: https://github.com/etcimon/botan/blob/master/source/botan/utils/simd/immintrin.d
Dec 08 2014
parent Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 12/08/2014 06:21 PM, Etienne wrote:
 Are you using it for the binary search part (find pool) ?
Nope the binary search is surprisingly fine, but I had it on my list of subjects for quite a while too.
Dec 08 2014
prev sibling parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Monday, 8 December 2014 at 16:44:37 UTC, Martin Nowak wrote:
 actually slowed down some GC benchmarks by 3%. The branch 
 predictor had more trouble with the single branch because that 
 resulted in a fifty-fifty chance. There is some correlation
Consider rewriting the code to use non-branching comparison, i.e. use the condition test result directly in an expression or cmov.
Dec 09 2014
prev sibling next sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:
 I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at 
 a time.

 ulong2 vval = [v0, v1];
 ulong2 vlow = [low, low];
 ulong2 vhigh = [high, high];

 int res = PMOVMSKB(vval >= vlow & vval < vhigh);

 I figured out sort of a solution, but it seems way too 
 complicated, because there is only signed comparison.

 Usually (scalar) I'd use this, which makes use of unsigned wrap 
 to safe one conditional

 immutable size = cast(ulong)(vhigh - vlow);
 if (cast(ulong)(v0 - vlow) < size) {}
 if (cast(ulong)(v1 - vlow) < size) {}

 over

 if (v0 >= vlow && v0 < vhigh) {}

 Maybe this can be used on SIMD too (saturated sub or so)?

 -Martin
I don't quite understand what you mean by "save one conditional".
Dec 08 2014
parent Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 12/08/2014 05:55 PM, John Colvin wrote:
 I don't quite understand what you mean by "save one conditional".
Instead of if (v >= low && v < high) it's if (cast(size_t)(v - low) < size) So there is only one branch. There are other ways to eliminate one branch, for example this. if ((v >= low) & (v < high)) But the interesting point was that the second branch actually helped the branch predictor. Because it seems to know that when the first branch is true the second will likely be true as well, respectively if it was false the 2nd one will be as well.
Dec 08 2014
prev sibling next sibling parent reply "ponce" <contact gam3sfrommars.fr> writes:
Using the fact that a negative 64-bit integer is also a negative 
32-bit integer when truncated, you could use 2x SSE2 32-bit 
comparison.
- one for [v0 , v1] - [vlow, vlow] compared to [0, 0] (result 
should be >= 0, signed comparison)
- one for [v0 , v1] - [vhigh, vhigh] compared to [0, 0] (result 
should be < 0, signed comparison)

Then apply logical and on the masks.
This doesn't work if vhigh - vlow spans a too large area.

On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:
 I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at 
 a time.

 ulong2 vval = [v0, v1];
 ulong2 vlow = [low, low];
 ulong2 vhigh = [high, high];

 int res = PMOVMSKB(vval >= vlow & vval < vhigh);

 I figured out sort of a solution, but it seems way too 
 complicated, because there is only signed comparison.

 Usually (scalar) I'd use this, which makes use of unsigned wrap 
 to safe one conditional

 immutable size = cast(ulong)(vhigh - vlow);
 if (cast(ulong)(v0 - vlow) < size) {}
 if (cast(ulong)(v1 - vlow) < size) {}

 over

 if (v0 >= vlow && v0 < vhigh) {}

 Maybe this can be used on SIMD too (saturated sub or so)?

 -Martin
Dec 08 2014
next sibling parent "ponce" <contact gam3sfrommars.fr> writes:
On Monday, 8 December 2014 at 16:57:20 UTC, ponce wrote:
 Using the fact that a negative 64-bit integer is also a 
 negative 32-bit integer when truncated,
Oops. This assumption is wrong.
Dec 08 2014
prev sibling parent Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 12/08/2014 05:57 PM, ponce wrote:
 This doesn't work if vhigh - vlow spans a too large area.
It won't unless you allocate more than 4GB of RAM.
Dec 08 2014
prev sibling next sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:
 I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at 
 a time.

 ulong2 vval = [v0, v1];
 ulong2 vlow = [low, low];
 ulong2 vhigh = [high, high];

 int res = PMOVMSKB(vval >= vlow & vval < vhigh);

 I figured out sort of a solution, but it seems way too 
 complicated, because there is only signed comparison.

 Usually (scalar) I'd use this, which makes use of unsigned wrap 
 to safe one conditional

 immutable size = cast(ulong)(vhigh - vlow);
 if (cast(ulong)(v0 - vlow) < size) {}
 if (cast(ulong)(v1 - vlow) < size) {}

 over

 if (v0 >= vlow && v0 < vhigh) {}

 Maybe this can be used on SIMD too (saturated sub or so)?

 -Martin
Well gcc gives me: typedef unsigned long ulong4 __attribute__ ((vector_size (32))); ulong4 foo(ulong4 a, ulong4 l, ulong4 h) { return (a >= l) & (a < h); } foo(unsigned long __vector, unsigned long __vector, unsigned long __vector): vmovdqa .LC0(%rip), %ymm3 vpsubq %ymm3, %ymm0, %ymm0 vpsubq %ymm3, %ymm2, %ymm2 vpsubq %ymm3, %ymm1, %ymm1 vpcmpgtq %ymm0, %ymm2, %ymm2 vpcmpgtq %ymm0, %ymm1, %ymm1 vpandn %ymm2, %ymm1, %ymm0 ret .LC0: .quad -9223372036854775808 .quad -9223372036854775808 .quad -9223372036854775808 .quad -9223372036854775808
Dec 08 2014
next sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 8 December 2014 at 17:05:09 UTC, John Colvin wrote:
 On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:
 I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) 
 at a time.

 ulong2 vval = [v0, v1];
 ulong2 vlow = [low, low];
 ulong2 vhigh = [high, high];

 int res = PMOVMSKB(vval >= vlow & vval < vhigh);

 I figured out sort of a solution, but it seems way too 
 complicated, because there is only signed comparison.

 Usually (scalar) I'd use this, which makes use of unsigned 
 wrap to safe one conditional

 immutable size = cast(ulong)(vhigh - vlow);
 if (cast(ulong)(v0 - vlow) < size) {}
 if (cast(ulong)(v1 - vlow) < size) {}

 over

 if (v0 >= vlow && v0 < vhigh) {}

 Maybe this can be used on SIMD too (saturated sub or so)?

 -Martin
Well gcc gives me: typedef unsigned long ulong4 __attribute__ ((vector_size (32))); ulong4 foo(ulong4 a, ulong4 l, ulong4 h) { return (a >= l) & (a < h); } foo(unsigned long __vector, unsigned long __vector, unsigned long __vector): vmovdqa .LC0(%rip), %ymm3 vpsubq %ymm3, %ymm0, %ymm0 vpsubq %ymm3, %ymm2, %ymm2 vpsubq %ymm3, %ymm1, %ymm1 vpcmpgtq %ymm0, %ymm2, %ymm2 vpcmpgtq %ymm0, %ymm1, %ymm1 vpandn %ymm2, %ymm1, %ymm0 ret .LC0: .quad -9223372036854775808 .quad -9223372036854775808 .quad -9223372036854775808 .quad -9223372036854775808
To conceptually get what it's doing here, the trick is that it's offsetting the values so as to simulate unsigned comparisons using signed instructions.
Dec 08 2014
parent reply Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 12/08/2014 06:20 PM, John Colvin wrote:
 To conceptually get what it's doing here, the trick is that it's
 offsetting the values so as to simulate unsigned comparisons using
 signed instructions.
All too easy, but would've taken me a pen and paper to realize :).
Dec 08 2014
parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 8 December 2014 at 23:40:21 UTC, Martin Nowak wrote:
 On 12/08/2014 06:20 PM, John Colvin wrote:
 To conceptually get what it's doing here, the trick is that 
 it's
 offsetting the values so as to simulate unsigned comparisons 
 using
 signed instructions.
All too easy, but would've taken me a pen and paper to realize :).
I wouldn't have seen it so easily but I've spent the last few weeks doing discrete Fourier analysis, so I've got very used to spotting modulo arithmetic tricks of exactly this kind.
Dec 08 2014
prev sibling parent reply Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 12/08/2014 06:05 PM, John Colvin wrote:
 Well gcc gives me:
Tried that with dmd, it gave me. bug.d(5): Error: incompatible types for ((a) >= (l)): '__vector(ulong[4])' and '__vector(ulong[4])' bug.d(5): Error: incompatible types for ((a) < (h)): '__vector(ulong[4])' and '__vector(ulong[4])' Yours looks better.
Dec 08 2014
parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 8 December 2014 at 22:10:09 UTC, Martin Nowak wrote:
 On 12/08/2014 06:05 PM, John Colvin wrote:
 Well gcc gives me:
Tried that with dmd, it gave me. bug.d(5): Error: incompatible types for ((a) >= (l)): '__vector(ulong[4])' and '__vector(ulong[4])' bug.d(5): Error: incompatible types for ((a) < (h)): '__vector(ulong[4])' and '__vector(ulong[4])' Yours looks better.
It's unfortunate it can't work it out automatically like gcc can. Anyway, you can write it out manually: auto foo(ulong2 a, ulong2 l, ulong h) { static long2 shift = long.min; //-2^63 long2 aS = cast(long2)(a - shift); long2 lS = cast(long2)(l - shift); long2 hS = cast(long2)(h - shift); return (hS > aS) & (lS > aS); } but that doesn't actually compile, the compiler just sits in semantic3 forever (see https://issues.dlang.org/show_bug.cgi?id=13841)
Dec 09 2014
parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Tuesday, 9 December 2014 at 12:36:29 UTC, John Colvin wrote:
 On Monday, 8 December 2014 at 22:10:09 UTC, Martin Nowak wrote:
 On 12/08/2014 06:05 PM, John Colvin wrote:
 Well gcc gives me:
Tried that with dmd, it gave me. bug.d(5): Error: incompatible types for ((a) >= (l)): '__vector(ulong[4])' and '__vector(ulong[4])' bug.d(5): Error: incompatible types for ((a) < (h)): '__vector(ulong[4])' and '__vector(ulong[4])' Yours looks better.
It's unfortunate it can't work it out automatically like gcc can. Anyway, you can write it out manually: auto foo(ulong2 a, ulong2 l, ulong h) { static long2 shift = long.min; //-2^63 long2 aS = cast(long2)(a - shift); long2 lS = cast(long2)(l - shift); long2 hS = cast(long2)(h - shift); return (hS > aS) & (lS > aS); } but that doesn't actually compile, the compiler just sits in semantic3 forever (see https://issues.dlang.org/show_bug.cgi?id=13841)
which of course Kenji already has a pull for, less than 3 hours later :)
Dec 09 2014
parent reply Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 12/09/2014 05:22 PM, John Colvin wrote:
 which of course Kenji already has a pull for, less than 3 hours later :)
It's right on time, it's right on time
Dec 10 2014
parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Wednesday, 10 December 2014 at 15:53:40 UTC, Martin Nowak 
wrote:
 On 12/09/2014 05:22 PM, John Colvin wrote:
 which of course Kenji already has a pull for, less than 3 
 hours later :)
It's right on time, it's right on time
Still doesn't work though: https://issues.dlang.org/show_bug.cgi?id=13852 you can't even use core.imd.__simd because it doesn't have PCMPGTQ https://issues.dlang.org/show_bug.cgi?id=8047
Dec 10 2014
prev sibling parent reply "ponce" <contact gam3sfrommars.fr> writes:
On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:
 I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at 
 a time.

 ulong2 vval = [v0, v1];
 ulong2 vlow = [low, low];
 ulong2 vhigh = [high, high];

 int res = PMOVMSKB(vval >= vlow & vval < vhigh);

 I figured out sort of a solution, but it seems way too 
 complicated, because there is only signed comparison.

 Usually (scalar) I'd use this, which makes use of unsigned wrap 
 to safe one conditional

 immutable size = cast(ulong)(vhigh - vlow);
 if (cast(ulong)(v0 - vlow) < size) {}
 if (cast(ulong)(v1 - vlow) < size) {}

 over

 if (v0 >= vlow && v0 < vhigh) {}

 Maybe this can be used on SIMD too (saturated sub or so)?

 -Martin
Another solution, in SSE pseudo-code: values <- [vval0, vval1] // two value to bound-check, would be 4 in AVX low <- [vlow, vlow] high <- [vhigh, vhigh] subl <- values - low // using the PSUBQ instruction (SSE2) subh <- values - high // using the PSUBQ instruction (SSE2) mask <- subh andnot subl // using the PANDN instruction (SSE2) // only logical shift is available for 64-bit integers until AVX-512 // using PSRLQ to shift the sign bit to the right (SSE2) vresult <- mask >> 31 (for each dword) Now, each 64-bit word in vresult contains exactly 1 if bound were checked, or 0 else.
Dec 08 2014
parent "ponce" <contact gam3sfrommars.fr> writes:
On Monday, 8 December 2014 at 22:37:08 UTC, ponce wrote:
   vresult <- mask >> 31 (for each dword)
Correction: vresult <- mask >> 63 (for each qword)
Dec 08 2014