digitalmars.D - Any SIMD experts?

Martin Nowak (16/16) Dec 08 2014 I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at a time.

Martin Nowak (12/21) Dec 08 2014 This is for the most performance critical instructions during GC

Etienne (11/20) Dec 08 2014 Are you using it for the binary search part (find pool) ?

Martin Nowak (3/4) Dec 08 2014 Nope the binary search is surprisingly fine, but I had it on my list of

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (3/6) Dec 09 2014 Consider rewriting the code to use non-branching comparison, i.e.

John Colvin (2/19) Dec 08 2014 I don't quite understand what you mean by "save one conditional".

Martin Nowak (12/13) Dec 08 2014 Instead of

ponce (10/27) Dec 08 2014 Using the fact that a negative 64-bit integer is also a negative

ponce (2/4) Dec 08 2014 Oops. This assumption is wrong.
Martin Nowak (2/3) Dec 08 2014 It won't unless you allocate more than 4GB of RAM.

John Colvin (22/39) Dec 08 2014 Well gcc gives me:

John Colvin (4/52) Dec 08 2014 To conceptually get what it's doing here, the trick is that it's

Martin Nowak (2/5) Dec 08 2014 All too easy, but would've taken me a pen and paper to realize :).

John Colvin (4/13) Dec 08 2014 I wouldn't have seen it so easily but I've spent the last few

Martin Nowak (7/8) Dec 08 2014 Tried that with dmd, it gave me.

John Colvin (14/22) Dec 09 2014 It's unfortunate it can't work it out automatically like gcc can.

John Colvin (3/29) Dec 09 2014 which of course Kenji already has a pull for, less than 3 hours

Martin Nowak (2/3) Dec 10 2014 It's right on time, it's right on time

John Colvin (6/11) Dec 10 2014 Still doesn't work though:

ponce (15/32) Dec 08 2014 Another solution, in SSE pseudo-code:

ponce (3/4) Dec 08 2014 Correction:

Martin Nowak <code+news.digitalmars dawg.eu> writes:

I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at a time.

ulong2 vval = [v0, v1];
ulong2 vlow = [low, low];
ulong2 vhigh = [high, high];

int res = PMOVMSKB(vval >= vlow & vval < vhigh);

I figured out sort of a solution, but it seems way too complicated, 
because there is only signed comparison.

Usually (scalar) I'd use this, which makes use of unsigned wrap to safe 
one conditional

immutable size = cast(ulong)(vhigh - vlow);
if (cast(ulong)(v0 - vlow) < size) {}
if (cast(ulong)(v1 - vlow) < size) {}

over

if (v0 >= vlow && v0 < vhigh) {}

Maybe this can be used on SIMD too (saturated sub or so)?

-Martin

Dec 08 2014

"Martin Nowak" <code dawg.eu> writes:

On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:
 Usually (scalar) I'd use this, which makes use of unsigned wrap 
 to safe one conditional

 immutable size = cast(ulong)(vhigh - vlow);
 if (cast(ulong)(v0 - vlow) < size) {}
 if (cast(ulong)(v1 - vlow) < size) {}

 over

 if (v0 >= vlow && v0 < vhigh) {}

 Maybe this can be used on SIMD too (saturated sub or so)?

 -Martin

This is for the most performance critical instructions during GC 
marking.
If we can come up with some good SIMD this will result in a good 
speedup.

Yesterday I was surprised to learn that my unsigned wrap trick 
actually slowed down some GC benchmarks by 3%. The branch 
predictor had more trouble with the single branch because that 
resulted in a fifty-fifty chance. There is some correlation 
between the 2 branch bounds checks and one of them could be 
predicted fairly well, resulting in a better combined result.
Always profile!

Dec 08 2014

Etienne <etcimon gmail.com> writes:

On 2014-12-08 11:44 AM, Martin Nowak wrote:
 This is for the most performance critical instructions during GC marking.
 If we can come up with some good SIMD this will result in a good speedup.

 Yesterday I was surprised to learn that my unsigned wrap trick actually
 slowed down some GC benchmarks by 3%. The branch predictor had more
 trouble with the single branch because that resulted in a fifty-fifty
 chance. There is some correlation between the 2 branch bounds checks and
 one of them could be predicted fairly well, resulting in a better
 combined result.
 Always profile!

Are you using it for the binary search part (find pool) ?

http://cs.nyu.edu/~lerner/spring12/Preso06-SIMDTree.pdf

The savings are would be best when the tree (pool info?) is under 64KB, 
savings being up to 30%. It could be worth doing it. Here's a quick 
solution:

http://stackoverflow.com/questions/20616605/using-simd-avx-sse-for-tree-traversal

Most of the code circulating on the web uses the _mm256 or _mm128 SIMD 
format, which I ported to multiple platforms already and I can share it 
through boost license too if you want:

https://github.com/etcimon/botan/blob/master/source/botan/utils/simd/immintrin.d

Dec 08 2014

Martin Nowak <code+news.digitalmars dawg.eu> writes:

On 12/08/2014 06:21 PM, Etienne wrote:
 Are you using it for the binary search part (find pool) ?

Nope the binary search is surprisingly fine, but I had it on my list of 
subjects for quite a while too.

Dec 08 2014

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Monday, 8 December 2014 at 16:44:37 UTC, Martin Nowak wrote:
 actually slowed down some GC benchmarks by 3%. The branch 
 predictor had more trouble with the single branch because that 
 resulted in a fifty-fifty chance. There is some correlation

Consider rewriting the code to use non-branching comparison, i.e. 
use the condition test result directly in an expression or cmov.

Dec 09 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:
 I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at 
 a time.

 ulong2 vval = [v0, v1];
 ulong2 vlow = [low, low];
 ulong2 vhigh = [high, high];

 int res = PMOVMSKB(vval >= vlow & vval < vhigh);

 I figured out sort of a solution, but it seems way too 
 complicated, because there is only signed comparison.

 Usually (scalar) I'd use this, which makes use of unsigned wrap 
 to safe one conditional

 immutable size = cast(ulong)(vhigh - vlow);
 if (cast(ulong)(v0 - vlow) < size) {}
 if (cast(ulong)(v1 - vlow) < size) {}

 over

 if (v0 >= vlow && v0 < vhigh) {}

 Maybe this can be used on SIMD too (saturated sub or so)?

 -Martin

I don't quite understand what you mean by "save one conditional".

Dec 08 2014

Martin Nowak <code+news.digitalmars dawg.eu> writes:

On 12/08/2014 05:55 PM, John Colvin wrote:
 I don't quite understand what you mean by "save one conditional".

Instead of
   if (v >= low && v < high)
it's
   if (cast(size_t)(v - low) < size)

So there is only one branch.
There are other ways to eliminate one branch, for example this.

   if ((v >= low) & (v < high))

But the interesting point was that the second branch actually helped the 
branch predictor. Because it seems to know that when the first branch is 
true the second will likely be true as well, respectively if it was 
false the 2nd one will be as well.

Dec 08 2014

"ponce" <contact gam3sfrommars.fr> writes:

Using the fact that a negative 64-bit integer is also a negative 
32-bit integer when truncated, you could use 2x SSE2 32-bit 
comparison.
- one for [v0 , v1] - [vlow, vlow] compared to [0, 0] (result 
should be >= 0, signed comparison)
- one for [v0 , v1] - [vhigh, vhigh] compared to [0, 0] (result 
should be < 0, signed comparison)

Then apply logical and on the masks.
This doesn't work if vhigh - vlow spans a too large area.

On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:
 I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at 
 a time.

 ulong2 vval = [v0, v1];
 ulong2 vlow = [low, low];
 ulong2 vhigh = [high, high];

 int res = PMOVMSKB(vval >= vlow & vval < vhigh);

 I figured out sort of a solution, but it seems way too 
 complicated, because there is only signed comparison.

 Usually (scalar) I'd use this, which makes use of unsigned wrap 
 to safe one conditional

 immutable size = cast(ulong)(vhigh - vlow);
 if (cast(ulong)(v0 - vlow) < size) {}
 if (cast(ulong)(v1 - vlow) < size) {}

 over

 if (v0 >= vlow && v0 < vhigh) {}

 Maybe this can be used on SIMD too (saturated sub or so)?

 -Martin

Dec 08 2014

"ponce" <contact gam3sfrommars.fr> writes:

On Monday, 8 December 2014 at 16:57:20 UTC, ponce wrote:
 Using the fact that a negative 64-bit integer is also a 
 negative 32-bit integer when truncated,

Oops. This assumption is wrong.

Dec 08 2014

Martin Nowak <code+news.digitalmars dawg.eu> writes:

On 12/08/2014 05:57 PM, ponce wrote:
 This doesn't work if vhigh - vlow spans a too large area.

It won't unless you allocate more than 4GB of RAM.

Dec 08 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:
 I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at 
 a time.

 ulong2 vval = [v0, v1];
 ulong2 vlow = [low, low];
 ulong2 vhigh = [high, high];

 int res = PMOVMSKB(vval >= vlow & vval < vhigh);

 I figured out sort of a solution, but it seems way too 
 complicated, because there is only signed comparison.

 Usually (scalar) I'd use this, which makes use of unsigned wrap 
 to safe one conditional

 immutable size = cast(ulong)(vhigh - vlow);
 if (cast(ulong)(v0 - vlow) < size) {}
 if (cast(ulong)(v1 - vlow) < size) {}

 over

 if (v0 >= vlow && v0 < vhigh) {}

 Maybe this can be used on SIMD too (saturated sub or so)?

 -Martin

Well gcc gives me:

typedef unsigned long ulong4 __attribute__ ((vector_size (32)));

ulong4 foo(ulong4 a, ulong4 l, ulong4 h)
{
     return (a >= l) & (a < h);
}


foo(unsigned long __vector, unsigned long __vector, unsigned long 
__vector):
	vmovdqa	.LC0(%rip), %ymm3
	vpsubq	%ymm3, %ymm0, %ymm0
	vpsubq	%ymm3, %ymm2, %ymm2
	vpsubq	%ymm3, %ymm1, %ymm1
	vpcmpgtq	%ymm0, %ymm2, %ymm2
	vpcmpgtq	%ymm0, %ymm1, %ymm1
	vpandn	%ymm2, %ymm1, %ymm0
	ret
.LC0:
	.quad	-9223372036854775808
	.quad	-9223372036854775808
	.quad	-9223372036854775808
	.quad	-9223372036854775808

Dec 08 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 8 December 2014 at 17:05:09 UTC, John Colvin wrote:
 On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:
 I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) 
 at a time.

 ulong2 vval = [v0, v1];
 ulong2 vlow = [low, low];
 ulong2 vhigh = [high, high];

 int res = PMOVMSKB(vval >= vlow & vval < vhigh);

 I figured out sort of a solution, but it seems way too 
 complicated, because there is only signed comparison.

 Usually (scalar) I'd use this, which makes use of unsigned 
 wrap to safe one conditional

 immutable size = cast(ulong)(vhigh - vlow);
 if (cast(ulong)(v0 - vlow) < size) {}
 if (cast(ulong)(v1 - vlow) < size) {}

 over

 if (v0 >= vlow && v0 < vhigh) {}

 Maybe this can be used on SIMD too (saturated sub or so)?

 -Martin

 Well gcc gives me:

 typedef unsigned long ulong4 __attribute__ ((vector_size (32)));

 ulong4 foo(ulong4 a, ulong4 l, ulong4 h)
 {
     return (a >= l) & (a < h);
 }


 foo(unsigned long __vector, unsigned long __vector, unsigned 
 long __vector):
 	vmovdqa	.LC0(%rip), %ymm3
 	vpsubq	%ymm3, %ymm0, %ymm0
 	vpsubq	%ymm3, %ymm2, %ymm2
 	vpsubq	%ymm3, %ymm1, %ymm1
 	vpcmpgtq	%ymm0, %ymm2, %ymm2
 	vpcmpgtq	%ymm0, %ymm1, %ymm1
 	vpandn	%ymm2, %ymm1, %ymm0
 	ret
 .LC0:
 	.quad	-9223372036854775808
 	.quad	-9223372036854775808
 	.quad	-9223372036854775808
 	.quad	-9223372036854775808

To conceptually get what it's doing here, the trick is that it's 
offsetting the values so as to simulate unsigned comparisons 
using signed instructions.

Dec 08 2014

Martin Nowak <code+news.digitalmars dawg.eu> writes:

On 12/08/2014 06:20 PM, John Colvin wrote:
 To conceptually get what it's doing here, the trick is that it's
 offsetting the values so as to simulate unsigned comparisons using
 signed instructions.

All too easy, but would've taken me a pen and paper to realize :).

Dec 08 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 8 December 2014 at 23:40:21 UTC, Martin Nowak wrote:
 On 12/08/2014 06:20 PM, John Colvin wrote:
 To conceptually get what it's doing here, the trick is that 
 it's
 offsetting the values so as to simulate unsigned comparisons 
 using
 signed instructions.

 All too easy, but would've taken me a pen and paper to realize 
 :).

I wouldn't have seen it so easily but I've spent the last few 
weeks doing discrete Fourier analysis, so I've got very used to 
spotting modulo arithmetic tricks of exactly this kind.

Dec 08 2014

Martin Nowak <code+news.digitalmars dawg.eu> writes:

On 12/08/2014 06:05 PM, John Colvin wrote:
 Well gcc gives me:

Tried that with dmd, it gave me.

bug.d(5): Error: incompatible types for ((a) >= (l)): 
'__vector(ulong[4])' and '__vector(ulong[4])'
bug.d(5): Error: incompatible types for ((a) < (h)): 
'__vector(ulong[4])' and '__vector(ulong[4])'

Yours looks better.

Dec 08 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 8 December 2014 at 22:10:09 UTC, Martin Nowak wrote:
 On 12/08/2014 06:05 PM, John Colvin wrote:
 Well gcc gives me:

 Tried that with dmd, it gave me.

 bug.d(5): Error: incompatible types for ((a) >= (l)): 
 '__vector(ulong[4])' and '__vector(ulong[4])'
 bug.d(5): Error: incompatible types for ((a) < (h)): 
 '__vector(ulong[4])' and '__vector(ulong[4])'

 Yours looks better.

It's unfortunate it can't work it out automatically like gcc can.

Anyway, you can write it out manually:

auto foo(ulong2 a, ulong2 l, ulong h)
{
     static long2 shift = long.min; //-2^63
     long2 aS = cast(long2)(a - shift);
     long2 lS = cast(long2)(l - shift);
     long2 hS = cast(long2)(h - shift);
     return (hS > aS) & (lS > aS);
}

but that doesn't actually compile, the compiler just sits in 
semantic3 forever (see 
https://issues.dlang.org/show_bug.cgi?id=13841)

Dec 09 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Tuesday, 9 December 2014 at 12:36:29 UTC, John Colvin wrote:
 On Monday, 8 December 2014 at 22:10:09 UTC, Martin Nowak wrote:
 On 12/08/2014 06:05 PM, John Colvin wrote:
 Well gcc gives me:

 Tried that with dmd, it gave me.

 bug.d(5): Error: incompatible types for ((a) >= (l)): 
 '__vector(ulong[4])' and '__vector(ulong[4])'
 bug.d(5): Error: incompatible types for ((a) < (h)): 
 '__vector(ulong[4])' and '__vector(ulong[4])'

 Yours looks better.

 It's unfortunate it can't work it out automatically like gcc 
 can.

 Anyway, you can write it out manually:

 auto foo(ulong2 a, ulong2 l, ulong h)
 {
     static long2 shift = long.min; //-2^63
     long2 aS = cast(long2)(a - shift);
     long2 lS = cast(long2)(l - shift);
     long2 hS = cast(long2)(h - shift);
     return (hS > aS) & (lS > aS);
 }

 but that doesn't actually compile, the compiler just sits in 
 semantic3 forever (see 
 https://issues.dlang.org/show_bug.cgi?id=13841)

which of course Kenji already has a pull for, less than 3 hours 
later :)

Dec 09 2014

Martin Nowak <code+news.digitalmars dawg.eu> writes:

On 12/09/2014 05:22 PM, John Colvin wrote:
 which of course Kenji already has a pull for, less than 3 hours later :)

It's right on time, it's right on time

Dec 10 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Wednesday, 10 December 2014 at 15:53:40 UTC, Martin Nowak 
wrote:
 On 12/09/2014 05:22 PM, John Colvin wrote:
 which of course Kenji already has a pull for, less than 3 
 hours later :)

 It's right on time, it's right on time

Still doesn't work though: 
https://issues.dlang.org/show_bug.cgi?id=13852

you can't even use core.imd.__simd because it doesn't have 
PCMPGTQ https://issues.dlang.org/show_bug.cgi?id=8047

Dec 10 2014

"ponce" <contact gam3sfrommars.fr> writes:

On Monday, 8 December 2014 at 16:32:50 UTC, Martin Nowak wrote:
 I want to do bounds checking of 2 (4 on avx) ulongs (64-bit) at 
 a time.

 ulong2 vval = [v0, v1];
 ulong2 vlow = [low, low];
 ulong2 vhigh = [high, high];

 int res = PMOVMSKB(vval >= vlow & vval < vhigh);

 I figured out sort of a solution, but it seems way too 
 complicated, because there is only signed comparison.

 Usually (scalar) I'd use this, which makes use of unsigned wrap 
 to safe one conditional

 immutable size = cast(ulong)(vhigh - vlow);
 if (cast(ulong)(v0 - vlow) < size) {}
 if (cast(ulong)(v1 - vlow) < size) {}

 over

 if (v0 >= vlow && v0 < vhigh) {}

 Maybe this can be used on SIMD too (saturated sub or so)?

 -Martin

Another solution, in SSE pseudo-code:

   values <- [vval0, vval1] // two value to bound-check, would be 
4 in AVX

   low <- [vlow, vlow]
   high <- [vhigh, vhigh]

   subl <- values - low // using the PSUBQ instruction (SSE2)
   subh <- values - high // using the PSUBQ instruction (SSE2)
   mask <- subh andnot subl // using the PANDN instruction (SSE2)

   // only logical shift is available for 64-bit integers until 
AVX-512
   // using PSRLQ to shift the sign bit to the right (SSE2)
   vresult <- mask >> 31 (for each dword)


Now, each 64-bit word in vresult contains exactly 1 if bound were 
checked, or 0 else.

Dec 08 2014

"ponce" <contact gam3sfrommars.fr> writes:

On Monday, 8 December 2014 at 22:37:08 UTC, ponce wrote:
   vresult <- mask >> 31 (for each dword)

Correction:

    vresult <- mask >> 63 (for each qword)

Dec 08 2014

D Programming

C/C++ Programming

Other

digitalmars.D - Any SIMD experts?