digitalmars.D.announce - SIMD implementation of dot-product. Benchmarks

Ilya Yaroshenko (2/2) Aug 17 2013 http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-produ...

John Colvin (4/6) Aug 17 2013 Nice, that's a good speedup.

Ilya Yaroshenko (1/2) Aug 17 2013 Thanks, I`ll remove mtune)

John Colvin (6/8) Aug 17 2013 It would be really interesting if you could try writing the same

Ilya Yaroshenko (34/43) Aug 17 2013 I am lazy )

Manu (17/63) Aug 17 2013 movups is not good. It'll be a lot faster (and portable) if you use mova...

Ilya Yaroshenko (2/25) Aug 17 2013 I`ll try =). Thanks you very math!
Ilya Yaroshenko (18/41) Aug 24 2013 I have tried to write fast implementation with aligned loads:

Manu (7/36) Aug 25 2013 This if course depends largely on your processor too. What

bearophile (9/15) Aug 17 2013 One solution is to copy one of the features of Lisp, that is
Manu (7/10) Aug 17 2013 It doesn't look like you account for alignment.

Ilya Yaroshenko (7/16) Aug 17 2013 dotProduct uses unaligned loads (__builtin_ia32_loadups256,

Manu (7/22) Aug 17 2013 x86 is the only arch that can perform an unaligned load. And even on x86

Ilya Yaroshenko (3/38) Aug 17 2013 Thanks!

Andrei Alexandrescu (5/7) Aug 18 2013 The images never load for me, all I see is some "Request timed out"

Ilya Yaroshenko (4/13) Aug 18 2013 I have changed interactive charts to png images.

Andrei Alexandrescu (3/15) Aug 18 2013 Yes, thanks.
Andrei Alexandrescu (2/4) Aug 21 2013 http://www.reddit.com/r/programming/comments/1ktue0/benchmarking_a_simd_...

Iain Buclaw (10/12) Aug 18 2013 Having a quick flick through the simd.d source, I see LDC's and GDC's

"Ilya Yaroshenko" <ilyayaroshenko gmail.com> writes:

http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html

Ilya

Aug 17 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Saturday, 17 August 2013 at 18:50:15 UTC, Ilya Yaroshenko 
wrote:
 http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html

 Ilya

Nice, that's a good speedup.

BTW: -march=native automatically implies -mtune=native

Aug 17 2013

"Ilya Yaroshenko" <ilyayaroshenko gmail.com> writes:

 BTW: -march=native automatically implies -mtune=native

Thanks, I`ll remove mtune)

Aug 17 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Saturday, 17 August 2013 at 19:24:52 UTC, Ilya Yaroshenko 
wrote:
 BTW: -march=native automatically implies -mtune=native

 Thanks, I`ll remove mtune)

It would be really interesting if you could try writing the same 
code in c, both a scalar version and a version using gcc's vector 
instrinsics, to allow us to compare performance and identify 
areas for D to improve.

Aug 17 2013

"Ilya Yaroshenko" <ilyayaroshenko gmail.com> writes:

On Saturday, 17 August 2013 at 19:38:52 UTC, John Colvin wrote:
 On Saturday, 17 August 2013 at 19:24:52 UTC, Ilya Yaroshenko 
 wrote:
 BTW: -march=native automatically implies -mtune=native

 Thanks, I`ll remove mtune)

 It would be really interesting if you could try writing the 
 same code in c, both a scalar version and a version using gcc's 
 vector instrinsics, to allow us to compare performance and 
 identify areas for D to improve.

I am lazy )

I have looked at assembler code:

float, scalar (main loop):
.L191:
	vmovss	xmm1, DWORD PTR [rsi+rax*4]
	vfmadd231ss	xmm0, xmm1, DWORD PTR [rcx+rax*4]
	add	rax, 1
	cmp	rax, rdi
	jne	.L191


float, vector (main loop):
.L2448:
	vmovups	ymm5, YMMWORD PTR [rax]
	sub	rax, -128
	sub	r11, -128
	vmovups	ymm4, YMMWORD PTR [r11-128]
	vmovups	ymm6, YMMWORD PTR [rax-96]
	vmovups	ymm7, YMMWORD PTR [r11-96]
	vfmadd231ps	ymm3, ymm5, ymm4
	vmovups	ymm8, YMMWORD PTR [rax-64]
	vmovups	ymm9, YMMWORD PTR [r11-64]
	vfmadd231ps	ymm0, ymm6, ymm7
	vmovups	ymm10, YMMWORD PTR [rax-32]
	vmovups	ymm11, YMMWORD PTR [r11-32]
	cmp	rdi, rax
	vfmadd231ps	ymm2, ymm8, ymm9
	vfmadd231ps	ymm1, ymm10, ymm11
	ja	.L2448

float, vector (full):
	https://gist.github.com/9il/6258443


It is pretty optimized)


____
Best regards

Ilya

Aug 17 2013

Manu <turkeyman gmail.com> writes:

movups is not good. It'll be a lot faster (and portable) if you use movaps.

Process looks something like:
  * do the first few from a[0] until a's alignment interval as scalar
  * load the left of b's aligned pair
  * loop for each aligned vector in a
    - load a[n..n+4] aligned
    - load the right of b's pair
    - combine left~right and shift left to match elements against a
    - left = right
  * perform stragglers as scalar

Your benchmark is probably misleading too, because I suspect you are
passing directly alloc-ed arrays into the function (which are 16 byte
aligned).
movups will be significantly slower if the pointers supplied are not 16
byte aligned.
Also, results vary significantly between chip manufacturers and revisions.


On 18 August 2013 14:55, Ilya Yaroshenko <ilyayaroshenko gmail.com> wrote:

 On Saturday, 17 August 2013 at 19:38:52 UTC, John Colvin wrote:

 On Saturday, 17 August 2013 at 19:24:52 UTC, Ilya Yaroshenko wrote:

 BTW: -march=native automatically implies -mtune=native

 Thanks, I`ll remove mtune)

 It would be really interesting if you could try writing the same code in
 c, both a scalar version and a version using gcc's vector instrinsics, to
 allow us to compare performance and identify areas for D to improve.

 I am lazy )

 I have looked at assembler code:

 float, scalar (main loop):
 .L191:
         vmovss  xmm1, DWORD PTR [rsi+rax*4]
         vfmadd231ss     xmm0, xmm1, DWORD PTR [rcx+rax*4]
         add     rax, 1
         cmp     rax, rdi
         jne     .L191


 float, vector (main loop):
 .L2448:
         vmovups ymm5, YMMWORD PTR [rax]
         sub     rax, -128
         sub     r11, -128
         vmovups ymm4, YMMWORD PTR [r11-128]
         vmovups ymm6, YMMWORD PTR [rax-96]
         vmovups ymm7, YMMWORD PTR [r11-96]
         vfmadd231ps     ymm3, ymm5, ymm4
         vmovups ymm8, YMMWORD PTR [rax-64]
         vmovups ymm9, YMMWORD PTR [r11-64]
         vfmadd231ps     ymm0, ymm6, ymm7
         vmovups ymm10, YMMWORD PTR [rax-32]
         vmovups ymm11, YMMWORD PTR [r11-32]
         cmp     rdi, rax
         vfmadd231ps     ymm2, ymm8, ymm9
         vfmadd231ps     ymm1, ymm10, ymm11
         ja      .L2448

 float, vector (full):
         https://gist.github.com/9il/**6258443<https://gist.github.com/9il/6258443>


 It is pretty optimized)


 ____
 Best regards

 Ilya

Aug 17 2013

"Ilya Yaroshenko" <ilyayaroshenko gmail.com> writes:

On Sunday, 18 August 2013 at 05:26:00 UTC, Manu wrote:
 movups is not good. It'll be a lot faster (and portable) if you 
 use movaps.

 Process looks something like:
   * do the first few from a[0] until a's alignment interval as 
 scalar
   * load the left of b's aligned pair
   * loop for each aligned vector in a
     - load a[n..n+4] aligned
     - load the right of b's pair
     - combine left~right and shift left to match elements 
 against a
     - left = right
   * perform stragglers as scalar

 Your benchmark is probably misleading too, because I suspect 
 you are
 passing directly alloc-ed arrays into the function (which are 
 16 byte
 aligned).
 movups will be significantly slower if the pointers supplied 
 are not 16
 byte aligned.
 Also, results vary significantly between chip manufacturers and 
 revisions.

I`ll try =). Thanks you very math!

Aug 17 2013

"Ilya Yaroshenko" <ilyayaroshenko gmail.com> writes:

On Sunday, 18 August 2013 at 05:26:00 UTC, Manu wrote:
 movups is not good. It'll be a lot faster (and portable) if you 
 use movaps.

 Process looks something like:
   * do the first few from a[0] until a's alignment interval as 
 scalar
   * load the left of b's aligned pair
   * loop for each aligned vector in a
     - load a[n..n+4] aligned
     - load the right of b's pair
     - combine left~right and shift left to match elements 
 against a
     - left = right
   * perform stragglers as scalar

 Your benchmark is probably misleading too, because I suspect 
 you are
 passing directly alloc-ed arrays into the function (which are 
 16 byte
 aligned).
 movups will be significantly slower if the pointers supplied 
 are not 16
 byte aligned.
 Also, results vary significantly between chip manufacturers and 
 revisions.


I have tried to write fast implementation with aligned loads:
1. I have now idea how to shift (rotate) 32-bytes avx vector 
without XOP instruction set (XOP available only for AMD).
2. I have tried to use one vmovaps and [one vmovups]/[two 
vinsertf128] with 16-bytes aligned arrays (previously iterates 
with a). It works slower then two vmovups (because loop tricks). 
Now I have 300 lines of slow dotProduct code =)
4. Condition for small arrays works good.


I think it is better to use:

1. vmovups if it is available with condition for small arrays
2. version like from phobos if vmovups is not avalible
3. special version for small static size arrays

I think version for static size arrays can be easily done for 
phobos, processors can unroll such code. And dot product 
optimized for complex numbers can be done too.

Best regards

Ilya

Aug 24 2013

Manu <turkeyman gmail.com> writes:

On 25 August 2013 01:01, Ilya Yaroshenko <ilyayaroshenko gmail.com> wrote:

 On Sunday, 18 August 2013 at 05:26:00 UTC, Manu wrote:

 movups is not good. It'll be a lot faster (and portable) if you use
 movaps.

 Process looks something like:
   * do the first few from a[0] until a's alignment interval as scalar
   * load the left of b's aligned pair
   * loop for each aligned vector in a
     - load a[n..n+4] aligned
     - load the right of b's pair
     - combine left~right and shift left to match elements against a
     - left = right
   * perform stragglers as scalar

 Your benchmark is probably misleading too, because I suspect you are
 passing directly alloc-ed arrays into the function (which are 16 byte
 aligned).
 movups will be significantly slower if the pointers supplied are not 16
 byte aligned.
 Also, results vary significantly between chip manufacturers and revisions.


 I have tried to write fast implementation with aligned loads:
 1. I have now idea how to shift (rotate) 32-bytes avx vector without XOP
 instruction set (XOP available only for AMD).
 2. I have tried to use one vmovaps and [one vmovups]/[two vinsertf128]
 with 16-bytes aligned arrays (previously iterates with a). It works slower
 then two vmovups (because loop tricks). Now I have 300 lines of slow
 dotProduct code =)

This if course depends largely on your processor too. What
processor/revision?
There is a massive difference between vendors and revisions.


 4. Condition for small arrays works good.

Did you try putting this path selection logic in an outer function that the
compiler can inline?

Where's your new code with the movaps solution?

Aug 25 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Ilya Yaroshenko:
 http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html

 From the blog post:

 Compile fast_math code from other program separately and then 
 link it. This is easy solution. However this is a step back to 
 C.<
 To introduce a  fast_math attribute. This is hard to realize. 
 But I hope this will be done for future compilers.<

One solution is to copy one of the features of Lisp, that is 
offer an annotation to specify different compilation switches for 
functions.

Since some time it's present in GNU-C too:
http://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html#Function-Specific-Option-Pragmas

Bye,
bearophile

Aug 17 2013

Manu <turkeyman gmail.com> writes:

It doesn't look like you account for alignment.
This is basically not-portable (I doubt unaligned loads in this context are
faster than performing scalar operations), and possibly inefficient on x86
too.
To make it account for potentially random alignment will be awkward, but it
might be possible to do efficiently.


On 18 August 2013 04:50, Ilya Yaroshenko <ilyayaroshenko gmail.com> wrote:

 http://spiceandmath.blogspot.**ru/2013/08/simd-**
 implementation-of-dot-product_**17.html<http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html>

 Ilya

Aug 17 2013

"Ilya Yaroshenko" <ilyayaroshenko gmail.com> writes:

On Sunday, 18 August 2013 at 01:53:53 UTC, Manu wrote:
 It doesn't look like you account for alignment.
 This is basically not-portable (I doubt unaligned loads in this 
 context are
 faster than performing scalar operations), and possibly 
 inefficient on x86
 too.

dotProduct uses unaligned loads (__builtin_ia32_loadups256, 
__builtin_ia32_loadupd256) and it up to 21 times faster then 
trivial scalar version.

Why unaligned loads is not-portable and inefficient?



 To make it account for potentially random alignment will be 
 awkward, but it
 might be possible to do efficiently.

Did you mean use unaligned loads or prepare data for alignment 
loads at the beginning of function?

Aug 17 2013

Manu <turkeyman gmail.com> writes:

On 18 August 2013 14:39, Ilya Yaroshenko <ilyayaroshenko gmail.com> wrote:

 On Sunday, 18 August 2013 at 01:53:53 UTC, Manu wrote:

 It doesn't look like you account for alignment.
 This is basically not-portable (I doubt unaligned loads in this context
 are
 faster than performing scalar operations), and possibly inefficient on x86
 too.

 dotProduct uses unaligned loads (__builtin_ia32_loadups256,
 __builtin_ia32_loadupd256) and it up to 21 times faster then trivial scalar
 version.

 Why unaligned loads is not-portable and inefficient?


x86 is the only arch that can perform an unaligned load. And even on x86
(many implementations) it's not very efficient.


 To make it account for potentially random alignment will be awkward, but it
 might be possible to do efficiently.

 Did you mean use unaligned loads or prepare data for alignment loads at
 the beginning of function?

I mean to only use aligned loads, in whatever way that happens to work out.
The hard case is when the 2 arrays have different start offsets.

Otherwise you need to wrap your code in a version(x86) block.

Aug 17 2013

"Ilya Yaroshenko" <ilyayaroshenko gmail.com> writes:

On Sunday, 18 August 2013 at 05:07:12 UTC, Manu wrote:
 On 18 August 2013 14:39, Ilya Yaroshenko 
 <ilyayaroshenko gmail.com> wrote:

 On Sunday, 18 August 2013 at 01:53:53 UTC, Manu wrote:

 It doesn't look like you account for alignment.
 This is basically not-portable (I doubt unaligned loads in 
 this context
 are
 faster than performing scalar operations), and possibly 
 inefficient on x86
 too.

 dotProduct uses unaligned loads (__builtin_ia32_loadups256,
 __builtin_ia32_loadupd256) and it up to 21 times faster then 
 trivial scalar
 version.

 Why unaligned loads is not-portable and inefficient?


 x86 is the only arch that can perform an unaligned load. And 
 even on x86
 (many implementations) it's not very efficient.

:(

  To make it account for potentially random alignment will be 
 awkward, but it
 might be possible to do efficiently.

 Did you mean use unaligned loads or prepare data for alignment 
 loads at
 the beginning of function?

 I mean to only use aligned loads, in whatever way that happens 
 to work out.
 The hard case is when the 2 arrays have different start offsets.

 Otherwise you need to wrap your code in a version(x86) block.

Thanks!

Aug 17 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/17/13 11:50 AM, Ilya Yaroshenko wrote:
 http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html


 Ilya

The images never load for me, all I see is some "Request timed out" 
stripes after the text.

Typo: Ununtu


Andrei

Aug 18 2013

"Ilya Yaroshenko" <ilyayaroshenko gmail.com> writes:

On Sunday, 18 August 2013 at 16:32:33 UTC, Andrei Alexandrescu 
wrote:
 On 8/17/13 11:50 AM, Ilya Yaroshenko wrote:
 http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html


 Ilya

 The images never load for me, all I see is some "Request timed 
 out" stripes after the text.

I have changed interactive charts to png images.
Does it works?

 Typo: Ununtu


 Andrei

Aug 18 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/18/13 10:24 AM, Ilya Yaroshenko wrote:
 On Sunday, 18 August 2013 at 16:32:33 UTC, Andrei Alexandrescu wrote:
 On 8/17/13 11:50 AM, Ilya Yaroshenko wrote:
 http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html



 Ilya

 The images never load for me, all I see is some "Request timed out"
 stripes after the text.

 I have changed interactive charts to png images.
 Does it works?

Yes, thanks.

Andrei

Aug 18 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

 On 8/17/13 11:50 AM, Ilya Yaroshenko wrote:
 http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html

http://www.reddit.com/r/programming/comments/1ktue0/benchmarking_a_simd_implementation_of_dot_product/

Andrei

Aug 21 2013

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 17 August 2013 19:50, Ilya Yaroshenko <ilyayaroshenko gmail.com> wrote:
 http://spiceandmath.blogspot.ru/2013/08/simd-implementation-of-dot-product_17.html

 Ilya

Having a quick flick through the simd.d source, I see LDC's and GDC's
implementation couldn't be any more wildly different... (LDC's doesn't
even look like D code thanks to pragma LDC_inline_ir :)


Thumbs up on marking functions as pure - I should really document
somewhere how gcc builtins are fleshed out to the gcc.builtins module
sometime...


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Aug 18 2013

D Programming

C/C++ Programming

Other

digitalmars.D.announce - SIMD implementation of dot-product. Benchmarks