digitalmars.D.announce - DConf 2013 Day 3 Talk 5: Effective SIMD for modern architectures

Andrei Alexandrescu (8/8) Jun 19 2013 Apologies for the delay, we're moving and things are a bit hectic.

bearophile (91/104) Jun 20 2013 Very nice.

Manu (39/132) Jun 20 2013 t

bearophile (15/21) Jun 20 2013 The D GC currently allocates them 16-bytes aligned (but if you

Manu (41/62) Jun 20 2013 Yes, the GC allocates 16byte aligned memory, this is good. It's critical

bearophile (6/7) Jun 23 2013 An important thing here is: what's the semantics present in that
bearophile (382/383) Jul 12 2013 I have taken a look at this page:

=?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= (20/25) Jun 20 2013 Since you mention that, I developed a similar compiler/language in

Nick Sabalausky (4/8) Jun 20 2013 A bit late, but torrents/links up:

Andrei Alexandrescu (4/12) Jun 21 2013 Thanks for this work. I'll be late with torrents for the last two talks

Andrei Alexandrescu (3/11) Jun 24 2013 Now available in HD: https://archive.org/details/dconf2013-day03-talk05

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Apologies for the delay, we're moving and things are a bit hectic.

reddit: 
http://www.reddit.com/r/programming/comments/1go9ky/dconf_2013_effective_simd_for_modern/

twitter: https://twitter.com/D_Programming/status/347433981928693760

hackernews: https://news.ycombinator.com/item?id=5907624

facebook: https://www.facebook.com/dlang.org/posts/659747567372261

youtube: http://youtube.com/watch?v=q_39RnxtkgM


Andrei

Jun 19 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:

 http://youtube.com/watch?v=q_39RnxtkgM

Very nice.

- - - - - - - - - - - - - - - - - - -

Slide 3:

 In practise, say we have iterative code like this:
 
 int data[100];
 
 for(int i = 0; i < data.length; ++i) {
   data[i] += 10; }

For code like that in D we have vector ops:

int[100] data;
data[] += 10;


Regarding vector ops: currently they are written with handwritten 
asm that uses SIMD where possible. Once std.simd is in good shape 
I think the array ops can be rewritten (and completed in their 
missing parts) using a higher level style of coding.

- - - - - - - - - - - - - - - - - - -

Slide 22:

 Comparisons:
 Full suite of comparisons Can produce bit-masks, or boolean 
 'any'/'all' logic.

Maybe a little of compiler support (for the syntax) will help 
here.

- - - - - - - - - - - - - - - - - - -

Slide 26:

 Always pass vectors by value.

Unfortunately it seems a bad idea to give a warning if you pass 
one of those by reference.

- - - - - - - - - - - - - - - - - - -

Slide 27:

 3. Use ‘leaf’ functions where possible.

I am not sure how much good it is to enforce leaf functions with 
a  leaf annotation.

- - - - - - - - - - - - - - - - - - -

Slide 32:

 Experiment with prefetching?

Are D intrinsics offering instructions to perform prefetching?

- - - - - - - - - - - - - - - - - - -

LDC2 is supports SIMD on Windows32 too.

So for this code:


void main() {
     alias double2 = __vector(double[2]);
     auto a = new double[200];
     auto b = cast(double2[])a;
     double2 tens = [10.0, 10.0];
     b[] += tens;
}


LDC2 compiles it to:

	movl	$200, 4(%esp)
	movl	$__D11TypeInfo_Ad6__initZ, (%esp)
	calll	__d_newarrayiT
	movl	%edx, %esi
	movl	%eax, (%esp)
	movl	$16, 8(%esp)
	movl	$8, 4(%esp)
	calll	__d_array_cast_len
	testl	%eax, %eax
	je	LBB0_3
	movapd	LCPI0_0, %xmm0
	.align	16, 0x90
LBB0_2:
	movapd	(%esi), %xmm1
	addpd	%xmm0, %xmm1
	movapd	%xmm1, (%esi)
	addl	$16, %esi
	decl	%eax
	jne	LBB0_2
LBB0_3:
	xorl	%eax, %eax
	addl	$12, %esp
	popl	%esi
	ret


It uses addpd that works with two doubles at the same time.

- - - - - - - - - - - - - - - - - - -

The Reddit thread contains a link to this page, a compiler for a 
C variant from Intel that's optimized for SIMD:
http://ispc.github.io/

Some of the syntax of ispc:

- - - - - -

The first of these statements is cif, indicating an if statement 
that is expected to be coherent. The usage of cif in code is just 
the same as if:

cif (x < y) {
     ...
} else {
     ...
}

cif provides a hint to the compiler that you expect that most of 
the executing SPMD programs will all have the same result for the 
if condition.

Along similar lines, cfor, cdo, and cwhile check to see if all 
program instances are running at the start of each loop 
iteration; if so, they can run a specialized code path that has 
been optimized for the "all on" execution mask case.

- - - - - -

foreach_tiled(y = y0 ... y1, x = 0 ... w,
               u = 0 ... nsubsamples, v = 0 ... nsubsamples) {
     float du = (float)u * invSamples, dv = (float)v * invSamples;

- - - - - -

I'll take a better look at ispc.

Bye,
bearophile

Jun 20 2013

Manu <turkeyman gmail.com> writes:

On 20 June 2013 21:58, bearophile <bearophileHUGS lycos.com> wrote:

 Andrei Alexandrescu:

  http://youtube.com/watch?v=3Dq_**39RnxtkgM<http://youtube.com/watch?v=3D=

q_39RnxtkgM>

 Very nice.

 - - - - - - - - - - - - - - - - - - -

 Slide 3:

  In practise, say we have iterative code like this:
 int data[100];

 for(int i =3D 0; i < data.length; ++i) {
   data[i] +=3D 10; }

 For code like that in D we have vector ops:

 int[100] data;
 data[] +=3D 10;


 Regarding vector ops: currently they are written with handwritten asm tha=

t
 uses SIMD where possible. Once std.simd is in good shape I think the arra=

y
 ops can be rewritten (and completed in their missing parts) using a highe=

r
 level style of coding.

I was trying to illustrate a process. Not so much a comment on D array
syntax.
The problem with auto-simd applied to array operations, is D doesn't assert
that arrays are aligned. Nor are they multiples of 'N' elements wide, which
means they lose the opportunity to make a lot of assumptions that make the
biggest performance difference.
They must be aligned, and multiples of N elements. By using explicit SIMD
types, you're forced to adhere to those rules as a programmer, and the
compiler can optimise properly.
You take on the responsibility to handle mis-alignment and stragglers as
the programmer, and perhaps make less conservative choices.

- - - - - - - - - - - - - - - - - - -
 Slide 22:

  Comparisons:
 Full suite of comparisons Can produce bit-masks, or boolean 'any'/'all'
 logic.

 Maybe a little of compiler support (for the syntax) will help here.

Well, each are valid comparisons in different situations. I'm not sure how
syntax could clearly select the one you want.

- - - - - - - - - - - - - - - - - - -
 Slide 26:

  Always pass vectors by value.

 Unfortunately it seems a bad idea to give a warning if you pass one of
 those by reference.

And I don't think it should. Passing by ref isn't 'wrong', you just
shouldn't do it if you care about performance.

- - - - - - - - - - - - - - - - - - -
 Slide 27:

  3. Use =E2=80=98leaf=E2=80=99 functions where possible.

 I am not sure how much good it is to enforce leaf functions with a  leaf
 annotation.

I don't think it would be useful. It should only be considered a general
rule when people are very specifically considering performance above all
else.
It's just a very important detail to be aware of when optimising your code,
particularly so when you're dealing with maths code (often involving simd).

- - - - - - - - - - - - - - - - - - -
 Slide 32:

  Experiment with prefetching?

 Are D intrinsics offering instructions to perform prefetching?

Well, GCC does at least. If you're worried about performance at this level,
you're probably already using GCC :)

- - - - - - - - - - - - - - - - - - -
 LDC2 is supports SIMD on Windows32 too.

 So for this code:


 void main() {
     alias double2 =3D __vector(double[2]);
     auto a =3D new double[200];
     auto b =3D cast(double2[])a;
     double2 tens =3D [10.0, 10.0];
     b[] +=3D tens;
 }


 LDC2 compiles it to:

         movl    $200, 4(%esp)
         movl    $__D11TypeInfo_Ad6__initZ, (%esp)
         calll   __d_newarrayiT
         movl    %edx, %esi
         movl    %eax, (%esp)
         movl    $16, 8(%esp)
         movl    $8, 4(%esp)
         calll   __d_array_cast_len
         testl   %eax, %eax
         je      LBB0_3
         movapd  LCPI0_0, %xmm0
         .align  16, 0x90
 LBB0_2:
         movapd  (%esi), %xmm1
         addpd   %xmm0, %xmm1
         movapd  %xmm1, (%esi)
         addl    $16, %esi
         decl    %eax
         jne     LBB0_2
 LBB0_3:
         xorl    %eax, %eax
         addl    $12, %esp
         popl    %esi
         ret


 It uses addpd that works with two doubles at the same time.

Sure... did I say this wasn't supported somewhere? Sorry if I gave that
impression.

- - - - - - - - - - - - - - - - - - -
 The Reddit thread contains a link to this page, a compiler for a C varian=

t
 from Intel that's optimized for SIMD:
 http://ispc.github.io/

 Some of the syntax of ispc:

 - - - - - -

 The first of these statements is cif, indicating an if statement that is
 expected to be coherent. The usage of cif in code is just the same as if:

 cif (x < y) {
     ...
 } else {
     ...
 }

 cif provides a hint to the compiler that you expect that most of the
 executing SPMD programs will all have the same result for the if conditio=

n.
 Along similar lines, cfor, cdo, and cwhile check to see if all program
 instances are running at the start of each loop iteration; if so, they ca=

n
 run a specialized code path that has been optimized for the "all on"
 execution mask case.

This is interesting. I didn't know about this.

Jun 20 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Manu:

 They must be aligned, and multiples of N elements.

The D GC currently allocates them 16-bytes aligned (but if you 
slice the array you can lose some alignment). On some new CPUs 
the penalty for misalignment is small.

You often have "n" values, where n is variable. If n is large 
enough and you are using D vector ops, the handling of the head 
and tail doesn't waste too much time. If you have very few values 
it's much better to use the SIMD code.


 Well, each are valid comparisons in different situations. I'm 
 not sure how syntax could clearly select the one you want.

Maybe later we'll look for some syntax sugar for this.


 Are D intrinsics offering instructions to perform prefetching?

 Well, GCC does at least. If you're worried about performance at 
 this level, you're probably already using GCC :)

I think D SIMD programmers will expect something functionally 
like __builtin_prefetch to be available in D too:
http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-g_t_005f_005fbuiltin_005fprefetch-3396

Thank you,
bye,
bearophile

Jun 20 2013

Manu <turkeyman gmail.com> writes:

On 21 June 2013 00:03, bearophile <bearophileHUGS lycos.com> wrote:

Manu:

They must be aligned, and multiples of N elements.

The D GC currently allocates them 16-bytes aligned (but if you slice the
array you can lose some alignment). On some new CPUs the penalty for
misalignment is small.

Yes, the GC allocates 16byte aligned memory, this is good. It's critical
actually. But if the data types themselves weren't aligned, then the alloc
alignment would be lost as soon as they were used in struct's.

You'll notice I made a point of focusing on _portable_ simd. It's true,
some new chips can deal with it at virtually no additional cost, but they
lose nothing by aligning their data regardless, and you can run on anything.
I hope that people write libraries that can run well on anything, not just
their architecture of choice. The guidelines I presented, if followed, will
give you good performance on all architectures.
They're not even very inconvenient.

If your point is about auto-vectorisation being much simpler without the
alignment restrictions, this is true. But again, I'm talking about portable
and RELIABLE implementations, that is, the programmer should know that SIMD
was used effectively, and not have to hope the optimiser was able to do a
good job. Make these guidelines second nature, and you'll foster a habit of
writing portable code even if you don't intend to do so personally. Someone
somewhere may want to use your library...

You often have "n" values, where n is variable. If n is large enough and
you are using D vector ops, the handling of the head and tail doesn't waste
too much time. If you have very few values it's much better to use the SIMD
code.

See my later slides about branch predictability. When you need to handle
stragglers on the head or tail, then you've introduced 2 sources of
unpredictability (and also bloated your code).
If the arrays are very long, this may be okay as you say, but if they're
not it becomes significant.

But there is an new issue that appears; if the output array is not the same
as the input array, then you have a new mis-alignment where the bases of
the 2 arrays might not share the same alignment, and you can't do a simd
load from one and store to the other without a series of corrective shifts
and merges, which will effectively result in similar code to my un-aligned
load demonstration.

So the case where this is reliable is:
* long data array
* output array is the same as the input array (overwrites the input?)

I don't consider that reliable, and I don't think special-cases awareness
of those criteria is any easier than carefully/deliberately using SIMD in
the first place.

Well, each are valid comparisons in different situations. I'm not sure how
syntax could clearly select the one you want.

Maybe later we'll look for some syntax sugar for this.

I'm definitely curious... but i'm not sure it's necessary.

Are D intrinsics offering instructions to perform prefetching?

Well, GCC does at least. If you're worried about performance at this
level, you're probably already using GCC :)

I think D SIMD programmers will expect something functionally like
__builtin_prefetch to be available in D too:
http://gcc.gnu.org/onlinedocs/**gcc/Other-Builtins.html#index-**
g_t_005f_005fbuiltin_**005fprefetch-3396<http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-g_t_005f_005fbuiltin_005fprefetch-3396>

Yup, I toyed with the idea of adding it to std.simd, but I didn't think it
fit there.

Jun 20 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Manu:

 This is interesting. I didn't know about this.

An important thing here is: what's the semantics present in that 
language that is missing in D (and that is useful for the 
optimizer)? Is it possible/worth to add it?

Bye,
bearophile

Jun 23 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Manu:

 This is interesting. I didn't know about this.

I have taken a look at this page:
https://github.com/ispc/ispc

There is a free compiler binary for various operating systems:
http://ispc.github.io/downloads.html

I have tried the Windows compiler on some of the given examples 
of code, and it works! And the resulting asm is excellent. Normal 
compilers for the usual languages aren't able to do produce not 
even nearly as good asm.

To try the code of the examples I compile like this:

ispc.exe --emit-asm stencil.ispc -o stencil.s

Or even like this to see the AVX2 asm instructions:

ispc.exe --target=avx2 --emit-asm stencil.ispc -o stencil.s



As example it compiles a function like this:


static void
stencil_step(uniform int x0, uniform int x1,
              uniform int y0, uniform int y1,
              uniform int z0, uniform int z1,
              uniform int Nx, uniform int Ny, uniform int Nz,
              uniform const float coef[4], uniform const float 
vsq[],
              uniform const float Ain[], uniform float Aout[]) {
     const uniform int Nxy = Nx * Ny;

     foreach (z = z0 ... z1, y = y0 ... y1, x = x0 ... x1) {
         int index = (z * Nxy) + (y * Nx) + x;
#define A_cur(x, y, z) Ain[index + (x) + ((y) * Nx) + ((z) * Nxy)]
#define A_next(x, y, z) Aout[index + (x) + ((y) * Nx) + ((z) * 
Nxy)]
         float div = coef[0] * A_cur(0, 0, 0) +
             coef[1] * (A_cur(+1, 0, 0) + A_cur(-1, 0, 0) +
                        A_cur(0, +1, 0) + A_cur(0, -1, 0) +
                        A_cur(0, 0, +1) + A_cur(0, 0, -1)) +
             coef[2] * (A_cur(+2, 0, 0) + A_cur(-2, 0, 0) +
                        A_cur(0, +2, 0) + A_cur(0, -2, 0) +
                        A_cur(0, 0, +2) + A_cur(0, 0, -2)) +
             coef[3] * (A_cur(+3, 0, 0) + A_cur(-3, 0, 0) +
                        A_cur(0, +3, 0) + A_cur(0, -3, 0) +
                        A_cur(0, 0, +3) + A_cur(0, 0, -3));

         A_next(0, 0, 0) = 2 * A_cur(0, 0, 0) - A_next(0, 0, 0) +
             vsq[index] * div;
     }
}



To asm like (using SSE4):




	movups	(%r14,%rdi), %xmm0
	mulps	%xmm6, %xmm5

	movq	424(%rsp), %rdx
	movups	(%rdx,%r15), %xmm6
	mulps	%xmm3, %xmm2

	mulps	%xmm11, %xmm3
	addps	%xmm2, %xmm3
	addps	%xmm5, %xmm3
	addps	%xmm4, %xmm0
	movslq	%eax, %rax
	movups	(%r14,%rax), %xmm2
	addps	%xmm0, %xmm2
	movslq	%ebp, %rax
	movups	(%r14,%rax), %xmm0
	addps	%xmm2, %xmm0
	mulps	%xmm9, %xmm0
	addps	%xmm3, %xmm0
	mulps	%xmm6, %xmm0
	addps	%xmm1, %xmm0
	movups	%xmm0, (%r11,%r15)
	jl	.LBB0_5


Depth=2
	movq	%r13, %rbp

	movq	424(%rsp), %r9

	movq	%r11, %rbx
	jge	.LBB0_257


Depth=2
	movd	%r8d, %xmm0

	imull	400(%rsp), %r15d

	paddd	.LCPI0_1(%rip), %xmm0
	movdqa	%xmm8, %xmm1
	pcmpgtd	%xmm0, %xmm1
	movmskps	%xmm1, %edi

	leal	(%r8,%r15), %r10d



Or using AVX2 (this not exactly the equievalent piece of code) 
(The asm generates with AVX2 is usually significant shorter):


	movslq	%edx, %rdx
	vaddps	(%rax,%rdx), %ymm5, %ymm5

	leal	(%rdx,%r15), %edx
	movslq	%edx, %rdx
	vaddps	(%rax,%rdx), %ymm5, %ymm5
	vaddps	%ymm11, %ymm6, %ymm9
	vaddps	%ymm10, %ymm8, %ymm10
	vmovups	(%rax,%rdi), %ymm12
	addl	$32, %r15d
	vmovups	(%r11,%rcx), %ymm6
	movq	400(%rsp), %rdx
	vbroadcastss	12(%rdx), %ymm7
	vbroadcastss	8(%rdx), %ymm8
	vbroadcastss	(%rdx), %ymm11
	vaddps	%ymm12, %ymm10, %ymm10

	vbroadcastss	4(%rdx), %ymm12
	vmulps	%ymm9, %ymm12, %ymm9
	vfmadd213ps	%ymm9, %ymm3, %ymm11
	vfmadd213ps	%ymm11, %ymm8, %ymm10
	vfmadd213ps	%ymm10, %ymm5, %ymm7
	vfmadd213ps	%ymm4, %ymm6, %ymm7
	vmovups	%ymm7, (%r8,%rcx)
	jl	.LBB0_8

%partial_inner_all_outer.us

Depth=2

	movq	400(%rsp), %r10
	jge	.LBB0_6


Depth=2
	vmovd	%r14d, %xmm3
	vbroadcastss	%xmm3, %ymm3

	leal	(%rcx,%r15), %ecx
	vpaddd	%ymm2, %ymm3, %ymm3




Sometimes it gives performance warnings:

rt.ispc:257:9: Performance Warning: Scatter required to store 
value.
         image[offset] = ray.maxt;
         ^^^^^^^^^^^^^

rt.ispc:258:9: Performance Warning: Scatter required to store 
value.
         id[offset] = ray.hitId;
         ^^^^^^^^^^


A a bit larger example with this function from a little 
ray-tracer:


static bool TriIntersect(const uniform Triangle &tri, Ray &ray) {
     uniform float3 p0 = { tri.p[0][0], tri.p[0][1], tri.p[0][2] };
     uniform float3 p1 = { tri.p[1][0], tri.p[1][1], tri.p[1][2] };
     uniform float3 p2 = { tri.p[2][0], tri.p[2][1], tri.p[2][2] };
     uniform float3 e1 = p1 - p0;
     uniform float3 e2 = p2 - p0;

     float3 s1 = Cross(ray.dir, e2);
     float divisor = Dot(s1, e1);
     bool hit = true;

     if (divisor == 0.)
         hit = false;
     float invDivisor = 1.f / divisor;

     // Compute first barycentric coordinate
     float3 d = ray.origin - p0;
     float b1 = Dot(d, s1) * invDivisor;
     if (b1 < 0. || b1 > 1.)
         hit = false;

     // Compute second barycentric coordinate
     float3 s2 = Cross(d, e1);
     float b2 = Dot(ray.dir, s2) * invDivisor;
     if (b2 < 0. || b1 + b2 > 1.)
         hit = false;

     // Compute _t_ to intersection point
     float t = Dot(e2, s2) * invDivisor;
     if (t < ray.mint || t > ray.maxt)
         hit = false;

     if (hit) {
         ray.maxt = t;
         ray.hitId = tri.id;
     }
     return hit;
}


The (more or less) complete asm with AVX2 for that function:



 "TriIntersect___REFs[_c_unTriangle]REFs[vyRay]"

     subq    $248, %rsp










     vmovss  (%rcx), %xmm0
     vmovss  16(%rcx), %xmm2
     vinsertps   $16, 4(%rcx), %xmm0, %xmm0
     vinsertps   $32, 8(%rcx), %xmm0, %xmm0
     vinsertf128 $1, %xmm0, %ymm0, %ymm10
     vmovaps (%rdx), %ymm0
     vmovaps 32(%rdx), %ymm3
     vmovaps 64(%rdx), %ymm1
     vmovaps 96(%rdx), %ymm9
     vpbroadcastd    .LCPI1_0(%rip), %ymm12
     vxorps  %ymm4, %ymm4, %ymm4
     vpermps %ymm10, %ymm4, %ymm4
     vsubps  %ymm4, %ymm0, %ymm0
     vpermps %ymm10, %ymm12, %ymm4
     vsubps  %ymm4, %ymm3, %ymm13

     vpbroadcastd    .LCPI1_1(%rip), %ymm8
     vpermps %ymm10, %ymm8, %ymm3
     vsubps  %ymm3, %ymm1, %ymm6
     vmovss  32(%rcx), %xmm1
     vinsertps   $16, 36(%rcx), %xmm1, %xmm1
     vinsertps   $32, 40(%rcx), %xmm1, %xmm1
     vinsertf128 $1, %xmm0, %ymm1, %ymm1
     vsubps  %ymm10, %ymm1, %ymm15
     vbroadcastss    %xmm15, %ymm3

     vmovaps 128(%rdx), %ymm11
     vpermps %ymm15, %ymm12, %ymm5
     vmulps  %ymm3, %ymm11, %ymm1
     vmovaps %ymm3, %ymm4
     vmovaps %ymm5, %ymm7
     vfmsub213ps %ymm1, %ymm9, %ymm7
     vinsertps   $16, 20(%rcx), %xmm2, %xmm1
     vinsertps   $32, 24(%rcx), %xmm1, %xmm1
     vinsertf128 $1, %xmm0, %ymm1, %ymm1
     vsubps  %ymm10, %ymm1, %ymm1
     vpermps %ymm1, %ymm12, %ymm14
     vpermps %ymm1, %ymm8, %ymm12
     vmulps  %ymm6, %ymm14, %ymm3
     vmovaps %ymm13, %ymm2
     vfmsub213ps %ymm3, %ymm12, %ymm2
     vbroadcastss    %xmm1, %ymm1
     vmulps  %ymm0, %ymm12, %ymm3
     vmovaps %ymm6, %ymm13
     vfmsub213ps %ymm3, %ymm1, %ymm13
     vmulps  %ymm13, %ymm11, %ymm3
     vmovaps %ymm2, %ymm10
     vfmadd213ps %ymm3, %ymm9, %ymm10
     vpermps %ymm15, %ymm8, %ymm8
     vmulps  %ymm8, %ymm9, %ymm3
     vmovaps 160(%rdx), %ymm9
     vmulps  %ymm5, %ymm9, %ymm15
     vfmsub213ps %ymm15, %ymm8, %ymm11
     vmovaps %ymm4, %ymm15
     vfmsub213ps %ymm3, %ymm9, %ymm15
     vmulps  %ymm15, %ymm14, %ymm4
     vmovaps %ymm11, %ymm3
     vfmadd213ps %ymm4, %ymm1, %ymm3
     vfmadd213ps %ymm3, %ymm7, %ymm12

     vmulps  %ymm4, %ymm15, %ymm3
     vfmadd213ps %ymm3, %ymm0, %ymm11
     vfmadd213ps %ymm11, %ymm6, %ymm7
     vmulps  %ymm4, %ymm1, %ymm1
     vfmsub213ps %ymm1, %ymm14, %ymm0
     vbroadcastss    .LCPI1_2(%rip), %ymm1
     vrcpps  %ymm12, %ymm3
     vmovaps %ymm12, %ymm4
     vfnmadd213ps    %ymm1, %ymm3, %ymm4
     vmulps  %ymm4, %ymm3, %ymm4
     vxorps  %ymm14, %ymm14, %ymm14
     vcmpeqps    %ymm14, %ymm12, %ymm1
     vcmpunordps %ymm14, %ymm12, %ymm3
     vorps   %ymm1, %ymm3, %ymm11
     vmulps  %ymm13, %ymm5, %ymm1
     vmulps  %ymm4, %ymm7, %ymm5

     vfmadd213ps %ymm1, %ymm3, %ymm2
     vbroadcastss    .LCPI1_3(%rip), %ymm1
     vcmpnleps   %ymm1, %ymm5, %ymm3
     vcmpnleps   %ymm5, %ymm14, %ymm6
     vorps   %ymm3, %ymm6, %ymm6
     vbroadcastss    .LCPI1_0(%rip), %ymm3
     vblendvps   %ymm11, %ymm14, %ymm3, %ymm7
     vfmadd213ps %ymm10, %ymm0, %ymm9
     vmovaps (%r8), %ymm3
     vmovmskps   %ymm3, %eax
     leaq    352(%rdx), %r8
     cmpl    $255, %eax
     vmulps  %ymm9, %ymm4, %ymm9
     vaddps  %ymm9, %ymm5, %ymm10
     vmovups 320(%rdx), %ymm5
     vfmadd213ps %ymm2, %ymm8, %ymm0
     vblendvps   %ymm6, %ymm14, %ymm7, %ymm6
     vpcmpeqd    %ymm2, %ymm2, %ymm2
     vcmpnleps   %ymm1, %ymm10, %ymm1
     vcmpnleps   %ymm9, %ymm14, %ymm7
     vorps   %ymm1, %ymm7, %ymm1
     vblendvps   %ymm1, %ymm14, %ymm6, %ymm6
     vmulps  %ymm0, %ymm4, %ymm1
     vcmpnleps   352(%rdx), %ymm1, %ymm0
     vcmpnleps   %ymm1, %ymm5, %ymm4
     vorps   %ymm0, %ymm4, %ymm0
     vblendvps   %ymm0, %ymm14, %ymm6, %ymm0
     vpcmpeqd    %ymm14, %ymm0, %ymm4
     vpxor   %ymm2, %ymm4, %ymm2
     je  .LBB1_1

     vpand   %ymm3, %ymm2, %ymm2

     vmovmskps   %ymm2, %eax
     testl   %eax, %eax
     je  .LBB1_3

     vmaskmovps  %ymm1, %ymm2, (%r8)
     vpbroadcastd    48(%rcx), %ymm1
     vmaskmovps  %ymm1, %ymm2, 384(%rdx)











     addq    $248, %rsp
     ret


Using SSE4 the function asm starts like this:


 "TriIntersect___REFs[_c_unTriangle]REFs[vyRay]"

	subq	$248, %rsp










	movss	(%rcx), %xmm1
	movss	16(%rcx), %xmm0
	insertps	$16, 4(%rcx), %xmm1
	insertps	$32, 8(%rcx), %xmm1
	movss	32(%rcx), %xmm7
	insertps	$16, 36(%rcx), %xmm7
	insertps	$32, 40(%rcx), %xmm7
	subps	%xmm1, %xmm7


	movaps	(%rdx), %xmm4
	movaps	16(%rdx), %xmm12
	movaps	32(%rdx), %xmm5
	movaps	48(%rdx), %xmm11
	subps	%xmm3, %xmm12
	subps	%xmm2, %xmm5


	subps	%xmm2, %xmm4


	movaps	%xmm11, %xmm2
	mulps	%xmm3, %xmm2
	movaps	%xmm3, %xmm6


	movaps	%xmm11, %xmm9
	mulps	%xmm3, %xmm9
	movaps	%xmm3, %xmm10
	insertps	$16, 20(%rcx), %xmm0
	insertps	$32, 24(%rcx), %xmm0
	subps	%xmm1, %xmm0


	movdqa	%xmm3, %xmm1
	movdqa	%xmm3, %xmm14
	mulps	%xmm5, %xmm1

...



Even LDC2 compiler doesn't get anywhere close to such 
good/efficient usage of SIMD instructions. And the compiler is 
also able to spread the work on multiple cores. I think D is 
meant to be used for similar numerical code too, so perhaps the 
little amount of ideas contained in this very C-like language is 
worth stealing and adding to D.

Bye,
bearophile

Jul 12 2013

=?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig outerproduct.org> writes:

Am 20.06.2013 13:58, schrieb bearophile:
 
 The Reddit thread contains a link to this page, a compiler for a C
 variant from Intel that's optimized for SIMD:
 http://ispc.github.io/
 

Since you mention that, I developed a similar compiler/language in
parallel to Intel at the time. The main differences were an implicit
approach to handling the main loop, it didn't expose the SIMD target
through explicit indices like ISPC did at the beginning, and could
target GPUs in addition to outputting SIMD code. It was primarily used
for high performance image processing on a commercial application that
needed a safe CPU fallback path.

Although I didn't have the time to implement many comprehensive
optimization techniques (apart from some basic ones and from what LLVM
provides) the results were quite impressive, depending of course on how
well a program lends itself to the SPMD->SIMD transformation. There are
some benchmarks at the end of the thesis:

http://outerproduct.org/research/msc-thesis-slurp.pdf

Unfortunately, at this point it is purely academical because the
copyright to the source code has been left with my former employer and
now Google. It's a bit of a pity as it was filling a certain niche that
nobody else did (fortunately this has become less of an issue with the
ubiquitous distribution of shader capable GPUs and improving drivers
from a certain GPU vendor).

Jun 20 2013

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On Wed, 19 Jun 2013 15:25:29 -0400
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
 
 reddit: 
 http://www.reddit.com/r/programming/comments/1go9ky/dconf_2013_effective_simd_for_modern/
 

A bit late, but torrents/links up:
http://semitwist.com/download/misc/dconf2013/

Jun 20 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 6/21/13 12:38 AM, Nick Sabalausky wrote:
 On Wed, 19 Jun 2013 15:25:29 -0400
 Andrei Alexandrescu<SeeWebsiteForEmail erdani.org>  wrote:
 reddit:
 http://www.reddit.com/r/programming/comments/1go9ky/dconf_2013_effective_simd_for_modern/

 A bit late, but torrents/links up:
 http://semitwist.com/download/misc/dconf2013/

Thanks for this work. I'll be late with torrents for the last two talks 
until I get to a broadband connection.

Andrei

Jun 21 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 6/19/13 12:25 PM, Andrei Alexandrescu wrote:
 Apologies for the delay, we're moving and things are a bit hectic.

 reddit:
 http://www.reddit.com/r/programming/comments/1go9ky/dconf_2013_effective_simd_for_modern/


 twitter: https://twitter.com/D_Programming/status/347433981928693760

 hackernews: https://news.ycombinator.com/item?id=5907624

 facebook: https://www.facebook.com/dlang.org/posts/659747567372261

 youtube: http://youtube.com/watch?v=q_39RnxtkgM


 Andrei

Now available in HD: https://archive.org/details/dconf2013-day03-talk05


Andrei

Jun 24 2013

D Programming

C/C++ Programming

Other

digitalmars.D.announce - DConf 2013 Day 3 Talk 5: Effective SIMD for modern architectures