## digitalmars.D - Vectorization and more

• bearophile (69/69) Apr 13 2010 Some musings about vectorizations.
```Some musings about vectorizations.
The benchmarks of the Shootout site sometimes are useful, because they are well
chosen to cover a wide variety of computations and kinds of common kernels.

One of those benchmarks is the n-body, this is one of the faster
implementations, in C:
http://shootout.alioth.debian.org/u32/program.php?test=nbody&lang=gcc&id=5

Profiling of the n-body benchmark shows that most of the time is spent in the
two nested loops of the advance() function (inside it there is a sqrt that is
critical).

This is a summary of the code of the C implementation, the increased
performance compared to the other versions comes from packing x and y in a SSE
register, and z plus an empty value in another, this allows to do three
operations (on x, y and z or the speeds) in two istructions instead of three:

typedef double v2df __attribute__ (( vector_size(2*sizeof(double)) ));

struct planet {
v2df xy;
v2df z0; /* z and zero */
v2df vxvy;
v2df vz00;	/* vz and zero */
v2df massmass; /* the mass in both components */
};

...
for (j = i + 1; j < NBODIES; j++) {
b2 = &(bodies[j]);
dxdy = b->xy - b2->xy;
dz00 = b->z0 - b2->z0;
/* dx*dx+dy*dy | dz*dz */
/* dx*dx+dy*dy+dz*dz | dx*dx+dy*dy+dz*dz */
magmag = dtdt / (__builtin_ia32_sqrtpd(distance2)*distance2);
dxdy *= magmag;
dz00 *= magmag;
b->vxvy -= dxdy * b2->massmass;
b->vz00 -= dz00 * b2->massmass;
b2->vxvy += dxdy * b->massmass;
b2->vz00 += dz00 * b->massmass;
}
}
...
}

Note how the C code performs two sqrt in parallel, even if only one sqrt is
necessary, because this helps the compiler keep the values in the SSE
registers, avoiding a waste of time (but the CPU will perform two sqrt, so it
will waster some electricity. This is not the kind of optimization you want for
a netbook that has to save as much power as possible. Contrary to what Chris
Lattner thinks, I belive it can exist a compilation switch like -Oe to compile
a program with the purpose of saving energy at runtime).

Note2: in practice I have seen that LDC with D code is able to produce a binary
about as fast as this C one, using no explicit use of SSE registers (and no
array ops) :-)

Later I have created a D version of the n-body that uses arrays ops (and as
expected it's a little slower than the naive version, because array ops are not
refined yet in D). This is a summary of the code:

alias double[3] Tri;

struct Body {
Tri xyz = [0,0,0];
Tri v = [0,0,0];
double mass = 0.0;
...
}

foreach(idx, ref i; bodies) {
double im = i.mass;
foreach (ref j; bodies[idx + 1 .. length]) {
double jm = j.mass;
Tri d = void;
d[] = i.xyz[] - j.xyz[];

double distance = sqrt(d[0]*d[0] + d[1]*d[1] + d[2]*d[2]);
double mag = dt / (distance * distance * distance);

i.v[] -= d[] * jm * mag;
j.v[] += d[] * im * mag;
}
}

foreach (ref i; bodies)
i.xyz[] += dt * i.v[];
}

Future D implementations will surely be able to use a SSE register to store a
double[2] (and double[4] in 256 bit CPU registers, etc).

But to me the problem here seems a little different: given that Tri array,
present two times inside the Body struct, will be D compilers able to use two
SSE registers (with a padding item), to implement an operation like this, done
among arrays of lenght 3?

d[] = i.xyz[] - j.xyz[];

I think LLVM devs eventually will add vector ops to LLVM, so LLVM will have to
face both:
1) the need to pad registers when the arrays aren't as long as an integral
number of SSE registers;
2) to copy scalar values in all slots of a SSE register to keep its operations
as fast as possible when mixed with vector ops. (Later I might ask to Lattner
about native vector ops in LLVM. This can be really useful for LDC).

Bye,
bearophile
```
Apr 13 2010