digitalmars.D - Vectorization and more
- bearophile (69/69) Apr 13 2010 Some musings about vectorizations.
Some musings about vectorizations. The benchmarks of the Shootout site sometimes are useful, because they are well chosen to cover a wide variety of computations and kinds of common kernels. One of those benchmarks is the n-body, this is one of the faster implementations, in C: http://shootout.alioth.debian.org/u32/program.php?test=nbody&lang=gcc&id=5 Profiling of the n-body benchmark shows that most of the time is spent in the two nested loops of the advance() function (inside it there is a sqrt that is critical). This is a summary of the code of the C implementation, the increased performance compared to the other versions comes from packing x and y in a SSE register, and z plus an empty value in another, this allows to do three operations (on x, y and z or the speeds) in two istructions instead of three: typedef double v2df __attribute__ (( vector_size(2*sizeof(double)) )); struct planet { v2df xy; v2df z0; /* z and zero */ v2df vxvy; v2df vz00; /* vz and zero */ v2df massmass; /* the mass in both components */ }; static void advance(int q) { ... for (j = i + 1; j < NBODIES; j++) { b2 = &(bodies[j]); dxdy = b->xy - b2->xy; dz00 = b->z0 - b2->z0; /* dx*dx+dy*dy | dz*dz */ tsquare = __builtin_ia32_haddpd(dxdy*dxdy,dz00*dz00); /* dx*dx+dy*dy+dz*dz | dx*dx+dy*dy+dz*dz */ distance2 = __builtin_ia32_haddpd(tsquare,tsquare); magmag = dtdt / (__builtin_ia32_sqrtpd(distance2)*distance2); dxdy *= magmag; dz00 *= magmag; b->vxvy -= dxdy * b2->massmass; b->vz00 -= dz00 * b2->massmass; b2->vxvy += dxdy * b->massmass; b2->vz00 += dz00 * b->massmass; } } ... } Note how the C code performs two sqrt in parallel, even if only one sqrt is necessary, because this helps the compiler keep the values in the SSE registers, avoiding a waste of time (but the CPU will perform two sqrt, so it will waster some electricity. This is not the kind of optimization you want for a netbook that has to save as much power as possible. Contrary to what Chris Lattner thinks, I belive it can exist a compilation switch like -Oe to compile a program with the purpose of saving energy at runtime). Note2: in practice I have seen that LDC with D code is able to produce a binary about as fast as this C one, using no explicit use of SSE registers (and no array ops) :-) Later I have created a D version of the n-body that uses arrays ops (and as expected it's a little slower than the naive version, because array ops are not refined yet in D). This is a summary of the code: alias double[3] Tri; struct Body { Tri xyz = [0,0,0]; Tri v = [0,0,0]; double mass = 0.0; ... } void advance(double dt) { foreach(idx, ref i; bodies) { double im = i.mass; foreach (ref j; bodies[idx + 1 .. length]) { double jm = j.mass; Tri d = void; d[] = i.xyz[] - j.xyz[]; double distance = sqrt(d[0]*d[0] + d[1]*d[1] + d[2]*d[2]); double mag = dt / (distance * distance * distance); i.v[] -= d[] * jm * mag; j.v[] += d[] * im * mag; } } foreach (ref i; bodies) i.xyz[] += dt * i.v[]; } Future D implementations will surely be able to use a SSE register to store a double[2] (and double[4] in 256 bit CPU registers, etc). But to me the problem here seems a little different: given that Tri array, present two times inside the Body struct, will be D compilers able to use two SSE registers (with a padding item), to implement an operation like this, done among arrays of lenght 3? d[] = i.xyz[] - j.xyz[]; I think LLVM devs eventually will add vector ops to LLVM, so LLVM will have to face both: 1) the need to pad registers when the arrays aren't as long as an integral number of SSE registers; 2) to copy scalar values in all slots of a SSE register to keep its operations as fast as possible when mixed with vector ops. (Later I might ask to Lattner about native vector ops in LLVM. This can be really useful for LDC). Bye, bearophile
Apr 13 2010