digitalmars.D - Good dotProduct
- bearophile (8/8) Jun 25 2010 I have added to Bugzilla a small enhancement request for Phobos:
- bearophile (98/99) Jun 27 2010 My first try, writing X86+SSE asm is a pain for me still:
- bearophile (44/44) Jun 29 2010 This is the faster I've done, further loop unrolling is not useful (for ...
- Andrei Alexandrescu (5/6) Jun 29 2010 [snip]
- bearophile (20/22) Jun 29 2010 LLVM compiler can't unroll loops that loop a number of times known at co...
- bearophile (1/2) Jun 29 2010 I meant at run-time, sorry.
- bearophile (13/13) Jun 29 2010 There is one more thing that I have forgotten. If you have code like:
- Andrei Alexandrescu (5/23) Jun 29 2010 No need to worry about that. Unaligned arrays can be handled simply by
- bearophile (52/52) Jun 29 2010 A version for floats. A version for reals can't be done with SSE* regist...
- Andrei Alexandrescu (5/11) Jun 29 2010 You already have a loop at the end that takes care of the stray
- bearophile (8/11) Jun 29 2010 Unfortunately things aren't that simple, you need a pre-loop and a post-...
- Andrei Alexandrescu (4/19) Jun 29 2010 Oh I see. Yah, it looks like both before and after loops are needed.
I have added to Bugzilla a small enhancement request for Phobos: http://d.puremagic.com/issues/show_bug.cgi?id=4393 In my opinion this can be a critical operation, so it deserves care as much as the array ops. Re-implementing this many times is bad, better to implement it well and put it in Phobos (as a specialization of std.numeric.dotProduct). I am now able to write some working SSE* code, but I am not expert yet. So if someone is willing to write it (or even offer it, if already written), I think Andrei will be willing to add it to the std.numeric module. std.numeric.dotProduct looks almost as efficient as D code gets. So what I'd like added to Phobos are three (or six) efficient asm routines: - dotProduct among two dynamic arrays of doubles (optionally floats too, if you want). - Optionally dotProduct among two double[3] and two double[4] (optionally floats too, if you want) (later this code can be moved inside a small vect struct). Bye and thank you, bearophile
Jun 25 2010
I am now able to write some working SSE* code, but I am not expert yet. So if someone is willing to write it (or even offer it, if already written), I think Andrei will be willing to add it to the std.numeric module.My first try, writing X86+SSE asm is a pain for me still: import std.c.stdio: printf; import std.date: getUTCtime, ticksPerSecond; import std.numeric: dotProduct; import std.contracts: enforce; double clock() { return cast(double)getUTCtime() / ticksPerSecond; } double dot1(double[] arr1, double[] arr2) { assert(arr1.length == arr2.length); double tot = 0.0; foreach (i, e1; arr1) tot += e1 * arr2[i]; return tot; } double dot2(double[] a, double[] b) { size_t len = a.length; assert(len == b.length, "dot(): the two array lengths differ."); if (len == 0) return 0.0; if (len == 1) return a[0] * b[0]; double tot = void; double* a_ptr = a.ptr; double* b_ptr = b.ptr; assert((cast(size_t)a_ptr & cast(size_t)0b1111) == 0, "dot(): the first array is not aligned to 16 bytes."); assert((cast(size_t)b_ptr & cast(size_t)0b1111) == 0, "dot(): the second array is not aligned to 16 bytes."); len = (len / 2) * 2 * double.sizeof; asm { mov EAX, a_ptr; // EAX = a_ptr mov ECX, b_ptr; // ECX = b_ptr mov EDI, len; // EDI = len xorpd XMM0, XMM0; // XMM0[0,1] = 0.0, 0.0 (tot0,tot1) xor EDX, EDX; // EDX = 0 (loop counter) align 8; LOOP_START: movapd XMM1, [EAX + EDX]; // XMM1[0,1] = *(EAX + EDX)[0,1] mulpd XMM1, [ECX + EDX]; // XMM1[0,1] *= *(ECX + EDX)[0,1] add EDX, 16; // EDX += 16 addpd XMM0, XMM1; // XMM0[0,1] += XMM1[0,1] cmp EDI, EDX; jne LOOP_START; // XMM0[0] = XMM0[0] + XMM0[1] movapd XMM1, XMM0; // XMM1[0,1] = XMM0[0,1] unpckhpd XMM1, XMM1; // XMM1[0] = XMM1[1] addsd XMM0, XMM1; // XMM0[0] += XMM1[0] movsd tot, XMM0; // tot = XMM0[0] } if (a.length & 1) return tot + (a[$-1] * b[$-1]); else return tot; } void main() { enum int NLOOPS = 100_000; double[] a1 = [0.97939860007980784, 0.15654832543818165, 0.20876449456836543, 0.13588707622872687, 0.56737250028542408, 0.60890755422949261, 0.72503629431774808, 0.52283227344759831, 0.82107581846425648, 0.9280027000111094, 0.65371212119615851, 0.99162440025345067, 0.93134413423287143, 0.95319320272812291, 0.82373984947977308, 0.09382106227964937, 0.9914424038875832, 0.80601047736119313, 0.023619286739061329, 0.82167558946575081]; a1 ~= a1; double[] a2 = [0.37967009318734157, 0.49961633977084308, 0.063379452228570665, 0.016573170529015635, 0.32720445135092779, 0.90558380684677242, 0.59689617644678783, 0.22590204202286546, 0.13701998912150426, 0.21786382155943662, 0.74110776773633547, 0.62437487889391807, 0.41013869338412479, 0.047723768990690196, 0.98567658092497179, 0.19281802583215202, 0.7937119206931792, 0.34128113271035532, 0.90960739148505643, 0.01852954991914102]; a2 ~= a2; assert(a1.length == a2.length); double t0, t1; t0 = clock(); double total0 = 0.0; foreach (i; 0 .. NLOOPS) foreach (len; 0 .. a1.length+1) total0 += dotProduct(a1[0 .. len], a2[0 .. len]); t1 = clock(); printf("dotProduct: %.3f %f\n", t1-t0, total0); t0 = clock(); double total1 = 0.0; foreach (i; 0 .. NLOOPS) foreach (len; 0 .. a1.length+1) total1 += dot1(a1[0 .. len], a2[0 .. len]); t1 = clock(); printf("dot1: %.3f %f\n", t1-t0, total1); enforce((total0 - total1) < 0.001); t0 = clock(); double total2 = 0.0; foreach (i; 0 .. NLOOPS) foreach (len; 0 .. a1.length+1) total2 += dot2(a1[0 .. len], a2[0 .. len]); t1 = clock(); printf("dot2: %.3f %f\n", t1-t0, total2); enforce((total0 - total2) < 0.001); } Suggestions welcome. This dot2() seems to work, but it can be improved in several ways: - it needs to work with a and/or b unaligned to 16 bytes - the loop can be unrolled once or more - maybe the nonasm code can be improved a bit - SSE detection branch can be added, if it's not too much slow dot2() is slower than std.numeric.dotProduct for small arrays, so it can be added a test inside to detect short arrays and use normal D code on them. Unrolling its loop once can make it a little faster still for longhish arrays. Is this not valid? movapd XMM1, [EAX + EDX * 16]; Bye, bearophile
Jun 27 2010
This is the faster I've done, further loop unrolling is not useful (for arrays of about 1400 doubles it's about 2.9 times faster than std.numeric.dotProduct on a recent Celeron CPU): double dot5(double[] a, double[] b) { size_t len = a.length; assert(len == b.length, "dot(): the two array lengths differ."); if (len == 0) return 0.0; assert(len % 4 == 0, "dot: length must be a multiple of 4"); double tot = void; double* a_ptr = a.ptr; double* b_ptr = b.ptr; assert((cast(size_t)a_ptr & cast(size_t)0b1111) == 0, "dot(): the first array is not aligned to 16 bytes."); assert((cast(size_t)b_ptr & cast(size_t)0b1111) == 0, "dot(): the second array is not aligned to 16 bytes."); len = (len / 4) * 32; asm { mov EAX, a_ptr; mov ECX, b_ptr; mov EDI, len; xorpd XMM0, XMM0; xorpd XMM3, XMM3; xor EDX, EDX; align 8; LOOP_START: movapd XMM1, [EAX + EDX]; mulpd XMM1, [ECX + EDX]; movapd XMM2, [EAX + EDX + 16]; mulpd XMM2, [ECX + EDX + 16]; addpd XMM0, XMM1; add EDX, 32; addpd XMM3, XMM2; cmp EDI, EDX; jne LOOP_START; addpd XMM0, XMM3; // XMM0[0] += XMM0[1] movapd XMM1, XMM0; unpckhpd XMM1, XMM1; addsd XMM0, XMM1; movsd tot, XMM0; } return tot; } Bye, bearophile
Jun 29 2010
bearophile wrote:This is the faster I've done, further loop unrolling is not useful (for arrays of about 1400 doubles it's about 2.9 times faster than std.numeric.dotProduct on a recent Celeron CPU):[snip] That's pretty cool! Do we have your permission to replace dotProduct with yours? Andrei
Jun 29 2010
Andrei:That's pretty cool!<LLVM compiler can't unroll loops that loop a number of times known at compile time (I have an open LLVM bug report on this), and even when this bug gets fixed, then LLVM isn't able to vectorize yet, so it doesn't use SIMD instructions like movapd, mulpd etc that work with two doubles at the same time (unless you use intrinsics instructions like _mm_load_pd that are currently not available with DMD but can be used with LDC too). So this routine is an improvement even on the asm produced by LDC from D code. But there are some other things you need to consider: dot5() works with arrays of doubles only; but it's not too much hard to write a version for floats, if you want I can write it in some time (I am slow in writing asm). A version for reals is probably less important. dot5() uses the movapd instruction, so it requires the doubles to be aligned to 16 bytes. Currently it just asserts if data is not aligned, but such assert is kind of useless. This problem can be solved in three ways: adding a bit of code (in D or asm) in the beginning to use the first unaligned double if present. This branch slows down code a bit (such slow down can be felt only for small arrays, of course). Otherwise there are other solutions, none of them is perfect. Currently in D2 big dynamic arrays are not aligned to 16 bytes, but Steven Schveighoffer will fix it: http://d.puremagic.com/issues/show_bug.cgi?id=4400 Even when Steven has fixed the alignment to 16 bytes, you just need to slice the array like [1..$] to have an unaligned array. The type system can come to the rescue: an AlignedArray (subtype of array) can be invented that guarantees alignment to 16 bytes even for its slices (so it just refuses to be sliced on odd indexed bounds), then inside the dotProduct() code a static if can test for such array and avoid the runtime test (branch) for 16 byte alignment. Note that future CPUs can drop the requirement for 16 bytes alignments. If this happens then AlignedArray can become useless. That dot5() seems to work, but it needs unittests. The loop of dot5() is unrolled once, and each instruction works on two doubles, so essentially this code is unrolled four times. This means dot5() works only if the length of the input arrays is a multiple of 4. A bit of extra trailing or leading code (in asm or D, but D is probably enough) needs to be added to perform the mul+sum in the few pairs not managed by the main asm loop. Maybe the D code inside dot5() can be improved a bit. movapd is a SSE2 instruction, so a SSE2 detection branch can be added, this is very easy, you can test is with the sse2 function from the arraydouble module of the druntime: private import core.cpuid; bool sse2() { return cpuid == 3 && core.cpuid.sse2(); } The small problem is of course that this is a runtime test, this can slow down code a bit when input arrays a small. It's not easy to move this test from runtime to compile-time (to compile a module with different versions of the function... this can require self-modifying code to change the "static" start address of the right function... Some compiler support can improve this situation. Despite dot5() is almost three times faster than std.numeric.dotProduct for aligned arrays of about 1400 doubles, for small or very small arrays it's slower. So another runtime test (if) can be added inside the function to switch to a different algorithm (probably purely written in D) that works efficiently on very short arrays.Do we have your permission to replace dotProduct with yours?<Of course :-) Improving std.numerics was my purpose from the beginning. But as you see dot5() isn't yet a replacement for dotProduct, some more work needs to be done even if dot() contains no bugs. Bye, bearophile
Jun 29 2010
LLVM compiler can't unroll loops that loop a number of times known at compile timeI meant at run-time, sorry.
Jun 29 2010
There is one more thing that I have forgotten. If you have code like: void main() { double[8] a = 0.5; double[8] b = 2.0; double r = dot5(a, b); } then you can't be sure that 'a' and 'b' are aligned to 16 bytes (I hope they are aligned to 8). Time ago I have suggested the simple syntax: align(16) double[8] b = 2.0; but things are not so easy: http://d.puremagic.com/issues/show_bug.cgi?id=2278 Bye, bearophile
Jun 29 2010
bearophile wrote:There is one more thing that I have forgotten. If you have code like: void main() { double[8] a = 0.5; double[8] b = 2.0; double r = dot5(a, b); } then you can't be sure that 'a' and 'b' are aligned to 16 bytes (I hope they are aligned to 8). Time ago I have suggested the simple syntax: align(16) double[8] b = 2.0; but things are not so easy: http://d.puremagic.com/issues/show_bug.cgi?id=2278 Bye, bearophileNo need to worry about that. Unaligned arrays can be handled simply by taking care of the stray elements by hand. The bulk of the loop will run at full speed. Andrei
Jun 29 2010
A version for floats. A version for reals can't be done with SSE* registers. This loop is unrolled two times, and each SSE2 register keeps 4 floats, so it performs 8 mul+add each cycle. Again this code is slower for shorter arrays, but not much. A version of the code with no unrolling (that performs only 4 mul+add each cycle) is a little better for shorter arrays (to create it you just need to change UNROLL_MASK to 0b11, remove all the operations on XMM2 and XMM3 and add only 16 to EDX each loop). The asserts assert((cast(size_t)... can be replaced by a loop that performs the unaligned muls+adds and then changes len, a_ptr and b_ptr to the remaining aligned ones. float dot(float[] a, float[] b) { enum size_t UNROLL_MASK = 0b111; assert(a.length == b.length, "dot(): the two array lengths differ."); typeof(a[0]) tot = void; auto a_ptr = a.ptr; auto b_ptr = b.ptr; assert((cast(size_t)a_ptr & cast(size_t)0b1111) == 0, "dot(): the first array is not aligned to 16 bytes."); assert((cast(size_t)b_ptr & cast(size_t)0b1111) == 0, "dot(): the second array is not aligned to 16 bytes."); size_t len = (a.length & (~UNROLL_MASK)) * a[0].sizeof; if (len) { asm { mov EAX, a_ptr; mov ECX, b_ptr; mov EDI, len; xorps XMM0, XMM0; xorps XMM3, XMM3; xor EDX, EDX; align 8; LOOP_START: movaps XMM1, [EAX + EDX]; mulps XMM1, [ECX + EDX]; movaps XMM2, [EAX + EDX + 16]; mulps XMM2, [ECX + EDX + 16]; add EDX, 32; addps XMM0, XMM1; cmp EDI, EDX; addps XMM3, XMM2; jne LOOP_START; addps XMM0, XMM3; // XMM0[0] = XMM0[0] + XMM0[1] + XMM0[2] + XMM0[3] movhlps XMM1, XMM0; addps XMM0, XMM1; pshufd XMM1, XMM0, 0b01_01_01_01; addss XMM0, XMM1; movss tot, XMM0; } } else tot = 0.0; size_t left = a.length & UNROLL_MASK; for (size_t i = left; i > 0; i--) tot += a[$ - i] * b[$ - i]; return tot; } On the same Celeron it takes about 0.88 ticks for each mul+add when the two arrays are 1400 floats long (they fit in L1 cache, if they gets longer and don't fit in L1 it slows down a little). I think better code or better CPUs can go as low as 0.5, but I and/or my CPU aren't that good yet. Its usage of cache lines is not optimal, I think. I don't think cache prefetching instructions can help this code. Bye, bearophile
Jun 29 2010
bearophile wrote:A version for floats. A version for reals can't be done with SSE* registers. This loop is unrolled two times, and each SSE2 register keeps 4 floats, so it performs 8 mul+add each cycle. Again this code is slower for shorter arrays, but not much. A version of the code with no unrolling (that performs only 4 mul+add each cycle) is a little better for shorter arrays (to create it you just need to change UNROLL_MASK to 0b11, remove all the operations on XMM2 and XMM3 and add only 16 to EDX each loop). The asserts assert((cast(size_t)... can be replaced by a loop that performs the unaligned muls+adds and then changes len, a_ptr and b_ptr to the remaining aligned ones.You already have a loop at the end that takes care of the stray elements. Why not move it to the beginning to take care of the stray elements _and_ unaligned elements in one shot? Andrei
Jun 29 2010
Andrei Alexandrescu:You already have a loop at the end that takes care of the stray elements. Why not move it to the beginning to take care of the stray elements _and_ unaligned elements in one shot?Unfortunately things aren't that simple, you need a pre-loop and a post-loop. That asm block can process only aligned values in groups of 8 floats. So if you have an unaligned array of N items, that starts and ends before and after a block of processable items, you need to process both head and tail separately. a floats: ** **** **** **** **** *** Loop blocks: xxxxxxxxx xxxxxxxxx xxxxxxxxx xxxxxxxxx 16b aligned: ____ ____ ____ ____ ____ ____ ____ ____ if you don't understand still, I can create an example. Bye, bearophile
Jun 29 2010
bearophile wrote:Andrei Alexandrescu:Oh I see. Yah, it looks like both before and after loops are needed. Liquid fuel rocket, ion engine, and parachute. AndreiYou already have a loop at the end that takes care of the stray elements. Why not move it to the beginning to take care of the stray elements _and_ unaligned elements in one shot?Unfortunately things aren't that simple, you need a pre-loop and a post-loop. That asm block can process only aligned values in groups of 8 floats. So if you have an unaligned array of N items, that starts and ends before and after a block of processable items, you need to process both head and tail separately. a floats: ** **** **** **** **** *** Loop blocks: xxxxxxxxx xxxxxxxxx xxxxxxxxx xxxxxxxxx 16b aligned: ____ ____ ____ ____ ____ ____ ____ ____ if you don't understand still, I can create an example. Bye, bearophile
Jun 29 2010