www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Re: Good dotProduct

reply bearophile <bearophileHUGS lycos.com> writes:
This is the faster I've done, further loop unrolling is not useful (for arrays
of about 1400 doubles it's about 2.9 times faster than std.numeric.dotProduct
on a recent Celeron CPU):

double dot5(double[] a, double[] b) {
    size_t len = a.length;
    assert(len == b.length, "dot(): the two array lengths differ.");
    if (len == 0)
        return 0.0;
    assert(len % 4 == 0, "dot: length must be a multiple of 4");

    double tot = void;
    double* a_ptr = a.ptr;
    double* b_ptr = b.ptr;

    assert((cast(size_t)a_ptr & cast(size_t)0b1111) == 0,
           "dot(): the first array is not aligned to 16 bytes.");
    assert((cast(size_t)b_ptr & cast(size_t)0b1111) == 0,
           "dot(): the second array is not aligned to 16 bytes.");

    len = (len / 4) * 32;

    asm {
        mov EAX, a_ptr;
        mov ECX, b_ptr;
        mov EDI, len;
        xorpd XMM0, XMM0;
        xorpd XMM3, XMM3;
        xor EDX, EDX;

        align 8;
    LOOP_START:
        movapd XMM1, [EAX + EDX];
        mulpd XMM1,  [ECX + EDX];
        movapd XMM2, [EAX + EDX + 16];
        mulpd XMM2,  [ECX + EDX + 16];
        addpd XMM0, XMM1;
        add EDX, 32;
        addpd XMM3, XMM2;
        cmp EDI, EDX;
        jne LOOP_START;

        addpd XMM0, XMM3;

        // XMM0[0] += XMM0[1]
        movapd  XMM1, XMM0;
        unpckhpd XMM1, XMM1;
        addsd XMM0, XMM1;

        movsd tot, XMM0;
    }

    return tot;
}

Bye,
bearophile
Jun 29 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 This is the faster I've done, further loop unrolling is not useful (for arrays
of about 1400 doubles it's about 2.9 times faster than std.numeric.dotProduct
on a recent Celeron CPU):

That's pretty cool! Do we have your permission to replace dotProduct with yours? Andrei
Jun 29 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei:

That's pretty cool!<

LLVM compiler can't unroll loops that loop a number of times known at compile time (I have an open LLVM bug report on this), and even when this bug gets fixed, then LLVM isn't able to vectorize yet, so it doesn't use SIMD instructions like movapd, mulpd etc that work with two doubles at the same time (unless you use intrinsics instructions like _mm_load_pd that are currently not available with DMD but can be used with LDC too). So this routine is an improvement even on the asm produced by LDC from D code. But there are some other things you need to consider: dot5() works with arrays of doubles only; but it's not too much hard to write a version for floats, if you want I can write it in some time (I am slow in writing asm). A version for reals is probably less important. dot5() uses the movapd instruction, so it requires the doubles to be aligned to 16 bytes. Currently it just asserts if data is not aligned, but such assert is kind of useless. This problem can be solved in three ways: adding a bit of code (in D or asm) in the beginning to use the first unaligned double if present. This branch slows down code a bit (such slow down can be felt only for small arrays, of course). Otherwise there are other solutions, none of them is perfect. Currently in D2 big dynamic arrays are not aligned to 16 bytes, but Steven Schveighoffer will fix it: http://d.puremagic.com/issues/show_bug.cgi?id=4400 Even when Steven has fixed the alignment to 16 bytes, you just need to slice the array like [1..$] to have an unaligned array. The type system can come to the rescue: an AlignedArray (subtype of array) can be invented that guarantees alignment to 16 bytes even for its slices (so it just refuses to be sliced on odd indexed bounds), then inside the dotProduct() code a static if can test for such array and avoid the runtime test (branch) for 16 byte alignment. Note that future CPUs can drop the requirement for 16 bytes alignments. If this happens then AlignedArray can become useless. That dot5() seems to work, but it needs unittests. The loop of dot5() is unrolled once, and each instruction works on two doubles, so essentially this code is unrolled four times. This means dot5() works only if the length of the input arrays is a multiple of 4. A bit of extra trailing or leading code (in asm or D, but D is probably enough) needs to be added to perform the mul+sum in the few pairs not managed by the main asm loop. Maybe the D code inside dot5() can be improved a bit. movapd is a SSE2 instruction, so a SSE2 detection branch can be added, this is very easy, you can test is with the sse2 function from the arraydouble module of the druntime: private import core.cpuid; bool sse2() { return cpuid == 3 && core.cpuid.sse2(); } The small problem is of course that this is a runtime test, this can slow down code a bit when input arrays a small. It's not easy to move this test from runtime to compile-time (to compile a module with different versions of the function... this can require self-modifying code to change the "static" start address of the right function... Some compiler support can improve this situation. Despite dot5() is almost three times faster than std.numeric.dotProduct for aligned arrays of about 1400 doubles, for small or very small arrays it's slower. So another runtime test (if) can be added inside the function to switch to a different algorithm (probably purely written in D) that works efficiently on very short arrays.
Do we have your permission to replace dotProduct with yours?<

Of course :-) Improving std.numerics was my purpose from the beginning. But as you see dot5() isn't yet a replacement for dotProduct, some more work needs to be done even if dot() contains no bugs. Bye, bearophile
Jun 29 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
 LLVM compiler can't unroll loops that loop a number of times known at compile
time

I meant at run-time, sorry.
Jun 29 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
There is one more thing that I have forgotten. If you have code like:

void main() {
    double[8] a = 0.5;
    double[8] b = 2.0;    
    double r = dot5(a, b);
}

then you can't be sure that 'a' and 'b' are aligned to 16 bytes (I hope they
are aligned to 8).

Time ago I have suggested the simple syntax:
align(16) double[8] b = 2.0;    

but things are not so easy:
http://d.puremagic.com/issues/show_bug.cgi?id=2278

Bye,
bearophile
Jun 29 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 There is one more thing that I have forgotten. If you have code like:
 
 void main() {
     double[8] a = 0.5;
     double[8] b = 2.0;    
     double r = dot5(a, b);
 }
 
 then you can't be sure that 'a' and 'b' are aligned to 16 bytes (I hope they
are aligned to 8).
 
 Time ago I have suggested the simple syntax:
 align(16) double[8] b = 2.0;    
 
 but things are not so easy:
 http://d.puremagic.com/issues/show_bug.cgi?id=2278
 
 Bye,
 bearophile

No need to worry about that. Unaligned arrays can be handled simply by taking care of the stray elements by hand. The bulk of the loop will run at full speed. Andrei
Jun 29 2010
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
A version for floats. A version for reals can't be done with SSE* registers.
This loop is unrolled two times, and each SSE2 register keeps 4 floats, so it
performs 8 mul+add each cycle. Again this code is slower for shorter arrays,
but not much.

A version of the code with no unrolling (that performs only 4 mul+add each
cycle) is a little better for shorter arrays (to create it you just need to
change UNROLL_MASK to 0b11, remove all the operations on XMM2 and XMM3 and add
only 16 to EDX each loop).

The asserts assert((cast(size_t)... can be replaced by a loop that performs the
unaligned muls+adds and then changes len, a_ptr and b_ptr to the remaining
aligned ones.


float dot(float[] a, float[] b) {
    enum size_t UNROLL_MASK = 0b111;
    assert(a.length == b.length, "dot(): the two array lengths differ.");

    typeof(a[0]) tot = void;
    auto a_ptr = a.ptr;
    auto b_ptr = b.ptr;

    assert((cast(size_t)a_ptr & cast(size_t)0b1111) == 0,
           "dot(): the first array is not aligned to 16 bytes.");
    assert((cast(size_t)b_ptr & cast(size_t)0b1111) == 0,
           "dot(): the second array is not aligned to 16 bytes.");

    size_t len = (a.length & (~UNROLL_MASK)) * a[0].sizeof;

    if (len) {
        asm {
            mov EAX, a_ptr;
            mov ECX, b_ptr;
            mov EDI, len;
            xorps XMM0, XMM0;
            xorps XMM3, XMM3;
            xor EDX, EDX;

            align 8;
          LOOP_START:
            movaps XMM1, [EAX + EDX];
            mulps XMM1,  [ECX + EDX];
            movaps XMM2, [EAX + EDX + 16];
            mulps XMM2,  [ECX + EDX + 16];
            add EDX, 32;
            addps XMM0, XMM1;
            cmp EDI, EDX;
            addps XMM3, XMM2;
            jne LOOP_START;

            addps XMM0, XMM3;

            // XMM0[0] = XMM0[0] + XMM0[1] + XMM0[2] + XMM0[3]
            movhlps XMM1, XMM0;
            addps XMM0, XMM1;
            pshufd XMM1, XMM0, 0b01_01_01_01;
            addss XMM0, XMM1;

            movss tot, XMM0;
        }
    } else
        tot = 0.0;

    size_t left = a.length & UNROLL_MASK;
    for (size_t i = left; i > 0; i--)
        tot += a[$ - i] * b[$ - i];
    return tot;
}


On the same Celeron it takes about 0.88 ticks for each mul+add when the two
arrays are 1400 floats long (they fit in L1 cache, if they gets longer and
don't fit in L1 it slows down a little). I think better code or better CPUs can
go as low as 0.5, but I and/or my CPU aren't that good yet. Its usage of cache
lines is not optimal, I think. I don't think cache prefetching instructions can
help this code.

Bye,
bearophile
Jun 29 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 A version for floats. A version for reals can't be done with SSE* registers.
 This loop is unrolled two times, and each SSE2 register keeps 4 floats, so it
performs 8 mul+add each cycle. Again this code is slower for shorter arrays,
but not much.
 
 A version of the code with no unrolling (that performs only 4 mul+add each
cycle) is a little better for shorter arrays (to create it you just need to
change UNROLL_MASK to 0b11, remove all the operations on XMM2 and XMM3 and add
only 16 to EDX each loop).
 
 The asserts assert((cast(size_t)... can be replaced by a loop that performs
the unaligned muls+adds and then changes len, a_ptr and b_ptr to the remaining
aligned ones.

You already have a loop at the end that takes care of the stray elements. Why not move it to the beginning to take care of the stray elements _and_ unaligned elements in one shot? Andrei
Jun 29 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 You already have a loop at the end that takes care of the stray 
 elements. Why not move it to the beginning to take care of the stray 
 elements _and_ unaligned elements in one shot?

Unfortunately things aren't that simple, you need a pre-loop and a post-loop. That asm block can process only aligned values in groups of 8 floats. So if you have an unaligned array of N items, that starts and ends before and after a block of processable items, you need to process both head and tail separately. a floats: ** **** **** **** **** *** Loop blocks: xxxxxxxxx xxxxxxxxx xxxxxxxxx xxxxxxxxx 16b aligned: ____ ____ ____ ____ ____ ____ ____ ____ if you don't understand still, I can create an example. Bye, bearophile
Jun 29 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 Andrei Alexandrescu:
 You already have a loop at the end that takes care of the stray 
 elements. Why not move it to the beginning to take care of the stray 
 elements _and_ unaligned elements in one shot?

Unfortunately things aren't that simple, you need a pre-loop and a post-loop. That asm block can process only aligned values in groups of 8 floats. So if you have an unaligned array of N items, that starts and ends before and after a block of processable items, you need to process both head and tail separately. a floats: ** **** **** **** **** *** Loop blocks: xxxxxxxxx xxxxxxxxx xxxxxxxxx xxxxxxxxx 16b aligned: ____ ____ ____ ____ ____ ____ ____ ____ if you don't understand still, I can create an example. Bye, bearophile

Oh I see. Yah, it looks like both before and after loops are needed. Liquid fuel rocket, ion engine, and parachute. Andrei
Jun 29 2010