digitalmars.D - Re: Good dotProduct

bearophile <bearophileHUGS lycos.com> Jun 29 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Jun 29 2010

bearophile <bearophileHUGS lycos.com> Jun 29 2010

bearophile <bearophileHUGS lycos.com> Jun 29 2010

bearophile <bearophileHUGS lycos.com> Jun 29 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Jun 29 2010

bearophile <bearophileHUGS lycos.com> Jun 29 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Jun 29 2010

bearophile <bearophileHUGS lycos.com> Jun 29 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Jun 29 2010

bearophile <bearophileHUGS lycos.com> writes:

This is the faster I've done, further loop unrolling is not useful (for arrays
of about 1400 doubles it's about 2.9 times faster than std.numeric.dotProduct
on a recent Celeron CPU):

double dot5(double[] a, double[] b) {
    size_t len = a.length;
    assert(len == b.length, "dot(): the two array lengths differ.");
    if (len == 0)
        return 0.0;
    assert(len % 4 == 0, "dot: length must be a multiple of 4");

    double tot = void;
    double* a_ptr = a.ptr;
    double* b_ptr = b.ptr;

    assert((cast(size_t)a_ptr & cast(size_t)0b1111) == 0,
           "dot(): the first array is not aligned to 16 bytes.");
    assert((cast(size_t)b_ptr & cast(size_t)0b1111) == 0,
           "dot(): the second array is not aligned to 16 bytes.");

    len = (len / 4) * 32;

    asm {
        mov EAX, a_ptr;
        mov ECX, b_ptr;
        mov EDI, len;
        xorpd XMM0, XMM0;
        xorpd XMM3, XMM3;
        xor EDX, EDX;

        align 8;
    LOOP_START:
        movapd XMM1, [EAX + EDX];
        mulpd XMM1,  [ECX + EDX];
        movapd XMM2, [EAX + EDX + 16];
        mulpd XMM2,  [ECX + EDX + 16];
        addpd XMM0, XMM1;
        add EDX, 32;
        addpd XMM3, XMM2;
        cmp EDI, EDX;
        jne LOOP_START;

        addpd XMM0, XMM3;

        // XMM0[0] += XMM0[1]
        movapd  XMM1, XMM0;
        unpckhpd XMM1, XMM1;
        addsd XMM0, XMM1;

        movsd tot, XMM0;
    }

    return tot;
}

Bye,
bearophile

Jun 29 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 This is the faster I've done, further loop unrolling is not useful (for arrays
of about 1400 doubles it's about 2.9 times faster than std.numeric.dotProduct
on a recent Celeron CPU):



That's pretty cool! Do we have your permission to replace dotProduct 
with yours?

Andrei

Jun 29 2010

bearophile <bearophileHUGS lycos.com> writes:

Andrei:

That's pretty cool!<


LLVM compiler can't unroll loops that loop a number of times known at compile
time (I have an open LLVM bug report on this), and even when this bug gets
fixed, then LLVM isn't able to vectorize yet, so it doesn't use SIMD
instructions like movapd, mulpd etc that work with two doubles at the same time
(unless you use intrinsics instructions like _mm_load_pd that are currently not
available with DMD but can be used with LDC too). So this routine is an
improvement even on the asm produced by LDC from D code.

But there are some other things you need to consider:

dot5() works with arrays of doubles only; but it's not too much hard to write a
version for floats, if you want I can write it in some time (I am slow in
writing asm). A version for reals is probably less important.

dot5() uses the movapd instruction, so it requires the doubles to be aligned to
16 bytes. Currently it just asserts if data is not aligned, but such assert is
kind of useless. This problem can be solved in three ways: adding a bit of code
(in D or asm) in the beginning to use the first unaligned double if present.
This branch slows down code a bit (such slow down can be felt only for small
arrays, of course). Otherwise there are other solutions, none of them is
perfect. 

Currently in D2 big dynamic arrays are not aligned to 16 bytes, but Steven
Schveighoffer will fix it:
http://d.puremagic.com/issues/show_bug.cgi?id=4400

Even when Steven has fixed the alignment to 16 bytes, you just need to slice
the array like [1..$] to have an unaligned array. The type system can come to
the rescue: an AlignedArray (subtype of array) can be invented that guarantees
alignment to 16 bytes even for its slices (so it just refuses to be sliced on
odd indexed bounds), then inside the dotProduct() code a static if can test for
such array and avoid the runtime test (branch) for 16 byte alignment.

Note that future CPUs can drop the requirement for 16 bytes alignments. If this
happens then AlignedArray can become useless.

That dot5() seems to work, but it needs unittests.

The loop of dot5() is unrolled once, and each instruction works on two doubles,
so essentially this code is unrolled four times. This means dot5() works only
if the length of the input arrays is a multiple of 4. A bit of extra trailing
or leading code (in asm or D, but D is probably enough) needs to be added to
perform the mul+sum in the few pairs not managed by the main asm loop.

Maybe the D code inside dot5() can be improved a bit.

movapd is a SSE2 instruction, so a SSE2 detection branch can be added, this is
very easy, you can test is with the sse2 function from the arraydouble module
of the druntime:
private import core.cpuid;
bool sse2() { return cpuid == 3 && core.cpuid.sse2(); }
The small problem is of course that this is a runtime test, this can slow down
code a bit when input arrays a small. It's not easy to move this test from
runtime to compile-time (to compile a module with different versions of the
function... this can require self-modifying code to change the "static" start
address of the right function... Some compiler support can improve this
situation.

Despite dot5() is almost three times faster than std.numeric.dotProduct for
aligned arrays of about 1400 doubles, for small or very small arrays it's
slower. So another runtime test (if) can be added inside the function to switch
to a different algorithm (probably purely written in D) that works efficiently
on very short arrays.


Do we have your permission to replace dotProduct with yours?<


Of course :-) Improving std.numerics was my purpose from the beginning. But as
you see dot5() isn't yet a replacement for dotProduct, some more work needs to
be done even if dot() contains no bugs.

Bye,
bearophile

Jun 29 2010

bearophile <bearophileHUGS lycos.com> writes:

 LLVM compiler can't unroll loops that loop a number of times known at compile
time


I meant at run-time, sorry.

Jun 29 2010

bearophile <bearophileHUGS lycos.com> writes:

There is one more thing that I have forgotten. If you have code like:

void main() {
    double[8] a = 0.5;
    double[8] b = 2.0;    
    double r = dot5(a, b);
}

then you can't be sure that 'a' and 'b' are aligned to 16 bytes (I hope they
are aligned to 8).

Time ago I have suggested the simple syntax:
align(16) double[8] b = 2.0;    

but things are not so easy:
http://d.puremagic.com/issues/show_bug.cgi?id=2278

Bye,
bearophile

Jun 29 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 There is one more thing that I have forgotten. If you have code like:
 
 void main() {
     double[8] a = 0.5;
     double[8] b = 2.0;    
     double r = dot5(a, b);
 }
 
 then you can't be sure that 'a' and 'b' are aligned to 16 bytes (I hope they
are aligned to 8).
 
 Time ago I have suggested the simple syntax:
 align(16) double[8] b = 2.0;    
 
 but things are not so easy:
 http://d.puremagic.com/issues/show_bug.cgi?id=2278
 
 Bye,
 bearophile


No need to worry about that. Unaligned arrays can be handled simply by 
taking care of the stray elements by hand. The bulk of the loop will run 
at full speed.

Andrei

Jun 29 2010

bearophile <bearophileHUGS lycos.com> writes:

A version for floats. A version for reals can't be done with SSE* registers.
This loop is unrolled two times, and each SSE2 register keeps 4 floats, so it
performs 8 mul+add each cycle. Again this code is slower for shorter arrays,
but not much.

A version of the code with no unrolling (that performs only 4 mul+add each
cycle) is a little better for shorter arrays (to create it you just need to
change UNROLL_MASK to 0b11, remove all the operations on XMM2 and XMM3 and add
only 16 to EDX each loop).

The asserts assert((cast(size_t)... can be replaced by a loop that performs the
unaligned muls+adds and then changes len, a_ptr and b_ptr to the remaining
aligned ones.


float dot(float[] a, float[] b) {
    enum size_t UNROLL_MASK = 0b111;
    assert(a.length == b.length, "dot(): the two array lengths differ.");

    typeof(a[0]) tot = void;
    auto a_ptr = a.ptr;
    auto b_ptr = b.ptr;

    assert((cast(size_t)a_ptr & cast(size_t)0b1111) == 0,
           "dot(): the first array is not aligned to 16 bytes.");
    assert((cast(size_t)b_ptr & cast(size_t)0b1111) == 0,
           "dot(): the second array is not aligned to 16 bytes.");

    size_t len = (a.length & (~UNROLL_MASK)) * a[0].sizeof;

    if (len) {
        asm {
            mov EAX, a_ptr;
            mov ECX, b_ptr;
            mov EDI, len;
            xorps XMM0, XMM0;
            xorps XMM3, XMM3;
            xor EDX, EDX;

            align 8;
          LOOP_START:
            movaps XMM1, [EAX + EDX];
            mulps XMM1,  [ECX + EDX];
            movaps XMM2, [EAX + EDX + 16];
            mulps XMM2,  [ECX + EDX + 16];
            add EDX, 32;
            addps XMM0, XMM1;
            cmp EDI, EDX;
            addps XMM3, XMM2;
            jne LOOP_START;

            addps XMM0, XMM3;

            // XMM0[0] = XMM0[0] + XMM0[1] + XMM0[2] + XMM0[3]
            movhlps XMM1, XMM0;
            addps XMM0, XMM1;
            pshufd XMM1, XMM0, 0b01_01_01_01;
            addss XMM0, XMM1;

            movss tot, XMM0;
        }
    } else
        tot = 0.0;

    size_t left = a.length & UNROLL_MASK;
    for (size_t i = left; i > 0; i--)
        tot += a[$ - i] * b[$ - i];
    return tot;
}


On the same Celeron it takes about 0.88 ticks for each mul+add when the two
arrays are 1400 floats long (they fit in L1 cache, if they gets longer and
don't fit in L1 it slows down a little). I think better code or better CPUs can
go as low as 0.5, but I and/or my CPU aren't that good yet. Its usage of cache
lines is not optimal, I think. I don't think cache prefetching instructions can
help this code.

Bye,
bearophile

Jun 29 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 A version for floats. A version for reals can't be done with SSE* registers.
 This loop is unrolled two times, and each SSE2 register keeps 4 floats, so it
performs 8 mul+add each cycle. Again this code is slower for shorter arrays,
but not much.
 
 A version of the code with no unrolling (that performs only 4 mul+add each
cycle) is a little better for shorter arrays (to create it you just need to
change UNROLL_MASK to 0b11, remove all the operations on XMM2 and XMM3 and add
only 16 to EDX each loop).
 
 The asserts assert((cast(size_t)... can be replaced by a loop that performs
the unaligned muls+adds and then changes len, a_ptr and b_ptr to the remaining
aligned ones.


You already have a loop at the end that takes care of the stray 
elements. Why not move it to the beginning to take care of the stray 
elements _and_ unaligned elements in one shot?

Andrei

Jun 29 2010

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:
 You already have a loop at the end that takes care of the stray 
 elements. Why not move it to the beginning to take care of the stray 
 elements _and_ unaligned elements in one shot?


Unfortunately things aren't that simple, you need a pre-loop and a post-loop.
That asm block can process only aligned values in groups of 8 floats. So if you
have an unaligned array of N items, that starts and ends before and after a
block of processable items, you need to process both head and tail separately.

a floats:           ** **** **** **** **** ***
Loop blocks: xxxxxxxxx xxxxxxxxx xxxxxxxxx xxxxxxxxx
16b aligned: ____ ____ ____ ____ ____ ____ ____ ____

if you don't understand still, I can create an example.

Bye,
bearophile

Jun 29 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 Andrei Alexandrescu:
 You already have a loop at the end that takes care of the stray 
 elements. Why not move it to the beginning to take care of the stray 
 elements _and_ unaligned elements in one shot?


 Unfortunately things aren't that simple, you need a pre-loop and a post-loop.
That asm block can process only aligned values in groups of 8 floats. So if you
have an unaligned array of N items, that starts and ends before and after a
block of processable items, you need to process both head and tail separately.
 
 a floats:           ** **** **** **** **** ***
 Loop blocks: xxxxxxxxx xxxxxxxxx xxxxxxxxx xxxxxxxxx
 16b aligned: ____ ____ ____ ____ ____ ____ ____ ____
 
 if you don't understand still, I can create an example.
 
 Bye,
 bearophile


Oh I see. Yah, it looks like both before and after loops are needed. 
Liquid fuel rocket, ion engine, and parachute.

Andrei

Jun 29 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Re: Good dotProduct