digitalmars.D.bugs - [Issue 4393] New: Very good dotProduct

d-bugmail puremagic.com (33/33) Jun 25 2010 http://d.puremagic.com/issues/show_bug.cgi?id=4393

d-bugmail puremagic.com (10/10) Jan 09 2011 http://d.puremagic.com/issues/show_bug.cgi?id=4393
d-bugmail puremagic.com (95/95) Apr 28 2011 http://d.puremagic.com/issues/show_bug.cgi?id=4393
d-bugmail puremagic.com (7/7) Apr 28 2011 http://d.puremagic.com/issues/show_bug.cgi?id=4393
d-bugmail puremagic.com (15/15) Apr 28 2011 http://d.puremagic.com/issues/show_bug.cgi?id=4393
d-bugmail puremagic.com (20/20) Apr 28 2011 http://d.puremagic.com/issues/show_bug.cgi?id=4393

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=4393

           Summary: Very good dotProduct
           Product: D
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: bearophile_hugs eml.cc



This is really an enhancement request.

In many programs I compute dot product among two arrays of doubles (often
inside another loop. So it's positive std.numeric.dotProduct uses assert
instead of enforce). This is sometimes the most time-critical operation of a
whole small numerical program.

So I'd like the dot product among two arrays of doubles to be specialized with
an asm routine (that uses SSE registers). In my opinion this algorithm is
simple enough and this product is critical enough to justify this asm code
added to Phobos (a disadvantage is that D front-end doesn't inline functions
that contain asm) (as array operations in druntime are often in asm).

Two more specialized routines can be added somewhere (if necessary elsewhere,
where a small vector struct is defined in Phobos) to perform dotProduct among
fixed-sized arrays of 3 or 4 doubles. This often another performance-critical
operation.

See for example:
http://www.bealto.com/mp-dot_sse.html
http://www.gamedev.net/community/forums/topic.asp?topic_id=380764

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Jun 25 2010

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=4393


Andrei Alexandrescu <andrei metalanguage.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
                 CC|                            |andrei metalanguage.com
         AssignedTo|nobody puremagic.com        |andrei metalanguage.com


-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Jan 09 2011

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=4393




Some code I have written for arrays of floats:


float dot(float[] a, float[] b) {
    enum size_t UNROLL_MASK = 0b111;
    assert(a.length == b.length, "dot(): the two array lengths differ.");

    typeof(a[0]) tot = void;
    auto a_ptr = a.ptr;
    auto b_ptr = b.ptr;

    assert((cast(size_t)a_ptr & cast(size_t)0b1111) == 0,
           "dot(): the first array is not aligned to 16 bytes.");
    assert((cast(size_t)b_ptr & cast(size_t)0b1111) == 0,
           "dot(): the second array is not aligned to 16 bytes.");

    size_t len = (a.length & (~UNROLL_MASK)) * a[0].sizeof;

    if (len) {
        asm {
            mov EAX, a_ptr;
            mov ECX, b_ptr;
            mov EDI, len;
            xorps XMM0, XMM0;
            xorps XMM3, XMM3;
            xor EDX, EDX;

            align 8;
          LOOP_START:
            movaps XMM1, [EAX + EDX];
            mulps XMM1,  [ECX + EDX];
            movaps XMM2, [EAX + EDX + 16];
            mulps XMM2,  [ECX + EDX + 16];
            add EDX, 32;
            addps XMM0, XMM1;
            cmp EDI, EDX;
            addps XMM3, XMM2;
            jne LOOP_START;

            addps XMM0, XMM3;

            // XMM0[0] = XMM0[0] + XMM0[1] + XMM0[2] + XMM0[3]
            movhlps XMM1, XMM0;
            addps XMM0, XMM1;
            pshufd XMM1, XMM0, 0b01_01_01_01;
            addss XMM0, XMM1;

            movss tot, XMM0;
        }
    } else
        tot = 0.0;

    size_t left = a.length & UNROLL_MASK;
    for (size_t i = left; i > 0; i--)
        tot += a[$ - i] * b[$ - i];
    return tot;
}



And for arrays of doubles:

import std.c.stdio;

void main() {
    double[] A = [0.7644108908809033, 0.96458177717869509,
                  0.51166055832523683, 0.098840360055908461,
                  0.40813780586233483, 0.32285857447088551,
                  0.59137950751990331, 0.59518178287473289];
    double[] B = [0.28061374331187983, 0.85036446787626108,
                  0.52498124274748326, 0.84170745998075014,
                  0.55819169392258683, 0.62586351111477134,
                  0.43021720539707864, 0.652708603473523];

    // >>> sum(a*b for a,b in zip(A, B))
    // 2.4593435789602185

    if (A.length % 4 != 0) throw new Error("");
    double tot1 = void,
           tot2 = void;
    auto a_ptr = cast(double*)A.ptr;
    auto b_ptr = cast(double*)B.ptr;
    size_t len = A.length * double.sizeof;

    asm {
        mov EAX, a_ptr;
        mov ECX, b_ptr;
        mov EDI, len;
        xorps XMM0, XMM0;
        xorps XMM3, XMM3;
        xor EDX, EDX;

        align 8;
      LOOP_START:
        movapd XMM1, [EAX + EDX];
        mulpd XMM1,  [ECX + EDX];
        movapd XMM2, [EAX + EDX + 16];
        mulpd XMM2,  [ECX + EDX + 16];
        add EDX, 32;
        addpd XMM0, XMM1;
        cmp EDI, EDX;
        addpd XMM3, XMM2;
        jne LOOP_START;

        addpd XMM0, XMM3; // XMM0 += XMM3

        movhpd tot1, XMM0;
        movlpd tot2, XMM0;
    }

    printf("%.15lf\n", tot1 + tot2);
}



See bug 5880 too.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Apr 28 2011

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=4393




20:10:55 PDT ---
Nice work. Got some benchmarks? Also, I wonder what version flag I should use
to guard the assembler implementation. Don?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Apr 28 2011

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=4393


David Simcha <dsimcha yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dsimcha yahoo.com



Looks rather interesting.  Dot products are sufficiently universally useful
that, if this is significantly faster than the current std.numeric
implementation, inclusion is definitely justified.  However, you'd greatly
increase the chances of inclusion if you created some simple benchmarks
(nothing fancy, just the obvious ones) and unit tests and submitted this as a
pull request on Github.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Apr 28 2011

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=4393


Don <clugdbug yahoo.com.au> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |clugdbug yahoo.com.au



Did you test this on Intel, or AMD? Blas1 code is generally limited by memory
access, and AMD has two load ports, so it has different bottlenecks and in
these operations always does better than Intel.

See also a discussion at:
http://www.bealto.com/mp-dot_sse.html
(I've talked to Eric Bealto before, he's happy for Phobos to use any of his
stuff if we see anything we want). It's a bit misleading, though, because above
a certain length, you become dominated by cache effects, so I don't know if
unrolling by 4 is actually worthwhile in practice.

I also have some optimized x87 dot product code (AMD 32 bit CPUs don't have
SSE2, so they still need x87 for doubles).

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Apr 28 2011

D Programming

C/C++ Programming

Other

digitalmars.D.bugs - [Issue 4393] New: Very good dotProduct