www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 18627] New: std.complex is a lot slower than builtin complex

https://issues.dlang.org/show_bug.cgi?id=18627

          Issue ID: 18627
           Summary: std.complex is a lot slower than builtin complex types
                    at number crunching
           Product: D
           Version: D2
          Hardware: x86
                OS: Windows
            Status: NEW
          Severity: enhancement
          Priority: P1
         Component: phobos
          Assignee: nobody puremagic.com
          Reporter: aliloko gmail.com

BENCHMARK:

Please consider the following program:

It's just number crunching, there is no complex exponential or such function
calls.
The FFT function features complex addition, multiplication and division.

--- main.d


import std.string;
import std.datetime;
import std.datetime.stopwatch : benchmark, StopWatch;
import std.complex;
import std.stdio;
import std.math;

void main()
{
    cfloat[] A = new cfloat[1024];
    cdouble[] B = new cdouble[1024];
    Complex!float[] C = new Complex!float[1024];
    Complex!double[] D = new Complex!double[1024];
    foreach(i; 0..1024)
    {
        // Initialize with something
        A[i] = i + 0i;
        B[i] = i + 0i;
        C[i] = Complex!float(i, 0);
        D[i] = Complex!double(i, 0);
    }

    void fA()
    {
        bench!(float, cfloat, true)(A);
    }

    void fB()
    {
        bench!(double, cdouble, true)(B);
    }

    void fC()
    {
        bench!(float, Complex!float, false)(C);
    }

    void fD()
    {
        bench!(double, Complex!double, false)(D);
    }

    auto r = benchmark!(fA, fB, fC, fD)(1000);
    Duration fAResult = r[0];
    Duration fBResult = r[1];
    Duration fCResult = r[2];
    Duration fDResult = r[3];
    writefln("With cfloat: %s", fAResult);
    writefln("With cdouble: %s", fBResult);
    writefln("With Complex!float: %s", fCResult);
    writefln("With Complex!double: %s", fDResult);
}

void bench(T, ComplexType, bool usingBuiltinTypes)(ComplexType[] array)
{
    // forward then reverse FFT
    FFT!(T, ComplexType, true, usingBuiltinTypes)(array);
    FFT!(T, ComplexType, false, usingBuiltinTypes)(array);
}

// just for the benchmark purpose
private void FFT(T, ComplexType, bool direction, bool
usingBuiltinTypes)(ComplexType[] buffer) pure nothrow  nogc
{
    int size = cast(int)(buffer.length);
    int m = 10;

    ComplexType* pbuffer = buffer.ptr;

    // do the bit reversal
    int i2 = cast(int)size / 2;
    int j = 0;
    for (int i = 0; i < size - 1; ++i)
    {
        if (i < j)
        {
            auto tmp = pbuffer[i];
            pbuffer[i] = pbuffer[j];
            pbuffer[j] = tmp;
        }

        int k = i2;
        while(k <= j)
        {
            j = j - k;
            k = k / 2;
        }
        j += k;
    }

    // compute the FFT
    static if (usingBuiltinTypes)
    {
        ComplexType c = -1+0i;
    }
    else
    {
        ComplexType c = -1;
    }
    int l2 = 1;
    for (int l = 0; l < m; ++l)
    {
        int l1 = l2;
        l2 = l2 * 2;
        static if (usingBuiltinTypes)
        {
            ComplexType u = 1+0i;
        }
        else
        {
            ComplexType u = 1;
        }
        for (int j2 = 0; j2 < l1; ++j2)
        {
            int i = j2;
            while (i < size)
            {
                int i1 = i + l1;
                ComplexType t1 = u * pbuffer[i1];
                pbuffer[i1] = pbuffer[i] - t1;
                pbuffer[i] += t1;
                i += l2;
            }
            u = u * c;
        }

        T newImag = sqrt((1 - c.re) / 2);
        static if (direction)
            newImag = -newImag;
        T newReal = sqrt((1 + c.re) / 2);

        static if (usingBuiltinTypes)
        {            
            c = newReal + 1.0fi * newImag;
        }
        else
        {
            c = ComplexType(newReal, newImag);
        }
    }

    // scaling when doing the reverse transformation, to avoid being multiplied
by size
    static if (!direction)
    {
        T divider = 1 / cast(T)size;
        for (int i = 0; i < size; ++i)
        {
            pbuffer[i] = pbuffer[i] * divider;
        }
    }
}




BENCHMARKS

- With DMD v2.079.0, Windows:

$ rdmd -O -inline -release main.d

With cfloat: 1 sec, 187 ms, 225 ╬╝s, and 5 hnsecs
With cdouble: 1 sec, 315 ms, 385 ╬╝s, and 7 hnsecs
With Complex!float: 2 secs, 451 ms, 319 ╬╝s, and 2 hnsecs  (!)
With Complex!double: 11 secs, 539 ms, and 96 ╬╝s  (!!!)

$ rdmd -O -inline -release main.d -m64

With cfloat: 1 sec, 196 ms, 312 ╬╝s, and 7 hnsecs
With cdouble: 1 sec, 295 ms, 927 ╬╝s, and 9 hnsecs
With Complex!float: 2 secs, 505 ms, 536 ╬╝s, and 4 hnsecs (!)
With Complex!double: 11 secs, 590 ms, 574 ╬╝s, and 4 hnsecs (!!!)

- With LDC 1.8.0 32-bit:

$ ldc2.exe -O3 -release -enable-inlining main.d

With cfloat: 564 ms, 669 us, and 9 hnsecs
With cdouble: 535 ms, 570 us, and 2 hnsecs
With Complex!float: 563 ms, 809 us, and 4 hnsecs
With Complex!double: 537 ms, 289 us, and 6 hnsecs

- With LDC 1.8.0 64-bit:

$ ldc2.exe -O3 -release -enable-inlining main.d

With cfloat: 577 ms, 472 us, and 5 hnsecs
With cdouble: 554 ms, 357 us, and 6 hnsecs
With Complex!float: 588 ms, 463 us, and 9 hnsecs
With Complex!double: 558 ms, 881 us, and 7 hnsecs



COMMENTARY

We are looking to a guaranteed 2x to 10x regression when using complex number +
DMD and complex arithmetic.

Latest LDC seems to have fixed this discrepancy but I'm still using an older
LDC where the difference is 2x in favour of builtin complexes.

This will definately make my life harder and I don't understand why the rug is
pulled below builtin complex numbers.

While I do use another FFT optimized with LDC intrinsics, there is a lot of
complex number crunching in phase vocoders, filtering, and many signal
processing tasks. I find the builtin complex in D incredibly well suited to
this.

This is a competive avantage, even for readability. Please reconsider the
deprecation.

--
Mar 18 2018