digitalmars.D.bugs - [Issue 18627] New: std.complex is a lot slower than builtin complex
- d-bugmail puremagic.com (195/195) Mar 18 2018 https://issues.dlang.org/show_bug.cgi?id=18627
https://issues.dlang.org/show_bug.cgi?id=18627 Issue ID: 18627 Summary: std.complex is a lot slower than builtin complex types at number crunching Product: D Version: D2 Hardware: x86 OS: Windows Status: NEW Severity: enhancement Priority: P1 Component: phobos Assignee: nobody puremagic.com Reporter: aliloko gmail.com BENCHMARK: Please consider the following program: It's just number crunching, there is no complex exponential or such function calls. The FFT function features complex addition, multiplication and division. --- main.d import std.string; import std.datetime; import std.datetime.stopwatch : benchmark, StopWatch; import std.complex; import std.stdio; import std.math; void main() { cfloat[] A = new cfloat[1024]; cdouble[] B = new cdouble[1024]; Complex!float[] C = new Complex!float[1024]; Complex!double[] D = new Complex!double[1024]; foreach(i; 0..1024) { // Initialize with something A[i] = i + 0i; B[i] = i + 0i; C[i] = Complex!float(i, 0); D[i] = Complex!double(i, 0); } void fA() { bench!(float, cfloat, true)(A); } void fB() { bench!(double, cdouble, true)(B); } void fC() { bench!(float, Complex!float, false)(C); } void fD() { bench!(double, Complex!double, false)(D); } auto r = benchmark!(fA, fB, fC, fD)(1000); Duration fAResult = r[0]; Duration fBResult = r[1]; Duration fCResult = r[2]; Duration fDResult = r[3]; writefln("With cfloat: %s", fAResult); writefln("With cdouble: %s", fBResult); writefln("With Complex!float: %s", fCResult); writefln("With Complex!double: %s", fDResult); } void bench(T, ComplexType, bool usingBuiltinTypes)(ComplexType[] array) { // forward then reverse FFT FFT!(T, ComplexType, true, usingBuiltinTypes)(array); FFT!(T, ComplexType, false, usingBuiltinTypes)(array); } // just for the benchmark purpose private void FFT(T, ComplexType, bool direction, bool usingBuiltinTypes)(ComplexType[] buffer) pure nothrow nogc { int size = cast(int)(buffer.length); int m = 10; ComplexType* pbuffer = buffer.ptr; // do the bit reversal int i2 = cast(int)size / 2; int j = 0; for (int i = 0; i < size - 1; ++i) { if (i < j) { auto tmp = pbuffer[i]; pbuffer[i] = pbuffer[j]; pbuffer[j] = tmp; } int k = i2; while(k <= j) { j = j - k; k = k / 2; } j += k; } // compute the FFT static if (usingBuiltinTypes) { ComplexType c = -1+0i; } else { ComplexType c = -1; } int l2 = 1; for (int l = 0; l < m; ++l) { int l1 = l2; l2 = l2 * 2; static if (usingBuiltinTypes) { ComplexType u = 1+0i; } else { ComplexType u = 1; } for (int j2 = 0; j2 < l1; ++j2) { int i = j2; while (i < size) { int i1 = i + l1; ComplexType t1 = u * pbuffer[i1]; pbuffer[i1] = pbuffer[i] - t1; pbuffer[i] += t1; i += l2; } u = u * c; } T newImag = sqrt((1 - c.re) / 2); static if (direction) newImag = -newImag; T newReal = sqrt((1 + c.re) / 2); static if (usingBuiltinTypes) { c = newReal + 1.0fi * newImag; } else { c = ComplexType(newReal, newImag); } } // scaling when doing the reverse transformation, to avoid being multiplied by size static if (!direction) { T divider = 1 / cast(T)size; for (int i = 0; i < size; ++i) { pbuffer[i] = pbuffer[i] * divider; } } } BENCHMARKS - With DMD v2.079.0, Windows: $ rdmd -O -inline -release main.d With cfloat: 1 sec, 187 ms, 225 ╬╝s, and 5 hnsecs With cdouble: 1 sec, 315 ms, 385 ╬╝s, and 7 hnsecs With Complex!float: 2 secs, 451 ms, 319 ╬╝s, and 2 hnsecs (!) With Complex!double: 11 secs, 539 ms, and 96 ╬╝s (!!!) $ rdmd -O -inline -release main.d -m64 With cfloat: 1 sec, 196 ms, 312 ╬╝s, and 7 hnsecs With cdouble: 1 sec, 295 ms, 927 ╬╝s, and 9 hnsecs With Complex!float: 2 secs, 505 ms, 536 ╬╝s, and 4 hnsecs (!) With Complex!double: 11 secs, 590 ms, 574 ╬╝s, and 4 hnsecs (!!!) - With LDC 1.8.0 32-bit: $ ldc2.exe -O3 -release -enable-inlining main.d With cfloat: 564 ms, 669 us, and 9 hnsecs With cdouble: 535 ms, 570 us, and 2 hnsecs With Complex!float: 563 ms, 809 us, and 4 hnsecs With Complex!double: 537 ms, 289 us, and 6 hnsecs - With LDC 1.8.0 64-bit: $ ldc2.exe -O3 -release -enable-inlining main.d With cfloat: 577 ms, 472 us, and 5 hnsecs With cdouble: 554 ms, 357 us, and 6 hnsecs With Complex!float: 588 ms, 463 us, and 9 hnsecs With Complex!double: 558 ms, 881 us, and 7 hnsecs COMMENTARY We are looking to a guaranteed 2x to 10x regression when using complex number + DMD and complex arithmetic. Latest LDC seems to have fixed this discrepancy but I'm still using an older LDC where the difference is 2x in favour of builtin complexes. This will definately make my life harder and I don't understand why the rug is pulled below builtin complex numbers. While I do use another FFT optimized with LDC intrinsics, there is a lot of complex number crunching in phase vocoders, filtering, and many signal processing tasks. I find the builtin complex in D incredibly well suited to this. This is a competive avantage, even for readability. Please reconsider the deprecation. --
Mar 18 2018