digitalmars.D - Naive node.js faster than naive LDC2?
- James Lu (16/16) Aug 21 2020 Code:
- James Lu (4/20) Aug 21 2020 With the double type:
- James Lu (6/34) Aug 21 2020 Bonus: Direct translation of Dlang to Node.js, Node.js faster
- H. S. Teoh (11/18) Aug 21 2020 Using a class for Complex (and a non-final one at that!!) introduces
- MoonlightSentinel (2/3) Aug 21 2020 Maybe try `creal`?
- James Lu (5/23) Aug 21 2020 I showed with and without class. V8's analyzer might be superior
- H. S. Teoh (228/233) Aug 22 2020 Sorry, I missed that the first time round. But how come your struct
- James Lu (5/13) Aug 22 2020 If you had read all three of my original self-replies9, you would
- Arun (4/21) Aug 21 2020 He mentioned this in his first post.
- H. S. Teoh (49/53) Aug 21 2020 [...]
- aberba (4/10) Aug 22 2020 Or maybe DMD is not trying to win any performance context... just
- bachmeier (5/21) Aug 21 2020 I have no desire to dig into it myself, but I'll just note that
- H. S. Teoh (22/26) Aug 22 2020 As described in my other post, my analysis of James' code reveals the
- Andrei Alexandrescu (16/46) Aug 22 2020 First off this is a great accomplishment of the V8 team. That's the
- James Lu (3/16) Aug 22 2020 Interesting perspective.
- jmh530 (4/21) Aug 22 2020 It would be interesting to see a write-up where LDC's dynamic
- kinke (9/12) Aug 22 2020 Yes, and that's because there's only a `real` version for fmax.
- Arjan (2/14) Aug 22 2020 Sound to me like a fantastic SAOC project.
- Andrei Alexandrescu (5/21) Aug 22 2020 Ow, do we still suffer from that? Sigh.
- Avrina (5/27) Aug 22 2020 Cos, sin, tan, asin, acos, atan, etc.. There's still more,
- Andrei Alexandrescu (5/35) Aug 22 2020 1. Linear time for small n is fine and does not affect the argument.
- Avrina (10/51) Aug 22 2020 What argument?
- Andrei Alexandrescu (2/52) Aug 22 2020 You seem to derive good enjoyment out of making unkind comments.
- Avrina (4/64) Aug 24 2020 I am an observer of truth, if you don't like the truth, look
- kinke (6/9) Aug 22 2020 These functions are a lot more involved indeed; I've taken care
- bachmeier (6/34) Aug 24 2020 Well, I'm not going to debate the possibility that D will
Code: https://gist.github.com/CrazyPython/364f11465dab90d611ecc81490682680 LDC 1.23.0 (Installed from dlang.org) ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fast Node v14.40 (V8 8.1.307.31) Dlang trials: 2957 2560 2048 Average: 2521 Node.JS trials: 1988 2567 1863 Average: 2139 Notes: - I had to reinstall Dlang from the install script - I was initially confused why -mtune=native didn't work, and had to read documentation. Would have been nice if the compiler told me -mcpu=native was what I needed. - I skipped -march=native. Did not find information on the wiki https://wiki.dlang.org/Using_LDC - Node.js compiles faster and uses a compilation cache Mandatory citation: https://github.com/brion/mandelbrot-shootout
Aug 21 2020
On Friday, 21 August 2020 at 23:10:53 UTC, James Lu wrote:Code: https://gist.github.com/CrazyPython/364f11465dab90d611ecc81490682680 LDC 1.23.0 (Installed from dlang.org) ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fast Node v14.40 (V8 8.1.307.31) Dlang trials: 2957 2560 2048 Average: 2521 Node.JS trials: 1988 2567 1863 Average: 2139 Notes: - I had to reinstall Dlang from the install script - I was initially confused why -mtune=native didn't work, and had to read documentation. Would have been nice if the compiler told me -mcpu=native was what I needed. - I skipped -march=native. Did not find information on the wiki https://wiki.dlang.org/Using_LDC - Node.js compiles faster and uses a compilation cache Mandatory citation: https://github.com/brion/mandelbrot-shootoutWith the double type: Node: 2211 2574 2306 Dlang: 2520 1891 1676
Aug 21 2020
On Friday, 21 August 2020 at 23:14:12 UTC, James Lu wrote:On Friday, 21 August 2020 at 23:10:53 UTC, James Lu wrote:Bonus: Direct translation of Dlang to Node.js, Node.js faster https://gist.github.com/CrazyPython/8bafd16837ec8ad4c5a638b9d305fc96 Dlang: 4076 3622 2934 (3544 average) Node.js: 2624 2334 2316 (2424 average) LDC2 is 46% slower!Code: https://gist.github.com/CrazyPython/364f11465dab90d611ecc81490682680 LDC 1.23.0 (Installed from dlang.org) ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fast Node v14.40 (V8 8.1.307.31) Dlang trials: 2957 2560 2048 Average: 2521 Node.JS trials: 1988 2567 1863 Average: 2139 Notes: - I had to reinstall Dlang from the install script - I was initially confused why -mtune=native didn't work, and had to read documentation. Would have been nice if the compiler told me -mcpu=native was what I needed. - I skipped -march=native. Did not find information on the wiki https://wiki.dlang.org/Using_LDC - Node.js compiles faster and uses a compilation cache Mandatory citation: https://github.com/brion/mandelbrot-shootoutWith the double type: Node: 2211 2574 2306 Dlang: 2520 1891 1676
Aug 21 2020
On Fri, Aug 21, 2020 at 11:22:27PM +0000, James Lu via Digitalmars-d wrote: [...]Bonus: Direct translation of Dlang to Node.js, Node.js faster https://gist.github.com/CrazyPython/8bafd16837ec8ad4c5a638b9d305fc96 Dlang: 4076 3622 2934 (3544 average) Node.js: 2624 2334 2316 (2424 average) LDC2 is 46% slower!Using a class for Complex (and a non-final one at that!!) introduces tons of allocation overhead per iteration, plus virtual function call overhead. You should be using a struct instead. I betcha this one change will make a big difference in performance. Also, what's the command you're using to compile the program? If you're doing performance comparison, you should specify -O2 or -O3. T -- Knowledge is that area of ignorance that we arrange and classify. -- Ambrose Bierce
Aug 21 2020
On Friday, 21 August 2020 at 23:49:44 UTC, H. S. Teoh wrote:You should be using a struct instead.Maybe try `creal`?
Aug 21 2020
On Friday, 21 August 2020 at 23:49:44 UTC, H. S. Teoh wrote:On Fri, Aug 21, 2020 at 11:22:27PM +0000, James Lu via Digitalmars-d wrote: [...]I showed with and without class. V8's analyzer might be superior to LDC's in removing the allocation overhead. I used the same compilation flags as the original: ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fastBonus: Direct translation of Dlang to Node.js, Node.js faster https://gist.github.com/CrazyPython/8bafd16837ec8ad4c5a638b9d305fc96 Dlang: 4076 3622 2934 (3544 average) Node.js: 2624 2334 2316 (2424 average) LDC2 is 46% slower!Using a class for Complex (and a non-final one at that!!) introduces tons of allocation overhead per iteration, plus virtual function call overhead. You should be using a struct instead. I betcha this one change will make a big difference in performance. Also, what's the command you're using to compile the program? If you're doing performance comparison, you should specify -O2 or -O3. T
Aug 21 2020
On Sat, Aug 22, 2020 at 12:21:50AM +0000, James Lu via Digitalmars-d wrote: [...]I showed with and without class.Sorry, I missed that the first time round. But how come your struct version uses `real` but your class version uses `double`? It's well-known that `real` is slow because it uses x87 instructions, as opposed to the SSE/etc. instructions that `double` would use.V8's analyzer might be superior to LDC's in removing the allocation overhead. I used the same compilation flags as the original: ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fastI found that with -ffast-math --fp-contract=fast, performance doubled. Since James' original struct version uses real, I decided to do a comparison between real and double in addition to class vs. struct: class version, with real: ldc2 -d-version=useClass -d-version=useReal -g -ffast-math --fp-contract=fast -mcpu=native -O3 test.d -of=test-class-native 8 secs, 201 ms, 286 μs, and 9 hnsecs 8 secs, 153 ms, 617 μs, and 9 hnsecs 8 secs, 205 ms, 966 μs, and 6 hnsecs class version, with double: ldc2 -d-version=useClass -d-version=useDouble -g -ffast-math --fp-contract=fast -mcpu=native -O3 test.d -of=test-class-native 4 secs, 177 ms, 842 μs, and 3 hnsecs 4 secs, 297 ms, 899 μs, and 6 hnsecs 4 secs, 221 ms, 916 μs, and 7 hnsecs struct version, with real: ldc2 -d-version=useStruct -d-version=useReal -g -ffast-math --fp-contract=fast -mcpu=native -O3 test.d -of=test-class-native 3 secs, 191 ms, 21 μs, and 4 hnsecs 3 secs, 223 ms, 692 μs, and 9 hnsecs 3 secs, 210 ms, 429 μs, and 2 hnsecs struct version, with double: ldc2 -d-version=useStruct -d-version=useDouble -g -ffast-math --fp-contract=fast -mcpu=native -O3 test.d -of=test-class-native 2 secs, 659 ms, 309 μs, and 2 hnsecs 2 secs, 654 ms, 96 μs, and 3 hnsecs 2 secs, 630 ms, 84 μs, and 4 hnsecs As you can see, using struct vs class grants almost double the performance. Using double with struct instead of real with struct gives 17% improvement. The difference between struct and class is not surprising; allocations are slow in general, and D generally does not do very much optimizations of allocations. Node.js being Javascript-based, and Javascript being object-heavy, it's not surprising that more object lifetime analysis would be applied to optimize allocations. I do feel James' struct implementation was flawed, though, because of using real instead of double, real being known to be slow on modern hardware. (Also, comparing struct + real to class + double seems a bit like comparing apples and oranges.) My original modification of James' code uses struct + double, and comparing that with struct + real showed a 17% degradation upon switching to real. // As a further step, I profiled the program and found that most of the time was being spent calling the C math library's fmax() function (which involves an expensive PIC indirection, not to mention lack of inlining). Writing a naïve version of fmax() in D gave the following numbers: struct version, with double + custom fmax function: ldc2 -d-version=useStruct -d-version=useDouble -d-version=customFmax -g -ffast-math --fp-contract=fast -mcpu=native -O3 test.d -of=test-struct-native 1 sec, 567 ms, 219 μs, and 6 hnsecs 1 sec, 557 ms, 762 μs, and 7 hnsecs 1 sec, 574 ms, 657 μs, and 7 hnsecs This represents a whopping 40% improvement over the version calling the C library's fmax function. I wonder how this last version compares with the Node.js performance? // Code, for full disclosure (basically copy-n-pasted from James' code, with minor modifications for testing struct vs class, real vs double): ------------------------------------ // import std.stdio; import std.math; import core.time; version(useReal) alias Num = real; version(useDouble) alias Num = double; version(customFmax) Num fmax(Num x, Num y) { return (x < y) ? y : x; } version(useStruct) { struct Complex { Num x; Num y; this(A)(A px, A py) { this.x = px; this.y = py; } unittest { auto complex = Complex(2, 2); assert(complex.x == 2 && complex.y == 2); } auto abs() const { return fmax(this.x * this.x, this.y * this.y); } void add(T)(const T other) { this.x += other.x; this.y += other.y; } void mul(T)(const T other) { auto newX = this.x * other.x - this.y * other.y; auto newY = this.x * other.y + this.y * other.x; this.x = newX; this.y = newY; } } unittest { auto c = Complex(5, 3); c.mul(Complex(4, 2)); assert(c.x == 14 && c.y == 22); } unittest { auto org = Complex(0, 0); org.add(Complex(3, 3)); assert(org.x == 3 && org.y == 3); } auto iterate_mandelbrot(const Complex c, const int maxIters) { auto z = Complex(0, 0); for (int i = 0; i < maxIters; i++) { if (z.abs() >= 2.0) { return i; } z.mul(z); z.add(c); } return maxIters; } const x0 = -2.5, x1 = 1, y0 = -1, y1 = 1; const cols = 72, rows = 24; const maxIters = 1000000; void main() { auto now = MonoTime.currTime; for (Num row = 0; row < rows; row++) { const y = (row / rows) * (y1 - y0) + y0; char[] str; for (Num col = 0; col < cols; col++) { // Num is needed here because otherwise "/" does integer division const x = (col / cols) * (x1 - x0) + x0; auto c = Complex(x, y); auto iters = iterate_mandelbrot(c, maxIters); if (iters == 0) { str ~= '.'; } else if (iters == 1) { str ~= '%'; } else if (iters == 2) { str ~= ' '; } else if (iters == maxIters) { str ~= ' '; } else { } } str.writeln; } writeln(MonoTime.currTime - now); } } version(useClass) { class Complex { Num x; Num y; this(A)(A px, A py) { this.x = px; this.y = py; } unittest { auto complex = new Complex(2, 2); assert(complex.x == 2 && complex.y == 2); } auto abs() const { return fmax(this.x * this.x, this.y * this.y); } void add(T)(const T other) { this.x += other.x; this.y += other.y; } void mul(T)(const T other) { auto newX = this.x * other.x - this.y * other.y; auto newY = this.x * other.y + this.y * other.x; this.x = newX; this.y = newY; } } unittest { auto c = new Complex(5, 3); c.mul(new Complex(4, 2)); assert(c.x == 14 && c.y == 22); } unittest { auto org = new Complex(0, 0); org.add(new Complex(3, 3)); assert(org.x == 3 && org.y == 3); } auto iterate_mandelbrot(const Complex c, const int maxIters) { auto z = new Complex(0, 0); for (int i = 0; i < maxIters; i++) { if (z.abs() >= 2.0) { return i; } z.mul(z); z.add(c); } return maxIters; } const x0 = -2.5, x1 = 1, y0 = -1, y1 = 1; const cols = 72, rows = 24; const maxIters = 1000000; void main() { auto now = MonoTime.currTime; for (Num row = 0; row < rows; row++) { const y = (row / rows) * (y1 - y0) + y0; char[] str; for (Num col = 0; col < cols; col++) { // Num is needed here because otherwise "/" does integer division const x = (col / cols) * (x1 - x0) + x0; auto c = new Complex(x, y); auto iters = iterate_mandelbrot(c, maxIters); if (iters == 0) { str ~= '.'; } else if (iters == 1) { str ~= '%'; } else if (iters == 2) { str ~= ' '; } else if (iters == maxIters) { str ~= ' '; } else { } } str.writeln; } writeln(MonoTime.currTime - now); } } ------------------------------------ T -- MAS = Mana Ada Sistem?
Aug 22 2020
On Saturday, 22 August 2020 at 16:01:52 UTC, H. S. Teoh wrote:Sorry, I missed that the first time round. But how come your struct version uses `real` but your class version uses `double`? It's well-known that `real` is slow because it uses x87 instructions, as opposed to the SSE/etc. instructions that `double` would use.On Friday, 21 August 2020 at 23:14:12 UTC, James Lu wrote:With the double type: Node: 2211 2574 2306 Dlang: 2520 1891 1676If you had read all three of my original self-replies9, you would have noticed I also did a version with struct + class. Perhaps I need to learn to keep them all in the same post.
Aug 22 2020
On Friday, 21 August 2020 at 23:49:44 UTC, H. S. Teoh wrote:On Fri, Aug 21, 2020 at 11:22:27PM +0000, James Lu via Digitalmars-d wrote: [...]He mentioned this in his first post. LDC 1.23.0 (Installed from dlang.org) ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fastBonus: Direct translation of Dlang to Node.js, Node.js faster https://gist.github.com/CrazyPython/8bafd16837ec8ad4c5a638b9d305fc96 Dlang: 4076 3622 2934 (3544 average) Node.js: 2624 2334 2316 (2424 average) LDC2 is 46% slower!Using a class for Complex (and a non-final one at that!!) introduces tons of allocation overhead per iteration, plus virtual function call overhead. You should be using a struct instead. I betcha this one change will make a big difference in performance. Also, what's the command you're using to compile the program? If you're doing performance comparison, you should specify -O2 or -O3.
Aug 21 2020
On Fri, Aug 21, 2020 at 04:49:44PM -0700, H. S. Teoh via Digitalmars-d wrote: [...]Using a class for Complex (and a non-final one at that!!) introduces tons of allocation overhead per iteration, plus virtual function call overhead. You should be using a struct instead. I betcha this one change will make a big difference in performance.[...] OK, so I copied the code and changed the class to struct, and compared the results. Both versions are compiled with ldc2 -O3. class version: 7 secs, 125 ms, 608 μs, and 9 hnsecs 7 secs, 155 ms, 328 μs, and 6 hnsecs 7 secs, 158 ms, 966 μs, and 4 hnsecs struct version: 6 secs, 55 ms, 140 μs, and 4 hnsecs 6 secs, 125 ms, 974 μs, and 5 hnsecs 6 secs, 126 ms, 945 μs, and 4 hnsecs For performance comparisons, take the best of n (because the others are merely measuring more system noise). This represents about a 15% performance increase in switching to struct instead of class. I thought it might make a difference to optimize for my CPU with -mcpu=native, so here are the numbers: class version: 7 secs, 100 ms, 602 μs, and 6 hnsecs 7 secs, 100 ms, 437 μs, and 7 hnsecs 7 secs, 121 ms, 594 μs, and 4 hnsecs struct version: 6 secs, 73 ms, 534 μs, and 3 hnsecs 5 secs, 662 ms, 626 μs, and 5 hnsecs 6 secs, 103 ms, 871 μs, and 2 hnsecs Again taking the best of 3, that's about a 20% performance increase between changing from class to struct. // Just for laughs, I tested with dmd -O -inline: class version: 7 secs, 255 ms, 748 μs, and 5 hnsecs 7 secs, 249 ms, 683 μs, and 9 hnsecs 7 secs, 593 ms, 847 μs, and 8 hnsecs struct version: 7 secs, 646 ms, 685 μs, and 5 hnsecs 7 secs, 618 ms, 642 μs, and 7 hnsecs 7 secs, 606 ms, 85 μs, and 4 hnsecs Surprisingly, the class version does *better* than the struct version when compiled with dmd. (Wow, is dmd codegen *that* bad that it outweighs even class allocation overhead?? :-D) But both are worse than even the class version with ldc2 -O3 (even without -mcpu=native). So yeah. I wouldn't trust dmd with a 10-foot pole when it comes to runtime performance. The struct version compiled with `ldc2 -O3 -mcpu=native` beats the struct version compiled with dmd by a 26% margin. That's pretty sad. T -- An imaginary friend squared is a real enemy.
Aug 21 2020
On Saturday, 22 August 2020 at 00:10:43 UTC, H. S. Teoh wrote:On Fri, Aug 21, 2020 at 04:49:44PM -0700, H. S. Teoh via Digitalmars-d wrote: [...]Or maybe DMD is not trying to win any performance context... just focusing on fast compilation for quick prototyping. Something you wouldn't getting otherwise without DMD.Surprisingly, the class version does *better* than the struct version when compiled with dmd. (Wow, is dmd codegen *that* bad that it outweighs even class allocation overhead?? :-D)
Aug 22 2020
On Friday, 21 August 2020 at 23:10:53 UTC, James Lu wrote:Code: https://gist.github.com/CrazyPython/364f11465dab90d611ecc81490682680 LDC 1.23.0 (Installed from dlang.org) ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fast Node v14.40 (V8 8.1.307.31) Dlang trials: 2957 2560 2048 Average: 2521 Node.JS trials: 1988 2567 1863 Average: 2139 Notes: - I had to reinstall Dlang from the install script - I was initially confused why -mtune=native didn't work, and had to read documentation. Would have been nice if the compiler told me -mcpu=native was what I needed. - I skipped -march=native. Did not find information on the wiki https://wiki.dlang.org/Using_LDC - Node.js compiles faster and uses a compilation cache Mandatory citation: https://github.com/brion/mandelbrot-shootoutI have no desire to dig into it myself, but I'll just note that if you check the CLBG, you'll see that it's not hard to write C and C++ programs for this benchmark that are many times slower than Node JS. The worst of them takes seven times longer to run.
Aug 21 2020
On Sat, Aug 22, 2020 at 02:08:40AM +0000, bachmeier via Digitalmars-d wrote: [...]I have no desire to dig into it myself, but I'll just note that if you check the CLBG, you'll see that it's not hard to write C and C++ programs for this benchmark that are many times slower than Node JS. The worst of them takes seven times longer to run.As described in my other post, my analysis of James' code reveals the following issues: 1) Using class instead of struct; 2) Using real instead of double; 3) std.math.fmax calling the C library (involving a PIC indirection to a shared library as opposed to inlineable native D code). Addressing all 3 issues yielded a 67% improvement (from class + real + C fmax -> struct + double + native fmax), or 37% improvement (from class + double + C fmax -> struct + double + native fmax). I don't have a Node.js environment, though, so I can't make a direct comparison with the optimized D version. I will note, though, that (1) and (2) are well-known performance issues; I'm surprised that this was not taken into account in the original comparison. (3) is something worth considering for std.math -- yes it's troublesome to have to maintain D versions of these functions, but they *could* make a potentially big performance impact by being inlineable in hot inner loops. T -- Computers shouldn't beep through the keyhole.
Aug 22 2020
On 8/22/20 12:15 PM, H. S. Teoh wrote:On Sat, Aug 22, 2020 at 02:08:40AM +0000, bachmeier via Digitalmars-d wrote: [...]First off this is a great accomplishment of the V8 team. That's the remarkable part. Second, this is in many ways not new. JIT optimizers are great at optimizing numeric code and loops. With regularity ever since shortly after Java was invented, "Java is faster than C" claims have been backed by benchmarks involving numeric code and loops. For such, the JIT can do as good a job as, and sometimes even better than, a traditional compiler. The way of a low-level language competes is by offering you ways to tune code to rival/surpass JIT performance if you need to. (Of course, that means the programmer is spending time on that, which is a minus.) I have no doubt the D mandelbrot code can be made at least as fast as the node.js code. The reality of large applications involves things like structures and indirections and such, for which data layout is important. I think JITs have a way to go in that area.I have no desire to dig into it myself, but I'll just note that if you check the CLBG, you'll see that it's not hard to write C and C++ programs for this benchmark that are many times slower than Node JS. The worst of them takes seven times longer to run.As described in my other post, my analysis of James' code reveals the following issues: 1) Using class instead of struct; 2) Using real instead of double; 3) std.math.fmax calling the C library (involving a PIC indirection to a shared library as opposed to inlineable native D code). Addressing all 3 issues yielded a 67% improvement (from class + real + C fmax -> struct + double + native fmax), or 37% improvement (from class + double + C fmax -> struct + double + native fmax). I don't have a Node.js environment, though, so I can't make a direct comparison with the optimized D version. I will note, though, that (1) and (2) are well-known performance issues; I'm surprised that this was not taken into account in the original comparison. (3) is something worth considering for std.math -- yes it's troublesome to have to maintain D versions of these functions, but they *could* make a potentially big performance impact by being inlineable in hot inner loops.
Aug 22 2020
On Saturday, 22 August 2020 at 16:38:42 UTC, Andrei Alexandrescu wrote:On 8/22/20 12:15 PM, H. S. Teoh wrote:Interesting perspective.[...]Second, this is in many ways not new. JIT optimizers are great at optimizing numeric code and loops. With regularity ever since shortly after Java was invented, "Java is faster than C" claims have been backed by benchmarks involving numeric code and loops. For such, the JIT can do as good a job as, and sometimes even better than, a traditional compiler. The way of a low-level language competes is by offering you ways to tune code to rival/surpass JIT performance if you need to. (Of course, that means the programmer is spending time on that, which is a minus.) I have no doubt the D mandelbrot code can be made at least as fast as the node.js code.
Aug 22 2020
On Saturday, 22 August 2020 at 16:38:42 UTC, Andrei Alexandrescu wrote:[snip] First off this is a great accomplishment of the V8 team. That's the remarkable part. Second, this is in many ways not new. JIT optimizers are great at optimizing numeric code and loops. With regularity ever since shortly after Java was invented, "Java is faster than C" claims have been backed by benchmarks involving numeric code and loops. For such, the JIT can do as good a job as, and sometimes even better than, a traditional compiler. The way of a low-level language competes is by offering you ways to tune code to rival/surpass JIT performance if you need to. (Of course, that means the programmer is spending time on that, which is a minus.) I have no doubt the D mandelbrot code can be made at least as fast as the node.js code. The reality of large applications involves things like structures and indirections and such, for which data layout is important. I think JITs have a way to go in that area.It would be interesting to see a write-up where LDC's dynamic compile can speed up D code in the same way.
Aug 22 2020
On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:3) std.math.fmax calling the C library (involving a PIC indirection to a shared library as opposed to inlineable native D code).Yes, and that's because there's only a `real` version for fmax. If upstream Phobos had proper double/float overloads, we could uncomment the LDC-specific implementations using LLVM intrinsics, which use (obviously much faster) SSE instructions: https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/std/math.d#L7785-L7798 Number crunching in D could be significantly accelerated if the people interested in it showed some love for std.math, but we've had this topic for years.
Aug 22 2020
On Saturday, 22 August 2020 at 17:40:11 UTC, kinke wrote:On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:Sound to me like a fantastic SAOC project.3) std.math.fmax calling the C library (involving a PIC indirection to a shared library as opposed to inlineable native D code).Yes, and that's because there's only a `real` version for fmax. If upstream Phobos had proper double/float overloads, we could uncomment the LDC-specific implementations using LLVM intrinsics, which use (obviously much faster) SSE instructions: https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/std/math.d#L7785-L7798 Number crunching in D could be significantly accelerated if the people interested in it showed some love for std.math, but we've had this topic for years.
Aug 22 2020
On 8/22/20 1:40 PM, kinke wrote:On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:Ow, do we still suffer from that? Sigh. https://github.com/dlang/phobos/pull/7604/files It's 10 minutes of work - as much as writing a couple of posts, and much more satisfactory.3) std.math.fmax calling the C library (involving a PIC indirection to a shared library as opposed to inlineable native D code).Yes, and that's because there's only a `real` version for fmax. If upstream Phobos had proper double/float overloads, we could uncomment the LDC-specific implementations using LLVM intrinsics, which use (obviously much faster) SSE instructions: https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/st /math.d#L7785-L7798 Number crunching in D could be significantly accelerated if the people interested in it showed some love for std.math, but we've had this topic for years.
Aug 22 2020
On Saturday, 22 August 2020 at 19:20:34 UTC, Andrei Alexandrescu wrote:On 8/22/20 1:40 PM, kinke wrote:Cos, sin, tan, asin, acos, atan, etc.. There's still more, putting in the actual work that std.math needs is going to take more than 10 mins.On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:Ow, do we still suffer from that? Sigh. https://github.com/dlang/phobos/pull/7604/files It's 10 minutes of work - as much as writing a couple of posts, and much more satisfactory.3) std.math.fmax calling the C library (involving a PIC indirection to a shared library as opposed to inlineable native D code).Yes, and that's because there's only a `real` version for fmax. If upstream Phobos had proper double/float overloads, we could uncomment the LDC-specific implementations using LLVM intrinsics, which use (obviously much faster) SSE instructions: https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/std/math.d#L7785-L7798 Number crunching in D could be significantly accelerated if the people interested in it showed some love for std.math, but we've had this topic for years.
Aug 22 2020
On 8/22/20 3:34 PM, Avrina wrote:On Saturday, 22 August 2020 at 19:20:34 UTC, Andrei Alexandrescu wrote:1. Linear time for small n is fine and does not affect the argument. 2. Incremental is still fine. 3. Work has actually be done by Nick Wilson in https://github.com/dlang/phobos/pull/7463.On 8/22/20 1:40 PM, kinke wrote:Cos, sin, tan, asin, acos, atan, etc.. There's still more, putting in the actual work that std.math needs is going to take more than 10 mins.On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:Ow, do we still suffer from that? Sigh. https://github.com/dlang/phobos/pull/7604/files It's 10 minutes of work - as much as writing a couple of posts, and much more satisfactory.3) std.math.fmax calling the C library (involving a PIC indirection to a shared library as opposed to inlineable native D code).Yes, and that's because there's only a `real` version for fmax. If upstream Phobos had proper double/float overloads, we could uncomment the LDC-specific implementations using LLVM intrinsics, which use (obviously much faster) SSE instructions: https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/st /math.d#L7785-L7798 Number crunching in D could be significantly accelerated if the people interested in it showed some love for std.math, but we've had this topic for years.
Aug 22 2020
On Saturday, 22 August 2020 at 21:20:36 UTC, Andrei Alexandrescu wrote:On 8/22/20 3:34 PM, Avrina wrote:What argument?On Saturday, 22 August 2020 at 19:20:34 UTC, Andrei Alexandrescu wrote:1. Linear time for small n is fine and does not affect the argument.On 8/22/20 1:40 PM, kinke wrote:Cos, sin, tan, asin, acos, atan, etc.. There's still more, putting in the actual work that std.math needs is going to take more than 10 mins.On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:Ow, do we still suffer from that? Sigh. https://github.com/dlang/phobos/pull/7604/files It's 10 minutes of work - as much as writing a couple of posts, and much more satisfactory.3) std.math.fmax calling the C library (involving a PIC indirection to a shared library as opposed to inlineable native D code).Yes, and that's because there's only a `real` version for fmax. If upstream Phobos had proper double/float overloads, we could uncomment the LDC-specific implementations using LLVM intrinsics, which use (obviously much faster) SSE instructions: https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/std/math.d#L7785-L7798 Number crunching in D could be significantly accelerated if the people interested in it showed some love for std.math, but we've had this topic for years.2. Incremental is still fine.It can introduce subtle bugs and problems with precision, flip flopping between float, double, and real. If it is done all at once, it will only happen once, not every time someone feels like spend 10 mins doing a little bit of work to change one function. It really shouldn't have been implemented this way in the first place.3. Work has actually be done by Nick Wilson in https://github.com/dlang/phobos/pull/7463.A dead pull request? Not unusual.
Aug 22 2020
On 8/22/20 6:09 PM, Avrina wrote:On Saturday, 22 August 2020 at 21:20:36 UTC, Andrei Alexandrescu wrote:You seem to derive good enjoyment out of making unkind comments.On 8/22/20 3:34 PM, Avrina wrote:What argument?On Saturday, 22 August 2020 at 19:20:34 UTC, Andrei Alexandrescu wrote:1. Linear time for small n is fine and does not affect the argument.On 8/22/20 1:40 PM, kinke wrote:Cos, sin, tan, asin, acos, atan, etc.. There's still more, putting in the actual work that std.math needs is going to take more than 10 mins.On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:Ow, do we still suffer from that? Sigh. https://github.com/dlang/phobos/pull/7604/files It's 10 minutes of work - as much as writing a couple of posts, and much more satisfactory.3) std.math.fmax calling the C library (involving a PIC indirection to a shared library as opposed to inlineable native D code).Yes, and that's because there's only a `real` version for fmax. If upstream Phobos had proper double/float overloads, we could uncomment the LDC-specific implementations using LLVM intrinsics, which use (obviously much faster) SSE instructions: https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/st /math.d#L7785-L7798 Number crunching in D could be significantly accelerated if the people interested in it showed some love for std.math, but we've had this topic for years.2. Incremental is still fine.It can introduce subtle bugs and problems with precision, flip flopping between float, double, and real. If it is done all at once, it will only happen once, not every time someone feels like spend 10 mins doing a little bit of work to change one function. It really shouldn't have been implemented this way in the first place.3. Work has actually be done by Nick Wilson in https://github.com/dlang/phobos/pull/7463.A dead pull request? Not unusual.
Aug 22 2020
On Sunday, 23 August 2020 at 02:18:19 UTC, Andrei Alexandrescu wrote:On 8/22/20 6:09 PM, Avrina wrote:I am an observer of truth, if you don't like the truth, look away; as it seems to be common place here anyways.On Saturday, 22 August 2020 at 21:20:36 UTC, Andrei Alexandrescu wrote:You seem to derive good enjoyment out of making unkind comments.On 8/22/20 3:34 PM, Avrina wrote:What argument?On Saturday, 22 August 2020 at 19:20:34 UTC, Andrei Alexandrescu wrote:1. Linear time for small n is fine and does not affect the argument.On 8/22/20 1:40 PM, kinke wrote:Cos, sin, tan, asin, acos, atan, etc.. There's still more, putting in the actual work that std.math needs is going to take more than 10 mins.On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:Ow, do we still suffer from that? Sigh. https://github.com/dlang/phobos/pull/7604/files It's 10 minutes of work - as much as writing a couple of posts, and much more satisfactory.3) std.math.fmax calling the C library (involving a PIC indirection to a shared library as opposed to inlineable native D code).Yes, and that's because there's only a `real` version for fmax. If upstream Phobos had proper double/float overloads, we could uncomment the LDC-specific implementations using LLVM intrinsics, which use (obviously much faster) SSE instructions: https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/std/math.d#L7785-L7798 Number crunching in D could be significantly accelerated if the people interested in it showed some love for std.math, but we've had this topic for years.2. Incremental is still fine.It can introduce subtle bugs and problems with precision, flip flopping between float, double, and real. If it is done all at once, it will only happen once, not every time someone feels like spend 10 mins doing a little bit of work to change one function. It really shouldn't have been implemented this way in the first place.3. Work has actually be done by Nick Wilson in https://github.com/dlang/phobos/pull/7463.A dead pull request? Not unusual.
Aug 24 2020
On Saturday, 22 August 2020 at 19:34:42 UTC, Avrina wrote:Cos, sin, tan, asin, acos, atan, etc.. There's still more, putting in the actual work that std.math needs is going to take more than 10 mins.These functions are a lot more involved indeed; I've taken care of a few of these some years ago, porting from the Cephes C library, see https://github.com/dlang/phobos/pull/6272. The main difficulty is the need to support 4 floating-point formats - single, double, x87 extended and quadruple precision.
Aug 22 2020
On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:On Sat, Aug 22, 2020 at 02:08:40AM +0000, bachmeier via Digitalmars-d wrote: [...]Well, I'm not going to debate the possibility that D will outperform Node with appropriate optimization. It's clear from the C and C++ timings, though, that you can't just type out a C, C++, or D program and expect to beat Node - it's going to take some care and knowledge that you really don't need with Node.I have no desire to dig into it myself, but I'll just note that if you check the CLBG, you'll see that it's not hard to write C and C++ programs for this benchmark that are many times slower than Node JS. The worst of them takes seven times longer to run.As described in my other post, my analysis of James' code reveals the following issues: 1) Using class instead of struct; 2) Using real instead of double; 3) std.math.fmax calling the C library (involving a PIC indirection to a shared library as opposed to inlineable native D code). Addressing all 3 issues yielded a 67% improvement (from class + real + C fmax -> struct + double + native fmax), or 37% improvement (from class + double + C fmax -> struct + double + native fmax). I don't have a Node.js environment, though, so I can't make a direct comparison with the optimized D version. I will note, though, that (1) and (2) are well-known performance issues; I'm surprised that this was not taken into account in the original comparison. (3) is something worth considering for std.math -- yes it's troublesome to have to maintain D versions of these functions, but they *could* make a potentially big performance impact by being inlineable in hot inner loops.
Aug 24 2020