www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Naive node.js faster than naive LDC2?

reply James Lu <jamtlu gmail.com> writes:
Code: 
https://gist.github.com/CrazyPython/364f11465dab90d611ecc81490682680

LDC 1.23.0 (Installed from dlang.org)

ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fast

Node v14.40 (V8 8.1.307.31)

Dlang trials: 2957 2560 2048 Average: 2521
Node.JS trials: 1988 2567 1863 Average: 2139

Notes:

  - I had to reinstall Dlang from the install script
  - I was initially confused why -mtune=native didn't work, and 
had to read documentation. Would have been nice if the compiler 
told me -mcpu=native was what I needed.
  - I skipped -march=native. Did not find information on the wiki 
https://wiki.dlang.org/Using_LDC
  - Node.js compiles faster and uses a compilation cache

Mandatory citation: https://github.com/brion/mandelbrot-shootout
Aug 21 2020
next sibling parent reply James Lu <jamtlu gmail.com> writes:
On Friday, 21 August 2020 at 23:10:53 UTC, James Lu wrote:
 Code: 
 https://gist.github.com/CrazyPython/364f11465dab90d611ecc81490682680

 LDC 1.23.0 (Installed from dlang.org)

 ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fast

 Node v14.40 (V8 8.1.307.31)

 Dlang trials: 2957 2560 2048 Average: 2521
 Node.JS trials: 1988 2567 1863 Average: 2139

 Notes:

  - I had to reinstall Dlang from the install script
  - I was initially confused why -mtune=native didn't work, and 
 had to read documentation. Would have been nice if the compiler 
 told me -mcpu=native was what I needed.
  - I skipped -march=native. Did not find information on the 
 wiki https://wiki.dlang.org/Using_LDC
  - Node.js compiles faster and uses a compilation cache

 Mandatory citation: https://github.com/brion/mandelbrot-shootout
With the double type: Node: 2211 2574 2306 Dlang: 2520 1891 1676
Aug 21 2020
parent reply James Lu <jamtlu gmail.com> writes:
On Friday, 21 August 2020 at 23:14:12 UTC, James Lu wrote:
 On Friday, 21 August 2020 at 23:10:53 UTC, James Lu wrote:
 Code: 
 https://gist.github.com/CrazyPython/364f11465dab90d611ecc81490682680

 LDC 1.23.0 (Installed from dlang.org)

 ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fast

 Node v14.40 (V8 8.1.307.31)

 Dlang trials: 2957 2560 2048 Average: 2521
 Node.JS trials: 1988 2567 1863 Average: 2139

 Notes:

  - I had to reinstall Dlang from the install script
  - I was initially confused why -mtune=native didn't work, and 
 had to read documentation. Would have been nice if the 
 compiler told me -mcpu=native was what I needed.
  - I skipped -march=native. Did not find information on the 
 wiki https://wiki.dlang.org/Using_LDC
  - Node.js compiles faster and uses a compilation cache

 Mandatory citation: 
 https://github.com/brion/mandelbrot-shootout
With the double type: Node: 2211 2574 2306 Dlang: 2520 1891 1676
Bonus: Direct translation of Dlang to Node.js, Node.js faster https://gist.github.com/CrazyPython/8bafd16837ec8ad4c5a638b9d305fc96 Dlang: 4076 3622 2934 (3544 average) Node.js: 2624 2334 2316 (2424 average) LDC2 is 46% slower!
Aug 21 2020
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Aug 21, 2020 at 11:22:27PM +0000, James Lu via Digitalmars-d wrote:
[...]
 Bonus: Direct translation of Dlang to Node.js, Node.js faster
 https://gist.github.com/CrazyPython/8bafd16837ec8ad4c5a638b9d305fc96
 
 Dlang: 4076 3622 2934 (3544 average)
 Node.js: 2624 2334 2316 (2424 average)
 
 LDC2 is 46% slower!
Using a class for Complex (and a non-final one at that!!) introduces tons of allocation overhead per iteration, plus virtual function call overhead. You should be using a struct instead. I betcha this one change will make a big difference in performance. Also, what's the command you're using to compile the program? If you're doing performance comparison, you should specify -O2 or -O3. T -- Knowledge is that area of ignorance that we arrange and classify. -- Ambrose Bierce
Aug 21 2020
next sibling parent MoonlightSentinel <moonlightsentinel disroot.org> writes:
On Friday, 21 August 2020 at 23:49:44 UTC, H. S. Teoh wrote:
 You should be using a struct instead.
Maybe try `creal`?
Aug 21 2020
prev sibling next sibling parent reply James Lu <jamtlu gmail.com> writes:
On Friday, 21 August 2020 at 23:49:44 UTC, H. S. Teoh wrote:
 On Fri, Aug 21, 2020 at 11:22:27PM +0000, James Lu via 
 Digitalmars-d wrote: [...]
 Bonus: Direct translation of Dlang to Node.js, Node.js faster 
 https://gist.github.com/CrazyPython/8bafd16837ec8ad4c5a638b9d305fc96
 
 Dlang: 4076 3622 2934 (3544 average)
 Node.js: 2624 2334 2316 (2424 average)
 
 LDC2 is 46% slower!
Using a class for Complex (and a non-final one at that!!) introduces tons of allocation overhead per iteration, plus virtual function call overhead. You should be using a struct instead. I betcha this one change will make a big difference in performance. Also, what's the command you're using to compile the program? If you're doing performance comparison, you should specify -O2 or -O3. T
I showed with and without class. V8's analyzer might be superior to LDC's in removing the allocation overhead. I used the same compilation flags as the original: ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fast
Aug 21 2020
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Aug 22, 2020 at 12:21:50AM +0000, James Lu via Digitalmars-d wrote:
[...]
 I showed with and without class.
Sorry, I missed that the first time round. But how come your struct version uses `real` but your class version uses `double`? It's well-known that `real` is slow because it uses x87 instructions, as opposed to the SSE/etc. instructions that `double` would use.
 V8's analyzer might be superior to LDC's in removing the allocation
 overhead. I used the same compilation flags as the original:
 
 ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fast
I found that with -ffast-math --fp-contract=fast, performance doubled. Since James' original struct version uses real, I decided to do a comparison between real and double in addition to class vs. struct: class version, with real: ldc2 -d-version=useClass -d-version=useReal -g -ffast-math --fp-contract=fast -mcpu=native -O3 test.d -of=test-class-native 8 secs, 201 ms, 286 μs, and 9 hnsecs 8 secs, 153 ms, 617 μs, and 9 hnsecs 8 secs, 205 ms, 966 μs, and 6 hnsecs class version, with double: ldc2 -d-version=useClass -d-version=useDouble -g -ffast-math --fp-contract=fast -mcpu=native -O3 test.d -of=test-class-native 4 secs, 177 ms, 842 μs, and 3 hnsecs 4 secs, 297 ms, 899 μs, and 6 hnsecs 4 secs, 221 ms, 916 μs, and 7 hnsecs struct version, with real: ldc2 -d-version=useStruct -d-version=useReal -g -ffast-math --fp-contract=fast -mcpu=native -O3 test.d -of=test-class-native 3 secs, 191 ms, 21 μs, and 4 hnsecs 3 secs, 223 ms, 692 μs, and 9 hnsecs 3 secs, 210 ms, 429 μs, and 2 hnsecs struct version, with double: ldc2 -d-version=useStruct -d-version=useDouble -g -ffast-math --fp-contract=fast -mcpu=native -O3 test.d -of=test-class-native 2 secs, 659 ms, 309 μs, and 2 hnsecs 2 secs, 654 ms, 96 μs, and 3 hnsecs 2 secs, 630 ms, 84 μs, and 4 hnsecs As you can see, using struct vs class grants almost double the performance. Using double with struct instead of real with struct gives 17% improvement. The difference between struct and class is not surprising; allocations are slow in general, and D generally does not do very much optimizations of allocations. Node.js being Javascript-based, and Javascript being object-heavy, it's not surprising that more object lifetime analysis would be applied to optimize allocations. I do feel James' struct implementation was flawed, though, because of using real instead of double, real being known to be slow on modern hardware. (Also, comparing struct + real to class + double seems a bit like comparing apples and oranges.) My original modification of James' code uses struct + double, and comparing that with struct + real showed a 17% degradation upon switching to real. // As a further step, I profiled the program and found that most of the time was being spent calling the C math library's fmax() function (which involves an expensive PIC indirection, not to mention lack of inlining). Writing a naïve version of fmax() in D gave the following numbers: struct version, with double + custom fmax function: ldc2 -d-version=useStruct -d-version=useDouble -d-version=customFmax -g -ffast-math --fp-contract=fast -mcpu=native -O3 test.d -of=test-struct-native 1 sec, 567 ms, 219 μs, and 6 hnsecs 1 sec, 557 ms, 762 μs, and 7 hnsecs 1 sec, 574 ms, 657 μs, and 7 hnsecs This represents a whopping 40% improvement over the version calling the C library's fmax function. I wonder how this last version compares with the Node.js performance? // Code, for full disclosure (basically copy-n-pasted from James' code, with minor modifications for testing struct vs class, real vs double): ------------------------------------ // import std.stdio; import std.math; import core.time; version(useReal) alias Num = real; version(useDouble) alias Num = double; version(customFmax) Num fmax(Num x, Num y) { return (x < y) ? y : x; } version(useStruct) { struct Complex { Num x; Num y; this(A)(A px, A py) { this.x = px; this.y = py; } unittest { auto complex = Complex(2, 2); assert(complex.x == 2 && complex.y == 2); } auto abs() const { return fmax(this.x * this.x, this.y * this.y); } void add(T)(const T other) { this.x += other.x; this.y += other.y; } void mul(T)(const T other) { auto newX = this.x * other.x - this.y * other.y; auto newY = this.x * other.y + this.y * other.x; this.x = newX; this.y = newY; } } unittest { auto c = Complex(5, 3); c.mul(Complex(4, 2)); assert(c.x == 14 && c.y == 22); } unittest { auto org = Complex(0, 0); org.add(Complex(3, 3)); assert(org.x == 3 && org.y == 3); } auto iterate_mandelbrot(const Complex c, const int maxIters) { auto z = Complex(0, 0); for (int i = 0; i < maxIters; i++) { if (z.abs() >= 2.0) { return i; } z.mul(z); z.add(c); } return maxIters; } const x0 = -2.5, x1 = 1, y0 = -1, y1 = 1; const cols = 72, rows = 24; const maxIters = 1000000; void main() { auto now = MonoTime.currTime; for (Num row = 0; row < rows; row++) { const y = (row / rows) * (y1 - y0) + y0; char[] str; for (Num col = 0; col < cols; col++) { // Num is needed here because otherwise "/" does integer division const x = (col / cols) * (x1 - x0) + x0; auto c = Complex(x, y); auto iters = iterate_mandelbrot(c, maxIters); if (iters == 0) { str ~= '.'; } else if (iters == 1) { str ~= '%'; } else if (iters == 2) { str ~= ' '; } else if (iters == maxIters) { str ~= ' '; } else { } } str.writeln; } writeln(MonoTime.currTime - now); } } version(useClass) { class Complex { Num x; Num y; this(A)(A px, A py) { this.x = px; this.y = py; } unittest { auto complex = new Complex(2, 2); assert(complex.x == 2 && complex.y == 2); } auto abs() const { return fmax(this.x * this.x, this.y * this.y); } void add(T)(const T other) { this.x += other.x; this.y += other.y; } void mul(T)(const T other) { auto newX = this.x * other.x - this.y * other.y; auto newY = this.x * other.y + this.y * other.x; this.x = newX; this.y = newY; } } unittest { auto c = new Complex(5, 3); c.mul(new Complex(4, 2)); assert(c.x == 14 && c.y == 22); } unittest { auto org = new Complex(0, 0); org.add(new Complex(3, 3)); assert(org.x == 3 && org.y == 3); } auto iterate_mandelbrot(const Complex c, const int maxIters) { auto z = new Complex(0, 0); for (int i = 0; i < maxIters; i++) { if (z.abs() >= 2.0) { return i; } z.mul(z); z.add(c); } return maxIters; } const x0 = -2.5, x1 = 1, y0 = -1, y1 = 1; const cols = 72, rows = 24; const maxIters = 1000000; void main() { auto now = MonoTime.currTime; for (Num row = 0; row < rows; row++) { const y = (row / rows) * (y1 - y0) + y0; char[] str; for (Num col = 0; col < cols; col++) { // Num is needed here because otherwise "/" does integer division const x = (col / cols) * (x1 - x0) + x0; auto c = new Complex(x, y); auto iters = iterate_mandelbrot(c, maxIters); if (iters == 0) { str ~= '.'; } else if (iters == 1) { str ~= '%'; } else if (iters == 2) { str ~= ' '; } else if (iters == maxIters) { str ~= ' '; } else { } } str.writeln; } writeln(MonoTime.currTime - now); } } ------------------------------------ T -- MAS = Mana Ada Sistem?
Aug 22 2020
parent James Lu <jamtlu gmail.com> writes:
On Saturday, 22 August 2020 at 16:01:52 UTC, H. S. Teoh wrote:
 Sorry, I missed that the first time round.  But how come your 
 struct version uses `real` but your class version uses 
 `double`?  It's well-known that `real` is slow because it uses 
 x87 instructions, as opposed to the SSE/etc. instructions that 
 `double` would use.
On Friday, 21 August 2020 at 23:14:12 UTC, James Lu wrote:
 With the double type:

 Node: 2211 2574 2306
 Dlang: 2520 1891 1676
If you had read all three of my original self-replies9, you would have noticed I also did a version with struct + class. Perhaps I need to learn to keep them all in the same post.
Aug 22 2020
prev sibling parent Arun <aruncxy gmail.com> writes:
On Friday, 21 August 2020 at 23:49:44 UTC, H. S. Teoh wrote:
 On Fri, Aug 21, 2020 at 11:22:27PM +0000, James Lu via 
 Digitalmars-d wrote: [...]
 Bonus: Direct translation of Dlang to Node.js, Node.js faster 
 https://gist.github.com/CrazyPython/8bafd16837ec8ad4c5a638b9d305fc96
 
 Dlang: 4076 3622 2934 (3544 average)
 Node.js: 2624 2334 2316 (2424 average)
 
 LDC2 is 46% slower!
Using a class for Complex (and a non-final one at that!!) introduces tons of allocation overhead per iteration, plus virtual function call overhead. You should be using a struct instead. I betcha this one change will make a big difference in performance. Also, what's the command you're using to compile the program? If you're doing performance comparison, you should specify -O2 or -O3.
He mentioned this in his first post. LDC 1.23.0 (Installed from dlang.org) ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fast
Aug 21 2020
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Aug 21, 2020 at 04:49:44PM -0700, H. S. Teoh via Digitalmars-d wrote:
[...]
 Using a class for Complex (and a non-final one at that!!) introduces
 tons of allocation overhead per iteration, plus virtual function call
 overhead.  You should be using a struct instead.  I betcha this one
 change will make a big difference in performance.
[...] OK, so I copied the code and changed the class to struct, and compared the results. Both versions are compiled with ldc2 -O3. class version: 7 secs, 125 ms, 608 μs, and 9 hnsecs 7 secs, 155 ms, 328 μs, and 6 hnsecs 7 secs, 158 ms, 966 μs, and 4 hnsecs struct version: 6 secs, 55 ms, 140 μs, and 4 hnsecs 6 secs, 125 ms, 974 μs, and 5 hnsecs 6 secs, 126 ms, 945 μs, and 4 hnsecs For performance comparisons, take the best of n (because the others are merely measuring more system noise). This represents about a 15% performance increase in switching to struct instead of class. I thought it might make a difference to optimize for my CPU with -mcpu=native, so here are the numbers: class version: 7 secs, 100 ms, 602 μs, and 6 hnsecs 7 secs, 100 ms, 437 μs, and 7 hnsecs 7 secs, 121 ms, 594 μs, and 4 hnsecs struct version: 6 secs, 73 ms, 534 μs, and 3 hnsecs 5 secs, 662 ms, 626 μs, and 5 hnsecs 6 secs, 103 ms, 871 μs, and 2 hnsecs Again taking the best of 3, that's about a 20% performance increase between changing from class to struct. // Just for laughs, I tested with dmd -O -inline: class version: 7 secs, 255 ms, 748 μs, and 5 hnsecs 7 secs, 249 ms, 683 μs, and 9 hnsecs 7 secs, 593 ms, 847 μs, and 8 hnsecs struct version: 7 secs, 646 ms, 685 μs, and 5 hnsecs 7 secs, 618 ms, 642 μs, and 7 hnsecs 7 secs, 606 ms, 85 μs, and 4 hnsecs Surprisingly, the class version does *better* than the struct version when compiled with dmd. (Wow, is dmd codegen *that* bad that it outweighs even class allocation overhead?? :-D) But both are worse than even the class version with ldc2 -O3 (even without -mcpu=native). So yeah. I wouldn't trust dmd with a 10-foot pole when it comes to runtime performance. The struct version compiled with `ldc2 -O3 -mcpu=native` beats the struct version compiled with dmd by a 26% margin. That's pretty sad. T -- An imaginary friend squared is a real enemy.
Aug 21 2020
parent aberba <karabutaworld gmail.com> writes:
On Saturday, 22 August 2020 at 00:10:43 UTC, H. S. Teoh wrote:
 On Fri, Aug 21, 2020 at 04:49:44PM -0700, H. S. Teoh via 
 Digitalmars-d wrote: [...]
 
Surprisingly, the class version does *better* than the struct version when compiled with dmd. (Wow, is dmd codegen *that* bad that it outweighs even class allocation overhead?? :-D)
Or maybe DMD is not trying to win any performance context... just focusing on fast compilation for quick prototyping. Something you wouldn't getting otherwise without DMD.
Aug 22 2020
prev sibling parent reply bachmeier <no spam.net> writes:
On Friday, 21 August 2020 at 23:10:53 UTC, James Lu wrote:
 Code: 
 https://gist.github.com/CrazyPython/364f11465dab90d611ecc81490682680

 LDC 1.23.0 (Installed from dlang.org)

 ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fast

 Node v14.40 (V8 8.1.307.31)

 Dlang trials: 2957 2560 2048 Average: 2521
 Node.JS trials: 1988 2567 1863 Average: 2139

 Notes:

  - I had to reinstall Dlang from the install script
  - I was initially confused why -mtune=native didn't work, and 
 had to read documentation. Would have been nice if the compiler 
 told me -mcpu=native was what I needed.
  - I skipped -march=native. Did not find information on the 
 wiki https://wiki.dlang.org/Using_LDC
  - Node.js compiles faster and uses a compilation cache

 Mandatory citation: https://github.com/brion/mandelbrot-shootout
I have no desire to dig into it myself, but I'll just note that if you check the CLBG, you'll see that it's not hard to write C and C++ programs for this benchmark that are many times slower than Node JS. The worst of them takes seven times longer to run.
Aug 21 2020
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Aug 22, 2020 at 02:08:40AM +0000, bachmeier via Digitalmars-d wrote:
[...]
 I have no desire to dig into it myself, but I'll just note that if you
 check the CLBG, you'll see that it's not hard to write C and C++
 programs for this benchmark that are many times slower than Node JS.
 The worst of them takes seven times longer to run.
As described in my other post, my analysis of James' code reveals the following issues: 1) Using class instead of struct; 2) Using real instead of double; 3) std.math.fmax calling the C library (involving a PIC indirection to a shared library as opposed to inlineable native D code). Addressing all 3 issues yielded a 67% improvement (from class + real + C fmax -> struct + double + native fmax), or 37% improvement (from class + double + C fmax -> struct + double + native fmax). I don't have a Node.js environment, though, so I can't make a direct comparison with the optimized D version. I will note, though, that (1) and (2) are well-known performance issues; I'm surprised that this was not taken into account in the original comparison. (3) is something worth considering for std.math -- yes it's troublesome to have to maintain D versions of these functions, but they *could* make a potentially big performance impact by being inlineable in hot inner loops. T -- Computers shouldn't beep through the keyhole.
Aug 22 2020
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/22/20 12:15 PM, H. S. Teoh wrote:
 On Sat, Aug 22, 2020 at 02:08:40AM +0000, bachmeier via Digitalmars-d wrote:
 [...]
 I have no desire to dig into it myself, but I'll just note that if you
 check the CLBG, you'll see that it's not hard to write C and C++
 programs for this benchmark that are many times slower than Node JS.
 The worst of them takes seven times longer to run.
As described in my other post, my analysis of James' code reveals the following issues: 1) Using class instead of struct; 2) Using real instead of double; 3) std.math.fmax calling the C library (involving a PIC indirection to a shared library as opposed to inlineable native D code). Addressing all 3 issues yielded a 67% improvement (from class + real + C fmax -> struct + double + native fmax), or 37% improvement (from class + double + C fmax -> struct + double + native fmax). I don't have a Node.js environment, though, so I can't make a direct comparison with the optimized D version. I will note, though, that (1) and (2) are well-known performance issues; I'm surprised that this was not taken into account in the original comparison. (3) is something worth considering for std.math -- yes it's troublesome to have to maintain D versions of these functions, but they *could* make a potentially big performance impact by being inlineable in hot inner loops.
First off this is a great accomplishment of the V8 team. That's the remarkable part. Second, this is in many ways not new. JIT optimizers are great at optimizing numeric code and loops. With regularity ever since shortly after Java was invented, "Java is faster than C" claims have been backed by benchmarks involving numeric code and loops. For such, the JIT can do as good a job as, and sometimes even better than, a traditional compiler. The way of a low-level language competes is by offering you ways to tune code to rival/surpass JIT performance if you need to. (Of course, that means the programmer is spending time on that, which is a minus.) I have no doubt the D mandelbrot code can be made at least as fast as the node.js code. The reality of large applications involves things like structures and indirections and such, for which data layout is important. I think JITs have a way to go in that area.
Aug 22 2020
next sibling parent James Lu <jamtlu gmail.com> writes:
On Saturday, 22 August 2020 at 16:38:42 UTC, Andrei Alexandrescu 
wrote:
 On 8/22/20 12:15 PM, H. S. Teoh wrote:
 [...]
Second, this is in many ways not new. JIT optimizers are great at optimizing numeric code and loops. With regularity ever since shortly after Java was invented, "Java is faster than C" claims have been backed by benchmarks involving numeric code and loops. For such, the JIT can do as good a job as, and sometimes even better than, a traditional compiler. The way of a low-level language competes is by offering you ways to tune code to rival/surpass JIT performance if you need to. (Of course, that means the programmer is spending time on that, which is a minus.) I have no doubt the D mandelbrot code can be made at least as fast as the node.js code.
Interesting perspective.
Aug 22 2020
prev sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Saturday, 22 August 2020 at 16:38:42 UTC, Andrei Alexandrescu 
wrote:
 [snip]

 First off this is a great accomplishment of the V8 team. That's 
 the remarkable part.

 Second, this is in many ways not new. JIT optimizers are great 
 at optimizing numeric code and loops. With regularity ever 
 since shortly after Java was invented, "Java is faster than C" 
 claims have been backed by benchmarks involving numeric code 
 and loops. For such, the JIT can do as good a job as, and 
 sometimes even better than, a traditional compiler.

 The way of a low-level language competes is by offering you 
 ways to tune code to rival/surpass JIT performance if you need 
 to. (Of course, that means the programmer is spending time on 
 that, which is a minus.) I have no doubt the D mandelbrot code 
 can be made at least as fast as the node.js code.

 The reality of large applications involves things like 
 structures and indirections and such, for which data layout is 
 important. I think JITs have a way to go in that area.
It would be interesting to see a write-up where LDC's dynamic compile can speed up D code in the same way.
Aug 22 2020
prev sibling next sibling parent reply kinke <noone nowhere.com> writes:
On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:
 3) std.math.fmax calling the C library (involving a PIC 
 indirection to a
 shared library as opposed to inlineable native D code).
Yes, and that's because there's only a `real` version for fmax. If upstream Phobos had proper double/float overloads, we could uncomment the LDC-specific implementations using LLVM intrinsics, which use (obviously much faster) SSE instructions: https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/std/math.d#L7785-L7798 Number crunching in D could be significantly accelerated if the people interested in it showed some love for std.math, but we've had this topic for years.
Aug 22 2020
next sibling parent Arjan <arjan ask.me.to> writes:
On Saturday, 22 August 2020 at 17:40:11 UTC, kinke wrote:
 On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:
 3) std.math.fmax calling the C library (involving a PIC 
 indirection to a
 shared library as opposed to inlineable native D code).
Yes, and that's because there's only a `real` version for fmax. If upstream Phobos had proper double/float overloads, we could uncomment the LDC-specific implementations using LLVM intrinsics, which use (obviously much faster) SSE instructions: https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/std/math.d#L7785-L7798 Number crunching in D could be significantly accelerated if the people interested in it showed some love for std.math, but we've had this topic for years.
Sound to me like a fantastic SAOC project.
Aug 22 2020
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/22/20 1:40 PM, kinke wrote:
 On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:
 3) std.math.fmax calling the C library (involving a PIC indirection to a
 shared library as opposed to inlineable native D code).
Yes, and that's because there's only a `real` version for fmax. If upstream Phobos had proper double/float overloads, we could uncomment the LDC-specific implementations using LLVM intrinsics, which use (obviously much faster) SSE instructions: https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/st /math.d#L7785-L7798 Number crunching in D could be significantly accelerated if the people interested in it showed some love for std.math, but we've had this topic for years.
Ow, do we still suffer from that? Sigh. https://github.com/dlang/phobos/pull/7604/files It's 10 minutes of work - as much as writing a couple of posts, and much more satisfactory.
Aug 22 2020
parent reply Avrina <avrina12309412342 gmail.com> writes:
On Saturday, 22 August 2020 at 19:20:34 UTC, Andrei Alexandrescu 
wrote:
 On 8/22/20 1:40 PM, kinke wrote:
 On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:
 3) std.math.fmax calling the C library (involving a PIC 
 indirection to a
 shared library as opposed to inlineable native D code).
Yes, and that's because there's only a `real` version for fmax. If upstream Phobos had proper double/float overloads, we could uncomment the LDC-specific implementations using LLVM intrinsics, which use (obviously much faster) SSE instructions: https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/std/math.d#L7785-L7798 Number crunching in D could be significantly accelerated if the people interested in it showed some love for std.math, but we've had this topic for years.
Ow, do we still suffer from that? Sigh. https://github.com/dlang/phobos/pull/7604/files It's 10 minutes of work - as much as writing a couple of posts, and much more satisfactory.
Cos, sin, tan, asin, acos, atan, etc.. There's still more, putting in the actual work that std.math needs is going to take more than 10 mins.
Aug 22 2020
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/22/20 3:34 PM, Avrina wrote:
 On Saturday, 22 August 2020 at 19:20:34 UTC, Andrei Alexandrescu wrote:
 On 8/22/20 1:40 PM, kinke wrote:
 On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:
 3) std.math.fmax calling the C library (involving a PIC indirection 
 to a
 shared library as opposed to inlineable native D code).
Yes, and that's because there's only a `real` version for fmax. If upstream Phobos had proper double/float overloads, we could uncomment the LDC-specific implementations using LLVM intrinsics, which use (obviously much faster) SSE instructions: https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/st /math.d#L7785-L7798 Number crunching in D could be significantly accelerated if the people interested in it showed some love for std.math, but we've had this topic for years.
Ow, do we still suffer from that? Sigh. https://github.com/dlang/phobos/pull/7604/files It's 10 minutes of work - as much as writing a couple of posts, and much more satisfactory.
Cos, sin, tan, asin, acos, atan, etc.. There's still more, putting in the actual work that std.math needs is going to take more than 10 mins.
1. Linear time for small n is fine and does not affect the argument. 2. Incremental is still fine. 3. Work has actually be done by Nick Wilson in https://github.com/dlang/phobos/pull/7463.
Aug 22 2020
parent reply Avrina <avrina12309412342 gmail.com> writes:
On Saturday, 22 August 2020 at 21:20:36 UTC, Andrei Alexandrescu 
wrote:
 On 8/22/20 3:34 PM, Avrina wrote:
 On Saturday, 22 August 2020 at 19:20:34 UTC, Andrei 
 Alexandrescu wrote:
 On 8/22/20 1:40 PM, kinke wrote:
 On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh 
 wrote:
 3) std.math.fmax calling the C library (involving a PIC 
 indirection to a
 shared library as opposed to inlineable native D code).
Yes, and that's because there's only a `real` version for fmax. If upstream Phobos had proper double/float overloads, we could uncomment the LDC-specific implementations using LLVM intrinsics, which use (obviously much faster) SSE instructions: https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/std/math.d#L7785-L7798 Number crunching in D could be significantly accelerated if the people interested in it showed some love for std.math, but we've had this topic for years.
Ow, do we still suffer from that? Sigh. https://github.com/dlang/phobos/pull/7604/files It's 10 minutes of work - as much as writing a couple of posts, and much more satisfactory.
Cos, sin, tan, asin, acos, atan, etc.. There's still more, putting in the actual work that std.math needs is going to take more than 10 mins.
1. Linear time for small n is fine and does not affect the argument.
What argument?
 2. Incremental is still fine.
It can introduce subtle bugs and problems with precision, flip flopping between float, double, and real. If it is done all at once, it will only happen once, not every time someone feels like spend 10 mins doing a little bit of work to change one function. It really shouldn't have been implemented this way in the first place.
 3. Work has actually be done by Nick Wilson in 
 https://github.com/dlang/phobos/pull/7463.
A dead pull request? Not unusual.
Aug 22 2020
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/22/20 6:09 PM, Avrina wrote:
 On Saturday, 22 August 2020 at 21:20:36 UTC, Andrei Alexandrescu wrote:
 On 8/22/20 3:34 PM, Avrina wrote:
 On Saturday, 22 August 2020 at 19:20:34 UTC, Andrei Alexandrescu wrote:
 On 8/22/20 1:40 PM, kinke wrote:
 On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:
 3) std.math.fmax calling the C library (involving a PIC 
 indirection to a
 shared library as opposed to inlineable native D code).
Yes, and that's because there's only a `real` version for fmax. If upstream Phobos had proper double/float overloads, we could uncomment the LDC-specific implementations using LLVM intrinsics, which use (obviously much faster) SSE instructions: https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/st /math.d#L7785-L7798 Number crunching in D could be significantly accelerated if the people interested in it showed some love for std.math, but we've had this topic for years.
Ow, do we still suffer from that? Sigh. https://github.com/dlang/phobos/pull/7604/files It's 10 minutes of work - as much as writing a couple of posts, and much more satisfactory.
Cos, sin, tan, asin, acos, atan, etc.. There's still more, putting in the actual work that std.math needs is going to take more than 10 mins.
1. Linear time for small n is fine and does not affect the argument.
What argument?
 2. Incremental is still fine.
It can introduce subtle bugs and problems with precision, flip flopping between float, double, and real. If it is done all at once, it will only happen once, not every time someone feels like spend 10 mins doing a little bit of work to change one function. It really shouldn't have been implemented this way in the first place.
 3. Work has actually be done by Nick Wilson in 
 https://github.com/dlang/phobos/pull/7463.
A dead pull request? Not unusual.
You seem to derive good enjoyment out of making unkind comments.
Aug 22 2020
parent Avrina <avrina12309412342 gmail.com> writes:
On Sunday, 23 August 2020 at 02:18:19 UTC, Andrei Alexandrescu 
wrote:
 On 8/22/20 6:09 PM, Avrina wrote:
 On Saturday, 22 August 2020 at 21:20:36 UTC, Andrei 
 Alexandrescu wrote:
 On 8/22/20 3:34 PM, Avrina wrote:
 On Saturday, 22 August 2020 at 19:20:34 UTC, Andrei 
 Alexandrescu wrote:
 On 8/22/20 1:40 PM, kinke wrote:
 On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh 
 wrote:
 3) std.math.fmax calling the C library (involving a PIC 
 indirection to a
 shared library as opposed to inlineable native D code).
Yes, and that's because there's only a `real` version for fmax. If upstream Phobos had proper double/float overloads, we could uncomment the LDC-specific implementations using LLVM intrinsics, which use (obviously much faster) SSE instructions: https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/std/math.d#L7785-L7798 Number crunching in D could be significantly accelerated if the people interested in it showed some love for std.math, but we've had this topic for years.
Ow, do we still suffer from that? Sigh. https://github.com/dlang/phobos/pull/7604/files It's 10 minutes of work - as much as writing a couple of posts, and much more satisfactory.
Cos, sin, tan, asin, acos, atan, etc.. There's still more, putting in the actual work that std.math needs is going to take more than 10 mins.
1. Linear time for small n is fine and does not affect the argument.
What argument?
 2. Incremental is still fine.
It can introduce subtle bugs and problems with precision, flip flopping between float, double, and real. If it is done all at once, it will only happen once, not every time someone feels like spend 10 mins doing a little bit of work to change one function. It really shouldn't have been implemented this way in the first place.
 3. Work has actually be done by Nick Wilson in 
 https://github.com/dlang/phobos/pull/7463.
A dead pull request? Not unusual.
You seem to derive good enjoyment out of making unkind comments.
I am an observer of truth, if you don't like the truth, look away; as it seems to be common place here anyways.
Aug 24 2020
prev sibling parent kinke <noone nowhere.com> writes:
On Saturday, 22 August 2020 at 19:34:42 UTC, Avrina wrote:
 Cos, sin, tan, asin, acos, atan, etc.. There's still more, 
 putting in the actual work that std.math needs is going to take 
 more than 10 mins.
These functions are a lot more involved indeed; I've taken care of a few of these some years ago, porting from the Cephes C library, see https://github.com/dlang/phobos/pull/6272. The main difficulty is the need to support 4 floating-point formats - single, double, x87 extended and quadruple precision.
Aug 22 2020
prev sibling parent bachmeier <no spam.net> writes:
On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:
 On Sat, Aug 22, 2020 at 02:08:40AM +0000, bachmeier via 
 Digitalmars-d wrote: [...]
 I have no desire to dig into it myself, but I'll just note 
 that if you check the CLBG, you'll see that it's not hard to 
 write C and C++ programs for this benchmark that are many 
 times slower than Node JS. The worst of them takes seven times 
 longer to run.
As described in my other post, my analysis of James' code reveals the following issues: 1) Using class instead of struct; 2) Using real instead of double; 3) std.math.fmax calling the C library (involving a PIC indirection to a shared library as opposed to inlineable native D code). Addressing all 3 issues yielded a 67% improvement (from class + real + C fmax -> struct + double + native fmax), or 37% improvement (from class + double + C fmax -> struct + double + native fmax). I don't have a Node.js environment, though, so I can't make a direct comparison with the optimized D version. I will note, though, that (1) and (2) are well-known performance issues; I'm surprised that this was not taken into account in the original comparison. (3) is something worth considering for std.math -- yes it's troublesome to have to maintain D versions of these functions, but they *could* make a potentially big performance impact by being inlineable in hot inner loops.
Well, I'm not going to debate the possibility that D will outperform Node with appropriate optimization. It's clear from the C and C++ timings, though, that you can't just type out a C, C++, or D program and expect to beat Node - it's going to take some care and knowledge that you really don't need with Node.
Aug 24 2020