digitalmars.D - Naive node.js faster than naive LDC2?

James Lu (16/16) Aug 21 2020 Code:

James Lu (4/20) Aug 21 2020 With the double type:

James Lu (6/34) Aug 21 2020 Bonus: Direct translation of Dlang to Node.js, Node.js faster

H. S. Teoh (11/18) Aug 21 2020 Using a class for Complex (and a non-final one at that!!) introduces

MoonlightSentinel (2/3) Aug 21 2020 Maybe try `creal`?
James Lu (5/23) Aug 21 2020 I showed with and without class. V8's analyzer might be superior

H. S. Teoh (228/233) Aug 22 2020 Sorry, I missed that the first time round. But how come your struct

James Lu (5/13) Aug 22 2020 If you had read all three of my original self-replies9, you would

Arun (4/21) Aug 21 2020 He mentioned this in his first post.

H. S. Teoh (49/53) Aug 21 2020 [...]

aberba (4/10) Aug 22 2020 Or maybe DMD is not trying to win any performance context... just

bachmeier (5/21) Aug 21 2020 I have no desire to dig into it myself, but I'll just note that

H. S. Teoh (22/26) Aug 22 2020 As described in my other post, my analysis of James' code reveals the

Andrei Alexandrescu (16/46) Aug 22 2020 First off this is a great accomplishment of the V8 team. That's the

James Lu (3/16) Aug 22 2020 Interesting perspective.
jmh530 (4/21) Aug 22 2020 It would be interesting to see a write-up where LDC's dynamic

kinke (9/12) Aug 22 2020 Yes, and that's because there's only a `real` version for fmax.

Arjan (2/14) Aug 22 2020 Sound to me like a fantastic SAOC project.
Andrei Alexandrescu (5/21) Aug 22 2020 Ow, do we still suffer from that? Sigh.

Avrina (5/27) Aug 22 2020 Cos, sin, tan, asin, acos, atan, etc.. There's still more,

Andrei Alexandrescu (5/35) Aug 22 2020 1. Linear time for small n is fine and does not affect the argument.

Avrina (10/51) Aug 22 2020 What argument?

Andrei Alexandrescu (2/52) Aug 22 2020 You seem to derive good enjoyment out of making unkind comments.

Avrina (4/64) Aug 24 2020 I am an observer of truth, if you don't like the truth, look

kinke (6/9) Aug 22 2020 These functions are a lot more involved indeed; I've taken care

bachmeier (6/34) Aug 24 2020 Well, I'm not going to debate the possibility that D will

James Lu <jamtlu gmail.com> writes:

Code: 
https://gist.github.com/CrazyPython/364f11465dab90d611ecc81490682680

LDC 1.23.0 (Installed from dlang.org)

ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fast

Node v14.40 (V8 8.1.307.31)

Dlang trials: 2957 2560 2048 Average: 2521
Node.JS trials: 1988 2567 1863 Average: 2139

Notes:

  - I had to reinstall Dlang from the install script
  - I was initially confused why -mtune=native didn't work, and 
had to read documentation. Would have been nice if the compiler 
told me -mcpu=native was what I needed.
  - I skipped -march=native. Did not find information on the wiki 
https://wiki.dlang.org/Using_LDC
  - Node.js compiles faster and uses a compilation cache

Mandatory citation: https://github.com/brion/mandelbrot-shootout

Aug 21 2020

James Lu <jamtlu gmail.com> writes:

On Friday, 21 August 2020 at 23:10:53 UTC, James Lu wrote:
 Code: 
 https://gist.github.com/CrazyPython/364f11465dab90d611ecc81490682680

 LDC 1.23.0 (Installed from dlang.org)

 ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fast

 Node v14.40 (V8 8.1.307.31)

 Dlang trials: 2957 2560 2048 Average: 2521
 Node.JS trials: 1988 2567 1863 Average: 2139

 Notes:

  - I had to reinstall Dlang from the install script
  - I was initially confused why -mtune=native didn't work, and 
 had to read documentation. Would have been nice if the compiler 
 told me -mcpu=native was what I needed.
  - I skipped -march=native. Did not find information on the 
 wiki https://wiki.dlang.org/Using_LDC
  - Node.js compiles faster and uses a compilation cache

 Mandatory citation: https://github.com/brion/mandelbrot-shootout

With the double type:

Node: 2211 2574 2306
Dlang: 2520 1891 1676

Aug 21 2020

James Lu <jamtlu gmail.com> writes:

On Friday, 21 August 2020 at 23:14:12 UTC, James Lu wrote:
 On Friday, 21 August 2020 at 23:10:53 UTC, James Lu wrote:
 Code: 
 https://gist.github.com/CrazyPython/364f11465dab90d611ecc81490682680

 LDC 1.23.0 (Installed from dlang.org)

 ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fast

 Node v14.40 (V8 8.1.307.31)

 Dlang trials: 2957 2560 2048 Average: 2521
 Node.JS trials: 1988 2567 1863 Average: 2139

 Notes:

  - I had to reinstall Dlang from the install script
  - I was initially confused why -mtune=native didn't work, and 
 had to read documentation. Would have been nice if the 
 compiler told me -mcpu=native was what I needed.
  - I skipped -march=native. Did not find information on the 
 wiki https://wiki.dlang.org/Using_LDC
  - Node.js compiles faster and uses a compilation cache

 Mandatory citation: 
 https://github.com/brion/mandelbrot-shootout

 With the double type:

 Node: 2211 2574 2306
 Dlang: 2520 1891 1676

Bonus: Direct translation of Dlang to Node.js, Node.js faster 
https://gist.github.com/CrazyPython/8bafd16837ec8ad4c5a638b9d305fc96

Dlang: 4076 3622 2934 (3544 average)
Node.js: 2624 2334 2316 (2424 average)

LDC2 is 46% slower!

Aug 21 2020

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Aug 21, 2020 at 11:22:27PM +0000, James Lu via Digitalmars-d wrote:
[...]
 Bonus: Direct translation of Dlang to Node.js, Node.js faster
 https://gist.github.com/CrazyPython/8bafd16837ec8ad4c5a638b9d305fc96
 
 Dlang: 4076 3622 2934 (3544 average)
 Node.js: 2624 2334 2316 (2424 average)
 
 LDC2 is 46% slower!

Using a class for Complex (and a non-final one at that!!) introduces
tons of allocation overhead per iteration, plus virtual function call
overhead.  You should be using a struct instead.  I betcha this one
change will make a big difference in performance.

Also, what's the command you're using to compile the program?  If you're
doing performance comparison, you should specify -O2 or -O3.


T

-- 
Knowledge is that area of ignorance that we arrange and classify. -- Ambrose
Bierce

Aug 21 2020

MoonlightSentinel <moonlightsentinel disroot.org> writes:

On Friday, 21 August 2020 at 23:49:44 UTC, H. S. Teoh wrote:
 You should be using a struct instead.

Maybe try `creal`?

Aug 21 2020

James Lu <jamtlu gmail.com> writes:

On Friday, 21 August 2020 at 23:49:44 UTC, H. S. Teoh wrote:
 On Fri, Aug 21, 2020 at 11:22:27PM +0000, James Lu via 
 Digitalmars-d wrote: [...]
 Bonus: Direct translation of Dlang to Node.js, Node.js faster 
 https://gist.github.com/CrazyPython/8bafd16837ec8ad4c5a638b9d305fc96
 
 Dlang: 4076 3622 2934 (3544 average)
 Node.js: 2624 2334 2316 (2424 average)
 
 LDC2 is 46% slower!

 Using a class for Complex (and a non-final one at that!!) 
 introduces tons of allocation overhead per iteration, plus 
 virtual function call overhead.  You should be using a struct 
 instead.  I betcha this one change will make a big difference 
 in performance.

 Also, what's the command you're using to compile the program?  
 If you're doing performance comparison, you should specify -O2 
 or -O3.


 T

I showed with and without class. V8's analyzer might be superior 
to LDC's in removing the allocation overhead. I used the same 
compilation flags as the original:

ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fast

Aug 21 2020

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Aug 22, 2020 at 12:21:50AM +0000, James Lu via Digitalmars-d wrote:
[...]
 I showed with and without class.

Sorry, I missed that the first time round.  But how come your struct
version uses `real` but your class version uses `double`?  It's
well-known that `real` is slow because it uses x87 instructions, as
opposed to the SSE/etc. instructions that `double` would use.


 V8's analyzer might be superior to LDC's in removing the allocation
 overhead. I used the same compilation flags as the original:
 
 ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fast

I found that with -ffast-math --fp-contract=fast, performance doubled.
Since James' original struct version uses real, I decided to do a
comparison between real and double in addition to class vs. struct:

class version, with real:
	ldc2 -d-version=useClass -d-version=useReal -g -ffast-math --fp-contract=fast
-mcpu=native -O3 test.d -of=test-class-native

	8 secs, 201 ms, 286 μs, and 9 hnsecs
	8 secs, 153 ms, 617 μs, and 9 hnsecs
	8 secs, 205 ms, 966 μs, and 6 hnsecs

class version, with double:
	ldc2 -d-version=useClass -d-version=useDouble -g -ffast-math
--fp-contract=fast -mcpu=native -O3 test.d -of=test-class-native

	4 secs, 177 ms, 842 μs, and 3 hnsecs
	4 secs, 297 ms, 899 μs, and 6 hnsecs
	4 secs, 221 ms, 916 μs, and 7 hnsecs

struct version, with real:
	ldc2 -d-version=useStruct -d-version=useReal -g -ffast-math --fp-contract=fast
-mcpu=native -O3 test.d -of=test-class-native

	3 secs, 191 ms, 21 μs, and 4 hnsecs
	3 secs, 223 ms, 692 μs, and 9 hnsecs
	3 secs, 210 ms, 429 μs, and 2 hnsecs

struct version, with double:
	ldc2 -d-version=useStruct -d-version=useDouble -g -ffast-math
--fp-contract=fast -mcpu=native -O3 test.d -of=test-class-native

	2 secs, 659 ms, 309 μs, and 2 hnsecs
	2 secs, 654 ms, 96 μs, and 3 hnsecs
	2 secs, 630 ms, 84 μs, and 4 hnsecs

As you can see, using struct vs class grants almost double the
performance.  Using double with struct instead of real with struct gives
17% improvement.

The difference between struct and class is not surprising; allocations
are slow in general, and D generally does not do very much optimizations
of allocations.  Node.js being Javascript-based, and Javascript being
object-heavy, it's not surprising that more object lifetime analysis
would be applied to optimize allocations.

I do feel James' struct implementation was flawed, though, because of
using real instead of double, real being known to be slow on modern
hardware. (Also, comparing struct + real to class + double seems a bit
like comparing apples and oranges.)  My original modification of James'
code uses struct + double, and comparing that with struct + real showed
a 17% degradation upon switching to real.

//

As a further step, I profiled the program and found that most of the
time was being spent calling the C math library's fmax() function (which
involves an expensive PIC indirection, not to mention lack of inlining).
Writing a naïve version of fmax() in D gave the following numbers:

struct version, with double + custom fmax function:
	ldc2 -d-version=useStruct -d-version=useDouble -d-version=customFmax -g
-ffast-math --fp-contract=fast -mcpu=native -O3 test.d -of=test-struct-native

	1 sec, 567 ms, 219 μs, and 6 hnsecs
	1 sec, 557 ms, 762 μs, and 7 hnsecs
	1 sec, 574 ms, 657 μs, and 7 hnsecs

This represents a whopping 40% improvement over the version calling the
C library's fmax function.

I wonder how this last version compares with the Node.js performance?


//

Code, for full disclosure (basically copy-n-pasted from James' code,
with minor modifications for testing struct vs class, real vs double):

------------------------------------
// 
import std.stdio;
import std.math;
import core.time;

version(useReal) alias Num = real;
version(useDouble) alias Num = double;

version(customFmax)
	Num fmax(Num x, Num y) { return (x < y) ? y : x; }

version(useStruct)
{
	struct Complex {
	    Num x;
	    Num y;
	    this(A)(A px, A py) {
		this.x = px;
		this.y = py;
	    }
	    unittest {
		auto complex = Complex(2, 2);
		assert(complex.x == 2 && complex.y == 2);
	    }
	    auto abs() const {
		return fmax(this.x * this.x, this.y * this.y);
	    }

	    void add(T)(const T other) {
		this.x += other.x;
		this.y += other.y;
	    }

	    void mul(T)(const T other) {
		auto newX = this.x * other.x - this.y * other.y;
		auto newY = this.x * other.y + this.y * other.x;
		this.x = newX;
		this.y = newY;
	    }
	}
	unittest {
	    auto c = Complex(5, 3);
	    c.mul(Complex(4, 2));
	    assert(c.x == 14 && c.y == 22);
	}
	unittest {
	    auto org = Complex(0, 0);
	    org.add(Complex(3, 3));
	    assert(org.x == 3 && org.y == 3);
	}

	auto iterate_mandelbrot(const Complex c, const int maxIters) {
	    auto z = Complex(0, 0);
	    for (int i = 0; i < maxIters; i++) {
		if (z.abs() >= 2.0) {
		    return i;
		}
		z.mul(z);
		z.add(c);
	    }
	    return maxIters;
	}

	const x0 = -2.5, x1 = 1, y0 = -1, y1 = 1;
	const cols = 72, rows = 24;
	const maxIters = 1000000;

	void main() {
		auto now = MonoTime.currTime;
	    for (Num row = 0; row < rows; row++) {
		const y = (row / rows) * (y1 - y0) + y0;
		char[] str;
		for (Num col = 0; col < cols; col++) {
		    // Num is needed here because otherwise "/" does integer division
		    const x = (col / cols) * (x1 - x0) + x0;
		    auto c = Complex(x, y);
		    auto iters = iterate_mandelbrot(c, maxIters);
		    if (iters == 0) {
			str ~= '.';
		    } else if (iters == 1) {
			str ~= '%';
		    } else if (iters == 2) {
			str ~= ' ';
		    } else if (iters == maxIters) {
			str ~= ' ';
		    } else {

		    }
		}
		str.writeln;
	    }
	    writeln(MonoTime.currTime - now);
	}
}

version(useClass)
{
	class Complex {
	    Num x;
	    Num y;
	    this(A)(A px, A py) {
		this.x = px;
		this.y = py;
	    }
	    unittest {
		auto complex = new Complex(2, 2);
		assert(complex.x == 2 && complex.y == 2);
	    }
	    auto abs() const {
		return fmax(this.x * this.x, this.y * this.y);
	    }

	    void add(T)(const T other) {
		this.x += other.x;
		this.y += other.y;
	    }

	    void mul(T)(const T other) {
		auto newX = this.x * other.x - this.y * other.y;
		auto newY = this.x * other.y + this.y * other.x;
		this.x = newX;
		this.y = newY;
	    }
	}
	unittest {
	    auto c = new Complex(5, 3);
	    c.mul(new Complex(4, 2));
	    assert(c.x == 14 && c.y == 22);
	}
	unittest {
	    auto org = new Complex(0, 0);
	    org.add(new Complex(3, 3));
	    assert(org.x == 3 && org.y == 3);
	}

	auto iterate_mandelbrot(const Complex c, const int maxIters) {
	    auto z = new Complex(0, 0);
	    for (int i = 0; i < maxIters; i++) {
		if (z.abs() >= 2.0) {
		    return i;
		}
		z.mul(z);
		z.add(c);
	    }
	    return maxIters;
	}

	const x0 = -2.5, x1 = 1, y0 = -1, y1 = 1;
	const cols = 72, rows = 24;
	const maxIters = 1000000;

	void main() {
		auto now = MonoTime.currTime;
	    for (Num row = 0; row < rows; row++) {
		const y = (row / rows) * (y1 - y0) + y0;
		char[] str;
		for (Num col = 0; col < cols; col++) {
		    // Num is needed here because otherwise "/" does integer division
		    const x = (col / cols) * (x1 - x0) + x0;
		    auto c = new Complex(x, y);
		    auto iters = iterate_mandelbrot(c, maxIters);
		    if (iters == 0) {
			str ~= '.';
		    } else if (iters == 1) {
			str ~= '%';
		    } else if (iters == 2) {
			str ~= ' ';
		    } else if (iters == maxIters) {
			str ~= ' ';
		    } else {

		    }
		}
		str.writeln;
	    }
	    writeln(MonoTime.currTime - now);
	}
}
------------------------------------


T

-- 
MAS = Mana Ada Sistem?

Aug 22 2020

James Lu <jamtlu gmail.com> writes:

On Saturday, 22 August 2020 at 16:01:52 UTC, H. S. Teoh wrote:
 Sorry, I missed that the first time round.  But how come your 
 struct version uses `real` but your class version uses 
 `double`?  It's well-known that `real` is slow because it uses 
 x87 instructions, as opposed to the SSE/etc. instructions that 
 `double` would use.

On Friday, 21 August 2020 at 23:14:12 UTC, James Lu wrote:
 With the double type:

 Node: 2211 2574 2306
 Dlang: 2520 1891 1676

If you had read all three of my original self-replies9, you would 
have noticed I also did a version with struct + class.

Perhaps I need to learn to keep them all in the same post.

Aug 22 2020

Arun <aruncxy gmail.com> writes:

On Friday, 21 August 2020 at 23:49:44 UTC, H. S. Teoh wrote:
 On Fri, Aug 21, 2020 at 11:22:27PM +0000, James Lu via 
 Digitalmars-d wrote: [...]
 Bonus: Direct translation of Dlang to Node.js, Node.js faster 
 https://gist.github.com/CrazyPython/8bafd16837ec8ad4c5a638b9d305fc96
 
 Dlang: 4076 3622 2934 (3544 average)
 Node.js: 2624 2334 2316 (2424 average)
 
 LDC2 is 46% slower!

 Using a class for Complex (and a non-final one at that!!) 
 introduces tons of allocation overhead per iteration, plus 
 virtual function call overhead.  You should be using a struct 
 instead.  I betcha this one change will make a big difference 
 in performance.

 Also, what's the command you're using to compile the program?  
 If you're doing performance comparison, you should specify -O2 
 or -O3.

He mentioned this in his first post.

LDC 1.23.0 (Installed from dlang.org)

ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fast

Aug 21 2020

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Aug 21, 2020 at 04:49:44PM -0700, H. S. Teoh via Digitalmars-d wrote:
[...]
 Using a class for Complex (and a non-final one at that!!) introduces
 tons of allocation overhead per iteration, plus virtual function call
 overhead.  You should be using a struct instead.  I betcha this one
 change will make a big difference in performance.

[...]

OK, so I copied the code and changed the class to struct, and compared
the results. Both versions are compiled with ldc2 -O3.

	class version:
	7 secs, 125 ms, 608 μs, and 9 hnsecs
	7 secs, 155 ms, 328 μs, and 6 hnsecs
	7 secs, 158 ms, 966 μs, and 4 hnsecs

	struct version:
	6 secs, 55 ms, 140 μs, and 4 hnsecs
	6 secs, 125 ms, 974 μs, and 5 hnsecs
	6 secs, 126 ms, 945 μs, and 4 hnsecs

For performance comparisons, take the best of n (because the others are
merely measuring more system noise).  This represents about a 15%
performance increase in switching to struct instead of class.

I thought it might make a difference to optimize for my CPU with
-mcpu=native, so here are the numbers:

	class version:
	7 secs, 100 ms, 602 μs, and 6 hnsecs
	7 secs, 100 ms, 437 μs, and 7 hnsecs
	7 secs, 121 ms, 594 μs, and 4 hnsecs

	struct version:
	6 secs, 73 ms, 534 μs, and 3 hnsecs
	5 secs, 662 ms, 626 μs, and 5 hnsecs
	6 secs, 103 ms, 871 μs, and 2 hnsecs

Again taking the best of 3, that's about a 20% performance increase
between changing from class to struct.

//

Just for laughs, I tested with dmd -O -inline:

	class version:
	7 secs, 255 ms, 748 μs, and 5 hnsecs
	7 secs, 249 ms, 683 μs, and 9 hnsecs
	7 secs, 593 ms, 847 μs, and 8 hnsecs

	struct version:
	7 secs, 646 ms, 685 μs, and 5 hnsecs
	7 secs, 618 ms, 642 μs, and 7 hnsecs
	7 secs, 606 ms, 85 μs, and 4 hnsecs

Surprisingly, the class version does *better* than the struct version
when compiled with dmd.  (Wow, is dmd codegen *that* bad that it
outweighs even class allocation overhead?? :-D)  But both are worse than
even the class version with ldc2 -O3 (even without -mcpu=native).

So yeah.  I wouldn't trust dmd with a 10-foot pole when it comes to
runtime performance.  The struct version compiled with `ldc2 -O3
-mcpu=native` beats the struct version compiled with dmd by a 26%
margin.  That's pretty sad.


T

-- 
An imaginary friend squared is a real enemy.

Aug 21 2020

aberba <karabutaworld gmail.com> writes:

On Saturday, 22 August 2020 at 00:10:43 UTC, H. S. Teoh wrote:
 On Fri, Aug 21, 2020 at 04:49:44PM -0700, H. S. Teoh via 
 Digitalmars-d wrote: [...]
 

 Surprisingly, the class version does *better* than the struct 
 version when compiled with dmd.  (Wow, is dmd codegen *that* 
 bad that it outweighs even class allocation overhead?? :-D)

Or maybe DMD is not trying to win any performance context... just 
focusing on fast compilation for quick prototyping. Something you 
wouldn't getting otherwise without DMD.

Aug 22 2020

bachmeier <no spam.net> writes:

On Friday, 21 August 2020 at 23:10:53 UTC, James Lu wrote:
 Code: 
 https://gist.github.com/CrazyPython/364f11465dab90d611ecc81490682680

 LDC 1.23.0 (Installed from dlang.org)

 ldc2 -release -mcpu=native -O3 -ffast-math --fp-contract=fast

 Node v14.40 (V8 8.1.307.31)

 Dlang trials: 2957 2560 2048 Average: 2521
 Node.JS trials: 1988 2567 1863 Average: 2139

 Notes:

  - I had to reinstall Dlang from the install script
  - I was initially confused why -mtune=native didn't work, and 
 had to read documentation. Would have been nice if the compiler 
 told me -mcpu=native was what I needed.
  - I skipped -march=native. Did not find information on the 
 wiki https://wiki.dlang.org/Using_LDC
  - Node.js compiles faster and uses a compilation cache

 Mandatory citation: https://github.com/brion/mandelbrot-shootout

I have no desire to dig into it myself, but I'll just note that 
if you check the CLBG, you'll see that it's not hard to write C 
and C++ programs for this benchmark that are many times slower 
than Node JS. The worst of them takes seven times longer to run.

Aug 21 2020

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Aug 22, 2020 at 02:08:40AM +0000, bachmeier via Digitalmars-d wrote:
[...]
 I have no desire to dig into it myself, but I'll just note that if you
 check the CLBG, you'll see that it's not hard to write C and C++
 programs for this benchmark that are many times slower than Node JS.
 The worst of them takes seven times longer to run.

As described in my other post, my analysis of James' code reveals the
following issues:

1) Using class instead of struct;

2) Using real instead of double;

3) std.math.fmax calling the C library (involving a PIC indirection to a
shared library as opposed to inlineable native D code).

Addressing all 3 issues yielded a 67% improvement (from class + real + C
fmax -> struct + double + native fmax), or 37% improvement (from class +
double + C fmax -> struct + double + native fmax).

I don't have a Node.js environment, though, so I can't make a direct
comparison with the optimized D version.

I will note, though, that (1) and (2) are well-known performance issues;
I'm surprised that this was not taken into account in the original
comparison. (3) is something worth considering for std.math -- yes it's
troublesome to have to maintain D versions of these functions, but they
*could* make a potentially big performance impact by being inlineable in
hot inner loops.


T

-- 
Computers shouldn't beep through the keyhole.

Aug 22 2020

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/22/20 12:15 PM, H. S. Teoh wrote:
 On Sat, Aug 22, 2020 at 02:08:40AM +0000, bachmeier via Digitalmars-d wrote:
 [...]
 I have no desire to dig into it myself, but I'll just note that if you
 check the CLBG, you'll see that it's not hard to write C and C++
 programs for this benchmark that are many times slower than Node JS.
 The worst of them takes seven times longer to run.

 
 As described in my other post, my analysis of James' code reveals the
 following issues:
 
 1) Using class instead of struct;
 
 2) Using real instead of double;
 
 3) std.math.fmax calling the C library (involving a PIC indirection to a
 shared library as opposed to inlineable native D code).
 
 Addressing all 3 issues yielded a 67% improvement (from class + real + C
 fmax -> struct + double + native fmax), or 37% improvement (from class +
 double + C fmax -> struct + double + native fmax).
 
 I don't have a Node.js environment, though, so I can't make a direct
 comparison with the optimized D version.
 
 I will note, though, that (1) and (2) are well-known performance issues;
 I'm surprised that this was not taken into account in the original
 comparison. (3) is something worth considering for std.math -- yes it's
 troublesome to have to maintain D versions of these functions, but they
 *could* make a potentially big performance impact by being inlineable in
 hot inner loops.

First off this is a great accomplishment of the V8 team. That's the 
remarkable part.

Second, this is in many ways not new. JIT optimizers are great at 
optimizing numeric code and loops. With regularity ever since shortly 
after Java was invented, "Java is faster than C" claims have been backed 
by benchmarks involving numeric code and loops. For such, the JIT can do 
as good a job as, and sometimes even better than, a traditional compiler.

The way of a low-level language competes is by offering you ways to tune 
code to rival/surpass JIT performance if you need to. (Of course, that 
means the programmer is spending time on that, which is a minus.) I have 
no doubt the D mandelbrot code can be made at least as fast as the 
node.js code.

The reality of large applications involves things like structures and 
indirections and such, for which data layout is important. I think JITs 
have a way to go in that area.

Aug 22 2020

James Lu <jamtlu gmail.com> writes:

On Saturday, 22 August 2020 at 16:38:42 UTC, Andrei Alexandrescu 
wrote:
 On 8/22/20 12:15 PM, H. S. Teoh wrote:
 [...]

 Second, this is in many ways not new. JIT optimizers are great 
 at optimizing numeric code and loops. With regularity ever 
 since shortly after Java was invented, "Java is faster than C" 
 claims have been backed by benchmarks involving numeric code 
 and loops. For such, the JIT can do as good a job as, and 
 sometimes even better than, a traditional compiler.

 The way of a low-level language competes is by offering you 
 ways to tune code to rival/surpass JIT performance if you need 
 to. (Of course, that means the programmer is spending time on 
 that, which is a minus.) I have no doubt the D mandelbrot code 
 can be made at least as fast as the node.js code.

Interesting perspective.

Aug 22 2020

jmh530 <john.michael.hall gmail.com> writes:

On Saturday, 22 August 2020 at 16:38:42 UTC, Andrei Alexandrescu 
wrote:
 [snip]

 First off this is a great accomplishment of the V8 team. That's 
 the remarkable part.

 Second, this is in many ways not new. JIT optimizers are great 
 at optimizing numeric code and loops. With regularity ever 
 since shortly after Java was invented, "Java is faster than C" 
 claims have been backed by benchmarks involving numeric code 
 and loops. For such, the JIT can do as good a job as, and 
 sometimes even better than, a traditional compiler.

 The way of a low-level language competes is by offering you 
 ways to tune code to rival/surpass JIT performance if you need 
 to. (Of course, that means the programmer is spending time on 
 that, which is a minus.) I have no doubt the D mandelbrot code 
 can be made at least as fast as the node.js code.

 The reality of large applications involves things like 
 structures and indirections and such, for which data layout is 
 important. I think JITs have a way to go in that area.

It would be interesting to see a write-up where LDC's dynamic 
compile can speed up D code in the same way.

Aug 22 2020

kinke <noone nowhere.com> writes:

On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:
 3) std.math.fmax calling the C library (involving a PIC 
 indirection to a
 shared library as opposed to inlineable native D code).

Yes, and that's because there's only a `real` version for fmax. 
If upstream Phobos had proper double/float overloads, we could 
uncomment the LDC-specific implementations using LLVM intrinsics, 
which use (obviously much faster) SSE instructions:

https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/std/math.d#L7785-L7798

Number crunching in D could be significantly accelerated if the 
people interested in it showed some love for std.math, but we've 
had this topic for years.

Aug 22 2020

Arjan <arjan ask.me.to> writes:

On Saturday, 22 August 2020 at 17:40:11 UTC, kinke wrote:
 On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:
 3) std.math.fmax calling the C library (involving a PIC 
 indirection to a
 shared library as opposed to inlineable native D code).

 Yes, and that's because there's only a `real` version for fmax. 
 If upstream Phobos had proper double/float overloads, we could 
 uncomment the LDC-specific implementations using LLVM 
 intrinsics, which use (obviously much faster) SSE instructions:

 https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/std/math.d#L7785-L7798

 Number crunching in D could be significantly accelerated if the 
 people interested in it showed some love for std.math, but 
 we've had this topic for years.

Sound to me like a fantastic SAOC project.

Aug 22 2020

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/22/20 1:40 PM, kinke wrote:
 On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:
 3) std.math.fmax calling the C library (involving a PIC indirection to a
 shared library as opposed to inlineable native D code).

 
 Yes, and that's because there's only a `real` version for fmax.

 If 
 upstream Phobos had proper double/float overloads, we could uncomment 
 the LDC-specific implementations using LLVM intrinsics, which use 
 (obviously much faster) SSE instructions:
 
 https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/st
/math.d#L7785-L7798 
 
 
 Number crunching in D could be significantly accelerated if the people 
 interested in it showed some love for std.math, but we've had this topic 
 for years.

Ow, do we still suffer from that? Sigh.

https://github.com/dlang/phobos/pull/7604/files

It's 10 minutes of work - as much as writing a couple of posts, and much 
more satisfactory.

Aug 22 2020

Avrina <avrina12309412342 gmail.com> writes:

On Saturday, 22 August 2020 at 19:20:34 UTC, Andrei Alexandrescu 
wrote:
 On 8/22/20 1:40 PM, kinke wrote:
 On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:
 3) std.math.fmax calling the C library (involving a PIC 
 indirection to a
 shared library as opposed to inlineable native D code).

 
 Yes, and that's because there's only a `real` version for fmax.

 If upstream Phobos had proper double/float overloads, we could 
 uncomment the LDC-specific implementations using LLVM 
 intrinsics, which use (obviously much faster) SSE instructions:
 
 https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/std/math.d#L7785-L7798
 
 
 Number crunching in D could be significantly accelerated if 
 the people interested in it showed some love for std.math, but 
 we've had this topic for years.

 Ow, do we still suffer from that? Sigh.

 https://github.com/dlang/phobos/pull/7604/files

 It's 10 minutes of work - as much as writing a couple of posts, 
 and much more satisfactory.

Cos, sin, tan, asin, acos, atan, etc.. There's still more, 
putting in the actual work that std.math needs is going to take 
more than 10 mins.

Aug 22 2020

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/22/20 3:34 PM, Avrina wrote:
 On Saturday, 22 August 2020 at 19:20:34 UTC, Andrei Alexandrescu wrote:
 On 8/22/20 1:40 PM, kinke wrote:
 On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:
 3) std.math.fmax calling the C library (involving a PIC indirection 
 to a
 shared library as opposed to inlineable native D code).

 Yes, and that's because there's only a `real` version for fmax.

 If upstream Phobos had proper double/float overloads, we could 
 uncomment the LDC-specific implementations using LLVM intrinsics, 
 which use (obviously much faster) SSE instructions:

 https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/st
/math.d#L7785-L7798 



 Number crunching in D could be significantly accelerated if the 
 people interested in it showed some love for std.math, but we've had 
 this topic for years.

 Ow, do we still suffer from that? Sigh.

 https://github.com/dlang/phobos/pull/7604/files

 It's 10 minutes of work - as much as writing a couple of posts, and 
 much more satisfactory.

 
 Cos, sin, tan, asin, acos, atan, etc.. There's still more, putting in 
 the actual work that std.math needs is going to take more than 10 mins.

1. Linear time for small n is fine and does not affect the argument.

2. Incremental is still fine.

3. Work has actually be done by Nick Wilson in 
https://github.com/dlang/phobos/pull/7463.

Aug 22 2020

Avrina <avrina12309412342 gmail.com> writes:

On Saturday, 22 August 2020 at 21:20:36 UTC, Andrei Alexandrescu 
wrote:
 On 8/22/20 3:34 PM, Avrina wrote:
 On Saturday, 22 August 2020 at 19:20:34 UTC, Andrei 
 Alexandrescu wrote:
 On 8/22/20 1:40 PM, kinke wrote:
 On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh 
 wrote:
 3) std.math.fmax calling the C library (involving a PIC 
 indirection to a
 shared library as opposed to inlineable native D code).

 Yes, and that's because there's only a `real` version for 
 fmax.

 If upstream Phobos had proper double/float overloads, we 
 could uncomment the LDC-specific implementations using LLVM 
 intrinsics, which use (obviously much faster) SSE 
 instructions:

 https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/std/math.d#L7785-L7798



 Number crunching in D could be significantly accelerated if 
 the people interested in it showed some love for std.math, 
 but we've had this topic for years.

 Ow, do we still suffer from that? Sigh.

 https://github.com/dlang/phobos/pull/7604/files

 It's 10 minutes of work - as much as writing a couple of 
 posts, and much more satisfactory.

 
 Cos, sin, tan, asin, acos, atan, etc.. There's still more, 
 putting in the actual work that std.math needs is going to 
 take more than 10 mins.

 1. Linear time for small n is fine and does not affect the 
 argument.

What argument?

 2. Incremental is still fine.

It can introduce subtle bugs and problems with precision, flip 
flopping between float, double, and real. If it is done all at 
once, it will only happen once, not every time someone feels like 
spend 10 mins doing a little bit of work to change one function. 
It really shouldn't have been implemented this way in the first 
place.

 3. Work has actually be done by Nick Wilson in 
 https://github.com/dlang/phobos/pull/7463.

A dead pull request? Not unusual.

Aug 22 2020

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/22/20 6:09 PM, Avrina wrote:
 On Saturday, 22 August 2020 at 21:20:36 UTC, Andrei Alexandrescu wrote:
 On 8/22/20 3:34 PM, Avrina wrote:
 On Saturday, 22 August 2020 at 19:20:34 UTC, Andrei Alexandrescu wrote:
 On 8/22/20 1:40 PM, kinke wrote:
 On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:
 3) std.math.fmax calling the C library (involving a PIC 
 indirection to a
 shared library as opposed to inlineable native D code).

 Yes, and that's because there's only a `real` version for fmax.

 If upstream Phobos had proper double/float overloads, we could 
 uncomment the LDC-specific implementations using LLVM intrinsics, 
 which use (obviously much faster) SSE instructions:

 https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/st
/math.d#L7785-L7798 




 Number crunching in D could be significantly accelerated if the 
 people interested in it showed some love for std.math, but we've 
 had this topic for years.

 Ow, do we still suffer from that? Sigh.

 https://github.com/dlang/phobos/pull/7604/files

 It's 10 minutes of work - as much as writing a couple of posts, and 
 much more satisfactory.

 Cos, sin, tan, asin, acos, atan, etc.. There's still more, putting in 
 the actual work that std.math needs is going to take more than 10 mins.

 1. Linear time for small n is fine and does not affect the argument.

 
 What argument?
 
 2. Incremental is still fine.

 
 It can introduce subtle bugs and problems with precision, flip flopping 
 between float, double, and real. If it is done all at once, it will only 
 happen once, not every time someone feels like spend 10 mins doing a 
 little bit of work to change one function. It really shouldn't have been 
 implemented this way in the first place.
 
 3. Work has actually be done by Nick Wilson in 
 https://github.com/dlang/phobos/pull/7463.

 
 A dead pull request? Not unusual.

You seem to derive good enjoyment out of making unkind comments.

Aug 22 2020

Avrina <avrina12309412342 gmail.com> writes:

On Sunday, 23 August 2020 at 02:18:19 UTC, Andrei Alexandrescu 
wrote:
 On 8/22/20 6:09 PM, Avrina wrote:
 On Saturday, 22 August 2020 at 21:20:36 UTC, Andrei 
 Alexandrescu wrote:
 On 8/22/20 3:34 PM, Avrina wrote:
 On Saturday, 22 August 2020 at 19:20:34 UTC, Andrei 
 Alexandrescu wrote:
 On 8/22/20 1:40 PM, kinke wrote:
 On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh 
 wrote:
 3) std.math.fmax calling the C library (involving a PIC 
 indirection to a
 shared library as opposed to inlineable native D code).

 Yes, and that's because there's only a `real` version for 
 fmax.

 If upstream Phobos had proper double/float overloads, we 
 could uncomment the LDC-specific implementations using 
 LLVM intrinsics, which use (obviously much faster) SSE 
 instructions:

 https://github.com/ldc-developers/phobos/blob/1366c7d5be65def916f030785fc1f1833342497d/std/math.d#L7785-L7798




 Number crunching in D could be significantly accelerated 
 if the people interested in it showed some love for 
 std.math, but we've had this topic for years.

 Ow, do we still suffer from that? Sigh.

 https://github.com/dlang/phobos/pull/7604/files

 It's 10 minutes of work - as much as writing a couple of 
 posts, and much more satisfactory.

 Cos, sin, tan, asin, acos, atan, etc.. There's still more, 
 putting in the actual work that std.math needs is going to 
 take more than 10 mins.

 1. Linear time for small n is fine and does not affect the 
 argument.

 
 What argument?
 
 2. Incremental is still fine.

 
 It can introduce subtle bugs and problems with precision, flip 
 flopping between float, double, and real. If it is done all at 
 once, it will only happen once, not every time someone feels 
 like spend 10 mins doing a little bit of work to change one 
 function. It really shouldn't have been implemented this way 
 in the first place.
 
 3. Work has actually be done by Nick Wilson in 
 https://github.com/dlang/phobos/pull/7463.

 
 A dead pull request? Not unusual.

 You seem to derive good enjoyment out of making unkind comments.

I am an observer of truth, if you don't like the truth, look 
away; as it seems to be common place here anyways.

Aug 24 2020

kinke <noone nowhere.com> writes:

On Saturday, 22 August 2020 at 19:34:42 UTC, Avrina wrote:
 Cos, sin, tan, asin, acos, atan, etc.. There's still more, 
 putting in the actual work that std.math needs is going to take 
 more than 10 mins.

These functions are a lot more involved indeed; I've taken care 
of a few of these some years ago, porting from the Cephes C 
library, see https://github.com/dlang/phobos/pull/6272. The main 
difficulty is the need to support 4 floating-point formats - 
single, double, x87 extended and quadruple precision.

Aug 22 2020

bachmeier <no spam.net> writes:

On Saturday, 22 August 2020 at 16:15:14 UTC, H. S. Teoh wrote:
 On Sat, Aug 22, 2020 at 02:08:40AM +0000, bachmeier via 
 Digitalmars-d wrote: [...]
 I have no desire to dig into it myself, but I'll just note 
 that if you check the CLBG, you'll see that it's not hard to 
 write C and C++ programs for this benchmark that are many 
 times slower than Node JS. The worst of them takes seven times 
 longer to run.

 As described in my other post, my analysis of James' code 
 reveals the following issues:

 1) Using class instead of struct;

 2) Using real instead of double;

 3) std.math.fmax calling the C library (involving a PIC 
 indirection to a
 shared library as opposed to inlineable native D code).

 Addressing all 3 issues yielded a 67% improvement (from class + 
 real + C
 fmax -> struct + double + native fmax), or 37% improvement 
 (from class +
 double + C fmax -> struct + double + native fmax).

 I don't have a Node.js environment, though, so I can't make a 
 direct comparison with the optimized D version.

 I will note, though, that (1) and (2) are well-known 
 performance issues; I'm surprised that this was not taken into 
 account in the original comparison. (3) is something worth 
 considering for std.math -- yes it's troublesome to have to 
 maintain D versions of these functions, but they *could* make a 
 potentially big performance impact by being inlineable in hot 
 inner loops.

Well, I'm not going to debate the possibility that D will 
outperform Node with appropriate optimization. It's clear from 
the C and C++ timings, though, that you can't just type out a C, 
C++, or D program and expect to beat Node - it's going to take 
some care and knowledge that you really don't need with Node.

Aug 24 2020

D Programming

C/C++ Programming

Other

digitalmars.D - Naive node.js faster than naive LDC2?