www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Why Java (server VM) is faster than D?

reply "aki" <aki google.com> writes:
When I was trying to port some Java program to D,
I noticed Java is faster than D.
I made a simple bench mark test as follows.
Then, I was shocked with the result.

test results on Win8 64bit (smaller is better)
Java(1.8.0,64bit,server): 0.677
C++(MS vs2013): 2.141

D(DMD 2.067.1): 2.448
D(GDC 4.9.2/2.066): 2.481
Java(1.8.0,32bit,client): 3.060

Does anyone know the magic of Java?

Thanks, Aki.

---

test program for D lang:
import std.datetime;
import std.stdio;
class Foo {
	int i = 0;
	void bar() {}
};
class SubFoo : Foo {
	override void bar() {
		i = i * 3 + 1;
	}
};
int test(Foo obj, int repeat) {
	for (int r = 0; r<repeat; ++r) {
		obj.bar();
	}
	return obj.i;
}
void main() {
	auto stime = Clock.currTime();
	int repeat = 1000 * 1000 * 1000;
	int ret = test(new SubFoo(), repeat);
	double time = (Clock.currTime() - stime).total!"msecs" / 1000.0;
	writefln("time=%5.3f, ret=%d", time, ret);
}

test program for Java:
class Foo {
	public int i = 0;
	public void bar() {}
};
class SubFoo extends Foo {
	public void bar() {
		i = i * 3 + 1;
	}
};
public class Main {
	public static int test(Foo obj, int repeat) {
		for (int r = 0; r<repeat; ++r) {
			obj.bar();
		}
		return obj.i;
	}
	public static void main(String[] args) {
		long stime = System.currentTimeMillis();
		int repeat = 1000 * 1000 * 1000;
		int ret = test(new SubFoo(), repeat);
		double time = (System.currentTimeMillis() - stime) / 1000.0;
		System.out.printf("time=%5.3f, ret=%d", time, ret);
	}
}
Aug 03 2015
next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 3 August 2015 at 16:27:39 UTC, aki wrote:
 When I was trying to port some Java program to D,
 I noticed Java is faster than D.
 I made a simple bench mark test as follows.
 Then, I was shocked with the result.

 [...]
What compilation flags?
Aug 03 2015
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 03-Aug-2015 19:27, aki wrote:
 When I was trying to port some Java program to D,
 I noticed Java is faster than D.
 I made a simple bench mark test as follows.
 Then, I was shocked with the result.

 test results on Win8 64bit (smaller is better)
 Java(1.8.0,64bit,server): 0.677
 C++(MS vs2013): 2.141

 D(DMD 2.067.1): 2.448
 D(GDC 4.9.2/2.066): 2.481
 Java(1.8.0,32bit,client): 3.060

 Does anyone know the magic of Java?

 Thanks, Aki.
Devirtualization? HotSpot is fairly aggressive in that regard. -- Dmitry Olshansky
Aug 03 2015
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 8/3/15 12:31 PM, Dmitry Olshansky wrote:
 On 03-Aug-2015 19:27, aki wrote:
 When I was trying to port some Java program to D,
 I noticed Java is faster than D.
 I made a simple bench mark test as follows.
 Then, I was shocked with the result.

 test results on Win8 64bit (smaller is better)
 Java(1.8.0,64bit,server): 0.677
 C++(MS vs2013): 2.141

 D(DMD 2.067.1): 2.448
 D(GDC 4.9.2/2.066): 2.481
 Java(1.8.0,32bit,client): 3.060

 Does anyone know the magic of Java?

 Thanks, Aki.
Devirtualization? HotSpot is fairly aggressive in that regard.
Yeah, I think that's it. virtual calls cannot be inlined by the D compiler, but could be inlined by hotspot. You can fix this by making the derived class final, or marking the method final, and always using a reference to the derived type. If you need virtualization still, you will have to deal with lower performance. -Steve
Aug 03 2015
parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 3 August 2015 at 16:41:42 UTC, Steven Schveighoffer 
wrote:
 On 8/3/15 12:31 PM, Dmitry Olshansky wrote:
 On 03-Aug-2015 19:27, aki wrote:
 When I was trying to port some Java program to D,
 I noticed Java is faster than D.
 I made a simple bench mark test as follows.
 Then, I was shocked with the result.

 test results on Win8 64bit (smaller is better)
 Java(1.8.0,64bit,server): 0.677
 C++(MS vs2013): 2.141

 D(DMD 2.067.1): 2.448
 D(GDC 4.9.2/2.066): 2.481
 Java(1.8.0,32bit,client): 3.060

 Does anyone know the magic of Java?

 Thanks, Aki.
Devirtualization? HotSpot is fairly aggressive in that regard.
Yeah, I think that's it. virtual calls cannot be inlined by the D compiler, but could be inlined by hotspot. You can fix this by making the derived class final, or marking the method final, and always using a reference to the derived type. If you need virtualization still, you will have to deal with lower performance. -Steve
Yup. I get very similar numbers to aki for his version, but changing two lines: final class SubFoo : Foo { int test(F)(F obj, int repeat) { or less generally: int test(SubFoo obj, int repeat) { gets me down to 0.182s with ldc on OS X
Aug 03 2015
next sibling parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Monday, 3 August 2015 at 16:47:58 UTC, John Colvin wrote:
 gets me down to 0.182s with ldc on OS X
Yeah, I tried dmd with the final and didn't get a difference but gdc with final (and -frelease, very important for max speed here since without it the method calls are surrounded by various assertions) and got similar speed to the hand written one too.
Aug 03 2015
parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 3 August 2015 at 16:53:30 UTC, Adam D. Ruppe wrote:
 On Monday, 3 August 2015 at 16:47:58 UTC, John Colvin wrote:
 gets me down to 0.182s with ldc on OS X
Yeah, I tried dmd with the final and didn't get a difference but gdc with final (and -frelease, very important for max speed here since without it the method calls are surrounded by various assertions) and got similar speed to the hand written one too.
ouch, yeah those assertions cause me a 30x slowdown!
Aug 03 2015
prev sibling parent reply "aki" <aki google.com> writes:
On Monday, 3 August 2015 at 16:47:58 UTC, John Colvin wrote:
 changing two lines:
 final class SubFoo : Foo {
 int test(F)(F obj, int repeat) {
I tried it. DMD is no change, while GDC gets acceptable score. D(DMD 2.067.1): 2.445 D(GDC 4.9.2/2.066): 0.928 Now I got a hint how to improve the code by hand. Thanks, John. But the original Java code that I'm porting is about 10,000 lines of code. And the performance is about 3 times different. Yes! Java is 3 times faster than D in my app. I hope the future DMD/GDC compiler will do the similar optimization automatically, not by hand. Aki.
Aug 03 2015
parent "Etienne Cimon" <etcimon gmail.com> writes:
On Monday, 3 August 2015 at 17:33:30 UTC, aki wrote:
 On Monday, 3 August 2015 at 16:47:58 UTC, John Colvin wrote:
 changing two lines:
 final class SubFoo : Foo {
 int test(F)(F obj, int repeat) {
I tried it. DMD is no change, while GDC gets acceptable score. D(DMD 2.067.1): 2.445 D(GDC 4.9.2/2.066): 0.928 Now I got a hint how to improve the code by hand. Thanks, John. But the original Java code that I'm porting is about 10,000 lines of code. And the performance is about 3 times different. Yes! Java is 3 times faster than D in my app. I hope the future DMD/GDC compiler will do the similar optimization automatically, not by hand. Aki.
LLVM might be able to do achieve Java's optimization for your use case using profile-guided optimization. In principle, it's hard to choose which function to inline without the function call counts, but LLVM has a back-end with sampling support. http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization Whether or not this is or will be available soon for D in LDC is a different matter.
Aug 03 2015
prev sibling next sibling parent Justin Whear <justin economicmodeling.com> writes:
Java being fastest at running Java-style code is not too surprising.  My 
guess is that Java is "hotspot" inlining the calls to `bar`, getting rid 
of the dynamic dispatch overhead.  I think that for real systems D will 
generally beat out Java across the board, but not if the D version is a 
straight up transliteration of the Java--expect Java to be the best at 
running Java code.
Aug 03 2015
prev sibling next sibling parent Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 3 August 2015 at 18:27, aki via Digitalmars-d <
digitalmars-d puremagic.com> wrote:

 When I was trying to port some Java program to D,
 I noticed Java is faster than D.
 I made a simple bench mark test as follows.
 Then, I was shocked with the result.

 test results on Win8 64bit (smaller is better)
 Java(1.8.0,64bit,server): 0.677
 C++(MS vs2013): 2.141

 D(DMD 2.067.1): 2.448
 D(GDC 4.9.2/2.066): 2.481
 Java(1.8.0,32bit,client): 3.060

 Does anyone know the magic of Java?

 Thanks, Aki.
I have read somewhere (or maybe heard) that Java VM is able to cache and possibly remove/inline dynamic dispatches on the fly. This is a clear win for VM languages over native compiled. Iain.
Aug 03 2015
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 3 August 2015 at 16:27:39 UTC, aki wrote:
 When I was trying to port some Java program to D,
 I noticed Java is faster than D.
 I made a simple bench mark test as follows.
 Then, I was shocked with the result.

 test results on Win8 64bit (smaller is better)
 Java(1.8.0,64bit,server): 0.677
 C++(MS vs2013): 2.141

 D(DMD 2.067.1): 2.448
 D(GDC 4.9.2/2.066): 2.481
 Java(1.8.0,32bit,client): 3.060

 Does anyone know the magic of Java?

 Thanks, Aki.

 ---

 test program for D lang:
 import std.datetime;
 import std.stdio;
 class Foo {
 	int i = 0;
 	void bar() {}
 };
 class SubFoo : Foo {
 	override void bar() {
 		i = i * 3 + 1;
 	}
 };
 int test(Foo obj, int repeat) {
 	for (int r = 0; r<repeat; ++r) {
 		obj.bar();
 	}
 	return obj.i;
 }
 void main() {
 	auto stime = Clock.currTime();
 	int repeat = 1000 * 1000 * 1000;
 	int ret = test(new SubFoo(), repeat);
 	double time = (Clock.currTime() - stime).total!"msecs" / 
 1000.0;
 	writefln("time=%5.3f, ret=%d", time, ret);
 }

 test program for Java:
 class Foo {
 	public int i = 0;
 	public void bar() {}
 };
 class SubFoo extends Foo {
 	public void bar() {
 		i = i * 3 + 1;
 	}
 };
 public class Main {
 	public static int test(Foo obj, int repeat) {
 		for (int r = 0; r<repeat; ++r) {
 			obj.bar();
 		}
 		return obj.i;
 	}
 	public static void main(String[] args) {
 		long stime = System.currentTimeMillis();
 		int repeat = 1000 * 1000 * 1000;
 		int ret = test(new SubFoo(), repeat);
 		double time = (System.currentTimeMillis() - stime) / 1000.0;
 		System.out.printf("time=%5.3f, ret=%d", time, ret);
 	}
 }
Not surprising. The virtual function call takes almost all of the time and the JVM will be devirtualising it. If you want to call tiny virtual functions in tight loops, use a VM. That said, it's a bit disappointing that the devirtualisation doesn't happen at compile-time after inlining for a simple case like this.
Aug 03 2015
prev sibling parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
You can try a few potential optimizations in the D version 
yourself and see if it makes a difference.

Devirtualization has a very small impact. Test this by making 
`test` take `SubFoo` and making `bar` final, or making `bar` a 
stand-alone function.

That's not it.

Inlining alone doesn't make a huge difference either - test this 
by copy/pasting the `bar` method body to the test function.

But we can see a *huge* difference if we inline AND make the data 
local:

int test(SubFoo obj, int repeat) {
         int i = obj.i; // local variable copy
         for (int r = 0; r<repeat; ++r) {
                 //obj.bar();
                 i = i *3 + 1; // do the math on the local
         }
         obj.i = i; // save it back to the object so same result 
to the outside
world
         return obj.i;
}



That cuts the time to less than 1/2 on my computer from the other 
fastest version.

So I suspect the JVM is able to figure out that the `i` member is 
being used and putting it in a hot cache instead of accessing it 
indirectly though the object, just like I did by hand there.

I betcha if the loop ran 5 times, it would be no different, but 
the JVM realizes after hundreds of iterations that there's a huge 
optimization potential there and rewrites the code at that point, 
making it faster for the next million runs.
Aug 03 2015
parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 3 August 2015 at 16:47:14 UTC, Adam D. Ruppe wrote:
 You can try a few potential optimizations in the D version 
 yourself and see if it makes a difference.

 Devirtualization has a very small impact. Test this by making 
 `test` take `SubFoo` and making `bar` final, or making `bar` a 
 stand-alone function.

 That's not it.
Making SubFoo a final class and test take SubFoo gives a >10x speedup for me.
Aug 03 2015
next sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 8/3/15 12:50 PM, John Colvin wrote:
 On Monday, 3 August 2015 at 16:47:14 UTC, Adam D. Ruppe wrote:
 You can try a few potential optimizations in the D version yourself
 and see if it makes a difference.

 Devirtualization has a very small impact. Test this by making `test`
 take `SubFoo` and making `bar` final, or making `bar` a stand-alone
 function.

 That's not it.
Making SubFoo a final class and test take SubFoo gives a >10x speedup for me.
Let's make sure we're all comparing apples to apples here. FWIW, I suspect the inlining to be the most significant improvement, which is impossible for virtual functions in D. ALSO, make SURE you are compiling in release mode, so you aren't calling a virtual invariant function before/after every call. -Steve
Aug 03 2015
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 03-Aug-2015 19:54, Steven Schveighoffer wrote:
 On 8/3/15 12:50 PM, John Colvin wrote:
 On Monday, 3 August 2015 at 16:47:14 UTC, Adam D. Ruppe wrote:
 You can try a few potential optimizations in the D version yourself
 and see if it makes a difference.

 Devirtualization has a very small impact. Test this by making `test`
 take `SubFoo` and making `bar` final, or making `bar` a stand-alone
 function.

 That's not it.
Making SubFoo a final class and test take SubFoo gives a >10x speedup for me.
Let's make sure we're all comparing apples to apples here. FWIW, I suspect the inlining to be the most significant improvement, which is impossible for virtual functions in D.
Should be trivial in this particular case. You just keep the original virtual call where it cannot be deduced.
 ALSO, make SURE you are compiling in release mode, so you aren't calling
 a virtual invariant function before/after every call.
This one is critical. Actually why do we have an extra call for trivial null-check on any object that doesn't even have invariant? -- Dmitry Olshansky
Aug 03 2015
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 8/3/15 12:59 PM, Dmitry Olshansky wrote:
 On 03-Aug-2015 19:54, Steven Schveighoffer wrote:
 ALSO, make SURE you are compiling in release mode, so you aren't calling
 a virtual invariant function before/after every call.
This one is critical. Actually why do we have an extra call for trivial null-check on any object that doesn't even have invariant?
Actually, that the call to the invariant should be avoidable if the object doesn't have one. It should be easy to check the vtable pointer to see if it points at the "default" invariant (which does nothing). -Steve
Aug 03 2015
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 03-Aug-2015 20:05, Steven Schveighoffer wrote:
 On 8/3/15 12:59 PM, Dmitry Olshansky wrote:
 On 03-Aug-2015 19:54, Steven Schveighoffer wrote:
 ALSO, make SURE you are compiling in release mode, so you aren't calling
 a virtual invariant function before/after every call.
This one is critical. Actually why do we have an extra call for trivial null-check on any object that doesn't even have invariant?
Actually, that the call to the invariant should be avoidable if the object doesn't have one. It should be easy to check the vtable pointer to see if it points at the "default" invariant (which does nothing).
https://issues.dlang.org/show_bug.cgi?id=14865 -- Dmitry Olshansky
Aug 03 2015
prev sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Monday, 3 August 2015 at 16:50:42 UTC, John Colvin wrote:
 Making SubFoo a final class and test take SubFoo gives a >10x 
 speedup for me.
Right, gdc and ldc will the the aggressive inlining and local data optimizations automatically once it is able to devirtualize the calls (at least when you use the -O flags). dmd, however, even with -inline, doesn't make the local copy of the variable - it disassembles to this: 08098740 <_D1l4testFC1l6SubFooiZi>: 8098740: 55 push ebp 8098741: 8b ec mov ebp,esp 8098743: 89 c1 mov ecx,eax 8098745: 53 push ebx 8098746: 31 d2 xor edx,edx 8098748: 8b 5d 08 mov ebx,DWORD PTR [ebp+0x8] 809874b: 56 push esi 809874c: 85 c9 test ecx,ecx 809874e: 7e 0f jle 809875f <_D1l4testFC1l6SubFooiZi+0x1f> 8098750: 8b 43 08 mov eax,DWORD PTR [ebx+0x8] 8098753: 8d 74 40 01 lea esi,[eax+eax*2+0x1] 8098757: 42 inc edx 8098758: 89 73 08 mov DWORD PTR [ebx+0x8],esi 809875b: 39 ca cmp edx,ecx 809875d: 7c f1 jl 8098750 <_D1l4testFC1l6SubFooiZi+0x10> 809875f: 8b 43 08 mov eax,DWORD PTR [ebx+0x8] 8098762: 5e pop esi 8098763: 5b pop ebx 8098764: 5d pop ebp 8098765: c2 04 00 ret 0x4 There's no call in there, but there is still indirect memory access for the variable, so it doesn't get the caching benefits of the stack. It isn't news that dmd's optimizer is pretty bad next to.... well, pretty much everyone else nowdays, whether gdc, ldc, or Java, but it is sometimes nice to take a look at why. The biggest magic of Java IMO here is being CPU cache friendly!
Aug 03 2015