digitalmars.D - Why Java (server VM) is faster than D?
- aki (64/64) Aug 03 2015 When I was trying to port some Java program to D,
- John Colvin (2/7) Aug 03 2015 What compilation flags?
- Dmitry Olshansky (4/17) Aug 03 2015 Devirtualization? HotSpot is fairly aggressive in that regard.
- Steven Schveighoffer (7/26) Aug 03 2015 Yeah, I think that's it. virtual calls cannot be inlined by the D
- John Colvin (9/37) Aug 03 2015 Yup. I get very similar numbers to aki for his version, but
- Adam D. Ruppe (5/6) Aug 03 2015 Yeah, I tried dmd with the final and didn't get a difference but
- John Colvin (2/9) Aug 03 2015 ouch, yeah those assertions cause me a 30x slowdown!
- aki (13/16) Aug 03 2015 I tried it. DMD is no change, while GDC gets acceptable score.
- Etienne Cimon (8/24) Aug 03 2015 LLVM might be able to do achieve Java's optimization for your use
- Justin Whear (6/6) Aug 03 2015 Java being fastest at running Java-style code is not too surprising. My...
- Iain Buclaw via Digitalmars-d (6/19) Aug 03 2015 I have read somewhere (or maybe heard) that Java VM is able to cache and
- John Colvin (7/72) Aug 03 2015 Not surprising. The virtual function call takes almost all of the
- Adam D. Ruppe (30/30) Aug 03 2015 You can try a few potential optimizations in the D version
- John Colvin (3/9) Aug 03 2015 Making SubFoo a final class and test take SubFoo gives a >10x
- Steven Schveighoffer (7/18) Aug 03 2015 Let's make sure we're all comparing apples to apples here.
- Dmitry Olshansky (7/25) Aug 03 2015 Should be trivial in this particular case. You just keep the original
- Steven Schveighoffer (5/10) Aug 03 2015 Actually, that the call to the invariant should be avoidable if the
- Dmitry Olshansky (4/14) Aug 03 2015 https://issues.dlang.org/show_bug.cgi?id=14865
- Adam D. Ruppe (40/42) Aug 03 2015 Right, gdc and ldc will the the aggressive inlining and local
When I was trying to port some Java program to D, I noticed Java is faster than D. I made a simple bench mark test as follows. Then, I was shocked with the result. test results on Win8 64bit (smaller is better) Java(1.8.0,64bit,server): 0.677 C++(MS vs2013): 2.141 D(DMD 2.067.1): 2.448 D(GDC 4.9.2/2.066): 2.481 Java(1.8.0,32bit,client): 3.060 Does anyone know the magic of Java? Thanks, Aki. --- test program for D lang: import std.datetime; import std.stdio; class Foo { int i = 0; void bar() {} }; class SubFoo : Foo { override void bar() { i = i * 3 + 1; } }; int test(Foo obj, int repeat) { for (int r = 0; r<repeat; ++r) { obj.bar(); } return obj.i; } void main() { auto stime = Clock.currTime(); int repeat = 1000 * 1000 * 1000; int ret = test(new SubFoo(), repeat); double time = (Clock.currTime() - stime).total!"msecs" / 1000.0; writefln("time=%5.3f, ret=%d", time, ret); } test program for Java: class Foo { public int i = 0; public void bar() {} }; class SubFoo extends Foo { public void bar() { i = i * 3 + 1; } }; public class Main { public static int test(Foo obj, int repeat) { for (int r = 0; r<repeat; ++r) { obj.bar(); } return obj.i; } public static void main(String[] args) { long stime = System.currentTimeMillis(); int repeat = 1000 * 1000 * 1000; int ret = test(new SubFoo(), repeat); double time = (System.currentTimeMillis() - stime) / 1000.0; System.out.printf("time=%5.3f, ret=%d", time, ret); } }
Aug 03 2015
On Monday, 3 August 2015 at 16:27:39 UTC, aki wrote:When I was trying to port some Java program to D, I noticed Java is faster than D. I made a simple bench mark test as follows. Then, I was shocked with the result. [...]What compilation flags?
Aug 03 2015
On 03-Aug-2015 19:27, aki wrote:When I was trying to port some Java program to D, I noticed Java is faster than D. I made a simple bench mark test as follows. Then, I was shocked with the result. test results on Win8 64bit (smaller is better) Java(1.8.0,64bit,server): 0.677 C++(MS vs2013): 2.141 D(DMD 2.067.1): 2.448 D(GDC 4.9.2/2.066): 2.481 Java(1.8.0,32bit,client): 3.060 Does anyone know the magic of Java? Thanks, Aki.Devirtualization? HotSpot is fairly aggressive in that regard. -- Dmitry Olshansky
Aug 03 2015
On 8/3/15 12:31 PM, Dmitry Olshansky wrote:On 03-Aug-2015 19:27, aki wrote:Yeah, I think that's it. virtual calls cannot be inlined by the D compiler, but could be inlined by hotspot. You can fix this by making the derived class final, or marking the method final, and always using a reference to the derived type. If you need virtualization still, you will have to deal with lower performance. -SteveWhen I was trying to port some Java program to D, I noticed Java is faster than D. I made a simple bench mark test as follows. Then, I was shocked with the result. test results on Win8 64bit (smaller is better) Java(1.8.0,64bit,server): 0.677 C++(MS vs2013): 2.141 D(DMD 2.067.1): 2.448 D(GDC 4.9.2/2.066): 2.481 Java(1.8.0,32bit,client): 3.060 Does anyone know the magic of Java? Thanks, Aki.Devirtualization? HotSpot is fairly aggressive in that regard.
Aug 03 2015
On Monday, 3 August 2015 at 16:41:42 UTC, Steven Schveighoffer wrote:On 8/3/15 12:31 PM, Dmitry Olshansky wrote:Yup. I get very similar numbers to aki for his version, but changing two lines: final class SubFoo : Foo { int test(F)(F obj, int repeat) { or less generally: int test(SubFoo obj, int repeat) { gets me down to 0.182s with ldc on OS XOn 03-Aug-2015 19:27, aki wrote:Yeah, I think that's it. virtual calls cannot be inlined by the D compiler, but could be inlined by hotspot. You can fix this by making the derived class final, or marking the method final, and always using a reference to the derived type. If you need virtualization still, you will have to deal with lower performance. -SteveWhen I was trying to port some Java program to D, I noticed Java is faster than D. I made a simple bench mark test as follows. Then, I was shocked with the result. test results on Win8 64bit (smaller is better) Java(1.8.0,64bit,server): 0.677 C++(MS vs2013): 2.141 D(DMD 2.067.1): 2.448 D(GDC 4.9.2/2.066): 2.481 Java(1.8.0,32bit,client): 3.060 Does anyone know the magic of Java? Thanks, Aki.Devirtualization? HotSpot is fairly aggressive in that regard.
Aug 03 2015
On Monday, 3 August 2015 at 16:47:58 UTC, John Colvin wrote:gets me down to 0.182s with ldc on OS XYeah, I tried dmd with the final and didn't get a difference but gdc with final (and -frelease, very important for max speed here since without it the method calls are surrounded by various assertions) and got similar speed to the hand written one too.
Aug 03 2015
On Monday, 3 August 2015 at 16:53:30 UTC, Adam D. Ruppe wrote:On Monday, 3 August 2015 at 16:47:58 UTC, John Colvin wrote:ouch, yeah those assertions cause me a 30x slowdown!gets me down to 0.182s with ldc on OS XYeah, I tried dmd with the final and didn't get a difference but gdc with final (and -frelease, very important for max speed here since without it the method calls are surrounded by various assertions) and got similar speed to the hand written one too.
Aug 03 2015
On Monday, 3 August 2015 at 16:47:58 UTC, John Colvin wrote:changing two lines: final class SubFoo : Foo { int test(F)(F obj, int repeat) {I tried it. DMD is no change, while GDC gets acceptable score. D(DMD 2.067.1): 2.445 D(GDC 4.9.2/2.066): 0.928 Now I got a hint how to improve the code by hand. Thanks, John. But the original Java code that I'm porting is about 10,000 lines of code. And the performance is about 3 times different. Yes! Java is 3 times faster than D in my app. I hope the future DMD/GDC compiler will do the similar optimization automatically, not by hand. Aki.
Aug 03 2015
On Monday, 3 August 2015 at 17:33:30 UTC, aki wrote:On Monday, 3 August 2015 at 16:47:58 UTC, John Colvin wrote:LLVM might be able to do achieve Java's optimization for your use case using profile-guided optimization. In principle, it's hard to choose which function to inline without the function call counts, but LLVM has a back-end with sampling support. http://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization Whether or not this is or will be available soon for D in LDC is a different matter.changing two lines: final class SubFoo : Foo { int test(F)(F obj, int repeat) {I tried it. DMD is no change, while GDC gets acceptable score. D(DMD 2.067.1): 2.445 D(GDC 4.9.2/2.066): 0.928 Now I got a hint how to improve the code by hand. Thanks, John. But the original Java code that I'm porting is about 10,000 lines of code. And the performance is about 3 times different. Yes! Java is 3 times faster than D in my app. I hope the future DMD/GDC compiler will do the similar optimization automatically, not by hand. Aki.
Aug 03 2015
Java being fastest at running Java-style code is not too surprising. My guess is that Java is "hotspot" inlining the calls to `bar`, getting rid of the dynamic dispatch overhead. I think that for real systems D will generally beat out Java across the board, but not if the D version is a straight up transliteration of the Java--expect Java to be the best at running Java code.
Aug 03 2015
On 3 August 2015 at 18:27, aki via Digitalmars-d < digitalmars-d puremagic.com> wrote:When I was trying to port some Java program to D, I noticed Java is faster than D. I made a simple bench mark test as follows. Then, I was shocked with the result. test results on Win8 64bit (smaller is better) Java(1.8.0,64bit,server): 0.677 C++(MS vs2013): 2.141 D(DMD 2.067.1): 2.448 D(GDC 4.9.2/2.066): 2.481 Java(1.8.0,32bit,client): 3.060 Does anyone know the magic of Java? Thanks, Aki.I have read somewhere (or maybe heard) that Java VM is able to cache and possibly remove/inline dynamic dispatches on the fly. This is a clear win for VM languages over native compiled. Iain.
Aug 03 2015
On Monday, 3 August 2015 at 16:27:39 UTC, aki wrote:When I was trying to port some Java program to D, I noticed Java is faster than D. I made a simple bench mark test as follows. Then, I was shocked with the result. test results on Win8 64bit (smaller is better) Java(1.8.0,64bit,server): 0.677 C++(MS vs2013): 2.141 D(DMD 2.067.1): 2.448 D(GDC 4.9.2/2.066): 2.481 Java(1.8.0,32bit,client): 3.060 Does anyone know the magic of Java? Thanks, Aki. --- test program for D lang: import std.datetime; import std.stdio; class Foo { int i = 0; void bar() {} }; class SubFoo : Foo { override void bar() { i = i * 3 + 1; } }; int test(Foo obj, int repeat) { for (int r = 0; r<repeat; ++r) { obj.bar(); } return obj.i; } void main() { auto stime = Clock.currTime(); int repeat = 1000 * 1000 * 1000; int ret = test(new SubFoo(), repeat); double time = (Clock.currTime() - stime).total!"msecs" / 1000.0; writefln("time=%5.3f, ret=%d", time, ret); } test program for Java: class Foo { public int i = 0; public void bar() {} }; class SubFoo extends Foo { public void bar() { i = i * 3 + 1; } }; public class Main { public static int test(Foo obj, int repeat) { for (int r = 0; r<repeat; ++r) { obj.bar(); } return obj.i; } public static void main(String[] args) { long stime = System.currentTimeMillis(); int repeat = 1000 * 1000 * 1000; int ret = test(new SubFoo(), repeat); double time = (System.currentTimeMillis() - stime) / 1000.0; System.out.printf("time=%5.3f, ret=%d", time, ret); } }Not surprising. The virtual function call takes almost all of the time and the JVM will be devirtualising it. If you want to call tiny virtual functions in tight loops, use a VM. That said, it's a bit disappointing that the devirtualisation doesn't happen at compile-time after inlining for a simple case like this.
Aug 03 2015
You can try a few potential optimizations in the D version yourself and see if it makes a difference. Devirtualization has a very small impact. Test this by making `test` take `SubFoo` and making `bar` final, or making `bar` a stand-alone function. That's not it. Inlining alone doesn't make a huge difference either - test this by copy/pasting the `bar` method body to the test function. But we can see a *huge* difference if we inline AND make the data local: int test(SubFoo obj, int repeat) { int i = obj.i; // local variable copy for (int r = 0; r<repeat; ++r) { //obj.bar(); i = i *3 + 1; // do the math on the local } obj.i = i; // save it back to the object so same result to the outside world return obj.i; } That cuts the time to less than 1/2 on my computer from the other fastest version. So I suspect the JVM is able to figure out that the `i` member is being used and putting it in a hot cache instead of accessing it indirectly though the object, just like I did by hand there. I betcha if the loop ran 5 times, it would be no different, but the JVM realizes after hundreds of iterations that there's a huge optimization potential there and rewrites the code at that point, making it faster for the next million runs.
Aug 03 2015
On Monday, 3 August 2015 at 16:47:14 UTC, Adam D. Ruppe wrote:You can try a few potential optimizations in the D version yourself and see if it makes a difference. Devirtualization has a very small impact. Test this by making `test` take `SubFoo` and making `bar` final, or making `bar` a stand-alone function. That's not it.Making SubFoo a final class and test take SubFoo gives a >10x speedup for me.
Aug 03 2015
On 8/3/15 12:50 PM, John Colvin wrote:On Monday, 3 August 2015 at 16:47:14 UTC, Adam D. Ruppe wrote:Let's make sure we're all comparing apples to apples here. FWIW, I suspect the inlining to be the most significant improvement, which is impossible for virtual functions in D. ALSO, make SURE you are compiling in release mode, so you aren't calling a virtual invariant function before/after every call. -SteveYou can try a few potential optimizations in the D version yourself and see if it makes a difference. Devirtualization has a very small impact. Test this by making `test` take `SubFoo` and making `bar` final, or making `bar` a stand-alone function. That's not it.Making SubFoo a final class and test take SubFoo gives a >10x speedup for me.
Aug 03 2015
On 03-Aug-2015 19:54, Steven Schveighoffer wrote:On 8/3/15 12:50 PM, John Colvin wrote:Should be trivial in this particular case. You just keep the original virtual call where it cannot be deduced.On Monday, 3 August 2015 at 16:47:14 UTC, Adam D. Ruppe wrote:Let's make sure we're all comparing apples to apples here. FWIW, I suspect the inlining to be the most significant improvement, which is impossible for virtual functions in D.You can try a few potential optimizations in the D version yourself and see if it makes a difference. Devirtualization has a very small impact. Test this by making `test` take `SubFoo` and making `bar` final, or making `bar` a stand-alone function. That's not it.Making SubFoo a final class and test take SubFoo gives a >10x speedup for me.ALSO, make SURE you are compiling in release mode, so you aren't calling a virtual invariant function before/after every call.This one is critical. Actually why do we have an extra call for trivial null-check on any object that doesn't even have invariant? -- Dmitry Olshansky
Aug 03 2015
On 8/3/15 12:59 PM, Dmitry Olshansky wrote:On 03-Aug-2015 19:54, Steven Schveighoffer wrote:Actually, that the call to the invariant should be avoidable if the object doesn't have one. It should be easy to check the vtable pointer to see if it points at the "default" invariant (which does nothing). -SteveALSO, make SURE you are compiling in release mode, so you aren't calling a virtual invariant function before/after every call.This one is critical. Actually why do we have an extra call for trivial null-check on any object that doesn't even have invariant?
Aug 03 2015
On 03-Aug-2015 20:05, Steven Schveighoffer wrote:On 8/3/15 12:59 PM, Dmitry Olshansky wrote:https://issues.dlang.org/show_bug.cgi?id=14865 -- Dmitry OlshanskyOn 03-Aug-2015 19:54, Steven Schveighoffer wrote:Actually, that the call to the invariant should be avoidable if the object doesn't have one. It should be easy to check the vtable pointer to see if it points at the "default" invariant (which does nothing).ALSO, make SURE you are compiling in release mode, so you aren't calling a virtual invariant function before/after every call.This one is critical. Actually why do we have an extra call for trivial null-check on any object that doesn't even have invariant?
Aug 03 2015
On Monday, 3 August 2015 at 16:50:42 UTC, John Colvin wrote:Making SubFoo a final class and test take SubFoo gives a >10x speedup for me.Right, gdc and ldc will the the aggressive inlining and local data optimizations automatically once it is able to devirtualize the calls (at least when you use the -O flags). dmd, however, even with -inline, doesn't make the local copy of the variable - it disassembles to this: 08098740 <_D1l4testFC1l6SubFooiZi>: 8098740: 55 push ebp 8098741: 8b ec mov ebp,esp 8098743: 89 c1 mov ecx,eax 8098745: 53 push ebx 8098746: 31 d2 xor edx,edx 8098748: 8b 5d 08 mov ebx,DWORD PTR [ebp+0x8] 809874b: 56 push esi 809874c: 85 c9 test ecx,ecx 809874e: 7e 0f jle 809875f <_D1l4testFC1l6SubFooiZi+0x1f> 8098750: 8b 43 08 mov eax,DWORD PTR [ebx+0x8] 8098753: 8d 74 40 01 lea esi,[eax+eax*2+0x1] 8098757: 42 inc edx 8098758: 89 73 08 mov DWORD PTR [ebx+0x8],esi 809875b: 39 ca cmp edx,ecx 809875d: 7c f1 jl 8098750 <_D1l4testFC1l6SubFooiZi+0x10> 809875f: 8b 43 08 mov eax,DWORD PTR [ebx+0x8] 8098762: 5e pop esi 8098763: 5b pop ebx 8098764: 5d pop ebp 8098765: c2 04 00 ret 0x4 There's no call in there, but there is still indirect memory access for the variable, so it doesn't get the caching benefits of the stack. It isn't news that dmd's optimizer is pretty bad next to.... well, pretty much everyone else nowdays, whether gdc, ldc, or Java, but it is sometimes nice to take a look at why. The biggest magic of Java IMO here is being CPU cache friendly!
Aug 03 2015