digitalmars.D - It seems like DMD doesn't optimize tail recursion very well
- Akhropov (59/59) Jun 09 2006 Looking through tests on http://shootout.alioth.debian.org/ website I no...
- James Pelcis (64/64) Jun 09 2006 You seem to be right. Just to make a fairer test, I made both versions
- sailormoontw (20/20) Jun 09 2006 It takes twice the time for D to do the same job...
- pragma (17/21) Jun 09 2006 I can confirm. Using your source code, comparing DMC to DMD, I got:
- Deewiant (2/16) Jun 09 2006 Looking at your numbers, D was actually precisely 313657318 clock cycles...
- pragma (3/19) Jun 09 2006 Oh, my bad. Thanks for catching that. :)
- AKhropov (21/45) Jun 09 2006 Did your turn maximum optimization on?
- pragma (45/46) Jun 09 2006 As a matter of fact, I did not. Sure enough, with "-O -inline release"
- Bruno Medeiros (5/56) Jun 11 2006 What are those NOPs for, if I may ask?
- Jarrett Billingsley (5/6) Jun 11 2006 Usually for alignment, so that the instructions pair better in the pipel...
- James Dunne (6/18) Jun 11 2006 A MOV reg, reg instruction is encoded as a single byte, so to pad it
- pragma (3/7) Jun 12 2006 I inserted them manually into the routine. They're there so I could eas...
- pragma (141/141) Jun 09 2006 Attached are the ASM dump of the Ack() function as outlined elsewhere in...
- michal.minich gmail.com (11/152) Jun 09 2006 High level functions like ENTER LEVAVE are relic of old effort of trying...
- Andrei Khropov (89/90) Jun 09 2006 It is more efficient compared to DMC only.
- Gregor Richards (7/7) Jun 09 2006 $ gdmd ack.d -O -release -inline -ofack-gdc
- Dave (21/28) Jun 09 2006 I get the following. If dmd is reverted back to v0.139, it does much
Looking through tests on http://shootout.alioth.debian.org/ website I noticed that the only tests where DMD is inferior to GCC are where recursion is seriously involved (binary trees and recursive). My own experiments with ackermann function confirmed that: C++: --------------------------------------------------------------- #include <iostream> #include <windows.h> // for GetTickCount int Ack(int a, int b) { if (a == 0) return b + 1; else if (b == 0) return Ack(a - 1, 1); else return Ack(a - 1, Ack(a, b - 1)); } int main() { DWORD start = GetTickCount(); std::cout << Ack(3, 11) << std::endl; DWORD stop = GetTickCount(); std::cout << stop-start << " ms elapsed.\n"; return 0; } ------------------------------------------------------------- D: ------------------------------------------------------------- private import std.perf, std.stdio; int Ack(int a, int b) { if (a == 0) return b + 1; else if (b == 0) return Ack(a - 1, 1); else return Ack(a - 1, Ack(a, b - 1)); } void main() { HighPerformanceCounter t = new HighPerformanceCounter(); t.start(); Ack(3, 11); t.stop(); writefln("%d ms elapsed.", t.milliseconds()); } --------------------------------------------------------------- I tried MSVC++ 8, MinGW GCC 3.4.2 and DMD 0.157 cl ack.cpp /Ox /Feack-cpp.exe g++ ack.cpp -O99 -oack-gpp.exe dmd ack.d -O -release -inline -ofack-d.exe Here are the results (averaged over 10 times): 1) MS C++ : 830.5 ms 3) GCC : 990.4 ms 5) DMD : 2247.3 ms Am I right? What could be done with this? -- AKhropov
Jun 09 2006
You seem to be right. Just to make a fairer test, I made both versions use the same timing code and did a DMC vs. DMD. I'm not the best of coders though, so it might not be perfect. ------------------------------------------ C++: ------------------------------------------ #include <iostream.h> int Ack(int a, int b) { if (a == 0) return b + 1; else if (b == 0) return Ack(a - 1, 1); else return Ack(a - 1, Ack(a, b - 1)); } long long getCount() { asm { rdtsc; ret; } } int main() { long long start = getCount(); Ack(3, 11); long long stop = getCount(); cout << stop-start << " clock cycles elapsed.\n"; return 0; } ------------------------------------------ D: ------------------------------------------ private import std.stdio; int Ack(int a, int b) { if (a == 0) return b + 1; else if (b == 0) return Ack(a - 1, 1); else return Ack(a - 1, Ack(a, b - 1)); } long getCount() { asm { rdtsc; ret; } } void main() { long start = getCount(); Ack(3, 11); long stop = getCount(); writefln("%d clock cycles elapsed.", stop - start); } ------------------------------------------ My results (averaged and rounded) were 5,575,010,530 for C++ and 8,339,840,041 for D. Since it was with the same linker, optimizer, etc., this is a bad sign. Is the problem in the code or the compiler?
Jun 09 2006
It takes twice the time for D to do the same job... but at least it's better than Ruby, Common Lisp, and Haskell these 3 all failed...either stack overflow or halted. Ruby: def Ack(a,b) return b + 1 if a == 0 return Ack(a-1, 1) if b == 0 Ack(a-1, Ack(a, b-1)) end Common Lisp: (defun ack (a b) (cond ((equal a 0) (1+ b)) ((equal b 0) (ack (1- a) 1)) (t (ack (1- a) (ack a (1- b)))))) Haskell: ack :: Int -> Int -> Int ack 0 b = b+1 ack a 0 = ack (a-1) 1 ack a b = ack (a-1) (ack a (b-1))
Jun 09 2006
In article <e6c2kn$m48$1 digitaldaemon.com>, James Pelcis says...My results (averaged and rounded) were 5,575,010,530 for C++ and 8,339,840,041 for D. Since it was with the same linker, optimizer, etc., this is a bad sign. Is the problem in the code or the compiler?I can confirm. Using your source code, comparing DMC to DMD, I got: C:\home\pragma\src\test>ackcpp.exe 7982698959 clock cycles elapsed. C:\home\pragma\src\test>ackd.exe 10881160506 clock cycles elapsed. Now on a whim, I added "extern(C)" to the definition of Ack() like so:extern(C) int Ack(int a, int b){ /* ... */ }And got this: C:\home\pragma\src\test>ackcpp.exe 7371540535 clock cycles elapsed. C:\home\pragma\src\test>ackd.exe 7057883217 clock cycles elapsed. With that change D is still lagging, but by *significantly* less - so there's another factor here. So there is some kind of drawback to using the D callspec with recursion. My guess is that exploring the ASM using obj2asm would help show why. - EricAnderton at yahoo
Jun 09 2006
pragma wrote:Now on a whim, I added "extern(C)" to the definition of Ack() like so:Looking at your numbers, D was actually precisely 313657318 clock cycles _faster_.extern(C) int Ack(int a, int b){ /* ... */ }And got this: C:\home\pragma\src\test>ackcpp.exe 7371540535 clock cycles elapsed. C:\home\pragma\src\test>ackd.exe 7057883217 clock cycles elapsed. With that change D is still lagging, but by *significantly* less - so there's another factor here.
Jun 09 2006
In article <e6c6o2$qba$1 digitaldaemon.com>, Deewiant says...pragma wrote:Oh, my bad. Thanks for catching that. :) - EricAnderton at yahooNow on a whim, I added "extern(C)" to the definition of Ack() like so:Looking at your numbers, D was actually precisely 313657318 clock cycles _faster_.extern(C) int Ack(int a, int b){ /* ... */ }And got this: C:\home\pragma\src\test>ackcpp.exe 7371540535 clock cycles elapsed. C:\home\pragma\src\test>ackd.exe 7057883217 clock cycles elapsed. With that change D is still lagging, but by *significantly* less - so there's another factor here.
Jun 09 2006
pragma wrote:In article <e6c2kn$m48$1 digitaldaemon.com>, James Pelcis says...Did your turn maximum optimization on? Here are my results with James' code: --- compiler flags: ----- dmc akk2.cpp -o -oakk2-cpp.exe dmd akk2.d -O -release -inline -ofakk2-d.exe ------------------------- 1) both variants without "extern (C)" C++: 4379147218 clock cycles elapsed. D : 3669672891 clock cycles elapsed. 2) both variants with "extern (C)" (well, "extern "C"" in C++ case) C++: 4414772278 clock cycles elapsed. D : 4394787775 clock cycles elapsed. So, turning on "extern (C)" actually made it slower and approximately the same speed as C++. So, I think the problem is in the back-end (which is the same for DMD and DMC AFAIK ). It would be interesting to hear from guys using GDC with GCC back-end. -- AKhropovMy results (averaged and rounded) were 5,575,010,530 for C++ and 8,339,840,041 for D. Since it was with the same linker, optimizer, etc., this is a bad sign. Is the problem in the code or the compiler?I can confirm. Using your source code, comparing DMC to DMD, I got: C:\home\pragma\src\test>ackcpp.exe 7982698959 clock cycles elapsed. C:\home\pragma\src\test>ackd.exe 10881160506 clock cycles elapsed. Now on a whim, I added "extern(C)" to the definition of Ack() like so:extern(C) int Ack(int a, int b){ /* ... */ }And got this: C:\home\pragma\src\test>ackcpp.exe 7371540535 clock cycles elapsed. C:\home\pragma\src\test>ackd.exe 7057883217 clock cycles elapsed.
Jun 09 2006
In article <e6c9mm$v9q$1 digitaldaemon.com>, AKhropov says...Did your turn maximum optimization on?As a matter of fact, I did not. Sure enough, with "-O -inline release" specified, the enter/leave instructions dissapear and are replaced with their push/pop/mov counterparts (see attached). :) - Eric SUB_L00402010: push ebp mov ebp,esp nop nop nop cmp dword ptr [ebp+08h],00000000h jnz L00402022 mov eax,[ebp+0Ch] inc eax pop ebp retn ;------------------------------------------------------------------------------ L00402022: cmp dword ptr [ebp+0Ch],00000000h jnz L00402039 push 00000001h mov eax,[ebp+08h] dec eax push eax call SUB_L00402010 add esp,00000008h pop ebp retn ;------------------------------------------------------------------------------ L00402039: mov eax,[ebp+0Ch] dec eax push eax push [ebp+08h] call SUB_L00402010 add esp,00000008h push eax mov eax,[ebp+08h] dec eax push eax call SUB_L00402010 add esp,00000008h pop ebp retn
Jun 09 2006
pragma wrote:In article <e6c9mm$v9q$1 digitaldaemon.com>, AKhropov says...What are those NOPs for, if I may ask? -- Bruno Medeiros - CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#DDid your turn maximum optimization on?As a matter of fact, I did not. Sure enough, with "-O -inline release" specified, the enter/leave instructions dissapear and are replaced with their push/pop/mov counterparts (see attached). :) - Eric SUB_L00402010: push ebp mov ebp,esp nop nop nop cmp dword ptr [ebp+08h],00000000h jnz L00402022 mov eax,[ebp+0Ch] inc eax pop ebp retn ;------------------------------------------------------------------------------ L00402022: cmp dword ptr [ebp+0Ch],00000000h jnz L00402039 push 00000001h mov eax,[ebp+08h] dec eax push eax call SUB_L00402010 add esp,00000008h pop ebp retn ;------------------------------------------------------------------------------ L00402039: mov eax,[ebp+0Ch] dec eax push eax push [ebp+08h] call SUB_L00402010 add esp,00000008h push eax mov eax,[ebp+08h] dec eax push eax call SUB_L00402010 add esp,00000008h pop ebp retn
Jun 11 2006
"Bruno Medeiros" <brunodomedeirosATgmail SPAM.com> wrote in message news:e6i5n5$kok$2 digitaldaemon.com...What are those NOPs for, if I may ask?Usually for alignment, so that the instructions pair better in the pipeline, though three nops in a row seems kind of odd, unless that's some kind of technique I've never seen before.
Jun 11 2006
Jarrett Billingsley wrote:"Bruno Medeiros" <brunodomedeirosATgmail SPAM.com> wrote in message news:e6i5n5$kok$2 digitaldaemon.com...A MOV reg, reg instruction is encoded as a single byte, so to pad it gets three NOPs. Correct me if mistaken! -- Regards, James DunneWhat are those NOPs for, if I may ask?Usually for alignment, so that the instructions pair better in the pipeline, though three nops in a row seems kind of odd, unless that's some kind of technique I've never seen before.
Jun 11 2006
In article <e6i5n5$kok$2 digitaldaemon.com>, Bruno Medeiros says...What are those NOPs for, if I may ask?I inserted them manually into the routine. They're there so I could easily find the routine in the dump. ;)-- Bruno Medeiros - CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Jun 12 2006
Attached are the ASM dump of the Ack() function as outlined elsewhere in this thread. I was actually suprised that the extern(C) D version of the Ack() function was byte-for-byte identical to the plain C/CPP version of the same code. The plain extern(D) version was the only run that created a different routine, and hence, the wildly different timings. To my eye, the codegen in D is not only shorter, but more efficent. The only thing I'm not too sure on is the use of enter/leave instead of push/pop/mov for maintaining up the function's stack frame. While I have no solid evidence, I'd be willing to bet that these two instructions are what's holding things up. My timings are on an Intel chip. Anyone out there with an AMD machine willing to try this on? - Eric // dissasembler listing for ackd.exe // int Ack(int a, int b) SUB_L00402010: enter 0004h,00h mov [ebp-04h],eax nop nop nop cmp dword ptr [ebp+08h],00000000h jnz L00402025 inc eax leave retn 0004h ;------------------------------------------------------------------------------ L00402025: cmp dword ptr [ebp-04h],00000000h jnz L0040203E mov eax,[ebp+08h] dec eax push eax mov eax,00000001h call SUB_L00402010 leave retn 0004h ;------------------------------------------------------------------------------ L0040203E: mov ecx,[ebp+08h] lea edx,[ecx-01h] push edx push ecx mov eax,[ebp-04h] dec eax call SUB_L00402010 call SUB_L00402010 leave retn 0004h // dissasembler listing for ackd.exe // extern(C) int Ack(int a, int b) SUB_L00402010: push ebp mov ebp,esp push ebx nop nop nop cmp dword ptr [ebp+08h],00000000h jnz L00402024 mov eax,[ebp+0Ch] inc eax pop ebx pop ebp retn ;------------------------------------------------------------------------------ L00402024: cmp dword ptr [ebp+0Ch],00000000h jnz L0040203C push 00000001h mov ecx,[ebp+08h] dec ecx push ecx call SUB_L00402010 add esp,00000008h pop ebx pop ebp retn ;------------------------------------------------------------------------------ L0040203C: mov edx,[ebp+0Ch] dec edx push edx push [ebp+08h] call SUB_L00402010 add esp,00000008h push eax mov ebx,[ebp+08h] dec ebx push ebx call SUB_L00402010 add esp,00000008h pop ebx pop ebp retn // dissasembler listing for ackcpp.exe // int Ack(int a, int b) SUB_L00402010: push ebp mov ebp,esp push ebx nop nop nop cmp dword ptr [ebp+08h],00000000h jnz L00402024 mov eax,[ebp+0Ch] inc eax pop ebx pop ebp retn ;------------------------------------------------------------------------------ L00402024: cmp dword ptr [ebp+0Ch],00000000h jnz L0040203C push 00000001h mov ecx,[ebp+08h] dec ecx push ecx call SUB_L00402010 add esp,00000008h pop ebx pop ebp retn ;------------------------------------------------------------------------------ L0040203C: mov edx,[ebp+0Ch] dec edx push edx push [ebp+08h] call SUB_L00402010 add esp,00000008h push eax mov ebx,[ebp+08h] dec ebx push ebx call SUB_L00402010 add esp,00000008h pop ebx pop ebp retn
Jun 09 2006
High level functions like ENTER LEVAVE are relic of old effort of trying to make machine code / asm human programmable. In fact, the current processors are going the opposite way, divigin instruction into micro-inctructions. There is better alternate way by using push / mov / pop functions instead; and seems the companies are not trying to optimize this high level instructions anymore. You can see for yourself at: http://webster.cs.ucr.edu/AoA/DOS/ch12/CH12-3.html#HEADING3-73 http://www.goof.com/pcg/doc/pentium-optxref.html http://www.google.com/search?hl=en&lr=&q=enter+leave+push+pop+mov+instruction+cycles+pentium+intel+amd&btnG=Search -M In article <e6cauc$10g3$1 digitaldaemon.com>, pragma says...Attached are the ASM dump of the Ack() function as outlined elsewhere in this thread. I was actually suprised that the extern(C) D version of the Ack() function was byte-for-byte identical to the plain C/CPP version of the same code. The plain extern(D) version was the only run that created a different routine, and hence, the wildly different timings. To my eye, the codegen in D is not only shorter, but more efficent. The only thing I'm not too sure on is the use of enter/leave instead of push/pop/mov for maintaining up the function's stack frame. While I have no solid evidence, I'd be willing to bet that these two instructions are what's holding things up. My timings are on an Intel chip. Anyone out there with an AMD machine willing to try this on? - Eric // dissasembler listing for ackd.exe // int Ack(int a, int b) SUB_L00402010: enter 0004h,00h mov [ebp-04h],eax nop nop nop cmp dword ptr [ebp+08h],00000000h jnz L00402025 inc eax leave retn 0004h ;------------------------------------------------------------------------------ L00402025: cmp dword ptr [ebp-04h],00000000h jnz L0040203E mov eax,[ebp+08h] dec eax push eax mov eax,00000001h call SUB_L00402010 leave retn 0004h ;------------------------------------------------------------------------------ L0040203E: mov ecx,[ebp+08h] lea edx,[ecx-01h] push edx push ecx mov eax,[ebp-04h] dec eax call SUB_L00402010 call SUB_L00402010 leave retn 0004h // dissasembler listing for ackd.exe // extern(C) int Ack(int a, int b) SUB_L00402010: push ebp mov ebp,esp push ebx nop nop nop cmp dword ptr [ebp+08h],00000000h jnz L00402024 mov eax,[ebp+0Ch] inc eax pop ebx pop ebp retn ;------------------------------------------------------------------------------ L00402024: cmp dword ptr [ebp+0Ch],00000000h jnz L0040203C push 00000001h mov ecx,[ebp+08h] dec ecx push ecx call SUB_L00402010 add esp,00000008h pop ebx pop ebp retn ;------------------------------------------------------------------------------ L0040203C: mov edx,[ebp+0Ch] dec edx push edx push [ebp+08h] call SUB_L00402010 add esp,00000008h push eax mov ebx,[ebp+08h] dec ebx push ebx call SUB_L00402010 add esp,00000008h pop ebx pop ebp retn // dissasembler listing for ackcpp.exe // int Ack(int a, int b) SUB_L00402010: push ebp mov ebp,esp push ebx nop nop nop cmp dword ptr [ebp+08h],00000000h jnz L00402024 mov eax,[ebp+0Ch] inc eax pop ebx pop ebp retn ;------------------------------------------------------------------------------ L00402024: cmp dword ptr [ebp+0Ch],00000000h jnz L0040203C push 00000001h mov ecx,[ebp+08h] dec ecx push ecx call SUB_L00402010 add esp,00000008h pop ebx pop ebp retn ;------------------------------------------------------------------------------ L0040203C: mov edx,[ebp+0Ch] dec edx push edx push [ebp+08h] call SUB_L00402010 add esp,00000008h push eax mov ebx,[ebp+08h] dec ebx push ebx call SUB_L00402010 add esp,00000008h pop ebx pop ebp retn
Jun 09 2006
pragma wrote:To my eye, the codegen in D is not only shorter, but more efficent.It is more efficient compared to DMC only. My point was that DMD codegen is substantially less efficient _in this particular case_ than widespread C++ compilers like MSVC++ or GCC while being on a par with them (or superior) in the other cases. It's even slower than MS Here is the assembler output (ack and main) for my original program (output was removed from the original program to make the listing clearer) in GCC 3.4.2 (in MinGW) : ---------------------------------------------------------- __Z3Ackii: pushl %ebp movl %esp, %ebp pushl %ebx subl $8, %esp movl 8(%ebp), %edx movl 12(%ebp), %eax .p2align 4,,15 L12: testl %edx, %edx je L7 L14: testl %eax, %eax jne L4 decl %edx movl $1, %eax testl %edx, %edx jne L14 L7: addl $8, %esp incl %eax popl %ebx popl %ebp ret .p2align 4,,7 L4: movl %edx, (%esp) leal -1(%edx), %ebx decl %eax movl %eax, 4(%esp) call __Z3Akkii movl %ebx, %edx jmp L12 .def ___main; .scl 2; .type 32; .endef .section .rdata,"dr" LC0: .ascii " ms elapsed.\12\0" .text .align 2 .p2align 4,,15 .globl _main .def _main; .scl 2; .type 32; .endef _main: pushl %ebp movl $16, %eax movl %esp, %ebp pushl %ebx subl $20, %esp andl $-16, %esp call __alloca call ___main call _GetTickCount 0 movl $3, (%esp) movl $10, %ecx movl %eax, %ebx movl %ecx, 4(%esp) call __Z3Ackii movl %eax, 4(%esp) movl $2, (%esp) call __Z3Ackii call _GetTickCount 0 movl $__ZSt4cout, (%esp) subl %ebx, %eax movl %eax, 4(%esp) call __ZNSolsEm movl %eax, (%esp) movl $LC0, %edx movl %edx, 4(%esp) call __ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc movl -4(%ebp), %ebx xorl %eax, %eax leave ret ---------------------------------------------------------- It seems like GCC was clever enough to somehow split the calculation in two calls. -- AKhropov
Jun 09 2006
$ gdmd ack.d -O -release -inline -ofack-gdc $ ./ack-gdc 703 ms elapsed. $ dmd ack.d -O -release -inline -ofack-dmd gcc ack.o -o ack-dmd -m32 -lphobos -lpthread -lm $ ./ack-dmd 2373 ms elapsed.
Jun 09 2006
Gregor Richards wrote:$ gdmd ack.d -O -release -inline -ofack-gdc $ ./ack-gdc 703 ms elapsed. $ dmd ack.d -O -release -inline -ofack-dmd gcc ack.o -o ack-dmd -m32 -lphobos -lpthread -lm $ ./ack-dmd 2373 ms elapsed.I get the following. If dmd is reverted back to v0.139, it does much better, for whatever that is worth. GDC v0.17 w/ 4.01 backend: real 0m0.828s user 0m0.824s sys 0m0.000s DMD v0.160: real 0m2.628s user 0m2.572s sys 0m0.000s DMD v0.139: real 0m1.131s user 0m1.124s sys 0m0.004s
Jun 09 2006