digitalmars.D - Low hanging fruit for optimizing loops
- Juan Manuel Cabo (118/118) Jun 07 2013 Given the recent surge in interest for performance, I dusted
- Walter Bright (4/7) Jun 07 2013 It's great that you're doing this. You can track it down further by usin...
- dennis luehring (4/11) Jun 07 2013 or even nicer :) :)
- Juan Manuel Cabo (56/64) Jun 07 2013 Thanks!!
- Walter Bright (2/4) Jun 08 2013 Thanks, this is great information.
- dennis luehring (2/6) Jun 08 2013 sadly not for x64 :(
- Walter Bright (2/10) Jun 08 2013 I don't understand your comment - the code example was for 64 bit code.
- dennis luehring (3/14) Jun 08 2013 ida freeware is x86 only - so my "even nicer disassembly" just don't
- Adam D. Ruppe (4/5) Jun 08 2013 You know what would rok? If it could output code that you could
- Andrei Alexandrescu (3/8) Jun 08 2013 Wow!
- Marco Leise (5/5) Jun 07 2013 GCC even cosiders cache line length and CPU cache size.
- Walter Bright (2/5) Jun 08 2013 http://d.puremagic.com/issues/show_bug.cgi?id=10301
Given the recent surge in interest for performance, I dusted off a small test that I made long ago and determined myself to find the cause of the performance difference. I tested the linear version of fibonacci both in DMD and in C with GCC. Without optimization switches, I'm happy to see that the D version is faster. But when using the switches, the C version takes 30% less time. I'm including the dissasembly here. The asm functions are very very close to each other. Both loops are only 6 instructions long. So I think it is a low hanging fruit because IMO the speed difference is either because the GCC loop jump is 64bit aligned, or because of the order of the instructions (there is an extra instruction between the loop increment and the loop comparison, giving an opportunity for parallelization). OS: Kubuntu Linux 12.04 64bit CPU: 2.1GHz - AMD Athlon(tm) 64 X2 Dual Core Processor 4000+ Switches: dmd -O -inline -noboundscheck -release dtest.d gcc -O3 -o ctest ctest.c Times: D: 1 sec, 430 ms, 207 μs, and 4 hnsecs C: 940 ms D version: --------- import std.stdio; import std.datetime; int fibw(int n) { //Linear Fibonacci int a = 1; int b = 1; for (int i=2; i <= n; ++i) { int sum = a + b; a = b; b = sum; } return b; } void main() { auto start = Clock.currTime(); int r = fibw(1000_000_000); auto elapsed = Clock.currTime() - start; writeln(r); writeln(elapsed); } C Version: --------- #include <stdio.h> #include <time.h> int fibw(int n) { //Linear Fibonacci int a = 1; int b = 1; int i; for (i=2; i <= n; ++i) { int sum = a + b; a = b; b = sum; } return b; } int main() { clock_t start = clock(); int r = fibw(1000*1000*1000); clock_t elapsed = clock() - start; printf("%d\n", r); printf("%d ms\n", (int)(elapsed * 1000 / CLOCKS_PER_SEC)); return 0; } D Version DISASM: ---------------- 00000000004681d0 <_D5dtest4fibwFiZi>: 4681d0: 55 push rbp 4681d1: 48 8b ec mov rbp,rsp 4681d4: 48 89 f9 mov rcx,rdi 4681d7: be 01 00 00 00 mov esi,0x1 4681dc: b8 01 00 00 00 mov eax,0x1 4681e1: ba 02 00 00 00 mov edx,0x2 4681e6: 83 f9 02 cmp ecx,0x2 4681e9: 7c 0f jl 4681fa <_D5dtest4fibwFiZi+0x2a> ; LOOP JUMP ---> 4681eb: 8d 3c 06 lea edi,[rsi+rax*1] 4681ee: 48 89 c6 mov rsi,rax 4681f1: 48 89 f8 mov rax,rdi 4681f4: ff c2 inc edx 4681f6: 39 ca cmp edx,ecx 4681f8: 7e f1 jle 4681eb <_D5dtest4fibwFiZi+0x1b> ; LOOP END 4681fa: 5d pop rbp 4681fb: c3 ret C Version DISASM: ---------------- 0000000000400860 <fibw>: 400860: 83 ff 01 cmp edi,0x1 400863: b8 01 00 00 00 mov eax,0x1 400868: 7e 1c jle 400886 <fibw+0x26> 40086a: ba 02 00 00 00 mov edx,0x2 40086f: b9 01 00 00 00 mov ecx,0x1 ; NOTICE THE nop (64bit alignment??): 400874: 0f 1f 40 00 nop DWORD PTR [rax+0x0] ; LOOP JUMP --> 400878: 8d 34 01 lea esi,[rcx+rax*1] 40087b: 83 c2 01 add edx,0x1 40087e: 89 c1 mov ecx,eax ; REORDERED cmp 400880: 39 d7 cmp edi,edx 400882: 89 f0 mov eax,esi 400884: 7d f2 jge 400878 <fibw+0x18> ; LOOP END 400886: f3 c3 repz ret 400888: 90 nop 400889: 90 nop both files were dissasembled with: objdump -M intel -d --jm
Jun 07 2013
On 6/7/2013 9:15 PM, Juan Manuel Cabo wrote:Given the recent surge in interest for performance, I dusted off a small test that I made long ago and determined myself to find the cause of the performance difference.It's great that you're doing this. You can track it down further by using inline assembler and trying different instruction sequences. Also, obj2asm gives nicer disassembly :-)
Jun 07 2013
Am 08.06.2013 07:11, schrieb Walter Bright:On 6/7/2013 9:15 PM, Juan Manuel Cabo wrote:or even nicer :) :) ida pro freeware http://out7.hex-rays.com/files/idafree50.exeGiven the recent surge in interest for performance, I dusted off a small test that I made long ago and determined myself to find the cause of the performance difference.It's great that you're doing this. You can track it down further by using inline assembler and trying different instruction sequences. Also, obj2asm gives nicer disassembly :-)
Jun 07 2013
On Saturday, 8 June 2013 at 05:11:11 UTC, Walter Bright wrote:On 6/7/2013 9:15 PM, Juan Manuel Cabo wrote:Thanks!! I now used inline assembler, and can confidently say that the difference is because of the alignment. Changing the order of the cmp relative to the increment didn't do anything. Adding the right amount of 'nop' makes it run in 957 ms, 921 μs, and 4 hnsecs But if I overshoot it, or miss one, it goes back to 1 sec, 438 ms, and 544 μs Also, I couldn't use this instruction in D's asm{} 0f 1f 40 00 nop DWORD PTR [rax+0x0] and obj2asm doesn't dissasemble it (it just puts "0f1f" and gives incorrent asm for the next few instructions). I'm now not entirely sure that aligning loop jumps would be worthwhile though. They would have to be "leaf" loops because any call made inside the loop would overshadow the benefits (I was looping millons of times in my test). Anyway, here is the new source: import std.stdio; import std.datetime; int fiba(int n) { asm { naked; push RBP; mov RBP,RSP; mov RCX,RDI; mov ESI,0x1; mov EAX,0x1; mov EDX,0x2; cmp ECX,0x2; jl EXIT_LOOP; nop; nop; nop; nop; nop; nop; nop; nop; nop; nop; nop; nop; nop; LOOP_START: lea EDI,[RSI+RAX*1]; mov RSI,RAX; mov RAX,RDI; inc EDX; cmp EDX,ECX; jle LOOP_START; EXIT_LOOP: pop RBP; ret; } } void main() { auto start = Clock.currTime(); int r = fiba(1000_000_000); auto elapsed = Clock.currTime() - start; writeln(r); writeln(elapsed); } --jmGiven the recent surge in interest for performance, I dusted off a small test that I made long ago and determined myself to find the cause of the performance difference.It's great that you're doing this. You can track it down further by using inline assembler and trying different instruction sequences. Also, obj2asm gives nicer disassembly :-)
Jun 07 2013
On 6/7/2013 11:11 PM, Juan Manuel Cabo wrote:I now used inline assembler, and can confidently say that the difference is because of the alignment.Thanks, this is great information.
Jun 08 2013
Am 08.06.2013 09:28, schrieb Walter Bright:On 6/7/2013 11:11 PM, Juan Manuel Cabo wrote:sadly not for x64 :(I now used inline assembler, and can confidently say that the difference is because of the alignment.Thanks, this is great information.
Jun 08 2013
On 6/8/2013 12:37 AM, dennis luehring wrote:Am 08.06.2013 09:28, schrieb Walter Bright:I don't understand your comment - the code example was for 64 bit code.On 6/7/2013 11:11 PM, Juan Manuel Cabo wrote:sadly not for x64 :(I now used inline assembler, and can confidently say that the difference is because of the alignment.Thanks, this is great information.
Jun 08 2013
Am 08.06.2013 19:26, schrieb Walter Bright:On 6/8/2013 12:37 AM, dennis luehring wrote:ida freeware is x86 only - so my "even nicer disassembly" just don't work in this caseAm 08.06.2013 09:28, schrieb Walter Bright:I don't understand your comment - the code example was for 64 bit code.On 6/7/2013 11:11 PM, Juan Manuel Cabo wrote:sadly not for x64 :(I now used inline assembler, and can confidently say that the difference is because of the alignment.Thanks, this is great information.
Jun 08 2013
On Saturday, 8 June 2013 at 05:11:11 UTC, Walter Bright wrote:Also, obj2asm gives nicer disassembly :-)You know what would rok? If it could output code that you could copy/paste directly into D's iasm without having to go back and add semicolons, change 0eh to 0x0e, etc.
Jun 08 2013
On 6/8/13 1:53 PM, Adam D. Ruppe wrote:On Saturday, 8 June 2013 at 05:11:11 UTC, Walter Bright wrote:Wow! AndreiAlso, obj2asm gives nicer disassembly :-)You know what would rok? If it could output code that you could copy/paste directly into D's iasm without having to go back and add semicolons, change 0eh to 0x0e, etc.
Jun 08 2013
GCC even cosiders cache line length and CPU cache size. This alignment improvement should consistently work on x86/amd64. -- Marco
Jun 07 2013
On 6/7/2013 9:15 PM, Juan Manuel Cabo wrote:So I think it is a low hanging fruit because IMO the speed difference is either because the GCC loop jump is 64bit aligned,http://d.puremagic.com/issues/show_bug.cgi?id=10301
Jun 08 2013