digitalmars.D - Compilation of a numerical kernel
- bearophile (73/73) Jun 27 2010 Recently I have seen some work from Don about floating point optimizatio...
Recently I have seen some work from Don about floating point optimization in DMD: http://d.puremagic.com/issues/show_bug.cgi?id=4380 http://d.puremagic.com/issues/show_bug.cgi?id=4383 so maybe he is interested in this too. This test program is the nested loop of a program, and it's one of the hottest spots, it determines the performance of the whole small program, so even if it's just three lines of code it needs to be optimized well by the compiler (the D code can be modified to unroll the loop few times). // D code double foo(double[] arr1, double[] arr2) { double diff = 0.0; for (int i; i < arr1.length; i++) { double aux = arr1[i] - arr2[i]; diff += aux * aux; } return diff; } void main() {} D code compiled by DMD, optimized build: L38: fld qword ptr [EDX*8][ECX] fsub qword ptr [EDX*8][EBX] inc EDX cmp EDX,058h[ESP] fstp qword ptr 014h[ESP] fld qword ptr 014h[ESP] fmul ST,ST(0) fadd qword ptr 4[ESP] fstp qword ptr 4[ESP] jb L38 D code compiled by LDC, optimized build: .LBB13_5: movsd (%edi,%ecx,8), %xmm1 subsd (%eax,%ecx,8), %xmm1 incl %ecx cmpl %esi, %ecx mulsd %xmm1, %xmm1 addsd %xmm1, %xmm0 jne .LBB13_5 The asm produced by dmd is not efficient, it's not a matter of SSE register usage. I have translated it to C to see how GCC compiles it, to see how compile it with no SSE: // C code double foo(double* arr1, double* arr2, int len) { double diff = 0.0; int i; for (i = 0; i < len; i++) { double aux = arr1[i] - arr2[i]; diff += aux * aux; } return diff; } C code compiled with gcc 4.5 (32 bit): L3: fldl (%ecx,%eax,8) fsubl (%ebx,%eax,8) incl %eax fmul %st(0), %st cmpl %edx, %eax faddp %st, %st(1) jne L3 This is an example of how a compiler can compile it, unrolled once and working on two doubles in each SSE instruction (this is on 64 bit too), so this equals to a 4X unroll: Modified C code compiled with GCC (64 bit): L3: movapd (%rcx,%rax), %xmm1 subpd (%rdx,%rax), %xmm1 movapd %xmm1, %xmm0 mulpd %xmm1, %xmm0 addpd %xmm0, %xmm2 movapd 16(%rcx,%rax), %xmm0 subpd 16(%rdx,%rax), %xmm0 addq $32, %rax mulpd %xmm1, %xmm0 cmpq %r8, %rax addpd %xmm0, %xmm3 jne L3 Cache prefetch instructions can't help a lot here, because the access pattern to the memory is very plain. Bye, bearophile
Jun 27 2010