digitalmars.D - Performance updates
- bearophile (128/128) Nov 13 2008 In this post I show few things I have found/collected in the last weeks ...
- Bill Baxter (3/130) Nov 13 2008 *subliminal chant: L D C L D C L D C .....*
- Moritz Warning (5/14) Nov 13 2008 I have noticed the same problem with foreach.
- bearophile (5/8) Nov 13 2008 I haven't succeed in compiling code with ldc yet. But I have tried the l...
- Bill Baxter (7/13) Nov 13 2008 That's a bummer to hear that. But the good point of LLVM is that it's
- Bill Baxter (7/20) Nov 13 2008 One thing to be clear about when talking about the speed of foreach is
- Moritz Warning (2/25) Nov 13 2008 Nobody would use "for"-loops with opApply in real world code. :P
- Bill Baxter (4/29) Nov 13 2008 Er, right, which is why using a for() loop is faster that foreach() in
In this post I show few things I have found/collected in the last weeks related to the performance of the code compiled with DMD. I have added two of them to the list of slow things (as tiny benchmarks) I keep: http://www.fantascienza.net/leonardo/js/slow_d.zip ------------------------ This is D code that just performs many integer divisions (by 7, known at compile time): int divideBySeven(int x) { return x / 7; } void main() { int i = int.max; int r; while (i--) r = divideBySeven(i); printf("%d\n", r); } The same code in C: #include "limits.h" #include "stdio.h" int divideBySeven(int x) { return x / 7; } int main() { int i = INT_MAX; int r; while (i--) r = divideBySeven(i); printf("%d\n", r); return 0; } I have compiled the C code with MinGW based on GCC 4.2.1 with -O3 -s, and the D code with DMD 1.035 with -O -release -inline, the CPU is Core2 2 GHz. Timing results: intdiv: C: 5.04 s D: 39.32 s The ASM generated by DMD for the function divideBySeven(): mov ECX,7 cdq idiv ECX The ASM generated by GCC for the function divideBySeven(): pushl %ebp movl %esp, %ebp movl 8(%ebp), %ecx pushl %ebx movl $-1840700269, %ebx movl %ebx, %eax popl %ebx imull %ecx popl %ebp leal (%edx,%ecx), %eax sarl $2, %eax sarl $31, %ecx subl %ecx, %eax -------------------------- This is a scanning, C code: #include "stdio.h" int main() { int counter0 = 0, counter1 = 0, counter2 = 0, counter3 = 0; int i = 300000000; while (i--) { // 0.63 s if (i % 4 == 0) { counter0++; } else if (i % 4 == 1) { counter1++; } else if (i % 4 == 2) { counter2++; } else { counter3++; } } printf("%d %d %d %d\n", counter0, counter1, counter2, counter3); return 0; } Equal D code: import std.stdio: printf; int main() { int counter0, counter1, counter2, counter3; int i = 300_000_000; while (i--) { // 1.23 s if (i % 4 == 0) { counter0++; } else if (i % 4 == 1) { counter1++; } else if (i % 4 == 2) { counter2++; } else { counter3++; } } printf("%d %d %d %d\n", counter0, counter1, counter2, counter3); return 0; } Timings: Scan (i = 300_000_000): C: 0.63 s D: 1.23 s I can offer Asm code on request. ------------------------ The D1 docs strongly suggest to use foreach every time it's possible, avoiding to use the less handy for(), so for almost a year I have assumed foreach is as fast as the for() but this two versions shows differences: 20.66 seconds: http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=1 My classless version, 25.91 seconds: http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=2 That second version uses the following code in the most critical spot of the program: void advance(TyPlanets bodies, double dt) { foreach(idx, ref b; bodies) { double bm = b.mass; foreach(ref b2; bodies[idx + 1 .. length]) { Removing the foreach, replacing them with for loops like the following, gives a significant performance increase, the code runs in about 19.5 second (estimated timing of the Shootout computer, I can give you exact timings on my PC if you want): static void advance(TyPlanets bodies, double dt) { for (int idx; idx < bodies.length; idx++) { auto b = &(bodies[idx]); double bm = b.mass; for (int j = idx + 1; j < bodies.length; j++) { auto b2 = &(bodies[j]); (Note that I haven't just replaced foreach with a for, I have also removed the slicing bodies[idx+1..$], that also slows down the code a little, but I have seen that even removing just the slice the code is slower anyway). ------------------------ Looking at assembly produced by the Intel compiler from C code that contains many floating point operations I can see how short and slick it is. I have also read few articles that show various ways to manage the floating point stack as efficiently as possible. Some of such methods are slow or quite slow but produce more efficient code. Using those refined methods for all the FP operations of a large program may lead to too much long compilation times. So profile-guided optimization may be used to find what are the hot parts of the code, to optimize the FP operations in them expecially well, and compile all the other FP parts with a faster compilation method. GCC has also the "hot" function attribute to let programmers manually define what functions are more critical: http://gcc.gnu.org/onlinedocs/gcc-4.3.0//gcc/Function-Attributes.html In D in theory it can be used something like this: optimize { ... } (As static if { ... } it doesn't create a new scope). It tells the compiler that a part of code that uses lot of FP operations is speed critical, so the compiler can compile it with a quite slower floating point stack allocation algorithm, like ones I have read about. Bye, bearophile
Nov 13 2008
On Thu, Nov 13, 2008 at 8:19 PM, bearophile <bearophileHUGS lycos.com> wrote:In this post I show few things I have found/collected in the last weeks related to the performance of the code compiled with DMD. I have added two of them to the list of slow things (as tiny benchmarks) I keep: http://www.fantascienza.net/leonardo/js/slow_d.zip ------------------------ This is D code that just performs many integer divisions (by 7, known at compile time): int divideBySeven(int x) { return x / 7; } void main() { int i = int.max; int r; while (i--) r = divideBySeven(i); printf("%d\n", r); } The same code in C: #include "limits.h" #include "stdio.h" int divideBySeven(int x) { return x / 7; } int main() { int i = INT_MAX; int r; while (i--) r = divideBySeven(i); printf("%d\n", r); return 0; } I have compiled the C code with MinGW based on GCC 4.2.1 with -O3 -s, and the D code with DMD 1.035 with -O -release -inline, the CPU is Core2 2 GHz. Timing results: intdiv: C: 5.04 s D: 39.32 s The ASM generated by DMD for the function divideBySeven(): mov ECX,7 cdq idiv ECX The ASM generated by GCC for the function divideBySeven(): pushl %ebp movl %esp, %ebp movl 8(%ebp), %ecx pushl %ebx movl $-1840700269, %ebx movl %ebx, %eax popl %ebx imull %ecx popl %ebp leal (%edx,%ecx), %eax sarl $2, %eax sarl $31, %ecx subl %ecx, %eax -------------------------- This is a scanning, C code: #include "stdio.h" int main() { int counter0 = 0, counter1 = 0, counter2 = 0, counter3 = 0; int i = 300000000; while (i--) { // 0.63 s if (i % 4 == 0) { counter0++; } else if (i % 4 == 1) { counter1++; } else if (i % 4 == 2) { counter2++; } else { counter3++; } } printf("%d %d %d %d\n", counter0, counter1, counter2, counter3); return 0; } Equal D code: import std.stdio: printf; int main() { int counter0, counter1, counter2, counter3; int i = 300_000_000; while (i--) { // 1.23 s if (i % 4 == 0) { counter0++; } else if (i % 4 == 1) { counter1++; } else if (i % 4 == 2) { counter2++; } else { counter3++; } } printf("%d %d %d %d\n", counter0, counter1, counter2, counter3); return 0; } Timings: Scan (i = 300_000_000): C: 0.63 s D: 1.23 s I can offer Asm code on request. ------------------------ The D1 docs strongly suggest to use foreach every time it's possible, avoiding to use the less handy for(), so for almost a year I have assumed foreach is as fast as the for() but this two versions shows differences: 20.66 seconds: http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=1 My classless version, 25.91 seconds: http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=2 That second version uses the following code in the most critical spot of the program: void advance(TyPlanets bodies, double dt) { foreach(idx, ref b; bodies) { double bm = b.mass; foreach(ref b2; bodies[idx + 1 .. length]) { Removing the foreach, replacing them with for loops like the following, gives a significant performance increase, the code runs in about 19.5 second (estimated timing of the Shootout computer, I can give you exact timings on my PC if you want): static void advance(TyPlanets bodies, double dt) { for (int idx; idx < bodies.length; idx++) { auto b = &(bodies[idx]); double bm = b.mass; for (int j = idx + 1; j < bodies.length; j++) { auto b2 = &(bodies[j]); (Note that I haven't just replaced foreach with a for, I have also removed the slicing bodies[idx+1..$], that also slows down the code a little, but I have seen that even removing just the slice the code is slower anyway). ------------------------ Looking at assembly produced by the Intel compiler from C code that contains many floating point operations I can see how short and slick it is. I have also read few articles that show various ways to manage the floating point stack as efficiently as possible. Some of such methods are slow or quite slow but produce more efficient code. Using those refined methods for all the FP operations of a large program may lead to too much long compilation times. So profile-guided optimization may be used to find what are the hot parts of the code, to optimize the FP operations in them expecially well, and compile all the other FP parts with a faster compilation method. GCC has also the "hot" function attribute to let programmers manually define what functions are more critical: http://gcc.gnu.org/onlinedocs/gcc-4.3.0//gcc/Function-Attributes.html In D in theory it can be used something like this: optimize { ... } (As static if { ... } it doesn't create a new scope). It tells the compiler that a part of code that uses lot of FP operations is speed critical, so the compiler can compile it with a quite slower floating point stack allocation algorithm, like ones I have read about. Bye, bearophile*subliminal chant: L D C L D C L D C .....* --bb
Nov 13 2008
On Thu, 13 Nov 2008 06:19:27 -0500, bearophile wrote:In this post I show few things I have found/collected in the last weeks related to the performance of the code compiled with DMD.[..]The D1 docs strongly suggest to use foreach every time it's possible, avoiding to use the less handy for(), so for almost a year I have assumed foreach is as fast as the for() but this two versions shows differences:I have noticed the same problem with foreach. A for loop is often faster. :/ I haven't tried ldc yet..
Nov 13 2008
Moritz Warning:I have noticed the same problem with foreach. A for loop is often faster. :/You generally notice a difference only in the most inner and critical loops, so I think it's okay to use foreach() in 80-90% of the code of a program.I haven't tried ldc yet..I haven't succeed in compiling code with ldc yet. But I have tried the last LLVM with the C code of the nbody benchmark of the Shootout, and GCC (MinGW) 4.2.1 produces code about twice faster on my Core2. That's even quite worse than the D code compiled with DMD. Bye, bearophile
Nov 13 2008
On Fri, Nov 14, 2008 at 7:59 AM, bearophile <bearophileHUGS lycos.com> wrote:Moritz Warning:That's a bummer to hear that. But the good point of LLVM is that it's on an upward trend. With DMD only Walter has the ability to work on backend codegen improvements and we all know how much time he has available for such things. Eventually I believe LLVM will leave DMD in the dust and probably GCC too. --bbI have noticed the same problem with foreach. A for loop is often faster. :/You generally notice a difference only in the most inner and critical loops, so I think it's okay to use foreach() in 80-90% of the code of a program.I haven't tried ldc yet..I haven't succeed in compiling code with ldc yet. But I have tried the last LLVM with the C code of the nbody benchmark of the Shootout, and GCC (MinGW) 4.2.1 produces code about twice faster on my Core2. That's even quite worse than the D code compiled with DMD.
Nov 13 2008
On Fri, Nov 14, 2008 at 7:49 AM, Moritz Warning <moritzwarning web.de> wrote:On Thu, 13 Nov 2008 06:19:27 -0500, bearophile wrote:One thing to be clear about when talking about the speed of foreach is whether you're using it on a built-in array or on a user type with an opApply. The former is expected to be as fast as for(), the latter is definitely not because it has to call a delegate each time around the loop. --bbIn this post I show few things I have found/collected in the last weeks related to the performance of the code compiled with DMD.[..]The D1 docs strongly suggest to use foreach every time it's possible, avoiding to use the less handy for(), so for almost a year I have assumed foreach is as fast as the for() but this two versions shows differences:I have noticed the same problem with foreach. A for loop is often faster. :/
Nov 13 2008
On Fri, 14 Nov 2008 08:48:53 +0900, Bill Baxter wrote:On Fri, Nov 14, 2008 at 7:49 AM, Moritz Warning <moritzwarning web.de> wrote:Nobody would use "for"-loops with opApply in real world code. :POn Thu, 13 Nov 2008 06:19:27 -0500, bearophile wrote:One thing to be clear about when talking about the speed of foreach is whether you're using it on a built-in array or on a user type with an opApply. The former is expected to be as fast as for(), the latter is definitely not because it has to call a delegate each time around the loop.In this post I show few things I have found/collected in the last weeks related to the performance of the code compiled with DMD.[..]The D1 docs strongly suggest to use foreach every time it's possible, avoiding to use the less handy for(), so for almost a year I have assumed foreach is as fast as the for() but this two versions shows differences:I have noticed the same problem with foreach. A for loop is often faster. :/
Nov 13 2008
On Fri, Nov 14, 2008 at 9:06 AM, Moritz Warning <moritzwarning web.de> wrote:On Fri, 14 Nov 2008 08:48:53 +0900, Bill Baxter wrote:Er, right, which is why using a for() loop is faster that foreach() in that case. Sorry if I wasn't clear about that. --bbOn Fri, Nov 14, 2008 at 7:49 AM, Moritz Warning <moritzwarning web.de> wrote:Nobody would use "for"-loops with opApply in real world code. :POn Thu, 13 Nov 2008 06:19:27 -0500, bearophile wrote:One thing to be clear about when talking about the speed of foreach is whether you're using it on a built-in array or on a user type with an opApply. The former is expected to be as fast as for(), the latter is definitely not because it has to call a delegate each time around the loop.In this post I show few things I have found/collected in the last weeks related to the performance of the code compiled with DMD.[..]The D1 docs strongly suggest to use foreach every time it's possible, avoiding to use the less handy for(), so for almost a year I have assumed foreach is as fast as the for() but this two versions shows differences:I have noticed the same problem with foreach. A for loop is often faster. :/
Nov 13 2008