digitalmars.D - How to tune numerical D? (matrix multiplication is faster in g++ vs
- J (186/186) Mar 03 2013 Dear D pros,
- John Colvin (21/209) Mar 03 2013 First things first:
- bearophile (4/9) Mar 03 2013 Nope, what matters is the total program runtime.
- John Colvin (4/13) Mar 04 2013 The performance of the multiplication loops and the performance
- bearophile (10/14) Mar 04 2013 If you want to improve the D compiler, druntime, etc, then I
- John Colvin (17/29) Mar 04 2013 I disagree. Information about which parts of the code are running
- Walter Bright (4/22) Mar 04 2013 I agree with John. Correctly interpreting benchmark results will enable ...
- bearophile (4/4) Mar 03 2013 Your benchmark code updated to D2:
- bearophile (7/9) Mar 03 2013 Sorry, this line:
- bearophile (4/4) Mar 03 2013 So this should be better:
- bearophile (5/5) Mar 03 2013 Generally for such matrix benchmarks if you chose the compilation
- J (203/207) Mar 03 2013 @bearophile: Thank you! Unfortunately the
- Manu (12/39) Mar 03 2013 Using dynamic arrays of dynamic arrays that way is pretty poor form
- J (3/19) Mar 04 2013 That's a really good point. I wonder if there is a canonical
- J (90/92) Mar 04 2013 I'm not sure if they are the recommended/best practice for matrix
- Lars T. Kyllingstad (26/44) Mar 05 2013 I'm the author of SciD. It's great that you found it useful! :)
- bearophile (11/28) Mar 04 2013 A bit better version:
- John Colvin (3/10) Mar 04 2013 As far as I know it's just a way of automating the allocation, it
- John Colvin (2/6) Mar 04 2013 http://dpaste.dzfl.pl/ is back online btw
- jerro (18/22) Mar 04 2013 You can make it much faster even without really changing the
- jerro (13/37) Mar 04 2013 forgot to set m3's elements to zero before adding to them:
- J (38/41) Mar 04 2013 Thanks Jerro. You made me realize that help from the experts
- Timon Gehr (10/30) Mar 05 2013 This is a little faster for 2000x2000 matrices and typical cache size:
- Walter Bright (2/3) Mar 04 2013 Using obj2asm will get you much nicer looking assembler.
- Rob T (14/14) Mar 03 2013 I suggest that you move this line
- Andrei Alexandrescu (6/13) Mar 03 2013 You're measuring the speed of a couple of tight loops. The smallest
- J (22/26) Mar 03 2013 Thanks Andrei!
- Walter Bright (6/35) Mar 04 2013 One difference that jumps out at me is you have extra variables and ref ...
Dear D pros, As a fan of D, I was hoping to be able to get similar results as this fellow on stack overflow, by noting his tuning steps; http://stackoverflow.com/questions/5142366/how-fast-is-d-compared-to-c Sadly however, when I pull out a simple matrix multiplication benchmark from the old language shootout (back when it had D), it is disturbingly slower in D when pit against C++. Details? I ran with very recent gdc (gcc 4.7.2, gdc on the 4.7.2 b8f5c22b0e7afa7e68a287ed788597e783540063), and the exact same gcc c++ compiler. How would I tune this to be more competitive? I'm comparing gdc vs g++ both built using the exact same gcc-4.7.2 back end, so it has to be something in the front end. I've disabled GC after the matrices are made in D, so that doesn't explain it. What is going on? I'm hoping I'm making a silly, naive, obvious beginner mistake, but could that be? I'm not sure how to apply the 'in' argument advice given on stackoverflow; if that is the answer, could someone summarise the best practice for 'in' use? Thank you! - J $ g++ --version #shows: g++ (GCC) 4.7.2 $ uname -a UTC 2010 x86_64 GNU/Linux $ g++ -O3 matrix.cpp -ocppmatrix $ time ./cppmatrix -1015380632 859379360 -367726792 -1548829944 real 1m31.941s user 1m31.920s sys 0m0.010s $ time ./cppmatrix -1015380632 859379360 -367726792 -1548829944 real 1m32.068s user 1m32.010s sys 0m0.050s $ gdmd -O -inline -release -noboundscheck -m64 matrix.d -ofdmatrix $ time ./dmatrix -1015380632 859379360 -367726792 -1548829944 real 2m10.677s user 2m10.650s sys 0m0.020s $ $ time ./dmatrix -1015380632 859379360 -367726792 -1548829944 real 2m12.664s user 2m12.600s sys 0m0.030s exact same backend compiler. slower-- $ gdmd -O -q,-O3 -inline -release -noboundscheck -m64 matrix.d -ofdmatrix $ time ./dmatrix -1015380632 859379360 -367726792 -1548829944 real 2m17.107s user 2m17.080s sys 0m0.020s jaten afarm:~/tmp$ shown; it's same source as all of these; the historical http://shootout.alioth.debian.org/ code from when D was in the shootout.) $ time java matrix -1015380632 859379360 -367726792 -1548829944 real 2m23.739s user 2m23.650s sys 0m0.130s $ Slightly bigger matrix? SIZE = 2500 results: 25% slower in D $ time ./cpp.O3.matrix -1506465222 -119774408 -1600478274 1285663906 real 3m1.340s user 3m1.290s sys 0m0.040s $ time ./dmatrix -1506465222 -119774408 -1600478274 1285663906 real 4m2.109s user 4m2.050s sys 0m0.050s //////// D version import core.memory; import std.stdio, std.string, std.array, std.conv; const int SIZE = 2000; int main(string[] args) { int i, n = args.length > 1 ? to!int(args[1]) : 1; int[][] m1 = mkmatrix(SIZE,SIZE); int[][] m2 = mkmatrix(SIZE,SIZE); int[][] mm = mkmatrix(SIZE,SIZE); GC.disable; for (i=0; i<n; i++) { mmult(m1, m2, mm); } writefln("%d %d %d %d",mm[0][0],mm[2][3],mm[3][2],mm[4][4]); return 0; } int[][] mkmatrix(int rows, int cols) { int[][] m; int count = 1; m.length = rows; foreach(ref int[] mi; m) { mi.length = cols; foreach(ref int mij; mi) { mij = count++; } } return(m); } void mmult(int[][] m1, int[][] m2, int[][] m3) { foreach(int i, int[] m1i; m1) { foreach(int j, ref int m3ij; m3[i]) { int val; foreach(int k, int[] m2k; m2) { val += m1i[k] * m2k[j]; } m3ij = val; } } } ////// C++ version #include <stdio.h> #include <stdlib.h> #include <unistd.h> #define SIZE 2000 int **mkmatrix(int rows, int cols) { int i, j, count = 1; int **m = (int **) malloc(rows * sizeof(int *)); for (i=0; i<rows; i++) { m[i] = (int *) malloc(cols * sizeof(int)); for (j=0; j<cols; j++) { m[i][j] = count++; } } return(m); } void zeromatrix(int rows, int cols, int **m) { int i, j; for (i=0; i<rows; i++) for (j=0; j<cols; j++) m[i][j] = 0; } void freematrix(int rows, int **m) { while (--rows > -1) { free(m[rows]); } free(m); } int **mmult(int rows, int cols, int **m1, int **m2, int **m3) { int i, j, k, val; for (i=0; i<rows; i++) { for (j=0; j<cols; j++) { val = 0; for (k=0; k<cols; k++) { val += m1[i][k] * m2[k][j]; } m3[i][j] = val; } } return(m3); } int main(int argc, char *argv[]) { int i, n = ((argc == 2) ? atoi(argv[1]) : 1); int **m1 = mkmatrix(SIZE, SIZE); int **m2 = mkmatrix(SIZE, SIZE); int **mm = mkmatrix(SIZE, SIZE); for (i=0; i<n; i++) { mm = mmult(SIZE, SIZE, m1, m2, mm); } printf("%d %d %d %d\n", mm[0][0], mm[2][3], mm[3][2], mm[4][4]); freematrix(SIZE, m1); freematrix(SIZE, m2); freematrix(SIZE, mm); return(0); }
Mar 03 2013
On Monday, 4 March 2013 at 03:48:45 UTC, J wrote:Dear D pros, As a fan of D, I was hoping to be able to get similar results as this fellow on stack overflow, by noting his tuning steps; http://stackoverflow.com/questions/5142366/how-fast-is-d-compared-to-c Sadly however, when I pull out a simple matrix multiplication benchmark from the old language shootout (back when it had D), it is disturbingly slower in D when pit against C++. Details? I ran with very recent gdc (gcc 4.7.2, gdc on the b8f5c22b0e7afa7e68a287ed788597e783540063), and the exact same gcc c++ compiler. How would I tune this to be more competitive? I'm comparing gdc vs g++ both built using the exact same gcc-4.7.2 back end, so it has to be something in the front end. I've disabled GC after the matrices are made in D, so that doesn't explain it. What is going on? I'm hoping I'm making a silly, naive, obvious beginner mistake, but could that be? I'm not sure how to apply the 'in' argument advice given on stackoverflow; if that is the answer, could someone summarise the best practice for 'in' use? Thank you! - J $ g++ --version #shows: g++ (GCC) 4.7.2 $ uname -a 02:41:37 UTC 2010 x86_64 GNU/Linux $ g++ -O3 matrix.cpp -ocppmatrix $ time ./cppmatrix -1015380632 859379360 -367726792 -1548829944 real 1m31.941s user 1m31.920s sys 0m0.010s $ time ./cppmatrix -1015380632 859379360 -367726792 -1548829944 real 1m32.068s user 1m32.010s sys 0m0.050s $ gdmd -O -inline -release -noboundscheck -m64 matrix.d -ofdmatrix $ time ./dmatrix -1015380632 859379360 -367726792 -1548829944 real 2m10.677s user 2m10.650s sys 0m0.020s $ $ time ./dmatrix -1015380632 859379360 -367726792 -1548829944 real 2m12.664s user 2m12.600s sys 0m0.030s exact same backend compiler. goes slower-- $ gdmd -O -q,-O3 -inline -release -noboundscheck -m64 matrix.d -ofdmatrix $ time ./dmatrix -1015380632 859379360 -367726792 -1548829944 real 2m17.107s user 2m17.080s sys 0m0.020s jaten afarm:~/tmp$ shown; it's same source as all of these; the historical http://shootout.alioth.debian.org/ code from when D was in the shootout.) $ time java matrix -1015380632 859379360 -367726792 -1548829944 real 2m23.739s user 2m23.650s sys 0m0.130s $ Slightly bigger matrix? SIZE = 2500 results: 25% slower in D $ time ./cpp.O3.matrix -1506465222 -119774408 -1600478274 1285663906 real 3m1.340s user 3m1.290s sys 0m0.040s $ time ./dmatrix -1506465222 -119774408 -1600478274 1285663906 real 4m2.109s user 4m2.050s sys 0m0.050s //////// D version import core.memory; import std.stdio, std.string, std.array, std.conv; const int SIZE = 2000; int main(string[] args) { int i, n = args.length > 1 ? to!int(args[1]) : 1; int[][] m1 = mkmatrix(SIZE,SIZE); int[][] m2 = mkmatrix(SIZE,SIZE); int[][] mm = mkmatrix(SIZE,SIZE); GC.disable; for (i=0; i<n; i++) { mmult(m1, m2, mm); } writefln("%d %d %d %d",mm[0][0],mm[2][3],mm[3][2],mm[4][4]); return 0; } int[][] mkmatrix(int rows, int cols) { int[][] m; int count = 1; m.length = rows; foreach(ref int[] mi; m) { mi.length = cols; foreach(ref int mij; mi) { mij = count++; } } return(m); } void mmult(int[][] m1, int[][] m2, int[][] m3) { foreach(int i, int[] m1i; m1) { foreach(int j, ref int m3ij; m3[i]) { int val; foreach(int k, int[] m2k; m2) { val += m1i[k] * m2k[j]; } m3ij = val; } } } ////// C++ version #include <stdio.h> #include <stdlib.h> #include <unistd.h> #define SIZE 2000 int **mkmatrix(int rows, int cols) { int i, j, count = 1; int **m = (int **) malloc(rows * sizeof(int *)); for (i=0; i<rows; i++) { m[i] = (int *) malloc(cols * sizeof(int)); for (j=0; j<cols; j++) { m[i][j] = count++; } } return(m); } void zeromatrix(int rows, int cols, int **m) { int i, j; for (i=0; i<rows; i++) for (j=0; j<cols; j++) m[i][j] = 0; } void freematrix(int rows, int **m) { while (--rows > -1) { free(m[rows]); } free(m); } int **mmult(int rows, int cols, int **m1, int **m2, int **m3) { int i, j, k, val; for (i=0; i<rows; i++) { for (j=0; j<cols; j++) { val = 0; for (k=0; k<cols; k++) { val += m1[i][k] * m2[k][j]; } m3[i][j] = val; } } return(m3); } int main(int argc, char *argv[]) { int i, n = ((argc == 2) ? atoi(argv[1]) : 1); int **m1 = mkmatrix(SIZE, SIZE); int **m2 = mkmatrix(SIZE, SIZE); int **mm = mkmatrix(SIZE, SIZE); for (i=0; i<n; i++) { mm = mmult(SIZE, SIZE, m1, m2, mm); } printf("%d %d %d %d\n", mm[0][0], mm[2][3], mm[3][2], mm[4][4]); freematrix(SIZE, m1); freematrix(SIZE, m2); freematrix(SIZE, mm); return(0); }First things first: You're not just timing the multiplication, you're timing the memory allocation as well. I suggest using http://dlang.org/phobos/std_datetime.html#StopWatch to do some proper timings in D Also, there is a semi-documented multi-dimensional array allocation syntax that is very neat, see here a simplified version of mkmatrix using it: int[][] mkmatrix(size_t rows, size_t cols) { int[][] m = new int[][](rows, cols); size_t count = 1; foreach(ref mi; m) foreach(ref mij; mi) mij = count++; return(m); } However, I have found myself that D is slower than C for these sort of intense numerical things. The assembly code should show why quite easily.
Mar 03 2013
John Colvin:First things first: You're not just timing the multiplication, you're timing the memory allocation as well. I suggest using http://dlang.org/phobos/std_datetime.html#StopWatch to do some proper timings in DNope, what matters is the total program runtime. Bye, bearophile
Mar 03 2013
On Monday, 4 March 2013 at 04:15:41 UTC, bearophile wrote:John Colvin:The performance of the multiplication loops and the performance of the allocation are separate issues and should be measured as such, especially if one wants to make meaningful optimisations.First things first: You're not just timing the multiplication, you're timing the memory allocation as well. I suggest using http://dlang.org/phobos/std_datetime.html#StopWatch to do some proper timings in DNope, what matters is the total program runtime. Bye, bearophile
Mar 04 2013
John Colvin:The performance of the multiplication loops and the performance of the allocation are separate issues and should be measured as such, especially if one wants to make meaningful optimisations.If you want to improve the D compiler, druntime, etc, then I agree you have to separate the variables and test them one at a time. But if you are comparing languages+runtimes+libraries then it's better to not cheat, and test the whole running (warmed) time.http://dpaste.dzfl.pl/ is back online btwdpaste is too much javascript-heavy for a very quick pasting. codepad is much more light if I don't want to run D code. Bye, bearophile
Mar 04 2013
On Monday, 4 March 2013 at 15:44:40 UTC, bearophile wrote:John Colvin:I disagree. Information about which parts of the code are running fast and which are running slow is critical to optimisation. If you don't know whether it's the D memory allocation that's slow or the D multiplication loops, you're trying to optimise blindfolded. Even if all your doing is a comparison, it's a lot more useful to know *where* the slowdown is happening so that you can make a meaningful analysis of the results. Enter a strange example: I found that malloced multi-dim arrays were considerably faster to iterate over and assign to than D gc slices, even with the gc disable after allocation and bounds checks turned off. If I hadn't bothered to do separate timings of the allocation and iteration, I would never have noticed this effect and instead written it off as purely "malloc is faster at allocating than the GC"The performance of the multiplication loops and the performance of the allocation are separate issues and should be measured as such, especially if one wants to make meaningful optimisations.If you want to improve the D compiler, druntime, etc, then I agree you have to separate the variables and test them one at a time. But if you are comparing languages+runtimes+libraries then it's better to not cheat, and test the whole running (warmed) time. Bye, bearophile
Mar 04 2013
On 3/4/2013 9:00 AM, John Colvin wrote:On Monday, 4 March 2013 at 15:44:40 UTC, bearophile wrote:I agree with John. Correctly interpreting benchmark results will enable you to tune your application to run much faster, even without any changes to the compiler or runtime library.John Colvin:I disagree. Information about which parts of the code are running fast and which are running slow is critical to optimisation. If you don't know whether it's the D memory allocation that's slow or the D multiplication loops, you're trying to optimise blindfolded.The performance of the multiplication loops and the performance of the allocation are separate issues and should be measured as such, especially if one wants to make meaningful optimisations.If you want to improve the D compiler, druntime, etc, then I agree you have to separate the variables and test them one at a time. But if you are comparing languages+runtimes+libraries then it's better to not cheat, and test the whole running (warmed) time. Bye, bearophile
Mar 04 2013
Your benchmark code updated to D2: http://codepad.org/WMgu6XQG Bye, bearophile
Mar 03 2013
On Monday, 4 March 2013 at 04:12:18 UTC, bearophile wrote:Your benchmark code updated to D2: http://codepad.org/WMgu6XQGSorry, this line: enum size_t SIZE = 200; Should be: enum size_t SIZE = 2_000; Bye, bearophile
Mar 03 2013
So this should be better: http://codepad.org/B5b4uyBM Bye, bearophile
Mar 03 2013
Generally for such matrix benchmarks if you chose the compilation flags really well (including link-time optimization!) I've seen that with LDC you get "good enough" timings. Bye, bearophile
Mar 03 2013
On Monday, 4 March 2013 at 04:22:01 UTC, bearophile wrote:So this should be better: http://codepad.org/B5b4uyBM Bye, bearophilebearophile: Thank you! Unfortunately the http://codepad.org/B5b4uyBM code runs a bit *slower* than the original D code. Yikes! $ gdmd -O -inline -release -noboundscheck -m64 bear.d -ofdbear $ time ./dbear -1015380632 859379360 -367726792 -1548829944 real 2m36.971s user 2m36.910s sys 0m0.030s $ time ./dbear -1015380632 859379360 -367726792 -1548829944 real 2m34.425s user 2m34.370s sys 0m0.020s John Colvin: here is the disassembly of mmult() in both languages. Unfortunately I'm not literate in x86_64 assembly. Perhaps the problem is obvious to you? All I can really tell is that the g++ version is shorter. The memory allocation, when timed separately (comment out mmult), is less than 60 msec for either version, so I don't think its a memory issue, although it could be caching issue since the matrix layouts are different. (gdb) disas /m _D6matrix5mmultFAAiAAiAAiZv Dump of assembler code for function _D6matrix5mmultFAAiAAiAAiZv: 56 void mmult(int[][] m1, int[][] m2, int[][] m3) 0x00000000004352a0 <+0>: push %r15 0x00000000004352a5 <+5>: push %r14 0x00000000004352a7 <+7>: push %r13 0x00000000004352a9 <+9>: push %r12 0x00000000004352ab <+11>: mov %rdx,%r12 0x00000000004352ae <+14>: push %rbp 0x00000000004352af <+15>: mov %rsi,%rbp 0x00000000004352b2 <+18>: push %rbx 0x00000000004352b3 <+19>: mov %r9,-0x40(%rsp) 0x00000000004352b8 <+24>: mov %rdi,-0x10(%rsp) 0x00000000004352bd <+29>: mov %rsi,-0x8(%rsp) 0x00000000004352c2 <+34>: mov %rdx,-0x20(%rsp) 0x00000000004352c7 <+39>: mov %rcx,-0x18(%rsp) 0x00000000004352cc <+44>: mov %r8,-0x30(%rsp) 0x00000000004352d1 <+49>: mov %r9,-0x28(%rsp) 0x00000000004352dc <+60>: add $0x1,%rdi 0x00000000004352e0 <+64>: lea 0x1(%rdx),%rdx 0x00000000004352e4 <+68>: mov $0x1,%r15d 0x00000000004352ea <+74>: mov %rdi,-0x38(%rsp) 0x0000000000435315 <+117>: add $0x1,%r13 0x0000000000435319 <+121>: mov $0x1,%r11d 0x0000000000435330 <+144>: mov $0x1,%esi 57 { 58 foreach(int i, int[] m1i; m1) 0x00000000004352a2 <+2>: test %rdi,%rdi 0x00000000004352d6 <+54>: je 0x4353aa <_D6matrix5mmultFAAiAAiAAiZv+266> 0x00000000004352ef <+79>: xor %esi,%esi 0x00000000004352f1 <+81>: mov %rsi,%rax 0x00000000004352f4 <+84>: shl $0x4,%rax 0x00000000004352f8 <+88>: mov 0x8(%rbp,%rax,1),%r10 0x0000000000435398 <+248>: cmp -0x38(%rsp),%rax 0x000000000043539d <+253>: mov %r15,%rsi 0x00000000004353a0 <+256>: je 0x4353aa <_D6matrix5mmultFAAiAAiAAiZv+266> 0x00000000004353a2 <+258>: mov %rax,%r15 0x00000000004353a5 <+261>: jmpq 0x4352f1 <_D6matrix5mmultFAAiAAiAAiZv+81> 59 { 60 foreach(int j, ref int m3ij; m3[i]) 0x00000000004352fd <+93>: add -0x40(%rsp),%rax 0x0000000000435302 <+98>: mov (%rax),%r13 0x0000000000435305 <+101>: mov 0x8(%rax),%r14 0x0000000000435309 <+105>: test %r13,%r13 0x000000000043530c <+108>: je 0x435394 <_D6matrix5mmultFAAiAAiAAiZv+244> 0x0000000000435312 <+114>: xor %r9d,%r9d 0x000000000043531f <+127>: shl $0x2,%r9 0x0000000000435326 <+134>: lea (%r14,%r9,1),%rbx 0x000000000043536c <+204>: mov %r11,%r9 0x000000000043536f <+207>: cmp %r13,%rax 0x0000000000435372 <+210>: je 0x435394 <_D6matrix5mmultFAAiAAiAAiZv+244> 0x0000000000435374 <+212>: shl $0x2,%r9 0x000000000043537b <+219>: mov %rax,%r11 0x000000000043537e <+222>: lea (%r14,%r9,1),%rbx 0x000000000043538a <+234>: mov %r11,%r9 0x000000000043538f <+239>: cmp %r13,%rax 0x0000000000435392 <+242>: jne 0x435374 <_D6matrix5mmultFAAiAAiAAiZv+212> 0x0000000000435394 <+244>: lea 0x1(%r15),%rax 61 { 62 int val; 0x0000000000435337 <+151>: xor %edi,%edi 0x0000000000435339 <+153>: jmp 0x435343 <_D6matrix5mmultFAAiAAiAAiZv+163> 0x000000000043533b <+155>: nopl 0x0(%rax,%rax,1) 0x0000000000435388 <+232>: xor %edi,%edi 63 foreach(int k, int[] m2k; m2) 0x0000000000435323 <+131>: test %r12,%r12 0x000000000043532a <+138>: je 0x435384 <_D6matrix5mmultFAAiAAiAAiZv+228> 0x000000000043532c <+140>: nopl 0x0(%rax) 0x0000000000435335 <+149>: xor %eax,%eax 0x0000000000435340 <+160>: mov %r8,%rsi 0x0000000000435343 <+163>: mov %rax,%r8 0x000000000043534a <+170>: shl $0x4,%r8 0x000000000043535e <+190>: cmp %rdx,%r8 0x0000000000435361 <+193>: mov %rsi,%rax 0x0000000000435364 <+196>: jne 0x435340 <_D6matrix5mmultFAAiAAiAAiZv+160> 0x0000000000435366 <+198>: lea 0x1(%r11),%rax 0x0000000000435378 <+216>: test %r12,%r12 0x0000000000435382 <+226>: jne 0x435330 <_D6matrix5mmultFAAiAAiAAiZv+144> 0x0000000000435384 <+228>: lea 0x1(%r11),%rax 64 { 65 val += m1i[k] * m2k[j]; 0x0000000000435346 <+166>: mov (%r10,%rax,4),%eax 0x000000000043534e <+174>: mov 0x8(%rcx,%r8,1),%r8 0x0000000000435353 <+179>: imul (%r8,%r9,1),%eax 0x0000000000435358 <+184>: lea 0x1(%rsi),%r8 0x000000000043535c <+188>: add %eax,%edi 66 } 67 m3ij = val; 0x000000000043536a <+202>: mov %edi,(%rbx) 0x000000000043538d <+237>: mov %edi,(%rbx) 68 } 69 } 70 } 0x00000000004353aa <+266>: pop %rbx 0x00000000004353ab <+267>: pop %rbp 0x00000000004353ac <+268>: pop %r12 0x00000000004353ae <+270>: pop %r13 0x00000000004353b0 <+272>: pop %r14 0x00000000004353b2 <+274>: pop %r15 0x00000000004353b4 <+276>: retq 0x00000000004353b5: data32 nopw %cs:0x0(%rax,%rax,1) End of assembler dump. (gdb) (gdb) disas /m mmult Dump of assembler code for function mmult(int, int, int**, int**, int**): 36 int **mmult(int rows, int cols, int **m1, int **m2, int **m3) { 0x0000000000400a10 <+0>: push %r14 0x0000000000400a14 <+4>: push %r13 0x0000000000400a16 <+6>: mov %r8,%r13 0x0000000000400a19 <+9>: push %r12 0x0000000000400a1b <+11>: push %rbp 0x0000000000400a1c <+12>: push %rbx 0x0000000000400a1d <+13>: mov %edi,%ebx 0x0000000000400a21 <+17>: lea -0x1(%rsi),%eax 0x0000000000400a24 <+20>: mov %rdx,%r12 0x0000000000400a27 <+23>: xor %ebp,%ebp 0x0000000000400a29 <+25>: lea 0x4(,%rax,4),%rdi 0x0000000000400a40 <+48>: xor %r9d,%r9d 0x0000000000400a43 <+51>: xor %r11d,%r11d 0x0000000000400a46 <+54>: nopw %cs:0x0(%rax,%rax,1) 37 int i, j, k, val; 38 for (i=0; i<rows; i++) { 0x0000000000400a12 <+2>: test %edi,%edi 0x0000000000400a1f <+15>: jle 0x400a7e <mmult(int, int, int**, int**, int**)+110> 0x0000000000400a7a <+106>: cmp %ebp,%ebx 0x0000000000400a7c <+108>: jg 0x400a31 <mmult(int, int, int**, int**, int**)+33> 39 for (j=0; j<cols; j++) { 0x0000000000400a31 <+33>: test %esi,%esi 0x0000000000400a33 <+35>: jle 0x400a76 <mmult(int, int, int**, int**, int**)+102> 0x0000000000400a35 <+37>: mov (%r12,%rbp,8),%r8 0x0000000000400a39 <+41>: mov 0x0(%r13,%rbp,8),%rdx 0x0000000000400a3e <+46>: xor %eax,%eax 0x0000000000400a71 <+97>: cmp %rdi,%rax 0x0000000000400a74 <+100>: jne 0x400a40 <mmult(int, int, int**, int**, int**)+48> 0x0000000000400a76 <+102>: add $0x1,%rbp 40 val = 0; 41 for (k=0; k<cols; k++) { 0x0000000000400a64 <+84>: cmp %r9d,%esi 0x0000000000400a67 <+87>: jg 0x400a50 <mmult(int, int, int**, int**, int**)+64> 42 val += m1[i][k] * m2[k][j]; 0x0000000000400a50 <+64>: mov (%rcx,%r9,8),%r14 0x0000000000400a54 <+68>: mov (%r8,%r9,4),%r10d 0x0000000000400a58 <+72>: add $0x1,%r9 0x0000000000400a5c <+76>: imul (%r14,%rax,1),%r10d 0x0000000000400a61 <+81>: add %r10d,%r11d 43 } 44 m3[i][j] = val; 0x0000000000400a69 <+89>: mov %r11d,(%rdx,%rax,1) 0x0000000000400a6d <+93>: add $0x4,%rax 45 } 46 } 47 return(m3); 48 } 0x0000000000400a7e <+110>: pop %rbx 0x0000000000400a7f <+111>: pop %rbp 0x0000000000400a80 <+112>: pop %r12 0x0000000000400a82 <+114>: mov %r13,%rax 0x0000000000400a85 <+117>: pop %r13 0x0000000000400a87 <+119>: pop %r14 0x0000000000400a89 <+121>: retq End of assembler dump. (gdb)
Mar 03 2013
On 4 March 2013 14:50, J <private private-dont-email-dont-spam.com> wrote:On Monday, 4 March 2013 at 04:22:01 UTC, bearophile wrote:Using dynamic arrays of dynamic arrays that way is pretty poor form regardless of the language. You should really use single dimensional array: int matrix[SIZE*SIZE]; And index via: matrix[y*SIZE+x] (You can pretty this up in various ways, if this is too unsightly) On 4 March 2013 14:02, John Colvin <john.loughran.colvin gmail.com> wrote:So this should be better: http://codepad.org/B5b4uyBM Bye, bearophilebearophile: Thank you! Unfortunately the http://codepad.org/B5b4uyBMcode runs a bit *slower* than the original D code. Yikes! $ gdmd -O -inline -release -noboundscheck -m64 bear.d -ofdbear $ time ./dbear -1015380632 859379360 -367726792 -1548829944 real 2m36.971s user 2m36.910s sys 0m0.030s $ time ./dbear -1015380632 859379360 -367726792 -1548829944 real 2m34.425s user 2m34.370s sys 0m0.020s John Colvin: here is the disassembly of mmult() in both languages. Unfortunately I'm not literate in x86_64 assembly. Perhaps the problem is obvious to you? All I can really tell is that the g++ version is shorter. The memory allocation, when timed separately (comment out mmult), is less than 60 msec for either version, so I don't think its a memory issue, although it could be caching issue since the matrix layouts are different.int[][] m = new int[][](rows, cols);Does D support proper square array's this way? Or does it just automate allocation of the inner arrays? Does it allocate all the associated memory in one block?
Mar 03 2013
On Monday, 4 March 2013 at 05:07:10 UTC, Manu wrote:Using dynamic arrays of dynamic arrays that way is pretty poor form regardless of the language. You should really use single dimensional array: int matrix[SIZE*SIZE]; And index via: matrix[y*SIZE+x] (You can pretty this up in various ways, if this is too unsightly) On 4 March 2013 14:02, John Colvin <john.loughran.colvin gmail.com> wrote:That's a really good point. I wonder if there is a canonical matrix that would be preferred?int[][] m = new int[][](rows, cols);Does D support proper square array's this way? Or does it just automate allocation of the inner arrays? Does it allocate all the associated memory in one block?
Mar 04 2013
On Monday, 4 March 2013 at 08:02:46 UTC, J wrote:That's a really good point. I wonder if there is a canonical matrix that would be preferred?I'm not sure if they are the recommended/best practice for matrix handling in D at the moment (please advise if they are not), but with a little searching, I found that the SciD library has nice matrices and MattrixViews (column-major storage, LAPACK compatible). Now I like MatrixViews because they let me beat the original (clearly non-optimal) C matrix multiplication by a couple seconds, and the D code with operator overloading in place makes matrix multiplication look elegant. One shootout benchmark down, eleven to go. :-) - J p.s. Am I right in concluding there are no multimethods (multiple dispatch) in D... it seemed a little awkward to have to wrap the MatrixView in a new struct solely in order to overload multiplication. Is there a better way that I've missed? #beating the C/malloc/shootout based code that ran in 1m32sec: $ time ./scid_matrix_test -1015380632 859379360 -367726792 -1548829944 real 1m27.967s user 1m27.930s sys 0m0.030s $ time ./scid_matrix_test -1015380632 859379360 -367726792 -1548829944 real 1m28.259s user 1m28.230s sys 0m0.020s $ module scid_matrix_test; // compile: gdmd -O -release -inline -noboundscheck scid_matrix_test.d -L-L/home/you/pkg/scid/scid/ -L-lscid -L-L/usr/lib/x86_64-linux-gnu/ -L-lgfortran -L-lblas -L-llapack -L-L. import scid.matrix; import std.stdio, std.string, std.array, std.conv; const int SIZE = 2000; import std.stdio; // Doesn't seem to have multiple dispatch / multimethods, so // I guess we have to wrap MatrixView to implement opBinary!"*"(lhs,rhs) struct Multipliable (T) { MatrixView!(T) _m; alias _m this; this(MatrixView!(T) src) { _m = src; } Multipliable!(T) opBinary(string op : "*")(ref Multipliable rhs) if (op == "*") { auto r = Multipliable!int(matrix!int(_m.rows,rhs.cols)); return mmult(this, rhs, r); } } Multipliable!(int) mkmatrix(int rows, int cols) { auto m = Multipliable!int(); m._m = matrix!int(rows,cols); int count = 1; for(int i = 0; i < rows; ++i) { for(int j = 0; j < cols; ++j) { m[i,j] = count++; } } return(m); } Multipliable!(int) mmult(ref Multipliable!(int) m1, ref Multipliable!(int) m2, ref Multipliable!(int) m3) { int i, j, k, val; ulong rows = m1.rows; ulong cols = m1.cols; for (i=0; i<rows; i++) { for (j=0; j<cols; j++) { val = 0; for (k=0; k<cols; k++) { val += m1[i,k] * m2[k,j]; } m3[i,j] = val; } } return(m3); } void main() { auto m1 = mkmatrix(SIZE,SIZE); auto m2 = mkmatrix(SIZE,SIZE); auto mm = mkmatrix(SIZE,SIZE); mm = m1 * m2; writefln("%d %d %d %d",mm[0,0],mm[2,3],mm[3,2],mm[4,4]); }
Mar 04 2013
On Monday, 4 March 2013 at 12:28:25 UTC, J wrote:On Monday, 4 March 2013 at 08:02:46 UTC, J wrote:I'm the author of SciD. It's great that you found it useful! :) When I wrote scid.matrix and scid.linalg, it was because I needed some linear algebra functionality for a project at work. I knew I wouldn't have the time to write a full-featured linear algebra library, so I intentionally gave it a minimalistic design, and only included the stuff I needed. That's also why I used the name "MatrixView" rather than "Matrix"; it was only supposed to be a convenient way to view an array as a matrix, and not a full matrix type. Later, Cristi Cobzarenco forked my library for Google Summer of Code 2011, with David Simcha as his mentor. They removed almost everything but the linear algebra modules, and redesigned those from scratch. So basically, there is pretty much nothing left of my code in their library. ;) I don't think they ever completed the project, but I believe parts of it are usable. You'll find it here: https://github.com/cristicbz/scid AFAIK, their goal was to use expression templates in order to transform D expressions like x*A*B + y*C where A, B and C are matrices, and x and y are scalars, into as few optimised BLAS calls as possible -- e.g., a single GEMM call in the example above. I don't know how far they got on this, though. LarsThat's a really good point. I wonder if there is a canonical matrix that would be preferred?I'm not sure if they are the recommended/best practice for matrix handling in D at the moment (please advise if they are not), but with a little searching, I found that the SciD library has nice matrices and MattrixViews (column-major storage, LAPACK compatible). Now I like MatrixViews because they let me beat the original (clearly non-optimal) C matrix multiplication by a couple seconds, and the D code with operator overloading in place makes matrix multiplication look elegant. One shootout benchmark down, eleven to go. :-) - J p.s. Am I right in concluding there are no multimethods (multiple dispatch) in D... it seemed a little awkward to have to wrap the MatrixView in a new struct solely in order to overload multiplication. Is there a better way that I've missed?
Mar 05 2013
J wrote:bearophile: Thank you! Unfortunately the http://codepad.org/B5b4uyBM code runs a bit *slower* than the original D code. Yikes! $ gdmd -O -inline -release -noboundscheck -m64 bear.d -ofdbear $ time ./dbear -1015380632 859379360 -367726792 -1548829944 real 2m36.971s user 2m36.910s sys 0m0.030s $ time ./dbear -1015380632 859379360 -367726792 -1548829944 real 2m34.425s user 2m34.370s sys 0m0.020sA bit better version: http://codepad.org/jhbYxEgU I think this code is good compared to the original (there are better algorithms). So maybe the introduced inefficiencies are small compiler problems worth finding and reporting. ------------------ Manu:Does D support proper square array's this way? Or does it just automate allocation of the inner arrays? Does it allocate all the associated memory in one block?Maybe you should take a look at druntime code. Bye, bearophile
Mar 04 2013
On Monday, 4 March 2013 at 14:59:21 UTC, bearophile wrote:Manu:As far as I know it's just a way of automating the allocation, it expands out to the necessary loops.Does D support proper square array's this way? Or does it just automate allocation of the inner arrays? Does it allocate all the associated memory in one block?Maybe you should take a look at druntime code. Bye, bearophile
Mar 04 2013
On Monday, 4 March 2013 at 14:59:21 UTC, bearophile wrote:A bit better version: http://codepad.org/jhbYxEgU Bye, bearophilehttp://dpaste.dzfl.pl/ is back online btw
Mar 04 2013
A bit better version: http://codepad.org/jhbYxEgU I think this code is good compared to the original (there are better algorithms).You can make it much faster even without really changing the algorithm. Just by reversing the order of inner two loops like this: void matrixMult2(in int[][] m1, in int[][] m2, int[][] m3) pure nothrow { foreach (immutable i; 0 .. m1.length) foreach (immutable k; 0 .. m2[0].length) foreach (immutable j; 0 .. m3[0].length) m3[i][j] += m1[i][k] * m2[k][j]; } you can make the code much more cache friendly (because now you aren't iterating any matrix by column in the inner loop) and also allow the compiler to do auto vectorization. matrixMul2() takes 2.6 seconds on my machine and matrixMul()takes 72 seconds (both compiled with gdmd -O -inline -release -noboundscheck -mavx). This isn't really relevant to the comparison with C++ in this thread, I just thought it may be useful for anyone writing matrix code.
Mar 04 2013
On Monday, 4 March 2013 at 15:46:50 UTC, jerro wrote:forgot to set m3's elements to zero before adding to them: void matrixMult2(in int[][] m1, in int[][] m2, int[][] m3) pure nothrow { foreach (immutable i; 0 .. m1.length) { m3[i][] = 0; foreach (immutable k; 0 .. m2[0].length) foreach (immutable j; 0 .. m3[0].length) m3[i][j] += m1[i][k] * m2[k][j]; } } This does not make the function noticeably slower.A bit better version: http://codepad.org/jhbYxEgU I think this code is good compared to the original (there are better algorithms).You can make it much faster even without really changing the algorithm. Just by reversing the order of inner two loops like this: void matrixMult2(in int[][] m1, in int[][] m2, int[][] m3) pure nothrow { foreach (immutable i; 0 .. m1.length) foreach (immutable k; 0 .. m2[0].length) foreach (immutable j; 0 .. m3[0].length) m3[i][j] += m1[i][k] * m2[k][j]; } you can make the code much more cache friendly (because now you aren't iterating any matrix by column in the inner loop) and also allow the compiler to do auto vectorization. matrixMul2() takes 2.6 seconds on my machine and matrixMul()takes 72 seconds (both compiled with gdmd -O -inline -release -noboundscheck -mavx). This isn't really relevant to the comparison with C++ in this thread, I just thought it may be useful for anyone writing matrix code.
Mar 04 2013
On Monday, 4 March 2013 at 15:57:42 UTC, jerro wrote:Thanks Jerro. You made me realize that help from the experts could be quite useful. I plugged in a call to the BLAS matrix multiply routine, which SciD conveniently binds. The result? My 2000x2000 matrix multiply went from 98 seconds down to 1.8 seconds. Its just hilariously faster to use 20 years of numerical experts optimized code than to try to write your own. // screaming fast version - uses BLAS for 50x speedup over naive code. // Multipliable!(T) mmult2(T)(ref Multipliable!(T) m1, ref Multipliable!(T) m2, ref Multipliable!(T) m3) { m3.array[] = 0; assert(m1.cols == m2.rows); char ntran = 'N'; double one = 1.0; double zero = 0.0; int nrow = cast(int)m1.rows; int ncol = cast(int)m1.cols; int mcol = cast(int)m2.cols; scid.bindings.blas.blas.dgemm_(&ntran, // transa &ntran, // transb &nrow, // m &mcol, // n &ncol, // k &one, // alpha m1.array.ptr, // A &nrow, // lda m2.array.ptr, // B &ncol, // ldb &zero, // beta m3.array.ptr, // C &nrow, // ldc nrow, // transa_len ncol); // transb_len return m3; }matrixMul2() takes 2.6 seconds on my machine and matrixMul()takes 72 seconds (both compiled with gdmd -O -inline -release -noboundscheck -mavx).
Mar 04 2013
On 03/04/2013 04:46 PM, jerro wrote:This is a little faster for 2000x2000 matrices and typical cache size: void matrixMult(in int[][] m1, in int[][] m2, int[][] m3){ enum block=50; // optimal parameter depends on hardware foreach(ib;iota(0,m1.length,block)) foreach(kb;iota(0,m2[0].length,block)) foreach(i;iota(ib,min(m1.length,ib+block))) foreach(k;iota(kb,min(m2[0].length,kb+block))) m3[i][] += m1[i][k] * m2[k][]; }A bit better version: http://codepad.org/jhbYxEgU I think this code is good compared to the original (there are better algorithms).You can make it much faster even without really changing the algorithm. Just by reversing the order of inner two loops like this: void matrixMult2(in int[][] m1, in int[][] m2, int[][] m3) pure nothrow { foreach (immutable i; 0 .. m1.length) foreach (immutable k; 0 .. m2[0].length) foreach (immutable j; 0 .. m3[0].length) m3[i][j] += m1[i][k] * m2[k][j]; } you can make the code much more cache friendly (because now you aren't iterating any matrix by column in the inner loop) and also allow the compiler to do auto vectorization. matrixMul2() takes 2.6 seconds on my machine and matrixMul()takes 72 seconds (both compiled with gdmd -O -inline -release -noboundscheck -mavx). This isn't really relevant to the comparison with C++ in this thread, I just thought it may be useful for anyone writing matrix code.
Mar 05 2013
On 3/3/2013 8:50 PM, J wrote:Dump of assembler code for function _D6matrix5mmultFAAiAAiAAiZv:Using obj2asm will get you much nicer looking assembler.
Mar 04 2013
I suggest that you move this line GC.disable; to the first line. I don't see how you are doing your timings so that part is a wild card. Also note that when the GC is re-enabled it can add a significant amount of time to the tests. You are not explicitly re-enabling the GC, but I don't know if the GC kicks back as part of program termination, it probably shouldn't but we lack precise documentation and cannot be certain. I would test timings immediately after the GC is disabled and prior to program termination to see what affects the GC may or may not have on the timings. --rt
Mar 03 2013
On 3/3/13 10:48 PM, J wrote:Dear D pros, As a fan of D, I was hoping to be able to get similar results as this fellow on stack overflow, by noting his tuning steps; http://stackoverflow.com/questions/5142366/how-fast-is-d-compared-to-c Sadly however, when I pull out a simple matrix multiplication benchmark from the old language shootout (back when it had D), it is disturbingly slower in D when pit against C++.You're measuring the speed of a couple of tight loops. The smallest differences in codegen between them will be on the radar. Use straight for loops or "foreach (i; 0 .. limit)" for those loops, the foreach forms you currently may introduce slight differences in codegen. Andrei
Mar 03 2013
On Monday, 4 March 2013 at 04:49:20 UTC, Andrei Alexandrescu wrote:You're measuring the speed of a couple of tight loops. The smallest differences in codegen between them will be on the radar. Use straight for loops or "foreach (i; 0 .. limit)" for those loops...Thanks Andrei! I validated your analysis by doing a straight port of the C code to D, even using the same memory layout (matching malloc calls). The port was trivial, which was very reassuring. As evidence, I had to tweak only 8 lines othe C to make compiled under gdc. The mmult() function in the D version remained identical to that in the C version. Even more reassuring was that the performance of the resulting D matched the C to within 1% tolerance (about 200-500 msec seconds slower on the D side; presumably due to runtime init). $time ./gdc_compiled_ported_cpp_version -1015380632 859379360 -367726792 -1548829944 real 1m32.418s user 1m32.370s sys 0m0.020s $ It's a great feeling, knowing the bare metal is available if need be. Thanks guys! -J
Mar 03 2013
On 3/3/2013 7:48 PM, J wrote:void mmult(int[][] m1, int[][] m2, int[][] m3) { foreach(int i, int[] m1i; m1) { foreach(int j, ref int m3ij; m3[i]) { int val; foreach(int k, int[] m2k; m2) { val += m1i[k] * m2k[j]; } m3ij = val; } } }[...]////// C++ version int **mmult(int rows, int cols, int **m1, int **m2, int **m3) { int i, j, k, val; for (i=0; i<rows; i++) { for (j=0; j<cols; j++) { val = 0; for (k=0; k<cols; k++) { val += m1[i][k] * m2[k][j]; } m3[i][j] = val; } } return(m3); }One difference that jumps out at me is you have extra variables and ref types in the D version, and in the C++ version you have "cached" the row & column loop limits. (I.e. the C++ version assumes a rectangular matrix, while the D one has a (presumably) different length for each column.)
Mar 04 2013