www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - How to tune numerical D? (matrix multiplication is faster in g++ vs

reply "J" <private private-dont-email-dont-spam.com> writes:
Dear D pros,

As a fan of D, I was hoping to be able to get similar results as 
this fellow on stack overflow, by noting his tuning steps;
http://stackoverflow.com/questions/5142366/how-fast-is-d-compared-to-c

Sadly however, when I pull out a simple matrix multiplication 
benchmark from the old language shootout (back when it had D), it 
is disturbingly slower in D when pit against C++.

Details? I ran with very recent gdc (gcc 4.7.2, gdc on the 4.7.2 

b8f5c22b0e7afa7e68a287ed788597e783540063), and the exact same gcc 
c++ compiler.

How would I tune this to be more competitive?  I'm comparing gdc 
vs g++ both built using the exact same gcc-4.7.2 back end, so it 
has to be something in the front end.  I've disabled GC after the 
matrices are made in D, so that doesn't explain it.

What is going on?  I'm hoping I'm making a silly, naive, obvious 
beginner mistake, but could that be?  I'm not sure how to apply 
the 'in' argument advice given on stackoverflow; if that is the 
answer, could someone summarise the best practice for 'in' use?

Thank you!

- J

$ g++ --version #shows: g++ (GCC) 4.7.2
$ uname -a

UTC 2010 x86_64 GNU/Linux



$ g++  -O3 matrix.cpp -ocppmatrix
$ time ./cppmatrix
-1015380632 859379360 -367726792 -1548829944

real    1m31.941s
user    1m31.920s
sys 0m0.010s
$ time ./cppmatrix
-1015380632 859379360 -367726792 -1548829944

real    1m32.068s
user    1m32.010s
sys 0m0.050s




$ gdmd -O -inline -release -noboundscheck -m64 matrix.d -ofdmatrix
$ time ./dmatrix
-1015380632 859379360 -367726792 -1548829944

real    2m10.677s
user    2m10.650s
sys 0m0.020s
$
$ time ./dmatrix
-1015380632 859379360 -367726792 -1548829944

real    2m12.664s
user    2m12.600s
sys 0m0.030s




exact same backend compiler.


slower--

$ gdmd -O -q,-O3 -inline -release -noboundscheck -m64 matrix.d 
-ofdmatrix
$ time ./dmatrix
-1015380632 859379360 -367726792 -1548829944

real    2m17.107s
user    2m17.080s
sys 0m0.020s
jaten afarm:~/tmp$



shown; it's same source as all of these; the historical 
http://shootout.alioth.debian.org/ code from when D was in the 
shootout.)

$ time java matrix
-1015380632 859379360 -367726792 -1548829944

real    2m23.739s
user    2m23.650s
sys 0m0.130s
$


Slightly bigger matrix?

SIZE = 2500 results: 25% slower in D

$ time ./cpp.O3.matrix
-1506465222 -119774408 -1600478274 1285663906

real    3m1.340s
user    3m1.290s
sys 0m0.040s

$ time ./dmatrix
-1506465222 -119774408 -1600478274 1285663906

real    4m2.109s
user    4m2.050s
sys 0m0.050s


//////// D version

import core.memory;

import std.stdio, std.string, std.array, std.conv;

const int SIZE = 2000;

int main(string[] args)
{
     int i, n = args.length > 1 ? to!int(args[1]) : 1;

     int[][] m1 = mkmatrix(SIZE,SIZE);
     int[][] m2 = mkmatrix(SIZE,SIZE);
     int[][] mm = mkmatrix(SIZE,SIZE);

     GC.disable;

     for (i=0; i<n; i++) {
         mmult(m1, m2, mm);
     }

     writefln("%d %d %d %d",mm[0][0],mm[2][3],mm[3][2],mm[4][4]);

     return 0;
}

int[][] mkmatrix(int rows, int cols)
{
     int[][] m;
     int count = 1;

     m.length = rows;
     foreach(ref int[] mi; m)
     {
         mi.length = cols;
         foreach(ref int mij; mi)
         {
             mij = count++;
         }
     }

     return(m);
}

void mmult(int[][] m1, int[][] m2, int[][] m3)
{
     foreach(int i, int[] m1i; m1)
     {
         foreach(int j, ref int m3ij; m3[i])
         {
             int val;
             foreach(int k, int[] m2k; m2)
             {
                 val += m1i[k] * m2k[j];
             }
             m3ij = val;
         }
     }
}

////// C++ version

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define SIZE 2000

int **mkmatrix(int rows, int cols) {
     int i, j, count = 1;
     int **m = (int **) malloc(rows * sizeof(int *));
     for (i=0; i<rows; i++) {
     m[i] = (int *) malloc(cols * sizeof(int));
     for (j=0; j<cols; j++) {
         m[i][j] = count++;
     }
     }
     return(m);
}

void zeromatrix(int rows, int cols, int **m) {
     int i, j;
     for (i=0; i<rows; i++)
     for (j=0; j<cols; j++)
         m[i][j] = 0;
}

void freematrix(int rows, int **m) {
     while (--rows > -1) { free(m[rows]); }
     free(m);
}

int **mmult(int rows, int cols, int **m1, int **m2, int **m3) {
     int i, j, k, val;
     for (i=0; i<rows; i++) {
     for (j=0; j<cols; j++) {
         val = 0;
         for (k=0; k<cols; k++) {
         val += m1[i][k] * m2[k][j];
         }
         m3[i][j] = val;
     }
     }
     return(m3);
}

int main(int argc, char *argv[]) {
     int i, n = ((argc == 2) ? atoi(argv[1]) : 1);

     int **m1 = mkmatrix(SIZE, SIZE);
     int **m2 = mkmatrix(SIZE, SIZE);
     int **mm = mkmatrix(SIZE, SIZE);

     for (i=0; i<n; i++) {
     mm = mmult(SIZE, SIZE, m1, m2, mm);
     }
     printf("%d %d %d %d\n", mm[0][0], mm[2][3], mm[3][2], 
mm[4][4]);

     freematrix(SIZE, m1);
     freematrix(SIZE, m2);
     freematrix(SIZE, mm);
     return(0);
}
Mar 03 2013
next sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 4 March 2013 at 03:48:45 UTC, J wrote:
 Dear D pros,

 As a fan of D, I was hoping to be able to get similar results 
 as this fellow on stack overflow, by noting his tuning steps;
 http://stackoverflow.com/questions/5142366/how-fast-is-d-compared-to-c

 Sadly however, when I pull out a simple matrix multiplication 
 benchmark from the old language shootout (back when it had D), 
 it is disturbingly slower in D when pit against C++.

 Details? I ran with very recent gdc (gcc 4.7.2, gdc on the 

 b8f5c22b0e7afa7e68a287ed788597e783540063), and the exact same 
 gcc c++ compiler.

 How would I tune this to be more competitive?  I'm comparing 
 gdc vs g++ both built using the exact same gcc-4.7.2 back end, 
 so it has to be something in the front end.  I've disabled GC 
 after the matrices are made in D, so that doesn't explain it.

 What is going on?  I'm hoping I'm making a silly, naive, 
 obvious beginner mistake, but could that be?  I'm not sure how 
 to apply the 'in' argument advice given on stackoverflow; if 
 that is the answer, could someone summarise the best practice 
 for 'in' use?

 Thank you!

 - J

 $ g++ --version #shows: g++ (GCC) 4.7.2
 $ uname -a

 02:41:37 UTC 2010 x86_64 GNU/Linux



 $ g++  -O3 matrix.cpp -ocppmatrix
 $ time ./cppmatrix
 -1015380632 859379360 -367726792 -1548829944

 real    1m31.941s
 user    1m31.920s
 sys 0m0.010s
 $ time ./cppmatrix
 -1015380632 859379360 -367726792 -1548829944

 real    1m32.068s
 user    1m32.010s
 sys 0m0.050s




 $ gdmd -O -inline -release -noboundscheck -m64 matrix.d 
 -ofdmatrix
 $ time ./dmatrix
 -1015380632 859379360 -367726792 -1548829944

 real    2m10.677s
 user    2m10.650s
 sys 0m0.020s
 $
 $ time ./dmatrix
 -1015380632 859379360 -367726792 -1548829944

 real    2m12.664s
 user    2m12.600s
 sys 0m0.030s




 exact same backend compiler.


 goes slower--

 $ gdmd -O -q,-O3 -inline -release -noboundscheck -m64 matrix.d 
 -ofdmatrix
 $ time ./dmatrix
 -1015380632 859379360 -367726792 -1548829944

 real    2m17.107s
 user    2m17.080s
 sys 0m0.020s
 jaten afarm:~/tmp$



 shown; it's same source as all of these; the historical 
 http://shootout.alioth.debian.org/ code from when D was in the 
 shootout.)

 $ time java matrix
 -1015380632 859379360 -367726792 -1548829944

 real    2m23.739s
 user    2m23.650s
 sys 0m0.130s
 $


 Slightly bigger matrix?

 SIZE = 2500 results: 25% slower in D

 $ time ./cpp.O3.matrix
 -1506465222 -119774408 -1600478274 1285663906

 real    3m1.340s
 user    3m1.290s
 sys 0m0.040s

 $ time ./dmatrix
 -1506465222 -119774408 -1600478274 1285663906

 real    4m2.109s
 user    4m2.050s
 sys 0m0.050s


 //////// D version

 import core.memory;

 import std.stdio, std.string, std.array, std.conv;

 const int SIZE = 2000;

 int main(string[] args)
 {
     int i, n = args.length > 1 ? to!int(args[1]) : 1;

     int[][] m1 = mkmatrix(SIZE,SIZE);
     int[][] m2 = mkmatrix(SIZE,SIZE);
     int[][] mm = mkmatrix(SIZE,SIZE);

     GC.disable;

     for (i=0; i<n; i++) {
         mmult(m1, m2, mm);
     }

     writefln("%d %d %d %d",mm[0][0],mm[2][3],mm[3][2],mm[4][4]);

     return 0;
 }

 int[][] mkmatrix(int rows, int cols)
 {
     int[][] m;
     int count = 1;

     m.length = rows;
     foreach(ref int[] mi; m)
     {
         mi.length = cols;
         foreach(ref int mij; mi)
         {
             mij = count++;
         }
     }

     return(m);
 }

 void mmult(int[][] m1, int[][] m2, int[][] m3)
 {
     foreach(int i, int[] m1i; m1)
     {
         foreach(int j, ref int m3ij; m3[i])
         {
             int val;
             foreach(int k, int[] m2k; m2)
             {
                 val += m1i[k] * m2k[j];
             }
             m3ij = val;
         }
     }
 }

 ////// C++ version

 #include <stdio.h>
 #include <stdlib.h>
 #include <unistd.h>

 #define SIZE 2000

 int **mkmatrix(int rows, int cols) {
     int i, j, count = 1;
     int **m = (int **) malloc(rows * sizeof(int *));
     for (i=0; i<rows; i++) {
     m[i] = (int *) malloc(cols * sizeof(int));
     for (j=0; j<cols; j++) {
         m[i][j] = count++;
     }
     }
     return(m);
 }

 void zeromatrix(int rows, int cols, int **m) {
     int i, j;
     for (i=0; i<rows; i++)
     for (j=0; j<cols; j++)
         m[i][j] = 0;
 }

 void freematrix(int rows, int **m) {
     while (--rows > -1) { free(m[rows]); }
     free(m);
 }

 int **mmult(int rows, int cols, int **m1, int **m2, int **m3) {
     int i, j, k, val;
     for (i=0; i<rows; i++) {
     for (j=0; j<cols; j++) {
         val = 0;
         for (k=0; k<cols; k++) {
         val += m1[i][k] * m2[k][j];
         }
         m3[i][j] = val;
     }
     }
     return(m3);
 }

 int main(int argc, char *argv[]) {
     int i, n = ((argc == 2) ? atoi(argv[1]) : 1);

     int **m1 = mkmatrix(SIZE, SIZE);
     int **m2 = mkmatrix(SIZE, SIZE);
     int **mm = mkmatrix(SIZE, SIZE);

     for (i=0; i<n; i++) {
     mm = mmult(SIZE, SIZE, m1, m2, mm);
     }
     printf("%d %d %d %d\n", mm[0][0], mm[2][3], mm[3][2], 
 mm[4][4]);

     freematrix(SIZE, m1);
     freematrix(SIZE, m2);
     freematrix(SIZE, mm);
     return(0);
 }
First things first: You're not just timing the multiplication, you're timing the memory allocation as well. I suggest using http://dlang.org/phobos/std_datetime.html#StopWatch to do some proper timings in D Also, there is a semi-documented multi-dimensional array allocation syntax that is very neat, see here a simplified version of mkmatrix using it: int[][] mkmatrix(size_t rows, size_t cols) { int[][] m = new int[][](rows, cols); size_t count = 1; foreach(ref mi; m) foreach(ref mij; mi) mij = count++; return(m); } However, I have found myself that D is slower than C for these sort of intense numerical things. The assembly code should show why quite easily.
Mar 03 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
John Colvin:

 First things first:
 You're not just timing the multiplication, you're timing the 
 memory allocation as well. I suggest using 
 http://dlang.org/phobos/std_datetime.html#StopWatch to do some 
 proper timings in D
Nope, what matters is the total program runtime. Bye, bearophile
Mar 03 2013
parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 4 March 2013 at 04:15:41 UTC, bearophile wrote:
 John Colvin:

 First things first:
 You're not just timing the multiplication, you're timing the 
 memory allocation as well. I suggest using 
 http://dlang.org/phobos/std_datetime.html#StopWatch to do some 
 proper timings in D
Nope, what matters is the total program runtime. Bye, bearophile
The performance of the multiplication loops and the performance of the allocation are separate issues and should be measured as such, especially if one wants to make meaningful optimisations.
Mar 04 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
John Colvin:

 The performance of the multiplication loops and the performance 
 of the allocation are separate issues and should be measured as 
 such, especially if one wants to make meaningful optimisations.
If you want to improve the D compiler, druntime, etc, then I agree you have to separate the variables and test them one at a time. But if you are comparing languages+runtimes+libraries then it's better to not cheat, and test the whole running (warmed) time.
 http://dpaste.dzfl.pl/ is back online btw
dpaste is too much javascript-heavy for a very quick pasting. codepad is much more light if I don't want to run D code. Bye, bearophile
Mar 04 2013
parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 4 March 2013 at 15:44:40 UTC, bearophile wrote:
 John Colvin:

 The performance of the multiplication loops and the 
 performance of the allocation are separate issues and should 
 be measured as such, especially if one wants to make 
 meaningful optimisations.
If you want to improve the D compiler, druntime, etc, then I agree you have to separate the variables and test them one at a time. But if you are comparing languages+runtimes+libraries then it's better to not cheat, and test the whole running (warmed) time. Bye, bearophile
I disagree. Information about which parts of the code are running fast and which are running slow is critical to optimisation. If you don't know whether it's the D memory allocation that's slow or the D multiplication loops, you're trying to optimise blindfolded. Even if all your doing is a comparison, it's a lot more useful to know *where* the slowdown is happening so that you can make a meaningful analysis of the results. Enter a strange example: I found that malloced multi-dim arrays were considerably faster to iterate over and assign to than D gc slices, even with the gc disable after allocation and bounds checks turned off. If I hadn't bothered to do separate timings of the allocation and iteration, I would never have noticed this effect and instead written it off as purely "malloc is faster at allocating than the GC"
Mar 04 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/4/2013 9:00 AM, John Colvin wrote:
 On Monday, 4 March 2013 at 15:44:40 UTC, bearophile wrote:
 John Colvin:

 The performance of the multiplication loops and the performance of the
 allocation are separate issues and should be measured as such, especially if
 one wants to make meaningful optimisations.
If you want to improve the D compiler, druntime, etc, then I agree you have to separate the variables and test them one at a time. But if you are comparing languages+runtimes+libraries then it's better to not cheat, and test the whole running (warmed) time. Bye, bearophile
I disagree. Information about which parts of the code are running fast and which are running slow is critical to optimisation. If you don't know whether it's the D memory allocation that's slow or the D multiplication loops, you're trying to optimise blindfolded.
I agree with John. Correctly interpreting benchmark results will enable you to tune your application to run much faster, even without any changes to the compiler or runtime library.
Mar 04 2013
prev sibling next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Your benchmark code updated to D2:

http://codepad.org/WMgu6XQG

Bye,
bearophile
Mar 03 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
On Monday, 4 March 2013 at 04:12:18 UTC, bearophile wrote:
 Your benchmark code updated to D2:

 http://codepad.org/WMgu6XQG
Sorry, this line: enum size_t SIZE = 200; Should be: enum size_t SIZE = 2_000; Bye, bearophile
Mar 03 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
So this should be better:

http://codepad.org/B5b4uyBM

Bye,
bearophile
Mar 03 2013
next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Generally for such matrix benchmarks if you chose the compilation 
flags really well (including link-time optimization!) I've seen 
that with LDC you get "good enough" timings.

Bye,
bearophile
Mar 03 2013
prev sibling parent reply "J" <private private-dont-email-dont-spam.com> writes:
On Monday, 4 March 2013 at 04:22:01 UTC, bearophile wrote:
 So this should be better:

 http://codepad.org/B5b4uyBM

 Bye,
 bearophile
bearophile: Thank you! Unfortunately the http://codepad.org/B5b4uyBM code runs a bit *slower* than the original D code. Yikes! $ gdmd -O -inline -release -noboundscheck -m64 bear.d -ofdbear $ time ./dbear -1015380632 859379360 -367726792 -1548829944 real 2m36.971s user 2m36.910s sys 0m0.030s $ time ./dbear -1015380632 859379360 -367726792 -1548829944 real 2m34.425s user 2m34.370s sys 0m0.020s John Colvin: here is the disassembly of mmult() in both languages. Unfortunately I'm not literate in x86_64 assembly. Perhaps the problem is obvious to you? All I can really tell is that the g++ version is shorter. The memory allocation, when timed separately (comment out mmult), is less than 60 msec for either version, so I don't think its a memory issue, although it could be caching issue since the matrix layouts are different. (gdb) disas /m _D6matrix5mmultFAAiAAiAAiZv Dump of assembler code for function _D6matrix5mmultFAAiAAiAAiZv: 56 void mmult(int[][] m1, int[][] m2, int[][] m3) 0x00000000004352a0 <+0>: push %r15 0x00000000004352a5 <+5>: push %r14 0x00000000004352a7 <+7>: push %r13 0x00000000004352a9 <+9>: push %r12 0x00000000004352ab <+11>: mov %rdx,%r12 0x00000000004352ae <+14>: push %rbp 0x00000000004352af <+15>: mov %rsi,%rbp 0x00000000004352b2 <+18>: push %rbx 0x00000000004352b3 <+19>: mov %r9,-0x40(%rsp) 0x00000000004352b8 <+24>: mov %rdi,-0x10(%rsp) 0x00000000004352bd <+29>: mov %rsi,-0x8(%rsp) 0x00000000004352c2 <+34>: mov %rdx,-0x20(%rsp) 0x00000000004352c7 <+39>: mov %rcx,-0x18(%rsp) 0x00000000004352cc <+44>: mov %r8,-0x30(%rsp) 0x00000000004352d1 <+49>: mov %r9,-0x28(%rsp) 0x00000000004352dc <+60>: add $0x1,%rdi 0x00000000004352e0 <+64>: lea 0x1(%rdx),%rdx 0x00000000004352e4 <+68>: mov $0x1,%r15d 0x00000000004352ea <+74>: mov %rdi,-0x38(%rsp) 0x0000000000435315 <+117>: add $0x1,%r13 0x0000000000435319 <+121>: mov $0x1,%r11d 0x0000000000435330 <+144>: mov $0x1,%esi 57 { 58 foreach(int i, int[] m1i; m1) 0x00000000004352a2 <+2>: test %rdi,%rdi 0x00000000004352d6 <+54>: je 0x4353aa <_D6matrix5mmultFAAiAAiAAiZv+266> 0x00000000004352ef <+79>: xor %esi,%esi 0x00000000004352f1 <+81>: mov %rsi,%rax 0x00000000004352f4 <+84>: shl $0x4,%rax 0x00000000004352f8 <+88>: mov 0x8(%rbp,%rax,1),%r10 0x0000000000435398 <+248>: cmp -0x38(%rsp),%rax 0x000000000043539d <+253>: mov %r15,%rsi 0x00000000004353a0 <+256>: je 0x4353aa <_D6matrix5mmultFAAiAAiAAiZv+266> 0x00000000004353a2 <+258>: mov %rax,%r15 0x00000000004353a5 <+261>: jmpq 0x4352f1 <_D6matrix5mmultFAAiAAiAAiZv+81> 59 { 60 foreach(int j, ref int m3ij; m3[i]) 0x00000000004352fd <+93>: add -0x40(%rsp),%rax 0x0000000000435302 <+98>: mov (%rax),%r13 0x0000000000435305 <+101>: mov 0x8(%rax),%r14 0x0000000000435309 <+105>: test %r13,%r13 0x000000000043530c <+108>: je 0x435394 <_D6matrix5mmultFAAiAAiAAiZv+244> 0x0000000000435312 <+114>: xor %r9d,%r9d 0x000000000043531f <+127>: shl $0x2,%r9 0x0000000000435326 <+134>: lea (%r14,%r9,1),%rbx 0x000000000043536c <+204>: mov %r11,%r9 0x000000000043536f <+207>: cmp %r13,%rax 0x0000000000435372 <+210>: je 0x435394 <_D6matrix5mmultFAAiAAiAAiZv+244> 0x0000000000435374 <+212>: shl $0x2,%r9 0x000000000043537b <+219>: mov %rax,%r11 0x000000000043537e <+222>: lea (%r14,%r9,1),%rbx 0x000000000043538a <+234>: mov %r11,%r9 0x000000000043538f <+239>: cmp %r13,%rax 0x0000000000435392 <+242>: jne 0x435374 <_D6matrix5mmultFAAiAAiAAiZv+212> 0x0000000000435394 <+244>: lea 0x1(%r15),%rax 61 { 62 int val; 0x0000000000435337 <+151>: xor %edi,%edi 0x0000000000435339 <+153>: jmp 0x435343 <_D6matrix5mmultFAAiAAiAAiZv+163> 0x000000000043533b <+155>: nopl 0x0(%rax,%rax,1) 0x0000000000435388 <+232>: xor %edi,%edi 63 foreach(int k, int[] m2k; m2) 0x0000000000435323 <+131>: test %r12,%r12 0x000000000043532a <+138>: je 0x435384 <_D6matrix5mmultFAAiAAiAAiZv+228> 0x000000000043532c <+140>: nopl 0x0(%rax) 0x0000000000435335 <+149>: xor %eax,%eax 0x0000000000435340 <+160>: mov %r8,%rsi 0x0000000000435343 <+163>: mov %rax,%r8 0x000000000043534a <+170>: shl $0x4,%r8 0x000000000043535e <+190>: cmp %rdx,%r8 0x0000000000435361 <+193>: mov %rsi,%rax 0x0000000000435364 <+196>: jne 0x435340 <_D6matrix5mmultFAAiAAiAAiZv+160> 0x0000000000435366 <+198>: lea 0x1(%r11),%rax 0x0000000000435378 <+216>: test %r12,%r12 0x0000000000435382 <+226>: jne 0x435330 <_D6matrix5mmultFAAiAAiAAiZv+144> 0x0000000000435384 <+228>: lea 0x1(%r11),%rax 64 { 65 val += m1i[k] * m2k[j]; 0x0000000000435346 <+166>: mov (%r10,%rax,4),%eax 0x000000000043534e <+174>: mov 0x8(%rcx,%r8,1),%r8 0x0000000000435353 <+179>: imul (%r8,%r9,1),%eax 0x0000000000435358 <+184>: lea 0x1(%rsi),%r8 0x000000000043535c <+188>: add %eax,%edi 66 } 67 m3ij = val; 0x000000000043536a <+202>: mov %edi,(%rbx) 0x000000000043538d <+237>: mov %edi,(%rbx) 68 } 69 } 70 } 0x00000000004353aa <+266>: pop %rbx 0x00000000004353ab <+267>: pop %rbp 0x00000000004353ac <+268>: pop %r12 0x00000000004353ae <+270>: pop %r13 0x00000000004353b0 <+272>: pop %r14 0x00000000004353b2 <+274>: pop %r15 0x00000000004353b4 <+276>: retq 0x00000000004353b5: data32 nopw %cs:0x0(%rax,%rax,1) End of assembler dump. (gdb) (gdb) disas /m mmult Dump of assembler code for function mmult(int, int, int**, int**, int**): 36 int **mmult(int rows, int cols, int **m1, int **m2, int **m3) { 0x0000000000400a10 <+0>: push %r14 0x0000000000400a14 <+4>: push %r13 0x0000000000400a16 <+6>: mov %r8,%r13 0x0000000000400a19 <+9>: push %r12 0x0000000000400a1b <+11>: push %rbp 0x0000000000400a1c <+12>: push %rbx 0x0000000000400a1d <+13>: mov %edi,%ebx 0x0000000000400a21 <+17>: lea -0x1(%rsi),%eax 0x0000000000400a24 <+20>: mov %rdx,%r12 0x0000000000400a27 <+23>: xor %ebp,%ebp 0x0000000000400a29 <+25>: lea 0x4(,%rax,4),%rdi 0x0000000000400a40 <+48>: xor %r9d,%r9d 0x0000000000400a43 <+51>: xor %r11d,%r11d 0x0000000000400a46 <+54>: nopw %cs:0x0(%rax,%rax,1) 37 int i, j, k, val; 38 for (i=0; i<rows; i++) { 0x0000000000400a12 <+2>: test %edi,%edi 0x0000000000400a1f <+15>: jle 0x400a7e <mmult(int, int, int**, int**, int**)+110> 0x0000000000400a7a <+106>: cmp %ebp,%ebx 0x0000000000400a7c <+108>: jg 0x400a31 <mmult(int, int, int**, int**, int**)+33> 39 for (j=0; j<cols; j++) { 0x0000000000400a31 <+33>: test %esi,%esi 0x0000000000400a33 <+35>: jle 0x400a76 <mmult(int, int, int**, int**, int**)+102> 0x0000000000400a35 <+37>: mov (%r12,%rbp,8),%r8 0x0000000000400a39 <+41>: mov 0x0(%r13,%rbp,8),%rdx 0x0000000000400a3e <+46>: xor %eax,%eax 0x0000000000400a71 <+97>: cmp %rdi,%rax 0x0000000000400a74 <+100>: jne 0x400a40 <mmult(int, int, int**, int**, int**)+48> 0x0000000000400a76 <+102>: add $0x1,%rbp 40 val = 0; 41 for (k=0; k<cols; k++) { 0x0000000000400a64 <+84>: cmp %r9d,%esi 0x0000000000400a67 <+87>: jg 0x400a50 <mmult(int, int, int**, int**, int**)+64> 42 val += m1[i][k] * m2[k][j]; 0x0000000000400a50 <+64>: mov (%rcx,%r9,8),%r14 0x0000000000400a54 <+68>: mov (%r8,%r9,4),%r10d 0x0000000000400a58 <+72>: add $0x1,%r9 0x0000000000400a5c <+76>: imul (%r14,%rax,1),%r10d 0x0000000000400a61 <+81>: add %r10d,%r11d 43 } 44 m3[i][j] = val; 0x0000000000400a69 <+89>: mov %r11d,(%rdx,%rax,1) 0x0000000000400a6d <+93>: add $0x4,%rax 45 } 46 } 47 return(m3); 48 } 0x0000000000400a7e <+110>: pop %rbx 0x0000000000400a7f <+111>: pop %rbp 0x0000000000400a80 <+112>: pop %r12 0x0000000000400a82 <+114>: mov %r13,%rax 0x0000000000400a85 <+117>: pop %r13 0x0000000000400a87 <+119>: pop %r14 0x0000000000400a89 <+121>: retq End of assembler dump. (gdb)
Mar 03 2013
next sibling parent reply Manu <turkeyman gmail.com> writes:
On 4 March 2013 14:50, J <private private-dont-email-dont-spam.com> wrote:

 On Monday, 4 March 2013 at 04:22:01 UTC, bearophile wrote:

 So this should be better:

 http://codepad.org/B5b4uyBM

 Bye,
 bearophile
bearophile: Thank you! Unfortunately the http://codepad.org/B5b4uyBMcode runs a bit *slower* than the original D code. Yikes! $ gdmd -O -inline -release -noboundscheck -m64 bear.d -ofdbear $ time ./dbear -1015380632 859379360 -367726792 -1548829944 real 2m36.971s user 2m36.910s sys 0m0.030s $ time ./dbear -1015380632 859379360 -367726792 -1548829944 real 2m34.425s user 2m34.370s sys 0m0.020s John Colvin: here is the disassembly of mmult() in both languages. Unfortunately I'm not literate in x86_64 assembly. Perhaps the problem is obvious to you? All I can really tell is that the g++ version is shorter. The memory allocation, when timed separately (comment out mmult), is less than 60 msec for either version, so I don't think its a memory issue, although it could be caching issue since the matrix layouts are different.
Using dynamic arrays of dynamic arrays that way is pretty poor form regardless of the language. You should really use single dimensional array: int matrix[SIZE*SIZE]; And index via: matrix[y*SIZE+x] (You can pretty this up in various ways, if this is too unsightly) On 4 March 2013 14:02, John Colvin <john.loughran.colvin gmail.com> wrote:
     int[][] m = new int[][](rows, cols);
Does D support proper square array's this way? Or does it just automate allocation of the inner arrays? Does it allocate all the associated memory in one block?
Mar 03 2013
next sibling parent reply "J" <private private-dont-email-dont-spam.com> writes:
On Monday, 4 March 2013 at 05:07:10 UTC, Manu wrote:
 Using dynamic arrays of dynamic arrays that way is pretty poor 
 form regardless of the language.

 You should really use single dimensional array:
   int matrix[SIZE*SIZE];
 And index via:
   matrix[y*SIZE+x]

 (You can pretty this up in various ways, if this is too 
 unsightly)


 On 4 March 2013 14:02, John Colvin 
 <john.loughran.colvin gmail.com> wrote:

     int[][] m = new int[][](rows, cols);
Does D support proper square array's this way? Or does it just automate allocation of the inner arrays? Does it allocate all the associated memory in one block?
That's a really good point. I wonder if there is a canonical matrix that would be preferred?
Mar 04 2013
parent reply "J" <private private-dont-email-dont-spam.com> writes:
On Monday, 4 March 2013 at 08:02:46 UTC, J wrote:
 That's a really good point. I wonder if there is a canonical 
 matrix that would be preferred?
I'm not sure if they are the recommended/best practice for matrix handling in D at the moment (please advise if they are not), but with a little searching, I found that the SciD library has nice matrices and MattrixViews (column-major storage, LAPACK compatible). Now I like MatrixViews because they let me beat the original (clearly non-optimal) C matrix multiplication by a couple seconds, and the D code with operator overloading in place makes matrix multiplication look elegant. One shootout benchmark down, eleven to go. :-) - J p.s. Am I right in concluding there are no multimethods (multiple dispatch) in D... it seemed a little awkward to have to wrap the MatrixView in a new struct solely in order to overload multiplication. Is there a better way that I've missed? #beating the C/malloc/shootout based code that ran in 1m32sec: $ time ./scid_matrix_test -1015380632 859379360 -367726792 -1548829944 real 1m27.967s user 1m27.930s sys 0m0.030s $ time ./scid_matrix_test -1015380632 859379360 -367726792 -1548829944 real 1m28.259s user 1m28.230s sys 0m0.020s $ module scid_matrix_test; // compile: gdmd -O -release -inline -noboundscheck scid_matrix_test.d -L-L/home/you/pkg/scid/scid/ -L-lscid -L-L/usr/lib/x86_64-linux-gnu/ -L-lgfortran -L-lblas -L-llapack -L-L. import scid.matrix; import std.stdio, std.string, std.array, std.conv; const int SIZE = 2000; import std.stdio; // Doesn't seem to have multiple dispatch / multimethods, so // I guess we have to wrap MatrixView to implement opBinary!"*"(lhs,rhs) struct Multipliable (T) { MatrixView!(T) _m; alias _m this; this(MatrixView!(T) src) { _m = src; } Multipliable!(T) opBinary(string op : "*")(ref Multipliable rhs) if (op == "*") { auto r = Multipliable!int(matrix!int(_m.rows,rhs.cols)); return mmult(this, rhs, r); } } Multipliable!(int) mkmatrix(int rows, int cols) { auto m = Multipliable!int(); m._m = matrix!int(rows,cols); int count = 1; for(int i = 0; i < rows; ++i) { for(int j = 0; j < cols; ++j) { m[i,j] = count++; } } return(m); } Multipliable!(int) mmult(ref Multipliable!(int) m1, ref Multipliable!(int) m2, ref Multipliable!(int) m3) { int i, j, k, val; ulong rows = m1.rows; ulong cols = m1.cols; for (i=0; i<rows; i++) { for (j=0; j<cols; j++) { val = 0; for (k=0; k<cols; k++) { val += m1[i,k] * m2[k,j]; } m3[i,j] = val; } } return(m3); } void main() { auto m1 = mkmatrix(SIZE,SIZE); auto m2 = mkmatrix(SIZE,SIZE); auto mm = mkmatrix(SIZE,SIZE); mm = m1 * m2; writefln("%d %d %d %d",mm[0,0],mm[2,3],mm[3,2],mm[4,4]); }
Mar 04 2013
parent "Lars T. Kyllingstad" <public kyllingen.net> writes:
On Monday, 4 March 2013 at 12:28:25 UTC, J wrote:
 On Monday, 4 March 2013 at 08:02:46 UTC, J wrote:
 That's a really good point. I wonder if there is a canonical 
 matrix that would be preferred?
I'm not sure if they are the recommended/best practice for matrix handling in D at the moment (please advise if they are not), but with a little searching, I found that the SciD library has nice matrices and MattrixViews (column-major storage, LAPACK compatible). Now I like MatrixViews because they let me beat the original (clearly non-optimal) C matrix multiplication by a couple seconds, and the D code with operator overloading in place makes matrix multiplication look elegant. One shootout benchmark down, eleven to go. :-) - J p.s. Am I right in concluding there are no multimethods (multiple dispatch) in D... it seemed a little awkward to have to wrap the MatrixView in a new struct solely in order to overload multiplication. Is there a better way that I've missed?
I'm the author of SciD. It's great that you found it useful! :) When I wrote scid.matrix and scid.linalg, it was because I needed some linear algebra functionality for a project at work. I knew I wouldn't have the time to write a full-featured linear algebra library, so I intentionally gave it a minimalistic design, and only included the stuff I needed. That's also why I used the name "MatrixView" rather than "Matrix"; it was only supposed to be a convenient way to view an array as a matrix, and not a full matrix type. Later, Cristi Cobzarenco forked my library for Google Summer of Code 2011, with David Simcha as his mentor. They removed almost everything but the linear algebra modules, and redesigned those from scratch. So basically, there is pretty much nothing left of my code in their library. ;) I don't think they ever completed the project, but I believe parts of it are usable. You'll find it here: https://github.com/cristicbz/scid AFAIK, their goal was to use expression templates in order to transform D expressions like x*A*B + y*C where A, B and C are matrices, and x and y are scalars, into as few optimised BLAS calls as possible -- e.g., a single GEMM call in the example above. I don't know how far they got on this, though. Lars
Mar 05 2013
prev sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
J wrote:

  bearophile: Thank you!  Unfortunately the 
 http://codepad.org/B5b4uyBM code runs a bit *slower* than the 
 original D code. Yikes!

 $  gdmd -O -inline -release -noboundscheck -m64 bear.d -ofdbear
 $ time ./dbear
 -1015380632 859379360 -367726792 -1548829944

 real    2m36.971s
 user    2m36.910s
 sys 0m0.030s
 $ time ./dbear
 -1015380632 859379360 -367726792 -1548829944

 real    2m34.425s
 user    2m34.370s
 sys 0m0.020s
A bit better version: http://codepad.org/jhbYxEgU I think this code is good compared to the original (there are better algorithms). So maybe the introduced inefficiencies are small compiler problems worth finding and reporting. ------------------ Manu:
 Does D support proper square array's this way? Or does it just 
 automate allocation of the inner arrays?
 Does it allocate all the associated memory in one block?
Maybe you should take a look at druntime code. Bye, bearophile
Mar 04 2013
next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 4 March 2013 at 14:59:21 UTC, bearophile wrote:
 Manu:

 Does D support proper square array's this way? Or does it just 
 automate allocation of the inner arrays?
 Does it allocate all the associated memory in one block?
Maybe you should take a look at druntime code. Bye, bearophile
As far as I know it's just a way of automating the allocation, it expands out to the necessary loops.
Mar 04 2013
prev sibling next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Monday, 4 March 2013 at 14:59:21 UTC, bearophile wrote:
 A bit better version:
 http://codepad.org/jhbYxEgU
 Bye,
 bearophile
http://dpaste.dzfl.pl/ is back online btw
Mar 04 2013
prev sibling parent reply "jerro" <a a.com> writes:
 A bit better version:
 http://codepad.org/jhbYxEgU

 I think this code is good compared to the original (there are 
 better algorithms).
You can make it much faster even without really changing the algorithm. Just by reversing the order of inner two loops like this: void matrixMult2(in int[][] m1, in int[][] m2, int[][] m3) pure nothrow { foreach (immutable i; 0 .. m1.length) foreach (immutable k; 0 .. m2[0].length) foreach (immutable j; 0 .. m3[0].length) m3[i][j] += m1[i][k] * m2[k][j]; } you can make the code much more cache friendly (because now you aren't iterating any matrix by column in the inner loop) and also allow the compiler to do auto vectorization. matrixMul2() takes 2.6 seconds on my machine and matrixMul()takes 72 seconds (both compiled with gdmd -O -inline -release -noboundscheck -mavx). This isn't really relevant to the comparison with C++ in this thread, I just thought it may be useful for anyone writing matrix code.
Mar 04 2013
next sibling parent reply "jerro" <a a.com> writes:
On Monday, 4 March 2013 at 15:46:50 UTC, jerro wrote:
 A bit better version:
 http://codepad.org/jhbYxEgU

 I think this code is good compared to the original (there are 
 better algorithms).
You can make it much faster even without really changing the algorithm. Just by reversing the order of inner two loops like this: void matrixMult2(in int[][] m1, in int[][] m2, int[][] m3) pure nothrow { foreach (immutable i; 0 .. m1.length) foreach (immutable k; 0 .. m2[0].length) foreach (immutable j; 0 .. m3[0].length) m3[i][j] += m1[i][k] * m2[k][j]; } you can make the code much more cache friendly (because now you aren't iterating any matrix by column in the inner loop) and also allow the compiler to do auto vectorization. matrixMul2() takes 2.6 seconds on my machine and matrixMul()takes 72 seconds (both compiled with gdmd -O -inline -release -noboundscheck -mavx). This isn't really relevant to the comparison with C++ in this thread, I just thought it may be useful for anyone writing matrix code.
forgot to set m3's elements to zero before adding to them: void matrixMult2(in int[][] m1, in int[][] m2, int[][] m3) pure nothrow { foreach (immutable i; 0 .. m1.length) { m3[i][] = 0; foreach (immutable k; 0 .. m2[0].length) foreach (immutable j; 0 .. m3[0].length) m3[i][j] += m1[i][k] * m2[k][j]; } } This does not make the function noticeably slower.
Mar 04 2013
parent "J" <not_listed not.not.listed> writes:
On Monday, 4 March 2013 at 15:57:42 UTC, jerro wrote:
 matrixMul2() takes 2.6 seconds on my machine and 
 matrixMul()takes 72 seconds (both compiled with  gdmd -O 
 -inline -release -noboundscheck -mavx).
Thanks Jerro. You made me realize that help from the experts could be quite useful. I plugged in a call to the BLAS matrix multiply routine, which SciD conveniently binds. The result? My 2000x2000 matrix multiply went from 98 seconds down to 1.8 seconds. Its just hilariously faster to use 20 years of numerical experts optimized code than to try to write your own. // screaming fast version - uses BLAS for 50x speedup over naive code. // Multipliable!(T) mmult2(T)(ref Multipliable!(T) m1, ref Multipliable!(T) m2, ref Multipliable!(T) m3) { m3.array[] = 0; assert(m1.cols == m2.rows); char ntran = 'N'; double one = 1.0; double zero = 0.0; int nrow = cast(int)m1.rows; int ncol = cast(int)m1.cols; int mcol = cast(int)m2.cols; scid.bindings.blas.blas.dgemm_(&ntran, // transa &ntran, // transb &nrow, // m &mcol, // n &ncol, // k &one, // alpha m1.array.ptr, // A &nrow, // lda m2.array.ptr, // B &ncol, // ldb &zero, // beta m3.array.ptr, // C &nrow, // ldc nrow, // transa_len ncol); // transb_len return m3; }
Mar 04 2013
prev sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 03/04/2013 04:46 PM, jerro wrote:
 A bit better version:
 http://codepad.org/jhbYxEgU

 I think this code is good compared to the original (there are better
 algorithms).
You can make it much faster even without really changing the algorithm. Just by reversing the order of inner two loops like this: void matrixMult2(in int[][] m1, in int[][] m2, int[][] m3) pure nothrow { foreach (immutable i; 0 .. m1.length) foreach (immutable k; 0 .. m2[0].length) foreach (immutable j; 0 .. m3[0].length) m3[i][j] += m1[i][k] * m2[k][j]; } you can make the code much more cache friendly (because now you aren't iterating any matrix by column in the inner loop) and also allow the compiler to do auto vectorization. matrixMul2() takes 2.6 seconds on my machine and matrixMul()takes 72 seconds (both compiled with gdmd -O -inline -release -noboundscheck -mavx). This isn't really relevant to the comparison with C++ in this thread, I just thought it may be useful for anyone writing matrix code.
This is a little faster for 2000x2000 matrices and typical cache size: void matrixMult(in int[][] m1, in int[][] m2, int[][] m3){ enum block=50; // optimal parameter depends on hardware foreach(ib;iota(0,m1.length,block)) foreach(kb;iota(0,m2[0].length,block)) foreach(i;iota(ib,min(m1.length,ib+block))) foreach(k;iota(kb,min(m2[0].length,kb+block))) m3[i][] += m1[i][k] * m2[k][]; }
Mar 05 2013
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/3/2013 8:50 PM, J wrote:
 Dump of assembler code for function _D6matrix5mmultFAAiAAiAAiZv:
Using obj2asm will get you much nicer looking assembler.
Mar 04 2013
prev sibling next sibling parent "Rob T" <alanb ucora.com> writes:
I suggest that you move this line

GC.disable;

to the first line.

I don't see how you are doing your timings so that part is a wild 
card.

Also note that when the GC is re-enabled it can add a significant 
amount of time to the tests. You are not explicitly re-enabling 
the GC, but I don't know if the GC kicks back as part of program 
termination, it probably shouldn't but we lack precise 
documentation and cannot be certain.

I would test timings immediately after the GC is disabled and 
prior to program termination to see what affects the GC may or 
may not have on the timings.

--rt
Mar 03 2013
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/3/13 10:48 PM, J wrote:
 Dear D pros,

 As a fan of D, I was hoping to be able to get similar results as this
 fellow on stack overflow, by noting his tuning steps;
 http://stackoverflow.com/questions/5142366/how-fast-is-d-compared-to-c

 Sadly however, when I pull out a simple matrix multiplication benchmark
 from the old language shootout (back when it had D), it is disturbingly
 slower in D when pit against C++.
You're measuring the speed of a couple of tight loops. The smallest differences in codegen between them will be on the radar. Use straight for loops or "foreach (i; 0 .. limit)" for those loops, the foreach forms you currently may introduce slight differences in codegen. Andrei
Mar 03 2013
parent "J" <private private-dont-email-dont-spam.com> writes:
On Monday, 4 March 2013 at 04:49:20 UTC, Andrei Alexandrescu 
wrote:
 You're measuring the speed of a couple of tight loops. The 
 smallest differences in codegen between them will be on the 
 radar. Use straight for loops or "foreach (i; 0 .. limit)" for 
 those loops...
Thanks Andrei! I validated your analysis by doing a straight port of the C code to D, even using the same memory layout (matching malloc calls). The port was trivial, which was very reassuring. As evidence, I had to tweak only 8 lines othe C to make compiled under gdc. The mmult() function in the D version remained identical to that in the C version. Even more reassuring was that the performance of the resulting D matched the C to within 1% tolerance (about 200-500 msec seconds slower on the D side; presumably due to runtime init). $time ./gdc_compiled_ported_cpp_version -1015380632 859379360 -367726792 -1548829944 real 1m32.418s user 1m32.370s sys 0m0.020s $ It's a great feeling, knowing the bare metal is available if need be. Thanks guys! -J
Mar 03 2013
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/3/2013 7:48 PM, J wrote:
 void mmult(int[][] m1, int[][] m2, int[][] m3)
 {
      foreach(int i, int[] m1i; m1)
      {
          foreach(int j, ref int m3ij; m3[i])
          {
              int val;
              foreach(int k, int[] m2k; m2)
              {
                  val += m1i[k] * m2k[j];
              }
              m3ij = val;
          }
      }
 }
[...]
 ////// C++ version
 int **mmult(int rows, int cols, int **m1, int **m2, int **m3) {
      int i, j, k, val;
      for (i=0; i<rows; i++) {
      for (j=0; j<cols; j++) {
          val = 0;
          for (k=0; k<cols; k++) {
          val += m1[i][k] * m2[k][j];
          }
          m3[i][j] = val;
      }
      }
      return(m3);
 }
One difference that jumps out at me is you have extra variables and ref types in the D version, and in the C++ version you have "cached" the row & column loop limits. (I.e. the C++ version assumes a rectangular matrix, while the D one has a (presumably) different length for each column.)
Mar 04 2013