digitalmars.D - Matrix mul
- bearophile (95/95) Nov 22 2008 While writing code that works on matrices I have found something curious...
- Andrei Alexandrescu (10/28) Nov 22 2008 [snip]
- bearophile (6/10) Nov 22 2008 The array bounds aren't controlled, the code is compiled with -O -releas...
- Bill Baxter (9/17) Nov 22 2008 This is why I prefer to call on an optimized BLAS lib for all my large
- bearophile (5/8) Nov 22 2008 Can you or someone else run that little D code, so you can tell me if my...
- BCS (6/14) Nov 22 2008 BLAS may well be the most tested, optimized and benchmarked code in the ...
- Bill Baxter (6/20) Nov 23 2008 Exactly. That's why I haven't spent too much time benchmarking it.
- bearophile (4/7) Nov 23 2008 Performing many benchmarks teaches you that it's better not assume too m...
- Bill Baxter (13/18) Nov 23 2008 I do find all your benchmark postings interesting. I'd heard the DMD
- bearophile (6/11) Nov 23 2008 Thank you, that's a lot of difference.
- BCS (2/8) Nov 23 2008 It wouldn't suprise me if Don wrote it!
- Sergey Gromov (8/10) Nov 22 2008 Tested your code using DMD 2.019. You are right. The #1 is about 30
- Andrei Alexandrescu (6/19) Nov 23 2008 I think what happens is that the speed of FP arithmetic depends on the
- Sergey Gromov (3/11) Nov 23 2008 I'm always forgetting that an uninitialized double is NaN... That
- Andrei Alexandrescu (4/10) Nov 23 2008 Oh sorry for the confusion; I thought the assembler was hand-written,
- Don (9/10) Nov 22 2008 Here's what I think is going on. AFAIK, D hasn't got any special code
- Tom S (11/24) Nov 22 2008 Nah, it's about NaNs :)
- Bill Baxter (3/29) Nov 23 2008 Doh! The NaNs strike again.
- bearophile (82/86) Nov 23 2008 You are right, and I'm a stupid. Most of my benchmarks have similar sill...
- Andrei Alexandrescu (6/29) Nov 23 2008 I swear I wrote my previous message before reading this :o). Damn you're...
- Tom S (8/37) Nov 23 2008 Damn, you're right too ;) I always forget whether malloc or calloc
- Andrei Alexandrescu (6/40) Nov 23 2008 In my experience malloc returns zeroed memory around the start of an
While writing code that works on matrices I have found something curious, so I have written the following little benchmark. As usual keep eyes open for possible bugs and mistakes of mine: import std.conv: toInt; import std.c.stdlib: rand, malloc; const int RAND_MAX = short.max; // RAND_MAX seems missing from std.c.stdlib void main(string[] args) { int i, j, k; int N = args.length == 2 ? toInt(args[1]) : 400; // alloc A, B, C static if (1) { auto A = new double[][](N, N); auto B = new double[][](N, N); auto C = new double[][](N, N); } else { double** A = cast(double**)malloc(N * (double*).sizeof); for (i = 0; i < N; i++) A[i] = cast(double*)malloc(N * (double).sizeof); double** B = cast(double**)malloc(N * (double*).sizeof); for (i = 0; i < N; i++) B[i] = cast(double*)malloc(N * (double).sizeof); double** C = cast(double**)malloc(N * (double*).sizeof); for (i = 0; i < N; i++) C[i] = cast(double*)malloc(N * (double).sizeof); } // init mats randomly for (i = 0; i < N; i++) for (j = 0; j < N; j++) { A[i][j] = cast(double)rand() / RAND_MAX; B[i][j] = cast(double)rand() / RAND_MAX; } // C = A * B, naive way for (i = 0; i < N; i++) for (j = 0; j < N; j++) for (k = 0; k < N; k++) C[i][k] += A[i][j] * B[j][k]; } Timings, Core2, 2 GHz, DMD V.1.036: N = 100: Version 1: 0.17 s Version 2: 0.03 s N = 400: Version 1: 9.25 s Version 2: 0.26 s N = 512: Version 1: 19.3 s Version 2: 1.0 s -------------------------- Note there are much faster ways to perform A * B: // Alternative mul code double[] bj = new double[N]; for (int j = 0; j < N; j++) { for (int k = 0; k < N; k++) bj[k] = B[k][j]; for (int i = 0; i < N; i++) { double[] ai = A[i]; double s = 0; for (int k = 0; k < N; k++) s += ai[k] * bj[k]; C[i][j] = s; } } -------------------------------- L164: mov ECX,[EAX] fld qword ptr [ESI*8][ECX] mov ECX,0[EBP] fmul qword ptr [EBX*8][ECX] mov ECX,[EDX] fadd qword ptr [EBX*8][ECX] fstp qword ptr [EBX*8][ECX] inc EBX cmp EBX,EDI jl L164 L12B: mov EDX,4[EBX] mov EAX,[EBX] fld qword ptr [EDI*8][EDX] mov ECX,010h[ESP] mov EDX,4[ECX] mov EAX,[ECX] fmul qword ptr [ESI*8][EDX] mov ECX,014h[ESP] mov EDX,4[ECX] mov EAX,[ECX] fadd qword ptr [ESI*8][EDX] fstp qword ptr [ESI*8][EDX] inc ESI cmp ESI,EBP jl L12B Note: compiling the alternative mul code written in C with GCC the running speed is about twice still compared to the Version2. Bye, bearophile
Nov 22 2008
bearophile wrote:While writing code that works on matrices I have found something curious, so I have written the following little benchmark. As usual keep eyes open for possible bugs and mistakes of mine:[snip] This is yet another proof that bounds checking can cost a lot. Although the looping code looks the same, in the case of using arrays there are plenty of bounds checks going on. My guess is that if you turn that off, the differences won't be as large (or even detectable for certain ranges of N).-------------------------- Note there are much faster ways to perform A * B: // Alternative mul code double[] bj = new double[N]; for (int j = 0; j < N; j++) { for (int k = 0; k < N; k++) bj[k] = B[k][j]; for (int i = 0; i < N; i++) { double[] ai = A[i]; double s = 0; for (int k = 0; k < N; k++) s += ai[k] * bj[k]; C[i][j] = s; } }Probably blocking will bring even more mileage (but again that depends on N). Andrei
Nov 22 2008
Andrei Alexandrescu:My guess is that if you turn that off, the differences won't be as large (or even detectable for certain ranges of N).The array bounds aren't controlled, the code is compiled with -O -release -inline. Do you see array bound controls in the asm code at the bottom of my post?Probably blocking will bring even more mileage (but again that depends on N).Yes, blocking may help. And using SSE instructions may help some more. The end result may be hundred or more times faster than the naive code in D :-) Bye, bearophile
Nov 22 2008
On Sun, Nov 23, 2008 at 8:29 AM, bearophile <bearophileHUGS lycos.com> wrote:Andrei Alexandrescu:This is why I prefer to call on an optimized BLAS lib for all my large matrix multiplication needs. All that nonsense is already taken care of. And it's compiled with GCC which has better floating point optimization to begin with. I haven't done any benchmarks, though. :-) Might be interesting to try out my MingGW-compiled ATLAS BLAS matrix mult against the numbers you're getting there. --bbMy guess is that if you turn that off, the differences won't be as large (or even detectable for certain ranges of N).The array bounds aren't controlled, the code is compiled with -O -release -inline. Do you see array bound controls in the asm code at the bottom of my post?Probably blocking will bring even more mileage (but again that depends on N).Yes, blocking may help. And using SSE instructions may help some more. The end result may be hundred or more times faster than the naive code in D :-)
Nov 22 2008
Bill Baxter Wrote:I haven't done any benchmarks, though. :-)I see. But using something because it's supposed to be faster without performing actual performance comparisons looks a little strange to me :-)Might be interesting to try out my MingGW-compiled ATLAS BLAS matrix mult against the numbers you're getting there.Can you or someone else run that little D code, so you can tell me if my timings are right? Bye, bearophile
Nov 22 2008
Reply to bearophile,Bill Baxter Wrote:BLAS may well be the most tested, optimized and benchmarked code in the world. It is highly unlikely that anything will be faster on hardwhere anyone is likely to run across. Given that the vast bulk of super computer time is spent on matrix math, it should be fair to say that BLAS is probably the most executed code on the planet.I haven't done any benchmarks, though. :-)I see. But using something because it's supposed to be faster without performing actual performance comparisons looks a little strange to me :-)
Nov 22 2008
On Sun, Nov 23, 2008 at 12:25 PM, BCS <ao pathlink.com> wrote:Reply to bearophile,Exactly. That's why I haven't spent too much time benchmarking it. It would be quite surprising if something I wrote in D outperformed the ATLAS SSE3 optimized BLAS implementation. (Though It would not be so surprising if something Don wrote managed to outperform it. :-) ) --bbBill Baxter Wrote:BLAS may well be the most tested, optimized and benchmarked code in the world. It is highly unlikely that anything will be faster on hardwhere anyone is likely to run across. Given that the vast bulk of super computer time is spent on matrix math, it should be fair to say that BLAS is probably the most executed code on the planet.I haven't done any benchmarks, though. :-)I see. But using something because it's supposed to be faster without performing actual performance comparisons looks a little strange to me :-)
Nov 23 2008
Bill Baxter:Exactly. That's why I haven't spent too much time benchmarking it. It would be quite surprising if something I wrote in D outperformed the ATLAS SSE3 optimized BLAS implementation.Performing many benchmarks teaches you that it's better not assume too much things. Nature and computers often find ways to surprise you :-) Bye, bearophile
Nov 23 2008
On Sun, Nov 23, 2008 at 6:15 PM, bearophile <bearophileHUGS lycos.com> wrote:Bill Baxter:I do find all your benchmark postings interesting. I'd heard the DMD fp codegen wasn't so great, but I'm quite convinced now. In this particular case I just haven't had any performance problems with my setup yet, so I haven't felt the need to investigate. I needed various LAPACK routines anyway, and LAPACK depends on BLAS, so BLAS is just sitting there. I might as well use it. Anyway I gave it a try, and it does multiply the 400x400 matrices about 6.7x faster than the fastest result from the naive mult implementation in your benchmark. However it should be noted that BLAS only accepts contiguous arrays, so I only tried it on your V2 allocation strategy that allocates one big array. --bbExactly. That's why I haven't spent too much time benchmarking it. It would be quite surprising if something I wrote in D outperformed the ATLAS SSE3 optimized BLAS implementation.Performing many benchmarks teaches you that it's better not assume too much things. Nature and computers often find ways to surprise you :-)
Nov 23 2008
Bill Baxter:I do find all your benchmark postings interesting.Most of them are buggy, in the beginning... The main problem is that they often do nothing real, so there's no practical result to test. In the future I'll add better tests to avoid the most silly errors of mine.Anyway I gave it a try, and it does multiply the 400x400 matrices about 6.7x faster than the fastest result from the naive mult implementation in your benchmark.Thank you, that's a lot of difference. In my original post I have also given a block of code to multiply the matrices in a less naive way, you may test that too, if you have time and you want. Bye, bearophile
Nov 23 2008
Reply to Bill,Exactly. That's why I haven't spent too much time benchmarking it. It would be quite surprising if something I wrote in D outperformed the ATLAS SSE3 optimized BLAS implementation. (Though It would not be so surprising if something Don wrote managed to outperform it. :-) ) --bbIt wouldn't suprise me if Don wrote it!
Nov 23 2008
Sat, 22 Nov 2008 20:14:37 -0500, bearophile wrote:Can you or someone else run that little D code, so you can tell me if my timings are right?would say around 4%--even though I didn't call hasNoPointers() anywhere. The really weird part is, if I comment out the "init mats randomly" loop don't get it.
Nov 22 2008
Sergey Gromov wrote:Sat, 22 Nov 2008 20:14:37 -0500, bearophile wrote:I think what happens is that the speed of FP arithmetic depends on the values of the operands. Whenever a NaN is involved everything gets a lot slower (tested that). Also, I believe (without having tested it) that multiplying with zero is faster than multiplying nonzero numbers. AndreiCan you or someone else run that little D code, so you can tell me if my timings are right?would say around 4%--even though I didn't call hasNoPointers() anywhere. The really weird part is, if I comment out the "init mats randomly" loop don't get it.
Nov 23 2008
Sun, 23 Nov 2008 07:33:16 -0600, Andrei Alexandrescu wrote:Sergey Gromov wrote:I'm always forgetting that an uninitialized double is NaN... That should be the reason for at least this slowdown.The really weird part is, if I comment out the "init mats randomly" loop don't get it.I think what happens is that the speed of FP arithmetic depends on the values of the operands. Whenever a NaN is involved everything gets a lot slower (tested that).
Nov 23 2008
bearophile wrote:Andrei Alexandrescu:Oh sorry for the confusion; I thought the assembler was hand-written, not generated. AndreiMy guess is that if you turn that off, the differences won't be as large (or even detectable for certain ranges of N).The array bounds aren't controlled, the code is compiled with -O -release -inline. Do you see array bound controls in the asm code at the bottom of my post?
Nov 23 2008
bearophile wrote:While writing code that works on matrices I have found something curious...Here's what I think is going on. AFAIK, D hasn't got any special code for initializing jagged arrays. So auto A = new double[][](N, N); involves N+1 memory allocations. As well as being slow, this is going to give horrible locality of reference in the memory allocation. You'll suffer cache misses on almost every array access! I don't think this has anything to do with bounds checking.
Nov 22 2008
Don wrote:bearophile wrote:Nah, it's about NaNs :) 'init mats randomly' loop doesn't touch C at all, thus all the latter additions leave C at NaN, causing lots of FP exceptions. We actually had something like this happen in Deadlock a few times in the animation subsystem. Brought the framerate down to a single digit :D -- Tomasz Stachowiak http://h3.team0xf.com/ h3/h3r3tic on #D freenodeWhile writing code that works on matrices I have found something curious...Here's what I think is going on. AFAIK, D hasn't got any special code for initializing jagged arrays. So auto A = new double[][](N, N); involves N+1 memory allocations. As well as being slow, this is going to give horrible locality of reference in the memory allocation. You'll suffer cache misses on almost every array access! I don't think this has anything to do with bounds checking.
Nov 22 2008
Doh! The NaNs strike again. --bb On Sun, Nov 23, 2008 at 3:40 PM, Tom S <h3r3tic remove.mat.uni.torun.pl> wrote:Don wrote:bearophile wrote:Nah, it's about NaNs :) mats randomly' loop doesn't touch C at all, thus all the latter additions leave C at NaN, causing lots of FP exceptions. We actually had something like this happen in Deadlock a few times in the animation subsystem. Brought the framerate down to a single digit :D -- Tomasz Stachowiak http://h3.team0xf.com/ h3/h3r3tic on #D freenodeWhile writing code that works on matrices I have found something curious...Here's what I think is going on. AFAIK, D hasn't got any special code for initializing jagged arrays. So auto A = new double[][](N, N); involves N+1 memory allocations. As well as being slow, this is going to give horrible locality of reference in the memory allocation. You'll suffer cache misses on almost every array access! I don't think this has anything to do with bounds checking.
Nov 23 2008
Tom S:Nah, it's about NaNs :) 'init mats randomly' loop doesn't touch C at all, thus all the latter additions leave C at NaN, causing lots of FP exceptions.You are right, and I'm a stupid. Most of my benchmarks have similar silly bugs at the beginning. Thank you. The new code: import std.conv: toInt; import std.c.stdlib: rand, malloc; const int RAND_MAX = short.max; // RAND_MAX seems missing from std.c.stdlib T[][] NewVoidCMatrix(T)(int nr, int nc) { // Part of the code by Marius Muja <mariusm cs.ubc.ca> assert(nr > 0, "NewVoidCMatrix: nr must be > 0."); assert(nc > 0, "NewVoidCMatrix: nc must be > 0."); void* mem = cast(void*)malloc(nr * (T[]).sizeof + nr * nc * T.sizeof); T[]* index = cast(T[]*)mem; T* mat = cast(T*)(mem + nr * (T[]).sizeof); for (int i = 0; i < nr; ++i) { index[i] = mat[0 .. nc]; mat += nc; } return index[0 .. nr]; } void main(string[] args) { int i, j, k; int N = args.length == 2 ? toInt(args[1]) : 400; // alloc A, B, C static if (0) { auto A = new double[][](N, N); auto B = new double[][](N, N); auto C = new double[][](N, N); } static if (0) { auto A = NewVoidCMatrix!(double)(N, N); auto B = NewVoidCMatrix!(double)(N, N); auto C = NewVoidCMatrix!(double)(N, N); } static if (1) { double** A = cast(double**)malloc(N * (double*).sizeof); for (i = 0; i < N; i++) A[i] = cast(double*)malloc(N * (double).sizeof); double** B = cast(double**)malloc(N * (double*).sizeof); for (i = 0; i < N; i++) B[i] = cast(double*)malloc(N * (double).sizeof); double** C = cast(double**)malloc(N * (double*).sizeof); for (i = 0; i < N; i++) C[i] = cast(double*)malloc(N * (double).sizeof); } static if (0) { double** A = cast(double**)malloc(N * (double*).sizeof); A[0] = cast(double*)malloc(N * N * (double).sizeof); for(i = 1; i < N; i++) A[i] = A[0] + i * N; double** B = cast(double**)malloc(N * (double*).sizeof); B[0] = cast(double*)malloc(N * N * (double).sizeof); for(i = 1; i < N; i++) B[i] = B[0] + i * N; double** C = cast(double**)malloc(N * (double*).sizeof); C[0] = cast(double*)malloc(N * N * (double).sizeof); for(i = 1; i < N; i++) C[i] = C[0] + i * N; } // init mats randomly for (i = 0; i < N; i++) for (j = 0; j < N; j++) { A[i][j] = cast(double)rand() / RAND_MAX; B[i][j] = cast(double)rand() / RAND_MAX; C[i][j] = 0.0; } // C = A * B, naive way for (i = 0; i < N; i++) for (j = 0; j < N; j++) for (k = 0; k < N; k++) C[i][k] += A[i][j] * B[j][k]; } I have added two more versions for Don, to show the difference of speed relative to spreading of memory caused by the malloc. And it's visible but not significant. Timings are, n=400: V1: 0.42 s V2: 0.41 s (repeatable difference of 0.01 s) V3: 0.25 s V4: 0.25 s Now the difference in speed has to come from something else. It may come from the DMD not optimizing dynamic arrays usage as well as C arrays. A test I haven't done yet is to change the representation of the dynamic arrays: instead of allocating an array of structs that contain len+ptr, I can allocate two "parallel arrays", one with only len and the other with just the pointers. I think this may increase the locality of reference of the pointers, maybe leading to performance similar to the C arrays. Bye, bearophile
Nov 23 2008
Tom S wrote:Don wrote:I swear I wrote my previous message before reading this :o). Damn you're uses malloc, isn't that memory garbage (possibly, but not necessarily, zero)? Andreibearophile wrote:Nah, it's about NaNs :) 'init mats randomly' loop doesn't touch C at all, thus all the latter additions leave C at NaN, causing lots of FP exceptions. We actually had something like this happen in Deadlock a few times in the animation subsystem. Brought the framerate down to a single digit :DWhile writing code that works on matrices I have found something curious...Here's what I think is going on. AFAIK, D hasn't got any special code for initializing jagged arrays. So auto A = new double[][](N, N); involves N+1 memory allocations. As well as being slow, this is going to give horrible locality of reference in the memory allocation. You'll suffer cache misses on almost every array access! I don't think this has anything to do with bounds checking.
Nov 23 2008
Andrei Alexandrescu wrote:Tom S wrote:Damn, you're right too ;) I always forget whether malloc or calloc initializes to zero. According to my tests, DMD/Tango's malloc returns mostly zeroes in bearophile's test anyway. -- Tomasz Stachowiak http://h3.team0xf.com/ h3/h3r3tic on #D freenodeDon wrote:I swear I wrote my previous message before reading this :o). Damn you're uses malloc, isn't that memory garbage (possibly, but not necessarily, zero)?bearophile wrote:Nah, it's about NaNs :) 'init mats randomly' loop doesn't touch C at all, thus all the latter additions leave C at NaN, causing lots of FP exceptions. We actually had something like this happen in Deadlock a few times in the animation subsystem. Brought the framerate down to a single digit :DWhile writing code that works on matrices I have found something curious...Here's what I think is going on. AFAIK, D hasn't got any special code for initializing jagged arrays. So auto A = new double[][](N, N); involves N+1 memory allocations. As well as being slow, this is going to give horrible locality of reference in the memory allocation. You'll suffer cache misses on almost every array access! I don't think this has anything to do with bounds checking.
Nov 23 2008
Tom S wrote:Andrei Alexandrescu wrote:In my experience malloc returns zeroed memory around the start of an application (due to the OS applying simple security considerations). Later on, as memory is reused, there's no more zeroes - it's whatever you left in there. AndreiTom S wrote:Damn, you're right too ;) I always forget whether malloc or calloc initializes to zero. According to my tests, DMD/Tango's malloc returns mostly zeroes in bearophile's test anyway.Don wrote:I swear I wrote my previous message before reading this :o). Damn you're right. I hadn't seen the bug in initialization. But since necessarily, zero)?bearophile wrote:Nah, it's about NaNs :) 'init mats randomly' loop doesn't touch C at all, thus all the latter additions leave C at NaN, causing lots of FP exceptions. We actually had something like this happen in Deadlock a few times in the animation subsystem. Brought the framerate down to a single digit :DWhile writing code that works on matrices I have found something curious...Here's what I think is going on. AFAIK, D hasn't got any special code for initializing jagged arrays. So auto A = new double[][](N, N); involves N+1 memory allocations. As well as being slow, this is going to give horrible locality of reference in the memory allocation. You'll suffer cache misses on almost every array access! I don't think this has anything to do with bounds checking.
Nov 23 2008