digitalmars.D.learn - Optimize my code =)
- Robin (109/109) Feb 14 2014 Hiho,
- Craig Dillabaugh (5/12) Feb 14 2014 I am no expert at optimizing D code, so this is a bit of a shot
- John Colvin (5/19) Feb 14 2014 Why would that make a difference here? They're (almost) identical
- Craig Dillabaugh (9/32) Feb 14 2014 Well, in my defense I did start with a disclaimer. For some
- bearophile (4/7) Feb 14 2014 It's the same thing.
- Xinok (8/17) Feb 14 2014 Not quite. Setting length will copy over the existing contents of
- bearophile (32/80) Feb 14 2014 Also try "final class" or struct in your benchmark. And try to
- Chris Cain (23/40) Feb 14 2014 Are you sure you need a class here? If you're not using
- bearophile (4/5) Feb 14 2014 minimallyInitializedArray should be used, because it's safer.
- thedeemon (4/6) Feb 14 2014 First of all make sure you it's not virtual, otherwise each
- Kapps (34/106) Feb 14 2014 Try with and without -noboundscheck. Sometimes using it seems to
- Nick Sabalausky (16/36) Feb 14 2014 A matrix is just plain-old-data, so use a struct, you don't need a class...
- bearophile (14/24) Feb 14 2014 But the row-column bounds of the matrix are not the same as the
- Robin (104/104) Feb 15 2014 Hiho,
- Stanislav Blinov (22/62) Feb 15 2014 Make them value types. If you're using dynamic arrays for
- bearophile (13/16) Feb 15 2014 Many of the things you say and write are still wrong or confused.
- bearophile (7/7) Feb 15 2014 In the meantime also take a look at the matrix mul code I've
- francesco cattoglio (7/10) Feb 16 2014 Well, actually, sometimes squeezing as much performance as you
- bearophile (9/16) Feb 16 2014 Unless you know a language well, you will fail squeezing that.
- Francesco Cattoglio (8/12) Feb 16 2014 Sorry, but I fail to be amazed by what huge piles of money thrown
- Robin (85/85) Feb 16 2014 Hiho,
- Stanislav Blinov (14/29) Feb 16 2014 What swap assignment, what move constructor? :)
- bearophile (21/39) Feb 16 2014 D is not a small language, it contains many features, but
- Robin (8/8) Feb 17 2014 Hiho,
- John Colvin (37/45) Feb 17 2014 A few quick points:
- bearophile (80/151) Feb 17 2014 I think it's better to improve your code iteratively. First
- Robin (74/74) Feb 17 2014 Hiho,
- bearophile (63/115) Feb 17 2014 The code looks better.
- Nick Sabalausky (18/20) Feb 17 2014 You should get rid of the .dup's in the constructors and postblits. You
- Stanislav Blinov (39/39) Feb 17 2014 Okay, here we go...
- Stanislav Blinov (5/8) Feb 17 2014 Actually rvalue opAssign would be even better :)
- bearophile (5/7) Feb 17 2014 I like a constructor(s) like that because it catches bugs like:
- Stanislav Blinov (2/7) Feb 18 2014 Hmmm... yeah, ok, not completely unnecessary :)
- bearophile (7/9) Feb 17 2014 Few small things should still be improved, but it's an
- Stanislav Blinov (3/10) Feb 17 2014 To me as well. I haven't yet tried to dig deep though.
- bearophile (12/16) Feb 18 2014 I have compiled your code with (a single module, 32 bit Windows):
- Stanislav Blinov (41/57) Feb 18 2014 Interesting. For me (on 64 bit) multiplicationTest is in the same
- Stanislav Blinov (14/14) Feb 18 2014 ...And if I define opEquals as it was made by Robin, i.e. like
- Kapps (7/11) Feb 18 2014 LDC is probably detecting that you're never actually using the
- Robin (101/101) Feb 18 2014 Hiho,
- bearophile (22/45) Feb 18 2014 I guess Andrei doesn't agree with you (and move semantics in
- Stanislav Blinov (38/67) Feb 18 2014 Matrix(this) not compiling when 'this' is const is a bug. That's
- Robin (48/48) Feb 20 2014 Hiho,
- bearophile (11/17) Feb 20 2014 I am not sure of the causes of this, but the D GG was improved a
- John Colvin (2/50) Feb 21 2014 what flags did you use?
- Robin (5/5) Feb 22 2014 Hiho,
- bearophile (6/10) Feb 18 2014 Physics teaches us that those experimental measures are expressed
- Kapps (10/25) Feb 17 2014 While it is cleaner, I remember array operations (such as data[]
- bearophile (20/23) Feb 17 2014 They are not being "fixed". But for arrays as large as the ones
- Nick Sabalausky (4/13) Feb 17 2014 I think you've misunderstood me. I was only explaining D's class vs
- Nick Sabalausky (31/43) Feb 15 2014 Like in other languages, "final" means "cannot inherit from this". Since...
- Chris Williams (20/41) Feb 17 2014 Your foreach here is unnecessary. You can zero out an array using
Hiho, I am fairly new to the D programming and still reading throught the awesome online book at http://ddili.org/ders/d.en/index.html. However, I think it is missing some things or I am missing glasses and overlooked these parts in the book.^^ Currently I am trying to write a very simple Matrix library for some minor linear algebra operations such as calculating determine, some simple compositions etc. I am aware that this probably has been done already and I am mainly focusing learning D with this project. =) I already wrote a similar Matrix library for Java and C++ with more or less the same algorithms in order to benchmark these languages. As I am very new to D I instantly ran into certain optimizing issues. E.g. the simple matrix multiplication based in my java implementation requires about 1.5 secs for multiplying two 1000x1000 matrices, however my D implementation requires 39 seconds without compiler optimizations and about 14 seconds with -inline, -O, -noboundscheck and -release. So I am sure that I just missed some possible optimization routines for D and that I could miss entire patters used in D to write more optimized code - especially when looking at the const and immutability concepts. I won't paste the whole code, just the important parts. class Matrix(T = double) { private T[] data; private Dimension dim; } This is my class with its templated data as a one dimensional array (as I don't like jagged-arrays) and a dimension (which is a struct). The dimension object has some util functionality such as getting the total size or mapping (row, col) to a one-dimensional index besides the getter for rows and columns. this(size_t rows, size_t cols) { this.dim = Dimension(rows, cols); this.data = new T[this.dim.size]; enum nil = to!T(0); foreach(ref T element; this.data) element = nil; } This is one of the constructor. Its task is just to create a new Matrix instance filled with what is 0 for the templated value - don't know if there is a better approach for this. I experienced that floating point values are sadly initialized with nan which isn't what I wanted -> double.init = nan. I am using opIndex and opIndexAssign in order to access and assign the matrix values: T opIndex(size_t row, size_t col) const { immutable size_t i = this.dim.offset(row, col); if (i >= this.dim.size) { // TODO - have to learn exception handling in D first. :P } return this.data[i]; } Which calls: size_t size() property const { return this.rows * this.cols; } I think and hope that this is getting optimized via inlining. =) This works similar for opIndexAssign. The function I am mainly benchmarking is the simple matrix multiplication where one of the multiplied matrices is tranposed first in order to improve cache hit ratio. Matrix opMul(ref const(Matrix) other) { if (this.dim.rows != other.dim.cols || this.dim.cols != ther.dim.rows) { // TODO - still have to learn exception handling first ... } auto m = new Matrix(this.dim.rows, other.dim.cols); auto s = new Matrix(other).transposeAssign(); size_t r1, r2, c; T sum; for (r1 = 0; r1 < this.dim.rows; ++r1) { for (r2 = 0; r2 < other.dim.rows; ++r2) { sum = to!T(0); for (c = 0; c < this.dim.cols; ++c) { sum += this[r1, c] * other[r2, c]; } m[r1, r2] = sum; } } return m; } I am aware that there are faster algorithms for matrix multiplication but I am mainly learning how to write efficient D code in general and not based on any algorithm. This is more or less the same algorithm I am using with java and c++. Both require about 1.5 secs (after java warm-up) to perform this task for two 1000x1000 matrices. D compiled with DMD takes about 14 seconds with all (known-to-me) optimize flag activated. (listed above) I wanted to make s an immutable matrix in order to hopefully improve performance via this change, however I wasn't technically able to do this to be honest. Besides that I can't find a way how to make a use of move semantics like in C++. For example a rvalue-move-constructor or a move-assign would be a very nice thing for many tasks. I am not quite sure also if it is normal in D to make an idup function as slices have which creates an immutable copy of your object. And are there important things I have to do while implementing an idup method for this matrix class? Another nice thing to know would be if it is possible to initialize an array before it is default initialized with T.init where T is the type of the array's fields. In C++ e.g. there is no default initialization which is nice if you have to initialize every single field anyway. E.g. in a Matrix.random() method which creates a matrix with random values. There it is unnecessary that the (sometimes huge) array is completely initialized with the type's init value. Thanks in advance! Robin
Feb 14 2014
On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:this(size_t rows, size_t cols) { this.dim = Dimension(rows, cols); this.data = new T[this.dim.size]; enum nil = to!T(0); foreach(ref T element; this.data) element = nil; }I am no expert at optimizing D code, so this is a bit of a shot in the dark, but does your speed improve at all if you replace:this.data = new T[this.dim.size];with: this.data.length = this.dim.size
Feb 14 2014
On Friday, 14 February 2014 at 16:40:31 UTC, Craig Dillabaugh wrote:On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:Why would that make a difference here? They're (almost) identical are they not? Certainly the body of the work, the allocation itself, is the same.this(size_t rows, size_t cols) { this.dim = Dimension(rows, cols); this.data = new T[this.dim.size]; enum nil = to!T(0); foreach(ref T element; this.data) element = nil; }I am no expert at optimizing D code, so this is a bit of a shot in the dark, but does your speed improve at all if you replace:this.data = new T[this.dim.size];with: this.data.length = this.dim.size
Feb 14 2014
On Friday, 14 February 2014 at 16:47:32 UTC, John Colvin wrote:On Friday, 14 February 2014 at 16:40:31 UTC, Craig Dillabaugh wrote:Well, in my defense I did start with a disclaimer. For some reason I thought using .length was the proper way to size arrays. It is likely faster if you are resizing an existing array (especially downward), but in this case new memory is being allocated anyway. In fact I ran some tests (just allocating a matrix over and over) and it seems using new is consistently faster than using .length (by about a second for 50000 matrices). Shows how much I know.On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:Why would that make a difference here? They're (almost) identical are they not? Certainly the body of the work, the allocation itself, is the same.this(size_t rows, size_t cols) { this.dim = Dimension(rows, cols); this.data = new T[this.dim.size]; enum nil = to!T(0); foreach(ref T element; this.data) element = nil; }I am no expert at optimizing D code, so this is a bit of a shot in the dark, but does your speed improve at all if you replace:this.data = new T[this.dim.size];with: this.data.length = this.dim.size
Feb 14 2014
Craig Dillabaugh:It's the same thing. Bye, bearophilethis.data = new T[this.dim.size];with: this.data.length = this.dim.size
Feb 14 2014
On Friday, 14 February 2014 at 16:56:29 UTC, bearophile wrote:Craig Dillabaugh:Not quite. Setting length will copy over the existing contents of the array. Using new simply sets every element to .init. Granted, this.data is empty meaning there's nothing to copy over, so there's a negligible overhead which may be optimized out by the compiler anyways. There's also the uninitializedArray function:It's the same thing. Bye, bearophilethis.data = new T[this.dim.size];with: this.data.length = this.dim.size
Feb 14 2014
Robin:class Matrix(T = double) { private T[] data; private Dimension dim; }Also try "final class" or struct in your benchmark. And try to use ldc2 compiler for performance benchmarks. Perhaps dim is better const, unless you want to change the shape of the matrix.this(size_t rows, size_t cols) { this.dim = Dimension(rows, cols); this.data = new T[this.dim.size]; enum nil = to!T(0); foreach(ref T element; this.data) element = nil; }Better: this(in size_t rows, in size_t cols) pure nothrow { this.dim = Dimension(rows, cols); this.data = minimallyInitializedArray!(typeof(data))(this.dim.size); this.data[] = to!T(0); }I experienced that floating point values are sadly initialized with nan which isn't what I wanted -> double.init = nan.This is actually a good feature of D language :-)T opIndex(size_t row, size_t col) const { immutable size_t i = this.dim.offset(row, col); if (i >= this.dim.size) { // TODO - have to learn exception handling in D first. :P }Look in D docs for the contract programming. Also add the pure nothrow attributes.Which calls: size_t size() property const { return this.rows * this.cols; } I think and hope that this is getting optimized via inlining. =)Currently class methods are virtual, and currently D compilers are not inlining virtual calls much. The solution is to use a final class, final methods, etc.This works similar for opIndexAssign. The function I am mainly benchmarking is the simple matrix multiplication where one of the multiplied matrices is tranposed first in order to improve cache hit ratio.Transposing the whole matrix before the multiplication is not efficient. Better to do it more lazily.auto m = new Matrix(this.dim.rows, other.dim.cols); auto s = new Matrix(other).transposeAssign(); size_t r1, r2, c; T sum; for (r1 = 0; r1 < this.dim.rows; ++r1) { for (r2 = 0; r2 < other.dim.rows; ++r2) {Better to define r1, r2 inside the for. Or often even better to use a foreach with interval.D compiled with DMD takes about 14 seconds with all (known-to-me) optimize flag activated. (listed above)DMD is not efficient with floating point values. Try ldc2 or gdc compilers (I suggest ldc2).I wanted to make s an immutable matrix in order to hopefully improve performance via this change,Most times immutability worsens performance, unless you are passing data across threads.however I wasn't technically able to do this to be honest.It's not too much hard to use immutability in D.Besides that I can't find a way how to make a use of move semantics like in C++. For example a rvalue-move-constructor or a move-assign would be a very nice thing for many tasks.That adds too much complexity to .Another nice thing to know would be if it is possible to initialize an array before it is default initialized with T.init where T is the type of the array's fields. In C++ e.g. there is no default initialization which is nice if you have to initialize every single field anyway. E.g. in a Matrix.random() method which creates a matrix with random values. There it is unnecessary that the (sometimes huge) array is completely initialized with the type's init value.It's done by the function that I have used above, from the std.array module. Bye, bearophile
Feb 14 2014
On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:This is my class with its templated data as a one dimensional array (as I don't like jagged-arrays) and a dimension (which is a struct). The dimension object has some util functionality such as getting the total size or mapping (row, col) to a one-dimensional index besides the getter for rows and columns.Are you sure you need a class here? If you're not using inheritance, structs can be much faster.this(size_t rows, size_t cols) { this.dim = Dimension(rows, cols); this.data = new T[this.dim.size]; enum nil = to!T(0); foreach(ref T element; this.data) element = nil; }You don't need to use std.conv.to here (ditto for when you initialize sum later). This should work and is clearer: foreach(ref T element; this.data) element = 0; (You can do `double item = 0;` and it knows enough to store the double equivalent of 0) In the case of initializing sum, I'm not sure the compiler is folding `sum = to!T(0);` thus you could be getting a lot of overhead from that. Using std.conv.to is fine but it does actually do checks in the conversion process, so it can be expensive in tight loops. Since you're just using a constant 0 here, there's no reason to use std.conv.to.The function I am mainly benchmarking is the simple matrix multiplication where one of the multiplied matrices is tranposed first in order to improve cache hit ratio.Did you benchmark first to make sure this actually improves performance? It seems to me like doing the transpose would require the same cache-missing that you'd get in the actual use. If you were caching it and using it multiple times, this would probably be more beneficial. General tip for optimizing: benchmark before and after every "optimization" you do. I've been surprised to find some of my ideas for optimizations slowed things down before.Another nice thing to know would be if it is possible to initialize an array before it is default initialized with T.init where T is the type of the array's fields.
Feb 14 2014
Chris Cain:minimallyInitializedArray should be used, because it's safer. Bye, bearophile
Feb 14 2014
On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:class Matrix(T = double) { T opIndex(size_t row, size_t col) const {First of all make sure you it's not virtual, otherwise each element access will cost you enough to make it 10x slower than Java.
Feb 14 2014
On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:As I am very new to D I instantly ran into certain optimizing issues. E.g. the simple matrix multiplication based in my java implementation requires about 1.5 secs for multiplying two 1000x1000 matrices, however my D implementation requires 39 seconds without compiler optimizations and about 14 seconds with -inline, -O, -noboundscheck and -release. So I am sure that I just missed some possible optimization routines for D and that I could miss entire patters used in D to write more optimized code - especially when looking at the const and immutability concepts.Try with and without -noboundscheck. Sometimes using it seems to make things slower, which is odd. Also, compiling for 64-bit may help significantly. When you compile for 32-bit, you won't get benefits of SIMD instructions as it's not guaranteed to be supported by the CPU. Since Java runs a JIT, it would use these SIMD instructions. Also, you'll get significantly faster code if you use LDC or GDC.class Matrix(T = double) { private T[] data; private Dimension dim; }This should be final class or struct, you don't need virtual methods.This is my class with its templated data as a one dimensional array (as I don't like jagged-arrays) and a dimension (which is a struct). The dimension object has some util functionality such as getting the total size or mapping (row, col) to a one-dimensional index besides the getter for rows and columns. this(size_t rows, size_t cols) { this.dim = Dimension(rows, cols); this.data = new T[this.dim.size]; enum nil = to!T(0); foreach(ref T element; this.data) element = nil; }You don't need to!T for setting to 0, just use plain 0. Using this to will probably result in overhead. Note that this may evaluated every time you use 'nil', so using a constant will help.T opIndex(size_t row, size_t col) const { immutable size_t i = this.dim.offset(row, col); if (i >= this.dim.size) { // TODO - have to learn exception handling in D first. :P } return this.data[i]; }You're using -noboundscheck then manually implementing bounds checks, and probably less efficiently than the compiler does. Either change this to an assert, or remove it. In either case, the compiler will do it's own built in bounds checking. This call is especially expensive since you're doing entirely virtual methods because you specified class instead of final class or using a struct. None of your Matrix calls will be inlined.Which calls: size_t size() property const { return this.rows * this.cols; } I think and hope that this is getting optimized via inlining. =) This works similar for opIndexAssign.Not unless you final class / struct.The function I am mainly benchmarking is the simple matrix multiplication where one of the multiplied matrices is tranposed first in order to improve cache hit ratio. Matrix opMul(ref const(Matrix) other) { if (this.dim.rows != other.dim.cols || this.dim.cols != ther.dim.rows) { // TODO - still have to learn exception handling first ... } auto m = new Matrix(this.dim.rows, other.dim.cols); auto s = new Matrix(other).transposeAssign(); size_t r1, r2, c; T sum; for (r1 = 0; r1 < this.dim.rows; ++r1) { for (r2 = 0; r2 < other.dim.rows; ++r2) { sum = to!T(0); for (c = 0; c < this.dim.cols; ++c) { sum += this[r1, c] * other[r2, c]; } m[r1, r2] = sum; } } return m; }These allocations may potentially hurt. Also, again, the virtual calls here are a huge hit. Using to!T(0) is also a waste of performance, as youc an just do straight 0.I am aware that there are faster algorithms for matrix multiplication but I am mainly learning how to write efficient D code in general and not based on any algorithm.Using SIMD instructions manually would of course be much, much, faster, but I understand that it defeats the point somewhat and is much more ugly.This is more or less the same algorithm I am using with java and c++. Both require about 1.5 secs (after java warm-up) to perform this task for two 1000x1000 matrices. D compiled with DMD takes about 14 seconds with all (known-to-me) optimize flag activated. (listed above)If you used LDC or GDC 64-bit with the changes above, I'd guess it would be similar. I wouldn't expect D to out-perform Java much for this particular scenario, there's very little going on and the JIT should be able to optimize it quite well with SIMD stuff.I wanted to make s an immutable matrix in order to hopefully improve performance via this change, however I wasn't technically able to do this to be honest.I don't think this would improve performance.
Feb 14 2014
On 2/14/2014 11:00 AM, Robin wrote:class Matrix(T = double) { private T[] data; private Dimension dim; }A matrix is just plain-old-data, so use a struct, you don't need a class. A struct will be much more lightweight: A struct doesn't normally involve memory allocations like a class does, and you'll get better data locality and less indirection, even compared to a final class.I am using opIndex and opIndexAssign in order to access and assign the matrix values: T opIndex(size_t row, size_t col) const { immutable size_t i = this.dim.offset(row, col); if (i >= this.dim.size) { // TODO - have to learn exception handling in D first. :P } return this.data[i]; }No need for the bounds check. D already does bounds checks automatically (unless you compile with -noboundscheck, but the whole *point* of that flag is to disable bounds checks.) But that said, I don't know whether the compiler might already be optimizing out your bounds check anyway. So try it and profile, see what happens.Another nice thing to know would be if it is possible to initialize an array before it is default initialized with T.init where T is the type of the array's fields. In C++ e.g. there is no default initialization which is nice if you have to initialize every single field anyway. E.g. in a Matrix.random() method which creates a matrix with random values. There it is unnecessary that the (sometimes huge) array is completely initialized with the type's init value.You can opt-out of the default initialization with a void initializer: http://dlang.org/declaration.html#VoidInitializer Although to be honest I forget how to do that for arrays, and the functions other people already suggested for creating/initing your array probably already do that anyway.
Feb 14 2014
Nick Sabalausky:But the row-column bounds of the matrix are not the same as the 1D bounds of the 1D array. You can have out-of-column-bounds and still be inside the 1D bounds. So that test should go in the precondition and it should test row and col separately: T opIndex(in size_t row, in size_t col) const pure nothrow in { assert(row < nRows); assert(col < nCols); } body { return data[dim.offset(row, col)]; } Bye, bearophileT opIndex(size_t row, size_t col) const { immutable size_t i = this.dim.offset(row, col); if (i >= this.dim.size) { // TODO - have to learn exception handling in D first. :P } return this.data[i]; }No need for the bounds check. D already does bounds checks automatically
Feb 14 2014
Hiho, wow, first of all: this community is awesome - so many kind and interesting answers. =) With your help I was able to achieve a performance boost for several different operations. Some benchmarks: Allocation of 5 10.000 x 10.000 matrices in a row: Before: ~8,2 seconds After: ~2,3 seconds (with the minimallyInitializedArray!) Multiplication of two 1000x1000 matrices: Before: ~14,8 seconds After: ~4,3 seconds However, I think there is still much potential in order to further optimize this code. Let me tell you what changes I have mainly performed on the code so far ... Matrix is still a class but I changed it to a final class preventing matrix methods to be virtual. Dimension is now a final struct (don't know if 'final' is affecting structs in any way tough ...). This mainly gave the multiplication a huge performance boost. When converting the Matrix to a struct from class the multiplication even lowered from ~4.3 seconds to about 3.6 seconds. However, I am currently not sure if I want matrices to be structs (value types). Besides that I tried to add nothrow and pure as attribute to every possible method. However, as it turned out I wasn't able to add the pure modifier to any method as I always called an impure method with it (as stated by the compiler). This actually made sense most of the times and I think the only pure methods now are the constructor methods of the Dimension struct. What may still speed up things? In my tests it turned out that the simple code: auto m1 = Matrix!double.random(1000, 1000, -10, 10, 0.25); auto m2 = Matrix!double.random(1000, 1000, -10, 10, 0.25); auto m3 = m1 * m2; Called the normal copy-constructor. This is sad as it would be a huge performance boost if it would make use of the move semantics. (In the C++ matrix codes this scenario would actually call move assignment operator for matrix m3 which is much much faster than copying.) But I haven't figured out yet how to use move semantics in D with class objects. Or is that just possible with struct value types? I have also tried the LDC compiler. However, it gave me some strange bugs. E.g. it didn't like the following line: ref Matrix transpose() const { return new Matrix(this).transposeAssign(); } And it forced me to change it to: ref Matrix transpose() const { auto m = new Matrix(this); m.transposeAssign(); return m; } Which is kind of ugly ... I hope that this is getting fixed soon, as I imply it as correct D code because the DMD is capable to compile this correctly. Some of you came up with the thing that transposing the matrix before multiplication of both takes place must be much slower than without the transposition. In my former Java and C++ programs I have already tested what strategy is faster and it turned out that cache efficiency DOES matter of course. There are also papers explaining why a transpose matrix multiplication may be faster than without transposing. In the end this is just a test suite and there is already an even faster approach of a matrix multiplication which runs in O(n^2.8) instead of O(n^3) as my current simple solution. And with the power of OpenCL (or similar) one could even lift a matrix multiplication to the GPU and boom.^^ But the current task is to find general optimized code for the D language - this thread already helped me much! I wasn't aware that D is actually capable of lazy evaluations other than the if-pseude-lazy-evaluation which is kind of cool. However, I still have to look that up in order to maybe find a use for it in these codes. SIMD instructions also sound extremely cool but a little bit too complex regarding the fact that I am fairly new to D and still learning basics of this language. In the end I wanted to state something which I found very iritating. Bearophile stated correctly that my pre-conditions for the index operators are all wrong and corrected my code with some smooth additions: T opIndex(in size_t row, in size_t col) const nothrow in { assert(row < nRows); assert(col < nCols); } body { return data[dim.offset(row, col)]; } The in and body statements are cool as far as I realize what they are for. However, in the benchmarks they had a clear and noticable negative impact on the matrix multiplication which raised to ~8 seconds from ~4 seconds with his code compared to mine. When leaving out the in and body statement block and using only one normal block as follows, it stayed at the 4 seconds duration for that task and so I think that the compiler won't optimize things in an 'in' code block - is that right? T opIndex(in size_t row, in size_t col) const nothrow { assert(row < nRows); assert(col < nCols); return data[dim.offset(row, col)]; } Thanks again for all your helpful comments and thanks in advance - I am eagerly looking forward for your future comments! =) Robin
Feb 15 2014
On Saturday, 15 February 2014 at 23:06:27 UTC, Robin wrote:Matrix is still a class but I changed it to a final class preventing matrix methods to be virtual. Dimension is now a final struct (don't know if 'final' is affecting structs in any way tough ...).It doesn't. It may even be disallowed someday, when it's fixed :)This mainly gave the multiplication a huge performance boost. When converting the Matrix to a struct from class the multiplication even lowered from ~4.3 seconds to about 3.6 seconds. However, I am currently not sure if I want matrices to be structs (value types).Make them value types. If you're using dynamic arrays for storage, you're already using GC plenty, no need for additional allocations (and see below for copy and move semantics). Don't forget a postblit though.(In the C++ matrix codes this scenario would actually call move assignment operator for matrix m3 which is much much faster than copying.)D performs return value optimization: it moves result whenever it can.But I haven't figured out yet how to use move semantics in D with class objects. Or is that just possible with struct value types?Yup, it "just works". In most cases, anyway. But move semantics in D differ from C++: D doesn't have rvalue references, and thus the compiler only performs a move when it can prove that value will no longer be used (there's a lengthy description in TDPL but I'm too lazy now to fetch it for exact citation :) ). For explicit moves, there's std.algorithm.move(), though AFAIK it's underimplemented at the moment.I have also tried the LDC compiler. However, it gave me some strange bugs. E.g. it didn't like the following line: ref Matrix transpose() const { return new Matrix(this).transposeAssign(); } And it forced me to change it to: ref Matrix transpose() const { auto m = new Matrix(this); m.transposeAssign(); return m; } Which is kind of ugly ... I hope that this is getting fixed soon, as I imply it as correct D code because the DMD is capable to compile this correctly.Yes, this should be allowed. But you could also just write (new Matrix(this)).transposeAssign() :) Also, there's no point in ref return when you're returning a reference to class instance.T opIndex(in size_t row, in size_t col) const nothrow in { assert(row < nRows); assert(col < nCols); } body { return data[dim.offset(row, col)]; } The in and body statements are cool as far as I realize what they are for. ...so I think that the compiler won't optimize things in an 'in' code block - is that right?in/out contracts are debug creatures anyway, they're not present in -release builds.
Feb 15 2014
Robin:Thanks again for all your helpful comments and thanks in advance - I am eagerly looking forward for your future comments! =)Many of the things you say and write are still wrong or confused. Usually the hard optimization of code is one the last things you learn about a language, because to do it you need to have very clear what the language is doing in every spot (this is true for most languages, like C, C++, Haskell, Python, etc). And you are not yet at this point. Tomorrow I will answer some things in your post, but in the meantime if you want you can show your whole D program, and tomorrow I'll try to make it faster for LDC2 and I'll explain it some. Bye, bearophile
Feb 15 2014
In the meantime also take a look at the matrix mul code I've written here: http://rosettacode.org/wiki/Matrix_multiplication#D In the first entry you see no whole transposition up front. Storing the whole matrix in a 1D array is probably more efficient. Bye, bearophile
Feb 15 2014
On Sunday, 16 February 2014 at 01:25:13 UTC, bearophile wrote:Many of the things you say and write are still wrong or confused. Usually the hard optimization of code is one the last things you learn about a languageWell, actually, sometimes squeezing as much performance as you can from a test case can be a way to find out if a given language checks all the boxes and can be used to solve your problems. I must admit I'm shocked by the poor performance of D here. But I also know one HAS to try LDC or GDC, or those numbers are really meaningless.
Feb 16 2014
francesco cattoglio:Well, actually, sometimes squeezing as much performance as you can from a test case can be a way to find out if a given language checks all the boxes and can be used to solve your problems.Unless you know a language well, you will fail squeezing that. This is true even in Python.I must admit I'm shocked by the poor performance of D here. But I also know one HAS to try LDC or GDC, or those numbers are really meaningless.Be instead amazed of the sometimes near-C++ performance levels they have pushed Java to :-) And I think D is doing well in a benchmark like this, I guess with ldc2 you can go about as fast as C++. Bye, bearophile
Feb 16 2014
On Sunday, 16 February 2014 at 09:53:08 UTC, bearophile wrote:Be instead amazed of the sometimes near-C++ performance levels they have pushed Java to :-)Sorry, but I fail to be amazed by what huge piles of money thrown to a project for 10+ years can achieve. Those kind of achievements are boring :PAnd I think D is doing well in a benchmark like this, I guess with ldc2 you can go about as fast as C++.I'm sure that other "small benchmarks" on computational-intensive code achieved C++ speed in the past, so if we can't get C++-like speed here, it's either a problem with the code or some serious performance regression.
Feb 16 2014
Hiho, thanks again for your answers! I don't know where to start ... Stanislav Blinov: I am not currently aware of the "move semantics" in D or what it is called there. However, I am not quite sure if D does an equally good job as C++ in determining if a value can be moved or not. I have made some simple tests which showed me that the swap-assignment and the move-constructor are never called in D at code where C++ would certainly take them, instead D just copies values which were in fact rvalues ... (Maybe I just don't get the whole thing behind the scenes - and I am really hoping it.) My statement that the "in" and "body" block of the assignment operator slows down the whole code isn't true anymore. Don't know what fixed it tough. bearophile: Yeah, you are completely right - I am super confused at the moment. I always thought that D would be much easier to learn than C++ as all the people always say that C++ is the most complex language. After what I have learned so far D seems to be much more complex which isn't bad in general, but the learning curve doesn't feel so well atm as I am mainly confused by many things instead of getting what happens behind the scene. (e.g. move semantics, the whole army of modifiers, immutability which result in zero performance boost in non-paralell environments, the fact that classes are by-ref and structs are by-val is also confusing from time to time to be honest.) When I started to write the C++ matrix library after I have written the Java matrix library to get to know what both languages can achive I have also had to start learning C++ as well. I bought a book and after 2-3 weeks I got the concepts and everything went as excepted and no things seemed to be hidden behind the scenes from me, the programmer which was in fact a nice experience when trying to optimize the code until it was equally fast as the java code or even faster. D seems to be more problematic when it comes to learning how to optimize the code. To summarize, I am still a beginner and learn this language as it has got cool, new and advanced features, it has not so many legacy issues (like C++), it unites object orientated, functional as well as imperative and paralell programming paradigms (still amazed by that fact!), runs native and isn't bound to any why I am learning D! =) I would love to see how much performance a D vetaran like you is able to grab out of my codes. Just tell me where I shall upload them and I will do so. Don't expect too much as I am still a beginner and as this project has just started. I didn't want to implement the whole functionality before I know how to do it with a good result. ;-) The matrix multiplication you have posted has some cool features in it. I especially like the pure one with all that functional stuff. :D Nick Sabalausky: Don't be shocked at the bad performance of a beginner D programmer. At least I am not shocked yet that my D implementation isn't performing as well as the highly optimized C++11 implementation or the JIT-optimized Java implementation. In general one can not compare the results of a native compilation optimization which shall run on different hardwares and a JIT optimization which is able to pull out every single optimization routine because it knows the system and all its components to compile at. There is also the LDC2 and the GDC which should also improve the general performance quite a bit. However, what is true is, that it is (at least for me) harder to learn how to write efficient code for D than it was to learn it for C++. And keep in mind that my Java implementation (which runs nearly as fast as the C++ implementation) is written just to fit for performance. I have had to do dirty things (against the general Java mentality) in order to achive this performance. Besides that I couldn't implement a generic solution in Java which wasn't five to ten times slower than without generics as generics in Java are a runtime feature. So all in all Java is only good in performance if you don't use it as Java (purely object oriented etc.) ... but it works! :D All of you: What I have done since the last time: I have changed my mind - matrix is now a value type. However, I wasn't sure about this step because I planned to implement dense and sparse matrices which could inherit from a basic matrix implementation in thus required to be classes. I have also removed that nasty "final struct" statement.^^ However, benchmarks stayed the same since the last post. Hopefully I haven't forgotten to say anything important ... Robin
Feb 16 2014
On Sunday, 16 February 2014 at 23:47:08 UTC, Robin wrote:I am not currently aware of the "move semantics" in D or what it is called there. However, I am not quite sure if D does an equally good job as C++ in determining if a value can be moved or not. I have made some simple tests which showed me that the swap-assignment and the move-constructor are never called in D at code where C++ would certainly take them, instead D just copies values which were in fact rvalues ... (Maybe I just don't get the whole thing behind the scenes - and I am really hoping it.)What swap assignment, what move constructor? :) Here are the rules for moving rvalues in D: 1) All anonymous rvalues are moved; 2) Named temporaries that are returned from functions are moved; 3) That's it :) No other guarantees are offered (i.e. compilers may or may not try to be smart in other scenarios). Sometimes when you want an explicit move you can use std.algorithm.move function. But keep in mind that its current implementation is incomplete (i.e. it doesn't handle immutability properly). This code illustrates it all: http://dpaste.dzfl.pl/c050c9fccbb2All of you: What I have done since the last time: I have changed my mind - matrix is now a value type. However, I wasn't sure about this step because I planned to implement dense and sparse matrices which could inherit from a basic matrix implementation in thus required to be classes.Templates, perhaps?
Feb 16 2014
Robin:I always thought that D would be much easier to learn than C++ as all the people always say that C++ is the most complex language. After what I have learned so far D seems to be much more complex which isn't bad in general, but the learning curve doesn't feel so well atm as I am mainly confused by many things instead of getting what happens behind the scene.D is not a small language, it contains many features, but compared to C++ it has less corner cases, less traps, and on default it's often safer than C++. In C++ you need to take care of many details if you want to write correct code. Writing D code should be faster and safer than C++.immutability which result in zero performance boost in non-paralell environments,A good D compiler like ldc2 is sometimes able to optimize better the code that uses immutable/const. But immutability is also for the programmer: to make the code simpler to reason about, and safer.the fact that classes are by-ref and structs are by-val is also confusing from time to time to be honest.)D seems to be more problematic when it comes to learning how to optimize the code.Perhaps because D also focuses a little more on correctness of the code. You see this even more in Ada language.I would love to see how much performance a D vetaran like you is able to grab out of my codes. Just tell me where I shall upload them and I will do so.I guess your code is composed by just one or two small D modules, so you can post it here: http://dpaste.dzfl.pl/ You can also upload there the Java code, to help me compare and to know what you were trying to write :-)However, what is true is, that it is (at least for me) harder to learn how to write efficient code for D than it was to learn it for C++.But I think on average it's a little harder to write correct code in C++ compared to D. Bye, bearophile
Feb 16 2014
Hiho, I currently haven't got enough time to respond to all what have been said since my last post. However, here is the simple code: http://dpaste.dzfl.pl/3df8694eb47d Thanks in advance! I am really looking forward to your editing! =) Robin
Feb 17 2014
On Monday, 17 February 2014 at 09:48:49 UTC, Robin wrote:Hiho, I currently haven't got enough time to respond to all what have been said since my last post. However, here is the simple code: http://dpaste.dzfl.pl/3df8694eb47d Thanks in advance! I am really looking forward to your editing! =) RobinA few quick points: 1) foreach loops are nice, use them 2) D has neat array-ops that can simplify your code for == and -= etc... e.g. /** * Subtracts all entries of this matrix by the given other matrix. */ ref Matrix opSubAssign(ref const(Matrix) other) nothrow in { assert(this.dim == other.dim); } body { this.data[] -= other.data[]; return this; } 3) Although your generic indexing is nice, a lot of operations that you have implemented have access patterns that allow faster indexing without a multiplication, e.g. /** * Creates a new identity matrix with the given size specified by the given rows and * columns indices. */ static Matrix identity(size_t rows, size_t cols) nothrow { auto m = Matrix(rows, cols); size_t index = 0; foreach (i; 0 .. m.dim.min) { m.data[index] = 1; index += cols + 1; } return m; }
Feb 17 2014
Robin:http://dpaste.dzfl.pl/3df8694eb47dI think it's better to improve your code iteratively. First starting with small simple things:this(size_t rows, size_t cols) nothrow pure {=> this(in size_t rows, in size_t cols) nothrow pure {this(ref const(Dimension) that) nothrow pure {=> this(const ref Dimension that) nothrow pure {ref Dimension opAssign(ref Dimension rhs) nothrow pure {Is this method necessary?bool opEquals(ref const(Dimension) other) nothrow const {Not necessary.size_t min() property nothrow const { if (this.rows <= this.cols) return this.rows; else return this.cols; }=> (I have changed its name to reduce my confusion std.algorithm.min): property size_t minSide() const pure nothrow { return (rows <= cols) ? rows : cols; }size_t size() property nothrow const {=> property size_t size() property pure nothrow {size_t offset(size_t row, size_t col) nothrow const {=> size_t offset(in size_t row, in size_t col) const pure nothrow {this(in size_t rows, in size_t cols, bool initialize = true) nothrow pure { this.dim = Dimension(rows, cols); this.data = minimallyInitializedArray!(typeof(this.data))(rows * cols); if (initialize) this.data[] = to!T(0); }=> this(in size_t rows, in size_t cols, in bool initialize = true) nothrow pure { ... if (initialize) this.data[] = 0;/** * */Sometimes I prefer to use just one line of code to reduce vertical wasted space: /// Move constructor. From rvalue.this(ref const(Matrix) that) {=> this(const ref Matrix that) pure {this.data = that.data.dup;Currently array.dup is not nothrow. This will eventually be fixed.ref Matrix transpose() const { return Matrix(this).transposeAssign();=> ref Matrix transpose() const pure nothrow { return Matrix(this).transposeAssign;ref Matrix transposeAssign() nothrow { size_t r, c; for (r = 0; r < this.dim.rows; ++r) { for (c = 0; c < r; ++c) { std.algorithm.swap( this.data[this.dim.offset(r, c)], this.data[this.dim.offset(c, r)]); } } return this; }Use immutable foreach loops. And even when you use for loops always define the loop variable inside the for: "for (size_t r = 0; ...". Also in general I suggest you to write a little less code. A first try (not tested), but probably this can be done with much less multiplications: ref Matrix transposeAssign() pure nothrow { foreach (immutable r; 0 .. dim.rows) { foreach (immutable c; 0 .. r) { swap(data[dim.offset(r, c)], data[dim.offset(c, r)]); } } return this; }Matrix opMul(ref const(Matrix) other)Never use old-style operator overloading in new D code. Use only the new operator overloading style.auto s = Matrix(other).transposeAssign();=>auto s = Matrix(other).transposeAssign;size_t r1, r2, c;Ditto.foreach (ref T element; this.data) { element *= scalar; }Try to use: data[] *= scalar;for (auto i = 0; i < this.dim.size; ++i) { this.data[i] += other.data[i]; }Try to use: data[] += other.data[];for (auto i = 0; i < this.dim.size; ++i) { this.data[i] -= other.data[i]; }Try to use: data[] -= other.data[];bool opEquals(ref const(Matrix) other) nothrow const { if (this.dim != other.dim) { return false; } for (auto i = 0; i < this.dim.size; ++i) { if (this.data[i] != other.data[i]) return false; } return true; }Try (in general try to write less code): return dim == other.dim && data == other.data;auto stringBuilder = std.array.appender!string;=> Appender!string result;stringBuilder.put("Matrix: " ~ to!string(this.dim.rows) ~ " x " ~ to!string(this.dim.cols)=> result ~= text("Matrix: ", dim.rows, " x ", dim.colsauto m = Matrix(rows, cols); for (auto i = 0; i < m.dim.min; ++i) { m[i, i] = 1; }See the comment by John Colvin.auto generator = Random(unpredictableSeed);What if I want to fill the matrix deterministically, with a given seed? In general it's better to not add a random matrix method to your matrix struct. Better to add an external free function that uses UFCS and is more flexible and simpler to reason about.writeln("\tTime required: " ~ to!string(secs) ~ " secs, " ~ to!string(msecs) ~ " msecs");Use .text to strinify, but here you don't need to stringify at all: => writeln("\tTime required: ", secs, " secs, ", msecs, " msecs");sw.reset();=> sw.reset;void multiplicationTest() {In D there is also unittest {}. A benchmark is not exactly an unit test, so do as you prefer. Bye, bearophile
Feb 17 2014
Hiho, thank you for your code improvements and suggestions. I really like the foreach loop in D as well as the slight (but existing) performance boost over conventional for loops. =) Another success of the changes I have made is that I have achieved to further improve the matrix multiplication performance from 3.6 seconds for two 1000x1000 matrices to 1.9 seconds which is already very close to java and c++ with about 1.3 - 1.5 seconds. The key to victory was pointer arithmetics as I notices that I have used them in the C++ implementation, too. xD The toString implementation has improved its performance slightly due to the changes you have mentioned above: 1.37 secs -> 1.29 secs I have also adjusted all operator overloadings to the "new style" - I just haven't known about that "new style" until then - thanks! I will just post the whole code again so that you can see what I have changed. Keep in mind that I am still using DMD as compiler and thus performance may still raise once I use another compiler! All in all I am very happy with the code analysis and its improvements! However, there were some strange things of which I am very confused ... void allocationTest() { writeln("allocationTest ..."); sw.start(); auto m1 = Matrix!double(10000, 10000); { auto m2 = Matrix!double(10000, 10000); } { auto m2 = Matrix!double(10000, 10000); } { auto m2 = Matrix!double(10000, 10000); } //{ auto m2 = Matrix!double(10000, 10000); } sw.stop(); printBenchmarks(); } This is the most confusing code snippet. I have just changed the whole allocation for all m1 and m2 from new Matrix!double (on heap) to Matrix!double (on stack) and the performance dropped significantly - the benchmarked timed raised from 2,3 seconds to over 25 seconds!! Now look at the code above. When I leave it as it is now, the code requires about 2,9 seconds runtime, however, when enabeling the currently out-commented line the code takes 14 to 25 seconds longer! mind blown ... 0.o This is extremely confusion as I allocate these matrices on the stack and since I have allocated them within their own scoped-block they should instantly release their memory again so that no memory consumption takes place for more than 2 matrices at the same time. This just wasn't the fact as far as I have tested it. Another strange things was that the new opEquals implementation: bool opEquals(const ref Matrix other) const pure nothrow { if (this.dim != other.dim) { return false; } foreach (immutable i; 0 .. this.dim.size) { if (this.data[i] != other.data[i]) return false; } return true; } is actually about 20% faster than the one you have suggested. With the single line of "return (this.dim == other.dim && this.data[] == other.data[]). The last thing I haven't quite understood is that I tried to replace auto t = Matrix(other).transposeAssign(); in the matrix multiplication algorithm with its shorter and clearer form auto t = other.transpose(); // sorry for the nasty '()', but I like them! :/ This however gave me wonderful segmentation faults on runtime while using the matrix multiplication ... And here is the complete and improved code: http://dpaste.dzfl.pl/7f8610efa82b Thanks in advance for helping me! =) Robin
Feb 17 2014
Robin:The key to victory was pointer arithmetics as I notices that I have used them in the C++ implementation, too. xDPerhaps with LDC2 it's not necessary.I will just post the whole code again so that you can see what I have changed.The code looks better. There's no need to put Dimension in another module. In D modules contain related stuff, unlike Java. Also feel free to use some free-standing functions. With UFCS they get used in the same way, and they help making classes/structs simpler. Some of your imports could be moved to more local scopes, instead of being all at module-level.result ~= to!string(this[r, c]);=> result ~= this[r, c].text;writeln("\tTime required: " ~ to!string(secs) ~ " secs, " ~ to!string(msecs) ~ " msecs");=> writeln("\tTime required: ", secs, " secs, ", msecs, " msecs");ref Matrix opOpAssign(string s)(in T scalar) pure nothrow if (s == "*") {=> ref Matrix opOpAssign(string op)(in T scalar) pure nothrow if (op == "*") { Also you have two functions with code like: this.data[] op= scalar; You can define a single template (untested): /** * Adds or subtracts all entries of this matrix by the given other matrix. */ ref Matrix opOpAssign(string op)(const ref Matrix other) pure nothrow if (op == "+" || op == "-") in { assert(this.dim == other.dim); } body { mixin("this.data[] " ~ op ~ "= other.data[];"); return this; }Matrix opBinary(string s)(const ref Matrix other) const pure if (s == "*")Given that on a 32 bit system a Matrix is just 16 bytes, it could be better to not accept the argument by ref, and avoid one more level of indirection: Matrix opBinary(string s)(in Matrix other) const pure if (s == "*")However, there were some strange things of which I am very confused ... void allocationTest() { writeln("allocationTest ..."); sw.start(); auto m1 = Matrix!double(10000, 10000); { auto m2 = Matrix!double(10000, 10000); } { auto m2 = Matrix!double(10000, 10000); } { auto m2 = Matrix!double(10000, 10000); } //{ auto m2 = Matrix!double(10000, 10000); } sw.stop(); printBenchmarks(); } This is the most confusing code snippet. I have just changed the whole allocation for all m1 and m2 from new Matrix!double (on heap) to Matrix!double (on stack)The matrix data is always on the heap. What ends on the stack is a very limited amount of stuff.This is extremely confusion as I allocate these matrices on the stack and since I have allocated them within their own scoped-block they should instantly release their memoryYou are mistaken. minimallyInitializedArray allocates memory on the GC heap (and there's not enough space on the stack to allocate 10000^2 doubles). In both D and Java the deallocation of memory managed by the GC is not deterministic, so it's not immediately released at scope exit. Also unlike the Java GC, the D GC is less refined, and by design it is currently not precise. With such large arrays often there are _inbound_ pointers that keep the memory alive, especially on 32 bit systems. So perhaps your problems are caused by the GC. You can have deterministic memory management in your struct-based matrices, but you have to allocate the memory manually (from the GC or probably better from the C heap) and free it in the struct destructor usign RAII.Another strange things was that the new opEquals implementation: bool opEquals(const ref Matrix other) const pure nothrow { if (this.dim != other.dim) { return false; } foreach (immutable i; 0 .. this.dim.size) { if (this.data[i] != other.data[i]) return false; } return true; } is actually about 20% faster than the one you have suggested. With the single line of "return (this.dim == other.dim && this.data[] == other.data[]).I think this small performance bug is fixed in dmd 2.065 that is currently in beta3.The last thing I haven't quite understood is that I tried to replace auto t = Matrix(other).transposeAssign(); in the matrix multiplication algorithm with its shorter and clearer form auto t = other.transpose(); // sorry for the nasty '()', but I like them! :/ This however gave me wonderful segmentation faults on runtime while using the matrix multiplication ...This looks like a null-related bug. I'll benchmark your code a little, but I think I don't have as much RAM as you. Bye, bearophile
Feb 17 2014
On 2/17/2014 4:56 PM, Robin wrote:And here is the complete and improved code: http://dpaste.dzfl.pl/7f8610efa82bYou should get rid of the .dup's in the constructors and postblits. You don't want big arrays ever getting accidentally allocated and copied behind your back when you're only trying to pass things around. Better to provide an explicit .dup function in your Matrix class (just like how D's dynamic arrays work) for when you *intentionally* want a full duplicate. Also, unlike classes, when you're copying a struct there's no need to copy each field individually. Just copy the struct as a whole. In fact, copy constructors and overloading assignments from the same type are generally not needed at all: They aren't going to be any faster than just using D's default behavior of memory-blitting the struct to the new location *if* a copy is actually even needed at all. By providing explicit copy constructors and overloading assignments from the same type, you're forcing the compiler to use a potentially-slower method every time. Finally, and I could be wrong on this part, but I doubt those move constructors are really needed in D. See the top answer here: http://stackoverflow.com/questions/4200190/does-d-have-something-akin-to-c0xs-move-semantics
Feb 17 2014
Okay, here we go... 1) Don't use upper case letters in module names (http://dlang.org/module.html#ModuleDeclaration) 2) As has already been suggested, if you're targeting raw performance, don't use GC. You can always malloc and free your storage. Using C heap has certain implications, but they're not important right now. 3) Ditch extra constructors, they're completely unnecessary. For matrix you only need your non-trivial ctor, postblit and lvalue assignment operator. 4) This: ref Matrix transpose() const { return Matrix(this).transposeAssign(); } just doesn't make any sense. You're returning a reference to a temporary. 5) Use scoped imports, they're good :) 6) Use writefln for formatted output, it's good :) With the above suggestions your code transfrorms into this: http://dpaste.dzfl.pl/9d7feeab59f6 And here are the timings on my machine: $ rdmd -O -release -inline -noboundscheck main.d allocationTest ... Time required: 1 sec, 112 ms, 827 μs, and 3 hnsecs multiplicationTest ... Time required: 1 sec, 234 ms, 417 μs, and 8 hnsecs toStringTest ... Time required: 998 ms, 16 μs, and 2 hnsecs transposeTest ... Time required: 813 ms, 947 μs, and 3 hnsecs scalarMultiplicationTest ... Time required: 105 ms, 828 μs, and 5 hnsecs matrixAddSubTest ... Time required: 240 ms and 384 μs matrixEqualsTest ... Time required: 244 ms, 249 μs, and 8 hnsecs identityMatrixTest ... Time required: 249 ms, 897 μs, and 4 hnsecs LDC yields roughly the same times.
Feb 17 2014
On Tuesday, 18 February 2014 at 03:32:48 UTC, Stanislav Blinov wrote:3) Ditch extra constructors, they're completely unnecessary. For matrix you only need your non-trivial ctor, postblit and lvalue assignment operator.Actually rvalue opAssign would be even better :) Also, I forgot to remove it in the code, but that explicit ctor for Dimension is completely unnecessary too.
Feb 17 2014
Stanislav Blinov:that explicit ctor for Dimension is completely unnecessary too.I like a constructor(s) like that because it catches bugs like: auto d = Dimension(5); Bye, bearophile
Feb 17 2014
On Tuesday, 18 February 2014 at 07:45:18 UTC, bearophile wrote:Stanislav Blinov:Hmmm... yeah, ok, not completely unnecessary :)that explicit ctor for Dimension is completely unnecessary too.I like a constructor(s) like that because it catches bugs like: auto d = Dimension(5);
Feb 18 2014
Stanislav Blinov:http://dpaste.dzfl.pl/9d7feeab59f6Few small things should still be improved, but it's an improvement. Perhaps it needs to use a reference counting from Phobos.LDC yields roughly the same times.This is surprising. Bye, bearophile
Feb 17 2014
On Tuesday, 18 February 2014 at 07:50:23 UTC, bearophile wrote:Stanislav Blinov:COW for matrices? Aw, come on... :)http://dpaste.dzfl.pl/9d7feeab59f6Few small things should still be improved, but it's an improvement. Perhaps it needs to use a reference counting from Phobos.To me as well. I haven't yet tried to dig deep though.LDC yields roughly the same times.This is surprising.
Feb 17 2014
Stanislav Blinov:I have compiled your code with (a single module, 32 bit Windows): dmd -wi -vcolumns -O -release -inline -noboundscheck matrix3.d ldmd2 -wi -O -release -inline -noboundscheck matrix3.d DMD: multiplicationTest ... Time required: 4 secs, 665 ms, 452 ╬╝s, and 2 hnsecs LDC2: multiplicationTest ... Time required: 2 secs, 522 ms, and 747 ╬╝s Bye, bearophileTo me as well. I haven't yet tried to dig deep though.LDC yields roughly the same times.This is surprising.
Feb 18 2014
On Tuesday, 18 February 2014 at 08:50:23 UTC, bearophile wrote:Stanislav Blinov:Interesting. For me (on 64 bit) multiplicationTest is in the same league for DMD and LDC. (I'm still using dmd 2.064.2 as my main dmd install btw). But what is up with those first and last tests in LDC?.. $ rdmd -wi -O -release -inline -noboundscheck main.d allocationTest ... Time required: 1 sec, 165 ms, 766 μs, and 1 hnsec multiplicationTest ... Time required: 1 sec, 232 ms, 287 μs, and 4 hnsecs toStringTest ... Time required: 997 ms, 217 μs, and 1 hnsec transposeTest ... Time required: 859 ms, 152 μs, and 5 hnsecs scalarMultiplicationTest ... Time required: 105 ms, 366 μs, and 5 hnsecs matrixAddSubTest ... Time required: 241 ms, 705 μs, and 8 hnsecs matrixEqualsTest ... Time required: 243 ms, 534 μs, and 8 hnsecs identityMatrixTest ... Time required: 260 ms, 411 μs, and 4 hnsec $ ldmd2 -wi -O -release -inline -noboundscheck main.d neolab/core/dimension.d neolab/core/matrix.d -ofmain-ldc && main-ldc allocationTest ... Time required: 2 hnsecs (o_O) multiplicationTest ... Time required: 1 sec, 138 ms, 628 μs, and 2 hnsecs toStringTest ... Time required: 704 ms, 937 μs, and 5 hnsecs transposeTest ... Time required: 859 ms, 413 μs, and 5 hnsecs scalarMultiplicationTest ... Time required: 103 ms, 250 μs, and 2 hnsecs matrixAddSubTest ... Time required: 226 ms, 955 μs, and 3 hnsecs matrixEqualsTest ... Time required: 470 ms and 186 μs identityMatrixTest ... Time required: 4 hnsecs (o_O)I have compiled your code with (a single module, 32 bit Windows): dmd -wi -vcolumns -O -release -inline -noboundscheck matrix3.d ldmd2 -wi -O -release -inline -noboundscheck matrix3.d DMD: multiplicationTest ... Time required: 4 secs, 665 ms, 452 ╬╝s, and 2 hnsecs LDC2: multiplicationTest ... Time required: 2 secs, 522 ms, and 747 ╬╝sTo me as well. I haven't yet tried to dig deep though.LDC yields roughly the same times.This is surprising.
Feb 18 2014
...And if I define opEquals as it was made by Robin, i.e. like this: bool opEquals(ref const Matrix other) const pure nothrow { version (all) { if (dim != other.dim) return false; foreach(immutable i, const ref e; data) if (e != other.data[i]) return false; return true; } else { return (dim == other.dim) && (data[] == other.data[]); } } then the relevant benchmark on LDC flattens near zero time too x.x
Feb 18 2014
On Tuesday, 18 February 2014 at 09:05:33 UTC, Stanislav Blinov wrote:allocationTest ... Time required: 2 hnsecs (o_O) identityMatrixTest ... Time required: 4 hnsecs (o_O)LDC is probably detecting that you're never actually using the results of the operation and that none of the steps have side-effects, so it's optimizing the call out. One way of avoiding this is to do something like writeln the result of an operation.
Feb 18 2014
Hiho, I am happy to see that I could encourage so many people to discuss about this topic to not only give me interesting answer to my questions but also to analyse and evaluate features and performance of D. I have fixed the allocation performance problem via a custom destructor method which manually notifies the GC that its data is garbage so that the GC can free it in the next cycle and no longer have to determine if it is no longer used. This extremely speeded up the allocation tests but increased overall runtime performance by a very slight amount of time because memory is now freed contigeously. Nick Sabalausky: Why should I remove .dub from the copy constructor? In my opinion this is important to keep both matrices (source and copy) independent from each other. The suggested COW feature for matrices sound interesting but also weird. I have to read more about that. Move semantics in D a needed and existing, however, I think that the D compiler isn't as good as the C++ compiler in determining what is moveable and what not. Another thing which is hopefully a bug in the current language implementation of D is the strange behaviour of the transpose method with the only line of code: return Matrix(this).transposeAssign(); In C++ for example this compiles without any problems and results in correct transposed matrix copies - this works due to the existance of move semantics in C++ and is one of the coolest features since C++11 which increased and simplified codes in many cases enormously for value types just as structs in D. I have ran several tests in order to test when the move assign or the move constructors are called in D and whenever I expected them to be called the copy-constructor or the postblit was called instead which was frustating imo. Perhaps I still haven't quite understood the things behind the scenes in D but it can't be the solution to always copy the whole data whenever I could instead have a move of the whole data on the heap. Besides that on suggestion which came up was that I could insert the Dimension module into the Matrix module as their are semantically working together. However, I find it better to have many smaller code snippets instead of fewer bigger ones and that's why I keep them both separated. I also gave scoped imports a try and hoped that they were able to reduce my executable file and perhaps increase the performance of my program, none of which was true -> confused. Instead I now have more lines of code and do not see instantly what dependencies the module as itself has. So what is the point in scoped imports? The mixin keyword is also nice to have but it feels similar to a dirty C-macro to me where text replacement with lower abstraction (unlike templates) takes place. Of course I am wrong and you will teach me why but atm I have strange feelings about implementing codes with mixins. In this special case: perhaps it isn't a wise decision to merge addition with subtraction and perhaps I can find faster ways to do that which invole more differences in both actions which requires to split both methods up again. (theoretical, but it could be) Another weird thing is that the result ~= text(tabStr, this[r, c]) in the toString method is much slower than the two following lines of code: result ~= tabStr; result ~= to!string(this[r, c]); Does anybody have an answer to this? I am now thinking that I know the most important things about how to write ordinary (not super) efficient D code - laugh at me if you want :D - and will continue extending this benchmarking library and whenever I feel bad about a certain algorithm's performance I will knock here in this thread again, you know. :P In the end of my post I just want to summarize the benchmark history for the matrix multiplication as I think that it is funny: (All tests for two 1000x1000 matrices!) - The bloody first implementation of the matrix implementation (which worked) required about 39 seconds to finish. - Then I have finally found out the optimizing commands for the DMD and the multiplication performance double roughly to about 14 seconds. - I created an account here and due to your massive help the matrix multiplication required only about 5 seconds shortly after due to better const, pure and nothrow usage. - Through the shift from class to struct and some optimizations in memory layout and further usage improvements of const, pure and nothrow as well as several array feature usages and foreach loop the algorithm performance raised once again and required about 3,7 seconds. - The last notifiable optimization was the implemenatation based on pointer arithmentics which again improved the performance from 3,7 seconds to roughly about 2 seconds. Due to this development I see that there is still a possibility that we could beat the 1,5 seconds from Java or even the 1,3 seconds from C++! (based on my machine: 64bit-archlinux, dualcore 2,2ghz, 4gb ram) There are still many ways to further improve the performance. For examply by using LDC on certain hardwares, paralellism and perhaps by implementing COW with no GC dependencies. And of course I may miss many other possible optimization features of D. I by myself can say that I have learn a lot and that's the most important thing above everything else for me here. Thank you all for this very interesting conversation! You - the community - are a part what makes D a great language. =) Robin
Feb 18 2014
Robin:the existance of move semantics in C++ and is one of the coolest features since C++11 which increased and simplified codes in many cases enormously for value types just as structs in D.I guess Andrei doesn't agree with you (and move semantics in C++11 is quite hard to understand).I also gave scoped imports a try and hoped that they were able to reduce my executable file and perhaps increase the performance of my program, none of which was true -> confused. Instead I now have more lines of code and do not see instantly what dependencies the module as itself has. So what is the point in scoped imports?Scoped imports in general can't increase performance. Their main point is to avoid importing modules that are needed only by templated code. So if you don't instantiate the template, the liker works less and the binary is usually smaller (no moduleinfo, etc).Another weird thing is that the result ~= text(tabStr, this[r, c]) in the toString method is much slower than the two following lines of code: result ~= tabStr; result ~= to!string(this[r, c]); Does anybody have an answer to this?It doesn't look too much weird. In the first case you are allocating and creating larger strings. But I don't think matrix printing is a bottleneck in a program.- Then I have finally found out the optimizing commands for the DMDThis is a small but common problem. Perhaps worth fixing.There are still many ways to further improve the performance. For examply by using LDCLatest stable and unstable versions of LDC2, try it: https://github.com/ldc-developers/ldc/releases/tag/v0.12.1 https://github.com/ldc-developers/ldc/releases/tag/v0.13.0-alpha1on certain hardwares, paralellism and perhaps by implementing COW with no GC dependencies. And of course I may miss many other possible optimization features of D.Matrix multiplication can be improved a lot tiling the matrix (or better using a cache oblivious algorithm), using SSE/AVX2, using multiple cores, etc. As starting point you can try to use std.parallelism. It could speed up your code on 4 cores with a very limited amount of added code. Bye, bearophile
Feb 18 2014
On Tuesday, 18 February 2014 at 23:36:12 UTC, Robin wrote:Another thing which is hopefully a bug in the current language implementation of D is the strange behaviour of the transpose method with the only line of code: return Matrix(this).transposeAssign();Matrix(this) not compiling when 'this' is const is a bug. That's why I had to replace it with assignment. Postblit still has some unresolved problems. However, you should note that you have a bigger bug in that transposeAssign() returns a reference :) So unless transpose() returns Matrix (instead of ref Matrix), that's a problem.In C++ for example this compiles without any problems and results in correct transposed matrix copies - this works due to the existance of move semantics in C++ and is one of the coolest features since C++11 which increased and simplified codes in many cases enormously for value types just as structs in D.The fact that it compiles in C++ is a problem (this illustrates your initial implementation of transpose() translated into C++): struct Matrix { // ref Matrix transpose() const; S& transpose() const { return Matrix(*this).transposeAssign(); } // ref Matrix transposeAssign(); S& transposeAssign() { return *this; } }; int main(int argc, char** argv) { Matrix m1; Matrix& m2 = m1.transpose(); // ooops! }I have ran several tests in order to test when the move assign or the move constructors are called in D and whenever I expected them to be called the copy-constructor or the postblit was called instead which was frustating imo.I've posted a complete illustration on how D moves rvalues, browse this thread back a bit.I also gave scoped imports a try and hoped that they were able to reduce my executable file and perhaps increase the performance of my program, none of which was true -> confused. Instead I now have more lines of code and do not see instantly what dependencies the module as itself has. So what is the point in scoped imports?They don't pollute outer scope with symbols. If you just need to call a couple of functions from std.random inside *one* function, there's no need to pull names from std.random into the whole module.The mixin keyword is also nice to have but it feels similar to a dirty C-macro to me where text replacement with lower abstraction (unlike templates) takes place. Of course I am wrongYes you are :) It's not dirty and it's certainly not a macro. It's a great feature for generating code at compile time. For examples just browse through Phobos (e.g. std.functional), or take a look at the code in this thread: http://forum.dlang.org/thread/mailman.158.1391156715.13884.digitalmars-d puremagic.com Maybe it's your syntax highlighting that throws you off? Then use q{} insead of "" :)Another weird thing is that the result ~= text(tabStr, this[r, c]) in the toString method is much slower than the two following lines of code: result ~= tabStr; result ~= to!string(this[r, c]); Does anybody have an answer to this?Probably due to extra allocation in text(tabStr, this[r,c]) since it will perform ~ under the hood, while appender already has storage.
Feb 18 2014
Hiho, here are the results of both compilers (DMD and LDC2) on my system: LDC: ========================================= allocationTest ... Time required: 5 secs, 424 msecs multiplicationTest ... Time required: 1 secs, 744 msecs toStringTest ... Time required: 0 secs, 974 msecs transposeTest ... Time required: 0 secs, 998 msecs scalarMultiplicationTest ... Time required: 0 secs, 395 msecs matrixAddSubTest ... Time required: 0 secs, 794 msecs matrixEqualsTest ... Time required: 0 secs, 0 msecs identityMatrixTest ... Time required: 0 secs, 393 msecs ========================================= DMD: ========================================= allocationTest ... Time required: 3 secs, 161 msecs multiplicationTest ... Time required: 2 secs, 6 msecs toStringTest ... Time required: 1 secs, 365 msecs transposeTest ... Time required: 1 secs, 45 msecs scalarMultiplicationTest ... Time required: 0 secs, 405 msecs matrixAddSubTest ... Time required: 0 secs, 809 msecs matrixEqualsTest ... Time required: 0 secs, 430 msecs identityMatrixTest ... Time required: 0 secs, 350 msecs ========================================= All in all I must say that I am more pleased with the DMD results as I am kind of worried about the LDC allocation test performance ... I also had to rewrite parts of the codes as some functions just weren't "pure" or "nothrow" such as the whole things around this.data[] += other.data[]. Robin
Feb 20 2014
Robin:All in all I must say that I am more pleased with the DMD results as I am kind of worried about the LDC allocation test performance ...I am not sure of the causes of this, but the D GG was improved a little, in the last two compiler versions.I also had to rewrite parts of the codes as some functions just weren't "pure" or "nothrow" such as the whole things around this.data[] += other.data[].This because LDC2 is one version of the language behind. It's a temporary problem. Your multiplicationTest timings with LDC2 seem to have not yet reached the performance of your C++ code. You can take a look at the asm output of both. Perhaps in the D code you have some allocation you are not doing in the C++ code. Bye, bearophile
Feb 20 2014
On Thursday, 20 February 2014 at 21:32:10 UTC, Robin wrote:Hiho, here are the results of both compilers (DMD and LDC2) on my system: LDC: ========================================= allocationTest ... Time required: 5 secs, 424 msecs multiplicationTest ... Time required: 1 secs, 744 msecs toStringTest ... Time required: 0 secs, 974 msecs transposeTest ... Time required: 0 secs, 998 msecs scalarMultiplicationTest ... Time required: 0 secs, 395 msecs matrixAddSubTest ... Time required: 0 secs, 794 msecs matrixEqualsTest ... Time required: 0 secs, 0 msecs identityMatrixTest ... Time required: 0 secs, 393 msecs ========================================= DMD: ========================================= allocationTest ... Time required: 3 secs, 161 msecs multiplicationTest ... Time required: 2 secs, 6 msecs toStringTest ... Time required: 1 secs, 365 msecs transposeTest ... Time required: 1 secs, 45 msecs scalarMultiplicationTest ... Time required: 0 secs, 405 msecs matrixAddSubTest ... Time required: 0 secs, 809 msecs matrixEqualsTest ... Time required: 0 secs, 430 msecs identityMatrixTest ... Time required: 0 secs, 350 msecs ========================================= All in all I must say that I am more pleased with the DMD results as I am kind of worried about the LDC allocation test performance ... I also had to rewrite parts of the codes as some functions just weren't "pure" or "nothrow" such as the whole things around this.data[] += other.data[]. Robinwhat flags did you use?
Feb 21 2014
Hiho, as I used the ldmd wrapper for ldc2 I was able to use the same arguments given via the cmdfile. These were: -release -noboundscheck -O and -inline. Robin
Feb 22 2014
Stanislav Blinov:allocationTest ... Time required: 1 sec, 112 ms, 827 μs, and 3 hnsecs multiplicationTest ... Time required: 1 sec, 234 ms, 417 μs, and 8 hnsecsPhysics teaches us that those experimental measures are expressed with a excessive precision. For such benchmarks it's more wise to write "1.11" seconds. Bye, bearophile
Feb 18 2014
On Monday, 17 February 2014 at 11:34:43 UTC, bearophile wrote:While it is cleaner, I remember array operations (such as data[] -= other.data[]) being significantly slower than doing it yourself. I don't know if this has been fixed. Perhaps using -m64 helps this. Also, I stress again that if you aren't compiling for 64-bit (-m64), it's very likely to give a significant improvement (because the optimizer will probably convert some of your operations into SIMD operations). Using LDC / GDC would also give a significant improvement.foreach (ref T element; this.data) { element *= scalar; }Try to use: data[] *= scalar;for (auto i = 0; i < this.dim.size; ++i) { this.data[i] += other.data[i]; }Try to use: data[] += other.data[];for (auto i = 0; i < this.dim.size; ++i) { this.data[i] -= other.data[i]; }Try to use: data[] -= other.data[];
Feb 17 2014
Kapps:I remember array operations (such as data[] -= other.data[]) being significantly slower than doing it yourself. I don't know if this has been fixed.They are not being "fixed". But for arrays as large as the ones here, they are fast enough. --------------------- I have tested the code (all in a single module) using the latest dmd and ldc2 (beta versions in both cases), compiling with: dmd -O -release -inline -noboundscheck matrix.d ldmd2 -O -release -inline -noboundscheck matrix.d Best timings: LDC2: multiplicationTest ... Time required: 2 secs, 522 msecs DMD: multiplicationTest ... Time required: 4 secs, 724 msecs Using link-time optimization, or using AVX/AVX2 on modern CPUs (using the compiler switcher for the AVX) could improve the LDC2 timings a little more. Bye, bearophile
Feb 17 2014
On 2/16/2014 6:47 PM, Robin wrote:Nick Sabalausky: Don't be shocked at the bad performance of a beginner D programmer. At least I am not shocked yet that my D implementation isn't performing as well as the highly optimized C++11 implementation or the JIT-optimized Java implementation. In general one can not compare the results of a native compilation optimization which shall run on different hardwares and a JIT optimization which is able to pull out every single optimization routine because it knows the system and all its components to compile at.I think you've misunderstood me. I was only explaining D's class vs struct system, and why in D (as opposed to Java) structs are better than classes in this particular case.
Feb 17 2014
On 2/15/2014 6:06 PM, Robin wrote:Matrix is still a class but I changed it to a final class preventing matrix methods to be virtual. Dimension is now a final struct (don't know if 'final' is affecting structs in any way tough ...). This mainly gave the multiplication a huge performance boost.Like in other languages, "final" means "cannot inherit from this". Since structs, by definition, don't support inheritance anyway, "final" isn't really applicable - they're inherently "final" no matter what.When converting the Matrix to a struct from class the multiplication even lowered from ~4.3 seconds to about 3.6 seconds. However, I am currently not sure if I want matrices to be structs (value types).They really should be structs. There's basically no benefit to making them classes. Classes are for when you need inheritance and don't need performance. Java, from what I understand, does a reasonable job compensating for the inherent inefficiency of classes by going to great lengths to have absolute top-of-the-line class-oriented optimizations and GC and strip out all the class underpinnings when possible (AIUI). Keep in mind too, the data inside dynamic arrays is already by reference. Arrays in D are basically like this: struct Array(T) { size_t length; T* ptr; } So even when your matrix is a struct, the actual values inside the matrix are still by reference (because it's through that pointer). So you're still not actually copying the data inside the matrix as you pass it around. You're only copying the dimension, length and pointer. And those are copied via a straight memcopy, not a "one field at a time" as I've heard C++ does (though I could be wrong, it's been forever since I did much C++). And if you need to pass a struct by reference, there's always still "ref". Class really just isn't the right thing for a matrix, especially if you're paying attention to performance.I have also tried the LDC compiler. However, it gave me some strange bugs. E.g. it didn't like the following line: ref Matrix transpose() const { return new Matrix(this).transposeAssign(); }Doing that without parens around the "new Matrix(this)" is a very new feature, so I'm not entirely surprised LDC doesn't have it yet. Like Stanislav said, you can just add parens for now.
Feb 15 2014
On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:this(size_t rows, size_t cols) { this.dim = Dimension(rows, cols); this.data = new T[this.dim.size]; enum nil = to!T(0); foreach(ref T element; this.data) element = nil; }Your foreach here is unnecessary. You can zero out an array using the syntax: this.data[] = 0; // probably compiles out as memset() And while I would expect to!T(0) to compile down to something with no particular logic, it still is a function call not a macro, so you're better to let the compiler figure out the best conversion from 0 for type T than to use a library function for it.I think and hope that this is getting optimized via inlining. =) This works similar for opIndexAssign.From complaints that I have seen, very little gets inlined at the moment.Matrix opMul(ref const(Matrix) other) { if (this.dim.rows != other.dim.cols || this.dim.cols != ther.dim.rows) { // TODO - still have to learn exception handling first ... } auto m = new Matrix(this.dim.rows, other.dim.cols); auto s = new Matrix(other).transposeAssign();Since you don't return s, it's probably better to not allocate an instance. I would also be curious to see your constructor which copies one matrix into a new one.Besides that I can't find a way how to make a use of move semantics like in C++. For example a rvalue-move-constructor or a move-assign would be a very nice thing for many tasks.Do you mean something like this? this.data[] = other.data[]; // probably compiles out as memcpy() this.data = other.data.dup; // different syntax, but should be the sameAnother nice thing to know would be if it is possible to initialize an array before it is default initialized with T.init where T is the type of the array's fields.I believe so, but I would have to scour about to find the syntax. Hopefully, someone else knows.
Feb 17 2014