digitalmars.D.learn - Optimize my code =)

Robin (109/109) Feb 14 2014 Hiho,

Craig Dillabaugh (5/12) Feb 14 2014 I am no expert at optimizing D code, so this is a bit of a shot

John Colvin (5/19) Feb 14 2014 Why would that make a difference here? They're (almost) identical

Craig Dillabaugh (9/32) Feb 14 2014 Well, in my defense I did start with a disclaimer. For some

bearophile (4/7) Feb 14 2014 It's the same thing.

Xinok (8/17) Feb 14 2014 Not quite. Setting length will copy over the existing contents of

bearophile (32/80) Feb 14 2014 Also try "final class" or struct in your benchmark. And try to
Chris Cain (23/40) Feb 14 2014 Are you sure you need a class here? If you're not using

bearophile (4/5) Feb 14 2014 minimallyInitializedArray should be used, because it's safer.

thedeemon (4/6) Feb 14 2014 First of all make sure you it's not virtual, otherwise each
Kapps (34/106) Feb 14 2014 Try with and without -noboundscheck. Sometimes using it seems to
Nick Sabalausky (16/36) Feb 14 2014 A matrix is just plain-old-data, so use a struct, you don't need a class...

bearophile (14/24) Feb 14 2014 But the row-column bounds of the matrix are not the same as the

Robin (104/104) Feb 15 2014 Hiho,

Stanislav Blinov (22/62) Feb 15 2014 Make them value types. If you're using dynamic arrays for
bearophile (13/16) Feb 15 2014 Many of the things you say and write are still wrong or confused.

bearophile (7/7) Feb 15 2014 In the meantime also take a look at the matrix mul code I've
francesco cattoglio (7/10) Feb 16 2014 Well, actually, sometimes squeezing as much performance as you

bearophile (9/16) Feb 16 2014 Unless you know a language well, you will fail squeezing that.

Francesco Cattoglio (8/12) Feb 16 2014 Sorry, but I fail to be amazed by what huge piles of money thrown

Robin (85/85) Feb 16 2014 Hiho,

Stanislav Blinov (14/29) Feb 16 2014 What swap assignment, what move constructor? :)
bearophile (21/39) Feb 16 2014 D is not a small language, it contains many features, but

Robin (8/8) Feb 17 2014 Hiho,

John Colvin (37/45) Feb 17 2014 A few quick points:
bearophile (80/151) Feb 17 2014 I think it's better to improve your code iteratively. First

Robin (74/74) Feb 17 2014 Hiho,

bearophile (63/115) Feb 17 2014 The code looks better.
Nick Sabalausky (18/20) Feb 17 2014 You should get rid of the .dup's in the constructors and postblits. You
Stanislav Blinov (39/39) Feb 17 2014 Okay, here we go...

Stanislav Blinov (5/8) Feb 17 2014 Actually rvalue opAssign would be even better :)

bearophile (5/7) Feb 17 2014 I like a constructor(s) like that because it catches bugs like:

Stanislav Blinov (2/7) Feb 18 2014 Hmmm... yeah, ok, not completely unnecessary :)

bearophile (7/9) Feb 17 2014 Few small things should still be improved, but it's an

Stanislav Blinov (3/10) Feb 17 2014 To me as well. I haven't yet tried to dig deep though.

bearophile (12/16) Feb 18 2014 I have compiled your code with (a single module, 32 bit Windows):

Stanislav Blinov (41/57) Feb 18 2014 Interesting. For me (on 64 bit) multiplicationTest is in the same

Stanislav Blinov (14/14) Feb 18 2014 ...And if I define opEquals as it was made by Robin, i.e. like
Kapps (7/11) Feb 18 2014 LDC is probably detecting that you're never actually using the

Robin (101/101) Feb 18 2014 Hiho,

bearophile (22/45) Feb 18 2014 I guess Andrei doesn't agree with you (and move semantics in
Stanislav Blinov (38/67) Feb 18 2014 Matrix(this) not compiling when 'this' is const is a bug. That's

Robin (48/48) Feb 20 2014 Hiho,

bearophile (11/17) Feb 20 2014 I am not sure of the causes of this, but the D GG was improved a
John Colvin (2/50) Feb 21 2014 what flags did you use?

Robin (5/5) Feb 22 2014 Hiho,

bearophile (6/10) Feb 18 2014 Physics teaches us that those experimental measures are expressed

Kapps (10/25) Feb 17 2014 While it is cleaner, I remember array operations (such as data[]

bearophile (20/23) Feb 17 2014 They are not being "fixed". But for arrays as large as the ones

Nick Sabalausky (4/13) Feb 17 2014 I think you've misunderstood me. I was only explaining D's class vs

Nick Sabalausky (31/43) Feb 15 2014 Like in other languages, "final" means "cannot inherit from this". Since...

Chris Williams (20/41) Feb 17 2014 Your foreach here is unnecessary. You can zero out an array using

"Robin" <robbepop web.de> writes:

Hiho,

I am fairly new to the D programming and still reading throught 
the awesome online book at http://ddili.org/ders/d.en/index.html.
However, I think it is missing some things or I am missing 
glasses and overlooked these parts in the book.^^

Currently I am trying to write a very simple Matrix library for 
some minor linear algebra operations such as calculating 
determine, some simple compositions etc. I am aware that this 
probably has been done already and I am mainly focusing learning 
D with this project. =)

I already wrote a similar Matrix library for Java and C++ with 
more or less the same algorithms in order to benchmark these 
languages.

As I am very new to D I instantly ran into certain optimizing 
issues. E.g. the simple matrix multiplication based in my java 
implementation requires about 1.5 secs for multiplying two 
1000x1000 matrices, however my D implementation requires 39 
seconds without compiler optimizations and about 14 seconds with 
-inline, -O, -noboundscheck and -release. So I am sure that I 
just missed some possible optimization routines for D and that I 
could miss entire patters used in D to write more optimized code 
- especially when looking at the const and immutability concepts.

I won't paste the whole code, just the important parts.

class Matrix(T = double) {
	private T[] data;
	private Dimension dim;
}

This is my class with its templated data as a one dimensional 
array (as I don't like jagged-arrays) and a dimension (which is a 
struct). The dimension object has some util functionality such as 
getting the total size or mapping (row, col) to a one-dimensional 
index besides the getter for rows and columns.

this(size_t rows, size_t cols) {
	this.dim = Dimension(rows, cols);
	this.data = new T[this.dim.size];
	enum nil = to!T(0);
	foreach(ref T element; this.data) element = nil;
}

This is one of the constructor. Its task is just to create a new 
Matrix instance filled with what is 0 for the templated value - 
don't know if there is a better approach for this. I experienced 
that floating point values are sadly initialized with nan which 
isn't what I wanted -> double.init = nan.

I am using opIndex and opIndexAssign in order to access and 
assign the matrix values:

T opIndex(size_t row, size_t col) const {
	immutable size_t i = this.dim.offset(row, col);
	if (i >= this.dim.size) {
		// TODO - have to learn exception handling in D first. :P
	}
	return this.data[i];
}

Which calls:

size_t size()  property const {
	return this.rows * this.cols;
}

I think and hope that this is getting optimized via inlining. =)
This works similar for opIndexAssign.

The function I am mainly benchmarking is the simple matrix 
multiplication where one of the multiplied matrices is tranposed 
first in order to improve cache hit ratio.

Matrix opMul(ref const(Matrix) other) {
	if (this.dim.rows != other.dim.cols || this.dim.cols != 
ther.dim.rows) {
		// TODO - still have to learn exception handling first ...
	}
	auto m = new Matrix(this.dim.rows, other.dim.cols);
	auto s = new Matrix(other).transposeAssign();
	size_t r1, r2, c;
	T sum;
	for (r1 = 0; r1 < this.dim.rows; ++r1) {
		for (r2 = 0; r2 < other.dim.rows; ++r2) {
			sum = to!T(0);
			for (c = 0; c < this.dim.cols; ++c) {
				sum += this[r1, c] * other[r2, c];
			}
			m[r1, r2] = sum;
		}
	}
	return m;
}

I am aware that there are faster algorithms for matrix 
multiplication but I am mainly learning how to write efficient D 
code in general and not based on any algorithm.

This is more or less the same algorithm I am using with java and 
c++. Both require about 1.5 secs (after java warm-up) to perform 
this task for two 1000x1000 matrices. D compiled with DMD takes 
about 14 seconds with all (known-to-me) optimize flag activated. 
(listed above)

I wanted to make s an immutable matrix in order to hopefully 
improve performance via this change, however I wasn't technically 
able to do this to be honest.

Besides that I can't find a way how to make a use of move 
semantics like in C++. For example a rvalue-move-constructor or a 
move-assign would be a very nice thing for many tasks.

I am not quite sure also if it is normal in D to make an idup 
function as slices have which creates an immutable copy of your 
object. And are there important things I have to do while 
implementing an idup method for this matrix class?

Another nice thing to know would be if it is possible to 
initialize an array before it is default initialized with T.init 
where T is the type of the array's fields. In C++ e.g. there is 
no default initialization which is nice if you have to initialize 
every single field anyway. E.g. in a Matrix.random() method which 
creates a matrix with random values. There it is unnecessary that 
the (sometimes huge) array is completely initialized with the 
type's init value.

Thanks in advance!

Robin

Feb 14 2014

"Craig Dillabaugh" <cdillaba cg.scs.carleton.ca> writes:

On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:
 this(size_t rows, size_t cols) {
 	this.dim = Dimension(rows, cols);
 	this.data = new T[this.dim.size];
 	enum nil = to!T(0);
 	foreach(ref T element; this.data) element = nil;
 }

I am no expert at optimizing D code, so this is a bit of a shot 
in the dark, but does your speed improve at all if you replace:

 	this.data = new T[this.dim.size];

with:

this.data.length = this.dim.size

Feb 14 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Friday, 14 February 2014 at 16:40:31 UTC, Craig Dillabaugh 
wrote:
 On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:
 this(size_t rows, size_t cols) {
 	this.dim = Dimension(rows, cols);
 	this.data = new T[this.dim.size];
 	enum nil = to!T(0);
 	foreach(ref T element; this.data) element = nil;
 }

 I am no expert at optimizing D code, so this is a bit of a shot 
 in the dark, but does your speed improve at all if you replace:

 	this.data = new T[this.dim.size];

 with:

 this.data.length = this.dim.size

Why would that make a difference here? They're (almost) identical 
are they not? Certainly the body of the work, the allocation 
itself, is the same.

Feb 14 2014

"Craig Dillabaugh" <cdillaba cg.scs.carleton.ca> writes:

On Friday, 14 February 2014 at 16:47:32 UTC, John Colvin wrote:
 On Friday, 14 February 2014 at 16:40:31 UTC, Craig Dillabaugh 
 wrote:
 On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:
 this(size_t rows, size_t cols) {
 	this.dim = Dimension(rows, cols);
 	this.data = new T[this.dim.size];
 	enum nil = to!T(0);
 	foreach(ref T element; this.data) element = nil;
 }

 I am no expert at optimizing D code, so this is a bit of a 
 shot in the dark, but does your speed improve at all if you 
 replace:

 	this.data = new T[this.dim.size];

 with:

 this.data.length = this.dim.size

 Why would that make a difference here? They're (almost) 
 identical are they not? Certainly the body of the work, the 
 allocation itself, is the same.

Well, in my defense I did start with a disclaimer.  For some 
reason I thought using .length was the proper way to size arrays. 
   It is likely faster if you are resizing an existing array 
(especially downward), but in this case new memory is being 
allocated anyway.

In fact I ran some tests (just allocating a matrix over and over) 
and it seems using new is consistently faster than using .length 
(by about a second for 50000 matrices).  Shows how much I know.

Feb 14 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Craig Dillabaugh:

 	this.data = new T[this.dim.size];

 with:

 this.data.length = this.dim.size

It's the same thing.

Bye,
bearophile

Feb 14 2014

"Xinok" <xinok live.com> writes:

On Friday, 14 February 2014 at 16:56:29 UTC, bearophile wrote:
 Craig Dillabaugh:

 	this.data = new T[this.dim.size];

 with:

 this.data.length = this.dim.size

 It's the same thing.

 Bye,
 bearophile

Not quite. Setting length will copy over the existing contents of 
the array. Using new simply sets every element to .init.

Granted, this.data is empty meaning there's nothing to copy over, 
so there's a negligible overhead which may be optimized out by 
the compiler anyways.

There's also the uninitializedArray function:

Feb 14 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Robin:

 class Matrix(T = double) {
 	private T[] data;
 	private Dimension dim;
 }

Also try "final class" or struct in your benchmark. And try to 
use ldc2 compiler for performance benchmarks.

Perhaps dim is better const, unless you want to change the shape 
of the matrix.


 this(size_t rows, size_t cols) {
 	this.dim = Dimension(rows, cols);
 	this.data = new T[this.dim.size];
 	enum nil = to!T(0);
 	foreach(ref T element; this.data) element = nil;
 }

Better:

this(in size_t rows, in size_t cols) pure nothrow {
     this.dim = Dimension(rows, cols);
     this.data = 
minimallyInitializedArray!(typeof(data))(this.dim.size);
     this.data[] = to!T(0);
}


 I experienced that floating point values are sadly initialized 
 with nan which isn't what I wanted -> double.init = nan.

This is actually a good feature of D language :-)


 T opIndex(size_t row, size_t col) const {
 	immutable size_t i = this.dim.offset(row, col);
 	if (i >= this.dim.size) {
 		// TODO - have to learn exception handling in D first. :P
 	}

Look in D docs for the contract programming.

Also add the pure nothrow attributes.


 Which calls:

 size_t size()  property const {
 	return this.rows * this.cols;
 }

 I think and hope that this is getting optimized via inlining. =)

Currently class methods are virtual, and currently D compilers 
are not inlining virtual calls much. The solution is to use a 
final class, final methods, etc.


 This works similar for opIndexAssign.

 The function I am mainly benchmarking is the simple matrix 
 multiplication where one of the multiplied matrices is 
 tranposed first in order to improve cache hit ratio.

Transposing the whole matrix before the multiplication is not 
efficient. Better to do it more lazily.


 	auto m = new Matrix(this.dim.rows, other.dim.cols);
 	auto s = new Matrix(other).transposeAssign();
 	size_t r1, r2, c;
 	T sum;
 	for (r1 = 0; r1 < this.dim.rows; ++r1) {
 		for (r2 = 0; r2 < other.dim.rows; ++r2) {

Better to define r1, r2 inside the for. Or often even better to 
use a foreach with interval.


 D compiled with DMD takes about 14 seconds with all 
 (known-to-me) optimize flag activated. (listed above)

DMD is not efficient with floating point values. Try ldc2 or gdc 
compilers (I suggest ldc2).


 I wanted to make s an immutable matrix in order to hopefully 
 improve performance via this change,

Most times immutability worsens performance, unless you are 
passing data across threads.


 however I wasn't technically able to do this to be honest.

It's not too much hard to use immutability in D.


 Besides that I can't find a way how to make a use of move 
 semantics like in C++. For example a rvalue-move-constructor or 
 a move-assign would be a very nice thing for many tasks.

That adds too much complexity to .


 Another nice thing to know would be if it is possible to 
 initialize an array before it is default initialized with 
 T.init where T is the type of the array's fields. In C++ e.g. 
 there is no default initialization which is nice if you have to 
 initialize every single field anyway. E.g. in a Matrix.random() 
 method which creates a matrix with random values. There it is 
 unnecessary that the (sometimes huge) array is completely 
 initialized with the type's init value.

It's done by the function that I have used above, from the 
std.array module.

Bye,
bearophile

Feb 14 2014

"Chris Cain" <clcain uncg.edu> writes:

On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:
 This is my class with its templated data as a one dimensional 
 array (as I don't like jagged-arrays) and a dimension (which is 
 a struct). The dimension object has some util functionality 
 such as getting the total size or mapping (row, col) to a 
 one-dimensional index besides the getter for rows and columns.

Are you sure you need a class here? If you're not using 
inheritance, structs can be much faster.


 this(size_t rows, size_t cols) {
 	this.dim = Dimension(rows, cols);
 	this.data = new T[this.dim.size];
 	enum nil = to!T(0);
 	foreach(ref T element; this.data) element = nil;
 }

You don't need to use std.conv.to here (ditto for when you 
initialize sum later). This should work and is clearer:

     foreach(ref T element; this.data) element = 0;

(You can do `double item = 0;` and it knows enough to store the 
double equivalent of 0)

In the case of initializing sum, I'm not sure the compiler is 
folding `sum = to!T(0);` thus you could be getting a lot of 
overhead from that. Using std.conv.to is fine but it does 
actually do checks in the conversion process, so it can be 
expensive in tight loops. Since you're just using a constant 0 
here, there's no reason to use std.conv.to.


 The function I am mainly benchmarking is the simple matrix 
 multiplication where one of the multiplied matrices is 
 tranposed first in order to improve cache hit ratio.

Did you benchmark first to make sure this actually improves 
performance? It seems to me like doing the transpose would 
require the same cache-missing that you'd get in the actual use. 
If you were caching it and using it multiple times, this would 
probably be more beneficial. General tip for optimizing: 
benchmark before and after every "optimization" you do. I've been 
surprised to find some of my ideas for optimizations slowed 
things down before.


 Another nice thing to know would be if it is possible to 
 initialize an array before it is default initialized with 
 T.init where T is the type of the array's fields.

Feb 14 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Chris Cain:



minimallyInitializedArray should be used, because it's safer.

Bye,
bearophile

Feb 14 2014

"thedeemon" <dlang thedeemon.com> writes:

On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:
 class Matrix(T = double) {
 T opIndex(size_t row, size_t col) const {

First of all make sure you it's not virtual, otherwise each 
element access will cost you enough to make it 10x slower than 
Java.

Feb 14 2014

"Kapps" <opantm2+spam gmail.com> writes:

On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:
 As I am very new to D I instantly ran into certain optimizing 
 issues. E.g. the simple matrix multiplication based in my java 
 implementation requires about 1.5 secs for multiplying two 
 1000x1000 matrices, however my D implementation requires 39 
 seconds without compiler optimizations and about 14 seconds 
 with -inline, -O, -noboundscheck and -release. So I am sure 
 that I just missed some possible optimization routines for D 
 and that I could miss entire patters used in D to write more 
 optimized code - especially when looking at the const and 
 immutability concepts.

Try with and without -noboundscheck. Sometimes using it seems to
make things slower, which is odd.

Also, compiling for 64-bit may help significantly. When you
compile for 32-bit, you won't get benefits of SIMD instructions
as it's not guaranteed to be supported by the CPU. Since Java
runs a JIT, it would use these SIMD instructions.

Also, you'll get significantly faster code if you use LDC or GDC.

 class Matrix(T = double) {
 	private T[] data;
 	private Dimension dim;
 }

This should be final class or struct, you don't need virtual
methods.

 This is my class with its templated data as a one dimensional 
 array (as I don't like jagged-arrays) and a dimension (which is 
 a struct). The dimension object has some util functionality 
 such as getting the total size or mapping (row, col) to a 
 one-dimensional index besides the getter for rows and columns.

 this(size_t rows, size_t cols) {
 	this.dim = Dimension(rows, cols);
 	this.data = new T[this.dim.size];
 	enum nil = to!T(0);
 	foreach(ref T element; this.data) element = nil;
 }

You don't need to!T for setting to 0, just use plain 0. Using
this to will probably result in overhead. Note that this may
evaluated every time you use 'nil', so using a constant will help.


 T opIndex(size_t row, size_t col) const {
 	immutable size_t i = this.dim.offset(row, col);
 	if (i >= this.dim.size) {
 		// TODO - have to learn exception handling in D first. :P
 	}
 	return this.data[i];
 }

You're using -noboundscheck then manually implementing bounds
checks, and probably less efficiently than the compiler does.
Either change this to an assert, or remove it. In either case,
the compiler will do it's own built in bounds checking.

This call is especially expensive since you're doing entirely
virtual methods because you specified class instead of final
class or using a struct. None of your Matrix calls will be
inlined.

 Which calls:

 size_t size()  property const {
 	return this.rows * this.cols;
 }

 I think and hope that this is getting optimized via inlining. =)
 This works similar for opIndexAssign.

Not unless you final class / struct.

 The function I am mainly benchmarking is the simple matrix 
 multiplication where one of the multiplied matrices is 
 tranposed first in order to improve cache hit ratio.

 Matrix opMul(ref const(Matrix) other) {
 	if (this.dim.rows != other.dim.cols || this.dim.cols != 
 ther.dim.rows) {
 		// TODO - still have to learn exception handling first ...
 	}
 	auto m = new Matrix(this.dim.rows, other.dim.cols);
 	auto s = new Matrix(other).transposeAssign();
 	size_t r1, r2, c;
 	T sum;
 	for (r1 = 0; r1 < this.dim.rows; ++r1) {
 		for (r2 = 0; r2 < other.dim.rows; ++r2) {
 			sum = to!T(0);
 			for (c = 0; c < this.dim.cols; ++c) {
 				sum += this[r1, c] * other[r2, c];
 			}
 			m[r1, r2] = sum;
 		}
 	}
 	return m;
 }

These allocations may potentially hurt.
Also, again, the virtual calls here are a huge hit.
Using to!T(0) is also a waste of performance, as youc an just do
straight 0.


 I am aware that there are faster algorithms for matrix 
 multiplication but I am mainly learning how to write efficient 
 D code in general and not based on any algorithm.

Using SIMD instructions manually would of course be much, much,
faster, but I understand that it defeats the point somewhat and
is much more ugly.

 This is more or less the same algorithm I am using with java 
 and c++. Both require about 1.5 secs (after java warm-up) to 
 perform this task for two 1000x1000 matrices. D compiled with 
 DMD takes about 14 seconds with all (known-to-me) optimize flag 
 activated. (listed above)

If you used LDC or GDC 64-bit with the changes above, I'd guess
it would be similar. I wouldn't expect D to out-perform Java much
for this particular scenario, there's very little going on and
the JIT should be able to optimize it quite well with SIMD stuff.

 I wanted to make s an immutable matrix in order to hopefully 
 improve performance via this change, however I wasn't 
 technically able to do this to be honest.

I don't think this would improve performance.

Feb 14 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 2/14/2014 11:00 AM, Robin wrote:
 class Matrix(T = double) {
      private T[] data;
      private Dimension dim;
 }

A matrix is just plain-old-data, so use a struct, you don't need a class.

A struct will be much more lightweight: A struct doesn't normally 
involve memory allocations like a class does, and you'll get better data 
locality and less indirection, even compared to a final class.

 I am using opIndex and opIndexAssign in order to access and assign the
 matrix values:

 T opIndex(size_t row, size_t col) const {
      immutable size_t i = this.dim.offset(row, col);
      if (i >= this.dim.size) {
          // TODO - have to learn exception handling in D first. :P
      }
      return this.data[i];
 }

No need for the bounds check. D already does bounds checks automatically 
(unless you compile with -noboundscheck, but the whole *point* of that 
flag is to disable bounds checks.)

But that said, I don't know whether the compiler might already be 
optimizing out your bounds check anyway. So try it and profile, see what 
happens.

 Another nice thing to know would be if it is possible to initialize an
 array before it is default initialized with T.init where T is the type
 of the array's fields. In C++ e.g. there is no default initialization
 which is nice if you have to initialize every single field anyway. E.g.
 in a Matrix.random() method which creates a matrix with random values.
 There it is unnecessary that the (sometimes huge) array is completely
 initialized with the type's init value.

You can opt-out of the default initialization with a void initializer: 
http://dlang.org/declaration.html#VoidInitializer

Although to be honest I forget how to do that for arrays, and the 
functions other people already suggested for creating/initing your array 
probably already do that anyway.

Feb 14 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Nick Sabalausky:

 T opIndex(size_t row, size_t col) const {
     immutable size_t i = this.dim.offset(row, col);
     if (i >= this.dim.size) {
         // TODO - have to learn exception handling in D first. 
 :P
     }
     return this.data[i];
 }

 No need for the bounds check. D already does bounds checks 
 automatically

But the row-column bounds of the matrix are not the same as the 
1D bounds of the 1D array. You can have out-of-column-bounds and 
still be inside the 1D bounds. So that test should go in the 
precondition and it should test row and col separately:

T opIndex(in size_t row, in size_t col) const pure nothrow
in {
     assert(row < nRows);
     assert(col < nCols);
} body {
     return data[dim.offset(row, col)];
}

Bye,
bearophile

Feb 14 2014

"Robin" <robbepop web.de> writes:

Hiho,

wow, first of all: this community is awesome - so many kind and 
interesting answers. =)

With your help I was able to achieve a performance boost for 
several different operations.

Some benchmarks:

Allocation of 5 10.000 x 10.000 matrices in a row:
Before: ~8,2 seconds
After: ~2,3 seconds (with the minimallyInitializedArray!)

Multiplication of two 1000x1000 matrices:
Before: ~14,8 seconds
After: ~4,3 seconds

However, I think there is still much potential in order to 
further optimize this code. Let me tell you what changes I have 
mainly performed on the code so far ...

Matrix is still a class but I changed it to a final class 
preventing matrix methods to be virtual. Dimension is now a final 
struct (don't know if 'final' is affecting structs in any way 
tough ...). This mainly gave the multiplication a huge 
performance boost.

When converting the Matrix to a struct from class the 
multiplication even lowered from ~4.3 seconds to about 3.6 
seconds. However, I am currently not sure if I want matrices to 
be structs (value types).

Besides that I tried to add nothrow and pure as attribute to 
every possible method. However, as it turned out I wasn't able to 
add the pure modifier to any method as I always called an impure 
method with it (as stated by the compiler). This actually made 
sense most of the times and I think the only pure methods now are 
the constructor methods of the Dimension struct.

What may still speed up things?

In my tests it turned out that the simple code:

auto m1 = Matrix!double.random(1000, 1000, -10, 10, 0.25);
auto m2 = Matrix!double.random(1000, 1000, -10, 10, 0.25);
auto m3 = m1 * m2;

Called the normal copy-constructor. This is sad as it would be a 
huge performance boost if it would make use of the move 
semantics. (In the C++ matrix codes this scenario would actually 
call move assignment operator for matrix m3 which is much much 
faster than copying.)

But I haven't figured out yet how to use move semantics in D with 
class objects. Or is that just possible with struct value types?

I have also tried the LDC compiler. However, it gave me some 
strange bugs. E.g. it didn't like the following line:

ref Matrix transpose() const {
	return new Matrix(this).transposeAssign();
}

And it forced me to change it to:

ref Matrix transpose() const {
	auto m = new Matrix(this);
	m.transposeAssign();
	return m;
}

Which is kind of ugly ...
I hope that this is getting fixed soon, as I imply it as correct 
D code because the DMD is capable to compile this correctly.

Some of you came up with the thing that transposing the matrix 
before multiplication of both takes place must be much slower 
than without the transposition. In my former Java and C++ 
programs I have already tested what strategy is faster and it 
turned out that cache efficiency DOES matter of course. There are 
also papers explaining why a transpose matrix multiplication may 
be faster than without transposing.
In the end this is just a test suite and there is already an even 
faster approach of a matrix multiplication which runs in O(n^2.8) 
instead of O(n^3) as my current simple solution. And with the 
power of OpenCL (or similar) one could even lift a matrix 
multiplication to the GPU and boom.^^ But the current task is to 
find general optimized code for the D language - this thread 
already helped me much!

I wasn't aware that D is actually capable of lazy evaluations 
other than the if-pseude-lazy-evaluation which is kind of cool. 
However, I still have to look that up in order to maybe find a 
use for it in these codes.

SIMD instructions also sound extremely cool but a little bit too 
complex regarding the fact that I am fairly new to D and still 
learning basics of this language.

In the end I wanted to state something which I found very 
iritating. Bearophile stated correctly that my pre-conditions for 
the index operators are all wrong and corrected my code with some 
smooth additions:

T opIndex(in size_t row, in size_t col) const nothrow
in {
     assert(row < nRows);
     assert(col < nCols);
} body {
     return data[dim.offset(row, col)];
}

The in and body statements are cool as far as I realize what they 
are for. However, in the benchmarks they had a clear and 
noticable negative impact on the matrix multiplication which 
raised to ~8 seconds from ~4 seconds with his code compared to 
mine. When leaving out the in and body statement block and using 
only one normal block as follows, it stayed at the 4 seconds 
duration for that task and so I think that the compiler won't 
optimize things in an 'in' code block - is that right?

T opIndex(in size_t row, in size_t col) const nothrow {
     assert(row < nRows);
     assert(col < nCols);
     return data[dim.offset(row, col)];
}

Thanks again for all your helpful comments and thanks in advance 
- I am eagerly looking forward for your future comments! =)

Robin

Feb 15 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Saturday, 15 February 2014 at 23:06:27 UTC, Robin wrote:

 Matrix is still a class but I changed it to a final class 
 preventing matrix methods to be virtual. Dimension is now a 
 final struct (don't know if 'final' is affecting structs in any 
 way tough ...).

It doesn't. It may even be disallowed someday, when it's fixed :)

 This mainly gave the multiplication a huge performance boost.

 When converting the Matrix to a struct from class the 
 multiplication even lowered from ~4.3 seconds to about 3.6 
 seconds. However, I am currently not sure if I want matrices to 
 be structs (value types).

Make them value types. If you're using dynamic arrays for 
storage, you're already using GC plenty, no need for additional 
allocations (and see below for copy and move semantics). Don't 
forget a postblit though.

 (In the C++ matrix codes this scenario would actually call move 
 assignment operator for matrix m3 which is much much faster 
 than copying.)

D performs return value optimization: it moves result whenever it 
can.

 But I haven't figured out yet how to use move semantics in D 
 with class objects. Or is that just possible with struct value 
 types?

Yup, it "just works". In most cases, anyway.
But move semantics in D differ from C++: D doesn't have rvalue 
references, and thus the compiler only performs a move when it 
can prove that value will no longer be used (there's a lengthy 
description in TDPL but I'm too lazy now to fetch it for exact 
citation :) ).
For explicit moves, there's std.algorithm.move(), though AFAIK 
it's underimplemented at the moment.

 I have also tried the LDC compiler. However, it gave me some 
 strange bugs. E.g. it didn't like the following line:

 ref Matrix transpose() const {
 	return new Matrix(this).transposeAssign();
 }

 And it forced me to change it to:

 ref Matrix transpose() const {
 	auto m = new Matrix(this);
 	m.transposeAssign();
 	return m;
 }

 Which is kind of ugly ...
 I hope that this is getting fixed soon, as I imply it as 
 correct D code because the DMD is capable to compile this 
 correctly.

Yes, this should be allowed. But you could also just write (new 
Matrix(this)).transposeAssign() :)
Also, there's no point in ref return when you're returning a 
reference to class instance.

 T opIndex(in size_t row, in size_t col) const nothrow
 in {
     assert(row < nRows);
     assert(col < nCols);
 } body {
     return data[dim.offset(row, col)];
 }

 The in and body statements are cool as far as I realize what 
 they are for. ...so I think that the compiler won't optimize 
 things in an 'in' code block - is that right?

in/out contracts are debug creatures anyway, they're not present 
in -release builds.

Feb 15 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Robin:

 Thanks again for all your helpful comments and thanks in 
 advance - I am eagerly looking forward for your future 
 comments! =)

Many of the things you say and write are still wrong or confused. 
Usually the hard optimization of code is one the last things you 
learn about a language, because to do it you need to have very 
clear what the language is doing in every spot (this is true for 
most languages, like C, C++, Haskell, Python, etc). And you are 
not yet at this point.

Tomorrow I will answer some things in your post, but in the 
meantime if you want you can show your whole D program, and 
tomorrow I'll try to make it faster for LDC2 and I'll explain it 
some.

Bye,
bearophile

Feb 15 2014

"bearophile" <bearophileHUGS lycos.com> writes:

In the meantime also take a look at the matrix mul code I've 
written here:

http://rosettacode.org/wiki/Matrix_multiplication#D

In the first entry you see no whole transposition up front.

Storing the whole matrix in a 1D array is probably more efficient.

Bye,
bearophile

Feb 15 2014

"francesco cattoglio" <francesco.cattoglio gmail.com> writes:

On Sunday, 16 February 2014 at 01:25:13 UTC, bearophile wrote:
 Many of the things you say and write are still wrong or 
 confused. Usually the hard optimization of code is one the last 
 things you learn about a language

Well, actually, sometimes squeezing as much performance as you 
can from a test case can be a way to find out if a given language 
checks all the boxes and can be used to solve your problems.
I must admit I'm shocked by the poor performance of D here. But I 
also know one HAS to try LDC or GDC, or those numbers are really 
meaningless.

Feb 16 2014

"bearophile" <bearophileHUGS lycos.com> writes:

francesco cattoglio:

 Well, actually, sometimes squeezing as much performance as you 
 can from a test case can be a way to find out if a given 
 language checks all the boxes and can be used to solve your 
 problems.

Unless you know a language well, you will fail squeezing that. 
This is true even in Python.


 I must admit I'm shocked by the poor performance of D here. But 
 I also know one HAS to try LDC or GDC, or those numbers are 
 really meaningless.

Be instead amazed of the sometimes near-C++ performance levels 
they have pushed Java to :-)
And I think D is doing well in a benchmark like this, I guess 
with ldc2 you can go about as fast as C++.

Bye,
bearophile

Feb 16 2014

"Francesco Cattoglio" <francesco.cattoglio gmail.com> writes:

On Sunday, 16 February 2014 at 09:53:08 UTC, bearophile wrote:
 Be instead amazed of the sometimes near-C++ performance levels 
 they have pushed Java to :-)

Sorry, but I fail to be amazed by what huge piles of money thrown 
to a project for 10+ years can achieve. Those kind of 
achievements are boring :P
 And I think D is doing well in a benchmark like this, I guess 
 with ldc2 you can go about as fast as C++.

I'm sure that other "small benchmarks" on computational-intensive 
code achieved C++ speed in the past, so if we can't get C++-like 
speed here, it's either a problem with the code or some serious 
performance regression.

Feb 16 2014

"Robin" <robbepop web.de> writes:

Hiho,

thanks again for your answers!

I don't know where to start ...

 Stanislav Blinov:
I am not currently aware of the "move semantics" in D or what it 
is called there. However, I am not quite sure if D does an 
equally good job as C++ in determining if a value can be moved or 
not. I have made some simple tests which showed me that the 
swap-assignment and the move-constructor are never called in D at 
code where C++ would certainly take them, instead D just copies 
values which were in fact rvalues ... (Maybe I just don't get the 
whole thing behind the scenes - and I am really hoping it.)

My statement that the "in" and "body" block of the assignment 
operator slows down the whole code isn't true anymore. Don't know 
what fixed it tough.

 bearophile:
Yeah, you are completely right - I am super confused at the 
moment. I always thought that D would be much easier to learn 
than C++ as all the people always say that C++ is the most 
complex language. After what I have learned so far D seems to be 
much more complex which isn't bad in general, but the learning 
curve doesn't feel so well atm as I am mainly confused by many 
things instead of getting what happens behind the scene. (e.g. 
move semantics, the whole army of modifiers, immutability which 
result in zero performance boost in non-paralell environments, 
the fact that classes are by-ref and structs are by-val is also 
confusing from time to time to be honest.)

When I started to write the C++ matrix library after I have 
written the Java matrix library to get to know what both 
languages can achive I have also had to start learning C++ as 
well. I bought a book and after 2-3 weeks I got the concepts and 
everything went as excepted and no things seemed to be hidden 
behind the scenes from me, the programmer which was in fact a 
nice experience when trying to optimize the code until it was 
equally fast as the java code or even faster. D seems to be more 
problematic when it comes to learning how to optimize the code.

To summarize, I am still a beginner and learn this language as it 
has got cool, new and advanced features, it has not so many 
legacy issues (like C++), it unites object orientated, functional 
as well as imperative and paralell programming paradigms (still 
amazed by that fact!), runs native and isn't bound to any 

why I am learning D! =)

I would love to see how much performance a D vetaran like you is 
able to grab out of my codes. Just tell me where I shall upload 
them and I will do so. Don't expect too much as I am still a 
beginner and as this project has just started. I didn't want to 
implement the whole functionality before I know how to do it with 
a good result. ;-)

The matrix multiplication you have posted has some cool features 
in it. I especially like the pure one with all that functional 
stuff. :D

 Nick Sabalausky:

Don't be shocked at the bad performance of a beginner D 
programmer. At least I am not shocked yet that my D 
implementation isn't performing as well as the highly optimized 
C++11 implementation or the JIT-optimized Java implementation. In 
general one can not compare the results of a native compilation 
optimization which shall run on different hardwares and a JIT 
optimization which is able to pull out every single optimization 
routine because it knows the system and all its components to 
compile at.

There is also the LDC2 and the GDC which should also improve the 
general performance quite a bit.

However, what is true is, that it is (at least for me) harder to 
learn how to write efficient code for D than it was to learn it 
for C++. And keep in mind that my Java implementation (which runs 
nearly as fast as the C++ implementation) is written just to fit 
for performance. I have had to do dirty things (against the 
general Java mentality) in order to achive this performance. 
Besides that I couldn't implement a generic solution in Java 
which wasn't five to ten times slower than without generics as 
generics in Java are a runtime feature. So all in all Java is 
only good in performance if you don't use it as Java (purely 
object oriented etc.) ... but it works! :D

 All of you:
What I have done since the last time:
I have changed my mind - matrix is now a value type. However, I 
wasn't sure about this step because I planned to implement dense 
and sparse matrices which could inherit from a basic matrix 
implementation in thus required to be classes.

I have also removed that nasty "final struct" statement.^^

However, benchmarks stayed the same since the last post.

Hopefully I haven't forgotten to say anything important ...

Robin

Feb 16 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Sunday, 16 February 2014 at 23:47:08 UTC, Robin wrote:

 I am not currently aware of the "move semantics" in D or what 
 it is called there. However, I am not quite sure if D does an 
 equally good job as C++ in determining if a value can be moved 
 or not. I have made some simple tests which showed me that the 
 swap-assignment and the move-constructor are never called in D 
 at code where C++ would certainly take them, instead D just 
 copies values which were in fact rvalues ... (Maybe I just 
 don't get the whole thing behind the scenes - and I am really 
 hoping it.)

What swap assignment, what move constructor? :)

Here are the rules for moving rvalues in D:

1) All anonymous rvalues are moved;
2) Named temporaries that are returned from functions are moved;
3) That's it :) No other guarantees are offered (i.e. compilers 
may or may not try to be smart in other scenarios).

Sometimes when you want an explicit move you can use 
std.algorithm.move function. But keep in mind that its current 
implementation is incomplete (i.e. it doesn't handle immutability 
properly).

This code illustrates it all:

http://dpaste.dzfl.pl/c050c9fccbb2

  All of you:
 What I have done since the last time:
 I have changed my mind - matrix is now a value type. However, I 
 wasn't sure about this step because I planned to implement 
 dense and sparse matrices which could inherit from a basic 
 matrix implementation in thus required to be classes.

Templates, perhaps?

Feb 16 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Robin:

 I always thought that D would be much easier to learn than C++ 
 as all the people always say that C++ is the most complex 
 language. After what I have learned so far D seems to be much 
 more complex which isn't bad in general, but the learning curve 
 doesn't feel so well atm as I am mainly confused by many things 
 instead of getting what happens behind the scene.

D is not a small language, it contains many features, but 
compared to C++ it has less corner cases, less traps, and on 
default it's often safer than C++. In C++ you need to take care 
of many details if you want to write correct code. Writing D code 
should be faster and safer than C++.


 immutability which result in zero performance boost in 
 non-paralell environments,

A good D compiler like ldc2 is sometimes able to optimize better 
the code that uses immutable/const. But immutability is also for 
the programmer: to make the code simpler to reason about, and 
safer.


 the fact that classes are by-ref and structs are by-val is also 
 confusing from time to time to be honest.)


 D seems to be more problematic when it comes to learning how to
 optimize the code.

Perhaps because D also focuses a little more on correctness of 
the code. You see this even more in Ada language.


 I would love to see how much performance a D vetaran like you 
 is able to grab out of my codes. Just tell me where I shall 
 upload them and I will do so.

I guess your code is composed by just one or two small D modules, 
so you can post it here:
http://dpaste.dzfl.pl/
You can also upload there the Java code, to help me compare and 
to know what you were trying to write :-)


 However, what is true is, that it is (at least for me) harder 
 to learn how to write efficient code for D than it was to learn 
 it for C++.

But I think on average it's a little harder to write correct code 
in C++ compared to D.

Bye,
bearophile

Feb 16 2014

"Robin" <robbepop web.de> writes:

Hiho,

I currently haven't got enough time to respond to all what have 
been said since my last post.
However, here is the simple code:
http://dpaste.dzfl.pl/3df8694eb47d

Thanks in advance!
I am really looking forward to your editing! =)

Robin

Feb 17 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Monday, 17 February 2014 at 09:48:49 UTC, Robin wrote:
 Hiho,

 I currently haven't got enough time to respond to all what have 
 been said since my last post.
 However, here is the simple code:
 http://dpaste.dzfl.pl/3df8694eb47d

 Thanks in advance!
 I am really looking forward to your editing! =)

 Robin

A few quick points:

1) foreach loops are nice, use them

2) D has neat array-ops that can simplify your code for == and -= 
etc...
e.g.

	/**
	 * Subtracts all entries of this matrix by the given other 
matrix.
	 */
	ref Matrix opSubAssign(ref const(Matrix) other) nothrow
	in
	{
		assert(this.dim == other.dim);
	}
	body
	{
		this.data[] -= other.data[];
		return this;
	}


3) Although your generic indexing is nice, a lot of operations 
that you have implemented have access patterns that allow faster 
indexing without a multiplication, e.g.

	/**
	 * Creates a new identity matrix with the given size specified 
by the given rows and
	 * columns indices.
	 */
	static Matrix identity(size_t rows, size_t cols) nothrow {
		auto m = Matrix(rows, cols);
		size_t index = 0;
		foreach (i; 0 .. m.dim.min) {
			m.data[index] = 1;
			index += cols + 1;
		}
		return m;
	}

Feb 17 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Robin:

 http://dpaste.dzfl.pl/3df8694eb47d

I think it's better to improve your code iteratively. First 
starting with small simple things:


    this(size_t rows, size_t cols) nothrow pure {

=>

     this(in size_t rows, in size_t cols) nothrow pure {


    this(ref const(Dimension) that) nothrow pure {

=>

     this(const ref Dimension that) nothrow pure {


    ref Dimension opAssign(ref Dimension rhs) nothrow pure {

Is this method necessary?


    bool opEquals(ref const(Dimension) other) nothrow const {

Not necessary.


    size_t min()  property nothrow const {
        if (this.rows <= this.cols) return this.rows;
        else                        return this.cols;
    }

=> (I have changed its name to reduce my confusion 
std.algorithm.min):

      property size_t minSide() const pure nothrow {
         return (rows <= cols) ? rows : cols;
     }


    size_t size()  property nothrow const {

=>

      property size_t size()  property pure nothrow {


    size_t offset(size_t row, size_t col) nothrow const {

=>

size_t offset(in size_t row, in size_t col) const pure nothrow {



    this(in size_t rows, in size_t cols, bool initialize = true) 
 nothrow pure {
        this.dim = Dimension(rows, cols);
        this.data = 
 minimallyInitializedArray!(typeof(this.data))(rows * cols);
        if (initialize) this.data[] = to!T(0);
    }

=>

     this(in size_t rows, in size_t cols, in bool initialize = 
true) nothrow pure {
     ...
     if (initialize)
         this.data[] = 0;


    /**
     *
     */

Sometimes I prefer to use just one line of code to reduce 
vertical wasted space:

/// Move constructor. From rvalue.


    this(ref const(Matrix) that) {

=>

     this(const ref Matrix that) pure {


        this.data = that.data.dup;

Currently array.dup is not nothrow. This will eventually be fixed.


    ref Matrix transpose() const {
        return Matrix(this).transposeAssign();

=>

     ref Matrix transpose() const pure nothrow {
         return Matrix(this).transposeAssign;


    ref Matrix transposeAssign() nothrow {
        size_t r, c;
        for (r = 0; r < this.dim.rows; ++r) {
            for (c = 0; c < r; ++c) {
                std.algorithm.swap(
                    this.data[this.dim.offset(r, c)],
                    this.data[this.dim.offset(c, r)]);
            }
        }
        return this;
    }

Use immutable foreach loops.
And even when you use for loops always define the loop variable 
inside the for: "for (size_t r = 0; ...".
Also in general I suggest you to write a little less code.

A first try (not tested), but probably this can be done with much 
less multiplications:

     ref Matrix transposeAssign() pure nothrow {
         foreach (immutable r; 0 .. dim.rows) {
             foreach (immutable c; 0 .. r) {
                 swap(data[dim.offset(r, c)], data[dim.offset(c, 
r)]);
             }
         }
         return this;
     }


    Matrix opMul(ref const(Matrix) other)

Never use old-style operator overloading in new D code. Use only 
the new operator overloading style.


        auto s = Matrix(other).transposeAssign();

=>

        auto s = Matrix(other).transposeAssign;


        size_t r1, r2, c;

Ditto.


        foreach (ref T element; this.data) {
            element *= scalar;
        }

Try to use:

         data[] *= scalar;


        for (auto i = 0; i < this.dim.size; ++i) {
            this.data[i] += other.data[i];
        }

Try to use:

         data[] += other.data[];


        for (auto i = 0; i < this.dim.size; ++i) {
            this.data[i] -= other.data[i];
        }

Try to use:

         data[] -= other.data[];


    bool opEquals(ref const(Matrix) other) nothrow const {
        if (this.dim != other.dim) {
            return false;
        }
        for (auto i = 0; i < this.dim.size; ++i) {
            if (this.data[i] != other.data[i]) return false;
        }
        return true;
    }

Try (in general try to write less code):

     return dim == other.dim && data == other.data;



        auto stringBuilder = std.array.appender!string;

=>

         Appender!string result;


        stringBuilder.put("Matrix: "
            ~ to!string(this.dim.rows) ~ " x " ~ 
 to!string(this.dim.cols)

=>

         result ~= text("Matrix: ", dim.rows, " x ", dim.cols


        auto m = Matrix(rows, cols);
        for (auto i = 0; i < m.dim.min; ++i) {
            m[i, i] = 1;
        }

See the comment by John Colvin.


        auto generator = Random(unpredictableSeed);

What if I want to fill the matrix deterministically, with a given 
seed? In general it's better to not add a random matrix method to 
your matrix struct. Better to add an external free function that 
uses UFCS and is more flexible and simpler to reason about.



    writeln("\tTime required: " ~ to!string(secs) ~ " secs, " ~ 
 to!string(msecs) ~ " msecs");

Use .text to strinify, but here you don't need to stringify at 
all:

=>

     writeln("\tTime required: ", secs, " secs, ", msecs, " 
msecs");


    sw.reset();

=>
     sw.reset;


 void multiplicationTest() {

In D there is also unittest {}. A benchmark is not exactly an 
unit test, so do as you prefer.

Bye,
bearophile

Feb 17 2014

"Robin" <robbepop web.de> writes:

Hiho,

thank you for your code improvements and suggestions.

I really like the foreach loop in D as well as the slight (but 
existing) performance boost over conventional for loops. =)

Another success of the changes I have made is that I have 
achieved to further improve the matrix multiplication performance 
from 3.6 seconds for two 1000x1000 matrices to 1.9 seconds which 
is already very close to java and c++ with about 1.3 - 1.5 
seconds.

The key to victory was pointer arithmetics as I notices that I 
have used them in the C++ implementation, too. xD

The toString implementation has improved its performance slightly 
due to the changes you have mentioned above: 1.37 secs -> 1.29 
secs

I have also adjusted all operator overloadings to the "new style" 
- I just haven't known about that "new style" until then - thanks!

I will just post the whole code again so that you can see what I 
have changed.

Keep in mind that I am still using DMD as compiler and thus 
performance may still raise once I use another compiler!

All in all I am very happy with the code analysis and its 
improvements!
However, there were some strange things of which I am very 
confused ...

void allocationTest() {
	writeln("allocationTest ...");
	sw.start();
	auto m1 = Matrix!double(10000, 10000);
	{ auto m2 = Matrix!double(10000, 10000); }
	{ auto m2 = Matrix!double(10000, 10000); }
	{ auto m2 = Matrix!double(10000, 10000); }
	//{ auto m2 = Matrix!double(10000, 10000); }
	sw.stop();
	printBenchmarks();
}

This is the most confusing code snippet. I have just changed the 
whole allocation for all m1 and m2 from new Matrix!double (on 
heap) to Matrix!double (on stack) and the performance dropped 
significantly - the benchmarked timed raised from 2,3 seconds to 
over 25 seconds!! Now look at the code above. When I leave it as 
it is now, the code requires about 2,9 seconds runtime, however, 
when enabeling the currently out-commented line the code takes 14 
to 25 seconds longer! mind blown ... 0.o This is extremely 
confusion as I allocate these matrices on the stack and since I 
have allocated them within their own scoped-block they should 
instantly release their memory again so that no memory 
consumption takes place for more than 2 matrices at the same 
time. This just wasn't the fact as far as I have tested it.

Another strange things was that the new opEquals implementation:

	bool opEquals(const ref Matrix other) const pure nothrow {
		if (this.dim != other.dim) {
			return false;
		}
		foreach (immutable i; 0 .. this.dim.size) {
			if (this.data[i] != other.data[i]) return false;
		}
		return true;
	}

is actually about 20% faster than the one you have suggested. 
With the single line of "return (this.dim == other.dim && 
this.data[] == other.data[]).

The last thing I haven't quite understood is that I tried to 
replace

auto t = Matrix(other).transposeAssign();

in the matrix multiplication algorithm with its shorter and 
clearer form

auto t = other.transpose(); // sorry for the nasty '()', but I 
like them! :/

This however gave me wonderful segmentation faults on runtime 
while using the matrix multiplication ...

And here is the complete and improved code:
http://dpaste.dzfl.pl/7f8610efa82b

Thanks in advance for helping me! =)

Robin

Feb 17 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Robin:

 The key to victory was pointer arithmetics as I notices that I 
 have used them in the C++ implementation, too. xD

Perhaps with LDC2 it's not necessary.


 I will just post the whole code again so that you can see what 
 I have changed.

The code looks better.

There's no need to put Dimension in another module. In D modules
contain related stuff, unlike Java.

Also feel free to use some free-standing functions. With UFCS
they get used in the same way, and they help making
classes/structs simpler.

Some of your imports could be moved to more local scopes, instead
of being all at module-level.


                result ~= to!string(this[r, c]);

=>

                  result ~= this[r, c].text;


 writeln("\tTime required: " ~ to!string(secs) ~ " secs, " ~ 
 to!string(msecs) ~ " msecs");

=>

writeln("\tTime required: ", secs, " secs, ", msecs, " msecs");


    ref Matrix opOpAssign(string s)(in T scalar) pure nothrow if 
 (s == "*") {

=>

      ref Matrix opOpAssign(string op)(in T scalar) pure nothrow if
(op == "*") {


Also you have two functions with code like:

this.data[] op= scalar;

You can define a single template (untested):


      /**
       * Adds or subtracts all entries of this matrix by the given
other matrix.
       */
      ref Matrix opOpAssign(string op)(const ref Matrix other) pure
nothrow
      if (op == "+" || op == "-")
      in
      {
          assert(this.dim == other.dim);
      }
      body
      {
          mixin("this.data[] " ~ op ~ "= other.data[];");
          return this;
      }



    Matrix opBinary(string s)(const ref Matrix other) const pure 
 if (s == "*")

Given that on a 32 bit system a Matrix is just 16 bytes, it could
be better to not accept the argument by ref, and avoid one more
level of indirection:

      Matrix opBinary(string s)(in Matrix other) const pure if (s
== "*")


 However, there were some strange things of which I am very 
 confused ...

 void allocationTest() {
 	writeln("allocationTest ...");
 	sw.start();
 	auto m1 = Matrix!double(10000, 10000);
 	{ auto m2 = Matrix!double(10000, 10000); }
 	{ auto m2 = Matrix!double(10000, 10000); }
 	{ auto m2 = Matrix!double(10000, 10000); }
 	//{ auto m2 = Matrix!double(10000, 10000); }
 	sw.stop();
 	printBenchmarks();
 }

 This is the most confusing code snippet. I have just changed 
 the whole allocation for all m1 and m2 from new Matrix!double 
 (on heap) to Matrix!double (on stack)

The matrix data is always on the heap. What ends on the stack is
a very limited amount of stuff.


 This is extremely confusion as I allocate these matrices on the 
 stack and since I have allocated them within their own 
 scoped-block they should instantly release their memory

You are mistaken. minimallyInitializedArray allocates memory on
the GC heap (and there's not enough space on the stack to
allocate 10000^2 doubles). In both D and Java the deallocation of
memory managed by the GC is not deterministic, so it's not
immediately released at scope exit. Also unlike the Java GC, the
D GC is less refined, and by design it is currently not precise.
With such large arrays often there are _inbound_ pointers that
keep the memory alive, especially on 32 bit systems. So perhaps
your problems are caused by the GC.

You can have deterministic memory management in your struct-based
matrices, but you have to allocate the memory manually (from the
GC or probably better from the C heap) and free it in the struct
destructor usign RAII.


 Another strange things was that the new opEquals implementation:

 	bool opEquals(const ref Matrix other) const pure nothrow {
 		if (this.dim != other.dim) {
 			return false;
 		}
 		foreach (immutable i; 0 .. this.dim.size) {
 			if (this.data[i] != other.data[i]) return false;
 		}
 		return true;
 	}

 is actually about 20% faster than the one you have suggested. 
 With the single line of "return (this.dim == other.dim && 
 this.data[] == other.data[]).

I think this small performance bug is fixed in dmd 2.065 that is
currently in beta3.


 The last thing I haven't quite understood is that I tried to 
 replace

 auto t = Matrix(other).transposeAssign();

 in the matrix multiplication algorithm with its shorter and 
 clearer form

 auto t = other.transpose(); // sorry for the nasty '()', but I 
 like them! :/

 This however gave me wonderful segmentation faults on runtime 
 while using the matrix multiplication ...

This looks like a null-related bug.

I'll benchmark your code a little, but I think I don't have as
much RAM as you.

Bye,
bearophile

Feb 17 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 2/17/2014 4:56 PM, Robin wrote:
 And here is the complete and improved code:
 http://dpaste.dzfl.pl/7f8610efa82b

You should get rid of the .dup's in the constructors and postblits. You 
don't want big arrays ever getting accidentally allocated and copied 
behind your back when you're only trying to pass things around. Better 
to provide an explicit .dup function in your Matrix class (just like how 
D's dynamic arrays work) for when you *intentionally* want a full duplicate.

Also, unlike classes, when you're copying a struct there's no need to 
copy each field individually. Just copy the struct as a whole. In fact, 
copy constructors and overloading assignments from the same type are 
generally not needed at all: They aren't going to be any faster than 
just using D's default behavior of memory-blitting the struct to the new 
location *if* a copy is actually even needed at all. By providing 
explicit copy constructors and overloading assignments from the same 
type, you're forcing the compiler to use a potentially-slower method 
every time.

Finally, and I could be wrong on this part, but I doubt those move 
constructors are really needed in D. See the top answer here: 
http://stackoverflow.com/questions/4200190/does-d-have-something-akin-to-c0xs-move-semantics

Feb 17 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

Okay, here we go...

1) Don't use upper case letters in module names 
(http://dlang.org/module.html#ModuleDeclaration)
2) As has already been suggested, if you're targeting raw 
performance, don't use GC. You can always malloc and free your 
storage. Using C heap has certain implications, but they're not 
important right now.
3) Ditch extra constructors, they're completely unnecessary. For 
matrix you only need your non-trivial ctor, postblit and lvalue 
assignment operator.
4) This:

     ref Matrix transpose() const {
         return Matrix(this).transposeAssign();
     }

just doesn't make any sense. You're returning a reference to a 
temporary.
5) Use scoped imports, they're good :)
6) Use writefln for formatted output, it's good :)

With the above suggestions your code transfrorms into this:

http://dpaste.dzfl.pl/9d7feeab59f6

And here are the timings on my machine:

$ rdmd -O -release -inline -noboundscheck main.d
allocationTest ...
         Time required: 1 sec, 112 ms, 827 μs, and 3 hnsecs
multiplicationTest ...
         Time required: 1 sec, 234 ms, 417 μs, and 8 hnsecs
toStringTest ...
         Time required: 998 ms, 16 μs, and 2 hnsecs
transposeTest ...
         Time required: 813 ms, 947 μs, and 3 hnsecs
scalarMultiplicationTest ...
         Time required: 105 ms, 828 μs, and 5 hnsecs
matrixAddSubTest ...
         Time required: 240 ms and 384 μs
matrixEqualsTest ...
         Time required: 244 ms, 249 μs, and 8 hnsecs
identityMatrixTest ...
         Time required: 249 ms, 897 μs, and 4 hnsecs

LDC yields roughly the same times.

Feb 17 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Tuesday, 18 February 2014 at 03:32:48 UTC, Stanislav Blinov 
wrote:

 3) Ditch extra constructors, they're completely unnecessary. 
 For matrix you only need your non-trivial ctor, postblit and 
 lvalue assignment operator.

Actually rvalue opAssign would be even better :)

Also, I forgot to remove it in the code, but that explicit ctor 
for Dimension is completely unnecessary too.

Feb 17 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Stanislav Blinov:

 that explicit ctor
 for Dimension is completely unnecessary too.

I like a constructor(s) like that because it catches bugs like:

auto d = Dimension(5);

Bye,
bearophile

Feb 17 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Tuesday, 18 February 2014 at 07:45:18 UTC, bearophile wrote:
 Stanislav Blinov:

 that explicit ctor
 for Dimension is completely unnecessary too.

 I like a constructor(s) like that because it catches bugs like:

 auto d = Dimension(5);

Hmmm... yeah, ok, not completely unnecessary :)

Feb 18 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Stanislav Blinov:

 http://dpaste.dzfl.pl/9d7feeab59f6

Few small things should still be improved, but it's an 
improvement. Perhaps it needs to use a reference counting from 
Phobos.


 LDC yields roughly the same times.

This is surprising.

Bye,
bearophile

Feb 17 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Tuesday, 18 February 2014 at 07:50:23 UTC, bearophile wrote:
 Stanislav Blinov:

 http://dpaste.dzfl.pl/9d7feeab59f6

 Few small things should still be improved, but it's an 
 improvement. Perhaps it needs to use a reference counting from 
 Phobos.

COW for matrices? Aw, come on... :)

 LDC yields roughly the same times.

 This is surprising.

To me as well. I haven't yet tried to dig deep though.

Feb 17 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Stanislav Blinov:

 LDC yields roughly the same times.

 This is surprising.

 To me as well. I haven't yet tried to dig deep though.

I have compiled your code with (a single module, 32 bit Windows):

dmd -wi -vcolumns -O -release -inline -noboundscheck matrix3.d
ldmd2 -wi -O -release -inline -noboundscheck matrix3.d

DMD:

multiplicationTest ...
         Time required: 4 secs, 665 ms, 452 ╬╝s, and 2 hnsecs

LDC2:

multiplicationTest ...
         Time required: 2 secs, 522 ms, and 747 ╬╝s

Bye,
bearophile

Feb 18 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Tuesday, 18 February 2014 at 08:50:23 UTC, bearophile wrote:
 Stanislav Blinov:

 LDC yields roughly the same times.

 This is surprising.

 To me as well. I haven't yet tried to dig deep though.

 I have compiled your code with (a single module, 32 bit 
 Windows):

 dmd -wi -vcolumns -O -release -inline -noboundscheck matrix3.d
 ldmd2 -wi -O -release -inline -noboundscheck matrix3.d

 DMD:

 multiplicationTest ...
         Time required: 4 secs, 665 ms, 452 ╬╝s, and 2 hnsecs

 LDC2:

 multiplicationTest ...
         Time required: 2 secs, 522 ms, and 747 ╬╝s

Interesting. For me (on 64 bit) multiplicationTest is in the same 
league for DMD and LDC. (I'm still using dmd 2.064.2 as my main 
dmd install btw).

But what is up with those first and last tests in LDC?..

$ rdmd -wi -O -release -inline -noboundscheck main.d
allocationTest ...
         Time required: 1 sec, 165 ms, 766 μs, and 1 hnsec
multiplicationTest ...
         Time required: 1 sec, 232 ms, 287 μs, and 4 hnsecs
toStringTest ...
         Time required: 997 ms, 217 μs, and 1 hnsec
transposeTest ...
         Time required: 859 ms, 152 μs, and 5 hnsecs
scalarMultiplicationTest ...
         Time required: 105 ms, 366 μs, and 5 hnsecs
matrixAddSubTest ...
         Time required: 241 ms, 705 μs, and 8 hnsecs
matrixEqualsTest ...
         Time required: 243 ms, 534 μs, and 8 hnsecs
identityMatrixTest ...
         Time required: 260 ms, 411 μs, and 4 hnsec

$ ldmd2 -wi -O -release -inline -noboundscheck main.d 
neolab/core/dimension.d neolab/core/matrix.d -ofmain-ldc && 
main-ldc

allocationTest ...
         Time required: 2 hnsecs (o_O)
multiplicationTest ...
         Time required: 1 sec, 138 ms, 628 μs, and 2 hnsecs
toStringTest ...
         Time required: 704 ms, 937 μs, and 5 hnsecs
transposeTest ...
         Time required: 859 ms, 413 μs, and 5 hnsecs
scalarMultiplicationTest ...
         Time required: 103 ms, 250 μs, and 2 hnsecs
matrixAddSubTest ...
         Time required: 226 ms, 955 μs, and 3 hnsecs
matrixEqualsTest ...
         Time required: 470 ms and 186 μs
identityMatrixTest ...
         Time required: 4 hnsecs (o_O)

Feb 18 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

...And if I define opEquals as it was made by Robin, i.e. like 
this:


bool opEquals(ref const Matrix other) const pure nothrow {
     version (all) {
         if (dim != other.dim) return false;
         foreach(immutable i, const ref e; data)
             if (e != other.data[i])
                 return false;
         return true;
     } else {
         return (dim == other.dim) && (data[] == other.data[]);
     }
}


then the relevant benchmark on LDC flattens near zero time too x.x

Feb 18 2014

"Kapps" <opantm2+spam gmail.com> writes:

On Tuesday, 18 February 2014 at 09:05:33 UTC, Stanislav Blinov
wrote:
 allocationTest ...
         Time required: 2 hnsecs (o_O)
 identityMatrixTest ...
         Time required: 4 hnsecs (o_O)

LDC is probably detecting that you're never actually using the
results of the operation and that none of the steps have
side-effects, so it's optimizing the call out. One way of
avoiding this is to do something like writeln the result of an
operation.

Feb 18 2014

"Robin" <robbepop web.de> writes:

Hiho,

I am happy to see that I could encourage so many people to 
discuss about this topic to not only give me interesting answer 
to my questions but also to analyse and evaluate features and 
performance of D.

I have fixed the allocation performance problem via a custom 
destructor method which manually notifies the GC that its data is 
garbage so that the GC can free it in the next cycle and no 
longer have to determine if it is no longer used. This extremely 
speeded up the allocation tests but increased overall runtime 
performance by a very slight amount of time because memory is now 
freed contigeously.

 Nick Sabalausky:
Why should I remove .dub from the copy constructor? In my opinion 
this is important to keep both matrices (source and copy) 
independent from each other. The suggested COW feature for 
matrices sound interesting but also weird. I have to read more 
about that.
Move semantics in D a needed and existing, however, I think that 
the D compiler isn't as good as the C++ compiler in determining 
what is moveable and what not.

Another thing which is hopefully a bug in the current language 
implementation of D is the strange behaviour of the transpose 
method with the only line of code:

return Matrix(this).transposeAssign();

In C++ for example this compiles without any problems and results 
in correct transposed matrix copies - this works due to the 
existance of move semantics in C++ and is one of the coolest 
features since C++11 which increased and simplified codes in many 
cases enormously for value types just as structs in D.

I have ran several tests in order to test when the move assign or 
the move constructors are called in D and whenever I expected 
them to be called the copy-constructor or the postblit was called 
instead which was frustating imo.
Perhaps I still haven't quite understood the things behind the 
scenes in D but it can't be the solution to always copy the whole 
data whenever I could instead have a move of the whole data on 
the heap.

Besides that on suggestion which came up was that I could insert 
the Dimension module into the Matrix module as their are 
semantically working together. However, I find it better to have 
many smaller code snippets instead of fewer bigger ones and 
that's why I keep them both separated.

I also gave scoped imports a try and hoped that they were able to 
reduce my executable file and perhaps increase the performance of 
my program, none of which was true -> confused. Instead I now 
have more lines of code and do not see instantly what 
dependencies the module as itself has. So what is the point in 
scoped imports?

The mixin keyword is also nice to have but it feels similar to a 
dirty C-macro to me where text replacement with lower abstraction 
(unlike templates) takes place. Of course I am wrong and you will 
teach me why but atm I have strange feelings about implementing 
codes with mixins. In this special case: perhaps it isn't a wise 
decision to merge addition with subtraction and perhaps I can 
find faster ways to do that which invole more differences in both 
actions which requires to split both methods up again. 
(theoretical, but it could be)

Another weird thing is that the result ~= text(tabStr, this[r, 
c]) in the toString method is much slower than the two following 
lines of code:

result ~= tabStr;
result ~= to!string(this[r, c]);

Does anybody have an answer to this?

I am now thinking that I know the most important things about how 
to write ordinary (not super) efficient D code - laugh at me if 
you want :D - and will continue extending this benchmarking 
library and whenever I feel bad about a certain algorithm's 
performance I will knock here in this thread again, you know. :P

In the end of my post I just want to summarize the benchmark 
history for the matrix multiplication as I think that it is 
funny: (All tests for two 1000x1000 matrices!)

- The bloody first implementation of the matrix implementation 
(which worked) required about 39 seconds to finish.
- Then I have finally found out the optimizing commands for the 
DMD and the multiplication performance double roughly to about 14 
seconds.
- I created an account here and due to your massive help the 
matrix multiplication required only about 5 seconds shortly after 
due to better const, pure and nothrow usage.
- Through the shift from class to struct and some optimizations 
in memory layout and further usage improvements of const, pure 
and nothrow as well as several array feature usages and foreach 
loop the algorithm performance raised once again and required 
about 3,7 seconds.
- The last notifiable optimization was the implemenatation based 
on pointer arithmentics which again improved the performance from 
3,7 seconds to roughly about 2 seconds.

Due to this development I see that there is still a possibility 
that we could beat the 1,5 seconds from Java or even the 1,3 
seconds from C++! (based on my machine: 64bit-archlinux, dualcore 
2,2ghz, 4gb ram)

There are still many ways to further improve the performance. For 
examply by using LDC on certain hardwares, paralellism and 
perhaps by implementing COW with no GC dependencies. And of 
course I may miss many other possible optimization features of D.

I by myself can say that I have learn a lot and that's the most 
important thing above everything else for me here.

Thank you all for this very interesting conversation! You - the 
community - are a part what makes D a great language. =)

Robin

Feb 18 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Robin:

 the existance of move semantics in C++ and is one of the 
 coolest features since C++11 which increased and simplified 
 codes in many cases enormously for value types just as structs 
 in D.

I guess Andrei doesn't agree with you (and move semantics in 
C++11 is quite hard to understand).


 I also gave scoped imports a try and hoped that they were able 
 to reduce my executable file and perhaps increase the 
 performance of my program, none of which was true -> confused. 
 Instead I now have more lines of code and do not see instantly 
 what dependencies the module as itself has. So what is the 
 point in scoped imports?

Scoped imports in general can't increase performance. Their main 
point is to avoid importing modules that are needed only by 
templated code. So if you don't instantiate the template, the 
liker works less and the binary is usually smaller (no 
moduleinfo, etc).


 Another weird thing is that the result ~= text(tabStr, this[r, 
 c]) in the toString method is much slower than the two 
 following lines of code:

 result ~= tabStr;
 result ~= to!string(this[r, c]);

 Does anybody have an answer to this?

It doesn't look too much weird. In the first case you are 
allocating and creating larger strings. But I don't think matrix 
printing is a bottleneck in a program.


 - Then I have finally found out the optimizing commands for the 
 DMD

This is a small but common problem. Perhaps worth fixing.


 There are still many ways to further improve the performance. 
 For examply by using LDC

Latest stable and unstable versions of LDC2, try it:
https://github.com/ldc-developers/ldc/releases/tag/v0.12.1
https://github.com/ldc-developers/ldc/releases/tag/v0.13.0-alpha1


 on certain hardwares, paralellism and perhaps by implementing 
 COW with no GC dependencies. And of course I may miss many 
 other possible optimization features of D.

Matrix multiplication can be improved a lot tiling the matrix (or 
better using a cache oblivious algorithm), using SSE/AVX2, using 
multiple cores, etc. As starting point you can try to use 
std.parallelism. It could speed up your code on 4 cores with a 
very limited amount of added code.

Bye,
bearophile

Feb 18 2014

"Stanislav Blinov" <stanislav.blinov gmail.com> writes:

On Tuesday, 18 February 2014 at 23:36:12 UTC, Robin wrote:

 Another thing which is hopefully a bug in the current language 
 implementation of D is the strange behaviour of the transpose 
 method with the only line of code:

 return Matrix(this).transposeAssign();

Matrix(this) not compiling when 'this' is const is a bug. That's 
why I had to replace it with assignment. Postblit still has some 
unresolved problems.

However, you should note that you have a bigger bug in that 
transposeAssign() returns a reference :) So unless transpose() 
returns Matrix (instead of ref Matrix), that's a problem.

 In C++ for example this compiles without any problems and 
 results in correct transposed matrix copies - this works due to 
 the existance of move semantics in C++ and is one of the 
 coolest features since C++11 which increased and simplified 
 codes in many cases enormously for value types just as structs 
 in D.

The fact that it compiles in C++ is a problem (this illustrates 
your initial implementation of transpose() translated into C++):

struct Matrix
{
     // ref Matrix transpose() const;
     S& transpose() const { return 
Matrix(*this).transposeAssign(); }

     // ref Matrix transposeAssign();
     S& transposeAssign() { return *this; }
};

int main(int argc, char** argv)
{
     Matrix m1;
     Matrix& m2 = m1.transpose(); // ooops!
}

 I have ran several tests in order to test when the move assign 
 or the move constructors are called in D and whenever I 
 expected them to be called the copy-constructor or the postblit 
 was called instead which was frustating imo.

I've posted a complete illustration on how D moves rvalues, 
browse this thread back a bit.

 I also gave scoped imports a try and hoped that they were able 
 to reduce my executable file and perhaps increase the 
 performance of my program, none of which was true -> confused. 
 Instead I now have more lines of code and do not see instantly 
 what dependencies the module as itself has. So what is the 
 point in scoped imports?

They don't pollute outer scope with symbols. If you just need to 
call a couple of functions from std.random inside *one* function, 
there's no need to pull names from std.random into the whole 
module.

 The mixin keyword is also nice to have but it feels similar to 
 a dirty C-macro to me where text replacement with lower 
 abstraction (unlike templates) takes place. Of course I am wrong

Yes you are :) It's not dirty and it's certainly not a macro. 
It's a great feature for generating code at compile time. For 
examples just browse through Phobos (e.g. std.functional), or 
take a look at the code in this thread:

http://forum.dlang.org/thread/mailman.158.1391156715.13884.digitalmars-d puremagic.com

Maybe it's your syntax highlighting that throws you off? Then use 
q{} insead of "" :)

 Another weird thing is that the result ~= text(tabStr, this[r, 
 c]) in the toString method is much slower than the two 
 following lines of code:

 result ~= tabStr;
 result ~= to!string(this[r, c]);

 Does anybody have an answer to this?

Probably due to extra allocation in text(tabStr, this[r,c]) since 
it will perform ~ under the hood, while appender already has 
storage.

Feb 18 2014

"Robin" <robbepop web.de> writes:

Hiho,

here are the results of both compilers (DMD and LDC2) on my 
system:

LDC:
=========================================
allocationTest ...
	Time required: 5 secs, 424 msecs
multiplicationTest ...
	Time required: 1 secs, 744 msecs
toStringTest ...
	Time required: 0 secs, 974 msecs
transposeTest ...
	Time required: 0 secs, 998 msecs
scalarMultiplicationTest ...
	Time required: 0 secs, 395 msecs
matrixAddSubTest ...
	Time required: 0 secs, 794 msecs
matrixEqualsTest ...
	Time required: 0 secs, 0 msecs
identityMatrixTest ...
	Time required: 0 secs, 393 msecs
=========================================

DMD:
=========================================
allocationTest ...
	Time required: 3 secs, 161 msecs
multiplicationTest ...
	Time required: 2 secs, 6 msecs
toStringTest ...
	Time required: 1 secs, 365 msecs
transposeTest ...
	Time required: 1 secs, 45 msecs
scalarMultiplicationTest ...
	Time required: 0 secs, 405 msecs
matrixAddSubTest ...
	Time required: 0 secs, 809 msecs
matrixEqualsTest ...
	Time required: 0 secs, 430 msecs
identityMatrixTest ...
	Time required: 0 secs, 350 msecs
=========================================

All in all I must say that I am more pleased with the DMD results 
as I am kind of worried about the LDC allocation test performance 
...

I also had to rewrite parts of the codes as some functions just 
weren't "pure" or "nothrow" such as the whole things around 
this.data[] += other.data[].

Robin

Feb 20 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Robin:

 All in all I must say that I am more pleased with the DMD 
 results as I am kind of worried about the LDC allocation test 
 performance ...

I am not sure of the causes of this, but the D GG was improved a 
little, in the last two compiler versions.


 I also had to rewrite parts of the codes as some functions just 
 weren't "pure" or "nothrow" such as the whole things around 
 this.data[] += other.data[].

This because LDC2 is one version of the language behind. It's a 
temporary problem.

Your multiplicationTest timings with LDC2 seem to have not yet 
reached the performance of your C++ code. You can take a look at 
the asm output of both. Perhaps in the D code you have some 
allocation you are not doing in the C++ code.

Bye,
bearophile

Feb 20 2014

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Thursday, 20 February 2014 at 21:32:10 UTC, Robin wrote:
 Hiho,

 here are the results of both compilers (DMD and LDC2) on my 
 system:

 LDC:
 =========================================
 allocationTest ...
 	Time required: 5 secs, 424 msecs
 multiplicationTest ...
 	Time required: 1 secs, 744 msecs
 toStringTest ...
 	Time required: 0 secs, 974 msecs
 transposeTest ...
 	Time required: 0 secs, 998 msecs
 scalarMultiplicationTest ...
 	Time required: 0 secs, 395 msecs
 matrixAddSubTest ...
 	Time required: 0 secs, 794 msecs
 matrixEqualsTest ...
 	Time required: 0 secs, 0 msecs
 identityMatrixTest ...
 	Time required: 0 secs, 393 msecs
 =========================================

 DMD:
 =========================================
 allocationTest ...
 	Time required: 3 secs, 161 msecs
 multiplicationTest ...
 	Time required: 2 secs, 6 msecs
 toStringTest ...
 	Time required: 1 secs, 365 msecs
 transposeTest ...
 	Time required: 1 secs, 45 msecs
 scalarMultiplicationTest ...
 	Time required: 0 secs, 405 msecs
 matrixAddSubTest ...
 	Time required: 0 secs, 809 msecs
 matrixEqualsTest ...
 	Time required: 0 secs, 430 msecs
 identityMatrixTest ...
 	Time required: 0 secs, 350 msecs
 =========================================

 All in all I must say that I am more pleased with the DMD 
 results as I am kind of worried about the LDC allocation test 
 performance ...

 I also had to rewrite parts of the codes as some functions just 
 weren't "pure" or "nothrow" such as the whole things around 
 this.data[] += other.data[].

 Robin

what flags did you use?

Feb 21 2014

"Robin" <robbepop web.de> writes:

Hiho,

as I used the ldmd wrapper for ldc2 I was able to use the same
arguments given via the cmdfile. These were: -release
-noboundscheck -O and -inline.

Robin

Feb 22 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Stanislav Blinov:

 allocationTest ...
         Time required: 1 sec, 112 ms, 827 μs, and 3 hnsecs
 multiplicationTest ...
         Time required: 1 sec, 234 ms, 417 μs, and 8 hnsecs

Physics teaches us that those experimental measures are expressed 
with a excessive precision. For such benchmarks it's more wise to 
write "1.11" seconds.

Bye,
bearophile

Feb 18 2014

"Kapps" <opantm2+spam gmail.com> writes:

On Monday, 17 February 2014 at 11:34:43 UTC, bearophile wrote:
       foreach (ref T element; this.data) {
           element *= scalar;
       }

 Try to use:

         data[] *= scalar;


       for (auto i = 0; i < this.dim.size; ++i) {
           this.data[i] += other.data[i];
       }

 Try to use:

         data[] += other.data[];


       for (auto i = 0; i < this.dim.size; ++i) {
           this.data[i] -= other.data[i];
       }

 Try to use:

         data[] -= other.data[];

While it is cleaner, I remember array operations (such as data[] 
-= other.data[]) being significantly slower than doing it 
yourself. I don't know if this has been fixed. Perhaps using -m64 
helps this.


Also, I stress again that if you aren't compiling for 64-bit 
(-m64), it's very likely to give a significant improvement 
(because the optimizer will probably convert some of your 
operations into SIMD operations). Using LDC / GDC would also give 
a significant improvement.

Feb 17 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Kapps:

 I remember array operations (such as data[] -= other.data[]) 
 being significantly slower than doing it yourself. I don't know 
 if this has been fixed.

They are not being "fixed". But for arrays as large as the ones 
here, they are fast enough.

---------------------

I have tested the code (all in a single module) using the latest 
dmd and ldc2 (beta versions in both cases), compiling with:

dmd -O -release -inline -noboundscheck matrix.d
ldmd2 -O -release -inline -noboundscheck matrix.d


Best timings:

LDC2:

multiplicationTest ...
         Time required: 2 secs, 522 msecs

DMD:

multiplicationTest ...
         Time required: 4 secs, 724 msecs

Using link-time optimization, or using AVX/AVX2 on modern CPUs 
(using the compiler switcher for the AVX) could improve the LDC2 
timings a little more.

Bye,
bearophile

Feb 17 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 2/16/2014 6:47 PM, Robin wrote:
  Nick Sabalausky:

 Don't be shocked at the bad performance of a beginner D programmer. At
 least I am not shocked yet that my D implementation isn't performing as
 well as the highly optimized C++11 implementation or the JIT-optimized
 Java implementation. In general one can not compare the results of a
 native compilation optimization which shall run on different hardwares
 and a JIT optimization which is able to pull out every single
 optimization routine because it knows the system and all its components
 to compile at.

I think you've misunderstood me. I was only explaining D's class vs 
struct system, and why in D (as opposed to Java) structs are better than 
classes in this particular case.

Feb 17 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 2/15/2014 6:06 PM, Robin wrote:
 Matrix is still a class but I changed it to a final class preventing
 matrix methods to be virtual. Dimension is now a final struct (don't
 know if 'final' is affecting structs in any way tough ...). This mainly
 gave the multiplication a huge performance boost.

Like in other languages, "final" means "cannot inherit from this". Since 
structs, by definition, don't support inheritance anyway, "final" isn't 
really applicable - they're inherently "final" no matter what.

 When converting the Matrix to a struct from class the multiplication
 even lowered from ~4.3 seconds to about 3.6 seconds. However, I am
 currently not sure if I want matrices to be structs (value types).

They really should be structs. There's basically no benefit to making 
them classes. Classes are for when you need inheritance and don't need 
performance. Java, from what I understand, does a reasonable job 
compensating for the inherent inefficiency of classes by going to great 
lengths to have absolute top-of-the-line class-oriented optimizations 
and GC and strip out all the class underpinnings when possible (AIUI). 


Keep in mind too, the data inside dynamic arrays is already by 
reference. Arrays in D are basically like this:

struct Array(T)
{
     size_t length;
     T* ptr;
}

So even when your matrix is a struct, the actual values inside the 
matrix are still by reference (because it's through that pointer). So 
you're still not actually copying the data inside the matrix as you pass 
it around. You're only copying the dimension, length and pointer. And 
those are copied via a straight memcopy, not a "one field at a time" as 
I've heard C++ does (though I could be wrong, it's been forever since I 
did much C++).

And if you need to pass a struct by reference, there's always still 
"ref". Class really just isn't the right thing for a matrix, especially 
if you're paying attention to performance.


 I have also tried the LDC compiler. However, it gave me some strange
 bugs. E.g. it didn't like the following line:

 ref Matrix transpose() const {
      return new Matrix(this).transposeAssign();
 }

Doing that without parens around the "new Matrix(this)" is a very new 
feature, so I'm not entirely surprised LDC doesn't have it yet. Like 
Stanislav said, you can just add parens for now.

Feb 15 2014

"Chris Williams" <yoreanon-chrisw yahoo.co.jp> writes:

On Friday, 14 February 2014 at 16:00:09 UTC, Robin wrote:
 this(size_t rows, size_t cols) {
 	this.dim = Dimension(rows, cols);
 	this.data = new T[this.dim.size];
 	enum nil = to!T(0);
 	foreach(ref T element; this.data) element = nil;
 }

Your foreach here is unnecessary. You can zero out an array using 
the syntax:

this.data[] = 0; // probably compiles out as memset()

And while I would expect to!T(0) to compile down to something 
with no particular logic, it still is a function call not a 
macro, so you're better to let the compiler figure out the best 
conversion from 0 for type T than to use a library function for 
it.

 I think and hope that this is getting optimized via inlining. =)
 This works similar for opIndexAssign.

 From complaints that I have seen, very little gets inlined at the 
moment.


 Matrix opMul(ref const(Matrix) other) {
 	if (this.dim.rows != other.dim.cols || this.dim.cols != 
 ther.dim.rows) {
 		// TODO - still have to learn exception handling first ...
 	}
 	auto m = new Matrix(this.dim.rows, other.dim.cols);
 	auto s = new Matrix(other).transposeAssign();

Since you don't return s, it's probably better to not allocate an 
instance. I would also be curious to see your constructor which 
copies one matrix into a new one.

 Besides that I can't find a way how to make a use of move 
 semantics like in C++. For example a rvalue-move-constructor or 
 a move-assign would be a very nice thing for many tasks.

Do you mean something like this?

this.data[] = other.data[]; // probably compiles out as memcpy()
this.data = other.data.dup; // different syntax, but should be 
the same

 Another nice thing to know would be if it is possible to 
 initialize an array before it is default initialized with 
 T.init where T is the type of the array's fields.

I believe so, but I would have to scour about to find the syntax. 
Hopefully, someone else knows.

Feb 17 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Optimize my code =)