digitalmars.D - Performance of loops

Chris (41/41) Apr 24 2015 I tested the performance of three types of loops (see code

John Colvin (4/45) Apr 24 2015 dmd's optimiser isn't great. GDC creates identical assembly for

Chris (3/60) Apr 24 2015 Yeah, I figure. However, the gap is a serious issue for dmd, even

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (6/6) Apr 24 2015 Most of the time is taken up by the array's allocation. The

Chris (3/9) Apr 24 2015 I've tried it, and the loops indeed show similar results. What is

Steven Schveighoffer (22/32) Apr 24 2015 enum of an array literal is like pasting the array literal wherever the

Chris (3/41) Apr 24 2015 Thanks for the clarification.

Baz (8/49) Apr 24 2015 You should temper the results because these kinds of tests are

Chris (6/67) Apr 27 2015 Yes, I am aware of this. It's hard to come up with meaningful

"Chris" <wendlec tcd.ie> writes:

I tested the performance of three types of loops (see code 
below). It turns out that the fastest loop is the "plainLoop". 
Unless my examples are completely screwed up, the difference 
between "plainLoop" and the other two loops is gigantic (e.g.):

9 ms, 149 μs, and 4 hnsecs  // foreach (const ref w)
9 ms, 77 μs, and 8 hnsecs   // foreach (ref w)
1 ms, 183 μs, and 6 hnsecs  // foreach (w)

with -release -inline -O -boundscheck=off

8 ms, 492 μs, and 3 hnsecs
8 ms, 287 μs, and 1 hnsec
341 μs and 2 hnsecs

[compiler dmd v2.067.0, Linux Ubuntu, 64bit]


import std.datetime : benchmark, Duration;
import std.string : format;
import std.conv : to;
import std.stdio : writeln;

enum {
   string[] words = ["Hello", "world", "Ola", "mundo"],
}

void main() {
   auto result = benchmark!(constLoop, refLoop, 
plainLoop)(100_000);
   writeln(to!Duration(result[0]));
   writeln(to!Duration(result[1]));
   writeln(to!Duration(result[2]));

}

void constLoop() {
   size_t cnt;
   foreach (const ref w; words)
     cnt += w.length;
}

void refLoop() {
   size_t cnt;
   foreach (ref w; words)
     cnt += w.length;
}

void plainLoop() {
   size_t cnt;
   foreach (w; words)
     cnt += w.length;
}

Apr 24 2015

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Friday, 24 April 2015 at 11:00:23 UTC, Chris wrote:
 I tested the performance of three types of loops (see code 
 below). It turns out that the fastest loop is the "plainLoop". 
 Unless my examples are completely screwed up, the difference 
 between "plainLoop" and the other two loops is gigantic (e.g.):

 9 ms, 149 μs, and 4 hnsecs  // foreach (const ref w)
 9 ms, 77 μs, and 8 hnsecs   // foreach (ref w)
 1 ms, 183 μs, and 6 hnsecs  // foreach (w)

 with -release -inline -O -boundscheck=off

 8 ms, 492 μs, and 3 hnsecs
 8 ms, 287 μs, and 1 hnsec
 341 μs and 2 hnsecs

 [compiler dmd v2.067.0, Linux Ubuntu, 64bit]


 import std.datetime : benchmark, Duration;
 import std.string : format;
 import std.conv : to;
 import std.stdio : writeln;

 enum {
   string[] words = ["Hello", "world", "Ola", "mundo"],
 }

 void main() {
   auto result = benchmark!(constLoop, refLoop, 
 plainLoop)(100_000);
   writeln(to!Duration(result[0]));
   writeln(to!Duration(result[1]));
   writeln(to!Duration(result[2]));

 }

 void constLoop() {
   size_t cnt;
   foreach (const ref w; words)
     cnt += w.length;
 }

 void refLoop() {
   size_t cnt;
   foreach (ref w; words)
     cnt += w.length;
 }

 void plainLoop() {
   size_t cnt;
   foreach (w; words)
     cnt += w.length;
 }

dmd's optimiser isn't great. GDC creates identical assembly for 
all three functions.

Rule of thumb: if you want high performance, use GDC or LDC.

Apr 24 2015

"Chris" <wendlec tcd.ie> writes:

On Friday, 24 April 2015 at 11:39:14 UTC, John Colvin wrote:
 On Friday, 24 April 2015 at 11:00:23 UTC, Chris wrote:
 I tested the performance of three types of loops (see code 
 below). It turns out that the fastest loop is the "plainLoop". 
 Unless my examples are completely screwed up, the difference 
 between "plainLoop" and the other two loops is gigantic (e.g.):

 9 ms, 149 μs, and 4 hnsecs  // foreach (const ref w)
 9 ms, 77 μs, and 8 hnsecs   // foreach (ref w)
 1 ms, 183 μs, and 6 hnsecs  // foreach (w)

 with -release -inline -O -boundscheck=off

 8 ms, 492 μs, and 3 hnsecs
 8 ms, 287 μs, and 1 hnsec
 341 μs and 2 hnsecs

 [compiler dmd v2.067.0, Linux Ubuntu, 64bit]


 import std.datetime : benchmark, Duration;
 import std.string : format;
 import std.conv : to;
 import std.stdio : writeln;

 enum {
  string[] words = ["Hello", "world", "Ola", "mundo"],
 }

 void main() {
  auto result = benchmark!(constLoop, refLoop, 
 plainLoop)(100_000);
  writeln(to!Duration(result[0]));
  writeln(to!Duration(result[1]));
  writeln(to!Duration(result[2]));

 }

 void constLoop() {
  size_t cnt;
  foreach (const ref w; words)
    cnt += w.length;
 }

 void refLoop() {
  size_t cnt;
  foreach (ref w; words)
    cnt += w.length;
 }

 void plainLoop() {
  size_t cnt;
  foreach (w; words)
    cnt += w.length;
 }

 dmd's optimiser isn't great. GDC creates identical assembly for 
 all three functions.

 Rule of thumb: if you want high performance, use GDC or LDC.

Yeah, I figure. However, the gap is a serious issue for dmd, even 
if it's only a reference compiler.

Apr 24 2015

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

Most of the time is taken up by the array's allocation. The 
optimizers of both DMD and LDC evidently doesn't optimize it away 
when you use `ref`, even though it could in theory.

Remove the `enum` and just use a normal global variable, and you 
will get more or less identical times for all three loops. LDC 
even turns then into empty functions (didn't check for DMD).

Apr 24 2015

"Chris" <wendlec tcd.ie> writes:

On Friday, 24 April 2015 at 11:53:56 UTC, Marc Schütz wrote:
 Most of the time is taken up by the array's allocation. The 
 optimizers of both DMD and LDC evidently doesn't optimize it 
 away when you use `ref`, even though it could in theory.

 Remove the `enum` and just use a normal global variable, and 
 you will get more or less identical times for all three loops. 
 LDC even turns then into empty functions (didn't check for DMD).

I've tried it, and the loops indeed show similar results. What is 
it about `enum` that slows it down?

Apr 24 2015

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 4/24/15 9:13 AM, Chris wrote:
 On Friday, 24 April 2015 at 11:53:56 UTC, Marc Schütz wrote:
 Most of the time is taken up by the array's allocation. The optimizers
 of both DMD and LDC evidently doesn't optimize it away when you use
 `ref`, even though it could in theory.

 Remove the `enum` and just use a normal global variable, and you will
 get more or less identical times for all three loops. LDC even turns
 then into empty functions (didn't check for DMD).

 I've tried it, and the loops indeed show similar results. What is it
 about `enum` that slows it down?

enum of an array literal is like pasting the array literal wherever the 
enum lives.

So for example, your const loop:

void constLoop() {
   size_t cnt;
   foreach (const ref w; words)
     cnt += w.length;
}

Is really like saying:

void constLoop() {
   size_t cnt;
   foreach (const ref w; ["Hello", "world", "Ola", "mundo"])
     cnt += w.length;
}

And note that in D right now, an array literal is NOT stored in the data 
segment. It's generated every time you use it (with a call to 
_d_newarray, which allocates on the GC).

There was a push a while back to make array literals immutable and 
placed in the data segment. But it never panned out, would break too 
much code.

-Steve

Apr 24 2015

"Chris" <wendlec tcd.ie> writes:

On Friday, 24 April 2015 at 13:55:49 UTC, Steven Schveighoffer 
wrote:
 On 4/24/15 9:13 AM, Chris wrote:
 On Friday, 24 April 2015 at 11:53:56 UTC, Marc Schütz wrote:
 Most of the time is taken up by the array's allocation. The 
 optimizers
 of both DMD and LDC evidently doesn't optimize it away when 
 you use
 `ref`, even though it could in theory.

 Remove the `enum` and just use a normal global variable, and 
 you will
 get more or less identical times for all three loops. LDC 
 even turns
 then into empty functions (didn't check for DMD).

 I've tried it, and the loops indeed show similar results. What 
 is it
 about `enum` that slows it down?

 enum of an array literal is like pasting the array literal 
 wherever the enum lives.

 So for example, your const loop:

 void constLoop() {
   size_t cnt;
   foreach (const ref w; words)
     cnt += w.length;
 }

 Is really like saying:

 void constLoop() {
   size_t cnt;
   foreach (const ref w; ["Hello", "world", "Ola", "mundo"])
     cnt += w.length;
 }

 And note that in D right now, an array literal is NOT stored in 
 the data segment. It's generated every time you use it (with a 
 call to _d_newarray, which allocates on the GC).

 There was a push a while back to make array literals immutable 
 and placed in the data segment. But it never panned out, would 
 break too much code.

 -Steve

Thanks for the clarification.

Apr 24 2015

"Baz" <bb.temp gmx.com> writes:

On Friday, 24 April 2015 at 11:00:23 UTC, Chris wrote:
 I tested the performance of three types of loops (see code 
 below). It turns out that the fastest loop is the "plainLoop". 
 Unless my examples are completely screwed up, the difference 
 between "plainLoop" and the other two loops is gigantic (e.g.):

 9 ms, 149 μs, and 4 hnsecs  // foreach (const ref w)
 9 ms, 77 μs, and 8 hnsecs   // foreach (ref w)
 1 ms, 183 μs, and 6 hnsecs  // foreach (w)

 with -release -inline -O -boundscheck=off

 8 ms, 492 μs, and 3 hnsecs
 8 ms, 287 μs, and 1 hnsec
 341 μs and 2 hnsecs

 [compiler dmd v2.067.0, Linux Ubuntu, 64bit]


 import std.datetime : benchmark, Duration;
 import std.string : format;
 import std.conv : to;
 import std.stdio : writeln;

 enum {
   string[] words = ["Hello", "world", "Ola", "mundo"],
 }

 void main() {
   auto result = benchmark!(constLoop, refLoop, 
 plainLoop)(100_000);
   writeln(to!Duration(result[0]));
   writeln(to!Duration(result[1]));
   writeln(to!Duration(result[2]));

 }

 void constLoop() {
   size_t cnt;
   foreach (const ref w; words)
     cnt += w.length;
 }

 void refLoop() {
   size_t cnt;
   foreach (ref w; words)
     cnt += w.length;
 }

 void plainLoop() {
   size_t cnt;
   foreach (w; words)
     cnt += w.length;
 }

You should temper the results because these kinds of tests are 
known not to represent well the reality. In a real program, the 
loops performances are affected by the instruction cache, the 
memory cache, the complexity of the operations inside the loop 
(nested TEST/CMP). Actually this kind of test represents an 
ideal, unreallisctic situation, that will never happend in a 
"IRL" program.

Apr 24 2015

"Chris" <wendlec tcd.ie> writes:

On Friday, 24 April 2015 at 13:27:18 UTC, Baz wrote:
 On Friday, 24 April 2015 at 11:00:23 UTC, Chris wrote:
 I tested the performance of three types of loops (see code 
 below). It turns out that the fastest loop is the "plainLoop". 
 Unless my examples are completely screwed up, the difference 
 between "plainLoop" and the other two loops is gigantic (e.g.):

 9 ms, 149 μs, and 4 hnsecs  // foreach (const ref w)
 9 ms, 77 μs, and 8 hnsecs   // foreach (ref w)
 1 ms, 183 μs, and 6 hnsecs  // foreach (w)

 with -release -inline -O -boundscheck=off

 8 ms, 492 μs, and 3 hnsecs
 8 ms, 287 μs, and 1 hnsec
 341 μs and 2 hnsecs

 [compiler dmd v2.067.0, Linux Ubuntu, 64bit]


 import std.datetime : benchmark, Duration;
 import std.string : format;
 import std.conv : to;
 import std.stdio : writeln;

 enum {
  string[] words = ["Hello", "world", "Ola", "mundo"],
 }

 void main() {
  auto result = benchmark!(constLoop, refLoop, 
 plainLoop)(100_000);
  writeln(to!Duration(result[0]));
  writeln(to!Duration(result[1]));
  writeln(to!Duration(result[2]));

 }

 void constLoop() {
  size_t cnt;
  foreach (const ref w; words)
    cnt += w.length;
 }

 void refLoop() {
  size_t cnt;
  foreach (ref w; words)
    cnt += w.length;
 }

 void plainLoop() {
  size_t cnt;
  foreach (w; words)
    cnt += w.length;
 }

 You should temper the results because these kinds of tests are 
 known not to represent well the reality. In a real program, the 
 loops performances are affected by the instruction cache, the 
 memory cache, the complexity of the operations inside the loop 
 (nested TEST/CMP). Actually this kind of test represents an 
 ideal, unreallisctic situation, that will never happend in a 
 "IRL" program.

Yes, I am aware of this. It's hard to come up with meaningful 
tests in no time. At least this one helped me to understand that 
`enum` can be very harmful performance wise.

Unfortunately, I cannot easily extract loops and logic of my real 
programs to put on the forum. Too complicated.

Apr 27 2015

D Programming

C/C++ Programming

Other

digitalmars.D - Performance of loops