www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Array fill performance differences between for, foreach, slice

reply data pulverizer <data.pulverizer gmail.com> writes:
I've observed large differences in timing performance while 
filling arrays using different methods (for vs foreach vs arr[] = 
x) and don't know why. I've looked at array.d 
(https://github.com/dlang/dmd/blob/9792735c82ac997d11d7fe6c3d6c604389b3f5bd/s
c/dmd/root/array.d) but I'm still none the wiser.

Here is an example:

```
/* fill.d */

import std.stdio: writeln;
import std.typecons: Tuple, tuple;
import std.algorithm.iteration: mean;
import std.algorithm.iteration: sum;
import std.datetime.stopwatch: AutoStart, StopWatch;


/* Benchmarking Function */
auto bench(alias fun, string units = "msecs",
           ulong minN = 10, bool doPrint = false)(ulong n, string 
msg = "")
{
   auto times = new double[n];
   auto sw = StopWatch(AutoStart.no);
   for(ulong i = 0; i < n; ++i)
   {
     sw.start();
     fun();
     sw.stop();
     times[i] = cast(double)sw.peek.total!units;
     sw.reset();
   }
   double ave = mean(times);
   double sd = 0;

   if(n >= minN)
   {
     for(ulong i = 0; i < n; ++i)
       sd += (times[i] - ave)^^2;
     sd /= (n - 1);
     sd ^^= 0.5;
   }else{
     sd = double.nan;
   }

   static if(doPrint)
     writeln(msg ~ "Mean time("~ units ~ "): ", ave, ", Standard 
Deviation: ", sd);

   return tuple!("mean", "sd")(ave, sd);
}

/* Fill Functions */
auto fill_for(alias x, ulong n)()
{
   alias T = typeof(x);
   auto arr = new T[n];

   for(ulong i = 0; i < n; ++i)
     arr[i] = x;

   return arr;
}

auto fill_foreach(alias x, ulong n)()
{
   alias T = typeof(x);
   auto arr = new T[n];

   foreach(ref el; arr)
     el = x;

   return arr;
}

auto fill_slice(alias x, ulong n)()
{
   alias T = typeof(x);
   auto arr = new T[n];

   arr[] = x;

   return arr;
}


void main()
{
   double x = 42;

   bench!(fill_slice!(x, 100_000), "usecs", 10, true)(100, "Slice: 
");
   bench!(fill_foreach!(x, 100_000), "usecs", 10, true)(100, 
"Foreach: ");
   bench!(fill_for!(x, 100_000), "usecs", 10, true)(100, "For: ");
}

/*
$ dmd fill.d && ./fill
Slice: Mean time(usecs): 87.38, Standard Deviation: 54.1542
Foreach: Mean time(usecs): 179.9, Standard Deviation: 41.4109
For: Mean time(usecs): 245.81, Standard Deviation: 53.0798

$ dmd --version
DMD64 D Compiler v2.090.1
...
*/
```

It would be great to know why there are large differences in 
performance between these approaches and it would be great to see 
where array's opSliceAssign (or the equivalent method) for D's 
native array is implemented. Playing with `-boundscheck` made no 
difference in the contrasting performances. Thanks.
Mar 31 2020
next sibling parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 3/31/20 5:30 PM, data pulverizer wrote:
 I've observed large differences in timing performance while filling 
 arrays using different methods (for vs foreach vs arr[] = x) and don't 
 know why. I've looked at array.d 
 (https://github.com/dlang/dmd/blob/9792735c82ac997d11d7fe6c3d6c604389b3f5bd/s
c/dmd/root/array.d) 
 but I'm still none the wiser.
for: you are indexing an array on each step, incurring a bounds check, and also doing a pointer calculation on each access. foreach: Better, iterates a pointer over the whole array, so it's going to be faster. slice assign: Best, this uses whatever tricks are possible in the given architecture. Most likely memset, or memcpy. These are insanely optimized functions, which is why they perform so well. -Steve
Mar 31 2020
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2020-03-31 23:30, data pulverizer wrote:

 $ dmd fill.d && ./fill
You have not enabled optimizations. You should compile with `-O -release -inline` to enable all optimizations. Without optimizations I get numbers like these: Slice: Mean time(usecs): 92.91, Standard Deviation: 49.8002 Foreach: Mean time(usecs): 183.95, Standard Deviation: 30.749 For: Mean time(usecs): 239.74, Standard Deviation: 30.753 With optimizations turned on I get quite varying results: Slice: Mean time(usecs): 96.84, Standard Deviation: 52.5676 Foreach: Mean time(usecs): 95.84, Standard Deviation: 21.9244 For: Mean time(usecs): 106.73, Standard Deviation: 31.2545 Slice: Mean time(usecs): 192.03, Standard Deviation: 115.971 Foreach: Mean time(usecs): 177.06, Standard Deviation: 128.012 For: Mean time(usecs): 89.14, Standard Deviation: 25.681 Slice: Mean time(usecs): 97.74, Standard Deviation: 53.2058 Foreach: Mean time(usecs): 79.72, Standard Deviation: 11.0088 For: Mean time(usecs): 80.54, Standard Deviation: 12.06 I you care about performance you should really compile using LDC (with `-O5 -release -flto=full -defaultlib=phobos2-ldc-lto,druntime-ldc-lto`), which usually produces much better code: Slice: Mean time(usecs): 50.58, Standard Deviation: 46.2621 Foreach: Mean time(usecs): 53.92, Standard Deviation: 19.8039 For: Mean time(usecs): 39.89, Standard Deviation: 7.80041 Slice: Mean time(usecs): 76.62, Standard Deviation: 73.0014 Foreach: Mean time(usecs): 49.63, Standard Deviation: 24.5672 For: Mean time(usecs): 40.02, Standard Deviation: 7.67388 -- /Jacob Carlborg
Mar 31 2020
next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Wednesday, 1 April 2020 at 06:48:09 UTC, Jacob Carlborg wrote:
 You have not enabled optimizations. You should compile with `-O 
 -release -inline` to enable all optimizations.
-release should *never* be used. You're trading memory safety and security holes for a few microseconds of execution time. Not worth the risk. For realistic code, try to make it fast without using that switch, there's plenty of other ways, like isolated trusted sections as needed. If you have significant cost from asserts you might turn them off with `-check=assert=off` to leave them out of the compile without affecting the other safety features.
 I you care about performance you should really compile using 
 LDC (with `-O5 -release -flto=full
again forget the -release switch. the rest cool though.
Apr 01 2020
parent reply Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
On Wednesday, 1 April 2020 at 12:22:48 UTC, Adam D. Ruppe wrote:
 On Wednesday, 1 April 2020 at 06:48:09 UTC, Jacob Carlborg 
 wrote:
 You have not enabled optimizations. You should compile with 
 `-O -release -inline` to enable all optimizations.
-release should *never* be used. You're trading memory safety and security holes for a few microseconds of execution time. Not worth the risk.
It is nice that bounds checks remain in place when using release and the code is safe.
Apr 01 2020
parent Adam D. Ruppe <destructionator gmail.com> writes:
On Wednesday, 1 April 2020 at 15:04:44 UTC, Jesse Phillips wrote:
 It is nice that bounds checks remain in place when using 
 release and the code is  safe.
Yeah, it used to be even worse than it is now, but it is still a terrible switch that should NEVER be used. There are always better ways and we should stop telling people to use it. The newer `-check` and `-boundscheck` switches give more fine-grained control if you need it (and you again never should, since you can control on a per-expression basis with trusted and the .ptr attribute.)
Apr 01 2020
prev sibling parent lithium iodate <whatdoiknow doesntexist.net> writes:
On Wednesday, 1 April 2020 at 06:48:09 UTC, Jacob Carlborg wrote:
 I you care about performance you should really compile using 
 LDC (with `-O5 -release -flto=full 
 -defaultlib=phobos2-ldc-lto,druntime-ldc-lto`), which usually 
 produces much better code:

 Slice: Mean time(usecs): 50.58, Standard Deviation: 46.2621
 Foreach: Mean time(usecs): 53.92, Standard Deviation: 19.8039
 For: Mean time(usecs): 39.89, Standard Deviation: 7.80041

 Slice: Mean time(usecs): 76.62, Standard Deviation: 73.0014
 Foreach: Mean time(usecs): 49.63, Standard Deviation: 24.5672
 For: Mean time(usecs): 40.02, Standard Deviation: 7.67388
The benchmark is highly flawed, the results for LDC are completely meaningless. LDC produces identical code for every case and the filling procedures are all elided.
Apr 01 2020
prev sibling parent reply data pulverizer <data.pulverizer gmail.com> writes:
Thanks for all the suggestions made so far. I am still interested 
in looking at the implementation details of the slice assign 
`arr[] = x` which I can't seem to find. Before I made my initial 
post, I tried doing a `memcpy` and `memmove` under a `for` loop 
but it did not change the performance or get the same kind of 
performance as the initial slice performance so I didn't bother 
to mention them, I haven't tried it with the suggested compiler 
flags though.

 StevenSchveighoffer also suggested using `memset` (as well as 
`memcpy`) please correct me if I am wrong but it looks as if 
`memset` can only write from an `int` sized source and I need the 
source size to be any potential size (T).

     
----------------------------------------------------------------------

On a related aside I noticed that the timing was reduced across 
the board so much so that the initial slice time halved when 
initialising with:

```
auto arr = (cast(T*)GC.malloc(T.sizeof*n, GC.BlkAttr.NO_SCAN | 
GC.BlkAttr.APPENDABLE))[0..n];
```

Instead of:

```
auto arr = new T[n];
```

If I am filling the array with something anyway for example 
`arr[] = x` is there anything wrong with this approach? Should I 
care if the memory is not "scanned" - I have no idea what that 
means but I assume it's some kind of check that takes time and is 
switched off by `GC.BlkAttr.NO_SCAN`. Are there any downsides to 
this that I should be aware of?

I noticed that `GC.malloc()` is based on `gc_malloc()` which 
gives the bit mask option that makes it  faster than 
`core.stdc.stdlib: malloc`. Is `gc_malloc` OS dependent? I can't 
find it in the standard C library, the only reference I found for 
it is [here](https://linux.die.net/man/3/gc) and it is named 
slightly differently but appears to be the same function. In 
`core.memory`, it is specified by the `extern (C)` declaration 
(https://github.com/dlang/druntime/blob/master/src/core/memory.d) 
so I guess it must be somewhere on my system?

Thanks
Apr 01 2020
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 4/1/20 11:23 AM, data pulverizer wrote:
 Thanks for all the suggestions made so far. I am still interested in 
 looking at the implementation details of the slice assign `arr[] = x` 
 which I can't seem to find. Before I made my initial post, I tried doing 
 a `memcpy` and `memmove` under a `for` loop but it did not change the 
 performance or get the same kind of performance as the initial slice 
 performance so I didn't bother to mention them, I haven't tried it with 
 the suggested compiler flags though.
Using disassembly, on run.dlang.io, it says it's using __memsetDouble.
 
  StevenSchveighoffer also suggested using `memset` (as well as `memcpy`) 
 please correct me if I am wrong but it looks as if `memset` can only 
 write from an `int` sized source and I need the source size to be any 
 potential size (T).
Again, the compiler uses whatever tools are available. It might be memset, it might be something else. In the case of your code, it's using __memsetDouble, which I have no idea where it's defined (probably libc).
 ----------------------------------------------------------------------
 
 On a related aside I noticed that the timing was reduced across the 
 board so much so that the initial slice time halved when initialising with:
 
 ```
 auto arr = (cast(T*)GC.malloc(T.sizeof*n, GC.BlkAttr.NO_SCAN | 
 GC.BlkAttr.APPENDABLE))[0..n];
 ```
 
 Instead of:
 
 ```
 auto arr = new T[n];
 ```
What this means is, don't scan the block for pointers during a GC collect cycle. If you have pointers in your T, this is a very bad idea. Not only that, but this does not initialize the appendable data at the end of the block. In addition, GC.malloc just zero-initializes the data. If you do new T[n], and T has an initializer, it's going to be a lot more expensive. If you are going to use this, remove the GC.BlkAttr.APPENDABLE. In the case of double, it is initialized to NaN. This could explain the difference in timing.
 
 I noticed that `GC.malloc()` is based on `gc_malloc()` which gives the 
 bit mask option that makes it  faster than `core.stdc.stdlib: malloc`. 
 Is `gc_malloc` OS dependent? I can't find it in the standard C library, 
 the only reference I found for it is 
 [here](https://linux.die.net/man/3/gc) and it is named slightly 
 differently but appears to be the same function. In `core.memory`, it is 
 specified by the `extern (C)` declaration 
 (https://github.com/dlang/druntime/blob/master/src/core/memory.d) so I 
 guess it must be somewhere on my system?
It's in the D garbage collector, here: https://github.com/dlang/druntime/blob/2eec30b35bab308a37298331353bdce5fee1b657/src/gc/proxy.d#L166 extern(C) functions can be implemented in D. The major difference between standard D functions and extern(C) is that the latter does not do name mangling. -Steve
Apr 01 2020
parent data pulverizer <data.pulverizer gmail.com> writes:
In all honesty till now I haven't really thought deeply about 
memory allocation, I just assumed that malloc, free, and so on 
where low level operating system functions and that was that. 
I've heard people in the D community talk about garbage 
collection, and memory allocation but I didn't think it was 
something I had to worry about.

A lot of the work I do is statistics and data science based, 
often speed is important, calculations on arrays involve a lot of 
array instantiation, moving and copying elements and so forth. As 
far as all that stuff goes the speed of memory operations is 
important. The fact that I could get those kinds of operations to 
run faster using D's GC.malloc and different types of iteration 
is significant. Today it hit me that I actually need to study 
memory allocation and possibly garbage collection as topics - not 
so that I can implement a garbage collector, but so that I can 
write faster code.

I know to everyone here that's just basic stuff, but it's a bit 
of a revelation for me. :-)
Apr 08 2020