digitalmars.D - Array Appending and DRuntime

dsimcha (35/35) Apr 25 2009 For a while now, I've had a suspicion that the builtin array appending h...

dsimcha (22/28) Apr 25 2009 On further examination, this is clearly somehow related to caching. Her...
bearophile (34/39) Apr 25 2009 I have done some more timings:

Andrei Alexandrescu (5/19) Apr 25 2009 I'm not sure on what machine you test, but on my (crappy old) laptop,

bearophile (8/11) Apr 25 2009 My purpose is to have a good standard library and a good language.

Andrei Alexandrescu (8/24) Apr 25 2009 I can't improve what I can't measure. On my system, liberally loaded

dsimcha (16/40) Apr 25 2009 I think I can actually explain this. For some reason (I don't know why)...

dsimcha <dsimcha yahoo.com> writes:

For a while now, I've had a suspicion that the builtin array appending has
gotten slower lately.  The talk about array appenders here lately has brought
that to a head, so I decided to test it and find out.  Here's the test program
I used:

import std.stdio, std.perf;

void main() {
    scope pc = new PerformanceCounter;
    pc.start;
    uint[] foo;
    foreach(i; 0..1_000_000) {
        foo ~= i;
    }
    pc.stop;
    writeln(pc.milliseconds);
}

Results:

DMD 2.019 (Last release before druntime):  42 milliseconds.
DMD 2.020 (First release with druntime):  ~1000 milliseconds.
DMD 2.029 (Current version):  ~1000 milliseconds.
DMD 2.029 (Replacing ~= with the Appender struct):  18 milliseconds.
DMD 2.029 (Replacing builtin array with rangeextra.TNew):  19 milliseconds.

I think it's abundantly clear that I wasn't just going crazy and something
changed between 2.019 and 2.020 that made the builtin ~= about 25x slower.  It
probably is related to druntime because this was the only major change between
2.019 and 2.020.  If we could put this back to the way it was in 2.019, this
would drastically reduce the need for more complicated solutions.  Here's some
speculation about what might have happened:

1.  Since Appender and TNew both use the builtin array's reallocation scheme,
it's not a reallocation scheme issue.
2.  gcx.d appears to cache the last block size query.  This means that
repeatedly querying the same block in a single threaded program (where
thread_needLock() returns false and no lock is necessary) is very fast.  This
is true in both the old Phobos GC and the druntime GC.  I wonder if this was
somehow bypassed by the ~= operator when druntime was integrated with DMD in
its early days.

Apr 25 2009

dsimcha <dsimcha yahoo.com> writes:

== Quote from dsimcha (dsimcha yahoo.com)'s article
 2.  gcx.d appears to cache the last block size query.  This means that
 repeatedly querying the same block in a single threaded program (where
 thread_needLock() returns false and no lock is necessary) is very fast.  This
 is true in both the old Phobos GC and the druntime GC.  I wonder if this was
 somehow bypassed by the ~= operator when druntime was integrated with DMD in
 its early days.

On further examination, this is clearly somehow related to caching.  Here is a
very similar test program, that appends to tow arrays instead of one:

import std.stdio, std.perf;

void main() {
    scope pc = new PerformanceCounter;
    pc.start;
    uint[] foo, bar;
    foreach(i; 0..1_000_000) {
        foo ~= i;
		bar ~= i;
    }
    pc.stop;
    writeln(pc.milliseconds);
}

Timings:
DMD 2.019:  ~1800 ms
DMD 2.029:  ~2300 ms (Note:  Still slower but not by as much even in absolute
terms)
DMD 2.029 (Using Appender instead of ~=):  49 ms

By appending to two arrays, we screw up the caching scheme, hence much poorer
performance.  However most use cases probably involve appending to only one
array
at a time.  Since this is a clear regression, I'll file a bug report.

Apr 25 2009

bearophile <bearophileHUGS lycos.com> writes:

dsimcha:
 DMD 2.019 (Last release before druntime):  42 milliseconds.
 DMD 2.020 (First release with druntime):  ~1000 milliseconds.
 DMD 2.029 (Current version):  ~1000 milliseconds.
 DMD 2.029 (Replacing ~= with the Appender struct):  18 milliseconds.
 DMD 2.029 (Replacing builtin array with rangeextra.TNew):  19 milliseconds.

I have done some more timings:

Timings, appending 100_000_000 uints, seconds:
  DMD 1.042:                 7.51
  DMD 1.042, ArrayBuilder:   1.81
  DMD 2.029:                 a lot
  DMD 2.029, Appender:      22.83

I don't like the user interface of Appender.

---------------

The code I have used:

void main() {
    uint[] foo;
    for (int i; i < 100_000_000; i++)
        foo ~= i;
}

---------------

import d.builders: ArrayBuilder;
void main() {
    ArrayBuilder!(uint) b;
    for (int i; i < 100_000_000; i++)
        b ~= i;
    uint[] arr = b.toarray;
}

---------------

import std.array: appender;
void main() {
    char[] arr;
    auto app = appender(&arr);
    for (int i; i < 100_000_000; i++)
        app.put(i);
    auto result = app.data;
}

Bye,
bearophile

Apr 25 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 dsimcha:
 DMD 2.019 (Last release before druntime):  42 milliseconds.
 DMD 2.020 (First release with druntime):  ~1000 milliseconds.
 DMD 2.029 (Current version):  ~1000 milliseconds.
 DMD 2.029 (Replacing ~= with the Appender struct):  18 milliseconds.
 DMD 2.029 (Replacing builtin array with rangeextra.TNew):  19 milliseconds.

 
 I have done some more timings:
 
 Timings, appending 100_000_000 uints, seconds:
   DMD 1.042:                 7.51
   DMD 1.042, ArrayBuilder:   1.81
   DMD 2.029:                 a lot
   DMD 2.029, Appender:      22.83

I'm not sure on what machine you test, but on my (crappy old) laptop, 
your Appender example completes in 4.95 seconds. Are you sure you are 
measuring the right thing?

Andrei

Apr 25 2009

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:
 I'm not sure on what machine you test, but on my (crappy old) laptop, 
 your Appender example completes in 4.95 seconds. Are you sure you are 
 measuring the right thing?

My purpose is to have a good standard library and a good language.
Errors of mine are always possible. Benchmarks are tricky things.

I have redone the tests, the CPU is unloaded, no computation is going on, the
system is WinXP with 2 GB RAM, 2 GHz Core 2.

New timings give no less than 22.6 seconds, repeated. As running time I use a
timing command similar to the "time" of Linux, that gives the total running
time of the program (final deallocations too).

I think Appender can be improved some. You can also try DMD1, to compare
timings, it's not a fully dead language yet.

Bye,
bearophile

Apr 25 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 Andrei Alexandrescu:
 I'm not sure on what machine you test, but on my (crappy old)
 laptop, your Appender example completes in 4.95 seconds. Are you
 sure you are measuring the right thing?

 
 My purpose is to have a good standard library and a good language. 
 Errors of mine are always possible. Benchmarks are tricky things.
 
 I have redone the tests, the CPU is unloaded, no computation is going
 on, the system is WinXP with 2 GB RAM, 2 GHz Core 2.
 
 New timings give no less than 22.6 seconds, repeated. As running time
 I use a timing command similar to the "time" of Linux, that gives the
 total running time of the program (final deallocations too).
 
 I think Appender can be improved some.

I can't improve what I can't measure. On my system, liberally loaded 
with a variety of programs, weaker than yours (1.8GHz/512MB laptop) I 
get 4.8s with -O and 5.3s without. The major difference is the OS 
(Ubuntu on my laptop), but I am puzzled as Appender doesn't assume a lot 
about the OS. I time with Unix time. I copied the code straight from 
your post. Please advise.

Andrei

Apr 25 2009

dsimcha <dsimcha yahoo.com> writes:

== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 bearophile wrote:
 Andrei Alexandrescu:
 I'm not sure on what machine you test, but on my (crappy old)
 laptop, your Appender example completes in 4.95 seconds. Are you
 sure you are measuring the right thing?

 My purpose is to have a good standard library and a good language.
 Errors of mine are always possible. Benchmarks are tricky things.

 I have redone the tests, the CPU is unloaded, no computation is going
 on, the system is WinXP with 2 GB RAM, 2 GHz Core 2.

 New timings give no less than 22.6 seconds, repeated. As running time
 I use a timing command similar to the "time" of Linux, that gives the
 total running time of the program (final deallocations too).

 I think Appender can be improved some.

 I can't improve what I can't measure. On my system, liberally loaded
 with a variety of programs, weaker than yours (1.8GHz/512MB laptop) I
 get 4.8s with -O and 5.3s without. The major difference is the OS
 (Ubuntu on my laptop), but I am puzzled as Appender doesn't assume a lot
 about the OS. I time with Unix time. I copied the code straight from
 your post. Please advise.
 Andrei

I think I can actually explain this.  For some reason (I don't know why) the
Linux
GC is less susceptible to false pointers than the Windows GC.  You can prove
this
by playing with associative arrays and seeing how big an associative array has
to
get before there is a high probability that the GC won't free it.  (See bugzilla
2105.)

I get timings more similar to Bearophile on my WinXP box.  If blocks of memory
keep getting hit with false pointers on reallocation (which they do when arrays
get this big, at least on Windows), and you have to allocate more memory from
the
OS, etc., of course it's going to be slower.

As for why the Linux GC is less susceptible to false pointers, I really don't
know.  It's just an empirical observation.  I would guess that maybe Linux
allocates from the top of the address space, where false pointers are less
frequent.

Other than that, I would suggest only using 1,000,000 or so elements both
because
100,000,000 is just plain unrealistic in most real-world use cases and because
you
avoid false pointer issues that way.

Apr 25 2009

D Programming

C/C++ Programming

Other

digitalmars.D - Array Appending and DRuntime