www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Array Appending and DRuntime

reply dsimcha <dsimcha yahoo.com> writes:
For a while now, I've had a suspicion that the builtin array appending has
gotten slower lately.  The talk about array appenders here lately has brought
that to a head, so I decided to test it and find out.  Here's the test program
I used:

import std.stdio, std.perf;

void main() {
    scope pc = new PerformanceCounter;
    pc.start;
    uint[] foo;
    foreach(i; 0..1_000_000) {
        foo ~= i;
    }
    pc.stop;
    writeln(pc.milliseconds);
}

Results:

DMD 2.019 (Last release before druntime):  42 milliseconds.
DMD 2.020 (First release with druntime):  ~1000 milliseconds.
DMD 2.029 (Current version):  ~1000 milliseconds.
DMD 2.029 (Replacing ~= with the Appender struct):  18 milliseconds.
DMD 2.029 (Replacing builtin array with rangeextra.TNew):  19 milliseconds.

I think it's abundantly clear that I wasn't just going crazy and something
changed between 2.019 and 2.020 that made the builtin ~= about 25x slower.  It
probably is related to druntime because this was the only major change between
2.019 and 2.020.  If we could put this back to the way it was in 2.019, this
would drastically reduce the need for more complicated solutions.  Here's some
speculation about what might have happened:

1.  Since Appender and TNew both use the builtin array's reallocation scheme,
it's not a reallocation scheme issue.
2.  gcx.d appears to cache the last block size query.  This means that
repeatedly querying the same block in a single threaded program (where
thread_needLock() returns false and no lock is necessary) is very fast.  This
is true in both the old Phobos GC and the druntime GC.  I wonder if this was
somehow bypassed by the ~= operator when druntime was integrated with DMD in
its early days.
Apr 25 2009
next sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from dsimcha (dsimcha yahoo.com)'s article
 2.  gcx.d appears to cache the last block size query.  This means that
 repeatedly querying the same block in a single threaded program (where
 thread_needLock() returns false and no lock is necessary) is very fast.  This
 is true in both the old Phobos GC and the druntime GC.  I wonder if this was
 somehow bypassed by the ~= operator when druntime was integrated with DMD in
 its early days.
On further examination, this is clearly somehow related to caching. Here is a very similar test program, that appends to tow arrays instead of one: import std.stdio, std.perf; void main() { scope pc = new PerformanceCounter; pc.start; uint[] foo, bar; foreach(i; 0..1_000_000) { foo ~= i; bar ~= i; } pc.stop; writeln(pc.milliseconds); } Timings: DMD 2.019: ~1800 ms DMD 2.029: ~2300 ms (Note: Still slower but not by as much even in absolute terms) DMD 2.029 (Using Appender instead of ~=): 49 ms By appending to two arrays, we screw up the caching scheme, hence much poorer performance. However most use cases probably involve appending to only one array at a time. Since this is a clear regression, I'll file a bug report.
Apr 25 2009
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
dsimcha:
 DMD 2.019 (Last release before druntime):  42 milliseconds.
 DMD 2.020 (First release with druntime):  ~1000 milliseconds.
 DMD 2.029 (Current version):  ~1000 milliseconds.
 DMD 2.029 (Replacing ~= with the Appender struct):  18 milliseconds.
 DMD 2.029 (Replacing builtin array with rangeextra.TNew):  19 milliseconds.
I have done some more timings: Timings, appending 100_000_000 uints, seconds: DMD 1.042: 7.51 DMD 1.042, ArrayBuilder: 1.81 DMD 2.029: a lot DMD 2.029, Appender: 22.83 I don't like the user interface of Appender. --------------- The code I have used: void main() { uint[] foo; for (int i; i < 100_000_000; i++) foo ~= i; } --------------- import d.builders: ArrayBuilder; void main() { ArrayBuilder!(uint) b; for (int i; i < 100_000_000; i++) b ~= i; uint[] arr = b.toarray; } --------------- import std.array: appender; void main() { char[] arr; auto app = appender(&arr); for (int i; i < 100_000_000; i++) app.put(i); auto result = app.data; } Bye, bearophile
Apr 25 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 dsimcha:
 DMD 2.019 (Last release before druntime):  42 milliseconds.
 DMD 2.020 (First release with druntime):  ~1000 milliseconds.
 DMD 2.029 (Current version):  ~1000 milliseconds.
 DMD 2.029 (Replacing ~= with the Appender struct):  18 milliseconds.
 DMD 2.029 (Replacing builtin array with rangeextra.TNew):  19 milliseconds.
I have done some more timings: Timings, appending 100_000_000 uints, seconds: DMD 1.042: 7.51 DMD 1.042, ArrayBuilder: 1.81 DMD 2.029: a lot DMD 2.029, Appender: 22.83
I'm not sure on what machine you test, but on my (crappy old) laptop, your Appender example completes in 4.95 seconds. Are you sure you are measuring the right thing? Andrei
Apr 25 2009
parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 I'm not sure on what machine you test, but on my (crappy old) laptop, 
 your Appender example completes in 4.95 seconds. Are you sure you are 
 measuring the right thing?
My purpose is to have a good standard library and a good language. Errors of mine are always possible. Benchmarks are tricky things. I have redone the tests, the CPU is unloaded, no computation is going on, the system is WinXP with 2 GB RAM, 2 GHz Core 2. New timings give no less than 22.6 seconds, repeated. As running time I use a timing command similar to the "time" of Linux, that gives the total running time of the program (final deallocations too). I think Appender can be improved some. You can also try DMD1, to compare timings, it's not a fully dead language yet. Bye, bearophile
Apr 25 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 Andrei Alexandrescu:
 I'm not sure on what machine you test, but on my (crappy old)
 laptop, your Appender example completes in 4.95 seconds. Are you
 sure you are measuring the right thing?
My purpose is to have a good standard library and a good language. Errors of mine are always possible. Benchmarks are tricky things. I have redone the tests, the CPU is unloaded, no computation is going on, the system is WinXP with 2 GB RAM, 2 GHz Core 2. New timings give no less than 22.6 seconds, repeated. As running time I use a timing command similar to the "time" of Linux, that gives the total running time of the program (final deallocations too). I think Appender can be improved some.
I can't improve what I can't measure. On my system, liberally loaded with a variety of programs, weaker than yours (1.8GHz/512MB laptop) I get 4.8s with -O and 5.3s without. The major difference is the OS (Ubuntu on my laptop), but I am puzzled as Appender doesn't assume a lot about the OS. I time with Unix time. I copied the code straight from your post. Please advise. Andrei
Apr 25 2009
parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 bearophile wrote:
 Andrei Alexandrescu:
 I'm not sure on what machine you test, but on my (crappy old)
 laptop, your Appender example completes in 4.95 seconds. Are you
 sure you are measuring the right thing?
My purpose is to have a good standard library and a good language. Errors of mine are always possible. Benchmarks are tricky things. I have redone the tests, the CPU is unloaded, no computation is going on, the system is WinXP with 2 GB RAM, 2 GHz Core 2. New timings give no less than 22.6 seconds, repeated. As running time I use a timing command similar to the "time" of Linux, that gives the total running time of the program (final deallocations too). I think Appender can be improved some.
I can't improve what I can't measure. On my system, liberally loaded with a variety of programs, weaker than yours (1.8GHz/512MB laptop) I get 4.8s with -O and 5.3s without. The major difference is the OS (Ubuntu on my laptop), but I am puzzled as Appender doesn't assume a lot about the OS. I time with Unix time. I copied the code straight from your post. Please advise. Andrei
I think I can actually explain this. For some reason (I don't know why) the Linux GC is less susceptible to false pointers than the Windows GC. You can prove this by playing with associative arrays and seeing how big an associative array has to get before there is a high probability that the GC won't free it. (See bugzilla 2105.) I get timings more similar to Bearophile on my WinXP box. If blocks of memory keep getting hit with false pointers on reallocation (which they do when arrays get this big, at least on Windows), and you have to allocate more memory from the OS, etc., of course it's going to be slower. As for why the Linux GC is less susceptible to false pointers, I really don't know. It's just an empirical observation. I would guess that maybe Linux allocates from the top of the address space, where false pointers are less frequent. Other than that, I would suggest only using 1,000,000 or so elements both because 100,000,000 is just plain unrealistic in most real-world use cases and because you avoid false pointer issues that way.
Apr 25 2009