digitalmars.D - Array Appending and DRuntime
- dsimcha (35/35) Apr 25 2009 For a while now, I've had a suspicion that the builtin array appending h...
- dsimcha (22/28) Apr 25 2009 On further examination, this is clearly somehow related to caching. Her...
- bearophile (34/39) Apr 25 2009 I have done some more timings:
- Andrei Alexandrescu (5/19) Apr 25 2009 I'm not sure on what machine you test, but on my (crappy old) laptop,
- bearophile (8/11) Apr 25 2009 My purpose is to have a good standard library and a good language.
- Andrei Alexandrescu (8/24) Apr 25 2009 I can't improve what I can't measure. On my system, liberally loaded
- dsimcha (16/40) Apr 25 2009 I think I can actually explain this. For some reason (I don't know why)...
For a while now, I've had a suspicion that the builtin array appending has gotten slower lately. The talk about array appenders here lately has brought that to a head, so I decided to test it and find out. Here's the test program I used: import std.stdio, std.perf; void main() { scope pc = new PerformanceCounter; pc.start; uint[] foo; foreach(i; 0..1_000_000) { foo ~= i; } pc.stop; writeln(pc.milliseconds); } Results: DMD 2.019 (Last release before druntime): 42 milliseconds. DMD 2.020 (First release with druntime): ~1000 milliseconds. DMD 2.029 (Current version): ~1000 milliseconds. DMD 2.029 (Replacing ~= with the Appender struct): 18 milliseconds. DMD 2.029 (Replacing builtin array with rangeextra.TNew): 19 milliseconds. I think it's abundantly clear that I wasn't just going crazy and something changed between 2.019 and 2.020 that made the builtin ~= about 25x slower. It probably is related to druntime because this was the only major change between 2.019 and 2.020. If we could put this back to the way it was in 2.019, this would drastically reduce the need for more complicated solutions. Here's some speculation about what might have happened: 1. Since Appender and TNew both use the builtin array's reallocation scheme, it's not a reallocation scheme issue. 2. gcx.d appears to cache the last block size query. This means that repeatedly querying the same block in a single threaded program (where thread_needLock() returns false and no lock is necessary) is very fast. This is true in both the old Phobos GC and the druntime GC. I wonder if this was somehow bypassed by the ~= operator when druntime was integrated with DMD in its early days.
Apr 25 2009
== Quote from dsimcha (dsimcha yahoo.com)'s article2. gcx.d appears to cache the last block size query. This means that repeatedly querying the same block in a single threaded program (where thread_needLock() returns false and no lock is necessary) is very fast. This is true in both the old Phobos GC and the druntime GC. I wonder if this was somehow bypassed by the ~= operator when druntime was integrated with DMD in its early days.On further examination, this is clearly somehow related to caching. Here is a very similar test program, that appends to tow arrays instead of one: import std.stdio, std.perf; void main() { scope pc = new PerformanceCounter; pc.start; uint[] foo, bar; foreach(i; 0..1_000_000) { foo ~= i; bar ~= i; } pc.stop; writeln(pc.milliseconds); } Timings: DMD 2.019: ~1800 ms DMD 2.029: ~2300 ms (Note: Still slower but not by as much even in absolute terms) DMD 2.029 (Using Appender instead of ~=): 49 ms By appending to two arrays, we screw up the caching scheme, hence much poorer performance. However most use cases probably involve appending to only one array at a time. Since this is a clear regression, I'll file a bug report.
Apr 25 2009
dsimcha:DMD 2.019 (Last release before druntime): 42 milliseconds. DMD 2.020 (First release with druntime): ~1000 milliseconds. DMD 2.029 (Current version): ~1000 milliseconds. DMD 2.029 (Replacing ~= with the Appender struct): 18 milliseconds. DMD 2.029 (Replacing builtin array with rangeextra.TNew): 19 milliseconds.I have done some more timings: Timings, appending 100_000_000 uints, seconds: DMD 1.042: 7.51 DMD 1.042, ArrayBuilder: 1.81 DMD 2.029: a lot DMD 2.029, Appender: 22.83 I don't like the user interface of Appender. --------------- The code I have used: void main() { uint[] foo; for (int i; i < 100_000_000; i++) foo ~= i; } --------------- import d.builders: ArrayBuilder; void main() { ArrayBuilder!(uint) b; for (int i; i < 100_000_000; i++) b ~= i; uint[] arr = b.toarray; } --------------- import std.array: appender; void main() { char[] arr; auto app = appender(&arr); for (int i; i < 100_000_000; i++) app.put(i); auto result = app.data; } Bye, bearophile
Apr 25 2009
bearophile wrote:dsimcha:I'm not sure on what machine you test, but on my (crappy old) laptop, your Appender example completes in 4.95 seconds. Are you sure you are measuring the right thing? AndreiDMD 2.019 (Last release before druntime): 42 milliseconds. DMD 2.020 (First release with druntime): ~1000 milliseconds. DMD 2.029 (Current version): ~1000 milliseconds. DMD 2.029 (Replacing ~= with the Appender struct): 18 milliseconds. DMD 2.029 (Replacing builtin array with rangeextra.TNew): 19 milliseconds.I have done some more timings: Timings, appending 100_000_000 uints, seconds: DMD 1.042: 7.51 DMD 1.042, ArrayBuilder: 1.81 DMD 2.029: a lot DMD 2.029, Appender: 22.83
Apr 25 2009
Andrei Alexandrescu:I'm not sure on what machine you test, but on my (crappy old) laptop, your Appender example completes in 4.95 seconds. Are you sure you are measuring the right thing?My purpose is to have a good standard library and a good language. Errors of mine are always possible. Benchmarks are tricky things. I have redone the tests, the CPU is unloaded, no computation is going on, the system is WinXP with 2 GB RAM, 2 GHz Core 2. New timings give no less than 22.6 seconds, repeated. As running time I use a timing command similar to the "time" of Linux, that gives the total running time of the program (final deallocations too). I think Appender can be improved some. You can also try DMD1, to compare timings, it's not a fully dead language yet. Bye, bearophile
Apr 25 2009
bearophile wrote:Andrei Alexandrescu:I can't improve what I can't measure. On my system, liberally loaded with a variety of programs, weaker than yours (1.8GHz/512MB laptop) I get 4.8s with -O and 5.3s without. The major difference is the OS (Ubuntu on my laptop), but I am puzzled as Appender doesn't assume a lot about the OS. I time with Unix time. I copied the code straight from your post. Please advise. AndreiI'm not sure on what machine you test, but on my (crappy old) laptop, your Appender example completes in 4.95 seconds. Are you sure you are measuring the right thing?My purpose is to have a good standard library and a good language. Errors of mine are always possible. Benchmarks are tricky things. I have redone the tests, the CPU is unloaded, no computation is going on, the system is WinXP with 2 GB RAM, 2 GHz Core 2. New timings give no less than 22.6 seconds, repeated. As running time I use a timing command similar to the "time" of Linux, that gives the total running time of the program (final deallocations too). I think Appender can be improved some.
Apr 25 2009
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articlebearophile wrote:I think I can actually explain this. For some reason (I don't know why) the Linux GC is less susceptible to false pointers than the Windows GC. You can prove this by playing with associative arrays and seeing how big an associative array has to get before there is a high probability that the GC won't free it. (See bugzilla 2105.) I get timings more similar to Bearophile on my WinXP box. If blocks of memory keep getting hit with false pointers on reallocation (which they do when arrays get this big, at least on Windows), and you have to allocate more memory from the OS, etc., of course it's going to be slower. As for why the Linux GC is less susceptible to false pointers, I really don't know. It's just an empirical observation. I would guess that maybe Linux allocates from the top of the address space, where false pointers are less frequent. Other than that, I would suggest only using 1,000,000 or so elements both because 100,000,000 is just plain unrealistic in most real-world use cases and because you avoid false pointer issues that way.Andrei Alexandrescu:I can't improve what I can't measure. On my system, liberally loaded with a variety of programs, weaker than yours (1.8GHz/512MB laptop) I get 4.8s with -O and 5.3s without. The major difference is the OS (Ubuntu on my laptop), but I am puzzled as Appender doesn't assume a lot about the OS. I time with Unix time. I copied the code straight from your post. Please advise. AndreiI'm not sure on what machine you test, but on my (crappy old) laptop, your Appender example completes in 4.95 seconds. Are you sure you are measuring the right thing?My purpose is to have a good standard library and a good language. Errors of mine are always possible. Benchmarks are tricky things. I have redone the tests, the CPU is unloaded, no computation is going on, the system is WinXP with 2 GB RAM, 2 GHz Core 2. New timings give no less than 22.6 seconds, repeated. As running time I use a timing command similar to the "time" of Linux, that gives the total running time of the program (final deallocations too). I think Appender can be improved some.
Apr 25 2009