www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Let's avoid range specializations - make ranges great again!

reply Seb <seb wilzba.ch> writes:
In Phobos (especially in std.algorithm) a lot of specialization 
between a RandomAccessRange and an InputRange are used even 
though only an InputRange is required.
The reason for this is solely performance - imho we should start 
to tackle this and remove such specializations. Let's create 
better code!

To quote David from the recent PR 
(https://github.com/dlang/phobos/pull/4265) where adding a 
identity specialization achieves a 5-10x speedup for 
{min,max}Element.

 I think that we have a larger issue at hand here. Ranges are of 
 prime strategic importance for D as an accessible way of 
 composing complex operations out of reusable primitives. A 
 large part of this comes from the fact that they are supposed 
 to be zero-cost abstractions – after the compiler optimiser has 
 had its way with a piece of code, we expect it to be equivalent 
 to a hand-written loop.
1) Should it matter whether generic range looping or "specialized" r[i] access is used? Answer for dmd it does (7x slower), ldc does the job (nearly) correctly. ```
 dmd -release -O -boundscheck=off test_looping.d && 
 ./test_looping
random_access 685 ms, 351 μs, and 9 hnsecs foreach 685 ms, 888 μs, and 4 hnsecs foreach_ref 685 ms, 654 μs, and 4 hnsecs for_range_api 7 secs, 211 ms, 530 μs, and 3 hnsecs ```
 ldc -release -O3 -boundscheck=off test_looping.d && 
 ./test_looping
random_access 142 ms and 88 μs foreach 142 ms, 8 μs, and 3 hnsecs foreach_ref 142 ms, 400 μs, and 9 hnsecs for_range_api 143 ms, 394 μs, and 1 hnsec ``` Benchmarking code: https://github.com/wilzbach/perf-ranges/blob/master/test_looping.d Bug report for DMD: https://issues.dlang.org/show_bug.cgi?id=16120 (Tested with both ldc 0.17 and 1.0) 2) Should it matter whether I use r.save or r[i..$]? Answer: it does for dmd and ldc. ```
 dmd -release -O -boundscheck=off test_save.d && ./test_save
random_access 424 ms and 72 μs foreach_random 420 ms, 754 μs, and 5 hnsecs for_range_api 2 secs, 562 ms, 415 μs, and 5 hnsecs
 ldc -release -O3 -boundscheck=off test_save.d && ./test_save
random_access 245 ms, 713 μs, and 4 hnsecs foreach_random 241 ms, 533 μs, and 1 hnsec for_range_api 349 ms, 582 μs, and 9 hnsecs ``` Benchmarking code: https://github.com/wilzbach/perf-ranges/blob/master/test_save.d Open question: how can we make DMD/LDC smarter, so that they inline range primitives and the promise about the zero-cost abstractions is full-filled?
Jun 04 2016
next sibling parent David Nadlinger <code klickverbot.at> writes:
On Saturday, 4 June 2016 at 17:35:23 UTC, Seb wrote:
 ldc does the job (nearly) correctly.
Scratch the "nearly" – at least you can't infer that from the performance results alone, the differences are well within the run-to-run noise.
 dmd -release -O -boundscheck=off test_looping.d && 
 ./test_looping
You seem to be missing `-inline` here, although it doesn't seem to influence the results (DMD 2.071.0/OS X x86_64).
 2) Should it matter whether I use r.save or r[i..$]?

 Answer: it does for dmd and ldc.

 […]

 Open question: how can we make DMD/LDC smarter, so that they 
 inline range primitives […]?
Is this really the conclusion to draw from this benchmark? First of all, unless I'm mistaken, the three implementations are not equivalent – f_for seems to ignore the first element by doing an extra `popFront()`. Fixing that, the implementations seem to be essentially equivalent in terms of execution time (LDC master, i7-4980HQ 2.8GHz): ``` 0 17 secs, 840 ms, 770 μs, and 9 hnsecs 1 16 secs, 680 ms, 80 μs, and 6 hnsecs 2 17 secs, 635 ms, 548 μs, and 1 hnsec ``` Even though this is using a larger number of iterations for illustrative purposes, the difference still is to be in the run-to-run noise. Comparing the x86_64 assembly produced for 1 and 2, everything is inlined correctly in both cases. The "reversed" loop structure does lead to a slightly higher number of instructions in the loop, though, which might or might not be measurable: --- __D9test_save17__T9f_foreachTAmZ9f_foreachFNaNbNiNfAmmZAm: xor eax, eax test rsi, rsi je LBB11_4 mov rcx, rdx .align 4, 0x90 LBB11_2: cmp qword ptr [rcx], rdi je LBB11_5 inc rax add rcx, 8 cmp rsi, rax ja LBB11_2 LBB11_4: mov rax, rsi ret LBB11_5: sub rsi, rax mov rax, rsi mov rdx, rcx ret __D9test_save13__T5f_forTAmZ5f_forFNaNbNiNfAmmZAm: test rsi, rsi je LBB13_4 mov rax, rsi dec rax mov rcx, rdx add rcx, 8 .align 4, 0x90 LBB13_2: cmp qword ptr [rcx - 8], rdi je LBB13_4 mov rdx, rcx mov rsi, rax dec rax add rcx, 8 cmp rax, -1 jne LBB13_2 LBB13_4: mov rax, rsi ret --- — David
Jun 04 2016
prev sibling next sibling parent Era Scarecrow <rtcvb32 yahoo.com> writes:
On Saturday, 4 June 2016 at 17:35:23 UTC, Seb wrote:
 In Phobos (especially in std.algorithm) a lot of specialization 
 between a RandomAccessRange and an InputRange are used even 
 though only an InputRange is required. The reason for this is 
 solely performance - imho we should start to tackle this and 
 remove such specializations. Let's create better code!
 The reason for this is solely performance - we should <snip> 
 remove such specializations.
Performance will always be a driving factor. But then again let's not forget an array is not a range, and isn't treated as one unless the primitives are present (which by default they are not). Arrays in their basic forms are also very common, easy, fairly cheap, etc. Same as slices. I see no reason to drop array based specializations, unless they fail to produce good code.
Jun 04 2016
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/4/16 7:35 PM, Seb wrote:
 In Phobos (especially in std.algorithm) a lot of specialization between
 a RandomAccessRange and an InputRange are used even though only an
 InputRange is required.
 The reason for this is solely performance - imho we should start to
 tackle this and remove such specializations. Let's create better code!
Clearly improving range-specific optimizations will "lift all boats". But specializing code for specific range kinds is in principle a good thing. See e.g. the recent thread about nextPermutation - we should not be ashamed to specialize it for random-access ranges. Andrei
Jun 05 2016