digitalmars.D - Ranges, Performance, and opApply
- Kapps (33/33) Oct 30 2011 As it stands right now, ranges are quite inefficient. Phobos is designed...
- Andrei Alexandrescu (13/19) Oct 30 2011 The test is flawed in a subtle way and no conclusion can be drawn from i...
- Martin Nowak (12/33) Oct 30 2011 No it does not.
- Andrei Alexandrescu (5/42) Oct 30 2011 Indeed you're right. In the quiescent state both implementations call
- dsimcha (2/5) Oct 30 2011 Just for future reference, LDC **DOES** sometimes inline opApply bodies.
- Martin Nowak (2/11) Oct 30 2011 Right it should have been 'is mostly not inlined'.
- Steven Schveighoffer (15/24) Oct 31 2011 The compiler should almost always be able to inline opApply, as the code...
- Martin Nowak (44/70) Oct 31 2011 Inlining the body would necessitate one parameterized function
- Kapps (8/46) Oct 31 2011 Thanks for the explanation guys. That definitely makes me feel better
As it stands right now, ranges are quite inefficient. Phobos is designed with an emphasis on ranges, including in performance-sensitive code such as std.algorithm. No matter how optimized the range is, it will always be slower than the equivalent opApply function (and, as far as I can tell, offers no benefit for just iteration). The fastest way of iterating over anything currently, is foreach over an array. It beats even for(size_t i 0 to length-1) by a decent bit. The lowest level range for all ranges is almost always an array (that is, if you have a range using a range ... using a range, it is likely that the last range is an array or uses an array). By implementing an opApply for these ranges, that calls opApply on the wrapped range or wrapped array (or other), you can basically double the performance cost of iterating (for many iterations over a small body). It is not possible to make the compiler-generated opApply make this optimization, as it does not know what the underlying data for the range is. Creating an opApply can be done on a per-range basis for ranges that are used in performance-sensitive code (such as countUntil). If the manual opApply foreach's over the input range, and the input range has a similar manual opApply, there are fair performance benefits. If the underlying range does not have a manual opApply, there will still be some performance benefits, and does not break code in any way. If all ranges being used have opApply and the last range is an array, then performance benefits are very significant. As an example, I made a test of 5 different ways of implementing std.algorithm.filter, and two tests of another filter on top of the output from the first one, one with a manual opApply, and one without. Basically, a manual opApply was over twice as fast. The code and results can be found at http://pastie.org/2781723. Ultimately, nothing is as fast as just doing a foreach yourself, but that's to be expected. It can be however, be easily made that the difference is not as extreme as it currently is. So, should Phobos ranges attempt to have a manual opApply where feasibly being used in performance sensitive code?
Oct 30 2011
On 10/30/11 3:02 AM, Kapps wrote:As it stands right now, ranges are quite inefficient. Phobos is designed with an emphasis on ranges, including in performance-sensitive code such as std.algorithm. No matter how optimized the range is, it will always be slower than the equivalent opApply function (and, as far as I can tell, offers no benefit for just iteration).The test is flawed in a subtle way and no conclusion can be drawn from it. The test filters a checkerboard distribution (one element satisfies/the next doesn't satisfy the filter). As such, the loop on lines 26-29 will always test exactly two times to take one step. In contrast, the loop at lines 75-81 is integrated and only makes the minimum amount of tests. Calling the delegate is expensive, but there are only half as many calls and the saved tests compensate for that. I believe there is nothing inherent in the design of ranges that makes them generally less efficient than various alternatives. Yes, there will be the corner cases where internal iteration will be better, but overall I don't think they form a large fraction of all cases. Andrei
Oct 30 2011
On Sun, 30 Oct 2011 15:42:36 +0100, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 10/30/11 3:02 AM, Kapps wrote:No it does not. In fact the opApply version does ONE additional check at loop start, which doesn't matter for 2.5M iterations.As it stands right now, ranges are quite inefficient. Phobos is designed with an emphasis on ranges, including in performance-sensitive code such as std.algorithm. No matter how optimized the range is, it will always be slower than the equivalent opApply function (and, as far as I can tell, offers no benefit for just iteration).The test is flawed in a subtle way and no conclusion can be drawn from it. The test filters a checkerboard distribution (one element satisfies/the next doesn't satisfy the filter). As such, the loop on lines 26-29 will always test exactly two times to take one step. In contrast, the loop at lines 75-81 is integrated and only makes the minimum amount of tests. Calling the delegate is expensive, but there are only half as many calls and the saved tests compensate for that.I believe there is nothing inherent in the design of ranges that makes them generally less efficient than various alternatives. Yes, there will be the corner cases where internal iteration will be better, but overall I don't think they form a large fraction of all cases. AndreiThe test still doesn't measure anything but spurious effects of different inline decisions. I get a ~10% hit for range foreach over opApply. When looping without predicate than popFront is inlined and the result is a ~10% hit for opApply. As the opApply body can never be inlined it is a worse choice in general. martin
Oct 30 2011
On 10/30/11 11:45 AM, Martin Nowak wrote:On Sun, 30 Oct 2011 15:42:36 +0100, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Indeed you're right. In the quiescent state both implementations call the support functions the same number of times. Thanks.On 10/30/11 3:02 AM, Kapps wrote:No it does not.As it stands right now, ranges are quite inefficient. Phobos is designed with an emphasis on ranges, including in performance-sensitive code such as std.algorithm. No matter how optimized the range is, it will always be slower than the equivalent opApply function (and, as far as I can tell, offers no benefit for just iteration).The test is flawed in a subtle way and no conclusion can be drawn from it. The test filters a checkerboard distribution (one element satisfies/the next doesn't satisfy the filter). As such, the loop on lines 26-29 will always test exactly two times to take one step. In contrast, the loop at lines 75-81 is integrated and only makes the minimum amount of tests. Calling the delegate is expensive, but there are only half as many calls and the saved tests compensate for that.In fact the opApply version does ONE additional check at loop start, which doesn't matter for 2.5M iterations.That was my intuition too. AndreiI believe there is nothing inherent in the design of ranges that makes them generally less efficient than various alternatives. Yes, there will be the corner cases where internal iteration will be better, but overall I don't think they form a large fraction of all cases. AndreiThe test still doesn't measure anything but spurious effects of different inline decisions. I get a ~10% hit for range foreach over opApply. When looping without predicate than popFront is inlined and the result is a ~10% hit for opApply. As the opApply body can never be inlined it is a worse choice in general.
Oct 30 2011
On 10/30/2011 3:09 PM, Andrei Alexandrescu wrote:Just for future reference, LDC **DOES** sometimes inline opApply bodies.As the opApply body can never be inlined it is a worse choice in general.That was my intuition too. Andrei
Oct 30 2011
On Sun, 30 Oct 2011 20:23:59 +0100, dsimcha <dsimcha yahoo.com> wrote:On 10/30/2011 3:09 PM, Andrei Alexandrescu wrote:Right it should have been 'is mostly not inlined'.Just for future reference, LDC **DOES** sometimes inline opApply bodies.As the opApply body can never be inlined it is a worse choice in general.That was my intuition too. Andrei
Oct 30 2011
On Sun, 30 Oct 2011 15:23:59 -0400, dsimcha <dsimcha yahoo.com> wrote:On 10/30/2011 3:09 PM, Andrei Alexandrescu wrote:The compiler should almost always be able to inline opApply, as the code for the opApply body is always available. There are few exceptions, such as when opApply is not final, or when it's recursive. I wonder if even in these cases, some sort of "virtual inlining" such as pushing the code onto the stack and avoiding using function calls would be possible (speaking from naivety, maybe this does nothing). Being able to exploit the fact that a delegate literal is always fully available would be nice. Indeed, opApply should beat the pants off of ranges when the range primitives are virtual functions. But ranges should win when inlining is possible in some cases. There are always going to be use cases for opApply that ranges cannot duplicate (such as recursion), and vice versa (such as parallel iteration). -SteveJust for future reference, LDC **DOES** sometimes inline opApply bodies.As the opApply body can never be inlined it is a worse choice in general.That was my intuition too. Andrei
Oct 31 2011
On Mon, 31 Oct 2011 13:56:21 +0100, Steven Schveighoffer <schveiguy yahoo.com> wrote:On Sun, 30 Oct 2011 15:23:59 -0400, dsimcha <dsimcha yahoo.com> wrote:Inlining the body would necessitate one parameterized function per caller. Quite a lot of code and doesn't even have defined mangling, does it? It's a different topic when both, the opApply function and the body are inlined at the caller site. Also when inlining the body it is much more difficult to assign registers to variables from the calling stack unless you can prove that nobody can refer to them. I think a better approach is to revive the templated 'opApply(alias fbody)' idea alternatively to the delegated one. But there are currently too many issues to do this. Constructing similar functionality requires quite some C++-ish gymnastics. Partly due to bugs and for reasons discussed here: http://d.puremagic.com/issues/show_bug.cgi?id=5710. ---- import std.range, std.stdio; struct Context { // doesn't even work as static method alias Context_forEach forEach; uint[] data; } // doesn't work at module scope when erroneously prefixed with static ??? /*static*/ void Context_forEach(alias fbody)(ref Context self) { foreach(e; self.data) fbody(e); } void main() { size_t sum; auto ctx = Context(array(iota(0U, 500U))); Context.forEach!( (a) { sum += a; } )(ctx); writeln(sum); } martinOn 10/30/2011 3:09 PM, Andrei Alexandrescu wrote:The compiler should almost always be able to inline opApply, as the code for the opApply body is always available. There are few exceptions, such as when opApply is not final, or when it's recursive. I wonder if even in these cases, some sort of "virtual inlining" such as pushing the code onto the stack and avoiding using function calls would be possible (speaking from naivety, maybe this does nothing). Being able to exploit the fact that a delegate literal is always fully available would be nice. Indeed, opApply should beat the pants off of ranges when the range primitives are virtual functions. But ranges should win when inlining is possible in some cases. There are always going to be use cases for opApply that ranges cannot duplicate (such as recursion), and vice versa (such as parallel iteration). -SteveJust for future reference, LDC **DOES** sometimes inline opApply bodies.As the opApply body can never be inlined it is a worse choice in general.That was my intuition too. Andrei
Oct 31 2011
Thanks for the explanation guys. That definitely makes me feel better about using ranges in my code (though for plain iteration opApply does seem easier to implement due to not forcing to conform to a specific naming convention except for one operator). When replacing modulus 2 with modulus 100, the results are indeed very close, proving that the performance issue is not related to ranges vs opApply. Which is rather surprising to me actually. Optimization ftw I guess. :P On 30/10/2011 10:45 AM, Martin Nowak wrote:On Sun, 30 Oct 2011 15:42:36 +0100, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 10/30/11 3:02 AM, Kapps wrote:No it does not. In fact the opApply version does ONE additional check at loop start, which doesn't matter for 2.5M iterations.As it stands right now, ranges are quite inefficient. Phobos is designed with an emphasis on ranges, including in performance-sensitive code such as std.algorithm. No matter how optimized the range is, it will always be slower than the equivalent opApply function (and, as far as I can tell, offers no benefit for just iteration).The test is flawed in a subtle way and no conclusion can be drawn from it. The test filters a checkerboard distribution (one element satisfies/the next doesn't satisfy the filter). As such, the loop on lines 26-29 will always test exactly two times to take one step. In contrast, the loop at lines 75-81 is integrated and only makes the minimum amount of tests. Calling the delegate is expensive, but there are only half as many calls and the saved tests compensate for that.I believe there is nothing inherent in the design of ranges that makes them generally less efficient than various alternatives. Yes, there will be the corner cases where internal iteration will be better, but overall I don't think they form a large fraction of all cases. AndreiThe test still doesn't measure anything but spurious effects of different inline decisions. I get a ~10% hit for range foreach over opApply. When looping without predicate than popFront is inlined and the result is a ~10% hit for opApply. As the opApply body can never be inlined it is a worse choice in general. martin
Oct 31 2011