digitalmars.D - Tricky semantics of ranges & potentially numerous Phobos bugs
- H. S. Teoh (50/50) Oct 15 2012 In an effort to locate the suspected Phobos bug that I ran into in my
- H. S. Teoh (55/98) Oct 16 2012 I think a better solution is to somehow differentiate between ranges
- Andrei Alexandrescu (4/5) Oct 17 2012 isTransient!R is exactly the same thing as isInputRange!R &&
- monarch_dodra (6/11) Oct 17 2012 Technically, (isTransient!R) is just a subset of (isInputRange!R
- Andrei Alexandrescu (3/16) Oct 17 2012 Depends on implementation - popFront may actually wipe the character.
- monarch_dodra (5/25) Oct 17 2012 "byDChar" (if it existed), *could* be a perfectly valid
- Jonathan M Davis (37/42) Oct 17 2012 In what way? There is nothing anywhere that states that an input range's...
- monarch_dodra (9/15) Oct 18 2012 That's already a pretty big usage :) The thing is you just can't
- H. S. Teoh (30/48) Oct 17 2012 It's perfectly possible to implement joiner, chain, find, count, cmp,
- Don Clugston (5/49) Oct 18 2012 Is it actually orthogonal? Is it possible for a forward range to be
- monarch_dodra (63/68) Oct 18 2012 Save just means the range can save its position. If it is
- H. S. Teoh (9/23) Oct 18 2012 [...]
- Andrei Alexandrescu (6/10) Oct 17 2012 Such issues do happen, are relatively rare, and are virtually untested
- David Nadlinger (7/12) Oct 17 2012 If an abstraction encourages use in a way which leads to
- Andrei Alexandrescu (7/15) Oct 17 2012 When I designed input ranges vs. forward ranges there were long
In an effort to locate the suspected Phobos bug that I ran into in my new Cartesian product implementation, I looked over the code for std.algorithm joiner more carefully, and noticed the following: Given a range of ranges ror, it assigns ror.front to a struct member and then calls ror.popFront() immediately. Then as the user calls the joined range's popFront, it successively pops its local copy of ror.front until it's empty, whereupon it assigns the new ror.front to the local copy again, and so on. While this works for array of arrays, it *doesn't* work for ranges that overwrite ror.front when ror.popFront() is called. For example, it wouldn't work for stdin.byLine(), because the underlying buffer it reused. Here's the proof: Code: // Filename: test2.d import std.algorithm; import std.stdio; void main() { auto lines = stdin.byLine(); writeln(lines.joiner); } Compile & test: dmd test2.d (echo abc; echo def; echo ghi) | ./test2 defghighi As you can see, the output is mangled, because byLine() has reused the line buffer before joiner.Result got to it. Long story short: saving a local copy of range.front and then calling range.popFront() may _invalidate_ the copy. So basically, either you need a forward range and use .save to keep track of old values of range.front, or you have to refrain from calling popFront() until you're well and truly done with the current value of .front. While not respecting this will work with many common ranges, it will also fail in subtle ways when given other ranges. The scary thing is, I see similar code like this all over Phobos. Does this mean that most of std.algorithm may need to be revised to address this issue? At the very least, it would seem that a code audit is in order to weed out this particular issue. :-/ T -- A linguistics professor was lecturing to his class one day. "In English," he said, "A double negative forms a positive. In some languages, though, such as Russian, a double negative is still a negative. However, there is no language wherein a double positive can form a negative." A voice from the back of the room piped up, "Yeah, yeah."
Oct 15 2012
On Tue, Oct 16, 2012 at 07:28:38PM +0200, Jonathan M Davis wrote:On Tuesday, October 16, 2012 10:13:57 H. S. Teoh wrote:[...]ByLine is perfectly valid range insofar as you realize that it's likely to go completely south if you use it in any way that could involve keeping front around after popFront has been called, which means that anything which relies on keeping front around isn't going to work. So, it's a range, but it's essentially an unsafe one (though I'm not sure that it's an un- safe one). So, it's fine that ByLine is a range as long as we're willing to put up with it not working with a lot of range-based functions because of its abnormal behavior. But I don't think that it's at all reasonable for range-based functions in general to not be able to rely on front returning the same type every time or on its value disappearing or becoming invalid in some way after a call to popFront. That's completely untenable IMHO.I think a better solution is to somehow differentiate between ranges whose .front value is permanent (arrays), and ranges whose .front values are transient (e.g. byLine, but we shouldn't single it out because there are other occasions where you may *want* to have a range that doesn't call .dup every time -- say a range over all permutations of an array that modifies the array in-place). Perhaps mark ranges with an .isTransient property (which can be an enum so it doesn't waste any space), and range-based functions that require non-transient ranges will refuse to work with a transient range. Or alternatively, switch to a different implementation that takes transience into account (which may be slower, etc., but the important thing is to do it right -- after all, D's motto is safe by default, unsafe if you ask for it).Ranges _can_ define semantics which violate that, but they have to make it clear that they do so that programmers using them realize that they may not work right with a lot of range-based functions (which potentially makes it so that it they really shouldn't have been ranges in the first place).I think this approach of what amounts to guessing which ranges are safe/unsafe with which functions is what's untenable. We all know that people don't read documentation, or at least, not thoroughly. Something like this is too easy to miss, and bugs will slip into code unnoticed for a long time, and then explode in your face. It's unsafe by default, which goes against D's philosophy. I think the problem is better addressed by an .isTransient property that can be selected by templates at compile-time. I agree that requiring *every* range function to expect .front to mutate upon calling .popFront() is unreasonable, but *some* functions *can* be written in a way that they will still work. Marking ranges with transient .front values will allow functions that *can* take transient ranges do so, while leaving other functions accepting only non-transient ranges. That way, if somebody tries to pass a transient range to a template that only works with a non-transient range, they will know at compile-time instead of getting a nasty surprise later on.[...] I don't like the idea that some ranges _may_ or _may not_ work with some functions. That's just too unreliable, and it's too easy to make mistakes. Let's put in an isTransient property on the unsafe ranges and let the templates' signature constraints enforce passing the right kind of range. On Tue, Oct 16, 2012 at 10:58:23AM -0700, David Gileadi wrote: [...]So what is (or should be) the semantic design of ranges? Let's work out a precise definition so that we have something to build on.As far as front (or back) goes, range-based functions should be able to rely on 1. front returning the exact same value on every call until popFront has been called (though there's no guarantee that front won't have to be recalculated on each call, so assigning the result of front to a local variable is advisable for efficiency if front must be used multiple times before a call to popFront). 2. the result of front continuing to be valid and unchanged after popFront has been called if it was assigned to a variable. Any range is free to violate this, but because range-based functions are free to rely on it, such ranges risk not working correctly with many range-based functions and must be used with caution.As a D dabbler, it seems wrong to me that a built-in range like byLine, which is often used in example code, should have weird side effects with the built-in methods in std.algorithm. IMO this needs to be fixed one way or another, and a mere documentation change is likely not enough for casual D users.I agree. We should add an isTransient property to at least the built-in ranges in Phobos, so that at least those ranges will always work properly (or complain loudly at compile-time if passed to the wrong function). User-defined ranges need to be marked too, but that is up to the user. I don't think this part can be statically checked (unless you want to enforce it using the type system by making non-transient ranges return immutable from .front -- but that may be impractical in some cases). So this has to be documented, and users will have the option of marking their own ranges and get the benefit of compile-time compatibility checks. But at the very least, the Phobos ranges must always work correctly, or refuse to work at compile-time. Leaving things up to convention (via documentation) is not a good idea. T -- ASCII stupid question, getty stupid ANSI.
Oct 16 2012
On 10/16/12 3:07 PM, H. S. Teoh wrote:Perhaps mark ranges with an .isTransient propertyisTransient!R is exactly the same thing as isInputRange!R && !isForwardRange!R. Andrei
Oct 17 2012
On Wednesday, 17 October 2012 at 17:15:14 UTC, Andrei Alexandrescu wrote:On 10/16/12 3:07 PM, H. S. Teoh wrote:Technically, (isTransient!R) is just a subset of (isInputRange!R && !isForwardRange!R) "byDChar" (if it existed), would be a perfectly valid non-transient input only range.Perhaps mark ranges with an .isTransient propertyisTransient!R is exactly the same thing as isInputRange!R && !isForwardRange!R. Andrei
Oct 17 2012
On 10/17/12 1:29 PM, monarch_dodra wrote:On Wednesday, 17 October 2012 at 17:15:14 UTC, Andrei Alexandrescu wrote:Depends on implementation - popFront may actually wipe the character. AndreiOn 10/16/12 3:07 PM, H. S. Teoh wrote:Technically, (isTransient!R) is just a subset of (isInputRange!R && !isForwardRange!R) "byDChar" (if it existed), would be a perfectly valid non-transient input only range.Perhaps mark ranges with an .isTransient propertyisTransient!R is exactly the same thing as isInputRange!R && !isForwardRange!R. Andrei
Oct 17 2012
On Wednesday, 17 October 2012 at 18:13:42 UTC, Andrei Alexandrescu wrote:On 10/17/12 1:29 PM, monarch_dodra wrote:"byDChar" (if it existed), *could* be a perfectly valid non-transient input only range, if the implementation returns the character by value. :)On Wednesday, 17 October 2012 at 17:15:14 UTC, Andrei Alexandrescu wrote:Depends on implementation - popFront may actually wipe the character. AndreiOn 10/16/12 3:07 PM, H. S. Teoh wrote:Technically, (isTransient!R) is just a subset of (isInputRange!R && !isForwardRange!R) "byDChar" (if it existed), would be a perfectly valid non-transient input only range.Perhaps mark ranges with an .isTransient propertyisTransient!R is exactly the same thing as isInputRange!R && !isForwardRange!R. Andrei
Oct 17 2012
On Wednesday, October 17, 2012 13:15:12 Andrei Alexandrescu wrote:On 10/16/12 3:07 PM, H. S. Teoh wrote:In what way? There is nothing anywhere that states that an input range's front could be invalidated upon a call to popFront. There's nothing inherent in the fact that range can't be saved that makes it so that front must be invalidated upon a call to popFront. You could easily define an input range of bytes over a file where front is perfectly valid after a call to popFront, because it's a value type. You could also define ByLine so that its front is perfectly valid after a call to popFront. It's just that it requires that a new buffer be allocated instead of reusing the buffer. That has nothing to do with the fact that its incapable of copying its state such that you could have multiple copies of it being iterated over separately. I don't see anything in the concept or definition of input ranges which implies that front would be invalidated by a call to popFront. I'm increasingly convinced that input ranges which are not forward ranges are useless for pretty much anything other than foreach. Far too much requires that you be able to save the current state - and most stuff _inherently_ requires it such that it's not simply a question of implementing the function differently. And adding even further restrictions on input ranges just makes it worse. It actually wouldn't hurt my feelings one whit if we got rid of the idea of input ranges entirely. It's perfectly possible to implement ranges like ByLine as forward ranges. It just requires a bit more work. But they'd be _way_ more useful if they were. In my experience the only time that you don't need to dup what ByLine or ByChunk gives you is when all you need is foreach, and if that's really the case, then opApply can be used for foreach, and ByLine and ByChunk can be defined in ways that actually allow them to not only not invalidate the result of previous front calls but to actually be full-on forward ranges, making them actually useful for things beyond foreach. Regardless, there's nothing in how input ranges are currently defined which indicates that front would ever be invalidated for _any_ type of range, and ByLine and ByChunk are pretty much the only ranges I've ever seen which invalidate previous calls to front. So, I don't see how you could think that they're anything but abnormal. And if you really want to argue that whether front can be invalidated or not is somehow part of the difference between an input range and a forward range, then the documentation on that needs to make that _very_ clear, and it's going to be that much worse to deal with input ranges which aren't forward ranges. - Jonathan M DavisPerhaps mark ranges with an .isTransient propertyisTransient!R is exactly the same thing as isInputRange!R && !isForwardRange!R.
Oct 17 2012
On Wednesday, 17 October 2012 at 19:56:08 UTC, Jonathan M Davis wrote:[SNIP] I'm increasingly convinced that input ranges which are not forward ranges are useless for pretty much anything other than foreach. [SNIP] - Jonathan M DavisThat's already a pretty big usage :) The thing is you just can't "do" anything with them, but that *is* the design. Just read them once to place them into another container. The fact they interface with, say "array", or "appender", or copy, makes the interface convenient. The fact that byLine will choke on a call to "array", IMO, has nothing to do with it being a forward range.
Oct 18 2012
On Wed, Oct 17, 2012 at 12:55:56PM -0700, Jonathan M Davis wrote: [...]I'm increasingly convinced that input ranges which are not forward ranges are useless for pretty much anything other than foreach. Far too much requires that you be able to save the current state - and most stuff _inherently_ requires it such that it's not simply a question of implementing the function differently.It's perfectly possible to implement joiner, chain, find, count, cmp, equal, until, filter, map, reduce, without assuming that the value returned by .front is persistent. Just to name a few. In fact, it's even possible to implement cartesianProduct in which one of the ranges is an input range. I'd hardly call that useless.And adding even further restrictions on input ranges just makes it worse. It actually wouldn't hurt my feelings one whit if we got rid of the idea of input ranges entirely.The motivating example for input ranges, at least according to TDPL, is find(). There's nothing about find() that precludes non-forward input ranges. A lot would be missing from the usefulness of ranges if we were forced to only use forward ranges. [...]Regardless, there's nothing in how input ranges are currently defined which indicates that front would ever be invalidated for _any_ type of range, and ByLine and ByChunk are pretty much the only ranges I've ever seen which invalidate previous calls to front. So, I don't see how you could think that they're anything but abnormal.I can think of quite a few situations in which it's useful to not assume that the return value of .front is persistent, which I've already mentioned before: in-place array permutation, reused buffers for complex computations, etc..And if you really want to argue that whether front can be invalidated or not is somehow part of the difference between an input range and a forward range, then the documentation on that needs to make that _very_ clear, and it's going to be that much worse to deal with input ranges which aren't forward ranges.[...] I think I'm not so sure about Andrei's lumping input ranges with persistent return values from .front together with forward ranges. Some algorithms, like findAdjacent, do not need a forward range, but they do need a persistent .front. I do not like the idea of artificially limiting the scope of findAdjacent just because you can't assume input ranges' .front returns a persistent value. Like somebody else mentioned, whether .front is transient or not is orthogonal to whether the range is an input range or a forward range. There can be ranges whose .front is persistent, but they can't be forward ranges for practical reasons. T -- Having a smoking section in a restaurant is like having a peeing section in a swimming pool. -- Edward Burr
Oct 17 2012
On 17/10/12 23:41, H. S. Teoh wrote:On Wed, Oct 17, 2012 at 12:55:56PM -0700, Jonathan M Davis wrote: [...]Is it actually orthogonal? Is it possible for a forward range to be transient? Or is it an intermediate concept? TransientInputRange -> NonTransientInputRange -> ForwardRangeI'm increasingly convinced that input ranges which are not forward ranges are useless for pretty much anything other than foreach. Far too much requires that you be able to save the current state - and most stuff _inherently_ requires it such that it's not simply a question of implementing the function differently.It's perfectly possible to implement joiner, chain, find, count, cmp, equal, until, filter, map, reduce, without assuming that the value returned by .front is persistent. Just to name a few. In fact, it's even possible to implement cartesianProduct in which one of the ranges is an input range. I'd hardly call that useless.And adding even further restrictions on input ranges just makes it worse. It actually wouldn't hurt my feelings one whit if we got rid of the idea of input ranges entirely.The motivating example for input ranges, at least according to TDPL, is find(). There's nothing about find() that precludes non-forward input ranges. A lot would be missing from the usefulness of ranges if we were forced to only use forward ranges. [...]Regardless, there's nothing in how input ranges are currently defined which indicates that front would ever be invalidated for _any_ type of range, and ByLine and ByChunk are pretty much the only ranges I've ever seen which invalidate previous calls to front. So, I don't see how you could think that they're anything but abnormal.I can think of quite a few situations in which it's useful to not assume that the return value of .front is persistent, which I've already mentioned before: in-place array permutation, reused buffers for complex computations, etc..And if you really want to argue that whether front can be invalidated or not is somehow part of the difference between an input range and a forward range, then the documentation on that needs to make that _very_ clear, and it's going to be that much worse to deal with input ranges which aren't forward ranges.[...] I think I'm not so sure about Andrei's lumping input ranges with persistent return values from .front together with forward ranges. Some algorithms, like findAdjacent, do not need a forward range, but they do need a persistent .front. I do not like the idea of artificially limiting the scope of findAdjacent just because you can't assume input ranges' .front returns a persistent value. Like somebody else mentioned, whether .front is transient or not is orthogonal to whether the range is an input range or a forward range. There can be ranges whose .front is persistent, but they can't be forward ranges for practical reasons.
Oct 18 2012
On Thursday, 18 October 2012 at 07:09:04 UTC, Don Clugston wrote:On 17/10/12 23:41, H. S. Teoh wrote: Is it actually orthogonal? Is it possible for a forward range to be transient? Or is it an intermediate concept? TransientInputRange -> NonTransientInputRange -> ForwardRangeSave just means the range can save its position. If it is returning via a buffer, Forward of not, it is going to be transient. Take this forward range, that returns the strings "A", "B" and "C" ad infinitum: //---- enum _ABC = "ABC"; struct ABC { char[1] buf = _ABC[0]; size_t i; enum empty = false; property char[] front(){return buf;} void popFront() { ++i; buf[0] = _ABC[i%3]; } property ABC save() { return this; } } //---- This is a perfectly valid range, which you can save, but the returned string is transient: //---- void main() { ABC abc; writeln("Printing 10 elements: "); abc.take(10).writeln('\n'); writeln("Duplicating range"); auto abc2 = abc.save; abc.popFront; foreach(v; zip(abc, abc2).take(5)) write("[", v[0], ", ", v[1], "]"); writeln('\n'); writeln("Prnting two consecutive elements:"); auto first = abc.front; abc.popFront(); auto second = abc.front; writeln("[", first, ", ", second, "]"); } //---- Produces: //---- Printing 10 elements: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"] Duplicating range [B, A][C, B][A, C][B, A][C, B] Prnting two consecutive elements: [C, C] //---- As you can see, you can perfectly iterate. You can perfectly save the range. The saved range can be used to backtrack. But if you attempt to store two consecutive fronts, things don't go well. The same holds true for a Random Access range BTW. Iteration and transient-ness of returned value are orthogonal concepts
Oct 18 2012
On Thu, Oct 18, 2012 at 09:09:03AM +0200, Don Clugston wrote:On 17/10/12 23:41, H. S. Teoh wrote:[...][...] What about a range over all permutations of an array, that modifies the array in-place? It can be a forward range by having .save copy the current state of the array, but .front is transient nonetheless. T -- Life is too short to run proprietary software. -- Bdale GarbeeI think I'm not so sure about Andrei's lumping input ranges with persistent return values from .front together with forward ranges. Some algorithms, like findAdjacent, do not need a forward range, but they do need a persistent .front. I do not like the idea of artificially limiting the scope of findAdjacent just because you can't assume input ranges' .front returns a persistent value. Like somebody else mentioned, whether .front is transient or not is orthogonal to whether the range is an input range or a forward range. There can be ranges whose .front is persistent, but they can't be forward ranges for practical reasons.Is it actually orthogonal? Is it possible for a forward range to be transient?
Oct 18 2012
On 10/16/12 1:09 AM, H. S. Teoh wrote:The scary thing is, I see similar code like this all over Phobos. Does this mean that most of std.algorithm may need to be revised to address this issue? At the very least, it would seem that a code audit is in order to weed out this particular issue.Such issues do happen, are relatively rare, and are virtually untested because there's been no unittests with a deliberately "bad" input range. Although of course we do need to add the appropriate fixes and unittests, I'm not worried about it at all systemically. Andrei
Oct 17 2012
On Wednesday, 17 October 2012 at 15:14:44 UTC, Andrei Alexandrescu wrote:Such issues do happen, are relatively rare, and are virtually untested because there's been no unittests with a deliberately "bad" input range. Although of course we do need to add the appropriate fixes and unittests, I'm not worried about it at all systemically.If an abstraction encourages use in a way which leads to hard-to-detect logic bugs that do not occur with the most common test cases, this is _exactly_ the thing we should be worried about! David
Oct 17 2012
On 10/17/12 12:57 PM, David Nadlinger wrote:On Wednesday, 17 October 2012 at 15:14:44 UTC, Andrei Alexandrescu wrote:When I designed input ranges vs. forward ranges there were long discussion on how to distinguish them. If you have a better design it would probably not influence the state of the affair, but it would be a good discussion to have. FWIW I can already think of a couple but all rely on UFCS. AndreiSuch issues do happen, are relatively rare, and are virtually untested because there's been no unittests with a deliberately "bad" input range. Although of course we do need to add the appropriate fixes and unittests, I'm not worried about it at all systemically.If an abstraction encourages use in a way which leads to hard-to-detect logic bugs that do not occur with the most common test cases, this is _exactly_ the thing we should be worried about!
Oct 17 2012