digitalmars.D - algorithms that take ranges by reference
- Andrei Alexandrescu (99/99) Dec 30 2009 The grand STL tradition is to _always_ take iterators by value. If
- Kevin Bealer (21/41) Dec 30 2009 I would vote yes -- I've used this technique (with my own character slic...
- Andrei Alexandrescu (3/37) Dec 30 2009 You read my mind. I was thinking of something with "copy" in the name.
- Philippe Sigaud (11/26) Dec 31 2009 I find popFrontUntil / popFrontWhile to be quite useful myself. Heck, I
- Michel Fortin (13/32) Dec 30 2009 I'd say go by the most efficient method. I've implemented a few text
- Andrei Alexandrescu (4/36) Dec 30 2009 No, just some html scraping. I need to parse 200 GB worth of html :o).
The grand STL tradition is to _always_ take iterators by value. If iterators are computed, they are returned by the algorithm. My initial approach to defining std.algorithm was to continue that tradition for ranges: ranges are values. No algorithm currently takes a range by reference. There are a couple of simple functions that emphatically do take ref, namely popFrontN and popBackN in std.range. It is becoming, however, increasingly clear that there are algorithms that could and should manipulate ranges by reference. They might take and return values, but it's just too messy to do so. (Cue music for the "improve built-in tuples choir.) A concrete case is text processing. Many contemporary libraries use index-based processing, but that's difficult when handling multibyte characters. To address that, bidirectional ranges are one correct way to go. Phobos defines a ByCodeUnit range that spans a string correctly, one dchar at a time. With that, you can write: auto fname = "some-utf8-file.html"; auto txt = byDchar(readText(fname)); // use txt.empty, txt.front, txt.back, txt.popFront, txt.popBack The front and back methods have type dchar. So far so good. But then I noticed that many std.algorithms make it difficult to play with the range comfortably. For example, say I want to match a tag, and special-case for a few particular tags: while (!txt.empty) { auto c = txt.front; txt.popFront(); if (c == '<') { if (startsWith(txt, "!--")) { // This is a comment popFrontN(txt, 3); txt = find(txt, "-->"); popFrontN(txt, 3); ... } else if (startsWith(txt, "script")) { ... } } ... } This is already wasteful: startsWith scans a few characters off txt, and then popFrontN does it again. In this particular case it's not all that inefficient, but clearly the approach doesn't have elegance on its side either, so it's a lose-lose. There's also the case issue with e.g. the "script" tag, but I hope the approach outlined in the post "one step towards unification of std.algorithm and std.string" will take care of that. Looking at samples like the above, some needed algorithms look like this (simplified by e.g. giving up custom predicates etc.): /** If r1 starts with r2, advance r1 past it and return true. Otherwise, leave r1 unchanged and return false. */ bool skip(R1, R2)(ref R1 r1, R2 r2) { auto r = r1; // .save(); while (!r2.empty && !r.empty && r.front == r2.front) { r.popFront(); r2.popFront(); } if (r2.empty) { r1 = r; return true; } return false; } /** If r2 can be find in r1, advance r1 past r2 and return true. Otherwise, leave r1 unchanged and return false. */ bool findSkip(R1, R2)(ref R1 r1, R2 r2) { auto r = r1; // .save(); while (!r.empty && !startsWith(r1, r2)) { r.popFront(); r2.popFront(); } if (r2.empty) { r1 = r; return true; } return false; } With such functions the code looks much cleaner: while (!txt.empty) { auto c = txt.front; txt.popFront(); if (c == '<') { if (skip(txt, "!--")) { // This is a comment enforce(findSkip(txt, "-->")); ... } else if (skip(txt, "script")) { ... } } ... } But they break the tradition because now an algorithm may alter or not a range, and client code must be aware of that - one more thing to worry about. What do you think? Should we go with by-ref passing or not? Other ideas? Andrei
Dec 30 2009
Andrei Alexandrescu Wrote:The grand STL tradition is to _always_ take iterators by value. If iterators are computed, they are returned by the algorithm. My initial approach to defining std.algorithm was to continue that tradition for ranges: ranges are values. No algorithm currently takes a range by reference. There are a couple of simple functions that emphatically do take ref, namely popFrontN and popBackN in std.range. It is becoming, however, increasingly clear that there are algorithms that could and should manipulate ranges by reference. They might take and return values, but it's just too messy to do so. (Cue music for the "improve built-in tuples choir.) A concrete case is text processing. Many contemporary libraries use index-based processing, but that's difficult when handling multibyte characters. To address that, bidirectional ranges are one correct way to go. Phobos defines a ByCodeUnit range that spans a string correctly, one dchar at a time. With that, you can write:...AndreiI would vote yes -- I've used this technique (with my own character slice classes in C++) and they are a great idiom to work with. I think ranges (and slices) have some of the properties from each of pointers, containers, and streams. A stream is always a by-ref kind of thing unless you are in a language that needs monads etc. Let me suggest one more function I've found very useful: bool readUntil(R1, R2, R3)(ref R1 input, R2 delim, ref R3 result) { // Like findSkip, but returning intervening text. } Then you can write something like: bool consumeQuotedString(R1,R2)(ref R1 text, ref R2 quoted) { if (skip(text, "'")) { if (! readUntil(text, "'", quoted)) { /*error*/ } return true; } return found; } Kevin
Dec 30 2009
Kevin Bealer wrote:Andrei Alexandrescu Wrote:You read my mind. I was thinking of something with "copy" in the name. AndreiThe grand STL tradition is to _always_ take iterators by value. If iterators are computed, they are returned by the algorithm. My initial approach to defining std.algorithm was to continue that tradition for ranges: ranges are values. No algorithm currently takes a range by reference. There are a couple of simple functions that emphatically do take ref, namely popFrontN and popBackN in std.range. It is becoming, however, increasingly clear that there are algorithms that could and should manipulate ranges by reference. They might take and return values, but it's just too messy to do so. (Cue music for the "improve built-in tuples choir.) A concrete case is text processing. Many contemporary libraries use index-based processing, but that's difficult when handling multibyte characters. To address that, bidirectional ranges are one correct way to go. Phobos defines a ByCodeUnit range that spans a string correctly, one dchar at a time. With that, you can write:...AndreiI would vote yes -- I've used this technique (with my own character slice classes in C++) and they are a great idiom to work with. I think ranges (and slices) have some of the properties from each of pointers, containers, and streams. A stream is always a by-ref kind of thing unless you are in a language that needs monads etc. Let me suggest one more function I've found very useful: bool readUntil(R1, R2, R3)(ref R1 input, R2 delim, ref R3 result) { // Like findSkip, but returning intervening text. }
Dec 30 2009
On Thu, Dec 31, 2009 at 01:11, Andrei Alexandrescu < SeeWebsiteForEmail erdani.org> wrote:Kevin Bealer wrote:I find popFrontUntil / popFrontWhile to be quite useful myself. Heck, I even needed an 'exhaust' function once, that just takes a range (mostly, the rest of a range inside an algorithm) and makes it empty. When writing these by-ref funs, I take care to indicate quite clearly in the docs that they modify the input range, because it's not a common pattern in std.algorithm. I like Michel's suggestion to use consume as a prefix. If Phobos goes for a nested structure, you could put this consuming algorithms in a special module for them. std.algorith.consume or somesuch. PhilippeYou read my mind. I was thinking of something with "copy" in the name.It is becoming, however, increasingly clear that there are algorithms that could and should manipulate ranges by reference. They might take and return values, but it's just too messy to do so. (Cue music for the "improve built-in tuples choir.) Let me suggest one more function I've found very useful:bool readUntil(R1, R2, R3)(ref R1 input, R2 delim, ref R3 result) { // Like findSkip, but returning intervening text. }
Dec 31 2009
On 2009-12-30 14:53:33 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:But they break the tradition because now an algorithm may alter or not a range, and client code must be aware of that - one more thing to worry about. What do you think? Should we go with by-ref passing or not? Other ideas?I'd say go by the most efficient method. I've implemented a few text parsing functions of my own and they all take the range by reference. I think you can make things clear with proper naming. All my functions that advance the range passed by reference are prefixed "consume": "consumeOneChar", "consumeString", "consumeNumber", "consumeUntil", "consumeWhile", etc. This makes the intent very clear.while (!txt.empty) { auto c = txt.front; txt.popFront(); if (c == '<') { if (skip(txt, "!--")) { // This is a comment enforce(findSkip(txt, "-->")); ... } else if (skip(txt, "script")) { ... } } ... }Are you writing a new XML parser? -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Dec 30 2009
Michel Fortin wrote:On 2009-12-30 14:53:33 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:Perfect. I'll do that.But they break the tradition because now an algorithm may alter or not a range, and client code must be aware of that - one more thing to worry about. What do you think? Should we go with by-ref passing or not? Other ideas?I'd say go by the most efficient method. I've implemented a few text parsing functions of my own and they all take the range by reference. I think you can make things clear with proper naming. All my functions that advance the range passed by reference are prefixed "consume": "consumeOneChar", "consumeString", "consumeNumber", "consumeUntil", "consumeWhile", etc. This makes the intent very clear.No, just some html scraping. I need to parse 200 GB worth of html :o). Andreiwhile (!txt.empty) { auto c = txt.front; txt.popFront(); if (c == '<') { if (skip(txt, "!--")) { // This is a comment enforce(findSkip(txt, "-->")); ... } else if (skip(txt, "script")) { ... } } ... }Are you writing a new XML parser?
Dec 30 2009