digitalmars.D - algorithms that take ranges by reference

Andrei Alexandrescu (99/99) Dec 30 2009 The grand STL tradition is to _always_ take iterators by value. If

Kevin Bealer (21/41) Dec 30 2009 I would vote yes -- I've used this technique (with my own character slic...

Andrei Alexandrescu (3/37) Dec 30 2009 You read my mind. I was thinking of something with "copy" in the name.

Philippe Sigaud (11/26) Dec 31 2009 I find popFrontUntil / popFrontWhile to be quite useful myself. Heck, I

Michel Fortin (13/32) Dec 30 2009 I'd say go by the most efficient method. I've implemented a few text

Andrei Alexandrescu (4/36) Dec 30 2009 No, just some html scraping. I need to parse 200 GB worth of html :o).

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

The grand STL tradition is to _always_ take iterators by value. If 
iterators are computed, they are returned by the algorithm.

My initial approach to defining std.algorithm was to continue that 
tradition for ranges: ranges are values. No algorithm currently takes a 
range by reference. There are a couple of simple functions that 
emphatically do take ref, namely popFrontN and popBackN in std.range.

It is becoming, however, increasingly clear that there are algorithms 
that could and should manipulate ranges by reference. They might take 
and return values, but it's just too messy to do so. (Cue music for the 
"improve built-in tuples choir.)

A concrete case is text processing. Many contemporary libraries use 
index-based processing, but that's difficult when handling multibyte 
characters. To address that, bidirectional ranges are one correct way to 
go. Phobos defines a ByCodeUnit range that spans a string correctly, one 
dchar at a time. With that, you can write:

auto fname = "some-utf8-file.html";
auto txt = byDchar(readText(fname));
// use txt.empty, txt.front, txt.back, txt.popFront, txt.popBack

The front and back methods have type dchar.

So far so good. But then I noticed that many std.algorithms make it 
difficult to play with the range comfortably. For example, say I want to 
match a tag, and special-case for a few particular tags:

while (!txt.empty) {
     auto c = txt.front;
     txt.popFront();
     if (c == '<') {
         if (startsWith(txt, "!--")) {
             // This is a comment
             popFrontN(txt, 3);
             txt = find(txt, "-->");
             popFrontN(txt, 3);
             ...
          } else if (startsWith(txt, "script")) {
          ...
          }
     }
     ...
}

This is already wasteful: startsWith scans a few characters off txt, and 
then popFrontN does it again. In this particular case it's not all that 
inefficient, but clearly the approach doesn't have elegance on its side 
either, so it's a lose-lose.

There's also the case issue with e.g. the "script" tag, but I hope the 
approach outlined in the post "one step towards unification of 
std.algorithm and std.string" will take care of that.

Looking at samples like the above, some needed algorithms look like this 
(simplified by e.g. giving up custom predicates etc.):

/**
If r1 starts with r2, advance r1 past it and return true. Otherwise, 
leave r1 unchanged and return false.
*/
bool skip(R1, R2)(ref R1 r1, R2 r2) {
     auto r = r1; // .save();
     while (!r2.empty && !r.empty && r.front == r2.front) {
         r.popFront();
         r2.popFront();
     }
     if (r2.empty) {
         r1 = r;
         return true;
     }
     return false;
}

/**
If r2 can be find in r1, advance r1 past r2 and return true. Otherwise, 
leave r1 unchanged and return false.
*/
bool findSkip(R1, R2)(ref R1 r1, R2 r2) {
     auto r = r1; // .save();
     while (!r.empty && !startsWith(r1, r2)) {
         r.popFront();
         r2.popFront();
     }
     if (r2.empty) {
         r1 = r;
         return true;
     }
     return false;
}

With such functions the code looks much cleaner:

while (!txt.empty) {
     auto c = txt.front;
     txt.popFront();
     if (c == '<') {
         if (skip(txt, "!--")) {
             // This is a comment
             enforce(findSkip(txt, "-->"));
             ...
          } else if (skip(txt, "script")) {
          ...
          }
     }
     ...
}

But they break the tradition because now an algorithm may alter or not a 
range, and client code must be aware of that - one more thing to worry 
about.

What do you think? Should we go with by-ref passing or not? Other ideas?



Andrei

Dec 30 2009

Kevin Bealer <kevinbealer gmail.com> writes:

Andrei Alexandrescu Wrote:

 The grand STL tradition is to _always_ take iterators by value. If 
 iterators are computed, they are returned by the algorithm.
 
 My initial approach to defining std.algorithm was to continue that 
 tradition for ranges: ranges are values. No algorithm currently takes a 
 range by reference. There are a couple of simple functions that 
 emphatically do take ref, namely popFrontN and popBackN in std.range.
 
 It is becoming, however, increasingly clear that there are algorithms 
 that could and should manipulate ranges by reference. They might take 
 and return values, but it's just too messy to do so. (Cue music for the 
 "improve built-in tuples choir.)
 
 A concrete case is text processing. Many contemporary libraries use 
 index-based processing, but that's difficult when handling multibyte 
 characters. To address that, bidirectional ranges are one correct way to 
 go. Phobos defines a ByCodeUnit range that spans a string correctly, one 
 dchar at a time. With that, you can write:
 

...
 Andrei

I would vote yes -- I've used this technique (with my own character slice
classes in C++) and they are a great idiom to work with.

I think ranges (and slices) have some of the properties from each of pointers,
containers, and streams.  A stream is always a by-ref kind of thing unless you
are in a language that needs monads etc.

Let me suggest one more function I've found very useful:

bool readUntil(R1, R2, R3)(ref R1 input, R2 delim, ref R3 result)
{
  // Like findSkip, but returning intervening text.
}

Then you can write something like:

bool consumeQuotedString(R1,R2)(ref R1 text, ref R2 quoted)
{
    if (skip(text, "'")) {
        if (! readUntil(text, "'", quoted)) {
            /*error*/
        }
        return true;
    }
    return found;
}

Kevin

Dec 30 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Kevin Bealer wrote:
 Andrei Alexandrescu Wrote:
 
 The grand STL tradition is to _always_ take iterators by value. If 
 iterators are computed, they are returned by the algorithm.

 My initial approach to defining std.algorithm was to continue that 
 tradition for ranges: ranges are values. No algorithm currently takes a 
 range by reference. There are a couple of simple functions that 
 emphatically do take ref, namely popFrontN and popBackN in std.range.

 It is becoming, however, increasingly clear that there are algorithms 
 that could and should manipulate ranges by reference. They might take 
 and return values, but it's just too messy to do so. (Cue music for the 
 "improve built-in tuples choir.)

 A concrete case is text processing. Many contemporary libraries use 
 index-based processing, but that's difficult when handling multibyte 
 characters. To address that, bidirectional ranges are one correct way to 
 go. Phobos defines a ByCodeUnit range that spans a string correctly, one 
 dchar at a time. With that, you can write:

 ...
 Andrei

 
 I would vote yes -- I've used this technique (with my own character slice
classes in C++) and they are a great idiom to work with.
 
 I think ranges (and slices) have some of the properties from each of pointers,
containers, and streams.  A stream is always a by-ref kind of thing unless you
are in a language that needs monads etc.
 
 Let me suggest one more function I've found very useful:
 
 bool readUntil(R1, R2, R3)(ref R1 input, R2 delim, ref R3 result)
 {
   // Like findSkip, but returning intervening text.
 }

You read my mind. I was thinking of something with "copy" in the name.

Andrei

Dec 30 2009

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Thu, Dec 31, 2009 at 01:11, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 Kevin Bealer wrote:

 It is becoming, however, increasingly clear that there are algorithms
 that could and should manipulate ranges by reference. They might take and
 return values, but it's just too messy to do so. (Cue music for the "improve
 built-in tuples choir.)

 Let me suggest one more function I've found very useful:

 bool readUntil(R1, R2, R3)(ref R1 input, R2 delim, ref R3 result)
 {
  // Like findSkip, but returning intervening text.
 }

 You read my mind. I was thinking of something with "copy" in the name.

I find popFrontUntil / popFrontWhile to be quite useful myself.  Heck, I
even needed an 'exhaust' function once, that just takes a range (mostly, the
rest of a range inside an algorithm) and makes it empty. When writing these
by-ref funs, I take care to indicate quite clearly in the docs that they
modify the input range, because it's not  a common pattern in std.algorithm.

I like Michel's suggestion to use consume as a prefix. If Phobos goes for a
nested structure, you could put this consuming algorithms in a special
module for them. std.algorith.consume or somesuch.

 Philippe

Dec 31 2009

Michel Fortin <michel.fortin michelf.com> writes:

On 2009-12-30 14:53:33 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 But they break the tradition because now an algorithm may alter or not 
 a range, and client code must be aware of that - one more thing to 
 worry about.
 
 What do you think? Should we go with by-ref passing or not? Other ideas?

I'd say go by the most efficient method. I've implemented a few text 
parsing functions of my own and they all take the range by reference.

I think you can make things clear with proper naming. All my functions 
that advance the range passed by reference are prefixed "consume": 
"consumeOneChar", "consumeString", "consumeNumber", "consumeUntil", 
"consumeWhile", etc. This makes the intent very clear.


 while (!txt.empty) {
      auto c = txt.front;
      txt.popFront();
      if (c == '<') {
          if (skip(txt, "!--")) {
              // This is a comment
              enforce(findSkip(txt, "-->"));
              ...
           } else if (skip(txt, "script")) {
           ...
           }
      }
      ...
 }

Are you writing a new XML parser?

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Dec 30 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Michel Fortin wrote:
 On 2009-12-30 14:53:33 -0500, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> said:
 
 But they break the tradition because now an algorithm may alter or not 
 a range, and client code must be aware of that - one more thing to 
 worry about.

 What do you think? Should we go with by-ref passing or not? Other ideas?

 
 I'd say go by the most efficient method. I've implemented a few text 
 parsing functions of my own and they all take the range by reference.
 
 I think you can make things clear with proper naming. All my functions 
 that advance the range passed by reference are prefixed "consume": 
 "consumeOneChar", "consumeString", "consumeNumber", "consumeUntil", 
 "consumeWhile", etc. This makes the intent very clear.

Perfect. I'll do that.

 while (!txt.empty) {
      auto c = txt.front;
      txt.popFront();
      if (c == '<') {
          if (skip(txt, "!--")) {
              // This is a comment
              enforce(findSkip(txt, "-->"));
              ...
           } else if (skip(txt, "script")) {
           ...
           }
      }
      ...
 }

 
 Are you writing a new XML parser?

No, just some html scraping. I need to parse 200 GB worth of html :o).


Andrei

Dec 30 2009

D Programming

C/C++ Programming

Other

digitalmars.D - algorithms that take ranges by reference