www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Dealing with ranges where front and popFront do the same logic / eager

reply Dennis <dkorpel gmail.com> writes:
I've always been curious around the design choice of ranges to 
make front and popFront separate functions, instead of having 
popFront return the front. I suppose it is useful sometimes to be 
able to access front multiple times without having to save it 
temporarily (can't think of a specific example though).

But sometimes doing popFront requires doing the same work as 
front. For example, for strings and autodecoding, both front and 
popFront need to determine the length of the next code point. Now 
for utf-8 that is still managable, but I'm making a tokenizer 
range for a programming language. Most of the logic in `front` 
needs to be duplicated for `popFront`! (and `empty` as well)

So I often end up with designs like:
```
struct Range
{
   string source;
   bool empty;
   Token front;
   this(string source) {
     this.source = source;
     popFront();
   }

   void popFront() {
     // *complicated logic*
     front = result;
   }
}
```

This increases the size of the Range struct, and makes the range 
one element eager. If the first token is invalid, and I use 
Exceptions, then merely constructing the range will immediately 
throw an Exception. I recall reading that constructors throwing 
exceptions are problematic, but I don't have any experience with 
that. So maybe I should add a try-catch in the constructor, and 
rethrow the Exception upon the first 'front' call? That feels 
hacky.

This is not a dealbreaker, I'm just curious on your tips and 
opinions regarding this.
Oct 16 2018
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Oct 16, 2018 at 10:59:50PM +0000, Dennis via Digitalmars-d-learn wrote:
 I've always been curious around the design choice of ranges to make
 front and popFront separate functions, instead of having popFront
 return the front. I suppose it is useful sometimes to be able to
 access front multiple times without having to save it temporarily
 (can't think of a specific example though).
IMO, it is a good design choice, though I have also encountered the issues you mention myself. The main advantage is the ability to make decisions with a 1-element lookahead -- you can examine the next element without consuming it, which is useful in writing parsers, when the next token may not be a part of the current parse subtree, so a subtree parsing routine can just gracefully exit without consuming the next token (and requiring stupid hacks like unputting the token back so that another parsing routine can pick it up), and let the upper-level routines take care of the case where the next token is actually invalid.
 But sometimes doing popFront requires doing the same work as front.
 For example, for strings and autodecoding, both front and popFront
 need to determine the length of the next code point. Now for utf-8
 that is still managable, but I'm making a tokenizer range for a
 programming language. Most of the logic in `front` needs to be
 duplicated for `popFront`! (and `empty` as well)
 
 So I often end up with designs like:
[...] Your solution is how I would do it too. There's no point in duplicating the work in .front; calling .front multiple times shouldn't recompute the same thing over and over again. Just do the work once and save it, then return the same result if the caller calls .front multiple times. If you're worried about the size of the struct becoming too big, just add a fake Token type that marks EOF, and check for that in .empty instead of the `bool empty`. [...]
 This increases the size of the Range struct, and makes the range one
 element eager. If the first token is invalid, and I use Exceptions,
 then merely constructing the range will immediately throw an
 Exception. I recall reading that constructors throwing exceptions are
 problematic, but I don't have any experience with that. So maybe I
 should add a try-catch in the constructor, and rethrow the Exception
 upon the first 'front' call? That feels hacky.
[...] It *is* hacky. I'd just throw the Exception in the ctor, and not bother with rethrowing it in .front. I'm not sure what's the reasoning behind the saying that throwing exceptions in ctors is bad, but exceptions are exactly the kind of thing designed for handling this sort of situation. If the parser detects a problem early (i.e., at construction time, rather than the first call to .front), why not report the problem early instead of waiting? Now, there *is* one situation where the first-element-eager behaviour of ranges isn't desirable, and that is if you're reading live user input from stdin (e.g., in a REPL), and you want to do something like print a prompt before the first input line is read. In this case, the mere act of constructing the parser would block on input, likely before the code has a chance to print the prompt. I've encountered this situation on multiple occasions, and have found it truly aggravating, esp. if you're using std.stdio.File.byLine, which has to read a line of input before it can know the value of .empty, so the mere act of calling .byLine requires you to print a prompt first, even if you structured your code differently for better encapsulation. My eventual solution is to wrap .byLine inside a struct that lazily constructs the ByLine instance: // Implementation of byLinePrompted. private struct ByLinePrompted(alias writePrompt, File) { private File f; private alias ByLine = typeof(f.byLine()); import std.typecons : Nullable; private Nullable!ByLine _impl; private ref ByLine impl() { if (_impl.isNull) { writePrompt(); _impl = f.byLine; } return _impl; } property bool empty() { return impl.empty; } property auto front() { return impl.front; } void popFront() { writePrompt(); impl.popFront(); } static assert(isInputRange!(typeof(this))); } /** * Returns: A lazily-initialized wrapper around File.byLine that * emits a prompt before every line read. * * Params: * writePrompt = A function invoked each time a prompt is called for. * f = The file to wrap around. */ auto byLinePrompted(alias writePrompt, File)(File f) if (is(typeof(writePrompt()))) { return ByLinePrompted!(writePrompt, File)(f); } Example usage: auto input = stdin.byLinePrompted!({ stdout.write("> "); stdout.flush; }).parse; while (!input.empty) { auto token = input.front; ... } In essence, we write the prompt before constructing byLine. This also defers any exceptions from the parser until the first call to .front. The byLinePrompted wrapper gives us a nice API to hide away all of this ugly mess. It does entail a null check every invocation, but the CPU branch predictor ought to be able to figure out pretty quickly that the null check will always be false after the first time, so you shouldn't see a significant performance hit. T -- Bomb technician: If I'm running, try to keep up.
Oct 16 2018
parent Dennis <dkorpel gmail.com> writes:
On Wednesday, 17 October 2018 at 00:12:13 UTC, H. S. Teoh wrote:
 I'm not sure what's the reasoning behind the saying that 
 throwing exceptions in ctors is bad, but exceptions are exactly 
 the kind of thing designed for handling this sort of situation. 
 If the parser detects a problem early (i.e., at construction 
 time, rather than the first call to .front), why not report the 
 problem early instead of waiting?
I think I must have confused constructors with destructors there. Walter made this comment: "Throwing in a destructor is a nightmare, it makes my brain hurt just trying to figure out what 'should' happen. I've proposed before that destructors should be nothrow." https://issues.dlang.org/show_bug.cgi?id=14903 I couldn't find anything about nothrow constructors, so it might be just fine. Throwing exceptions to early is a problem when you do something like this: ``` auto tokens = Lexer(source).handle!(ParseException, RangePrimitive.front, (e, r) => Token.error); ``` https://dlang.org/phobos/std_exception.html#.handle It's unexpected that the exception will be thrown one token in advance. Interesting solution for your problem with byLine by the way.
Oct 17 2018
prev sibling parent reply Dukc <ajieskola gmail.com> writes:
On Tuesday, 16 October 2018 at 22:59:50 UTC, Dennis wrote:
 [snip]
The first thing to consider for invalid tokens, at least for me, would be to either have popFront set empty to true, or set front to some value representing a parsing error. A programming language parser almost certainly needs to output helpful error messages, so the latter is likely a better approach. Whatever iterates through the range could then throw an appopriate exception when it encounters an error token. I'm not saying your approach is wrong though. It's probably just a matter of taste.
Oct 17 2018
parent Stanislav Blinov <stanislav.blinov gmail.com> writes:
On Wednesday, 17 October 2018 at 09:53:22 UTC, Dukc wrote:

 Whatever iterates through the range could then throw an 
 appopriate exception when it encounters an error token.
Exceptions in a parser? Monsieur knows his perversion.
Oct 17 2018