www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - buffered input (examples?)

reply spir <denis.spir gmail.com> writes:
Hello,


Andrei Alexandrescu wrote:
"Generally laziness may be a tactics you could use to help with memory use. A 
good example is split() vs. splitter(). The "er" version offers one element at 
a time thus never forcing an allocation. The split() version must do all work 
upfront and also allocate a structure for depositing the output."
-------
"Let's restart this. What do you want to do that you can't do with the proposed 
API?"


First, I would love to read about some typical or sample uses of buffered input 
(hopefully clearly explained); especially showing why & how buffering is a true 
(theoretical, concrete) gain there; and why lower-level control over buffering 
is helpful, if not required.
Here is a (real & current) use case about which I wonder whether it matches 
intended use of buffered input ranges: the interface between a lexer and a 
parser. (This is the reason why I followed this thread closely, while not 
knowing much in the domain.)

Currently, my lexing toolkit constructs lexers that generate a complete lexeme 
stream in one go. How to make it work lazily, meaning on-demand from the 
parser? (*) Here is the way I see the problem, and how to solve it naively:

This is simple for a subset of language grammars for which --at every syntactic 
level-- the first element (eg 1 lexeme) univoquely determines which syntactic 
construct may be expected. Just like at the morphological level (for most prog 
langs) the first character univoquely determines the lexeme type. (Hope I'm 
clear.) In this case, we only need to read one token/lexeme at a time: sa,y the 
buffer is a single token.

In the general case, things are somewhat more complicated. I won't enter 
details (see eg http://en.wikipedia.org/wiki/Lookahead & 
http://en.wikipedia.org/wiki/Backtracking) of how/why, but we need to be able 
to lookahead & to backtrack when a choice pattern's sub-patterns are composite. 
Not only that, but --if I do not fool myself-- the amount of lookeahead 
required can be in some cases constantly undefined (due to recursivity) and we 
may in some extreme case need to backtrack the whole of what has been matched 
as of now (if top-pattern is a choice). Seems this means illimited buffer.

Thus, here is how I see the lexer <--> parser interface:
* One or more match function(s), mainly checking the current lexeme's type, 
possibly giving some more info or performing some useful actions.
* store() command: if first call, start to record matched lexemes into buffer; 
push current cursor in buffer on a stack [0, i, j, c...]
* unstore() command: backtrack, meaning discard buffer[c..$]; pop last cursor c

This is to be used the following way:
1. When starting to match a composite pattern which is a sub-pattern of a 
choice, meaning we need lookahead and may need backtracking, call store().
2. When matching this pattern ends, either by success or failure, call
unstore().
     ~ success: matched tokens are validated, and have produced one or more
       AST node(s), thus we do not need them in buffer anymore.
     ~ failure: we need to backtrack to where matching this pattern started
       to try matching an alternative pattern (**)

The possible necessity of storing in buffer more than current composite 
pattern's tokens, and having a stack of cursors, is (as you guess) due to 
pattern nesting, and to self- or mutual recursivity. When unstoring a pattern's 
tokens, be it after validation or to backtrack, there may be a higher-level 
pattern under match trial, which also requires lookahead.
If the top-pattern is a choice, and some of its sub-patterns are composite, 
then the whole stream of matched tokens needs be stored in buffer until parsing 
the whole source is validated or fails. I guess.


Does this have anything to do with currently discussed buffered input ranges? 
If yes, how does such a design, or any alternative, fit their proposed
interface?


Denis

(*) I doubt there is any great advantage to expect from avoiding whole source 
lexing in one go (typically one module file). I rather explore the topic for 
its proper interest.

(**) There may be match memoisation (eg so-called "packrat parsing"), to avoid 
re-matching same pattern type at same position, but this is another story, and 
rather happens at a higher-level than individual tokens as such.
-- 
_________________
vita es estrany
spir.wikidot.com
Feb 07 2011
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2011-02-07 08:24:32 -0500, spir <denis.spir gmail.com> said:

 Does this have anything to do with currently discussed buffered input 
 ranges? If yes, how does such a design, or any alternative, fit their 
 proposed interface?
You can build all of this on top of a buffered input range. The buffered input range is not an alternative for your complex parsing algorithm, it's just the component managing the buffer. Having the underlying range manage the buffer (as opposed to having your parser algorithm doing it) means that the buffer implementation can vary depending on the type of range. For instance, if you're parsing directly data in memory, the buffered range can use directly this data in memory as the buffer, requiring no allocation and no copy; if you're parsing from a file or a network stream it'll use a more standard buffer. But how exactly the buffer is implemented does not affect your parsing algorithm in any way. That's the great thing about it, separation of concerns: your algorithm will work independently of the buffer implementation used. All your parser algorithm has to do is say "shiftFront" and "appendToFront" to control what's available in the buffer. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Feb 07 2011
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/7/11 9:40 AM, Michel Fortin wrote:
 On 2011-02-07 08:24:32 -0500, spir <denis.spir gmail.com> said:

 Does this have anything to do with currently discussed buffered input
 ranges? If yes, how does such a design, or any alternative, fit their
 proposed interface?
You can build all of this on top of a buffered input range. The buffered input range is not an alternative for your complex parsing algorithm, it's just the component managing the buffer. Having the underlying range manage the buffer (as opposed to having your parser algorithm doing it) means that the buffer implementation can vary depending on the type of range. For instance, if you're parsing directly data in memory, the buffered range can use directly this data in memory as the buffer, requiring no allocation and no copy; if you're parsing from a file or a network stream it'll use a more standard buffer. But how exactly the buffer is implemented does not affect your parsing algorithm in any way. That's the great thing about it, separation of concerns: your algorithm will work independently of the buffer implementation used. All your parser algorithm has to do is say "shiftFront" and "appendToFront" to control what's available in the buffer.
Very well put, thanks. Andrei
Feb 07 2011
prev sibling parent reply spir <denis.spir gmail.com> writes:
On 02/07/2011 03:40 PM, Michel Fortin wrote:
 On 2011-02-07 08:24:32 -0500, spir <denis.spir gmail.com> said:

 Does this have anything to do with currently discussed buffered input ranges?
 If yes, how does such a design, or any alternative, fit their proposed
 interface?
You can build all of this on top of a buffered input range. The buffered input range is not an alternative for your complex parsing algorithm, it's just the component managing the buffer. Having the underlying range manage the buffer (as opposed to having your parser algorithm doing it) means that the buffer implementation can vary depending on the type of range. For instance, if you're parsing directly data in memory, the buffered range can use directly this data in memory as the buffer, requiring no allocation and no copy; if you're parsing from a file or a network stream it'll use a more standard buffer. But how exactly the buffer is implemented does not affect your parsing algorithm in any way. That's the great thing about it, separation of concerns: your algorithm will work independently of the buffer implementation used. All your parser algorithm has to do is say "shiftFront" and "appendToFront" to control what's available in the buffer.
Right, that's nice. But I was certainly unclear. My question was in fact how to make the lexeme stream be a buffered range. So that, precisely, the parser does not have to cope about buffering (or rather has to cope about it the std way clients will use to cope with buffered ranges once they are standardised). In my hypothetical design, store() and unstore() are commands used to manage buffering, thus similar in purpose, but different from what Andrei's proposal provifdes. They just match the way I see a parser's needs, naively. Now, how to translate this design to make the lexeme stream a buffered range is very unclear for me; needed operations do not directly translate to appendToFront / shiftFront (unless I do not understand their action). I'm not even sure whether it would be simply correct to use a buffered range. Denis -- _________________ vita es estrany spir.wikidot.com
Feb 07 2011
parent Michel Fortin <michel.fortin michelf.com> writes:
On 2011-02-07 13:25:53 -0500, spir <denis.spir gmail.com> said:

 On 02/07/2011 03:40 PM, Michel Fortin wrote:
 On 2011-02-07 08:24:32 -0500, spir <denis.spir gmail.com> said:
 
 Does this have anything to do with currently discussed buffered input ranges?
 If yes, how does such a design, or any alternative, fit their proposed
 interface?
You can build all of this on top of a buffered input range. The buffered input range is not an alternative for your complex parsing algorithm, it's just the component managing the buffer. Having the underlying range manage the buffer (as opposed to having your parser algorithm doing it) means that the buffer implementation can vary depending on the type of range. For instance, if you're parsing directly data in memory, the buffered range can use directly this data in memory as the buffer, requiring no allocation and no copy; if you're parsing from a file or a network stream it'll use a more standard buffer. But how exactly the buffer is implemented does not affect your parsing algorithm in any way. That's the great thing about it, separation of concerns: your algorithm will work independently of the buffer implementation used. All your parser algorithm has to do is say "shiftFront" and "appendToFront" to control what's available in the buffer.
Right, that's nice. But I was certainly unclear. My question was in fact how to make the lexeme stream be a buffered range. So that, precisely, the parser does not have to cope about buffering (or rather has to cope about it the std way clients will use to cope with buffered ranges once they are standardised). In my hypothetical design, store() and unstore() are commands used to manage buffering, thus similar in purpose, but different from what Andrei's proposal provifdes. They just match the way I see a parser's needs, naively. Now, how to translate this design to make the lexeme stream a buffered range is very unclear for me; needed operations do not directly translate to appendToFront / shiftFront (unless I do not understand their action). I'm not even sure whether it would be simply correct to use a buffered range.
You'd probably need a wrapper around the buffered range to handle backtracking nicely. Something like this: struct Backtracker(R) { R bufRange; size_t backtrackLength; ElementType!R front() property { bufRange.front[0..$-backtrackLength]; } void shiftFront(n) { bufRange.shiftFront(size_t n); } void backtrackFront(size_t n) { backtrackLength += n; } void appendToFront(size_t n) { if (backtrackLength <= n) { backtrackLength -= n; } else { bufRange.appendToFront(n-backtrackLength); backtrackLength = 0; } } void popFront() { assert(backtrackLength == 0, "mixin backtracking with by-chunk iteration not implemented"); buffRange.popFront(); } } Perhaps this "backtrackFront()" function could be an optional part of the standard buffered range interface. If an algorithm requires a backtracking buffered input range and the input range you're provided with does not support backtracking, then you can use a wrapper like the one I drafted above to make the input acceptable to the algorithm. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Feb 07 2011