digitalmars.D - buffered input (examples?)
- spir (69/69) Feb 07 2011 Hello,
- Michel Fortin (21/24) Feb 07 2011 You can build all of this on top of a buffered input range. The
- Andrei Alexandrescu (3/21) Feb 07 2011 Very well put, thanks.
- spir (17/35) Feb 07 2011 Right, that's nice. But I was certainly unclear. My question was in fact...
- Michel Fortin (39/76) Feb 07 2011 You'd probably need a wrapper around the buffered range to handle
Hello, Andrei Alexandrescu wrote: "Generally laziness may be a tactics you could use to help with memory use. A good example is split() vs. splitter(). The "er" version offers one element at a time thus never forcing an allocation. The split() version must do all work upfront and also allocate a structure for depositing the output." ------- "Let's restart this. What do you want to do that you can't do with the proposed API?" First, I would love to read about some typical or sample uses of buffered input (hopefully clearly explained); especially showing why & how buffering is a true (theoretical, concrete) gain there; and why lower-level control over buffering is helpful, if not required. Here is a (real & current) use case about which I wonder whether it matches intended use of buffered input ranges: the interface between a lexer and a parser. (This is the reason why I followed this thread closely, while not knowing much in the domain.) Currently, my lexing toolkit constructs lexers that generate a complete lexeme stream in one go. How to make it work lazily, meaning on-demand from the parser? (*) Here is the way I see the problem, and how to solve it naively: This is simple for a subset of language grammars for which --at every syntactic level-- the first element (eg 1 lexeme) univoquely determines which syntactic construct may be expected. Just like at the morphological level (for most prog langs) the first character univoquely determines the lexeme type. (Hope I'm clear.) In this case, we only need to read one token/lexeme at a time: sa,y the buffer is a single token. In the general case, things are somewhat more complicated. I won't enter details (see eg http://en.wikipedia.org/wiki/Lookahead & http://en.wikipedia.org/wiki/Backtracking) of how/why, but we need to be able to lookahead & to backtrack when a choice pattern's sub-patterns are composite. Not only that, but --if I do not fool myself-- the amount of lookeahead required can be in some cases constantly undefined (due to recursivity) and we may in some extreme case need to backtrack the whole of what has been matched as of now (if top-pattern is a choice). Seems this means illimited buffer. Thus, here is how I see the lexer <--> parser interface: * One or more match function(s), mainly checking the current lexeme's type, possibly giving some more info or performing some useful actions. * store() command: if first call, start to record matched lexemes into buffer; push current cursor in buffer on a stack [0, i, j, c...] * unstore() command: backtrack, meaning discard buffer[c..$]; pop last cursor c This is to be used the following way: 1. When starting to match a composite pattern which is a sub-pattern of a choice, meaning we need lookahead and may need backtracking, call store(). 2. When matching this pattern ends, either by success or failure, call unstore(). ~ success: matched tokens are validated, and have produced one or more AST node(s), thus we do not need them in buffer anymore. ~ failure: we need to backtrack to where matching this pattern started to try matching an alternative pattern (**) The possible necessity of storing in buffer more than current composite pattern's tokens, and having a stack of cursors, is (as you guess) due to pattern nesting, and to self- or mutual recursivity. When unstoring a pattern's tokens, be it after validation or to backtrack, there may be a higher-level pattern under match trial, which also requires lookahead. If the top-pattern is a choice, and some of its sub-patterns are composite, then the whole stream of matched tokens needs be stored in buffer until parsing the whole source is validated or fails. I guess. Does this have anything to do with currently discussed buffered input ranges? If yes, how does such a design, or any alternative, fit their proposed interface? Denis (*) I doubt there is any great advantage to expect from avoiding whole source lexing in one go (typically one module file). I rather explore the topic for its proper interest. (**) There may be match memoisation (eg so-called "packrat parsing"), to avoid re-matching same pattern type at same position, but this is another story, and rather happens at a higher-level than individual tokens as such. -- _________________ vita es estrany spir.wikidot.com
Feb 07 2011
On 2011-02-07 08:24:32 -0500, spir <denis.spir gmail.com> said:Does this have anything to do with currently discussed buffered input ranges? If yes, how does such a design, or any alternative, fit their proposed interface?You can build all of this on top of a buffered input range. The buffered input range is not an alternative for your complex parsing algorithm, it's just the component managing the buffer. Having the underlying range manage the buffer (as opposed to having your parser algorithm doing it) means that the buffer implementation can vary depending on the type of range. For instance, if you're parsing directly data in memory, the buffered range can use directly this data in memory as the buffer, requiring no allocation and no copy; if you're parsing from a file or a network stream it'll use a more standard buffer. But how exactly the buffer is implemented does not affect your parsing algorithm in any way. That's the great thing about it, separation of concerns: your algorithm will work independently of the buffer implementation used. All your parser algorithm has to do is say "shiftFront" and "appendToFront" to control what's available in the buffer. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Feb 07 2011
On 2/7/11 9:40 AM, Michel Fortin wrote:On 2011-02-07 08:24:32 -0500, spir <denis.spir gmail.com> said:Very well put, thanks. AndreiDoes this have anything to do with currently discussed buffered input ranges? If yes, how does such a design, or any alternative, fit their proposed interface?You can build all of this on top of a buffered input range. The buffered input range is not an alternative for your complex parsing algorithm, it's just the component managing the buffer. Having the underlying range manage the buffer (as opposed to having your parser algorithm doing it) means that the buffer implementation can vary depending on the type of range. For instance, if you're parsing directly data in memory, the buffered range can use directly this data in memory as the buffer, requiring no allocation and no copy; if you're parsing from a file or a network stream it'll use a more standard buffer. But how exactly the buffer is implemented does not affect your parsing algorithm in any way. That's the great thing about it, separation of concerns: your algorithm will work independently of the buffer implementation used. All your parser algorithm has to do is say "shiftFront" and "appendToFront" to control what's available in the buffer.
Feb 07 2011
On 02/07/2011 03:40 PM, Michel Fortin wrote:On 2011-02-07 08:24:32 -0500, spir <denis.spir gmail.com> said:Right, that's nice. But I was certainly unclear. My question was in fact how to make the lexeme stream be a buffered range. So that, precisely, the parser does not have to cope about buffering (or rather has to cope about it the std way clients will use to cope with buffered ranges once they are standardised). In my hypothetical design, store() and unstore() are commands used to manage buffering, thus similar in purpose, but different from what Andrei's proposal provifdes. They just match the way I see a parser's needs, naively. Now, how to translate this design to make the lexeme stream a buffered range is very unclear for me; needed operations do not directly translate to appendToFront / shiftFront (unless I do not understand their action). I'm not even sure whether it would be simply correct to use a buffered range. Denis -- _________________ vita es estrany spir.wikidot.comDoes this have anything to do with currently discussed buffered input ranges? If yes, how does such a design, or any alternative, fit their proposed interface?You can build all of this on top of a buffered input range. The buffered input range is not an alternative for your complex parsing algorithm, it's just the component managing the buffer. Having the underlying range manage the buffer (as opposed to having your parser algorithm doing it) means that the buffer implementation can vary depending on the type of range. For instance, if you're parsing directly data in memory, the buffered range can use directly this data in memory as the buffer, requiring no allocation and no copy; if you're parsing from a file or a network stream it'll use a more standard buffer. But how exactly the buffer is implemented does not affect your parsing algorithm in any way. That's the great thing about it, separation of concerns: your algorithm will work independently of the buffer implementation used. All your parser algorithm has to do is say "shiftFront" and "appendToFront" to control what's available in the buffer.
Feb 07 2011
On 2011-02-07 13:25:53 -0500, spir <denis.spir gmail.com> said:On 02/07/2011 03:40 PM, Michel Fortin wrote:You'd probably need a wrapper around the buffered range to handle backtracking nicely. Something like this: struct Backtracker(R) { R bufRange; size_t backtrackLength; ElementType!R front() property { bufRange.front[0..$-backtrackLength]; } void shiftFront(n) { bufRange.shiftFront(size_t n); } void backtrackFront(size_t n) { backtrackLength += n; } void appendToFront(size_t n) { if (backtrackLength <= n) { backtrackLength -= n; } else { bufRange.appendToFront(n-backtrackLength); backtrackLength = 0; } } void popFront() { assert(backtrackLength == 0, "mixin backtracking with by-chunk iteration not implemented"); buffRange.popFront(); } } Perhaps this "backtrackFront()" function could be an optional part of the standard buffered range interface. If an algorithm requires a backtracking buffered input range and the input range you're provided with does not support backtracking, then you can use a wrapper like the one I drafted above to make the input acceptable to the algorithm. -- Michel Fortin michel.fortin michelf.com http://michelf.com/On 2011-02-07 08:24:32 -0500, spir <denis.spir gmail.com> said:Right, that's nice. But I was certainly unclear. My question was in fact how to make the lexeme stream be a buffered range. So that, precisely, the parser does not have to cope about buffering (or rather has to cope about it the std way clients will use to cope with buffered ranges once they are standardised). In my hypothetical design, store() and unstore() are commands used to manage buffering, thus similar in purpose, but different from what Andrei's proposal provifdes. They just match the way I see a parser's needs, naively. Now, how to translate this design to make the lexeme stream a buffered range is very unclear for me; needed operations do not directly translate to appendToFront / shiftFront (unless I do not understand their action). I'm not even sure whether it would be simply correct to use a buffered range.Does this have anything to do with currently discussed buffered input ranges? If yes, how does such a design, or any alternative, fit their proposed interface?You can build all of this on top of a buffered input range. The buffered input range is not an alternative for your complex parsing algorithm, it's just the component managing the buffer. Having the underlying range manage the buffer (as opposed to having your parser algorithm doing it) means that the buffer implementation can vary depending on the type of range. For instance, if you're parsing directly data in memory, the buffered range can use directly this data in memory as the buffer, requiring no allocation and no copy; if you're parsing from a file or a network stream it'll use a more standard buffer. But how exactly the buffer is implemented does not affect your parsing algorithm in any way. That's the great thing about it, separation of concerns: your algorithm will work independently of the buffer implementation used. All your parser algorithm has to do is say "shiftFront" and "appendToFront" to control what's available in the buffer.
Feb 07 2011