digitalmars.D - buffered input (examples?)

spir (69/69) Feb 07 2011 Hello,

Michel Fortin (21/24) Feb 07 2011 You can build all of this on top of a buffered input range. The

Andrei Alexandrescu (3/21) Feb 07 2011 Very well put, thanks.
spir (17/35) Feb 07 2011 Right, that's nice. But I was certainly unclear. My question was in fact...

Michel Fortin (39/76) Feb 07 2011 You'd probably need a wrapper around the buffered range to handle

spir <denis.spir gmail.com> writes:

Hello,


Andrei Alexandrescu wrote:
"Generally laziness may be a tactics you could use to help with memory use. A 
good example is split() vs. splitter(). The "er" version offers one element at 
a time thus never forcing an allocation. The split() version must do all work 
upfront and also allocate a structure for depositing the output."
-------
"Let's restart this. What do you want to do that you can't do with the proposed 
API?"


First, I would love to read about some typical or sample uses of buffered input 
(hopefully clearly explained); especially showing why & how buffering is a true 
(theoretical, concrete) gain there; and why lower-level control over buffering 
is helpful, if not required.
Here is a (real & current) use case about which I wonder whether it matches 
intended use of buffered input ranges: the interface between a lexer and a 
parser. (This is the reason why I followed this thread closely, while not 
knowing much in the domain.)

Currently, my lexing toolkit constructs lexers that generate a complete lexeme 
stream in one go. How to make it work lazily, meaning on-demand from the 
parser? (*) Here is the way I see the problem, and how to solve it naively:

This is simple for a subset of language grammars for which --at every syntactic 
level-- the first element (eg 1 lexeme) univoquely determines which syntactic 
construct may be expected. Just like at the morphological level (for most prog 
langs) the first character univoquely determines the lexeme type. (Hope I'm 
clear.) In this case, we only need to read one token/lexeme at a time: sa,y the 
buffer is a single token.

In the general case, things are somewhat more complicated. I won't enter 
details (see eg http://en.wikipedia.org/wiki/Lookahead & 
http://en.wikipedia.org/wiki/Backtracking) of how/why, but we need to be able 
to lookahead & to backtrack when a choice pattern's sub-patterns are composite. 
Not only that, but --if I do not fool myself-- the amount of lookeahead 
required can be in some cases constantly undefined (due to recursivity) and we 
may in some extreme case need to backtrack the whole of what has been matched 
as of now (if top-pattern is a choice). Seems this means illimited buffer.

Thus, here is how I see the lexer <--> parser interface:
* One or more match function(s), mainly checking the current lexeme's type, 
possibly giving some more info or performing some useful actions.
* store() command: if first call, start to record matched lexemes into buffer; 
push current cursor in buffer on a stack [0, i, j, c...]
* unstore() command: backtrack, meaning discard buffer[c..$]; pop last cursor c

This is to be used the following way:
1. When starting to match a composite pattern which is a sub-pattern of a 
choice, meaning we need lookahead and may need backtracking, call store().
2. When matching this pattern ends, either by success or failure, call
unstore().
     ~ success: matched tokens are validated, and have produced one or more
       AST node(s), thus we do not need them in buffer anymore.
     ~ failure: we need to backtrack to where matching this pattern started
       to try matching an alternative pattern (**)

The possible necessity of storing in buffer more than current composite 
pattern's tokens, and having a stack of cursors, is (as you guess) due to 
pattern nesting, and to self- or mutual recursivity. When unstoring a pattern's 
tokens, be it after validation or to backtrack, there may be a higher-level 
pattern under match trial, which also requires lookahead.
If the top-pattern is a choice, and some of its sub-patterns are composite, 
then the whole stream of matched tokens needs be stored in buffer until parsing 
the whole source is validated or fails. I guess.


Does this have anything to do with currently discussed buffered input ranges? 
If yes, how does such a design, or any alternative, fit their proposed
interface?


Denis

(*) I doubt there is any great advantage to expect from avoiding whole source 
lexing in one go (typically one module file). I rather explore the topic for 
its proper interest.

(**) There may be match memoisation (eg so-called "packrat parsing"), to avoid 
re-matching same pattern type at same position, but this is another story, and 
rather happens at a higher-level than individual tokens as such.
-- 
_________________
vita es estrany
spir.wikidot.com

Feb 07 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-02-07 08:24:32 -0500, spir <denis.spir gmail.com> said:

 Does this have anything to do with currently discussed buffered input 
 ranges? If yes, how does such a design, or any alternative, fit their 
 proposed interface?

You can build all of this on top of a buffered input range. The 
buffered input range is not an alternative for your complex parsing 
algorithm, it's just the component managing the buffer.

Having the underlying range manage the buffer (as opposed to having 
your parser algorithm doing it) means that the buffer implementation 
can vary depending on the type of range. For instance, if you're 
parsing directly data in memory, the buffered range can use directly 
this data in memory as the buffer, requiring no allocation and no copy; 
if you're parsing from a file or a network stream it'll use a more 
standard buffer.

But how exactly the buffer is implemented does not affect your parsing 
algorithm in any way. That's the great thing about it, separation of 
concerns: your algorithm will work independently of the buffer 
implementation used. All your parser algorithm has to do is say 
"shiftFront" and "appendToFront" to control what's available in the 
buffer.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Feb 07 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/7/11 9:40 AM, Michel Fortin wrote:
 On 2011-02-07 08:24:32 -0500, spir <denis.spir gmail.com> said:

 Does this have anything to do with currently discussed buffered input
 ranges? If yes, how does such a design, or any alternative, fit their
 proposed interface?

 You can build all of this on top of a buffered input range. The buffered
 input range is not an alternative for your complex parsing algorithm,
 it's just the component managing the buffer.

 Having the underlying range manage the buffer (as opposed to having your
 parser algorithm doing it) means that the buffer implementation can vary
 depending on the type of range. For instance, if you're parsing directly
 data in memory, the buffered range can use directly this data in memory
 as the buffer, requiring no allocation and no copy; if you're parsing
 from a file or a network stream it'll use a more standard buffer.

 But how exactly the buffer is implemented does not affect your parsing
 algorithm in any way. That's the great thing about it, separation of
 concerns: your algorithm will work independently of the buffer
 implementation used. All your parser algorithm has to do is say
 "shiftFront" and "appendToFront" to control what's available in the buffer.

Very well put, thanks.

Andrei

Feb 07 2011

spir <denis.spir gmail.com> writes:

On 02/07/2011 03:40 PM, Michel Fortin wrote:
 On 2011-02-07 08:24:32 -0500, spir <denis.spir gmail.com> said:

 Does this have anything to do with currently discussed buffered input ranges?
 If yes, how does such a design, or any alternative, fit their proposed
 interface?

 You can build all of this on top of a buffered input range. The buffered input
 range is not an alternative for your complex parsing algorithm, it's just the
 component managing the buffer.

 Having the underlying range manage the buffer (as opposed to having your parser
 algorithm doing it) means that the buffer implementation can vary depending on
 the type of range. For instance, if you're parsing directly data in memory, the
 buffered range can use directly this data in memory as the buffer, requiring no
 allocation and no copy; if you're parsing from a file or a network stream it'll
 use a more standard buffer.

 But how exactly the buffer is implemented does not affect your parsing
 algorithm in any way. That's the great thing about it, separation of concerns:
 your algorithm will work independently of the buffer implementation used. All
 your parser algorithm has to do is say "shiftFront" and "appendToFront" to
 control what's available in the buffer.

Right, that's nice. But I was certainly unclear. My question was in fact how to 
make the lexeme stream be a buffered range. So that, precisely, the parser does 
not have to cope about buffering (or rather has to cope about it the std way 
clients will use to cope with buffered ranges once they are standardised).
In my hypothetical design, store() and unstore() are commands used to manage 
buffering, thus similar in purpose, but different from what Andrei's proposal 
provifdes. They just match the way I see a parser's needs, naively. Now, how to 
translate this design to make the lexeme stream a buffered range is very 
unclear for me; needed operations do not directly translate to appendToFront / 
shiftFront (unless I do not understand their action). I'm not even sure whether 
it would be simply correct to use a buffered range.

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

Feb 07 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-02-07 13:25:53 -0500, spir <denis.spir gmail.com> said:

 On 02/07/2011 03:40 PM, Michel Fortin wrote:
 On 2011-02-07 08:24:32 -0500, spir <denis.spir gmail.com> said:
 
 Does this have anything to do with currently discussed buffered input ranges?
 If yes, how does such a design, or any alternative, fit their proposed
 interface?

 
 You can build all of this on top of a buffered input range. The buffered input
 range is not an alternative for your complex parsing algorithm, it's just the
 component managing the buffer.
 
 Having the underlying range manage the buffer (as opposed to having your parser
 algorithm doing it) means that the buffer implementation can vary depending on
 the type of range. For instance, if you're parsing directly data in memory, the
 buffered range can use directly this data in memory as the buffer, requiring no
 allocation and no copy; if you're parsing from a file or a network stream it'll
 use a more standard buffer.
 
 But how exactly the buffer is implemented does not affect your parsing
 algorithm in any way. That's the great thing about it, separation of concerns:
 your algorithm will work independently of the buffer implementation used. All
 your parser algorithm has to do is say "shiftFront" and "appendToFront" to
 control what's available in the buffer.

 
 Right, that's nice. But I was certainly unclear. My question was in 
 fact how to make the lexeme stream be a buffered range. So that, 
 precisely, the parser does not have to cope about buffering (or rather 
 has to cope about it the std way clients will use to cope with buffered 
 ranges once they are standardised).
 In my hypothetical design, store() and unstore() are commands used to 
 manage buffering, thus similar in purpose, but different from what 
 Andrei's proposal provifdes. They just match the way I see a parser's 
 needs, naively. Now, how to translate this design to make the lexeme 
 stream a buffered range is very unclear for me; needed operations do 
 not directly translate to appendToFront / shiftFront (unless I do not 
 understand their action). I'm not even sure whether it would be simply 
 correct to use a buffered range.

You'd probably need a wrapper around the buffered range to handle 
backtracking nicely. Something like this:

	struct Backtracker(R) {
		R bufRange;
		size_t backtrackLength;

		ElementType!R front()  property {
			bufRange.front[0..$-backtrackLength];
		}

		void shiftFront(n) {
			bufRange.shiftFront(size_t n);
		}

		void backtrackFront(size_t n) {
			backtrackLength += n;
		}
		
		void appendToFront(size_t n) {
			if (backtrackLength <= n) {
				backtrackLength -= n;
			} else {
				bufRange.appendToFront(n-backtrackLength);
				backtrackLength = 0;
			}
		}

		void popFront() {
			assert(backtrackLength == 0, "mixin backtracking with by-chunk 
iteration not implemented");
			buffRange.popFront();
		}
	}

Perhaps this "backtrackFront()" function could be an optional part of 
the standard buffered range interface. If an algorithm requires a 
backtracking buffered input range and the input range you're provided 
with does not support backtracking, then you can use a wrapper like the 
one I drafted above to make the input acceptable to the algorithm.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Feb 07 2011

D Programming

C/C++ Programming

Other

digitalmars.D - buffered input (examples?)