digitalmars.D - PEG matching/parsing lib in progress
- spir (89/89) Nov 08 2010 Hello,
Hello, I'm currently writing a PEG matching/parsing library in and for D. This is = rather to learn the language better, but it will be published anyway. (Tell= me whether there are already, could not find any.) One key characteristic is: instead of using a meta-grammar (an enhanced dia= lect of PEG) to write a grammar which then needs to be translated into a pa= rser, one defines the parser directly in D, using Pattern types like Litera= l, Choice, Option, etc. For instance: auto sign =3D new Option(new Klass("+-")); auto digits =3D new String(new Klass("0-9")); auto integer =3D new Tuple(sign, digits); writeln(integer.match("-321")); // should write: "integer:(sign:- digits:321)" In other words: the grammar is the parser. A big advantage is one needs not= learn and master a new langage. A big drawback is the grammar is verbose -= - but that's ok for me. Another advantage is one can often express the grammar more elegantly, or s= olve difficult issues, by using D features inside the grammar: lDelim =3D Character("<"); lDelim =3D Character(">"); Node tagSection(name) {return Tuple(lDelim, Literal(name), rDelim);)} boldSection =3D tagSection("b"); (This defines a kind of super-pattern. Also solves the problem that tags mu= st match -- no need to keep a stack of open tags.) I'm facing a 2 issues, 1 about design, 1 about implementation. First, you may find it weird that I call "Tuple" what is usually named "Seq= uence". Actually, a sequence of predefined elements in text, like above, co= rrespond to tuple fields. The result is a composite node, each child of whi= ch is a node matched by a sub-pattern. Hope you see what I mean. "Sequence" would be misleading: in fact, repetitive patterns yield sequence= s (arrays) of nodes. For instance: digits =3D OneOrMore(digit); writeln(digits.match("321")); // should write: "digits:[digit:3 digit:2 digit:1]" For this reason, a TupleNode presently holds an associative array of childr= en nodes, each indexed by the pattern that yielded it. I find this rather s= ensible, and rich in terms of semantics (but see also below). As a result, = when traversing an AST, one can access tuple node fields by key (instead of= by index number), which is imo nicer and more expressive. What do you thin= k? One issue happens when several fields of the tuple actually are of the same= kind: addition =3D new Tuple(operand, PLUS, operand); In fact, the pattern operand here determines the format of each operand wit= hout telling apart their distinct *roles*. This issue breaks the above dsig= n, since both operands would go to the same entry of the associative array.= The user must thus write eg: rightOperand =3D leftOperand.copy(); addition =3D new Tuple(leftOperand, PLUS, rightOperand); So, I need to make a choice: 1. Keep this design, which forces users to copy patterns in such cases. 2. Let it down, make nodes generated by Tuple patterns hold a plain arr= ay of chil nodes. 3. Have tuple nodes hold both an plain array and an associative array. 4. ??? I'm currently for option 1., because it make the grammar clearer, even if i= t requires one more line from the user. Second, I would like node output (toString) to show pattern *names*, like i= n the examples. Do you find this useful? In my practice, it seems a priceles feature, even = if it makes output more verbose. It especially helps a lot in setting up co= rrect grammars, commonly a difficult task. Unnamed results miss semantic in= formation. What do you think? Nodes presently hold a ref to their pattern, right. But I cannot find a way= for the patterns to know their own names ;-) I do not want the user to hav= e to name them manually, since they already give the needed information in = definitions. But the said info seems inaccessible at runtime. One apparent solution would be write the grammar itself in an associative a= rray of (name:pattern) pairs, then traverse it to set the names on the patt= erns themselves. But this does not work because hogher-level patterns need = to reference lower-level ones. Another approach would be to explore the current scope (symbol table). That= 's what I do win dynamic languages. Is this possible in D? Can you imagine another solution? (Note: .stringof is not a solution, I gue= ss, since this.stringof yields "this"!) This feature is also highly useful when writing out patterns themselves: su= bpatterns are called by their names. For instance, integer shows as "(sign = digits)". The only alternative is to fully expand subpatterns, and subsubpa= ttern.., which quickly leads to unreadable mess like regexes. The same adva= ntage helps in having clear match failure error messages: the failing patte= rn shows in a readible manner. It would also give an alternative for tuple nodes: keys could be pattern na= mes instead of patterns themselves; which is more flexible since a tree can= then be traversed by a routine lying in a scope where pattern definitions = are not available (eg the parser can be written in a module). Thank you, Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 08 2010