www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - PEG matching/parsing lib in progress

Hello,


I'm currently writing a PEG matching/parsing library in and for D. This is =
rather to learn the language better, but it will be published anyway. (Tell=
 me whether there are already, could not find any.)
One key characteristic is: instead of using a meta-grammar (an enhanced dia=
lect of PEG) to write a grammar which then needs to be translated into a pa=
rser, one defines the parser directly in D, using Pattern types like Litera=
l, Choice, Option, etc. For instance:
    auto sign =3D new Option(new Klass("+-"));
    auto digits =3D new String(new Klass("0-9"));
    auto integer =3D new Tuple(sign, digits);
    writeln(integer.match("-321"));
    // should write: "integer:(sign:- digits:321)"
In other words: the grammar is the parser. A big advantage is one needs not=
 learn and master a new langage. A big drawback is the grammar is verbose -=
- but that's ok for me.
Another advantage is one can often express the grammar more elegantly, or s=
olve difficult issues, by using D features inside the grammar:
    lDelim =3D Character("<"); lDelim =3D Character(">");
    Node tagSection(name) {return Tuple(lDelim, Literal(name), rDelim);)}
    boldSection =3D tagSection("b");
(This defines a kind of super-pattern. Also solves the problem that tags mu=
st match -- no need to keep a stack of open tags.)

I'm facing a 2 issues, 1 about design, 1 about implementation.



First, you may find it weird that I call "Tuple" what is usually named "Seq=
uence". Actually, a sequence of predefined elements in text, like above, co=
rrespond to tuple fields. The result is a composite node, each child of whi=
ch is a node matched by a sub-pattern. Hope you see what I mean.
"Sequence" would be misleading: in fact, repetitive patterns yield sequence=
s (arrays) of nodes. For instance:
    digits =3D OneOrMore(digit);
    writeln(digits.match("321"));
    // should write: "digits:[digit:3 digit:2 digit:1]"

For this reason, a TupleNode presently holds an associative array of childr=
en nodes, each indexed by the pattern that yielded it. I find this rather s=
ensible, and rich in terms of semantics (but see also below). As a result, =
when traversing an AST, one can access tuple node fields by key (instead of=
 by index number), which is imo nicer and more expressive. What do you thin=
k?

One issue happens when several fields of the tuple actually are of the same=
 kind:
    addition =3D new Tuple(operand, PLUS, operand);
In fact, the pattern operand here determines the format of each operand wit=
hout telling apart their distinct *roles*. This issue breaks the above dsig=
n, since both operands would go to the same entry of the associative array.=
 The user must thus write eg:
    rightOperand =3D leftOperand.copy();
    addition =3D new Tuple(leftOperand, PLUS, rightOperand);

So, I need to make a choice:
    1. Keep this design, which forces users to copy patterns in such cases.
    2. Let it down, make nodes generated by Tuple patterns hold a plain arr=
ay of chil nodes.
    3. Have tuple nodes hold both an plain array and an associative array.
    4. ???
I'm currently for option 1., because it make the grammar clearer, even if i=
t requires one more line from the user.



Second, I would like node output (toString) to show pattern *names*, like i=
n the examples.
Do you find this useful? In my practice, it seems a priceles feature, even =
if it makes output more verbose. It especially helps a lot in setting up co=
rrect grammars, commonly a difficult task. Unnamed results miss semantic in=
formation. What do you think?
Nodes presently hold a ref to their pattern, right. But I cannot find a way=
 for the patterns to know their own names ;-) I do not want the user to hav=
e to name them manually, since they already give the needed information in =
definitions. But the said info seems inaccessible at runtime.

One apparent solution would be write the grammar itself in an associative a=
rray of (name:pattern) pairs, then traverse it to set the names on the patt=
erns themselves. But this does not work because hogher-level patterns need =
to reference lower-level ones.
Another approach would be to explore the current scope (symbol table). That=
's what I do win dynamic languages. Is this possible in D?
Can you imagine another solution? (Note: .stringof is not a solution, I gue=
ss, since this.stringof yields "this"!)

This feature is also highly useful when writing out patterns themselves: su=
bpatterns are called by their names. For instance, integer shows as "(sign =
digits)". The only alternative is to fully expand subpatterns, and subsubpa=
ttern.., which quickly leads to unreadable mess like regexes. The same adva=
ntage helps in having clear match failure error messages: the failing patte=
rn shows in a readible manner.
It would also give an alternative for tuple nodes: keys could be pattern na=
mes instead of patterns themselves; which is more flexible since a tree can=
 then be traversed by a routine lying in a scope where pattern definitions =
are not available (eg the parser can be written in a module).


Thank you,
Denis
-- -- -- -- -- -- --
vit esse estrany =E2=98=A3

spir.wikidot.com
Nov 08 2010