www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Writing a Parser

reply Dan <murpsoft hotmail.com> writes:
I've been messing with how to write a parser, and so far I've played with
numerous patterns before eventually wanting to cry.

At the moment, I'm trying recursive descent parsing.

The problem is that I've realized I'm duplicating huge volumes of code to cope
with the tristate decision of { unexpected, allow, require } for any given
token.

For example, to consume a for loop, you consume something similar to
/for\s*\((.*?)\)\s*\{(.*?)\}/

I have it doing that, but my soul feels heavy with the masses of looped
switches it's doing.  Is there any way to ease the pain?

Regards,
Dan
Jan 04 2008
next sibling parent reply "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Dan" <murpsoft hotmail.com> wrote in message 
news:flmtrv$2jrn$1 digitalmars.com...
 I've been messing with how to write a parser, and so far I've played with 
 numerous patterns before eventually wanting to cry.

 At the moment, I'm trying recursive descent parsing.

 The problem is that I've realized I'm duplicating huge volumes of code to 
 cope with the tristate decision of { unexpected, allow, require } for any 
 given token.

 For example, to consume a for loop, you consume something similar to
 /for\s*\((.*?)\)\s*\{(.*?)\}/

 I have it doing that, but my soul feels heavy with the masses of looped 
 switches it's doing.  Is there any way to ease the pain?
Separate tokenization and syntax parsing? It makes things a hell of a lot easier. You don't even necessarily have to tokenize the source entirely before parsing; just have a lexer which lexes tokens out of the source on demand. The syntax parsing is then unencumbered from dealing with the raw source and just has to do stuff like "expect 'for', expect left-paren, expect (your condition), expect right-paren" etc.
Jan 04 2008
parent Alan Knowles <alan akbkhome.com> writes:
Yes, normally when doing OO patterns for this

class Parser extends Tokenizer {...}

either that or
class Parser {
	Tokenizer reader;
	parser code....
}

Recommended reading
- the parser in DMD's source (not DMDScript) is quite nice - it's a hand 
coded one and is quite easy to understand.

This is interesting from the perspective of using function pointers to 
do pattern testing (it's horribly borked from the perspective that it's 
really the tokenizer..)
http://www.dsource.org/projects/leds/browser/trunk/src/language/php/Parser.d

If you dont want a hand coded parser, have a look at csLex - It's what 
mono used, and is pretty trivial to modify, and retarget to generate 
other language code (eg. D code...)

 From someone who's written far too many parsers ;)
Regards
Alan




Jarrett Billingsley wrote:
 "Dan" <murpsoft hotmail.com> wrote in message 
 news:flmtrv$2jrn$1 digitalmars.com...
 I've been messing with how to write a parser, and so far I've played with 
 numerous patterns before eventually wanting to cry.

 At the moment, I'm trying recursive descent parsing.

 The problem is that I've realized I'm duplicating huge volumes of code to 
 cope with the tristate decision of { unexpected, allow, require } for any 
 given token.

 For example, to consume a for loop, you consume something similar to
 /for\s*\((.*?)\)\s*\{(.*?)\}/

 I have it doing that, but my soul feels heavy with the masses of looped 
 switches it's doing.  Is there any way to ease the pain?
Separate tokenization and syntax parsing? It makes things a hell of a lot easier. You don't even necessarily have to tokenize the source entirely before parsing; just have a lexer which lexes tokens out of the source on demand. The syntax parsing is then unencumbered from dealing with the raw source and just has to do stuff like "expect 'for', expect left-paren, expect (your condition), expect right-paren" etc.
Jan 04 2008
prev sibling next sibling parent reply Jascha Wetzel <firstname mainia.de> writes:
Dan wrote:
 I've been messing with how to write a parser, and so far I've played with
numerous patterns before eventually wanting to cry.
 
 At the moment, I'm trying recursive descent parsing.
 
 The problem is that I've realized I'm duplicating huge volumes of code to cope
with the tristate decision of { unexpected, allow, require } for any given
token.
 
 For example, to consume a for loop, you consume something similar to
 /for\s*\((.*?)\)\s*\{(.*?)\}/
 
 I have it doing that, but my soul feels heavy with the masses of looped
switches it's doing.  Is there any way to ease the pain?
a parser generator :) writing a parser or scanner manually is a bit like writing any program in assembler - tedious, error-prone and not well maintainable. there's a lot of stuff in a parser that can be automatically generated. even if you want to write the parser all by yourself, i'd rather suggest you write a simple parser generator to do that tedious part for you.
Jan 04 2008
next sibling parent 0ffh <frank youknow.what.todo.interNETz> writes:
Jascha Wetzel wrote:
 Dan wrote:
 I've been messing with how to write a parser, and so far I've played 
 with numerous patterns before eventually wanting to cry.
  [...]
 I have it doing that, but my soul feels heavy with the masses of 
 looped switches it's doing.  Is there any way to ease the pain?
a parser generator :) [...]
Or a parsing combinator library. :) And it's always nice to roll your own, if you have the extra time. Not only to have, but to know. regards, frank
Jan 05 2008
prev sibling parent reply Alan Knowles <alan akbkhome.com> writes:
I did spend some time looking at aPaGeD for this yesteday - here's some 
feedback that may help (and if you can answer some of the questions that 
would help me alot as well)

- It' works (yes, but you would be amazed, for the life of me I could 
not get antlr to do anything - java write once, run nowhere...) - so 
being able to generate code and working programs from the examples was 
great, and a real +100 for the project..

- Syntax - on the face of it looks reasonable, and easy to understand. - 
similar enough to antlr..
(That's the end of the really good stuff)


- Documentation
While I know it's a pain to write, the things you have already tend to 
focus on how the parser is built, and are biased to someone 
understanding the internals and phrase-ology involved in parsers, rather 
than an end user - who just knows if I'm looking for this.. - then put 
this, and the result is available in these variables:

Specifically I've no idea what the meanings of these are, and they are 
rather critical to the docs....:
"Terminal" "Non-Terminal"

- Regex's
While I can see the benefit's I'd much rather the compiler built them 
for me.. - part of the beauty of the BNF format is that it's easy to 
read, and explains regex style situations alot better.. - Otherwise (see 
below about explaining how they can deal with classic situations...)

- How to handle classic situations
This is the key to success for the Documentation. (and what is seriously 
missing) - as most people will probably have come from a lexx/yacc 
background...

These are a few classic examples that the Documentation could do with.

* Top level parser starts.
Most grammers start with a top level statement, eg.
Program:
	Statements;

In which case the application should only start by solving Statements, - 
the biggest problem I found was that I had no idea how to stop it 
matching any of the condition rules (that were only relivant to a 
specific state - eg. see next example)


* Parse a string
This is a common pattern but it's quite difficult to see how to 
implement it. -- And as above, when I tried, the parser started matching 
DoubleQuotedStringChars at the start of the file (even though it's only 
used in  DoubleQuotedString.

DoubleQuotedString;
	QUOTE DoubleQuotedStringChars QUOTE

DoubleQuotedStringChars:
	(DoubleQuotedStringChar)*

DoubleQuotedStringChar:
	"\" ANYCHAR:
	^QUOTE;
	
* Classic groupings:
	(.....)*    eg. many of these matches..
	(.....)+    eg. one or more of these matches..
	(.....)?    eg. one or none of these matches..
	(.....)=> ...    if forward lookup succeeds on (...) try match next combo.


Regards
Alan









Jascha Wetzel wrote:
 Dan wrote:
 I've been messing with how to write a parser, and so far I've played 
 with numerous patterns before eventually wanting to cry.

 At the moment, I'm trying recursive descent parsing.

 The problem is that I've realized I'm duplicating huge volumes of code 
 to cope with the tristate decision of { unexpected, allow, require } 
 for any given token.

 For example, to consume a for loop, you consume something similar to
 /for\s*\((.*?)\)\s*\{(.*?)\}/

 I have it doing that, but my soul feels heavy with the masses of 
 looped switches it's doing.  Is there any way to ease the pain?
a parser generator :) writing a parser or scanner manually is a bit like writing any program in assembler - tedious, error-prone and not well maintainable. there's a lot of stuff in a parser that can be automatically generated. even if you want to write the parser all by yourself, i'd rather suggest you write a simple parser generator to do that tedious part for you.
Jan 08 2008
parent reply Jascha Wetzel <firstname mainia.de> writes:
Alan Knowles wrote:
 - It' works (yes, but you would be amazed, for the life of me I could 
 not get antlr to do anything - java write once, run nowhere...) - so 
 being able to generate code and working programs from the examples was 
 great, and a real +100 for the project..
I'm glad to hear that, although, i must admit, i was so bold to assume that already ;)
 - Documentation
 While I know it's a pain to write, the things you have already tend to 
 focus on how the parser is built, and are biased to someone 
 understanding the internals and phrase-ology involved in parsers, rather 
 than an end user - who just knows if I'm looking for this.. - then put 
 this, and the result is available in these variables:
Yeah, i became aware of that through the feedback. Without thinking too much, i assumed that using parsers would be something you'd do only if you dealt with parsers intimately. Therefore i felt it should be sufficient to describe everything that's specific to apaged. Once i find the time, i'll extend the documentation with a thorough hands-on guide.
 Specifically I've no idea what the meanings of these are, and they are 
 rather critical to the docs....:
 "Terminal" "Non-Terminal"
A Terminal is a symbol that is "final" wrt. expansion, while a Non-Terminal can be expanded by some rules. Terminals, Tokens and Lexemes are *practically* more or less the same. In apaged, a Terminal is always a string or a regexp. A Non-Terminal is specified by a set of production rules that tell how to match a sequence of Terminals and Non-Terminals. In apaged, a Non-Terminal is always something looking like this: MyNT { "a" MyOtherNT MyNT; "b" Nt2 { /* D Code ... */ } } If you think of parse trees when dealing with your grammar (if not, ignore the rest), Non-Terminals are the inner nodes, and Terminals are the leaves.
 - Regex's
 While I can see the benefit's I'd much rather the compiler built them 
 for me.. - part of the beauty of the BNF format is that it's easy to 
 read, and explains regex style situations alot better.. - Otherwise (see 
 below about explaining how they can deal with classic situations...)
Do you mean regexp-like syntax for rules (i.e. EBNF syntax)? This is a feature on my todo list, but i still have to find a nice way to merge this with the semantic code (which is non-trivial) - a thing for the future. If you mean Regex's as in "asdf[a-z0-9]*", then i don't understand what you mean, since apaged does compile them for you. You don't need to use them at all, though. They're merely a convenience feature.
 - How to handle classic situations
 This is the key to success for the Documentation. (and what is seriously 
 missing) - as most people will probably have come from a lexx/yacc 
 background...
 
 These are a few classic examples that the Documentation could do with.
 
 * Top level parser starts.
 Most grammers start with a top level statement, eg.
 Program:
     Statements;
 
 In which case the application should only start by solving Statements, - 
 the biggest problem I found was that I had no idea how to stop it 
 matching any of the condition rules (that were only relivant to a 
 specific state - eg. see next example)
The docs state that "The first non-terminal in a grammar file will be treated as the start symbol for the grammar." I admit that it should be "...in the main grammar file..." since that doesn't apply for imported grammar files.
 * Parse a string
 This is a common pattern but it's quite difficult to see how to 
 implement it. -- And as above, when I tried, the parser started matching 
 DoubleQuotedStringChars at the start of the file (even though it's only 
 used in  DoubleQuotedString.
 
 DoubleQuotedString;
     QUOTE DoubleQuotedStringChars QUOTE
 
 DoubleQuotedStringChars:
     (DoubleQuotedStringChar)*
 
 DoubleQuotedStringChar:
     "\" ANYCHAR:
     ^QUOTE;
Hm, it shouldn't do that. I'll assume that's because of how QUOTE is defined. I'd need to see the whole grammar. Besides that, you use 2 unsupported EBNF features in the above grammar: ^QUOTE and (DoubleQuotedStringChar)* Apaged doesn't support full EBNF syntax for the reason mentioned above.
 * Classic groupings:
     (.....)*    eg. many of these matches..
     (.....)+    eg. one or more of these matches..
     (.....)?    eg. one or none of these matches..
     (.....)=> ...    if forward lookup succeeds on (...) try match next 
 combo.
None of those are supported. You need to write them BNF style: A* becomes As { As A; epsilon; } A+ becomes Ap { Ap A; A; } A? becomes Aq { A; epsilon; } Lookahead isn't supported for rules either. You may lookahead a single Terminal symbol, though, using >"a" or >["a" "b"], as documented. Rule-level lookahead can also be rewritten BNF style, but it may affect more related rules and therefore depends on your instance of the problem. I hope that helps a bit. If you read up on BNF vs. EBNF and consider that apaged is BNF based, you should find solutions to all of these problems.
Jan 09 2008
parent reply Dan <murpsoft hotmail.com> writes:
: D

So far I've got it doing an LL(0) predictive tokenizer which generates a [still
pretty buggy] AST.

I'm quite proud of it at the moment, as I'm certain now that I can accomplish
LL(0) scanning for everything but binaryOperators; where it's LL(n) | n E
expression.


Jascha Wetzel Wrote:
 Alan Knowles wrote:
 - Documentation
 While I know it's a pain to write, the things you have already tend to 
 focus on how the parser is built, and are biased to someone 
 understanding the internals and phrase-ology involved in parsers, rather 
 than an end user - who just knows if I'm looking for this.. - then put 
 this, and the result is available in these variables:
Yeah, i became aware of that through the feedback. Without thinking too much, i assumed that using parsers would be something you'd do only if you dealt with parsers intimately.
I took linguistics, and I'm an interpreter writer, and I still haven't looked at BNF or YACC/BISON notation yet. To be honest, I'm not interested in it. Like MathML, it's way too far from the machine to generate an *efficient* parser. Mine might not be efficient, but that wouldn't be intrinsic to it being low-level.
 A Terminal is a symbol that is "final" wrt. expansion, while a 
 Non-Terminal can be expanded by some rules. Terminals, Tokens and Lexemes are
*practically* more or less the same. 
Linguistics assigns them very different meanings.
 If you think of parse trees when dealing with your grammar
Are those like sentence trees, with noun phrase, verb phrase, etc?
 ignore the rest), Non-Terminals are the inner nodes, and Terminals are 
 the leaves.
That makes sense.
 - How to handle classic situations
When I first started with Walnut 0.x, I was grinding my brain trying to figure out how to match these correctly: { { /* } */ } , { } } Another classical problem is JavaScript RegExp literals or divide: /bob/i can be "divide bob divide i", or a regexp, depending on whether we expect an operator or operand. How would you write that? How would the machine read that? Both are highly important, but I find most parser writers only care about the former, and that the product is a working AST. If I only wanted that, I could write the interpreter entirely in javascript regular expressions and it'll only be 40 lines of code. In fact, I think someone did that already, but I'm sure he wasn't that terse. Regards, Dan
Jan 09 2008
parent reply Jascha Wetzel <firstname mainia.de> writes:
Dan wrote:
 Yeah, i became aware of that through the feedback. Without thinking too 
 much, i assumed that using parsers would be something you'd do only if 
 you dealt with parsers intimately. 
I took linguistics, and I'm an interpreter writer, and I still haven't looked at BNF or YACC/BISON notation yet. To be honest, I'm not interested in it. Like MathML, it's way too far from the machine to generate an *efficient* parser. Mine might not be efficient, but that wouldn't be intrinsic to it being low-level.
That depends on the grammar that is being parsed. Simple grammars can often be parsed faster with hand-optimized parsers. The more complex the grammar is, the less impact the generated parsers' overhead has. Apaged was tailored for parsing D, and it's very fast for that. Last time i checked, it parsed the complete Tango package in less than 2 seconds (including disk io).
 A Terminal is a symbol that is "final" wrt. expansion, while a 
 Non-Terminal can be expanded by some rules. Terminals, Tokens and Lexemes are
*practically* more or less the same. 
Linguistics assigns them very different meanings.
Formal languages do too, but for practical use it's not much of a difference.
 If you think of parse trees when dealing with your grammar
Are those like sentence trees, with noun phrase, verb phrase, etc?
Probably. I don't know linguistics. Example: S -> A A -> aBa B -> bCb C -> c C -> A parsing ababcbaba yields S | _____ A _____ / | \ a ___ B____ a / | \ b _ C__ b / | \ a B a /|\ b C b | c
 - How to handle classic situations
When I first started with Walnut 0.x, I was grinding my brain trying to figure out how to match these correctly: { { /* } */ } , { } }
it's a matter of what else is allowed in { }. besides, usually /* */ comments are handled as whitespace lexemes, solving the problem before parsing.
 Another classical problem is JavaScript RegExp literals or divide:
 
 /bob/i  can be "divide bob divide i", or a regexp, depending on whether we
expect an operator or operand.
 
 How would you write that?
 How would the machine read that?
the whole problem of parsing such a thing doesn't arise until embedded into the grammar. it therefore depends on what else interferes with this syntax. i don't know exactly what's allowed in JavaScript, but you can probably distinguish these expressions by the leading / - in a arithmetic expression that isn't allowed, therefore the parser tries to match a regexp expression. Very simplified: Expr -> ReEx | ArEx ArEx -> ArEx '/' ArEx | Identifier | NumericLiteral ReEx -> '/' StringLiteral '/' OptParameters Since neither Identifier nor NumericLiteral may start with '/' (i.e. '/' is not in the first-set of ArEx), the grammar unambiguous.
Jan 10 2008
parent reply Dan <murpsoft hotmail.com> writes:
Jascha Wetzel Wrote:

 Dan wrote:
Like MathML, it's way too far from the machine to generate an *efficient*
parser.  
That depends on the grammar that is being parsed. Simple grammars can often be parsed faster with hand-optimized parsers. The more complex the grammar is, the less impact the generated parsers' overhead has.
Ah, but there's always overhead.
 Apaged was tailored for parsing D, and it's very fast for that. Last 
 time i checked, it parsed the complete Tango package in less than 2 
 seconds (including disk io).
Eep. That would instantly make any JavaScript interpreter a failure; scripts need to *run* in the < 500ms (unnoticeable) range to even be considered.
 it's a matter of what else is allowed in { }. besides, usually /* */ 
 comments are handled as whitespace lexemes, solving the problem before 
 parsing.
Aha! Well then, the way I wrote my scanner/parser, a whole tree is built before parsing. It's not fully functional yet, but I'm not seeing any design failures.
 
 Another classical problem is JavaScript RegExp literals or divide:
 
 /bob/i  can be "divide bob divide i", or a regexp, depending on whether we
expect an operator or operand.
 
 How would you write that?
 How would the machine read that?
the whole problem of parsing such a thing doesn't arise until embedded into the grammar. it therefore depends on what else interferes with this syntax. i don't know exactly what's allowed in JavaScript, but you can probably distinguish these expressions by the leading / - in a arithmetic expression that isn't allowed, therefore the parser tries to match a regexp expression. Very simplified: Expr -> ReEx | ArEx ArEx -> ArEx '/' ArEx | Identifier | NumericLiteral ReEx -> '/' StringLiteral '/' OptParameters Since neither Identifier nor NumericLiteral may start with '/' (i.e. '/' is not in the first-set of ArEx), the grammar unambiguous.
So you mean to say, it does it by looking at the tokens before and after and inferring the value based on whether an operator or operand is expected? That makes sense. I did it similar to that.
Jan 12 2008
parent Jascha Wetzel <firstname mainia.de> writes:
Dan wrote:
 That depends on the grammar that is being parsed. Simple grammars can 
 often be parsed faster with hand-optimized parsers. The more complex the 
 grammar is, the less impact the generated parsers' overhead has.
Ah, but there's always overhead.
What i meant was, that this overhead has a purpose. Once your grammar uses it, it's not overhead anymore. Automatically generated parsers usually facilitate bottom-up parsing algorithms (LALR, GLR) that have several advantages over top-down algorithms (LL(k), usually implemented as recursive descent) that are used in manually written parsers. LALR for example is completely iterative (i.e. no stack action, no calls, less code -> completely cached), making it more efficient than recursive descent. GLR is almost completely iterative, depending on the implementation. If an automatically generated parser is slower, it is usually due to features that make it more flexible wrt. to grammars. Therefore, as stated initially, once you need this flexibility, the parser performs nicely. Another thing is backtracking. RD can use lookahead, but it is more elaborate to implement it manually and generally, top-down lookahead isn't as effective as bottom-up lookahead. That is, it takes more complex grammars to create the need for backtracking in a bottom-up parser than in a top-down parser. That is where bottom-up parsing is categorically faster than top-down. All this makes me claim that it'll be hard to write for example a recursive descent D parser by hand, that performs better than it's automatically generated GLR counterpart. You might still succeed by a small margin, but you will have spent a *lot* more time to do so. And you will have a parser that is not near as easily maintainable as the grammar used for the automatically generated parser.
 Apaged was tailored for parsing D, and it's very fast for that. Last 
 time i checked, it parsed the complete Tango package in less than 2 
 seconds (including disk io).
Eep. That would instantly make any JavaScript interpreter a failure; scripts need to *run* in the < 500ms (unnoticeable) range to even be considered.
The tango package has >400 files with multiple MB of code. I doubt there is any script or library written in JavaScript that is even close to that size. Parsing single files <1000 LoC is a matter of <1ms with apaged or any other LALR/GLR parser, no need to worry ;).
 So you mean to say, it does it by looking at the tokens before and after and
inferring the value based on whether an operator or operand is expected?
 
 That makes sense.  I did it similar to that.
In this case it even only needs to look at the next token. You're usually right with what you do to solve such a case if you don't need to backtrack. If so, you should double check that there is no other way.
Jan 12 2008
prev sibling next sibling parent reply Dan <murpsoft hotmail.com> writes:
Dan Wrote:

 
 I've been messing with how to write a parser, and so far I've played with
numerous patterns before eventually wanting to cry.
 
 At the moment, I'm trying recursive descent parsing.
I've got it performing rather nicely now, for it's completeness. parseOperand() returns an { ident, int, double, string, regexp, expression, arrayliteral, objectliteral *value* } incomplete for e notation numbers incomplete for expression, arrayliteral, objectliteral parseBinaryOperator() returns one of the binary operator tokens parseOptionalWS() consumes whitespace and doesn't return anything I've got some more written, but they're not tested. Once I have numbers done, I'll try writing the arrayLiteral, which will of course, parseOperand() "," parseOperand() : )
Jan 06 2008
next sibling parent Ap Lee <aplee primus.ca> writes:
 Dan Wrote:

 I've been messing with how to write a parser, and so far I've played with
numerous patterns before eventually wanting to cry.

 At the moment, I'm trying recursive descent parsing.
Have you looked at ANTLR? I have used it for two projects and I am quite impressed. It does not have a D backend, nor a D grammar AFAIK, but if you are writing a parser, you might find a lot of the work has been done for you already. If you go for antlr, I strongly recommend the book advertized on the antlr.org page, I liked it very much. Good luck.
Jan 09 2008
prev sibling parent reply Ap Lee <aplee primus.ca> writes:
 Dan Wrote:

 I've been messing with how to write a parser, and so far I've played with
numerous patterns before eventually wanting to cry.

 At the moment, I'm trying recursive descent parsing.
Have you looked at ANTLR? I have used it for two projects and I am quite impressed. It does not have a D backend, nor a D grammar AFAIK, but if you are writing a parser, you might find a lot of the work has been done for you already. If you go for antlr, I strongly recommend the book advertized on the antlr.org page, I liked it very much. Good luck.
Jan 09 2008
parent reply Frank Benoit <keinfarbton googlemail.com> writes:
Ap Lee schrieb:
 Have you looked at ANTLR? I have used it for two projects and I am quite 
 impressed. It does not have a D backend, nor a D grammar AFAIK, but if 
 you are writing a parser, you might find a lot of the work has been done 
 for you already. If you go for antlr, I strongly recommend the book 
 advertized on the antlr.org page, I liked it very much.
 
 Good luck.
I haven't tested it, but there is a Antlr D backend: http://www.mbutscher.de/antlrd/
Jan 09 2008
parent Alan Knowles <alan akbkhome.com> writes:
I tried antlrd on debian, while I could get it to install (apt-get for 
pure antlr) and download antlrd, All the examples resulted in an error 
message:
#runantlr /tmp/SimpleC.g
/tmp/SimpleC.g:3:1: unexpected token: grammar

Which did not have any google results for potential reasons..

Regards
Alan


Frank Benoit wrote:
 Ap Lee schrieb:
 Have you looked at ANTLR? I have used it for two projects and I am 
 quite impressed. It does not have a D backend, nor a D grammar AFAIK, 
 but if you are writing a parser, you might find a lot of the work has 
 been done for you already. If you go for antlr, I strongly recommend 
 the book advertized on the antlr.org page, I liked it very much.

 Good luck.
I haven't tested it, but there is a Antlr D backend: http://www.mbutscher.de/antlrd/
Jan 09 2008
prev sibling parent reply Christoph Singewald <chrisrtoph singewald.at> writes:
Here you can find some parsergenerators:

http://www.prowiki.org/wiki4d/wiki.cgi?GrammarParsers

I use lemonde as parsergenerator and re2c as lexer. Using re2c in most cases
you have only to replace 'unsigned int' with 'uint' in the resulting code.


regards
cs

Dan Wrote:

 
 I've been messing with how to write a parser, and so far I've played with
numerous patterns before eventually wanting to cry.
 
 At the moment, I'm trying recursive descent parsing.
 
 The problem is that I've realized I'm duplicating huge volumes of code to cope
with the tristate decision of { unexpected, allow, require } for any given
token.
 
 For example, to consume a for loop, you consume something similar to
 /for\s*\((.*?)\)\s*\{(.*?)\}/
 
 I have it doing that, but my soul feels heavy with the masses of looped
switches it's doing.  Is there any way to ease the pain?
 
 Regards,
 Dan
Jan 10 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Christoph Singewald:
 I use lemonde as parsergenerator and re2c as lexer. Using re2c in most
 cases you have only to replace 'unsigned int' with 'uint' in the resulting
code.
I have seen re2c and it looks nice. I think its source code can be modified with not that much efforts to make it produce D code instead of C. I think C isn't the right language to write such tool: its sources are about 200 KB of C code (plus some code generated by itself), they can probably be replaced by a 50 (or less) KB Python module (that generates the C/D code). Something like a tiny but really fast C compiler like TinyCC (http://fabrice.bellard.free.fr/tcc/) can be used to compile the C code on the fly in memory and execute it. This may become the starting point to give run-time compiled Regular Expressions to D :-) Probably there are other smarter ways to do similar things. Bye, bearophile
Jan 10 2008