digitalmars.D.learn - Writing a Parser
- Dan (8/8) Jan 04 2008 I've been messing with how to write a parser, and so far I've played wit...
- Jarrett Billingsley (8/18) Jan 04 2008 Separate tokenization and syntax parsing? It makes things a hell of a l...
- Alan Knowles (21/46) Jan 04 2008 Yes, normally when doing OO patterns for this
- Jascha Wetzel (7/17) Jan 04 2008 a parser generator :)
- 0ffh (5/14) Jan 05 2008 Or a parsing combinator library. :)
- Alan Knowles (58/80) Jan 08 2008 I did spend some time looking at aPaGeD for this yesteday - here's some
- Jascha Wetzel (66/121) Jan 09 2008 I'm glad to hear that, although, i must admit, i was so bold to assume
- Dan (18/35) Jan 09 2008 : D
- Jascha Wetzel (43/66) Jan 10 2008 That depends on the grammar that is being parsed. Simple grammars can
- Dan (6/37) Jan 12 2008 Eep. That would instantly make any JavaScript interpreter a failure; sc...
- Jascha Wetzel (34/47) Jan 12 2008 What i meant was, that this overhead has a purpose. Once your grammar
- Dan (12/16) Jan 06 2008 I've got it performing rather nicely now, for it's completeness.
- Ap Lee (6/11) Jan 09 2008 Have you looked at ANTLR? I have used it for two projects and I am quite...
- Ap Lee (6/11) Jan 09 2008 Have you looked at ANTLR? I have used it for two projects and I am quite...
- Frank Benoit (3/10) Jan 09 2008 I haven't tested it, but there is a Antlr D backend:
- Alan Knowles (9/20) Jan 09 2008 I tried antlrd on debian, while I could get it to install (apt-get for
- Christoph Singewald (6/20) Jan 10 2008 Here you can find some parsergenerators:
- bearophile (6/8) Jan 10 2008 I have seen re2c and it looks nice. I think its source code can be modif...
I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry. At the moment, I'm trying recursive descent parsing. The problem is that I've realized I'm duplicating huge volumes of code to cope with the tristate decision of { unexpected, allow, require } for any given token. For example, to consume a for loop, you consume something similar to /for\s*\((.*?)\)\s*\{(.*?)\}/ I have it doing that, but my soul feels heavy with the masses of looped switches it's doing. Is there any way to ease the pain? Regards, Dan
Jan 04 2008
"Dan" <murpsoft hotmail.com> wrote in message news:flmtrv$2jrn$1 digitalmars.com...I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry. At the moment, I'm trying recursive descent parsing. The problem is that I've realized I'm duplicating huge volumes of code to cope with the tristate decision of { unexpected, allow, require } for any given token. For example, to consume a for loop, you consume something similar to /for\s*\((.*?)\)\s*\{(.*?)\}/ I have it doing that, but my soul feels heavy with the masses of looped switches it's doing. Is there any way to ease the pain?Separate tokenization and syntax parsing? It makes things a hell of a lot easier. You don't even necessarily have to tokenize the source entirely before parsing; just have a lexer which lexes tokens out of the source on demand. The syntax parsing is then unencumbered from dealing with the raw source and just has to do stuff like "expect 'for', expect left-paren, expect (your condition), expect right-paren" etc.
Jan 04 2008
Yes, normally when doing OO patterns for this class Parser extends Tokenizer {...} either that or class Parser { Tokenizer reader; parser code.... } Recommended reading - the parser in DMD's source (not DMDScript) is quite nice - it's a hand coded one and is quite easy to understand. This is interesting from the perspective of using function pointers to do pattern testing (it's horribly borked from the perspective that it's really the tokenizer..) http://www.dsource.org/projects/leds/browser/trunk/src/language/php/Parser.d If you dont want a hand coded parser, have a look at csLex - It's what mono used, and is pretty trivial to modify, and retarget to generate other language code (eg. D code...) From someone who's written far too many parsers ;) Regards Alan Jarrett Billingsley wrote:"Dan" <murpsoft hotmail.com> wrote in message news:flmtrv$2jrn$1 digitalmars.com...I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry. At the moment, I'm trying recursive descent parsing. The problem is that I've realized I'm duplicating huge volumes of code to cope with the tristate decision of { unexpected, allow, require } for any given token. For example, to consume a for loop, you consume something similar to /for\s*\((.*?)\)\s*\{(.*?)\}/ I have it doing that, but my soul feels heavy with the masses of looped switches it's doing. Is there any way to ease the pain?Separate tokenization and syntax parsing? It makes things a hell of a lot easier. You don't even necessarily have to tokenize the source entirely before parsing; just have a lexer which lexes tokens out of the source on demand. The syntax parsing is then unencumbered from dealing with the raw source and just has to do stuff like "expect 'for', expect left-paren, expect (your condition), expect right-paren" etc.
Jan 04 2008
Dan wrote:I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry. At the moment, I'm trying recursive descent parsing. The problem is that I've realized I'm duplicating huge volumes of code to cope with the tristate decision of { unexpected, allow, require } for any given token. For example, to consume a for loop, you consume something similar to /for\s*\((.*?)\)\s*\{(.*?)\}/ I have it doing that, but my soul feels heavy with the masses of looped switches it's doing. Is there any way to ease the pain?a parser generator :) writing a parser or scanner manually is a bit like writing any program in assembler - tedious, error-prone and not well maintainable. there's a lot of stuff in a parser that can be automatically generated. even if you want to write the parser all by yourself, i'd rather suggest you write a simple parser generator to do that tedious part for you.
Jan 04 2008
Jascha Wetzel wrote:Dan wrote:Or a parsing combinator library. :) And it's always nice to roll your own, if you have the extra time. Not only to have, but to know. regards, frankI've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry. [...] I have it doing that, but my soul feels heavy with the masses of looped switches it's doing. Is there any way to ease the pain?a parser generator :) [...]
Jan 05 2008
I did spend some time looking at aPaGeD for this yesteday - here's some feedback that may help (and if you can answer some of the questions that would help me alot as well) - It' works (yes, but you would be amazed, for the life of me I could not get antlr to do anything - java write once, run nowhere...) - so being able to generate code and working programs from the examples was great, and a real +100 for the project.. - Syntax - on the face of it looks reasonable, and easy to understand. - similar enough to antlr.. (That's the end of the really good stuff) - Documentation While I know it's a pain to write, the things you have already tend to focus on how the parser is built, and are biased to someone understanding the internals and phrase-ology involved in parsers, rather than an end user - who just knows if I'm looking for this.. - then put this, and the result is available in these variables: Specifically I've no idea what the meanings of these are, and they are rather critical to the docs....: "Terminal" "Non-Terminal" - Regex's While I can see the benefit's I'd much rather the compiler built them for me.. - part of the beauty of the BNF format is that it's easy to read, and explains regex style situations alot better.. - Otherwise (see below about explaining how they can deal with classic situations...) - How to handle classic situations This is the key to success for the Documentation. (and what is seriously missing) - as most people will probably have come from a lexx/yacc background... These are a few classic examples that the Documentation could do with. * Top level parser starts. Most grammers start with a top level statement, eg. Program: Statements; In which case the application should only start by solving Statements, - the biggest problem I found was that I had no idea how to stop it matching any of the condition rules (that were only relivant to a specific state - eg. see next example) * Parse a string This is a common pattern but it's quite difficult to see how to implement it. -- And as above, when I tried, the parser started matching DoubleQuotedStringChars at the start of the file (even though it's only used in DoubleQuotedString. DoubleQuotedString; QUOTE DoubleQuotedStringChars QUOTE DoubleQuotedStringChars: (DoubleQuotedStringChar)* DoubleQuotedStringChar: "\" ANYCHAR: ^QUOTE; * Classic groupings: (.....)* eg. many of these matches.. (.....)+ eg. one or more of these matches.. (.....)? eg. one or none of these matches.. (.....)=> ... if forward lookup succeeds on (...) try match next combo. Regards Alan Jascha Wetzel wrote:Dan wrote:I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry. At the moment, I'm trying recursive descent parsing. The problem is that I've realized I'm duplicating huge volumes of code to cope with the tristate decision of { unexpected, allow, require } for any given token. For example, to consume a for loop, you consume something similar to /for\s*\((.*?)\)\s*\{(.*?)\}/ I have it doing that, but my soul feels heavy with the masses of looped switches it's doing. Is there any way to ease the pain?a parser generator :) writing a parser or scanner manually is a bit like writing any program in assembler - tedious, error-prone and not well maintainable. there's a lot of stuff in a parser that can be automatically generated. even if you want to write the parser all by yourself, i'd rather suggest you write a simple parser generator to do that tedious part for you.
Jan 08 2008
Alan Knowles wrote:- It' works (yes, but you would be amazed, for the life of me I could not get antlr to do anything - java write once, run nowhere...) - so being able to generate code and working programs from the examples was great, and a real +100 for the project..I'm glad to hear that, although, i must admit, i was so bold to assume that already ;)- Documentation While I know it's a pain to write, the things you have already tend to focus on how the parser is built, and are biased to someone understanding the internals and phrase-ology involved in parsers, rather than an end user - who just knows if I'm looking for this.. - then put this, and the result is available in these variables:Yeah, i became aware of that through the feedback. Without thinking too much, i assumed that using parsers would be something you'd do only if you dealt with parsers intimately. Therefore i felt it should be sufficient to describe everything that's specific to apaged. Once i find the time, i'll extend the documentation with a thorough hands-on guide.Specifically I've no idea what the meanings of these are, and they are rather critical to the docs....: "Terminal" "Non-Terminal"A Terminal is a symbol that is "final" wrt. expansion, while a Non-Terminal can be expanded by some rules. Terminals, Tokens and Lexemes are *practically* more or less the same. In apaged, a Terminal is always a string or a regexp. A Non-Terminal is specified by a set of production rules that tell how to match a sequence of Terminals and Non-Terminals. In apaged, a Non-Terminal is always something looking like this: MyNT { "a" MyOtherNT MyNT; "b" Nt2 { /* D Code ... */ } } If you think of parse trees when dealing with your grammar (if not, ignore the rest), Non-Terminals are the inner nodes, and Terminals are the leaves.- Regex's While I can see the benefit's I'd much rather the compiler built them for me.. - part of the beauty of the BNF format is that it's easy to read, and explains regex style situations alot better.. - Otherwise (see below about explaining how they can deal with classic situations...)Do you mean regexp-like syntax for rules (i.e. EBNF syntax)? This is a feature on my todo list, but i still have to find a nice way to merge this with the semantic code (which is non-trivial) - a thing for the future. If you mean Regex's as in "asdf[a-z0-9]*", then i don't understand what you mean, since apaged does compile them for you. You don't need to use them at all, though. They're merely a convenience feature.- How to handle classic situations This is the key to success for the Documentation. (and what is seriously missing) - as most people will probably have come from a lexx/yacc background... These are a few classic examples that the Documentation could do with. * Top level parser starts. Most grammers start with a top level statement, eg. Program: Statements; In which case the application should only start by solving Statements, - the biggest problem I found was that I had no idea how to stop it matching any of the condition rules (that were only relivant to a specific state - eg. see next example)The docs state that "The first non-terminal in a grammar file will be treated as the start symbol for the grammar." I admit that it should be "...in the main grammar file..." since that doesn't apply for imported grammar files.* Parse a string This is a common pattern but it's quite difficult to see how to implement it. -- And as above, when I tried, the parser started matching DoubleQuotedStringChars at the start of the file (even though it's only used in DoubleQuotedString. DoubleQuotedString; QUOTE DoubleQuotedStringChars QUOTE DoubleQuotedStringChars: (DoubleQuotedStringChar)* DoubleQuotedStringChar: "\" ANYCHAR: ^QUOTE;Hm, it shouldn't do that. I'll assume that's because of how QUOTE is defined. I'd need to see the whole grammar. Besides that, you use 2 unsupported EBNF features in the above grammar: ^QUOTE and (DoubleQuotedStringChar)* Apaged doesn't support full EBNF syntax for the reason mentioned above.* Classic groupings: (.....)* eg. many of these matches.. (.....)+ eg. one or more of these matches.. (.....)? eg. one or none of these matches.. (.....)=> ... if forward lookup succeeds on (...) try match next combo.None of those are supported. You need to write them BNF style: A* becomes As { As A; epsilon; } A+ becomes Ap { Ap A; A; } A? becomes Aq { A; epsilon; } Lookahead isn't supported for rules either. You may lookahead a single Terminal symbol, though, using >"a" or >["a" "b"], as documented. Rule-level lookahead can also be rewritten BNF style, but it may affect more related rules and therefore depends on your instance of the problem. I hope that helps a bit. If you read up on BNF vs. EBNF and consider that apaged is BNF based, you should find solutions to all of these problems.
Jan 09 2008
: D So far I've got it doing an LL(0) predictive tokenizer which generates a [still pretty buggy] AST. I'm quite proud of it at the moment, as I'm certain now that I can accomplish LL(0) scanning for everything but binaryOperators; where it's LL(n) | n E expression. Jascha Wetzel Wrote:Alan Knowles wrote:I took linguistics, and I'm an interpreter writer, and I still haven't looked at BNF or YACC/BISON notation yet. To be honest, I'm not interested in it. Like MathML, it's way too far from the machine to generate an *efficient* parser. Mine might not be efficient, but that wouldn't be intrinsic to it being low-level.- Documentation While I know it's a pain to write, the things you have already tend to focus on how the parser is built, and are biased to someone understanding the internals and phrase-ology involved in parsers, rather than an end user - who just knows if I'm looking for this.. - then put this, and the result is available in these variables:Yeah, i became aware of that through the feedback. Without thinking too much, i assumed that using parsers would be something you'd do only if you dealt with parsers intimately.A Terminal is a symbol that is "final" wrt. expansion, while a Non-Terminal can be expanded by some rules. Terminals, Tokens and Lexemes are *practically* more or less the same.Linguistics assigns them very different meanings.If you think of parse trees when dealing with your grammarAre those like sentence trees, with noun phrase, verb phrase, etc?ignore the rest), Non-Terminals are the inner nodes, and Terminals are the leaves.That makes sense.When I first started with Walnut 0.x, I was grinding my brain trying to figure out how to match these correctly: { { /* } */ } , { } } Another classical problem is JavaScript RegExp literals or divide: /bob/i can be "divide bob divide i", or a regexp, depending on whether we expect an operator or operand. How would you write that? How would the machine read that? Both are highly important, but I find most parser writers only care about the former, and that the product is a working AST. If I only wanted that, I could write the interpreter entirely in javascript regular expressions and it'll only be 40 lines of code. In fact, I think someone did that already, but I'm sure he wasn't that terse. Regards, Dan- How to handle classic situations
Jan 09 2008
Dan wrote:That depends on the grammar that is being parsed. Simple grammars can often be parsed faster with hand-optimized parsers. The more complex the grammar is, the less impact the generated parsers' overhead has. Apaged was tailored for parsing D, and it's very fast for that. Last time i checked, it parsed the complete Tango package in less than 2 seconds (including disk io).Yeah, i became aware of that through the feedback. Without thinking too much, i assumed that using parsers would be something you'd do only if you dealt with parsers intimately.I took linguistics, and I'm an interpreter writer, and I still haven't looked at BNF or YACC/BISON notation yet. To be honest, I'm not interested in it. Like MathML, it's way too far from the machine to generate an *efficient* parser. Mine might not be efficient, but that wouldn't be intrinsic to it being low-level.Formal languages do too, but for practical use it's not much of a difference.A Terminal is a symbol that is "final" wrt. expansion, while a Non-Terminal can be expanded by some rules. Terminals, Tokens and Lexemes are *practically* more or less the same.Linguistics assigns them very different meanings.Probably. I don't know linguistics. Example: S -> A A -> aBa B -> bCb C -> c C -> A parsing ababcbaba yields S | _____ A _____ / | \ a ___ B____ a / | \ b _ C__ b / | \ a B a /|\ b C b | cIf you think of parse trees when dealing with your grammarAre those like sentence trees, with noun phrase, verb phrase, etc?it's a matter of what else is allowed in { }. besides, usually /* */ comments are handled as whitespace lexemes, solving the problem before parsing.When I first started with Walnut 0.x, I was grinding my brain trying to figure out how to match these correctly: { { /* } */ } , { } }- How to handle classic situationsAnother classical problem is JavaScript RegExp literals or divide: /bob/i can be "divide bob divide i", or a regexp, depending on whether we expect an operator or operand. How would you write that? How would the machine read that?the whole problem of parsing such a thing doesn't arise until embedded into the grammar. it therefore depends on what else interferes with this syntax. i don't know exactly what's allowed in JavaScript, but you can probably distinguish these expressions by the leading / - in a arithmetic expression that isn't allowed, therefore the parser tries to match a regexp expression. Very simplified: Expr -> ReEx | ArEx ArEx -> ArEx '/' ArEx | Identifier | NumericLiteral ReEx -> '/' StringLiteral '/' OptParameters Since neither Identifier nor NumericLiteral may start with '/' (i.e. '/' is not in the first-set of ArEx), the grammar unambiguous.
Jan 10 2008
Jascha Wetzel Wrote:Dan wrote:Ah, but there's always overhead.Like MathML, it's way too far from the machine to generate an *efficient* parser.That depends on the grammar that is being parsed. Simple grammars can often be parsed faster with hand-optimized parsers. The more complex the grammar is, the less impact the generated parsers' overhead has.Apaged was tailored for parsing D, and it's very fast for that. Last time i checked, it parsed the complete Tango package in less than 2 seconds (including disk io).Eep. That would instantly make any JavaScript interpreter a failure; scripts need to *run* in the < 500ms (unnoticeable) range to even be considered.it's a matter of what else is allowed in { }. besides, usually /* */ comments are handled as whitespace lexemes, solving the problem before parsing.Aha! Well then, the way I wrote my scanner/parser, a whole tree is built before parsing. It's not fully functional yet, but I'm not seeing any design failures.So you mean to say, it does it by looking at the tokens before and after and inferring the value based on whether an operator or operand is expected? That makes sense. I did it similar to that.Another classical problem is JavaScript RegExp literals or divide: /bob/i can be "divide bob divide i", or a regexp, depending on whether we expect an operator or operand. How would you write that? How would the machine read that?the whole problem of parsing such a thing doesn't arise until embedded into the grammar. it therefore depends on what else interferes with this syntax. i don't know exactly what's allowed in JavaScript, but you can probably distinguish these expressions by the leading / - in a arithmetic expression that isn't allowed, therefore the parser tries to match a regexp expression. Very simplified: Expr -> ReEx | ArEx ArEx -> ArEx '/' ArEx | Identifier | NumericLiteral ReEx -> '/' StringLiteral '/' OptParameters Since neither Identifier nor NumericLiteral may start with '/' (i.e. '/' is not in the first-set of ArEx), the grammar unambiguous.
Jan 12 2008
Dan wrote:What i meant was, that this overhead has a purpose. Once your grammar uses it, it's not overhead anymore. Automatically generated parsers usually facilitate bottom-up parsing algorithms (LALR, GLR) that have several advantages over top-down algorithms (LL(k), usually implemented as recursive descent) that are used in manually written parsers. LALR for example is completely iterative (i.e. no stack action, no calls, less code -> completely cached), making it more efficient than recursive descent. GLR is almost completely iterative, depending on the implementation. If an automatically generated parser is slower, it is usually due to features that make it more flexible wrt. to grammars. Therefore, as stated initially, once you need this flexibility, the parser performs nicely. Another thing is backtracking. RD can use lookahead, but it is more elaborate to implement it manually and generally, top-down lookahead isn't as effective as bottom-up lookahead. That is, it takes more complex grammars to create the need for backtracking in a bottom-up parser than in a top-down parser. That is where bottom-up parsing is categorically faster than top-down. All this makes me claim that it'll be hard to write for example a recursive descent D parser by hand, that performs better than it's automatically generated GLR counterpart. You might still succeed by a small margin, but you will have spent a *lot* more time to do so. And you will have a parser that is not near as easily maintainable as the grammar used for the automatically generated parser.That depends on the grammar that is being parsed. Simple grammars can often be parsed faster with hand-optimized parsers. The more complex the grammar is, the less impact the generated parsers' overhead has.Ah, but there's always overhead.The tango package has >400 files with multiple MB of code. I doubt there is any script or library written in JavaScript that is even close to that size. Parsing single files <1000 LoC is a matter of <1ms with apaged or any other LALR/GLR parser, no need to worry ;).Apaged was tailored for parsing D, and it's very fast for that. Last time i checked, it parsed the complete Tango package in less than 2 seconds (including disk io).Eep. That would instantly make any JavaScript interpreter a failure; scripts need to *run* in the < 500ms (unnoticeable) range to even be considered.So you mean to say, it does it by looking at the tokens before and after and inferring the value based on whether an operator or operand is expected? That makes sense. I did it similar to that.In this case it even only needs to look at the next token. You're usually right with what you do to solve such a case if you don't need to backtrack. If so, you should double check that there is no other way.
Jan 12 2008
Dan Wrote:I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry. At the moment, I'm trying recursive descent parsing.I've got it performing rather nicely now, for it's completeness. parseOperand() returns an { ident, int, double, string, regexp, expression, arrayliteral, objectliteral *value* } incomplete for e notation numbers incomplete for expression, arrayliteral, objectliteral parseBinaryOperator() returns one of the binary operator tokens parseOptionalWS() consumes whitespace and doesn't return anything I've got some more written, but they're not tested. Once I have numbers done, I'll try writing the arrayLiteral, which will of course, parseOperand() "," parseOperand() : )
Jan 06 2008
Dan Wrote:Have you looked at ANTLR? I have used it for two projects and I am quite impressed. It does not have a D backend, nor a D grammar AFAIK, but if you are writing a parser, you might find a lot of the work has been done for you already. If you go for antlr, I strongly recommend the book advertized on the antlr.org page, I liked it very much. Good luck.I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry. At the moment, I'm trying recursive descent parsing.
Jan 09 2008
Dan Wrote:Have you looked at ANTLR? I have used it for two projects and I am quite impressed. It does not have a D backend, nor a D grammar AFAIK, but if you are writing a parser, you might find a lot of the work has been done for you already. If you go for antlr, I strongly recommend the book advertized on the antlr.org page, I liked it very much. Good luck.I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry. At the moment, I'm trying recursive descent parsing.
Jan 09 2008
Ap Lee schrieb:Have you looked at ANTLR? I have used it for two projects and I am quite impressed. It does not have a D backend, nor a D grammar AFAIK, but if you are writing a parser, you might find a lot of the work has been done for you already. If you go for antlr, I strongly recommend the book advertized on the antlr.org page, I liked it very much. Good luck.I haven't tested it, but there is a Antlr D backend: http://www.mbutscher.de/antlrd/
Jan 09 2008
I tried antlrd on debian, while I could get it to install (apt-get for pure antlr) and download antlrd, All the examples resulted in an error message: #runantlr /tmp/SimpleC.g /tmp/SimpleC.g:3:1: unexpected token: grammar Which did not have any google results for potential reasons.. Regards Alan Frank Benoit wrote:Ap Lee schrieb:Have you looked at ANTLR? I have used it for two projects and I am quite impressed. It does not have a D backend, nor a D grammar AFAIK, but if you are writing a parser, you might find a lot of the work has been done for you already. If you go for antlr, I strongly recommend the book advertized on the antlr.org page, I liked it very much. Good luck.I haven't tested it, but there is a Antlr D backend: http://www.mbutscher.de/antlrd/
Jan 09 2008
Here you can find some parsergenerators: http://www.prowiki.org/wiki4d/wiki.cgi?GrammarParsers I use lemonde as parsergenerator and re2c as lexer. Using re2c in most cases you have only to replace 'unsigned int' with 'uint' in the resulting code. regards cs Dan Wrote:I've been messing with how to write a parser, and so far I've played with numerous patterns before eventually wanting to cry. At the moment, I'm trying recursive descent parsing. The problem is that I've realized I'm duplicating huge volumes of code to cope with the tristate decision of { unexpected, allow, require } for any given token. For example, to consume a for loop, you consume something similar to /for\s*\((.*?)\)\s*\{(.*?)\}/ I have it doing that, but my soul feels heavy with the masses of looped switches it's doing. Is there any way to ease the pain? Regards, Dan
Jan 10 2008
Christoph Singewald:I use lemonde as parsergenerator and re2c as lexer. Using re2c in most cases you have only to replace 'unsigned int' with 'uint' in the resulting code.I have seen re2c and it looks nice. I think its source code can be modified with not that much efforts to make it produce D code instead of C. I think C isn't the right language to write such tool: its sources are about 200 KB of C code (plus some code generated by itself), they can probably be replaced by a 50 (or less) KB Python module (that generates the C/D code). Something like a tiny but really fast C compiler like TinyCC (http://fabrice.bellard.free.fr/tcc/) can be used to compile the C code on the fly in memory and execute it. This may become the starting point to give run-time compiled Regular Expressions to D :-) Probably there are other smarter ways to do similar things. Bye, bearophile
Jan 10 2008