digitalmars.D - Lexer and parser generators using CTFE
- Andrei Alexandrescu (31/31) Feb 27 2012 I'm starting a new thread on this because I think the matter is of
- CTFE-4-the-win (6/41) Feb 28 2012 Definitely, I applaud this initiative! I've long been of the
- bls (27/34) Feb 28 2012 Are you aware of Philippe Sigaud's PEG work (may become part of his
- Christopher Bergqvist (12/66) Feb 28 2012 I agree that the current direction of D in this area is
- Jonathan M Davis (5/17) Feb 28 2012 Well, one cool thing would be that a compiler could effectively compile ...
- Piotr Szturmaj (3/19) Feb 28 2012 CTFE code can be much slower than native one, and I would like to see
- Marco Leise (4/6) Feb 29 2012 I second this. As a fan of clean optimizations this is one of the things...
- Andrei Alexandrescu (6/14) Mar 01 2012 I think this is the kind of development that will be naturally motivated...
- James Miller (32/94) Feb 28 2012 d
- H. S. Teoh (9/26) Feb 28 2012 [...]
- d coder (9/14) Feb 28 2012 CTFE parsing is especially useful for DSEL (Domain Specific Embedded
- Philippe Sigaud (40/48) Feb 29 2012 variables/identifiers in the parsed subset of DSL with the mainstream D
- Andrei Alexandrescu (4/9) Feb 29 2012 That looks about right, but still has a fair amount of noise. I think
- deadalnix (3/12) Mar 01 2012 It make sense so an external file can be imported as string and
- Philippe Sigaud (8/14) Mar 01 2012 approach of putting the entire grammar in one string is best.
- John Belmonte (5/17) May 27 2012 I'm wondering if people have seen LPeg. It's a Lua library, but
- Philippe Sigaud (22/26) May 28 2012 This is exactly the approach followed by C++ Boost::Spirit, with halas
- John Belmonte (22/34) May 28 2012 Fair enough. I did notice the following in the markdown PEG
- Philippe Sigaud (9/24) May 29 2012 te*
- H. S. Teoh (6/22) Mar 01 2012 You could maybe just put D code in the grammar string, which gets
- Philippe Sigaud (5/9) Mar 01 2012 I could, but one of my driving goals while beginning this project a mont...
- Jacob Carlborg (9/19) Mar 01 2012 Maybe we can take some ideas from the CoffeeScript compiler (written in
- d coder (3/7) Feb 29 2012 I am eagerly waiting for you to upload the stuff on GitHub.
- Andrei Alexandrescu (20/29) Feb 29 2012 Barrier of entry and granularity of approach, I think.
- Christopher Bergqvist (7/31) Feb 29 2012 Thanks for your response. The lowered barrier of entry in
- H. S. Teoh (15/24) Feb 29 2012 I see that std.stdio.writef already parses format strings at
- Andrej Mitrovic (2/2) Feb 29 2012 Why do you need grammar rules for json? Just use a custom-defined
- Hisayuki Mima (33/64) Feb 28 2012 Hello Andrei,
- Hisayuki Mima (4/80) Feb 29 2012 A minimum of documentation is now available here:
- Andrei Alexandrescu (4/6) Feb 29 2012 Great start, you may want to aggressively add examples. Then people
- Andrei Alexandrescu (61/86) Feb 29 2012 Nice! I have a few comments. I should say that they entail a fair amount...
- Hisayuki Mima (26/116) Mar 04 2012 Certainly, "built AST" option is very interesting.
- Andrei Alexandrescu (12/28) Mar 04 2012 It's your project so you define what's fun to work on. Let me just say
- Ben Hanson (10/15) Mar 05 2012 This chimes nicely with the comments made at the Going Native
- d coder (32/38) May 27 2012 Hello Youkei
- Hisayuki Mima (9/48) May 31 2012 Hello Puneet,
- d coder (10/14) May 27 2012 Ok. I see my folly. At compile time, there would not be any "this" since
- H. S. Teoh (23/42) Feb 28 2012 CTFE is one of the big features that attracted me to D.
- Andrei Alexandrescu (3/10) Feb 29 2012 For example integrated lexer and parser generators!
- Nick Sabalausky (8/18) Feb 29 2012 ...With statically-checked symbol names that can include arbitrary
- mist (2/2) Feb 28 2012 Something similar to Boost::Spirit2 but with sane syntax and
- bcs (3/5) Feb 28 2012 How about not having to encode the syntax as arithmetic expressions? D
- Andrei Alexandrescu (3/8) Feb 29 2012 I think that's what he meant by "sane syntax".
- Andrei Alexandrescu (4/6) Feb 29 2012 Exactly. Boost::Spirit2 is a perfect illustration of a dancing bear -
- Martin Nowak (22/53) Feb 28 2012 I wrote a generic lexer generator some time ago.
- Dmitry Olshansky (24/89) Feb 28 2012 To be truly successful it should have some distinct properties like:
- H. S. Teoh (38/70) Feb 28 2012 [...]
- Dmitry Olshansky (17/84) Feb 29 2012 Well I see no problem in defining a separate OperatorGrammar
- Andrei Alexandrescu (6/12) Feb 29 2012 What does stacking grammars mean?
- Dmitry Olshansky (32/44) Feb 29 2012 In general it should be read as using one parser as lexer for another
- Andrei Alexandrescu (9/26) Feb 29 2012 Yes. But AST creation should be supported without any custom code.
- Nick Sabalausky (10/20) Feb 29 2012 Probably LR, too, unless you build the state tables in a separate prior
- Nick Sabalausky (3/13) Feb 29 2012 Erm, I mean LALR specifically.
- Andrei Alexandrescu (5/15) Feb 29 2012 It's been a while since I read about PEGs (the packrat parser paper),
- Nick Sabalausky (13/28) Feb 29 2012 I don't know much about packrat parsers, but what I meant is that genera...
- Ben Hanson (29/32) Mar 05 2012 Hi Nick,
- Dmitry Olshansky (11/24) Feb 29 2012 It might involve parsing and grammars, unless academia folk seduce me
- H. S. Teoh (17/35) Feb 28 2012 Cool! I'll have to take a look at this sometime.
- deadalnix (6/10) Feb 28 2012 std.conainer is already a good candidate for allocators.
- Martin Nowak (25/43) Feb 28 2012 Yacc does work but at the price of an additional build step and total
- H. S. Teoh (8/28) Feb 28 2012 Excellent idea, I like this.
- Nick Sabalausky (27/51) Feb 28 2012 In Goldie, I've taken an inverted approach, which IMHO is easier to use:...
- Nick Sabalausky (4/61) Feb 28 2012 Hmm, maybe I need to think about what it would take to make Goldie able ...
- Nick Sabalausky (8/10) Feb 28 2012 Just gave it a quick shot. It was looking like it might not be too bad, ...
- Philippe Sigaud (14/26) Feb 29 2012 but
-
Nick Sabalausky
(4/25)
Feb 29 2012
"Philippe Sigaud"
wrote in message - Nick Sabalausky (3/15) Mar 07 2012 Filed: http://d.puremagic.com/issues/show_bug.cgi?id=7667
- Andrei Alexandrescu (3/28) Feb 29 2012 I think this is the right approach.
- Nick Sabalausky (3/40) Feb 29 2012 What? We agree? ;)
- Andrei Alexandrescu (3/4) Feb 29 2012 It's 2012 after all.
- Andrei Alexandrescu (9/29) Feb 29 2012 I'm not sure what a parser combinator is. I think grammar inheritance
- Timon Gehr (2/4) Feb 29 2012 Well, it is slower at lexing than DMD at parsing. What is the bottleneck...
- Martin Nowak (6/12) Feb 29 2012 No, it's as fast as dmd's lexer.
- H. S. Teoh (12/28) Feb 29 2012 This sounds like something I ran into in my D lexer project. Lexing
- d coder (1/7) Feb 29 2012 And how different was just a few seconds from 10-15 seconds? :-)
- Nick Sabalausky (4/15) Feb 29 2012 ~10 seconds
- H. S. Teoh (6/17) Feb 29 2012 OK so I was imprecise. It was 2-3 seconds compared to 10-15 seconds. So
- Timon Gehr (2/13) Feb 29 2012 I did that.
- Martin Nowak (14/33) Feb 29 2012 Interesting, I've commented it out https://gist.github.com/1262321#L1559...
- Timon Gehr (4/38) Feb 29 2012 I get 0.160s for lexing using your lexer.
- Martin Nowak (10/52) Feb 29 2012 Mmh, I've retested and you're right dmd's lexer is about 2x faster.
- H. S. Teoh (26/34) Feb 29 2012 [...]
- bcs (6/37) Feb 28 2012 A bit outdated but....
- F i L (71/71) May 27 2012 I'm not sure I follow all the details of what Andrei's suggesting
- Jacob Carlborg (8/75) May 28 2012 This is a very cool feature of Nimrod. It allows to move several
- Ken (11/46) Jun 01 2012 Great! So what's the next step? Do we wait for the maintainers
- Andrei Alexandrescu (7/15) Jun 01 2012 I think this core strength of the language should be properly supplanted...
I'm starting a new thread on this because I think the matter is of strategic importance. We all felt for a long time that there's a lot of potential in CTFE, and potential applications have been discussed more than a few times, ranging from formatting strings parsed to DSLs and parser generators. Such feats are now approaching fruition because a number of factors converge: * Dmitry Olshansky's regex library (now in Phobos) generates efficient D code straight from regexen. * The scope and quality of CTFE has improved enormously, making more advanced uses possible and even relatively easy (thanks Don!) * Hisayuki Mima implemented a parser generator in only 3000 lines of code (sadly, no comments or documentation yet :o)) * With the occasion of that announcement we also find out Philippe Sigaud has already a competing design and implementation of a parser generator. This is the kind of stuff I've had an eye on for the longest time. I'm saying it's of strategic importance because CTFE technology, though not new and already available with some languages, has unique powers when combined with other features of D. With CTFE we get to do things that are quite literally impossible to do in other languages. We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could use a character-level PEG or a classic automaton, and emit tokens for consumption by a parser generator. The two should work in perfect tandem (no need for glue code). At the end of the day, defining a complete lexer+parser combo for a language should be just a few lines longer than the textual representation of the grammar itself. What do you all think? Let's get this project off the ground! Thanks, Andrei
Feb 27 2012
On Tuesday, 28 February 2012 at 07:59:16 UTC, Andrei Alexandrescu wrote:I'm starting a new thread on this because I think the matter is of strategic importance. We all felt for a long time that there's a lot of potential in CTFE, and potential applications have been discussed more than a few times, ranging from formatting strings parsed to DSLs and parser generators. Such feats are now approaching fruition because a number of factors converge: * Dmitry Olshansky's regex library (now in Phobos) generates efficient D code straight from regexen. * The scope and quality of CTFE has improved enormously, making more advanced uses possible and even relatively easy (thanks Don!) * Hisayuki Mima implemented a parser generator in only 3000 lines of code (sadly, no comments or documentation yet :o)) * With the occasion of that announcement we also find out Philippe Sigaud has already a competing design and implementation of a parser generator. This is the kind of stuff I've had an eye on for the longest time. I'm saying it's of strategic importance because CTFE technology, though not new and already available with some languages, has unique powers when combined with other features of D. With CTFE we get to do things that are quite literally impossible to do in other languages. We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could use a character-level PEG or a classic automaton, and emit tokens for consumption by a parser generator. The two should work in perfect tandem (no need for glue code). At the end of the day, defining a complete lexer+parser combo for a language should be just a few lines longer than the textual representation of the grammar itself. What do you all think? Let's get this project off the ground! Thanks, AndreiDefinitely, I applaud this initiative! I've long been of the opinion that CTFE parsing is D's killer-feature, which would allow me to "sneak" D into a "nameless above average size company". ;)
Feb 28 2012
On 02/28/2012 12:36 AM, CTFE-4-the-win wrote:We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could use a character-level PEG or a classic automaton, and emit tokens for consumption by a parser generator. The two should work in perfect tandem (no need for glue code). At the end of the day, defining a complete lexer+parser combo for a language should be just a few lines longer than the textual representation of the grammar itself.Are you aware of Philippe Sigaud's PEG work (may become part of his template book) Quote Philippe .. I recently wrote a parsing expression grammar module in D, also to create grammars and parsers at compile-time and parse inputs at CT. (PEG: http://en.wikipedia.org/wiki/Parsing_expression_grammar) Usage is like this: mixin Grammar( "JSON <- '{' ( Pair ( ',' Pair )* )? '}'" , "Pair <- String ':' Value" , "Value <- String / Number / JSON / Array / True / False / Null" , `True <- "true"` (..., rest of JSON grammar) ); enum result = JSON.parse( `{ "Hello":42, "World!": { "a":[0,1,2,3] } }`); End Quote No that bad :) , I think. Well, I am still a fan of EBNF based parser generators (recursive decent) but that's an other story. If I recall correctly BCS has created something working. ( BNF , so limited)
Feb 28 2012
On Tuesday, 28 February 2012 at 08:36:13 UTC, CTFE-4-the-win wrote:On Tuesday, 28 February 2012 at 07:59:16 UTC, Andrei Alexandrescu wrote:I agree that the current direction of D in this area is impressive. However, I fail to see a killer-feature in generating a lexer-parser generator at compile-time instead of run-time. A run-time generator would benefit from not having to execute within the limited CTFE environment and would always be on-par in that respect. A compile time generator would internalize the generation and compilation of the result (with possible glue-code), simplifying the build process somewhat. What am I failing to pick up on?I'm starting a new thread on this because I think the matter is of strategic importance. We all felt for a long time that there's a lot of potential in CTFE, and potential applications have been discussed more than a few times, ranging from formatting strings parsed to DSLs and parser generators. Such feats are now approaching fruition because a number of factors converge: * Dmitry Olshansky's regex library (now in Phobos) generates efficient D code straight from regexen. * The scope and quality of CTFE has improved enormously, making more advanced uses possible and even relatively easy (thanks Don!) * Hisayuki Mima implemented a parser generator in only 3000 lines of code (sadly, no comments or documentation yet :o)) * With the occasion of that announcement we also find out Philippe Sigaud has already a competing design and implementation of a parser generator. This is the kind of stuff I've had an eye on for the longest time. I'm saying it's of strategic importance because CTFE technology, though not new and already available with some languages, has unique powers when combined with other features of D. With CTFE we get to do things that are quite literally impossible to do in other languages. We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could use a character-level PEG or a classic automaton, and emit tokens for consumption by a parser generator. The two should work in perfect tandem (no need for glue code). At the end of the day, defining a complete lexer+parser combo for a language should be just a few lines longer than the textual representation of the grammar itself. What do you all think? Let's get this project off the ground! Thanks, AndreiDefinitely, I applaud this initiative! I've long been of the opinion that CTFE parsing is D's killer-feature, which would allow me to "sneak" D into a "nameless above average size company". ;)
Feb 28 2012
On Wednesday, February 29, 2012 02:16:12 Christopher Bergqvist wrote:I agree that the current direction of D in this area is impressive. However, I fail to see a killer-feature in generating a lexer-parser generator at compile-time instead of run-time. A run-time generator would benefit from not having to execute within the limited CTFE environment and would always be on-par in that respect. A compile time generator would internalize the generation and compilation of the result (with possible glue-code), simplifying the build process somewhat. What am I failing to pick up on?Well, one cool thing would be that a compiler could effectively compile itself - or at least the lexer and parser would compile themselves. You wouldn't have to run a program to generate them. It could be self-contained. - Jonathan M Davis
Feb 28 2012
Jonathan M Davis wrote:On Wednesday, February 29, 2012 02:16:12 Christopher Bergqvist wrote:CTFE code can be much slower than native one, and I would like to see some kind of compiler cache for such code.I agree that the current direction of D in this area is impressive. However, I fail to see a killer-feature in generating a lexer-parser generator at compile-time instead of run-time. A run-time generator would benefit from not having to execute within the limited CTFE environment and would always be on-par in that respect. A compile time generator would internalize the generation and compilation of the result (with possible glue-code), simplifying the build process somewhat. What am I failing to pick up on?Well, one cool thing would be that a compiler could effectively compile itself - or at least the lexer and parser would compile themselves. You wouldn't have to run a program to generate them. It could be self-contained.
Feb 28 2012
Am 29.02.2012, 02:30 Uhr, schrieb Piotr Szturmaj <bncrbme jadamspam.pl>:CTFE code can be much slower than native one, and I would like to see some kind of compiler cache for such code.I second this. As a fan of clean optimizations this is one of the things I tossed around my head for a while. It could use a cache file or the compiler could be started as a daemon process keeping a memory cache. All code that is called through CTFE would go into the cache, indexed by the internal representation of the function body and parameters. But there are a few open questions, like how seamless this could be integrated. Is it easy to get a hash for function bodies and parameters? How should the cache be limited? N functions, n bytes or maybe one cache per module (together with the .o files)? The latter case would mean that if a.d uses CTFE, that executes code from b.d the results of CTFE would all be cached together with a.o, because that was the entry point. And if there was a module c.d that does the same as a.d it would duplicate the b.d part in its own cache. The benefit is that the cache works with both single module and whole program compilation and it doesn't need any limits. Instead the caches for each module are always refreshed with what was last used in the compilation. In addition to the last compilation, the caches could be aware of versioned compilation. I usually want to compile debug versions and Linux/Windows release versions at least, so I wouldn't want to invalidate the caches. For 32-bit vs. 64-bit I assume it is the best to just cache them separately as it could prove difficult to distinguish two versions of code that uses (void*).sizeof or something else that isn't declared wrapped in a version statement like size_t is.
Feb 29 2012
On 3/1/12 1:54 AM, Marco Leise wrote:Am 29.02.2012, 02:30 Uhr, schrieb Piotr Szturmaj <bncrbme jadamspam.pl>:I think this is the kind of development that will be naturally motivated and pushed by serious CTFE applications pushing the boundary. I'm not very worried about it right now as about investigating CTFE potential and blockers. AndreiCTFE code can be much slower than native one, and I would like to see some kind of compiler cache for such code.I second this. As a fan of clean optimizations this is one of the things I tossed around my head for a while. It could use a cache file or the compiler could be started as a daemon process keeping a memory cache. All code that is called through CTFE would go into the cache, indexed by the internal representation of the function body and parameters.
Mar 01 2012
On 29 February 2012 14:16, Christopher Bergqvist <spambox0 digitalpoetry.se> wrote:On Tuesday, 28 February 2012 at 08:36:13 UTC, CTFE-4-the-win wrote:dOn Tuesday, 28 February 2012 at 07:59:16 UTC, Andrei Alexandrescu wrote:I'm starting a new thread on this because I think the matter is of strategic importance. We all felt for a long time that there's a lot of potential in CTFE, an=ngpotential applications have been discussed more than a few times, rangi=Dfrom formatting strings parsed to DSLs and parser generators. Such feats are now approaching fruition because a number of factors converge: * Dmitry Olshansky's regex library (now in Phobos) generates efficient =decode straight from regexen. * The scope and quality of CTFE has improved enormously, making more advanced uses possible and even relatively easy (thanks Don!) * Hisayuki Mima implemented a parser generator in only 3000 lines of co=ud(sadly, no comments or documentation yet :o)) * With the occasion of that announcement we also find out Philippe Siga=.has already a competing design and implementation of a parser generator=newThis is the kind of stuff I've had an eye on for the longest time. I'm saying it's of strategic importance because CTFE technology, though not=nedand already available with some languages, has unique powers when combi=use awith other features of D. With CTFE we get to do things that are quite literally impossible to do in other languages. We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could =tioncharacter-level PEG or a classic automaton, and emit tokens for consump=orby a parser generator. The two should work in perfect tandem (no need f=boglue code). At the end of the day, defining a complete lexer+parser com=rI agree that the current direction of D in this area is impressive. =C2=A0However, I fail to see a killer-feature in generating a lexer-parse=for a language should be just a few lines longer than the textual representation of the grammar itself. What do you all think? Let's get this project off the ground! Thanks, AndreiDefinitely, I applaud this initiative! I've long been of the opinion that CTFE parsing is D's killer-feature, which would allow me to "sneak" D into a "nameless above average size company". ;)generator at compile-time instead of run-time. A run-time generator would benefit from not having to execute within the limited CTFE environment and would always be on-par in that respect. =C2==A0Acompile time generator would internalize the generation and compilation o=fthe result (with possible glue-code), simplifying the build process somewhat. What am I failing to pick up on?As far as I can tell, the advantage of compile-time parser-generation is that you get significantly better start-up times for the program. Generating a lexer/parser at runtime produces signficant overhead. Also, if you want to do a yacc-style action system, then it being created at compile-time means that you don't have the overhead of using anonymous functions or delegates at runtime, you can just give it a raw string of D code. The CTFE environment is not that limited, and has enough functionality to produce a lexer/parser. Complex language definitions would be benefited by CTFE parser-generation since the load times to create the generator can be quite long, a problem for programs that may not be long-running (a lint tool for example). -- James Miller
Feb 28 2012
On Tue, Feb 28, 2012 at 08:25:21PM -0500, Jonathan M Davis wrote:On Wednesday, February 29, 2012 02:16:12 Christopher Bergqvist wrote:[...] You also eliminate possibly lengthy program startup initialization. Which can add up, if the program is run many times. Plus, if something can already be computed at compile-time, why waste time & resources doing it at runtime? T -- Без труда не выловишь и рыбку из пруда.I agree that the current direction of D in this area is impressive. However, I fail to see a killer-feature in generating a lexer-parser generator at compile-time instead of run-time. A run-time generator would benefit from not having to execute within the limited CTFE environment and would always be on-par in that respect. A compile time generator would internalize the generation and compilation of the result (with possible glue-code), simplifying the build process somewhat. What am I failing to pick up on?Well, one cool thing would be that a compiler could effectively compile itself - or at least the lexer and parser would compile themselves. You wouldn't have to run a program to generate them. It could be self-contained.
Feb 28 2012
CTFE parsing is especially useful for DSEL (Domain Specific Embedded Languages) or internal DSLs. The advantages are: 1. Syntactic errors (in the parsed constructs) are given out at compile time. 2. D reflections are available only at compile time. Referencing the variables/identifiers in the parsed subset of DSL with the mainstream D code is impossible without reflections in place. Regards - PuneetOn Wednesday, February 29, 2012 02:16:12 Christopher Bergqvist wrote:I agree that the current direction of D in this area is impressive. However, I fail to see a killer-feature in generating a lexer-parser generator at compile-time instead of run-time.
Feb 28 2012
Languages) or internal DSLs. The advantages are:CTFE parsing is especially useful for DSEL (Domain Specific EmbeddedOn Wednesday, February 29, 2012 02:16:12 Christopher Bergqvist wrote:I agree that the current direction of D in this area is impressive. However, I fail to see a killer-feature in generating a lexer-parser generator at compile-time instead of run-time.1. Syntactic errors (in the parsed constructs) are given out at compiletime.2. D reflections are available only at compile time. Referencing thevariables/identifiers in the parsed subset of DSL with the mainstream D code is impossible without reflections in place. One of my goals while writing a CT grammar generator was to get a compile-time parse-tree. Since it contains strings, it's easy to walk the tree, assembling strings as you go and generating the code you want (if//when you want to write code, that is) Strings are a D way to represent code, so any way to get structured strings at compile-time opens whole vistas for code generation. As for semantic actions, I added them in my code yesterday. I had hopes for using D's new anonymous syntax (p => p), but by being anonymous, they cannot be inserted easily in string mixins (other modules do not now about __lambda1 and co). Anyway, I now have semantic actions at compile-time, I used them to write a small (as in, veeery simple) XML parser: I use semantic actions to push node names while encountering them and pop the last tag while encountering a closing tag. It seems to work OK. That looks a bit like this (sorry, writing on a pad) mixin(Grammar!("Doc <- Node*" "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction, "OpeningTag <- '<' Identifier '>'", OpeningAction, "ClosingTag <- `</` Identifier '>'", ClosingAction, "Text <- (!(OpeningTag / ClosingTag) _)+")); The PEG for Text just means: any char, as long as it's not an OpeningTag nor a ClosingTag. PEG use '.' to say 'Any char', but I wanted to be able to deal with qualified names, so I chose '_' instead. When there is no action, it default to NoOp, as is the case for Doc, Node and Text. I also added named captures (and capture comparison), to be able to say: "I want a sequence of equal chars": "Equal <- _ first (_=first)*" That is: any char, store it as "first", then take any number of char, long as their match is equal to first's match. All this work at CT. I'm afraid being in holidays right now means I do not have easy access to GitHub (no git on a pad, and the computer I use to code right now does not have any network connection). I'll put all this online in a few days, because that must seems like the ramblings of a madman right now... Philippe
Feb 29 2012
On 2/29/12 3:45 AM, Philippe Sigaud wrote:mixin(Grammar!("Doc <- Node*" "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction, "OpeningTag <- '<' Identifier '>'", OpeningAction, "ClosingTag <- `</` Identifier '>'", ClosingAction, "Text <- (!(OpeningTag / ClosingTag) _)+"));That looks about right, but still has a fair amount of noise. I think the approach of putting the entire grammar in one string is best. Andrei
Feb 29 2012
Le 29/02/2012 17:45, Andrei Alexandrescu a écrit :On 2/29/12 3:45 AM, Philippe Sigaud wrote:It make sense so an external file can be imported as string and processed at compile time.mixin(Grammar!("Doc <- Node*" "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction, "OpeningTag <- '<' Identifier '>'", OpeningAction, "ClosingTag <- `</` Identifier '>'", ClosingAction, "Text <- (!(OpeningTag / ClosingTag) _)+"));That looks about right, but still has a fair amount of noise. I think the approach of putting the entire grammar in one string is best. Andrei
Mar 01 2012
approach of putting the entire grammar in one string is best. Yes, using one string is indeed better. That won't be too difficult to code. But how to associate actions with a rule, in that case? I mean, some rules will have actions, some not. In my case, a parse tree is automatically generated (indeed, it's almost the only thing returned by a parse call, that with recognized string and named captures). I have some additional syntax to 'drop' a node ans such, similar to what you proposed a few posts ago.mixin(Grammar!("Doc <- Node*" "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction, "OpeningTag <- '<' Identifier '>'", OpeningAction, "ClosingTag <- `</` Identifier '>'", ClosingAction, "Text <- (!(OpeningTag / ClosingTag) _)+"));That looks about right, but still has a fair amount of noise. I think the
Mar 01 2012
On Thursday, 1 March 2012 at 15:10:36 UTC, Philippe Sigaud wrote:I'm wondering if people have seen LPeg. It's a Lua library, but the design is interesting in that patterns are first class objects which can be composed with operator overloading. http://www.inf.puc-rio.br/~roberto/lpeg/approach of putting the entire grammar in one string is best. Yes, using one string is indeed better. That won't be too difficult to code.mixin(Grammar!("Doc <- Node*" "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction, "OpeningTag <- '<' Identifier '>'", OpeningAction, "ClosingTag <- `</` Identifier '>'", ClosingAction, "Text <- (!(OpeningTag / ClosingTag) _)+"));That looks about right, but still has a fair amount of noise. I think the
May 27 2012
On Sun, May 27, 2012 at 11:13 PM, John Belmonte <john neggie.net> wrote:I'm wondering if people have seen LPeg. =C2=A0It's a Lua library, but the=designis interesting in that patterns are first class objects which can be composed with operator overloading. http://www.inf.puc-rio.br/~roberto/lpeg/This is exactly the approach followed by C++ Boost::Spirit, with halas the limitations from Algol-language operators and precedence: no postfix '*' (Kleene Star) operator. It seems that lpeg has the same problem. I played with this idea with my own Pegged (https://github.com/PhilippeSigaud/Pegged), but I wasn't quite convinced by the result, exactly for the reason above. Also, when looking at real-world Spirit examples, I was a bit disappointed by the resulting syntax: it's not *that* readable for complicated expressions. In fact, that's exactly why I decided to follow the DSL road with Pegged, so as to obtain exactly the PEG syntax, with the real operators and their standard precedence. Btw, if you're interested in expression templates, I uploaded yesterday a very preliminary project in Github, to construct, and then manipulate, expressions: auto e =3D expr(); // The null expression auto f =3D 2*e + 1- (e~"abc")*3; // f is an expression template that encodes the right-hand side. Then, it's up to the user to define what the expression represents. https://github.com/PhilippeSigaud/Expression-Tree
May 28 2012
On Monday, 28 May 2012 at 12:27:09 UTC, Philippe Sigaud wrote:I played with this idea with my own Pegged (https://github.com/PhilippeSigaud/Pegged), but I wasn't quite convinced by the result, exactly for the reason above. Also, when looking at real-world Spirit examples, I was a bit disappointed by the resulting syntax: it's not *that* readable for complicated expressions. In fact, that's exactly why I decided to follow the DSL road with Pegged, so as to obtain exactly the PEG syntax, with the real operators and their standard precedence.Fair enough. I did notice the following in the markdown PEG though which could benefit from first class patterns: HtmlBlockOpenAddress <- '<' Spnl ("address" / "ADDRESS") Spnl HtmlAttribute* '>' HtmlBlockCloseAddress <- '<' Spnl '/' ("address" / "ADDRESS") Spnl '>' HtmlBlockAddress <- HtmlBlockOpenAddress (HtmlBlockAddress / !HtmlBlockCloseAddress .)* HtmlBlockCloseAddress HtmlBlockOpenBlockquote <- '<' Spnl ("blockquote" / "BLOCKQUOTE") Spnl HtmlAttribute* '>' HtmlBlockCloseBlockquote <- '<' Spnl '/' ("blockquote" / "BLOCKQUOTE") Spnl '>' HtmlBlockBlockquote <- HtmlBlockOpenBlockquote (HtmlBlockBlockquote / !HtmlBlockCloseBlockquote .)* HtmlBlockCloseBlockquote . . .
May 28 2012
On Mon, May 28, 2012 at 11:42 PM, John Belmonte <john neggie.net> wrote:Fair enough. =C2=A0I did notice the following in the markdown PEG though =whichcould benefit from first class patterns: HtmlBlockOpenAddress <- '<' Spnl ("address" / "ADDRESS") Spnl HtmlAttribu=te*'>' HtmlBlockCloseAddress <- '<' Spnl '/' ("address" / "ADDRESS") Spnl '>' HtmlBlockAddress <- HtmlBlockOpenAddress (HtmlBlockAddress / =C2=A0 !HtmlBlockCloseAddress .)* HtmlBlockCloseAddress HtmlBlockOpenBlockquote <- '<' Spnl ("blockquote" / "BLOCKQUOTE") Spnl =C2=A0 HtmlAttribute* '>' HtmlBlockCloseBlockquote <- '<' Spnl '/' ("blockquote" / "BLOCKQUOTE") Sp=nl'>' HtmlBlockBlockquote <- HtmlBlockOpenBlockquote (HtmlBlockBlockquote / =C2=A0 !HtmlBlockCloseBlockquote .)* HtmlBlockCloseBlockquoteYou're exactly right! Nice catch. I took this PEG from another github project (I hope I put the attribution somewhere?) and that was before Pegged accepted parameterized rules. I could indeed drastically factor the previous code.
May 29 2012
On Thu, Mar 01, 2012 at 04:10:26PM +0100, Philippe Sigaud wrote:You could maybe just put D code in the grammar string, which gets compiled as a string mixin by CTFE? T -- "A man's wife has more power over him than the state has." -- Ralph EmersonYes, using one string is indeed better. That won't be too difficult to code. But how to associate actions with a rule, in that case? I mean, some rules will have actions, some not.mixin(Grammar!("Doc <- Node*" "Node <- OpeningTag (Text / Node)* ClosingTag", NodeAction, "OpeningTag <- '<' Identifier '>'", OpeningAction, "ClosingTag <- `</` Identifier '>'", ClosingAction, "Text <- (!(OpeningTag / ClosingTag) _)+"));That looks about right, but still has a fair amount of noise. I think the approach of putting the entire grammar in one string is best.
Mar 01 2012
I could, but one of my driving goals while beginning this project a month ao was to use the shiny new lambda syntax directly :) "rule1", o => o "rule2", o => o etc.But how to associate actions with a rule, in that case? I mean, some rules will have actions, some not.You could maybe just put D code in the grammar string, which gets compiled as a string mixin by CTFE?
Mar 01 2012
On 2012-03-01 18:21, Philippe Sigaud wrote:> > But how to associate actions with a rule, in that case? I mean, some > > rules will have actions, some not. > > You could maybe just put D code in the grammar string, which gets > compiled as a string mixin by CTFE? I could, but one of my driving goals while beginning this project a month ao was to use the shiny new lambda syntax directly :) "rule1", o => o "rule2", o => o etc.Maybe we can take some ideas from the CoffeeScript compiler (written in CoffeeScript) : Grammar: http://jashkenas.github.com/coffee-script/documentation/docs/grammar.html Lexer: http://jashkenas.github.com/coffee-script/documentation/docs/lexer.html -- /Jacob Carlborg
Mar 01 2012
I'm afraid being in holidays right now means I do not have easy access to GitHub (no git on a pad, and the computer I use to code right now does not have any network connection). I'll put all this online in a few days, because that must seems like the ramblings of a madman right now...I am eagerly waiting for you to upload the stuff on GitHub. Regards - Puneet
Feb 29 2012
On 2/28/12 7:16 PM, Christopher Bergqvist wrote:I agree that the current direction of D in this area is impressive. However, I fail to see a killer-feature in generating a lexer-parser generator at compile-time instead of run-time. A run-time generator would benefit from not having to execute within the limited CTFE environment and would always be on-par in that respect. A compile time generator would internalize the generation and compilation of the result (with possible glue-code), simplifying the build process somewhat. What am I failing to pick up on?Barrier of entry and granularity of approach, I think. Currently if one wants to parse some simple grammar, there are options such as (a) do it by hand, (b) use boost::spirit, or (c) use lex/yacc. Parsing by hand has the obvious disadvantages. Using boost::spirit has a steep learning curve and tends to create very contorted grammar representations, full of representation noise, and scales very poorly. Using lex/yacc is hamfisted - there's an additional build step, generated files to deal with, and the related logistics, which make lex/yacc a viable choice only for "big" grammars. An efficient, integrated parser generator would lower the barrier of entry dramatically - if we play our cards right, even a sprintf specifier string could be parsed simpler and faster using an embedded grammar, instead of painfully writing the recognizer by hand. Parsing config files, XML, JSON, CSV, various custom file formats and many others - all would all be a few lines away. Ideally a user who has a basic understanding of grammars should have an easier time using a small grammar to parse simple custom formats, than writing the parsing code by hand. Andrei
Feb 29 2012
On Wednesday, 29 February 2012 at 16:41:22 UTC, Andrei Alexandrescu wrote:On 2/28/12 7:16 PM, Christopher Bergqvist wrote:Thanks for your response. The lowered barrier of entry in parsing something like a customized JSON format or config files is nice, and something I could see myself use. I'm still skeptical about the level of "killer-featureness" but I would be glad to be proven wrong.What am I failing to pick up on?Barrier of entry and granularity of approach, I think. Currently if one wants to parse some simple grammar, there are options such as (a) do it by hand, (b) use boost::spirit, or (c) use lex/yacc. Parsing by hand has the obvious disadvantages. Using boost::spirit has a steep learning curve and tends to create very contorted grammar representations, full of representation noise, and scales very poorly. Using lex/yacc is hamfisted - there's an additional build step, generated files to deal with, and the related logistics, which make lex/yacc a viable choice only for "big" grammars. An efficient, integrated parser generator would lower the barrier of entry dramatically - if we play our cards right, even a sprintf specifier string could be parsed simpler and faster using an embedded grammar, instead of painfully writing the recognizer by hand. Parsing config files, XML, JSON, CSV, various custom file formats and many others - all would all be a few lines away. Ideally a user who has a basic understanding of grammars should have an easier time using a small grammar to parse simple custom formats, than writing the parsing code by hand. Andrei
Feb 29 2012
On Wed, Feb 29, 2012 at 10:41:22AM -0600, Andrei Alexandrescu wrote: [...]An efficient, integrated parser generator would lower the barrier of entry dramatically - if we play our cards right, even a sprintf specifier string could be parsed simpler and faster using an embedded grammar, instead of painfully writing the recognizer by hand.I see that std.stdio.writef already parses format strings at compile-time, though the code is somewhat ugly. :) It would be ideal if that can be done just by a series of ENBF/regex declarations.Parsing config files, XML, JSON, CSV, various custom file formats and many others - all would all be a few lines away. Ideally a user who has a basic understanding of grammars should have an easier time using a small grammar to parse simple custom formats, than writing the parsing code by hand.[...] This is definitely something worth having. It has always been my dream that an ideal programming language should let you write grammar rules and pretty much autogen the code for you. There shouldn't need to be external utilities with the associated messy build rules, auxiliary files and other such things, esp. if you just need a one-time small parser for a very specific task. T -- Why ask rhetorical questions? -- JC
Feb 29 2012
Why do you need grammar rules for json? Just use a custom-defined struct and pass it to a toJson function like ae's json module. :)
Feb 29 2012
(2012/02/28 16:59), Andrei Alexandrescu wrote:I'm starting a new thread on this because I think the matter is of strategic importance. We all felt for a long time that there's a lot of potential in CTFE, and potential applications have been discussed more than a few times, ranging from formatting strings parsed to DSLs and parser generators. Such feats are now approaching fruition because a number of factors converge: * Dmitry Olshansky's regex library (now in Phobos) generates efficient D code straight from regexen. * The scope and quality of CTFE has improved enormously, making more advanced uses possible and even relatively easy (thanks Don!) * Hisayuki Mima implemented a parser generator in only 3000 lines of code (sadly, no comments or documentation yet :o)) * With the occasion of that announcement we also find out Philippe Sigaud has already a competing design and implementation of a parser generator. This is the kind of stuff I've had an eye on for the longest time. I'm saying it's of strategic importance because CTFE technology, though not new and already available with some languages, has unique powers when combined with other features of D. With CTFE we get to do things that are quite literally impossible to do in other languages. We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could use a character-level PEG or a classic automaton, and emit tokens for consumption by a parser generator. The two should work in perfect tandem (no need for glue code). At the end of the day, defining a complete lexer+parser combo for a language should be just a few lines longer than the textual representation of the grammar itself. What do you all think? Let's get this project off the ground! Thanks, AndreiHello Andrei, Certainly, I don't write yet the documentation of my library, ctpg. (But a few examples available here: https://github.com/youkei/ctpg/tree/master/example) So, I'd like to describe it here. To be honest, ctpg is inspired by one module of Scala's standard library, Parser Combinators. One of the differences between Parser Combinators and ctpg is that Parser Combinators generates parsers in run-time, but ctpg generates parsers in compile-time by the power of CTFE and mixin. A parser generated by ctpg is a recursive descent parser, so it does lexical analysis and parsing at a time. And the type of input which it can accept is string, wstring, dstring and ForwardRange whose element type is char, wchar or dchar. So, dividing lexical analysis and parsing into two processes is difficult in ctpg. Importantly, a parser generated by ctpg is statically typed as one of the examples, "parsing simple arithmetic expression" shows. Generally a parser generated by other tool and accepting tokens returns the abstract syntax tree, but it return the evaluated value in the example. In other words, it does lexical analysis, parsing and (type) converting at a time. If you want simply abstract syntax tree, it may be a little pain to use ctpg. That's all I'd like to say about ctpg here. By the way, I wholeheartedly agree with Andrei's opinion. But currently, I think, CTFE is limited because of this issue: http://d.puremagic.com/issues/show_bug.cgi?id=6498 . Without dealing with this issue, We couldn't bring out the full potential of CTFE. Thanks, Hisayuki Mima
Feb 28 2012
(2012/02/28 23:58), Hisayuki Mima wrote:(2012/02/28 16:59), Andrei Alexandrescu wrote:A minimum of documentation is now available here: https://github.com/youkei/ctpg/wiki/Home-en Hisayuki MimaI'm starting a new thread on this because I think the matter is of strategic importance. We all felt for a long time that there's a lot of potential in CTFE, and potential applications have been discussed more than a few times, ranging from formatting strings parsed to DSLs and parser generators. Such feats are now approaching fruition because a number of factors converge: * Dmitry Olshansky's regex library (now in Phobos) generates efficient D code straight from regexen. * The scope and quality of CTFE has improved enormously, making more advanced uses possible and even relatively easy (thanks Don!) * Hisayuki Mima implemented a parser generator in only 3000 lines of code (sadly, no comments or documentation yet :o)) * With the occasion of that announcement we also find out Philippe Sigaud has already a competing design and implementation of a parser generator. This is the kind of stuff I've had an eye on for the longest time. I'm saying it's of strategic importance because CTFE technology, though not new and already available with some languages, has unique powers when combined with other features of D. With CTFE we get to do things that are quite literally impossible to do in other languages. We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could use a character-level PEG or a classic automaton, and emit tokens for consumption by a parser generator. The two should work in perfect tandem (no need for glue code). At the end of the day, defining a complete lexer+parser combo for a language should be just a few lines longer than the textual representation of the grammar itself. What do you all think? Let's get this project off the ground! Thanks, AndreiHello Andrei, Certainly, I don't write yet the documentation of my library, ctpg. (But a few examples available here: https://github.com/youkei/ctpg/tree/master/example) So, I'd like to describe it here. To be honest, ctpg is inspired by one module of Scala's standard library, Parser Combinators. One of the differences between Parser Combinators and ctpg is that Parser Combinators generates parsers in run-time, but ctpg generates parsers in compile-time by the power of CTFE and mixin. A parser generated by ctpg is a recursive descent parser, so it does lexical analysis and parsing at a time. And the type of input which it can accept is string, wstring, dstring and ForwardRange whose element type is char, wchar or dchar. So, dividing lexical analysis and parsing into two processes is difficult in ctpg. Importantly, a parser generated by ctpg is statically typed as one of the examples, "parsing simple arithmetic expression" shows. Generally a parser generated by other tool and accepting tokens returns the abstract syntax tree, but it return the evaluated value in the example. In other words, it does lexical analysis, parsing and (type) converting at a time. If you want simply abstract syntax tree, it may be a little pain to use ctpg. That's all I'd like to say about ctpg here. By the way, I wholeheartedly agree with Andrei's opinion. But currently, I think, CTFE is limited because of this issue: http://d.puremagic.com/issues/show_bug.cgi?id=6498 . Without dealing with this issue, We couldn't bring out the full potential of CTFE. Thanks, Hisayuki Mima
Feb 29 2012
On 2/29/12 10:03 AM, Hisayuki Mima wrote:A minimum of documentation is now available here: https://github.com/youkei/ctpg/wiki/Home-enGreat start, you may want to aggressively add examples. Then people would love using ctpg and will write more documentation, tutorials etc. Andrei
Feb 29 2012
On 2/28/12 8:58 AM, Hisayuki Mima wrote:Certainly, I don't write yet the documentation of my library, ctpg. (But a few examples available here: https://github.com/youkei/ctpg/tree/master/example)Nice! I have a few comments. I should say that they entail a fair amount of work. The current setup makes things very difficult for a common case - generating a parse tree. The user would need to insert a lot of custom code to create the appropriate nodes, but that's very easy for the parser because it already has the components. The parser should have a "build AST" option, in which case it should build all nodes without much effort from the user. That's what ANTLR does, and it has a special operator "^" to indicate where the root of the tree should be (http://www.antlr2.org/doc/trees.html). So here's what I think your examples should look like: The syntax could be changed a bit - it should be less like D and more like PEG. The "$" is implicit and shouldn't be needed. The ";" is a useful terminator for large productions spanning more than one line, so let's keep it. I don't understand the use of "!", for example the PEG for expression parsing at http://en.wikipedia.org/wiki/Parsing_expression_grammar is: Value ← [0-9]+ / '(' Expr ')' Product ← Value (('*' / '/') Value)* Sum ← Product (('+' / '-') Product)* Expr ← Sum but your grammar has: int primary = !"(" addExp !")" / intLit_p; whereas it seems to me it should be int primary = "(" addExp ")" / intLit_p; No? Here's how I think a parser that also builds an AST looks like: mixin generateParsers!( Options.createAST, q{ root <- addExp; addExp <- mulExp (("+" / "-")^ addExp)? mulExp <- primary (("*" / "/")^ mulExp)? primary <- "(" addExp ")"% / intLit_p; } ); I used the following notation: a node suffixed with "^" becomes the root of the produced AST, and a node suffixed with "%" will be dropped altogether (that's because ")" is not needed in the AST after the expression has been parsed). With only three characters I informed the parser what the shape of the AST will be. At this point calling parse!root("1+2*3") would return an object of type ASTNode!"addExp", which in turn has two children, lhs and rhs. The lhs node has type ASTNode!"intLit_p" which has inside a value "1". The rhs node has type ASTNode!"mulExp", and that in turn has two children nodes lhs and rhs, both of type ASTNode!"intLit_p", each with its own value ("2" and "3", respectively). All these nodes are dynamically created by the parsing process and inherit a common type, e.g. ASTNode!"" or some type ASTRootNode.So, I'd like to describe it here. To be honest, ctpg is inspired by one module of Scala's standard library, Parser Combinators. One of the differences between Parser Combinators and ctpg is that Parser Combinators generates parsers in run-time, but ctpg generates parsers in compile-time by the power of CTFE and mixin. A parser generated by ctpg is a recursive descent parser, so it does lexical analysis and parsing at a time.Oh, I forgot this property of PEGs - integrated lexing and parsing. So no need for a separate lexer!And the type of input which it can accept is string, wstring, dstring and ForwardRange whose element type is char, wchar or dchar.Great. I think we should actually define the notion of BufferedRange. I'll get to that topic later.So, dividing lexical analysis and parsing into two processes is difficult in ctpg.Yes, sorry I forgot about that!Importantly, a parser generated by ctpg is statically typed as one of the examples, "parsing simple arithmetic expression" shows. Generally a parser generated by other tool and accepting tokens returns the abstract syntax tree, but it return the evaluated value in the example. In other words, it does lexical analysis, parsing and (type) converting at a time. If you want simply abstract syntax tree, it may be a little pain to use ctpg. That's all I'd like to say about ctpg here.Why would it be difficult to integrate AST creation with parsing? I'm not sure I understand this. My understanding is that you should be able to add a couple of simple operators ("^" and "%" as described above) to inform AST creation, and you're done! Thanks, Andrei
Feb 29 2012
(2012/03/01 1:19), Andrei Alexandrescu wrote:On 2/28/12 8:58 AM, Hisayuki Mima wrote:Certainly, "built AST" option is very interesting. I don't know whether building AST is a common case or not because I'm unfamiliar with syntax analysis. But I'd like to complete ctpg as D version of boost::spirit::qi first. Currently, ctpg can be used probably like boost::spirit::qi. (Both do typed syntax analysis.) There are some points where ctpg is superior to boost::spirit::qi. For example, The source code written by using ctpg is simple because it is like PEG. In addition, I'm planning to improve error messages by making excellent use of #line and __LINE__. I'd like to write the library which is functionally same boost::spirit::qi and written in D style. Hence, I'd like to make implementing "built AST" option i.e. untyped syntax analysis a low priority. What is your idea? P.S. prefix operator "!" is same as postfix operator "%" which you said. The types which parsers on both sides of "/" operator have should be compatible. The type of "(" addExp ")" is Tuple!(string, int, string) and the type of intLit_p is int. They are not compatible. The type of !"(" addExp !")" is int. So the types of !"(" addExp !")" and intLit_p are compatible. Thanks, Hisayuki MimaCertainly, I don't write yet the documentation of my library, ctpg. (But a few examples available here: https://github.com/youkei/ctpg/tree/master/example)Nice! I have a few comments. I should say that they entail a fair amount of work. The current setup makes things very difficult for a common case - generating a parse tree. The user would need to insert a lot of custom code to create the appropriate nodes, but that's very easy for the parser because it already has the components. The parser should have a "build AST" option, in which case it should build all nodes without much effort from the user. That's what ANTLR does, and it has a special operator "^" to indicate where the root of the tree should be (http://www.antlr2.org/doc/trees.html). So here's what I think your examples should look like: The syntax could be changed a bit - it should be less like D and more like PEG. The "$" is implicit and shouldn't be needed. The ";" is a useful terminator for large productions spanning more than one line, so let's keep it. I don't understand the use of "!", for example the PEG for expression parsing at http://en.wikipedia.org/wiki/Parsing_expression_grammar is: Value ← [0-9]+ / '(' Expr ')' Product ← Value (('*' / '/') Value)* Sum ← Product (('+' / '-') Product)* Expr ← Sum but your grammar has: int primary = !"(" addExp !")" / intLit_p; whereas it seems to me it should be int primary = "(" addExp ")" / intLit_p; No? Here's how I think a parser that also builds an AST looks like: mixin generateParsers!( Options.createAST, q{ root <- addExp; addExp <- mulExp (("+" / "-")^ addExp)? mulExp <- primary (("*" / "/")^ mulExp)? primary <- "(" addExp ")"% / intLit_p; } ); I used the following notation: a node suffixed with "^" becomes the root of the produced AST, and a node suffixed with "%" will be dropped altogether (that's because ")" is not needed in the AST after the expression has been parsed). With only three characters I informed the parser what the shape of the AST will be. At this point calling parse!root("1+2*3") would return an object of type ASTNode!"addExp", which in turn has two children, lhs and rhs. The lhs node has type ASTNode!"intLit_p" which has inside a value "1". The rhs node has type ASTNode!"mulExp", and that in turn has two children nodes lhs and rhs, both of type ASTNode!"intLit_p", each with its own value ("2" and "3", respectively). All these nodes are dynamically created by the parsing process and inherit a common type, e.g. ASTNode!"" or some type ASTRootNode.So, I'd like to describe it here. To be honest, ctpg is inspired by one module of Scala's standard library, Parser Combinators. One of the differences between Parser Combinators and ctpg is that Parser Combinators generates parsers in run-time, but ctpg generates parsers in compile-time by the power of CTFE and mixin. A parser generated by ctpg is a recursive descent parser, so it does lexical analysis and parsing at a time.Oh, I forgot this property of PEGs - integrated lexing and parsing. So no need for a separate lexer!And the type of input which it can accept is string, wstring, dstring and ForwardRange whose element type is char, wchar or dchar.Great. I think we should actually define the notion of BufferedRange. I'll get to that topic later.So, dividing lexical analysis and parsing into two processes is difficult in ctpg.Yes, sorry I forgot about that!Importantly, a parser generated by ctpg is statically typed as one of the examples, "parsing simple arithmetic expression" shows. Generally a parser generated by other tool and accepting tokens returns the abstract syntax tree, but it return the evaluated value in the example. In other words, it does lexical analysis, parsing and (type) converting at a time. If you want simply abstract syntax tree, it may be a little pain to use ctpg. That's all I'd like to say about ctpg here.Why would it be difficult to integrate AST creation with parsing? I'm not sure I understand this. My understanding is that you should be able to add a couple of simple operators ("^" and "%" as described above) to inform AST creation, and you're done! Thanks, Andrei
Mar 04 2012
On 3/4/12 8:34 AM, Hisayuki Mima wrote:Certainly, "built AST" option is very interesting. I don't know whether building AST is a common case or not because I'm unfamiliar with syntax analysis. But I'd like to complete ctpg as D version of boost::spirit::qi first. Currently, ctpg can be used probably like boost::spirit::qi. (Both do typed syntax analysis.) There are some points where ctpg is superior to boost::spirit::qi. For example, The source code written by using ctpg is simple because it is like PEG. In addition, I'm planning to improve error messages by making excellent use of #line and __LINE__. I'd like to write the library which is functionally same boost::spirit::qi and written in D style. Hence, I'd like to make implementing "built AST" option i.e. untyped syntax analysis a low priority. What is your idea?It's your project so you define what's fun to work on. Let me just say that I worked on a lot of parsing-related stuff, and most often projects that start as "this grammar is simple enough, let's just do processing during parsing" ultimately required a rewrite to do generate AST, process it, and then use it for computation. You can get away without an AST for the simplest grammars (e.g. printf etc.) but in that case you compete with hand-written code. I wouldn't think of parsing a serious grammar with more than a few productions without generating an AST. My intuition is that your work will best excel at those grammars. Andrei
Mar 04 2012
On Sunday, 4 March 2012 at 21:00:00 UTC, Andrei Alexandrescu wrote:You can get away without an AST for the simplest grammars (e.g. printf etc.) but in that case you compete with hand-written code. I wouldn't think of parsing a serious grammar with more than a few productions without generating an AST. My intuition is that your work will best excel at those grammars.This chimes nicely with the comments made at the Going Native conference with regard to clang generating an AST and MSVC and GCC not. Herb Sutter referred to it as "AST envy" (as in he wished MSVC took the AST route). I only mention it as maybe not everyone here watched that vid. Andrei himself was at the conference of course! Regards, Ben
Mar 05 2012
Generally a parser generated by other tool and accepting tokens returns the abstract syntax tree, but it return the evaluated value in the example. In other words, it does lexical analysis, parsing and (type) converting at a time. If you want simply abstract syntax tree, it may be a little pain to use ctpg.Hello Youkei I am trying to use CTPG for compile time parsing for a DSL I am working on. I have tried the examples you created in the examples directory. I would like the parser to effect some side effects. For this purpose, I tried including the parser mixin into a class, but I got a strange error saying: Error: need 'this' to access member parse I have copied the code I am trying to compile at the end of the email. Let me know what I could be doing wrong here. Regards - Puneet import ctpg; import std.array: join; import std.conv: to; class Foo { int result; mixin(generateParsers(q{ int root = mulExp $; int mulExp = primary !"*" mulExp >> (lhs, rhs){ return lhs * rhs; } / primary; int primary = !"(" mulExp !")" / [0-9]+ >> join >> to!int; })); void frop() { result = parse!root("5*8"); } } void main(){ Foo foo = new Foo(); foo.frop(); }
May 27 2012
(2012年05月28日 02:31), d coder wrote:Generally a parser generated by other tool and accepting tokens returns the abstract syntax tree, but it return the evaluated value in the example. In other words, it does lexical analysis, parsing and (type) converting at a time. If you want simply abstract syntax tree, it may be a little pain to use ctpg. Hello Youkei I am trying to use CTPG for compile time parsing for a DSL I am working on. I have tried the examples you created in the examples directory. I would like the parser to effect some side effects. For this purpose, I tried including the parser mixin into a class, but I got a strange error saying: Error: need 'this' to access member parse I have copied the code I am trying to compile at the end of the email. Let me know what I could be doing wrong here. Regards - Puneet import ctpg; import std.array: join; import std.conv: to; class Foo { int result; mixin(generateParsers(q{ int root = mulExp $; int mulExp = primary !"*" mulExp >> (lhs, rhs){ return lhs * rhs; } / primary; int primary = !"(" mulExp !")" / [0-9]+ >> join >> to!int; })); void frop() { result = parse!root("5*8"); } } void main(){ Foo foo = new Foo(); foo.frop(); }Hello Puneet, Thank you for your report. I fixed it. Now CTPG creates a static function as a parser. But I'm afraid this fix doesn't help you because I don't understand what a side effect you said is. Can you show me some examples which include the side effect? Thanks, Hisayuki Mima
May 31 2012
I would like the parser to effect some side effects. For this purpose, I tried including the parser mixin into a class, but I got a strange error saying: Error: need 'this' to access member parseOk. I see my folly. At compile time, there would not be any "this" since the class has not been instantiated yet. I will have to think of other solutions. Basically, I am trying to use the parser to create functions that I can use at run time. But I wanted to create two functions from the same parser. I have succeeded with creating one function. I do not want to create two separate parsers because each of these parsers would add to memory footprint of the compiler. Any ideas? Maybe I could use tuples here? Regards - Puneet
May 27 2012
On Tue, Feb 28, 2012 at 01:59:21AM -0600, Andrei Alexandrescu wrote:I'm starting a new thread on this because I think the matter is of strategic importance.+1.We all felt for a long time that there's a lot of potential in CTFE, and potential applications have been discussed more than a few times, ranging from formatting strings parsed to DSLs and parser generators.CTFE is one of the big features that attracted me to D. [...]This is the kind of stuff I've had an eye on for the longest time. I'm saying it's of strategic importance because CTFE technology, though not new and already available with some languages, has unique powers when combined with other features of D. With CTFE we get to do things that are quite literally impossible to do in other languages.For example?We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could use a character-level PEG or a classic automaton, and emit tokens for consumption by a parser generator. The two should work in perfect tandem (no need for glue code). At the end of the day, defining a complete lexer+parser combo for a language should be just a few lines longer than the textual representation of the grammar itself.Definitely agree. I frequently work with code that requires some amount of lexing/parsing; having something like this in Phobos would be absolutely awesome.What do you all think? Let's get this project off the ground![...] +1. On that note, I'd like to mention that CTFE currently has some parts that need more work: (1) We need compile-time alternatives to the functions in std.math so that functions that are currently only implemented in asm can be used in CTFE. (2) It should be possible to generate AA literals from CTFE and use them to initialize things like parsing/lexing lookup tables, etc.. While this in itself is not a major critical issue, it would make D look bad if we tout D's CTFE capabilities yet at the same time AA's have to be created by static this() at runtime. T -- Which is worse: ignorance or apathy? Who knows? Who cares? -- Erich Schubert
Feb 28 2012
On 2/28/12 9:57 AM, H. S. Teoh wrote:On Tue, Feb 28, 2012 at 01:59:21AM -0600, Andrei Alexandrescu wrote:For example integrated lexer and parser generators! AndreiThis is the kind of stuff I've had an eye on for the longest time. I'm saying it's of strategic importance because CTFE technology, though not new and already available with some languages, has unique powers when combined with other features of D. With CTFE we get to do things that are quite literally impossible to do in other languages.For example?
Feb 29 2012
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:jiljf7$18em$1 digitalmars.com...On 2/28/12 9:57 AM, H. S. Teoh wrote:...With statically-checked symbol names that can include arbitrary characters: Sym!"+" Sym!"{" Sym!"<Foo>" etc...On Tue, Feb 28, 2012 at 01:59:21AM -0600, Andrei Alexandrescu wrote:For example integrated lexer and parser generators!This is the kind of stuff I've had an eye on for the longest time. I'm saying it's of strategic importance because CTFE technology, though not new and already available with some languages, has unique powers when combined with other features of D. With CTFE we get to do things that are quite literally impossible to do in other languages.For example?
Feb 29 2012
Something similar to Boost::Spirit2 but with sane syntax and better debugging would have been awesome.
Feb 28 2012
On 02/28/2012 08:23 AM, mist wrote:Something similar to Boost::Spirit2 but with sane syntax and better debugging would have been awesome.How about not having to encode the syntax as arithmetic expressions? D can do string parsing of eBNF at compile time if you want that.
Feb 28 2012
On 2/29/12 12:07 AM, bcs wrote:On 02/28/2012 08:23 AM, mist wrote:I think that's what he meant by "sane syntax". AndreiSomething similar to Boost::Spirit2 but with sane syntax and better debugging would have been awesome.How about not having to encode the syntax as arithmetic expressions? D can do string parsing of eBNF at compile time if you want that.
Feb 29 2012
On 2/28/12 10:23 AM, mist wrote:Something similar to Boost::Spirit2 but with sane syntax and better debugging would have been awesome.Exactly. Boost::Spirit2 is a perfect illustration of a dancing bear - using inadequate technology for a specific purpose. Andrei
Feb 29 2012
On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I'm starting a new thread on this because I think the matter is of strategic importance. We all felt for a long time that there's a lot of potential in CTFE, and potential applications have been discussed more than a few times, ranging from formatting strings parsed to DSLs and parser generators. Such feats are now approaching fruition because a number of factors converge: * Dmitry Olshansky's regex library (now in Phobos) generates efficient D code straight from regexen. * The scope and quality of CTFE has improved enormously, making more advanced uses possible and even relatively easy (thanks Don!) * Hisayuki Mima implemented a parser generator in only 3000 lines of code (sadly, no comments or documentation yet :o)) * With the occasion of that announcement we also find out Philippe Sigaud has already a competing design and implementation of a parser generator. This is the kind of stuff I've had an eye on for the longest time. I'm saying it's of strategic importance because CTFE technology, though not new and already available with some languages, has unique powers when combined with other features of D. With CTFE we get to do things that are quite literally impossible to do in other languages. We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could use a character-level PEG or a classic automaton, and emit tokens for consumption by a parser generator. The two should work in perfect tandem (no need for glue code). At the end of the day, defining a complete lexer+parser combo for a language should be just a few lines longer than the textual representation of the grammar itself. What do you all think? Let's get this project off the ground! Thanks, AndreiI wrote a generic lexer generator some time ago. It already let to some compiler O(N^2) optimizations, because the token declarations sneak into the mangling :(. https://gist.github.com/1255439 - lexer generator https://gist.github.com/1262321 - complete and fast D lexer I've ditched an attempt to write a parser combinator. It was based on expression templates and ended up at spirit craziness. <PERSONAL OPINION The hassle of providing good error messages and synthesizing parse results in a generic parser outweigh the benefit of a declarative grammar. /PERSONAL OPINION> A lot becomes feasible from the CTFE perspective, despite some bugfixes I only miss exp and log currently. I do not agree that it's the right moment to write a parser though. It hits the first of phobos two biggest shortcomings, the lack of a good I/O system and the missing Allocators. Any parser written now will either risk to not play nice with ranges or has to come up with it's own buffering again.
Feb 28 2012
On 28.02.2012 22:46, Martin Nowak wrote:On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:To be truly successful it should have some distinct properties like: - being faster or identical to handwritten stuff we already have (like e.g. std.json ), reminds us of recent INI parser proposal, should be an easy pick for general parser. - be flexible, the result need not be a predefined AST tree nodes, Hisayuki Mima's parser shows some nice movement in this direction - have reasonable compile times and memory consumption (though it will only improve over time) Recalling EBNF parser idea that I run with before finally being dragged down by real life. Roughly I thought to follow hybrid LL(*) aproach, while I had a solid plan on doing DFA for lexer and parser lookahead, the other things were more or less floating.I'm starting a new thread on this because I think the matter is of strategic importance. We all felt for a long time that there's a lot of potential in CTFE, and potential applications have been discussed more than a few times, ranging from formatting strings parsed to DSLs and parser generators. Such feats are now approaching fruition because a number of factors converge: * Dmitry Olshansky's regex library (now in Phobos) generates efficient D code straight from regexen. * The scope and quality of CTFE has improved enormously, making more advanced uses possible and even relatively easy (thanks Don!) * Hisayuki Mima implemented a parser generator in only 3000 lines of code (sadly, no comments or documentation yet :o)) * With the occasion of that announcement we also find out Philippe Sigaud has already a competing design and implementation of a parser generator. This is the kind of stuff I've had an eye on for the longest time. I'm saying it's of strategic importance because CTFE technology, though not new and already available with some languages, has unique powers when combined with other features of D. With CTFE we get to do things that are quite literally impossible to do in other languages. We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could use a character-level PEG or a classic automaton, and emit tokens for consumption by a parser generator. The two should work in perfect tandem (no need for glue code). At the end of the day, defining a complete lexer+parser combo for a language should be just a few lines longer than the textual representation of the grammar itself. What do you all think? Let's get this project off the ground!Cool stuff.Thanks, AndreiI wrote a generic lexer generator some time ago. It already let to some compiler O(N^2) optimizations, because the token declarations sneak into the mangling :(. https://gist.github.com/1255439 - lexer generatorhttps://gist.github.com/1262321 - complete and fast D lexer I've ditched an attempt to write a parser combinator. It was based on expression templates and ended up at spirit craziness. <PERSONAL OPINION The hassle of providing good error messages and synthesizing parse results in a generic parser outweigh the benefit of a declarative grammar. /PERSONAL OPINION>Error reporting is the weak spot, still no good ideas on solving that.A lot becomes feasible from the CTFE perspective, despite some bugfixes I only miss exp and log currently.Reminds me that I have to dig up a wonderful CTFE bug in std.regex that's being somehow worked around by duping arrays since August.I do not agree that it's the right moment to write a parser though. It hits the first of phobos two biggest shortcomings, the lack of a good I/O system and the missing Allocators. Any parser written now will either risk to not play nice with ranges or has to come up with it's own buffering again.All good points. There is prototype of interactive regex matcher that works directly on stream (buried in std.regex), it even passed dry-run unittests back then. Though I had to postpone till I/O is sorted out. I really loved Steven's design with it's easy access to buffer and well thought out primitives, hope it will come about sometime soon. -- Dmitry Olshansky
Feb 28 2012
On Tue, Feb 28, 2012 at 11:52:52PM +0400, Dmitry Olshansky wrote:On 28.02.2012 22:46, Martin Nowak wrote:[...]On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:[...]We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could use a character-level PEG or a classic automaton, and emit tokens for consumption by a parser generator. The two should work in perfect tandem (no need for glue code). At the end of the day, defining a complete lexer+parser combo for a language should be just a few lines longer than the textual representation of the grammar itself.To be truly successful it should have some distinct properties like: - being faster or identical to handwritten stuff we already have (like e.g. std.json ), reminds us of recent INI parser proposal, should be an easy pick for general parser.We probably have to use string mixins to achieve this. OTOH, a generic parser generator is unlikely to be able to handle special case optimizations such as operator-precedence parsers (http://en.wikipedia.org/wiki/Operator-precedence_parser), because operator precedence is only implied by grammar rules, and to optimize for this case requires explicitly specifying each operator's precedence. There is some price to be paid for generality, unfortunately.- be flexible, the result need not be a predefined AST tree nodes, Hisayuki Mima's parser shows some nice movement in this directionIn my mind, D's delegates are ideal for this sort of thing. Each grammar rule will have an associated callback. We can provide default callbacks that generates AST nodes, but the user can override them with his own delegates to do whatever they want (evaluate expressions/commands on the fly, build only AST subtrees of interest, etc.).- have reasonable compile times and memory consumption (though it will only improve over time)This will be good motivation to improve dmd's CTFE implementation. ;-) [...]Error reporting is the weak spot, still no good ideas on solving that.This is not specific to D, though, right? [...][...] Are you talking about std.io? I glanced over that code briefly recently; it looks very promising. Hope it will make it into Phobos soon. We *need* to integrate streams, ranges, and the stdio writeln, etc., stuff. The current situation to an outsider looks quite bad (streams not compatible with stdio, no easy way to substitute string buffer for files, etc.). We also need automatic UTF encoding detection for all input streams/files/ranges/whatever. Including mismatching endianness (i.e. auto byte-swapping). In a recent personal project to write a D lexer (as an exercise to help me learn D better), I ended up having to roll my own input stream abstraction in order to handle automatic encoding detection. Which is quite poorly written, I've to admit. Something like this needs to be standardized in Phobos with a well-written implementation. T -- "You know, maybe we don't *need* enemies." "Yeah, best friends are about all I can take." -- Calvin & HobbesI do not agree that it's the right moment to write a parser though. It hits the first of phobos two biggest shortcomings, the lack of a good I/O system and the missing Allocators. Any parser written now will either risk to not play nice with ranges or has to come up with it's own buffering again.All good points. There is prototype of interactive regex matcher that works directly on stream (buried in std.regex), it even passed dry-run unittests back then. Though I had to postpone till I/O is sorted out. I really loved Steven's design with it's easy access to buffer and well thought out primitives, hope it will come about sometime soon.
Feb 28 2012
On 29.02.2012 0:19, H. S. Teoh wrote:On Tue, Feb 28, 2012 at 11:52:52PM +0400, Dmitry Olshansky wrote:Well I see no problem in defining a separate OperatorGrammar constructor, that will take base Unit production and a bunch of operators with arity, plus brackets with their priority is plenty. Semantic then actions are trivially defined for each of operators respectively. Someway of stacking grammars is bonus points for the project. As Andrei started with lexer+parser, I see no problem with it being lexer+parser(+parser)*.On 28.02.2012 22:46, Martin Nowak wrote:[...]On Tue, 28 Feb 2012 08:59:21 +0100, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:[...]We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could use a character-level PEG or a classic automaton, and emit tokens for consumption by a parser generator. The two should work in perfect tandem (no need for glue code). At the end of the day, defining a complete lexer+parser combo for a language should be just a few lines longer than the textual representation of the grammar itself.To be truly successful it should have some distinct properties like: - being faster or identical to handwritten stuff we already have (like e.g. std.json ), reminds us of recent INI parser proposal, should be an easy pick for general parser.We probably have to use string mixins to achieve this. OTOH, a generic parser generator is unlikely to be able to handle special case optimizations such as operator-precedence parsers (http://en.wikipedia.org/wiki/Operator-precedence_parser), because operator precedence is only implied by grammar rules, and to optimize for this case requires explicitly specifying each operator's precedence. There is some price to be paid for generality, unfortunately.Yup delegates and mixins are the tools. More over lambdas shine here, less clutter + type deduction.- be flexible, the result need not be a predefined AST tree nodes, Hisayuki Mima's parser shows some nice movement in this directionIn my mind, D's delegates are ideal for this sort of thing. Each grammar rule will have an associated callback. We can provide default callbacks that generates AST nodes, but the user can override them with his own delegates to do whatever they want (evaluate expressions/commands on the fly, build only AST subtrees of interest, etc.).Such a framework relying on mixins is bound to produce awful error messages at compile-time, unless explicitly stuffed up to watterline with some kind of static assert(__traits(compiles,...),"Error x + info");- have reasonable compile times and memory consumption (though it will only improve over time)This will be good motivation to improve dmd's CTFE implementation. ;-) [...]Error reporting is the weak spot, still no good ideas on solving that.This is not specific to D, though, right?[...]Aye, that's it. Not to mention DIP9, though that's another topic in itself.[...] Are you talking about std.io? I glanced over that code briefly recently; it looks very promising. Hope it will make it into Phobos soon. We *need* to integrate streams, ranges, and the stdio writeln, etc., stuff. The current situation to an outsider looks quite bad (streams not compatible with stdio, no easy way to substitute string buffer for files, etc.).I do not agree that it's the right moment to write a parser though. It hits the first of phobos two biggest shortcomings, the lack of a good I/O system and the missing Allocators. Any parser written now will either risk to not play nice with ranges or has to come up with it's own buffering again.All good points. There is prototype of interactive regex matcher that works directly on stream (buried in std.regex), it even passed dry-run unittests back then. Though I had to postpone till I/O is sorted out. I really loved Steven's design with it's easy access to buffer and well thought out primitives, hope it will come about sometime soon.We also need automatic UTF encoding detection for all input streams/files/ranges/whatever. Including mismatching endianness (i.e. auto byte-swapping). In a recent personal project to write a D lexer (as an exercise to help me learn D better), I ended up having to roll my own input stream abstraction in order to handle automatic encoding detection. Which is quite poorly written, I've to admit. Something like this needs to be standardized in Phobos with a well-written implementation. T-- Dmitry Olshansky
Feb 29 2012
On 2/29/12 3:48 AM, Dmitry Olshansky wrote:Someway of stacking grammars is bonus points for the project. As Andrei started with lexer+parser, I see no problem with it being lexer+parser(+parser)*.What does stacking grammars mean? Anyhow, one thing that would become important is tree walkers - e.g. you have the tree and you define code for evaluating it etc.Such a framework relying on mixins is bound to produce awful error messages at compile-time, unless explicitly stuffed up to watterline with some kind of static assert(__traits(compiles,...),"Error x + info");We need to figure out ways to make that simpler. Andrei
Feb 29 2012
On 29.02.2012 20:47, Andrei Alexandrescu wrote:On 2/29/12 3:48 AM, Dmitry Olshansky wrote:In general it should be read as using one parser as lexer for another parser. But there is some nice ways to integrate another parser within recursive descent parser. Let's assume simple recursive descent parser that uses operator precedence grammar. Suppose we have the main grammar: Declaration = Type Identifier; Statement = 'if' '(' Expr ')' | Expr Type = int | float | ... Where Expr is connected to the second grammar expressed as ... Ok, let's sketch it, skipping semantic actions: operatorGrammar!q{ Expr Unit: Constant | Identifier //assume we have lexer Operators: Not, !, prefix, 4 UnMinus, -, prefix, 4 Mul, *, binary, 3 Mul, /, binary, 3 Plus, +, binary, 2 BiMinus, -, binary, 2 Index, [], brace-like postfix Braces, (), brace-like }; Effectively operator grammar for expressions parses "token" Expr. It's easy in recursive descent because it can just try to call this parser, then try the other alternative if it fails. I'm not sure this stacking does add that much, but it's something to keep in mind.Someway of stacking grammars is bonus points for the project. As Andrei started with lexer+parser, I see no problem with it being lexer+parser(+parser)*.What does stacking grammars mean?Anyhow, one thing that would become important is tree walkers - e.g. you have the tree and you define code for evaluating it etc.Yes, it's a must have if AST is generated generically via annotations.-- Dmitry OlshanskySuch a framework relying on mixins is bound to produce awful error messages at compile-time, unless explicitly stuffed up to watterline with some kind of static assert(__traits(compiles,...),"Error x + info");We need to figure out ways to make that simpler. Andrei
Feb 29 2012
On 2/28/12 1:52 PM, Dmitry Olshansky wrote:To be truly successful it should have some distinct properties like: - being faster or identical to handwritten stuff we already have (like e.g. std.json ), reminds us of recent INI parser proposal, should be an easy pick for general parser.Yes.- be flexible, the result need not be a predefined AST tree nodes, Hisayuki Mima's parser shows some nice movement in this directionYes. But AST creation should be supported without any custom code.- have reasonable compile times and memory consumption (though it will only improve over time)Yes. I guess PEGs have problems there.Recalling EBNF parser idea that I run with before finally being dragged down by real life. Roughly I thought to follow hybrid LL(*) aproach, while I had a solid plan on doing DFA for lexer and parser lookahead, the other things were more or less floating.Well, maybe you could integrate that within your up-and-coming research. Grammars have been considered a solved problem every five years since 1970, and there's always new work coming :o).There is prototype of interactive regex matcher that works directly on stream (buried in std.regex), it even passed dry-run unittests back then. Though I had to postpone till I/O is sorted out. I really loved Steven's design with it's easy access to buffer and well thought out primitives, hope it will come about sometime soon.An interactive regex would be a dream come true to me... Andrei
Feb 29 2012
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:jiljsg$193s$1 digitalmars.com...On 2/28/12 1:52 PM, Dmitry Olshansky wrote:Probably LR, too, unless you build the state tables in a separate prior build step.- have reasonable compile times and memory consumption (though it will only improve over time)Yes. I guess PEGs have problems there.One thing I've been wanting to see in a lexer's regex would be a way to match things like D's nested comments or: qEOS fancy string blah blah EOS Support for either one would probably render it "not *technically* a regex" but it should be entirely possible by just adding some extra data/processing to the states.There is prototype of interactive regex matcher that works directly on stream (buried in std.regex), it even passed dry-run unittests back then. Though I had to postpone till I/O is sorted out. I really loved Steven's design with it's easy access to buffer and well thought out primitives, hope it will come about sometime soon.An interactive regex would be a dream come true to me...
Feb 29 2012
"Nick Sabalausky" <a a.a> wrote in message news:jilmhr$1dul$1 digitalmars.com..."Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:jiljsg$193s$1 digitalmars.com...Erm, I mean LALR specifically.On 2/28/12 1:52 PM, Dmitry Olshansky wrote:Probably LR, too, unless you build the state tables in a separate prior build step.- have reasonable compile times and memory consumption (though it will only improve over time)Yes. I guess PEGs have problems there.
Feb 29 2012
On 2/29/12 11:15 AM, Nick Sabalausky wrote:"Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org> wrote in message news:jiljsg$193s$1 digitalmars.com...It's been a while since I read about PEGs (the packrat parser paper), but think it's a more fundamental issue with PEGs because they need to memoize several possible parses of the input. AndreiOn 2/28/12 1:52 PM, Dmitry Olshansky wrote:Probably LR, too, unless you build the state tables in a separate prior build step.- have reasonable compile times and memory consumption (though it will only improve over time)Yes. I guess PEGs have problems there.
Feb 29 2012
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:jilnjg$1eu9$3 digitalmars.com...On 2/29/12 11:15 AM, Nick Sabalausky wrote:I don't know much about packrat parsers, but what I meant is that generating LALR tables (which are then used by an LALR parser) can be very computationally expensive: For simple grammars, it's trivial, but for more complex "typical programming langauge" grammars, it can easily take upwards of quite a few minutes when *not* done in CTFE (at least on my underpowered machine). Although maybe my implementation and GOLD's implementation both just have some additional hidden scaling issue that isn't inherent to the algorithms? Actually using an *already*-generated LALR table (which only needs to be generated once per grammar, not once per input) to parse doesn't have major inherent efficiency issues compared to other parsing algorithms, AFAIK."Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org> wrote in message news:jiljsg$193s$1 digitalmars.com...It's been a while since I read about PEGs (the packrat parser paper), but think it's a more fundamental issue with PEGs because they need to memoize several possible parses of the input.On 2/28/12 1:52 PM, Dmitry Olshansky wrote:Probably LR, too, unless you build the state tables in a separate prior build step.- have reasonable compile times and memory consumption (though it will only improve over time)Yes. I guess PEGs have problems there.
Feb 29 2012
Hi Nick, On Wednesday, 29 February 2012 at 17:16:44 UTC, Nick Sabalausky wrote:One thing I've been wanting to see in a lexer's regex would be a way to match things like D's nested comments or:I have already implemented a lexer generator that can handle recursion in the token definitions (using multiple lexer states). See http://www.benhanson.net/lexertl.html The library is C++ and generates lexers at runtime, but the concepts should be easily transferable. Basically I have a boolean for pop in the end state and a separate column for push_dfa_: if (end_state_) { // Return longest match if (pop_) { start_state_ = results_.stack.top ().first; results_.stack.pop (); } else if (push_dfa_ != results_.npos ()) { results_.stack.push (typename results::id_type_pair (push_dfa_, id_)); } . . . I'm happy to answer any questions etc. Regards, Ben
Mar 05 2012
On 29.02.2012 20:31, Andrei Alexandrescu wrote:It might involve parsing and grammars, unless academia folk seduce me with their ideas of optimization for highly parallel architectures :).Recalling EBNF parser idea that I run with before finally being dragged down by real life. Roughly I thought to follow hybrid LL(*) aproach, while I had a solid plan on doing DFA for lexer and parser lookahead, the other things were more or less floating.Well, maybe you could integrate that within your up-and-coming research. Grammars have been considered a solved problem every five years since 1970, and there's always new work coming :o).Where you've been? ;) I mean during GSOC we've spent days with Fawzi talking about shoehorning regex matcher to work on buffered streams. He actually did that prototype, I was mostly reviewing/fixing the source to make sure it fits. Recalling the inner works it was even expected to optionally do NFC normalization on the fly (wow, now that was ambitious) and all that without copying stuff around 99% of the time. -- Dmitry OlshanskyThere is prototype of interactive regex matcher that works directly on stream (buried in std.regex), it even passed dry-run unittests back then. Though I had to postpone till I/O is sorted out. I really loved Steven's design with it's easy access to buffer and well thought out primitives, hope it will come about sometime soon.An interactive regex would be a dream come true to me...
Feb 29 2012
On Tue, Feb 28, 2012 at 07:46:04PM +0100, Martin Nowak wrote: [...]I wrote a generic lexer generator some time ago. It already let to some compiler O(N^2) optimizations, because the token declarations sneak into the mangling :(. https://gist.github.com/1255439 - lexer generator https://gist.github.com/1262321 - complete and fast D lexerCool! I'll have to take a look at this sometime. [...]<PERSONAL OPINION The hassle of providing good error messages and synthesizing parse results in a generic parser outweigh the benefit of a declarative grammar. /PERSONAL OPINION>But things like lex/yacc have been useful throughout the years. With D's delegates, lexer/parser action rules should be even cleaner, no?A lot becomes feasible from the CTFE perspective, despite some bugfixes I only miss exp and log currently.All of std.math should have CTFE versions so that we can perform complex calculations at compile-time (e.g., create table of scaled sin/cos values for use in high-performance 3D apps -- no need to waste time to generate these tables at startup time).I do not agree that it's the right moment to write a parser though. It hits the first of phobos two biggest shortcomings, the lack of a good I/O system and the missing Allocators. Any parser written now will either risk to not play nice with ranges or has to come up with it's own buffering again.OTOH, perhaps once we start writing a lexer/parser generator, that will force us to fix the I/O and allocator system. Without a good project to motivate us to fix these tedious issues, it seems that we just lose inspiration to do it after a while. T -- The computer is only a tool. Unfortunately, so is the user. -- Armaphine, K5
Feb 28 2012
Le 28/02/2012 21:02, H. S. Teoh a crit :OTOH, perhaps once we start writing a lexer/parser generator, that will force us to fix the I/O and allocator system. Without a good project to motivate us to fix these tedious issues, it seems that we just lose inspiration to do it after a while.std.conainer is already a good candidate for allocators. Back to the topic, this is a great idea, but isn't it building on weak bases ? We still have issue with const correctness, or postblit, and on lib side, we lack allocators, error handling is poor, containers are poor, etc . . .
Feb 28 2012
On Tue, 28 Feb 2012 21:02:24 +0100, H. S. Teoh <hsteoh quickfur.ath.cx> wrote:On Tue, Feb 28, 2012 at 07:46:04PM +0100, Martin Nowak wrote: [...]Yacc does work but at the price of an additional build step and total automaton obfuscation. And even at that price the results are still hard to maintain klingon sources. http://supercollider.git.sourceforge.net/git/gitweb.cgi?p=supercollider/supercollider;a=blob;f=lang/LangSource/Bison/lang11d I won't deny that the combination of CTFE text processing and static introspection could improve on this. It could be made more feasible by some conventions, e.g. parse result always uses structs or classes and built-in arrays. ---- class Module { this(Declaration[]); } class StructDeclaration : Declaration { enum _enbf = "struct $1=Identifier { $2=Declaration* }"; this(Identifier, Declaration[]); } ... Parser!(Module, StructDeclaration, ...) parser; Module m = parser.parse(read("file.d"));I wrote a generic lexer generator some time ago. It already let to some compiler O(N^2) optimizations, because the token declarations sneak into the mangling :(. https://gist.github.com/1255439 - lexer generator https://gist.github.com/1262321 - complete and fast D lexerCool! I'll have to take a look at this sometime. [...]<PERSONAL OPINION The hassle of providing good error messages and synthesizing parse results in a generic parser outweigh the benefit of a declarative grammar. /PERSONAL OPINION>But things like lex/yacc have been useful throughout the years. With D's delegates, lexer/parser action rules should be even cleaner, no?
Feb 28 2012
On Tue, Feb 28, 2012 at 10:03:42PM +0100, Martin Nowak wrote: [...]I won't deny that the combination of CTFE text processing and static introspection could improve on this. It could be made more feasible by some conventions, e.g. parse result always uses structs or classes and built-in arrays.Excellent idea, I like this.class Module { this(Declaration[]); } class StructDeclaration : Declaration { enum _enbf = "struct $1=Identifier { $2=Declaration* }"; this(Identifier, Declaration[]); } ... Parser!(Module, StructDeclaration, ...) parser; Module m = parser.parse(read("file.d"));I like this! Definitely an improvement over yacc syntax. T -- Unix was not designed to stop people from doing stupid things, because that would also stop them from doing clever things. -- Doug Gwyn
Feb 28 2012
"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message news:mailman.215.1330472867.24984.digitalmars-d puremagic.com...On Tue, Feb 28, 2012 at 10:03:42PM +0100, Martin Nowak wrote: [...]In Goldie, I've taken an inverted approach, which IMHO is easier to use: The types are automatically generated from the grammar, not the other way around. So applying that approach to the above code, it'd be more like this: mixin genGrammar!("myGrammar", ` Identifier = [a-zA-Z_][a-zA-Z_0-9]* Module = Declaration+ Declaration = StructDeclaration StructDeclaration = 'struct' Identifier '{' Declaration* '}' `); Which generates these classes: Parser!"myGrammar" Symbol!("myGrammar.Identifier") Symbol!("myGrammar.Module") Symbol!("myGrammar.Declaration") Symbol!("myGrammar.StructDeclaration") and/or these: Parser_myGrammar Symbol_myGrammar!"Identifier" Symbol_myGrammar!"Module" Symbol_myGrammar!"Declaration" Symbol_myGrammar!"StructDeclaration" would could then be aliased by the user however they wanted: alias Symbol_myGrammar MySym; And there can still be hooks (delegates, subclassing, whatever) to add customized behavior/functionality.I won't deny that the combination of CTFE text processing and static introspection could improve on this. It could be made more feasible by some conventions, e.g. parse result always uses structs or classes and built-in arrays.Excellent idea, I like this.class Module { this(Declaration[]); } class StructDeclaration : Declaration { enum _enbf = "struct $1=Identifier { $2=Declaration* }"; this(Identifier, Declaration[]); } ... Parser!(Module, StructDeclaration, ...) parser; Module m = parser.parse(read("file.d"));I like this! Definitely an improvement over yacc syntax.
Feb 28 2012
"Nick Sabalausky" <a a.a> wrote in message news:jikcca$1vq7$1 digitalmars.com..."H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message news:mailman.215.1330472867.24984.digitalmars-d puremagic.com...Hmm, maybe I need to think about what it would take to make Goldie able to parse at compile-time...On Tue, Feb 28, 2012 at 10:03:42PM +0100, Martin Nowak wrote: [...]In Goldie, I've taken an inverted approach, which IMHO is easier to use: The types are automatically generated from the grammar, not the other way around. So applying that approach to the above code, it'd be more like this: mixin genGrammar!("myGrammar", ` Identifier = [a-zA-Z_][a-zA-Z_0-9]* Module = Declaration+ Declaration = StructDeclaration StructDeclaration = 'struct' Identifier '{' Declaration* '}' `); Which generates these classes: Parser!"myGrammar" Symbol!("myGrammar.Identifier") Symbol!("myGrammar.Module") Symbol!("myGrammar.Declaration") Symbol!("myGrammar.StructDeclaration") and/or these: Parser_myGrammar Symbol_myGrammar!"Identifier" Symbol_myGrammar!"Module" Symbol_myGrammar!"Declaration" Symbol_myGrammar!"StructDeclaration" would could then be aliased by the user however they wanted: alias Symbol_myGrammar MySym; And there can still be hooks (delegates, subclassing, whatever) to add customized behavior/functionality.I won't deny that the combination of CTFE text processing and static introspection could improve on this. It could be made more feasible by some conventions, e.g. parse result always uses structs or classes and built-in arrays.Excellent idea, I like this.class Module { this(Declaration[]); } class StructDeclaration : Declaration { enum _enbf = "struct $1=Identifier { $2=Declaration* }"; this(Identifier, Declaration[]); } ... Parser!(Module, StructDeclaration, ...) parser; Module m = parser.parse(read("file.d"));I like this! Definitely an improvement over yacc syntax.
Feb 28 2012
"Nick Sabalausky" <a a.a> wrote in message news:jikcit$201o$1 digitalmars.com...Hmm, maybe I need to think about what it would take to make Goldie able to parse at compile-time...Just gave it a quick shot. It was looking like it might not be too bad, but then I hit: Assertion failure: 'ctfeStack.stackPointer() == 0' on line 4823 in file 'interpret.c' Bleh. (This was with DMD 2.058)
Feb 28 2012
Le 29 f=C3=A9vr. 2012 07:20, "Nick Sabalausky" <a a.a> a =C3=A9crit :"Nick Sabalausky" <a a.a> wrote in message news:jikcit$201o$1 digitalmars.com...toHmm, maybe I need to think about what it would take to make Goldie ablebutparse at compile-time...Just gave it a quick shot. It was looking like it might not be too bad,then I hit: Assertion failure: 'ctfeStack.stackPointer() =3D=3D 0' on line 4823 in fi=le'interpret.c' Bleh. (This was with DMD 2.058)Yeah, I had the very same yesterday :( Also, another one on line 94 in interpret.c 'v->ctfeSomethin' failing. Too bad. In my case, I found a workaround: I was doing array[] ~=3D SomeStruct(args); which asserts at CTFE. But: auto s =3D SomeStructs(args); array[] ~=3D s; works.
Feb 29 2012
"Philippe Sigaud" <philippe.sigaud gmail.com> wrote in message news:mailman.229.1330508922.24984.digitalmars-d puremagic.com... Le 29 fvr. 2012 07:20, "Nick Sabalausky" <a a.a> a crit :Ooh, cool! I know exactly where I'm doing that!Just gave it a quick shot. It was looking like it might not be too bad,butthen I hit: Assertion failure: 'ctfeStack.stackPointer() == 0' on line 4823 in file 'interpret.c' Bleh. (This was with DMD 2.058)Yeah, I had the very same yesterday :( Also, another one on line 94 in interpret.c 'v->ctfeSomethin' failing. Too bad. In my case, I found a workaround: I was doing array[] ~= SomeStruct(args); which asserts at CTFE. But: auto s = SomeStructs(args); array[] ~= s; works.
Feb 29 2012
"Nick Sabalausky" <a a.a> wrote in message news:jikfpi$25cc$1 digitalmars.com..."Nick Sabalausky" <a a.a> wrote in message news:jikcit$201o$1 digitalmars.com...Filed: http://d.puremagic.com/issues/show_bug.cgi?id=7667Hmm, maybe I need to think about what it would take to make Goldie able to parse at compile-time...Just gave it a quick shot. It was looking like it might not be too bad, but then I hit: Assertion failure: 'ctfeStack.stackPointer() == 0' on line 4823 in file 'interpret.c' Bleh. (This was with DMD 2.058)
Mar 07 2012
On 2/28/12 11:15 PM, Nick Sabalausky wrote:In Goldie, I've taken an inverted approach, which IMHO is easier to use: The types are automatically generated from the grammar, not the other way around. So applying that approach to the above code, it'd be more like this: mixin genGrammar!("myGrammar", ` Identifier = [a-zA-Z_][a-zA-Z_0-9]* Module = Declaration+ Declaration = StructDeclaration StructDeclaration = 'struct' Identifier '{' Declaration* '}' `); Which generates these classes: Parser!"myGrammar" Symbol!("myGrammar.Identifier") Symbol!("myGrammar.Module") Symbol!("myGrammar.Declaration") Symbol!("myGrammar.StructDeclaration") and/or these: Parser_myGrammar Symbol_myGrammar!"Identifier" Symbol_myGrammar!"Module" Symbol_myGrammar!"Declaration" Symbol_myGrammar!"StructDeclaration" would could then be aliased by the user however they wanted: alias Symbol_myGrammar MySym; And there can still be hooks (delegates, subclassing, whatever) to add customized behavior/functionality.I think this is the right approach. Andrei
Feb 29 2012
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:jilkj6$1a50$2 digitalmars.com...On 2/28/12 11:15 PM, Nick Sabalausky wrote:What? We agree? ;)In Goldie, I've taken an inverted approach, which IMHO is easier to use: The types are automatically generated from the grammar, not the other way around. So applying that approach to the above code, it'd be more like this: mixin genGrammar!("myGrammar", ` Identifier = [a-zA-Z_][a-zA-Z_0-9]* Module = Declaration+ Declaration = StructDeclaration StructDeclaration = 'struct' Identifier '{' Declaration* '}' `); Which generates these classes: Parser!"myGrammar" Symbol!("myGrammar.Identifier") Symbol!("myGrammar.Module") Symbol!("myGrammar.Declaration") Symbol!("myGrammar.StructDeclaration") and/or these: Parser_myGrammar Symbol_myGrammar!"Identifier" Symbol_myGrammar!"Module" Symbol_myGrammar!"Declaration" Symbol_myGrammar!"StructDeclaration" would could then be aliased by the user however they wanted: alias Symbol_myGrammar MySym; And there can still be hooks (delegates, subclassing, whatever) to add customized behavior/functionality.I think this is the right approach.
Feb 29 2012
On 2/29/12 11:05 AM, Nick Sabalausky wrote:What? We agree? ;)It's 2012 after all. Andrei
Feb 29 2012
On 2/28/12 12:46 PM, Martin Nowak wrote:I wrote a generic lexer generator some time ago. It already let to some compiler O(N^2) optimizations, because the token declarations sneak into the mangling :(. https://gist.github.com/1255439 - lexer generator https://gist.github.com/1262321 - complete and fast D lexerWow, now we have an embarrassment of riches!I've ditched an attempt to write a parser combinator. It was based on expression templates and ended up at spirit craziness.I'm not sure what a parser combinator is. I think grammar inheritance could actually be interesting to explore.<PERSONAL OPINION The hassle of providing good error messages and synthesizing parse results in a generic parser outweigh the benefit of a declarative grammar. /PERSONAL OPINION>I think synthesizing ASTs is a solved problem, see my post before. Error messages are still a hassle, but I missed the point they became good in hand-written parsers :o).A lot becomes feasible from the CTFE perspective, despite some bugfixes I only miss exp and log currently. I do not agree that it's the right moment to write a parser though. It hits the first of phobos two biggest shortcomings, the lack of a good I/O system and the missing Allocators. Any parser written now will either risk to not play nice with ranges or has to come up with it's own buffering again.Agreed, but that doesn't seem like the largest hurdle to me. Andrei
Feb 29 2012
On 02/28/2012 07:46 PM, Martin Nowak wrote:https://gist.github.com/1255439 - lexer generator https://gist.github.com/1262321 - complete and fast D lexerWell, it is slower at lexing than DMD at parsing. What is the bottleneck?
Feb 29 2012
On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:On 02/28/2012 07:46 PM, Martin Nowak wrote:No, it's as fast as dmd's lexer. Writing the tokens to stdout takes a lot of time though. Just disable the "writeln(tok);" in the main loop. There is also an --echo option which recreates the complete source from the tokens.https://gist.github.com/1255439 - lexer generator https://gist.github.com/1262321 - complete and fast D lexerWell, it is slower at lexing than DMD at parsing. What is the bottleneck?
Feb 29 2012
On Wed, Feb 29, 2012 at 07:28:48PM +0100, Martin Nowak wrote:On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:This sounds like something I ran into in my D lexer project. Lexing Phobos took upwards of 10-15 seconds, which is extremely slow considering that dmd can *compile* it in just a few seconds. So I did some profiling, and found out that most of the time was spent formatting tokens into strings and running writeln. After I commented that out, the running time dropped to just a few seconds. :-) T -- The fact that anyone still uses AOL shows that even the presence of options doesn't stop some people from picking the pessimal one. - Mike EllisOn 02/28/2012 07:46 PM, Martin Nowak wrote:No, it's as fast as dmd's lexer. Writing the tokens to stdout takes a lot of time though. Just disable the "writeln(tok);" in the main loop. There is also an --echo option which recreates the complete source from the tokens.https://gist.github.com/1255439 - lexer generator https://gist.github.com/1262321 - complete and fast D lexerWell, it is slower at lexing than DMD at parsing. What is the bottleneck?
Feb 29 2012
This sounds like something I ran into in my D lexer project. Lexing Phobos took upwards of 10-15 seconds, which is extremely slow considering that dmd can *compile* it in just a few seconds. So I did some profiling, and found out that most of the time was spent formatting tokens into strings and running writeln. After I commented that out, the running time dropped to just a few seconds. :-)And how different was just a few seconds from 10-15 seconds? :-)
Feb 29 2012
"d coder" <dlang.coder gmail.com> wrote in message news:mailman.241.1330541814.24984.digitalmars-d puremagic.com...~10 seconds ;)This sounds like something I ran into in my D lexer project. Lexing Phobos took upwards of 10-15 seconds, which is extremely slow considering that dmd can *compile* it in just a few seconds. So I did some profiling, and found out that most of the time was spent formatting tokens into strings and running writeln. After I commented that out, the running time dropped to just a few seconds. :-)And how different was just a few seconds from 10-15 seconds? :-)
Feb 29 2012
On Thu, Mar 01, 2012 at 12:19:30AM +0530, d coder wrote:OK so I was imprecise. It was 2-3 seconds compared to 10-15 seconds. So it's about an order of magnitude difference. :) T -- "I speak better English than this villain Bush" -- Mohammed Saeed al-Sahaf, Iraqi Minister of InformationThis sounds like something I ran into in my D lexer project. Lexing Phobos took upwards of 10-15 seconds, which is extremely slow considering that dmd can *compile* it in just a few seconds. So I did some profiling, and found out that most of the time was spent formatting tokens into strings and running writeln. After I commented that out, the running time dropped to just a few seconds. :-)And how different was just a few seconds from 10-15 seconds? :-)
Feb 29 2012
On 02/29/2012 07:28 PM, Martin Nowak wrote:On Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:I did that.On 02/28/2012 07:46 PM, Martin Nowak wrote:No, it's as fast as dmd's lexer. Writing the tokens to stdout takes a lot of time though. Just disable the "writeln(tok);" in the main loop.https://gist.github.com/1255439 - lexer generator https://gist.github.com/1262321 - complete and fast D lexerWell, it is slower at lexing than DMD at parsing. What is the bottleneck?
Feb 29 2012
On Wed, 29 Feb 2012 20:30:43 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:On 02/29/2012 07:28 PM, Martin Nowak wrote:Interesting, I've commented it out https://gist.github.com/1262321#L1559 and get the following. <<< PHOBOS=~/Code/D/DPL/phobos mkdir test_lexer cd test_lexer curl https://raw.github.com/gist/1255439/lexer.d > lexer.d curl https://raw.github.com/gist/1262321/dlexer.d > dlexer.d curl https://raw.github.com/gist/1262321/entity.d > entity.d dmd -O -release -inline dlexer lexer entity wc -l ${PHOBOS}/std/*.d time ./dlexer ${PHOBOS}/std/*.dOn Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:I did that.On 02/28/2012 07:46 PM, Martin Nowak wrote:No, it's as fast as dmd's lexer. Writing the tokens to stdout takes a lot of time though. Just disable the "writeln(tok);" in the main loop.https://gist.github.com/1255439 - lexer generator https://gist.github.com/1262321 - complete and fast D lexerWell, it is slower at lexing than DMD at parsing. What is the bottleneck?./dlexer ${PHOBOS}/std/*.d 0.21s user 0.00s system 99% cpu 0.211 total
Feb 29 2012
On 02/29/2012 09:03 PM, Martin Nowak wrote:On Wed, 29 Feb 2012 20:30:43 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:I get 0.160s for lexing using your lexer. Parsing the same file with DMDs parser takes 0.155 seconds. The difference grows with larger files.On 02/29/2012 07:28 PM, Martin Nowak wrote:Interesting, I've commented it out https://gist.github.com/1262321#L1559 and get the following. <<< PHOBOS=~/Code/D/DPL/phobos mkdir test_lexer cd test_lexer curl https://raw.github.com/gist/1255439/lexer.d > lexer.d curl https://raw.github.com/gist/1262321/dlexer.d > dlexer.d curl https://raw.github.com/gist/1262321/entity.d > entity.d dmd -O -release -inline dlexer lexer entity wc -l ${PHOBOS}/std/*.d time ./dlexer ${PHOBOS}/std/*.dOn Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:I did that.On 02/28/2012 07:46 PM, Martin Nowak wrote:No, it's as fast as dmd's lexer. Writing the tokens to stdout takes a lot of time though. Just disable the "writeln(tok);" in the main loop.https://gist.github.com/1255439 - lexer generator https://gist.github.com/1262321 - complete and fast D lexerWell, it is slower at lexing than DMD at parsing. What is the bottleneck?./dlexer ${PHOBOS}/std/*.d 0.21s user 0.00s system 99% cpu 0.211 total
Feb 29 2012
On Wed, 29 Feb 2012 21:30:57 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:On 02/29/2012 09:03 PM, Martin Nowak wrote:Mmh, I've retested and you're right dmd's lexer is about 2x faster. The main overhead stems from using ranges and enforce. Quick profiling shows that 25% is spent in popFront and std.utf.stride. Last time I worked on this I rewrote std.utf.decode to be much faster. But utf characters are still "decoded" twice, once for front and then again for popFront. Also stride uses table lookup and can't be inlined. If switch tables were implemented on x64 one could use them for integral ElementType.On Wed, 29 Feb 2012 20:30:43 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:I get 0.160s for lexing using your lexer. Parsing the same file with DMDs parser takes 0.155 seconds. The difference grows with larger files.On 02/29/2012 07:28 PM, Martin Nowak wrote:Interesting, I've commented it out https://gist.github.com/1262321#L1559 and get the following. <<< PHOBOS=~/Code/D/DPL/phobos mkdir test_lexer cd test_lexer curl https://raw.github.com/gist/1255439/lexer.d > lexer.d curl https://raw.github.com/gist/1262321/dlexer.d > dlexer.d curl https://raw.github.com/gist/1262321/entity.d > entity.d dmd -O -release -inline dlexer lexer entity wc -l ${PHOBOS}/std/*.d time ./dlexer ${PHOBOS}/std/*.dOn Wed, 29 Feb 2012 17:41:19 +0100, Timon Gehr <timon.gehr gmx.ch> wrote:I did that.On 02/28/2012 07:46 PM, Martin Nowak wrote:No, it's as fast as dmd's lexer. Writing the tokens to stdout takes a lot of time though. Just disable the "writeln(tok);" in the main loop.https://gist.github.com/1255439 - lexer generator https://gist.github.com/1262321 - complete and fast D lexerWell, it is slower at lexing than DMD at parsing. What is the bottleneck?./dlexer ${PHOBOS}/std/*.d 0.21s user 0.00s system 99% cpu 0.211 total
Feb 29 2012
On Thu, Mar 01, 2012 at 12:04:39AM +0100, Martin Nowak wrote: [...]Mmh, I've retested and you're right dmd's lexer is about 2x faster. The main overhead stems from using ranges and enforce. Quick profiling shows that 25% is spent in popFront and std.utf.stride. Last time I worked on this I rewrote std.utf.decode to be much faster. But utf characters are still "decoded" twice, once for front and then again for popFront. Also stride uses table lookup and can't be inlined.[...] One way to not decode characters twice is by using a single-dchar buffer in your range. Something like: struct MyRange { private File src; char buf[]; dchar readahead; this(File _src) { // ... fill up buf from src here popFront(); } property pure dchar front() { return readahead; } void popFront() { int stride; readahead = decode(buf, stride); buf = buf[stride..$]; // ... fill up buf more if needed } } T -- "A man's wife has more power over him than the state has." -- Ralph Emerson
Feb 29 2012
On 02/27/2012 11:59 PM, Andrei Alexandrescu wrote:I'm starting a new thread on this because I think the matter is of strategic importance. We all felt for a long time that there's a lot of potential in CTFE, and potential applications have been discussed more than a few times, ranging from formatting strings parsed to DSLs and parser generators. Such feats are now approaching fruition because a number of factors converge: * Dmitry Olshansky's regex library (now in Phobos) generates efficient D code straight from regexen. * The scope and quality of CTFE has improved enormously, making more advanced uses possible and even relatively easy (thanks Don!) * Hisayuki Mima implemented a parser generator in only 3000 lines of code (sadly, no comments or documentation yet :o))A bit outdated but.... http://dsource.org/projects/scrapple/browser/trunk/dparser/dparse.d BTW, that uses very little in the way of CTFE and even less in the way of string mixins. Those are, IMHO, very powerful tools that should only be combined as weapons of last resort.* With the occasion of that announcement we also find out Philippe Sigaud has already a competing design and implementation of a parser generator. This is the kind of stuff I've had an eye on for the longest time. I'm saying it's of strategic importance because CTFE technology, though not new and already available with some languages, has unique powers when combined with other features of D. With CTFE we get to do things that are quite literally impossible to do in other languages. We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could use a character-level PEG or a classic automaton, and emit tokens for consumption by a parser generator. The two should work in perfect tandem (no need for glue code). At the end of the day, defining a complete lexer+parser combo for a language should be just a few lines longer than the textual representation of the grammar itself. What do you all think? Let's get this project off the ground! Thanks, Andrei
Feb 28 2012
I'm not sure I follow all the details of what Andrei's suggesting and what's being talked about here, this parser/lexer stuff is still very new to me, so this may be a bit off-topic. However, I thought I'd weigh in on something I was very impressed with about the Nimrod language's direct AST access/manipulation. Nim has a "template" which is very much like D's mixin templates, example: template foo(b:string) = var bar = b block main: foo("test") assert(bar == "test") and the equivalent in... // D mixin template foo(string b) { auto bar = b; } void main() { mixin foo("test"); assert(bar == "test"); } which is all very straight forward stuff, the cool part comes with Nim's macro's. Nim has a two unique types: expr & stmt (expression & statement). They're direct AST structures which can be passed to template/macro procedures and arbitrarily mutated. Example: macro foo(s:stmt): stmt = result = newNimNode(nnkStmtList) for i in 1 .. s.len-1: var str = s[i].toStrLit() result.add(newCall("echo", str)) block main: foo: bar baz the above code prints: " bar baz " **Some notes: result is what's returned, and the reason you can use "foo" with a statement body is because any macro/template who's last parameter is type 'stmt' can be called with block semantics; similar to how UFCS works with the first parameter.** The above *might* look like the following in D: macro foo(ASTNode[] stmts...) { ASTNode[] result; foreach (s; stmts) { auto str = s.toASTString(); result ~= new ASTCall!"writeln"(str); } return result; } void main() { foo { bar; baz; } } This kind of direct AST manipulation + body semantics opens the doors for a lot of cool things. If you read through Nim's lib documentation you'll notice many of the "language" features are actually just Library procedures in the defaultly included system module. Which is great because contributions to the *language* can be made very easily. Also, the infrastructure to read/write AST is no-doubt immensely useful for IDE support and other such dev tools. I'm not a huge fan of everything in Nimrod, but this is something they definitely got right, and I think D could gain from their experience.
May 27 2012
On 2012-05-27 22:15, F i L wrote:I'm not sure I follow all the details of what Andrei's suggesting and what's being talked about here, this parser/lexer stuff is still very new to me, so this may be a bit off-topic. However, I thought I'd weigh in on something I was very impressed with about the Nimrod language's direct AST access/manipulation. Nim has a "template" which is very much like D's mixin templates, example: template foo(b:string) = var bar = b block main: foo("test") assert(bar == "test") and the equivalent in... // D mixin template foo(string b) { auto bar = b; } void main() { mixin foo("test"); assert(bar == "test"); } which is all very straight forward stuff, the cool part comes with Nim's macro's. Nim has a two unique types: expr & stmt (expression & statement). They're direct AST structures which can be passed to template/macro procedures and arbitrarily mutated. Example: macro foo(s:stmt): stmt = result = newNimNode(nnkStmtList) for i in 1 .. s.len-1: var str = s[i].toStrLit() result.add(newCall("echo", str)) block main: foo: bar baz the above code prints: " bar baz " **Some notes: result is what's returned, and the reason you can use "foo" with a statement body is because any macro/template who's last parameter is type 'stmt' can be called with block semantics; similar to how UFCS works with the first parameter.** The above *might* look like the following in D: macro foo(ASTNode[] stmts...) { ASTNode[] result; foreach (s; stmts) { auto str = s.toASTString(); result ~= new ASTCall!"writeln"(str); } return result; } void main() { foo { bar; baz; } } This kind of direct AST manipulation + body semantics opens the doors for a lot of cool things. If you read through Nim's lib documentation you'll notice many of the "language" features are actually just Library procedures in the defaultly included system module. Which is great because contributions to the *language* can be made very easily. Also, the infrastructure to read/write AST is no-doubt immensely useful for IDE support and other such dev tools. I'm not a huge fan of everything in Nimrod, but this is something they definitely got right, and I think D could gain from their experience.This is a very cool feature of Nimrod. It allows to move several language features to the library. * synchronized * scope * foreach (possibly) -- /Jacob Carlborg
May 28 2012
On Tuesday, 28 February 2012 at 07:59:16 UTC, Andrei Alexandrescu wrote:I'm starting a new thread on this because I think the matter is of strategic importance. We all felt for a long time that there's a lot of potential in CTFE, and potential applications have been discussed more than a few times, ranging from formatting strings parsed to DSLs and parser generators. Such feats are now approaching fruition because a number of factors converge: * Dmitry Olshansky's regex library (now in Phobos) generates efficient D code straight from regexen. * The scope and quality of CTFE has improved enormously, making more advanced uses possible and even relatively easy (thanks Don!) * Hisayuki Mima implemented a parser generator in only 3000 lines of code (sadly, no comments or documentation yet :o)) * With the occasion of that announcement we also find out Philippe Sigaud has already a competing design and implementation of a parser generator. This is the kind of stuff I've had an eye on for the longest time. I'm saying it's of strategic importance because CTFE technology, though not new and already available with some languages, has unique powers when combined with other features of D. With CTFE we get to do things that are quite literally impossible to do in other languages. We need to have a easy-to-use, complete, seamless, and efficient lexer-parser generator combo in Phobos, pronto. The lexer itself could use a character-level PEG or a classic automaton, and emit tokens for consumption by a parser generator. The two should work in perfect tandem (no need for glue code). At the end of the day, defining a complete lexer+parser combo for a language should be just a few lines longer than the textual representation of the grammar itself. What do you all think? Let's get this project off the ground! Thanks, AndreiGreat! So what's the next step? Do we wait for the maintainers of one of the CTFE parser gen packages to drop it in the Phobos Review Queue? Do a reimplementation for Phobos? We could attack this in pieces. Start with a lexer/FSA generator (like Ragel but using CTFE) - this will make it much easier to consume many wire protocols, for starters (I found it very easy to make a simple HTTP client using Ragel), and will be quite useful on its own. Then extend it into a lexer for a parser generator.
Jun 01 2012
On 6/1/12 8:39 AM, Ken wrote:Great! So what's the next step? Do we wait for the maintainers of one of the CTFE parser gen packages to drop it in the Phobos Review Queue? Do a reimplementation for Phobos? We could attack this in pieces. Start with a lexer/FSA generator (like Ragel but using CTFE) - this will make it much easier to consume many wire protocols, for starters (I found it very easy to make a simple HTTP client using Ragel), and will be quite useful on its own. Then extend it into a lexer for a parser generator.I think this core strength of the language should be properly supplanted by library support, so I'd be very interested to look at Phobos proposals. The proposals should come with sample grammars of nontrivial size, ideally a parser for the entire D language. There might be things to fix in the compiler, e.g. memory consumption. Andrei
Jun 01 2012