digitalmars.D - Let's stop parser Hell
- Denis Shelomovskij (10/10) Jul 05 2012 There are more and more projects requiring parsing D code (IDE plugins,
- Roman D. Boiko (14/31) Jul 05 2012 Visual-D is not Boost-licensed (I think this would be possible to
- Roman D. Boiko (4/4) Jul 05 2012 Forgot to add DDMD, which also has been forked and redesigned
- Denis Shelomovskij (7/10) Jul 05 2012 Why? Even were they all almost equal we can select any one. The point is...
- Roman D. Boiko (7/18) Jul 05 2012 Well, I guess we'd like to select one and not change it later,
- Jacob Carlborg (4/5) Jul 05 2012 Haven't we already done that a couple of times.
- Roman D. Boiko (5/8) Jul 05 2012 Well, we did something like that for DCT... but I doubt that it
- Jacob Carlborg (32/36) Jul 05 2012 I don't know. Here is what I wrote for DCT:
- Roman D. Boiko (7/9) Jul 05 2012 OK, fairly complete. Let it be the starting point. Why not put it
- Jacob Carlborg (4/5) Jul 05 2012 Aren't you voting on your own project :)
- Roman D. Boiko (3/6) Jul 05 2012 Well, I'm going to base parsing on Pegged, after tweaking it to
- Philippe Sigaud (32/35) Jul 05 2012 As much as I'm flattered by that, my current impression is Pegged is ver...
- Andrei Alexandrescu (12/37) Jul 05 2012 I'll be glad to buy for you any book you might feel you need for this.
- Roman D. Boiko (4/21) Jul 05 2012 +10 (can I vote ten times?)
- Philippe Sigaud (20/25) Jul 05 2012 That's nice of you, if a bit extreme for a public mailing list :)
- Roman D. Boiko (35/71) Jul 05 2012 I have four, from 1 to 7 years old... Wouldn't call them a
- Roman D. Boiko (15/28) Jul 05 2012 Check whether an identifier is a keyword using AA (dictionary)
- Dmitry Olshansky (9/21) Jul 05 2012 I'd stress the fact that it's a fully unrolled & hard-coded
- Philippe Sigaud (3/5) Jul 05 2012 Better not telling my wife. She's making noises about having a fourth.
- Andrei Alexandrescu (3/5) Jul 05 2012 Former sux latter rox.
- Nick Sabalausky (6/9) Jul 05 2012 FWIW, I've found this one to be pretty helpful:
- Philippe Sigaud (5/14) Jul 06 2012 Thanks Nick, I'll add it to my 'will buy one day' list (which grows
- Nick Sabalausky (8/26) Jul 06 2012 Yea, it's from an academic publisher unfortunately, so $$$. I got ahold
- Roman D. Boiko (8/40) Jul 05 2012 I'm sure it can generate **much** faster code. I'm going to focus
- Tobias Pankrath (6/10) Jul 05 2012 If you are interested in parsing, than I wouldn't recommend the
- Philippe Sigaud (10/18) Jul 05 2012 rn
- Roman D. Boiko (3/6) Jul 05 2012 http://www.komkon.org/~sher/books/parsing_techniques_2008.pdf
- deadalnix (2/12) Jul 05 2012 Why not program instead of doing bureaucracy ?
- Roman D. Boiko (8/9) Jul 05 2012 To avoid programming things which are not needed or don't fit.
- deadalnix (4/11) Jul 05 2012 Well you did the mistake to introduce an over complex mechanism in
- Roman D. Boiko (2/6) Jul 05 2012 I'm fine with that, let you doubt.
- Denis Shelomovskij (7/21) Jul 05 2012 Just want to mention I'm talking about parser only. Things like
- Jonathan M Davis (28/39) Jul 05 2012 It's been discussed before, but there are some obvious use cases such as...
- Roman D. Boiko (11/17) Jul 05 2012 Resume: everybody is welcome to join effort of translating DMD
- Andrei Alexandrescu (10/26) Jul 05 2012 I'd really want to create a task force on this, it is of strategic
- Dmitry Olshansky (11/41) Jul 05 2012 Count me as interested.
- Roman D. Boiko (3/9) Jul 05 2012 IMO, first priority is to make code generated by Pegged work
- Andrei Alexandrescu (4/13) Jul 05 2012 Well I'd say there are things like supporting large, realistic grammars
- Andrei Alexandrescu (6/25) Jul 05 2012 Excellent point. Walter is 100% behind the general strategy and we can
- Jacob Carlborg (8/12) Jul 05 2012 I don't see why you are against this. I'm seeing this as basically the
- David Piepgrass (41/59) Jul 06 2012 Hi everybody! My name's David and I've been dreaming since around
- H. S. Teoh (17/23) Jul 06 2012 Frankly, I don't know what DCT stands for (anyone?), but AA stands for
- Roman D. Boiko (7/8) Jul 07 2012 That's a working (not final) name of my project, a.k.a. The D
- Roman D. Boiko (6/16) Jul 07 2012 I decided to ask a question about this:
- Dmitry Olshansky (8/21) Jul 07 2012 I can tell you that they are not slower then one another in principle.
- Roman D. Boiko (5/11) Jul 07 2012 Exactly, that is also my point. I think that Pegged can be
- Roman D. Boiko (6/10) Jul 07 2012 Hmm... found an interesting article:
- Dmitry Olshansky (5/15) Jul 07 2012 Yup, LL(*) is my favorite so far. As for debugging and error recovery
- Roman D. Boiko (6/8) Jul 07 2012 It's interesting, that I also wanted to use DFA for non-terminals
- Andrei Alexandrescu (5/19) Jul 07 2012 That's Terence Parr's discovery, right? I've always liked ANTLR, so if
- Dmitry Olshansky (5/25) Jul 07 2012 I believe that it may need very few _semantic_ predicates. But there are...
- Tobias Pankrath (1/5) Jul 08 2012 We should also consider using GLR if LL(*) doesn't work.
- Dmitry Olshansky (4/9) Jul 08 2012 GLR ... are you serious? It still does parsing in n^3 if I'm not mistake...
- Tobias Pankrath (9/29) Jul 08 2012 Since GLR can parse any CFG the upper bound is in O(n^3), but the
- Tobias Pankrath (1/2) Jul 08 2012 http://scottmcpeak.com/elkhound/sources/elkhound/algorithm.html
- Dmitry Olshansky (6/32) Jul 08 2012 The problem is that say LL(*) can be easily tweaked in place to run
- Tobias Pankrath (11/15) Jul 08 2012 Did you look at Elkhound? According to the tutorial you can use
- Roman D. Boiko (3/14) Jul 08 2012 http://www.antlr.org/papers/LL-star-PLDI11.pdf describes some
- Tobias Pankrath (7/13) Jul 07 2012 To be picky here:
- Roman D. Boiko (3/7) Jul 07 2012 Also any PEG defines a language with linear-time parsing. Some
- Tobias Pankrath (2/10) Jul 07 2012 Interesting, I thought that PEG ⊂ CFG holds.
- Roman D. Boiko (5/6) Jul 07 2012 See section 3.4 of http://bford.info/pub/lang/peg.pdf
- Philippe Sigaud (2/9) Jul 07 2012 Not that anyone is interested in such languages, but still :)
- deadalnix (2/26) Jul 07 2012 D isn't 100% CFG. But it is close.
- travert phare.normalesup.org (Christophe Travert) (2/3) Jul 09 2012 What makes D fail to be a CFG?
- deadalnix (4/7) Jul 10 2012 type[something] <= something can be a type or an expression.
- Timon Gehr (3/11) Jul 10 2012 'something' is context-free:
- travert phare.normalesup.org (Christophe Travert) (5/19) Jul 11 2012 Do you have to know if something is a type or an expression for a simple...
- Timon Gehr (4/24) Jul 11 2012 No. Some token sequences can be both a type and an expression based on
- David Piepgrass (4/14) Jul 11 2012 I don't see how "type | expression" is context free. The input
- Timon Gehr (8/24) Jul 11 2012 'Context free' means that all the production rule left hand sides do
- Philippe Sigaud (9/12) Jul 23 2012 This little post to say a big 'Thanks' to Tobias. I'm reading Grune's
- Andrei Alexandrescu (3/15) Jul 23 2012 When will we see the fruit of that all in the form of D libraries?
- Roman D. Boiko (2/15) Jul 23 2012 Yes, I'm also reading it and find it amusingly high-quality.
- Tobias Pankrath (2/15) Jul 23 2012 I'm glad that you like it.
- Andrei Alexandrescu (3/7) Jul 07 2012 Isn't it the case that PEG require more memory for certain grammars?
- Roman D. Boiko (25/26) Jul 07 2012 So far it looks like LALR parsers may have lower constant factors
- Andrei Alexandrescu (7/33) Jul 07 2012 Isn't also the fact that lexing and parsing are integrated? With
- Dmitry Olshansky (6/46) Jul 07 2012 I'll have to point out that the whole point about integrated lexing is
- Andrei Alexandrescu (3/6) Jul 07 2012 Interesting. I'll have to ask for more details on why.
- Roman D. Boiko (7/15) Jul 07 2012 +1. Personally I associate PEG with a parser that includes a
- Dmitry Olshansky (9/23) Jul 07 2012 Well put yet it's still lexer. And most horribly it's not optimized to
- Roman D. Boiko (12/16) Jul 07 2012 I think that your conclusions are about statistical evidences of
- Dmitry Olshansky (9/23) Jul 07 2012 You may misunderstood me as well, the point is still the same:
- Roman D. Boiko (5/18) Jul 07 2012 But PEG itself doesn't require backtracking parsing, does it? So
- Dmitry Olshansky (15/31) Jul 07 2012 It does. Specifically the algorithm is to try option A, try option B,
- Roman D. Boiko (18/32) Jul 07 2012 There is no algorithm, only __grammar__. It specifies that option
- Dmitry Olshansky (23/53) Jul 07 2012 That would not always work, but yes that's what we should do IMHO.
- Roman D. Boiko (18/52) Jul 07 2012 I didn't do a deeper research on this, just shared my current
- Dmitry Olshansky (29/34) Jul 07 2012 I think I told that before but anyway the main point is:
- Roman D. Boiko (5/17) Jul 07 2012 This is the implementation I have in mind as the only sane way to
- deadalnix (2/27) Jul 07 2012 Many usages only need a lexer.
- Jonathan M Davis (22/26) Jul 06 2012 I'm not even vaguely convinced that having a lexer/parser specifically f...
- Jacob Carlborg (16/26) Jul 07 2012 I think the whole point of having a compiler as a library is that the
- deadalnix (3/32) Jul 07 2012 I tried that. This is almost impossible, dmd's parser and AST are very
- Roman D. Boiko (2/4) Jul 07 2012 +1
- Jacob Carlborg (4/6) Jul 08 2012 I suspected that.
- Jonathan M Davis (54/83) Jul 07 2012 There are multiple issues here. The one that Andrei is raising is the fa...
- Daniel Murphy (7/15) Jul 07 2012 A more realistic alternative is to modify dmd to optionally (at build/li...
- Jacob Carlborg (16/68) Jul 08 2012 That should be true for D as well when the spec is complete.
- Jonathan M Davis (20/44) Jul 08 2012 True, but Andrei is basically arguing that porting dmd's lexer and parse...
- Roman D. Boiko (4/9) Jul 08 2012 Can you provide a *specific* example of performance optimization
- Jonathan M Davis (23/33) Jul 08 2012 It's been too long since I was actively working on parsers to give any
- Roman D. Boiko (13/25) Jul 08 2012 My aim is to find out any potential bottlenecks and ensure that
- Roman D. Boiko (1/3) Jul 08 2012 s/sews/seams
- David Piepgrass (49/71) Jul 08 2012 I'm convinced that the output of a parser generator (PG) can be
- Roman D. Boiko (4/4) Jul 08 2012 David, I would suggest you to create a proof-of-concept prototype
- Jacob Carlborg (5/8) Jul 08 2012 I'm not expert in this field but I've heard that it's harder to get good...
- Roman D. Boiko (2/4) Jul 09 2012 Good point!
- Jacob Carlborg (4/8) Jul 08 2012 It don't feel like that, didn't we get lambdas or UFCS in the last relea...
- Jonathan M Davis (8/15) Jul 08 2012 lambdas are a bit older than that IIRC, and I don't think that UFCS actu...
- Jacob Carlborg (6/12) Jul 09 2012 I'm pretty sure UFCS affects lexing or parsing. How else would this be
- Jonathan M Davis (16/28) Jul 09 2012 That definitely wouldn't affect lexing, because it doesn't affect the to...
- Daniel Murphy (3/11) Jul 09 2012 Not true. This used to be lexed as '4.f' 'oo'. (I think)
- Timon Gehr (2/11) Jul 09 2012 Those changes are trivial to incorporate into a well written parser.
- Denis Shelomovskij (7/12) Jul 05 2012 I didn't want for someone to code anything at all! I on the contrary
- Jonathan M Davis (23/33) Jul 05 2012 eone
- Dmitry Olshansky (5/16) Jul 05 2012 It doesn't work like that in OpenSource. No matter what you do people
- deadalnix (4/25) Jul 05 2012 You never looked at dmd frontend soure code don't you ? It is a horror
- Jonathan M Davis (19/47) Jul 05 2012 files
- Chad J (9/10) Jul 07 2012 Interesting, I thought you were hand-writing this stuff.
- Roman D. Boiko (29/41) Jul 07 2012 There are many possible customization point which would make it a
- Philippe Sigaud (15/29) Jul 07 2012 I added dstrings because
- Roman D. Boiko (9/28) Jul 07 2012 I propose to switch code to use S if(isSomeString!S) everywhere.
- Chad J (8/22) Jul 07 2012 Yeah, it's good to hear this notion reinforced. I had this suspicion
- Roman D. Boiko (6/14) Jul 07 2012 At the very least, we could use DFA instead of backtracking where
- Chad J (3/15) Jul 07 2012 These were my thoughts exactly, although somewhat unsubstantiated in my
- Dmitry Olshansky (5/21) Jul 07 2012 Yup the nice thing about ANTLR is the usage of DFA. e.g. it's used for
- Andrei Alexandrescu (4/16) Jul 07 2012 Doesn't ANTLR use full-fledged character-level LL(*) parsing even in the...
- Roman D. Boiko (13/15) Jul 07 2012 Since I didn't understand your question I assume that my
- David Piepgrass (9/17) Jul 07 2012 ANTLR 3 doesn't use a DFA unless it needs to. If unlimited
- Andrei Alexandrescu (4/13) Jul 07 2012 This should help:
- Roman D. Boiko (6/7) Jul 07 2012 Thanks, nice article.
- Dmitry Olshansky (6/30) Jul 07 2012 Another idea that I've never realized is to add operator precedence
- Roman D. Boiko (5/9) Jul 07 2012 But that's already available by explicitly defining expression
- Dmitry Olshansky (5/13) Jul 07 2012 Yes, I've meant something completely different ;)
- Timon Gehr (2/10) Jul 07 2012 http://en.wikipedia.org/wiki/Operator-precedence_parser#Precedence_climb...
- Roman D. Boiko (6/7) Jul 07 2012 OK, at least I didn't misunderstand. So my question was whether
- Roman D. Boiko (10/17) Jul 07 2012 OK, now I agree, the need to perform several nested calls in
- Jonathan M Davis (5/20) Jul 07 2012 I don't know about this particular case, because I haven't really looked...
- Roman D. Boiko (16/32) Jul 10 2012 One disadvantage of Packrat parsers I mentioned was problematic
- Philippe Sigaud (7/15) Jul 10 2012 IIRC, that's what I encoded in Pegged (admittedly limited) error
- Timon Gehr (2/17) Jul 10 2012 FWIW, this is what most HTML parsers are doing.
- Philippe Sigaud (4/9) Jul 10 2012 Ah, right. I can get it for HTML/XML. JSON also, maybe.
- Roman D. Boiko (5/20) Jul 10 2012 It would still generate errors. But would enable a lot of useful
- Jonathan M Davis (6/30) Jul 10 2012 Which is horrible. You pretty much have to with HTML because of the horr...
- Roman D. Boiko (4/14) Jul 10 2012 Not having control over parser or source code causes problems.
- Jacob Carlborg (15/19) Jul 10 2012 I'm not sure but I think he was referring to a kind of error reporting
- Jonathan M Davis (6/26) Jul 10 2012 Well, giving an error, continuing to parse, and giving a partial result ...
- Timon Gehr (3/28) Jul 10 2012 This is actually precisely what many of the more recent curly-brace-
- Jacob Carlborg (11/15) Jul 10 2012 No, that is _not_ what happens with HTML. With HTML, the browser _do
- Jonathan M Davis (11/26) Jul 10 2012 ??? I guess that I wasn't clear. I mean that with HTML, it ignores error...
- Jacob Carlborg (4/13) Jul 11 2012 Ok, I see. It seems we're meaning the same thing.
- Dmitry Olshansky (6/35) Jul 10 2012 BTW clang does this and even more of stuff on semantic level. It's known...
- David Piepgrass (64/86) Jul 07 2012 Interesting. After trying to use ANTLR-C# several years back, I
- Chad J (91/165) Jul 07 2012 I was thinking the same thing.
- Roman D. Boiko (13/30) Jul 07 2012 Funny, we've discussed an idea to introduce some hybrid of regex
- Chad J (64/77) Jul 07 2012 From some of my earlier scribblings:
- Roman D. Boiko (19/31) Jul 07 2012 This would cause wasting space (probably a lot). Definitely it
- Chad J (13/42) Jul 07 2012 I see what you mean.
- Roman D. Boiko (15/16) Jul 07 2012 And I think that I expressed myself badly. Let me rephrase.
- Chad J (2/17) Jul 07 2012 Yep. Makes sense.
- David Piepgrass (79/116) Jul 07 2012 I'm glad to hear you like the tree-parsing approach, Chad,
- Roman D. Boiko (11/20) Jul 07 2012 I bet that after stage 2 you would have performed almost the same
- David Piepgrass (21/44) Jul 07 2012 Hmm, you've got a good point there, although simple
- David Piepgrass (18/23) Jul 07 2012 That reminds me, I forgot to write a another advantage I expected
- Roman D. Boiko (10/20) Jul 07 2012 There could be multiple errors that compensate each other and
- David Piepgrass (10/30) Jul 07 2012 This is all true, but forgetting a brace commonly results in a
- Roman D. Boiko (4/13) Jul 07 2012 So you simply admit that error recovery is difficult to
- David Piepgrass (4/17) Jul 07 2012 I'm not seeing any tremendous error-handling difficulty in my
- Roman D. Boiko (55/58) Jul 07 2012 (I will use the word "you" to refer to some abstract person or
- Roman D. Boiko (15/21) Jul 07 2012 Exactly. Semantic analysis is what would benefit from that, as
- Chad J (64/97) Jul 07 2012 Yes and yes.
- David Piepgrass (32/61) Jul 07 2012 Well, for templates, in general, it would be necessary to
- Timon Gehr (16/159) Jul 07 2012 I'd suggest:
- Timon Gehr (3/22) Jul 07 2012 Also, I'd like to point out that the transformation is not actually
- Chad J (3/21) Jul 07 2012 I wish.
- Dmitry Olshansky (6/9) Jul 05 2012 Then do it. It's all about having something so obviously good, fast and
- Roman D. Boiko (2/7) Jul 05 2012 Would you help to make Pegged faster?
- Dmitry Olshansky (5/11) Jul 05 2012 Well why not.
- Roman D. Boiko (5/8) Jul 05 2012 I bet that after you finish with GSOC optimizing Pegged will not
- Andrei Alexandrescu (5/15) Jul 05 2012 Good call :o).
- Dmitry Olshansky (5/20) Jul 05 2012 It's great that you are not opposed to adjusting scope of project.
- Caligo (8/18) Jul 05 2012 Is the actual grammar available somewhere? The online language
- Roman D. Boiko (4/7) Jul 05 2012 In PEG format, yes (not necessarily correct):
- Kai Meyer (27/34) Jul 31 2012 I know I'm a little late coming into this conversation. This seems like
- Tobias Pankrath (15/28) Jul 31 2012 Yes. To make this statement more precise: We need a lexer that
- Philippe Sigaud (26/45) Jul 31 2012 I think we need a tokenizer generator and a parser generator. They can
- Tobias Pankrath (4/14) Jul 31 2012 I have no numbers. It's a statement that I found in some (1-3)
- Dmitry Olshansky (20/61) Jul 31 2012 Parser can use constant IDs for nested tables, IDs point to string table...
- Philippe Sigaud (18/43) Jul 31 2012 The latter, I get. The former, not so much.
- Timon Gehr (5/9) Jul 31 2012 Ddoc is typically not required. By default it should be treated as
- Jonathan M Davis (6/9) Jul 31 2012 That was how I was intending to deal with ddoc. It's just a nested block...
- Philippe Sigaud (11/19) Jul 31 2012 OK. Same for standard comment and doc comments?
- Dmitry Olshansky (10/20) Jul 31 2012 That's why there is "allows" (not "required") the reason in general is
- Dmitry Olshansky (12/61) Jul 31 2012 Well while lookahead will help, there are simpler ways. e.g.
- Philippe Sigaud (5/18) Jul 31 2012 In the FORTRAN case, you indeed *need* to re-lex the stuff after IF,
- Dmitry Olshansky (8/16) Jul 31 2012 Okay. Say lexer maps all unique strings that are not keywords to some ID...
- Philippe Sigaud (3/21) Jul 31 2012 Ah, that! This I know, I misunderstood your initial post. The symbol
- Jonathan M Davis (9/11) Jul 31 2012 I'm actually quite far along with one now - one which is specifically wr...
- Jacob Carlborg (4/12) Aug 01 2012 That's awesome. I really looking forward to this. Keep up the good work.
- Jacob Carlborg (4/11) Aug 01 2012 BTW, do you have to code online somewhere?
- Jonathan M Davis (3/13) Aug 01 2012 No, because I'm still in the middle of working on it.
- Denis Shelomovskij (5/16) Aug 01 2012 Good. Will wait for announce.
- Philippe Sigaud (10/21) Jul 31 2012 Could you please describe the kind of token it produces?
- Dmitry Olshansky (27/63) Jul 31 2012 Okay. Critics go as follows:
- Kai Meyer (15/86) Aug 06 2012 I cut my teeth on perl, so everything gets compared to perl in my mind.
- Philippe Sigaud (5/18) Aug 06 2012 D does have compiled regex, it's called ctRegex (meaning compile-time re...
- Dmitry Olshansky (26/48) Aug 06 2012 Of course regex is first compiled to bytecode (same thing as "compile"
- Philippe Sigaud (7/20) Aug 06 2012 Btw, I wanted to ask you that for a long time: what do you mean by
- Dmitry Olshansky (26/46) Aug 06 2012 by this:
- Dmitry Olshansky (4/7) Aug 06 2012 Compiled native code is faster. Perl is slower.
- Philippe Sigaud (3/13) Aug 06 2012 Sorry, I meant: what second version? How do you distinguish between
- David Nadlinger (4/7) Aug 06 2012 As far as I know, ctRegex _always_ produces machine code.
- Philippe Sigaud (6/8) Aug 06 2012 That's what I had in mind too, but Dmitry seemed to say things are
- Dmitry Olshansky (17/26) Aug 06 2012 Yeah, sorry for confusion - there are only 2 ways to do the job:
- Jonathan M Davis (69/94) Jul 31 2012 Yeah. Once I started on it, I made a lot of progress really quickly. The...
- travert phare.normalesup.org (Christophe Travert) (16/30) Aug 01 2012 The occurence of tabWidth surprises me.
- Jonathan M Davis (42/66) Aug 01 2012 col counts code points. tabWidth is the number of code points that '\t'=
- Jacob Carlborg (8/23) Aug 01 2012 What about the end/length of a token? Token.str.length would give the
- Jonathan M Davis (14/42) Aug 01 2012 I'm not sure. I don't think so. It doesn't really keep track of code poi...
- Jacob Carlborg (5/11) Aug 01 2012 Doing a syntax highlighter and calculate the length (walkLength) for
- Jonathan M Davis (16/41) Jul 31 2012 blockComment, /// $(D /* */)
- Philippe Sigaud (43/98) Jul 31 2012 My fear (having encoded the D grammar once) is that 99% of the code
- Jacob Carlborg (5/9) Aug 01 2012 Some IDE's do a more advanced syntax highlighting based on the semantic
- Jonathan M Davis (64/121) Jul 31 2012 I'm not using regexes at all. It's using string mixins to reduce code
- Jacob Carlborg (8/13) Aug 01 2012 That's good idea. Most code can be treated as ASCII (I assume most
- Jonathan M Davis (10/23) Aug 01 2012 What's of particular importance is the fact that _all_ of the language
- Philippe Sigaud (17/52) Jul 31 2012 OK. It'll more difficult to genericize, then, but that's not your
- Jonathan M Davis (40/68) Jul 31 2012 It was never intended to be even vaguely generic. It's targeting D
- Jacob Carlborg (4/8) Aug 01 2012 :)
- Philippe Sigaud (20/40) Aug 01 2012 That's exactly what I had in mind. Anyway, we need a D lexer. We also
- Jacob Carlborg (11/26) Aug 01 2012 I'm not quite sure how it works either. But I'm thinking like this:
- Jonathan M Davis (70/82) Aug 01 2012 It contains unicode. The lexer is lexing whatever encoding the source i=
- Philippe Sigaud (9/29) Aug 01 2012 do you mean I can do:
- Jacob Carlborg (16/27) Aug 01 2012 Unicode supports three encodings: UTF-8, UTF-16 and UTF-32. All these
- Philippe Sigaud (9/13) Aug 01 2012 (emphasis mine)
- Jacob Carlborg (4/10) Aug 01 2012 You're welcome :)
- Andrej Mitrovic (10/15) Aug 01 2012 I think many people viewed Unicode this way at first. But there is a
- Jacob Carlborg (5/14) Aug 01 2012 This is a good read as well:
- Philippe Sigaud (8/17) Aug 01 2012 I will, but not yet. I've a few books on parsing and compilers to read
- Dmitry Olshansky (6/24) Aug 01 2012 Once you have time to learn some unicode, check out this page:
- Andrej Mitrovic (3/6) Aug 01 2012 Didn't know about that one, cool! Also might come in handy:
- Jonathan M Davis (8/13) Aug 01 2012 I guess that that would explain why you didn't understand what I was say...
- Philippe Sigaud (5/17) Aug 01 2012 I knew about 1-2-4 bytes schemes and such. But, somehow, for me,
- Jonathan M Davis (26/42) Aug 01 2012 D source text can be in one of the following formats:
- Jacob Carlborg (8/31) Aug 01 2012 But if you read a source file which is encoded using UTF-16 you would
- Jonathan M Davis (16/21) Aug 01 2012 It may very well be a good idea to templatize Token on range type. It wo...
- Jacob Carlborg (33/46) Aug 01 2012 To me a string type would be enough. I don't need support for ranges.
- travert phare.normalesup.org (Christophe Travert) (32/46) Aug 02 2012 It can't if it is a simple input range! Like a file read with most
- Jonathan M Davis (43/66) Aug 02 2012 It
There are more and more projects requiring parsing D code (IDE plugins, DCT and dfmt now). We definitely need a _standard_ fast D parser (suitable as tokenizer). We already have a good one at least in Visual D. Maybe dmd's parser is faster. If so, it can be (easily?) rewritten in D. We also already have What about to get one and call it standard? -- Денис В. Шеломовский Denis V. Shelomovskij
Jul 05 2012
On Thursday, 5 July 2012 at 12:11:33 UTC, Denis Shelomovskij wrote:There are more and more projects requiring parsing D code (IDE plugins, DCT and dfmt now). We definitely need a _standard_ fast D parser (suitable as tokenizer). We already have a good one at least in Visual D. Maybe dmd's parser is faster. If so, it can be (easily?) from Mono-D etc.). What about to get one and call it standard?Visual-D is not Boost-licensed (I think this would be possible to change) Pegged may eventually become standard, if it will be performance optimized and a bit more customizable Dscanner(https://github.com/Hackerpilot/Dscanner) from Brian Schott is pretty good, too. SDC is another nice option DIL (http://code.google.com/p/dil/) is very nice but GPL I plan to try using Pegged inside my DCT project. Probably that will require huge modifications though... Some more links from Pegged readme:Hisayuki Mima's CTPG(https://github.com/youkei/ctpg), very similar, also done in D. Have a look! Nick Sabalausky's Goldie (http://www.dsource.org/projects/goldie). Benjamin Shropshire's dparser (http://dsource.org/projects/scrapple/browser/trunk/dparser).Martin Nowak put these gists on the D newsgroup: https://gist.github.com/1255439 - lexer generator https://gist.github.com/1262321 - complete and fast D lexer
Jul 05 2012
Forgot to add DDMD, which also has been forked and redesigned recently by someone (thus two more different compiler frontends). But overall I doubt that any project could become standard very soon.
Jul 05 2012
05.07.2012 16:30, Roman D. Boiko пишет:Forgot to add DDMD, which also has been forked and redesigned recently by someone (thus two more different compiler frontends). But overall I doubt that any project could become standard very soon.Why? Even were they all almost equal we can select any one. The point is to select something (anything) to guide a coder who want to do something with this task to a good/up-to-date/supported parser. -- Денис В. Шеломовский Denis V. Shelomovskij
Jul 05 2012
On Thursday, 5 July 2012 at 12:53:02 UTC, Denis Shelomovskij wrote:05.07.2012 16:30, Roman D. Boiko пишет:Well, I guess we'd like to select one and not change it later, don't we? I'm not sure we can decide which is the best currently and will stay the best in the future for all major use cases. Anyway I propose to enumerate major use cases first.Forgot to add DDMD, which also has been forked and redesigned recently by someone (thus two more different compiler frontends). But overall I doubt that any project could become standard very soon.Why? Even were they all almost equal we can select any one. The point is to select something (anything) to guide a coder who want to do something with this task to a good/up-to-date/supported parser.
Jul 05 2012
On 2012-07-05 15:08, Roman D. Boiko wrote:Anyway I propose to enumerate major use cases first.Haven't we already done that a couple of times. -- /Jacob Carlborg
Jul 05 2012
On Thursday, 5 July 2012 at 15:40:53 UTC, Jacob Carlborg wrote:On 2012-07-05 15:08, Roman D. Boiko wrote:Well, we did something like that for DCT... but I doubt that it would fit general needs. If we had, why haven't they been analyzed, classified, discussed, etc.? Or have they?Anyway I propose to enumerate major use cases first.Haven't we already done that a couple of times.
Jul 05 2012
On 2012-07-05 18:04, Roman D. Boiko wrote:Well, we did something like that for DCT... but I doubt that it would fit general needs.Why wouldn't it.If we had, why haven't they been analyzed, classified, discussed, etc.? Or have they?I don't know. Here is what I wrote for DCT: * IDE integration * Refactoring tool * Static analysis * Compiler * Doc generating * Build tool * DI generating In general, use cases that can span several compile phases, i.e. lexing, parsing, semantic analysis and so on. Some of these use cases can be broken in to several new use cases at a lower level. Some examples: IDE integration: * Syntax highlighting * Code completion * Showing lex, syntax and semantic errors * Formatter Refactoring: * Cross-referencing symbols * Renaming of symbols * Extract local variable to instance variable * Extract variable to function/method * Extract a piece of code/method into a new class Build tool: * Tracking module dependencies Doc generating: * Associate a declaration and its documentation Original post: http://www.digitalmars.com/d/archives/digitalmars/D/DCT_use_cases_-_draft_168106.html#N168141 -- /Jacob Carlborg
Jul 05 2012
On Thursday, 5 July 2012 at 16:14:27 UTC, Jacob Carlborg wrote:Original post: http://www.digitalmars.com/d/archives/digitalmars/D/DCT_use_cases_-_draft_168106.html#N168141OK, fairly complete. Let it be the starting point. Why not put it on some wiki, then add some more, discuss, vote, etc.? Meanwhile create a matrix of features implemented in each of interesting standardization candidates. And pick up one of them as "standard" or "reference" parser. My vote would be for Pegged, I guess.
Jul 05 2012
On 2012-07-05 18:32, Roman D. Boiko wrote:My vote would be for Pegged, I guess.Aren't you voting on your own project :) -- /Jacob Carlborg
Jul 05 2012
On Thursday, 5 July 2012 at 16:38:27 UTC, Jacob Carlborg wrote:On 2012-07-05 18:32, Roman D. Boiko wrote:Well, I'm going to base parsing on Pegged, after tweaking it to my needs.My vote would be for Pegged, I guess.Aren't you voting on your own project :)
Jul 05 2012
As much as I'm flattered by that, my current impression is Pegged is very far from being performant. As a proof-of-concept that, in D, it's possible to parse a string and create a parse tree at compile-time and then generate code from this, it's also successful. Go D! As a parser proper, Pegged is awful :-) Nothing I'm ashamed of, as I learn by coding. Hey, I just received the Dragon Book (International Edition), I'm sure I'll learn many things in there. So, if anyone is willing to change the code generated by Pegged, I'm game. The results you showed me on keyword parsing are very interesting! But, my impression is that the need for a 'D'-only parser and lexer is far greater and much more imediate that the need for a parser generator. All the reasons advanced upthread ask for a D parser, not a generic generator. Parser generators are for those of us interested in having DSLs or macros in D. So Pegged or any other generator should *not* get the community focus right now. My plan would be as follow: 1- assemble a group of people knowing parsing. I don't think I'm exactly knowledgeable, but I'm ready to be part of such a group. 2- create a github process. 3- translate an existing parser / adapt a D parser for Phobos. I'm ready to be part of this (I'm sure I'll learn in the process) 4- spend 1-2 years fighting over LR parsing and such :) (Just kidding) 5- submit it to Phobos and have it adopted. much later: 6- see the way the parser code is organized and tweak a code generator (Pegged is a possibility if recursive parsing is OK) to produce an equivalent code when fed the D grammar. side-effect: maybe a std.tree or std.collection.tree to deal with trees in a generic way. PhilippeOn 2012-07-05 18:32, Roman D. Boiko wrote:My vote would be for Pegged, I guess.
Jul 05 2012
On 7/5/12 2:16 PM, Philippe Sigaud wrote:As much as I'm flattered by that, my current impression is Pegged is very far from being performant. As a proof-of-concept that, in D, it's possible to parse a string and create a parse tree at compile-time and then generate code from this, it's also successful. Go D! As a parser proper, Pegged is awful :-) Nothing I'm ashamed of, as I learn by coding. Hey, I just received the Dragon Book (International Edition), I'm sure I'll learn many things in there.I'll be glad to buy for you any book you might feel you need for this. Again, there are few things more important for D right now than exploiting its unmatched-by-competition features to great ends. I don't want the lack of educational material to hold you down. Please continue working on this and let me know of what you need.So, if anyone is willing to change the code generated by Pegged, I'm game. The results you showed me on keyword parsing are very interesting! But, my impression is that the need for a 'D'-only parser and lexer is far greater and much more imediate that the need for a parser generator.I very strongly disagree. This is our chance to get things right instead of having a limited product that destroys competition (much like lex and yacc have been for years in the parser generator field).All the reasons advanced upthread ask for a D parser, not a generic generator. Parser generators are for those of us interested in having DSLs or macros in D. So Pegged or any other generator should *not* get the community focus right now.Pegged should be the focus.My plan would be as follow: 1- assemble a group of people knowing parsing. I don't think I'm exactly knowledgeable, but I'm ready to be part of such a group. 2- create a github process. 3- translate an existing parser / adapt a D parser for Phobos. I'm ready to be part of this (I'm sure I'll learn in the process) 4- spend 1-2 years fighting over LR parsing and such :) (Just kidding) 5- submit it to Phobos and have it adopted.Sounds good. Replace 1-2 years with 1-2 months. Andrei
Jul 05 2012
On Thursday, 5 July 2012 at 18:28:21 UTC, Andrei Alexandrescu wrote:On 7/5/12 2:16 PM, Philippe Sigaud wrote:+10 (can I vote ten times?)So Pegged or any other generator should *not* get the community focus right now.Pegged should be the focus.Well, probably with 3-4 months... :)My plan would be as follow: 1- assemble a group of people knowing parsing. I don't think I'm exactly knowledgeable, but I'm ready to be part of such a group. 2- create a github process. 3- translate an existing parser / adapt a D parser for Phobos. I'm ready to be part of this (I'm sure I'll learn in the process) 4- spend 1-2 years fighting over LR parsing and such :) (Just kidding) 5- submit it to Phobos and have it adopted.Sounds good. Replace 1-2 years with 1-2 months.
Jul 05 2012
On Thu, Jul 5, 2012 at 8:28 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I'll be glad to buy for you any book you might feel you need for this. Again, there are few things more important for D right now than exploiting its unmatched-by-competition features to great ends. I don't want the lack of educational material to hold you down. Please continue working on this and let me know of what you need.That's nice of you, if a bit extreme for a public mailing list :) Andrei, money is no problem :) Interest in the field of parsing is no problem. Interest in D future is no problem. Having a demanding job, and three children, is a problem. No, scratch that, you know what I mean. But hey, Roman is doing interesting things on keyword parsing right now, and changing the parser generated by Pegged is not difficult. We will see where this thread lead. (Roman, you should send your results here, because I'm still taken aback by the built-in AA speed compared to linear array look-up for 100 keywords). As Dmitry said, we might hit a CTFE wall: memory consumption, computation speed, ... (*channelling Andrei*: then we will correct whatever makes CTFE a problem. Right) Philippe (Hesitating between 'The Art of the Metaobject Protocol' and 'Compilers, Techniques and Tools', right now)
Jul 05 2012
On Thursday, 5 July 2012 at 19:54:39 UTC, Philippe Sigaud wrote:On Thu, Jul 5, 2012 at 8:28 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I have four, from 1 to 7 years old... Wouldn't call them a problem, though :)))I'll be glad to buy for you any book you might feel you need for this. Again, there are few things more important for D right now than exploiting its unmatched-by-competition features to great ends. I don't want the lack of educational material to hold you down. Please continue working on this and let me know of what you need.That's nice of you, if a bit extreme for a public mailing list :) Andrei, money is no problem :) Interest in the field of parsing is no problem. Interest in D future is no problem. Having a demanding job, and three children, is a problem. No, scratch that, you know what I mean.But hey, Roman is doing interesting things on keyword parsing right now, and changing the parser generated by Pegged is not difficult. We will see where this thread lead. (Roman, you should send your results here, because I'm still taken aback by the built-in AA speed compared to linear array look-up for 100 keywords).Well, I wouldn't call those "results" yet. Just some benchmarks. Here they are: isKeyword_Dummy (baseline): 427 [microsec] total, 28 [nanosec / lookup]. isKeyword_Dictionary: 1190 [microsec] total, 214 [nanosec / lookup]. isKeyword_SwitchByLengthThenByChar: 466 [microsec] total, 83 [nanosec / lookup]. isKeyword_BinaryArrayLookup: 2722 [microsec] total, 490 [nanosec / lookup]. isKeyword_LinearArrayLookup: 13822 [microsec] total, 2490 [nanosec / lookup]. isKeyword_UnicodeTrie: 1317 [microsec] total, 237 [nanosec / lookup]. isKeyword_UnicodeTrieBoolLookup: 1072 [microsec] total, 193 [nanosec / lookup]. Total: 22949 identifiers + 5551 keywords. isKeyword_Dummy (baseline): 2738 [microsec] total, 50 [nanosec / lookup]. isKeyword_Dictionary: 4247 [microsec] total, 242 [nanosec / lookup]. isKeyword_SwitchByLengthThenByChar: 1593 [microsec] total, 91 [nanosec / lookup]. isKeyword_BinaryArrayLookup: 14351 [microsec] total, 820 [nanosec / lookup]. isKeyword_LinearArrayLookup: 59564 [microsec] total, 3405 [nanosec / lookup]. isKeyword_UnicodeTrie: 4167 [microsec] total, 238 [nanosec / lookup]. isKeyword_UnicodeTrieBoolLookup: 3466 [microsec] total, 198 [nanosec / lookup]. Total: 104183 identifiers + 17488 keywords.As Dmitry said, we might hit a CTFE wall: memory consumption, computation speed, ... (*channelling Andrei*: then we will correct whatever makes CTFE a problem. Right) Philippe (Hesitating between 'The Art of the Metaobject Protocol' and 'Compilers, Techniques and Tools', right now)
Jul 05 2012
isKeyword_Dummy (baseline): 2738 [microsec] total, 50 [ns / lookup].This one calculates a sum of all identifier code units. Included for comparison.isKeyword_Dictionary: 4247 [microsec] total, 242 [ns / lookup].Check whether an identifier is a keyword using AA (dictionary) lookup.isKeyword_SwitchByLengthThenByChar: 1593 [microsec] total, 91 [ns / lookup].Switch by identifier length at the top, then recursively switch by each character. Clearly a winner, but I think it can be improved further.isKeyword_BinaryArrayLookup: 14351 [microsec] total, 820 [ns / lookup].Binary search in an ordered array of keywords.isKeyword_LinearArrayLookup: 59564 [microsec] total, 3405 [ns / lookup].Ditto, search is linear.isKeyword_UnicodeTrie: 4167 [microsec] total, 238 [ns / lookup].Lookup a keyword in a trie, created by Dmitry. This will be improved.isKeyword_UnicodeTrieBoolLookup: 3466 [microsec] total, 198 [ns / lookup].The same, but only determines whether an identifier is a keyword, not which exactly.Total: 104183 identifiers + 17488 keywords.Analyzed the largest Phobos file (DateTime? I didn't check the name.) Results are consistent for other files, too.
Jul 05 2012
On 06-Jul-12 00:16, Roman D. Boiko wrote:I'd stress the fact that it's a fully unrolled & hard-coded switch that takes a lot of pages (~72Kbytes). It's easily be perfect. Sorry, couldn't resist ;) And I'm not sure how much it could be optimized maybe some measly 10-30%.isKeyword_Dummy (baseline): 2738 [microsec] total, 50 [ns / lookup].This one calculates a sum of all identifier code units. Included for comparison.isKeyword_Dictionary: 4247 [microsec] total, 242 [ns / lookup].Check whether an identifier is a keyword using AA (dictionary) lookup.isKeyword_SwitchByLengthThenByChar: 1593 [microsec] total, 91 [ns / lookup].Switch by identifier length at the top, then recursively switch by each character. Clearly a winner, but I think it can be improved further.It is std.datetime as I've been running this benchmark against it and seen the same numbers :) -- Dmitry OlshanskyTotal: 104183 identifiers + 17488 keywords.Analyzed the largest Phobos file (DateTime? I didn't check the name.) Results are consistent for other files, too.
Jul 05 2012
On Thu, Jul 5, 2012 at 10:02 PM, Roman D. Boiko <rb d-coding.com> wrote (on children)I have four, from 1 to 7 years old... Wouldn't call them a problem, though :)))Better not telling my wife. She's making noises about having a fourth.
Jul 05 2012
On 7/5/12 3:54 PM, Philippe Sigaud wrote:(Hesitating between 'The Art of the Metaobject Protocol' and 'Compilers, Techniques and Tools', right now)Former sux latter rox. Andrei
Jul 05 2012
On Thu, 5 Jul 2012 21:54:29 +0200 Philippe Sigaud <philippe.sigaud gmail.com> wrote:(Hesitating between 'The Art of the Metaobject Protocol' and 'Compilers, Techniques and Tools', right now)FWIW, I've found this one to be pretty helpful: http://www.pearsonhighered.com/educator/product/Crafting-A-Compiler/9780136067054.page It's the closest one I've seen to being more for programmers than mathematicians.
Jul 05 2012
On Fri, Jul 6, 2012 at 7:26 AM, Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> wrote:On Thu, 5 Jul 2012 21:54:29 +0200 Philippe Sigaud <philippe.sigaud gmail.com> wrote:Thanks Nick, I'll add it to my 'will buy one day' list (which grows quite rapidly) $122? Ouch.(Hesitating between 'The Art of the Metaobject Protocol' and 'Compilers, Techniques and Tools', right now)FWIW, I've found this one to be pretty helpful: http://www.pearsonhighered.com/educator/product/Crafting-A-Compiler/9780136067054.page It's the closest one I've seen to being more for programmers than mathematicians.
Jul 06 2012
On Fri, 6 Jul 2012 09:50:26 +0200 Philippe Sigaud <philippe.sigaud gmail.com> wrote:On Fri, Jul 6, 2012 at 7:26 AM, Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> wrote:Yea, it's from an academic publisher unfortunately, so $$$. I got ahold of it through the local library systems (All of Ohio's college libraries are connected in a system called OhioLINK, and books from it can be put on hold and shipped to most of the public libraries - or at least most of the public libraries around the Cleveland area.) Maybe there's something similar out your way?On Thu, 5 Jul 2012 21:54:29 +0200 Philippe Sigaud <philippe.sigaud gmail.com> wrote:Thanks Nick, I'll add it to my 'will buy one day' list (which grows quite rapidly) $122? Ouch.(Hesitating between 'The Art of the Metaobject Protocol' and 'Compilers, Techniques and Tools', right now)FWIW, I've found this one to be pretty helpful: http://www.pearsonhighered.com/educator/product/Crafting-A-Compiler/9780136067054.page It's the closest one I've seen to being more for programmers than mathematicians.
Jul 06 2012
On Thursday, 5 July 2012 at 18:17:06 UTC, Philippe Sigaud wrote:I'm sure it can generate **much** faster code. I'm going to focus on its part that generates D parser (i.e., to make it significantly faster and able to efficiently parse-as-you-type). Actually, I'm sure it will be able to beat any other parser with respect to performance. :) 1. So my plan is the following: invite whoever would want to help. 2. Prove my claims above in practice. :-)))))As much as I'm flattered by that, my current impression is Pegged is very far from being performant. As a proof-of-concept that, in D, it's possible to parse a string and create a parse tree at compile-time and then generate code from this, it's also successful. Go D! As a parser proper, Pegged is awful :-) Nothing I'm ashamed of, as I learn by coding. Hey, I just received the Dragon Book (International Edition), I'm sure I'll learn many things in there. So, if anyone is willing to change the code generated by Pegged, I'm game. The results you showed me on keyword parsing are very interesting! But, my impression is that the need for a 'D'-only parser and lexer is far greater and much more imediate that the need for a parser generator. All the reasons advanced upthread ask for a D parser, not a generic generator. Parser generators are for those of us interested in having DSLs or macros in D. So Pegged or any other generator should *not* get the community focus right now.On 2012-07-05 18:32, Roman D. Boiko wrote:My vote would be for Pegged, I guess.
Jul 05 2012
On Thursday, 5 July 2012 at 18:17:06 UTC, Philippe Sigaud wrote:As a parser proper, Pegged is awful :-) Nothing I'm ashamed of, as I learn by coding. Hey, I just received the Dragon Book (International Edition),If you are interested in parsing, than I wouldn't recommend the dragon book, but this one http://dickgrune.com/Books/PTAPG_2nd_Edition/Additional.html It really is an awesome book.
Jul 05 2012
On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath <tobias pankrath.net> wrot= e:On Thursday, 5 July 2012 at 18:17:06 UTC, Philippe Sigaud wrote:rnAs a parser proper, Pegged is awful :-) Nothing I'm ashamed of, as I lea=ok,by coding. Hey, I just received the Dragon Book (International Edition),If you are interested in parsing, than I wouldn't recommend the dragon bo=but this one http://dickgrune.com/Books/PTAPG_2nd_Edition/Additional.html It really is an awesome book.Hey, good catch! I saw this one a few months ago and forgot about it. Having half a book being annotated bibliography (IIUC) scared me. Hmm 72 =E2=82=AC by Springer, 55 =E2=82=AC on Amazon. Something is not righ= t. Paperback vs perfect bound maybe? Is "Modern Compiler Design" by the same authors any good?
Jul 05 2012
On Thursday, 5 July 2012 at 20:28:32 UTC, Philippe Sigaud wrote:Hmm 72 € by Springer, 55 € on Amazon. Something is not right. Paperback vs perfect bound maybe?http://www.komkon.org/~sher/books/parsing_techniques_2008.pdf Not sure that it is legal, but definitely free.
Jul 05 2012
Le 05/07/2012 18:32, Roman D. Boiko a écrit :On Thursday, 5 July 2012 at 16:14:27 UTC, Jacob Carlborg wrote:Why not program instead of doing bureaucracy ?Original post: http://www.digitalmars.com/d/archives/digitalmars/D/DCT_use_cases_-_draft_168106.html#N168141OK, fairly complete. Let it be the starting point. Why not put it on some wiki, then add some more, discuss, vote, etc.? Meanwhile create a matrix of features implemented in each of interesting standardization candidates. And pick up one of them as "standard" or "reference" parser. My vote would be for Pegged, I guess.
Jul 05 2012
On Thursday, 5 July 2012 at 22:11:41 UTC, deadalnix wrote:Why not program instead of doing bureaucracy ?To avoid programming things which are not needed or don't fit. I've thrown away several implementations already... time to reflect a little :) But, actually, for DCT I do know what I need to do. That suggestion was with respect to any candidate for becoming standard. Everybody understands that their way (likely differently than others).
Jul 05 2012
Le 06/07/2012 00:21, Roman D. Boiko a écrit :On Thursday, 5 July 2012 at 22:11:41 UTC, deadalnix wrote:Well you did the mistake to introduce an over complex mechanism in log(n) to get back to source code when an obvious one in O(1) exists. Let me doubt.Why not program instead of doing bureaucracy ?To avoid programming things which are not needed or don't fit. I've thrown away several implementations already... time to reflect a little :) But, actually, for DCT I do know what I need to do. That suggestion was with respect to any candidate for becoming standard. Everybody understands that their way (likely differently than others).
Jul 05 2012
On Thursday, 5 July 2012 at 22:32:29 UTC, deadalnix wrote:Well you did the mistake to introduce an over complex mechanism in log(n) to get back to source code when an obvious one in O(1) exists. Let me doubt.I'm fine with that, let you doubt.
Jul 05 2012
05.07.2012 20:14, Jacob Carlborg пишет:On 2012-07-05 18:04, Roman D. Boiko wrote:Just want to mention I'm talking about parser only. Things like "Refactoring" have to work perfectly or are unusable so it require a full compiler at least as finished as current dmd. -- Денис В. Шеломовский Denis V. ShelomovskijWell, we did something like that for DCT... but I doubt that it would fit general needs.Why wouldn't it.If we had, why haven't they been analyzed, classified, discussed, etc.? Or have they?I don't know. Here is what I wrote for DCT: * IDE integration * Refactoring tool * Static analysis * Compiler * Doc generating * Build tool * DI generating
Jul 05 2012
On Thursday, July 05, 2012 18:04:11 Roman D. Boiko wrote:On Thursday, 5 July 2012 at 15:40:53 UTC, Jacob Carlborg wrote:It's been discussed before, but there are some obvious use cases such as syntax highlighting, code formatting, and manipulation of D source files (e.g. to strip out the unit tests). We need to have a lexer in Phobos which parsers D code and generates a range of tokens, and we need to have a parser in Phobos which operates on a range of tokens. As discussed previously, there is desire to have the lexer and parser ported from dmd's frontend to D for Phobos so that what we get is easily maintainable alongside dmd's frontend and produces the same results (and is fast). It's _also_ desirable that we get a lexer/parser generator into Phobos for generating lexers/parsers for whatever language you might want to generate them for. Pegged is a good example of what can be done, and I think that Andrei was trying to get Philippe to prepare a submission to Phobos from it (though I'm not sure), but regarldess of whether pegged (or anything based on it) makes it into Phobos, we definitely want something similar. So, we have a fair idea of some of the stuff that we want, but it's a question of time and effort. I keep intending to port dmd's lexer to D for submission to Phobos, but I've been too busy to do it. At least a couple of other people have also expressed interest in doing it, but no one has submitted anything for Phobos. So, it remains undone, and anything which would need to lex or parse D code has to find its own solution. As with most everything around here, it's a question of people being having the time and being willing to put in the effort to do it. It's all too common for someone to suggest that we should do something or implement something without ever attempting to do it themselves, and in general, stuff around here gets done because someone really wants it done, takes the time to do it, and sees it through until its done and in Phobos. - Jonathan M DavisOn 2012-07-05 15:08, Roman D. Boiko wrote:Well, we did something like that for DCT... but I doubt that it would fit general needs. If we had, why haven't they been analyzed, classified, discussed, etc.? Or have they?Anyway I propose to enumerate major use cases first.Haven't we already done that a couple of times.
Jul 05 2012
On Thursday, 5 July 2012 at 16:28:57 UTC, Jonathan M Davis wrote:It's all too common for someone to suggest that we should do something or implement something without ever attempting to do it themselves, and in general, stuff around here gets done because someone really wants it done, takes the time to do it, and sees it through until its done and in Phobos. - Jonathan M DavisResume: everybody is welcome to join effort of translating DMD front end, and improving Pegged. Also I would like to invite those interested in DCT project to help me with it. Right now I'm trying to understand whether it is possible to incorporate Pegged inside without losing anything critical (and I think it is very likely possible), and how exactly to do that. Dmitry proposed to help improve Pegged (or some other compiler's) speed. Anyone else?
Jul 05 2012
On 7/5/12 12:39 PM, Roman D. Boiko wrote:On Thursday, 5 July 2012 at 16:28:57 UTC, Jonathan M Davis wrote:I'd really want to create a task force on this, it is of strategic importance to D. In Walter's own words, no new feature is going to push us forward since we're not really using the great goodies we've got, and CTFE technology is the most important. I also am actively opposed to a project of just translating D's front-end to D and dropping it into Phobos because it would smother (a) work on generic parser generators, and (b) strong, dependable formalization of D's syntax. AndreiIt's all too common for someone to suggest that we should do something or implement something without ever attempting to do it themselves, and in general, stuff around here gets done because someone really wants it done, takes the time to do it, and sees it through until its done and in Phobos. - Jonathan M DavisResume: everybody is welcome to join effort of translating DMD front end, and improving Pegged. Also I would like to invite those interested in DCT project to help me with it. Right now I'm trying to understand whether it is possible to incorporate Pegged inside without losing anything critical (and I think it is very likely possible), and how exactly to do that. Dmitry proposed to help improve Pegged (or some other compiler's) speed. Anyone else?
Jul 05 2012
On 05-Jul-12 22:22, Andrei Alexandrescu wrote:On 7/5/12 12:39 PM, Roman D. Boiko wrote:Count me as interested. CTFE needs more correctness & speed though. So to put it blantly - no it's not possible right NOW. BUT it doesn't prevent us from planing and doing a proof of concept. Pegged seems a good starting point even if we end up re-writing it from scratch.On Thursday, 5 July 2012 at 16:28:57 UTC, Jonathan M Davis wrote:I'd really want to create a task force on this, it is of strategic importance to D. In Walter's own words, no new feature is going to push us forward since we're not really using the great goodies we've got, and CTFE technology is the most important.It's all too common for someone to suggest that we should do something or implement something without ever attempting to do it themselves, and in general, stuff around here gets done because someone really wants it done, takes the time to do it, and sees it through until its done and in Phobos. - Jonathan M DavisResume: everybody is welcome to join effort of translating DMD front end, and improving Pegged. Also I would like to invite those interested in DCT project to help me with it. Right now I'm trying to understand whether it is possible to incorporate Pegged inside without losing anything critical (and I think it is very likely possible), and how exactly to do that. Dmitry proposed to help improve Pegged (or some other compiler's) speed. Anyone else?I also am actively opposed to a project of just translating D's front-end to D and dropping it into Phobos because it would smother (a) work on generic parser generators, and (b) strong, dependable formalization of D's syntax.Well put. It shouldn't stop people from doing parsers, IMO the more the merrier. -- Dmitry Olshansky
Jul 05 2012
On Thursday, 5 July 2012 at 18:33:50 UTC, Dmitry Olshansky wrote:Count me as interested. CTFE needs more correctness & speed though. So to put it blantly - no it's not possible right NOW. BUT it doesn't prevent us from planing and doing a proof of concept. Pegged seems a good starting point even if we end up re-writing it from scratch.IMO, first priority is to make code generated by Pegged work fast, and second priority - make code generation itself fast.
Jul 05 2012
On 7/5/12 2:38 PM, Roman D. Boiko wrote:On Thursday, 5 July 2012 at 18:33:50 UTC, Dmitry Olshansky wrote:Well I'd say there are things like supporting large, realistic grammars without blowing up or taking long compilation times is the first priority. AndreiCount me as interested. CTFE needs more correctness & speed though. So to put it blantly - no it's not possible right NOW. BUT it doesn't prevent us from planing and doing a proof of concept. Pegged seems a good starting point even if we end up re-writing it from scratch.IMO, first priority is to make code generated by Pegged work fast, and second priority - make code generation itself fast.
Jul 05 2012
On 7/5/12 2:33 PM, Dmitry Olshansky wrote:On 05-Jul-12 22:22, Andrei Alexandrescu wrote:Excellent point. Walter is 100% behind the general strategy and we can bring him to work on specific bug reports and enhancement requests if we make a case they are important for Pegged.I'd really want to create a task force on this, it is of strategic importance to D. In Walter's own words, no new feature is going to push us forward since we're not really using the great goodies we've got, and CTFE technology is the most important.Count me as interested. CTFE needs more correctness & speed though. So to put it blantly - no it's not possible right NOW. BUT it doesn't prevent us from planing and doing a proof of concept. Pegged seems a good starting point even if we end up re-writing it from scratch.Exactly. AndreiI also am actively opposed to a project of just translating D's front-end to D and dropping it into Phobos because it would smother (a) work on generic parser generators, and (b) strong, dependable formalization of D's syntax.Well put. It shouldn't stop people from doing parsers, IMO the more the merrier.
Jul 05 2012
On 2012-07-05 20:22, Andrei Alexandrescu wrote:I also am actively opposed to a project of just translating D's front-end to D and dropping it into Phobos because it would smother (a) work on generic parser generators, and (b) strong, dependable formalization of D's syntax.I don't see why you are against this. I'm seeing this as basically the only change for DMD every being written in D. Sure I would much rather have a completely new compiler written in D, with a module approach and designed from scratch to be used as a library, but I'm trying to realistic here. -- /Jacob Carlborg
Jul 05 2012
Hi everybody! My name's David and I've been dreaming since around 1999 of making my own computer language, and never found the time for it. The first time I looked at D it was around 2004 or so, and it just looked like a "moderately better C++" which I forgot about, having more lofty ideas. When I found out about D2's metaprogramming facilities I instantly became much more interested, although I still wish to accomplish more than is possible ATM. I've been talking to my boss about reducing my working hours, mainly in order to have time to work on something related to D. My goal is to popularize a language that is efficient (as in runtime speed and size), expressive, safe, concise, readable, well-documented, easy-to-use, and good at finding errors in your code. In other words, I want a language that is literally all things to all people, a language that is effective for any task. I want to kill off this preconceived notion that most programmers seem to have, that fast code requires a language like C++ that is hard to use. The notion that Rapid Application Development is incompatible with an efficient executable is nonsense and I want to kill it :) To be honest I have some reservations about D, but of all the languages I know, D is currently closest to my ideal. This work on parsers might be a good place for me to dive in. I have an interest in parsers and familiarity with LL, LALR, PEGs, and even Pratt parsers (fun!), but I am still inexperienced. I also like writing documentation and articles, but I always find it hard to figure out how other people's code works well enough to document it. I'm having some trouble following this thread due to the acronyms: CTFE, DCT, AA. At least I managed to figure out that CTFE is Compile Time Function Execution. DCT and AA I already know as Discrete Cosine Transform and Anti-Aliasing, of course.... but what's it mean to you? One thing that has always concerned me about PEGs is that they always say PEGs are slower than traditional two-phase LALR(1) or LL(k) parsers. However, I have never seen any benchmarks. Does anyone know exactly how much performance you lose in an (optimized) PEG compared to an (optimized) LALR/LL parser + LL/regex lexer? Anyway, it's the weekend, during which I hope I can find a place to fit in with you guys.Resume: everybody is welcome to join effort of translating DMD front end, and improving Pegged. Also I would like to invite those interested in DCT project to help me with it. Right now I'm trying to understand whether it is possible to incorporate Pegged inside without losing anything critical (and I think it is very likely possible), and how exactly to do that. Dmitry proposed to help improve Pegged (or some other compiler's) speed. Anyone else?I'd really want to create a task force on this, it is of strategic importance to D. In Walter's own words, no new feature is going to push us forward since we're not really using the great goodies we've got, and CTFE technology is the most important.
Jul 06 2012
On Sat, Jul 07, 2012 at 02:45:35AM +0200, David Piepgrass wrote: [...]I'm having some trouble following this thread due to the acronyms: CTFE, DCT, AA. At least I managed to figure out that CTFE is Compile Time Function Execution. DCT and AA I already know as Discrete Cosine Transform and Anti-Aliasing, of course.... but what's it mean to you?Frankly, I don't know what DCT stands for (anyone?), but AA stands for Associative Array (aka hash, dictionary, etc.). [...]Anyway, it's the weekend, during which I hope I can find a place to fit in with you guys.Welcome aboard!!! D still has some ways to go (there are still warts in some places), but I also have to admit that it's closest to my ideal language too. I'm so sick and fed up of the tedium of C and the convolutions of C++, that I just can't see myself doing any more C/C++ except for code maintenance. New projects I would do in D. (Personal projects, anyway... work projects, I don't get to choose.) Anyway, D still has some ways to go, meaning that more manpower is always more than welcome. T -- Elegant or ugly code as well as fine or rude sentences have something in common: they don't depend on the language. -- Luca De Vitis
Jul 06 2012
On Saturday, 7 July 2012 at 01:47:21 UTC, H. S. Teoh wrote:Frankly, I don't know what DCT stands for (anyone?)That's a working (not final) name of my project, a.k.a. The D Compiler Tools. I've used this acronym in previous threads but didn't explain this time. Current state is very far from completion, it is in research phase. Link: https://github.com/roman-d-boiko/dct, but I'd suggest to contact me at rb d-coding.com if anyone is interested.
Jul 07 2012
On Saturday, 7 July 2012 at 00:45:36 UTC, David Piepgrass wrote:This work on parsers might be a good place for me to dive in. I have an interest in parsers and familiarity with LL, LALR, PEGs, and even Pratt parsers (fun!), but I am still inexperienced....One thing that has always concerned me about PEGs is that they always say PEGs are slower than traditional two-phase LALR(1) or LL(k) parsers. However, I have never seen any benchmarks. Does anyone know exactly how much performance you lose in an (optimized) PEG compared to an (optimized) LALR/LL parser + LL/regex lexer?I decided to ask a question about this: http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk Don't hesitate to edit it if you would like to clarify some aspect.
Jul 07 2012
On 07-Jul-12 13:06, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 00:45:36 UTC, David Piepgrass wrote:I can tell you that they are not slower then one another in principle. Quality of implementations trumps every theoretical aspect here. LALR is usually fast even if implemented by book but they are hard to optimize futher and quite restrictive on "semantic extensions". Proper CFG parsers all are liner-time anyway. -- Dmitry OlshanskyThis work on parsers might be a good place for me to dive in. I have an interest in parsers and familiarity with LL, LALR, PEGs, and even Pratt parsers (fun!), but I am still inexperienced....One thing that has always concerned me about PEGs is that they always say PEGs are slower than traditional two-phase LALR(1) or LL(k) parsers. However, I have never seen any benchmarks. Does anyone know exactly how much performance you lose in an (optimized) PEG compared to an (optimized) LALR/LL parser + LL/regex lexer?I decided to ask a question about this: http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk Don't hesitate to edit it if you would like to clarify some aspect.
Jul 07 2012
On Saturday, 7 July 2012 at 10:05:30 UTC, Dmitry Olshansky wrote:I can tell you that they are not slower then one another in principle. Quality of implementations trumps every theoretical aspect here. LALR is usually fast even if implemented by book but they are hard to optimize futher and quite restrictive on "semantic extensions". Proper CFG parsers all are liner-time anyway.Exactly, that is also my point. I think that Pegged can be heavily optimized in performance, and there are no fundamental issues which would make it inherently slower than LALR or a hand-written D-specific parser.
Jul 07 2012
On Saturday, 7 July 2012 at 10:26:39 UTC, Roman D. Boiko wrote:I think that Pegged can be heavily optimized in performance, and there are no fundamental issues which would make it inherently slower than LALR or a hand-written D-specific parser.Hmm... found an interesting article: http://www.antlr.org/papers/LL-star-PLDI11.pdf It describes some disadvantages of Packrat parsing, like problems with debugging and error recovery. These are important for DCT, so I'll have to perform additional research.
Jul 07 2012
On 07-Jul-12 15:30, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 10:26:39 UTC, Roman D. Boiko wrote:Yup, LL(*) is my favorite so far. As for debugging and error recovery they are always a problem. -- Dmitry OlshanskyI think that Pegged can be heavily optimized in performance, and there are no fundamental issues which would make it inherently slower than LALR or a hand-written D-specific parser.Hmm... found an interesting article: http://www.antlr.org/papers/LL-star-PLDI11.pdf It describes some disadvantages of Packrat parsing, like problems with debugging and error recovery. These are important for DCT, so I'll have to perform additional research.
Jul 07 2012
On Saturday, 7 July 2012 at 11:33:18 UTC, Dmitry Olshansky wrote:Yup, LL(*) is my favorite so far. As for debugging and error recovery they are always a problem.It's interesting, that I also wanted to use DFA for non-terminals (similarly to LL(*)), but was considering their usage as pure performance optimization. Concerns raised in the article seem to be very reasonable, I will likely reconsider my previous plans... This forum is an endless source of high-quality feed-back for me.
Jul 07 2012
On 7/7/12 7:33 AM, Dmitry Olshansky wrote:On 07-Jul-12 15:30, Roman D. Boiko wrote:That's Terence Parr's discovery, right? I've always liked ANTLR, so if PEGs turn out to have issues LL(*) sounds like a promising alternative. How many semantics hacks does D's syntax need for a LL(*) parser? AndreiOn Saturday, 7 July 2012 at 10:26:39 UTC, Roman D. Boiko wrote:Yup, LL(*) is my favorite so far.I think that Pegged can be heavily optimized in performance, and there are no fundamental issues which would make it inherently slower than LALR or a hand-written D-specific parser.Hmm... found an interesting article: http://www.antlr.org/papers/LL-star-PLDI11.pdf It describes some disadvantages of Packrat parsing, like problems with debugging and error recovery. These are important for DCT, so I'll have to perform additional research.
Jul 07 2012
On 07-Jul-12 22:25, Andrei Alexandrescu wrote:On 7/7/12 7:33 AM, Dmitry Olshansky wrote:I believe that it may need very few _semantic_ predicates. But there are cases where infinite lookahead is a must. Can't recall which cases offhand.On 07-Jul-12 15:30, Roman D. Boiko wrote:That's Terence Parr's discovery, right? I've always liked ANTLR, so if PEGs turn out to have issues LL(*) sounds like a promising alternative. How many semantics hacks does D's syntax need for a LL(*) parser?On Saturday, 7 July 2012 at 10:26:39 UTC, Roman D. Boiko wrote:Yup, LL(*) is my favorite so far.I think that Pegged can be heavily optimized in performance, and there are no fundamental issues which would make it inherently slower than LALR or a hand-written D-specific parser.Hmm... found an interesting article: http://www.antlr.org/papers/LL-star-PLDI11.pdf It describes some disadvantages of Packrat parsing, like problems with debugging and error recovery. These are important for DCT, so I'll have to perform additional research.Andrei-- Dmitry Olshansky
Jul 07 2012
We should also consider using GLR if LL(*) doesn't work.Yup, LL(*) is my favorite so far.That's Terence Parr's discovery, right? I've always liked ANTLR, so if PEGs turn out to have issues LL(*) sounds like a promising alternative.
Jul 08 2012
On 08-Jul-12 13:05, Tobias Pankrath wrote:GLR ... are you serious? It still does parsing in n^3 if I'm not mistaken. -- Dmitry OlshanskyWe should also consider using GLR if LL(*) doesn't work.Yup, LL(*) is my favorite so far.That's Terence Parr's discovery, right? I've always liked ANTLR, so if PEGs turn out to have issues LL(*) sounds like a promising alternative.
Jul 08 2012
On Sunday, 8 July 2012 at 09:14:32 UTC, Dmitry Olshansky wrote:On 08-Jul-12 13:05, Tobias Pankrath wrote:Since GLR can parse any CFG the upper bound is in O(n^3), but the actual performance seems to depend on the grammar. From Elkhound [1]GLR ... are you serious? It still does parsing in n^3 if I'm not mistaken.We should also consider using GLR if LL(*) doesn't work.Yup, LL(*) is my favorite so far.That's Terence Parr's discovery, right? I've always liked ANTLR, so if PEGs turn out to have issues LL(*) sounds like a promising alternative.It should be competitive with Bison for LALR (fragements of) grammars, and degrade gracefully from there. On the scale of grammar nondeterminism, from none (LALR) to some to lots, "some" is the niche Elkhound is going after. These goals are driven by Elkhound's primary application, the Elsa C++ Parser. In essence, Elkhound came about because I wanted to apply automatic parsing technology to parsing C++, but found exiting systems inadequate.So it seems to me that it is not worse than PEG if you have a grammar with reasonable many ambiguities. Performance is one thing, but you should be able to express your language in the underlying formalism without too many hurdles.
Jul 08 2012
From Elkhound [1]http://scottmcpeak.com/elkhound/sources/elkhound/algorithm.html
Jul 08 2012
On 08-Jul-12 14:48, Tobias Pankrath wrote:On Sunday, 8 July 2012 at 09:14:32 UTC, Dmitry Olshansky wrote:The problem is that say LL(*) can be easily tweaked in place to run various semantic actions as it parse the source. I think GLR is harder to make do anything other then "parse & say it's fine". -- Dmitry OlshanskyOn 08-Jul-12 13:05, Tobias Pankrath wrote:Since GLR can parse any CFG the upper bound is in O(n^3), but the actual performance seems to depend on the grammar. From Elkhound [1]GLR ... are you serious? It still does parsing in n^3 if I'm not mistaken.We should also consider using GLR if LL(*) doesn't work.Yup, LL(*) is my favorite so far.That's Terence Parr's discovery, right? I've always liked ANTLR, so if PEGs turn out to have issues LL(*) sounds like a promising alternative.It should be competitive with Bison for LALR (fragements of) grammars, and degrade gracefully from there. On the scale of grammar nondeterminism, from none (LALR) to some to lots, "some" is the niche Elkhound is going after. These goals are driven by Elkhound's primary application, the Elsa C++ Parser. In essence, Elkhound came about because I wanted to apply automatic parsing technology to parsing C++, but found exiting systems inadequate.So it seems to me that it is not worse than PEG if you have a grammar with reasonable many ambiguities. Performance is one thing, but you should be able to express your language in the underlying formalism without too many hurdles.
Jul 08 2012
The problem is that say LL(*) can be easily tweaked in place to run various semantic actions as it parse the source. I think GLR is harder to make do anything other then "parse & say it's fine".Did you look at Elkhound? According to the tutorial you can use actions in the same way as if you were using an (LA)LR parser. Dunno if it's sufficient, but with Elkhound an C++-Parser has been written. See also http://stackoverflow.com/questions/428892/what-parser-genera or-do-you-recommend where Ira Baxter (maybe you know him) makes a case for DMS, which uses GLR internally. Here is a google tech talk of Ira explaining the DMS http://www.youtube.com/watch?v=C-_dw9iEzhA . I'm not very experienced with parser generators. Just trying to put all options on the table.
Jul 08 2012
On Sunday, 8 July 2012 at 09:14:32 UTC, Dmitry Olshansky wrote:On 08-Jul-12 13:05, Tobias Pankrath wrote:http://www.antlr.org/papers/LL-star-PLDI11.pdf describes some disadvantages of GLR with respect to LL(*).GLR ... are you serious? It still does parsing in n^3 if I'm not mistaken.We should also consider using GLR if LL(*) doesn't work.Yup, LL(*) is my favorite so far.That's Terence Parr's discovery, right? I've always liked ANTLR, so if PEGs turn out to have issues LL(*) sounds like a promising alternative.
Jul 08 2012
I can tell you that they are not slower then one another in principle. Quality of implementations trumps every theoretical aspect here. LALR is usually fast even if implemented by book but they are hard to optimize futher and quite restrictive on "semantic extensions". Proper CFG parsers all are liner-time anyway.To be picky here: The languages that can be parsed in linear time are a strict subset of CFGs. However I do agree, that error handling and flexibility are more important than raw speed. I don't want to get a concrete syntax tree full of unneeded productions, like the ones you get when you encode precedence rules in your grammar.
Jul 07 2012
On Saturday, 7 July 2012 at 14:17:54 UTC, Tobias Pankrath wrote:Also any PEG defines a language with linear-time parsing. Some non-context-free languages can be described with PEG.Proper CFG parsers all are liner-time anyway.To be picky here: The languages that can be parsed in linear time are a strict subset of CFGs.
Jul 07 2012
On Saturday, 7 July 2012 at 15:39:52 UTC, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 14:17:54 UTC, Tobias Pankrath wrote:Interesting, I thought that PEG ⊂ CFG holds.Also any PEG defines a language with linear-time parsing. Some non-context-free languages can be described with PEG.Proper CFG parsers all are liner-time anyway.To be picky here: The languages that can be parsed in linear time are a strict subset of CFGs.
Jul 07 2012
On Saturday, 7 July 2012 at 16:14:13 UTC, Tobias Pankrath wrote:Interesting, I thought that PEG ⊂ CFG holds.See section 3.4 of http://bford.info/pub/lang/peg.pdf It contains a simple proof that a non-context-free language (a^n) (b^n) (c^n) can be described with PEG syntax. There are infinitely many such languages.
Jul 07 2012
On Sat, Jul 7, 2012 at 6:25 PM, Roman D. Boiko <rb d-coding.com> wrote:On Saturday, 7 July 2012 at 16:14:13 UTC, Tobias Pankrath wrote:Not that anyone is interested in such languages, but still :)Interesting, I thought that PEG =E2=8A=82 CFG holds.See section 3.4 of http://bford.info/pub/lang/peg.pdf It contains a simple proof that a non-context-free language (a^n) (b^n) (c^n) can be described with PEG syntax. There are infinitely many such languages.
Jul 07 2012
On 07/07/2012 12:05, Dmitry Olshansky wrote:On 07-Jul-12 13:06, Roman D. Boiko wrote:D isn't 100% CFG. But it is close.On Saturday, 7 July 2012 at 00:45:36 UTC, David Piepgrass wrote:I can tell you that they are not slower then one another in principle. Quality of implementations trumps every theoretical aspect here. LALR is usually fast even if implemented by book but they are hard to optimize futher and quite restrictive on "semantic extensions". Proper CFG parsers all are liner-time anyway.This work on parsers might be a good place for me to dive in. I have an interest in parsers and familiarity with LL, LALR, PEGs, and even Pratt parsers (fun!), but I am still inexperienced....One thing that has always concerned me about PEGs is that they always say PEGs are slower than traditional two-phase LALR(1) or LL(k) parsers. However, I have never seen any benchmarks. Does anyone know exactly how much performance you lose in an (optimized) PEG compared to an (optimized) LALR/LL parser + LL/regex lexer?I decided to ask a question about this: http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llk Don't hesitate to edit it if you would like to clarify some aspect.
Jul 07 2012
deadalnix , dans le message (digitalmars.D:171330), a crit:D isn't 100% CFG. But it is close.What makes D fail to be a CFG?
Jul 09 2012
On 09/07/2012 10:14, Christophe Travert wrote:deadalnix , dans le message (digitalmars.D:171330), a crit :type[something] <= something can be a type or an expression. typeid(somethning) <= same here identifier!(something) <= againD isn't 100% CFG. But it is close.What makes D fail to be a CFG?
Jul 10 2012
On 07/11/2012 01:16 AM, deadalnix wrote:On 09/07/2012 10:14, Christophe Travert wrote:'something' is context-free: something ::= type | expression.deadalnix , dans le message (digitalmars.D:171330), a crit :type[something] <= something can be a type or an expression. typeid(somethning) <= same here identifier!(something) <= againD isn't 100% CFG. But it is close.What makes D fail to be a CFG?
Jul 10 2012
Timon Gehr , dans le message (digitalmars.D:171814), a crit:On 07/11/2012 01:16 AM, deadalnix wrote:Do you have to know if something is a type or an expression for a simple parsing? The langage would better not require this, otherwise simple parsing is not possible without looking at all forward references and imported files.On 09/07/2012 10:14, Christophe Travert wrote:'something' is context-free: something ::= type | expression.deadalnix , dans le message (digitalmars.D:171330), a crit :type[something] <= something can be a type or an expression. typeid(somethning) <= same here identifier!(something) <= againD isn't 100% CFG. But it is close.What makes D fail to be a CFG?
Jul 11 2012
On 07/11/2012 10:23 AM, Christophe Travert wrote:Timon Gehr , dans le message (digitalmars.D:171814), a crit :No. Some token sequences can be both a type and an expression based on the context (the CFG is ambiguous), but that is the analysers business. Parsing D code does not require any kind of analysis.On 07/11/2012 01:16 AM, deadalnix wrote:Do you have to know if something is a type or an expression for a simple parsing?On 09/07/2012 10:14, Christophe Travert wrote:'something' is context-free: something ::= type | expression.deadalnix , dans le message (digitalmars.D:171330), a crit :type[something]<= something can be a type or an expression. typeid(somethning)<= same here identifier!(something)<= againD isn't 100% CFG. But it is close.What makes D fail to be a CFG?The langage would better not require this, otherwise simple parsing is not possible without looking at all forward references and imported files.
Jul 11 2012
On Tuesday, 10 July 2012 at 23:49:58 UTC, Timon Gehr wrote:On 07/11/2012 01:16 AM, deadalnix wrote:I don't see how "type | expression" is context free. The input "Foo" could be a type or expression, you can't tell which without looking at the context.On 09/07/2012 10:14, Christophe Travert wrote:'something' is context-free: something ::= type | expression.deadalnix , dans le message (digitalmars.D:171330), a écrit :type[something] <= something can be a type or an expression. typeid(somethning) <= same here identifier!(something) <= againD isn't 100% CFG. But it is close.What makes D fail to be a CFG?
Jul 11 2012
On 07/11/2012 11:55 PM, David Piepgrass wrote:On Tuesday, 10 July 2012 at 23:49:58 UTC, Timon Gehr wrote:'Context free' means that all the production rule left hand sides do not have a context, i.e. they consist of a single non-terminal. What you are describing is the fact that the grammar is ambiguous. The parser can just assume whatever works and the analyzer takes care of the interpretation of the resulting AST. The parser does not have to give a meaning to the code. It just turns it into a representation that is easy to work with.On 07/11/2012 01:16 AM, deadalnix wrote:I don't see how "type | expression" is context free. The input "Foo" could be a type or expression, you can't tell which without looking at the context.On 09/07/2012 10:14, Christophe Travert wrote:'something' is context-free: something ::= type | expression.deadalnix , dans le message (digitalmars.D:171330), a écrit :type[something] <= something can be a type or an expression. typeid(somethning) <= same here identifier!(something) <= againD isn't 100% CFG. But it is close.What makes D fail to be a CFG?
Jul 11 2012
On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath <tobias pankrath.net> wrote:If you are interested in parsing, than I wouldn't recommend the dragon book, but this one It really is an awesome book.This little post to say a big 'Thanks' to Tobias. I'm reading Grune's Parsing Techniques and find it absolutely wonderful: easy to read, going into details and limits of the many parsing algos. Very cool! I'm playing with fully-backtracking recursive-descent parsers right now, to deal with ambiguous grammars and the small parser works beautifully. Hey, I just saw he gives the code on his website, nice! Now, on to LR. Nick, here I come. I'll at last know how Goldie works. Philippe
Jul 23 2012
On 7/23/12 9:22 AM, Philippe Sigaud wrote:On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath<tobias pankrath.net> wrote:When will we see the fruit of that all in the form of D libraries? AndreiIf you are interested in parsing, than I wouldn't recommend the dragon book, but this one It really is an awesome book.This little post to say a big 'Thanks' to Tobias. I'm reading Grune's Parsing Techniques and find it absolutely wonderful: easy to read, going into details and limits of the many parsing algos. Very cool! I'm playing with fully-backtracking recursive-descent parsers right now, to deal with ambiguous grammars and the small parser works beautifully. Hey, I just saw he gives the code on his website, nice! Now, on to LR. Nick, here I come. I'll at last know how Goldie works.
Jul 23 2012
On Monday, 23 July 2012 at 13:22:47 UTC, Philippe Sigaud wrote:On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath <tobias pankrath.net> wrote:Yes, I'm also reading it and find it amusingly high-quality.If you are interested in parsing, than I wouldn't recommend the dragon book, but this one It really is an awesome book.This little post to say a big 'Thanks' to Tobias. I'm reading Grune's Parsing Techniques and find it absolutely wonderful: easy to read, going into details and limits of the many parsing algos. Very cool!
Jul 23 2012
On Monday, 23 July 2012 at 13:22:47 UTC, Philippe Sigaud wrote:On Thu, Jul 5, 2012 at 10:00 PM, Tobias Pankrath <tobias pankrath.net> wrote:I'm glad that you like it.If you are interested in parsing, than I wouldn't recommend the dragon book, but this one It really is an awesome book.This little post to say a big 'Thanks' to Tobias. I'm reading Grune's Parsing Techniques and find it absolutely wonderful: easy to read, going into details and limits of the many parsing algos. Very cool!
Jul 23 2012
On 7/7/12 6:05 AM, Dmitry Olshansky wrote:I can tell you that they are not slower then one another in principle. Quality of implementations trumps every theoretical aspect here. LALR is usually fast even if implemented by book but they are hard to optimize futher and quite restrictive on "semantic extensions".Isn't it the case that PEG require more memory for certain grammars? Andrei
Jul 07 2012
On Saturday, 7 July 2012 at 09:06:57 UTC, Roman D. Boiko wrote:http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llkSo far it looks like LALR parsers may have lower constant factors than Packrat. The difference could be minimized by paying attention to parsing of terminal symbols, which was in my plans already. It is not necessary to strictly follow Packrat parsing algorithm. The benefits of Pegged, in my view, are its support of Parsing Expression Grammar (PEG) and compile-time evaluation. It is easily extensible and modifiable. When I implemented recursive-descent parser by hand in one of early drafts of DCT, I strongly felt the need to generalize code in a way which in retrospect I would call PEG-like. The structure of my hand-written recursive-descent parser was a one-to-one mapping to an implemented subset of D specification, and I considered it problematic, because it was needed to duplicate the same structure in the resulting AST. PEG is basically a language that describes both, the implementation of parser, and the language syntax. It greatly reduces implicit code duplication. I think that generated code can be made almost as fast as a hand-written parser for a particular language (probably, a few percent slower). Especially if that language is similar to D (context-free, with fine-grained hierarchical grammar). Optimizations might require to forget about strictly following any theoretical algorithm, but that should be OK.
Jul 07 2012
On 7/7/12 6:24 AM, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 09:06:57 UTC, Roman D. Boiko wrote:Isn't also the fact that lexing and parsing are integrated? With traditional LALR you need a separate tokenizer.http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llkSo far it looks like LALR parsers may have lower constant factors than Packrat. The difference could be minimized by paying attention to parsing of terminal symbols, which was in my plans already. It is not necessary to strictly follow Packrat parsing algorithm. The benefits of Pegged, in my view, are its support of Parsing Expression Grammar (PEG) and compile-time evaluation. It is easily extensible and modifiable.When I implemented recursive-descent parser by hand in one of early drafts of DCT, I strongly felt the need to generalize code in a way which in retrospect I would call PEG-like. The structure of my hand-written recursive-descent parser was a one-to-one mapping to an implemented subset of D specification, and I considered it problematic, because it was needed to duplicate the same structure in the resulting AST. PEG is basically a language that describes both, the implementation of parser, and the language syntax. It greatly reduces implicit code duplication. I think that generated code can be made almost as fast as a hand-written parser for a particular language (probably, a few percent slower). Especially if that language is similar to D (context-free, with fine-grained hierarchical grammar). Optimizations might require to forget about strictly following any theoretical algorithm, but that should be OK.All that sounds really encouraging. I'm really looking forward to more work in that area. If you stumble upon bugs that block you, let us know and Walter agreed he'll boost their priority. Andrei
Jul 07 2012
On 07-Jul-12 22:23, Andrei Alexandrescu wrote:On 7/7/12 6:24 AM, Roman D. Boiko wrote:I'll have to point out that the whole point about integrated lexing is moot. It's more of liability then benefit. At very least it's just implementation curiosity not advantage.On Saturday, 7 July 2012 at 09:06:57 UTC, Roman D. Boiko wrote:Isn't also the fact that lexing and parsing are integrated? With traditional LALR you need a separate tokenizer.http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llkSo far it looks like LALR parsers may have lower constant factors than Packrat. The difference could be minimized by paying attention to parsing of terminal symbols, which was in my plans already. It is not necessary to strictly follow Packrat parsing algorithm. The benefits of Pegged, in my view, are its support of Parsing Expression Grammar (PEG) and compile-time evaluation. It is easily extensible and modifiable.-- Dmitry OlshanskyWhen I implemented recursive-descent parser by hand in one of early drafts of DCT, I strongly felt the need to generalize code in a way which in retrospect I would call PEG-like. The structure of my hand-written recursive-descent parser was a one-to-one mapping to an implemented subset of D specification, and I considered it problematic, because it was needed to duplicate the same structure in the resulting AST. PEG is basically a language that describes both, the implementation of parser, and the language syntax. It greatly reduces implicit code duplication. I think that generated code can be made almost as fast as a hand-written parser for a particular language (probably, a few percent slower). Especially if that language is similar to D (context-free, with fine-grained hierarchical grammar). Optimizations might require to forget about strictly following any theoretical algorithm, but that should be OK.All that sounds really encouraging. I'm really looking forward to more work in that area. If you stumble upon bugs that block you, let us know and Walter agreed he'll boost their priority. Andrei
Jul 07 2012
On 7/7/12 3:17 PM, Dmitry Olshansky wrote:I'll have to point out that the whole point about integrated lexing is moot. It's more of liability then benefit. At very least it's just implementation curiosity not advantage.Interesting. I'll have to ask for more details on why. Andrei
Jul 07 2012
On Saturday, 7 July 2012 at 20:09:56 UTC, Andrei Alexandrescu wrote:On 7/7/12 3:17 PM, Dmitry Olshansky wrote:+1. Personally I associate PEG with a parser that includes a __distributed__ lexer inside, which gives the advantage of having to check less alternatives at each step (that's similar to deterministic parsing). If lexer is separate, it seems to be more difficult to scan for only a subset of possible tokens.I'll have to point out that the whole point about integrated lexing is moot. It's more of liability then benefit. At very least it's just implementation curiosity not advantage.Interesting. I'll have to ask for more details on why. Andrei
Jul 07 2012
On 08-Jul-12 00:19, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 20:09:56 UTC, Andrei Alexandrescu wrote:Well put yet it's still lexer. And most horribly it's not optimized to DFAs most of the time last time I've checked papers. which gives the advantage of having toOn 7/7/12 3:17 PM, Dmitry Olshansky wrote:+1. Personally I associate PEG with a parser that includes a __distributed__ lexer inside,I'll have to point out that the whole point about integrated lexing is moot. It's more of liability then benefit. At very least it's just implementation curiosity not advantage.Interesting. I'll have to ask for more details on why. Andreicheck less alternatives at each step (that's similar to deterministic parsing). If lexer is separate, it seems to be more difficult to scan for only a subset of possible tokens.And given the backtracking nature of PEGs you'll do your distributed thing many times over or ... spend a lot of RAM to remember not to redo it. I recall lexing takes even more then parsing itself. -- Dmitry Olshansky
Jul 07 2012
On Saturday, 7 July 2012 at 20:29:26 UTC, Dmitry Olshansky wrote:And given the backtracking nature of PEGs you'll do your distributed thing many times over or ... spend a lot of RAM to remember not to redo it. I recall lexing takes even more then parsing itself.I think that your conclusions are about statistical evidences of PEG misuses and poor PEG parser implementations. My point was that there is nothing fundamentally worse in having lexer integrated with parser, but there are performance advantages of having to check less possible cases when the structural information is available (so that lexSmth could be called when Smth is expected, thus requiring less switch branches if switch is used). As for lexing multiple times, simply use a free list of terminals (aka tokens). I still assume that grammar is properly defined, so that there is only one way to split source into tokens.
Jul 07 2012
On 08-Jul-12 00:50, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 20:29:26 UTC, Dmitry Olshansky wrote:You may misunderstood me as well, the point is still the same: there are 2 things - notation and implementation, the fact that lexer is integrated in notation like in PEGs is not related to the fact that PEGs in their classic definition never use term Token and do backtracking parsing essentially on character level.And given the backtracking nature of PEGs you'll do your distributed thing many times over or ... spend a lot of RAM to remember not to redo it. I recall lexing takes even more then parsing itself.I think that your conclusions are about statistical evidences of PEG misuses and poor PEG parser implementations. My point was that there is nothing fundamentally worse in having lexer integrated with parser, but there are performance advantages of having to check less possible cases when the structural information is available (so that lexSmth could be called when Smth is expected, thus requiring less switch branches if switch is used).As for lexing multiple times, simply use a free list of terminals (aka tokens). I still assume that grammar is properly defined, so that there is only one way to split source into tokens.Tokens.. there is no such term in use if we talk about 'pure' PEG. -- Dmitry Olshansky
Jul 07 2012
On Saturday, 7 July 2012 at 21:08:43 UTC, Dmitry Olshansky wrote:You may misunderstood me as well, the point is still the same: there are 2 things - notation and implementation, the fact that lexer is integrated in notation like in PEGs is not related to the fact that PEGs in their classic definition never use term Token and do backtracking parsing essentially on character level.But PEG itself doesn't require backtracking parsing, does it? So that's an implementation detail, or a specific algorithm. Lexer separation seems to be independent of this.Terminal symbols.As for lexing multiple times, simply use a free list of terminals (aka tokens). I still assume that grammar is properly defined, so that there is only one way to split source into tokens.Tokens.. there is no such term in use if we talk about 'pure' PEG.
Jul 07 2012
On 08-Jul-12 01:24, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 21:08:43 UTC, Dmitry Olshansky wrote:It does. Specifically the algorithm is to try option A, try option B, try option C... it's obvious how backtracking comes in ... whan say A fails somewhere in the middle of it. Of course there is a fair amount of bookkeeping that reduces doing work all over again but it does ... allocate. This could be avoided via disambiguation a-la LL(k) but that won't be true PEG anymore ;) So that'sYou may misunderstood me as well, the point is still the same: there are 2 things - notation and implementation, the fact that lexer is integrated in notation like in PEGs is not related to the fact that PEGs in their classic definition never use term Token and do backtracking parsing essentially on character level.But PEG itself doesn't require backtracking parsing, does it?an implementation detail, or a specific algorithm.No Lexer separationseems to be independent of this.YesCharacters. -- Dmitry OlshanskyTerminal symbols.As for lexing multiple times, simply use a free list of terminals (aka tokens). I still assume that grammar is properly defined, so that there is only one way to split source into tokens.Tokens.. there is no such term in use if we talk about 'pure' PEG.
Jul 07 2012
On Saturday, 7 July 2012 at 21:35:52 UTC, Dmitry Olshansky wrote:On 08-Jul-12 01:24, Roman D. Boiko wrote:There is no algorithm, only __grammar__. It specifies that option A has higher priority than option B in expression A / B / C. But it doesn't forbid, for example, to analyze all in parallel (track state transitions) for each character consequently, as a proper DFA implementation would do, until some option succeeds and all previous fail. A greedy regex would also have to check all successor options (C in the exapmle above) to determine the longest one.But PEG itself doesn't require backtracking parsing, does it?It does. Specifically the algorithm is to try option A, try option B, try option C...it's obvious how backtracking comes in ... whan say A fails somewhere in the middle of it.Simply stop tracking respective state. Process others in parallel as I described above.Of course there is a fair amount of bookkeeping that reduces doing work all over again but it does ... allocate.The amount of used memory would be proportional to the length of input after the branching point times number of rules (actually, of states in the DFA, which is likely larger but still finite for a particular grammar).Nope. According to the definition, PEG is a set of non-terminal symbols, terminal symbols, production rules, and a starting non-terminal. Terminals are not necessarily characters.Characters.Tokens.. there is no such term in use if we talk about 'pure' PEG.Terminal symbols.
Jul 07 2012
On 08-Jul-12 01:49, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 21:35:52 UTC, Dmitry Olshansky wrote:That would not always work, but yes that's what we should do IMHO. It's not what packrats do however (when I say algorithm I mean specifically packrat). And PEGs are commonly associated with packrats. A greedy regex would also have to check all successor options (COn 08-Jul-12 01:24, Roman D. Boiko wrote:There is no algorithm, only __grammar__. It specifies that option A has higher priority than option B in expression A / B / C. But it doesn't forbid, for example, to analyze all in parallel (track state transitions) for each character consequently, as a proper DFA implementation would do, until some option succeeds and all previous fail.But PEG itself doesn't require backtracking parsing, does it?It does. Specifically the algorithm is to try option A, try option B, try option C...in the exapmle above) to determine the longest one.Yes, yet regex engine (in simple form) can 'merge' identical parallel states* easily it's not so simple with general CFGs. *As in NFA states, I like to think in multiple NFA states vs single DFA state.Yeah it could be done, but it doesn't buy you a thing in a lot of cases unlike in regular grammar. The problem is that state is not as simple as in regex for instance (even in regex that must capture some submatches DFA won't do BTW). Thus you can't merge seemingly identical states because they depend on (different) interpretation of previous part of input. That's why DFA in ANTLR is only used to do lookahead or lexing, it doesn't maintain path through states during parsing.it's obvious how backtracking comes in ... whan say A fails somewhere in the middle of it.Simply stop tracking respective state. Process others in parallel as I described above.Finite and proportional to input size? Another way to say why an O(n) memory requirement for packrats?Of course there is a fair amount of bookkeeping that reduces doing work all over again but it does ... allocate.The amount of used memory would be proportional to the length of input after the branching point times number of rules (actually, of states in the DFA, which is likely larger but still finite for a particular grammar).Yup. But I'm not talking definitions here I'm talking the notion of "integrated lexing" and lexing is certainly about characters or was it? -- Dmitry OlshanskyNope. According to the definition, PEG is a set of non-terminal symbols, terminal symbols, production rules, and a starting non-terminal. Terminals are not necessarily characters.Characters.Tokens.. there is no such term in use if we talk about 'pure' PEG.Terminal symbols.
Jul 07 2012
On Saturday, 7 July 2012 at 22:05:05 UTC, Dmitry Olshansky wrote:That would not always work, but yes that's what we should do IMHO.I didn't do a deeper research on this, just shared my current understanding. When that doesn't work in a particular case, we will need to find the problem and solve that.It's not what packrats do however (when I say algorithm I mean specifically packrat). And PEGs are commonly associated with packrats.As Philippe Sigaud admitted, that is an incorrect association.Well, if you must track the path to a particular state, then DFA simply doesn't match the problem and there's nothing we can do.Yeah it could be done, but it doesn't buy you a thing in a lot of cases unlike in regular grammar. The problem is that state is not as simple as in regex for instance (even in regex that must capture some submatches DFA won't do BTW). Thus you can't merge seemingly identical states because they depend on (different) interpretation of previous part of input.it's obvious how backtracking comes in ... whan say A fails somewhere in the middle of it.Simply stop tracking respective state. Process others in parallel as I described above.That's why DFA in ANTLR is only used to do lookahead or lexing, it doesn't maintain path through states during parsing.See above. Is maintaining path really necessary? I thought that DFA with additional states could be created for any particular situation of ambiguity, etc. This needs further research.Sorry, I incorrectly included the O(n) multiplier. It should be finite. When you go to the next character, you don't need previous states any more.Finite and proportional to input size? Another way to say why an O(n) memory requirement for packrats?Of course there is a fair amount of bookkeeping that reduces doing work all over again but it does ... allocate.The amount of used memory would be proportional to the length of input after the branching point times number of rules (actually, of states in the DFA, which is likely larger but still finite for a particular grammar).Yup. But I'm not talking definitions here I'm talking the notion of "integrated lexing" and lexing is certainly about characters or was it?I thought integrated lexing was about doing it along with parsing and specifying its rules directly inside the grammar. From the former we get structural context to narrow-down available options, and from the latter we have a uniform grammar description.
Jul 07 2012
On 08-Jul-12 00:09, Andrei Alexandrescu wrote:On 7/7/12 3:17 PM, Dmitry Olshansky wrote:I think I told that before but anyway the main point is: the reason lexer exists is performance. It's obvious that one can write grammar without ever using lexer. If the notation is good say regex or the one used in PEG (that is not quite the same it turns out) then token can be easily describe in grammar. Once we got lexer can give only 2 things: -manageability, it's jsut splitting grammar in 2 parts that could be maintained separately (in fact some "lexers" are not DFA nor regular) (sometimes it could go the opposite direction - it could be better to have one integrated grammar) - speed. The reason it could be faster is because lexer can use highly deterministic grammar. Now back to PEGs and packrats. What I see everywhere I look is essentially integration of a backtracking-NFA lexer, that is not fast nor is particularly convenient. The notation is the whole other matter. So my view on the matter: whether or not lexer is integrated is two-fold question: notation vs implementation. E.g. it may make regular lexer-like things a part of notation thus having rules like: Identifier -> Alpha (Alpha|Number|_)* Yet it may or may use the same engine for _all_ _rules_, in fact if implementation optimize things by ripping off regular parts of grammar to small DFA engines it would make some kind of lexer behind the scenes! Anyway highly hybrid parsers are all the rage now, and reasons are obvious - they runs fast and can be bend by some magic to quite convoluted grammars of modern languages ;) -- Dmitry OlshanskyI'll have to point out that the whole point about integrated lexing is moot. It's more of liability then benefit. At very least it's just implementation curiosity not advantage.Interesting. I'll have to ask for more details on why.
Jul 07 2012
On Saturday, 7 July 2012 at 20:27:14 UTC, Dmitry Olshansky wrote:So my view on the matter: whether or not lexer is integrated is two-fold question: notation vs implementation. E.g. it may make regular lexer-like things a part of notation thus having rules like: Identifier -> Alpha (Alpha|Number|_)* Yet it may or may use the same engine for _all_ _rules_, in fact if implementation optimize things by ripping off regular parts of grammar to small DFA engines it would make some kind of lexer behind the scenes!This is the implementation I have in mind as the only sane way to parse PEG efficiently. Plus specializing DFA to only check for those terminals which are allowed according to available structural information at the moment of their invocation.Anyway highly hybrid parsers are all the rage now, and reasons are obvious - they runs fast and can be bend by some magic to quite convoluted grammars of modern languages ;)
Jul 07 2012
On 07/07/2012 21:17, Dmitry Olshansky wrote:On 07-Jul-12 22:23, Andrei Alexandrescu wrote:Many usages only need a lexer.On 7/7/12 6:24 AM, Roman D. Boiko wrote:I'll have to point out that the whole point about integrated lexing is moot. It's more of liability then benefit. At very least it's just implementation curiosity not advantage.On Saturday, 7 July 2012 at 09:06:57 UTC, Roman D. Boiko wrote:Isn't also the fact that lexing and parsing are integrated? With traditional LALR you need a separate tokenizer.http://stackoverflow.com/questions/11373644/performance-of-parsers-peg-vs-lalr1-or-llkSo far it looks like LALR parsers may have lower constant factors than Packrat. The difference could be minimized by paying attention to parsing of terminal symbols, which was in my plans already. It is not necessary to strictly follow Packrat parsing algorithm. The benefits of Pegged, in my view, are its support of Parsing Expression Grammar (PEG) and compile-time evaluation. It is easily extensible and modifiable.
Jul 07 2012
On Thursday, July 05, 2012 14:22:35 Andrei Alexandrescu wrote:I also am actively opposed to a project of just translating D's front-end to D and dropping it into Phobos because it would smother (a) work on generic parser generators, and (b) strong, dependable formalization of D's syntax.I'm not even vaguely convinced that having a lexer/parser specifically for D in Phobos (regardless of whether it comes from dmd or not) will have a negative effect on work for making generic parser generators. People are going to want a good parser generator _regardless_ of what the situation with parsing D is. And I'd be very surprised if you couldn't make a hand-written parser for D which was better than one that you can generate. Parser generation is great, because it allows you to quickly and easily put together a parser from a grammar, but it's unlikely to be as fast as a hand-written one optimized for a particular language. However, as the recent discussion on popFront shows, only benchmarking of actual solutions would show that for sure. Now, the issue of a "strong, dependable formalization of D's syntax" is another thing entirely. Porting dmd's lexer and parser to Phobos would keep the Phobos implementation in line with dmd much more easily and avoid inconsistencies in the language definition and the like. However, if we write a new lexer and parser specifically for Phobos which _doesn't_ port the lexer or parser from dmd, then that _would_ help drive making the spec match the compiler (or vice versa). So, I agree that could be a definite argument for writing a lexer and parser from scratch rather than porting the one from dmd, but I don't buy the bit about it smothering parser generators at all. I think that the use cases are completely different. - Jonathan M Davis
Jul 06 2012
On 2012-07-07 03:12, Jonathan M Davis wrote:Now, the issue of a "strong, dependable formalization of D's syntax" is another thing entirely. Porting dmd's lexer and parser to Phobos would keep the Phobos implementation in line with dmd much more easily and avoid inconsistencies in the language definition and the like. However, if we write a new lexer and parser specifically for Phobos which _doesn't_ port the lexer or parser from dmd, then that _would_ help drive making the spec match the compiler (or vice versa). So, I agree that could be a definite argument for writing a lexer and parser from scratch rather than porting the one from dmd, but I don't buy the bit about it smothering parser generators at all. I think that the use cases are completely different.I think the whole point of having a compiler as a library is that the compiler should use the library as well. Otherwise the two will get out of sync. Just look at Clang, LLVM, LLDB and Xcode, they took the correct approach. Clang and LLVM (and I think LLDB) are available as libraries. Then the compiler, debugger (lldb) and IDE uses these libraries as part of their implementation. They don't have their own implementation that is similar to the libraries, making it "easy" to stay in sync. They _use_ the libraries as libraries. This is what DMD and Phobos should do as well. If it's too complicated to port the lexer/parser to D then it would be better, at least as a first step, to modify DMD as needed. Create a C API for DMD and then create D bindings to be put into Phobos. -- /Jacob Carlborg
Jul 07 2012
On 07/07/2012 13:05, Jacob Carlborg wrote:On 2012-07-07 03:12, Jonathan M Davis wrote:I tried that. This is almost impossible, dmd's parser and AST are very tightly mixed with dmd's internals.Now, the issue of a "strong, dependable formalization of D's syntax" is another thing entirely. Porting dmd's lexer and parser to Phobos would keep the Phobos implementation in line with dmd much more easily and avoid inconsistencies in the language definition and the like. However, if we write a new lexer and parser specifically for Phobos which _doesn't_ port the lexer or parser from dmd, then that _would_ help drive making the spec match the compiler (or vice versa). So, I agree that could be a definite argument for writing a lexer and parser from scratch rather than porting the one from dmd, but I don't buy the bit about it smothering parser generators at all. I think that the use cases are completely different.I think the whole point of having a compiler as a library is that the compiler should use the library as well. Otherwise the two will get out of sync. Just look at Clang, LLVM, LLDB and Xcode, they took the correct approach. Clang and LLVM (and I think LLDB) are available as libraries. Then the compiler, debugger (lldb) and IDE uses these libraries as part of their implementation. They don't have their own implementation that is similar to the libraries, making it "easy" to stay in sync. They _use_ the libraries as libraries. This is what DMD and Phobos should do as well. If it's too complicated to port the lexer/parser to D then it would be better, at least as a first step, to modify DMD as needed.
Jul 07 2012
On Saturday, 7 July 2012 at 16:49:09 UTC, deadalnix wrote:I tried that. This is almost impossible, dmd's parser and AST are very tightly mixed with dmd's internals.+1
Jul 07 2012
On 2012-07-07 18:49, deadalnix wrote:I tried that. This is almost impossible, dmd's parser and AST are very tightly mixed with dmd's internals.I suspected that. -- /Jacob Carlborg
Jul 08 2012
On Saturday, July 07, 2012 13:05:29 Jacob Carlborg wrote:On 2012-07-07 03:12, Jonathan M Davis wrote:There are multiple issues here. The one that Andrei is raising is the fact that D isn't properly formalized. Essentially, the compiler _is_ the spec, and while the online spec _mostly_ follows it, it doesn't entirely, and the online spec isn't always as precise as it needs to be regardless. With a fully formalized spec, it should be possible to fully implement a D compiler from the spec alone, and I don't think that that's currently possible. Writing a D lexer and parser (if not a full-blown compiler) from scratch would help highlight the places in the spec which are lacking, and having it be part of Phobos would arguably increase Walter's incentive to make sure that the spec is in line with the compiler (and vice versa) so that stuff _other_ than the compiler which is based on the spec would be able to match the compiler. Clang is in a _completely_ different situation. It's a C/C++ compiler, and both C and C++ already have official, formalized specs which Clang conforms to (or is supposed to anyway). Clang has no control over the spec at all. It merely implements it. So, there is no issue of trying to keep other tools or compilers in line with Clang due to it being the language's spec like we effectively have with dmd. It may help the tools which use Clang to be fully in line with Clang and not have to worry about whether Clang implements the spec slightly differently, but in theory, if they all follow the spec correctly, that isn't in issue (though obviously in practice it can be). In D's case, all of the major D compilers use the same frontend, which helps compatability but harms the spec, because there's less incentive to keep it precise and up-to-date. So, from the perspective of the spec, implementing the D lexer and parser for Phobos separately from dmd would be of great benefit. IMHO, the reason that porting dmd's lexer and parser would be of great benefit is primarily maintenance. It makes it much easier to keep Phobos' lexer and parser in line with dmd, making discrepencies less likely, but it arguably reduces the incentive to improve the spec. The benefits of having a lexer and parser as a library (regardless of whether it's from scratch or a port from dmd) are primarly derived from the fact that it makes it much easier to create tools which use them. Such tools no longer have to write their own lexers or parsers. If the compiler uses the same library, it has the added benefit of making it so that any tool using the library will be in sync with the compiler, but if the spec is properly formalized and up-to-date, and the library is kep up-to-date with _it_, that shouldn't really be a problem. You still have the debate as to whether it's better to have a separate implementation based on the spec (thereby making it more likely that the spec is correct) or whether it's better to have the compiler share the implementation so that the library is guaranteed to match the compiler (though not necessarily the spec), but I think that that's a separate debate from whether we should have the lexer and parser as a library. In all honesty though, I would be surprised if you could convince Walter to switch dmd's frontend to Phobos' lexer and parser even once they've been implemented. So, while I agree that there are benefits in doing so, I'm not sure how much chance you have of ever getting any traction with that. Another thing to consider is that that might make it so that gdc and ldc couldn't share the same frontend with dmd (assuming that they have to keep their frontends in C or C++ - I don't know if they do) - but if so, that would increase the incentive for the spec to be correct if dmd ever started using a D frontend. - Jonathan M DavisNow, the issue of a "strong, dependable formalization of D's syntax" is another thing entirely. Porting dmd's lexer and parser to Phobos would keep the Phobos implementation in line with dmd much more easily and avoid inconsistencies in the language definition and the like. However, if we write a new lexer and parser specifically for Phobos which _doesn't_ port the lexer or parser from dmd, then that _would_ help drive making the spec match the compiler (or vice versa). So, I agree that could be a definite argument for writing a lexer and parser from scratch rather than porting the one from dmd, but I don't buy the bit about it smothering parser generators at all. I think that the use cases are completely different.I think the whole point of having a compiler as a library is that the compiler should use the library as well. Otherwise the two will get out of sync. Just look at Clang, LLVM, LLDB and Xcode, they took the correct approach. Clang and LLVM (and I think LLDB) are available as libraries. Then the compiler, debugger (lldb) and IDE uses these libraries as part of their implementation. They don't have their own implementation that is similar to the libraries, making it "easy" to stay in sync. They _use_ the libraries as libraries. This is what DMD and Phobos should do as well. If it's too complicated to port the lexer/parser to D then it would be better, at least as a first step, to modify DMD as needed. Create a C API for DMD and then create D bindings to be put into Phobos.
Jul 07 2012
"Jonathan M Davis" <jmdavisProg gmx.com> wrote in message news:mailman.163.1341683651.31962.digitalmars-d puremagic.com...[snip] In all honesty though, I would be surprised if you could convince Walter to switch dmd's frontend to Phobos' lexer and parser even once they've been implemented. So, while I agree that there are benefits in doing so, I'm not sure how much chance you have of ever getting any traction with that. - Jonathan M DavisA more realistic alternative is to modify dmd to optionally (at build/link time) use Phobos' lexer and/or parser, which would allow them to be run against the test suite. Parser and lexer changes are relatively rare in the compiler, so keeping two implementations in sync is not really that big a deal.
Jul 07 2012
On 2012-07-07 19:53, Jonathan M Davis wrote:There are multiple issues here. The one that Andrei is raising is the fact that D isn't properly formalized. Essentially, the compiler _is_ the spec, and while the online spec _mostly_ follows it, it doesn't entirely, and the online spec isn't always as precise as it needs to be regardless. With a fully formalized spec, it should be possible to fully implement a D compiler from the spec alone, and I don't think that that's currently possible.That's been a problem for a long time.Writing a D lexer and parser (if not a full-blown compiler) from scratch would help highlight the places in the spec which are lacking, and having it be part of Phobos would arguably increase Walter's incentive to make sure that the spec is in line with the compiler (and vice versa) so that stuff _other_ than the compiler which is based on the spec would be able to match the compiler. Clang is in a _completely_ different situation. It's a C/C++ compiler, and both C and C++ already have official, formalized specs which Clang conforms to (or is supposed to anyway). Clang has no control over the spec at all. It merely implements it. So, there is no issue of trying to keep other tools or compilers in line with Clang due to it being the language's spec like we effectively have with dmd. It may help the tools which use Clang to be fully in line with Clang and not have to worry about whether Clang implements the spec slightly differently, but in theory, if they all follow the spec correctly, that isn't in issue (though obviously in practice it can be).That should be true for D as well when the spec is complete.In D's case, all of the major D compilers use the same frontend, which helps compatability but harms the spec, because there's less incentive to keep it precise and up-to-date. So, from the perspective of the spec, implementing the D lexer and parser for Phobos separately from dmd would be of great benefit.That might be true.IMHO, the reason that porting dmd's lexer and parser would be of great benefit is primarily maintenance. It makes it much easier to keep Phobos' lexer and parser in line with dmd, making discrepencies less likely, but it arguably reduces the incentive to improve the spec.But then it sounds like the best solution would be not to have a lexer/parser based on DMD and instead making sure the spec is correct.The benefits of having a lexer and parser as a library (regardless of whether it's from scratch or a port from dmd) are primarly derived from the fact that it makes it much easier to create tools which use them. Such tools no longer have to write their own lexers or parsers.Of course, that's the whole point.If the compiler uses the same library, it has the added benefit of making it so that any tool using the library will be in sync with the compiler, but if the spec is properly formalized and up-to-date, and the library is kep up-to-date with _it_, that shouldn't really be a problem. You still have the debate as to whether it's better to have a separate implementation based on the spec (thereby making it more likely that the spec is correct) or whether it's better to have the compiler share the implementation so that the library is guaranteed to match the compiler (though not necessarily the spec), but I think that that's a separate debate from whether we should have the lexer and parser as a library.What keeps popping up in my head is the scenario where users are complaining about the frontend in Phobos not behaving as their compiler. If this is due to they are out of sync, bugs or not following the spec doesn't matter. I still thinks D is changing too much to have two separate implementations of the compiler and a library like this.In all honesty though, I would be surprised if you could convince Walter to switch dmd's frontend to Phobos' lexer and parser even once they've been implemented. So, while I agree that there are benefits in doing so, I'm not sure how much chance you have of ever getting any traction with that.That is perhaps true.Another thing to consider is that that might make it so that gdc and ldc couldn't share the same frontend with dmd (assuming that they have to keep their frontends in C or C++ - I don't know if they do) - but if so, that would increase the incentive for the spec to be correct if dmd ever started using a D frontend.-- /Jacob Carlborg
Jul 08 2012
On Sunday, July 08, 2012 14:13:18 Jacob Carlborg wrote:On 2012-07-07 19:53, Jonathan M Davis wrote:True, but Andrei is basically arguing that porting dmd's lexer and parser to D would reduce the incentive to fix this, whereas having a new lexer/parser would encourage fixing it (though he's arguing for purely using a generative parser rather than writing a D-specific one).There are multiple issues here. The one that Andrei is raising is the fact that D isn't properly formalized. Essentially, the compiler _is_ the spec, and while the online spec _mostly_ follows it, it doesn't entirely, and the online spec isn't always as precise as it needs to be regardless. With a fully formalized spec, it should be possible to fully implement a D compiler from the spec alone, and I don't think that that's currently possible.That's been a problem for a long time.That would be the question. Given enough time, what I'd probably want to do would be to port dmd's lexer and parser to D in order to properly figure it all out, updating the spec in the process, and _then_ go write one from scratch, possibly basing some of it on how dmd did it, possibly not. But I don't really have the time for any of that right now, and most of the focus right now from people interested in parsing seems to be on pegged and parser generators (which are very cool and in some ways much more interesting, but I seriously question that that's the performant way to go if you're looking to parse D specifically). So, who knows what we're going to get out of this and when we'll get it.IMHO, the reason that porting dmd's lexer and parser would be of great benefit is primarily maintenance. It makes it much easier to keep Phobos' lexer and parser in line with dmd, making discrepencies less likely, but it arguably reduces the incentive to improve the spec.But then it sounds like the best solution would be not to have a lexer/parser based on DMD and instead making sure the spec is correct.What keeps popping up in my head is the scenario where users are complaining about the frontend in Phobos not behaving as their compiler. If this is due to they are out of sync, bugs or not following the spec doesn't matter. I still thinks D is changing too much to have two separate implementations of the compiler and a library like this.Well, for the lexer and parser, we're probably okay (or very close to it). As Daniel pointed out elsewhere in this thread, that part of the frontend doesn't change much (particularly the lexer). There's definitely still some churn, but it's nothing like it used to be. - Jonathan M Davis
Jul 08 2012
On Sunday, 8 July 2012 at 20:06:07 UTC, Jonathan M Davis wrote:most of the focus right now from people interested in parsing seems to be on pegged and parser generators (which are very cool and in some ways much more interesting, but I seriously question that that's the performant way to go if you're looking to parse D specifically).Can you provide a *specific* example of performance optimization which a custom D compiler would have, but parser generator would be unlikely to catch up.
Jul 08 2012
On Sunday, July 08, 2012 22:15:26 Roman D. Boiko wrote:On Sunday, 8 July 2012 at 20:06:07 UTC, Jonathan M Davis wrote:It's been too long since I was actively working on parsers to give any details, but it is my understanding that because a hand-written parser is optimized for a specific grammar, it's going to be faster. Also, look at dmd and dmc vs other compilers. They use hand-written parsers and are generally much faster than their competitors. One thing to remember about hand-written parsers vs generative ones though is that they usually are completely different in terms of the type of parser that you write (e.g. hand-written parsers are generally recursive-decent parser whereas generative ones usually use bottom-up parsers). So, that could have a large impact on performance as well (in either direction). To be clear though, I have _no_ problem with having a generative parser in Phobos (or having other 3rd party options available). Parsers like pegged are extremely cool and extremely useful. It's just that it's my understanding that well-written hand-written parsers are faster than generated ones, so I think that it would be benecial to have a hand-written parser for D in Phobos _in addition_ to a general, generative solution. But to fully prove that a hand-written one would be faster, we'd of course have to have actual solutions to compare. And if the API for a D-specific parser in Phobos is designed well enough, and it somehow proved that a generative solution was faster, then the hand-written one could be replaced by the generative one underneat the hood. - Jonathan M Davismost of the focus right now from people interested in parsing seems to be on pegged and parser generators (which are very cool and in some ways much more interesting, but I seriously question that that's the performant way to go if you're looking to parse D specifically).Can you provide a *specific* example of performance optimization which a custom D compiler would have, but parser generator would be unlikely to catch up.
Jul 08 2012
On Sunday, 8 July 2012 at 21:03:41 UTC, Jonathan M Davis wrote:It's been too long since I was actively working on parsers to give any details, but it is my understanding that because a hand-written parser is optimized for a specific grammar, it's going to be faster.My aim is to find out any potential bottlenecks and ensure that those are possible to get rid of. So, let's try. I believe it would not hurt generality or quality of a parser generator if it contained sews for inserting custom (optimized) code where necessary, including those needed to take advantage of some particular aspects of D grammar. Thus I claim that optimization for D grammar is possible.Also, look at dmd and dmc vs other compilers. They use hand-written parsers and are generally much faster than their competitors.Due to which particular design or implementation decisions? Would it be possible to name a few which are relevant in this context?One thing to remember about hand-written parsers vs generative ones though is that they usually are completely different in terms of the type of parser that you write (e.g. hand-written parsers are generally recursive-decent parser whereas generative ones usually use bottom-up parsers).So far discussion goes in favor of LL(*) parser like ANTLR, which is top-down recursive-descent. Either Pegged will be optimized with LL(*) algorithms, or a new parser generator created.
Jul 08 2012
I believe it would not hurt generality or quality of a parser generator if it contained sews for inserting custom (optimized)s/sews/seams
Jul 08 2012
On Sunday, 8 July 2012 at 21:22:39 UTC, Roman D. Boiko wrote:On Sunday, 8 July 2012 at 21:03:41 UTC, Jonathan M Davis wrote:I'm convinced that the output of a parser generator (PG) can be very nearly as fast as hand-written code. ANTLR's output (last I checked) was not ideal, but the one I planned to make (a few years ago) would have produced faster code. By default the PG's output will not be the speed of hand-written code, but the user can optimize it. Assuming an ANTLR-like PG, the user can inspect the original output looking for inefficient lookahead, or cases where the parser looks for rare cases before common cases, and then improve the grammar and insert ... I forget all the ANTLR terminology ... syntactic predicates or whatever, to optimize the parser.It's been too long since I was actively working on parsers to give any details, but it is my understanding that because a hand-written parser is optimized for a specific grammar, it's going to be faster.My aim is to find out any potential bottlenecks and ensure that those are possible to get rid of. So, let's try. I believe it would not hurt generality or quality of a parser generator if it contained sews for inserting custom (optimized) code where necessary, including those needed to take advantage of some particular aspects of D grammar. Thus I claim that optimization for D grammar is possible.So far discussion goes in favor of LL(*) parser like ANTLR, which is top-down recursive-descent. Either Pegged will be optimized with LL(*) algorithms, or a new parser generator created.Right, for instance I am interested in writing a top-down PG because I understand them better and prefer the top-down approach due to its flexibility (semantic actions, allowing custom code) and understandability (the user can realistically understand the output; in fact readability would be a specific goal of mine) Roman, regarding what you were saying to me earlier:In stage 2 you have only performed some basic analysis, like, e.g., matched braces to define some hierarchy. This means that at the time when you find a missing brace, for example, you cannot tell anything more than that braces don't match.Stage 2 actually can tell more than just "a brace is missing somewhere". Because so many languages are C-like. So given this situation: frob (c &% x) } It doesn't need to know what language this is to tell where the brace belongs. Even in a more nebulous case like: frob (c &% x) bar lic } probably the brace belongs at the end of the first line. Perhaps your point is that there are situations where a parser that knows the "entire" grammar could make a better guess about where the missing brace/paren belongs. That's certainly true. However, just because it can guess better, doesn't mean it can reinterpret the code based on that guess. I mean, I don't see any way to "back up" a parser by an arbitrary amount. A hypothetical stage 2 would probably be hand-written and could realistically back up and insert a brace/paren anywhere that the heuristics dictate, because it is producing a simple data structure and it doesn't need to do any semantic actions as it parses. A "full" parser, on the other hand, has done a lot of work that it can't undo, so the best it can do is report to the user "line 54: error: brace mismatch; did you forget a brace on line 13?" The heuristic is still helpful, but it has already parsed lines 13 to 54 in the wrong context (and, in some cases, has already split out a series of error messages that are unrelated to the user's actual mistake).As I demonstrated in some examples, it could get the output which implies incorrect structureI was unable to find the examples you refer to... this thread's getting a little unweildy :)
Jul 08 2012
David, I would suggest you to create a proof-of-concept prototype and pay attention to some non-trivial use cases. This way you'd likely be able to reduce major risks significantly before making any serious time investments.
Jul 08 2012
On 2012-07-08 22:15, Roman D. Boiko wrote:Can you provide a *specific* example of performance optimization which a custom D compiler would have, but parser generator would be unlikely to catch up.I'm not expert in this field but I've heard that it's harder to get good error reporting with a parser generator. -- /Jacob Carlborg
Jul 08 2012
On Monday, 9 July 2012 at 06:35:38 UTC, Jacob Carlborg wrote:I'm not expert in this field but I've heard that it's harder to get good error reporting with a parser generator.Good point!
Jul 09 2012
On 2012-07-08 22:05, Jonathan M Davis wrote:Well, for the lexer and parser, we're probably okay (or very close to it). As Daniel pointed out elsewhere in this thread, that part of the frontend doesn't change much (particularly the lexer). There's definitely still some churn, but it's nothing like it used to be.It don't feel like that, didn't we get lambdas or UFCS in the last release? -- /Jacob Carlborg
Jul 08 2012
On Monday, July 09, 2012 08:33:31 Jacob Carlborg wrote:On 2012-07-08 22:05, Jonathan M Davis wrote:lambdas are a bit older than that IIRC, and I don't think that UFCS actually affects the lexer or parser. Yes, changes are still made, but they're increasingly rare, and most of the stuff being changed is on the semantic level (most of which is bug fixes). So, out of all of the parts of the compiler to duplicate, those are the least likely to have to have major changes made to them later (especially the lexer). - Jonathan M DavisWell, for the lexer and parser, we're probably okay (or very close to it). As Daniel pointed out elsewhere in this thread, that part of the frontend doesn't change much (particularly the lexer). There's definitely still some churn, but it's nothing like it used to be.It don't feel like that, didn't we get lambdas or UFCS in the last release?
Jul 08 2012
On 2012-07-09 08:39, Jonathan M Davis wrote:lambdas are a bit older than that IIRC, and I don't think that UFCS actually affects the lexer or parser. Yes, changes are still made, but they're increasingly rare, and most of the stuff being changed is on the semantic level (most of which is bug fixes). So, out of all of the parts of the compiler to duplicate, those are the least likely to have to have major changes made to them later (especially the lexer).I'm pretty sure UFCS affects lexing or parsing. How else would this be legal: 4.foo(); -- /Jacob Carlborg
Jul 09 2012
On Monday, July 09, 2012 09:16:41 Jacob Carlborg wrote:On 2012-07-09 08:39, Jonathan M Davis wrote:That definitely wouldn't affect lexing, because it doesn't affect the tokens at all. Whether it affects the parser would depend on the grammar. There's a good chance it would affect parsing, but it might not. It depends on how numeric literals are treated. In most cases though, UFCS definitely wouldn't affect parsing at all, because it's purely a matter of symbol resolution, so if changes had to be made to the parser, they would almost certainly have been minimal. Yes, some changes are still being made to the parser, but since the language is mostly frozen, almost all of the changes have gone into bug fixes, which are generally semantic issues and don't affect lexing or parsing at all. So, while there would still be some maintenance issues in keeping a separate lexer and parser in line with dmd, I really don't think that it would take much at this point. Most of the work would be in getting them to match dmd when they're first written. - Jonathan M Davislambdas are a bit older than that IIRC, and I don't think that UFCS actually affects the lexer or parser. Yes, changes are still made, but they're increasingly rare, and most of the stuff being changed is on the semantic level (most of which is bug fixes). So, out of all of the parts of the compiler to duplicate, those are the least likely to have to have major changes made to them later (especially the lexer).I'm pretty sure UFCS affects lexing or parsing. How else would this be legal: 4.foo();
Jul 09 2012
"Jonathan M Davis" <jmdavisProg gmx.com> wrote in message news:mailman.190.1341818983.31962.digitalmars-d puremagic.com...Not true. This used to be lexed as '4.f' 'oo'. (I think)I'm pretty sure UFCS affects lexing or parsing. How else would this be legal: 4.foo();That definitely wouldn't affect lexing, because it doesn't affect the tokens at all.
Jul 09 2012
On 07/09/2012 08:33 AM, Jacob Carlborg wrote:On 2012-07-08 22:05, Jonathan M Davis wrote:Those changes are trivial to incorporate into a well written parser.Well, for the lexer and parser, we're probably okay (or very close to it). As Daniel pointed out elsewhere in this thread, that part of the frontend doesn't change much (particularly the lexer). There's definitely still some churn, but it's nothing like it used to be.It don't feel like that, didn't we get lambdas or UFCS in the last release?
Jul 09 2012
05.07.2012 20:28, Jonathan M Davis пишет:It's all too common for someone to suggest that we should do something or implement something without ever attempting to do it themselves, and in general, stuff around here gets done because someone really wants it done, takes the time to do it, and sees it through until its done and in Phobos.I didn't want for someone to code anything at all! I on the contrary want them to stop writing parsers because it results in the only consequence: one have to spend more time to find a better parser. -- Денис В. Шеломовский Denis V. Shelomovskij
Jul 05 2012
On Thursday, July 05, 2012 22:23:00 Denis Shelomovskij wrote:05.07.2012 20:28, Jonathan M Davis =D0=BF=D0=B8=D1=88=D0=B5=D1=82:tIt's all too common for someone to suggest that we should do something or implement something without ever attempting to do i=eonethemselves, and in general, stuff around here gets done because som=untilreally wants it done, takes the time to do it, and sees it through =Well, until a lexer and parser for D make it into Phobos, more people a= re=20 going to keep writing them, and even if/when they _do_ make it into Pho= bos,=20 people will keep writing them, because some people like to write that k= ind of=20 thing. Honestly though, if your complaint is that there's too much choice, I d= on't=20 have much sympathy for you. In general, we have too little D code out t= here,=20 not too much. If there's a problem with more parsers being written, I t= hink=20 that it's almost purely an issue of some people's time probably being b= etter=20 spent on other projects, but it's their right to work on whatever they = feel=20 like working on. - Jonathan M Davisits done and in Phobos.=20 I didn't want for someone to code anything at all! I on the contrary want them to stop writing parsers because it results in the only consequence: one have to spend more time to find a better parser.
Jul 05 2012
On 05-Jul-12 22:23, Denis Shelomovskij wrote:05.07.2012 20:28, Jonathan M Davis пишет:It doesn't work like that in OpenSource. No matter what you do people keep writing code ;) -- Dmitry OlshanskyIt's all too common for someone to suggest that we should do something or implement something without ever attempting to do it themselves, and in general, stuff around here gets done because someone really wants it done, takes the time to do it, and sees it through until its done and in Phobos.I didn't want for someone to code anything at all! I on the contrary want them to stop writing parsers because it results in the only consequence: one have to spend more time to find a better parser.
Jul 05 2012
Le 05/07/2012 18:28, Jonathan M Davis a écrit :On Thursday, July 05, 2012 18:04:11 Roman D. Boiko wrote:You never looked at dmd frontend soure code don't you ? It is a horror museum, and if we want to have something in D, I certainly wouldn't copy that.On Thursday, 5 July 2012 at 15:40:53 UTC, Jacob Carlborg wrote:It's been discussed before, but there are some obvious use cases such as syntax highlighting, code formatting, and manipulation of D source files (e.g. to strip out the unit tests). We need to have a lexer in Phobos which parsers D code and generates a range of tokens, and we need to have a parser in Phobos which operates on a range of tokens. As discussed previously, there is desire to have the lexer and parser ported from dmd's frontend to D for Phobos so that what we get is easily maintainable alongside dmd's frontend and produces the same results (and is fast).On 2012-07-05 15:08, Roman D. Boiko wrote:Well, we did something like that for DCT... but I doubt that it would fit general needs. If we had, why haven't they been analyzed, classified, discussed, etc.? Or have they?Anyway I propose to enumerate major use cases first.Haven't we already done that a couple of times.
Jul 05 2012
On Friday, July 06, 2012 00:24:11 deadalnix wrote:Le 05/07/2012 18:28, Jonathan M Davis a =C3=A9crit :ch asOn Thursday, July 05, 2012 18:04:11 Roman D. Boiko wrote:On Thursday, 5 July 2012 at 15:40:53 UTC, Jacob Carlborg wrote:=20 It's been discussed before, but there are some obvious use cases su=On 2012-07-05 15:08, Roman D. Boiko wrote:=20 Well, we did something like that for DCT... but I doubt that it would fit general needs. =20 If we had, why haven't they been analyzed, classified, discussed, etc.? Or have they?Anyway I propose to enumerate major use cases first.=20 Haven't we already done that a couple of times.filessyntax highlighting, code formatting, and manipulation of D source =s a(e.g. to strip out the unit tests). =20 We need to have a lexer in Phobos which parsers D code and generate=tes onrange of tokens, and we need to have a parser in Phobos which opera=thea range of tokens. As discussed previously, there is desire to have=whatlexer and parser ported from dmd's frontend to D for Phobos so that=thewe get is easily maintainable alongside dmd's frontend and produces=rsame results (and is fast).=20 You never looked at dmd frontend soure code don't you ? It is a horro=museum, and if we want to have something in D, I certainly wouldn't c=opythat.I have definitely looked at dmd's frontend's source code. The idea behi= nd=20 copying the lexer and parser from there is then that they'd match the=20= compiler, which would make it much easier to keep them in sync with cha= nges to=20 the compiler. Whether that's what someone would have written had they d= one it=20 purely in D initially is pretty much irrelevant. - Jonathan M Davis
Jul 05 2012
On 07/05/2012 08:28 AM, Roman D. Boiko wrote:Pegged may eventually become standard, if it will be performanceoptimized and a bit more customizableInteresting, I thought you were hand-writing this stuff. I'm a fan of pegged and made some pull requests, one of which was aimed at making it more customizable by allowing the user to define what type gets used as a parse tree node, thus allowing one to potentially use their parse tree as an AST (and maybe do a couple other things too). It's a WIP, but the proof of concept is done: DMD can handle the extra templating at compile time, so it works. What kind of things did you want in terms of customizability?
Jul 07 2012
On Saturday, 7 July 2012 at 15:42:30 UTC, Chad J wrote:On 07/05/2012 08:28 AM, Roman D. Boiko wrote:There are many possible customization point which would make it a better fit for my project while also being useful in general. The most critical was the one you implemented: ability to define a custom parse tree. I also needed the ability to use immutable structures (almost) everywhere, while currently they must be mutable. Also it would be great to avoid UTF conversions (instead use functions and types templated by the string type). I also wanted to add ability to reuse parts of previously generated AST which correspond to source code that has not been changed, or to identical source code parsed previously. (This would allow incremental parsing, e.g., while user is typing.) The latter would require to track source code positions separately, or even generating them on demand (this is the feature actively criticized by deadalnix in some previous discussions but central to DCT architecture). AST would only hold information about widths of its nodes. I've also written some notes (10-15 suggestions) while studying Pegged code, which will be shared later. However, as I found a few hours ago, Packrat parsing (typically used to handle PEG) has serious disadvantages: it complicates debugging because of frequent backtracking, it has problems with error recovery, and typically disallows to add actions with side effects (because of possibility of backtracking). These are important enough to reconsider my plans of using Pegged. I will try to analyze whether the issues are so fundamental that I (or somebody else) will have to create an ANTLR-like parser instead, or whether it is possible to introduce changes into Pegged that would fix these problems.Pegged may eventually become standard, if it will be performanceoptimized and a bit more customizableInteresting, I thought you were hand-writing this stuff. I'm a fan of pegged and made some pull requests, one of which was aimed at making it more customizable by allowing the user to define what type gets used as a parse tree node, thus allowing one to potentially use their parse tree as an AST (and maybe do a couple other things too). It's a WIP, but the proof of concept is done: DMD can handle the extra templating at compile time, so it works. What kind of things did you want in terms of customizability?
Jul 07 2012
On Sat, Jul 7, 2012 at 6:06 PM, Roman D. Boiko <rb d-coding.com> wrote:The most critical was the one you implemented: ability to define a custom parse tree. I also needed the ability to use immutable structures (almost) everywhere, while currently they must be mutable. Also it would be great to avoid UTF conversions (instead use functions and types templated by the string type).I added dstrings because 1- at the time (a few months ago), the lists here were awash in UTF-32 discussions and I thought that'd be the way to go anyway 2- other D parsing libraries seemed to go to UTF32 also (CTPG) 3- I wanted to be able to parse mathematical notation like nabla, derivatives, etc. which all have UTF32 symbols.However, as I found a few hours ago, Packrat parsing (typically used to handle PEG) has serious disadvantages: it complicates debugging because of frequent backtracking, it has problems with error recovery, and typically disallows to add actions with side effects (because of possibility of backtracking). These are important enough to reconsider my plans of using Pegged. I will try to analyze whether the issues are so fundamental that I (or somebody else) will have to create an ANTLR-like parser instead, or whether it is possible to introduce changes into Pegged that would fix these problems.Note that PEG does not impose to use packrat parsing, even though it was developed to use it. I think it's a historical 'accident' that put the two together: Bryan Ford thesis used the two together. Note that many PEG parsers do not rely on packrat (Pegged does not). There are a bunch of articles on Bryan Ford's website by a guy writting a PEG parser for Java, and who found that storing the last rules was enought to get a slight speed improvement, buth that doing anymore sotrage was detrimental to the parser's overall efficiency.
Jul 07 2012
On Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote:I added dstrings because 1- at the time (a few months ago), the lists here were awash in UTF-32 discussions and I thought that'd be the way to go anyway 2- other D parsing libraries seemed to go to UTF32 also (CTPG) 3- I wanted to be able to parse mathematical notation like nabla, derivatives, etc. which all have UTF32 symbols.I propose to switch code to use S if(isSomeString!S) everywhere. Client code would first determine source encoding scheme, and then instantiate parsers specifying a string type. This is not a trivial change, but I'm willing to help implementing it.Note that PEG does not impose to use packrat parsing, even though it was developed to use it. I think it's a historical 'accident' that put the two together: Bryan Ford thesis used the two together. Note that many PEG parsers do not rely on packrat (Pegged does not). There are a bunch of articles on Bryan Ford's website by a guy writting a PEG parser for Java, and who found that storing the last rules was enought to get a slight speed improvement, buth that doing anymore sotrage was detrimental to the parser's overall efficiency.That's great! Anyway I want to understand the advantages and limitations of both Pegged and ANTLR, and probably study some more techniques. Such research consumes a lot of time but can be done incrementally along with development.
Jul 07 2012
On 07/07/2012 12:37 PM, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote:Yeah, it's good to hear this notion reinforced. I had this suspicion that the packrat parser is not necessarily the best/fastest solution, mostly because of the large allocation that has to happen before you get O(n) performance. Thus I figured that pegged might eventually use different parsing strategies underneath it all, possibly with a lot of special-casing and clever hand-tuned and profiled optimizations. At least that's what makes sense to me.Note that PEG does not impose to use packrat parsing, even though it was developed to use it. I think it's a historical 'accident' that put the two together: Bryan Ford thesis used the two together. Note that many PEG parsers do not rely on packrat (Pegged does not). There are a bunch of articles on Bryan Ford's website by a guy writting a PEG parser for Java, and who found that storing the last rules was enought to get a slight speed improvement, buth that doing anymore sotrage was detrimental to the parser's overall efficiency.That's great! Anyway I want to understand the advantages and limitations of both Pegged and ANTLR, and probably study some more techniques. Such research consumes a lot of time but can be done incrementally along with development.
Jul 07 2012
On Saturday, 7 July 2012 at 16:56:06 UTC, Chad J wrote:Yeah, it's good to hear this notion reinforced. I had this suspicion that the packrat parser is not necessarily the best/fastest solution, mostly because of the large allocation that has to happen before you get O(n) performance. Thus I figured that pegged might eventually use different parsing strategies underneath it all, possibly with a lot of special-casing and clever hand-tuned and profiled optimizations. At least that's what makes sense to me.At the very least, we could use DFA instead of backtracking where possible. This is the approach implemented in ANTLR, but I wanted to introduce them before I knew about existence of the latter, simply because this would likely produce the fastest parsers possible.
Jul 07 2012
On 07/07/2012 01:01 PM, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 16:56:06 UTC, Chad J wrote:These were my thoughts exactly, although somewhat unsubstantiated in my case ;)Yeah, it's good to hear this notion reinforced. I had this suspicion that the packrat parser is not necessarily the best/fastest solution, mostly because of the large allocation that has to happen before you get O(n) performance. Thus I figured that pegged might eventually use different parsing strategies underneath it all, possibly with a lot of special-casing and clever hand-tuned and profiled optimizations. At least that's what makes sense to me.At the very least, we could use DFA instead of backtracking where possible. This is the approach implemented in ANTLR, but I wanted to introduce them before I knew about existence of the latter, simply because this would likely produce the fastest parsers possible.
Jul 07 2012
On 07-Jul-12 22:29, Chad J wrote:On 07/07/2012 01:01 PM, Roman D. Boiko wrote:Yup the nice thing about ANTLR is the usage of DFA. e.g. it's used for disambiguating alternatives that LL(k) could never do. -- Dmitry OlshanskyOn Saturday, 7 July 2012 at 16:56:06 UTC, Chad J wrote:These were my thoughts exactly, although somewhat unsubstantiated in my case ;)Yeah, it's good to hear this notion reinforced. I had this suspicion that the packrat parser is not necessarily the best/fastest solution, mostly because of the large allocation that has to happen before you get O(n) performance. Thus I figured that pegged might eventually use different parsing strategies underneath it all, possibly with a lot of special-casing and clever hand-tuned and profiled optimizations. At least that's what makes sense to me.At the very least, we could use DFA instead of backtracking where possible. This is the approach implemented in ANTLR, but I wanted to introduce them before I knew about existence of the latter, simply because this would likely produce the fastest parsers possible.
Jul 07 2012
On 7/7/12 1:01 PM, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 16:56:06 UTC, Chad J wrote:Doesn't ANTLR use full-fledged character-level LL(*) parsing even in the tokenizer? AndreiYeah, it's good to hear this notion reinforced. I had this suspicion that the packrat parser is not necessarily the best/fastest solution, mostly because of the large allocation that has to happen before you get O(n) performance. Thus I figured that pegged might eventually use different parsing strategies underneath it all, possibly with a lot of special-casing and clever hand-tuned and profiled optimizations. At least that's what makes sense to me.At the very least, we could use DFA instead of backtracking where possible. This is the approach implemented in ANTLR, but I wanted to introduce them before I knew about existence of the latter, simply because this would likely produce the fastest parsers possible.
Jul 07 2012
On Saturday, 7 July 2012 at 20:04:21 UTC, Andrei Alexandrescu wrote:Doesn't ANTLR use full-fledged character-level LL(*) parsing even in the tokenizer?Since I didn't understand your question I assume that my statement was somehow incorrect (likely because I made some wrong assumptions about ANTLR). I didn't know about its existence until today and still don't understand it completely. What I think I understood is that it uses DFA for deciding which grammar rule to apply instead of doing backtracking. I also think that it uses DFA for low-level scanning (I'm not sure). The idea to introduce DFA for both determining which rule to apply and lexing of terminal symbols appeared to me much earlier, and the suggestion to introduce them into Pegged is one of options which I think could extremely improve performance.
Jul 07 2012
Since I didn't understand your question I assume that my statement was somehow incorrect (likely because I made some wrong assumptions about ANTLR). I didn't know about its existence until today and still don't understand it completely. What I think I understood is that it uses DFA for deciding which grammar rule to apply instead of doing backtracking. I also think that it uses DFA for low-level scanning (I'm not sure).ANTLR 3 doesn't use a DFA unless it needs to. If unlimited lookahead is not called for, it uses standard LL(k) or perhaps it uses the modified (approximate? I forget the name) LL(k) from ANTLR 2. DFA comes into play, for instance, if you need to check what comes after an argument list (of, unlimited, length) before you can decide that it *is* an argument list and start the "real" parsing (The author says LL(k) is too inefficient so he used a restricted form of it; personally I'm not convinced, but I digress)
Jul 07 2012
On 7/7/12 4:15 PM, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 20:04:21 UTC, Andrei Alexandrescu wrote:This should help: http://www.antlr.org/wiki/display/~admin/ANTLR+v4+lexers AndreiDoesn't ANTLR use full-fledged character-level LL(*) parsing even in the tokenizer?Since I didn't understand your question I assume that my statement was somehow incorrect (likely because I made some wrong assumptions about ANTLR). I didn't know about its existence until today and still don't understand it completely. What I think I understood is that it uses DFA for deciding which grammar rule to apply instead of doing backtracking. I also think that it uses DFA for low-level scanning (I'm not sure).
Jul 07 2012
On Saturday, 7 July 2012 at 22:40:05 UTC, Andrei Alexandrescu wrote:http://www.antlr.org/wiki/display/~admin/ANTLR+v4+lexersThanks, nice article. Also found another post: http://www.antlr.org/wiki/display/~admin/2008/11/30/Example+tree+rewriting+with+patterns, which is related to some other discussion in this thread :)
Jul 07 2012
On 07-Jul-12 20:56, Chad J wrote:On 07/07/2012 12:37 PM, Roman D. Boiko wrote:Another idea that I've never realized is to add operator precedence grammar to pegged. Of course it should fit naturally with traditional PEG, for instance taking responsibility for parsing expressions. -- Dmitry OlshanskyOn Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote:Yeah, it's good to hear this notion reinforced. I had this suspicion that the packrat parser is not necessarily the best/fastest solution, mostly because of the large allocation that has to happen before you get O(n) performance. Thus I figured that pegged might eventually use different parsing strategies underneath it all, possibly with a lot of special-casing and clever hand-tuned and profiled optimizations. At least that's what makes sense to me.Note that PEG does not impose to use packrat parsing, even though it was developed to use it. I think it's a historical 'accident' that put the two together: Bryan Ford thesis used the two together. Note that many PEG parsers do not rely on packrat (Pegged does not). There are a bunch of articles on Bryan Ford's website by a guy writting a PEG parser for Java, and who found that storing the last rules was enought to get a slight speed improvement, buth that doing anymore sotrage was detrimental to the parser's overall efficiency.That's great! Anyway I want to understand the advantages and limitations of both Pegged and ANTLR, and probably study some more techniques. Such research consumes a lot of time but can be done incrementally along with development.
Jul 07 2012
On Saturday, 7 July 2012 at 19:29:47 UTC, Dmitry Olshansky wrote:Another idea that I've never realized is to add operator precedence grammar to pegged. Of course it should fit naturally with traditional PEG, for instance taking responsibility for parsing expressions.But that's already available by explicitly defining expression grammar via nested rules. See for example the examples/dgrammar.d in Pegged. This way, for example, multiplication has precedence over addition. (It looks like I misunderstood you. Did I?)
Jul 07 2012
On 07-Jul-12 23:35, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 19:29:47 UTC, Dmitry Olshansky wrote:Yes, I've meant something completely different ;) http://en.wikipedia.org/wiki/Operator-precedence_grammar -- Dmitry OlshanskyAnother idea that I've never realized is to add operator precedence grammar to pegged. Of course it should fit naturally with traditional PEG, for instance taking responsibility for parsing expressions.But that's already available by explicitly defining expression grammar via nested rules. See for example the examples/dgrammar.d in Pegged. This way, for example, multiplication has precedence over addition. (It looks like I misunderstood you. Did I?)
Jul 07 2012
On 07/07/2012 09:35 PM, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 19:29:47 UTC, Dmitry Olshansky wrote:http://en.wikipedia.org/wiki/Operator-precedence_parser#Precedence_climbing_methodAnother idea that I've never realized is to add operator precedence grammar to pegged. Of course it should fit naturally with traditional PEG, for instance taking responsibility for parsing expressions.But that's already available by explicitly defining expression grammar via nested rules. See for example the examples/dgrammar.d in Pegged. This way, for example, multiplication has precedence over addition. (It looks like I misunderstood you. Did I?)
Jul 07 2012
On Saturday, 7 July 2012 at 19:50:37 UTC, Timon Gehr wrote:http://en.wikipedia.org/wiki/Operator-precedence_parser#Precedence_climbing_methodOK, at least I didn't misunderstand. So my question was whether the alternative that I described and which exists in PEG is somehow worse than OPP. Wikipedia seems to provide an answer to that: OPP tend to be simpler. (I didn't investigate this topic further.)
Jul 07 2012
On Saturday, 7 July 2012 at 19:58:37 UTC, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 19:50:37 UTC, Timon Gehr wrote:OK, now I agree, the need to perform several nested calls in order to parse some expression is costly enough to consider OPP superior to a general PEG for expressions. At first I was surprised when found that D doesn't define operator precedence explicitly, but instead provides a hierarchy of expression types. Then I somehow concluded that the approaches are equivalent. (I started learning parsing techniques only in February '12.) Since then I never reconsidered my past conclusions.http://en.wikipedia.org/wiki/Operator-precedence_parser#Precedence_climbing_methodOK, at least I didn't misunderstand. So my question was whether the alternative that I described and which exists in PEG is somehow worse than OPP. Wikipedia seems to provide an answer to that: OPP tend to be simpler. (I didn't investigate this topic further.)
Jul 07 2012
On Saturday, July 07, 2012 18:37:54 Roman D. Boiko wrote:On Saturday, 7 July 2012 at 16:27:00 UTC, Philippe Sigaud wrote:I don't know about this particular case, because I haven't really looked at pegged, but in general, string parsing stuff should be taking ranges of dchar and then specializing on string type where appropriate for efficiency. - Jonathan M DavisI added dstrings because 1- at the time (a few months ago), the lists here were awash in UTF-32 discussions and I thought that'd be the way to go anyway 2- other D parsing libraries seemed to go to UTF32 also (CTPG) 3- I wanted to be able to parse mathematical notation like nabla, derivatives, etc. which all have UTF32 symbols.I propose to switch code to use S if(isSomeString!S) everywhere. Client code would first determine source encoding scheme, and then instantiate parsers specifying a string type. This is not a trivial change, but I'm willing to help implementing it.
Jul 07 2012
On Saturday, 7 July 2012 at 16:37:56 UTC, Roman D. Boiko wrote:One disadvantage of Packrat parsers I mentioned was problematic error recovery (according to the article from ANTLR website). After some additional research, I found that it is not a critical problem. To find the exact place of error (from parser's perspective, not user's) one only needs to remember the farthest successfully parsed position (among several backtracking attempts) and the reason that it failed. It is also possible to rerun parsing with some additional heuristics after failing, thus enabling advanced error repair scenarios. Since Pegged doesn't use Packrat algorithm, this solution might be either not relevant or not applicable, but I doubt that there will be any fundamental problem with error recovery. Unpleasant debugging experience, however, should be relevant for any parser that uses backtracking heavily.Note that PEG does not impose to use packrat parsing, even though it was developed to use it. I think it's a historical 'accident' that put the two together: Bryan Ford thesis used the two together. Note that many PEG parsers do not rely on packrat (Pegged does not). There are a bunch of articles on Bryan Ford's website by a guy writting a PEG parser for Java, and who found that storing the last rules was enought to get a slight speed improvement, buth that doing anymore sotrage was detrimental to the parser's overall efficiency.That's great! Anyway I want to understand the advantages and limitations of both Pegged and ANTLR, and probably study some more techniques. Such research consumes a lot of time but can be done incrementally along with development.
Jul 10 2012
Tue, Jul 10, 2012 at 12:41 PM, Roman D. Boiko <rb d-coding.com> wrote:One disadvantage of Packrat parsers I mentioned was problematic error recovery (according to the article from ANTLR website). After some additional research, I found that it is not a critical problem. To find the exact place of error (from parser's perspective, not user's) one only needs to remember the farthest successfully parsed position (among several backtracking attempts) and the reason that it failed.IIRC, that's what I encoded in Pegged (admittedly limited) error reporting: remember the farthest error.It is also possible to rerun parsing with some additional heuristics after failing, thus enabling advanced error repair scenarios.Do people really what error-repairing parsers? I want my parsers to tell me something is bad, and, optionally to advance a possible repair, but definitely *not* to automatically repair a inferred error and continue happily.
Jul 10 2012
On 07/10/2012 09:14 PM, Philippe Sigaud wrote:Tue, Jul 10, 2012 at 12:41 PM, Roman D. Boiko<rb d-coding.com> wrote:FWIW, this is what most HTML parsers are doing.One disadvantage of Packrat parsers I mentioned was problematic error recovery (according to the article from ANTLR website). After some additional research, I found that it is not a critical problem. To find the exact place of error (from parser's perspective, not user's) one only needs to remember the farthest successfully parsed position (among several backtracking attempts) and the reason that it failed.IIRC, that's what I encoded in Pegged (admittedly limited) error reporting: remember the farthest error.It is also possible to rerun parsing with some additional heuristics after failing, thus enabling advanced error repair scenarios.Do people really what error-repairing parsers? I want my parsers to tell me something is bad, and, optionally to advance a possible repair, but definitely *not* to automatically repair a inferred error and continue happily.
Jul 10 2012
On Tue, Jul 10, 2012 at 9:25 PM, Timon Gehr <timon.gehr gmx.ch> wrote:Ah, right. I can get it for HTML/XML. JSON also, maybe. I was thinking of parsing a programming language (C, D, etc) Consider me half-convinced :)Do people really what error-repairing parsers? I want my parsers to tell me something is bad, and, optionally to advance a possible repair, but definitely *not* to automatically repair a inferred error and continue happily.FWIW, this is what most HTML parsers are doing.
Jul 10 2012
On Tuesday, 10 July 2012 at 19:41:29 UTC, Philippe Sigaud wrote:On Tue, Jul 10, 2012 at 9:25 PM, Timon Gehr <timon.gehr gmx.ch> wrote:It would still generate errors. But would enable a lot of useful functionality: autocompletion, refactoring, symbol documentation in a tooltip, displaying method overloads with parameters as-you-type, go to definition, etc.Ah, right. I can get it for HTML/XML. JSON also, maybe. I was thinking of parsing a programming language (C, D, etc) Consider me half-convinced :)Do people really what error-repairing parsers? I want my parsers to tell me something is bad, and, optionally to advance a possible repair, but definitely *not* to automatically repair a inferred error and continue happily.FWIW, this is what most HTML parsers are doing.
Jul 10 2012
On Tuesday, July 10, 2012 21:25:52 Timon Gehr wrote:On 07/10/2012 09:14 PM, Philippe Sigaud wrote:Which is horrible. You pretty much have to with HTML because of the horrid decision that it should be parsed so laxly by browsers, but pretty much nothing else should do that. Either it's correct or it's not. Having the compiler "fix" your code would cause far more problems that it would ever fix. - Jonathan M DavisTue, Jul 10, 2012 at 12:41 PM, Roman D. Boiko<rb d-coding.com> wrote:FWIW, this is what most HTML parsers are doing.One disadvantage of Packrat parsers I mentioned was problematic error recovery (according to the article from ANTLR website). After some additional research, I found that it is not a critical problem. To find the exact place of error (from parser's perspective, not user's) one only needs to remember the farthest successfully parsed position (among several backtracking attempts) and the reason that it failed.IIRC, that's what I encoded in Pegged (admittedly limited) error reporting: remember the farthest error.It is also possible to rerun parsing with some additional heuristics after failing, thus enabling advanced error repair scenarios.Do people really what error-repairing parsers? I want my parsers to tell me something is bad, and, optionally to advance a possible repair, but definitely *not* to automatically repair a inferred error and continue happily.
Jul 10 2012
On Tuesday, 10 July 2012 at 20:25:12 UTC, Jonathan M Davis wrote:On Tuesday, July 10, 2012 21:25:52 Timon Gehr wrote:Not having control over parser or source code causes problems. Ability to deliver useful functionality (see my post above) is a different use case.FWIW, this is what most HTML parsers are doing.Which is horrible. You pretty much have to with HTML because of the horrid decision that it should be parsed so laxly by browsers, but pretty much nothing else should do that. Either it's correct or it's not. Having the compiler "fix" your code would cause far more problems that it would ever fix.
Jul 10 2012
On 2012-07-10 22:25, Jonathan M Davis wrote:Which is horrible. You pretty much have to with HTML because of the horrid decision that it should be parsed so laxly by browsers, but pretty much nothing else should do that. Either it's correct or it's not. Having the compiler "fix" your code would cause far more problems that it would ever fix.I'm not sure but I think he was referring to a kind of error reporting technique used by compilers. Example: int foo () { int a = 3 // note the missing semicolon return a; } Instead of the parser going completely mad because of the missing semicolon. It will basically insert a semicolon, report the error and then happily continue parsing. I think this will make it easier to find later errors and less likely to report incorrect errors due to a previous error. -- /Jacob Carlborg
Jul 10 2012
On Tuesday, July 10, 2012 22:40:17 Jacob Carlborg wrote:On 2012-07-10 22:25, Jonathan M Davis wrote:Well, giving an error, continuing to parse, and giving a partial result can be useful (and you give a prime example of that), but "fixing" the problem (e.g by inserting the semicolon) and not considering it to be an error would be a _huge_ mistake IMHO. And that's essentially what happens with HTML. - Jonathan M DavisWhich is horrible. You pretty much have to with HTML because of the horrid decision that it should be parsed so laxly by browsers, but pretty much nothing else should do that. Either it's correct or it's not. Having the compiler "fix" your code would cause far more problems that it would ever fix.I'm not sure but I think he was referring to a kind of error reporting technique used by compilers. Example: int foo () { int a = 3 // note the missing semicolon return a; } Instead of the parser going completely mad because of the missing semicolon. It will basically insert a semicolon, report the error and then happily continue parsing. I think this will make it easier to find later errors and less likely to report incorrect errors due to a previous error.
Jul 10 2012
On 07/10/2012 10:53 PM, Jonathan M Davis wrote:On Tuesday, July 10, 2012 22:40:17 Jacob Carlborg wrote:This is actually precisely what many of the more recent curly-brace- and-semicolon languages have been doing with regard to semicolons.On 2012-07-10 22:25, Jonathan M Davis wrote:Well, giving an error, continuing to parse, and giving a partial result can be useful (and you give a prime example of that), but "fixing" the problem (e.g by inserting the semicolon) and not considering it to be an error would be a _huge_ mistake IMHO.Which is horrible. You pretty much have to with HTML because of the horrid decision that it should be parsed so laxly by browsers, but pretty much nothing else should do that. Either it's correct or it's not. Having the compiler "fix" your code would cause far more problems that it would ever fix.I'm not sure but I think he was referring to a kind of error reporting technique used by compilers. Example: int foo () { int a = 3 // note the missing semicolon return a; } Instead of the parser going completely mad because of the missing semicolon. It will basically insert a semicolon, report the error and then happily continue parsing. I think this will make it easier to find later errors and less likely to report incorrect errors due to a previous error.
Jul 10 2012
On 2012-07-10 22:53, Jonathan M Davis wrote:Well, giving an error, continuing to parse, and giving a partial result can be useful (and you give a prime example of that), but "fixing" the problem (e.g by inserting the semicolon) and not considering it to be an error would be a _huge_ mistake IMHO. And that's essentially what happens with HTML.No, that is _not_ what happens with HTML. With HTML, the browser _do not_ output the error and continues as if it was valid could. As far as I know, up until HTML 5, the spec hasn't mentioned what should happen with invalid code. This is just a error handling strategy that is an implementation detail. It will not change what is and what isn't valid code. Are you preferring getting just the first error when compiling? Fix the error, compile again, get a new error and so on. -- /Jacob Carlborg
Jul 10 2012
On Wednesday, July 11, 2012 08:41:53 Jacob Carlborg wrote:On 2012-07-10 22:53, Jonathan M Davis wrote:??? I guess that I wasn't clear. I mean that with HTML, it ignores errors. The browser doesn't spit out errors. It just guesses at what you really meant and displays that. It "fixes" the error for you, which is a horrible design IMHO. Obviously, we're stuck with it for HTML, but it should not be replicated with anything else. This is in contrast to your example of outputting an error and continuing to parse as best it can in order to provide more detail and more error messages but _not_ ultimately considering the parsing successful. _That_ is useful. HTML's behavior is not. - Jonathan M DavisWell, giving an error, continuing to parse, and giving a partial result can be useful (and you give a prime example of that), but "fixing" the problem (e.g by inserting the semicolon) and not considering it to be an error would be a _huge_ mistake IMHO. And that's essentially what happens with HTML.No, that is _not_ what happens with HTML. With HTML, the browser _do not_ output the error and continues as if it was valid could. As far as I know, up until HTML 5, the spec hasn't mentioned what should happen with invalid code. This is just a error handling strategy that is an implementation detail. It will not change what is and what isn't valid code. Are you preferring getting just the first error when compiling? Fix the error, compile again, get a new error and so on.
Jul 10 2012
On 2012-07-11 08:52, Jonathan M Davis wrote:??? I guess that I wasn't clear. I mean that with HTML, it ignores errors. The browser doesn't spit out errors. It just guesses at what you really meant and displays that. It "fixes" the error for you, which is a horrible design IMHO. Obviously, we're stuck with it for HTML, but it should not be replicated with anything else. This is in contrast to your example of outputting an error and continuing to parse as best it can in order to provide more detail and more error messages but _not_ ultimately considering the parsing successful. _That_ is useful. HTML's behavior is not.Ok, I see. It seems we're meaning the same thing. -- /Jacob Carlborg
Jul 11 2012
On 11-Jul-12 00:25, Jonathan M Davis wrote:On Tuesday, July 10, 2012 21:25:52 Timon Gehr wrote:BTW clang does this and even more of stuff on semantic level. It's known to won a legions of users because of that (well not only that but good diagnostic in general). -- Dmitry OlshanskyOn 07/10/2012 09:14 PM, Philippe Sigaud wrote:Which is horrible. You pretty much have to with HTML because of the horrid decision that it should be parsed so laxly by browsers, but pretty much nothing else should do that. Either it's correct or it's not. Having the compiler "fix" your code would cause far more problems that it would ever fix.Tue, Jul 10, 2012 at 12:41 PM, Roman D. Boiko<rb d-coding.com> wrote:FWIW, this is what most HTML parsers are doing.One disadvantage of Packrat parsers I mentioned was problematic error recovery (according to the article from ANTLR website). After some additional research, I found that it is not a critical problem. To find the exact place of error (from parser's perspective, not user's) one only needs to remember the farthest successfully parsed position (among several backtracking attempts) and the reason that it failed.IIRC, that's what I encoded in Pegged (admittedly limited) error reporting: remember the farthest error.It is also possible to rerun parsing with some additional heuristics after failing, thus enabling advanced error repair scenarios.Do people really what error-repairing parsers? I want my parsers to tell me something is bad, and, optionally to advance a possible repair, but definitely *not* to automatically repair a inferred error and continue happily.
Jul 10 2012
Note that PEG does not impose to use packrat parsing, even though it was developed to use it. I think it's a historical 'accident' that put the two together: Bryan Ford thesis used the two together.got disillusioned because nobody was interested in fixing the bugs in it (ANTLR's author is a Java guy first and foremost) and the source code of the required libraries didn't have source code or a license (wtf.) So, for awhile I was thinking about how I might make my own parser generator that was "better" than ANTLR. I liked the syntax of PEG descriptions, but I was concerned about the performance hit of packrat and, besides, I already liked the syntax and flexibility of ANTLR. So my idea was to make something that was LL(k) and mixed the syntax of ANTLR and PEG while using more sane (IMO) semantics than ANTLR did at the time (I've no idea if ANTLR 3 still uses the same semantics today...) All of this is 'water under the bridge' now, but I hand-wrote a lexer to help me plan out how my parser-generator would produce code. The output code was to be both more efficient and significantly more readable than ANTLR's output. I didn't get around to writing the parser-generator itself but I'll have a look back at my handmade lexer for inspiration.I don't like the sound of this either. Even if PEGs were fast, difficulty in debugging, error handling, etc. would give me pause. I insist on well-rounded tools. For example, even though LALR(1) may be the fastest type of parser (is it?), I prefer not to use it due to its inflexibility (it just doesn't like some reasonable grammars), and the fact that the generated code is totally unreadable and hard to debug (mind you, when I learned LALR in school I found that it is possible to visualize how it works in a pretty intuitive way--but debuggers won't do that for you.) While PEGs are clearly far more flexible than LALR and probably more flexible than LL(k), I am a big fan of old-fashioned recursive descent because it's very flexible (easy to insert actions during parsing, and it's possible to use custom parsing code in certain places, if necessary*) and the parser generator's output is potentially very straightforward to understand and debug. In my mind, the main reason you want to use a parser generator instead of hand-coding is convenience, e.g. (1) to compress the grammar down so you can see it clearly, (2) have the PG compute the first-sets and follow-sets for you, (3) get reasonably automatic error handling. * (If the language you want to parse is well-designed, you'll probably not need much custom parsing. But it's a nice thing to offer in a general-purpose parser generator.) I'm not totally sure yet how to support good error messages, efficiency and straightforward output at the same time, but by the power of D I'm sure I could think of something... I would like to submit another approach to parsing that I dare say is my favorite, even though I have hardly used it at all yet. ANTLR offers something called "tree parsing" that is extremely cool. It parses trees instead of linear token streams, and produces other trees as output. I don't have a good sense of how tree parsing works, but I think that some kind of tree-based parser generator could become the basis for a very flexible and easy-to-understand D front-end. If a PG operates on trees instead of linear token streams, I have a sneaky suspicion that it could revolutionize how a compiler front-end works. Why? because right now parsers operate just once, on the user's input, and from there you manipulate the AST with "ordinary" code. But if you have a tree parser, you can routinely manipulate and transform parts of the tree with a sequence of independent parsers and grammars. Thus, parsers would replace a lot of things for which you would otherwise use a visitor pattern, or something. I think I'll try to sketch out this idea in more detail later.However, as I found a few hours ago, Packrat parsing (typically used to handle PEG) has serious disadvantages: it complicates debugging because of frequent backtracking, it has problems with error recovery, and typically disallows to add actions with side effects (because of possibility of backtracking). These are important enough to reconsider my plans of using Pegged. I will try to analyze whether the issues are so fundamental that I (or somebody else) will have to create an ANTLR-like parser instead, or whether it is possible to introduce changes into Pegged that would fix these problems.
Jul 07 2012
On 07/07/2012 02:23 PM, David Piepgrass wrote:I was thinking the same thing. My intent is to create a kind of regular-expression-of-nodes with push/pop operators to recognize ascent and descent on the tree. Such a regular expression would allow one to capture subtrees out of generalized patterns and then place them into new trees that then become the input for the next pattern or set of patterns. I think this is much closer to how I conceptualize semantic analysis than how semantic analysis is done in front ends like DMD: it should to be done with pattern recognition and substitution, not with myriad nested if-statements and while-loops. My vision is to have code similar to this in the front-end: /+ Lower while ( boolExpr ) { statements; } Into loopAgain: if ( !boolExpr ) goto exitLoop statements; goto loopAgain exitLoop: +/ void lowerWhileStatement( SyntaxElement* syntaxNode ) { auto captures = syntaxNode.matchNodes( TOK_WHILE_NODE, OP_ENTER_NODE, OP_CAPTURE(0), OP_BEGIN, TOK_EXPRESSION, OP_END, OP_CAPTURE(1), OP_BEGIN, TOK_STATEMENT, OP_END, OP_LEAVE_NODE); if ( captures is null ) return; syntaxNode.replaceWith( LabelNode("loopAgain"), TOK_IF_STATEMENT, OP_INSERT, OP_BEGIN, TOK_NEGATE, OP_INSERT, OP_BEGIN, captures[0], // Expression OP_END, GotoStatement("exitLoop"), OP_END, captures[1], // statements GotoStatement("loopAgain"), LabelNode("exitLoop") ); } The specifics will easily change. One problem with the above code is that it could probably stand to use more templates and compile-time action to allow the front-end to merge patterns happening in the same pass to be merged together into one expression, thus preventing any unnecessary rescanning. It all becomes DFAs or DPDAs operating on syntax trees. In this vision I do not use classes and inheritance for my AST. Instead I use structs that contain some kind of nodeType member that would be one of the tokens/symbols in the grammar, like TOK_WHILE_NODE in the above code. Dynamic dispatch is instead performed by (very fast) DFAs recognizing parts of the AST. This kind of architecture leads to other interesting benefits, like being able to assert which symbols a pattern is designed to handle or which symbols are allowed to exist in the AST at any point in time. Thus if you write a lowering that introduces nodes that a later pass can't handle, you'll know very quickly, at least in principle. I wanted to make such a front-end so that I could easily make a C backend. I believe such a compiler would be able to do that with great ease. I really want a D compiler that can output ANSI C code that can be used with few or no OS/CPU dependencies. I would be willing to lose a lot of the nifty parallelism/concurrency stuff and deal with reference counting instead of full garbage collection, as long as it lets me EASILY target new systems (any phone, console platform, and some embedded microcontrollers). Then what I have is something that's as ubiquitous as C, but adds a lot of useful features like exception handling, dynamic arrays, templates, CTFE, etc etc. My ideas for how to deal with ASTs in pattern recognition and substitution followed from this. Needing to use D in places where it isn't available is a real pain-point for me right now, and I'll probably find ways to spend time on it eventually.Note that PEG does not impose to use packrat parsing, even though it was developed to use it. I think it's a historical 'accident' that put the two together: Bryan Ford thesis used the two together.disillusioned because nobody was interested in fixing the bugs in it (ANTLR's author is a Java guy first and foremost) and the source code of the required libraries didn't have source code or a license (wtf.) So, for awhile I was thinking about how I might make my own parser generator that was "better" than ANTLR. I liked the syntax of PEG descriptions, but I was concerned about the performance hit of packrat and, besides, I already liked the syntax and flexibility of ANTLR. So my idea was to make something that was LL(k) and mixed the syntax of ANTLR and PEG while using more sane (IMO) semantics than ANTLR did at the time (I've no idea if ANTLR 3 still uses the same semantics today...) All of this is 'water under the bridge' now, but I hand-wrote a lexer to help me plan out how my parser-generator would produce code. The output code was to be both more efficient and significantly more readable than ANTLR's output. I didn't get around to writing the parser-generator itself but I'll have a look back at my handmade lexer for inspiration.I don't like the sound of this either. Even if PEGs were fast, difficulty in debugging, error handling, etc. would give me pause. I insist on well-rounded tools. For example, even though LALR(1) may be the fastest type of parser (is it?), I prefer not to use it due to its inflexibility (it just doesn't like some reasonable grammars), and the fact that the generated code is totally unreadable and hard to debug (mind you, when I learned LALR in school I found that it is possible to visualize how it works in a pretty intuitive way--but debuggers won't do that for you.) While PEGs are clearly far more flexible than LALR and probably more flexible than LL(k), I am a big fan of old-fashioned recursive descent because it's very flexible (easy to insert actions during parsing, and it's possible to use custom parsing code in certain places, if necessary*) and the parser generator's output is potentially very straightforward to understand and debug. In my mind, the main reason you want to use a parser generator instead of hand-coding is convenience, e.g. (1) to compress the grammar down so you can see it clearly, (2) have the PG compute the first-sets and follow-sets for you, (3) get reasonably automatic error handling. * (If the language you want to parse is well-designed, you'll probably not need much custom parsing. But it's a nice thing to offer in a general-purpose parser generator.) I'm not totally sure yet how to support good error messages, efficiency and straightforward output at the same time, but by the power of D I'm sure I could think of something... I would like to submit another approach to parsing that I dare say is my favorite, even though I have hardly used it at all yet. ANTLR offers something called "tree parsing" that is extremely cool. It parses trees instead of linear token streams, and produces other trees as output. I don't have a good sense of how tree parsing works, but I think that some kind of tree-based parser generator could become the basis for a very flexible and easy-to-understand D front-end. If a PG operates on trees instead of linear token streams, I have a sneaky suspicion that it could revolutionize how a compiler front-end works. Why? because right now parsers operate just once, on the user's input, and from there you manipulate the AST with "ordinary" code. But if you have a tree parser, you can routinely manipulate and transform parts of the tree with a sequence of independent parsers and grammars. Thus, parsers would replace a lot of things for which you would otherwise use a visitor pattern, or something. I think I'll try to sketch out this idea in more detail later.However, as I found a few hours ago, Packrat parsing (typically used to handle PEG) has serious disadvantages: it complicates debugging because of frequent backtracking, it has problems with error recovery, and typically disallows to add actions with side effects (because of possibility of backtracking). These are important enough to reconsider my plans of using Pegged. I will try to analyze whether the issues are so fundamental that I (or somebody else) will have to create an ANTLR-like parser instead, or whether it is possible to introduce changes into Pegged that would fix these problems.
Jul 07 2012
On Saturday, 7 July 2012 at 18:55:57 UTC, Chad J wrote:I was thinking the same thing. My intent is to create a kind of regular-expression-of-nodes with push/pop operators to recognize ascent and descent on the tree. Such a regular expression would allow one to capture subtrees out of generalized patterns and then place them into new trees that then become the input for the next pattern or set of patterns. I think this is much closer to how I conceptualize semantic analysis than how semantic analysis is done in front ends like DMD: it should to be done with pattern recognition and substitution, not with myriad nested if-statements and while-loops.Funny, we've discussed an idea to introduce some hybrid of regex and xpath for querying, searching and traversing AST with Dmitry a few weeks ago. A custom NDepend-like Code Query Language was the major alternative we considered. Discussion started on this forum and continued via email.In this vision I do not use classes and inheritance for my AST. Instead I use structs that contain some kind of nodeType member that would be one of the tokens/symbols in the grammar, like TOK_WHILE_NODE in the above code. Dynamic dispatch is instead performed by (very fast) DFAs recognizing parts of the AST.Exactly. This idea first came to me in April after I implemented the first top-down recursive descent custom parser for a D subset. I tried Visitor pattern before that and wasn't happy. There are some subtle difficulties which I believe will be possible to overcome, most important being the need to introduce a mechanism for hierarchical classification (like a pow expression being an assign expression at the same time).
Jul 07 2012
On 07/07/2012 03:13 PM, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 18:55:57 UTC, Chad J wrote:From some of my earlier scribblings: enum : SyntaxElement { AST_EXPRESSION = 0x0001_0000_0000_0000, AST_UNARY_EXPR = 0x0000_0001_0000_0000 | AST_EXPRESSION, AST_NEGATE_EXPR = 0x0000_0000_0001_0000 | AST_UNARY_EXPR, AST_COMPLIMENT_EXPR = 0x0000_0000_0002_0000 | AST_UNARY_EXPR, AST_POST_ADD_EXPR = 0x0000_0000_0003_0000 | AST_UNARY_EXPR, AST_POST_SUB_EXPR = 0x0000_0000_0004_0000 | AST_UNARY_EXPR, AST_PRE_ADD_EXPR = 0x0000_0000_0005_0000 | AST_UNARY_EXPR, AST_PRE_SUB_EXPR = 0x0000_0000_0006_0000 | AST_UNARY_EXPR, AST_BINARY_EXPR = 0x0000_0002_0000_0000 | AST_EXPRESSION, AST_AND_EXPR = 0x0000_0000_0001_0000 | AST_BINARY_EXPR, AST_OR_EXPR = 0x0000_0000_0002_0000 | AST_BINARY_EXPR, AST_XOR_EXPR = 0x0000_0000_0003_0000 | AST_BINARY_EXPR, AST_AND_AND_EXPR = 0x0000_0000_0004_0000 | AST_BINARY_EXPR, AST_OR_OR_EXPR = 0x0000_0000_0005_0000 | AST_BINARY_EXPR, AST_ADD_EXPR = 0x0000_0000_0006_0000 | AST_BINARY_EXPR, AST_TRINARY_EXPR = 0x0000_0003_0000_0000 | AST_EXPRESSION, AST_NARY_EXPR = 0x0000_0004_0000_0000 | AST_EXPRESSION, AST_STATEMENT = 0x0002_0000_0000_0000, } bool isA( SyntaxElement leafier, SyntaxElement rootier ) { SyntaxElement mask = 0; if ( rootier & 0x0000_0000_FFFF_FFFF ) { if ( rootier & 0x0000_0000_0000_FFFF ) mask = 0xFFFF_FFFF_FFFF_FFFF; else mask = 0xFFFF_FFFF_FFFF_0000; } else { if ( rootier & 0x0000_FFFF_FFFF_FFFF ) mask = 0xFFFF_FFFF_0000_0000; else mask = 0xFFFF_0000_0000_0000; } return (leafier & mask) == rootier; } unittest { assert( isA( AST_EXPRESSION, AST_EXPRESSION) ); assert( isA( AST_NEGATE_EXPR, AST_NEGATE_EXPR) ); assert( isA( AST_NEGATE_EXPR, AST_EXPRESSION) ); assert( isA( AST_NEGATE_EXPR, AST_UNARY_EXPR) ); assert( !isA( AST_EXPRESSION, AST_STATEMENT) ); assert( !isA( AST_NEGATE_EXPR, AST_BINARY_EXPR) ); assert( !isA( AST_NEGATE_EXPR, AST_STATEMENT) ); assert( !isA( AST_NEGATE_EXPR, AST_COMPLIMENT_EXPR) ); assert(false); } This approach of course has shameful nesting limitations, but I feel like these determinations could be fairly well optimized even for the general case. For example: another approach that I might be more inclined to take is to give each token/symbol a low-valued index into a small inheritance table. I would expect the regex engine to call the isA function as one of it's operations. Thus placing an AST_EXPRESSION into your expression would also match an AST_NEGATE_EXPR too.In this vision I do not use classes and inheritance for my AST. Instead I use structs that contain some kind of nodeType member that would be one of the tokens/symbols in the grammar, like TOK_WHILE_NODE in the above code. Dynamic dispatch is instead performed by (very fast) DFAs recognizing parts of the AST.Exactly. This idea first came to me in April after I implemented the first top-down recursive descent custom parser for a D subset. I tried Visitor pattern before that and wasn't happy. There are some subtle difficulties which I believe will be possible to overcome, most important being the need to introduce a mechanism for hierarchical classification (like a pow expression being an assign expression at the same time).
Jul 07 2012
On Saturday, 7 July 2012 at 22:03:20 UTC, Chad J wrote:enum : SyntaxElement { AST_EXPRESSION = 0x0001_0000_0000_0000, AST_UNARY_EXPR = 0x0000_0001_0000_0000 |This would cause wasting space (probably a lot). Definitely it would not be easy to implement in a parser generator, when various language properties are not known beforehand for fine-grained tuning.This approach of course has shameful nesting limitations, but I feel like these determinations could be fairly well optimized even for the general case. For example: another approach that I might be more inclined to take is to give each token/symbol a low-valued index into a small inheritance table.Depending on implementation, that might introduce the multiplier overhead of table access per each comparison (and there would be many in case of searching for nodes of specific type).I would expect the regex engine to call the isA function as one of it's operations. Thus placing an AST_EXPRESSION into your expression would also match an AST_NEGATE_EXPR too.But actually it is not so difficult to implement in a very similar way to what you described. I was thinking about a lookup table, but different from a traditional inheritance table. It would be indexed by AST node type (integral enum value), and store various classification information as bits. Maybe this is what you meant and I misunderstood you... Example is here: https://github.com/roman-d-boiko/dct/blob/May2012drafts/fe/core.d (sorry, it doesn't show how to do classification, and has a different context, but I hope you get the idea). The advantage over storing hierarchical information directly in each token is obviously memory usage.
Jul 07 2012
On 07/07/2012 06:18 PM, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 22:03:20 UTC, Chad J wrote:I see what you mean. I'm not sure that I buy that language properties are known before-hand. The front-end knows what the language grammar looks like and it knows what kind of things it can find in an AST. Thus you can make the regex DSL do this transformation and let the DFAs handle everything: expr -> (expr | unary_expr | negate_expr | binary_compliment | incr_expr | decr_expr | binary_expr | assign_expr | add_assign_expr | nary_expr | ...) Because what we really want it to do is match any of those expression kinds when we place the expression symbol in our regex. I think the important point here though is that inheritance can be reduced to bit-twiddling optimizations in this case.enum : SyntaxElement { AST_EXPRESSION = 0x0001_0000_0000_0000, AST_UNARY_EXPR = 0x0000_0001_0000_0000 |This would cause wasting space (probably a lot). Definitely it would not be easy to implement in a parser generator, when various language properties are not known beforehand for fine-grained tuning.This approach of course has shameful nesting limitations, but I feel like these determinations could be fairly well optimized even for the general case. For example: another approach that I might be more inclined to take is to give each token/symbol a low-valued index into a small inheritance table.Depending on implementation, that might introduce the multiplier overhead of table access per each comparison (and there would be many in case of searching for nodes of specific type).I would expect the regex engine to call the isA function as one of it's operations. Thus placing an AST_EXPRESSION into your expression would also match an AST_NEGATE_EXPR too.But actually it is not so difficult to implement in a very similar way to what you described. I was thinking about a lookup table, but different from a traditional inheritance table. It would be indexed by AST node type (integral enum value), and store various classification information as bits. Maybe this is what you meant and I misunderstood you... Example is here: https://github.com/roman-d-boiko/dct/blob/May2012drafts/fe/core.d (sorry, it doesn't show how to do classification, and has a different context, but I hope you get the idea). The advantage over storing hierarchical information directly in each token is obviously memory usage.
Jul 07 2012
On Saturday, 7 July 2012 at 22:45:01 UTC, Chad J wrote:I see what you mean.And I think that I expressed myself badly. Let me rephrase. When the memory hierarchy is deep, every level would require at least one bit position. Or even every base class would require a separate bit. (I think that the former + several bits to distinguish among hierarchies.) Otherwise it would not be easy to check by a bit-mask. Even if the above is incorrect (and that is likely since I didn't try to encode that compactly for the real grammar), I think that in general that information would only be possible to store in a fairly large integral. Especially if we try to generate parser from grammar, and thus can't do fine-tuning to pack the information tightly. This overhead would be paid per each AST node __instance__. But that is not necessary, since we could store information in a lookup table only once per node __type__.
Jul 07 2012
On 07/07/2012 07:04 PM, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 22:45:01 UTC, Chad J wrote:Yep. Makes sense.I see what you mean.And I think that I expressed myself badly. Let me rephrase. When the memory hierarchy is deep, every level would require at least one bit position. Or even every base class would require a separate bit. (I think that the former + several bits to distinguish among hierarchies.) Otherwise it would not be easy to check by a bit-mask. Even if the above is incorrect (and that is likely since I didn't try to encode that compactly for the real grammar), I think that in general that information would only be possible to store in a fairly large integral. Especially if we try to generate parser from grammar, and thus can't do fine-tuning to pack the information tightly. This overhead would be paid per each AST node __instance__. But that is not necessary, since we could store information in a lookup table only once per node __type__.
Jul 07 2012
auto captures = syntaxNode.matchNodes( TOK_WHILE_NODE, OP_ENTER_NODE, OP_CAPTURE(0), OP_BEGIN, TOK_EXPRESSION, OP_END, OP_CAPTURE(1), OP_BEGIN, TOK_STATEMENT, OP_END, OP_LEAVE_NODE);I'm glad to hear you like the tree-parsing approach, Chad, although the particular syntax here looks pretty unfriendly :O -- does this represent something that you are working on right now?This kind of architecture leads to other interesting benefits, like being able to assert which symbols a pattern is designed to handle or which symbols are allowed to exist in the AST at any point in time. Thus if you write a lowering that introduces nodes that a later pass can't handle, you'll know very quickly, at least in principle. I wanted to make such a front-end so that I could easily make a C backend. I believe such a compiler would be able to do that with great ease. I really want a D compiler that can output ANSI C code that can be used with few or no OS/CPU dependencies. I would be willing to lose a lot of the nifty parallelism/concurrency stuff and deal with reference counting instead of full garbage collection, as long as it lets me EASILY target new systems (any phone, console platform, and some embedded microcontrollers). Then what I have is something that's as ubiquitous as C, but adds a lot of useful features like exception handling, dynamic arrays, templates, CTFE, etc etc. My ideas for how to deal with ASTs in pattern recognition and substitution followed from this.I tend to agree that it would be better to have a "general" node class with the node type as a property rather than a subtype and rather than a myriad of independent types, although in the past I haven't been able to figure out how to make this approach simultaneously general, efficient, and easy to use. I'm kind of a perfectionist which perhaps holds me back sometimes :) I'd like to add that if we give tree parsing first-class treatment, I believe the most logical approach to parsing has three or more stages instead of the traditional two (lex+parse): 1. Lexer 2. Tree-ification 3. Parsing to AST (which may itself use multiple stages, e.g. parse the declarations first, then parse function bodies later) The new stage two simply groups things that are in parenthesis and braces. So an input stream such as the following: A man (from a [very ugly] house in the suburbs) was quoted as saying { I saw Batman (and Robin) last night! } Is converted to a tree where everything parenthesized or braced gets to be a child: A man ( from a [ very ugly ] house in the suburbs ) was quoted as saying { ... } Some of the things I like about this approach are: 1. It's language-agnostic. Lots of languages and DSLs could re-use exactly the same code from stage 2. (Stage 1, also, is fairly similar between languages and I wonder if a parameterized standard lexer is a worthwhile pursuit.) 2. It mostly eliminates the need for arbitrary-length lookahead for things like D's template_functions(...)(...). Of course, the tokens will almost always end up getting scanned twice, but hey, at least you know you won't need to scan them more than twice, right? (er, of course the semantic analysis will scan it several times anyway. Maybe this point is moot.) 3. It is very efficient for tools that don't need to examine function bodies. Such tools can easily leave out that part of the parser simply by not invoking the function-body sub-parser. 4. It leaves open the door to supporting embedded DSLs in the future. It's trivial to just ignore a block of text in braces and hand it off to a DSL later. It is similar to the way PEGs allow several different parties to contribute parts of a grammar, except that this approach does not constrain all the parties to actually use PEGs; for instance if I am a really lazy DSL author and I already have a SQL parser laying around (whether it's LL(k), LALR, whatever), I can just feed the original input text to that parser (or, better, use the flat token stream, sans comments, that came out of the lexer.) 5. It's risky 'cause I've never heard of anyone taking this approach before. Bring on the danger! I have observed that most PLs (Programming Langs) use one of two versions of stage 2: (1) C-style, with structure indicated entirely with {}, (), [], and possibly <> (shudder), or (2) Python-style, with structure indicated by indentation instead of {}. My favorite is the Boo language, which combines these two, using Python style by default, but also having a WSA parsing mode (whitespace-agnostic) with braces, and switching to WSA mode inside a Python-style module whenever the user uses an opener ("(,{,[") (IIRC). So a single Boo-like stage 2 could be re-used for almost all PLs, and thus would be a reasonable candidate as part of the standard library or a "D Boost library" (there is not such a thing, is there?).I wanted to make such a front-end so that I could easily make a C backend. I believe such a compiler would be able to do that with great ease. Needing to use D in places where it isn't available is a real pain-point for me right now, and I'll probably find ways to spend time on it eventually.Yeah, with a tree-transforming parser, I imagine the same thing, except my current fetish is to convert a certain subset of D to multiple other languages automatically. Then I could write libraries that can easily be used by an astonishingly large audience. I certainly would like to see D targetting Android, but that's best done directly from D to ARM. Anyway, the devil's in the detail. Originally I wanted to do a couldn't seem to work out the details, but D is more flexible and is likely better suited to the task.
Jul 07 2012
On Saturday, 7 July 2012 at 20:26:07 UTC, David Piepgrass wrote:I'd like to add that if we give tree parsing first-class treatment, I believe the most logical approach to parsing has three or more stages instead of the traditional two (lex+parse): 1. Lexer 2. Tree-ification 3. Parsing to AST (which may itself use multiple stages, e.g. parse the declarations first, then parse function bodies later) The new stage two simply groups things that are in parenthesis and braces. So an input stream such as the following:I bet that after stage 2 you would have performed almost the same amount of work (in other words, spent almost the same time) as you would if you did full parsing. After that you would need to iterate the whole tree (possibly multiple times), modify (or recreate if the AST is immutable) its nodes, etc. Altogether this might be a lot of overhead. My opinion is that tree manipulation is something that should be available to clients of parser-as-a-library or even of parser+semantic analyzer, but not necessarily advantageous for parser itself.
Jul 07 2012
On Saturday, 7 July 2012 at 20:39:18 UTC, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 20:26:07 UTC, David Piepgrass wrote:Hmm, you've got a good point there, although simple tree-ification is clearly less work than standard parsing, since statements like "auto x = y + z;" would be quickly "blitted" into the same node in phase 2, but would become multiple separate nodes in phase 3. What I like about it is not its performance, but how it matches the way we think about languages. Humans tend to see overall structure first, and examine the fine details later. The tree parsing approach is similarly nonlinear and can be modularized in a way that might be more intuitive than traditional EBNF. On the other hand, one could argue it is /too/ flexible, admitting so many different approaches to parsing that a front-end based on this approach is not necessarily intuitive to follow; and of course, not using a standard EBNF-type grammar could be argued to be bad. Still... it's a fun concept, and even if the initial parsing ends up using the good-old lex-parse approach, semantic analysis and lowering can benefit from a tree parser. Tree parsing, of course, is just a generalization of linear parsing and so a tree parser generator (TPG) could work equally well for flat input.I'd like to add that if we give tree parsing first-class treatment, I believe the most logical approach to parsing has three or more stages instead of the traditional two (lex+parse): 1. Lexer 2. Tree-ification 3. Parsing to AST (which may itself use multiple stages, e.g. parse the declarations first, then parse function bodies later) The new stage two simply groups things that are in parenthesis and braces. So an input stream such as the following:I bet that after stage 2 you would have performed almost the same amount of work (in other words, spent almost the same time) as you would if you did full parsing. After that you would need to iterate the whole tree (possibly multiple times), modify (or recreate if the AST is immutable) its nodes, etc. Altogether this might be a lot of overhead. My opinion is that tree manipulation is something that should be available to clients of parser-as-a-library or even of parser+semantic analyzer, but not necessarily advantageous for parser itself.
Jul 07 2012
What I like about it is not its performance, but how it matches the way we think about languages. Humans tend to see overall structure first, and examine the fine details later. The tree parsing approach is similarly nonlinear and can be modularized in a way that might be more intuitive than traditional EBNF.That reminds me, I forgot to write a another advantage I expected the three-phase approach to have, namely, that it seems easier to tell what the programmer "meant" with three phases, in the face of errors. I mean, phase 2 can tell when braces and parenthesis are not matched up properly and then it can make reasonable guesses about where those missing braces/parenthesis were meant to be, based on things like indentation. That would be especially helpful when the parser is used in an IDE, since if the IDE guesses the intention correctly, it can still understand broken code and provide code completion for it. And since phase 2 is a standard tool, anybody's parser can use it. Example: void f() { if (cond) x = y + 1; y = z + 1; } } // The error appears to be here, but it's really 4 lines up.
Jul 07 2012
On Saturday, 7 July 2012 at 21:52:09 UTC, David Piepgrass wrote:it seems easier to tell what the programmer "meant" with three phases, in the face of errors. I mean, phase 2 can tell when braces and parenthesis are not matched up properly and then it can make reasonable guesses about where those missing braces/parenthesis were meant to be, based on things like indentation. That would be especially helpful when the parser is used in an IDE, since if the IDE guesses the intention correctly, it can still understand broken code and provide code completion for it. And since phase 2 is a standard tool, anybody's parser can use it.There could be multiple errors that compensate each other and make your phase 2 succeed and prevent phase 3 from doing proper error handling. Even knowing that there is an error, in many cases you would not be able to create a meaningful error message. And any error would make your phase-2 tree incorrect, so it would be difficult to recover from it by inserting an additional token or ignoring tokens until parser is able to continue its work properly. All this would suffer for the same reason: you loose information.
Jul 07 2012
On Saturday, 7 July 2012 at 22:07:02 UTC, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 21:52:09 UTC, David Piepgrass wrote:This is all true, but forgetting a brace commonly results in a barrage of error messages anyway. Code that guesses what you meant obviously won't work all the time, and phase 3 would have to take care not to emit an error message about a "{" token that doesn't actually exist (that was merely "guessed-in"). But at least it's nice for a parser to be /able/ to guess what you meant; for a typical parser it would be out of the question, upon detecting an error, to back up four source lines, insert a brace and try again.it seems easier to tell what the programmer "meant" with three phases, in the face of errors. I mean, phase 2 can tell when braces and parenthesis are not matched up properly and then it can make reasonable guesses about where those missing braces/parenthesis were meant to be, based on things like indentation. That would be especially helpful when the parser is used in an IDE, since if the IDE guesses the intention correctly, it can still understand broken code and provide code completion for it. And since phase 2 is a standard tool, anybody's parser can use it.There could be multiple errors that compensate each other and make your phase 2 succeed and prevent phase 3 from doing proper error handling. Even knowing that there is an error, in many cases you would not be able to create a meaningful error message. And any error would make your phase-2 tree incorrect, so it would be difficult to recover from it by inserting an additional token or ignoring tokens until parser is able to continue its work properly. All this would suffer for the same reason: you loose information.
Jul 07 2012
On Saturday, 7 July 2012 at 22:25:00 UTC, David Piepgrass wrote:This is all true, but forgetting a brace commonly results in a barrage of error messages anyway. Code that guesses what you meant obviously won't work all the time, and phase 3 would have to take care not to emit an error message about a "{" token that doesn't actually exist (that was merely "guessed-in"). But at least it's nice for a parser to be /able/ to guess what you meant; for a typical parser it would be out of the question, upon detecting an error, to back up four source lines, insert a brace and try again.So you simply admit that error recovery is difficult to implement. For me, it is a must-have, and thus throwing away information is bad.
Jul 07 2012
On Saturday, 7 July 2012 at 22:35:37 UTC, Roman D. Boiko wrote:On Saturday, 7 July 2012 at 22:25:00 UTC, David Piepgrass wrote:I'm not seeing any tremendous error-handling difficulty in my idea. Anyway, I missed the part about information being thrown away...?This is all true, but forgetting a brace commonly results in a barrage of error messages anyway. Code that guesses what you meant obviously won't work all the time, and phase 3 would have to take care not to emit an error message about a "{" token that doesn't actually exist (that was merely "guessed-in"). But at least it's nice for a parser to be /able/ to guess what you meant; for a typical parser it would be out of the question, upon detecting an error, to back up four source lines, insert a brace and try again.So you simply admit that error recovery is difficult to implement. For me, it is a must-have, and thus throwing away information is bad.
Jul 07 2012
On Sunday, 8 July 2012 at 01:15:31 UTC, David Piepgrass wrote:I'm not seeing any tremendous error-handling difficulty in my idea. Anyway, I missed the part about information being thrown away...?(I will use the word "you" to refer to some abstract person or compiler.) Error handling could mean either error reporting and stopping after the first error, or reporting several errors and continuing parsing + semantic analysis whenever possible, so that the user would get partial functionality like code completion, information about method overloads, or "go to definition" / find all usages, etc. The first method is the less powerful option, but still requires constructing a meaningful error message which could help the user. There are many possible error recovery strategies. For example, you might decide to insert some token according to the parsing information available up to the moment when error is discovered, if that would fix the problem and allow to continue parsing. Another option is to ignore a series of further tokens (treat them a whitespace, for example), until parser is able to continue its work. Also there are many heuristics for typical error situations. All of these can only be performed if parser gathers the syntax information which can be inferred from source code according to the grammar rules. In stage 2 you have only performed some basic analysis, like, e.g., matched braces to define some hierarchy. This means that at the time when you find a missing brace, for example, you cannot tell anything more than that braces don't match. Or, if the user inserts a brace in an incorrect location, it is only possible to say that it is either missing somewhere and somewhere else another brace is missing, or that it is missing in one place, and some other brace is redundant. In many cases you won't even notice that a brace is incorrectly placed, and pass the resulting tree to the 3rd stage. You don't take any hint from grammar about the exact locations where some token should be present. Now, stage 3 heavily depends on the output of stage 2. As I demonstrated in some examples, it could get the output which implies incorrect structure, even if that has not been found in the previous stage. You would need to analyze so much information attempting to find the real roots of a problem, that effectively it would involve duplicating (implicitly) the logic of previous stage, but with combinatorial complexity growth. The problems you would need to deal with are much more complex than I described here. So you wouldn't be able to deliver error recovery at all, or (if you could) it would be either trivial or would require needlessly complex logic. Breaking the system at the wrong boundary (when the number of dependencies that cross the boundary is large) always introduces artificial complexity. Described above is the illustration of what I call information loss. I may have described something not as clearly as needed, but I didn't have the goal to provide a thorough and verifiable analysis. I speculated and simplified a lot. If you decide to ignore this, it is not a problem and I don't state that you will fail any of your goals. Just admit that __for me__ this approach is not a viable option. Everything written above is IMHO and there may be many ways to resolve the problems with various degrees of success.
Jul 07 2012
On Saturday, 7 July 2012 at 21:43:58 UTC, David Piepgrass wrote:Still... it's a fun concept, and even if the initial parsing ends up using the good-old lex-parse approach, semantic analysis and lowering can benefit from a tree parser. Tree parsing, of course, is just a generalization of linear parsing and so a tree parser generator (TPG) could work equally well for flat input.Exactly. Semantic analysis is what would benefit from that, as well as client code (for example, refactoring tools, etc.) Parser would become quite complicated. Probably it would need to perform complex tree navigation (which might decrease performance proportionally to the average tree depth or even more, if parser algorithms were themselves fast). Any non-trivial query would require direct character manipulation (like comparing sub-strings of captures with string literals, etc.). Probably you would get frequent cache misses because of the need to jump throughout the (incomplete) AST. It would definitely be problematic to build an immutable AST model, thus (IMO) significantly and needlessly constraining you and users of your library. And likely you would have to deal with many more problems __without__ significant gains.
Jul 07 2012
On 07/07/2012 04:26 PM, David Piepgrass wrote:Yes and yes. I didn't choose this because it because it's pretty. I chose it because: (1) It's easy to implement. (2) Both the implementation and syntax can be altered easily. I do not have time to write a language for the tree pattern recognition and substitution that is needed to do this in an aesthetically pleasing way. I've tried to sketch what it might look like before, and even then it is hard to make it nice, much less begin implementing the thing. I'd love to have such a language, but resource constraints exist. I also think that this approach would allow me to find out what my usage patterns look like before I commit to a more complicated architecture/tool. I really think the design of this regex/language/DSL thing should be dominated by its usage. This is a tricky chicken-and-egg thing because it's not currently used. The hacky syntax you see is the bootstrapping. Point (2) is important because, since we don't have existing usage patterns, this thing is going to change. It's going to change /a lot/. I want it to be easy to change. I think a complete DSL will be harder to change quickly. I also like how it doesn't require a lot of CTFE trickery or pushing DMD too far. D has really cool features, but I find that when I use things like CTFE aggressively then I lose productivity because I end up spending a lot of time finding compiler bugs. This leads to my current strategy: use the simpler features that work for sure, and only use the more advanced stuff when I really need to. I think my syntax fits this strategy and thus contributes to point (1). That said, it is good that even mostly-working CTFE exists and that a powerful template and metaprogramming system exists, because I don't think a compiler like this would be very practical to program otherwise. It would be doable in other languages, but could easily suffer from performance pessimizations due to being forced to compute everything at runtime. If anyone has an approach that shares the above strengths and looks nicer or is more powerful, I'd love to see it.auto captures = syntaxNode.matchNodes( TOK_WHILE_NODE, OP_ENTER_NODE, OP_CAPTURE(0), OP_BEGIN, TOK_EXPRESSION, OP_END, OP_CAPTURE(1), OP_BEGIN, TOK_STATEMENT, OP_END, OP_LEAVE_NODE);I'm glad to hear you like the tree-parsing approach, Chad, although the particular syntax here looks pretty unfriendly :O -- does this represent something that you are working on right now?5. It's risky 'cause I've never heard of anyone taking this approach before. Bring on the danger!The danger is the fun part! <g>That does sound very cool. Possibly difficult though, due to having to cater to the lowest-common-denominator in all of your API designs. No templated functions or ranges in your API, that's for sure. I'm sure there are some things where this is very doable though; it probably depends on what kind of libraries you are writing. As for D targeting Android, my intent is really to target X where X is any CPU/OS combo you can think of. I want to be able to get D, the language, not necessarily phobos or other niceties, to work on any platform, and to do so without much work. Cross-compiling to a new platform that has never been cross-compiled before should require zero coding. Perhaps it might be possible to have a text file with some key-value configuration that tells it certain common features are available on the target, thus allowing you to have more features with almost no effort involved. Still, I'll always take a crippled but existent D compiler that targets Android over a perfect but non-existent D compiler that targets Android. I think that the D-directly-to-ARM is the current approach for cross-compiling. I critique it for its underwhelming lack of results.I wanted to make such a front-end so that I could easily make a C backend. I believe such a compiler would be able to do that with great ease. Needing to use D in places where it isn't available is a real pain-point for me right now, and I'll probably find ways to spend time on it eventually.Yeah, with a tree-transforming parser, I imagine the same thing, except my current fetish is to convert a certain subset of D to multiple other languages automatically. Then I could write libraries that can easily be used by an astonishingly large audience. I certainly would like to see D targetting Android, but that's best done directly from D to ARM.Anyway, the devil's in the detail. Originally I wanted to do a parser out the details, but D is more flexible and is likely better suited to the task.I can easily see how this is the case. I don't think I'd be interested in doing a project like this in any other language. I imagined trying practical. For instance, I don't think the "use regular expressions to match AST structures" would work well in other cases because it would either (1) have a bunch of runtime overhead for compiling the expressions into DFAs at runtime or (2) suffer from integration problems if you try to compile the expressions in separate files before compiling the rest of the front-end.
Jul 07 2012
Well, for templates, in general, it would be necessary to instantiate a particular set of templates and explicitly give them names in the target language. So for instance, I could define a Point!T struct in D, sure, but then I'd have to tell the language converter to create target-language specializations: in were C++, the template could be translated to a C++ template, Point<T>, as long as there aren't any "static ifs" or other things that can't be translated. Notably, if a template P!T depends on another template Q!T, then P!T cannot be translated to Adapting standard libraries could no doubt be a gigantic problem. I don't know how to begin to think about doing that. But for ranges in particular, I think the concept is too important to leave out of public interfaces. So I'd port the major range data structures to the target languages, most likely by hand, so that they could be used by converted code.Yeah, with a tree-transforming parser, I imagine the same thing, except my current [fantasy] is to convert a certain subset of D to multiple other languages automatically. Then I could write libraries that can easily be used by an astonishingly large audience. I certainly would like to see D targetting Android, but that's best done directly from D to ARM.That does sound very cool. Possibly difficult though, due to having to cater to the lowest-common-denominator in all of your API designs. No templated functions or ranges in your API, that's for sure. I'm sure there are some things where this is very doable though; it probably depends on what kind of libraries you are writing.As for D targeting Android, my intent is really to target X where X is any CPU/OS combo you can think of. I want to be able to get D, the language, not necessarily phobos or other niceties, to work on any platform, and to do so without much work. Cross-compiling to a new platform that has never been cross-compiled before should require zero coding.I understand. Conversion to C is an effective last resort. And, well, I hear a lot of compilers have even used it as a standard practice. I guess you'd be stuck with refcounting, though.I think that the D-directly-to-ARM is the current approach for cross-compiling. I critique it for its underwhelming lack of results.Yeah. I assume it involves weird object-file formats, calling conventions and ABIs. I guess very few want to get involved with that stuff, and very few have the slightest clue where to begin, myself included.(2) suffer from integration problems if you try to compile the expressions in separate files before compiling the rest of the front-end.Absolutely, I love language-integrated metaprogramming. Without it you end up with complicated build environments, and I hate those, cuz there isn't a single standard build environment that everybody likes. I think people should be able to just load up their favorite IDE and add all source files to the project and It Just Works. Or on the command line, do dmd *.d or whatever. Oh, and the ability to run the same code at meta-compile-time, compile-time and run-time, also priceless.
Jul 07 2012
On 07/07/2012 08:55 PM, Chad J wrote:On 07/07/2012 02:23 PM, David Piepgrass wrote:I'd suggest: AstOp!` Lower while ( boolExpr ) { statements; } Into loopAgain: if ( !boolExpr ) { statements; } else goto exitLoop goto loopAgain exitLoop: `.run(syntaxNode);I was thinking the same thing. My intent is to create a kind of regular-expression-of-nodes with push/pop operators to recognize ascent and descent on the tree. Such a regular expression would allow one to capture subtrees out of generalized patterns and then place them into new trees that then become the input for the next pattern or set of patterns. I think this is much closer to how I conceptualize semantic analysis than how semantic analysis is done in front ends like DMD: it should to be done with pattern recognition and substitution, not with myriad nested if-statements and while-loops. My vision is to have code similar to this in the front-end: /+ Lower while ( boolExpr ) { statements; } Into loopAgain: if ( !boolExpr ) goto exitLoop statements; goto loopAgain exitLoop: +/ void lowerWhileStatement( SyntaxElement* syntaxNode ) { auto captures = syntaxNode.matchNodes( TOK_WHILE_NODE, OP_ENTER_NODE, OP_CAPTURE(0), OP_BEGIN, TOK_EXPRESSION, OP_END, OP_CAPTURE(1), OP_BEGIN, TOK_STATEMENT, OP_END, OP_LEAVE_NODE); if ( captures is null ) return; syntaxNode.replaceWith( LabelNode("loopAgain"), TOK_IF_STATEMENT, OP_INSERT, OP_BEGIN, TOK_NEGATE, OP_INSERT, OP_BEGIN, captures[0], // Expression OP_END, GotoStatement("exitLoop"), OP_END, captures[1], // statements GotoStatement("loopAgain"), LabelNode("exitLoop") ); } The specifics will easily change.Note that PEG does not impose to use packrat parsing, even though it was developed to use it. I think it's a historical 'accident' that put the two together: Bryan Ford thesis used the two together.disillusioned because nobody was interested in fixing the bugs in it (ANTLR's author is a Java guy first and foremost) and the source code of the required libraries didn't have source code or a license (wtf.) So, for awhile I was thinking about how I might make my own parser generator that was "better" than ANTLR. I liked the syntax of PEG descriptions, but I was concerned about the performance hit of packrat and, besides, I already liked the syntax and flexibility of ANTLR. So my idea was to make something that was LL(k) and mixed the syntax of ANTLR and PEG while using more sane (IMO) semantics than ANTLR did at the time (I've no idea if ANTLR 3 still uses the same semantics today...) All of this is 'water under the bridge' now, but I hand-wrote a lexer to help me plan out how my parser-generator would produce code. The output code was to be both more efficient and significantly more readable than ANTLR's output. I didn't get around to writing the parser-generator itself but I'll have a look back at my handmade lexer for inspiration.I don't like the sound of this either. Even if PEGs were fast, difficulty in debugging, error handling, etc. would give me pause. I insist on well-rounded tools. For example, even though LALR(1) may be the fastest type of parser (is it?), I prefer not to use it due to its inflexibility (it just doesn't like some reasonable grammars), and the fact that the generated code is totally unreadable and hard to debug (mind you, when I learned LALR in school I found that it is possible to visualize how it works in a pretty intuitive way--but debuggers won't do that for you.) While PEGs are clearly far more flexible than LALR and probably more flexible than LL(k), I am a big fan of old-fashioned recursive descent because it's very flexible (easy to insert actions during parsing, and it's possible to use custom parsing code in certain places, if necessary*) and the parser generator's output is potentially very straightforward to understand and debug. In my mind, the main reason you want to use a parser generator instead of hand-coding is convenience, e.g. (1) to compress the grammar down so you can see it clearly, (2) have the PG compute the first-sets and follow-sets for you, (3) get reasonably automatic error handling. * (If the language you want to parse is well-designed, you'll probably not need much custom parsing. But it's a nice thing to offer in a general-purpose parser generator.) I'm not totally sure yet how to support good error messages, efficiency and straightforward output at the same time, but by the power of D I'm sure I could think of something... I would like to submit another approach to parsing that I dare say is my favorite, even though I have hardly used it at all yet. ANTLR offers something called "tree parsing" that is extremely cool. It parses trees instead of linear token streams, and produces other trees as output. I don't have a good sense of how tree parsing works, but I think that some kind of tree-based parser generator could become the basis for a very flexible and easy-to-understand D front-end. If a PG operates on trees instead of linear token streams, I have a sneaky suspicion that it could revolutionize how a compiler front-end works. Why? because right now parsers operate just once, on the user's input, and from there you manipulate the AST with "ordinary" code. But if you have a tree parser, you can routinely manipulate and transform parts of the tree with a sequence of independent parsers and grammars. Thus, parsers would replace a lot of things for which you would otherwise use a visitor pattern, or something. I think I'll try to sketch out this idea in more detail later.However, as I found a few hours ago, Packrat parsing (typically used to handle PEG) has serious disadvantages: it complicates debugging because of frequent backtracking, it has problems with error recovery, and typically disallows to add actions with side effects (because of possibility of backtracking). These are important enough to reconsider my plans of using Pegged. I will try to analyze whether the issues are so fundamental that I (or somebody else) will have to create an ANTLR-like parser instead, or whether it is possible to introduce changes into Pegged that would fix these problems.
Jul 07 2012
On 07/07/2012 10:26 PM, Timon Gehr wrote:On 07/07/2012 08:55 PM, Chad J wrote:Also, I'd like to point out that the transformation is not actually valid, because there are break/continue.... The specifics will easily change.I'd suggest: AstOp!` Lower while ( boolExpr ) { statements; } Into loopAgain: if ( !boolExpr ) { statements; } else goto exitLoop goto loopAgain exitLoop: `.run(syntaxNode);
Jul 07 2012
On 07/07/2012 04:26 PM, Timon Gehr wrote:On 07/07/2012 08:55 PM, Chad J wrote:I wish. Can you make it happen?The specifics will easily change.I'd suggest: AstOp!` Lower while ( boolExpr ) { statements; } Into loopAgain: if ( !boolExpr ) { statements; } else goto exitLoop goto loopAgain exitLoop: `.run(syntaxNode);
Jul 07 2012
On 05-Jul-12 16:11, Denis Shelomovskij wrote:There are more and more projects requiring parsing D code (IDE plugins, DCT and dfmt now). We definitely need a _standard_ fast D parser (suitable as tokenizer).Then do it. It's all about having something so obviously good, fast and flexible that other stuff can be considered only as toys. I might do one, but I'd rather just help other folks make it faster ;) -- Dmitry Olshansky
Jul 05 2012
On Thursday, 5 July 2012 at 12:32:19 UTC, Dmitry Olshansky wrote:Then do it. It's all about having something so obviously good, fast and flexible that other stuff can be considered only as toys. I might do one, but I'd rather just help other folks make it faster ;)Would you help to make Pegged faster?
Jul 05 2012
On 05-Jul-12 17:04, Roman D. Boiko wrote:On Thursday, 5 July 2012 at 12:32:19 UTC, Dmitry Olshansky wrote:Well why not. But first I'll need to deliver some stuff on my GSOC project. -- Dmitry OlshanskyThen do it. It's all about having something so obviously good, fast and flexible that other stuff can be considered only as toys. I might do one, but I'd rather just help other folks make it faster ;)Would you help to make Pegged faster?
Jul 05 2012
On Thursday, 5 July 2012 at 13:05:41 UTC, Dmitry Olshansky wrote:On 05-Jul-12 17:04, Roman D. Boiko wrote: Well why not. But first I'll need to deliver some stuff on my GSOC project.I bet that after you finish with GSOC optimizing Pegged will not be less relevant than it is now :) And as for DCT, it will take another half a year (at least) until it will become usable for any real needs (except the most trivial).
Jul 05 2012
On 7/5/12 9:05 AM, Dmitry Olshansky wrote:On 05-Jul-12 17:04, Roman D. Boiko wrote:Good call :o). Note that we can discuss tweaking the scope of the project after the midterm. Ping me if you have some ideas. AndreiOn Thursday, 5 July 2012 at 12:32:19 UTC, Dmitry Olshansky wrote:Well why not. But first I'll need to deliver some stuff on my GSOC project.Then do it. It's all about having something so obviously good, fast and flexible that other stuff can be considered only as toys. I might do one, but I'd rather just help other folks make it faster ;)Would you help to make Pegged faster?
Jul 05 2012
On 05-Jul-12 18:29, Andrei Alexandrescu wrote:On 7/5/12 9:05 AM, Dmitry Olshansky wrote:It's great that you are not opposed to adjusting scope of project. I have a ton of ideas, but let's discuss them after midterm. -- Dmitry OlshanskyOn 05-Jul-12 17:04, Roman D. Boiko wrote:Good call :o). Note that we can discuss tweaking the scope of the project after the midterm. Ping me if you have some ideas.On Thursday, 5 July 2012 at 12:32:19 UTC, Dmitry Olshansky wrote:Well why not. But first I'll need to deliver some stuff on my GSOC project.Then do it. It's all about having something so obviously good, fast and flexible that other stuff can be considered only as toys. I might do one, but I'd rather just help other folks make it faster ;)Would you help to make Pegged faster?
Jul 05 2012
Is the actual grammar available somewhere? The online language reference is all we got I guess? DMD doesn't seem to be using yacc/bison either. On Thu, Jul 5, 2012 at 7:11 AM, Denis Shelomovskij <verylonglogin.reg gmail.com> wrote:There are more and more projects requiring parsing D code (IDE plugins, D=CTand dfmt now). We definitely need a _standard_ fast D parser (suitable as tokenizer). We already have a good one at least in Visual D. Maybe dmd's parser is faste=r.If so, it can be (easily?) rewritten in D. We also already have some othe=rWhat about to get one and call it standard? -- =E4=C5=CE=C9=D3 =F7. =FB=C5=CC=CF=CD=CF=D7=D3=CB=C9=CA Denis V. Shelomovskij
Jul 05 2012
On Thursday, 5 July 2012 at 15:42:22 UTC, Caligo wrote:Is the actual grammar available somewhere? The online language reference is all we got I guess? DMD doesn't seem to be using yacc/bison either.In PEG format, yes (not necessarily correct): https://github.com/roman-d-boiko/Pegged/blob/dct/pegged/examples/dgrammar.d I don't know about any others, though.
Jul 05 2012
On 07/05/2012 06:11 AM, Denis Shelomovskij wrote:There are more and more projects requiring parsing D code (IDE plugins, DCT and dfmt now). We definitely need a _standard_ fast D parser (suitable as tokenizer). We already have a good one at least in Visual D. Maybe dmd's parser is faster. If so, it can be (easily?) rewritten in D. We also already have What about to get one and call it standard?I know I'm a little late coming into this conversation. This seems like a nice thread to toss myself into. I've started working on a generic lexer that is based off of a defined grammar. As I read through the thread (I unfortunately don't have enough time to read every post, but I skimmed through as many as I could, and read the ones that seemed important), it seems like we need a parser in D that can lex D, and provide a Range of tokens that could be consumed. With some very minor tweaks, and a properly written Grammar class, I basically have it already done. D was going to be one of the first languages I would have written a definition for. https://github.com/kai4785/firstfront I haven't had time to look through Pegged, but there are some differences in my approach that I can immediately see in Pegged's. 1) Pegged does compile-time or run-time code generation for your parser. Mine is run-time only and regex based. 2) Pegged uses PEG format, which I have been introduced to through this thread. One of my goals was to actually invent something like PEG. This will save me time :) I would love to receive some critical feedback on my approach as well as the actual code base. To build it, you'll need cmake, and cmaked2 from here: http://code.google.com/p/cmaked2/wiki/GettingStarted Or just dmd *.d :) I haven't tried to build it on Windows yet, but I don't see anything that immediately jumps out as not cross-platform. I've been working on it on both Fedora and CentOS. -Kai Meyer
Jul 31 2012
On Tuesday, 31 July 2012 at 16:10:14 UTC, Kai Meyer wrote:I know I'm a little late coming into this conversation. This seems like a nice thread to toss myself into. I've started working on a generic lexer that is based off of a defined grammar.Every helping hand is appreciated :-)As I read through the thread (I unfortunately don't have enough time to read every post, but I skimmed through as many as I could, and read the ones that seemed important), it seems like we need a parser in D that can lex D, and provide a Range of tokens that could be consumed.Yes. To make this statement more precise: We need a lexer that provides a range of tokens and we need a parser which makes it possible to build an AST. Pegged combines both approaches but imposes an overhead if you just need a token list. However I'm not sure if this is a problem. There are already some working D-Parsers buried in this thread.With some very minor tweaks, and a properly written Grammar class, I basically have it already done. D was going to be one of the first languages I would have written a definition for. https://github.com/kai4785/firstfrontI've only glimpsed at your code. For most languages lexing is far more expensive then parsing and thus the lexer has to be very fast and I wouldn't pursue your approach and instead use something like ragel. It already has D output but needs a separate build step.
Jul 31 2012
On Tue, Jul 31, 2012 at 7:29 PM, Tobias Pankrath <tobias pankrath.net> wrote:On Tuesday, 31 July 2012 at 16:10:14 UTC, Kai Meyer wrote:Hi Kai, welcome here!I know I'm a little late coming into this conversation. This seems like a nice thread to toss myself into. I've started working on a generic lexer that is based off of a defined grammar.Every helping hand is appreciated :-)I think we need a tokenizer generator and a parser generator. They can be grouped or not, but we need both, in Phobos. We also need to define what's needed in a token. Kai's approach is OK, but what's the _type field for an operator or and identifier? Also, a lexer can fill a symbol table and produce identifier tokens that refer to a particular entry in the symbol table. I guess creating a tree of symbol tables according to scope visibility is then more the job of the parser, but I'm not sure.As I read through the thread (I unfortunately don't have enough time to read every post, but I skimmed through as many as I could, and read the ones that seemed important), it seems like we need a parser in D that can lex D, and provide a Range of tokens that could be consumed.Yes. To make this statement more precise: We need a lexer that provides a range of tokens and we need a parser which makes it possible to build an AST. Pegged combines both approaches but imposes an overhead if you just need a token list. However I'm not sure if this is a problem.There are already some working D-Parsers buried in this thread.Yes. We also need a D parser (not generic), but no one pushed one for Phobos inclusion right now. Anyway, we can put generic parsers in Phobos too (LALR, LL(*), etc), but I'd say the first priority would be to have a D lexer (producing a range of tokens) and a D parser (consuming a range of tokens, and executing semantic actions, like the building of an AST). Generic parsers can come later on. Having a D parser in the standard distribution would create much goodwill. Many people want that.I've only glimpsed at your code. For most languages lexing is far more expensive then parsingIs that so?and thus the lexer has to be very fast and I wouldn't pursue your approach and instead use something like ragel. It already has D output but needs a separate build step.Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for D. The only problem I see in Kai's approach (which is the standard one, a prioritized list of regexes) is how to recognize tokens that are not regular (I mean, that cannot be encoded as a regex), like nesting comments? How does the lexer know when to stop producing a 'comment' token? Philippe
Jul 31 2012
On Tuesday, 31 July 2012 at 21:10:52 UTC, Philippe Sigaud wrote:I have no numbers. It's a statement that I found in some (1-3) sources about parsing. I'll share if I can find them.I've only glimpsed at your code. For most languages lexing is far more expensive then parsingIs that so?Yeah.and thus the lexer has to be very fast and I wouldn't pursue your approach and instead use something like ragel. It already has D output but needs a separate build step.Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for D.
Jul 31 2012
On 01-Aug-12 01:10, Philippe Sigaud wrote:On Tue, Jul 31, 2012 at 7:29 PM, Tobias Pankrath <tobias pankrath.net> wrote:Yeah.On Tuesday, 31 July 2012 at 16:10:14 UTC, Kai Meyer wrote:Hi Kai, welcome here!I know I'm a little late coming into this conversation. This seems like a nice thread to toss myself into. I've started working on a generic lexer that is based off of a defined grammar.Every helping hand is appreciated :-)I think we need a tokenizer generator and a parser generator. They can be grouped or not, but we need both, in Phobos. We also need to define what's needed in a token. Kai's approach is OK, but what's the _type field for an operator or and identifier? Also, a lexer can fill a symbol table and produce identifier tokens that refer to a particular entry in the symbol table.As I read through the thread (I unfortunately don't have enough time to read every post, but I skimmed through as many as I could, and read the ones that seemed important), it seems like we need a parser in D that can lex D, and provide a Range of tokens that could be consumed.Yes. To make this statement more precise: We need a lexer that provides a range of tokens and we need a parser which makes it possible to build an AST. Pegged combines both approaches but imposes an overhead if you just need a token list. However I'm not sure if this is a problem.I guess creating a tree of symbol tables according to scope visibility is then more the job of the parser, but I'm not sure.Parser can use constant IDs for nested tables, IDs point to string table. String table is populated by lexer.It usually is. Unless parser is overly generic as GLL/GLR (not to claim that it's very slow but GLL/GLR are slower for obvious reasons).I've only glimpsed at your code. For most languages lexing is far more expensive then parsingIs that so?Have to agree here if anything a better DFA generator is needed, current std.regex can't get as good in this field because of: a) irregularities like lookahed/lookbehind/etc. in patterns (avoided in ctRegex via codegen) b) full unicode features commitment (again expect some improvement here in near future) c) designed to take multiple captures from matched piece of text. I'm not sure when (or even if) std.regex will ever special case all of the above.and thus the lexer has to be very fast and I wouldn't pursue your approach and instead use something like ragel. It already has D output but needs a separate build step.Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for D. The only problem I see in Kai's approach (which is the standard one, a prioritized list of regexes) is how to recognize tokens that are not regular (I mean, that cannot be encoded as a regex), like nesting comments?See below.How does the lexer know when to stop producing a 'comment' token?Call special function skipComment when lexer hits comment_start pseudotoken. Typically lexeres are regular as it allows them to be fast. -- Dmitry Olshansky
Jul 31 2012
On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:The latter, I get. The former, not so much.I guess creating a tree of symbol tables according to scope visibility is then more the job of the parser, but I'm not sure.Parser can use constant IDs for nested tables, IDs point to string table. String table is populated by lexer.I'd have thought that, seeing as 'the end result' is more complicated (richer, if I may say so) for parsing, then parsing is more expensive. I'm reading on GLR, and was wondering what the practical limits are. Do people use GLR to parse thousands of pages of text?It usually is. Unless parser is overly generic as GLL/GLR (not to claim that it's very slow but GLL/GLR are slower for obvious reasons).I've only glimpsed at your code. For most languages lexing is far more expensive then parsingIs that so?Have to agree here if anything a better DFA generator is needed, current std.regex can't get as good in this field because of: a) irregularities like lookahed/lookbehind/etc. in patterns (avoided in ctRegex via codegen) b) full unicode features commitment (again expect some improvement here in near future) c) designed to take multiple captures from matched piece of text. I'm not sure when (or even if) std.regex will ever special case all of the above.Well, - for a lexer lookahead is sometimes useful (the Dragon book cite the FORTRAN grammar, for which keywords are not reserved and so when you encounter IF, you don't know if (!) it's a function call or a 'real' if) - Unicode is needed to lex D correctly, no? - multiple captures doesn't seem necessary *for a lexer*Ah, and this special function must somehow maintain a stack, to 'count' the comment_start and comment_end pseudotokens. So in a way, it quits the regular world to become temporarily more powerful.How does the lexer know when to stop producing a 'comment' token?Call special function skipComment when lexer hits comment_start pseudotoken.Typically lexeres are regular as it allows them to be fast.Hmm, but then it has to treat comments a one token. So no Ddoc love?
Jul 31 2012
On 08/01/2012 12:01 AM, Philippe Sigaud wrote:On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Regularity of the language is not required for speed.Typically lexeres are regular as it allows them to be fast.Hmm, but then it has to treat comments a one token. So no Ddoc love?Ddoc is typically not required. By default it should be treated as whitespace. If it is required, one token seems reasonable: The post-processing of the doc comment is best done as a separate step.
Jul 31 2012
On Wednesday, August 01, 2012 00:54:56 Timon Gehr wrote:Ddoc is typically not required. By default it should be treated as whitespace. If it is required, one token seems reasonable: The post-processing of the doc comment is best done as a separate step.That was how I was intending to deal with ddoc. It's just a nested block comment token. The whole comment string is there, so the ddoc processor can use that to do whatever it does. ddoc isn't part of lexing really. It's a separate thing. - Jonathan M Davis
Jul 31 2012
On Wed, Aug 1, 2012 at 12:58 AM, Jonathan M Davis <jmdavisProg gmx.com> wrote:On Wednesday, August 01, 2012 00:54:56 Timon Gehr wrote:OK. Same for standard comment and doc comments? I was wondering how to get the code possibly inside a ---- / ---- block (I never dealt with something like documentation or syntax highlighting), but your solution makes it easy: Toten(TokenType.DocComment, "/** ... */"), Token(TokenType.Module, "module"), ... A small specialised parser can then extract text, DDocs macros and code blocks from inside the comment. Findind and stripping '----' is easy and then the lexer can be locally reinvoked on the slice containing the example code.Ddoc is typically not required. By default it should be treated as whitespace. If it is required, one token seems reasonable: The post-processing of the doc comment is best done as a separate step.That was how I was intending to deal with ddoc. It's just a nested block comment token. The whole comment string is there, so the ddoc processor can use that to do whatever it does. ddoc isn't part of lexing really. It's a separate thing.
Jul 31 2012
On 01-Aug-12 02:54, Timon Gehr wrote:On 08/01/2012 12:01 AM, Philippe Sigaud wrote:That's why there is "allows" (not "required") the reason in general is that plain deterministic automation is insanely fast compared with just about any CFG parsing scheme. Yet there are certain grammars/cases where it doesn't matter much. Also tweaking DFA by "semantic" actions could make it handle some irregularities at no extra cost etc.On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Regularity of the language is not required for speed.Typically lexeres are regular as it allows them to be fast.Yup. -- Dmitry OlshanskyHmm, but then it has to treat comments a one token. So no Ddoc love?Ddoc is typically not required. By default it should be treated as whitespace. If it is required, one token seems reasonable: The post-processing of the doc comment is best done as a separate step.
Jul 31 2012
On 01-Aug-12 02:01, Philippe Sigaud wrote:On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Well while lookahead will help, there are simpler ways. e.g. regex ("IF\ZYOUR_LOOKAHEAD"); \Z means that you are capturing text only up to \Z. It's still regular. I think Boost regex does support this. We could have supported this cleanly but have chosen ECM-262 standard. Critical difference is that lookahead can be used like this: regex("blah(?=lookahead)some_other_stuff_that_lookahead_also_checks");The latter, I get. The former, not so much.I guess creating a tree of symbol tables according to scope visibility is then more the job of the parser, but I'm not sure.Parser can use constant IDs for nested tables, IDs point to string table. String table is populated by lexer.I'd have thought that, seeing as 'the end result' is more complicated (richer, if I may say so) for parsing, then parsing is more expensive. I'm reading on GLR, and was wondering what the practical limits are. Do people use GLR to parse thousands of pages of text?It usually is. Unless parser is overly generic as GLL/GLR (not to claim that it's very slow but GLL/GLR are slower for obvious reasons).I've only glimpsed at your code. For most languages lexing is far more expensive then parsingIs that so?Have to agree here if anything a better DFA generator is needed, current std.regex can't get as good in this field because of: a) irregularities like lookahed/lookbehind/etc. in patterns (avoided in ctRegex via codegen) b) full unicode features commitment (again expect some improvement here in near future) c) designed to take multiple captures from matched piece of text. I'm not sure when (or even if) std.regex will ever special case all of the above.Well, - for a lexer lookahead is sometimes useful (the Dragon book cite the FORTRAN grammar, for which keywords are not reserved and so when you encounter IF, you don't know if (!) it's a function call or a 'real' if)- Unicode is needed to lex D correctly, no?Not that much of unicode, e.g. it could skip decoding stuff most of the time.- multiple captures doesn't seem necessary *for a lexer*-- Dmitry OlshanskyAh, and this special function must somehow maintain a stack, to 'count' the comment_start and comment_end pseudotokens. So in a way, it quits the regular world to become temporarily more powerful.How does the lexer know when to stop producing a 'comment' token?Call special function skipComment when lexer hits comment_start pseudotoken.Typically lexeres are regular as it allows them to be fast.Hmm, but then it has to treat comments a one token. So no Ddoc love?
Jul 31 2012
On Wed, Aug 1, 2012 at 7:48 AM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:In the FORTRAN case, you indeed *need* to re-lex the stuff after IF, with another regex, once you've determined it's an IF instruction and not some moron who used IF as an identifier. You know, as a first step, I'd be happy to get ctRegex to recognize the \Z flag.Well, - for a lexer lookahead is sometimes useful (the Dragon book cite the FORTRAN grammar, for which keywords are not reserved and so when you encounter IF, you don't know if (!) it's a function call or a 'real' if)Well while lookahead will help, there are simpler ways. e.g. regex ("IF\ZYOUR_LOOKAHEAD"); \Z means that you are capturing text only up to \Z. It's still regular. I think Boost regex does support this. We could have supported this cleanly but have chosen ECM-262 standard. Critical difference is that lookahead can be used like this: regex("blah(?=lookahead)some_other_stuff_that_lookahead_also_checks");
Jul 31 2012
On 01-Aug-12 02:01, Philippe Sigaud wrote:On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Okay. Say lexer maps all unique strings that are not keywords to some ID. Then parser creates a stack of scoped symbol tables. These nested symbol tables use only IDs not strings themselves. (Though I expect that only semantic analysis require the use of these tables) -- Dmitry OlshanskyThe latter, I get. The former, not so much.I guess creating a tree of symbol tables according to scope visibility is then more the job of the parser, but I'm not sure.Parser can use constant IDs for nested tables, IDs point to string table. String table is populated by lexer.
Jul 31 2012
On Wed, Aug 1, 2012 at 8:12 AM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:On 01-Aug-12 02:01, Philippe Sigaud wrote:Ah, that! This I know, I misunderstood your initial post. The symbol table can also be pre-charged with keywords, if needed.On Tue, Jul 31, 2012 at 11:38 PM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Okay. Say lexer maps all unique strings that are not keywords to some ID. Then parser creates a stack of scoped symbol tables. These nested symbol tables use only IDs not strings themselves. (Though I expect that only semantic analysis require the use of these tables)The latter, I get. The former, not so much.I guess creating a tree of symbol tables according to scope visibility is then more the job of the parser, but I'm not sure.Parser can use constant IDs for nested tables, IDs point to string table. String table is populated by lexer.
Jul 31 2012
On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for D.I'm actually quite far along with one now - one which is specifically written and optimized for lexing D. I'll probably be done with it not too long after the 2.060 release (though we'll see). Writing it has been going surprisingly quickly actually, and I've already found some bugs in the spec as a result (some of which have been fixed, some of which I still need to create pull requests for). So, regardless of what happens with my lexer, at least the spec will be more accurate. - Jonathan M Davis
Jul 31 2012
On 2012-07-31 23:20, Jonathan M Davis wrote:I'm actually quite far along with one now - one which is specifically written and optimized for lexing D. I'll probably be done with it not too long after the 2.060 release (though we'll see). Writing it has been going surprisingly quickly actually, and I've already found some bugs in the spec as a result (some of which have been fixed, some of which I still need to create pull requests for). So, regardless of what happens with my lexer, at least the spec will be more accurate. - Jonathan M DavisThat's awesome. I really looking forward to this. Keep up the good work. -- /Jacob Carlborg
Aug 01 2012
On 2012-07-31 23:20, Jonathan M Davis wrote:I'm actually quite far along with one now - one which is specifically written and optimized for lexing D. I'll probably be done with it not too long after the 2.060 release (though we'll see). Writing it has been going surprisingly quickly actually, and I've already found some bugs in the spec as a result (some of which have been fixed, some of which I still need to create pull requests for). So, regardless of what happens with my lexer, at least the spec will be more accurate.BTW, do you have to code online somewhere? -- /Jacob Carlborg
Aug 01 2012
On Wednesday, August 01, 2012 13:30:31 Jacob Carlborg wrote:On 2012-07-31 23:20, Jonathan M Davis wrote:No, because I'm still in the middle of working on it. - Jonathan M DavisI'm actually quite far along with one now - one which is specifically written and optimized for lexing D. I'll probably be done with it not too long after the 2.060 release (though we'll see). Writing it has been going surprisingly quickly actually, and I've already found some bugs in the spec as a result (some of which have been fixed, some of which I still need to create pull requests for). So, regardless of what happens with my lexer, at least the spec will be more accurate.BTW, do you have to code online somewhere?
Aug 01 2012
01.08.2012 1:20, Jonathan M Davis пишет:On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:Good. Will wait for announce. -- Денис В. Шеломовский Denis V. ShelomovskijHaving std.lexer in Phobos would be quite good. With a pre-compiled lexer for D.I'm actually quite far along with one now - one which is specifically written and optimized for lexing D. I'll probably be done with it not too long after the 2.060 release (though we'll see). Writing it has been going surprisingly quickly actually, and I've already found some bugs in the spec as a result (some of which have been fixed, some of which I still need to create pull requests for). So, regardless of what happens with my lexer, at least the spec will be more accurate. - Jonathan M Davis
Aug 01 2012
On Tue, Jul 31, 2012 at 11:20 PM, Jonathan M Davis <jmdavisProg gmx.com> wrote:On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:That was quick! Cool!Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for D.I'm actually quite far along with one now - one which is specifically written and optimized for lexing D. I'll probably be done with it not too long after the 2.060 release (though we'll see).Writing it has been going surprisingly quickly actually, and I've already found some bugs in the spec as a result (some of which have been fixed, some of which I still need to create pull requests for). So, regardless of what happens with my lexer, at least the spec will be more accurate.Could you please describe the kind of token it produces? Can it build a symbol table? Does it recognize all kind of strings (including q{ } ones)? How does it deal with comments, particularly nested ones? Does it automatically discard whitespaces or produce them as a token? I'd favor this approach, if only because wrapping the lexer in a filter!noWS(tokenRange) is easy. Does it produce a lazy range btw?
Jul 31 2012
On 31-Jul-12 20:10, Kai Meyer wrote:On 07/05/2012 06:11 AM, Denis Shelomovskij wrote:There are more and more projects requiring parsing D code (IDE plugins, DCT and dfmt now). We definitely need a _standard_ fast D parser (suitable as tokenizer). We already have a good one at least in Visual D. Maybe dmd's parser is faster. If so, it can be (easily?) rewritten in D. We also already have What about to get one and call it standard?I know I'm a little late coming into this conversation. This seems like a nice thread to toss myself into. I've started working on a generic lexer that is based off of a defined grammar. As I read through the thread (I unfortunately don't have enough time to read every post, but I skimmed through as many as I could, and read the ones that seemed important), it seems like we need a parser in D that can lex D, and provide a Range of tokens that could be consumed. With some very minor tweaks, and a properly written Grammar class, I basically have it already done. D was going to be one of the first languages I would have written a definition for. https://github.com/kai4785/firstfront I haven't had time to look through Pegged, but there are some differences in my approach that I can immediately see in Pegged's. 1) Pegged does compile-time or run-time code generation for your parser. Mine is run-time only and regex based. 2) Pegged uses PEG format, which I have been introduced to through this thread. One of my goals was to actually invent something like PEG. This will save me time :)I would love to receive some critical feedback on my approach as well as the actual code base.Okay. Critics go as follows: First, as an author of std.regex I'm pleased that you find it suitable to use it in lexer. :) Still there is a BIG word of warning: Lexer based on trying out an array of regexes until one will match is NOT fast and not even close to fast ones. It is acceptable as proof of concept/prototype kind of thing only. To help this use case I though of making multi-regex matching a-la: match(text, regex1, regex2, regex3...); then lexing would be effectively a loop of form: foreach(tok; match(text, regex1, regex2, regex3...)){ switch(tok[0]){//suppose match returns tuple - N of regex matched + the usual piece of text ... } } (with some glitches on /+ ... +/ and other non-regular stuff). But until such time it's not to be used seriously in this context. The reason is that each call to match does have some overhead to setup regex engine, it's constant but it's huge compared to what usual lexers typically do. The other thing is that you should use ctRegex for this kind of job if you can (i.e. compiler doesn't explode on it). (keep in mind I only glimpsed the source, will post more feedback later)To build it, you'll need cmake, and cmaked2 from here: http://code.google.com/p/cmaked2/wiki/GettingStarted Or just dmd *.d :) I haven't tried to build it on Windows yet, but I don't see anything that immediately jumps out as not cross-platform. I've been working on it on both Fedora and CentOS.rdmd for the win! -- Dmitry Olshansky
Jul 31 2012
On 07/31/2012 03:27 PM, Dmitry Olshansky wrote:On 31-Jul-12 20:10, Kai Meyer wrote:I cut my teeth on perl, so everything gets compared to perl in my mind. In perl, I can 'precompile' a regular expression, so I can avoid some overhead. So something like this: while(<STDIN>){ if($_ =~ m/<someregex>/){ some work; } } Would end up re-compiling the regex on each line from STDIN. Perl uses the term "compiling" the regular expression, which may be different than what you call "setup regex engine". Does/Can D's std.regex offer something like this? If not, I would be interested in why not. -Kai MeyerOn 07/05/2012 06:11 AM, Denis Shelomovskij wrote:There are more and more projects requiring parsing D code (IDE plugins, DCT and dfmt now). We definitely need a _standard_ fast D parser (suitable as tokenizer). We already have a good one at least in Visual D. Maybe dmd's parser is faster. If so, it can be (easily?) rewritten in D. We also already have What about to get one and call it standard?I know I'm a little late coming into this conversation. This seems like a nice thread to toss myself into. I've started working on a generic lexer that is based off of a defined grammar. As I read through the thread (I unfortunately don't have enough time to read every post, but I skimmed through as many as I could, and read the ones that seemed important), it seems like we need a parser in D that can lex D, and provide a Range of tokens that could be consumed. With some very minor tweaks, and a properly written Grammar class, I basically have it already done. D was going to be one of the first languages I would have written a definition for. https://github.com/kai4785/firstfront I haven't had time to look through Pegged, but there are some differences in my approach that I can immediately see in Pegged's. 1) Pegged does compile-time or run-time code generation for your parser. Mine is run-time only and regex based. 2) Pegged uses PEG format, which I have been introduced to through this thread. One of my goals was to actually invent something like PEG. This will save me time :)I would love to receive some critical feedback on my approach as well as the actual code base.Okay. Critics go as follows: First, as an author of std.regex I'm pleased that you find it suitable to use it in lexer. :) Still there is a BIG word of warning: Lexer based on trying out an array of regexes until one will match is NOT fast and not even close to fast ones. It is acceptable as proof of concept/prototype kind of thing only. To help this use case I though of making multi-regex matching a-la: match(text, regex1, regex2, regex3...); then lexing would be effectively a loop of form: foreach(tok; match(text, regex1, regex2, regex3...)){ switch(tok[0]){//suppose match returns tuple - N of regex matched + the usual piece of text ... } } (with some glitches on /+ ... +/ and other non-regular stuff). But until such time it's not to be used seriously in this context. The reason is that each call to match does have some overhead to setup regex engine, it's constant but it's huge compared to what usual lexers typically do. The other thing is that you should use ctRegex for this kind of job if you can (i.e. compiler doesn't explode on it). (keep in mind I only glimpsed the source, will post more feedback later)To build it, you'll need cmake, and cmaked2 from here: http://code.google.com/p/cmaked2/wiki/GettingStarted Or just dmd *.d :) I haven't tried to build it on Windows yet, but I don't see anything that immediately jumps out as not cross-platform. I've been working on it on both Fedora and CentOS.rdmd for the win!
Aug 06 2012
I cut my teeth on perl, so everything gets compared to perl in my mind. In perl, I can 'precompile' a regular expression, so I can avoid some overhead. So something like this: while(<STDIN>){ if($_ =~ m/<someregex>/){ some work; } } Would end up re-compiling the regex on each line from STDIN. Perl uses the term "compiling" the regular expression, which may be different than what you call "setup regex engine". Does/Can D's std.regex offer something like this? If not, I would be interested in why not.D does have compiled regex, it's called ctRegex (meaning compile-time regex): http://dlang.org/phobos/std_regex.html#ctRegex The tests done recently put it among the fastest regex engine known. Caution: not every regex extension known to mankind is implemented here!
Aug 06 2012
On 06-Aug-12 22:52, Philippe Sigaud wrote:Of course regex is first compiled to bytecode (same thing as "compile" in perl). Moreover if you use regex pattern directly it is compiled on first use and put into TLS cache of compiled patterns. From now on it's used in compiled form. (there about 8 entries in cache, don't relay on it too much). What I mean by "setup engine" is extra cost spent on _starting_ matching with compiled pattern. For one thing it adds 1 call to malloc/free. Again I think in Perl the same thing applies. In other words doing continuous search (via foreach(x; match(..., "g" )) ) is much faster then calling match on individual pieces over and over again in cycle.I cut my teeth on perl, so everything gets compared to perl in my mind. In perl, I can 'precompile' a regular expression, so I can avoid some overhead. So something like this: while(<STDIN>){ if($_ =~ m/<someregex>/){ some work; } } Would end up re-compiling the regex on each line from STDIN. Perl uses the term "compiling" the regular expression, which may be different than what you call "setup regex engine".Of course it does, in fact you can't match without compiling pattern first (it just provides a shortcut that does it for you behind the scenes).Does/Can D's std.regex offer something like this? If not, I would be interested in why not.D does have compiled regex, it's called ctRegex (meaning compile-time regex): http://dlang.org/phobos/std_regex.html#ctRegexAnd there is a second version - compiled native code. Unlike perl it's not bytecode and thus usually much faster. Frankly the most slow regex I've seen is in Python, the second most sucky one is PCRE (but is incredibly popular somehow). Perl is not bad but usually slower then top dogs from C++ & std.regex.The tests done recently put it among the fastest regex engine known.Yeah, on top of said chart. Needless to say the test favored my implementation, ctRegex is not the fastest regex engine in general (yet).Caution: not every regex extension known to mankind is implemented here!Sure as hell. In fact, the most problematic thing is that parser often fails during CTFE. Also I have a solid plan on enhancing a bunch of things effectively making std.regex v2 but no sooner then October-November. -- Dmitry Olshansky
Aug 06 2012
On Mon, Aug 6, 2012 at 9:31 PM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Of course regex is first compiled to bytecode (same thing as "compile" in perl). Moreover if you use regex pattern directly it is compiled on first use and put into TLS cache of compiled patterns. From now on it's used in compiled form. (there about 8 entries in cache, don't relay on it too much).Btw, I wanted to ask you that for a long time: what do you mean by 'compiled to bytecode', for D source?And there is a second version - compiled native code. Unlike perl it's not bytecode and thus usually much faster.Which?Frankly the most slow regex I've seen is in Python, the second most sucky one is PCRE (but is incredibly popular somehow). Perl is not bad but usually slower then top dogs from C++ & std.regex.Do people *really* need speed, or a great number of extensions?Sure as hell. In fact, the most problematic thing is that parser often fails during CTFE.For example?Also I have a solid plan on enhancing a bunch of things effectively making std.regex v2 but no sooner then October-November.That's good to know.
Aug 06 2012
On 06-Aug-12 23:43, Philippe Sigaud wrote:On Mon, Aug 6, 2012 at 9:31 PM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:by this: auto r = regex(some_string); // where some_string can come from user input. r - contains bytecode that matches pattern. Unlike ctRegex which does output D source code and compiles it to native code.Of course regex is first compiled to bytecode (same thing as "compile" in perl). Moreover if you use regex pattern directly it is compiled on first use and put into TLS cache of compiled patterns. From now on it's used in compiled form. (there about 8 entries in cache, don't relay on it too much).Btw, I wanted to ask you that for a long time: what do you mean by 'compiled to bytecode', for D source?Compiled native code is faster. Or what youAnd there is a second version - compiled native code. Unlike perl it's not bytecode and thus usually much faster.Which?They want both. In my experience you can't satisfy everybody. I think features are not what people look for in regex, even basic ECMA-262 stuff is way above what most programmers need to do. Unlike extra speed which even newbies could use :) And while I'm at it - we already have a damn good collection of extensions, way too much I'd say. For example, I've yet to see one regex engine that support Unicode to the same level as std.regex, namely I haven't seen a single one with full set operations (not only union but subtraction, intersection etc.) inside of char class [...]. Funny even ICU one didn't support them a year ago, dunno about now. Also some extensions come from implementations inherent inefficiency e.g. (as in Perl) possessive quantifiers, atomic groups. No wonder it's so hard to make look-behind unrestricted in Perl, the whole thing is a mess.Frankly the most slow regex I've seen is in Python, the second most sucky one is PCRE (but is incredibly popular somehow). Perl is not bad but usually slower then top dogs from C++ & std.regex.Do people *really* need speed, or a great number of extensions?Ehmn. See bugzilla, search ctRegex. But not yet filed are: said non-union set operations usually fail + the fact that Unicode properties are not readable at CT (thus \p{xxx} fails).Sure as hell. In fact, the most problematic thing is that parser often fails during CTFE.For example?-- Dmitry OlshanskyAlso I have a solid plan on enhancing a bunch of things effectively making std.regex v2 but no sooner then October-November.That's good to know.
Aug 06 2012
On 06-Aug-12 23:59, Dmitry Olshansky wrote:Compiled native code is faster. Perl is slower. -- Dmitry OlshanskyWhich?And there is a second version - compiled native code. Unlike perl it's notbytecode and thus usually much faster.
Aug 06 2012
On Mon, Aug 6, 2012 at 10:00 PM, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:On 06-Aug-12 23:59, Dmitry Olshansky wrote:Sorry, I meant: what second version? How do you distinguish between bytecode-producing and compiled-code producing ctRegex?Compiled native code is faster. Perl is slower.Which?And there is a second version - compiled native code. Unlike perl it's notbytecode and thus usually much faster.
Aug 06 2012
On Tuesday, 7 August 2012 at 04:58:05 UTC, Philippe Sigaud wrote:Sorry, I meant: what second version? How do you distinguish between bytecode-producing and compiled-code producing ctRegex?As far as I know, ctRegex _always_ produces machine code. Bytecode is what the runtime implementation, i.e. regex, uses. David
Aug 06 2012
On Tue, Aug 7, 2012 at 7:02 AM, David Nadlinger <see klickverbot.at> wrote:As far as I know, ctRegex _always_ produces machine code. Bytecode is what the runtime implementation, i.e. regex, uses.That's what I had in mind too, but Dmitry seemed to say things are more complicated. Maybe I misunderstood. And do you know why it's called bytecode? I never heard of 'bytecode' for D outside of std.regex. Does that just mean 'intermediate form D code', as in 'bunch of small and easily optimized instructions.'?
Aug 06 2012
On 07-Aug-12 09:11, Philippe Sigaud wrote:On Tue, Aug 7, 2012 at 7:02 AM, David Nadlinger <see klickverbot.at> wrote:Yeah, sorry for confusion - there are only 2 ways to do the job: regex and ctRegex. The first one contains special bytecode, the second one generates native code.As far as I know, ctRegex _always_ produces machine code. Bytecode is what the runtime implementation, i.e. regex, uses.That's what I had in mind too, but Dmitry seemed to say things are more complicated. Maybe I misunderstood.And do you know why it's called bytecode? I never heard of 'bytecode' for D outside of std.regex.It's a special code that encodes pattern matching machine :) It's not a bytecode for D. In other words instruction go as follows: match char, match one of few chars, match one of character set, match any char, jump/loop/option etc.Does that just mean 'intermediate form D code', as in 'bunch of small and easily optimized instructions.'?No :) So far there are only 3 cases: RT parsing ---> Bytecode ( { auto x = regex("blah) }) CT parsing ---> Bytecode ( static x = regex("blah"); ) CT parsing ---> Machine code ( static/auto x = ctRegex!"blah" ) It would be nice to have JIT compiler for D at RT but no luck yet. (std.jit?) Then the picture would be complete. -- Dmitry Olshansky
Aug 06 2012
On Tuesday, July 31, 2012 23:39:38 Philippe Sigaud wrote:On Tue, Jul 31, 2012 at 11:20 PM, Jonathan M Davis <jmdavisProg gmx.com>wrote:Yeah. Once I started on it, I made a lot of progress really quickly. There's still a fair bit to do (primarily having to do with literals), but it probably won't take all that much longer. Certainly, I'd expect to have it done within a couple of weeks if not sooner, unless something goes wrong.On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote:That was quick! Cool!Having std.lexer in Phobos would be quite good. With a pre-compiled lexer for D.I'm actually quite far along with one now - one which is specifically written and optimized for lexing D. I'll probably be done with it not too long after the 2.060 release (though we'll see).Well, it's still a work in progress, so it certainly can be adjusted as necessary. I intend it to fully implement the spec (and make sure that both it and the spec match what dmd is doing) as far as lexing goes. The idea is that you should be able to build a fully compliant D parser on top of it and build a fully compliant D compiler on top of that. It already supports all of the comment types and several of the string literal types. I haven't sorted out q{} yet, but I will before I'm done, and that may or may not affect how some things work, since I'm not quite sure how to handle q{} yet (it may end up being done with tokens marking the beginning and end of the token sequence encompassed by q{}, but we'll see). I'm in the middle of dealing with the named entity stuff at the moment, which unfortunately has revealed a rather nasty compiler bug with regards to template compile times, which I still need to report (I intend to do that this evening). The file generating the table of named entities currently takes over 6 minutes to compile on my Phenom II thanks to that bug, so I'm not quite sure how that's going to affect things. Regardless, the lexer should support _everything_ as far as what's required for fully lexing D goes by the time that I'm done. I don't have the code with me at the moment, but I believe that the token type looks something like struct Token { TokenType type; string str; LiteralValue value; SourcePos pos; } struct SourcePos { size_t line; size_ col; size_t tabWidth = 8; } The type is an enum which gives the type of the token (obviously) which includes the various comment types and an error type (so errors are reported by getting a token that was an error token rather than throwing or anything like that, which should make lexing passed malformed stuff easy). str holds the exact text which was lexed (this includes the entire comment for the various comment token types), which in the case of lexing string rather than another range type would normally (always? - I don't remember) be a slice of the string being lexed, which should help make lexing string very efficient. It may or may not make sense to change that to the range type used rather than string. For nesting block comments, the whole comment is one token (with the token type which is specifically for nested comments), regardless of whether there's any nesting going on. But that could be changed if there were a need to get separate tokens for the comments inside. LiteralValue is a VariantN of the types that a literal can be (long, ulong, real, and string IIRC) and is empty unless the token is a literal type (the various string postfixes - c,w, and d - are treated as different token types rather than giving the literal value different string types - the same with the integral and floating point literals). And pos holds the position in the text where the token started, which should make it easy to use for syntax highlighting and the like (as well as indicating where an error occurred). The initial position is passed as an optional argument to the lexing function, so it doesn't have to be 1:1 (though that's the default), and it allows you to select the tabwidth. So, you'll pass a range and an optional starting position to the lexing function, and it'll return a lazy range of Tokens. Whitespace is stripped as part of the lexing process, but if you take the pos properties of two adjacent tokens, you should be able to determine how much whitespace was between them. I _think_ that that's how it currently is, but again, I don't have the code with me at the moment, so it may not be 100% correct. And since it's a work in progress, it's certainly open to changes. - Jonathan M DavisWriting it has been going surprisingly quickly actually, and I've already found some bugs in the spec as a result (some of which have been fixed, some of which I still need to create pull requests for). So, regardless of what happens with my lexer, at least the spec will be more accurate.Could you please describe the kind of token it produces? Can it build a symbol table? Does it recognize all kind of strings (including q{ } ones)? How does it deal with comments, particularly nested ones? Does it automatically discard whitespaces or produce them as a token? I'd favor this approach, if only because wrapping the lexer in a filter!noWS(tokenRange) is easy. Does it produce a lazy range btw?
Jul 31 2012
"Jonathan M Davis" , dans le message (digitalmars.D:173860), a crit:struct Token { TokenType type; string str; LiteralValue value; SourcePos pos; } struct SourcePos { size_t line; size_t col; size_t tabWidth = 8; }The occurence of tabWidth surprises me. What is col supposed to be ? an index (code unit), a character number (code point), an estimation of where the caracter is supposed to be printed on the line, given the provided tabwidth ? I don't think the lexer can realy try to calculate at what column the character is printed, since it depends on the editor (if you want to use the lexer to syntax highlight for example), how it supports combining characters, zero or multiple column characters, etc. (which you may not want to have to decode). You may want to provide the number of tabs met so far. Note that there are other whitespace that you may want to count, but you shouldn't have a very complicated SourcePos structure. It might be easier to have whitespace, endofline and endoffile tokens, and let the user filter out or take into account what he wants to take into account. Or just let the user look into the original string...
Aug 01 2012
On Wednesday, August 01, 2012 07:12:30 Christophe Travert wrote:"Jonathan M Davis" , dans le message (digitalmars.D:173860), a =C3=A9=crit :struct Token { =20 TokenType type; string str; LiteralValue value; SourcePos pos; =20 } =20 struct SourcePos { =20 size_t line; size_t col; size_t tabWidth =3D 8; =20 }=20 The occurence of tabWidth surprises me. What is col supposed to be ? an index (code unit), a character number=(code point), an estimation of where the caracter is supposed to be printed on the line, given the provided tabwidth ?col counts code points. tabWidth is the number of code points that '\t'= is=20 considered to be. That's it. So, in theory, an editor should be able to= use it=20 to indicate where on the line the token starts. If the code using the lexer wants to treat tabs as having the widtho of= a=20 single code point, then all they have to do is pass in a SourcePos with= a=20 tabWidth of 1. But if the lexer doesn't have a way to count tabs differ= ently,=20 then there's no way for the code using the lexer to figure out the tabs= without=20 going back and lexing the whitespace itself. But counting tabs as a dif= ferent=20 width than everything else is so common that it seemed prudent to add i= t.=20 Given that Phobos doesn't support graphemes and that ranges use code po= ints, a=20 code point is the closest to a character that the lexer would be counti= ng, and=20 it just makes sense to count code points. Now, the one thing that might be a problem with treating tabs as a fixe= d width=20 is that it's not uncommon to treat tabs as being up to a particular wid= th=20 rather than having a fixed width such that if there are other character= s before=20 a tab, then the tabs width is the max tab width minus the number of cha= racters=20 since the start of that tab width. e.g. if tabs had a max width of 8, t= hen \t123\t could have the first tab with a width of 8 and the second as only havin= g a=20 width of 5. But that's too complicated to deal with in the lexer IMHO. Maybe the tab width isn't worth having in SourePos and it will ultimate= ly be=20 removed, but it struck me as a useful feature, so I added it. - Jonathan M Davis
Aug 01 2012
On 2012-08-01 00:38, Jonathan M Davis wrote:I don't have the code with me at the moment, but I believe that the token type looks something like struct Token { TokenType type; string str; LiteralValue value; SourcePos pos; } struct SourcePos { size_t line; size_ col; size_t tabWidth = 8; }What about the end/length of a token? Token.str.length would give the number of bytes (code units?) instead of the number of characters (code points?). I'm not entirely sure what's needed when, for example, doing syntax highlighting. I assume you would know the length in characters of a given token internally inside the lexer? -- /Jacob Carlborg
Aug 01 2012
On Wednesday, August 01, 2012 10:25:18 Jacob Carlborg wrote:On 2012-08-01 00:38, Jonathan M Davis wrote:I'm not sure. I don't think so. It doesn't really keep track of code points. It operates in code units as much as possible, and pos doesn't really help, because any newline that occurred would make it so that subtracting the start col from the end col would be completely bogus (that and tabs would mess that up pretty thoroughly, but as Christophe pointed out, the whole tabWidth thing may not actually have been a good idea anyway). It could certainly be added, but unless the lexer always knows it (and I'm pretty sure that it doesn't), then keeping track of that entails extra overhead. But maybe it's worth that overhead. I'll have to look at what I have and see. Worst case, the caller can just use walkLength on str, but if it has to do that all the time, then that's not exactly conducive to good performance. - Jonathan M DavisI don't have the code with me at the moment, but I believe that the token type looks something like struct Token { TokenType type; string str; LiteralValue value; SourcePos pos; } struct SourcePos { size_t line; size_ col; size_t tabWidth = 8; }What about the end/length of a token? Token.str.length would give the number of bytes (code units?) instead of the number of characters (code points?). I'm not entirely sure what's needed when, for example, doing syntax highlighting. I assume you would know the length in characters of a given token internally inside the lexer?
Aug 01 2012
On 2012-08-01 10:40, Jonathan M Davis wrote:It could certainly be added, but unless the lexer always knows it (and I'm pretty sure that it doesn't), then keeping track of that entails extra overhead. But maybe it's worth that overhead. I'll have to look at what I have and see. Worst case, the caller can just use walkLength on str, but if it has to do that all the time, then that's not exactly conducive to good performance.Doing a syntax highlighter and calculate the length (walkLength) for each token would most likely slow it down a bit. -- /Jacob Carlborg
Aug 01 2012
On Wednesday, August 01, 2012 07:26:07 Philippe Sigaud wrote:On Wed, Aug 1, 2012 at 12:58 AM, Jonathan M Davis <jmdavisProg gmx.com>wrote:On Wednesday, August 01, 2012 00:54:56 Timon Gehr wrote:OK. Same for standard comment and doc comments?Ddoc is typically not required. By default it should be treated as whitespace. If it is required, one token seems reasonable: The post-processing of the doc comment is best done as a separate step.That was how I was intending to deal with ddoc. It's just a nested block comment token. The whole comment string is there, so the ddoc processor can use that to do whatever it does. ddoc isn't part of lexing really. It's a separate thing.From the TokenType enum declaration:blockComment, /// $(D /* */) lineComment, /// $(D // ) nestingBlockComment, /// $(D /+ +/) There are then functions which operate on Tokens to give you information about them. Among them is isDdocComment, which will return true if the Token type is a comment, and that comment is a ddoc comment (i.e. starts with /**, ///, or /++ rather than /*, //, or /+). So, anything that wants to process ddoc comments can lex them out and process them, and if they want to know what symbols that a ddoc comment applies to, then they look at the tokens that follow (though a full-on parser would be required to do that correctly).I was wondering how to get the code possibly inside a ---- / ---- block (I never dealt with something like documentation or syntax highlighting), but your solution makes it easy: Toten(TokenType.DocComment, "/** ... */"), Token(TokenType.Module, "module"), ... A small specialised parser can then extract text, DDocs macros and code blocks from inside the comment. Findind and stripping '----' is easy and then the lexer can be locally reinvoked on the slice containing the example code.Yes. The lexer isn't concerned with where the text comes from, and it isn't concerned with lexing comments beyond putting them in a token. But that should be powerful enough to lex the examples if you've already extracted them. - Jonathan M Davis
Jul 31 2012
Thanks for taking the time to write such an explanation!Yeah. Once I started on it, I made a lot of progress really quickly. There's still a fair bit to do (primarily having to do with literals), but it probably won't take all that much longer. Certainly, I'd expect to have it done within a couple of weeks if not sooner, unless something goes wrong.Is it based on a prioritized list of regexes?Well, it's still a work in progress, so it certainly can be adjusted as necessary. I intend it to fully implement the spec (and make sure that both it and the spec match what dmd is doing) as far as lexing goes. The idea is that you should be able to build a fully compliant D parser on top of it and build a fully compliant D compiler on top of that.My fear (having encoded the D grammar once) is that 99% of the code constructs are easily done (even new stuff like the => syntax), but that the remaining 1% is a nightmare. The many different kind of strings with or without escaping, for example. Your lexer return a string literal as a string field, right? If so, escaping chars like " and ` makes me nervous.It already supports all of the comment types and several of the string literal types. I haven't sorted out q{} yet, but I will before I'm done, and that may or may not affect how some things work, since I'm not quite sure how to handle q{} yet (it may end up being done with tokens marking the beginning and end of the token sequence encompassed by q{}, but we'll see).I see. My question was because, as I said before I found this to be a difficult and hairy part of the D grammar, and an IDE might want to colorize/manipulate the content of a string, when it knows its a string mixin.I don't have the code with me at the moment, but I believe that the token type looks something like struct Token { TokenType type; string str; LiteralValue value; SourcePos pos; }Seems OK. I guess it can be templated on the string type, because you sure want to lex dstrings.The type is an enum which gives the type of the token (obviously) which includes the various comment types and an error type (so errors are reported by getting a token that was an error token rather than throwing or anything like that, which should make lexing passed malformed stuff easy).And the recovery is done on the next token. Good idea.str holds the exact text which was lexed (this includes the entire comment for the various comment token types), which in the case of lexing string rather than another range type would normally (always? - I don't remember) be a slice of the string being lexed, which should help make lexing string very efficient.Yeahs, you may as well use this excellent feature. But that means the input must be held in memory always, no? If it's an input range (coming from a big file), can you keep slices from it?It may or may not make sense to change that to the range type used rather than string. For nesting block comments, the whole comment is one token (with the token type which is specifically for nested comments), regardless of whether there's any nesting going on. But that could be changed if there were a need to get separate tokens for the comments inside.That's OK, as I reply in another message. It's now quite clear for me how to deal with that. IDE/doc generator can parse the comment at their leasure, as they are not long nor critical section of code. They can even easily get the code fragment used as examples inside ---/--- markers.LiteralValue is a VariantN of the types that a literal can be (long, ulong, real, and string IIRC) and is empty unless the token is a literal typeYou can make it an Algebraic!(long, ulong, real, string), as your universe of type is restricted. That can even catch a non-legal value. Arrays literals are not considered literals, since they can contain arbitrary expressions, is that right? And it's a good idea to drop the complex literals once and for all.(the various string postfixes - c,w, and d - are treated as different token types rather than giving the literal value different string types - the same with the integral and floating point literals).That's seem reasonable enough, but can you really store a dstring literal in a string field?And pos holds the position in the text where the token started, which should make it easy to use for syntax highlightingDoes syntax highlighting need more that a token stream? Without having thought a lot about it, it seems to me IDE tend to highlight based just on the token type, not on a parse tree. So that means your lexer can be used directly by interested people, that's nice.and the like (as well as indicating where an error occurred). The initial position is passed as an optional argument to the lexing function, so it doesn't have to be 1:1 (though that's the default), and it allows you to select the tabwidth.I really like the idea of having malformed token become error token, with the lexing that continue. As I said before, that enables an IDE to continue to do syntax highlighting.So, you'll pass a range and an optional starting position to the lexing function, and it'll return a lazy range of Tokens. Whitespace is stripped as part of the lexing process, but if you take the pos properties of two adjacent tokens, you should be able to determine how much whitespace was between them.OK, so whitespace is strictly that, and comments are not dropped. That's OK. I guess a D parser would then just filter the comments out of the token range.I _think_ that that's how it currently is, but again, I don't have the code with me at the moment, so it may not be 100% correct. And since it's a work in progress, it's certainly open to changes.My only quip for now would be the string/dstring thingy. Did you achieve UFCS, wrt floating point values? is 1.f seen as a float or as f called on 1 (int)?
Jul 31 2012
On 2012-08-01 07:44, Philippe Sigaud wrote:Does syntax highlighting need more that a token stream? Without having thought a lot about it, it seems to me IDE tend to highlight based just on the token type, not on a parse tree. So that means your lexer can be used directly by interested people, that's nice.Some IDE's do a more advanced syntax highlighting based on the semantic analysis. For example, Eclipse highlights instance variables differently. -- /Jacob Carlborg
Aug 01 2012
On Wednesday, August 01, 2012 07:44:33 Philippe Sigaud wrote:Is it based on a prioritized list of regexes?I'm not using regexes at all. It's using string mixins to reduce code duplication, but it's effectively hand-written. If I do it right, it should be _very_ difficult to make it any faster than it's going to be. It even specifically avoids decoding unicode characters and operates on ASCII characters as much as possible.Well, if you want the original text, there's the str property, which contains everything that was in the source code (including whatever escaping there was), whereas the value property includes only the body, and it will have already processed whatever escaped characters needed to be processed. It's the actual value of the literal. So, \` would be ` in the value property, named entities would be their actual code points, etc. So, I don't really see a problem there. Sure, depending on what is done with the processed string, it could cause further problems, but that always happens when doing string processing.Well, it's still a work in progress, so it certainly can be adjusted as necessary. I intend it to fully implement the spec (and make sure that both it and the spec match what dmd is doing) as far as lexing goes. The idea is that you should be able to build a fully compliant D parser on top of it and build a fully compliant D compiler on top of that.My fear (having encoded the D grammar once) is that 99% of the code constructs are easily done (even new stuff like the => syntax), but that the remaining 1% is a nightmare. The many different kind of strings with or without escaping, for example. Your lexer return a string literal as a string field, right? If so, escaping chars like " and ` makes me nervous.Well, it could very well cause me trouble. I haven't gotten to it yet, and I'll need to study it to fully understand what it does before crafting my solution, but it'll probably end up being a sequence of tokens of some kind given that that's what it is - a string of tokens. Regardless, it'll be done in a way that whatever is using the lexer will be able to know what it is and process is it accordingly.It already supports all of the comment types and several of the string literal types. I haven't sorted out q{} yet, but I will before I'm done, and that may or may not affect how some things work, since I'm not quite sure how to handle q{} yet (it may end up being done with tokens marking the beginning and end of the token sequence encompassed by q{}, but we'll see).I see. My question was because, as I said before I found this to be a difficult and hairy part of the D grammar, and an IDE might want to colorize/manipulate the content of a string, when it knows its a string mixin.Well, whatever is using the lexer is going to have to make sure that what it passes to the lexer continues to exist while it uses the lexer. Given how slicing works in D, it's pretty much a given that lexers and parsers are going to use them heavily, and any code using a lexer or parser is going to have to take that into account. Slicing is one of the major reasons that Tango's XML parser is so insanely fast.str holds the exact text which was lexed (this includes the entire comment for the various comment token types), which in the case of lexing string rather than another range type would normally (always? - I don't remember) be a slice of the string being lexed, which should help make lexing string very efficient.Yeahs, you may as well use this excellent feature. But that means the input must be held in memory always, no? If it's an input range (coming from a big file), can you keep slices from it?I'll have to look at that to see whether using Algebraic is better. I'm not super-familiar with std.variant, so I may have picked the wrong type. However, VariantN already holds a specific set of types though (unlike Variant), so that part isn't a problem.LiteralValue is a VariantN of the types that a literal can be (long, ulong, real, and string IIRC) and is empty unless the token is a literal typeYou can make it an Algebraic!(long, ulong, real, string), as your universe of type is restricted. That can even catch a non-legal value.Arrays literals are not considered literals, since they can contain arbitrary expressions, is that right?There is no concept of array literal in the lexing portion of the spec. There may be at the level of the parser, but that's not the lexer's problem.And it's a good idea to drop the complex literals once and for all.I'm not supporting any symbols which are known to be scheduled for deprecation (e.g. !<> and !>=). The _only_ stuff which I'm supporting along those lines is to-be-deprecated keywords (e.g. volatile and delete), since they still won't be legal to use as identifiers. And there's a function to query a token as to whether it's using a deprecated or unused keyword so that the program using the lexer can flag it if it wants to.Yeah. Why not? The string is the same in the source code regardless of the type of the literal. The only difference is the letter tacked onto the end. That will be turned into the appropriate string type be the semantic analyzer, but the lexer doesn't care.(the various string postfixes - c,w, and d - are treated as different token types rather than giving the literal value different string types - the same with the integral and floating point literals).That's seem reasonable enough, but can you really store a dstring literal in a string field?Does syntax highlighting need more that a token stream? Without having thought a lot about it, it seems to me IDE tend to highlight based just on the token type, not on a parse tree. So that means your lexer can be used directly by interested people, that's nice.If all you want to do is highlight specific symbols and keywords, then you don't need a parser. If you want to do fancier stuff related to templates and the like, then you'll need a parser and maybe even a semantic analyzer. But a lexer should be plenty for basic syntax highlighting.I really like the idea of having malformed token become error token, with the lexing that continue. As I said before, that enables an IDE to continue to do syntax highlighting.That was the idea. You need to be able to continue lexing beyond errors in many cases, and that seemed the natural way to do it.Did you achieve UFCS, wrt floating point values? is 1.f seen as a float or as f called on 1 (int)?I haven't tackled floating point literals yet, but since the f in 1.f is not part of the floating point literal, I'll have to handle it. I suspect that that's one area where I'll need to update the spec though, since it probably hasn't been updated for that yet. Basically, the lexer that I'm writing needs to be 100% compliant with the D spec and dmd (which means updating the spec or fixing dmd in some cases), and it needs to be possible to build on top of it anything and everything that dmd does that would use a lexer (even if it's not the way that dmd currently does it) so that it's possible to build a fully compliant D parser and compiler on top of it as well as a ddoc generator and anything else that you'd need to do with a lexer for D. So, if you have any questions about what my lexer does (or is supposed to do) with regards to the spec, that should answer it. If my lexer doesn't match the spec or dmd when I'm done (aside from specific exceptions relating to stuff like deprecated symbols), then I screwed up. - Jonathan M Davis
Jul 31 2012
On 2012-08-01 08:11, Jonathan M Davis wrote:I'm not using regexes at all. It's using string mixins to reduce code duplication, but it's effectively hand-written. If I do it right, it should be _very_ difficult to make it any faster than it's going to be. It even specifically avoids decoding unicode characters and operates on ASCII characters as much as possible.That's good idea. Most code can be treated as ASCII (I assume most people code in english). It would basically only be string literals containing characters outside the ASCII table. BTW, have you seen this: http://woboq.com/blog/utf-8-processing-using-simd.html -- /Jacob Carlborg
Aug 01 2012
On Wednesday, August 01, 2012 11:14:52 Jacob Carlborg wrote:On 2012-08-01 08:11, Jonathan M Davis wrote:What's of particular importance is the fact that _all_ of the language constructs are ASCII. So, unicode comes in exclusively with identifiers, string literals, char literals, and whitespace. And with those, ASCII is still going to be far more common, so coding it in a way that makes ASCII faster at slight cost to performance for unicode is perfectly acceptable.I'm not using regexes at all. It's using string mixins to reduce code duplication, but it's effectively hand-written. If I do it right, it should be _very_ difficult to make it any faster than it's going to be. It even specifically avoids decoding unicode characters and operates on ASCII characters as much as possible.That's good idea. Most code can be treated as ASCII (I assume most people code in english). It would basically only be string literals containing characters outside the ASCII table.BTW, have you seen this: http://woboq.com/blog/utf-8-processing-using-simd.htmlNo, I'll have to take a look. I know pretty much nothing about SIMD though. I've only heard of it, because Walter implemented some SIMD stuff in dmd not too long ago. - Jonathan M Davis
Aug 01 2012
On Wed, Aug 1, 2012 at 8:11 AM, Jonathan M Davis <jmdavisProg gmx.com> wrote:On Wednesday, August 01, 2012 07:44:33 Philippe Sigaud wrote:OK. It'll more difficult to genericize, then, but that's not your problem (could be mine, though).Is it based on a prioritized list of regexes?I'm not using regexes at all. It's using string mixins to reduce code duplication, but it's effectively hand-written. If I do it right, it should be _very_ difficult to make it any faster than it's going to be. It even specifically avoids decoding unicode characters and operates on ASCII characters as much as possible.Well, whatever is using the lexer is going to have to make sure that what it passes to the lexer continues to exist while it uses the lexer.Yeah, I realized that just after posting. And anyway, the token are made to be consumed at once, normally.I'll have to look at that to see whether using Algebraic is better. I'm not super-familiar with std.variant, so I may have picked the wrong type. However, VariantN already holds a specific set of types though (unlike Variant), so that part isn't a problem.OK, I forgot about VariantN (I'm not so used to std.variant either)I'm not supporting any symbols which are known to be scheduled for deprecation (e.g. !<> and !>=). The _only_ stuff which I'm supporting along those lines is to-be-deprecated keywords (e.g. volatile and delete), since they still won't be legal to use as identifiers. And there's a function to query a token as to whether it's using a deprecated or unused keyword so that the program using the lexer can flag it if it wants to.Good idea. A nice bunch of query functions will be a nice complement - isDoc - isComment - isString - isDeprecated ...I don't get it. Say I have an literal with non UTF-8 chars, how will it be stored inside the .str field as a string?That's seem reasonable enough, but can you really store a dstring literal in a string field?Yeah. Why not? The string is the same in the source code regardless of the type of the literal. The only difference is the letter tacked onto the end. That will be turned into the appropriate string type be the semantic analyzer, but the lexer doesn't care.Basically, the lexer that I'm writing needs to be 100% compliant with the D spec and dmd (which means updating the spec or fixing dmd in some cases), and it needs to be possible to build on top of it anything and everything that dmd does that would use a lexer (even if it's not the way that dmd currently does it) so that it's possible to build a fully compliant D parser and compiler on top of it as well as a ddoc generator and anything else that you'd need to do with a lexer for D. So, if you have any questions about what my lexer does (or is supposed to do) with regards to the spec, that should answer it. If my lexer doesn't match the spec or dmd when I'm done (aside from specific exceptions relating to stuff like deprecated symbols), then I screwed up.That's a lofty goal, but that would indeed be quite good to have an officially integrated lexer in Phobos that would (as Andrei said) "be it". The token spec would be the lexer.
Jul 31 2012
On Wednesday, August 01, 2012 08:20:04 Philippe Sigaud wrote:OK. It'll more difficult to genericize, then, but that's not your problem (could be mine, though).It was never intended to be even vaguely generic. It's targeting D specifically. If someone can take it and make it generic when I'm done, then great. But it's goal is to lex D as efficiently as possible, and it'll do whatever it takes to do that. From how the main switch statement's cases are constructed though, there's a lot there which could be genericized. I currently have several mixins used to create them, but I'm pretty sure that I can generate a _lot_ of the case statements using a single mixin which just takes the list of symbols and their associated tokens, which I'll probably do before I'm done. So, I'm sure that pieces of what I'm doing could be used to generate a lexer for another language, but plenty of it is very specific to D.The literal is written in whatever encoding the range is in. If it's UTF-8, it's UTF-8. If it's UTF-32, it's UTF-32. UTF-8 can hold exactly the same set of characters that UTF-32 can. Your range could be UTF-32, but the string literal is supposed to be UTF-8 ultimately. Or the range could be UTF-8 when the literal is UTF-32. The characters themselves are in the encoding type of the range regardless. It's just the values that the compiler generates which change. "hello world" "hello world"c "hello world"w "hello world"d are absolutely identical as far as lexing goes save for the trailing character. It would be the same regardless of the characters in the strings or the encoding used in the source file. In either case, a lot of string literals have to be decoded (e.g if they contain escaped characters), so you often can't create them with a slice anyway, and if a range is used which isn't one of the string types, then it's impossible for Token's value property to use the range type whenever it can't use a slice. So, it's just simpliest to always use string. It may be a slight performance hit for lexing wstrings and dstrings, since they _could_ be both sliced and created as new strings (unlike other ranges), but I don't think that it's worth the extra complication to make it so that the string literal's value could be other string types, especially when lexing strings is likely to be the common case.I don't get it. Say I have an literal with non UTF-8 chars, how will it be stored inside the .str field as a string?That's seem reasonable enough, but can you really store a dstring literal in a string field?Yeah. Why not? The string is the same in the source code regardless of the type of the literal. The only difference is the letter tacked onto the end. That will be turned into the appropriate string type be the semantic analyzer, but the lexer doesn't care.Well, I think that that's what a lexer in Phobos _has_ to do, or it can't be in Phobos. And if Jacob Carlborg gets his way, dmd's frontend will eventually switch to using the lexer and parser from Phobos, and in that sort of situation, it's that much more imperative that they follow the spec exactly. - Jonathan M DavisBasically, the lexer that I'm writing needs to be 100% compliant with the D spec and dmd (which means updating the spec or fixing dmd in some cases), and it needs to be possible to build on top of it anything and everything that dmd does that would use a lexer (even if it's not the way that dmd currently does it) so that it's possible to build a fully compliant D parser and compiler on top of it as well as a ddoc generator and anything else that you'd need to do with a lexer for D. So, if you have any questions about what my lexer does (or is supposed to do) with regards to the spec, that should answer it. If my lexer doesn't match the spec or dmd when I'm done (aside from specific exceptions relating to stuff like deprecated symbols), then I screwed up.That's a lofty goal, but that would indeed be quite good to have an officially integrated lexer in Phobos that would (as Andrei said) "be it". The token spec would be the lexer.
Jul 31 2012
On 2012-08-01 08:39, Jonathan M Davis wrote:Well, I think that that's what a lexer in Phobos _has_ to do, or it can't be in Phobos. And if Jacob Carlborg gets his way, dmd's frontend will eventually switch to using the lexer and parser from Phobos, and in that sort of situation, it's that much more imperative that they follow the spec exactly.:) -- /Jacob Carlborg
Aug 01 2012
On Wed, Aug 1, 2012 at 8:39 AM, Jonathan M Davis <jmdavisProg gmx.com> wrote:It was never intended to be even vaguely generic. It's targeting D specifically. If someone can take it and make it generic when I'm done, then great. But it's goal is to lex D as efficiently as possible, and it'll do whatever it takes to do that.That's exactly what I had in mind. Anyway, we need a D lexer. We also need a generic lexer generator, but as a far-away second choice and we can admit it being less efficient. Of course, any trick used on the D lexer can most probably be reused for Algol-family lexers.Everytime I think I understand D strings, you prove me wrong. So, I *still* don't get how that works: say I have auto s = " - some greek or chinese chars, mathematical symbols, whatever - "d; Then, the "..." part is lexed as a string literal. How can the string field in the Token magically contain UTF32 characters? Or, are they automatically cut in four nonsense chars each? What about comments containing non-ASCII chars? How can code coming after the lexer make sense of it? As Jacob say, many people code in English. That's right, but 1- they most probably use their own language for internal documentation 2- any in8n part of a code base will have non-ASCII chars 3- D is supposed to accept UTF-16 and UTF-32 source code. So, wouldn't it make sense to at least provide an option on the lexer to specifically store identifier lexemes and comments as a dstring?I don't get it. Say I have an literal with non UTF-8 chars, how will it be stored inside the .str field as a string?The literal is written in whatever encoding the range is in. If it's UTF-8, it's UTF-8. If it's UTF-32, it's UTF-32. UTF-8 can hold exactly the same set of characters that UTF-32 can. Your range could be UTF-32, but the string literal is supposed to be UTF-8 ultimately. Or the range could be UTF-8 when the literal is UTF-32. The characters themselves are in the encoding type of the range regardless. It's just the values that the compiler generates which change. "hello world" "hello world"c "hello world"w "hello world"d are absolutely identical as far as lexing goes save for the trailing character. It would be the same regardless of the characters in the strings or the encoding used in the source file.
Aug 01 2012
On 2012-08-01 14:44, Philippe Sigaud wrote:Everytime I think I understand D strings, you prove me wrong. So, I *still* don't get how that works: say I have auto s = " - some greek or chinese chars, mathematical symbols, whatever - "d; Then, the "..." part is lexed as a string literal. How can the string field in the Token magically contain UTF32 characters? Or, are they automatically cut in four nonsense chars each? What about comments containing non-ASCII chars? How can code coming after the lexer make sense of it? As Jacob say, many people code in English. That's right, but 1- they most probably use their own language for internal documentation 2- any in8n part of a code base will have non-ASCII chars 3- D is supposed to accept UTF-16 and UTF-32 source code. So, wouldn't it make sense to at least provide an option on the lexer to specifically store identifier lexemes and comments as a dstring?I'm not quite sure how it works either. But I'm thinking like this: The string representing what's in the source code can be either UFT-8 or the encoding of the file. I'm not sure if the lexer needs to re-encode the string if it's not in the same encoding as the file. Then there's an other field/function that returns the processed token, i.e. for a token of the type "int" it will return an actual int. This function will return different types of string depending on the type of the string literal the token represents. -- /Jacob Carlborg
Aug 01 2012
On Wednesday, August 01, 2012 14:44:29 Philippe Sigaud wrote:Everytime I think I understand D strings, you prove me wrong. So, I *still* don't get how that works: =20 say I have =20 auto s =3D " - some greek or chinese chars, mathematical symbols, wha=tever -"d; =20 Then, the "..." part is lexed as a string literal. How can the string=field in the Token magically contain UTF32 characters?It contains unicode. The lexer is lexing whatever encoding the source i= s in,=20 which has _nothing_ to do with the d on the end. It could be UTF-8, or = UTF-16,=20 or UTF-32. If we supported other encodings in ranges, it could be one o= f=20 those. Which of those it is is irrelevant. As far as the value of the l= iteral=20 goes, these two strings are identical: "=E3=82=A6=E3=82=A7=E3=83=96=E3=82=B5=E3=82=A4=E3=83=88" "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8" The encoding of the source file is irrelevant. By tacking a d on the en= d "=E3=82=A6=E3=82=A7=E3=83=96=E3=82=B5=E3=82=A4=E3=83=88"d "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8"d you're just telling the compiler that you want the value that it genera= tes to=20 be in UTF-32. The source code could be in any of the supported encoding= s, and=20 the string could be held in any encoding until the object code is actua= lly=20 generated.So, wouldn't it make sense to at least provide an option on the lexer=to specifically store identifier lexemes and comments as a dstring?You mean make it so that Token is=20 struct Token(R) { TokenType type; R str; LiteralValue value SourcePos pos; } instead of struct Token { TokenType type; string str; LiteralValue value SourcePos pos; } or do you mean something else? I may do something like that, but I woul= d point=20 out that if R doesn't have slicing, then that doesn't work. So, str can= 't=20 always be the same type as the original range. For ranges with no slici= ng, it=20 would have to be something else (probably either string or=20 typeof(takeExactly(range))). However, making str R _does_ come at the c= ost of=20 complicating code using the lexer, since instead of just using Token, y= ou have=20 to worry about whether it's a Token!string, Token!dstring, etc, and whe= ther=20 it's worth that complication is debatable. By far the most common use c= ase is=20 to lex string, and if str is string, and R is not, then you incur the p= enalty=20 of converting R to string. So, the common use case is fast, and the unc= ommon=20 use case still works but is slower, and the user of the lexer doesn't h= ave to=20 care what the original range type was. It could go either way. I used string on first pass, but as I said, I c= ould=20 change it to R later if that makes more sense. I'm not particularly hun= g up on=20 that little detail at this point, and that's probably one of the things= that=20 can be changed reasonably easily later. - Jonathan M Davis
Aug 01 2012
On Wed, Aug 1, 2012 at 5:45 PM, Jonathan M Davis <jmdavisProg gmx.com> wrot= e:"=E3=82=A6=E3=82=A7=E3=83=96=E3=82=B5=E3=82=A4=E3=83=88" "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8" The encoding of the source file is irrelevant.do you mean I can do: string field =3D "=E3=82=A6=E3=82=A7=E3=83=96=E3=82=B5=E3=82=A4=E3=83=88"; ? Geez, just tested it, it works. even writeln(field) correctly output the japanese chars and dmd doesn't choke on it. Bang, back to state 0: I don't get how D strings work.You mean make it so that Token is struct Token(R) { TokenType type; R str; LiteralValue value SourcePos pos; } instead of struct Token { TokenType type; string str; LiteralValue value SourcePos pos; } or do you mean something else?Right, this.
Aug 01 2012
On 2012-08-01 19:50, Philippe Sigaud wrote:On Wed, Aug 1, 2012 at 5:45 PM, Jonathan M Davis <jmdavisProg gmx.com> wrote:Unicode supports three encodings: UTF-8, UTF-16 and UTF-32. All these encodings can store every character in the Unicode standard. What's different is how the characters are stored and how many bytes a single character takes to store in the string. For example: string str = "ö"; The above character will take up two bytes in the string. On the other hand, this won't work: char c = 'ö'; The reason for that is the the above character needs two bytes to be stored but "char" can only store one byte. Therefore you need to store the character in a type where it fits, i.e. "wchar" or "dchar". Or you can use a string where you can store how many bytes you want. Don't know if that makes it clearer. -- /Jacob Carlborg"ウェブサイト" "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8" The encoding of the source file is irrelevant.do you mean I can do: string field = "ウェブサイト"; ? Geez, just tested it, it works. even writeln(field) correctly output the japanese chars and dmd doesn't choke on it. Bang, back to state 0: I don't get how D strings work.
Aug 01 2012
On Wed, Aug 1, 2012 at 8:24 PM, Jacob Carlborg <doob me.com> wrote:Don't know if that makes it clearer.It does! Particularly this:All these encodings can store *every* character in the Unicode standard. What's different is how the characters are stored and how many bytes a single character takes to store in the string.(emphasis mine) I somehow thought that with UTF-8 you were limited to a part of Unicode, and to another, bigger part with UTF-16. I equated Unicode with UTF-32. This is what completely warped my vision. It's good to learn something new everyday, I guess. Thanks Jacob!
Aug 01 2012
On 2012-08-01 22:47, Philippe Sigaud wrote:I somehow thought that with UTF-8 you were limited to a part of Unicode, and to another, bigger part with UTF-16. I equated Unicode with UTF-32. This is what completely warped my vision. It's good to learn something new everyday, I guess. Thanks Jacob!You're welcome :) -- /Jacob Carlborg
Aug 01 2012
On 8/1/12, Philippe Sigaud <philippe.sigaud gmail.com> wrote:I somehow thought that with UTF-8 you were limited to a part of Unicode, and to another, bigger part with UTF-16. I equated Unicode with UTF-32. This is what completely warped my vision. It's good to learn something new everyday, I guess.I think many people viewed Unicode this way at first. But there is a metric ton of cool info out there if you want to get to know more about unicode (this may or may not be interesting reading material), e.g.: http://www.catch22.net/tuts/introduction-unicode http://icu-project.org/docs/papers/forms_of_unicode/ http://stackoverflow.com/questions/222386/what-do-i-need-to-know-about-unicode I used to have more of these links but lost them. There's even a gigantic book about unicode (Unicode Demystified).
Aug 01 2012
On 2012-08-01 22:54, Andrej Mitrovic wrote:I think many people viewed Unicode this way at first. But there is a metric ton of cool info out there if you want to get to know more about unicode (this may or may not be interesting reading material), e.g.: http://www.catch22.net/tuts/introduction-unicode http://icu-project.org/docs/papers/forms_of_unicode/ http://stackoverflow.com/questions/222386/what-do-i-need-to-know-about-unicode I used to have more of these links but lost them. There's even a gigantic book about unicode (Unicode Demystified).This is a good read as well: http://www.joelonsoftware.com/articles/Unicode.html -- /Jacob Carlborg
Aug 01 2012
On Wed, Aug 1, 2012 at 10:54 PM, Andrej Mitrovic <andrej.mitrovich gmail.com> wrote:On 8/1/12, Philippe Sigaud <philippe.sigaud gmail.com> wrote:I will, but not yet. I've a few books on parsing and compilers to read before that. I just read http://www.joelonsoftware.com/articles/Unicode.html, though, and I'm a bit disappointed that char 7 (\u007) does not make my computer beep. I remember now having my computer beep on char 7 during the 80s when ASCII was the only thing that existed.I somehow thought that with UTF-8 you were limited to a part of Unicode, and to another, bigger part with UTF-16. I equated Unicode with UTF-32. This is what completely warped my vision. It's good to learn something new everyday, I guess.I think many people viewed Unicode this way at first. But there is a metric ton of cool info out there if you want to get to know more about unicode
Aug 01 2012
On 02-Aug-12 01:23, Philippe Sigaud wrote:On Wed, Aug 1, 2012 at 10:54 PM, Andrej Mitrovic <andrej.mitrovich gmail.com> wrote:Once you have time to learn some unicode, check out this page: http://unicode.org/cldr/utility/index.jsp I've found these tools to be incredibly useful. -- Dmitry OlshanskyOn 8/1/12, Philippe Sigaud <philippe.sigaud gmail.com> wrote:I will, but not yet. I've a few books on parsing and compilers to read before that. I just read http://www.joelonsoftware.com/articles/Unicode.html, though, and I'm a bit disappointed that char 7 (\u007) does not make my computer beep. I remember now having my computer beep on char 7 during the 80s when ASCII was the only thing that existed.I somehow thought that with UTF-8 you were limited to a part of Unicode, and to another, bigger part with UTF-16. I equated Unicode with UTF-32. This is what completely warped my vision. It's good to learn something new everyday, I guess.I think many people viewed Unicode this way at first. But there is a metric ton of cool info out there if you want to get to know more about unicode
Aug 01 2012
On 8/1/12, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Once you have time to learn some unicode, check out this page: http://unicode.org/cldr/utility/index.jsp I've found these tools to be incredibly useful.Didn't know about that one, cool! Also might come in handy: http://people.w3.org/rishida/scripts/uniview/
Aug 01 2012
On Wednesday, August 01, 2012 22:47:47 Philippe Sigaud wrote:I somehow thought that with UTF-8 you were limited to a part of Unicode, and to another, bigger part with UTF-16. I equated Unicode with UTF-32. This is what completely warped my vision. It's good to learn something new everyday, I guess.I guess that that would explain why you didn't understand what I was saying. I was highly confused as to what was confusing about what I was saying, but it didn't even occur to me that you had that sort of misunderstanding. You really should get a better grip on unicode if you want to be writing code that lexes or parses it efficiently (though it sounds like you're reading up on a lot already right now). - Jonathan M Davis
Aug 01 2012
On Thu, Aug 2, 2012 at 1:29 AM, Jonathan M Davis <jmdavisProg gmx.com> wrote:On Wednesday, August 01, 2012 22:47:47 Philippe Sigaud wrote:I knew about 1-2-4 bytes schemes and such. But, somehow, for me, string == only-almost-ASCII characters. Anyway, it all *clicked* into place right afterwards and your answers are perfectly clear to me now.I somehow thought that with UTF-8 you were limited to a part of Unicode, and to another, bigger part with UTF-16. I equated Unicode with UTF-32. This is what completely warped my vision. It's good to learn something new everyday, I guess.I guess that that would explain why you didn't understand what I was saying. I was highly confused as to what was confusing about what I was saying, but it didn't even occur to me that you had that sort of misunderstanding. You really should get a better grip on unicode if you want to be writing code that lexes or parses it efficiently (though it sounds like you're reading up on a lot already right now).
Aug 01 2012
On Wednesday, August 01, 2012 19:50:10 Philippe Sigaud wrote:On Wed, Aug 1, 2012 at 5:45 PM, Jonathan M Davis <jmdavisProg gmx.com>wrote:"ウェブサイト" "\u30A6\u30A7\u30D6\u30B5\u30A4\u30C8" The encoding of the source file is irrelevant.do you mean I can do: string field = "ウェブサイト"; ? Geez, just tested it, it works. even writeln(field) correctly output the japanese chars and dmd doesn't choke on it. Bang, back to state 0: I don't get how D strings work.From http://dlang.org/lex.htmlD source text can be in one of the following formats: * ASCII * UTF-8 * UTF-16BE * UTF-16LE * UTF-32BE * UTF-32LE So, yes, you can stick unicode characters directly in D code. Though I wonder about the correctness of the spec here. It claims that if there's no BOM, then it's ASCII, but unless vim inserts BOM markers into all of my .d files, none of them have BOM markers, but I can put unicode in a .d file just fine with vim. U should probably study up on BOMs. In any case, the source is read in whatever encoding it's in. String literals then all become UTF-8 in the final object code unless they're marked as specifically being another type via the postfix letter or they're inferred as being another type (e.g. when you assign a string literal to a dstring). Regardless, what's in the final object code is based on the types that the type system marks strings as, not what the encoding of the source code was. So, a lexer shouldn't care about what the encoding of the source is beyond what it takes to covert it to a format that it can deal with and potentially being written in a way which makes handling a particular encoding more efficient. The values of literals and the like are completely unaffected regardless. - Jonathan M Davis
Aug 01 2012
On 2012-08-01 20:24, Jonathan M Davis wrote:D source text can be in one of the following formats: * ASCII * UTF-8 * UTF-16BE * UTF-16LE * UTF-32BE * UTF-32LE So, yes, you can stick unicode characters directly in D code. Though I wonder about the correctness of the spec here. It claims that if there's no BOM, then it's ASCII, but unless vim inserts BOM markers into all of my .d files, none of them have BOM markers, but I can put unicode in a .d file just fine with vim. U should probably study up on BOMs. In any case, the source is read in whatever encoding it's in. String literals then all become UTF-8 in the final object code unless they're marked as specifically being another type via the postfix letter or they're inferred as being another type (e.g. when you assign a string literal to a dstring). Regardless, what's in the final object code is based on the types that the type system marks strings as, not what the encoding of the source code was. So, a lexer shouldn't care about what the encoding of the source is beyond what it takes to covert it to a format that it can deal with and potentially being written in a way which makes handling a particular encoding more efficient. The values of literals and the like are completely unaffected regardless.But if you read a source file which is encoded using UTF-16 you would need to re-encode that to store it in the "str" filed in your Token struct? If that's the case, wouldn't it be better to make Token a template to be able to store all Unicode encodings without re-encoding? Although I don't know how if that will complicate the rest of the lexer. -- /Jacob Carlborg
Aug 01 2012
On Wednesday, August 01, 2012 20:29:45 Jacob Carlborg wrote:But if you read a source file which is encoded using UTF-16 you would need to re-encode that to store it in the "str" filed in your Token struct?Currently, yes.If that's the case, wouldn't it be better to make Token a template to be able to store all Unicode encodings without re-encoding? Although I don't know how if that will complicate the rest of the lexer.It may very well be a good idea to templatize Token on range type. It would be nice not to have to templatize it, but that may be the best route to go. The main question is whether str is _always_ a slice (or the result of takeExactly) of the orignal range. I _think_ that it is, but I'd have to make sure of that. If it's not and can't be for whatever reason, then that poses a problem. If Token _does_ get templatized, then I believe that R will end up being the original type in the case of the various string types or a range which has slicing, but it'll be the result of takeExactly(range, len) for everything else. I just made str a string to begin with, since it was simple, and I was still working on a lot of the initial design and how I was going to go about things. If it makes more sense for it to be templated, then it'll be changed so that it's templated. - Jonathan M Davis
Aug 01 2012
On 2012-08-01 22:10, Jonathan M Davis wrote:It may very well be a good idea to templatize Token on range type. It would be nice not to have to templatize it, but that may be the best route to go. The main question is whether str is _always_ a slice (or the result of takeExactly) of the orignal range. I _think_ that it is, but I'd have to make sure of that. If it's not and can't be for whatever reason, then that poses a problem. If Token _does_ get templatized, then I believe that R will end up being the original type in the case of the various string types or a range which has slicing, but it'll be the result of takeExactly(range, len) for everything else.To me a string type would be enough. I don't need support for ranges. How about adding a union instead? enum StringType { utf8, utf16, utf32 } struct Token { StringType stringType; union { string strc; wstring strw; dstring strd; } property T str (T = string) () { static if (is(T == string)) { assert(stringType == StringType.utf8); return strc; } ... } } Most use cases would look like this: string s = token.str;I just made str a string to begin with, since it was simple, and I was still working on a lot of the initial design and how I was going to go about things. If it makes more sense for it to be templated, then it'll be changed so that it's templated.I completely understand that. -- /Jacob Carlborg
Aug 01 2012
"Jonathan M Davis" , dans le message (digitalmars.D:173942), a crit:It may very well be a good idea to templatize Token on range type. It would be nice not to have to templatize it, but that may be the best route to go. The main question is whether str is _always_ a slice (or the result of takeExactly) of the orignal range. I _think_ that it is, but I'd have to make sure of that. If it's not and can't be for whatever reason, then that poses a problem.It can't if it is a simple input range! Like a file read with most 'lazy' methods. Then you need either to transform the input range in a forward range using a range adapter that performs buffering, or perform your own buffering internally. You also have to decide how long the token will be valid (I believe if you want lexing to be blazing fast, you don't want to allocate for each token). Maybe you want you lexer to work with range of strings too, like File.byLine or File.byChunk (the latter require buffering if you split in the middle of a token...). But that may wait until a nice API for files, streams, etc. is found.If Token _does_ get templatized, then I believe that R will end up being the original type in the case of the various string types or a range which has slicing, but it'll be the result of takeExactly(range, len) for everything else.A range which has slicing doesn't necessarily return it's own type when opSlice is used, according to hasSlicing. I'm pretty sure parts of Phobos doesn't take that into account. However, the result of takeExactly will always be the good type, since it uses opSlice when it can, so you can just use that. Making a generic lexer that works with any forward range of dchar and returns a range of tokens without performing decoding of litteral seems to be a good first step.I just made str a string to begin with, since it was simple, and I was still working on a lot of the initial design and how I was going to go about things. If it makes more sense for it to be templated, then it'll be changed so that it's templated.string may not be where you want to start, because it is a specialization for which you need to optimize utf-8 decoding. Also, you said in this thread that you only need to consider ansy characters in the lexer because non-ansy characters are only used in non-keyword identifier. That is not entirely true: EndOfLine defines 2 non-ansy characters, namely LINE SEPARATORand PARAGRAPH SEPARATOR. http://dlang.org/lex.html#EndOfLine Maybe they should be dropped, since other non-ansy whitespace are not supported. You may want the line count to be consistent with other programs. I don't know what text-processing programs usualy considers an end of line. -- Christophe
Aug 02 2012
On Thursday, August 02, 2012 07:06:25 Christophe Travert wrote:"Jonathan M Davis" , dans le message (digitalmars.D:173942), a =C3=A9=crit :ItIt may very well be a good idea to templatize Token on range type. =twould be nice not to have to templatize it, but that may be the bes=or theroute to go. The main question is whether str is _always_ a slice (=butresult of takeExactly) of the orignal range. I _think_ that it is, =rI'd have to make sure of that. If it's not and can't be for whateve=areason, then that poses a problem.=20 It can't if it is a simple input range! Like a file read with most 'lazy' methods. Then you need either to transform the input range in =forward range using a range adapter that performs buffering, or perfo=rmyour own buffering internally. You also have to decide how long the token will be valid (I believe if you want lexing to be blazing fast,=you don't want to allocate for each token).My lexer specifically requires a forward range. The more that I deal wi= th input=20 ranges, the more that I'm convinced that they're nearly useless. If you= need=20 even _one_ character of lookahead, then an input range doesn't fly at a= ll, and=20 considered the performance gains in using slicing (or takeExactly), I j= ust=20 don't think that it makes sense to operate on an input range. Besides, = if=20 anyone wants full performance, they'll probably need to use one of the = built- in string types. Any range of dchar which needs to decode on the call t= o front=20 or popFront will take a serious performance hit. It'll work, but it's n= ot=20 really advisable if you need performance.Also, you said in this thread that you only need to consider ansy characters in the lexer because non-ansy characters are only used in non-keyword identifier. That is not entirely true: EndOfLine defines =2non-ansy characters, namely LINE SEPARATOR and PARAGRAPH SEPARATOR. http://dlang.org/lex.html#EndOfLine Maybe they should be dropped, since other non-ansy whitespace are n=otsupported. You may want the line count to be consistent with other programs. I don't know what text-processing programs usualy considers=anend of line.I'm well aware of all fo that. Almost all of the lexer can operate enti= rely on=20 ASCII, with a few exceptions, and even in some of those cases, decoding= isn't=20 required (e.g. lineSep and paraSep can be dealt with as code units rath= er than=20 having to decode to compare agains them). The lexer that I'm writing wi= ll=20 follow the spec. And any lexer that wants to get into Phobos will need = to do=20 the same. So, stuff like lineSep and the end of file characters that th= e spec=20 has will be supported. - Jonathan M Davis
Aug 02 2012