digitalmars.D - std.d.lexer: pre-voting review / discussion
- Dicebot (36/36) Sep 11 2013 std.d.lexer is standard module for lexing D code, written by
- Tove (6/8) Sep 11 2013 I remember reading there were some interesting hash-advances in
- Johannes Pfau (38/41) Sep 11 2013 Question / Minor issue:
- Walter Bright (3/6) Sep 11 2013 It is important, and I'm glad you brought it up. The LexerConfig can pro...
- qznc (5/9) Sep 11 2013 The documentation for Token twice says "measured in ASCII
- Walter Bright (2/10) Sep 11 2013 Pedantically, it is just UTF-8 code units, which are a superset of ASCII...
- Walter Bright (23/28) Sep 11 2013 Thank you, Brian! This is important work.
- Brian Schott (3/3) Sep 11 2013 The choice of ending token names with underscores was made
- Walter Bright (3/6) Sep 11 2013 I didn't realize that was in the style guide. I guess I can't complain a...
- H. S. Teoh (8/16) Sep 11 2013 I disagree. I think it's more readable to use a consistent prefix, like
- Dicebot (3/8) Sep 11 2013 Not unless you want to change the style guide and break existing
- H. S. Teoh (6/13) Sep 11 2013 How would that break Phobos code? Phobos code doesn't even use
- Dicebot (4/18) Sep 11 2013 Phobos code must conform its style guide. You can't change it
- H. S. Teoh (22/38) Sep 11 2013 This doesn't violate Phobos style guidelines:
- deadalnix (4/37) Sep 11 2013 Int, Function, Scope, Import are all valid identifiers.
- Jonathan M Davis (8/9) Sep 11 2013 All of which violate Phobos' naming conventions for enum values (they mu...
- H. S. Teoh (18/27) Sep 11 2013 [...]
- sclytrack (4/43) Sep 11 2013 Delphi would use
- Manfred Nowak (3/4) Sep 11 2013 Why---and since which version?
- Walter Bright (3/6) Sep 11 2013 Since the very beginning.
- Manfred Nowak (5/9) Sep 11 2013 I see now, that you wrote about parsing---not about lexing.
- Robert Schadek (7/16) Sep 12 2013 I had problems with it, especially with
- Manfred Nowak (4/5) Sep 12 2013 I see the shift/reduce conflict if you was indeed trying to syntacticall...
- Jonathan M Davis (6/10) Sep 11 2013 IIRC, it's at least partially cause by the .. token. You have to look ah...
- Brian Schott (17/30) Sep 11 2013 Yeah. D requires lookahead in both lexing and parsing. Some
- deadalnix (6/21) Sep 11 2013 Does not require lookahead.
- Manfred Nowak (3/5) Sep 11 2013 Walter road about _arbitrary_ overhead, i.e. unlimited overhead.
- Manfred Nowak (10/12) Sep 11 2013 This lookahead is introduced by using a petty grammar.
- Jonathan M Davis (5/15) Sep 11 2013 LOL. You were asking about _parsing_ requiring arbitrary lookahead, and ...
- Jonathan M Davis (10/12) Sep 11 2013 Appending _ is what we do in Phobos when we need to use a keyword, and i...
- H. S. Teoh (85/117) Sep 11 2013 I believe this is probably a result of having a separate enum value of
- Walter Bright (9/26) Sep 11 2013 A type is a well-understood concept in D. This is using 'type' to mean s...
- H. S. Teoh (39/73) Sep 11 2013 No, in typical D code, user-defined type names are capitalized, such as
- Walter Bright (5/24) Sep 11 2013 It's not that big a deal, and certainly not evil.
- H. S. Teoh (9/14) Sep 11 2013 But then the code example proceeds to pass byLine() to it. Is that
- Brian Schott (3/10) Sep 11 2013 The example doesn't pass the result of byLine to the byToken
- H. S. Teoh (7/18) Sep 11 2013 *facepalm* You're right, it's calling join() on it. Nevermind what I
- Jacob Carlborg (10/13) Sep 11 2013 Currently this is possible:
- Brian Schott (21/161) Sep 11 2013 There are already plenty of hackish things in that module, so
- H. S. Teoh (34/157) Sep 11 2013 I was just going by what Walter said, but OK, fair enough.
- Jacob Carlborg (8/9) Sep 12 2013 Really? It seems you only need to change two places in the code:
- Piotr Szturmaj (7/14) Sep 11 2013 Delphi designers realized this problem years ago and they came up with a...
- Kapps (6/27) Sep 11 2013 C# allows this as well. For example, you can have:
- Piotr Szturmaj (4/31) Sep 11 2013 Yes, and this is important to actually have the real name without a '_'
- Jacob Carlborg (5/11) Sep 12 2013 Scala does it as well: `keyword` if I recall correctly. Seems like you
- Michel Fortin (10/12) Sep 11 2013 Since this is a forward range, you can save() it at one location and
- Walter Bright (3/10) Sep 11 2013 That's correct, but that implies re-lexing the tokens, which has negativ...
- deadalnix (3/20) Sep 11 2013 Indeed. What solution do you have in mind ?
- Walter Bright (3/4) Sep 11 2013 The solution dmd uses is to put in an intermediary layer that saves the
- deadalnix (5/9) Sep 11 2013 But then, you have an extra step when looking up every tokens +
- Walter Bright (6/14) Sep 11 2013 Not really - I use a dirty trick in that a static instance is always the...
- deadalnix (10/29) Sep 11 2013 If I understand you correctly, that mean that lookahead of one
- Walter Bright (4/9) Sep 11 2013 I don't believe that, because you can see about anything for tokens in l...
- deadalnix (4/18) Sep 12 2013 I plan to, but you know what it is, the best optimization is the
- Martin Nowak (5/8) Sep 12 2013 Me neither and if you have to duplicate the code for this it's a double
- Martin Nowak (4/8) Sep 11 2013 Linked list sounds bad.
- H. S. Teoh (11/21) Sep 11 2013 I still don't understand why backtracking is necessary in the first
- deadalnix (29/38) Sep 11 2013 The problem is that it can cause a exponential (and I literally
- H. S. Teoh (23/45) Sep 12 2013 [...]
- deadalnix (4/29) Sep 12 2013 And then you got to backtrack the parsing instead of the lexing.
- H. S. Teoh (98/127) Sep 12 2013 No. You can just use ArgListItem for both runtime args and compile-time
- deadalnix (9/9) Sep 12 2013 We are talking about parameters here, not arguments. But overall,
- Manfred Nowak (3/4) Sep 12 2013 If one ever reaches that position one uses petty lexer/grammar definitio...
- Mehrdad (3/9) Sep 27 2013 Pretty sure a GLR parser handles that well within O(n^2) space.
- Mehrdad (2/12) Sep 27 2013 I meant time... Though it's true for space too.
- Walter Bright (2/11) Sep 11 2013 No, I don't have a rough idea, but it doesn't show up in the profile.
- Robert Schadek (3/12) Sep 12 2013 Maybe some fixed size stack vector with 64 elements or so and some
- Dmitry Olshansky (4/17) Sep 12 2013 And an extra branch to detect which one is currently the case? No, thank...
- Robert Schadek (3/8) Sep 12 2013 I would think that branching and using the vector is faster than using
- Jonathan M Davis (5/14) Sep 12 2013 Well, it sounds like there are several ideas to try out in this thread. ...
- qznc (6/10) Sep 12 2013 I think this is the right approach. It can probably be another
- Martin Nowak (10/14) Sep 11 2013 Buffering the tokens would work. There are some ways promote input
- Timon Gehr (23/33) Sep 12 2013 It's a type tag. The tag uniquely determines the type. (As in 'the type
- Dmitry Olshansky (11/36) Sep 12 2013 I like this.
- H. S. Teoh (12/32) Sep 12 2013 +1. This is superior to both the ad hoc _ suffix and my ad hoc
- Timon Gehr (23/28) Sep 12 2013 This is just a quick hack. One would maybe want to avoid the unwarranted...
- Jacob Carlborg (6/7) Sep 12 2013 Unfortunately that's currently not possible to pass to functions, see my...
- Martin Nowak (6/15) Sep 12 2013 Do you think it's possible to use CircularBufferRange itself as anchor.
- Timon Gehr (14/35) Sep 12 2013 Well, this implementation is geared towards the usage pattern found in
- Walter Bright (4/5) Sep 11 2013 One litmus test of the validity of the design is to see if one can build...
- Brian Schott (14/21) Sep 11 2013 I'm not sure if you've followed the announce mailing list
- Walter Bright (2/4) Sep 11 2013 Awesome! This is just what I asked for.
- Martin Nowak (18/26) Sep 11 2013 I only had a short look at the source code and was left with a good
- Jacob Carlborg (4/8) Sep 12 2013 Sounds like that would fit in std.file or similar.
- dennis luehring (4/6) Sep 11 2013 a standard benchmark scenario is needed - to compare dmd and std.d.lexer
- Jacob Carlborg (63/64) Sep 12 2013 Finally :)
- Dmitry Olshansky (12/32) Sep 12 2013 These could be one module. There is really no meaningful way to use
- Jacob Carlborg (5/7) Sep 12 2013 std.d.lexer.config might be a bit unnecessary. But yes, in general I
- Brian Schott (10/10) Sep 12 2013 I got some time to work on the lexer this evening. Changeset
- dennis luehring (20/30) Sep 12 2013 problem: many occurences of the same string
- dennis luehring (9/9) Sep 12 2013 just an idea regarding your string[TokenType.max + 1] in
- Walter Bright (4/13) Sep 12 2013 I strongly recommend running the unit tests with -cov. A lexer should be...
- Brian Schott (9/11) Sep 12 2013 Some of the code is only present to be used at compile-time to
- Walter Bright (3/12) Sep 12 2013 Thanks.
- deadalnix (6/11) Sep 12 2013 One quick remark : we need some kind of value provider to reuse
- Timon Gehr (30/36) Sep 12 2013 (Commenting on what's visible in the documentation only for now.)
- Dicebot (5/5) Sep 17 2013 Discussion has slowed down and thread is fading away.
- deadalnix (3/8) Sep 17 2013 I had some comments that nobody addressed. Mostly about firing
- Dicebot (9/19) Sep 17 2013 Well, I suppose only Brian can address it :) I am only checking
- Brian Schott (3/23) Sep 17 2013 I've been busy with things that aren't D-related recently, but I
- Dicebot (4/6) Sep 17 2013 Ok, please notify me then once you feel ready for voting / extra
- Brian Schott (4/6) Sep 24 2013 Changes since last time:
- Jacob Carlborg (31/33) Sep 25 2013 I had some comments and a couple of minor things like spell errors:
- Brian Schott (7/14) Sep 25 2013 This is the default behavior that happens when you don't
- Jacob Carlborg (8/13) Sep 25 2013 I see.
- Dominikus Dittes Scherkl (20/20) Sep 26 2013 Hello.
- Jos van Uden (3/14) Sep 26 2013 I imagine the lexer follows the language specification:
- Dominikus Dittes Scherkl (11/21) Sep 27 2013 I know. What I wanted to say is: The language specification has a
- Brian Schott (4/6) Sep 24 2013 Doing that would require making the identifier pool part of the
- deadalnix (4/10) Sep 24 2013 Yes, ideally as alias parameter. This is a show stopper for many
- ilya-stromberg (6/8) Sep 17 2013 Brian, have you got any plans to write common lexer/parser
std.d.lexer is standard module for lexing D code, written by Brian Schott ---- Input ---- Code: https://github.com/Hackerpilot/phobos/tree/master/std/d Documentation: http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html Initial discussion: http://forum.dlang.org/thread/dpdgcycrgfspcxenzrjf forum.dlang.org Usage example in real project: https://github.com/Hackerpilot/Dscanner (as stdx.d.lexer, Brian, please correct me if versions do not match) ---- Information for reviewers ---- (yes, I am mostly copy-pasting this :P) Goal of this thread is to detect if there are any outstanding issues that need to fixed before formal "yes"/"no" voting happens. If no critical objections will arise, voting will begin starting with a next week. Otherwise it depends on time module author needs to implement suggestions. Please take this part seriously: "If you identify problems along the way, please note if they are minor, serious, or showstoppers." (http://wiki.dlang.org/Review/Process). This information later will be used to determine if library is ready for voting. If there are any frequent Phobos contributors / core developers please pay extra attention to submission code style and fitting into overall Phobos guidelines and structure. Most important goal of this review is to determine any API / design problems. Any internal implementation tweaks may happen after inclusion to Phobos but it is important to assure that no breaking changes will be required any time soon after module will get wider usage. ---- Information request from module author ---- Performance was a major discussed topic in previous thread. Could you please provide benchmarking data for version currently ongoing the review?
Sep 11 2013
On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:std.d.lexer is standard module for lexing D code, written by Brian SchottI remember reading there were some interesting hash-advances in dmd recently. http://forum.dlang.org/thread/kq7ov0$2o8n$1 digitalmars.com?page=1 maybe it's worth benchmarking those hashes for std.d.lexer as well.
Sep 11 2013
Am Wed, 11 Sep 2013 17:01:58 +0200 schrieb "Dicebot" <public dicebot.lv>:std.d.lexer is standard module for lexing D code, written by Brian SchottQuestion / Minor issue: As we already have a range based interface I'd love to have partial lexing / parsing, especially for IDEs. Say I have this source code: -------------------------------------------- 1: module a; 2: 3: void test(int a) 4: { 5: [...] 6: } 7: 8: void test2() 9: [...] -------------------------------------------- Then I first do a full parse pass over the source. Now line 5 is being edited. I know from the full parse that line 5 is part of a FunctionDeclaration which starts at line 3 and ends at line 6. Now I'd like to re-parse only that part: -------------------------------------------- FunctionDeclaration decl = document.getDeclByLine(5); decl.reparse(/*start_line=*/ 3, docBuffer); -------------------------------------------- I think these are the two critical points related to this for the proposed std.lexer: * How can I tell the lexer to start lexing at line/character n? Of course the input could be sliced, but then line number and position information in the Token struct is wrong. * I guess std.lexer slices the original input? This could make things difficult if the file buffer is edited in place. But this can probably be dealt with outside of std.lexer. (By reallocating all .value members) (And once this is working, an example in the docs would be great) But to be honest I'm not sure how important this really is. I think it should help for more responsive IDEs but maybe parsing is not a bottleneck at all?
Sep 11 2013
On 9/11/2013 11:43 AM, Johannes Pfau wrote:But to be honest I'm not sure how important this really is. I think it should help for more responsive IDEs but maybe parsing is not a bottleneck at all?It is important, and I'm glad you brought it up. The LexerConfig can provide a spot to put a starting line/column value.
Sep 11 2013
On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:std.d.lexer is standard module for lexing D code, written by Brian Schott Documentation: http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.htmlThe documentation for Token twice says "measured in ASCII characters or UTF-8 code units", which sounds confusing to me. Is it UTF-8, which includes ASCII? Then it should not be "or". This is nitpicking. Overall, I like the proposal. Great work!
Sep 11 2013
On 9/11/2013 11:45 AM, qznc wrote:On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:Pedantically, it is just UTF-8 code units, which are a superset of ASCII.std.d.lexer is standard module for lexing D code, written by Brian Schott Documentation: http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.htmlThe documentation for Token twice says "measured in ASCII characters or UTF-8 code units", which sounds confusing to me. Is it UTF-8, which includes ASCII? Then it should not be "or".
Sep 11 2013
On 9/11/2013 8:01 AM, Dicebot wrote:std.d.lexer is standard module for lexing D code, written by Brian SchottThank you, Brian! This is important work. Not a thorough review, just some notes from reading the doc file: 1. I don't like the _ suffix for keywords. Just call it kwimport or something like that. 2. The example uses an if-then sequence of isBuiltType, isKeyword, etc. Should be an enum so a switch can be done for speed. 3. I assumed TokenType is a type. But it's not, it's an enum. Even the document says it's a 'type', but it's not a type. 4. When naming tokens like .. 'slice', it is giving it a syntactic/semantic name rather than a token name. This would be awkward if .. took on new meanings in D. Calling it 'dotdot' would be clearer. Ditto for the rest. For example that is done better, '*' is called 'star', rather than 'dereference'. 5. The LexerConfig initialization should be a constructor rather than a sequence of assignments. LexerConfig documentation is awfully thin. For example, 'tokenStyle' is explained as being 'Token style', whatever that is. 6. No clue how lookahead works with this. Parsing D requires arbitrary lookahead. 7. uint line; Should indicate that lines start with '1', not '0'. Ditto for columns. 8. 'default_' Again with the awful practice of appending _. 9. Need to insert intra-page navigation links, such as when 'byToken()' appears in the text, it should be link to where byToken is described.Goal of this thread is to detect if there are any outstanding issues that need to fixed before formal "yes"/"no" voting happens. If no critical objections will arise, voting will begin starting with a next week. Otherwise it depends on time module author needs to implement suggestions.I believe the state of the documentation is a showstopper, and needs to be extensively fleshed out before it can be considered ready for voting.
Sep 11 2013
The choice of ending token names with underscores was made according to the Phobos style guide. http://dlang.org/dstyle.html
Sep 11 2013
On 9/11/2013 12:10 PM, Brian Schott wrote:The choice of ending token names with underscores was made according to the Phobos style guide. http://dlang.org/dstyle.htmlI didn't realize that was in the style guide. I guess I can't complain about it, then :-)
Sep 11 2013
On Wed, Sep 11, 2013 at 12:13:07PM -0700, Walter Bright wrote:On 9/11/2013 12:10 PM, Brian Schott wrote:I disagree. I think it's more readable to use a consistent prefix, like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's clear you're referring to token types, not the actual keyword. T -- One Word to write them all, One Access to find them, One Excel to count them all, And thus to Windows bind them. -- Mike ChampionThe choice of ending token names with underscores was made according to the Phobos style guide. http://dlang.org/dstyle.htmlI didn't realize that was in the style guide. I guess I can't complain about it, then :-)
Sep 11 2013
On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:I disagree. I think it's more readable to use a consistent prefix, like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's clear you're referring to token types, not the actual keyword.Not unless you want to change the style guide and break existing Phobos code ;)
Sep 11 2013
On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:How would that break Phobos code? Phobos code doesn't even use std.d.lexer right now. T -- Being able to learn is a great learning; being able to unlearn is a greater learning.I disagree. I think it's more readable to use a consistent prefix, like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's clear you're referring to token types, not the actual keyword.Not unless you want to change the style guide and break existing Phobos code ;)
Sep 11 2013
On Wednesday, 11 September 2013 at 20:08:44 UTC, H. S. Teoh wrote:On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:Phobos code must conform its style guide. You can't change it without changing existing Phobos code that relies on it. Inconsistent style is worst of all options.On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:How would that break Phobos code? Phobos code doesn't even use std.d.lexer right now.I disagree. I think it's more readable to use a consistent prefix, like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's clear you're referring to token types, not the actual keyword.Not unless you want to change the style guide and break existing Phobos code ;)
Sep 11 2013
On Wed, Sep 11, 2013 at 10:18:12PM +0200, Dicebot wrote:On Wednesday, 11 September 2013 at 20:08:44 UTC, H. S. Teoh wrote:This doesn't violate Phobos style guidelines: enum TokenType { kwInt, kwFloat, kwDouble, ... kwFunction, kwScope, ... // etc. } The token representing a particular keyword is not the same thing as the keyword itself. There's nothing that says "a D lexer must use token type names that are derived from the D keywords it represents". In fact, using a kw prefix for tokens representing keywords is *more* consistent, because it doesn't require inserting _ for some enum values but not for others. I never said anything about changing Phobos style guidelines. T -- I think Debian's doing something wrong, `apt-get install pesticide', doesn't seem to remove the bugs on my system! -- Mike DresserOn Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:Phobos code must conform its style guide. You can't change it without changing existing Phobos code that relies on it. Inconsistent style is worst of all options.On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:How would that break Phobos code? Phobos code doesn't even use std.d.lexer right now.I disagree. I think it's more readable to use a consistent prefix, like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's clear you're referring to token types, not the actual keyword.Not unless you want to change the style guide and break existing Phobos code ;)
Sep 11 2013
On Wednesday, 11 September 2013 at 20:28:06 UTC, H. S. Teoh wrote:On Wed, Sep 11, 2013 at 10:18:12PM +0200, Dicebot wrote:Int, Function, Scope, Import are all valid identifiers. random minimization like kw is really bad. It is even worse when it doesn't make anything sorter.On Wednesday, 11 September 2013 at 20:08:44 UTC, H. S. Teoh wrote:This doesn't violate Phobos style guidelines: enum TokenType { kwInt, kwFloat, kwDouble, ... kwFunction, kwScope, ... // etc. }On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:Phobos code must conform its style guide. You can't change it without changing existing Phobos code that relies on it. Inconsistent style is worst of all options.On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:How would that break Phobos code? Phobos code doesn't even use std.d.lexer right now.I disagree. I think it's more readable to use a consistent prefix, like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's clear you're referring to token types, not the actual keyword.Not unless you want to change the style guide and break existing Phobos code ;)
Sep 11 2013
On Thursday, September 12, 2013 03:37:06 deadalnix wrote:Int, Function, Scope, Import are all valid identifiers.All of which violate Phobos' naming conventions for enum values (they must start with a lowercase letter), which is why we went with adding an _ on the end. And it's pretty much as close as you can get to the keyword without actually using the keyword, which is a plus IMHO (though from the sounds of it, H.S. Teoh would consider that a negative due to possible confusion with the keyword). - Jonathan M Davis
Sep 11 2013
On Wed, Sep 11, 2013 at 10:06:11PM -0400, Jonathan M Davis wrote:On Thursday, September 12, 2013 03:37:06 deadalnix wrote:[...] Actually, the main issue I have is that some of the enum values end with _ while others don't. This is inconsistent. I'd rather have consistency than superficial resemblance to the keywords as typed. Either *all* of the enum values should end with _, or *none* of them should. Having a mixture of both is an eyesore, and leads to people wondering, should I add a _ at the end or not? If people insist that the 'default' keyword absolutely must be represented as TokenType.default_ (I really don't see why), then *all* TokenType values should end with _. But honestly, I find that really ugly. Writing something like kwDefault, or tokenTypeDefault, would be far better. Sigh, Andrei was right. Once the bikeshed is up for painting, even the rainbow won't suffice. :-P T -- MASM = Mana Ada Sistem, Man!Int, Function, Scope, Import are all valid identifiers.All of which violate Phobos' naming conventions for enum values (they must start with a lowercase letter), which is why we went with adding an _ on the end. And it's pretty much as close as you can get to the keyword without actually using the keyword, which is a plus IMHO (though from the sounds of it, H.S. Teoh would consider that a negative due to possible confusion with the keyword).
Sep 11 2013
On Thursday, 12 September 2013 at 03:31:42 UTC, H. S. Teoh wrote:On Wed, Sep 11, 2013 at 10:06:11PM -0400, Jonathan M Davis wrote:Delphi would use TokenType ttDefault MyType mtDefaultOn Thursday, September 12, 2013 03:37:06 deadalnix wrote:[...] Actually, the main issue I have is that some of the enum values end with _ while others don't. This is inconsistent. I'd rather have consistency than superficial resemblance to the keywords as typed. Either *all* of the enum values should end with _, or *none* of them should. Having a mixture of both is an eyesore, and leads to people wondering, should I add a _ at the end or not? If people insist that the 'default' keyword absolutely must be represented as TokenType.default_ (I really don't see why), then *all* TokenType values should end with _. But honestly, I find that really ugly. Writing something like kwDefault, or tokenTypeDefault, would be far better. Sigh, Andrei was right. Once the bikeshed is up for painting, even the rainbow won't suffice. :-P TInt, Function, Scope, Import are all valid identifiers.All of which violate Phobos' naming conventions for enum values (they must start with a lowercase letter), which is why we went with adding an _ on the end. And it's pretty much as close as you can get to the keyword without actually using the keyword, which is a plus IMHO (though from the sounds of it, H.S. Teoh would consider that a negative due to possible confusion with the keyword).
Sep 11 2013
Walter Bright wrote:Parsing D requires arbitrary lookahead.Why---and since which version? -manfred
Sep 11 2013
On 9/11/2013 12:17 PM, Manfred Nowak wrote:Walter Bright wrote:Since the very beginning. One example is determining if something is a declaration or an expression.Parsing D requires arbitrary lookahead.Why---and since which version?
Sep 11 2013
Walter Bright wrote:Since the very beginning. One example is determining if something is a declaration or an expression.I see now, that you wrote about parsing---not about lexing. Btw. I wrote an LALR(1)-parser for an early version of D. This means a lookahead of one was sufficient---or I made terrible mistakes. -manfred
Sep 11 2013
On 09/12/2013 12:09 AM, Manfred Nowak wrote:Walter Bright wrote:I had problems with it, especially with /IdentifierList: Identifier . IdentifierList TemplateInstance. IdentifierList And Bison also complaint. /Since the very beginning. One example is determining if something is a declaration or an expression.I see now, that you wrote about parsing---not about lexing. Btw. I wrote an LALR(1)-parser for an early version of D. This means a lookahead of one was sufficient---or I made terrible mistakes. -manfred
Sep 12 2013
Robert Schadek wrote:especially with IdentifierList:I see the shift/reduce conflict if you was indeed trying to syntactically differantiate between template identifiers and other identifiers. -manfred
Sep 12 2013
On Wednesday, September 11, 2013 19:17:26 Manfred Nowak wrote:Walter Bright wrote:IIRC, it's at least partially cause by the .. token. You have to look ahead to figure out whether it's .. or a floating point literal. There may be other cases as well, but that particular one was actually complained about by Robert Schadek in his talk at dconf 2013. - Jonathan M DavisParsing D requires arbitrary lookahead.Why---and since which version?
Sep 11 2013
On Wednesday, 11 September 2013 at 19:29:48 UTC, Jonathan M Davis wrote:On Wednesday, September 11, 2013 19:17:26 Manfred Nowak wrote:Yeah. D requires lookahead in both lexing and parsing. Some examples: * String literals such as q{} * If statements need to determine the difference between if (expression) and if (type identifier = expression) * Determining if something is a lamba expression or a function literal expression * Determining if something is a declaration or a statement or an expression. * Differentiating between (type).identifier and a lambda expression * Differentiating between a type and an expression inside a typeid expression * Differentiating between associative array key type and tuple slices in type suffixesWalter Bright wrote:IIRC, it's at least partially cause by the .. token. You have to look ahead to figure out whether it's .. or a floating point literal. There may be other cases as well, but that particular one was actually complained about by Robert Schadek in his talk at dconf 2013. - Jonathan M DavisParsing D requires arbitrary lookahead.Why---and since which version?
Sep 11 2013
On Wednesday, 11 September 2013 at 19:46:25 UTC, Brian Schott wrote:Yeah. D requires lookahead in both lexing and parsing. Some examples: * String literals such as q{} * If statements need to determine the difference between if (expression) and if (type identifier = expression) * Determining if something is a lamba expression or a function literal expression * Determining if something is a declaration or a statement or an expression.Does not require lookahead.* Differentiating between (type).identifier and a lambda expressionYup !* Differentiating between a type and an expression inside a typeid expressionDoes not require lookahead.* Differentiating between associative array key type and tuple slices in type suffixesditto.
Sep 11 2013
Brian Schott wrote:Walter road about _arbitrary_ overhead, i.e. unlimited overhead. -manfredYeah. D requires lookahead in both lexing and parsing.Parsing D requires arbitrary lookahead.
Sep 11 2013
Jonathan M Davis wrote:You have to look ahead to figure out whether it's .. or a floating point literal.This lookahead is introduced by using a petty grammar. Please reconsider, that lexing searches for the leftmost longest pattern of the rest of the input. This means that introducing a pattern like `<int>\.\.' return TokenType.INTDOTDOT; would eliminate the lookahead in the lexer. In the parser an additional rule then has to be added: <range> ::= INT DOTDOT INT | INTDOTDOT INT -manfred
Sep 11 2013
On Wednesday, September 11, 2013 15:29:33 Jonathan M Davis wrote:On Wednesday, September 11, 2013 19:17:26 Manfred Nowak wrote:LOL. You were asking about _parsing_ requiring arbitrary lookahead, and I answered as to why lexing requires more than one character of lookahead. I should read more carefully... - Jonathan M DavisWalter Bright wrote:IIRC, it's at least partially cause by the .. token. You have to look ahead to figure out whether it's .. or a floating point literal. There may be other cases as well, but that particular one was actually complained about by Robert Schadek in his talk at dconf 2013.Parsing D requires arbitrary lookahead.Why---and since which version?
Sep 11 2013
On Wednesday, September 11, 2013 11:49:44 Walter Bright wrote:1. I don't like the _ suffix for keywords. Just call it kwimport or something like that.Appending _ is what we do in Phobos when we need to use a keyword, and it's even called out in the style guide: http://dlang.org/dstyle.html Now, if it makes sense to use something other than a keyword, that's probably better, but in cases where it really makes the most sense to use a keyword but we can't (e.g. std.traits.FunctionAttribute), we append _ to the keyword. What makes the most sense ni this case, I don't know, because I haven't looked at either the documentation or the code yet, but if Brian is appending _ to keywords, he's following the official style guide. - Jonathan M Davis
Sep 11 2013
On Wed, Sep 11, 2013 at 11:49:44AM -0700, Walter Bright wrote:On 9/11/2013 8:01 AM, Dicebot wrote:I agree.std.d.lexer is standard module for lexing D code, written by Brian SchottThank you, Brian! This is important work. Not a thorough review, just some notes from reading the doc file: 1. I don't like the _ suffix for keywords. Just call it kwimport or something like that.2. The example uses an if-then sequence of isBuiltType, isKeyword, etc. Should be an enum so a switch can be done for speed.I believe this is probably a result of having a separate enum value of each token, and at the same time needing to group them into categories for syntax highlighting purposes. I'd suggest a function for converting the TokenType enum value into a TokenCategory enum. Perhaps something like: enum TokenCategory { BuiltInType, Keyword, ... } // Convert TokenType into TokenCategory TokenCategory category(TokenType tt) { ... } Then in user code, you call category() on the token type, and switch over that. This maximizes performance. Implementation-wise, I'd suggest either a hash table on TokenType, or perhaps even encoding the category into the TokenType enum values, for example: enum TokenCategory { BuiltInType, Keyword, ... } enum TokenType { IntToken = (TokenCategory.BuiltInType << 16) | 0x0001, FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002, ... FuncToken = (TokenCategory.Keyword << 16) | 0x0001, } Then the category function can be simply: TokenCategory category(TokenType tt) { return cast(TokenCategory)(tt >> 16); } Though admittedly, this is a bit hackish. But if you're going for speed, this would be quite fast.3. I assumed TokenType is a type. But it's not, it's an enum. Even the document says it's a 'type', but it's not a type.I don't understand this complaint. I thought it's pretty clear that it's referring to what type of token it is (or would you prefer TokenKind?).4. When naming tokens like .. 'slice', it is giving it a syntactic/semantic name rather than a token name. This would be awkward if .. took on new meanings in D. Calling it 'dotdot' would be clearer. Ditto for the rest. For example that is done better, '*' is called 'star', rather than 'dereference'.I agree. Especially since '*' can also mean multiplication, depending on context. It would be weird (and unnecessarily obscure) for parser code to refer to 'dereference' when parsing expressions. :)5. The LexerConfig initialization should be a constructor rather than a sequence of assignments. LexerConfig documentation is awfully thin. For example, 'tokenStyle' is explained as being 'Token style', whatever that is.I'm on the fence about this one. Setting up the configuration before starting the lexing process makes sense to me. Perhaps one way to improve this is to rename LexerConfig to DLexer, and make byToken a member function (or call it via UFCS): DLexer lex; lex.iterStyle = ...; // etc. foreach (token; lex.byToken()) { ... } This reads better: you're getting a list of tokens from the lexer 'lex', as opposed to getting something from byToken(config), which doesn't really *say* what it's doing: is it tokenizing the config object?6. No clue how lookahead works with this. Parsing D requires arbitrary lookahead.What has this got to do with lexing? The parser needs to keep track of its state independently of the lexer anyway, doesn't it? I'm not sure how DMD's parser works, but the way I usually write parsers is that their state encodes the series of tokens encountered so far, so they don't need the lexer to save any tokens for them. If they need to refer to tokens, they create partial AST trees on the parser stack that reference said tokens. I don't see why it should be the lexer's job to keep track of such things. [...]8. 'default_' Again with the awful practice of appending _.Yeah I didn't like that. Prefixing keywords with "kw" or "kw_" is much better, instead of the confusing way of appending _ to some token types but not others.9. Need to insert intra-page navigation links, such as when 'byToken()' appears in the text, it should be link to where byToken is described.Agreed. [...]I believe the state of the documentation is a showstopper, and needs to be extensively fleshed out before it can be considered ready for voting.I think the API can be improved. The LexerConfig -> DLexer rename is an important readability issue IMO. Also, it's unclear what types of input is supported -- the code example only uses string input, but what about file input? Does it support byLine? Or does it need me to slurp the entire file contents into an in-memory buffer first? Now, somebody pointed out that there's currently no way to tell it that the data you're feeding to it is a partial slice of the full source. I think this should be an easy fix: LexerConfig (or DLexer after the rename) can have a field for specifying the initial line number / column number, defaulting to (1, 1) but can be changed by the caller for parsing code fragments. A minor concern I have about the current implementation (I didn't look at the code yet, but this is just based on the documentation), is that there's no way to choose what kind of data you want in the token stream. Right now, it appears that it always computes startIndex, line, and column. What if I don't care about this information, and only want, say, the token type and value (say I'm pretty-printing the source, so I don't care how the original was formatted)? Would it be possible to skip the additional computations required to populate these fields? T -- MACINTOSH: Most Applications Crash, If Not, The Operating System Hangs
Sep 11 2013
On 9/11/2013 12:55 PM, H. S. Teoh wrote:A type is a well-understood concept in D. This is using 'type' to mean something completely different. This is very jarring. Your idea of using 'kind' is vastly superior.3. I assumed TokenType is a type. But it's not, it's an enum. Even the document says it's a 'type', but it's not a type.I don't understand this complaint. I thought it's pretty clear that it's referring to what type of token it is (or would you prefer TokenKind?).Yes, using a constructor :-)5. The LexerConfig initialization should be a constructor rather than a sequence of assignments. LexerConfig documentation is awfully thin. For example, 'tokenStyle' is explained as being 'Token style', whatever that is.I'm on the fence about this one. Setting up the configuration before starting the lexing process makes sense to me.The lexer needs to support backing up and going forward again. That's how the dmd lexer works.6. No clue how lookahead works with this. Parsing D requires arbitrary lookahead.What has this got to do with lexing?Also, it's unclear what types of input is supported -- the code example only uses string input, but what about file input? Does it support byLine? Or does it need me to slurp the entire file contents into an in-memory buffer first?The point of having the input be an InputRange is you DO NOT CARE if the input is coming from a string, file, etc.
Sep 11 2013
On Wed, Sep 11, 2013 at 01:31:45PM -0700, Walter Bright wrote:On 9/11/2013 12:55 PM, H. S. Teoh wrote:No, in typical D code, user-defined type names are capitalized, such as InputRange, Duration, Complex, ByLine, etc.. They do not have 'Type' in their names (we don't call them InputRangeType, for example). So this usage doesn't conflict at all.A type is a well-understood concept in D. This is using 'type' to mean something completely different. This is very jarring. Your idea of using 'kind' is vastly superior.3. I assumed TokenType is a type. But it's not, it's an enum. Even the document says it's a 'type', but it's not a type.I don't understand this complaint. I thought it's pretty clear that it's referring to what type of token it is (or would you prefer TokenKind?).Constructors with many arguments are evil, because you can't just modify one setting and use defaults for the rest; you have to specify all of them up to the last default argument you have to change.Yes, using a constructor :-)5. The LexerConfig initialization should be a constructor rather than a sequence of assignments. LexerConfig documentation is awfully thin. For example, 'tokenStyle' is explained as being 'Token style', whatever that is.I'm on the fence about this one. Setting up the configuration before starting the lexing process makes sense to me.Then the parser should be fixed. Keeping track of what tokens have been seen so far is the parser's job, not the lexer's. LALR(1) parsers, for example, deal with multiple similar-looking constructs without requiring more than a single token lookahead. The trick is that you do not assume a particular grammar production until it has been fully disambiguated. The grammar, if not conducive to this, can usually be rearranged to make it possible. For example, if the grammar has two production rules: StatementA ::= TokenA TokenB TokenC TokenD StatementB ::= TokenA TokenB TokenC TokenE then you may claim that you need a 4-token lookahead in order to tell, given TokenA in the current input, which type of statement it's going to be. However, these rules can be rewritten like this: StatementABPrefix ::= TokenA TokenB TokenC StatementA ::= StatementABPrefix TokenD StatementB ::= StatementABPrefix TokenE Now the parser only needs a 1-token lookahead to do its job, and no backtracking is ever needed. Any information encoded by the first 3 tokens can be stored in the AST node of StatementABPrefix, so it's recoverable without needing to save previous tokens. Unless, you're saying that D grammar is more complex than LALR(1)... If so, I'd like to see an example of where this is the case.The lexer needs to support backing up and going forward again. That's how the dmd lexer works.6. No clue how lookahead works with this. Parsing D requires arbitrary lookahead.What has this got to do with lexing?Yes, but the docs currently doesn't say what form this input range must take. Must it be a range of characters? Am I allowed to pass in a range of *strings* (representing source lines)? This needs to be clearly stated. The fact that tokens slice the input also needs to be clearly stated, since that precludes the use of transient ranges like byLine (your token values will be garbled by the time you get to them). T -- Let's call it an accidental feature. -- Larry WallAlso, it's unclear what types of input is supported -- the code example only uses string input, but what about file input? Does it support byLine? Or does it need me to slurp the entire file contents into an in-memory buffer first?The point of having the input be an InputRange is you DO NOT CARE if the input is coming from a string, file, etc.
Sep 11 2013
On 9/11/2013 3:02 PM, H. S. Teoh wrote:No, in typical D code, user-defined type names are capitalized, such as InputRange, Duration, Complex, ByLine, etc.. They do not have 'Type' in their names (we don't call them InputRangeType, for example). So this usage doesn't conflict at all.It's not a type as is understood in D. It's an enumerated value.Constructors with many arguments are evil, because you can't just modify one setting and use defaults for the rest; you have to specify all of them up to the last default argument you have to change.It's not that big a deal, and certainly not evil.The parser works fine. It doesn't need fixing.Then the parser should be fixed.The lexer needs to support backing up and going forward again. That's how the dmd lexer works.6. No clue how lookahead works with this. Parsing D requires arbitrary lookahead.What has this got to do with lexing?It does say that.The point of having the input be an InputRange is you DO NOT CARE if the input is coming from a string, file, etc.Yes, but the docs currently doesn't say what form this input range must take. Must it be a range of characters?
Sep 11 2013
On Wed, Sep 11, 2013 at 04:42:34PM -0700, Walter Bright wrote:On 9/11/2013 3:02 PM, H. S. Teoh wrote:[...]But then the code example proceeds to pass byLine() to it. Is that correct? If it is, then the docs need to be updated, because last time I checked, byLine() isn't a range of char, but a range of char *arrays*. T -- Real men don't take backups. They put their source on a public FTP-server and let the world mirror it. -- Linus TorvaldsYes, but the docs currently doesn't say what form this input range must take. Must it be a range of characters?It does say that.
Sep 11 2013
On Thursday, 12 September 2013 at 00:13:36 UTC, H. S. Teoh wrote:But then the code example proceeds to pass byLine() to it. Is that correct? If it is, then the docs need to be updated, because last time I checked, byLine() isn't a range of char, but a range of char *arrays*. TThe example doesn't pass the result of byLine to the byToken function directly.
Sep 11 2013
On Thu, Sep 12, 2013 at 03:17:06AM +0200, Brian Schott wrote:On Thursday, 12 September 2013 at 00:13:36 UTC, H. S. Teoh wrote:*facepalm* You're right, it's calling join() on it. Nevermind what I said then. :-P Sorry for all the unnecessary noise, I don't know what got into me that I didn't see the join(). T -- May you live all the days of your life. -- Jonathan SwiftBut then the code example proceeds to pass byLine() to it. Is that correct? If it is, then the docs need to be updated, because last time I checked, byLine() isn't a range of char, but a range of char *arrays*. TThe example doesn't pass the result of byLine to the byToken function directly.
Sep 11 2013
On 2013-09-12 00:02, H. S. Teoh wrote:Constructors with many arguments are evil, because you can't just modify one setting and use defaults for the rest; you have to specify all of them up to the last default argument you have to change.Currently this is possible: LexerConfig config = { iterStyle: IterationStyle.everything, tokenStyle: TokenStyle.source }; I would really like the following to be possible as well: void foo (LexerConfig config); foo({ iterStyle: IterationStyle.everything, tokenStyle: TokenStyle.source }); -- /Jacob Carlborg
Sep 11 2013
On Wednesday, 11 September 2013 at 19:56:45 UTC, H. S. Teoh wrote:There are already plenty of hackish things in that module, so this would fit right in.2. The example uses an if-then sequence of isBuiltType, isKeyword, etc. Should be an enum so a switch can be done for speed.I believe this is probably a result of having a separate enum value of each token, and at the same time needing to group them into categories for syntax highlighting purposes. I'd suggest a function for converting the TokenType enum value into a TokenCategory enum. Perhaps something like: enum TokenCategory { BuiltInType, Keyword, ... } // Convert TokenType into TokenCategory TokenCategory category(TokenType tt) { ... } Then in user code, you call category() on the token type, and switch over that. This maximizes performance. Implementation-wise, I'd suggest either a hash table on TokenType, or perhaps even encoding the category into the TokenType enum values, for example: enum TokenCategory { BuiltInType, Keyword, ... } enum TokenType { IntToken = (TokenCategory.BuiltInType << 16) | 0x0001, FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002, ... FuncToken = (TokenCategory.Keyword << 16) | 0x0001, } Then the category function can be simply: TokenCategory category(TokenType tt) { return cast(TokenCategory)(tt >> 16); } Though admittedly, this is a bit hackish. But if you're going for speed, this would be quite fast.If you read the docs/code you'll see that "*" is called "star" :-). The slice -> dotdot rename is pretty simple to do.4. When naming tokens like .. 'slice', it is giving it a syntactic/semantic name rather than a token name. This would be awkward if .. took on new meanings in D. Calling it 'dotdot' would be clearer. Ditto for the rest. For example that is done better, '*' is called 'star', rather than 'dereference'.I agree. Especially since '*' can also mean multiplication, depending on context. It would be weird (and unnecessarily obscure) for parser code to refer to 'dereference' when parsing expressions. :)byToken is a free function because its first parameter is the range to tokenize. This allows you to use UFCS on the range. (e.g. "sourcCode.byToken()" Putting it in a struct/class would break this.5. The LexerConfig initialization should be a constructor rather than a sequence of assignments. LexerConfig documentation is awfully thin. For example, 'tokenStyle' is explained as being 'Token style', whatever that is.I'm on the fence about this one. Setting up the configuration before starting the lexing process makes sense to me. Perhaps one way to improve this is to rename LexerConfig to DLexer, and make byToken a member function (or call it via UFCS): DLexer lex; lex.iterStyle = ...; // etc. foreach (token; lex.byToken()) { ... } This reads better: you're getting a list of tokens from the lexer 'lex', as opposed to getting something from byToken(config), which doesn't really *say* what it's doing: is it tokenizing the config object?For parsing, you'll likely want to use array() to grab all the tokens. But there are other uses such as syntax highlighting that only need one token at a time.6. No clue how lookahead works with this. Parsing D requires arbitrary lookahead.What has this got to do with lexing? The parser needs to keep track of its state independently of the lexer anyway, doesn't it? I'm not sure how DMD's parser works, but the way I usually write parsers is that their state encodes the series of tokens encountered so far, so they don't need the lexer to save any tokens for them. If they need to refer to tokens, they create partial AST trees on the parser stack that reference said tokens. I don't see why it should be the lexer's job to keep track of such things.I'll work on this later this evening.9. Need to insert intra-page navigation links, such as when 'byToken()' appears in the text, it should be link to where byToken is described.Agreed.[...]I'm not convinced (yet?).I believe the state of the documentation is a showstopper, and needs to be extensively fleshed out before it can be considered ready for voting.I think the API can be improved. The LexerConfig -> DLexer rename is an important readability issue IMO.Also, it's unclear what types of input is supported -- the code example only uses string input, but what about file input? Does it support byLine? Or does it need me to slurp the entire file contents into an in-memory buffer first?The input is a range of bytes, which the lexer assumes is UTF-8. The lexer works much faster on arrays than it does on arbitrary ranges though. Dmitry Olshansky implemented a circular buffer in this module that's used to cache the input range internally. This buffer is then used for the lexing process.Now, somebody pointed out that there's currently no way to tell it that the data you're feeding to it is a partial slice of the full source. I think this should be an easy fix: LexerConfig (or DLexer after the rename) can have a field for specifying the initial line number / column number, defaulting to (1, 1) but can be changed by the caller for parsing code fragments.That's simple enough to add.A minor concern I have about the current implementation (I didn't look at the code yet, but this is just based on the documentation), is that there's no way to choose what kind of data you want in the token stream. Right now, it appears that it always computes startIndex, line, and column. What if I don't care about this information, and only want, say, the token type and value (say I'm pretty-printing the source, so I don't care how the original was formatted)? Would it be possible to skip the additional computations required to populate these fields?It's possible, but I fear that it would make the code a mess.
Sep 11 2013
On Wed, Sep 11, 2013 at 10:37:34PM +0200, Brian Schott wrote:On Wednesday, 11 September 2013 at 19:56:45 UTC, H. S. Teoh wrote:Better hope it passes code review, though. :)There are already plenty of hackish things in that module, so this would fit right in.2. The example uses an if-then sequence of isBuiltType, isKeyword, etc. Should be an enum so a switch can be done for speed.I believe this is probably a result of having a separate enum value of each token, and at the same time needing to group them into categories for syntax highlighting purposes. I'd suggest a function for converting the TokenType enum value into a TokenCategory enum. Perhaps something like: enum TokenCategory { BuiltInType, Keyword, ... } // Convert TokenType into TokenCategory TokenCategory category(TokenType tt) { ... } Then in user code, you call category() on the token type, and switch over that. This maximizes performance. Implementation-wise, I'd suggest either a hash table on TokenType, or perhaps even encoding the category into the TokenType enum values, for example: enum TokenCategory { BuiltInType, Keyword, ... } enum TokenType { IntToken = (TokenCategory.BuiltInType << 16) | 0x0001, FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002, ... FuncToken = (TokenCategory.Keyword << 16) | 0x0001, } Then the category function can be simply: TokenCategory category(TokenType tt) { return cast(TokenCategory)(tt >> 16); } Though admittedly, this is a bit hackish. But if you're going for speed, this would be quite fast.I was just going by what Walter said, but OK, fair enough.If you read the docs/code you'll see that "*" is called "star" :-). The slice -> dotdot rename is pretty simple to do.4. When naming tokens like .. 'slice', it is giving it a syntactic/semantic name rather than a token name. This would be awkward if .. took on new meanings in D. Calling it 'dotdot' would be clearer. Ditto for the rest. For example that is done better, '*' is called 'star', rather than 'dereference'.I agree. Especially since '*' can also mean multiplication, depending on context. It would be weird (and unnecessarily obscure) for parser code to refer to 'dereference' when parsing expressions. :)Very good point. In that case, would it be possible to make byToken accept configuration parameters inline? That is: auto tokens = File(mysource) .byLine() // does this actually work?? .byToken(IterationStyle.everything, ...); This seems better suited for UFCS-style chains, as opposed to needing to create a separate config object outside the chain just so you can use it inside. And writing LexerConfig(...) inline seems a bit cumbersome: auto tokens = File(mysource) .byLine() // does this actually work?? .byToken(LexerConfig(IterationStyle.everything, ...)); though, in retrospect, perhaps not as bad as it initially sounds.byToken is a free function because its first parameter is the range to tokenize. This allows you to use UFCS on the range. (e.g. "sourcCode.byToken()" Putting it in a struct/class would break this.5. The LexerConfig initialization should be a constructor rather than a sequence of assignments. LexerConfig documentation is awfully thin. For example, 'tokenStyle' is explained as being 'Token style', whatever that is.I'm on the fence about this one. Setting up the configuration before starting the lexing process makes sense to me. Perhaps one way to improve this is to rename LexerConfig to DLexer, and make byToken a member function (or call it via UFCS): DLexer lex; lex.iterStyle = ...; // etc. foreach (token; lex.byToken()) { ... } This reads better: you're getting a list of tokens from the lexer 'lex', as opposed to getting something from byToken(config), which doesn't really *say* what it's doing: is it tokenizing the config object?I don't think that's necessary. An appropriately-constructed parser doesn't need arbitrary slicing of tokens; its internal state should encode what tokens have been seen thus far so that it never needs to backtrack. [...]For parsing, you'll likely want to use array() to grab all the tokens. But there are other uses such as syntax highlighting that only need one token at a time.6. No clue how lookahead works with this. Parsing D requires arbitrary lookahead.What has this got to do with lexing? The parser needs to keep track of its state independently of the lexer anyway, doesn't it? I'm not sure how DMD's parser works, but the way I usually write parsers is that their state encodes the series of tokens encountered so far, so they don't need the lexer to save any tokens for them. If they need to refer to tokens, they create partial AST trees on the parser stack that reference said tokens. I don't see why it should be the lexer's job to keep track of such things.OK, I take that back. UFCS chaining is an important emerging D paradigm that we should support, and the current API seems better suited for that purpose.I think the API can be improved. The LexerConfig -> DLexer rename is an important readability issue IMO.I'm not convinced (yet?).OK, if this can be indicated in the docs, that'd be great. :) (A signature constraint of the form `if (is(ElementType!R == char))` that should be good enough, I think, since DDOC in DMD git HEAD will include signature constraints in the generated docs, and it should be clear enough what is expected. [...]Also, it's unclear what types of input is supported -- the code example only uses string input, but what about file input? Does it support byLine? Or does it need me to slurp the entire file contents into an in-memory buffer first?The input is a range of bytes, which the lexer assumes is UTF-8. The lexer works much faster on arrays than it does on arbitrary ranges though. Dmitry Olshansky implemented a circular buffer in this module that's used to cache the input range internally. This buffer is then used for the lexing process.Fair enough. It's not a significant overhead anyway (I believe), so it shouldn't be a big issue. T -- Once bitten, twice cry...A minor concern I have about the current implementation (I didn't look at the code yet, but this is just based on the documentation), is that there's no way to choose what kind of data you want in the token stream. Right now, it appears that it always computes startIndex, line, and column. What if I don't care about this information, and only want, say, the token type and value (say I'm pretty-printing the source, so I don't care how the original was formatted)? Would it be possible to skip the additional computations required to populate these fields?It's possible, but I fear that it would make the code a mess.
Sep 11 2013
On 2013-09-11 22:37, Brian Schott wrote:It's possible, but I fear that it would make the code a mess.Really? It seems you only need to change two places in the code: https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L1724 And https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L400 Although I just did a quick search for "line". -- /Jacob Carlborg
Sep 12 2013
On 11.09.2013 20:49, Walter Bright wrote:On 9/11/2013 8:01 AM, Dicebot wrote:Delphi designers realized this problem years ago and they came up with a solution: http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers Basically, Delphi allows escaping reserved identifiers with a '&'. I wonder how D solves that problem when interfacing to COM classes if they have for example a function named "scope".std.d.lexer is standard module for lexing D code, written by Brian SchottThank you, Brian! This is important work. Not a thorough review, just some notes from reading the doc file: 1. I don't like the _ suffix for keywords. Just call it kwimport or something like that. 8. 'default_' Again with the awful practice of appending _.
Sep 11 2013
On Wednesday, 11 September 2013 at 19:57:28 UTC, Piotr Szturmaj wrote:On 11.09.2013 20:49, Walter Bright wrote:void MyVoid(string int) { ... } The name is actually int for purposes of reflection and the like, the just tells the compiler that it's not a keyword.On 9/11/2013 8:01 AM, Dicebot wrote:Delphi designers realized this problem years ago and they came up with a solution: http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers Basically, Delphi allows escaping reserved identifiers with a '&'. I wonder how D solves that problem when interfacing to COM classes if they have for example a function named "scope".std.d.lexer is standard module for lexing D code, written by Brian SchottThank you, Brian! This is important work. Not a thorough review, just some notes from reading the doc file: 1. I don't like the _ suffix for keywords. Just call it kwimport or something like that. 8. 'default_' Again with the awful practice of appending _.
Sep 11 2013
On 12.09.2013 01:22, Kapps wrote:On Wednesday, 11 September 2013 at 19:57:28 UTC, Piotr Szturmaj wrote:Yes, and this is important to actually have the real name without a '_' or any other prefix or suffix. Is there an enhancement proposal for this in the bugzilla?On 11.09.2013 20:49, Walter Bright wrote:void MyVoid(string int) { ... } The name is actually int for purposes of reflection and the like, the just tells the compiler that it's not a keyword.On 9/11/2013 8:01 AM, Dicebot wrote:Delphi designers realized this problem years ago and they came up with a solution: http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers Basically, Delphi allows escaping reserved identifiers with a '&'. I wonder how D solves that problem when interfacing to COM classes if they have for example a function named "scope".std.d.lexer is standard module for lexing D code, written by Brian SchottThank you, Brian! This is important work. Not a thorough review, just some notes from reading the doc file: 1. I don't like the _ suffix for keywords. Just call it kwimport or something like that. 8. 'default_' Again with the awful practice of appending _.
Sep 11 2013
On 2013-09-11 21:57, Piotr Szturmaj wrote:Delphi designers realized this problem years ago and they came up with a solution: http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers Basically, Delphi allows escaping reserved identifiers with a '&'. I wonder how D solves that problem when interfacing to COM classes if they have for example a function named "scope".Scala does it as well: `keyword` if I recall correctly. Seems like you can put basically anything between the backticks in Scala. -- /Jacob Carlborg
Sep 12 2013
On 2013-09-11 18:49:44 +0000, Walter Bright <newshound2 digitalmars.com> said:6. No clue how lookahead works with this. Parsing D requires arbitrary lookahead.Since this is a forward range, you can save() it at one location and continue parsing, and then restart parsing from the savepoint. https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2324 This also means that the underlying range used as input to the lexer must be a forward range that you can save() too. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca
Sep 11 2013
On 9/11/2013 3:09 PM, Michel Fortin wrote:On 2013-09-11 18:49:44 +0000, Walter Bright <newshound2 digitalmars.com> said:That's correct, but that implies re-lexing the tokens, which has negative performance implications.6. No clue how lookahead works with this. Parsing D requires arbitrary lookahead.Since this is a forward range, you can save() it at one location and continue parsing, and then restart parsing from the savepoint. https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2324 This also means that the underlying range used as input to the lexer must be a forward range that you can save() too.
Sep 11 2013
On Wednesday, 11 September 2013 at 23:44:55 UTC, Walter Bright wrote:On 9/11/2013 3:09 PM, Michel Fortin wrote:Indeed. What solution do you have in mind ?On 2013-09-11 18:49:44 +0000, Walter Bright <newshound2 digitalmars.com> said:That's correct, but that implies re-lexing the tokens, which has negative performance implications.6. No clue how lookahead works with this. Parsing D requires arbitrary lookahead.Since this is a forward range, you can save() it at one location and continue parsing, and then restart parsing from the savepoint. https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2324 This also means that the underlying range used as input to the lexer must be a forward range that you can save() too.
Sep 11 2013
On 9/11/2013 6:30 PM, deadalnix wrote:Indeed. What solution do you have in mind ?The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.
Sep 11 2013
On Thursday, 12 September 2013 at 01:39:52 UTC, Walter Bright wrote:On 9/11/2013 6:30 PM, deadalnix wrote:But then, you have an extra step when looking up every tokens + memory management overhead. How big is the performance improvement ?Indeed. What solution do you have in mind ?The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.
Sep 11 2013
On 9/11/2013 6:54 PM, deadalnix wrote:On Thursday, 12 September 2013 at 01:39:52 UTC, Walter Bright wrote:Not really - I use a dirty trick in that a static instance is always the start of the list. But even so, an extra indirection is better than re-lexing. Lexing is a clear bottleneck in the profiling. I've even been thinking of special casing the code that scans comments to use SIMD instructions.On 9/11/2013 6:30 PM, deadalnix wrote:But then, you have an extra step when looking up every tokens + memory management overhead. How big is the performance improvement ?Indeed. What solution do you have in mind ?The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.
Sep 11 2013
On Thursday, 12 September 2013 at 04:57:50 UTC, Walter Bright wrote:On 9/11/2013 6:54 PM, deadalnix wrote:If I understand you correctly, that mean that lookahead of one token do not trigger any allocation.On Thursday, 12 September 2013 at 01:39:52 UTC, Walter Bright wrote:Not really - I use a dirty trick in that a static instance is always the start of the list.On 9/11/2013 6:30 PM, deadalnix wrote:But then, you have an extra step when looking up every tokens + memory management overhead. How big is the performance improvement ?Indeed. What solution do you have in mind ?The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.But even so, an extra indirection is better than re-lexing. Lexing is a clear bottleneck in the profiling. I've even been thinking of special casing the code that scans comments to use SIMD instructions.See my comment, it is possible, with increased parser complexity, to handle many cases where you don't know what you are parsing yet. Doing so, lookahead is only required to find matching closing token. I suspect that a fast path in the lexer for that precise use case may be faster than buffering tokens, as it allow to save one branch per token.
Sep 11 2013
On 9/11/2013 10:10 PM, deadalnix wrote:See my comment, it is possible, with increased parser complexity, to handle many cases where you don't know what you are parsing yet. Doing so, lookahead is only required to find matching closing token. I suspect that a fast path in the lexer for that precise use case may be faster than buffering tokens, as it allow to save one branch per token.I don't believe that, because you can see about anything for tokens in lookahead and so have to duplicate nearly the whole lexer anyway for the 'fast path', but you're free to try it out and prove me wrong.
Sep 11 2013
On Thursday, 12 September 2013 at 06:17:04 UTC, Walter Bright wrote:On 9/11/2013 10:10 PM, deadalnix wrote:I plan to, but you know what it is, the best optimization is the one that go from non working to working state.See my comment, it is possible, with increased parser complexity, to handle many cases where you don't know what you are parsing yet. Doing so, lookahead is only required to find matching closing token. I suspect that a fast path in the lexer for that precise use case may be faster than buffering tokens, as it allow to save one branch per token.I don't believe that, because you can see about anything for tokens in lookahead and so have to duplicate nearly the whole lexer anyway for the 'fast path', but you're free to try it out and prove me wrong.
Sep 12 2013
On 09/12/2013 08:17 AM, Walter Bright wrote:I don't believe that, because you can see about anything for tokens in lookahead and so have to duplicate nearly the whole lexer anyway for the 'fast path', but you're free to try it out and prove me wrong.Me neither and if you have to duplicate the code for this it's a double loss. I think it would be more fruitful to batch lexing, i.e. fill a hole memory page (4Kb) with tokens the go back to parsing until you need more.
Sep 12 2013
On 09/12/2013 03:39 AM, Walter Bright wrote:On 9/11/2013 6:30 PM, deadalnix wrote:Linked list sounds bad. Do you have a rough idea how often lookahead is needed, i.e. is it performance relevant? If so it might be worth tuning.Indeed. What solution do you have in mind ?The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.
Sep 11 2013
On Thu, Sep 12, 2013 at 04:21:36AM +0200, Martin Nowak wrote:On 09/12/2013 03:39 AM, Walter Bright wrote:I still don't understand why backtracking is necessary in the first place. I would've thought a modern parser should be well able to encode seen tokens in its state so that backtracking is never necessary. Or does D grammar have tricky bits that cannot be handled this way, that I'm unaware of? T -- The peace of mind---from knowing that viruses which exploit Microsoft system vulnerabilities cannot touch Linux---is priceless. -- Frustrated system administrator.On 9/11/2013 6:30 PM, deadalnix wrote:Linked list sounds bad. Do you have a rough idea how often lookahead is needed, i.e. is it performance relevant? If so it might be worth tuning.Indeed. What solution do you have in mind ?The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.
Sep 11 2013
On Thursday, 12 September 2013 at 03:37:55 UTC, H. S. Teoh wrote:I still don't understand why backtracking is necessary in the first place. I would've thought a modern parser should be well able to encode seen tokens in its state so that backtracking is never necessary. Or does D grammar have tricky bits that cannot be handled this way, that I'm unaware of?The problem is that it can cause a exponential (and I literally mean exponential here) amount of complexity. In some cases, the complexity is manageable, but in other that don't make any sense (it has to be noted that even full lexing don't make any sens here). For instance : int foo()() {} ^ When you are at the caret position, you don't know if you face a function declaration or a template declaration. You could go for some ambiguous parsing, but each template argument can itself be a type, an expression or a symbol. In such situation, it is much simpler to skip input to the closing parentheses, check what's coming after and act accordingly. The alternative is to go for some ambiguous function/template parameters parsing and resolve at the end, but as template argument are themselves ambiguous type/expression/symbols, the amount of complexity in the parser is doomed to explode. Note that the example is not self contained, for instance, both template parameters and argument can imply template instantiation, which means ambiguous argument parsing. SDC does a good deal of ambiguous parsing without backtracking (more than DMD does), but you got to draw the line somewhere. What I'd like to see is a go to the closing token feature, that can eventually take a fast path to do so, or an efficient token buffering system (I'm not sure which feature would be the fastest, but I'm inclined to think this is the first one, however, that won't be suited for a dmd style parser).
Sep 11 2013
On Thu, Sep 12, 2013 at 07:00:09AM +0200, deadalnix wrote:On Thursday, 12 September 2013 at 03:37:55 UTC, H. S. Teoh wrote:[...] This can be handled by using an intermediate grammar rule. Reduce all (...) into an intermediate type, say ArgList, so the reduction happens something like this: int foo () () {} Type Ident ArgList ArgList ^ Then have the rule: CompileTimeArgs ::= ArgList RuntimeArgs ::= ArgList TemplateDecl ::= Type Ident CompileTimeArgs RuntimeArgs '{' ... FuncDecl ::= Type Ident RuntimeArgs '{' ... So first, all (...) gets parsed to ArgList, but it's not yet fixed whether they are compile-time arguments or runtime arguments. It's only after you see the next '(' or '{' that you decide whether ArgList should reduce to CompileTimeArgs or RuntimeArgs. ArgList itself, of course, will accept all possible parameters (both runtime and compile-time): types, expressions, symbols. Then when you reduce it to RuntimeArgs, you reject stuff that can't be interpreted as parameter declarations. T -- There are two ways to write error-free programs; only the third one works.I still don't understand why backtracking is necessary in the first place. I would've thought a modern parser should be well able to encode seen tokens in its state so that backtracking is never necessary. Or does D grammar have tricky bits that cannot be handled this way, that I'm unaware of?The problem is that it can cause a exponential (and I literally mean exponential here) amount of complexity. In some cases, the complexity is manageable, but in other that don't make any sense (it has to be noted that even full lexing don't make any sens here). For instance : int foo()() {} ^ When you are at the caret position, you don't know if you face a function declaration or a template declaration. You could go for some ambiguous parsing, but each template argument can itself be a type, an expression or a symbol.
Sep 12 2013
On Thursday, 12 September 2013 at 14:09:43 UTC, H. S. Teoh wrote:This can be handled by using an intermediate grammar rule. Reduce all (...) into an intermediate type, say ArgList, so the reduction happens something like this: int foo () () {} Type Ident ArgList ArgList ^ Then have the rule: CompileTimeArgs ::= ArgList RuntimeArgs ::= ArgList TemplateDecl ::= Type Ident CompileTimeArgs RuntimeArgs '{' ... FuncDecl ::= Type Ident RuntimeArgs '{' ... So first, all (...) gets parsed to ArgList, but it's not yet fixed whether they are compile-time arguments or runtime arguments. It's only after you see the next '(' or '{' that you decide whether ArgList should reduce to CompileTimeArgs or RuntimeArgs. ArgList itself, of course, will accept all possible parameters (both runtime and compile-time): types, expressions, symbols. Then when you reduce it to RuntimeArgs, you reject stuff that can't be interpreted as parameter declarations.And then you got to backtrack the parsing instead of the lexing. You just moved the problem around. You'll have to create some temporary ast nodes that then will fix into what they really are.
Sep 12 2013
On Thu, Sep 12, 2013 at 06:09:41PM +0200, deadalnix wrote:On Thursday, 12 September 2013 at 14:09:43 UTC, H. S. Teoh wrote:No. You can just use ArgListItem for both runtime args and compile-time args. Once you decided which one it is, wrong arguments are rejected at semantic time (which you have to do anyway). Let's take a concrete example. Say we're parsing this invalid code: int foo(alias A)(alias B) {} You'd go through these steps: 1) Parse initial prefix of declaration: int foo(alias A)(alias B) {} ^ AST: FuncDecl |--RetType: int |--Ident: foo \--[ being built ] 2) Parse first (...): int foo(alias A)(alias B) {} ^ AST: FuncDecl |--RetType: int |--Ident: foo |--ArgList | \-- AliasArg | \-- ident: A \--[ being built ] I'm skipping the intermediate steps here, it's obvious how to construct AliasArg from the usual parsing process. 3) Parse second (...): int foo(alias A)(alias B) {} ^ AST: FuncDecl |--RetType: int |--Ident: foo |--ArgList | \-- AliasArg | \-- ident: A |--ArgList | \-- AliasArg | \-- ident: B \--[ being built ] 4) At this point, you now know the first ArgList is CompileTimeArgList, and the second is RuntimeArgList, so you can just change the type fields (along with narrowing FuncDecl to TemplateFuncDecl): AST: TemplateFuncDecl (was: FuncDecl) |--RetType: int |--Ident: foo |--CompileTimeArgList (was: ArgList) | \-- AliasArg | \-- ident: A |--RuntimeArgList (was: ArgList) | \-- AliasArg | \-- ident: B \--[ being built ] Since you're still constructing FuncDecl, your current parsing context should still have a direct reference to the partially-constructed FuncDecl node, which in turn has a direct reference to both ArgList child nodes. So this is just dereferencing a couple of pointers. No backtracking. 5) Finish parsing the declaration: int foo(alias A)(alias B) {} ^ AST: TemplateFuncDecl |--RetType: int |--Ident: foo |--CompileTimeArgList (was: ArgList) | \-- AliasArg | \-- ident: A |--RuntimeArgList (was: ArgList) | \-- AliasArg | \-- ident: B \--FuncBody \-- CompoundStatement \-- [empty body] 6) Run semantic: - Create local symbol table for foo, etc.. - Run semantic on CompileTimeArgList: - Check AliasArg for validity - Run semantic on AliasArg: add A to function's local symbol table, etc. - Run semantic on RuntimeArgList: - Check AliasArg for validity: ERROR: cannot have alias parameter at runtime. - (Add B to local symbol table)(skipped due to previous error) - (Run semantic on FuncBody)(skipped due to previous error) - (Run semantic on RetType (verify return type match, etc.))(skipped due to previous error) - (Add function to parent scope symbol table)(skipped due to previous error) So, no backtracking is necessary. Of course, it sounds like DMD's parser doesn't work this way, but that's a limitation of DMD's parser, not an *inherent* need for backtracking. T -- I see that you JS got Bach.This can be handled by using an intermediate grammar rule. Reduce all (...) into an intermediate type, say ArgList, so the reduction happens something like this: int foo () () {} Type Ident ArgList ArgList ^ Then have the rule: CompileTimeArgs ::= ArgList RuntimeArgs ::= ArgList TemplateDecl ::= Type Ident CompileTimeArgs RuntimeArgs '{' ... FuncDecl ::= Type Ident RuntimeArgs '{' ... So first, all (...) gets parsed to ArgList, but it's not yet fixed whether they are compile-time arguments or runtime arguments. It's only after you see the next '(' or '{' that you decide whether ArgList should reduce to CompileTimeArgs or RuntimeArgs. ArgList itself, of course, will accept all possible parameters (both runtime and compile-time): types, expressions, symbols. Then when you reduce it to RuntimeArgs, you reject stuff that can't be interpreted as parameter declarations.And then you got to backtrack the parsing instead of the lexing. You just moved the problem around. You'll have to create some temporary ast nodes that then will fix into what they really are.
Sep 12 2013
We are talking about parameters here, not arguments. But overall, that isn't the point. If I have int a = 3 as an argument in the first set of (), I still have no clue if it is a runtime or compile time parameter. But ref int a = 3 ? Doing so, you ends up with a lot of ambiguous node (like FunctionArgumentOrValueTemplateParameter) and have to manage that complexity, or have to backtrack on the parsing and fix ambiguous node when you have enough information.
Sep 12 2013
deadalnix wrote:When you are at the caret position, you don't knowIf one ever reaches that position one uses petty lexer/grammar definitions. -manfred
Sep 12 2013
On Thursday, 12 September 2013 at 05:00:11 UTC, deadalnix wrote:The problem is that it can cause a exponential (and I literally mean exponential here) amount of complexity. The alternative is to go for some ambiguous function/template parameters parsing and resolve at the end, but as template argument are themselves ambiguous type/expression/symbols, the amount of complexity in the parser is doomed to explode.Pretty sure a GLR parser handles that well within O(n^2) space. Nothing exponential necessary...
Sep 27 2013
On Saturday, 28 September 2013 at 01:23:48 UTC, Mehrdad wrote:On Thursday, 12 September 2013 at 05:00:11 UTC, deadalnix wrote:I meant time... Though it's true for space too.The problem is that it can cause a exponential (and I literally mean exponential here) amount of complexity. The alternative is to go for some ambiguous function/template parameters parsing and resolve at the end, but as template argument are themselves ambiguous type/expression/symbols, the amount of complexity in the parser is doomed to explode.Pretty sure a GLR parser handles that well within O(n^2) space. Nothing exponential necessary...
Sep 27 2013
On 9/11/2013 7:21 PM, Martin Nowak wrote:On 09/12/2013 03:39 AM, Walter Bright wrote:No, I don't have a rough idea, but it doesn't show up in the profile.On 9/11/2013 6:30 PM, deadalnix wrote:Linked list sounds bad. Do you have a rough idea how often lookahead is needed, i.e. is it performance relevant? If so it might be worth tuning.Indeed. What solution do you have in mind ?The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.
Sep 11 2013
On 09/12/2013 04:21 AM, Martin Nowak wrote:On 09/12/2013 03:39 AM, Walter Bright wrote:Maybe some fixed size stack vector with 64 elements or so and some linked list for the unusual case would help ..On 9/11/2013 6:30 PM, deadalnix wrote:Linked list sounds bad. Do you have a rough idea how often lookahead is needed, i.e. is it performance relevant? If so it might be worth tuning.Indeed. What solution do you have in mind ?The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.
Sep 12 2013
13-Sep-2013 00:46, Robert Schadek пишет:On 09/12/2013 04:21 AM, Martin Nowak wrote:On 09/12/2013 03:39 AM, Walter Bright wrote:On 9/11/2013 6:30 PM, deadalnix wrote:Linked list sounds bad. Do you have a rough idea how often lookahead is needed, i.e. is it performance relevant? If so it might be worth tuning.Indeed. What solution do you have in mind ?The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.Maybe some fixed size stack vector with 64 elements or so and some linked list for the unusual case would help ..And an extra branch to detect which one is currently the case? No, thanks ;) -- Dmitry Olshansky
Sep 12 2013
On 09/12/2013 11:03 PM, Dmitry Olshansky wrote:I would think that branching and using the vector is faster than using the constructing and linking nodes in the LLMaybe some fixed size stack vector with 64 elements or so and some linked list for the unusual case would help ..And an extra branch to detect which one is currently the case? No, thanks ;)
Sep 12 2013
On Thursday, September 12, 2013 23:55:23 Robert Schadek wrote:On 09/12/2013 11:03 PM, Dmitry Olshansky wrote:Well, it sounds like there are several ideas to try out in this thread. I guess that only benchmarking and profiling will really tell us which (if any) are better though. - Jonathan M DavisI would think that branching and using the vector is faster than using the constructing and linking nodes in the LLMaybe some fixed size stack vector with 64 elements or so and some linked list for the unusual case would help ..And an extra branch to detect which one is currently the case? No, thanks ;)
Sep 12 2013
On Thursday, 12 September 2013 at 01:39:52 UTC, Walter Bright wrote:On 9/11/2013 6:30 PM, deadalnix wrote:I think this is the right approach. It can probably be another function, we can put into std.range and reuse for other lexer/parsers. The lexer or the parser should not be made more complex for this.Indeed. What solution do you have in mind ?The solution dmd uses is to put in an intermediary layer that saves the lookahead tokens in a linked list.
Sep 12 2013
On 09/12/2013 03:30 AM, deadalnix wrote:Buffering the tokens would work. There are some ways promote input ranges to forward ranges. But there are also some pitfalls like the implicit save on copy. I have two prototypes for a generic input range buffer. https://gist.github.com/dawgfoto/2187220 - uses growing ring buffer https://gist.github.com/dawgfoto/1257196 - uses ref counted lookahead buffers in a singly linked list The lexer itself has a ringbuffer for input ranges. https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2278That's correct, but that implies re-lexing the tokens, which has negative performance implications.Indeed. What solution do you have in mind ?
Sep 11 2013
On 09/11/2013 08:49 PM, Walter Bright wrote:3. I assumed [an instance of] TokenType is a type.(is(TokenType) is true).But it's not, it's an enum. Even the document says it's a 'type', but it's not a type.It's a type tag. The tag uniquely determines the type. (As in 'the type of a token', as opposed to 'the type of an expression'.)4. When naming tokens like .. 'slice', it is giving it a syntactic/semantic name rather than a token name. This would be awkward if .. took on new meanings in D. Calling it 'dotdot' would be clearer. Ditto for the rest. For example that is done better, '*' is called 'star', rather than 'dereference'.FWIW, I use Tok!"..". I.e. a "UDL" for specifying kinds of tokens when interfacing with the parser. Some other kinds of tokens get a canonical representation. Eg. Tok!"i" is the kind of identifier tokens, Tok!"0" is the kind of signed integer literal tokens etc.5. The LexerConfig initialization should be a constructor rather than a sequence of assignments.Using the { a:2, b:3 }-style initialization syntax?6. No clue how lookahead works with this.Eg. use a CircularBuffer adapter range. I have an implementation currently coupled with my own lexer implementation. If there is interest, I could factor it out. Lookahead is realized as follows in the parser: (assume 'code' is the circular buffer range.) auto saveState(){muteerr++; return code.pushAnchor();} // saves the state and mutes all error messages until the state is restored void restoreState(Anchor state){ muteerr--; code.popAnchor(state); } The 'Anchor' is a trivial wrapper around a size_t. The circular buffer grows automatically to keep around tokens still reachable by an anchor. (The range only needs small constant space besides the buffer to support this functionality, though it is unable to detect usage errors.) This approach is typically more efficient than using a free list on contemporary architectures.
Sep 12 2013
12-Sep-2013 19:39, Timon Gehr пишет:On 09/11/2013 08:49 PM, Walter Bright wrote:I like this. Not only this has the benefit of not colliding with keywords. I also imagine that it could be incredibly convenient to get back the symbolic representation of a token (when token used as parameter to AST-node say BinaryExpr!(Tok!"+")). And truth be told we all know how tokens look in symbolic form so learning a pack of names for them feels pointless.4. When naming tokens like .. 'slice', it is giving it a syntactic/semantic name rather than a token name. This would be awkward if .. took on new meanings in D. Calling it 'dotdot' would be clearer. Ditto for the rest. For example that is done better, '*' is called 'star', rather than 'dereference'.FWIW, I use Tok!"..". I.e. a "UDL" for specifying kinds of tokens when interfacing with the parser. Some other kinds of tokens get a canonical representation. Eg. Tok!"i" is the kind of identifier tokens, Tok!"0" is the kind of signed integer literal tokens etc.This ^^ is how. In fact std.d.lexer internally does similar thing with non-RA ranges of bytes. -- Dmitry Olshansky6. No clue how lookahead works with this.Eg. use a CircularBuffer adapter range. I have an implementation currently coupled with my own lexer implementation. If there is interest, I could factor it out. Lookahead is realized as follows in the parser: (assume 'code' is the circular buffer range.) auto saveState(){muteerr++; return code.pushAnchor();} // saves the state and mutes all error messages until the state is restored void restoreState(Anchor state){ muteerr--; code.popAnchor(state); } The 'Anchor' is a trivial wrapper around a size_t. The circular buffer grows automatically to keep around tokens still reachable by an anchor. (The range only needs small constant space besides the buffer to support this functionality, though it is unable to detect usage errors.) This approach is typically more efficient than using a free list on contemporary architectures.
Sep 12 2013
On Thu, Sep 12, 2013 at 08:10:18PM +0400, Dmitry Olshansky wrote:12-Sep-2013 19:39, Timon Gehr пишет:+1. This is superior to both the ad hoc _ suffix and my ad hoc prefixing approach. Tok!"default" is maximally readable, and requires no silly convolutions like _ or 'kw' / 'tokenType' prefixes. I vote for Tok!"..." to denote token types. Question: what's the implementation of Tok? Does it fit into an enum? What's the underlying representation? I imagine some kind of canonical mapping into an integral type would be desired, to maximize runtime performance. T -- There are three kinds of people in the world: those who can count, and those who can't.On 09/11/2013 08:49 PM, Walter Bright wrote:I like this. Not only this has the benefit of not colliding with keywords. I also imagine that it could be incredibly convenient to get back the symbolic representation of a token (when token used as parameter to AST-node say BinaryExpr!(Tok!"+")). And truth be told we all know how tokens look in symbolic form so learning a pack of names for them feels pointless.4. When naming tokens like .. 'slice', it is giving it a syntactic/semantic name rather than a token name. This would be awkward if .. took on new meanings in D. Calling it 'dotdot' would be clearer. Ditto for the rest. For example that is done better, '*' is called 'star', rather than 'dereference'.FWIW, I use Tok!"..". I.e. a "UDL" for specifying kinds of tokens when interfacing with the parser. Some other kinds of tokens get a canonical representation. Eg. Tok!"i" is the kind of identifier tokens, Tok!"0" is the kind of signed integer literal tokens etc.
Sep 12 2013
On 09/12/2013 06:40 PM, H. S. Teoh wrote:I vote for Tok!"..." to denote token types. Question: what's the implementation of Tok? Does it fit into an enum? What's the underlying representation? I imagine some kind of canonical mapping into an integral type would be desired, to maximize runtime performance.This is just a quick hack. One would maybe want to avoid the unwarranted copies and linear searches at compile time: import std.algorithm; private enum symbols = ["i",".","..",",","0",/+...+/]; struct TokenType{ private uint _tag; // or smaller type private this(int tag){_tag=tag;} string toString(){ // (optional) static array(R)(R s){ string[] r; foreach(x;s) r~=x; return r; } static immutable strs=array(symbols.map!((a)=>`Tok!"`~a~`"`)); return strs[_tag]; } } template Tok(string name)if(symbols.countUntil(name)!=-1){ enum Tok=TokenType(cast(uint)symbols.countUntil(name)); } import std.stdio; void main(){ enum x = Tok!"i"; writeln(x); writeln(Tok!"0"); }
Sep 12 2013
On 2013-09-12 17:39, Timon Gehr wrote:Using the { a:2, b:3 }-style initialization syntax?Unfortunately that's currently not possible to pass to functions, see my other post: http://forum.dlang.org/thread/jsnhlcbulwyjuqcqoepe forum.dlang.org?page=6#post-l0ro6h:249mk:241:40digitalmars.com -- /Jacob Carlborg
Sep 12 2013
On 09/12/2013 05:39 PM, Timon Gehr wrote:Lookahead is realized as follows in the parser: (assume 'code' is the circular buffer range.) auto saveState(){muteerr++; return code.pushAnchor();} // saves the state and mutes all error messages until the state is restored void restoreState(Anchor state){ muteerr--; code.popAnchor(state); } The 'Anchor' is a trivial wrapper around a size_t. The circular buffer grows automatically to keep around tokens still reachable by an anchor. (The range only needs small constant space besides the buffer to support this functionality, though it is unable to detect usage errors.)Do you think it's possible to use CircularBufferRange itself as anchor. Then calling save() would return a CircularBufferRange and you canould scratch the two functions above. I had some promising experiments in that direction, but the implicit save on copy is annoying because you end up with anchors from temporary copies.
Sep 12 2013
On 09/12/2013 10:03 PM, Martin Nowak wrote:On 09/12/2013 05:39 PM, Timon Gehr wrote:Well, this implementation is geared towards the usage pattern found in the top-down parser. If the first anchor is not restored last, it will not generally work. This does not conform to the range interface. Implementing .save requires a more advanced and less efficient implementation.Lookahead is realized as follows in the parser: (assume 'code' is the circular buffer range.) auto saveState(){muteerr++; return code.pushAnchor();} // saves the state and mutes all error messages until the state is restored void restoreState(Anchor state){ muteerr--; code.popAnchor(state); } The 'Anchor' is a trivial wrapper around a size_t. The circular buffer grows automatically to keep around tokens still reachable by an anchor. (The range only needs small constant space besides the buffer to support this functionality, though it is unable to detect usage errors.)Do you think it's possible to use CircularBufferRange itself as anchor.Then calling save() would return a CircularBufferRange and you could scratch the two functions above. I had some promising experiments in that direction,What did you do? The way I think I'd approach it would be to maintain a binary min-heap using a few pointers in each range instance.but the implicit save on copy is annoying because you end up with anchors from temporary copies.Is this referring to suboptimal efficiency? I guess that is rather hard to address in a safe way. Basically, you'd want to forget about "anchor management" in case it can be proven that there is another range on the same buffer that stays at most as progressed during the whole lifetime of the range under consideration. Of course, it is always possible to do it in unsafely.
Sep 12 2013
On 9/11/2013 8:01 AM, Dicebot wrote:std.d.lexer is standard module for lexing D code, written by Brian SchottOne litmus test of the validity of the design is to see if one can build a D parser out of it. Otherwise we'll likely face the unpleasant task of needing to deprecate std.d.lexer.
Sep 11 2013
On Wednesday, 11 September 2013 at 20:37:11 UTC, Walter Bright wrote:On 9/11/2013 8:01 AM, Dicebot wrote:I'm not sure if you've followed the announce mailing list recently, but this lexer is already used in both the DScanner and DCD projects. There is a parser module that goes along with this lexer included in DScanner (and used by DCD as well). It's not ready for review though. I'm still working on improving its error recovery. DScanner is a tool for D source that's able to generate syntax-highlighted HTML, perform syntax validation, generate the AST in XML format, generate CTAGS, list imports, and count lines of code. DCD is an autocompletion client/server program that has integrations for several text editors.std.d.lexer is standard module for lexing D code, written by Brian SchottOne litmus test of the validity of the design is to see if one can build a D parser out of it. Otherwise we'll likely face the unpleasant task of needing to deprecate std.d.lexer.
Sep 11 2013
On 9/11/2013 1:44 PM, Brian Schott wrote:I'm not sure if you've followed the announce mailing list recently, but this lexer is already used in both the DScanner and DCD projects.Awesome! This is just what I asked for.
Sep 11 2013
On 09/11/2013 05:01 PM, Dicebot wrote:std.d.lexer is standard module for lexing D code, written by Brian SchottNice!Please take this part seriously: "If you identify problems along the way, please note if they are minor, serious, or showstoppers." (http://wiki.dlang.org/Review/Process). This information later will be used to determine if library is ready for voting.I only had a short look at the source code and was left with a good impression, i.e. it looks like a lexer written in D. I do like how you made the lexer configurable to serve the different needs. Also it was very easy to setup and use. The appended underscore for keywords is much nicer than messing with uppercase/lowercase which was what I did. One thing that bothered me was the usage of const(ubyte)[] as input. If you operate on UTF-8 it should take const(char)[] you can cast it to const(ubyte)[] internally. Also a convenience function that reads a file and processes UTF BOM marks would be nice (see toUtf8 https://github.com/dawgfoto/lexer/blob/master/dlexer/dlexer.d#L1429), but that could as well fit somewhere else into phobos.---- Information request from module author ---- Performance was a major discussed topic in previous thread. Could you please provide benchmarking data for version currently ongoing the review?Nice too, it works as fast as my D lexer (https://github.com/dawgfoto/lexer) which is about as fast as DMD's lexer (when compiled with LDC).
Sep 11 2013
On 2013-09-12 00:36, Martin Nowak wrote:Also a convenience function that reads a file and processes UTF BOM marks would be nice (see toUtf8 https://github.com/dawgfoto/lexer/blob/master/dlexer/dlexer.d#L1429), but that could as well fit somewhere else into phobos.Sounds like that would fit in std.file or similar. -- /Jacob Carlborg
Sep 12 2013
Am 11.09.2013 17:01, schrieb Dicebot:std.d.lexer is standard module for lexing D code, written by Brian Schotta standard benchmark scenario is needed - to compare dmd and std.d.lexer with dmc_dmd, ldc, gdc and msc_dmd just to be clear how near it reaches current dmd lexing speed
Sep 11 2013
On 2013-09-11 17:01, Dicebot wrote:std.d.lexer is standard module for lexing D code, written by Brian SchottFinally :) * How does it handler errors, just returns TokenType.invalid? * Personally I think the module is too big. I would go with: - std.d.lexer.token - std.d.lexer.tokentype - std.d.lexer.lexer - contains the rest - std.d.lexer.config - IterationStyle, TokenStyle, LexerConfig - CircularRange, StringCache, possibly put somewhere else. I assume this can be used for other things than lexing? - Trie related code, same as above * I see that errorMessage throws an exception. Do we really want that? I would except it just returns an invalid token. If we do decide it should throw, it should absolutely _not_ throw a plain Exception. Create a new type, LexException or similar. I hate when code throws plain Exceptions, it makes it useless to catch them. I would also expect this LexException to contain a Token. It shouldn't be needed to parse the exception message to get line and column information. * I like that you overall use clear and descriptive variable and function names. Except "sbox": https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L3265 * Could we get some unit tests for string literals, comments and identifies out side of the ASCII table * I would like to see a short description for each unit test, what it's testing. Personally I have started with this style: describe("byToken") { context("valid string literal") { it("should return a token with the type TokenType.stringLiteral") unittest { // test } it("should return a token with the correct lexeme") unittest { // test } } } Better formatted: http://pastebin.com/Dx78Vw6r People here might think that would be a bit too verbose. The following would be ok as well: describe("short description of the unit test") unittest { } * Could you remove debug code and other code that is commented out: - 344 - 1172 - 1226, is that needed? - 3165-3166 - 3197-3198 - 3392 - 3410 - 3434 Spell errors: * "forwarad" - 292 * "commemnt" - 2031 * "sentenels" - 299 * "messsage" - 301 * "underliying" - 2454 * "alloctors" - 3230 * "strightforward" - 2276 -- /Jacob Carlborg
Sep 12 2013
12-Sep-2013 12:05, Jacob Carlborg пишет:On 2013-09-11 17:01, Dicebot wrote:These could be one module. There is really no meaningful way to use token type separately from token.std.d.lexer is standard module for lexing D code, written by Brian SchottFinally :) * How does it handler errors, just returns TokenType.invalid? * Personally I think the module is too big. I would go with: - std.d.lexer.token - std.d.lexer.tokentype- std.d.lexer.lexer - contains the rest - std.d.lexer.config - IterationStyle, TokenStyle, LexerConfigContrary I see this break down pointless - do you really want to use config without the lexer?- CircularRange, StringCache, possibly put somewhere else. I assume this can be used for other things than lexing? - Trie related code, same as aboveNo good public interface defined is the reason. Basically the same as with Trie in the new std.uni module - needs its own review.* I see that errorMessage throws an exception. Do we really want that? I would except it just returns an invalid token. If we do decide it should throw, it should absolutely _not_ throw a plain Exception. Create a new type, LexException or similar. I hate when code throws plain Exceptions, it makes it useless to catch them. I would also expect this LexException to contain a Token. It shouldn't be needed to parse the exception message to get line and column information.Better yet to have a std exception hierarchy... so that all parsing modules can be tied to ParseException. So this needs to be resolved in a forward-compatible way. -- Dmitry Olshansky
Sep 12 2013
On 2013-09-12 18:21, Dmitry Olshansky wrote:Contrary I see this break down pointless - do you really want to use config without the lexer?std.d.lexer.config might be a bit unnecessary. But yes, in general I would like to see smaller modules. -- /Jacob Carlborg
Sep 12 2013
I got some time to work on the lexer this evening. Changeset here: https://github.com/Hackerpilot/phobos/commit/9bdb7f97bb8021f3b0d0291896b8fe21a6fead23#std/d/lexer.d The DDoc page has moved here: http://hackerpilot.github.io/experimental/std_lexer/phobos/std_d_lexer.html * There are a few more unit tests now * bitAnd renamed to amp * slice rename to dotdot * Much more cross-referencing in the doc comments * Start line and column can be specified in the lexer config
Sep 12 2013
Am 12.09.2013 10:15, schrieb Brian Schott:I got some time to work on the lexer this evening. Changeset here: https://github.com/Hackerpilot/phobos/commit/9bdb7f97bb8021f3b0d0291896b8fe21a6fead23#std/d/lexer.d The DDoc page has moved here: http://hackerpilot.github.io/experimental/std_lexer/phobos/std_d_lexer.html * There are a few more unit tests now * bitAnd renamed to amp * slice rename to dotdot * Much more cross-referencing in the doc comments * Start line and column can be specified in the lexer configproblem: many occurences of the same string you should use constants for the tokens (and others) string asm_token = "asm"; ... immutable(string[TokenType.max + 1]) tokenValues = [ ... asm_token ... ] and reuse these constants in your "optimization" maybe you can replace these lines with something getting feed with asm_token and give the same result but without 'a' and "sm" as splitted and different magic values - maybe a nice template or subrange... case 'a': if (input[1..$].equal("sm")) return TokenType.asm_; else ... break; and in your unit tests for example on auto expected =
Sep 12 2013
just an idea regarding your string[TokenType.max + 1] in immutable(string[TokenType.max + 1]) tokenValues = [..] it seems that you try to reduce the memory usage wouldn't it be a nice idea to generate a combined imutable string at compiletime like this one "...pragmaexportpackageprivate..." and generated string slice accesses? imuteable string big_one = generated_from(toke_list); imutable string export_token = big_one[10..6];
Sep 12 2013
On 9/12/2013 1:15 AM, Brian Schott wrote:I got some time to work on the lexer this evening. Changeset here: https://github.com/Hackerpilot/phobos/commit/9bdb7f97bb8021f3b0d0291896b8fe21a6fead23#std/d/lexer.d The DDoc page has moved here: http://hackerpilot.github.io/experimental/std_lexer/phobos/std_d_lexer.htmlGreat!* There are a few more unit tests nowI strongly recommend running the unit tests with -cov. A lexer should be able to get near 100% coverage with the unit tests.* bitAnd renamed to amp * slice rename to dotdot * Much more cross-referencing in the doc comments * Start line and column can be specified in the lexer config
Sep 12 2013
On Thursday, 12 September 2013 at 23:40:55 UTC, Walter Bright wrote:I strongly recommend running the unit tests with -cov. A lexer should be able to get near 100% coverage with the unit tests.Some of the code is only present to be used at compile-time to generate switch statements inside of the lexer. -cov doesn't show code that's executed at compile-time to be covered, and it couldn't show meaningful line numbers on code that's generated and mixed in. That being said, it's currently 70% covered. I'll be making another pass over the code later to fill in some of the gaps.
Sep 12 2013
On 9/12/2013 4:57 PM, Brian Schott wrote:On Thursday, 12 September 2013 at 23:40:55 UTC, Walter Bright wrote:That's right.I strongly recommend running the unit tests with -cov. A lexer should be able to get near 100% coverage with the unit tests.Some of the code is only present to be used at compile-time to generate switch statements inside of the lexer. -cov doesn't show code that's executed at compile-time to be covered, and it couldn't show meaningful line numbers on code that's generated and mixed in.That being said, it's currently 70% covered. I'll be making another pass over the code later to fill in some of the gaps.Thanks.
Sep 12 2013
On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:Most important goal of this review is to determine any API / design problems. Any internal implementation tweaks may happen after inclusion to Phobos but it is important to assure that no breaking changes will be required any time soon after module will get wider usage.One quick remark : we need some kind of value provider to reuse across different lexing, and can be used outside the lexer. If I process a module and have to kick in a new lexing phase because of mixin, I want to generate identifiers out of the same pool.
Sep 12 2013
On 09/11/2013 05:01 PM, Dicebot wrote:std.d.lexer is standard module for lexing D code, written by Brian Schott ---- Input ---- Code: https://github.com/Hackerpilot/phobos/tree/master/std/d Documentation: http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html ...(Commenting on what's visible in the documentation only for now.) auto config = ... ... .byToken(config) ... Seems to be a natural candidate for manual partial specialization. enum config = ... ... .byToken!config() ... uint line; ushort column; // is there overflow checking? "Check to see if the token is of the same type and has the same string representation as the given token." Tokens with the same string representation always are of the same type, so this seems redundant. Furthermore, I'd expect (!a.opCmp(b)) === (a == b). Why provide the operator overloads at all? They don't implement essential or natural functionality. "includeSpecialTokens". It's not clear what this flag does. "If the input range supports slicing, the caching layer aliases itself away and the lexing process is much more efficient." It might be more sensible to require the user to manually wrap his range. "pure nothrow bool isOperator(const TokenType t); pure nothrow bool isOperator(ref const Token t); pure nothrow bool isKeyword(const TokenType t); pure nothrow bool isKeyword(ref const Token t); ..." IMO we should get rid of these. TokenType naming seems inconsistent. eg: & is amp, = is assign, == is equal, but &= is bitAndEqual and && is logicAnd IMO better: & is and, = is assign, &= is andAssign and && is andAnd. Of course, it might be best to use a template instead. Tok!"&", Tok!"&=" and Tok!"&&".
Sep 12 2013
Discussion has slowed down and thread is fading away. I personally have not noticed any crucial objections (ones that can't be figured out during normal pull request review process). Brian, what is your state on this? Is there something you want to add/change before the voting gets initiated?
Sep 17 2013
On Tuesday, 17 September 2013 at 15:31:00 UTC, Dicebot wrote:Discussion has slowed down and thread is fading away. I personally have not noticed any crucial objections (ones that can't be figured out during normal pull request review process). Brian, what is your state on this? Is there something you want to add/change before the voting gets initiated?I had some comments that nobody addressed. Mostly about firing several instances of the lexer with the same identifier pool.
Sep 17 2013
On Tuesday, 17 September 2013 at 16:34:01 UTC, deadalnix wrote:On Tuesday, 17 September 2013 at 15:31:00 UTC, Dicebot wrote:Well, I suppose only Brian can address it :) I am only checking if there are no commonly repeated objections that are likely to ruin the voting. There are plenty of comments of course and I would recommend Brian to at least make some statement about those before voting but that is his call. There is a possibility to mention one issue as a blocker when voting though, that will be taken into consideration when counting actual votes.Discussion has slowed down and thread is fading away. I personally have not noticed any crucial objections (ones that can't be figured out during normal pull request review process). Brian, what is your state on this? Is there something you want to add/change before the voting gets initiated?I had some comments that nobody addressed. Mostly about firing several instances of the lexer with the same identifier pool.
Sep 17 2013
On Tuesday, 17 September 2013 at 16:48:44 UTC, Dicebot wrote:On Tuesday, 17 September 2013 at 16:34:01 UTC, deadalnix wrote:I've been busy with things that aren't D-related recently, but I should have time the rest of this week to address the lexer.On Tuesday, 17 September 2013 at 15:31:00 UTC, Dicebot wrote:Well, I suppose only Brian can address it :) I am only checking if there are no commonly repeated objections that are likely to ruin the voting. There are plenty of comments of course and I would recommend Brian to at least make some statement about those before voting but that is his call. There is a possibility to mention one issue as a blocker when voting though, that will be taken into consideration when counting actual votes.Discussion has slowed down and thread is fading away. I personally have not noticed any crucial objections (ones that can't be figured out during normal pull request review process). Brian, what is your state on this? Is there something you want to add/change before the voting gets initiated?I had some comments that nobody addressed. Mostly about firing several instances of the lexer with the same identifier pool.
Sep 17 2013
On Tuesday, 17 September 2013 at 20:14:36 UTC, Brian Schott wrote:I've been busy with things that aren't D-related recently, but I should have time the rest of this week to address the lexer.Ok, please notify me then once you feel ready for voting / extra review ( public at dicebot.lv ) Thanks for you work.
Sep 17 2013
On Tuesday, 17 September 2013 at 20:14:36 UTC, Brian Schott wrote:I've been busy with things that aren't D-related recently, but I should have time the rest of this week to address the lexer.Changes since last time: https://github.com/Hackerpilot/phobos/compare/D-Programming-Language:df38839...master Test coverage is up to 85% now and a few bugs have been fixed.
Sep 24 2013
On 2013-09-25 04:48, Brian Schott wrote:Changes since last time: https://github.com/Hackerpilot/phobos/compare/D-Programming-Language:df38839...masterI had some comments and a couple of minor things like spell errors: * I see that errorMessage throws an exception. Do we really want that? I would except it just returns an invalid token. * Could we get some unit tests for string literals, comments and identifies out side of the ASCII table * Personally I would like to see a short description for each unit test, what it's testing * Could you remove debug code and other code that is commented out: - 344 - 1172 - 1226, is that needed? - 3165-3166 - 3197-3198 - 3392 - 3410 - 3434 Spell errors: * "forwarad" - 292 * "commemnt" - 2031 * "sentenels" - 299 * "messsage" - 301 * "underliying" - 2454 * "alloctors" - 3230 * "strightforward" - 2276 I guess these line number might be off now. My original comments was made September 12. For reference see: http://forum.dlang.org/thread/jsnhlcbulwyjuqcqoepe forum.dlang.org?page=7#post-l0rsje:24jf9:241:40digitalmars.com -- /Jacob Carlborg
Sep 25 2013
On Wednesday, 25 September 2013 at 09:36:43 UTC, Jacob Carlborg wrote:* I see that errorMessage throws an exception. Do we really want that? I would except it just returns an invalid token.This is the default behavior that happens when you don't configure an error callback.* Could we get some unit tests for string literals, comments and identifies out side of the ASCII tableI've added one.* Could you remove debug code and other code that is commented out:Most of this is now gone.Spell errors:These were fixed weeks ago.
Sep 25 2013
On Wednesday, 25 September 2013 at 16:52:43 UTC, Brian Schott wrote:This is the default behavior that happens when you don't configure an error callback.I see.I've added one.Thanks.Most of this is now gone.That's good.These were fixed weeks ago.Great, I just never got a respond that. -- /Jacob Carlborg
Sep 25 2013
Hello. I'm not sure if this belongs here, but I think there is bug at the very start of the Lexer chapter: Is U+001A really meant to end the source file? According to the Unicode specification this is a "replacement character", like the newer U+FFFC. Or is it simply a spelling error and U+0019 was intended to end the source (this would fit, as it means "end of media"). I don't know if anybody ever has ended his source in that way or if it was tested. More important to me is, that all the Space-Characters beyond ASCII are not considered whitespace (starting with U+00A0 NBSP, the different wide spaces U+2000 to U+200B up to the exotic stuff U+202F, U+205F, U+2060, U+3000 and the famous U+FEFF). Why? Ok, the set is much larger, but for the end-of-line also the unicode versions (U+2028 and U+2029) are added. This seems inconsequent to me.
Sep 26 2013
On 26-9-2013 17:41, Dominikus Dittes Scherkl wrote:Hello. I'm not sure if this belongs here, but I think there is bug at the very start of the Lexer chapter: Is U+001A really meant to end the source file? According to the Unicode specification this is a "replacement character", like the newer U+FFFC. Or is it simply a spelling error and U+0019 was intended to end the source (this would fit, as it means "end of media"). I don't know if anybody ever has ended his source in that way or if it was tested. More important to me is, that all the Space-Characters beyond ASCII are not considered whitespace (starting with U+00A0 NBSP, the different wide spaces U+2000 to U+200B up to the exotic stuff U+202F, U+205F, U+2060, U+3000 and the famous U+FEFF). Why? Ok, the set is much larger, but for the end-of-line also the unicode versions (U+2028 and U+2029) are added. This seems inconsequent to me.I imagine the lexer follows the language specification: http://dlang.org/lex.html#EndOfFile
Sep 26 2013
On Thursday, 26 September 2013 at 16:47:09 UTC, Jos van Uden wrote:I know. What I wanted to say is: The language specification has a bug here (at least it is strange to interpret "replacement character" as end of file and "end of media" not) and the handling of unicode space characters is not nice. If this is not the right place to discus that matter, please point me to a better place.Is U+001A really meant to end the source file? According to the Unicode specification this is a "replacement character", like the newer U+FFFC. Or is it simply a spelling error and U+0019 was intended to end the source (this would fit, as it means "end of media"). More important to me is, that all the Space-Characters beyond ASCII are not considered whitespaceI imagine the lexer follows the language specification: http://dlang.org/lex.html#EndOfFile
Sep 27 2013
On Tuesday, 17 September 2013 at 16:34:01 UTC, deadalnix wrote:I had some comments that nobody addressed. Mostly about firing several instances of the lexer with the same identifier pool.Doing that would require making the identifier pool part of the public API, which is not something that I want to do at the moment. Let's wait until the allocators are figureod out first.
Sep 24 2013
On Wednesday, 25 September 2013 at 02:23:36 UTC, Brian Schott wrote:On Tuesday, 17 September 2013 at 16:34:01 UTC, deadalnix wrote:Yes, ideally as alias parameter. This is a show stopper for many usages.I had some comments that nobody addressed. Mostly about firing several instances of the lexer with the same identifier pool.Doing that would require making the identifier pool part of the public API, which is not something that I want to do at the moment. Let's wait until the allocators are figureod out first.
Sep 24 2013
On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:std.d.lexer is standard module for lexing D code, written by Brian SchottBrian, have you got any plans to write common lexer/parser generator based on context-free grammar (CFG) or parsing expression grammar (PEG), not only for D grammar? Something like this: https://github.com/PhilippeSigaud/Pegged
Sep 17 2013