digitalmars.D - std.d.lexer: pre-voting review / discussion

Dicebot (36/36) Sep 11 2013 std.d.lexer is standard module for lexing D code, written by

Tove (6/8) Sep 11 2013 I remember reading there were some interesting hash-advances in
Johannes Pfau (38/41) Sep 11 2013 Question / Minor issue:

Walter Bright (3/6) Sep 11 2013 It is important, and I'm glad you brought it up. The LexerConfig can pro...

qznc (5/9) Sep 11 2013 The documentation for Token twice says "measured in ASCII

Walter Bright (2/10) Sep 11 2013 Pedantically, it is just UTF-8 code units, which are a superset of ASCII...

Walter Bright (23/28) Sep 11 2013 Thank you, Brian! This is important work.

Brian Schott (3/3) Sep 11 2013 The choice of ending token names with underscores was made

Walter Bright (3/6) Sep 11 2013 I didn't realize that was in the style guide. I guess I can't complain a...

H. S. Teoh (8/16) Sep 11 2013 I disagree. I think it's more readable to use a consistent prefix, like

Dicebot (3/8) Sep 11 2013 Not unless you want to change the style guide and break existing

H. S. Teoh (6/13) Sep 11 2013 How would that break Phobos code? Phobos code doesn't even use

Dicebot (4/18) Sep 11 2013 Phobos code must conform its style guide. You can't change it

H. S. Teoh (22/38) Sep 11 2013 This doesn't violate Phobos style guidelines:

deadalnix (4/37) Sep 11 2013 Int, Function, Scope, Import are all valid identifiers.

Jonathan M Davis (8/9) Sep 11 2013 All of which violate Phobos' naming conventions for enum values (they mu...
H. S. Teoh (18/27) Sep 11 2013 [...]

sclytrack (4/43) Sep 11 2013 Delphi would use

Manfred Nowak (3/4) Sep 11 2013 Why---and since which version?

Walter Bright (3/6) Sep 11 2013 Since the very beginning.

Manfred Nowak (5/9) Sep 11 2013 I see now, that you wrote about parsing---not about lexing.

Robert Schadek (7/16) Sep 12 2013 I had problems with it, especially with

Manfred Nowak (4/5) Sep 12 2013 I see the shift/reduce conflict if you was indeed trying to syntacticall...

Jonathan M Davis (6/10) Sep 11 2013 IIRC, it's at least partially cause by the .. token. You have to look ah...

Brian Schott (17/30) Sep 11 2013 Yeah. D requires lookahead in both lexing and parsing. Some

deadalnix (6/21) Sep 11 2013 Does not require lookahead.
Manfred Nowak (3/5) Sep 11 2013 Walter road about _arbitrary_ overhead, i.e. unlimited overhead.

Manfred Nowak (10/12) Sep 11 2013 This lookahead is introduced by using a petty grammar.

Jonathan M Davis (5/15) Sep 11 2013 LOL. You were asking about _parsing_ requiring arbitrary lookahead, and ...

Jonathan M Davis (10/12) Sep 11 2013 Appending _ is what we do in Phobos when we need to use a keyword, and i...
H. S. Teoh (85/117) Sep 11 2013 I believe this is probably a result of having a separate enum value of

Walter Bright (9/26) Sep 11 2013 A type is a well-understood concept in D. This is using 'type' to mean s...

H. S. Teoh (39/73) Sep 11 2013 No, in typical D code, user-defined type names are capitalized, such as

Walter Bright (5/24) Sep 11 2013 It's not that big a deal, and certainly not evil.

H. S. Teoh (9/14) Sep 11 2013 But then the code example proceeds to pass byLine() to it. Is that

Brian Schott (3/10) Sep 11 2013 The example doesn't pass the result of byLine to the byToken

H. S. Teoh (7/18) Sep 11 2013 *facepalm* You're right, it's calling join() on it. Nevermind what I

Jacob Carlborg (10/13) Sep 11 2013 Currently this is possible:

Brian Schott (21/161) Sep 11 2013 There are already plenty of hackish things in that module, so

H. S. Teoh (34/157) Sep 11 2013 I was just going by what Walter said, but OK, fair enough.
Jacob Carlborg (8/9) Sep 12 2013 Really? It seems you only need to change two places in the code:

Piotr Szturmaj (7/14) Sep 11 2013 Delphi designers realized this problem years ago and they came up with a...

Kapps (6/27) Sep 11 2013 C# allows this as well. For example, you can have:

Piotr Szturmaj (4/31) Sep 11 2013 Yes, and this is important to actually have the real name without a '_'

Jacob Carlborg (5/11) Sep 12 2013 Scala does it as well: `keyword` if I recall correctly. Seems like you

Michel Fortin (10/12) Sep 11 2013 Since this is a forward range, you can save() it at one location and

Walter Bright (3/10) Sep 11 2013 That's correct, but that implies re-lexing the tokens, which has negativ...

deadalnix (3/20) Sep 11 2013 Indeed. What solution do you have in mind ?

Walter Bright (3/4) Sep 11 2013 The solution dmd uses is to put in an intermediary layer that saves the

deadalnix (5/9) Sep 11 2013 But then, you have an extra step when looking up every tokens +

Walter Bright (6/14) Sep 11 2013 Not really - I use a dirty trick in that a static instance is always the...

deadalnix (10/29) Sep 11 2013 If I understand you correctly, that mean that lookahead of one

Walter Bright (4/9) Sep 11 2013 I don't believe that, because you can see about anything for tokens in l...

deadalnix (4/18) Sep 12 2013 I plan to, but you know what it is, the best optimization is the
Martin Nowak (5/8) Sep 12 2013 Me neither and if you have to duplicate the code for this it's a double

Martin Nowak (4/8) Sep 11 2013 Linked list sounds bad.

H. S. Teoh (11/21) Sep 11 2013 I still don't understand why backtracking is necessary in the first

deadalnix (29/38) Sep 11 2013 The problem is that it can cause a exponential (and I literally

H. S. Teoh (23/45) Sep 12 2013 [...]

deadalnix (4/29) Sep 12 2013 And then you got to backtrack the parsing instead of the lexing.

H. S. Teoh (98/127) Sep 12 2013 No. You can just use ArgListItem for both runtime args and compile-time

deadalnix (9/9) Sep 12 2013 We are talking about parameters here, not arguments. But overall,

Manfred Nowak (3/4) Sep 12 2013 If one ever reaches that position one uses petty lexer/grammar definitio...
Mehrdad (3/9) Sep 27 2013 Pretty sure a GLR parser handles that well within O(n^2) space.

Mehrdad (2/12) Sep 27 2013 I meant time... Though it's true for space too.

Walter Bright (2/11) Sep 11 2013 No, I don't have a rough idea, but it doesn't show up in the profile.
Robert Schadek (3/12) Sep 12 2013 Maybe some fixed size stack vector with 64 elements or so and some

Dmitry Olshansky (4/17) Sep 12 2013 And an extra branch to detect which one is currently the case? No, thank...

Robert Schadek (3/8) Sep 12 2013 I would think that branching and using the vector is faster than using
Jonathan M Davis (5/14) Sep 12 2013 Well, it sounds like there are several ideas to try out in this thread. ...

qznc (6/10) Sep 12 2013 I think this is the right approach. It can probably be another

Martin Nowak (10/14) Sep 11 2013 Buffering the tokens would work. There are some ways promote input

Timon Gehr (23/33) Sep 12 2013 It's a type tag. The tag uniquely determines the type. (As in 'the type

Dmitry Olshansky (11/36) Sep 12 2013 I like this.

H. S. Teoh (12/32) Sep 12 2013 +1. This is superior to both the ad hoc _ suffix and my ad hoc

Timon Gehr (23/28) Sep 12 2013 This is just a quick hack. One would maybe want to avoid the unwarranted...

Jacob Carlborg (6/7) Sep 12 2013 Unfortunately that's currently not possible to pass to functions, see my...
Martin Nowak (6/15) Sep 12 2013 Do you think it's possible to use CircularBufferRange itself as anchor.

Timon Gehr (14/35) Sep 12 2013 Well, this implementation is geared towards the usage pattern found in

Walter Bright (4/5) Sep 11 2013 One litmus test of the validity of the design is to see if one can build...

Brian Schott (14/21) Sep 11 2013 I'm not sure if you've followed the announce mailing list

Walter Bright (2/4) Sep 11 2013 Awesome! This is just what I asked for.

Martin Nowak (18/26) Sep 11 2013 I only had a short look at the source code and was left with a good

Jacob Carlborg (4/8) Sep 12 2013 Sounds like that would fit in std.file or similar.

dennis luehring (4/6) Sep 11 2013 a standard benchmark scenario is needed - to compare dmd and std.d.lexer
Jacob Carlborg (63/64) Sep 12 2013 Finally :)

Dmitry Olshansky (12/32) Sep 12 2013 These could be one module. There is really no meaningful way to use

Jacob Carlborg (5/7) Sep 12 2013 std.d.lexer.config might be a bit unnecessary. But yes, in general I

Brian Schott (10/10) Sep 12 2013 I got some time to work on the lexer this evening. Changeset

dennis luehring (20/30) Sep 12 2013 problem: many occurences of the same string

dennis luehring (9/9) Sep 12 2013 just an idea regarding your string[TokenType.max + 1] in

Walter Bright (4/13) Sep 12 2013 I strongly recommend running the unit tests with -cov. A lexer should be...

Brian Schott (9/11) Sep 12 2013 Some of the code is only present to be used at compile-time to

Walter Bright (3/12) Sep 12 2013 Thanks.

deadalnix (6/11) Sep 12 2013 One quick remark : we need some kind of value provider to reuse
Timon Gehr (30/36) Sep 12 2013 (Commenting on what's visible in the documentation only for now.)
Dicebot (5/5) Sep 17 2013 Discussion has slowed down and thread is fading away.

deadalnix (3/8) Sep 17 2013 I had some comments that nobody addressed. Mostly about firing

Dicebot (9/19) Sep 17 2013 Well, I suppose only Brian can address it :) I am only checking

Brian Schott (3/23) Sep 17 2013 I've been busy with things that aren't D-related recently, but I

Dicebot (4/6) Sep 17 2013 Ok, please notify me then once you feel ready for voting / extra
Brian Schott (4/6) Sep 24 2013 Changes since last time:

Jacob Carlborg (31/33) Sep 25 2013 I had some comments and a couple of minor things like spell errors:

Brian Schott (7/14) Sep 25 2013 This is the default behavior that happens when you don't

Jacob Carlborg (8/13) Sep 25 2013 I see.

Dominikus Dittes Scherkl (20/20) Sep 26 2013 Hello.

Jos van Uden (3/14) Sep 26 2013 I imagine the lexer follows the language specification:

Dominikus Dittes Scherkl (11/21) Sep 27 2013 I know. What I wanted to say is: The language specification has a

Brian Schott (4/6) Sep 24 2013 Doing that would require making the identifier pool part of the

deadalnix (4/10) Sep 24 2013 Yes, ideally as alias parameter. This is a show stopper for many

ilya-stromberg (6/8) Sep 17 2013 Brian, have you got any plans to write common lexer/parser

"Dicebot" <public dicebot.lv> writes:

std.d.lexer is standard module for lexing D code, written by 
Brian Schott

---- Input ----

Code: https://github.com/Hackerpilot/phobos/tree/master/std/d

Documentation:
http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html

Initial discussion:
http://forum.dlang.org/thread/dpdgcycrgfspcxenzrjf forum.dlang.org

Usage example in real project:
https://github.com/Hackerpilot/Dscanner
(as stdx.d.lexer, Brian, please correct me if versions do not 
match)

---- Information for reviewers ----

(yes, I am mostly copy-pasting this :P)

Goal of this thread is to detect if there are any outstanding
issues that need to fixed before formal "yes"/"no" voting
happens. If no critical objections will arise, voting will begin
starting with a next week. Otherwise it depends on time module 
author needs to implement suggestions.

Please take this part seriously: "If you identify problems along 
the
way, please note if they are minor, serious, or showstoppers."
(http://wiki.dlang.org/Review/Process). This information later
will be used to determine if library is ready for voting.

If there are any frequent Phobos contributors / core developers
please pay extra attention to submission code style and fitting
into overall Phobos guidelines and structure.

Most important goal of this review is to determine any API / 
design problems. Any internal implementation tweaks may happen 
after inclusion to Phobos but it is important to assure that no 
breaking changes will be required any time soon after module will 
get wider usage.

---- Information request from module author ----

Performance was a major discussed topic in previous thread. Could 
you please provide benchmarking data for version currently 
ongoing the review?

Sep 11 2013

"Tove" <tove fransson.se> writes:

On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by 
 Brian Schott

I remember reading there were some interesting hash-advances in 
dmd recently.

http://forum.dlang.org/thread/kq7ov0$2o8n$1 digitalmars.com?page=1

maybe it's worth benchmarking those hashes for std.d.lexer as 
well.

Sep 11 2013

Johannes Pfau <nospam example.com> writes:

Am Wed, 11 Sep 2013 17:01:58 +0200
schrieb "Dicebot" <public dicebot.lv>:

 std.d.lexer is standard module for lexing D code, written by 
 Brian Schott
 

Question / Minor issue:

As we already have a range based interface I'd love to have partial
lexing / parsing, especially for IDEs.

Say I have this source code:
--------------------------------------------
1: module a;
2: 
3: void test(int a)
4: {
5:     [...]
6: }
7: 
8: void test2()
9: [...]
--------------------------------------------

Then I first do a full parse pass over the source. Now line 5 is being
edited. I know from the full parse that line 5 is part of a
FunctionDeclaration which starts at line 3 and ends at line 6. Now I'd
like to re-parse only that part:

--------------------------------------------
FunctionDeclaration decl = document.getDeclByLine(5);
decl.reparse(/*start_line=*/ 3, docBuffer);
--------------------------------------------

I think these are the two critical points related to this for the
proposed std.lexer:

* How can I tell the lexer to start lexing at line/character n? Of
  course the input could be sliced, but then line number and position
  information in the Token struct is wrong.
* I guess std.lexer slices the original input? This could make things
  difficult if the file buffer is edited in place. But this can
  probably be dealt with outside of std.lexer. (By reallocating all
  .value members)


(And once this is working, an example in the docs would be great)

But to be honest I'm not sure how important this really is. I think it
should help for more responsive IDEs but maybe parsing is not a
bottleneck at all?

Sep 11 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 9/11/2013 11:43 AM, Johannes Pfau wrote:
 But to be honest I'm not sure how important this really is. I think it
 should help for more responsive IDEs but maybe parsing is not a
 bottleneck at all?


It is important, and I'm glad you brought it up. The LexerConfig can provide a 
spot to put a starting line/column value.

Sep 11 2013

"qznc" <qznc web.de> writes:

On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by 
 Brian Schott

 Documentation:
 http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html

The documentation for Token twice says "measured in ASCII 
characters or UTF-8 code units", which sounds confusing to me.

Is it UTF-8, which includes ASCII? Then it should not be "or".

This is nitpicking. Overall, I like the proposal. Great work!

Sep 11 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 9/11/2013 11:45 AM, qznc wrote:
 On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian Schott

 Documentation:
 http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html

 The documentation for Token twice says "measured in ASCII characters or UTF-8
 code units", which sounds confusing to me.

 Is it UTF-8, which includes ASCII? Then it should not be "or".

Pedantically, it is just UTF-8 code units, which are a superset of ASCII.

Sep 11 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 9/11/2013 8:01 AM, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian Schott

Thank you, Brian! This is important work.

Not a thorough review, just some notes from reading the doc file:

1. I don't like the _ suffix for keywords. Just call it kwimport or something 
like that.

2. The example uses an if-then sequence of isBuiltType, isKeyword, etc. Should 
be an enum so a switch can be done for speed.

3. I assumed TokenType is a type. But it's not, it's an enum. Even the document 
says it's a 'type', but it's not a type.

4. When naming tokens like .. 'slice', it is giving it a syntactic/semantic
name 
rather than a token name. This would be awkward if .. took on new meanings in
D. 
Calling it 'dotdot' would be clearer. Ditto for the rest. For example that is 
done better, '*' is called 'star', rather than 'dereference'.

5. The LexerConfig initialization should be a constructor rather than a
sequence 
of assignments. LexerConfig documentation is awfully thin. For example, 
'tokenStyle' is explained as being 'Token style', whatever that is.

6. No clue how lookahead works with this. Parsing D requires arbitrary
lookahead.

7. uint line; Should indicate that lines start with '1', not '0'. Ditto for
columns.

8. 'default_' Again with the awful practice of appending _.

9. Need to insert intra-page navigation links, such as when 'byToken()' appears 
in the text, it should be link to where byToken is described.


 Goal of this thread is to detect if there are any outstanding
 issues that need to fixed before formal "yes"/"no" voting
 happens. If no critical objections will arise, voting will begin
 starting with a next week. Otherwise it depends on time module author needs to
implement suggestions.

I believe the state of the documentation is a showstopper, and needs to be 
extensively fleshed out before it can be considered ready for voting.

Sep 11 2013

"Brian Schott" <briancschott gmail.com> writes:

The choice of ending token names with underscores was made 
according to the Phobos style guide.

http://dlang.org/dstyle.html

Sep 11 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 9/11/2013 12:10 PM, Brian Schott wrote:
 The choice of ending token names with underscores was made according to the
 Phobos style guide.

 http://dlang.org/dstyle.html

I didn't realize that was in the style guide. I guess I can't complain about
it, 
then :-)

Sep 11 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Sep 11, 2013 at 12:13:07PM -0700, Walter Bright wrote:
 On 9/11/2013 12:10 PM, Brian Schott wrote:
The choice of ending token names with underscores was made according
to the Phobos style guide.

http://dlang.org/dstyle.html

 
 I didn't realize that was in the style guide. I guess I can't
 complain about it, then :-)

I disagree. I think it's more readable to use a consistent prefix, like
kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's clear
you're referring to token types, not the actual keyword.


T

-- 
One Word to write them all, One Access to find them, One Excel to count
them all, And thus to Windows bind them. -- Mike Champion

Sep 11 2013

"Dicebot" <public dicebot.lv> writes:

On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:
 I disagree. I think it's more readable to use a consistent 
 prefix, like
 kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's 
 clear
 you're referring to token types, not the actual keyword.

Not unless you want to change the style guide and break existing 
Phobos code ;)

Sep 11 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:
 On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:
I disagree. I think it's more readable to use a consistent prefix,
like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's
clear you're referring to token types, not the actual keyword.

 
 Not unless you want to change the style guide and break existing
 Phobos code ;)

How would that break Phobos code? Phobos code doesn't even use
std.d.lexer right now.


T

-- 
Being able to learn is a great learning; being able to unlearn is a greater
learning.

Sep 11 2013

"Dicebot" <public dicebot.lv> writes:

On Wednesday, 11 September 2013 at 20:08:44 UTC, H. S. Teoh wrote:
 On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:
 On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh 
 wrote:
I disagree. I think it's more readable to use a consistent 
prefix,
like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that 
it's
clear you're referring to token types, not the actual keyword.

 
 Not unless you want to change the style guide and break 
 existing
 Phobos code ;)

 How would that break Phobos code? Phobos code doesn't even use
 std.d.lexer right now.

Phobos code must conform its style guide. You can't change it 
without changing existing Phobos code that relies on it. 
Inconsistent style is worst of all options.

Sep 11 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Sep 11, 2013 at 10:18:12PM +0200, Dicebot wrote:
 On Wednesday, 11 September 2013 at 20:08:44 UTC, H. S. Teoh wrote:
On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:
On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh wrote:
I disagree. I think it's more readable to use a consistent prefix,
like kw... or kw_... (e.g. kw_int, kw_return, etc.), so that it's
clear you're referring to token types, not the actual keyword.

Not unless you want to change the style guide and break existing
Phobos code ;)

How would that break Phobos code? Phobos code doesn't even use
std.d.lexer right now.

 
 Phobos code must conform its style guide. You can't change it
 without changing existing Phobos code that relies on it.
 Inconsistent style is worst of all options.

This doesn't violate Phobos style guidelines:

	enum TokenType {
		kwInt,
		kwFloat,
		kwDouble,
		...
		kwFunction,
		kwScope,
		... // etc.
	}

The token representing a particular keyword is not the same thing as the
keyword itself. There's nothing that says "a D lexer must use token type
names that are derived from the D keywords it represents".

In fact, using a kw prefix for tokens representing keywords is *more*
consistent, because it doesn't require inserting _ for some enum values
but not for others.

I never said anything about changing Phobos style guidelines.


T

-- 
I think Debian's doing something wrong, `apt-get install pesticide',
doesn't seem to remove the bugs on my system! -- Mike Dresser

Sep 11 2013

"deadalnix" <deadalnix gmail.com> writes:

On Wednesday, 11 September 2013 at 20:28:06 UTC, H. S. Teoh wrote:
 On Wed, Sep 11, 2013 at 10:18:12PM +0200, Dicebot wrote:
 On Wednesday, 11 September 2013 at 20:08:44 UTC, H. S. Teoh 
 wrote:
On Wed, Sep 11, 2013 at 10:04:20PM +0200, Dicebot wrote:
On Wednesday, 11 September 2013 at 19:58:36 UTC, H. S. Teoh 
wrote:
I disagree. I think it's more readable to use a consistent 
prefix,
like kw... or kw_... (e.g. kw_int, kw_return, etc.), so 
that it's
clear you're referring to token types, not the actual 
keyword.

Not unless you want to change the style guide and break 
existing
Phobos code ;)

How would that break Phobos code? Phobos code doesn't even use
std.d.lexer right now.

 
 Phobos code must conform its style guide. You can't change it
 without changing existing Phobos code that relies on it.
 Inconsistent style is worst of all options.

 This doesn't violate Phobos style guidelines:

 	enum TokenType {
 		kwInt,
 		kwFloat,
 		kwDouble,
 		...
 		kwFunction,
 		kwScope,
 		... // etc.
 	}

Int, Function, Scope, Import are all valid identifiers.

random minimization like kw is really bad. It is even worse when 
it doesn't make anything sorter.

Sep 11 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, September 12, 2013 03:37:06 deadalnix wrote:
 Int, Function, Scope, Import are all valid identifiers.

All of which violate Phobos' naming conventions for enum values (they must 
start with a lowercase letter), which is why we went with adding an _ on the 
end. And it's pretty much as close as you can get to the keyword without 
actually using the keyword, which is a plus IMHO (though from the sounds of 
it, H.S. Teoh would consider that a negative due to possible confusion with 
the keyword).

- Jonathan M Davis

Sep 11 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Sep 11, 2013 at 10:06:11PM -0400, Jonathan M Davis wrote:
 On Thursday, September 12, 2013 03:37:06 deadalnix wrote:
 Int, Function, Scope, Import are all valid identifiers.

 
 All of which violate Phobos' naming conventions for enum values (they
 must start with a lowercase letter), which is why we went with adding
 an _ on the end. And it's pretty much as close as you can get to the
 keyword without actually using the keyword, which is a plus IMHO
 (though from the sounds of it, H.S. Teoh would consider that a
 negative due to possible confusion with the keyword).

[...]

Actually, the main issue I have is that some of the enum values end with
_ while others don't. This is inconsistent. I'd rather have consistency
than superficial resemblance to the keywords as typed. Either *all* of
the enum values should end with _, or *none* of them should.  Having a
mixture of both is an eyesore, and leads to people wondering, should I
add a _ at the end or not?

If people insist that the 'default' keyword absolutely must be
represented as TokenType.default_ (I really don't see why), then *all*
TokenType values should end with _. But honestly, I find that really
ugly. Writing something like kwDefault, or tokenTypeDefault, would be
far better.

Sigh, Andrei was right. Once the bikeshed is up for painting, even the
rainbow won't suffice. :-P


T

-- 
MASM = Mana Ada Sistem, Man!

Sep 11 2013

"sclytrack" <sclytrack qlskdfjl.com> writes:

On Thursday, 12 September 2013 at 03:31:42 UTC, H. S. Teoh wrote:
 On Wed, Sep 11, 2013 at 10:06:11PM -0400, Jonathan M Davis 
 wrote:
 On Thursday, September 12, 2013 03:37:06 deadalnix wrote:
 Int, Function, Scope, Import are all valid identifiers.

 
 All of which violate Phobos' naming conventions for enum 
 values (they
 must start with a lowercase letter), which is why we went with 
 adding
 an _ on the end. And it's pretty much as close as you can get 
 to the
 keyword without actually using the keyword, which is a plus 
 IMHO
 (though from the sounds of it, H.S. Teoh would consider that a
 negative due to possible confusion with the keyword).

 [...]

 Actually, the main issue I have is that some of the enum values 
 end with
 _ while others don't. This is inconsistent. I'd rather have 
 consistency
 than superficial resemblance to the keywords as typed. Either 
 *all* of
 the enum values should end with _, or *none* of them should.  
 Having a
 mixture of both is an eyesore, and leads to people wondering, 
 should I
 add a _ at the end or not?

 If people insist that the 'default' keyword absolutely must be
 represented as TokenType.default_ (I really don't see why), 
 then *all*
 TokenType values should end with _. But honestly, I find that 
 really
 ugly. Writing something like kwDefault, or tokenTypeDefault, 
 would be
 far better.

 Sigh, Andrei was right. Once the bikeshed is up for painting, 
 even the
 rainbow won't suffice. :-P


 T

Delphi would use

   TokenType   ttDefault
   MyType      mtDefault

Sep 11 2013

Manfred Nowak <svv1999 hotmail.com> writes:

Walter Bright wrote:
 Parsing D requires arbitrary lookahead.

Why---and since which version?

-manfred

Sep 11 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 9/11/2013 12:17 PM, Manfred Nowak wrote:
 Walter Bright wrote:
 Parsing D requires arbitrary lookahead.

 Why---and since which version?

Since the very beginning.

One example is determining if something is a declaration or an expression.

Sep 11 2013

Manfred Nowak <svv1999 hotmail.com> writes:

Walter Bright wrote:

 Since the very beginning.
 
 One example is determining if something is a declaration or an
 expression. 

I see now, that you wrote about parsing---not about lexing.

Btw. I wrote an LALR(1)-parser for an early version of D. This means a 
lookahead of one was sufficient---or I made terrible mistakes. 

-manfred

Sep 11 2013

Robert Schadek <realburner gmx.de> writes:

On 09/12/2013 12:09 AM, Manfred Nowak wrote:
 Walter Bright wrote:

 Since the very beginning.

 One example is determining if something is a declaration or an
 expression. 

 I see now, that you wrote about parsing---not about lexing.

 Btw. I wrote an LALR(1)-parser for an early version of D. This means a 
 lookahead of one was sufficient---or I made terrible mistakes. 

 -manfred

I had problems with it, especially with

/IdentifierList:
    Identifier . IdentifierList
    TemplateInstance. IdentifierList

And Bison also complaint.
/

Sep 12 2013

Manfred Nowak <svv1999 hotmail.com> writes:

Robert Schadek wrote:

 especially with IdentifierList:

I see the shift/reduce conflict if you was indeed trying to syntactically 
differantiate between template identifiers and other identifiers.

-manfred

Sep 12 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, September 11, 2013 19:17:26 Manfred Nowak wrote:
 Walter Bright wrote:
 Parsing D requires arbitrary lookahead.

 
 Why---and since which version?

IIRC, it's at least partially cause by the .. token. You have to look ahead to 
figure out whether it's .. or a floating point literal. There may be other
cases 
as well, but that particular one was actually complained about by Robert 
Schadek in his talk at dconf 2013.

- Jonathan M Davis

Sep 11 2013

"Brian Schott" <briancschott gmail.com> writes:

On Wednesday, 11 September 2013 at 19:29:48 UTC, Jonathan M Davis 
wrote:
 On Wednesday, September 11, 2013 19:17:26 Manfred Nowak wrote:
 Walter Bright wrote:
 Parsing D requires arbitrary lookahead.

 
 Why---and since which version?

 IIRC, it's at least partially cause by the .. token. You have 
 to look ahead to
 figure out whether it's .. or a floating point literal. There 
 may be other cases
 as well, but that particular one was actually complained about 
 by Robert
 Schadek in his talk at dconf 2013.

 - Jonathan M Davis

Yeah. D requires lookahead in both lexing and parsing. Some 
examples:

* String literals such as q{}
* If statements need to determine the difference between if 
(expression) and if (type identifier = expression)
* Determining if something is a lamba expression or a function 
literal expression
* Determining if something is a declaration or a statement or an 
expression.
* Differentiating between (type).identifier and a lambda 
expression
* Differentiating between a type and an expression inside a 
typeid expression
* Differentiating between associative array key type and tuple 
slices in type suffixes

Sep 11 2013

"deadalnix" <deadalnix gmail.com> writes:

On Wednesday, 11 September 2013 at 19:46:25 UTC, Brian Schott 
wrote:
 Yeah. D requires lookahead in both lexing and parsing. Some 
 examples:

 * String literals such as q{}
 * If statements need to determine the difference between if 
 (expression) and if (type identifier = expression)
 * Determining if something is a lamba expression or a function 
 literal expression
 * Determining if something is a declaration or a statement or 
 an expression.

Does not require lookahead.

 * Differentiating between (type).identifier and a lambda 
 expression

Yup !

 * Differentiating between a type and an expression inside a 
 typeid expression

Does not require lookahead.

 * Differentiating between associative array key type and tuple 
 slices in type suffixes

ditto.

Sep 11 2013

Manfred Nowak <svv1999 hotmail.com> writes:

Brian Schott wrote:

 Parsing D requires arbitrary lookahead.



 Yeah. D requires lookahead in both lexing and parsing.

Walter road about _arbitrary_ overhead, i.e. unlimited overhead.

-manfred

Sep 11 2013

Manfred Nowak <svv1999 hotmail.com> writes:

Jonathan M Davis wrote:

 You have to look ahead to figure out whether it's .. or a floating
 point literal.

This lookahead is introduced by using a petty grammar.

Please reconsider, that lexing searches for the leftmost longest pattern 
of the rest of the input. This means that introducing a pattern like
 `<int>\.\.'    	return TokenType.INTDOTDOT;
would eliminate the lookahead in the lexer.

In the parser an additional rule then has to be added:
  <range> ::= INT DOTDOT INT
           |  INTDOTDOT INT

-manfred

Sep 11 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, September 11, 2013 15:29:33 Jonathan M Davis wrote:
 On Wednesday, September 11, 2013 19:17:26 Manfred Nowak wrote:
 Walter Bright wrote:
 Parsing D requires arbitrary lookahead.

 
 Why---and since which version?

 
 IIRC, it's at least partially cause by the .. token. You have to look ahead
 to figure out whether it's .. or a floating point literal. There may be
 other cases as well, but that particular one was actually complained about
 by Robert Schadek in his talk at dconf 2013.

LOL. You were asking about _parsing_ requiring arbitrary lookahead, and I 
answered as to why lexing requires more than one character of lookahead. I 
should read more carefully...

- Jonathan M Davis

Sep 11 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, September 11, 2013 11:49:44 Walter Bright wrote:
 1. I don't like the _ suffix for keywords. Just call it kwimport or
 something like that.

Appending _ is what we do in Phobos when we need to use a keyword, and it's 
even called out in the style guide: http://dlang.org/dstyle.html

Now, if it makes sense to use something other than a keyword, that's probably 
better, but in cases where it really makes the most sense to use a keyword but 
we can't (e.g. std.traits.FunctionAttribute), we append _ to the keyword.

What makes the most sense ni this case, I don't know, because I haven't looked
at either the documentation or the code yet, but if Brian is appending _ to
keywords, he's following the official style guide.

- Jonathan M Davis

Sep 11 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Sep 11, 2013 at 11:49:44AM -0700, Walter Bright wrote:
 On 9/11/2013 8:01 AM, Dicebot wrote:
std.d.lexer is standard module for lexing D code, written by Brian
Schott

 
 Thank you, Brian! This is important work.
 
 Not a thorough review, just some notes from reading the doc file:
 
 1. I don't like the _ suffix for keywords. Just call it kwimport or
 something like that.

I agree.


 2. The example uses an if-then sequence of isBuiltType, isKeyword,
 etc. Should be an enum so a switch can be done for speed.

I believe this is probably a result of having a separate enum value of
each token, and at the same time needing to group them into categories
for syntax highlighting purposes. I'd suggest a function for converting
the TokenType enum value into a TokenCategory enum. Perhaps something
like:

	enum TokenCategory { BuiltInType, Keyword, ... }

	// Convert TokenType into TokenCategory
	TokenCategory category(TokenType tt) { ... }

Then in user code, you call category() on the token type, and switch
over that. This maximizes performance.

Implementation-wise, I'd suggest either a hash table on TokenType, or
perhaps even encoding the category into the TokenType enum values, for
example:

	enum TokenCategory {
		BuiltInType, Keyword, ...
	}

	enum TokenType {
		IntToken = (TokenCategory.BuiltInType << 16) | 0x0001,
		FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002,
		...
		FuncToken = (TokenCategory.Keyword << 16) | 0x0001,
	}

Then the category function can be simply:

	TokenCategory category(TokenType tt) {
		return cast(TokenCategory)(tt >> 16);
	}

Though admittedly, this is a bit hackish. But if you're going for speed,
this would be quite fast.


 3. I assumed TokenType is a type. But it's not, it's an enum. Even
 the document says it's a 'type', but it's not a type.

I don't understand this complaint. I thought it's pretty clear that it's
referring to what type of token it is (or would you prefer TokenKind?).


 4. When naming tokens like .. 'slice', it is giving it a
 syntactic/semantic name rather than a token name. This would be
 awkward if .. took on new meanings in D. Calling it 'dotdot' would
 be clearer. Ditto for the rest. For example that is done better, '*'
 is called 'star', rather than 'dereference'.

I agree. Especially since '*' can also mean multiplication, depending on
context. It would be weird (and unnecessarily obscure) for parser code
to refer to 'dereference' when parsing expressions. :)


 5. The LexerConfig initialization should be a constructor rather than
 a sequence of assignments. LexerConfig documentation is awfully thin.
 For example, 'tokenStyle' is explained as being 'Token style',
 whatever that is.

I'm on the fence about this one. Setting up the configuration before
starting the lexing process makes sense to me.

Perhaps one way to improve this is to rename LexerConfig to DLexer, and
make byToken a member function (or call it via UFCS):

	DLexer lex;
	lex.iterStyle = ...;
	// etc.

	foreach (token; lex.byToken()) {
		...
	}

This reads better: you're getting a list of tokens from the lexer 'lex',
as opposed to getting something from byToken(config), which doesn't
really *say* what it's doing: is it tokenizing the config object?


 6. No clue how lookahead works with this. Parsing D requires arbitrary
 lookahead.

What has this got to do with lexing? The parser needs to keep track of
its state independently of the lexer anyway, doesn't it?  I'm not sure
how DMD's parser works, but the way I usually write parsers is that
their state encodes the series of tokens encountered so far, so they
don't need the lexer to save any tokens for them. If they need to refer
to tokens, they create partial AST trees on the parser stack that
reference said tokens. I don't see why it should be the lexer's job to
keep track of such things.


[...]
 8. 'default_' Again with the awful practice of appending _.

Yeah I didn't like that. Prefixing keywords with "kw" or "kw_" is much
better, instead of the confusing way of appending _ to some token types
but not others.


 9. Need to insert intra-page navigation links, such as when
 'byToken()' appears in the text, it should be link to where byToken
 is described.

Agreed.


[...]
 I believe the state of the documentation is a showstopper, and needs
 to be extensively fleshed out before it can be considered ready for
 voting.

I think the API can be improved. The LexerConfig -> DLexer rename is an
important readability issue IMO.

Also, it's unclear what types of input is supported -- the code example
only uses string input, but what about file input? Does it support
byLine? Or does it need me to slurp the entire file contents into an
in-memory buffer first?

Now, somebody pointed out that there's currently no way to tell it that
the data you're feeding to it is a partial slice of the full source. I
think this should be an easy fix: LexerConfig (or DLexer after the
rename) can have a field for specifying the initial line number / column
number, defaulting to (1, 1) but can be changed by the caller for
parsing code fragments.

A minor concern I have about the current implementation (I didn't look
at the code yet, but this is just based on the documentation), is that
there's no way to choose what kind of data you want in the token stream.
Right now, it appears that it always computes startIndex, line, and
column. What if I don't care about this information, and only want, say,
the token type and value (say I'm pretty-printing the source, so I don't
care how the original was formatted)?  Would it be possible to skip the
additional computations required to populate these fields?


T

-- 
MACINTOSH: Most Applications Crash, If Not, The Operating System Hangs

Sep 11 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 9/11/2013 12:55 PM, H. S. Teoh wrote:
 3. I assumed TokenType is a type. But it's not, it's an enum. Even
 the document says it's a 'type', but it's not a type.

 I don't understand this complaint. I thought it's pretty clear that it's
 referring to what type of token it is (or would you prefer TokenKind?).

A type is a well-understood concept in D. This is using 'type' to mean
something 
completely different. This is very jarring. Your idea of using 'kind' is vastly 
superior.


 5. The LexerConfig initialization should be a constructor rather than
 a sequence of assignments. LexerConfig documentation is awfully thin.
 For example, 'tokenStyle' is explained as being 'Token style',
 whatever that is.

 I'm on the fence about this one. Setting up the configuration before
 starting the lexing process makes sense to me.

Yes, using a constructor :-)

 6. No clue how lookahead works with this. Parsing D requires arbitrary
 lookahead.

 What has this got to do with lexing?

The lexer needs to support backing up and going forward again. That's how the 
dmd lexer works.


 Also, it's unclear what types of input is supported -- the code example
 only uses string input, but what about file input? Does it support
 byLine? Or does it need me to slurp the entire file contents into an
 in-memory buffer first?

The point of having the input be an InputRange is you DO NOT CARE if the input 
is coming from a string, file, etc.

Sep 11 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Sep 11, 2013 at 01:31:45PM -0700, Walter Bright wrote:
 On 9/11/2013 12:55 PM, H. S. Teoh wrote:
3. I assumed TokenType is a type. But it's not, it's an enum. Even
the document says it's a 'type', but it's not a type.

I don't understand this complaint. I thought it's pretty clear that
it's referring to what type of token it is (or would you prefer
TokenKind?).

 
 A type is a well-understood concept in D. This is using 'type' to
 mean something completely different. This is very jarring. Your idea
 of using 'kind' is vastly superior.

No, in typical D code, user-defined type names are capitalized, such as
InputRange, Duration, Complex, ByLine, etc.. They do not have 'Type' in
their names (we don't call them InputRangeType, for example). So this
usage doesn't conflict at all.


5. The LexerConfig initialization should be a constructor rather
than a sequence of assignments. LexerConfig documentation is awfully
thin.  For example, 'tokenStyle' is explained as being 'Token
style', whatever that is.

I'm on the fence about this one. Setting up the configuration before
starting the lexing process makes sense to me.

 
 Yes, using a constructor :-)

Constructors with many arguments are evil, because you can't just modify
one setting and use defaults for the rest; you have to specify all of
them up to the last default argument you have to change.


6. No clue how lookahead works with this. Parsing D requires
arbitrary lookahead.

What has this got to do with lexing?

 
 The lexer needs to support backing up and going forward again.
 That's how the dmd lexer works.

Then the parser should be fixed. Keeping track of what tokens have been
seen so far is the parser's job, not the lexer's. LALR(1) parsers, for
example, deal with multiple similar-looking constructs without requiring
more than a single token lookahead. The trick is that you do not assume
a particular grammar production until it has been fully disambiguated.
The grammar, if not conducive to this, can usually be rearranged to make
it possible.  For example, if the grammar has two production rules:

	StatementA ::= TokenA TokenB TokenC TokenD
	StatementB ::= TokenA TokenB TokenC TokenE

then you may claim that you need a 4-token lookahead in order to tell,
given TokenA in the current input, which type of statement it's going to
be. However, these rules can be rewritten like this:

	StatementABPrefix ::= TokenA TokenB TokenC
	StatementA ::= StatementABPrefix TokenD
	StatementB ::= StatementABPrefix TokenE

Now the parser only needs a 1-token lookahead to do its job, and no
backtracking is ever needed. Any information encoded by the first 3
tokens can be stored in the AST node of StatementABPrefix, so it's
recoverable without needing to save previous tokens.

Unless, you're saying that D grammar is more complex than LALR(1)... If
so, I'd like to see an example of where this is the case.


Also, it's unclear what types of input is supported -- the code
example only uses string input, but what about file input? Does it
support byLine? Or does it need me to slurp the entire file contents
into an in-memory buffer first?

 
 The point of having the input be an InputRange is you DO NOT CARE if
 the input is coming from a string, file, etc.

Yes, but the docs currently doesn't say what form this input range must
take. Must it be a range of characters? Am I allowed to pass in a range
of *strings* (representing source lines)? This needs to be clearly
stated.

The fact that tokens slice the input also needs to be clearly stated,
since that precludes the use of transient ranges like byLine (your token
values will be garbled by the time you get to them).


T

-- 
Let's call it an accidental feature. -- Larry Wall

Sep 11 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 9/11/2013 3:02 PM, H. S. Teoh wrote:
 No, in typical D code, user-defined type names are capitalized, such as
 InputRange, Duration, Complex, ByLine, etc.. They do not have 'Type' in
 their names (we don't call them InputRangeType, for example). So this
 usage doesn't conflict at all.

It's not a type as is understood in D. It's an enumerated value.


 Constructors with many arguments are evil, because you can't just modify
 one setting and use defaults for the rest; you have to specify all of
 them up to the last default argument you have to change.

It's not that big a deal, and certainly not evil.


 6. No clue how lookahead works with this. Parsing D requires
 arbitrary lookahead.

 What has this got to do with lexing?

 The lexer needs to support backing up and going forward again.
 That's how the dmd lexer works.

 Then the parser should be fixed.

The parser works fine. It doesn't need fixing.


 The point of having the input be an InputRange is you DO NOT CARE if
 the input is coming from a string, file, etc.

 Yes, but the docs currently doesn't say what form this input range must
 take. Must it be a range of characters?

It does say that.

Sep 11 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Sep 11, 2013 at 04:42:34PM -0700, Walter Bright wrote:
 On 9/11/2013 3:02 PM, H. S. Teoh wrote:

[...]
Yes, but the docs currently doesn't say what form this input range
must take. Must it be a range of characters?

 
 It does say that.

But then the code example proceeds to pass byLine() to it. Is that
correct? If it is, then the docs need to be updated, because last time I
checked, byLine() isn't a range of char, but a range of char *arrays*.


T

-- 
Real men don't take backups. They put their source on a public
FTP-server and let the world mirror it. -- Linus Torvalds

Sep 11 2013

"Brian Schott" <briancschott gmail.com> writes:

On Thursday, 12 September 2013 at 00:13:36 UTC, H. S. Teoh wrote:
 But then the code example proceeds to pass byLine() to it. Is 
 that
 correct? If it is, then the docs need to be updated, because 
 last time I
 checked, byLine() isn't a range of char, but a range of char 
 *arrays*.


 T

The example doesn't pass the result of byLine to the byToken 
function directly.

Sep 11 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Sep 12, 2013 at 03:17:06AM +0200, Brian Schott wrote:
 On Thursday, 12 September 2013 at 00:13:36 UTC, H. S. Teoh wrote:
But then the code example proceeds to pass byLine() to it. Is that
correct? If it is, then the docs need to be updated, because last
time I checked, byLine() isn't a range of char, but a range of char
*arrays*.


T

 
 The example doesn't pass the result of byLine to the byToken function
 directly.

*facepalm* You're right, it's calling join() on it. Nevermind what I
said then. :-P  Sorry for all the unnecessary noise, I don't know what
got into me that I didn't see the join().


T

-- 
May you live all the days of your life. -- Jonathan Swift

Sep 11 2013

Jacob Carlborg <doob me.com> writes:

On 2013-09-12 00:02, H. S. Teoh wrote:

 Constructors with many arguments are evil, because you can't just modify
 one setting and use defaults for the rest; you have to specify all of
 them up to the last default argument you have to change.

Currently this is possible:

LexerConfig config = { iterStyle: IterationStyle.everything, tokenStyle: 
TokenStyle.source };

I would really like the following to be possible as well:

void foo (LexerConfig config);

foo({ iterStyle: IterationStyle.everything, tokenStyle: 
TokenStyle.source });

-- 
/Jacob Carlborg

Sep 11 2013

"Brian Schott" <briancschott gmail.com> writes:

On Wednesday, 11 September 2013 at 19:56:45 UTC, H. S. Teoh wrote:
 2. The example uses an if-then sequence of isBuiltType, 
 isKeyword,
 etc. Should be an enum so a switch can be done for speed.

 I believe this is probably a result of having a separate enum 
 value of
 each token, and at the same time needing to group them into 
 categories
 for syntax highlighting purposes. I'd suggest a function for 
 converting
 the TokenType enum value into a TokenCategory enum. Perhaps 
 something
 like:

 	enum TokenCategory { BuiltInType, Keyword, ... }

 	// Convert TokenType into TokenCategory
 	TokenCategory category(TokenType tt) { ... }

 Then in user code, you call category() on the token type, and 
 switch
 over that. This maximizes performance.

 Implementation-wise, I'd suggest either a hash table on 
 TokenType, or
 perhaps even encoding the category into the TokenType enum 
 values, for
 example:

 	enum TokenCategory {
 		BuiltInType, Keyword, ...
 	}

 	enum TokenType {
 		IntToken = (TokenCategory.BuiltInType << 16) | 0x0001,
 		FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002,
 		...
 		FuncToken = (TokenCategory.Keyword << 16) | 0x0001,
 	}

 Then the category function can be simply:

 	TokenCategory category(TokenType tt) {
 		return cast(TokenCategory)(tt >> 16);
 	}

 Though admittedly, this is a bit hackish. But if you're going 
 for speed,
 this would be quite fast.

There are already plenty of hackish things in that module, so 
this would fit right in.

 4. When naming tokens like .. 'slice', it is giving it a
 syntactic/semantic name rather than a token name. This would be
 awkward if .. took on new meanings in D. Calling it 'dotdot' 
 would
 be clearer. Ditto for the rest. For example that is done 
 better, '*'
 is called 'star', rather than 'dereference'.

 I agree. Especially since '*' can also mean multiplication, 
 depending on
 context. It would be weird (and unnecessarily obscure) for 
 parser code
 to refer to 'dereference' when parsing expressions. :)

If you read the docs/code you'll see that "*" is called "star" 
:-). The slice -> dotdot rename is pretty simple to do.

 5. The LexerConfig initialization should be a constructor 
 rather than
 a sequence of assignments. LexerConfig documentation is 
 awfully thin.
 For example, 'tokenStyle' is explained as being 'Token style',
 whatever that is.

 I'm on the fence about this one. Setting up the configuration 
 before
 starting the lexing process makes sense to me.

 Perhaps one way to improve this is to rename LexerConfig to 
 DLexer, and
 make byToken a member function (or call it via UFCS):

 	DLexer lex;
 	lex.iterStyle = ...;
 	// etc.

 	foreach (token; lex.byToken()) {
 		...
 	}

 This reads better: you're getting a list of tokens from the 
 lexer 'lex',
 as opposed to getting something from byToken(config), which 
 doesn't
 really *say* what it's doing: is it tokenizing the config 
 object?

byToken is a free function because its first parameter is the 
range to tokenize. This allows you to use UFCS on the range. 
(e.g. "sourcCode.byToken()" Putting it in a struct/class would 
break this.

 6. No clue how lookahead works with this. Parsing D requires 
 arbitrary
 lookahead.

 What has this got to do with lexing? The parser needs to keep 
 track of
 its state independently of the lexer anyway, doesn't it?  I'm 
 not sure
 how DMD's parser works, but the way I usually write parsers is 
 that
 their state encodes the series of tokens encountered so far, so 
 they
 don't need the lexer to save any tokens for them. If they need 
 to refer
 to tokens, they create partial AST trees on the parser stack 
 that
 reference said tokens. I don't see why it should be the lexer's 
 job to
 keep track of such things.

For parsing, you'll likely want to use array() to grab all the 
tokens. But there are other uses such as syntax highlighting that 
only need one token at a time.

 9. Need to insert intra-page navigation links, such as when
 'byToken()' appears in the text, it should be link to where 
 byToken
 is described.

 Agreed.

I'll work on this later this evening.

 [...]
 I believe the state of the documentation is a showstopper, and 
 needs
 to be extensively fleshed out before it can be considered 
 ready for
 voting.

 I think the API can be improved. The LexerConfig -> DLexer 
 rename is an
 important readability issue IMO.

I'm not convinced (yet?).

 Also, it's unclear what types of input is supported -- the code 
 example
 only uses string input, but what about file input? Does it 
 support
 byLine? Or does it need me to slurp the entire file contents 
 into an
 in-memory buffer first?

The input is a range of bytes, which the lexer assumes is UTF-8. 
The lexer works much faster on arrays than it does on arbitrary 
ranges though. Dmitry Olshansky implemented a circular buffer in 
this module that's used to cache the input range internally. This 
buffer is then used for the lexing process.

 Now, somebody pointed out that there's currently no way to tell 
 it that
 the data you're feeding to it is a partial slice of the full 
 source. I
 think this should be an easy fix: LexerConfig (or DLexer after 
 the
 rename) can have a field for specifying the initial line number 
 / column
 number, defaulting to (1, 1) but can be changed by the caller 
 for
 parsing code fragments.

That's simple enough to add.

 A minor concern I have about the current implementation (I 
 didn't look
 at the code yet, but this is just based on the documentation), 
 is that
 there's no way to choose what kind of data you want in the 
 token stream.
 Right now, it appears that it always computes startIndex, line, 
 and
 column. What if I don't care about this information, and only 
 want, say,
 the token type and value (say I'm pretty-printing the source, 
 so I don't
 care how the original was formatted)?  Would it be possible to 
 skip the
 additional computations required to populate these fields?

It's possible, but I fear that it would make the code a mess.

Sep 11 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Sep 11, 2013 at 10:37:34PM +0200, Brian Schott wrote:
 On Wednesday, 11 September 2013 at 19:56:45 UTC, H. S. Teoh wrote:
2. The example uses an if-then sequence of isBuiltType, isKeyword,
etc. Should be an enum so a switch can be done for speed.

I believe this is probably a result of having a separate enum value
of each token, and at the same time needing to group them into
categories for syntax highlighting purposes. I'd suggest a function
for converting the TokenType enum value into a TokenCategory enum.
Perhaps something like:

	enum TokenCategory { BuiltInType, Keyword, ... }

	// Convert TokenType into TokenCategory
	TokenCategory category(TokenType tt) { ... }

Then in user code, you call category() on the token type, and switch
over that. This maximizes performance.

Implementation-wise, I'd suggest either a hash table on TokenType, or
perhaps even encoding the category into the TokenType enum values,
for example:

	enum TokenCategory {
		BuiltInType, Keyword, ...
	}

	enum TokenType {
		IntToken = (TokenCategory.BuiltInType << 16) | 0x0001,
		FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002,
		...
		FuncToken = (TokenCategory.Keyword << 16) | 0x0001,
	}

Then the category function can be simply:

	TokenCategory category(TokenType tt) {
		return cast(TokenCategory)(tt >> 16);
	}

Though admittedly, this is a bit hackish. But if you're going for
speed, this would be quite fast.

 
 There are already plenty of hackish things in that module, so this
 would fit right in.

Better hope it passes code review, though. :)


4. When naming tokens like .. 'slice', it is giving it a
syntactic/semantic name rather than a token name. This would be
awkward if .. took on new meanings in D. Calling it 'dotdot' would
be clearer. Ditto for the rest. For example that is done better, '*'
is called 'star', rather than 'dereference'.

I agree. Especially since '*' can also mean multiplication, depending
on context. It would be weird (and unnecessarily obscure) for parser
code to refer to 'dereference' when parsing expressions. :)

 
 If you read the docs/code you'll see that "*" is called "star" :-).
 The slice -> dotdot rename is pretty simple to do.

I was just going by what Walter said, but OK, fair enough.


5. The LexerConfig initialization should be a constructor rather
than a sequence of assignments. LexerConfig documentation is awfully
thin.  For example, 'tokenStyle' is explained as being 'Token
style', whatever that is.

I'm on the fence about this one. Setting up the configuration before
starting the lexing process makes sense to me.

Perhaps one way to improve this is to rename LexerConfig to DLexer,
and make byToken a member function (or call it via UFCS):

	DLexer lex;
	lex.iterStyle = ...;
	// etc.

	foreach (token; lex.byToken()) {
		...
	}

This reads better: you're getting a list of tokens from the lexer
'lex', as opposed to getting something from byToken(config), which
doesn't really *say* what it's doing: is it tokenizing the config
object?

 
 byToken is a free function because its first parameter is the range to
 tokenize. This allows you to use UFCS on the range. (e.g.
 "sourcCode.byToken()" Putting it in a struct/class would break this.

Very good point. In that case, would it be possible to make byToken
accept configuration parameters inline? That is:

	auto tokens = File(mysource)
		.byLine() // does this actually work??
		.byToken(IterationStyle.everything, ...);

This seems better suited for UFCS-style chains, as opposed to needing to
create a separate config object outside the chain just so you can use it
inside. And writing LexerConfig(...) inline seems a bit cumbersome:

	auto tokens = File(mysource)
		.byLine() // does this actually work??
		.byToken(LexerConfig(IterationStyle.everything, ...));

though, in retrospect, perhaps not as bad as it initially sounds.


6. No clue how lookahead works with this. Parsing D requires
arbitrary lookahead.

What has this got to do with lexing? The parser needs to keep track
of its state independently of the lexer anyway, doesn't it?  I'm not
sure how DMD's parser works, but the way I usually write parsers is
that their state encodes the series of tokens encountered so far, so
they don't need the lexer to save any tokens for them. If they need
to refer to tokens, they create partial AST trees on the parser stack
that reference said tokens. I don't see why it should be the lexer's
job to keep track of such things.

 
 For parsing, you'll likely want to use array() to grab all the
 tokens. But there are other uses such as syntax highlighting that
 only need one token at a time.

I don't think that's necessary. An appropriately-constructed parser
doesn't need arbitrary slicing of tokens; its internal state should
encode what tokens have been seen thus far so that it never needs to
backtrack.


[...]
I think the API can be improved. The LexerConfig -> DLexer rename is
an important readability issue IMO.

 
 I'm not convinced (yet?).

OK, I take that back. UFCS chaining is an important emerging D paradigm
that we should support, and the current API seems better suited for
that purpose.


Also, it's unclear what types of input is supported -- the code
example only uses string input, but what about file input? Does it
support byLine? Or does it need me to slurp the entire file contents
into an in-memory buffer first?

 
 The input is a range of bytes, which the lexer assumes is UTF-8. The
 lexer works much faster on arrays than it does on arbitrary ranges
 though. Dmitry Olshansky implemented a circular buffer in this module
 that's used to cache the input range internally. This buffer is then
 used for the lexing process.

OK, if this can be indicated in the docs, that'd be great. :) (A
signature constraint of the form `if (is(ElementType!R == char))` that
should be good enough, I think, since DDOC in DMD git HEAD will include
signature constraints in the generated docs, and it should be clear
enough what is expected.


[...]
A minor concern I have about the current implementation (I didn't
look at the code yet, but this is just based on the documentation),
is that there's no way to choose what kind of data you want in the
token stream.  Right now, it appears that it always computes
startIndex, line, and column. What if I don't care about this
information, and only want, say, the token type and value (say I'm
pretty-printing the source, so I don't care how the original was
formatted)?  Would it be possible to skip the additional computations
required to populate these fields?

 
 It's possible, but I fear that it would make the code a mess.

Fair enough. It's not a significant overhead anyway (I believe), so it
shouldn't be a big issue.


T

-- 
Once bitten, twice cry...

Sep 11 2013

Jacob Carlborg <doob me.com> writes:

On 2013-09-11 22:37, Brian Schott wrote:

 It's possible, but I fear that it would make the code a mess.

Really? It seems you only need to change two places in the code:

https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L1724

And

https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L400

Although I just did a quick search for "line".

-- 
/Jacob Carlborg

Sep 12 2013

Piotr Szturmaj <bncrbme jadamspam.pl> writes:

On 11.09.2013 20:49, Walter Bright wrote:
 On 9/11/2013 8:01 AM, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian Schott

 Thank you, Brian! This is important work.

 Not a thorough review, just some notes from reading the doc file:

 1. I don't like the _ suffix for keywords. Just call it kwimport or
 something like that.

 8. 'default_' Again with the awful practice of appending _.

Delphi designers realized this problem years ago and they came up with a 
solution: 
http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers

Basically, Delphi allows escaping reserved identifiers with a '&'. I 
wonder how D solves that problem when interfacing to COM classes if they 
have for example a function named "scope".

Sep 11 2013

"Kapps" <opantm2+spam gmail.com> writes:

On Wednesday, 11 September 2013 at 19:57:28 UTC, Piotr Szturmaj 
wrote:
 On 11.09.2013 20:49, Walter Bright wrote:
 On 9/11/2013 8:01 AM, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by 
 Brian Schott

 Thank you, Brian! This is important work.

 Not a thorough review, just some notes from reading the doc 
 file:

 1. I don't like the _ suffix for keywords. Just call it 
 kwimport or
 something like that.

 8. 'default_' Again with the awful practice of appending _.

 Delphi designers realized this problem years ago and they came 
 up with a solution: 
 http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers

 Basically, Delphi allows escaping reserved identifiers with a 
 '&'. I wonder how D solves that problem when interfacing to COM 
 classes if they have for example a function named "scope".


void MyVoid(string  int) { ... }
The name is actually int for purposes of reflection and the like, 
the   just tells the compiler that it's not a keyword.

Sep 11 2013

Piotr Szturmaj <bncrbme jadamspam.pl> writes:

On 12.09.2013 01:22, Kapps wrote:
 On Wednesday, 11 September 2013 at 19:57:28 UTC, Piotr Szturmaj wrote:
 On 11.09.2013 20:49, Walter Bright wrote:
 On 9/11/2013 8:01 AM, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian
 Schott

 Thank you, Brian! This is important work.

 Not a thorough review, just some notes from reading the doc file:

 1. I don't like the _ suffix for keywords. Just call it kwimport or
 something like that.

 8. 'default_' Again with the awful practice of appending _.

 Delphi designers realized this problem years ago and they came up with
 a solution:
 http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers


 Basically, Delphi allows escaping reserved identifiers with a '&'. I
 wonder how D solves that problem when interfacing to COM classes if
 they have for example a function named "scope".


 void MyVoid(string  int) { ... }
 The name is actually int for purposes of reflection and the like, the  
 just tells the compiler that it's not a keyword.

Yes, and this is important to actually have the real name without a '_' 
or any other prefix or suffix. Is there an enhancement proposal for this 
in the bugzilla?

Sep 11 2013

Jacob Carlborg <doob me.com> writes:

On 2013-09-11 21:57, Piotr Szturmaj wrote:

 Delphi designers realized this problem years ago and they came up with a
 solution:
 http://docwiki.embarcadero.com/RADStudio/XE4/en/Fundamental_Syntactic_Elements#Extended_Identifiers


 Basically, Delphi allows escaping reserved identifiers with a '&'. I
 wonder how D solves that problem when interfacing to COM classes if they
 have for example a function named "scope".

Scala does it as well: `keyword` if I recall correctly. Seems like you 
can put basically anything between the backticks in Scala.

-- 
/Jacob Carlborg

Sep 12 2013

Michel Fortin <michel.fortin michelf.ca> writes:

On 2013-09-11 18:49:44 +0000, Walter Bright <newshound2 digitalmars.com> said:

 6. No clue how lookahead works with this. Parsing D requires arbitrary 
 lookahead.

Since this is a forward range, you can save() it at one location and 
continue parsing, and then restart parsing from the savepoint.
https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2324

This also means that the underlying range used as input to the lexer 
must be a forward range that you can save() too.

-- 
Michel Fortin
michel.fortin michelf.ca
http://michelf.ca

Sep 11 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 9/11/2013 3:09 PM, Michel Fortin wrote:
 On 2013-09-11 18:49:44 +0000, Walter Bright <newshound2 digitalmars.com> said:

 6. No clue how lookahead works with this. Parsing D requires arbitrary
lookahead.

 Since this is a forward range, you can save() it at one location and continue
 parsing, and then restart parsing from the savepoint.
 https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2324

 This also means that the underlying range used as input to the lexer must be a
 forward range that you can save() too.

That's correct, but that implies re-lexing the tokens, which has negative 
performance implications.

Sep 11 2013

"deadalnix" <deadalnix gmail.com> writes:

On Wednesday, 11 September 2013 at 23:44:55 UTC, Walter Bright 
wrote:
 On 9/11/2013 3:09 PM, Michel Fortin wrote:
 On 2013-09-11 18:49:44 +0000, Walter Bright 
 <newshound2 digitalmars.com> said:

 6. No clue how lookahead works with this. Parsing D requires 
 arbitrary lookahead.

 Since this is a forward range, you can save() it at one 
 location and continue
 parsing, and then restart parsing from the savepoint.
 https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2324

 This also means that the underlying range used as input to the 
 lexer must be a
 forward range that you can save() too.

 That's correct, but that implies re-lexing the tokens, which 
 has negative performance implications.

Indeed. What solution do you have in mind ?

Sep 11 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

The solution dmd uses is to put in an intermediary layer that saves the 
lookahead tokens in a linked list.

Sep 11 2013

"deadalnix" <deadalnix gmail.com> writes:

On Thursday, 12 September 2013 at 01:39:52 UTC, Walter Bright 
wrote:
 On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

 The solution dmd uses is to put in an intermediary layer that 
 saves the lookahead tokens in a linked list.

But then, you have an extra step when looking up every tokens + 
memory management overhead. How big is the performance 
improvement ?

Sep 11 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 9/11/2013 6:54 PM, deadalnix wrote:
 On Thursday, 12 September 2013 at 01:39:52 UTC, Walter Bright wrote:
 On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

 The solution dmd uses is to put in an intermediary layer that saves the
 lookahead tokens in a linked list.

 But then, you have an extra step when looking up every tokens + memory
 management overhead. How big is the performance improvement ?

Not really - I use a dirty trick in that a static instance is always the start 
of the list.

But even so, an extra indirection is better than re-lexing. Lexing is a clear 
bottleneck in the profiling. I've even been thinking of special casing the code 
that scans comments to use SIMD instructions.

Sep 11 2013

"deadalnix" <deadalnix gmail.com> writes:

On Thursday, 12 September 2013 at 04:57:50 UTC, Walter Bright 
wrote:
 On 9/11/2013 6:54 PM, deadalnix wrote:
 On Thursday, 12 September 2013 at 01:39:52 UTC, Walter Bright 
 wrote:
 On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

 The solution dmd uses is to put in an intermediary layer that 
 saves the
 lookahead tokens in a linked list.

 But then, you have an extra step when looking up every tokens 
 + memory
 management overhead. How big is the performance improvement ?

 Not really - I use a dirty trick in that a static instance is 
 always the start of the list.

If I understand you correctly, that mean that lookahead of one 
token do not trigger any allocation.

 But even so, an extra indirection is better than re-lexing. 
 Lexing is a clear bottleneck in the profiling. I've even been 
 thinking of special casing the code that scans comments to use 
 SIMD instructions.

See my comment, it is possible, with increased parser complexity, 
to handle many cases where you don't know what you are parsing 
yet. Doing so, lookahead is only required to find matching 
closing token. I suspect that a fast path in the lexer for that 
precise use case may be faster than buffering tokens, as it allow 
to save one branch per token.

Sep 11 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 9/11/2013 10:10 PM, deadalnix wrote:
 See my comment, it is possible, with increased parser complexity, to handle
many
 cases where you don't know what you are parsing yet. Doing so, lookahead is
only
 required to find matching closing token. I suspect that a fast path in the
lexer
 for that precise use case may be faster than buffering tokens, as it allow to
 save one branch per token.

I don't believe that, because you can see about anything for tokens in
lookahead 
and so have to duplicate nearly the whole lexer anyway for the 'fast path', but 
you're free to try it out and prove me wrong.

Sep 11 2013

"deadalnix" <deadalnix gmail.com> writes:

On Thursday, 12 September 2013 at 06:17:04 UTC, Walter Bright 
wrote:
 On 9/11/2013 10:10 PM, deadalnix wrote:
 See my comment, it is possible, with increased parser 
 complexity, to handle many
 cases where you don't know what you are parsing yet. Doing so, 
 lookahead is only
 required to find matching closing token. I suspect that a fast 
 path in the lexer
 for that precise use case may be faster than buffering tokens, 
 as it allow to
 save one branch per token.

 I don't believe that, because you can see about anything for 
 tokens in lookahead and so have to duplicate nearly the whole 
 lexer anyway for the 'fast path', but you're free to try it out 
 and prove me wrong.

I plan to, but you know what it is, the best optimization is the 
one that go from non working to working state.

Sep 12 2013

Martin Nowak <code dawg.eu> writes:

On 09/12/2013 08:17 AM, Walter Bright wrote:
 I don't believe that, because you can see about anything for tokens in
 lookahead and so have to duplicate nearly the whole lexer anyway for the
 'fast path', but you're free to try it out and prove me wrong.

Me neither and if you have to duplicate the code for this it's a double 
loss.
I think it would be more fruitful to batch lexing, i.e. fill a hole 
memory page (4Kb) with tokens the go back to parsing until you need more.

Sep 12 2013

Martin Nowak <code dawg.eu> writes:

On 09/12/2013 03:39 AM, Walter Bright wrote:
 On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

 The solution dmd uses is to put in an intermediary layer that saves the
 lookahead tokens in a linked list.

Linked list sounds bad.
Do you have a rough idea how often lookahead is needed, i.e. is it 
performance relevant? If so it might be worth tuning.

Sep 11 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Sep 12, 2013 at 04:21:36AM +0200, Martin Nowak wrote:
 On 09/12/2013 03:39 AM, Walter Bright wrote:
On 9/11/2013 6:30 PM, deadalnix wrote:
Indeed. What solution do you have in mind ?

The solution dmd uses is to put in an intermediary layer that saves
the lookahead tokens in a linked list.

 
 Linked list sounds bad.
 Do you have a rough idea how often lookahead is needed, i.e. is it
 performance relevant? If so it might be worth tuning.

I still don't understand why backtracking is necessary in the first
place. I would've thought a modern parser should be well able to encode
seen tokens in its state so that backtracking is never necessary. Or
does D grammar have tricky bits that cannot be handled this way, that
I'm unaware of?


T

-- 
The peace of mind---from knowing that viruses which exploit Microsoft
system vulnerabilities cannot touch Linux---is priceless. -- Frustrated
system administrator.

Sep 11 2013

"deadalnix" <deadalnix gmail.com> writes:

On Thursday, 12 September 2013 at 03:37:55 UTC, H. S. Teoh wrote:
 I still don't understand why backtracking is necessary in the 
 first
 place. I would've thought a modern parser should be well able 
 to encode
 seen tokens in its state so that backtracking is never 
 necessary. Or
 does D grammar have tricky bits that cannot be handled this 
 way, that
 I'm unaware of?

The problem is that it can cause a exponential (and I literally 
mean exponential here) amount of complexity.

In some cases, the complexity is manageable, but in other that 
don't make any sense (it has to be noted that even full lexing 
don't make any sens here). For instance :

int foo()() {}
        ^

When you are at the caret position, you don't know if you face a 
function declaration or a template declaration. You could go for 
some ambiguous parsing, but each template argument can itself be 
a type, an expression or a symbol.

In such situation, it is much simpler to skip input to the 
closing parentheses, check what's coming after and act 
accordingly. The alternative is to go for some ambiguous 
function/template parameters parsing and resolve at the end, but 
as template argument are themselves ambiguous 
type/expression/symbols, the amount of complexity in the parser 
is doomed to explode. Note that the example is not self 
contained, for instance, both template parameters and argument 
can imply template instantiation, which means ambiguous argument 
parsing.

SDC does a good deal of ambiguous parsing without backtracking 
(more than DMD does), but you got to draw the line somewhere.

What I'd like to see is a go to the closing token feature, that 
can eventually take a fast path to do so, or an efficient token 
buffering system (I'm not sure which feature would be the 
fastest, but I'm inclined to think this is the first one, 
however, that won't be suited for a dmd style parser).

Sep 11 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Sep 12, 2013 at 07:00:09AM +0200, deadalnix wrote:
 On Thursday, 12 September 2013 at 03:37:55 UTC, H. S. Teoh wrote:
I still don't understand why backtracking is necessary in the first
place. I would've thought a modern parser should be well able to
encode seen tokens in its state so that backtracking is never
necessary.  Or does D grammar have tricky bits that cannot be handled
this way, that I'm unaware of?

 
 The problem is that it can cause a exponential (and I literally mean
 exponential here) amount of complexity.
 
 In some cases, the complexity is manageable, but in other that don't
 make any sense (it has to be noted that even full lexing don't make
 any sens here). For instance :
 
 int foo()() {}
        ^
 
 When you are at the caret position, you don't know if you face a
 function declaration or a template declaration. You could go for
 some ambiguous parsing, but each template argument can itself be a
 type, an expression or a symbol.

[...]

This can be handled by using an intermediate grammar rule. Reduce all
(...) into an intermediate type, say ArgList, so the reduction
happens something like this:

	int  foo   ()      ()      {}
	Type Ident ArgList ArgList ^

Then have the rule:

	CompileTimeArgs ::= ArgList
	RuntimeArgs ::= ArgList
	TemplateDecl ::= Type Ident CompileTimeArgs RuntimeArgs '{' ...
	FuncDecl ::= Type Ident RuntimeArgs '{' ...

So first, all (...) gets parsed to ArgList, but it's not yet fixed
whether they are compile-time arguments or runtime arguments. It's only
after you see the next '(' or '{' that you decide whether ArgList should
reduce to CompileTimeArgs or RuntimeArgs.

ArgList itself, of course, will accept all possible parameters (both
runtime and compile-time): types, expressions, symbols. Then when you
reduce it to RuntimeArgs, you reject stuff that can't be interpreted as
parameter declarations.


T

-- 
There are two ways to write error-free programs; only the third one works.

Sep 12 2013

"deadalnix" <deadalnix gmail.com> writes:

On Thursday, 12 September 2013 at 14:09:43 UTC, H. S. Teoh wrote:
 This can be handled by using an intermediate grammar rule. 
 Reduce all
 (...) into an intermediate type, say ArgList, so the reduction
 happens something like this:

 	int  foo   ()      ()      {}
 	Type Ident ArgList ArgList ^

 Then have the rule:

 	CompileTimeArgs ::= ArgList
 	RuntimeArgs ::= ArgList
 	TemplateDecl ::= Type Ident CompileTimeArgs RuntimeArgs '{' ...
 	FuncDecl ::= Type Ident RuntimeArgs '{' ...

 So first, all (...) gets parsed to ArgList, but it's not yet 
 fixed
 whether they are compile-time arguments or runtime arguments. 
 It's only
 after you see the next '(' or '{' that you decide whether 
 ArgList should
 reduce to CompileTimeArgs or RuntimeArgs.

 ArgList itself, of course, will accept all possible parameters 
 (both
 runtime and compile-time): types, expressions, symbols. Then 
 when you
 reduce it to RuntimeArgs, you reject stuff that can't be 
 interpreted as
 parameter declarations.

And then you got to backtrack the parsing instead of the lexing. 
You just moved the problem around. You'll have to create some 
temporary ast nodes that then will fix into what they really are.

Sep 12 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Sep 12, 2013 at 06:09:41PM +0200, deadalnix wrote:
 On Thursday, 12 September 2013 at 14:09:43 UTC, H. S. Teoh wrote:
This can be handled by using an intermediate grammar rule. Reduce all
(...) into an intermediate type, say ArgList, so the reduction
happens something like this:

	int  foo   ()      ()      {}
	Type Ident ArgList ArgList ^

Then have the rule:

	CompileTimeArgs ::= ArgList
	RuntimeArgs ::= ArgList
	TemplateDecl ::= Type Ident CompileTimeArgs RuntimeArgs '{' ...
	FuncDecl ::= Type Ident RuntimeArgs '{' ...

So first, all (...) gets parsed to ArgList, but it's not yet fixed
whether they are compile-time arguments or runtime arguments. It's
only after you see the next '(' or '{' that you decide whether
ArgList should reduce to CompileTimeArgs or RuntimeArgs.

ArgList itself, of course, will accept all possible parameters (both
runtime and compile-time): types, expressions, symbols. Then when you
reduce it to RuntimeArgs, you reject stuff that can't be interpreted
as parameter declarations.

 
 And then you got to backtrack the parsing instead of the lexing. You
 just moved the problem around. You'll have to create some temporary
 ast nodes that then will fix into what they really are.

No. You can just use ArgListItem for both runtime args and compile-time
args. Once you decided which one it is, wrong arguments are rejected at
semantic time (which you have to do anyway).

Let's take a concrete example. Say we're parsing this invalid code:

	int foo(alias A)(alias B) {}

You'd go through these steps:

1) Parse initial prefix of declaration:

	int foo(alias A)(alias B) {}
	       ^
	AST:
	FuncDecl
	 |--RetType: int
	 |--Ident: foo
	 \--[ being built ]

2) Parse first (...):

	int foo(alias A)(alias B) {}
	                ^
	AST:
	FuncDecl
	 |--RetType: int
	 |--Ident: foo
	 |--ArgList
	 |   \-- AliasArg
	 |        \-- ident: A
	 \--[ being built ]

   I'm skipping the intermediate steps here, it's obvious how to
   construct AliasArg from the usual parsing process.

3) Parse second (...):

	int foo(alias A)(alias B) {}
	                          ^
	AST:
	FuncDecl
	 |--RetType: int
	 |--Ident: foo
	 |--ArgList
	 |   \-- AliasArg
	 |        \-- ident: A
	 |--ArgList
	 |   \-- AliasArg
	 |        \-- ident: B
	 \--[ being built ]

4) At this point, you now know the first ArgList is CompileTimeArgList,
and the second is RuntimeArgList, so you can just change the type
fields (along with narrowing FuncDecl to TemplateFuncDecl):

	AST:
	TemplateFuncDecl (was: FuncDecl)
	 |--RetType: int
	 |--Ident: foo
	 |--CompileTimeArgList (was: ArgList)
	 |   \-- AliasArg
	 |        \-- ident: A
	 |--RuntimeArgList (was: ArgList)
	 |   \-- AliasArg
	 |        \-- ident: B
	 \--[ being built ]

Since you're still constructing FuncDecl, your current parsing context
should still have a direct reference to the partially-constructed
FuncDecl node, which in turn has a direct reference to both ArgList
child nodes. So this is just dereferencing a couple of pointers. No
backtracking.

5) Finish parsing the declaration:

	int foo(alias A)(alias B) {}
	                            ^
	AST:
	TemplateFuncDecl
	 |--RetType: int
	 |--Ident: foo
	 |--CompileTimeArgList (was: ArgList)
	 |   \-- AliasArg
	 |        \-- ident: A
	 |--RuntimeArgList (was: ArgList)
	 |   \-- AliasArg
	 |        \-- ident: B
	 \--FuncBody
	     \-- CompoundStatement
	          \-- [empty body]

6) Run semantic:
   - Create local symbol table for foo, etc..
   - Run semantic on CompileTimeArgList:
      - Check AliasArg for validity
      - Run semantic on AliasArg: add A to function's local symbol
        table, etc.
   - Run semantic on RuntimeArgList:
      - Check AliasArg for validity: ERROR: cannot have alias parameter
        at runtime.
      - (Add B to local symbol table)(skipped due to previous error)
   - (Run semantic on FuncBody)(skipped due to previous error)
   - (Run semantic on RetType (verify return type match, etc.))(skipped
     due to previous error)
   - (Add function to parent scope symbol table)(skipped due to previous
     error)

So, no backtracking is necessary.

Of course, it sounds like DMD's parser doesn't work this way, but that's
a limitation of DMD's parser, not an *inherent* need for backtracking.


T

-- 
I see that you JS got Bach.

Sep 12 2013

"deadalnix" <deadalnix gmail.com> writes:

We are talking about parameters here, not arguments. But overall, 
that isn't the point.

If I have int a = 3 as an argument in the first set of (), I 
still have no clue if it is a runtime or compile time parameter. 
But ref int a = 3 ?

Doing so, you ends up with a lot of ambiguous node (like 
FunctionArgumentOrValueTemplateParameter) and have to manage that 
complexity, or have to backtrack on the parsing and fix ambiguous 
node when you have enough information.

Sep 12 2013

Manfred Nowak <svv1999 hotmail.com> writes:

deadalnix wrote:
 When you are at the caret position, you don't know

If one ever reaches that position one uses petty lexer/grammar definitions.

-manfred

Sep 12 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Thursday, 12 September 2013 at 05:00:11 UTC, deadalnix wrote:
 The problem is that it can cause a exponential (and I literally 
 mean exponential here) amount of complexity.

 The alternative is to go for some ambiguous function/template 
 parameters parsing and resolve at the end, but as template 
 argument are themselves ambiguous type/expression/symbols, the 
 amount of complexity in the parser is doomed to explode.


Pretty sure a GLR parser handles that well within O(n^2) space. 
Nothing exponential necessary...

Sep 27 2013

"Mehrdad" <wfunction hotmail.com> writes:

On Saturday, 28 September 2013 at 01:23:48 UTC, Mehrdad wrote:
 On Thursday, 12 September 2013 at 05:00:11 UTC, deadalnix wrote:
 The problem is that it can cause a exponential (and I 
 literally mean exponential here) amount of complexity.

 The alternative is to go for some ambiguous function/template 
 parameters parsing and resolve at the end, but as template 
 argument are themselves ambiguous type/expression/symbols, the 
 amount of complexity in the parser is doomed to explode.


 Pretty sure a GLR parser handles that well within O(n^2) space. 
 Nothing exponential necessary...

I meant time... Though it's true for space too.

Sep 27 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 9/11/2013 7:21 PM, Martin Nowak wrote:
 On 09/12/2013 03:39 AM, Walter Bright wrote:
 On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

 The solution dmd uses is to put in an intermediary layer that saves the
 lookahead tokens in a linked list.

 Linked list sounds bad.
 Do you have a rough idea how often lookahead is needed, i.e. is it performance
 relevant? If so it might be worth tuning.

No, I don't have a rough idea, but it doesn't show up in the profile.

Sep 11 2013

Robert Schadek <realburner gmx.de> writes:

On 09/12/2013 04:21 AM, Martin Nowak wrote:
 On 09/12/2013 03:39 AM, Walter Bright wrote:
 On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

 The solution dmd uses is to put in an intermediary layer that saves the
 lookahead tokens in a linked list.

 Linked list sounds bad.
 Do you have a rough idea how often lookahead is needed, i.e. is it
 performance relevant? If so it might be worth tuning.

Maybe some fixed size stack vector with 64 elements or so and some
linked list for the unusual case would help ..

Sep 12 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

13-Sep-2013 00:46, Robert Schadek пишет:
 On 09/12/2013 04:21 AM, Martin Nowak wrote:
 On 09/12/2013 03:39 AM, Walter Bright wrote:
 On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

 The solution dmd uses is to put in an intermediary layer that saves the
 lookahead tokens in a linked list.

 Linked list sounds bad.
 Do you have a rough idea how often lookahead is needed, i.e. is it
 performance relevant? If so it might be worth tuning.


 Maybe some fixed size stack vector with 64 elements or so and some
 linked list for the unusual case would help ..

And an extra branch to detect which one is currently the case? No, thanks ;)

-- 
Dmitry Olshansky

Sep 12 2013

Robert Schadek <realburner gmx.de> writes:

On 09/12/2013 11:03 PM, Dmitry Olshansky wrote:
 Maybe some fixed size stack vector with 64 elements or so and some
 linked list for the unusual case would help ..

 And an extra branch to detect which one is currently the case? No,
 thanks ;)

I would think that branching and using the vector is faster than using
the constructing and linking nodes in the LL

Sep 12 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, September 12, 2013 23:55:23 Robert Schadek wrote:
 On 09/12/2013 11:03 PM, Dmitry Olshansky wrote:
 Maybe some fixed size stack vector with 64 elements or so and some
 linked list for the unusual case would help ..

 
 And an extra branch to detect which one is currently the case? No,
 thanks ;)

 
 I would think that branching and using the vector is faster than using
 the constructing and linking nodes in the LL

Well, it sounds like there are several ideas to try out in this thread. I 
guess that only benchmarking and profiling will really tell us which (if any) 
are better though.

- Jonathan M Davis

Sep 12 2013

"qznc" <qznc web.de> writes:

On Thursday, 12 September 2013 at 01:39:52 UTC, Walter Bright 
wrote:
 On 9/11/2013 6:30 PM, deadalnix wrote:
 Indeed. What solution do you have in mind ?

 The solution dmd uses is to put in an intermediary layer that 
 saves the lookahead tokens in a linked list.

I think this is the right approach. It can probably be another 
function, we can put into std.range and reuse for other 
lexer/parsers. The lexer or the parser should not be made more 
complex for this.

Sep 12 2013

Martin Nowak <code dawg.eu> writes:

On 09/12/2013 03:30 AM, deadalnix wrote:
 That's correct, but that implies re-lexing the tokens, which has
 negative performance implications.

 Indeed. What solution do you have in mind ?

Buffering the tokens would work. There are some ways promote input 
ranges to forward ranges. But there are also some pitfalls like the 
implicit save on copy.

I have two prototypes for a generic input range buffer.

https://gist.github.com/dawgfoto/2187220 - uses growing ring buffer
https://gist.github.com/dawgfoto/1257196 - uses ref counted lookahead 
buffers in a singly linked list

The lexer itself has a ringbuffer for input ranges.
https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L2278

Sep 11 2013

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/11/2013 08:49 PM, Walter Bright wrote:

 3. I assumed [an instance of] TokenType is a type.

(is(TokenType) is true).

 But it's not, it's an enum.
 Even the document says it's a 'type', but it's not a type.

It's a type tag. The tag uniquely determines the type. (As in 'the type 
of a token', as opposed to 'the type of an expression'.)

 4. When naming tokens like .. 'slice', it is giving it a
 syntactic/semantic name rather than a token name. This would be awkward
 if .. took on new meanings in D. Calling it 'dotdot' would be clearer.
 Ditto for the rest. For example that is done better, '*' is called
 'star', rather than 'dereference'.

FWIW, I use Tok!"..". I.e. a "UDL" for specifying kinds of tokens when 
interfacing with the parser. Some other kinds of tokens get a canonical 
representation. Eg. Tok!"i" is the kind of identifier tokens, Tok!"0" is 
the kind of signed integer literal tokens etc.

 5. The LexerConfig initialization should be a constructor rather than a
sequence of assignments.

Using the { a:2, b:3 }-style initialization syntax?

 6. No clue how lookahead works with this.

Eg. use a CircularBuffer adapter range. I have an implementation 
currently coupled with my own lexer implementation. If there is 
interest, I could factor it out.

Lookahead is realized as follows in the parser:

(assume 'code' is the circular buffer range.)

auto saveState(){muteerr++; return code.pushAnchor();} // saves the 
state and mutes all error messages until the state is restored

void restoreState(Anchor state){ muteerr--; code.popAnchor(state); }

The 'Anchor' is a trivial wrapper around a size_t. The circular buffer 
grows automatically to keep around tokens still reachable by an anchor. 
(The range only needs small constant space besides the buffer to support 
this functionality, though it is unable to detect usage errors.)


This approach is typically more efficient than using a free list on 
contemporary architectures.

Sep 12 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

12-Sep-2013 19:39, Timon Gehr пишет:
 On 09/11/2013 08:49 PM, Walter Bright wrote:
 4. When naming tokens like .. 'slice', it is giving it a
 syntactic/semantic name rather than a token name. This would be awkward
 if .. took on new meanings in D. Calling it 'dotdot' would be clearer.
 Ditto for the rest. For example that is done better, '*' is called
 'star', rather than 'dereference'.

 FWIW, I use Tok!"..". I.e. a "UDL" for specifying kinds of tokens when
 interfacing with the parser. Some other kinds of tokens get a canonical
 representation. Eg. Tok!"i" is the kind of identifier tokens, Tok!"0" is
 the kind of signed integer literal tokens etc.

I like this.
Not only this has the benefit of not colliding with keywords. I also 
imagine that it could be incredibly convenient to get back the symbolic 
representation of a token (when token used as parameter to AST-node say 
BinaryExpr!(Tok!"+")). And truth be told we all know how tokens look in 
symbolic form so learning a pack of names for them feels pointless.

 6. No clue how lookahead works with this.

 Eg. use a CircularBuffer adapter range. I have an implementation
 currently coupled with my own lexer implementation. If there is
 interest, I could factor it out.

 Lookahead is realized as follows in the parser:

 (assume 'code' is the circular buffer range.)

 auto saveState(){muteerr++; return code.pushAnchor();} // saves the
 state and mutes all error messages until the state is restored

 void restoreState(Anchor state){ muteerr--; code.popAnchor(state); }

 The 'Anchor' is a trivial wrapper around a size_t. The circular buffer
 grows automatically to keep around tokens still reachable by an anchor.
 (The range only needs small constant space besides the buffer to support
 this functionality, though it is unable to detect usage errors.)


 This approach is typically more efficient than using a free list on
 contemporary architectures.

This ^^ is how. In fact std.d.lexer internally does similar thing with 
non-RA ranges of bytes.


-- 
Dmitry Olshansky

Sep 12 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Sep 12, 2013 at 08:10:18PM +0400, Dmitry Olshansky wrote:
 12-Sep-2013 19:39, Timon Gehr пишет:
On 09/11/2013 08:49 PM, Walter Bright wrote:
4. When naming tokens like .. 'slice', it is giving it a
syntactic/semantic name rather than a token name. This would be
awkward if .. took on new meanings in D. Calling it 'dotdot' would
be clearer.  Ditto for the rest. For example that is done better,
'*' is called 'star', rather than 'dereference'.

FWIW, I use Tok!"..". I.e. a "UDL" for specifying kinds of tokens
when interfacing with the parser. Some other kinds of tokens get a
canonical representation. Eg. Tok!"i" is the kind of identifier
tokens, Tok!"0" is the kind of signed integer literal tokens etc.

 
 I like this.
 Not only this has the benefit of not colliding with keywords. I also
 imagine that it could be incredibly convenient to get back the
 symbolic representation of a token (when token used as parameter to
 AST-node say BinaryExpr!(Tok!"+")). And truth be told we all know
 how tokens look in symbolic form so learning a pack of names for
 them feels pointless.

+1.  This is superior to both the ad hoc _ suffix and my ad hoc
prefixing approach.  Tok!"default" is maximally readable, and requires
no silly convolutions like _ or 'kw' / 'tokenType' prefixes.

I vote for Tok!"..." to denote token types.

Question: what's the implementation of Tok? Does it fit into an enum?
What's the underlying representation? I imagine some kind of canonical
mapping into an integral type would be desired, to maximize runtime
performance.


T

-- 
There are three kinds of people in the world: those who can count, and those
who can't.

Sep 12 2013

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/12/2013 06:40 PM, H. S. Teoh wrote:
 I vote for Tok!"..." to denote token types.

 Question: what's the implementation of Tok?  Does it fit into an enum?
 What's the underlying representation? I imagine some kind of canonical
 mapping into an integral type would be desired, to maximize runtime
 performance.


This is just a quick hack. One would maybe want to avoid the unwarranted 
copies and linear searches at compile time:

import std.algorithm;

private enum symbols = ["i",".","..",",","0",/+...+/];

struct TokenType{
     private uint _tag; // or smaller type
     private this(int tag){_tag=tag;}
     string toString(){ // (optional)
         static array(R)(R s){ string[] r; foreach(x;s) r~=x; return r; }
         static immutable strs=array(symbols.map!((a)=>`Tok!"`~a~`"`));
         return strs[_tag];
     }
}

template Tok(string name)if(symbols.countUntil(name)!=-1){
     enum Tok=TokenType(cast(uint)symbols.countUntil(name));
}

import std.stdio;
void main(){
     enum x = Tok!"i";
     writeln(x);
     writeln(Tok!"0");
}

Sep 12 2013

Jacob Carlborg <doob me.com> writes:

On 2013-09-12 17:39, Timon Gehr wrote:

 Using the { a:2, b:3 }-style initialization syntax?

Unfortunately that's currently not possible to pass to functions, see my 
other post:

http://forum.dlang.org/thread/jsnhlcbulwyjuqcqoepe forum.dlang.org?page=6#post-l0ro6h:249mk:241:40digitalmars.com

-- 
/Jacob Carlborg

Sep 12 2013

Martin Nowak <code dawg.eu> writes:

On 09/12/2013 05:39 PM, Timon Gehr wrote:
 Lookahead is realized as follows in the parser:

 (assume 'code' is the circular buffer range.)

 auto saveState(){muteerr++; return code.pushAnchor();} // saves the
 state and mutes all error messages until the state is restored

 void restoreState(Anchor state){ muteerr--; code.popAnchor(state); }

 The 'Anchor' is a trivial wrapper around a size_t. The circular buffer
 grows automatically to keep around tokens still reachable by an anchor.
 (The range only needs small constant space besides the buffer to support
 this functionality, though it is unable to detect usage errors.)

Do you think it's possible to use CircularBufferRange itself as anchor.
Then calling save() would return a CircularBufferRange and you canould 
scratch the two functions above. I had some promising experiments in 
that direction, but the implicit save on copy is annoying because you 
end up with anchors from temporary copies.

Sep 12 2013

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/12/2013 10:03 PM, Martin Nowak wrote:
 On 09/12/2013 05:39 PM, Timon Gehr wrote:
 Lookahead is realized as follows in the parser:

 (assume 'code' is the circular buffer range.)

 auto saveState(){muteerr++; return code.pushAnchor();} // saves the
 state and mutes all error messages until the state is restored

 void restoreState(Anchor state){ muteerr--; code.popAnchor(state); }

 The 'Anchor' is a trivial wrapper around a size_t. The circular buffer
 grows automatically to keep around tokens still reachable by an anchor.
 (The range only needs small constant space besides the buffer to support
 this functionality, though it is unable to detect usage errors.)

 Do you think it's possible to use CircularBufferRange itself as anchor.

Well, this implementation is geared towards the usage pattern found in 
the top-down parser. If the first anchor is not restored last, it will 
not generally work. This does not conform to the range interface. 
Implementing .save requires a more advanced and less efficient 
implementation.

 Then calling save() would return a CircularBufferRange and you could
 scratch the two functions above. I had some promising experiments in
 that direction,

What did you do? The way I think I'd approach it would be to maintain a 
binary min-heap using a few pointers in each range instance.

 but the implicit save on copy is annoying because you
 end up with anchors from temporary copies.

Is this referring to suboptimal efficiency? I guess that is rather hard 
to address in a safe way. Basically, you'd want to forget about "anchor 
management" in case it can be proven that there is another range on the 
same buffer that stays at most as progressed during the whole lifetime 
of the range under consideration.

Of course, it is always possible to do it in unsafely.

Sep 12 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 9/11/2013 8:01 AM, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian Schott

One litmus test of the validity of the design is to see if one can build a D 
parser out of it.

Otherwise we'll likely face the unpleasant task of needing to deprecate
std.d.lexer.

Sep 11 2013

"Brian Schott" <briancschott gmail.com> writes:

On Wednesday, 11 September 2013 at 20:37:11 UTC, Walter Bright 
wrote:
 On 9/11/2013 8:01 AM, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by 
 Brian Schott

 One litmus test of the validity of the design is to see if one 
 can build a D parser out of it.

 Otherwise we'll likely face the unpleasant task of needing to 
 deprecate std.d.lexer.

I'm not sure if you've followed the announce mailing list 
recently, but this lexer is already used in both the DScanner and 
DCD projects. There is a parser module that goes along with this 
lexer included in DScanner (and used by DCD as well). It's not 
ready for review though. I'm still working on improving its error 
recovery.

DScanner is a tool for D source that's able to generate 
syntax-highlighted HTML, perform syntax validation, generate the 
AST in XML format, generate CTAGS, list imports, and count lines 
of code.

DCD is an autocompletion client/server program that has 
integrations for several text editors.

Sep 11 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 9/11/2013 1:44 PM, Brian Schott wrote:
 I'm not sure if you've followed the announce mailing list recently, but this
 lexer is already used in both the DScanner and DCD projects.

Awesome! This is just what I asked for.

Sep 11 2013

Martin Nowak <code dawg.eu> writes:

On 09/11/2013 05:01 PM, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian Schott

Nice!

 Please take this part seriously: "If you identify problems along the
 way, please note if they are minor, serious, or showstoppers."
 (http://wiki.dlang.org/Review/Process). This information later
 will be used to determine if library is ready for voting.

I only had a short look at the source code and was left with a good 
impression, i.e. it looks like a lexer written in D.

I do like how you made the lexer configurable to serve the different 
needs. Also it was very easy to setup and use.

The appended underscore for keywords is much nicer than messing with 
uppercase/lowercase which was what I did.

One thing that bothered me was the usage of const(ubyte)[] as input.
If you operate on UTF-8 it should take const(char)[] you can cast it to 
const(ubyte)[] internally.

Also a convenience function that reads a file and processes UTF BOM 
marks would be nice (see toUtf8 
https://github.com/dawgfoto/lexer/blob/master/dlexer/dlexer.d#L1429), 
but that could as well fit somewhere else into phobos.

 ---- Information request from module author ----

 Performance was a major discussed topic in previous thread. Could you
 please provide benchmarking data for version currently ongoing the review?

Nice too, it works as fast as my D lexer 
(https://github.com/dawgfoto/lexer) which is about as fast as DMD's 
lexer (when compiled with LDC).

Sep 11 2013

Jacob Carlborg <doob me.com> writes:

On 2013-09-12 00:36, Martin Nowak wrote:

 Also a convenience function that reads a file and processes UTF BOM
 marks would be nice (see toUtf8
 https://github.com/dawgfoto/lexer/blob/master/dlexer/dlexer.d#L1429),
 but that could as well fit somewhere else into phobos.

Sounds like that would fit in std.file or similar.

-- 
/Jacob Carlborg

Sep 12 2013

dennis luehring <dl.soluz gmx.net> writes:

Am 11.09.2013 17:01, schrieb Dicebot:
 std.d.lexer is standard module for lexing D code, written by
 Brian Schott

a standard benchmark scenario is needed - to compare dmd and std.d.lexer

with dmc_dmd, ldc, gdc and msc_dmd

just to be clear how near it reaches current dmd lexing speed

Sep 11 2013

Jacob Carlborg <doob me.com> writes:

On 2013-09-11 17:01, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian Schott

Finally :)

* How does it handler errors, just returns TokenType.invalid?

* Personally I think the module is too big. I would go with:

- std.d.lexer.token
- std.d.lexer.tokentype
- std.d.lexer.lexer - contains the rest
- std.d.lexer.config - IterationStyle, TokenStyle, LexerConfig
- CircularRange, StringCache, possibly put somewhere else. I assume this 
can be used for other things than lexing?
- Trie related code, same as above

* I see that errorMessage throws an exception. Do we really want that? I 
would except it just returns an invalid token.

If we do decide it should throw, it should absolutely _not_ throw a 
plain Exception. Create a new type, LexException or similar. I hate when 
code throws plain Exceptions, it makes it useless to catch them.

I would also expect this LexException to contain a Token. It shouldn't 
be needed to parse the exception message to get line and column information.

* I like that you overall use clear and descriptive variable and 
function names. Except "sbox": 
https://github.com/Hackerpilot/phobos/blob/master/std/d/lexer.d#L3265

* Could we get some unit tests for string literals, comments and 
identifies out side of the ASCII table

* I would like to see a short description for each unit test, what it's 
testing. Personally I have started with this style:

 describe("byToken")
{
      context("valid string literal")
     {
          it("should return a token with the type 
TokenType.stringLiteral") unittest
         {
             // test
         }

          it("should return a token with the correct lexeme") unittest
         {
             // test
         }
     }
}

Better formatted: http://pastebin.com/Dx78Vw6r

People here might think that would be a bit too verbose. The following 
would be ok as well:

 describe("short description of the unit test") unittest { }

* Could you remove debug code and other code that is commented out:

- 344
- 1172
- 1226, is that needed?
- 3165-3166
- 3197-3198
- 3392
- 3410
- 3434

Spell errors:

* "forwarad" - 292
* "commemnt" - 2031
* "sentenels" - 299
* "messsage" - 301
* "underliying" - 2454
* "alloctors" - 3230
* "strightforward" - 2276

-- 
/Jacob Carlborg

Sep 12 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

12-Sep-2013 12:05, Jacob Carlborg пишет:
 On 2013-09-11 17:01, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian Schott

 Finally :)

 * How does it handler errors, just returns TokenType.invalid?

 * Personally I think the module is too big. I would go with:

 - std.d.lexer.token
 - std.d.lexer.tokentype

These could be one module. There is really no meaningful way to use 
token type separately from token.

 - std.d.lexer.lexer - contains the rest
 - std.d.lexer.config - IterationStyle, TokenStyle, LexerConfig

Contrary I see this break down pointless - do you really want to use 
config without the lexer?


 - CircularRange, StringCache, possibly put somewhere else. I assume this
 can be used for other things than lexing?
 - Trie related code, same as above

No good public interface defined is the reason. Basically the same as 
with Trie in the new std.uni module - needs its own review.

 * I see that errorMessage throws an exception. Do we really want that? I
 would except it just returns an invalid token.

 If we do decide it should throw, it should absolutely _not_ throw a
 plain Exception. Create a new type, LexException or similar. I hate when
 code throws plain Exceptions, it makes it useless to catch them.

 I would also expect this LexException to contain a Token. It shouldn't
 be needed to parse the exception message to get line and column
 information.

Better yet to have a std exception hierarchy... so that all parsing 
modules can be tied to ParseException. So this needs to be resolved in a 
forward-compatible way.


-- 
Dmitry Olshansky

Sep 12 2013

Jacob Carlborg <doob me.com> writes:

On 2013-09-12 18:21, Dmitry Olshansky wrote:

 Contrary I see this break down pointless - do you really want to use
 config without the lexer?

std.d.lexer.config might be a bit unnecessary. But yes, in general I 
would like to see smaller modules.

-- 
/Jacob Carlborg

Sep 12 2013

"Brian Schott" <briancschott gmail.com> writes:

I got some time to work on the lexer this evening. Changeset 
here: 
https://github.com/Hackerpilot/phobos/commit/9bdb7f97bb8021f3b0d0291896b8fe21a6fead23#std/d/lexer.d

The DDoc page has moved here: 
http://hackerpilot.github.io/experimental/std_lexer/phobos/std_d_lexer.html

* There are a few more unit tests now
* bitAnd renamed to amp
* slice rename to dotdot
* Much more cross-referencing in the doc comments
* Start line and column can be specified in the lexer config

Sep 12 2013

dennis luehring <dl.soluz gmx.net> writes:

Am 12.09.2013 10:15, schrieb Brian Schott:
 I got some time to work on the lexer this evening. Changeset
 here:
 https://github.com/Hackerpilot/phobos/commit/9bdb7f97bb8021f3b0d0291896b8fe21a6fead23#std/d/lexer.d

 The DDoc page has moved here:
 http://hackerpilot.github.io/experimental/std_lexer/phobos/std_d_lexer.html

 * There are a few more unit tests now
 * bitAnd renamed to amp
 * slice rename to dotdot
 * Much more cross-referencing in the doc comments
 * Start line and column can be specified in the lexer config

problem: many occurences of the same string

you should use constants for the tokens (and others)

string asm_token = "asm";
...

immutable(string[TokenType.max + 1]) tokenValues = [
   ...
   asm_token
   ...
]

and reuse these constants in your "optimization"

maybe you can replace these lines with something getting feed with 
asm_token and give the same result but without 'a' and "sm" as splitted 
and different magic values - maybe a nice template or subrange...

case 'a': if (input[1..$].equal("sm")) return TokenType.asm_; else
...
break;

and in your unit tests

for example

on auto expected =

Sep 12 2013

dennis luehring <dl.soluz gmx.net> writes:

just an idea regarding your string[TokenType.max + 1] in

immutable(string[TokenType.max + 1]) tokenValues = [..]

it seems that you try to reduce the memory usage

wouldn't it be a nice idea to generate a combined imutable string at 
compiletime like this one

"...pragmaexportpackageprivate..."

and generated string slice accesses?

imuteable string big_one = generated_from(toke_list);
imutable string export_token = big_one[10..6];

Sep 12 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 9/12/2013 1:15 AM, Brian Schott wrote:
 I got some time to work on the lexer this evening. Changeset here:
 https://github.com/Hackerpilot/phobos/commit/9bdb7f97bb8021f3b0d0291896b8fe21a6fead23#std/d/lexer.d


 The DDoc page has moved here:
 http://hackerpilot.github.io/experimental/std_lexer/phobos/std_d_lexer.html

Great!


 * There are a few more unit tests now

I strongly recommend running the unit tests with -cov. A lexer should be able
to 
get near 100% coverage with the unit tests.

 * bitAnd renamed to amp
 * slice rename to dotdot
 * Much more cross-referencing in the doc comments
 * Start line and column can be specified in the lexer config

Sep 12 2013

"Brian Schott" <briancschott gmail.com> writes:

On Thursday, 12 September 2013 at 23:40:55 UTC, Walter Bright 
wrote:
 I strongly recommend running the unit tests with -cov. A lexer 
 should be able to get near 100% coverage with the unit tests.

Some of the code is only present to be used at compile-time to 
generate switch statements inside of the lexer. -cov doesn't show 
code that's executed at compile-time to be covered, and it 
couldn't show meaningful line numbers on code that's generated 
and mixed in.

That being said, it's currently 70% covered. I'll be making 
another pass over the code later to fill in some of the gaps.

Sep 12 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 9/12/2013 4:57 PM, Brian Schott wrote:
 On Thursday, 12 September 2013 at 23:40:55 UTC, Walter Bright wrote:
 I strongly recommend running the unit tests with -cov. A lexer should be able
 to get near 100% coverage with the unit tests.

 Some of the code is only present to be used at compile-time to generate switch
 statements inside of the lexer. -cov doesn't show code that's executed at
 compile-time to be covered, and it couldn't show meaningful line numbers on
code
 that's generated and mixed in.

That's right.


 That being said, it's currently 70% covered. I'll be making another pass over
 the code later to fill in some of the gaps.

Thanks.

Sep 12 2013

"deadalnix" <deadalnix gmail.com> writes:

On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:
 Most important goal of this review is to determine any API / 
 design problems. Any internal implementation tweaks may happen 
 after inclusion to Phobos but it is important to assure that no 
 breaking changes will be required any time soon after module 
 will get wider usage.

One quick remark : we need some kind of value provider to reuse 
across different lexing, and can be used outside the lexer.

If I process a module and have to kick in a new lexing phase 
because of mixin, I want to generate identifiers out of the same 
pool.

Sep 12 2013

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/11/2013 05:01 PM, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by Brian Schott

 ---- Input ----

 Code: https://github.com/Hackerpilot/phobos/tree/master/std/d

 Documentation:
 http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html

 ...

(Commenting on what's visible in the documentation only for now.)

auto config = ...
... .byToken(config) ...

Seems to be a natural candidate for manual partial specialization.

enum config = ...
... .byToken!config() ...

uint line; ushort column; // is there overflow checking?

"Check to see if the token is of the same type and has the same string 
representation as the given token."

Tokens with the same string representation always are of the same type, 
so this seems redundant.

Furthermore, I'd expect (!a.opCmp(b)) === (a == b).

Why provide the operator overloads at all? They don't implement 
essential or natural functionality.


"includeSpecialTokens". It's not clear what this flag does.


"If the input range supports slicing, the caching layer aliases itself 
away and the lexing process is much more efficient."

It might be more sensible to require the user to manually wrap his range.

"pure nothrow bool isOperator(const TokenType t);
pure nothrow bool isOperator(ref const Token t);
pure nothrow bool isKeyword(const TokenType t);
pure nothrow bool isKeyword(ref const Token t);
..."

IMO we should get rid of these.



TokenType naming seems inconsistent. eg: & is amp, = is assign, == is 
equal, but &= is bitAndEqual and && is logicAnd

IMO better: & is and, = is assign, &= is andAssign and && is andAnd.

Of course, it might be best to use a template instead. Tok!"&", Tok!"&=" 
and Tok!"&&".

Sep 12 2013

"Dicebot" <public dicebot.lv> writes:

Discussion has slowed down and thread is fading away.

I personally have not noticed any crucial objections (ones that 
can't be figured out during normal pull request review process). 
Brian, what is your state on this? Is there something you want to 
add/change before the voting gets initiated?

Sep 17 2013

"deadalnix" <deadalnix gmail.com> writes:

On Tuesday, 17 September 2013 at 15:31:00 UTC, Dicebot wrote:
 Discussion has slowed down and thread is fading away.

 I personally have not noticed any crucial objections (ones that 
 can't be figured out during normal pull request review 
 process). Brian, what is your state on this? Is there something 
 you want to add/change before the voting gets initiated?

I had some comments that nobody addressed. Mostly about firing 
several instances of the lexer with the same identifier pool.

Sep 17 2013

"Dicebot" <public dicebot.lv> writes:

On Tuesday, 17 September 2013 at 16:34:01 UTC, deadalnix wrote:
 On Tuesday, 17 September 2013 at 15:31:00 UTC, Dicebot wrote:
 Discussion has slowed down and thread is fading away.

 I personally have not noticed any crucial objections (ones 
 that can't be figured out during normal pull request review 
 process). Brian, what is your state on this? Is there 
 something you want to add/change before the voting gets 
 initiated?

 I had some comments that nobody addressed. Mostly about firing 
 several instances of the lexer with the same identifier pool.

Well, I suppose only Brian can address it :) I am only checking 
if there are no commonly repeated objections that are likely to 
ruin the voting. There are plenty of comments of course and I 
would recommend Brian to at least make some statement about those 
before voting but that is his call.

There is a possibility to mention one issue as a blocker when 
voting though, that will be taken into consideration when 
counting actual votes.

Sep 17 2013

"Brian Schott" <briancschott gmail.com> writes:

On Tuesday, 17 September 2013 at 16:48:44 UTC, Dicebot wrote:
 On Tuesday, 17 September 2013 at 16:34:01 UTC, deadalnix wrote:
 On Tuesday, 17 September 2013 at 15:31:00 UTC, Dicebot wrote:
 Discussion has slowed down and thread is fading away.

 I personally have not noticed any crucial objections (ones 
 that can't be figured out during normal pull request review 
 process). Brian, what is your state on this? Is there 
 something you want to add/change before the voting gets 
 initiated?

 I had some comments that nobody addressed. Mostly about firing 
 several instances of the lexer with the same identifier pool.

 Well, I suppose only Brian can address it :) I am only checking 
 if there are no commonly repeated objections that are likely to 
 ruin the voting. There are plenty of comments of course and I 
 would recommend Brian to at least make some statement about 
 those before voting but that is his call.

 There is a possibility to mention one issue as a blocker when 
 voting though, that will be taken into consideration when 
 counting actual votes.

I've been busy with things that aren't D-related recently, but I 
should have time the rest of this week to address the lexer.

Sep 17 2013

"Dicebot" <public dicebot.lv> writes:

On Tuesday, 17 September 2013 at 20:14:36 UTC, Brian Schott wrote:
 I've been busy with things that aren't D-related recently, but 
 I should have time the rest of this week to address the lexer.

Ok, please notify me then once you feel ready for voting / extra 
review ( public at dicebot.lv )

Thanks for you work.

Sep 17 2013

"Brian Schott" <briancschott gmail.com> writes:

On Tuesday, 17 September 2013 at 20:14:36 UTC, Brian Schott wrote:
 I've been busy with things that aren't D-related recently, but 
 I should have time the rest of this week to address the lexer.

Changes since last time:

https://github.com/Hackerpilot/phobos/compare/D-Programming-Language:df38839...master

Test coverage is up to 85% now and a few bugs have been fixed.

Sep 24 2013

Jacob Carlborg <doob me.com> writes:

On 2013-09-25 04:48, Brian Schott wrote:

 Changes since last time:

 https://github.com/Hackerpilot/phobos/compare/D-Programming-Language:df38839...master

I had some comments and a couple of minor things like spell errors:

* I see that errorMessage throws an exception. Do we really want that? I 
would except it just returns an invalid token.

* Could we get some unit tests for string literals, comments and 
identifies out side of the ASCII table

* Personally I would like to see a short description for each unit test, 
what it's testing

* Could you remove debug code and other code that is commented out:

- 344
- 1172
- 1226, is that needed?
- 3165-3166
- 3197-3198
- 3392
- 3410
- 3434

Spell errors:

* "forwarad" - 292
* "commemnt" - 2031
* "sentenels" - 299
* "messsage" - 301
* "underliying" - 2454
* "alloctors" - 3230
* "strightforward" - 2276

I guess these line number might be off now. My original comments was 
made September 12.

For reference see:

http://forum.dlang.org/thread/jsnhlcbulwyjuqcqoepe forum.dlang.org?page=7#post-l0rsje:24jf9:241:40digitalmars.com

-- 
/Jacob Carlborg

Sep 25 2013

"Brian Schott" <briancschott gmail.com> writes:

On Wednesday, 25 September 2013 at 09:36:43 UTC, Jacob Carlborg
wrote:
 * I see that errorMessage throws an exception. Do we really 
 want that? I would except it just returns an invalid token.

This is the default behavior that happens when you don't
configure an error callback.

 * Could we get some unit tests for string literals, comments 
 and identifies out side of the ASCII table

I've added one.

 * Could you remove debug code and other code that is commented 
 out:

Most of this is now gone.

 Spell errors:

These were fixed weeks ago.

Sep 25 2013

"Jacob Carlborg" <doob me.com> writes:

On Wednesday, 25 September 2013 at 16:52:43 UTC, Brian Schott 
wrote:

 This is the default behavior that happens when you don't
 configure an error callback.

I see.

 I've added one.

Thanks.

 Most of this is now gone.

That's good.

 These were fixed weeks ago.

Great, I just never got a respond that.

--
/Jacob Carlborg

Sep 25 2013

"Dominikus Dittes Scherkl" writes:

Hello.

I'm not sure if this belongs here, but I think there is bug at 
the very start of the Lexer chapter:

Is U+001A really meant to end the source file?
According to the Unicode specification this is a "replacement 
character", like the newer U+FFFC. Or is it simply a spelling 
error and U+0019 was intended to
end the source (this would fit, as it means "end of media").

I don't know if anybody ever has ended his source in that way or 
if it was tested.

More important to me is, that all the Space-Characters beyond 
ASCII are not
considered whitespace (starting with U+00A0 NBSP, the different 
wide spaces
U+2000 to U+200B up to the exotic stuff U+202F, U+205F, U+2060, 
U+3000 and
the famous U+FEFF). Why?
Ok, the set is much larger, but for the end-of-line also the 
unicode versions (U+2028 and U+2029) are added. This seems 
inconsequent to me.

Sep 26 2013

Jos van Uden <usenet fwend.com> writes:

On 26-9-2013 17:41, Dominikus Dittes Scherkl wrote:
 Hello.

 I'm not sure if this belongs here, but I think there is bug at the very start
of the Lexer chapter:

 Is U+001A really meant to end the source file?
 According to the Unicode specification this is a "replacement character", like
the newer U+FFFC. Or is it simply a spelling error and U+0019 was intended to
 end the source (this would fit, as it means "end of media").

 I don't know if anybody ever has ended his source in that way or if it was
tested.

 More important to me is, that all the Space-Characters beyond ASCII are not
 considered whitespace (starting with U+00A0 NBSP, the different wide spaces
 U+2000 to U+200B up to the exotic stuff U+202F, U+205F, U+2060, U+3000 and
 the famous U+FEFF). Why?
 Ok, the set is much larger, but for the end-of-line also the unicode versions
(U+2028 and U+2029) are added. This seems inconsequent to me.

I imagine the lexer follows the language specification:

http://dlang.org/lex.html#EndOfFile

Sep 26 2013

"Dominikus Dittes Scherkl" writes:

On Thursday, 26 September 2013 at 16:47:09 UTC, Jos van Uden 
wrote:
 Is U+001A really meant to end the source file?
 According to the Unicode specification this is a "replacement 
 character", like the newer U+FFFC. Or is it simply a spelling 
 error and U+0019 was intended to
 end the source (this would fit, as it means "end of media").

 More important to me is, that all the Space-Characters beyond 
 ASCII are not considered whitespace

 I imagine the lexer follows the language specification:

 http://dlang.org/lex.html#EndOfFile

I know. What I wanted to say is: The language specification has a 
bug here
(at least it is strange to interpret "replacement character" as 
end of file
and "end of media" not) and the handling of unicode space 
characters is
not nice.
If this is not the right place to discus that matter, please 
point me to a better place.

Sep 27 2013

"Brian Schott" <briancschott gmail.com> writes:

On Tuesday, 17 September 2013 at 16:34:01 UTC, deadalnix wrote:
 I had some comments that nobody addressed. Mostly about firing 
 several instances of the lexer with the same identifier pool.

Doing that would require making the identifier pool part of the 
public API, which is not something that I want to do at the 
moment. Let's wait until the allocators are figureod out first.

Sep 24 2013

"deadalnix" <deadalnix gmail.com> writes:

On Wednesday, 25 September 2013 at 02:23:36 UTC, Brian Schott 
wrote:
 On Tuesday, 17 September 2013 at 16:34:01 UTC, deadalnix wrote:
 I had some comments that nobody addressed. Mostly about firing 
 several instances of the lexer with the same identifier pool.

 Doing that would require making the identifier pool part of the 
 public API, which is not something that I want to do at the 
 moment. Let's wait until the allocators are figureod out first.

Yes, ideally as alias parameter. This is a show stopper for many 
usages.

Sep 24 2013

"ilya-stromberg" <ilya-stromberg-2009 yandex.ru> writes:

On Wednesday, 11 September 2013 at 15:02:00 UTC, Dicebot wrote:
 std.d.lexer is standard module for lexing D code, written by 
 Brian Schott

Brian, have you got any plans to write common lexer/parser 
generator based on context-free grammar (CFG) or parsing 
expression grammar (PEG), not only for D grammar?

Something like this:
https://github.com/PhilippeSigaud/Pegged

Sep 17 2013

D Programming

C/C++ Programming

Other

digitalmars.D - std.d.lexer: pre-voting review / discussion