www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - What would need to be done to get sdc.lexer to std.lexer quality?

reply "Bernard Helyer" <b.helyer gmail.com> writes:
Okay, so I've seen several comments from several people
regarding the need for a D lexer in Phobos. I figure
I should contribute something to this NG other than
misdirected anger, so here it is.

SDC has a lexer, and it's pretty much complete. It handles
unicode and script lines, and #line and friends.

It's currently MIT, but I've been meaning to re license to
to boost, so that's not an issue. It used to have some number
lexing code stolen from DMD, but I removed that when we moved
to MIT.

https://github.com/bhelyer/SDC/blob/master/src/sdc/lexer.d
https://github.com/bhelyer/SDC/blob/master/src/sdc/source.d
https://github.com/bhelyer/SDC/blob/master/src/sdc/tokenstream.d
https://github.com/bhelyer/SDC/blob/master/src/sdc/token.d
https://github.com/bhelyer/SDC/blob/master/src/sdc/location.d

TokenStream would need to become a range, name and specific
interface details requested from you fine people.

opKirbyRape will, with great regret, have to go.

Documentation will need to be buffed, and it'll need to be
renamed into Phobos style.

I'm willing to do the work if people think it's worthwhile,
and I can get some directed suggestions.

-Bernard.
Aug 01 2012
next sibling parent reply "Bernard Helyer" <b.helyer gmail.com> writes:
I have been informed that deadalnix, that wily Frenchman, has
already built a range abstraction on top of it. So that's a
plus.
Aug 01 2012
parent deadalnix <deadalnix gmail.com> writes:
Le 02/08/2012 01:14, Bernard Helyer a écrit :
 I have been informed that deadalnix, that wily Frenchman, has
 already built a range abstraction on top of it. So that's a
 plus.
It shouldn't be included in phobos, but can be useful to test things during dev.
Aug 01 2012
prev sibling parent reply "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Wednesday, 1 August 2012 at 23:06:19 UTC, Bernard Helyer wrote:
 Okay, so I've seen several comments from several people
 regarding the need for a D lexer in Phobos. I figure
 I should contribute something to this NG other than
 misdirected anger, so here it is.

 SDC has a lexer, and it's pretty much complete. It handles
 unicode and script lines, and #line and friends.

 It's currently MIT, but I've been meaning to re license to
 to boost, so that's not an issue. It used to have some number
 lexing code stolen from DMD, but I removed that when we moved
 to MIT.

 https://github.com/bhelyer/SDC/blob/master/src/sdc/lexer.d
 https://github.com/bhelyer/SDC/blob/master/src/sdc/source.d
 https://github.com/bhelyer/SDC/blob/master/src/sdc/tokenstream.d
 https://github.com/bhelyer/SDC/blob/master/src/sdc/token.d
 https://github.com/bhelyer/SDC/blob/master/src/sdc/location.d

 TokenStream would need to become a range, name and specific
 interface details requested from you fine people.

 opKirbyRape will, with great regret, have to go.

 Documentation will need to be buffed, and it'll need to be
 renamed into Phobos style.

 I'm willing to do the work if people think it's worthwhile,
 and I can get some directed suggestions.

 -Bernard.
Some of the other comments I brought up on IRC: * Currently files are read in their entirety first, then parsed. It is worth exploring the idea of reading it in chunks lazily. * The current result (TokenStream) is a wrapper over a GC-allocated array of Token class instances, each instance with its own GC allocation (new Token). It is worth exploring an alternative allocation strategy for the tokens. There are a *lot* of little things that need to be done, but everything important is in place.
Aug 01 2012
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 4:18 PM, Jakob Ovrum wrote:
   * Currently files are read in their entirety first, then parsed. It is worth
 exploring the idea of reading it in chunks lazily.
Using an input range will take care of that nicely.
   * The current result (TokenStream) is a wrapper over a GC-allocated array of
 Token class instances, each instance with its own GC allocation (new Token). It
 is worth exploring an alternative allocation strategy for the tokens.
That's just not going to produce a high performance lexer. The way to do it is in the Lexer instance, have a value which is the current Token instance. That way, in the normal case, one NEVER has to allocate a token instance. Only when lookahead is done is storage allocation required, and that list should be held by Lexer and recycled as tokens get consumed. This is how the dmd lexer works. Doing one allocation per token is never going to scale to trying to shove millions upon millions of lines of code through it.
Aug 01 2012
parent reply "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Thursday, 2 August 2012 at 04:38:11 UTC, Walter Bright wrote:
 That's just not going to produce a high performance lexer.

 The way to do it is in the Lexer instance, have a value which 
 is the current Token instance. That way, in the normal case, 
 one NEVER has to allocate a token instance.

 Only when lookahead is done is storage allocation required, and 
 that list should be held by Lexer and recycled as tokens get 
 consumed. This is how the dmd lexer works.

 Doing one allocation per token is never going to scale to 
 trying to shove millions upon millions of lines of code through 
 it.
Which is exactly why I'm pointing out the current, poor approach. Having a single array with contiguous Tokens for lookahead is completely doable even when Token is a class with some simple GC.malloc and emplace composition. I think SDC's Token class is too big to be useful as a struct, you'd pretty much never want to pass it anywhere by value.
Aug 01 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/1/2012 10:31 PM, Jakob Ovrum wrote:
 On Thursday, 2 August 2012 at 04:38:11 UTC, Walter Bright wrote:
 That's just not going to produce a high performance lexer.

 The way to do it is in the Lexer instance, have a value which is the current
 Token instance. That way, in the normal case, one NEVER has to allocate a
 token instance.

 Only when lookahead is done is storage allocation required, and that list
 should be held by Lexer and recycled as tokens get consumed. This is how the
 dmd lexer works.

 Doing one allocation per token is never going to scale to trying to shove
 millions upon millions of lines of code through it.
Which is exactly why I'm pointing out the current, poor approach. Having a single array with contiguous Tokens for lookahead is completely doable even when Token is a class with some simple GC.malloc and emplace composition. I think SDC's Token class is too big to be useful as a struct, you'd pretty much never want to pass it anywhere by value.
Using a class implies an extra level of indirection, and the other issue is the only point to using a class is if you're going to derive from it and override its methods. I don't see that for a Token. Use pass-by-ref for the Token.
Aug 01 2012
next sibling parent reply "David Nadlinger" <see klickverbot.at> writes:
On Thursday, 2 August 2012 at 05:36:37 UTC, Walter Bright wrote:
 Using a class implies an extra level of indirection, […]
 Use pass-by-ref for the Token.
How is pass-by-ref not an extra level of indirection? David
Aug 02 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 12:04 AM, David Nadlinger wrote:
 On Thursday, 2 August 2012 at 05:36:37 UTC, Walter Bright wrote:
 Using a class implies an extra level of indirection, […]
 Use pass-by-ref for the Token.
How is pass-by-ref not an extra level of indirection?
If you have a "Lexer" instance that contains by value the current Token, then you deref to get to "Lexer". If Token is a class, then you deref to get "Lexer" and then deref again to get the current Token.
Aug 02 2012
prev sibling next sibling parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Thursday, 2 August 2012 at 05:36:37 UTC, Walter Bright wrote:
 Using a class implies an extra level of indirection, and the 
 other issue is the only point to using a class is if you're 
 going to derive from it and override its methods. I don't see 
 that for a Token.

 Use pass-by-ref for the Token.
You'll always have an extra layer of indirection if you aim not to pass by value. By only exposing a pointer/class reference you make it impossible to do the wrong thing by implicitly copying the struct; and if we have a struct which is only ever meant to be used through a pointer, we're better off using a class. Of course, if we can trim the size of the struct sufficiently I'm all for using a value type; then we would also lose the two-word overhead all classes have (but really shouldn't have to have; the monitor should only be on synchronized classes and we wouldn't have to force a vpointer if Object wasn't bloated, but that's an argument for another day).
Aug 02 2012
prev sibling parent deadalnix <deadalnix gmail.com> writes:
Le 02/08/2012 07:35, Walter Bright a écrit :
 Using a class implies an extra level of indirection, and the other issue
 is the only point to using a class is if you're going to derive from it
 and override its methods. I don't see that for a Token.

 Use pass-by-ref for the Token.
The fact that ref for classes and ref for everything else works differently don't really help here.
Aug 02 2012
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-02 07:31, Jakob Ovrum wrote:

 Which is exactly why I'm pointing out the current, poor approach. Having
 a single array with contiguous Tokens for lookahead is completely doable
 even when Token is a class with some simple GC.malloc and emplace
 composition. I think SDC's Token class is too big to be useful as a
 struct, you'd pretty much never want to pass it anywhere by value.
If you change Token to a struct it takes 64bytes on a LP64 platform. I don't know if that is too big to be passed around by value. -- /Jacob Carlborg
Aug 02 2012
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-02 09:11, Jacob Carlborg wrote:

 If you change Token to a struct it takes 64 bytes on a LP64 platform. I
 don't know if that is too big to be passed around by value.
Just for comparison, the type used for tokens in Clang is only 24 bytes. The main reason is the small source location. It's only 32 bites wide, it uses an uint as some kind of offset or id. http://clang.llvm.org/doxygen/classclang_1_1Token.html http://clang.llvm.org/doxygen/classclang_1_1SourceLocation.html -- /Jacob Carlborg
Aug 02 2012
parent reply "Bernard Helyer" <b.helyer gmail.com> writes:
In my dev work I've shaved some bytes off of Token.
I removed the filename from Location, as we don't assume
the input is a file anymore, and I've changed to tracking
line and column numbers as uint instead of size_t.

I don't know what kind of number I _should_ be aiming for,
but I'd imagine I'm not gonna get it that small.
Aug 02 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-08-02 09:26, Bernard Helyer wrote:
 In my dev work I've shaved some bytes off of Token.
 I removed the filename from Location, as we don't assume
 the input is a file anymore, and I've changed to tracking
 line and column numbers as uint instead of size_t.

 I don't know what kind of number I _should_ be aiming for,
 but I'd imagine I'm not gonna get it that small.
I think the source location is calculated on demand based on that offset. You can probably shave off a couple of bytes by using a (u)short or (u)byte instead of TokenKind. The TokenKind takes 32 bits, that's way more then what's actually needed. -- /Jacob Carlborg
Aug 02 2012
parent "Bernard Helyer" <b.helyer gmail.com> writes:
On Thursday, 2 August 2012 at 07:42:05 UTC, Jacob Carlborg wrote:
 You can probably shave off a couple of bytes by using a 
 (u)short or (u)byte instead of TokenKind. The TokenKind takes 
 32 bits, that's way more then what's actually needed.
Good point. I think there's 180 ish at the moment, so we can get away with a ubyte save a cambrian explosion of new keywords. :P
Aug 02 2012
prev sibling parent reply "Bernard Helyer" <b.helyer gmail.com> writes:
On Thursday, 2 August 2012 at 07:11:36 UTC, Jacob Carlborg wrote:
 If you change Token to a struct it takes 64bytes on a LP64 
 platform. I don't know if that is too big to be passed around 
 by value.
That's why I moved Token to a class in the first place. It became far too big and you had to pass it around by reference, which I thought defeated the purpose. https://github.com/bhelyer/std.d.lexer/ Gonna spend some time massaging this into a Walter-Approved (tm) lexer. It's got some ways to go.
Aug 02 2012
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 12:22 AM, Bernard Helyer wrote:
 Gonna spend some time massaging this into a
 Walter-Approved (tm) lexer. It's got some ways to go.
Thank you. I've got a lot of experience writing lexers and heavily using them professionally, but I'm still finding ways to make them better - faster - more encapsulated. Since std.d.lexer won't be a third party module, but one that comes with a compiler, it must pass muster as a top quality one and be as good as the one in the compiler.
Aug 02 2012
parent reply "Bernard Helyer" <b.helyer gmail.com> writes:
On Thursday, 2 August 2012 at 20:05:59 UTC, Walter Bright wrote:
 On 8/2/2012 12:22 AM, Bernard Helyer wrote:
 Gonna spend some time massaging this into a
 Walter-Approved (tm) lexer. It's got some ways to go.
Thank you. I've got a lot of experience writing lexers and heavily using them professionally, but I'm still finding ways to make them better - faster - more encapsulated. Since std.d.lexer won't be a third party module, but one that comes with a compiler, it must pass muster as a top quality one and be as good as the one in the compiler.
I will make it my mission to kick your (metaphorical performance based) ass, sir.
Aug 02 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 8/2/2012 7:21 PM, Bernard Helyer wrote:
 I will make it my mission to kick your
 (metaphorical performance based) ass, sir.
I am looking forward to a good ass-kicking lexer!
Aug 02 2012
prev sibling parent "Nathan M. Swan" <nathanmswan gmail.com> writes:
On Thursday, 2 August 2012 at 07:22:57 UTC, Bernard Helyer wrote:
 Gonna spend some time massaging this into a
 Walter-Approved (tm) lexer. It's got some ways to go.
Are there any specific ways I could help? NMS
Aug 03 2012