digitalmars.D - What would need to be done to get sdc.lexer to std.lexer quality?

Bernard Helyer (23/23) Aug 01 2012 Okay, so I've seen several comments from several people

Bernard Helyer (3/3) Aug 01 2012 I have been informed that deadalnix, that wily Frenchman, has

deadalnix (3/6) Aug 01 2012 It shouldn't be included in phobos, but can be useful to test things

Jakob Ovrum (10/33) Aug 01 2012 Some of the other comments I brought up on IRC:

Walter Bright (11/16) Aug 01 2012 That's just not going to produce a high performance lexer.

Jakob Ovrum (7/17) Aug 01 2012 Which is exactly why I'm pointing out the current, poor approach.

Walter Bright (5/23) Aug 01 2012 Using a class implies an extra level of indirection, and the other issue...

David Nadlinger (3/5) Aug 02 2012 How is pass-by-ref not an extra level of indirection?

Walter Bright (4/8) Aug 02 2012 If you have a "Lexer" instance that contains by value the current Token,...

Jakob Ovrum (12/17) Aug 02 2012 You'll always have an extra layer of indirection if you aim not
deadalnix (3/7) Aug 02 2012 The fact that ref for classes and ref for everything else works

Jacob Carlborg (5/10) Aug 02 2012 If you change Token to a struct it takes 64bytes on a LP64 platform. I

Jacob Carlborg (8/10) Aug 02 2012 Just for comparison, the type used for tokens in Clang is only 24 bytes....

Bernard Helyer (6/6) Aug 02 2012 In my dev work I've shaved some bytes off of Token.

Jacob Carlborg (7/13) Aug 02 2012 I think the source location is calculated on demand based on that offset...

Bernard Helyer (4/7) Aug 02 2012 Good point. I think there's 180 ish at the moment, so we can

Bernard Helyer (7/10) Aug 02 2012 That's why I moved Token to a class in the first place.

Walter Bright (7/9) Aug 02 2012 Thank you. I've got a lot of experience writing lexers and heavily using...

Bernard Helyer (3/12) Aug 02 2012 I will make it my mission to kick your

Walter Bright (2/4) Aug 02 2012 I am looking forward to a good ass-kicking lexer!

Nathan M. Swan (3/5) Aug 03 2012 Are there any specific ways I could help?

"Bernard Helyer" <b.helyer gmail.com> writes:

Okay, so I've seen several comments from several people
regarding the need for a D lexer in Phobos. I figure
I should contribute something to this NG other than
misdirected anger, so here it is.

SDC has a lexer, and it's pretty much complete. It handles
unicode and script lines, and #line and friends.

It's currently MIT, but I've been meaning to re license to
to boost, so that's not an issue. It used to have some number
lexing code stolen from DMD, but I removed that when we moved
to MIT.

https://github.com/bhelyer/SDC/blob/master/src/sdc/lexer.d
https://github.com/bhelyer/SDC/blob/master/src/sdc/source.d
https://github.com/bhelyer/SDC/blob/master/src/sdc/tokenstream.d
https://github.com/bhelyer/SDC/blob/master/src/sdc/token.d
https://github.com/bhelyer/SDC/blob/master/src/sdc/location.d

TokenStream would need to become a range, name and specific
interface details requested from you fine people.

opKirbyRape will, with great regret, have to go.

Documentation will need to be buffed, and it'll need to be
renamed into Phobos style.

I'm willing to do the work if people think it's worthwhile,
and I can get some directed suggestions.

-Bernard.

Aug 01 2012

"Bernard Helyer" <b.helyer gmail.com> writes:

I have been informed that deadalnix, that wily Frenchman, has
already built a range abstraction on top of it. So that's a
plus.

Aug 01 2012

deadalnix <deadalnix gmail.com> writes:

Le 02/08/2012 01:14, Bernard Helyer a écrit :
 I have been informed that deadalnix, that wily Frenchman, has
 already built a range abstraction on top of it. So that's a
 plus.

It shouldn't be included in phobos, but can be useful to test things 
during dev.

Aug 01 2012

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Wednesday, 1 August 2012 at 23:06:19 UTC, Bernard Helyer wrote:
 Okay, so I've seen several comments from several people
 regarding the need for a D lexer in Phobos. I figure
 I should contribute something to this NG other than
 misdirected anger, so here it is.

 SDC has a lexer, and it's pretty much complete. It handles
 unicode and script lines, and #line and friends.

 It's currently MIT, but I've been meaning to re license to
 to boost, so that's not an issue. It used to have some number
 lexing code stolen from DMD, but I removed that when we moved
 to MIT.

 https://github.com/bhelyer/SDC/blob/master/src/sdc/lexer.d
 https://github.com/bhelyer/SDC/blob/master/src/sdc/source.d
 https://github.com/bhelyer/SDC/blob/master/src/sdc/tokenstream.d
 https://github.com/bhelyer/SDC/blob/master/src/sdc/token.d
 https://github.com/bhelyer/SDC/blob/master/src/sdc/location.d

 TokenStream would need to become a range, name and specific
 interface details requested from you fine people.

 opKirbyRape will, with great regret, have to go.

 Documentation will need to be buffed, and it'll need to be
 renamed into Phobos style.

 I'm willing to do the work if people think it's worthwhile,
 and I can get some directed suggestions.

 -Bernard.

Some of the other comments I brought up on IRC:

  * Currently files are read in their entirety first, then parsed. 
It is worth exploring the idea of reading it in chunks lazily.
  * The current result (TokenStream) is a wrapper over a 
GC-allocated array of Token class instances, each instance with 
its own GC allocation (new Token). It is worth exploring an 
alternative allocation strategy for the tokens.

There are a *lot* of little things that need to be done, but 
everything important is in place.

Aug 01 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/1/2012 4:18 PM, Jakob Ovrum wrote:
   * Currently files are read in their entirety first, then parsed. It is worth
 exploring the idea of reading it in chunks lazily.

Using an input range will take care of that nicely.

   * The current result (TokenStream) is a wrapper over a GC-allocated array of
 Token class instances, each instance with its own GC allocation (new Token). It
 is worth exploring an alternative allocation strategy for the tokens.

That's just not going to produce a high performance lexer.

The way to do it is in the Lexer instance, have a value which is the current 
Token instance. That way, in the normal case, one NEVER has to allocate a token 
instance.

Only when lookahead is done is storage allocation required, and that list
should 
be held by Lexer and recycled as tokens get consumed. This is how the dmd lexer 
works.

Doing one allocation per token is never going to scale to trying to shove 
millions upon millions of lines of code through it.

Aug 01 2012

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Thursday, 2 August 2012 at 04:38:11 UTC, Walter Bright wrote:
 That's just not going to produce a high performance lexer.

 The way to do it is in the Lexer instance, have a value which 
 is the current Token instance. That way, in the normal case, 
 one NEVER has to allocate a token instance.

 Only when lookahead is done is storage allocation required, and 
 that list should be held by Lexer and recycled as tokens get 
 consumed. This is how the dmd lexer works.

 Doing one allocation per token is never going to scale to 
 trying to shove millions upon millions of lines of code through 
 it.

Which is exactly why I'm pointing out the current, poor approach. 
Having a single array with contiguous Tokens for lookahead is 
completely doable even when Token is a class with some simple 
GC.malloc and emplace composition. I think SDC's Token class is 
too big to be useful as a struct, you'd pretty much never want to 
pass it anywhere by value.

Aug 01 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/1/2012 10:31 PM, Jakob Ovrum wrote:
 On Thursday, 2 August 2012 at 04:38:11 UTC, Walter Bright wrote:
 That's just not going to produce a high performance lexer.

 The way to do it is in the Lexer instance, have a value which is the current
 Token instance. That way, in the normal case, one NEVER has to allocate a
 token instance.

 Only when lookahead is done is storage allocation required, and that list
 should be held by Lexer and recycled as tokens get consumed. This is how the
 dmd lexer works.

 Doing one allocation per token is never going to scale to trying to shove
 millions upon millions of lines of code through it.

 Which is exactly why I'm pointing out the current, poor approach. Having a
 single array with contiguous Tokens for lookahead is completely doable even
when
 Token is a class with some simple GC.malloc and emplace composition. I think
 SDC's Token class is too big to be useful as a struct, you'd pretty much never
 want to pass it anywhere by value.

Using a class implies an extra level of indirection, and the other issue is the 
only point to using a class is if you're going to derive from it and override 
its methods. I don't see that for a Token.

Use pass-by-ref for the Token.

Aug 01 2012

"David Nadlinger" <see klickverbot.at> writes:

On Thursday, 2 August 2012 at 05:36:37 UTC, Walter Bright wrote:
 Using a class implies an extra level of indirection, […]
 Use pass-by-ref for the Token.

How is pass-by-ref not an extra level of indirection?

David

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 12:04 AM, David Nadlinger wrote:
 On Thursday, 2 August 2012 at 05:36:37 UTC, Walter Bright wrote:
 Using a class implies an extra level of indirection, […]
 Use pass-by-ref for the Token.

 How is pass-by-ref not an extra level of indirection?

If you have a "Lexer" instance that contains by value the current Token, then 
you deref to get to "Lexer". If Token is a class, then you deref to get "Lexer" 
and then deref again to get the current Token.

Aug 02 2012

"Jakob Ovrum" <jakobovrum gmail.com> writes:

On Thursday, 2 August 2012 at 05:36:37 UTC, Walter Bright wrote:
 Using a class implies an extra level of indirection, and the 
 other issue is the only point to using a class is if you're 
 going to derive from it and override its methods. I don't see 
 that for a Token.

 Use pass-by-ref for the Token.

You'll always have an extra layer of indirection if you aim not 
to pass by value. By only exposing a pointer/class reference you 
make it impossible to do the wrong thing by implicitly copying 
the struct; and if we have a struct which is only ever meant to 
be used through a pointer, we're better off using a class.

Of course, if we can trim the size of the struct sufficiently I'm 
all for using a value type; then we would also lose the two-word 
overhead all classes have (but really shouldn't have to have; the 
monitor should only be on synchronized classes and we wouldn't 
have to force a vpointer if Object wasn't bloated, but that's an 
argument for another day).

Aug 02 2012

deadalnix <deadalnix gmail.com> writes:

Le 02/08/2012 07:35, Walter Bright a écrit :
 Using a class implies an extra level of indirection, and the other issue
 is the only point to using a class is if you're going to derive from it
 and override its methods. I don't see that for a Token.

 Use pass-by-ref for the Token.

The fact that ref for classes and ref for everything else works 
differently don't really help here.

Aug 02 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-02 07:31, Jakob Ovrum wrote:

 Which is exactly why I'm pointing out the current, poor approach. Having
 a single array with contiguous Tokens for lookahead is completely doable
 even when Token is a class with some simple GC.malloc and emplace
 composition. I think SDC's Token class is too big to be useful as a
 struct, you'd pretty much never want to pass it anywhere by value.

If you change Token to a struct it takes 64bytes on a LP64 platform. I 
don't know if that is too big to be passed around by value.

-- 
/Jacob Carlborg

Aug 02 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-02 09:11, Jacob Carlborg wrote:

 If you change Token to a struct it takes 64 bytes on a LP64 platform. I
 don't know if that is too big to be passed around by value.

Just for comparison, the type used for tokens in Clang is only 24 bytes. 
The main reason is the small source location. It's only 32 bites wide, 
it uses an uint as some kind of offset or id.

http://clang.llvm.org/doxygen/classclang_1_1Token.html
http://clang.llvm.org/doxygen/classclang_1_1SourceLocation.html

-- 
/Jacob Carlborg

Aug 02 2012

"Bernard Helyer" <b.helyer gmail.com> writes:

In my dev work I've shaved some bytes off of Token.
I removed the filename from Location, as we don't assume
the input is a file anymore, and I've changed to tracking
line and column numbers as uint instead of size_t.

I don't know what kind of number I _should_ be aiming for,
but I'd imagine I'm not gonna get it that small.

Aug 02 2012

Jacob Carlborg <doob me.com> writes:

On 2012-08-02 09:26, Bernard Helyer wrote:
 In my dev work I've shaved some bytes off of Token.
 I removed the filename from Location, as we don't assume
 the input is a file anymore, and I've changed to tracking
 line and column numbers as uint instead of size_t.

 I don't know what kind of number I _should_ be aiming for,
 but I'd imagine I'm not gonna get it that small.

I think the source location is calculated on demand based on that offset.

You can probably shave off a couple of bytes by using a (u)short or 
(u)byte instead of TokenKind. The TokenKind takes 32 bits, that's way 
more then what's actually needed.

-- 
/Jacob Carlborg

Aug 02 2012

"Bernard Helyer" <b.helyer gmail.com> writes:

On Thursday, 2 August 2012 at 07:42:05 UTC, Jacob Carlborg wrote:
 You can probably shave off a couple of bytes by using a 
 (u)short or (u)byte instead of TokenKind. The TokenKind takes 
 32 bits, that's way more then what's actually needed.

Good point. I think there's 180 ish at the moment, so we can
get away with a ubyte save a cambrian explosion of new
keywords. :P

Aug 02 2012

"Bernard Helyer" <b.helyer gmail.com> writes:

On Thursday, 2 August 2012 at 07:11:36 UTC, Jacob Carlborg wrote:
 If you change Token to a struct it takes 64bytes on a LP64 
 platform. I don't know if that is too big to be passed around 
 by value.

That's why I moved Token to a class in the first place.
It became far too big and you had to pass it around by
reference, which I thought defeated the purpose.

https://github.com/bhelyer/std.d.lexer/

Gonna spend some time massaging this into a
Walter-Approved (tm) lexer. It's got some ways to go.

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 12:22 AM, Bernard Helyer wrote:
 Gonna spend some time massaging this into a
 Walter-Approved (tm) lexer. It's got some ways to go.


Thank you. I've got a lot of experience writing lexers and heavily using them 
professionally, but I'm still finding ways to make them better - faster - more 
encapsulated.

Since std.d.lexer won't be a third party module, but one that comes with a 
compiler, it must pass muster as a top quality one and be as good as the one in 
the compiler.

Aug 02 2012

"Bernard Helyer" <b.helyer gmail.com> writes:

On Thursday, 2 August 2012 at 20:05:59 UTC, Walter Bright wrote:
 On 8/2/2012 12:22 AM, Bernard Helyer wrote:
 Gonna spend some time massaging this into a
 Walter-Approved (tm) lexer. It's got some ways to go.


 Thank you. I've got a lot of experience writing lexers and 
 heavily using them professionally, but I'm still finding ways 
 to make them better - faster - more encapsulated.

 Since std.d.lexer won't be a third party module, but one that 
 comes with a compiler, it must pass muster as a top quality one 
 and be as good as the one in the compiler.

I will make it my mission to kick your
(metaphorical performance based) ass, sir.

Aug 02 2012

Walter Bright <newshound2 digitalmars.com> writes:

On 8/2/2012 7:21 PM, Bernard Helyer wrote:
 I will make it my mission to kick your
 (metaphorical performance based) ass, sir.

I am looking forward to a good ass-kicking lexer!

Aug 02 2012

"Nathan M. Swan" <nathanmswan gmail.com> writes:

On Thursday, 2 August 2012 at 07:22:57 UTC, Bernard Helyer wrote:
 Gonna spend some time massaging this into a
 Walter-Approved (tm) lexer. It's got some ways to go.

Are there any specific ways I could help?
NMS

Aug 03 2012

D Programming

C/C++ Programming

Other

digitalmars.D - What would need to be done to get sdc.lexer to std.lexer quality?