digitalmars.D.announce - DCT: D compiler as a collection of libraries

Roman D. Boiko (16/16) May 11 2012 There were several discussions about the need for a D compiler

Jacob Carlborg (10/25) May 11 2012 (Re-posting here)

Roman D. Boiko (23/31) May 11 2012 I consider it a draft state, because it has got several rewrites

Jacob Carlborg (9/40) May 11 2012 Ok.

Roman D. Boiko (18/33) May 11 2012 Indices of the first code unit of each line are stored inside

Jacob Carlborg (10/32) May 11 2012 Aha, clever. As long as I can get out the information I'm happy :) How

Roman D. Boiko (10/42) May 11 2012 There is a method for that in Lexer interface, for me it looks

Jacob Carlborg (7/24) May 11 2012 Found it now, "calculateFor". It not sure if it's the most intuitive

Roman D. Boiko (5/23) May 11 2012 calculateLocation was original name, but I don't like repeating

Jacob Carlborg (7/15) May 11 2012 My original suggestion was to have the functionality in Token, which

Roman D. Boiko (7/22) May 11 2012 What about the following signature: Location locate(size_t index)?

Jacob Carlborg (13/19) May 11 2012 That is better although I would prefer to pass in a token (assuming that...

Roman D. Boiko (9/28) May 11 2012 It is also what I was planning to do after I posted my last

deadalnix (9/30) May 11 2012 I don't really see the benefit of this. You are trading a O(1) operation...

Roman D. Boiko (10/23) May 11 2012 It would be interesting to see benchmarks. But anyway log(1M) is
Roman D. Boiko (7/14) May 11 2012 Technically, I'm trading N*0(1) operations needed to track line

deadalnix (2/15) May 11 2012 N can easily be number of tokens.

Roman D. Boiko (17/28) May 11 2012 Yes, if you are looking for the token by its index, not for

Ary Manzana (26/49) May 11 2012 As deadalnix says, I think you are over-complicating things.

Roman D. Boiko (29/55) May 11 2012 Summary of your points (I deliberately emphasize some of them

Roman D. Boiko (3/12) May 11 2012 https://github.com/roman-d-boiko/dct/issues/15

Roman D. Boiko (6/32) May 11 2012 Would it be possible for you to fork my code and tweak it for

Ary Manzana (2/41) May 12 2012 Sure, I'll do it and provide some benchmarks. Thanks for creating the is...

Roman D. Boiko (43/58) May 14 2012 Currently I think about making token a class instead of struct.

Roman D. Boiko (3/5) May 14 2012 Further discussion on this topic (struct vs class) is at
deadalnix (7/60) May 14 2012 I'm pretty sure that D's token will not change that much. If the need

Roman D. Boiko (9/61) May 14 2012 NNTP error: 400 load at 23.60, try later prevented me from

Tove (20/23) May 14 2012 What if there were two different lex:er modes... with different

Roman D. Boiko (13/36) May 14 2012 So far it doesn't seem expensive to tolerate errors and proceed.

Roman D. Boiko (7/23) May 14 2012 Just to clarify: different modes in lexer in my view are like two

deadalnix (2/24) May 15 2012 If both have the same interface, metaprograming can do the magic for you...

Roman D. Boiko (3/12) May 15 2012 Well, I didn't state that there is no way to solve the issue.

Timon Gehr (3/24) May 15 2012 Just use a circular buffer of value-type tokens. There is absolutely no

Roman D. Boiko (18/26) May 15 2012 I'd like to be able to do efficient token lookup based on its

deadalnix (5/31) May 16 2012 The only information that is usefull over time is the position. This is

Ary Manzana (11/20) May 11 2012 But then how do you do to efficiently (if reverse walk is any efficient)...

dennis luehring (5/25) May 11 2012 it would be better to add something like an ColumnLine-collector-thing -...

Roman D. Boiko (5/9) May 11 2012 This looks like what I called a hook for pluggin-in respective

Roman D. Boiko (26/53) May 11 2012 I borrowed the trick from Brian Schott: tokens will be stored as
deadalnix (6/26) May 12 2012 SDC uses struct Location to store such data.

Roman D. Boiko (8/30) May 12 2012 1. If a struct is a field of heap allocated object, it will be

Tobias Pankrath (5/8) May 12 2012 As far as I can tell, it won't be allocated on it's own, since it

Roman D. Boiko (9/18) May 12 2012 Sorry for confusion. In the first point I was trying to clarify

Timon Gehr (3/19) May 12 2012 Well yes, but it seems to me that you are deliberately complicating the

Roman D. Boiko (10/43) May 12 2012 Do you mean something like what I summarized from Ary:

Jacob Carlborg (12/27) May 11 2012 If think that the end goal of a project like this, putting a D frontend

dennis luehring (7/10) May 11 2012 or a pure D version of it with the features:

Roman D. Boiko (3/15) May 11 2012 I plan to go this way.

dennis luehring (5/8) May 11 2012 are using slices for prevent coping everything around?

Roman D. Boiko (6/14) May 11 2012 I tried optimizing code when it didn't complicate design. And

alex (5/21) May 11 2012 Ever thought of asking the VisualD developer to integrate your

Roman D. Boiko (4/8) May 11 2012 Didn't think about that yet, because I don't use VisualD.

alex (3/11) May 11 2012 Mono-D is written in C#, VisualD uses D -- so it actually should

Roman D. Boiko (5/7) May 11 2012 Sorry, I meant D-IDE. But there might exist the reason to consume

Roman D. Boiko (2/8) May 11 2012 Oops D-IDE is also C#...

deadalnix (3/11) May 11 2012 The best optimization is the one that bring your code from non working

Roman D. Boiko (10/20) May 11 2012 My plan is to create frontend that would be much better than

Jacob Carlborg (6/14) May 11 2012 That's too bad.
deadalnix (3/17) May 11 2012 I this is your plan, I'm very happy. We really should discuss to avoid

Roman D. Boiko (3/16) May 11 2012 That makes sense. Is it possible to switch SDC to the Boost

deadalnix (4/6) May 11 2012 Let me do a clean package of my code this week end. For now it is mixed

deadalnix (2/11) May 11 2012 From the beginning, I'm think AST macro using CTFE.

Roman D. Boiko (14/15) May 11 2012 Could you please elaborate?

deadalnix (15/30) May 11 2012 More explicitly, the goal isn't to implement a different language than D...

Roman D. Boiko (10/21) May 11 2012 Yes, I forgot this one. Actually, I didn't discard imaginary

Jacob Carlborg (5/20) May 11 2012 That won't be easy, nobody know what the specification is . TDPL, DMD or...

Rory McGuire (6/34) May 11 2012 ? my guess is that the spec is TDPL + TDPL errata. dlang.org should be

Roman D. Boiko (7/14) May 11 2012 Whenever I'm in doubt, I try to analyze both TDPL and dlang.org,
Jacob Carlborg (4/9) May 11 2012 None of these are in sync.

Roman D. Boiko (4/13) May 11 2012 I suggest to choose a dedicated place where it is possible to
Rory McGuire (3/16) May 11 2012 yip, but TDPL still has to take precedence because that is the one that
Jonathan M Davis (11/13) May 11 2012 It doesn't necessarily. Each place that they differ is examined individu...

dennis luehring (3/19) May 11 2012 does the parser/lexer allow half-finished syntax parsing? for being

Roman D. Boiko (3/5) May 11 2012 That's planned, but I would like to see your usage scenarios

dennis luehring (3/8) May 11 2012 try to syntaxhiglight while coding - thats the scenario, parts

Roman D. Boiko (12/21) May 11 2012 I depends on IDE. For example, sublime-text (and most likely

Jacob Carlborg (13/18) May 11 2012 Example from TextMate:

deadalnix (4/19) May 11 2012 I have started a similar stuff, and it is currently more advanced than
Jonathan M Davis (26/47) May 11 2012 It's great that you're working on this. We definitely need more of this ...

Roman D. Boiko (13/59) May 11 2012 Yes, my project has different goals from #1 or #2. The primary

Marco Leise (25/46) May 19 2012 A general purpose D front-end library has been discussed several times. ...

Roman D. Boiko (67/158) May 20 2012 (I decided to comment on both my post and your reply.)

Marco Leise (13/43) May 20 2012 Mostly my own gut feeling, that things that sound great in my head turn ...

Roman D. Boiko (25/64) May 20 2012 The opposite situation: I didn't think about parser per project :)

Marco Leise (6/31) May 20 2012 I just only thought about the single-use situation, not the situation wh...

Roman D. Boiko (25/64) May 20 2012 The opposite situation: I didn't think about parser per project :)

"Roman D. Boiko" <rb d-coding.com> writes:

There were several discussions about the need for a D compiler 
library.

I propose my draft implementation of lexer for community review:
https://github.com/roman-d-boiko/dct

Lexer is based on Brian Schott's project 
https://github.com/Hackerpilot/Dscanner, but it has been 
refactored and extended (and more changes are on the way).

The goal is to have source code loading, lexer, parser and 
semantic analysis available as parts of Phobos. These libraries 
should be designed to be usable in multiple scenarios (e.g., 
refactoring, code analysis, etc.).

My commitment is to have at least front end built this year (and 
conforming to the D2 specification unless explicitly stated 
otherwise for some particular aspect).

Please post any feed here. A dedicated project web-site will be 
created later.

May 11 2012

Jacob Carlborg <doob me.com> writes:

On 2012-05-11 10:01, Roman D. Boiko wrote:
 There were several discussions about the need for a D compiler library.

 I propose my draft implementation of lexer for community review:
 https://github.com/roman-d-boiko/dct

 Lexer is based on Brian Schott's project
 https://github.com/Hackerpilot/Dscanner, but it has been refactored and
 extended (and more changes are on the way).

 The goal is to have source code loading, lexer, parser and semantic
 analysis available as parts of Phobos. These libraries should be
 designed to be usable in multiple scenarios (e.g., refactoring, code
 analysis, etc.).

 My commitment is to have at least front end built this year (and
 conforming to the D2 specification unless explicitly stated otherwise
 for some particular aspect).

 Please post any feed here. A dedicated project web-site will be created
 later.

(Re-posting here)

A couple of questions:

* What's the sate of the lexer
* Does it convert numerical literals and similar to their actual values
* Does it retain full source information
* Is there an example we can look at to see how the API is used
* Does it have a range based interface

-- 
/Jacob Carlborg

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 08:38:36 UTC, Jacob Carlborg wrote:
 (Re-posting here)
 A couple of questions:

 * What's the sate of the lexer

I consider it a draft state, because it has got several rewrites
recently and I plan to do more, especially based on community
feedback. However, implementation handles almost all possible
cases. Because of rewrites it is most likely broken at this
moment, I'm going to fix it ASAP (in a day or two).

Lexer will provide a random-access range of tokens (this is not
done yet).

Each token contains:
* start index (position in the original encoding, 0 corresponds
to the first code unit after BOM),
* token value encoded as UTF-8 string,
* token kind (e.g., token.kind = TokenKind.Float),
* possibly enum with annotations (e.g., token.annotations =
FloatAnnotation.Hex | FloatAnnotation.Real)

 * Does it convert numerical literals and similar to their 
 actual values

It is planned to add a post-processor for that as part of parser,
please see README.md for some more details.

 * Does it retain full source information

Yes, this is a design choice to preserve all information. Source
code is converted to UTF-8 and stored as token.value, even
whitespaces. Information about code unit indices in the original
encoding is preserved, too.

 * Is there an example we can look at to see how the API is used

TBD soon (see Roadmap in the readme file)

 * Does it have a range based interface

Yes, this is what I consider one of its strengths.

May 11 2012

Jacob Carlborg <doob me.com> writes:

On 2012-05-11 10:58, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 08:38:36 UTC, Jacob Carlborg wrote:
 (Re-posting here)
 A couple of questions:

 * What's the sate of the lexer

 I consider it a draft state, because it has got several rewrites
 recently and I plan to do more, especially based on community
 feedback. However, implementation handles almost all possible
 cases. Because of rewrites it is most likely broken at this
 moment, I'm going to fix it ASAP (in a day or two).

I see.

 Lexer will provide a random-access range of tokens (this is not
 done yet).

Ok.

 Each token contains:
 * start index (position in the original encoding, 0 corresponds
 to the first code unit after BOM),
 * token value encoded as UTF-8 string,
 * token kind (e.g., token.kind = TokenKind.Float),
 * possibly enum with annotations (e.g., token.annotations =
 FloatAnnotation.Hex | FloatAnnotation.Real)

What about line and column information?

 * Does it convert numerical literals and similar to their actual values

 It is planned to add a post-processor for that as part of parser,
 please see README.md for some more details.

Isn't that a job for the lexer?

 * Does it retain full source information

 Yes, this is a design choice to preserve all information. Source
 code is converted to UTF-8 and stored as token.value, even
 whitespaces. Information about code unit indices in the original
 encoding is preserved, too.

That's sounds good.

 * Is there an example we can look at to see how the API is used

 TBD soon (see Roadmap in the readme file)

 * Does it have a range based interface

 Yes, this is what I consider one of its strengths.

I see. Thanks.

-- 
/Jacob Carlborg

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 09:08:24 UTC, Jacob Carlborg wrote:
 On 2012-05-11 10:58, Roman D. Boiko wrote:
 Each token contains:
 * start index (position in the original encoding, 0 corresponds
 to the first code unit after BOM),
 * token value encoded as UTF-8 string,
 * token kind (e.g., token.kind = TokenKind.Float),
 * possibly enum with annotations (e.g., token.annotations =
 FloatAnnotation.Hex | FloatAnnotation.Real)

 What about line and column information?

Indices of the first code unit of each line are stored inside 
lexer and a function will compute Location (line number, column 
number, file specification) for any index. This way size of Token 
instance is reduced to the minimum. It is assumed that Location 
can be computed on demand, and is not needed frequently. So 
column is calculated by reverse walk till previous end of line, 
etc. Locations will possible to calculate both taking into 
account special token sequences (e.g., #line 3 "ab/c.d"), or 
discarding them.

 * Does it convert numerical literals and similar to their 
 actual values

 It is planned to add a post-processor for that as part of 
 parser,
 please see README.md for some more details.

 Isn't that a job for the lexer?

That might be done in lexer for efficiency reasons (to avoid 
lexing token value again). But separating this into a dedicated 
post-processing phase leads to a much cleaner design (IMO), also 
suitable for uses when such values are not needed. Also I don't 
think that performance would be improved given the ratio of 
number of literals to total number of tokens and the need to 
store additional information per token if it is done in lexer. I 
will elaborate on that later.

May 11 2012

Jacob Carlborg <doob me.com> writes:

On 2012-05-11 11:22, Roman D. Boiko wrote:

 What about line and column information?

 Indices of the first code unit of each line are stored inside lexer and
 a function will compute Location (line number, column number, file
 specification) for any index. This way size of Token instance is reduced
 to the minimum. It is assumed that Location can be computed on demand,
 and is not needed frequently. So column is calculated by reverse walk
 till previous end of line, etc. Locations will possible to calculate
 both taking into account special token sequences (e.g., #line 3
 "ab/c.d"), or discarding them.

Aha, clever. As long as I can get out the information I'm happy :) How 
about adding properties for this in the token struct?

 * Does it convert numerical literals and similar to their actual values

 It is planned to add a post-processor for that as part of parser,
 please see README.md for some more details.

 Isn't that a job for the lexer?

 That might be done in lexer for efficiency reasons (to avoid lexing
 token value again). But separating this into a dedicated post-processing
 phase leads to a much cleaner design (IMO), also suitable for uses when
 such values are not needed.

That might be the case. But I don't think it belongs in the parser.

 Also I don't think that performance would be
 improved given the ratio of number of literals to total number of tokens
 and the need to store additional information per token if it is done in
 lexer. I will elaborate on that later.

Ok, fair enough. Perhaps this could be a property in the Token struct as 
well. In that case I would suggest renaming "value" to 
lexeme/spelling/representation, or something like that, and then name 
the new property "value".

-- 
/Jacob Carlborg

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 09:36:28 UTC, Jacob Carlborg wrote:
 On 2012-05-11 11:22, Roman D. Boiko wrote:
 Locations will possible to calculate
 both taking into account special token sequences (e.g., #line 3
 "ab/c.d"), or discarding them.

 Aha, clever. As long as I can get out the information I'm happy 
 :) How about adding properties for this in the token struct?

There is a method for that in Lexer interface, for me it looks 
like it belongth there and not to token. Version accepting token 
and producing a pair of start/end Locations will be added.

 * Does it convert numerical literals and similar to their 
 actual values

 It is planned to add a post-processor for that as part of 
 parser,
 please see README.md for some more details.

 Isn't that a job for the lexer?

 That might be done in lexer for efficiency reasons (to avoid 
 lexing
 token value again). But separating this into a dedicated 
 post-processing
 phase leads to a much cleaner design (IMO), also suitable for 
 uses when
 such values are not needed.

 That might be the case. But I don't think it belongs in the 
 parser.

I will provide example code and a dedicated post later to 
illustrate my point.

 Also I don't think that performance would be
 improved given the ratio of number of literals to total number 
 of tokens
 and the need to store additional information per token if it 
 is done in
 lexer. I will elaborate on that later.

 Ok, fair enough. Perhaps this could be a property in the Token 
 struct as well. In that case I would suggest renaming "value" 
 to lexeme/spelling/representation, or something like that, and 
 then name the new property "value".

I was going to rename value, but couldn't find a nice term. 
Thanks for your suggestions!
As for the property with strongly typed literal value, currently 
I plan to put it into AST.

May 11 2012

Jacob Carlborg <doob me.com> writes:

On 2012-05-11 12:35, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 09:36:28 UTC, Jacob Carlborg wrote:

 Aha, clever. As long as I can get out the information I'm happy :) How
 about adding properties for this in the token struct?

 There is a method for that in Lexer interface, for me it looks like it
 belongth there and not to token. Version accepting token and producing a
 pair of start/end Locations will be added.

Found it now, "calculateFor". It not sure if it's the most intuitive 
name though. I get the feeling: "calculate what?".

 That might be the case. But I don't think it belongs in the parser.

 I will provide example code and a dedicated post later to illustrate my
 point.

I guess I'll have to wait for that then :)

 Ok, fair enough. Perhaps this could be a property in the Token struct
 as well. In that case I would suggest renaming "value" to
 lexeme/spelling/representation, or something like that, and then name
 the new property "value".

 I was going to rename value, but couldn't find a nice term. Thanks for
 your suggestions!
 As for the property with strongly typed literal value, currently I plan
 to put it into AST.

I stole "spelling" from Clang :)

-- 
/Jacob Carlborg

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 11:49:23 UTC, Jacob Carlborg wrote:
 On 2012-05-11 12:35, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 09:36:28 UTC, Jacob Carlborg wrote:

 Aha, clever. As long as I can get out the information I'm 
 happy :) How
 about adding properties for this in the token struct?

 There is a method for that in Lexer interface, for me it looks 
 like it
 belongth there and not to token. Version accepting token and 
 producing a
 pair of start/end Locations will be added.

 Found it now, "calculateFor". It not sure if it's the most 
 intuitive name though. I get the feeling: "calculate what?".

calculateLocation was original name, but I don't like repeating 
return type in method names, I decided to change it so that it is 
clear that another renaming is needed ;) Any suggestions?

 That might be the case. But I don't think it belongs in the 
 parser.

 I will provide example code and a dedicated post later to 
 illustrate my
 point.

 I guess I'll have to wait for that then :)

I'll try to do that ahead of roadmap, it is important.

May 11 2012

Jacob Carlborg <doob me.com> writes:

On 2012-05-11 14:07, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 11:49:23 UTC, Jacob Carlborg wrote:

 Found it now, "calculateFor". It not sure if it's the most intuitive
 name though. I get the feeling: "calculate what?".


 calculateLocation was original name, but I don't like repeating return
 type in method names, I decided to change it so that it is clear that
 another renaming is needed ;) Any suggestions?

My original suggestion was to have the functionality in Token, which 
would have made for intuitive names: line, column and file. But since 
you didn't like that I have to give it some thought.

 I guess I'll have to wait for that then :)

 I'll try to do that ahead of roadmap, it is important.

Cool.

-- 
/Jacob Carlborg

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 12:55:58 UTC, Jacob Carlborg wrote:
 On 2012-05-11 14:07, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 11:49:23 UTC, Jacob Carlborg wrote:

 Found it now, "calculateFor". It not sure if it's the most 
 intuitive
 name though. I get the feeling: "calculate what?".


 calculateLocation was original name, but I don't like 
 repeating return
 type in method names, I decided to change it so that it is 
 clear that
 another renaming is needed ;) Any suggestions?

 My original suggestion was to have the functionality in Token, 
 which would have made for intuitive names: line, column and 
 file. But since you didn't like that I have to give it some 
 thought.

What about the following signature: Location locate(size_t index)?
Or even better:
alias size_t CodeUnitIndex;
Location locateFor(CodeUnitIndex position);

The problem with placing it in Token is that Token should not 
know anything about source as a whole.

May 11 2012

Jacob Carlborg <doob me.com> writes:

On 2012-05-11 15:01, Roman D. Boiko wrote:

 What about the following signature: Location locate(size_t index)?
 Or even better:
 alias size_t CodeUnitIndex;
 Location locateFor(CodeUnitIndex position);

That is better although I would prefer to pass in a token (assuming that 
is where index is declared). Then it would be an implementation detail 
that "index" is used to get the location.

Another option would be to turn it around:

sturct/class Location
{
     Location find/locate (Token token);
}

 The problem with placing it in Token is that Token should not know
 anything about source as a whole.

I see. For convenience there could be properties defined that just 
forwards the call to some other function.

-- 
/Jacob Carlborg

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 13:25:53 UTC, Jacob Carlborg wrote:
 On 2012-05-11 15:01, Roman D. Boiko wrote:

 What about the following signature: Location locate(size_t 
 index)?
 Or even better:
 alias size_t CodeUnitIndex;
 Location locateFor(CodeUnitIndex position);

 That is better although I would prefer to pass in a token 
 (assuming that is where index is declared). Then it would be an 
 implementation detail that "index" is used to get the location.

It is also what I was planning to do after I posted my last 
suggestion. Unless I will discover some conceptual problem with 
that, which is unlikely.

 Another option would be to turn it around:

 sturct/class Location
 {
     Location find/locate (Token token);
 }

 The problem with placing it in Token is that Token should not 
 know
 anything about source as a whole.

 I see. For convenience there could be properties defined that 
 just forwards the call to some other function.

Both these cases would introduce circular dependency between 
Lexer and its output data (Location or Token). It is possible to 
break such dependency via complicating things even more, or live 
with it. But I don't like circular dependencies when there is no 
compelling reason to introduce them.

May 11 2012

deadalnix <deadalnix gmail.com> writes:

Le 11/05/2012 15:01, Roman D. Boiko a écrit :
 On Friday, 11 May 2012 at 12:55:58 UTC, Jacob Carlborg wrote:
 On 2012-05-11 14:07, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 11:49:23 UTC, Jacob Carlborg wrote:

 Found it now, "calculateFor". It not sure if it's the most intuitive
 name though. I get the feeling: "calculate what?".


 calculateLocation was original name, but I don't like repeating return
 type in method names, I decided to change it so that it is clear that
 another renaming is needed ;) Any suggestions?

 My original suggestion was to have the functionality in Token, which
 would have made for intuitive names: line, column and file. But since
 you didn't like that I have to give it some thought.

 What about the following signature: Location locate(size_t index)?
 Or even better:
 alias size_t CodeUnitIndex;
 Location locateFor(CodeUnitIndex position);

 The problem with placing it in Token is that Token should not know
 anything about source as a whole.

I don't really see the benefit of this. You are trading a O(1) operation 
to an O(log(n)) . It can only be faster in specific cases, which should 
be measured.

It is likely to be slower on compiling erroneous code or used for 
something else than compiling, and likely to be insignificant to compile 
correct code (I suspect the performance win is negligible compared to 
the time required to build a fully operational executable).

You are overcomplicating things for no to little benefice.

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 13:28:21 UTC, deadalnix wrote:
 Le 11/05/2012 15:01, Roman D. Boiko a écrit :
 The problem with placing it in Token is that Token should not 
 know
 anything about source as a whole.

 I don't really see the benefit of this. You are trading a O(1) 
 operation to an O(log(n)) . It can only be faster in specific 
 cases, which should be measured.

It would be interesting to see benchmarks. But anyway log(1M) is 
just 20, and I didn't see any source code with 1M lines :). The 
main point is design, not performance. I'm trying to build 
minimal design that handles common use cases very well, and 
others well enough. I don't see why Location would be needed for 
every Token.

 It is likely to be slower on compiling erroneous code or used 
 for something else than compiling, and likely to be 
 insignificant to compile correct code (I suspect the 
 performance win is negligible compared to the time required to 
 build a fully operational executable).

IMO, it is unrelated.
 You are overcomplicating things for no to little benefice.

Well, it depends... However, I will provide design rationale 
document this month.

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 13:28:21 UTC, deadalnix wrote:
 Le 11/05/2012 15:01, Roman D. Boiko a écrit :
 The problem with placing it in Token is that Token should not 
 know
 anything about source as a whole.

 I don't really see the benefit of this. You are trading a O(1) 
 operation to an O(log(n)) . It can only be faster in specific 
 cases, which should be measured.

Technically, I'm trading N*0(1) operations needed to track line 
and column while consuming each character to M*0(log(n)) 
operations when calculating them on demand. N = number of 
characters, n is number of lines and M is number of actual usages 
of Location. My assumption is that M << N (M is much smaller than 
N).

May 11 2012

deadalnix <deadalnix gmail.com> writes:

Le 11/05/2012 16:02, Roman D. Boiko a écrit :
 On Friday, 11 May 2012 at 13:28:21 UTC, deadalnix wrote:
 Le 11/05/2012 15:01, Roman D. Boiko a écrit :
 The problem with placing it in Token is that Token should not know
 anything about source as a whole.

 I don't really see the benefit of this. You are trading a O(1)
 operation to an O(log(n)) . It can only be faster in specific cases,
 which should be measured.

 Technically, I'm trading N*0(1) operations needed to track line and
 column while consuming each character to M*0(log(n)) operations when
 calculating them on demand. N = number of characters, n is number of
 lines and M is number of actual usages of Location. My assumption is
 that M << N (M is much smaller than N).

N can easily be number of tokens.

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 15:05:19 UTC, deadalnix wrote:
 Le 11/05/2012 16:02, Roman D. Boiko a écrit :
 Technically, I'm trading N*0(1) operations needed to track 
 line and
 column while consuming each character to M*0(log(n)) 
 operations when
 calculating them on demand. N = number of characters, n is 
 number of
 lines and M is number of actual usages of Location. My 
 assumption is
 that M << N (M is much smaller than N).

 N can easily be number of tokens.

Yes, if you are looking for the token by its index, not for 
location. E.g., for autocompletion it is needed to find the last 
token before cursor location. But that is not related to location 
search.

Also please note that I oversimplified formula for complexity. It 
also has other components. My reply was just an additional minor 
comment. Motivation is design, not performance.

One additional reason for such separation: with my approach it is 
possible to calculate Location both taking into account 
infromation from special token sequences, or ignoring it. How 
would you do that for eager calculation? Calculate only one 
category? Or track and store both?

Unnecessary complexity will eventually find a way to shoot your 
leg :) Real-world usage (at least, anticipated scenarios) should 
be the basis for designing. Sometimes this contradicts intuition 
and requires you to look at the problem from a different side.

May 11 2012

Ary Manzana <ary esperanto.org.ar> writes:

On 5/11/12 10:14 PM, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 15:05:19 UTC, deadalnix wrote:
 Le 11/05/2012 16:02, Roman D. Boiko a écrit :
 Technically, I'm trading N*0(1) operations needed to track line and
 column while consuming each character to M*0(log(n)) operations when
 calculating them on demand. N = number of characters, n is number of
 lines and M is number of actual usages of Location. My assumption is
 that M << N (M is much smaller than N).

 N can easily be number of tokens.

 Yes, if you are looking for the token by its index, not for location.
 E.g., for autocompletion it is needed to find the last token before
 cursor location. But that is not related to location search.

 Also please note that I oversimplified formula for complexity. It also
 has other components. My reply was just an additional minor comment.
 Motivation is design, not performance.

 One additional reason for such separation: with my approach it is
 possible to calculate Location both taking into account infromation from
 special token sequences, or ignoring it. How would you do that for eager
 calculation? Calculate only one category? Or track and store both?

 Unnecessary complexity will eventually find a way to shoot your leg :)
 Real-world usage (at least, anticipated scenarios) should be the basis
 for designing. Sometimes this contradicts intuition and requires you to
 look at the problem from a different side.

As deadalnix says, I think you are over-complicating things.

I mean, to store the column and line information it's just:

if (isNewLine(c)) {
   line++;
   column = 0;
} else {
   column++;
}

(I think you need to add that to the SourceRange class. Then copy line 
and column to token on the Lexer#lex() method)

Do you really think it's that costly in terms of performance?

I think you are wasting much more memory and performance by storing all 
the tokens in the lexer.

Imagine I want to implement a simple syntax highlighter: just highlight 
keywords. How can I tell DCT to *not* store all tokens because I need 
each one in turn? And since I'll be highlighting in the editor I will 
need column and line information. That means I'll have to do that 
O(log(n)) operation for every token.

So you see, for the simplest use case of a lexer the performance of DCT 
is awful.

Now imagine I want to build an AST. Again, I consume the tokens one by 
one, probably peeking in some cases. If I want to store line and column 
information I just copy them to the AST. You say the tokens are 
discarded but their data is not, and that's why their data is usually 
copied.

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
 As deadalnix says, I think you are over-complicating things.

 I mean, to store the column and line information it's just:

 if (isNewLine(c)) {
   line++;
   column = 0;
 } else {
   column++;
 }

 (I think you need to add that to the SourceRange class. Then 
 copy line and column to token on the Lexer#lex() method)

 Do you really think it's that costly in terms of performance?

 I think you are wasting much more memory and performance by 
 storing all the tokens in the lexer.

 Imagine I want to implement a simple syntax highlighter: just 
 highlight keywords. How can I tell DCT to *not* store all 
 tokens because I need each one in turn? And since I'll be 
 highlighting in the editor I will need column and line 
 information. That means I'll have to do that O(log(n)) 
 operation for every token.

 So you see, for the simplest use case of a lexer the 
 performance of DCT is awful.

 Now imagine I want to build an AST. Again, I consume the tokens 
 one by one, probably peeking in some cases. If I want to store 
 line and column information I just copy them to the AST. You 
 say the tokens are discarded but their data is not, and that's 
 why their data is usually copied.

Summary of your points (I deliberately emphasize some of them 
more than you did; please correct me if I misinterpreted 
anything):
* storing location for each token is simple and cheap
* SourceRange is needed; an instance per token; it should be a 
class, not struct
* Syntax highlighting doesn't need to keep all tokens in memory
* but it must know column and line for each token
* for this use case DCT has awful performance
* to build AST lines and columns are required
* information from tokens must be copied and possibly transformed 
in order to put it into AST; after such transformation tokens are 
not needed any more

Those all are valid points given that we don't have any 
benchmarks and usage examples. The good news is that use cases 
like highlighting, parsing, autocompletion, etc. are the core 
functionality for which DCT should be designed. So if it fails 
any of them, design needs to be revisited.

But I will need some time to thoroughly address each of this 
issues (either prove that it is not relevant, or fix the 
problem). I will definitely complete this work during May. I will 
regularly report my progress here.

In general, I tend to disagree with most of them, because I 
already thought about each and designed accordingly, but it is 
important to give them enough attention. What I fail miserably at 
this moment is providing actual usage code and benchmarks, so I'm 
going to focus on that.

Thanks for your feedback!

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 12 May 2012 at 05:13:36 UTC, Roman D. Boiko wrote:
 On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
 As deadalnix says, I think you are over-complicating things.


...
 ... I will need some time to thoroughly address each of this 
 issues (either prove that it is not relevant, or fix the 
 problem). I will definitely complete this work during May. I 
 will regularly report my progress here.

 In general, I tend to disagree with most of them, because I 
 already thought about each and designed accordingly, but it is 
 important to give them enough attention.

https://github.com/roman-d-boiko/dct/issues/15

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
 As deadalnix says, I think you are over-complicating things.

 I mean, to store the column and line information it's just:

 if (isNewLine(c)) {
   line++;
   column = 0;
 } else {
   column++;
 }

 (I think you need to add that to the SourceRange class. Then 
 copy line and column to token on the Lexer#lex() method)

 Do you really think it's that costly in terms of performance?

 I think you are wasting much more memory and performance by 
 storing all the tokens in the lexer.

 Imagine I want to implement a simple syntax highlighter: just 
 highlight keywords. How can I tell DCT to *not* store all 
 tokens because I need each one in turn? And since I'll be 
 highlighting in the editor I will need column and line 
 information. That means I'll have to do that O(log(n)) 
 operation for every token.

 So you see, for the simplest use case of a lexer the 
 performance of DCT is awful.

 Now imagine I want to build an AST. Again, I consume the tokens 
 one by one, probably peeking in some cases. If I want to store 
 line and column information I just copy them to the AST. You 
 say the tokens are discarded but their data is not, and that's 
 why their data is usually copied.

Would it be possible for you to fork my code and tweak it for 
comparison? You will definitely discover more problems this way, 
and such feedback would really help me. That doesn't seem to be 
inadequately much work to do.

Any other volunteers for this?

May 11 2012

Ary Manzana <ary esperanto.org.ar> writes:

On 5/12/12 12:17 PM, Roman D. Boiko wrote:
 On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
 As deadalnix says, I think you are over-complicating things.

 I mean, to store the column and line information it's just:

 if (isNewLine(c)) {
 line++;
 column = 0;
 } else {
 column++;
 }

 (I think you need to add that to the SourceRange class. Then copy line
 and column to token on the Lexer#lex() method)

 Do you really think it's that costly in terms of performance?

 I think you are wasting much more memory and performance by storing
 all the tokens in the lexer.

 Imagine I want to implement a simple syntax highlighter: just
 highlight keywords. How can I tell DCT to *not* store all tokens
 because I need each one in turn? And since I'll be highlighting in the
 editor I will need column and line information. That means I'll have
 to do that O(log(n)) operation for every token.

 So you see, for the simplest use case of a lexer the performance of
 DCT is awful.

 Now imagine I want to build an AST. Again, I consume the tokens one by
 one, probably peeking in some cases. If I want to store line and
 column information I just copy them to the AST. You say the tokens are
 discarded but their data is not, and that's why their data is usually
 copied.

 Would it be possible for you to fork my code and tweak it for
 comparison? You will definitely discover more problems this way, and
 such feedback would really help me. That doesn't seem to be inadequately
 much work to do.

 Any other volunteers for this?

Sure, I'll do it and provide some benchmarks. Thanks for creating the issue.

May 12 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
 I think you are wasting much more memory and performance by 
 storing all the tokens in the lexer.

 Imagine I want to implement a simple syntax highlighter: just 
 highlight keywords. How can I tell DCT to *not* store all 
 tokens because I need each one in turn? And since I'll be 
 highlighting in the editor I will need column and line 
 information. That means I'll have to do that O(log(n)) 
 operation for every token.

 So you see, for the simplest use case of a lexer the 
 performance of DCT is awful.

 Now imagine I want to build an AST. Again, I consume the tokens 
 one by one, probably peeking in some cases. If I want to store 
 line and column information I just copy them to the AST. You 
 say the tokens are discarded but their data is not, and that's 
 why their data is usually copied.

Currently I think about making token a class instead of struct.

A token (from 
https://github.com/roman-d-boiko/dct/blob/master/fe/core.d) is:

// Represents lexed token
struct Token
{
     size_t startIndex; // position of the first code unit in the 
source string
     string spelling; // characters from which this token has been 
lexed
     TokenKind kind; // enum; each keyword and operator, have a 
dedicated kind
     ubyte annotations; // meta information like whether a token 
is valid, or an integer literal is signed, long, hexadecimal, etc.
}

Making it a class would give several benefits:

* allow not to worry about allocating a big array of tokens. 
E.g., on 64-bit OS the largest module in Phobos (IIRC, the 
std.datetime) consumes 13.5MB in an array of almost 500K tokens. 
It would require 4 times smaller chunk of contiguous memory if it 
was an array of class objects, because each would consume only 8 
bytes instead of 32.

* allow subclassing, for example, for storing strongly typed 
literal values; this flexibility could also facilitate future 
extensibility (but it's difficult to predict which kind of 
extension may be needed)

* there would be no need to copy data from tokens into AST, 
passing an object would be enough (again, copy 8 instead of 32 
bytes); the same applies to passing into methods - no need to 
pass by ref to minimise overhead

It would incur some additional memory overhead (at least 8 bytes 
per token), but that's hardly significant. Also there is 
additional price for accessing token members because of 
indirection, and, possibly, worse cache friendliness (token 
instances may be allocated anywhere in memory, not close to each 
other).

These considerations are mostly about performance. I think there 
is also some impact on design, but couldn't find anything 
significant (given that currently I see a token as merely a 
datastructure without associated behavior).

Could anybody suggest other pros and cons? Which option would you 
choose?

May 14 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 14 May 2012 at 15:00:37 UTC, Roman D. Boiko wrote:
 Could anybody suggest other pros and cons? Which option would 
 you choose?

Further discussion on this topic (struct vs class) is at 
http://forum.dlang.org/thread/asdrqlaydzcdpqwsbtiv forum.dlang.org

May 14 2012

deadalnix <deadalnix gmail.com> writes:

Le 14/05/2012 17:00, Roman D. Boiko a écrit :
 On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
 I think you are wasting much more memory and performance by storing
 all the tokens in the lexer.

 Imagine I want to implement a simple syntax highlighter: just
 highlight keywords. How can I tell DCT to *not* store all tokens
 because I need each one in turn? And since I'll be highlighting in the
 editor I will need column and line information. That means I'll have
 to do that O(log(n)) operation for every token.

 So you see, for the simplest use case of a lexer the performance of
 DCT is awful.

 Now imagine I want to build an AST. Again, I consume the tokens one by
 one, probably peeking in some cases. If I want to store line and
 column information I just copy them to the AST. You say the tokens are
 discarded but their data is not, and that's why their data is usually
 copied.

 Currently I think about making token a class instead of struct.

 A token (from
 https://github.com/roman-d-boiko/dct/blob/master/fe/core.d) is:

 // Represents lexed token
 struct Token
 {
 size_t startIndex; // position of the first code unit in the source string
 string spelling; // characters from which this token has been lexed
 TokenKind kind; // enum; each keyword and operator, have a dedicated kind
 ubyte annotations; // meta information like whether a token is valid, or
 an integer literal is signed, long, hexadecimal, etc.
 }

 Making it a class would give several benefits:

 * allow not to worry about allocating a big array of tokens. E.g., on
 64-bit OS the largest module in Phobos (IIRC, the std.datetime) consumes
 13.5MB in an array of almost 500K tokens. It would require 4 times
 smaller chunk of contiguous memory if it was an array of class objects,
 because each would consume only 8 bytes instead of 32.

Why is this a benefice ?

 * allow subclassing, for example, for storing strongly typed literal
 values; this flexibility could also facilitate future extensibility (but
 it's difficult to predict which kind of extension may be needed)

I'm pretty sure that D's token will not change that much. If the need 
isn't identified right know, I'd advocate for YAGNI.

 * there would be no need to copy data from tokens into AST, passing an
 object would be enough (again, copy 8 instead of 32 bytes); the same
 applies to passing into methods - no need to pass by ref to minimise
 overhead

Yes but now you add pressure on the GC and add indirections. I'm not 
sure it worth it. It seems to me like a premature optimization.

 It would incur some additional memory overhead (at least 8 bytes per
 token), but that's hardly significant. Also there is additional price
 for accessing token members because of indirection, and, possibly, worse
 cache friendliness (token instances may be allocated anywhere in memory,
 not close to each other).

 These considerations are mostly about performance. I think there is also
 some impact on design, but couldn't find anything significant (given
 that currently I see a token as merely a datastructure without
 associated behavior).

 Could anybody suggest other pros and cons? Which option would you choose?

You are over engineering the whole stuff.

May 14 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 14 May 2012 at 16:30:21 UTC, deadalnix wrote:
 Le 14/05/2012 17:00, Roman D. Boiko a écrit :
 Making it a class would give several benefits:
 * allow not to worry about allocating a big array of tokens. 
 E.g., on
 64-bit OS the largest module in Phobos (IIRC, the 
 std.datetime) consumes
 13.5MB in an array of almost 500K tokens. It would require 4 
 times
 smaller chunk of contiguous memory if it was an array of class 
 objects,
 because each would consume only 8 bytes instead of 32.

 Why is this a benefice ?

NNTP error: 400 load at 23.60, try later prevented me from 
answering :)

Because it might be difficult to find a big chunk of available
memory (3.5M vs 14M for this particular case).

 * allow subclassing, for example, for storing strongly typed 
 literal
 values; this flexibility could also facilitate future 
 extensibility (but
 it's difficult to predict which kind of extension may be 
 needed)

 I'm pretty sure that D's token will not change that much. If 
 the need isn't identified right know, I'd advocate for YAGNI.

Agree.

 * there would be no need to copy data from tokens into AST, 
 passing an
 object would be enough (again, copy 8 instead of 32 bytes); 
 the same
 applies to passing into methods - no need to pass by ref to 
 minimise
 overhead

 Yes but now you add pressure on the GC and add indirections. 
 I'm not sure it worth it. It seems to me like a premature 
 optimization.

It looks so. Thanks.

 It would incur some additional memory overhead (at least 8 
 bytes per
 token), but that's hardly significant. Also there is 
 additional price
 for accessing token members because of indirection, and, 
 possibly, worse
 cache friendliness (token instances may be allocated anywhere 
 in memory,
 not close to each other).

 These considerations are mostly about performance. I think 
 there is also
 some impact on design, but couldn't find anything significant 
 (given
 that currently I see a token as merely a datastructure without
 associated behavior).

 Could anybody suggest other pros and cons? Which option would 
 you choose?

 You are over engineering the whole stuff.

I'm trying to solve this and other tradeoffs. I'd like to
simplify but satisfy my design goals.

May 14 2012

"Tove" <tove fransson.se> writes:

On Monday, 14 May 2012 at 16:58:42 UTC, Roman D. Boiko wrote:
 You are over engineering the whole stuff.

 I'm trying to solve this and other tradeoffs. I'd like to
 simplify but satisfy my design goals.

What if there were two different lex:er modes... with different 
struct:s.

1. For an IDE with on the fly lexing:
   Assumption, the error rate is high.(need to keep much info)

2. For the compiler
Assumption, the error rate is non existent, and if there is an 
error it really doesn't matter if it's slow.

So... when choosing the "compiler mode"... and there actually is 
an error, then just lex it again, to produce a pretty error 
message ;)

try
{
   lex(mode.compiler);
}
catch
{
   lex(mode.ide); // calculates column etc. what ever info it 
needs.
}

May 14 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 14 May 2012 at 19:04:20 UTC, Tove wrote:
 On Monday, 14 May 2012 at 16:58:42 UTC, Roman D. Boiko wrote:
 You are over engineering the whole stuff.

 I'm trying to solve this and other tradeoffs. I'd like to
 simplify but satisfy my design goals.

 What if there were two different lex:er modes... with different 
 struct:s.

 1. For an IDE with on the fly lexing:
   Assumption, the error rate is high.(need to keep much info)

 2. For the compiler
 Assumption, the error rate is non existent, and if there is an 
 error it really doesn't matter if it's slow.

 So... when choosing the "compiler mode"... and there actually 
 is an error, then just lex it again, to produce a pretty error 
 message ;)

 try
 {
   lex(mode.compiler);
 }
 catch
 {
   lex(mode.ide); // calculates column etc. what ever info it 
 needs.
 }

So far it doesn't seem expensive to tolerate errors and proceed.
The only thing I miss is some sort of specification when to stop
including characters into token spelling and start a new token. I
don't think I'll use backtracking for that in the nearest future.
If I did, I would really separate part of lexer and provide two
implementations for that part. Given this, accepting errors and
moving on simply requires some finite set of rules about
boundaries of invalid tokens. I also think structural code
editing concepts will help here, but I didn't do any research on
this topic yet.

The problem with multiple lexer implementations is that it might
become much more difficult to maintain them.

May 14 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Monday, 14 May 2012 at 19:13:39 UTC, Roman D. Boiko wrote:
 On Monday, 14 May 2012 at 19:04:20 UTC, Tove wrote:
 What if there were two different lex:er modes... with 
 different struct:s.

 1. For an IDE with on the fly lexing:
  Assumption, the error rate is high.(need to keep much info)

 2. For the compiler
 Assumption, the error rate is non existent, and if there is an 
 error it really doesn't matter if it's slow.

 So... when choosing the "compiler mode"... and there actually 
 is an error, then just lex it again, to produce a pretty error 
 message ;)


...
 The problem with multiple lexer implementations is that it might
 become much more difficult to maintain them.

Just to clarify: different modes in lexer in my view are like two
different implementations combined in a non-trivial way (unless
the difference is minor). So complexity goes from two factors:
different implementations and how to combine them. I try to avoid
this.

May 14 2012

deadalnix <deadalnix gmail.com> writes:

Le 14/05/2012 21:21, Roman D. Boiko a écrit :
 On Monday, 14 May 2012 at 19:13:39 UTC, Roman D. Boiko wrote:
 On Monday, 14 May 2012 at 19:04:20 UTC, Tove wrote:
 What if there were two different lex:er modes... with different
 struct:s.

 1. For an IDE with on the fly lexing:
 Assumption, the error rate is high.(need to keep much info)

 2. For the compiler
 Assumption, the error rate is non existent, and if there is an error
 it really doesn't matter if it's slow.

 So... when choosing the "compiler mode"... and there actually is an
 error, then just lex it again, to produce a pretty error message ;)


 ...
 The problem with multiple lexer implementations is that it might
 become much more difficult to maintain them.

 Just to clarify: different modes in lexer in my view are like two
 different implementations combined in a non-trivial way (unless
 the difference is minor). So complexity goes from two factors:
 different implementations and how to combine them. I try to avoid
 this.

If both have the same interface, metaprograming can do the magic for you.

May 15 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Tuesday, 15 May 2012 at 09:33:31 UTC, deadalnix wrote:
 Le 14/05/2012 21:21, Roman D. Boiko a écrit :
 Just to clarify: different modes in lexer in my view are like 
 two
 different implementations combined in a non-trivial way (unless
 the difference is minor). So complexity goes from two factors:
 different implementations and how to combine them. I try to 
 avoid this.

 If both have the same interface, metaprograming can do the 
 magic for you.

Well, I didn't state that there is no way to solve the issue.
Also I added "unless the difference is minor".

May 15 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 05/14/2012 05:00 PM, Roman D. Boiko wrote:
 On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
 I think you are wasting much more memory and performance by storing
 all the tokens in the lexer.

 Imagine I want to implement a simple syntax highlighter: just
 highlight keywords. How can I tell DCT to *not* store all tokens
 because I need each one in turn? And since I'll be highlighting in the
 editor I will need column and line information. That means I'll have
 to do that O(log(n)) operation for every token.

 So you see, for the simplest use case of a lexer the performance of
 DCT is awful.

 Now imagine I want to build an AST. Again, I consume the tokens one by
 one, probably peeking in some cases. If I want to store line and
 column information I just copy them to the AST. You say the tokens are
 discarded but their data is not, and that's why their data is usually
 copied.

 Currently I think about making token a class instead of struct.

 ...

 Could anybody suggest other pros and cons? Which option would you choose?

Just use a circular buffer of value-type tokens. There is absolutely no 
excuse for slow parsing.

May 15 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Tuesday, 15 May 2012 at 19:27:26 UTC, Timon Gehr wrote:
 On 05/14/2012 05:00 PM, Roman D. Boiko wrote:
 Currently I think about making token a class instead of struct.
 ...

 Could anybody suggest other pros and cons? Which option would 
 you choose?

 Just use a circular buffer of value-type tokens. There is 
 absolutely no excuse for slow parsing.

I'd like to be able to do efficient token lookup based on its 
start index in the file (e.g., for autocompletion and calculating 
column/line numbers on demand). I also plan to have ability to 
update lexer, parser and semantic analysis results as the user or 
tool makes edits (incrementally, not by restarting), and to keep 
history of changes as long as it is needed.

To achieve this, I will use Token[N][] for storing tokens, and an 
immutable binary tree for efficient lookup and preserving history.

I don't worry much about keeping all tokens in memory, because 
information they contain is useful and needed for most scenarios. 
If, on the contrary, information is no longer used, it will be 
garbage collected.

This is a significant redesign of current implementation. It is 
based on feedback from this thread and completely satisfies my 
design goals. It is also going to be efficient.

Alpha release for community review is planned on May 21, and 
design documentation in 10 days after that.

May 15 2012

deadalnix <deadalnix gmail.com> writes:

Le 15/05/2012 21:59, Roman D. Boiko a écrit :
 On Tuesday, 15 May 2012 at 19:27:26 UTC, Timon Gehr wrote:
 On 05/14/2012 05:00 PM, Roman D. Boiko wrote:
 Currently I think about making token a class instead of struct.
 ...

 Could anybody suggest other pros and cons? Which option would you
 choose?

 Just use a circular buffer of value-type tokens. There is absolutely
 no excuse for slow parsing.

 I'd like to be able to do efficient token lookup based on its start
 index in the file (e.g., for autocompletion and calculating column/line
 numbers on demand). I also plan to have ability to update lexer, parser
 and semantic analysis results as the user or tool makes edits
 (incrementally, not by restarting), and to keep history of changes as
 long as it is needed.

 To achieve this, I will use Token[N][] for storing tokens, and an
 immutable binary tree for efficient lookup and preserving history.

 I don't worry much about keeping all tokens in memory, because
 information they contain is useful and needed for most scenarios. If, on
 the contrary, information is no longer used, it will be garbage collected.

 This is a significant redesign of current implementation. It is based on
 feedback from this thread and completely satisfies my design goals. It
 is also going to be efficient.

 Alpha release for community review is planned on May 21, and design
 documentation in 10 days after that.

The only information that is usefull over time is the position. This is 
the one that you want to calculate lazily.

In other terms, what you want to keep in memory have no interest.

A circular buffer as Timon propose is a better idea.

May 16 2012

Ary Manzana <ary esperanto.org.ar> writes:

On 5/11/12 4:22 PM, Roman D. Boiko wrote:
 What about line and column information?

 Indices of the first code unit of each line are stored inside lexer and
 a function will compute Location (line number, column number, file
 specification) for any index. This way size of Token instance is reduced
 to the minimum. It is assumed that Location can be computed on demand,
 and is not needed frequently. So column is calculated by reverse walk
 till previous end of line, etc. Locations will possible to calculate
 both taking into account special token sequences (e.g., #line 3
 "ab/c.d"), or discarding them.

But then how do you do to efficiently (if reverse walk is any efficient) 
compute line numbers?

Usually tokens are used and discarded. I mean, somebody that uses the 
lexer asks tokens, process them (for example to highlight code or to 
build an AST) and then discards them. So you can reuse the same Token 
instance. If you want to peek the next token, or have a buffer of token, 
you can use a freelist ( http://dlang.org/memory.html#freelists , one of 
the many nice things I learned by looking at DMD's source code ).

So adding line and column information is not like wasting a lot of 
memory: just 8 bytes more for each token in the freelist.

May 11 2012

dennis luehring <dl.soluz gmx.net> writes:

Am 11.05.2012 13:50, schrieb Ary Manzana:
 On 5/11/12 4:22 PM, Roman D. Boiko wrote:
  What about line and column information?

  Indices of the first code unit of each line are stored inside lexer and
  a function will compute Location (line number, column number, file
  specification) for any index. This way size of Token instance is reduced
  to the minimum. It is assumed that Location can be computed on demand,
  and is not needed frequently. So column is calculated by reverse walk
  till previous end of line, etc. Locations will possible to calculate
  both taking into account special token sequences (e.g., #line 3
  "ab/c.d"), or discarding them.

 But then how do you do to efficiently (if reverse walk is any efficient)
 compute line numbers?

 Usually tokens are used and discarded. I mean, somebody that uses the
 lexer asks tokens, process them (for example to highlight code or to
 build an AST) and then discards them. So you can reuse the same Token
 instance. If you want to peek the next token, or have a buffer of token,
 you can use a freelist ( http://dlang.org/memory.html#freelists , one of
 the many nice things I learned by looking at DMD's source code ).

 So adding line and column information is not like wasting a lot of
 memory: just 8 bytes more for each token in the freelist.

it would be better to add something like an ColumnLine-collector-thing - 
that if applied is able to hold this information, so no waste if its not 
needed

i think there are several parts that can work like that

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 12:12:01 UTC, dennis luehring wrote:
 it would be better to add something like an 
 ColumnLine-collector-thing - that if applied is able to hold 
 this information, so no waste if its not needed

 i think there are several parts that can work like that

This looks like what I called a hook for pluggin-in respective 
functionality in my previous message. But I'm not sure it is 
needed at all (in addition to what I already designed with 
calculateFor() method).

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 11:50:29 UTC, Ary Manzana wrote:
 On 5/11/12 4:22 PM, Roman D. Boiko wrote:
 What about line and column information?

 Indices of the first code unit of each line are stored inside 
 lexer and
 a function will compute Location (line number, column number, 
 file
 specification) for any index. This way size of Token instance 
 is reduced
 to the minimum. It is assumed that Location can be computed on 
 demand,
 and is not needed frequently. So column is calculated by 
 reverse walk
 till previous end of line, etc. Locations will possible to 
 calculate
 both taking into account special token sequences (e.g., #line 3
 "ab/c.d"), or discarding them.

 But then how do you do to efficiently (if reverse walk is any 
 efficient) compute line numbers?

I borrowed the trick from Brian Schott: tokens will be stored as
a sorted array. Sorting is done based on their indices in source
code (they are naturally sorted immediately, no need to run any
algorithm for that). The same is true for information about line
start indices.

Now given an index we can do a binary search in such sorted
arrays, and get corresponding line number or token (whatever we
need). Walking along utf code points starting from the index
which corresponds to beginning of line and taking into account
tab characters it is possible to calculate column reasonably fast.

My assumption is that column numbers are needed for use cases
like error reporting or other messages for a user, and thus there
is no need to pre-calculate them for each token (or character).
For such use cases efficiency should be really enough.

In general, I precalculate and store only information which is
needed frequently. Other information is evaluated lazily.

 Usually tokens are used and discarded. I mean, somebody that 
 uses the lexer asks tokens, process them (for example to 
 highlight code or to build an AST) and then discards them. So 
 you can reuse the same Token instance. If you want to peek the 
 next token, or have a buffer of token, you can use a freelist ( 
 http://dlang.org/memory.html#freelists , one of the many nice 
 things I learned by looking at DMD's source code ).

But the information from tokens is not discarded (at least, this
is the requirement for DCT). So my choice is to keep it in tokens
instead of converting to some other form. That also implies that
Token is free from any auxiliary information which is not
necessary for common use cases.

 So adding line and column information is not like wasting a lot 
 of memory: just 8 bytes more for each token in the freelist.

It is inefficient to calculate it during lexing, scanning
algorithm become less robust and more complicated, and in most
cases we won't need it anyway. I will add a hook to plug-in such
functionality (when needed) if I will know why it is necessary.

May 11 2012

deadalnix <deadalnix gmail.com> writes:

Le 11/05/2012 13:50, Ary Manzana a écrit :
 On 5/11/12 4:22 PM, Roman D. Boiko wrote:
 What about line and column information?

 Indices of the first code unit of each line are stored inside lexer and
 a function will compute Location (line number, column number, file
 specification) for any index. This way size of Token instance is reduced
 to the minimum. It is assumed that Location can be computed on demand,
 and is not needed frequently. So column is calculated by reverse walk
 till previous end of line, etc. Locations will possible to calculate
 both taking into account special token sequences (e.g., #line 3
 "ab/c.d"), or discarding them.

 But then how do you do to efficiently (if reverse walk is any efficient)
 compute line numbers?

 Usually tokens are used and discarded. I mean, somebody that uses the
 lexer asks tokens, process them (for example to highlight code or to
 build an AST) and then discards them. So you can reuse the same Token
 instance. If you want to peek the next token, or have a buffer of token,
 you can use a freelist ( http://dlang.org/memory.html#freelists , one of
 the many nice things I learned by looking at DMD's source code ).

 So adding line and column information is not like wasting a lot of
 memory: just 8 bytes more for each token in the freelist.

SDC uses struct Location to store such data.

Every token and AST element have a member data of type location.

As it is value type, no need for free list all thoses sort of thing. 
When the token is discarded, the location is discarded too. The same 
goes for AST nodes.

May 12 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 12 May 2012 at 13:24:11 UTC, deadalnix wrote:
 Le 11/05/2012 13:50, Ary Manzana a écrit :
 Usually tokens are used and discarded. I mean, somebody that 
 uses the
 lexer asks tokens, process them (for example to highlight code 
 or to
 build an AST) and then discards them. So you can reuse the 
 same Token
 instance. If you want to peek the next token, or have a buffer 
 of token,
 you can use a freelist ( 
 http://dlang.org/memory.html#freelists , one of
 the many nice things I learned by looking at DMD's source code 
 ).

 So adding line and column information is not like wasting a 
 lot of
 memory: just 8 bytes more for each token in the freelist.

 SDC uses struct Location to store such data.

 Every token and AST element have a member data of type location.

 As it is value type, no need for free list all thoses sort of 
 thing. When the token is discarded, the location is discarded 
 too. The same goes for AST nodes.

1. If a struct is a field of heap allocated object, it will be
allocated and garbage collected. Only if it only exists on stack
(i.e., in method body), GC is not used.

2. If struct which is often allocated and deallocated, free list
would allow to skip its initialization, which is especially
important if size of struct is large. This is true for both heap
and stack allocated data.

May 12 2012

"Tobias Pankrath" <tobias pankrath.net> writes:

 1. If a struct is a field of heap allocated object, it will be
 allocated and garbage collected. Only if it only exists on stack
 (i.e., in method body), GC is not used.

As far as I can tell, it won't be allocated on it's own, since it 
is stored in the garbage collected object as a value field. So 
using a freelist would actually increase the overhead. You have 
to manage the freelist and do the allocation of the containing 
object.

May 12 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath wrote:
 1. If a struct is a field of heap allocated object, it will be
 allocated and garbage collected. Only if it only exists on 
 stack
 (i.e., in method body), GC is not used.

 As far as I can tell, it won't be allocated on it's own, since 
 it is stored in the garbage collected object as a value field. 
 So using a freelist would actually increase the overhead. You 
 have to manage the freelist and do the allocation of the 
 containing object.

Sorry for confusion. In the first point I was trying to clarify 
that using a struct doesn't mean that it is stack allocated, but 
that statement was off-topic for the given context. Currently I 
consider my first point to be useless, and the second one is not 
relevant in *my* context, since I don't do many allocations of 
Location, as well as in a majority of other contexts. As it is 
often the case, performance optimization should be performed only 
when benchmarks show it is needed.

May 12 2012

Timon Gehr <timon.gehr gmx.ch> writes:

On 05/12/2012 11:08 PM, Roman D. Boiko wrote:
 On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath wrote:
 1. If a struct is a field of heap allocated object, it will be
 allocated and garbage collected. Only if it only exists on stack
 (i.e., in method body), GC is not used.

 As far as I can tell, it won't be allocated on it's own, since it is
 stored in the garbage collected object as a value field. So using a
 freelist would actually increase the overhead. You have to manage the
 freelist and do the allocation of the containing object.

 Sorry for confusion. In the first point I was trying to clarify that
 using a struct doesn't mean that it is stack allocated, but that
 statement was off-topic for the given context. Currently I consider my
 first point to be useless, and the second one is not relevant in *my*
 context, since I don't do many allocations of Location, as well as in a
 majority of other contexts. As it is often the case, performance
 optimization should be performed only when benchmarks show it is needed.

Well yes, but it seems to me that you are deliberately complicating the 
design in order to make it less efficient?

May 12 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 12 May 2012 at 22:19:55 UTC, Timon Gehr wrote:
 On 05/12/2012 11:08 PM, Roman D. Boiko wrote:
 On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath 
 wrote:
 1. If a struct is a field of heap allocated object, it will 
 be
 allocated and garbage collected. Only if it only exists on 
 stack
 (i.e., in method body), GC is not used.

 As far as I can tell, it won't be allocated on it's own, 
 since it is
 stored in the garbage collected object as a value field. So 
 using a
 freelist would actually increase the overhead. You have to 
 manage the
 freelist and do the allocation of the containing object.

 Sorry for confusion. In the first point I was trying to 
 clarify that
 using a struct doesn't mean that it is stack allocated, but 
 that
 statement was off-topic for the given context. Currently I 
 consider my
 first point to be useless, and the second one is not relevant 
 in *my*
 context, since I don't do many allocations of Location, as 
 well as in a
 majority of other contexts. As it is often the case, 
 performance
 optimization should be performed only when benchmarks show it 
 is needed.

 Well yes, but it seems to me that you are deliberately 
 complicating the design in order to make it less efficient?

Do you mean something like what I summarized from Ary:
http://forum.dlang.org/post/pggkobonzmgdwjigyaac forum.dlang.org ?
If that is the case, could you please add your concerns to the
dedicated issue:
https://github.com/roman-d-boiko/dct/issues/15 ?

Otherwise, if by overcomplicating you meant using a free list,
I'd like to clarify that I don't plan to introduce it yet because 
I see no reason to use it in my context.

Thanks

May 12 2012

Jacob Carlborg <doob me.com> writes:

On 2012-05-11 10:01, Roman D. Boiko wrote:
 There were several discussions about the need for a D compiler library.

 I propose my draft implementation of lexer for community review:
 https://github.com/roman-d-boiko/dct

 Lexer is based on Brian Schott's project
 https://github.com/Hackerpilot/Dscanner, but it has been refactored and
 extended (and more changes are on the way).

 The goal is to have source code loading, lexer, parser and semantic
 analysis available as parts of Phobos. These libraries should be
 designed to be usable in multiple scenarios (e.g., refactoring, code
 analysis, etc.).

 My commitment is to have at least front end built this year (and
 conforming to the D2 specification unless explicitly stated otherwise
 for some particular aspect).

 Please post any feed here. A dedicated project web-site will be created
 later.

If think that the end goal of a project like this, putting a D frontend 
in Phobos, should be that the compiler should be built using this 
library. This would result in the compiler and library always being in 
sync and having the same behavior. Otherwise it's easy this would be 
just another tool that tries to lex and parse D code, always being out 
of sync with the compiler and not having the same behavior.

For this to happen, for Walter to start using this, I think there would 
be a greater change if the frontend was a port of the DMD frontend and 
not changed too much.

-- 
/Jacob Carlborg

May 11 2012

dennis luehring <dl.soluz gmx.net> writes:

Am 11.05.2012 11:02, schrieb Jacob Carlborg:
 For this to happen, for Walter to start using this, I think there would
 be a greater change if the frontend was a port of the DMD frontend and
 not changed too much.

or a pure D version of it with the features:

-very fast in parsing/lexing - there need to be a benchmark enviroment 
from the very start

-easy to extend and fix

and think that is what walter wants from an D-ish frontend in the first 
place

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 09:21:29 UTC, dennis luehring wrote:
 Am 11.05.2012 11:02, schrieb Jacob Carlborg:
 For this to happen, for Walter to start using this, I think 
 there would
 be a greater change if the frontend was a port of the DMD 
 frontend and
 not changed too much.

 or a pure D version of it with the features:

 -very fast in parsing/lexing - there need to be a benchmark 
 enviroment from the very start

Will add that to May roadmap.

 -easy to extend and fix

 and think that is what walter wants from an D-ish frontend in 
 the first place

I plan to go this way.

May 11 2012

dennis luehring <dl.soluz gmx.net> writes:

Am 11.05.2012 11:33, schrieb Roman D. Boiko:
  -very fast in parsing/lexing - there need to be a benchmark
  enviroment from the very start

 Will add that to May roadmap.

are using slices for prevent coping everything around?

the parser/lexer need to be as fast as the original one - maybe even 
faster - else it won't replace walters at any time - because speed does 
matter here very much

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 10:01:17 UTC, dennis luehring wrote:
 Am 11.05.2012 11:33, schrieb Roman D. Boiko:
 -very fast in parsing/lexing - there need to be a benchmark
 enviroment from the very start

 Will add that to May roadmap.

 are using slices for prevent coping everything around?

 the parser/lexer need to be as fast as the original one - maybe 
 even faster - else it won't replace walters at any time - 
 because speed does matter here very much

I tried optimizing code when it didn't complicate design. And 
more optimizations will be added later.

It would be interesting to have benchmarks for comparing 
performance, but I don't plan to branch and edit DMD front end to 
prepare it for benchmarking (any time soon).

May 11 2012

"alex" <info alexanderbothe.com> writes:

On Friday, 11 May 2012 at 10:40:43 UTC, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 10:01:17 UTC, dennis luehring wrote:
 Am 11.05.2012 11:33, schrieb Roman D. Boiko:
 -very fast in parsing/lexing - there need to be a benchmark
 enviroment from the very start

 Will add that to May roadmap.

 are using slices for prevent coping everything around?

 the parser/lexer need to be as fast as the original one - 
 maybe even faster - else it won't replace walters at any time 
 - because speed does matter here very much

 I tried optimizing code when it didn't complicate design. And 
 more optimizations will be added later.

 It would be interesting to have benchmarks for comparing 
 performance, but I don't plan to branch and edit DMD front end 
 to prepare it for benchmarking (any time soon).


Ever thought of asking the VisualD developer to integrate your 
library into his IDE extension? Might be cool to do so because of 
extended completion abilities etc. (lol I'm the Mono-D dev -- but 
why not? ;D)

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 11:41:34 UTC, alex wrote:
 Ever thought of asking the VisualD developer to integrate your 
 library into his IDE extension? Might be cool to do so because 
 of extended completion abilities etc. (lol I'm the Mono-D dev 
 -- but why not? ;D)

Didn't think about that yet, because I don't use VisualD.
I actually planned to analyse whether DCT could be integrated 
into Mono-D, so your feedback is welcome :)

May 11 2012

"alex" <info alexanderbothe.com> writes:

On Friday, 11 May 2012 at 11:55:47 UTC, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 11:41:34 UTC, alex wrote:
 Ever thought of asking the VisualD developer to integrate your 
 library into his IDE extension? Might be cool to do so because 
 of extended completion abilities etc. (lol I'm the Mono-D dev 
 -- but why not? ;D)

 Didn't think about that yet, because I don't use VisualD.
 I actually planned to analyse whether DCT could be integrated 
 into Mono-D, so your feedback is welcome :)


be easier to integrate into the second one :)

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 12:13:53 UTC, alex wrote:

 should be easier to integrate into the second one :)

Sorry, I meant D-IDE. But there might exist the reason to consume 

make it usable for that.

I have nothing against VisualD, just didn't think about it yet.

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 12:20:27 UTC, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 12:13:53 UTC, alex wrote:

 should be easier to integrate into the second one :)

 Sorry, I meant D-IDE. But there might exist the reason to 

 collaborate to make it usable for that.

May 11 2012

deadalnix <deadalnix gmail.com> writes:

Le 11/05/2012 12:01, dennis luehring a �crit :
 Am 11.05.2012 11:33, schrieb Roman D. Boiko:
 -very fast in parsing/lexing - there need to be a benchmark
 enviroment from the very start

 Will add that to May roadmap.

 are using slices for prevent coping everything around?

 the parser/lexer need to be as fast as the original one - maybe even
 faster - else it won't replace walters at any time - because speed does
 matter here very much

The best optimization is the one that bring your code from non working 
state to working state.

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 09:02:12 UTC, Jacob Carlborg wrote:
 If think that the end goal of a project like this, putting a D 
 frontend in Phobos, should be that the compiler should be built 
 using this library. This would result in the compiler and 
 library always being in sync and having the same behavior. 
 Otherwise it's easy this would be just another tool that tries 
 to lex and parse D code, always being out of sync with the 
 compiler and not having the same behavior.

 For this to happen, for Walter to start using this, I think 
 there would be a greater change if the frontend was a port of 
 the DMD frontend and not changed too much.

My plan is to create frontend that would be much better than 
existing, both in design and implementation. I decided to work on 
this full time for several months.

Front end will not produce the same data as DMD front end does, 
so most likely it will not be suitable to replace existing C++ 
implementation. But since no information will be lost I will 
consider creating a separate project which would build output 
compatible with existing (not identical, I don't think that would 
be feasible).

May 11 2012

Jacob Carlborg <doob me.com> writes:

On 2012-05-11 11:31, Roman D. Boiko wrote:

 My plan is to create frontend that would be much better than existing,
 both in design and implementation. I decided to work on this full time
 for several months.

That's good news.

 Front end will not produce the same data as DMD front end does, so most
 likely it will not be suitable to replace existing C++ implementation.

That's too bad.

 But since no information will be lost I will consider creating a
 separate project which would build output compatible with existing (not
 identical, I don't think that would be feasible).

Ok.

-- 
/Jacob Carlborg

May 11 2012

deadalnix <deadalnix gmail.com> writes:

Le 11/05/2012 11:31, Roman D. Boiko a écrit :
 On Friday, 11 May 2012 at 09:02:12 UTC, Jacob Carlborg wrote:
 If think that the end goal of a project like this, putting a D
 frontend in Phobos, should be that the compiler should be built using
 this library. This would result in the compiler and library always
 being in sync and having the same behavior. Otherwise it's easy this
 would be just another tool that tries to lex and parse D code, always
 being out of sync with the compiler and not having the same behavior.

 For this to happen, for Walter to start using this, I think there
 would be a greater change if the frontend was a port of the DMD
 frontend and not changed too much.

 My plan is to create frontend that would be much better than existing,
 both in design and implementation. I decided to work on this full time
 for several months.

I this is your plan, I'm very happy. We really should discuss to avoid 
duplicating effort as we are doing right now.

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 11:50:14 UTC, deadalnix wrote:
 Le 11/05/2012 11:31, Roman D. Boiko a écrit :
 My plan is to create frontend that would be much better than 
 existing,
 both in design and implementation. I decided to work on this 
 full time
 for several months.

 I this is your plan, I'm very happy. We really should discuss 
 to avoid duplicating effort as we are doing right now.

 I have started a similar stuff, and it is currently more 
 advanced than DCT is. I kind of decapitated sdc to do it.

 Maybe we should join effort instead of doing 2 separate 
 projects ?

That makes sense. Is it possible to switch SDC to the Boost 
license? I'm trying to keep it for all DCT code.

May 11 2012

deadalnix <deadalnix gmail.com> writes:

Le 11/05/2012 14:04, Roman D. Boiko a écrit :
 That makes sense. Is it possible to switch SDC to the Boost license? I'm
 trying to keep it for all DCT code.

Let me do a clean package of my code this week end. For now it is mixed 
with SDC source code, which was enough as I was working alone, but isn't 
quite right for people to join the effort.

May 11 2012

deadalnix <deadalnix gmail.com> writes:

Le 11/05/2012 11:02, Jacob Carlborg a écrit :
 If think that the end goal of a project like this, putting a D frontend
 in Phobos, should be that the compiler should be built using this
 library. This would result in the compiler and library always being in
 sync and having the same behavior. Otherwise it's easy this would be
 just another tool that tries to lex and parse D code, always being out
 of sync with the compiler and not having the same behavior.

 For this to happen, for Walter to start using this, I think there would
 be a greater change if the frontend was a port of the DMD frontend and
 not changed too much.

 From the beginning, I'm think AST macro using CTFE.

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 11:47:18 UTC, deadalnix wrote:
 From the beginning, I'm think AST macro using CTFE.

Could you please elaborate?

I plan to strictly follow published D specification.
Exceptions from this rule are possible provided either of the 
following is true:
* new functionality has been implemented in DMD but is not 
included into specification yet
* specification is incorrect (has a bug) or incomplete, 
especially if DMD behavior differs from specification
* change is compatible with specification and brings some 
significant improvement (e.g., this seems to be the case for my 
decision to introduce post-processor after lexer)

Some more exceptions might be added later, but the goal is to 
minimize differences.

May 11 2012

deadalnix <deadalnix gmail.com> writes:

Le 11/05/2012 14:14, Roman D. Boiko a écrit :
 On Friday, 11 May 2012 at 11:47:18 UTC, deadalnix wrote:
 From the beginning, I'm think AST macro using CTFE.

 Could you please elaborate?

 I plan to strictly follow published D specification.
 Exceptions from this rule are possible provided either of the following
 is true:
 * new functionality has been implemented in DMD but is not included into
 specification yet
 * specification is incorrect (has a bug) or incomplete, especially if
 DMD behavior differs from specification
 * change is compatible with specification and brings some significant
 improvement (e.g., this seems to be the case for my decision to
 introduce post-processor after lexer)

 Some more exceptions might be added later, but the goal is to minimize
 differences.

More explicitly, the goal isn't to implement a different language than D.

Simply doing the parsing/AST building in a way that would allow AST 
macro to be introduced later.

Your 3 points seem reasonable. Mine were :
  * Implement something that can parse D as it is currently 
defined/implemented (if dmd's behavior and spec differs, it is handled 
on a per case basis).
  * Discard all deprecated features. Not even try to implement them even 
if dmd support them currently.
  * Do the parsing in several steps to allow different tools to work 
with it.

I think we both have very compatibles goals. Let me do a clean package 
of it I write about design goals. I don't have that much time right now 
to do it, I will this week end.

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 12:30:01 UTC, deadalnix wrote:
 Your 3 points seem reasonable. Mine were :
  * Implement something that can parse D as it is currently 
 defined/implemented (if dmd's behavior and spec differs, it is 
 handled on a per case basis).

All differences should be documented.
  * Discard all deprecated features. Not even try to implement 
 them even if dmd support them currently.

Yes, I forgot this one. Actually, I didn't discard imaginary 
floats, because I don't know what exactly will be done instead 
and it is easy to keep them.

  * Do the parsing in several steps to allow different tools to 
 work with it.

I was thinking about a pool of analysers each of which would add 
some information. This could be more than needed for semantic 
analysis. An analyser would be created each time when information 
(e.g., some indexing) is needed for a particular use case.

 I think we both have very compatibles goals. Let me do a clean 
 package of it I write about design goals. I don't have that 
 much time right now to do it, I will this week end.

I saw your ast_as_lib branch of SDC, but didn't dig deeper.

May 11 2012

Jacob Carlborg <doob me.com> writes:

On 2012-05-11 14:14, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 11:47:18 UTC, deadalnix wrote:
 From the beginning, I'm think AST macro using CTFE.

 Could you please elaborate?

 I plan to strictly follow published D specification.

That won't be easy, nobody know what the specification is . TDPL, DMD or 
dlang.org?

 Exceptions from this rule are possible provided either of the following
 is true:
 * new functionality has been implemented in DMD but is not included into
 specification yet
 * specification is incorrect (has a bug) or incomplete, especially if
 DMD behavior differs from specification
 * change is compatible with specification and brings some significant
 improvement (e.g., this seems to be the case for my decision to
 introduce post-processor after lexer)

 Some more exceptions might be added later, but the goal is to minimize
 differences.


-- 
/Jacob Carlborg

May 11 2012

Rory McGuire <rjmcguire gmail.com> writes:

? my guess is that the spec is TDPL + TDPL errata. dlang.org should be
updated as people notice inaccuracies.
This project would be an ideal time to update dlang.org as people notice
its not in sync with TDPL.
If TDPL doesn't cover it then the community should review it.


On Fri, May 11, 2012 at 3:17 PM, Jacob Carlborg <doob me.com> wrote:

 On 2012-05-11 14:14, Roman D. Boiko wrote:

 On Friday, 11 May 2012 at 11:47:18 UTC, deadalnix wrote:

 From the beginning, I'm think AST macro using CTFE.

 Could you please elaborate?

 I plan to strictly follow published D specification.

 That won't be easy, nobody know what the specification is . TDPL, DMD or
 dlang.org?


  Exceptions from this rule are possible provided either of the following
 is true:
 * new functionality has been implemented in DMD but is not included into
 specification yet
 * specification is incorrect (has a bug) or incomplete, especially if
 DMD behavior differs from specification
 * change is compatible with specification and brings some significant
 improvement (e.g., this seems to be the case for my decision to
 introduce post-processor after lexer)

 Some more exceptions might be added later, but the goal is to minimize
 differences.

 --
 /Jacob Carlborg

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 13:30:49 UTC, Rory McGuire wrote:
 ? my guess is that the spec is TDPL + TDPL errata. dlang.org 
 should be
 updated as people notice inaccuracies.
 This project would be an ideal time to update dlang.org as 
 people notice
 its not in sync with TDPL.
 If TDPL doesn't cover it then the community should review it.

Whenever I'm in doubt, I try to analyze both TDPL and dlang.org, 
and also consult DMD sources. But in general when I write 
specification I mean dlang.org.

I agree that it should be maintained, but I understand that it is 
difficult to do. I'll try to summarize problems which I discover 
(I already found many), but it is quite time-consuming activity.

May 11 2012

Jacob Carlborg <doob me.com> writes:

On 2012-05-11 15:30, Rory McGuire wrote:
 ? my guess is that the spec is TDPL + TDPL errata. dlang.org
 <http://dlang.org> should be updated as people notice inaccuracies.
 This project would be an ideal time to update dlang.org
 <http://dlang.org> as people notice its not in sync with TDPL.
 If TDPL doesn't cover it then the community should review it.

None of these are in sync.

-- 
/Jacob Carlborg

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 14:45:45 UTC, Jacob Carlborg wrote:
 On 2012-05-11 15:30, Rory McGuire wrote:
 ? my guess is that the spec is TDPL + TDPL errata. dlang.org
 <http://dlang.org> should be updated as people notice 
 inaccuracies.
 This project would be an ideal time to update dlang.org
 <http://dlang.org> as people notice its not in sync with TDPL.
 If TDPL doesn't cover it then the community should review it.

 None of these are in sync.

I suggest to choose a dedicated place where it is possible to 
discuss differences and other problems with specification. This 
forum?

May 11 2012

Rory McGuire <rjmcguire gmail.com> writes:

yip, but TDPL still has to take precedence because that is the one that
walter + andrei + community put the most focused effort into.


On Fri, May 11, 2012 at 4:45 PM, Jacob Carlborg <doob me.com> wrote:

 On 2012-05-11 15:30, Rory McGuire wrote:

 ? my guess is that the spec is TDPL + TDPL errata. dlang.org
 <http://dlang.org> should be updated as people notice inaccuracies.

 This project would be an ideal time to update dlang.org
 <http://dlang.org> as people notice its not in sync with TDPL.

 If TDPL doesn't cover it then the community should review it.

 None of these are in sync.

 --
 /Jacob Carlborg

May 11 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Friday, May 11, 2012 23:00:15 Rory McGuire wrote:
 yip, but TDPL still has to take precedence because that is the one that
 walter + andrei + community put the most focused effort into.

It doesn't necessarily. Each place that they differ is examined individually 
and decided on its own merits. There is no automatic "TDPL says it's so, so 
it's so." There _are_ cases where things have even been explicitly changed 
after TDPL was released, making TDPL out-of-date for that particular item (e.g 
strong vs weak purity is not discussed in TDPL at all, beacuse it didn't exist 
at the time). In general, however, TDPL _is_ correct, and when there's a 
difference, it's simply a case of the documentation or compiler not having been 
updated accordingly. We try to avoid making any changes that would conflict 
with TDPL.

- Jonathan M Davis

May 11 2012

dennis luehring <dl.soluz gmx.net> writes:

Am 11.05.2012 10:01, schrieb Roman D. Boiko:
 There were several discussions about the need for a D compiler
 library.

 I propose my draft implementation of lexer for community review:
 https://github.com/roman-d-boiko/dct

 Lexer is based on Brian Schott's project
 https://github.com/Hackerpilot/Dscanner, but it has been
 refactored and extended (and more changes are on the way).

 The goal is to have source code loading, lexer, parser and
 semantic analysis available as parts of Phobos. These libraries
 should be designed to be usable in multiple scenarios (e.g.,
 refactoring, code analysis, etc.).

 My commitment is to have at least front end built this year (and
 conforming to the D2 specification unless explicitly stated
 otherwise for some particular aspect).

 Please post any feed here. A dedicated project web-site will be
 created later.

does the parser/lexer allow half-finished syntax parsing? for being 
useable in an IDE for syntax-highlighting while coding?

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 09:19:07 UTC, dennis luehring wrote:
 does the parser/lexer allow half-finished syntax parsing? for 
 being useable in an IDE for syntax-highlighting while coding?

That's planned, but I would like to see your usage scenarios 
(pseudo-code would help a lot).

May 11 2012

dennis luehring <dl.soluz gmx.net> writes:

Am 11.05.2012 11:23, schrieb Roman D. Boiko:
 On Friday, 11 May 2012 at 09:19:07 UTC, dennis luehring wrote:
  does the parser/lexer allow half-finished syntax parsing? for
  being useable in an IDE for syntax-highlighting while coding?

 That's planned, but I would like to see your usage scenarios
 (pseudo-code would help a lot).

try to syntaxhiglight while coding - thats the scenario, parts
of the source code isn't fully valid while writing

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Friday, 11 May 2012 at 09:28:36 UTC, dennis luehring wrote:
 Am 11.05.2012 11:23, schrieb Roman D. Boiko:
 On Friday, 11 May 2012 at 09:19:07 UTC, dennis luehring wrote:
 does the parser/lexer allow half-finished syntax parsing? for
 being useable in an IDE for syntax-highlighting while coding?

 That's planned, but I would like to see your usage scenarios
 (pseudo-code would help a lot).

 try to syntaxhiglight while coding - thats the scenario, parts
 of the source code isn't fully valid while writing

I depends on IDE. For example, sublime-text (and most likely 
TextMate) uses regex for syntax highlighting. That makes it 
impossible to use with for D in some scenarios (like nested block 
comments). Any IDE that provides API for coloring will get 
correct information if code is valid.

If it is not valid, it is only possible to handle specific kinds 
of errors, but in general there will always be cases when 
highlighting (or some parsing / semantic analysis information) is 
"incorrect". I aim to handle common errors gracefully.

In practice, I think, it is possible to handle 99% of problems, 
but this requires a lot of design / specification work.

May 11 2012

Jacob Carlborg <doob me.com> writes:

On 2012-05-11 11:23, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 09:19:07 UTC, dennis luehring wrote:
 does the parser/lexer allow half-finished syntax parsing? for being
 useable in an IDE for syntax-highlighting while coding?

 That's planned, but I would like to see your usage scenarios
 (pseudo-code would help a lot).

Example from TextMate:

* "void" - keyword is colored
* "module main" - nothing colored until I type a semicolon
* "module main;" - keyword and "main" is colored (differently)
* "void foo" - keyword is colored
* "void foo (" - keyword and "foo" is colored (differently)
* "struct F" - keyword and "F is colored (differently)

* Literals are always colored.
* User-defined constants are always colored. It's basically any token 
that is all upper case

-- 
/Jacob Carlborg

May 11 2012

deadalnix <deadalnix gmail.com> writes:

Le 11/05/2012 10:01, Roman D. Boiko a écrit :
 There were several discussions about the need for a D compiler library.

 I propose my draft implementation of lexer for community review:
 https://github.com/roman-d-boiko/dct

 Lexer is based on Brian Schott's project
 https://github.com/Hackerpilot/Dscanner, but it has been refactored and
 extended (and more changes are on the way).

 The goal is to have source code loading, lexer, parser and semantic
 analysis available as parts of Phobos. These libraries should be
 designed to be usable in multiple scenarios (e.g., refactoring, code
 analysis, etc.).

 My commitment is to have at least front end built this year (and
 conforming to the D2 specification unless explicitly stated otherwise
 for some particular aspect).

 Please post any feed here. A dedicated project web-site will be created
 later.

I have started a similar stuff, and it is currently more advanced than 
DCT is. I kind of decapitated sdc to do it.

Maybe we should join effort instead of doing 2 separate projects ?

May 11 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, May 11, 2012 10:01:28 Roman D. Boiko wrote:
 There were several discussions about the need for a D compiler
 library.
 
 I propose my draft implementation of lexer for community review:
 https://github.com/roman-d-boiko/dct
 
 Lexer is based on Brian Schott's project
 https://github.com/Hackerpilot/Dscanner, but it has been
 refactored and extended (and more changes are on the way).
 
 The goal is to have source code loading, lexer, parser and
 semantic analysis available as parts of Phobos. These libraries
 should be designed to be usable in multiple scenarios (e.g.,
 refactoring, code analysis, etc.).
 
 My commitment is to have at least front end built this year (and
 conforming to the D2 specification unless explicitly stated
 otherwise for some particular aspect).
 
 Please post any feed here. A dedicated project web-site will be
 created later.

It's great that you're working on this. We definitely need more of this sort of 
stuff. However, I would point out that I'm not sure that it will be acceptable 
for Phobos (it _may_ be, but it's not quite what we've been looking for). 
There are two types of lexers/parsers that we've been looking to add to 
Phobos.

1. A direct port of the lexer from dmd's C++ front-end. The API would be D-
ified to use ranges, but the internals would be almost identical to the C++ 
front-end so that porting changes back and forth would be very easy. It also 
would make it easier to replace the C++ front-end with a D one later if the D 
one is very close to the C++ one.

2. A lexer/parser generator using templates and/or CTFE to generate a 
lexer/parser of whatever language using a BNF or EBNF grammar - probably 
similar to what Philippe Sigaud has done to generate a parser using PEGs. Such 
could be used to generate a lexer/parser for D or any other language.

What you seem to be doing is creating a D-specific parser from scratch, which 
is great, but it's not quite what we've been looking to add to Phobos. That 
doesn't necessarily mean that we can't or won't add it to Phobos once it's 
complete (especially if nothing else gets done first), but it also may not be 

once it's done and ready to be reviewed for inclusion in Phobos. I'm not 
trying to discourage you at all. I'm just pointing out that what you're doing 
is quite what we're looking for. Regardless, it _would_ be interesting to be 

seeing which performs better.

- Jonathan M Davis

May 11 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 12 May 2012 at 05:41:16 UTC, Jonathan M Davis wrote:
 It's great that you're working on this. We definitely need more 
 of this sort of
 stuff. However, I would point out that I'm not sure that it 
 will be acceptable
 for Phobos (it _may_ be, but it's not quite what we've been 
 looking for).
 There are two types of lexers/parsers that we've been looking 
 to add to
 Phobos.

 1. A direct port of the lexer from dmd's C++ front-end. The API 
 would be D-
 ified to use ranges, but the internals would be almost 
 identical to the C++
 front-end so that porting changes back and forth would be very 
 easy. It also
 would make it easier to replace the C++ front-end with a D one 
 later if the D
 one is very close to the C++ one.

 2. A lexer/parser generator using templates and/or CTFE to 
 generate a
 lexer/parser of whatever language using a BNF or EBNF grammar - 
 probably
 similar to what Philippe Sigaud has done to generate a parser 
 using PEGs. Such
 could be used to generate a lexer/parser for D or any other 
 language.

 What you seem to be doing is creating a D-specific parser from 
 scratch, which
 is great, but it's not quite what we've been looking to add to 
 Phobos. That
 doesn't necessarily mean that we can't or won't add it to 
 Phobos once it's
 complete (especially if nothing else gets done first), but it 
 also may not be

 to be decided
 once it's done and ready to be reviewed for inclusion in 
 Phobos. I'm not
 trying to discourage you at all. I'm just pointing out that 
 what you're doing
 is quite what we're looking for. Regardless, it _would_ be 
 interesting to be

 you're doing and
 seeing which performs better.

 - Jonathan M Davis


need I'm trying to address is making development of tools for D 
development much easier. Tools are needed for D to become more 
widely used in practise. Secondary (a nice-to-have) would be to 
build a compiler on top of that. I plan to do both, but I don't 
intend to be backwards-compatible with DMD backend. I hope that 
at least part of my code will either influence or be reused in 
the reference D compiler (whatever that will be in the future).

It is important to have a reference frontend implementation which 
can be used in many tools, otherwise a lot of effort would be 
completely wasted and no tool would comply with the D 
specification and compilers.

May 11 2012

Marco Leise <Marco.Leise gmx.de> writes:

Am Fri, 11 May 2012 10:01:28 +0200
schrieb "Roman D. Boiko" <rb d-coding.com>:

 There were several discussions about the need for a D compiler 
 library.
 
 I propose my draft implementation of lexer for community review:
 https://github.com/roman-d-boiko/dct
 
 Lexer is based on Brian Schott's project 
 https://github.com/Hackerpilot/Dscanner, but it has been 
 refactored and extended (and more changes are on the way).
 
 The goal is to have source code loading, lexer, parser and 
 semantic analysis available as parts of Phobos. These libraries 
 should be designed to be usable in multiple scenarios (e.g., 
 refactoring, code analysis, etc.).
 
 My commitment is to have at least front end built this year (and 
 conforming to the D2 specification unless explicitly stated 
 otherwise for some particular aspect).
 
 Please post any feed here. A dedicated project web-site will be 
 created later.

A general purpose D front-end library has been discussed several times. So at
least for some day-dreaming about better IDE support and code analysis tools it
has a huge value. It is good to see that someone finally takes the time to
implement one. My only concern is that not enough planing went into the design.
I think about things like brainstorming and collecting possible use cases from
the community or looking at how some C++ tools do their job and what
infrastructure they are built on. Although it is really difficult to tell from
other people's code which decision was 'important' and what was just the
author's way to do it.

Inclusion into Phobos I would not see as a priority. As Jonathan said, there
are already some clear visions of how such modules would look like. Also if any
project seeks to replace the DMD front-end, Walter should be the technical
advisor. Like anyone putting a lot of time and effort into a design, he could
have strong feelings about certain decisions and implementing them in a
seemingly worse way.
That said, you make the impression of being really dedicated to the project,
even giving yourself a time line, which is a good sign. I wish you all the best
for the project and hope that - even without any 'official' recognition - it
will see a lot of use as the base of D tools.

To learn about parsing I wrote a syntax highlighter for the DCPU-16 assembly of
Minecraft author Markus Persson. (Which was a good exercise.) Interestingly I
ended up with a similar design for the Token struct like yours:

- TokenType/TokenKind enum
- Trie for matching token kinds (why do you expand it into nested switch-case
through CTFE mixins though?)
Since assembly code is usually small I just preallocate an array of
sourceCode.length tokens and realloc it to the correct size when I'm done
parsing. Nothing pretty, but simple and I am sure it won't get any faster ;).
One notable difference is that I only check for isEoF in one location, since I
append "\0" to the source code as a stop token (that can also serve as an
end-of-line indicator). (The general idea is to move checks outside of loops.)

** Errors  **
I generally keep the start and end column, in case someone wants that. A real
world example:

  ubyte x = ...;
  if (x >= 0 && x <= 8) ...
  Line 2, warning: Comparison is always true.

At first glace you say "No, a byte can become larger than 8.", but the compiler
is just complaining about the first part of the expression here, which would be
clarified by e.g. some ASCII/Unicode art:

  Line 2, warning: Comparison is always true:
  if (x >= 0 && x <= 8) ...
      ^----^

** Code highlighting **
Especially Gtk's TextView (which GtkSourceView is base on) isn't made for
several thousand tokens. The text view will take a second per 20000 tokens or
so. The source view works around that by highlighting in chunks, but you can
still fell the delay when you jump to the end of the file in gedit and even
MonoDevelop suffers from the bad performance. If I understood your posts
correctly, you are already planning for minimal change sets. It is *very*
important to only update changed parts in a syntax highlighting editor. So for
now I ended up writing a 'insertText' and 'deleteText' method for the parser
that take 'start index', 'text' and 'start index', 'end index' respectively and
return a list of removed and added tokens.
If possible, make sure that your library can be used with GtkSourceView,
Scintilla and QSyntaxHighlighter. Unfortunately the bindings for the last two
could use an update, but the API help should get you an idea of how to use it
most efficiently.

-- 
Marco

May 19 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Saturday, 19 May 2012 at 20:35:10 UTC, Marco Leise wrote:
 Am Fri, 11 May 2012 10:01:28 +0200
 schrieb "Roman D. Boiko" <rb d-coding.com>:

 There were several discussions about the need for a D compiler 
 library.
 
 I propose my draft implementation of lexer for community 
 review:
 https://github.com/roman-d-boiko/dct


(I decided to comment on both my post and your reply.)

I've got *a lot* of feedback from community, which caused a 
significant redesign (but caused a delay with yet unknown 
duration). I'll write more on that later. Thanks a lot to 
everybody!

 Lexer is based on Brian Schott's project 
 https://github.com/Hackerpilot/Dscanner, but it has been 
 refactored and extended (and more changes are on the way).


Looks like I'm going to replace this implementation almost 
completely.

 The goal is to have source code loading, lexer, parser and 
 semantic analysis available as parts of Phobos. These 
 libraries should be designed to be usable in multiple 
 scenarios (e.g., refactoring, code analysis, etc.).


See below.

 My commitment is to have at least front end built this year 
 (and conforming to the D2 specification unless explicitly 
 stated otherwise for some particular aspect).


By the D specification I mean 
http://dlang.org/language-reference.html, or (when specification 
is not clear for me) one of the following:
* TDPL,
* current DMD implementation
* community feedback.

(Note that I may assume that I understand some aspect, but later 
revisit it if needed.)

 Please post any feed here. A dedicated project web-site will 
 be created later.


It is going to be http://d-coding.com, but it is not usable yet 
(I have not much web-design practise). It is hosted on 
https://github.com/roman-d-boiko/roman-d-boiko.github.com, pull 
requests are welcome.

 A general purpose D front-end library has been discussed 
 several times. So at least for some day-dreaming about better 
 IDE support and code analysis tools it has a huge value. It is 
 good to see that someone finally takes the time to implement 
 one. My only concern is that not enough planing went into the 
 design.

Could you name a few specific concerns? The reason of this thread 
was to reflect on design early. OTOH, I didn't spend time 
documenting my design goals and tradeoffs, so discussion turned 
into a brainstorm and wasn't always productive. But now when I 
see how much value can I get even without doing my homework, I'm 
much more likely to provide better documentation and ask more 
specific questions.

I think about things like brainstorming and collecting
 possible use cases from the community or looking at how some 
 C++ tools do their job and what infrastructure they are built 
 on. Although it is really difficult to tell from other people's 
 code which decision was 'important' and what was just the 
 author's way to do it.

I'm going to pick up several use cases and prioritize them 
according to my judgement. Feel free to suggest any cases that 
you think are needed (with motivation). Prioritizing is necessary 
to define what is out of scope and plan work into milestones, in 
order to ensure the project is feasible.

 Inclusion into Phobos I would not see as a priority. As 
 Jonathan said, there are already some clear visions of how such 
 modules would look like.

Well, *considering* such inclusion seriously would help improve 
the quality of project. But it is not what necessarily will 
happen. Currently I think that my goals are close to those of use 
case 2 from Jonathan's reply. But until the project is reasonably 
complete it is not the time to argue whether to include it (or 
its part).

 Also if any project seeks to replace the DMD front-end, Walter 
 should be the technical advisor. Like anyone putting a lot of 
 time and effort into a design, he could have strong feelings 
 about certain decisions and implementing them in a seemingly 
 worse way.

Not a goal currently, because that would make the project 
significantly less likely to be completed ever.

 That said, you make the impression of being really dedicated to 
 the project, even giving yourself a time line, which is a good 
 sign. I wish you all the best for the project and hope that - 
 even without any 'official' recognition - it will see a lot of 
 use as the base of D tools.

Well, I hope that some more people will join the project once it 
stabilizes and delivers something useful.

By the way, is anybody *potentially* interested to join?

 To learn about parsing I wrote a syntax highlighter for the 
 DCPU-16 assembly of Minecraft author Markus Persson. (Which was 
 a good exercise.) Interestingly I ended up with a similar 
 design for the Token struct like yours:

 - TokenType/TokenKind enum
 - Trie for matching token kinds (why do you expand it into 
 nested switch-case through CTFE mixins though?)

Switch code generation is something temporary that works 
currently and helps me evaluate possible problems with future 
design, which is planned to be compile-time finite automata 
(likely deterministic).

 Since assembly code is usually small I just preallocate an 
 array of sourceCode.length tokens and realloc it to the correct 
 size when I'm done parsing. Nothing pretty, but simple and I am 
 sure it won't get any faster ;).

I'm sure it will :) (I'm going to elaborate on this some time 
later).

 One notable difference is that I only check for isEoF in one 
 location, since I append "\0" to the source code as a stop 
 token (that can also serve as an end-of-line indicator). (The 
 general idea is to move checks outside of loops.)

There are several EoF conditions: \0, \x1A, __EOF__ and physicas 
EOF. And any loop would need to check for all. Preallocation 
could eliminate the need to check for physical EoF, but would 
make it impossible to avoid input string copying and also would 
not allow passing a function a slice of string to be lexed.

 ** Errors  **
 I generally keep the start and end column, in case someone 
 wants that. A real world example:

   ubyte x = ...;
   if (x >= 0 && x <= 8) ...
   Line 2, warning: Comparison is always true.

 At first glace you say "No, a byte can become larger than 8.", 
 but the compiler is just complaining about the first part of 
 the expression here, which would be clarified by e.g. some 
 ASCII/Unicode art:

   Line 2, warning: Comparison is always true:
   if (x >= 0 && x <= 8) ...
       ^----^

This functionality has been the priority from the beginning. Not 
implemented yet but designed for. Evaluation of column and line 
only on demand is caused by the assumption that such information 
is needed rarely (primarily to display information to the user). 
My new design also addresses the use cases when it is needed 
frequently.

 ** Code highlighting **
 Especially Gtk's TextView (which GtkSourceView is base on) 
 isn't made for several thousand tokens. The text view will take 
 a second per 20000 tokens or so. The source view works around 
 that by highlighting in chunks, but you can still fell the 
 delay when you jump to the end of the file in gedit and even 
 MonoDevelop suffers from the bad performance. If I understood 
 your posts correctly, you are already planning for minimal 
 change sets. It is *very* important to only update changed 
 parts in a syntax highlighting editor. So for now I ended up 
 writing a 'insertText' and 'deleteText' method for the parser 
 that take 'start index', 'text' and 'start index', 'end index' 
 respectively and return a list of removed and added tokens.
 If possible, make sure that your library can be used with 
 GtkSourceView, Scintilla and QSyntaxHighlighter. Unfortunately 
 the bindings for the last two could use an update, but the API 
 help should get you an idea of how to use it most efficiently.

Curretly I plan to support GtkSourceView, Scintilla, and a custom 
editor API which I plan to define so that it would be most 
efficient in this respect. Didn't work on that thoroughly yet.

Incremental changes are the key to efficiency, and I'm going to 
invest a lot of effort into making them. Also immutability of 
data structures will enable many optimizations.

May 20 2012

Marco Leise <Marco.Leise gmx.de> writes:

Am Sun, 20 May 2012 10:09:34 +0200
schrieb "Roman D. Boiko" <rb d-coding.com>:

 Could you name a few specific concerns?

Mostly my own gut feeling, that things that sound great in my head turn out to
bite me in the end. Things that one just doesn't think of because of the
limited horizon everyone has and probably a lack of experience in the field of
compiler/tools writing.
On the other hand I have good experience with working by community feedback. So
the rough edges in the design may already be ironed out by now.

 I'm going to pick up several use cases and prioritize them 
 according to my judgement. Feel free to suggest any cases that 
 you think are needed (with motivation). Prioritizing is necessary 
 to define what is out of scope and plan work into milestones, in 
 order to ensure the project is feasible.

There is one feature I remember caused some head-aches for Code::Blocks. They
used a separate parser instance for every project in the IDE, which meant that
all the standard include files would be parsed and kept in memory multiple
times. When they later switched to a common parser for all projects they ran
into stability issues. If you can arrange it, it would be great for
multi-project IDEs to be able to add and remove projects to your parser without
reparsing Phobos/druntime (which may have its .di files replaced by .d files,
looking at the past discussion).

C bindings could be an option. (As in: the smallest common denominator.) They


 Since assembly code is usually small I just preallocate an 
 array of sourceCode.length tokens and realloc it to the correct 
 size when I'm done parsing. Nothing pretty, but simple and I am 
 sure it won't get any faster ;).

 I'm sure it will :) (I'm going to elaborate on this some time 
 later).

I'm curious.

 There are several EoF conditions: \0, \x1A, __EOF__ and physicas 
 EOF. And any loop would need to check for all. Preallocation 
 could eliminate the need to check for physical EoF, but would 
 make it impossible to avoid input string copying and also would 
 not allow passing a function a slice of string to be lexed.

I found that I usually either load from file into a newly allocated buffer
(where a copy occurs, only because I forgot about assumeUnique in
std.exception) or I am editing the file in which case I recreate the source
string after every key stroke anyway. I can still pass slices of that string to
functions though. Not sure what you mean.
It probably doesn't work for D as well as for ASM code, but I could also check
for \x1A and __EOF__ in the same fashion. (By the way, is it \x1A (substitute,
^Z) or did you mean \0x04 (end-of-transmission, ^D)?) The way it works is:
Parser states like 'in expression' can safely peek at the next character at any
time. If it doesn't match what they expect they emit an error and drop back to
the "surrounding" parser state. When they reach the "file" level, that's the
only place where an EOF (which will only occur once per file anyway) will be
consumed.
In theory one can drop all string length checks and work on char* directly with
a known terminator char that is distinct from anything else.

 ** Errors  **
 I generally keep the start and end column, in case someone 
 wants that.

 This functionality has been the priority from the beginning. Not 
 implemented yet but designed for. Evaluation of column and line 
 only on demand is caused by the assumption that such information 
 is needed rarely (primarily to display information to the user). 
 My new design also addresses the use cases when it is needed 
 frequently.

 [...] 

 Incremental changes are the key to efficiency, and I'm going to 
 invest a lot of effort into making them. Also immutability of 
 data structures will enable many optimizations.

No more questions!

-- 
Marco

May 20 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Sunday, 20 May 2012 at 17:42:59 UTC, Marco Leise wrote:
 There is one feature I remember caused some head-aches for 
 Code::Blocks. They used a separate parser instance for every 
 project in the IDE, which meant that all the standard include 
 files would be parsed and kept in memory multiple times. When 
 they later switched to a common parser for all projects they 
 ran into stability issues. If you can arrange it, it would be 
 great for multi-project IDEs to be able to add and remove 
 projects to your parser without reparsing Phobos/druntime 
 (which may have its .di files replaced by .d files, looking at 
 the past discussion).

The opposite situation: I didn't think about parser per project :)
So I guess no need to worry here.

 C bindings could be an option. (As in: the smallest common 

 Python, ...) to use your library.

Yeah, but I'm far from there yet.

 Since assembly code is usually small I just preallocate an 
 array of sourceCode.length tokens and realloc it to the 
 correct size when I'm done parsing. Nothing pretty, but 
 simple and I am sure it won't get any faster ;).

 I'm sure it will :) (I'm going to elaborate on this some time 
 later).

 I'm curious.

Maybe I'm don't understand your use case, but the idea is that if
you parse as you type it should be possible to avoid parsing and
allocating memory for those lines which have not changed. But
that is not compatible with storing tokens in the array, since it
would cause reallocating memory each time, so the other data
structure should be used (e.g., a linked list, or, if efficient
lookup is needed, a red-black tree). Only benchmarks can show
whether (and how much) my approach would be faster for specific
situation (input patterns like average size of data, complexity
of parsing algorithms, requirements, etc.).

 I found that I usually either load from file into a newly 
 allocated buffer (where a copy occurs, only because I forgot 
 about assumeUnique in std.exception) or I am editing the file 
 in which case I recreate the source string after every key 
 stroke anyway. I can still pass slices of that string to 
 functions though. Not sure what you mean.

Answered below.

 It probably doesn't work for D as well as for ASM code, but I 
 could also check for \x1A and __EOF__ in the same fashion.
 (By the way, is it \x1A (substitute, ^Z) or did you mean \0x04 
 (end-of-transmission, ^D)?)

D has the following EoF cases: \0, \x1A, physical EoF and __EOF__
special token when not inside comment or some of string literals.
^D is not in this list.

 The way it works is: Parser states like 'in expression' can 
 safely peek at the next character at any time. If it doesn't 
 match what they expect they emit an error and drop back to the 
 "surrounding" parser state. When they reach the "file" level, 
 that's the only place where an EOF (which will only occur once 
 per file anyway) will be consumed.
 In theory one can drop all string length checks and work on 
 char* directly with a known terminator char that is distinct 
 from anything else.

If you want to pass a slice of input string to a function, you
cannot append \0 to it without copying data. If you don't append
some pre-defined character, you must check for length *and* all
supported terminating characters. On the contrary, your design
might not require passing slices, and if language syntax allows
deterministic parsing (when you know what to expect next), there
is no need for checking EoF.

May 20 2012

Marco Leise <Marco.Leise gmx.de> writes:

Am Sun, 20 May 2012 20:37:07 +0200
schrieb "Roman D. Boiko" <rb d-coding.com>:

 Since assembly code is usually small I just preallocate an 
 array of sourceCode.length tokens and realloc it to the 
 correct size when I'm done parsing. Nothing pretty, but 
 simple and I am sure it won't get any faster ;).

 I'm sure it will :) (I'm going to elaborate on this some time 
 later).

 I'm curious.

 Maybe I'm don't understand your use case, but the idea is that if
 you parse as you type it should be possible to avoid parsing and
 allocating memory for those lines which have not changed. But
 that is not compatible with storing tokens in the array, since it
 would cause reallocating memory each time, so the other data
 structure should be used (e.g., a linked list, or, if efficient
 lookup is needed, a red-black tree). Only benchmarks can show
 whether (and how much) my approach would be faster for specific
 situation (input patterns like average size of data, complexity
 of parsing algorithms, requirements, etc.).

I just only thought about the single-use situation, not the situation when
editing files. Now the idea has evolved a bit and you probably thought about
storing the scope hierarchy in a tree, too. It is still difficult to write a
fast parser when someone can add/remove a '{' somewhere at the top of
datetime.d, which changes the interpretation of the rest of the file. Mono-D
has some hickups worse than those (several seconds even) - maybe a bug. I'll
keep my fingers crossed.

 If you want to pass a slice of input string to a function, you
 cannot append \0 to it without copying data. If you don't append
 some pre-defined character, you must check for length *and* all
 supported terminating characters. On the contrary, your design
 might not require passing slices, and if language syntax allows
 deterministic parsing (when you know what to expect next), there
 is no need for checking EoF.

Now I get it!

-- 
Marco

May 20 2012

"Roman D. Boiko" <rb d-coding.com> writes:

On Sunday, 20 May 2012 at 17:42:59 UTC, Marco Leise wrote:
 There is one feature I remember caused some head-aches for 
 Code::Blocks. They used a separate parser instance for every 
 project in the IDE, which meant that all the standard include 
 files would be parsed and kept in memory multiple times. When 
 they later switched to a common parser for all projects they 
 ran into stability issues. If you can arrange it, it would be 
 great for multi-project IDEs to be able to add and remove 
 projects to your parser without reparsing Phobos/druntime 
 (which may have its .di files replaced by .d files, looking at 
 the past discussion).

The opposite situation: I didn't think about parser per project :)
So I guess no need to worry here.

 C bindings could be an option. (As in: the smallest common 

 Python, ...) to use your library.

Yeah, but I'm far from there yet.

 Since assembly code is usually small I just preallocate an 
 array of sourceCode.length tokens and realloc it to the 
 correct size when I'm done parsing. Nothing pretty, but 
 simple and I am sure it won't get any faster ;).

 I'm sure it will :) (I'm going to elaborate on this some time 
 later).

 I'm curious.

Maybe I'm don't understand your use case, but the idea is that if
you parse as you type it should be possible to avoid parsing and
allocating memory for those lines which have not changed. But
that is not compatible with storing tokens in the array, since it
would cause reallocating memory each time, so the other data
structure should be used (e.g., a linked list, or, if efficient
lookup is needed, a red-black tree). Only benchmarks can show
whether (and how much) my approach would be faster for specific
situation (input patterns like average size of data, complexity
of parsing algorithms, requirements, etc.).

 I found that I usually either load from file into a newly 
 allocated buffer (where a copy occurs, only because I forgot 
 about assumeUnique in std.exception) or I am editing the file 
 in which case I recreate the source string after every key 
 stroke anyway. I can still pass slices of that string to 
 functions though. Not sure what you mean.

Answered below.

 It probably doesn't work for D as well as for ASM code, but I 
 could also check for \x1A and __EOF__ in the same fashion.
 (By the way, is it \x1A (substitute, ^Z) or did you mean \0x04 
 (end-of-transmission, ^D)?)

D has the following EoF cases: \0, \x1A, physical EoF and __EOF__
special token when not inside comment or some of string literals.
^D is not in this list.

 The way it works is: Parser states like 'in expression' can 
 safely peek at the next character at any time. If it doesn't 
 match what they expect they emit an error and drop back to the 
 "surrounding" parser state. When they reach the "file" level, 
 that's the only place where an EOF (which will only occur once 
 per file anyway) will be consumed.
 In theory one can drop all string length checks and work on 
 char* directly with a known terminator char that is distinct 
 from anything else.

If you want to pass a slice of input string to a function, you
cannot append \0 to it without copying data. If you don't append
some pre-defined character, you must check for length *and* all
supported terminating characters. On the contrary, your design
might not require passing slices, and if language syntax allows
deterministic parsing (when you know what to expect next), there
is no need for checking EoF.

May 20 2012

D Programming

C/C++ Programming

Other

digitalmars.D.announce - DCT: D compiler as a collection of libraries