www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - DCT: D compiler as a collection of libraries

reply "Roman D. Boiko" <rb d-coding.com> writes:
There were several discussions about the need for a D compiler 
library.

I propose my draft implementation of lexer for community review:
https://github.com/roman-d-boiko/dct

Lexer is based on Brian Schott's project 
https://github.com/Hackerpilot/Dscanner, but it has been 
refactored and extended (and more changes are on the way).

The goal is to have source code loading, lexer, parser and 
semantic analysis available as parts of Phobos. These libraries 
should be designed to be usable in multiple scenarios (e.g., 
refactoring, code analysis, etc.).

My commitment is to have at least front end built this year (and 
conforming to the D2 specification unless explicitly stated 
otherwise for some particular aspect).

Please post any feed here. A dedicated project web-site will be 
created later.
May 11 2012
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-05-11 10:01, Roman D. Boiko wrote:
 There were several discussions about the need for a D compiler library.

 I propose my draft implementation of lexer for community review:
 https://github.com/roman-d-boiko/dct

 Lexer is based on Brian Schott's project
 https://github.com/Hackerpilot/Dscanner, but it has been refactored and
 extended (and more changes are on the way).

 The goal is to have source code loading, lexer, parser and semantic
 analysis available as parts of Phobos. These libraries should be
 designed to be usable in multiple scenarios (e.g., refactoring, code
 analysis, etc.).

 My commitment is to have at least front end built this year (and
 conforming to the D2 specification unless explicitly stated otherwise
 for some particular aspect).

 Please post any feed here. A dedicated project web-site will be created
 later.
(Re-posting here) A couple of questions: * What's the sate of the lexer * Does it convert numerical literals and similar to their actual values * Does it retain full source information * Is there an example we can look at to see how the API is used * Does it have a range based interface -- /Jacob Carlborg
May 11 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 08:38:36 UTC, Jacob Carlborg wrote:
 (Re-posting here)
 A couple of questions:

 * What's the sate of the lexer
I consider it a draft state, because it has got several rewrites recently and I plan to do more, especially based on community feedback. However, implementation handles almost all possible cases. Because of rewrites it is most likely broken at this moment, I'm going to fix it ASAP (in a day or two). Lexer will provide a random-access range of tokens (this is not done yet). Each token contains: * start index (position in the original encoding, 0 corresponds to the first code unit after BOM), * token value encoded as UTF-8 string, * token kind (e.g., token.kind = TokenKind.Float), * possibly enum with annotations (e.g., token.annotations = FloatAnnotation.Hex | FloatAnnotation.Real)
 * Does it convert numerical literals and similar to their 
 actual values
It is planned to add a post-processor for that as part of parser, please see README.md for some more details.
 * Does it retain full source information
Yes, this is a design choice to preserve all information. Source code is converted to UTF-8 and stored as token.value, even whitespaces. Information about code unit indices in the original encoding is preserved, too.
 * Is there an example we can look at to see how the API is used
TBD soon (see Roadmap in the readme file)
 * Does it have a range based interface
Yes, this is what I consider one of its strengths.
May 11 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-05-11 10:58, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 08:38:36 UTC, Jacob Carlborg wrote:
 (Re-posting here)
 A couple of questions:

 * What's the sate of the lexer
I consider it a draft state, because it has got several rewrites recently and I plan to do more, especially based on community feedback. However, implementation handles almost all possible cases. Because of rewrites it is most likely broken at this moment, I'm going to fix it ASAP (in a day or two).
I see.
 Lexer will provide a random-access range of tokens (this is not
 done yet).
Ok.
 Each token contains:
 * start index (position in the original encoding, 0 corresponds
 to the first code unit after BOM),
 * token value encoded as UTF-8 string,
 * token kind (e.g., token.kind = TokenKind.Float),
 * possibly enum with annotations (e.g., token.annotations =
 FloatAnnotation.Hex | FloatAnnotation.Real)
What about line and column information?
 * Does it convert numerical literals and similar to their actual values
It is planned to add a post-processor for that as part of parser, please see README.md for some more details.
Isn't that a job for the lexer?
 * Does it retain full source information
Yes, this is a design choice to preserve all information. Source code is converted to UTF-8 and stored as token.value, even whitespaces. Information about code unit indices in the original encoding is preserved, too.
That's sounds good.
 * Is there an example we can look at to see how the API is used
TBD soon (see Roadmap in the readme file)
 * Does it have a range based interface
Yes, this is what I consider one of its strengths.
I see. Thanks. -- /Jacob Carlborg
May 11 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 09:08:24 UTC, Jacob Carlborg wrote:
 On 2012-05-11 10:58, Roman D. Boiko wrote:
 Each token contains:
 * start index (position in the original encoding, 0 corresponds
 to the first code unit after BOM),
 * token value encoded as UTF-8 string,
 * token kind (e.g., token.kind = TokenKind.Float),
 * possibly enum with annotations (e.g., token.annotations =
 FloatAnnotation.Hex | FloatAnnotation.Real)
What about line and column information?
Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 "ab/c.d"), or discarding them.
 * Does it convert numerical literals and similar to their 
 actual values
It is planned to add a post-processor for that as part of parser, please see README.md for some more details.
Isn't that a job for the lexer?
That might be done in lexer for efficiency reasons (to avoid lexing token value again). But separating this into a dedicated post-processing phase leads to a much cleaner design (IMO), also suitable for uses when such values are not needed. Also I don't think that performance would be improved given the ratio of number of literals to total number of tokens and the need to store additional information per token if it is done in lexer. I will elaborate on that later.
May 11 2012
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-05-11 11:22, Roman D. Boiko wrote:

 What about line and column information?
Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 "ab/c.d"), or discarding them.
Aha, clever. As long as I can get out the information I'm happy :) How about adding properties for this in the token struct?
 * Does it convert numerical literals and similar to their actual values
It is planned to add a post-processor for that as part of parser, please see README.md for some more details.
Isn't that a job for the lexer?
That might be done in lexer for efficiency reasons (to avoid lexing token value again). But separating this into a dedicated post-processing phase leads to a much cleaner design (IMO), also suitable for uses when such values are not needed.
That might be the case. But I don't think it belongs in the parser.
 Also I don't think that performance would be
 improved given the ratio of number of literals to total number of tokens
 and the need to store additional information per token if it is done in
 lexer. I will elaborate on that later.
Ok, fair enough. Perhaps this could be a property in the Token struct as well. In that case I would suggest renaming "value" to lexeme/spelling/representation, or something like that, and then name the new property "value". -- /Jacob Carlborg
May 11 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 09:36:28 UTC, Jacob Carlborg wrote:
 On 2012-05-11 11:22, Roman D. Boiko wrote:
 Locations will possible to calculate
 both taking into account special token sequences (e.g., #line 3
 "ab/c.d"), or discarding them.
Aha, clever. As long as I can get out the information I'm happy :) How about adding properties for this in the token struct?
There is a method for that in Lexer interface, for me it looks like it belongth there and not to token. Version accepting token and producing a pair of start/end Locations will be added.
 * Does it convert numerical literals and similar to their 
 actual values
It is planned to add a post-processor for that as part of parser, please see README.md for some more details.
Isn't that a job for the lexer?
That might be done in lexer for efficiency reasons (to avoid lexing token value again). But separating this into a dedicated post-processing phase leads to a much cleaner design (IMO), also suitable for uses when such values are not needed.
That might be the case. But I don't think it belongs in the parser.
I will provide example code and a dedicated post later to illustrate my point.
 Also I don't think that performance would be
 improved given the ratio of number of literals to total number 
 of tokens
 and the need to store additional information per token if it 
 is done in
 lexer. I will elaborate on that later.
Ok, fair enough. Perhaps this could be a property in the Token struct as well. In that case I would suggest renaming "value" to lexeme/spelling/representation, or something like that, and then name the new property "value".
I was going to rename value, but couldn't find a nice term. Thanks for your suggestions! As for the property with strongly typed literal value, currently I plan to put it into AST.
May 11 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-05-11 12:35, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 09:36:28 UTC, Jacob Carlborg wrote:
 Aha, clever. As long as I can get out the information I'm happy :) How
 about adding properties for this in the token struct?
There is a method for that in Lexer interface, for me it looks like it belongth there and not to token. Version accepting token and producing a pair of start/end Locations will be added.
Found it now, "calculateFor". It not sure if it's the most intuitive name though. I get the feeling: "calculate what?".
 That might be the case. But I don't think it belongs in the parser.
I will provide example code and a dedicated post later to illustrate my point.
I guess I'll have to wait for that then :)
 Ok, fair enough. Perhaps this could be a property in the Token struct
 as well. In that case I would suggest renaming "value" to
 lexeme/spelling/representation, or something like that, and then name
 the new property "value".
I was going to rename value, but couldn't find a nice term. Thanks for your suggestions! As for the property with strongly typed literal value, currently I plan to put it into AST.
I stole "spelling" from Clang :) -- /Jacob Carlborg
May 11 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 11:49:23 UTC, Jacob Carlborg wrote:
 On 2012-05-11 12:35, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 09:36:28 UTC, Jacob Carlborg wrote:
 Aha, clever. As long as I can get out the information I'm 
 happy :) How
 about adding properties for this in the token struct?
There is a method for that in Lexer interface, for me it looks like it belongth there and not to token. Version accepting token and producing a pair of start/end Locations will be added.
Found it now, "calculateFor". It not sure if it's the most intuitive name though. I get the feeling: "calculate what?".
calculateLocation was original name, but I don't like repeating return type in method names, I decided to change it so that it is clear that another renaming is needed ;) Any suggestions?
 That might be the case. But I don't think it belongs in the 
 parser.
I will provide example code and a dedicated post later to illustrate my point.
I guess I'll have to wait for that then :)
I'll try to do that ahead of roadmap, it is important.
May 11 2012
parent reply Jacob Carlborg <doob me.com> writes:
On 2012-05-11 14:07, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 11:49:23 UTC, Jacob Carlborg wrote:
 Found it now, "calculateFor". It not sure if it's the most intuitive
 name though. I get the feeling: "calculate what?".
 calculateLocation was original name, but I don't like repeating return
 type in method names, I decided to change it so that it is clear that
 another renaming is needed ;) Any suggestions?
My original suggestion was to have the functionality in Token, which would have made for intuitive names: line, column and file. But since you didn't like that I have to give it some thought.
 I guess I'll have to wait for that then :)
I'll try to do that ahead of roadmap, it is important.
Cool. -- /Jacob Carlborg
May 11 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 12:55:58 UTC, Jacob Carlborg wrote:
 On 2012-05-11 14:07, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 11:49:23 UTC, Jacob Carlborg wrote:
 Found it now, "calculateFor". It not sure if it's the most 
 intuitive
 name though. I get the feeling: "calculate what?".
 calculateLocation was original name, but I don't like 
 repeating return
 type in method names, I decided to change it so that it is 
 clear that
 another renaming is needed ;) Any suggestions?
My original suggestion was to have the functionality in Token, which would have made for intuitive names: line, column and file. But since you didn't like that I have to give it some thought.
What about the following signature: Location locate(size_t index)? Or even better: alias size_t CodeUnitIndex; Location locateFor(CodeUnitIndex position); The problem with placing it in Token is that Token should not know anything about source as a whole.
May 11 2012
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-05-11 15:01, Roman D. Boiko wrote:

 What about the following signature: Location locate(size_t index)?
 Or even better:
 alias size_t CodeUnitIndex;
 Location locateFor(CodeUnitIndex position);
That is better although I would prefer to pass in a token (assuming that is where index is declared). Then it would be an implementation detail that "index" is used to get the location. Another option would be to turn it around: sturct/class Location { Location find/locate (Token token); }
 The problem with placing it in Token is that Token should not know
 anything about source as a whole.
I see. For convenience there could be properties defined that just forwards the call to some other function. -- /Jacob Carlborg
May 11 2012
parent "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 13:25:53 UTC, Jacob Carlborg wrote:
 On 2012-05-11 15:01, Roman D. Boiko wrote:

 What about the following signature: Location locate(size_t 
 index)?
 Or even better:
 alias size_t CodeUnitIndex;
 Location locateFor(CodeUnitIndex position);
That is better although I would prefer to pass in a token (assuming that is where index is declared). Then it would be an implementation detail that "index" is used to get the location.
It is also what I was planning to do after I posted my last suggestion. Unless I will discover some conceptual problem with that, which is unlikely.
 Another option would be to turn it around:

 sturct/class Location
 {
     Location find/locate (Token token);
 }

 The problem with placing it in Token is that Token should not 
 know
 anything about source as a whole.
I see. For convenience there could be properties defined that just forwards the call to some other function.
Both these cases would introduce circular dependency between Lexer and its output data (Location or Token). It is possible to break such dependency via complicating things even more, or live with it. But I don't like circular dependencies when there is no compelling reason to introduce them.
May 11 2012
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
Le 11/05/2012 15:01, Roman D. Boiko a écrit :
 On Friday, 11 May 2012 at 12:55:58 UTC, Jacob Carlborg wrote:
 On 2012-05-11 14:07, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 11:49:23 UTC, Jacob Carlborg wrote:
 Found it now, "calculateFor". It not sure if it's the most intuitive
 name though. I get the feeling: "calculate what?".
 calculateLocation was original name, but I don't like repeating return
 type in method names, I decided to change it so that it is clear that
 another renaming is needed ;) Any suggestions?
My original suggestion was to have the functionality in Token, which would have made for intuitive names: line, column and file. But since you didn't like that I have to give it some thought.
What about the following signature: Location locate(size_t index)? Or even better: alias size_t CodeUnitIndex; Location locateFor(CodeUnitIndex position); The problem with placing it in Token is that Token should not know anything about source as a whole.
I don't really see the benefit of this. You are trading a O(1) operation to an O(log(n)) . It can only be faster in specific cases, which should be measured. It is likely to be slower on compiling erroneous code or used for something else than compiling, and likely to be insignificant to compile correct code (I suspect the performance win is negligible compared to the time required to build a fully operational executable). You are overcomplicating things for no to little benefice.
May 11 2012
next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 13:28:21 UTC, deadalnix wrote:
 Le 11/05/2012 15:01, Roman D. Boiko a écrit :
 The problem with placing it in Token is that Token should not 
 know
 anything about source as a whole.
I don't really see the benefit of this. You are trading a O(1) operation to an O(log(n)) . It can only be faster in specific cases, which should be measured.
It would be interesting to see benchmarks. But anyway log(1M) is just 20, and I didn't see any source code with 1M lines :). The main point is design, not performance. I'm trying to build minimal design that handles common use cases very well, and others well enough. I don't see why Location would be needed for every Token.
 It is likely to be slower on compiling erroneous code or used 
 for something else than compiling, and likely to be 
 insignificant to compile correct code (I suspect the 
 performance win is negligible compared to the time required to 
 build a fully operational executable).
IMO, it is unrelated.
 You are overcomplicating things for no to little benefice.
Well, it depends... However, I will provide design rationale document this month.
May 11 2012
prev sibling parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 13:28:21 UTC, deadalnix wrote:
 Le 11/05/2012 15:01, Roman D. Boiko a écrit :
 The problem with placing it in Token is that Token should not 
 know
 anything about source as a whole.
I don't really see the benefit of this. You are trading a O(1) operation to an O(log(n)) . It can only be faster in specific cases, which should be measured.
Technically, I'm trading N*0(1) operations needed to track line and column while consuming each character to M*0(log(n)) operations when calculating them on demand. N = number of characters, n is number of lines and M is number of actual usages of Location. My assumption is that M << N (M is much smaller than N).
May 11 2012
parent reply deadalnix <deadalnix gmail.com> writes:
Le 11/05/2012 16:02, Roman D. Boiko a écrit :
 On Friday, 11 May 2012 at 13:28:21 UTC, deadalnix wrote:
 Le 11/05/2012 15:01, Roman D. Boiko a écrit :
 The problem with placing it in Token is that Token should not know
 anything about source as a whole.
I don't really see the benefit of this. You are trading a O(1) operation to an O(log(n)) . It can only be faster in specific cases, which should be measured.
Technically, I'm trading N*0(1) operations needed to track line and column while consuming each character to M*0(log(n)) operations when calculating them on demand. N = number of characters, n is number of lines and M is number of actual usages of Location. My assumption is that M << N (M is much smaller than N).
N can easily be number of tokens.
May 11 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 15:05:19 UTC, deadalnix wrote:
 Le 11/05/2012 16:02, Roman D. Boiko a écrit :
 Technically, I'm trading N*0(1) operations needed to track 
 line and
 column while consuming each character to M*0(log(n)) 
 operations when
 calculating them on demand. N = number of characters, n is 
 number of
 lines and M is number of actual usages of Location. My 
 assumption is
 that M << N (M is much smaller than N).
N can easily be number of tokens.
Yes, if you are looking for the token by its index, not for location. E.g., for autocompletion it is needed to find the last token before cursor location. But that is not related to location search. Also please note that I oversimplified formula for complexity. It also has other components. My reply was just an additional minor comment. Motivation is design, not performance. One additional reason for such separation: with my approach it is possible to calculate Location both taking into account infromation from special token sequences, or ignoring it. How would you do that for eager calculation? Calculate only one category? Or track and store both? Unnecessary complexity will eventually find a way to shoot your leg :) Real-world usage (at least, anticipated scenarios) should be the basis for designing. Sometimes this contradicts intuition and requires you to look at the problem from a different side.
May 11 2012
parent reply Ary Manzana <ary esperanto.org.ar> writes:
On 5/11/12 10:14 PM, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 15:05:19 UTC, deadalnix wrote:
 Le 11/05/2012 16:02, Roman D. Boiko a écrit :
 Technically, I'm trading N*0(1) operations needed to track line and
 column while consuming each character to M*0(log(n)) operations when
 calculating them on demand. N = number of characters, n is number of
 lines and M is number of actual usages of Location. My assumption is
 that M << N (M is much smaller than N).
N can easily be number of tokens.
Yes, if you are looking for the token by its index, not for location. E.g., for autocompletion it is needed to find the last token before cursor location. But that is not related to location search. Also please note that I oversimplified formula for complexity. It also has other components. My reply was just an additional minor comment. Motivation is design, not performance. One additional reason for such separation: with my approach it is possible to calculate Location both taking into account infromation from special token sequences, or ignoring it. How would you do that for eager calculation? Calculate only one category? Or track and store both? Unnecessary complexity will eventually find a way to shoot your leg :) Real-world usage (at least, anticipated scenarios) should be the basis for designing. Sometimes this contradicts intuition and requires you to look at the problem from a different side.
As deadalnix says, I think you are over-complicating things. I mean, to store the column and line information it's just: if (isNewLine(c)) { line++; column = 0; } else { column++; } (I think you need to add that to the SourceRange class. Then copy line and column to token on the Lexer#lex() method) Do you really think it's that costly in terms of performance? I think you are wasting much more memory and performance by storing all the tokens in the lexer. Imagine I want to implement a simple syntax highlighter: just highlight keywords. How can I tell DCT to *not* store all tokens because I need each one in turn? And since I'll be highlighting in the editor I will need column and line information. That means I'll have to do that O(log(n)) operation for every token. So you see, for the simplest use case of a lexer the performance of DCT is awful. Now imagine I want to build an AST. Again, I consume the tokens one by one, probably peeking in some cases. If I want to store line and column information I just copy them to the AST. You say the tokens are discarded but their data is not, and that's why their data is usually copied.
May 11 2012
next sibling parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
 As deadalnix says, I think you are over-complicating things.

 I mean, to store the column and line information it's just:

 if (isNewLine(c)) {
   line++;
   column = 0;
 } else {
   column++;
 }

 (I think you need to add that to the SourceRange class. Then 
 copy line and column to token on the Lexer#lex() method)

 Do you really think it's that costly in terms of performance?

 I think you are wasting much more memory and performance by 
 storing all the tokens in the lexer.

 Imagine I want to implement a simple syntax highlighter: just 
 highlight keywords. How can I tell DCT to *not* store all 
 tokens because I need each one in turn? And since I'll be 
 highlighting in the editor I will need column and line 
 information. That means I'll have to do that O(log(n)) 
 operation for every token.

 So you see, for the simplest use case of a lexer the 
 performance of DCT is awful.

 Now imagine I want to build an AST. Again, I consume the tokens 
 one by one, probably peeking in some cases. If I want to store 
 line and column information I just copy them to the AST. You 
 say the tokens are discarded but their data is not, and that's 
 why their data is usually copied.
Summary of your points (I deliberately emphasize some of them more than you did; please correct me if I misinterpreted anything): * storing location for each token is simple and cheap * SourceRange is needed; an instance per token; it should be a class, not struct * Syntax highlighting doesn't need to keep all tokens in memory * but it must know column and line for each token * for this use case DCT has awful performance * to build AST lines and columns are required * information from tokens must be copied and possibly transformed in order to put it into AST; after such transformation tokens are not needed any more Those all are valid points given that we don't have any benchmarks and usage examples. The good news is that use cases like highlighting, parsing, autocompletion, etc. are the core functionality for which DCT should be designed. So if it fails any of them, design needs to be revisited. But I will need some time to thoroughly address each of this issues (either prove that it is not relevant, or fix the problem). I will definitely complete this work during May. I will regularly report my progress here. In general, I tend to disagree with most of them, because I already thought about each and designed accordingly, but it is important to give them enough attention. What I fail miserably at this moment is providing actual usage code and benchmarks, so I'm going to focus on that. Thanks for your feedback!
May 11 2012
parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 12 May 2012 at 05:13:36 UTC, Roman D. Boiko wrote:
 On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
 As deadalnix says, I think you are over-complicating things.
...
 ... I will need some time to thoroughly address each of this 
 issues (either prove that it is not relevant, or fix the 
 problem). I will definitely complete this work during May. I 
 will regularly report my progress here.

 In general, I tend to disagree with most of them, because I 
 already thought about each and designed accordingly, but it is 
 important to give them enough attention.
https://github.com/roman-d-boiko/dct/issues/15
May 11 2012
prev sibling next sibling parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
 As deadalnix says, I think you are over-complicating things.

 I mean, to store the column and line information it's just:

 if (isNewLine(c)) {
   line++;
   column = 0;
 } else {
   column++;
 }

 (I think you need to add that to the SourceRange class. Then 
 copy line and column to token on the Lexer#lex() method)

 Do you really think it's that costly in terms of performance?

 I think you are wasting much more memory and performance by 
 storing all the tokens in the lexer.

 Imagine I want to implement a simple syntax highlighter: just 
 highlight keywords. How can I tell DCT to *not* store all 
 tokens because I need each one in turn? And since I'll be 
 highlighting in the editor I will need column and line 
 information. That means I'll have to do that O(log(n)) 
 operation for every token.

 So you see, for the simplest use case of a lexer the 
 performance of DCT is awful.

 Now imagine I want to build an AST. Again, I consume the tokens 
 one by one, probably peeking in some cases. If I want to store 
 line and column information I just copy them to the AST. You 
 say the tokens are discarded but their data is not, and that's 
 why their data is usually copied.
Would it be possible for you to fork my code and tweak it for comparison? You will definitely discover more problems this way, and such feedback would really help me. That doesn't seem to be inadequately much work to do. Any other volunteers for this?
May 11 2012
parent Ary Manzana <ary esperanto.org.ar> writes:
On 5/12/12 12:17 PM, Roman D. Boiko wrote:
 On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
 As deadalnix says, I think you are over-complicating things.

 I mean, to store the column and line information it's just:

 if (isNewLine(c)) {
 line++;
 column = 0;
 } else {
 column++;
 }

 (I think you need to add that to the SourceRange class. Then copy line
 and column to token on the Lexer#lex() method)

 Do you really think it's that costly in terms of performance?

 I think you are wasting much more memory and performance by storing
 all the tokens in the lexer.

 Imagine I want to implement a simple syntax highlighter: just
 highlight keywords. How can I tell DCT to *not* store all tokens
 because I need each one in turn? And since I'll be highlighting in the
 editor I will need column and line information. That means I'll have
 to do that O(log(n)) operation for every token.

 So you see, for the simplest use case of a lexer the performance of
 DCT is awful.

 Now imagine I want to build an AST. Again, I consume the tokens one by
 one, probably peeking in some cases. If I want to store line and
 column information I just copy them to the AST. You say the tokens are
 discarded but their data is not, and that's why their data is usually
 copied.
Would it be possible for you to fork my code and tweak it for comparison? You will definitely discover more problems this way, and such feedback would really help me. That doesn't seem to be inadequately much work to do. Any other volunteers for this?
Sure, I'll do it and provide some benchmarks. Thanks for creating the issue.
May 12 2012
prev sibling parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
 I think you are wasting much more memory and performance by 
 storing all the tokens in the lexer.

 Imagine I want to implement a simple syntax highlighter: just 
 highlight keywords. How can I tell DCT to *not* store all 
 tokens because I need each one in turn? And since I'll be 
 highlighting in the editor I will need column and line 
 information. That means I'll have to do that O(log(n)) 
 operation for every token.

 So you see, for the simplest use case of a lexer the 
 performance of DCT is awful.

 Now imagine I want to build an AST. Again, I consume the tokens 
 one by one, probably peeking in some cases. If I want to store 
 line and column information I just copy them to the AST. You 
 say the tokens are discarded but their data is not, and that's 
 why their data is usually copied.
Currently I think about making token a class instead of struct. A token (from https://github.com/roman-d-boiko/dct/blob/master/fe/core.d) is: // Represents lexed token struct Token { size_t startIndex; // position of the first code unit in the source string string spelling; // characters from which this token has been lexed TokenKind kind; // enum; each keyword and operator, have a dedicated kind ubyte annotations; // meta information like whether a token is valid, or an integer literal is signed, long, hexadecimal, etc. } Making it a class would give several benefits: * allow not to worry about allocating a big array of tokens. E.g., on 64-bit OS the largest module in Phobos (IIRC, the std.datetime) consumes 13.5MB in an array of almost 500K tokens. It would require 4 times smaller chunk of contiguous memory if it was an array of class objects, because each would consume only 8 bytes instead of 32. * allow subclassing, for example, for storing strongly typed literal values; this flexibility could also facilitate future extensibility (but it's difficult to predict which kind of extension may be needed) * there would be no need to copy data from tokens into AST, passing an object would be enough (again, copy 8 instead of 32 bytes); the same applies to passing into methods - no need to pass by ref to minimise overhead It would incur some additional memory overhead (at least 8 bytes per token), but that's hardly significant. Also there is additional price for accessing token members because of indirection, and, possibly, worse cache friendliness (token instances may be allocated anywhere in memory, not close to each other). These considerations are mostly about performance. I think there is also some impact on design, but couldn't find anything significant (given that currently I see a token as merely a datastructure without associated behavior). Could anybody suggest other pros and cons? Which option would you choose?
May 14 2012
next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Monday, 14 May 2012 at 15:00:37 UTC, Roman D. Boiko wrote:
 Could anybody suggest other pros and cons? Which option would 
 you choose?
Further discussion on this topic (struct vs class) is at http://forum.dlang.org/thread/asdrqlaydzcdpqwsbtiv forum.dlang.org
May 14 2012
prev sibling next sibling parent reply deadalnix <deadalnix gmail.com> writes:
Le 14/05/2012 17:00, Roman D. Boiko a écrit :
 On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
 I think you are wasting much more memory and performance by storing
 all the tokens in the lexer.

 Imagine I want to implement a simple syntax highlighter: just
 highlight keywords. How can I tell DCT to *not* store all tokens
 because I need each one in turn? And since I'll be highlighting in the
 editor I will need column and line information. That means I'll have
 to do that O(log(n)) operation for every token.

 So you see, for the simplest use case of a lexer the performance of
 DCT is awful.

 Now imagine I want to build an AST. Again, I consume the tokens one by
 one, probably peeking in some cases. If I want to store line and
 column information I just copy them to the AST. You say the tokens are
 discarded but their data is not, and that's why their data is usually
 copied.
Currently I think about making token a class instead of struct. A token (from https://github.com/roman-d-boiko/dct/blob/master/fe/core.d) is: // Represents lexed token struct Token { size_t startIndex; // position of the first code unit in the source string string spelling; // characters from which this token has been lexed TokenKind kind; // enum; each keyword and operator, have a dedicated kind ubyte annotations; // meta information like whether a token is valid, or an integer literal is signed, long, hexadecimal, etc. } Making it a class would give several benefits: * allow not to worry about allocating a big array of tokens. E.g., on 64-bit OS the largest module in Phobos (IIRC, the std.datetime) consumes 13.5MB in an array of almost 500K tokens. It would require 4 times smaller chunk of contiguous memory if it was an array of class objects, because each would consume only 8 bytes instead of 32.
Why is this a benefice ?
 * allow subclassing, for example, for storing strongly typed literal
 values; this flexibility could also facilitate future extensibility (but
 it's difficult to predict which kind of extension may be needed)
I'm pretty sure that D's token will not change that much. If the need isn't identified right know, I'd advocate for YAGNI.
 * there would be no need to copy data from tokens into AST, passing an
 object would be enough (again, copy 8 instead of 32 bytes); the same
 applies to passing into methods - no need to pass by ref to minimise
 overhead
Yes but now you add pressure on the GC and add indirections. I'm not sure it worth it. It seems to me like a premature optimization.
 It would incur some additional memory overhead (at least 8 bytes per
 token), but that's hardly significant. Also there is additional price
 for accessing token members because of indirection, and, possibly, worse
 cache friendliness (token instances may be allocated anywhere in memory,
 not close to each other).

 These considerations are mostly about performance. I think there is also
 some impact on design, but couldn't find anything significant (given
 that currently I see a token as merely a datastructure without
 associated behavior).

 Could anybody suggest other pros and cons? Which option would you choose?
You are over engineering the whole stuff.
May 14 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Monday, 14 May 2012 at 16:30:21 UTC, deadalnix wrote:
 Le 14/05/2012 17:00, Roman D. Boiko a écrit :
 Making it a class would give several benefits:
 * allow not to worry about allocating a big array of tokens. 
 E.g., on
 64-bit OS the largest module in Phobos (IIRC, the 
 std.datetime) consumes
 13.5MB in an array of almost 500K tokens. It would require 4 
 times
 smaller chunk of contiguous memory if it was an array of class 
 objects,
 because each would consume only 8 bytes instead of 32.
Why is this a benefice ?
NNTP error: 400 load at 23.60, try later prevented me from answering :) Because it might be difficult to find a big chunk of available memory (3.5M vs 14M for this particular case).
 * allow subclassing, for example, for storing strongly typed 
 literal
 values; this flexibility could also facilitate future 
 extensibility (but
 it's difficult to predict which kind of extension may be 
 needed)
I'm pretty sure that D's token will not change that much. If the need isn't identified right know, I'd advocate for YAGNI.
Agree.
 * there would be no need to copy data from tokens into AST, 
 passing an
 object would be enough (again, copy 8 instead of 32 bytes); 
 the same
 applies to passing into methods - no need to pass by ref to 
 minimise
 overhead
Yes but now you add pressure on the GC and add indirections. I'm not sure it worth it. It seems to me like a premature optimization.
It looks so. Thanks.
 It would incur some additional memory overhead (at least 8 
 bytes per
 token), but that's hardly significant. Also there is 
 additional price
 for accessing token members because of indirection, and, 
 possibly, worse
 cache friendliness (token instances may be allocated anywhere 
 in memory,
 not close to each other).

 These considerations are mostly about performance. I think 
 there is also
 some impact on design, but couldn't find anything significant 
 (given
 that currently I see a token as merely a datastructure without
 associated behavior).

 Could anybody suggest other pros and cons? Which option would 
 you choose?
You are over engineering the whole stuff.
I'm trying to solve this and other tradeoffs. I'd like to simplify but satisfy my design goals.
May 14 2012
parent reply "Tove" <tove fransson.se> writes:
On Monday, 14 May 2012 at 16:58:42 UTC, Roman D. Boiko wrote:
 You are over engineering the whole stuff.
I'm trying to solve this and other tradeoffs. I'd like to simplify but satisfy my design goals.
What if there were two different lex:er modes... with different struct:s. 1. For an IDE with on the fly lexing: Assumption, the error rate is high.(need to keep much info) 2. For the compiler Assumption, the error rate is non existent, and if there is an error it really doesn't matter if it's slow. So... when choosing the "compiler mode"... and there actually is an error, then just lex it again, to produce a pretty error message ;) try { lex(mode.compiler); } catch { lex(mode.ide); // calculates column etc. what ever info it needs. }
May 14 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Monday, 14 May 2012 at 19:04:20 UTC, Tove wrote:
 On Monday, 14 May 2012 at 16:58:42 UTC, Roman D. Boiko wrote:
 You are over engineering the whole stuff.
I'm trying to solve this and other tradeoffs. I'd like to simplify but satisfy my design goals.
What if there were two different lex:er modes... with different struct:s. 1. For an IDE with on the fly lexing: Assumption, the error rate is high.(need to keep much info) 2. For the compiler Assumption, the error rate is non existent, and if there is an error it really doesn't matter if it's slow. So... when choosing the "compiler mode"... and there actually is an error, then just lex it again, to produce a pretty error message ;) try { lex(mode.compiler); } catch { lex(mode.ide); // calculates column etc. what ever info it needs. }
So far it doesn't seem expensive to tolerate errors and proceed. The only thing I miss is some sort of specification when to stop including characters into token spelling and start a new token. I don't think I'll use backtracking for that in the nearest future. If I did, I would really separate part of lexer and provide two implementations for that part. Given this, accepting errors and moving on simply requires some finite set of rules about boundaries of invalid tokens. I also think structural code editing concepts will help here, but I didn't do any research on this topic yet. The problem with multiple lexer implementations is that it might become much more difficult to maintain them.
May 14 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Monday, 14 May 2012 at 19:13:39 UTC, Roman D. Boiko wrote:
 On Monday, 14 May 2012 at 19:04:20 UTC, Tove wrote:
 What if there were two different lex:er modes... with 
 different struct:s.

 1. For an IDE with on the fly lexing:
  Assumption, the error rate is high.(need to keep much info)

 2. For the compiler
 Assumption, the error rate is non existent, and if there is an 
 error it really doesn't matter if it's slow.

 So... when choosing the "compiler mode"... and there actually 
 is an error, then just lex it again, to produce a pretty error 
 message ;)
...
 The problem with multiple lexer implementations is that it might
 become much more difficult to maintain them.
Just to clarify: different modes in lexer in my view are like two different implementations combined in a non-trivial way (unless the difference is minor). So complexity goes from two factors: different implementations and how to combine them. I try to avoid this.
May 14 2012
parent reply deadalnix <deadalnix gmail.com> writes:
Le 14/05/2012 21:21, Roman D. Boiko a écrit :
 On Monday, 14 May 2012 at 19:13:39 UTC, Roman D. Boiko wrote:
 On Monday, 14 May 2012 at 19:04:20 UTC, Tove wrote:
 What if there were two different lex:er modes... with different
 struct:s.

 1. For an IDE with on the fly lexing:
 Assumption, the error rate is high.(need to keep much info)

 2. For the compiler
 Assumption, the error rate is non existent, and if there is an error
 it really doesn't matter if it's slow.

 So... when choosing the "compiler mode"... and there actually is an
 error, then just lex it again, to produce a pretty error message ;)
...
 The problem with multiple lexer implementations is that it might
 become much more difficult to maintain them.
Just to clarify: different modes in lexer in my view are like two different implementations combined in a non-trivial way (unless the difference is minor). So complexity goes from two factors: different implementations and how to combine them. I try to avoid this.
If both have the same interface, metaprograming can do the magic for you.
May 15 2012
parent "Roman D. Boiko" <rb d-coding.com> writes:
On Tuesday, 15 May 2012 at 09:33:31 UTC, deadalnix wrote:
 Le 14/05/2012 21:21, Roman D. Boiko a écrit :
 Just to clarify: different modes in lexer in my view are like 
 two
 different implementations combined in a non-trivial way (unless
 the difference is minor). So complexity goes from two factors:
 different implementations and how to combine them. I try to 
 avoid this.
If both have the same interface, metaprograming can do the magic for you.
Well, I didn't state that there is no way to solve the issue. Also I added "unless the difference is minor".
May 15 2012
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 05/14/2012 05:00 PM, Roman D. Boiko wrote:
 On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
 I think you are wasting much more memory and performance by storing
 all the tokens in the lexer.

 Imagine I want to implement a simple syntax highlighter: just
 highlight keywords. How can I tell DCT to *not* store all tokens
 because I need each one in turn? And since I'll be highlighting in the
 editor I will need column and line information. That means I'll have
 to do that O(log(n)) operation for every token.

 So you see, for the simplest use case of a lexer the performance of
 DCT is awful.

 Now imagine I want to build an AST. Again, I consume the tokens one by
 one, probably peeking in some cases. If I want to store line and
 column information I just copy them to the AST. You say the tokens are
 discarded but their data is not, and that's why their data is usually
 copied.
Currently I think about making token a class instead of struct. ... Could anybody suggest other pros and cons? Which option would you choose?
Just use a circular buffer of value-type tokens. There is absolutely no excuse for slow parsing.
May 15 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Tuesday, 15 May 2012 at 19:27:26 UTC, Timon Gehr wrote:
 On 05/14/2012 05:00 PM, Roman D. Boiko wrote:
 Currently I think about making token a class instead of struct.
 ...

 Could anybody suggest other pros and cons? Which option would 
 you choose?
Just use a circular buffer of value-type tokens. There is absolutely no excuse for slow parsing.
I'd like to be able to do efficient token lookup based on its start index in the file (e.g., for autocompletion and calculating column/line numbers on demand). I also plan to have ability to update lexer, parser and semantic analysis results as the user or tool makes edits (incrementally, not by restarting), and to keep history of changes as long as it is needed. To achieve this, I will use Token[N][] for storing tokens, and an immutable binary tree for efficient lookup and preserving history. I don't worry much about keeping all tokens in memory, because information they contain is useful and needed for most scenarios. If, on the contrary, information is no longer used, it will be garbage collected. This is a significant redesign of current implementation. It is based on feedback from this thread and completely satisfies my design goals. It is also going to be efficient. Alpha release for community review is planned on May 21, and design documentation in 10 days after that.
May 15 2012
parent deadalnix <deadalnix gmail.com> writes:
Le 15/05/2012 21:59, Roman D. Boiko a écrit :
 On Tuesday, 15 May 2012 at 19:27:26 UTC, Timon Gehr wrote:
 On 05/14/2012 05:00 PM, Roman D. Boiko wrote:
 Currently I think about making token a class instead of struct.
 ...

 Could anybody suggest other pros and cons? Which option would you
 choose?
Just use a circular buffer of value-type tokens. There is absolutely no excuse for slow parsing.
I'd like to be able to do efficient token lookup based on its start index in the file (e.g., for autocompletion and calculating column/line numbers on demand). I also plan to have ability to update lexer, parser and semantic analysis results as the user or tool makes edits (incrementally, not by restarting), and to keep history of changes as long as it is needed. To achieve this, I will use Token[N][] for storing tokens, and an immutable binary tree for efficient lookup and preserving history. I don't worry much about keeping all tokens in memory, because information they contain is useful and needed for most scenarios. If, on the contrary, information is no longer used, it will be garbage collected. This is a significant redesign of current implementation. It is based on feedback from this thread and completely satisfies my design goals. It is also going to be efficient. Alpha release for community review is planned on May 21, and design documentation in 10 days after that.
The only information that is usefull over time is the position. This is the one that you want to calculate lazily. In other terms, what you want to keep in memory have no interest. A circular buffer as Timon propose is a better idea.
May 16 2012
prev sibling parent reply Ary Manzana <ary esperanto.org.ar> writes:
On 5/11/12 4:22 PM, Roman D. Boiko wrote:
 What about line and column information?
Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 "ab/c.d"), or discarding them.
But then how do you do to efficiently (if reverse walk is any efficient) compute line numbers? Usually tokens are used and discarded. I mean, somebody that uses the lexer asks tokens, process them (for example to highlight code or to build an AST) and then discards them. So you can reuse the same Token instance. If you want to peek the next token, or have a buffer of token, you can use a freelist ( http://dlang.org/memory.html#freelists , one of the many nice things I learned by looking at DMD's source code ). So adding line and column information is not like wasting a lot of memory: just 8 bytes more for each token in the freelist.
May 11 2012
next sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 11.05.2012 13:50, schrieb Ary Manzana:
 On 5/11/12 4:22 PM, Roman D. Boiko wrote:
  What about line and column information?
Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 "ab/c.d"), or discarding them.
But then how do you do to efficiently (if reverse walk is any efficient) compute line numbers? Usually tokens are used and discarded. I mean, somebody that uses the lexer asks tokens, process them (for example to highlight code or to build an AST) and then discards them. So you can reuse the same Token instance. If you want to peek the next token, or have a buffer of token, you can use a freelist ( http://dlang.org/memory.html#freelists , one of the many nice things I learned by looking at DMD's source code ). So adding line and column information is not like wasting a lot of memory: just 8 bytes more for each token in the freelist.
it would be better to add something like an ColumnLine-collector-thing - that if applied is able to hold this information, so no waste if its not needed i think there are several parts that can work like that
May 11 2012
parent "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 12:12:01 UTC, dennis luehring wrote:
 it would be better to add something like an 
 ColumnLine-collector-thing - that if applied is able to hold 
 this information, so no waste if its not needed

 i think there are several parts that can work like that
This looks like what I called a hook for pluggin-in respective functionality in my previous message. But I'm not sure it is needed at all (in addition to what I already designed with calculateFor() method).
May 11 2012
prev sibling next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 11:50:29 UTC, Ary Manzana wrote:
 On 5/11/12 4:22 PM, Roman D. Boiko wrote:
 What about line and column information?
Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 "ab/c.d"), or discarding them.
But then how do you do to efficiently (if reverse walk is any efficient) compute line numbers?
I borrowed the trick from Brian Schott: tokens will be stored as a sorted array. Sorting is done based on their indices in source code (they are naturally sorted immediately, no need to run any algorithm for that). The same is true for information about line start indices. Now given an index we can do a binary search in such sorted arrays, and get corresponding line number or token (whatever we need). Walking along utf code points starting from the index which corresponds to beginning of line and taking into account tab characters it is possible to calculate column reasonably fast. My assumption is that column numbers are needed for use cases like error reporting or other messages for a user, and thus there is no need to pre-calculate them for each token (or character). For such use cases efficiency should be really enough. In general, I precalculate and store only information which is needed frequently. Other information is evaluated lazily.
 Usually tokens are used and discarded. I mean, somebody that 
 uses the lexer asks tokens, process them (for example to 
 highlight code or to build an AST) and then discards them. So 
 you can reuse the same Token instance. If you want to peek the 
 next token, or have a buffer of token, you can use a freelist ( 
 http://dlang.org/memory.html#freelists , one of the many nice 
 things I learned by looking at DMD's source code ).
But the information from tokens is not discarded (at least, this is the requirement for DCT). So my choice is to keep it in tokens instead of converting to some other form. That also implies that Token is free from any auxiliary information which is not necessary for common use cases.
 So adding line and column information is not like wasting a lot 
 of memory: just 8 bytes more for each token in the freelist.
It is inefficient to calculate it during lexing, scanning algorithm become less robust and more complicated, and in most cases we won't need it anyway. I will add a hook to plug-in such functionality (when needed) if I will know why it is necessary.
May 11 2012
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
Le 11/05/2012 13:50, Ary Manzana a écrit :
 On 5/11/12 4:22 PM, Roman D. Boiko wrote:
 What about line and column information?
Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 "ab/c.d"), or discarding them.
But then how do you do to efficiently (if reverse walk is any efficient) compute line numbers? Usually tokens are used and discarded. I mean, somebody that uses the lexer asks tokens, process them (for example to highlight code or to build an AST) and then discards them. So you can reuse the same Token instance. If you want to peek the next token, or have a buffer of token, you can use a freelist ( http://dlang.org/memory.html#freelists , one of the many nice things I learned by looking at DMD's source code ). So adding line and column information is not like wasting a lot of memory: just 8 bytes more for each token in the freelist.
SDC uses struct Location to store such data. Every token and AST element have a member data of type location. As it is value type, no need for free list all thoses sort of thing. When the token is discarded, the location is discarded too. The same goes for AST nodes.
May 12 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 12 May 2012 at 13:24:11 UTC, deadalnix wrote:
 Le 11/05/2012 13:50, Ary Manzana a écrit :
 Usually tokens are used and discarded. I mean, somebody that 
 uses the
 lexer asks tokens, process them (for example to highlight code 
 or to
 build an AST) and then discards them. So you can reuse the 
 same Token
 instance. If you want to peek the next token, or have a buffer 
 of token,
 you can use a freelist ( 
 http://dlang.org/memory.html#freelists , one of
 the many nice things I learned by looking at DMD's source code 
 ).

 So adding line and column information is not like wasting a 
 lot of
 memory: just 8 bytes more for each token in the freelist.
SDC uses struct Location to store such data. Every token and AST element have a member data of type location. As it is value type, no need for free list all thoses sort of thing. When the token is discarded, the location is discarded too. The same goes for AST nodes.
1. If a struct is a field of heap allocated object, it will be allocated and garbage collected. Only if it only exists on stack (i.e., in method body), GC is not used. 2. If struct which is often allocated and deallocated, free list would allow to skip its initialization, which is especially important if size of struct is large. This is true for both heap and stack allocated data.
May 12 2012
parent reply "Tobias Pankrath" <tobias pankrath.net> writes:
 1. If a struct is a field of heap allocated object, it will be
 allocated and garbage collected. Only if it only exists on stack
 (i.e., in method body), GC is not used.
As far as I can tell, it won't be allocated on it's own, since it is stored in the garbage collected object as a value field. So using a freelist would actually increase the overhead. You have to manage the freelist and do the allocation of the containing object.
May 12 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath wrote:
 1. If a struct is a field of heap allocated object, it will be
 allocated and garbage collected. Only if it only exists on 
 stack
 (i.e., in method body), GC is not used.
As far as I can tell, it won't be allocated on it's own, since it is stored in the garbage collected object as a value field. So using a freelist would actually increase the overhead. You have to manage the freelist and do the allocation of the containing object.
Sorry for confusion. In the first point I was trying to clarify that using a struct doesn't mean that it is stack allocated, but that statement was off-topic for the given context. Currently I consider my first point to be useless, and the second one is not relevant in *my* context, since I don't do many allocations of Location, as well as in a majority of other contexts. As it is often the case, performance optimization should be performed only when benchmarks show it is needed.
May 12 2012
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 05/12/2012 11:08 PM, Roman D. Boiko wrote:
 On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath wrote:
 1. If a struct is a field of heap allocated object, it will be
 allocated and garbage collected. Only if it only exists on stack
 (i.e., in method body), GC is not used.
As far as I can tell, it won't be allocated on it's own, since it is stored in the garbage collected object as a value field. So using a freelist would actually increase the overhead. You have to manage the freelist and do the allocation of the containing object.
Sorry for confusion. In the first point I was trying to clarify that using a struct doesn't mean that it is stack allocated, but that statement was off-topic for the given context. Currently I consider my first point to be useless, and the second one is not relevant in *my* context, since I don't do many allocations of Location, as well as in a majority of other contexts. As it is often the case, performance optimization should be performed only when benchmarks show it is needed.
Well yes, but it seems to me that you are deliberately complicating the design in order to make it less efficient?
May 12 2012
parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 12 May 2012 at 22:19:55 UTC, Timon Gehr wrote:
 On 05/12/2012 11:08 PM, Roman D. Boiko wrote:
 On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath 
 wrote:
 1. If a struct is a field of heap allocated object, it will 
 be
 allocated and garbage collected. Only if it only exists on 
 stack
 (i.e., in method body), GC is not used.
As far as I can tell, it won't be allocated on it's own, since it is stored in the garbage collected object as a value field. So using a freelist would actually increase the overhead. You have to manage the freelist and do the allocation of the containing object.
Sorry for confusion. In the first point I was trying to clarify that using a struct doesn't mean that it is stack allocated, but that statement was off-topic for the given context. Currently I consider my first point to be useless, and the second one is not relevant in *my* context, since I don't do many allocations of Location, as well as in a majority of other contexts. As it is often the case, performance optimization should be performed only when benchmarks show it is needed.
Well yes, but it seems to me that you are deliberately complicating the design in order to make it less efficient?
Do you mean something like what I summarized from Ary: http://forum.dlang.org/post/pggkobonzmgdwjigyaac forum.dlang.org ? If that is the case, could you please add your concerns to the dedicated issue: https://github.com/roman-d-boiko/dct/issues/15 ? Otherwise, if by overcomplicating you meant using a free list, I'd like to clarify that I don't plan to introduce it yet because I see no reason to use it in my context. Thanks
May 12 2012
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-05-11 10:01, Roman D. Boiko wrote:
 There were several discussions about the need for a D compiler library.

 I propose my draft implementation of lexer for community review:
 https://github.com/roman-d-boiko/dct

 Lexer is based on Brian Schott's project
 https://github.com/Hackerpilot/Dscanner, but it has been refactored and
 extended (and more changes are on the way).

 The goal is to have source code loading, lexer, parser and semantic
 analysis available as parts of Phobos. These libraries should be
 designed to be usable in multiple scenarios (e.g., refactoring, code
 analysis, etc.).

 My commitment is to have at least front end built this year (and
 conforming to the D2 specification unless explicitly stated otherwise
 for some particular aspect).

 Please post any feed here. A dedicated project web-site will be created
 later.
If think that the end goal of a project like this, putting a D frontend in Phobos, should be that the compiler should be built using this library. This would result in the compiler and library always being in sync and having the same behavior. Otherwise it's easy this would be just another tool that tries to lex and parse D code, always being out of sync with the compiler and not having the same behavior. For this to happen, for Walter to start using this, I think there would be a greater change if the frontend was a port of the DMD frontend and not changed too much. -- /Jacob Carlborg
May 11 2012
next sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 11.05.2012 11:02, schrieb Jacob Carlborg:
 For this to happen, for Walter to start using this, I think there would
 be a greater change if the frontend was a port of the DMD frontend and
 not changed too much.
or a pure D version of it with the features: -very fast in parsing/lexing - there need to be a benchmark enviroment from the very start -easy to extend and fix and think that is what walter wants from an D-ish frontend in the first place
May 11 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 09:21:29 UTC, dennis luehring wrote:
 Am 11.05.2012 11:02, schrieb Jacob Carlborg:
 For this to happen, for Walter to start using this, I think 
 there would
 be a greater change if the frontend was a port of the DMD 
 frontend and
 not changed too much.
or a pure D version of it with the features: -very fast in parsing/lexing - there need to be a benchmark enviroment from the very start
Will add that to May roadmap.
 -easy to extend and fix

 and think that is what walter wants from an D-ish frontend in 
 the first place
I plan to go this way.
May 11 2012
parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 11.05.2012 11:33, schrieb Roman D. Boiko:
  -very fast in parsing/lexing - there need to be a benchmark
  enviroment from the very start
Will add that to May roadmap.
are using slices for prevent coping everything around? the parser/lexer need to be as fast as the original one - maybe even faster - else it won't replace walters at any time - because speed does matter here very much
May 11 2012
next sibling parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 10:01:17 UTC, dennis luehring wrote:
 Am 11.05.2012 11:33, schrieb Roman D. Boiko:
 -very fast in parsing/lexing - there need to be a benchmark
 enviroment from the very start
Will add that to May roadmap.
are using slices for prevent coping everything around? the parser/lexer need to be as fast as the original one - maybe even faster - else it won't replace walters at any time - because speed does matter here very much
I tried optimizing code when it didn't complicate design. And more optimizations will be added later. It would be interesting to have benchmarks for comparing performance, but I don't plan to branch and edit DMD front end to prepare it for benchmarking (any time soon).
May 11 2012
parent reply "alex" <info alexanderbothe.com> writes:
On Friday, 11 May 2012 at 10:40:43 UTC, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 10:01:17 UTC, dennis luehring wrote:
 Am 11.05.2012 11:33, schrieb Roman D. Boiko:
 -very fast in parsing/lexing - there need to be a benchmark
 enviroment from the very start
Will add that to May roadmap.
are using slices for prevent coping everything around? the parser/lexer need to be as fast as the original one - maybe even faster - else it won't replace walters at any time - because speed does matter here very much
I tried optimizing code when it didn't complicate design. And more optimizations will be added later. It would be interesting to have benchmarks for comparing performance, but I don't plan to branch and edit DMD front end to prepare it for benchmarking (any time soon).
Ever thought of asking the VisualD developer to integrate your library into his IDE extension? Might be cool to do so because of extended completion abilities etc. (lol I'm the Mono-D dev -- but why not? ;D)
May 11 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 11:41:34 UTC, alex wrote:
 Ever thought of asking the VisualD developer to integrate your 
 library into his IDE extension? Might be cool to do so because 
 of extended completion abilities etc. (lol I'm the Mono-D dev 
 -- but why not? ;D)
Didn't think about that yet, because I don't use VisualD. I actually planned to analyse whether DCT could be integrated into Mono-D, so your feedback is welcome :)
May 11 2012
parent reply "alex" <info alexanderbothe.com> writes:
On Friday, 11 May 2012 at 11:55:47 UTC, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 11:41:34 UTC, alex wrote:
 Ever thought of asking the VisualD developer to integrate your 
 library into his IDE extension? Might be cool to do so because 
 of extended completion abilities etc. (lol I'm the Mono-D dev 
 -- but why not? ;D)
Didn't think about that yet, because I don't use VisualD. I actually planned to analyse whether DCT could be integrated into Mono-D, so your feedback is welcome :)
be easier to integrate into the second one :)
May 11 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 12:13:53 UTC, alex wrote:

 should be easier to integrate into the second one :)
Sorry, I meant D-IDE. But there might exist the reason to consume make it usable for that. I have nothing against VisualD, just didn't think about it yet.
May 11 2012
parent "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 12:20:27 UTC, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 12:13:53 UTC, alex wrote:

 should be easier to integrate into the second one :)
Sorry, I meant D-IDE. But there might exist the reason to collaborate to make it usable for that.
May 11 2012
prev sibling parent deadalnix <deadalnix gmail.com> writes:
Le 11/05/2012 12:01, dennis luehring a écrit :
 Am 11.05.2012 11:33, schrieb Roman D. Boiko:
 -very fast in parsing/lexing - there need to be a benchmark
 enviroment from the very start
Will add that to May roadmap.
are using slices for prevent coping everything around? the parser/lexer need to be as fast as the original one - maybe even faster - else it won't replace walters at any time - because speed does matter here very much
The best optimization is the one that bring your code from non working state to working state.
May 11 2012
prev sibling next sibling parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 09:02:12 UTC, Jacob Carlborg wrote:
 If think that the end goal of a project like this, putting a D 
 frontend in Phobos, should be that the compiler should be built 
 using this library. This would result in the compiler and 
 library always being in sync and having the same behavior. 
 Otherwise it's easy this would be just another tool that tries 
 to lex and parse D code, always being out of sync with the 
 compiler and not having the same behavior.

 For this to happen, for Walter to start using this, I think 
 there would be a greater change if the frontend was a port of 
 the DMD frontend and not changed too much.
My plan is to create frontend that would be much better than existing, both in design and implementation. I decided to work on this full time for several months. Front end will not produce the same data as DMD front end does, so most likely it will not be suitable to replace existing C++ implementation. But since no information will be lost I will consider creating a separate project which would build output compatible with existing (not identical, I don't think that would be feasible).
May 11 2012
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-05-11 11:31, Roman D. Boiko wrote:

 My plan is to create frontend that would be much better than existing,
 both in design and implementation. I decided to work on this full time
 for several months.
That's good news.
 Front end will not produce the same data as DMD front end does, so most
 likely it will not be suitable to replace existing C++ implementation.
That's too bad.
 But since no information will be lost I will consider creating a
 separate project which would build output compatible with existing (not
 identical, I don't think that would be feasible).
Ok. -- /Jacob Carlborg
May 11 2012
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
Le 11/05/2012 11:31, Roman D. Boiko a écrit :
 On Friday, 11 May 2012 at 09:02:12 UTC, Jacob Carlborg wrote:
 If think that the end goal of a project like this, putting a D
 frontend in Phobos, should be that the compiler should be built using
 this library. This would result in the compiler and library always
 being in sync and having the same behavior. Otherwise it's easy this
 would be just another tool that tries to lex and parse D code, always
 being out of sync with the compiler and not having the same behavior.

 For this to happen, for Walter to start using this, I think there
 would be a greater change if the frontend was a port of the DMD
 frontend and not changed too much.
My plan is to create frontend that would be much better than existing, both in design and implementation. I decided to work on this full time for several months.
I this is your plan, I'm very happy. We really should discuss to avoid duplicating effort as we are doing right now.
May 11 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 11:50:14 UTC, deadalnix wrote:
 Le 11/05/2012 11:31, Roman D. Boiko a écrit :
 My plan is to create frontend that would be much better than 
 existing,
 both in design and implementation. I decided to work on this 
 full time
 for several months.
I this is your plan, I'm very happy. We really should discuss to avoid duplicating effort as we are doing right now.
 I have started a similar stuff, and it is currently more 
 advanced than DCT is. I kind of decapitated sdc to do it.

 Maybe we should join effort instead of doing 2 separate 
 projects ?
That makes sense. Is it possible to switch SDC to the Boost license? I'm trying to keep it for all DCT code.
May 11 2012
parent deadalnix <deadalnix gmail.com> writes:
Le 11/05/2012 14:04, Roman D. Boiko a écrit :
 That makes sense. Is it possible to switch SDC to the Boost license? I'm
 trying to keep it for all DCT code.
Let me do a clean package of my code this week end. For now it is mixed with SDC source code, which was enough as I was working alone, but isn't quite right for people to join the effort.
May 11 2012
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
Le 11/05/2012 11:02, Jacob Carlborg a écrit :
 If think that the end goal of a project like this, putting a D frontend
 in Phobos, should be that the compiler should be built using this
 library. This would result in the compiler and library always being in
 sync and having the same behavior. Otherwise it's easy this would be
 just another tool that tries to lex and parse D code, always being out
 of sync with the compiler and not having the same behavior.

 For this to happen, for Walter to start using this, I think there would
 be a greater change if the frontend was a port of the DMD frontend and
 not changed too much.
From the beginning, I'm think AST macro using CTFE.
May 11 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 11:47:18 UTC, deadalnix wrote:
 From the beginning, I'm think AST macro using CTFE.
Could you please elaborate? I plan to strictly follow published D specification. Exceptions from this rule are possible provided either of the following is true: * new functionality has been implemented in DMD but is not included into specification yet * specification is incorrect (has a bug) or incomplete, especially if DMD behavior differs from specification * change is compatible with specification and brings some significant improvement (e.g., this seems to be the case for my decision to introduce post-processor after lexer) Some more exceptions might be added later, but the goal is to minimize differences.
May 11 2012
next sibling parent reply deadalnix <deadalnix gmail.com> writes:
Le 11/05/2012 14:14, Roman D. Boiko a écrit :
 On Friday, 11 May 2012 at 11:47:18 UTC, deadalnix wrote:
 From the beginning, I'm think AST macro using CTFE.
Could you please elaborate? I plan to strictly follow published D specification. Exceptions from this rule are possible provided either of the following is true: * new functionality has been implemented in DMD but is not included into specification yet * specification is incorrect (has a bug) or incomplete, especially if DMD behavior differs from specification * change is compatible with specification and brings some significant improvement (e.g., this seems to be the case for my decision to introduce post-processor after lexer) Some more exceptions might be added later, but the goal is to minimize differences.
More explicitly, the goal isn't to implement a different language than D. Simply doing the parsing/AST building in a way that would allow AST macro to be introduced later. Your 3 points seem reasonable. Mine were : * Implement something that can parse D as it is currently defined/implemented (if dmd's behavior and spec differs, it is handled on a per case basis). * Discard all deprecated features. Not even try to implement them even if dmd support them currently. * Do the parsing in several steps to allow different tools to work with it. I think we both have very compatibles goals. Let me do a clean package of it I write about design goals. I don't have that much time right now to do it, I will this week end.
May 11 2012
parent "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 12:30:01 UTC, deadalnix wrote:
 Your 3 points seem reasonable. Mine were :
  * Implement something that can parse D as it is currently 
 defined/implemented (if dmd's behavior and spec differs, it is 
 handled on a per case basis).
All differences should be documented.
  * Discard all deprecated features. Not even try to implement 
 them even if dmd support them currently.
Yes, I forgot this one. Actually, I didn't discard imaginary floats, because I don't know what exactly will be done instead and it is easy to keep them.
  * Do the parsing in several steps to allow different tools to 
 work with it.
I was thinking about a pool of analysers each of which would add some information. This could be more than needed for semantic analysis. An analyser would be created each time when information (e.g., some indexing) is needed for a particular use case.
 I think we both have very compatibles goals. Let me do a clean 
 package of it I write about design goals. I don't have that 
 much time right now to do it, I will this week end.
I saw your ast_as_lib branch of SDC, but didn't dig deeper.
May 11 2012
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-05-11 14:14, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 11:47:18 UTC, deadalnix wrote:
 From the beginning, I'm think AST macro using CTFE.
Could you please elaborate? I plan to strictly follow published D specification.
That won't be easy, nobody know what the specification is . TDPL, DMD or dlang.org?
 Exceptions from this rule are possible provided either of the following
 is true:
 * new functionality has been implemented in DMD but is not included into
 specification yet
 * specification is incorrect (has a bug) or incomplete, especially if
 DMD behavior differs from specification
 * change is compatible with specification and brings some significant
 improvement (e.g., this seems to be the case for my decision to
 introduce post-processor after lexer)

 Some more exceptions might be added later, but the goal is to minimize
 differences.
-- /Jacob Carlborg
May 11 2012
parent reply Rory McGuire <rjmcguire gmail.com> writes:
? my guess is that the spec is TDPL + TDPL errata. dlang.org should be
updated as people notice inaccuracies.
This project would be an ideal time to update dlang.org as people notice
its not in sync with TDPL.
If TDPL doesn't cover it then the community should review it.


On Fri, May 11, 2012 at 3:17 PM, Jacob Carlborg <doob me.com> wrote:

 On 2012-05-11 14:14, Roman D. Boiko wrote:

 On Friday, 11 May 2012 at 11:47:18 UTC, deadalnix wrote:

 From the beginning, I'm think AST macro using CTFE.
Could you please elaborate? I plan to strictly follow published D specification.
That won't be easy, nobody know what the specification is . TDPL, DMD or dlang.org? Exceptions from this rule are possible provided either of the following
 is true:
 * new functionality has been implemented in DMD but is not included into
 specification yet
 * specification is incorrect (has a bug) or incomplete, especially if
 DMD behavior differs from specification
 * change is compatible with specification and brings some significant
 improvement (e.g., this seems to be the case for my decision to
 introduce post-processor after lexer)

 Some more exceptions might be added later, but the goal is to minimize
 differences.
-- /Jacob Carlborg
May 11 2012
next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 13:30:49 UTC, Rory McGuire wrote:
 ? my guess is that the spec is TDPL + TDPL errata. dlang.org 
 should be
 updated as people notice inaccuracies.
 This project would be an ideal time to update dlang.org as 
 people notice
 its not in sync with TDPL.
 If TDPL doesn't cover it then the community should review it.
Whenever I'm in doubt, I try to analyze both TDPL and dlang.org, and also consult DMD sources. But in general when I write specification I mean dlang.org. I agree that it should be maintained, but I understand that it is difficult to do. I'll try to summarize problems which I discover (I already found many), but it is quite time-consuming activity.
May 11 2012
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-05-11 15:30, Rory McGuire wrote:
 ? my guess is that the spec is TDPL + TDPL errata. dlang.org
 <http://dlang.org> should be updated as people notice inaccuracies.
 This project would be an ideal time to update dlang.org
 <http://dlang.org> as people notice its not in sync with TDPL.
 If TDPL doesn't cover it then the community should review it.
None of these are in sync. -- /Jacob Carlborg
May 11 2012
next sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 14:45:45 UTC, Jacob Carlborg wrote:
 On 2012-05-11 15:30, Rory McGuire wrote:
 ? my guess is that the spec is TDPL + TDPL errata. dlang.org
 <http://dlang.org> should be updated as people notice 
 inaccuracies.
 This project would be an ideal time to update dlang.org
 <http://dlang.org> as people notice its not in sync with TDPL.
 If TDPL doesn't cover it then the community should review it.
None of these are in sync.
I suggest to choose a dedicated place where it is possible to discuss differences and other problems with specification. This forum?
May 11 2012
prev sibling next sibling parent Rory McGuire <rjmcguire gmail.com> writes:
yip, but TDPL still has to take precedence because that is the one that
walter + andrei + community put the most focused effort into.


On Fri, May 11, 2012 at 4:45 PM, Jacob Carlborg <doob me.com> wrote:

 On 2012-05-11 15:30, Rory McGuire wrote:

 ? my guess is that the spec is TDPL + TDPL errata. dlang.org
 <http://dlang.org> should be updated as people notice inaccuracies.

 This project would be an ideal time to update dlang.org
 <http://dlang.org> as people notice its not in sync with TDPL.

 If TDPL doesn't cover it then the community should review it.
None of these are in sync. -- /Jacob Carlborg
May 11 2012
prev sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Friday, May 11, 2012 23:00:15 Rory McGuire wrote:
 yip, but TDPL still has to take precedence because that is the one that
 walter + andrei + community put the most focused effort into.
It doesn't necessarily. Each place that they differ is examined individually and decided on its own merits. There is no automatic "TDPL says it's so, so it's so." There _are_ cases where things have even been explicitly changed after TDPL was released, making TDPL out-of-date for that particular item (e.g strong vs weak purity is not discussed in TDPL at all, beacuse it didn't exist at the time). In general, however, TDPL _is_ correct, and when there's a difference, it's simply a case of the documentation or compiler not having been updated accordingly. We try to avoid making any changes that would conflict with TDPL. - Jonathan M Davis
May 11 2012
prev sibling next sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 11.05.2012 10:01, schrieb Roman D. Boiko:
 There were several discussions about the need for a D compiler
 library.

 I propose my draft implementation of lexer for community review:
 https://github.com/roman-d-boiko/dct

 Lexer is based on Brian Schott's project
 https://github.com/Hackerpilot/Dscanner, but it has been
 refactored and extended (and more changes are on the way).

 The goal is to have source code loading, lexer, parser and
 semantic analysis available as parts of Phobos. These libraries
 should be designed to be usable in multiple scenarios (e.g.,
 refactoring, code analysis, etc.).

 My commitment is to have at least front end built this year (and
 conforming to the D2 specification unless explicitly stated
 otherwise for some particular aspect).

 Please post any feed here. A dedicated project web-site will be
 created later.
does the parser/lexer allow half-finished syntax parsing? for being useable in an IDE for syntax-highlighting while coding?
May 11 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 09:19:07 UTC, dennis luehring wrote:
 does the parser/lexer allow half-finished syntax parsing? for 
 being useable in an IDE for syntax-highlighting while coding?
That's planned, but I would like to see your usage scenarios (pseudo-code would help a lot).
May 11 2012
next sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 11.05.2012 11:23, schrieb Roman D. Boiko:
 On Friday, 11 May 2012 at 09:19:07 UTC, dennis luehring wrote:
  does the parser/lexer allow half-finished syntax parsing? for
  being useable in an IDE for syntax-highlighting while coding?
That's planned, but I would like to see your usage scenarios (pseudo-code would help a lot).
try to syntaxhiglight while coding - thats the scenario, parts of the source code isn't fully valid while writing
May 11 2012
parent "Roman D. Boiko" <rb d-coding.com> writes:
On Friday, 11 May 2012 at 09:28:36 UTC, dennis luehring wrote:
 Am 11.05.2012 11:23, schrieb Roman D. Boiko:
 On Friday, 11 May 2012 at 09:19:07 UTC, dennis luehring wrote:
 does the parser/lexer allow half-finished syntax parsing? for
 being useable in an IDE for syntax-highlighting while coding?
That's planned, but I would like to see your usage scenarios (pseudo-code would help a lot).
try to syntaxhiglight while coding - thats the scenario, parts of the source code isn't fully valid while writing
I depends on IDE. For example, sublime-text (and most likely TextMate) uses regex for syntax highlighting. That makes it impossible to use with for D in some scenarios (like nested block comments). Any IDE that provides API for coloring will get correct information if code is valid. If it is not valid, it is only possible to handle specific kinds of errors, but in general there will always be cases when highlighting (or some parsing / semantic analysis information) is "incorrect". I aim to handle common errors gracefully. In practice, I think, it is possible to handle 99% of problems, but this requires a lot of design / specification work.
May 11 2012
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2012-05-11 11:23, Roman D. Boiko wrote:
 On Friday, 11 May 2012 at 09:19:07 UTC, dennis luehring wrote:
 does the parser/lexer allow half-finished syntax parsing? for being
 useable in an IDE for syntax-highlighting while coding?
That's planned, but I would like to see your usage scenarios (pseudo-code would help a lot).
Example from TextMate: * "void" - keyword is colored * "module main" - nothing colored until I type a semicolon * "module main;" - keyword and "main" is colored (differently) * "void foo" - keyword is colored * "void foo (" - keyword and "foo" is colored (differently) * "struct F" - keyword and "F is colored (differently) * Literals are always colored. * User-defined constants are always colored. It's basically any token that is all upper case -- /Jacob Carlborg
May 11 2012
prev sibling next sibling parent deadalnix <deadalnix gmail.com> writes:
Le 11/05/2012 10:01, Roman D. Boiko a écrit :
 There were several discussions about the need for a D compiler library.

 I propose my draft implementation of lexer for community review:
 https://github.com/roman-d-boiko/dct

 Lexer is based on Brian Schott's project
 https://github.com/Hackerpilot/Dscanner, but it has been refactored and
 extended (and more changes are on the way).

 The goal is to have source code loading, lexer, parser and semantic
 analysis available as parts of Phobos. These libraries should be
 designed to be usable in multiple scenarios (e.g., refactoring, code
 analysis, etc.).

 My commitment is to have at least front end built this year (and
 conforming to the D2 specification unless explicitly stated otherwise
 for some particular aspect).

 Please post any feed here. A dedicated project web-site will be created
 later.
I have started a similar stuff, and it is currently more advanced than DCT is. I kind of decapitated sdc to do it. Maybe we should join effort instead of doing 2 separate projects ?
May 11 2012
prev sibling next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, May 11, 2012 10:01:28 Roman D. Boiko wrote:
 There were several discussions about the need for a D compiler
 library.
 
 I propose my draft implementation of lexer for community review:
 https://github.com/roman-d-boiko/dct
 
 Lexer is based on Brian Schott's project
 https://github.com/Hackerpilot/Dscanner, but it has been
 refactored and extended (and more changes are on the way).
 
 The goal is to have source code loading, lexer, parser and
 semantic analysis available as parts of Phobos. These libraries
 should be designed to be usable in multiple scenarios (e.g.,
 refactoring, code analysis, etc.).
 
 My commitment is to have at least front end built this year (and
 conforming to the D2 specification unless explicitly stated
 otherwise for some particular aspect).
 
 Please post any feed here. A dedicated project web-site will be
 created later.
It's great that you're working on this. We definitely need more of this sort of stuff. However, I would point out that I'm not sure that it will be acceptable for Phobos (it _may_ be, but it's not quite what we've been looking for). There are two types of lexers/parsers that we've been looking to add to Phobos. 1. A direct port of the lexer from dmd's C++ front-end. The API would be D- ified to use ranges, but the internals would be almost identical to the C++ front-end so that porting changes back and forth would be very easy. It also would make it easier to replace the C++ front-end with a D one later if the D one is very close to the C++ one. 2. A lexer/parser generator using templates and/or CTFE to generate a lexer/parser of whatever language using a BNF or EBNF grammar - probably similar to what Philippe Sigaud has done to generate a parser using PEGs. Such could be used to generate a lexer/parser for D or any other language. What you seem to be doing is creating a D-specific parser from scratch, which is great, but it's not quite what we've been looking to add to Phobos. That doesn't necessarily mean that we can't or won't add it to Phobos once it's complete (especially if nothing else gets done first), but it also may not be once it's done and ready to be reviewed for inclusion in Phobos. I'm not trying to discourage you at all. I'm just pointing out that what you're doing is quite what we're looking for. Regardless, it _would_ be interesting to be seeing which performs better. - Jonathan M Davis
May 11 2012
parent "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 12 May 2012 at 05:41:16 UTC, Jonathan M Davis wrote:
 It's great that you're working on this. We definitely need more 
 of this sort of
 stuff. However, I would point out that I'm not sure that it 
 will be acceptable
 for Phobos (it _may_ be, but it's not quite what we've been 
 looking for).
 There are two types of lexers/parsers that we've been looking 
 to add to
 Phobos.

 1. A direct port of the lexer from dmd's C++ front-end. The API 
 would be D-
 ified to use ranges, but the internals would be almost 
 identical to the C++
 front-end so that porting changes back and forth would be very 
 easy. It also
 would make it easier to replace the C++ front-end with a D one 
 later if the D
 one is very close to the C++ one.

 2. A lexer/parser generator using templates and/or CTFE to 
 generate a
 lexer/parser of whatever language using a BNF or EBNF grammar - 
 probably
 similar to what Philippe Sigaud has done to generate a parser 
 using PEGs. Such
 could be used to generate a lexer/parser for D or any other 
 language.

 What you seem to be doing is creating a D-specific parser from 
 scratch, which
 is great, but it's not quite what we've been looking to add to 
 Phobos. That
 doesn't necessarily mean that we can't or won't add it to 
 Phobos once it's
 complete (especially if nothing else gets done first), but it 
 also may not be

 to be decided
 once it's done and ready to be reviewed for inclusion in 
 Phobos. I'm not
 trying to discourage you at all. I'm just pointing out that 
 what you're doing
 is quite what we're looking for. Regardless, it _would_ be 
 interesting to be

 you're doing and
 seeing which performs better.

 - Jonathan M Davis
need I'm trying to address is making development of tools for D development much easier. Tools are needed for D to become more widely used in practise. Secondary (a nice-to-have) would be to build a compiler on top of that. I plan to do both, but I don't intend to be backwards-compatible with DMD backend. I hope that at least part of my code will either influence or be reused in the reference D compiler (whatever that will be in the future). It is important to have a reference frontend implementation which can be used in many tools, otherwise a lot of effort would be completely wasted and no tool would comply with the D specification and compilers.
May 11 2012
prev sibling parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Fri, 11 May 2012 10:01:28 +0200
schrieb "Roman D. Boiko" <rb d-coding.com>:

 There were several discussions about the need for a D compiler 
 library.
 
 I propose my draft implementation of lexer for community review:
 https://github.com/roman-d-boiko/dct
 
 Lexer is based on Brian Schott's project 
 https://github.com/Hackerpilot/Dscanner, but it has been 
 refactored and extended (and more changes are on the way).
 
 The goal is to have source code loading, lexer, parser and 
 semantic analysis available as parts of Phobos. These libraries 
 should be designed to be usable in multiple scenarios (e.g., 
 refactoring, code analysis, etc.).
 
 My commitment is to have at least front end built this year (and 
 conforming to the D2 specification unless explicitly stated 
 otherwise for some particular aspect).
 
 Please post any feed here. A dedicated project web-site will be 
 created later.
A general purpose D front-end library has been discussed several times. So at least for some day-dreaming about better IDE support and code analysis tools it has a huge value. It is good to see that someone finally takes the time to implement one. My only concern is that not enough planing went into the design. I think about things like brainstorming and collecting possible use cases from the community or looking at how some C++ tools do their job and what infrastructure they are built on. Although it is really difficult to tell from other people's code which decision was 'important' and what was just the author's way to do it. Inclusion into Phobos I would not see as a priority. As Jonathan said, there are already some clear visions of how such modules would look like. Also if any project seeks to replace the DMD front-end, Walter should be the technical advisor. Like anyone putting a lot of time and effort into a design, he could have strong feelings about certain decisions and implementing them in a seemingly worse way. That said, you make the impression of being really dedicated to the project, even giving yourself a time line, which is a good sign. I wish you all the best for the project and hope that - even without any 'official' recognition - it will see a lot of use as the base of D tools. To learn about parsing I wrote a syntax highlighter for the DCPU-16 assembly of Minecraft author Markus Persson. (Which was a good exercise.) Interestingly I ended up with a similar design for the Token struct like yours: - TokenType/TokenKind enum - Trie for matching token kinds (why do you expand it into nested switch-case through CTFE mixins though?) Since assembly code is usually small I just preallocate an array of sourceCode.length tokens and realloc it to the correct size when I'm done parsing. Nothing pretty, but simple and I am sure it won't get any faster ;). One notable difference is that I only check for isEoF in one location, since I append "\0" to the source code as a stop token (that can also serve as an end-of-line indicator). (The general idea is to move checks outside of loops.) ** Errors ** I generally keep the start and end column, in case someone wants that. A real world example: ubyte x = ...; if (x >= 0 && x <= 8) ... Line 2, warning: Comparison is always true. At first glace you say "No, a byte can become larger than 8.", but the compiler is just complaining about the first part of the expression here, which would be clarified by e.g. some ASCII/Unicode art: Line 2, warning: Comparison is always true: if (x >= 0 && x <= 8) ... ^----^ ** Code highlighting ** Especially Gtk's TextView (which GtkSourceView is base on) isn't made for several thousand tokens. The text view will take a second per 20000 tokens or so. The source view works around that by highlighting in chunks, but you can still fell the delay when you jump to the end of the file in gedit and even MonoDevelop suffers from the bad performance. If I understood your posts correctly, you are already planning for minimal change sets. It is *very* important to only update changed parts in a syntax highlighting editor. So for now I ended up writing a 'insertText' and 'deleteText' method for the parser that take 'start index', 'text' and 'start index', 'end index' respectively and return a list of removed and added tokens. If possible, make sure that your library can be used with GtkSourceView, Scintilla and QSyntaxHighlighter. Unfortunately the bindings for the last two could use an update, but the API help should get you an idea of how to use it most efficiently. -- Marco
May 19 2012
parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Saturday, 19 May 2012 at 20:35:10 UTC, Marco Leise wrote:
 Am Fri, 11 May 2012 10:01:28 +0200
 schrieb "Roman D. Boiko" <rb d-coding.com>:

 There were several discussions about the need for a D compiler 
 library.
 
 I propose my draft implementation of lexer for community 
 review:
 https://github.com/roman-d-boiko/dct
(I decided to comment on both my post and your reply.) I've got *a lot* of feedback from community, which caused a significant redesign (but caused a delay with yet unknown duration). I'll write more on that later. Thanks a lot to everybody!
 Lexer is based on Brian Schott's project 
 https://github.com/Hackerpilot/Dscanner, but it has been 
 refactored and extended (and more changes are on the way).
Looks like I'm going to replace this implementation almost completely.
 The goal is to have source code loading, lexer, parser and 
 semantic analysis available as parts of Phobos. These 
 libraries should be designed to be usable in multiple 
 scenarios (e.g., refactoring, code analysis, etc.).
See below.
 My commitment is to have at least front end built this year 
 (and conforming to the D2 specification unless explicitly 
 stated otherwise for some particular aspect).
By the D specification I mean http://dlang.org/language-reference.html, or (when specification is not clear for me) one of the following: * TDPL, * current DMD implementation * community feedback. (Note that I may assume that I understand some aspect, but later revisit it if needed.)
 Please post any feed here. A dedicated project web-site will 
 be created later.
It is going to be http://d-coding.com, but it is not usable yet (I have not much web-design practise). It is hosted on https://github.com/roman-d-boiko/roman-d-boiko.github.com, pull requests are welcome.
 A general purpose D front-end library has been discussed 
 several times. So at least for some day-dreaming about better 
 IDE support and code analysis tools it has a huge value. It is 
 good to see that someone finally takes the time to implement 
 one. My only concern is that not enough planing went into the 
 design.
Could you name a few specific concerns? The reason of this thread was to reflect on design early. OTOH, I didn't spend time documenting my design goals and tradeoffs, so discussion turned into a brainstorm and wasn't always productive. But now when I see how much value can I get even without doing my homework, I'm much more likely to provide better documentation and ask more specific questions.
I think about things like brainstorming and collecting
 possible use cases from the community or looking at how some 
 C++ tools do their job and what infrastructure they are built 
 on. Although it is really difficult to tell from other people's 
 code which decision was 'important' and what was just the 
 author's way to do it.
I'm going to pick up several use cases and prioritize them according to my judgement. Feel free to suggest any cases that you think are needed (with motivation). Prioritizing is necessary to define what is out of scope and plan work into milestones, in order to ensure the project is feasible.
 Inclusion into Phobos I would not see as a priority. As 
 Jonathan said, there are already some clear visions of how such 
 modules would look like.
Well, *considering* such inclusion seriously would help improve the quality of project. But it is not what necessarily will happen. Currently I think that my goals are close to those of use case 2 from Jonathan's reply. But until the project is reasonably complete it is not the time to argue whether to include it (or its part).
 Also if any project seeks to replace the DMD front-end, Walter 
 should be the technical advisor. Like anyone putting a lot of 
 time and effort into a design, he could have strong feelings 
 about certain decisions and implementing them in a seemingly 
 worse way.
Not a goal currently, because that would make the project significantly less likely to be completed ever.
 That said, you make the impression of being really dedicated to 
 the project, even giving yourself a time line, which is a good 
 sign. I wish you all the best for the project and hope that - 
 even without any 'official' recognition - it will see a lot of 
 use as the base of D tools.
Well, I hope that some more people will join the project once it stabilizes and delivers something useful. By the way, is anybody *potentially* interested to join?
 To learn about parsing I wrote a syntax highlighter for the 
 DCPU-16 assembly of Minecraft author Markus Persson. (Which was 
 a good exercise.) Interestingly I ended up with a similar 
 design for the Token struct like yours:

 - TokenType/TokenKind enum
 - Trie for matching token kinds (why do you expand it into 
 nested switch-case through CTFE mixins though?)
Switch code generation is something temporary that works currently and helps me evaluate possible problems with future design, which is planned to be compile-time finite automata (likely deterministic).
 Since assembly code is usually small I just preallocate an 
 array of sourceCode.length tokens and realloc it to the correct 
 size when I'm done parsing. Nothing pretty, but simple and I am 
 sure it won't get any faster ;).
I'm sure it will :) (I'm going to elaborate on this some time later).
 One notable difference is that I only check for isEoF in one 
 location, since I append "\0" to the source code as a stop 
 token (that can also serve as an end-of-line indicator). (The 
 general idea is to move checks outside of loops.)
There are several EoF conditions: \0, \x1A, __EOF__ and physicas EOF. And any loop would need to check for all. Preallocation could eliminate the need to check for physical EoF, but would make it impossible to avoid input string copying and also would not allow passing a function a slice of string to be lexed.
 ** Errors  **
 I generally keep the start and end column, in case someone 
 wants that. A real world example:

   ubyte x = ...;
   if (x >= 0 && x <= 8) ...
   Line 2, warning: Comparison is always true.

 At first glace you say "No, a byte can become larger than 8.", 
 but the compiler is just complaining about the first part of 
 the expression here, which would be clarified by e.g. some 
 ASCII/Unicode art:

   Line 2, warning: Comparison is always true:
   if (x >= 0 && x <= 8) ...
       ^----^
This functionality has been the priority from the beginning. Not implemented yet but designed for. Evaluation of column and line only on demand is caused by the assumption that such information is needed rarely (primarily to display information to the user). My new design also addresses the use cases when it is needed frequently.
 ** Code highlighting **
 Especially Gtk's TextView (which GtkSourceView is base on) 
 isn't made for several thousand tokens. The text view will take 
 a second per 20000 tokens or so. The source view works around 
 that by highlighting in chunks, but you can still fell the 
 delay when you jump to the end of the file in gedit and even 
 MonoDevelop suffers from the bad performance. If I understood 
 your posts correctly, you are already planning for minimal 
 change sets. It is *very* important to only update changed 
 parts in a syntax highlighting editor. So for now I ended up 
 writing a 'insertText' and 'deleteText' method for the parser 
 that take 'start index', 'text' and 'start index', 'end index' 
 respectively and return a list of removed and added tokens.
 If possible, make sure that your library can be used with 
 GtkSourceView, Scintilla and QSyntaxHighlighter. Unfortunately 
 the bindings for the last two could use an update, but the API 
 help should get you an idea of how to use it most efficiently.
Curretly I plan to support GtkSourceView, Scintilla, and a custom editor API which I plan to define so that it would be most efficient in this respect. Didn't work on that thoroughly yet. Incremental changes are the key to efficiency, and I'm going to invest a lot of effort into making them. Also immutability of data structures will enable many optimizations.
May 20 2012
parent reply Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 20 May 2012 10:09:34 +0200
schrieb "Roman D. Boiko" <rb d-coding.com>:

 Could you name a few specific concerns?
Mostly my own gut feeling, that things that sound great in my head turn out to bite me in the end. Things that one just doesn't think of because of the limited horizon everyone has and probably a lack of experience in the field of compiler/tools writing. On the other hand I have good experience with working by community feedback. So the rough edges in the design may already be ironed out by now.
 I'm going to pick up several use cases and prioritize them 
 according to my judgement. Feel free to suggest any cases that 
 you think are needed (with motivation). Prioritizing is necessary 
 to define what is out of scope and plan work into milestones, in 
 order to ensure the project is feasible.
There is one feature I remember caused some head-aches for Code::Blocks. They used a separate parser instance for every project in the IDE, which meant that all the standard include files would be parsed and kept in memory multiple times. When they later switched to a common parser for all projects they ran into stability issues. If you can arrange it, it would be great for multi-project IDEs to be able to add and remove projects to your parser without reparsing Phobos/druntime (which may have its .di files replaced by .d files, looking at the past discussion). C bindings could be an option. (As in: the smallest common denominator.) They
 Since assembly code is usually small I just preallocate an 
 array of sourceCode.length tokens and realloc it to the correct 
 size when I'm done parsing. Nothing pretty, but simple and I am 
 sure it won't get any faster ;).
I'm sure it will :) (I'm going to elaborate on this some time later).
I'm curious.
 There are several EoF conditions: \0, \x1A, __EOF__ and physicas 
 EOF. And any loop would need to check for all. Preallocation 
 could eliminate the need to check for physical EoF, but would 
 make it impossible to avoid input string copying and also would 
 not allow passing a function a slice of string to be lexed.
I found that I usually either load from file into a newly allocated buffer (where a copy occurs, only because I forgot about assumeUnique in std.exception) or I am editing the file in which case I recreate the source string after every key stroke anyway. I can still pass slices of that string to functions though. Not sure what you mean. It probably doesn't work for D as well as for ASM code, but I could also check for \x1A and __EOF__ in the same fashion. (By the way, is it \x1A (substitute, ^Z) or did you mean \0x04 (end-of-transmission, ^D)?) The way it works is: Parser states like 'in expression' can safely peek at the next character at any time. If it doesn't match what they expect they emit an error and drop back to the "surrounding" parser state. When they reach the "file" level, that's the only place where an EOF (which will only occur once per file anyway) will be consumed. In theory one can drop all string length checks and work on char* directly with a known terminator char that is distinct from anything else.
 ** Errors  **
 I generally keep the start and end column, in case someone 
 wants that.
This functionality has been the priority from the beginning. Not implemented yet but designed for. Evaluation of column and line only on demand is caused by the assumption that such information is needed rarely (primarily to display information to the user). My new design also addresses the use cases when it is needed frequently. [...] Incremental changes are the key to efficiency, and I'm going to invest a lot of effort into making them. Also immutability of data structures will enable many optimizations.
No more questions! -- Marco
May 20 2012
next sibling parent reply "Roman D. Boiko" <rb d-coding.com> writes:
On Sunday, 20 May 2012 at 17:42:59 UTC, Marco Leise wrote:
 There is one feature I remember caused some head-aches for 
 Code::Blocks. They used a separate parser instance for every 
 project in the IDE, which meant that all the standard include 
 files would be parsed and kept in memory multiple times. When 
 they later switched to a common parser for all projects they 
 ran into stability issues. If you can arrange it, it would be 
 great for multi-project IDEs to be able to add and remove 
 projects to your parser without reparsing Phobos/druntime 
 (which may have its .di files replaced by .d files, looking at 
 the past discussion).
The opposite situation: I didn't think about parser per project :) So I guess no need to worry here.
 C bindings could be an option. (As in: the smallest common 

 Python, ...) to use your library.
Yeah, but I'm far from there yet.
 Since assembly code is usually small I just preallocate an 
 array of sourceCode.length tokens and realloc it to the 
 correct size when I'm done parsing. Nothing pretty, but 
 simple and I am sure it won't get any faster ;).
I'm sure it will :) (I'm going to elaborate on this some time later).
I'm curious.
Maybe I'm don't understand your use case, but the idea is that if you parse as you type it should be possible to avoid parsing and allocating memory for those lines which have not changed. But that is not compatible with storing tokens in the array, since it would cause reallocating memory each time, so the other data structure should be used (e.g., a linked list, or, if efficient lookup is needed, a red-black tree). Only benchmarks can show whether (and how much) my approach would be faster for specific situation (input patterns like average size of data, complexity of parsing algorithms, requirements, etc.).
 I found that I usually either load from file into a newly 
 allocated buffer (where a copy occurs, only because I forgot 
 about assumeUnique in std.exception) or I am editing the file 
 in which case I recreate the source string after every key 
 stroke anyway. I can still pass slices of that string to 
 functions though. Not sure what you mean.
Answered below.
 It probably doesn't work for D as well as for ASM code, but I 
 could also check for \x1A and __EOF__ in the same fashion.
 (By the way, is it \x1A (substitute, ^Z) or did you mean \0x04 
 (end-of-transmission, ^D)?)
D has the following EoF cases: \0, \x1A, physical EoF and __EOF__ special token when not inside comment or some of string literals. ^D is not in this list.
 The way it works is: Parser states like 'in expression' can 
 safely peek at the next character at any time. If it doesn't 
 match what they expect they emit an error and drop back to the 
 "surrounding" parser state. When they reach the "file" level, 
 that's the only place where an EOF (which will only occur once 
 per file anyway) will be consumed.
 In theory one can drop all string length checks and work on 
 char* directly with a known terminator char that is distinct 
 from anything else.
If you want to pass a slice of input string to a function, you cannot append \0 to it without copying data. If you don't append some pre-defined character, you must check for length *and* all supported terminating characters. On the contrary, your design might not require passing slices, and if language syntax allows deterministic parsing (when you know what to expect next), there is no need for checking EoF.
May 20 2012
parent Marco Leise <Marco.Leise gmx.de> writes:
Am Sun, 20 May 2012 20:37:07 +0200
schrieb "Roman D. Boiko" <rb d-coding.com>:

 Since assembly code is usually small I just preallocate an 
 array of sourceCode.length tokens and realloc it to the 
 correct size when I'm done parsing. Nothing pretty, but 
 simple and I am sure it won't get any faster ;).
I'm sure it will :) (I'm going to elaborate on this some time later).
I'm curious.
Maybe I'm don't understand your use case, but the idea is that if you parse as you type it should be possible to avoid parsing and allocating memory for those lines which have not changed. But that is not compatible with storing tokens in the array, since it would cause reallocating memory each time, so the other data structure should be used (e.g., a linked list, or, if efficient lookup is needed, a red-black tree). Only benchmarks can show whether (and how much) my approach would be faster for specific situation (input patterns like average size of data, complexity of parsing algorithms, requirements, etc.).
I just only thought about the single-use situation, not the situation when editing files. Now the idea has evolved a bit and you probably thought about storing the scope hierarchy in a tree, too. It is still difficult to write a fast parser when someone can add/remove a '{' somewhere at the top of datetime.d, which changes the interpretation of the rest of the file. Mono-D has some hickups worse than those (several seconds even) - maybe a bug. I'll keep my fingers crossed.
 If you want to pass a slice of input string to a function, you
 cannot append \0 to it without copying data. If you don't append
 some pre-defined character, you must check for length *and* all
 supported terminating characters. On the contrary, your design
 might not require passing slices, and if language syntax allows
 deterministic parsing (when you know what to expect next), there
 is no need for checking EoF.
Now I get it! -- Marco
May 20 2012
prev sibling parent "Roman D. Boiko" <rb d-coding.com> writes:
On Sunday, 20 May 2012 at 17:42:59 UTC, Marco Leise wrote:
 There is one feature I remember caused some head-aches for 
 Code::Blocks. They used a separate parser instance for every 
 project in the IDE, which meant that all the standard include 
 files would be parsed and kept in memory multiple times. When 
 they later switched to a common parser for all projects they 
 ran into stability issues. If you can arrange it, it would be 
 great for multi-project IDEs to be able to add and remove 
 projects to your parser without reparsing Phobos/druntime 
 (which may have its .di files replaced by .d files, looking at 
 the past discussion).
The opposite situation: I didn't think about parser per project :) So I guess no need to worry here.
 C bindings could be an option. (As in: the smallest common 

 Python, ...) to use your library.
Yeah, but I'm far from there yet.
 Since assembly code is usually small I just preallocate an 
 array of sourceCode.length tokens and realloc it to the 
 correct size when I'm done parsing. Nothing pretty, but 
 simple and I am sure it won't get any faster ;).
I'm sure it will :) (I'm going to elaborate on this some time later).
I'm curious.
Maybe I'm don't understand your use case, but the idea is that if you parse as you type it should be possible to avoid parsing and allocating memory for those lines which have not changed. But that is not compatible with storing tokens in the array, since it would cause reallocating memory each time, so the other data structure should be used (e.g., a linked list, or, if efficient lookup is needed, a red-black tree). Only benchmarks can show whether (and how much) my approach would be faster for specific situation (input patterns like average size of data, complexity of parsing algorithms, requirements, etc.).
 I found that I usually either load from file into a newly 
 allocated buffer (where a copy occurs, only because I forgot 
 about assumeUnique in std.exception) or I am editing the file 
 in which case I recreate the source string after every key 
 stroke anyway. I can still pass slices of that string to 
 functions though. Not sure what you mean.
Answered below.
 It probably doesn't work for D as well as for ASM code, but I 
 could also check for \x1A and __EOF__ in the same fashion.
 (By the way, is it \x1A (substitute, ^Z) or did you mean \0x04 
 (end-of-transmission, ^D)?)
D has the following EoF cases: \0, \x1A, physical EoF and __EOF__ special token when not inside comment or some of string literals. ^D is not in this list.
 The way it works is: Parser states like 'in expression' can 
 safely peek at the next character at any time. If it doesn't 
 match what they expect they emit an error and drop back to the 
 "surrounding" parser state. When they reach the "file" level, 
 that's the only place where an EOF (which will only occur once 
 per file anyway) will be consumed.
 In theory one can drop all string length checks and work on 
 char* directly with a known terminator char that is distinct 
 from anything else.
If you want to pass a slice of input string to a function, you cannot append \0 to it without copying data. If you don't append some pre-defined character, you must check for length *and* all supported terminating characters. On the contrary, your design might not require passing slices, and if language syntax allows deterministic parsing (when you know what to expect next), there is no need for checking EoF.
May 20 2012