digitalmars.D.announce - DCT: D compiler as a collection of libraries
- Roman D. Boiko (16/16) May 11 2012 There were several discussions about the need for a D compiler
- Jacob Carlborg (10/25) May 11 2012 (Re-posting here)
- Roman D. Boiko (23/31) May 11 2012 I consider it a draft state, because it has got several rewrites
- Jacob Carlborg (9/40) May 11 2012 Ok.
- Roman D. Boiko (18/33) May 11 2012 Indices of the first code unit of each line are stored inside
- Jacob Carlborg (10/32) May 11 2012 Aha, clever. As long as I can get out the information I'm happy :) How
- Roman D. Boiko (10/42) May 11 2012 There is a method for that in Lexer interface, for me it looks
- Jacob Carlborg (7/24) May 11 2012 Found it now, "calculateFor". It not sure if it's the most intuitive
- Roman D. Boiko (5/23) May 11 2012 calculateLocation was original name, but I don't like repeating
- Jacob Carlborg (7/15) May 11 2012 My original suggestion was to have the functionality in Token, which
- Roman D. Boiko (7/22) May 11 2012 What about the following signature: Location locate(size_t index)?
- Jacob Carlborg (13/19) May 11 2012 That is better although I would prefer to pass in a token (assuming that...
- Roman D. Boiko (9/28) May 11 2012 It is also what I was planning to do after I posted my last
- deadalnix (9/30) May 11 2012 I don't really see the benefit of this. You are trading a O(1) operation...
- Roman D. Boiko (10/23) May 11 2012 It would be interesting to see benchmarks. But anyway log(1M) is
- Roman D. Boiko (7/14) May 11 2012 Technically, I'm trading N*0(1) operations needed to track line
- deadalnix (2/15) May 11 2012 N can easily be number of tokens.
- Roman D. Boiko (17/28) May 11 2012 Yes, if you are looking for the token by its index, not for
- Ary Manzana (26/49) May 11 2012 As deadalnix says, I think you are over-complicating things.
- Roman D. Boiko (29/55) May 11 2012 Summary of your points (I deliberately emphasize some of them
- Roman D. Boiko (3/12) May 11 2012 https://github.com/roman-d-boiko/dct/issues/15
- Roman D. Boiko (6/32) May 11 2012 Would it be possible for you to fork my code and tweak it for
- Ary Manzana (2/41) May 12 2012 Sure, I'll do it and provide some benchmarks. Thanks for creating the is...
- Roman D. Boiko (43/58) May 14 2012 Currently I think about making token a class instead of struct.
- Roman D. Boiko (3/5) May 14 2012 Further discussion on this topic (struct vs class) is at
- deadalnix (7/60) May 14 2012 I'm pretty sure that D's token will not change that much. If the need
- Roman D. Boiko (9/61) May 14 2012 NNTP error: 400 load at 23.60, try later prevented me from
- Tove (20/23) May 14 2012 What if there were two different lex:er modes... with different
- Roman D. Boiko (13/36) May 14 2012 So far it doesn't seem expensive to tolerate errors and proceed.
- Roman D. Boiko (7/23) May 14 2012 Just to clarify: different modes in lexer in my view are like two
- deadalnix (2/24) May 15 2012 If both have the same interface, metaprograming can do the magic for you...
- Roman D. Boiko (3/12) May 15 2012 Well, I didn't state that there is no way to solve the issue.
- Timon Gehr (3/24) May 15 2012 Just use a circular buffer of value-type tokens. There is absolutely no
- Roman D. Boiko (18/26) May 15 2012 I'd like to be able to do efficient token lookup based on its
- deadalnix (5/31) May 16 2012 The only information that is usefull over time is the position. This is
- Ary Manzana (11/20) May 11 2012 But then how do you do to efficiently (if reverse walk is any efficient)...
- dennis luehring (5/25) May 11 2012 it would be better to add something like an ColumnLine-collector-thing -...
- Roman D. Boiko (5/9) May 11 2012 This looks like what I called a hook for pluggin-in respective
- Roman D. Boiko (26/53) May 11 2012 I borrowed the trick from Brian Schott: tokens will be stored as
- deadalnix (6/26) May 12 2012 SDC uses struct Location to store such data.
- Roman D. Boiko (8/30) May 12 2012 1. If a struct is a field of heap allocated object, it will be
- Tobias Pankrath (5/8) May 12 2012 As far as I can tell, it won't be allocated on it's own, since it
- Roman D. Boiko (9/18) May 12 2012 Sorry for confusion. In the first point I was trying to clarify
- Timon Gehr (3/19) May 12 2012 Well yes, but it seems to me that you are deliberately complicating the
- Roman D. Boiko (10/43) May 12 2012 Do you mean something like what I summarized from Ary:
- Jacob Carlborg (12/27) May 11 2012 If think that the end goal of a project like this, putting a D frontend
- dennis luehring (7/10) May 11 2012 or a pure D version of it with the features:
- Roman D. Boiko (3/15) May 11 2012 I plan to go this way.
- dennis luehring (5/8) May 11 2012 are using slices for prevent coping everything around?
- Roman D. Boiko (6/14) May 11 2012 I tried optimizing code when it didn't complicate design. And
- alex (5/21) May 11 2012 Ever thought of asking the VisualD developer to integrate your
- Roman D. Boiko (4/8) May 11 2012 Didn't think about that yet, because I don't use VisualD.
- alex (3/11) May 11 2012 Mono-D is written in C#, VisualD uses D -- so it actually should
- Roman D. Boiko (5/7) May 11 2012 Sorry, I meant D-IDE. But there might exist the reason to consume
- Roman D. Boiko (2/8) May 11 2012 Oops D-IDE is also C#...
- deadalnix (3/11) May 11 2012 The best optimization is the one that bring your code from non working
- Roman D. Boiko (10/20) May 11 2012 My plan is to create frontend that would be much better than
- Jacob Carlborg (6/14) May 11 2012 That's too bad.
- deadalnix (3/17) May 11 2012 I this is your plan, I'm very happy. We really should discuss to avoid
- Roman D. Boiko (3/16) May 11 2012 That makes sense. Is it possible to switch SDC to the Boost
- deadalnix (4/6) May 11 2012 Let me do a clean package of my code this week end. For now it is mixed
- deadalnix (2/11) May 11 2012 From the beginning, I'm think AST macro using CTFE.
- Roman D. Boiko (14/15) May 11 2012 Could you please elaborate?
- deadalnix (15/30) May 11 2012 More explicitly, the goal isn't to implement a different language than D...
- Roman D. Boiko (10/21) May 11 2012 Yes, I forgot this one. Actually, I didn't discard imaginary
- Jacob Carlborg (5/20) May 11 2012 That won't be easy, nobody know what the specification is . TDPL, DMD or...
- Rory McGuire (6/34) May 11 2012 ? my guess is that the spec is TDPL + TDPL errata. dlang.org should be
- Roman D. Boiko (7/14) May 11 2012 Whenever I'm in doubt, I try to analyze both TDPL and dlang.org,
- Jacob Carlborg (4/9) May 11 2012 None of these are in sync.
- Roman D. Boiko (4/13) May 11 2012 I suggest to choose a dedicated place where it is possible to
- Rory McGuire (3/16) May 11 2012 yip, but TDPL still has to take precedence because that is the one that
- Jonathan M Davis (11/13) May 11 2012 It doesn't necessarily. Each place that they differ is examined individu...
- dennis luehring (3/19) May 11 2012 does the parser/lexer allow half-finished syntax parsing? for being
- Roman D. Boiko (3/5) May 11 2012 That's planned, but I would like to see your usage scenarios
- dennis luehring (3/8) May 11 2012 try to syntaxhiglight while coding - thats the scenario, parts
- Roman D. Boiko (12/21) May 11 2012 I depends on IDE. For example, sublime-text (and most likely
- Jacob Carlborg (13/18) May 11 2012 Example from TextMate:
- deadalnix (4/19) May 11 2012 I have started a similar stuff, and it is currently more advanced than
- Jonathan M Davis (26/47) May 11 2012 It's great that you're working on this. We definitely need more of this ...
- Roman D. Boiko (13/59) May 11 2012 Yes, my project has different goals from #1 or #2. The primary
- Marco Leise (25/46) May 19 2012 A general purpose D front-end library has been discussed several times. ...
- Roman D. Boiko (67/158) May 20 2012 (I decided to comment on both my post and your reply.)
- Marco Leise (13/43) May 20 2012 Mostly my own gut feeling, that things that sound great in my head turn ...
- Roman D. Boiko (25/64) May 20 2012 The opposite situation: I didn't think about parser per project :)
- Marco Leise (6/31) May 20 2012 I just only thought about the single-use situation, not the situation wh...
- Roman D. Boiko (25/64) May 20 2012 The opposite situation: I didn't think about parser per project :)
There were several discussions about the need for a D compiler library. I propose my draft implementation of lexer for community review: https://github.com/roman-d-boiko/dct Lexer is based on Brian Schott's project https://github.com/Hackerpilot/Dscanner, but it has been refactored and extended (and more changes are on the way). The goal is to have source code loading, lexer, parser and semantic analysis available as parts of Phobos. These libraries should be designed to be usable in multiple scenarios (e.g., refactoring, code analysis, etc.). My commitment is to have at least front end built this year (and conforming to the D2 specification unless explicitly stated otherwise for some particular aspect). Please post any feed here. A dedicated project web-site will be created later.
May 11 2012
On 2012-05-11 10:01, Roman D. Boiko wrote:There were several discussions about the need for a D compiler library. I propose my draft implementation of lexer for community review: https://github.com/roman-d-boiko/dct Lexer is based on Brian Schott's project https://github.com/Hackerpilot/Dscanner, but it has been refactored and extended (and more changes are on the way). The goal is to have source code loading, lexer, parser and semantic analysis available as parts of Phobos. These libraries should be designed to be usable in multiple scenarios (e.g., refactoring, code analysis, etc.). My commitment is to have at least front end built this year (and conforming to the D2 specification unless explicitly stated otherwise for some particular aspect). Please post any feed here. A dedicated project web-site will be created later.(Re-posting here) A couple of questions: * What's the sate of the lexer * Does it convert numerical literals and similar to their actual values * Does it retain full source information * Is there an example we can look at to see how the API is used * Does it have a range based interface -- /Jacob Carlborg
May 11 2012
On Friday, 11 May 2012 at 08:38:36 UTC, Jacob Carlborg wrote:(Re-posting here) A couple of questions: * What's the sate of the lexerI consider it a draft state, because it has got several rewrites recently and I plan to do more, especially based on community feedback. However, implementation handles almost all possible cases. Because of rewrites it is most likely broken at this moment, I'm going to fix it ASAP (in a day or two). Lexer will provide a random-access range of tokens (this is not done yet). Each token contains: * start index (position in the original encoding, 0 corresponds to the first code unit after BOM), * token value encoded as UTF-8 string, * token kind (e.g., token.kind = TokenKind.Float), * possibly enum with annotations (e.g., token.annotations = FloatAnnotation.Hex | FloatAnnotation.Real)* Does it convert numerical literals and similar to their actual valuesIt is planned to add a post-processor for that as part of parser, please see README.md for some more details.* Does it retain full source informationYes, this is a design choice to preserve all information. Source code is converted to UTF-8 and stored as token.value, even whitespaces. Information about code unit indices in the original encoding is preserved, too.* Is there an example we can look at to see how the API is usedTBD soon (see Roadmap in the readme file)* Does it have a range based interfaceYes, this is what I consider one of its strengths.
May 11 2012
On 2012-05-11 10:58, Roman D. Boiko wrote:On Friday, 11 May 2012 at 08:38:36 UTC, Jacob Carlborg wrote:I see.(Re-posting here) A couple of questions: * What's the sate of the lexerI consider it a draft state, because it has got several rewrites recently and I plan to do more, especially based on community feedback. However, implementation handles almost all possible cases. Because of rewrites it is most likely broken at this moment, I'm going to fix it ASAP (in a day or two).Lexer will provide a random-access range of tokens (this is not done yet).Ok.Each token contains: * start index (position in the original encoding, 0 corresponds to the first code unit after BOM), * token value encoded as UTF-8 string, * token kind (e.g., token.kind = TokenKind.Float), * possibly enum with annotations (e.g., token.annotations = FloatAnnotation.Hex | FloatAnnotation.Real)What about line and column information?Isn't that a job for the lexer?* Does it convert numerical literals and similar to their actual valuesIt is planned to add a post-processor for that as part of parser, please see README.md for some more details.That's sounds good.* Does it retain full source informationYes, this is a design choice to preserve all information. Source code is converted to UTF-8 and stored as token.value, even whitespaces. Information about code unit indices in the original encoding is preserved, too.I see. Thanks. -- /Jacob Carlborg* Is there an example we can look at to see how the API is usedTBD soon (see Roadmap in the readme file)* Does it have a range based interfaceYes, this is what I consider one of its strengths.
May 11 2012
On Friday, 11 May 2012 at 09:08:24 UTC, Jacob Carlborg wrote:On 2012-05-11 10:58, Roman D. Boiko wrote:Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 "ab/c.d"), or discarding them.Each token contains: * start index (position in the original encoding, 0 corresponds to the first code unit after BOM), * token value encoded as UTF-8 string, * token kind (e.g., token.kind = TokenKind.Float), * possibly enum with annotations (e.g., token.annotations = FloatAnnotation.Hex | FloatAnnotation.Real)What about line and column information?That might be done in lexer for efficiency reasons (to avoid lexing token value again). But separating this into a dedicated post-processing phase leads to a much cleaner design (IMO), also suitable for uses when such values are not needed. Also I don't think that performance would be improved given the ratio of number of literals to total number of tokens and the need to store additional information per token if it is done in lexer. I will elaborate on that later.Isn't that a job for the lexer?* Does it convert numerical literals and similar to their actual valuesIt is planned to add a post-processor for that as part of parser, please see README.md for some more details.
May 11 2012
On 2012-05-11 11:22, Roman D. Boiko wrote:Aha, clever. As long as I can get out the information I'm happy :) How about adding properties for this in the token struct?What about line and column information?Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 "ab/c.d"), or discarding them.That might be the case. But I don't think it belongs in the parser.That might be done in lexer for efficiency reasons (to avoid lexing token value again). But separating this into a dedicated post-processing phase leads to a much cleaner design (IMO), also suitable for uses when such values are not needed.Isn't that a job for the lexer?* Does it convert numerical literals and similar to their actual valuesIt is planned to add a post-processor for that as part of parser, please see README.md for some more details.Also I don't think that performance would be improved given the ratio of number of literals to total number of tokens and the need to store additional information per token if it is done in lexer. I will elaborate on that later.Ok, fair enough. Perhaps this could be a property in the Token struct as well. In that case I would suggest renaming "value" to lexeme/spelling/representation, or something like that, and then name the new property "value". -- /Jacob Carlborg
May 11 2012
On Friday, 11 May 2012 at 09:36:28 UTC, Jacob Carlborg wrote:On 2012-05-11 11:22, Roman D. Boiko wrote:There is a method for that in Lexer interface, for me it looks like it belongth there and not to token. Version accepting token and producing a pair of start/end Locations will be added.Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 "ab/c.d"), or discarding them.Aha, clever. As long as I can get out the information I'm happy :) How about adding properties for this in the token struct?I will provide example code and a dedicated post later to illustrate my point.That might be the case. But I don't think it belongs in the parser.That might be done in lexer for efficiency reasons (to avoid lexing token value again). But separating this into a dedicated post-processing phase leads to a much cleaner design (IMO), also suitable for uses when such values are not needed.Isn't that a job for the lexer?* Does it convert numerical literals and similar to their actual valuesIt is planned to add a post-processor for that as part of parser, please see README.md for some more details.I was going to rename value, but couldn't find a nice term. Thanks for your suggestions! As for the property with strongly typed literal value, currently I plan to put it into AST.Also I don't think that performance would be improved given the ratio of number of literals to total number of tokens and the need to store additional information per token if it is done in lexer. I will elaborate on that later.Ok, fair enough. Perhaps this could be a property in the Token struct as well. In that case I would suggest renaming "value" to lexeme/spelling/representation, or something like that, and then name the new property "value".
May 11 2012
On 2012-05-11 12:35, Roman D. Boiko wrote:On Friday, 11 May 2012 at 09:36:28 UTC, Jacob Carlborg wrote:Found it now, "calculateFor". It not sure if it's the most intuitive name though. I get the feeling: "calculate what?".Aha, clever. As long as I can get out the information I'm happy :) How about adding properties for this in the token struct?There is a method for that in Lexer interface, for me it looks like it belongth there and not to token. Version accepting token and producing a pair of start/end Locations will be added.I guess I'll have to wait for that then :)That might be the case. But I don't think it belongs in the parser.I will provide example code and a dedicated post later to illustrate my point.I stole "spelling" from Clang :) -- /Jacob CarlborgOk, fair enough. Perhaps this could be a property in the Token struct as well. In that case I would suggest renaming "value" to lexeme/spelling/representation, or something like that, and then name the new property "value".I was going to rename value, but couldn't find a nice term. Thanks for your suggestions! As for the property with strongly typed literal value, currently I plan to put it into AST.
May 11 2012
On Friday, 11 May 2012 at 11:49:23 UTC, Jacob Carlborg wrote:On 2012-05-11 12:35, Roman D. Boiko wrote:calculateLocation was original name, but I don't like repeating return type in method names, I decided to change it so that it is clear that another renaming is needed ;) Any suggestions?On Friday, 11 May 2012 at 09:36:28 UTC, Jacob Carlborg wrote:Found it now, "calculateFor". It not sure if it's the most intuitive name though. I get the feeling: "calculate what?".Aha, clever. As long as I can get out the information I'm happy :) How about adding properties for this in the token struct?There is a method for that in Lexer interface, for me it looks like it belongth there and not to token. Version accepting token and producing a pair of start/end Locations will be added.I'll try to do that ahead of roadmap, it is important.I guess I'll have to wait for that then :)That might be the case. But I don't think it belongs in the parser.I will provide example code and a dedicated post later to illustrate my point.
May 11 2012
On 2012-05-11 14:07, Roman D. Boiko wrote:On Friday, 11 May 2012 at 11:49:23 UTC, Jacob Carlborg wrote:Found it now, "calculateFor". It not sure if it's the most intuitive name though. I get the feeling: "calculate what?".calculateLocation was original name, but I don't like repeating return type in method names, I decided to change it so that it is clear that another renaming is needed ;) Any suggestions?My original suggestion was to have the functionality in Token, which would have made for intuitive names: line, column and file. But since you didn't like that I have to give it some thought.Cool. -- /Jacob CarlborgI guess I'll have to wait for that then :)I'll try to do that ahead of roadmap, it is important.
May 11 2012
On Friday, 11 May 2012 at 12:55:58 UTC, Jacob Carlborg wrote:On 2012-05-11 14:07, Roman D. Boiko wrote:What about the following signature: Location locate(size_t index)? Or even better: alias size_t CodeUnitIndex; Location locateFor(CodeUnitIndex position); The problem with placing it in Token is that Token should not know anything about source as a whole.On Friday, 11 May 2012 at 11:49:23 UTC, Jacob Carlborg wrote:Found it now, "calculateFor". It not sure if it's the most intuitive name though. I get the feeling: "calculate what?".calculateLocation was original name, but I don't like repeating return type in method names, I decided to change it so that it is clear that another renaming is needed ;) Any suggestions?My original suggestion was to have the functionality in Token, which would have made for intuitive names: line, column and file. But since you didn't like that I have to give it some thought.
May 11 2012
On 2012-05-11 15:01, Roman D. Boiko wrote:What about the following signature: Location locate(size_t index)? Or even better: alias size_t CodeUnitIndex; Location locateFor(CodeUnitIndex position);That is better although I would prefer to pass in a token (assuming that is where index is declared). Then it would be an implementation detail that "index" is used to get the location. Another option would be to turn it around: sturct/class Location { Location find/locate (Token token); }The problem with placing it in Token is that Token should not know anything about source as a whole.I see. For convenience there could be properties defined that just forwards the call to some other function. -- /Jacob Carlborg
May 11 2012
On Friday, 11 May 2012 at 13:25:53 UTC, Jacob Carlborg wrote:On 2012-05-11 15:01, Roman D. Boiko wrote:It is also what I was planning to do after I posted my last suggestion. Unless I will discover some conceptual problem with that, which is unlikely.What about the following signature: Location locate(size_t index)? Or even better: alias size_t CodeUnitIndex; Location locateFor(CodeUnitIndex position);That is better although I would prefer to pass in a token (assuming that is where index is declared). Then it would be an implementation detail that "index" is used to get the location.Another option would be to turn it around: sturct/class Location { Location find/locate (Token token); }Both these cases would introduce circular dependency between Lexer and its output data (Location or Token). It is possible to break such dependency via complicating things even more, or live with it. But I don't like circular dependencies when there is no compelling reason to introduce them.The problem with placing it in Token is that Token should not know anything about source as a whole.I see. For convenience there could be properties defined that just forwards the call to some other function.
May 11 2012
Le 11/05/2012 15:01, Roman D. Boiko a écrit :On Friday, 11 May 2012 at 12:55:58 UTC, Jacob Carlborg wrote:I don't really see the benefit of this. You are trading a O(1) operation to an O(log(n)) . It can only be faster in specific cases, which should be measured. It is likely to be slower on compiling erroneous code or used for something else than compiling, and likely to be insignificant to compile correct code (I suspect the performance win is negligible compared to the time required to build a fully operational executable). You are overcomplicating things for no to little benefice.On 2012-05-11 14:07, Roman D. Boiko wrote:What about the following signature: Location locate(size_t index)? Or even better: alias size_t CodeUnitIndex; Location locateFor(CodeUnitIndex position); The problem with placing it in Token is that Token should not know anything about source as a whole.On Friday, 11 May 2012 at 11:49:23 UTC, Jacob Carlborg wrote:Found it now, "calculateFor". It not sure if it's the most intuitive name though. I get the feeling: "calculate what?".calculateLocation was original name, but I don't like repeating return type in method names, I decided to change it so that it is clear that another renaming is needed ;) Any suggestions?My original suggestion was to have the functionality in Token, which would have made for intuitive names: line, column and file. But since you didn't like that I have to give it some thought.
May 11 2012
On Friday, 11 May 2012 at 13:28:21 UTC, deadalnix wrote:Le 11/05/2012 15:01, Roman D. Boiko a écrit :It would be interesting to see benchmarks. But anyway log(1M) is just 20, and I didn't see any source code with 1M lines :). The main point is design, not performance. I'm trying to build minimal design that handles common use cases very well, and others well enough. I don't see why Location would be needed for every Token.The problem with placing it in Token is that Token should not know anything about source as a whole.I don't really see the benefit of this. You are trading a O(1) operation to an O(log(n)) . It can only be faster in specific cases, which should be measured.It is likely to be slower on compiling erroneous code or used for something else than compiling, and likely to be insignificant to compile correct code (I suspect the performance win is negligible compared to the time required to build a fully operational executable).IMO, it is unrelated.You are overcomplicating things for no to little benefice.Well, it depends... However, I will provide design rationale document this month.
May 11 2012
On Friday, 11 May 2012 at 13:28:21 UTC, deadalnix wrote:Le 11/05/2012 15:01, Roman D. Boiko a écrit :Technically, I'm trading N*0(1) operations needed to track line and column while consuming each character to M*0(log(n)) operations when calculating them on demand. N = number of characters, n is number of lines and M is number of actual usages of Location. My assumption is that M << N (M is much smaller than N).The problem with placing it in Token is that Token should not know anything about source as a whole.I don't really see the benefit of this. You are trading a O(1) operation to an O(log(n)) . It can only be faster in specific cases, which should be measured.
May 11 2012
Le 11/05/2012 16:02, Roman D. Boiko a écrit :On Friday, 11 May 2012 at 13:28:21 UTC, deadalnix wrote:N can easily be number of tokens.Le 11/05/2012 15:01, Roman D. Boiko a écrit :Technically, I'm trading N*0(1) operations needed to track line and column while consuming each character to M*0(log(n)) operations when calculating them on demand. N = number of characters, n is number of lines and M is number of actual usages of Location. My assumption is that M << N (M is much smaller than N).The problem with placing it in Token is that Token should not know anything about source as a whole.I don't really see the benefit of this. You are trading a O(1) operation to an O(log(n)) . It can only be faster in specific cases, which should be measured.
May 11 2012
On Friday, 11 May 2012 at 15:05:19 UTC, deadalnix wrote:Le 11/05/2012 16:02, Roman D. Boiko a écrit :Yes, if you are looking for the token by its index, not for location. E.g., for autocompletion it is needed to find the last token before cursor location. But that is not related to location search. Also please note that I oversimplified formula for complexity. It also has other components. My reply was just an additional minor comment. Motivation is design, not performance. One additional reason for such separation: with my approach it is possible to calculate Location both taking into account infromation from special token sequences, or ignoring it. How would you do that for eager calculation? Calculate only one category? Or track and store both? Unnecessary complexity will eventually find a way to shoot your leg :) Real-world usage (at least, anticipated scenarios) should be the basis for designing. Sometimes this contradicts intuition and requires you to look at the problem from a different side.Technically, I'm trading N*0(1) operations needed to track line and column while consuming each character to M*0(log(n)) operations when calculating them on demand. N = number of characters, n is number of lines and M is number of actual usages of Location. My assumption is that M << N (M is much smaller than N).N can easily be number of tokens.
May 11 2012
On 5/11/12 10:14 PM, Roman D. Boiko wrote:On Friday, 11 May 2012 at 15:05:19 UTC, deadalnix wrote:As deadalnix says, I think you are over-complicating things. I mean, to store the column and line information it's just: if (isNewLine(c)) { line++; column = 0; } else { column++; } (I think you need to add that to the SourceRange class. Then copy line and column to token on the Lexer#lex() method) Do you really think it's that costly in terms of performance? I think you are wasting much more memory and performance by storing all the tokens in the lexer. Imagine I want to implement a simple syntax highlighter: just highlight keywords. How can I tell DCT to *not* store all tokens because I need each one in turn? And since I'll be highlighting in the editor I will need column and line information. That means I'll have to do that O(log(n)) operation for every token. So you see, for the simplest use case of a lexer the performance of DCT is awful. Now imagine I want to build an AST. Again, I consume the tokens one by one, probably peeking in some cases. If I want to store line and column information I just copy them to the AST. You say the tokens are discarded but their data is not, and that's why their data is usually copied.Le 11/05/2012 16:02, Roman D. Boiko a écrit :Yes, if you are looking for the token by its index, not for location. E.g., for autocompletion it is needed to find the last token before cursor location. But that is not related to location search. Also please note that I oversimplified formula for complexity. It also has other components. My reply was just an additional minor comment. Motivation is design, not performance. One additional reason for such separation: with my approach it is possible to calculate Location both taking into account infromation from special token sequences, or ignoring it. How would you do that for eager calculation? Calculate only one category? Or track and store both? Unnecessary complexity will eventually find a way to shoot your leg :) Real-world usage (at least, anticipated scenarios) should be the basis for designing. Sometimes this contradicts intuition and requires you to look at the problem from a different side.Technically, I'm trading N*0(1) operations needed to track line and column while consuming each character to M*0(log(n)) operations when calculating them on demand. N = number of characters, n is number of lines and M is number of actual usages of Location. My assumption is that M << N (M is much smaller than N).N can easily be number of tokens.
May 11 2012
On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:As deadalnix says, I think you are over-complicating things. I mean, to store the column and line information it's just: if (isNewLine(c)) { line++; column = 0; } else { column++; } (I think you need to add that to the SourceRange class. Then copy line and column to token on the Lexer#lex() method) Do you really think it's that costly in terms of performance? I think you are wasting much more memory and performance by storing all the tokens in the lexer. Imagine I want to implement a simple syntax highlighter: just highlight keywords. How can I tell DCT to *not* store all tokens because I need each one in turn? And since I'll be highlighting in the editor I will need column and line information. That means I'll have to do that O(log(n)) operation for every token. So you see, for the simplest use case of a lexer the performance of DCT is awful. Now imagine I want to build an AST. Again, I consume the tokens one by one, probably peeking in some cases. If I want to store line and column information I just copy them to the AST. You say the tokens are discarded but their data is not, and that's why their data is usually copied.Summary of your points (I deliberately emphasize some of them more than you did; please correct me if I misinterpreted anything): * storing location for each token is simple and cheap * SourceRange is needed; an instance per token; it should be a class, not struct * Syntax highlighting doesn't need to keep all tokens in memory * but it must know column and line for each token * for this use case DCT has awful performance * to build AST lines and columns are required * information from tokens must be copied and possibly transformed in order to put it into AST; after such transformation tokens are not needed any more Those all are valid points given that we don't have any benchmarks and usage examples. The good news is that use cases like highlighting, parsing, autocompletion, etc. are the core functionality for which DCT should be designed. So if it fails any of them, design needs to be revisited. But I will need some time to thoroughly address each of this issues (either prove that it is not relevant, or fix the problem). I will definitely complete this work during May. I will regularly report my progress here. In general, I tend to disagree with most of them, because I already thought about each and designed accordingly, but it is important to give them enough attention. What I fail miserably at this moment is providing actual usage code and benchmarks, so I'm going to focus on that. Thanks for your feedback!
May 11 2012
On Saturday, 12 May 2012 at 05:13:36 UTC, Roman D. Boiko wrote:On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:...As deadalnix says, I think you are over-complicating things.... I will need some time to thoroughly address each of this issues (either prove that it is not relevant, or fix the problem). I will definitely complete this work during May. I will regularly report my progress here. In general, I tend to disagree with most of them, because I already thought about each and designed accordingly, but it is important to give them enough attention.https://github.com/roman-d-boiko/dct/issues/15
May 11 2012
On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:As deadalnix says, I think you are over-complicating things. I mean, to store the column and line information it's just: if (isNewLine(c)) { line++; column = 0; } else { column++; } (I think you need to add that to the SourceRange class. Then copy line and column to token on the Lexer#lex() method) Do you really think it's that costly in terms of performance? I think you are wasting much more memory and performance by storing all the tokens in the lexer. Imagine I want to implement a simple syntax highlighter: just highlight keywords. How can I tell DCT to *not* store all tokens because I need each one in turn? And since I'll be highlighting in the editor I will need column and line information. That means I'll have to do that O(log(n)) operation for every token. So you see, for the simplest use case of a lexer the performance of DCT is awful. Now imagine I want to build an AST. Again, I consume the tokens one by one, probably peeking in some cases. If I want to store line and column information I just copy them to the AST. You say the tokens are discarded but their data is not, and that's why their data is usually copied.Would it be possible for you to fork my code and tweak it for comparison? You will definitely discover more problems this way, and such feedback would really help me. That doesn't seem to be inadequately much work to do. Any other volunteers for this?
May 11 2012
On 5/12/12 12:17 PM, Roman D. Boiko wrote:On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:Sure, I'll do it and provide some benchmarks. Thanks for creating the issue.As deadalnix says, I think you are over-complicating things. I mean, to store the column and line information it's just: if (isNewLine(c)) { line++; column = 0; } else { column++; } (I think you need to add that to the SourceRange class. Then copy line and column to token on the Lexer#lex() method) Do you really think it's that costly in terms of performance? I think you are wasting much more memory and performance by storing all the tokens in the lexer. Imagine I want to implement a simple syntax highlighter: just highlight keywords. How can I tell DCT to *not* store all tokens because I need each one in turn? And since I'll be highlighting in the editor I will need column and line information. That means I'll have to do that O(log(n)) operation for every token. So you see, for the simplest use case of a lexer the performance of DCT is awful. Now imagine I want to build an AST. Again, I consume the tokens one by one, probably peeking in some cases. If I want to store line and column information I just copy them to the AST. You say the tokens are discarded but their data is not, and that's why their data is usually copied.Would it be possible for you to fork my code and tweak it for comparison? You will definitely discover more problems this way, and such feedback would really help me. That doesn't seem to be inadequately much work to do. Any other volunteers for this?
May 12 2012
On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:I think you are wasting much more memory and performance by storing all the tokens in the lexer. Imagine I want to implement a simple syntax highlighter: just highlight keywords. How can I tell DCT to *not* store all tokens because I need each one in turn? And since I'll be highlighting in the editor I will need column and line information. That means I'll have to do that O(log(n)) operation for every token. So you see, for the simplest use case of a lexer the performance of DCT is awful. Now imagine I want to build an AST. Again, I consume the tokens one by one, probably peeking in some cases. If I want to store line and column information I just copy them to the AST. You say the tokens are discarded but their data is not, and that's why their data is usually copied.Currently I think about making token a class instead of struct. A token (from https://github.com/roman-d-boiko/dct/blob/master/fe/core.d) is: // Represents lexed token struct Token { size_t startIndex; // position of the first code unit in the source string string spelling; // characters from which this token has been lexed TokenKind kind; // enum; each keyword and operator, have a dedicated kind ubyte annotations; // meta information like whether a token is valid, or an integer literal is signed, long, hexadecimal, etc. } Making it a class would give several benefits: * allow not to worry about allocating a big array of tokens. E.g., on 64-bit OS the largest module in Phobos (IIRC, the std.datetime) consumes 13.5MB in an array of almost 500K tokens. It would require 4 times smaller chunk of contiguous memory if it was an array of class objects, because each would consume only 8 bytes instead of 32. * allow subclassing, for example, for storing strongly typed literal values; this flexibility could also facilitate future extensibility (but it's difficult to predict which kind of extension may be needed) * there would be no need to copy data from tokens into AST, passing an object would be enough (again, copy 8 instead of 32 bytes); the same applies to passing into methods - no need to pass by ref to minimise overhead It would incur some additional memory overhead (at least 8 bytes per token), but that's hardly significant. Also there is additional price for accessing token members because of indirection, and, possibly, worse cache friendliness (token instances may be allocated anywhere in memory, not close to each other). These considerations are mostly about performance. I think there is also some impact on design, but couldn't find anything significant (given that currently I see a token as merely a datastructure without associated behavior). Could anybody suggest other pros and cons? Which option would you choose?
May 14 2012
On Monday, 14 May 2012 at 15:00:37 UTC, Roman D. Boiko wrote:Could anybody suggest other pros and cons? Which option would you choose?Further discussion on this topic (struct vs class) is at http://forum.dlang.org/thread/asdrqlaydzcdpqwsbtiv forum.dlang.org
May 14 2012
Le 14/05/2012 17:00, Roman D. Boiko a écrit :On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:Why is this a benefice ?I think you are wasting much more memory and performance by storing all the tokens in the lexer. Imagine I want to implement a simple syntax highlighter: just highlight keywords. How can I tell DCT to *not* store all tokens because I need each one in turn? And since I'll be highlighting in the editor I will need column and line information. That means I'll have to do that O(log(n)) operation for every token. So you see, for the simplest use case of a lexer the performance of DCT is awful. Now imagine I want to build an AST. Again, I consume the tokens one by one, probably peeking in some cases. If I want to store line and column information I just copy them to the AST. You say the tokens are discarded but their data is not, and that's why their data is usually copied.Currently I think about making token a class instead of struct. A token (from https://github.com/roman-d-boiko/dct/blob/master/fe/core.d) is: // Represents lexed token struct Token { size_t startIndex; // position of the first code unit in the source string string spelling; // characters from which this token has been lexed TokenKind kind; // enum; each keyword and operator, have a dedicated kind ubyte annotations; // meta information like whether a token is valid, or an integer literal is signed, long, hexadecimal, etc. } Making it a class would give several benefits: * allow not to worry about allocating a big array of tokens. E.g., on 64-bit OS the largest module in Phobos (IIRC, the std.datetime) consumes 13.5MB in an array of almost 500K tokens. It would require 4 times smaller chunk of contiguous memory if it was an array of class objects, because each would consume only 8 bytes instead of 32.* allow subclassing, for example, for storing strongly typed literal values; this flexibility could also facilitate future extensibility (but it's difficult to predict which kind of extension may be needed)I'm pretty sure that D's token will not change that much. If the need isn't identified right know, I'd advocate for YAGNI.* there would be no need to copy data from tokens into AST, passing an object would be enough (again, copy 8 instead of 32 bytes); the same applies to passing into methods - no need to pass by ref to minimise overheadYes but now you add pressure on the GC and add indirections. I'm not sure it worth it. It seems to me like a premature optimization.It would incur some additional memory overhead (at least 8 bytes per token), but that's hardly significant. Also there is additional price for accessing token members because of indirection, and, possibly, worse cache friendliness (token instances may be allocated anywhere in memory, not close to each other). These considerations are mostly about performance. I think there is also some impact on design, but couldn't find anything significant (given that currently I see a token as merely a datastructure without associated behavior). Could anybody suggest other pros and cons? Which option would you choose?You are over engineering the whole stuff.
May 14 2012
On Monday, 14 May 2012 at 16:30:21 UTC, deadalnix wrote:Le 14/05/2012 17:00, Roman D. Boiko a écrit :NNTP error: 400 load at 23.60, try later prevented me from answering :) Because it might be difficult to find a big chunk of available memory (3.5M vs 14M for this particular case).Making it a class would give several benefits: * allow not to worry about allocating a big array of tokens. E.g., on 64-bit OS the largest module in Phobos (IIRC, the std.datetime) consumes 13.5MB in an array of almost 500K tokens. It would require 4 times smaller chunk of contiguous memory if it was an array of class objects, because each would consume only 8 bytes instead of 32.Why is this a benefice ?Agree.* allow subclassing, for example, for storing strongly typed literal values; this flexibility could also facilitate future extensibility (but it's difficult to predict which kind of extension may be needed)I'm pretty sure that D's token will not change that much. If the need isn't identified right know, I'd advocate for YAGNI.It looks so. Thanks.* there would be no need to copy data from tokens into AST, passing an object would be enough (again, copy 8 instead of 32 bytes); the same applies to passing into methods - no need to pass by ref to minimise overheadYes but now you add pressure on the GC and add indirections. I'm not sure it worth it. It seems to me like a premature optimization.I'm trying to solve this and other tradeoffs. I'd like to simplify but satisfy my design goals.It would incur some additional memory overhead (at least 8 bytes per token), but that's hardly significant. Also there is additional price for accessing token members because of indirection, and, possibly, worse cache friendliness (token instances may be allocated anywhere in memory, not close to each other). These considerations are mostly about performance. I think there is also some impact on design, but couldn't find anything significant (given that currently I see a token as merely a datastructure without associated behavior). Could anybody suggest other pros and cons? Which option would you choose?You are over engineering the whole stuff.
May 14 2012
On Monday, 14 May 2012 at 16:58:42 UTC, Roman D. Boiko wrote:What if there were two different lex:er modes... with different struct:s. 1. For an IDE with on the fly lexing: Assumption, the error rate is high.(need to keep much info) 2. For the compiler Assumption, the error rate is non existent, and if there is an error it really doesn't matter if it's slow. So... when choosing the "compiler mode"... and there actually is an error, then just lex it again, to produce a pretty error message ;) try { lex(mode.compiler); } catch { lex(mode.ide); // calculates column etc. what ever info it needs. }You are over engineering the whole stuff.I'm trying to solve this and other tradeoffs. I'd like to simplify but satisfy my design goals.
May 14 2012
On Monday, 14 May 2012 at 19:04:20 UTC, Tove wrote:On Monday, 14 May 2012 at 16:58:42 UTC, Roman D. Boiko wrote:So far it doesn't seem expensive to tolerate errors and proceed. The only thing I miss is some sort of specification when to stop including characters into token spelling and start a new token. I don't think I'll use backtracking for that in the nearest future. If I did, I would really separate part of lexer and provide two implementations for that part. Given this, accepting errors and moving on simply requires some finite set of rules about boundaries of invalid tokens. I also think structural code editing concepts will help here, but I didn't do any research on this topic yet. The problem with multiple lexer implementations is that it might become much more difficult to maintain them.What if there were two different lex:er modes... with different struct:s. 1. For an IDE with on the fly lexing: Assumption, the error rate is high.(need to keep much info) 2. For the compiler Assumption, the error rate is non existent, and if there is an error it really doesn't matter if it's slow. So... when choosing the "compiler mode"... and there actually is an error, then just lex it again, to produce a pretty error message ;) try { lex(mode.compiler); } catch { lex(mode.ide); // calculates column etc. what ever info it needs. }You are over engineering the whole stuff.I'm trying to solve this and other tradeoffs. I'd like to simplify but satisfy my design goals.
May 14 2012
On Monday, 14 May 2012 at 19:13:39 UTC, Roman D. Boiko wrote:On Monday, 14 May 2012 at 19:04:20 UTC, Tove wrote:...What if there were two different lex:er modes... with different struct:s. 1. For an IDE with on the fly lexing: Assumption, the error rate is high.(need to keep much info) 2. For the compiler Assumption, the error rate is non existent, and if there is an error it really doesn't matter if it's slow. So... when choosing the "compiler mode"... and there actually is an error, then just lex it again, to produce a pretty error message ;)The problem with multiple lexer implementations is that it might become much more difficult to maintain them.Just to clarify: different modes in lexer in my view are like two different implementations combined in a non-trivial way (unless the difference is minor). So complexity goes from two factors: different implementations and how to combine them. I try to avoid this.
May 14 2012
Le 14/05/2012 21:21, Roman D. Boiko a écrit :On Monday, 14 May 2012 at 19:13:39 UTC, Roman D. Boiko wrote:If both have the same interface, metaprograming can do the magic for you.On Monday, 14 May 2012 at 19:04:20 UTC, Tove wrote:...What if there were two different lex:er modes... with different struct:s. 1. For an IDE with on the fly lexing: Assumption, the error rate is high.(need to keep much info) 2. For the compiler Assumption, the error rate is non existent, and if there is an error it really doesn't matter if it's slow. So... when choosing the "compiler mode"... and there actually is an error, then just lex it again, to produce a pretty error message ;)The problem with multiple lexer implementations is that it might become much more difficult to maintain them.Just to clarify: different modes in lexer in my view are like two different implementations combined in a non-trivial way (unless the difference is minor). So complexity goes from two factors: different implementations and how to combine them. I try to avoid this.
May 15 2012
On Tuesday, 15 May 2012 at 09:33:31 UTC, deadalnix wrote:Le 14/05/2012 21:21, Roman D. Boiko a écrit :Well, I didn't state that there is no way to solve the issue. Also I added "unless the difference is minor".Just to clarify: different modes in lexer in my view are like two different implementations combined in a non-trivial way (unless the difference is minor). So complexity goes from two factors: different implementations and how to combine them. I try to avoid this.If both have the same interface, metaprograming can do the magic for you.
May 15 2012
On 05/14/2012 05:00 PM, Roman D. Boiko wrote:On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:Just use a circular buffer of value-type tokens. There is absolutely no excuse for slow parsing.I think you are wasting much more memory and performance by storing all the tokens in the lexer. Imagine I want to implement a simple syntax highlighter: just highlight keywords. How can I tell DCT to *not* store all tokens because I need each one in turn? And since I'll be highlighting in the editor I will need column and line information. That means I'll have to do that O(log(n)) operation for every token. So you see, for the simplest use case of a lexer the performance of DCT is awful. Now imagine I want to build an AST. Again, I consume the tokens one by one, probably peeking in some cases. If I want to store line and column information I just copy them to the AST. You say the tokens are discarded but their data is not, and that's why their data is usually copied.Currently I think about making token a class instead of struct. ... Could anybody suggest other pros and cons? Which option would you choose?
May 15 2012
On Tuesday, 15 May 2012 at 19:27:26 UTC, Timon Gehr wrote:On 05/14/2012 05:00 PM, Roman D. Boiko wrote:I'd like to be able to do efficient token lookup based on its start index in the file (e.g., for autocompletion and calculating column/line numbers on demand). I also plan to have ability to update lexer, parser and semantic analysis results as the user or tool makes edits (incrementally, not by restarting), and to keep history of changes as long as it is needed. To achieve this, I will use Token[N][] for storing tokens, and an immutable binary tree for efficient lookup and preserving history. I don't worry much about keeping all tokens in memory, because information they contain is useful and needed for most scenarios. If, on the contrary, information is no longer used, it will be garbage collected. This is a significant redesign of current implementation. It is based on feedback from this thread and completely satisfies my design goals. It is also going to be efficient. Alpha release for community review is planned on May 21, and design documentation in 10 days after that.Currently I think about making token a class instead of struct. ... Could anybody suggest other pros and cons? Which option would you choose?Just use a circular buffer of value-type tokens. There is absolutely no excuse for slow parsing.
May 15 2012
Le 15/05/2012 21:59, Roman D. Boiko a écrit :On Tuesday, 15 May 2012 at 19:27:26 UTC, Timon Gehr wrote:The only information that is usefull over time is the position. This is the one that you want to calculate lazily. In other terms, what you want to keep in memory have no interest. A circular buffer as Timon propose is a better idea.On 05/14/2012 05:00 PM, Roman D. Boiko wrote:I'd like to be able to do efficient token lookup based on its start index in the file (e.g., for autocompletion and calculating column/line numbers on demand). I also plan to have ability to update lexer, parser and semantic analysis results as the user or tool makes edits (incrementally, not by restarting), and to keep history of changes as long as it is needed. To achieve this, I will use Token[N][] for storing tokens, and an immutable binary tree for efficient lookup and preserving history. I don't worry much about keeping all tokens in memory, because information they contain is useful and needed for most scenarios. If, on the contrary, information is no longer used, it will be garbage collected. This is a significant redesign of current implementation. It is based on feedback from this thread and completely satisfies my design goals. It is also going to be efficient. Alpha release for community review is planned on May 21, and design documentation in 10 days after that.Currently I think about making token a class instead of struct. ... Could anybody suggest other pros and cons? Which option would you choose?Just use a circular buffer of value-type tokens. There is absolutely no excuse for slow parsing.
May 16 2012
On 5/11/12 4:22 PM, Roman D. Boiko wrote:But then how do you do to efficiently (if reverse walk is any efficient) compute line numbers? Usually tokens are used and discarded. I mean, somebody that uses the lexer asks tokens, process them (for example to highlight code or to build an AST) and then discards them. So you can reuse the same Token instance. If you want to peek the next token, or have a buffer of token, you can use a freelist ( http://dlang.org/memory.html#freelists , one of the many nice things I learned by looking at DMD's source code ). So adding line and column information is not like wasting a lot of memory: just 8 bytes more for each token in the freelist.What about line and column information?Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 "ab/c.d"), or discarding them.
May 11 2012
Am 11.05.2012 13:50, schrieb Ary Manzana:On 5/11/12 4:22 PM, Roman D. Boiko wrote:it would be better to add something like an ColumnLine-collector-thing - that if applied is able to hold this information, so no waste if its not needed i think there are several parts that can work like thatBut then how do you do to efficiently (if reverse walk is any efficient) compute line numbers? Usually tokens are used and discarded. I mean, somebody that uses the lexer asks tokens, process them (for example to highlight code or to build an AST) and then discards them. So you can reuse the same Token instance. If you want to peek the next token, or have a buffer of token, you can use a freelist ( http://dlang.org/memory.html#freelists , one of the many nice things I learned by looking at DMD's source code ). So adding line and column information is not like wasting a lot of memory: just 8 bytes more for each token in the freelist.What about line and column information?Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 "ab/c.d"), or discarding them.
May 11 2012
On Friday, 11 May 2012 at 12:12:01 UTC, dennis luehring wrote:it would be better to add something like an ColumnLine-collector-thing - that if applied is able to hold this information, so no waste if its not needed i think there are several parts that can work like thatThis looks like what I called a hook for pluggin-in respective functionality in my previous message. But I'm not sure it is needed at all (in addition to what I already designed with calculateFor() method).
May 11 2012
On Friday, 11 May 2012 at 11:50:29 UTC, Ary Manzana wrote:On 5/11/12 4:22 PM, Roman D. Boiko wrote:I borrowed the trick from Brian Schott: tokens will be stored as a sorted array. Sorting is done based on their indices in source code (they are naturally sorted immediately, no need to run any algorithm for that). The same is true for information about line start indices. Now given an index we can do a binary search in such sorted arrays, and get corresponding line number or token (whatever we need). Walking along utf code points starting from the index which corresponds to beginning of line and taking into account tab characters it is possible to calculate column reasonably fast. My assumption is that column numbers are needed for use cases like error reporting or other messages for a user, and thus there is no need to pre-calculate them for each token (or character). For such use cases efficiency should be really enough. In general, I precalculate and store only information which is needed frequently. Other information is evaluated lazily.But then how do you do to efficiently (if reverse walk is any efficient) compute line numbers?What about line and column information?Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 "ab/c.d"), or discarding them.Usually tokens are used and discarded. I mean, somebody that uses the lexer asks tokens, process them (for example to highlight code or to build an AST) and then discards them. So you can reuse the same Token instance. If you want to peek the next token, or have a buffer of token, you can use a freelist ( http://dlang.org/memory.html#freelists , one of the many nice things I learned by looking at DMD's source code ).But the information from tokens is not discarded (at least, this is the requirement for DCT). So my choice is to keep it in tokens instead of converting to some other form. That also implies that Token is free from any auxiliary information which is not necessary for common use cases.So adding line and column information is not like wasting a lot of memory: just 8 bytes more for each token in the freelist.It is inefficient to calculate it during lexing, scanning algorithm become less robust and more complicated, and in most cases we won't need it anyway. I will add a hook to plug-in such functionality (when needed) if I will know why it is necessary.
May 11 2012
Le 11/05/2012 13:50, Ary Manzana a écrit :On 5/11/12 4:22 PM, Roman D. Boiko wrote:SDC uses struct Location to store such data. Every token and AST element have a member data of type location. As it is value type, no need for free list all thoses sort of thing. When the token is discarded, the location is discarded too. The same goes for AST nodes.But then how do you do to efficiently (if reverse walk is any efficient) compute line numbers? Usually tokens are used and discarded. I mean, somebody that uses the lexer asks tokens, process them (for example to highlight code or to build an AST) and then discards them. So you can reuse the same Token instance. If you want to peek the next token, or have a buffer of token, you can use a freelist ( http://dlang.org/memory.html#freelists , one of the many nice things I learned by looking at DMD's source code ). So adding line and column information is not like wasting a lot of memory: just 8 bytes more for each token in the freelist.What about line and column information?Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 "ab/c.d"), or discarding them.
May 12 2012
On Saturday, 12 May 2012 at 13:24:11 UTC, deadalnix wrote:Le 11/05/2012 13:50, Ary Manzana a écrit :1. If a struct is a field of heap allocated object, it will be allocated and garbage collected. Only if it only exists on stack (i.e., in method body), GC is not used. 2. If struct which is often allocated and deallocated, free list would allow to skip its initialization, which is especially important if size of struct is large. This is true for both heap and stack allocated data.Usually tokens are used and discarded. I mean, somebody that uses the lexer asks tokens, process them (for example to highlight code or to build an AST) and then discards them. So you can reuse the same Token instance. If you want to peek the next token, or have a buffer of token, you can use a freelist ( http://dlang.org/memory.html#freelists , one of the many nice things I learned by looking at DMD's source code ). So adding line and column information is not like wasting a lot of memory: just 8 bytes more for each token in the freelist.SDC uses struct Location to store such data. Every token and AST element have a member data of type location. As it is value type, no need for free list all thoses sort of thing. When the token is discarded, the location is discarded too. The same goes for AST nodes.
May 12 2012
1. If a struct is a field of heap allocated object, it will be allocated and garbage collected. Only if it only exists on stack (i.e., in method body), GC is not used.As far as I can tell, it won't be allocated on it's own, since it is stored in the garbage collected object as a value field. So using a freelist would actually increase the overhead. You have to manage the freelist and do the allocation of the containing object.
May 12 2012
On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath wrote:Sorry for confusion. In the first point I was trying to clarify that using a struct doesn't mean that it is stack allocated, but that statement was off-topic for the given context. Currently I consider my first point to be useless, and the second one is not relevant in *my* context, since I don't do many allocations of Location, as well as in a majority of other contexts. As it is often the case, performance optimization should be performed only when benchmarks show it is needed.1. If a struct is a field of heap allocated object, it will be allocated and garbage collected. Only if it only exists on stack (i.e., in method body), GC is not used.As far as I can tell, it won't be allocated on it's own, since it is stored in the garbage collected object as a value field. So using a freelist would actually increase the overhead. You have to manage the freelist and do the allocation of the containing object.
May 12 2012
On 05/12/2012 11:08 PM, Roman D. Boiko wrote:On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath wrote:Well yes, but it seems to me that you are deliberately complicating the design in order to make it less efficient?Sorry for confusion. In the first point I was trying to clarify that using a struct doesn't mean that it is stack allocated, but that statement was off-topic for the given context. Currently I consider my first point to be useless, and the second one is not relevant in *my* context, since I don't do many allocations of Location, as well as in a majority of other contexts. As it is often the case, performance optimization should be performed only when benchmarks show it is needed.1. If a struct is a field of heap allocated object, it will be allocated and garbage collected. Only if it only exists on stack (i.e., in method body), GC is not used.As far as I can tell, it won't be allocated on it's own, since it is stored in the garbage collected object as a value field. So using a freelist would actually increase the overhead. You have to manage the freelist and do the allocation of the containing object.
May 12 2012
On Saturday, 12 May 2012 at 22:19:55 UTC, Timon Gehr wrote:On 05/12/2012 11:08 PM, Roman D. Boiko wrote:Do you mean something like what I summarized from Ary: http://forum.dlang.org/post/pggkobonzmgdwjigyaac forum.dlang.org ? If that is the case, could you please add your concerns to the dedicated issue: https://github.com/roman-d-boiko/dct/issues/15 ? Otherwise, if by overcomplicating you meant using a free list, I'd like to clarify that I don't plan to introduce it yet because I see no reason to use it in my context. ThanksOn Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath wrote:Well yes, but it seems to me that you are deliberately complicating the design in order to make it less efficient?Sorry for confusion. In the first point I was trying to clarify that using a struct doesn't mean that it is stack allocated, but that statement was off-topic for the given context. Currently I consider my first point to be useless, and the second one is not relevant in *my* context, since I don't do many allocations of Location, as well as in a majority of other contexts. As it is often the case, performance optimization should be performed only when benchmarks show it is needed.1. If a struct is a field of heap allocated object, it will be allocated and garbage collected. Only if it only exists on stack (i.e., in method body), GC is not used.As far as I can tell, it won't be allocated on it's own, since it is stored in the garbage collected object as a value field. So using a freelist would actually increase the overhead. You have to manage the freelist and do the allocation of the containing object.
May 12 2012
On 2012-05-11 10:01, Roman D. Boiko wrote:There were several discussions about the need for a D compiler library. I propose my draft implementation of lexer for community review: https://github.com/roman-d-boiko/dct Lexer is based on Brian Schott's project https://github.com/Hackerpilot/Dscanner, but it has been refactored and extended (and more changes are on the way). The goal is to have source code loading, lexer, parser and semantic analysis available as parts of Phobos. These libraries should be designed to be usable in multiple scenarios (e.g., refactoring, code analysis, etc.). My commitment is to have at least front end built this year (and conforming to the D2 specification unless explicitly stated otherwise for some particular aspect). Please post any feed here. A dedicated project web-site will be created later.If think that the end goal of a project like this, putting a D frontend in Phobos, should be that the compiler should be built using this library. This would result in the compiler and library always being in sync and having the same behavior. Otherwise it's easy this would be just another tool that tries to lex and parse D code, always being out of sync with the compiler and not having the same behavior. For this to happen, for Walter to start using this, I think there would be a greater change if the frontend was a port of the DMD frontend and not changed too much. -- /Jacob Carlborg
May 11 2012
Am 11.05.2012 11:02, schrieb Jacob Carlborg:For this to happen, for Walter to start using this, I think there would be a greater change if the frontend was a port of the DMD frontend and not changed too much.or a pure D version of it with the features: -very fast in parsing/lexing - there need to be a benchmark enviroment from the very start -easy to extend and fix and think that is what walter wants from an D-ish frontend in the first place
May 11 2012
On Friday, 11 May 2012 at 09:21:29 UTC, dennis luehring wrote:Am 11.05.2012 11:02, schrieb Jacob Carlborg:Will add that to May roadmap.For this to happen, for Walter to start using this, I think there would be a greater change if the frontend was a port of the DMD frontend and not changed too much.or a pure D version of it with the features: -very fast in parsing/lexing - there need to be a benchmark enviroment from the very start-easy to extend and fix and think that is what walter wants from an D-ish frontend in the first placeI plan to go this way.
May 11 2012
Am 11.05.2012 11:33, schrieb Roman D. Boiko:are using slices for prevent coping everything around? the parser/lexer need to be as fast as the original one - maybe even faster - else it won't replace walters at any time - because speed does matter here very much-very fast in parsing/lexing - there need to be a benchmark enviroment from the very startWill add that to May roadmap.
May 11 2012
On Friday, 11 May 2012 at 10:01:17 UTC, dennis luehring wrote:Am 11.05.2012 11:33, schrieb Roman D. Boiko:I tried optimizing code when it didn't complicate design. And more optimizations will be added later. It would be interesting to have benchmarks for comparing performance, but I don't plan to branch and edit DMD front end to prepare it for benchmarking (any time soon).are using slices for prevent coping everything around? the parser/lexer need to be as fast as the original one - maybe even faster - else it won't replace walters at any time - because speed does matter here very much-very fast in parsing/lexing - there need to be a benchmark enviroment from the very startWill add that to May roadmap.
May 11 2012
On Friday, 11 May 2012 at 10:40:43 UTC, Roman D. Boiko wrote:On Friday, 11 May 2012 at 10:01:17 UTC, dennis luehring wrote:Ever thought of asking the VisualD developer to integrate your library into his IDE extension? Might be cool to do so because of extended completion abilities etc. (lol I'm the Mono-D dev -- but why not? ;D)Am 11.05.2012 11:33, schrieb Roman D. Boiko:I tried optimizing code when it didn't complicate design. And more optimizations will be added later. It would be interesting to have benchmarks for comparing performance, but I don't plan to branch and edit DMD front end to prepare it for benchmarking (any time soon).are using slices for prevent coping everything around? the parser/lexer need to be as fast as the original one - maybe even faster - else it won't replace walters at any time - because speed does matter here very much-very fast in parsing/lexing - there need to be a benchmark enviroment from the very startWill add that to May roadmap.
May 11 2012
On Friday, 11 May 2012 at 11:41:34 UTC, alex wrote:Ever thought of asking the VisualD developer to integrate your library into his IDE extension? Might be cool to do so because of extended completion abilities etc. (lol I'm the Mono-D dev -- but why not? ;D)Didn't think about that yet, because I don't use VisualD. I actually planned to analyse whether DCT could be integrated into Mono-D, so your feedback is welcome :)
May 11 2012
On Friday, 11 May 2012 at 11:55:47 UTC, Roman D. Boiko wrote:On Friday, 11 May 2012 at 11:41:34 UTC, alex wrote:be easier to integrate into the second one :)Ever thought of asking the VisualD developer to integrate your library into his IDE extension? Might be cool to do so because of extended completion abilities etc. (lol I'm the Mono-D dev -- but why not? ;D)Didn't think about that yet, because I don't use VisualD. I actually planned to analyse whether DCT could be integrated into Mono-D, so your feedback is welcome :)
May 11 2012
On Friday, 11 May 2012 at 12:13:53 UTC, alex wrote:should be easier to integrate into the second one :)Sorry, I meant D-IDE. But there might exist the reason to consume make it usable for that. I have nothing against VisualD, just didn't think about it yet.
May 11 2012
On Friday, 11 May 2012 at 12:20:27 UTC, Roman D. Boiko wrote:On Friday, 11 May 2012 at 12:13:53 UTC, alex wrote:should be easier to integrate into the second one :)Sorry, I meant D-IDE. But there might exist the reason to collaborate to make it usable for that.
May 11 2012
Le 11/05/2012 12:01, dennis luehring a écrit :Am 11.05.2012 11:33, schrieb Roman D. Boiko:The best optimization is the one that bring your code from non working state to working state.are using slices for prevent coping everything around? the parser/lexer need to be as fast as the original one - maybe even faster - else it won't replace walters at any time - because speed does matter here very much-very fast in parsing/lexing - there need to be a benchmark enviroment from the very startWill add that to May roadmap.
May 11 2012
On Friday, 11 May 2012 at 09:02:12 UTC, Jacob Carlborg wrote:If think that the end goal of a project like this, putting a D frontend in Phobos, should be that the compiler should be built using this library. This would result in the compiler and library always being in sync and having the same behavior. Otherwise it's easy this would be just another tool that tries to lex and parse D code, always being out of sync with the compiler and not having the same behavior. For this to happen, for Walter to start using this, I think there would be a greater change if the frontend was a port of the DMD frontend and not changed too much.My plan is to create frontend that would be much better than existing, both in design and implementation. I decided to work on this full time for several months. Front end will not produce the same data as DMD front end does, so most likely it will not be suitable to replace existing C++ implementation. But since no information will be lost I will consider creating a separate project which would build output compatible with existing (not identical, I don't think that would be feasible).
May 11 2012
On 2012-05-11 11:31, Roman D. Boiko wrote:My plan is to create frontend that would be much better than existing, both in design and implementation. I decided to work on this full time for several months.That's good news.Front end will not produce the same data as DMD front end does, so most likely it will not be suitable to replace existing C++ implementation.That's too bad.But since no information will be lost I will consider creating a separate project which would build output compatible with existing (not identical, I don't think that would be feasible).Ok. -- /Jacob Carlborg
May 11 2012
Le 11/05/2012 11:31, Roman D. Boiko a écrit :On Friday, 11 May 2012 at 09:02:12 UTC, Jacob Carlborg wrote:I this is your plan, I'm very happy. We really should discuss to avoid duplicating effort as we are doing right now.If think that the end goal of a project like this, putting a D frontend in Phobos, should be that the compiler should be built using this library. This would result in the compiler and library always being in sync and having the same behavior. Otherwise it's easy this would be just another tool that tries to lex and parse D code, always being out of sync with the compiler and not having the same behavior. For this to happen, for Walter to start using this, I think there would be a greater change if the frontend was a port of the DMD frontend and not changed too much.My plan is to create frontend that would be much better than existing, both in design and implementation. I decided to work on this full time for several months.
May 11 2012
On Friday, 11 May 2012 at 11:50:14 UTC, deadalnix wrote:Le 11/05/2012 11:31, Roman D. Boiko a écrit :My plan is to create frontend that would be much better than existing, both in design and implementation. I decided to work on this full time for several months.I this is your plan, I'm very happy. We really should discuss to avoid duplicating effort as we are doing right now.I have started a similar stuff, and it is currently more advanced than DCT is. I kind of decapitated sdc to do it. Maybe we should join effort instead of doing 2 separate projects ?That makes sense. Is it possible to switch SDC to the Boost license? I'm trying to keep it for all DCT code.
May 11 2012
Le 11/05/2012 14:04, Roman D. Boiko a écrit :That makes sense. Is it possible to switch SDC to the Boost license? I'm trying to keep it for all DCT code.Let me do a clean package of my code this week end. For now it is mixed with SDC source code, which was enough as I was working alone, but isn't quite right for people to join the effort.
May 11 2012
Le 11/05/2012 11:02, Jacob Carlborg a écrit :If think that the end goal of a project like this, putting a D frontend in Phobos, should be that the compiler should be built using this library. This would result in the compiler and library always being in sync and having the same behavior. Otherwise it's easy this would be just another tool that tries to lex and parse D code, always being out of sync with the compiler and not having the same behavior. For this to happen, for Walter to start using this, I think there would be a greater change if the frontend was a port of the DMD frontend and not changed too much.From the beginning, I'm think AST macro using CTFE.
May 11 2012
On Friday, 11 May 2012 at 11:47:18 UTC, deadalnix wrote:From the beginning, I'm think AST macro using CTFE.Could you please elaborate? I plan to strictly follow published D specification. Exceptions from this rule are possible provided either of the following is true: * new functionality has been implemented in DMD but is not included into specification yet * specification is incorrect (has a bug) or incomplete, especially if DMD behavior differs from specification * change is compatible with specification and brings some significant improvement (e.g., this seems to be the case for my decision to introduce post-processor after lexer) Some more exceptions might be added later, but the goal is to minimize differences.
May 11 2012
Le 11/05/2012 14:14, Roman D. Boiko a écrit :On Friday, 11 May 2012 at 11:47:18 UTC, deadalnix wrote:More explicitly, the goal isn't to implement a different language than D. Simply doing the parsing/AST building in a way that would allow AST macro to be introduced later. Your 3 points seem reasonable. Mine were : * Implement something that can parse D as it is currently defined/implemented (if dmd's behavior and spec differs, it is handled on a per case basis). * Discard all deprecated features. Not even try to implement them even if dmd support them currently. * Do the parsing in several steps to allow different tools to work with it. I think we both have very compatibles goals. Let me do a clean package of it I write about design goals. I don't have that much time right now to do it, I will this week end.From the beginning, I'm think AST macro using CTFE.Could you please elaborate? I plan to strictly follow published D specification. Exceptions from this rule are possible provided either of the following is true: * new functionality has been implemented in DMD but is not included into specification yet * specification is incorrect (has a bug) or incomplete, especially if DMD behavior differs from specification * change is compatible with specification and brings some significant improvement (e.g., this seems to be the case for my decision to introduce post-processor after lexer) Some more exceptions might be added later, but the goal is to minimize differences.
May 11 2012
On Friday, 11 May 2012 at 12:30:01 UTC, deadalnix wrote:Your 3 points seem reasonable. Mine were : * Implement something that can parse D as it is currently defined/implemented (if dmd's behavior and spec differs, it is handled on a per case basis).All differences should be documented.* Discard all deprecated features. Not even try to implement them even if dmd support them currently.Yes, I forgot this one. Actually, I didn't discard imaginary floats, because I don't know what exactly will be done instead and it is easy to keep them.* Do the parsing in several steps to allow different tools to work with it.I was thinking about a pool of analysers each of which would add some information. This could be more than needed for semantic analysis. An analyser would be created each time when information (e.g., some indexing) is needed for a particular use case.I think we both have very compatibles goals. Let me do a clean package of it I write about design goals. I don't have that much time right now to do it, I will this week end.I saw your ast_as_lib branch of SDC, but didn't dig deeper.
May 11 2012
On 2012-05-11 14:14, Roman D. Boiko wrote:On Friday, 11 May 2012 at 11:47:18 UTC, deadalnix wrote:That won't be easy, nobody know what the specification is . TDPL, DMD or dlang.org?From the beginning, I'm think AST macro using CTFE.Could you please elaborate? I plan to strictly follow published D specification.Exceptions from this rule are possible provided either of the following is true: * new functionality has been implemented in DMD but is not included into specification yet * specification is incorrect (has a bug) or incomplete, especially if DMD behavior differs from specification * change is compatible with specification and brings some significant improvement (e.g., this seems to be the case for my decision to introduce post-processor after lexer) Some more exceptions might be added later, but the goal is to minimize differences.-- /Jacob Carlborg
May 11 2012
? my guess is that the spec is TDPL + TDPL errata. dlang.org should be updated as people notice inaccuracies. This project would be an ideal time to update dlang.org as people notice its not in sync with TDPL. If TDPL doesn't cover it then the community should review it. On Fri, May 11, 2012 at 3:17 PM, Jacob Carlborg <doob me.com> wrote:On 2012-05-11 14:14, Roman D. Boiko wrote:On Friday, 11 May 2012 at 11:47:18 UTC, deadalnix wrote:That won't be easy, nobody know what the specification is . TDPL, DMD or dlang.org? Exceptions from this rule are possible provided either of the followingFrom the beginning, I'm think AST macro using CTFE.Could you please elaborate? I plan to strictly follow published D specification.is true: * new functionality has been implemented in DMD but is not included into specification yet * specification is incorrect (has a bug) or incomplete, especially if DMD behavior differs from specification * change is compatible with specification and brings some significant improvement (e.g., this seems to be the case for my decision to introduce post-processor after lexer) Some more exceptions might be added later, but the goal is to minimize differences.-- /Jacob Carlborg
May 11 2012
On Friday, 11 May 2012 at 13:30:49 UTC, Rory McGuire wrote:? my guess is that the spec is TDPL + TDPL errata. dlang.org should be updated as people notice inaccuracies. This project would be an ideal time to update dlang.org as people notice its not in sync with TDPL. If TDPL doesn't cover it then the community should review it.Whenever I'm in doubt, I try to analyze both TDPL and dlang.org, and also consult DMD sources. But in general when I write specification I mean dlang.org. I agree that it should be maintained, but I understand that it is difficult to do. I'll try to summarize problems which I discover (I already found many), but it is quite time-consuming activity.
May 11 2012
On 2012-05-11 15:30, Rory McGuire wrote:? my guess is that the spec is TDPL + TDPL errata. dlang.org <http://dlang.org> should be updated as people notice inaccuracies. This project would be an ideal time to update dlang.org <http://dlang.org> as people notice its not in sync with TDPL. If TDPL doesn't cover it then the community should review it.None of these are in sync. -- /Jacob Carlborg
May 11 2012
On Friday, 11 May 2012 at 14:45:45 UTC, Jacob Carlborg wrote:On 2012-05-11 15:30, Rory McGuire wrote:I suggest to choose a dedicated place where it is possible to discuss differences and other problems with specification. This forum?? my guess is that the spec is TDPL + TDPL errata. dlang.org <http://dlang.org> should be updated as people notice inaccuracies. This project would be an ideal time to update dlang.org <http://dlang.org> as people notice its not in sync with TDPL. If TDPL doesn't cover it then the community should review it.None of these are in sync.
May 11 2012
yip, but TDPL still has to take precedence because that is the one that walter + andrei + community put the most focused effort into. On Fri, May 11, 2012 at 4:45 PM, Jacob Carlborg <doob me.com> wrote:On 2012-05-11 15:30, Rory McGuire wrote:? my guess is that the spec is TDPL + TDPL errata. dlang.org <http://dlang.org> should be updated as people notice inaccuracies. This project would be an ideal time to update dlang.org <http://dlang.org> as people notice its not in sync with TDPL. If TDPL doesn't cover it then the community should review it.None of these are in sync. -- /Jacob Carlborg
May 11 2012
On Friday, May 11, 2012 23:00:15 Rory McGuire wrote:yip, but TDPL still has to take precedence because that is the one that walter + andrei + community put the most focused effort into.It doesn't necessarily. Each place that they differ is examined individually and decided on its own merits. There is no automatic "TDPL says it's so, so it's so." There _are_ cases where things have even been explicitly changed after TDPL was released, making TDPL out-of-date for that particular item (e.g strong vs weak purity is not discussed in TDPL at all, beacuse it didn't exist at the time). In general, however, TDPL _is_ correct, and when there's a difference, it's simply a case of the documentation or compiler not having been updated accordingly. We try to avoid making any changes that would conflict with TDPL. - Jonathan M Davis
May 11 2012
Am 11.05.2012 10:01, schrieb Roman D. Boiko:There were several discussions about the need for a D compiler library. I propose my draft implementation of lexer for community review: https://github.com/roman-d-boiko/dct Lexer is based on Brian Schott's project https://github.com/Hackerpilot/Dscanner, but it has been refactored and extended (and more changes are on the way). The goal is to have source code loading, lexer, parser and semantic analysis available as parts of Phobos. These libraries should be designed to be usable in multiple scenarios (e.g., refactoring, code analysis, etc.). My commitment is to have at least front end built this year (and conforming to the D2 specification unless explicitly stated otherwise for some particular aspect). Please post any feed here. A dedicated project web-site will be created later.does the parser/lexer allow half-finished syntax parsing? for being useable in an IDE for syntax-highlighting while coding?
May 11 2012
On Friday, 11 May 2012 at 09:19:07 UTC, dennis luehring wrote:does the parser/lexer allow half-finished syntax parsing? for being useable in an IDE for syntax-highlighting while coding?That's planned, but I would like to see your usage scenarios (pseudo-code would help a lot).
May 11 2012
Am 11.05.2012 11:23, schrieb Roman D. Boiko:On Friday, 11 May 2012 at 09:19:07 UTC, dennis luehring wrote:try to syntaxhiglight while coding - thats the scenario, parts of the source code isn't fully valid while writingdoes the parser/lexer allow half-finished syntax parsing? for being useable in an IDE for syntax-highlighting while coding?That's planned, but I would like to see your usage scenarios (pseudo-code would help a lot).
May 11 2012
On Friday, 11 May 2012 at 09:28:36 UTC, dennis luehring wrote:Am 11.05.2012 11:23, schrieb Roman D. Boiko:I depends on IDE. For example, sublime-text (and most likely TextMate) uses regex for syntax highlighting. That makes it impossible to use with for D in some scenarios (like nested block comments). Any IDE that provides API for coloring will get correct information if code is valid. If it is not valid, it is only possible to handle specific kinds of errors, but in general there will always be cases when highlighting (or some parsing / semantic analysis information) is "incorrect". I aim to handle common errors gracefully. In practice, I think, it is possible to handle 99% of problems, but this requires a lot of design / specification work.On Friday, 11 May 2012 at 09:19:07 UTC, dennis luehring wrote:try to syntaxhiglight while coding - thats the scenario, parts of the source code isn't fully valid while writingdoes the parser/lexer allow half-finished syntax parsing? for being useable in an IDE for syntax-highlighting while coding?That's planned, but I would like to see your usage scenarios (pseudo-code would help a lot).
May 11 2012
On 2012-05-11 11:23, Roman D. Boiko wrote:On Friday, 11 May 2012 at 09:19:07 UTC, dennis luehring wrote:Example from TextMate: * "void" - keyword is colored * "module main" - nothing colored until I type a semicolon * "module main;" - keyword and "main" is colored (differently) * "void foo" - keyword is colored * "void foo (" - keyword and "foo" is colored (differently) * "struct F" - keyword and "F is colored (differently) * Literals are always colored. * User-defined constants are always colored. It's basically any token that is all upper case -- /Jacob Carlborgdoes the parser/lexer allow half-finished syntax parsing? for being useable in an IDE for syntax-highlighting while coding?That's planned, but I would like to see your usage scenarios (pseudo-code would help a lot).
May 11 2012
Le 11/05/2012 10:01, Roman D. Boiko a écrit :There were several discussions about the need for a D compiler library. I propose my draft implementation of lexer for community review: https://github.com/roman-d-boiko/dct Lexer is based on Brian Schott's project https://github.com/Hackerpilot/Dscanner, but it has been refactored and extended (and more changes are on the way). The goal is to have source code loading, lexer, parser and semantic analysis available as parts of Phobos. These libraries should be designed to be usable in multiple scenarios (e.g., refactoring, code analysis, etc.). My commitment is to have at least front end built this year (and conforming to the D2 specification unless explicitly stated otherwise for some particular aspect). Please post any feed here. A dedicated project web-site will be created later.I have started a similar stuff, and it is currently more advanced than DCT is. I kind of decapitated sdc to do it. Maybe we should join effort instead of doing 2 separate projects ?
May 11 2012
On Friday, May 11, 2012 10:01:28 Roman D. Boiko wrote:There were several discussions about the need for a D compiler library. I propose my draft implementation of lexer for community review: https://github.com/roman-d-boiko/dct Lexer is based on Brian Schott's project https://github.com/Hackerpilot/Dscanner, but it has been refactored and extended (and more changes are on the way). The goal is to have source code loading, lexer, parser and semantic analysis available as parts of Phobos. These libraries should be designed to be usable in multiple scenarios (e.g., refactoring, code analysis, etc.). My commitment is to have at least front end built this year (and conforming to the D2 specification unless explicitly stated otherwise for some particular aspect). Please post any feed here. A dedicated project web-site will be created later.It's great that you're working on this. We definitely need more of this sort of stuff. However, I would point out that I'm not sure that it will be acceptable for Phobos (it _may_ be, but it's not quite what we've been looking for). There are two types of lexers/parsers that we've been looking to add to Phobos. 1. A direct port of the lexer from dmd's C++ front-end. The API would be D- ified to use ranges, but the internals would be almost identical to the C++ front-end so that porting changes back and forth would be very easy. It also would make it easier to replace the C++ front-end with a D one later if the D one is very close to the C++ one. 2. A lexer/parser generator using templates and/or CTFE to generate a lexer/parser of whatever language using a BNF or EBNF grammar - probably similar to what Philippe Sigaud has done to generate a parser using PEGs. Such could be used to generate a lexer/parser for D or any other language. What you seem to be doing is creating a D-specific parser from scratch, which is great, but it's not quite what we've been looking to add to Phobos. That doesn't necessarily mean that we can't or won't add it to Phobos once it's complete (especially if nothing else gets done first), but it also may not be once it's done and ready to be reviewed for inclusion in Phobos. I'm not trying to discourage you at all. I'm just pointing out that what you're doing is quite what we're looking for. Regardless, it _would_ be interesting to be seeing which performs better. - Jonathan M Davis
May 11 2012
On Saturday, 12 May 2012 at 05:41:16 UTC, Jonathan M Davis wrote:It's great that you're working on this. We definitely need more of this sort of stuff. However, I would point out that I'm not sure that it will be acceptable for Phobos (it _may_ be, but it's not quite what we've been looking for). There are two types of lexers/parsers that we've been looking to add to Phobos. 1. A direct port of the lexer from dmd's C++ front-end. The API would be D- ified to use ranges, but the internals would be almost identical to the C++ front-end so that porting changes back and forth would be very easy. It also would make it easier to replace the C++ front-end with a D one later if the D one is very close to the C++ one. 2. A lexer/parser generator using templates and/or CTFE to generate a lexer/parser of whatever language using a BNF or EBNF grammar - probably similar to what Philippe Sigaud has done to generate a parser using PEGs. Such could be used to generate a lexer/parser for D or any other language. What you seem to be doing is creating a D-specific parser from scratch, which is great, but it's not quite what we've been looking to add to Phobos. That doesn't necessarily mean that we can't or won't add it to Phobos once it's complete (especially if nothing else gets done first), but it also may not be to be decided once it's done and ready to be reviewed for inclusion in Phobos. I'm not trying to discourage you at all. I'm just pointing out that what you're doing is quite what we're looking for. Regardless, it _would_ be interesting to be you're doing and seeing which performs better. - Jonathan M Davisneed I'm trying to address is making development of tools for D development much easier. Tools are needed for D to become more widely used in practise. Secondary (a nice-to-have) would be to build a compiler on top of that. I plan to do both, but I don't intend to be backwards-compatible with DMD backend. I hope that at least part of my code will either influence or be reused in the reference D compiler (whatever that will be in the future). It is important to have a reference frontend implementation which can be used in many tools, otherwise a lot of effort would be completely wasted and no tool would comply with the D specification and compilers.
May 11 2012
Am Fri, 11 May 2012 10:01:28 +0200 schrieb "Roman D. Boiko" <rb d-coding.com>:There were several discussions about the need for a D compiler library. I propose my draft implementation of lexer for community review: https://github.com/roman-d-boiko/dct Lexer is based on Brian Schott's project https://github.com/Hackerpilot/Dscanner, but it has been refactored and extended (and more changes are on the way). The goal is to have source code loading, lexer, parser and semantic analysis available as parts of Phobos. These libraries should be designed to be usable in multiple scenarios (e.g., refactoring, code analysis, etc.). My commitment is to have at least front end built this year (and conforming to the D2 specification unless explicitly stated otherwise for some particular aspect). Please post any feed here. A dedicated project web-site will be created later.A general purpose D front-end library has been discussed several times. So at least for some day-dreaming about better IDE support and code analysis tools it has a huge value. It is good to see that someone finally takes the time to implement one. My only concern is that not enough planing went into the design. I think about things like brainstorming and collecting possible use cases from the community or looking at how some C++ tools do their job and what infrastructure they are built on. Although it is really difficult to tell from other people's code which decision was 'important' and what was just the author's way to do it. Inclusion into Phobos I would not see as a priority. As Jonathan said, there are already some clear visions of how such modules would look like. Also if any project seeks to replace the DMD front-end, Walter should be the technical advisor. Like anyone putting a lot of time and effort into a design, he could have strong feelings about certain decisions and implementing them in a seemingly worse way. That said, you make the impression of being really dedicated to the project, even giving yourself a time line, which is a good sign. I wish you all the best for the project and hope that - even without any 'official' recognition - it will see a lot of use as the base of D tools. To learn about parsing I wrote a syntax highlighter for the DCPU-16 assembly of Minecraft author Markus Persson. (Which was a good exercise.) Interestingly I ended up with a similar design for the Token struct like yours: - TokenType/TokenKind enum - Trie for matching token kinds (why do you expand it into nested switch-case through CTFE mixins though?) Since assembly code is usually small I just preallocate an array of sourceCode.length tokens and realloc it to the correct size when I'm done parsing. Nothing pretty, but simple and I am sure it won't get any faster ;). One notable difference is that I only check for isEoF in one location, since I append "\0" to the source code as a stop token (that can also serve as an end-of-line indicator). (The general idea is to move checks outside of loops.) ** Errors ** I generally keep the start and end column, in case someone wants that. A real world example: ubyte x = ...; if (x >= 0 && x <= 8) ... Line 2, warning: Comparison is always true. At first glace you say "No, a byte can become larger than 8.", but the compiler is just complaining about the first part of the expression here, which would be clarified by e.g. some ASCII/Unicode art: Line 2, warning: Comparison is always true: if (x >= 0 && x <= 8) ... ^----^ ** Code highlighting ** Especially Gtk's TextView (which GtkSourceView is base on) isn't made for several thousand tokens. The text view will take a second per 20000 tokens or so. The source view works around that by highlighting in chunks, but you can still fell the delay when you jump to the end of the file in gedit and even MonoDevelop suffers from the bad performance. If I understood your posts correctly, you are already planning for minimal change sets. It is *very* important to only update changed parts in a syntax highlighting editor. So for now I ended up writing a 'insertText' and 'deleteText' method for the parser that take 'start index', 'text' and 'start index', 'end index' respectively and return a list of removed and added tokens. If possible, make sure that your library can be used with GtkSourceView, Scintilla and QSyntaxHighlighter. Unfortunately the bindings for the last two could use an update, but the API help should get you an idea of how to use it most efficiently. -- Marco
May 19 2012
On Saturday, 19 May 2012 at 20:35:10 UTC, Marco Leise wrote:Am Fri, 11 May 2012 10:01:28 +0200 schrieb "Roman D. Boiko" <rb d-coding.com>:(I decided to comment on both my post and your reply.) I've got *a lot* of feedback from community, which caused a significant redesign (but caused a delay with yet unknown duration). I'll write more on that later. Thanks a lot to everybody!There were several discussions about the need for a D compiler library. I propose my draft implementation of lexer for community review: https://github.com/roman-d-boiko/dctLooks like I'm going to replace this implementation almost completely.Lexer is based on Brian Schott's project https://github.com/Hackerpilot/Dscanner, but it has been refactored and extended (and more changes are on the way).See below.The goal is to have source code loading, lexer, parser and semantic analysis available as parts of Phobos. These libraries should be designed to be usable in multiple scenarios (e.g., refactoring, code analysis, etc.).By the D specification I mean http://dlang.org/language-reference.html, or (when specification is not clear for me) one of the following: * TDPL, * current DMD implementation * community feedback. (Note that I may assume that I understand some aspect, but later revisit it if needed.)My commitment is to have at least front end built this year (and conforming to the D2 specification unless explicitly stated otherwise for some particular aspect).It is going to be http://d-coding.com, but it is not usable yet (I have not much web-design practise). It is hosted on https://github.com/roman-d-boiko/roman-d-boiko.github.com, pull requests are welcome.Please post any feed here. A dedicated project web-site will be created later.A general purpose D front-end library has been discussed several times. So at least for some day-dreaming about better IDE support and code analysis tools it has a huge value. It is good to see that someone finally takes the time to implement one. My only concern is that not enough planing went into the design.Could you name a few specific concerns? The reason of this thread was to reflect on design early. OTOH, I didn't spend time documenting my design goals and tradeoffs, so discussion turned into a brainstorm and wasn't always productive. But now when I see how much value can I get even without doing my homework, I'm much more likely to provide better documentation and ask more specific questions.I think about things like brainstorming and collecting possible use cases from the community or looking at how some C++ tools do their job and what infrastructure they are built on. Although it is really difficult to tell from other people's code which decision was 'important' and what was just the author's way to do it.I'm going to pick up several use cases and prioritize them according to my judgement. Feel free to suggest any cases that you think are needed (with motivation). Prioritizing is necessary to define what is out of scope and plan work into milestones, in order to ensure the project is feasible.Inclusion into Phobos I would not see as a priority. As Jonathan said, there are already some clear visions of how such modules would look like.Well, *considering* such inclusion seriously would help improve the quality of project. But it is not what necessarily will happen. Currently I think that my goals are close to those of use case 2 from Jonathan's reply. But until the project is reasonably complete it is not the time to argue whether to include it (or its part).Also if any project seeks to replace the DMD front-end, Walter should be the technical advisor. Like anyone putting a lot of time and effort into a design, he could have strong feelings about certain decisions and implementing them in a seemingly worse way.Not a goal currently, because that would make the project significantly less likely to be completed ever.That said, you make the impression of being really dedicated to the project, even giving yourself a time line, which is a good sign. I wish you all the best for the project and hope that - even without any 'official' recognition - it will see a lot of use as the base of D tools.Well, I hope that some more people will join the project once it stabilizes and delivers something useful. By the way, is anybody *potentially* interested to join?To learn about parsing I wrote a syntax highlighter for the DCPU-16 assembly of Minecraft author Markus Persson. (Which was a good exercise.) Interestingly I ended up with a similar design for the Token struct like yours: - TokenType/TokenKind enum - Trie for matching token kinds (why do you expand it into nested switch-case through CTFE mixins though?)Switch code generation is something temporary that works currently and helps me evaluate possible problems with future design, which is planned to be compile-time finite automata (likely deterministic).Since assembly code is usually small I just preallocate an array of sourceCode.length tokens and realloc it to the correct size when I'm done parsing. Nothing pretty, but simple and I am sure it won't get any faster ;).I'm sure it will :) (I'm going to elaborate on this some time later).One notable difference is that I only check for isEoF in one location, since I append "\0" to the source code as a stop token (that can also serve as an end-of-line indicator). (The general idea is to move checks outside of loops.)There are several EoF conditions: \0, \x1A, __EOF__ and physicas EOF. And any loop would need to check for all. Preallocation could eliminate the need to check for physical EoF, but would make it impossible to avoid input string copying and also would not allow passing a function a slice of string to be lexed.** Errors ** I generally keep the start and end column, in case someone wants that. A real world example: ubyte x = ...; if (x >= 0 && x <= 8) ... Line 2, warning: Comparison is always true. At first glace you say "No, a byte can become larger than 8.", but the compiler is just complaining about the first part of the expression here, which would be clarified by e.g. some ASCII/Unicode art: Line 2, warning: Comparison is always true: if (x >= 0 && x <= 8) ... ^----^This functionality has been the priority from the beginning. Not implemented yet but designed for. Evaluation of column and line only on demand is caused by the assumption that such information is needed rarely (primarily to display information to the user). My new design also addresses the use cases when it is needed frequently.** Code highlighting ** Especially Gtk's TextView (which GtkSourceView is base on) isn't made for several thousand tokens. The text view will take a second per 20000 tokens or so. The source view works around that by highlighting in chunks, but you can still fell the delay when you jump to the end of the file in gedit and even MonoDevelop suffers from the bad performance. If I understood your posts correctly, you are already planning for minimal change sets. It is *very* important to only update changed parts in a syntax highlighting editor. So for now I ended up writing a 'insertText' and 'deleteText' method for the parser that take 'start index', 'text' and 'start index', 'end index' respectively and return a list of removed and added tokens. If possible, make sure that your library can be used with GtkSourceView, Scintilla and QSyntaxHighlighter. Unfortunately the bindings for the last two could use an update, but the API help should get you an idea of how to use it most efficiently.Curretly I plan to support GtkSourceView, Scintilla, and a custom editor API which I plan to define so that it would be most efficient in this respect. Didn't work on that thoroughly yet. Incremental changes are the key to efficiency, and I'm going to invest a lot of effort into making them. Also immutability of data structures will enable many optimizations.
May 20 2012
Am Sun, 20 May 2012 10:09:34 +0200 schrieb "Roman D. Boiko" <rb d-coding.com>:Could you name a few specific concerns?Mostly my own gut feeling, that things that sound great in my head turn out to bite me in the end. Things that one just doesn't think of because of the limited horizon everyone has and probably a lack of experience in the field of compiler/tools writing. On the other hand I have good experience with working by community feedback. So the rough edges in the design may already be ironed out by now.I'm going to pick up several use cases and prioritize them according to my judgement. Feel free to suggest any cases that you think are needed (with motivation). Prioritizing is necessary to define what is out of scope and plan work into milestones, in order to ensure the project is feasible.There is one feature I remember caused some head-aches for Code::Blocks. They used a separate parser instance for every project in the IDE, which meant that all the standard include files would be parsed and kept in memory multiple times. When they later switched to a common parser for all projects they ran into stability issues. If you can arrange it, it would be great for multi-project IDEs to be able to add and remove projects to your parser without reparsing Phobos/druntime (which may have its .di files replaced by .d files, looking at the past discussion). C bindings could be an option. (As in: the smallest common denominator.) TheyI'm curious.Since assembly code is usually small I just preallocate an array of sourceCode.length tokens and realloc it to the correct size when I'm done parsing. Nothing pretty, but simple and I am sure it won't get any faster ;).I'm sure it will :) (I'm going to elaborate on this some time later).There are several EoF conditions: \0, \x1A, __EOF__ and physicas EOF. And any loop would need to check for all. Preallocation could eliminate the need to check for physical EoF, but would make it impossible to avoid input string copying and also would not allow passing a function a slice of string to be lexed.I found that I usually either load from file into a newly allocated buffer (where a copy occurs, only because I forgot about assumeUnique in std.exception) or I am editing the file in which case I recreate the source string after every key stroke anyway. I can still pass slices of that string to functions though. Not sure what you mean. It probably doesn't work for D as well as for ASM code, but I could also check for \x1A and __EOF__ in the same fashion. (By the way, is it \x1A (substitute, ^Z) or did you mean \0x04 (end-of-transmission, ^D)?) The way it works is: Parser states like 'in expression' can safely peek at the next character at any time. If it doesn't match what they expect they emit an error and drop back to the "surrounding" parser state. When they reach the "file" level, that's the only place where an EOF (which will only occur once per file anyway) will be consumed. In theory one can drop all string length checks and work on char* directly with a known terminator char that is distinct from anything else.No more questions! -- Marco** Errors ** I generally keep the start and end column, in case someone wants that.This functionality has been the priority from the beginning. Not implemented yet but designed for. Evaluation of column and line only on demand is caused by the assumption that such information is needed rarely (primarily to display information to the user). My new design also addresses the use cases when it is needed frequently. [...] Incremental changes are the key to efficiency, and I'm going to invest a lot of effort into making them. Also immutability of data structures will enable many optimizations.
May 20 2012
On Sunday, 20 May 2012 at 17:42:59 UTC, Marco Leise wrote:There is one feature I remember caused some head-aches for Code::Blocks. They used a separate parser instance for every project in the IDE, which meant that all the standard include files would be parsed and kept in memory multiple times. When they later switched to a common parser for all projects they ran into stability issues. If you can arrange it, it would be great for multi-project IDEs to be able to add and remove projects to your parser without reparsing Phobos/druntime (which may have its .di files replaced by .d files, looking at the past discussion).The opposite situation: I didn't think about parser per project :) So I guess no need to worry here.C bindings could be an option. (As in: the smallest common Python, ...) to use your library.Yeah, but I'm far from there yet.Maybe I'm don't understand your use case, but the idea is that if you parse as you type it should be possible to avoid parsing and allocating memory for those lines which have not changed. But that is not compatible with storing tokens in the array, since it would cause reallocating memory each time, so the other data structure should be used (e.g., a linked list, or, if efficient lookup is needed, a red-black tree). Only benchmarks can show whether (and how much) my approach would be faster for specific situation (input patterns like average size of data, complexity of parsing algorithms, requirements, etc.).I'm curious.Since assembly code is usually small I just preallocate an array of sourceCode.length tokens and realloc it to the correct size when I'm done parsing. Nothing pretty, but simple and I am sure it won't get any faster ;).I'm sure it will :) (I'm going to elaborate on this some time later).I found that I usually either load from file into a newly allocated buffer (where a copy occurs, only because I forgot about assumeUnique in std.exception) or I am editing the file in which case I recreate the source string after every key stroke anyway. I can still pass slices of that string to functions though. Not sure what you mean.Answered below.It probably doesn't work for D as well as for ASM code, but I could also check for \x1A and __EOF__ in the same fashion. (By the way, is it \x1A (substitute, ^Z) or did you mean \0x04 (end-of-transmission, ^D)?)D has the following EoF cases: \0, \x1A, physical EoF and __EOF__ special token when not inside comment or some of string literals. ^D is not in this list.The way it works is: Parser states like 'in expression' can safely peek at the next character at any time. If it doesn't match what they expect they emit an error and drop back to the "surrounding" parser state. When they reach the "file" level, that's the only place where an EOF (which will only occur once per file anyway) will be consumed. In theory one can drop all string length checks and work on char* directly with a known terminator char that is distinct from anything else.If you want to pass a slice of input string to a function, you cannot append \0 to it without copying data. If you don't append some pre-defined character, you must check for length *and* all supported terminating characters. On the contrary, your design might not require passing slices, and if language syntax allows deterministic parsing (when you know what to expect next), there is no need for checking EoF.
May 20 2012
Am Sun, 20 May 2012 20:37:07 +0200 schrieb "Roman D. Boiko" <rb d-coding.com>:I just only thought about the single-use situation, not the situation when editing files. Now the idea has evolved a bit and you probably thought about storing the scope hierarchy in a tree, too. It is still difficult to write a fast parser when someone can add/remove a '{' somewhere at the top of datetime.d, which changes the interpretation of the rest of the file. Mono-D has some hickups worse than those (several seconds even) - maybe a bug. I'll keep my fingers crossed.Maybe I'm don't understand your use case, but the idea is that if you parse as you type it should be possible to avoid parsing and allocating memory for those lines which have not changed. But that is not compatible with storing tokens in the array, since it would cause reallocating memory each time, so the other data structure should be used (e.g., a linked list, or, if efficient lookup is needed, a red-black tree). Only benchmarks can show whether (and how much) my approach would be faster for specific situation (input patterns like average size of data, complexity of parsing algorithms, requirements, etc.).I'm curious.Since assembly code is usually small I just preallocate an array of sourceCode.length tokens and realloc it to the correct size when I'm done parsing. Nothing pretty, but simple and I am sure it won't get any faster ;).I'm sure it will :) (I'm going to elaborate on this some time later).If you want to pass a slice of input string to a function, you cannot append \0 to it without copying data. If you don't append some pre-defined character, you must check for length *and* all supported terminating characters. On the contrary, your design might not require passing slices, and if language syntax allows deterministic parsing (when you know what to expect next), there is no need for checking EoF.Now I get it! -- Marco
May 20 2012
On Sunday, 20 May 2012 at 17:42:59 UTC, Marco Leise wrote:There is one feature I remember caused some head-aches for Code::Blocks. They used a separate parser instance for every project in the IDE, which meant that all the standard include files would be parsed and kept in memory multiple times. When they later switched to a common parser for all projects they ran into stability issues. If you can arrange it, it would be great for multi-project IDEs to be able to add and remove projects to your parser without reparsing Phobos/druntime (which may have its .di files replaced by .d files, looking at the past discussion).The opposite situation: I didn't think about parser per project :) So I guess no need to worry here.C bindings could be an option. (As in: the smallest common Python, ...) to use your library.Yeah, but I'm far from there yet.Maybe I'm don't understand your use case, but the idea is that if you parse as you type it should be possible to avoid parsing and allocating memory for those lines which have not changed. But that is not compatible with storing tokens in the array, since it would cause reallocating memory each time, so the other data structure should be used (e.g., a linked list, or, if efficient lookup is needed, a red-black tree). Only benchmarks can show whether (and how much) my approach would be faster for specific situation (input patterns like average size of data, complexity of parsing algorithms, requirements, etc.).I'm curious.Since assembly code is usually small I just preallocate an array of sourceCode.length tokens and realloc it to the correct size when I'm done parsing. Nothing pretty, but simple and I am sure it won't get any faster ;).I'm sure it will :) (I'm going to elaborate on this some time later).I found that I usually either load from file into a newly allocated buffer (where a copy occurs, only because I forgot about assumeUnique in std.exception) or I am editing the file in which case I recreate the source string after every key stroke anyway. I can still pass slices of that string to functions though. Not sure what you mean.Answered below.It probably doesn't work for D as well as for ASM code, but I could also check for \x1A and __EOF__ in the same fashion. (By the way, is it \x1A (substitute, ^Z) or did you mean \0x04 (end-of-transmission, ^D)?)D has the following EoF cases: \0, \x1A, physical EoF and __EOF__ special token when not inside comment or some of string literals. ^D is not in this list.The way it works is: Parser states like 'in expression' can safely peek at the next character at any time. If it doesn't match what they expect they emit an error and drop back to the "surrounding" parser state. When they reach the "file" level, that's the only place where an EOF (which will only occur once per file anyway) will be consumed. In theory one can drop all string length checks and work on char* directly with a known terminator char that is distinct from anything else.If you want to pass a slice of input string to a function, you cannot append \0 to it without copying data. If you don't append some pre-defined character, you must check for length *and* all supported terminating characters. On the contrary, your design might not require passing slices, and if language syntax allows deterministic parsing (when you know what to expect next), there is no need for checking EoF.
May 20 2012