digitalmars.D - std.d.lexer requirements
- Walter Bright (34/34) Aug 01 2012 Given the various proposals for a lexer module for Phobos, I thought I'd...
- Jakob Ovrum (9/10) Aug 01 2012 I agree with everything but this point. Tokens become quite large
- Walter Bright (3/11) Aug 01 2012 Doing a class requires a heap allocation per token, which will have disa...
- Jakob Ovrum (3/5) Aug 01 2012 No it doesn't. That's an implication of using the NewExpression,
- Walter Bright (2/3) Aug 01 2012 See my other reply to you on this topic.
- Bernard Helyer (1/4) Aug 01 2012 I assume you mean 'shouldn't'. :P
- Walter Bright (2/5) Aug 01 2012 Good catch.
- deadalnix (7/16) Aug 01 2012 I see the lexer as a function that take an range of char as input and
- Walter Bright (2/8) Aug 01 2012 Yes, as the lexer will have some state and some configuration parameters...
- Jonathan M Davis (9/18) Aug 01 2012 nd give
- Walter Bright (4/16) Aug 01 2012 No, because the user may want to serially present ranges to the same lex...
- Jonathan M Davis (4/10) Aug 01 2012 Then just pass the same identifier table to the function which creates t...
- Walter Bright (13/15) Aug 01 2012 You're still going to require another type, otherwise you'll have to dup...
- Jonathan M Davis (5/12) Aug 01 2012 Why would you be duplicating any state in the token? The range would hav...
- Walter Bright (2/13) Aug 01 2012 I meant two types: the Lexer and the Token. The Lexer can present a rang...
- Jonathan M Davis (11/29) Aug 01 2012 What I was expecting there to be was a type which was a range of tokens...
- Walter Bright (2/11) Aug 01 2012 You could do it either way.
- Jonathan M Davis (5/15) Aug 01 2012 If you end up with enough pieces of state that you need to carry around
- H. S. Teoh (12/26) Aug 01 2012 [...]
- Walter Bright (4/10) Aug 01 2012 Yup. I keep trying to think of a way to lex multiple files at the same t...
- deadalnix (4/14) Aug 02 2012 That was exactly my reaction to the « let's reuse the identifier table »...
- Jason House (5/20) Aug 06 2012 The following is an incredibly fast multithreaded hash table. It
- Walter Bright (3/6) Aug 07 2012 It might if I understood it! There do seem to be some cases where fences...
- Jonathan M Davis (55/80) Aug 01 2012 But that's not how ranges of characters work. They're ranges of dchar. R...
- Walter Bright (27/108) Aug 01 2012 I have argued against making ranges of characters dchars, because of per...
- Jonathan M Davis (24/54) Aug 01 2012 1. The current design of Phobos is to have ranges of dchar, because it f...
- Walter Bright (19/44) Aug 01 2012 The lexer must use char or it will not be acceptable as anything but a t...
- Jonathan M Davis (13/25) Aug 01 2012 Avoiding decoding can be done with strings and operating on ranges of dc...
- Walter Bright (15/40) Aug 01 2012 1. Encoding it into a dchar is a performance problem. Source code sits i...
- Jacob Carlborg (13/16) Aug 01 2012 Your first requirement is a bit strange and confusing:
- Walter Bright (2/11) Aug 02 2012 I answered this point a few posts up in the thread.
- Jacob Carlborg (7/8) Aug 02 2012 I've read a few posts up and the only answer I found is that the lexer
- Walter Bright (2/8) Aug 02 2012 You can use an adapter range.
- Jonathan M Davis (54/68) Aug 02 2012 I think that we're misunderstanding each other here. A typical, well-wri...
- Walter Bright (16/42) Aug 02 2012 It *still* must convert UTF8 to dchars before presenting them to the con...
- Jacob Carlborg (4/19) Aug 01 2012 I think you two are saying the same things.
- Jonathan M Davis (16/26) Aug 01 2012 Another thing that I should point out is that a range of UTF-8 or UTF-16...
- Walter Bright (10/24) Aug 02 2012 My experience in writing fast string based code that worked on UTF8 and
- Bernard Helyer (2/6) Aug 02 2012 If you want to throw out some target times, that would be useful.
- Walter Bright (2/7) Aug 02 2012 As fast as the dmd one would be best.
- Jonathan M Davis (4/15) Aug 02 2012 How would we measure that? dmd's lexer is tied to dmd, so how would we t...
- Walter Bright (2/4) Aug 02 2012 Easy. Just make a special version of dmd that lexes only, and time it.
- Ed McCardell (9/14) Aug 03 2012 I made a lexing-only version of dmd at
- dennis luehring (6/21) Aug 03 2012 Walter mentioned an easier way - without "forking" dmd - should be
- Walter Bright (2/4) Aug 03 2012 Forking it is fine. This is just for a one-off benchmarking thing.
- dennis luehring (3/11) Aug 02 2012 would it be (easily) possible to "extract" the dmd lexer code (and
- dennis luehring (3/11) Aug 02 2012 can the dmd lexer seperated as an lib and become useable from outside -
- deadalnix (7/17) Aug 02 2012 That'd be great but . . .
- Adam D. Ruppe (4/6) Aug 02 2012 What if we're just using this lexer in something like a
- Marco Leise (8/15) Aug 02 2012 My pet peeve for editors is how to make the lexer work on only what I ju...
- Walter Bright (5/9) Aug 02 2012 A good IDE should do its parsing in a separate thread, so the main user ...
- Andrej Mitrovic (5/10) Aug 02 2012 Agreed, this is an editor/IDE domain problem. And there are some
- Jacob Carlborg (6/11) Aug 02 2012 It still needs to update the editor view with the correct syntax
- Andrej Mitrovic (5/8) Aug 02 2012 It can do that immediately for the text that's visible in the window
- Jacob Carlborg (4/8) Aug 02 2012 That should work if the lexer can handle it.
- Walter Bright (10/19) Aug 02 2012 The rendering code should be in yet a third thread.
- Dmitry Olshansky (11/34) Aug 02 2012 OT:
- Jacob Carlborg (5/11) Aug 02 2012 I'm not entirely sure what you mean be "rendering" but most GUI systems
- Dmitry Olshansky (7/16) Aug 03 2012 Draw thing to an off screen bitmap then blit it to window (aye, pass
- Jacob Carlborg (4/8) Aug 03 2012 Ok, now I see.
- Jacob Carlborg (15/24) Aug 02 2012 Most GUI systems are not thread safe. I know for sure that Cocoa on Mac
- deadalnix (3/16) Aug 03 2012 A common solution to that problem is to cancel the computation when user...
- Walter Bright (8/17) Aug 02 2012 A lexer is like a device driver, or a memory allocator. Having it be as ...
- Dmitry Olshansky (9/32) Aug 02 2012 +1 Another point is that it's crystal clear *how* to optimize lexer, and...
- Andrej Mitrovic (3/4) Aug 02 2012 In a single project or globally? I hope not the former, D should
- Walter Bright (5/7) Aug 02 2012 I'm less sure how well known it is. DMC++ was known for being *several t...
- Jacob Carlborg (6/9) Aug 02 2012 I'm not sure how easy it would be to just measure the lexing phase of
- dennis luehring (5/12) Aug 02 2012 wouldn't it be better to extract the lexer part of dmd into its own
- Jacob Carlborg (4/8) Aug 02 2012 That was what I meant. If it's easy someone would have already done it.
- Walter Bright (3/5) Aug 02 2012 You don't need to extract it to measure it. Just have it lex the source ...
- Jacob Carlborg (11/13) Aug 03 2012 Well, that's the problem. It's not like DMD has a single "lex" function
- Walter Bright (2/13) Aug 03 2012 Look in doc.c at highlightCode2() for how to call the lexer by itself.
- Jacob Carlborg (10/11) Aug 03 2012 So basically:
- Walter Bright (2/11) Aug 03 2012 Pretty much. Or just call lex.nextToken() until you get a TOKeof.
- Jonathan M Davis (15/38) Aug 02 2012 Recent benchmarks by Iain Buclaw show that lexing is not a bottleneck in...
- Jonathan M Davis (10/16) Aug 02 2012 It is for ranges in general. In the general case, a range of UTF-8 or UT...
- Walter Bright (2/10) Aug 02 2012 Yes, it can work.
- Jonathan M Davis (21/33) Aug 02 2012 How? If you operate on a range of code units, then you're operating on
- Walter Bright (5/24) Aug 02 2012 Keep a 6 character buffer in your consumer. If you read a char with the ...
- deadalnix (2/24) Aug 02 2012 How is that different than a manually done range of dchar ?
- Walter Bright (4/5) Aug 02 2012 The decoding is rarely necessary, even if non-ascii data is there. Howev...
- Jonathan M Davis (8/14) Aug 02 2012 Hence why special-casing must be used to deal with variably-length encod...
- Dmitry Olshansky (7/29) Aug 02 2012 4 bytes is enough.
- Walter Bright (4/11) Aug 02 2012 Yeah, but I thought 6 bytes would future proof it! (Inevitably, the Unic...
- Jonathan M Davis (38/58) Aug 02 2012 And how on earth is that going to work as a range? Range-based functions...
- travert phare.normalesup.org (Christophe Travert) (10/13) Aug 02 2012 In some case a range of dchar is useful. In some case a range of char is...
- Walter Bright (9/33) Aug 02 2012 1. read a character from the range
- Jonathan M Davis (26/27) Aug 02 2012 But that's the problem. The consumer has to treat character ranges speci...
- Andrei Alexandrescu (4/8) Aug 02 2012 It is generic! It's just in another dimension: it operates on any range
- Jonathan M Davis (13/22) Aug 02 2012 So, a function which does the buffering of code units like Walter sugges...
- Andrei Alexandrescu (8/21) Aug 02 2012 Of course, because it operates on bytes read from memory, files, or
- Jonathan M Davis (6/13) Aug 02 2012 It may be the best approach for the lexer (though I'm not convinced; I'l...
- Andrei Alexandrescu (7/20) Aug 02 2012 I think Walter has very often emphasized the need for the lexer to be
- Jonathan M Davis (27/37) Aug 02 2012 LOL. Well, I'm not about to decide on the best approach to this without
- Jonathan M Davis (28/33) Aug 02 2012 LOL. It looks like taking this approach results in almost identical code...
- Walter Bright (2/6) Aug 02 2012 +1
- Walter Bright (17/43) Aug 02 2012 No, the consumer can do its own buffering. It only needs a 4 character b...
- Jonathan M Davis (5/8) Aug 02 2012 I think that this is the main point of misunderstanding then. From your
- Andrei Alexandrescu (8/13) Aug 02 2012 The lexer must be configurable enough to tokenize other languages than
- Jonathan M Davis (7/20) Aug 02 2012 You're not going to get as fast a lexer if it's not written specifically...
- Timon Gehr (3/19) Aug 02 2012 It is a more general problem.
- Andrei Alexandrescu (3/7) Aug 02 2012 Do you have any evidence to back that up? I mean you're just saying it.
- Jonathan M Davis (16/24) Aug 02 2012 Because all of the rules are built directly into the code. You don't hav...
- Timon Gehr (14/41) Aug 02 2012 The parts that can be specified with simple regexen are certainly not a
- Jonathan M Davis (5/7) Aug 02 2012 If fully concur that we should have a generic lexer, but unless the gene...
- Philippe Sigaud (4/8) Aug 02 2012 I propose we let him finish std.lexer, test it with Jonathan, benchmark ...
- deadalnix (6/16) Aug 03 2012 I think it can be as fast if the lexer is compile time created. I'd also...
- Bernard Helyer (3/5) Aug 02 2012 You're going to have to defend that one.
- Andrei Alexandrescu (5/8) Aug 02 2012 I wouldn't know how to. To me it's all too obvious it's better to have a...
- Walter Bright (6/17) Aug 02 2012 I agree and I hope the three can combine their efforts with the best ide...
- Bernard Helyer (5/29) Aug 02 2012 If the other guys think they've got it, then I can withdraw my
- Jonathan M Davis (10/14) Aug 02 2012 I'm a fair ways along already. Making changes according to what Walter w...
- Bernard Helyer (5/27) Aug 02 2012 I'll let you get on with it then. I'll amuse myself with the
- Jonathan M Davis (6/10) Aug 02 2012 Well, if std.d.lexer is done correctly, then it should at least be very ...
- Jakob Ovrum (5/9) Aug 02 2012 I don't think you should give up on this adaption. SDC's lexer is
- Philippe Sigaud (11/24) Aug 02 2012 Look, many of us here were interested in your idea of having comments
- Piotr Szturmaj (3/16) Aug 02 2012 Working example:
- =?ISO-8859-1?Q?Jos=E9_Armando_Garc=EDa_Sancio?= (13/41) Aug 02 2012 A little bit off topic but...
- Jacob Carlborg (7/14) Aug 02 2012 I do understand that the lexer needs to be insanely fast and it needs to...
- Walter Bright (2/4) Aug 02 2012 Worst case use an adapter range.
- travert phare.normalesup.org (Christophe Travert) (7/14) Aug 02 2012 Yes
- Jacob Carlborg (4/5) Aug 02 2012 And that is better than a plain string?
- travert phare.normalesup.org (Christophe Travert) (2/8) Aug 02 2012 because its front method does not do any decoding.
- Jacob Carlborg (12/20) Aug 02 2012 If it was a plain string you wouldn't use "front". You would handle any,...
- travert phare.normalesup.org (Christophe Travert) (16/20) Aug 03 2012 I find it more comfortable to just use
- Walter Bright (3/11) Aug 03 2012 No may about it, I wouldn't be happy :-)
- Walter Bright (8/14) Aug 03 2012 For example, let's suppose we want to do D syntax highlighting in our ID...
- Tobias Pankrath (8/16) Aug 03 2012 Would this be an argument for putting the computation of source
- Walter Bright (2/8) Aug 03 2012 Worth thinking about.
- Dmitry Olshansky (11/39) Aug 02 2012 Well, it doesn't convert back to UTF-8 as it just slices of the input :)
- deadalnix (2/16) Aug 02 2012 Really nice idea. It is still easy to wrap the Range in another Range
- Walter Bright (4/6) Aug 02 2012 It's suboptimal for a high speed lexer/parser, as already explained.
- Jacob Carlborg (5/9) Aug 01 2012 I'm no expert on ranges but won't that prevent slicing? Slicing is one
- Walter Bright (5/12) Aug 02 2012 You don't want to use slicing on the lexer. The reason is that your slic...
- deadalnix (4/21) Aug 02 2012 Token are not kept in memory. You usually consume them for other
- Walter Bright (4/27) Aug 02 2012 The tokens are not kept, correct. But the identifier strings, and the st...
- deadalnix (15/47) Aug 03 2012 Ok, what do you think of that :
- travert phare.normalesup.org (Christophe Travert) (39/53) Aug 03 2012 If I may add, there are several possitilities here:
- Tobias Pankrath (4/13) Aug 03 2012 #6 None. Just provide information of token type and location
- Walter Bright (5/16) Aug 03 2012 You're talking about doing for strings what is done for identifiers - re...
- deadalnix (8/32) Aug 04 2012 That option have the benefice to allow very fast identifier comparison
- Jonathan M Davis (17/20) Aug 03 2012 String literals often _can't_ be slices unless you leave them in their
- travert phare.normalesup.org (Christophe Travert) (10/34) Aug 04 2012 I thought it was not the lexer's job to process litterals. Just split
- Dmitry Olshansky (4/36) Aug 04 2012 Most likely - since you re-read the same memory twice to do it.
- travert phare.normalesup.org (Christophe Travert) (12/13) Aug 04 2012 You're probably right, but if you do this right after the token is
- Dmitry Olshansky (11/23) Aug 04 2012 q{ .. }
- Jonathan M Davis (5/8) Aug 04 2012 It's starting to look like figuring out what should and shouldn't be
- Tobias Pankrath (5/17) Aug 04 2012 If we have a really fast lexer that is highly compile-time
- Dmitry Olshansky (34/41) Aug 04 2012 Let's add some meat to my post.
- Jonathan M Davis (12/25) Aug 05 2012 [snip]
- deadalnix (9/50) Aug 06 2012 It seems way too much.
- Philippe Sigaud (8/15) Aug 06 2012 I think one should pass it an empty symbol table and the lexer should
- Jacob Carlborg (5/9) Aug 06 2012 Especially when if the lexer is used in an IDE. The code is constantly
- Walter Bright (4/8) Aug 06 2012 That's why I suggested supplying a callback delegate to decide what to d...
- travert phare.normalesup.org (Christophe Travert) (6/16) Aug 07 2012 It may be easier to take into accound few cases (return error token and
- Jonathan M Davis (25/36) Aug 07 2012 to do
- Walter Bright (4/8) Aug 07 2012 It's not just overhead - it's just plain ugly to constantly check for er...
- Jonathan M Davis (10/21) Aug 07 2012 It's easier to see where in the range of tokens the errors occur. A dele...
- Jacob Carlborg (5/10) Aug 07 2012 Just pass the same token to the delegate that you would have returned
- travert phare.normalesup.org (Christophe Travert) (11/21) Aug 07 2012 That's what I would do. If you have to define a way to return error
- Walter Bright (3/8) Aug 07 2012 The delegate has a context pointer giving it a reference to whatever con...
- travert phare.normalesup.org (Christophe Travert) (24/33) Aug 07 2012 It's not necessarily ugly, because of the powerful range design. You can...
- Philippe Sigaud (18/21) Aug 07 2012 IIRC, I was not the only one, as people here interested in coding an
- Walter Bright (6/17) Aug 07 2012 Delegates are not C-isms.
- Philippe Sigaud (10/20) Aug 07 2012 And what a wonderful decision that was!
- Jonathan M Davis (23/26) Aug 07 2012 It doesn't really affect much to allow choosing between returning a toke...
- Walter Bright (4/7) Aug 07 2012 "When in doubt, leave it out."
- Jonathan M Davis (27/37) Aug 07 2012 On the whole, I agree, and I'm definitely leaving some of that stuff out...
- Walter Bright (4/12) Aug 07 2012 If the delegate returns, then the lexer recovers.
- travert phare.normalesup.org (Christophe Travert) (10/11) Aug 07 2012 That's an option, if there is only one way to recover (which is a
- Dmitry Olshansky (15/77) Aug 06 2012 Editor that highlights text may choose not to build identifier table at
- Jacob Carlborg (4/10) Aug 06 2012 The Eclipse plugin Descent can show formatted DDoc comments.
- Dmitry Olshansky (5/15) Aug 06 2012 I've meant that IDE may be interested in more then just DDoc, at least
- travert phare.normalesup.org (Christophe Travert) (10/18) Aug 04 2012 Yes, I figured-out a policy could be used to, but since the begining of
- Bernard Helyer (2/2) Aug 02 2012 http://i.imgur.com/oSXTc.png
- Walter Bright (2/4) Aug 02 2012 I'm afraid I'm mystified as to your point.
- Bernard Helyer (2/7) Aug 02 2012 Just that I'm slaving away over a hot IDE here. :P
- Walter Bright (2/3) Aug 02 2012 Ah! Well, keep on keepin' on, then!
- Chad J (2/4) Aug 04 2012 Hell yeah Alexander Brandon.
- dennis luehring (5/6) Aug 02 2012 then add a benchmark "suite" to the list - the lexer should be
- Piotr Szturmaj (4/8) Aug 02 2012 Why it is a mistake? I think Lexer should parse any UTF range and return...
- Walter Bright (6/15) Aug 02 2012 Because the lexer is large and it would have to have a lot of special ca...
- Andrei Alexandrescu (29/30) Aug 02 2012 Here's a crazy idea that I'll hang to this one remark. No, two crazy ide...
- Michel Fortin (30/35) Aug 02 2012 That's not a great surprise to me. I hit the same issues when writing
- Andrei Alexandrescu (8/34) Aug 02 2012 I agree frontUnit and popFrontUnit are more generic because they allow
- Jacob Carlborg (10/13) Aug 02 2012 The skype call you suggested:
- Andrei Alexandrescu (4/16) Aug 02 2012 Oh, ok, I thought I'd be on the call. That would be between Jonathan and...
- travert phare.normalesup.org (Christophe Travert) (9/14) Aug 02 2012 Any range of dchar could have a representation (or you may want to call
- Piotr Szturmaj (6/14) Aug 02 2012 Instead of returning whole strings, you may provide another sub-range
- Jonathan M Davis (5/17) Aug 02 2012 I'm sure it's possible. I just don't know how messy it will be. It depen...
- Brad Roberts (13/59) Aug 05 2012 To help with performance comparisons I ripped dmd's lexer out and got it...
- Walter Bright (2/14) Aug 05 2012 Thanks, Brad!
- Ary Manzana (3/7) Aug 06 2012 I believe there should be an option to get comments as tokens. Or
Given the various proposals for a lexer module for Phobos, I thought I'd share some characteristics it ought to have. First of all, it should be suitable for, at a minimum: 1. compilers 2. syntax highlighting editors 3. source code formatters 4. html creation To that end: 1. It should accept as input an input range of UTF8. I feel it is a mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use an 'adapter' range to convert the input to UTF8. (This is what component programming is all about.) 2. It should output an input range of tokens 3. tokens should be values, not classes 4. It should avoid memory allocation as much as possible 5. It should read or write any mutable global state outside of its "Lexer" instance 6. A single "Lexer" instance should be able to serially accept input ranges, sharing and updating one identifier table 7. It should accept a callback delegate for errors. That delegate should decide whether to: 1. ignore the error (and "Lexer" will try to recover and continue) 2. print an error message (and "Lexer" will try to recover and continue) 3. throw an exception, "Lexer" is done with that input range 8. Lexer should be configurable as to whether it should collect information about comments and ddoc comments or not 9. Comments and ddoc comments should be attached to the next following token, they should not themselves be tokens 10. High speed matters a lot 11. Tokens should have begin/end line/column markers, though most of the time this can be implicitly determined 12. It should come with unittests that, using -cov, show 100% coverage Basically, I don't want anyone to be motivated to do a separate one after seeing this one.
Aug 01 2012
On Thursday, 2 August 2012 at 00:11:15 UTC, Walter Bright wrote:3. tokens should be values, not classesI agree with everything but this point. Tokens become quite large when they have all kinds of string and position information on them, much larger than the typical recommended sizes for pass-by-value. It's still possible to provide an interface for them that doesn't require as much copying (ref returns and getters and whatnot), but it's fiddly and way too easy to copy these huge Token structs around, especially if the output is a range of Token structs.
Aug 01 2012
On 8/1/2012 5:21 PM, Jakob Ovrum wrote:On Thursday, 2 August 2012 at 00:11:15 UTC, Walter Bright wrote:Doing a class requires a heap allocation per token, which will have disastrous performance consequences.3. tokens should be values, not classesI agree with everything but this point. Tokens become quite large when they have all kinds of string and position information on them, much larger than the typical recommended sizes for pass-by-value. It's still possible to provide an interface for them that doesn't require as much copying (ref returns and getters and whatnot), but it's fiddly and way too easy to copy these huge Token structs around, especially if the output is a range of Token structs.
Aug 01 2012
On Thursday, 2 August 2012 at 04:00:54 UTC, Walter Bright wrote:Doing a class requires a heap allocation per token, which will have disastrous performance consequences.No it doesn't. That's an implication of using the NewExpression, not classes.
Aug 01 2012
On 8/1/2012 9:02 PM, Jakob Ovrum wrote:No it doesn't. That's an implication of using the NewExpression, not classes.See my other reply to you on this topic.
Aug 01 2012
5. It should read or write any mutable global state outside of its "Lexer" instanceI assume you mean 'shouldn't'. :P
Aug 01 2012
On 8/1/2012 6:30 PM, Bernard Helyer wrote:Good catch.5. It should read or write any mutable global state outside of its "Lexer" instanceI assume you mean 'shouldn't'. :P
Aug 01 2012
Le 02/08/2012 02:10, Walter Bright a écrit :6. A single "Lexer" instance should be able to serially accept input ranges, sharing and updating one identifier tableI see the lexer as a function that take an range of char as input and give back a range of token. Does it make sense to make an instance of a lexer ?7. It should accept a callback delegate for errors. That delegate should decide whether to: 1. ignore the error (and "Lexer" will try to recover and continue) 2. print an error message (and "Lexer" will try to recover and continue) 3. throw an exception, "Lexer" is done with that input rangeOff topic, but it look like the condition proposal from H.S. Teoh and myself.Basically, I don't want anyone to be motivated to do a separate one after seeing this one.That would be awesome !
Aug 01 2012
On 8/1/2012 7:04 PM, deadalnix wrote:Le 02/08/2012 02:10, Walter Bright a écrit :Yes, as the lexer will have some state and some configuration parameters.6. A single "Lexer" instance should be able to serially accept input ranges, sharing and updating one identifier tableI see the lexer as a function that take an range of char as input and give back a range of token. Does it make sense to make an instance of a lexer ?
Aug 01 2012
On Wednesday, August 01, 2012 21:00:53 Walter Bright wrote:On 8/1/2012 7:04 PM, deadalnix wrote:utLe 02/08/2012 02:10, Walter Bright a =C3=A9crit :6. A single "Lexer" instance should be able to serially accept inp=nd giveranges, sharing and updating one identifier table=20 I see the lexer as a function that take an range of char as input a=lexerback a range of token. Does it make sense to make an instance of a =ers. Couldn't that just be part of the range type? A separate lexer type sho= uldn't=20 be necessary. - Jonathan M Davis?Yes, as the lexer will have some state and some configuration paramet=
Aug 01 2012
On 8/1/2012 9:06 PM, Jonathan M Davis wrote:On Wednesday, August 01, 2012 21:00:53 Walter Bright wrote:No, because the user may want to serially present ranges to the same lexer type. Why do that, you might say? That's how dmd works - it means the identifier table will be the same for all source files.On 8/1/2012 7:04 PM, deadalnix wrote:Couldn't that just be part of the range type? A separate lexer type shouldn't be necessary.Le 02/08/2012 02:10, Walter Bright a écrit :Yes, as the lexer will have some state and some configuration parameters.6. A single "Lexer" instance should be able to serially accept input ranges, sharing and updating one identifier tableI see the lexer as a function that take an range of char as input and give back a range of token. Does it make sense to make an instance of a lexer ?
Aug 01 2012
On Wednesday, August 01, 2012 21:47:22 Walter Bright wrote:Then just pass the same identifier table to the function which creates the token range. That doesn't require another type. - Jonathan M DavisCouldn't that just be part of the range type? A separate lexer type shouldn't be necessary.No, because the user may want to serially present ranges to the same lexer type. Why do that, you might say? That's how dmd works - it means the identifier table will be the same for all source files.
Aug 01 2012
On 8/1/2012 9:54 PM, Jonathan M Davis wrote:Then just pass the same identifier table to the function which creates the token range. That doesn't require another type.You're still going to require another type, otherwise you'll have to duplicate the state in every token allocation, with resultant heavy memory and initialization costs. Please keep in mind that a lexer is not something you just pass a few short strings to. It's very very very performance critical as all those extra instructions add up to long delays when you're shoving millions of lines of code into its maw. For the same reason you're also not going to want the lexer putting pressure on the GC. It could bring your whole system down. To get a high performance lexer, you're going to be counting the average number of instructions executed per input character. Each one counts. Shaving one off is a victory. You should also be thinking about memory cache access patterns.
Aug 01 2012
On Wednesday, August 01, 2012 22:09:05 Walter Bright wrote:On 8/1/2012 9:54 PM, Jonathan M Davis wrote:Why would you be duplicating any state in the token? The range would have the state, not the token. The token would only have the data specific to the token, not the range as a whole. - Jonathan M DavisThen just pass the same identifier table to the function which creates the token range. That doesn't require another type.You're still going to require another type, otherwise you'll have to duplicate the state in every token allocation, with resultant heavy memory and initialization costs.
Aug 01 2012
On 8/1/2012 10:34 PM, Jonathan M Davis wrote:On Wednesday, August 01, 2012 22:09:05 Walter Bright wrote:I meant two types: the Lexer and the Token. The Lexer can present a range interface.On 8/1/2012 9:54 PM, Jonathan M Davis wrote:Why would you be duplicating any state in the token? The range would have the state, not the token. The token would only have the data specific to the token, not the range as a whole.Then just pass the same identifier table to the function which creates the token range. That doesn't require another type.You're still going to require another type, otherwise you'll have to duplicate the state in every token allocation, with resultant heavy memory and initialization costs.
Aug 01 2012
On Wednesday, August 01, 2012 22:41:01 Walter Bright wrote:On 8/1/2012 10:34 PM, Jonathan M Davis wrote:What I was expecting there to be was a type which was a range of tokens. You passed the source string to a function which returned that range, and you iterated over it to process each token. What you appear to have been arguing for is another type which you get the range from which holds additional state (e.g. the indentifier table). So, you're saying now that you only meant that the token type needs to be separate from the range type (which I would have thought would be the only way to do it), or are you indeed saying that there should be another type that the range comes from and which holds additional state? - Jonathan M DavisOn Wednesday, August 01, 2012 22:09:05 Walter Bright wrote:I meant two types: the Lexer and the Token. The Lexer can present a range interface.On 8/1/2012 9:54 PM, Jonathan M Davis wrote:Why would you be duplicating any state in the token? The range would have the state, not the token. The token would only have the data specific to the token, not the range as a whole.Then just pass the same identifier table to the function which creates the token range. That doesn't require another type.You're still going to require another type, otherwise you'll have to duplicate the state in every token allocation, with resultant heavy memory and initialization costs.
Aug 01 2012
On 8/1/2012 10:53 PM, Jonathan M Davis wrote:What I was expecting there to be was a type which was a range of tokens. You passed the source string to a function which returned that range, and you iterated over it to process each token. What you appear to have been arguing for is another type which you get the range from which holds additional state (e.g. the indentifier table). So, you're saying now that you only meant that the token type needs to be separate from the range type (which I would have thought would be the only way to do it), or are you indeed saying that there should be another type that the range comes from and which holds additional state?You could do it either way.
Aug 01 2012
On Wednesday, August 01, 2012 21:54:01 Jonathan M Davis wrote:On Wednesday, August 01, 2012 21:47:22 Walter Bright wrote:If you end up with enough pieces of state that you need to carry around between files/strings being lexed, then creating another object makes sense, but I would hope that the amount of global state like that would be minimal. - Jonathan M DavisThen just pass the same identifier table to the function which creates the token range. That doesn't require another type.Couldn't that just be part of the range type? A separate lexer type shouldn't be necessary.No, because the user may want to serially present ranges to the same lexer type. Why do that, you might say? That's how dmd works - it means the identifier table will be the same for all source files.
Aug 01 2012
On Wed, Aug 01, 2012 at 09:06:42PM -0700, Jonathan M Davis wrote:On Wednesday, August 01, 2012 21:00:53 Walter Bright wrote:[...] Whether it's part of the range type or a separate lexer type, *definitely* make it possible to have multiple instances. One of the biggest flaws of otherwise-good lexer generators like lex and flex (C/C++) is that the core code assumes a single instance, and multi-instances were glued on after the fact, making it a royal pain to work with anything that needs lexing multiple things at the same time. T -- The day Microsoft makes something that doesn't suck is probably the day they start making vacuum cleaners... -- SlashdotterOn 8/1/2012 7:04 PM, deadalnix wrote:Couldn't that just be part of the range type? A separate lexer type shouldn't be necessary.Le 02/08/2012 02:10, Walter Bright a écrit :Yes, as the lexer will have some state and some configuration parameters.6. A single "Lexer" instance should be able to serially accept input ranges, sharing and updating one identifier tableI see the lexer as a function that take an range of char as input and give back a range of token. Does it make sense to make an instance of a lexer ?
Aug 01 2012
On 8/1/2012 9:41 PM, H. S. Teoh wrote:Whether it's part of the range type or a separate lexer type, *definitely* make it possible to have multiple instances. One of the biggest flaws of otherwise-good lexer generators like lex and flex (C/C++) is that the core code assumes a single instance, and multi-instances were glued on after the fact, making it a royal pain to work with anything that needs lexing multiple things at the same time.Yup. I keep trying to think of a way to lex multiple files at the same time in separate threads, but the problem is serializing access to the identifier table will likely kill off any perf gain.
Aug 01 2012
Le 02/08/2012 06:48, Walter Bright a écrit :On 8/1/2012 9:41 PM, H. S. Teoh wrote:That was exactly my reaction to the « let's reuse the identifier table » comment of yours. The future is multicore.Whether it's part of the range type or a separate lexer type, *definitely* make it possible to have multiple instances. One of the biggest flaws of otherwise-good lexer generators like lex and flex (C/C++) is that the core code assumes a single instance, and multi-instances were glued on after the fact, making it a royal pain to work with anything that needs lexing multiple things at the same time.Yup. I keep trying to think of a way to lex multiple files at the same time in separate threads, but the problem is serializing access to the identifier table will likely kill off any perf gain.
Aug 02 2012
On Thursday, 2 August 2012 at 04:48:56 UTC, Walter Bright wrote:On 8/1/2012 9:41 PM, H. S. Teoh wrote:The following is an incredibly fast multithreaded hash table. It is both lock-free and fence-free. Would something like that solve your problem? http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdfWhether it's part of the range type or a separate lexer type, *definitely* make it possible to have multiple instances. One of the biggest flaws of otherwise-good lexer generators like lex and flex (C/C++) is that the core code assumes a single instance, and multi-instances were glued on after the fact, making it a royal pain to work with anything that needs lexing multiple things at the same time.Yup. I keep trying to think of a way to lex multiple files at the same time in separate threads, but the problem is serializing access to the identifier table will likely kill off any perf gain.
Aug 06 2012
On 8/6/2012 5:14 PM, Jason House wrote:The following is an incredibly fast multithreaded hash table. It is both lock-free and fence-free. Would something like that solve your problem? http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdfIt might if I understood it! There do seem to be some cases where fences are required. This would take considerable study.
Aug 07 2012
On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:1. It should accept as input an input range of UTF8. I feel it is a mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use an 'adapter' range to convert the input to UTF8. (This is what component programming is all about.)But that's not how ranges of characters work. They're ranges of dchar. Ranges don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have to create special wrappers around string or wstring to have ranges of UTF-8. The way that it's normally done is to have ranges of dchar and then special-case range-based functions for strings. Then the function can operate on any range of dchar but still operates on strings efficiently. So, it's quite possible to make a lexer operate on ranges of dchar but be operating primarily on ASCII by special-casing strings and avoiding decoding as much as possible. My lexer does a good job of this, I believe, so it's thoroughly optimized for strings, but it's still operating on ranges of dchar, not ranges of UTF-8.2. It should output an input range of tokens 3. tokens should be values, not classes 4. It should avoid memory allocation as much as possible 5. It should read or write any mutable global state outside of its "Lexer" instance 6. A single "Lexer" instance should be able to serially accept input ranges, sharing and updating one identifier tableWhen you say identifier table, do you mean symbol table, or something else? Isn't the symbol table something for the parser to worry about? Other than parsers, I can't think of anything which would even _care_ about a symbol table. And depending on how a symbol table were done, you'd want it to take scoping rules into account (_I'd_ certainly expect it to), meaning that it only includes identifiers which are relevant to the current scope. And if that's the case, it _really_ doesn't belong in the lexer. The code using the lexer can easily have its own table that it populates according to however it wants it populated by processing the identifier tokens as they get lexed. Not to mention, don't symbol tables usually include type information that a lexer _wouldn't_ have? Sure, a symbol table _could_ be put in the lexer (at least if it doesn't care about scoping), but I definitely question that that's the right place for it. If you mean a table that simply lists identifiers rather than a symbol table, I'm not quite sure why you'd want it in addition to the symbol table, and again, I'd expect only parsers to care about that sort of thing, so I'd expect the parser to be implementing it. So, what exactly are you looking for the lexer to implement as far as an identifier table goes, and why is it in the lexer rather than the parser?7. It should accept a callback delegate for errors. That delegate should decide whether to: 1. ignore the error (and "Lexer" will try to recover and continue) 2. print an error message (and "Lexer" will try to recover and continue) 3. throw an exception, "Lexer" is done with that input rangeI'm currently treating errors as tokens. It then becomes easy for the code using the lexer to just ignore the errors, to process them immediately, or to put off dealing with them until the lexing is complete. So, the code using the lexer can handle errors however and whenever it likes without having to worry about delegates or exceptions. Since tokens are lexed lazily, the fact that an error is reported as a token doesn't require that the lexing continue, but it also makes it _easy_ to continue lexing, ignoring or saving the error. I'm inclined to think that that's a superior approach to using delegates and exceptions.8. Lexer should be configurable as to whether it should collect information about comments and ddoc comments or not 9. Comments and ddoc comments should be attached to the next following token, they should not themselves be tokensWhy? I'd argue for just processing them as tokens just like I'm doing with errors. The code using the lexer can then process them or ignore them as it likes (it can even specifically look for them and ignore all other tokens if it's something which only operates on comments). And if you want to see which token follows the comment, it's the next one in the range. So, I don't see much need to attach comments to other tokens. What's the advantage of not treating them as tokens?12. It should come with unittests that, using -cov, show 100% coverage100% is actually unreasonable, because there are lines which should _never_ be hit (e.g. an assert(0) line in a switch statement). All lines _other_ than those sort of lines should have code coverage, but depending on the lexer implementation, 100% is impossible. std.datetime has a ton of unit tests and IIRC it still only manages 98% because of assert(0) lines. So, I agree with the spirit of your statement, but it's not necessarily reasonable to do exactly what you're saying. - Jonathan M Davis
Aug 01 2012
On 8/1/2012 8:04 PM, Jonathan M Davis wrote:On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:I have argued against making ranges of characters dchars, because of performance reasons. This will especially adversely affect the performance of the lexer. The fact is, files are normally in UTF8 and just about everything else is in UTF8. Prematurely converting to UTF-32 is a performance disaster. Note that the single largest thing holding back regex performance is that premature conversion to dchar and back to char. If lexer is required to accept dchar ranges, its performance will drop at least in half, and people are going to go reinvent their own lexers.1. It should accept as input an input range of UTF8. I feel it is a mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use an 'adapter' range to convert the input to UTF8. (This is what component programming is all about.)But that's not how ranges of characters work. They're ranges of dchar. Ranges don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have to create special wrappers around string or wstring to have ranges of UTF-8. The way that it's normally done is to have ranges of dchar and then special-case range-based functions for strings. Then the function can operate on any range of dchar but still operates on strings efficiently.So, it's quite possible to make a lexer operate on ranges of dchar but be operating primarily on ASCII by special-casing strings and avoiding decoding as much as possible. My lexer does a good job of this, I believe, so it's thoroughly optimized for strings, but it's still operating on ranges of dchar, not ranges of UTF-8.All identifiers are entered into a hashtable, and are referred to by pointers into that hashtable for the rest of dmd. This makes symbol lookups incredibly fast, as no string comparisons are done.2. It should output an input range of tokens 3. tokens should be values, not classes 4. It should avoid memory allocation as much as possible 5. It should read or write any mutable global state outside of its "Lexer" instance 6. A single "Lexer" instance should be able to serially accept input ranges, sharing and updating one identifier tableWhen you say identifier table, do you mean symbol table, or something else?Isn't the symbol table something for the parser to worry about? Other than parsers, I can't think of anything which would even _care_ about a symbol table. And depending on how a symbol table were done, you'd want it to take scoping rules into account (_I'd_ certainly expect it to), meaning that it only includes identifiers which are relevant to the current scope. And if that's the case, it _really_ doesn't belong in the lexer. The code using the lexer can easily have its own table that it populates according to however it wants it populated by processing the identifier tokens as they get lexed. Not to mention, don't symbol tables usually include type information that a lexer _wouldn't_ have?The symbol table becomes a symbol table of pointers into the lexer's identifier table.Sure, a symbol table _could_ be put in the lexer (at least if it doesn't care about scoping), but I definitely question that that's the right place for it.The symbol table isn't in the lexer, the identifier string table is.If you mean a table that simply lists identifiers rather than a symbol table, I'm not quite sure why you'd want it in addition to the symbol table, and again, I'd expect only parsers to care about that sort of thing, so I'd expect the parser to be implementing it.Because it is gawdammed fast to do it that way. An identifier is treated as a string exactly once, in the lexer, and thereafter by its unique handle.So, what exactly are you looking for the lexer to implement as far as an identifier table goes, and why is it in the lexer rather than the parser?Very simple. A mapping of [identifier string] => [unique value], and a pointer servers nicely as that unique value.I don't agree. It means that the parser has to check everywhere for error tokens. Besides the ugliness, it means a slowdown for those checks.7. It should accept a callback delegate for errors. That delegate should decide whether to: 1. ignore the error (and "Lexer" will try to recover and continue) 2. print an error message (and "Lexer" will try to recover and continue) 3. throw an exception, "Lexer" is done with that input rangeI'm currently treating errors as tokens. It then becomes easy for the code using the lexer to just ignore the errors, to process them immediately, or to put off dealing with them until the lexing is complete. So, the code using the lexer can handle errors however and whenever it likes without having to worry about delegates or exceptions. Since tokens are lexed lazily, the fact that an error is reported as a token doesn't require that the lexing continue, but it also makes it _easy_ to continue lexing, ignoring or saving the error. I'm inclined to think that that's a superior approach to using delegates and exceptions.Because then the parser would have to add checks for 0 or more comment tokens between every other token. It uglifies the parser pretty badly, and slows it down.8. Lexer should be configurable as to whether it should collect information about comments and ddoc comments or not 9. Comments and ddoc comments should be attached to the next following token, they should not themselves be tokensWhy?I'd argue for just processing them as tokens just like I'm doing with errors. The code using the lexer can then process them or ignore them as it likes (it can even specifically look for them and ignore all other tokens if it's something which only operates on comments). And if you want to see which token follows the comment, it's the next one in the range. So, I don't see much need to attach comments to other tokens. What's the advantage of not treating them as tokens?Performance and simplicity. Your grammar will be awfully ugly unless you insert another range to filter out those tokens - but that'll be a slowdown.I know, but I expect all unreachable code to be justified. The lexer is a well defined problem, and this should not be difficult.12. It should come with unittests that, using -cov, show 100% coverage100% is actually unreasonable, because there are lines which should _never_ be hit (e.g. an assert(0) line in a switch statement). All lines _other_ than those sort of lines should have code coverage, but depending on the lexer implementation, 100% is impossible. std.datetime has a ton of unit tests and IIRC it still only manages 98% because of assert(0) lines. So, I agree with the spirit of your statement, but it's not necessarily reasonable to do exactly what you're saying.
Aug 01 2012
On Wednesday, August 01, 2012 21:30:44 Walter Bright wrote:On 8/1/2012 8:04 PM, Jonathan M Davis wrote:1. The current design of Phobos is to have ranges of dchar, because it fosters code correctness (though it can harm efficiency). It's arguably too late to do otherwise. Certainly, doing otherwise now would break a lot of code. If the lexer tried to operate on UTF-8 as part of its API rather than operating on ranges of dchar and special-casing strings, then it wouldn't fit in with Phobos at all. 2. The lexer does _not_ have to have its performance tank by accepting ranges of dchar. It's true that the performance will be harmed for ranges which _aren't_ strings, but for strings (as would be by far the most common use case) it can be very efficient by special-casing them. And as much as there are potential performance issues with Phobos' choice of treating strings as ranges of dchar, if it were to continue to treat them as ranges of code units, it's pretty much a guarantee that there would be a _ton_ of bugs caused by it. Most programmers have absolutely no clue about how unicode works and don't really want to know. They want strings to just work. Phobos' approach of defaulting to correct but making it possible to make the code faster through specializations is very much in line with D's typical approach of making things safe by default but allowing the programmer to do unsafe things for optimizations when they know what they're doing.On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:I have argued against making ranges of characters dchars, because of performance reasons. This will especially adversely affect the performance of the lexer. The fact is, files are normally in UTF8 and just about everything else is in UTF8. Prematurely converting to UTF-32 is a performance disaster. Note that the single largest thing holding back regex performance is that premature conversion to dchar and back to char. If lexer is required to accept dchar ranges, its performance will drop at least in half, and people are going to go reinvent their own lexers.1. It should accept as input an input range of UTF8. I feel it is a mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use an 'adapter' range to convert the input to UTF8. (This is what component programming is all about.)But that's not how ranges of characters work. They're ranges of dchar. Ranges don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have to create special wrappers around string or wstring to have ranges of UTF-8. The way that it's normally done is to have ranges of dchar and then special-case range-based functions for strings. Then the function can operate on any range of dchar but still operates on strings efficiently.All identifiers are entered into a hashtable, and are referred to by pointers into that hashtable for the rest of dmd. This makes symbol lookups incredibly fast, as no string comparisons are done.Hmmm. Well, I'd still argue that that's a parser thing. Pretty much nothing else will care about it. At most, it should be an optional feature of the lexer. But it certainly could be added that way. - Jonathan M Davis
Aug 01 2012
On 8/1/2012 9:52 PM, Jonathan M Davis wrote:1. The current design of Phobos is to have ranges of dchar, because it fosters code correctness (though it can harm efficiency). It's arguably too late to do otherwise. Certainly, doing otherwise now would break a lot of code. If the lexer tried to operate on UTF-8 as part of its API rather than operating on ranges of dchar and special-casing strings, then it wouldn't fit in with Phobos at all.The lexer must use char or it will not be acceptable as anything but a toy for performance reasons.2. The lexer does _not_ have to have its performance tank by accepting ranges of dchar. It's true that the performance will be harmed for ranges which _aren't_ strings, but for strings (as would be by far the most common use case) it can be very efficient by special-casing them.Somebody has to convert the input files into dchars, and then back into chars. That blows for performance. Think billions and billions of characters going through, not just a few random strings. Always always think of the lexer as having a firehose of data shoved into its maw, and it better be thirsty!And as much as there are potential performance issues with Phobos' choice of treating strings as ranges of dchar, if it were to continue to treat them as ranges of code units, it's pretty much a guarantee that there would be a _ton_ of bugs caused by it. Most programmers have absolutely no clue about how unicode works and don't really want to know. They want strings to just work. Phobos' approach of defaulting to correct but making it possible to make the code faster through specializations is very much in line with D's typical approach of making things safe by default but allowing the programmer to do unsafe things for optimizations when they know what they're doing.I expect std.d.lexer to handle UTF8 correctly, so I don't think this should be an issue in this particular case. dmd's lexer does handle UTF8 correctly. Note also that the places where non-ASCII characters can appear in correct D code is severely limited, and there are even fewer places where multibyte characters need to be decoded at all, and the lexer takes full advantage of this to boost its speed. For example, non-ASCII characters can appear in comments, but they DO NOT need to be decoded, and even just having the test for a non-ASCII character in the comment scanner will visibly slow down the lexer.I hate to say "trust me on this", but if you don't, have a look at dmd's lexer and how it handles identifiers, then look at dmd's symbol table.All identifiers are entered into a hashtable, and are referred to by pointers into that hashtable for the rest of dmd. This makes symbol lookups incredibly fast, as no string comparisons are done.Hmmm. Well, I'd still argue that that's a parser thing. Pretty much nothing else will care about it. At most, it should be an optional feature of the lexer. But it certainly could be added that way.
Aug 01 2012
On Wednesday, August 01, 2012 22:33:12 Walter Bright wrote:The lexer must use char or it will not be acceptable as anything but a toy for performance reasons.Avoiding decoding can be done with strings and operating on ranges of dchar, so you'd be operating almost entirely on ASCII. Are you saying that there's a performance issue aside from decoding?Somebody has to convert the input files into dchars, and then back into chars. That blows for performance. Think billions and billions of characters going through, not just a few random strings.Why is there any converting to dchar going on here? I don't see why any would be necessary. If you reading in a file as a string or char[] (as would be typical), then you're operating on a string, and then the only time that any decoding will be necessary is when you actually need to operate on a unicode character, which is very rare in D's grammar. It's only when operating on something _other_ than a string that you'd have to actually deal with dchars.My point is that it's the sort of thing that _only_ a parser would care about. So, unless it _needs_ to be in the lexer for some reason, it shouldn't be. - Jonathan M DavisHmmm. Well, I'd still argue that that's a parser thing. Pretty much nothing else will care about it. At most, it should be an optional feature of the lexer. But it certainly could be added that way.I hate to say "trust me on this", but if you don't, have a look at dmd's lexer and how it handles identifiers, then look at dmd's symbol table.
Aug 01 2012
On 8/1/2012 10:44 PM, Jonathan M Davis wrote:On Wednesday, August 01, 2012 22:33:12 Walter Bright wrote:1. Encoding it into a dchar is a performance problem. Source code sits in files that are nearly always in UTF8. So your input range MUST check every single char and convert it to UTF32 as necessary. Plus, there's that additional step removed from sticking the file input buffer directly into the lexer's input. 2. Storing strings as dchars is a performance and memory problem (4x as much data and hence time). Remember, nearly ALL of D source will be ASCII. All performance considerations must be tilted towards the usual case.The lexer must use char or it will not be acceptable as anything but a toy for performance reasons.Avoiding decoding can be done with strings and operating on ranges of dchar, so you'd be operating almost entirely on ASCII. Are you saying that there's a performance issue aside from decoding?Because your input range is a range of dchar?Somebody has to convert the input files into dchars, and then back into chars. That blows for performance. Think billions and billions of characters going through, not just a few random strings.Why is there any converting to dchar going on here?I don't see why any would be necessary. If you reading in a file as a string or char[] (as would be typical), then you're operating on a string, and then the only time that any decoding will be necessary is when you actually need to operate on a unicode character, which is very rare in D's grammar. It's only when operating on something _other_ than a string that you'd have to actually deal with dchars.That's what I've been saying. So why have an input range of dchars, which must be decoded in advance, otherwise it wouldn't be a range of dchars?I think you are misunderstanding. The lexer doesn't have a *symbol* table in it. It has a mapping from identifiers to unique handles. It needs to be there, otherwise the semantic analysis has to scan identifier strings a second time.My point is that it's the sort of thing that _only_ a parser would care about. So, unless it _needs_ to be in the lexer for some reason, it shouldn't be.Hmmm. Well, I'd still argue that that's a parser thing. Pretty much nothing else will care about it. At most, it should be an optional feature of the lexer. But it certainly could be added that way.I hate to say "trust me on this", but if you don't, have a look at dmd's lexer and how it handles identifiers, then look at dmd's symbol table.
Aug 01 2012
On 2012-08-02 08:39, Walter Bright wrote:That's what I've been saying. So why have an input range of dchars, which must be decoded in advance, otherwise it wouldn't be a range of dchars?Your first requirement is a bit strange and confusing: "1. It should accept as input an input range of UTF8." To my understand ranges in Phobos are designed to operate on dchars, not chars, regardless of the original input type. So if you can create a range that operates on UTF8 I don't know if that will be usable with the rest of the Phobos functions expecting ranges of dchars. Basically making this type of range useless since it cannot use any other functionality in Phobos. The current state of Phobos: you either have a range of dchars or you don't have a range, you have an array or something else. -- /Jacob Carlborg
Aug 01 2012
On 8/1/2012 11:59 PM, Jacob Carlborg wrote:Your first requirement is a bit strange and confusing: "1. It should accept as input an input range of UTF8." To my understand ranges in Phobos are designed to operate on dchars, not chars, regardless of the original input type. So if you can create a range that operates on UTF8 I don't know if that will be usable with the rest of the Phobos functions expecting ranges of dchars. Basically making this type of range useless since it cannot use any other functionality in Phobos. The current state of Phobos: you either have a range of dchars or you don't have a range, you have an array or something else.I answered this point a few posts up in the thread.
Aug 02 2012
On 2012-08-02 09:21, Walter Bright wrote:I answered this point a few posts up in the thread.I've read a few posts up and the only answer I found is that the lexer needs to operates on chars. But it does not answer the question how that range type would be used by all other range based functions in Phobos. Have I missed this? Can you please link to your post. -- /Jacob Carlborg
Aug 02 2012
On 8/2/2012 12:29 AM, Jacob Carlborg wrote:On 2012-08-02 09:21, Walter Bright wrote:You can use an adapter range.I answered this point a few posts up in the thread.I've read a few posts up and the only answer I found is that the lexer needs to operates on chars. But it does not answer the question how that range type would be used by all other range based functions in Phobos. Have I missed this? Can you please link to your post.
Aug 02 2012
On Wednesday, August 01, 2012 23:39:42 Walter Bright wrote:I think that we're misunderstanding each other here. A typical, well-written, range-based function which operates on ranges of dchar will use static if or overloads to special-case strings. This means that it will function with any range of dchar, but it _also_ will be as efficient with strings as if it just operated on strings. It won't decode anything in the string unless it has to. So, having a lexer which operates on ranges of dchar does _not_ make string processing less efficient. It just makes it so that it can _also_ operate on ranges of dchar which aren't strings. For instance, my lexer uses this whenever it needs to get at the first character in the range: static if(isNarrowString!R) Unqual!(ElementEncodingType!R) first = range[0]; else dchar first = range.front; I use string mixins to reduce the code duplication that goes with this, but notice that strings and wstrings do _not_ get decoded. Their first code unit is used, and then most everything operates on that code unit. e.g. switch(first) { case '\n': //... case ' ': //... case '+' //... //... default: //... } It's only when the code actually needs to decode the first code point that it does so (e.g. when all of the possible ASCII characters have been exhausted, it needs to check whether the first code point is a unicode alpha character, since that could be the beginning of an identifier). At that point, I end up doing something like static if(isNarrowString!R) dchar front = range.front; else alias first front; or if I need to know the number of code units that make up the code point, I explicitly call decode in the case of a narrow string. In either case, code units are _not_ being converted to dchar unless they absolutely have to be. The only thing that suffers here performance-wise is ranges that aren't actually strings. For instance, filter!"true"(source) would have _much_ worse performance, because it has to decode every character. But without the concept of a variably-lengthed encoded range, we can't really get around that.Because your input range is a range of dchar?Somebody has to convert the input files into dchars, and then back into chars. That blows for performance. Think billions and billions of characters going through, not just a few random strings.Why is there any converting to dchar going on here?Yes. I understand. It has a mapping of pointers to identifiers. My point is that nothing but parsers will need that. From the standpoint of functionality, it's a parser feature, not a lexer feature. So, if it can be done just fine in the parser, then that's where it should be. If on the other hand, it _needs_ to be in the lexer for some reason (e.g. performance), then that's a reason to put it there. It sounds like you're arguing that performance requirements dictate that it be put in the lexer even if nothing but the parser cares about it, in which case, the lexer is indeed where it should be, but if performance isn't affected, then I'd still argue that it should be in the parser. - Jonathan M DavisMy point is that it's the sort of thing that _only_ a parser would care about. So, unless it _needs_ to be in the lexer for some reason, it shouldn't be.I think you are misunderstanding. The lexer doesn't have a *symbol* table in it. It has a mapping from identifiers to unique handles. It needs to be there, otherwise the semantic analysis has to scan identifier strings a second time.
Aug 02 2012
On 8/2/2012 12:21 AM, Jonathan M Davis wrote:It *still* must convert UTF8 to dchars before presenting them to the consumer of the dchar elements.Because your input range is a range of dchar?I think that we're misunderstanding each other here. A typical, well-written, range-based function which operates on ranges of dchar will use static if or overloads to special-case strings. This means that it will function with any range of dchar, but it _also_ will be as efficient with strings as if it just operated on strings.It won't decode anything in the string unless it has to. So, having a lexer which operates on ranges of dchar does _not_ make string processing less efficient. It just makes it so that it can _also_ operate on ranges of dchar which aren't strings. For instance, my lexer uses this whenever it needs to get at the first character in the range: static if(isNarrowString!R) Unqual!(ElementEncodingType!R) first = range[0]; else dchar first = range.front;You're requiring a random access input range that has random access to something other than the range element type?? and you're requiring an isNarrowString to work on an arbitrary range?if I need to know the number of code units that make up the code point, I explicitly call decode in the case of a narrow string. In either case, code units are _not_ being converted to dchar unless they absolutely have to be.Or you could do away with requiring a special range type and just have it be a UTF8 range. What I wasn't realizing earlier was that you were positing a range type that has two different kinds of elements. I don't think this is a proper component type.Yes. I understand. It has a mapping of pointers to identifiers. My point is that nothing but parsers will need that. From the standpoint of functionality, it's a parser feature, not a lexer feature. So, if it can be done just fine in the parser, then that's where it should be. If on the other hand, it _needs_ to be in the lexer for some reason (e.g. performance), then that's a reason to put it there.If you take it out of the lexer, then: 1. the lexer must allocate storage for every identifier, rather than only for unique identifiers 2. and then the parser must scan the identifier string *again* 3. there must be two hash lookups of each identifier rather than one It's a suboptimal design.
Aug 02 2012
On 2012-08-02 07:44, Jonathan M Davis wrote:On Wednesday, August 01, 2012 22:33:12 Walter Bright wrote:I think you two are saying the same things. -- /Jacob CarlborgThe lexer must use char or it will not be acceptable as anything but a toy for performance reasons.Avoiding decoding can be done with strings and operating on ranges of dchar, so you'd be operating almost entirely on ASCII. Are you saying that there's a performance issue aside from decoding?Somebody has to convert the input files into dchars, and then back into chars. That blows for performance. Think billions and billions of characters going through, not just a few random strings.Why is there any converting to dchar going on here? I don't see why any would be necessary. If you reading in a file as a string or char[] (as would be typical), then you're operating on a string, and then the only time that any decoding will be necessary is when you actually need to operate on a unicode character, which is very rare in D's grammar. It's only when operating on something _other_ than a string that you'd have to actually deal with dchars.
Aug 01 2012
On Wednesday, August 01, 2012 21:52:15 Jonathan M Davis wrote:And as much as there are potential performance issues with Phobos' choice of treating strings as ranges of dchar, if it were to continue to treat them as ranges of code units, it's pretty much a guarantee that there would be a _ton_ of bugs caused by it. Most programmers have absolutely no clue about how unicode works and don't really want to know. They want strings to just work. Phobos' approach of defaulting to correct but making it possible to make the code faster through specializations is very much in line with D's typical approach of making things safe by default but allowing the programmer to do unsafe things for optimizations when they know what they're doing.Another thing that I should point out is that a range of UTF-8 or UTF-16 wouldn't work with many range-based functions at all. Most of std.algorithm and its ilk would be completely useless. Range-based functions operate on a ranges elements, so operating on a range of code units would mean operating on code units, which is going to be _wrong_ almost all the time. Think about what would happen if you used a function like map or filter on a range of code units. The resultant range would be completely invalid as far as unicode goes. Range-based functions need to be operating on _characters_. Technically, not even code points gets us there, so it's _still_ buggy. It's just a _lot_ closer to being correct and works 99.99+% of the time. If we want to be able to operate on ranges of UTF-8 or UTF-16, we need to add a concept of variably-length encoded ranges so that it's possible to treat them as both their encoding and whatever they represent (e.g. code point or grapheme in the case of ranges of code units). - Jonathan M Davis
Aug 01 2012
On 8/1/2012 11:56 PM, Jonathan M Davis wrote:Another thing that I should point out is that a range of UTF-8 or UTF-16 wouldn't work with many range-based functions at all. Most of std.algorithm and its ilk would be completely useless. Range-based functions operate on a ranges elements, so operating on a range of code units would mean operating on code units, which is going to be _wrong_ almost all the time. Think about what would happen if you used a function like map or filter on a range of code units. The resultant range would be completely invalid as far as unicode goes.My experience in writing fast string based code that worked on UTF8 and correctly handled multibyte characters was that they are very possible and practical, and they are faster. The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it isn't fast, serious users will eschew it and will cook up their own. You'll have a nice, pretty, useless toy of std.d.lexer. I think there's some serious underestimation of how critical this is.Range-based functions need to be operating on _characters_. Technically, not even code points gets us there, so it's _still_ buggy. It's just a _lot_ closer to being correct and works 99.99+% of the time.Multi-code point characters are quite irrelevant to the correctness of a D lexer.If we want to be able to operate on ranges of UTF-8 or UTF-16, we need to add a concept of variably-length encoded ranges so that it's possible to treat them as both their encoding and whatever they represent (e.g. code point or grapheme in the case of ranges of code units).No, this is not necessary.
Aug 02 2012
On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it isn't fast, serious users will eschew it and will cook up their own. You'll have a nice, pretty, useless toy of std.d.lexer.If you want to throw out some target times, that would be useful.
Aug 02 2012
On 8/2/2012 12:33 AM, Bernard Helyer wrote:On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:As fast as the dmd one would be best.The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it isn't fast, serious users will eschew it and will cook up their own. You'll have a nice, pretty, useless toy of std.d.lexer.If you want to throw out some target times, that would be useful.
Aug 02 2012
On Thursday, August 02, 2012 01:13:04 Walter Bright wrote:On 8/2/2012 12:33 AM, Bernard Helyer wrote:How would we measure that? dmd's lexer is tied to dmd, so how would we test the speed of only its lexer? - Jonathan M DavisOn Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:As fast as the dmd one would be best.The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it isn't fast, serious users will eschew it and will cook up their own. You'll have a nice, pretty, useless toy of std.d.lexer.If you want to throw out some target times, that would be useful.
Aug 02 2012
On 8/2/2012 1:21 AM, Jonathan M Davis wrote:How would we measure that? dmd's lexer is tied to dmd, so how would we test the speed of only its lexer?Easy. Just make a special version of dmd that lexes only, and time it.
Aug 02 2012
On 08/02/2012 04:41 AM, Walter Bright wrote:On 8/2/2012 1:21 AM, Jonathan M Davis wrote:I made a lexing-only version of dmd at https://github.com/edmccard/dmd/tree/lexonly by stripping non-lexer-related code from mars.c, and adding a lexModule function that is called instead of Module::parse.. There's no benchmarking code yet; it basically just does while (token.value != TOKeof) nextToken(); for each D source file passed on the command line. --EdHow would we measure that? dmd's lexer is tied to dmd, so how would we test the speed of only its lexer?Easy. Just make a special version of dmd that lexes only, and time it.
Aug 03 2012
Am 03.08.2012 09:56, schrieb Ed McCardell:On 08/02/2012 04:41 AM, Walter Bright wrote:Walter mentioned an easier way - without "forking" dmd - should be better an integral part of the ongoing development maybe under a tools or benchmark section? news://news.digitalmars.com:119/jvfv14$1ri0$1 digitalmars.com ->Look in doc.c at highlightCode2() for how to call the lexer by itself.On 8/2/2012 1:21 AM, Jonathan M Davis wrote:I made a lexing-only version of dmd at https://github.com/edmccard/dmd/tree/lexonly by stripping non-lexer-related code from mars.c, and adding a lexModule function that is called instead of Module::parse.. There's no benchmarking code yet; it basically just does while (token.value != TOKeof) nextToken(); for each D source file passed on the command line. --EdHow would we measure that? dmd's lexer is tied to dmd, so how would we test the speed of only its lexer?Easy. Just make a special version of dmd that lexes only, and time it.
Aug 03 2012
On 8/3/2012 1:27 AM, dennis luehring wrote:Walter mentioned an easier way - without "forking" dmd - should be better an integral part of the ongoing development maybe under a tools or benchmark section?Forking it is fine. This is just for a one-off benchmarking thing.
Aug 03 2012
Am 02.08.2012 10:13, schrieb Walter Bright:On 8/2/2012 12:33 AM, Bernard Helyer wrote:would it be (easily) possible to "extract" the dmd lexer code (and needed interface) for using it as an spererated benchmark reference?On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:As fast as the dmd one would be best.The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it isn't fast, serious users will eschew it and will cook up their own. You'll have a nice, pretty, useless toy of std.d.lexer.If you want to throw out some target times, that would be useful.
Aug 02 2012
Am 02.08.2012 10:13, schrieb Walter Bright:On 8/2/2012 12:33 AM, Bernard Helyer wrote:can the dmd lexer seperated as an lib and become useable from outside - as the benchmark referenceOn Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:As fast as the dmd one would be best.The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it isn't fast, serious users will eschew it and will cook up their own. You'll have a nice, pretty, useless toy of std.d.lexer.If you want to throw out some target times, that would be useful.
Aug 02 2012
Le 02/08/2012 10:13, Walter Bright a écrit :On 8/2/2012 12:33 AM, Bernard Helyer wrote:That'd be great but . . . lexer really isn't the performance bottleneck of dmd (or any compiler of a non trivial language). Additionally, anybody that have touched dmd source code can agree that usability/maintainability isn't as its best. Sacrificing some perfs in a non bottleneck area to increase ease of use make perfect sense.On Thursday, 2 August 2012 at 07:29:52 UTC, Walter Bright wrote:As fast as the dmd one would be best.The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it isn't fast, serious users will eschew it and will cook up their own. You'll have a nice, pretty, useless toy of std.d.lexer.If you want to throw out some target times, that would be useful.
Aug 02 2012
On Thursday, 2 August 2012 at 11:47:20 UTC, deadalnix wrote:lexer really isn't the performance bottleneck of dmd (or any compiler of a non trivial language).What if we're just using this lexer in something like a syntax highlighting text editor? I'd be annoyed if it stopped typing for a little bit cuz of processing.
Aug 02 2012
Am Thu, 02 Aug 2012 14:26:58 +0200 schrieb "Adam D. Ruppe" <destructionator gmail.com>:On Thursday, 2 August 2012 at 11:47:20 UTC, deadalnix wrote:My pet peeve for editors is how to make the lexer work on only what I just edited. It is really not trivial, but eased for example by editors that insert closing } automatically, so the scopes don't mess up. A missing curly bracket results in the parser reinterpreting the whole file. The same goes for string terminators: If I type `` at the top of std.datetime in Mono-D right after "module std.datetime;", I have a 12 (!) seconds wait, till the IDE becomes responsive again. But if I use ' or ", then the editor inserts the second terminator and is done with reparsing in 1-2 seconds. Can we really have a good lexer that is equally well suited for compilers as for editors ? -- Marcolexer really isn't the performance bottleneck of dmd (or any compiler of a non trivial language).What if we're just using this lexer in something like a syntax highlighting text editor? I'd be annoyed if it stopped typing for a little bit cuz of processing.
Aug 02 2012
On 8/2/2012 11:14 AM, Marco Leise wrote:If I type `` at the top of std.datetime in Mono-D right after "module std.datetime;", I have a 12 (!) seconds wait, till the IDE becomes responsive again. But if I use ' or ", then the editor inserts the second terminator and is done with reparsing in 1-2 seconds.A good IDE should do its parsing in a separate thread, so the main user input thread remains crisp and responsive. If the user edits the text while the parsing is in progress, the background parsing thread simply abandons the current parse and starts over.
Aug 02 2012
On 8/2/12, Walter Bright <newshound2 digitalmars.com> wrote:A good IDE should do its parsing in a separate thread, so the main user input thread remains crisp and responsive. If the user edits the text while the parsing is in progress, the background parsing thread simply abandons the current parse and starts over.Agreed, this is an editor/IDE domain problem. And there are some obvious optimizations an editor could use, such as only lexing the visible portion of text on the screen (and then lexing the rest in a background thread which doesn't block the interface).
Aug 02 2012
On 2012-08-02 21:35, Walter Bright wrote:A good IDE should do its parsing in a separate thread, so the main user input thread remains crisp and responsive. If the user edits the text while the parsing is in progress, the background parsing thread simply abandons the current parse and starts over.It still needs to update the editor view with the correct syntax highlighting which needs to be done in the same thread as the rest of the GUI. -- /Jacob Carlborg
Aug 02 2012
On 8/2/12, Jacob Carlborg <doob me.com> wrote:It still needs to update the editor view with the correct syntax highlighting which needs to be done in the same thread as the rest of the GUI.It can do that immediately for the text that's visible in the window because ~100 lines of text can be lexed pretty damn instantly. As soon as that's done the GUI should be responsive and the rest of the text buffer should be lexed in the background.
Aug 02 2012
On 2012-08-02 22:54, Andrej Mitrovic wrote:It can do that immediately for the text that's visible in the window because ~100 lines of text can be lexed pretty damn instantly. As soon as that's done the GUI should be responsive and the rest of the text buffer should be lexed in the background.That should work if the lexer can handle it. -- /Jacob Carlborg
Aug 02 2012
On 8/2/2012 1:41 PM, Jacob Carlborg wrote:On 2012-08-02 21:35, Walter Bright wrote:The rendering code should be in yet a third thread. An editor I wrote years ago had the rendering code in a separate thread from user input. You never had to wait to type in commands, the rendering would catch up when it could. What was also effective was the rendering would abandon a render midstream and restart it if it detected that the underlying data had changed in the meantime. This meant that the display was never more than one render out of date. Although the code itself wasn't any faster, it certainly *felt* faster with this approach. It made for crisp editing even on a pig slow machine.A good IDE should do its parsing in a separate thread, so the main user input thread remains crisp and responsive. If the user edits the text while the parsing is in progress, the background parsing thread simply abandons the current parse and starts over.It still needs to update the editor view with the correct syntax highlighting which needs to be done in the same thread as the rest of the GUI.
Aug 02 2012
On 03-Aug-12 02:10, Walter Bright wrote:On 8/2/2012 1:41 PM, Jacob Carlborg wrote:OT: It never ceases to amaze me how people miss this very simple point: GUI runs on its own thread and shouldn't ever block on something (save for message pump itself, of course). Everything else (including possibly slow rendering) done on the side and then result (once ready) swiftly indicated on GUI. Recent Windows 8 talks were in fact nothing new in this regard, but now even API is made so that it's harder to do something blocking in GUI thread.On 2012-08-02 21:35, Walter Bright wrote:The rendering code should be in yet a third thread.A good IDE should do its parsing in a separate thread, so the main user input thread remains crisp and responsive. If the user edits the text while the parsing is in progress, the background parsing thread simply abandons the current parse and starts over.It still needs to update the editor view with the correct syntax highlighting which needs to be done in the same thread as the rest of the GUI.An editor I wrote years ago had the rendering code in a separate thread from user input. You never had to wait to type in commands, the rendering would catch up when it could. What was also effective was the rendering would abandon a render midstream and restart it if it detected that the underlying data had changed in the meantime. This meant that the display was never more than one render out of date. Although the code itself wasn't any faster, it certainly *felt* faster with this approach. It made for crisp editing even on a pig slow machine.-- Dmitry Olshansky
Aug 02 2012
On 2012-08-03 00:25, Dmitry Olshansky wrote:OT: It never ceases to amaze me how people miss this very simple point: GUI runs on its own thread and shouldn't ever block on something (save for message pump itself, of course). Everything else (including possibly slow rendering) done on the side and then result (once ready) swiftly indicated on GUI.I'm not entirely sure what you mean be "rendering" but most GUI systems are not thread safe and all GUI updates need to happen in the same thread. -- /Jacob Carlborg
Aug 02 2012
On 03-Aug-12 10:35, Jacob Carlborg wrote:On 2012-08-03 00:25, Dmitry Olshansky wrote:Draw thing to an off screen bitmap then blit it to window (aye, pass back to UI thread a reference to the buffer with pixels). This technique been in use for decades. Imagine drawing some large intricate fractal it could easily take few seconds. -- Dmitry OlshanskyOT: It never ceases to amaze me how people miss this very simple point: GUI runs on its own thread and shouldn't ever block on something (save for message pump itself, of course). Everything else (including possibly slow rendering) done on the side and then result (once ready) swiftly indicated on GUI.I'm not entirely sure what you mean be "rendering" but most GUI systems are not thread safe and all GUI updates need to happen in the same thread.
Aug 03 2012
On 2012-08-03 18:49, Dmitry Olshansky wrote:Draw thing to an off screen bitmap then blit it to window (aye, pass back to UI thread a reference to the buffer with pixels). This technique been in use for decades. Imagine drawing some large intricate fractal it could easily take few seconds.Ok, now I see. -- /Jacob Carlborg
Aug 03 2012
On 2012-08-03 00:10, Walter Bright wrote:The rendering code should be in yet a third thread.Most GUI systems are not thread safe. I know for sure that Cocoa on Mac OS X is not. All the changes to the GUI needs to happen in the same thread. But you can usually post a message from another thread to the GUI thread if you want to update the GUI from another thread.An editor I wrote years ago had the rendering code in a separate thread from user input. You never had to wait to type in commands, the rendering would catch up when it could. What was also effective was the rendering would abandon a render midstream and restart it if it detected that the underlying data had changed in the meantime. This meant that the display was never more than one render out of date. Although the code itself wasn't any faster, it certainly *felt* faster with this approach. It made for crisp editing even on a pig slow machine.But if you type something and you don't see any new characters that will feel slow. Regardless of the application received the user input or not. The user doesn't care what's happening in the background of the application, it only cares about what it can see. I've used many editors/IDEs where the GUI locks, for whatever reason, you continue typing and suddenly you get 5 characters at once. To me, that would indicate the the application do receive the user input but it cannot render it fast enough, for whatever reason. -- /Jacob Carlborg
Aug 02 2012
Le 02/08/2012 20:14, Marco Leise a écrit :Am Thu, 02 Aug 2012 14:26:58 +0200 schrieb "Adam D. Ruppe"<destructionator gmail.com>:A common solution to that problem is to cancel the computation when user type something again.On Thursday, 2 August 2012 at 11:47:20 UTC, deadalnix wrote:My pet peeve for editors is how to make the lexer work on only what I just edited. It is really not trivial, but eased for example by editors that insert closing } automatically, so the scopes don't mess up. A missing curly bracket results in the parser reinterpreting the whole file. The same goes for string terminators: If I type `` at the top of std.datetime in Mono-D right after "module std.datetime;", I have a 12 (!) seconds wait, till the IDE becomes responsive again. But if I use ' or ", then the editor inserts the second terminator and is done with reparsing in 1-2 seconds. Can we really have a good lexer that is equally well suited for compilers as for editors ?lexer really isn't the performance bottleneck of dmd (or any compiler of a non trivial language).What if we're just using this lexer in something like a syntax highlighting text editor? I'd be annoyed if it stopped typing for a little bit cuz of processing.
Aug 03 2012
On 8/2/2012 4:47 AM, deadalnix wrote:Le 02/08/2012 10:13, Walter Bright a écrit :A lexer is like a device driver, or a memory allocator. Having it be as fast as possible pays off dividends endlessly for whoever uses it. And in my profiling, lexing speed is a big factor in the debug build times. The Digital Mars C++ compiler (and its antecedents) made a lot of hay by being the fastest compiler available by a wide margin. A big advantage D has is also compile speed. Please do not underestimate its importance.As fast as the dmd one would be best.That'd be great but . . . lexer really isn't the performance bottleneck of dmd (or any compiler of a non trivial language). Additionally, anybody that have touched dmd source code can agree that usability/maintainability isn't as its best. Sacrificing some perfs in a non bottleneck area to increase ease of use make perfect sense.
Aug 02 2012
On 02-Aug-12 22:06, Walter Bright wrote:On 8/2/2012 4:47 AM, deadalnix wrote:+1 Another point is that it's crystal clear *how* to optimize lexer, and gain some precious time. It is well trodden path with well known destination. The time saved in lexer gives you some headroom in say optimizer, where you always can find more ways to improve result by sacrificing speed. In the long run I totally expect millions of lines of D code in the wild.Le 02/08/2012 10:13, Walter Bright a écrit :A lexer is like a device driver, or a memory allocator. Having it be as fast as possible pays off dividends endlessly for whoever uses it.As fast as the dmd one would be best.That'd be great but . . . lexer really isn't the performance bottleneck of dmd (or any compiler of a non trivial language). Additionally, anybody that have touched dmd source code can agree that usability/maintainability isn't as its best. Sacrificing some perfs in a non bottleneck area to increase ease of use make perfect sense.And in my profiling, lexing speed is a big factor in the debug build times. The Digital Mars C++ compiler (and its antecedents) made a lot of hay by being the fastest compiler available by a wide margin. A big advantage D has is also compile speed. Please do not underestimate its importance.-- Dmitry Olshansky
Aug 02 2012
On 8/2/12, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:In the long run I totally expect millions of lines of D code in the wild.In a single project or globally? I hope not the former, D should inspire people to write less code that does more. :)
Aug 02 2012
On 8/2/2012 1:20 PM, Dmitry Olshansky wrote:+1 Another point is that it's crystal clear *how* to optimize lexer, and gain some precious time. It is well trodden path with well known destination.I'm less sure how well known it is. DMC++ was known for being *several times* faster than other C++ compilers, not just a few percent faster. But we do have the DMD lexer which is useful as a benchmark and a guide. I won't say it couldn't be made faster, but it does set a minimum bar for performance.
Aug 02 2012
On 2012-08-03 00:01, Walter Bright wrote:But we do have the DMD lexer which is useful as a benchmark and a guide. I won't say it couldn't be made faster, but it does set a minimum bar for performance.I'm not sure how easy it would be to just measure the lexing phase of DMD. If it's easy someone would probably already have extracted the lexer from DMD. -- /Jacob Carlborg
Aug 02 2012
Am 03.08.2012 08:37, schrieb Jacob Carlborg:On 2012-08-03 00:01, Walter Bright wrote:wouldn't it be better to extract the lexer part of dmd into its own (hopefully small) library - that way the lexer is still useable by dmd AND benchmarkable from outside - it is then even possible to replace the dmd lexer by an D version due to the c linkage feature of DBut we do have the DMD lexer which is useful as a benchmark and a guide. I won't say it couldn't be made faster, but it does set a minimum bar for performance.I'm not sure how easy it would be to just measure the lexing phase of DMD. If it's easy someone would probably already have extracted the lexer from DMD.
Aug 02 2012
On 2012-08-03 08:49, dennis luehring wrote:wouldn't it be better to extract the lexer part of dmd into its own (hopefully small) library - that way the lexer is still useable by dmd AND benchmarkable from outside - it is then even possible to replace the dmd lexer by an D version due to the c linkage feature of DThat was what I meant. If it's easy someone would have already done it. -- /Jacob Carlborg
Aug 02 2012
On 8/2/2012 11:37 PM, Jacob Carlborg wrote:I'm not sure how easy it would be to just measure the lexing phase of DMD. If it's easy someone would probably already have extracted the lexer from DMD.You don't need to extract it to measure it. Just have it lex the source files in a loop, and time that loop.
Aug 02 2012
On 2012-08-03 08:59, Walter Bright wrote:You don't need to extract it to measure it. Just have it lex the source files in a loop, and time that loop.Well, that's the problem. It's not like DMD has a single "lex" function that does all the job. Would it perhaps be possible to time Parser::parseModule and remove all things that doesn't seem related to lexing? Remove stuff like: a = new Identifiers(); md = new ModuleDeclaration(a, id, safe); And similar. -- /Jacob Carlborg
Aug 03 2012
On 8/3/2012 12:11 AM, Jacob Carlborg wrote:On 2012-08-03 08:59, Walter Bright wrote:Look in doc.c at highlightCode2() for how to call the lexer by itself.You don't need to extract it to measure it. Just have it lex the source files in a loop, and time that loop.Well, that's the problem. It's not like DMD has a single "lex" function that does all the job. Would it perhaps be possible to time Parser::parseModule and remove all things that doesn't seem related to lexing? Remove stuff like: a = new Identifiers(); md = new ModuleDeclaration(a, id, safe); And similar.
Aug 03 2012
On 2012-08-03 09:35, Walter Bright wrote:Look in doc.c at highlightCode2() for how to call the lexer by itself.So basically: Token tok; //start timer while (tok.value != TOKeof) lex.scan(&tok); //end timer Something like that? -- /Jacob Carlborg
Aug 03 2012
On 8/3/2012 2:01 AM, Jacob Carlborg wrote:On 2012-08-03 09:35, Walter Bright wrote:Pretty much. Or just call lex.nextToken() until you get a TOKeof.Look in doc.c at highlightCode2() for how to call the lexer by itself.So basically: Token tok; //start timer while (tok.value != TOKeof) lex.scan(&tok); //end timer Something like that?
Aug 03 2012
On Thursday, August 02, 2012 11:06:39 Walter Bright wrote:On 8/2/2012 4:47 AM, deadalnix wrote:Recent benchmarks by Iain Buclaw show that lexing is not a bottleneck in dmd, so it's natural for some folks to think that the lexing is not all that critical in comparison to other compiler components. However, the reason that lexing is not a bottleneck is almost ceratinly because the lexer is so thoroughly optimized. So, if it _didn't_ try and be as efficient as possible, it _would_ be a bottleneck. However, given that we're providing an API for a lexer rather than having the lexer be an internal component of a compiler, that _does_ put more pressure on making the API user-friendly. Hopefully that can be done without sacrificing any performance, but there may be a point where the only way to make it faster would be to make the API very unfriendly. If that's the case, then we're going to have some tough decisions to make. But we'll cross that bridge if/when we get to it. - Jonathan M DavisLe 02/08/2012 10:13, Walter Bright a écrit :A lexer is like a device driver, or a memory allocator. Having it be as fast as possible pays off dividends endlessly for whoever uses it. And in my profiling, lexing speed is a big factor in the debug build times. The Digital Mars C++ compiler (and its antecedents) made a lot of hay by being the fastest compiler available by a wide margin. A big advantage D has is also compile speed. Please do not underestimate its importance.As fast as the dmd one would be best.That'd be great but . . . lexer really isn't the performance bottleneck of dmd (or any compiler of a non trivial language). Additionally, anybody that have touched dmd source code can agree that usability/maintainability isn't as its best. Sacrificing some perfs in a non bottleneck area to increase ease of use make perfect sense.
Aug 02 2012
On Thursday, August 02, 2012 00:29:09 Walter Bright wrote:It is for ranges in general. In the general case, a range of UTF-8 or UTF-16 makes no sense whatsoever. Having range-based functions which understand the encodings and optimize accordingly can be very beneficial (which happens with strings but can't happen with general ranges without the concept of a variably-length encoded range like we have with forward range or random access range), but to actually have a range of UTF-8 or UTF-16 just wouldn't work. Range-based functions operate on elements, and doing stuff like filter or map or reduce on code units doesn't make any sense at all. - Jonathan M DavisIf we want to be able to operate on ranges of UTF-8 or UTF-16, we need to add a concept of variably-length encoded ranges so that it's possible to treat them as both their encoding and whatever they represent (e.g. code point or grapheme in the case of ranges of code units).No, this is not necessary.
Aug 02 2012
On 8/2/2012 12:43 AM, Jonathan M Davis wrote:It is for ranges in general. In the general case, a range of UTF-8 or UTF-16 makes no sense whatsoever. Having range-based functions which understand the encodings and optimize accordingly can be very beneficial (which happens with strings but can't happen with general ranges without the concept of a variably-length encoded range like we have with forward range or random access range), but to actually have a range of UTF-8 or UTF-16 just wouldn't work. Range-based functions operate on elements, and doing stuff like filter or map or reduce on code units doesn't make any sense at all.Yes, it can work.
Aug 02 2012
On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:On 8/2/2012 12:43 AM, Jonathan M Davis wrote:How? If you operate on a range of code units, then you're operating on individual code units, which almost never makes sense. There are plenty cases where a function which understands the encoding can avoid some of costs associated with decoding and whatnot, but since range-based functions operate on their elements, if the elementse are code units, a range-based function will operate on individual code units with _no_ understanding of the encoding at all. Ranges have no concept of encoding. Do you really think that it makes sense for a function like map or filter to operate on individual code units? Because that's what would end up happening with a range of code units. Your average, range-based function only makes sense with _characters_, not code units. Functions which can operate on ranges of code units without screwing up the encoding are a rarity. Unless a range-based function special cases a range-type which is variably- lengthed encoded (e.g. string), it just isn't going to deal with the encoding properly. Either it operates on the encoding or the actual value, depending on what its element type is. I concur that operating on strings as code units is better from the standpoint of efficiency, but it just doesn't work with a generic function without it having a special case which therefore _isn't_ generic. - Jonathan M DavisIt is for ranges in general. In the general case, a range of UTF-8 or UTF-16 makes no sense whatsoever. Having range-based functions which understand the encodings and optimize accordingly can be very beneficial (which happens with strings but can't happen with general ranges without the concept of a variably-length encoded range like we have with forward range or random access range), but to actually have a range of UTF-8 or UTF-16 just wouldn't work. Range-based functions operate on elements, and doing stuff like filter or map or reduce on code units doesn't make any sense at all.Yes, it can work.
Aug 02 2012
On 8/2/2012 1:38 AM, Jonathan M Davis wrote:On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:Keep a 6 character buffer in your consumer. If you read a char with the high bit set, start filling that buffer and then decode it.On 8/2/2012 12:43 AM, Jonathan M Davis wrote:How?It is for ranges in general. In the general case, a range of UTF-8 or UTF-16 makes no sense whatsoever. Having range-based functions which understand the encodings and optimize accordingly can be very beneficial (which happens with strings but can't happen with general ranges without the concept of a variably-length encoded range like we have with forward range or random access range), but to actually have a range of UTF-8 or UTF-16 just wouldn't work. Range-based functions operate on elements, and doing stuff like filter or map or reduce on code units doesn't make any sense at all.Yes, it can work.Do you really think that it makes sense for a function like map or filter to operate on individual code units? Because that's what would end up happening with a range of code units. Your average, range-based function only makes sense with _characters_, not code units. Functions which can operate on ranges of code units without screwing up the encoding are a rarity.Rare or not, they are certainly possible, and the early versions of std.string did just that (although they weren't using ranges, the same techniques apply).
Aug 02 2012
Le 02/08/2012 10:44, Walter Bright a écrit :On 8/2/2012 1:38 AM, Jonathan M Davis wrote:How is that different than a manually done range of dchar ?On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:Keep a 6 character buffer in your consumer. If you read a char with the high bit set, start filling that buffer and then decode it.On 8/2/2012 12:43 AM, Jonathan M Davis wrote:How?It is for ranges in general. In the general case, a range of UTF-8 or UTF-16 makes no sense whatsoever. Having range-based functions which understand the encodings and optimize accordingly can be very beneficial (which happens with strings but can't happen with general ranges without the concept of a variably-length encoded range like we have with forward range or random access range), but to actually have a range of UTF-8 or UTF-16 just wouldn't work. Range-based functions operate on elements, and doing stuff like filter or map or reduce on code units doesn't make any sense at all.Yes, it can work.
Aug 02 2012
On 8/2/2012 4:49 AM, deadalnix wrote:How is that different than a manually done range of dchar ?The decoding is rarely necessary, even if non-ascii data is there. However, the range cannot decide if decoding is necessary - the receiver has to, hence the receiver does the decoding.
Aug 02 2012
On Thursday, August 02, 2012 10:58:21 Walter Bright wrote:On 8/2/2012 4:49 AM, deadalnix wrote:Hence why special-casing must be used to deal with variably-length encodings like UTF-8. If it's not something that the range can handle through front and popFront, then it's not going to work as a range. Having it be _possible_ to special-case but not require it allows for the type to works a range but still be efficient of the function operating on the range codes for it. But if you require the function to code for it, then it's not really a range. - Jonathan M DavisHow is that different than a manually done range of dchar ?The decoding is rarely necessary, even if non-ascii data is there. However, the range cannot decide if decoding is necessary - the receiver has to, hence the receiver does the decoding.
Aug 02 2012
On 02-Aug-12 12:44, Walter Bright wrote:On 8/2/2012 1:38 AM, Jonathan M Davis wrote:4 bytes is enough. Since Unicode 5(?) the range of codepoints was defined to be 0...0x10FFFF specifically so that it could be encoded in 4 bytes of UTF-8. P.S. Looks like I'm too late for this party ;) -- Dmitry OlshanskyOn Thursday, August 02, 2012 01:14:30 Walter Bright wrote:Keep a 6 character buffer in your consumer. If you read a char with the high bit set, start filling that buffer and then decode it.On 8/2/2012 12:43 AM, Jonathan M Davis wrote:How?It is for ranges in general. In the general case, a range of UTF-8 or UTF-16 makes no sense whatsoever. Having range-based functions which understand the encodings and optimize accordingly can be very beneficial (which happens with strings but can't happen with general ranges without the concept of a variably-length encoded range like we have with forward range or random access range), but to actually have a range of UTF-8 or UTF-16 just wouldn't work. Range-based functions operate on elements, and doing stuff like filter or map or reduce on code units doesn't make any sense at all.Yes, it can work.
Aug 02 2012
On 8/2/2012 8:46 AM, Dmitry Olshansky wrote:Yeah, but I thought 6 bytes would future proof it! (Inevitably, the Unicode committee will add more.)Keep a 6 character buffer in your consumer. If you read a char with the high bit set, start filling that buffer and then decode it.4 bytes is enough. Since Unicode 5(?) the range of codepoints was defined to be 0...0x10FFFF specifically so that it could be encoded in 4 bytes of UTF-8.P.S. Looks like I'm too late for this party ;)It affects you strongly, too, so I'm glad to see you join in.
Aug 02 2012
On Thursday, August 02, 2012 01:44:18 Walter Bright wrote:On 8/2/2012 1:38 AM, Jonathan M Davis wrote:And how on earth is that going to work as a range? Range-based functions operate on elements. They use empty, front, popFront, etc. If front doesn't return an element that a range-based function can operate on without caring what it is, then that type isn't going to work as a range. If you need the consumer to be doing something special, then that means you need to special case it for that range type. And that's what you're doing when you special- case range-base functions for strings. So, because of how front and popFront work, you either have to have a range of code units or a range of code points. With a range of code units, the element type is a code unit, so any operations that you do will be operating on individual code units, not code points. With a range of code points, any operations being done will operate on code points, which _will_ require decoding as long as you're actually using the range API. You only make strings more efficent by special-casing the function for them such that it understands unicode and will operate on the string in the most efficient way according to how that string's encoding works. You seem to be arguing that we can somehow have a generic range API which operates on strings, and yet what you're saying a function using those ranges must do (e.g. having a buffer of multipl code units) requires that a range- based function operate in a non-generic manner for strings. If the consumer has to do _anything_ special for strings, then what it's doing is non-generic with regards to strings. I agree that we should be making string operations more efficient by taking code units into account, but I completely disagree that we can do that generically. At best, we could add the concept of a variably-length encoded range such that a range-based function could special case them and use the encoding where appropriate, but all that buys us is the ability to special case variably- length encoded ranges _other_ than strings (since we can already special case strings), and I don't think that it's even possible to deal with a variably- length encoded range properly without understanding what the encoding is, in which case, we'd be dealing with special range types which were specifically UTF-8 encoded or UTF-16 encoded, and range-based functions would be special- casing _those_ rather than a generic variably-length encoded range. In either case, because the consumer must do something other than simply operate on front, popFront, empty, etc., you're _not_ dealing with the range API but rather working around it. - Jonathan M DavisOn Thursday, August 02, 2012 01:14:30 Walter Bright wrote:Keep a 6 character buffer in your consumer. If you read a char with the high bit set, start filling that buffer and then decode it.On 8/2/2012 12:43 AM, Jonathan M Davis wrote:How?It is for ranges in general. In the general case, a range of UTF-8 or UTF-16 makes no sense whatsoever. Having range-based functions which understand the encodings and optimize accordingly can be very beneficial (which happens with strings but can't happen with general ranges without the concept of a variably-length encoded range like we have with forward range or random access range), but to actually have a range of UTF-8 or UTF-16 just wouldn't work. Range-based functions operate on elements, and doing stuff like filter or map or reduce on code units doesn't make any sense at all.Yes, it can work.
Aug 02 2012
"Jonathan M Davis" , dans le message (digitalmars.D:174059), a écrit :In either case, because the consumer must do something other than simply operate on front, popFront, empty, etc., you're _not_ dealing with the range API but rather working around it.In some case a range of dchar is useful. In some case a range of char is sufficient, and much more efficient. And for the UTF-aware programer, it makes much sense. The fact that you sometimes have to buffer some information because the meaning of one element is affected by the previous element is a normal issue in many algoritms, it's not working arround anything. Your lexer uses range API, would you say it is just working arroung range because you have to take into account several caracters (let they be dchars) at the same time to know what they are meaning?
Aug 02 2012
On 8/2/2012 1:26 PM, Jonathan M Davis wrote:On Thursday, August 02, 2012 01:44:18 Walter Bright wrote:1. read a character from the range 2. if the character is the start of a multibyte character, put the character in the buffer 3. keep reading from the range until you've got the whole of the multibyte character 4. convert that 6 (or 4) character buffer into a dchar Remember, its the consumer doing the decoding, not the input range.On 8/2/2012 1:38 AM, Jonathan M Davis wrote:And how on earth is that going to work as a range?On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:Keep a 6 character buffer in your consumer. If you read a char with the high bit set, start filling that buffer and then decode it.On 8/2/2012 12:43 AM, Jonathan M Davis wrote:How?It is for ranges in general. In the general case, a range of UTF-8 or UTF-16 makes no sense whatsoever. Having range-based functions which understand the encodings and optimize accordingly can be very beneficial (which happens with strings but can't happen with general ranges without the concept of a variably-length encoded range like we have with forward range or random access range), but to actually have a range of UTF-8 or UTF-16 just wouldn't work. Range-based functions operate on elements, and doing stuff like filter or map or reduce on code units doesn't make any sense at all.Yes, it can work.I agree that we should be making string operations more efficient by taking code units into account, but I completely disagree that we can do that generically.The requirement I listed was that the input range present UTF8 characters. Not any random character type.
Aug 02 2012
On Thursday, August 02, 2012 15:14:17 Walter Bright wrote:Remember, its the consumer doing the decoding, not the input range.But that's the problem. The consumer has to treat character ranges specially to make this work. It's not generic. If it were generic, then it would simply be using front, popFront, etc. It's going to have to special case strings to do the buffering that you're suggesting. And if you have to special case strings, then how is that any different from what we have now? If you're arguing that strings should be treated as ranges of code units, then pretty much _every_ range-based function will have to special case strings to even work correctly - otherwise it'll be operating on individual code points rather than code points (e.g. filtering code units rather than code points, which would generate an invalid string). This makes the default behavior incorrect, forcing _everyone_ to special case strings _everywhere_ if they want correct behavior with ranges which are strings. And efficiency means nothing if the result is wrong. As it is now, the default behavior of strings with range-based functions is correct but inefficient, so at least we get correct code. And if someone wants their string processing to be efficient, then they special case strings and do things like the buffering that you're suggesting. So, we have correct by default with efficiency as an option. The alternative that you seem to be suggesting (making strings be treated as ranges of code units) means that it would be fast by default but correct as an option, which is completely backwards IMHO. Efficiency is important, but it's pointless how efficient something is if it's wrong, and expecting that your average programmer is going to write unicode-aware code which functions correctly is completely unrealistic. - Jonathan M Davis
Aug 02 2012
On 8/2/12 6:38 PM, Jonathan M Davis wrote:On Thursday, August 02, 2012 15:14:17 Walter Bright wrote:It is generic! It's just in another dimension: it operates on any range of _bytes_. AndreiRemember, its the consumer doing the decoding, not the input range.But that's the problem. The consumer has to treat character ranges specially to make this work. It's not generic.
Aug 02 2012
On Thursday, August 02, 2012 18:41:23 Andrei Alexandrescu wrote:On 8/2/12 6:38 PM, Jonathan M Davis wrote:So, a function which does the buffering of code units like Walter suggests is generic? It's doing something that makes no sense outside of strings. And if it's doing that with strings and something else with everything else (which it _has_ to do if the same function is going to work with both unicode as well as range types that have nothing to do with unicode), then it's special casing strings and therefore is _not_ generic. Sure, you could have a function which specifically operates on ranges of code units and understands how unicode works and is written accordingly, but then that function is specific to ranges of code units and is only generic with regards to various ranges of code units. It can't operate on generic ranges like functions such as map and filter can. - Jonathan M DavisOn Thursday, August 02, 2012 15:14:17 Walter Bright wrote:It is generic! It's just in another dimension: it operates on any range of _bytes_.Remember, its the consumer doing the decoding, not the input range.But that's the problem. The consumer has to treat character ranges specially to make this work. It's not generic.
Aug 02 2012
On 8/2/12 6:54 PM, Jonathan M Davis wrote:So, a function which does the buffering of code units like Walter suggests is generic?Of course, because it operates on bytes read from memory, files, or sockets etc.It's doing something that makes no sense outside of strings.Right. The bytes represent UTF-8 encoded strings, except their type is ubyte so there's no processing in the library.And if it's doing that with strings and something else with everything else (which it _has_ to do if the same function is going to work with both unicode as well as range types that have nothing to do with unicode), then it's special casing strings and therefore is _not_ generic.This is automatically destroyed because its assumption was destroyed.Sure, you could have a function which specifically operates on ranges of code units and understands how unicode works and is written accordingly, but then that function is specific to ranges of code units and is only generic with regards to various ranges of code units. It can't operate on generic ranges like functions such as map and filter can.Yes, and I think that's exactly what the doctor prescribed here. Andrei
Aug 02 2012
On Thursday, August 02, 2012 19:06:32 Andrei Alexandrescu wrote:It may be the best approach for the lexer (though I'm not convinced; I'll have to think about it more), but Walter seems to be arguing that that strings should be treated as ranges of code units in general, which I think is completely wrong. - Jonathan M DavisSure, you could have a function which specifically operates on ranges of code units and understands how unicode works and is written accordingly, but then that function is specific to ranges of code units and is only generic with regards to various ranges of code units. It can't operate on generic ranges like functions such as map and filter can.Yes, and I think that's exactly what the doctor prescribed here.
Aug 02 2012
On 8/2/12 7:18 PM, Jonathan M Davis wrote:On Thursday, August 02, 2012 19:06:32 Andrei Alexandrescu wrote:Your insights are always appreciated; even their Cliff notes :o).It may be the best approach for the lexer (though I'm not convinced; I'll have to think about it more),Sure, you could have a function which specifically operates on ranges of code units and understands how unicode works and is written accordingly, but then that function is specific to ranges of code units and is only generic with regards to various ranges of code units. It can't operate on generic ranges like functions such as map and filter can.Yes, and I think that's exactly what the doctor prescribed here.but Walter seems to be arguing that that strings should be treated as ranges of code units in general, which I think is completely wrong.I think Walter has very often emphasized the need for the lexer to be faster than the usual client software. My perception is that he's discussing lexer design in understanding there's a need for a less comfortable approach, namely do decoding in client. Andrei
Aug 02 2012
On Thursday, August 02, 2012 19:30:47 Andrei Alexandrescu wrote:On 8/2/12 7:18 PM, Jonathan M Davis wrote: Your insights are always appreciated; even their Cliff notes :o).LOL. Well, I'm not about to decide on the best approach to this without thinking through it more. What I've been doing manages to deal quite nicely with avoiding unnecessary decoding and still allows for the lexing of ranges of dchar which aren't strings (though there's obviously an efficiency hit there), and it really isn't complicated or messy thanks to some basic mixins that I've been using. Switching to operating specifically on code units and not accepting ranges of dchar at all has some serious ramifications, and I have to think through them all before I take a position on that.That may be, but if he's arguing that strings should _always_ be treated as range of code units - as in all D programs, most of which don't have anything to do with lexers (other than when they're compiled) - then I'm definitely going to object to that, and it's my understanding that that's what he's arguing. But maybe I've misunderstood. I've been arguing that strings should still be treated as ranges of code points and that that does not preclude making the lexer efficiently operate on code units when operating on strings even if it operates on ranges of dchar. I think that whether making the lexer operate on ranges of dchar but specialize on strings is a better approach or making it operate specifically on ranges of code units is a better approach depends on what we want it to be usable with. It should be just as fast with strings in either case, so it becomes a question of how we want to handle ranges which _aren't_ strings. I suppose that we could make it operate on code units and just let ranges of dchar have UTF-32 as their code unit (since dchar is both a code unit and a code point), then ranges of dchar will still work but ranges of char and wchar will _also_ work. Hmmm. As I said, I'll have to think this through a bit. - Jonathan M Davisbut Walter seems to be arguing that that strings should be treated as ranges of code units in general, which I think is completely wrong.I think Walter has very often emphasized the need for the lexer to be faster than the usual client software. My perception is that he's discussing lexer design in understanding there's a need for a less comfortable approach, namely do decoding in client.
Aug 02 2012
On Thursday, August 02, 2012 19:52:35 Jonathan M Davis wrote:I suppose that we could make it operate on code units and just let ranges of dchar have UTF-32 as their code unit (since dchar is both a code unit and a code point), then ranges of dchar will still work but ranges of char and wchar will _also_ work. Hmmm. As I said, I'll have to think this through a bit.LOL. It looks like taking this approach results in almost identical code to what I've been doing. The main difference is that if you're dealing with a range other than a string, you need to use decode instead of front, which means that decode is going to need to work with more than just strings (probably stride too). I'll have to create a pull request for that. But unless you restrict it to strings and ranges of code units which are random access, you still have to worry about stuff like using range[0] vs range.front depending on the type, so my mixin approach is still applicable, and it makes it very easy to switch what I'm doing, since there are very few lines that need to be tweaked. So, I guess that I'll be taking the approach of taking ranges of char, wchar, and dchar and treat them all as ranges of code units. So, it'll work with everything that it worked with before but will now also work with ranges of char and wchar. There's still a performance hit if you do something like passing it filter!"true(source), but there's no way to fix that without disallowing dchar ranges entirely, which would be unnecessarily restrictive. Once you allow arbitrary ranges of char rather than requiring strings, the extra code required to allow ranges of wchar and dchar is trivial. It's stuff like worrying about range[0] vs range.front which complicates things (even if front is a code unit rather than a code point), and using string mixins makes it so that the code with the logic is just as simple as it would be with strings. So, I think that I can continue almost exactly as I have been and still achieve what Walter wants. The main issue that I have (beyond finishing what I haven't gotten to yet) is changing how I handle errors and comments, since I currently have them as tokens, but that shouldn't be terribly hard to fix. - Jonathan M Davis
Aug 02 2012
On 8/2/2012 4:30 PM, Andrei Alexandrescu wrote:I think Walter has very often emphasized the need for the lexer to be faster than the usual client software. My perception is that he's discussing lexer design in understanding there's a need for a less comfortable approach, namely do decoding in client.+1
Aug 02 2012
On 8/2/2012 3:38 PM, Jonathan M Davis wrote:On Thursday, August 02, 2012 15:14:17 Walter Bright wrote:No, the consumer can do its own buffering. It only needs a 4 character buffer, worst case.Remember, its the consumer doing the decoding, not the input range.But that's the problem. The consumer has to treat character ranges specially to make this work. It's not generic. If it were generic, then it would simply be using front, popFront, etc. It's going to have to special case strings to do the buffering that you're suggesting. And if you have to special case strings, then how is that any different from what we have now?If you're arguing that strings should be treated as ranges of code units, then pretty much _every_ range-based function will have to special case strings to even work correctly - otherwise it'll be operating on individual code points rather than code points (e.g. filtering code units rather than code points, which would generate an invalid string). This makes the default behavior incorrect, forcing _everyone_ to special case strings _everywhere_ if they want correct behavior with ranges which are strings. And efficiency means nothing if the result is wrong.No, I'm arguing that the LEXER should accept a UTF8 input range for its input. I am not making a general argument about ranges, characters, or Phobos.As it is now, the default behavior of strings with range-based functions is correct but inefficient, so at least we get correct code. And if someone wants their string processing to be efficient, then they special case strings and do things like the buffering that you're suggesting. So, we have correct by default with efficiency as an option. The alternative that you seem to be suggesting (making strings be treated as ranges of code units) means that it would be fast by default but correct as an option, which is completely backwards IMHO. Efficiency is important, but it's pointless how efficient something is if it's wrong, and expecting that your average programmer is going to write unicode-aware code which functions correctly is completely unrealistic.Efficiency for the *lexer* is of *paramount* importance. I don't anticipate std.d.lexer will be implemented by some random newbie, I expect it to be carefully implemented and to do Unicode correctly, regardless of how difficult or easy that may be. I seem to utterly fail at making this point. The same point applies to std.regex - efficiency is terribly, terribly important for it. Everyone judges regexes by their speed, and nobody cares how hard they are to implement to get that speed. To reiterate another point, since we are in the compiler business, people will expect std.d.lexer to be of top quality, not some bag on the side. It needs to be usable as a base for writing a professional quality compiler. It's the reason why I'm pushing much harder on this than I do for other modules.
Aug 02 2012
On Thursday, August 02, 2012 19:40:13 Walter Bright wrote:No, I'm arguing that the LEXER should accept a UTF8 input range for its input. I am not making a general argument about ranges, characters, or Phobos.I think that this is the main point of misunderstanding then. From your comments, you seemed to me to be arguing for strings being treated as ranges of code units all the time, which really doesn't make sense. - Jonathan M Davis
Aug 02 2012
On 8/2/12 10:40 PM, Walter Bright wrote:To reiterate another point, since we are in the compiler business, people will expect std.d.lexer to be of top quality, not some bag on the side. It needs to be usable as a base for writing a professional quality compiler. It's the reason why I'm pushing much harder on this than I do for other modules.The lexer must be configurable enough to tokenize other languages than D. I confess I'm very unhappy that there seem to be no less than three people determined to write lexers for D. We're wasting precious talent and resources doubly. Once, several people are working in parallel on the same product. Second, none of them is actually solving the problem that should be solved. Andrei
Aug 02 2012
On Thursday, August 02, 2012 23:00:41 Andrei Alexandrescu wrote:On 8/2/12 10:40 PM, Walter Bright wrote:You're not going to get as fast a lexer if it's not written specifically for D. Writing a generic lexer is a different problem. It's also one that needs to be solved, but I think that it's a mistake to think that a generic lexer is going to be able to be as fast as one specifically optimized for D. And there are several people already working on the generic stuff (like Phillipe). - Jonathan M DavisTo reiterate another point, since we are in the compiler business, people will expect std.d.lexer to be of top quality, not some bag on the side. It needs to be usable as a base for writing a professional quality compiler. It's the reason why I'm pushing much harder on this than I do for other modules.The lexer must be configurable enough to tokenize other languages than D. I confess I'm very unhappy that there seem to be no less than three people determined to write lexers for D. We're wasting precious talent and resources doubly. Once, several people are working in parallel on the same product. Second, none of them is actually solving the problem that should be solved.
Aug 02 2012
On 08/03/2012 05:08 AM, Jonathan M Davis wrote:On Thursday, August 02, 2012 23:00:41 Andrei Alexandrescu wrote:This should certainly be possible.On 8/2/12 10:40 PM, Walter Bright wrote:You're not going to get as fast a lexer if it's not written specifically for D.To reiterate another point, since we are in the compiler business, people will expect std.d.lexer to be of top quality, not some bag on the side. It needs to be usable as a base for writing a professional quality compiler. It's the reason why I'm pushing much harder on this than I do for other modules.The lexer must be configurable enough to tokenize other languages than D. I confess I'm very unhappy that there seem to be no less than three people determined to write lexers for D. We're wasting precious talent and resources doubly. Once, several people are working in parallel on the same product. Second, none of them is actually solving the problem that should be solved.Writing a generic lexer is a different problem.It is a more general problem.
Aug 02 2012
On 8/2/12 11:08 PM, Jonathan M Davis wrote:You're not going to get as fast a lexer if it's not written specifically for D. Writing a generic lexer is a different problem. It's also one that needs to be solved, but I think that it's a mistake to think that a generic lexer is going to be able to be as fast as one specifically optimized for D.Do you have any evidence to back that up? I mean you're just saying it. Andrei
Aug 02 2012
On Thursday, August 02, 2012 23:41:39 Andrei Alexandrescu wrote:On 8/2/12 11:08 PM, Jonathan M Davis wrote:Because all of the rules are built directly into the code. You don't have to use regexes or anything like that. Pieces of the lexer could certainly be generic or copied over to other lexers just fine, but when you write the lexer by hand specifically for D, you can guarantee that it checks exactly what it needs to for D without any extra cruft or lost efficiency due to decoding where it doesn't need to or checking an additional character at any point or anything along those lines. And tuning it is much easier, because you have control over the whole thing. Also, given features such as token strings, I would think that using a generic lexer on D would be rather difficult anyway. If someone wants to try and write a generic lexer for D and see if they can beat out any hand-written ones, then more power to them, but I don't see how you could possibly expect to shave the operations down to the bare minimum necessary to get the job done with a generic lexer, whereas a hand-written parser can do that given enough effort. - Jonathan M DavisYou're not going to get as fast a lexer if it's not written specifically for D. Writing a generic lexer is a different problem. It's also one that needs to be solved, but I think that it's a mistake to think that a generic lexer is going to be able to be as fast as one specifically optimized for D.Do you have any evidence to back that up? I mean you're just saying it.
Aug 02 2012
On 08/03/2012 05:53 AM, Jonathan M Davis wrote:On Thursday, August 02, 2012 23:41:39 Andrei Alexandrescu wrote:The parts that can be specified with simple regexen are certainly not a problem. A generic string mixin based lexer should be able to generate very close to optimal code by eg. merging common token prefixes and the like.On 8/2/12 11:08 PM, Jonathan M Davis wrote:Because all of the rules are built directly into the code. You don't have to use regexes or anything like that.You're not going to get as fast a lexer if it's not written specifically for D. Writing a generic lexer is a different problem. It's also one that needs to be solved, but I think that it's a mistake to think that a generic lexer is going to be able to be as fast as one specifically optimized for D.Do you have any evidence to back that up? I mean you're just saying it.Pieces of the lexer could certainly be generic or copied over to other lexers just fine, but when you write the lexer by hand specifically for D, you can guarantee that it checks exactly what it needs to for D without any extra cruft or lost efficiency due to decoding where it doesn't need to or checking an additional character at any point or anything along those lines.This is achievable if it is fully generic as well. Just add generic peephole optimizations until the generated lexer is identical to what the hand-written one would have looked like.And tuning it is much easier,Yes.because you have control over the whole thing.The library writer has control over the whole thing in each case.Also, given features such as token strings, I would think that using a generic lexer on D would be rather difficult anyway.It would of course need to incorporate custom parsing routine support.If someone wants to try and write a generic lexer for D and see if they can beat out any hand-written ones,I'll possibly give it a shot if I can find the time.then more power to them, but I don't see how you could possibly expect to shave the operations down to the bare minimum necessary to get the job done with a generic lexer, whereas a hand-written parser can do that given enough effort.If it is optimal for D lexing and close-optimal or optimal for other languages then it is profoundly more useful than just a D lexer.
Aug 02 2012
On Friday, August 03, 2012 06:14:08 Timon Gehr wrote:If it is optimal for D lexing and close-optimal or optimal for other languages then it is profoundly more useful than just a D lexer.If fully concur that we should have a generic lexer, but unless the generic lexer can be just as fast as a hand-written one (which I seriously question), then I definitely think that we're going to want both, not just a generic one. - Jonathan M Davis
Aug 02 2012
On Fri, Aug 3, 2012 at 6:14 AM, Timon Gehr <timon.gehr gmx.ch> wrote:I propose we let him finish std.lexer, test it with Jonathan, benchmark it, etc. Once it's out, we can create a small group interested in getting generic and we can try and decorticate it at our leasure.If someone wants to try and write a generic lexer for D and see if they can beat out any hand-written ones,I'll possibly give it a shot if I can find the time.
Aug 02 2012
Le 03/08/2012 05:41, Andrei Alexandrescu a écrit :On 8/2/12 11:08 PM, Jonathan M Davis wrote:I think it can be as fast if the lexer is compile time created. I'd also argue that it is better to have a D lexer ASAP, even if it isn't as configurable as it could be. Better something now that something better later. We can make it better later anyway.You're not going to get as fast a lexer if it's not written specifically for D. Writing a generic lexer is a different problem. It's also one that needs to be solved, but I think that it's a mistake to think that a generic lexer is going to be able to be as fast as one specifically optimized for D.Do you have any evidence to back that up? I mean you're just saying it. Andrei
Aug 03 2012
On Friday, 3 August 2012 at 03:00:42 UTC, Andrei Alexandrescu wrote:The lexer must be configurable enough to tokenize other languages than D.You're going to have to defend that one.
Aug 02 2012
On 8/2/12 11:11 PM, Bernard Helyer wrote:On Friday, 3 August 2012 at 03:00:42 UTC, Andrei Alexandrescu wrote:I wouldn't know how to. To me it's all too obvious it's better to have a tool that allows writing lexers for many languages (and incidentally use it for D), than a tool that can only tokenize D code. AndreiThe lexer must be configurable enough to tokenize other languages than D.You're going to have to defend that one.
Aug 02 2012
On 8/2/2012 8:00 PM, Andrei Alexandrescu wrote:On 8/2/12 10:40 PM, Walter Bright wrote:I agree and I hope the three can combine their efforts with the best ideas of each. I don't think the lexer needs to be configurable to other languages. I think it's fine that it be custom for D. However, and this is a big however, its user-facing interface should be usable for other languages, I see no obvious reason why this cannot be.To reiterate another point, since we are in the compiler business, people will expect std.d.lexer to be of top quality, not some bag on the side. It needs to be usable as a base for writing a professional quality compiler. It's the reason why I'm pushing much harder on this than I do for other modules.The lexer must be configurable enough to tokenize other languages than D. I confess I'm very unhappy that there seem to be no less than three people determined to write lexers for D. We're wasting precious talent and resources doubly. Once, several people are working in parallel on the same product. Second, none of them is actually solving the problem that should be solved.
Aug 02 2012
On Friday, 3 August 2012 at 03:14:14 UTC, Walter Bright wrote:On 8/2/2012 8:00 PM, Andrei Alexandrescu wrote:If the other guys think they've got it, then I can withdraw my efforts. I was just thinking I had a lexer just sitting around, may as well use it, but if the other guys have it, then I'm fine with withdrawing.On 8/2/12 10:40 PM, Walter Bright wrote:I agree and I hope the three can combine their efforts with the best ideas of each.To reiterate another point, since we are in the compiler business, people will expect std.d.lexer to be of top quality, not some bag on the side. It needs to be usable as a base for writing a professional quality compiler. It's the reason why I'm pushing much harder on this than I do for other modules.The lexer must be configurable enough to tokenize other languages than D. I confess I'm very unhappy that there seem to be no less than three people determined to write lexers for D. We're wasting precious talent and resources doubly. Once, several people are working in parallel on the same product. Second, none of them is actually solving the problem that should be solved.
Aug 02 2012
On Friday, August 03, 2012 05:36:05 Bernard Helyer wrote:If the other guys think they've got it, then I can withdraw my efforts. I was just thinking I had a lexer just sitting around, may as well use it, but if the other guys have it, then I'm fine with withdrawing.I'm a fair ways along already. Making changes according to what Walter wants will slow me down some but not a lot. It's something that I've wanted to do for quite some time but just finally got around to starting a couple of weeks ago. So, I definitely want to complete it regardless of what anyone else is doing. It also gives me an opportunity to make sure that the spec and dmd match (and I've already found a few bugs with the spec and fixed them). Feel free to do whatever you want, but I'm definitely going to complete what I'm working on. - Jonathan M Davis
Aug 02 2012
On Friday, 3 August 2012 at 03:59:29 UTC, Jonathan M Davis wrote:On Friday, August 03, 2012 05:36:05 Bernard Helyer wrote:I'll let you get on with it then. I'll amuse myself with the thought of someone asking why SDC doesn't use std.d.lexer or a parser generator. I'll then hit them with my cane, and tell them to get off of my lawn. :PIf the other guys think they've got it, then I can withdraw my efforts. I was just thinking I had a lexer just sitting around, may as well use it, but if the other guys have it, then I'm fine with withdrawing.I'm a fair ways along already. Making changes according to what Walter wants will slow me down some but not a lot. It's something that I've wanted to do for quite some time but just finally got around to starting a couple of weeks ago. So, I definitely want to complete it regardless of what anyone else is doing. It also gives me an opportunity to make sure that the spec and dmd match (and I've already found a few bugs with the spec and fixed them). Feel free to do whatever you want, but I'm definitely going to complete what I'm working on. - Jonathan M Davis
Aug 02 2012
On Friday, August 03, 2012 06:02:29 Bernard Helyer wrote:I'll let you get on with it then. I'll amuse myself with the thought of someone asking why SDC doesn't use std.d.lexer or a parser generator. I'll then hit them with my cane, and tell them to get off of my lawn. :PWell, if std.d.lexer is done correctly, then it should at least be very usable by SDC, though whether it would be worth changing SDC to use it, I have no idea. That'll be up to you and the other SDC devs. But when code gets posted, you should definitely chime in on it. - Jonathan M Davis
Aug 02 2012
On Friday, 3 August 2012 at 04:02:34 UTC, Bernard Helyer wrote:I'll let you get on with it then. I'll amuse myself with the thought of someone asking why SDC doesn't use std.d.lexer or a parser generator. I'll then hit them with my cane, and tell them to get off of my lawn. :PI don't think you should give up on this adaption. SDC's lexer is already complete. I don't know if anyone else can claim that. Apart from that, any changes made will almost certainly benefit SDC as well.
Aug 02 2012
On Fri, Aug 3, 2012 at 5:59 AM, Jonathan M Davis <jmdavisProg gmx.com> wrote:On Friday, August 03, 2012 05:36:05 Bernard Helyer wrote:Look, many of us here were interested in your idea of having comments and errors lexed as tokens. Could it be possible to just add two static policies errorBehavior { report, skip, ...} and comments { asToken, grouped }? That way, a user just say; // Walter auto toks = Lexer!(comments.grouped)(code); // May be the default value // Other: auto toks = Lexer!(comments.asToken)(code); It's just a small static if away...If the other guys think they've got it, then I can withdraw my efforts. I was just thinking I had a lexer just sitting around, may as well use it, but if the other guys have it, then I'm fine with withdrawing.I'm a fair ways along already. Making changes according to what Walter wants will slow me down some but not a lot. It's something that I've wanted to do for quite some time but just finally got around to starting a couple of weeks ago. So, I definitely want to complete it regardless of what anyone else is doing. It also gives me an opportunity to make sure that the spec and dmd match (and I've already found a few bugs with the spec and fixed them). Feel free to do whatever you want, but I'm definitely going to complete what I'm working on.
Aug 02 2012
Walter Bright wrote:On 8/2/2012 1:26 PM, Jonathan M Davis wrote:Working example: https://github.com/pszturmaj/json-streaming-parser/blob/master/json.d#L18 :-)On Thursday, August 02, 2012 01:44:18 Walter Bright wrote:1. read a character from the range 2. if the character is the start of a multibyte character, put the character in the buffer 3. keep reading from the range until you've got the whole of the multibyte character 4. convert that 6 (or 4) character buffer into a dcharKeep a 6 character buffer in your consumer. If you read a char with the high bit set, start filling that buffer and then decode it.And how on earth is that going to work as a range?
Aug 02 2012
On Thu, Aug 2, 2012 at 1:26 PM, Jonathan M Davis <jmdavisProg gmx.com> wrote:On Thursday, August 02, 2012 01:44:18 Walter Bright wrote:A little bit off topic but... People have been composing/decorating Streams/Ranges for probably 30 years now. Examples: input stream, output stream, byte stream, char stream, buffered stream, cipher stream, base64 stream, etc. If you need more example. Consider an HTTPS request. At the lowest level you have a byte stream/range. No sane developer wants to deal with HTTPS request at this level so you decorate it with an SSL stream/range. That is still too low level so you decorate this with a char stream/range. Still too low level? Decorate it with a modal line buffered stream/range. We are getting closer but it still not the correct range abstraction so then you need a modal http stream/range. You need the modal part if you want to support http streaming.On 8/2/2012 1:38 AM, Jonathan M Davis wrote:And how on earth is that going to work as a range? Range-based functions operate on elements. They use empty, front, popFront, etc. If front doesn't return an element that a range-based function can operate on without caring what it is, then that type isn't going to work as a range. If you need the consumer to be doing something special, then that means you need to special case it for that range type. And that's what you're doing when you special- case range-base functions for strings.On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:Keep a 6 character buffer in your consumer. If you read a char with the high bit set, start filling that buffer and then decode it.On 8/2/2012 12:43 AM, Jonathan M Davis wrote:How?It is for ranges in general. In the general case, a range of UTF-8 or UTF-16 makes no sense whatsoever. Having range-based functions which understand the encodings and optimize accordingly can be very beneficial (which happens with strings but can't happen with general ranges without the concept of a variably-length encoded range like we have with forward range or random access range), but to actually have a range of UTF-8 or UTF-16 just wouldn't work. Range-based functions operate on elements, and doing stuff like filter or map or reduce on code units doesn't make any sense at all.Yes, it can work.
Aug 02 2012
On 2012-08-02 09:29, Walter Bright wrote:My experience in writing fast string based code that worked on UTF8 and correctly handled multibyte characters was that they are very possible and practical, and they are faster. The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it isn't fast, serious users will eschew it and will cook up their own. You'll have a nice, pretty, useless toy of std.d.lexer. I think there's some serious underestimation of how critical this is.I do understand that the lexer needs to be insanely fast and it needs to operate on UTF-8 and not UTF-32 or anything else. But what I still don't understand is how a UTF-8 range is going to be usable by other range based functions in Phobos. -- /Jacob Carlborg
Aug 02 2012
On 8/2/2012 12:49 AM, Jacob Carlborg wrote:But what I still don't understand is how a UTF-8 range is going to be usable by other range based functions in Phobos.Worst case use an adapter range.
Aug 02 2012
Walter Bright , dans le message (digitalmars.D:174015), a écrit :On 8/2/2012 12:49 AM, Jacob Carlborg wrote:Yes auto r = myString.byChar(); after implementing a byChar adapter range or just auto r = cast(const(ubyte)[]) myString; And it's a range of code unit, not code point. And it's usable in phobos.But what I still don't understand is how a UTF-8 range is going to be usable by other range based functions in Phobos.Worst case use an adapter range.
Aug 02 2012
On 2012-08-02 10:15, Walter Bright wrote:Worst case use an adapter range.And that is better than a plain string? -- /Jacob Carlborg
Aug 02 2012
Jacob Carlborg , dans le message (digitalmars.D:174069), a écrit :On 2012-08-02 10:15, Walter Bright wrote:because its front method does not do any decoding.Worst case use an adapter range.And that is better than a plain string?
Aug 02 2012
On 2012-08-02 22:51, Christophe Travert wrote:Jacob Carlborg , dans le message (digitalmars.D:174069), a écrit :If it was a plain string you wouldn't use "front". You would handle any, possible, decoding by yourself, internally in the lexer. This is what Johnathan already does, it seems: static if(isNarrowString!R) Unqual!(ElementEncodingType!R) first = range[0]; else dchar first = range.front; If you're only supporting plain UTF-8 strings you would just do: char first = range[0]; -- /Jacob CarlborgOn 2012-08-02 10:15, Walter Bright wrote:because its front method does not do any decoding.Worst case use an adapter range.And that is better than a plain string?
Aug 02 2012
Jacob Carlborg , dans le message (digitalmars.D:174131), a écrit :static if(isNarrowString!R) Unqual!(ElementEncodingType!R) first = range[0]; else dchar first = range.front;I find it more comfortable to just use first = range.front, with a range of char or ubyte. This range does not have to be a string, it can be a something over a file, stream, socket. It can also be the result of an algorithm, because you *can* use algorithm on ranges of char, and it makes sense if you know what you are doing. If Walter discovers the lexer does not work with a socket, a "file.byChunk.join", and has to do expensive utf-8 decoding for the lexer to work because it can only use range of dchar, and not range of char (except that it special-cased strings), he may not be happy. It the range happens to be a string, I would use an adapter to make it appear like a range of char, not dchar, like the library likes to do. I think Andrei suggested that already. -- Christophe
Aug 03 2012
On 8/3/2012 1:21 AM, Christophe Travert wrote:This range does not have to be a string, it can be a something over a file, stream, socket. It can also be the result of an algorithm, because you *can* use algorithm on ranges of char, and it makes sense if you know what you are doing.Correct, that's the whole point of using a range - it can come from anything.If Walter discovers the lexer does not work with a socket, a "file.byChunk.join", and has to do expensive utf-8 decoding for the lexer to work because it can only use range of dchar, and not range of char (except that it special-cased strings), he may not be happy.No may about it, I wouldn't be happy :-)
Aug 03 2012
On 8/3/2012 3:07 AM, Walter Bright wrote:On 8/3/2012 1:21 AM, Christophe Travert wrote:For example, let's suppose we want to do D syntax highlighting in our IDE. It is highly unlikely that the text editor's data structure is a simple string. It's likely to be an array of lines, or something like that. It's simple to provide a range interface to that, where it presents the data structure as a sequence of chars. But that wouldn't work with Jonathan's string specialization. Nor would the string specialization work if the input was a 1024 byte file buffer.This range does not have to be a string, it can be a something over a file, stream, socket. It can also be the result of an algorithm, because you *can* use algorithm on ranges of char, and it makes sense if you know what you are doing.Correct, that's the whole point of using a range - it can come from anything.
Aug 03 2012
Would this be an argument for putting the computation of source locations (i.e. line + offset or similar) into the range / into an template argument / policy, so that it's done in the most effective way for the client? Kate for example has a "range"-type that marks a span in the text buffer. This way the lexer can return token with the correct "textrange" attached and you don't need to recompute the text ranges from line/col numbers.Correct, that's the whole point of using a range - it can come from anything.For example, let's suppose we want to do D syntax highlighting in our IDE. It is highly unlikely that the text editor's data structure is a simple string. It's likely to be an array of lines, or something like that. It's simple to provide a range interface to that, where it presents the data structure as a sequence of chars.
Aug 03 2012
On 8/3/2012 4:40 AM, Tobias Pankrath wrote:Would this be an argument for putting the computation of source locations (i.e. line + offset or similar) into the range / into an template argument / policy, so that it's done in the most effective way for the client? Kate for example has a "range"-type that marks a span in the text buffer. This way the lexer can return token with the correct "textrange" attached and you don't need to recompute the text ranges from line/col numbers.Worth thinking about.
Aug 03 2012
On 02-Aug-12 08:30, Walter Bright wrote:On 8/1/2012 8:04 PM, Jonathan M Davis wrote:Well, it doesn't convert back to UTF-8 as it just slices of the input :) Otherwise very true especially with ctRegex that used to recieve quite some hype even in its present state. 33% of time spent is doing and redoing UTF-8 decoding. (Note that quite some extra work on top of what lexer does is done, e.g. lexer is largely deterministic but regex has some of try-rollback).On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:I have argued against making ranges of characters dchars, because of performance reasons. This will especially adversely affect the performance of the lexer. The fact is, files are normally in UTF8 and just about everything else is in UTF8. Prematurely converting to UTF-32 is a performance disaster. Note that the single largest thing holding back regex performance is that premature conversion to dchar and back to char.1. It should accept as input an input range of UTF8. I feel it is a mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use an 'adapter' range to convert the input to UTF8. (This is what component programming is all about.)But that's not how ranges of characters work. They're ranges of dchar. Ranges don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have to create special wrappers around string or wstring to have ranges of UTF-8. The way that it's normally done is to have ranges of dchar and then special-case range-based functions for strings. Then the function can operate on any range of dchar but still operates on strings efficiently.If lexer is required to accept dchar ranges, its performance will drop at least in half, and people are going to go reinvent their own lexers.Yes, it slows things down. Decoding (if any) should kick in only where it's absolutely necessary and be an integral part of lexer automation. -- Dmitry Olshansky
Aug 02 2012
Really nice idea. It is still easy to wrap the Range in another Range that process errors in a custom way.7. It should accept a callback delegate for errors. That delegate should decide whether to: 1. ignore the error (and "Lexer" will try to recover and continue) 2. print an error message (and "Lexer" will try to recover and continue) 3. throw an exception, "Lexer" is done with that input rangeI'm currently treating errors as tokens. It then becomes easy for the code using the lexer to just ignore the errors, to process them immediately, or to put off dealing with them until the lexing is complete. So, the code using the lexer can handle errors however and whenever it likes without having to worry about delegates or exceptions. Since tokens are lexed lazily, the fact that an error is reported as a token doesn't require that the lexing continue, but it also makes it _easy_ to continue lexing, ignoring or saving the error. I'm inclined to think that that's a superior approach to using delegates and exceptions.
Aug 02 2012
On 8/2/2012 4:27 AM, deadalnix wrote:Really nice idea. It is still easy to wrap the Range in another Range that process errors in a custom way.It's suboptimal for a high speed lexer/parser, as already explained. Speed is of paramount importance for a lexer, and it overrides things like elegance and simplicity.
Aug 02 2012
On 2012-08-02 02:10, Walter Bright wrote:1. It should accept as input an input range of UTF8. I feel it is a mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use an 'adapter' range to convert the input to UTF8. (This is what component programming is all about.)I'm no expert on ranges but won't that prevent slicing? Slicing is one of the main reasons for why the Tango XML parser is so amazingly fast. -- /Jacob Carlborg
Aug 01 2012
On 8/1/2012 11:49 PM, Jacob Carlborg wrote:On 2012-08-02 02:10, Walter Bright wrote:You don't want to use slicing on the lexer. The reason is that your slices will be spread all over memory, as source files can be huge, and all that memory will be retained and never released. What you want is a compact representation after lexing. Compactness also helps a lot with memory caching.1. It should accept as input an input range of UTF8. I feel it is a mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use an 'adapter' range to convert the input to UTF8. (This is what component programming is all about.)I'm no expert on ranges but won't that prevent slicing? Slicing is one of the main reasons for why the Tango XML parser is so amazingly fast.
Aug 02 2012
Le 02/08/2012 09:30, Walter Bright a écrit :On 8/1/2012 11:49 PM, Jacob Carlborg wrote:Token are not kept in memory. You usually consume them for other processing and throw them away. It isn't an issue.On 2012-08-02 02:10, Walter Bright wrote:You don't want to use slicing on the lexer. The reason is that your slices will be spread all over memory, as source files can be huge, and all that memory will be retained and never released. What you want is a compact representation after lexing. Compactness also helps a lot with memory caching.1. It should accept as input an input range of UTF8. I feel it is a mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use an 'adapter' range to convert the input to UTF8. (This is what component programming is all about.)I'm no expert on ranges but won't that prevent slicing? Slicing is one of the main reasons for why the Tango XML parser is so amazingly fast.
Aug 02 2012
On 8/2/2012 4:52 AM, deadalnix wrote:Le 02/08/2012 09:30, Walter Bright a écrit :The tokens are not kept, correct. But the identifier strings, and the string literals, are kept, and if they are slices into the input buffer, then everything I said applies.On 8/1/2012 11:49 PM, Jacob Carlborg wrote:Token are not kept in memory. You usually consume them for other processing and throw them away. It isn't an issue.On 2012-08-02 02:10, Walter Bright wrote:You don't want to use slicing on the lexer. The reason is that your slices will be spread all over memory, as source files can be huge, and all that memory will be retained and never released. What you want is a compact representation after lexing. Compactness also helps a lot with memory caching.1. It should accept as input an input range of UTF8. I feel it is a mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use an 'adapter' range to convert the input to UTF8. (This is what component programming is all about.)I'm no expert on ranges but won't that prevent slicing? Slicing is one of the main reasons for why the Tango XML parser is so amazingly fast.
Aug 02 2012
Le 02/08/2012 20:08, Walter Bright a écrit :On 8/2/2012 4:52 AM, deadalnix wrote:Ok, what do you think of that : lexer can have a parameter that tell if it should build a table of token or slice the input. The second is important, for instance for an IDE : lexing will occur often, and you prefer slicing here because you already have the source file in memory anyway. The token always contains as a member a slice. The slice come either from the source or from a memory chunk allocated by the lexer. If the lexer allocate chunks, it will reuse the same memory location for the same string. Considering the following mecanism to compare slice, this will require 2 comparaisons for identifier lexed with that method : if(a.length != b.length) return false; if(a.ptr == b.ptr) return true; // Regular char by char comparison. Is that a suitable option ?Le 02/08/2012 09:30, Walter Bright a écrit :The tokens are not kept, correct. But the identifier strings, and the string literals, are kept, and if they are slices into the input buffer, then everything I said applies.On 8/1/2012 11:49 PM, Jacob Carlborg wrote:Token are not kept in memory. You usually consume them for other processing and throw them away. It isn't an issue.On 2012-08-02 02:10, Walter Bright wrote:You don't want to use slicing on the lexer. The reason is that your slices will be spread all over memory, as source files can be huge, and all that memory will be retained and never released. What you want is a compact representation after lexing. Compactness also helps a lot with memory caching.1. It should accept as input an input range of UTF8. I feel it is a mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use an 'adapter' range to convert the input to UTF8. (This is what component programming is all about.)I'm no expert on ranges but won't that prevent slicing? Slicing is one of the main reasons for why the Tango XML parser is so amazingly fast.
Aug 03 2012
deadalnix , dans le message (digitalmars.D:174155), a écrit :If I may add, there are several possitilities here: 1- a real slice of the input range 2- a slice of the input range created with .save and takeExactly 3- a slice allocated in GC memory by the lexer 4- a slice of memory owned by the lexer, which is reused for the next token (thus, the next call to popFront invalidates the token). 5- a slice of memory from a lookup table. All are useful in certain situations. don't have a huge amont of code to parse. token, you save the range, and when you found the end of the token, you just use takeExactly on the saved ranged. to do them in a second step if you want to be able to use input range (which you have too). Actually, the buffer may be external, if you use a buffered-range adapter to make a forward range out of an input range. Having an internal buffer may be more efficient. That something that has to be profiled. slice saved somewhere before having a look at the look-up table. efficiency by an algorithm external to the lexer. This would probably bring many ways to use the lexer. For example, you can filter-out many tokens that you don't want before building the table, which avoids to have an overfull look-up table if you are only interested in a subset of tokens. user to use the adapters blindly, or if an internal implementation proves to be significantly more efficient. -- ChristopheThe tokens are not kept, correct. But the identifier strings, and the string literals, are kept, and if they are slices into the input buffer, then everything I said applies.Ok, what do you think of that : lexer can have a parameter that tell if it should build a table of token or slice the input. The second is important, for instance for an IDE : lexing will occur often, and you prefer slicing here because you already have the source file in memory anyway. The token always contains as a member a slice. The slice come either from the source or from a memory chunk allocated by the lexer.
Aug 03 2012
On Friday, 3 August 2012 at 14:49:55 UTC, travert phare.normalesup.orgIf I may add, there are several possitilities here: 1- a real slice of the input range 2- a slice of the input range created with .save and takeExactly 3- a slice allocated in GC memory by the lexer 4- a slice of memory owned by the lexer, which is reused for the next token (thus, the next call to popFront invalidates the token). 5- a slice of memory from a lookup table.(i.e. for syntax highlighting).
Aug 03 2012
On 8/3/2012 6:18 AM, deadalnix wrote:lexer can have a parameter that tell if it should build a table of token or slice the input. The second is important, for instance for an IDE : lexing will occur often, and you prefer slicing here because you already have the source file in memory anyway.A string may span multiple lines - IDEs do not store the text as one string.If the lexer allocate chunks, it will reuse the same memory location for the same string. Considering the following mecanism to compare slice, this will require 2 comparaisons for identifier lexed with that method : if(a.length != b.length) return false; if(a.ptr == b.ptr) return true; // Regular char by char comparison. Is that a suitable option ?You're talking about doing for strings what is done for identifiers - returning a unique handle for each. I don't think this works very well for string literals, as there seem to be few duplicates.
Aug 03 2012
Le 03/08/2012 21:59, Walter Bright a écrit :On 8/3/2012 6:18 AM, deadalnix wrote:That option have the benefice to allow very fast identifier comparison (like DMD does) but don't impose it. For instance, you could use that trick in a single thread, but another identifier table for another. It allow to avoid completely the problem with multithreading you mention, while keeping most identifiers comparison really fast. It allow also for several allocation scheme for the slice, that fit different needs, as shown by Christophe Travert.lexer can have a parameter that tell if it should build a table of token or slice the input. The second is important, for instance for an IDE : lexing will occur often, and you prefer slicing here because you already have the source file in memory anyway.A string may span multiple lines - IDEs do not store the text as one string.If the lexer allocate chunks, it will reuse the same memory location for the same string. Considering the following mecanism to compare slice, this will require 2 comparaisons for identifier lexed with that method : if(a.length != b.length) return false; if(a.ptr == b.ptr) return true; // Regular char by char comparison. Is that a suitable option ?You're talking about doing for strings what is done for identifiers - returning a unique handle for each. I don't think this works very well for string literals, as there seem to be few duplicates.
Aug 04 2012
On Thursday, August 02, 2012 11:08:23 Walter Bright wrote:The tokens are not kept, correct. But the identifier strings, and the string literals, are kept, and if they are slices into the input buffer, then everything I said applies.String literals often _can't_ be slices unless you leave them in their original state rather than giving the version that they translate to (e.g. leaving \© in the string rather than replacing it with its actual, unicode value). And since you're not going to be able to create the literal using whatever type the range is unless it's a string of some variety, that means that the literals often can't be slices, which - depending on the implementation - would make it so that that they can't _ever_ be slices. Identifiers are a different story, since they don't have to be translated at all, but regardless of whether keeping a slice would be better than creating a new string, the identifier table will be far superior, since then you only need one copy of each identifier. So, it ultimately doesn't make sense to use slices in either case even without considering issues like them being spread across memory. The only place that I'd expect a slice in a token is in the string which represents the text which was lexed, and that won't normally be kept around. - Jonathan M Davis
Aug 03 2012
Jonathan M Davis , dans le message (digitalmars.D:174191), a écrit :On Thursday, August 02, 2012 11:08:23 Walter Bright wrote:I thought it was not the lexer's job to process litterals. Just split the input in tokens, and provide minimal info: TokenType, line and col along with the representation from the input. That's enough for a syntax highlighting tools for example. Otherwise you'll end up doing complex interpretation and the lexer will not be that efficient. Litteral interpretation can be done in a second step. Do you think doing litteral interpretation separately when you need it would be less efficient? -- ChristopheThe tokens are not kept, correct. But the identifier strings, and the string literals, are kept, and if they are slices into the input buffer, then everything I said applies.String literals often _can't_ be slices unless you leave them in their original state rather than giving the version that they translate to (e.g. leaving \© in the string rather than replacing it with its actual, unicode value). And since you're not going to be able to create the literal using whatever type the range is unless it's a string of some variety, that means that the literals often can't be slices, which - depending on the implementation - would make it so that that they can't _ever_ be slices. Identifiers are a different story, since they don't have to be translated at all, but regardless of whether keeping a slice would be better than creating a new string, the identifier table will be far superior, since then you only need one copy of each identifier. So, it ultimately doesn't make sense to use slices in either case even without considering issues like them being spread across memory. The only place that I'd expect a slice in a token is in the string which represents the text which was lexed, and that won't normally be kept around. - Jonathan M Davis
Aug 04 2012
On 04-Aug-12 14:02, Christophe Travert wrote:Jonathan M Davis , dans le message (digitalmars.D:174191), a écrit :Most likely - since you re-read the same memory twice to do it. -- Dmitry OlshanskyOn Thursday, August 02, 2012 11:08:23 Walter Bright wrote:I thought it was not the lexer's job to process litterals. Just split the input in tokens, and provide minimal info: TokenType, line and col along with the representation from the input. That's enough for a syntax highlighting tools for example. Otherwise you'll end up doing complex interpretation and the lexer will not be that efficient. Litteral interpretation can be done in a second step. Do you think doing litteral interpretation separately when you need it would be less efficient?The tokens are not kept, correct. But the identifier strings, and the string literals, are kept, and if they are slices into the input buffer, then everything I said applies.String literals often _can't_ be slices unless you leave them in their original state rather than giving the version that they translate to (e.g. leaving \© in the string rather than replacing it with its actual, unicode value). And since you're not going to be able to create the literal using whatever type the range is unless it's a string of some variety, that means that the literals often can't be slices, which - depending on the implementation - would make it so that that they can't _ever_ be slices. Identifiers are a different story, since they don't have to be translated at all, but regardless of whether keeping a slice would be better than creating a new string, the identifier table will be far superior, since then you only need one copy of each identifier. So, it ultimately doesn't make sense to use slices in either case even without considering issues like them being spread across memory. The only place that I'd expect a slice in a token is in the string which represents the text which was lexed, and that won't normally be kept around. - Jonathan M Davis
Aug 04 2012
Dmitry Olshansky , dans le message (digitalmars.D:174214), a écrit :Most likely - since you re-read the same memory twice to do it.You're probably right, but if you do this right after the token is generated, the memory should still be near the processor. And the operation on the first read should be very basic: just check nothing illegal appears, and check for the end of the token. The cost is not negligible, but what you do with litteral tokens can vary much, and what the lexer will propose may not be what the user want, so the user may suffer the cost of the litteral decoding (including allocation of the decoded string, the copy of the caracters, etc), that he doesn't want, or will have to re-do his own way... -- Christophe
Aug 04 2012
On 04-Aug-12 14:55, Christophe Travert wrote:Dmitry Olshansky , dans le message (digitalmars.D:174214), a écrit :q{ .. } "\x13\x27 ...\u1212" In most cases it takes around the same time to check correctness and output it as simply pass it by. (see also re-decoding unicode in identifiers, though that's rare to see unicode chars in identifier)Most likely - since you re-read the same memory twice to do it.You're probably right, but if you do this right after the token is generated, the memory should still be near the processor. And the operation on the first read should be very basic: just check nothing illegal appears, and check for the end of the token.The cost is not negligible, but what you do with litteral tokens can vary much, and what the lexer will propose may not be what the user want, so the user may suffer the cost of the litteral decoding (including allocation of the decoded string, the copy of the caracters, etc), that he doesn't want, or will have to re-do his own way...I see it as a compile-time policy, that will fit nicely and solve both issues. Just provide a templates with a few hooks, and add a Noop policy that does nothing. -- Dmitry Olshansky
Aug 04 2012
On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:I see it as a compile-time policy, that will fit nicely and solve both issues. Just provide a templates with a few hooks, and add a Noop policy that does nothing.It's starting to look like figuring out what should and shouldn't be configurable and how to handle it is going to be the largest problem in the lexer... - Jonathan M Davis
Aug 04 2012
On Saturday, 4 August 2012 at 11:58:09 UTC, Jonathan M Davis wrote:On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:If we have a really fast lexer that is highly compile-time configureable and has a readable codebase, then this would be a really good showcase for D.I see it as a compile-time policy, that will fit nicely and solve both issues. Just provide a templates with a few hooks, and add a Noop policy that does nothing.It's starting to look like figuring out what should and shouldn't be configurable and how to handle it is going to be the largest problem in the lexer... - Jonathan M Davis
Aug 04 2012
On 04-Aug-12 15:48, Jonathan M Davis wrote:On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:Let's add some meat to my post. I've seen it mostly as follows: //user defines mixin template that is mixed in inside lexer template MyConfig() { enum identifierTable = true; // means there would be calls to table.insert on each identifier enum countLines = true; //adds line, column properties to the lexer/Tokens //statically bound callbacks, inside one can use say: // skip() - to skip a char (popFront) // get() - to read next char (via popFront, front) // line, col - as readonly properties // (skip & get do the counting if enabled) bool onError() { skip(); //the most dumb recovery, just skip a char return true; //go on with tokenizing, false - stop prematurely } ... } usage: { auto my_supa_table = ...; //some kind of container (should a set on strings and support .insert("blah"); ) auto dlex = Lexer!(MyConfig)(table); auto all_tokens = array(dlex(joiner(stdin.byChunk(4096)))); //or if we had no interest in table but only tokens: auto noop = Lexer!(NoopLex)(); ... } -- Dmitry OlshanskyI see it as a compile-time policy, that will fit nicely and solve both issues. Just provide a templates with a few hooks, and add a Noop policy that does nothing.It's starting to look like figuring out what should and shouldn't be configurable and how to handle it is going to be the largest problem in the lexer...
Aug 04 2012
On Saturday, August 04, 2012 17:45:58 Dmitry Olshansky wrote:On 04-Aug-12 15:48, Jonathan M Davis wrote:[snip] It would probably be a bit more user friendly to pass a struct as a template argument (which you can't do in the normal sense, but you can pass it as an alias). Regardless, the problem isn't so much how to provide a configuration as how the configuration options affect the lexer and what it does. I'll figure it out, but it does definitely complicate things. I wasn't originally planning an having anything be configurable. But if we want it to be such that no one will want to write another one (as Walter is looking for), then it's going to need to be configurable enough to make it efficient for all of the common lexing scenarios rather than efficient for one particular scenario. - Jonathan M DavisOn Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:Let's add some meat to my post. I've seen it mostly as follows:I see it as a compile-time policy, that will fit nicely and solve both issues. Just provide a templates with a few hooks, and add a Noop policy that does nothing.It's starting to look like figuring out what should and shouldn't be configurable and how to handle it is going to be the largest problem in the lexer...
Aug 05 2012
Le 04/08/2012 15:45, Dmitry Olshansky a écrit :On 04-Aug-12 15:48, Jonathan M Davis wrote:It seems way too much. The most complex thing that is needed is the policy to allocate identifiers in tokens. It can be made by passing a function that have a string as parameter and a string as return value. The default one would be an identity function. The second parameter is a bool to tokenize comments or not. Is that enough ? The onError look like a typical use case for conditions as explained in the huge thread on Exception.On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:Let's add some meat to my post. I've seen it mostly as follows: //user defines mixin template that is mixed in inside lexer template MyConfig() { enum identifierTable = true; // means there would be calls to table.insert on each identifier enum countLines = true; //adds line, column properties to the lexer/Tokens //statically bound callbacks, inside one can use say: // skip() - to skip a char (popFront) // get() - to read next char (via popFront, front) // line, col - as readonly properties // (skip & get do the counting if enabled) bool onError() { skip(); //the most dumb recovery, just skip a char return true; //go on with tokenizing, false - stop prematurely } ... } usage: { auto my_supa_table = ...; //some kind of container (should a set on strings and support .insert("blah"); ) auto dlex = Lexer!(MyConfig)(table); auto all_tokens = array(dlex(joiner(stdin.byChunk(4096)))); //or if we had no interest in table but only tokens: auto noop = Lexer!(NoopLex)(); ... }I see it as a compile-time policy, that will fit nicely and solve both issues. Just provide a templates with a few hooks, and add a Noop policy that does nothing.It's starting to look like figuring out what should and shouldn't be configurable and how to handle it is going to be the largest problem in the lexer...
Aug 06 2012
On Mon, Aug 6, 2012 at 8:03 PM, deadalnix <deadalnix gmail.com> wrote:The most complex thing that is needed is the policy to allocate identifiers in tokens. It can be made by passing a function that have a string as parameter and a string as return value. The default one would be an identity function.I think one should pass it an empty symbol table and the lexer should fill it and associate each identifier with a unique ID, ID which would appear in the Identifier token.The second parameter is a bool to tokenize comments or not. Is that enough ? The onError look like a typical use case for conditions as explained in the huge thread on Exception.Yes, well we don't have a condition system. And using exceptions during lexing would most probably kill its efficiency. Errors in lexing are not uncommon. The usual D idiom of having an enum StopOnError { no, yes } should be enough.
Aug 06 2012
On 2012-08-06 21:00, Philippe Sigaud wrote:Yes, well we don't have a condition system. And using exceptions during lexing would most probably kill its efficiency. Errors in lexing are not uncommon. The usual D idiom of having an enum StopOnError { no, yes } should be enough.Especially when if the lexer is used in an IDE. The code is constantly changing and will be invalid quite often. -- /Jacob Carlborg
Aug 06 2012
On 8/6/2012 12:00 PM, Philippe Sigaud wrote:Yes, well we don't have a condition system. And using exceptions during lexing would most probably kill its efficiency. Errors in lexing are not uncommon. The usual D idiom of having an enum StopOnError { no, yes } should be enough.That's why I suggested supplying a callback delegate to decide what to do with errors (ignore, throw exception, or quit) and have the delegate itself do that. That way, there is no customization of the Lexer required.
Aug 06 2012
Walter Bright , dans le message (digitalmars.D:174360), a écrit :On 8/6/2012 12:00 PM, Philippe Sigaud wrote:It may be easier to take into accound few cases (return error token and throwing is enough, so that is a basic static if), than to define a way to integrate a delegate (what should be the delegate's signature, what value to return to query for stopping, how to provide ways to recovers, etc).Yes, well we don't have a condition system. And using exceptions during lexing would most probably kill its efficiency. Errors in lexing are not uncommon. The usual D idiom of having an enum StopOnError { no, yes } should be enough.That's why I suggested supplying a callback delegate to decide what to do with errors (ignore, throw exception, or quit) and have the delegate itself do that. That way, there is no customization of the Lexer required.
Aug 07 2012
On Tuesday, August 07, 2012 08:00:24 Christophe Travert wrote:Walter Bright , dans le message (digitalmars.D:174360), a =C3=A9crit =:to doThat's why I suggested supplying a callback delegate to decide what=ewith errors (ignore, throw exception, or quit) and have the delegat=nditself do that. That way, there is no customization of the Lexer required.=20 It may be easier to take into accound few cases (return error token a=throwing is enough, so that is a basic static if), than to define a w=ayto integrate a delegate (what should be the delegate's signature, wha=tvalue to return to query for stopping, how to provide ways to recover=s,etc).For the moment at least, I'm doing this bool delegate(string errorMsg, SourcePos pos) errorHandler; where SourcePos is a struct which holds the line and col of the place i= n the=20 source code where the bad token starts. It it returns true, the token i= s=20 skipped. If it returns false, an exception is thrown - and of course th= e=20 delegate can throw its own exception if it wants to. But you can also configure the lexer to return an error token instead o= f using=20 the delegate if that's what you prefer. But Walter is right in that if = you=20 have to check every token for whether it's an error, that will incur ov= erhead.=20 So, depending on your use case, that could be unacceptable. - Jonathan M Davis
Aug 07 2012
On 8/7/2012 1:14 AM, Jonathan M Davis wrote:But you can also configure the lexer to return an error token instead of using the delegate if that's what you prefer. But Walter is right in that if you have to check every token for whether it's an error, that will incur overhead. So, depending on your use case, that could be unacceptable.It's not just overhead - it's just plain ugly to constantly check for error tokens. It's also tedious and error prone to insert those checks. I don't see any advantage to it.
Aug 07 2012
On Tuesday, August 07, 2012 02:54:42 Walter Bright wrote:On 8/7/2012 1:14 AM, Jonathan M Davis wrote:It's easier to see where in the range of tokens the errors occur. A delegate is disconnected from the point where the range is being consumed, whereas if tokens are used for errors, then the function consuming the range can see exactly where in the range of tokens the error is (and potentially handle it differently based on that information). Regardless, I was asked to keep that option in there by at least one person (Philippe Sigaud IIRC), which is why I didn't just switch over to the delegate entirely. - Jonathan M DavisBut you can also configure the lexer to return an error token instead of using the delegate if that's what you prefer. But Walter is right in that if you have to check every token for whether it's an error, that will incur overhead. So, depending on your use case, that could be unacceptable.It's not just overhead - it's just plain ugly to constantly check for error tokens. It's also tedious and error prone to insert those checks. I don't see any advantage to it.
Aug 07 2012
On 2012-08-07 12:06, Jonathan M Davis wrote:It's easier to see where in the range of tokens the errors occur. A delegate is disconnected from the point where the range is being consumed, whereas if tokens are used for errors, then the function consuming the range can see exactly where in the range of tokens the error is (and potentially handle it differently based on that information).Just pass the same token to the delegate that you would have returned otherwise? -- /Jacob Carlborg
Aug 07 2012
Jacob Carlborg , dans le message (digitalmars.D:174421), a écrit :On 2012-08-07 12:06, Jonathan M Davis wrote:That's what I would do. If you have to define a way to return error information as a token, better use it again when using delegates. Personnaly, I would have the delegate be: int delegate(Token); A return value of 0 means: continue parsing. Any other value is an error number and stops the parser (makes it empty). The error number can be retrieved from the empty parser with a specific function. If you want to throw, just throw in the delegate. No need to return a specific value for that. But a bool return value may be enough too...It's easier to see where in the range of tokens the errors occur. A delegate is disconnected from the point where the range is being consumed, whereas if tokens are used for errors, then the function consuming the range can see exactly where in the range of tokens the error is (and potentially handle it differently based on that information).Just pass the same token to the delegate that you would have returned otherwise?
Aug 07 2012
On 8/7/2012 3:06 AM, Jonathan M Davis wrote:It's easier to see where in the range of tokens the errors occur. A delegate is disconnected from the point where the range is being consumed, whereas if tokens are used for errors, then the function consuming the range can see exactly where in the range of tokens the error is (and potentially handle it differently based on that information).The delegate has a context pointer giving it a reference to whatever context the code calling the Lexer needs.
Aug 07 2012
Walter Bright , dans le message (digitalmars.D:174394), a écrit :On 8/7/2012 1:14 AM, Jonathan M Davis wrote:It's not necessarily ugly, because of the powerful range design. You can branch the lexer to a range adapter that just ignore error tokens, or throw when it meats an error token. For example, just use: auto tokens = data.lexer.throwOnErrorToken; I don't think this is more ugly than: auto tokens = data.lexer!(complex signature) { throw LexException; }; But yes, there is overhead, so I understand returning error tokens is not satisfactory for everyone.But you can also configure the lexer to return an error token instead of using the delegate if that's what you prefer. But Walter is right in that if you have to check every token for whether it's an error, that will incur overhead. So, depending on your use case, that could be unacceptable.It's not just overhead - it's just plain ugly to constantly check for error tokens. It's also tedious and error prone to insert those checks.I don't see any advantage to it.Storing the error somewhere can be of use. For example, you may want to lex a whole file into an array of tokens, and then deal with you errors as you analyse the array of tokens. Of course, you can alway make a delegate to store the error somewhere, but it is easier if this somewhere is in your token pile. What I don't see any advantage is using a delegate that can only return or throw. A policy makes the job: auto tokens = data.lexer!ExceptionPolicy.throwException; That's clean too. If you want the delegate to be of any use, then it must have data to process. That's why I said we have to worry about the signature of the delegate. -- Christophe
Aug 07 2012
On Tue, Aug 7, 2012 at 12:06 PM, Jonathan M Davis <jmdavisProg gmx.com> wrote:Regardless, I was asked to keep that option in there by at least one person (Philippe Sigaud IIRC), which is why I didn't just switch over to the delegate entirely.IIRC, I was not the only one, as people here interested in coding an IDE asked for it too. A lexer is useful for more than 'just' parsing D afterwards: an IDE could easily color tokens according to their type and an error token is just was is needed to highlight errors. Also, what I proposed was a *static* decision: with SkipErrors { no, yes }. With a static if inside its guts, the lexer could change its behavior accordingly. Make skipError.yes the default and Walter get its speed. It's just that an IDE or another parser could use auto lex = std.lexer.Lexer!(SkipError.no)(input); Walter, with all due respect, you sometimes give the impression to forget we are talking about D and go back to deeply entrenched C-isms. Compile-time decisions can be used to avoid any overhead as long as you have a clear idea of what the two code paths should look like. And, as Christophe said, ranges are a powerful API. In another thread Simen and me did some comparison between C-like code and code using only ranges upon ranges upon ranges. A (limited!) difference in speed appeared only for very long calculations.
Aug 07 2012
On 8/7/2012 7:15 AM, Philippe Sigaud wrote:Also, what I proposed was a *static* decision: with SkipErrors { no, yes }. With a static if inside its guts, the lexer could change its behavior accordingly.Yes, I understand about static if decisions :-) hell I invented them!Walter, with all due respect, you sometimes give the impression to forget we are talking about D and go back to deeply entrenched C-isms.Delegates are not C-isms.Compile-time decisions can be used to avoid any overhead as long as you have a clear idea of what the two code paths should look like.Yes, I understand that. There's also a point about adding too much complexity to the interface. The delegate callback reduces complexity in the interface.And, as Christophe said, ranges are a powerful API. In another thread Simen and me did some comparison between C-like code and code using only ranges upon ranges upon ranges. A (limited!) difference in speed appeared only for very long calculations.That's good, and you really don't need to sell me on ranges - I'm already sold.
Aug 07 2012
On Tue, Aug 7, 2012 at 9:38 PM, Walter Bright <newshound2 digitalmars.com> wrote:Yes, I understand about static if decisions :-) hell I invented them!And what a wonderful decision that was!Yes, I understand that. There's also a point about adding too much complexity to the interface. The delegate callback reduces complexity in the interface.OK, then let's let Jonathan work, and we will see how it goes.Well, you gave the impression a bit upstream in this thread that having to filter a token range to eliminate errors was an atrocity (millions of tokens!). As far as I'm concerned, the recent good news was to (re?)discover than complex calls of ranges upon ranges could still be calculated by CTFE. That's really neat.And, as Christophe said, ranges are a powerful API. In another thread Simen and me did some comparison between C-like code and code using only ranges upon ranges upon ranges. A (limited!) difference in speed appeared only for very long calculations.That's good, and you really don't need to sell me on ranges - I'm already sold.
Aug 07 2012
On Tuesday, August 07, 2012 12:38:26 Walter Bright wrote:Yes, I understand that. There's also a point about adding too much complexity to the interface. The delegate callback reduces complexity in the interface.It doesn't really affect much to allow choosing between returning a token and using a delegate, especially if ignoring errors is treated as a separate option rather than simply using a delegate that skips them (which may or may not be beneficial - it's faster without the delegate, but it's actually kind of hard to get lexing errors). What worries me more is stuff like providing a way to have the range calculate the current position itself (as Christophe suggested IIRC) or having it provide an efficient way to determine the number of code units between two ranges so that you can slice the range lexed to put in the Token. Determining the number of code units is easily done with ptr for strings, but for everything else, you generally have to count as code units are consumed, which isn't really an issue for small tokens (especially those like symbols where the length is known without counting) but does add up for arbitrarily long ones such as comments or string literals. So, providing a way to calculate it more efficiently where possible might be desirable, but it's yet another layer of complication, and I don't know that it's actually possible to provide such a function in enough situations for it to be worth providing that functionality. I expect that the configuration stuff is going to have to be adjusted after I'm done, since I'm not sure that it's entirely clear what's worth configuring or not. - Jonathan M Davis
Aug 07 2012
On 8/7/2012 2:14 PM, Jonathan M Davis wrote:I expect that the configuration stuff is going to have to be adjusted after I'm done, since I'm not sure that it's entirely clear what's worth configuring or not."When in doubt, leave it out." If experience later shows it is really needed, it is easier to add it in than to have a blizzard of useless configuration options that cannot be removed.
Aug 07 2012
On Tuesday, August 07, 2012 14:30:51 Walter Bright wrote:On 8/7/2012 2:14 PM, Jonathan M Davis wrote:On the whole, I agree, and I'm definitely leaving some of that stuff out at this point. A simplified version of what I have at the moment is struct LexConfig { /+ configuration options +/} struct Lexer(alias config) if(is(typeof(config) == LexConfig)) { auto lex(R)(R range, SourcePos pos = SourcePos.init) {} } The main problem that I have is that if I end up with any delegates which need to take the range being lexed, then either I can't put them in the Lexer like I am with the error handler delegate (forcing you to pass it to the lex function every time that you lex a new range), or I have to templatize Lexer on the range type, which is also undesirable. If it turns out later that we want to add such delegates and we didn't templatize Lexer on the range type, then we'll be forced to make it so that they're arguments to the lex function. But templatizing the Lexer on the range type is ugly enough that it's not really something that I want to do if I can reasonbly avoid it. Neither solution appeals to me. If we don't ever need to add such delegates, then it doesn't really matter, but I'd prefer to have a solution which is appropriately flexible if we _do_ need to add that sort of thing later. I guess that I'll have to keep thinking about it to see if I can come up with a reasonable way to do it without hampering the API by either making it ugly through unneeded flexibility or by making it too simple to be able to reasonably add more functionality later. - Jonathan M DavisI expect that the configuration stuff is going to have to be adjusted after I'm done, since I'm not sure that it's entirely clear what's worth configuring or not."When in doubt, leave it out." If experience later shows it is really needed, it is easier to add it in than to have a blizzard of useless configuration options that cannot be removed.
Aug 07 2012
On 8/7/2012 1:00 AM, Christophe Travert wrote:If the delegate returns, then the lexer recovers. The delegate is passed the error message and the location. I don't see it is more complex than that.That's why I suggested supplying a callback delegate to decide what to do with errors (ignore, throw exception, or quit) and have the delegate itself do that. That way, there is no customization of the Lexer required.It may be easier to take into accound few cases (return error token and throwing is enough, so that is a basic static if), than to define a way to integrate a delegate (what should be the delegate's signature, what value to return to query for stopping, how to provide ways to recovers, etc).
Aug 07 2012
Walter Bright , dans le message (digitalmars.D:174393), a écrit :If the delegate returns, then the lexer recovers.That's an option, if there is only one way to recover (which is a reasonable assumption). You wanted the delegate to "decide what to do with the errors (ignore, throw exception, or quit)". Throwing is handled, but not ignore/quit. Jonathan's solution (delegate returning a bool) is good. It could also be a delegate returning an int, 0 meaning continue, and any other value being an error code that can be retrieved later. It could also be a number of characters to skip (0 meaning break).
Aug 07 2012
On 06-Aug-12 22:03, deadalnix wrote:Le 04/08/2012 15:45, Dmitry Olshansky a écrit :Editor that highlights text may choose not to build identifier table at all. One may see it as a safe mode (low resource mode) for more advance IDE.On 04-Aug-12 15:48, Jonathan M Davis wrote:It seems way too much. The most complex thing that is needed is the policy to allocate identifiers in tokens.On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:Let's add some meat to my post. I've seen it mostly as follows: //user defines mixin template that is mixed in inside lexer template MyConfig() { enum identifierTable = true; // means there would be calls to table.insert on each identifier enum countLines = true; //adds line, column properties to the lexer/Tokens //statically bound callbacks, inside one can use say: // skip() - to skip a char (popFront) // get() - to read next char (via popFront, front) // line, col - as readonly properties // (skip & get do the counting if enabled) bool onError() { skip(); //the most dumb recovery, just skip a char return true; //go on with tokenizing, false - stop prematurely } ... } usage: { auto my_supa_table = ...; //some kind of container (should a set on strings and support .insert("blah"); ) auto dlex = Lexer!(MyConfig)(table); auto all_tokens = array(dlex(joiner(stdin.byChunk(4096)))); //or if we had no interest in table but only tokens: auto noop = Lexer!(NoopLex)(); ... }I see it as a compile-time policy, that will fit nicely and solve both issues. Just provide a templates with a few hooks, and add a Noop policy that does nothing.It's starting to look like figuring out what should and shouldn't be configurable and how to handle it is going to be the largest problem in the lexer...The second parameter is a bool to tokenize comments or not. Is that enough ?No. And doing Tokens as special comment token is frankly bad idea. See Walter's comments in this thread. Also e.g. For compiler only DDoc ones are ever useful, not so for IDE. Filtering them out later is inefficient, as it would be far better not to create them in the first place.The onError look like a typical use case for conditions as explained in the huge thread on Exception.mm I lost track of that discussion. Either way I see statically bound function as good enough hook into the process as it can do anything useful: skip wrong chars, throw exception, stop parsing prematurely, whatever - pick your poison. -- Dmitry Olshansky
Aug 06 2012
On 2012-08-06 22:26, Dmitry Olshansky wrote:No. And doing Tokens as special comment token is frankly bad idea. See Walter's comments in this thread. Also e.g. For compiler only DDoc ones are ever useful, not so for IDE. Filtering them out later is inefficient, as it would be far better not to create them in the first place.The Eclipse plugin Descent can show formatted DDoc comments. -- /Jacob Carlborg
Aug 06 2012
On 07-Aug-12 01:48, Jacob Carlborg wrote:On 2012-08-06 22:26, Dmitry Olshansky wrote:I've meant that IDE may be interested in more then just DDoc, at least to highlight things properly. -- Dmitry OlshanskyNo. And doing Tokens as special comment token is frankly bad idea. See Walter's comments in this thread. Also e.g. For compiler only DDoc ones are ever useful, not so for IDE. Filtering them out later is inefficient, as it would be far better not to create them in the first place.The Eclipse plugin Descent can show formatted DDoc comments.
Aug 06 2012
Jonathan M Davis , dans le message (digitalmars.D:174223), a écrit :On Saturday, August 04, 2012 15:32:22 Dmitry Olshansky wrote:Yes, I figured-out a policy could be used to, but since the begining of the thread, that makes a lot of things to configure! Jonathan would have trouble trying to make them all. Choices have to be made. That's why I proposed to use adapter range to enable to do the buffering instead of slicing, and to build the look up table. Done correctly, it can make the core of the lexer imlementation clean without loosing efficiency (I hope). If this policy for parsing literals if the only thing that remains to be configured directly in the core of the lexer with static if, then it's reasonable.I see it as a compile-time policy, that will fit nicely and solve both issues. Just provide a templates with a few hooks, and add a Noop policy that does nothing.It's starting to look like figuring out what should and shouldn't be configurable and how to handle it is going to be the largest problem in the lexer...
Aug 04 2012
http://i.imgur.com/oSXTc.png Posted without comment.
Aug 02 2012
On 8/2/2012 12:09 AM, Bernard Helyer wrote:http://i.imgur.com/oSXTc.png Posted without comment.I'm afraid I'm mystified as to your point.
Aug 02 2012
On Thursday, 2 August 2012 at 07:32:59 UTC, Walter Bright wrote:On 8/2/2012 12:09 AM, Bernard Helyer wrote:Just that I'm slaving away over a hot IDE here. :Phttp://i.imgur.com/oSXTc.png Posted without comment.I'm afraid I'm mystified as to your point.
Aug 02 2012
On 8/2/2012 12:34 AM, Bernard Helyer wrote:Just that I'm slaving away over a hot IDE here. :PAh! Well, keep on keepin' on, then!
Aug 02 2012
On 08/02/2012 03:09 AM, Bernard Helyer wrote:http://i.imgur.com/oSXTc.png Posted without comment.Hell yeah Alexander Brandon.
Aug 04 2012
10. High speed matters a lotthen add a benchmark "suite" to the list - the lexer should be benchmarked from the very first beginning and it should be designed for multithreading - there is no need for on-the-fly hash-table updating - maybe just one update on each lex threads end
Aug 02 2012
Walter Bright wrote:1. It should accept as input an input range of UTF8. I feel it is a mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use an 'adapter' range to convert the input to UTF8. (This is what component programming is all about.)Why it is a mistake? I think Lexer should parse any UTF range and return compatible token's strings. That is it should provide strings for UTF8 input, wstrings for UTF16 input and so on.
Aug 02 2012
On 8/2/2012 2:27 AM, Piotr Szturmaj wrote:Walter Bright wrote:Because the lexer is large and it would have to have a lot of special case code inserted here and there to make that work.1. It should accept as input an input range of UTF8. I feel it is a mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use an 'adapter' range to convert the input to UTF8. (This is what component programming is all about.)Why it is a mistake?I think Lexer should parse any UTF range and return compatible token's strings. That is it should provide strings for UTF8 input, wstrings for UTF16 input and so on.Why? I've never seen any UTF16 or UTF32 D source in the wild. Besides, if it is not templated then it doesn't need to be recompiled by every user of it - it can exist as object code in the library.
Aug 02 2012
On 8/2/12 6:07 AM, Walter Bright wrote:Why? I've never seen any UTF16 or UTF32 D source in the wild.Here's a crazy idea that I'll hang to this one remark. No, two crazy ideas. First, after having read the large back-and-forth Jonathan/Walter in one sitting, it's becoming obvious to me you'll never understand each other on this nontrivial matter through this medium. I suggest you set up a skype/phone call. Once you get past the first 30 seconds of social awkwardness of hearing each other's voice, you'll make fantastic progress in communicating. Regarding the problem at hand, it's becoming painfully obvious to me that the lexer MUST do its own decoding internally. Hence, a very simple thing to do is have the entire lexer only deal with ranges of ubyte. If someone passes a char[], the lexer's front end can simply call s.representation and obtain the underlying ubyte[]. If someone passes some range of char, the lexer uses an adapter (e.g. map()) that casts every char to ubyte, which is a zero-cost operation. Then it uses the same core operating on ranges of ubyte. In the first implementation, the lexer may actually refuse any range of 16-bit or 32-bit elements (wchar[], range of wchar, dchar[], range of dchar). Later on the core may be evolved to handle range of ushort and range of dchar. The front-end would use again representation() against wchar[], cast with range of wchar, and would just pass the dchar[] and range of dchar around. This makes the core simple and efficient (I think Jonathan's use of static if and mixins everywhere, while well-intended, complicates matters without necessity). And as such we have a lexer! Which operates with ranges, just has a simple front-end clarifying that the lexer must do its own decoding. Works? Andrei
Aug 02 2012
On 2012-08-02 12:28:03 +0000, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:Regarding the problem at hand, it's becoming painfully obvious to me that the lexer MUST do its own decoding internally.That's not a great surprise to me. I hit the same issues when writing my XML parser, which is why I invented functions called frontUnit and popFrontUnit. I'm glad you're realizing this.Hence, a very simple thing to do is have the entire lexer only deal with ranges of ubyte. If someone passes a char[], the lexer's front end can simply call s.representation and obtain the underlying ubyte[].That's ugly, but it could work (assuming s.representation returns the casted range by ref). I still prefer my frontUnit and popFrontUnit approach though. In fact, any parser for which speed is important will have to bypass std.range's clever handling of UTF characters. Dealing simply with ubytes isn't enough, since in some cases you'll want to fire up the UTF decoder. The next issue, which I haven's seen discussed here is that for a parser to be efficient it should operate on buffers. You can make it work with arbitrary ranges, but if you don't have a buffer you can slice when you need to preserve a string, you're going to have to build the string character by character, which is not efficient at all. But then you can only really return slices if the underlying representation is the same as the output representation, and unless your API has a templated output type, you're going to special case a lot of things. After having attempted an XML parser with ranges, I'm not sure parsing using generic ranges can be made very efficient. Automatic conversion to UTF-32 is a nuisance for performance, and if the output needs to return parts of the input, you'll need to create an inefficient special case just to allocate many new strings in the correct format. I wonder how your call with Walter will turn out. -- Michel Fortin michel.fortin michelf.ca http://michelf.ca/
Aug 02 2012
On 8/2/12 2:17 PM, Michel Fortin wrote:On 2012-08-02 12:28:03 +0000, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:I agree frontUnit and popFrontUnit are more generic because they allow other ranges to define them.Hence, a very simple thing to do is have the entire lexer only deal with ranges of ubyte. If someone passes a char[], the lexer's front end can simply call s.representation and obtain the underlying ubyte[].That's ugly, but it could work (assuming s.representation returns the casted range by ref). I still prefer my frontUnit and popFrontUnit approach though.In fact, any parser for which speed is important will have to bypass std.range's clever handling of UTF characters. Dealing simply with ubytes isn't enough, since in some cases you'll want to fire up the UTF decoder. The next issue, which I haven's seen discussed here is that for a parser to be efficient it should operate on buffers. You can make it work with arbitrary ranges, but if you don't have a buffer you can slice when you need to preserve a string, you're going to have to build the string character by character, which is not efficient at all. But then you can only really return slices if the underlying representation is the same as the output representation, and unless your API has a templated output type, you're going to special case a lot of things.I think a BufferedRange could go a long way here.After having attempted an XML parser with ranges, I'm not sure parsing using generic ranges can be made very efficient. Automatic conversion to UTF-32 is a nuisance for performance, and if the output needs to return parts of the input, you'll need to create an inefficient special case just to allocate many new strings in the correct format.I'm not so sure, but I'll measure.I wonder how your call with Walter will turn out.What call? Thanks, Andrei
Aug 02 2012
On 2012-08-02 22:26, Andrei Alexandrescu wrote:On 8/2/12 2:17 PM, Michel Fortin wrote:The skype call you suggested: "First, after having read the large back-and-forth Jonathan/Walter in one sitting, it's becoming obvious to me you'll never understand each other on this nontrivial matter through this medium. I suggest you set up a skype/phone call. Once you get past the first 30 seconds of social awkwardness of hearing each other's voice, you'll make fantastic progress in communicatin" -- /Jacob CarlborgI wonder how your call with Walter will turn out.What call?
Aug 02 2012
On 8/2/12 4:44 PM, Jacob Carlborg wrote:On 2012-08-02 22:26, Andrei Alexandrescu wrote:Oh, ok, I thought I'd be on the call. That would be between Jonathan and Walter. AndreiOn 8/2/12 2:17 PM, Michel Fortin wrote:The skype call you suggested: "First, after having read the large back-and-forth Jonathan/Walter in one sitting, it's becoming obvious to me you'll never understand each other on this nontrivial matter through this medium. I suggest you set up a skype/phone call. Once you get past the first 30 seconds of social awkwardness of hearing each other's voice, you'll make fantastic progress in communicatin"I wonder how your call with Walter will turn out.What call?
Aug 02 2012
Andrei Alexandrescu , dans le message (digitalmars.D:174060), a écrit :I agree frontUnit and popFrontUnit are more generic because they allow other ranges to define them.Any range of dchar could have a representation (or you may want to call it something else) that returns a range of char (or ubyte). And I think they are more generic because they use a generic API (ie range), that is very powerful: the representation can provide length, slicing, etc... that is different that the dchar length or whatever. You don't want to duplicate all range methods by postfixing Unit...You proposed Jonathan to call Walter in an earlier post. I believe there is a misunderstandment.I wonder how your call with Walter will turn out.What call?
Aug 02 2012
Michel Fortin wrote:The next issue, which I haven's seen discussed here is that for a parser to be efficient it should operate on buffers. You can make it work with arbitrary ranges, but if you don't have a buffer you can slice when you need to preserve a string, you're going to have to build the string character by character, which is not efficient at all. But then you can only really return slices if the underlying representation is the same as the output representation, and unless your API has a templated output type, you're going to special case a lot of things.Instead of returning whole strings, you may provide another sub-range for the string itself. I wrote JSON parser using this approach (https://github.com/pszturmaj/json-streaming-parser) and thanks to that it is possible to parse json without a single heap allocation. This could be also used in XML parser.
Aug 02 2012
On Friday, August 03, 2012 07:30:32 Philippe Sigaud wrote:Look, many of us here were interested in your idea of having comments and errors lexed as tokens. Could it be possible to just add two static policies errorBehavior { report, skip, ...} and comments { asToken, grouped }? That way, a user just say; // Walter auto toks = Lexer!(comments.grouped)(code); // May be the default value // Other: auto toks = Lexer!(comments.asToken)(code); It's just a small static if away...I'm sure it's possible. I just don't know how messy it will be. It depends on how the code is affected by needing to not treat comments and errors as tokens for what Walter wants. I'll see what I can do though. - Jonathan M Davis
Aug 02 2012
To help with performance comparisons I ripped dmd's lexer out and got it building as a few .d files. It's very crude. It's got tons of casts (more than the original c++ version). I attempted no cleanup or any other change than the minimum I could to get it to build and run. Obviously there's tons of room for cleanup, but that's not the point... it's just useful as a baseline. The branch: https://github.com/braddr/phobos/tree/dmd_lexer The commit with the changes: https://github.com/braddr/phobos/commit/040540ef3baa38997b15a56be3e9cd9c4bfa51ab On my desktop (far from idle, it's running 2 of the auto testers), it consistently takes 0.187s to lex all of the .d files in phobos. Later, Brad On 8/1/2012 5:10 PM, Walter Bright wrote:Given the various proposals for a lexer module for Phobos, I thought I'd share some characteristics it ought to have. First of all, it should be suitable for, at a minimum: 1. compilers 2. syntax highlighting editors 3. source code formatters 4. html creation To that end: 1. It should accept as input an input range of UTF8. I feel it is a mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use an 'adapter' range to convert the input to UTF8. (This is what component programming is all about.) 2. It should output an input range of tokens 3. tokens should be values, not classes 4. It should avoid memory allocation as much as possible 5. It should read or write any mutable global state outside of its "Lexer" instance 6. A single "Lexer" instance should be able to serially accept input ranges, sharing and updating one identifier table 7. It should accept a callback delegate for errors. That delegate should decide whether to: 1. ignore the error (and "Lexer" will try to recover and continue) 2. print an error message (and "Lexer" will try to recover and continue) 3. throw an exception, "Lexer" is done with that input range 8. Lexer should be configurable as to whether it should collect information about comments and ddoc comments or not 9. Comments and ddoc comments should be attached to the next following token, they should not themselves be tokens 10. High speed matters a lot 11. Tokens should have begin/end line/column markers, though most of the time this can be implicitly determined 12. It should come with unittests that, using -cov, show 100% coverage Basically, I don't want anyone to be motivated to do a separate one after seeing this one.
Aug 05 2012
On 8/5/2012 12:59 AM, Brad Roberts wrote:To help with performance comparisons I ripped dmd's lexer out and got it building as a few .d files. It's very crude. It's got tons of casts (more than the original c++ version). I attempted no cleanup or any other change than the minimum I could to get it to build and run. Obviously there's tons of room for cleanup, but that's not the point... it's just useful as a baseline. The branch: https://github.com/braddr/phobos/tree/dmd_lexer The commit with the changes: https://github.com/braddr/phobos/commit/040540ef3baa38997b15a56be3e9cd9c4bfa51ab On my desktop (far from idle, it's running 2 of the auto testers), it consistently takes 0.187s to lex all of the .d files in phobos. Later, BradThanks, Brad!
Aug 05 2012
On 8/1/12 21:10 , Walter Bright wrote:8. Lexer should be configurable as to whether it should collect information about comments and ddoc comments or not 9. Comments and ddoc comments should be attached to the next following token, they should not themselves be tokensI believe there should be an option to get comments as tokens. Or otherwise the attached comments should have source location...
Aug 06 2012