digitalmars.D - Request for comments: std.d.lexer
- Brian Schott (18/18) Jan 27 2013 I'm writing a D lexer for possible inclusion in Phobos.
- Philippe Sigaud (29/44) Jan 27 2013 Cool! I remember linking to it in the wiki a week ago:
- deadalnix (3/14) Jan 27 2013 Oh yes, that is very important. Conditions are perfect to handle
- Brian Schott (23/44) Jan 27 2013 It implements the InputRange interface from std.range so that
- Timon Gehr (5/28) Jan 27 2013 You are measuring lexing speed against compilation speed. A reasonably
- Johannes Pfau (6/45) Jan 27 2013 Profiling is always a good idea, but to be fair: His dmd was probably
- dennis luehring (6/51) Jan 27 2013 that makes no sense - we need a tiny piece of benchmark code inside of
- dennis luehring (3/8) Jan 27 2013 the dmd lexer is in
- Philippe Sigaud (12/32) Jan 27 2013 Hmm. You're the first person I see pushing this. Personally, I'd go
- Jacob Carlborg (5/9) Jan 27 2013 Always good to try and minimize the size of a token when the lexer
- Walter Bright (5/7) Jan 27 2013 Speed is critical for a lexer.
- Dmitry Olshansky (8/15) Jan 27 2013 I concur. One of the biggest reason* there is a separate lexer step is
- Philippe Sigaud (6/11) Jan 27 2013 ... which is exactly what parsing expression grammars (and other
- Dmitry Olshansky (14/26) Jan 27 2013 That is only a happenstance and is highly overrated. What does it buy
- Philippe Sigaud (13/33) Jan 27 2013 I don't know. I discovered what scannerless parsing was after
- Dmitry Olshansky (14/27) Jan 27 2013 Reminds me to tweak/refactor/optimize CT-part of it.
- Philippe Sigaud (4/8) Jan 27 2013 The latter. The one I use for Pegged to generate (what is hopefully) a
- Walter Bright (5/17) Jan 27 2013 If you do such a thing when designing a language, I guarantee that you w...
- Philippe Sigaud (6/14) Jan 27 2013 Something I never tought about: given a 10k lines module, how many
- Jacob Carlborg (13/18) Jan 27 2013 If we're talking about dynamic allocation you can make sure you're just
- FG (12/22) Jan 28 2013 I was also thinking about using slices to limit string allocation. So fa...
- Walter Bright (3/19) Jan 28 2013 Just study the dmd lexer.c source code.
- Mehrdad (9/15) Jan 28 2013 Here's my (VERY) simple NFA-based "regex"-based lexer, which
- Philippe Sigaud (2/3) Jan 28 2013 This link does not work for me :(
- Mehrdad (8/12) Jan 28 2013 Yeah sorry, I found a bug in it, so I took it off and fixed it
- Philippe Sigaud (5/13) Jan 28 2013 Grabby grab!
- Mehrdad (6/10) Jan 28 2013 lol. Mainly because I got tired of running into compiler bugs, to
- Paulo Pinto (2/9) Jan 27 2013 I second that.
- Mehrdad (4/11) Jan 28 2013 Semi-unrelated question -- how would you benchmark a _parser_?
- Jacob Carlborg (6/9) Jan 28 2013 It should always be possible to get a number, but it would be hard to
- Mehrdad (4/13) Jan 28 2013 Sorry I think my question was vague -- what I meant was, how do
- Jacob Carlborg (6/9) Jan 28 2013 The easy way out would be to just measure how long time it takes to
- Mehrdad (4/14) Jan 28 2013 Yeah that's exactly what I meant by not comparing xD
- Walter Bright (2/3) Jan 28 2013 Just run under a profiler, then fix the hot spots.
- dennis luehring (2/5) Jan 28 2013 mehrdad speaks about benchmarking not optimization :)
- Walter Bright (3/10) Jan 28 2013 I know. But profiling is how you make it fast. Lexers should be as fast ...
- Mehrdad (3/15) Jan 28 2013 I was talking about parsers actually :| on a second thought maybe
- Walter Bright (2/17) Jan 28 2013 Same comment applies.
- Johannes Pfau (10/18) Jan 28 2013 But to be fair that doesn't fit ranges very well. If you don't want to
- Dmitry Olshansky (9/27) Jan 28 2013 Not the whole source but to construct a table of all identifiers. The
- Timon Gehr (18/33) Jan 28 2013 Source code is usually small. (Even std.datetime has 'only' 1.6 MB.) My
- Walter Bright (2/3) Jan 28 2013 This makes symbol lookups very fast.
- Dmitry Olshansky (13/49) Jan 30 2013 In allocation scheme I proposed that ID could be a 32bit offset into the...
- Walter Bright (4/6) Feb 01 2013 That only works if you know in advance the max size the chunk can ever b...
- Dmitry Olshansky (13/20) Feb 01 2013 Well I supposed it's exactly one reallocatable block. Then token have an...
- Walter Bright (4/23) Feb 01 2013 Your technique can work, provided the number of identifiers isn't large ...
- Dmitry Olshansky (238/270) Feb 02 2013 I think I failed to describe the virtual memory trick or you just
- Walter Bright (3/4) Feb 02 2013 You can't just reserve 1Gb of address space in a 32 bit system. (I know ...
- Dmitry Olshansky (20/25) Feb 03 2013 OK, sorry you must knew this stuff throughly. 1Gb is not the point I
- Tove (4/12) Feb 01 2013 This can easily be archived by preallocating file.size bytes...
- deadalnix (22/22) Jan 27 2013 Very happy to see that !
- deadalnix (5/27) Jan 27 2013 And the famous Job's « one last thing » : I'm not a big fan of
- Brian Schott (17/38) Jan 27 2013 I decided not to do this because the lexer actually calls itself
- deadalnix (6/23) Jan 27 2013 Even without special casing for slice of the input, generating a
- Dicebot (4/4) Jan 27 2013 In one of last discussions about standard lexer/parser I remember
- Philippe Sigaud (4/4) Jan 27 2013 Let's add another question:
- Jacob Carlborg (4/8) Jan 27 2013 Perhaps an option for this.
- Jacob Carlborg (5/22) Jan 27 2013 How about changing the type of TokenType to ushort, if all members fit.
- Walter Bright (3/6) Jan 27 2013 Just a quick comment: byToken() should not accept a filename. It's input...
- Dmitry Olshansky (5/13) Jan 27 2013 InputRange of char/wchar/dchar I guess as lexer should have an ability
- Walter Bright (3/15) Jan 27 2013 Just char would suffice. If a user needs wchar/dchar, they can put a map...
- Brian Schott (5/13) Jan 27 2013 The file name is accepted for eventual error reporting purposes.
- Timon Gehr (3/6) Jan 27 2013 Sure. The point you brought across, however, was that it is not
- Walter Bright (2/3) Jan 27 2013 Use an OutputRange for that.
- David Nadlinger (4/8) Jan 27 2013 What about that delegate-based design? I thought everyone agreed
- Walter Bright (4/10) Jan 27 2013 An OutputRange is a way of doing that. The advantage of OutputRange's is...
- David Nadlinger (8/22) Jan 27 2013 I was talking about the design you proposed yourself here:
- Brian Schott (42/46) Jan 27 2013 I think you misunderstand. The file name is so that if you pass
- Walter Bright (3/11) Jan 27 2013 Yes, I did misunderstand. I suggest updating the documentation to clear ...
- deadalnix (5/15) Jan 27 2013 I don't think that is a good idea. For instance mixin need to be
- Johannes Pfau (7/11) Jan 28 2013 This sounds like a valid use case for buffered ranges (which we don't
- Jacob Carlborg (6/11) Jan 28 2013 If think slicing the buffer will force the whole buffer to remain in
- Walter Bright (3/6) Jan 28 2013 Well, the source buffer can be large, and will span a lot of memory cach...
- Jacob Carlborg (5/7) Jan 28 2013 Would it be better to .dup the slices? Won't that be just as slow as
- Dmitry Olshansky (5/10) Jan 28 2013 It would be better to compactly layout these one by one in a separate
- jerro (5/19) Jan 28 2013 Would it also make sense to do small string optimization? On 64
- Dmitry Olshansky (4/20) Jan 28 2013 It very well could be iff slice is a field in the Token struct.
- Timon Gehr (9/42) Jan 28 2013 I see, probably there should be an option to do this by slicing instead.
- Brian Schott (28/33) Jan 30 2013 I implemented the various suggestions from a past thread and made
- dennis luehring (3/19) Jan 30 2013 but you still need to compare that to the current dmd lexer - nothing
- FG (3/8) Jan 30 2013 Yep. I'm eager to see what timings you get with the whole file kept in m...
- Dmitry Olshansky (9/42) Jan 30 2013 idup --> allocation
- FG (10/13) Jan 30 2013 Yeah, similar to what I suggested, except that probably it would be good...
- Dmitry Olshansky (14/27) Jan 30 2013 Mm trie shouldn't have that many allocations. Typically each node is
- Jacob Carlborg (4/24) Jan 31 2013 How many tokens would that be in total?
- H. S. Teoh (99/105) Jan 28 2013 [...]
- Dejan Lekic (3/22) Jan 28 2013 Having a "standard" lexer module, that many people work on, and
- Jacob Carlborg (6/23) Jan 31 2013 Just thinking out loud here. Would it be possible to lex a file in
- FG (2/4) Jan 31 2013 Do you know where you can safely cut it without having it lexed beforeha...
- Jacob Carlborg (8/10) Jan 31 2013 I was thinking that myself. It would probably be possible to just cut it...
- H. S. Teoh (9/18) Jan 31 2013 [...]
- Jacob Carlborg (4/7) Jan 31 2013 That would be a problem.
- deadalnix (4/13) Jan 31 2013 I don't think it worth the complexity. You can lex both file in
- dennis luehring (4/30) Jan 31 2013 why not only the symbols at the border needs to be "connected" then
- Jacob Carlborg (4/7) Jan 31 2013 That would require some profiling to figure out.
- dennis luehring (7/13) Jan 31 2013 i would say it "can" help alot so the design should be able to use
- Jacob Carlborg (7/13) Jan 31 2013 This is kind of obvious, I think. That's why I started to think at this
- qznc (6/9) Jan 31 2013 I think a lexer should be IO-bound on todays machines, so
- Mehrdad (4/11) Feb 01 2013 Pretty sure RAMs these days can handle more than 1 processor's
- Jacob Carlborg (4/6) Feb 01 2013 I'm not sure I understand.
- Brian Schott (5/5) Feb 03 2013 After several hours of optimizing I've managed to make it so that
- FG (2/4) Feb 03 2013 What are you comparing here? How do you time DMD's lexing stage?
- Brian Schott (3/8) Feb 03 2013 A simple hack of module.c that prints out the file's token count
- FG (5/8) Feb 03 2013 Ah, fine. Then it's not that bad. :)
- Jacob Carlborg (5/9) Feb 03 2013 That would be interesting to hear. Three times slower than DMD doesn't
- FG (5/12) Feb 04 2013 Looking at the current source, there is now a StringCache to hold the st...
- Mehrdad (6/11) Feb 04 2013 Where is the current bottleneck?
- Brian Schott (4/4) Feb 04 2013 More optimizing:
- Andrei Alexandrescu (4/8) Feb 04 2013 Suggestion: take lexer.c and convert it to D. Should take one day, and
- Andrej Mitrovic (3/5) Feb 04 2013 This was already done for DDMD, and the more recent minimal version of i...
- Andrei Alexandrescu (3/8) Feb 04 2013 Awesome! Has anyone measured the speed?
- Brian Schott (6/8) Feb 04 2013 I gave up on getting it to compile.
- Andrej Mitrovic (6/7) Feb 05 2013 It seems master is broken. An earlier commit is buildable with a small
- deadalnix (5/14) Feb 04 2013 DMD's lexer is not suitable for phobos IMO. It doesn't take a
- Andrei Alexandrescu (4/18) Feb 05 2013 I looked at the code. Once converted, it's trivial to convert the lexer
- Jacob Carlborg (9/11) Feb 05 2013 There's reason for why nobody has just extract the lexer from DMD. It
- Jonathan M Davis (18/20) Feb 05 2013 Not exactly. I decided that it would be of greater benefit to just write...
- Jacob Carlborg (4/7) Feb 05 2013 Aha, I see.
- Jonathan M Davis (12/19) Feb 05 2013 There are basic ideas about how it works which are obviously good and sh...
- Jacob Carlborg (4/10) Feb 05 2013 Yeah, that could be useful.
- Andrei Alexandrescu (5/23) Feb 05 2013 As far as I could tell the dependencies of the lexer are fairly
- Jacob Carlborg (4/7) Feb 05 2013 That's not what I remember from last time I gave it a try.
- Jonathan M Davis (14/17) Feb 05 2013 I don't remember all of the details at the moment, since it's been sever...
- Andrei Alexandrescu (5/21) Feb 05 2013 I think it would be reasonable for a lexer to require a range of ubyte
- Jonathan M Davis (9/12) Feb 05 2013 I'd have to think about how you'd handle the Unicode stuff in that case,...
- Jonathan M Davis (27/30) Feb 08 2013 Another big issue is the fact that in some ways, using a pointer like dm...
- Dmitry Olshansky (6/17) Feb 08 2013 Not true, certain ranges know length but can't be random access as
- Jonathan M Davis (33/51) Feb 08 2013 t may
- Dmitry Olshansky (14/45) Feb 08 2013 Well I honestly disagree about the promise of knowing length - being
- Jonathan M Davis (10/15) Feb 08 2013 I don't know exactly how costly it ends up being (and certainly, poor de...
- Dmitry Olshansky (20/37) Feb 08 2013 I suspect it *might* require extra register for base depending on
- Jonathan M Davis (9/11) Feb 08 2013 The main problem isn't RA ranges. It's quite probable that the optimizer...
- deadalnix (6/24) Feb 08 2013 For pure InputRanges, that is pretty bad as the decoding of UTF
- deadalnix (7/33) Feb 08 2013 Wow that is complete bullshit xD I completely messed up what I
- Jonathan M Davis (15/19) Feb 09 2013 Well, with the pull request that I have, it'll always pop for input rang...
- Jacob Carlborg (6/12) Feb 09 2013 Following this discussion it seems it's complicated to support ranges in...
- Andrei Alexandrescu (4/17) Feb 09 2013 Requiring a random-access range of ubyte with a terminating zero may be
- Jacob Carlborg (6/8) Feb 09 2013 Requiring a random-access range probably makes it easier. People here
- Andrei Alexandrescu (4/10) Feb 09 2013 Yah. The way I see it is, start with a random-access range and then see
- Dmitry Olshansky (10/21) Feb 09 2013 I don't get this. There is no sensible requirement to forbid non-random
- Jonathan M Davis (21/40) Feb 09 2013 may be
- Walter Bright (4/6) Feb 08 2013 A more problematic case is dmd's lexer relies on a 0 byte at the end to ...
- Jonathan M Davis (7/14) Feb 09 2013 I didn't know that. That's a cute trick. But yeah, without controlling t...
- Andrei Alexandrescu (3/16) Feb 09 2013 That's not a problem. You may require that the input has a terminating z...
- Jacob Carlborg (4/6) Feb 09 2013 You cannot just append a terminating zero in the lexer if it's a range?
- Andrei Alexandrescu (3/7) Feb 09 2013 You require it. It's client's burden.
- Jacob Carlborg (4/5) Feb 09 2013 Ok, I see.
- Walter Bright (11/27) Feb 09 2013 Perhaps we can formalize this a bit. Define a SentinelInputRange, which ...
- Walter Bright (4/11) Feb 09 2013 If it's not clear, SentinelInputRange.front can be called *before* check...
- Andrei Alexandrescu (5/23) Feb 09 2013 Yah, that all would fit C-style strings and singly-linked lists like a
- deadalnix (10/14) Feb 08 2013 Playing around that, I discovered a bug in std.utf : slice and
- Jonathan M Davis (7/13) Feb 05 2013 It turns out that it has nothing to do with string mixins (though it doe...
- Jacob Carlborg (4/9) Feb 05 2013 Ok, good that it's not a blocker.
- Dmitry Olshansky (4/8) Feb 05 2013 Time to do some hacking on your lexer I guess. I'll try add a couple of
- Brian Schott (7/10) Feb 05 2013 I've been using avgtime[1] for measuring run times, perf[2], and
- Dmitry Olshansky (7/18) Feb 05 2013 Thanks.
- Brian Schott (5/11) Feb 08 2013 http://hackerpilot.github.com/experimental/std_lexer/images/times3.png
- Jacob Carlborg (4/7) Feb 08 2013 DMD is still consistently faster, :(
- Brian Schott (3/4) Feb 08 2013 We might be getting to the part where we say that code generated
- Brian Schott (5/11) Feb 08 2013 For the lulz I compiled with "dmd -release -inline -noboundscheck
- Dmitry Olshansky (4/14) Feb 08 2013 Would be intereesting to try gdc as dmd on linux uses gcc.
- Iain Buclaw (6/27) Feb 08 2013 What? That's an outright fib. :-)
- Brian Schott (2/3) Feb 08 2013 I think he means that "on linux the dmd binary is compiled by gcc"
- Iain Buclaw (5/9) Feb 08 2013 That's still lies. :o)
- Dmitry Olshansky (4/13) Feb 08 2013 --
- Iain Buclaw (9/25) Feb 08 2013 I see we could be doing this all day. :=C3=BE
- Brian Schott (5/13) Feb 08 2013 What we're saying is that dmd, The Digital Mars D Compiler, is
- Iain Buclaw (8/20) Feb 08 2013 ly
- Dmitry Olshansky (7/23) Feb 08 2013 We've been discussing DMD's lexer written in C++ that is obviously
- SomeDude (3/12) Feb 09 2013 I'd think it's built by DMC++, Digital Mars C++ compiler.
- Dmitry Olshansky (4/18) Feb 09 2013 On linux, sure ;)
- Brian Schott (8/9) Feb 08 2013 GDC decided to randomly not fail to build on my system, so it's
- Brian Schott (3/4) Feb 08 2013 Copy/pate fail. That's actually "dmd -release -inline
- Iain Buclaw (5/15) Feb 08 2013 Cool. How do you determine the speed times?
- FG (3/5) Feb 08 2013 Interesting. Seems that LDC is faster than GDC but has a bigger start ov...
- David Nadlinger (7/14) Feb 08 2013 It seems like that, but the differences are small enough that you
- dennis luehring (8/17) Feb 08 2013 i still don't get what you comparing here
- Jacob Carlborg (4/12) Feb 09 2013 Do you have the exact code you test available somewhere?
- Dmitry Olshansky (40/46) Feb 08 2013 Keep in mind the D runtime start-up cost. In the end on small files DMD
- Jacob Carlborg (5/52) Feb 08 2013 Ok, I see. But wouldn't it be more fair to compare either GCC to GDC or
- Walter Bright (2/4) Feb 08 2013 I worship the Dark Arts.
- Iain Buclaw (3/8) Feb 09 2013 More like cursed the god's when you wrote it. :-)
- Brian Schott (5/10) Feb 28 2013 *recites incantations*
- Andrej Mitrovic (2/3) Feb 28 2013 That's a nice plot, but what is the X axis?
- FG (2/5) Feb 28 2013 Lexing time in milliseconds.
- Dmitry Olshansky (22/33) Feb 28 2013 Looking far better now. Keep in mind that we still don't use the
- Dmitry Olshansky (7/32) Feb 28 2013 To clarify these are problems I observed but haven't solved/avoided
I'm writing a D lexer for possible inclusion in Phobos. DDOC: http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html Code: https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d It's currently able to correctly syntax highlight all of Phobos, but does a fairly bad job at rejecting or notifying users/callers about invalid input. I'd like to hear arguments on the various ways to handle errors in the lexer. In a compiler it would be useful to throw an exception on finding something like a string literal that doesn't stop before EOF, but a text editor or IDE would probably want to be a bit more lenient. Maybe having it run-time (or compile-time configurable) like std.csv would be the best option here. I'm interested in ideas on the API design and other high-level issues at the moment. I don't consider this ready for inclusion. (The current module being reviewed for inclusion in Phobos is the new std.uni.)
Jan 27 2013
On Sun, Jan 27, 2013 at 10:51 AM, Brian Schott <briancschott gmail.com> wrote:I'm writing a D lexer for possible inclusion in Phobos. DDOC: http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html Code: https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.dCool! I remember linking to it in the wiki a week ago: Here: http://wiki.dlang.org/Lexers_Parsers Feel free to correct the entry.It's currently able to correctly syntax highlight all of Phobos, but does a fairly bad job at rejecting or notifying users/callers about invalid input. I'd like to hear arguments on the various ways to handle errors in the lexer. In a compiler it would be useful to throw an exception on finding something like a string literal that doesn't stop before EOF, but a text editor or IDE would probably want to be a bit more lenient. Maybe having it run-time (or compile-time configurable) like std.csv would be the best option here.Last time we discussed it, IIRC, some people wanted the lexer to stop at once, other just wanted an Error token. I personally prefer an Error token, but that means finding a way to start lexing again after the error (and hence, finding where the error ends). I guess any separator/terminator could be used to re-engage the lexer: space, semicolon, closing brace, closing parenthesis?I'm interested in ideas on the API design and other high-level issues at the moment. I don't consider this ready for inclusion. (The current module being reviewed for inclusion in Phobos is the new std.uni.)OK, here are a few questions: * Having a range interface is good. Any reason why you made byToken a class and not a struct? Most (like, 99%) of range in Phobos are structs. Do you need reference semantics? * Also, is there a way to keep comments? Any code wanting the modify the code might need them. (edit: Ah, I see it: IterationStyle.IncludeComments) * I'd distinguish between standard comments and documentation comments. These are different beasts, to my eyes. * I see Token has a startIndex member. Any reason not to have a endIndex member? Or can and end index always be deduced from startIndex and value.length? * How does it fare with non ASCII code? * A rough estimate of number of tokens/s would be good (I know it'll vary). Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good. On a related note, does your lexer have any problem with 10k+-lines files?
Jan 27 2013
On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:Last time we discussed it, IIRC, some people wanted the lexer to stop at once, other just wanted an Error token. I personally prefer an Error token, but that means finding a way to start lexing again after the error (and hence, finding where the error ends). I guess any separator/terminator could be used to re-engage the lexer: space, semicolon, closing brace, closing parenthesis?Oh yes, that is very important. Conditions are perfect to handle stuff like that.
Jan 27 2013
On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:* Having a range interface is good. Any reason why you made byToken a class and not a struct? Most (like, 99%) of range in Phobos are structs. Do you need reference semantics?It implements the InputRange interface from std.range so that users have a choice of using template constraints or the OO model in their code.* Also, is there a way to keep comments? Any code wanting the modify the code might need them. (edit: Ah, I see it: IterationStyle.IncludeComments) * I'd distinguish between standard comments and documentation comments. These are different beasts, to my eyes.The standard at http://dlang.org/lex.html doesn't differentiate between them. It's trivial to write a function that checks if a token starts with "///", "/**", or "/++" while iterating over the tokens.* I see Token has a startIndex member. Any reason not to have a endIndex member? Or can and end index always be deduced from startIndex and value.length?That's the idea.* How does it fare with non ASCII code?Everything is templated on the character type, but I haven't done any testing on UTF-16 or UTF-32. Valgrind still shows functions from std.uni being called, so at the moment I assume it works.* A rough estimate of number of tokens/s would be good (I know it'll vary). Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good. On a related note, does your lexer have any problem with 10k+-lines files?$ time dscanner --sloc ../phobos/std/datetime.d 14950 real 0m0.319s user 0m0.313s sys 0m0.006s $ time dmd -c ../phobos/std/datetime.d real 0m0.354s user 0m0.318s sys 0m0.036s Yes, I know that "time" is a terrible benchmarking tool, but they're fairly close for whatever that's worth.
Jan 27 2013
On 01/27/2013 11:42 AM, Brian Schott wrote:On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:The lexer range must be a struct.* Having a range interface is good. Any reason why you made byToken a class and not a struct? Most (like, 99%) of range in Phobos are structs. Do you need reference semantics?It implements the InputRange interface from std.range so that users have a choice of using template constraints or the OO model in their code. ......You are measuring lexing speed against compilation speed. A reasonably well performing lexer is around one order of magnitude faster on std.datetime. Maybe you should profile a little?* A rough estimate of number of tokens/s would be good (I know it'll vary). Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good. On a related note, does your lexer have any problem with 10k+-lines files?$ time dscanner --sloc ../phobos/std/datetime.d 14950 real 0m0.319s user 0m0.313s sys 0m0.006s $ time dmd -c ../phobos/std/datetime.d real 0m0.354s user 0m0.318s sys 0m0.036s Yes, I know that "time" is a terrible benchmarking tool, but they're fairly close for whatever that's worth.
Jan 27 2013
Am Sun, 27 Jan 2013 12:38:33 +0100 schrieb Timon Gehr <timon.gehr gmx.ch>:On 01/27/2013 11:42 AM, Brian Schott wrote:Profiling is always a good idea, but to be fair: His dmd was probably compiled with gcc and -O2 if it's a normal release build. So to compare that he should use gdc, -O2 -release -fno-bounds-check and probably more flags to compile the d code.On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:The lexer range must be a struct.* Having a range interface is good. Any reason why you made byToken a class and not a struct? Most (like, 99%) of range in Phobos are structs. Do you need reference semantics?It implements the InputRange interface from std.range so that users have a choice of using template constraints or the OO model in their code. ......You are measuring lexing speed against compilation speed. A reasonably well performing lexer is around one order of magnitude faster on std.datetime. Maybe you should profile a little?* A rough estimate of number of tokens/s would be good (I know it'll vary). Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good. On a related note, does your lexer have any problem with 10k+-lines files?$ time dscanner --sloc ../phobos/std/datetime.d 14950 real 0m0.319s user 0m0.313s sys 0m0.006s $ time dmd -c ../phobos/std/datetime.d real 0m0.354s user 0m0.318s sys 0m0.036s Yes, I know that "time" is a terrible benchmarking tool, but they're fairly close for whatever that's worth.
Jan 27 2013
Am 27.01.2013 13:31, schrieb Johannes Pfau:Am Sun, 27 Jan 2013 12:38:33 +0100 schrieb Timon Gehr <timon.gehr gmx.ch>:that makes no sense - we need a tiny piece of benchmark code inside of dmd frontend (and gdc) - these results are the only reliable/compareable benchmark someone knows the place where such benchmarking can take place in the dmd frontend code?On 01/27/2013 11:42 AM, Brian Schott wrote:Profiling is always a good idea, but to be fair: His dmd was probably compiled with gcc and -O2 if it's a normal release build. So to compare that he should use gdc, -O2 -release -fno-bounds-check and probably more flags to compile the d code.On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:The lexer range must be a struct.* Having a range interface is good. Any reason why you made byToken a class and not a struct? Most (like, 99%) of range in Phobos are structs. Do you need reference semantics?It implements the InputRange interface from std.range so that users have a choice of using template constraints or the OO model in their code. ......You are measuring lexing speed against compilation speed. A reasonably well performing lexer is around one order of magnitude faster on std.datetime. Maybe you should profile a little?* A rough estimate of number of tokens/s would be good (I know it'll vary). Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good. On a related note, does your lexer have any problem with 10k+-lines files?$ time dscanner --sloc ../phobos/std/datetime.d 14950 real 0m0.319s user 0m0.313s sys 0m0.006s $ time dmd -c ../phobos/std/datetime.d real 0m0.354s user 0m0.318s sys 0m0.036s Yes, I know that "time" is a terrible benchmarking tool, but they're fairly close for whatever that's worth.
Jan 27 2013
that makes no sense - we need a tiny piece of benchmark code inside of dmd frontend (and gdc) - these results are the only reliable/compareable benchmark someone knows the place where such benchmarking can take place in the dmd frontend code?the dmd lexer is in https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c with an small unit test
Jan 27 2013
On Sun, Jan 27, 2013 at 11:42 AM, Brian Schott <briancschott gmail.com> wrote:On Sunday, 27 January 2013 at 10:17:48 UTC, Philippe Sigaud wrote:Hmm. You're the first person I see pushing this. Personally, I'd go for a struct, if only because I don't need reference semantics.* Having a range interface is good. Any reason why you made byToken a class and not a struct? Most (like, 99%) of range in Phobos are structs. Do you need reference semantics?It implements the InputRange interface from std.range so that users have a choice of using template constraints or the OO model in their code.Yes but, the standard lexer was done with DMD in mind, and DMD has a different code path for generating comments. It's your project, sure, but I'd appreciate tokens differentiating between the many comments. Oh, and recognizing some inner Ddoc tokens, like ---- delimiters for documentation code blocks. That way, code-in-doc could use the lexer also. Pretty please?* Also, is there a way to keep comments? Any code wanting the modify the code might need them. (edit: Ah, I see it: IterationStyle.IncludeComments) * I'd distinguish between standard comments and documentation comments. These are different beasts, to my eyes.The standard at http://dlang.org/lex.html doesn't differentiate between them. It's trivial to write a function that checks if a token starts with "///", "/**", or "/++" while iterating over the tokens.Does it work for UTF-16 and UTF-32 strings?* I see Token has a startIndex member. Any reason not to have a endIndex member? Or can and end index always be deduced from startIndex and value.length?That's the idea.
Jan 27 2013
On 2013-01-27 11:42, Brian Schott wrote:Always good to try and minimize the size of a token when the lexer should output thousands of tokens per second. -- /Jacob Carlborg* I see Token has a startIndex member. Any reason not to have a endIndex member? Or can and end index always be deduced from startIndex and value.length?That's the idea.
Jan 27 2013
On 1/27/2013 2:17 AM, Philippe Sigaud wrote:Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good.Speed is critical for a lexer. This means, for example, you'll need to squeeze pretty much all storage allocation out of it. A lexer that does an allocation per token will is not going to do very well at all.
Jan 27 2013
27-Jan-2013 23:48, Walter Bright пишет:On 1/27/2013 2:17 AM, Philippe Sigaud wrote:I concur. One of the biggest reason* there is a separate lexer step is because it could be made to do this stage very-very fast. Then the rest of the parser will greatly benefit from this underlying speed. *Otherwise we could have just as well add the lexer stage as simple rules to the grammar that treats all of codepoints as terminals. -- Dmitry OlshanskyWalter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good.Speed is critical for a lexer. This means, for example, you'll need to squeeze pretty much all storage allocation out of it. A lexer that does an allocation per token will is not going to do very well at all.
Jan 27 2013
I concur. One of the biggest reason* there is a separate lexer step is because it could be made to do this stage very-very fast. Then the rest of the parser will greatly benefit from this underlying speed. *Otherwise we could have just as well add the lexer stage as simple rules to the grammar that treats all of codepoints as terminals.... which is exactly what parsing expression grammars (and other scannerless parsers) do. AFAICT, one interesting consequence is the ability to have composition of grammars, which I sure have with Pegged. But maybe it's not related that much, that's not something I stopped to think about. In any case, grammar composition is something I've learnt to like quite a lot.
Jan 27 2013
28-Jan-2013 00:18, Philippe Sigaud пишет:That is only a happenstance and is highly overrated. What does it buy you is the right question to ask. After all every LL-parser can be written avoiding notion of lexer and strangely I see no fuss about it. So is it just hype? I dunno.I concur. One of the biggest reason* there is a separate lexer step is because it could be made to do this stage very-very fast. Then the rest of the parser will greatly benefit from this underlying speed. *Otherwise we could have just as well add the lexer stage as simple rules to the grammar that treats all of codepoints as terminals.... which is exactly what parsing expression grammars (and other scannerless parsers) do.AFAICT, one interesting consequence is the ability to have composition of grammars, which I sure have with Pegged. But maybe it's not related that much, that's not something I stopped to think about. In any case, grammar composition is something I've learnt to like quite a lot.You can still have composability of grammars. In fact I'd define a couple of a less powerful but better optimized variations if I were you. PEG-style memoization is still possible as long as underlying grammars are context-free all the useful usually are. LR hardly composable ( I haven't seen a single one that does). But all top-down one always are unless there is some explicit work spent to undermine this ability ;) -- Dmitry Olshansky
Jan 27 2013
I don't know. I discovered what scannerless parsing was after discovering (and getting interested) in PEG. That's a somewhat unusual path, as I'm now discovering standard lexer+parser parsing :) Anyway, let's not derail the OP thread too much.... which is exactly what parsing expression grammars (and other scannerless parsers) do.That is only a happenstance and is highly overrated. What does it buy you is the right question to ask. After all every LL-parser can be written avoiding notion of lexer and strangely I see no fuss about it. So is it just hype? I dunno.Yes, getting an automaton to deal with the regular part would be good. But then, there already is std.regex ;) it's good to know composition is possible elsewhere. I really like it: it makes me use grammars (and the related, generated parsers) as I use functions in PL: slowly building layers of abstraction and reuse.AFAICT, one interesting consequence is the ability to have composition of grammars, which I sure have with Pegged. But maybe it's not related that much, that's not something I stopped to think about. In any case, grammar composition is something I've learnt to like quite a lot.You can still have composability of grammars. In fact I'd define a couple of a less powerful but better optimized variations if I were you.PEG-style memoization is still possible as long as underlying grammars are context-free all the useful usually are.Yes, I agree.LR hardly composable ( I haven't seen a single one that does). But all top-down one always are unless there is some explicit work spent to undermine this ability ;)What I still can't get an easy answer on is: is the D grammar LR(1), LR(0), LALR?
Jan 27 2013
28-Jan-2013 00:45, Philippe Sigaud пишет: [snip]Reminds me to tweak/refactor/optimize CT-part of it. [snip]Yes, getting an automaton to deal with the regular part would be good. But then, there already is std.regex ;)AFAICT, one interesting consequence is the ability to have composition of grammars, which I sure have with Pegged. But maybe it's not related that much, that's not something I stopped to think about. In any case, grammar composition is something I've learnt to like quite a lot.You can still have composability of grammars. In fact I'd define a couple of a less powerful but better optimized variations if I were you.What I still can't get an easy answer on is: is the D grammar LR(1), LR(0), LALR?I *think* you can shoehorn it into any one of these: LALR(1), LL(k)+ some lookahead, LL(*) a generalization of it, or PEG. As usual you might need semantic predicates to iron out some quirks though. We are getting OT quick ;) The hint is that your question is a bit faulty: by calling it "the D grammar" do you mean the exact one listed on the website or any equivalent that parses the same language (including the ones obtained by simple transformations)? -- Dmitry Olshansky
Jan 27 2013
The hint is that your question is a bit faulty: by calling it "the D grammar" do you mean the exact one listed on the website or any equivalent that parses the same language (including the ones obtained by simple transformations)?The latter. The one I use for Pegged to generate (what is hopefully) a D parser is already modified, discards constructs like NameList := Name NameList in favor of Name+ Anyway, let's stop here. Back to lexing proper :)
Jan 27 2013
On 1/27/2013 12:18 PM, Philippe Sigaud wrote:If you do such a thing when designing a language, I guarantee that you will find it impossible to resist having custom "tokens" in various parts of the grammar. This will make it impossible to write tools that only need to tokens, such as those syntax highlighting editors that really are token highlighters.I concur. One of the biggest reason* there is a separate lexer step is because it could be made to do this stage very-very fast. Then the rest of the parser will greatly benefit from this underlying speed. *Otherwise we could have just as well add the lexer stage as simple rules to the grammar that treats all of codepoints as terminals.... which is exactly what parsing expression grammars (and other scannerless parsers) do. AFAICT, one interesting consequence is the ability to have composition of grammars, which I sure have with Pegged. But maybe it's not related that much, that's not something I stopped to think about. In any case, grammar composition is something I've learnt to like quite a lot.
Jan 27 2013
On Sun, Jan 27, 2013 at 8:48 PM, Walter Bright <newshound2 digitalmars.com> wrote:On 1/27/2013 2:17 AM, Philippe Sigaud wrote:Something I never tought about: given a 10k lines module, how many tokens does that represent, roughly. Ten times as much?Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good.Speed is critical for a lexer.This means, for example, you'll need to squeeze pretty much all storage allocation out of it. A lexer that does an allocation per token will is not going to do very well at all.How does one do that? Honest question, I'm not really concerned by extreme speed, most of the time, and have lots to learn in this area.
Jan 27 2013
On 2013-01-27 21:15, Philippe Sigaud wrote:If we're talking about dynamic allocation you can make sure you're just using value types. A token could look like: struct Token { TokenKind kind; string value; uint index; } For the "value" you could just slice the buffer. But I think this will prevent the whole buffer from being collected. -- /Jacob CarlborgThis means, for example, you'll need to squeeze pretty much all storage allocation out of it. A lexer that does an allocation per token will is not going to do very well at all.How does one do that? Honest question, I'm not really concerned by extreme speed, most of the time, and have lots to learn in this area.
Jan 27 2013
On 2013-01-28 08:47, Jacob Carlborg wrote:If we're talking about dynamic allocation you can make sure you're just using value types. A token could look like: struct Token { TokenKind kind; string value; uint index; } For the "value" you could just slice the buffer. But I think this will prevent the whole buffer from being collected.I was also thinking about using slices to limit string allocation. So far the combined size of source files in D projects is so small, that it wouldn't hurt to mmap the files and slice them. It is possible though that someone would create a huge file, even if only just to see this program crash. :) In that case something else may be useful. Allocate special arrays for holding value strings, for example 256 kB per array. Token.value will be a slice of such array. Additionally have a lookup Trie to help reuse repeating values - if a string is already in an array, the Trie leaf will store its position (slice) and the token will only have to copy that info. If the lookup doesn't turn up the string, the string will be added to the end of the array using Appender or, if it doesn't fit, a new array will be created.
Jan 28 2013
On 1/27/2013 12:15 PM, Philippe Sigaud wrote:On Sun, Jan 27, 2013 at 8:48 PM, Walter Bright <newshound2 digitalmars.com> wrote:I don't know.On 1/27/2013 2:17 AM, Philippe Sigaud wrote:Something I never tought about: given a 10k lines module, how many tokens does that represent, roughly. Ten times as much?Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good.Speed is critical for a lexer.Just study the dmd lexer.c source code.This means, for example, you'll need to squeeze pretty much all storage allocation out of it. A lexer that does an allocation per token will is not going to do very well at all.How does one do that? Honest question, I'm not really concerned by extreme speed, most of the time, and have lots to learn in this area.
Jan 28 2013
On Sunday, 27 January 2013 at 20:15:33 UTC, Philippe Sigaud wrote:Here's my (VERY) simple NFA-based "regex"-based lexer, which performs **no** heap allocations (unless the pattern is very complicated): https://gist.github.com/b10ae22ab822c87467a3 Yes, the code is ugly and the capabilities are far too basic, but it was good enough for what I needed. The point it illustrates is how you can lex without any heap allocations.This means, for example, you'll need to squeeze pretty much all storage allocation out of it. A lexer that does an allocation per token will is not going to do very well at all.How does one do that? Honest question, I'm not really concerned by extreme speed, most of the time, and have lots to learn in this area.
Jan 28 2013
On Mon, Jan 28, 2013 at 12:43 PM, Mehrdad <wfunction hotmail.com> wrote:https://gist.github.com/b10ae22ab822c87467a3This link does not work for me :(
Jan 28 2013
On Monday, 28 January 2013 at 20:35:51 UTC, Philippe Sigaud wrote:On Mon, Jan 28, 2013 at 12:43 PM, Mehrdad <wfunction hotmail.com> wrote:Yeah sorry, I found a bug in it, so I took it off and fixed it (... or so I think). I wouldn't be surprised if it's still buggy. It's more of a proof of concept than anything... it definitely has room for improvement, but the speed is still decent. Here's another link (grab it before it's gone!): https://gist.github.com/addfe830acf9785d01efhttps://gist.github.com/b10ae22ab822c87467a3This link does not work for me :(
Jan 28 2013
Grabby grab! Owww, C++! My eyes, they're melting! My hairs, on fire! Why did you forfeit your immortal soul? Anyway, thanks. I'll look at it more closely. It seems to be the kind of C++ I can still read *and* understand.This link does not work for me :(Yeah sorry, I found a bug in it, so I took it off and fixed it (... or so I think). I wouldn't be surprised if it's still buggy. It's more of a proof of concept than anything... it definitely has room for improvement, but the speed is still decent. Here's another link (grab it before it's gone!): https://gist.github.com/addfe830acf9785d01ef
Jan 28 2013
On Monday, 28 January 2013 at 22:00:30 UTC, Philippe Sigaud wrote:Owww, C++! My eyes, they're melting! My hairs, on fire! Why did you forfeit your immortal soul?lol. Mainly because I got tired of running into compiler bugs, to be honest. :|Anyway, thanks. I'll look at it more closely. It seems to be the kind of C++ I can still read *and* understand.Yeah, my goal was to make it mostly C-compatible, so I kept the C++-iness to a minimum. It's more like "C"++ rather than "C++", I think. :P
Jan 28 2013
Am 27.01.2013 20:48, schrieb Walter Bright:On 1/27/2013 2:17 AM, Philippe Sigaud wrote:I second that.Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good.Speed is critical for a lexer. This means, for example, you'll need to squeeze pretty much all storage allocation out of it. A lexer that does an allocation per token will is not going to do very well at all.
Jan 27 2013
On Sunday, 27 January 2013 at 19:48:23 UTC, Walter Bright wrote:On 1/27/2013 2:17 AM, Philippe Sigaud wrote:Semi-unrelated question -- how would you benchmark a _parser_? Is there any way to get a _number_ as an answer, or is comparing against a rival parser the only way of benchmarking a parser?Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good.Speed is critical for a lexer. This means, for example, you'll need to squeeze pretty much all storage allocation out of it. A lexer that does an allocation per token will is not going to do very well at all.
Jan 28 2013
On 2013-01-28 10:03, Mehrdad wrote:Semi-unrelated question -- how would you benchmark a _parser_? Is there any way to get a _number_ as an answer, or is comparing against a rival parser the only way of benchmarking a parser?It should always be possible to get a number, but it would be hard to determine if that's a good number or not without comparing with anything else. -- /Jacob Carlborg
Jan 28 2013
On Monday, 28 January 2013 at 09:07:15 UTC, Jacob Carlborg wrote:On 2013-01-28 10:03, Mehrdad wrote:Sorry I think my question was vague -- what I meant was, how do you measure it, i.e. in what units? Lines per second? Tokens per second? Syntax nodes per second? etc.Semi-unrelated question -- how would you benchmark a _parser_? Is there any way to get a _number_ as an answer, or is comparing against a rival parser the only way of benchmarking a parser?It should always be possible to get a number, but it would be hard to determine if that's a good number or not without comparing with anything else.
Jan 28 2013
On 2013-01-28 10:09, Mehrdad wrote:Sorry I think my question was vague -- what I meant was, how do you measure it, i.e. in what units? Lines per second? Tokens per second? Syntax nodes per second? etc.The easy way out would be to just measure how long time it takes to parse a given piece of code and compare with some other parse using the same code. -- /Jacob Carlborg
Jan 28 2013
On Monday, 28 January 2013 at 09:21:51 UTC, Jacob Carlborg wrote:On 2013-01-28 10:09, Mehrdad wrote:Yeah that's exactly what I meant by not comparing xD I'm wondering if there's any way to make it independent of the code/grammarSorry I think my question was vague -- what I meant was, how do you measure it, i.e. in what units? Lines per second? Tokens per second? Syntax nodes per second? etc.The easy way out would be to just measure how long time it takes to parse a given piece of code and compare with some other parse using the same code.
Jan 28 2013
On 1/28/2013 1:30 AM, Mehrdad wrote:I'm wondering if there's any way to make it independent of the code/grammarJust run under a profiler, then fix the hot spots.
Jan 28 2013
Am 28.01.2013 11:13, schrieb Walter Bright:On 1/28/2013 1:30 AM, Mehrdad wrote:mehrdad speaks about benchmarking not optimization :)I'm wondering if there's any way to make it independent of the code/grammarJust run under a profiler, then fix the hot spots.
Jan 28 2013
On 1/28/2013 2:16 AM, dennis luehring wrote:Am 28.01.2013 11:13, schrieb Walter Bright:I know. But profiling is how you make it fast. Lexers should be as fast as possible, not merely faster than the competition.On 1/28/2013 1:30 AM, Mehrdad wrote:mehrdad speaks about benchmarking not optimization :)I'm wondering if there's any way to make it independent of the code/grammarJust run under a profiler, then fix the hot spots.
Jan 28 2013
On Monday, 28 January 2013 at 11:01:49 UTC, Walter Bright wrote:On 1/28/2013 2:16 AM, dennis luehring wrote:I was talking about parsers actually :| on a second thought maybe I shouldn't have asked that on a thread about lexing...Am 28.01.2013 11:13, schrieb Walter Bright:I know. But profiling is how you make it fast. Lexers should be as fast as possible, not merely faster than the competition.On 1/28/2013 1:30 AM, Mehrdad wrote:mehrdad speaks about benchmarking not optimization :)I'm wondering if there's any way to make it independent of the code/grammarJust run under a profiler, then fix the hot spots.
Jan 28 2013
On 1/28/2013 3:34 AM, Mehrdad wrote:On Monday, 28 January 2013 at 11:01:49 UTC, Walter Bright wrote:Same comment applies.On 1/28/2013 2:16 AM, dennis luehring wrote:I was talking about parsers actually :| on a second thought maybe I shouldn't have asked that on a thread about lexing...Am 28.01.2013 11:13, schrieb Walter Bright:I know. But profiling is how you make it fast. Lexers should be as fast as possible, not merely faster than the competition.On 1/28/2013 1:30 AM, Mehrdad wrote:mehrdad speaks about benchmarking not optimization :)I'm wondering if there's any way to make it independent of the code/grammarJust run under a profiler, then fix the hot spots.
Jan 28 2013
Am Sun, 27 Jan 2013 11:48:23 -0800 schrieb Walter Bright <newshound2 digitalmars.com>:On 1/27/2013 2:17 AM, Philippe Sigaud wrote:But to be fair that doesn't fit ranges very well. If you don't want to do any allocation but still keep identifiers etc in memory this basically means you have to keep the whole source in memory and this is conceptually an array and not a range. But you can of course write a lexer which accepts buffered ranges and does some allocation for those and is special cased for arrays to not allocate at all. (Unbuffered ranges should be supported using a generic BufferedRange)Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good.Speed is critical for a lexer. This means, for example, you'll need to squeeze pretty much all storage allocation out of it.
Jan 28 2013
28-Jan-2013 15:48, Johannes Pfau пишет:Am Sun, 27 Jan 2013 11:48:23 -0800 schrieb Walter Bright <newshound2 digitalmars.com>:Not the whole source but to construct a table of all identifiers. The source is awfully redundant because of repeated identifiers, spaces, comments and what not. The set of unique identifiers is rather small. I think the best course of action is to just provide a hook to trigger on every identifier encountered. That could be as discussed earlier a delegate.On 1/27/2013 2:17 AM, Philippe Sigaud wrote:But to be fair that doesn't fit ranges very well. If you don't want to do any allocation but still keep identifiers etc in memory this basically means you have to keep the whole source in memory and this is conceptually an array and not a range.Walter seems to think if a lexer is not able to vomit thousands of tokens a seconds, then it's not good.Speed is critical for a lexer. This means, for example, you'll need to squeeze pretty much all storage allocation out of it.But you can of course write a lexer which accepts buffered ranges and does some allocation for those and is special cased for arrays to not allocate at all. (Unbuffered ranges should be supported using a generic BufferedRange)-- Dmitry Olshansky
Jan 28 2013
On 01/28/2013 12:59 PM, Dmitry Olshansky wrote:28-Jan-2013 15:48, Johannes Pfau пишет:Source code is usually small. (Even std.datetime has 'only' 1.6 MB.) My own lexer-parser combination slices directly into the original source code, for every token, in order to remember the exact source location, and the last time I have measured, it ran faster than DMD's. I keep the source around for error reporting anyway: tt.d:27:5: error: no member 'n' for type 'A' a.n=2; ^~~ Since the tokens point directly into the source code, it is not necessary to construct any other data structures in order to allow fast retrieval of the appropriate source code line. But it's clear that a general-purpose library might not want to impose this restriction on storage upon it's clients. I think it is somewhat helpful for speed though. The other thing I do is buffering tokens in a contiguous ring buffer that grows if a lot of lookahead is requested.... But to be fair that doesn't fit ranges very well. If you don't want to do any allocation but still keep identifiers etc in memory this basically means you have to keep the whole source in memory and this is conceptually an array and not a range.Not the whole source but to construct a table of all identifiers. The source is awfully redundant because of repeated identifiers, spaces, comments and what not. The set of unique identifiers is rather small.I think the best course of action is to just provide a hook to trigger on every identifier encountered. That could be as discussed earlier a delegate. ...Maybe. I map identifiers to unique id's later, in the identifier AST node constructor, though.
Jan 28 2013
On 1/28/2013 1:04 PM, Timon Gehr wrote:Maybe. I map identifiers to unique id's later,This makes symbol lookups very fast.
Jan 28 2013
29-Jan-2013 01:04, Timon Gehr пишет:On 01/28/2013 12:59 PM, Dmitry Olshansky wrote:In allocation scheme I proposed that ID could be a 32bit offset into the unique identifiers chunk. Identifiers themselves then would have to be stored with sentinels (e.g. zeros) at end like in C. Then the 'map' process comes for free. The only overhead is actually copying the bytes to this buffer but it only happens once per unique identifier and in exchange you get to lex anything not only 'the whole module in one memory block' kind of thing. On the plus side it's also more cache friendly. Overall I think it's a good trade-off, but I never had the time to exercise it. -- Dmitry Olshansky28-Jan-2013 15:48, Johannes Pfau пишет:Source code is usually small. (Even std.datetime has 'only' 1.6 MB.) My own lexer-parser combination slices directly into the original source code, for every token, in order to remember the exact source location, and the last time I have measured, it ran faster than DMD's. I keep the source around for error reporting anyway: tt.d:27:5: error: no member 'n' for type 'A' a.n=2; ^~~ Since the tokens point directly into the source code, it is not necessary to construct any other data structures in order to allow fast retrieval of the appropriate source code line. But it's clear that a general-purpose library might not want to impose this restriction on storage upon it's clients. I think it is somewhat helpful for speed though. The other thing I do is buffering tokens in a contiguous ring buffer that grows if a lot of lookahead is requested.... But to be fair that doesn't fit ranges very well. If you don't want to do any allocation but still keep identifiers etc in memory this basically means you have to keep the whole source in memory and this is conceptually an array and not a range.Not the whole source but to construct a table of all identifiers. The source is awfully redundant because of repeated identifiers, spaces, comments and what not. The set of unique identifiers is rather small.I think the best course of action is to just provide a hook to trigger on every identifier encountered. That could be as discussed earlier a delegate. ...Maybe. I map identifiers to unique id's later, in the identifier AST node constructor, though.
Jan 30 2013
On 1/30/2013 8:44 AM, Dmitry Olshansky wrote:In allocation scheme I proposed that ID could be a 32bit offset into the unique identifiers chunk.That only works if you know in advance the max size the chunk can ever be and preallocate it. Otherwise, you have no guarantee that the next allocated chunk will be within 32 bits of address of the previous chunks.
Feb 01 2013
01-Feb-2013 15:05, Walter Bright пишет:On 1/30/2013 8:44 AM, Dmitry Olshansky wrote:Well I supposed it's exactly one reallocatable block. Then token have an offset that doesn't care if the block was reallocated. Or rather the reallocating just RESERVE virtual RAM for it (say 1G), and COMMIT it page by page when you need to grow it. Once lexing is done, shrink virtual region to the actual used size to free up address space (e.g. if we are on 32bits). AS for 32bit limit that gives 4Gb maximum of the cumulative length of all unique identifier names is more then enough by any standard. I haven't seen a 4G codebase not to speak of identifiers alone that even if we count all the repetitions separately. -- Dmitry OlshanskyIn allocation scheme I proposed that ID could be a 32bit offset into the unique identifiers chunk.That only works if you know in advance the max size the chunk can ever be and preallocate it. Otherwise, you have no guarantee that the next allocated chunk will be within 32 bits of address of the previous chunks.
Feb 01 2013
On 2/1/2013 3:22 AM, Dmitry Olshansky wrote:01-Feb-2013 15:05, Walter Bright пишет:Your technique can work, provided the number of identifiers isn't large enough that memory fragmentation will prevent being able to reallocate the buffer to a larger size.On 1/30/2013 8:44 AM, Dmitry Olshansky wrote:Well I supposed it's exactly one reallocatable block. Then token have an offset that doesn't care if the block was reallocated. Or rather the reallocating just RESERVE virtual RAM for it (say 1G), and COMMIT it page by page when you need to grow it. Once lexing is done, shrink virtual region to the actual used size to free up address space (e.g. if we are on 32bits). AS for 32bit limit that gives 4Gb maximum of the cumulative length of all unique identifier names is more then enough by any standard. I haven't seen a 4G codebase not to speak of identifiers alone that even if we count all the repetitions separately.In allocation scheme I proposed that ID could be a 32bit offset into the unique identifiers chunk.That only works if you know in advance the max size the chunk can ever be and preallocate it. Otherwise, you have no guarantee that the next allocated chunk will be within 32 bits of address of the previous chunks.
Feb 01 2013
02-Feb-2013 01:10, Walter Bright пишет:On 2/1/2013 3:22 AM, Dmitry Olshansky wrote:I think I failed to describe the virtual memory trick or you just ignored it? But regardless I have to correct myself on both the amount to reserve and the fact that it's actually not virtual memory alone but a virtue of MMU. See below a full program in D see it the trick in action, currently Win32 only, as I can't recall the right POSIX calls offhand. It's a benchmark against built-in array/appender - the heap stuff. The speed is simillar, with my quick and dirty stuff being much faster iff arrays don't allocate all of ram beforehand. The steps to snatch the power of not having any reallocations: 1. Reserve a contiguous *address space* range. Not even a bit of virtual memory is reserved/commited/touched at this moment. This just creates a TLB entry AFAICT. (for lexer this address space range would have length equal to the sum of all files length as Tove points out, this is the first correction) 2. Any time there is no memory to put more stuff into (as is initially), commit the next few pages. At this moment virtual ram is committed but not allocated. 3. [this I think we all know] The moment memory location in the committed part of the range is touched the page for it is allocated (and on win32 initially contains zeros automatically). Only at this point Virtual memory is allocated So you see where my 2nd correction goes. In essence you get: 0 reallocations and no more then 1 page of virtual memory wasted in all cases. Sweet deal ;) Then there is an optional step if you think that too much of address space is used (makes sense only on 32-bits if at all): copy the resulting block of actual data elsewhere in the heap and release the whole address range. The code: (the same as github gist: https://gist.github.com/4699388 ) ///Written in the D programming language // Bench built-in array append, std.array.Appender // and custom virtual-memory aware VirtualArray in terms of // memory usage and the time taken // to append up to X megabytes using different chunk sizes: // 16 bytes, 64bytes, 256 bytes and 16Kb at a time. // pass this version to take insert readln's // at interesting points and investigate RAM usage // make sure to enable relevant columns in task manager process details: // committed size and private working set // version = diagnose; import std.c.windows.windows; import std.exception, std.datetime, std.algorithm, std.range, std.array; auto roundUpToPowerOf2(size_t var, size_t pow2) { return (var + pow2-1) & ~(pow2-1); } struct VirtualArray { private: ubyte* pointer; // to the beginning of the reserved memory size_t _length; // length of data size_t commited; // committed memory so far size_t reserved; // total possible maximum to grow to size_t pageSize; // page size, this could be global //size_t pageSize; // commit some more memory from the reserved range void extend(size_t size) { size = roundUpToPowerOf2(size, pageSize); // this will throw once run out of reserved pages enforce(VirtualAlloc(pointer+commited, size, MEM_COMMIT, PAGE_READWRITE)); commited += size; } public: this(size_t maxCapacity) { SYSTEM_INFO sysinfo; GetSystemInfo(&sysinfo); pageSize = sysinfo.dwPageSize; // the page size is a power of 2 // round the capacity up to a multiple of page size maxCapacity = roundUpToPowerOf2(maxCapacity, pageSize); pointer = cast(ubyte*)enforce(VirtualAlloc(null, maxCapacity, MEM_RESERVE, PAGE_READWRITE)); commited = 0; reserved = maxCapacity; _length = 0; } // bare minimum primitives to run benchmark ref ubyte opIndex(size_t idx) { assert(idx < length); return pointer[idx]; } // ditto ubyte[] opSlice(size_t from, size_t to) { assert(from < to && to <= length); return pointer[from..to]; } // ditto property size_t length(){ return _length; } void opOpAssign(string op:"~")(ubyte[] data) { size_t end = length + data.length; while(end > commited) extend(end-commited); pointer[length..end] = data[]; _length = end; } ~this() { // should not throw, struct is not copyable enforce(VirtualFree(pointer, 0, MEM_RELEASE)); } } import std.stdio; void timeIt(string prefix, void delegate() run) { StopWatch sw; sw.start(); run(); sw.stop(); writefln(prefix~ " %4s ms", sw.peek().msecs); } bool verify(Arr)(ubyte[] pattern, ref Arr arr) { for(size_t i=0; i<arr.length; i+= pattern.length) { if(arr[i..i+pattern.length] != pattern[]) { writeln(arr[i .. i+pattern.length], " vs ", pattern); return false; } } return true; } void main() { size_t totalSize = 120*2^^20; auto chunks = [ iota(0, 16).map!(x=>cast(ubyte)x).array, iota(0, 64).map!(x=>cast(ubyte)x).array, iota(0, 256).map!(x=>cast(ubyte)x).array, iota(0, 2^^10).map!(x=>cast(ubyte)x).array ]; auto titles = [ " 16b", " 64b", "256b", " 16K" ]; import core.memory; GC.disable(); // to prevent measuring a collection by mistake foreach(n, chunk; chunks) { // reserve memory anew on each iteration // I was able to easily go up to 1.5 G on 32bits // note: there are no reallocations here at all version(diagnose) if(n == 0) { writeln("Before reserving address space"); readln(); } auto virtArr = VirtualArray(totalSize); version(diagnose) if(n == 0) { writeln("Reserved address space"); readln(); } timeIt(titles[n]~" virtual", (){ foreach(i; 0..totalSize/chunk.length) { virtArr ~= chunk; } }); version(diagnose) if(n == 0) { writeln("Commited all of virtual memory"); readln(); //uncomment to take a pause to investigate RAM usage } // time to verify is the same for all with -O -inline enforce(verify(chunk, virtArr)); } writeln("============"); foreach(n, chunk; chunks) { ubyte[] arr; // the GC is out of the game as soon as 512 Mbytes on 32bits // as I hit OutOfMemory on a 2nd iteration // try to help poor runtime // without reserve it crashes and burns at about 120 M // with reserve it's on par with chunk >= 256 // but it _commits_ all memory beforehand ! version(diagnose) if(n == 0) { writeln("Before array.reserve()"); readln(); } arr.reserve(totalSize); version(diagnose) if(n == 0) { writeln("After array.reserve()"); readln(); } timeIt(titles[n]~" array ", (){ foreach(i; 0..totalSize/chunk.length) { arr ~= chunk; } }); version(diagnose) if(n == 0) { writeln("After all data touched"); readln(); } // time to verify is the same for all with -O -inline enforce(verify(chunk, arr)); // sadly need to not run out of memory on 32 bits delete arr; } writeln("============"); foreach(n, chunk; chunks) { { // appender is supposed to be faster then array on appends // but it's actually slower starting at 256 byte chunks // and esp. so with 16Kb chunks auto app = appender!(ubyte[])(); timeIt(titles[n]~" appender", (){ foreach(i; 0..totalSize/chunk.length) { app.put(chunk); } }); auto appArr = app.data(); enforce(verify(chunk, appArr)); } GC.collect(); } } -- Dmitry Olshansky01-Feb-2013 15:05, Walter Bright пишет:Your technique can work, provided the number of identifiers isn't large enough that memory fragmentation will prevent being able to reallocate the buffer to a larger size.On 1/30/2013 8:44 AM, Dmitry Olshansky wrote:Well I supposed it's exactly one reallocatable block. Then token have an offset that doesn't care if the block was reallocated. Or rather the reallocating just RESERVE virtual RAM for it (say 1G), and COMMIT it page by page when you need to grow it. Once lexing is done, shrink virtual region to the actual used size to free up address space (e.g. if we are on 32bits). AS for 32bit limit that gives 4Gb maximum of the cumulative length of all unique identifier names is more then enough by any standard. I haven't seen a 4G codebase not to speak of identifiers alone that even if we count all the repetitions separately.In allocation scheme I proposed that ID could be a 32bit offset into the unique identifiers chunk.That only works if you know in advance the max size the chunk can ever be and preallocate it. Otherwise, you have no guarantee that the next allocated chunk will be within 32 bits of address of the previous chunks.
Feb 02 2013
On 2/2/2013 1:54 PM, Dmitry Olshansky wrote:I think I failed to describe the virtual memory trick or you just ignored it?You can't just reserve 1Gb of address space in a 32 bit system. (I know about reserving address space vs committing memory.)
Feb 02 2013
03-Feb-2013 05:30, Walter Bright пишет:On 2/2/2013 1:54 PM, Dmitry Olshansky wrote:OK, sorry you must knew this stuff throughly. 1Gb is not the point I feel like I shouldn't been listing 1Gb at all as we need much less. It's about 21 Mb of D source code in full dmd test suite + druntime + phobos, and only 4M in vibe.d including all examples. Reserve this amount. It's the same amount of memory as in read them all and slice them all approach that is said to be so fast and practical. And yet we don't commit this RAM at all most of the time. Thinking more of 32bit systems that are so tight on address space - it's 2Gb for user-mode. Precisely because of that a signed 32 bit offset is enough to address RAM if we store it in multiple chunks like you said. The fact that there is only 2Gb of user-space actually sort of helps us as we surely can reach the other chunks from a single base. In case of 3Gb (some linux-es and/or large address aware win32) we just need to reserve the first range somewhere in the middle, as we do it at startup this should be easy with a few probes. Also after the lexing you can always free the address space and reallocate-compact it as the last step like I said earlier. -- Dmitry OlshanskyI think I failed to describe the virtual memory trick or you just ignored it?You can't just reserve 1Gb of address space in a 32 bit system. (I know about reserving address space vs committing memory.)
Feb 03 2013
On Friday, 1 February 2013 at 11:06:02 UTC, Walter Bright wrote:On 1/30/2013 8:44 AM, Dmitry Olshansky wrote:This can easily be archived by preallocating file.size bytes... it will be x orders of magnitude too much, but it doesn't matter, as in the end only the cache locality matters.In allocation scheme I proposed that ID could be a 32bit offset into the unique identifiers chunk.That only works if you know in advance the max size the chunk can ever be and preallocate it. Otherwise, you have no guarantee that the next allocated chunk will be within 32 bits of address of the previous chunks.
Feb 01 2013
Very happy to see that ! Some remarks : - Many parameters should be compile time parameters. Instead of runtime. - I'm not a big fan of byToken name, but let's see what others think of it. - I'm not sure this is the role of the lexer to process __IDENTIFIER__ special stuffs. - You need to provide a way to specify haw textual representation of the token (ie value) is set. The best way to do it IMO is an alias parameter that return a string when called with a string (then the user can choose to keep the value from original string, create a copy, always get the same copy with the same string, etc . . .). - Ideally, the location format should be configurable. - You must return at least a forward range, and not an input range, otherwize a lexer cannot lookahead. - I'm not sure about making TokenRange a class. As a more positive note, I was just in need of something like this today and wondered if such project was going to finally happen. This is a great step forward ! Some stuff still need to be polished, but who get everything right the first time ?
Jan 27 2013
On Sunday, 27 January 2013 at 10:32:39 UTC, deadalnix wrote:Very happy to see that ! Some remarks : - Many parameters should be compile time parameters. Instead of runtime. - I'm not a big fan of byToken name, but let's see what others think of it. - I'm not sure this is the role of the lexer to process __IDENTIFIER__ special stuffs. - You need to provide a way to specify haw textual representation of the token (ie value) is set. The best way to do it IMO is an alias parameter that return a string when called with a string (then the user can choose to keep the value from original string, create a copy, always get the same copy with the same string, etc . . .). - Ideally, the location format should be configurable. - You must return at least a forward range, and not an input range, otherwize a lexer cannot lookahead. - I'm not sure about making TokenRange a class. As a more positive note, I was just in need of something like this today and wondered if such project was going to finally happen. This is a great step forward ! Some stuff still need to be polished, but who get everything right the first time ?And the famous Job's « one last thing » : I'm not a big fan of having OPERATORS_BEGIN of the same type as regular token types. Now they make valid token. Why not provide a set of function like isOperator ?
Jan 27 2013
On Sunday, 27 January 2013 at 10:32:39 UTC, deadalnix wrote:Very happy to see that ! Some remarks : - Many parameters should be compile time parameters. Instead of runtime.I decided not to do this because the lexer actually calls itself while parsing token strings. If they were compile-time parameters, the compiler would likely generate a lot more code.- I'm not a big fan of byToken name, but let's see what others think of it.Chosen for consistency with the various functions in std.stdio, but now that you point his out, it's not very consistent with std.algorithm or std.range.- I'm not sure this is the role of the lexer to process __IDENTIFIER__ special stuffs.According to http://dlang.org/lex#specialtokens this is the correct behavior.- You need to provide a way to specify haw textual representation of the token (ie value) is set. The best way to do it IMO is an alias parameter that return a string when called with a string (then the user can choose to keep the value from original string, create a copy, always get the same copy with the same string, etc . . .).The lexer does not operate on slices of its input. It would be possible to special-case for this in the future.- Ideally, the location format should be configurable. - You must return at least a forward range, and not an input range, otherwize a lexer cannot lookahead.It's easy to wrap this range inside of another that does buffering for lookahead. https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/circularbuffer.dAnd the famous Job's « one last thing » : I'm not a big fan of having OPERATORS_BEGIN of the same type as regular token types. Now they make valid token. Why not provide a set of function like isOperator ?This eliminates possible uses of the case range statement. It may be the case (ha ha) that nobody cares and would rather have those functions. I'd be fine with changing that.
Jan 27 2013
On Sunday, 27 January 2013 at 10:55:49 UTC, Brian Schott wrote:On Sunday, 27 January 2013 at 10:32:39 UTC, deadalnix wrote:Have you measured ?Very happy to see that ! Some remarks : - Many parameters should be compile time parameters. Instead of runtime.I decided not to do this because the lexer actually calls itself while parsing token strings. If they were compile-time parameters, the compiler would likely generate a lot more code.Even without special casing for slice of the input, generating a new string or reusing an existing one is already an important thing. Depending of the processing that come after, it may be of great importance.- You need to provide a way to specify haw textual representation of the token (ie value) is set. The best way to do it IMO is an alias parameter that return a string when called with a string (then the user can choose to keep the value from original string, create a copy, always get the same copy with the same string, etc . . .).The lexer does not operate on slices of its input. It would be possible to special-case for this in the future.
Jan 27 2013
In one of last discussions about standard lexer/parser I remember quite a neat proposal - take a delegate for error handling and provide two out of the box ( one, that throws exception and one that returns Error token)
Jan 27 2013
Let's add another question: what about treating q{ } token strings as... well, a list of tokens? IDE would like this: no need to reparse the string, the token are there directly.
Jan 27 2013
On 2013-01-27 12:51, Philippe Sigaud wrote:Let's add another question: what about treating q{ } token strings as... well, a list of tokens? IDE would like this: no need to reparse the string, the token are there directly.Perhaps an option for this. -- /Jacob Carlborg
Jan 27 2013
On 2013-01-27 10:51, Brian Schott wrote:I'm writing a D lexer for possible inclusion in Phobos. DDOC: http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html Code: https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d It's currently able to correctly syntax highlight all of Phobos, but does a fairly bad job at rejecting or notifying users/callers about invalid input. I'd like to hear arguments on the various ways to handle errors in the lexer. In a compiler it would be useful to throw an exception on finding something like a string literal that doesn't stop before EOF, but a text editor or IDE would probably want to be a bit more lenient. Maybe having it run-time (or compile-time configurable) like std.csv would be the best option here. I'm interested in ideas on the API design and other high-level issues at the moment. I don't consider this ready for inclusion. (The current module being reviewed for inclusion in Phobos is the new std.uni.)How about changing the type of TokenType to ushort, if all members fit. Just to minimize the size of a token. -- /Jacob Carlborg
Jan 27 2013
On 1/27/2013 1:51 AM, Brian Schott wrote:I'm interested in ideas on the API design and other high-level issues at the moment. I don't consider this ready for inclusion. (The current module being reviewed for inclusion in Phobos is the new std.uni.)Just a quick comment: byToken() should not accept a filename. It's input should be via an InputRange, not a file.
Jan 27 2013
27-Jan-2013 23:46, Walter Bright пишет:On 1/27/2013 1:51 AM, Brian Schott wrote:InputRange of char/wchar/dchar I guess as lexer should have an ability to avoid decoding. -- Dmitry OlshanskyI'm interested in ideas on the API design and other high-level issues at the moment. I don't consider this ready for inclusion. (The current module being reviewed for inclusion in Phobos is the new std.uni.)Just a quick comment: byToken() should not accept a filename. It's input should be via an InputRange, not a file.
Jan 27 2013
On 1/27/2013 12:11 PM, Dmitry Olshansky wrote:27-Jan-2013 23:46, Walter Bright пишет:Just char would suffice. If a user needs wchar/dchar, they can put a map in between the InputRange and the lexer.On 1/27/2013 1:51 AM, Brian Schott wrote:InputRange of char/wchar/dchar I guess as lexer should have an ability to avoid decoding.I'm interested in ideas on the API design and other high-level issues at the moment. I don't consider this ready for inclusion. (The current module being reviewed for inclusion in Phobos is the new std.uni.)Just a quick comment: byToken() should not accept a filename. It's input should be via an InputRange, not a file.
Jan 27 2013
On Sunday, 27 January 2013 at 19:46:12 UTC, Walter Bright wrote:On 1/27/2013 1:51 AM, Brian Schott wrote:The file name is accepted for eventual error reporting purposes. The actual input for the lexer is the parameter called "range". Regarding the times that I posted, my point was that it's not slower than "dmd -c", nothing more.I'm interested in ideas on the API design and other high-level issues at the moment. I don't consider this ready for inclusion. (The current module being reviewed for inclusion in Phobos is the new std.uni.)Just a quick comment: byToken() should not accept a filename. It's input should be via an InputRange, not a file.
Jan 27 2013
On 01/27/2013 10:39 PM, Brian Schott wrote:... Regarding the times that I posted, my point was that it's not slower than "dmd -c", nothing more.Sure. The point you brought across, however, was that it is not significantly faster yet. :o)
Jan 27 2013
On 1/27/2013 1:39 PM, Brian Schott wrote:The file name is accepted for eventual error reporting purposes.Use an OutputRange for that.
Jan 27 2013
On Sunday, 27 January 2013 at 23:49:11 UTC, Walter Bright wrote:On 1/27/2013 1:39 PM, Brian Schott wrote:What about that delegate-based design? I thought everyone agreed that it was nice? DavidThe file name is accepted for eventual error reporting purposes.Use an OutputRange for that.
Jan 27 2013
On 1/27/2013 4:48 PM, David Nadlinger wrote:On Sunday, 27 January 2013 at 23:49:11 UTC, Walter Bright wrote:An OutputRange is a way of doing that. The advantage of OutputRange's is that is TheWayToDoThings in Phobos so that components can all interoperate and plug into each other.On 1/27/2013 1:39 PM, Brian Schott wrote:What about that delegate-based design? I thought everyone agreed that it was nice?The file name is accepted for eventual error reporting purposes.Use an OutputRange for that.
Jan 27 2013
On Monday, 28 January 2013 at 00:51:28 UTC, Walter Bright wrote:On 1/27/2013 4:48 PM, David Nadlinger wrote:I was talking about the design you proposed yourself here: http://forum.dlang.org/post/jvp9ke$2m45$1 digitalmars.com Oh, and you really don't need to give me the basic Phobos/ranges sales pitch, I think I'm quite aware of their advantages. I'm just not sure that e.g. having an "exception thrower" output range would be a wise design decision. DavidOn Sunday, 27 January 2013 at 23:49:11 UTC, Walter Bright wrote:An OutputRange is a way of doing that. The advantage of OutputRange's is that is TheWayToDoThings in Phobos so that components can all interoperate and plug into each other.On 1/27/2013 1:39 PM, Brian Schott wrote:What about that delegate-based design? I thought everyone agreed that it was nice?The file name is accepted for eventual error reporting purposes.Use an OutputRange for that.
Jan 27 2013
On Sunday, 27 January 2013 at 23:49:11 UTC, Walter Bright wrote:On 1/27/2013 1:39 PM, Brian Schott wrote:I think you misunderstand. The file name is so that if you pass in "foo.d" the lexer can say "Error: unterminated string literal beginning on line 123 of foo.d". It's not so that error messagaes will be written to a file of that name. On the topic of performance, I realized that the numbers posted previously were actually for a debug build. Fail. For whatever reason, the current version of the lexer code isn't triggering my heisenbug[1] and I was able to build with -release -inline -O. Here's what avgtime has to say: $ avgtime -q -h -r 200 dscanner --tokenCount ../phobos/std/datetime.d ------------------------ Total time (ms): 51409.8 Repetitions : 200 Sample mode : 250 (169 ocurrences) Median time : 255.57 Avg time : 257.049 Std dev. : 4.39338 Minimum : 252.931 Maximum : 278.658 95% conf.int. : [248.438, 265.66] e = 8.61087 99% conf.int. : [245.733, 268.366] e = 11.3166 EstimatedAvg95%: [256.44, 257.658] e = 0.608881 EstimatedAvg99%: [256.249, 257.849] e = 0.800205 Histogram : msecs: count normalized bar Which works out to 1,327,784 tokens per second on my Ivy Bridge i7. I created a small program that demangles the output of valgrind so that tools like KCachegrind can display profiling information more clearly. It's now on the wiki[2] The bottleneck in std.d.lexer as it stands is the appender instances that assemble Token.value during iteration and front() on the array of char[]. (As I'm sure everyone expected) [1] http://forum.dlang.org/thread/bug-9353-3 http.d.puremagic.com%2Fissues%2F [2] http://wiki.dlang.org/Other_Dev_ToolsThe file name is accepted for eventual error reporting purposes.Use an OutputRange for that.
Jan 27 2013
On 1/27/2013 4:53 PM, Brian Schott wrote:On Sunday, 27 January 2013 at 23:49:11 UTC, Walter Bright wrote:Yes, I did misunderstand. I suggest updating the documentation to clear up the misunderstanding.On 1/27/2013 1:39 PM, Brian Schott wrote:I think you misunderstand. The file name is so that if you pass in "foo.d" the lexer can say "Error: unterminated string literal beginning on line 123 of foo.d". It's not so that error messagaes will be written to a file of that name.The file name is accepted for eventual error reporting purposes.Use an OutputRange for that.
Jan 27 2013
On Monday, 28 January 2013 at 00:53:03 UTC, Brian Schott wrote:On Sunday, 27 January 2013 at 23:49:11 UTC, Walter Bright wrote:I don't think that is a good idea. For instance mixin need to be lexed but don't come from a file. The lexer should report the error, what is done on error is up to the user of the lexer.On 1/27/2013 1:39 PM, Brian Schott wrote:I think you misunderstand. The file name is so that if you pass in "foo.d" the lexer can say "Error: unterminated string literal beginning on line 123 of foo.d". It's not so that error messagaes will be written to a file of that name.The file name is accepted for eventual error reporting purposes.Use an OutputRange for that.
Jan 27 2013
Am Mon, 28 Jan 2013 01:53:02 +0100 schrieb "Brian Schott" <briancschott gmail.com>:The bottleneck in std.d.lexer as it stands is the appender instances that assemble Token.value during iteration and front() on the array of char[]. (As I'm sure everyone expected)This sounds like a valid use case for buffered ranges (which we don't have yet, afaik). When used correctly you could avoid the appender completely. Instead slice the input buffer and copy it if necessary. (If the complete file is kept in memory copying could also be avoided)
Jan 28 2013
On 2013-01-28 09:51, Johannes Pfau wrote:This sounds like a valid use case for buffered ranges (which we don't have yet, afaik). When used correctly you could avoid the appender completely. Instead slice the input buffer and copy it if necessary. (If the complete file is kept in memory copying could also be avoided)If think slicing the buffer will force the whole buffer to remain in memory and not be collected by the GC. If one keeps the whole buffer in memory anyway this won't be a problem. -- /Jacob Carlborg
Jan 28 2013
On 1/28/2013 1:05 AM, Jacob Carlborg wrote:If think slicing the buffer will force the whole buffer to remain in memory and not be collected by the GC. If one keeps the whole buffer in memory anyway this won't be a problem.Well, the source buffer can be large, and will span a lot of memory cache lines, making accessing those slices slow.
Jan 28 2013
On 2013-01-28 11:14, Walter Bright wrote:Well, the source buffer can be large, and will span a lot of memory cache lines, making accessing those slices slow.Would it be better to .dup the slices? Won't that be just as slow as using the appender? -- /Jacob Carlborg
Jan 28 2013
28-Jan-2013 14:39, Jacob Carlborg пишет:On 2013-01-28 11:14, Walter Bright wrote:It would be better to compactly layout these one by one in a separate buffer skipping all of the extra slack in the source file(s). -- Dmitry OlshanskyWell, the source buffer can be large, and will span a lot of memory cache lines, making accessing those slices slow.Would it be better to .dup the slices? Won't that be just as slow as using the appender?
Jan 28 2013
On Monday, 28 January 2013 at 11:00:09 UTC, Dmitry Olshansky wrote:28-Jan-2013 14:39, Jacob Carlborg пишет:Would it also make sense to do small string optimization? On 64 bit, you could use the memory used by the string field itself to store strings of length up to 15 characters.On 2013-01-28 11:14, Walter Bright wrote:It would be better to compactly layout these one by one in a separate buffer skipping all of the extra slack in the source file(s).Well, the source buffer can be large, and will span a lot of memory cache lines, making accessing those slices slow.Would it be better to .dup the slices? Won't that be just as slow as using the appender?
Jan 28 2013
28-Jan-2013 18:08, jerro пишет:On Monday, 28 January 2013 at 11:00:09 UTC, Dmitry Olshansky wrote:It very well could be iff slice is a field in the Token struct. -- Dmitry Olshansky28-Jan-2013 14:39, Jacob Carlborg пишет:Would it also make sense to do small string optimization? On 64 bit, you could use the memory used by the string field itself to store strings of length up to 15 characters.On 2013-01-28 11:14, Walter Bright wrote:It would be better to compactly layout these one by one in a separate buffer skipping all of the extra slack in the source file(s).Well, the source buffer can be large, and will span a lot of memory cache lines, making accessing those slices slow.Would it be better to .dup the slices? Won't that be just as slow as using the appender?
Jan 28 2013
On 01/28/2013 01:53 AM, Brian Schott wrote:... On the topic of performance, I realized that the numbers posted previously were actually for a debug build. Fail. For whatever reason, the current version of the lexer code isn't triggering my heisenbug[1] and I was able to build with -release -inline -O. Here's what avgtime has to say: $ avgtime -q -h -r 200 dscanner --tokenCount ../phobos/std/datetime.d ------------------------ Total time (ms): 51409.8 Repetitions : 200 Sample mode : 250 (169 ocurrences) Median time : 255.57 Avg time : 257.049 Std dev. : 4.39338 Minimum : 252.931 Maximum : 278.658 95% conf.int. : [248.438, 265.66] e = 8.61087 99% conf.int. : [245.733, 268.366] e = 11.3166 EstimatedAvg95%: [256.44, 257.658] e = 0.608881 EstimatedAvg99%: [256.249, 257.849] e = 0.800205 Histogram : msecs: count normalized bar Which works out to 1,327,784 tokens per second on my Ivy Bridge i7.Better, but still slow.I created a small program that demangles the output of valgrind so that tools like KCachegrind can display profiling information more clearly. It's now on the wiki[2] The bottleneck in std.d.lexer as it stands is the appender instances that assemble Token.value during iteration and front() on the array of char[]. (As I'm sure everyone expected)I see, probably there should be an option to do this by slicing instead. Also try to treat narrow strings in such a way that they do not incur undue decoding overhead. I guess that at some point pure nothrow TokenType lookupTokenType(const string input) might become a bottleneck. (DMD does not generate near-optimal string switches, I think.)
Jan 28 2013
On Monday, 28 January 2013 at 21:03:21 UTC, Timon Gehr wrote:Better, but still slow.I implemented the various suggestions from a past thread and made the lexer only work ubyte[] (to aviod phobos converting everything to dchar all the time) and gave the tokenizer instance a character buffer that it re-uses. Results: $ avgtime -q -r 200 ./dscanner --tokenCount ../phobos/std/datetime.d ------------------------ Total time (ms): 13861.8 Repetitions : 200 Sample mode : 69 (90 ocurrences) Median time : 69.0745 Avg time : 69.3088 Std dev. : 0.670203 Minimum : 68.613 Maximum : 72.635 95% conf.int. : [67.9952, 70.6223] e = 1.31357 99% conf.int. : [67.5824, 71.0351] e = 1.72633 EstimatedAvg95%: [69.2159, 69.4016] e = 0.0928836 EstimatedAvg99%: [69.1867, 69.4308] e = 0.12207 If my math is right, that means it's getting 4.9 million tokens/second now. According to Valgrind the only way to really improve things now is to require that the input to the lexer support slicing. (Remember the secret of Tango's XML parser...) The bottleneck is now on the calls to .idup to construct the token strings from slices of the buffer.I guess that at some point pure nothrow TokenType lookupTokenType(const string input) might become a bottleneck. (DMD does not generate near-optimal string switches, I think.)Right now that's a fairly small box on KCachegrind.
Jan 30 2013
Am 30.01.2013 10:49, schrieb Brian Schott:On Monday, 28 January 2013 at 21:03:21 UTC, Timon Gehr wrote:but you still need to compare that to the current dmd lexer - nothing else can be a benchmark referenceBetter, but still slow.I implemented the various suggestions from a past thread and made the lexer only work ubyte[] (to aviod phobos converting everything to dchar all the time) and gave the tokenizer instance a character buffer that it re-uses. Results: $ avgtime -q -r 200 ./dscanner --tokenCount ../phobos/std/datetime.d ..... If my math is right, that means it's getting 4.9 million tokens/second now. According to Valgrind the only way to really improve things now is to require that the input to the lexer support slicing. (Remember the secret of Tango's XML parser...) The bottleneck is now on the calls to .idup to construct the token strings from slices of the buffer.
Jan 30 2013
On 2013-01-30 10:49, Brian Schott wrote:If my math is right, that means it's getting 4.9 million tokens/second now. According to Valgrind the only way to really improve things now is to require that the input to the lexer support slicing. (Remember the secret of Tango's XML parser...) The bottleneck is now on the calls to .idup to construct the token strings from slices of the buffer.Yep. I'm eager to see what timings you get with the whole file kept in memory and sliced, without copying token strings.
Jan 30 2013
30-Jan-2013 13:49, Brian Schott пишет:On Monday, 28 January 2013 at 21:03:21 UTC, Timon Gehr wrote:idup --> allocation Instead I suggest to try and allocate a big block of fixed size (say about 16-64K) upfront and copy identifiers one by one there. When it fills just allocate another one and move on. If identifier is exceptionally long then you can just idup it as before. This should bring down the number of calls to GC significantly.Better, but still slow.I implemented the various suggestions from a past thread and made the lexer only work ubyte[] (to aviod phobos converting everything to dchar all the time) and gave the tokenizer instance a character buffer that it re-uses. Results: $ avgtime -q -r 200 ./dscanner --tokenCount ../phobos/std/datetime.d ------------------------ Total time (ms): 13861.8 Repetitions : 200 Sample mode : 69 (90 ocurrences) Median time : 69.0745 Avg time : 69.3088 Std dev. : 0.670203 Minimum : 68.613 Maximum : 72.635 95% conf.int. : [67.9952, 70.6223] e = 1.31357 99% conf.int. : [67.5824, 71.0351] e = 1.72633 EstimatedAvg95%: [69.2159, 69.4016] e = 0.0928836 EstimatedAvg99%: [69.1867, 69.4308] e = 0.12207 If my math is right, that means it's getting 4.9 million tokens/second now. According to Valgrind the only way to really improve things now is to require that the input to the lexer support slicing. (Remember the secret of Tango's XML parser...) The bottleneck is now on the calls to .idup to construct the token strings from slices of the buffer.-- Dmitry OlshanskyI guess that at some point pure nothrow TokenType lookupTokenType(const string input) might become a bottleneck. (DMD does not generate near-optimal string switches, I think.)Right now that's a fairly small box on KCachegrind.
Jan 30 2013
On 2013-01-30 17:50, Dmitry Olshansky wrote:Instead I suggest to try and allocate a big block of fixed size (say about 16-64K) upfront and copy identifiers one by one there. When it fills just allocate another one and move on.Yeah, similar to what I suggested, except that probably it would be good to also have a look-up structure for identifiers, so that only unique strings go into those blocks. I wonder what would be a practical data structure for such look-ups: A trie is great except for the implementation hurdle to keep it also in one or a few memory blocks to prevent frequent allocations. A simpler approach would be to make it an array of (hash, string_slice) pairs, sorted by hash (of the identifier) - lots of string scanning though. What do you think?
Jan 30 2013
30-Jan-2013 21:21, FG пишет:On 2013-01-30 17:50, Dmitry Olshansky wrote:Mm trie shouldn't have that many allocations. Typically each node is fixed-sized. Also I don't see keeping it in many memory blocks as a show stopper. The trick is exactly the same as I mentioned above but blocks got to be quite a bit large (say 8 times :) ). Truth be told the trie should be optimal for this, but a lot of effort is required to implement a fast trie compared to rolling out a simple hash-table. That's something I hope to change once my Trie template is polished enough for Phobos proposal.Instead I suggest to try and allocate a big block of fixed size (say about 16-64K) upfront and copy identifiers one by one there. When it fills just allocate another one and move on.Yeah, similar to what I suggested, except that probably it would be good to also have a look-up structure for identifiers, so that only unique strings go into those blocks. I wonder what would be a practical data structure for such look-ups: A trie is great except for the implementation hurdle to keep it also in one or a few memory blocks to prevent frequent allocations.A simpler approach would be to make it an array of (hash, string_slice) pairs, sorted by hash (of the identifier) - lots of string scanning though.Mm since you need to find what's unique as you lex the file I suspect that sorting is out of the window. I'd be modest and for starters just use a hash-table + tune the hash function a bit. -- Dmitry Olshansky
Jan 30 2013
On 2013-01-30 10:49, Brian Schott wrote:Results: $ avgtime -q -r 200 ./dscanner --tokenCount ../phobos/std/datetime.d ------------------------ Total time (ms): 13861.8 Repetitions : 200 Sample mode : 69 (90 ocurrences) Median time : 69.0745 Avg time : 69.3088 Std dev. : 0.670203 Minimum : 68.613 Maximum : 72.635 95% conf.int. : [67.9952, 70.6223] e = 1.31357 99% conf.int. : [67.5824, 71.0351] e = 1.72633 EstimatedAvg95%: [69.2159, 69.4016] e = 0.0928836 EstimatedAvg99%: [69.1867, 69.4308] e = 0.12207 If my math is right, that means it's getting 4.9 million tokens/second now. According to Valgrind the only way to really improve things now is to require that the input to the lexer support slicing. (Remember the secret of Tango's XML parser...) The bottleneck is now on the calls to .idup to construct the token strings from slices of the buffer.How many tokens would that be in total? -- /Jacob Carlborg
Jan 31 2013
On Sun, Jan 27, 2013 at 10:39:13PM +0100, Brian Schott wrote:On Sunday, 27 January 2013 at 19:46:12 UTC, Walter Bright wrote:[...][...] FWIW, I've developed this little idiom in my code when it comes to dealing with error reporting in lexing/parsing code (for my own DSLs, not D): The main problem I have is that my lexer/parser accepts an input range, but input ranges don't (necessarily) have any filename/line number associated with them. Moreover, the code that throws the exception may be quite far down the call chain, and may have not access to the context that knows what filename/line number the error occurred at. For example, I may have a generic function called parseInt(), which can be called from the lexer, the parser, or a whole bunch of other places. It wouldn't make sense to force parseInt() to take a filename and line number, just so it can have nicer error reporting, for example. So I decided to move the inclusion of filename/line number information to where they belong: in the code that knows about them. So here's a sketch of my approach: class SyntaxError : Exception { string filename; int linenum; this(string msg) { super(msg); } } ... int parseInt(R)(R inputRange) { ... // N.B.: no filename/line number info here if (!isDigit(inputRange.front)) throw new Exception("Invalid digit: %s", inputRange.front); } ... Expr parseExpr(R)(R inputRange) { ... // N.B.: any exception just unrolls past this call, 'cos // we have no filename/line number info to insert anyway if (tokenType == IntLiteral) { value = parseInt(inputRange): } ... } ... Expr parseFileInput(string filename) { auto f = File(filename); try { // Wrapper range that counts line numbers auto r = NumberedSrc(f); return parseExpr(r); } catch(SyntaxError e) { // Insert filename/line number info into message e.filename = filename; e.linenum = r.linenum; e.msg = format("%s:%d %s", filename, r.linenum, e.msg); throw e; } } ... Expr parseConsoleInput() { // No filename/line number info here return parseExpr(stdin.byLine()); } ... Expr parseStringInput(string input) { try { auto r = NumberedSrc(input); return parseExpr(r); } catch(SyntaxError e) { // We don't have filename here, but we do have // line number, so use that. e.linenum = r.linenum; e.msg = format("Line %d: %s", r.linenum, e.msg); throw e; } } Notice that I have different wrapper functions for dealing with different kinds of input; the underlying parser doesn't even care about filename/line numbers, but the wrapper functions catch any parsing exceptions that are thrown from underneath and prepend this info as appropriate. This simplifies the parsing code (don't have to keep worrying about line numbers and propagating filenames) and also produces output that makes sense: - Console input don't need line numbers; the user doesn't care if this is the 500th command he typed, or the 701st. - Internal strings don't get a nonsensical "filename", 'cos they don't *have* a filename in the first place. Just a single line number so you can locate the problem in, say, the string literal or something. - File input has filename and line number. - Other kinds of input contexts can be handled in the same way. - The use of NumberedSrc (maybe better named LineNumberedRange or something) makes line numbers available to each of these contexts at the top-level. Though of course, the lexer itself can also handle this (but it adds complications if you have to continue detecting newlines in, say, string literals, when the lexer is in a different state). The cleaner code does come at a price, though: this code probably is a bit inefficient due to the number of layers in it. But, just thought I'd share this idea. T -- You are only young once, but you can stay immature indefinitely. -- azephrahelJust a quick comment: byToken() should not accept a filename. It's input should be via an InputRange, not a file.The file name is accepted for eventual error reporting purposes. The actual input for the lexer is the parameter called "range".
Jan 28 2013
On Sunday, 27 January 2013 at 09:51:10 UTC, Brian Schott wrote:I'm writing a D lexer for possible inclusion in Phobos. DDOC: http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html Code: https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d It's currently able to correctly syntax highlight all of Phobos, but does a fairly bad job at rejecting or notifying users/callers about invalid input. I'd like to hear arguments on the various ways to handle errors in the lexer. In a compiler it would be useful to throw an exception on finding something like a string literal that doesn't stop before EOF, but a text editor or IDE would probably want to be a bit more lenient. Maybe having it run-time (or compile-time configurable) like std.csv would be the best option here. I'm interested in ideas on the API design and other high-level issues at the moment. I don't consider this ready for inclusion. (The current module being reviewed for inclusion in Phobos is the new std.uni.)Having a "standard" lexer module, that many people work on, and has up-to-date rules, is priceless! Thank you for doing this.
Jan 28 2013
On 2013-01-27 10:51, Brian Schott wrote:I'm writing a D lexer for possible inclusion in Phobos. DDOC: http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html Code: https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d It's currently able to correctly syntax highlight all of Phobos, but does a fairly bad job at rejecting or notifying users/callers about invalid input. I'd like to hear arguments on the various ways to handle errors in the lexer. In a compiler it would be useful to throw an exception on finding something like a string literal that doesn't stop before EOF, but a text editor or IDE would probably want to be a bit more lenient. Maybe having it run-time (or compile-time configurable) like std.csv would be the best option here. I'm interested in ideas on the API design and other high-level issues at the moment. I don't consider this ready for inclusion. (The current module being reviewed for inclusion in Phobos is the new std.uni.)Just thinking out loud here. Would it be possible to lex a file in parallel? Cutting it in half (or similar) and lex both pieces simultaneously in parallel. -- /Jacob Carlborg
Jan 31 2013
On 2013-01-31 13:14, Jacob Carlborg wrote:Just thinking out loud here. Would it be possible to lex a file in parallel? Cutting it in half (or similar) and lex both pieces simultaneously in parallel.Do you know where you can safely cut it without having it lexed beforehand? :)
Jan 31 2013
On 2013-01-31 13:34, FG wrote:Do you know where you can safely cut it without having it lexed beforehand? :)I was thinking that myself. It would probably be possible to just cut it in the middle and then lex a few characters forward and backwards until you get a valid token. Try and calculate the correct index where to cut. Although I have no idea how much trouble it would given you and how much you would gain. -- /Jacob Carlborg
Jan 31 2013
On Thu, Jan 31, 2013 at 01:48:02PM +0100, Jacob Carlborg wrote:On 2013-01-31 13:34, FG wrote:[...] Doesn't work if the middle happens to be inside a string literal containing code. Esp. a q{} literal (you wouldn't be able to tell where it starts/ends without scanning the entire file, because the {}'s nest). T -- Life is unfair. Ask too much from it, and it may decide you don't deserve what you have now either.Do you know where you can safely cut it without having it lexed beforehand? :)I was thinking that myself. It would probably be possible to just cut it in the middle and then lex a few characters forward and backwards until you get a valid token. Try and calculate the correct index where to cut.
Jan 31 2013
On 2013-01-31 16:54, H. S. Teoh wrote:Doesn't work if the middle happens to be inside a string literal containing code. Esp. a q{} literal (you wouldn't be able to tell where it starts/ends without scanning the entire file, because the {}'s nest).That would be a problem. -- /Jacob Carlborg
Jan 31 2013
On Thursday, 31 January 2013 at 12:48:03 UTC, Jacob Carlborg wrote:On 2013-01-31 13:34, FG wrote:I don't think it worth the complexity. You can lex both file in parallel with 2 lexer instance if you want to make things faster.Do you know where you can safely cut it without having it lexed beforehand? :)I was thinking that myself. It would probably be possible to just cut it in the middle and then lex a few characters forward and backwards until you get a valid token. Try and calculate the correct index where to cut. Although I have no idea how much trouble it would given you and how much you would gain.
Jan 31 2013
Am 31.01.2013 13:14, schrieb Jacob Carlborg:On 2013-01-27 10:51, Brian Schott wrote:why not only the symbols at the border needs to be "connected" then the question is: how many blocks(threads) are ok for 1,2,3,8 cores - can also depend on the speed of the filesystemI'm writing a D lexer for possible inclusion in Phobos. DDOC: http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html Code: https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d It's currently able to correctly syntax highlight all of Phobos, but does a fairly bad job at rejecting or notifying users/callers about invalid input. I'd like to hear arguments on the various ways to handle errors in the lexer. In a compiler it would be useful to throw an exception on finding something like a string literal that doesn't stop before EOF, but a text editor or IDE would probably want to be a bit more lenient. Maybe having it run-time (or compile-time configurable) like std.csv would be the best option here. I'm interested in ideas on the API design and other high-level issues at the moment. I don't consider this ready for inclusion. (The current module being reviewed for inclusion in Phobos is the new std.uni.)Just thinking out loud here. Would it be possible to lex a file in parallel? Cutting it in half (or similar) and lex both pieces simultaneously in parallel.
Jan 31 2013
On 2013-01-31 13:35, dennis luehring wrote:why not only the symbols at the border needs to be "connected" then the question is: how many blocks(threads) are ok for 1,2,3,8 cores - can also depend on the speed of the filesystemThat would require some profiling to figure out. -- /Jacob Carlborg
Jan 31 2013
Am 31.01.2013 13:48, schrieb Jacob Carlborg:On 2013-01-31 13:35, dennis luehring wrote:i would say it "can" help alot so the design should be able to use split-parts and combine symboles at the border and also file-based lexing should be threadable so that 16 files can be handled by my 16 cores system full in parallel :) but its also size dependend - it makes no sense to split in too small parts, could be counter-productivewhy not only the symbols at the border needs to be "connected" then the question is: how many blocks(threads) are ok for 1,2,3,8 cores - can also depend on the speed of the filesystemThat would require some profiling to figure out.
Jan 31 2013
On 2013-01-31 13:57, dennis luehring wrote:i would say it "can" help alot so the design should be able to use split-parts and combine symboles at the border and also file-based lexing should be threadable so that 16 files can be handled by my 16 cores system full in parallel :)This is kind of obvious, I think. That's why I started to think at this less obvious case.but its also size dependend - it makes no sense to split in too small parts, could be counter-productiveOf course not. That would require some profiling as well to find a sweet spot. -- /Jacob Carlborg
Jan 31 2013
On Thursday, 31 January 2013 at 12:14:35 UTC, Jacob Carlborg wrote:Just thinking out loud here. Would it be possible to lex a file in parallel? Cutting it in half (or similar) and lex both pieces simultaneously in parallel.I think a lexer should be IO-bound on todays machines, so parallelizing should not give much benefits. Of course, you can write a microbenchmark, such that lexing is compute-bound and parallelizing gives a speedup. ;)
Jan 31 2013
On Friday, 1 February 2013 at 07:41:11 UTC, qznc wrote:On Thursday, 31 January 2013 at 12:14:35 UTC, Jacob Carlborg wrote:Pretty sure RAMs these days can handle more than 1 processor's memory request at a given point in time, so there should be an N-times speedup, where N maxes out at the max throughput...Just thinking out loud here. Would it be possible to lex a file in parallel? Cutting it in half (or similar) and lex both pieces simultaneously in parallel.I think a lexer should be IO-bound on todays machines, so parallelizing should not give much benefits.
Feb 01 2013
On 2013-02-01 08:41, qznc wrote:I think a lexer should be IO-bound on todays machines, so parallelizing should not give much benefits.I'm not sure I understand. -- /Jacob Carlborg
Feb 01 2013
After several hours of optimizing I've managed to make it so that dmd's lexer is only three times faster. http://hackerpilot.github.com/experimental/std_lexer/images/times.png The funny thing is that compiling with LDC gave a bigger speed boost than any of my code refacatoring.
Feb 03 2013
On 2013-02-04 01:22, Brian Schott wrote:After several hours of optimizing I've managed to make it so that dmd's lexer is only three times faster.What are you comparing here? How do you time DMD's lexing stage?
Feb 03 2013
On Monday, 4 February 2013 at 00:38:57 UTC, FG wrote:On 2013-02-04 01:22, Brian Schott wrote:A simple hack of module.c that prints out the file's token count and then calls exit(0).After several hours of optimizing I've managed to make it so that dmd's lexer is only three times faster.What are you comparing here? How do you time DMD's lexing stage?
Feb 03 2013
On 2013-02-04 01:41, Brian Schott wrote:Ah, fine. Then it's not that bad. :) Have you already made it use slices of the whole source, without copying the string values of tokens? What kind of optimizations have you made?What are you comparing here? How do you time DMD's lexing stage?A simple hack of module.c that prints out the file's token count and then calls exit(0).
Feb 03 2013
On 2013-02-04 01:50, FG wrote:Ah, fine. Then it's not that bad. :) Have you already made it use slices of the whole source, without copying the string values of tokens? What kind of optimizations have you made?That would be interesting to hear. Three times slower than DMD doesn't sound good. I know that DMD is fast, but three times. -- /Jacob Carlborg
Feb 03 2013
On 2013-02-04 08:57, Jacob Carlborg wrote:On 2013-02-04 01:50, FG wrote:Looking at the current source, there is now a StringCache to hold the strings. It is however also used to store all those long comments, so I believe quite some time is unnecessarily wasted on generating hashes for them. But probably there are other parts slowing things down.Ah, fine. Then it's not that bad. :) Have you already made it use slices of the whole source, without copying the string values of tokens? What kind of optimizations have you made?That would be interesting to hear. Three times slower than DMD doesn't sound good. I know that DMD is fast, but three times.
Feb 04 2013
On Monday, 4 February 2013 at 00:22:42 UTC, Brian Schott wrote:After several hours of optimizing I've managed to make it so that dmd's lexer is only three times faster. http://hackerpilot.github.com/experimental/std_lexer/images/times.png The funny thing is that compiling with LDC gave a bigger speed boost than any of my code refacatoring.Where is the current bottleneck? (Should be easy to find just by running the program ~5 times and suddenly breaking into it with a debugger.) Also, I'm assuming you've already tried disabling range-checking on arrays?
Feb 04 2013
More optimizing: http://hackerpilot.github.com/experimental/std_lexer/images/times2.png Still only half speed. I'm becoming more and more convinced that Walter is actually a wizard.
Feb 04 2013
On 2/4/13 10:19 PM, Brian Schott wrote:More optimizing: http://hackerpilot.github.com/experimental/std_lexer/images/times2.png Still only half speed. I'm becoming more and more convinced that Walter is actually a wizard.Suggestion: take lexer.c and convert it to D. Should take one day, and you'll have performance on par. Andrei
Feb 04 2013
On 2/5/13, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Suggestion: take lexer.c and convert it to D. Should take one day, and you'll have performance on par.This was already done for DDMD, and the more recent minimal version of it: https://github.com/zachthemystic/ddmd-clean/blob/master/dmd/lexer.d
Feb 04 2013
On 2/4/13 11:05 PM, Andrej Mitrovic wrote:On 2/5/13, Andrei Alexandrescu<SeeWebsiteForEmail erdani.org> wrote:Awesome! Has anyone measured the speed? AndreiSuggestion: take lexer.c and convert it to D. Should take one day, and you'll have performance on par.This was already done for DDMD, and the more recent minimal version of it: https://github.com/zachthemystic/ddmd-clean/blob/master/dmd/lexer.d
Feb 04 2013
On Tuesday, 5 February 2013 at 04:19:54 UTC, Andrei Alexandrescu wrote:Awesome! Has anyone measured the speed? AndreiI gave up on getting it to compile. ddmd (the project this one is based on) was last updated to 2.040 according to its page on dsource, and I also haven't gotten it to compile.
Feb 04 2013
On 2/5/13, Brian Schott <briancschott gmail.com> wrote:I gave up on getting it to compile.It seems master is broken. An earlier commit is buildable with a small change, but it doesn't seem to parse modules like std.datetime. I don't know what exact version the front-end was ported from, but I've tried to parse datetime from 2.055 to 2.061 and it didn't work. So yeah it's not usable in this state.
Feb 05 2013
On Tuesday, 5 February 2013 at 03:22:52 UTC, Andrei Alexandrescu wrote:On 2/4/13 10:19 PM, Brian Schott wrote:DMD's lexer is not suitable for phobos IMO. It doesn't take a range as input and don't produce a range. It also lack the features you may want from a multi usage D lexer.More optimizing: http://hackerpilot.github.com/experimental/std_lexer/images/times2.png Still only half speed. I'm becoming more and more convinced that Walter is actually a wizard.Suggestion: take lexer.c and convert it to D. Should take one day, and you'll have performance on par.
Feb 04 2013
On 2/5/13 12:45 AM, deadalnix wrote:On Tuesday, 5 February 2013 at 03:22:52 UTC, Andrei Alexandrescu wrote:I looked at the code. Once converted, it's trivial to convert the lexer into an input range. AndreiOn 2/4/13 10:19 PM, Brian Schott wrote:DMD's lexer is not suitable for phobos IMO. It doesn't take a range as input and don't produce a range. It also lack the features you may want from a multi usage D lexer.More optimizing: http://hackerpilot.github.com/experimental/std_lexer/images/times2.png Still only half speed. I'm becoming more and more convinced that Walter is actually a wizard.Suggestion: take lexer.c and convert it to D. Should take one day, and you'll have performance on par.
Feb 05 2013
On 2013-02-05 04:22, Andrei Alexandrescu wrote:Suggestion: take lexer.c and convert it to D. Should take one day, and you'll have performance on par.There's reason for why nobody has just extract the lexer from DMD. It will probably take more than a day just to extract the lexer to be able to use it without the rest of DMD. It's probably easier to do the actual porting than extract the lexer. Actually, when I think about it, Johnathan is working on porting the DMD lexer to D. -- /Jacob Carlborg
Feb 05 2013
On Tuesday, February 05, 2013 09:14:35 Jacob Carlborg wrote:Actually, when I think about it, Johnathan is working on porting the DMD lexer to D.Not exactly. I decided that it would be of greater benefit to just write the thing according to the grammar and make sure that the compiler matched the spec. I've already found and fixed several bugs in the spec because of that. Also, given how I'm writing it, I expect that its speed will be similar to that of dmd given the fact that I'm writing it such that it will generally do the minimum number of operations required. But it wouldn't surprise me at all if no lexer can possibly match dmd at the moment as long as it's compiled with dmd (at least on Linux), because gcc's optimizer is so much better than dmd's, and dmd is getting compiled with gcc, whereas the competing lexer in D is being compiled with dmd. I don't have a lot of time to work on my lexer at the moment, but I'd really like to get it done soon, and I have most of the features in place. Unfortunately, when I went to try and work on it again the other day, the code wasn't compiling anymore, and I need to figure out why. I suspect that it's a regression related to string mixins, but I have to investigate further to sort it out. - Jonathan M Davis
Feb 05 2013
On 2013-02-05 10:07, Jonathan M Davis wrote:Not exactly. I decided that it would be of greater benefit to just write the thing according to the grammar and make sure that the compiler matched the spec.Aha, I see. -- /Jacob Carlborg
Feb 05 2013
On Tuesday, February 05, 2013 09:14:35 Jacob Carlborg wrote:On 2013-02-05 04:22, Andrei Alexandrescu wrote:There are basic ideas about how it works which are obviously good and should be in the finished product in D, but it's not range-based, which forces you to do things differently. It's also not configurable, which forces you to do things differently. If it could be ported as-is and then compared for speed, then that would be a great test, since it would be able to show how much of the speed problem is purely a compiler issue as opposed to a design issue, but you wouldn't be able to actually use it for anything more than what Brian is doing with his performance testing, because as you point out, it's too integrated into dmd. It _would_ be valuable though as a performance test of the compiler. - Jonathan M DavisSuggestion: take lexer.c and convert it to D. Should take one day, and you'll have performance on par.There's reason for why nobody has just extract the lexer from DMD. It will probably take more than a day just to extract the lexer to be able to use it without the rest of DMD.
Feb 05 2013
On 2013-02-05 11:44, Jonathan M Davis wrote:If it could be ported as-is and then compared for speed, then that would be a great test, since it would be able to show how much of the speed problem is purely a compiler issue as opposed to a design issue, but you wouldn't be able to actually use it for anything more than what Brian is doing with his performance testing, because as you point out, it's too integrated into dmd. It _would_ be valuable though as a performance test of the compiler.Yeah, that could be useful. -- /Jacob Carlborg
Feb 05 2013
On 2/5/13 5:44 AM, Jonathan M Davis wrote:On Tuesday, February 05, 2013 09:14:35 Jacob Carlborg wrote:As far as I could tell the dependencies of the lexer are fairly contained (util, token, identifier) and conversion to input range is immediate. AndreiOn 2013-02-05 04:22, Andrei Alexandrescu wrote:There are basic ideas about how it works which are obviously good and should be in the finished product in D, but it's not range-based, which forces you to do things differently. It's also not configurable, which forces you to do things differently. If it could be ported as-is and then compared for speed, then that would be a great test, since it would be able to show how much of the speed problem is purely a compiler issue as opposed to a design issue, but you wouldn't be able to actually use it for anything more than what Brian is doing with his performance testing, because as you point out, it's too integrated into dmd. It _would_ be valuable though as a performance test of the compiler.Suggestion: take lexer.c and convert it to D. Should take one day, and you'll have performance on par.There's reason for why nobody has just extract the lexer from DMD. It will probably take more than a day just to extract the lexer to be able to use it without the rest of DMD.
Feb 05 2013
On 2013-02-05 14:34, Andrei Alexandrescu wrote:As far as I could tell the dependencies of the lexer are fairly contained (util, token, identifier) and conversion to input range is immediate.That's not what I remember from last time I gave it a try. -- /Jacob Carlborg
Feb 05 2013
On Tuesday, February 05, 2013 08:34:29 Andrei Alexandrescu wrote:As far as I could tell the dependencies of the lexer are fairly contained (util, token, identifier) and conversion to input range is immediate.I don't remember all of the details at the moment, since it's been several months since I looked at dmd's lexer, but a lot of the problem stems from the fact that it's all written around the assumption that it's dealing with a char*. Converting it to operate on string might be fairly straightforward, but it gets more complicated when dealing with ranges. Also, both Walter and others have stated that the lexer in D should be configurable in a number of ways, and dmd's lexer isn't configurable at all. So, while a direct translation would likely be quick, refactoring it to do what it's been asked to be able to do would not be. I'm quite a ways along with one that's written from scratch, but I need to find the time to finish it. Also, doing it from scratch has had the added benefit of helping me find bugs in the spec and in dmd. - Jonathan M Davis
Feb 05 2013
On 2/5/13 10:29 PM, Jonathan M Davis wrote:On Tuesday, February 05, 2013 08:34:29 Andrei Alexandrescu wrote:I think it would be reasonable for a lexer to require a range of ubyte as input, and carry its own decoding. In the first approximation it may even require a random-access range of ubyte. AndreiAs far as I could tell the dependencies of the lexer are fairly contained (util, token, identifier) and conversion to input range is immediate.I don't remember all of the details at the moment, since it's been several months since I looked at dmd's lexer, but a lot of the problem stems from the fact that it's all written around the assumption that it's dealing with a char*. Converting it to operate on string might be fairly straightforward, but it gets more complicated when dealing with ranges. Also, both Walter and others have stated that the lexer in D should be configurable in a number of ways, and dmd's lexer isn't configurable at all. So, while a direct translation would likely be quick, refactoring it to do what it's been asked to be able to do would not be. I'm quite a ways along with one that's written from scratch, but I need to find the time to finish it. Also, doing it from scratch has had the added benefit of helping me find bugs in the spec and in dmd.
Feb 05 2013
On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:I think it would be reasonable for a lexer to require a range of ubyte as input, and carry its own decoding. In the first approximation it may even require a random-access range of ubyte.I'd have to think about how you'd handle the Unicode stuff in that case, since I'm not quite sure what you mean by having it handle its own decoding if it's a range of code units, but what I've been working on works with all of the character types and is very careful about how it deals with decoding in order to avoid unnecessary decoding. And that wasn't all that hard as far as the lexer's code goes. The hard part with that was making std.utf work with ranges of code units rather than just strings, and that was committed months ago. - Jonathan M Davis
Feb 05 2013
On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:I think it would be reasonable for a lexer to require a range of ubyte as input, and carry its own decoding. In the first approximation it may even require a random-access range of ubyte.Another big issue is the fact that in some ways, using a pointer like dmd's lexer does is actually superior to using a range. In particular, it's trivial to determine where in the text a token is, because you can simply subtract the pointer in the token from the initial pointer. Strings would be okay too, because you can subtract their ptr properties. But the closest that you'll get with ranges is to subtract their lengths, and the only ranges that are likely to define length are random-access ranges. And to do that, you'd either have to keep calculating the index for each token as its generated or save the range with ever token (rather than just having a pointer) so that you could determine the index later if you needed to. And depending on the range, all of that saving could be expensive. And for any other type of range, you'd literally have to count the code units as you iterated in order to figure out what the index is (and you'd have to keep saving the range as you went along if you wanted to slice it at all, since it wouldn't actually be sliceable, and so getting to a particular index in the range would be very expensive even if you kept track of it). And for syntax highlighting and some error reporting and a variety of other uses, you need to be able to determine where in the text a token was (not just its column and line number). And that's simply a lot easier with a pointer. So, dealing with generic ranges is a bit problematic - far more so than any issues with character types. If the lexer is well-written, the extra overhead had be obviated by having the lexer function do stuff a bit differently depending on the type of the range, but regardless, restricting it to strings or pointers would be cleaner in that regard. It's not quite a use case where ranges shine - especially when efficiency is a top priority. - Jonathan M Davis
Feb 08 2013
08-Feb-2013 12:01, Jonathan M Davis пишет:On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:Not true, certain ranges know length but can't be random access as indexing is O(lgN) or worse. Including a stripe of chunks as taken from file. -- Dmitry OlshanskyI think it would be reasonable for a lexer to require a range of ubyte as input, and carry its own decoding. In the first approximation it may even require a random-access range of ubyte.Another big issue is the fact that in some ways, using a pointer like dmd's lexer does is actually superior to using a range. In particular, it's trivial to determine where in the text a token is, because you can simply subtract the pointer in the token from the initial pointer. Strings would be okay too, because you can subtract their ptr properties. But the closest that you'll get with ranges is to subtract their lengths, and the only ranges that are likely to define length are random-access ranges.
Feb 08 2013
On Friday, February 08, 2013 12:12:30 Dmitry Olshansky wrote:08-Feb-2013 12:01, Jonathan M Davis =D0=BF=D0=B8=D1=88=D0=B5=D1=82:byteOn Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:I think it would be reasonable for a lexer to require a range of u=t mayas input, and carry its own decoding. In the first approximation i=keeven require a random-access range of ubyte.=20 Another big issue is the fact that in some ways, using a pointer li='sdmd's lexer does is actually superior to using a range. In particular, it=simplytrivial to determine where in the text a token is, because you can =wouldsubtract the pointer in the token from the initial pointer. Strings=be okay too, because you can subtract their ptr properties. But the=nd theclosest that you'll get with ranges is to subtract their lengths, a=es.only ranges that are likely to define length are random-access rang==20 Not true, certain ranges know length but can't be random access as indexing is O(lgN) or worse. Including a stripe of chunks as taken fr=omfile.I said that the only ones which are "likely" to define length are rando= m-access=20 range. There _are_ other ranges which can, but in most cases, if you ca= n know=20 the length, you can do random access as well. Regardless, the main issu= e still=20 stands in that dealing with keeping track of the index of the code unit= of a=20 token is more complicated and generally more expensive with ranges than= it is=20 with a pointer. Some range types will do better than others, but short = of=20 using a string's ptr property, there's always going to be some addition= al=20 overhead in comparison to pointers to keep track of the indices or to k= eep a=20 range or slice of one as part of a token. The pointer's just more light= weight.=20 That doesn't make ranges unacceptable by any means. It just means that = they're=20 going to take at least a slight performance hit in comparison to pointe= rs. - Jonathan M Davis
Feb 08 2013
08-Feb-2013 13:40, Jonathan M Davis пишет:On Friday, February 08, 2013 12:12:30 Dmitry Olshansky wrote:Well I honestly disagree about the promise of knowing length - being able to index. "The most ranges" is arrays and wrappers on top of these. Given current realities oF D and Phobos I'm afraid you are right though. Regardless, the main issue still08-Feb-2013 12:01, Jonathan M Davis пишет:I said that the only ones which are "likely" to define length are random-access range. There _are_ other ranges which can, but in most cases, if you can know the length, you can do random access as well.On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:Not true, certain ranges know length but can't be random access as indexing is O(lgN) or worse. Including a stripe of chunks as taken from file.I think it would be reasonable for a lexer to require a range of ubyte as input, and carry its own decoding. In the first approximation it may even require a random-access range of ubyte.Another big issue is the fact that in some ways, using a pointer like dmd's lexer does is actually superior to using a range. In particular, it's trivial to determine where in the text a token is, because you can simply subtract the pointer in the token from the initial pointer. Strings would be okay too, because you can subtract their ptr properties. But the closest that you'll get with ranges is to subtract their lengths, and the only ranges that are likely to define length are random-access ranges.stands in that dealing with keeping track of the index of the code unit of a token is more complicated and generally more expensive with ranges than it is with a pointer.If target is random access range just use offset throughout. It's basically becomes base + offset vs base + pointer i.e. non-issue If not then pointer argument no longer applies and you can just as well use separate counter on per popFront. It'd not that costly in this case and flexibility tramps other concerns with forward ranges in any case.Some range types will do better than others, but short of using a string's ptr property, there's always going to be some additional overhead in comparison to pointers to keep track of the indices or to keep a range or slice of one as part of a token. The pointer's just more lightweight. That doesn't make ranges unacceptable by any means. It just means that they're going to take at least a slight performance hit in comparison to pointers.See above. Pointer to something inside of a buffer == index in buffer, typically even with pointer you can't drop the 'buffer' reference itself. -- Dmitry Olshansky
Feb 08 2013
On Friday, February 08, 2013 14:29:20 Dmitry Olshansky wrote:If target is random access range just use offset throughout. It's basically becomes base + offset vs base + pointer i.e. non-issueIf not then pointer argument no longer applies and you can just as well use separate counter on per popFront. It'd not that costly in this case and flexibility tramps other concerns with forward ranges in any case.I don't know exactly how costly it ends up being (and certainly, poor design choices in other areas could dwarf that cost), but it does incur extra overhead throughout the lexing process. In most code, it wouldn't be a big deal, but the lexer is trying to be lightning fast, so every extra bit like that is going to add up and slow it down. But you really don't have any choice if you don't even have random access. Regardless, my point is that accepting generic ranges rather than pointers complicates things somewhat and does incur at least a slight performance penalty. - Jonathan M Davis
Feb 08 2013
08-Feb-2013 23:25, Jonathan M Davis пишет:On Friday, February 08, 2013 14:29:20 Dmitry Olshansky wrote:bit likeIf target is random access range just use offset throughout. It's basically becomes base + offset vs base + pointer i.e. non-issueIf not then pointer argument no longer applies and you can just as well use separate counter on per popFront. It'd not that costly in this case and flexibility tramps other concerns with forward ranges in any case.I don't know exactly how costly it ends up being (and certainly, poor design choices in other areas could dwarf that cost), but it does incur extra overhead throughout the lexing process. In most code, it wouldn't be a big deal, but the lexer is trying to be lightning fast, so every extrathat is going to add up and slow it down.I suspect it *might* require extra register for base depending on smartness of the compiler, extra register in turn could rise the cost. The key to speed here however is not in using raw pointers, assembly and/or SIMD. FWIW if there is only a single base pointer + offsets compiler can assume no pointer aliasing and optimize things better. The discussion becomes purely rhetorical unless some hard data is actually presented and not based on DMD's optimizer, please.But you really don't have any choice if you don't even have random access.Agreed.Regardless, my point is that accepting generic ranges rather than pointers complicates things somewhat anddoes incurat least a slight performance penalty.Complication - yes, slight performance cost is what I doubt it in RA case. Seems like a compiler/optimizer issue. For one I used to write cycles with naked pointers in C to get better performance out of crappy compilers. I won't ever do it today as it would get worse performance in addition to being less readable (I've checked at least on VC++ about a year ago, compiler rewrites these index-based loops better, YMMV). -- Dmitry Olshansky
Feb 08 2013
On Friday, February 08, 2013 23:50:13 Dmitry Olshansky wrote:Complication - yes, slight performance cost is what I doubt it in RA case. Seems like a compiler/optimizer issue.The main problem isn't RA ranges. It's quite probable that the optimizer takes care of any minor additional overhead that's incurred there. The issue is really pure forward ranges, because you're forced to count the code units as they're processed (or even save the range as you go along, given how expensive it would be to go find the index even if you have it). But we really have no way of knowing how bad it is without hard data. It would certainly be trivial to end up doing other stuff in the implementation which cost far more. - Jonathan M Davis
Feb 08 2013
On Friday, 8 February 2013 at 20:54:32 UTC, Jonathan M Davis wrote:On Friday, February 08, 2013 23:50:13 Dmitry Olshansky wrote:For pure InputRanges, that is pretty bad as the decoding of UTF chars basically have to be done 2 times, and each codepoint popped individually, or the lexer have to embed its own homebrew version of std.utf .Complication - yes, slight performance cost is what I doubt it in RA case. Seems like a compiler/optimizer issue.The main problem isn't RA ranges. It's quite probable that the optimizer takes care of any minor additional overhead that's incurred there. The issue is really pure forward ranges, because you're forced to count the code units as they're processed (or even save the range as you go along, given how expensive it would be to go find the index even if you have it). But we really have no way of knowing how bad it is without hard data. It would certainly be trivial to end up doing other stuff in the implementation which cost far more.
Feb 08 2013
On Saturday, 9 February 2013 at 07:15:57 UTC, deadalnix wrote:On Friday, 8 February 2013 at 20:54:32 UTC, Jonathan M Davis wrote:Wow that is complete bullshit xD I completely messed up what I wanted to say. So, std.utf.decodeFront pop or not the utf character. And in case it does not, you ends up having to do extra hanky panky (and duplicate logic in std.utf to know if it does pop or not, which is likely to be very bug prone).On Friday, February 08, 2013 23:50:13 Dmitry Olshansky wrote:For pure InputRanges, that is pretty bad as the decoding of UTF chars basically have to be done 2 times, and each codepoint popped individually, or the lexer have to embed its own homebrew version of std.utf .Complication - yes, slight performance cost is what I doubt it in RA case. Seems like a compiler/optimizer issue.The main problem isn't RA ranges. It's quite probable that the optimizer takes care of any minor additional overhead that's incurred there. The issue is really pure forward ranges, because you're forced to count the code units as they're processed (or even save the range as you go along, given how expensive it would be to go find the index even if you have it). But we really have no way of knowing how bad it is without hard data. It would certainly be trivial to end up doing other stuff in the implementation which cost far more.
Feb 08 2013
On Saturday, February 09, 2013 08:19:33 deadalnix wrote:So, std.utf.decodeFront pop or not the utf character. And in case it does not, you ends up having to do extra hanky panky (and duplicate logic in std.utf to know if it does pop or not, which is likely to be very bug prone).Well, with the pull request that I have, it'll always pop for input ranges and never for anything else. I'm disinclined to have it pop off elements at all, because I'd like it to function as much like decode as possible, but there's no choice with pure input ranges, and it may be better to just make it always pop the elements off. I'll have to think about it. I hate pure input ranges in general. They're just so frustrating to deal with, and I believe that you can always have a forward range if you're willing to put a bit more effort into it. The lexer that I'm writing doesn't even support pure input ranges, because it has to be able to save to do what it does, and the only reason that decodeFront exists is because I wrote it to support that use case. And needing to operate on ranges of code units is likely to be quite rare, and the function is quite new, so I don't expect that much code is affected by any bugs in decodeFront or that much would be broken were it to be changed. - Jonathan M Davis
Feb 09 2013
On 2013-02-09 09:05, Jonathan M Davis wrote:The lexer that I'm writing doesn't even support pure input ranges, because it has to be able to save to do what it does, and the only reason that decodeFront exists is because I wrote it to support that use case. And needing to operate on ranges of code units is likely to be quite rare, and the function is quite new, so I don't expect that much code is affected by any bugs in decodeFront or that much would be broken were it to be changed.Following this discussion it seems it's complicated to support ranges in a lexer and still maintain the speed. Is it really worth the trouble to support ranges? -- /Jacob Carlborg
Feb 09 2013
On 2/9/13 10:05 AM, Jacob Carlborg wrote:On 2013-02-09 09:05, Jonathan M Davis wrote:Requiring a random-access range of ubyte with a terminating zero may be the most general approach to a fast lexer - and that's fine. AndreiThe lexer that I'm writing doesn't even support pure input ranges, because it has to be able to save to do what it does, and the only reason that decodeFront exists is because I wrote it to support that use case. And needing to operate on ranges of code units is likely to be quite rare, and the function is quite new, so I don't expect that much code is affected by any bugs in decodeFront or that much would be broken were it to be changed.Following this discussion it seems it's complicated to support ranges in a lexer and still maintain the speed. Is it really worth the trouble to support ranges?
Feb 09 2013
On 2013-02-09 16:10, Andrei Alexandrescu wrote:Requiring a random-access range of ubyte with a terminating zero may be the most general approach to a fast lexer - and that's fine.Requiring a random-access range probably makes it easier. People here seems to try to support ranges with less functionality, like input or forward ranges. -- /Jacob Carlborg
Feb 09 2013
On 2/9/13 10:37 AM, Jacob Carlborg wrote:On 2013-02-09 16:10, Andrei Alexandrescu wrote:Yah. The way I see it is, start with a random-access range and then see what the use patterns are. Then possibly refine. AndreiRequiring a random-access range of ubyte with a terminating zero may be the most general approach to a fast lexer - and that's fine.Requiring a random-access range probably makes it easier. People here seems to try to support ranges with less functionality, like input or forward ranges.
Feb 09 2013
09-Feb-2013 19:46, Andrei Alexandrescu пишет:On 2/9/13 10:37 AM, Jacob Carlborg wrote:I don't get this. There is no sensible requirement to forbid non-random access. A couple of static ifs and you are done. And I can confidently say that std.d.lexer has quite some room for optimization in both cases and it doesn't have to sacrifice the generic path. I intent to continue hacking on Brain's implementation and to help him refine it. Any real help (as in work and analysis) is appreciated, thanks. -- Dmitry OlshanskyOn 2013-02-09 16:10, Andrei Alexandrescu wrote:Yah. The way I see it is, start with a random-access range and then see what the use patterns are. Then possibly refine.Requiring a random-access range of ubyte with a terminating zero may be the most general approach to a fast lexer - and that's fine.Requiring a random-access range probably makes it easier. People here seems to try to support ranges with less functionality, like input or forward ranges.
Feb 09 2013
On Sunday, February 10, 2013 00:26:41 Dmitry Olshansky wrote:09-Feb-2013 19:46, Andrei Alexandrescu =D0=BF=D0=B8=D1=88=D0=B5=D1=82=:may beOn 2/9/13 10:37 AM, Jacob Carlborg wrote:On 2013-02-09 16:10, Andrei Alexandrescu wrote:Requiring a random-access range of ubyte with a terminating zero =erethe most general approach to a fast lexer - and that's fine.=20 Requiring a random-access range probably makes it easier. People h=orseems to try to support ranges with less functionality, like input=seeforward ranges.=20 Yah. The way I see it is, start with a random-access range and then=omwhat the use patterns are. Then possibly refine.=20 I don't get this. There is no sensible requirement to forbid non-rand=access. A couple of static ifs and you are done. =20 And I can confidently say that std.d.lexer has quite some room for optimization in both cases and it doesn't have to sacrifice the gener=icpath.It may end up being less efficient with some types of ranges, but it ca= n still=20 work, and someone's use case may not care about that extra loss of effi= ciency.=20 Those who really do can use RA ranges or strings or whatever. But if we throw too many extra restrictions on the ranges (like they ha= ve to=20 be random access or end with 0), then pretty soon you might as well req= uire=20 that they use string, because the extra benefit of the few extra types = of=20 ranges which would work wolud be pretty minimal. - Jonathan M Davis
Feb 09 2013
On 2/8/2013 12:01 AM, Jonathan M Davis wrote:It's not quite a use case where ranges shine - especially when efficiency is a top priority.A more problematic case is dmd's lexer relies on a 0 byte at the end to be a "sentinel" for the end of file. Without such a sentinel, every consumption of a source character requires two operations rather than one.
Feb 08 2013
On Friday, February 08, 2013 23:48:50 Walter Bright wrote:On 2/8/2013 12:01 AM, Jonathan M Davis wrote:I didn't know that. That's a cute trick. But yeah, without controlling the input, it's not possible, and that won't work with a general implementation in a library. It would be trivial enough to put a wrapper around the input to add the 0 byte at the end, but the wrapper would still have to keep checking for empty, so you wouldn't gain anything. - Jonathan M DavisIt's not quite a use case where ranges shine - especially when efficiency is a top priority.A more problematic case is dmd's lexer relies on a 0 byte at the end to be a "sentinel" for the end of file. Without such a sentinel, every consumption of a source character requires two operations rather than one.
Feb 09 2013
On 2/9/13 3:07 AM, Jonathan M Davis wrote:On Friday, February 08, 2013 23:48:50 Walter Bright wrote:That's not a problem. You may require that the input has a terminating zero. AndreiOn 2/8/2013 12:01 AM, Jonathan M Davis wrote:I didn't know that. That's a cute trick. But yeah, without controlling the input, it's not possible, and that won't work with a general implementation in a library. It would be trivial enough to put a wrapper around the input to add the 0 byte at the end, but the wrapper would still have to keep checking for empty, so you wouldn't gain anything.It's not quite a use case where ranges shine - especially when efficiency is a top priority.A more problematic case is dmd's lexer relies on a 0 byte at the end to be a "sentinel" for the end of file. Without such a sentinel, every consumption of a source character requires two operations rather than one.
Feb 09 2013
On 2013-02-09 15:37, Andrei Alexandrescu wrote:That's not a problem. You may require that the input has a terminating zero.You cannot just append a terminating zero in the lexer if it's a range? -- /Jacob Carlborg
Feb 09 2013
On 2/9/13 10:38 AM, Jacob Carlborg wrote:On 2013-02-09 15:37, Andrei Alexandrescu wrote:You require it. It's client's burden. AndreiThat's not a problem. You may require that the input has a terminating zero.You cannot just append a terminating zero in the lexer if it's a range?
Feb 09 2013
On 2013-02-09 16:46, Andrei Alexandrescu wrote:You require it. It's client's burden.Ok, I see. -- /Jacob Carlborg
Feb 09 2013
On 2/9/2013 6:37 AM, Andrei Alexandrescu wrote:On 2/9/13 3:07 AM, Jonathan M Davis wrote:Perhaps we can formalize this a bit. Define a SentinelInputRange, which has an additional manifest constant with the name 'sentinel'. empty becomes defined as: empty = front == sentinel; popFront() does not advance if empty. Advancing over a SentinelInputRange can be done the usual: for (r = range; !r.empty; r.popFront()) { auto e = r.front; } or can be done as: for (r = range; r.front != sentinel; r.popFront()) { auto e = r.front; } This should work well with 0 terminated strings. The output of the lexer should also be a SentinelInputRange, with TOKeof as the sentinel.On Friday, February 08, 2013 23:48:50 Walter Bright wrote:That's not a problem. You may require that the input has a terminating zero.On 2/8/2013 12:01 AM, Jonathan M Davis wrote:I didn't know that. That's a cute trick. But yeah, without controlling the input, it's not possible, and that won't work with a general implementation in a library. It would be trivial enough to put a wrapper around the input to add the 0 byte at the end, but the wrapper would still have to keep checking for empty, so you wouldn't gain anything.It's not quite a use case where ranges shine - especially when efficiency is a top priority.A more problematic case is dmd's lexer relies on a 0 byte at the end to be a "sentinel" for the end of file. Without such a sentinel, every consumption of a source character requires two operations rather than one.
Feb 09 2013
On 2/9/2013 12:43 PM, Walter Bright wrote:> Perhaps we can formalize this a bit. Define a SentinelInputRange, which has anadditional manifest constant with the name 'sentinel'. empty becomes defined as: empty = front == sentinel; popFront() does not advance if empty. Advancing over a SentinelInputRange can be done the usual: for (r = range; !r.empty; r.popFront()) { auto e = r.front; } or can be done as: for (r = range; r.front != sentinel; r.popFront()) { auto e = r.front; }If it's not clear, SentinelInputRange.front can be called *before* checking empty. If it is empty, then the sentinel is returned.
Feb 09 2013
On 2/9/13 3:49 PM, Walter Bright wrote:On 2/9/2013 12:43 PM, Walter Bright wrote:> Perhaps we can formalize this a bit. Define a SentinelInputRange, which has an > additional manifest constant with the name 'sentinel'. empty becomes defined as: > > empty = front == sentinel; > > popFront() does not advance if empty. Advancing over a SentinelInputRange can be > done the usual: > > for (r = range; !r.empty; r.popFront()) { auto e = r.front; } > > or can be done as: > > for (r = range; r.front != sentinel; r.popFront()) { auto e = r.front; } If it's not clear, SentinelInputRange.front can be called *before* checking empty. If it is empty, then the sentinel is returned.Yah, that all would fit C-style strings and singly-linked lists like a glove. I've been long hoping the opportunity to define such a range would present itself :o). Andrei
Feb 09 2013
On Wednesday, 6 February 2013 at 03:51:33 UTC, Andrei Alexandrescu wrote:I think it would be reasonable for a lexer to require a range of ubyte as input, and carry its own decoding. In the first approximation it may even require a random-access range of ubyte.Playing around that, I discovered a bug in std.utf : slice and other range are not threated the same way by decodeFront, which is rather problematic. Jonathan also hit that bug : http://d.puremagic.com/issues/show_bug.cgi?id=9456 That make the lexer hard to write with a consistent behavior for InputRanges. The bug probably exists in everything that rely on decodeFront at some point.
Feb 08 2013
On Tuesday, February 05, 2013 01:07:48 Jonathan M Davis wrote:I don't have a lot of time to work on my lexer at the moment, but I'd really like to get it done soon, and I have most of the features in place. Unfortunately, when I went to try and work on it again the other day, the code wasn't compiling anymore, and I need to figure out why. I suspect that it's a regression related to string mixins, but I have to investigate further to sort it out.It turns out that it has nothing to do with string mixins (though it does have to do with CTFE): http://d.puremagic.com/issues/show_bug.cgi?id=9452 Fortunately, there's a simple workaround that'll let me continue until the bug is fixed. - Jonathan M Davis
Feb 05 2013
On 2013-02-05 11:46, Jonathan M Davis wrote:It turns out that it has nothing to do with string mixins (though it does have to do with CTFE): http://d.puremagic.com/issues/show_bug.cgi?id=9452 Fortunately, there's a simple workaround that'll let me continue until the bug is fixed.Ok, good that it's not a blocker. -- /Jacob Carlborg
Feb 05 2013
On 02/05/2013 07:19 AM, Brian Schott wrote:More optimizing: http://hackerpilot.github.com/experimental/std_lexer/images/times2.png Still only half speed. I'm becoming more and more convinced that Walter is actually a wizard.Time to do some hacking on your lexer I guess. I'll try add a couple of tricks and see if it helps. What command do you use for benchmarking?
Feb 05 2013
On Tuesday, 5 February 2013 at 08:52:53 UTC, Dmitry Olshansky wrote:Time to do some hacking on your lexer I guess. I'll try add a couple of tricks and see if it helps. What command do you use for benchmarking?I've been using avgtime[1] for measuring run times, perf[2], and callgrind/kcachegrind[3] for profiling. [1] https://github.com/jmcabo/avgtime [2] https://perf.wiki.kernel.org/index.php/Tutorial [3] http://kcachegrind.sourceforge.net/html/Home.html
Feb 05 2013
05-Feb-2013 22:25, Brian Schott пишет:On Tuesday, 5 February 2013 at 08:52:53 UTC, Dmitry Olshansky wrote:Thanks. I've made a pass through the code and found some places to improve. Sadly I've been rather busy at work today. Anyway get ready for pull requests should my ideas prove to be worthy performance-wise. -- Dmitry OlshanskyTime to do some hacking on your lexer I guess. I'll try add a couple of tricks and see if it helps. What command do you use for benchmarking?I've been using avgtime[1] for measuring run times, perf[2], and callgrind/kcachegrind[3] for profiling. [1] https://github.com/jmcabo/avgtime [2] https://perf.wiki.kernel.org/index.php/Tutorial [3] http://kcachegrind.sourceforge.net/html/Home.html
Feb 05 2013
On Tuesday, 5 February 2013 at 19:00:42 UTC, Dmitry Olshansky wrote:Thanks. I've made a pass through the code and found some places to improve. Sadly I've been rather busy at work today. Anyway get ready for pull requests should my ideas prove to be worthy performance-wise.http://hackerpilot.github.com/experimental/std_lexer/images/times3.png Your suggestions seem to have worked. We're below 20ms on my machine for the datetime module.
Feb 08 2013
On 2013-02-08 16:08, Brian Schott wrote:http://hackerpilot.github.com/experimental/std_lexer/images/times3.png Your suggestions seem to have worked. We're below 20ms on my machine for the datetime module.DMD is still consistently faster, :( -- /Jacob Carlborg
Feb 08 2013
On Friday, 8 February 2013 at 15:12:23 UTC, Jacob Carlborg wrote:DMD is still consistently faster, :(We might be getting to the part where we say that code generated by gcc is still consistently faster than code generated by ldc.
Feb 08 2013
On Friday, 8 February 2013 at 15:23:00 UTC, Brian Schott wrote:On Friday, 8 February 2013 at 15:12:23 UTC, Jacob Carlborg wrote:For the lulz I compiled with "dmd -release -inline -noboundscheck -O -m64 -property": http://hackerpilot.github.com/experimental/std_lexer/images/times3-dmd.png Yes, dmd vs ldc actually is a matter of 32 vs 18 ms for datetime.DMD is still consistently faster, :(We might be getting to the part where we say that code generated by gcc is still consistently faster than code generated by ldc.
Feb 08 2013
08-Feb-2013 19:34, Brian Schott пишет:On Friday, 8 February 2013 at 15:23:00 UTC, Brian Schott wrote:Would be intereesting to try gdc as dmd on linux uses gcc. -- Dmitry OlshanskyOn Friday, 8 February 2013 at 15:12:23 UTC, Jacob Carlborg wrote:For the lulz I compiled with "dmd -release -inline -noboundscheck -O -m64 -property": http://hackerpilot.github.com/experimental/std_lexer/images/times3-dmd.png Yes, dmd vs ldc actually is a matter of 32 vs 18 ms for datetime.DMD is still consistently faster, :(We might be getting to the part where we say that code generated by gcc is still consistently faster than code generated by ldc.
Feb 08 2013
On 8 February 2013 15:35, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:08-Feb-2013 19:34, Brian Schott =D0=BF=D0=B8=D1=88=D0=B5=D1=82: On Friday, 8 February 2013 at 15:23:00 UTC, Brian Schott wrote:es/times3-dmd.png>On Friday, 8 February 2013 at 15:12:23 UTC, Jacob Carlborg wrote:For the lulz I compiled with "dmd -release -inline -noboundscheck -O -m64 -property": http://hackerpilot.github.com/**experimental/std_lexer/images/** times3-dmd.png<http://hackerpilot.github.com/experimental/std_lexer/imag=DMD is still consistently faster, :(We might be getting to the part where we say that code generated by gcc is still consistently faster than code generated by ldc.What? That's an outright fib. :-) --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';Yes, dmd vs ldc actually is a matter of 32 vs 18 ms for datetime.Would be intereesting to try gdc as dmd on linux uses gcc.
Feb 08 2013
On Friday, 8 February 2013 at 15:42:47 UTC, Iain Buclaw wrote:What? That's an outright fib. :-)I think he means that "on linux the dmd binary is compiled by gcc"
Feb 08 2013
On 8 February 2013 15:46, Brian Schott <briancschott gmail.com> wrote:On Friday, 8 February 2013 at 15:42:47 UTC, Iain Buclaw wrote:That's still lies. :o) -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';What? That's an outright fib. :-)I think he means that "on linux the dmd binary is compiled by gcc"
Feb 08 2013
08-Feb-2013 19:52, Iain Buclaw пишет:On 8 February 2013 15:46, Brian Schott <briancschott gmail.com <mailto:briancschott gmail.com>> wrote: On Friday, 8 February 2013 at 15:42:47 UTC, Iain Buclaw wrote: What? That's an outright fib. :-) I think he means that "on linux the dmd binary is compiled by gcc" That's still lies. :o)g++ ? :)-- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';-- Dmitry Olshansky
Feb 08 2013
On 8 February 2013 16:00, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:08-Feb-2013 19:52, Iain Buclaw =D0=BF=D0=B8=D1=88=D0=B5=D1=82:I see we could be doing this all day. :=C3=BE I'll lay down the hint, dmd compiles the source, not gcc. And although gcc may be invoked during a certain special stage of compilation, its actually just a driver to call a certain toolchain program that is outside of gcc. :-) --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';On 8 February 2013 15:46, Brian Schott <briancschott gmail.com <mailto:briancschott gmail.com**>> wrote: On Friday, 8 February 2013 at 15:42:47 UTC, Iain Buclaw wrote: What? That's an outright fib. :-) I think he means that "on linux the dmd binary is compiled by gcc" That's still lies. :o)g++ ? :)
Feb 08 2013
On Friday, 8 February 2013 at 16:38:00 UTC, Iain Buclaw wrote:I see we could be doing this all day. :þWe could.I'll lay down the hint, dmd compiles the source, not gcc. And although gcc may be invoked during a certain special stage of compilation, its actually just a driver to call a certain toolchain program that is outside of gcc. :-)What we're saying is that dmd, The Digital Mars D Compiler, is written in C++ and is thus built by GCC/G++. We can tell by looking at the Makefile that DMD doesn't build itself.
Feb 08 2013
On 8 February 2013 16:41, Brian Schott <briancschott gmail.com> wrote:On Friday, 8 February 2013 at 16:38:00 UTC, Iain Buclaw wrote:ccI see we could be doing this all day. :=FEWe could. I'll lay down the hint, dmd compiles the source, not gcc. And although g=lymay be invoked during a certain special stage of compilation, its actual=.just a driver to call a certain toolchain program that is outside of gcc=That still has nothing to do with how dmd links D programs. ;) --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';:-)What we're saying is that dmd, The Digital Mars D Compiler, is written in C++ and is thus built by GCC/G++. We can tell by looking at the Makefile that DMD doesn't build itself.
Feb 08 2013
08-Feb-2013 20:51, Iain Buclaw пишет:On 8 February 2013 16:41, Brian Schott <briancschott gmail.com <mailto:briancschott gmail.com>> wrote: On Friday, 8 February 2013 at 16:38:00 UTC, Iain Buclaw wrote: I see we could be doing this all day. :þ We could. I'll lay down the hint, dmd compiles the source, not gcc. And although gcc may be invoked during a certain special stage of compilation, its actually just a driver to call a certain toolchain program that is outside of gcc. :-) What we're saying is that dmd, The Digital Mars D Compiler, is written in C++ and is thus built by GCC/G++. We can tell by looking at the Makefile that DMD doesn't build itself. That still has nothing to do with how dmd links D programs. ;)We've been discussing DMD's lexer written in C++ that is obviously compiled using the default compiler on target platform. That's what is being measured the good ol' C++ lexer vs D lexer. I don't see how the way DMD does linking is related to what tool is used to actually build it. -- Dmitry Olshansky
Feb 08 2013
On Friday, 8 February 2013 at 16:00:05 UTC, Dmitry Olshansky wrote:08-Feb-2013 19:52, Iain Buclaw пишет:I'd think it's built by DMC++, Digital Mars C++ compiler.That's still lies. :o)g++ ? :)-- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Feb 09 2013
09-Feb-2013 23:20, SomeDude пишет:On Friday, 8 February 2013 at 16:00:05 UTC, Dmitry Olshansky wrote:On linux, sure ;) -- Dmitry Olshansky08-Feb-2013 19:52, Iain Buclaw пишет:I'd think it's built by DMC++, Digital Mars C++ compiler.That's still lies. :o)g++ ? :)-- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';
Feb 09 2013
On Friday, 8 February 2013 at 15:35:55 UTC, Dmitry Olshansky wrote:Would be intereesting to try gdc as dmd on linux uses gcc.GDC decided to randomly not fail to build on my system, so it's time for MOAR CHARTS. dmd -inline -noboundscheck -O -w -wi -m64 -property ldc2 -O2 -release -vectorize -m64 gdc -O3 -fno-bounds-check -frelease -m64 http://hackerpilot.github.com/experimental/std_lexer/images/times3-all.png
Feb 08 2013
On Friday, 8 February 2013 at 16:35:50 UTC, Brian Schott wrote:dmd -inline -noboundscheck -O -w -wi -m64 -propertyCopy/pate fail. That's actually "dmd -release -inline -noboundscheck -O -w -wi -m64 -property"
Feb 08 2013
On 8 February 2013 16:35, Brian Schott <briancschott gmail.com> wrote:On Friday, 8 February 2013 at 15:35:55 UTC, Dmitry Olshansky wrote:Cool. How do you determine the speed times? -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';Would be intereesting to try gdc as dmd on linux uses gcc.GDC decided to randomly not fail to build on my system, so it's time for MOAR CHARTS. dmd -inline -noboundscheck -O -w -wi -m64 -property ldc2 -O2 -release -vectorize -m64 gdc -O3 -fno-bounds-check -frelease -m64 http://hackerpilot.github.com/**experimental/std_lexer/images/** times3-all.png<http://hackerpilot.github.com/experimental/std_lexer/images/times3-all.png>
Feb 08 2013
On 2013-02-08 17:35, Brian Schott wrote:GDC decided to randomly not fail to build on my system, so it's time for MOAR CHARTS.Interesting. Seems that LDC is faster than GDC but has a bigger start overhead. OT: Why is the chart's legend in reverse order? :P
Feb 08 2013
On Friday, 8 February 2013 at 20:41:52 UTC, FG wrote:On 2013-02-08 17:35, Brian Schott wrote:It seems like that, but the differences are small enough that you can't make reliable statements without having a closer look at the statistics. David, who dreams of a world in which every graph has (at least) error barsGDC decided to randomly not fail to build on my system, so it's time for MOAR CHARTS.Interesting. Seems that LDC is faster than GDC but has a bigger start overhead. OT: Why is the chart's legend in reverse order? :P
Feb 08 2013
Am 08.02.2013 17:35, schrieb Brian Schott:On Friday, 8 February 2013 at 15:35:55 UTC, Dmitry Olshansky wrote:i still don't get what you comparing here you benchmarking something like an mixup of code-generation, startup-times and filesystem behavior + lexing time (+ lexing output style) and this all makes only sense if you lexer produces the very same output (so it can be out of the box used as an replacement) as the dmd lexer - for beeing compareable or aren't you in the phase of detail benchmarking/profiling?Would be intereesting to try gdc as dmd on linux uses gcc.GDC decided to randomly not fail to build on my system, so it's time for MOAR CHARTS. dmd -inline -noboundscheck -O -w -wi -m64 -property ldc2 -O2 -release -vectorize -m64 gdc -O3 -fno-bounds-check -frelease -m64 http://hackerpilot.github.com/experimental/std_lexer/images/times3-all.png
Feb 08 2013
On 2013-02-08 17:35, Brian Schott wrote:On Friday, 8 February 2013 at 15:35:55 UTC, Dmitry Olshansky wrote:Do you have the exact code you test available somewhere? -- /Jacob CarlborgWould be intereesting to try gdc as dmd on linux uses gcc.GDC decided to randomly not fail to build on my system, so it's time for MOAR CHARTS. dmd -inline -noboundscheck -O -w -wi -m64 -property ldc2 -O2 -release -vectorize -m64 gdc -O3 -fno-bounds-check -frelease -m64 http://hackerpilot.github.com/experimental/std_lexer/images/times3-all.png
Feb 09 2013
08-Feb-2013 19:12, Jacob Carlborg пишет:On 2013-02-08 16:08, Brian Schott wrote:Keep in mind the D runtime start-up cost. In the end on small files DMD always wins because of slim C-runtime. To estimate runtime startup lag I've compared run times of int main(){ return 0; } in C (gcc): ------------------------ Total time (ms): 31637.7 Repetitions : 35000 Sample mode : 0.8 (23649 ocurrences) Median time : 0.873 Avg time : 0.903935 Std dev. : 0.204107 Minimum : 0.552 Maximum : 9.057 95% conf.int. : [0.503892, 1.30398] e = 0.400043 99% conf.int. : [0.378189, 1.42968] e = 0.525746 EstimatedAvg95%: [0.901796, 0.906073] e = 0.00213832 EstimatedAvg99%: [0.901125, 0.906745] e = 0.00281023 void main(){} in D (ldc): ------------------------ Total time (ms): 15429.4 Repetitions : 7000 Sample mode : 2.1 (1785 ocurrences) Median time : 2.128 Avg time : 2.2042 Std dev. : 0.466834 Minimum : 1.286 Maximum : 19.933 95% conf.int. : [1.28922, 3.11918] e = 0.914978 99% conf.int. : [1.00171, 3.40668] e = 1.20248 EstimatedAvg95%: [2.19326, 2.21514] e = 0.0109361 EstimatedAvg99%: [2.18983, 2.21857] e = 0.0143724 dmitry dmitry-VirtualBox ~ $ I think that the mode could serve as an indicator of the most probable outcome. For me it's around 0.8 vs 2.1 ms. D runtime rides on top of C one + plus the same OS costs to launch a process, so from now on let's keep in mind these extra ~ 1.3 ms of bias in favor of C-based lexer. -- Dmitry Olshanskyhttp://hackerpilot.github.com/experimental/std_lexer/images/times3.png Your suggestions seem to have worked. We're below 20ms on my machine for the datetime module.DMD is still consistently faster, :(
Feb 08 2013
On 2013-02-08 16:28, Dmitry Olshansky wrote:08-Feb-2013 19:12, Jacob Carlborg пишет:Ok, I see. But wouldn't it be more fair to compare either GCC to GDC or Clang to LDC. -- /Jacob CarlborgOn 2013-02-08 16:08, Brian Schott wrote:Keep in mind the D runtime start-up cost. In the end on small files DMD always wins because of slim C-runtime. To estimate runtime startup lag I've compared run times of int main(){ return 0; } in C (gcc): ------------------------ Total time (ms): 31637.7 Repetitions : 35000 Sample mode : 0.8 (23649 ocurrences) Median time : 0.873 Avg time : 0.903935 Std dev. : 0.204107 Minimum : 0.552 Maximum : 9.057 95% conf.int. : [0.503892, 1.30398] e = 0.400043 99% conf.int. : [0.378189, 1.42968] e = 0.525746 EstimatedAvg95%: [0.901796, 0.906073] e = 0.00213832 EstimatedAvg99%: [0.901125, 0.906745] e = 0.00281023 void main(){} in D (ldc): ------------------------ Total time (ms): 15429.4 Repetitions : 7000 Sample mode : 2.1 (1785 ocurrences) Median time : 2.128 Avg time : 2.2042 Std dev. : 0.466834 Minimum : 1.286 Maximum : 19.933 95% conf.int. : [1.28922, 3.11918] e = 0.914978 99% conf.int. : [1.00171, 3.40668] e = 1.20248 EstimatedAvg95%: [2.19326, 2.21514] e = 0.0109361 EstimatedAvg99%: [2.18983, 2.21857] e = 0.0143724 dmitry dmitry-VirtualBox ~ $ I think that the mode could serve as an indicator of the most probable outcome. For me it's around 0.8 vs 2.1 ms. D runtime rides on top of C one + plus the same OS costs to launch a process, so from now on let's keep in mind these extra ~ 1.3 ms of bias in favor of C-based lexer.http://hackerpilot.github.com/experimental/std_lexer/images/times3.png Your suggestions seem to have worked. We're below 20ms on my machine for the datetime module.DMD is still consistently faster, :(
Feb 08 2013
On 2/4/2013 7:19 PM, Brian Schott wrote:Still only half speed. I'm becoming more and more convinced that Walter is actually a wizard.I worship the Dark Arts.
Feb 08 2013
On Feb 9, 2013 8:01 AM, "Walter Bright" <newshound2 digitalmars.com> wrote:On 2/4/2013 7:19 PM, Brian Schott wrote:isStill only half speed. I'm becoming more and more convinced that WalterMore like cursed the god's when you wrote it. :-)actually a wizard.I worship the Dark Arts.
Feb 09 2013
On Saturday, 9 February 2013 at 07:55:04 UTC, Walter Bright wrote:On 2/4/2013 7:19 PM, Brian Schott wrote:*recites incantations* *merges pull requests* *adds unit tests* http://hackerpilot.github.com/experimental/std_lexer/images/times4.pngStill only half speed. I'm becoming more and more convinced that Walter is actually a wizard.I worship the Dark Arts.
Feb 28 2013
On 2/28/13, Brian Schott <briancschott gmail.com> wrote:http://hackerpilot.github.com/experimental/std_lexer/images/times4.pngThat's a nice plot, but what is the X axis?
Feb 28 2013
On 2013-02-28 16:34, Andrej Mitrovic wrote:On 2/28/13, Brian Schott <briancschott gmail.com> wrote:Lexing time in milliseconds.http://hackerpilot.github.com/experimental/std_lexer/images/times4.pngThat's a nice plot, but what is the X axis?
Feb 28 2013
28-Feb-2013 13:09, Brian Schott пишет:On Saturday, 9 February 2013 at 07:55:04 UTC, Walter Bright wrote:Looking far better now. Keep in mind that we still don't use the original dmd's sentinel magic to avoid length check in std.d.lexer :) Being that wizard who gave a couple of scrolls to Brain I'll outline some interesting data points collected while helping him out. - D run-time takes some time to start up/shut down. It's on the order of 1ms for me vs 1/10th of ms for plain C - expanding on the previous point - GC.malloc takes sometime and triggers collections from time to time (2 collections to lex datetime, there is next to no garbage, but they run regardless). Even simply disabling GC at first and re-enabling it at the end of program speeds up the whole time by roughly 5% on my machine. - opApply is definitely slower then inlined range-based foreach loop even with scope delegate. Other then this specifics the prime spells that give the performance are: - don't use built-in AA or subtle allocations (avoid GC as far as you can) - lean common path, keep there only the operations that are required in *every* code-path - avoid ever doing the same work twice (redesign, hack and whatnot but don't do it) -- Dmitry OlshanskyOn 2/4/2013 7:19 PM, Brian Schott wrote:*recites incantations* *merges pull requests* *adds unit tests* http://hackerpilot.github.com/experimental/std_lexer/images/times4.pngStill only half speed. I'm becoming more and more convinced that Walter is actually a wizard.I worship the Dark Arts.
Feb 28 2013
28-Feb-2013 21:27, Dmitry Olshansky пишет:28-Feb-2013 13:09, Brian Schott пишет:To clarify these are problems I observed but haven't solved/avoided completely:On Saturday, 9 February 2013 at 07:55:04 UTC, Walter Bright wrote:Looking far better now. Keep in mind that we still don't use the original dmd's sentinel magic to avoid length check in std.d.lexer :) Being that wizard who gave a couple of scrolls to Brain I'll outline some interesting data points collected while helping him out.On 2/4/2013 7:19 PM, Brian Schott wrote:*recites incantations* *merges pull requests* *adds unit tests* http://hackerpilot.github.com/experimental/std_lexer/images/times4.pngStill only half speed. I'm becoming more and more convinced that Walter is actually a wizard.I worship the Dark Arts.- D run-time takes some time to start up/shut down. It's on the order of 1ms for me vs 1/10th of ms for plain C - expanding on the previous point - GC.malloc takes sometime and triggers collections from time to time (2 collections to lex datetime, there is next to no garbage, but they run regardless).And this one below is an experiment and not is not related to the numbers posted still.Even simply disabling GC at first and re-enabling it at the end of program speeds up the whole time by roughly 5% on my machine.-- Dmitry Olshansky
Feb 28 2013