digitalmars.D - Improving std.regex(p)
- Andrei Alexandrescu (26/26) Jun 17 2010 There are currently two regexen in the standard library. The older one,
- Lars T. Kyllingstad (8/40) Jun 17 2010 There is the 'scregexp' project on dsource:
- Andrei Alexandrescu (9/52) Jun 18 2010 scregexp includes the following requirement within the license:
- David Gileadi (7/15) Jun 22 2010 Since I didn't see someone else reply, back when he announced the code h...
- Ben Hanson (19/19) Jun 18 2010 Hi Andrei,
- Ben Hanson (24/24) Jun 18 2010 Reading your original post again and thinking about this a bit more...
- bearophile (4/6) Jun 18 2010 This can be done with string mixin. You can use compile-time evaluation ...
- BCS (8/21) Jun 19 2010 OTOH, and IHMO it should be avoided where you can, kinda like goto. (For...
- Ben Hanson (9/28) Jun 19 2010 It occurred to me that you could have a normal regex library that builds...
- BCS (9/11) Jun 19 2010 I don't remember what versions this is (I've done about 3 that I can rem...
- Ben Hanson (31/40) Jun 19 2010 I can't really speak for the Spirit people, but it's certainly interesti...
- div0 (11/18) Jun 19 2010 I ported most of the classic version of spirit to d a while back.
- Ben Hanson (10/25) Jun 22 2010 Faster compile time code generation is a massive advantage for D.
- BCS (29/58) Jun 19 2010 I've never got a usable grammer for D to feed into it but I did get one ...
- Ben.Hanson (5/52) Jun 22 2010 Let's come back to this! For now I will concentrate on the regex stuff a...
- Don (8/34) Jun 19 2010 Whenever that happens, it's a compiler bug.
- Ben Hanson (7/41) Jun 19 2010 that and then just
- BCS (14/23) Jun 19 2010 The issue isn't that the error has no line number but that the line numb...
- Nick Sabalausky (4/17) Jun 22 2010 For reference, that's bug #2887:
- Lars T. Kyllingstad (13/20) Jun 18 2010 D's metaprogramming facilities are insanely powerful. You can take any
- Lars T. Kyllingstad (7/32) Jun 18 2010 I should mention that the makeParserCode() function in the example must
- bearophile (5/6) Jun 18 2010 One less if this works and gets accepted:
- Don (13/21) Jun 18 2010 It's a great piece of work, but the rest of the CTFE mechanics are not
- Nick Sabalausky (5/17) Jun 18 2010 IIRC, that's the bug (1330) that's exploted to detect runtime vs CTFE, s...
- Don (2/21) Jun 18 2010 Not any more. There's __ctfe now.
- Ellery Newcomer (10/34) Jun 18 2010 I've always understood that dfas have an exponential lower bound
- Ben Hanson (44/53) Jun 18 2010 I also wondered about this and originally just thought, well, you can so...
- Andrei Alexandrescu (6/55) Jun 18 2010 I think only backtracking engines (not DFAs) have the exponential run
- sybrandy (11/16) Jun 18 2010 I don't know how practical it is, but I had a thought to allow some of
- Ellery Newcomer (10/28) Jun 18 2010 The NFA->DFA algorithms are what I was referring to. My copy of the
- Ben Hanson (24/32) Jun 19 2010 I can't prove what the comlexity is for DFA compilation, but instead, I
- Ellery Newcomer (5/28) Jun 19 2010 it does
- Ben Hanson (42/77) Jun 19 2010 (just
- Ben Hanson (9/9) Jun 19 2010 I forgot to mention... Ideally I'd like some kind of abstract interface ...
- Nick Sabalausky (6/9) Jun 19 2010 That's cool. That's something I've often wanted in regexes (It would be
- Nick Sabalausky (4/13) Jun 19 2010 That's "...defining lexer tokens...", not "...defining lever tokens...",...
- Ben Hanson (10/19) Jun 22 2010 Yeah, it does seem obvious that there'd be some middle ground between re...
- Ben Hanson (2/18) Jun 21 2010 Try this link: http://www.arl.wustl.edu/~mbecchi/files/becchi_ancs2009.p...
- Ellery Newcomer (2/20) Jun 21 2010 thanks
- Jacob Carlborg (7/33) Jun 18 2010 The is already a compile time regular expression engine available in the...
- Don (7/51) Jun 18 2010 I wrote that code before CTFE was available. It helped to design D's
- Ben Hanson (5/5) Jun 18 2010 OK, I've just ordered The D Programming Language from Amazon. Hopefully ...
- Nick Sabalausky (5/29) Jun 18 2010 This would be a good thing for it to pay attention to, if it doesn't
- Ellery Newcomer (29/55) Jun 19 2010 Generally I think D's CT capabilities have a way to go yet before this
- BCS (6/9) Jun 19 2010 You use the type system. I can say from experience that (memory issues a...
- Nick Sabalausky (3/9) Jun 19 2010 Trivial example?
- BCS (52/66) Jun 19 2010 Building an arbitrary tree:
- Ellery Newcomer (3/19) Jun 19 2010 Okay, you use structs without indirection. I guess that works. How much
- Don (4/26) Jun 20 2010 Currently CTFE uses copy-on-write. If you're just passing the tree, not
- BCS (6/29) Jun 20 2010 At compile time as a template parameter? I think it gets passed around a...
- Ellery Newcomer (2/23) Jun 21 2010 Nevermind, I see what you're doing now.
There are currently two regexen in the standard library. The older one, std.regexp, is time-tested but only works with UTF8 and has a clunkier API. The newer one, std.regex, is newer and isolates the engine from the matches (and therefore can reuse and cache engines easier), and supports all character widths. But it's less tested and doesn't have that great of an interface because it pretty much inherits the existing one. I wish to improve regex handling in Phobos. The most important improvement is not in the interface - it's in the engine. The current engine is adequate but nothing to write home about, and for simple regexen is markedly slower than equivalent hand-written code (e.g. matching whitespace). One great opportunity would be for D to leverage its uncanny compile-time evaluation abilities and offer a regex that parses the pattern during compilation: foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... } Such a static regex could be simpler than a full-blown regex with captures and backreferences etc., but it would have guaranteed performance (e.g. it would be an automaton instead of a backtracking engine) and would be darn fast because it would generate custom code for each regex pattern. See related work: http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html If we get as far as implementing what RE2 can do with compile-time evaluation, people will definitely notice. If there's anyone who'd want to tackle such a project (for Phobos or not), I highly encourage you to do so. Andrei
Jun 17 2010
On Thu, 17 Jun 2010 21:44:03 -0700, Andrei Alexandrescu wrote:There are currently two regexen in the standard library. The older one, std.regexp, is time-tested but only works with UTF8 and has a clunkier API. The newer one, std.regex, is newer and isolates the engine from the matches (and therefore can reuse and cache engines easier), and supports all character widths. But it's less tested and doesn't have that great of an interface because it pretty much inherits the existing one. I wish to improve regex handling in Phobos. The most important improvement is not in the interface - it's in the engine. The current engine is adequate but nothing to write home about, and for simple regexen is markedly slower than equivalent hand-written code (e.g. matching whitespace). One great opportunity would be for D to leverage its uncanny compile-time evaluation abilities and offer a regex that parses the pattern during compilation: foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... } Such a static regex could be simpler than a full-blown regex with captures and backreferences etc., but it would have guaranteed performance (e.g. it would be an automaton instead of a backtracking engine) and would be darn fast because it would generate custom code for each regex pattern. See related work: http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.htmlIf we get as far as implementing what RE2 can do with compile-time evaluation, people will definitely notice. If there's anyone who'd want to tackle such a project (for Phobos or not), I highly encourage you to do so.There is the 'scregexp' project on dsource: http://www.dsource.org/projects/scregexp/ It's D1/Tango, but maybe it could be adapted to D2/Phobos? It would at least serve as a starting point for anyone wanting to try their hand at doing this. -Lars
Jun 17 2010
Lars T. Kyllingstad wrote:On Thu, 17 Jun 2010 21:44:03 -0700, Andrei Alexandrescu wrote:scregexp includes the following requirement within the license: "Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution." That would need to be changed before inclusion in Phobos. It looks like there are three people in the copyright notice: Walter Bright, Marton Papp, and yidabu. Does anyone know Marton's email address? AndreiThere are currently two regexen in the standard library. The older one, std.regexp, is time-tested but only works with UTF8 and has a clunkier API. The newer one, std.regex, is newer and isolates the engine from the matches (and therefore can reuse and cache engines easier), and supports all character widths. But it's less tested and doesn't have that great of an interface because it pretty much inherits the existing one. I wish to improve regex handling in Phobos. The most important improvement is not in the interface - it's in the engine. The current engine is adequate but nothing to write home about, and for simple regexen is markedly slower than equivalent hand-written code (e.g. matching whitespace). One great opportunity would be for D to leverage its uncanny compile-time evaluation abilities and offer a regex that parses the pattern during compilation: foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... } Such a static regex could be simpler than a full-blown regex with captures and backreferences etc., but it would have guaranteed performance (e.g. it would be an automaton instead of a backtracking engine) and would be darn fast because it would generate custom code for each regex pattern. See related work: http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.htmlIf we get as far as implementing what RE2 can do with compile-time evaluation, people will definitely notice. If there's anyone who'd want to tackle such a project (for Phobos or not), I highly encourage you to do so.There is the 'scregexp' project on dsource: http://www.dsource.org/projects/scregexp/ It's D1/Tango, but maybe it could be adapted to D2/Phobos? It would at least serve as a starting point for anyone wanting to try their hand at doing this.
Jun 18 2010
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'sscregexp includes the following requirement within the license: "Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution." That would need to be changed before inclusion in Phobos. It looks like there are three people in the copyright notice: Walter Bright, Marton Papp, and yidabu. Does anyone know Marton's email address? AndreiSince I didn't see someone else reply, back when he announced the code he used anteusz freemail.hu. Google also turned up http://www.equinoxbase.com/. I'm not sure it's the same fellow, but the page has his name, mentions D and has a contact form. You could also try PMing him at dsource: http://www.dsource.org/forums/profile.php? mode=viewprofile&u=1513
Jun 22 2010
Hi Andrei, I am unable to offer to write an entire compile time regex library in D, but it is certainly an interesting subject for me and would be happy to contribute to any discussions. I am sure a D implementation could be much more successful than a C++ TMP DFA producer (see http://old.nabble.com/-regex%2C-xpressive-- interesting(-)-perf-benchmark-to28800789.html#a28815063 - I'm sure you have already! ;-)) I am not really into D yet, although I've been following this group for a while. The release of Visual D is a huge improvement and of course I am very interested in your new book! Here are the links to the DFA implementations I have done in C++: http://www.benhanson.net/lexertl.html http://www.codeproject.com/KB/recipes/wc.aspx hopefully emitting CLR byte-codes. I would be happy to talk to anyone concerning all things DFA (and it would be intriguing to code up or try some code snippets!) Regards, Ben
Jun 18 2010
Reading your original post again and thinking about this a bit more... If someone can help me get up to speed with TMP in D, I could probably put together a proof of concept pretty quickly. Aside from the D syntax, it is all in the design (which is why I would like to discuss it in more detail). For anyone who is interested, here is how I go about DFA production currently: - Bog standard tokenisation of regex tokens - Normalise each regex charset and enumerate in a map - build a regex syntax tree with leaf nodes storing the enumerated charset id - Intersect all the charsets and build a vector of equivalent charsets - Perform the followset to DFA transform as described in the Dragon book For my lexer generator, I build a lookup table for each character to an equivalence class. For my wild card matcher, I leave the representation as characters and do a linear lookup of chars to 'next' state. The lexer equivalence class method (a la flex) is more efficient as you just have two lookups per character with no linear searches. I'm thinking I could look at getting as far as the regex syntax tree and then we could discuss DFA representation (it sounds like you want to generate code directly - how do you do that?! This is why I mentioned the .NET bytecode stuff - I imagine there is some overlap there, but I don't know enough to be sure. Can you output source code at compilation time and have D compile it, or do you use a load of recursion like C++ TMP? And I forgot to mention - boost.spirit uses my lexertl library. Regards, Ben
Jun 18 2010
Ben Hanson:Can you output source code at compilation time and have D compile it, or do you use a load of recursion like C++ TMP?This can be done with string mixin. You can use compile-time evaluation to build a string of code, and the mix it in somewhere, the compiler will compile it. It's a dirty thing, but unless D3 gets macros, it's the thing to be used. You can use it to build a string of asm code too, because it's part of D syntax. Bye, bearophile
Jun 18 2010
Hello bearophile,Ben Hanson:OTOH, and IHMO it should be avoided where you can, kinda like goto. (For one things, forget having usable line numbers in your error messages.) You can often get a lot done with static if, tuple foreach, const values feed to boilerplate code and the like with just a scattering of one or two line mixins stuffed in the middle.Can you output source code at compilation time and have D compile it, or do you use a load of recursion like C++ TMP?This can be done with string mixin. You can use compile-time evaluation to build a string of code, and the mix it in somewhere, the compiler will compile it. It's a dirty thing, but unless D3 gets macros, it's the thing to be used. You can use it to build a string of asm code too, because it's part of D syntax.Bye, bearophile-- ... <IXOYE><
Jun 19 2010
== Quote from BCS (none anon.com)'s articleHello bearophile,It occurred to me that you could have a normal regex library that builds the regex at runtime, debug it, produce a code generator (similar code to re2c), debug that and then just wrap it with the mixin at the end. If you run into bugs, you can just do it at runtime instead and debug. The nice thing is, if I can get this converted, the lexer library is just slightly different. That means you could have the same approach with a lexer generator! This would lead to huge interest from the boost.spirit guys... Regards, BenBen Hanson:OTOH, and IHMO it should be avoided where you can, kinda like goto. (For one things, forget having usable line numbers in your error messages.) You can often get a lot done with static if, tuple foreach, const values feed to boilerplate code and the like with just a scattering of one or two line mixins stuffed in the middle.Can you output source code at compilation time and have D compile it, or do you use a load of recursion like C++ TMP?This can be done with string mixin. You can use compile-time evaluation to build a string of code, and the mix it in somewhere, the compiler will compile it. It's a dirty thing, but unless D3 gets macros, it's the thing to be used. You can use it to build a string of asm code too, because it's part of D syntax.Bye, bearophile
Jun 19 2010
Hello Ben,This would lead to huge interest from the boost.spirit guys...I don't remember what versions this is (I've done about 3 that I can remember) or even if it works, but I wonder if they would have any interest in this: http://www.dsource.org/projects/scrapple/browser/trunk/dparser/dparse.d Note (if that is the version I think it is) that the only mixins in that are tiny (mixin("label_"~tag!(id)~":;"); & mixin("goto label_"~tag!(id)~";");) or generating a big list invocations of other code. -- ... <IXOYE><
Jun 19 2010
Hi BCS, == Quote from BCS (none anon.com)'s articleHello Ben,I can't really speak for the Spirit people, but it's certainly interesting to me! :-) Here's some background to make things clearer: I'm not even currently a Spirit user myself as I've only really needed a tokeniser at work (they think that's mind-bogglingly sophisticated, never mind using a beast like Spirit! ;-)). The only reason they were interested in my lexer generator is that recursive descent lexing is pretty slow. So for more demanding grammars, a DFA based lexer is better. The lexer library allows you to create a lexer at runtime (which doesn't fit in so well with their model of doing everything at compile time), or you can generate code offline and then just add that to your project. This is why a compile time DFA lexer would be really interesting to them. From memory (Joel did a PDF recently, but that is on my works machine) Joel has been developing Spirit for over ten years. The latest version is pretty sophisticated and has all kinds of clever stuff for directly parsing data in structures all inline etc. Needless to say, I find the whole thing pretty mind boggling. The biggest problem (as far as I can see as an observer) is the compile times. This is where D could be really interesting overall (Hartmut has certainly got his eye on it and in fact I'm sure all the Boost people do). For my part, I'd like to see an LR parser generator developed. I'd be happy with one that creates the parser at runtime (assuming decent performance), but if it can generate and compile code efficiently at compile time, so much the better! :-) I really like the idea of being able to switch the runtime/compile time behaviour. When it comes to the code generation part for the DFA regex engine in D, I'd be happy to talk to you further about the techniques you've employed regarding emit. I'm still completely new to D, so it'll take me a while! :-) Cheers, BenThis would lead to huge interest from the boost.spirit guys...I don't remember what versions this is (I've done about 3 that I can remember) or even if it works, but I wonder if they would have any interest in this: http://www.dsource.org/projects/scrapple/browser/trunk/dparser/dparse.d Note (if that is the version I think it is) that the only mixins in that are tiny (mixin("label_"~tag!(id)~":;"); & mixin("goto label_"~tag!(id)~";");) or generating a big list invocations of other code.
Jun 19 2010
On 19/06/2010 20:39, Ben Hanson wrote:From memory (Joel did a PDF recently, but that is on my works machine) Joel has been developing Spirit for over ten years. The latest version is pretty sophisticated and has all kinds of clever stuff for directly parsing data in structures all inline etc. Needless to say, I find the whole thing pretty mind boggling. The biggest problem (as far as I can see as an observer) is the compile times. This is where D could be really interesting overall (Hartmut has certainly got his eye on it and in fact I'm sure all the Boost people do).I ported most of the classic version of spirit to d a while back. I've recently written a XML parser using it and it takes nearly 7 whole seconds to compile with DMD, which is a vast amount of time compared to rest of the stuff I compile. :) The most trivial spirit parser in c++ takes over 30 seconds on my machine, even with everything in the pre compiled header. D is just a massive win for productively. -- My enormous talent is exceeded only by my outrageous laziness. http://www.ssTk.co.uk
Jun 19 2010
== Quote from div0 (div0 users.sourceforge.net)'s articleOn 19/06/2010 20:39, Ben Hanson wrote:Faster compile time code generation is a massive advantage for D. Hartmut contacted me about a runtime LALR parser generator called Sweet Parser (http://www.sweetsoftware.co.nz/parser_overview.php). They've contacted the author about integrating a version of that into Spirit. A version in D that could run at compile time would be cool too... Bizzare timing or what! It looks like interesting times lie ahead... Regards, BenFrom memory (Joel did a PDF recently, but that is on my works machine) Joel has been developing Spirit for over ten years. The latest version is pretty sophisticated and has all kinds of clever stuff for directly parsing data in structures all inline etc. Needless to say, I find the whole thing pretty mind boggling. The biggest problem (as far as I can see as an observer) is the compile times. This is where D could be really interesting overall (Hartmut has certainly got his eye on it and in fact I'm sure all the Boost people do).I ported most of the classic version of spirit to d a while back. I've recently written a XML parser using it and it takes nearly 7 whole seconds to compile with DMD, which is a vast amount of time compared to rest of the stuff I compile. :) The most trivial spirit parser in c++ takes over 30 seconds on my machine, even with everything in the pre compiled header. D is just a massive win for productively.
Jun 22 2010
Hello Ben,The only reason they were interested in my lexer generator is that recursive descent lexing is pretty slow. So for more demanding grammars, a DFA based lexer is better. The lexer library allows you to create a lexer at runtime (which doesn't fit in so well with their model of doing everything at compile time), or you can generate code offline and then just add that to your project. This is why a compile time DFA lexer would be really interesting to them. The biggest problem (as far as I can see as an observer) is the compile times.I've never got a usable grammer for D to feed into it but I did get one (~200 productions) that at least compiled: it took 7 (or 17 minutes, I forget) and 700MB or ram To compile (P-III 550MHz, >700MB RAM)This is where D could be really interesting overall (Hartmut has certainly got his eye on it and in fact I'm sure all the Boost people do). For my part, I'd like to see an LR parser generator developed. I'd be happy with one that creates the parser at runtime (assuming decent performance), but if it can generate and compile code efficiently at compile time, so much the better! :-) I really like the idea of being able to switch the runtime/compile time behaviour.Given that most LR parsers are table driven, all that would be needed for a compile time system would be that the table generator be evaluateable at compile time. The result could be stored in a constant. The only tricky bits then would be handling the data values on the stack and the part to tack in the action rules and that could just be a naming convention: class MyActions { resultType Action_statement_if(/+ args +/) { /+ code +/ } resultType Action_statement_ifThen(/+ args +/) { /+ code +/ } resultType Action_addExpression_plus(/+ args +/) { /+ code +/ } resultType Action_addExpression_minus(/+ args +/) { /+ code +/ } resultType Action_addExpression_chain(/+ args +/) { /+ code +/ } // ... mixin Parser!( `statment : "if" "(" expression ")" statment / if | "if" "(" expression ")" statment "then" statment / ifThen | "{" statment* "}" / block | expression ";" /expression; addExpression : ..... `); } Parser!(string) would generate a function that just assumes the existence of functions of the given names.When it comes to the code generation part for the DFA regex engine in D, I'd be happy to talk to you further about the techniques you've employed regarding emit. I'm still completely new to D, so it'll take me a while! :-) Cheers, Ben-- ... <IXOYE><
Jun 19 2010
== Quote from BCS (none anon.com)'s articleHello Ben,Let's come back to this! For now I will concentrate on the regex stuff and review the Sweet Parser for the Spirit guys. Regards, BenThe only reason they were interested in my lexer generator is that recursive descent lexing is pretty slow. So for more demanding grammars, a DFA based lexer is better. The lexer library allows you to create a lexer at runtime (which doesn't fit in so well with their model of doing everything at compile time), or you can generate code offline and then just add that to your project. This is why a compile time DFA lexer would be really interesting to them. The biggest problem (as far as I can see as an observer) is the compile times.I've never got a usable grammer for D to feed into it but I did get one (~200 productions) that at least compiled: it took 7 (or 17 minutes, I forget) and 700MB or ram To compile (P-III 550MHz, >700MB RAM)This is where D could be really interesting overall (Hartmut has certainly got his eye on it and in fact I'm sure all the Boost people do). For my part, I'd like to see an LR parser generator developed. I'd be happy with one that creates the parser at runtime (assuming decent performance), but if it can generate and compile code efficiently at compile time, so much the better! :-) I really like the idea of being able to switch the runtime/compile time behaviour.Given that most LR parsers are table driven, all that would be needed for a compile time system would be that the table generator be evaluateable at compile time. The result could be stored in a constant. The only tricky bits then would be handling the data values on the stack and the part to tack in the action rules and that could just be a naming convention: class MyActions { resultType Action_statement_if(/+ args +/) { /+ code +/ } resultType Action_statement_ifThen(/+ args +/) { /+ code +/ } resultType Action_addExpression_plus(/+ args +/) { /+ code +/ } resultType Action_addExpression_minus(/+ args +/) { /+ code +/ } resultType Action_addExpression_chain(/+ args +/) { /+ code +/ } // ... mixin Parser!( `statment : "if" "(" expression ")" statment / if | "if" "(" expression ")" statment "then" statment / ifThen | "{" statment* "}" / block | expression ";" /expression; addExpression : ..... `); } Parser!(string) would generate a function that just assumes the existence of functions of the given names.
Jun 22 2010
Ben Hanson wrote:== Quote from BCS (none anon.com)'s article(ForHello bearophile,Ben Hanson:OTOH, and IHMO it should be avoided where you can, kinda like goto.Can you output source code at compilation time and have D compile it, or do you use a load of recursion like C++ TMP?This can be done with string mixin. You can use compile-time evaluation to build a string of code, and the mix it in somewhere, the compiler will compile it. It's a dirty thing, but unless D3 gets macros, it's the thing to be used. You can use it to build a string of asm code too, because it's part of D syntax.Whenever that happens, it's a compiler bug. If you can produce a test case where that happens, please put it in Bugzilla. Youone things, forget having usable line numbers in your error messages.)That is exactly the way I recommend to develop CTFE code. Get the code completely working, run unit tests on every component, and then CTFE it.can often get a lot done with static if, tuple foreach, const values feed to boilerplate code and the like with just a scattering of one or two line mixins stuffed in the middle.It occurred to me that you could have a normal regex library that builds the regex at runtime, debug it, produce a code generator (similar code to re2c), debug that and then just wrap it with the mixin at the end. If you run into bugs, you can just do it at runtime instead and debug. The nice thing is, if I can get this converted, the lexer library is just slightly different. That means you could have the same approach with a lexer generator! This would lead to huge interest from the boost.spirit guys...Bye,it. bearophile
Jun 19 2010
== Quote from Don (nospam nospam.com)'s articleBen Hanson wrote:the regex at== Quote from BCS (none anon.com)'s article(ForHello bearophile,Ben Hanson:OTOH, and IHMO it should be avoided where you can, kinda like goto.Can you output source code at compilation time and have D compile it, or do you use a load of recursion like C++ TMP?This can be done with string mixin. You can use compile-time evaluation to build a string of code, and the mix it in somewhere, the compiler will compile it. It's a dirty thing, but unless D3 gets macros, it's the thing to be used. You can use it to build a string of asm code too, because it's part of D syntax.Whenever that happens, it's a compiler bug. If you can produce a test case where that happens, please put it in Bugzilla. Youone things, forget having usable line numbers in your error messages.)can often get a lot done with static if, tuple foreach, const values feed to boilerplate code and the like with just a scattering of one or two line mixins stuffed in the middle.It occurred to me that you could have a normal regex library that buildsBye,it. bearophilethat and then justruntime, debug it, produce a code generator (similar code to re2c), debugat runtimewrap it with the mixin at the end. If you run into bugs, you can just do itlexer library is justinstead and debug. The nice thing is, if I can get this converted, thelexer generator! Thisslightly different. That means you could have the same approach with aGood to know Don, thanks!would lead to huge interest from the boost.spirit guys...That is exactly the way I recommend to develop CTFE code. Get the code completely working, run unit tests on every component, and then CTFE it.
Jun 19 2010
Hello Don,The issue isn't that the error has no line number but that the line number is the line of the mixin or the line within the mixin (I forget which) to be useful you need both AND the full text of the mixin. Yes there are ways around this but my original point is that in my experience, I can get just about as much done and for about the same effort with very little code being mixed in. My approach to this kind of problem has typically been to use CTFE to process input, template and or mixins to generate an AST (or whatever the problem demands) represented as a type and normal template, static ifs and tuple foreach to do the code generation with the occasional line of code mixed in here and there. The result of this is that most of the code is transparently visible as code rather than as string literals. -- ... <IXOYE><== Quote from BCS (none anon.com)'s articleWhenever that happens, it's a compiler bug. If you can produce a test case where that happens, please put it in Bugzilla.OTOH, and IHMO it should be avoided where you can, kinda like goto. (For one things, forget having usable line numbers in your error messages.)
Jun 19 2010
"BCS" <none anon.com> wrote in message news:a6268ff154bd8ccddee0380cd08 news.digitalmars.com...Hello Don,http://d.puremagic.com/issues/show_bug.cgi?id=2887The issue isn't that the error has no line number but that the line number is the line of the mixin or the line within the mixin (I forget which) to be useful you need both AND the full text of the mixin.== Quote from BCS (none anon.com)'s articleWhenever that happens, it's a compiler bug. If you can produce a test case where that happens, please put it in Bugzilla.OTOH, and IHMO it should be avoided where you can, kinda like goto. (For one things, forget having usable line numbers in your error messages.)
Jun 22 2010
On Fri, 18 Jun 2010 11:36:48 +0000, Ben Hanson wrote:I'm thinking I could look at getting as far as the regex syntax tree and then we could discuss DFA representation (it sounds like you want to generate code directly - how do you do that?! This is why I mentioned the .NET bytecode stuff - I imagine there is some overlap there, but I don't know enough to be sure. Can you output source code at compilation time and have D compile it, or do you use a load of recursion like C++ TMP?D's metaprogramming facilities are insanely powerful. You can take any string containing valid source code (that is known at compile time, of course) and "paste" it into your code with the mixin() expression: mixin("int i;"); i = 2; For the regex parser, I suppose you'd do something like: string makeParserCode(string regex) { ... } struct sregex(string regex) { mixin(makeParserCode(regex)); } -Lars
Jun 18 2010
On Fri, 18 Jun 2010 12:09:54 +0000, Lars T. Kyllingstad wrote:On Fri, 18 Jun 2010 11:36:48 +0000, Ben Hanson wrote:I should mention that the makeParserCode() function in the example must of course be evaluatable at compile time, and there are some restrictions on such functions. Lately, many of those restrictions have been overcome, though, and I'm not sure what's left. Does anyone know what the current limitations on CTFE are? -LarsI'm thinking I could look at getting as far as the regex syntax tree and then we could discuss DFA representation (it sounds like you want to generate code directly - how do you do that?! This is why I mentioned the .NET bytecode stuff - I imagine there is some overlap there, but I don't know enough to be sure. Can you output source code at compilation time and have D compile it, or do you use a load of recursion like C++ TMP?D's metaprogramming facilities are insanely powerful. You can take any string containing valid source code (that is known at compile time, of course) and "paste" it into your code with the mixin() expression: mixin("int i;"); i = 2; For the regex parser, I suppose you'd do something like: string makeParserCode(string regex) { ... } struct sregex(string regex) { mixin(makeParserCode(regex)); }
Jun 18 2010
Lars T. Kyllingstad:Does anyone know what the current limitations on CTFE are?One less if this works and gets accepted: http://d.puremagic.com/issues/show_bug.cgi?id=3050 Bye, bearophile
Jun 18 2010
bearophile wrote:Lars T. Kyllingstad:It's a great piece of work, but the rest of the CTFE mechanics are not ready for that patch. The one which is blocking everything in CTFE is bug 1330. At present, CTFE uses copy-on-write, and that doesn't work with reference types. Fixing that is going to require large changes throughout the compiler, so I've been holding off on that one. Meanwhile I've been trying to fix all the non-reference bugs in CTFE and assembling a comprehensive test suite. So to answer the original question -- the main things that don't work in CTFE that are allowable in safe D are classes, exceptions, and built-in features which are implemented in the runtime rather than in the compiler (including array operations).Does anyone know what the current limitations on CTFE are?One less if this works and gets accepted: http://d.puremagic.com/issues/show_bug.cgi?id=3050 Bye, bearophile
Jun 18 2010
"Don" <nospam nospam.com> wrote in message news:hvgasq$2k3a$1 digitalmars.com...bearophile wrote:IIRC, that's the bug (1330) that's exploted to detect runtime vs CTFE, so if it's fixed, that would break any code that does that detection, so another CTFE-detection mechanism may be needed before a 1330 fix.Lars T. Kyllingstad:It's a great piece of work, but the rest of the CTFE mechanics are not ready for that patch. The one which is blocking everything in CTFE is bug 1330.Does anyone know what the current limitations on CTFE are?One less if this works and gets accepted: http://d.puremagic.com/issues/show_bug.cgi?id=3050 Bye, bearophile
Jun 18 2010
Nick Sabalausky wrote:"Don" <nospam nospam.com> wrote in message news:hvgasq$2k3a$1 digitalmars.com...Not any more. There's __ctfe now.bearophile wrote:IIRC, that's the bug (1330) that's exploted to detect runtime vs CTFE, so if it's fixed, that would break any code that does that detection, so another CTFE-detection mechanism may be needed before a 1330 fix.Lars T. Kyllingstad:It's a great piece of work, but the rest of the CTFE mechanics are not ready for that patch. The one which is blocking everything in CTFE is bug 1330.Does anyone know what the current limitations on CTFE are?One less if this works and gets accepted: http://d.puremagic.com/issues/show_bug.cgi?id=3050 Bye, bearophile
Jun 18 2010
On 06/18/2010 06:36 AM, Ben Hanson wrote:Reading your original post again and thinking about this a bit more... If someone can help me get up to speed with TMP in D, I could probably put together a proof of concept pretty quickly. Aside from the D syntax, it is all in the design (which is why I would like to discuss it in more detail). For anyone who is interested, here is how I go about DFA production currently: - Bog standard tokenisation of regex tokens - Normalise each regex charset and enumerate in a map - build a regex syntax tree with leaf nodes storing the enumerated charset id - Intersect all the charsets and build a vector of equivalent charsets - Perform the followset to DFA transform as described in the Dragon book For my lexer generator, I build a lookup table for each character to an equivalence class. For my wild card matcher, I leave the representation as characters and do a linear lookup of chars to 'next' state. The lexer equivalence class method (a la flex) is more efficient as you just have two lookups per character with no linear searches. I'm thinking I could look at getting as far as the regex syntax tree and then we could discuss DFA representation (it sounds like you want to generate code directly - how do you do that?! This is why I mentioned the .NET bytecode stuff - I imagine there is some overlap there, but I don't know enough to be sure. Can you output source code at compilation time and have D compile it, or do you use a load of recursion like C++ TMP? And I forgot to mention - boost.spirit uses my lexertl library. Regards, BenI've always understood that dfas have an exponential lower bound relative to .. something .. but I don't believe I have ever actually used a dfa generator (or written one, sadly enough). What has your experience been on dfa compile time? I suppose RE2's syntax is a good starting point http://code.google.com/p/re2/wiki/Syntax I wonder about those unicode character classes, though. Looks like an imminent head-on collision with dmd's issues regarding memory management at compile time.
Jun 18 2010
Hi Ellery, == Quote from Ellery Newcomer (ellery-newcomer utulsa.edu)'s articleI've always understood that dfas have an exponential lower bound relative to .. something .. but I don't believe I have ever actually used a dfa generator (or written one, sadly enough). What has your experience been on dfa compile time? I suppose RE2's syntax is a good starting point http://code.google.com/p/re2/wiki/Syntax I wonder about those unicode character classes, though. Looks like an imminent head-on collision with dmd's issues regarding memory management at compile time.I also wondered about this and originally just thought, well, you can solve any performance problems with decent optimal code! :-) So this train of thought was years ago now and in reality the only way I can really see to bring DFA compilation to its knees is to use massive repeats. DFA also implies a more limited syntax (Andrei hinted at this in his original post) and there are certain things you just don't need to worry about any more such as 'possessive' quantifiers. Traditionally DFA regex has been used for lexer generators as they generally don't need some of the more outlandish 'extra' features, so it's possible that a straight up regex engine would have more demands put on it, but there again, a lexer generator is dealing with a list of regexes, not just one! The last time I timed it, lexertl could compile all of the C++ token rules in well under 0.5 seconds. There are such things as tagged dfas, which I would love to know more about. The Tango regex library uses them for captures I believe. If anyone can provide more info on how TDFAs really work that would be fantastic. All the literature I've read didn't exactly enlighten me (including Ville Laurikari's papers). The full re2 syntax is way more than I could provide. See http:// www.benhanson.net/lexertl.html for a table of regex syntax I know how to support. On the subject of Unicode chars, this is an interesting topic. The Tango regex library gets around the storage requirements by having a list of ranges. The approach lexertl takes is to maintain a sorted string of chars. The trick is, if the string has more than 0.5 of the max possible number of chars in it, you negate it. This is actually part of the normalisation process and leads to quicker intersections later on. When it comes to the final machine, lexertl slices 'wide chars' into bytes for lookup now, keeping the first stage lookup to 256 entries (I can elaborate if anyone wants me too). Of course if you are generating code (as seems to be desired) none of this matters very much. Don't forget that if you intersect all the charsets *once* after building your syntax tree, then when you do transition intersections, you are only intersecting sets of indexes to charset equivalence classes which is really quick. For a quick example: - Say you have two char classes, one is [a-z] and the other one is [a]. - You intersect these classes once giving [b-z] and [a] - So if originally [a-z] = 0 and [a] = 1 (as I said, charsets are enumerated as they are scanned), then now the equivalence index sets are [a-z] = {0, 1} and [a] = {1} due to [b-z] = 0 and [a] = 1. When you generate the DFA you are then intersecting {0, 1} and {1} rather than the actual charsets. The sets are obviously really small! :-) Hope that was useful, or at least interesting! Regards, Ben
Jun 18 2010
Ellery Newcomer wrote:On 06/18/2010 06:36 AM, Ben Hanson wrote:I think only backtracking engines (not DFAs) have the exponential run time problem. Regular expression grammar is itself context free (so it's parsable efficiently), the NFA and NFA->DFA algorithms are polynomial, and the generated DFA makes one transition per input character. AndreiReading your original post again and thinking about this a bit more... If someone can help me get up to speed with TMP in D, I could probably put together a proof of concept pretty quickly. Aside from the D syntax, it is all in the design (which is why I would like to discuss it in more detail). For anyone who is interested, here is how I go about DFA production currently: - Bog standard tokenisation of regex tokens - Normalise each regex charset and enumerate in a map - build a regex syntax tree with leaf nodes storing the enumerated charset id - Intersect all the charsets and build a vector of equivalent charsets - Perform the followset to DFA transform as described in the Dragon book For my lexer generator, I build a lookup table for each character to an equivalence class. For my wild card matcher, I leave the representation as characters and do a linear lookup of chars to 'next' state. The lexer equivalence class method (a la flex) is more efficient as you just have two lookups per character with no linear searches. I'm thinking I could look at getting as far as the regex syntax tree and then we could discuss DFA representation (it sounds like you want to generate code directly - how do you do that?! This is why I mentioned the .NET bytecode stuff - I imagine there is some overlap there, but I don't know enough to be sure. Can you output source code at compilation time and have D compile it, or do you use a load of recursion like C++ TMP? And I forgot to mention - boost.spirit uses my lexertl library. Regards, BenI've always understood that dfas have an exponential lower bound relative to .. something .. but I don't believe I have ever actually used a dfa generator (or written one, sadly enough). What has your experience been on dfa compile time?
Jun 18 2010
I think only backtracking engines (not DFAs) have the exponential run time problem. Regular expression grammar is itself context free (so it's parsable efficiently), the NFA and NFA->DFA algorithms are polynomial, and the generated DFA makes one transition per input character. AndreiI don't know how practical it is, but I had a thought to allow some of these nasty regexes to run better: use threads. In theory, and I have not done any work to test/prove this so it's very theoretical, you can have multiple threads running in parallel each testing a potential path. Granted, there's always the issue of spawning/managing threads and potentially overloading the system by spawning too many, however it may be a viable alternative if you really HAVE to execute a particularly nasty regex. Anyway, it's just a thought. I'd hate to lose the benefits of a NFA if we're afraid of the potential performance hit if a user writes a bad regex. Casey
Jun 18 2010
On 06/18/2010 05:08 PM, sybrandy wrote:The NFA->DFA algorithms are what I was referring to. My copy of the dragon book suggests that DFAs usually have O(n^3) initialization time but O(n^2 * 2^n) worst case relative to the number of states. I'd love to be shown wrong, though.I think only backtracking engines (not DFAs) have the exponential run time problem. Regular expression grammar is itself context free (so it's parsable efficiently), the NFA and NFA->DFA algorithms are polynomial, and the generated DFA makes one transition per input character. AndreiI don't know how practical it is, but I had a thought to allow some of these nasty regexes to run better: use threads. In theory, and I have not done any work to test/prove this so it's very theoretical, you can have multiple threads running in parallel each testing a potential path. Granted, there's always the issue of spawning/managing threads and potentially overloading the system by spawning too many, however it may be a viable alternative if you really HAVE to execute a particularly nasty regex.Isn't what you're describing an NFA?Anyway, it's just a thought. I'd hate to lose the benefits of a NFA if we're afraid of the potential performance hit if a user writes a bad regex. CaseyYou don't get a performance hit with DFAs on knarly regexen (actually, they would be faster than NFAs), just regexen that take forever to compile. I wonder if it would be possible to mechanically detect problem cases for DFA compilation and on detection switch over to an NFA engine.
Jun 18 2010
== Quote from Ellery Newcomer (ellery-newcomer utulsa.edu)'s articleThe NFA->DFA algorithms are what I was referring to. My copy of the dragon book suggests that DFAs usually have O(n^3) initialization time but O(n^2 * 2^n) worst case relative to the number of states. I'd love to be shown wrong, though.I can't prove what the comlexity is for DFA compilation, but instead, I challenge anyone to show slow compilation times with any DFA compatible regex. As I don't have some D code yet, you could try the current lexertl code (just add a single rule). Compile as Release if using Visual C++, as the iterator debugging kills performance in VC++ 8 or above (one problem I guess wouldn't exist with a D codebase). As I say, the only obvious area for me that hammers compilation times is huge repeats. If you repeat an expression 50,0000 times or something, then yes, you will get a huge machine and it will take time to construct! :-) There are papers on tackling this with counting repeats, even for DFAs. If this area interests anyone, please let me know.You don't get a performance hit with DFAs on knarly regexen (actually, they would be faster than NFAs), just regexen that take forever to compile. I wonder if it would be possible to mechanically detect problem cases for DFA compilation and on detection switch over to an NFA engine.DFA compilation is particularly good for simplifying verbose regexes, even wihout DFA minimisation. Hybrid DFA/NFA engines have been written in the past and as far as feaures go, this is probably the ideal way to go. Russ Cox has some clever techniques too. Don't forget that grep is (traditionally) implemented with a DFA engine. Of course, the real irony is that Henry Spencer wrote a much better regex engine for TCL (a DFA/NFA hybrid as I understand it) which is much better than his original NFA backtracker and yet everyone holds up the backtracking algorithm as the way to do things... That's inertia for you. Regards, Ben
Jun 19 2010
On 06/19/2010 03:35 AM, Ben Hanson wrote:I can't prove what the comlexity is for DFA compilation, but instead, I challenge anyone to show slow compilation times with any DFA compatible regex. As I don't have some D code yet, you could try the current lexertl code (just add a single rule). Compile as Release if using Visual C++, as the iterator debugging kills performance in VC++ 8 or above (one problem I guess wouldn't exist with a D codebase). As I say, the only obvious area for me that hammers compilation times is huge repeats. If you repeat an expression 50,0000 times or something, then yes, you will get a huge machine and it will take time to construct! :-)et unicode?There are papers on tackling this with counting repeats, even for DFAs. If this area interests anyone, please let me know.it doesDFA compilation is particularly good for simplifying verbose regexes, even wihout DFA minimisation. Hybrid DFA/NFA engines have been written in the past and as far as feaures go, this is probably the ideal way to go. Russ Cox has some clever techniques too.interestingDon't forget that grep is (traditionally) implemented with a DFA engine.I didn't know thatOf course, the real irony is that Henry Spencer wrote a much better regex engine for TCL (a DFA/NFA hybrid as I understand it) which is much better than his original NFA backtracker and yet everyone holds up the backtracking algorithm as the way to do things... That's inertia for you. Regards, Ben
Jun 19 2010
== Quote from Ellery Newcomer (ellery-newcomer utulsa.edu)'s articleOn 06/19/2010 03:35 AM, Ben Hanson wrote:regex.I can't prove what the comlexity is for DFA compilation, but instead, I challenge anyone to show slow compilation times with any DFA compatible(justAs I don't have some D code yet, you could try the current lexertl codehugeadd a single rule). Compile as Release if using Visual C++, as the iterator debugging kills performance in VC++ 8 or above (one problem I guess wouldn't exist with a D codebase). As I say, the only obvious area for me that hammers compilation times isyourepeats. If you repeat an expression 50,0000 times or something, then yes,The latest lexertl takes the following strategy: - If you're building a lexer using 8 bit chars, then no problem. Two phase lookup, first lookup map all 256 chars to equivalence class index, then that index is used to locate the correct column in the DFA. - If you're using 'wide characters', then still have a 256 entry lookup table and slice each 'wide character' into 8 bit chunks and use those chunks as charsets. - If you want to generate code, you can generate tables using a first phase lookup of MAX_CHARS of the 'wide character'. You can then convert this back to a character based state machine and then you can just deal with characters again instead of equivalence classes. One of the enhancements I want to make, is the ability to go straight to a char based state machine to avoid the huge first phase lookup. Now, as far a huge Unicode ranges goes, this shouldn't be a problem when you're generating code. You can use an if statement with an upper and lower bound in those cases and only use a switch statement when the range is disparate etc. If the technique of storing all those chars in a string ends up being too expensive on RAM (even with the auto negation trick), then there's always the method used by Tango Regex, which is a list of ranges (this probably works quite well with garbage collection - remember I designed for C++). Let me know if any of that isn't clear. Basically it's not a huge problem, you can use various techniques to compress the data sensibly and efficiently.will get a huge machine and it will take time to construct! :-)et unicode?thisThere are papers on tackling this with counting repeats, even for DFAs. IfI'll have to dig out any references I have! I don't have them to hand right now.area interests anyone, please let me know.it doespastDFA compilation is particularly good for simplifying verbose regexes, even wihout DFA minimisation. Hybrid DFA/NFA engines have been written in theRuss (from memory) uses competing 'threads' in his matcher (http://swtch.com/ ~rsc/regexp/regexp2.html) http://swtch.com/~rsc/regexp/regexp3.html discusses RE2.and as far as feaures go, this is probably the ideal way to go. Russ Cox has some clever techniques too.interestingA lot of automata theory appears to have been forgotten, as noted by Russ Cox (amongst others). Lua has an interesting pattern matcher, as it goes more in the formal grammar direction to (for example) match balanced parenthesis etc. Another thing I'd like to do (no doubt time will not allow!) is to extend Notepad RE (http://www.codeproject.com/KB/recipes/notepadre.aspx) to have the LUA pattern matcher as an option.Don't forget that grep is (traditionally) implemented with a DFA engine.I didn't know thatthanOf course, the real irony is that Henry Spencer wrote a much better regex engine for TCL (a DFA/NFA hybrid as I understand it) which is much betterRegards, Benhis original NFA backtracker and yet everyone holds up the backtracking algorithm as the way to do things... That's inertia for you. Regards, Ben
Jun 19 2010
I forgot to mention... Ideally I'd like some kind of abstract interface to the generated DFA (this is in data mode before code generation etc.) so that different implementations can be hidden under a common interface. This is another example of why it would be nice to discuss the implementation with others! :-) Generally people just don't care about this stuff. I realise that various folk would like to go one better and not even assume a DFA behind the interface, but that is currently out of my scope..! Regards, Ben
Jun 19 2010
"Ben Hanson" <Ben.Hanson tfbplc.co.uk> wrote in message news:hvj9i1$1q7b$1 digitalmars.com...Lua has an interesting pattern matcher, as it goes more in the formal grammar direction to (for example) match balanced parenthesis etc.That's cool. That's something I've often wanted in regexes (It would be great for defining lever tokens in a grammar-definition language), and though I'm no FSA or regex expert, it always seemed like something entirely doable without having to resort to a full-blown LL or LR parser.
Jun 19 2010
"Nick Sabalausky" <a a.a> wrote in message news:hvjmo1$2mnu$1 digitalmars.com..."Ben Hanson" <Ben.Hanson tfbplc.co.uk> wrote in message news:hvj9i1$1q7b$1 digitalmars.com...That's "...defining lexer tokens...", not "...defining lever tokens...", of course.Lua has an interesting pattern matcher, as it goes more in the formal grammar direction to (for example) match balanced parenthesis etc.That's cool. That's something I've often wanted in regexes (It would be great for defining lever tokens in a grammar-definition language), and though I'm no FSA or regex expert, it always seemed like something entirely doable without having to resort to a full-blown LL or LR parser.
Jun 19 2010
== Quote from Nick Sabalausky (a a.a)'s article"Ben Hanson" <Ben.Hanson tfbplc.co.uk> wrote in message news:hvj9i1$1q7b$1 digitalmars.com...Yeah, it does seem obvious that there'd be some middle ground between regexes - certainly the author of the LUA pattern matching thought so! :-) The topic could definitely do with more attention. There is limited back-tracking in the LUA pattern matching, which I don't like, but it appears more efficient than back-tracking NFA. Apparantly it has some solid theory behind it too, which is good. Sadly just not enough time to persue all these possibilities... Regards, BenLua has an interesting pattern matcher, as it goes more in the formal grammar direction to (for example) match balanced parenthesis etc.That's cool. That's something I've often wanted in regexes (It would be great for defining lever tokens in a grammar-definition language), and though I'm no FSA or regex expert, it always seemed like something entirely doable without having to resort to a full-blown LL or LR parser.
Jun 22 2010
== Quote from Ellery Newcomer (ellery-newcomer utulsa.edu)'s articleOn 06/19/2010 03:35 AM, Ben Hanson wrote:Try this link: http://www.arl.wustl.edu/~mbecchi/files/becchi_ancs2009.pdfI can't prove what the comlexity is for DFA compilation, but instead, I challenge anyone to show slow compilation times with any DFA compatible regex. As I don't have some D code yet, you could try the current lexertl code (just add a single rule). Compile as Release if using Visual C++, as the iterator debugging kills performance in VC++ 8 or above (one problem I guess wouldn't exist with a D codebase). As I say, the only obvious area for me that hammers compilation times is huge repeats. If you repeat an expression 50,0000 times or something, then yes, you will get a huge machine and it will take time to construct! :-)et unicode?There are papers on tackling this with counting repeats, even for DFAs. If this area interests anyone, please let me know.it does
Jun 21 2010
On 06/21/2010 04:07 AM, Ben Hanson wrote:== Quote from Ellery Newcomer (ellery-newcomer utulsa.edu)'s articlethanksOn 06/19/2010 03:35 AM, Ben Hanson wrote:Try this link: http://www.arl.wustl.edu/~mbecchi/files/becchi_ancs2009.pdfI can't prove what the comlexity is for DFA compilation, but instead, I challenge anyone to show slow compilation times with any DFA compatible regex. As I don't have some D code yet, you could try the current lexertl code (just add a single rule). Compile as Release if using Visual C++, as the iterator debugging kills performance in VC++ 8 or above (one problem I guess wouldn't exist with a D codebase). As I say, the only obvious area for me that hammers compilation times is huge repeats. If you repeat an expression 50,0000 times or something, then yes, you will get a huge machine and it will take time to construct! :-)et unicode?There are papers on tackling this with counting repeats, even for DFAs. If this area interests anyone, please let me know.it does
Jun 21 2010
On 2010-06-18 06:44, Andrei Alexandrescu wrote:There are currently two regexen in the standard library. The older one, std.regexp, is time-tested but only works with UTF8 and has a clunkier API. The newer one, std.regex, is newer and isolates the engine from the matches (and therefore can reuse and cache engines easier), and supports all character widths. But it's less tested and doesn't have that great of an interface because it pretty much inherits the existing one. I wish to improve regex handling in Phobos. The most important improvement is not in the interface - it's in the engine. The current engine is adequate but nothing to write home about, and for simple regexen is markedly slower than equivalent hand-written code (e.g. matching whitespace). One great opportunity would be for D to leverage its uncanny compile-time evaluation abilities and offer a regex that parses the pattern during compilation: foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... } Such a static regex could be simpler than a full-blown regex with captures and backreferences etc., but it would have guaranteed performance (e.g. it would be an automaton instead of a backtracking engine) and would be darn fast because it would generate custom code for each regex pattern. See related work: http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html If we get as far as implementing what RE2 can do with compile-time evaluation, people will definitely notice. If there's anyone who'd want to tackle such a project (for Phobos or not), I highly encourage you to do so. AndreiThe is already a compile time regular expression engine available in the DDL project at dsource, it's in the meta package: http://dsource.org/projects/ddl/browser/trunk/meta I don't know if it's is still usable. -- /Jacob Carlborg
Jun 18 2010
Jacob Carlborg wrote:On 2010-06-18 06:44, Andrei Alexandrescu wrote:I wrote that code before CTFE was available. It helped to design D's metaprogramming system, but the code itself is thoroughly obsolete now. It is SO much easier to write that stuff with CTFE. Incidentally, now that we have the no-brackets form of template instantiation, there's much more freedom for compile-time regex syntax. auto k = regex!"[A-Za-z][0-9A-Za-z]*"(s1);There are currently two regexen in the standard library. The older one, std.regexp, is time-tested but only works with UTF8 and has a clunkier API. The newer one, std.regex, is newer and isolates the engine from the matches (and therefore can reuse and cache engines easier), and supports all character widths. But it's less tested and doesn't have that great of an interface because it pretty much inherits the existing one. I wish to improve regex handling in Phobos. The most important improvement is not in the interface - it's in the engine. The current engine is adequate but nothing to write home about, and for simple regexen is markedly slower than equivalent hand-written code (e.g. matching whitespace). One great opportunity would be for D to leverage its uncanny compile-time evaluation abilities and offer a regex that parses the pattern during compilation: foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... } Such a static regex could be simpler than a full-blown regex with captures and backreferences etc., but it would have guaranteed performance (e.g. it would be an automaton instead of a backtracking engine) and would be darn fast because it would generate custom code for each regex pattern. See related work: http://google-opensource.blogspot.com/2010/03/re2-principled-appr ach-to-regular.html If we get as far as implementing what RE2 can do with compile-time evaluation, people will definitely notice. If there's anyone who'd want to tackle such a project (for Phobos or not), I highly encourage you to do so. AndreiThe is already a compile time regular expression engine available in the DDL project at dsource, it's in the meta package: http://dsource.org/projects/ddl/browser/trunk/meta I don't know if it's is still usable.
Jun 18 2010
OK, I've just ordered The D Programming Language from Amazon. Hopefully this will enable me to cut out a certain amount of total n00b questions such as "How do strings work?". Regards, Ben
Jun 18 2010
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:hvetik$2k7j$1 digitalmars.com...There are currently two regexen in the standard library. The older one, std.regexp, is time-tested but only works with UTF8 and has a clunkier API. The newer one, std.regex, is newer and isolates the engine from the matches (and therefore can reuse and cache engines easier), and supports all character widths. But it's less tested and doesn't have that great of an interface because it pretty much inherits the existing one. I wish to improve regex handling in Phobos. The most important improvement is not in the interface - it's in the engine. The current engine is adequate but nothing to write home about, and for simple regexen is markedly slower than equivalent hand-written code (e.g. matching whitespace). One great opportunity would be for D to leverage its uncanny compile-time evaluation abilities and offer a regex that parses the pattern during compilation: foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... } Such a static regex could be simpler than a full-blown regex with captures and backreferences etc., but it would have guaranteed performance (e.g. it would be an automaton instead of a backtracking engine) and would be darn fast because it would generate custom code for each regex pattern. See related work: http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html If we get as far as implementing what RE2 can do with compile-time evaluation, people will definitely notice. If there's anyone who'd want to tackle such a project (for Phobos or not), I highly encourage you to do so.This would be a good thing for it to pay attention to, if it doesn't already: http://swtch.com/~rsc/regexp/regexp1.html
Jun 18 2010
On 06/17/2010 11:44 PM, Andrei Alexandrescu wrote:There are currently two regexen in the standard library. The older one, std.regexp, is time-tested but only works with UTF8 and has a clunkier API. The newer one, std.regex, is newer and isolates the engine from the matches (and therefore can reuse and cache engines easier), and supports all character widths. But it's less tested and doesn't have that great of an interface because it pretty much inherits the existing one. I wish to improve regex handling in Phobos. The most important improvement is not in the interface - it's in the engine. The current engine is adequate but nothing to write home about, and for simple regexen is markedly slower than equivalent hand-written code (e.g. matching whitespace). One great opportunity would be for D to leverage its uncanny compile-time evaluation abilities and offer a regex that parses the pattern during compilation: foreach (s; splitter(line, sregex!",[ \t\r]*")) { ... } Such a static regex could be simpler than a full-blown regex with captures and backreferences etc., but it would have guaranteed performance (e.g. it would be an automaton instead of a backtracking engine) and would be darn fast because it would generate custom code for each regex pattern. See related work: http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html If we get as far as implementing what RE2 can do with compile-time evaluation, people will definitely notice. If there's anyone who'd want to tackle such a project (for Phobos or not), I highly encourage you to do so. AndreiGenerally I think D's CT capabilities have a way to go yet before this would be worth tackling. E.g. how do you build a parse tree if pointers and classes aren't allowed in CT code? But, predicated on the assumption that it will happen relatively soon.. What do we want our regex language to look like? this: http://digitalmars.com/ctg/regular.html is okay, but I like the idea of ascii and unicode character classes as per RE2's syntax, and I don't like the idea of an overloaded \n. (I assume we're talking about strings as bidirectional unicode ranges) and I would like to see the language grammar explicitly written out somewhere. and I like the idea of capture groups. Thinking of something along the lines of import std.regex; int i; double d; ctcapture!r"(?<i>\d+): (?<d>\d+\.\d+)"( "10: 100.5"); //this will probably have to be inside a mixin statement assert(i == 10); assert(d == 100.5); //grumph, floating point.. for that matter, why not some predefined subexpressions corresponding to D's integer literals, et al? And available inside the regex? ctcapture!r"(?<i>{:IntegerLiteral:}): (?<d>{:FloatLiteral:})" // Yeah, I know. You come up with a better syntax ("10: 0xEA.A1p-2"); Okay, I'm done dreaming. I don't know how much of the above can be done within the constraints of an FA. How much sense does regex over unicode make? Do others have ideas for features for regex?
Jun 19 2010
Hello Ellery,Generally I think D's CT capabilities have a way to go yet before this would be worth tackling. E.g. how do you build a parse tree if pointers and classes aren't allowed in CT code?You use the type system. I can say from experience that (memory issues aside) it works. You can even do duck typing "Template Foo works on anything that defines a Bar and a Baz." -- ... <IXOYE><
Jun 19 2010
"BCS" <none anon.com> wrote in message news:a6268ff154ca8ccddf1ef51e1d8 news.digitalmars.com...Hello Ellery,Trivial example?Generally I think D's CT capabilities have a way to go yet before this would be worth tackling. E.g. how do you build a parse tree if pointers and classes aren't allowed in CT code?You use the type system. I can say from experience that (memory issues aside) it works.
Jun 19 2010
Hello Nick,"BCS" <none anon.com> wrote in message news:a6268ff154ca8ccddf1ef51e1d8 news.digitalmars.com...Building an arbitrary tree: int find(char[] data) { foreach(int j, char c; data) switch(c) { default: break; case '[', ']': return j; } return data.length; } int match(char[] data) { int i = 0; foreach(int j, char c; data) switch(c) { default: break; case '[': i++; break; case ']': if(--i == 0) return j; } return data.length; } template Tp(T...){ alias T Tp; } struct Bar(T...) { } template Foo(char[] data) { static if(data.length == 0) alias Tp!() Foo; else static if(data[0] == '[') alias Tp!(Bar!(Foo!(data[1..match(data)])), Foo!(data[match(data)+1..$])) Foo; else alias Tp!(data[0..find(data)], Foo!(data[find(data)..$])) Foo; } import std.stdio; void main() { writef("%s\n", Foo!("So [she was [considering ][in her] own[ mind [(as well] as] she could, ]for the").stringof); } Output: tuple("So ",(Bar!("she was ",Bar!("considering ") ,Bar!("in her") ," own",Bar!(" mind ",Bar!("(as well") ," as") ," she could, ") ),"for the") By placing things under different aliases and whatnot, named members can be done. Also, if the lisp like constructs puts you off, the construction of the tree can be done via more or less the same parsing code and some fairly trivial ctfe+mixin -- ... <IXOYE><Hello Ellery,Trivial example?Generally I think D's CT capabilities have a way to go yet before this would be worth tackling. E.g. how do you build a parse tree if pointers and classes aren't allowed in CT code?You use the type system. I can say from experience that (memory issues aside) it works.
Jun 19 2010
On 06/19/2010 10:55 PM, BCS wrote:Hello Nick,Okay, you use structs without indirection. I guess that works. How much copying does that entail when you e.g. pass your tree in a function call?"BCS" <none anon.com> wrote in message news:a6268ff154ca8ccddf1ef51e1d8 news.digitalmars.com...Building an arbitrary tree:Hello Ellery,Trivial example?Generally I think D's CT capabilities have a way to go yet before this would be worth tackling. E.g. how do you build a parse tree if pointers and classes aren't allowed in CT code?You use the type system. I can say from experience that (memory issues aside) it works.
Jun 19 2010
Ellery Newcomer wrote:On 06/19/2010 10:55 PM, BCS wrote:Currently CTFE uses copy-on-write. If you're just passing the tree, not modifying it, there's no copying. If you modify the tree, there's a horrific amount of copying -- this is the major CTFE bug.Hello Nick,Okay, you use structs without indirection. I guess that works. How much copying does that entail when you e.g. pass your tree in a function call?"BCS" <none anon.com> wrote in message news:a6268ff154ca8ccddf1ef51e1d8 news.digitalmars.com...Building an arbitrary tree:Hello Ellery,Trivial example?Generally I think D's CT capabilities have a way to go yet before this would be worth tackling. E.g. how do you build a parse tree if pointers and classes aren't allowed in CT code?You use the type system. I can say from experience that (memory issues aside) it works.
Jun 20 2010
Hello Ellery,On 06/19/2010 10:55 PM, BCS wrote:At compile time as a template parameter? I think it gets passed around as a reference. At runtime? That would be pointless as it has no runtime members. -- ... <IXOYE><Hello Nick,Okay, you use structs without indirection. I guess that works. How much copying does that entail when you e.g. pass your tree in a function call?"BCS" <none anon.com> wrote in message news:a6268ff154ca8ccddf1ef51e1d8 news.digitalmars.com...Building an arbitrary tree:Hello Ellery,Trivial example?Generally I think D's CT capabilities have a way to go yet before this would be worth tackling. E.g. how do you build a parse tree if pointers and classes aren't allowed in CT code?You use the type system. I can say from experience that (memory issues aside) it works.
Jun 20 2010
On 06/19/2010 11:27 PM, Ellery Newcomer wrote:On 06/19/2010 10:55 PM, BCS wrote:Nevermind, I see what you're doing now.Hello Nick,Okay, you use structs without indirection. I guess that works. How much copying does that entail when you e.g. pass your tree in a function call?"BCS" <none anon.com> wrote in message news:a6268ff154ca8ccddf1ef51e1d8 news.digitalmars.com...Building an arbitrary tree:Hello Ellery,Trivial example?Generally I think D's CT capabilities have a way to go yet before this would be worth tackling. E.g. how do you build a parse tree if pointers and classes aren't allowed in CT code?You use the type system. I can say from experience that (memory issues aside) it works.
Jun 21 2010