digitalmars.D - Fast regular expressions
- Jascha Wetzel (24/24) Apr 14 2007 Here is an alpha release of a regular expression engine i'm working on:
- Aziz K. (6/6) Apr 15 2007 Hi Jascha,
- Lionello Lunesu (7/11) Apr 15 2007 I think phobos' version does support this by using the attribute 'g':
- Lionello Lunesu (2/3) Apr 15 2007 Or I must have MISunderstood.. blergh :S
- Jascha Wetzel (8/17) Apr 15 2007 thanks!
- Aziz K. (9/15) Apr 15 2007 What I mean is that the dot meta character in Phobos' regexps doesn't =
- Jascha Wetzel (6/23) Apr 15 2007 ah, ic. that will be parameterized later on so you can choose whether
- Jeff (5/5) Apr 15 2007 I wonder if the Tango team would be interested in incorporating this
- Lars Ivar Igesund (10/15) Apr 15 2007 I'd say we're likely to be interested once Jascha's library reaches a mo...
- Jascha Wetzel (4/9) Apr 15 2007 this is also somewhat meant to be a response to Walter's post a few
- Davidl (6/30) Apr 15 2007 will u please add some feature of fuzzy matching? like
- Jeff (10/10) Apr 16 2007 It seems to me that supporting fuzzy matching in a general regular
- yidabu (2/32) May 19 2007
Here is an alpha release of a regular expression engine i'm working on: http://mainia.de/tdfaregexp-0.1-alpha.zip (artistic license 2.0) It implements a slightly modified version of Ville Laurikari's tagged NFA method that allows for regular expression parsing in linear time as opposed to the very common backtracking method that requires exponential time (worst case). As of now, the engine supports greedy and non-greedy operators, submatches and character classes. It compiles a regexp to a TDFA from which it can either be applied directly to an input string or compiled to D code. Supported operators: . | ( ) ? * + ?? *? +? The regexp-to-D compilation cannot be done at compile-time, though, because the required data structures are too complex to be represented in tuples or strings. Compile regexp.d and run it for example like this: regexp "(a*)([0-9a-fA-Z]*)(g|qwe)" aaaa987IUTqwe aaa 234234 This will output the TNFA, TDFA and D Code and try to match the 3 input strings. The file regextest.d is an example of how to use the generated code. There is still a lot of room for optimization, but i think that categorically, such a natively compiled, linear time regexp matcher should be the fastest possible. Comments and bug reports are encouraged ;)
Apr 14 2007
Hi Jascha, Since I found an important feature lacking in Phobos' regex implementation, I want to ask you: does your library support matching over new-lines? In any case, good job on writing an improved regex library! Regards.
Apr 15 2007
"Aziz K." <aziz.kerim gmail.com> wrote in message news:op.tqtketww545v36 vaio...Hi Jascha, Since I found an important feature lacking in Phobos' regex implementation, I want to ask you: does your library support matching over new-lines?I think phobos' version does support this by using the attribute 'g': char[] s = "strap a \nbla to bla\n"; auto ss = sub(s, " \nb", "ZZ", "g"); // result: strap aZZla to bla Or I must have understood what you meant? L.
Apr 15 2007
Or I must have understood what you meant?Or I must have MISunderstood.. blergh :S L.
Apr 15 2007
thanks! what exactly do you mean by "over new-lines"? phobos' regexps will match newlines if stated explicitly. it can also treat the input as multiple lines. my engine doesn't care about special chars, yet. the point is merely to improve on the algorithmic side. all the details with assertions, special chars and charsets will be filled in when all the operators work. Aziz K. wrote:Hi Jascha, Since I found an important feature lacking in Phobos' regex implementation, I want to ask you: does your library support matching over new-lines? In any case, good job on writing an improved regex library! Regards.
Apr 15 2007
what exactly do you mean by "over new-lines"? phobos' regexps will match newlines if stated explicitly. it can also treat the input as multiple lines.What I mean is that the dot meta character in Phobos' regexps doesn't = match a new-line. For example: char[] s =3D "<p>ab\ncd</p>"; auto m =3D search(s, "<p>(.+)</p>", "gmi"); writefln(m.match(1)); // no match...my engine doesn't care about special chars, yet. the point is merely t=oimprove on the algorithmic side. all the details with assertions, special chars and charsets will be filled in when all the operators wo=rk. Yeah, I noticed that when I looked through your code a bit.
Apr 15 2007
ah, ic. that will be parameterized later on so you can choose whether to have the dot match it or not. you can easily patch it into the alpha by changing charclass.d line 76 to static const CharClass!(char_t) any_char = {parts: [{l:32, r:127}, {l:10,r:10}, {l:13,r:13}]}; Aziz K. wrote:what exactly do you mean by "over new-lines"? phobos' regexps will match newlines if stated explicitly. it can also treat the input as multiple lines.What I mean is that the dot meta character in Phobos' regexps doesn't match a new-line. For example: char[] s = "<p>ab\ncd</p>"; auto m = search(s, "<p>(.+)</p>", "gmi"); writefln(m.match(1)); // no match...my engine doesn't care about special chars, yet. the point is merely to improve on the algorithmic side. all the details with assertions, special chars and charsets will be filled in when all the operators work.Yeah, I noticed that when I looked through your code a bit.
Apr 15 2007
I wonder if the Tango team would be interested in incorporating this work into their library? I was disappointed to see they've adopted the Phobos regular expression module, given I had some pretty serious gripes with it (seemingly erroneous behavior - I don't think this was ever resolved).
Apr 15 2007
Jeff wrote:I wonder if the Tango team would be interested in incorporating this work into their library? I was disappointed to see they've adopted the Phobos regular expression module, given I had some pretty serious gripes with it (seemingly erroneous behavior - I don't think this was ever resolved).I'd say we're likely to be interested once Jascha's library reaches a more stable state. We are aware that current implementation (which is the same in both Phobos and Tango) has issues. As with everything that goes into Tango, someone needs to write it, and mantain it. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Apr 15 2007
this is also somewhat meant to be a response to Walter's post a few weeks ago: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=47838 Jeff wrote:I wonder if the Tango team would be interested in incorporating this work into their library? I was disappointed to see they've adopted the Phobos regular expression module, given I had some pretty serious gripes with it (seemingly erroneous behavior - I don't think this was ever resolved).
Apr 15 2007
will u please add some feature of fuzzy matching? like with user definable error char number. for example, in fuzzy match people can match ab in abc or acb with 1char errorness and they can match ab in acdb or aefb with 2 char errornessHere is an alpha release of a regular expression engine i'm working on: http://mainia.de/tdfaregexp-0.1-alpha.zip (artistic license 2.0) It implements a slightly modified version of Ville Laurikari's tagged NFA method that allows for regular expression parsing in linear time as opposed to the very common backtracking method that requires exponential time (worst case). As of now, the engine supports greedy and non-greedy operators, submatches and character classes. It compiles a regexp to a TDFA from which it can either be applied directly to an input string or compiled to D code. Supported operators: . | ( ) ? * + ?? *? +? The regexp-to-D compilation cannot be done at compile-time, though, because the required data structures are too complex to be represented in tuples or strings. Compile regexp.d and run it for example like this: regexp "(a*)([0-9a-fA-Z]*)(g|qwe)" aaaa987IUTqwe aaa 234234 This will output the TNFA, TDFA and D Code and try to match the 3 input strings. The file regextest.d is an example of how to use the generated code. There is still a lot of room for optimization, but i think that categorically, such a natively compiled, linear time regexp matcher should be the fastest possible. Comments and bug reports are encouraged ;)
Apr 15 2007
It seems to me that supporting fuzzy matching in a general regular expression library would add a whole lot of complexity that wouldn't be used by many people. I think it would be hard to justify this. Were you really adamant about having regular expression matching, or more focussed on simple fuzzy matching? There are probably some libraries already that can search for fuzzy matches pretty well. If you can't find anything that meets your needs, you might consider having a look at http://en.wikipedia.org/wiki/Bitap_algorithm and http://en.wikipedia.org/wiki/Levenshtein_distance - they could be useful if you decide to write your own.
Apr 16 2007
great work! Jascha Wetzel <[firstname] mainia.de> Wrote:Here is an alpha release of a regular expression engine i'm working on: http://mainia.de/tdfaregexp-0.1-alpha.zip (artistic license 2.0) It implements a slightly modified version of Ville Laurikari's tagged NFA method that allows for regular expression parsing in linear time as opposed to the very common backtracking method that requires exponential time (worst case). As of now, the engine supports greedy and non-greedy operators, submatches and character classes. It compiles a regexp to a TDFA from which it can either be applied directly to an input string or compiled to D code. Supported operators: . | ( ) ? * + ?? *? +? The regexp-to-D compilation cannot be done at compile-time, though, because the required data structures are too complex to be represented in tuples or strings. Compile regexp.d and run it for example like this: regexp "(a*)([0-9a-fA-Z]*)(g|qwe)" aaaa987IUTqwe aaa 234234 This will output the TNFA, TDFA and D Code and try to match the 3 input strings. The file regextest.d is an example of how to use the generated code. There is still a lot of room for optimization, but i think that categorically, such a natively compiled, linear time regexp matcher should be the fastest possible. Comments and bug reports are encouraged ;)
May 19 2007