digitalmars.D - FReD is ready for review
- Dmitry Olshansky (27/27) Sep 13 2011 FReD ( Fast Regular expressions for D) is a GSOC project proposed as
- Vladimir Panteleev (11/16) Sep 13 2011 Hi, thanks for your hard work!
- Vladimir Panteleev (7/9) Sep 13 2011 Sorry, I had -wi enabled in my build environment. (Why would -wi cause
- Jonathan M Davis (5/13) Sep 13 2011 -wi shouldn't cause fatal errors, but anything going into Phobos needs t...
- Dmitry Olshansky (20/33) Sep 13 2011 Arrrrgh! On serious note it is fixed in some recent pull request, but
- Vladimir Panteleev (8/10) Sep 13 2011 That's really cool... I noticed that d_dna.d doesn't use regexCt, would ...
- Dmitry Olshansky (11/18) Sep 13 2011 In this particular case it may very well much faster actually. The thing...
- Vladimir Panteleev (7/10) Sep 13 2011 Ah, okay. I thought compile-time regex was the same as the default engin...
- Dmitry Olshansky (28/35) Sep 13 2011 Let me explain a bit. The default as chosen is fairly advanced NFA
- Andrei Alexandrescu (5/16) Sep 13 2011 Sounds great! I advise you to invest effort in reducing the 25% handicap...
- Dmitry Olshansky (23/40) Sep 14 2011 Yes, sir! :)
- Simen Kjaeraas (6/24) Sep 14 2011 Even so, that is awesome. Another point to D for its
- Marco Leise (5/31) Sep 14 2011 Wait with the champaign until this doesn't use 8 GB RAM during compile a...
- Martin Nowak (2/36) Sep 14 2011 It peaks at 4GB for me and takes about 20s to compile.
- Andrei Alexandrescu (7/49) Sep 14 2011 Great, thanks. Could you please email me privately some more information...
- bearophile (6/13) Sep 13 2011 This is not so good. Even if this problem is not solved soon, it will be...
- Marco Leise (6/11) Sep 13 2011 I heard a Google engineer talk about that. In JavaScript regular
- Jonathan M Davis (12/17) Sep 13 2011 I suspect that there's value in reviewing the unicode stuff separately, ...
- Dmitry Olshansky (12/28) Sep 13 2011 It doesn't affect API at all, but an important implementation detail.
- Jonathan M Davis (5/37) Sep 13 2011 As long as it doesn't affect the API, then we can review std.regex first...
FReD ( Fast Regular expressions for D) is a GSOC project proposed as source level compatible replacement for std.regex. Needless to say it is also superior in many ways ;) Source + docs + goodies packaged here: https://github.com/downloads/blackwhale/FReD/FReD.zip Goodies include popular regex-dna benchmark, by my measures we should be linux box, on win32 sadly it runs out of memory on the same machine (false pointers?). Speaking of source there are few artifacts that should end up in std.uni and not in std.regex: CodepointSet and CodepointTrie, and unicode property tables. I'd rather have them all accessible to user, but their interface is admittedly clunky, I'm open to ideas on better API. Notable caveats: - backreferences allowed only to locally unique groups, e.g. \b(\w+)\b\1 is allowed, same w/o any of \b - not. - replace with delegeate now takes Captures!R where R is e.g. string. This is due to having few types of engine and RegexMatch!X respectively. Previous signature allowed a certain measure of abuse e.g. you could have done a couple of popFronts on it matcher(!) in this delegate. However in most cases the code inside can be left as is. Otherwise it should work with existing code after replacing imports. P.S. I'm taking a seat somewhere in review queue, and busying myself with newspaper :) -- Dmitry Olshansky
Sep 13 2011
Hi, thanks for your hard work! On Wed, 14 Sep 2011 00:04:08 +0300, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:https://github.com/downloads/blackwhale/FReD/FReD.zipDoesn't seem to compile on the latest DMD, due to switch case fallthrough.Goodies include popular regex-dna benchmark, by my measures we should beThis is awesome, but if it's not a secret, who are the first two, and how far behind is FReD? Does this regard regular or compile-time regex?Though I used 64bit VM linux box, on win32 sadly it runs out of memory on the same machine (false pointers?).If I'll find time to port my memory debugger to D2 before someone else figures it out, I'll have a look at this. -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Sep 13 2011
On Wed, 14 Sep 2011 00:18:03 +0300, Vladimir Panteleev <vladimir thecybershadow.net> wrote:Doesn't seem to compile on the latest DMD, due to switch case fallthrough.Sorry, I had -wi enabled in my build environment. (Why would -wi cause fatal errors, though? Sounds like a bug.) -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Sep 13 2011
On Tuesday, September 13, 2011 14:25 Vladimir Panteleev wrote:On Wed, 14 Sep 2011 00:18:03 +0300, Vladimir Panteleev <vladimir thecybershadow.net> wrote:-wi shouldn't cause fatal errors, but anything going into Phobos needs to build with -w. So, if the proposed changes don't build with -w, that needs to be fixed. - Jonathan M DavisDoesn't seem to compile on the latest DMD, due to switch case fallthrough.Sorry, I had -wi enabled in my build environment. (Why would -wi cause fatal errors, though? Sounds like a bug.)
Sep 13 2011
On 14.09.2011 1:18, Vladimir Panteleev wrote:Hi, thanks for your hard work! On Wed, 14 Sep 2011 00:04:08 +0300, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Arrrrgh! On serious note it is fixed in some recent pull request, but that was hopelessly late for getting in this release.https://github.com/downloads/blackwhale/FReD/FReD.zipDoesn't seem to compile on the latest DMD, due to switch case fallthrough.OK in this particular benchmark, we are slighlty behind Google inc. :) Namely RE2 and irregex. On my machine RE2 is ~5 sec while FReD does the job in ~7sec. As for irregex, ... uhm-oh I haven't actually tested it (failed to just compile) but it beats RE2 in this test by some ~30%. OpenMP declarations stripped: http://shootout.alioth.debian.org/u32/performance.php?test=regexdna Anyway, the rest of competition is not even close, and I still haven't spent all of my tricks.Goodies include popular regex-dna benchmark, by my measures we shouldThis is awesome, but if it's not a secret, who are the first two, and how far behind is FReD? Does this regard regular or compile-time regex?Thanks, by my cut and paste tests shows the problem is caused by allocating large buffers in replace loop. Probably a better solution to this is alternative replace function that doesn't allocate on each call but uses a OutputRange. -- Dmitry OlshanskyThough I used 64bit VM linux box, on win32 sadly it runs out of memory on the same machine (false pointers?).If I'll find time to port my memory debugger to D2 before someone else figures it out, I'll have a look at this.
Sep 13 2011
On Wed, 14 Sep 2011 00:37:28 +0300, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Anyway, the rest of competition is not even close, and I still haven't spent all of my tricks.That's really cool... I noticed that d_dna.d doesn't use regexCt, would that make any difference or am I overestimating the advantage of compile-time codegen? -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Sep 13 2011
On 14.09.2011 1:42, Vladimir Panteleev wrote:On Wed, 14 Sep 2011 00:37:28 +0300, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:In this particular case it may very well much faster actually. The thing about this benchmark is that it exposes all the nasty constant overhead there is in the engine. On the other hand C-T regex doesn't have most of "what do we do next" overhead so it should be faster. However when it comes to some serious business like matching charsets and doing ambiguous control flow it starts losing to more superior design of the default engine. -- Dmitry OlshanskyAnyway, the rest of competition is not even close, and I still haven't spent all of my tricks.That's really cool... I noticed that d_dna.d doesn't use regexCt, would that make any difference or am I overestimating the advantage of compile-time codegen?
Sep 13 2011
On Wed, 14 Sep 2011 00:49:47 +0300, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:However when it comes to some serious business like matching charsets and doing ambiguous control flow it starts losing to more superior design of the default engine.Ah, okay. I thought compile-time regex was the same as the default engine, but without the runtime parsing and bytecode interpretation(?). -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Sep 13 2011
On 14.09.2011 1:55, Vladimir Panteleev wrote:On Wed, 14 Sep 2011 00:49:47 +0300, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Let me explain a bit. The default as chosen is fairly advanced NFA engine which features: - never backtracks and consequently decodes each UTF8/16/32 character only once; - as side effect doesn't support fully combinatorial backreferences (why would someone use 'em anyway); - guaranteed O(N*M) time on input with M being "size" of pattern; This also opens up matching directly on buffered streams... but back to the point. There is also a slower traditional backtracking engine, it still faster then many other engines out there but it ebbs out on some patterns like .*.*.*Z in text with no 'Z', precisely because of combinatorial nature of backtracking. Then C-T is built on top of backtracking, just because it was much easier this way :) It's indeed replacing bytecode interpretations with hardcoded code which is exceptionally fast, but the prime time consumer in real patterns is usually this memoize/return. C-T regex is still a bit experimental though and going to be improved over time. Last but not least there is also completely dumb (as in dead simple and fast) kickstart engine that does all of heavy lifting iff pattern starts with something more or less constant (e.g. it does 50% of job in regex-dna). And you can just parse at compile time and use default run-time engine for matching: enum r = regex("whatever"); -- Dmitry OlshanskyHowever when it comes to some serious business like matching charsets and doing ambiguous control flow it starts losing to more superior design of the default engine.Ah, okay. I thought compile-time regex was the same as the default engine, but without the runtime parsing and bytecode interpretation(?).
Sep 13 2011
On 9/13/11 4:37 PM, Dmitry Olshansky wrote:OK in this particular benchmark, we are slighlty behind Google inc. :) Namely RE2 and irregex. On my machine RE2 is ~5 sec while FReD does the job in ~7sec. As for irregex, ... uhm-oh I haven't actually tested it (failed to just compile) but it beats RE2 in this test by some ~30%. OpenMP declarations stripped: http://shootout.alioth.debian.org/u32/performance.php?test=regexdna Anyway, the rest of competition is not even close, and I still haven't spent all of my tricks.Sounds great! I advise you to invest effort in reducing the 25% handicap to <1%. As regexen have a fairly standardized interface, speed is the most impactful differentiatior people look at. Andrei
Sep 13 2011
On 14.09.2011 2:44, Andrei Alexandrescu wrote:On 9/13/11 4:37 PM, Dmitry Olshansky wrote:Yes, sir! :) As regexen have a fairly standardized interface, speed is theOK in this particular benchmark, we are slighlty behind Google inc. :) Namely RE2 and irregex. On my machine RE2 is ~5 sec while FReD does the job in ~7sec. As for irregex, ... uhm-oh I haven't actually tested it (failed to just compile) but it beats RE2 in this test by some ~30%. OpenMP declarations stripped: http://shootout.alioth.debian.org/u32/performance.php?test=regexdna Anyway, the rest of competition is not even close, and I still haven't spent all of my tricks.Sounds great! I advise you to invest effort in reducing the 25% handicap to <1%.most impactful differentiatior people look at.Well in this particular benchmark there is little I can do, some small scale optimizations and such, I expect no more then 10%. The bigger picture is that our best bet here is static regex, the competitors use some aggressive JIT so it's a fair play. I actually received pull request this morning do just that (thx, Martin) though dmd uses so much memory it's not for mortals yet: I had 2G of ram + some 600mb of swap - fail on 3rd pattern of 9 + 2 Gb of swap - 6 out of 9 Come on, it's a sport! + 4 Gb of swap and after few restless minutes I got it working Fresh timings: FReD C-T 3.4 sec FReD R-T 6.6 sec RE2 4.8 sec V8 JS 3.7 sec Look ma, we are on top! :) So, by the end of day, D does kick ass, even though it's a bit of Pyrrhic victory. --- Dmitry Olshansky
Sep 14 2011
On Wed, 14 Sep 2011 14:51:54 +0200, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Well in this particular benchmark there is little I can do, some small scale optimizations and such, I expect no more then 10%. The bigger picture is that our best bet here is static regex, the competitors use some aggressive JIT so it's a fair play. I actually received pull request this morning do just that (thx, Martin) though dmd uses so much memory it's not for mortals yet: I had 2G of ram + some 600mb of swap - fail on 3rd pattern of 9 + 2 Gb of swap - 6 out of 9 Come on, it's a sport! + 4 Gb of swap and after few restless minutes I got it working Fresh timings: FReD C-T 3.4 sec FReD R-T 6.6 sec RE2 4.8 sec V8 JS 3.7 sec Look ma, we are on top! :) So, by the end of day, D does kick ass, even though it's a bit of Pyrrhic victory.Even so, that is awesome. Another point to D for its metaprogramming. -- Simen
Sep 14 2011
Am 14.09.2011, 17:10 Uhr, schrieb Simen Kjaeraas <simen.kjaras gmail.com>:On Wed, 14 Sep 2011 14:51:54 +0200, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Wait with the champaign until this doesn't use 8 GB RAM during compile any more ;) Other than that, congrats! This is really impressive. I think you do some great pioneer work, that will help other who try to embed simple domain specific languages.Well in this particular benchmark there is little I can do, some small scale optimizations and such, I expect no more then 10%. The bigger picture is that our best bet here is static regex, the competitors use some aggressive JIT so it's a fair play. I actually received pull request this morning do just that (thx, Martin) though dmd uses so much memory it's not for mortals yet: I had 2G of ram + some 600mb of swap - fail on 3rd pattern of 9 + 2 Gb of swap - 6 out of 9 Come on, it's a sport! + 4 Gb of swap and after few restless minutes I got it working Fresh timings: FReD C-T 3.4 sec FReD R-T 6.6 sec RE2 4.8 sec V8 JS 3.7 sec Look ma, we are on top! :) So, by the end of day, D does kick ass, even though it's a bit of Pyrrhic victory.Even so, that is awesome. Another point to D for its metaprogramming.
Sep 14 2011
On Wed, 14 Sep 2011 17:36:25 +0200, Marco Leise <Marco.Leise gmx.de> wrote:Am 14.09.2011, 17:10 Uhr, schrieb Simen Kjaeraas <simen.kjaras gmail.com>:It peaks at 4GB for me and takes about 20s to compile.On Wed, 14 Sep 2011 14:51:54 +0200, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Wait with the champaign until this doesn't use 8 GB RAM during compile any more ;) Other than that, congrats! This is really impressive. I think you do some great pioneer work, that will help other who try to embed simple domain specific languages.Well in this particular benchmark there is little I can do, some small scale optimizations and such, I expect no more then 10%. The bigger picture is that our best bet here is static regex, the competitors use some aggressive JIT so it's a fair play. I actually received pull request this morning do just that (thx, Martin) though dmd uses so much memory it's not for mortals yet: I had 2G of ram + some 600mb of swap - fail on 3rd pattern of 9 + 2 Gb of swap - 6 out of 9 Come on, it's a sport! + 4 Gb of swap and after few restless minutes I got it working Fresh timings: FReD C-T 3.4 sec FReD R-T 6.6 sec RE2 4.8 sec V8 JS 3.7 sec Look ma, we are on top! :) So, by the end of day, D does kick ass, even though it's a bit of Pyrrhic victory.Even so, that is awesome. Another point to D for its metaprogramming.
Sep 14 2011
On 9/14/11 7:51 AM, Dmitry Olshansky wrote:On 14.09.2011 2:44, Andrei Alexandrescu wrote:Great, thanks. Could you please email me privately some more information about the benchmark and your test code? I'm considering using this in my upcoming talk at the Strange Loop conference. Thanks again, Andrei AndreiOn 9/13/11 4:37 PM, Dmitry Olshansky wrote:Yes, sir! :) As regexen have a fairly standardized interface, speed is theOK in this particular benchmark, we are slighlty behind Google inc. :) Namely RE2 and irregex. On my machine RE2 is ~5 sec while FReD does the job in ~7sec. As for irregex, ... uhm-oh I haven't actually tested it (failed to just compile) but it beats RE2 in this test by some ~30%. OpenMP declarations stripped: http://shootout.alioth.debian.org/u32/performance.php?test=regexdna Anyway, the rest of competition is not even close, and I still haven't spent all of my tricks.Sounds great! I advise you to invest effort in reducing the 25% handicap to <1%.most impactful differentiatior people look at.Well in this particular benchmark there is little I can do, some small scale optimizations and such, I expect no more then 10%. The bigger picture is that our best bet here is static regex, the competitors use some aggressive JIT so it's a fair play. I actually received pull request this morning do just that (thx, Martin) though dmd uses so much memory it's not for mortals yet: I had 2G of ram + some 600mb of swap - fail on 3rd pattern of 9 + 2 Gb of swap - 6 out of 9 Come on, it's a sport! + 4 Gb of swap and after few restless minutes I got it working Fresh timings: FReD C-T 3.4 sec FReD R-T 6.6 sec RE2 4.8 sec V8 JS 3.7 sec Look ma, we are on top! :) So, by the end of day, D does kick ass, even though it's a bit of Pyrrhic victory. --- Dmitry Olshansky
Sep 14 2011
Dmitry Olshansky:Goodies include popular regex-dna benchmark, by my measures we should beamount of engineering efforts.Though I used 64bit VM linux box, on win32 sadly it runs out of memory on the same machine (false pointers?).This is not so good. Even if this problem is not solved soon, it will be good to know its real cause.P.S. I'm taking a seat somewhere in review queue, and busying myself with newspaper :)*Gives some fine white-golden tea* Bye, bearophile
Sep 13 2011
Am 13.09.2011, 23:22 Uhr, schrieb bearophile <bearophileHUGS lycos.com>:Dmitry Olshansky:I heard a Google engineer talk about that. In JavaScript regular expressions are often used for a wider range of operations on strings (e.g. splitting), since they were generally faster than doing the same in pure JavaScript. It is one of those few occasions where a JS developer can rely on native code in a library doing most of the work.Goodies include popular regex-dna benchmark, by my measures we should besignificant amount of engineering efforts.
Sep 13 2011
On Tuesday, September 13, 2011 14:04 Dmitry Olshansky wrote:Speaking of source there are few artifacts that should end up in std.uni and not in std.regex: CodepointSet and CodepointTrie, and unicode property tables. I'd rather have them all accessible to user, but their interface is admittedly clunky, I'm open to ideas on better API.I suspect that there's value in reviewing the unicode stuff separately, and if it affects std.regex' API at all (as opposed to being simply an implementation detail), I'd suggest that we review and sort out the unicode stuff before we review then new std.regex, since that would then have definite impact on std.regex, and it affects a lot more than just std.regex. If the unicode stuff is entirely an implementation detail of std.regex and is not in the API, then we can review the unicode stuff separately and make std.regex use the stuff in std.uni once it's in there, but if you're doing major unicode stuff, I think that that's significant enough to merit its own review. - Jonathan M Davis
Sep 13 2011
On 14.09.2011 1:36, Jonathan M Davis wrote:On Tuesday, September 13, 2011 14:04 Dmitry Olshansky wrote:It doesn't affect API at all, but an important implementation detail. Particularly one can't get reasonable performance w/o unicode incorporated as an implementation detail yet I feel there should be a more general interface to it. Otherwise everyone willing to tackle unicode would have to, for instance, duplicate all of unicode property tables, and that's some 100Kb we are looking at.Speaking of source there are few artifacts that should end up in std.uni and not in std.regex: CodepointSet and CodepointTrie, and unicode property tables. I'd rather have them all accessible to user, but their interface is admittedly clunky, I'm open to ideas on better API.I suspect that there's value in reviewing the unicode stuff separately, and if it affects std.regex' API at all (as opposed to being simply an implementation detail), I'd suggest that we review and sort out the unicode stuff before we review then new std.regex, since that would then have definite impact on std.regex, and it affects a lot more than just std.regex.If the unicode stuff is entirely an implementation detail of std.regex and is not in the API, then we can review the unicode stuff separately and make std.regex use the stuff in std.uni once it's in there, but if you're doing major unicode stuff, I think that that's significant enough to merit its own review.It could be done in two steps, if I was to choose which way to go I'd first folded in std.regex then moved all its hidden unicode stuff to std.uni. It's mostly about character classification and about doing it fast. -- Dmitry Olshansky
Sep 13 2011
On Tuesday, September 13, 2011 14:52 Dmitry Olshansky wrote:On 14.09.2011 1:36, Jonathan M Davis wrote:As long as it doesn't affect the API, then we can review std.regex first, but we do definitely want to get any major unicode improvements into std.uni. And that will probably merit its own review. - Jonathan M DavisOn Tuesday, September 13, 2011 14:04 Dmitry Olshansky wrote:It doesn't affect API at all, but an important implementation detail. Particularly one can't get reasonable performance w/o unicode incorporated as an implementation detail yet I feel there should be a more general interface to it. Otherwise everyone willing to tackle unicode would have to, for instance, duplicate all of unicode property tables, and that's some 100Kb we are looking at.Speaking of source there are few artifacts that should end up in std.uni and not in std.regex: CodepointSet and CodepointTrie, and unicode property tables. I'd rather have them all accessible to user, but their interface is admittedly clunky, I'm open to ideas on better API.I suspect that there's value in reviewing the unicode stuff separately, and if it affects std.regex' API at all (as opposed to being simply an implementation detail), I'd suggest that we review and sort out the unicode stuff before we review then new std.regex, since that would then have definite impact on std.regex, and it affects a lot more than just std.regex.If the unicode stuff is entirely an implementation detail of std.regex and is not in the API, then we can review the unicode stuff separately and make std.regex use the stuff in std.uni once it's in there, but if you're doing major unicode stuff, I think that that's significant enough to merit its own review.It could be done in two steps, if I was to choose which way to go I'd first folded in std.regex then moved all its hidden unicode stuff to std.uni. It's mostly about character classification and about doing it fast.
Sep 13 2011