digitalmars.D - "Regular Expression Matching Can Be Simple And Fast (but...) "
- spir (11/11) Feb 24 2011 Hello,
- Trass3r (2/2) Feb 24 2011 Yep, finite automata FTW!
- Andrei Alexandrescu (5/12) Feb 24 2011 An often-quoted article. It turns out Thompson automata are 2x slower
- PeteC (49/49) Feb 27 2011 A few months ago I was using D on a project that involved filtering and
- Nick Sabalausky (35/41) Feb 24 2011 Yea, I've come across that before. There's definitely good stuff in ther...
- spir (11/18) Feb 24 2011 Actually, couldn't you say that about 90% of all technical writings, inc...
- Russel Winder (22/24) Feb 24 2011 On Thu, 2011-02-24 at 17:43 -0500, Nick Sabalausky wrote:
- Paulo Pinto (4/16) Feb 25 2011 The author is one of Go's developers. I wonder how fast is the current
- Russel Winder (25/45) Feb 25 2011 Is there a small microbenchmark you want to try out? I guess doing a C
- Paulo Pinto (9/9) Feb 25 2011 Hi,
Hello, "Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)" A *very* interesting and well written article about slow & fast regex engines, why and how: http://swtch.com/~rsc/regexp/regexp1.html Denis -- _________________ vita es estrany spir.wikidot.com
Feb 24 2011
Yep, finite automata FTW! Same goes for non-regexp string matching, see Aho-Corasick algorithm.
Feb 24 2011
On 2/24/11 8:26 AM, spir wrote:Hello, "Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)" A *very* interesting and well written article about slow & fast regex engines, why and how: http://swtch.com/~rsc/regexp/regexp1.html DenisAn often-quoted article. It turns out Thompson automata are 2x slower than Perl's for many real-world inputs, so it should be taken with a grain of salt. Andrei
Feb 24 2011
A few months ago I was using D on a project that involved filtering and tranforming file paths. Using D helped a lot, but I was a bit frustrated to discover there were a number of bugs in the standard regex library. I was able to fix a couple, which I reported in bugzilla, but I couldn't understand the source code well enough to deal with all of them. After reading Russ Cox's articles, I implemented an experimental regular expression engine based on the operations and execution methods mentioned in the http://swtch.com/~rsc/regexp/regexp2.html article. It's not yet completed, but I've put it up on github if anyone is interested in looking at it: https://github.com/PeteChadwick/pcregexeng Bear in mind that at this stage it's just an experiment and isn't intented to be used as a regex library. Comments, suggestions and criticisms are welcome. The library includes a Thompson style machine, and a very basic recursive backtracking machine. I tried a few simple benchmarks of these machines, alongside std.regex.match. The results are below: Regex test Regex: '([a-zA-Z0-9._%+-]+) ([a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})' Iterations: 100000 Text length: 20 Text: User domain.name.com lockstep (Email): 2699 ticks 1415.05 ticks/MB (User domain.name.com) backtrack (Email): 109 ticks 57.1474 ticks/MB (User domain.name.com) std.regex (Email): 250 ticks 131.072 ticks/MB Regex: '([a-zA-Z0-9._%+-]+) ([a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})' Iterations: 10 Text length: 210020 Text: not-an-email-address not-an-email-address not-an-e... lockstep (Email prefix 1): 2074 ticks 1035.5 ticks/MB (User domain.name.com) backtrack (Email prefix 1): 468 ticks 233.66 ticks/MB (User domain.name.com) std.regex (Email prefix 1): 1342 ticks 670.026 ticks/MB Regex: '([a-zA-Z0-9._%+-]+) ([a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})' Iterations: 10 Text length: 1048596 Text: ||||||||||||||||||||||||||||||||||||||||||||||||||... lockstep (Email prefix 2): 5944 ticks 594.389 ticks/MB (User domain.name.com) backtrack (Email prefix 2): 717 ticks 71.6986 ticks/MB (User domain.name.com) std.regex (Email prefix 2): 468 ticks 46.7991 ticks/MB Regex: 'a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaa' Iterations: 1 Text length: 18 Text: aaaaaaaaaaaaaaaaaa lockstep (Pathalogical): 0 ticks 0 ticks/MB (aaaaaaaaaaaaaaaaaa) backtrack (Pathalogical): 47 ticks 2.73795e+06 ticks/MB (aaaaaaaaaaaaaaaaaa) std.regex (Pathalogical): 14586 ticks 8.49696e+08 ticks/MB It terms of ticks/MB the Thompson machine implementation is quite a lot slower in these tests, but shows more consistency, and is able to deal with the pathalogical case.
Feb 27 2011
"spir" <denis.spir gmail.com> wrote in message news:mailman.1923.1298557650.4748.digitalmars-d puremagic.com...Hello, "Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)" A *very* interesting and well written article about slow & fast regex engines, why and how: http://swtch.com/~rsc/regexp/regexp1.htmlYea, I've come across that before. There's definitely good stuff in there, and it's much better-written than a lot of things I've seen. But one thing I really don't like about it is that there's a fundamental relationship going on that he merely implies throughout the article and never clearly states right up-front: --------------------------- A regex engine first needs to convert the regex to an NFA, but then there's three ways it can proceed: 1. Execute the NFA directly, which involves making guesses at certain points and then backing up if the guesses turn out wrong. This is what Perl does. Unfortunately, all this backing up and retrying makes it scale very poorly on certain complex regexes. 2. Convert the NFA to a DFA and execute the DFA. Optionally, the resulting DFA can then be optimized further. There's the extra upfront overhead of the conversion, but executing the DFA (even if it isn't optimized) never involves any guessing or backing up to retry, so it scales well. 3. Thompson NFA: Execute the NFA as if it were the equivalent DFA. You're essentially executing the DFA *as* you generate it from the NFA. (Although the resultant DFA doesn't actually get stored.) Note, of course, that all this applies to any use of an NFA, not just regex. --------------------------- Those few little paragraphs are the whole damn *point* of the article, but the only way to get that take-away is to stitch it together by reading the entire multi-page text. Without that "big picture" I had a much harder time than necessary understanding it the first time I read it, even though I was still aware that NFAs can be converted to DFAs. Actually, that's one of the things I really hate about academic writings in general (not that this article is particularly academic, as far as academic-themed stuff goes): There's rarely a decent overview. You're expected to implicitly piece together the big picture *as* you learn the details. Which, of course, is stupid because details mean practically nothing without proper context. Context *helps* you understand the details, plus it's less to be trying to learn all at once.
Feb 24 2011
On 02/24/2011 11:43 PM, Nick Sabalausky wrote:Actually, that's one of the things I really hate about academic writings in general (not that this article is particularly academic, as far as academic-themed stuff goes): There's rarely a decent overview. You're expected to implicitly piece together the big picture *as* you learn the details. Which, of course, is stupid because details mean practically nothing without proper context. Context *helps* you understand the details, plus it's less to be trying to learn all at once.Actually, couldn't you say that about 90% of all technical writings, including purely practicle ones (or rather supposed to), among which much of D doc. "You're expected to implicitly piece together the big picture *as* you learn the details". That's the feeling I get, at least. I'm all the time guessing (esp the /purpose/ of this or that feature or sub-feature). Denis -- _________________ vita es estrany spir.wikidot.com
Feb 24 2011
On Thu, 2011-02-24 at 17:43 -0500, Nick Sabalausky wrote: [ . . . ] Send feedback to Russ Cox? [ . . . ]Actually, that's one of the things I really hate about academic writings =in=20general (not that this article is particularly academic, as far as=20[ . . . ] I take exception to this category of "academic writing" and lumping everything of this sort into that category to be slagged off. Just because some academics can't author good documents doesn't imply all have this failing. I would hazard a guess though that a far greater percentage of software developers in industry and commerce are unable to write good articles. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Feb 24 2011
The author is one of Go's developers. I wonder how fast is the current regexp implementation in Go. "spir" <denis.spir gmail.com> wrote in message news:mailman.1923.1298557650.4748.digitalmars-d puremagic.com...Hello, "Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)" A *very* interesting and well written article about slow & fast regex engines, why and how: http://swtch.com/~rsc/regexp/regexp1.html Denis -- _________________ vita es estrany spir.wikidot.com
Feb 25 2011
On Fri, 2011-02-25 at 09:37 +0100, Paulo Pinto wrote:The author is one of Go's developers. I wonder how fast is the current==20regexp implementation in Go.Is there a small microbenchmark you want to try out? I guess doing a C or C++ / D / Go comparison is appropriate! (Better that debating rhetorical questions :-) I'd be up for participating. We should note though that the standard Go is known not to be highly optimized, it is in fact quite slow. This is a known state of affairs, and is right given the evolution of the semantics of the language and libraries just now. They will soon though have to put high optimization on the road map. Of course there is gccgo which should benefit from all the GNU optimization goodness. =20"spir" <denis.spir gmail.com> wrote in message=20 news:mailman.1923.1298557650.4748.digitalmars-d puremagic.com...a,=20Hello, "Regular Expression Matching Can Be Simple And Fast (but is slow in Jav==20Perl, PHP, Python, Ruby, ...)" A *very* interesting and well written article about slow & fast regex=--=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winderengines, why and how: http://swtch.com/~rsc/regexp/regexp1.html Denis --=20 _________________ vita es estrany spir.wikidot.com =20=20 =20
Feb 25 2011
Hi, I know my way around C and C++ quite well, but I am afraid that I do lack the knowledge to create a high performant regexp engine that would do justice to the chosen language. -- Paulo "Russel Winder" <russel russel.org.uk> wrote in message news:mailman.1943.1298624548.4748.digitalmars-d puremagic.com...
Feb 25 2011