digitalmars.D.learn - Tips on making regex more performant?
- Gary Willoughby (24/24) Jun 18 2013 Below is an example snippet of code to test for performance of
- 1100110 (3/25) Jun 18 2013 enum alert = ctRegex!r"^Alert ([0-9]+)";
- Gary Willoughby (2/4) Jun 18 2013 Thanks. Hmmm.. i get 500K (worse performance) using that. :/
- bearophile (11/12) Jun 18 2013 Try using the LDC2 compiler, with the ctRegex, and using the
- Jacob Carlborg (6/8) Jun 18 2013 D has basically the fastest regular expression library/module. It's
- Dmitry Olshansky (13/17) Jun 18 2013 As much as I'm appeased to hear this it's isn't simply "faster then V8"....
- Jacob Carlborg (4/14) Jun 19 2013 I see, thanks for the correction.
- Dmitry Olshansky (5/10) Jun 18 2013 My bet would be allocations are taking the bulk of time. Any chance to
- Dmitry Olshansky (52/74) Jun 18 2013 Depending on you data size reading more into memory in one go and then
- KJS (21/45) Jun 18 2013 I'm working with some string-heavy applications so I was curious
- Gary Willoughby (7/15) Jun 20 2013 This is the approach i used and it's very very fast. Using
Below is an example snippet of code to test for performance of regex matches. I need to parse a large log and extract data from it and i've noticed a huge increase in time of the loop when reading and using regex. ... auto alert = regex(r"^Alert ([0-9]+)"); while ((line = file.readln()) !is null) { auto m = match(line, alert); if (m) { alerts++; } counter++; } ... Using the above example i parse about 700K lines per second (i'm reading from an SSD). If i comment out the regex match function, i read at 4.5M lines per second. Considering i need to use about 8 regex matches and extract data, this figure further drops to about 100K lines per second. Is there anything i can do to speed up regex matching in such a scenario? Are there any tips you can share to speed things up? Thanks.
Jun 18 2013
On 06/18/2013 01:53 PM, Gary Willoughby wrote:Below is an example snippet of code to test for performance of regex matches. I need to parse a large log and extract data from it and i've noticed a huge increase in time of the loop when reading and using regex. ... auto alert = regex(r"^Alert ([0-9]+)"); while ((line = file.readln()) !is null) { auto m = match(line, alert); if (m) { alerts++; } counter++; } ... Using the above example i parse about 700K lines per second (i'm reading from an SSD). If i comment out the regex match function, i read at 4.5M lines per second. Considering i need to use about 8 regex matches and extract data, this figure further drops to about 100K lines per second. Is there anything i can do to speed up regex matching in such a scenario? Are there any tips you can share to speed things up? Thanks.enum alert = ctRegex!r"^Alert ([0-9]+)"; And then use it the same way.
Jun 18 2013
enum alert = ctRegex!r"^Alert ([0-9]+)"; And then use it the same way.Thanks. Hmmm.. i get 500K (worse performance) using that. :/ Any more tips?
Jun 18 2013
Gary Willoughby:Any more tips?Try using the LDC2 compiler, with the ctRegex, and using the correct compilation flags (like ldmd2 -O -release -inline -noboundscheck). If that fails one solution is to write your own finite state machine where the state is encoded by the program counter, using gotos. There is a well known simple to use C library that generates such code automatically. But it's probably better to rethink your problem and re-frame it. Bye, bearophile
Jun 18 2013
On 2013-06-18 21:22, Gary Willoughby wrote:Thanks. Hmmm.. i get 500K (worse performance) using that. :/D has basically the fastest regular expression library/module. It's faster than V8.Any more tips?How about reading in larger chunks than single lines? -- /Jacob Carlborg
Jun 18 2013
19-Jun-2013 00:34, Jacob Carlborg пишет:On 2013-06-18 21:22, Gary Willoughby wrote:As much as I'm appeased to hear this it's isn't simply "faster then V8". V8 one is very fast in general case (JIT compiler aids in that) and I think would come out as winer more often then std.regex. What is closer to truth is that typically std.regex is within about 5 top well-known engines and it's beats the rest of competition on some patterns including in certain popular benchmarks such as regex-dna. For instance it typically obliterates PCRE, infinitely so on Unicode patterns (I never had patience to let it finish) . With that being said there are many things to improve in std.regex speed-wise, sadly I haven't been able to tell about it at DConf. -- Dmitry OlshanskyThanks. Hmmm.. i get 500K (worse performance) using that. :/D has basically the fastest regular expression library/module. It's faster than V8.
Jun 18 2013
On 2013-06-18 23:29, Dmitry Olshansky wrote:As much as I'm appeased to hear this it's isn't simply "faster then V8". V8 one is very fast in general case (JIT compiler aids in that) and I think would come out as winer more often then std.regex. What is closer to truth is that typically std.regex is within about 5 top well-known engines and it's beats the rest of competition on some patterns including in certain popular benchmarks such as regex-dna. For instance it typically obliterates PCRE, infinitely so on Unicode patterns (I never had patience to let it finish) . With that being said there are many things to improve in std.regex speed-wise, sadly I haven't been able to tell about it at DConf.I see, thanks for the correction. -- /Jacob Carlborg
Jun 19 2013
18-Jun-2013 23:22, Gary Willoughby пишет:My bet would be allocations are taking the bulk of time. Any chance to compile it with -profile?enum alert = ctRegex!r"^Alert ([0-9]+)"; And then use it the same way.Thanks. Hmmm.. i get 500K (worse performance) using that. :/Any more tips?-- Dmitry Olshansky
Jun 18 2013
18-Jun-2013 22:53, Gary Willoughby пишет:Below is an example snippet of code to test for performance of regex matches. I need to parse a large log and extract data from it and i've noticed a huge increase in time of the loop when reading and using regex. ... auto alert = regex(r"^Alert ([0-9]+)"); while ((line = file.readln()) !is null) { auto m = match(line, alert); if (m) { alerts++; } counter++; } ...Depending on you data size reading more into memory in one go and then use regex with "g" option on it. The difference is that each time you call `match` regex engine has to allocate state for the matcher. This leads to basically doing malloc/free per line. Millions of allocations per second are surely out of reach. I've been long wondering what I can do to speed up such one-off matches as most of tuning had gone into the machine itself, not start/stop paths. What can be done to fix this is adding TLS cache of memory in std.regex, so that it reuses the same memory on each call to match. Partially this was postponed till supposedly soon to come Allocators design but it never did. Alternatives. Simplest alternative would be to use some Appender!char object and fill it up with say 100 lines. Second option is read up a big buffer, find a newline from the end of it and run global match on this chunk. Copy the (hopefully small) tail to the beginning of buffer and read into the rest. The example below assumes you can't have line bigger then some buffer size. (it's a sketch, untested) alert = regex(..., "gm"); ubyte[] buffer = new ubyte[64*1024]; ubyte[] slice = buffer[]; auto hasData = true; while(hasData){ size_t got = file.rawRead(slice).length; // + tail from prev iteration auto usable = got + buffer.length - slice.length; if(got < slice.length) hasData = false; auto idx = indexOf(retro(buffer[0..got]), '\n'); if(idx < 0){ slice = buffer[]; continue; } size_t lastLF = got-idx; char[] piece = cast(char[])buffer[0..lastLF]; foreach(m; match(piece, alert)) { //old code here } copy(buffer[lastLF..$], buffer[]); //copy leftover to front slice = buffer[$-lastLF..$]; //read into the rest of buffer } This is how I'd go about it ATM if speed is important.Using the above example i parse about 700K lines per second (i'm reading from an SSD). If i comment out the regex match function, i read at 4.5M lines per second. Considering i need to use about 8 regex matches and extract data, this figure further drops to about 100K lines per second.Like test 8 separate patterns? This would multiply aforementioned malloc/free by 8. One idea is to try putting them all in a single regex with alternations '|'.Is there anything i can do to speed up regex matching in such a scenario? Are there any tips you can share to speed things up?There are also compiler flags to fiddle with (dmd): -O -release -inline -noboundscheckThanks.-- Dmitry Olshansky
Jun 18 2013
On Tuesday, 18 June 2013 at 18:53:34 UTC, Gary Willoughby wrote:Below is an example snippet of code to test for performance of regex matches. I need to parse a large log and extract data from it and i've noticed a huge increase in time of the loop when reading and using regex. ... auto alert = regex(r"^Alert ([0-9]+)"); while ((line = file.readln()) !is null) { auto m = match(line, alert); if (m) { alerts++; } counter++; } ... Using the above example i parse about 700K lines per second (i'm reading from an SSD). If i comment out the regex match function, i read at 4.5M lines per second. Considering i need to use about 8 regex matches and extract data, this figure further drops to about 100K lines per second. Is there anything i can do to speed up regex matching in such a scenario? Are there any tips you can share to speed things up? Thanks.I'm working with some string-heavy applications so I was curious about this myself. I'm new to D, but I did some heavy data analysis on chat files a while back. Not knowing anything about your data or what other queries you might want to do on it, matching the first part of the string with std.algorithm.startsWith() and splitting the line on a delimiter outperforms regex matching on my admittedly arbitrary test code. I tested two extremes at 10,000,000 rounds: Match everything: ~39 seconds for match, ~8 seconds for startsWith/split. Match fails at start of string: ~10 seconds for match, ~1 second for startsWith/split. Match fails at end of string: ~15 seconds for match, ~1 second for startsWith/split. Even if you need a regex to match the middle, it might be worthwhile to filter the list with startsWith if you're matching a fixed string at the start of the line. Again, it depends on the frequency of hits and how the data is structured. Split is probably not the best way to slice the match, but I don't have time tonight to try other slicing methods.
Jun 18 2013
I'm working with some string-heavy applications so I was curious about this myself. I'm new to D, but I did some heavy data analysis on chat files a while back. Not knowing anything about your data or what other queries you might want to do on it, matching the first part of the string with std.algorithm.startsWith() and splitting the line on a delimiter outperforms regex matching on my admittedly arbitrary test code.This is the approach i used and it's very very fast. Using std.algorithm.startsWith() and simple delimiter based string slicing i've solved it. It's so fast there is hardly any slow down. As a side note, i also use regex matching for the same data in debug blocks to assert the slicing method matches what i would of got from the regex, which is nice.
Jun 20 2013