www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Tips on making regex more performant?

reply "Gary Willoughby" <dev kalekold.net> writes:
Below is an example snippet of code to test for performance of 
regex matches. I need to parse a large log and extract data from 
it and i've noticed a huge increase in time of the loop when 
reading and using regex.

	...
	auto alert = regex(r"^Alert ([0-9]+)");

	while ((line = file.readln()) !is null)
	{
		auto m = match(line, alert);

		if (m)
		{
			alerts++;
		}

		counter++;
	}
	...

Using the above example i parse about 700K lines per second (i'm 
reading from an SSD). If i comment out the regex match function, 
i read at 4.5M lines per second. Considering i need to use about 
8 regex matches and extract data, this figure further drops to 
about 100K lines per second.

Is there anything i can do to speed up regex matching in such a 
scenario? Are there any tips you can share to speed things up?

Thanks.
Jun 18 2013
next sibling parent reply 1100110 <0b1100110 gmail.com> writes:
On 06/18/2013 01:53 PM, Gary Willoughby wrote:
 Below is an example snippet of code to test for performance of regex
 matches. I need to parse a large log and extract data from it and i've
 noticed a huge increase in time of the loop when reading and using regex.

      ...
      auto alert = regex(r"^Alert ([0-9]+)");

      while ((line = file.readln()) !is null)
      {
          auto m = match(line, alert);

          if (m)
          {
              alerts++;
          }

          counter++;
      }
      ...

 Using the above example i parse about 700K lines per second (i'm reading
 from an SSD). If i comment out the regex match function, i read at 4.5M
 lines per second. Considering i need to use about 8 regex matches and
 extract data, this figure further drops to about 100K lines per second.

 Is there anything i can do to speed up regex matching in such a
 scenario? Are there any tips you can share to speed things up?

 Thanks.
enum alert = ctRegex!r"^Alert ([0-9]+)"; And then use it the same way.
Jun 18 2013
parent reply "Gary Willoughby" <dev kalekold.net> writes:
 enum alert = ctRegex!r"^Alert ([0-9]+)";

 And then use it the same way.
Thanks. Hmmm.. i get 500K (worse performance) using that. :/ Any more tips?
Jun 18 2013
next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Gary Willoughby:

 Any more tips?
Try using the LDC2 compiler, with the ctRegex, and using the correct compilation flags (like ldmd2 -O -release -inline -noboundscheck). If that fails one solution is to write your own finite state machine where the state is encoded by the program counter, using gotos. There is a well known simple to use C library that generates such code automatically. But it's probably better to rethink your problem and re-frame it. Bye, bearophile
Jun 18 2013
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-06-18 21:22, Gary Willoughby wrote:

 Thanks. Hmmm.. i get 500K (worse performance) using that. :/
D has basically the fastest regular expression library/module. It's faster than V8.
 Any more tips?
How about reading in larger chunks than single lines? -- /Jacob Carlborg
Jun 18 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
19-Jun-2013 00:34, Jacob Carlborg пишет:
 On 2013-06-18 21:22, Gary Willoughby wrote:

 Thanks. Hmmm.. i get 500K (worse performance) using that. :/
D has basically the fastest regular expression library/module. It's faster than V8.
As much as I'm appeased to hear this it's isn't simply "faster then V8". V8 one is very fast in general case (JIT compiler aids in that) and I think would come out as winer more often then std.regex. What is closer to truth is that typically std.regex is within about 5 top well-known engines and it's beats the rest of competition on some patterns including in certain popular benchmarks such as regex-dna. For instance it typically obliterates PCRE, infinitely so on Unicode patterns (I never had patience to let it finish) . With that being said there are many things to improve in std.regex speed-wise, sadly I haven't been able to tell about it at DConf. -- Dmitry Olshansky
Jun 18 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-06-18 23:29, Dmitry Olshansky wrote:

 As much as I'm appeased to hear this it's isn't simply "faster then V8".
 V8 one is very fast in general case (JIT compiler aids in that) and I
 think would come out as winer more often then std.regex.

 What is closer to truth is that typically std.regex is within about 5
 top well-known engines and it's beats the rest of competition on some
 patterns including in certain popular benchmarks such as regex-dna.

 For instance it typically obliterates PCRE, infinitely so on Unicode
 patterns (I never had patience to let it finish) .

 With that being said there are many things to improve in std.regex
 speed-wise, sadly I haven't been able to tell about it at DConf.
I see, thanks for the correction. -- /Jacob Carlborg
Jun 19 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
18-Jun-2013 23:22, Gary Willoughby пишет:
 enum alert = ctRegex!r"^Alert ([0-9]+)";

 And then use it the same way.
Thanks. Hmmm.. i get 500K (worse performance) using that. :/
My bet would be allocations are taking the bulk of time. Any chance to compile it with -profile?
 Any more tips?
-- Dmitry Olshansky
Jun 18 2013
prev sibling next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
18-Jun-2013 22:53, Gary Willoughby пишет:
 Below is an example snippet of code to test for performance of regex
 matches. I need to parse a large log and extract data from it and i've
 noticed a huge increase in time of the loop when reading and using regex.

      ...
      auto alert = regex(r"^Alert ([0-9]+)");

      while ((line = file.readln()) !is null)
      {
          auto m = match(line, alert);

          if (m)
          {
              alerts++;
          }

          counter++;
      }
      ...
Depending on you data size reading more into memory in one go and then use regex with "g" option on it. The difference is that each time you call `match` regex engine has to allocate state for the matcher. This leads to basically doing malloc/free per line. Millions of allocations per second are surely out of reach. I've been long wondering what I can do to speed up such one-off matches as most of tuning had gone into the machine itself, not start/stop paths. What can be done to fix this is adding TLS cache of memory in std.regex, so that it reuses the same memory on each call to match. Partially this was postponed till supposedly soon to come Allocators design but it never did. Alternatives. Simplest alternative would be to use some Appender!char object and fill it up with say 100 lines. Second option is read up a big buffer, find a newline from the end of it and run global match on this chunk. Copy the (hopefully small) tail to the beginning of buffer and read into the rest. The example below assumes you can't have line bigger then some buffer size. (it's a sketch, untested) alert = regex(..., "gm"); ubyte[] buffer = new ubyte[64*1024]; ubyte[] slice = buffer[]; auto hasData = true; while(hasData){ size_t got = file.rawRead(slice).length; // + tail from prev iteration auto usable = got + buffer.length - slice.length; if(got < slice.length) hasData = false; auto idx = indexOf(retro(buffer[0..got]), '\n'); if(idx < 0){ slice = buffer[]; continue; } size_t lastLF = got-idx; char[] piece = cast(char[])buffer[0..lastLF]; foreach(m; match(piece, alert)) { //old code here } copy(buffer[lastLF..$], buffer[]); //copy leftover to front slice = buffer[$-lastLF..$]; //read into the rest of buffer } This is how I'd go about it ATM if speed is important.
 Using the above example i parse about 700K lines per second (i'm reading
 from an SSD). If i comment out the regex match function, i read at 4.5M
 lines per second. Considering i need to use about 8 regex matches and
 extract data, this figure further drops to about 100K lines per second.
Like test 8 separate patterns? This would multiply aforementioned malloc/free by 8. One idea is to try putting them all in a single regex with alternations '|'.
 Is there anything i can do to speed up regex matching in such a
 scenario? Are there any tips you can share to speed things up?
There are also compiler flags to fiddle with (dmd): -O -release -inline -noboundscheck
 Thanks.
-- Dmitry Olshansky
Jun 18 2013
prev sibling parent reply "KJS" <kirkjobsluder fastmail.fm> writes:
On Tuesday, 18 June 2013 at 18:53:34 UTC, Gary Willoughby wrote:
 Below is an example snippet of code to test for performance of 
 regex matches. I need to parse a large log and extract data 
 from it and i've noticed a huge increase in time of the loop 
 when reading and using regex.

 	...
 	auto alert = regex(r"^Alert ([0-9]+)");

 	while ((line = file.readln()) !is null)
 	{
 		auto m = match(line, alert);

 		if (m)
 		{
 			alerts++;
 		}

 		counter++;
 	}
 	...

 Using the above example i parse about 700K lines per second 
 (i'm reading from an SSD). If i comment out the regex match 
 function, i read at 4.5M lines per second. Considering i need 
 to use about 8 regex matches and extract data, this figure 
 further drops to about 100K lines per second.

 Is there anything i can do to speed up regex matching in such a 
 scenario? Are there any tips you can share to speed things up?

 Thanks.
I'm working with some string-heavy applications so I was curious about this myself. I'm new to D, but I did some heavy data analysis on chat files a while back. Not knowing anything about your data or what other queries you might want to do on it, matching the first part of the string with std.algorithm.startsWith() and splitting the line on a delimiter outperforms regex matching on my admittedly arbitrary test code. I tested two extremes at 10,000,000 rounds: Match everything: ~39 seconds for match, ~8 seconds for startsWith/split. Match fails at start of string: ~10 seconds for match, ~1 second for startsWith/split. Match fails at end of string: ~15 seconds for match, ~1 second for startsWith/split. Even if you need a regex to match the middle, it might be worthwhile to filter the list with startsWith if you're matching a fixed string at the start of the line. Again, it depends on the frequency of hits and how the data is structured. Split is probably not the best way to slice the match, but I don't have time tonight to try other slicing methods.
Jun 18 2013
parent "Gary Willoughby" <dev kalekold.net> writes:
 I'm working with some string-heavy applications so I was 
 curious about this myself. I'm new to D, but I did some heavy 
 data analysis on chat files a while back.

 Not knowing anything about your data or what other queries you 
 might want to do on it, matching the first part of the string 
 with std.algorithm.startsWith() and splitting the line on a 
 delimiter outperforms regex matching on my admittedly arbitrary 
 test code.
This is the approach i used and it's very very fast. Using std.algorithm.startsWith() and simple delimiter based string slicing i've solved it. It's so fast there is hardly any slow down. As a side note, i also use regex matching for the same data in debug blocks to assert the slicing method matches what i would of got from the regex, which is nice.
Jun 20 2013