digitalmars.D.learn - Tips on making regex more performant?

Gary Willoughby (24/24) Jun 18 2013 Below is an example snippet of code to test for performance of

1100110 (3/25) Jun 18 2013 enum alert = ctRegex!r"^Alert ([0-9]+)";

Gary Willoughby (2/4) Jun 18 2013 Thanks. Hmmm.. i get 500K (worse performance) using that. :/

bearophile (11/12) Jun 18 2013 Try using the LDC2 compiler, with the ctRegex, and using the
Jacob Carlborg (6/8) Jun 18 2013 D has basically the fastest regular expression library/module. It's

Dmitry Olshansky (13/17) Jun 18 2013 As much as I'm appeased to hear this it's isn't simply "faster then V8"....

Jacob Carlborg (4/14) Jun 19 2013 I see, thanks for the correction.

Dmitry Olshansky (5/10) Jun 18 2013 My bet would be allocations are taking the bulk of time. Any chance to

Dmitry Olshansky (52/74) Jun 18 2013 Depending on you data size reading more into memory in one go and then
KJS (21/45) Jun 18 2013 I'm working with some string-heavy applications so I was curious

Gary Willoughby (7/15) Jun 20 2013 This is the approach i used and it's very very fast. Using

"Gary Willoughby" <dev kalekold.net> writes:

Below is an example snippet of code to test for performance of 
regex matches. I need to parse a large log and extract data from 
it and i've noticed a huge increase in time of the loop when 
reading and using regex.

	...
	auto alert = regex(r"^Alert ([0-9]+)");

	while ((line = file.readln()) !is null)
	{
		auto m = match(line, alert);

		if (m)
		{
			alerts++;
		}

		counter++;
	}
	...

Using the above example i parse about 700K lines per second (i'm 
reading from an SSD). If i comment out the regex match function, 
i read at 4.5M lines per second. Considering i need to use about 
8 regex matches and extract data, this figure further drops to 
about 100K lines per second.

Is there anything i can do to speed up regex matching in such a 
scenario? Are there any tips you can share to speed things up?

Thanks.

Jun 18 2013

1100110 <0b1100110 gmail.com> writes:

On 06/18/2013 01:53 PM, Gary Willoughby wrote:
 Below is an example snippet of code to test for performance of regex
 matches. I need to parse a large log and extract data from it and i've
 noticed a huge increase in time of the loop when reading and using regex.

      ...
      auto alert = regex(r"^Alert ([0-9]+)");

      while ((line = file.readln()) !is null)
      {
          auto m = match(line, alert);

          if (m)
          {
              alerts++;
          }

          counter++;
      }
      ...

 Using the above example i parse about 700K lines per second (i'm reading
 from an SSD). If i comment out the regex match function, i read at 4.5M
 lines per second. Considering i need to use about 8 regex matches and
 extract data, this figure further drops to about 100K lines per second.

 Is there anything i can do to speed up regex matching in such a
 scenario? Are there any tips you can share to speed things up?

 Thanks.

enum alert = ctRegex!r"^Alert ([0-9]+)";

And then use it the same way.

Jun 18 2013

"Gary Willoughby" <dev kalekold.net> writes:

 enum alert = ctRegex!r"^Alert ([0-9]+)";

 And then use it the same way.

Thanks. Hmmm.. i get 500K (worse performance) using that. :/

Any more tips?

Jun 18 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Gary Willoughby:

 Any more tips?

Try using the LDC2 compiler, with the ctRegex, and using the 
correct compilation flags (like ldmd2 -O -release -inline 
-noboundscheck).

If that fails one solution is to write your own finite state 
machine where the state is encoded by the program counter, using 
gotos. There is a well known simple to use C library that 
generates such code automatically. But it's probably better to 
rethink your problem and re-frame it.

Bye,
bearophile

Jun 18 2013

Jacob Carlborg <doob me.com> writes:

On 2013-06-18 21:22, Gary Willoughby wrote:

 Thanks. Hmmm.. i get 500K (worse performance) using that. :/

D has basically the fastest regular expression library/module. It's 
faster than V8.

 Any more tips?

How about reading in larger chunks than single lines?

-- 
/Jacob Carlborg

Jun 18 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

19-Jun-2013 00:34, Jacob Carlborg пишет:
 On 2013-06-18 21:22, Gary Willoughby wrote:

 Thanks. Hmmm.. i get 500K (worse performance) using that. :/

 D has basically the fastest regular expression library/module. It's
 faster than V8.

As much as I'm appeased to hear this it's isn't simply "faster then V8". 
V8 one is very fast in general case (JIT compiler aids in that) and I 
think would come out as winer more often then std.regex.

What is closer to truth is that typically std.regex is within about 5 
top well-known engines and it's beats the rest of competition on some 
patterns including in certain popular benchmarks such as regex-dna.

For instance it typically obliterates PCRE, infinitely so on Unicode 
patterns (I never had patience to let it finish) .

With that being said there are many things to improve in std.regex 
speed-wise, sadly I haven't been able to tell about it at DConf.


-- 
Dmitry Olshansky

Jun 18 2013

Jacob Carlborg <doob me.com> writes:

On 2013-06-18 23:29, Dmitry Olshansky wrote:

 As much as I'm appeased to hear this it's isn't simply "faster then V8".
 V8 one is very fast in general case (JIT compiler aids in that) and I
 think would come out as winer more often then std.regex.

 What is closer to truth is that typically std.regex is within about 5
 top well-known engines and it's beats the rest of competition on some
 patterns including in certain popular benchmarks such as regex-dna.

 For instance it typically obliterates PCRE, infinitely so on Unicode
 patterns (I never had patience to let it finish) .

 With that being said there are many things to improve in std.regex
 speed-wise, sadly I haven't been able to tell about it at DConf.

I see, thanks for the correction.

-- 
/Jacob Carlborg

Jun 19 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

18-Jun-2013 23:22, Gary Willoughby пишет:
 enum alert = ctRegex!r"^Alert ([0-9]+)";

 And then use it the same way.

 Thanks. Hmmm.. i get 500K (worse performance) using that. :/

My bet would be allocations are taking the bulk of time. Any chance to 
compile it with -profile?

 Any more tips?



-- 
Dmitry Olshansky

Jun 18 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

18-Jun-2013 22:53, Gary Willoughby пишет:
 Below is an example snippet of code to test for performance of regex
 matches. I need to parse a large log and extract data from it and i've
 noticed a huge increase in time of the loop when reading and using regex.

      ...
      auto alert = regex(r"^Alert ([0-9]+)");

      while ((line = file.readln()) !is null)
      {
          auto m = match(line, alert);

          if (m)
          {
              alerts++;
          }

          counter++;
      }
      ...

Depending on you data size reading more into memory in one go and then 
use regex  with "g" option on it. The difference is that each time you 
call `match` regex engine has to allocate state for the matcher. This 
leads to basically doing malloc/free per line. Millions of allocations 
per second are surely out of reach.

I've been long wondering what I can do to speed up such one-off matches 
as most of tuning had gone into the machine itself, not start/stop 
paths. What can be done to fix this is adding TLS cache of memory in 
std.regex, so that it reuses the same memory on each call to match.
Partially this was postponed till supposedly soon to come Allocators 
design but it never did.

Alternatives.

Simplest alternative would be to use some Appender!char object and fill 
it up with say 100 lines.

Second option is read up a big buffer, find a newline from the end of it 
and run global match on this chunk. Copy the (hopefully small) tail to 
the beginning of buffer and read into the rest.

The example below assumes you can't have line bigger then some buffer size.

(it's a sketch, untested)

alert = regex(..., "gm");
ubyte[] buffer = new ubyte[64*1024];
ubyte[] slice = buffer[];
auto hasData = true;
while(hasData){
	size_t got = file.rawRead(slice).length;
	// + tail from prev iteration
	auto usable = got + buffer.length - slice.length;
	if(got < slice.length)
		hasData = false;
	auto idx = indexOf(retro(buffer[0..got]), '\n');
	if(idx < 0){
		slice = buffer[];
		continue;
	}
	size_t lastLF = got-idx;
	char[] piece = cast(char[])buffer[0..lastLF];
	foreach(m; match(piece, alert))
	{
		//old code here
	}
	copy(buffer[lastLF..$], buffer[]); //copy leftover to front
	slice = buffer[$-lastLF..$]; //read into the rest of buffer
}

This is how I'd go about it ATM if speed is important.

 Using the above example i parse about 700K lines per second (i'm reading
 from an SSD). If i comment out the regex match function, i read at 4.5M
 lines per second. Considering i need to use about 8 regex matches and
 extract data, this figure further drops to about 100K lines per second.

Like test 8 separate patterns? This would multiply aforementioned 
malloc/free by 8. One idea is to try putting them all in a single regex 
with alternations '|'.


 Is there anything i can do to speed up regex matching in such a
 scenario? Are there any tips you can share to speed things up?

There are also compiler flags to fiddle with (dmd):
  -O -release -inline -noboundscheck

 Thanks.


-- 
Dmitry Olshansky

Jun 18 2013

"KJS" <kirkjobsluder fastmail.fm> writes:

On Tuesday, 18 June 2013 at 18:53:34 UTC, Gary Willoughby wrote:
 Below is an example snippet of code to test for performance of 
 regex matches. I need to parse a large log and extract data 
 from it and i've noticed a huge increase in time of the loop 
 when reading and using regex.

 	...
 	auto alert = regex(r"^Alert ([0-9]+)");

 	while ((line = file.readln()) !is null)
 	{
 		auto m = match(line, alert);

 		if (m)
 		{
 			alerts++;
 		}

 		counter++;
 	}
 	...

 Using the above example i parse about 700K lines per second 
 (i'm reading from an SSD). If i comment out the regex match 
 function, i read at 4.5M lines per second. Considering i need 
 to use about 8 regex matches and extract data, this figure 
 further drops to about 100K lines per second.

 Is there anything i can do to speed up regex matching in such a 
 scenario? Are there any tips you can share to speed things up?

 Thanks.

I'm working with some string-heavy applications so I was curious 
about this myself. I'm new to D, but I did some heavy data 
analysis on chat files a while back.

Not knowing anything about your data or what other queries you 
might want to do on it, matching the first part of the string 
with std.algorithm.startsWith() and splitting the line on a 
delimiter outperforms regex matching on my admittedly arbitrary 
test code. I tested two extremes at 10,000,000 rounds:

Match everything: ~39 seconds for match, ~8 seconds for 
startsWith/split.
Match fails at start of string: ~10 seconds for match, ~1 second 
for startsWith/split.
Match fails at end of string: ~15 seconds for match, ~1 second 
for startsWith/split.

Even if you need a regex to match the middle, it might be 
worthwhile to filter the list with startsWith if you're matching 
a fixed string at the start of the line. Again, it depends on the 
frequency of hits and how the data is structured.

Split is probably not the best way to slice the match, but I 
don't have time tonight to try other slicing methods.

Jun 18 2013

"Gary Willoughby" <dev kalekold.net> writes:

 I'm working with some string-heavy applications so I was 
 curious about this myself. I'm new to D, but I did some heavy 
 data analysis on chat files a while back.

 Not knowing anything about your data or what other queries you 
 might want to do on it, matching the first part of the string 
 with std.algorithm.startsWith() and splitting the line on a 
 delimiter outperforms regex matching on my admittedly arbitrary 
 test code.

This is the approach i used and it's very very fast. Using 
std.algorithm.startsWith() and simple delimiter based string 
slicing i've solved it. It's so fast there is hardly any slow 
down.

As a side note, i also use regex matching for the same data in 
debug blocks to assert the slicing method matches what i would of 
got from the regex, which is nice.

Jun 20 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Tips on making regex more performant?