digitalmars.D - ctRegex vs. Regex vs. plain string

Chris (11/11) Dec 06 2012 I have updated my code (finally!) to 2.060. As my project deals a

Dmitry Olshansky (43/56) Dec 06 2012 At first I was confused by "make extensive use of the std.regex" and

Chris (8/70) Dec 06 2012 Thanks a lot. That's very useful information. I will follow the

"Chris" <wendlec tcd.ie> writes:

I have updated my code (finally!) to 2.060. As my project deals a 
lot with text processing including loads of special characters 
(á, ú etc.), I make extensive use of the std.regex module (and I 
really appreciate the use of the Thompson NFA). To optimize my 
program I have experimented with ctRegex / StaticRegex and Regex. 
However, there are still compile time problems with Regex and 
StaticRegex which is why I am using plain strings at the moment, 
which work fine with the same regular expressions. Are there any 
precautions I have to take when using compile time regular 
expressions? Does anyone have any experience as regards 
performance enhancement?

Dec 06 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

12/6/2012 7:21 PM, Chris пишет:
 I have updated my code (finally!) to 2.060.

Congrats!

 As my project deals a lot
 with text processing including loads of special characters (á, ú etc.),
 I make extensive use of the std.regex module (and I really appreciate
 the use of the Thompson NFA). To optimize my program I have experimented
 with ctRegex / StaticRegex and Regex. However, there are still compile
 time problems with Regex and StaticRegex which is why I am using plain
 strings at the moment, which work fine with the same regular
 expressions.

At first I was confused by "make extensive use of the std.regex"  and 
"using plain strings". But then I recalled the problematic "bug" in how 
the compiler treats globals.

So if your code goes like this:

//globals or statics
auto re1 = regex(...);
auto re2 = regex(...);
//...
auto reK = regex(...);

//and e.g. in main:
void main(){
  ... use reX etc. ...
}

Then the long compilations are caused by the compiler doing 
constant-folding on re1-reK variables. This forces it to parse & compile 
these patterns at compile-time.

While it's cute and looks like a minor optimization it can make compile 
times monstrous. Especially as it just produces the same normal pattern 
that R-T regex uses. The way out is to keep compiled patterns on stack 
or initialize them inside of static this.

As for using strings as patterns - it does compile them internally and 
caches the last 8 of them. In other words it should be fine for scripts 
and programs that use a few patterns to go with plain strings. It 
doesn't slow things down considerably even in a tight loop.

But once you are going for about 10+ commonly used patterns then 
precompiling them is a better option.

 Are there any precautions I have to take when using compile
 time regular expressions?

One precaution is to use ctRegex only when things are well tested and 
you are ready to go for that extra speed. It typically takes a lot of 
time and RAM to get it to compile.

Then again testing that results do match is recommended. Simply because 
of the pressure it puts on the compiler ctRegex is not that well tested 
(it goes only through a couple of tests in the Phobos unittests)  unlike 
the regular one.

 Does anyone have any experience as regards
 performance enhancement?

You tell me ;) As a matter of fact I collect problematic or frequent 
patterns, guess I need to advertise it somewhere.

Seriously, it depends on patterns and the data. I'd expect about 20-50% 
faster. But there are even cases where it may slow it down (the C-T 
backend is not that sophisticated as primary R-T one... something to 
improve with time).

-- 
Dmitry Olshansky

Dec 06 2012

"Chris" <wendlec tcd.ie> writes:

On Thursday, 6 December 2012 at 16:00:11 UTC, Dmitry Olshansky 
wrote:
 12/6/2012 7:21 PM, Chris пишет:
 I have updated my code (finally!) to 2.060.

 Congrats!

 As my project deals a lot
 with text processing including loads of special characters (á, 
 ú etc.),
 I make extensive use of the std.regex module (and I really 
 appreciate
 the use of the Thompson NFA). To optimize my program I have 
 experimented
 with ctRegex / StaticRegex and Regex. However, there are still 
 compile
 time problems with Regex and StaticRegex which is why I am 
 using plain
 strings at the moment, which work fine with the same regular
 expressions.

 At first I was confused by "make extensive use of the 
 std.regex"  and "using plain strings". But then I recalled the 
 problematic "bug" in how the compiler treats globals.

 So if your code goes like this:

 //globals or statics
 auto re1 = regex(...);
 auto re2 = regex(...);
 //...
 auto reK = regex(...);

 //and e.g. in main:
 void main(){
  ... use reX etc. ...
 }

 Then the long compilations are caused by the compiler doing 
 constant-folding on re1-reK variables. This forces it to parse 
 & compile these patterns at compile-time.

 While it's cute and looks like a minor optimization it can make 
 compile times monstrous. Especially as it just produces the 
 same normal pattern that R-T regex uses. The way out is to keep 
 compiled patterns on stack or initialize them inside of static 
 this.

 As for using strings as patterns - it does compile them 
 internally and caches the last 8 of them. In other words it 
 should be fine for scripts and programs that use a few patterns 
 to go with plain strings. It doesn't slow things down 
 considerably even in a tight loop.

 But once you are going for about 10+ commonly used patterns 
 then precompiling them is a better option.

 Are there any precautions I have to take when using compile
 time regular expressions?

 One precaution is to use ctRegex only when things are well 
 tested and you are ready to go for that extra speed. It 
 typically takes a lot of time and RAM to get it to compile.

 Then again testing that results do match is recommended. Simply 
 because of the pressure it puts on the compiler ctRegex is not 
 that well tested (it goes only through a couple of tests in the 
 Phobos unittests)  unlike the regular one.

 Does anyone have any experience as regards
 performance enhancement?

 You tell me ;) As a matter of fact I collect problematic or 
 frequent patterns, guess I need to advertise it somewhere.

 Seriously, it depends on patterns and the data. I'd expect 
 about 20-50% faster. But there are even cases where it may slow 
 it down (the C-T backend is not that sophisticated as primary 
 R-T one... something to improve with time).

Thanks a lot. That's very useful information. I will follow the 
rules Roberto Ierusalimschy mentions:

"In Lua, as in any other programming language, we should always 
follow the two maxims of program optimization:

Dec 06 2012

D Programming

C/C++ Programming

Other

digitalmars.D - ctRegex vs. Regex vs. plain string