digitalmars.D - ctRegex vs. Regex vs. plain string
- Chris (11/11) Dec 06 2012 I have updated my code (finally!) to 2.060. As my project deals a
- Dmitry Olshansky (43/56) Dec 06 2012 At first I was confused by "make extensive use of the std.regex" and
- Chris (8/70) Dec 06 2012 Thanks a lot. That's very useful information. I will follow the
I have updated my code (finally!) to 2.060. As my project deals a lot with text processing including loads of special characters (á, ú etc.), I make extensive use of the std.regex module (and I really appreciate the use of the Thompson NFA). To optimize my program I have experimented with ctRegex / StaticRegex and Regex. However, there are still compile time problems with Regex and StaticRegex which is why I am using plain strings at the moment, which work fine with the same regular expressions. Are there any precautions I have to take when using compile time regular expressions? Does anyone have any experience as regards performance enhancement?
Dec 06 2012
12/6/2012 7:21 PM, Chris пишет:I have updated my code (finally!) to 2.060.Congrats!As my project deals a lot with text processing including loads of special characters (á, ú etc.), I make extensive use of the std.regex module (and I really appreciate the use of the Thompson NFA). To optimize my program I have experimented with ctRegex / StaticRegex and Regex. However, there are still compile time problems with Regex and StaticRegex which is why I am using plain strings at the moment, which work fine with the same regular expressions.At first I was confused by "make extensive use of the std.regex" and "using plain strings". But then I recalled the problematic "bug" in how the compiler treats globals. So if your code goes like this: //globals or statics auto re1 = regex(...); auto re2 = regex(...); //... auto reK = regex(...); //and e.g. in main: void main(){ ... use reX etc. ... } Then the long compilations are caused by the compiler doing constant-folding on re1-reK variables. This forces it to parse & compile these patterns at compile-time. While it's cute and looks like a minor optimization it can make compile times monstrous. Especially as it just produces the same normal pattern that R-T regex uses. The way out is to keep compiled patterns on stack or initialize them inside of static this. As for using strings as patterns - it does compile them internally and caches the last 8 of them. In other words it should be fine for scripts and programs that use a few patterns to go with plain strings. It doesn't slow things down considerably even in a tight loop. But once you are going for about 10+ commonly used patterns then precompiling them is a better option.Are there any precautions I have to take when using compile time regular expressions?One precaution is to use ctRegex only when things are well tested and you are ready to go for that extra speed. It typically takes a lot of time and RAM to get it to compile. Then again testing that results do match is recommended. Simply because of the pressure it puts on the compiler ctRegex is not that well tested (it goes only through a couple of tests in the Phobos unittests) unlike the regular one.Does anyone have any experience as regards performance enhancement?You tell me ;) As a matter of fact I collect problematic or frequent patterns, guess I need to advertise it somewhere. Seriously, it depends on patterns and the data. I'd expect about 20-50% faster. But there are even cases where it may slow it down (the C-T backend is not that sophisticated as primary R-T one... something to improve with time). -- Dmitry Olshansky
Dec 06 2012
On Thursday, 6 December 2012 at 16:00:11 UTC, Dmitry Olshansky wrote:12/6/2012 7:21 PM, Chris пишет:Thanks a lot. That's very useful information. I will follow the rules Roberto Ierusalimschy mentions: "In Lua, as in any other programming language, we should always follow the two maxims of program optimization:I have updated my code (finally!) to 2.060.Congrats!As my project deals a lot with text processing including loads of special characters (á, ú etc.), I make extensive use of the std.regex module (and I really appreciate the use of the Thompson NFA). To optimize my program I have experimented with ctRegex / StaticRegex and Regex. However, there are still compile time problems with Regex and StaticRegex which is why I am using plain strings at the moment, which work fine with the same regular expressions.At first I was confused by "make extensive use of the std.regex" and "using plain strings". But then I recalled the problematic "bug" in how the compiler treats globals. So if your code goes like this: //globals or statics auto re1 = regex(...); auto re2 = regex(...); //... auto reK = regex(...); //and e.g. in main: void main(){ ... use reX etc. ... } Then the long compilations are caused by the compiler doing constant-folding on re1-reK variables. This forces it to parse & compile these patterns at compile-time. While it's cute and looks like a minor optimization it can make compile times monstrous. Especially as it just produces the same normal pattern that R-T regex uses. The way out is to keep compiled patterns on stack or initialize them inside of static this. As for using strings as patterns - it does compile them internally and caches the last 8 of them. In other words it should be fine for scripts and programs that use a few patterns to go with plain strings. It doesn't slow things down considerably even in a tight loop. But once you are going for about 10+ commonly used patterns then precompiling them is a better option.Are there any precautions I have to take when using compile time regular expressions?One precaution is to use ctRegex only when things are well tested and you are ready to go for that extra speed. It typically takes a lot of time and RAM to get it to compile. Then again testing that results do match is recommended. Simply because of the pressure it puts on the compiler ctRegex is not that well tested (it goes only through a couple of tests in the Phobos unittests) unlike the regular one.Does anyone have any experience as regards performance enhancement?You tell me ;) As a matter of fact I collect problematic or frequent patterns, guess I need to advertise it somewhere. Seriously, it depends on patterns and the data. I'd expect about 20-50% faster. But there are even cases where it may slow it down (the C-T backend is not that sophisticated as primary R-T one... something to improve with time).
Dec 06 2012