www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Regex matching cause lots of _d_arrayliteralTX calls

reply "JR" <zorael gmail.com> writes:
I'm working on a toy IRC bot. Much of the logic involved is 
translating the incoming raw IRC string into something that makes 
sense (so now I have two problems, etc). But I managed to cook up 
a regex that so far seems to work well. Time for callgrind!

Grouped by source file, most time is spent in regex.d (as would 
seem natural) but more time is spent in gc.d than I would have 
expected. Looking at the callgraph I see that there's a curious 
amount of calls to _d_arrayliteralTX from (around) where the 
regex matching is done. (There's some inlining going on.)

Example:   http://dpaste.dzfl.pl/3932a231 (needs dmd head)

Callgraph: http://i.imgur.com/AZEutCE.png

TL;DR: 67 regex matches are done in that example snippet, on real 
but (hopefully) anonymized raw irc strings; _d_arrayliteralTX 
sees 800+ calls.

Is this working as expected? Or am I doing it wrong?


(In one of the instances currently running the figures are 2_162 
and 111_708, or a ratio of not quite 1:52.)
Sep 26 2013
next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
JR:

 Is this working as expected? Or am I doing it wrong?
I am not sure how a IRC bot could consume more than a tiny fraction of the CPU time of a modern multi-GHz processor. And I am not sure if regular expressions are a good idea to implement a IRC interface. But the author of the current regex module is probably interested in your profiling of his module. Bye, bearophile
Sep 26 2013
parent reply "JR" <zorael gmail.com> writes:
On Thursday, 26 September 2013 at 23:04:22 UTC, bearophile wrote:
 I am not sure how a IRC bot could consume more than a tiny 
 fraction of the CPU time of a modern multi-GHz processor.
Nor does it bite into my 8 gigabytes of ram. Forgive me, but the main culprit in all of this is still me doing it wrong. Can I keep the same RegexMatcher (perhaps as a struct member) and reuse it between matchings?
 And I am not sure if regular expressions are a good idea to 
 implement a IRC interface.
I dare say I disagree! As input you get strings which all contain two to ~five nuggets of information, where your only two immediately clues are that the first character will show if it's an off case (namely PING and ERROR), and that the second field as delimeted by whitespace -- ^[^ ]+ ([^ ]+).* -- will be the type/action of the event. What comes next is highly type-specific and with seemingly ad-hoc placement at times, but can be made sense of either with some regex, or in an std.algorithmic slicefest with extra countUntils for all.
Sep 26 2013
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Sep 27, 2013 at 01:51:51AM +0200, JR wrote:
 On Thursday, 26 September 2013 at 23:04:22 UTC, bearophile wrote:
I am not sure how a IRC bot could consume more than a tiny
fraction of the CPU time of a modern multi-GHz processor.
Nor does it bite into my 8 gigabytes of ram. Forgive me, but the main culprit in all of this is still me doing it wrong. Can I keep the same RegexMatcher (perhaps as a struct member) and reuse it between matchings?
Not sure what you mean, but are you compiling regexes every time you use them? If so, you should be storing them instead, for example: // Place to store precompiled regex matchers. struct MyContext { Regex!char pattern1; Regex!char pattern2; Regex!char pattern3; ... } // Presumably, this will run far more frequently than // updatePatterns. auto runMatches(MyContext ctxt, string message) { if (message.match(ctxt.pattern1)) { ... } else if (message.match(ctxt.pattern2)) { ... } ... } // Presumably, this only runs once in a while, so you save on // the cost of compiling/storing the regex every single time you // run a match. void updatePatterns(ref MyContext ctxt, string newPattern1, string newPattern2, string newPattern3, ...) { ctxt.pattern1 = regex(newPattern1); ctxt.pattern2 = regex(newPattern2); ctxt.pattern3 = regex(newPattern3); ... } So when you need to update your regexes, say based on reloading a config file or something, you'd run updatePatterns() to compile all the patterns, then runMatches() can be used during the normal course of your program. This should save on a lot of overhead. Of course, if you have regexes that are fixed at compile-time, you could use ctRegex to *really* speed things up. Or, if that makes your compilation too slow (it does have a tendency of doing that), initialize your patterns in a static this() block: Regex!char predeterminedPattern1; Regex!char predeterminedPattern2; static this() { predeterminedPattern1 = regex(`...`); predeterminedPattern2 = regex(`...`); } ... void matchStuff(string message) { if (message.match(preterminedPattern1)) { ... } ... }
And I am not sure if regular expressions are a good idea to
implement a IRC interface.
I dare say I disagree!
Yeah, anything involving heavy string processing is probably best done using regexes rather than ad hoc string slicing, which is bug-prone and hard to maintain. T -- First Rule of History: History doesn't repeat itself -- historians merely repeat each other.
Sep 26 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Sep 27, 2013 at 12:00:45AM +0200, JR wrote:
 I'm working on a toy IRC bot. Much of the logic involved is
 translating the incoming raw IRC string into something that makes
 sense (so now I have two problems, etc). But I managed to cook up a
 regex that so far seems to work well. Time for callgrind!
 
 Grouped by source file, most time is spent in regex.d (as would seem
 natural) but more time is spent in gc.d than I would have expected.
 Looking at the callgraph I see that there's a curious amount of
 calls to _d_arrayliteralTX from (around) where the regex matching is
 done. (There's some inlining going on.)
 
 Example:   http://dpaste.dzfl.pl/3932a231 (needs dmd head)
 
 Callgraph: http://i.imgur.com/AZEutCE.png
Actually, nevermind what I said in the last post. Obviously you're already using ctRegex. The problem is in this code: scope fields = raw.match(ircRegexPattern).front;
 TL;DR: 67 regex matches are done in that example snippet, on real
 but (hopefully) anonymized raw irc strings; _d_arrayliteralTX sees
 800+ calls.
Herein lies the hint: the exact number of calls (as seen from your callgraph) is 804, and 804 / 67 = 12 (exactly). This means that there are precisely 12 calls to _d_arrayliteralTX per regex match. So that leads to the question of why this is happening. I don't know the answer, but does it help if you don't call .front on the match object? I.e., try this: auto m = raw.match(ircRegexPattern); auto c = m.captures; // c now contains the captured fields, for example, c[1] returns // matching text for the first pair of parentheses, c[2] returns // the matching text for the second pair, etc.. c[0] returns the // entire match (uninteresting in your case). T -- Windows: the ultimate triumph of marketing over technology. -- Adrian von Bidder
Sep 26 2013
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
27-Sep-2013 02:00, JR пишет:
 I'm working on a toy IRC bot. Much of the logic involved is translating
 the incoming raw IRC string into something that makes sense (so now I
 have two problems, etc). But I managed to cook up a regex that so far
 seems to work well. Time for callgrind!

 Grouped by source file, most time is spent in regex.d (as would seem
 natural) but more time is spent in gc.d than I would have expected.
 Looking at the callgraph I see that there's a curious amount of calls to
 _d_arrayliteralTX from (around) where the regex matching is done.
 (There's some inlining going on.)

 Example:   http://dpaste.dzfl.pl/3932a231 (needs dmd head)
And the answer is - don't use ENUM with ctRegex. The problem is that ctRegex returns you a pack of datastructures (=arrays). Using them with enum makes it behave as if you pasted them as array literals and these do allocate each time. TL;DR: use static and/or auto with ctRegex not enum. -- Dmitry Olshansky
Sep 27 2013
parent reply "JR" <zorael gmail.com> writes:
On Friday, 27 September 2013 at 14:37:05 UTC, Dmitry Olshansky 
wrote:
 27-Sep-2013 02:00, JR пишет:
 And the answer is - don't use ENUM with ctRegex.
 The problem is that ctRegex returns you a pack of 
 datastructures (=arrays).
 Using them with enum makes it behave as if you pasted them as 
 array literals and these do allocate each time.

 TL;DR: use static and/or auto with ctRegex not enum.
That fixed it. Thank you both for your help! (I was of the notion that that enum merely translate to compile-time-evaluated constants.)
Sep 27 2013
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Sep 27, 2013 at 04:52:20PM +0200, JR wrote:
 On Friday, 27 September 2013 at 14:37:05 UTC, Dmitry Olshansky
 wrote:
27-Sep-2013 02:00, JR пишет:
And the answer is - don't use ENUM with ctRegex.  The problem is that
ctRegex returns you a pack of datastructures (=arrays).  Using them
with enum makes it behave as if you pasted them as array literals and
these do allocate each time.

TL;DR: use static and/or auto with ctRegex not enum.
That fixed it. Thank you both for your help! (I was of the notion that that enum merely translate to compile-time-evaluated constants.)
It does do that, yes. But the semantics aren't *quite* what one might expect (I was a victim of this misconception too :-P). What it does is to, in effect, "copy-n-paste" the value of the enum into whatever code uses it, every time. That means that if you write: enum x = "abc"; void fun() { auto y = x; } void gun() { auto z = x; } It will *copy* "abc" into y and z (rather than just aliasing the same underlying string), *each time* you call fun() and gun(). TL;DR: use immutable instead of enum for array-based constants: immutable x = "abc"; void fun() { auto y = x; } // now y is just an alias of x void gun() { auto z = x; } // now z is just an alias of x T -- Debian GNU/Linux: Cray on your desktop.
Sep 27 2013
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Friday, 27 September 2013 at 15:22:14 UTC, H. S. Teoh wrote:
 On Fri, Sep 27, 2013 at 04:52:20PM +0200, JR wrote:
 On Friday, 27 September 2013 at 14:37:05 UTC, Dmitry Olshansky
 wrote:
27-Sep-2013 02:00, JR пишет:
And the answer is - don't use ENUM with ctRegex.  The problem 
is that
ctRegex returns you a pack of datastructures (=arrays).  
Using them
with enum makes it behave as if you pasted them as array 
literals and
these do allocate each time.

TL;DR: use static and/or auto with ctRegex not enum.
That fixed it. Thank you both for your help! (I was of the notion that that enum merely translate to compile-time-evaluated constants.)
It does do that, yes. But the semantics aren't *quite* what one might expect (I was a victim of this misconception too :-P). What it does is to, in effect, "copy-n-paste" the value of the enum into whatever code uses it, every time. That means that if you write: enum x = "abc"; void fun() { auto y = x; } void gun() { auto z = x; } It will *copy* "abc" into y and z (rather than just aliasing the same underlying string), *each time* you call fun() and gun().
I think strings are an actual exception to this rule, what with being immutable and all (provided you place the info in a string of course "char[] ss = s" will allocate of course). All other types, even immutable (AFAIK) will allocate though, yes.
Sep 27 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
27-Sep-2013 18:52, JR пишет:
 (I was of the notion that that enum merely translate to
 compile-time-evaluated constants.)
I've recently investigated similar problem w.r.t. performance of std.regex on particular use-case/pattern. In fact it dig up an interesting bug with inlining in the compiler along the way. It took about 5 minutes to test that enum used instead of static devastates performance but quite some time to figure out the actual "infection chain". And indeed suspicious collections/d_array_literal is what I've seen in what supposed to be GC-free loop (callgrind is awesome). -- Dmitry Olshansky
Sep 27 2013