www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 7471] New: Improve performance of std.regex

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7471

           Summary: Improve performance of std.regex
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: Jesse.K.Phillips+D gmail.com



09:27:58 PST ---
The previous implementation is said to do some caching of the last used engine.
english.dic is 134,950 entries for these timings.

Test code
----------
import std.file;
import std.string;
import std.datetime;
import std.regex;

private int[string] model;

void main() {
   auto name = "english.dic";
   foreach(w; std.file.readText(name).toLower.splitLines)
      model[w] += 1;

   foreach(w; std.string.split(readText(name)))
      if(!match(w, regex(r"\d")).empty)
      {}
      else if(!match(w, regex(r"\W")).empty)
      {}
}
-------

I'm trying to avoid the caching here, but still see better performance in
2.056. Actually I find these timings are with mingw on Windows. I find it odd
that user time is actually fast, but real time is the slow piece, does mingw
have access to the proper information?

$ time ./test2.056.exe

real    0m0.860s
user    0m0.047s
sys     0m0.000s

$ time ./test2.058.exe

real    0m55.500s
user    0m0.031s
sys     0m0.000s

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 09 2012
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7471


Dmitry Olshansky <dmitry.olsh gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dmitry.olsh gmail.com



11:14:52 PST ---
I'm willing to investigate the issue. Can you attach english.dic file?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 24 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7471


dawg dawgfoto.de changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dawg dawgfoto.de



You are compiling two different regexes. So a single entry cache will only
solve part of your problem.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 24 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7471


Jesse Phillips <Jesse.K.Phillips+D gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |Jesse.K.Phillips+D gmail.co
                   |                            |m



18:02:05 PST ---
The exact file isn't important, can't get it now. But you could grab similar
from http://www.winedt.org/Dict/

I realize that the example given is avoiding the benefit of single caching, but
it does perform better and probably should be worked towards.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 24 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7471




02:22:02 PST ---
Profiling shows that about 99% of time is spent in GC, ouch.
What's at work here is that new regex engine is more costly to create and
allocates a bunch of structures on heap. The biggest ones of them are cached
like e.g. Tries but others are not.
I think I'll spend some time on introducing more caching and probably seek out
some GC unfriendly stuff in parser.
Still I should point out is that \d and \W in new engine are unicode aware and
correspond to MUCH broader character clasess then previos engine does. (that
belongs in ddocs somewhere)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 26 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7471




06:32:30 PST ---
Anyway how compares of 2.056-2.058 when you don't create regex objects inside
tight loop?
It is a strange thing to do at any circumstances, even N-slot caching you pay
some extra on each iteration to lookup and copy out the compiled regex needed.

I'm dreaming that probably one day the compiler can just see it's a loop
invariant and move it out for you. 
Hm.. could happen sometime soon if 'regex' is pure and then it's result is
immutable, the compiler would have it's guarantees to go ahead and optimize.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 26 2012
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7471




18:03:06 PST ---
After moving the regex to outside the loop and I think some other changes it
helped immensely. Declaring them as module variables didn't seem to gain any
more. I didn't have much time to play with it much more, it was exceptionable,
though I hope to do more with regex and just need to watch out for tight loops.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 26 2012