digitalmars.D.bugs - [Issue 7471] New: Improve performance of std.regex
- d-bugmail puremagic.com (48/48) Feb 09 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7471
- d-bugmail puremagic.com (10/10) Feb 24 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7471
- d-bugmail puremagic.com (11/11) Feb 24 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7471
- d-bugmail puremagic.com (14/14) Feb 24 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7471
- d-bugmail puremagic.com (14/14) Feb 26 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7471
- d-bugmail puremagic.com (13/13) Feb 26 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7471
- d-bugmail puremagic.com (9/9) Feb 26 2012 http://d.puremagic.com/issues/show_bug.cgi?id=7471
http://d.puremagic.com/issues/show_bug.cgi?id=7471 Summary: Improve performance of std.regex Product: D Version: D2 Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Phobos AssignedTo: nobody puremagic.com ReportedBy: Jesse.K.Phillips+D gmail.com 09:27:58 PST --- The previous implementation is said to do some caching of the last used engine. english.dic is 134,950 entries for these timings. Test code ---------- import std.file; import std.string; import std.datetime; import std.regex; private int[string] model; void main() { auto name = "english.dic"; foreach(w; std.file.readText(name).toLower.splitLines) model[w] += 1; foreach(w; std.string.split(readText(name))) if(!match(w, regex(r"\d")).empty) {} else if(!match(w, regex(r"\W")).empty) {} } ------- I'm trying to avoid the caching here, but still see better performance in 2.056. Actually I find these timings are with mingw on Windows. I find it odd that user time is actually fast, but real time is the slow piece, does mingw have access to the proper information? $ time ./test2.056.exe real 0m0.860s user 0m0.047s sys 0m0.000s $ time ./test2.058.exe real 0m55.500s user 0m0.031s sys 0m0.000s -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 09 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471 Dmitry Olshansky <dmitry.olsh gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dmitry.olsh gmail.com 11:14:52 PST --- I'm willing to investigate the issue. Can you attach english.dic file? -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 24 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471 dawg dawgfoto.de changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dawg dawgfoto.de You are compiling two different regexes. So a single entry cache will only solve part of your problem. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 24 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471 Jesse Phillips <Jesse.K.Phillips+D gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |Jesse.K.Phillips+D gmail.co | |m 18:02:05 PST --- The exact file isn't important, can't get it now. But you could grab similar from http://www.winedt.org/Dict/ I realize that the example given is avoiding the benefit of single caching, but it does perform better and probably should be worked towards. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 24 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471 02:22:02 PST --- Profiling shows that about 99% of time is spent in GC, ouch. What's at work here is that new regex engine is more costly to create and allocates a bunch of structures on heap. The biggest ones of them are cached like e.g. Tries but others are not. I think I'll spend some time on introducing more caching and probably seek out some GC unfriendly stuff in parser. Still I should point out is that \d and \W in new engine are unicode aware and correspond to MUCH broader character clasess then previos engine does. (that belongs in ddocs somewhere) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 26 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471 06:32:30 PST --- Anyway how compares of 2.056-2.058 when you don't create regex objects inside tight loop? It is a strange thing to do at any circumstances, even N-slot caching you pay some extra on each iteration to lookup and copy out the compiled regex needed. I'm dreaming that probably one day the compiler can just see it's a loop invariant and move it out for you. Hm.. could happen sometime soon if 'regex' is pure and then it's result is immutable, the compiler would have it's guarantees to go ahead and optimize. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 26 2012
http://d.puremagic.com/issues/show_bug.cgi?id=7471 18:03:06 PST --- After moving the regex to outside the loop and I think some other changes it helped immensely. Declaring them as module variables didn't seem to gain any more. I didn't have much time to play with it much more, it was exceptionable, though I hope to do more with regex and just need to watch out for tight loops. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 26 2012