digitalmars.D.bugs - [Issue 17820] New: std.regex performance: .matchFirst allocates
- via Digitalmars-d-bugs (100/100) Sep 09 2017 https://issues.dlang.org/show_bug.cgi?id=17820
https://issues.dlang.org/show_bug.cgi?id=17820 Issue ID: 17820 Summary: std.regex performance: .matchFirst allocates frequently; causes thread contention Product: D Version: D2 Hardware: x86_64 OS: Windows Status: NEW Severity: enhancement Priority: P1 Component: phobos Assignee: nobody puremagic.com Reporter: chadjoan gmail.com Created attachment 1657 --> https://issues.dlang.org/attachment.cgi?id=1657&action=edit Regex .matchFirst benchmark When a sufficiently complicated regular expression is used to analyze a large number of small strings, the memory allocation done by .matchFirst (and possibly other similar regex functions) causes punishing slowdown and thread contention. This happens even though the same Regex object (per thread) is reused for .matchFirst(). I have attached a test program which is also available on pastebin: https://pastebin.com/Am7BgYkL I strongly suspect this is caused by the call to malloc in .matchOnce in std.regex.package, as well as the call to calloc in Capture.newMatches also in std.regex.package. I was able to work around it by modifying Phobos and statically allocating the memory used by .matchOnce, which provided something like a 100x speedup on that project (with serial processing, nevermind that it made threading possible too). I was never going to call .matchFirst (or anything that generates captures) while still iterating over previous captures, so the workaround was fine for that project, but not acceptable for general use of course. Given that this slowdown was related to malloc, the exact ramifications and severity will change significantly depending on the host system's malloc implementation. It was really bad on Windows. Linux seems to fare much better. I suspect that Windows' malloc does a thread-global lock during allocation and then takes its sweet time, which results in some parallelized code actually running *slower* than the serial equivalent. And since this doesn't seem to happen to ALL regexes where the Captures (n > smallString) condition is met, I have to wonder if the allocation size, or some other yet undetermined factor, might change the behavior of Windows' malloc/calloc. This could make further testing difficult. Here is my run of the test program on a Windows 7 machine with dmd v2.076.0: -------------------------------------------------------- C:\dprojects\regex\benchmark>dmd main.d -release -O -m64 C:\dprojects\regex\benchmark>main Benchmarking non-regex parsing, parallel: processed 100000 elements in 0.061 seconds. Benchmarking regex parsing, no captures, parallel: processed 100000 elements in 202.893 seconds. Benchmarking regex parsing, many captures, parallel: processed 100000 elements in 977.339 seconds. Benchmarking non-regex parsing, serial: processed 100000 elements in 0.031 seconds. Benchmarking regex parsing, no captures, serial: processed 100000 elements in 202.395 seconds. Benchmarking regex parsing, many captures, serial: processed 100000 elements in 837.401 seconds. -------------------------------------------------------- Here it is again on a slightly slower (hardware-wise) Linux (Gentoo) machine with dmd v2.074.1: -------------------------------------------------------- $ dmd main.d -release -O -m64 $ ./main Benchmarking non-regex parsing, parallel: processed 100000 elements in 0.062 seconds. Benchmarking regex parsing, no captures, parallel: processed 100000 elements in 88.055 seconds. Benchmarking regex parsing, many captures, parallel: processed 100000 elements in 93.546 seconds. Benchmarking non-regex parsing, serial: processed 100000 elements in 0.043 seconds. Benchmarking regex parsing, no captures, serial: processed 100000 elements in 135.063 seconds. Benchmarking regex parsing, many captures, serial: processed 100000 elements in 135.250 seconds. -------------------------------------------------------- I noticed that my benchmark's non-regex numbers don't parallelize well. I'm not sure what's going on there. Maybe there are some hidden allocations happening? Either way, they only exist to provide a sense of scale. The more important numbers are the regex parsing benchmarks. When I used a different regular expression earlier, it *did* parallelize somewhat, so I believe this benchmark does represent what I experienced in production (in other words: to my knowledge the benchmark itself isn't causing any slowdowns, allocations, or thread contention). I have already begun working on a fix for this issue. I am not sure what to do with it now that Dmitry is kicking butt on regex bugs (https://forum.dlang.org/post/jshuunlmctbzniqvceru forum.dlang.org), but I'm at least glad I was able to make this bug report and keep others informed. If Dmitry doesn't feel like taking this one on, then I suppose I will have to reintegrate my WIP changes later and hopefully benefit from the cleaner future code. I am worried that my WIP code uses std.experimental.allocator: I feel like freelists are a very elegant solution to this problem; I'm not sure if std.experimental is unused in the rest of Phobos by chance or by decree; and reimplementing a freelist just to avoid using the experimental modules seems like a bad idea to me. I'm not sure how others feel. You can find my WIP fix here: https://github.com/chadjoan/phobos/tree/regex_allocator_optimization --
Sep 09 2017