digitalmars.D.bugs - [Issue 17820] New: std.regex performance: .matchFirst allocates

via Digitalmars-d-bugs (100/100) Sep 09 2017 https://issues.dlang.org/show_bug.cgi?id=17820
via Digitalmars-d-bugs <digitalmars-d-bugs puremagic.com> writes:
https://issues.dlang.org/show_bug.cgi?id=17820

          Issue ID: 17820
           Summary: std.regex performance: .matchFirst allocates
                    frequently; causes thread contention
           Product: D
           Version: D2
          Hardware: x86_64
                OS: Windows
            Status: NEW
          Severity: enhancement
          Priority: P1
         Component: phobos
          Assignee: nobody puremagic.com
          Reporter: chadjoan gmail.com

Created attachment 1657
  --> https://issues.dlang.org/attachment.cgi?id=1657&action=edit
Regex .matchFirst benchmark

When a sufficiently complicated regular expression is used to analyze a large
number of small strings, the memory allocation done by .matchFirst (and
possibly other similar regex functions) causes punishing slowdown and thread
contention.  This happens even though the same Regex object (per thread) is
reused for .matchFirst().

I have attached a test program which is also available on pastebin:
https://pastebin.com/Am7BgYkL

I strongly suspect this is caused by the call to malloc in .matchOnce in
std.regex.package, as well as the call to calloc in Capture.newMatches also in
std.regex.package.  I was able to work around it by modifying Phobos and
statically allocating the memory used by .matchOnce, which provided something
like a 100x speedup on that project (with serial processing, nevermind that it
made threading possible too).  I was never going to call .matchFirst (or
anything that generates captures) while still iterating over previous captures,
so the workaround was fine for that project, but not acceptable for general use
of course.

Given that this slowdown was related to malloc, the exact ramifications and
severity will change significantly depending on the host system's malloc
implementation.  It was really bad on Windows.  Linux seems to fare much
better.  I suspect that Windows' malloc does a thread-global lock during
allocation and then takes its sweet time, which results in some parallelized
code actually running *slower* than the serial equivalent.  And since this
doesn't seem to happen to ALL regexes where the Captures (n > smallString)
condition is met, I have to wonder if the allocation size, or some other yet
undetermined factor, might change the behavior of Windows' malloc/calloc.  This
could make further testing difficult.

Here is my run of the test program on a Windows 7 machine with dmd v2.076.0:
--------------------------------------------------------
C:\dprojects\regex\benchmark>dmd main.d -release -O -m64

C:\dprojects\regex\benchmark>main
Benchmarking non-regex parsing, parallel:                 processed 100000
elements in 0.061 seconds.
Benchmarking regex parsing, no captures, parallel:        processed 100000
elements in 202.893 seconds.
Benchmarking regex parsing, many captures, parallel:      processed 100000
elements in 977.339 seconds.
Benchmarking non-regex parsing, serial:                   processed 100000
elements in 0.031 seconds.
Benchmarking regex parsing, no captures, serial:          processed 100000
elements in 202.395 seconds.
Benchmarking regex parsing, many captures, serial:        processed 100000
elements in 837.401 seconds.
--------------------------------------------------------

Here it is again on a slightly slower (hardware-wise) Linux (Gentoo) machine
with dmd v2.074.1:
--------------------------------------------------------
$ dmd main.d -release -O -m64
$ ./main
Benchmarking non-regex parsing, parallel:                 processed 100000
elements in 0.062 seconds.
Benchmarking regex parsing, no captures, parallel:        processed 100000
elements in 88.055 seconds.
Benchmarking regex parsing, many captures, parallel:      processed 100000
elements in 93.546 seconds.
Benchmarking non-regex parsing, serial:                   processed 100000
elements in 0.043 seconds.
Benchmarking regex parsing, no captures, serial:          processed 100000
elements in 135.063 seconds.
Benchmarking regex parsing, many captures, serial:        processed 100000
elements in 135.250 seconds.
--------------------------------------------------------

I noticed that my benchmark's non-regex numbers don't parallelize well.  I'm
not sure what's going on there.  Maybe there are some hidden allocations
happening?  Either way, they only exist to provide a sense of scale.  The more
important numbers are the regex parsing benchmarks.  When I used a different
regular expression earlier, it *did* parallelize somewhat, so I believe this
benchmark does represent what I experienced in production (in other words: to
my knowledge the benchmark itself isn't causing any slowdowns, allocations, or
thread contention).

I have already begun working on a fix for this issue.  I am not sure what to do
with it now that Dmitry is kicking butt on regex bugs
(https://forum.dlang.org/post/jshuunlmctbzniqvceru forum.dlang.org), but I'm at
least glad I was able to make this bug report and keep others informed.  If
Dmitry doesn't feel like taking this one on, then I suppose I will have to
reintegrate my WIP changes later and hopefully benefit from the cleaner future
code.  I am worried that my WIP code uses std.experimental.allocator: I feel
like freelists are a very elegant solution to this problem; I'm not sure if
std.experimental is unused in the rest of Phobos by chance or by decree; and
reimplementing a freelist just to avoid using the experimental modules seems
like a bad idea to me.  I'm not sure how others feel.

You can find my WIP fix here:
https://github.com/chadjoan/phobos/tree/regex_allocator_optimization

--
Sep 09 2017
D Programming

C/C++ Programming

Other

digitalmars.D.bugs - [Issue 17820] New: std.regex performance: .matchFirst allocates