digitalmars.D - Do we need faster regex?

Dmitry Olshansky (11/11) Dec 17 2023 So I’ve been working on rewind-regex trying to correct all of the

Witold Baryluk (7/18) Dec 17 2023 I never needed to use backreferences in last 25 years. Something

Dmitry Olshansky (8/29) Dec 17 2023 Yes, RE2 is the main contender, rewind-regex aims at basically

Anonymouse (9/10) Dec 18 2023 I really like regex as a thing, but I have had to drop it because

Richard (Rikki) Andrew Cattermole (6/13) Dec 18 2023 As long as you use std.regex at runtime this is no longer the case.

H. S. Teoh (37/45) Dec 18 2023 [...]

Richard (Rikki) Andrew Cattermole (26/26) Dec 18 2023 If you import std.regex and do nothing else:

H. S. Teoh (6/18) Dec 18 2023 Cool, awesome stuff!

Richard (Rikki) Andrew Cattermole (3/3) Dec 18 2023 Yeah basically std.regex is no longer the cause for importing std.regex

H. S. Teoh (10/14) Dec 18 2023 I haven't noticed too much horrible slowdown from std.conv, but std.uni

Richard (Rikki) Andrew Cattermole (9/9) Dec 18 2023 The tables are stored compressed.
Hipreme (15/29) Dec 18 2023 std.conv slowdown comes from `std.conv.to!float` specifically.
Dmitry Olshansky (9/23) Dec 20 2023 Far as I know there is not CTFE involved with the tables, but

BoQsc (17/17) Dec 26 2023 The focus should always be on slower but more maintainable and

Dmitry Olshansky (21/40) Dec 30 2023 Well we already have std.regex which is quite fast and feature

Dmitry Olshansky (10/61) Dec 18 2023 A runtime cache should work, btw std.regex caches regexes (at

H. S. Teoh (9/18) Dec 18 2023 Cool, didn't know that. :-)

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

So I’ve been working on rewind-regex trying to correct all of the 
decisions in the original engine that slowed it down, dropping 
some features that I knew I cannot implement efficiently 
(backreferences have to go).

So while I’m obsessed with simplicity and speed I thought I’d ask 
people if it was an issue and what they really want from gen2 
regex library.

—
Dmitry Olshansky
CEO   Glowlabs
https://olshansky.me

Dec 17 2023

Witold Baryluk <witold.baryluk gmail.com> writes:

On Sunday, 17 December 2023 at 15:43:22 UTC, Dmitry Olshansky 
wrote:
 So I’ve been working on rewind-regex trying to correct all of 
 the decisions in the original engine that slowed it down, 
 dropping some features that I knew I cannot implement 
 efficiently (backreferences have to go).

 So while I’m obsessed with simplicity and speed I thought I’d 
 ask people if it was an issue and what they really want from 
 gen2 regex library.

 —
 Dmitry Olshansky
 CEO   Glowlabs
 https://olshansky.me

I never needed to use backreferences in last 25 years. Something 
that has similar performance and scaling like RE2, Plan 9 grep, 
or original awk (modulo unicode), would be the best. I somebody 
needs backreferences, they can use PCRE, or some 3rd party 
libraries.

Dec 17 2023

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On Sunday, 17 December 2023 at 17:40:21 UTC, Witold Baryluk wrote:
 On Sunday, 17 December 2023 at 15:43:22 UTC, Dmitry Olshansky 
 wrote:
 So I’ve been working on rewind-regex trying to correct all of 
 the decisions in the original engine that slowed it down, 
 dropping some features that I knew I cannot implement 
 efficiently (backreferences have to go).

 So while I’m obsessed with simplicity and speed I thought I’d 
 ask people if it was an issue and what they really want from 
 gen2 regex library.

 —
 Dmitry Olshansky
 CEO   Glowlabs
 https://olshansky.me

 I never needed to use backreferences in last 25 years. 
 Something that has similar performance and scaling like RE2, 
 Plan 9 grep, or original awk (modulo unicode), would be the 
 best.

Yes, RE2 is the main contender, rewind-regex aims at basically 
the same feature set and better/same performance.

 I somebody needs backreferences, they can use PCRE, or some 3rd 
 party libraries.

Same thoughts here.

—
Dmitry Olshansky
CEO   Glowlabs
https://olshansky.me

Dec 17 2023

Anonymouse <zorael gmail.com> writes:

On Sunday, 17 December 2023 at 15:43:22 UTC, Dmitry Olshansky 
wrote:
 [...]

I really like regex as a thing, but I have had to drop it because 
of the increased compilation memory and time requirements that 
use of `std.regex` incurred. Maybe that can't be avoided, I don't 
know.

I have not had that much use for backreferences, so to me the 
important part is that it doesn't require me to have 300+ Mb more 
RAM and an extra second or two on the wall clock to compile.

Dec 18 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 18/12/2023 10:20 PM, Anonymouse wrote:
 I really like regex as a thing, but I have had to drop it because of the 
 increased compilation memory and time requirements that use of 
 |std.regex| incurred. Maybe that can't be avoided, I don't know.
 
 I have not had that much use for backreferences, so to me the important 
 part is that it doesn't require me to have 300+ Mb more RAM and an extra 
 second or two on the wall clock to compile.

As long as you use std.regex at runtime this is no longer the case.

https://github.com/dlang/phobos/pull/8699

https://github.com/dlang/phobos/pull/8698

Looks like I messed up and not got it into the changelog.

Anyway 2.103 should be the release that have them in it.

Dec 18 2023

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Sun, Dec 17, 2023 at 03:43:22PM +0000, Dmitry Olshansky via Digitalmars-d
wrote:
 So I’ve been working on rewind-regex trying to correct all of the
 decisions in the original engine that slowed it down, dropping some
 features that I knew I cannot implement efficiently (backreferences
 have to go).
 
 So while I’m obsessed with simplicity and speed I thought I’d ask
 people if it was an issue and what they really want from gen2 regex
 library.

[...]

What I really want:

- Reduce compile-time cost of `import std.regex;` to zero, or at least
  close enough it's no longer noticeable.

- Automatic caching of fixed-string regexes, i.e., the equivalent of:

	struct Re(string ctKnownRe) {
		Regex!char re;
		shared static this() {
			re = regex(ctKnownRe);
		}
		Regex!char Re() {
			return re;
		}
	}

	void main() {
		string s;
		if (s.matchFirst(Re!`some\+pattern`)) {
			...
		}

		// This should reuse the Regex instance from before:
		if (s.matchFirst(Re!`some\+pattern`)) {
			...
		}
	}

- Reasonably fast runtime performance. I don't really care if it's the
  top-of-the-line superfast regex matcher, even though that would be
  really nice.  The primary pain points are the cost of import, and the
  need to manually write code for automatic caching of fixed runtime
  regexen.

- Get rid of ctRegex -- it adds a huge compile-time cost with
  questionable runtime benefit. Unless there's a way to do this at
  compile-time that *doesn't* add like 5 seconds per regex to compile
  times.


T

-- 
That's not a bug; that's a feature!

Dec 18 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

If you import std.regex and do nothing else:

sema1 61ms
sema2 101ms

std.internal.unicode_tables:
sema1 4ms
sema2 100ms

https://github.com/dlang/phobos/blob/master/std/internal/unicode_tables.d



Now for:

```d
auto ctr = ctRegex!(`^.*/([^/]+)/?$`);

// It works just like a normal regex:
auto c2 = matchFirst("foo/bar", ctr);   // First match found here, if any
assert(!c2.empty);   // Be sure to check if there is a match before 
examining contents!
assert(c2[1] == "bar");   // Captures is a range of submatches: 0 = full 
match.
```

toImpl:
sema3 125ms

parseSet:
Sema3 111ms

findAny (child of parseRegex and parseSet):

CTFE 51ms





While there are improvements to be had, I have already done all the big 
ones when I shaved off ~600ms earlier this year.

Gotta love time trace!

Dec 18 2023

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Tue, Dec 19, 2023 at 06:34:04AM +1300, Richard (Rikki) Andrew Cattermole via
Digitalmars-d wrote:
 If you import std.regex and do nothing else:
 
 sema1 61ms
 sema2 101ms
 
 std.internal.unicode_tables:
 sema1 4ms
 sema2 100ms

[...]
 While there are improvements to be had, I have already done all the
 big ones when I shaved off ~600ms earlier this year.
 
 Gotta love time trace!

Cool, awesome stuff!


T

-- 
If blunt statements had a point, they wouldn't be blunt...

Dec 18 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

Yeah basically std.regex is no longer the cause for importing std.regex 
slowdown.

Its stuff like std.conv and std.uni.

Dec 18 2023

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Tue, Dec 19, 2023 at 06:47:00AM +1300, Richard (Rikki) Andrew Cattermole via
Digitalmars-d wrote:
 Yeah basically std.regex is no longer the cause for importing
 std.regex slowdown.
 
 Its stuff like std.conv and std.uni.

I haven't noticed too much horrible slowdown from std.conv, but std.uni
could use some fixing. I'm tempted to suggest that those internal tables
in std.uni should be pre-generated rather than done at compile-time.
There comes a point where repeatedly doing something at every compile
just isn't worth it when the desired output could be autogenerated
beforehand and saved as a straight .d file with hard-coded values.


T

-- 
Heads I win, tails you lose.

Dec 18 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

The tables are stored compressed.

We do that because otherwise the binary size of Phobos would grow by 
multiple megabytes (think 10, not 1).

This is why when you trigger usage of regex at CT it slows down a whole 
lot. One of my optimizations was to prevent this unless it was needed.

However there are a lot of templates in use in each of the tables, and 
these could all be swapped out for something else and I think that might 
shave off something close to 100ms.

The tradeoffs for the tables are good ones, but it does bite us a bit here.

Dec 18 2023

Hipreme <msnmancini hotmail.com> writes:

On Monday, 18 December 2023 at 18:09:16 UTC, H. S. Teoh wrote:
 On Tue, Dec 19, 2023 at 06:47:00AM +1300, Richard (Rikki) 
 Andrew Cattermole via Digitalmars-d wrote:
 Yeah basically std.regex is no longer the cause for importing 
 std.regex slowdown.
 
 Its stuff like std.conv and std.uni.

 I haven't noticed too much horrible slowdown from std.conv, but 
 std.uni could use some fixing. I'm tempted to suggest that 
 those internal tables in std.uni should be pre-generated rather 
 than done at compile-time. There comes a point where repeatedly 
 doing something at every compile just isn't worth it when the 
 desired output could be autogenerated beforehand and saved as a 
 straight .d file with hard-coded values.


 T

std.conv slowdown comes from `std.conv.to!float` specifically. 
The other ones looks fine I guess.

about rikki's statement, it can help, but whenever you import the 
values, it will still take a lot of time in compilation time, you 
can look at `core.sys.windows.uuid` for reference, most of the 
compilation time spent on any windows module is this one taking a 
lot, and when you look at it, it is only a lot of definitions. 
So, yes, I think it could be a lot better. One example I've done 
was to separate the complete generation in Metal and never import 
the file which actually does all the CTFE and mixin's. Think of 
this file like only important to the linker and not needed to be 
used. A .di file with only definitions could help even more, and 
let other thing implement it so only the symbols would be 
imported.

Dec 18 2023

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On Monday, 18 December 2023 at 18:09:16 UTC, H. S. Teoh wrote:
 On Tue, Dec 19, 2023 at 06:47:00AM +1300, Richard (Rikki) 
 Andrew Cattermole via Digitalmars-d wrote:
 Yeah basically std.regex is no longer the cause for importing 
 std.regex slowdown.
 
 Its stuff like std.conv and std.uni.

 I haven't noticed too much horrible slowdown from std.conv, but 
 std.uni could use some fixing. I'm tempted to suggest that 
 those internal tables in std.uni should be pre-generated rather 
 than done at compile-time.

Far as I know there is not CTFE involved with the tables, but 
they are kind of huge. Perhaps doing .di files and enforce 
separate compilation may work here.


 There comes a point where repeatedly doing something at every 
 compile just isn't worth it when the desired output could be 
 autogenerated beforehand and saved as a straight .d file with 
 hard-coded values.

Which they are.

—
Dmitry Olshansky
CEO   Glowlabs
https://olshansky.me

Dec 20 2023

BoQsc <vaidas.boqsc gmail.com> writes:

The focus should always be on slower but more maintainable and 
simple implementation.

The efficiency and speed should be left for specific cases where 
it is needed and, again with as much clarity as possible.

Support and Features over efficiency and speed.

The question of "is anyone even using that?" is a wrong question.
If it's a common behaviour among implementations, it should exist.

If behaviour makes sense, it should exist and not avoided to be 
implemented just to gain some fraction of speed that can be 
gained in other ways or again, using more specific implementation 
for the job.

If you think that your implementation can be grown into 
supporting all crucial features, then it should be a good draft 
for other people to explore and complete implementation.

Else it should be a specific case implementation that could be 
selected if there is a need for speed but with the sacrifice of 
features.

Dec 26 2023

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On Tuesday, 26 December 2023 at 15:58:03 UTC, BoQsc wrote:
 The focus should always be on slower but more maintainable and 
 simple implementation.

Agreed on simple, though humbly disagree on speed.

 The efficiency and speed should be left for specific cases 
 where it is needed and, again with as much clarity as possible.

 Support and Features over efficiency and speed.

Well we already have std.regex which is quite fast and feature 
rich, sadly simplicity is not an option if we are to support full 
ECMAScript regex language and basic level 1 unicode regex.


 The question of "is anyone even using that?" is a wrong 
 question.
 If it's a common behaviour among implementations, it should 
 exist.

RE2 specifically avoids complicated features that block design of 
a fast engine.

 If behaviour makes sense, it should exist and not avoided to be 
 implemented just to gain some fraction of speed that can be 
 gained in other ways or again, using more specific 
 implementation for the job.

 If you think that your implementation can be grown into 
 supporting all crucial features, then it should be a good draft 
 for other people to explore and complete implementation.

There are very few folks who want to develop regex engine I 
think. Encouraging collaboration is an interesting angle I did 
not account for.


 Else it should be a specific case implementation that could be 
 selected if there is a need for speed but with the sacrifice of 
 features.

Okay, I understand your points. For the most part I could 
summarize my point of view as follows.
Building regex engine without regard for speed is not challenging 
for me, nor do I think that a simple slow engine could be 
gradually improved into simple fast engine. Speed is something 
you have to think of laying the first brick of whatever you are 
building, iff speed is desired.

—
Dmitry Olshansky
CEO   Glowlabs
https://olshansky.me

Dec 30 2023

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On Monday, 18 December 2023 at 17:16:40 UTC, H. S. Teoh wrote:
 On Sun, Dec 17, 2023 at 03:43:22PM +0000, Dmitry Olshansky via 
 Digitalmars-d wrote:
 So I’ve been working on rewind-regex trying to correct all of 
 the decisions in the original engine that slowed it down, 
 dropping some features that I knew I cannot implement 
 efficiently (backreferences have to go).
 
 So while I’m obsessed with simplicity and speed I thought I’d 
 ask people if it was an issue and what they really want from 
 gen2 regex library.

 [...]

 What I really want:

 - Reduce compile-time cost of `import std.regex;` to zero, or 
 at least
   close enough it's no longer noticeable.

 - Automatic caching of fixed-string regexes, i.e., the 
 equivalent of:

 	struct Re(string ctKnownRe) {
 		Regex!char re;
 		shared static this() {
 			re = regex(ctKnownRe);
 		}
 		Regex!char Re() {
 			return re;
 		}
 	}

A runtime cache should work, btw std.regex caches regexes (at 
least those passed as strings to match* family of functions).

 	void main() {
 		string s;
 		if (s.matchFirst(Re!`some\+pattern`)) {
 			...
 		}

 		// This should reuse the Regex instance from before:
 		if (s.matchFirst(Re!`some\+pattern`)) {
 			...
 		}
 	}

I'm thinking if it's worth it to intern patterns like that.

 - Reasonably fast runtime performance. I don't really care if 
 it's the
   top-of-the-line superfast regex matcher, even though that 
 would be
   really nice.  The primary pain points are the cost of import, 
 and the
   need to manually write code for automatic caching of fixed 
 runtime
   regexen.


 - Get rid of ctRegex -- it adds a huge compile-time cost with
   questionable runtime benefit. Unless there's a way to do this 
 at
   compile-time that *doesn't* add like 5 seconds per regex to 
 compile
   times.

Yup it's dropped, to be eventually replaced by JIT which is both 
better at compile-time and much more flexible at run-time.

---
Dmitry Olshansky
CEO   Glowlabs
https://olshansky.me

Dec 18 2023

"H. S. Teoh" <hsteoh qfbox.info> writes:

On Mon, Dec 18, 2023 at 06:34:51PM +0000, Dmitry Olshansky via Digitalmars-d
wrote:
[...]
 A runtime cache should work, btw std.regex caches regexes (at least
 those passed as strings to match* family of functions).

Cool, didn't know that. :-)


[...]
 - Get rid of ctRegex -- it adds a huge compile-time cost with
   questionable runtime benefit. Unless there's a way to do this at
   compile-time that *doesn't* add like 5 seconds per regex to
   compile times.

 
 Yup it's dropped, to be eventually replaced by JIT which is both
 better at compile-time and much more flexible at run-time.

[...]

Awesome stuff!


T

-- 
A mathematician is a device for turning coffee into theorems. -- P. Erdos

Dec 18 2023

D Programming

C/C++ Programming

Other

digitalmars.D - Do we need faster regex?