digitalmars.D - Do we need faster regex?
- Dmitry Olshansky (11/11) Dec 17 2023 So I’ve been working on rewind-regex trying to correct all of the
- Witold Baryluk (7/18) Dec 17 2023 I never needed to use backreferences in last 25 years. Something
- Dmitry Olshansky (8/29) Dec 17 2023 Yes, RE2 is the main contender, rewind-regex aims at basically
- Anonymouse (9/10) Dec 18 2023 I really like regex as a thing, but I have had to drop it because
- Richard (Rikki) Andrew Cattermole (6/13) Dec 18 2023 As long as you use std.regex at runtime this is no longer the case.
- H. S. Teoh (37/45) Dec 18 2023 [...]
- Richard (Rikki) Andrew Cattermole (26/26) Dec 18 2023 If you import std.regex and do nothing else:
- H. S. Teoh (6/18) Dec 18 2023 Cool, awesome stuff!
- Richard (Rikki) Andrew Cattermole (3/3) Dec 18 2023 Yeah basically std.regex is no longer the cause for importing std.regex
- H. S. Teoh (10/14) Dec 18 2023 I haven't noticed too much horrible slowdown from std.conv, but std.uni
- Richard (Rikki) Andrew Cattermole (9/9) Dec 18 2023 The tables are stored compressed.
- Hipreme (15/29) Dec 18 2023 std.conv slowdown comes from `std.conv.to!float` specifically.
- Dmitry Olshansky (9/23) Dec 20 2023 Far as I know there is not CTFE involved with the tables, but
- BoQsc (17/17) Dec 26 2023 The focus should always be on slower but more maintainable and
- Dmitry Olshansky (21/40) Dec 30 2023 Well we already have std.regex which is quite fast and feature
- Dmitry Olshansky (10/61) Dec 18 2023 A runtime cache should work, btw std.regex caches regexes (at
- H. S. Teoh (9/18) Dec 18 2023 Cool, didn't know that. :-)
So I’ve been working on rewind-regex trying to correct all of the decisions in the original engine that slowed it down, dropping some features that I knew I cannot implement efficiently (backreferences have to go). So while I’m obsessed with simplicity and speed I thought I’d ask people if it was an issue and what they really want from gen2 regex library. — Dmitry Olshansky CEO Glowlabs https://olshansky.me
Dec 17 2023
On Sunday, 17 December 2023 at 15:43:22 UTC, Dmitry Olshansky wrote:So I’ve been working on rewind-regex trying to correct all of the decisions in the original engine that slowed it down, dropping some features that I knew I cannot implement efficiently (backreferences have to go). So while I’m obsessed with simplicity and speed I thought I’d ask people if it was an issue and what they really want from gen2 regex library. — Dmitry Olshansky CEO Glowlabs https://olshansky.meI never needed to use backreferences in last 25 years. Something that has similar performance and scaling like RE2, Plan 9 grep, or original awk (modulo unicode), would be the best. I somebody needs backreferences, they can use PCRE, or some 3rd party libraries.
Dec 17 2023
On Sunday, 17 December 2023 at 17:40:21 UTC, Witold Baryluk wrote:On Sunday, 17 December 2023 at 15:43:22 UTC, Dmitry Olshansky wrote:Yes, RE2 is the main contender, rewind-regex aims at basically the same feature set and better/same performance.So I’ve been working on rewind-regex trying to correct all of the decisions in the original engine that slowed it down, dropping some features that I knew I cannot implement efficiently (backreferences have to go). So while I’m obsessed with simplicity and speed I thought I’d ask people if it was an issue and what they really want from gen2 regex library. — Dmitry Olshansky CEO Glowlabs https://olshansky.meI never needed to use backreferences in last 25 years. Something that has similar performance and scaling like RE2, Plan 9 grep, or original awk (modulo unicode), would be the best.I somebody needs backreferences, they can use PCRE, or some 3rd party libraries.Same thoughts here. — Dmitry Olshansky CEO Glowlabs https://olshansky.me
Dec 17 2023
On Sunday, 17 December 2023 at 15:43:22 UTC, Dmitry Olshansky wrote:[...]I really like regex as a thing, but I have had to drop it because of the increased compilation memory and time requirements that use of `std.regex` incurred. Maybe that can't be avoided, I don't know. I have not had that much use for backreferences, so to me the important part is that it doesn't require me to have 300+ Mb more RAM and an extra second or two on the wall clock to compile.
Dec 18 2023
On 18/12/2023 10:20 PM, Anonymouse wrote:I really like regex as a thing, but I have had to drop it because of the increased compilation memory and time requirements that use of |std.regex| incurred. Maybe that can't be avoided, I don't know. I have not had that much use for backreferences, so to me the important part is that it doesn't require me to have 300+ Mb more RAM and an extra second or two on the wall clock to compile.As long as you use std.regex at runtime this is no longer the case. https://github.com/dlang/phobos/pull/8699 https://github.com/dlang/phobos/pull/8698 Looks like I messed up and not got it into the changelog. Anyway 2.103 should be the release that have them in it.
Dec 18 2023
On Sun, Dec 17, 2023 at 03:43:22PM +0000, Dmitry Olshansky via Digitalmars-d wrote:So I’ve been working on rewind-regex trying to correct all of the decisions in the original engine that slowed it down, dropping some features that I knew I cannot implement efficiently (backreferences have to go). So while I’m obsessed with simplicity and speed I thought I’d ask people if it was an issue and what they really want from gen2 regex library.[...] What I really want: - Reduce compile-time cost of `import std.regex;` to zero, or at least close enough it's no longer noticeable. - Automatic caching of fixed-string regexes, i.e., the equivalent of: struct Re(string ctKnownRe) { Regex!char re; shared static this() { re = regex(ctKnownRe); } Regex!char Re() { return re; } } void main() { string s; if (s.matchFirst(Re!`some\+pattern`)) { ... } // This should reuse the Regex instance from before: if (s.matchFirst(Re!`some\+pattern`)) { ... } } - Reasonably fast runtime performance. I don't really care if it's the top-of-the-line superfast regex matcher, even though that would be really nice. The primary pain points are the cost of import, and the need to manually write code for automatic caching of fixed runtime regexen. - Get rid of ctRegex -- it adds a huge compile-time cost with questionable runtime benefit. Unless there's a way to do this at compile-time that *doesn't* add like 5 seconds per regex to compile times. T -- That's not a bug; that's a feature!
Dec 18 2023
If you import std.regex and do nothing else: sema1 61ms sema2 101ms std.internal.unicode_tables: sema1 4ms sema2 100ms https://github.com/dlang/phobos/blob/master/std/internal/unicode_tables.d Now for: ```d auto ctr = ctRegex!(`^.*/([^/]+)/?$`); // It works just like a normal regex: auto c2 = matchFirst("foo/bar", ctr); // First match found here, if any assert(!c2.empty); // Be sure to check if there is a match before examining contents! assert(c2[1] == "bar"); // Captures is a range of submatches: 0 = full match. ``` toImpl: sema3 125ms parseSet: Sema3 111ms findAny (child of parseRegex and parseSet): CTFE 51ms While there are improvements to be had, I have already done all the big ones when I shaved off ~600ms earlier this year. Gotta love time trace!
Dec 18 2023
On Tue, Dec 19, 2023 at 06:34:04AM +1300, Richard (Rikki) Andrew Cattermole via Digitalmars-d wrote:If you import std.regex and do nothing else: sema1 61ms sema2 101ms std.internal.unicode_tables: sema1 4ms sema2 100ms[...]While there are improvements to be had, I have already done all the big ones when I shaved off ~600ms earlier this year. Gotta love time trace!Cool, awesome stuff! T -- If blunt statements had a point, they wouldn't be blunt...
Dec 18 2023
Yeah basically std.regex is no longer the cause for importing std.regex slowdown. Its stuff like std.conv and std.uni.
Dec 18 2023
On Tue, Dec 19, 2023 at 06:47:00AM +1300, Richard (Rikki) Andrew Cattermole via Digitalmars-d wrote:Yeah basically std.regex is no longer the cause for importing std.regex slowdown. Its stuff like std.conv and std.uni.I haven't noticed too much horrible slowdown from std.conv, but std.uni could use some fixing. I'm tempted to suggest that those internal tables in std.uni should be pre-generated rather than done at compile-time. There comes a point where repeatedly doing something at every compile just isn't worth it when the desired output could be autogenerated beforehand and saved as a straight .d file with hard-coded values. T -- Heads I win, tails you lose.
Dec 18 2023
The tables are stored compressed. We do that because otherwise the binary size of Phobos would grow by multiple megabytes (think 10, not 1). This is why when you trigger usage of regex at CT it slows down a whole lot. One of my optimizations was to prevent this unless it was needed. However there are a lot of templates in use in each of the tables, and these could all be swapped out for something else and I think that might shave off something close to 100ms. The tradeoffs for the tables are good ones, but it does bite us a bit here.
Dec 18 2023
On Monday, 18 December 2023 at 18:09:16 UTC, H. S. Teoh wrote:On Tue, Dec 19, 2023 at 06:47:00AM +1300, Richard (Rikki) Andrew Cattermole via Digitalmars-d wrote:std.conv slowdown comes from `std.conv.to!float` specifically. The other ones looks fine I guess. about rikki's statement, it can help, but whenever you import the values, it will still take a lot of time in compilation time, you can look at `core.sys.windows.uuid` for reference, most of the compilation time spent on any windows module is this one taking a lot, and when you look at it, it is only a lot of definitions. So, yes, I think it could be a lot better. One example I've done was to separate the complete generation in Metal and never import the file which actually does all the CTFE and mixin's. Think of this file like only important to the linker and not needed to be used. A .di file with only definitions could help even more, and let other thing implement it so only the symbols would be imported.Yeah basically std.regex is no longer the cause for importing std.regex slowdown. Its stuff like std.conv and std.uni.I haven't noticed too much horrible slowdown from std.conv, but std.uni could use some fixing. I'm tempted to suggest that those internal tables in std.uni should be pre-generated rather than done at compile-time. There comes a point where repeatedly doing something at every compile just isn't worth it when the desired output could be autogenerated beforehand and saved as a straight .d file with hard-coded values. T
Dec 18 2023
On Monday, 18 December 2023 at 18:09:16 UTC, H. S. Teoh wrote:On Tue, Dec 19, 2023 at 06:47:00AM +1300, Richard (Rikki) Andrew Cattermole via Digitalmars-d wrote:Far as I know there is not CTFE involved with the tables, but they are kind of huge. Perhaps doing .di files and enforce separate compilation may work here.Yeah basically std.regex is no longer the cause for importing std.regex slowdown. Its stuff like std.conv and std.uni.I haven't noticed too much horrible slowdown from std.conv, but std.uni could use some fixing. I'm tempted to suggest that those internal tables in std.uni should be pre-generated rather than done at compile-time.There comes a point where repeatedly doing something at every compile just isn't worth it when the desired output could be autogenerated beforehand and saved as a straight .d file with hard-coded values.Which they are. — Dmitry Olshansky CEO Glowlabs https://olshansky.me
Dec 20 2023
The focus should always be on slower but more maintainable and simple implementation. The efficiency and speed should be left for specific cases where it is needed and, again with as much clarity as possible. Support and Features over efficiency and speed. The question of "is anyone even using that?" is a wrong question. If it's a common behaviour among implementations, it should exist. If behaviour makes sense, it should exist and not avoided to be implemented just to gain some fraction of speed that can be gained in other ways or again, using more specific implementation for the job. If you think that your implementation can be grown into supporting all crucial features, then it should be a good draft for other people to explore and complete implementation. Else it should be a specific case implementation that could be selected if there is a need for speed but with the sacrifice of features.
Dec 26 2023
On Tuesday, 26 December 2023 at 15:58:03 UTC, BoQsc wrote:The focus should always be on slower but more maintainable and simple implementation.Agreed on simple, though humbly disagree on speed.The efficiency and speed should be left for specific cases where it is needed and, again with as much clarity as possible. Support and Features over efficiency and speed.Well we already have std.regex which is quite fast and feature rich, sadly simplicity is not an option if we are to support full ECMAScript regex language and basic level 1 unicode regex.The question of "is anyone even using that?" is a wrong question. If it's a common behaviour among implementations, it should exist.RE2 specifically avoids complicated features that block design of a fast engine.If behaviour makes sense, it should exist and not avoided to be implemented just to gain some fraction of speed that can be gained in other ways or again, using more specific implementation for the job. If you think that your implementation can be grown into supporting all crucial features, then it should be a good draft for other people to explore and complete implementation.There are very few folks who want to develop regex engine I think. Encouraging collaboration is an interesting angle I did not account for.Else it should be a specific case implementation that could be selected if there is a need for speed but with the sacrifice of features.Okay, I understand your points. For the most part I could summarize my point of view as follows. Building regex engine without regard for speed is not challenging for me, nor do I think that a simple slow engine could be gradually improved into simple fast engine. Speed is something you have to think of laying the first brick of whatever you are building, iff speed is desired. — Dmitry Olshansky CEO Glowlabs https://olshansky.me
Dec 30 2023
On Monday, 18 December 2023 at 17:16:40 UTC, H. S. Teoh wrote:On Sun, Dec 17, 2023 at 03:43:22PM +0000, Dmitry Olshansky via Digitalmars-d wrote:A runtime cache should work, btw std.regex caches regexes (at least those passed as strings to match* family of functions).So I’ve been working on rewind-regex trying to correct all of the decisions in the original engine that slowed it down, dropping some features that I knew I cannot implement efficiently (backreferences have to go). So while I’m obsessed with simplicity and speed I thought I’d ask people if it was an issue and what they really want from gen2 regex library.[...] What I really want: - Reduce compile-time cost of `import std.regex;` to zero, or at least close enough it's no longer noticeable. - Automatic caching of fixed-string regexes, i.e., the equivalent of: struct Re(string ctKnownRe) { Regex!char re; shared static this() { re = regex(ctKnownRe); } Regex!char Re() { return re; } }void main() { string s; if (s.matchFirst(Re!`some\+pattern`)) { ... } // This should reuse the Regex instance from before: if (s.matchFirst(Re!`some\+pattern`)) { ... } }I'm thinking if it's worth it to intern patterns like that.- Reasonably fast runtime performance. I don't really care if it's the top-of-the-line superfast regex matcher, even though that would be really nice. The primary pain points are the cost of import, and the need to manually write code for automatic caching of fixed runtime regexen.- Get rid of ctRegex -- it adds a huge compile-time cost with questionable runtime benefit. Unless there's a way to do this at compile-time that *doesn't* add like 5 seconds per regex to compile times.Yup it's dropped, to be eventually replaced by JIT which is both better at compile-time and much more flexible at run-time. --- Dmitry Olshansky CEO Glowlabs https://olshansky.me
Dec 18 2023
On Mon, Dec 18, 2023 at 06:34:51PM +0000, Dmitry Olshansky via Digitalmars-d wrote: [...]A runtime cache should work, btw std.regex caches regexes (at least those passed as strings to match* family of functions).Cool, didn't know that. :-) [...][...] Awesome stuff! T -- A mathematician is a device for turning coffee into theorems. -- P. Erdos- Get rid of ctRegex -- it adds a huge compile-time cost with questionable runtime benefit. Unless there's a way to do this at compile-time that *doesn't* add like 5 seconds per regex to compile times.Yup it's dropped, to be eventually replaced by JIT which is both better at compile-time and much more flexible at run-time.
Dec 18 2023