digitalmars.D - Going on std.regex & std.uni bug-fixing hunt
- Dmitry Olshansky (17/17) Sep 05 2017 It's been tough time on D front for me, going down from about ~1
- Jon Degenhardt (11/14) Sep 05 2017 Not intending to add to your workload, but I added an enhancement
- Dmitry Olshansky (6/18) Sep 05 2017 Yeah, there is quite a few missing bits from std.uni. I didn't
- Chad Joan (42/52) Sep 09 2017 I was working on std.regex a bit myself, so I created this bug
- Dmitry Olshansky (10/60) Sep 10 2017 Yeah, well known problem. Solution is to keep a bit of memory
- Chad Joan (25/73) Sep 10 2017 Indeed.
- Dmitry Olshansky (28/79) Sep 10 2017 No it's just somthing that showed up in a number of benchmarks
It's been tough time on D front for me, going down from about ~1 week of activity during July to ~2-3 days in August. One thing I realised is that doing a new GC is going to be a long battle. Before I'm too deep down this rabbit hole I decided to first address the long-standing backlog of issues of std.regex and std.uni. My burndown list for std.regex: https://issues.dlang.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=VERIFIED&component=phobos&list_id=216638&product=D&query_format=advanced&resolution=---&short_desc=regex&short_desc_type=allwordssubstr Should get to about 5-6 enhacements. Same for std.uni: https://issues.dlang.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=VERIFIED&component=phobos&list_id=216641&query_format=advanced&resolution=---&short_desc=std.uni&short_desc_type=allwordssubstr This should be actually 0. Anything not showing up on these lists will not get fixed anytime soon. The starting point for std.regex: https://github.com/dlang/phobos/pull/5722
Sep 05 2017
On Tuesday, 5 September 2017 at 10:50:46 UTC, Dmitry Olshansky wrote:Before I'm too deep down this rabbit hole I decided to first address the long-standing backlog of issues of std.regex and std.uni.Not intending to add to your workload, but I added an enhancement request for something I hit ran into recently: Having a version of wcwidth/wcswidth functions available as part of unicode support. These functions calculate the width of a character in fixed-width fonts, they are useful in command-line apps. (CJK characters typically have a width of two.) Character width is part of the unicode standard. Issue: https://issues.dlang.org/show_bug.cgi?id=17810 --Jon
Sep 05 2017
On Wednesday, 6 September 2017 at 05:23:51 UTC, Jon Degenhardt wrote:On Tuesday, 5 September 2017 at 10:50:46 UTC, Dmitry Olshansky wrote:Yeah, there is quite a few missing bits from std.uni. I didn't expect this one though. Hopefully we can add them on pay as go principle, so if you don't use it you don't pay for it.Before I'm too deep down this rabbit hole I decided to first address the long-standing backlog of issues of std.regex and std.uni.Not intending to add to your workload, but I added an enhancement request for something I hit ran into recently: Having a version of wcwidth/wcswidth functions available as part of unicode support. These functions calculate the width of a character in fixed-width fonts, they are useful in command-line apps.--Jon
Sep 05 2017
On Tuesday, 5 September 2017 at 10:50:46 UTC, Dmitry Olshansky wrote:It's been tough time on D front for me, going down from about ~1 week of activity during July to ~2-3 days in August. One thing I realised is that doing a new GC is going to be a long battle. Before I'm too deep down this rabbit hole I decided to first address the long-standing backlog of issues of std.regex and std.uni. My burndown list for std.regex: https://issues.dlang.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=VERIFIED&component=phobos&list_id=216638&product=D&query_format=advanced&resolution=---&short_desc=regex&short_desc_type=allwordssubstr ...I was working on std.regex a bit myself, so I created this bug report to capture some of the findings/progress: https://issues.dlang.org/show_bug.cgi?id=17820 It seems like something you might be interested in, or might even have a small chance of fixing in the course of other things. ... There are other regex improvements I might be interested in, but I'm not sure I have time to make bug reports for them right now. I might be convinced to fast track them if someone wants to make legitimate effort towards fixing them, otherwise I'll eventually get around to writing the reports and/or making PRs someday. Examples: -- Calls to malloc in the CTFE path cause some regexes to fail at compile time. I suspect this happens due to the Captures (n > smallString) condition when the number of possible captures is greater than 3, but I haven't tested it (time consuming...). -- I remember being unable to iterate over named captures. But I'm not confident that I'm remembering this correctly, and I'm not sure if it's still true. -- The Captures struct does not specify what value is returned for submatches that were in the branch of an alternation that wasn't taken or in a repetition that matched 0 or more than 1 times. -- The Captures struct does not seem to have a way to access all of the strings matched by a submatch in repetition context, not to mention nested repetition contexts. I'm not sure how much those mentions help without proper bug reports, but at least I got it off my chest (for the time being) without having to spend my whole weekend writing bug reports ;) ... Dmitry, I appreciate your working towards making the regex module easier to work on. Thanks. ... I'm curious what you're thinking about when you mention something ambitious like writing a new GC :) (like this https://imgur.com/cWa4evD) I can't help but fantasize about cheap ways to get GC allocations to parallelize well and not end up writing an entire generational collector! But I doubt I'll ever have the opportunity to work on such things. I hope your GC attempt works out!
Sep 09 2017
On Sunday, 10 September 2017 at 00:16:10 UTC, Chad Joan wrote:On Tuesday, 5 September 2017 at 10:50:46 UTC, Dmitry Olshansky wrote:Yeah, well known problem. Solution is to keep a bit of memory cached eg in TLS variable.My burndown list for std.regex: https://issues.dlang.org/buglist.cgi?bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=VERIFIED&component=phobos&list_id=216638&product=D&query_format=advanced&resolution=---&short_desc=regex&short_desc_type=allwordssubstr ...I was working on std.regex a bit myself, so I created this bug report to capture some of the findings/progress: https://issues.dlang.org/show_bug.cgi?id=17820 It seems like something you might be interested in, or might even have a small chance of fixing in the course of other things.... There are other regex improvements I might be interested in, but I'm not sure I have time to make bug reports for them right now. I might be convinced to fast track them if someone wants to make legitimate effort towards fixing them, otherwise I'll eventually get around to writing the reports and/or making PRs someday. Examples: -- Calls to malloc in the CTFE path cause some regexes to fail at compile time. I suspect this happens due to the Captures (nSholudn't be a problem, but please report an example.smallString) condition when the number of possible capturesis greater than 3, but I haven't tested it (time consuming...).-- I remember being unable to iterate over named captures. But I'm not confident that I'm remembering this correctly, and I'm not sure if it's still true.Would be nice and simple enhancement.-- The Captures struct does not specify what value is returned for submatches that were in the branch of an alternation that wasn't taken or in a repetition that matched 0 or more than 1 times.As every engine out there the value is "", empty string.-- The Captures struct does not seem to have a way to access all of the strings matched by a submatch in repetition context, not to mention nested repetition contexts.Just like any other regex library.I'm not sure how much those mentions help without proper bug reports, but at least I got it off my chest (for the time being) without having to spend my whole weekend writing bug reports ;)Well they are warmly welcome shouldypu get to it.... Dmitry, I appreciate your working towards making the regex module easier to work on. Thanks. ... I'm curious what you're thinking about when you mention something ambitious like writing a new GC :) (like this https://imgur.com/cWa4evD) I can't help but fantasize about cheap ways to get GC allocations to parallelize well and not end up writing an entire generational collector!ThreadCache can go a long way to help that.But I doubt I'll ever have the opportunity to work on such things. I hope your GC attempt works out!Me too. It's won't be trivial effort though.
Sep 10 2017
On Sunday, 10 September 2017 at 11:47:02 UTC, Dmitry Olshansky wrote:On Sunday, 10 September 2017 at 00:16:10 UTC, Chad Joan wrote:Indeed. Is there another issue I can mark it as a duplicate of?I was working on std.regex a bit myself, so I created this bug report to capture some of the findings/progress: https://issues.dlang.org/show_bug.cgi?id=17820 It seems like something you might be interested in, or might even have a small chance of fixing in the course of other things.Yeah, well known problem. Solution is to keep a bit of memory cached eg in TLS variable.I usually don't refer to other libraries while using a library. If an API doesn't define something, then it is, by definition, undefined behavior, and thus quite undesirable to rely upon. This one seems pretty easy to fix though. I will probably make a documentation PR at some point.[...] -- The Captures struct does not specify what value is returned for submatches that were in the branch of an alternation that wasn't taken or in a repetition that matched 0 or more than 1 times.As every engine out there the value is "", empty string.Counterexample: https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.group.captures(v=vs.110).aspx#code-snippet-3 I actually have a strong interest in this. And not because I need to write regular expressions that extract lists of patterns all the time (well, it might've happened). More importantly: this would make it easier to integrate Phobos' regex engine into a parser generator framework. Current plans involve regular expression + parsing expression grammars. I'm pretty sure it is possible to mechanically convert a subset of PEGs into Regexes and gain some useful optimizations, but this requires granular control over regular expression captures to be able to extract the text matched by the original PEG symbols.-- The Captures struct does not seem to have a way to access all of the strings matched by a submatch in repetition context, not to mention nested repetition contexts.Just like any other regex library.Thanks!I'm not sure how much those mentions help without proper bug reports, but at least I got it off my chest (for the time being) without having to spend my whole weekend writing bug reports ;)Well they are warmly welcome shouldypu get to it.Google didn't help me with this one. Any chance I could get a link?... Dmitry, I appreciate your working towards making the regex module easier to work on. Thanks. ... I'm curious what you're thinking about when you mention something ambitious like writing a new GC :) (like this https://imgur.com/cWa4evD) I can't help but fantasize about cheap ways to get GC allocations to parallelize well and not end up writing an entire generational collector!ThreadCache can go a long way to help that.Good luck!But I doubt I'll ever have the opportunity to work on such things. I hope your GC attempt works out!Me too. It's won't be trivial effort though.
Sep 10 2017
On Sunday, 10 September 2017 at 18:54:21 UTC, Chad Joan wrote:On Sunday, 10 September 2017 at 11:47:02 UTC, Dmitry Olshansky wrote:No it's just somthing that showed up in a number of benchmarks people posted, never got around to it. So just file it so I'll have an issue to tick once I fix this.Yeah, well known problem. Solution is to keep a bit of memory cached eg in TLS variable.Indeed. Is there another issue I can mark it as a duplicate of?In many way we just copy ECMAScript regex semantics, with some extensions from Python and Perl.I usually don't refer to other libraries while using a library. If an API doesn't define something, then it is, by definition, undefined behavior, and thus quite undesirable to rely upon.[...] -- The Captures struct does not specify what value is returned for submatches that were in the branch of an alternation that wasn't taken or in a repetition that matched 0 or more than 1 times.As every engine out there the value is "", empty string.This one seems pretty easy to fix though. I will probably make a documentation PR at some point.Please do.Horrible, horrible idea. Performance-wise that is.Counterexample: https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.group.captures(v=vs.110).aspx#code-snippet-3-- The Captures struct does not seem to have a way to access all of the strings matched by a submatch in repetition context, not to mention nested repetition contexts.Just like any other regex library.I actually have a strong interest in this. And not because I need to write regular expressions that extract lists of patterns all the time (well, it might've happened). More importantly: this would make it easier to integrate Phobos' regex engine into a parser generator framework.No-no-no. Please don't.Current plans involve regular expression + parsing expression grammars. I'm pretty sure it is possible to mechanically convert a subset of PEGs into Regexes and gain some useful optimizations, but this requires granular control over regular expression captures to be able to extract the text matched by the original PEG symbols.This is heavily misguided, but a common idea. PEGs are actually way simpler then regex. PEGs * and + have nothing to do with regex * and + qualifiers. In PEG [ab]*b will never match because [ab]* eats any sequence of a-s and b-s and never backtracks. In regex [ab]* will match b, ab, bb, .... because regex 'backtracks'. Thus one can easily see that PEGs constructs are trivial to implement, and provide quite a number of optimizations as well. I'd argue that PEG can be made to run faster in 'parse' scenario whereas nothing beats _simple_ regex in 'search' scenario.The manpage for jemalloc mentions it explicitly search for tcache. Also look through jemalloc and related papers, I think they all call it thread cache. https://linux.die.net/man/3/jemalloc In essene you keep a small cache of free memory in TLS to serve next allocations faster. Some suggest a per-cpu cache, which is bit trickier but also interestingly avoid contention.ThreadCache can go a long way to help that.Google didn't help me with this one. Any chance I could get a link?Thanks!Good luck!things. I hope your GC attempt works out!Me too. It's won't be trivial effort though.
Sep 10 2017