digitalmars.D - How is std.regex.replaceAllInto more efficient?
- Shriramana Sharma (25/25) Oct 15 2015 Hello. The doc on std.regex.replace{First,All}Into reads:
- Dmitry Olshansky (16/40) Oct 16 2015 It is dupped just because we are going to use it as _input_ for the
- Shriramana Sharma (4/6) Oct 16 2015 That's much clear now, thanks!
- Dmitry Olshansky (4/8) Oct 16 2015 Feel free to improve on the examples, I'd gladly pull such patches.
- Shriramana Sharma (12/13) Oct 17 2015 How about this, using the "comifying" regex from the replaceAll example:
- Dmitry Olshansky (8/19) Oct 17 2015 My problem with it is writeln/byLine b/c it can't be unittested.
- Marc =?UTF-8?B?U2Now7x0eg==?= (5/25) Oct 17 2015 You can prefix it with `version(StdDdoc)`, then it will only
- Shriramana Sharma (6/8) Oct 17 2015 That's true, but then the point was not to produce a valid unittest for ...
- Dmitry Olshansky (6/12) Oct 17 2015 Listing code that is not routinely tested on each build means someday it...
- Shriramana Sharma (26/29) Oct 18 2015 Hmmm... AFAICS the *Into function is most useful when you don't know how...
- Dmitry Olshansky (12/39) Oct 18 2015 Looks good.
- Shriramana Sharma (23/26) Oct 18 2015 Mmm no that throws an assertion failure, since the guarantee of the rege...
- Dmitry Olshansky (10/34) Oct 18 2015 That would miss the zero. iota is half-open [) interval.
- =?UTF-8?Q?S=c3=b6nke_Ludwig?= (8/19) Oct 19 2015 How about going one step further:
- Shriramana Sharma (5/11) Oct 20 2015 Another option is to use KeepTerminator.yes. Using writeln() is shorter!...
Hello. The doc on std.regex.replace{First,All}Into reads: A variation on replaceAll that instead of allocating a new string on each call outputs the result piece-wise to the sink. In particular this enables efficient construction of a final output incrementally. Example: //swap all 3 letter words and bring it back string text = "How are you doing?"; auto sink = appender!(char[])(); replaceAllInto!(cap => retro(cap[0]))(sink, text, regex(`\b\w{3}\b`)); auto swapped = sink.data.dup; // make a copy explicitly assert(swapped == "woH era uoy doing?"); sink.clear(); replaceAllInto!(cap => retro(cap[0]))(sink, swapped, regex(`\b\w{3}\b`)); assert(sink.data == text); Now IIUC the code, there are only two calls to replaceAllInto, and after the first call the string is dupped, in which case there is an allocation nevertheless, so how does using *into make for efficiency? Or is it that "each call" means each match of the regex? So there aren't just two calls, and that without the sink and *Into, each match of the regex will cause a new string allocation and this is what is being avoided? If so, why can't the default implementation of replace{First,All} itself use an internal sink instead of needing the user to manually specify it? Thanks! Shriramana Sharma.
Oct 15 2015
On 16-Oct-2015 04:50, Shriramana Sharma wrote:Hello. The doc on std.regex.replace{First,All}Into reads: A variation on replaceAll that instead of allocating a new string on each call outputs the result piece-wise to the sink. In particular this enables efficient construction of a final output incrementally. Example: //swap all 3 letter words and bring it back string text = "How are you doing?"; auto sink = appender!(char[])(); replaceAllInto!(cap => retro(cap[0]))(sink, text, regex(`\b\w{3}\b`)); auto swapped = sink.data.dup; // make a copy explicitly assert(swapped == "woH era uoy doing?"); sink.clear(); replaceAllInto!(cap => retro(cap[0]))(sink, swapped, regex(`\b\w{3}\b`)); assert(sink.data == text); Now IIUC the code, there are only two calls to replaceAllInto, and after the first call the string is dupped,It is dupped just because we are going to use it as _input_ for the second call. If instead of swapped the second call would use some other source the intent of the example would be clearer. I agree it's not a great example.in which case there is an allocation nevertheless, so how does using *into make for efficiency?Consider that the appender is being reused in 2 replaceAllInto calls, therefore the same allocated memory is reused. Otherwise one would have to allocate on each call to replace.Or is it that "each call" means each match of the regex? So there aren't just two calls, and that without the sink and *Into, each match of the regex will cause a new string allocation and this is what is being avoided?No. Conceptually replace does replaceInto but creates a new appender each time. This is the overhead replaceInto means to avoid.If so, why can't the default implementation of replace{First,All} itself use an internal sink instead of needing the user to manually specify it?It does. replaceFirst is replaceFirstInto + allocation of appender. If you already have and appender you can reuse it. Even better if you were meaning to just print out the result you can pass it File("some-file").lockingTextWriter bypassing allocation completely. -- Dmitry Olshansky
Oct 16 2015
Dmitry Olshansky wrote:No. Conceptually replace does replaceInto but creates a new appender each time. This is the overhead replaceInto means to avoid.That's much clear now, thanks! --
Oct 16 2015
On 16-Oct-2015 18:23, Shriramana Sharma wrote:Dmitry Olshansky wrote:Feel free to improve on the examples, I'd gladly pull such patches. -- Dmitry OlshanskyNo. Conceptually replace does replaceInto but creates a new appender each time. This is the overhead replaceInto means to avoid.That's much clear now, thanks!
Oct 16 2015
Dmitry Olshansky wrote:Feel free to improve on the examples, I'd gladly pull such patches.How about this, using the "comifying" regex from the replaceAll example: auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g"); auto sink = appender!(char [])(); foreach (line; stdin.byLine()) { sink.clear(); replaceAllInto(sink, line, re, ","); writeln(sink.data()); } --
Oct 17 2015
On 17-Oct-2015 13:06, Shriramana Sharma wrote:Dmitry Olshansky wrote:My problem with it is writeln/byLine b/c it can't be unittested. Documented unit-tests as examples are gold. On the other hand you can post it to list of example shown on the first page of dlang.org because I think it's kind of cool. Also a variation of this example can be put in std.regex synopsis. -- Dmitry OlshanskyFeel free to improve on the examples, I'd gladly pull such patches.How about this, using the "comifying" regex from the replaceAll example: auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g"); auto sink = appender!(char [])(); foreach (line; stdin.byLine()) { sink.clear(); replaceAllInto(sink, line, re, ","); writeln(sink.data()); }
Oct 17 2015
On Saturday, 17 October 2015 at 10:54:19 UTC, Dmitry Olshansky wrote:On 17-Oct-2015 13:06, Shriramana Sharma wrote:You can prefix it with `version(StdDdoc)`, then it will only appear in the documentation. It won't be really tested of course, but the compiler will at least do syntax checking on it, I guess.Dmitry Olshansky wrote:My problem with it is writeln/byLine b/c it can't be unittested. Documented unit-tests as examples are gold.Feel free to improve on the examples, I'd gladly pull such patches.How about this, using the "comifying" regex from the replaceAll example: auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g"); auto sink = appender!(char [])(); foreach (line; stdin.byLine()) { sink.clear(); replaceAllInto(sink, line, re, ","); writeln(sink.data()); }
Oct 17 2015
Dmitry Olshansky wrote:My problem with it is writeln/byLine b/c it can't be unittested. Documented unit-tests as examples are gold.That's true, but then the point was not to produce a valid unittest for the *Into functions but a meaningful *example*, which will make clear to a user how this function is useful, no? --
Oct 17 2015
On 17-Oct-2015 16:43, Shriramana Sharma wrote:Dmitry Olshansky wrote:Listing code that is not routinely tested on each build means someday it may become broken. Anyway just issue a pull request, we can figure out the details in github discussion. -- Dmitry OlshanskyMy problem with it is writeln/byLine b/c it can't be unittested. Documented unit-tests as examples are gold.That's true, but then the point was not to produce a valid unittest for the *Into functions but a meaningful *example*, which will make clear to a user how this function is useful, no?
Oct 17 2015
Dmitry Olshansky wrote:Listing code that is not routinely tested on each build means someday it may become broken. Anyway just issue a pull request, we can figure out the details in github discussion.Hmmm... AFAICS the *Into function is most useful when you don't know how many items of input you are going to need to process and want to avoid an allocation at each item. The command-line byLine()/writeln() number delimiter makes a very sensible example there, so I would recommend you put that in *somewhere*, though it cannot be a unittest. For a unittest, how about: static auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g"); auto sink = appender!(char [])(); enum ulong min = 10UL ^^ 10, max = 10UL ^^ 19; foreach (i; 0 .. 50) { sink.clear(); replaceAllInto(sink, text(uniform(min, max)), re, ","); // NOTE: size_t is an unsigned type, so it will wrap over when going below zero for (size_t pos = sink.data.length - 4; pos < sink.data.length; pos -= 4) assert(sink.data[pos] == ','); } I didn't intentionally do that wrap-around trick, but I can't implicitly convert an unsigned type to a signed one either, and using to!long is equally as awkward, not to mention that it would not work for overly long strings (though that doesn't apply in the present case). --
Oct 18 2015
On 18-Oct-2015 17:26, Shriramana Sharma wrote:Dmitry Olshansky wrote:Yes - e.g. add to std.regex synopsis to showcase this ability.Listing code that is not routinely tested on each build means someday it may become broken. Anyway just issue a pull request, we can figure out the details in github discussion.Hmmm... AFAICS the *Into function is most useful when you don't know how many items of input you are going to need to process and want to avoid an allocation at each item. The command-line byLine()/writeln() number delimiter makes a very sensible example there, so I would recommend you put that in *somewhere*, though it cannot be a unittest.For a unittest, how about:Looks good.static auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g");Just staticauto sink = appender!(char [])(); enum ulong min = 10UL ^^ 10, max = 10UL ^^ 19; foreach (i; 0 .. 50) { sink.clear(); replaceAllInto(sink, text(uniform(min, max)), re, ","); // NOTE: size_t is an unsigned type, so it will wrap over when going below zero for (size_t pos = sink.data.length - 4; pos < sink.data.length; pos -= 4) assert(sink.data[pos] == ','); }I guess the idiomatic way is: foreach (pos; iota(0, sink.data.length, 4).retro) { ... }I didn't intentionally do that wrap-around trick, but I can't implicitly convert an unsigned type to a signed one either, and using to!long is equally as awkward, not to mention that it would not work for overly long strings (though that doesn't apply in the present case).See above. -- Dmitry Olshansky
Oct 18 2015
Dmitry Olshansky wrote:I guess the idiomatic way is: foreach (pos; iota(0, sink.data.length, 4).retro)Mmm no that throws an assertion failure, since the guarantee of the regex is only to insert commas before every third digit *from the end*, but doing retro on an *ascending* iota fixes the end point at index zero, and start point at some multiple of four *from the beginning*, which isn't compatible with the regex's promise. Thus the correct approach would be to do a *descending* iota: foreach (pos; iota(sink.data.length - 4, 0, -4)) assert(sink.data[pos] == ','); Pull: https://github.com/D-Programming-Language/phobos/pull/3731 I am not much experienced with GitHub pull requests, so I hope I did that correctly. IIUC after you pull, it will be back in my fork and its clone once I pull back from you, then I can delete my local branch, right? BTW since I still didn't like that wraparound thing, (and it would fail in some extreme corner cases where sink.data.length is > size_t.max - 3 or such, I think) I thought of it much and arrived at the much more convoluted: for (size_t pos = sink.data.length; pos > 3; ) { pos -= 4; assert(sink.data[pos] == ','); } which is still not easy for one's mind to "wraparound"! So the descending iota solution we arrived above is the most elegant. As I said, I'm still getting my "D-legs"... --
Oct 18 2015
On 18-Oct-2015 21:16, Shriramana Sharma wrote:Dmitry Olshansky wrote:That would miss the zero. iota is half-open [) interval. How about doing a bit of math and going for: iota(sink.data.length % 4, sink.data.length, 4) It really doesn't matter which way we go (forward/backward) but we must hit the right spots.I guess the idiomatic way is: foreach (pos; iota(0, sink.data.length, 4).retro)Mmm no that throws an assertion failure, since the guarantee of the regex is only to insert commas before every third digit *from the end*, but doing retro on an *ascending* iota fixes the end point at index zero, and start point at some multiple of four *from the beginning*, which isn't compatible with the regex's promise. Thus the correct approach would be to do a *descending* iota: foreach (pos; iota(sink.data.length - 4, 0, -4)) assert(sink.data[pos] == ',');Pull: https://github.com/D-Programming-Language/phobos/pull/3731 I am not much experienced with GitHub pull requests, so I hope I did that correctly. IIUC after you pull, it will be back in my fork and its clone once I pull back from you, then I can delete my local branch, right?Cool, will comment there.BTW since I still didn't like that wraparound thing, (and it would fail in some extreme corner cases where sink.data.length is > size_t.max - 3 or such, I think) I thought of it much and arrived at the much more convoluted: for (size_t pos = sink.data.length; pos > 3; ) { pos -= 4; assert(sink.data[pos] == ','); } which is still not easy for one's mind to "wraparound"! So the descending iota solution we arrived above is the most elegant. As I said, I'm still getting my "D-legs"...Welcome on board ;) -- Dmitry Olshansky
Oct 18 2015
Am 17.10.2015 um 12:06 schrieb Shriramana Sharma:Dmitry Olshansky wrote:How about going one step further: auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g"); foreach (line; stdin.byLine()) { replaceAllInto(stdout.lockingTextWriter, line, re, ","); writeln(); }Feel free to improve on the examples, I'd gladly pull such patches.How about this, using the "comifying" regex from the replaceAll example: auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g"); auto sink = appender!(char [])(); foreach (line; stdin.byLine()) { sink.clear(); replaceAllInto(sink, line, re, ","); writeln(sink.data()); }
Oct 19 2015
Sönke Ludwig wrote:auto re = ctRegex!(`(?<=\d)(?=(\d\d\d)+\b)`, "g"); foreach (line; stdin.byLine()) { replaceAllInto(stdout.lockingTextWriter, line, re, ","); writeln(); }Another option is to use KeepTerminator.yes. Using writeln() is shorter! :-) So this one does avoids all the extra allocations due to use of *Into, then? --
Oct 20 2015