digitalmars.D - std.string.reverse() for mutable array of chars
- bearophile (32/32) Dec 09 2011 Reversing an array of chars/wchars is a common enough operation (mutable...
- Jonathan M Davis (16/76) Dec 09 2011 If you want to reverse a char[], then cast it to ubyte[] and reverse tha...
- bearophile (6/14) Dec 09 2011 I am not interested in reversing code units. Sorry if my post has led to...
- Jonathan M Davis (9/31) Dec 09 2011 I don't expect that std.string will _ever_ be grapheme-aware or be proce...
- bearophile (4/6) Dec 09 2011 OK, let's forget about graphemes in this discussion. Now I am not asking...
- Jonathan M Davis (5/11) Dec 09 2011 It sounded like you were, because you were complaining about how dchar d...
- bearophile (4/5) Dec 09 2011 Right, I was :-) But you have changed my mind when you have explained me...
- Jonathan M Davis (27/33) Dec 09 2011 So, now you're asking that char and wchar arrays be reversible with reve...
- Jonathan M Davis (5/40) Dec 09 2011 Well, it looks like Andrei had some free time this morning and figured i...
- bearophile (4/6) Dec 09 2011 Thanks you Andrei Alexandrescu and Jonathan :-)
Reversing an array of chars/wchars is a common enough operation (mutable arrays often come from precedent operations that have built it). Currently std.algorithm.reverse() can't be used: import std.algorithm; void main() { dchar[] s1 = "hello"d.dup; s1.reverse(); // OK wchar[] s2 = "hello"w.dup; s2.reverse(); // error char[] s3 = "hello".dup; s3.reverse(); // error } I suggest to add a char[]/wchar[] specialization to std.algorithm.reverse() (or to add a std.string.reverse()), to make it work on those types too. Generally std.algorithms don't work on UTF8/UTF16 because of the variable length of its items, but for this specific algorithm I think this is not a problem because: 1) Reversing an array is an O(n) operation, and decoding UTF adds a constant overhead, so the computational complexity of reverse doesn't change. 2) If you reverse an char[] or wchar[] the result will fit in the input array (is this always true? Please tell me if this isn't true). It "just" needs to correctly swap the bytes of multi-byte chars, and swap if there are combined codepoints too. - - - - - - - - - - - - - - - - - - And I think std.algorithm.reverse() is sometimes buggy on a dchar[] (UTF32): import std.algorithm: reverse; void main() { dchar[] txt = "\U00000041\U00000308\U00000042"d.dup; txt.reverse(); assert(txt == "\U00000042\U00000308\U00000041"d); } txt contains LATIN CAPITAL LETTER A, COMBINING DIAERESIS, LATIN CAPITAL LETTER B (see bug 7084 for more details). A correct output for reversing txt is (LATIN CAPITAL LETTER B, LATIN CAPITAL LETTER A, COMBINING DIAERESIS): "\U00000042\U00000041\U00000308"d See for some code: http://stackoverflow.com/questions/199260/how-do-i-reverse-a-utf-8-string-in-place See also: http://d.puremagic.com/issues/show_bug.cgi?id=7085 Regarding the printing of unicode strings see also: http://d.puremagic.com/issues/show_bug.cgi?id=7084 Bye, bearophile
Dec 09 2011
On Friday, December 09, 2011 04:46:35 bearophile wrote:Reversing an array of chars/wchars is a common enough operation (mutable arrays often come from precedent operations that have built it). Currently std.algorithm.reverse() can't be used: import std.algorithm; void main() { dchar[] s1 = "hello"d.dup; s1.reverse(); // OK wchar[] s2 = "hello"w.dup; s2.reverse(); // error char[] s3 = "hello".dup; s3.reverse(); // error } I suggest to add a char[]/wchar[] specialization to std.algorithm.reverse() (or to add a std.string.reverse()), to make it work on those types too. Generally std.algorithms don't work on UTF8/UTF16 because of the variable length of its items, but for this specific algorithm I think this is not a problem because: 1) Reversing an array is an O(n) operation, and decoding UTF adds a constant overhead, so the computational complexity of reverse doesn't change. 2) If you reverse an char[] or wchar[] the result will fit in the input array (is this always true? Please tell me if this isn't true). It "just" needs to correctly swap the bytes of multi-byte chars, and swap if there are combined codepoints too. - - - - - - - - - - - - - - - - - - And I think std.algorithm.reverse() is sometimes buggy on a dchar[] (UTF32): import std.algorithm: reverse; void main() { dchar[] txt = "\U00000041\U00000308\U00000042"d.dup; txt.reverse(); assert(txt == "\U00000042\U00000308\U00000041"d); } txt contains LATIN CAPITAL LETTER A, COMBINING DIAERESIS, LATIN CAPITAL LETTER B (see bug 7084 for more details). A correct output for reversing txt is (LATIN CAPITAL LETTER B, LATIN CAPITAL LETTER A, COMBINING DIAERESIS): "\U00000042\U00000041\U00000308"d See for some code: http://stackoverflow.com/questions/199260/how-do-i-reverse-a-utf-8-string-in -place See also: http://d.puremagic.com/issues/show_bug.cgi?id=7085 Regarding the printing of unicode strings see also: http://d.puremagic.com/issues/show_bug.cgi?id=7084If you want to reverse a char[], then cast it to ubyte[] and reverse that. If you want to reverse a wchar[], then cast it to ushort[] and reverse that. In Phobos, strings are ranges of dchar, so reverse is going to reverse code points. If you want it to reverse code units instead, then you just use the appropriate cast. There's no reason to have it reverse the code units and completely mess up unicode strings. completely correct. It's reversing the code points, _not_ the graphemes. If you want to operate on graphemes, you need a range of graphemes, which Phobos does not yet support. Once it does (or if you implement it yourself), you can reverse a string based on graphemes if that's what you want to do. But as it stands, ranges of code points are the most advanced unicode construct that Phobos currently supports, so that's what its functions are going to operate on. - Jonathan M Davis
Dec 09 2011
Jonathan M Davis:completely correct. It's reversing the code points, _not_ the graphemes.OK. Maybe I will open a differently worded enhancement request, for a grapheme-aware std.string.If you want to reverse a char[], then cast it to ubyte[] and reverse that. If you want to reverse a wchar[], then cast it to ushort[] and reverse that. In Phobos, strings are ranges of dchar, so reverse is going to reverse code points. If you want it to reverse code units instead, then you just use the appropriate cast. There's no reason to have it reverse the code units and completely mess up unicode strings.I am not interested in reversing code units. Sorry if my post has led to this wrong idea. For this specific problem I am not going to cast to ubyte[] or ushort[] because it gives very wrong results. It's possible to write a "correct" (that doesn't take into account graphemes) reverse even if you do not use casts, keeping the array as char[] or wchar[], reversing the bytes, and then reversing the bytes of each variable-length codepoint. This is what I was asking to an in-place reverse(). Bye, bearophile
Dec 09 2011
On Friday, December 09, 2011 05:58:40 bearophile wrote:Jonathan M Davis:I don't expect that std.string will _ever_ be grapheme-aware or be processed by default as a range of graphemes. That's far too expensive as far as performance goes. Rather, we're likely to have a wrapper and/or separate range-type which handles graphemes. Then if you want the extra correctness and are willing to pay the cost, you use that. As I understand it, std.regex does have the beginnings of such, but we do still need to have a range type of some variety (probably in std.utf) which fully supports graphemes. - Jonathan M Davisdchar[] is completely correct. It's reversing the code points, _not_ the graphemes.OK. Maybe I will open a differently worded enhancement request, for a grapheme-aware std.string.If you want to reverse a char[], then cast it to ubyte[] and reverse that. If you want to reverse a wchar[], then cast it to ushort[] and reverse that. In Phobos, strings are ranges of dchar, so reverse is going to reverse code points. If you want it to reverse code units instead, then you just use the appropriate cast. There's no reason to have it reverse the code units and completely mess up unicode strings.I am not interested in reversing code units. Sorry if my post has led to this wrong idea. For this specific problem I am not going to cast to ubyte[] or ushort[] because it gives very wrong results. It's possible to write a "correct" (that doesn't take into account graphemes) reverse even if you do not use casts, keeping the array as char[] or wchar[], reversing the bytes, and then reversing the bytes of each variable-length codepoint. This is what I was asking to an in-place reverse().
Dec 09 2011
Jonathan M Davis:I don't expect that std.string will _ever_ be grapheme-aware or be processed by default as a range of graphemes.OK, let's forget about graphemes in this discussion. Now I am not asking for a grapheme-aware reverse. Bye, bearophile
Dec 09 2011
On Friday, December 09, 2011 06:15:04 bearophile wrote:Jonathan M Davis:It sounded like you were, because you were complaining about how dchar did an exact reversal of the code points rather than taking combining code points into account. - Jonathan M DavisI don't expect that std.string will _ever_ be grapheme-aware or be processed by default as a range of graphemes.OK, let's forget about graphemes in this discussion. Now I am not asking for a grapheme-aware reverse.
Dec 09 2011
Jonathan M Davis:It sounded like you were,Right, I was :-) But you have changed my mind when you have explained me that nothing in std.algorithm is grapheme-aware. So I have reduced the amount of what I am asking for. Bye, bearophile
Dec 09 2011
On Friday, December 09, 2011 12:44:37 bearophile wrote:Jonathan M Davis:So, now you're asking that char and wchar arrays be reversible with reverse such that their code points are reversed (i.e. the result is the same as if you reversed an array of dchar). Well, I'm not sure that you can actually do that with the same efficiency. I'd have to think about it more. Regardless, the implementation would be _really_ complicated in comparison to how reverse works right now. char[] and wchar[] don't work with reverse, because their elements aren't swappable. So, you can't just swap elements as you iterate in from both ends. You'd have to be moving stuff off into temporaries as you swapped them, because the code point on one side wouldn't necessarily fit where the code point on the other side was, and in the worst case (i.e. all of the code points on one half of the string are multiple code units and all of those on the other side are single code units), you'd pretty much end up having to copy half the array while you waited for enough space to open up on one side to fit the characters from the other side. So, regardless of whether it has the same computational complexity as the current reverse, its memory requirements would be far more. I don't think that the request is completely unreasonable, but also I'm not sure it's acceptable for reverse to change its performance characteristics as much as would be required for it to work with arrays of char or wchar - particularly with regards to how much memory would be required. In general, the performance characteristics of the algorithms in Phobos don't vary much with regards to the type that that's used. I'm pretty sure that in terms of big-o notation, the memory complexity wouldn't match (though I don't recall exactly how big-o notation works with memory rather than computational complexity). - Jonathan M DavisIt sounded like you were,Right, I was :-) But you have changed my mind when you have explained me that nothing in std.algorithm is grapheme-aware. So I have reduced the amount of what I am asking for.
Dec 09 2011
On Friday, December 09, 2011 13:27:01 Jonathan M Davis wrote:On Friday, December 09, 2011 12:44:37 bearophile wrote:Well, it looks like Andrei had some free time this morning and figured it out. He has a pull request for it: https://github.com/D-Programming-Language/phobos/pull/359 - Jonathan M DavisJonathan M Davis:So, now you're asking that char and wchar arrays be reversible with reverse such that their code points are reversed (i.e. the result is the same as if you reversed an array of dchar). Well, I'm not sure that you can actually do that with the same efficiency. I'd have to think about it more. Regardless, the implementation would be _really_ complicated in comparison to how reverse works right now. char[] and wchar[] don't work with reverse, because their elements aren't swappable. So, you can't just swap elements as you iterate in from both ends. You'd have to be moving stuff off into temporaries as you swapped them, because the code point on one side wouldn't necessarily fit where the code point on the other side was, and in the worst case (i.e. all of the code points on one half of the string are multiple code units and all of those on the other side are single code units), you'd pretty much end up having to copy half the array while you waited for enough space to open up on one side to fit the characters from the other side. So, regardless of whether it has the same computational complexity as the current reverse, its memory requirements would be far more. I don't think that the request is completely unreasonable, but also I'm not sure it's acceptable for reverse to change its performance characteristics as much as would be required for it to work with arrays of char or wchar - particularly with regards to how much memory would be required. In general, the performance characteristics of the algorithms in Phobos don't vary much with regards to the type that that's used. I'm pretty sure that in terms of big-o notation, the memory complexity wouldn't match (though I don't recall exactly how big-o notation works with memory rather than computational complexity).It sounded like you were,Right, I was :-) But you have changed my mind when you have explained me that nothing in std.algorithm is grapheme-aware. So I have reduced the amount of what I am asking for.
Dec 09 2011
Jonathan M Davis:Well, it looks like Andrei had some free time this morning and figured it out. He has a pull request for it:Thanks you Andrei Alexandrescu and Jonathan :-) Bye, bearophile
Dec 09 2011