digitalmars.D - [challenge] can you break wstring's back?
- Steven Schveighoffer (25/25) Nov 23 2010 I am working on a string implementation that enforces the correct
- stephan (14/39) Nov 24 2010 Here you go
- Steven Schveighoffer (5/6) Nov 24 2010 [snip]
- spir (14/23) Nov 24 2010 If the same logic is supposed tell the nr of code units of char & dchar ...
I am working on a string implementation that enforces the correct restrictions on a string (bi-directional range, etc), and I came across what I feel is a bug. However, I don't know enough about utf to construct a test case to prove it wrong. In std.array, there are separate functions for array.popBack(), depending on whether the array is a char[], a wchar[], or any other array type. The char[] and wchar[] popBacks are drastically different. However, there is only one back() function for narrow strings which supposedly handles both char[] and wchar[]. It looks like it will parse 1, 2, 3, or 4 elements depending on the bit pattern, and it's only looking at the least significant 8 bits of the elements to determine this. Does this make sense for wstring? I would think the wstring has a different way of decoding data than the string, otherwise why the two different popBacks? I don't know how to construct a string which shows there is an issue, is there one? If so, can you prove it with a unit test? Hint, the bit pattern of the end of the string must 'trick' the function into using the wrong number of elements, because ones that happen to match the correct number of elements needed will not cause an error (after deciding how many elements to decode, the data is passed to the decode function, which should do the right thing). As a bonus, can you write a correct wstring.back function so I can include it in my string struct? :) -Steve
Nov 23 2010
Am 24.11.2010 04:08, schrieb Steven Schveighoffer:I am working on a string implementation that enforces the correct restrictions on a string (bi-directional range, etc), and I came across what I feel is a bug. However, I don't know enough about utf to construct a test case to prove it wrong. In std.array, there are separate functions for array.popBack(), depending on whether the array is a char[], a wchar[], or any other array type. The char[] and wchar[] popBacks are drastically different. However, there is only one back() function for narrow strings which supposedly handles both char[] and wchar[]. It looks like it will parse 1, 2, 3, or 4 elements depending on the bit pattern, and it's only looking at the least significant 8 bits of the elements to determine this. Does this make sense for wstring? I would think the wstring has a different way of decoding data than the string, otherwise why the two different popBacks? I don't know how to construct a string which shows there is an issue, is there one? If so, can you prove it with a unit test?Here you go import std.array; import std.conv; void main() { dchar c = cast(dchar) 0x10000; auto ws = to!wstring(c); assert(ws.length == 2); // decoded as surrogate pair assert(ws.back == c); // fails with decoding error }Hint, the bit pattern of the end of the string must 'trick' the function into using the wrong number of elements, because ones that happen to match the correct number of elements needed will not cause an error (after deciding how many elements to decode, the data is passed to the decode function, which should do the right thing). As a bonus, can you write a correct wstring.back function so I can include it in my string struct? :)Use the same logic as in popBack for wstring, i.e. check whether the last wchar is the high part of a surrogate pair (i.e between 0xDC00 and 0xDFFF inclusive). If yes, two wchars are needed to decode to dchar. Otherwise, only one is needed.-Steve
Nov 24 2010
On Wed, 24 Nov 2010 03:46:26 -0500, stephan <none example.com> wrote:Here you go[snip] Thank you very much :) http://d.puremagic.com/issues/show_bug.cgi?id=5265 -Steve
Nov 24 2010
On Tue, 23 Nov 2010 22:08:04 -0500 "Steven Schveighoffer" <schveiguy yahoo.com> wrote:However, there is only one back() function for narrow strings which =20 supposedly handles both char[] and wchar[].If the same logic is supposed tell the nr of code units of char & dchar the= n it is certainly a bug.=20It looks like it will parse =20 1, 2, 3, or 4 elements depending on the bit pattern, and it's only lookin=g =20at the least significant 8 bits of the elements to determine this. Does ==20this make sense for wstring?This is what to do for utf8 <-> recomposing a whole code point / char from = variable nr of code units. Stephan answered about utf16 <-> wchar.I would think the wstring has a different =20 way of decoding data than the string, otherwise why the two different =20 popBacks?You are right. Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 24 2010