www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Behavior of strings with invalid unicode...

reply "monarch_dodra" <monarchdodra gmail.com> writes:
I made a commit that was meant to better certify what functions 
threw in UTF.

I thus noticed that some of our functions, are unsafe. For 
example:

strings s = [0b1100_0000]; //1 byte of 2 byte sequence
s.popFront();              //Assertion error because of invalid
                            //slicing of s[2 .. $];

"pop" is nothrow, so throwing exception is out of the question, 
and the implementation seems to imply that "invalid unicode 
sequences are removed".

This is a bug, right?

--------
Things get more complicated if you take into account "partial 
invalidity". For example:

strings s = [0b1100_0000, 'a', 'b'];

Here, the first byte is actually an invalid sequence, since the 
second byte is not of the form 0b10XX_XXXX. What's more, byte 2 
itself is actually a valid sequence. We do not detect this 
though, and create this output:
s.popFront(); => s == "b";
*arguably*, the correct behavior would be:
s.popFront(); => s == "ab";
Where only the single invalid first byte is removed.

The problem is that doing this would actually be much more 
expensive, especially for a rare case. Worst yet, chances are you 
validate again, and again (and again) the same character.

--------
So here are my 2 questions:
1. Is there, or does anyone know of, a standardized "behavior to 
follow when decoding utf with invalid codes"?

2. Do we even really support invalid UTF after we "leave" the 
std.utf.decode layer? EG: We simply suppose that the string is 
valid?
Nov 21 2012
parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, November 21, 2012 14:25:00 monarch_dodra wrote:
 So here are my 2 questions:
 1. Is there, or does anyone know of, a standardized "behavior to
 follow when decoding utf with invalid codes"?
 
 2. Do we even really support invalid UTF after we "leave" the
 std.utf.decode layer? EG: We simply suppose that the string is
 valid?
We don't support invalid unicode being providing ways to check for it and in some cases throwing if it's encountered. If you create a string with invalid unicode, then you're shooting yourself in the foot, and you could get weird results. Some code checks for validity and will throw when it's given invalid unicode (decode in particular does this), whereas some code will simply ignore the fact that it's invalid and move on (generally, because it's not bothering to go to the effort of validating it). I believe that at the moment, the idea is that when the full decoding of a character occurs, a UTFException will be thrown if an invalid code point is encountered, whereas anything which partially decodes characters (e.g. just figures out how large a code point is) may or may not throw. popFront used to throw but doesn't any longer in an effort to make it faster, letting decode be the one to throw (so front would still throw, but popFront wouldn't). I'm not aware of there being any standard way to deal with invalid Unicode, but I believe that popFront currently just treats invalid code points as being of length 1. - Jonathan M Davis
Nov 21 2012
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 21 November 2012 at 18:25:56 UTC, Jonathan M Davis 
wrote:
 On Wednesday, November 21, 2012 14:25:00 monarch_dodra wrote:
 So here are my 2 questions:
 1. Is there, or does anyone know of, a standardized "behavior 
 to
 follow when decoding utf with invalid codes"?
 
 2. Do we even really support invalid UTF after we "leave" the
 std.utf.decode layer? EG: We simply suppose that the string is
 valid?
We don't support invalid unicode being providing ways to check for it and in some cases throwing if it's encountered. If you create a string with invalid unicode, then you're shooting yourself in the foot, and you could get weird results. Some code checks for validity and will throw when it's given invalid unicode (decode in particular does this), whereas some code will simply ignore the fact that it's invalid and move on (generally, because it's not bothering to go to the effort of validating it). I believe that at the moment, the idea is that when the full decoding of a character occurs, a UTFException will be thrown if an invalid code point is encountered, whereas anything which partially decodes characters (e.g. just figures out how large a code point is) may or may not throw. popFront used to throw but doesn't any longer in an effort to make it faster, letting decode be the one to throw (so front would still throw, but popFront wouldn't).
OK: I guess that makes sense. I kind of which there'd be more of a documented "two-level" scheme, but that should be fine.
 I'm not aware of there being any standard way to deal with 
 invalid Unicode,
 but I believe that popFront currently just treats invalid code 
 points as being
 of length 1.

 - Jonathan M Davis
Well, popFront only pops 1 element only if the very first element of is an invalid code point, but will not "see" if the code point at index 2 is invalid for multi-byte codes. This kind of gives it a double-standard behavior, but I guess we have to draw a line somewhere.
Nov 25 2012
parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Monday, November 26, 2012 08:47:48 monarch_dodra wrote:
 OK: I guess that makes sense. I kind of which there'd be more of
 a documented "two-level" scheme, but that should be fine.
It's pretty much grown over time and isn't necessarily applied consistently.
 Well, popFront only pops 1 element only if the very first element
 of is an invalid code point, but will not "see" if the code point
 at index 2 is invalid for multi-byte codes.
 
 This kind of gives it a double-standard behavior, but I guess we
 have to draw a line somewhere.
We care about making popFront as fast as possible, and in general, front is called on the character as well (making the whole way that front and popFront work for strings naturally inefficient unfortunately), so it makes sense to skip the checking as much as possible in popFront. It's basically doing the best that it can to be as fast as it can, so any checking that it doesn't need to do is best skipped. Speed is wins over correctness here and anything that we can do to make it faster is desirable. It's not perfect that way, but since in most cases the Unicode will be correct, and the correctness is generally checked by front (or decode), it was deemed to be the best approach. - Jonathan M Davis
Nov 26 2012