digitalmars.D - Behavior of strings with invalid unicode...

monarch_dodra (33/33) Nov 21 2012 I made a commit that was meant to better certify what functions

Jonathan M Davis (18/25) Nov 21 2012 We don't support invalid unicode being providing ways to check for it an...

monarch_dodra (9/49) Nov 25 2012 OK: I guess that makes sense. I kind of which there'd be more of

Jonathan M Davis (12/20) Nov 26 2012 We care about making popFront as fast as possible, and in general, front...

"monarch_dodra" <monarchdodra gmail.com> writes:

I made a commit that was meant to better certify what functions 
threw in UTF.

I thus noticed that some of our functions, are unsafe. For 
example:

strings s = [0b1100_0000]; //1 byte of 2 byte sequence
s.popFront();              //Assertion error because of invalid
                            //slicing of s[2 .. $];

"pop" is nothrow, so throwing exception is out of the question, 
and the implementation seems to imply that "invalid unicode 
sequences are removed".

This is a bug, right?

--------
Things get more complicated if you take into account "partial 
invalidity". For example:

strings s = [0b1100_0000, 'a', 'b'];

Here, the first byte is actually an invalid sequence, since the 
second byte is not of the form 0b10XX_XXXX. What's more, byte 2 
itself is actually a valid sequence. We do not detect this 
though, and create this output:
s.popFront(); => s == "b";
*arguably*, the correct behavior would be:
s.popFront(); => s == "ab";
Where only the single invalid first byte is removed.

The problem is that doing this would actually be much more 
expensive, especially for a rare case. Worst yet, chances are you 
validate again, and again (and again) the same character.

--------
So here are my 2 questions:
1. Is there, or does anyone know of, a standardized "behavior to 
follow when decoding utf with invalid codes"?

2. Do we even really support invalid UTF after we "leave" the 
std.utf.decode layer? EG: We simply suppose that the string is 
valid?

Nov 21 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, November 21, 2012 14:25:00 monarch_dodra wrote:
 So here are my 2 questions:
 1. Is there, or does anyone know of, a standardized "behavior to
 follow when decoding utf with invalid codes"?
 
 2. Do we even really support invalid UTF after we "leave" the
 std.utf.decode layer? EG: We simply suppose that the string is
 valid?

We don't support invalid unicode being providing ways to check for it and in 
some cases throwing if it's encountered. If you create a string with invalid 
unicode, then you're shooting yourself in the foot, and you could get weird 
results. Some code checks for validity and will throw when it's given invalid 
unicode (decode in particular does this), whereas some code will simply ignore 
the fact that it's invalid and move on (generally, because it's not bothering 
to go to the effort of validating it). I believe that at the moment, the idea 
is that when the full decoding of a character occurs, a UTFException will be 
thrown if an invalid code point is encountered, whereas anything which 
partially decodes characters (e.g. just figures out how large a code point is) 
may or may not throw. popFront used to throw but doesn't any longer in an 
effort to make it faster, letting decode be the one to throw (so front would 
still throw, but popFront wouldn't).

I'm not aware of there being any standard way to deal with invalid Unicode, 
but I believe that popFront currently just treats invalid code points as being 
of length 1.

- Jonathan M Davis

Nov 21 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 21 November 2012 at 18:25:56 UTC, Jonathan M Davis 
wrote:
 On Wednesday, November 21, 2012 14:25:00 monarch_dodra wrote:
 So here are my 2 questions:
 1. Is there, or does anyone know of, a standardized "behavior 
 to
 follow when decoding utf with invalid codes"?
 
 2. Do we even really support invalid UTF after we "leave" the
 std.utf.decode layer? EG: We simply suppose that the string is
 valid?

 We don't support invalid unicode being providing ways to check 
 for it and in
 some cases throwing if it's encountered. If you create a string 
 with invalid
 unicode, then you're shooting yourself in the foot, and you 
 could get weird
 results. Some code checks for validity and will throw when it's 
 given invalid
 unicode (decode in particular does this), whereas some code 
 will simply ignore
 the fact that it's invalid and move on (generally, because it's 
 not bothering
 to go to the effort of validating it). I believe that at the 
 moment, the idea
 is that when the full decoding of a character occurs, a 
 UTFException will be
 thrown if an invalid code point is encountered, whereas 
 anything which
 partially decodes characters (e.g. just figures out how large a 
 code point is)
 may or may not throw. popFront used to throw but doesn't any 
 longer in an
 effort to make it faster, letting decode be the one to throw 
 (so front would
 still throw, but popFront wouldn't).

OK: I guess that makes sense. I kind of which there'd be more of 
a documented "two-level" scheme, but that should be fine.

 I'm not aware of there being any standard way to deal with 
 invalid Unicode,
 but I believe that popFront currently just treats invalid code 
 points as being
 of length 1.

 - Jonathan M Davis

Well, popFront only pops 1 element only if the very first element 
of is an invalid code point, but will not "see" if the code point 
at index 2 is invalid for multi-byte codes.

This kind of gives it a double-standard behavior, but I guess we 
have to draw a line somewhere.

Nov 25 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Monday, November 26, 2012 08:47:48 monarch_dodra wrote:
 OK: I guess that makes sense. I kind of which there'd be more of
 a documented "two-level" scheme, but that should be fine.

It's pretty much grown over time and isn't necessarily applied consistently.

 Well, popFront only pops 1 element only if the very first element
 of is an invalid code point, but will not "see" if the code point
 at index 2 is invalid for multi-byte codes.
 
 This kind of gives it a double-standard behavior, but I guess we
 have to draw a line somewhere.

We care about making popFront as fast as possible, and in general, front is 
called on the character as well (making the whole way that front and popFront 
work for strings naturally inefficient unfortunately), so it makes sense to
skip 
the checking as much as possible in popFront. It's basically doing the best 
that it can to be as fast as it can, so any checking that it doesn't need to 
do is best skipped. Speed is wins over correctness here and anything that we 
can do to make it faster is desirable. It's not perfect that way, but since in 
most cases the Unicode will be correct, and the correctness is generally 
checked by front (or decode), it was deemed to be the best approach.

- Jonathan M Davis

Nov 26 2012

D Programming

C/C++ Programming

Other

digitalmars.D - Behavior of strings with invalid unicode...