digitalmars.D - Handling invalid UTF sequences

Walter Bright (9/9) Mar 20 2014 Currently we do it by throwing a UTFException. This has problems:

deadalnix (5/16) Mar 20 2014 Hiding errors under the carpet is not a good strategy. These
monarch_dodra (16/27) Mar 20 2014 I had thought of this before, and had an idea along the lines of:

Walter Bright (8/10) Mar 20 2014 just squelching the problem, isn't it?
Brad Anderson (7/38) Mar 20 2014 I'm a fan of this approach but Timon pointed out when I wrote

monarch_dodra (16/22) Mar 21 2014 It's just as easy to slice mid-codepoint as it is to access a

Denis Shelomovskij (10/30) Mar 21 2014 Almost nothing to add here. We already have `-noboundscheck` which can

monarch_dodra (8/23) Mar 22 2014 Except it's an Unicode *Exception*. Invalid unicode is *NOT*

Nick Sabalausky (5/14) Mar 20 2014 I'd have to give some thought to have an opinion on the right solution,
Chris Williams (5/5) Mar 20 2014 To the extent possible, it should try to retain the data. But if
Steven Schveighoffer (8/17) Mar 20 2014 Can't say I like it. Especially since current code expects a throw.
Regan Heath (19/28) Mar 21 2014 In window/Win32..
Dmitry Olshansky (11/19) Mar 21 2014 If we talk decoding then only dchar is relevant.

Walter Bright (3/27) Mar 21 2014 Ah, that's what I was looking for. The wikipedia article was a bit wishy...

Jonathan M Davis (28/44) Mar 21 2014 After a discussion on this a few weeks back (where I was in favor of the

Walter Bright <newshound2 digitalmars.com> writes:

Currently we do it by throwing a UTFException. This has problems:

1. about anything that deals with UTF cannot be made nothrow

2. turns innocuous errors into major problems, such as DOS attack vectors
http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

One option to fix this is to treat invalid sequences as:

1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)

2. U+FFFD

I kinda like option 1.

What do you think?

Mar 20 2014

"deadalnix" <deadalnix gmail.com> writes:

On Thursday, 20 March 2014 at 22:39:47 UTC, Walter Bright wrote:
 Currently we do it by throwing a UTFException. This has 
 problems:

 1. about anything that deals with UTF cannot be made nothrow

 2. turns innocuous errors into major problems, such as DOS 
 attack vectors
 http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

 One option to fix this is to treat invalid sequences as:

 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)

 2. U+FFFD

 I kinda like option 1.

 What do you think?

Hiding errors under the carpet is not a good strategy. These 
sequences are invalid, and doomed to explode at some point. I'm 
not sure what the solution is, but the .init one do not seems 
like the right one to me.

Mar 20 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 20 March 2014 at 22:39:47 UTC, Walter Bright wrote:
 Currently we do it by throwing a UTFException. This has 
 problems:

 1. about anything that deals with UTF cannot be made nothrow

 2. turns innocuous errors into major problems, such as DOS 
 attack vectors
 http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

 One option to fix this is to treat invalid sequences as:

 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)

 2. U+FFFD

 I kinda like option 1.

 What do you think?

I had thought of this before, and had an idea along the lines of:
1. strings "inside" the program are always valid.
2. encountering invalid strings "inside" the program  is an Error.
3. strings from the "outside" world must be validated before use.

The advantage is *more* than just a nothrow guarantee, but also a 
performance guarantee in release. And it *is* a pretty sane 
approach to the problem:
- User data: validate before use.
- Internal data: if its bad, your program is in a failure state.

----

As for your proposal, I can't really say. Silently accepting 
invalid sequences sounds nice at first, but its kind of just 
squelching the problem, isn't it?

----

In any case, both proposals would be major breaking changes...

Mar 20 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/20/2014 3:51 PM, monarch_dodra wrote:
 In any case, both proposals would be major breaking changes...

Or we could do this as alternate names, leaving the originals as throwing.

 Silently accepting invalid sequences sounds nice at first, but its kind of 

just squelching the problem, isn't it?

Not exactly. The decoded/encoded string will still have invalid code units in 
it. It'd be like floating point nan, the invalid bits will still be propagated 
onwards to the output.

I'm also of the belief that UTF sequences should be validated on input, not 
necessarily on every operation on them.

Mar 20 2014

"Brad Anderson" <eco gnuk.net> writes:

On Thursday, 20 March 2014 at 22:51:27 UTC, monarch_dodra wrote:
 On Thursday, 20 March 2014 at 22:39:47 UTC, Walter Bright wrote:
 Currently we do it by throwing a UTFException. This has 
 problems:

 1. about anything that deals with UTF cannot be made nothrow

 2. turns innocuous errors into major problems, such as DOS 
 attack vectors
 http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

 One option to fix this is to treat invalid sequences as:

 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)

 2. U+FFFD

 I kinda like option 1.

 What do you think?

 I had thought of this before, and had an idea along the lines 
 of:
 1. strings "inside" the program are always valid.
 2. encountering invalid strings "inside" the program  is an 
 Error.
 3. strings from the "outside" world must be validated before 
 use.

 The advantage is *more* than just a nothrow guarantee, but also 
 a performance guarantee in release. And it *is* a pretty sane 
 approach to the problem:
 - User data: validate before use.
 - Internal data: if its bad, your program is in a failure state.

I'm a fan of this approach but Timon pointed out when I wrote 
about it once that it's rather trivial to get an invalid string 
through slicing mid-code point so now I'm not so sure. I think 
I'm still in favor of it because you've obviously got a logic 
error if that happens so your program isn't correct anyway (it's 
not a matter of bad user input).

Mar 20 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Thursday, 20 March 2014 at 23:34:02 UTC, Brad Anderson wrote:
 I'm a fan of this approach but Timon pointed out when I wrote 
 about it once that it's rather trivial to get an invalid string 
 through slicing mid-code point so now I'm not so sure.

It's just as easy to slice mid-codepoint as it is to access a 
range out of bounds. In both cases, it's a programming error.

The only excuse I see for throwing an exception for slicing 
mid-codepoint, is that
1. programmers are less aware of the issue, so it's more 
forgiving in a released program (nobody likes a crash).
2. arguably, it's not the *program* state that's bad. It's the 
*data*.

Well, in regards to "2", you could argue that program state and 
data state is one and the same.

 I think I'm still in favor of it because you've obviously got a 
 logic error if that happens so your program isn't correct 
 anyway (it's not a matter of bad user input).


If I remember correctly, with a specially written UTF string, it 
*was* possible to corrupt program state. I think. I need to 
double check. I didn't give it much thought then ("it should 
virtually never happen"), but it could be used as deliberate 
security vulnerability.

Mar 21 2014

Denis Shelomovskij <verylonglogin.reg gmail.com> writes:

21.03.2014 12:25, monarch_dodra пишет:
 On Thursday, 20 March 2014 at 23:34:02 UTC, Brad Anderson wrote:
 I'm a fan of this approach but Timon pointed out when I wrote about it
 once that it's rather trivial to get an invalid string through slicing
 mid-code point so now I'm not so sure.

 It's just as easy to slice mid-codepoint as it is to access a range out
 of bounds. In both cases, it's a programming error.

 The only excuse I see for throwing an exception for slicing
 mid-codepoint, is that
 1. programmers are less aware of the issue, so it's more forgiving in a
 released program (nobody likes a crash).
 2. arguably, it's not the *program* state that's bad. It's the *data*.

 Well, in regards to "2", you could argue that program state and data
 state is one and the same.

 I think I'm still in favor of it because you've obviously got a logic
 error if that happens so your program isn't correct anyway (it's not a
 matter of bad user input).


 If I remember correctly, with a specially written UTF string, it *was*
 possible to corrupt program state. I think. I need to double check. I
 didn't give it much thought then ("it should virtually never happen"),
 but it could be used as deliberate security vulnerability.

Almost nothing to add here. We already have `-noboundscheck` which can 
dramatically increase performance, throwing `UTFError` should either use 
same mechanics (`-noutfcheck`?) or just be stripped in release. 
Personally I'd choose the latter as there are lots of (sometimes very 
slow) assertions stripped with `-release` in real programs, which 
indicates same critical data corruption.

-- 
Денис В. Шеломовский
Denis V. Shelomovskij

Mar 21 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 21 March 2014 at 10:39:49 UTC, Denis Shelomovskij 
wrote:
 21.03.2014 12:25, monarch_dodra пишет:
 If I remember correctly, with a specially written UTF string, 
 it *was*
 possible to corrupt program state. I think. I need to double 
 check. I
 didn't give it much thought then ("it should virtually never 
 happen"),
 but it could be used as deliberate security vulnerability.

 Almost nothing to add here. We already have `-noboundscheck` 
 which can dramatically increase performance, throwing 
 `UTFError` should either use same mechanics (`-noutfcheck`?) or 
 just be stripped in release. Personally I'd choose the latter 
 as there are lots of (sometimes very slow) assertions stripped 
 with `-release` in real programs, which indicates same critical 
 data corruption.

Except it's an Unicode *Exception*. Invalid unicode is *NOT* 
supposed to be an error.

Now I remember: Truncated unicode strings can cause slicing out 
of bounds in popFront.

This means we are currently operating on a double standard of 
sometimes exception, sometimes error, sometimes corruption.

Mar 22 2014

Nick Sabalausky <SeeWebsiteToContactMe semitwist.com> writes:

On 3/20/2014 6:39 PM, Walter Bright wrote:
 Currently we do it by throwing a UTFException. This has problems:

 1. about anything that deals with UTF cannot be made nothrow

 2. turns innocuous errors into major problems, such as DOS attack vectors
 http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

 One option to fix this is to treat invalid sequences as:

 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)

 2. U+FFFD

 I kinda like option 1.

 What do you think?

I'd have to give some thought to have an opinion on the right solution, 
however I do want to say the current UTFException throwing is something 
I've always been unhappy with. So it definitely should get addressed in 
some way.

Mar 20 2014

"Chris Williams" <yoreanon-chrisw yahoo.co.jp> writes:

To the extent possible, it should try to retain the data. But if 
ever the character is actually needed for something (like parsing 
JSON or displaying a glyph), the bad region should be replaced 
with a series of replacement characters:

http://en.wikipedia.org/wiki/Replacement_character#Replacement_character

Mar 20 2014

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 20 Mar 2014 18:39:50 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 Currently we do it by throwing a UTFException. This has problems:

 1. about anything that deals with UTF cannot be made nothrow

 2. turns innocuous errors into major problems, such as DOS attack vectors
 http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

 One option to fix this is to treat invalid sequences as:

 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)

 2. U+FFFD

 I kinda like option 1.

 What do you think?

Can't say I like it. Especially since current code expects a throw.

I understand the need. What about creating a different type which decodes  
into a known invalid code, and doesn't throw? This leaves the selection of  
throwing or not up to the type, which is generally decided on declaration,  
instead of having to change all your calls.

-Steve

Mar 20 2014

"Regan Heath" <regan netmail.co.nz> writes:

On Thu, 20 Mar 2014 22:39:50 -0000, Walter Bright  
<newshound2 digitalmars.com> wrote:

 Currently we do it by throwing a UTFException. This has problems:

 1. about anything that deals with UTF cannot be made nothrow

 2. turns innocuous errors into major problems, such as DOS attack vectors
 http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

 One option to fix this is to treat invalid sequences as:

 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)

 2. U+FFFD

 I kinda like option 1.

 What do you think?

In window/Win32..

WideCharToMultiByte has flags for a bunch of similar behaviours and allows  
you to define a default char to use as a replacement in such cases.

swprintf when passed %S will convert a wchar_t UTF-16 argument into ascii,  
and replaces invalid characters with ? as it does so.

swprintf_s (the safe version), IIRC, will invoke the invalid parameter  
handler for sequences which cannot be converted.

I think, ideally, we want some sensible default behaviour but also the  
ability to alter it globally, and even better in specific calls where it  
makes sense to do so (where flags/arguments can be passed to that effect).

So, the default behaviour could be to throw (therefore no breaking change)  
and we provide a function to change this to one of the other options, and  
another to select a replacement character (which would default to .init or  
U+FFFD).

R

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/

Mar 21 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

21-Mar-2014 02:39, Walter Bright пишет:
 Currently we do it by throwing a UTFException. This has problems:

 1. about anything that deals with UTF cannot be made nothrow

 2. turns innocuous errors into major problems, such as DOS attack vectors
 http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

 One option to fix this is to treat invalid sequences as:

 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)

If we talk decoding then only dchar is relevant.
If transcoding then, having 0xFF makes for broken UTF-8 encoding so I 
see no sense in going for it.

 2. U+FFFD

Also has the benefit of being recommended by the standard specifically 
for the case of substitution for bad encoding.

Details:
https://d.puremagic.com/issues/show_bug.cgi?id=12113

 I kinda like option 1.

Not enough of an argument ;)


-- 
Dmitry Olshansky

Mar 21 2014

Walter Bright <newshound2 digitalmars.com> writes:

On 3/21/2014 10:14 AM, Dmitry Olshansky wrote:
 21-Mar-2014 02:39, Walter Bright пишет:
 Currently we do it by throwing a UTFException. This has problems:

 1. about anything that deals with UTF cannot be made nothrow

 2. turns innocuous errors into major problems, such as DOS attack vectors
 http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

 One option to fix this is to treat invalid sequences as:

 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)

 If we talk decoding then only dchar is relevant.
 If transcoding then, having 0xFF makes for broken UTF-8 encoding so I see no
 sense in going for it.

 2. U+FFFD

 Also has the benefit of being recommended by the standard specifically for the
 case of substitution for bad encoding.

 Details:
 https://d.puremagic.com/issues/show_bug.cgi?id=12113

Ah, that's what I was looking for. The wikipedia article was a bit wishy-washy 
about the whole thing.

 I kinda like option 1.

 Not enough of an argument ;)

Mar 21 2014

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, March 20, 2014 15:39:50 Walter Bright wrote:
 Currently we do it by throwing a UTFException. This has problems:
 
 1. about anything that deals with UTF cannot be made nothrow
 
 2. turns innocuous errors into major problems, such as DOS attack vectors
 http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
 
 One option to fix this is to treat invalid sequences as:
 
 1. the .init value (0xFF for UTF8, 0xFFFF for UTF16 and UTF32)
 
 2. U+FFFD
 
 I kinda like option 1.
 
 What do you think?

After a discussion on this a few weeks back (where I was in favor of the
current behavior when the discussion started), I'm now completely in favor
of making it so that std.utf.decode simply replaces invalid code points with
U+FFFD per the standard. Most code won't care and will continue to work as
before. The main difference is that invalid Unicode would then fall in the
same category as when a program is given a string with characters that it's
not supposed to be given. Any code that checks for that sort of thing will
then treat invalid Unicode as it would have treated other invalid strings,
and code that doesn't care will continue to not care except that now it will
work with invalid Unicode instead of throwing.

A prime example is something like find. What does it care if it's given 
invalid Unicode? It will simply look for what you tell it to look for, and if 
it's not there, it won't find it. U+FFFD will just be one more character that 
doesn't match what it's looking for.

The few programs that really care about whether a string that they're given 
contains any invalid Unicode can simply validate the string ahead of time. The 
main problem there is that we need to replace std.utf.validate with something 
like std.utf.isValidUnicode, because validate makes the horrendous decision of 
throwing rather than returning a bool (which is what triggered the previous 
discussion on the topic IIRC).

There may be some concern about this change silently changing behavior, but I 
think that the reality is that the vast majority of programs will continue to 
work just fine, and our string processing code will be that much cleaner and 
faster as a result. So, I'm very much inclined to take the path of making this 
change and putting a warning about it in the changelog rather than not making 
the change or trying to do this alongside what we currently have.

- Jonathan M Davis

Mar 21 2014

D Programming

C/C++ Programming

Other

digitalmars.D - Handling invalid UTF sequences