digitalmars.D - string need to be robust
- ZY Zhou (18/18) Mar 13 2011 Hi,
- Jonathan M Davis (3/22) Mar 13 2011 Check out std.utf. It has the functions for dealing with unicode stuff.
- ZY Zhou (8/8) Mar 13 2011 std.utf throw exception instead of crash the program. but you still need...
- %u (4/12) Mar 13 2011 At first glance I would like to overload the opInvalid? function in the ...
- Jonathan M Davis (11/21) Mar 13 2011 I think that it's completely unreasonable to expect all string functions...
- ZY Zhou (13/37) Mar 13 2011 As I explained in the first mail, if utf8 parser convert all invalid utf...
- spir (11/26) Mar 13 2011 This is not a good idea, imo. Surrogate values /are/ invalid code points...
- ZY Zhou (4/8) Mar 13 2011 If a invalid utf8 or utf16 code need to be converted to utf32, then it s...
- spir (19/27) Mar 13 2011 You are wrong on both points.
- spir (12/27) Mar 13 2011 PS: You are free to preprocess the source if you like it, and convert in...
- spir (18/26) Mar 13 2011 The line reader /must/ crash or throw if the source is invalid. If you d...
- spir (16/32) Mar 13 2011 D native features *must* crash or throw when the source text is invalid....
- ZY Zhou (7/11) Mar 13 2011 What if I'm making a text editor with D?
- Michel Fortin (23/30) Mar 13 2011 But what is the best thing to do when you got an invalid UTF file in a
- Jacob Carlborg (6/44) Mar 13 2011 I would say that the functions should NOT crash but instead throw an
- Andrei Alexandrescu (4/7) Mar 13 2011 Yah. In addition, the exception should provide index information such
- Andrej Mitrovic (4/4) Mar 13 2011 Crash -> Have fun stepping through your code with a debugger, or
- ZY Zhou (9/16) Mar 13 2011 it doesn't make sense to add try/catch every time you use tolower/touppe...
- Jesse Phillips (4/12) Mar 13 2011 Right that isn't the point of throwing an exception. In this case you ha...
- Jonathan M Davis (11/20) Mar 13 2011 If you're going to worry about string validity, it should be checked whe...
- Jacob Carlborg (5/21) Mar 14 2011 Depending on what kind of application you write, all you need could just...
- KennyTM~ (17/20) Mar 13 2011 It is already throwing an exception called
- Jesse Phillips (2/21) Mar 13 2011 foreach does not use the range interface provided by std.array (that I k...
- KennyTM~ (5/26) Mar 13 2011 I haven't checked on Windows, but there is no access violation in
- Jacob Carlborg (5/25) Mar 14 2011 Good, then we can call it a day :) I assumed it crashed since he said it...
Hi, I wrote a small program to read and parse html(charset=UTF-8). It worked great until some invalid utf8 chars appears in that page. When the string is invalid, things like foreach or std.string.tolower will just crash. this make the string type totally unusable when processing files, since there is no guarantee that utf8 file doesn't contain invalid utf8 chars. So I made a utf8 decoder myself to convert char[] to dchar[]. In my decoder, I convert all invalid utf8 chars to low surrogate code points(0x80~0xFF -> 0xDC80~0xDCFF), since low surrogate are invalid utf32 codes, I'm still able to know which part of the string is invalid. Besides, after processing the dchar[] string, I still can convert it back to utf8 char[] without affecting any of the invalid part. But it is still too easy to crash program with invalid string. Is it possible to make this a native feature of string? Or is there any other recommended method to solve this issue? Thank you, --ZY Zhou
Mar 13 2011
On Sunday 13 March 2011 01:57:12 ZY Zhou wrote:Hi, I wrote a small program to read and parse html(charset=UTF-8). It worked great until some invalid utf8 chars appears in that page. When the string is invalid, things like foreach or std.string.tolower will just crash. this make the string type totally unusable when processing files, since there is no guarantee that utf8 file doesn't contain invalid utf8 chars. So I made a utf8 decoder myself to convert char[] to dchar[]. In my decoder, I convert all invalid utf8 chars to low surrogate code points(0x80~0xFF -> 0xDC80~0xDCFF), since low surrogate are invalid utf32 codes, I'm still able to know which part of the string is invalid. Besides, after processing the dchar[] string, I still can convert it back to utf8 char[] without affecting any of the invalid part. But it is still too easy to crash program with invalid string. Is it possible to make this a native feature of string? Or is there any other recommended method to solve this issue?Check out std.utf. It has the functions for dealing with unicode stuff. - Jonathan M Davis
Mar 13 2011
std.utf throw exception instead of crash the program. but you still need to add try/catch everywhere. My point is: this simple code should work, instead of crash, it is supposed to leave all invalid codes untouched and just process the valid parts. Stream file = new BufferedFile("sample.txt"); foreach(char[] line; file) { string s = line.idup.tolower; }
Mar 13 2011
== Quote from ZY Zhou (rinick GeeMail.com)'s articlestd.utf throw exception instead of crash the program. but you still need to add try/catch everywhere. My point is: this simple code should work, instead of crash, it is supposed to leave all invalid codes untouched and just process the valid parts. Stream file = new BufferedFile("sample.txt"); foreach(char[] line; file) { string s = line.idup.tolower; }At first glance I would like to overload the opInvalid? function in the stream such that iso a throw, the char would be ignored. Disclaimer: I don't know D2 nor streams :)
Mar 13 2011
On Sunday 13 March 2011 04:34:24 ZY Zhou wrote:std.utf throw exception instead of crash the program. but you still need to add try/catch everywhere. My point is: this simple code should work, instead of crash, it is supposed to leave all invalid codes untouched and just process the valid parts. Stream file = new BufferedFile("sample.txt"); foreach(char[] line; file) { string s = line.idup.tolower; }I think that it's completely unreasonable to expect all string functions to worry about whether they're dealing with valid unicode or not. And a lot of string stuff would involve ranges which would require converting each code point to UTF-32. And how is it supposed to do _that_ with invalid UTF-8? I don't know how you expect to really be able to do anything with invalid UTF-8 anyway. There may be something that could be added to std.utf to help better handle the situation, but I think that it's completely unreasonable to expect all of the string-based and/or range-based functions to be able to handle invalid unicode. - Jonathan M Davis
Mar 13 2011
but I think that it's completely unreasonable to expect all of the string-based and/or range-based functions to be able to handle invalid unicode.As I explained in the first mail, if utf8 parser convert all invalid utf8 chars to low surrogate code points(0x80~0xFF -> 0xDC80~0xDCFF), other string related functions will still work fine, and you can also handle these error if you want string s = "\xa0"; foreach(dchar d; s) { if (isValidUnicode(d)) { process(d); } else { handleError(d); } } == Quote from Jonathan M Davis (jmdavisProg gmx.com)'s articleOn Sunday 13 March 2011 04:34:24 ZY Zhou wrote:std.utf throw exception instead of crash the program. but you still need to add try/catch everywhere. My point is: this simple code should work, instead of crash, it is supposed to leave all invalid codes untouched and just process the valid parts. Stream file = new BufferedFile("sample.txt"); foreach(char[] line; file) { string s = line.idup.tolower; }I think that it's completely unreasonable to expect all string functions to worry about whether they're dealing with valid unicode or not. And a lot of string stuff would involve ranges which would require converting each code point to UTF-32. And how is it supposed to do _that_ with invalid UTF-8? I don't know how you expect to really be able to do anything with invalid UTF-8 anyway. There may be something that could be added to std.utf to help better handle the situation, but I think that it's completely unreasonable to expect all of the string-based and/or range-based functions to be able to handle invalid unicode. - Jonathan M Davis
Mar 13 2011
On 03/13/2011 01:25 PM, ZY Zhou wrote:This is not a good idea, imo. Surrogate values /are/ invalid code points. (For the ones who guess, there are a range of /code unit/ values used to code in utf16 code points > 0xFFFF.) They should never appear in a string of dchar[]; and a string of char[] code units should never encode a non-code point in the surrogate range. Denis -- _________________ vita es estrany spir.wikidot.combut I think that it's completely unreasonable to expectAs I explained in the first mail, if utf8 parser convert all invalid utf8 chars to low surrogate code points(0x80~0xFF -> 0xDC80~0xDCFF), other string related functions will still work fine, and you can also handle these error if you want string s = "\xa0"; foreach(dchar d; s) { if (isValidUnicode(d)) { process(d); } else { handleError(d); } }all of the string-based and/or range-based functions to be able to handle invalid unicode.
Mar 13 2011
If a invalid utf8 or utf16 code need to be converted to utf32, then it should be converted to an invalid utf32. that's why D800~DFFF are marked as invalid points in unicode standard. == Quote from spir (denis.spir gmail.com)'s articleThis is not a good idea, imo. Surrogate values /are/ invalid code points. (For the ones who guess, there are a range of /code unit/ values used to code in utf16 code points > 0xFFFF.) They should never appear in a string of dchar[]; and a string of char[] code units should never encode a non-code point in the
Mar 13 2011
On 03/13/2011 04:43 PM, ZY Zhou wrote:If a invalid utf8 or utf16 code need to be converted to utf32, then it should be converted to an invalid utf32. that's why D800~DFFF are marked as invalid points in unicode standard.You are wrong on both points. First, there is no definition of invalid source conversion into another format/encoding; instead it should be treated as invalid, that's all. A language or string-processing library should certainly *not* provide any way to do that. Instead, it should just signal invalidity by crashing or throwing. Second, the range you mention is not intended for application use; instead it is reserved for special use by utf16; and, as such, invalid. Since the beginning of this thread, you are demanding for D standard features (the *string types or *char[] arrays) to cope with your particular needs of the moment, doing your job; at the price of all other use cases of those features potentially becoming unsecure or incorrect; crashing loads of existing code which rely on correct behaviour; and breaking the standard. Strange. Denis== Quote from spir (denis.spir gmail.com)'s article-- _________________ vita es estrany spir.wikidot.comThis is not a good idea, imo. Surrogate values /are/ invalid code points. (For the ones who guess, there are a range of /code unit/ values used to code in utf16 code points> 0xFFFF.) They should never appear in a string of dchar[]; and a string of char[] code units should never encode a non-code point in the
Mar 13 2011
On 03/13/2011 01:25 PM, ZY Zhou wrote:PS: You are free to preprocess the source if you like it, and convert invalid parts into whatever you like. But instead of surrogates, you'd rather use one of the freely usable ranges of values; or use 0 maybe (so that output won't be disturbed); or better the code point intended for "un-representable" thingie, that all fonts would correctly interpret (and usually display as an inverse video '?'). Denis -- _________________ vita es estrany spir.wikidot.combut I think that it's completely unreasonable to expectAs I explained in the first mail, if utf8 parser convert all invalid utf8 chars to low surrogate code points(0x80~0xFF -> 0xDC80~0xDCFF), other string related functions will still work fine, and you can also handle these error if you want string s = "\xa0"; foreach(dchar d; s) { if (isValidUnicode(d)) { process(d); } else { handleError(d); } }all of the string-based and/or range-based functions to be able to handle invalid unicode.
Mar 13 2011
On 03/13/2011 12:34 PM, ZY Zhou wrote:std.utf throw exception instead of crash the program. but you still need to add try/catch everywhere. My point is: this simple code should work, instead of crash, it is supposed to leave all invalid codes untouched and just process the valid parts. Stream file = new BufferedFile("sample.txt"); foreach(char[] line; file) { string s = line.idup.tolower; }The line reader /must/ crash or throw if the source is invalid. If you do not want to pre-process the source (and convert to surrogates as you do, which IMO is not a very good idea) /before/ running the process loop, then read the source as plain byte[], check and pre-process invalid text in this form, and then only convert to string to be able to use string functions. There is no builtin type in D that takes random input as supposed string (text) /without/ checking it, and still allows to use string operations on it as if it were checked to be valid. This would be foolish, don't you think? The D way to check the source of strings at construction time allows checking only once, then safely apply any number of operations. The opposite would require to check at each oeration. And would still throw or crash, which is not what you expect, I guess: you would like IIUC invalid input to be ignored. Denis -- _________________ vita es estrany spir.wikidot.com
Mar 13 2011
On 03/13/2011 10:57 AM, ZY Zhou wrote:Hi, I wrote a small program to read and parse html(charset=UTF-8). It worked great until some invalid utf8 chars appears in that page. When the string is invalid, things like foreach or std.string.tolower will just crash. this make the string type totally unusable when processing files, since there is no guarantee that utf8 file doesn't contain invalid utf8 chars. So I made a utf8 decoder myself to convert char[] to dchar[]. In my decoder, I convert all invalid utf8 chars to low surrogate code points(0x80~0xFF -> 0xDC80~0xDCFF), since low surrogate are invalid utf32 codes, I'm still able to know which part of the string is invalid. Besides, after processing the dchar[] string, I still can convert it back to utf8 char[] without affecting any of the invalid part. But it is still too easy to crash program with invalid string. Is it possible to make this a native feature of string? Or is there any other recommended method to solve this issue?D native features *must* crash or throw when the source text is invalid. What do you think? What should a square root function do when you pass it negative input? /You/ may have special requirements for those cases (ignore it, log it, negate it, replace it with 0 or 1...), but the library must crash anyway. Your requirements are application-specific needs that /you/ must define yourself. Hope I'm clear. D offers an utf8 checking function (checking utf8 beeing the same as convertingto utf32, it just tries to convert and throws when fails). I would use before process to do what /you/ expect. Denis -- _________________ vita es estrany spir.wikidot.com
Mar 13 2011
What if I'm making a text editor with D? I know the text has something wrong, I want to open it and fix it. the exception won't help, if the editor just refuse to open invalid file, then the editor is useless. Try open an invalid utf file with a text editor, like vim, you will understand what I mean == Quote from spir (denis.spir gmail.com)'s articleD offers an utf8 checking function (checking utf8 beeing the same as convertingto utf32, it just tries to convert and throws when fails). I would use before process to do what /you/ expect. Denis
Mar 13 2011
On 2011-03-13 10:18:24 -0400, ZY Zhou <rinick GeeeMail.com> said:What if I'm making a text editor with D? I know the text has something wrong, I want to open it and fix it. the exception won't help, if the editor just refuse to open invalid file, then the editor is useless. Try open an invalid utf file with a text editor, like vim, you will understand what I meanBut what is the best thing to do when you got an invalid UTF file in a text editor? Perhaps you should show a warning to the user, perhaps you also should ask the user to select the right text encoding (because it might simply not be UTF-8), or perhaps you want to silently ignore the error and show an invalid character marker at the right point in the text. All of these options are valid and the programing language shouldn't decide that for you. So I'd point out that a text file editor is a special use case, most programs aren't text file editors and don't share this concern. In the same vein, HTML parsers are also a special case that should know how to handle encodings. In fact, HTML 5 defines explicitly how to deal with invalid UTF-8 sequences: <http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#utf-8> There are many good ways to deal with invalid UTF-8 sequences. Throwing an exception seems like the most robust one to me since it protects against invalid input. What to do with invalid input belongs in the application logic, not the language. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 13 2011
On 2011-03-13 13:22, spir wrote:On 03/13/2011 10:57 AM, ZY Zhou wrote:I would say that the functions should NOT crash but instead throw an exception. Then the developer can choose what to do when there's an invalid unicode character. -- /Jacob CarlborgHi, I wrote a small program to read and parse html(charset=UTF-8). It worked great until some invalid utf8 chars appears in that page. When the string is invalid, things like foreach or std.string.tolower will just crash. this make the string type totally unusable when processing files, since there is no guarantee that utf8 file doesn't contain invalid utf8 chars. So I made a utf8 decoder myself to convert char[] to dchar[]. In my decoder, I convert all invalid utf8 chars to low surrogate code points(0x80~0xFF -> 0xDC80~0xDCFF), since low surrogate are invalid utf32 codes, I'm still able to know which part of the string is invalid. Besides, after processing the dchar[] string, I still can convert it back to utf8 char[] without affecting any of the invalid part. But it is still too easy to crash program with invalid string. Is it possible to make this a native feature of string? Or is there any other recommended method to solve this issue?D native features *must* crash or throw when the source text is invalid. What do you think? What should a square root function do when you pass it negative input? /You/ may have special requirements for those cases (ignore it, log it, negate it, replace it with 0 or 1...), but the library must crash anyway. Your requirements are application-specific needs that /you/ must define yourself. Hope I'm clear. D offers an utf8 checking function (checking utf8 beeing the same as convertingto utf32, it just tries to convert and throws when fails). I would use before process to do what /you/ expect. Denis
Mar 13 2011
On 3/13/11 1:55 PM, Jacob Carlborg wrote:I would say that the functions should NOT crash but instead throw an exception. Then the developer can choose what to do when there's an invalid unicode character.Yah. In addition, the exception should provide index information such that an application can elect to continue processing. Andrei
Mar 13 2011
Crash -> Have fun stepping through your code with a debugger, or worse, observe disassembly. Throw -> (Hopefully) get an informative error message, which could mean you'll be able to fix the bug quickly.
Mar 13 2011
it doesn't make sense to add try/catch every time you use tolower/toupper/foreach on string. No one will do that. You either throw exception when convert invalid utf8 bytes to string, or never throw exception and use invalid UTF32 code in dchar to represent invalid utf8 code. string s = "\x0A"; // this is the right place to throw the exception (or compile error) s.tolower; // no one will add try/catch on this --ZY Zhou == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articleOn 3/13/11 1:55 PM, Jacob Carlborg wrote:I would say that the functions should NOT crash but instead throw an exception. Then the developer can choose what to do when there's an invalid unicode character.Yah. In addition, the exception should provide index information such that an application can elect to continue processing. Andrei
Mar 13 2011
ZY Zhou Wrote:it doesn't make sense to add try/catch every time you use tolower/toupper/foreach on string. No one will do that. You either throw exception when convert invalid utf8 bytes to string, or never throw exception and use invalid UTF32 code in dchar to represent invalid utf8 code. string s = "\x0A"; // this is the right place to throw the exception (or compile error)I don't think here is even a good place to have it. The problem being most strings/invalid strings will come in during run-time. And at run-time we have std.file.readText which will validate the string upon entry.s.tolower; // no one will add try/catch on thisRight that isn't the point of throwing an exception. In this case you have not written you program to handle invalid Unicode, which means your program should not continue to operate until you decide to fix it. (Exceptions are robust) However an access violation like the one received from using foreach is not appropriate and should be filed as a bug. You should be programming under the assumption that everything is how you expect it, and verify and remedy external inputs at the point of input. If you don't then the program shouldn't continue running.
Mar 13 2011
On Sunday 13 March 2011 22:45:38 ZY Zhou wrote:it doesn't make sense to add try/catch every time you use tolower/toupper/foreach on string. No one will do that. You either throw exception when convert invalid utf8 bytes to string, or never throw exception and use invalid UTF32 code in dchar to represent invalid utf8 code. string s = "\x0A"; // this is the right place to throw the exception (or compile error) s.tolower; // no one will add try/catch on thisIf you're going to worry about string validity, it should be checked when the string is initially created. If it's not valid, then fix it in whatever way you deem appropriate. After that, you shouldn't have to worry about string validity anymore. Honestly, invalid UTF-8 is pretty rare overall. You'll get it because of a bad file or somesuch, but once the string is valid, it stays valid. So, you really only have to worry about it when you read in files and such. Once the file has been correctly read in, you just use the strings without worrying about it. There shouldn't be a need to use try-catch blocks much of anywhere to worry about invalid unicode. - Jonathan M Davis
Mar 13 2011
On 2011-03-14 06:45, ZY Zhou wrote:it doesn't make sense to add try/catch every time you use tolower/toupper/foreach on string. No one will do that. You either throw exception when convert invalid utf8 bytes to string, or never throw exception and use invalid UTF32 code in dchar to represent invalid utf8 code. string s = "\x0A"; // this is the right place to throw the exception (or compile error) s.tolower; // no one will add try/catch on this --ZY Zhou == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articleDepending on what kind of application you write, all you need could just be to wrap the main method in try/catch. -- /Jacob CarlborgOn 3/13/11 1:55 PM, Jacob Carlborg wrote:I would say that the functions should NOT crash but instead throw an exception. Then the developer can choose what to do when there's an invalid unicode character.Yah. In addition, the exception should provide index information such that an application can elect to continue processing. Andrei
Mar 14 2011
On Mar 14, 11 02:55, Jacob Carlborg wrote:I would say that the functions should NOT crash but instead throw an exception. Then the developer can choose what to do when there's an invalid unicode character.It is already throwing an exception called core.exception.UnicodeException. This even provides you the index where decoding failed. (However Phobos is not using it, AFAIK.) ----------- import core.exception, std.stdio, std.conv; void main() { char[] s = [0x0f, 0x12,0x34,0x56,0x78,0x9a,0xbc]; try { foreach (dchar d; s){} } catch (UnicodeException e) { writefln("error at index %s (%x)", e.idx, to!ubyte(s[e.idx])); // error at index 5 (9a) } } -----------
Mar 13 2011
KennyTM~ Wrote:It is already throwing an exception called core.exception.UnicodeException. This even provides you the index where decoding failed. (However Phobos is not using it, AFAIK.) ----------- import core.exception, std.stdio, std.conv; void main() { char[] s = [0x0f, 0x12,0x34,0x56,0x78,0x9a,0xbc]; try { foreach (dchar d; s){} } catch (UnicodeException e) { writefln("error at index %s (%x)", e.idx, to!ubyte(s[e.idx])); // error at index 5 (9a) } } -----------foreach does not use the range interface provided by std.array (that I know of). So there should be a bug reported on the current behavior of Access Violation.
Mar 13 2011
On Mar 14, 11 13:53, Jesse Phillips wrote:KennyTM~ Wrote:I haven't checked on Windows, but there is no access violation in foreach (http://ideone.com/WSBRm), and there is no access violation in the range interface either (as of 2.052 it throws a std.utf.UtfException), so I don't know what bug you are talking about.It is already throwing an exception called core.exception.UnicodeException. This even provides you the index where decoding failed. (However Phobos is not using it, AFAIK.) ----------- import core.exception, std.stdio, std.conv; void main() { char[] s = [0x0f, 0x12,0x34,0x56,0x78,0x9a,0xbc]; try { foreach (dchar d; s){} } catch (UnicodeException e) { writefln("error at index %s (%x)", e.idx, to!ubyte(s[e.idx])); // error at index 5 (9a) } } -----------foreach does not use the range interface provided by std.array (that I know of). So there should be a bug reported on the current behavior of Access Violation.
Mar 13 2011
On 2011-03-13 23:36, KennyTM~ wrote:On Mar 14, 11 02:55, Jacob Carlborg wrote:Good, then we can call it a day :) I assumed it crashed since he said it did. -- /Jacob CarlborgI would say that the functions should NOT crash but instead throw an exception. Then the developer can choose what to do when there's an invalid unicode character.It is already throwing an exception called core.exception.UnicodeException. This even provides you the index where decoding failed. (However Phobos is not using it, AFAIK.) ----------- import core.exception, std.stdio, std.conv; void main() { char[] s = [0x0f, 0x12,0x34,0x56,0x78,0x9a,0xbc]; try { foreach (dchar d; s){} } catch (UnicodeException e) { writefln("error at index %s (%x)", e.idx, to!ubyte(s[e.idx])); // error at index 5 (9a) } } -----------
Mar 14 2011