digitalmars.D.learn - UTF-8 problems
- Deewiant (21/21) Jun 12 2006 import std.stream, std.cstream;
- Oskar Linde (13/41) Jun 12 2006 I had a quick look at the std.stream sources and it seems std.stream
- Deewiant (8/54) Jun 12 2006 Thanks for the explanation. Unfortunately, I'm not knowledgeable enough ...
- Oskar Linde (6/62) Jun 12 2006 dchar std.utf.decode(char[],int)
- Deewiant (31/47) Jun 12 2006 Thanks, that works. What I did was write a short function looking like t...
- Oskar Linde (16/58) Jun 12 2006 For a more general implementation, change the last 3 lines to:
- Deewiant (10/80) Jun 12 2006 6? Aren't 4 UTF-8 units enough for all of Unicode? I see that UTF8stride...
- Carlos Santander (7/17) Jun 12 2006 Keep using readLine. The entire line should be made of valid UTF8 charac...
- Deewiant (5/18) Jun 12 2006 That would work, but I was originally using only getc() so it's easier f...
import std.stream, std.cstream; // åäöΔ void main() { Stream file = new File(__FILE__, FileMode.In); // alternatively: //Stream file = din; while (!file.eof) dout.writef("%s", file.getc); } -- With the above UTF-8 code, I expect the program's source to be output, also in UTF-8. However, I get ASCII output, and on line three appears everyone's favourite "Error: 4invalid UTF-8 sequence". Furthermore, unless I use the "alternative" where std.cstream.din is used, the two line breaks after "std.cstream;" are not \r\n as they should be in the DOS encoding I use, they are \r\r\n. Converting the line breaks to just \n causes them to become \r\n in the output. Whence the extra \r? What's strange is if I use e.g. readLine instead of getc, everything is fine. Since readLine seems to use getc internally, I'm having trouble understanding why this is the case. A bug or two, or where am I going wrong?
Jun 12 2006
Deewiant skrev:import std.stream, std.cstream; // åäöΔ void main() { Stream file = new File(__FILE__, FileMode.In); // alternatively: //Stream file = din; while (!file.eof) dout.writef("%s", file.getc); } -- With the above UTF-8 code, I expect the program's source to be output, also in UTF-8. However, I get ASCII output, and on line three appears everyone's favourite "Error: 4invalid UTF-8 sequence". Furthermore, unless I use the "alternative" where std.cstream.din is used, the two line breaks after "std.cstream;" are not \r\n as they should be in the DOS encoding I use, they are \r\r\n. Converting the line breaks to just \n causes them to become \r\n in the output. Whence the extra \r? What's strange is if I use e.g. readLine instead of getc, everything is fine. Since readLine seems to use getc internally, I'm having trouble understanding why this is the case. A bug or two, or where am I going wrong?I had a quick look at the std.stream sources and it seems std.stream isn't really unicode aware. getc() assumes the stream to be in utf-8 and returns a char, which means it returns a utf8 code unit, not a full character. getcw() on the other hand assumes the string is in utf-16 and returns a utf-16 code unit as a wchar. You are printing individial utf-8 code units as characters, which triggers your error. If D claims to have full unicode support, std.stream ought to either have decoding routines that return a dchar, or have a utf-decoding wrapper stream, in which case std.stream.getc() ought to return a ubyte, not a char... /Oskar
Jun 12 2006
Oskar Linde wrote:Deewiant skrev:Thanks for the explanation. Unfortunately, I'm not knowledgeable enough in these matters to correct the problem. So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I combine the former two into a single "char"? Say I check if the char received from getc() is greater than 127 (outside ASCII) and if it is, I store it and the following char in two ubytes. Now what? How do I get a char?import std.stream, std.cstream; // åäöΔ void main() { Stream file = new File(__FILE__, FileMode.In); // alternatively: //Stream file = din; while (!file.eof) dout.writef("%s", file.getc); } -- With the above UTF-8 code, I expect the program's source to be output, also in UTF-8. However, I get ASCII output, and on line three appears everyone's favourite "Error: 4invalid UTF-8 sequence". Furthermore, unless I use the "alternative" where std.cstream.din is used, the two line breaks after "std.cstream;" are not \r\n as they should be in the DOS encoding I use, they are \r\r\n. Converting the line breaks to just \n causes them to become \r\n in the output. Whence the extra \r? What's strange is if I use e.g. readLine instead of getc, everything is fine. Since readLine seems to use getc internally, I'm having trouble understanding why this is the case. A bug or two, or where am I going wrong?I had a quick look at the std.stream sources and it seems std.stream isn't really unicode aware. getc() assumes the stream to be in utf-8 and returns a char, which means it returns a utf8 code unit, not a full character. getcw() on the other hand assumes the string is in utf-16 and returns a utf-16 code unit as a wchar. You are printing individial utf-8 code units as characters, which triggers your error. /Oskar
Jun 12 2006
Deewiant skrev:Oskar Linde wrote:dchar std.utf.decode(char[],int) even if it can be quite clumsy. A hint is to use: std.utf.UTF8stride[c] to get the total number of bytes that are part of the starting token c. /OskarDeewiant skrev:Thanks for the explanation. Unfortunately, I'm not knowledgeable enough in these matters to correct the problem. So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I combine the former two into a single "char"? Say I check if the char received from getc() is greater than 127 (outside ASCII) and if it is, I store it and the following char in two ubytes. Now what? How do I get a char?import std.stream, std.cstream; // åäöΔ void main() { Stream file = new File(__FILE__, FileMode.In); // alternatively: //Stream file = din; while (!file.eof) dout.writef("%s", file.getc); } -- With the above UTF-8 code, I expect the program's source to be output, also in UTF-8. However, I get ASCII output, and on line three appears everyone's favourite "Error: 4invalid UTF-8 sequence". Furthermore, unless I use the "alternative" where std.cstream.din is used, the two line breaks after "std.cstream;" are not \r\n as they should be in the DOS encoding I use, they are \r\r\n. Converting the line breaks to just \n causes them to become \r\n in the output. Whence the extra \r? What's strange is if I use e.g. readLine instead of getc, everything is fine. Since readLine seems to use getc internally, I'm having trouble understanding why this is the case. A bug or two, or where am I going wrong?I had a quick look at the std.stream sources and it seems std.stream isn't really unicode aware. getc() assumes the stream to be in utf-8 and returns a char, which means it returns a utf8 code unit, not a full character. getcw() on the other hand assumes the string is in utf-16 and returns a utf-16 code unit as a wchar. You are printing individial utf-8 code units as characters, which triggers your error. /Oskar
Jun 12 2006
Oskar Linde wrote:Deewiant skrev:Thanks, that works. What I did was write a short function looking like this: dchar myGetchar(Stream s) { char c = s.getc; // ASCII if (c <= 127) return c; else { // UTF-8 char[] str = new char[2]; str[0] = c; str[1] = s.getc; // dummy var, needed by decode size_t i = 0; return decode(str, i); } } Using that in place of getc() pretty much does the trick. Unfortunately, when reading from files instead of stdin, I still run into the problem of \r\n being converted to \r\r\n. I think I know why, too: '\n' is being converted into \r\n because I'm on a Windows platform. I use the following workaround: if (c == '\r') { char d = s.getc; if (d == '\n') return '\n'; else { s.ungetc(d); return c; } }So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I combine the former two into a single "char"? Say I check if the char received from getc() is greater than 127 (outside ASCII) and if it is, I store it and the following char in two ubytes. Now what? How do I get a char?dchar std.utf.decode(char[],int) even if it can be quite clumsy. A hint is to use: std.utf.UTF8stride[c] to get the total number of bytes that are part of the starting token c. /Oskar
Jun 12 2006
Deewiant skrev:Oskar Linde wrote:This only works for a small subset of Unicode...Deewiant skrev:Thanks, that works. What I did was write a short function looking like this:So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I combine the former two into a single "char"? Say I check if the char received from getc() is greater than 127 (outside ASCII) and if it is, I store it and the following char in two ubytes. Now what? How do I get a char?dchar std.utf.decode(char[],int) even if it can be quite clumsy. A hint is to use: std.utf.UTF8stride[c] to get the total number of bytes that are part of the starting token c. /Oskardchar myGetchar(Stream s) { char c = s.getc; // ASCII if (c <= 127) return c; else { // UTF-8 char[] str = new char[2]; str[0] = c; str[1] = s.getc;For a more general implementation, change the last 3 lines to: char[6] str; str[0] = c; int n = std.utf.UTF8stride[c]; if (n == 0xff) return cast(dchar)-1;; // corrupt string for (int i = 1; i < n; i++) str[i] = s.getc;// dummy var, needed by decode size_t i = 0; return decode(str, i); } } Using that in place of getc() pretty much does the trick. Unfortunately, when reading from files instead of stdin, I still run into the problem of \r\n being converted to \r\r\n. I think I know why, too: '\n' is being converted into \r\n because I'm on a Windows platform. I use the following workaround:Yes. This is another proof that std.stream is lacking functionality. Because of this conversion, it is clear that std.stream isn't a binary stream, and as such, it ought to be either a utf-8, utf-16 or utf-32 encoded text stream, and in those cases std.stream.getc should have a function returning a dchar, just as the above code. /Oskar
Jun 12 2006
Oskar Linde wrote:Deewiant skrev:Thanks for correcting it, I was unsure myself.Oskar Linde wrote:This only works for a small subset of Unicode...Deewiant skrev:Thanks, that works. What I did was write a short function looking like this:So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I combine the former two into a single "char"? Say I check if the char received from getc() is greater than 127 (outside ASCII) and if it is, I store it and the following char in two ubytes. Now what? How do I get a char?dchar std.utf.decode(char[],int) even if it can be quite clumsy. A hint is to use: std.utf.UTF8stride[c] to get the total number of bytes that are part of the starting token c. /Oskar6? Aren't 4 UTF-8 units enough for all of Unicode? I see that UTF8stride also has 5 or 6 as some of its elements; why is that?dchar myGetchar(Stream s) { char c = s.getc; // ASCII if (c <= 127) return c; else { // UTF-8 char[] str = new char[2]; str[0] = c; str[1] = s.getc;For a more general implementation, change the last 3 lines to: char[6] str;str[0] = c; int n = std.utf.UTF8stride[c]; if (n == 0xff) return cast(dchar)-1;; // corrupt string for (int i = 1; i < n; i++) str[i] = s.getc;Yes, I agree wholeheartedly. It would appear that the std.stream classes are for textual input, but currently some of the methods choke on UTF-x input. In addition to a getcd() method to complement getc() and getcw(), a getb() method returning an ubyte might also be handy, for when one really wants byte-by-byte input. Perhaps getc()'s signature should actually be changed into that, since after all that's all it seems currently to be doing.// dummy var, needed by decode size_t i = 0; return decode(str, i); } } Using that in place of getc() pretty much does the trick. Unfortunately, when reading from files instead of stdin, I still run into the problem of \r\n being converted to \r\r\n. I think I know why, too: '\n' is being converted into \r\n because I'm on a Windows platform. I use the following workaround:Yes. This is another proof that std.stream is lacking functionality. Because of this conversion, it is clear that std.stream isn't a binary stream, and as such, it ought to be either a utf-8, utf-16 or utf-32 encoded text stream, and in those cases std.stream.getc should have a function returning a dchar, just as the above code. /Oskar
Jun 12 2006
Deewiant escribió:Thanks for the explanation. Unfortunately, I'm not knowledgeable enough in these matters to correct the problem. So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I combine the former two into a single "char"? Say I check if the char received from getc() is greater than 127 (outside ASCII) and if it is, I store it and the following char in two ubytes. Now what? How do I get a char?Keep using readLine. The entire line should be made of valid UTF8 characters. Maybe something to do about it would be to add getUTF8char, getUTF16char and getUTF32char, which would return char[], wchar[] and dchar, respectively, the first one returning an array of 1 to 4 elements, and the second 1 or 2. -- Carlos Santander Bernal
Jun 12 2006
Carlos Santander wrote:Deewiant escribió:That would work, but I was originally using only getc() so it's easier for me to replace that than to change half of my input paradigm. <g>So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I combine the former two into a single "char"? Say I check if the char received from getc() is greater than 127 (outside ASCII) and if it is, I store it and the following char in two ubytes. Now what? How do I get a char?Keep using readLine. The entire line should be made of valid UTF8 characters.Maybe something to do about it would be to add getUTF8char, getUTF16char and getUTF32char, which would return char[], wchar[] and dchar, respectively, the first one returning an array of 1 to 4 elements, and the second 1 or 2.Something like that would indeed be handy. It's too bad std.stream is lacking in some respects, such as this.
Jun 12 2006