digitalmars.D.bugs - A couple of issues with UTF
- Georg Wrede (120/120) Nov 18 2005 Haa, look ma, no hands!
- =?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= (4/13) Nov 18 2005 I think you have some serious issues with the political correctness of
- Georg Wrede (9/25) Nov 18 2005 ROFL !
- =?ISO-8859-1?Q?Jari-Matti_M=E4kel=E4?= (7/30) Nov 18 2005 write(...) writes the source value to the stream byte by byte.
- Georg Wrede (11/45) Nov 18 2005 Oops.
- Georg Wrede (22/58) Nov 23 2005 At first the Java style where one chains streams seemed terribly
- Bruno Medeiros (7/14) Nov 18 2005 Because that's not the BOM, it's (an int with) the string length...
Haa, look ma, no hands! So we have implicit UTF conversion. The following compiles ok! $ cat outtest2fixed.d import std.stream; void main() { char[] c = "Saatana perkele"c; wchar[] w = "Saatana perkele"w; File of1 = new File; File of2 = new File; of1.create("/tmp/1f.txt"); of2.create("/tmp/2f.txt"); of1.write(c); of1.write(c); of1.write(w); of1.write(w); of2.write(w); of2.write(w); of2.write(c); of2.write(c); of1.close(); of2.close(); } $ hexdump -C /tmp/1f.txt 00000000 0f 00 00 00 53 61 61 74 61 6e 61 20 70 65 72 6b |....Saatana perk| 00000010 65 6c 65 0f 00 00 00 53 61 61 74 61 6e 61 20 70 |ele....Saatana p| 00000020 65 72 6b 65 6c 65 0f 00 00 00 53 00 61 00 61 00 |erkele....S.a.a.| 00000030 74 00 61 00 6e 00 61 00 20 00 70 00 65 00 72 00 |t.a.n.a. .p.e.r.| 00000040 6b 00 65 00 6c 00 65 00 0f 00 00 00 53 00 61 00 |k.e.l.e.....S.a.| 00000050 61 00 74 00 61 00 6e 00 61 00 20 00 70 00 65 00 |a.t.a.n.a. .p.e.| 00000060 72 00 6b 00 65 00 6c 00 65 00 |r.k.e.l.e.| $ hexdump -C /tmp/2f.txt 00000000 0f 00 00 00 53 00 61 00 61 00 74 00 61 00 6e 00 |....S.a.a.t.a.n.| 00000010 61 00 20 00 70 00 65 00 72 00 6b 00 65 00 6c 00 |a. .p.e.r.k.e.l.| 00000020 65 00 0f 00 00 00 53 00 61 00 61 00 74 00 61 00 |e.....S.a.a.t.a.| 00000030 6e 00 61 00 20 00 70 00 65 00 72 00 6b 00 65 00 |n.a. .p.e.r.k.e.| 00000040 6c 00 65 00 0f 00 00 00 53 61 61 74 61 6e 61 20 |l.e.....Saatana | 00000050 70 65 72 6b 65 6c 65 0f 00 00 00 53 61 61 74 61 |perkele....Saata| 00000060 6e 61 20 70 65 72 6b 65 6c 65 |na perkele| $ cat /tmp/1f.txt Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele $ cat /tmp/2f.txt Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele $ HOWEVER, I have a couple of issues here. First of all, it looks like we don't have implicit conversion, but rather that the strings get copied to the output stream byte by byte! Now, the standard says that you are not allowed to have illegal octets, or characters in a UTF file. Of any width. Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the same file!!!!! Further, seems like write puts the BOM before every string. That is definitely illegal. (The operating system let me "cat" the files to screen, and tried its best to show them in a reasonable way (as you see above). But it really would not have had to.) --- What we could have happen is, that the first string output to the stream, causes the stream to choose the stream UTF width (and theoretically the endianness, too). (This is what the OS does when choosing whether to open in byte width or wider, according to linux documentation.) And whenever somebody tries to stuff "the wrong" crap there, do either of the following: - implicitly convert the string to the right UTF - throw error --- While D is in pre-1.0, I think we should at first decide that streams have to be opened with the UTF specified. Since the compiler should know the type of all the strings (see my other post today), it can then insert code for the appropriate runtime conversion. Since the compiler knows the type of string, it might be suggested that the first output string defines the stream type. I think it would be unwise. But _only_ for the same reason D demands a default in case, denies a semicolon right after an if clause, etc. That is, to help the programmer not to shoot his foot. There is _no_ valid reason why it couldn't be set by the first string automatically. HOWEVER, good table manners ask for reasonable defaults where at all possible. Such a default would be the UTF width and endianness that is "natural" on the particular platform. (If D is ever ported to platfrom that doesn't handle UTF, then the Natural Default of course is None. That is, one has to manually choose when opening the stream.) --- Similarly, if we want to implement our INPUT streams correctly, then they should _definitively_ choose their UTF type before the first time the application gets to read from the stream. FOR THE SITUATIONS where one either has to already process the first octet before enough of the stream has been seen to know which UTF type it is, THEN in THAT CASE an input stream of e.g. UBYTE should be mandatory to use instead. Or more to the point, UTF streams should not be used then. --- I have to remark on "since the compiler knows the type of string" above. Since this is such rocket science, DO REMEMBER that it "knows" because it looks at the TYPE (as in char[], wchar[], dchar[]) and not the CONTENTS of the string at that time. :-) Just to keep apples and oranges i order... --- What I called BOM above, does incidentally not look like it should, in the above file dumps anyway. --- Before we continue, I think everybody should read the following: www.unicode.org/faq/ -- ** --
Nov 18 2005
Georg Wrede wrote: <snip>$ cat /tmp/1f.txt Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele $ cat /tmp/2f.txt Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele $ HOWEVER, I have a couple of issues here.I think you have some serious issues with the political correctness of the message here ;)
Nov 18 2005
Jari-Matti Mäkelä wrote:Georg Wrede wrote: <snip>ROFL ! I trust the non-Finns use inborn Duck Typing. If it doesn't look like obscenities, then it isn't. :-) Or maybe I have an encryptor that turns "Hail Mary" into that string. Or maybe repeatedly drawing a picture out of iron wire has my hands bleeding, and I'm getting pissed off here. Maybe I should switch to clay models... But hey, it was USASCII all over!$ cat /tmp/1f.txt Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele $ cat /tmp/2f.txt Saatana perkeleSaatana perkeleSaatana perkeleSaatana perkele $ HOWEVER, I have a couple of issues here.I think you have some serious issues with the political correctness of the message here ;)
Nov 18 2005
Georg Wrede wrote: <snip>HOWEVER, I have a couple of issues here. First of all, it looks like we don't have implicit conversion, but rather that the strings get copied to the output stream byte by byte!write(...) writes the source value to the stream byte by byte.Now, the standard says that you are not allowed to have illegal octets, or characters in a UTF file. Of any width. Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the same file!!!!! Further, seems like write puts the BOM before every string. That is definitely illegal.That is illegal if you're trying to create a valid _text_ file. AFAIK the normal File is just a regular OutputStream, it doesn't care about UTF.What we could have happen is, that the first string output to the stream, causes the stream to choose the stream UTF width (and theoretically the endianness, too). (This is what the OS does when choosing whether to open in byte width or wider, according to linux documentation.) And whenever somebody tries to stuff "the wrong" crap there, do either of the following: - implicitly convert the string to the right UTF - throw errorI think this should not be the default for all streams. Maybe it would be better to have a new TextStream class that supports full Unicode?
Nov 18 2005
Jari-Matti Mäkelä wrote:Georg Wrede wrote: <snip>Oops. Well, in that case, we should give it uchar[] when we don't want fanciness. Or void[], right! Which should make it EITHER illegal to write [c/w/d]char[] to it -- OR we should have different kinds of streams. Some of which would be UTF savvy, some text, some void streams.HOWEVER, I have a couple of issues here. First of all, it looks like we don't have implicit conversion, but rather that the strings get copied to the output stream byte by byte!write(...) writes the source value to the stream byte by byte.We should have a set of different streams. Hey, Java has like millions to choose from! You can even join them to get, say, a "buffered, character-code-translating, rot-13, foo-izing" stream!!!Now, the standard says that you are not allowed to have illegal octets, or characters in a UTF file. Of any width. Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the same file!!!!! Further, seems like write puts the BOM before every string. That is definitely illegal.That is illegal if you're trying to create a valid _text_ file. AFAIK the normal File is just a regular OutputStream, it doesn't care about UTF.Of course!What we could have happen is, that the first string output to the stream, causes the stream to choose the stream UTF width (and theoretically the endianness, too). (This is what the OS does when choosing whether to open in byte width or wider, according to linux documentation.) And whenever somebody tries to stuff "the wrong" crap there, do either of the following: - implicitly convert the string to the right UTF - throw errorI think this should not be the default for all streams. Maybe it would be better to have a new TextStream class that supports full Unicode?
Nov 18 2005
Georg Wrede wrote:Jari-Matti Mäkelä wrote:At first the Java style where one chains streams seemed terribly inefficient. But later I understood that it wasn't, it just looked like inefficient. We could have raw input and output streams, and then a set of conversion streams (or actually filters), like this: OutStream os = new OutStream("foo"); // opens a raw outstream StreamBuffer sb = new StreamBuffer(os); ConvStream out = new ConvStream(UTF8, ISO8859-15, sb); ... char[] mytext = "kjsldkfjlskdfjslkd"; fwritefln(out, mytext); Since StreamBuffer eventually outputs everything, one doesn't even have to worry about the buffer getting filled up in "mid-char" if doing output in UTF (not the example above), since the rest of the char gets output later anyhow. I think this looks clean and easy to maintain (for the library maintainer), and it's use is starightforward, flexible, and coceptually clear. This would also bring tighter locality to the whole input/output system, since every stream only does its own thing. With this setup it also becomes much easier for the programmer to write his own stream filters, without having to become a Stream Guru first.Georg Wrede wrote: <snip>Oops. Well, in that case, we should give it uchar[] when we don't want fanciness. Or void[], right! Which should make it EITHER illegal to write [c/w/d]char[] to it -- OR we should have different kinds of streams. Some of which would be UTF savvy, some text, some void streams.HOWEVER, I have a couple of issues here. First of all, it looks like we don't have implicit conversion, but rather that the strings get copied to the output stream byte by byte!write(...) writes the source value to the stream byte by byte.We should have a set of different streams. Hey, Java has like millions to choose from! You can even join them to get, say, a "buffered, character-code-translating, rot-13, foo-izing" stream!!!Now, the standard says that you are not allowed to have illegal octets, or characters in a UTF file. Of any width. Therefore, you cannot put UTF-8 and UTF-16 (or UTF-32) in the same file!!!!! Further, seems like write puts the BOM before every string. That is definitely illegal.That is illegal if you're trying to create a valid _text_ file. AFAIK the normal File is just a regular OutputStream, it doesn't care about UTF.
Nov 23 2005
Georg Wrede wrote:Further, seems like write puts the BOM before every string. That is definitely illegal....What I called BOM above, does incidentally not look like it should, in the above file dumps anyway.Because that's not the BOM, it's (an int with) the string length... -- Bruno Medeiros - CS/E student "Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
Nov 18 2005