digitalmars.D.bugs - writef doesn't work on Windows XP console
- Roberto Mariottini (37/37) Dec 01 2004 Hi.
- Ben Hinkle (4/37) Dec 01 2004 This is expected behavior. Writef takes utf-8 strings hence the error th...
- Roberto Mariottini (6/24) Dec 01 2004 ^^^^^^^^^^^^^^^^^^^^
- =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (9/18) Dec 01 2004 It works just fine, but you *have* to set your console to UTF-8.
-
Stewart Gordon
(10/18)
Dec 01 2004
- =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (7/13) Dec 01 2004 Sounds like a good idea. I have some very small encoding additions...
- Stewart Gordon (11/19) Dec 01 2004 I wrote that, but then discovered that the 'norm' (if Phobos is anything...
- =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (6/14) Dec 01 2004 Or one can do like Java and use wchar[] and wchar, and ignore the bloat
-
Stewart Gordon
(8/9)
Dec 01 2004
- Ben Hinkle (17/30) Dec 01 2004 Note std.stream now has BOM support. Call readBOM or writeBOM in
- Roberto Mariottini (5/18) Dec 01 2004 The Windows function to accomplish this task is the already cited CharTo...
- Ben Hinkle (14/35) Dec 03 2004 fun -
- kris (5/12) Dec 03 2004 I'd like to encourage you to do so. If you take that approach I'll write...
- Roberto Mariottini (7/12) Dec 01 2004 Windows XP does *not* support UTF-8 consoles. Neither Windows NT/2000.
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (12/19) Dec 01 2004 Moral of the story being that 8-bit strings should be declared ubyte[].
-
Roberto Mariottini
(12/25)
Dec 02 2004
In article
, - =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (22/40) Dec 02 2004 That is why it was skipped, but you still need to be aware of the
- Regan Heath (20/61) Dec 02 2004 I think it's a good idea. I reckon it will initially cause people to be ...
- =?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= (5/22) Dec 03 2004 I'm not using Windows, but a modern system with an UTF-8 console ;-)
Hi. I can't make writef work on Windows XP using non-7bit-ASCII characters. The attached test program outputs: -- untranslated -- -- translated -- äöüßÄÖÜ Error: invalid UTF-8 sequence Test program: import std.stdio; import std.c.stdio; import std.c.windows.windows; extern (Windows) { export BOOL CharToOemW( LPCWSTR lpszSrc, // string to translate LPSTR lpszDst // translated string ); } int main() { puts("-- untranslated --"); puts("äöüßÄÖÜ"); writef("äöüßÄÖÜ\n"); puts("-- translated --"); wchar[] mess = "äöüßÄÖÜ"; char[] OEMmess = new char[mess.length]; CharToOemW(mess, OEMmess); puts(OEMmess); writef(OEMmess); return 0; }
Dec 01 2004
"Roberto Mariottini" <Roberto_member pathlink.com> wrote in message news:cok6li$1pkp$1 digitaldaemon.com...Hi. I can't make writef work on Windows XP using non-7bit-ASCII characters. The attached test program outputs: -- untranslated -- -- translated -- äöüßÄÖÜ Error: invalid UTF-8 sequence Test program: import std.stdio; import std.c.stdio; import std.c.windows.windows; extern (Windows) { export BOOL CharToOemW( LPCWSTR lpszSrc, // string to translate LPSTR lpszDst // translated string ); } int main() { puts("-- untranslated --"); puts("äöüßÄÖÜ"); writef("äöüßÄÖÜ\n"); puts("-- translated --"); wchar[] mess = "äöüßÄÖÜ"; char[] OEMmess = new char[mess.length]; CharToOemW(mess, OEMmess); puts(OEMmess); writef(OEMmess); return 0; }This is expected behavior. Writef takes utf-8 strings hence the error that the supplied string is not in utf-8 (because it isn't).
Dec 01 2004
In article <coketl$25kv$1 digitaldaemon.com>, Ben Hinkle says...[...]^^^^^^^^^^^^^^^^^^^^int main() { puts("-- untranslated --"); puts("äöüßÄÖÜ"); writef("äöüßÄÖÜ\n");The first writef uses an UTF-8 string, but it doesn't print what expected. Either one should work, but both don't work. Ciaoputs("-- translated --"); wchar[] mess = "äöüßÄÖÜ"; char[] OEMmess = new char[mess.length]; CharToOemW(mess, OEMmess); puts(OEMmess); writef(OEMmess); return 0; }This is expected behavior. Writef takes utf-8 strings hence the error that the supplied string is not in utf-8 (because it isn't).
Dec 01 2004
Roberto Mariottini wrote:The first writef uses an UTF-8 string, but it doesn't print what expected. Either one should work, but both don't work.It works just fine, but you *have* to set your console to UTF-8. D does *not* support consoles or shells which are not Unicode... :( Simple example:import std.stdio; void main() { writefln("äöüßÄÖÜ"); }In UTF-8 Terminal mode, this prints:äöüßÄÖÜIn Latin-1 Terminal mode, you get:äöüÃÃÃÃI'm assuming it prints similar garbage on a non-Unicode XP console ? (being a Mac user myself I have no idea how to change it on Windows) --anders
Dec 01 2004
Anders F Björklund wrote:Roberto Mariottini wrote:<snip> A while back I suggested writing some classes to do text file I/O, which would have conversion capabilities built in. http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/6089 I guess it would extend to console I/O. Stewart. -- My e-mail is valid but not my primary mailbox. Please keep replies on the 'group where everyone may benefit.The first writef uses an UTF-8 string, but it doesn't print what expected. Either one should work, but both don't work.It works just fine, but you *have* to set your console to UTF-8. D does *not* support consoles or shells which are not Unicode... :(
Dec 01 2004
Stewart Gordon wrote:A while back I suggested writing some classes to do text file I/O, which would have conversion capabilities built in. http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/6089 I guess it would extend to console I/O.Sounds like a good idea. I have some very small encoding additions... (just a lookup for each supported charset, without entire icu/iconv) http://www.algonet.se/~afb/d/mapping.zip And I think it should use char[] instead of dchar/dchar[], but that's rather minor (and it should probably overload all three string types) --anders
Dec 01 2004
Anders F Björklund wrote:Stewart Gordon wrote:<snip>A while back I suggested writing some classes to do text file I/O, which would have conversion capabilities built in. http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/6089And I think it should use char[] instead of dchar/dchar[], but that's rather minor (and it should probably overload all three string types)I wrote that, but then discovered that the 'norm' (if Phobos is anything to go by) is for strings to be manipulated as UTF-8, while dchar gets used for individual characters. Maybe it should 'normally' use char[]. After all, that's the most compact for text in alphabets below U+0800. But you're probably right that it should overload the lot. Stewart. -- My e-mail is valid but not my primary mailbox. Please keep replies on the 'group where everyone may benefit.
Dec 01 2004
Stewart Gordon wrote:Or one can do like Java and use wchar[] and wchar, and ignore the bloat for ASCII strings - and hack in support for surrogates some other way... Most of the consoles mentioned only support old 16-bit Unicode anyway ?And I think it should use char[] instead of dchar/dchar[], but that's rather minor (and it should probably overload all three string types)I wrote that, but then discovered that the 'norm' (if Phobos is anything to go by) is for strings to be manipulated as UTF-8, while dchar gets used for individual characters. Maybe it should 'normally' use char[]. After all, that's the most compact for text in alphabets below U+0800.But you're probably right that it should overload the lot.It's the D way :-) --anders
Dec 01 2004
Anders F Björklund wrote: <snip>Most of the consoles mentioned only support old 16-bit Unicode anyway ?<snip> MS-DOS, and hence DOS windows in Win9x, only support 8-bit IBM codepages. Stewart. -- My e-mail is valid but not my primary mailbox. Please keep replies on the 'group where everyone may benefit.
Dec 01 2004
"Stewart Gordon" <smjg_1998 yahoo.com> wrote in message news:cokn8o$2id0$1 digitaldaemon.com...Anders F Björklund wrote:Note std.stream now has BOM support. Call readBOM or writeBOM in EndianStream. Now that you mention it it might be nice to make another Stream subclass and add support for the "native" encodings. It sounds fun - I'll give it a shot. It should be pretty easy actually since you just override writeString and writeStringW to call some OS function to convert the string or char from utf to native encoding. Supporting arbitrary encodings would probably be left for non-phobos libraries since they would presumably require something like ICU or libiconv. So basically what I have in mind is that to write to stdout with native encoding you'd have to write import std.stream; ... stdoutn = NativeTextStream(stdout); stdoutn.writef(<some utf encoded string>); -BenRoberto Mariottini wrote:<snip> A while back I suggested writing some classes to do text file I/O, which would have conversion capabilities built in. http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/6089The first writef uses an UTF-8 string, but it doesn't print what expected. Either one should work, but both don't work.It works just fine, but you *have* to set your console to UTF-8. D does *not* support consoles or shells which are not Unicode... :(
Dec 01 2004
In article <cokqtc$2okj$1 digitaldaemon.com>, Ben Hinkle says...[...]Now that you mention it it might be nice to make another Stream subclass and add support for the "native" encodings. It sounds fun - I'll give it a shot. It should be pretty easy actually since you just override writeString and writeStringW to call some OS function to convert the string or char from utf to native encoding.The Windows function to accomplish this task is the already cited CharToOemW().Supporting arbitrary encodings would probably be left for non-phobos libraries since they would presumably require something like ICU or libiconv. So basically what I have in mind is that to write to stdout with native encoding you'd have to write import std.stream; ... stdoutn = NativeTextStream(stdout); stdoutn.writef(<some utf encoded string>);The default stdout should be a NativeTextStrem on Windows. Ciao
Dec 01 2004
"Roberto Mariottini" <Roberto_member pathlink.com> wrote in message news:comhrd$26ug$1 digitaldaemon.com...In article <cokqtc$2okj$1 digitaldaemon.com>, Ben Hinkle says...fun -[...]Now that you mention it it might be nice to make another Stream subclass and add support for the "native" encodings. It soundsCharToOemW().I'll give it a shot. It should be pretty easy actually since you just override writeString and writeStringW to call some OS function to convert the string or char from utf to native encoding.The Windows function to accomplish this task is the already citedwithSupporting arbitrary encodings would probably be left for non-phobos libraries since they would presumably require something like ICU or libiconv. So basically what I have in mind is that to write to stdoutActually on second thought I'm getting hesitant to put this kind of thing into std.stream since it is so platform specific - the Mac's iconv API is mapped to libiconv using C preprocessor macros so the D code will have to hard-code in those symbol names (AFAIK). Also it looks like CharToOemW might not be on all Win95/98/Me systems. Each platform will have to get special code for how to handle this NativeTextStream stuff. It could get pretty messy for fairly small bang-for-buck. I'm leaning towards putting in an outside library that can handle arbitrary encodings - like libiconv or mango's ICU wrapper or something.native encoding you'd have to write import std.stream; ... stdoutn = NativeTextStream(stdout); stdoutn.writef(<some utf encoded string>);The default stdout should be a NativeTextStrem on Windows. Ciao
Dec 03 2004
Ben Hinkle wrote:Each platform will have to get special code for how to handle this NativeTextStream stuff. It could get pretty messy for fairly small bang-for-buck. I'm leaning towards putting in an outside library that can handle arbitrary encodings - like libiconv or mango's ICU wrapper or something.I'd like to encourage you to do so. If you take that approach I'll write an adapter for Mango.io, so there's more options for everyone. I'd also like to see a Stream adapter for the ICU converters; perhaps there will be cases where the 200+ ICU transcoders cover areas that iconv does not?
Dec 03 2004
In article <cokk5u$2dt9$1 digitaldaemon.com>, =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= says...Roberto Mariottini wrote:Windows XP does *not* support UTF-8 consoles. Neither Windows NT/2000. So the bug still applies. I don't think D will go any further if it doesn't support non-English versions of Windows. CiaoThe first writef uses an UTF-8 string, but it doesn't print what expected. Either one should work, but both don't work.It works just fine, but you *have* to set your console to UTF-8. D does *not* support consoles or shells which are not Unicode... :(
Dec 01 2004
Ben Hinkle wrote:Moral of the story being that 8-bit strings should be declared ubyte[]. Even if it makes you cast it to a pointer, before usage with C routines:I can't make writef work on Windows XP using non-7bit-ASCII characters.[...snip...] This is expected behavior. Writef takes utf-8 strings hence the error that the supplied string is not in utf-8 (because it isn't).ubyte[] OEMmess = new ubyte[mess.length]; CharToOemW(mess, cast(LPSTR) OEMmess); puts(cast(char *) OEMmess);The "char" type in C, is known as "byte" in D. Confusingly enough. Like Ben says, the D char type only accepts valid UTF-8 code units... --anders PS. No, it doesn't help that the C routines are declared as (char *) when they really take (ubyte *) arguments. It's just as a shortcut to avoid having to translate the C function declarations to D... And of course, it also works just fine for ASCII-only strings. (a char[] can be directly converted to char *, iff it is ASCII) With non-US-ASCII characters, it doesn't work - as you've seen.
Dec 01 2004
In article <cokjmk$2d8u$1 digitaldaemon.com>, =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...[...]Moral of the story being that 8-bit strings should be declared ubyte[]. Even if it makes you cast it to a pointer, before usage with C routines:Sorry, I don't understand. Are you proposing to change any C function prototype that uses "char*" to "ubyte*"? I agree that this would make clear that D char[] are different from C char*. But it's a lot of work.ubyte[] OEMmess = new ubyte[mess.length]; CharToOemW(mess, cast(LPSTR) OEMmess); puts(cast(char *) OEMmess);The "char" type in C, is known as "byte" in D. Confusingly enough. Like Ben says, the D char type only accepts valid UTF-8 code units... PS. No, it doesn't help that the C routines are declared as (char *) when they really take (ubyte *) arguments. It's just as a shortcut to avoid having to translate the C function declarations to D...And of course, it also works just fine for ASCII-only strings. (a char[] can be directly converted to char *, iff it is ASCII) With non-US-ASCII characters, it doesn't work - as you've seen.With non-US-ASCII, but within the currently selected 8-bit OEM codepage, it works. The problem is that UTF-8 doesn't get correctly translated to IBM-850 (or 437, or ...) on Windows. Ciao
Dec 02 2004
Roberto Mariottini wrote:I was being somewhat vague, sorry.PS. No, it doesn't help that the C routines are declared as (char *) when they really take (ubyte *) arguments. It's just as a shortcut to avoid having to translate the C function declarations to D...Sorry, I don't understand.Are you proposing to change any C function prototype that uses "char*" to "ubyte*"? I agree that this would make clear that D char[] are different from C char*. But it's a lot of work.That is why it was skipped, but you still need to be aware of the difference or it will cause subtle bugs like the one you encountered... (actually is a huge pain, as soon as you leave the old ascii strings) Anyway, if you stick non-UTF-8 strings in char[] variables you are setting yourself up for "invalid UTF-8 sequence". So ubyte[] is better ? They both convert to C's (char *) in the usual way (with a NUL added) char[] and wchar[] should be enough for any strings internal to D, you should only need to mess with 8-bit encodings for input/output... (and then it should preferrably all be handled by a library routine)I meant that you can output ASCII as UTF-8 and it will still work... (mostly, except if you are stuck in EDBDIC or some other weird place)And of course, it also works just fine for ASCII-only strings. (a char[] can be directly converted to char *, iff it is ASCII) With non-US-ASCII characters, it doesn't work - as you've seen.With non-US-ASCII, but within the currently selected 8-bit OEM codepage, it works. The problem is that UTF-8 doesn't get correctly translated to IBM-850 (or 437, or ...) on Windows.writefln("hello world!"); // English, works about everywhere US-ASCIIBut to output to the console on Windows (or other non-Unicode platform), it needs to be translated to the local "code page" or "charset/encoding" Like if you want to support characters beyond the 96 or so that are in the ASCII subset, for instance if you live in Italy or Sweden.writefln("hall\u00e5 v\u00e4rlden!"); // Swedish, only works in UTF-8And there is currently no functions in D to do that, as far as I know ? Same thing applies to console input such as the "char[] args" params... If you just echo those args on a non-Unicode console, you get errors! (since then they are not really UTF-8 strings, but casted ubyte[]'s) --anders
Dec 02 2004
On Thu, 02 Dec 2004 11:11:27 +0100, Anders F Björklund <afb algonet.se> wrote:Roberto Mariottini wrote:I think it's a good idea. I reckon it will initially cause people to be confused, i.e. they see: int strcmp(byte *, byte *) and think "huh? strcmp takes a char * not a byte *" but then if they look up byte * and/or char * in the D docs they should hopefully realise the difference, that C's char * is really a byte * and D's char[] is UTF encoded. Oh yeah, correct me if I'm wrong but C's "char*" is really a "byte*" not a "ubyte*" as C's char's are signed.I was being somewhat vague, sorry.PS. No, it doesn't help that the C routines are declared as (char *) when they really take (ubyte *) arguments. It's just as a shortcut to avoid having to translate the C function declarations to D...Sorry, I don't understand.Are you proposing to change any C function prototype that uses "char*" to "ubyte*"? I agree that this would make clear that D char[] are different from C char*. But it's a lot of work.That is why it was skipped, but you still need to be aware of the difference or it will cause subtle bugs like the one you encountered... (actually is a huge pain, as soon as you leave the old ascii strings) Anyway, if you stick non-UTF-8 strings in char[] variables you are setting yourself up for "invalid UTF-8 sequence". So ubyte[] is better ? They both convert to C's (char *) in the usual way (with a NUL added) char[] and wchar[] should be enough for any strings internal to D, you should only need to mess with 8-bit encodings for input/output... (and then it should preferrably all be handled by a library routine)Exactly, all transcoding should be done at the input/output stage (if at all) internally you should use char[] wchar[] or dchar[]. Unless of course you have a good reason not to.No, but you can wrap and use the C (windows) function CharToOemW. Someone suggested that the default stdout stream should do this automatically, I think that's a great idea. IIRC Ben was considering giving this a go.I meant that you can output ASCII as UTF-8 and it will still work... (mostly, except if you are stuck in EDBDIC or some other weird place)And of course, it also works just fine for ASCII-only strings. (a char[] can be directly converted to char *, iff it is ASCII) With non-US-ASCII characters, it doesn't work - as you've seen.With non-US-ASCII, but within the currently selected 8-bit OEM codepage, it works. The problem is that UTF-8 doesn't get correctly translated to IBM-850 (or 437, or ...) on Windows.writefln("hello world!"); // English, works about everywhere US-ASCIIBut to output to the console on Windows (or other non-Unicode platform), it needs to be translated to the local "code page" or "charset/encoding" Like if you want to support characters beyond the 96 or so that are in the ASCII subset, for instance if you live in Italy or Sweden.writefln("hall\u00e5 v\u00e4rlden!"); // Swedish, only works in UTF-8And there is currently no functions in D to do that, as far as I know ?Same thing applies to console input such as the "char[] args" params... If you just echo those args on a non-Unicode console, you get errors! (since then they are not really UTF-8 strings, but casted ubyte[]'s)Which strikes me as ridiculous. Regan
Dec 02 2004
Regan Heath wrote:and think "huh? strcmp takes a char * not a byte *" but then if they look up byte * and/or char * in the D docs they should hopefully realise the difference, that C's char * is really a byte * and D's char[] is UTF encoded. Oh yeah, correct me if I'm wrong but C's "char*" is really a "byte*" not a "ubyte*" as C's char's are signed.D is only concerned about "byte size", so it will remain as char*...I'm not using Windows, but a modern system with an UTF-8 console ;-)No, but you can wrap and use the C (windows) function CharToOemW.writefln("hall\u00e5 v\u00e4rlden!"); // Swedish, only works in UTF-8And there is currently no functions in D to do that, as far as I know ?Either way, both stdout and stdin need to be "extended" for non-UTF-8 --andersSame thing applies to console input such as the "char[] args" params... If you just echo those args on a non-Unicode console, you get errors! (since then they are not really UTF-8 strings, but casted ubyte[]'s)Which strikes me as ridiculous.
Dec 03 2004