www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - writef doesn't work on Windows XP console

reply Roberto Mariottini <Roberto_member pathlink.com> writes:
Hi.
I can't make writef work on Windows XP using non-7bit-ASCII characters.

The attached test program outputs:

-- untranslated --


-- translated --
äöüßÄÖÜ
Error: invalid UTF-8 sequence

Test program:

import std.stdio;
import std.c.stdio;
import std.c.windows.windows;
 
extern (Windows)
{
  export BOOL CharToOemW(
    LPCWSTR lpszSrc,  // string to translate
    LPSTR lpszDst     // translated string
  );
}
 
int main()
{
   puts("-- untranslated --");
   puts("äöüßÄÖÜ");
   writef("äöüßÄÖÜ\n");
 
   puts("-- translated --");
   wchar[] mess = "äöüßÄÖÜ";
   char[] OEMmess = new char[mess.length];
   CharToOemW(mess, OEMmess);
   puts(OEMmess);
   writef(OEMmess);
 
   return 0;
}
Dec 01 2004
parent reply "Ben Hinkle" <ben.hinkle gmail.com> writes:
"Roberto Mariottini" <Roberto_member pathlink.com> wrote in message 
news:cok6li$1pkp$1 digitaldaemon.com...
 Hi.
 I can't make writef work on Windows XP using non-7bit-ASCII characters.

 The attached test program outputs:

 -- untranslated --


 -- translated --
 äöüßÄÖÜ
 Error: invalid UTF-8 sequence

 Test program:

 import std.stdio;
 import std.c.stdio;
 import std.c.windows.windows;

 extern (Windows)
 {
  export BOOL CharToOemW(
    LPCWSTR lpszSrc,  // string to translate
    LPSTR lpszDst     // translated string
  );
 }

 int main()
 {
   puts("-- untranslated --");
   puts("äöüßÄÖÜ");
   writef("äöüßÄÖÜ\n");

   puts("-- translated --");
   wchar[] mess = "äöüßÄÖÜ";
   char[] OEMmess = new char[mess.length];
   CharToOemW(mess, OEMmess);
   puts(OEMmess);
   writef(OEMmess);

   return 0;
 }
This is expected behavior. Writef takes utf-8 strings hence the error that the supplied string is not in utf-8 (because it isn't).
Dec 01 2004
next sibling parent reply Roberto Mariottini <Roberto_member pathlink.com> writes:
In article <coketl$25kv$1 digitaldaemon.com>, Ben Hinkle says...

[...]
 int main()
 {
   puts("-- untranslated --");
   puts("äöüßÄÖÜ");
   writef("äöüßÄÖÜ\n");
^^^^^^^^^^^^^^^^^^^^
   puts("-- translated --");
   wchar[] mess = "äöüßÄÖÜ";
   char[] OEMmess = new char[mess.length];
   CharToOemW(mess, OEMmess);
   puts(OEMmess);
   writef(OEMmess);

   return 0;
 }
This is expected behavior. Writef takes utf-8 strings hence the error that the supplied string is not in utf-8 (because it isn't).
The first writef uses an UTF-8 string, but it doesn't print what expected. Either one should work, but both don't work. Ciao
Dec 01 2004
parent reply =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Roberto Mariottini wrote:

 The first writef uses an UTF-8 string, but it doesn't print what expected.
 Either one should work, but both don't work.
It works just fine, but you *have* to set your console to UTF-8. D does *not* support consoles or shells which are not Unicode... :( Simple example:
 import std.stdio;
 void main()
 {
   writefln("äöüßÄÖÜ");
 }
In UTF-8 Terminal mode, this prints:
 äöüßÄÖÜ
In Latin-1 Terminal mode, you get:
 äöüÃÃÃÃ
I'm assuming it prints similar garbage on a non-Unicode XP console ? (being a Mac user myself I have no idea how to change it on Windows) --anders
Dec 01 2004
next sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Anders F Björklund wrote:
 Roberto Mariottini wrote:
 
 The first writef uses an UTF-8 string, but it doesn't print what 
 expected.
 Either one should work, but both don't work.
It works just fine, but you *have* to set your console to UTF-8. D does *not* support consoles or shells which are not Unicode... :(
<snip> A while back I suggested writing some classes to do text file I/O, which would have conversion capabilities built in. http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/6089 I guess it would extend to console I/O. Stewart. -- My e-mail is valid but not my primary mailbox. Please keep replies on the 'group where everyone may benefit.
Dec 01 2004
next sibling parent reply =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Stewart Gordon wrote:

 A while back I suggested writing some classes to do text file I/O, which 
 would have conversion capabilities built in.
 
 http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/6089
 
 I guess it would extend to console I/O.
Sounds like a good idea. I have some very small encoding additions... (just a lookup for each supported charset, without entire icu/iconv) http://www.algonet.se/~afb/d/mapping.zip And I think it should use char[] instead of dchar/dchar[], but that's rather minor (and it should probably overload all three string types) --anders
Dec 01 2004
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Anders F Björklund wrote:
 Stewart Gordon wrote:
 
 A while back I suggested writing some classes to do text file I/O, 
 which would have conversion capabilities built in.

 http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/6089
<snip>
 And I think it should use char[] instead of dchar/dchar[], but that's
 rather minor (and it should probably overload all three string types)
I wrote that, but then discovered that the 'norm' (if Phobos is anything to go by) is for strings to be manipulated as UTF-8, while dchar gets used for individual characters. Maybe it should 'normally' use char[]. After all, that's the most compact for text in alphabets below U+0800. But you're probably right that it should overload the lot. Stewart. -- My e-mail is valid but not my primary mailbox. Please keep replies on the 'group where everyone may benefit.
Dec 01 2004
parent reply =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Stewart Gordon wrote:

 And I think it should use char[] instead of dchar/dchar[], but that's
 rather minor (and it should probably overload all three string types)
I wrote that, but then discovered that the 'norm' (if Phobos is anything to go by) is for strings to be manipulated as UTF-8, while dchar gets used for individual characters. Maybe it should 'normally' use char[]. After all, that's the most compact for text in alphabets below U+0800.
Or one can do like Java and use wchar[] and wchar, and ignore the bloat for ASCII strings - and hack in support for surrogates some other way... Most of the consoles mentioned only support old 16-bit Unicode anyway ?
 But you're probably right that it should overload the lot.
It's the D way :-) --anders
Dec 01 2004
parent Stewart Gordon <smjg_1998 yahoo.com> writes:
Anders F Björklund wrote:
<snip>
 Most of the consoles mentioned only support old 16-bit Unicode anyway ?
<snip> MS-DOS, and hence DOS windows in Win9x, only support 8-bit IBM codepages. Stewart. -- My e-mail is valid but not my primary mailbox. Please keep replies on the 'group where everyone may benefit.
Dec 01 2004
prev sibling parent reply "Ben Hinkle" <bhinkle mathworks.com> writes:
"Stewart Gordon" <smjg_1998 yahoo.com> wrote in message
news:cokn8o$2id0$1 digitaldaemon.com...
 Anders F Björklund wrote:
 Roberto Mariottini wrote:

 The first writef uses an UTF-8 string, but it doesn't print what
 expected.
 Either one should work, but both don't work.
It works just fine, but you *have* to set your console to UTF-8. D does *not* support consoles or shells which are not Unicode... :(
<snip> A while back I suggested writing some classes to do text file I/O, which would have conversion capabilities built in. http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/6089
Note std.stream now has BOM support. Call readBOM or writeBOM in EndianStream. Now that you mention it it might be nice to make another Stream subclass and add support for the "native" encodings. It sounds fun - I'll give it a shot. It should be pretty easy actually since you just override writeString and writeStringW to call some OS function to convert the string or char from utf to native encoding. Supporting arbitrary encodings would probably be left for non-phobos libraries since they would presumably require something like ICU or libiconv. So basically what I have in mind is that to write to stdout with native encoding you'd have to write import std.stream; ... stdoutn = NativeTextStream(stdout); stdoutn.writef(<some utf encoded string>); -Ben
Dec 01 2004
parent reply Roberto Mariottini <Roberto_member pathlink.com> writes:
In article <cokqtc$2okj$1 digitaldaemon.com>, Ben Hinkle says...

[...]
Now that you mention it it might be nice to make another
Stream subclass and add support for the "native" encodings. It sounds fun -
I'll give it a shot. It should be pretty easy actually since you just
override writeString and writeStringW to call some OS function to convert
the string or char from utf to native encoding.
The Windows function to accomplish this task is the already cited CharToOemW().
Supporting arbitrary encodings would probably be left for non-phobos
libraries since they would presumably require something like ICU or
libiconv. So basically what I have in mind is that to write to stdout with
native encoding you'd have to write

import std.stream;
...
stdoutn = NativeTextStream(stdout);
stdoutn.writef(<some utf encoded string>);
The default stdout should be a NativeTextStrem on Windows. Ciao
Dec 01 2004
parent reply "Ben Hinkle" <bhinkle mathworks.com> writes:
"Roberto Mariottini" <Roberto_member pathlink.com> wrote in message
news:comhrd$26ug$1 digitaldaemon.com...
 In article <cokqtc$2okj$1 digitaldaemon.com>, Ben Hinkle says...

 [...]
Now that you mention it it might be nice to make another
Stream subclass and add support for the "native" encodings. It sounds
fun -
I'll give it a shot. It should be pretty easy actually since you just
override writeString and writeStringW to call some OS function to convert
the string or char from utf to native encoding.
The Windows function to accomplish this task is the already cited
CharToOemW().
Supporting arbitrary encodings would probably be left for non-phobos
libraries since they would presumably require something like ICU or
libiconv. So basically what I have in mind is that to write to stdout
with
native encoding you'd have to write

import std.stream;
...
stdoutn = NativeTextStream(stdout);
stdoutn.writef(<some utf encoded string>);
The default stdout should be a NativeTextStrem on Windows. Ciao
Actually on second thought I'm getting hesitant to put this kind of thing into std.stream since it is so platform specific - the Mac's iconv API is mapped to libiconv using C preprocessor macros so the D code will have to hard-code in those symbol names (AFAIK). Also it looks like CharToOemW might not be on all Win95/98/Me systems. Each platform will have to get special code for how to handle this NativeTextStream stuff. It could get pretty messy for fairly small bang-for-buck. I'm leaning towards putting in an outside library that can handle arbitrary encodings - like libiconv or mango's ICU wrapper or something.
Dec 03 2004
parent kris <fu bar.org> writes:
Ben Hinkle wrote:
  Each platform will have to get special
 code for how to handle this NativeTextStream stuff. It could get pretty
 messy for fairly small bang-for-buck. I'm leaning towards putting in an
 outside library that can handle arbitrary encodings - like libiconv or
 mango's ICU wrapper or something.
 
 
I'd like to encourage you to do so. If you take that approach I'll write an adapter for Mango.io, so there's more options for everyone. I'd also like to see a Stream adapter for the ICU converters; perhaps there will be cases where the 200+ ICU transcoders cover areas that iconv does not?
Dec 03 2004
prev sibling parent Roberto Mariottini <Roberto_member pathlink.com> writes:
In article <cokk5u$2dt9$1 digitaldaemon.com>,
=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= says...
Roberto Mariottini wrote:

 The first writef uses an UTF-8 string, but it doesn't print what expected.
 Either one should work, but both don't work.
It works just fine, but you *have* to set your console to UTF-8. D does *not* support consoles or shells which are not Unicode... :(
Windows XP does *not* support UTF-8 consoles. Neither Windows NT/2000. So the bug still applies. I don't think D will go any further if it doesn't support non-English versions of Windows. Ciao
Dec 01 2004
prev sibling parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Ben Hinkle wrote:

 I can't make writef work on Windows XP using non-7bit-ASCII characters.
[...snip...] This is expected behavior. Writef takes utf-8 strings hence the error that the supplied string is not in utf-8 (because it isn't).
Moral of the story being that 8-bit strings should be declared ubyte[]. Even if it makes you cast it to a pointer, before usage with C routines:
 ubyte[] OEMmess = new ubyte[mess.length];
 CharToOemW(mess, cast(LPSTR) OEMmess);
 puts(cast(char *) OEMmess);
The "char" type in C, is known as "byte" in D. Confusingly enough. Like Ben says, the D char type only accepts valid UTF-8 code units... --anders PS. No, it doesn't help that the C routines are declared as (char *) when they really take (ubyte *) arguments. It's just as a shortcut to avoid having to translate the C function declarations to D... And of course, it also works just fine for ASCII-only strings. (a char[] can be directly converted to char *, iff it is ASCII) With non-US-ASCII characters, it doesn't work - as you've seen.
Dec 01 2004
parent reply Roberto Mariottini <Roberto_member pathlink.com> writes:
In article <cokjmk$2d8u$1 digitaldaemon.com>,
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...

[...]
Moral of the story being that 8-bit strings should be declared ubyte[].
Even if it makes you cast it to a pointer, before usage with C routines:

 ubyte[] OEMmess = new ubyte[mess.length];
 CharToOemW(mess, cast(LPSTR) OEMmess);
 puts(cast(char *) OEMmess);
The "char" type in C, is known as "byte" in D. Confusingly enough. Like Ben says, the D char type only accepts valid UTF-8 code units... PS. No, it doesn't help that the C routines are declared as (char *) when they really take (ubyte *) arguments. It's just as a shortcut to avoid having to translate the C function declarations to D...
Sorry, I don't understand. Are you proposing to change any C function prototype that uses "char*" to "ubyte*"? I agree that this would make clear that D char[] are different from C char*. But it's a lot of work.
     And of course, it also works just fine for ASCII-only strings.
     (a char[] can be directly converted to char *, iff it is ASCII)
     With non-US-ASCII characters, it doesn't work - as you've seen.
With non-US-ASCII, but within the currently selected 8-bit OEM codepage, it works. The problem is that UTF-8 doesn't get correctly translated to IBM-850 (or 437, or ...) on Windows. Ciao
Dec 02 2004
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Roberto Mariottini wrote:

PS. No, it doesn't help that the C routines are declared as (char *)
    when they really take (ubyte *) arguments. It's just as a shortcut
    to avoid having to translate the C function declarations to D...
Sorry, I don't understand.
I was being somewhat vague, sorry.
 Are you proposing to change any C function prototype that uses "char*" to
 "ubyte*"?
 I agree that this would make clear that D char[] are different from C char*.
 But it's a lot of work.
That is why it was skipped, but you still need to be aware of the difference or it will cause subtle bugs like the one you encountered... (actually is a huge pain, as soon as you leave the old ascii strings) Anyway, if you stick non-UTF-8 strings in char[] variables you are setting yourself up for "invalid UTF-8 sequence". So ubyte[] is better ? They both convert to C's (char *) in the usual way (with a NUL added) char[] and wchar[] should be enough for any strings internal to D, you should only need to mess with 8-bit encodings for input/output... (and then it should preferrably all be handled by a library routine)
    And of course, it also works just fine for ASCII-only strings.
    (a char[] can be directly converted to char *, iff it is ASCII)
    With non-US-ASCII characters, it doesn't work - as you've seen.
With non-US-ASCII, but within the currently selected 8-bit OEM codepage, it works. The problem is that UTF-8 doesn't get correctly translated to IBM-850 (or 437, or ...) on Windows.
I meant that you can output ASCII as UTF-8 and it will still work... (mostly, except if you are stuck in EDBDIC or some other weird place)
   writefln("hello world!"); // English, works about everywhere US-ASCII
But to output to the console on Windows (or other non-Unicode platform), it needs to be translated to the local "code page" or "charset/encoding" Like if you want to support characters beyond the 96 or so that are in the ASCII subset, for instance if you live in Italy or Sweden.
   writefln("hall\u00e5 v\u00e4rlden!"); // Swedish, only works in UTF-8
And there is currently no functions in D to do that, as far as I know ? Same thing applies to console input such as the "char[] args" params... If you just echo those args on a non-Unicode console, you get errors! (since then they are not really UTF-8 strings, but casted ubyte[]'s) --anders
Dec 02 2004
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Thu, 02 Dec 2004 11:11:27 +0100, Anders F Björklund <afb algonet.se>  
wrote:
 Roberto Mariottini wrote:

 PS. No, it doesn't help that the C routines are declared as (char *)
    when they really take (ubyte *) arguments. It's just as a shortcut
    to avoid having to translate the C function declarations to D...
Sorry, I don't understand.
I was being somewhat vague, sorry.
 Are you proposing to change any C function prototype that uses "char*"  
 to
 "ubyte*"?
 I agree that this would make clear that D char[] are different from C  
 char*.
 But it's a lot of work.
I think it's a good idea. I reckon it will initially cause people to be confused, i.e. they see: int strcmp(byte *, byte *) and think "huh? strcmp takes a char * not a byte *" but then if they look up byte * and/or char * in the D docs they should hopefully realise the difference, that C's char * is really a byte * and D's char[] is UTF encoded. Oh yeah, correct me if I'm wrong but C's "char*" is really a "byte*" not a "ubyte*" as C's char's are signed.
 That is why it was skipped, but you still need to be aware of the  
 difference or it will cause subtle bugs like the one you encountered...
 (actually is a huge pain, as soon as you leave the old ascii strings)

 Anyway, if you stick non-UTF-8 strings in char[] variables you are
 setting yourself up for "invalid UTF-8 sequence". So ubyte[] is better ?
 They both convert to C's (char *) in the usual way (with a NUL added)

 char[] and wchar[] should be enough for any strings internal to D,
 you should only need to mess with 8-bit encodings for input/output...
 (and then it should preferrably all be handled by a library routine)
Exactly, all transcoding should be done at the input/output stage (if at all) internally you should use char[] wchar[] or dchar[]. Unless of course you have a good reason not to.
    And of course, it also works just fine for ASCII-only strings.
    (a char[] can be directly converted to char *, iff it is ASCII)
    With non-US-ASCII characters, it doesn't work - as you've seen.
With non-US-ASCII, but within the currently selected 8-bit OEM codepage, it works. The problem is that UTF-8 doesn't get correctly translated to IBM-850 (or 437, or ...) on Windows.
I meant that you can output ASCII as UTF-8 and it will still work... (mostly, except if you are stuck in EDBDIC or some other weird place)
   writefln("hello world!"); // English, works about everywhere US-ASCII
But to output to the console on Windows (or other non-Unicode platform), it needs to be translated to the local "code page" or "charset/encoding" Like if you want to support characters beyond the 96 or so that are in the ASCII subset, for instance if you live in Italy or Sweden.
   writefln("hall\u00e5 v\u00e4rlden!"); // Swedish, only works in UTF-8
And there is currently no functions in D to do that, as far as I know ?
No, but you can wrap and use the C (windows) function CharToOemW. Someone suggested that the default stdout stream should do this automatically, I think that's a great idea. IIRC Ben was considering giving this a go.
 Same thing applies to console input such as the "char[] args" params...
 If you just echo those args on a non-Unicode console, you get errors!
 (since then they are not really UTF-8 strings, but casted ubyte[]'s)
Which strikes me as ridiculous. Regan
Dec 02 2004
parent =?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Regan Heath wrote:

 and think "huh? strcmp takes a char * not a byte *" but then if they 
 look  up byte * and/or char * in the D docs they should hopefully 
 realise the  difference, that C's char * is really a byte * and D's 
 char[] is UTF  encoded.
 
 Oh yeah, correct me if I'm wrong but C's "char*" is really a "byte*" not 
 a  "ubyte*" as C's char's are signed.
D is only concerned about "byte size", so it will remain as char*...
   writefln("hall\u00e5 v\u00e4rlden!"); // Swedish, only works in UTF-8
And there is currently no functions in D to do that, as far as I know ?
No, but you can wrap and use the C (windows) function CharToOemW.
I'm not using Windows, but a modern system with an UTF-8 console ;-)
 Same thing applies to console input such as the "char[] args" params...
 If you just echo those args on a non-Unicode console, you get errors!
 (since then they are not really UTF-8 strings, but casted ubyte[]'s)
Which strikes me as ridiculous.
Either way, both stdout and stdin need to be "extended" for non-UTF-8 --anders
Dec 03 2004