www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - TDPL: Foreach over Unicode string

reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On page 123 there's an example of what happens when traversing a unicode string
with a char, and on the next page the string is traversed with a dchar, which
should fix the output. But I'm getting different results, here's the code and
output of the two samples:

import std.stdio;

void main() {
    string str = "Hall\u00E5, V\u00E4rld!";
    foreach (c; str) {
        write('[', c, ']');
    }
    writeln();
}

Prints:
[H][a][l][l][Ã][¥][,][ ][V][Ã][¤][r][l][d][!]

Second example:

import std.stdio;

void main() {
    string str = "Hall\u00E5, V\u00E4rld!";
    foreach (dchar c; str) {
        write('[', c, ']');
    }
    writeln();
}

Prints:
[H][a][l][l][å][,][ ][V][ä][r][l][d][!]


The second example should print out:
[H][a][l][l][å][,][ ][V][ä][r][l][d][!] 

This is on DMD 2.047 on Windows.
Jul 27 2010
next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
Andrej Mitrovic Wrote:

 On page 123 there's an example of what happens when traversing a unicode
string with a char, and on the next page the string is traversed with a dchar,
which should fix the output. But I'm getting different results, here's the code
and output of the two samples:
 
 import std.stdio;
 
 void main() {
     string str = "Hall\u00E5, V\u00E4rld!";
     foreach (c; str) {
         write('[', c, ']');
     }
     writeln();
 }
 
 Prints:
 [H][a][l][l][Ã][¥][,][ ][V][Ã][¤][r][l][d][!]
 
 Second example:
 
 import std.stdio;
 
 void main() {
     string str = "Hall\u00E5, V\u00E4rld!";
     foreach (dchar c; str) {
         write('[', c, ']');
     }
     writeln();
 }
 
 Prints:
 [H][a][l][l][å][,][ ][V][ä][r][l][d][!]
 
 
 The second example should print out:
 [H][a][l][l][å][,][ ][V][ä][r][l][d][!] 
 
 This is on DMD 2.047 on Windows.
I think it's Windows integration that's the problem, on OSX I get: [H][a][l][l][?][?][,][ ][V][?][?][r][l][d][!] [H][a][l][l][å][,][ ][V][ä][r][l][d][!] which is essentially correct. The only difference between this and doing the same thing in C and using printf() in place of write() is that both lines display correctly in C. I think printf() must be detecting partial UTF-8 characters and buffering until the complete chunk has arrived. Interestingly, the C output can't even be broken by badly timed calls to fflush(), so the buffering is happening at a fairly high level. I'd be interested in seeing the same thing in write() at some point.
Jul 27 2010
parent reply Sean Kelly <sean invisibleduck.org> writes:
Sean Kelly Wrote:
 
 I think it's Windows integration that's the problem, on OSX I get:
 
 [H][a][l][l][?][?][,][ ][V][?][?][r][l][d][!]
 [H][a][l][l][å][,][ ][V][ä][r][l][d][!]
 
 which is essentially correct.  The only difference between this and doing the
same thing in C and using printf() in place of write() is that both lines
display correctly in C.  I think printf() must be detecting partial UTF-8
characters and buffering until the complete chunk has arrived.  Interestingly,
the C output can't even be broken by badly timed calls to fflush(), so the
buffering is happening at a fairly high level.  I'd be interested in seeing the
same thing in write() at some point.
Ah, write() already works that way. It was the brackets that were screwing things up.
Jul 27 2010
parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On Wed, Jul 28, 2010 at 12:34 AM, Sean Kelly <sean invisibleduck.org> wrote=
:

 Sean Kelly Wrote:
 I think it's Windows integration that's the problem, on OSX I get:

 [H][a][l][l][?][?][,][ ][V][?][?][r][l][d][!]
 [H][a][l][l][=E5][,][ ][V][=E4][r][l][d][!]

 which is essentially correct.  The only difference between this and doi=
ng
 the same thing in C and using printf() in place of write() is that both
 lines display correctly in C.  I think printf() must be detecting partial
 UTF-8 characters and buffering until the complete chunk has arrived.
  Interestingly, the C output can't even be broken by badly timed calls to
 fflush(), so the buffering is happening at a fairly high level.  I'd be
 interested in seeing the same thing in write() at some point.

 Ah, write() already works that way.  It was the brackets that were screwi=
ng
 things up.
You are right about printf(), I'm getting the correct output with this code= : import std.stdio, std.stream; void main() { string str =3D "Hall\u00E5, V\u00E4rld!"; foreach (dchar c; str) { printf("%c", c); } writeln(); } Hall=E5, V=E4rld! Should I file this as a Windows bug for DMD?
Jul 27 2010
next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
Andrej Mitrovic Wrote:

 On Wed, Jul 28, 2010 at 12:34 AM, Sean Kelly <sean invisibleduck.org> wrote:
 
 Sean Kelly Wrote:
 I think it's Windows integration that's the problem, on OSX I get:

 [H][a][l][l][?][?][,][ ][V][?][?][r][l][d][!]
 [H][a][l][l][å][,][ ][V][ä][r][l][d][!]

 which is essentially correct.  The only difference between this and doing
the same thing in C and using printf() in place of write() is that both lines display correctly in C. I think printf() must be detecting partial UTF-8 characters and buffering until the complete chunk has arrived. Interestingly, the C output can't even be broken by badly timed calls to fflush(), so the buffering is happening at a fairly high level. I'd be interested in seeing the same thing in write() at some point. Ah, write() already works that way. It was the brackets that were screwing things up.
You are right about printf(), I'm getting the correct output with this code: import std.stdio, std.stream; void main() { string str = "Hall\u00E5, V\u00E4rld!"; foreach (dchar c; str) { printf("%c", c); } writeln(); } Hallå, Värld! Should I file this as a Windows bug for DMD?
Yes. I looked into this briefly, and after a bit of googling, it looks like fwide() isn't implemented on Windows (unless Walter had done this himself in the DMC libraries). See here: http://blogs.msdn.com/b/michkap/archive/2009/06/23/9797156.aspx If I change std.stdio.LockingTextWriter.put(C)(C c) to always use the version(Windows) code for a 32-bit argument it *almost* works correctly. Instead of garbage, the Unicode characters are a lowercase o with an accent above (U+01A1 I believe) and an uppercase sigma (U+01A9). I'll have to spend some more time later trying to figure out why it's these characters and not the intended ones. I wouldn't think that endian issues should be relevant, but that's the only thing I've come up with so far.
Jul 27 2010
next sibling parent reply Sean kelly <sean invisibleduck.org> writes:
After a bit more research, the situation is a bit more complicated than I
realized.  First, if I compile this C app using DMC:

#include <stdio.h>

int main()
{
    printf( "Hall\u00E5, V\u00E4rld!" );
    return 0;
}

The output is:



This is what I was seeing once I started messing with std.stdio.  An
improvement I suppose, since it's not garbage, but the output it still
incorrect if you're expecting Unicode.  After a bit of experimenting, it looks
like there are two ways to output non-ASCII correctly in Windows: convert to a
multi-byte string (toMBSz) or call WriteConsoleW.  Here's a test app and the
associated output.  Notice how writeln() has the same output as
printf(unicodeString).

import std.stdio;
import std.string;
import std.utf;
import std.windows.charset;
import core.sys.windows.windows;

void main()
{
    HANDLE h = GetStdHandle(STD_OUTPUT_HANDLE);
    DWORD ignore;
    wchar[] buf = ("\u00E5 \u00E4"w).dup;

    writeln(buf);
    printf("%s\n", toStringz(toUTF8(buf)));
    printf("%s\n", toMBSz(toUTF8(buf), 1));
    WriteConsoleW(h, buf.ptr, buf.length, &ignore, null);
}

prints:



å ä
å ä

I'd think it should be enough to have std.stdio call the wide char output
routine to have things display correctly, but I tried that and that's when I
got the sigma.  Figuring out what's going on there will take some more work,
and the ultimate fix may end up being in the DMC libraries... I really don't
know.
Jul 27 2010
parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Black unicode magic.

It's not a big issue for me, but it probably will be for people that deal
with Unicode all the time. Personally, ASCII is good enough for me. :)

 Thanks for your efforts!

On Wed, Jul 28, 2010 at 7:17 AM, Sean kelly <sean invisibleduck.org> wrote:

 After a bit more research, the situation is a bit more complicated than I
 realized.  First, if I compile this C app using DMC:

 #include <stdio.h>

 int main()
 {
    printf( "Hall\u00E5, V\u00E4rld!" );
    return 0;
 }

 The output is:



 This is what I was seeing once I started messing with std.stdio.  An
 improvement I suppose, since it's not garbage, but the output it still
 incorrect if you're expecting Unicode.  After a bit of experimenting, it
 looks like there are two ways to output non-ASCII correctly in Windows:
 convert to a multi-byte string (toMBSz) or call WriteConsoleW.  Here's a
 test app and the associated output.  Notice how writeln() has the same
 output as printf(unicodeString).

 import std.stdio;
 import std.string;
 import std.utf;
 import std.windows.charset;
 import core.sys.windows.windows;

 void main()
 {
    HANDLE h =3D GetStdHandle(STD_OUTPUT_HANDLE);
    DWORD ignore;
    wchar[] buf =3D ("\u00E5 \u00E4"w).dup;

    writeln(buf);
    printf("%s\n", toStringz(toUTF8(buf)));
    printf("%s\n", toMBSz(toUTF8(buf), 1));
    WriteConsoleW(h, buf.ptr, buf.length, &ignore, null);
 }

 prints:



 =E5 =E4
 =E5 =E4

 I'd think it should be enough to have std.stdio call the wide char output
 routine to have things display correctly, but I tried that and that's whe=
n I
 got the sigma.  Figuring out what's going on there will take some more wo=
rk,
 and the ultimate fix may end up being in the DMC libraries... I really do=
n't
 know.
Jul 28 2010
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
Sean Kelly wrote:
 Yes.  I looked into this briefly, and after a bit of googling, it looks like
 fwide() isn't implemented on Windows (unless Walter had done this himself in
 the DMC libraries).
fwide() has nothing to do with Windows. Yes, it is implemented in dmc, upon which dmd for Windows depends. When writing characters out to Windows, though, you have to be careful what "code page" Windows thinks your app is running in.
Jul 29 2010
parent reply Kagamin <spam here.lot> writes:
Walter Bright Wrote:

 When writing characters out to Windows, though, you have to be careful what 
 "code page" Windows thinks your app is running in.
It's valid for char functions. Is it valid that wide functions don't work either?
Jul 29 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
Kagamin wrote:
 Walter Bright Wrote:
 
 When writing characters out to Windows, though, you have to be careful what 
 "code page" Windows thinks your app is running in.
It's valid for char functions. Is it valid that wide functions don't work either?
The wide functions are supposed to be utf16, and those should work.
Jul 29 2010
parent reply Sean Kelly <sean invisibleduck.org> writes:
Walter Bright <newshound2 digitalmars.com> wrote:
 Kagamin wrote:
 Walter Bright Wrote:
 When writing characters out to Windows, though, you have to be
 careful what >> "code page" Windows thinks your app is running
 in.
It's valid for char functions.
Is it valid that wide functions don't work either?
The wide functions are supposed to be utf16, and those should work.
Surprisingly, they don't appear to work properly. The locale used for the UTF16 to multibyte conversion is the currently set locale, and that prints garbage on my Windows install. I had to use the OEM locale for it to work. I was going to fix this but wasn't sure if std.stdio should be setting the codepage it requires, or if the DMC code is broken (which doesn't seem likely).
Jul 30 2010
next sibling parent reply Kagamin <spam here.lot> writes:
Sean Kelly Wrote:

 The wide functions are supposed to be utf16, and those should work.
Surprisingly, they don't appear to work properly. The locale used for the UTF16 to multibyte conversion is the currently set locale, and that prints garbage on my Windows install.
For me it just didn't print non-ASCII characters. May be it supports just a small subset of unicode?
Jul 30 2010
parent Sean Kelly <sean invisibleduck.org> writes:
Kagamin Wrote:

 Sean Kelly Wrote:
 
 The wide functions are supposed to be utf16, and those should work.
Surprisingly, they don't appear to work properly. The locale used for the UTF16 to multibyte conversion is the currently set locale, and that prints garbage on my Windows install.
For me it just didn't print non-ASCII characters. May be it supports just a small subset of unicode?
I think it depends on the default codepage. My guess is that it does just as you described and only passes through ASCII.
Jul 30 2010
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
Sean Kelly wrote:
 Walter Bright <newshound2 digitalmars.com> wrote:
 Kagamin wrote:
 Walter Bright Wrote:
 When writing characters out to Windows, though, you have to be
 careful what >> "code page" Windows thinks your app is running
 in.
It's valid for char functions.
Is it valid that wide functions don't work either?
The wide functions are supposed to be utf16, and those should work.
Surprisingly, they don't appear to work properly. The locale used for the UTF16 to multibyte conversion is the currently set locale, and that prints garbage on my Windows install. I had to use the OEM locale for it to work. I was going to fix this but wasn't sure if std.stdio should be setting the codepage it requires, or if the DMC code is broken (which doesn't seem likely).
The D functions are supposed to send UTF16 to Windows via the "W" interface. What Windows does with it is up to Windows. The functions are NOT supposed to do a multibyte conversion and send it to the Windows "A" interface, except for the Win9x versions.
Jul 30 2010
next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
Walter Bright Wrote:

 Sean Kelly wrote:
 
 Surprisingly, they don't appear to work properly. The locale used for
 the UTF16 to multibyte conversion is the currently set locale, and that
 prints garbage on my Windows install. I had to use the OEM locale for it
 to work. I was going to fix this but wasn't sure if std.stdio should be
 setting the codepage it requires, or if the DMC code is broken (which
 doesn't seem likely).
The D functions are supposed to send UTF16 to Windows via the "W" interface. What Windows does with it is up to Windows. The functions are NOT supposed to do a multibyte conversion and send it to the Windows "A" interface, except for the Win9x versions.
So the relevant code for printing the described string is essentially as follows: module std.stdio; alias _fputc_nlock FPUTC; alias _fputwc_nlock FPUTWC; void put(C)(C c) if (is(C : const(dchar))) { int orientation = fwide(fps, 0); if (orientation <= 0) { auto b = std.utf.toUTF8(buf, c); foreach (i ; 0 .. b.length) FPUTC(b[i], handle); } else { if (c <= 0xFFFF) FPUTWC(c, handle); } } Assuming the orientation is wide and the file is open in text mode: wint_t _fputwc_nlock(wint_t wch, FILE *fp) { char mbc[3]; int size = wctomb(mbc, wch); _fputc_nlock(mbc[0], fp); _fputc_nlock(mbc[1], fp); } int wctomb(char *s, wchar_t wch) { len = WideCharToMultiByte(__locale_codepage, ...); } I found the C code via grep so I may not be looking at the correct implementation of each function, but it matches the behavior I'm seeing. I think the standard C routines were used in D to make sure IO buffers were shared with C, etc. Are you saying this should be changed to use the Windows routines instead? Alternately, is fputwc() really doing the right thing by using the default locale? I'd imagine so except that this approach doesn't work in my tests on Windows.
Jul 30 2010
parent Walter Bright <newshound2 digitalmars.com> writes:
Sean Kelly wrote:
 I found the C code via grep so I may not be looking at the correct
 implementation of each function, but it matches the behavior I'm seeing.  I
 think the standard C routines were used in D to make sure IO buffers were
 shared with C, etc.  Are you saying this should be changed to use the Windows
 routines instead?  Alternately, is fputwc() really doing the right thing by
 using the default locale?  I'd imagine so except that this approach doesn't
 work in my tests on Windows.
I don't know, it's been years since I worked on that code. The idea is that D and C writes to stdio can be interleaved.
Jul 31 2010
prev sibling parent Kagamin <spam here.lot> writes:
Walter Bright Wrote:

 The D functions are supposed to send UTF16 to Windows via the "W" interface. 
 What Windows does with it is up to Windows. The functions are NOT supposed to
do 
 a multibyte conversion and send it to the Windows "A" interface, except for
the 
 Win9x versions.
They can't just blindly call WriteConsoleW because according to msdn it fails if stdout is not a console. Shin Fujishiro's code is the correct one.
Jul 31 2010
prev sibling parent reply Shin Fujishiro <rsinfu gmail.com> writes:
Andrej Mitrovic <andrej.mitrovich gmail.com> wrote:
 You are right about printf(), I'm getting the correct output with this code:
 
 import std.stdio, std.stream;
 
 void main() {
     string str = "Hall\u00E5, V\u00E4rld!";
     foreach (dchar c; str) {
         printf("%c", c);
     }
     writeln();
 }
 
 Hallå, Värld!
The reason why printf printed the correct characters is probably that the console was working in Windows-1257 (variant of ISO-8859-1). ISO-8859-1 (aka Latin-1) coded character set is compatible with Unicode. For example, Latin-1 0xE5 corresponds to U+00E5 and both represents the character å. Due to this fact, your console could _occasionally_ print Latin-1 compatible Unicode characters. The reason that Sean saw õ and Õ was that the console worked in CP850, I believe. In CP850 coded character set, 0xE4 = õ and 0xE5 = Õ. D/Phobos works in Unicode, but system (console) works in a different codeset. As Kagamin pointed out, Phobos must transcode Unicode to system native codeset to correctly print characters (even on linux). By the way, I'm working on this problem in a devel branch: http://www.dsource.org/projects/phobos/browser/branches/devel/stdio-native-codeset/ Native codeset transcoder (std/internal/stdio/nativechar.d) is done. Now I'm thinking on how to integrate conversion facility to the stdio File framework. Shin
Jul 29 2010
next sibling parent Kagamin <spam here.lot> writes:
Shin Fujishiro Wrote:

 Now I'm thinking on how to integrate conversion facility to the stdio
 File framework.
I think creating a low-level unicode console interface will help. Like this void putchar(char c) disable { assert(false); } void putchar(wchar c) disable { assert(false); } void putchar(dchar c) {...}
Jul 29 2010
prev sibling parent reply Kagamin <spam here.lot> writes:
Shin Fujishiro Wrote:

   http://www.dsource.org/projects/phobos/browser/branches/devel/stdio-native-codeset/
I don't quite get, what is the difference between GetConsoleCP and CP_OEMCP for japanese and korean windows.
Jul 29 2010
parent reply Shin Fujishiro <rsinfu gmail.com> writes:
Kagamin <spam here.lot> wrote:
 Shin Fujishiro Wrote:
 
   http://www.dsource.org/projects/phobos/browser/branches/devel/stdio-native-codeset/
I don't quite get, what is the difference between GetConsoleCP and CP_OEMCP for japanese and korean windows.
User might change console code page by the chcp command. Or it might be changed by programmer. CP_OEMCP does not track such situation. By the way, which CP should be used for redirected stdio: ANSI or OEM? I thought ANSI was preferred, but OEM seems to be more commonly used for console apps. Shin
Jul 29 2010
parent Kagamin <spam here.lot> writes:
Shin Fujishiro Wrote:

 By the way, which CP should be used for redirected stdio: ANSI or OEM?
 I thought ANSI was preferred, but OEM seems to be more commonly used
 for console apps.
I think, they just don't care and write text as usual. C standard was created with implication that strings are in system codepage and no transcoding is ever mentioned, it's a language for ASCII text. There is even problem when program code is edited in a gui editor and saved in ANSI codepage, after compilation hardcoded strings are not transcoded and remain in ANSI codepage, printf just writes text blindly, so the output is broken.
Jul 30 2010
prev sibling parent Kagamin <spam here.lot> writes:
Sean kelly Wrote:

 Figuring out what's going on there will take some more work, and the ultimate
fix may end up being in the DMC libraries... I really don't know.
This can't be C bug because character encoding is not specified for C strings. If I remember it right, C uses system encoding, and string IO just blindly passes strings in and out. It's phobos' duty to convert D string to whatever encoding used by C library.
Jul 28 2010