digitalmars.D - TDPL: Foreach over Unicode string

Andrej Mitrovic (25/25) Jul 27 2010 On page 123 there's an example of what happens when traversing a unicode...

Sean Kelly (5/40) Jul 27 2010 I think it's Windows integration that's the problem, on OSX I get:

Sean Kelly (2/9) Jul 27 2010 Ah, write() already works that way. It was the brackets that were screw...

Andrej Mitrovic (16/32) Jul 27 2010 ng

Sean Kelly (4/40) Jul 27 2010 Yes. I looked into this briefly, and after a bit of googling, it looks ...

Sean kelly (31/31) Jul 27 2010 After a bit more research, the situation is a bit more complicated than ...

Andrej Mitrovic (8/50) Jul 28 2010 Black unicode magic.

Walter Bright (5/8) Jul 29 2010 fwide() has nothing to do with Windows. Yes, it is implemented in dmc, u...

Kagamin (3/5) Jul 29 2010 It's valid for char functions.

Walter Bright (2/9) Jul 29 2010 The wide functions are supposed to be utf16, and those should work.

Sean Kelly (7/16) Jul 30 2010 Surprisingly, they don't appear to work properly. The locale used for

Kagamin (2/7) Jul 30 2010 For me it just didn't print non-ASCII characters. May be it supports jus...

Sean Kelly (2/11) Jul 30 2010 I think it depends on the default codepage. My guess is that it does ju...

Walter Bright (5/21) Jul 30 2010 The D functions are supposed to send UTF16 to Windows via the "W" interf...

Sean Kelly (29/42) Jul 30 2010 So the relevant code for printing the described string is essentially as...

Walter Bright (3/10) Jul 31 2010 I don't know, it's been years since I worked on that code.

Kagamin (2/6) Jul 31 2010 They can't just blindly call WriteConsoleW because according to msdn it ...

Shin Fujishiro (18/31) Jul 29 2010 The reason why printf printed the correct characters is probably that

Kagamin (6/8) Jul 29 2010 I think creating a low-level unicode console interface will help.
Kagamin (2/3) Jul 29 2010 I don't quite get, what is the difference between GetConsoleCP and CP_OE...

Shin Fujishiro (7/12) Jul 29 2010 User might change console code page by the chcp command. Or it might

Kagamin (3/6) Jul 30 2010 I think, they just don't care and write text as usual. C standard was cr...

Kagamin (2/3) Jul 28 2010 This can't be C bug because character encoding is not specified for C st...

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On page 123 there's an example of what happens when traversing a unicode string
with a char, and on the next page the string is traversed with a dchar, which
should fix the output. But I'm getting different results, here's the code and
output of the two samples:

import std.stdio;

void main() {
    string str = "Hall\u00E5, V\u00E4rld!";
    foreach (c; str) {
        write('[', c, ']');
    }
    writeln();
}

Prints:
[H][a][l][l][�][�][,][ ][V][�][�][r][l][d][!]

Second example:

import std.stdio;

void main() {
    string str = "Hall\u00E5, V\u00E4rld!";
    foreach (dchar c; str) {
        write('[', c, ']');
    }
    writeln();
}

Prints:
[H][a][l][l][å][,][ ][V][ä][r][l][d][!]


The second example should print out:
[H][a][l][l][�][,][ ][V][�][r][l][d][!] 

This is on DMD 2.047 on Windows.

Jul 27 2010

Sean Kelly <sean invisibleduck.org> writes:

Andrej Mitrovic Wrote:

 On page 123 there's an example of what happens when traversing a unicode
string with a char, and on the next page the string is traversed with a dchar,
which should fix the output. But I'm getting different results, here's the code
and output of the two samples:
 
 import std.stdio;
 
 void main() {
     string str = "Hall\u00E5, V\u00E4rld!";
     foreach (c; str) {
         write('[', c, ']');
     }
     writeln();
 }
 
 Prints:
 [H][a][l][l][�][�][,][ ][V][�][�][r][l][d][!]
 
 Second example:
 
 import std.stdio;
 
 void main() {
     string str = "Hall\u00E5, V\u00E4rld!";
     foreach (dchar c; str) {
         write('[', c, ']');
     }
     writeln();
 }
 
 Prints:
 [H][a][l][l][å][,][ ][V][ä][r][l][d][!]
 
 
 The second example should print out:
 [H][a][l][l][�][,][ ][V][�][r][l][d][!] 
 
 This is on DMD 2.047 on Windows.

I think it's Windows integration that's the problem, on OSX I get:

[H][a][l][l][?][?][,][ ][V][?][?][r][l][d][!]
[H][a][l][l][�][,][ ][V][�][r][l][d][!]

which is essentially correct.  The only difference between this and doing the
same thing in C and using printf() in place of write() is that both lines
display correctly in C.  I think printf() must be detecting partial UTF-8
characters and buffering until the complete chunk has arrived.  Interestingly,
the C output can't even be broken by badly timed calls to fflush(), so the
buffering is happening at a fairly high level.  I'd be interested in seeing the
same thing in write() at some point.

Jul 27 2010

Sean Kelly <sean invisibleduck.org> writes:

Sean Kelly Wrote:
 
 I think it's Windows integration that's the problem, on OSX I get:
 
 [H][a][l][l][?][?][,][ ][V][?][?][r][l][d][!]
 [H][a][l][l][�][,][ ][V][�][r][l][d][!]
 
 which is essentially correct.  The only difference between this and doing the
same thing in C and using printf() in place of write() is that both lines
display correctly in C.  I think printf() must be detecting partial UTF-8
characters and buffering until the complete chunk has arrived.  Interestingly,
the C output can't even be broken by badly timed calls to fflush(), so the
buffering is happening at a fairly high level.  I'd be interested in seeing the
same thing in write() at some point.

Ah, write() already works that way.  It was the brackets that were screwing
things up.

Jul 27 2010

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On Wed, Jul 28, 2010 at 12:34 AM, Sean Kelly <sean invisibleduck.org> wrote=
:

 Sean Kelly Wrote:
 I think it's Windows integration that's the problem, on OSX I get:

 [H][a][l][l][?][?][,][ ][V][?][?][r][l][d][!]
 [H][a][l][l][=E5][,][ ][V][=E4][r][l][d][!]

 which is essentially correct.  The only difference between this and doi=


ng
 the same thing in C and using printf() in place of write() is that both
 lines display correctly in C.  I think printf() must be detecting partial
 UTF-8 characters and buffering until the complete chunk has arrived.
  Interestingly, the C output can't even be broken by badly timed calls to
 fflush(), so the buffering is happening at a fairly high level.  I'd be
 interested in seeing the same thing in write() at some point.

 Ah, write() already works that way.  It was the brackets that were screwi=

ng
 things up.

You are right about printf(), I'm getting the correct output with this code=
:

import std.stdio, std.stream;

void main() {
    string str =3D "Hall\u00E5, V\u00E4rld!";
    foreach (dchar c; str) {
        printf("%c", c);
    }
    writeln();
}

Hall=E5, V=E4rld!

Should I file this as a Windows bug for DMD?

Jul 27 2010

Sean Kelly <sean invisibleduck.org> writes:

Andrej Mitrovic Wrote:

 On Wed, Jul 28, 2010 at 12:34 AM, Sean Kelly <sean invisibleduck.org> wrote:
 
 Sean Kelly Wrote:
 I think it's Windows integration that's the problem, on OSX I get:

 [H][a][l][l][?][?][,][ ][V][?][?][r][l][d][!]
 [H][a][l][l][�][,][ ][V][�][r][l][d][!]

 which is essentially correct.  The only difference between this and doing

 the same thing in C and using printf() in place of write() is that both
 lines display correctly in C.  I think printf() must be detecting partial
 UTF-8 characters and buffering until the complete chunk has arrived.
  Interestingly, the C output can't even be broken by badly timed calls to
 fflush(), so the buffering is happening at a fairly high level.  I'd be
 interested in seeing the same thing in write() at some point.

 Ah, write() already works that way.  It was the brackets that were screwing
 things up.

 
 You are right about printf(), I'm getting the correct output with this code:
 
 import std.stdio, std.stream;
 
 void main() {
     string str = "Hall\u00E5, V\u00E4rld!";
     foreach (dchar c; str) {
         printf("%c", c);
     }
     writeln();
 }
 
 Hall�, V�rld!
 
 Should I file this as a Windows bug for DMD?

Yes.  I looked into this briefly, and after a bit of googling, it looks like
fwide() isn't implemented on Windows (unless Walter had done this himself in
the DMC libraries).  See here:

http://blogs.msdn.com/b/michkap/archive/2009/06/23/9797156.aspx

If I change std.stdio.LockingTextWriter.put(C)(C c) to always use the
version(Windows) code for a 32-bit argument it *almost* works correctly. 
Instead of garbage, the Unicode characters are a lowercase o with an accent
above (U+01A1 I believe) and an uppercase sigma (U+01A9).  I'll have to spend
some more time later trying to figure out why it's these characters and not the
intended ones.  I wouldn't think that endian issues should be relevant, but
that's the only thing I've come up with so far.

Jul 27 2010

Sean kelly <sean invisibleduck.org> writes:

After a bit more research, the situation is a bit more complicated than I
realized.  First, if I compile this C app using DMC:

#include <stdio.h>

int main()
{
    printf( "Hall\u00E5, V\u00E4rld!" );
    return 0;
}

The output is:



This is what I was seeing once I started messing with std.stdio.  An
improvement I suppose, since it's not garbage, but the output it still
incorrect if you're expecting Unicode.  After a bit of experimenting, it looks
like there are two ways to output non-ASCII correctly in Windows: convert to a
multi-byte string (toMBSz) or call WriteConsoleW.  Here's a test app and the
associated output.  Notice how writeln() has the same output as
printf(unicodeString).

import std.stdio;
import std.string;
import std.utf;
import std.windows.charset;
import core.sys.windows.windows;

void main()
{
    HANDLE h = GetStdHandle(STD_OUTPUT_HANDLE);
    DWORD ignore;
    wchar[] buf = ("\u00E5 \u00E4"w).dup;

    writeln(buf);
    printf("%s\n", toStringz(toUTF8(buf)));
    printf("%s\n", toMBSz(toUTF8(buf), 1));
    WriteConsoleW(h, buf.ptr, buf.length, &ignore, null);
}

prints:



� �
� �

I'd think it should be enough to have std.stdio call the wide char output
routine to have things display correctly, but I tried that and that's when I
got the sigma.  Figuring out what's going on there will take some more work,
and the ultimate fix may end up being in the DMC libraries... I really don't
know.

Jul 27 2010

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

Black unicode magic.

It's not a big issue for me, but it probably will be for people that deal
with Unicode all the time. Personally, ASCII is good enough for me. :)

 Thanks for your efforts!

On Wed, Jul 28, 2010 at 7:17 AM, Sean kelly <sean invisibleduck.org> wrote:

 After a bit more research, the situation is a bit more complicated than I
 realized.  First, if I compile this C app using DMC:

 #include <stdio.h>

 int main()
 {
    printf( "Hall\u00E5, V\u00E4rld!" );
    return 0;
 }

 The output is:



 This is what I was seeing once I started messing with std.stdio.  An
 improvement I suppose, since it's not garbage, but the output it still
 incorrect if you're expecting Unicode.  After a bit of experimenting, it
 looks like there are two ways to output non-ASCII correctly in Windows:
 convert to a multi-byte string (toMBSz) or call WriteConsoleW.  Here's a
 test app and the associated output.  Notice how writeln() has the same
 output as printf(unicodeString).

 import std.stdio;
 import std.string;
 import std.utf;
 import std.windows.charset;
 import core.sys.windows.windows;

 void main()
 {
    HANDLE h =3D GetStdHandle(STD_OUTPUT_HANDLE);
    DWORD ignore;
    wchar[] buf =3D ("\u00E5 \u00E4"w).dup;

    writeln(buf);
    printf("%s\n", toStringz(toUTF8(buf)));
    printf("%s\n", toMBSz(toUTF8(buf), 1));
    WriteConsoleW(h, buf.ptr, buf.length, &ignore, null);
 }

 prints:



 =E5 =E4
 =E5 =E4

 I'd think it should be enough to have std.stdio call the wide char output
 routine to have things display correctly, but I tried that and that's whe=

n I
 got the sigma.  Figuring out what's going on there will take some more wo=

rk,
 and the ultimate fix may end up being in the DMC libraries... I really do=

n't
 know.

Jul 28 2010

Walter Bright <newshound2 digitalmars.com> writes:

Sean Kelly wrote:
 Yes.  I looked into this briefly, and after a bit of googling, it looks like
 fwide() isn't implemented on Windows (unless Walter had done this himself in
 the DMC libraries).

fwide() has nothing to do with Windows. Yes, it is implemented in dmc, upon 
which dmd for Windows depends.

When writing characters out to Windows, though, you have to be careful what 
"code page" Windows thinks your app is running in.

Jul 29 2010

Kagamin <spam here.lot> writes:

Walter Bright Wrote:

 When writing characters out to Windows, though, you have to be careful what 
 "code page" Windows thinks your app is running in.

It's valid for char functions.
Is it valid that wide functions don't work either?

Jul 29 2010

Walter Bright <newshound2 digitalmars.com> writes:

Kagamin wrote:
 Walter Bright Wrote:
 
 When writing characters out to Windows, though, you have to be careful what 
 "code page" Windows thinks your app is running in.

 
 It's valid for char functions.
 Is it valid that wide functions don't work either?

The wide functions are supposed to be utf16, and those should work.

Jul 29 2010

Sean Kelly <sean invisibleduck.org> writes:

Walter Bright <newshound2 digitalmars.com> wrote:
 Kagamin wrote:
 Walter Bright Wrote:
 When writing characters out to Windows, though, you have to be
 careful what >> "code page" Windows thinks your app is running
 in.

 It's valid for char functions.

 Is it valid that wide functions don't work either?

 
 The wide functions are supposed to be utf16, and those should work.

Surprisingly, they don't appear to work properly. The locale used for
the UTF16 to multibyte conversion is the currently set locale, and that
prints garbage on my Windows install. I had to use the OEM locale for it
to work. I was going to fix this but wasn't sure if std.stdio should be
setting the codepage it requires, or if the DMC code is broken (which
doesn't seem likely).

Jul 30 2010

Kagamin <spam here.lot> writes:

Sean Kelly Wrote:

 The wide functions are supposed to be utf16, and those should work.

 
 Surprisingly, they don't appear to work properly. The locale used for
 the UTF16 to multibyte conversion is the currently set locale, and that
 prints garbage on my Windows install.

For me it just didn't print non-ASCII characters. May be it supports just a
small subset of unicode?

Jul 30 2010

Sean Kelly <sean invisibleduck.org> writes:

Kagamin Wrote:

 Sean Kelly Wrote:
 
 The wide functions are supposed to be utf16, and those should work.

 
 Surprisingly, they don't appear to work properly. The locale used for
 the UTF16 to multibyte conversion is the currently set locale, and that
 prints garbage on my Windows install.

 
 For me it just didn't print non-ASCII characters. May be it supports just a
small subset of unicode?

I think it depends on the default codepage.  My guess is that it does just as
you described and only passes through ASCII.

Jul 30 2010

Walter Bright <newshound2 digitalmars.com> writes:

Sean Kelly wrote:
 Walter Bright <newshound2 digitalmars.com> wrote:
 Kagamin wrote:
 Walter Bright Wrote:
 When writing characters out to Windows, though, you have to be
 careful what >> "code page" Windows thinks your app is running
 in.

 It's valid for char functions.

 Is it valid that wide functions don't work either?

 The wide functions are supposed to be utf16, and those should work.

 
 Surprisingly, they don't appear to work properly. The locale used for
 the UTF16 to multibyte conversion is the currently set locale, and that
 prints garbage on my Windows install. I had to use the OEM locale for it
 to work. I was going to fix this but wasn't sure if std.stdio should be
 setting the codepage it requires, or if the DMC code is broken (which
 doesn't seem likely).

The D functions are supposed to send UTF16 to Windows via the "W" interface. 
What Windows does with it is up to Windows. The functions are NOT supposed to
do 
a multibyte conversion and send it to the Windows "A" interface, except for the 
Win9x versions.

Jul 30 2010

Sean Kelly <sean invisibleduck.org> writes:

Walter Bright Wrote:

 Sean Kelly wrote:
 
 Surprisingly, they don't appear to work properly. The locale used for
 the UTF16 to multibyte conversion is the currently set locale, and that
 prints garbage on my Windows install. I had to use the OEM locale for it
 to work. I was going to fix this but wasn't sure if std.stdio should be
 setting the codepage it requires, or if the DMC code is broken (which
 doesn't seem likely).

 
 The D functions are supposed to send UTF16 to Windows via the "W" interface. 
 What Windows does with it is up to Windows. The functions are NOT supposed to
do 
 a multibyte conversion and send it to the Windows "A" interface, except for
the 
 Win9x versions.

So the relevant code for printing the described string is essentially as
follows:

module std.stdio;

alias _fputc_nlock FPUTC;
alias _fputwc_nlock FPUTWC;

void put(C)(C c) if (is(C : const(dchar)))
{
    int orientation = fwide(fps, 0);
    if (orientation <= 0) {
        auto b = std.utf.toUTF8(buf, c);
        foreach (i ; 0 .. b.length)
            FPUTC(b[i], handle);
    } else {
        if (c <= 0xFFFF)
            FPUTWC(c, handle);
    }
}

Assuming the orientation is wide and the file is open in text mode:

wint_t _fputwc_nlock(wint_t wch, FILE *fp)
{
    char mbc[3];
    int size = wctomb(mbc, wch);
    _fputc_nlock(mbc[0], fp);
    _fputc_nlock(mbc[1], fp);
}

int wctomb(char *s, wchar_t wch) {
    len = WideCharToMultiByte(__locale_codepage, ...);
}

I found the C code via grep so I may not be looking at the correct
implementation of each function, but it matches the behavior I'm seeing.  I
think the standard C routines were used in D to make sure IO buffers were
shared with C, etc.  Are you saying this should be changed to use the Windows
routines instead?  Alternately, is fputwc() really doing the right thing by
using the default locale?  I'd imagine so except that this approach doesn't
work in my tests on Windows.

Jul 30 2010

Walter Bright <newshound2 digitalmars.com> writes:

Sean Kelly wrote:
 I found the C code via grep so I may not be looking at the correct
 implementation of each function, but it matches the behavior I'm seeing.  I
 think the standard C routines were used in D to make sure IO buffers were
 shared with C, etc.  Are you saying this should be changed to use the Windows
 routines instead?  Alternately, is fputwc() really doing the right thing by
 using the default locale?  I'd imagine so except that this approach doesn't
 work in my tests on Windows.

I don't know, it's been years since I worked on that code.

The idea is that D and C writes to stdio can be interleaved.

Jul 31 2010

Kagamin <spam here.lot> writes:

Walter Bright Wrote:

 The D functions are supposed to send UTF16 to Windows via the "W" interface. 
 What Windows does with it is up to Windows. The functions are NOT supposed to
do 
 a multibyte conversion and send it to the Windows "A" interface, except for
the 
 Win9x versions.

They can't just blindly call WriteConsoleW because according to msdn it fails
if stdout is not a console. Shin Fujishiro's code is the correct one.

Jul 31 2010

Shin Fujishiro <rsinfu gmail.com> writes:

Andrej Mitrovic <andrej.mitrovich gmail.com> wrote:
 You are right about printf(), I'm getting the correct output with this code:
 
 import std.stdio, std.stream;
 
 void main() {
     string str = "Hall\u00E5, V\u00E4rld!";
     foreach (dchar c; str) {
         printf("%c", c);
     }
     writeln();
 }
 
 Hallå, Värld!

The reason why printf printed the correct characters is probably that
the console was working in Windows-1257 (variant of ISO-8859-1).

ISO-8859-1 (aka Latin-1) coded character set is compatible with Unicode.
For example, Latin-1 0xE5 corresponds to U+00E5 and both represents the
character å.  Due to this fact, your console could _occasionally_ print
Latin-1 compatible Unicode characters.

The reason that Sean saw õ and Õ was that the console worked in CP850,
I believe.  In CP850 coded character set, 0xE4 = õ and 0xE5 = Õ.

D/Phobos works in Unicode, but system (console) works in a different
codeset.  As Kagamin pointed out, Phobos must transcode Unicode to
system native codeset to correctly print characters (even on linux).

By the way, I'm working on this problem in a devel branch:

  http://www.dsource.org/projects/phobos/browser/branches/devel/stdio-native-codeset/

Native codeset transcoder (std/internal/stdio/nativechar.d) is done.
Now I'm thinking on how to integrate conversion facility to the stdio
File framework.


Shin

Jul 29 2010

Kagamin <spam here.lot> writes:

Shin Fujishiro Wrote:

 Now I'm thinking on how to integrate conversion facility to the stdio
 File framework.

I think creating a low-level unicode console interface will help.

Like this

void putchar(char c)  disable { assert(false); }
void putchar(wchar c)  disable { assert(false); }
void putchar(dchar c) {...}

Jul 29 2010

Kagamin <spam here.lot> writes:

Shin Fujishiro Wrote:

   http://www.dsource.org/projects/phobos/browser/branches/devel/stdio-native-codeset/

I don't quite get, what is the difference between GetConsoleCP and CP_OEMCP for
japanese and korean windows.

Jul 29 2010

Shin Fujishiro <rsinfu gmail.com> writes:

Kagamin <spam here.lot> wrote:
 Shin Fujishiro Wrote:
 
   http://www.dsource.org/projects/phobos/browser/branches/devel/stdio-native-codeset/

 
 I don't quite get, what is the difference between GetConsoleCP and CP_OEMCP
for japanese and korean windows.

User might change console code page by the chcp command.  Or it might
be changed by programmer.  CP_OEMCP does not track such situation.

By the way, which CP should be used for redirected stdio: ANSI or OEM?
I thought ANSI was preferred, but OEM seems to be more commonly used
for console apps.


Shin

Jul 29 2010

Kagamin <spam here.lot> writes:

Shin Fujishiro Wrote:

 By the way, which CP should be used for redirected stdio: ANSI or OEM?
 I thought ANSI was preferred, but OEM seems to be more commonly used
 for console apps.

I think, they just don't care and write text as usual. C standard was created
with implication that strings are in system codepage and no transcoding is ever
mentioned, it's a language for ASCII text.

There is even problem when program code is edited in a gui editor and saved in
ANSI codepage, after compilation hardcoded strings are not transcoded and
remain in ANSI codepage, printf just writes text blindly, so the output is
broken.

Jul 30 2010

Kagamin <spam here.lot> writes:

Sean kelly Wrote:

 Figuring out what's going on there will take some more work, and the ultimate
fix may end up being in the DMC libraries... I really don't know.

This can't be C bug because character encoding is not specified for C strings.
If I remember it right, C uses system encoding, and string IO just blindly
passes strings in and out. It's phobos' duty to convert D string to whatever
encoding used by C library.

Jul 28 2010

D Programming

C/C++ Programming

Other

digitalmars.D - TDPL: Foreach over Unicode string