digitalmars.D - TDPL: Foreach over Unicode string
- Andrej Mitrovic (25/25) Jul 27 2010 On page 123 there's an example of what happens when traversing a unicode...
- Sean Kelly (5/40) Jul 27 2010 I think it's Windows integration that's the problem, on OSX I get:
- Sean Kelly (2/9) Jul 27 2010 Ah, write() already works that way. It was the brackets that were screw...
- Andrej Mitrovic (16/32) Jul 27 2010 ng
- Sean Kelly (4/40) Jul 27 2010 Yes. I looked into this briefly, and after a bit of googling, it looks ...
- Sean kelly (31/31) Jul 27 2010 After a bit more research, the situation is a bit more complicated than ...
- Andrej Mitrovic (8/50) Jul 28 2010 Black unicode magic.
- Walter Bright (5/8) Jul 29 2010 fwide() has nothing to do with Windows. Yes, it is implemented in dmc, u...
- Kagamin (3/5) Jul 29 2010 It's valid for char functions.
- Walter Bright (2/9) Jul 29 2010 The wide functions are supposed to be utf16, and those should work.
- Sean Kelly (7/16) Jul 30 2010 Surprisingly, they don't appear to work properly. The locale used for
- Kagamin (2/7) Jul 30 2010 For me it just didn't print non-ASCII characters. May be it supports jus...
- Sean Kelly (2/11) Jul 30 2010 I think it depends on the default codepage. My guess is that it does ju...
- Walter Bright (5/21) Jul 30 2010 The D functions are supposed to send UTF16 to Windows via the "W" interf...
- Sean Kelly (29/42) Jul 30 2010 So the relevant code for printing the described string is essentially as...
- Walter Bright (3/10) Jul 31 2010 I don't know, it's been years since I worked on that code.
- Kagamin (2/6) Jul 31 2010 They can't just blindly call WriteConsoleW because according to msdn it ...
- Shin Fujishiro (18/31) Jul 29 2010 The reason why printf printed the correct characters is probably that
- Kagamin (6/8) Jul 29 2010 I think creating a low-level unicode console interface will help.
- Kagamin (2/3) Jul 29 2010 I don't quite get, what is the difference between GetConsoleCP and CP_OE...
- Shin Fujishiro (7/12) Jul 29 2010 User might change console code page by the chcp command. Or it might
- Kagamin (3/6) Jul 30 2010 I think, they just don't care and write text as usual. C standard was cr...
- Kagamin (2/3) Jul 28 2010 This can't be C bug because character encoding is not specified for C st...
On page 123 there's an example of what happens when traversing a unicode string with a char, and on the next page the string is traversed with a dchar, which should fix the output. But I'm getting different results, here's the code and output of the two samples: import std.stdio; void main() { string str = "Hall\u00E5, V\u00E4rld!"; foreach (c; str) { write('[', c, ']'); } writeln(); } Prints: [H][a][l][l][Ã][¥][,][ ][V][Ã][¤][r][l][d][!] Second example: import std.stdio; void main() { string str = "Hall\u00E5, V\u00E4rld!"; foreach (dchar c; str) { write('[', c, ']'); } writeln(); } Prints: [H][a][l][l][Ã¥][,][ ][V][ä][r][l][d][!] The second example should print out: [H][a][l][l][å][,][ ][V][ä][r][l][d][!] This is on DMD 2.047 on Windows.
Jul 27 2010
Andrej Mitrovic Wrote:On page 123 there's an example of what happens when traversing a unicode string with a char, and on the next page the string is traversed with a dchar, which should fix the output. But I'm getting different results, here's the code and output of the two samples: import std.stdio; void main() { string str = "Hall\u00E5, V\u00E4rld!"; foreach (c; str) { write('[', c, ']'); } writeln(); } Prints: [H][a][l][l][Ã][¥][,][ ][V][Ã][¤][r][l][d][!] Second example: import std.stdio; void main() { string str = "Hall\u00E5, V\u00E4rld!"; foreach (dchar c; str) { write('[', c, ']'); } writeln(); } Prints: [H][a][l][l][Ã¥][,][ ][V][ä][r][l][d][!] The second example should print out: [H][a][l][l][å][,][ ][V][ä][r][l][d][!] This is on DMD 2.047 on Windows.I think it's Windows integration that's the problem, on OSX I get: [H][a][l][l][?][?][,][ ][V][?][?][r][l][d][!] [H][a][l][l][å][,][ ][V][ä][r][l][d][!] which is essentially correct. The only difference between this and doing the same thing in C and using printf() in place of write() is that both lines display correctly in C. I think printf() must be detecting partial UTF-8 characters and buffering until the complete chunk has arrived. Interestingly, the C output can't even be broken by badly timed calls to fflush(), so the buffering is happening at a fairly high level. I'd be interested in seeing the same thing in write() at some point.
Jul 27 2010
Sean Kelly Wrote:I think it's Windows integration that's the problem, on OSX I get: [H][a][l][l][?][?][,][ ][V][?][?][r][l][d][!] [H][a][l][l][å][,][ ][V][ä][r][l][d][!] which is essentially correct. The only difference between this and doing the same thing in C and using printf() in place of write() is that both lines display correctly in C. I think printf() must be detecting partial UTF-8 characters and buffering until the complete chunk has arrived. Interestingly, the C output can't even be broken by badly timed calls to fflush(), so the buffering is happening at a fairly high level. I'd be interested in seeing the same thing in write() at some point.Ah, write() already works that way. It was the brackets that were screwing things up.
Jul 27 2010
On Wed, Jul 28, 2010 at 12:34 AM, Sean Kelly <sean invisibleduck.org> wrote= :Sean Kelly Wrote:ngI think it's Windows integration that's the problem, on OSX I get: [H][a][l][l][?][?][,][ ][V][?][?][r][l][d][!] [H][a][l][l][=E5][,][ ][V][=E4][r][l][d][!] which is essentially correct. The only difference between this and doi=the same thing in C and using printf() in place of write() is that both lines display correctly in C. I think printf() must be detecting partial UTF-8 characters and buffering until the complete chunk has arrived. Interestingly, the C output can't even be broken by badly timed calls to fflush(), so the buffering is happening at a fairly high level. I'd be interested in seeing the same thing in write() at some point. Ah, write() already works that way. It was the brackets that were screwi=ngthings up.You are right about printf(), I'm getting the correct output with this code= : import std.stdio, std.stream; void main() { string str =3D "Hall\u00E5, V\u00E4rld!"; foreach (dchar c; str) { printf("%c", c); } writeln(); } Hall=E5, V=E4rld! Should I file this as a Windows bug for DMD?
Jul 27 2010
Andrej Mitrovic Wrote:On Wed, Jul 28, 2010 at 12:34 AM, Sean Kelly <sean invisibleduck.org> wrote:Yes. I looked into this briefly, and after a bit of googling, it looks like fwide() isn't implemented on Windows (unless Walter had done this himself in the DMC libraries). See here: http://blogs.msdn.com/b/michkap/archive/2009/06/23/9797156.aspx If I change std.stdio.LockingTextWriter.put(C)(C c) to always use the version(Windows) code for a 32-bit argument it *almost* works correctly. Instead of garbage, the Unicode characters are a lowercase o with an accent above (U+01A1 I believe) and an uppercase sigma (U+01A9). I'll have to spend some more time later trying to figure out why it's these characters and not the intended ones. I wouldn't think that endian issues should be relevant, but that's the only thing I've come up with so far.Sean Kelly Wrote:You are right about printf(), I'm getting the correct output with this code: import std.stdio, std.stream; void main() { string str = "Hall\u00E5, V\u00E4rld!"; foreach (dchar c; str) { printf("%c", c); } writeln(); } Hallå, Värld! Should I file this as a Windows bug for DMD?I think it's Windows integration that's the problem, on OSX I get: [H][a][l][l][?][?][,][ ][V][?][?][r][l][d][!] [H][a][l][l][å][,][ ][V][ä][r][l][d][!] which is essentially correct. The only difference between this and doingthe same thing in C and using printf() in place of write() is that both lines display correctly in C. I think printf() must be detecting partial UTF-8 characters and buffering until the complete chunk has arrived. Interestingly, the C output can't even be broken by badly timed calls to fflush(), so the buffering is happening at a fairly high level. I'd be interested in seeing the same thing in write() at some point. Ah, write() already works that way. It was the brackets that were screwing things up.
Jul 27 2010
After a bit more research, the situation is a bit more complicated than I realized. First, if I compile this C app using DMC: #include <stdio.h> int main() { printf( "Hall\u00E5, V\u00E4rld!" ); return 0; } The output is: This is what I was seeing once I started messing with std.stdio. An improvement I suppose, since it's not garbage, but the output it still incorrect if you're expecting Unicode. After a bit of experimenting, it looks like there are two ways to output non-ASCII correctly in Windows: convert to a multi-byte string (toMBSz) or call WriteConsoleW. Here's a test app and the associated output. Notice how writeln() has the same output as printf(unicodeString). import std.stdio; import std.string; import std.utf; import std.windows.charset; import core.sys.windows.windows; void main() { HANDLE h = GetStdHandle(STD_OUTPUT_HANDLE); DWORD ignore; wchar[] buf = ("\u00E5 \u00E4"w).dup; writeln(buf); printf("%s\n", toStringz(toUTF8(buf))); printf("%s\n", toMBSz(toUTF8(buf), 1)); WriteConsoleW(h, buf.ptr, buf.length, &ignore, null); } prints: å ä å ä I'd think it should be enough to have std.stdio call the wide char output routine to have things display correctly, but I tried that and that's when I got the sigma. Figuring out what's going on there will take some more work, and the ultimate fix may end up being in the DMC libraries... I really don't know.
Jul 27 2010
Black unicode magic. It's not a big issue for me, but it probably will be for people that deal with Unicode all the time. Personally, ASCII is good enough for me. :) Thanks for your efforts! On Wed, Jul 28, 2010 at 7:17 AM, Sean kelly <sean invisibleduck.org> wrote:After a bit more research, the situation is a bit more complicated than I realized. First, if I compile this C app using DMC: #include <stdio.h> int main() { printf( "Hall\u00E5, V\u00E4rld!" ); return 0; } The output is: This is what I was seeing once I started messing with std.stdio. An improvement I suppose, since it's not garbage, but the output it still incorrect if you're expecting Unicode. After a bit of experimenting, it looks like there are two ways to output non-ASCII correctly in Windows: convert to a multi-byte string (toMBSz) or call WriteConsoleW. Here's a test app and the associated output. Notice how writeln() has the same output as printf(unicodeString). import std.stdio; import std.string; import std.utf; import std.windows.charset; import core.sys.windows.windows; void main() { HANDLE h =3D GetStdHandle(STD_OUTPUT_HANDLE); DWORD ignore; wchar[] buf =3D ("\u00E5 \u00E4"w).dup; writeln(buf); printf("%s\n", toStringz(toUTF8(buf))); printf("%s\n", toMBSz(toUTF8(buf), 1)); WriteConsoleW(h, buf.ptr, buf.length, &ignore, null); } prints: =E5 =E4 =E5 =E4 I'd think it should be enough to have std.stdio call the wide char output routine to have things display correctly, but I tried that and that's whe=n Igot the sigma. Figuring out what's going on there will take some more wo=rk,and the ultimate fix may end up being in the DMC libraries... I really do=n'tknow.
Jul 28 2010
Sean Kelly wrote:Yes. I looked into this briefly, and after a bit of googling, it looks like fwide() isn't implemented on Windows (unless Walter had done this himself in the DMC libraries).fwide() has nothing to do with Windows. Yes, it is implemented in dmc, upon which dmd for Windows depends. When writing characters out to Windows, though, you have to be careful what "code page" Windows thinks your app is running in.
Jul 29 2010
Walter Bright Wrote:When writing characters out to Windows, though, you have to be careful what "code page" Windows thinks your app is running in.It's valid for char functions. Is it valid that wide functions don't work either?
Jul 29 2010
Kagamin wrote:Walter Bright Wrote:The wide functions are supposed to be utf16, and those should work.When writing characters out to Windows, though, you have to be careful what "code page" Windows thinks your app is running in.It's valid for char functions. Is it valid that wide functions don't work either?
Jul 29 2010
Walter Bright <newshound2 digitalmars.com> wrote:Kagamin wrote:Surprisingly, they don't appear to work properly. The locale used for the UTF16 to multibyte conversion is the currently set locale, and that prints garbage on my Windows install. I had to use the OEM locale for it to work. I was going to fix this but wasn't sure if std.stdio should be setting the codepage it requires, or if the DMC code is broken (which doesn't seem likely).Walter Bright Wrote:The wide functions are supposed to be utf16, and those should work.Is it valid that wide functions don't work either?When writing characters out to Windows, though, you have to be careful what >> "code page" Windows thinks your app is running in.It's valid for char functions.
Jul 30 2010
Sean Kelly Wrote:For me it just didn't print non-ASCII characters. May be it supports just a small subset of unicode?The wide functions are supposed to be utf16, and those should work.Surprisingly, they don't appear to work properly. The locale used for the UTF16 to multibyte conversion is the currently set locale, and that prints garbage on my Windows install.
Jul 30 2010
Kagamin Wrote:Sean Kelly Wrote:I think it depends on the default codepage. My guess is that it does just as you described and only passes through ASCII.For me it just didn't print non-ASCII characters. May be it supports just a small subset of unicode?The wide functions are supposed to be utf16, and those should work.Surprisingly, they don't appear to work properly. The locale used for the UTF16 to multibyte conversion is the currently set locale, and that prints garbage on my Windows install.
Jul 30 2010
Sean Kelly wrote:Walter Bright <newshound2 digitalmars.com> wrote:The D functions are supposed to send UTF16 to Windows via the "W" interface. What Windows does with it is up to Windows. The functions are NOT supposed to do a multibyte conversion and send it to the Windows "A" interface, except for the Win9x versions.Kagamin wrote:Surprisingly, they don't appear to work properly. The locale used for the UTF16 to multibyte conversion is the currently set locale, and that prints garbage on my Windows install. I had to use the OEM locale for it to work. I was going to fix this but wasn't sure if std.stdio should be setting the codepage it requires, or if the DMC code is broken (which doesn't seem likely).Walter Bright Wrote:The wide functions are supposed to be utf16, and those should work.Is it valid that wide functions don't work either?When writing characters out to Windows, though, you have to be careful what >> "code page" Windows thinks your app is running in.It's valid for char functions.
Jul 30 2010
Walter Bright Wrote:Sean Kelly wrote:So the relevant code for printing the described string is essentially as follows: module std.stdio; alias _fputc_nlock FPUTC; alias _fputwc_nlock FPUTWC; void put(C)(C c) if (is(C : const(dchar))) { int orientation = fwide(fps, 0); if (orientation <= 0) { auto b = std.utf.toUTF8(buf, c); foreach (i ; 0 .. b.length) FPUTC(b[i], handle); } else { if (c <= 0xFFFF) FPUTWC(c, handle); } } Assuming the orientation is wide and the file is open in text mode: wint_t _fputwc_nlock(wint_t wch, FILE *fp) { char mbc[3]; int size = wctomb(mbc, wch); _fputc_nlock(mbc[0], fp); _fputc_nlock(mbc[1], fp); } int wctomb(char *s, wchar_t wch) { len = WideCharToMultiByte(__locale_codepage, ...); } I found the C code via grep so I may not be looking at the correct implementation of each function, but it matches the behavior I'm seeing. I think the standard C routines were used in D to make sure IO buffers were shared with C, etc. Are you saying this should be changed to use the Windows routines instead? Alternately, is fputwc() really doing the right thing by using the default locale? I'd imagine so except that this approach doesn't work in my tests on Windows.Surprisingly, they don't appear to work properly. The locale used for the UTF16 to multibyte conversion is the currently set locale, and that prints garbage on my Windows install. I had to use the OEM locale for it to work. I was going to fix this but wasn't sure if std.stdio should be setting the codepage it requires, or if the DMC code is broken (which doesn't seem likely).The D functions are supposed to send UTF16 to Windows via the "W" interface. What Windows does with it is up to Windows. The functions are NOT supposed to do a multibyte conversion and send it to the Windows "A" interface, except for the Win9x versions.
Jul 30 2010
Sean Kelly wrote:I found the C code via grep so I may not be looking at the correct implementation of each function, but it matches the behavior I'm seeing. I think the standard C routines were used in D to make sure IO buffers were shared with C, etc. Are you saying this should be changed to use the Windows routines instead? Alternately, is fputwc() really doing the right thing by using the default locale? I'd imagine so except that this approach doesn't work in my tests on Windows.I don't know, it's been years since I worked on that code. The idea is that D and C writes to stdio can be interleaved.
Jul 31 2010
Walter Bright Wrote:The D functions are supposed to send UTF16 to Windows via the "W" interface. What Windows does with it is up to Windows. The functions are NOT supposed to do a multibyte conversion and send it to the Windows "A" interface, except for the Win9x versions.They can't just blindly call WriteConsoleW because according to msdn it fails if stdout is not a console. Shin Fujishiro's code is the correct one.
Jul 31 2010
Andrej Mitrovic <andrej.mitrovich gmail.com> wrote:You are right about printf(), I'm getting the correct output with this code: import std.stdio, std.stream; void main() { string str = "Hall\u00E5, V\u00E4rld!"; foreach (dchar c; str) { printf("%c", c); } writeln(); } Hallå, Värld!The reason why printf printed the correct characters is probably that the console was working in Windows-1257 (variant of ISO-8859-1). ISO-8859-1 (aka Latin-1) coded character set is compatible with Unicode. For example, Latin-1 0xE5 corresponds to U+00E5 and both represents the character å. Due to this fact, your console could _occasionally_ print Latin-1 compatible Unicode characters. The reason that Sean saw õ and Õ was that the console worked in CP850, I believe. In CP850 coded character set, 0xE4 = õ and 0xE5 = Õ. D/Phobos works in Unicode, but system (console) works in a different codeset. As Kagamin pointed out, Phobos must transcode Unicode to system native codeset to correctly print characters (even on linux). By the way, I'm working on this problem in a devel branch: http://www.dsource.org/projects/phobos/browser/branches/devel/stdio-native-codeset/ Native codeset transcoder (std/internal/stdio/nativechar.d) is done. Now I'm thinking on how to integrate conversion facility to the stdio File framework. Shin
Jul 29 2010
Shin Fujishiro Wrote:Now I'm thinking on how to integrate conversion facility to the stdio File framework.I think creating a low-level unicode console interface will help. Like this void putchar(char c) disable { assert(false); } void putchar(wchar c) disable { assert(false); } void putchar(dchar c) {...}
Jul 29 2010
Shin Fujishiro Wrote:http://www.dsource.org/projects/phobos/browser/branches/devel/stdio-native-codeset/I don't quite get, what is the difference between GetConsoleCP and CP_OEMCP for japanese and korean windows.
Jul 29 2010
Kagamin <spam here.lot> wrote:Shin Fujishiro Wrote:User might change console code page by the chcp command. Or it might be changed by programmer. CP_OEMCP does not track such situation. By the way, which CP should be used for redirected stdio: ANSI or OEM? I thought ANSI was preferred, but OEM seems to be more commonly used for console apps. Shinhttp://www.dsource.org/projects/phobos/browser/branches/devel/stdio-native-codeset/I don't quite get, what is the difference between GetConsoleCP and CP_OEMCP for japanese and korean windows.
Jul 29 2010
Shin Fujishiro Wrote:By the way, which CP should be used for redirected stdio: ANSI or OEM? I thought ANSI was preferred, but OEM seems to be more commonly used for console apps.I think, they just don't care and write text as usual. C standard was created with implication that strings are in system codepage and no transcoding is ever mentioned, it's a language for ASCII text. There is even problem when program code is edited in a gui editor and saved in ANSI codepage, after compilation hardcoded strings are not transcoded and remain in ANSI codepage, printf just writes text blindly, so the output is broken.
Jul 30 2010
Sean kelly Wrote:Figuring out what's going on there will take some more work, and the ultimate fix may end up being in the DMC libraries... I really don't know.This can't be C bug because character encoding is not specified for C strings. If I remember it right, C uses system encoding, and string IO just blindly passes strings in and out. It's phobos' duty to convert D string to whatever encoding used by C library.
Jul 28 2010