www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - writef crashes on international string output

reply Dr.Dizel <Dr.Dizel_member pathlink.com> writes:
Writef crashes on international (russian) string output not UTF but generic.
Jan 28 2005
parent reply "Thomas Kuehne" <thomas-dloop kuehne.cn> writes:
Dr.Dizel schrieb in news:ctea06$k6q$1 digitaldaemon.com...
 Writef crashes on international (russian) string output not UTF but generic.
plattform? OS? compiler version? sample string? what shell? Thomas
Jan 28 2005
parent reply Dr.Dizel <Dr.Dizel_member pathlink.com> writes:
In article <cteamj$ku0$1 digitaldaemon.com>, Thomas Kuehne says...

Dr.Dizel schrieb in news:ctea06$k6q$1 digitaldaemon.com...
 Writef crashes on international (russian) string output not UTF but generic.
Ouch! It is a dmd parsing bug. I cannot write source files on my national language not identifiers but for example just simple strings for output. If I do so dmd cannot parse they in any encoding: ANSI, OEM, KOI8R ... except UTF-16. If I use UTF-16 dmd do strange codepage conversions. However, I need to write and print my strings on Russian! Examples with DOS codepage (866): ------------------------------------ import std.stdio; int main(char[][] args) { char[] hello_on_russian = "Ïðèâåò, ìèð!"; return 0; } C:\dmd\bin>dmd helloworld.d helloworld.d(6): invalid UTF-8 sequence helloworld.d(6): invalid UTF-8 sequence helloworld.d(6): invalid UTF-8 sequence helloworld.d(6): invalid UTF-8 sequence helloworld.d(6): invalid UTF-8 sequence helloworld.d(6): invalid UTF-8 sequence helloworld.d(6): invalid UTF-8 sequence helloworld.d(6): invalid UTF-8 sequence helloworld.d(6): invalid UTF-8 sequence -------------------------------------------------- import std.stdio; int main(char[][] args) { char[] hello_on_russian = `Ïðèâåò, ìèð!`; // backquotes here writef(hello_on_russian); return 0; } C:\dmd\bin>dmd helloworld.d C:\dmd\bin\..\..\dm\bin\link.exe helloworld,,,user32+kernel32/noi; C:\dmd\bin>helloworld Error: invalid UTF-8 sequence ------------------------------------ import std.stdio; int main(char[][] args) { char[] hello_on_russian = `Ïðèâåò, ìèð!`; // backquotes here printf(hello_on_russian); return 0; } C:\dmd\bin>helloworld Ïðèâåò, ìèð! Old printf way is good. I think other parts of dmd library have some bugs in national language strings parsing. P.S. I use Windows XP and dmd version is 0.111.
Jan 29 2005
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Dr.Dizel wrote:

 Ouch! It is a dmd parsing bug.
It's not a dmd bug, but a limitation by design...
 I cannot write source files on my national language not identifiers but for
 example just simple strings for output. If I do so dmd cannot parse they in any
 encoding: ANSI, OEM, KOI8R ... except UTF-16. If I use UTF-16 dmd do strange
 codepage conversions. However, I need to write and print my strings on
Russian! 
D *only* supports Unicode (UTF-8, UTF-16, UTF-32) This means: 1) Your source code must be in UTF-8 2) Your console input must be UTF-8 3) Your console output will be UTF-8 Otherwise you *will* get errors such as "invalid UTF-8 sequence" or wrong output. However, Unicode does have full support for Russian / Kyrillic - and so does D. This means that if you want to run D programs on an unsupported console, you need to cast and change encoding on the char[] before input/output. The input you get will be in ubyte[], in the local encoding, and can be converted to wchar[] with a lookup table... Similarly, you can convert your char[] to an ubyte[] for output by using the reverse of that table. The lookup table, "wchar[256] mapping", is different for each encoding. I can post some sample code, if wanted ? You can also use routines from the Windows API, to convert to and from the current console code page. They should be somewhere in D, as well. --anders PS. Lookup from codepage 866 (ubyte) to unicode (wchar) can be found at: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP866.TXT
Jan 29 2005
parent reply Dr.Dizel <Dr.Dizel_member pathlink.com> writes:
In article <ctgl26$4jh$1 digitaldaemon.com>,
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
Dr.Dizel wrote:
 Ouch! It is a dmd parsing bug.
It's not a dmd bug, but a limitation by design...
Then backquotes in my example destroy this design. Why I can use only English strings but cannot others? Is it tyranny of US? :-)
 I cannot write source files on my national language not identifiers but for
 example just simple strings for output. If I do so dmd cannot parse they in any
 encoding: ANSI, OEM, KOI8R ... except UTF-16. If I use UTF-16 dmd do strange
 codepage conversions. However, I need to write and print my strings on
Russian! 
D *only* supports Unicode (UTF-8, UTF-16, UTF-32)
However backquotes ...
This means:
1) Your source code must be in UTF-8
2) Your console input must be UTF-8
3) Your console output will be UTF-8
Where did you see such console? Which programs can use it? Is it sferic horse in vacuum? :-) If module std.stdio has no any input, how can I do it? Is it codepage safe? How can I input from and output to none UTF console? Is it a big problem or difficult thing to use dmd for programs, which use multilanguage envieroment?
Otherwise you *will* get errors such as
"invalid UTF-8 sequence" or wrong output.

However, Unicode does have full support
for Russian / Kyrillic - and so does D.


This means that if you want to run D programs on an unsupported console,
you need to cast and change encoding on the char[] before input/output.
How can I do so: char[] can hold only UTF-8 chars and writef cannot output other codepages (see my example)?
The input you get will be in ubyte[], in the local encoding, and can be
converted to wchar[] with a lookup table... Similarly, you can convert
your char[] to an ubyte[] for output by using the reverse of that table.
The lookup table, "wchar[256] mapping", is different for each encoding.
How can I output ubyte[] with writef?
I can post some sample code, if wanted ?
Yes. In addition, developers must rename char to utf8 because it is not real char and wchar to utf16 and dchar to utf32. Char must store any char from 0x00 to 0xFF.
Jan 30 2005
next sibling parent Sebastian Beschke <s.beschke gmx.de> writes:
Dr.Dizel schrieb:
 In article <ctgl26$4jh$1 digitaldaemon.com>,
 =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
 
Dr.Dizel wrote:
Ouch! It is a dmd parsing bug.
It's not a dmd bug, but a limitation by design...
Then backquotes in my example destroy this design. Why I can use only English strings but cannot others? Is it tyranny of US? :-)
Funny one gets accused as a tyrant when using the most liberal and general encoding available... ;)
 
 
I cannot write source files on my national language not identifiers but for
example just simple strings for output. If I do so dmd cannot parse they in any
encoding: ANSI, OEM, KOI8R ... except UTF-16. If I use UTF-16 dmd do strange
codepage conversions. However, I need to write and print my strings on Russian! 
D *only* supports Unicode (UTF-8, UTF-16, UTF-32)
However backquotes ...
You oughta make sure your text editor saves the source code correctly. If you wish to use UTF-16 or UTF-32, be sure that there is a Byte Order Mark at the start of the file. I use jEdit and save files in UTF-8, which works fine.
 
 
This means:
1) Your source code must be in UTF-8
2) Your console input must be UTF-8
3) Your console output will be UTF-8
Where did you see such console? Which programs can use it? Is it sferic horse in vacuum? :-)
I guess your best bet currently would be to not use the console, sad as that is. Alternatively, you might use something like iconv, but I have no idea if it's available for D. How does Russian console input work, anyway? I'd be interested in that ^^
 
 In addition, developers must rename char to utf8 because it is not real char
and
 wchar to utf16 and dchar to utf32. Char must store any char from 0x00 to 0xFF.
This has been up for discussion a lot of times, actually. IMHO, it doesn't really matter what you call them; the docs state clearly enough what they *are*. -Sebastian
Jan 30 2005
prev sibling parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Dr.Dizel wrote:

 Why I can use only English strings but cannot others? Is it tyranny of US? :-)
On the contrary, you can now use a lot more than just Western languages.
This means:
1) Your source code must be in UTF-8
This implies that your text editor must also be able to handle UTF-8.
2) Your console input must be UTF-8
3) Your console output will be UTF-8
Where did you see such console? Which programs can use it?
Linux has one. Mac OS X has one. I hope Windows XP can get one...
 If module std.stdio has no any input, how can I do it? Is it codepage safe?
 How can I input from and output to none UTF console?
 Is it a big problem or difficult thing to use dmd for programs,
 which use multilanguage envieroment?
Non-UTF consoles are unsupported, but it can still be done.
 How can I do so: char[] can hold only UTF-8 chars and writef cannot output
other
 codepages (see my example)?
Yes.
 How can I output ubyte[] with writef?
That I am not 100% sure of, since I used printf instead. writef works just fine for Unicode, but not for 8-bit...
I can post some sample code, if wanted ?
Yes.
See http://www.algonet.se/~afb/d/mapping.zip Haven't added CP866, but CP437 is there for reference ? Note: There are better version of this, for Windows only. (maybe some one else can post a version using Win32 API ?)
 In addition, developers must rename char to utf8 because it is not real char
and
 wchar to utf16 and dchar to utf32. Char must store any char from 0x00 to 0xFF.
The "char" type in D is, by definition, a UTF-8 type. Holding 0x00-0x7F, and all different types of Unicode characters by using up to char[4]... To store any so called character, from 0x00-0xFF, you *need* ubyte. Note: The "real char", if we are talking C/C++, is called "byte" in D. --anders
Jan 30 2005
parent reply Benjamin Herr <ben 0x539.de> writes:
Anders F Björklund wrote:
 Linux has one. Mac OS X has one. I hope Windows XP can get one...
Michael Walter has demonstrated that the WinXP console is indeed capable of UTF-8: <http://ilfirin.org/unicode.png>
Jan 30 2005
next sibling parent reply Sebastian Beschke <s.beschke gmx.de> writes:
Benjamin Herr schrieb:
 <http://ilfirin.org/unicode.png>
OMG, don't open the homepage!
Jan 30 2005
parent reply Benjamin Herr <ben 0x539.de> writes:
Sebastian Beschke schrieb:
 Benjamin Herr schrieb:
 
 <http://ilfirin.org/unicode.png>
OMG, don't open the homepage!
Sorry if I offended you :(
Jan 30 2005
parent reply Sebastian Beschke <s.beschke gmx.de> writes:
Benjamin Herr schrieb:
 Sebastian Beschke schrieb:
 
 Benjamin Herr schrieb:

 <http://ilfirin.org/unicode.png>
OMG, don't open the homepage!
Sorry if I offended you :(
Nah, that was a joke. ;) I'm not so easily offended. There have been far worse one-picture web sites in the past. I just assumed that the image was supposed to convey a humorous meaning about the person depicted (is it you?), so I tried to be humorous too. I forgot to put a smiley, though. :) -Sebastian
Jan 30 2005
parent Benjamin Herr <ben 0x539.de> writes:
Sebastian Beschke wrote:
 Benjamin Herr schrieb:
 Sebastian Beschke schrieb:
 Benjamin Herr schrieb:
 <http://ilfirin.org/unicode.png>
OMG, don't open the homepage!
Sorry if I offended you :(
Nah, that was a joke. ;) I'm not so easily offended. There have been far worse one-picture web sites in the past. I just assumed that the image was supposed to convey a humorous meaning about the person depicted (is it you?), so I tried to be humorous too. I forgot to put a smiley, though. :) -Sebastian
It is indeed me. And it continues to freak out a lot of people. :D Also, it is not a one-picture website by design, I am just too lazy to actually create a website to populate the domain I am paying for. -ben
Jan 30 2005
prev sibling parent reply =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Benjamin Herr wrote:

 Linux has one. Mac OS X has one. I hope Windows XP can get one...
Michael Walter has demonstrated that the WinXP console is indeed capable of UTF-8: <http://ilfirin.org/unicode.png>
I meant a native UTF-8 console, where you can do:
 import std.stdio;
 void main()
 {
   writefln("\u20ac");
 }
And have it print € ? http://www.fileformat.info/info/unicode/char/20ac/ --anders
Jan 30 2005
next sibling parent Benjamin Herr <ben 0x539.de> writes:
Anders F Björklund wrote:
 I meant a native UTF-8 console, where you can do:
 
 import std.stdio;
 void main()
 {
   writefln("\u20ac");
 }
And have it print € ? http://www.fileformat.info/info/unicode/char/20ac/ --anders
I am caused to assume that chcp <nifty parameters go here> will cause the Windows XP console to switch to UTF-8 mode. This is untested, however, as I use uxterm. -ben
Jan 30 2005
prev sibling parent Dr.Dizel <Dr.Dizel_member pathlink.com> writes:
In article <ctjgji$1a8r$1 digitaldaemon.com>,
=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= says...
Benjamin Herr wrote:

 Linux has one. Mac OS X has one. I hope Windows XP can get one...
Michael Walter has demonstrated that the WinXP console is indeed capable of UTF-8: <http://ilfirin.org/unicode.png>
I meant a native UTF-8 console, where you can do:
 import std.stdio;
 void main()
 {
   writefln("\u20ac");
 }
And have it print € ? http://www.fileformat.info/info/unicode/char/20ac/
It looks like a console hack. You must use _only_ Lucida Console font and you get readable output. Setup it in properties. Can you read some useful from this utf-8 console? How? I have another question. Does "std.stdio" means "Standard . Standard Input Output library"? It has to be named like "std.io". However it has no any Input things but Output only. Then it has to be named like "std.o". Sounds cool: S-T-D--O! :-D Developers, don’t name things which have no named functionality.
Feb 01 2005