digitalmars.D.bugs - writef crashes on international string output

Dr.Dizel (1/1) Jan 28 2005 Writef crashes on international (russian) string output not UTF but gene...

Thomas Kuehne (7/8) Jan 28 2005 plattform?

Dr.Dizel (50/52) Jan 29 2005 Ouch! It is a dmd parsing bug.

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (24/29) Jan 29 2005 D *only* supports Unicode (UTF-8, UTF-16, UTF-32)

Dr.Dizel (17/40) Jan 30 2005 Then backquotes in my example destroy this design.

Sebastian Beschke (15/51) Jan 30 2005 Funny one gets accused as a tyrant when using the most liberal and
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (17/36) Jan 30 2005 This implies that your text editor must also be able to handle UTF-8.

Benjamin Herr (3/4) Jan 30 2005 Michael Walter has demonstrated that the WinXP console is indeed capable...

Sebastian Beschke (2/3) Jan 30 2005 OMG, don't open the homepage!

Benjamin Herr (2/8) Jan 30 2005 Sorry if I offended you :(

Sebastian Beschke (8/20) Jan 30 2005 Nah, that was a joke. ;)

Benjamin Herr (5/22) Jan 30 2005 It is indeed me. And it continues to freak out a lot of people. :D

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (5/14) Jan 30 2005 And have it print € ?

Benjamin Herr (5/19) Jan 30 2005 I am caused to assume that chcp will cause
Dr.Dizel (10/23) Feb 01 2005 It looks like a console hack. You must use _only_ Lucida Console font an...

Dr.Dizel <Dr.Dizel_member pathlink.com> writes:

Writef crashes on international (russian) string output not UTF but generic.

Jan 28 2005

"Thomas Kuehne" <thomas-dloop kuehne.cn> writes:

Dr.Dizel schrieb in news:ctea06$k6q$1 digitaldaemon.com...
 Writef crashes on international (russian) string output not UTF but generic.

plattform?
OS?
compiler version?
sample string?
what shell?

Thomas

Jan 28 2005

Dr.Dizel <Dr.Dizel_member pathlink.com> writes:

In article <cteamj$ku0$1 digitaldaemon.com>, Thomas Kuehne says...

Dr.Dizel schrieb in news:ctea06$k6q$1 digitaldaemon.com...
 Writef crashes on international (russian) string output not UTF but generic.


Ouch! It is a dmd parsing bug.

I cannot write source files on my national language not identifiers but for
example just simple strings for output. If I do so dmd cannot parse they in any
encoding: ANSI, OEM, KOI8R ... except UTF-16. If I use UTF-16 dmd do strange
codepage conversions. However, I need to write and print my strings on Russian! 

Examples with DOS codepage (866):
------------------------------------
import std.stdio;

int main(char[][] args)
{
char[] hello_on_russian	= "������, ���!";

return 0;	
}

C:\dmd\bin>dmd helloworld.d
helloworld.d(6): invalid UTF-8 sequence
helloworld.d(6): invalid UTF-8 sequence
helloworld.d(6): invalid UTF-8 sequence
helloworld.d(6): invalid UTF-8 sequence
helloworld.d(6): invalid UTF-8 sequence
helloworld.d(6): invalid UTF-8 sequence
helloworld.d(6): invalid UTF-8 sequence
helloworld.d(6): invalid UTF-8 sequence
helloworld.d(6): invalid UTF-8 sequence
--------------------------------------------------
import std.stdio;

int main(char[][] args)
{
char[] hello_on_russian	= `������, ���!`;	// backquotes here
writef(hello_on_russian);

return 0;
}

C:\dmd\bin>dmd helloworld.d
C:\dmd\bin\..\..\dm\bin\link.exe helloworld,,,user32+kernel32/noi;

C:\dmd\bin>helloworld
Error: invalid UTF-8 sequence
------------------------------------
import std.stdio;

int main(char[][] args)
{
char[] hello_on_russian	= `������, ���!`;	// backquotes here
printf(hello_on_russian);

return 0;
}

C:\dmd\bin>helloworld
������, ���!

Old printf way is good.

I think other parts of dmd library have some bugs in national language strings
parsing.

P.S. I use Windows XP and dmd version is 0.111.

Jan 29 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Dr.Dizel wrote:

 Ouch! It is a dmd parsing bug.

It's not a dmd bug, but a limitation by design...

 I cannot write source files on my national language not identifiers but for
 example just simple strings for output. If I do so dmd cannot parse they in any
 encoding: ANSI, OEM, KOI8R ... except UTF-16. If I use UTF-16 dmd do strange
 codepage conversions. However, I need to write and print my strings on
Russian! 

D *only* supports Unicode (UTF-8, UTF-16, UTF-32)

This means:
1) Your source code must be in UTF-8
2) Your console input must be UTF-8
3) Your console output will be UTF-8

Otherwise you *will* get errors such as
"invalid UTF-8 sequence" or wrong output.

However, Unicode does have full support
for Russian / Kyrillic - and so does D.


This means that if you want to run D programs on an unsupported console,
you need to cast and change encoding on the char[] before input/output.

The input you get will be in ubyte[], in the local encoding, and can be
converted to wchar[] with a lookup table... Similarly, you can convert
your char[] to an ubyte[] for output by using the reverse of that table.
The lookup table, "wchar[256] mapping", is different for each encoding.

I can post some sample code, if wanted ?


You can also use routines from the Windows API, to convert to and from
the current console code page. They should be somewhere in D, as well.

--anders

PS.
Lookup from codepage 866 (ubyte) to unicode (wchar) can be found at:
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP866.TXT

Jan 29 2005

Dr.Dizel <Dr.Dizel_member pathlink.com> writes:

In article <ctgl26$4jh$1 digitaldaemon.com>,
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
Dr.Dizel wrote:
 Ouch! It is a dmd parsing bug.

It's not a dmd bug, but a limitation by design...

Then backquotes in my example destroy this design.
Why I can use only English strings but cannot others? Is it tyranny of US? :-)

 I cannot write source files on my national language not identifiers but for
 example just simple strings for output. If I do so dmd cannot parse they in any
 encoding: ANSI, OEM, KOI8R ... except UTF-16. If I use UTF-16 dmd do strange
 codepage conversions. However, I need to write and print my strings on
Russian! 

D *only* supports Unicode (UTF-8, UTF-16, UTF-32)

However backquotes ...

This means:
1) Your source code must be in UTF-8
2) Your console input must be UTF-8
3) Your console output will be UTF-8

Where did you see such console? Which programs can use it? Is it sferic horse in
vacuum? :-)
If module std.stdio has no any input, how can I do it? Is it codepage safe?
How can I input from and output to none UTF console?
Is it a big problem or difficult thing to use dmd for programs, which use
multilanguage envieroment?

Otherwise you *will* get errors such as
"invalid UTF-8 sequence" or wrong output.

However, Unicode does have full support
for Russian / Kyrillic - and so does D.


This means that if you want to run D programs on an unsupported console,
you need to cast and change encoding on the char[] before input/output.

How can I do so: char[] can hold only UTF-8 chars and writef cannot output other
codepages (see my example)?

The input you get will be in ubyte[], in the local encoding, and can be
converted to wchar[] with a lookup table... Similarly, you can convert
your char[] to an ubyte[] for output by using the reverse of that table.
The lookup table, "wchar[256] mapping", is different for each encoding.

How can I output ubyte[] with writef?

I can post some sample code, if wanted ?

Yes.

In addition, developers must rename char to utf8 because it is not real char and
wchar to utf16 and dchar to utf32. Char must store any char from 0x00 to 0xFF.

Jan 30 2005

Sebastian Beschke <s.beschke gmx.de> writes:

Dr.Dizel schrieb:
 In article <ctgl26$4jh$1 digitaldaemon.com>,
 =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
 
Dr.Dizel wrote:
Ouch! It is a dmd parsing bug.

It's not a dmd bug, but a limitation by design...

 
 
 Then backquotes in my example destroy this design.
 Why I can use only English strings but cannot others? Is it tyranny of US? :-)

Funny one gets accused as a tyrant when using the most liberal and 
general encoding available... ;)

 
 
I cannot write source files on my national language not identifiers but for
example just simple strings for output. If I do so dmd cannot parse they in any
encoding: ANSI, OEM, KOI8R ... except UTF-16. If I use UTF-16 dmd do strange
codepage conversions. However, I need to write and print my strings on Russian! 

D *only* supports Unicode (UTF-8, UTF-16, UTF-32)

 
 
 However backquotes ...

You oughta make sure your text editor saves the source code correctly. 
If you wish to use UTF-16 or UTF-32, be sure that there is a Byte Order 
Mark at the start of the file.

I use jEdit and save files in UTF-8, which works fine.

 
 
This means:
1) Your source code must be in UTF-8
2) Your console input must be UTF-8
3) Your console output will be UTF-8

 
 
 Where did you see such console? Which programs can use it? Is it sferic horse
in
 vacuum? :-)

I guess your best bet currently would be to not use the console, sad as 
that is. Alternatively, you might use something like iconv, but I have 
no idea if it's available for D.

How does Russian console input work, anyway? I'd be interested in that ^^

 
 In addition, developers must rename char to utf8 because it is not real char
and
 wchar to utf16 and dchar to utf32. Char must store any char from 0x00 to 0xFF.

This has been up for discussion a lot of times, actually. IMHO, it 
doesn't really matter what you call them; the docs state clearly enough 
  what they *are*.

-Sebastian

Jan 30 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Dr.Dizel wrote:

 Why I can use only English strings but cannot others? Is it tyranny of US? :-)

On the contrary, you can now use a lot more than just Western languages.

This means:
1) Your source code must be in UTF-8


This implies that your text editor must also be able to handle UTF-8.

2) Your console input must be UTF-8
3) Your console output will be UTF-8

 
 Where did you see such console? Which programs can use it?

Linux has one. Mac OS X has one. I hope Windows XP can get one...

 If module std.stdio has no any input, how can I do it? Is it codepage safe?
 How can I input from and output to none UTF console?
 Is it a big problem or difficult thing to use dmd for programs,
 which use multilanguage envieroment?

Non-UTF consoles are unsupported, but it can still be done.

 How can I do so: char[] can hold only UTF-8 chars and writef cannot output
other
 codepages (see my example)?

Yes.

 How can I output ubyte[] with writef?

That I am not 100% sure of, since I used printf instead.
writef works just fine for Unicode, but not for 8-bit...

I can post some sample code, if wanted ?

 
 Yes.

See http://www.algonet.se/~afb/d/mapping.zip
Haven't added CP866, but CP437 is there for reference ?

Note: There are better version of this, for Windows only.
(maybe some one else can post a version using Win32 API ?)

 In addition, developers must rename char to utf8 because it is not real char
and
 wchar to utf16 and dchar to utf32. Char must store any char from 0x00 to 0xFF.

The "char" type in D is, by definition, a UTF-8 type. Holding 0x00-0x7F,
and all different types of Unicode characters by using up to char[4]...

To store any so called character, from 0x00-0xFF, you *need* ubyte.
Note: The "real char", if we are talking C/C++, is called "byte" in D.

--anders

Jan 30 2005

Benjamin Herr <ben 0x539.de> writes:

Anders F Bj�rklund wrote:
 Linux has one. Mac OS X has one. I hope Windows XP can get one...

Michael Walter has demonstrated that the WinXP console is indeed capable 
of UTF-8: <http://ilfirin.org/unicode.png>

Jan 30 2005

Sebastian Beschke <s.beschke gmx.de> writes:

Benjamin Herr schrieb:
 <http://ilfirin.org/unicode.png>

OMG, don't open the homepage!

Jan 30 2005

Benjamin Herr <ben 0x539.de> writes:

Sebastian Beschke schrieb:
 Benjamin Herr schrieb:
 
 <http://ilfirin.org/unicode.png>

 
 
 OMG, don't open the homepage!

Sorry if I offended you :(

Jan 30 2005

Sebastian Beschke <s.beschke gmx.de> writes:

Benjamin Herr schrieb:
 Sebastian Beschke schrieb:
 
 Benjamin Herr schrieb:

 <http://ilfirin.org/unicode.png>



 OMG, don't open the homepage!

 
 
 Sorry if I offended you :(

Nah, that was a joke. ;)

I'm not so easily offended. There have been far worse one-picture web 
sites in the past.

I just assumed that the image was supposed to convey a humorous meaning 
about the person depicted (is it you?), so I tried to be humorous too. I 
forgot to put a smiley, though. :)

-Sebastian

Jan 30 2005

Benjamin Herr <ben 0x539.de> writes:

Sebastian Beschke wrote:
 Benjamin Herr schrieb:
 Sebastian Beschke schrieb:
 Benjamin Herr schrieb:
 <http://ilfirin.org/unicode.png>

 OMG, don't open the homepage!

 Sorry if I offended you :(

 
 Nah, that was a joke. ;)
 
 I'm not so easily offended. There have been far worse one-picture web 
 sites in the past.
 
 I just assumed that the image was supposed to convey a humorous meaning 
 about the person depicted (is it you?), so I tried to be humorous too. I 
 forgot to put a smiley, though. :)
 
 -Sebastian

It is indeed me. And it continues to freak out a lot of people. :D
Also, it is not a one-picture website by design, I am just too lazy to 
actually create a website to populate the domain I am paying for.

-ben

Jan 30 2005

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:

Benjamin Herr wrote:

 Linux has one. Mac OS X has one. I hope Windows XP can get one...

 
 Michael Walter has demonstrated that the WinXP console is indeed
 capable of UTF-8: <http://ilfirin.org/unicode.png>

I meant a native UTF-8 console, where you can do:

 import std.stdio;
 void main()
 {
   writefln("\u20ac");
 }

And have it print € ?

http://www.fileformat.info/info/unicode/char/20ac/

--anders

Jan 30 2005

Benjamin Herr <ben 0x539.de> writes:

Anders F Björklund wrote:
 I meant a native UTF-8 console, where you can do:
 
 import std.stdio;
 void main()
 {
   writefln("\u20ac");
 }

 
 
 And have it print € ?
 
 http://www.fileformat.info/info/unicode/char/20ac/
 
 --anders

I am caused to assume that chcp <nifty parameters go here> will cause 
the Windows XP console to switch to UTF-8 mode.
This is untested, however, as I use uxterm.

-ben

Jan 30 2005

Dr.Dizel <Dr.Dizel_member pathlink.com> writes:

In article <ctjgji$1a8r$1 digitaldaemon.com>,
=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= says...
Benjamin Herr wrote:

 Linux has one. Mac OS X has one. I hope Windows XP can get one...

 
 Michael Walter has demonstrated that the WinXP console is indeed
 capable of UTF-8: <http://ilfirin.org/unicode.png>

I meant a native UTF-8 console, where you can do:

 import std.stdio;
 void main()
 {
   writefln("\u20ac");
 }

And have it print € ?

http://www.fileformat.info/info/unicode/char/20ac/

It looks like a console hack. You must use _only_ Lucida Console font and you
get readable output. Setup it in properties.

Can you read some useful from this utf-8 console? How?

I have another question.
Does "std.stdio" means "Standard . Standard Input Output library"? It has to be
named like "std.io". However it has no any Input things but Output only. Then it
has to be named like "std.o". Sounds cool: S-T-D--O! :-D
Developers, don�t name things which have no named functionality.

Feb 01 2005

D Programming

C/C++ Programming

Other

digitalmars.D.bugs - writef crashes on international string output