digitalmars.D.bugs - Bug in std.string.format?

Juanjo =?ISO-8859-15?Q?=C1lvarez?= (6/6) Jul 09 2004 If I do:

Stewart Gordon (10/14) Jul 09 2004
Arcane Jill (12/18) Jul 09 2004 This is not a bug. You have an invalid UTF-8 sequence. The library is co...

Arcane Jill (8/13) Jul 09 2004 Oh - and here's the fix. Save your source-code text file in UTF-8 format...
Arcane Jill (15/15) Jul 09 2004 Actually, come to think of it, it would be very, very helpful to users o...

Stewart Gordon (19/24) Jul 09 2004 Hang on ... according to the docs, the compiler is supposed to accept

Arcane Jill (25/46) Jul 09 2004 I stand corrected. However, the UTFs are all very easy to tell apart. UT...

Juanjo =?ISO-8859-15?Q?=C1lvarez?= (3/10) Jul 10 2004 And they are (or at least were) extensively used in the obfuscated C
Stewart Gordon (31/63) Jul 12 2004 Are we talking of the byte-order mark, or the fallback for if that's

Arcane Jill (33/62) Jul 12 2004 I meant heuristically. Although obviously, if there's a BOM, you can tel...

Stewart Gordon (14/27) Jul 13 2004 As long as you don't confuse its semantics with those of the other

Juanjo =?ISO-8859-15?Q?=C1lvarez?= (16/26) Jul 09 2004 Then I was confused by the fact that inserting the line:

Arcane Jill (39/54) Jul 09 2004 That may be a red herring, but I don't know what Python does and I'm not

Juanjo =?ISO-8859-15?Q?=C1lvarez?= (23/50) Jul 10 2004 True, but the funny thing was that the files are saved (I've just tested...

Arcane Jill (71/91) Jul 12 2004 In point of fact, your assertion that virtually every other compiler and

Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:

If I do:

//Also with any other ascii 8 bit chars:
char[] str = std.string.format("STRING WITH NON ASCII7BIT CHARS ���");

The program says (in runtime):

Error: invalid UTF-8 sequence

AFAIK '�' is UTF-8.

Jul 09 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Juanjo �lvarez wrote:

 If I do:
 
 //Also with any other ascii 8 bit chars:
 char[] str = std.string.format("STRING WITH NON ASCII7BIT CHARS ���");

<snip>

std.string.format isn't documented as I look.  Is this the string 
counterpart of writef, which I'd just pointed out we should have over on 
d.D?

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the 
unfortunate victim of intensive mail-bombing at the moment.  Please keep 
replies on the 'group where everyone may benefit.

Jul 09 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cclofh$1qrr$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...
If I do:

//Also with any other ascii 8 bit chars:
char[] str = std.string.format("STRING WITH NON ASCII7BIT CHARS ���");

The program says (in runtime):

Error: invalid UTF-8 sequence

This is not a bug. You have an invalid UTF-8 sequence. The library is correctly
reporting it.



AFAIK '�' is UTF-8.

It is not. The Unicode character U+00D1, LATIN CAPITAL N WITH TILDE is
represented in UTF-8 by the two byte sequence { 0xC3, 0x91 }.

UTF-8 is backwardly compatible with ASCII. It is /not/, however, backwardly
compatible with ISO-8859-1. Any character with codepoint greater than 0x7F must
be correctly UTF-8 encoded.

You can get the correct UTF-8 sequence by starting with a string of dchars and
passing it to std.utf.toUTF8().

Arcane Jill

Jul 09 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ccm0is$2768$1 digitaldaemon.com>, Arcane Jill says...
char[] str = std.string.format("STRING WITH NON ASCII7BIT CHARS ���");

Error: invalid UTF-8 sequence

This is not a bug. You have an invalid UTF-8 sequence. The library is correctly
reporting it.

Oh - and here's the fix. Save your source-code text file in UTF-8 format before
attempting to compile it. I suspect it is currently saved in some ANSI format or
other - probably ISO-8859-1 or WINDOWS-259 depending on your operating system.
You need a text editor which can save in UTF-8.

D source files should always be saved in UTF-8 format if you want string
literals to be correctly interpretted.

Jill

Jul 09 2004

Arcane Jill <Arcane_member pathlink.com> writes:

Actually, come to think of it, it would be very, very helpful to users of D if
the D compiler actually checked the integrity of all string literals at compile
time. If any string literal were found (at compile time) to contain an invalid
UTF-8 sequence, it would help the user ENORMOUSLY if an error message along the
lines of:



were to be printed. (Strictly speaking, the D compiler should always pass the
entire source file to toUTF32(), and generate the above error if toUTF32()
fails. However, the source file encoding won't make any difference EXCEPT to
string literals).

So ... although it /is/ a user-error, it is nonetheless a user-error which DMD
could have detected at compile-time, instead of leaving the error reporting to
run time. The error message itself (as it stands) doesn't really help people to
understand what's wrong.

Arcane Jill

Jul 09 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Arcane Jill wrote:

<snip>


Hang on ... according to the docs, the compiler is supposed to accept 
UTF-16 and UTF-32 too.

<snip>
 So ... although it /is/ a user-error, it is nonetheless a user-error which DMD
 could have detected at compile-time, instead of leaving the error reporting to
 run time. The error message itself (as it stands) doesn't really help people to
 understand what's wrong.

Some debate is possible.  Obviously the compiler isn't being UTF 
compliant.  But what if someone wants to include, in a string literal, 
characters in the native OS or other character set that don't match 
UTF-8?  (FTM, how are escaped characters supposed to be handled ITR, 
considering that a string literal can be either a char[], wchar[] or 
dchar[]?)

Speaking of lexical.html...
"There are no digraphs or trigraphs in D."

What is meant by this, exactly?

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the 
unfortunate victim of intensive mail-bombing at the moment.  Please keep 
replies on the 'group where everyone may benefit.

Jul 09 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ccmo0h$8u0$1 digitaldaemon.com>, Stewart Gordon says...
Arcane Jill wrote:

<snip>


Hang on ... according to the docs, the compiler is supposed to accept 
UTF-16 and UTF-32 too.

I stand corrected. However, the UTFs are all very easy to tell apart. UTF-16
looks very different from UTF-8, and it only takes a simple algorithm to
distinguish them. Ditto UTF-32.

What I *SHOULD* have said is that DMD assumes that the source file is encoded in
UTF-8, UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE. What it can't do is tell 8-bit
encodings apart from each other, so it assumes that, if it's an 8-bit encoding,
it will be UTF-8.




<snip>
 So ... although it /is/ a user-error, it is nonetheless a user-error which DMD
 could have detected at compile-time, instead of leaving the error reporting to
 run time. The error message itself (as it stands) doesn't really help people to
 understand what's wrong.

Some debate is possible.  Obviously the compiler isn't being UTF 
compliant.

Yes, it is. The compiler is being 100% UTF compliant. Problems only arise if the
source code isn't.


But what if someone wants to include, in a string literal, 
characters in the native OS or other character set that don't match 
UTF-8?

There ain't no such character. UTF-8 can encode the entire of Unicode. I'm not
sure there's an OS on the planet which uses characters which are not in Unicode.

Oh wait - I believe the ZX Spectrum had some weird clunky graphics characters
which are not in Unicode. But we don't need to worry about that because D has
not been ported to that platform.



(FTM, how are escaped characters supposed to be handled ITR, 
considering that a string literal can be either a char[], wchar[] or 
dchar[]?)

They are supposed to be represented as is, not escaped in any way (beyond being
encoded in UTF-whatever).

Unless of course you mean stuff like "\n" - which obviously is stored in source
as backslash followed by 'n'. The compiler can figure THAT out because it's part
of D.




Speaking of lexical.html...
"There are no digraphs or trigraphs in D."

What is meant by this, exactly?

Old, old stuff from the early days of C. You have to go back a long time, but
once, there were keyboards without square brackets or curly braces and things,
and which were not remappable in software. Digraphs are two-character sequences
which a C compiler will replace with those single missing characters. Trigraphs
are similar three character sequences.

Jul 09 2004

Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:

Arcane Jill wrote:


What is meant by this, exactly?

 
 Old, old stuff from the early days of C. You have to go back a long time,
 but once, there were keyboards without square brackets or curly braces and
 things, and which were not remappable in software. Digraphs are
 two-character sequences which a C compiler will replace with those single
 missing characters. Trigraphs are similar three character sequences.

And they are (or at least were) extensively used in the obfuscated C
contests :)

Jul 10 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Arcane Jill wrote:

<snip>
 I stand corrected. However, the UTFs are all very easy to tell apart.  
 UTF-16 looks very different from UTF-8, and it only takes a simple 
 algorithm to distinguish them. Ditto UTF-32.

Are we talking of the byte-order mark, or the fallback for if that's 
missing?

 What I *SHOULD* have said is that DMD assumes that the source file is 
 encoded in UTF-8, UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE. What it 
 can't do is tell 8-bit encodings apart from each other, so it assumes 
 that, if it's an 8-bit encoding, it will be UTF-8.

Actually, there is a BOM for UTF-8 according to the docs.  But no doubt 
many UTF-8 files are typed without it.

<snip>
 Yes, it is. The compiler is being 100% UTF compliant. Problems only
 arise if the source code isn't.

Actually, I read that UTF compliance of a text reader necessarily means 
rejecting input that isn't UTF compliant.

 But what if someone wants to include, in a string literal, 
 characters in the native OS or other character set that don't match 
 UTF-8?

 
 There ain't no such character. UTF-8 can encode the entire of
 Unicode. I'm not sure there's an OS on the planet which uses
 characters which are not in Unicode.

By "match" I actually meant be represented by the same byte sequence. 
An important issue when it comes to generating console output, 
interfacing the OS API and stuff like that.

<snip>
 (FTM, how are escaped characters supposed to be handled ITR, 
 considering that a string literal can be either a char[], wchar[]
 or dchar[]?)

 
 They are supposed to be represented as is, not escaped in any way
 (beyond being encoded in UTF-whatever).
 
 Unless of course you mean stuff like "\n" - which obviously is stored
 in source as backslash followed by 'n'. The compiler can figure THAT
 out because it's part of D.

I meant stuff like "\xA3" actually, and in terms of what it becomes in 
the actual string data being represented.

<snip>
 Old, old stuff from the early days of C. You have to go back a long
 time, but once, there were keyboards without square brackets or curly
 braces and things, and which were not remappable in software.
 Digraphs are two-character sequences which a C compiler will replace
 with those single missing characters. Trigraphs are similar three
 character sequences.

My dad had an old C manual (which I first learned from, but only the 
very basics) with handwritten notes in it about teletypes from well 
before my time.  From what I remember, you typed something like:

MAIN() \(
	PRINTF("\HELLO, WORLD!\\N");
\)

But I don't remember there being any trigraphs in those notes.

And back in those days, you wrote x =- 4 instead of x -= 4.  I don't 
know at what point someone decided to break existing code by redefining 
the former to be the same as x = -4.

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the
unfortunate victim of intensive mail-bombing at the moment.  Please keep
replies on the 'group where everyone may benefit.

Jul 12 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cctnge$1801$1 digitaldaemon.com>, Stewart Gordon says...
Arcane Jill wrote:

<snip>
 I stand corrected. However, the UTFs are all very easy to tell apart.  
 UTF-16 looks very different from UTF-8, and it only takes a simple 
 algorithm to distinguish them. Ditto UTF-32.

Are we talking of the byte-order mark, or the fallback for if that's 
missing?

I meant heuristically. Although obviously, if there's a BOM, you can tell just
by reading the first (at most) four bytes.


Actually, there is a BOM for UTF-8 according to the docs.  But no doubt 
many UTF-8 files are typed without it.

Yep. Plenty of text editors save UTF-8 without a BOM. Some even offer you the
choice of BOM or no-BOM. So the absense of a BOM does not imply that a text file
is not UTF-8.


Actually, I read that UTF compliance of a text reader necessarily means 
rejecting input that isn't UTF compliant.

Gotcha. In that case, you are correct. So I guess this means that DMD really
/must/ validate the source file, or be itself in error. Well spotted.



 But what if someone wants to include, in a string literal, 
 characters in the native OS or other character set that don't match 
 UTF-8?

 
 There ain't no such character. UTF-8 can encode the entire of
 Unicode. I'm not sure there's an OS on the planet which uses
 characters which are not in Unicode.

By "match" I actually meant be represented by the same byte sequence. 
An important issue when it comes to generating console output, 
interfacing the OS API and stuff like that.

Aha. Well, that's an implementation-dependent thing, is it not? Not really a D
matter, I would have thought. Would I be correct in assuming that most console
escape sequences can be composed entirely out of ASCII characters? If that is
so, there isn't a problem anyway.




 Unless of course you mean stuff like "\n" - which obviously is stored
 in source as backslash followed by 'n'. The compiler can figure THAT
 out because it's part of D.

I meant stuff like "\xA3" actually, and in terms of what it becomes in 
the actual string data being represented.

Understood. Well, there's a little-known difference between '\xA3' and '\u00A3'.
'\xA3' means "the byte 0xA3", or, if it's a character, "the character
represented by codepoint 0xA3 in whatever encoding I happen to be using at the
time", whereas, '\u00A3' means specifically "the character represented by
codepoint 0xA3 in /Unicode/". That is, U+00A3, POUND SIGN.

In the particular case of D, a char[] contains UTF-8. So, I imagine it would be
perfectly OK to contruct valud UTF-8 sequences by hand. That is, I would _HOPE_
that all three of the following lines would produce identical results:





but I haven't tested this, so I don't know for sure. If not, it's a bug.

For console escape sequences which are absolutely NOT UTF-8, I would encourage
you to store such strings in ubyte[] arrays instead of char[] arrays, where such
validity restrictions don't apply. There's nothing to stop you from passing a
ubyte[] to std.stream.Stream.write(), after all.



And back in those days, you wrote x =- 4 instead of x -= 4.  I don't 
know at what point someone decided to break existing code by redefining 
the former to be the same as x = -4.

I don't know when that happened either. I gather that that change happened
though because compilers had a hard time distinguishing between:

(a) x =- 4;
(b) x = -4;


Arcane Jill

Jul 12 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Arcane Jill wrote:
<snip>
 For console escape sequences which are absolutely NOT UTF-8, I would encourage
 you to store such strings in ubyte[] arrays instead of char[] arrays, where
such
 validity restrictions don't apply. There's nothing to stop you from passing a
 ubyte[] to std.stream.Stream.write(), after all.

As long as you don't confuse its semantics with those of the other 
methods called write.

And back in those days, you wrote x =- 4 instead of x -= 4.  I don't 
know at what point someone decided to break existing code by redefining 
the former to be the same as x = -4.

 
 I don't know when that happened either. I gather that that change happened
 though because compilers had a hard time distinguishing between:
 
 (a) x =- 4;
 (b) x = -4;

It's no harder than distinguishing between

(a) + +x;
(b) ++ x;

But no doubt programmers confused them, particularly when they tried 
writing x=-4 without any spaces.

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the 
unfortunate victim of intensive mail-bombing at the moment.  Please keep 
replies on the 'group where everyone may benefit.

Jul 13 2004

Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:

Arcane Jill wrote:

AFAIK '�' is UTF-8.

 
 It is not. The Unicode character U+00D1, LATIN CAPITAL N WITH TILDE is
 represented in UTF-8 by the two byte sequence { 0xC3, 0x91 }.
 
 UTF-8 is backwardly compatible with ASCII. It is /not/, however,
 backwardly compatible with ISO-8859-1. Any character with codepoint
 greater than 0x7F must be correctly UTF-8 encoded.

Then I was confused by the fact that inserting the line:



at the start of a Python script make the interpreters works with latin1
chars directly.

 You can get the correct UTF-8 sequence by starting with a string of dchars
 and passing it to std.utf.toUTF8().

Could you please provide and example of how would be that done? Because if I
try:

dchar[] dstr = "ESPA�A";

the compiler says:

otroformat.d(7): invalid UTF-8 sequence

and if I instead try:

dchar[] dstr = std.utf.toUTF8("ESPA�A");


it says:

otroformat.d(7): function toUTF8 overloads char[](char[]s) and char[
(dchar[]s) both match argument list for toUTF8

So I'm a little lost here.

Jul 09 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ccn00c$khq$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...

Then I was confused by the fact that inserting the line:



at the start of a Python script make the interpreters works with latin1
chars directly.

That may be a red herring, but I don't know what Python does and I'm not
qualified to comment. If I had to guess, I'd say that declaration tells Python
the encoding with which the source files was saved.

I can tell you though that D also interprets all Latin-1 characters (and indeed,
all Unicode characters) directly ... *IF* the source file is saved in a UTF
format. (See below).

DMD may be "deficient" in the sense that it does not understand ISO-8859-1,
ISO-8859-2, WINDOWS-1252, etc, etc. - but I would regard that as a strength, not
a weakness. Simple. Neat. Clean. However, this does need to be better
documented.



Could you please provide and example of how would be that done? Because if I
try:

dchar[] dstr = "ESPA�A";

the compiler says:

otroformat.d(7): invalid UTF-8 sequence

Honestly - this has got nothing whatsoever to do with the compiler. There's a
stage BEFORE compiling - it's called saving the text file.

Let's say you're using Microsoft Notepad. Type something into it, such as:



Now - instead of clicking on "Save", click instead on "Save As". You'll see
three drop-down menus at the bottom of the dialog. One of them is labelled
"Encoding", and it will have "ANSI" selected by default. *** CHANGE IT TO UTF-8
***. Now save. Now the D compiler will be happy with it.

Pretty much all text editors these days offer such a choice - however it is
usually not the default, so you have to remember to explicitly do the Save As /
UTF-8 thing.

And you can use ALL characters too, not just Latin-1. You can use Latin-2,
Greek, Russian, Chinese, whatever.

Just remember that trick - SAVE AS UTF-8 before you attempt to compile.



and if I instead try:

dchar[] dstr = std.utf.toUTF8("ESPA�A");

it says:

otroformat.d(7): function toUTF8 overloads char[](char[]s) and char[
(dchar[]s) both match argument list for toUTF8

So I'm a little lost here.

I can understand that because, as I said, the DMD error message is not helpful.
However, bear in mind that the fault lies with your use of the text editor, not
with your use of D.

If Walter would care to help everyone out with this one by improving the error
message (if only to lay blame somewhere other than DMD), what he should do is
this. The compiler should pass the entire source file contents to
std.utf.validate (or some equivalent function written in C/C++). If it passes,
go ahead and compile. If it fails, issue an error message that the source file
is not correctly encoded, and needs to be re-saved as UTF-8 before it will
compile.

Of course, if the source file contains only ASCII characters then it is
automatically valid UTF-8, even if it was saved as "ANSI".

Arcane Jill

Jul 09 2004

Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:

Arcane Jill wrote:

First things first; thanks for your comments and you patience with my.



at the start of a Python script make the interpreters works with latin1
chars directly.

 
 That may be a red herring, but I don't know what Python does and I'm not
 qualified to comment. If I had to guess, I'd say that declaration tells
 Python the encoding with which the source files was saved.

True, but the funny thing was that the files are saved (I've just tested it)
as latin1 and it works and don't issue the warning it issues if you don't
put that line.

 I can tell you though that D also interprets all Latin-1 characters (and
 indeed, all Unicode characters) directly ... *IF* the source file is saved
 in a UTF format. (See below).

I didn't notice that my editor was saving the files as ISO-8859-15 and as
you said the compiler error message didn't helped on that. 

 DMD may be "deficient" in the sense that it does not understand
 ISO-8859-1, ISO-8859-2, WINDOWS-1252, etc, etc. - but I would regard that
 as a strength, not a weakness. Simple. Neat. Clean. However, this does
 need to be better documented.

I really think that making it also understand ISO-8859-1 (like virtually
every other compiler and interpreter out there) would not harm. 
 
 Let's say you're using Microsoft Notepad. Type something into it, such as:

I'm using vim/KDE Kate/KDevelop, after you comment I've configured them to
save in utf-8 by default and everything seems to work OK now (well, almost,
I still have to configure my terminal emulator to use unicode so the D
program textual non-ascii output is correctly shown.)

 Pretty much all text editors these days offer such a choice - however it
 is usually not the default, so you have to remember to explicitly do the
 Save As / UTF-8 thing.

Also true; it wasn't the default _because_ my LC_ALL environment variable
(Linux) was set to "es_ES.ISO-8859-15".

 Just remember that trick - SAVE AS UTF-8 before you attempt to compile.

I'll, sure.

 If Walter would care to help everyone out with this one by improving the
 error message (if only to lay blame somewhere other than DMD), what he
 should do is this. The compiler should pass the entire source file
 contents to std.utf.validate (or some equivalent function written in
 C/C++). If it passes, go ahead and compile. If it fails, issue an error
 message that the source file is not correctly encoded, and needs to be
 re-saved as UTF-8 before it will compile.

That would be perfectly logical.

Now, abusing you knowledge about the issue, how can I transform (in D) a
default utf-8 encoded font into ISO-latin1? In the program I'm writing most
users will use it from a unix console (graphical or not) and I don't want
to force them to configure their consoles to utf-8.

Thanks again for your answers

Jul 10 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ccp45f$r1r$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...

 DMD may be "deficient" in the sense that it does not understand
 ISO-8859-1, ISO-8859-2, WINDOWS-1252, etc, etc. - but I would regard that
 as a strength, not a weakness. Simple. Neat. Clean. However, this does
 need to be better documented.

I really think that making it also understand ISO-8859-1 (like virtually
every other compiler and interpreter out there) would not harm. 

In point of fact, your assertion that virtually every other compiler and
interpretter out there "understands" ISO-8859-1, is not correct. D is superior,
in this regard.

In a traditional C compiler, the encoding of the source file is essentially
*IGNORED*. There is absolutely no "understanding" going on. A string literal is
just a sequence of uninterpretted bytes. The illusion of "understanding" is
simply caused by the fact that the text editor at one end, and the console or
whatever at the other, happen to use the same encoding as each other.

With that bourne in mind, you should appreciate that what a C compiler APPEARS
to understand is not, in fact, ISO-8859-1, at all. It is simply the default OS
encoding, whatever that happens to be.

It sounds to me like Python may have real understanding of encodings - but if
that's true, Python would be the exception rather than the rule.

However, D *CANNOT* ignore the encoding of the source file. In D, a char[] array
must contain, *BY DEFINITION*, UTF-8. Ignoring the encoding of the source file
would break that definition, and result in invalid UTF-8 sequences within char[]
arrays, and consequent run-time errors.

This means that D has two choices:
(1) it could mandate that the source file encoding MUST be one of the UTF-
family, or
(2) it could be made to understand and decode other encodings. In effect, this
would mean transcoding the source file at compile-time from its original
encoding into UTF-8 before feeding it to the existing compilation process.

D has chosen option (1), and I think it was the right choice. Option (2) would
have added a trememdous amount of bloat to the compiler - and all so that users
don't have to get the hang of "Save As".

If D were to "understand" ISO-8859-1 specifically, there would be complaints
from those whose native encoding were ISO-8859-2. Why is THEIR encoding
supported, but not MINE?

The UTF- family are the only truly global encodings we have, right now. They can
be understood anywhere in the world, and can encode each and every Unicode
character.

By insisting that D source files must be UTF-XX, D is helping to educate people
to think globally, to be less parochial. ISO-8859-1 is not understood
everywhere. UTF-XX is.





 If Walter would care to help everyone out with this one by improving the
 error message (if only to lay blame somewhere other than DMD), what he
 should do is this. The compiler should pass the entire source file
 contents to std.utf.validate (or some equivalent function written in
 C/C++). If it passes, go ahead and compile. If it fails, issue an error
 message that the source file is not correctly encoded, and needs to be
 re-saved as UTF-8 before it will compile.

That would be perfectly logical.

It's something I would encourage. That said, I'm not sure if the current
(unhelpful) error message could actually be deemed a bug. It does, after all,
give AN error.




Now, abusing you knowledge about the issue, how can I transform (in D) a
default utf-8 encoded font into ISO-latin1?

I'll assume that where you wrote "font", you meant "string".

In general, to convert a UTF encoded string into another encoding, you need to
do "transcoding". This was discussed in the open streams discussion on the main
forum not so long ago. In general, you need classes (called Readers) to
translate from ENCODING-X to UTF, and other classes (called Writers) to
translate from UTF to ENCODING-X. Some people prefer the generic term Filter to
the terms Reader and Writer.

So, you'd need an ISO-8859-1 Writer class.

Unfortunately, such readers and writers don't exist yet. They are part of the
ongoing discussion about the future of streams.

Fortunately for you, as is happens, the algorithm for converting ISO-8859-1 to
Unicode is dead simple, so you can roll your own. In function form, it is this:











Observe that the input is declared as ubyte[], not char[] - this is because, in
D, you can't use a char[] array for anything other than UTF-8.

Obvioulsly, this algorithm won't work for ISO-8859-2, WINDOWS-1252, or indeed
ANY encoding other than Latin1.



In the program I'm writing most
users will use it from a unix console (graphical or not) and I don't want
to force them to configure their consoles to utf-8.

But, if I have understood you correctly, you *ARE* going to force them to
configure their consoles to ISO-8859-1. That seems most unfair to people who
happen not to live in Western Europe or America.



Thanks again for your answers

No probs. But we seem not to be talking about D bugs any more, so maybe we
should re-title this thread and move the discussion over the the main forum?

Arcane Jill

Jul 12 2004

D Programming

C/C++ Programming

Other

digitalmars.D.bugs - Bug in std.string.format?