digitalmars.D.bugs - Bug in std.string.format?
- Juanjo =?ISO-8859-15?Q?=C1lvarez?= (6/6) Jul 09 2004 If I do:
-
Stewart Gordon
(10/14)
Jul 09 2004
- Arcane Jill (12/18) Jul 09 2004 This is not a bug. You have an invalid UTF-8 sequence. The library is co...
- Arcane Jill (8/13) Jul 09 2004 Oh - and here's the fix. Save your source-code text file in UTF-8 format...
- Arcane Jill (15/15) Jul 09 2004 Actually, come to think of it, it would be very, very helpful to users o...
- Stewart Gordon (19/24) Jul 09 2004 Hang on ... according to the docs, the compiler is supposed to accept
- Arcane Jill (25/46) Jul 09 2004 I stand corrected. However, the UTFs are all very easy to tell apart. UT...
- Juanjo =?ISO-8859-15?Q?=C1lvarez?= (3/10) Jul 10 2004 And they are (or at least were) extensively used in the obfuscated C
- Stewart Gordon (31/63) Jul 12 2004 Are we talking of the byte-order mark, or the fallback for if that's
- Arcane Jill (33/62) Jul 12 2004 I meant heuristically. Although obviously, if there's a BOM, you can tel...
- Stewart Gordon (14/27) Jul 13 2004 As long as you don't confuse its semantics with those of the other
- Juanjo =?ISO-8859-15?Q?=C1lvarez?= (16/26) Jul 09 2004 Then I was confused by the fact that inserting the line:
- Arcane Jill (39/54) Jul 09 2004 That may be a red herring, but I don't know what Python does and I'm not
- Juanjo =?ISO-8859-15?Q?=C1lvarez?= (23/50) Jul 10 2004 True, but the funny thing was that the files are saved (I've just tested...
- Arcane Jill (71/91) Jul 12 2004 In point of fact, your assertion that virtually every other compiler and
If I do: //Also with any other ascii 8 bit chars: char[] str = std.string.format("STRING WITH NON ASCII7BIT CHARS ÑÑÑ"); The program says (in runtime): Error: invalid UTF-8 sequence AFAIK 'Ñ' is UTF-8.
Jul 09 2004
Juanjo Álvarez wrote:If I do: //Also with any other ascii 8 bit chars: char[] str = std.string.format("STRING WITH NON ASCII7BIT CHARS ÑÑÑ");<snip> std.string.format isn't documented as I look. Is this the string counterpart of writef, which I'd just pointed out we should have over on d.D? Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
Jul 09 2004
In article <cclofh$1qrr$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...If I do: //Also with any other ascii 8 bit chars: char[] str = std.string.format("STRING WITH NON ASCII7BIT CHARS ÑÑÑ"); The program says (in runtime): Error: invalid UTF-8 sequenceThis is not a bug. You have an invalid UTF-8 sequence. The library is correctly reporting it.AFAIK 'Ñ' is UTF-8.It is not. The Unicode character U+00D1, LATIN CAPITAL N WITH TILDE is represented in UTF-8 by the two byte sequence { 0xC3, 0x91 }. UTF-8 is backwardly compatible with ASCII. It is /not/, however, backwardly compatible with ISO-8859-1. Any character with codepoint greater than 0x7F must be correctly UTF-8 encoded. You can get the correct UTF-8 sequence by starting with a string of dchars and passing it to std.utf.toUTF8(). Arcane Jill
Jul 09 2004
In article <ccm0is$2768$1 digitaldaemon.com>, Arcane Jill says...Oh - and here's the fix. Save your source-code text file in UTF-8 format before attempting to compile it. I suspect it is currently saved in some ANSI format or other - probably ISO-8859-1 or WINDOWS-259 depending on your operating system. You need a text editor which can save in UTF-8. D source files should always be saved in UTF-8 format if you want string literals to be correctly interpretted. Jillchar[] str = std.string.format("STRING WITH NON ASCII7BIT CHARS ÑÑÑ"); Error: invalid UTF-8 sequenceThis is not a bug. You have an invalid UTF-8 sequence. The library is correctly reporting it.
Jul 09 2004
Actually, come to think of it, it would be very, very helpful to users of D if the D compiler actually checked the integrity of all string literals at compile time. If any string literal were found (at compile time) to contain an invalid UTF-8 sequence, it would help the user ENORMOUSLY if an error message along the lines of: were to be printed. (Strictly speaking, the D compiler should always pass the entire source file to toUTF32(), and generate the above error if toUTF32() fails. However, the source file encoding won't make any difference EXCEPT to string literals). So ... although it /is/ a user-error, it is nonetheless a user-error which DMD could have detected at compile-time, instead of leaving the error reporting to run time. The error message itself (as it stands) doesn't really help people to understand what's wrong. Arcane Jill
Jul 09 2004
Arcane Jill wrote: <snip>Hang on ... according to the docs, the compiler is supposed to accept UTF-16 and UTF-32 too. <snip>So ... although it /is/ a user-error, it is nonetheless a user-error which DMD could have detected at compile-time, instead of leaving the error reporting to run time. The error message itself (as it stands) doesn't really help people to understand what's wrong.Some debate is possible. Obviously the compiler isn't being UTF compliant. But what if someone wants to include, in a string literal, characters in the native OS or other character set that don't match UTF-8? (FTM, how are escaped characters supposed to be handled ITR, considering that a string literal can be either a char[], wchar[] or dchar[]?) Speaking of lexical.html... "There are no digraphs or trigraphs in D." What is meant by this, exactly? Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
Jul 09 2004
In article <ccmo0h$8u0$1 digitaldaemon.com>, Stewart Gordon says...Arcane Jill wrote: <snip>I stand corrected. However, the UTFs are all very easy to tell apart. UTF-16 looks very different from UTF-8, and it only takes a simple algorithm to distinguish them. Ditto UTF-32. What I *SHOULD* have said is that DMD assumes that the source file is encoded in UTF-8, UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE. What it can't do is tell 8-bit encodings apart from each other, so it assumes that, if it's an 8-bit encoding, it will be UTF-8.Hang on ... according to the docs, the compiler is supposed to accept UTF-16 and UTF-32 too.<snip>Yes, it is. The compiler is being 100% UTF compliant. Problems only arise if the source code isn't.So ... although it /is/ a user-error, it is nonetheless a user-error which DMD could have detected at compile-time, instead of leaving the error reporting to run time. The error message itself (as it stands) doesn't really help people to understand what's wrong.Some debate is possible. Obviously the compiler isn't being UTF compliant.But what if someone wants to include, in a string literal, characters in the native OS or other character set that don't match UTF-8?There ain't no such character. UTF-8 can encode the entire of Unicode. I'm not sure there's an OS on the planet which uses characters which are not in Unicode. Oh wait - I believe the ZX Spectrum had some weird clunky graphics characters which are not in Unicode. But we don't need to worry about that because D has not been ported to that platform.(FTM, how are escaped characters supposed to be handled ITR, considering that a string literal can be either a char[], wchar[] or dchar[]?)They are supposed to be represented as is, not escaped in any way (beyond being encoded in UTF-whatever). Unless of course you mean stuff like "\n" - which obviously is stored in source as backslash followed by 'n'. The compiler can figure THAT out because it's part of D.Speaking of lexical.html... "There are no digraphs or trigraphs in D." What is meant by this, exactly?Old, old stuff from the early days of C. You have to go back a long time, but once, there were keyboards without square brackets or curly braces and things, and which were not remappable in software. Digraphs are two-character sequences which a C compiler will replace with those single missing characters. Trigraphs are similar three character sequences.
Jul 09 2004
Arcane Jill wrote:And they are (or at least were) extensively used in the obfuscated C contests :)What is meant by this, exactly?Old, old stuff from the early days of C. You have to go back a long time, but once, there were keyboards without square brackets or curly braces and things, and which were not remappable in software. Digraphs are two-character sequences which a C compiler will replace with those single missing characters. Trigraphs are similar three character sequences.
Jul 10 2004
Arcane Jill wrote: <snip>I stand corrected. However, the UTFs are all very easy to tell apart. UTF-16 looks very different from UTF-8, and it only takes a simple algorithm to distinguish them. Ditto UTF-32.Are we talking of the byte-order mark, or the fallback for if that's missing?What I *SHOULD* have said is that DMD assumes that the source file is encoded in UTF-8, UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE. What it can't do is tell 8-bit encodings apart from each other, so it assumes that, if it's an 8-bit encoding, it will be UTF-8.Actually, there is a BOM for UTF-8 according to the docs. But no doubt many UTF-8 files are typed without it. <snip>Yes, it is. The compiler is being 100% UTF compliant. Problems only arise if the source code isn't.Actually, I read that UTF compliance of a text reader necessarily means rejecting input that isn't UTF compliant.By "match" I actually meant be represented by the same byte sequence. An important issue when it comes to generating console output, interfacing the OS API and stuff like that. <snip>But what if someone wants to include, in a string literal, characters in the native OS or other character set that don't match UTF-8?There ain't no such character. UTF-8 can encode the entire of Unicode. I'm not sure there's an OS on the planet which uses characters which are not in Unicode.I meant stuff like "\xA3" actually, and in terms of what it becomes in the actual string data being represented. <snip>(FTM, how are escaped characters supposed to be handled ITR, considering that a string literal can be either a char[], wchar[] or dchar[]?)They are supposed to be represented as is, not escaped in any way (beyond being encoded in UTF-whatever). Unless of course you mean stuff like "\n" - which obviously is stored in source as backslash followed by 'n'. The compiler can figure THAT out because it's part of D.Old, old stuff from the early days of C. You have to go back a long time, but once, there were keyboards without square brackets or curly braces and things, and which were not remappable in software. Digraphs are two-character sequences which a C compiler will replace with those single missing characters. Trigraphs are similar three character sequences.My dad had an old C manual (which I first learned from, but only the very basics) with handwritten notes in it about teletypes from well before my time. From what I remember, you typed something like: MAIN() \( PRINTF("\HELLO, WORLD!\\N"); \) But I don't remember there being any trigraphs in those notes. And back in those days, you wrote x =- 4 instead of x -= 4. I don't know at what point someone decided to break existing code by redefining the former to be the same as x = -4. Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
Jul 12 2004
In article <cctnge$1801$1 digitaldaemon.com>, Stewart Gordon says...Arcane Jill wrote: <snip>I meant heuristically. Although obviously, if there's a BOM, you can tell just by reading the first (at most) four bytes.I stand corrected. However, the UTFs are all very easy to tell apart. UTF-16 looks very different from UTF-8, and it only takes a simple algorithm to distinguish them. Ditto UTF-32.Are we talking of the byte-order mark, or the fallback for if that's missing?Actually, there is a BOM for UTF-8 according to the docs. But no doubt many UTF-8 files are typed without it.Yep. Plenty of text editors save UTF-8 without a BOM. Some even offer you the choice of BOM or no-BOM. So the absense of a BOM does not imply that a text file is not UTF-8.Actually, I read that UTF compliance of a text reader necessarily means rejecting input that isn't UTF compliant.Gotcha. In that case, you are correct. So I guess this means that DMD really /must/ validate the source file, or be itself in error. Well spotted.Aha. Well, that's an implementation-dependent thing, is it not? Not really a D matter, I would have thought. Would I be correct in assuming that most console escape sequences can be composed entirely out of ASCII characters? If that is so, there isn't a problem anyway.By "match" I actually meant be represented by the same byte sequence. An important issue when it comes to generating console output, interfacing the OS API and stuff like that.But what if someone wants to include, in a string literal, characters in the native OS or other character set that don't match UTF-8?There ain't no such character. UTF-8 can encode the entire of Unicode. I'm not sure there's an OS on the planet which uses characters which are not in Unicode.Understood. Well, there's a little-known difference between '\xA3' and '\u00A3'. '\xA3' means "the byte 0xA3", or, if it's a character, "the character represented by codepoint 0xA3 in whatever encoding I happen to be using at the time", whereas, '\u00A3' means specifically "the character represented by codepoint 0xA3 in /Unicode/". That is, U+00A3, POUND SIGN. In the particular case of D, a char[] contains UTF-8. So, I imagine it would be perfectly OK to contruct valud UTF-8 sequences by hand. That is, I would _HOPE_ that all three of the following lines would produce identical results: but I haven't tested this, so I don't know for sure. If not, it's a bug. For console escape sequences which are absolutely NOT UTF-8, I would encourage you to store such strings in ubyte[] arrays instead of char[] arrays, where such validity restrictions don't apply. There's nothing to stop you from passing a ubyte[] to std.stream.Stream.write(), after all.Unless of course you mean stuff like "\n" - which obviously is stored in source as backslash followed by 'n'. The compiler can figure THAT out because it's part of D.I meant stuff like "\xA3" actually, and in terms of what it becomes in the actual string data being represented.And back in those days, you wrote x =- 4 instead of x -= 4. I don't know at what point someone decided to break existing code by redefining the former to be the same as x = -4.I don't know when that happened either. I gather that that change happened though because compilers had a hard time distinguishing between: (a) x =- 4; (b) x = -4; Arcane Jill
Jul 12 2004
Arcane Jill wrote: <snip>For console escape sequences which are absolutely NOT UTF-8, I would encourage you to store such strings in ubyte[] arrays instead of char[] arrays, where such validity restrictions don't apply. There's nothing to stop you from passing a ubyte[] to std.stream.Stream.write(), after all.As long as you don't confuse its semantics with those of the other methods called write.It's no harder than distinguishing between (a) + +x; (b) ++ x; But no doubt programmers confused them, particularly when they tried writing x=-4 without any spaces. Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.And back in those days, you wrote x =- 4 instead of x -= 4. I don't know at what point someone decided to break existing code by redefining the former to be the same as x = -4.I don't know when that happened either. I gather that that change happened though because compilers had a hard time distinguishing between: (a) x =- 4; (b) x = -4;
Jul 13 2004
Arcane Jill wrote:Then I was confused by the fact that inserting the line: at the start of a Python script make the interpreters works with latin1 chars directly.AFAIK 'Ñ' is UTF-8.It is not. The Unicode character U+00D1, LATIN CAPITAL N WITH TILDE is represented in UTF-8 by the two byte sequence { 0xC3, 0x91 }. UTF-8 is backwardly compatible with ASCII. It is /not/, however, backwardly compatible with ISO-8859-1. Any character with codepoint greater than 0x7F must be correctly UTF-8 encoded.You can get the correct UTF-8 sequence by starting with a string of dchars and passing it to std.utf.toUTF8().Could you please provide and example of how would be that done? Because if I try: dchar[] dstr = "ESPAÑA"; the compiler says: otroformat.d(7): invalid UTF-8 sequence and if I instead try: dchar[] dstr = std.utf.toUTF8("ESPAÑA"); it says: otroformat.d(7): function toUTF8 overloads char[](char[]s) and char[ (dchar[]s) both match argument list for toUTF8 So I'm a little lost here.
Jul 09 2004
In article <ccn00c$khq$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...Then I was confused by the fact that inserting the line: at the start of a Python script make the interpreters works with latin1 chars directly.That may be a red herring, but I don't know what Python does and I'm not qualified to comment. If I had to guess, I'd say that declaration tells Python the encoding with which the source files was saved. I can tell you though that D also interprets all Latin-1 characters (and indeed, all Unicode characters) directly ... *IF* the source file is saved in a UTF format. (See below). DMD may be "deficient" in the sense that it does not understand ISO-8859-1, ISO-8859-2, WINDOWS-1252, etc, etc. - but I would regard that as a strength, not a weakness. Simple. Neat. Clean. However, this does need to be better documented.Could you please provide and example of how would be that done? Because if I try: dchar[] dstr = "ESPAÑA"; the compiler says: otroformat.d(7): invalid UTF-8 sequenceHonestly - this has got nothing whatsoever to do with the compiler. There's a stage BEFORE compiling - it's called saving the text file. Let's say you're using Microsoft Notepad. Type something into it, such as: Now - instead of clicking on "Save", click instead on "Save As". You'll see three drop-down menus at the bottom of the dialog. One of them is labelled "Encoding", and it will have "ANSI" selected by default. *** CHANGE IT TO UTF-8 ***. Now save. Now the D compiler will be happy with it. Pretty much all text editors these days offer such a choice - however it is usually not the default, so you have to remember to explicitly do the Save As / UTF-8 thing. And you can use ALL characters too, not just Latin-1. You can use Latin-2, Greek, Russian, Chinese, whatever. Just remember that trick - SAVE AS UTF-8 before you attempt to compile.and if I instead try: dchar[] dstr = std.utf.toUTF8("ESPAÑA"); it says: otroformat.d(7): function toUTF8 overloads char[](char[]s) and char[ (dchar[]s) both match argument list for toUTF8 So I'm a little lost here.I can understand that because, as I said, the DMD error message is not helpful. However, bear in mind that the fault lies with your use of the text editor, not with your use of D. If Walter would care to help everyone out with this one by improving the error message (if only to lay blame somewhere other than DMD), what he should do is this. The compiler should pass the entire source file contents to std.utf.validate (or some equivalent function written in C/C++). If it passes, go ahead and compile. If it fails, issue an error message that the source file is not correctly encoded, and needs to be re-saved as UTF-8 before it will compile. Of course, if the source file contains only ASCII characters then it is automatically valid UTF-8, even if it was saved as "ANSI". Arcane Jill
Jul 09 2004
Arcane Jill wrote: First things first; thanks for your comments and you patience with my.True, but the funny thing was that the files are saved (I've just tested it) as latin1 and it works and don't issue the warning it issues if you don't put that line.at the start of a Python script make the interpreters works with latin1 chars directly.That may be a red herring, but I don't know what Python does and I'm not qualified to comment. If I had to guess, I'd say that declaration tells Python the encoding with which the source files was saved.I can tell you though that D also interprets all Latin-1 characters (and indeed, all Unicode characters) directly ... *IF* the source file is saved in a UTF format. (See below).I didn't notice that my editor was saving the files as ISO-8859-15 and as you said the compiler error message didn't helped on that.DMD may be "deficient" in the sense that it does not understand ISO-8859-1, ISO-8859-2, WINDOWS-1252, etc, etc. - but I would regard that as a strength, not a weakness. Simple. Neat. Clean. However, this does need to be better documented.I really think that making it also understand ISO-8859-1 (like virtually every other compiler and interpreter out there) would not harm.Let's say you're using Microsoft Notepad. Type something into it, such as:I'm using vim/KDE Kate/KDevelop, after you comment I've configured them to save in utf-8 by default and everything seems to work OK now (well, almost, I still have to configure my terminal emulator to use unicode so the D program textual non-ascii output is correctly shown.)Pretty much all text editors these days offer such a choice - however it is usually not the default, so you have to remember to explicitly do the Save As / UTF-8 thing.Also true; it wasn't the default _because_ my LC_ALL environment variable (Linux) was set to "es_ES.ISO-8859-15".Just remember that trick - SAVE AS UTF-8 before you attempt to compile.I'll, sure.If Walter would care to help everyone out with this one by improving the error message (if only to lay blame somewhere other than DMD), what he should do is this. The compiler should pass the entire source file contents to std.utf.validate (or some equivalent function written in C/C++). If it passes, go ahead and compile. If it fails, issue an error message that the source file is not correctly encoded, and needs to be re-saved as UTF-8 before it will compile.That would be perfectly logical. Now, abusing you knowledge about the issue, how can I transform (in D) a default utf-8 encoded font into ISO-latin1? In the program I'm writing most users will use it from a unix console (graphical or not) and I don't want to force them to configure their consoles to utf-8. Thanks again for your answers
Jul 10 2004
In article <ccp45f$r1r$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...In point of fact, your assertion that virtually every other compiler and interpretter out there "understands" ISO-8859-1, is not correct. D is superior, in this regard. In a traditional C compiler, the encoding of the source file is essentially *IGNORED*. There is absolutely no "understanding" going on. A string literal is just a sequence of uninterpretted bytes. The illusion of "understanding" is simply caused by the fact that the text editor at one end, and the console or whatever at the other, happen to use the same encoding as each other. With that bourne in mind, you should appreciate that what a C compiler APPEARS to understand is not, in fact, ISO-8859-1, at all. It is simply the default OS encoding, whatever that happens to be. It sounds to me like Python may have real understanding of encodings - but if that's true, Python would be the exception rather than the rule. However, D *CANNOT* ignore the encoding of the source file. In D, a char[] array must contain, *BY DEFINITION*, UTF-8. Ignoring the encoding of the source file would break that definition, and result in invalid UTF-8 sequences within char[] arrays, and consequent run-time errors. This means that D has two choices: (1) it could mandate that the source file encoding MUST be one of the UTF- family, or (2) it could be made to understand and decode other encodings. In effect, this would mean transcoding the source file at compile-time from its original encoding into UTF-8 before feeding it to the existing compilation process. D has chosen option (1), and I think it was the right choice. Option (2) would have added a trememdous amount of bloat to the compiler - and all so that users don't have to get the hang of "Save As". If D were to "understand" ISO-8859-1 specifically, there would be complaints from those whose native encoding were ISO-8859-2. Why is THEIR encoding supported, but not MINE? The UTF- family are the only truly global encodings we have, right now. They can be understood anywhere in the world, and can encode each and every Unicode character. By insisting that D source files must be UTF-XX, D is helping to educate people to think globally, to be less parochial. ISO-8859-1 is not understood everywhere. UTF-XX is.DMD may be "deficient" in the sense that it does not understand ISO-8859-1, ISO-8859-2, WINDOWS-1252, etc, etc. - but I would regard that as a strength, not a weakness. Simple. Neat. Clean. However, this does need to be better documented.I really think that making it also understand ISO-8859-1 (like virtually every other compiler and interpreter out there) would not harm.It's something I would encourage. That said, I'm not sure if the current (unhelpful) error message could actually be deemed a bug. It does, after all, give AN error.If Walter would care to help everyone out with this one by improving the error message (if only to lay blame somewhere other than DMD), what he should do is this. The compiler should pass the entire source file contents to std.utf.validate (or some equivalent function written in C/C++). If it passes, go ahead and compile. If it fails, issue an error message that the source file is not correctly encoded, and needs to be re-saved as UTF-8 before it will compile.That would be perfectly logical.Now, abusing you knowledge about the issue, how can I transform (in D) a default utf-8 encoded font into ISO-latin1?I'll assume that where you wrote "font", you meant "string". In general, to convert a UTF encoded string into another encoding, you need to do "transcoding". This was discussed in the open streams discussion on the main forum not so long ago. In general, you need classes (called Readers) to translate from ENCODING-X to UTF, and other classes (called Writers) to translate from UTF to ENCODING-X. Some people prefer the generic term Filter to the terms Reader and Writer. So, you'd need an ISO-8859-1 Writer class. Unfortunately, such readers and writers don't exist yet. They are part of the ongoing discussion about the future of streams. Fortunately for you, as is happens, the algorithm for converting ISO-8859-1 to Unicode is dead simple, so you can roll your own. In function form, it is this: Observe that the input is declared as ubyte[], not char[] - this is because, in D, you can't use a char[] array for anything other than UTF-8. Obvioulsly, this algorithm won't work for ISO-8859-2, WINDOWS-1252, or indeed ANY encoding other than Latin1.In the program I'm writing most users will use it from a unix console (graphical or not) and I don't want to force them to configure their consoles to utf-8.But, if I have understood you correctly, you *ARE* going to force them to configure their consoles to ISO-8859-1. That seems most unfair to people who happen not to live in Western Europe or America.Thanks again for your answersNo probs. But we seem not to be talking about D bugs any more, so maybe we should re-title this thread and move the discussion over the the main forum? Arcane Jill
Jul 12 2004