digitalmars.D.bugs - Standard omission or compiler bug: Hexadecimal escapes don't encode
- Burton Radons (8/8) Sep 19 2004 The following string is encoded into UTF-8:
- Walter (5/13) Sep 20 2004 I wasn't sure what to do about that case, so I left the \x as whatever t...
- Stewart Gordon (11/17) Sep 21 2004 Here's somewhere I agree with your choice of behaviour, where \x denotes...
-
Stewart Gordon
(11/19)
Sep 21 2004
In article
, Stewart Gordon says and the... - Regan Heath (8/34) Sep 21 2004 I agree.. however doesn't this make it possible to create an invalid UTF...
- Walter (4/8) Sep 21 2004 Yes.
- Arcane Jill (10/14) Sep 22 2004 Yup. If you use \x in a char array you are doing /low level stuff/. You ...
- Regan Heath (14/31) Sep 22 2004 I agree.
- Arcane Jill (9/13) Sep 23 2004 Okay, you've convinced me.
- Stewart Gordon (15/24) Sep 23 2004 Do char[] and ubyte[] implicitly convert between each other? If not,
- Arcane Jill (58/71) Sep 23 2004 Of course not. They can't even /ex/plicitly convert. How could they? You...
- Stewart Gordon (25/64) Sep 23 2004 I wouldn't. I'd be converting from bytes interpreted as chars to
- Regan Heath (15/25) Sep 23 2004 On Thu, 23 Sep 2004 12:51:16 +0000 (UTC), Stewart Gordon
- Arcane Jill (7/10) Sep 24 2004 Completely in agreement with you there. However, Stewart did actually as...
- Stewart Gordon (12/15) Sep 22 2004 I firmly don't believe in any attempts to force a specific character
- Arcane Jill (15/25) Sep 22 2004 I agree that it should remain possible - but I disagree with the reason....
- Regan Heath (12/27) Sep 22 2004 But it's 'defined' as having that encoding, if you dont want it, dont us...
- Burton Radons (14/31) Sep 25 2004 I don't think this will work; it requires specifying what encoding the
- Arcane Jill (37/53) Sep 27 2004 Walter assures us that the D language itself is not prejudiced toward UT...
-
Stewart Gordon
(7/11)
Sep 29 2004
- Burton Radons (28/56) Sep 29 2004 I don't understand what political interpretation you gave my statement,
- Arcane Jill (38/65) Sep 30 2004 Yes. I was wrong.
- Arcane Jill (21/29) Sep 22 2004 This is correct behavior. You should be using \u for Unicode characters....
The following string is encoded into UTF-8: char [] c = "\u7362"; It encodes into x"E4 98 B7". The following string, however, does not get encoded: char [] c = "\x8F"; It remains x"8F". Thus the \x specifies a literal byte in the character stream as implemented. The specification doesn't mention this twist, if it was intentional.
Sep 19 2004
"Burton Radons" <burton-radons shaw.ca> wrote in message news:cilbgp$1p3a$1 digitaldaemon.com...The following string is encoded into UTF-8: char [] c = "\u7362"; It encodes into x"E4 98 B7". The following string, however, does not get encoded: char [] c = "\x8F"; It remains x"8F". Thus the \x specifies a literal byte in the character stream as implemented. The specification doesn't mention this twist, if it was intentional.I wasn't sure what to do about that case, so I left the \x as whatever the programmer wrote. The \u, though, is definitely meant as unicode and so is converted to UTF-8.
Sep 20 2004
In article <cioggi$k7p$1 digitaldaemon.com>, Walter says... <snip>Here's somewhere I agree with your choice of behaviour, where \x denotes byte values, not Unicode codepoints. Hence here, the coder who writes \x8F intended the byte having this value - a single value of type char. Moreover, it follows the "looks like C, acts like C" principle. Of course, if circumstances dictate that the string be interpreted as a wchar[] or dchar[], then that's another matter. Stewart.It remains x"8F". Thus the \x specifies a literal byte in the character stream as implemented. The specification doesn't mention this twist, if it was intentional.I wasn't sure what to do about that case, so I left the \x as whatever the programmer wrote. The \u, though, is definitely meant as unicode and so is converted to UTF-8.
Sep 21 2004
In article <ciout3$skf$1 digitaldaemon.com>, Stewart Gordon says and then some program or another makes a mess of... <snip>Here's somewhere I agree with your choice of behaviour, where \x denotes byte values, not Unicode codepoints. Hence here, the coder who writes \x8F intended the byte having this value - a single value of type char. Moreover, it follows the "looks like C, acts like C" principle. Of course, if circumstances dictate that the string be interpreted as a wchar[] or dchar[], then that's another matter.Just what is wrong with this web newsgroup interface? I should've carried on using my quote tidier. If anyone else is having the same troubles, you're pointed here.... http://smjg.port5.com/faqs/usenet/quotetidy.html Hopefully my regular posting environment will soon have a working power supply once again.... Stewart.
Sep 21 2004
On Tue, 21 Sep 2004 10:13:55 +0000 (UTC), Stewart Gordon <Stewart_member pathlink.com> wrote:In article <cioggi$k7p$1 digitaldaemon.com>, Walter says... <snip>I agree.. however doesn't this make it possible to create an invalid UTF-8 sequence? Does the compiler/program catch this invalid sequence? I believe it should.Here's somewhere I agree with your choice of behaviour, where \x denotes byte values, not Unicode codepoints. Hence here, the coder who writes \x8F intended the byte having this value - a single value of type char. Moreover, it follows the "looks like C, acts like C" principle.It remains x"8F". Thus the \x specifies a literal byte in the character stream as implemented. The specification doesn't mention this twist, if it was intentional.I wasn't sure what to do about that case, so I left the \x as whatever the programmer wrote. The \u, though, is definitely meant as unicode and so is converted to UTF-8.Of course, if circumstances dictate that the string be interpreted as a wchar[] or dchar[], then that's another matter. Stewart.-- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Sep 21 2004
"Regan Heath" <regan netwin.co.nz> wrote in message news:opseo5g1si5a2sq9 digitalmars.com...I agree.. however doesn't this make it possible to create an invalid UTF-8 sequence?Yes.Does the compiler/program catch this invalid sequence? I believe it should.Only if the string is interpreted as a wchar[] or dchar[].
Sep 21 2004
In article <opseo5g1si5a2sq9 digitalmars.com>, Regan Heath says...I agree.. however doesn't this make it possible to create an invalid UTF-8 sequence?Yup. If you use \x in a char array you are doing /low level stuff/. You are doing encoding-by-hand - and it's up to you to get it right.Does the compiler/program catch this invalid sequence? I believe it should.I disagree. If you're using \x then you're working at the byte level. You might be doing some system-programming-type stuff where you actually /want/ to break the rules. The compiler will catch it if and when you pass it to a toUTF function, and that's good enough for me. People simply need to understand the difference between \u and \x. Arcane Jill
Sep 22 2004
On Wed, 22 Sep 2004 07:21:27 +0000 (UTC), Arcane Jill <Arcane_member pathlink.com> wrote:In article <opseo5g1si5a2sq9 digitalmars.com>, Regan Heath says...I agree.I agree.. however doesn't this make it possible to create an invalid UTF-8 sequence?Yup. If you use \x in a char array you are doing /low level stuff/. You are doing encoding-by-hand - and it's up to you to get it right.I disagree. char is 'defined' as being UTF encoded, IMO it should never not be. If you want to 'break the rules' you can/should use ubyte[], then, you're not breaking any rules.Does the compiler/program catch this invalid sequence? I believe it should.I disagree. If you're using \x then you're working at the byte level. You might be doing some system-programming-type stuff where you actually /want/ to break the rules.The compiler will catch it if and when you pass it to a toUTF function, and that's good enough for me.Probably fair enough.. however, I think it would be more robust if it was made impossible to have an invalid utf8/16/32 sequence. That may be an impossible dream..People simply need to understand the difference between \u and \x.But of course. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Sep 22 2004
In article <opseq16svz5a2sq9 digitalmars.com>, Regan Heath says...I disagree. char is 'defined' as being UTF encoded, IMO it should never not be. If you want to 'break the rules' you can/should use ubyte[], then, you're not breaking any rules.Okay, you've convinced me. other integer and integer array literals. But that would be a real headache for Walter, since D is supposed to have a context-free grammar. It's not clear to me how the compiler could parse the difference. Arcane Jill
Sep 23 2004
In article <opseq16svz5a2sq9 digitalmars.com>, Regan Heath says... <snip>I disagree. char is 'defined' as being UTF encoded, IMO it should never not be. If you want to 'break the rules' you can/should use ubyte[], then, you're not breaking any rules.Do char[] and ubyte[] implicitly convert between each other? If not, it could make code that interfaces a foreign API somewhat cluttered with casts. And besides this, which is more self-documenting for the purpose? ubyte[] obscures the fact that it's a string, rather than any old block of 8-bit numbers. char[] denotes a string, but is it any more misleading? People coming from a C(++) background are likely to see it and think 'string' rather than 'UTF-8'. (Does anyone actually come from a D background yet?)<snip> That would mean that a single char value would be restricted to the ASCII set , wouldn't it? Stewart.The compiler will catch it if and when you pass it to a toUTF function, and that's good enough for me.Probably fair enough.. however, I think it would be more robust if it was made impossible to have an invalid utf8/16/32 sequence. That may be an impossible dream..
Sep 23 2004
In article <ciu7q5$s35$1 digitaldaemon.com>, Stewart Gordon says...Do char[] and ubyte[] implicitly convert between each other?Of course not. They can't even /ex/plicitly convert. How could they? You'd be converting from UTF-8 to ... what exactly? But I suspect you meant implicitly /cast/. In which case, no, they don't do that either.If not, it could make code that interfaces a foreign API somewhat cluttered with casts.Not really, since foreign API functions should be /expecting/ C-strings, that is, pointers to arrays of bytes (not chars), terminated with the byte value \0. So, for example, strcat() should be declared in D as: and not as:And besides this, which is more self-documenting for the purpose?Well, this of course is the big area of disagreement. We all want code to be easily maintainable. That means, more readable; more self-documenting. Readable code is a good thing. The problem is that, some of us (Regan and I, for example) look at a declaration of char[] and see "A string of Unicode characters encoded in UTF-8". It is eminently self-documenting, by the very definition of char[]. We also look at a declaration of byte[] and see "An array of bytes whose interpretation depends on what you do with them". Others (yourself included) apparently see things differently. You look at a declaration of char[] and see "A string of not-necessarily-Unicode characters encoded in some unspecified way", and see byte[] as "An array of bytes whose interpretation is anything /other/ than a sequence of characters". It is not really possible for code to be simultaneously self-documenting in both paradigms - but you might like to consider the fact that in C and C++, an array of C chars must be interpretted as "An array of bytes whose interpretation depends on what you do with them" - because C/C++ don't actually /have/ a character type, merely an overused byte type. As soon as you start to think: D Java C/C++ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ byte byte signed char ubyte no equivalent unsigned char char no equivalent no equivalent wchar char wchar_t and /stop/ imagining that D's char == C's char (which it clearly doesn't) then everything makes sense.ubyte[] obscures the fact that it's a string, rather than any old block of 8-bit numbers.And how would you make such a distinction in C?char[] denotes a string, but is it any more misleading? People coming from a C(++) background are likely to see it and think 'string' rather than 'UTF-8'. (Does anyone actually come from a D background yet?)Maybe you're answering your own question there. Stop thinking in C. This is D. Think in D. Even if nobody comes from a D background yet - let's just assume that one day, they will. It has been suggested over on the main forum that D's types be renamed. If no type called "char" existed in D; if instead, you had to choose between the types "utf8", "uint8" and "int8", it would be obvious which one you'd go for.That would mean that a single char value would be restricted to the ASCII set , wouldn't it?You're not thinking in Unicode. A D char stores a "code unit" (a UTF-8 fragment), not a character codepoint. UTF-8 code-units coincide with character codepoints /only/ in the ASCII range. A single char value, however, can store any valid UTF-8 fragment. You would be wrong, however, to interpret this as a character. For example: Arcane Jill
Sep 23 2004
In article <ciudb4$11ba$1 digitaldaemon.com>, Arcane Jill says...In article <ciu7q5$s35$1 digitaldaemon.com>, Stewart Gordon says...I wouldn't. I'd be converting from bytes interpreted as chars to bytes interpreted as bytes of arbitrary semantics. <snip>Do char[] and ubyte[] implicitly convert between each other?Of course not. They can't even /ex/plicitly convert. How could they? You'd be converting from UTF-8 to ... what exactly?Not really, since foreign API functions should be /expecting/ C-strings, that is, pointers to arrays of bytes (not chars), terminated with the byte value \0.Even if they're written in/for Pascal or Fortran?So, for example, strcat() should be declared in D as: and not as:Then how would I write the C call strcat(qwert, "yuiop"); in D? <snip>Others (yourself included) apparently see things differently. You look at a declaration of char[] and see "A string of not-necessarily-Unicode characters encoded in some unspecified way", and see byte[] as "An array of bytes whose interpretation is anything /other/ than a sequence of characters".Did I say that? I didn't mean to indicate that byte[] necessarily isn't an array of characters. Merely that I don't see people as seeing it and thinking 'string'. <snip>With a typedef. <snip>ubyte[] obscures the fact that it's a string, rather than any old block of 8-bit numbers.And how would you make such a distinction in C?Maybe you're answering your own question there. Stop thinking in C. This is D. Think in D.I do on the whole. But trying to think in Windows API at the same time isn't easy. It'll probably be easier once the D Windows headers are finished.Even if nobody comes from a D background yet - let's just assume that one day, they will. It has been suggested over on the main forum that D's types be renamed. If no type called "char" existed in D; if instead, you had to choose between the types "utf8", "uint8" and "int8", it would be obvious which one you'd go for.Then if only such types existed as "ansi", "windows1252", "windows1253", "ibm", "iso8859_5", "macdevanagari" then the list would be complete. <snip>You're not thinking in Unicode. A D char stores a "code unit" (a UTF-8 fragment), not a character codepoint. UTF-8 code-units coincide with character codepoints /only/ in the ASCII range. A single char value, however, can store any valid UTF-8 fragment. You would be wrong, however, to interpret this as a character.<snip> That makes sense.... Stewart.
Sep 23 2004
On Thu, 23 Sep 2004 12:51:16 +0000 (UTC), Stewart Gordon <Stewart_member pathlink.com> wrote: <snip>It appears to me that Walter has decided on having only 3 types with a specified encoding, and all other encodings will be handled by using ubyte[]/byte[] and conversion functions. I think this is the right choice. I see unicode as the future and other encodings as legacy encodings, whose use I hope gradually disappears. Of course is there is a valid reason for a certain encoding to remain, for speed/space/other reasons, and D wanted the same sort of built-in support as we do for utf8/16/32 then a new type might emerge. <snip> Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/Even if nobody comes from a D background yet - let's just assume that one day, they will. It has been suggested over on the main forum that D's types be renamed. If no type called "char" existed in D; if instead, you had to choose between the types "utf8", "uint8" and "int8", it would be obvious which one you'd go for.Then if only such types existed as "ansi", "windows1252", "windows1253", "ibm", "iso8859_5", "macdevanagari" then the list would be complete.
Sep 23 2004
In article <opses3k8mv5a2sq9 digitalmars.com>, Regan Heath says...It appears to me that Walter has decided on having only 3 types with a specified encoding, and all other encodings will be handled by using ubyte[]/byte[] and conversion functions.Completely in agreement with you there. However, Stewart did actually ask a question which I couldn't answer, and which we shouldn't ignore. Maybe you have some ideas. ..anyway... I'm moving my reply to the main forum. I think it's more appropriate there. Arcane Jill
Sep 24 2004
In article <opseo5g1si5a2sq9 digitalmars.com>, Regan Heath says... <snip>I agree.. however doesn't this make it possible to create an invalid UTF-8 sequence? Does the compiler/program catch this invalid sequence? I believe it should.I firmly don't believe in any attempts to force a specific character encoding on every char[] ever created. As said before, it should remain possible for char[] literals to contain character codes that aren't UTF-8, for such purposes as interfacing OS APIs. The ability to use arbitrary \x codes provides this neatly. I imagine few people would use it to insert UTF-8 characters in practice - if they want the checking, they can either type the character directly or use the \u code, which is much simpler than manually converting it to UTF-8. Stewart.
Sep 22 2004
In article <cirlav$2a9s$1 digitaldaemon.com>, Stewart Gordon says...I firmly don't believe in any attempts to force a specific character encoding on every char[] ever created.I do, since it's documented that way.As said before, it should remain possible for char[] literals to contain character codes that aren't UTF-8, for such purposes as interfacing OS APIs.I agree that it should remain possible - but I disagree with the reason. Non-UTF encodings are more properly stored as ubyte[] arrays in D. Remember, C and C++ simply don't /have/ a type equivalent to D's char, so functions written in C or C++ were /never/ intended to receive such a type. C's char == D's byte or ubyte. The possible reasons why one might want to store arbitrary byte values in chars include scary hand-encoding of UTF-8 and possible some esoteric custom extensions (for example, imagine you invent some backwardly compatible UTF-8-PLUS). Such uses are, however, rare. About as rare as, say, needing to write a custom allocator because "new" isn't good enough. It should always be possible, but never commonplace.The ability to use arbitrary \x codes provides this neatly. I imagine few people would use it to insert UTF-8 characters in practice - if they want the checking, they can either type the character directly or use the \u code, which is much simpler than manually converting it to UTF-8.Of course this makes perfect logical sense - /if/ you're talking about a ubyte[] array, not a char[] array. Jill
Sep 22 2004
On Wed, 22 Sep 2004 10:49:03 +0000 (UTC), Stewart Gordon <Stewart_member pathlink.com> wrote:In article <opseo5g1si5a2sq9 digitalmars.com>, Regan Heath says... <snip>But it's 'defined' as having that encoding, if you dont want it, dont use char[] use byte[] instead.I agree.. however doesn't this make it possible to create an invalid UTF-8 sequence? Does the compiler/program catch this invalid sequence? I believe it should.I firmly don't believe in any attempts to force a specific character encoding on every char[] ever created.As said before, it should remain possible for char[] literals to contain character codes that aren't UTF-8, for such purposes as interfacing OS APIs.A C/C++ char* is a signed 8 bit value with no specified encoding. D's byte[] matches that perfectly. Maybe byte[] should be implicitly convertable to char* (if it's not already).The ability to use arbitrary \x codes provides this neatly. I imagine few people would use it to insert UTF-8 characters in practice - if they want the checking, they can either type the character directly or use the \u code, which is much simpler than manually converting it to UTF-8.Sure, really I'm playing devils advocate.. I question the logic of 'defining' char to be utf8 if you're not going to enforce it. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Sep 22 2004
Stewart Gordon wrote:In article <cioggi$k7p$1 digitaldaemon.com>, Walter says... <snip>I don't think this will work; it requires specifying what encoding the compiler worked with internally. For example, DMD works in UTF-8 internally. Therefore the first string is okay but the second is not because the UTF-8 is broken: char [] foo = "\x8F"; wchar [] bar = "\x8F"; But if a compiler uses UTF-16 or UTF-32 internally, then it won't detect any problem with either of those strings. So a literal string must be valid for arbitrary conversion between any encoding (that can only be interpreted as "\x specifies a UNICODE character"), OR there must be a mandate for what encoding the compiler uses internally. I think the former is less odious; as soon as you start depending upon features of an encoding, you get into trouble.Here's somewhere I agree with your choice of behaviour, where \x denotes byte values, not Unicode codepoints. Hence here, the coder who writes \x8F intended the byte having this value - a single value of type char. Moreover, it follows the "looks like C, acts like C" principle.It remains x"8F". Thus the \x specifies a literal byte in the character stream as implemented. The specification doesn't mention this twist, if it was intentional.I wasn't sure what to do about that case, so I left the \x as whatever the programmer wrote. The \u, though, is definitely meant as unicode and so is converted to UTF-8.
Sep 25 2004
In article <cj4c5p$1r6s$1 digitaldaemon.com>, Burton Radons says...I don't think this will work; it requires specifying what encoding the compiler worked with internally. For example, DMD works in UTF-8 internally.Walter assures us that the D language itself is not prejudiced toward UTF-8; that UTF-16 and UTF-32 have equal status. I can think of one or two examples which seem to contradict this, but they are likely to disappear once D gives us implicit conversions between the UTFs.Therefore the first string is okay but the second is not because the UTF-8 is broken: char [] foo = "\x8F"; wchar [] bar = "\x8F";I presume you meant that the other way round: the /first/ string is broken; the /second/ string is okay. But that's okay. Anyone using \x in a char[], wchar[] or dchar[] is expected to know what they're doing, otherwise they should be using \u.But if a compiler uses UTF-16 or UTF-32 internally, then it won't detect any problem with either of those strings.There isn't /necessarily/ anything wrong with either of those strings. For example: The requirement for using \x is merely that the programmer knows their UTF.So a literal string must be valid for arbitrary conversion between any encoding (that can only be interpreted as "\x specifies a UNICODE character"),No, the requirement is that programmers /must not/ use \x within a string unless they understand exactly how it will be interpretted. For most normal purposes, stick to this golden rule: *) For char[], wchar[] or dchar[] - use \u *) For all other arrays - use \xOR there must be a mandate for what encoding the compiler uses internally.I don't see a need for that.I think the former is less odious; as soon as you start depending upon features of an encoding, you get into trouble.Right. Which is why \x in strings should be considered "experts only". But I would hesitate to call that a "bug". It would be /possible/ for D's lexer to distinguish character string constants from byte string constants in some cases. I don't know if that would be a good idea. What I mean is: This would catch a lot of such bugs at compile time. Maybe Walter could be persuaded to go for this, I don't know. But \x bugs are bugs in user code, not in the compiler. Arcane Jill
Sep 27 2004
In article <cj8fhp$1h9u$1 digitaldaemon.com>, Arcane Jill says... <snip>UTF-8 UTF-16 for U+008F<snip> No, "\x8F" _means_ the byte with value 0x8F, meant to be interpreted as UTF-8. Somewhere in the docs there's an example or two of a wchar[] or dchar[] being initialised with UTF-8 in this way. Stewart.
Sep 29 2004
Arcane Jill wrote:In article <cj4c5p$1r6s$1 digitaldaemon.com>, Burton Radons says...I don't understand what political interpretation you gave my statement, but it was only to introduce an object example.I don't think this will work; it requires specifying what encoding the compiler worked with internally. For example, DMD works in UTF-8 internally.Walter assures us that the D language itself is not prejudiced toward UTF-8; that UTF-16 and UTF-32 have equal status. I can think of one or two examples which seem to contradict this, but they are likely to disappear once D gives us implicit conversions between the UTFs.If the compiler uses UTF-8 internally, the first string compiles correctly as string with length one while the second string does not, because the compiler tries to re-encode it as UTF-16 during semantic processing and fails. I am describing DMD's current behaviour, mind. If the compiler uses UTF-16 or UTF-32 internally (where it would convert the source file into its native encoding during BOM processing), then both strings compile. The first string has length two; the second string has length one. [snip]Therefore the first string is okay but the second is not because the UTF-8 is broken: char [] foo = "\x8F"; wchar [] bar = "\x8F";I presume you meant that the other way round: the /first/ string is broken; the /second/ string is okay.If the compiler is using UTF-8 internally, there is no possible way to re-encode the second string as UTF-16 while remaining consistent with compilers that use different encodings. To use your example encoding, if the compiler uses UTF-8 internally, then this code: wchar [] s = "\xC4\x8F"; Would result in a single-code string (you understand that D grammar is contextless and that string escapes are interpreted during tokenisation, right?). However, if the compiler uses UTF-16 internally, it would result in a two-code string. This does show a third option, however: change string escapes so that they must not be interpreted until after semantic processing where they can be interpreted directly as their destination encoding. But that only serves to illustrate how unnatural this behaviour is; which may be why that no Unicode-supporting language that I can find that handles \x interprets it as anything but a character. [snip]But if a compiler uses UTF-16 or UTF-32 internally, then it won't detect any problem with either of those strings.There isn't /necessarily/ anything wrong with either of those strings. For example:
Sep 29 2004
In article <cjg3qq$1ui6$1 digitaldaemon.com>, Burton Radons says...Yes. I was wrong. The first example compiles okay, but results in foo containing an invalid UTF-8 sequence. The second example does not compile. (I assumed that it would, without testing the hypothesis. That'll teach me).If the compiler uses UTF-8 internally, the first string compiles correctly as string with length one while the second string does not, because the compiler tries to re-encode it as UTF-16 during semantic processing and fails. I am describing DMD's current behaviour, mind.Therefore the first string is okay but the second is not because the UTF-8 is broken: char [] foo = "\x8F"; wchar [] bar = "\x8F";I presume you meant that the other way round: the /first/ string is broken; the /second/ string is okay.To use your example encoding, if the compiler uses UTF-8 internally, then this code: wchar [] s = "\xC4\x8F";Again, I was wrong. I assumed (without testing) that this would compile to a two-wchar string constant, with s[0] containing U+00C4 and s[1] containing U+008F. In actual fact, what this code yields is a one-wchar string constant, with s[0] containing U+010F. I would call that a bug. [ 0xC4, 0x8F ] is UTF-8 (not UTF-16) for U+010F. But s is a wchar string, so it's supposed to be UTF-16.(you understand that D grammar is contextless and that string escapes are interpreted during tokenisation, right?). However, if the compiler uses UTF-16 internally, it would result in a two-code string.I think you're right. That's what happening. The compiler is interpretting all string constants as though they were UTF-8, regardless of the type of the destination.This does show a third option, however: change string escapes so that they must not be interpreted until after semantic processing where they can be interpreted directly as their destination encoding.That's the way I assumed it would be done.But that only serves to illustrate how unnatural this behaviour is; which may be why that no Unicode-supporting language that I can find that handles \x interprets it as anything but a character.Actually, C and C++ interpret \x as a LOCAL ENCODING character, and \u as a UNICODE character. Thus, in C++, if your local encoding were Windows-1252, then the following two statements would have identical effect: Both of these will leave the string s containing a single (byte-wide) char, with value 0x80. (Plus the null-terminator, of course). Compare this with which /should/ fail to compile on a Windows-1252 machine. So you /are/ right, but nonetheless there is a difference between \x and \u. And this presents a problem for D, because D aims to be portable between encodings. In D, therefore, \x SHOULD NOT be interpretted according to the local encoding, because this would immediately make code non-portable. One way around this would be to assert that \x should mean exactly the same thing as \u and \U (that is, to specify a Unicode character). Now, that would be fine for those of us used to Latin-1, but Cyrillic users (for example) would be left out in the cold. Currently, I have come to the conclusion that \x should be deprecated. The escapes \u and \U explicitly specify a character set (i.e. Unicode), and that is what you need for portabilty. \x just has too many problems. Arcane Jill
Sep 30 2004
In article <cilbgp$1p3a$1 digitaldaemon.com>, Burton Radons says...The following string is encoded into UTF-8: char [] c = "\u7362"; It encodes into x"E4 98 B7". The following string, however, does not get encoded: char [] c = "\x8F"; It remains x"8F". Thus the \x specifies a literal byte in the character stream as implemented. The specification doesn't mention this twist, if it was intentional.This is correct behavior. You should be using \u for Unicode characters. \x is for literal bytes. \u is supposed to understand the encoding. \x is not. In D, the source code encoding must always be a UTF, but in other computer languages, this is not so. Imagine a C++ program in which the source code encoding were WINDOWS-1252. In such a case, the following two lines would be equivalent: In both cases, a single byte [0x80] will be placed in the string s. And now, here's the same thing in a C++ program in which the source code encoding is WINDOWS-1251: In both cases, a single byte [0x88] will be placed in the string s. Now, since D does not allow non-UTF source code encodings, the distinction may appear blurred, but it's still there. Just remember: \x => insert this literal byte \u => insert this Unicode character, encoded in the appropriate encoding. Arcane Jill
Sep 22 2004