digitalmars.D.learn - understanding string suffixes
- Manfred Nowak (3/3) Aug 13 2005 What are the default suffixes depending on the bom of the source?
- Derek Parnell (9/12) Aug 13 2005 I don't believe that the string literal suffixes are effected in any way...
- Manfred Nowak (25/34) Aug 14 2005 True, true. I never thaught that the meaning of a supplied suffix
- Derek Parnell (19/22) Aug 14 2005 The ambiguity is not in the encoding of the source text but in the way t...
- Manfred Nowak (30/35) Aug 14 2005 [...]
- Derek Parnell (20/66) Aug 14 2005 I can see where you are going with this, but the encoding of the source
- Manfred Nowak (19/38) Aug 14 2005 Nice example, but is this argument suited in general? How will your
- Manfred Nowak (6/8) Aug 14 2005 http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/27069
- Derek Parnell (19/63) Aug 14 2005 As UTF16 encoded strings, just like the rest of the text. The editor
- Ben Hinkle (8/62) Aug 14 2005 Agreed. The c/w/d postfix indicates how to store the string in the obj f...
- xs0 (7/14) Aug 14 2005 Are you sure that it's a good idea to change behavior of code based on
- Manfred Nowak (13/20) Aug 14 2005 What is the code embedded in a file if you do not know the encoding
What are the default suffixes depending on the bom of the source? What is the meaning of a c-suffix in an utf32 source? -manfred
Aug 13 2005
On Sat, 13 Aug 2005 07:20:01 +0000 (UTC), Manfred Nowak wrote:What are the default suffixes depending on the bom of the source? What is the meaning of a c-suffix in an utf32 source?I don't believe that the string literal suffixes are effected in any way by the source code encoding scheme. I think that "qwerty"c is formed as a UTF8 string in RAM by the compiler regardless of UTF encoding of the source. -- Derek Parnell Melbourne, Australia 13/08/2005 11:26:11 PM
Aug 13 2005
Derek Parnell <derek psych.ward> wrote:True, true. I never thaught that the meaning of a supplied suffix may change depending on the source code encoding scheme. But the specs state: | The optional Postfix character gives a specific type to the | string, rather than it being inferred from the context. This is | useful when the type cannot be unambiguously inferred, such as | when overloading based on string type. But when is the type of a string ambiguous? The BOM, either missing or existing, supplies always a context for every string in the source: missing BOM c UTF8-BOM ? (probably c) UTF16-BOM w UTF32-BOM d Therefore a string literal s in an UTF32-source is aequivalent to the same string literal followed by the d-suffix: sd. I have done some tests and found, that a valid UTF32-code in a string literal suffixed with w throws an error, because it is not a legal UTF16-code. Therefore at least the w-suffix denotes not only type but also a check of the semantically correctness of the content of the string literal.What are the default suffixes depending on the bom of the source? What is the meaning of a c-suffix in an utf32 source?I don't believe that the string literal suffixes are effected in any way by the source code encoding scheme.I think that "qwerty"c is formed as a UTF8 string in RAM by the compiler regardless of UTF encoding of the source.UTF8? Because it is indistinguishable from ASCII in this case? -manfred
Aug 14 2005
On Sun, 14 Aug 2005 12:17:44 +0000 (UTC), Manfred Nowak wrote:Derek Parnell <derek psych.ward> wrote:[snip]But when is the type of a string ambiguous?The ambiguity is not in the encoding of the source text but in the way that a string literal is used when matching function signatures. Given ... void func(char[] x) { . . . } func( "some string" ); There is no problem so far, as there is only one possible match, but add this ... void func(dchar[] x) { . . . } And now there is an ambiguity. It is in this situation that string literal suffixes are useful. We need to do ... func( "some string"c ); or before suffixes func( cast(char[]) "some string" ); -- Derek Parnell Melbourne, Australia 14/08/2005 10:54:46 PM
Aug 14 2005
Derek Parnell <derek psych.ward> wrote: [...]but add this ... void func(dchar[] x) { . . . } And now there is an ambiguity.[...] Ouch. Now I see, that the old story on string literals has been covered with a fig leaf excuse. A source containing an overloaded function like void func( char[] s){} void func( wchar[] s){} void func( dchar[] s){} and a call with an unsuffixed string literal like func( "SomeString"); is unambiguously solvable by looking at the BOM of the source file, as I have already mentioned in the foregoing post: In an ASCII-source the char[]-overload has to be used, whereas in an UTF32-source the dchar[]-overload has to be used. What else should be natural? "Hey dear chinese, you have written all your strings in this UTF32- source in chinese letters, but please assure your D-compiler that you really meant to write chinese letters by appending the d-suffix to all your strings!"? Nope. No chinese should be forced to act this way. However, if a string in his source is not an UTF32-string he now can use the c- or d-suffix. Of course this would also imply, that an UTF32-source may have severe behaviour changes, if the BOM is changed. There is one more problem I do not understand: what will now happen with the call: func( "\u00001111"d "qwerty"c); Is this ambiguous? -manfred
Aug 14 2005
On Sun, 14 Aug 2005 14:12:21 +0000 (UTC), Manfred Nowak wrote:Derek Parnell <derek psych.ward> wrote: [...]I can see where you are going with this, but the encoding of the source text should be independent of the interpretation of undecorated string literals. Just because a file is encoded as UTF8 there should be no restriction on me deciding to save the file as UTF16. The compiler should not go choosing which function to call based on how the file just happens to be encoded.but add this ... void func(dchar[] x) { . . . } And now there is an ambiguity.[...] Ouch. Now I see, that the old story on string literals has been covered with a fig leaf excuse. A source containing an overloaded function like void func( char[] s){} void func( wchar[] s){} void func( dchar[] s){} and a call with an unsuffixed string literal like func( "SomeString"); is unambiguously solvable by looking at the BOM of the source file, as I have already mentioned in the foregoing post: In an ASCII-source the char[]-overload has to be used, whereas in an UTF32-source the dchar[]-overload has to be used. What else should be natural?"Hey dear chinese, you have written all your strings in this UTF32- source in chinese letters, but please assure your D-compiler that you really meant to write chinese letters by appending the d-suffix to all your strings!"?This sounds more like we need to have a pragma that specifies which default encoding we mean to have on the undecorated literals in a specific source text.Nope. No chinese should be forced to act this way. However, if a string in his source is not an UTF32-string he now can use the c- or d-suffix. Of course this would also imply, that an UTF32-source may have severe behaviour changes, if the BOM is changed.Exactly, so we should avoid this trap. Keep the default encoding as UTF8, but I still think that a pragma would be a good (and easy to implement) idea.There is one more problem I do not understand: what will now happen with the call: func( "\u00001111"d "qwerty"c); Is this ambiguous?Not to the compiler. If you try this you get the error message " mismatched string literal postfixes 'd' and 'c' " -- Derek Parnell Melbourne, Australia 15/08/2005 12:17:33 AM
Aug 14 2005
Derek Parnell <derek psych.ward> wrote: [...]the encoding of the source text should be independent of the interpretation of undecorated string literals. Just because a file is encoded as UTF8 there should be no restriction on me deciding to save the file as UTF16. The compiler should not go choosing which function to call based on how the file just happens to be encoded.Nice example, but is this argument suited in general? How will your embedded string literals be saved to the UTF16-source by your editor? And once you have changed some of them to real utf16-codes, how will your editor save them, if you decide to revert to utf8?This sounds more like we need to have a pragma that specifies which default encoding we mean to have on the undecorated literals in a specific source text.Agreed. That might be a solution. [...]Yes. But have you tried any further? func( "" ""c); //mismatched string literal postfixes ' ' and 'c' func( "" ""d); //mismatched string literal postfixes ' ' and 'd' Then: an unsuffixed string literal is neither compatibel with c nor d. So what is it, that the compiler complains about undecorated string literals match both char[] and dchar[]? Are we chasing a phantom, because the overloading routine of dmd is broken? vathix and some others have already reported on similar problems: http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/3206 -manfredwhat will now happen with the call: func( "\u00001111"d "qwerty"c); Is this ambiguous?Not to the compiler. If you try this you get the error message " mismatched string literal postfixes 'd' and 'c' "
Aug 14 2005
Manfred Nowak <svv1999 hotmail.com> wrote: [...]Are we chasing a phantom, because the overloading routine of dmd is broken?http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/27069 is the other reference where a `dchar' was matched by a `char' and a `creal'. -manfred
Aug 14 2005
On Sun, 14 Aug 2005 18:07:45 +0000 (UTC), Manfred Nowak wrote:Derek Parnell <derek psych.ward> wrote: [...]Yes.the encoding of the source text should be independent of the interpretation of undecorated string literals. Just because a file is encoded as UTF8 there should be no restriction on me deciding to save the file as UTF16. The compiler should not go choosing which function to call based on how the file just happens to be encoded.Nice example, but is this argument suited in general?How will your embedded string literals be saved to the UTF16-source by your editor?As UTF16 encoded strings, just like the rest of the text. The editor doesn't distinguish between string literals and other text.And once you have changed some of them to real utf16-codes, how will your editor save them, if you decide to revert to utf8?"real utf16-codes"? What are these? If the text is encoded in UTF16 then the entire text already contains real UTF16 codes. If I decide to save it as UTF8, then the editor will translate it for me. The way that the text is displayed remains the same, regardless of its encoding. The storage of the text is independent of the way it is displayed and interpreted by the compiler.That has got to be a bug.This sounds more like we need to have a pragma that specifies which default encoding we mean to have on the undecorated literals in a specific source text.Agreed. That might be a solution. [...]Yes. But have you tried any further? func( "" ""c); //mismatched string literal postfixes ' ' and 'c' func( "" ""d); //mismatched string literal postfixes ' ' and 'd' Then: an unsuffixed string literal is neither compatibel with c nor d. So what is it, that the compiler complains about undecorated string literals match both char[] and dchar[]?what will now happen with the call: func( "\u00001111"d "qwerty"c); Is this ambiguous?Not to the compiler. If you try this you get the error message " mismatched string literal postfixes 'd' and 'c' "Are we chasing a phantom, because the overloading routine of dmd is broken?Its not broken. It works as documented. ;-) -- Derek Parnell Melbourne, Australia 15/08/2005 8:02:38 AM
Aug 14 2005
"Derek Parnell" <derek psych.ward> wrote in message news:1fki98rnsnkwe$.11rh18dazasty.dlg 40tude.net...On Sun, 14 Aug 2005 18:07:45 +0000 (UTC), Manfred Nowak wrote:Agreed. The c/w/d postfix indicates how to store the string in the obj file not how it is stored in the source file.Derek Parnell <derek psych.ward> wrote: [...]Yes.the encoding of the source text should be independent of the interpretation of undecorated string literals. Just because a file is encoded as UTF8 there should be no restriction on me deciding to save the file as UTF16. The compiler should not go choosing which function to call based on how the file just happens to be encoded.Nice example, but is this argument suited in general?How will your embedded string literals be saved to the UTF16-source by your editor?As UTF16 encoded strings, just like the rest of the text. The editor doesn't distinguish between string literals and other text.And once you have changed some of them to real utf16-codes, how will your editor save them, if you decide to revert to utf8?"real utf16-codes"? What are these? If the text is encoded in UTF16 then the entire text already contains real UTF16 codes. If I decide to save it as UTF8, then the editor will translate it for me. The way that the text is displayed remains the same, regardless of its encoding. The storage of the text is independent of the way it is displayed and interpreted by the compiler.The current behavior is reasonable to me. It prevents any confusion with func(""c "") should it keep the c postfix or substitute the "empty postfix"? The current behavior errors and makes the user choose.That has got to be a bug.This sounds more like we need to have a pragma that specifies which default encoding we mean to have on the undecorated literals in a specific source text.Agreed. That might be a solution. [...]Yes. But have you tried any further? func( "" ""c); //mismatched string literal postfixes ' ' and 'c' func( "" ""d); //mismatched string literal postfixes ' ' and 'd' Then: an unsuffixed string literal is neither compatibel with c nor d. So what is it, that the compiler complains about undecorated string literals match both char[] and dchar[]?what will now happen with the call: func( "\u00001111"d "qwerty"c); Is this ambiguous?Not to the compiler. If you try this you get the error message " mismatched string literal postfixes 'd' and 'c' "
Aug 14 2005
Manfred Nowak wrote:In an ASCII-source the char[]-overload has to be used, whereas in an UTF32-source the dchar[]-overload has to be used. What else should be natural?Are you sure that it's a good idea to change behavior of code based on the encoding of the file? I sure don't... That would be like if "123.456" would be interpreted either as 123456 or 123.456, depending on your regional settings.. A definite disaster :)"Hey dear chinese, you have written all your strings in this UTF32- source in chinese letters, but please assure your D-compiler that you really meant to write chinese letters by appending the d-suffix to all your strings!"?Aren't the characters the same in all cases, just the string type changes? xs0
Aug 14 2005
xs0 <xs0 xs0.com> wrote: [...]Are you sure that it's a good idea to change behavior of code based on the encoding of the file? I sure don't...What is the code embedded in a file if you do not know the encoding of the file? Please explain. [...]That would be like if "123.456" would be interpreted either as 123456 or 123.456, depending on your regional settings.. A definite disaster :).. or 123,456. A desaster the germans are totally aware of, because comma and point change role when changing from english to german encoding. [...]Aren't the characters the same in all cases, just the string type changes?I might get you wro9ng, but why should the string literal consisting of the one letter d-string for "true" be the characters "true" as a 4-letter c-string? -manfred
Aug 14 2005