digitalmars.D - Non UTF characters in comments
- Vathix (9/9) Jan 29 2005 I'm not sure if allowing non UTF characters in comments is such a good
- =?ISO-8859-15?Q?Thomas_K=FChne?= (23/23) Jan 29 2005 -----BEGIN PGP SIGNED MESSAGE-----
- Vathix (1/2) Jan 29 2005 Looks like DMD allows that in comments and I don't think it's a good ide...
- =?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= (20/32) Jan 29 2005 The current lexer just skips all bytes in comments,
- =?ISO-8859-15?Q?Thomas_K=FChne?= (39/39) Jan 29 2005 -----BEGIN PGP SIGNED MESSAGE-----
- =?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= (7/12) Jan 29 2005 I just think it should treat comments the
- =?ISO-8859-15?Q?Thomas_K=FChne?= (40/40) Jan 29 2005 -----BEGIN PGP SIGNED MESSAGE-----
- =?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= (7/15) Jan 29 2005 can be changed into:
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (6/11) Jan 29 2005 Hilarious, the new patch made phobos fail:
- =?ISO-8859-15?Q?Thomas_K=FChne?= (38/38) Jan 29 2005 -----BEGIN PGP SIGNED MESSAGE-----
- =?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= (7/16) Jan 29 2005 Yes, the idea is that it will not be valid and
- Andrew Fedoniouk (10/19) Jan 29 2005 What does it mean "non UTF character" ?
- Sebastian Beschke (4/11) Jan 29 2005 Invalid sequences *are* possible by using codepoints in the table that
- =?ISO-8859-15?Q?Anders_F_Bj=F6rklund?= (4/6) Jan 29 2005 A simple way to do it is to try to interpret a file in Latin-1 as UTF-8.
- Andrew Fedoniouk (11/22) Jan 29 2005 I see. Thanks, Sebastian.
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (3/5) Jan 29 2005 A bug in the current DMD makes it allow almost everything, in comments.
- Andrew Fedoniouk (8/9) Jan 29 2005 It's a feature rather than a bug.
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (5/7) Jan 29 2005 :-)
- Andrew Fedoniouk (17/18) Jan 29 2005 Oh, dreams, dreams... this famous XHTML will never be a "lingua franca" ...
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (6/10) Jan 30 2005 XHTML *is* both HTML and XML at once, which is why it is so useful...
- Derek (8/20) Jan 29 2005 One could argue that comments are not actual D source code ;-)
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (4/8) Jan 30 2005 It makes other stuff like parsers easier, if all .d files are valid UTF.
- Vathix (3/7) Jan 29 2005 A value in the file that causes std.utf functions to throw an exception ...
- Walter (3/12) Jan 29 2005 Technically it's an error to have non-UTF characters anywhere in the sou...
- Brian Chapman (4/13) Jan 29 2005 I don't know what your situation/eviroment is, but you could just try
I'm not sure if allowing non UTF characters in comments is such a good idea. It seems to be complicating my parser, and it will probably complicate other things like text/code editors. What is supposed to happen when a non UTF character is encountered? Should it display a question mark, display nothing, use the current code page? What if the editor doesn't know about D's comments? I might not have mentioned this, but since D is suppsed to be easily parsed, this might be an issue; a special case. - Chris
Jan 29 2005
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Vathix wrote: | I'm not sure if allowing non UTF characters in comments is such a good | idea. It seems to be complicating my parser, and it will probably | complicate other things like text/code editors. What is supposed to | happen when a non UTF character is encountered? Should it display a | question mark, display nothing, use the current code page? What if the | editor doesn't know about D's comments? Maybe I am missreading your post. Are you trying to use 2 different encodings in one file? Concerning Unicode: you are supposed to display the glyph of U+FFFD for all character's that can't be displayed by other means - e.g. a generic glyph displaying the codepoint or the code range. (Depending on your situation you might also use U+FFFC). http://www.unicode.org Thomas -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (MingW32) iD8DBQFB+67k3w+/yD4P9tIRAvl8AJ92uZbHz2oqLyJdoRH1grDhB854VACfQ1Aq aczWIaLE5GTW9qE1vbPAceo= =zdro -----END PGP SIGNATURE-----
Jan 29 2005
Are you trying to use 2 different encodings in one file?Looks like DMD allows that in comments and I don't think it's a good idea.
Jan 29 2005
Vathix wrote:I'm not sure if allowing non UTF characters in comments is such a good idea.The current lexer just skips all bytes in comments, until it finds the end of the current comment run. And that's probably not a good idea, but simpler... (otherwise you would have to check all non-ASCIIs) You still cannot use such invalid UTF sequences for anything such as identifiers or strings, though... Just consider it a bug in the current DMD front-end ? (i.e. don't abuse this, since it'll be fixed one day) Says http://www.digitalmars.com/d/lex.html:Are you trying to use 2 different encodings in one file?Looks like DMD allows that in comments and I don't think it's a good idea.D source text can be in one of the following formats: * ASCII * UTF-8 * UTF-16BE * UTF-16LE * UTF-32BE * UTF-32LEThis implies that *all* source input should be valid UTF (since ASCII is also valid as UTF-8) It *should* just stop dead when it finds one, e.g.: error("invalid UTF-8 sequence"); --anders PS. A nice feature would be to have the frontend convert from other encodings as well, but it would just add unneeded complexity since there are a *lot* of possible encodings out there (200)
Jan 29 2005
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Anders F Björklund wrote: | Vathix wrote: | |>>> I'm not sure if allowing non UTF characters in comments is such a |>>> good idea. | | |>> Are you trying to use 2 different encodings in one file? |> |> |> Looks like DMD allows that in comments and I don't think it's a good |> idea. | | | The current lexer just skips all bytes in comments, | until it finds the end of the current comment run. [snip] | It *should* just stop dead when it finds one, e.g.: | error("invalid UTF-8 sequence"); I dont think the compiler should try to check the comment's content. What is an "invalid" UTF-8 sequence? How would you e.g. handle Java's pre 1.5 "customised" UTF-8? (endcoding >U-FFFF as UTF-16 surrogates encoded in 2 UTF-8 codepoints) - - Granted, we might agree on overlong sequences, but how about unassigned codepoints? - - Has the input to be normalized? What normalization? - - Are you going to enforce the full Unicode spec? What spec version? - - How about the PUA? - - How about >U-11FFFD? - - Is U-FFFD/U-FFFC allowed? Thomas -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (MingW32) iD8DBQFB+/SF3w+/yD4P9tIRArTkAJ9KOvumTbFe+2OdEbMwZSvNqCb3rACgqPcl xSZ2C0Vk2bUsVHsqZUlKwQI= =wAIW -----END PGP SIGNATURE-----
Jan 29 2005
Thomas Kühne wrote:| It *should* just stop dead when it finds one, e.g.: | error("invalid UTF-8 sequence"); I dont think the compiler should try to check the comment's content.Why not ? It checks the rest of the file...What is an "invalid" UTF-8 sequence?I just think it should treat comments the same way it treats identifiers and literals ? That is, call: utf_decodeChar and follow whatever error that it returns... (utf.c) --anders
Jan 29 2005
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Anders F Björklund wrote: | Thomas Kühne wrote: | |> | It *should* just stop dead when it finds one, e.g.: |> | error("invalid UTF-8 sequence"); |> |> I dont think the compiler should try to check the comment's content. | | | Why not ? It checks the rest of the file... | |> What is an "invalid" UTF-8 sequence? | | I just think it should treat comments the | same way it treats identifiers and literals ? | | That is, call: utf_decodeChar and follow | whatever error that it returns... (utf.c) The current check for identifiers are: 1) shortes possible byte sequence for UTF-8 OK 2) no loone surrogate part That might clash with pre 1.5 Java output. This is a Java bug, thus can be ignored. 3) c <= 0x10FFFF OK 4) c != 0xFFFE && c != 0xFFFF That's the only check I reject. Those codepoints can occure if a non-Unicode document is converted to UTF encoded Unicode. Inside of comments they shouldn't stop the parsing. Those checks above are - except for the 4th - reasonable for comments. Thomas -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (MingW32) iD8DBQFB+/zW3w+/yD4P9tIRAiGXAJ0T0A9Yj5FQMGR+aB60C3hGVU25UgCeKl7W GcWKf1XK2ZLTmgh+BjasRhs= =9oAb -----END PGP SIGNATURE-----
Jan 29 2005
Thomas Kühne wrote:4) c != 0xFFFE && c != 0xFFFF That's the only check I reject. Those codepoints can occure if a non-Unicode document is converted to UTF encoded Unicode. Inside of comments they shouldn't stop the parsing. Those checks above are - except for the 4th - reasonable for comments.If needed, that can be hacked around for comments, for those two.s = utf_decodeChar(octet, ndigits, &idx, &c); if (s || idx != ndigits)can be changed into: s = utf_decodeChar(octet, ndigits, &idx, &c); if ((s && c != 0xFFFE && c != 0xFFFF) || idx != ndigits) Would that make it more reasonable ? (have the DMD patch ready...) --anders
Jan 29 2005
I wrote:[...]Looks like DMD allows that in comments and I don't think it's a good idea.Would that make it more reasonable ? (have the DMD patch ready...)Hilarious, the new patch made phobos fail:../gcc-3.4.3/gcc/d/phobos/std/loader.d:62: invalid UTF-8 sequenceDue to this little comment line, from GDC:Modified by David Friedman, October 2004 (applied patches from Anders F Björklund.)(as the ö here was in Latin-1, you see...) --anders
Jan 29 2005
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Anders F Björklund schrieb: | Thomas Kühne wrote: | |> 4) c != 0xFFFE && c != 0xFFFF That's the only check I reject. Those |> codepoints can occure if a non-Unicode document is converted to UTF |> encoded Unicode. Inside of comments they shouldn't stop the |> parsing. |> |> Those checks above are - except for the 4th - reasonable for |> comments. | | | If needed, that can be hacked around for comments, for those two. | |> s = utf_decodeChar(octet, ndigits, &idx, &c); |> if (s || idx != ndigits) | | | can be changed into: | | s = utf_decodeChar(octet, ndigits, &idx, &c); | if ((s && c != 0xFFFE && c != 0xFFFF) || idx != ndigits) | | Would that make it more reasonable ? (have the DMD patch ready...) Have a look at utf_decodeChar: dmd/utf.c:92 and dmd/utf.c:183 ;) While looking through utf.c I noticed that UTF-32 decoding doesn't undergo any checks. I'll write a bunch of test cases for all those encoding issues tomorrow. Thomas -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (MingW32) iD8DBQFB/BgZ3w+/yD4P9tIRAviMAJwM/ZKfCMNEefi1ij3SfIPP0bz5OwCfQIq9 nFU3UMQ4FrQcwv1is2KoNEc= =mgqS -----END PGP SIGNATURE-----
Jan 29 2005
Thomas Kühne wrote:| can be changed into: | | s = utf_decodeChar(octet, ndigits, &idx, &c); | if ((s && c != 0xFFFE && c != 0xFFFF) || idx != ndigits) | | Would that make it more reasonable ? (have the DMD patch ready...) Have a look at utf_decodeChar: dmd/utf.c:92 and dmd/utf.c:183 ;)Yes, the idea is that it will not be valid and return string "invalid UTF-8 sequence", which is then ignored because the char is FFFE/F... (all input is converted to UTF-8 before lexer) The patch is in the digitalmars.D.bugs group. --anders
Jan 29 2005
What does it mean "non UTF character" ? UTFs are the forms of representing/encoding full UNICODE table - 21-bit charactes (code points). So "non UTF character" sounds for me as "non UNICODE character". And what is that? Some new alphabet? Andrew Fedoniouk. http://terrainformatica.com "Vathix" <vathix dprogramming.com> wrote in message news:opsldc5xwrkcck4r esi...I'm not sure if allowing non UTF characters in comments is such a good idea. It seems to be complicating my parser, and it will probably complicate other things like text/code editors. What is supposed to happen when a non UTF character is encountered? Should it display a question mark, display nothing, use the current code page? What if the editor doesn't know about D's comments? I might not have mentioned this, but since D is suppsed to be easily parsed, this might be an issue; a special case. - Chris
Jan 29 2005
Andrew Fedoniouk schrieb:What does it mean "non UTF character" ? UTFs are the forms of representing/encoding full UNICODE table - 21-bit charactes (code points). So "non UTF character" sounds for me as "non UNICODE character". And what is that? Some new alphabet?Invalid sequences *are* possible by using codepoints in the table that aren't defined, or by misforming UTF-8 or UTF-16 sequences. -Sebastian
Jan 29 2005
Sebastian Beschke wrote:Invalid sequences *are* possible by using codepoints in the table that aren't defined, or by misforming UTF-8 or UTF-16 sequences.A simple way to do it is to try to interpret a file in Latin-1 as UTF-8. That'll give you "invalid UTF-8 sequence", for everything outside ASCII. --anders
Jan 29 2005
I see. Thanks, Sebastian. But text with erroneous utf sequences (not "non UTF character", sic! ) will not be compiled anyway. So for "... other things like text/code editors. What is supposed to happen when a non UTF character is encountered?..." editor (good one) should mark them as "bad string literal" or the like. Andrew Fedoniouk. http://terrainformatica.com "Sebastian Beschke" <s.beschke gmx.de> wrote in message news:ctgkta$48a$1 digitaldaemon.com...Andrew Fedoniouk schrieb:What does it mean "non UTF character" ? UTFs are the forms of representing/encoding full UNICODE table - 21-bit charactes (code points). So "non UTF character" sounds for me as "non UNICODE character". And what is that? Some new alphabet?Invalid sequences *are* possible by using codepoints in the table that aren't defined, or by misforming UTF-8 or UTF-16 sequences. -Sebastian
Jan 29 2005
Andrew Fedoniouk wrote:But text with erroneous utf sequences (not "non UTF character", sic! ) will not be compiled anyway.A bug in the current DMD makes it allow almost everything, in comments. --anders
Jan 29 2005
A bug in the current DMD makes it allow almost everything, in comments.It's a feature rather than a bug. Preparation for attributed programming I guess. With option to include binary data inline :) I can imagine properties/methods having its own descriptional GIFs given in source text as bytes. Le Cauchemar! BTW: Are there any ports of png/jpeg/gif libs in D? Andrew Fedoniouk. http://terrainformatica.com
Jan 29 2005
Andrew Fedoniouk wrote:Preparation for attributed programming I guess. With option to include binary data inline :):-) No, it's a bug. D source code is supposed to be valid UTF-8/16/32. Ideally, the HTML used should be made to be valid XHTML as well... --anders
Jan 29 2005
Ideally, the HTML used should be made to be valid XHTML as well...Oh, dreams, dreams... this famous XHTML will never be a "lingua franca" as is HTML. First browser which will enforce showing *only valid* or even only well-formed docs will die for the market for many reasons. Even XHTML standard say that UA (user agent - browser) should try to show patrial content. Strictly speaking, partial content of XML doc is not well formed XML thus invalid. So browsers will show invalid XHTML anyway. So there will not be a strong motivation to use valid XHTML, so XHTML will loose its 'X' and become just HTML v.5,6, etc.... And I am silent yet about CSS grammar where whitespace is an operator with different meaning in different places.... Parsing nightmare. I love this game... What kind of grammar D uses, is it context free, BTW? Andrew Fedoniouk. http://terrainformatica.com
Jan 29 2005
Andrew Fedoniouk wrote:Oh, dreams, dreams... this famous XHTML will never be a "lingua franca" as is HTML.XHTML *is* both HTML and XML at once, which is why it is so useful... Just as UTF-8 is both ASCII and Unicode at once, best of both worlds.First browser which will enforce showing *only valid* or even only well-formed docs will die for the market for many reasons.Just because certain browsers display it, is no reason for nonvalid markup. And it's easy to verify too, using http://validator.w3.org/ ? --anders
Jan 30 2005
On Sun, 30 Jan 2005 01:11:47 +0100, Anders F Björklund wrote:Andrew Fedoniouk wrote:One could argue that comments are not actual D source code ;-) Consider: does the compiler needs the comments to create the application? If the comments are not for the compiler, then why should it care what's in the comments? -- Derek Melbourne, AustraliaPreparation for attributed programming I guess. With option to include binary data inline :):-) No, it's a bug. D source code is supposed to be valid UTF-8/16/32. Ideally, the HTML used should be made to be valid XHTML as well... --anders
Jan 29 2005
Derek wrote:One could argue that comments are not actual D source code ;-) Consider: does the compiler needs the comments to create the application?The source code is not just for the compiler. (I meant "compiler input")If the comments are not for the compiler, then why should it care what's in the comments?It makes other stuff like parsers easier, if all .d files are valid UTF. --anders
Jan 30 2005
So "non UTF character" sounds for me as "non UNICODE character". And what is that? Some new alphabet?A value in the file that causes std.utf functions to throw an exception because it's invalid. I'm not good at this stuff and I don't know all the proper terminology.
Jan 29 2005
"Vathix" <vathix dprogramming.com> wrote in message news:opsldc5xwrkcck4r esi...I'm not sure if allowing non UTF characters in comments is such a good idea. It seems to be complicating my parser, and it will probably complicate other things like text/code editors. What is supposed to happen when a non UTF character is encountered? Should it display a question mark, display nothing, use the current code page? What if the editor doesn't know about D's comments? I might not have mentioned this, but since D is suppsed to be easily parsed, this might be an issue; a special case. - ChrisTechnically it's an error to have non-UTF characters anywhere in the source.
Jan 29 2005
On 2005-01-29 08:57:23 -0600, Vathix <vathix dprogramming.com> said:I'm not sure if allowing non UTF characters in comments is such a good idea. It seems to be complicating my parser, and it will probably complicate other things like text/code editors. What is supposed to happen when a non UTF character is encountered? Should it display a question mark, display nothing, use the current code page? What if the editor doesn't know about D's comments? I might not have mentioned this, but since D is suppsed to be easily parsed, this might be an issue; a special case. - ChrisI don't know what your situation/eviroment is, but you could just try piping the sources files through the GNU iconv utility to convert the files to whatever encoding you want.
Jan 29 2005