digitalmars.D - string\utf question
- Lars Ivar Igesund (14/14) Aug 05 2004 I don't really know much about utf strings, but doing some processing of...
- Arcane Jill (8/22) Aug 05 2004 Google "ISO/IEC 9899:1999 (E)" or "ISO-C-FDIS.1999-04.pdf" and then head...
- Arcane Jill (10/13) Aug 05 2004 Hey, Lars, my reply to your post sorts before your post. Isn't that weir...
- Lars Ivar Igesund (9/27) Aug 05 2004 That is indeed curious, my system clock is only 5 minutes quick, but the...
- Lars Ivar Igesund (12/30) Aug 05 2004 Shows up correct here, and so something is strange somewhere (or was
- Lars Ivar Igesund (9/17) Aug 05 2004 I looked at the document you pointed out (together with the docs...),
- Arcane Jill (30/40) Aug 05 2004 Yes, they are.
- Lars Ivar Igesund (20/79) Aug 05 2004 Hmm, that didn't answer anything, you just came to the same conclusion
- J C Calvarese (19/54) Aug 05 2004 I think when he mentioned "Unicode Alpha", he meant "UniversalAlpha".
- Arcane Jill (10/14) Aug 06 2004 I know a little about Unicode, but far less about compilers, and pretty ...
- J C Calvarese (9/23) Aug 06 2004 (Did you happen to scroll down to my example in
- Martin M. Pedersen (33/36) Aug 06 2004 that
- Walter (3/6) Aug 06 2004 I'll fix it.
- J C Calvarese (10/18) Aug 05 2004 The web interface runs on tachyons. ;)
I don't really know much about utf strings, but doing some processing of D files (e.g. ddepcheck), I suspect that I should know at least something before I proceed. lex.html states that identifiers can contain Universal Alphas. What is an universal alpha, and is it possible to check if a character is an universal alpha using some function currently in Phobos (or will it come with std.utype)? Also, I use File : Stream's toString to get the content of the file (I don't care whether this is the most efficient way to do it or not, since it makes the processing itself much simpler compared to reading line by line). File's toString returns a char [] no matter what, whereas the std.ctype functions all take dchars as inputs. What's the recommended type to use (char, wchar, dchar)? What's the recommended way to convert the char [] to the best type? Lars Ivar Igesund
Aug 05 2004
In article <ceso29$18p3$1 digitaldaemon.com>, Lars Ivar Igesund says...I don't really know much about utf strings, but doing some processing of D files (e.g. ddepcheck), I suspect that I should know at least something before I proceed. lex.html states that identifiers can contain Universal Alphas. What is an universal alpha,Google "ISO/IEC 9899:1999 (E)" or "ISO-C-FDIS.1999-04.pdf" and then head to Annex D on page 438. (Obvious really! :))and is it possible to check if a character is an universal alpha using some function currently in Phobos (or will it come with std.utype)?No and no. But it would be easy for me to add such a function to etc.unicode if you want. It would end up being called isUniversalAlpha(dchar).Also, I use File : Stream's toString to get the content of the file (I don't care whether this is the most efficient way to do it or not, since it makes the processing itself much simpler compared to reading line by line). File's toString returns a char [] no matter what, whereas the std.ctype functions all take dchars as inputs. What's the recommended type to use (char, wchar, dchar)? What's the recommended way to convert the char [] to the best type?That's an application-dependent question, but personally I'd just do std.utf.toUTF32(char[]). Arcane Jill
Aug 05 2004
In article <cesppu$19jq$1 digitaldaemon.com>, Arcane Jill says...In article <ceso29$18p3$1 digitaldaemon.com>, Lars Ivar Igesund says...Hey, Lars, my reply to your post sorts before your post. Isn't that weird? My post's timestamp says 08:02:22 + 0 (which is, in fact, when I posted it). Yours says 09:41:38 + 1 (which must be wrong). Curious. Anyway...No and no. But it would be easy for me to add such a function to etc.unicode if you want. It would end up being called isUniversalAlpha(dchar).Thinking about this logically, isUniversalAlpha() would be a custom property, not actually part of the Unicode standard, so it probably doesn't really belong in etc.unicode. In fact, it probably belongs in std.compiler. I could invent etc.compiler and put it there, where it could stay until (if) Walter moves it. Would that make more sense? Jill (trying to stay organized)
Aug 05 2004
Arcane Jill wrote:In article <cesppu$19jq$1 digitaldaemon.com>, Arcane Jill says...That is indeed curious, my system clock is only 5 minutes quick, but the post is in my sent folder with the 9:41 time. Ok, this post should get the time 16:50 (GMT+1) if everythings correct.In article <ceso29$18p3$1 digitaldaemon.com>, Lars Ivar Igesund says...Hey, Lars, my reply to your post sorts before your post. Isn't that weird? My post's timestamp says 08:02:22 + 0 (which is, in fact, when I posted it). Yours says 09:41:38 + 1 (which must be wrong). Curious. Anyway...Well, maybe I can write that one myself and just keep it in my project until it is included in Phobos (unless it is very complicated). Currently I have no need for (other) external libraries. Anyway, thanks for the answers. Lars Ivar IgesundNo and no. But it would be easy for me to add such a function to etc.unicode if you want. It would end up being called isUniversalAlpha(dchar).Thinking about this logically, isUniversalAlpha() would be a custom property, not actually part of the Unicode standard, so it probably doesn't really belong in etc.unicode. In fact, it probably belongs in std.compiler. I could invent etc.compiler and put it there, where it could stay until (if) Walter moves it. Would that make more sense?
Aug 05 2004
Lars Ivar Igesund wrote:Arcane Jill wrote:Shows up correct here, and so something is strange somewhere (or was when I posted that message this morning). Many posts on the newsgroup has strange times (IMO), possibly because time zones are handled differently in different clients (and maybe on the server). Also, your message shows up with the time 9:06 (GMT+1, that is) in my Thunderbird. Maybe discrepancies pop up when some time pass from it is sent to it is accepted at the server at the same time as there are time zone differences. Well, I don't know, the *real* answer is probably that you have time machine and went back in time to answer my questions. Lars Ivar IgesundIn article <cesppu$19jq$1 digitaldaemon.com>, Arcane Jill says...That is indeed curious, my system clock is only 5 minutes quick, but the post is in my sent folder with the 9:41 time. Ok, this post should get the time 16:50 (GMT+1) if everythings correct.In article <ceso29$18p3$1 digitaldaemon.com>, Lars Ivar Igesund says...Hey, Lars, my reply to your post sorts before your post. Isn't that weird? My post's timestamp says 08:02:22 + 0 (which is, in fact, when I posted it). Yours says 09:41:38 + 1 (which must be wrong). Curious. Anyway...
Aug 05 2004
Arcane Jill wrote:Thinking about this logically, isUniversalAlpha() would be a custom property, not actually part of the Unicode standard, so it probably doesn't really belong in etc.unicode. In fact, it probably belongs in std.compiler. I could invent etc.compiler and put it there, where it could stay until (if) Walter moves it. Would that make more sense? Jill (trying to stay organized)I looked at the document you pointed out (together with the docs...), but the ranges there include Digits, and digits aren't part of the Universal Alphas allowed to use as an IdentifierStart. Sorry for acting stupid here, but are Digits and Special Characters from Annex D part of the Universal Alphas, or are there unmentioned exceptions? Also, trying to add any of these characters to my identifier names using Vim's hexadecimal mode fails miserably, but that's probably an error on my side. Lars Ivar Igesund
Aug 05 2004
In article <cetmo6$1qpe$1 digitaldaemon.com>, Lars Ivar Igesund says...I looked at the document you pointed out (together with the docs...), but the ranges there include Digits, and digits aren't part of the Universal Alphas allowed to use as an IdentifierStart.Well, digits are allowed in identifiers - just not at the start.Sorry for acting stupid here, but are Digits and Special Characters from Annex D part of the Universal Alphas,Yes, they are.or are there unmentioned exceptions?Not so far as I am aware. In http://www.digitalmars.com/d/lex.html, it says: "Identifiers start with a letter, _, or unicode alpha, and are followed by any number of letters, _, digits, or universal alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the C99 Standard.) Identifiers can be arbitrarily long, and are case sensitive. Identifiers starting with __ (two underscores) are reserved." So it looks like "universal alphas" are not actually permitted as the /first/ character of a D identifier, only as the second or subsequent char. The /first/ character is apparently allowed to be a "unicode alpha", not a "universal alpha". Of course, this begs the question "what is a unicode alpha"? The docs don't define this. Almost certainly, Walter means "a Unicode character which has the Alphabetic property", but I don't know that for sure. I would suggest this needs to be clarified in the documentation. The definition should also state /which version/ of Unicode the D compiler uses, since new "unicode alphas" will be added with each new version of Unicode. (Otherwise you could end up in the curious state whereby new Unicode letters would be allowed at the start of an identifier but not in the middle or end!) Moreover, you probably wouldn't /want/ the definition of an identifier to change with each new release of Unicode. Suggestion to Walter: you could redefine an identifier start to be: "an ASCII letter, underscore, or any universal alpha which has the Unicode Alphabetic property" (and, obviously, ensure that this definition is met, which it probably is already). Now you don't need to state a Unicode version number, because you're dealing only with a fixed and stable subset of Unicode.Also, trying to add any of these characters to my identifier names using Vim's hexadecimal mode fails miserably, but that's probably an error on my side.I know nothing about vim.
Aug 05 2004
Arcane Jill wrote:In article <cetmo6$1qpe$1 digitaldaemon.com>, Lars Ivar Igesund says...Hmm, that didn't answer anything, you just came to the same conclusion as me regarding the somewhat lacking documentation :) Another point, above the text you quoted, " IdentifierStart: _ Letter UniversalAlpha " Note that there is no mention of Unicode Alpha, neither there or elsewhere except in the excerpt you mentioned. Walter, obvious bug in documentation has been found. A fix would be received with joyous celebrations across the globe (or at least in an axis between me and Jill). And a clarification in this thread :)I looked at the document you pointed out (together with the docs...), but the ranges there include Digits, and digits aren't part of the Universal Alphas allowed to use as an IdentifierStart.Well, digits are allowed in identifiers - just not at the start.Sorry for acting stupid here, but are Digits and Special Characters from Annex D part of the Universal Alphas,Yes, they are.or are there unmentioned exceptions?Not so far as I am aware. In http://www.digitalmars.com/d/lex.html, it says: "Identifiers start with a letter, _, or unicode alpha, and are followed by any number of letters, _, digits, or universal alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the C99 Standard.) Identifiers can be arbitrarily long, and are case sensitive. Identifiers starting with __ (two underscores) are reserved." So it looks like "universal alphas" are not actually permitted as the /first/ character of a D identifier, only as the second or subsequent char. The /first/ character is apparently allowed to be a "unicode alpha", not a "universal alpha". Of course, this begs the question "what is a unicode alpha"? The docs don't define this. Almost certainly, Walter means "a Unicode character which has the Alphabetic property", but I don't know that for sure. I would suggest this needs to be clarified in the documentation. The definition should also state /which version/ of Unicode the D compiler uses, since new "unicode alphas" will be added with each new version of Unicode. (Otherwise you could end up in the curious state whereby new Unicode letters would be allowed at the start of an identifier but not in the middle or end!) Moreover, you probably wouldn't /want/ the definition of an identifier to change with each new release of Unicode. Suggestion to Walter: you could redefine an identifier start to be: "an ASCII letter, underscore, or any universal alpha which has the Unicode Alphabetic property" (and, obviously, ensure that this definition is met, which it probably is already). Now you don't need to state a Unicode version number, because you're dealing only with a fixed and stable subset of Unicode.Well, what I did, was to add characters from the list as part of identifiers, but dmd didn't accept them. It is possible that the file wasn't saved in the correct format, but it didn't look like the problem. I might look at again later when I have something to test. Lars Ivar IgesundAlso, trying to add any of these characters to my identifier names using Vim's hexadecimal mode fails miserably, but that's probably an error on my side.I know nothing about vim.
Aug 05 2004
Lars Ivar Igesund wrote:Arcane Jill wrote:...In article <cetmo6$1qpe$1 digitaldaemon.com>, Lars Ivar Igesund says...Hmm, that didn't answer anything, you just came to the same conclusion as me regarding the somewhat lacking documentation :) Another point, above the text you quoted, " IdentifierStart: _ Letter UniversalAlpha " Note that there is no mention of Unicode Alpha, neither there or elsewhere except in the excerpt you mentioned.I think when he mentioned "Unicode Alpha", he meant "UniversalAlpha".Walter, obvious bug in documentation has been found. A fix would be received with joyous celebrations across the globe (or at least in an axis between me and Jill). And a clarification in this thread :)I don't know about your test, but I got D to accept a Spanish letter (ñ) and a Chinese character (義) as an identifier. In case, this stuff gets garbled in the transmission, I attached a .zip. const char[] ñ = "eñe"; const char[] 義 = "justice"; import std.stdio; void main() { writefln("Feliz Cumpleaños."); writefln(ñ); writefln(義); /* It doesn't print right, but that's probably DOS's fault. */ } -- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/Well, what I did, was to add characters from the list as part of identifiers, but dmd didn't accept them. It is possible that the file wasn't saved in the correct format, but it didn't look like the problem. I might look at again later when I have something to test. Lars Ivar IgesundAlso, trying to add any of these characters to my identifier names using Vim's hexadecimal mode fails miserably, but that's probably an error on my side.I know nothing about vim.
Aug 05 2004
In article <ceu2eq$21vs$1 digitaldaemon.com>, Lars Ivar Igesund says...Well, what I did, was to add characters from the list as part of identifiers, but dmd didn't accept them. It is possible that the file wasn't saved in the correct format, but it didn't look like the problem. I might look at again later when I have something to test.I know a little about Unicode, but far less about compilers, and pretty much nothing at all about the D compiler. I'm afraid this question will have to be answered by someone else. J C Calvarses said 'I think when he [Walter] mentioned "Unicode Alpha", he meant "UniversalAlpha".' If this is true, it would seem strange (to me). It would imply that non-ASCII-digits are allowed as the first character of an identifier. Not that that's necessarily a /bad/ thing - just unexpected. Jill
Aug 06 2004
In article <cevaqt$2u57$1 digitaldaemon.com>, Arcane Jill says...In article <ceu2eq$21vs$1 digitaldaemon.com>, Lars Ivar Igesund says...(Did you happen to scroll down to my example in http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/8333?) It's definitely allowed. I was able to use a _Chinese character_ as an identifier. It was the first character. It was the last character. It was the only character. I can upload the example to my web site if you can't get the example from the post. jcc7Well, what I did, was to add characters from the list as part of identifiers, but dmd didn't accept them. It is possible that the file wasn't saved in the correct format, but it didn't look like the problem. I might look at again later when I have something to test.I know a little about Unicode, but far less about compilers, and pretty much nothing at all about the D compiler. I'm afraid this question will have to be answered by someone else. J C Calvarses said 'I think when he [Walter] mentioned "Unicode Alpha", he meant "UniversalAlpha".' If this is true, it would seem strange (to me). It would imply that non-ASCII-digits are allowed as the first character of an identifier. Not that that's necessarily a /bad/ thing - just unexpected. Jill
Aug 06 2004
"Arcane Jill" <Arcane_member pathlink.com> skrev i en meddelelse news:cevaqt$2u57$1 digitaldaemon.com...If this is true, it would seem strange (to me). It would imply that non-ASCII-digits are allowed as the first character of an identifier. Notthatthat's necessarily a /bad/ thing - just unexpected.C allows it, and so must D to be link-compatible. The relevant C grammar is: identifier: identifier-nondigit identifier identifier-nondigit identifier digit identifier-nondigit: nondigit universal-character-name other implementation-defined characters universal-character-name: \u hex-quad \U hex-quad hex-quad hex-quad: hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit Constraints A universal character name shall not specify a character short identifier in the range 00000000 through 00000020, 0000007F through 0000009F, or 0000D800 through 0000DFFF inclusive. A universal character name shall not designate a character in the required character set. Description Universal character names may be used in identifiers, character constants, and string literals to designate characters that are not in the required character set. Semantics The universal character name \Unnnnnnnn designates the character whose character short identifier (as specified by ISO/IEC 10646) is nnnnnnnn. Similarly, the universal character name \unnnn designates the character whose character short identifier is 0000nnnn. Regards, Martin M. Pedersen
Aug 06 2004
"Lars Ivar Igesund" <larsivar igesund.net> wrote in message news:ceu2eq$21vs$1 digitaldaemon.com...Walter, obvious bug in documentation has been found. A fix would be received with joyous celebrations across the globe (or at least in an axis between me and Jill). And a clarification in this thread :)I'll fix it.
Aug 06 2004
Arcane Jill wrote:In article <cesppu$19jq$1 digitaldaemon.com>, Arcane Jill says...The web interface runs on tachyons. ;) Actually, I've seen this happen before. I think it's related to the delay in time before a message appears on the web when it's posted through the web interface. The order looks normal if you're viewing through Thunderbird. Go figure. -- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/In article <ceso29$18p3$1 digitaldaemon.com>, Lars Ivar Igesund says...Hey, Lars, my reply to your post sorts before your post. Isn't that weird? My post's timestamp says 08:02:22 + 0 (which is, in fact, when I posted it). Yours says 09:41:38 + 1 (which must be wrong). Curious. Anyway...
Aug 05 2004