digitalmars.D - UTF-8 char[] consistency
- Jaap Geurts (24/24) Sep 25 2004 Hi all,
- Ben Hinkle (13/62) Sep 25 2004 which string functions specifically? What do you mean by "fail"?
- Jaap Geurts (7/23) Sep 26 2004 I tried the wchar[] and dchar[] and that works just fine. But because I ...
- David L. Davis (8/17) Sep 26 2004 Jaap Geurts: Yes, stringw.d (v0.3 beta) is one of my pet projects and yo...
- Jaap Geurts (7/11) Sep 27 2004 can
- Jaap Geurts (17/39) Sep 27 2004 David,
- Arcane Jill (29/35) Sep 28 2004 Sorry to leap into the middle of your conversation with David, but that ...
- David L. Davis (21/31) Sep 28 2004 Jaap: Currently for anything unicode based, I've been waiting on work th...
- David L. Davis (21/31) Sep 29 2004 Jaap: Currently for anything unicode based, I've been waiting on work th...
- David L. Davis (6/6) Sep 29 2004 Everyone: Oops!!! Sorry about the repost everyone. I had a bad storm in ...
- Arcane Jill (48/52) Sep 29 2004 Unlike UTF-8, UTF-16 is very cunning - and this is basically because Uni...
- Arcane Jill (6/9) Sep 29 2004 Erratum.
- David L. Davis (6/6) Sep 29 2004 Arcane Jill: Thxs as always for the clear insight! I now have a better
- Ben Hinkle (16/29) Sep 26 2004 to
- Jaap Geurts (12/20) Sep 27 2004 have.
- Arcane Jill (52/71) Sep 26 2004 Cool.
- Benjamin Herr (19/21) Sep 26 2004 So can we not just drop char and char[]s and define some standard string...
- Thomas Kuehne (13/19) Sep 26 2004 I guess you didn't (yet) dive into Unicode?
- Arcane Jill (34/44) Sep 27 2004 True enough. The best definition of "character" I have ever encountered ...
- Benjamin Herr (11/31) Sep 27 2004 I only theoretically dealed with Unicode (so, no). I had not idea I am
- Thomas Kuehne (13/23) Sep 27 2004 UTF-8/16/32 only deal with one codepoint at a time(except for some
- Jaap Geurts (28/47) Sep 27 2004 I see. If that is the way it is. Than I'll use functions operating on
- Arcane Jill (28/41) Sep 27 2004 Most Unicode platforms use UTF-16, including the ICU library. It follows
- Thomas Kuehne (6/15) Sep 27 2004 Guess you missed the extended CJK part. There are names of living person...
- Arcane Jill (30/45) Sep 28 2004 I will freely admit that I don't speak Chinese and don't know the intric...
- Benjamin Herr (6/9) Sep 28 2004 I guess I really do not get it. I thought I was just told that
- Sean Kelly (10/18) Sep 28 2004 I think what Jill was saying is that in most cases, UTF-16 will represen...
- Arcane Jill (21/31) Sep 29 2004 Yes, exactly. And to some extent, the same is also true of UTF-8 if your
- Thomas Kuehne (10/17) Sep 27 2004 Potentially codepoints are 64 bit. The highes currently assigned codepoi...
- Arcane Jill (9/10) Sep 29 2004 First I've heard of it. Do you have a source for this information?
- Arcane Jill (32/40) Sep 28 2004 Head out to www.unicode.org and check out their various FAQs. They do a ...
- J C Calvarese (6/63) Sep 29 2004 Cool. I added this to a wiki page:
- Ben Hinkle (22/37) Sep 27 2004 not my
- Arcane Jill (6/42) Sep 29 2004 I posted this yesterday:
Hi all, I'm testing and programming in D using UTF-8 under linux to encode the Vietnamese character set. I have some trouble with the way D handles the char[].length property. If I make a string as follows char[] s = "câu nà y có những chữ cái tiếng việt"; Then the length property (s.length) reports the number of bytes not the number of characters as I would expect to happen. The length property would return the number of bytes for the byte[]. Therefore I still need to use a strlen function to determine the correct string length. One of the implications is that most *string* handling functions in the phobos library depend on the length property and thus fail. There are some solutions to this: without modifying the language: 1. use a special functions to do the work. 2. make a string class. 3. convert everything internally to UTF-16, convert it back to UTF-8 before output. 1. The special functions would work but is troublesome because the phobos functions cannot be used.(i.e. they have to be rewritten). 2. the string class doesn't work well because the opAssign function cannot be overridden and this the following cannot be done: I know that it can be done slightly different but I'd like it to be as seemless as possible. (String s = new String("hello");) However the the phobos functions stil don't work and have to be included in the class. Wasn't Walter against a String class?? ;) 3. Converting everything is not very efficient. And requires non-transparent extra work. I'd suggest the following: 1. The char[] needs to be treated by the D compiler as a string array not as a byte array, or 2. Implement a special String datatype (has been discussed earlier and Walter is against it.) Also, a lot of phobos functions are missing for wide and double character operations. E.g. wchar[] ljustify(wchar[], int width); is not available and many more are not available for larger char sets. Regards, Jaap --- D programming from Vietnam
Sep 25 2004
Jaap Geurts wrote:Hi all, I'm testing and programming in D using UTF-8 under linux to encode the Vietnamese character set. I have some trouble with the way D handles the char[].length property.If this isn't in some FAQ it should be.If I make a string as follows char[] s = "câu nà y có những chữ cái tiếng việt"; Then the length property (s.length) reports the number of bytes not the number of characters as I would expect to happen. The length property would return the number of bytes for the byte[]. Therefore I still need to use a strlen function to determine the correct string length. One of the implications is that most *string* handling functions in the phobos library depend on the length property and thus fail.which string functions specifically? What do you mean by "fail"?There are some solutions to this: without modifying the language: 1. use a special functions to do the work. 2. make a string class. 3. convert everything internally to UTF-16, convert it back to UTF-8 before output.4. use dchar[] (or possibly wchar[] if you know the unicode codepoints in your string will fit in a wchar).1. The special functions would work but is troublesome because the phobos functions cannot be used.(i.e. they have to be rewritten). 2. the string class doesn't work well because the opAssign function cannot be overridden and this the following cannot be done: I know that it can be done slightly different but I'd like it to be as seemless as possible. (String s = new String("hello");) However the the phobos functions stil don't work and have to be included in the class. Wasn't Walter against a String class?? ;) 3. Converting everything is not very efficient. And requires non-transparent extra work. I'd suggest the following: 1. The char[] needs to be treated by the D compiler as a string array not as a byte array, or 2. Implement a special String datatype (has been discussed earlier and Walter is against it.)Have you tried using dchar[] or wchar[] in your app? Someone has made wstring.d which is the wchar equivalent to std.string (maybe it works for dchar, too, I don't remember exactly). And AJ and some others are working on expanding the unicode support - see the www.dsource.org.Also, a lot of phobos functions are missing for wide and double character operations. E.g. wchar[] ljustify(wchar[], int width); is not available and many more are not available for larger char sets.I don't have that wstring.d handy but hopefully it covers these. If not please let the author know so they can add them (and/or contribute them yourself). Your help in improving the library support for wchar and dchar would most likely be very much appreciated.Regards, Jaap --- D programming from Vietnam
Sep 25 2004
On Sat, 25 Sep 2004 10:50:41 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:The report the incorrect length. It reports the byte count not the the actual character count, as I would expect because it's an array of char. If I'm right for a char[] s; array and then requesting its length s.length; should report a wcslen(s) of some sort. But the curren't implementation doesn't.If I make a string as follows char[] s = "câu nà y có những chữ cái tiếng việt"; Then the length property (s.length) reports the number of bytes not the number of characters as I would expect to happen. The length property would return the number of bytes for the byte[]. Therefore I still need to use a strlen function to determine the correct string length. One of the implications is that most *string* handling functions in the phobos library depend on the length property and thus fail.which string functions specifically? What do you mean by "fail"?4. use dchar[] (or possibly wchar[] if you know the unicode codepoints in your string will fit in a wchar).I tried the wchar[] and dchar[] and that works just fine. But because I program under linux it would be nice if I can keep all my internal data in a consistent format. Which is utf-8 for unix bases systems. It seems a little odd to have to convert it to utf-16 each time I need to know the length of a string. Of course the occasional conversion is unavoidable because sometimes if one wants to insert a utf-8 encoded character into a string, one has to fit a wchar into a char[], i realize that.I don't have that wstring.d handy but hopefully it covers these. If not please let the author know so they can add them (and/or contribute them yourself). Your help in improving the library support for wchar and dchar would most likely be very much appreciated.If someone is reading this and knows where the wstring.d is. Can you please point me to it? Thanks, Jaap --- D programming from Vietnam
Sep 26 2004
In article <opsexriepv2saxk9 krd8833t>, Jaap Geurts says...On Sat, 25 Sep 2004 10:50:41 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:Jaap Geurts: Yes, stringw.d (v0.3 beta) is one of my pet projects and you can find here: http://spottedtiger.tripod.com/D_Language/D_Support_Projects_XP.html Please, let me know if there's any missing std.string.d function(s) that you need, and I'll work on getting them in as soon as possible. David L. ------------------------------------------------------------------- "Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"I don't have that wstring.d handy but hopefully it covers these. If not please let the author know so they can add them (and/or contribute them yourself). Your help in improving the library support for wchar and dchar would most likely be very much appreciated.If someone is reading this and knows where the wstring.d is. Can you please point me to it? Thanks, Jaap --- D programming from Vietnam
Sep 26 2004
"David L. Davis" <SpottedTiger yahoo.com> wrote in message news:cj7aih$mq5$1 digitaldaemon.com...Jaap Geurts: Yes, stringw.d (v0.3 beta) is one of my pet projects and youcanfind here:http://spottedtiger.tripod.com/D_Language/D_Support_Projects_XP.htmlPlease, let me know if there's any missing std.string.d function(s) thatyouneed, and I'll work on getting them in as soon as possible.If I find bugs or I need other functions, I'll submit my ideas to you. Thanks, David.
Sep 27 2004
David, I've examined your wstring library, and noticed that the case(islower,isupper) family functions cannot do other languages than plain latin ascii. Am I right in this? What is needed I guess is for the user to supply a conversion table (are the functions in phobos suitable?). I don't know enough about locale support in OS's but if it is not available there we'd have to code it into the lib. I'll do some probing about how to code it first and if you wish I can provide you the one for Vietnamese. Regards, Jaap "David L. Davis" <SpottedTiger yahoo.com> wrote in message news:cj7aih$mq5$1 digitaldaemon.com...In article <opsexriepv2saxk9 krd8833t>, Jaap Geurts says...dcharOn Sat, 25 Sep 2004 10:50:41 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:I don't have that wstring.d handy but hopefully it covers these. If not please let the author know so they can add them (and/or contribute them yourself). Your help in improving the library support for wchar andplease point me to it?would most likely be very much appreciated.If someone is reading this and knows where the wstring.d is. Can youcanThanks, Jaap --- D programming from VietnamJaap Geurts: Yes, stringw.d (v0.3 beta) is one of my pet projects and youfind here:http://spottedtiger.tripod.com/D_Language/D_Support_Projects_XP.htmlPlease, let me know if there's any missing std.string.d function(s) thatyouneed, and I'll work on getting them in as soon as possible. David L. ------------------------------------------------------------------- "Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"
Sep 27 2004
In article <cjal85$1oia$1 digitaldaemon.com>, Jaap Geurts says...What is needed I guess is for the user to supply a conversion table (are the functions in phobos suitable?).Sorry to leap into the middle of your conversation with David, but that is not so. What you need to do is go to www.dsource.org and look for a project called Deimos. Therein, you will find a library called etc.unicode, in source code form. Development of this library has been halted, in favor of ICU, but etc.unicode /does/ do simple casing. (And don't be fooled by the word "simple" - which only means that the function works on characters, not strings, (so it can't uppercase "ß" to "SS") and that it doesn't know that Turkish, Azeri and Lithuanian have non-standard casing rules. It is "simple casing" as opposed to "full casing", that's all). The relevant prototypes are: You do not need to specify a locale, because, if the locale is anything other than Turkish, Azeri or Lithuanian, the casing will be done correctly.I don't know enough about locale support in OS's but if it is not available there we'd have to code it into the lib.It is a common misconception that casing is locale sensitive. In Unicode, in general, it is not. Okay, so (as mentioned above) Turkish, Azeri and Lithuanian are different, but that is a small enough number that I prefer to think of it as being "locale-independent with three exceptions". I think the misconception arises because the C functions toupper(), tolower() etc. are dependent on something /called/ locale, but which is in fact more closely related to encoding scheme. These ctype functions need to do this because C's chars are only eight bits wide. This logic does not apply to Unicode, and certainly not to the functions in etc.unicode and the forthcoming ICU port.I'll do some probing about how to code it first and if you wish I can provide you the one for Vietnamese.The Unicode standard does not regard Vietnamese as an exception to the standard lookups, so etc.unicode is all you need. Arcane Jill
Sep 28 2004
In article <cjal85$1oia$1 digitaldaemon.com>, Jaap Geurts says...David, I've examined your wstring library, and noticed that the case(islower,isupper) family functions cannot do other languages than plain latin ascii. Am I right in this? What is needed I guess is for the user to supply a conversion table (are the functions in phobos suitable?). I don't know enough about locale support in OS's but if it is not available there we'd have to code it into the lib. I'll do some probing about how to code it first and if you wish I can provide you the one for Vietnamese. Regards, JaapJaap: Currently for anything unicode based, I've been waiting on work that Arcane Jill is doing. StringW.d was mainly created to make it easier to work with 16-bit characters (string.d made it a real pain...you nearly have to cast everything), and hopefully in turn it will work with Windows' 16-bit wide character API functions. But at this point I haven't tested it, plus I don't understand enough to know the real different between the 16-bit characters and unicode characters (some real example data and code would be helpful in this area...Jill?, Ben?, and/or anyone?). Anywayz, needless to say I've mirrored string.d functions like tolower(), toupper() and my very own asciiProperCase() functions to still work on ascii characters only. In my last reply I was mainly to point you to where stringw.d could be found, and if you found it useful and to let you know that if you needed anything that string.d had that's missing in it...that I would add it if you needed it. I hope I did give the impression that it did unicode? Also, I'm afraid I don't know much about "locale support" either. But if you do something in that area I wouldn't mind taking a look at it. :)) Good Luck in your project, David L. ------------------------------------------------------------------- "Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"
Sep 28 2004
In article <cjal85$1oia$1 digitaldaemon.com>, Jaap Geurts says...David, I've examined your wstring library, and noticed that the case(islower,isupper) family functions cannot do other languages than plain latin ascii. Am I right in this? What is needed I guess is for the user to supply a conversion table (are the functions in phobos suitable?). I don't know enough about locale support in OS's but if it is not available there we'd have to code it into the lib. I'll do some probing about how to code it first and if you wish I can provide you the one for Vietnamese. Regards, JaapJaap: Currently for anything unicode based, I've been waiting on work that Arcane Jill is doing. StringW.d was mainly created to make it easier to work with 16-bit characters (string.d made it a real pain...you nearly have to cast everything), and hopefully in turn it will work with Windows' 16-bit wide character API functions. But at this point I haven't tested it, plus I don't understand enough to know the real different between the 16-bit characters and unicode characters (some real example data and code would be helpful in this area...Jill?, Ben?, and/or anyone?). Anywayz, needless to say I've mirrored string.d functions like tolower(), toupper() and my very own asciiProperCase() functions to still work on ascii characters only. In my last reply I was mainly to point you to where stringw.d could be found, and if you found it useful and to let you know that if you needed anything that string.d had that's missing in it...that I would add it if you needed it. I hope I did give the impression that it did unicode? Also, I'm afraid I don't know much about "locale support" either. But if you do something in that area I wouldn't mind taking a look at it. :)) Good Luck in your project, David L. ------------------------------------------------------------------- "Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"
Sep 29 2004
Everyone: Oops!!! Sorry about the repost everyone. I had a bad storm in my area last night and my connection to the internet wasn't working right, so I didn't think my message had gotten posted. Again sorry. David L. ------------------------------------------------------------------- "Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"
Sep 29 2004
In article <cje51f$q8t$1 digitaldaemon.com>, David L. Davis says...plus I don't understand enough to know the real different between the 16-bit characters and unicode characters (some real example data and code would be helpful in this area...Jill?, Ben?, and/or anyone?).Unlike UTF-8, UTF-16 is very cunning - and this is basically because Unicode and UTF-16 were designed together, to work with each other. Here's how it works - there are two different perspectives: the 16-bit perspective, and the 21-bit perspective. In the 21-bit perspective, characters run from U+0000 to U+10FFFF - /but/, the range U+D800 to U+DFFF is illegal and invalid. There are /no/ Unicode characters in this range. Any application built to view the Unicode world from this point of view should be prepared to correctly handle and display all valid characters (which excludes U+D800 to U+DFFF). In the 16-bit perspective, characters run from U+0000 to U+FFFF - and, in this world, the range U+D800 to U+DFFF are just hunky dory. In this perspective, they are called "surrogate characters". They always occur in pairs, with a high surrogate (a character in the range U+D800 to U+DBFF) always immediately followed by a low surrogate (a character in the range U+DC00 to U_DFFF). There are plenty of applications built to view the Unicode world from this point of view (in particular, legacy applications written before Unicode 3.0, when all Unicode characters actually /were/ 16 bits wide). Let's take an example. The Unicode character U+1D11E (musical symbol G clef). When viewed by an application which sees 21-bit wide characters, what you see is U+1D11E, which you interpret as a single character, and display as ... well ... as musical symbol G clef. A legacy 16-bit-Unicode application looking at the same text file (assuming it to have been saved in UTF-16) will see two "characters": U+D874 followed by U+DD1E. (These are the UTF-16 fragments which together represent U+1D11E). Such an application may safely interpret these wchars as "unknown character" followed by "unknown character", and nothing will break. A slightly more sophisticated application might even interpret them as "high surrogate" followed by "low surrogate", and still, nothing would break. These pseudo-characters would likely both display as "unknown character" glyphs, but some fonts may give high surrogates a different glyph from low surrogates. (And, indeed, the Mac's "last chance" fallback font will actually display each psuedo-character as a tiny little hex representation of its codepoint!) Of course, all of this will fail completely if UTF-8 is used instead of UTF-16. In UTF-8, the representation of U+1D11E is: F0 9D 84 9E. Every UTF-8-aware application will decode this as 0x1D11E, and an application which is unaware of characters beyond U+FFFF would fall over badly here. (It might even truncate it to U+D11E: Hangul syllable TYAELM). But of course, you can still transcode into UTF-16 and deal with it that way - which is another reason why UTF-16 is very good for the internal workings of an application. Arcane Jill PS. It is worth noting that the vast majority of fonts available today which are either free or come bundled with an OS do not render characters beyond U+FFFF at all. In fact, I have yet to find /even one/ free font which contains U+1D11E (musical symbol G clef). [I would be very happy to be shown to be wrong on this point - anyone know of one?]. This means that if you stick such characters in a web page, nobody will be able to see them - so you'll have to use a gif after all. :( Unicode may be the future, but sadly it is not the present.
Sep 29 2004
In article <cje7o0$rj6$1 digitaldaemon.com>, Arcane Jill says...A legacy 16-bit-Unicode application looking at the same text file (assuming it to have been saved in UTF-16) will see two "characters": U+D874 followed by U+DD1E. (These are the UTF-16 fragments which together represent U+1D11E).Erratum. Whoops! UTF-16 for 1D11E is actually D834 followed by DD1E. (That'll teach me not to try UTF-16 transcoding by hand in future!) The logic of the post still holds, however. Jill
Sep 29 2004
Arcane Jill: Thxs as always for the clear insight! I now have a better understanding of how 16-bit characters (aka UTF-16 / wchar[]) and Unicode (v3.0 / v4.0) match against one another. :)) I hope your ICU conversion work is coming along fine. ------------------------------------------------------------------- "Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"
Sep 29 2004
"Jaap Geurts" <jaapsen hotmail.com> wrote in message news:opsexriepv2saxk9 krd8833t...On Sat, 25 Sep 2004 10:50:41 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:toIf I make a string as follows char[] s = "câu này có nh?ng ch? cái ti?ng vi?t"; Then the length property (s.length) reports the number of bytes not the number of characters as I would expect to happen. The length property would return the number of bytes for the byte[]. Therefore I still needtheuse a strlen function to determine the correct string length. One ofactual character count, as I would expect because it's an array of char. If I'm right for a char[] s; array and then requesting its length s.length; should report a wcslen(s) of some sort. But the curren't implementation doesn't. That is by design. Out of curiosity, what are you doing with your strings that require the number of characters? Usually one just deals with string fragments and it doesn't matter how long it is (either in characters or in bytes). In a perfect world your expectation of having a one-to-one mapping between array indexing and character indexing would clearly be nice to have. But the current design is (in Walter's opinion - and I agree with him) the best we can do given the imperfect world we find ourselves in and given D's design goals.The report the incorrect length. It reports the byte count not the theimplications is that most *string* handling functions in the phobos library depend on the length property and thus fail.which string functions specifically? What do you mean by "fail"?
Sep 26 2004
"Ben Hinkle" <bhinkle mathworks.com> wrote in message news:cj7eb6$ole$1 digitaldaemon.com...That is by design. Out of curiosity, what are you doing with your strings that require the number of characters? Usually one just deals with string fragments and it doesn't matter how long it is (either in characters or in bytes). In a perfect world your expectation of having a one-to-one mapping between array indexing and character indexing would clearly be nice tohave.But the current design is (in Walter's opinion - and I agree with him) the best we can do given the imperfect world we find ourselves in and givenD'sdesign goals.If this is by design than fine. Who am I to change it. It is just because I need to insert characters into existing strings. I see. Moreover if char[] does behave the way it currently does it will be fast, but it probably won't if it had to interpret the array as UTF-8. But then I see little difference between byte[] and char[]. They are basically the same and can be interpreted ambiguously. Something that Walter wanted to prevent if I remember correctly. Jaap
Sep 27 2004
In article <opsevonsdl2saxk9 krd8833t>, Jaap Geurts says...Hi all,Hi.I'm testing and programming in D using UTF-8 under linux to encode the Vietnamese character set.Cool.I have some trouble with the way D handles the char[].length property.length does what it does. What you need is a character count, which is something different.Therefore I still need to use a strlen function to determine the correct string length.Okay, here's one: And some overloads to complete the set:One of the implications is that most *string* handling functions in the phobos library depend on the length property and thus fail.Phobos is not really geared up for Unicode yet. The string handling functions are defined to work only for ASCII. What you need is Unicode string handling. D doesn't have that yet. There is a third party Unicode library called ICU (Internationalization Components for Unicode) which I'm trying to port to D, but it's slow work, partly because I've got too much else on at the moment.There are some solutions to this: without modifying the language: 1. use a special functions to do the work. 2. make a string class. 3. convert everything internally to UTF-16, convert it back to UTF-8 before output.Option 3 won't work in general. In general, you'll need to convert everything internally to UTF-32, not UTF-16. Of course, if it's just for Vietnamese, UTF-16 will be fine.1. The special functions would work but is troublesome because the phobos functions cannot be used.(i.e. they have to be rewritten).True.2. the string class doesn't work well because the opAssign function cannot be overridden and this the following cannot be done: I know that it can be done slightly different but I'd like it to be as seemless as possible. (String s = new String("hello");) However the the phobos functions stil don't work and have to be included in the class. Wasn't Walter against a String class?? ;)I've had exactly the same problem with a completely different class. I would very much like to see implicit constructors in D, so we could do: But this sort of thing is down to Walter, and he doesn't consider it a priority.3. Converting everything is not very efficient. And requires non-transparent extra work. I'd suggest the following: 1. The char[] needs to be treated by the D compiler as a string array not as a byte array,That's just not possible. A char is a UTF-8 fragment, not a Unicode character. They're just not the same.or 2. Implement a special String datatype (has been discussed earlier and Walter is against it.)This will happen anyway in time - by accident! ICU has a class called UnicodeString, so D will get that once ICU is ported.Also, a lot of phobos functions are missing for wide and double character operations. E.g. wchar[] ljustify(wchar[], int width); is not available and many more are not available for larger char sets.Again, ICU will fill in these gaps. I wish I could bring you better news, but at least these things are on their way and will get here eventually. Arcane Jill
Sep 26 2004
Arcane Jill wrote:This will happen anyway in time - by accident! ICU has a class called UnicodeString, so D will get that once ICU is ported.So can we not just drop char and char[]s and define some standard string class to be used for unicode strings (preferably one returning dchars when prompted for individual characters)? I mean, strings via easy-to-use arrays were one of those nifty ideas that attracts me to D. No freaky libraries to remember, just intuitive things that work the same for all kinds of arrays. But having strings implemented as character arrays is cool only as long as I can actually use that char[]-string like an array and get characters out of it by using the [] operator. Beyond that, it just is an annoying inconsistent analogy. Also it appears confusing to me that some string operations are supposed to be done with array operations, while others are defined in std.string. Now it seems far easier-to-use to have a string class that wraps all this. I apologise if my uneducated ranting is far below the average level of insight that is to be available here, and I apologise for the slight offtopicness, and I apologise for bringing this up long after the case to ditch char. -ben
Sep 26 2004
Benjamin Herr <ben 0x539.de> schrieb:Arcane Jill wrote:I guess you didn't (yet) dive into Unicode? A "character" is something quite complicated. 1) it can consist of one codepoint like 0x41 "A" 2) two different codpoint sequences can be equal: 0xC1 "Á" and 0x41 0x2CA "Á" 3) especially in Hanglu/Korean a "character" might be a sequenze of 1 up to 4 codepoints. 4) upper/lowercase conversion is dependend on the language used: Up1 -> Down1, Down2 Above points out only some basics you'd have to implement in your string class. ThomasThis will happen anyway in time - by accident! ICU has a class called UnicodeString, so D will get that once ICU is ported.So can we not just drop char and char[]s and define some standard string class to be used for unicode strings (preferably one returning dchars when prompted for individual characters)?
Sep 26 2004
In article <cj8kb0$22n5$1 digitaldaemon.com>, Thomas Kuehne says...A "character" is something quite complicated.True enough. The best definition of "character" I have ever encountered is this: A "character" is anything the Unicode Consortium say is a character! More official definitions such as "the smallest unit of information having semantic meaning" just don't hold up under close examination, as it's too easy to find counterexamples. The problem arises because Unicode started its life as the union of many existing legacy "character sets", each of which had their own different idea of what a "character" was.1) it can consist of one codepoint like 0x41 "A" 2) two different codpoint sequences can be equal: 0xC1 "Á" and 0x41 0x2CA "Á" 3) especially in Hanglu/Korean a "character" might be a sequenze of 1 up to 4 codepoints.Actually, you're talking about graphemes and/or glyphs, not characters. There is, in fact, a precise one-to-one correspondence between codepoints and characters. A grapheme, on the other hand, may consist of one or more characters combined together (for example 'A' + combining-acute-accent = 'Á', as per your example); a glyph may consist of one or more graphemes ligated together (for example 'a' + zero-width-joiner + 'e' = 'æ'). And just to be even more pedantic, your statement "two different codepoint sequences can be /equal/" should really read "two different codepoint sequences can be /canonically equivalent/". Equal means equal.4) upper/lowercase conversion is dependend on the language used: Up1 -> Down1, Down2..though currently only Turkish, Lithuanian and Azeri are non-standard. As far as casing is concerned, locale is /almost/ ignorable. The functions getSimpleUppercaseMapping() and getSimpleLowercaseMapping() in etc.unicode will work fine for all languages apart from these few non-standard exceptions listed above. A bigger problem with casing is that (for example) uppercase "ß" is "SS" - that is, strings can get longer when you case-convert them. Even etc.unicode doesn't deal with that (because it got aborted in favor of ICU before full casing was implemented). You're probably thinking of collation (sort order), which varies /greatly/ from language to language.Above points out only some basics you'd have to implement in your string class.I think the original poster was only talking about character counting, and the related problem of locating character boundaries in a UTF array. That's relatively easy, and can be hand-coded without too much trouble. The more complex stuff like casing, collation, equivalence, grapheme boundary identification, etc., is probably best left to an external library. Arcane Jill
Sep 27 2004
Thomas Kuehne wrote:Benjamin Herr <ben 0x539.de> schrieb:I only theoretically dealed with Unicode (so, no). I had not idea I am so far off, though.Arcane Jill wrote:I guess you didn't (yet) dive into Unicode? A "character" is something quite complicated.This will happen anyway in time - by accident! ICU has a class called UnicodeString, so D will get that once ICU is ported.So can we not just drop char and char[]s and define some standard string class to be used for unicode strings (preferably one returning dchars when prompted for individual characters)?1) it can consist of one codepoint like 0x41 "A"Sounds easy so far.2) two different codpoint sequences can be equal: 0xC1 "Á" and 0x41 0x2CA "Á"Is it not invalid at least with utf8 to use anything but the least `large' representation?[...] Above points out only some basics you'd have to implement in your string class.I either have to implement it in my string class, or I have to do it `by hand' every time I need any of this functionality. Which is why I suggested a class (or even just a struct, to keep the semantics closer to the standard arrays), after all. -ben
Sep 27 2004
Benjamin Herr schrieb:UTF-8/16/32 only deal with one codepoint at a time(except for some checking). The codepoint sequence above would be U+0000C1 "Á" and U+000041 U+0002CA "Á" The above are different Normalization Forms. (http://www.unicode.org/reports/tr15/)2) two different codpoint sequences can be equal: 0xC1 "Á" and 0x41 0x2CA "Á"Is it not invalid at least with utf8 to use anything but the least `large' representation?If you ensure that the input only contains Latin(Frensh/German..) / Greek / Cyrillic in fully NFC/NFKC you can assume for most cases that 1 dchar == 1 character. If you realy nead full string handling, I suppose you assist Arcane Jill with porting the ICU. ThomasAbove points out only some basics you'd have to implement in your string class.I either have to implement it in my string class, or I have to do it `by hand' every time I need any of this functionality. Which is why I suggested a class (or even just a struct, to keep the semantics closer to the standard arrays), after all.
Sep 27 2004
Vietnamese character set.I'm testing and programming in D using UTF-8 under linux to encode theCool.I see. If that is the way it is. Than I'll use functions operating on strings.I have some trouble with the way D handles the char[].length property.length does what it does. What you need is a character count, which is something different.string length.Therefore I still need to use a strlen function to determine the correctOkay, here's one:Thanks for the code examples.Phobos is not really geared up for Unicode yet. The string handlingfunctionsare defined to work only for ASCII.I noticed. I'll use David's (Spotted Tiger) stringw.d and complement if necessary.before output.3. convert everything internally to UTF-16, convert it back to UTF-8Option 3 won't work in general. In general, you'll need to converteverythinginternally to UTF-32, not UTF-16. Of course, if it's just for Vietnamese,UTF-16will be fine.Strangely enough the Windows32 API uses the UTF16 as their encoding.That's just not possible. A char is a UTF-8 fragment, not a Unicodecharacter.They're just not the same.I understand the issues, and UTF-8 in particular was actually designed with backwards compatibility in mind. (For C uses the zero char as the terminator. Had the world programmed in Pascal then we probably wouldn't have UTF-8/Walter is against it.)or 2. Implement a special String datatype (has been discussed earlier andThis will happen anyway in time - by accident! ICU has a class called UnicodeString, so D will get that once ICU is ported.But (if my memory serves me well) this is exactly what Walter wanted to prevent: A multitude of String classes all doing the same but having slightly different interfaces. Should this be part of the Language or the Phobos library don't you think. UTF-8 will always require a class of some sort... I'm not trying to put oil in the fire but isn't this an important aspect for version 1.0? Jaap -- D Programming from Vietnam.
Sep 27 2004
In article <cj90u4$b46$1 digitaldaemon.com>, Jaap Geurts says...Strangely enough the Windows32 API uses the UTF16 as their encoding.Most Unicode platforms use UTF-16, including the ICU library. It follows logically, therefore, that on these platforms - /including/ the Windows API - you cannot use array indexing to find the nth character. But there is method in this madness. The Unicode characters from U+10000 upwards are not characters from living languages. By and large, it is generally considered "harmless" to regard such characters as if they were two characters. For examples, consider the character U+1D11E (musical symbol G clef) - does it really /matter/ if your application perceives instead U+D874 followed by U+D11C? It won't affect casing, sorting or anything like that because the character isn't part of any living language script. From the point of view of most general purpose algorithms, it's just another shape to draw, like a WingDings symbol. So UTF-16 is simply the best space/speed compromise for the majority of real-life languages.I understand the issues, and UTF-8 in particular was actually designed with backwards compatibility in mind. (For C uses the zero char as the terminator. Had the world programmed in Pascal then we probably wouldn't have UTF-8/The compatibility is with ASCII, not with C. There is no Unicode meaning of U+0000, apart from "some sort of application-dependent control character".Well, ICU is not really anything to do with D. It was originally a Java API, then got ported to C and C++. We'll have it in D, too, eventually. It's not my fault if ICU defines a string class. But I don't think Walter will be complaining - the ICU class isn't a simple "replacement" or "alternative" to char[] - it provides full Unicode functionality, in a way that char[] doesn't. I don't think we'll be seeing "a multitude of String classes" either. To be honest, I don't think even ICU's UnicodeString class will ever become any kind of D "standard", because you won't be able to do implicit casting to/from it.This will happen anyway in time - by accident! ICU has a class called UnicodeString, so D will get that once ICU is ported.But (if my memory serves me well) this is exactly what Walter wanted to prevent: A multitude of String classes all doing the same but having slightly different interfaces. Should this be part of the Language or the Phobos library don't you think.UTF-8 will always require a class of some sort...Well, I'm more inclined to the view that truly internationalized software just won't use UTF-8 at all. UTF-16 is much more managable for this sort of thing. UTF-8 can do the job, but it's mainly intended for text which "mostly ASCII". Arcane Jill
Sep 27 2004
Arcane Jill schrieb:But there is method in this madness. The Unicode characters from U+10000 upwardsare not characters from living languages. By and large, it is generally considered "harmless" to regard such characters as if they were two characters. For examples, consider the character U+1D11E (musical symbol G clef) - does it really /matter/ if your application perceives instead U+D874 followed by U+D11C? It won't affect casing, sorting or anything like that because the character isn't part of any living language script.Guess you missed the extended CJK part. There are names of living persons that can only be encoded using post U+FFFF stuff. As a consequence it does affect the sorting and "character"/"glyph"/"graphem"/"codepoint"/"what-so-ever" count algorithms. Thomas
Sep 27 2004
In article <cj97fh$t7f$1 digitaldaemon.com>, Thomas Kuehne says...Arcane Jill schrieb:I will freely admit that I don't speak Chinese and don't know the intricacies of CJK. But that isn't really what I was trying to get at. Yes, obviously, if an app wants to be general, it must use proper character access via library functions. All I really meant was that if you pretend UTF-16 fragments are characters then your application will /usually/ behave sensibly. That's all. Me, I'm all in favor of proper character iteration. It's just that a lot of apps are going to want a quick-and-dirty shortcut that works more often than not, and UTF-16 is exactly that. So, there are characters in the >U+FFFF range which are used in proper names? I didn't know that. But how badly does that change things? Does it affect casing? I suppose the answer to that depends on whether or not CJK characters /have/ case. Do they? Does it affect sorting? Not in general, since collation is a function of the /user's preferences/, not the script (that is, if an English user sorts Czechoslovakian text, they will expect to see it in "English order", not "Czechoslovakian order"), so only applications which are (a) fully internationalized, or (b) written for CJK users specifically, will need to care. For the rest of the world, two "unknown character" glyphs is not that much worse than one. So I'd summarize as: *) If you want to write a fully internationalized app, you need to be using a proper Unicode library, but *) If you just want basic Unicode support which works in all but exceptional circumstances, you can make do with UTF-16, and the pretence that characters are 16-bits wide. In other words, yes, you're right. But we can ususally cheat. Anyway, this sort of conversation goes on all the time on the Unicode public forum. If you want to talk about this in depth, suggest we move the discussion there. Arcane JillBut there is method in this madness. The Unicode characters from U+10000 upwardsare not characters from living languages. By and large, it is generally considered "harmless" to regard such characters as if they were two characters. For examples, consider the character U+1D11E (musical symbol G clef) - does it really /matter/ if your application perceives instead U+D874 followed by U+D11C? It won't affect casing, sorting or anything like that because the character isn't part of any living language script.Guess you missed the extended CJK part. There are names of living persons that can only be encoded using post U+FFFF stuff. As a consequence it does affect the sorting and "character"/"glyph"/"graphem"/"codepoint"/"what-so-ever" count algorithms. Thomas
Sep 28 2004
Arcane Jill wrote:*) If you just want basic Unicode support which works in all but exceptional circumstances, you can make do with UTF-16, and the pretence that characters are 16-bits wide.I guess I really do not get it. I thought I was just told that codepoints might be only 16-bits wide but that I always have to account for multi-codepointy chars? *more clueless* -ben
Sep 28 2004
In article <cjc38q$2jna$2 digitaldaemon.com>, Benjamin Herr says...Arcane Jill wrote:I think what Jill was saying is that in most cases, UTF-16 will represent any character you care about with a single wchar (ie. in 16 bits). So if you code an application to use wchars you can generally pretend as if there is a 1 to 1 correspondence between wchars and characters. It's *possible* that some users (Chinese perhaps) could break your application, but if this isn't your target market then it may not be a concern. I think the point is that if you're worried that dchars will use up too much memory, you can usually get away with pretending UTF-16 is not a multi-char encoding scheme. Sean*) If you just want basic Unicode support which works in all but exceptional circumstances, you can make do with UTF-16, and the pretence that characters are 16-bits wide.I guess I really do not get it. I thought I was just told that codepoints might be only 16-bits wide but that I always have to account for multi-codepointy chars? *more clueless*
Sep 28 2004
In article <cjc7pb$2n3d$1 digitaldaemon.com>, Sean Kelly says...I think what Jill was saying is that in most cases, UTF-16 will represent any character you care about with a single wchar (ie. in 16 bits). So if you code an application to use wchars you can generally pretend as if there is a 1 to 1 correspondence between wchars and characters. It's *possible* that some users (Chinese perhaps) could break your application, but if this isn't your target market then it may not be a concern. I think the point is that if you're worried that dchars will use up too much memory, you can usually get away with pretending UTF-16 is not a multi-char encoding scheme. SeanYes, exactly. And to some extent, the same is also true of UTF-8 if your application only cares about ASCII. /Many/ algorithms will work just fine if you pretend that /UTF-8/ is a character set, and that a char[] is an actual string of 8-bit-wide "characters". For example, concatenation (strcat, ~); finding a character or a substring (strchr, strstr, find); splitting on boundaries determined by strchr/strstr/find; tokenizing using ASCII separators such as space or tab; identification of C/C++/D comments; parsing XML; ... the list is endless. So long as you don't try to interpret or manipulate the characters you don't "understand", these encodings are robust enough to withstand most other manipulations. The major reason for preferring UTF-16 over UTF-8, however, is that UTF-16 is likely to contain over 99% of all characters in which you are likely to be interested. The same cannot be said of UTF-8, which contains only ASCII characters. The major reason for preferring UTF-16 over UTF-32 is that you get a lot of wasted space with UTF-32. As noted above, >99% of your characters will only need two bytes, so that's two bytes of zeroes for each such character. Even theU+FFFF characters are still guaranteed to have /over one third/ of its bits unused. UTF-32 text files (and strings), therefore, /will/ have between a third and a half (and maybe even more if the text is mostly ASCII) of all of its bits wasted.So it's just a space/speed compromise, that's all. But a pretty good one in most cases. Jill
Sep 29 2004
Benjamin Herr Schrieb:Potentially codepoints are 64 bit. The highes currently assigned codepoint fits in 32 bit. For the majority of living languages only codepoints fit in 16 bit. The bit-size of a codepoint has nothing todo with multi-codepoint "chars". Again if you ensure that neither Korean/Hebrew/Arabic, (Zero-Width-)Joiners nor combining accents are used you might trade a 16-bit char as a "character" in most cases. Exceptions: sorting, display and advanced text analysis. Thomas*) If you just want basic Unicode support which works in all but exceptional circumstances, you can make do with UTF-16, and the pretence that characters are 16-bits wide.I guess I really do not get it. I thought I was just told that codepoints might be only 16-bits wide but that I always have to account for multi-codepointy chars? *more clueless*
Sep 27 2004
In article <cjc8ag$2nb2$1 digitaldaemon.com>, Thomas Kuehne says...Potentially codepoints are 64 bit.First I've heard of it. Do you have a source for this information? So far as I am aware, the UC are /adamant/ that they will never go beyond 21 bits. Programming languages tend to use 32 bits because (a) 32 bits is a more natural length for computers, and (b) they're not taking chances - once upon a time the UC thought that 16 bits would be sufficient. But I have never heard /anyone/ claim that codepoints are potentially 64 bits before. Whence does this originate? Arcane Jill
Sep 29 2004
In article <cjc38q$2jna$2 digitaldaemon.com>, Benjamin Herr says...Arcane Jill wrote:Head out to www.unicode.org and check out their various FAQs. They do a much better job at explaining things than I. For what it's worth, here's my potted summary: "code unit" = the technical name for a single primitive fragment of either UTF-8, UTF-16 or UTF-32 (that is, the value held in a single char, wchar or dchar). I tend to use the phrases UTF-8 fragment, UTF-16 fragment and UTF-32 fragment to express this concept. "code point" = the technical name for the numerical value associated with a character. In Unicode, valid codepoints go from 0 to 0x10FFFF inclusive. In D, a codepoint can only be stored in a dchar. "character" = officially, the smallest unit of textual information with semantic meaning. Practically speaking, this means either (a) a control code; (b) something printable; or (c) a combiner, such as an accent you can place over another character. Every character has a unique codepoint. Conversely, every codepoint in the range 0 to 0x10FFFF corresponds to a unique Unicode character. which is the character corresponding to codepoint 0x20AC). As an observation, over 99% of all the characters you are likely to use, and which are involved in text processing, will occur in the range U+0000 to U+FFFF. Therefore an array of sixteen-bit values interpretted as characters will likely be sufficient for most purposes. (A UTF-16 string may be interpretted in this way). If you want that extra 1%, as some apps will, you'll need to go the whole hog and recognise characters all the way up to U+10FFFF. "grapheme" = a printable base character which may have been modified by zero or more combining characters (for example 'a' followed by combining-acute-accent). "glyph" = one or more graphemes glued together to form a single printable symbol. The Unicode character zero-width-joiner usually acts as the glue. For more detailed information, as I suggested above, please feel free to go to the Unicode website, and get all the details from the people who organize the whole thing. Arcane Jill*) If you just want basic Unicode support which works in all but exceptional circumstances, you can make do with UTF-16, and the pretence that characters are 16-bits wide.I guess I really do not get it. I thought I was just told that codepoints might be only 16-bits wide but that I always have to account for multi-codepointy chars? *more clueless*
Sep 28 2004
Arcane Jill wrote:In article <cjc38q$2jna$2 digitaldaemon.com>, Benjamin Herr says...Cool. I added this to a wiki page: http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssuesArcane Jill wrote:Head out to www.unicode.org and check out their various FAQs. They do a much better job at explaining things than I. For what it's worth, here's my potted summary:*) If you just want basic Unicode support which works in all but exceptional circumstances, you can make do with UTF-16, and the pretence that characters are 16-bits wide.I guess I really do not get it. I thought I was just told that codepoints might be only 16-bits wide but that I always have to account for multi-codepointy chars? *more clueless*"code unit" = the technical name for a single primitive fragment of either UTF-8, UTF-16 or UTF-32 (that is, the value held in a single char, wchar or dchar). I tend to use the phrases UTF-8 fragment, UTF-16 fragment and UTF-32 fragment to express this concept. "code point" = the technical name for the numerical value associated with a character. In Unicode, valid codepoints go from 0 to 0x10FFFF inclusive. In D, a codepoint can only be stored in a dchar. "character" = officially, the smallest unit of textual information with semantic meaning. Practically speaking, this means either (a) a control code; (b) something printable; or (c) a combiner, such as an accent you can place over another character. Every character has a unique codepoint. Conversely, every codepoint in the range 0 to 0x10FFFF corresponds to a unique Unicode character. which is the character corresponding to codepoint 0x20AC). As an observation, over 99% of all the characters you are likely to use, and which are involved in text processing, will occur in the range U+0000 to U+FFFF. Therefore an array of sixteen-bit values interpretted as characters will likely be sufficient for most purposes. (A UTF-16 string may be interpretted in this way). If you want that extra 1%, as some apps will, you'll need to go the whole hog and recognise characters all the way up to U+10FFFF. "grapheme" = a printable base character which may have been modified by zero or more combining characters (for example 'a' followed by combining-acute-accent). "glyph" = one or more graphemes glued together to form a single printable symbol. The Unicode character zero-width-joiner usually acts as the glue. For more detailed information, as I suggested above, please feel free to go to the Unicode website, and get all the details from the people who organize the whole thing. Arcane Jill-- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Sep 29 2004
[snip]API,Well, ICU is not really anything to do with D. It was originally a JavaThis will happen anyway in time - by accident! ICU has a class called UnicodeString, so D will get that once ICU is ported.But (if my memory serves me well) this is exactly what Walter wanted to prevent: A multitude of String classes all doing the same but having slightly different interfaces. Should this be part of the Language or the Phobos library don't you think.then got ported to C and C++. We'll have it in D, too, eventually. It'snot myfault if ICU defines a string class. But I don't think Walter will be complaining - the ICU class isn't a simple "replacement" or "alternative"tochar[] - it provides full Unicode functionality, in a way that char[]doesn't.I don't think we'll be seeing "a multitude of String classes" either. Tobehonest, I don't think even ICU's UnicodeString class will ever become anykindof D "standard", because you won't be able to do implicit casting to/fromit. Is there a link to the String class API? I'm curious to see what the differences are from a function-based API. Is the basic difference that the String's encoding is determined at runtime? Maybe a struct would be better than a class: struct ICUString { enum Encoding {UTF8, UTF16, UTF32,...}; uint len; void* data; Encoding encoding; ... member functions like opIndex, etc... } ... functions like std.string with ICUString instead of char[] or wchar[] or dchar[]... [snip]
Sep 27 2004
In article <cjd924$81h$1 digitaldaemon.com>, David L. Davis says...In article <cjal85$1oia$1 digitaldaemon.com>, Jaap Geurts says...I posted this yesterday: http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/11206 - I hope it's helpful.David, I've examined your wstring library, and noticed that the case(islower,isupper) family functions cannot do other languages than plain latin ascii. Am I right in this? What is needed I guess is for the user to supply a conversion table (are the functions in phobos suitable?). I don't know enough about locale support in OS's but if it is not available there we'd have to code it into the lib. I'll do some probing about how to code it first and if you wish I can provide you the one for Vietnamese. Regards, JaapJaap: Currently for anything unicode based, I've been waiting on work that Arcane Jill is doing. StringW.d was mainly created to make it easier to work with 16-bit characters (string.d made it a real pain...you nearly have to cast everything), and hopefully in turn it will work with Windows' 16-bit wide character API functions. But at this point I haven't tested it, plus I don't understand enough to know the real different between the 16-bit characters and unicode characters (some real example data and code would be helpful in this area...Jill?, Ben?, and/or anyone?). Anywayz, needless to say I've mirrored string.d functions like tolower(), toupper() and my very own asciiProperCase() functions to still work on ascii characters only. In my last reply I was mainly to point you to where stringw.d could be found, and if you found it useful and to let you know that if you needed anything that string.d had that's missing in it...that I would add it if you needed it. I hope I did give the impression that it did unicode? Also, I'm afraid I don't know much about "locale support" either. But if you do something in that area I wouldn't mind taking a look at it. :)) Good Luck in your project, David L.------------------------------------------------------------------- "Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"Nice quote! Jill
Sep 29 2004