digitalmars.D - The Unicode Casing Algorithms
- Arcane Jill (59/74) Jun 04 2004 This is true, but it's not relevant. This was relevant back in the days ...
- Arcane Jill (5/7) Jun 04 2004 should read:
- Kris (9/16) Jun 04 2004 If it turns out that Jill is Irish, this spells "imminent joviality" to ...
- Hauke Duden (4/121) Jun 04 2004 Just wanted to note that I have a "real" Unicode casing module in the
- Arcane Jill (22/31) Jun 04 2004 Wow! I'm so impressed. How's it done? Have you defined a String class?
- Ben Hinkle (12/32) Jun 04 2004 Instead of making a String class another approach would be to write
- Arcane Jill (6/12) Jun 04 2004 Yup, there are all sorts of possible approaches. I could think of a few ...
- Hauke Duden (25/57) Jun 04 2004 I'm afraid I don't deserve your praise ;).
- Walter (10/14) Jun 04 2004 How about just calling them isdigit(dchar c), etc.? Perhaps call the mod...
- Arcane Jill (5/9) Jun 04 2004 Hey, Hauke. You've just been offered a place in the vaulted "std" heirar...
- Hauke Duden (5/13) Jun 04 2004 Thanks for cheering me on AJ ;).
- Hauke Duden (17/30) Jun 04 2004 I had three reasons for choosing these function names:
- Walter (11/26) Jun 04 2004 I know, but since these are well-established names, I think we can bend ...
- Hauke Duden (28/61) Jun 04 2004 Well, if you're not going to make the cut now, when then? D is a new
- Kris (11/17) Jun 04 2004 Well then, Walter. If that's the case, perhaps you'd apply the same rule...
- Walter (8/25) Jun 04 2004 unicode
- Arcane Jill (10/12) Jun 04 2004 Unicode space is not whitespace. Whitespace is a completely different co...
- Sean Kelly (8/17) Jun 05 2004 But that doesn't break the ASCII functions for the ASCII character set, ...
- Arcane Jill (53/60) Jun 05 2004 Obviously you are aware of this, but your choice of words gives a strang...
- Walter (20/34) Jun 05 2004 have
- Sean Kelly (8/36) Jun 05 2004 Thanks for putting it so clearly. I'm a bit rusty with C locale stuff
- Hauke Duden (4/6) Jun 05 2004 It is now also available here:
- David L. Davis (7/21) Jun 04 2004 Walter: The above sounds like a good idea for the dchar character(s) in
- Walter (6/11) Jun 04 2004 are
- Walter (7/12) Jun 04 2004 compare
- Arcane Jill (23/25) Jun 04 2004 It's 21 bits actually, the top codepoint being 0x10FFFF. But yeah, there...
- Walter (10/15) Jun 04 2004 half-assed job
- Roberto Mariottini (3/7) Jun 07 2004 7-bit ASCII, 8-bit CP1252 or 8-bit ISO-8859-1 (Latin-1)?
- Arcane Jill (12/20) Jun 07 2004 Just ASCII.
- Roberto Mariottini (10/32) Jun 08 2004 I know. It's only that I'm italian, and the italian language needs at le...
- Arcane Jill (17/23) Jun 08 2004 Hauke has now implemented utype - a drop-in replacement for ctype, which...
- Hauke Duden (4/16) Jun 08 2004 It is compatible. It has a unittest that checks all ASCII characters
- Arcane Jill (8/15) Jun 08 2004 Excellent! This is superb. The only thing is, the docs don't make that c...
- Hauke Duden (10/28) Jun 08 2004 The documentation of isspace states that it is equivalent to
- Arcane Jill (11/18) Jun 08 2004 Yes, I know. But I think it would be nice to start getting people used t...
- Hauke Duden (4/18) Jun 08 2004 But the interface would have to be changed to return a string instead of...
- Arcane Jill (7/7) Jun 08 2004 Okay, cancel that. I've just realized I was talking complete rubbish. Yo...
- Hauke Duden (7/15) Jun 08 2004 Lol. Come on, don't be sad... ;)
Sean makes some good points in his posts, but the D character set is Unicode by definition. Let me go through this:Some languages don't have upper and lowercase letters.This is true, but it's not relevant. This was relevant back in the days of conflicting 8-bit character encoding standards, in which codepoint 0x41 didn't necessarily mean 'A'. But in Unicode this simply doesn't matter, because there is room for all the characters. '\u0146' (Cryllic capital letter ZHE) will lowercase to '\u0436' (Cryllic small letter ZHE) even if you don't speak Russian.And many others don't convert properly using the default routines,Again, this is true, if by "default routines" you mean existing C routines. But they do convert properly if you employ the Unicode casing algorithms. These guys (the Unicode Consortium) have been figuring out this stuff for the last few decades, and have knowledge and experience which encompasses pretty much all the scripts in the world.even if the ASCII character set contains all the appropriate symbols.ASCII, of course, doesn't even contain e-acute, a symbol used, for example in the English word "café". This symbol (having codepoint '\u00E9') exists in ISO-8059-1, but not not in ASCII (whose defined codepoint range is 0x00 to 0x7F). I realise from the context that Sean did know that.So tolower(x)==tolower(y) may yield the incorrect result if the string contains characters beyond the usual 52 ASCII English values.Absolutely. The existing tolower() function is not suitable for Unicode. It exists for historical reasons, and is useful in compiling legacy code. But it really should be deprecated. Having said that, one can't deprecate a function until one has something with which to replace it. Hmmm....I'd like to assume that a D string is a sequence of characters, unicode or otherwise, and I think it would be a mistake to provide methods that don't work properly outside of ASCII English. While I'm not much of an expert on localization, I do think that the library should be designed with localization in mind.Would you like to know what the localization issues ARE? In Turkish and Azeri, dotted lowercase i uppercases to DOTTED uppercase I, while dotless uppercase I lowercases to DOTLESS lowercase i. (So if you think about it, the Turkish system actually makes more sense). But Unicode wanted to be a superset of ASCII, so that particular casing rule did not become a part of the standard. Lithuanian retains the dot in a lowercase i when followed by accents. I believe that it would be perfectly acceptable to provide default casing algorithms which work for the whole world apart from the above exceptions. Special functions could be written for those languages if needed. For the rest of the world, it all works smoothly, and differences in display are consigned to "font rendering issues". For example, in French, it is unusual to display an accent on an uppercase letter - but '\u00E9' (e acute) still uppercases to '\u00C9' (E acute), even in France. The decision not to DISPLAY the acute accent is considered a rendering issue, not a character issue, and is a problem which is solved very, very neatly simply by supplying specialized French fonts (in which '\u00C9' is rendered without an accent). Similarly, in tradition Irish, the letter i is written without a font - but the codepoint is still '\u0069', same as for the rest of us. Likewise with French, the decision not to display the dot is a mere rendering issue.For a more thorough explanation, Scott Meyers discusses the problem in one of his "Effective C++" books, the second one IIRC.Yes, but that was then and this is now. Unicode was invented precisely to solve this kind of problem, and solve it it has. There is neither any need nor any sense in our reinventing the wheel here. To case-convert a Unicode character, one merely looks up that character in the published Unicode charts. These are purposefully in machine-readable form, and are easily parsed. (http://www.unicode.org/reports/tr30/). This is slightly more tricky, for reasons I won't go into here, but all of the algorithms are easily implementable. Collation, as we know, IS locale dependent. This is even more tricky, but Unicode Collation Algorithm (http://www.unicode.org/reports/tr10/) If I had the time, I'd implement all of this myself, but I'm working on something else right now. I do hope, however, that D doesn't do a half-assed job and not be standards-compliant with the defined Unicode algorithms. I'm with what Walter says in the D manual on this one: Unicode is the future. Arcane Jill
Jun 04 2004
In article <c9p8dn$2j2i$1 digitaldaemon.com>, Arcane Jill says... Typo correction:in tradition Irish, the letter i is written without a fontshould read:in traditional Irish, the letter i is written without a DOTSorry about that, Jill
Jun 04 2004
If it turns out that Jill is Irish, this spells "imminent joviality" to me: The next time Matthew, Jill, and I disagree on the same thread, some canny wit is bound to make a fricking wisecrack about "There was this Englishman, Irishman, and Scotsman ...". I'll stake ten bucks, and a slightly worn pocket-protector, that it will be Brad Anderson ... any takers? <g> "Arcane Jill" <Arcane_member pathlink.com> wrote in message news:c9pa5a$2ln9$1 digitaldaemon.com...In article <c9p8dn$2j2i$1 digitaldaemon.com>, Arcane Jill says... Typo correction:in tradition Irish, the letter i is written without a fontshould read:in traditional Irish, the letter i is written without a DOTSorry about that, Jill
Jun 04 2004
Just wanted to note that I have a "real" Unicode casing module in the works. In fact, it is complete but not yet well tested. I'll try to finish it up and post it here tonight. Arcane Jill wrote:Sean makes some good points in his posts, but the D character set is Unicode by definition. Let me go through this:Some languages don't have upper and lowercase letters.This is true, but it's not relevant. This was relevant back in the days of conflicting 8-bit character encoding standards, in which codepoint 0x41 didn't necessarily mean 'A'. But in Unicode this simply doesn't matter, because there is room for all the characters. '\u0146' (Cryllic capital letter ZHE) will lowercase to '\u0436' (Cryllic small letter ZHE) even if you don't speak Russian.And many others don't convert properly using the default routines,Again, this is true, if by "default routines" you mean existing C routines. But they do convert properly if you employ the Unicode casing algorithms. These guys (the Unicode Consortium) have been figuring out this stuff for the last few decades, and have knowledge and experience which encompasses pretty much all the scripts in the world.even if the ASCII character set contains all the appropriate symbols.ASCII, of course, doesn't even contain e-acute, a symbol used, for example in the English word "café". This symbol (having codepoint '\u00E9') exists in ISO-8059-1, but not not in ASCII (whose defined codepoint range is 0x00 to 0x7F). I realise from the context that Sean did know that.So tolower(x)==tolower(y) may yield the incorrect result if the string contains characters beyond the usual 52 ASCII English values.Absolutely. The existing tolower() function is not suitable for Unicode. It exists for historical reasons, and is useful in compiling legacy code. But it really should be deprecated. Having said that, one can't deprecate a function until one has something with which to replace it. Hmmm....I'd like to assume that a D string is a sequence of characters, unicode or otherwise, and I think it would be a mistake to provide methods that don't work properly outside of ASCII English. While I'm not much of an expert on localization, I do think that the library should be designed with localization in mind.Would you like to know what the localization issues ARE? In Turkish and Azeri, dotted lowercase i uppercases to DOTTED uppercase I, while dotless uppercase I lowercases to DOTLESS lowercase i. (So if you think about it, the Turkish system actually makes more sense). But Unicode wanted to be a superset of ASCII, so that particular casing rule did not become a part of the standard. Lithuanian retains the dot in a lowercase i when followed by accents. I believe that it would be perfectly acceptable to provide default casing algorithms which work for the whole world apart from the above exceptions. Special functions could be written for those languages if needed. For the rest of the world, it all works smoothly, and differences in display are consigned to "font rendering issues". For example, in French, it is unusual to display an accent on an uppercase letter - but '\u00E9' (e acute) still uppercases to '\u00C9' (E acute), even in France. The decision not to DISPLAY the acute accent is considered a rendering issue, not a character issue, and is a problem which is solved very, very neatly simply by supplying specialized French fonts (in which '\u00C9' is rendered without an accent). Similarly, in tradition Irish, the letter i is written without a font - but the codepoint is still '\u0069', same as for the rest of us. Likewise with French, the decision not to display the dot is a mere rendering issue.For a more thorough explanation, Scott Meyers discusses the problem in one of his "Effective C++" books, the second one IIRC.Yes, but that was then and this is now. Unicode was invented precisely to solve this kind of problem, and solve it it has. There is neither any need nor any sense in our reinventing the wheel here. To case-convert a Unicode character, one merely looks up that character in the published Unicode charts. These are purposefully in machine-readable form, and are easily parsed. Foldings (http://www.unicode.org/reports/tr30/). This is slightly more tricky, for reasons I won't go into here, but all of the algorithms are easily implementable. Collation, as we know, IS locale dependent. This is even more tricky, but Unicode Collation Algorithm (http://www.unicode.org/reports/tr10/) If I had the time, I'd implement all of this myself, but I'm working on something else right now. I do hope, however, that D doesn't do a half-assed job and not be standards-compliant with the defined Unicode algorithms. I'm with what Walter says in the D manual on this one: Unicode is the future. Arcane Jill
Jun 04 2004
In article <c9pi28$jj$1 digitaldaemon.com>, Hauke Duden says...Just wanted to note that I have a "real" Unicode casing module in the works. In fact, it is complete but not yet well tested. I'll try to finish it up and post it here tonight.Wow! I'm so impressed. How's it done? Have you defined a String class? I ask because, as I'm sure you know, the Unicode character sequence '\u0065\u0301' (lowercase e followed by combining acute accent) should compare equal with '\u00E9' (pre-combined lowercase e with acute accent). Clearly they won't compare as equal in a straightforward dchar[] == test. (Even the lengths are different). I imagined crafting a String class which knew all about Unicode normalization, so that:assert(String("\u0065\u0301") == String("\u00E9"));would hold true. And this needs to hold true even in a case-SENSITIVE compare, let alone a case-INsensitive one. ..and not forgetting the conversions:// String s; dchar[] a = s.nfc(); dchar[] b = s.nfd(); dchar[] c = s.nfkc(); dchar[] d = s.nfkd();If your module is slready complete, I guess it's too late for me to point you in the direction of UPR, a binary format for Unicode character properties (much easier to parse than the code-charts). Info is at: http://www.let.uu.nl/~Theo.Veenker/personal/projects/upr/. Still - you might want to bear it in mind for the future, unless you've already got your own code for parsing the code-charts (for when the next version of Unicode comes out). Anyway, good luck. I'm really pleased to see someone taking all this seriously. There are just too many people of the "ASCII's good enough for me" ilk, and it makes a refreshing change to see D and its supporters taking the initiative here. Arcane Jill
Jun 04 2004
Arcane Jill wrote:In article <c9pi28$jj$1 digitaldaemon.com>, Hauke Duden says...Instead of making a String class another approach would be to write char[] normalize(char[]) that uses COW like std.string and use the regular comparison. That is the model used by tolower and friends. If it is desired an equivalent to cmp can be devised that takes normalization into account much like std.string.icmp takes case into account. A class for String came up a while ago and the basic argument against it was that it wasn't needed - functions work fine. Maybe we'll get to the point where a class is needed but the mental model of <length, ptr> and COW functions is so simple it would be a big change to give it up. -BenJust wanted to note that I have a "real" Unicode casing module in the works. In fact, it is complete but not yet well tested. I'll try to finish it up and post it here tonight.Wow! I'm so impressed. How's it done? Have you defined a String class? I ask because, as I'm sure you know, the Unicode character sequence '\u0065\u0301' (lowercase e followed by combining acute accent) should compare equal with '\u00E9' (pre-combined lowercase e with acute accent). Clearly they won't compare as equal in a straightforward dchar[] == test. (Even the lengths are different). I imagined crafting a String class which knew all about Unicode normalization, so that:assert(String("\u0065\u0301") == String("\u00E9"));would hold true. And this needs to hold true even in a case-SENSITIVE compare, let alone a case-INsensitive one.
Jun 04 2004
In article <c9ppdu$c90$1 digitaldaemon.com>, Ben Hinkle says...Instead of making a String class another approach would be to write char[] normalize(char[]) that uses COW like std.string and use the regular comparison. That is the model used by tolower and friends. If it is desired an equivalent to cmp can be devised that takes normalization into account much like std.string.icmp takes case into account.Yup, there are all sorts of possible approaches. I could think of a few more too (e.g. optimized comparisons which only need to test the start of the string instead of pre-normalizing all of it). But anyway - I'm keen to see which one Hauke Duden has come up with. I certainly look forward to it. Jill
Jun 04 2004
Arcane Jill wrote:In article <c9pi28$jj$1 digitaldaemon.com>, Hauke Duden says...I'm afraid I don't deserve your praise ;). While I'm also working on a string class, the module I'm talking about is a set of simple global functions like charToLower, charToUpper, charToTitle, charIsDigit, etc. Similar to std.c.ctype but with support for the full unicode character range.Just wanted to note that I have a "real" Unicode casing module in the works. In fact, it is complete but not yet well tested. I'll try to finish it up and post it here tonight.Wow! I'm so impressed. How's it done? Have you defined a String class?I ask because, as I'm sure you know, the Unicode character sequence '\u0065\u0301' (lowercase e followed by combining acute accent) should compare equal with '\u00E9' (pre-combined lowercase e with acute accent). Clearly they won't compare as equal in a straightforward dchar[] == test. (Even the lengths are different). I imagined crafting a String class which knew all about Unicode normalization, so that:I think that Unicode is so complicated that doing the case foldings and normalizations on-the-fly for every comparison is a bit of an overkill and could also introduce unnecessary performance bottlenecks. For my own programs I have long settled on only comparing strings the simple way (i.e. character for character). That's good enough if you don't have to work on strings that come from outside your program. For all other situations you can use a normalize function that is called once when the string enters the program.assert(String("\u0065\u0301") == String("\u00E9"));would hold true. And this needs to hold true even in a case-SENSITIVE compare, let alone a case-INsensitive one.If your module is slready complete, I guess it's too late for me to point you in the direction of UPR, a binary format for Unicode character properties (much easier to parse than the code-charts). Info is at: http://www.let.uu.nl/~Theo.Veenker/personal/projects/upr/. Still - you might want to bear it in mind for the future, unless you've already got your own code for parsing the code-charts (for when the next version of Unicode comes out).Thanks for that info - I will check it out. But as a matter of fact I do already have my own tool for parsing the Unicode data ;). It is more convenient for me, since the module works with static arrays that contain the data in compressed form (a relatively simple RLE algorithm, but effective enough to reduce 2 MB worth of tables to 12 KB).Anyway, good luck. I'm really pleased to see someone taking all this seriously. There are just too many people of the "ASCII's good enough for me" ilk, and it makes a refreshing change to see D and its supporters taking the initiative here.Thanks ;). I agree that far too many people ignore Unicode (right until their application needs to be translated to Japanese, for example). And D is in the position to make it easier for people to do the right thing from the start. We "only" have to make sure that Phobos implements proper Unicode support. Hauke
Jun 04 2004
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message news:c9q5sl$vcj$1 digitaldaemon.com...While I'm also working on a string class, the module I'm talking about is a set of simple global functions like charToLower, charToUpper, charToTitle, charIsDigit, etc. Similar to std.c.ctype but with support for the full unicode character range.How about just calling them isdigit(dchar c), etc.? Perhaps call the module std.utype. The sole remaining advantage of the std.ctype functions is they are very small. So, all a program would need to do to upgrade to unicode is replace: import std.ctype; with: import std.utype; and they'll get the unicode-capable versions of the same functions.
Jun 04 2004
In article <c9qh23$1fdh$2 digitaldaemon.com>, Walter says...replace: import std.ctype; with: import std.utype;Hey, Hauke. You've just been offered a place in the vaulted "std" heirarchy! Go for it man. I must be working in the wrong field. Jill :(
Jun 04 2004
Arcane Jill wrote:Thanks for cheering me on AJ ;). But let's wait and see what Walter thinks about it when he has it in his hands - especially about the function names :). Haukereplace: import std.ctype; with: import std.utype;Hey, Hauke. You've just been offered a place in the vaulted "std" heirarchy! Go for it man.
Jun 04 2004
Walter wrote:"Hauke Duden" <H.NS.Duden gmx.net> wrote in message news:c9q5sl$vcj$1 digitaldaemon.com...I had three reasons for choosing these function names: 1) isdigit etc. do not conform to the convention that new words should be capitalized. 2) because of D's overloading rules (with definitions in one module being able to completely hide those in others) I'm reluctant to choose global names that could also be used in another context. 3) I wanted to improve on ctype in a few places and also keep a bit closer to the Unicode terms. For example, isspace tests for things that separate words (whitespace in ASCII). In Unicode that's more than just whitespace, thus the name doesn't fit. I also think charIsSpace should check for actual space characters instead of all whitespace. Of course we could create a module std.utype which simply defines std.c.ctype compatible aliases. Or even better, simply call the unicode functions directly from std.c.ctype so that there is no wrong choice anymore. HaukeWhile I'm also working on a string class, the module I'm talking about is a set of simple global functions like charToLower, charToUpper, charToTitle, charIsDigit, etc. Similar to std.c.ctype but with support for the full unicode character range.How about just calling them isdigit(dchar c), etc.? Perhaps call the module std.utype. The sole remaining advantage of the std.ctype functions is they are very small. So, all a program would need to do to upgrade to unicode is replace:
Jun 04 2004
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message news:c9qjqr$1jfv$1 digitaldaemon.com...I had three reasons for choosing these function names: 1) isdigit etc. do not conform to the convention that new words should be capitalized.I know, but since these are well-established names, I think we can bend the rules a bit for them <g>.2) because of D's overloading rules (with definitions in one module being able to completely hide those in others) I'm reluctant to choose global names that could also be used in another context.I can't think of a case where they conflict. Note that the actual global names will not conflict, because the names will be prefixed by the package.module name.3) I wanted to improve on ctype in a few places and also keep a bit closer to the Unicode terms. For example, isspace tests for things that separate words (whitespace in ASCII). In Unicode that's more than just whitespace, thus the name doesn't fit. I also think charIsSpace should check for actual space characters instead of all whitespace.If you're changing what, say, isspace does for ASCII characters, then I think that's a mistake.Of course we could create a module std.utype which simply defines std.c.ctype compatible aliases. Or even better, simply call the unicode functions directly from std.c.ctype so that there is no wrong choice anymore.I'd do that if the utype functions didn't add significant bloat, but they do (I presume).
Jun 04 2004
Walter wrote:Well, if you're not going to make the cut now, when then? D is a new language and I think the standard library should at least be consistent.I had three reasons for choosing these function names: 1) isdigit etc. do not conform to the convention that new words should be capitalized.I know, but since these are well-established names, I think we can bend the rules a bit for them <g>.I can think of a few conflicts. In fact, in one of my own applications I had a function called "isSeparator" that had nothing at all to do with strings. Regarding the prefixes: I know that you can always access the functions in a fully qualified way, but I think having to do that can be a pain. Especially when you can sometimes get away without it and at other times you have to use the module name.2) because of D's overloading rules (with definitions in one module being able to completely hide those in others) I'm reluctant to choose global names that could also be used in another context.I can't think of a case where they conflict. Note that the actual global names will not conflict, because the names will be prefixed by the package.module name.That's precisely why it is not called isspace in my module :). I wanted to make it obvious that it has different behaviour. The function that does what ctype.isspace does is called charIsSeparator (Unicode calls such characters "separators"). charIsSpace on the other hand tests for characters with the Unicode separator subtype "space", which does NOT include linebreaks. That is as it should be, I think. However, I'd appreciate any ideas for a better name for charIsSpace that makes it obvious that it tests for spaces without actually using the word "space". I couldn't think of any.3) I wanted to improve on ctype in a few places and also keep a bit closer to the Unicode terms. For example, isspace tests for things that separate words (whitespace in ASCII). In Unicode that's more than just whitespace, thus the name doesn't fit. I also think charIsSpace should check for actual space characters instead of all whitespace.If you're changing what, say, isspace does for ASCII characters, then I think that's a mistake.Well, there's not THAT much overhead. But I guess every little bit could be too much for some specialized applications. For example, it would probably not be a good choice for embedded systems. Right now the module will increase executable size by 12 KB and uses about 2 MB of RAM. The RAM usage could be reduced quite a bit but then the character lookup would be about 3 times slower (right now only a comparison and a simple array indexing operation is needed). HaukeOf course we could create a module std.utype which simply defines std.c.ctype compatible aliases. Or even better, simply call the unicode functions directly from std.c.ctype so that there is no wrong choice anymore.I'd do that if the utype functions didn't add significant bloat, but they do (I presume).
Jun 04 2004
"Walter" wrote:doOf course we could create a module std.utype which simply defines std.c.ctype compatible aliases. Or even better, simply call the unicode functions directly from std.c.ctype so that there is no wrong choice anymore.I'd do that if the utype functions didn't add significant bloat, but they(I presume).Well then, Walter. If that's the case, perhaps you'd apply the same rule to printf usage within the root object? As we all know, printf drags along all the floating point formatting and boatloads of other, uhhh, errrrr ... stuff. It absolutely does not belong in the root object, and there's only a dozen or so references to it within debug code inside Phobos ... Sorry to sound a bit snotty, but this is surely a blatant double-standard <g> - Kris
Jun 04 2004
"Kris" <someidiot earthlink.dot.dot.dot.net> wrote in message news:c9qub0$22er$1 digitaldaemon.com..."Walter" wrote:unicodeOf course we could create a module std.utype which simply defines std.c.ctype compatible aliases. Or even better, simply call thetheyfunctions directly from std.c.ctype so that there is no wrong choice anymore.I'd do that if the utype functions didn't add significant bloat, butdoto(I presume).Well then, Walter. If that's the case, perhaps you'd apply the same ruleprintf usage within the root object? As we all know, printf drags alongallthe floating point formatting and boatloads of other, uhhh, errrrr ... stuff. It absolutely does not belong in the root object, and there's only a dozen or so references to it within debug code inside Phobos ... Sorry to sound a bit snotty, but this is surely a blatant double-standard <g>But everyone needs printf! And printf doesn't add 2Mb, either, last I checked <g>.
Jun 04 2004
Printf is certainly useful, but one shouldn't have to pay the bloat price when they don't even use it. Placing a printf call within Object.d (the print() method) adds zero value, and has negative impact. It's great not having to explicitly import printf ... but having it automatically loaded where it's never actually used is so totally bogus. BTW, there's actually only around 20 calls to Object.print(); All within Phobos (as Ben Hinkle pointed out). If you remove those, along with Object.print(), the problem just goes away ... "Walter" wrote:But everyone needs printf! And printf doesn't add 2Mb, either, last I checked <g>.
Jun 04 2004
"Walter" wrote:But everyone needs printf! And printf doesn't add 2Mb, either, last I checked <g>.Walter: I realize my reply wasn't very helpful, so please permit me to re-phrase? Yes, as you say, everyone needs printf <g>. They just don't need it in Object.print() - Kris
Jun 04 2004
"Kris" <someidiot earthlink.dot.dot.dot.net> wrote in message news:c9r8sq$2hnt$1 digitaldaemon.com..."Walter" wrote:Yeah, it probably should go from that.But everyone needs printf! And printf doesn't add 2Mb, either, last I checked <g>.Walter: I realize my reply wasn't very helpful, so please permit me to re-phrase? Yes, as you say, everyone needs printf <g>. They just don't need it in Object.print()
Jun 04 2004
In article <c9qr0q$1tk7$2 digitaldaemon.com>, Walter says...If you're changing what, say, isspace does for ASCII characters, then I think that's a mistake.Unicode space is not whitespace. Whitespace is a completely different concept. For example, non-breaking space ('\u00A0') is not considered whitespace, but Unicode correctly identifies it as a spacing character. Even more disasterous, '\n' is whitespace, but it is not space. Hauke is correct. These are different properties. You cannot simply re-use the old functions. You have to supply new ones, and preferably with different names. Arcane Jill (By the way, I couldn't download the zip file. Mozilla Firebird freaked out when I tried to click on the link).
Jun 04 2004
In article <c9rqvu$bah$1 digitaldaemon.com>, Arcane Jill says...In article <c9qr0q$1tk7$2 digitaldaemon.com>, Walter says...But that doesn't break the ASCII functions for the ASCII character set, it only means that new ones must be provided for Unicode characters. Personally, I'd prefer that the new functions work for both Unicode and for ASCII, much like the locale-based functions do in C++. Localization in C++ is probably the most complex part of the language, however, and I'd like to see if we can't find a way to simplify it a bit in D. SeanIf you're changing what, say, isspace does for ASCII characters, then I think that's a mistake.Unicode space is not whitespace. Whitespace is a completely different concept. For example, non-breaking space ('\u00A0') is not considered whitespace, but Unicode correctly identifies it as a spacing character. Even more disasterous, '\n' is whitespace, but it is not space. Hauke is correct. These are different properties. You cannot simply re-use the old functions. You have to supply new ones, and preferably with different names.
Jun 05 2004
In article <c9sob6$1qpn$1 digitaldaemon.com>, Sean Kelly says...But that doesn't break the ASCII functions for the ASCII character set, it only means that new ones must be provided for Unicode characters. Personally, I'd prefer that the new functions work for both Unicode and for ASCII,Obviously you are aware of this, but your choice of words gives a strange impression here. Clearly, ASCII characters *are* Unicode characters. ASCII is but a small subset of Unicode. They are defined for all Unicode characters, therefore they are defined for all ASCII characters.much like the locale-based functions do in C++. Localization in C++ is probably the most complex part of the language, however, and I'd like to see if we can't find a way to simplify it a bit in D.Agreed, but I'm not clear what you're asking. I've been involved with a text-to-speech project which we had to internationalize and localize for a whole bunch of languages. That was in C++, so I know the issues. Using Unicode made things a whole lot easier, but localization is about a lot more than selecting a character set. Stuff like what character you use for a decimal point, how you punctuate sentences, what kind of quotation marks you use, and so on, are all relevant to localization, and it would be nice to address these. But these issues are independent of the assinged properties of Unicode characters. But I never did like the way C handled locales. Java's tactic made more sense. With regard to those character properties, I couldn't quite figure out if you were agreeing or disagreeing. I suspect that we are all in agreement really. Certainly I would hope so, because actually there is no decision to be taken. And for obvious reasons: (1) The behavior of the ctype functions for the ASCII range is well and truly defined by years of precedent, and cannot be changed. (2) Similarly, the Unicode standard, and its various classifications, is an established international standard, and one which we are also not at liberty to change. So, either we implement Unicode properties or we don't, but if we want to be standards compliant, we /cannot/ change one single Unicode property - not even to make it compatible with isspace(), whether we agree with it or not. To do so would place us at odds with - well, basically, the rest of the world. It follows, therefore, that we need BOTH functions - for instance, we need the old fashioned ctype isspace() AND we need the new Unicode function charIsSpace(). We need the old fashioned ctype isalpha() AND we need the new Unicode function charIsLetter(). Supplying new functions cannot possibly break the old ones! But as Hauke and I have pointed out, in general they do not agree with each other, even in the ASCII range, and certainly not in the range 0x00 to 0xFF (the range for which the ctype functions are usually implemented). Java has a nice solution, which we might like to copy. Java implements the Unicode Standard (at least for Unicode 2.0), but they ALSO implement ADDITIONAL functions, such as isWhitespace(), isJavaIdentifierStart(), and so on. <ping!> I've just realized what you're refering to. How dumb of me not to have seen it earlier! Ok, let me go through this.... In C, the ctype functions such as toupper(c) will return a different value for a given codepoint c, depending on the current system default locale. toupper(0xD3) might give a different answer in Russia from that which it does in France. THIS PROBLEM DOES NOT ARISE WITH UNICODE. However, D implements toupper(), so the question is, should toupper() be locale dependent in D as it is in C. My immediate thought would be no. No way. The C system locale selects a character encoding upon which toupper() et al operate, but there is only one D character encoding standard. It is Unicode - the superset of all the others. And in Unicode, you *don't* call toupper(), you call Hauke's new function - charToUpper(). My inclination is that the old ctype functions should be defined only for the ASCII range (though having them take a dchar is harmless), and within that range, they be compatible with what C did. Arcane Jill
Jun 05 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:c9t05d$26ft$1 digitaldaemon.com...<ping!> I've just realized what you're refering to. How dumb of me not tohaveseen it earlier! Ok, let me go through this.... In C, the ctype functionssuchas toupper(c) will return a different value for a given codepoint c,dependingon the current system default locale. toupper(0xD3) might give a different answer in Russia from that which it does in France. THIS PROBLEM DOES NOTARISEWITH UNICODE. However, D implements toupper(), so the question is, should toupper() be locale dependent in D as it is in C. My immediate thoughtwould beno. No way. The C system locale selects a character encoding upon which toupper() et al operate, but there is only one D character encodingstandard. Itis Unicode - the superset of all the others. And in Unicode, you *don't*calltoupper(), you call Hauke's new function - charToUpper(). My inclinationis thatthe old ctype functions should be defined only for the ASCII range (though having them take a dchar is harmless), and within that range, they becompatiblewith what C did.I've pretty much come to the same conclusions: 1) D's character types are unicode. They aren't indices into locale-dependent code pages. The library functions are unicode. If you have data that's in a locale-dependent code page, convert it to unicode before using library string functions. 2) The ctype functions will just return 0 for non-ASCII characters. 3) There will be a separate set of functions for unicode, with different names. Thanks to you and Hauke for clarifying the issues with this.
Jun 05 2004
Walter wrote:"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:c9t05d$26ft$1 digitaldaemon.com...Thanks for putting it so clearly. I'm a bit rusty with C locale stuff and had forgotten about the default locale business. I agree. I would prefer to have a set of basic functions that are not locale dependent for the ASCII character set and have D provide its own set of unicode functions.<ping!> I've just realized what you're refering to. How dumb of me not to have seen it earlier! Ok, let me go through this.... In C, the ctype functions such as toupper(c) will return a different value for a given codepoint c, depending on the current system default locale. toupper(0xD3) might give a different answer in Russia from that which it does in France. THIS PROBLEM DOES NOT ARISE WITH UNICODE. However, D implements toupper(), so the question is, should toupper() be locale dependent in D as it is in C. My immediate thought would be no. No way. The C system locale selects a character encoding upon which toupper() et al operate, but there is only one D character encoding standard. It is Unicode - the superset of all the others. And in Unicode, you *don't* call toupper(), you call Hauke's new function - charToUpper(). My inclination is that the old ctype functions should be defined only for the ASCII range (though having them take a dchar is harmless), and within that range, they be compatible with what C did.I've pretty much come to the same conclusions: 1) D's character types are unicode. They aren't indices into locale-dependent code pages. The library functions are unicode. If you have data that's in a locale-dependent code page, convert it to unicode before using library string functions. 2) The ctype functions will just return 0 for non-ASCII characters. 3) There will be a separate set of functions for unicode, with different names.Sounds fantastic. Sean
Jun 05 2004
Arcane Jill wrote:(By the way, I couldn't download the zip file. Mozilla Firebird freaked out when I tried to click on the link).It is now also available here: http://www.hazardarea.com/unichar.zip Hauke
Jun 05 2004
In article <c9qh23$1fdh$2 digitaldaemon.com>, Walter says..."Hauke Duden" <H.NS.Duden gmx.net> wrote in message news:c9q5sl$vcj$1 digitaldaemon.com...Walter: The above sounds like a good idea for the dchar character(s) in std.ctype, but what about for strings that use std.string functions and are defined as char[], or is there a dchar[] string type I've missed somewhere? And if there isn't, shouldn't the strings really be defined as dchar[] to work with unicode 32-bit? Thxs for your answer in advance. :))While I'm also working on a string class, the module I'm talking about is a set of simple global functions like charToLower, charToUpper, charToTitle, charIsDigit, etc. Similar to std.c.ctype but with support for the full unicode character range.How about just calling them isdigit(dchar c), etc.? Perhaps call the module std.utype. The sole remaining advantage of the std.ctype functions is they are very small. So, all a program would need to do to upgrade to unicode is replace: import std.ctype; with: import std.utype; and they'll get the unicode-capable versions of the same functions.
Jun 04 2004
"David L. Davis" <SpottedTiger yahoo.com> wrote in message news:c9qmr7$1nrj$1 digitaldaemon.com...Walter: The above sounds like a good idea for the dchar character(s) in std.ctype, but what about for strings that use std.string functions andaredefined as char[], or is there a dchar[] string type I've missedsomewhere? Andif there isn't, shouldn't the strings really be defined as dchar[] to workwithunicode 32-bit?Check out the std.utf package, which will decode char[] into a dchar.
Jun 04 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:c9pneo$91a$1 digitaldaemon.com...I ask because, as I'm sure you know, the Unicode character sequence '\u0065\u0301' (lowercase e followed by combining acute accent) shouldcompareequal with '\u00E9' (pre-combined lowercase e with acute accent). Clearlytheywon't compare as equal in a straightforward dchar[] == test. (Even thelengthsare different).Oh durn, even with 20 bit unicode they are *still* having multicharacter sequences? ARRRRGGGGHHH.
Jun 04 2004
In article <c9qh22$1fdh$1 digitaldaemon.com>, Walter says...Oh durn, even with 20 bit unicode they are *still* having multicharacter sequences? ARRRRGGGGHHH.It's 21 bits actually, the top codepoint being 0x10FFFF. But yeah, there is a distinction between characters and glyphs (or - if you wan't to get technical, "default grapheme clusters"). One character equals one dchar - no questions there - there is not a one-to-one corresporence between characters and glyphs, and there may be several different "spellings" of the same glyph. The combining characters allow you, for example, to put an acute accent over any character. It's all cunning stuff, and of course something of a nightmare for those who design fonts, make text editors, and so on. But fortunately for us, font design is not an issue, just implementation of a few basic algorithms which someone else has already worked out for us. (Although of course, things are never that straightforward. The Consortium's algorithms are kind of "proof of concept". /Real/ implementations would have to throw in a bit of speed optimization). No need for the aaargh, though. Once you get your head around the character/glyph distinction, it all makes complete sense. D's dchars are *characters*, and for that purpose, they are exactly what they are designed to be. D has got it right. And no - there's no need to introduce a glyph type, before anyone asks. Glyphs are only important to people who write rendering algorithms. Glyph /boundaries/ are important, but the algorithms will cover that. I'm sure someone will take up the challenge. It's a fascinating area. Arcane Jill
Jun 04 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:c9p8dn$2j2i$1 digitaldaemon.com...If I had the time, I'd implement all of this myself, but I'm working on something else right now. I do hope, however, that D doesn't do ahalf-assed joband not be standards-compliant with the defined Unicode algorithms. I'mwithwhat Walter says in the D manual on this one: Unicode is the future.Yes. Thanks for the excellent references. Right now, the std.ctype functions all take an argument of 'dchar'. This means the interface is correct for unicode, even if the current implementation fails to work on anything but ASCII. If an ambitious person wishes to fix the implementations so they work with unicode, I'll incorporate them.
Jun 04 2004
In article <c9qgf3$1ec3$1 digitaldaemon.com>, Walter says...Right now, the std.ctype functions all take an argument of 'dchar'. This means the interface is correct for unicode, even if the current implementation fails to work on anything but ASCII.7-bit ASCII, 8-bit CP1252 or 8-bit ISO-8859-1 (Latin-1)? Ciao
Jun 07 2004
In article <ca15r8$1uun$1 digitaldaemon.com>, Roberto Mariottini says...In article <c9qgf3$1ec3$1 digitaldaemon.com>, Walter says...Just ASCII. WINDOWS-1252 (to give it its official encoding name) is all too often incorrectly declared as ISO-8859-1, thanks to Microsoft. Okay, so Unicode is a superset of ISO-8859-1, which in turn is a superset of ASCII, so you *COULD* implement the ctype functions according to the ISO-8859-1 locale, but I suspect that would be terribly confusing to those for whom that was not their default locale. WINDOWS-1252 conflicts with Unicode in the range 0x80 to 0x9F, so I wouldn't recommend that at all. Anyway, Linux users wouldn't like it. Microsoft have taken over enough of the world as it is without their invading D as well. ;-) JillRight now, the std.ctype functions all take an argument of 'dchar'. This means the interface is correct for unicode, even if the current implementation fails to work on anything but ASCII.7-bit ASCII, 8-bit CP1252 or 8-bit ISO-8859-1 (Latin-1)? Ciao
Jun 07 2004
In article <ca173t$20v3$1 digitaldaemon.com>, Arcane Jill says...In article <ca15r8$1uun$1 digitaldaemon.com>, Roberto Mariottini says...I know. It's only that I'm italian, and the italian language needs at least ISO-8859-1 (with collation, etc), ASCII is not sufficient. Supporting only ASCII means supporting only english. While this can be understandable for english-speaking people, I think that it's worth adding a single bit and upgrade to ISO-8859-1, thus supporting english, spanish, french, portuguese, german, italian, etc.In article <c9qgf3$1ec3$1 digitaldaemon.com>, Walter says...Just ASCII. WINDOWS-1252 (to give it its official encoding name) is all too often incorrectly declared as ISO-8859-1, thanks to Microsoft. Okay, so Unicode is a superset of ISO-8859-1, which in turn is a superset of ASCII, so you *COULD* implement the ctype functions according to the ISO-8859-1 locale, but I suspect that would be terribly confusing to those for whom that was not their default locale.Right now, the std.ctype functions all take an argument of 'dchar'. This means the interface is correct for unicode, even if the current implementation fails to work on anything but ASCII.7-bit ASCII, 8-bit CP1252 or 8-bit ISO-8859-1 (Latin-1)? CiaoWINDOWS-1252 conflicts with Unicode in the range 0x80 to 0x9F, so I wouldn't recommend that at all. Anyway, Linux users wouldn't like it. Microsoft have taken over enough of the world as it is without their invading D as well. ;-)I don't know how D handles the interface with the S.O., but I think Windows would pass CP1252-encoded characters to getchar(), for example. Ciao
Jun 08 2004
In article <ca3pe5$24v$1 digitaldaemon.com>, Roberto Mariottini says...I know. It's only that I'm italian, and the italian language needs at least ISO-8859-1 (with collation, etc), ASCII is not sufficient. Supporting only ASCII means supporting only english. While this can be understandable for english-speaking people, I think that it's worth adding a single bit and upgrade to ISO-8859-1, thus supporting english, spanish, french, portuguese, german, italian, etc.Hauke has now implemented utype - a drop-in replacement for ctype, which now supports all Unicode characters. (I don't know how he did it. I'm not completely convinced that it's backwardly compatible with ctype in the ASCII range, but even if it isn't, I'm sure it could be made so). That, in conjunction with the real Unicode functions which he has also supplied should solve all your problems. However, there is no way I would support adding explicit support to D for ISO-8859-1. I am also European, and I also use non-ASCII characters, but when I step outside the bounds of ASCII, I use use Unicode, not ISO-8859-1. Jill PS. Unicode is a superset of ISO-8859-1 with codepoint equivalence. In this sense only, ISO-8859-1 has special status compared with, say, ISO-8859-2. (Unicode is a superset of ISO-8859-2 as well, of course, but the codepoints are different). So anything which works for Unicode will work for ISO-8859-1, codepoint for codepoint. But that's not the same as restricting it to that range.
Jun 08 2004
Arcane Jill wrote:It is compatible. It has a unittest that checks all ASCII characters with all functions to make sure ;). HaukeI know. It's only that I'm italian, and the italian language needs at least ISO-8859-1 (with collation, etc), ASCII is not sufficient. Supporting only ASCII means supporting only english. While this can be understandable for english-speaking people, I think that it's worth adding a single bit and upgrade to ISO-8859-1, thus supporting english, spanish, french, portuguese, german, italian, etc.Hauke has now implemented utype - a drop-in replacement for ctype, which now supports all Unicode characters. (I don't know how he did it. I'm not completely convinced that it's backwardly compatible with ctype in the ASCII range, but even if it isn't, I'm sure it could be made so).
Jun 08 2004
In article <ca3v8c$fai$1 digitaldaemon.com>, Hauke Duden says...Excellent! This is superb. The only thing is, the docs don't make that claim (unless I missed it). When I read the docs for utype.isspace() I kinda got the impression that it just called charIsSpace(), which obviously would not be compatible with ctype. Perhaps you could make the documentation more explicit. All in all, I'm thoroughly impressed with this. Nice one! Jill PS. Did you omit charToCasefold(), or did I just miss it?Hauke has now implemented utype - a drop-in replacement for ctype, which now supports all Unicode characters. (I don't know how he did it. I'm not completely convinced that it's backwardly compatible with ctype in the ASCII range, but even if it isn't, I'm sure it could be made so).It is compatible. It has a unittest that checks all ASCII characters with all functions to make sure ;). Hauke
Jun 08 2004
Arcane Jill wrote:It is there, in the module description.Excellent! This is superb. The only thing is, the docs don't make that claim (unless I missed it).Hauke has now implemented utype - a drop-in replacement for ctype, which now supports all Unicode characters. (I don't know how he did it. I'm not completely convinced that it's backwardly compatible with ctype in the ASCII range, but even if it isn't, I'm sure it could be made so).It is compatible. It has a unittest that checks all ASCII characters with all functions to make sure ;). HaukeWhen I read the docs for utype.isspace() I kinda got the impression that it just called charIsSpace(), which obviously would not be compatible with ctype. Perhaps you could make the documentation more explicit.The documentation of isspace states that it is equivalent to charIsSeparator. But I will make it a little more obvious.All in all, I'm thoroughly impressed with this. Nice one!Thanks :).PS. Did you omit charToCasefold(), or did I just miss it?No, you didn't miss it. Real case folding is another beast entirely, as it requires one-to-many mappings. It is not supported by the module. If you want to do simple one-to-one case folding then calling charToLower on both characters should be equivalent. Hauke
Jun 08 2004
In article <ca49vq$10ui$1 digitaldaemon.com>, Hauke Duden says...Yes, I know. But I think it would be nice to start getting people used to the idea that they need to be calling toCasefold() instead of toLower() if they're going to do case-insensitive comparisons. It's a good "new thing to learn". Even if all it does (for now) is call charToLower(), that would be better than nothing.PS. Did you omit charToCasefold(), or did I just miss it?No, you didn't miss it. Real case folding is another beast entirely, as it requires one-to-many mappings. It is not supported by the module.If you want to do simple one-to-one case folding then calling charToLower on both characters should be equivalent.I know, but basically, I'm saying that code which reads:if (charToCaseFold(c) == charToCaseFold(d))is more self-documenting than code which reads:if (charToLower(c) == charToLower(d))and it gets people to start thinking in the Unicode way. So - even if it does nothing useful, I think it's still a good function to have. Jill
Jun 08 2004
Arcane Jill wrote:In article <ca49vq$10ui$1 digitaldaemon.com>, Hauke Duden says...But the interface would have to be changed to return a string instead of a single character. That would break all code that uses it. HaukeYes, I know. But I think it would be nice to start getting people used to the idea that they need to be calling toCasefold() instead of toLower() if they're going to do case-insensitive comparisons. It's a good "new thing to learn". Even if all it does (for now) is call charToLower(), that would be better than nothing.PS. Did you omit charToCasefold(), or did I just miss it?No, you didn't miss it. Real case folding is another beast entirely, as it requires one-to-many mappings. It is not supported by the module.
Jun 08 2004
Okay, cancel that. I've just realized I was talking complete rubbish. You were right. I was wrong. Case folding comes into play during special casing, not simple casing. (I was thinking it was in UnicodeData.txt, but of course it isn't, it's only in SpecialCasing.txt). So I withdraw my suggestion, apologize for questioning you, and now I'm going to go and hide in a corner until I stop feeling such a prat. Jill (embarrassed).
Jun 08 2004
Arcane Jill wrote:Okay, cancel that. I've just realized I was talking complete rubbish. You were right. I was wrong. Case folding comes into play during special casing, not simple casing. (I was thinking it was in UnicodeData.txt, but of course it isn't, it's only in SpecialCasing.txt). So I withdraw my suggestion, apologize for questioning you, and now I'm going to go and hide in a corner until I stop feeling such a prat. Jill (embarrassed).Lol. Come on, don't be sad... ;) It's good practice to question other people's work. They could be wrong just as easily as you could. At the very least it will keep both you and the other one thinking, which is always a good thing. Hauke
Jun 08 2004