digitalmars.D - The Unicode Casing Algorithms

Arcane Jill (59/74) Jun 04 2004 This is true, but it's not relevant. This was relevant back in the days ...

Arcane Jill (5/7) Jun 04 2004 should read:

Kris (9/16) Jun 04 2004 If it turns out that Jill is Irish, this spells "imminent joviality" to ...

Hauke Duden (4/121) Jun 04 2004 Just wanted to note that I have a "real" Unicode casing module in the

Arcane Jill (22/31) Jun 04 2004 Wow! I'm so impressed. How's it done? Have you defined a String class?

Ben Hinkle (12/32) Jun 04 2004 Instead of making a String class another approach would be to write

Arcane Jill (6/12) Jun 04 2004 Yup, there are all sorts of possible approaches. I could think of a few ...

Hauke Duden (25/57) Jun 04 2004 I'm afraid I don't deserve your praise ;).

Walter (10/14) Jun 04 2004 How about just calling them isdigit(dchar c), etc.? Perhaps call the mod...

Arcane Jill (5/9) Jun 04 2004 Hey, Hauke. You've just been offered a place in the vaulted "std" heirar...

Hauke Duden (5/13) Jun 04 2004 Thanks for cheering me on AJ ;).

Hauke Duden (17/30) Jun 04 2004 I had three reasons for choosing these function names:

Walter (11/26) Jun 04 2004 I know, but since these are well-established names, I think we can bend ...

Hauke Duden (28/61) Jun 04 2004 Well, if you're not going to make the cut now, when then? D is a new
Kris (11/17) Jun 04 2004 Well then, Walter. If that's the case, perhaps you'd apply the same rule...

Walter (8/25) Jun 04 2004 unicode

Kris (9/11) Jun 04 2004 Printf is certainly useful, but one shouldn't have to pay the bloat pric...
Kris (6/8) Jun 04 2004 Walter: I realize my reply wasn't very helpful, so please permit me to

Walter (3/10) Jun 04 2004 Yeah, it probably should go from that.

Arcane Jill (10/12) Jun 04 2004 Unicode space is not whitespace. Whitespace is a completely different co...

Sean Kelly (8/17) Jun 05 2004 But that doesn't break the ASCII functions for the ASCII character set, ...

Arcane Jill (53/60) Jun 05 2004 Obviously you are aware of this, but your choice of words gives a strang...

Walter (20/34) Jun 05 2004 have

Sean Kelly (8/36) Jun 05 2004 Thanks for putting it so clearly. I'm a bit rusty with C locale stuff

Hauke Duden (4/6) Jun 05 2004 It is now also available here:

David L. Davis (7/21) Jun 04 2004 Walter: The above sounds like a good idea for the dchar character(s) in

Walter (6/11) Jun 04 2004 are

Walter (7/12) Jun 04 2004 compare

Arcane Jill (23/25) Jun 04 2004 It's 21 bits actually, the top codepoint being 0x10FFFF. But yeah, there...

Walter (10/15) Jun 04 2004 half-assed job

Roberto Mariottini (3/7) Jun 07 2004 7-bit ASCII, 8-bit CP1252 or 8-bit ISO-8859-1 (Latin-1)?

Arcane Jill (12/20) Jun 07 2004 Just ASCII.

Roberto Mariottini (10/32) Jun 08 2004 I know. It's only that I'm italian, and the italian language needs at le...

Arcane Jill (17/23) Jun 08 2004 Hauke has now implemented utype - a drop-in replacement for ctype, which...

Hauke Duden (4/16) Jun 08 2004 It is compatible. It has a unittest that checks all ASCII characters

Arcane Jill (8/15) Jun 08 2004 Excellent! This is superb. The only thing is, the docs don't make that c...

Hauke Duden (10/28) Jun 08 2004 The documentation of isspace states that it is equivalent to

Arcane Jill (11/18) Jun 08 2004 Yes, I know. But I think it would be nice to start getting people used t...

Hauke Duden (4/18) Jun 08 2004 But the interface would have to be changed to return a string instead of...

Arcane Jill (7/7) Jun 08 2004 Okay, cancel that. I've just realized I was talking complete rubbish. Yo...

Hauke Duden (7/15) Jun 08 2004 Lol. Come on, don't be sad... ;)

Arcane Jill <Arcane_member pathlink.com> writes:

Sean makes some good points in his posts, but the D character set is Unicode by
definition. Let me go through this:

Some languages don't have upper and lowercase letters.

This is true, but it's not relevant. This was relevant back in the days of
conflicting 8-bit character encoding standards, in which codepoint 0x41 didn't
necessarily mean 'A'. But in Unicode this simply doesn't matter, because there
is room for all the characters. '\u0146' (Cryllic capital letter ZHE) will
lowercase to '\u0436' (Cryllic small letter ZHE) even if you don't speak
Russian.


And many others don't
convert properly using the default routines,

Again, this is true, if by "default routines" you mean existing C routines. But
they do convert properly if you employ the Unicode casing algorithms. These guys
(the Unicode Consortium) have been figuring out this stuff for the last few
decades, and have knowledge and experience which encompasses pretty much all the
scripts in the world.


even if the ASCII character set
contains all the appropriate symbols.

ASCII, of course, doesn't even contain e-acute, a symbol used, for example in
the English word "caf�". This symbol (having codepoint '\u00E9') exists in
ISO-8059-1, but not not in ASCII (whose defined codepoint range is 0x00 to
0x7F). I realise from the context that Sean did know that.



So tolower(x)==tolower(y) may yield the
incorrect result if the string contains characters beyond the usual 52 ASCII
English values.

Absolutely. The existing tolower() function is not suitable for Unicode. It
exists for historical reasons, and is useful in compiling legacy code. But it
really should be deprecated.

Having said that, one can't deprecate a function until one has something with
which to replace it. Hmmm....




I'd like to assume that a D string is a sequence of characters,
unicode or otherwise, and I think it would be a mistake to provide methods that
don't work properly outside of ASCII English. While I'm not much of an expert
on localization, I do think that the library should be designed with
localization in mind.

Would you like to know what the localization issues ARE?

In Turkish and Azeri, dotted lowercase i uppercases to DOTTED uppercase I, while
dotless uppercase I lowercases to DOTLESS lowercase i. (So if you think about
it, the Turkish system actually makes more sense). But Unicode wanted to be a
superset of ASCII, so that particular casing rule did not become a part of the
standard. Lithuanian retains the dot in a lowercase i when followed by accents.

I believe that it would be perfectly acceptable to provide default casing
algorithms which work for the whole world apart from the above exceptions.
Special functions could be written for those languages if needed.

For the rest of the world, it all works smoothly, and differences in display are
consigned to "font rendering issues". For example, in French, it is unusual to
display an accent on an uppercase letter - but '\u00E9' (e acute) still
uppercases to '\u00C9' (E acute), even in France. The decision not to DISPLAY
the acute accent is considered a rendering issue, not a character issue, and is
a problem which is solved very, very neatly simply by supplying specialized
French fonts (in which '\u00C9' is rendered without an accent). Similarly, in
tradition Irish, the letter i is written without a font - but the codepoint is
still '\u0069', same as for the rest of us. Likewise with French, the decision
not to display the dot is a mere rendering issue.


For a more thorough explanation, Scott Meyers discusses the problem in one of
his "Effective C++" books, the second one IIRC.

Yes, but that was then and this is now. Unicode was invented precisely to solve
this kind of problem, and solve it it has. There is neither any need nor any
sense in our reinventing the wheel here. To case-convert a Unicode character,
one merely looks up that character in the published Unicode charts. These are
purposefully in machine-readable form, and are easily parsed.


(http://www.unicode.org/reports/tr30/). This is slightly more tricky, for
reasons I won't go into here, but all of the algorithms are easily
implementable.

Collation, as we know, IS locale dependent. This is even more tricky, but

Unicode Collation Algorithm (http://www.unicode.org/reports/tr10/)

If I had the time, I'd implement all of this myself, but I'm working on
something else right now. I do hope, however, that D doesn't do a half-assed job
and not be standards-compliant with the defined Unicode algorithms. I'm with
what Walter says in the D manual on this one:

Unicode is the future.

Arcane Jill

Jun 04 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <c9p8dn$2j2i$1 digitaldaemon.com>, Arcane Jill says...

Typo correction:

in tradition Irish, the letter i is written without a font

should read:

in traditional Irish, the letter i is written without a DOT

Sorry about that,
Jill

Jun 04 2004

"Kris" <someidiot earthlink.dot.dot.dot.net> writes:

If it turns out that Jill is Irish, this spells "imminent joviality" to me:

The next time Matthew, Jill, and I disagree on the same thread, some canny
wit is bound to make a fricking wisecrack about "There was this Englishman,
Irishman, and Scotsman ...".

I'll stake ten bucks, and a slightly worn pocket-protector, that it will be
Brad Anderson ... any takers?

<g>



"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:c9pa5a$2ln9$1 digitaldaemon.com...
 In article <c9p8dn$2j2i$1 digitaldaemon.com>, Arcane Jill says...

 Typo correction:

in tradition Irish, the letter i is written without a font

 should read:

in traditional Irish, the letter i is written without a DOT

 Sorry about that,
 Jill

Jun 04 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Just wanted to note that I have a "real" Unicode casing module in the 
works. In fact, it is complete but not yet well tested.

I'll try to finish it up and post it here tonight.



Arcane Jill wrote:
 Sean makes some good points in his posts, but the D character set is Unicode by
 definition. Let me go through this:
 
 
Some languages don't have upper and lowercase letters.

 
 
 This is true, but it's not relevant. This was relevant back in the days of
 conflicting 8-bit character encoding standards, in which codepoint 0x41 didn't
 necessarily mean 'A'. But in Unicode this simply doesn't matter, because there
 is room for all the characters. '\u0146' (Cryllic capital letter ZHE) will
 lowercase to '\u0436' (Cryllic small letter ZHE) even if you don't speak
 Russian.
 
 
 
And many others don't
convert properly using the default routines,

 
 
 Again, this is true, if by "default routines" you mean existing C routines. But
 they do convert properly if you employ the Unicode casing algorithms. These
guys
 (the Unicode Consortium) have been figuring out this stuff for the last few
 decades, and have knowledge and experience which encompasses pretty much all
the
 scripts in the world.
 
 
 
even if the ASCII character set
contains all the appropriate symbols.

 
 
 ASCII, of course, doesn't even contain e-acute, a symbol used, for example in
 the English word "caf�". This symbol (having codepoint '\u00E9') exists in
 ISO-8059-1, but not not in ASCII (whose defined codepoint range is 0x00 to
 0x7F). I realise from the context that Sean did know that.
 
 
 
 
So tolower(x)==tolower(y) may yield the
incorrect result if the string contains characters beyond the usual 52 ASCII
English values.

 
 
 Absolutely. The existing tolower() function is not suitable for Unicode. It
 exists for historical reasons, and is useful in compiling legacy code. But it
 really should be deprecated.
 
 Having said that, one can't deprecate a function until one has something with
 which to replace it. Hmmm....
 
 
 
 
 
I'd like to assume that a D string is a sequence of characters,
unicode or otherwise, and I think it would be a mistake to provide methods that
don't work properly outside of ASCII English. While I'm not much of an expert
on localization, I do think that the library should be designed with
localization in mind.

 
 
 Would you like to know what the localization issues ARE?
 
 In Turkish and Azeri, dotted lowercase i uppercases to DOTTED uppercase I,
while
 dotless uppercase I lowercases to DOTLESS lowercase i. (So if you think about
 it, the Turkish system actually makes more sense). But Unicode wanted to be a
 superset of ASCII, so that particular casing rule did not become a part of the
 standard. Lithuanian retains the dot in a lowercase i when followed by accents.
 
 I believe that it would be perfectly acceptable to provide default casing
 algorithms which work for the whole world apart from the above exceptions.
 Special functions could be written for those languages if needed.
 
 For the rest of the world, it all works smoothly, and differences in display
are
 consigned to "font rendering issues". For example, in French, it is unusual to
 display an accent on an uppercase letter - but '\u00E9' (e acute) still
 uppercases to '\u00C9' (E acute), even in France. The decision not to DISPLAY
 the acute accent is considered a rendering issue, not a character issue, and is
 a problem which is solved very, very neatly simply by supplying specialized
 French fonts (in which '\u00C9' is rendered without an accent). Similarly, in
 tradition Irish, the letter i is written without a font - but the codepoint is
 still '\u0069', same as for the rest of us. Likewise with French, the decision
 not to display the dot is a mere rendering issue.
 
 
 
For a more thorough explanation, Scott Meyers discusses the problem in one of
his "Effective C++" books, the second one IIRC.

 
 
 Yes, but that was then and this is now. Unicode was invented precisely to solve
 this kind of problem, and solve it it has. There is neither any need nor any
 sense in our reinventing the wheel here. To case-convert a Unicode character,
 one merely looks up that character in the published Unicode charts. These are
 purposefully in machine-readable form, and are easily parsed.
 

Foldings
 (http://www.unicode.org/reports/tr30/). This is slightly more tricky, for
 reasons I won't go into here, but all of the algorithms are easily
 implementable.
 
 Collation, as we know, IS locale dependent. This is even more tricky, but

 Unicode Collation Algorithm (http://www.unicode.org/reports/tr10/)
 
 If I had the time, I'd implement all of this myself, but I'm working on
 something else right now. I do hope, however, that D doesn't do a half-assed
job
 and not be standards-compliant with the defined Unicode algorithms. I'm with
 what Walter says in the D manual on this one:
 
 Unicode is the future.
 
 Arcane Jill

Jun 04 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <c9pi28$jj$1 digitaldaemon.com>, Hauke Duden says...
Just wanted to note that I have a "real" Unicode casing module in the 
works. In fact, it is complete but not yet well tested.

I'll try to finish it up and post it here tonight.

Wow! I'm so impressed. How's it done? Have you defined a String class?

I ask because, as I'm sure you know, the Unicode character sequence
'\u0065\u0301' (lowercase e followed by combining acute accent) should compare
equal with '\u00E9' (pre-combined lowercase e with acute accent). Clearly they
won't compare as equal in a straightforward dchar[] == test. (Even the lengths
are different). I imagined crafting a String class which knew all about Unicode
normalization, so that:

      assert(String("\u0065\u0301") == String("\u00E9"));

would hold true. And this needs to hold true even in a case-SENSITIVE compare,
let alone a case-INsensitive one.

..and not forgetting the conversions:

       // String s;
       dchar[] a = s.nfc();
       dchar[] b = s.nfd();
       dchar[] c = s.nfkc();
       dchar[] d = s.nfkd();

If your module is slready complete, I guess it's too late for me to point you in
the direction of UPR, a binary format for Unicode character properties (much
easier to parse than the code-charts). Info is at:
http://www.let.uu.nl/~Theo.Veenker/personal/projects/upr/. Still - you might
want to bear it in mind for the future, unless you've already got your own code
for parsing the code-charts (for when the next version of Unicode comes out).

Anyway, good luck. I'm really pleased to see someone taking all this seriously.
There are just too many people of the "ASCII's good enough for me" ilk, and it
makes a refreshing change to see D and its supporters taking the initiative
here.

Arcane Jill

Jun 04 2004

Ben Hinkle <bhinkle4 juno.com> writes:

Arcane Jill wrote:

 In article <c9pi28$jj$1 digitaldaemon.com>, Hauke Duden says...
Just wanted to note that I have a "real" Unicode casing module in the
works. In fact, it is complete but not yet well tested.

I'll try to finish it up and post it here tonight.

 
 Wow! I'm so impressed. How's it done? Have you defined a String class?
 
 I ask because, as I'm sure you know, the Unicode character sequence
 '\u0065\u0301' (lowercase e followed by combining acute accent) should
 compare equal with '\u00E9' (pre-combined lowercase e with acute accent).
 Clearly they won't compare as equal in a straightforward dchar[] == test.
 (Even the lengths are different). I imagined crafting a String class which
 knew all about Unicode normalization, so that:
 
      assert(String("\u0065\u0301") == String("\u00E9"));

 
 would hold true. And this needs to hold true even in a case-SENSITIVE
 compare, let alone a case-INsensitive one.

Instead of making a String class another approach would be to write
 char[] normalize(char[])
that uses COW like std.string and use the regular comparison. That is the
model used by tolower and friends. If it is desired an equivalent to cmp
can be devised that takes normalization into account much like
std.string.icmp takes case into account.

A class for String came up a while ago and the basic argument against it was
that it wasn't needed - functions work fine. Maybe we'll get to the point
where a class is needed but the mental model of <length, ptr> and COW
functions is so simple it would be a big change to give it up.

-Ben

Jun 04 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <c9ppdu$c90$1 digitaldaemon.com>, Ben Hinkle says...

Instead of making a String class another approach would be to write
 char[] normalize(char[])
that uses COW like std.string and use the regular comparison. That is the
model used by tolower and friends. If it is desired an equivalent to cmp
can be devised that takes normalization into account much like
std.string.icmp takes case into account.

Yup, there are all sorts of possible approaches. I could think of a few more too
(e.g. optimized comparisons which only need to test the start of the string
instead of pre-normalizing all of it). But anyway - I'm keen to see which one
Hauke Duden has come up with. I certainly look forward to it.

Jill

Jun 04 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Arcane Jill wrote:
 In article <c9pi28$jj$1 digitaldaemon.com>, Hauke Duden says...
 
Just wanted to note that I have a "real" Unicode casing module in the 
works. In fact, it is complete but not yet well tested.

I'll try to finish it up and post it here tonight.

 
 
 Wow! I'm so impressed. How's it done? Have you defined a String class?

I'm afraid I don't deserve your praise ;).

While I'm also working on a string class, the module I'm talking about 
is a set of simple global functions like charToLower, charToUpper, 
charToTitle, charIsDigit, etc. Similar to std.c.ctype but with support 
for the full unicode character range.


 I ask because, as I'm sure you know, the Unicode character sequence
 '\u0065\u0301' (lowercase e followed by combining acute accent) should compare
 equal with '\u00E9' (pre-combined lowercase e with acute accent). Clearly they
 won't compare as equal in a straightforward dchar[] == test. (Even the lengths
 are different). I imagined crafting a String class which knew all about Unicode
 normalization, so that:
 
 
     assert(String("\u0065\u0301") == String("\u00E9"));

 
 
 would hold true. And this needs to hold true even in a case-SENSITIVE compare,
 let alone a case-INsensitive one.

I think that Unicode is so complicated that doing the case foldings and 
normalizations on-the-fly for every comparison is a bit of an overkill 
and could also introduce unnecessary performance bottlenecks. For my own 
programs I have long settled on only comparing strings the simple way 
(i.e. character for character). That's good enough if you don't have to 
work on strings that come from outside your program.

For all other situations you can use a normalize function that is called 
once when the string enters the program.


 If your module is slready complete, I guess it's too late for me to point you
in
 the direction of UPR, a binary format for Unicode character properties (much
 easier to parse than the code-charts). Info is at:
 http://www.let.uu.nl/~Theo.Veenker/personal/projects/upr/. Still - you might
 want to bear it in mind for the future, unless you've already got your own code
 for parsing the code-charts (for when the next version of Unicode comes out).

Thanks for that info - I will check it out. But as a matter of fact I do 
already have my own tool for parsing the Unicode data ;). It is more 
convenient for me, since the module works with static arrays that 
contain the data in compressed form (a relatively simple RLE algorithm, 
but effective enough to reduce 2 MB worth of tables to 12 KB).

 Anyway, good luck. I'm really pleased to see someone taking all this seriously.
 There are just too many people of the "ASCII's good enough for me" ilk, and it
 makes a refreshing change to see D and its supporters taking the initiative
 here.

Thanks ;). I agree that far too many people ignore Unicode (right until 
their application needs to be translated to Japanese, for example). And 
D is in the position to make it easier for people to do the right thing 
from the start. We "only" have to make sure that Phobos implements 
proper Unicode support.

Hauke

Jun 04 2004

"Walter" <newshound digitalmars.com> writes:

"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:c9q5sl$vcj$1 digitaldaemon.com...
 While I'm also working on a string class, the module I'm talking about
 is a set of simple global functions like charToLower, charToUpper,
 charToTitle, charIsDigit, etc. Similar to std.c.ctype but with support
 for the full unicode character range.

How about just calling them isdigit(dchar c), etc.? Perhaps call the module
std.utype. The sole remaining advantage of the std.ctype functions is they
are very small. So, all a program would need to do to upgrade to unicode is
replace:

    import std.ctype;
with:
    import std.utype;

and they'll get the unicode-capable versions of the same functions.

Jun 04 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <c9qh23$1fdh$2 digitaldaemon.com>, Walter says...

replace:
    import std.ctype;
with:
    import std.utype;

Hey, Hauke. You've just been offered a place in the vaulted "std" heirarchy! Go
for it man.

I must be working in the wrong field.
Jill  :(

Jun 04 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Arcane Jill wrote:
replace:
   import std.ctype;
with:
   import std.utype;

 
 
 Hey, Hauke. You've just been offered a place in the vaulted "std" heirarchy! Go
 for it man.

Thanks for cheering me on AJ ;).

But let's wait and see what Walter thinks about it when he has it in his 
hands - especially about the function names :).

Hauke

Jun 04 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Walter wrote:
 "Hauke Duden" <H.NS.Duden gmx.net> wrote in message
 news:c9q5sl$vcj$1 digitaldaemon.com...
 
While I'm also working on a string class, the module I'm talking about
is a set of simple global functions like charToLower, charToUpper,
charToTitle, charIsDigit, etc. Similar to std.c.ctype but with support
for the full unicode character range.

 
 
 How about just calling them isdigit(dchar c), etc.? Perhaps call the module
 std.utype. The sole remaining advantage of the std.ctype functions is they
 are very small. So, all a program would need to do to upgrade to unicode is
 replace:

I had three reasons for choosing these function names:

1) isdigit etc. do not conform to the convention that new words should 
be capitalized.

2) because of D's overloading rules (with definitions in one module 
being able to completely hide those in others) I'm reluctant to choose 
global names that could also be used in another context.

3) I wanted to improve on ctype in a few places and also keep a bit 
closer to the Unicode terms. For example, isspace tests for things that 
separate words (whitespace in ASCII). In Unicode that's more than just 
whitespace, thus the name doesn't fit. I also think charIsSpace should 
check for actual space characters instead of all whitespace.

Of course we could create a module std.utype which simply defines 
std.c.ctype compatible aliases. Or even better, simply call the unicode 
functions directly from std.c.ctype so that there is no wrong choice 
anymore.

Hauke

Jun 04 2004

"Walter" <newshound digitalmars.com> writes:

"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:c9qjqr$1jfv$1 digitaldaemon.com...
 I had three reasons for choosing these function names:

 1) isdigit etc. do not conform to the convention that new words should
 be capitalized.

I know, but since these are well-established names, I think we can bend the
rules a bit for them <g>.

 2) because of D's overloading rules (with definitions in one module
 being able to completely hide those in others) I'm reluctant to choose
 global names that could also be used in another context.

I can't think of a case where they conflict. Note that the actual global
names will not conflict, because the names will be prefixed by the
package.module name.

 3) I wanted to improve on ctype in a few places and also keep a bit
 closer to the Unicode terms. For example, isspace tests for things that
 separate words (whitespace in ASCII). In Unicode that's more than just
 whitespace, thus the name doesn't fit. I also think charIsSpace should
 check for actual space characters instead of all whitespace.

If you're changing what, say, isspace does for ASCII characters, then I
think that's a mistake.

 Of course we could create a module std.utype which simply defines
 std.c.ctype compatible aliases. Or even better, simply call the unicode
 functions directly from std.c.ctype so that there is no wrong choice
 anymore.

I'd do that if the utype functions didn't add significant bloat, but they do
(I presume).

Jun 04 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Walter wrote:
I had three reasons for choosing these function names:

1) isdigit etc. do not conform to the convention that new words should
be capitalized.

 
 
 I know, but since these are well-established names, I think we can bend the
 rules a bit for them <g>.

Well, if you're not going to make the cut now, when then? D is a new 
language and I think the standard library should at least be consistent.


2) because of D's overloading rules (with definitions in one module
being able to completely hide those in others) I'm reluctant to choose
global names that could also be used in another context.

 
 
 I can't think of a case where they conflict. Note that the actual global
 names will not conflict, because the names will be prefixed by the
 package.module name.

I can think of a few conflicts. In fact, in one of my own applications I 
had a function called "isSeparator" that had nothing at all to do with 
strings.

Regarding the prefixes: I know that you can always access the functions 
in a fully qualified way, but I think having to do that can be a pain. 
Especially when you can sometimes get away without it and at other times 
you have to use the module name.


3) I wanted to improve on ctype in a few places and also keep a bit
closer to the Unicode terms. For example, isspace tests for things that
separate words (whitespace in ASCII). In Unicode that's more than just
whitespace, thus the name doesn't fit. I also think charIsSpace should
check for actual space characters instead of all whitespace.

 
 
 If you're changing what, say, isspace does for ASCII characters, then I
 think that's a mistake.

That's precisely why it is not called isspace in my module :). I wanted 
to make it obvious that it has different behaviour. The function that 
does what ctype.isspace does is called charIsSeparator (Unicode calls 
such characters "separators").

charIsSpace on the other hand tests for characters with the Unicode 
separator subtype "space", which does NOT include linebreaks. That is as 
it should be, I think.

However, I'd appreciate any ideas for a better name for charIsSpace that 
makes it obvious that it tests for spaces without actually using the 
word "space". I couldn't think of any.

Of course we could create a module std.utype which simply defines
std.c.ctype compatible aliases. Or even better, simply call the unicode
functions directly from std.c.ctype so that there is no wrong choice
anymore.

 
 
 I'd do that if the utype functions didn't add significant bloat, but they do
 (I presume).

Well, there's not THAT much overhead. But I guess every little bit could 
be too much for some specialized applications. For example, it would 
probably not be a good choice for embedded systems.

Right now the module will increase executable size by 12 KB and uses 
about 2 MB of RAM. The RAM usage could be reduced quite a bit but then 
the character lookup would be about 3 times slower (right now only a 
comparison and a simple array indexing operation is needed).


Hauke

Jun 04 2004

"Kris" <someidiot earthlink.dot.dot.dot.net> writes:

"Walter"  wrote:
 Of course we could create a module std.utype which simply defines
 std.c.ctype compatible aliases. Or even better, simply call the unicode
 functions directly from std.c.ctype so that there is no wrong choice
 anymore.

 I'd do that if the utype functions didn't add significant bloat, but they

do
 (I presume).

Well then, Walter. If that's the case, perhaps you'd apply the same rule to
printf usage within the root object? As we all know, printf drags along all
the floating point formatting and boatloads of other, uhhh, errrrr ...
stuff.

It absolutely does not belong in the root object, and there's only a dozen
or so references to it within debug code inside Phobos ...

Sorry to sound a bit snotty, but this is surely a blatant double-standard
<g>

- Kris

Jun 04 2004

"Walter" <newshound digitalmars.com> writes:

"Kris" <someidiot earthlink.dot.dot.dot.net> wrote in message
news:c9qub0$22er$1 digitaldaemon.com...
 "Walter"  wrote:
 Of course we could create a module std.utype which simply defines
 std.c.ctype compatible aliases. Or even better, simply call the



unicode
 functions directly from std.c.ctype so that there is no wrong choice
 anymore.

 I'd do that if the utype functions didn't add significant bloat, but


they
 do
 (I presume).

 Well then, Walter. If that's the case, perhaps you'd apply the same rule

to
 printf usage within the root object? As we all know, printf drags along

all
 the floating point formatting and boatloads of other, uhhh, errrrr ...
 stuff.

 It absolutely does not belong in the root object, and there's only a dozen
 or so references to it within debug code inside Phobos ...

 Sorry to sound a bit snotty, but this is surely a blatant double-standard
 <g>

But everyone needs printf! And printf doesn't add 2Mb, either, last I
checked <g>.

Jun 04 2004

"Kris" <someidiot earthlink.dot.dot.dot.net> writes:

Printf is certainly useful, but one shouldn't have to pay the bloat price
when they don't even use it. Placing a printf call within Object.d (the
print() method) adds zero value, and has negative impact.

It's great not having to explicitly import printf ... but having it
automatically loaded where it's never actually used is so totally bogus.

BTW, there's actually only around 20 calls to Object.print(); All within
Phobos (as Ben Hinkle pointed out). If you remove those, along with
Object.print(), the problem just goes away ...

"Walter" wrote:
 But everyone needs printf! And printf doesn't add 2Mb, either, last I
 checked <g>.

Jun 04 2004

"Kris" <someidiot earthlink.dot.dot.dot.net> writes:

"Walter"  wrote:
 But everyone needs printf! And printf doesn't add 2Mb, either, last I
 checked <g>.

Walter: I realize my reply wasn't very helpful, so please permit me to
re-phrase?

Yes, as you say, everyone needs printf <g>. They just don't need it in
Object.print()

- Kris

Jun 04 2004

"Walter" <newshound digitalmars.com> writes:

"Kris" <someidiot earthlink.dot.dot.dot.net> wrote in message
news:c9r8sq$2hnt$1 digitaldaemon.com...
 "Walter"  wrote:
 But everyone needs printf! And printf doesn't add 2Mb, either, last I
 checked <g>.

 Walter: I realize my reply wasn't very helpful, so please permit me to
 re-phrase?

 Yes, as you say, everyone needs printf <g>. They just don't need it in
 Object.print()

Yeah, it probably should go from that.

Jun 04 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <c9qr0q$1tk7$2 digitaldaemon.com>, Walter says...

If you're changing what, say, isspace does for ASCII characters, then I
think that's a mistake.

Unicode space is not whitespace. Whitespace is a completely different concept.
For example, non-breaking space ('\u00A0') is not considered whitespace, but
Unicode correctly identifies it as a spacing character. Even more disasterous,
'\n'  is whitespace, but it is not space.

Hauke is correct. These are different properties. You cannot simply re-use the
old functions. You have to supply new ones, and preferably with different names.

Arcane Jill

(By the way, I couldn't download the zip file. Mozilla Firebird freaked out when
I tried to click on the link).

Jun 04 2004

Sean Kelly <sean f4.ca> writes:

In article <c9rqvu$bah$1 digitaldaemon.com>, Arcane Jill says...
In article <c9qr0q$1tk7$2 digitaldaemon.com>, Walter says...

If you're changing what, say, isspace does for ASCII characters, then I
think that's a mistake.

Unicode space is not whitespace. Whitespace is a completely different concept.
For example, non-breaking space ('\u00A0') is not considered whitespace, but
Unicode correctly identifies it as a spacing character. Even more disasterous,
'\n'  is whitespace, but it is not space.

Hauke is correct. These are different properties. You cannot simply re-use the
old functions. You have to supply new ones, and preferably with different names.

But that doesn't break the ASCII functions for the ASCII character set, it only
means that new ones must be provided for Unicode characters.  Personally, I'd
prefer that the new functions work for both Unicode and for ASCII, much like the
locale-based functions do in C++.  Localization in C++ is probably the most
complex part of the language, however, and I'd like to see if we can't find a
way to simplify it a bit in D.

Sean

Jun 05 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <c9sob6$1qpn$1 digitaldaemon.com>, Sean Kelly says...

But that doesn't break the ASCII functions for the ASCII character set, it only
means that new ones must be provided for Unicode characters.  Personally, I'd
prefer that the new functions work for both Unicode and for ASCII,

Obviously you are aware of this, but your choice of words gives a strange
impression here. Clearly, ASCII characters *are* Unicode characters. ASCII is
but a small subset of Unicode. They are defined for all Unicode characters,
therefore they are defined for all ASCII characters.


much like the
locale-based functions do in C++.  Localization in C++ is probably the most
complex part of the language, however, and I'd like to see if we can't find a
way to simplify it a bit in D.

Agreed, but I'm not clear what you're asking. I've been involved with a
text-to-speech project which we had to internationalize and localize for a whole
bunch of languages. That was in C++, so I know the issues. Using Unicode made
things a whole lot easier, but localization is about a lot more than selecting a
character set. Stuff like what character you use for a decimal point, how you
punctuate sentences, what kind of quotation marks you use, and so on, are all
relevant to localization, and it would be nice to address these. But these
issues are independent of the assinged properties of Unicode characters.

But I never did like the way C handled locales. Java's tactic made more sense.

With regard to those character properties, I couldn't quite figure out if you
were agreeing or disagreeing. I suspect that we are all in agreement really.
Certainly I would hope so, because actually there is no decision to be taken.
And for obvious reasons:

(1) The behavior of the ctype functions for the ASCII range is well and truly
defined by years of precedent, and cannot be changed.

(2) Similarly, the Unicode standard, and its various classifications, is an
established international standard, and one which we are also not at liberty to
change.

So, either we implement Unicode properties or we don't, but if we want to be
standards compliant, we /cannot/ change one single Unicode property - not even
to make it compatible with isspace(), whether we agree with it or not. To do so
would place us at odds with - well, basically, the rest of the world.

It follows, therefore, that we need BOTH functions - for instance, we need the
old fashioned ctype isspace() AND we need the new Unicode function
charIsSpace(). We need the old fashioned ctype isalpha() AND we need the new
Unicode function charIsLetter().

Supplying new functions cannot possibly break the old ones! But as Hauke and I
have pointed out, in general they do not agree with each other, even in the
ASCII range, and certainly not in the range 0x00 to 0xFF (the range for which
the ctype functions are usually implemented).

Java has a nice solution, which we might like to copy. Java implements the
Unicode Standard (at least for Unicode 2.0), but they ALSO implement ADDITIONAL
functions, such as isWhitespace(), isJavaIdentifierStart(), and so on. 

<ping!> I've just realized what you're refering to. How dumb of me not to have
seen it earlier! Ok, let me go through this.... In C, the ctype functions such
as toupper(c) will return a different value for a given codepoint c, depending
on the current system default locale. toupper(0xD3) might give a different
answer in Russia from that which it does in France. THIS PROBLEM DOES NOT ARISE
WITH UNICODE. However, D implements toupper(), so the question is, should
toupper() be locale dependent in D as it is in C. My immediate thought would be
no. No way. The C system locale selects a character encoding upon which
toupper() et al operate, but there is only one D character encoding standard. It
is Unicode - the superset of all the others. And in Unicode, you *don't* call
toupper(), you call Hauke's new function - charToUpper(). My inclination is that
the old ctype functions should be defined only for the ASCII range (though
having them take a dchar is harmless), and within that range, they be compatible
with what C did.


Arcane Jill

Jun 05 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:c9t05d$26ft$1 digitaldaemon.com...
 <ping!> I've just realized what you're refering to. How dumb of me not to

have
 seen it earlier! Ok, let me go through this.... In C, the ctype functions

such
 as toupper(c) will return a different value for a given codepoint c,

depending
 on the current system default locale. toupper(0xD3) might give a different
 answer in Russia from that which it does in France. THIS PROBLEM DOES NOT

ARISE
 WITH UNICODE. However, D implements toupper(), so the question is, should
 toupper() be locale dependent in D as it is in C. My immediate thought

would be
 no. No way. The C system locale selects a character encoding upon which
 toupper() et al operate, but there is only one D character encoding

standard. It
 is Unicode - the superset of all the others. And in Unicode, you *don't*

call
 toupper(), you call Hauke's new function - charToUpper(). My inclination

is that
 the old ctype functions should be defined only for the ASCII range (though
 having them take a dchar is harmless), and within that range, they be

compatible
 with what C did.

I've pretty much come to the same conclusions:

1) D's character types are unicode. They aren't indices into
locale-dependent code pages. The library functions are unicode. If you have
data that's in a locale-dependent code page, convert it to unicode before
using library string functions.

2) The ctype functions will just return 0 for non-ASCII characters.

3) There will be a separate set of functions for unicode, with different
names.

Thanks to you and Hauke for clarifying the issues with this.

Jun 05 2004

Sean Kelly <sean f4.ca> writes:

Walter wrote:
 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:c9t05d$26ft$1 digitaldaemon.com...
 
<ping!> I've just realized what you're refering to. How dumb of me not to have
seen it earlier! Ok, let me go through this.... In C, the ctype functions such
as toupper(c) will return a different value for a given codepoint c, depending
on the current system default locale. toupper(0xD3) might give a different
answer in Russia from that which it does in France. THIS PROBLEM DOES NOT ARISE
WITH UNICODE. However, D implements toupper(), so the question is, should
toupper() be locale dependent in D as it is in C. My immediate thought would be
no. No way. The C system locale selects a character encoding upon which
toupper() et al operate, but there is only one D character encoding standard. It
is Unicode - the superset of all the others. And in Unicode, you *don't* call
toupper(), you call Hauke's new function - charToUpper(). My inclination is that
the old ctype functions should be defined only for the ASCII range (though
having them take a dchar is harmless), and within that range, they be compatible
with what C did.


Thanks for putting it so clearly.  I'm a bit rusty with C locale stuff 
and had forgotten about the default locale business.  I agree.  I would 
prefer to have a set of basic functions that are not locale dependent 
for the ASCII character set and have D provide its own set of unicode 
functions.

 I've pretty much come to the same conclusions:
 
 1) D's character types are unicode. They aren't indices into
 locale-dependent code pages. The library functions are unicode. If you have
 data that's in a locale-dependent code page, convert it to unicode before
 using library string functions.
 
 2) The ctype functions will just return 0 for non-ASCII characters.
 
 3) There will be a separate set of functions for unicode, with different
 names.

Sounds fantastic.


Sean

Jun 05 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Arcane Jill wrote:
 (By the way, I couldn't download the zip file. Mozilla Firebird freaked out
when
 I tried to click on the link).

It is now also available here:

http://www.hazardarea.com/unichar.zip


Hauke

Jun 05 2004

David L. Davis <SpottedTiger yahoo.com> writes:

In article <c9qh23$1fdh$2 digitaldaemon.com>, Walter says...
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:c9q5sl$vcj$1 digitaldaemon.com...
 While I'm also working on a string class, the module I'm talking about
 is a set of simple global functions like charToLower, charToUpper,
 charToTitle, charIsDigit, etc. Similar to std.c.ctype but with support
 for the full unicode character range.

How about just calling them isdigit(dchar c), etc.? Perhaps call the module
std.utype. The sole remaining advantage of the std.ctype functions is they
are very small. So, all a program would need to do to upgrade to unicode is
replace:

    import std.ctype;
with:
    import std.utype;

and they'll get the unicode-capable versions of the same functions.

Walter: The above sounds like a good idea for the dchar character(s) in
std.ctype, but what about for strings that use std.string functions and are
defined as char[], or is there a dchar[] string type I've missed somewhere? And
if there isn't, shouldn't the strings really be defined as dchar[] to work with
unicode 32-bit?

Thxs for your answer in advance. :))

Jun 04 2004

"Walter" <newshound digitalmars.com> writes:

"David L. Davis" <SpottedTiger yahoo.com> wrote in message
news:c9qmr7$1nrj$1 digitaldaemon.com...
 Walter: The above sounds like a good idea for the dchar character(s) in
 std.ctype, but what about for strings that use std.string functions and

are
 defined as char[], or is there a dchar[] string type I've missed

somewhere? And
 if there isn't, shouldn't the strings really be defined as dchar[] to work

with
 unicode 32-bit?

Check out the std.utf package, which will decode char[] into a dchar.

Jun 04 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:c9pneo$91a$1 digitaldaemon.com...
 I ask because, as I'm sure you know, the Unicode character sequence
 '\u0065\u0301' (lowercase e followed by combining acute accent) should

compare
 equal with '\u00E9' (pre-combined lowercase e with acute accent). Clearly

they
 won't compare as equal in a straightforward dchar[] == test. (Even the

lengths
 are different).

Oh durn, even with 20 bit unicode they are *still* having multicharacter
sequences? ARRRRGGGGHHH.

Jun 04 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <c9qh22$1fdh$1 digitaldaemon.com>, Walter says...
Oh durn, even with 20 bit unicode they are *still* having multicharacter
sequences? ARRRRGGGGHHH.

It's 21 bits actually, the top codepoint being 0x10FFFF. But yeah, there is a
distinction between characters and glyphs (or - if you wan't to get technical,
"default grapheme clusters"). One character equals one dchar - no questions
there - there is not a one-to-one corresporence between characters and glyphs,
and there may be several different "spellings" of the same glyph. The combining
characters allow you, for example, to put an acute accent over any character.
It's all cunning stuff, and of course something of a nightmare for those who
design fonts, make text editors, and so on.

But fortunately for us, font design is not an issue, just implementation of a
few basic algorithms which someone else has already worked out for us. (Although
of course, things are never that straightforward. The Consortium's algorithms
are kind of "proof of concept". /Real/ implementations would have to throw in a
bit of speed optimization).

No need for the aaargh, though. Once you get your head around the
character/glyph distinction, it all makes complete sense. D's dchars are
*characters*, and for that purpose, they are exactly what they are designed to
be. D has got it right. And no - there's no need to introduce a glyph type,
before anyone asks. Glyphs are only important to people who write rendering
algorithms. Glyph /boundaries/ are important, but the algorithms will cover
that.

I'm sure someone will take up the challenge. It's a fascinating area.

Arcane Jill

Jun 04 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:c9p8dn$2j2i$1 digitaldaemon.com...
 If I had the time, I'd implement all of this myself, but I'm working on
 something else right now. I do hope, however, that D doesn't do a

half-assed job
 and not be standards-compliant with the defined Unicode algorithms. I'm

with
 what Walter says in the D manual on this one:

 Unicode is the future.

Yes. Thanks for the excellent references. Right now, the std.ctype functions
all take an argument of 'dchar'. This means the interface is correct for
unicode, even if the current implementation fails to work on anything but
ASCII.

If an ambitious person wishes to fix the implementations so they work with
unicode, I'll incorporate them.

Jun 04 2004

Roberto Mariottini <Roberto_member pathlink.com> writes:

In article <c9qgf3$1ec3$1 digitaldaemon.com>, Walter says...
Right now, the std.ctype functions
all take an argument of 'dchar'. This means the interface is correct for
unicode, even if the current implementation fails to work on anything but
ASCII.

7-bit ASCII, 8-bit CP1252 or 8-bit ISO-8859-1 (Latin-1)?

Ciao

Jun 07 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ca15r8$1uun$1 digitaldaemon.com>, Roberto Mariottini says...
In article <c9qgf3$1ec3$1 digitaldaemon.com>, Walter says...
Right now, the std.ctype functions
all take an argument of 'dchar'. This means the interface is correct for
unicode, even if the current implementation fails to work on anything but
ASCII.

7-bit ASCII, 8-bit CP1252 or 8-bit ISO-8859-1 (Latin-1)?

Ciao

Just ASCII. 

WINDOWS-1252 (to give it its official encoding name) is all too often
incorrectly declared as ISO-8859-1, thanks to Microsoft.

Okay, so Unicode is a superset of ISO-8859-1, which in turn is a superset of
ASCII, so you *COULD* implement the ctype functions according to the ISO-8859-1
locale, but I suspect that would be terribly confusing to those for whom that
was not their default locale.

WINDOWS-1252 conflicts with Unicode in the range 0x80 to 0x9F, so I wouldn't
recommend that at all. Anyway, Linux users wouldn't like it. Microsoft have
taken over enough of the world as it is without their invading D as well. ;-)

Jill

Jun 07 2004

Roberto Mariottini <Roberto_member pathlink.com> writes:

In article <ca173t$20v3$1 digitaldaemon.com>, Arcane Jill says...
In article <ca15r8$1uun$1 digitaldaemon.com>, Roberto Mariottini says...
In article <c9qgf3$1ec3$1 digitaldaemon.com>, Walter says...
Right now, the std.ctype functions
all take an argument of 'dchar'. This means the interface is correct for
unicode, even if the current implementation fails to work on anything but
ASCII.

7-bit ASCII, 8-bit CP1252 or 8-bit ISO-8859-1 (Latin-1)?

Ciao

Just ASCII. 

WINDOWS-1252 (to give it its official encoding name) is all too often
incorrectly declared as ISO-8859-1, thanks to Microsoft.

Okay, so Unicode is a superset of ISO-8859-1, which in turn is a superset of
ASCII, so you *COULD* implement the ctype functions according to the ISO-8859-1
locale, but I suspect that would be terribly confusing to those for whom that
was not their default locale.

I know. It's only that I'm italian, and the italian language needs at least
ISO-8859-1 (with collation, etc), ASCII is not sufficient.
Supporting only ASCII means supporting only english. While this can be
understandable for english-speaking people, I think that it's worth adding a
single bit and upgrade to ISO-8859-1, thus supporting english, spanish, french,
portuguese, german, italian, etc.

WINDOWS-1252 conflicts with Unicode in the range 0x80 to 0x9F, so I wouldn't
recommend that at all. Anyway, Linux users wouldn't like it. Microsoft have
taken over enough of the world as it is without their invading D as well. ;-)

I don't know how D handles the interface with the S.O., but I think Windows
would pass CP1252-encoded characters to getchar(), for example.

Ciao

Jun 08 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ca3pe5$24v$1 digitaldaemon.com>, Roberto Mariottini says...

I know. It's only that I'm italian, and the italian language needs at least
ISO-8859-1 (with collation, etc), ASCII is not sufficient.
Supporting only ASCII means supporting only english. While this can be
understandable for english-speaking people, I think that it's worth adding a
single bit and upgrade to ISO-8859-1, thus supporting english, spanish, french,
portuguese, german, italian, etc.

Hauke has now implemented utype - a drop-in replacement for ctype, which now
supports all Unicode characters. (I don't know how he did it. I'm not completely
convinced that it's backwardly compatible with ctype in the ASCII range, but
even if it isn't, I'm sure it could be made so).

That, in conjunction with the real Unicode functions which he has also supplied
should solve all your problems. However, there is no way I would support adding
explicit support to D for ISO-8859-1. I am also European, and I also use
non-ASCII characters, but when I step outside the bounds of ASCII, I use use
Unicode, not ISO-8859-1.

Jill

PS. Unicode is a superset of ISO-8859-1 with codepoint equivalence. In this
sense only, ISO-8859-1 has special status compared with, say, ISO-8859-2.
(Unicode is a superset of ISO-8859-2 as well, of course, but the codepoints are
different). So anything which works for Unicode will work for ISO-8859-1,
codepoint for codepoint. But that's not the same as restricting it to that
range.

Jun 08 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Arcane Jill wrote:
I know. It's only that I'm italian, and the italian language needs at least
ISO-8859-1 (with collation, etc), ASCII is not sufficient.
Supporting only ASCII means supporting only english. While this can be
understandable for english-speaking people, I think that it's worth adding a
single bit and upgrade to ISO-8859-1, thus supporting english, spanish, french,
portuguese, german, italian, etc.

 
 
 Hauke has now implemented utype - a drop-in replacement for ctype, which now
 supports all Unicode characters. (I don't know how he did it. I'm not
completely
 convinced that it's backwardly compatible with ctype in the ASCII range, but
 even if it isn't, I'm sure it could be made so).

It is compatible. It has a unittest that checks all ASCII characters 
with all functions to make sure ;).

Hauke

Jun 08 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ca3v8c$fai$1 digitaldaemon.com>, Hauke Duden says...

 Hauke has now implemented utype - a drop-in replacement for ctype, which now
 supports all Unicode characters. (I don't know how he did it. I'm not
completely
 convinced that it's backwardly compatible with ctype in the ASCII range, but
 even if it isn't, I'm sure it could be made so).

It is compatible. It has a unittest that checks all ASCII characters 
with all functions to make sure ;).

Hauke

Excellent! This is superb. The only thing is, the docs don't make that claim
(unless I missed it). When I read the docs for utype.isspace() I kinda got the
impression that it just called charIsSpace(), which obviously would not be
compatible with ctype. Perhaps you could make the documentation more explicit.

All in all, I'm thoroughly impressed with this. Nice one!

Jill

PS. Did you omit charToCasefold(), or did I just miss it?

Jun 08 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Arcane Jill wrote:
Hauke has now implemented utype - a drop-in replacement for ctype, which now
supports all Unicode characters. (I don't know how he did it. I'm not completely
convinced that it's backwardly compatible with ctype in the ASCII range, but
even if it isn't, I'm sure it could be made so).

It is compatible. It has a unittest that checks all ASCII characters 
with all functions to make sure ;).

Hauke

 
 
 Excellent! This is superb. The only thing is, the docs don't make that claim
 (unless I missed it).

It is there, in the module description.

 When I read the docs for utype.isspace() I kinda got the
 impression that it just called charIsSpace(), which obviously would not be
 compatible with ctype. Perhaps you could make the documentation more explicit.

The documentation of isspace states that it is equivalent to 
charIsSeparator. But I will make it a little more obvious.

 All in all, I'm thoroughly impressed with this. Nice one!

Thanks :).

 PS. Did you omit charToCasefold(), or did I just miss it?

No, you didn't miss it. Real case folding is another beast entirely, as 
it requires one-to-many mappings. It is not supported by the module.

If you want to do simple one-to-one case folding then calling 
charToLower on both characters should be equivalent.

Hauke

Jun 08 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ca49vq$10ui$1 digitaldaemon.com>, Hauke Duden says...

 PS. Did you omit charToCasefold(), or did I just miss it?

No, you didn't miss it. Real case folding is another beast entirely, as 
it requires one-to-many mappings. It is not supported by the module.

Yes, I know. But I think it would be nice to start getting people used to the
idea that they need to be calling toCasefold() instead of toLower() if they're
going to do case-insensitive comparisons. It's a good "new thing to learn". Even
if all it does (for now) is call charToLower(), that would be better than
nothing.


If you want to do simple one-to-one case folding then calling 
charToLower on both characters should be equivalent.

I know, but basically, I'm saying that code which reads:

       if (charToCaseFold(c) == charToCaseFold(d))

is more self-documenting than code which reads:

       if (charToLower(c) == charToLower(d))

and it gets people to start thinking in the Unicode way. So - even if it does
nothing useful, I think it's still a good function to have.

Jill

Jun 08 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Arcane Jill wrote:

 In article <ca49vq$10ui$1 digitaldaemon.com>, Hauke Duden says...
 
 
PS. Did you omit charToCasefold(), or did I just miss it?

No, you didn't miss it. Real case folding is another beast entirely, as 
it requires one-to-many mappings. It is not supported by the module.

 
 
 Yes, I know. But I think it would be nice to start getting people used to the
 idea that they need to be calling toCasefold() instead of toLower() if they're
 going to do case-insensitive comparisons. It's a good "new thing to learn".
Even
 if all it does (for now) is call charToLower(), that would be better than
 nothing.

But the interface would have to be changed to return a string instead of 
a single character. That would break all code that uses it.

Hauke

Jun 08 2004

Arcane Jill <Arcane_member pathlink.com> writes:

Okay, cancel that. I've just realized I was talking complete rubbish. You were
right. I was wrong. Case folding comes into play during special casing, not
simple casing. (I was thinking it was in UnicodeData.txt, but of course it
isn't, it's only in SpecialCasing.txt). So I withdraw my suggestion, apologize
for questioning you, and now I'm going to go and hide in a corner until I stop
feeling such a prat.

Jill (embarrassed).

Jun 08 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Arcane Jill wrote:

 Okay, cancel that. I've just realized I was talking complete rubbish. You were
 right. I was wrong. Case folding comes into play during special casing, not
 simple casing. (I was thinking it was in UnicodeData.txt, but of course it
 isn't, it's only in SpecialCasing.txt). So I withdraw my suggestion, apologize
 for questioning you, and now I'm going to go and hide in a corner until I stop
 feeling such a prat.
 
 Jill (embarrassed).

Lol. Come on, don't be sad... ;)

It's good practice to question other people's work. They could be wrong 
just as easily as you could.

At the very least it will keep both you and the other one thinking, 
which is always a good thing.

Hauke

Jun 08 2004

D Programming

C/C++ Programming

Other

digitalmars.D - The Unicode Casing Algorithms