digitalmars.D - RFC: Case-Insensitive Strings (And usually they really do *have* case)

digitalmars.D - RFC: Case-Insensitive Strings (And usually they really do have case)

Nick Sabalausky (58/58) Jan 09 2011 Imagine things like Windows filenames or tokenized BASIC code (There's

Jim (4/80) Jan 09 2011 I'm a firm believer of alternative B: Store the string with its original...

Jonathan M Davis (23/35) Jan 09 2011 I don't know that it's over-engineering. I expect that there _are_ cases...

Nick Sabalausky (44/87) Jan 09 2011 The problem certainly cropped up for me. See below.

Jim (5/12) Jan 10 2011 You have a point. Your case-sensitivity-aware string types will guarante...

Nick Sabalausky (28/46) Jan 10 2011 Not carrying any other data means not caching the lowercase version, whi...

Jonathan M Davis (6/32) Jan 10 2011 Why is caching necessary? Shouldn't you just be able to use std.string.i...

Nick Sabalausky (6/47) Jan 10 2011 Anything involving toHash (such as using Insensitive as an AA key) requi...

Nick Sabalausky (9/59) Jan 10 2011 The other (smaller) thing I'm concerned about, and the reason I keep

Michel Fortin (15/21) Jan 10 2011 Comparing the lowercase version of two strings works well for ASCII,

Nick Sabalausky (12/33) Jan 10 2011 Yea, Phobos doesn't even have folding-case functions yet (which is why I...

Daniel Gibson (3/6) Jan 10 2011 It's not the same, see
Nick Sabalausky (4/6) Jan 10 2011 Figured this belongs in bugilla:

Daniel Gibson (12/29) Jan 10 2011 That's wrong, '�' is lowercase and no upper-case version is used really,...

Nick Sabalausky (15/33) Jan 10 2011 One of the unicode documents mentions an example involving the three gre...

Michel Fortin (11/18) Jan 10 2011 Actually, what would be compatible with Unicode collation is probably

Nick Sabalausky (14/31) Jan 10 2011 I'll have to take a look. (I don't even know what "Unicode collation" is...

Michel Fortin (6/15) Jan 10 2011 Oops, you're right. I got it backward.

"Nick Sabalausky" <a a.a> writes:

Imagine things like Windows filenames or tokenized BASIC code (There's 
probably plenty of other examples too, for instance, I came across this 
issue when dealing with GOLD's pre-defined character set names: 
http://www.devincook.com/goldparser/doc/grammars/character-sets.htm ).

These things *do* have a specific case, but the vast majority of time their 
*comparisons* should be done insensitively. This can lead to awkwardness. 
For instance, how do you store it?

A. If you store it as lower-case: You lose the information of the original 
case. The casing information may not be important for comparisons, but it 
*is* an inherent part of the data. For instance, if the string gets output 
in some way, it's no longer what it originally was.

B. If you store it preserving case: You have to make sure to remember to 
convert to lower-case on most comparisons. Which has three problems:
    1. Easy to forget and cause a bug.
    2. May end up converting lower-case more often than really needed.
    3. What if the comparison occurs inside a routine that has no idea what 
it's supposed to be? You'd have to remember to convert to lower-case when 
passing to that routine. And then within the routines you're back to the 
problems of strategy "A".

C. If you store both the original and lower-case versions: You have to take 
special care to maintain two separate variables and keep them both in-sync.

All three options are kinda sucky. And in any way, how to do distinguish 
between variables that are suposed to be case-insensitive and ones that 
aren't? Naming convention?

I think treating case-sensitivity as an attribute of the data rather than 
the comparison, and utilizing the type-system, cleans things up 
considerably. So I've created a struct to represent a "case-insensitive 
string" (templated for string, wstring and dstring). I think it would be a 
good thing to have in phobos, and would like to submit it for review.

It's currently named Insensitive/WInsensitive/DInsensitive, but I'm thinking 
now that maybe it should be renamed something like 
istring/iwstring/idstring. They're used like this:

// Create
auto normalString = "Hello";
auto insensitiveString = Insensitive("Hello");

// Convert
normalString = insensitiveString.toString();
insensitiveString = Insensitive(normalString);

// Compare
assert(Insensitive("Hello") == Insensitive("hELLo"));

// Mixed-comparisons are deliberately disallowed at
// compile-time since the intent is ambiguous:
assert(normalString == insensitiveString); // Compile Error!

// To disambiguate:
assert(Insensitive(normalString) == insensitiveString)); // Case-Insensitive
assert(normalString == insensitiveString.toString())); // Case-Sensitive

// Preserves original case:
writeln(Insensitive("hELLo")); // Output: hELLo

It also works as expected as the key of an assoc array (which I've 
personally found useful). Also, it doesn't convert to lower-case until it 
actually needs to, and once it does, it caches it.

Here is the source:
http://www.dsource.org/projects/semitwist/browser/trunk/src/semitwist/util/text.d#L758

While it is in the middle of my big grab-bag library, I'm pretty sure it 
doesn't rely on anything that's not in phobos (unless I've missed something, 
in which case, I can correct). The unittests for it do use my own 
unittesting routines, but it should be trivial to see how they'd convert to 
unittest{}, assert() and assertPred.

Jan 09 2011

Jim <bitcirkel yahoo.com> writes:

I'm a firm believer of alternative B: Store the string with its original case,
unless it's particularly important to do otherwise.

The cost of case-insensitive comparison is REALLY small. Anytime you are to
compare two strings ask yourself whether case-sensitive or case-insensitive is
what you need. Have no inclination to prefer one type of comparison to the
other. Problem solved. Bloat avoided.


Creating specific types of strings that carry with them data on how they are to
be interpreted is over-engineering, solving a problem that doesn't exist.


Nick Sabalausky Wrote:

 Imagine things like Windows filenames or tokenized BASIC code (There's 
 probably plenty of other examples too, for instance, I came across this 
 issue when dealing with GOLD's pre-defined character set names: 
 http://www.devincook.com/goldparser/doc/grammars/character-sets.htm ).
 
 These things *do* have a specific case, but the vast majority of time their 
 *comparisons* should be done insensitively. This can lead to awkwardness. 
 For instance, how do you store it?
 
 A. If you store it as lower-case: You lose the information of the original 
 case. The casing information may not be important for comparisons, but it 
 *is* an inherent part of the data. For instance, if the string gets output 
 in some way, it's no longer what it originally was.
 
 B. If you store it preserving case: You have to make sure to remember to 
 convert to lower-case on most comparisons. Which has three problems:
     1. Easy to forget and cause a bug.
     2. May end up converting lower-case more often than really needed.
     3. What if the comparison occurs inside a routine that has no idea what 
 it's supposed to be? You'd have to remember to convert to lower-case when 
 passing to that routine. And then within the routines you're back to the 
 problems of strategy "A".
 
 C. If you store both the original and lower-case versions: You have to take 
 special care to maintain two separate variables and keep them both in-sync.
 
 All three options are kinda sucky. And in any way, how to do distinguish 
 between variables that are suposed to be case-insensitive and ones that 
 aren't? Naming convention?
 
 I think treating case-sensitivity as an attribute of the data rather than 
 the comparison, and utilizing the type-system, cleans things up 
 considerably. So I've created a struct to represent a "case-insensitive 
 string" (templated for string, wstring and dstring). I think it would be a 
 good thing to have in phobos, and would like to submit it for review.
 
 It's currently named Insensitive/WInsensitive/DInsensitive, but I'm thinking 
 now that maybe it should be renamed something like 
 istring/iwstring/idstring. They're used like this:
 
 // Create
 auto normalString = "Hello";
 auto insensitiveString = Insensitive("Hello");
 
 // Convert
 normalString = insensitiveString.toString();
 insensitiveString = Insensitive(normalString);
 
 // Compare
 assert(Insensitive("Hello") == Insensitive("hELLo"));
 
 // Mixed-comparisons are deliberately disallowed at
 // compile-time since the intent is ambiguous:
 assert(normalString == insensitiveString); // Compile Error!
 
 // To disambiguate:
 assert(Insensitive(normalString) == insensitiveString)); // Case-Insensitive
 assert(normalString == insensitiveString.toString())); // Case-Sensitive
 
 // Preserves original case:
 writeln(Insensitive("hELLo")); // Output: hELLo
 
 It also works as expected as the key of an assoc array (which I've 
 personally found useful). Also, it doesn't convert to lower-case until it 
 actually needs to, and once it does, it caches it.
 
 Here is the source:
 http://www.dsource.org/projects/semitwist/browser/trunk/src/semitwist/util/text.d#L758
 
 While it is in the middle of my big grab-bag library, I'm pretty sure it 
 doesn't rely on anything that's not in phobos (unless I've missed something, 
 in which case, I can correct). The unittests for it do use my own 
 unittesting routines, but it should be trivial to see how they'd convert to 
 unittest{}, assert() and assertPred.

Jan 09 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday 09 January 2011 13:52:53 Jim wrote:
 I'm a firm believer of alternative B: Store the string with its original
 case, unless it's particularly important to do otherwise.
 
 The cost of case-insensitive comparison is REALLY small. Anytime you are to
 compare two strings ask yourself whether case-sensitive or
 case-insensitive is what you need. Have no inclination to prefer one type
 of comparison to the other. Problem solved. Bloat avoided.
 
 
 Creating specific types of strings that carry with them data on how they
 are to be interpreted is over-engineering, solving a problem that doesn't
 exist.

I don't know that it's over-engineering. I expect that there _are_ cases where 
it makes perfect sense. However, in the general case, I do think that it's 
overkill. std.string.icmp() deals with most cases where you need case-
insensitive comparison, but what if you really do need it everywhere as in 
Nick's case? Or what about cases like associative arrays, which you can't give
a 
comparison function to (it has to be built into the type)? I don't think that 
the cost of the comparison here is really the issue. If that's all you need, 
then there's icmp(). It's when you need the same comparison _everywhere_ that
it 
matters.

I don't know if this is worthwhile having in Phobos or not, but it might be. It 
definitely seems to me that in the general case, this sort of solution should
not 
be used but that there are definitely cases where it would be extremely useful 
and get a rid of a host of potential problems.

Now, I do wonder if perhaps this idea should be generalized to any type and/or
a 
given binary predicate to test for equality rather than making it specific to 
strings and case-insensitive comparison. The issue here (in the general sense) 
is that you want to wrap a type so that it will use a specialized comparison 
function everywhere, and that seems like it should be highly generalizable, 
though doing it right may require alias this, which _is_ rather buggy at the 
moment. Still, it would seem to me to be worthwhile to consider how it could 
and/or should be generalized.

- Jonathan M Davis

Jan 09 2011

"Nick Sabalausky" <a a.a> writes:

"Jonathan M Davis" <jmdavisProg gmx.com> wrote in message 
news:mailman.529.1294620116.4748.digitalmars-d puremagic.com...
 On Sunday 09 January 2011 13:52:53 Jim wrote:
 I'm a firm believer of alternative B: Store the string with its original
 case, unless it's particularly important to do otherwise.

 The cost of case-insensitive comparison is REALLY small. Anytime you are 
 to
 compare two strings ask yourself whether case-sensitive or
 case-insensitive is what you need. Have no inclination to prefer one type
 of comparison to the other. Problem solved. Bloat avoided.


 Creating specific types of strings that carry with them data on how they
 are to be interpreted is over-engineering, solving a problem that doesn't
 exist.


The problem certainly cropped up for me. See below.

 I don't know that it's over-engineering. I expect that there _are_ cases 
 where
 it makes perfect sense. However, in the general case, I do think that it's
 overkill. std.string.icmp() deals with most cases where you need case-
 insensitive comparison, but what if you really do need it everywhere as in
 Nick's case? Or what about cases like associative arrays, which you can't 
 give a
 comparison function to (it has to be built into the type)? I don't think 
 that
 the cost of the comparison here is really the issue. If that's all you 
 need,
 then there's icmp(). It's when you need the same comparison _everywhere_ 
 that it
 matters.

Right. FWIW, this is the scenario that originally inspired it:

I was working on some code that processes a grammar definition 
(specifically, the BNF-style language that GOLD uses). The grammar 
definition language includes various pre-defined character sets, and allows 
user-defined character sets. These character sets are referred to be name 
(such as "AlphaNumeric", "Whitespace", or "Cyrillic Supplementary"). But 
those character set names are defined by the language as being 
case-insensitive. (Come to think of it, all the names of everything are 
case-insensitive: tokens, char sets, meta-data, etc.)

Due to the usage patterns in my program, it made sense to store the 
character sets as an associative array where the keys were the names of the 
character sets and the values were the data describing what characters were 
included in the set. And there were plenty of other AAs for other things 
that were all indexed by case-insensitive names. Obviously, I needed to 
ensure that *all* comparisons involving these names were done insensitively 
(to do otherwise would be a bug). And there were also times when I needed to 
display one of the character set names (error messages, for instance), and 
it would be awkward not to show the original capitalization. So I had to 
follow the convention of always creating lower-case versions to insert into 
and lookup from the AAs, and also maintain the original names (and be very 
careful about all of it). This quickly became an awful mess. But as soon as 
I wrote and started using the "Insensitive" type, the whole thing was 
simplified enormously.

While writing and dealing with all that code I realized something: While 
programmers are usually heavily conditioned to think of case-sensitivity as 
an attribute of the comparison, it's very frequent that the deciding factor 
in which comparison to use is *not* the comparison itself but *what* gets 
compared. And in those cases, you have to use the awful strategy of "relying 
on convention" to make sure you get it right in *every* place that 
particular data gets compared.

It's analogous to how Asm has separate operators for signed-integer, 
unsigned-integer and floating-point math: Many times a specific memory 
location is *supposed* to be treated as either signed, unsigned or float in 
*all* operations they participate in. Handling this with separate operators 
that behave differently is notoriously tedious and error-prone. That's why 
non-asm languages, even ones as low-level as C, employ a type system which 
allows the programmer to *force* a variable to always, and automatically, be 
used with the proper version of the given operator. Heck, it all goes back 
to the whole original point of a type-system.

 Now, I do wonder if perhaps this idea should be generalized to any type 
 and/or a
 given binary predicate to test for equality rather than making it specific 
 to
 strings and case-insensitive comparison. The issue here (in the general 
 sense)
 is that you want to wrap a type so that it will use a specialized 
 comparison
 function everywhere, and that seems like it should be highly 
 generalizable,
 though doing it right may require alias this, which _is_ rather buggy at 
 the
 moment. Still, it would seem to me to be worthwhile to consider how it 
 could
 and/or should be generalized.

That's a very good thought. Have to say I'm not really sure offhand how I'd 
do that though.

Jan 09 2011

Jim <bitcirkel yahoo.com> writes:

 While writing and dealing with all that code I realized something: While 
 programmers are usually heavily conditioned to think of case-sensitivity as 
 an attribute of the comparison, it's very frequent that the deciding factor 
 in which comparison to use is *not* the comparison itself but *what* gets 
 compared. And in those cases, you have to use the awful strategy of "relying 
 on convention" to make sure you get it right in *every* place that 
 particular data gets compared.


You have a point. Your case-sensitivity-aware string types will guarantee
correctness in a large and complex program. I like that. Ideally though, they
would only be compile-time constraints (i.e. not carrying any other data).

Perhaps it's possible to create such types?
CaseSensitiveString,
CaseInsensitiveString,
and the standard string type being unspecified.

Jan 10 2011

"Nick Sabalausky" <a a.a> writes:

"Jim" <bitcirkel yahoo.com> wrote in message 
news:igfado$11g3$1 digitalmars.com...
 While writing and dealing with all that code I realized something: While
 programmers are usually heavily conditioned to think of case-sensitivity 
 as
 an attribute of the comparison, it's very frequent that the deciding 
 factor
 in which comparison to use is *not* the comparison itself but *what* gets
 compared. And in those cases, you have to use the awful strategy of 
 "relying
 on convention" to make sure you get it right in *every* place that
 particular data gets compared.


 You have a point. Your case-sensitivity-aware string types will guarantee 
 correctness in a large and complex program. I like that. Ideally though, 
 they would only be compile-time constraints (i.e. not carrying any other 
 data).

Not carrying any other data means not caching the lowercase version, which 
means recreating the lowercase version more than necessary. So it's the 
classic speed vs. space tradeoff. I would think there would be cases where 
they get compared enough for that to make a difference, although I suppose 
we'd really need benchmarks to see. OTOH, there are certainly cases (such as 
my original motivating case) where the extra space is not an issue at all.

But that does raise a point worth exploring: Maybe it should support both 
caching and non-caching behavior? I can think of at least two different ways 
to do this:

1. Provide a "useCache" template parameter. Problem with that though, is 
that the caching and non-caching versions end up being two distinct types.

2. Make "useCache" a runtime property. Since the cached foldingCase var is 
entirely private, the "useCache" flag could probably be encoded as a 
specific value of foldingCase.length and/or foldingCase.ptr. So the overhead 
of the non-cached version would only be size_t*2 instead of 
size_t*2+str.length.

One other idea: It's possible to change it so the folding case version is 
only created and cached when toHash is used. For other cases, it could just 
use icmp. So the storage space (beyond .length and .ptr) for the folding 
case version would never be allocated unless you're using it as an AA key 
(or some other use that needs hashing). Some benchmarking would be needed to 
be certain, but I think that would hit a nice sweet spot between space and 
speed, plus it wouldn't bother the user of the type with "cache or no 
cache?".

 Perhaps it's possible to create such types?
 CaseSensitiveString,
 CaseInsensitiveString,
 and the standard string type being unspecified.

I think the standard type can be assumed to be case-sensitive since that's 
how it already operates.

Jan 10 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Monday, January 10, 2011 10:46:55 Nick Sabalausky wrote:
 "Jim" <bitcirkel yahoo.com> wrote in message
 news:igfado$11g3$1 digitalmars.com...
 
 While writing and dealing with all that code I realized something: While
 programmers are usually heavily conditioned to think of case-sensitivity
 as
 an attribute of the comparison, it's very frequent that the deciding
 factor
 in which comparison to use is *not* the comparison itself but *what*
 gets compared. And in those cases, you have to use the awful strategy
 of "relying
 on convention" to make sure you get it right in *every* place that
 particular data gets compared.

 
 You have a point. Your case-sensitivity-aware string types will guarantee
 correctness in a large and complex program. I like that. Ideally though,
 they would only be compile-time constraints (i.e. not carrying any other
 data).

 
 Not carrying any other data means not caching the lowercase version, which
 means recreating the lowercase version more than necessary. So it's the
 classic speed vs. space tradeoff. I would think there would be cases where
 they get compared enough for that to make a difference, although I suppose
 we'd really need benchmarks to see. OTOH, there are certainly cases (such
 as my original motivating case) where the extra space is not an issue at
 all.

Why is caching necessary? Shouldn't you just be able to use std.string.icmp() 
for comparisons internally, avoiding any copying or caching? That shouldn't
need 
to duplicate anything. Or do you need to cache the lower-case version for 
something other than comparison?

- Jonathan M Davis

Jan 10 2011

"Nick Sabalausky" <a a.a> writes:

"Jonathan M Davis" <jmdavisProg gmx.com> wrote in message 
news:mailman.538.1294690510.4748.digitalmars-d puremagic.com...
 On Monday, January 10, 2011 10:46:55 Nick Sabalausky wrote:
 "Jim" <bitcirkel yahoo.com> wrote in message
 news:igfado$11g3$1 digitalmars.com...

 While writing and dealing with all that code I realized something: 
 While
 programmers are usually heavily conditioned to think of 
 case-sensitivity
 as
 an attribute of the comparison, it's very frequent that the deciding
 factor
 in which comparison to use is *not* the comparison itself but *what*
 gets compared. And in those cases, you have to use the awful strategy
 of "relying
 on convention" to make sure you get it right in *every* place that
 particular data gets compared.

 You have a point. Your case-sensitivity-aware string types will 
 guarantee
 correctness in a large and complex program. I like that. Ideally 
 though,
 they would only be compile-time constraints (i.e. not carrying any 
 other
 data).

 Not carrying any other data means not caching the lowercase version, 
 which
 means recreating the lowercase version more than necessary. So it's the
 classic speed vs. space tradeoff. I would think there would be cases 
 where
 they get compared enough for that to make a difference, although I 
 suppose
 we'd really need benchmarks to see. OTOH, there are certainly cases (such
 as my original motivating case) where the extra space is not an issue at
 all.

 Why is caching necessary? Shouldn't you just be able to use 
 std.string.icmp()
 for comparisons internally, avoiding any copying or caching? That 
 shouldn't need
 to duplicate anything. Or do you need to cache the lower-case version for
 something other than comparison?

Anything involving toHash (such as using Insensitive as an AA key) requires 
the use of a lower-case version. For anything else, you're probably right, 
icmp should be fine (Although I'd like to do a benchmark of icmp vs regular 
string comparison).

Jan 10 2011

"Nick Sabalausky" <a a.a> writes:

"Nick Sabalausky" <a a.a> wrote in message 
news:igfq0n$22u8$1 digitalmars.com...
 "Jonathan M Davis" <jmdavisProg gmx.com> wrote in message 
 news:mailman.538.1294690510.4748.digitalmars-d puremagic.com...
 On Monday, January 10, 2011 10:46:55 Nick Sabalausky wrote:
 "Jim" <bitcirkel yahoo.com> wrote in message
 news:igfado$11g3$1 digitalmars.com...

 While writing and dealing with all that code I realized something: 
 While
 programmers are usually heavily conditioned to think of 
 case-sensitivity
 as
 an attribute of the comparison, it's very frequent that the deciding
 factor
 in which comparison to use is *not* the comparison itself but *what*
 gets compared. And in those cases, you have to use the awful strategy
 of "relying
 on convention" to make sure you get it right in *every* place that
 particular data gets compared.

 You have a point. Your case-sensitivity-aware string types will 
 guarantee
 correctness in a large and complex program. I like that. Ideally 
 though,
 they would only be compile-time constraints (i.e. not carrying any 
 other
 data).

 Not carrying any other data means not caching the lowercase version, 
 which
 means recreating the lowercase version more than necessary. So it's the
 classic speed vs. space tradeoff. I would think there would be cases 
 where
 they get compared enough for that to make a difference, although I 
 suppose
 we'd really need benchmarks to see. OTOH, there are certainly cases 
 (such
 as my original motivating case) where the extra space is not an issue at
 all.

 Why is caching necessary? Shouldn't you just be able to use 
 std.string.icmp()
 for comparisons internally, avoiding any copying or caching? That 
 shouldn't need
 to duplicate anything. Or do you need to cache the lower-case version for
 something other than comparison?

 Anything involving toHash (such as using Insensitive as an AA key) 
 requires the use of a lower-case version. For anything else, you're 
 probably right, icmp should be fine (Although I'd like to do a benchmark 
 of icmp vs regular string comparison).

The other (smaller) thing I'm concerned about, and the reason I keep 
mentioning I want to do a benchmark, is that case-sensitive comparisons are 
able to use the max-speed memcmp whereas icmp has to not only avoid memcmp 
but also stick extra logic in the middle of the loop (and currently that 
also includes calls to utf.decode). Without checking, I'm not sure if that 
wouldn't make multiple calls to icmp slower than just making a lower-case 
version once and using case-sensitive comparisons on them.

Jan 10 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-10 13:46:55 -0500, "Nick Sabalausky" <a a.a> said:

 Not carrying any other data means not caching the lowercase version, which
 means recreating the lowercase version more than necessary. So it's the
 classic speed vs. space tradeoff. I would think there would be cases where
 they get compared enough for that to make a difference, although I suppose
 we'd really need benchmarks to see. OTOH, there are certainly cases (such as
 my original motivating case) where the extra space is not an issue at all.

Comparing the lowercase version of two strings works well for ASCII, 
but I doubt it works very well for Unicode. Case conversion is not 
bidirectional (for instance both 'SS' and '�' become 'ss' in lowercase 
in German), and what's equal and what is not sometime depends on the 
language.

Checking for string equality is a special case of the Unicode collation 
algorithm. I'm not sure if implementing this part of Unicode is in the 
scope of Phobos (probably not), but short of having Unicode support it 
seems the utility of having a special string type dedicated to ASCII 
case-insensitive strings is quite limited.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 10 2011

"Nick Sabalausky" <a a.a> writes:

"Michel Fortin" <michel.fortin michelf.com> wrote in message 
news:igft2o$291g$1 digitalmars.com...
 On 2011-01-10 13:46:55 -0500, "Nick Sabalausky" <a a.a> said:

 Not carrying any other data means not caching the lowercase version, 
 which
 means recreating the lowercase version more than necessary. So it's the
 classic speed vs. space tradeoff. I would think there would be cases 
 where
 they get compared enough for that to make a difference, although I 
 suppose
 we'd really need benchmarks to see. OTOH, there are certainly cases (such 
 as
 my original motivating case) where the extra space is not an issue at 
 all.

 Comparing the lowercase version of two strings works well for ASCII, but I 
 doubt it works very well for Unicode. Case conversion is not bidirectional 
 (for instance both 'SS' and '�' become 'ss' in lowercase in German), and 
 what's equal and what is not sometime depends on the language.

 Checking for string equality is a special case of the Unicode collation 
 algorithm. I'm not sure if implementing this part of Unicode is in the 
 scope of Phobos (probably not), but short of having Unicode support it 
 seems the utility of having a special string type dedicated to ASCII 
 case-insensitive strings is quite limited.

Yea, Phobos doesn't even have folding-case functions yet (which is why I 
keep saying "lowercase"). (This is actually one place where Phobos is still 
behind Tango.)

However, I really think that's orthogonal to this since std.string.icmp 
doesn't handle such non-english issues either (just the english a-z, A-Z, 
and that's it). When Phobos does become multilingual, then this can be 
updated to follow suit.

One question though: Aren't 'SS' and '�' considered the same in german 
anyway? If so, how does using lowercase instead of folding case cause a 
problem?

Jan 10 2011

Daniel Gibson <metalcaedes gmail.com> writes:

Am 10.01.2011 22:24, schrieb Nick Sabalausky:
 One question though: Aren't 'SS' and '�' considered the same in german
 anyway? If so, how does using lowercase instead of folding case cause a
 problem?

It's not the same, see
http://en.wikipedia.org/wiki/%C3%9F#Current_usage_in_German

Jan 10 2011

"Nick Sabalausky" <a a.a> writes:

"Nick Sabalausky" <a a.a> wrote in message 
news:igftls$2afo$1 digitalmars.com...
 since std.string.icmp doesn't handle such non-english issues either (just 
 the english a-z, A-Z, and that's it).

Figured this belongs in bugilla: 
http://d.puremagic.com/issues/show_bug.cgi?id=5443

Jan 10 2011

Daniel Gibson <metalcaedes gmail.com> writes:

Am 10.01.2011 22:16, schrieb Michel Fortin:
 On 2011-01-10 13:46:55 -0500, "Nick Sabalausky" <a a.a> said:

 Not carrying any other data means not caching the lowercase version, which
 means recreating the lowercase version more than necessary. So it's the
 classic speed vs. space tradeoff. I would think there would be cases where
 they get compared enough for that to make a difference, although I suppose
 we'd really need benchmarks to see. OTOH, there are certainly cases (such as
 my original motivating case) where the extra space is not an issue at all.

 Comparing the lowercase version of two strings works well for ASCII, but I
doubt
 it works very well for Unicode. Case conversion is not bidirectional (for
 instance both 'SS' and '�' become 'ss' in lowercase in German),

That's wrong, '�' is lowercase and no upper-case version is used really, though 
one exists in Unicode (see: http://en.wikipedia.org/wiki/Capital_%C3%9F ).
Sometimes, when stuff is written in fullcaps, '�' (which never is the first 
character of a word) is replaced by "SS", but I wouldn't expect that to be
equal 
on icmp(). (e.g. "Strings vergleichen macht keinen Spa�!" vs "STRINGS 
VERGLEICHEN MACHT KEINEN SPASS!")

Anyway, in this case comparing in lowercase would cause no trouble at all 
(comparing in uppercase however would, if you don't use the 
not-really-existing-but-defined-by-unicode-Capital-�).

I don't know if there may be problems with special characters in other 
languages, though.

 and what's equal
 and what is not sometime depends on the language.

 Checking for string equality is a special case of the Unicode collation
 algorithm. I'm not sure if implementing this part of Unicode is in the scope of
 Phobos (probably not), but short of having Unicode support it seems the utility
 of having a special string type dedicated to ASCII case-insensitive strings is
 quite limited.

Jan 10 2011

"Nick Sabalausky" <a a.a> writes:

"Daniel Gibson" <metalcaedes gmail.com> wrote in message 
news:igfttf$p46$1 digitalmars.com...
 Am 10.01.2011 22:16, schrieb Michel Fortin:
 Comparing the lowercase version of two strings works well for ASCII, but 
 I doubt
 it works very well for Unicode. Case conversion is not bidirectional (for
 instance both 'SS' and '�' become 'ss' in lowercase in German),

 That's wrong, '�' is lowercase and no upper-case version is used really, 
 though one exists in Unicode (see: 
 http://en.wikipedia.org/wiki/Capital_%C3%9F ).
 Sometimes, when stuff is written in fullcaps, '�' (which never is the 
 first character of a word) is replaced by "SS", but I wouldn't expect that 
 to be equal on icmp(). (e.g. "Strings vergleichen macht keinen Spa�!" vs 
 "STRINGS VERGLEICHEN MACHT KEINEN SPASS!")

 Anyway, in this case comparing in lowercase would cause no trouble at all 
 (comparing in uppercase however would, if you don't use the 
 not-really-existing-but-defined-by-unicode-Capital-�).

 I don't know if there may be problems with special characters in other 
 languages, though.

One of the unicode documents mentions an example involving the three greek 
"sigma" letters, although I never quite understood how it demonstrated the 
inadequacy of using lower-case:

http://www.unicode.org/reports/tr21/tr21-5.html#Caseless_Matching
...Which references some information near the end of this sub-section:
http://www.unicode.org/reports/tr21/tr21-5.html#Introduction

Actually, what probably should be stored is a *normalized* folding-case 
version of the string, because then (if I understand correctly) memcmp could 
be used. I don't think memcpy technically works on non-ASCII (unless it's in 
normalized form).

In any case, Phobos doesn't currently handle any of that stuff at all, so my 
case-insensitive string type wouldn't be taking things backwards in that 
regard.

Jan 10 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-10 17:00:40 -0500, "Nick Sabalausky" <a a.a> said:

 Actually, what probably should be stored is a *normalized* folding-case
 version of the string, because then (if I understand correctly) memcmp could
 be used. I don't think memcpy technically works on non-ASCII (unless it's in
 normalized form).

Actually, what would be compatible with Unicode collation is probably 
an array of collation elements. This would also make it useful to sort 
case-insensitively, not just testing for equality.

Details (perhaps too much details):
<http://unicode.org/reports/tr10/>


 In any case, Phobos doesn't currently handle any of that stuff at all, so my
 case-insensitive string type wouldn't be taking things backwards in that
 regard.

Probably not.


-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 10 2011

"Nick Sabalausky" <a a.a> writes:

"Michel Fortin" <michel.fortin michelf.com> wrote in message 
news:iggi34$cqm$1 digitalmars.com...
 On 2011-01-10 17:00:40 -0500, "Nick Sabalausky" <a a.a> said:

 Actually, what probably should be stored is a *normalized* folding-case
 version of the string, because then (if I understand correctly) memcmp 
 could
 be used. I don't think memcpy technically works on non-ASCII (unless it's 
 in
 normalized form).

 Actually, what would be compatible with Unicode collation is probably an 
 array of collation elements. This would also make it useful to sort 
 case-insensitively, not just testing for equality.

 Details (perhaps too much details):
 <http://unicode.org/reports/tr10/>

I'll have to take a look. (I don't even know what "Unicode collation" is 
:P )

 In any case, Phobos doesn't currently handle any of that stuff at all, so 
 my
 case-insensitive string type wouldn't be taking things backwards in that
 regard.

 Probably not.

Actually I was (mostly) wrong: There's already unicode-compatible 
upper/lower case characters functions in std.uni, and these are used by 
std.string.toupper and std.string.tolower (as well as their InPlace 
coutnerparts). Which incidentally means that my Insentitive type worked 
correctly for unicode all along (I didn't know about icmp when I wrote it) - 
well, except for things that need folding case instead of lower-case.

Not only that, but Andrei just committed a unicode fix for icmp (bugzilla 
5443) just a few hours ago. (That's gotta be a record for fastest issue 
fixed in D's bug tracker!)

Still no folding-case though.

Jan 10 2011

Michel Fortin <michel.fortin michelf.com> writes:

On 2011-01-10 16:30:50 -0500, Daniel Gibson <metalcaedes gmail.com> said:

 Am 10.01.2011 22:16, schrieb Michel Fortin:
 Comparing the lowercase version of two strings works well for ASCII, 
 but I doubt
 it works very well for Unicode. Case conversion is not bidirectional (for
 instance both 'SS' and '�' become 'ss' in lowercase in German),

 
 That's wrong, '�' is lowercase and no upper-case version is used 
 really, though one exists in Unicode (see: 
 http://en.wikipedia.org/wiki/Capital_%C3%9F ).

Oops, you're right. I got it backward.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Jan 10 2011

D Programming

C/C++ Programming

Other

digitalmars.D - RFC: Case-Insensitive Strings (And usually they really do have case)