digitalmars.D - RFC: Case-Insensitive Strings (And usually they really do *have* case)
- Nick Sabalausky (58/58) Jan 09 2011 Imagine things like Windows filenames or tokenized BASIC code (There's
- Jim (4/80) Jan 09 2011 I'm a firm believer of alternative B: Store the string with its original...
- Jonathan M Davis (23/35) Jan 09 2011 I don't know that it's over-engineering. I expect that there _are_ cases...
- Nick Sabalausky (44/87) Jan 09 2011 The problem certainly cropped up for me. See below.
- Jim (5/12) Jan 10 2011 You have a point. Your case-sensitivity-aware string types will guarante...
- Nick Sabalausky (28/46) Jan 10 2011 Not carrying any other data means not caching the lowercase version, whi...
- Jonathan M Davis (6/32) Jan 10 2011 Why is caching necessary? Shouldn't you just be able to use std.string.i...
- Nick Sabalausky (6/47) Jan 10 2011 Anything involving toHash (such as using Insensitive as an AA key) requi...
- Nick Sabalausky (9/59) Jan 10 2011 The other (smaller) thing I'm concerned about, and the reason I keep
- Michel Fortin (15/21) Jan 10 2011 Comparing the lowercase version of two strings works well for ASCII,
- Nick Sabalausky (12/33) Jan 10 2011 Yea, Phobos doesn't even have folding-case functions yet (which is why I...
- Daniel Gibson (3/6) Jan 10 2011 It's not the same, see
- Nick Sabalausky (4/6) Jan 10 2011 Figured this belongs in bugilla:
- Daniel Gibson (12/29) Jan 10 2011 That's wrong, 'ß' is lowercase and no upper-case version is used really,...
- Nick Sabalausky (15/33) Jan 10 2011 One of the unicode documents mentions an example involving the three gre...
- Michel Fortin (11/18) Jan 10 2011 Actually, what would be compatible with Unicode collation is probably
- Nick Sabalausky (14/31) Jan 10 2011 I'll have to take a look. (I don't even know what "Unicode collation" is...
- Michel Fortin (6/15) Jan 10 2011 Oops, you're right. I got it backward.
Imagine things like Windows filenames or tokenized BASIC code (There's probably plenty of other examples too, for instance, I came across this issue when dealing with GOLD's pre-defined character set names: http://www.devincook.com/goldparser/doc/grammars/character-sets.htm ). These things *do* have a specific case, but the vast majority of time their *comparisons* should be done insensitively. This can lead to awkwardness. For instance, how do you store it? A. If you store it as lower-case: You lose the information of the original case. The casing information may not be important for comparisons, but it *is* an inherent part of the data. For instance, if the string gets output in some way, it's no longer what it originally was. B. If you store it preserving case: You have to make sure to remember to convert to lower-case on most comparisons. Which has three problems: 1. Easy to forget and cause a bug. 2. May end up converting lower-case more often than really needed. 3. What if the comparison occurs inside a routine that has no idea what it's supposed to be? You'd have to remember to convert to lower-case when passing to that routine. And then within the routines you're back to the problems of strategy "A". C. If you store both the original and lower-case versions: You have to take special care to maintain two separate variables and keep them both in-sync. All three options are kinda sucky. And in any way, how to do distinguish between variables that are suposed to be case-insensitive and ones that aren't? Naming convention? I think treating case-sensitivity as an attribute of the data rather than the comparison, and utilizing the type-system, cleans things up considerably. So I've created a struct to represent a "case-insensitive string" (templated for string, wstring and dstring). I think it would be a good thing to have in phobos, and would like to submit it for review. It's currently named Insensitive/WInsensitive/DInsensitive, but I'm thinking now that maybe it should be renamed something like istring/iwstring/idstring. They're used like this: // Create auto normalString = "Hello"; auto insensitiveString = Insensitive("Hello"); // Convert normalString = insensitiveString.toString(); insensitiveString = Insensitive(normalString); // Compare assert(Insensitive("Hello") == Insensitive("hELLo")); // Mixed-comparisons are deliberately disallowed at // compile-time since the intent is ambiguous: assert(normalString == insensitiveString); // Compile Error! // To disambiguate: assert(Insensitive(normalString) == insensitiveString)); // Case-Insensitive assert(normalString == insensitiveString.toString())); // Case-Sensitive // Preserves original case: writeln(Insensitive("hELLo")); // Output: hELLo It also works as expected as the key of an assoc array (which I've personally found useful). Also, it doesn't convert to lower-case until it actually needs to, and once it does, it caches it. Here is the source: http://www.dsource.org/projects/semitwist/browser/trunk/src/semitwist/util/text.d#L758 While it is in the middle of my big grab-bag library, I'm pretty sure it doesn't rely on anything that's not in phobos (unless I've missed something, in which case, I can correct). The unittests for it do use my own unittesting routines, but it should be trivial to see how they'd convert to unittest{}, assert() and assertPred.
Jan 09 2011
I'm a firm believer of alternative B: Store the string with its original case, unless it's particularly important to do otherwise. The cost of case-insensitive comparison is REALLY small. Anytime you are to compare two strings ask yourself whether case-sensitive or case-insensitive is what you need. Have no inclination to prefer one type of comparison to the other. Problem solved. Bloat avoided. Creating specific types of strings that carry with them data on how they are to be interpreted is over-engineering, solving a problem that doesn't exist. Nick Sabalausky Wrote:Imagine things like Windows filenames or tokenized BASIC code (There's probably plenty of other examples too, for instance, I came across this issue when dealing with GOLD's pre-defined character set names: http://www.devincook.com/goldparser/doc/grammars/character-sets.htm ). These things *do* have a specific case, but the vast majority of time their *comparisons* should be done insensitively. This can lead to awkwardness. For instance, how do you store it? A. If you store it as lower-case: You lose the information of the original case. The casing information may not be important for comparisons, but it *is* an inherent part of the data. For instance, if the string gets output in some way, it's no longer what it originally was. B. If you store it preserving case: You have to make sure to remember to convert to lower-case on most comparisons. Which has three problems: 1. Easy to forget and cause a bug. 2. May end up converting lower-case more often than really needed. 3. What if the comparison occurs inside a routine that has no idea what it's supposed to be? You'd have to remember to convert to lower-case when passing to that routine. And then within the routines you're back to the problems of strategy "A". C. If you store both the original and lower-case versions: You have to take special care to maintain two separate variables and keep them both in-sync. All three options are kinda sucky. And in any way, how to do distinguish between variables that are suposed to be case-insensitive and ones that aren't? Naming convention? I think treating case-sensitivity as an attribute of the data rather than the comparison, and utilizing the type-system, cleans things up considerably. So I've created a struct to represent a "case-insensitive string" (templated for string, wstring and dstring). I think it would be a good thing to have in phobos, and would like to submit it for review. It's currently named Insensitive/WInsensitive/DInsensitive, but I'm thinking now that maybe it should be renamed something like istring/iwstring/idstring. They're used like this: // Create auto normalString = "Hello"; auto insensitiveString = Insensitive("Hello"); // Convert normalString = insensitiveString.toString(); insensitiveString = Insensitive(normalString); // Compare assert(Insensitive("Hello") == Insensitive("hELLo")); // Mixed-comparisons are deliberately disallowed at // compile-time since the intent is ambiguous: assert(normalString == insensitiveString); // Compile Error! // To disambiguate: assert(Insensitive(normalString) == insensitiveString)); // Case-Insensitive assert(normalString == insensitiveString.toString())); // Case-Sensitive // Preserves original case: writeln(Insensitive("hELLo")); // Output: hELLo It also works as expected as the key of an assoc array (which I've personally found useful). Also, it doesn't convert to lower-case until it actually needs to, and once it does, it caches it. Here is the source: http://www.dsource.org/projects/semitwist/browser/trunk/src/semitwist/util/text.d#L758 While it is in the middle of my big grab-bag library, I'm pretty sure it doesn't rely on anything that's not in phobos (unless I've missed something, in which case, I can correct). The unittests for it do use my own unittesting routines, but it should be trivial to see how they'd convert to unittest{}, assert() and assertPred.
Jan 09 2011
On Sunday 09 January 2011 13:52:53 Jim wrote:I'm a firm believer of alternative B: Store the string with its original case, unless it's particularly important to do otherwise. The cost of case-insensitive comparison is REALLY small. Anytime you are to compare two strings ask yourself whether case-sensitive or case-insensitive is what you need. Have no inclination to prefer one type of comparison to the other. Problem solved. Bloat avoided. Creating specific types of strings that carry with them data on how they are to be interpreted is over-engineering, solving a problem that doesn't exist.I don't know that it's over-engineering. I expect that there _are_ cases where it makes perfect sense. However, in the general case, I do think that it's overkill. std.string.icmp() deals with most cases where you need case- insensitive comparison, but what if you really do need it everywhere as in Nick's case? Or what about cases like associative arrays, which you can't give a comparison function to (it has to be built into the type)? I don't think that the cost of the comparison here is really the issue. If that's all you need, then there's icmp(). It's when you need the same comparison _everywhere_ that it matters. I don't know if this is worthwhile having in Phobos or not, but it might be. It definitely seems to me that in the general case, this sort of solution should not be used but that there are definitely cases where it would be extremely useful and get a rid of a host of potential problems. Now, I do wonder if perhaps this idea should be generalized to any type and/or a given binary predicate to test for equality rather than making it specific to strings and case-insensitive comparison. The issue here (in the general sense) is that you want to wrap a type so that it will use a specialized comparison function everywhere, and that seems like it should be highly generalizable, though doing it right may require alias this, which _is_ rather buggy at the moment. Still, it would seem to me to be worthwhile to consider how it could and/or should be generalized. - Jonathan M Davis
Jan 09 2011
"Jonathan M Davis" <jmdavisProg gmx.com> wrote in message news:mailman.529.1294620116.4748.digitalmars-d puremagic.com...On Sunday 09 January 2011 13:52:53 Jim wrote:The problem certainly cropped up for me. See below.I'm a firm believer of alternative B: Store the string with its original case, unless it's particularly important to do otherwise. The cost of case-insensitive comparison is REALLY small. Anytime you are to compare two strings ask yourself whether case-sensitive or case-insensitive is what you need. Have no inclination to prefer one type of comparison to the other. Problem solved. Bloat avoided. Creating specific types of strings that carry with them data on how they are to be interpreted is over-engineering, solving a problem that doesn't exist.I don't know that it's over-engineering. I expect that there _are_ cases where it makes perfect sense. However, in the general case, I do think that it's overkill. std.string.icmp() deals with most cases where you need case- insensitive comparison, but what if you really do need it everywhere as in Nick's case? Or what about cases like associative arrays, which you can't give a comparison function to (it has to be built into the type)? I don't think that the cost of the comparison here is really the issue. If that's all you need, then there's icmp(). It's when you need the same comparison _everywhere_ that it matters.Right. FWIW, this is the scenario that originally inspired it: I was working on some code that processes a grammar definition (specifically, the BNF-style language that GOLD uses). The grammar definition language includes various pre-defined character sets, and allows user-defined character sets. These character sets are referred to be name (such as "AlphaNumeric", "Whitespace", or "Cyrillic Supplementary"). But those character set names are defined by the language as being case-insensitive. (Come to think of it, all the names of everything are case-insensitive: tokens, char sets, meta-data, etc.) Due to the usage patterns in my program, it made sense to store the character sets as an associative array where the keys were the names of the character sets and the values were the data describing what characters were included in the set. And there were plenty of other AAs for other things that were all indexed by case-insensitive names. Obviously, I needed to ensure that *all* comparisons involving these names were done insensitively (to do otherwise would be a bug). And there were also times when I needed to display one of the character set names (error messages, for instance), and it would be awkward not to show the original capitalization. So I had to follow the convention of always creating lower-case versions to insert into and lookup from the AAs, and also maintain the original names (and be very careful about all of it). This quickly became an awful mess. But as soon as I wrote and started using the "Insensitive" type, the whole thing was simplified enormously. While writing and dealing with all that code I realized something: While programmers are usually heavily conditioned to think of case-sensitivity as an attribute of the comparison, it's very frequent that the deciding factor in which comparison to use is *not* the comparison itself but *what* gets compared. And in those cases, you have to use the awful strategy of "relying on convention" to make sure you get it right in *every* place that particular data gets compared. It's analogous to how Asm has separate operators for signed-integer, unsigned-integer and floating-point math: Many times a specific memory location is *supposed* to be treated as either signed, unsigned or float in *all* operations they participate in. Handling this with separate operators that behave differently is notoriously tedious and error-prone. That's why non-asm languages, even ones as low-level as C, employ a type system which allows the programmer to *force* a variable to always, and automatically, be used with the proper version of the given operator. Heck, it all goes back to the whole original point of a type-system.Now, I do wonder if perhaps this idea should be generalized to any type and/or a given binary predicate to test for equality rather than making it specific to strings and case-insensitive comparison. The issue here (in the general sense) is that you want to wrap a type so that it will use a specialized comparison function everywhere, and that seems like it should be highly generalizable, though doing it right may require alias this, which _is_ rather buggy at the moment. Still, it would seem to me to be worthwhile to consider how it could and/or should be generalized.That's a very good thought. Have to say I'm not really sure offhand how I'd do that though.
Jan 09 2011
While writing and dealing with all that code I realized something: While programmers are usually heavily conditioned to think of case-sensitivity as an attribute of the comparison, it's very frequent that the deciding factor in which comparison to use is *not* the comparison itself but *what* gets compared. And in those cases, you have to use the awful strategy of "relying on convention" to make sure you get it right in *every* place that particular data gets compared.You have a point. Your case-sensitivity-aware string types will guarantee correctness in a large and complex program. I like that. Ideally though, they would only be compile-time constraints (i.e. not carrying any other data). Perhaps it's possible to create such types? CaseSensitiveString, CaseInsensitiveString, and the standard string type being unspecified.
Jan 10 2011
"Jim" <bitcirkel yahoo.com> wrote in message news:igfado$11g3$1 digitalmars.com...Not carrying any other data means not caching the lowercase version, which means recreating the lowercase version more than necessary. So it's the classic speed vs. space tradeoff. I would think there would be cases where they get compared enough for that to make a difference, although I suppose we'd really need benchmarks to see. OTOH, there are certainly cases (such as my original motivating case) where the extra space is not an issue at all. But that does raise a point worth exploring: Maybe it should support both caching and non-caching behavior? I can think of at least two different ways to do this: 1. Provide a "useCache" template parameter. Problem with that though, is that the caching and non-caching versions end up being two distinct types. 2. Make "useCache" a runtime property. Since the cached foldingCase var is entirely private, the "useCache" flag could probably be encoded as a specific value of foldingCase.length and/or foldingCase.ptr. So the overhead of the non-cached version would only be size_t*2 instead of size_t*2+str.length. One other idea: It's possible to change it so the folding case version is only created and cached when toHash is used. For other cases, it could just use icmp. So the storage space (beyond .length and .ptr) for the folding case version would never be allocated unless you're using it as an AA key (or some other use that needs hashing). Some benchmarking would be needed to be certain, but I think that would hit a nice sweet spot between space and speed, plus it wouldn't bother the user of the type with "cache or no cache?".While writing and dealing with all that code I realized something: While programmers are usually heavily conditioned to think of case-sensitivity as an attribute of the comparison, it's very frequent that the deciding factor in which comparison to use is *not* the comparison itself but *what* gets compared. And in those cases, you have to use the awful strategy of "relying on convention" to make sure you get it right in *every* place that particular data gets compared.You have a point. Your case-sensitivity-aware string types will guarantee correctness in a large and complex program. I like that. Ideally though, they would only be compile-time constraints (i.e. not carrying any other data).Perhaps it's possible to create such types? CaseSensitiveString, CaseInsensitiveString, and the standard string type being unspecified.I think the standard type can be assumed to be case-sensitive since that's how it already operates.
Jan 10 2011
On Monday, January 10, 2011 10:46:55 Nick Sabalausky wrote:"Jim" <bitcirkel yahoo.com> wrote in message news:igfado$11g3$1 digitalmars.com...Why is caching necessary? Shouldn't you just be able to use std.string.icmp() for comparisons internally, avoiding any copying or caching? That shouldn't need to duplicate anything. Or do you need to cache the lower-case version for something other than comparison? - Jonathan M DavisNot carrying any other data means not caching the lowercase version, which means recreating the lowercase version more than necessary. So it's the classic speed vs. space tradeoff. I would think there would be cases where they get compared enough for that to make a difference, although I suppose we'd really need benchmarks to see. OTOH, there are certainly cases (such as my original motivating case) where the extra space is not an issue at all.While writing and dealing with all that code I realized something: While programmers are usually heavily conditioned to think of case-sensitivity as an attribute of the comparison, it's very frequent that the deciding factor in which comparison to use is *not* the comparison itself but *what* gets compared. And in those cases, you have to use the awful strategy of "relying on convention" to make sure you get it right in *every* place that particular data gets compared.You have a point. Your case-sensitivity-aware string types will guarantee correctness in a large and complex program. I like that. Ideally though, they would only be compile-time constraints (i.e. not carrying any other data).
Jan 10 2011
"Jonathan M Davis" <jmdavisProg gmx.com> wrote in message news:mailman.538.1294690510.4748.digitalmars-d puremagic.com...On Monday, January 10, 2011 10:46:55 Nick Sabalausky wrote:Anything involving toHash (such as using Insensitive as an AA key) requires the use of a lower-case version. For anything else, you're probably right, icmp should be fine (Although I'd like to do a benchmark of icmp vs regular string comparison)."Jim" <bitcirkel yahoo.com> wrote in message news:igfado$11g3$1 digitalmars.com...Why is caching necessary? Shouldn't you just be able to use std.string.icmp() for comparisons internally, avoiding any copying or caching? That shouldn't need to duplicate anything. Or do you need to cache the lower-case version for something other than comparison?Not carrying any other data means not caching the lowercase version, which means recreating the lowercase version more than necessary. So it's the classic speed vs. space tradeoff. I would think there would be cases where they get compared enough for that to make a difference, although I suppose we'd really need benchmarks to see. OTOH, there are certainly cases (such as my original motivating case) where the extra space is not an issue at all.While writing and dealing with all that code I realized something: While programmers are usually heavily conditioned to think of case-sensitivity as an attribute of the comparison, it's very frequent that the deciding factor in which comparison to use is *not* the comparison itself but *what* gets compared. And in those cases, you have to use the awful strategy of "relying on convention" to make sure you get it right in *every* place that particular data gets compared.You have a point. Your case-sensitivity-aware string types will guarantee correctness in a large and complex program. I like that. Ideally though, they would only be compile-time constraints (i.e. not carrying any other data).
Jan 10 2011
"Nick Sabalausky" <a a.a> wrote in message news:igfq0n$22u8$1 digitalmars.com..."Jonathan M Davis" <jmdavisProg gmx.com> wrote in message news:mailman.538.1294690510.4748.digitalmars-d puremagic.com...The other (smaller) thing I'm concerned about, and the reason I keep mentioning I want to do a benchmark, is that case-sensitive comparisons are able to use the max-speed memcmp whereas icmp has to not only avoid memcmp but also stick extra logic in the middle of the loop (and currently that also includes calls to utf.decode). Without checking, I'm not sure if that wouldn't make multiple calls to icmp slower than just making a lower-case version once and using case-sensitive comparisons on them.On Monday, January 10, 2011 10:46:55 Nick Sabalausky wrote:Anything involving toHash (such as using Insensitive as an AA key) requires the use of a lower-case version. For anything else, you're probably right, icmp should be fine (Although I'd like to do a benchmark of icmp vs regular string comparison)."Jim" <bitcirkel yahoo.com> wrote in message news:igfado$11g3$1 digitalmars.com...Why is caching necessary? Shouldn't you just be able to use std.string.icmp() for comparisons internally, avoiding any copying or caching? That shouldn't need to duplicate anything. Or do you need to cache the lower-case version for something other than comparison?Not carrying any other data means not caching the lowercase version, which means recreating the lowercase version more than necessary. So it's the classic speed vs. space tradeoff. I would think there would be cases where they get compared enough for that to make a difference, although I suppose we'd really need benchmarks to see. OTOH, there are certainly cases (such as my original motivating case) where the extra space is not an issue at all.While writing and dealing with all that code I realized something: While programmers are usually heavily conditioned to think of case-sensitivity as an attribute of the comparison, it's very frequent that the deciding factor in which comparison to use is *not* the comparison itself but *what* gets compared. And in those cases, you have to use the awful strategy of "relying on convention" to make sure you get it right in *every* place that particular data gets compared.You have a point. Your case-sensitivity-aware string types will guarantee correctness in a large and complex program. I like that. Ideally though, they would only be compile-time constraints (i.e. not carrying any other data).
Jan 10 2011
On 2011-01-10 13:46:55 -0500, "Nick Sabalausky" <a a.a> said:Not carrying any other data means not caching the lowercase version, which means recreating the lowercase version more than necessary. So it's the classic speed vs. space tradeoff. I would think there would be cases where they get compared enough for that to make a difference, although I suppose we'd really need benchmarks to see. OTOH, there are certainly cases (such as my original motivating case) where the extra space is not an issue at all.Comparing the lowercase version of two strings works well for ASCII, but I doubt it works very well for Unicode. Case conversion is not bidirectional (for instance both 'SS' and 'ß' become 'ss' in lowercase in German), and what's equal and what is not sometime depends on the language. Checking for string equality is a special case of the Unicode collation algorithm. I'm not sure if implementing this part of Unicode is in the scope of Phobos (probably not), but short of having Unicode support it seems the utility of having a special string type dedicated to ASCII case-insensitive strings is quite limited. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 10 2011
"Michel Fortin" <michel.fortin michelf.com> wrote in message news:igft2o$291g$1 digitalmars.com...On 2011-01-10 13:46:55 -0500, "Nick Sabalausky" <a a.a> said:Yea, Phobos doesn't even have folding-case functions yet (which is why I keep saying "lowercase"). (This is actually one place where Phobos is still behind Tango.) However, I really think that's orthogonal to this since std.string.icmp doesn't handle such non-english issues either (just the english a-z, A-Z, and that's it). When Phobos does become multilingual, then this can be updated to follow suit. One question though: Aren't 'SS' and 'ß' considered the same in german anyway? If so, how does using lowercase instead of folding case cause a problem?Not carrying any other data means not caching the lowercase version, which means recreating the lowercase version more than necessary. So it's the classic speed vs. space tradeoff. I would think there would be cases where they get compared enough for that to make a difference, although I suppose we'd really need benchmarks to see. OTOH, there are certainly cases (such as my original motivating case) where the extra space is not an issue at all.Comparing the lowercase version of two strings works well for ASCII, but I doubt it works very well for Unicode. Case conversion is not bidirectional (for instance both 'SS' and 'ß' become 'ss' in lowercase in German), and what's equal and what is not sometime depends on the language. Checking for string equality is a special case of the Unicode collation algorithm. I'm not sure if implementing this part of Unicode is in the scope of Phobos (probably not), but short of having Unicode support it seems the utility of having a special string type dedicated to ASCII case-insensitive strings is quite limited.
Jan 10 2011
Am 10.01.2011 22:24, schrieb Nick Sabalausky:One question though: Aren't 'SS' and 'ß' considered the same in german anyway? If so, how does using lowercase instead of folding case cause a problem?It's not the same, see http://en.wikipedia.org/wiki/%C3%9F#Current_usage_in_German
Jan 10 2011
"Nick Sabalausky" <a a.a> wrote in message news:igftls$2afo$1 digitalmars.com...since std.string.icmp doesn't handle such non-english issues either (just the english a-z, A-Z, and that's it).Figured this belongs in bugilla: http://d.puremagic.com/issues/show_bug.cgi?id=5443
Jan 10 2011
Am 10.01.2011 22:16, schrieb Michel Fortin:On 2011-01-10 13:46:55 -0500, "Nick Sabalausky" <a a.a> said:That's wrong, 'ß' is lowercase and no upper-case version is used really, though one exists in Unicode (see: http://en.wikipedia.org/wiki/Capital_%C3%9F ). Sometimes, when stuff is written in fullcaps, 'ß' (which never is the first character of a word) is replaced by "SS", but I wouldn't expect that to be equal on icmp(). (e.g. "Strings vergleichen macht keinen Spaß!" vs "STRINGS VERGLEICHEN MACHT KEINEN SPASS!") Anyway, in this case comparing in lowercase would cause no trouble at all (comparing in uppercase however would, if you don't use the not-really-existing-but-defined-by-unicode-Capital-ß). I don't know if there may be problems with special characters in other languages, though.Not carrying any other data means not caching the lowercase version, which means recreating the lowercase version more than necessary. So it's the classic speed vs. space tradeoff. I would think there would be cases where they get compared enough for that to make a difference, although I suppose we'd really need benchmarks to see. OTOH, there are certainly cases (such as my original motivating case) where the extra space is not an issue at all.Comparing the lowercase version of two strings works well for ASCII, but I doubt it works very well for Unicode. Case conversion is not bidirectional (for instance both 'SS' and 'ß' become 'ss' in lowercase in German),and what's equal and what is not sometime depends on the language. Checking for string equality is a special case of the Unicode collation algorithm. I'm not sure if implementing this part of Unicode is in the scope of Phobos (probably not), but short of having Unicode support it seems the utility of having a special string type dedicated to ASCII case-insensitive strings is quite limited.
Jan 10 2011
"Daniel Gibson" <metalcaedes gmail.com> wrote in message news:igfttf$p46$1 digitalmars.com...Am 10.01.2011 22:16, schrieb Michel Fortin:One of the unicode documents mentions an example involving the three greek "sigma" letters, although I never quite understood how it demonstrated the inadequacy of using lower-case: http://www.unicode.org/reports/tr21/tr21-5.html#Caseless_Matching ...Which references some information near the end of this sub-section: http://www.unicode.org/reports/tr21/tr21-5.html#Introduction Actually, what probably should be stored is a *normalized* folding-case version of the string, because then (if I understand correctly) memcmp could be used. I don't think memcpy technically works on non-ASCII (unless it's in normalized form). In any case, Phobos doesn't currently handle any of that stuff at all, so my case-insensitive string type wouldn't be taking things backwards in that regard.Comparing the lowercase version of two strings works well for ASCII, but I doubt it works very well for Unicode. Case conversion is not bidirectional (for instance both 'SS' and 'ß' become 'ss' in lowercase in German),That's wrong, 'ß' is lowercase and no upper-case version is used really, though one exists in Unicode (see: http://en.wikipedia.org/wiki/Capital_%C3%9F ). Sometimes, when stuff is written in fullcaps, 'ß' (which never is the first character of a word) is replaced by "SS", but I wouldn't expect that to be equal on icmp(). (e.g. "Strings vergleichen macht keinen Spaß!" vs "STRINGS VERGLEICHEN MACHT KEINEN SPASS!") Anyway, in this case comparing in lowercase would cause no trouble at all (comparing in uppercase however would, if you don't use the not-really-existing-but-defined-by-unicode-Capital-ß). I don't know if there may be problems with special characters in other languages, though.
Jan 10 2011
On 2011-01-10 17:00:40 -0500, "Nick Sabalausky" <a a.a> said:Actually, what probably should be stored is a *normalized* folding-case version of the string, because then (if I understand correctly) memcmp could be used. I don't think memcpy technically works on non-ASCII (unless it's in normalized form).Actually, what would be compatible with Unicode collation is probably an array of collation elements. This would also make it useful to sort case-insensitively, not just testing for equality. Details (perhaps too much details): <http://unicode.org/reports/tr10/>In any case, Phobos doesn't currently handle any of that stuff at all, so my case-insensitive string type wouldn't be taking things backwards in that regard.Probably not. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jan 10 2011
"Michel Fortin" <michel.fortin michelf.com> wrote in message news:iggi34$cqm$1 digitalmars.com...On 2011-01-10 17:00:40 -0500, "Nick Sabalausky" <a a.a> said:I'll have to take a look. (I don't even know what "Unicode collation" is :P )Actually, what probably should be stored is a *normalized* folding-case version of the string, because then (if I understand correctly) memcmp could be used. I don't think memcpy technically works on non-ASCII (unless it's in normalized form).Actually, what would be compatible with Unicode collation is probably an array of collation elements. This would also make it useful to sort case-insensitively, not just testing for equality. Details (perhaps too much details): <http://unicode.org/reports/tr10/>Actually I was (mostly) wrong: There's already unicode-compatible upper/lower case characters functions in std.uni, and these are used by std.string.toupper and std.string.tolower (as well as their InPlace coutnerparts). Which incidentally means that my Insentitive type worked correctly for unicode all along (I didn't know about icmp when I wrote it) - well, except for things that need folding case instead of lower-case. Not only that, but Andrei just committed a unicode fix for icmp (bugzilla 5443) just a few hours ago. (That's gotta be a record for fastest issue fixed in D's bug tracker!) Still no folding-case though.In any case, Phobos doesn't currently handle any of that stuff at all, so my case-insensitive string type wouldn't be taking things backwards in that regard.Probably not.
Jan 10 2011
On 2011-01-10 16:30:50 -0500, Daniel Gibson <metalcaedes gmail.com> said:Am 10.01.2011 22:16, schrieb Michel Fortin:Oops, you're right. I got it backward. -- Michel Fortin michel.fortin michelf.com http://michelf.com/Comparing the lowercase version of two strings works well for ASCII, but I doubt it works very well for Unicode. Case conversion is not bidirectional (for instance both 'SS' and 'ß' become 'ss' in lowercase in German),That's wrong, 'ß' is lowercase and no upper-case version is used really, though one exists in Unicode (see: http://en.wikipedia.org/wiki/Capital_%C3%9F ).
Jan 10 2011