www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - std.string.toUpper() for greek characters

reply "Minas" <minas_mina1990 hotmail.co.uk> writes:
Currently, toUpper() (and probably toLower()) does not handle 
greek characters correctly. I fixed toUpper() by making a another 
function for greek characters

// called if (c >= 0x387 && c <= 0x3CE)
dchar toUpperGreek(dchar c)
{
	if( c >= 'α' && c <= 'ω' )
	{
		if( c == 'ς' )
			c = 'Σ';
		else
			c -= 32;
	}
	else
	{
		dchar[dchar] map;
		map['ά'] = 'Ά';
		map['έ'] = 'Έ';
		map['ή'] = 'Ή';
		map['ί'] = 'Ί';
		map['ϊ'] = 'Ϊ';
		map['ΐ'] = 'Ϊ';
		map['ό'] = 'Ό';
		map['ύ'] = 'Ύ';
		map['ϋ'] = 'Ϋ';
		map['ΰ'] = 'Ϋ';
		map['ώ'] = 'Ώ';
		
		c = map[c];
	}
	
	return c;
}

Then, in toUpper()
{
    ....
    if (c >= 0x387 && c <= 0x3CE)
       c = toUpperGreek()...
    ///
}

Do you think it should stay like that or I should copy-paste it 
in the body of toUpper()?

I'm going to fix toLower() as well and make a pull request.
Oct 03 2012
next sibling parent reply "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Wednesday, 3 October 2012 at 10:56:11 UTC, Minas wrote:
 Currently, toUpper() (and probably toLower()) does not handle 
 greek characters correctly. I fixed toUpper() by making a 
 another function for greek characters

 // called if (c >= 0x387 && c <= 0x3CE)
 dchar toUpperGreek(dchar c)
 {
 	if( c >= 'α' && c <= 'ω' )
 	{
 		if( c == 'ς' )
 			c = 'Σ';
 		else
 			c -= 32;
 	}
 	else
 	{
 		dchar[dchar] map;
 		map['ά'] = 'Ά';
 		map['έ'] = 'Έ';
 		map['ή'] = 'Ή';
 		map['ί'] = 'Ί';
 		map['ϊ'] = 'Ϊ';
 		map['ΐ'] = 'Ϊ';
 		map['ό'] = 'Ό';
 		map['ύ'] = 'Ύ';
 		map['ϋ'] = 'Ϋ';
 		map['ΰ'] = 'Ϋ';
 		map['ώ'] = 'Ώ';
 		
 		c = map[c];
 	}
 	
 	return c;
 }

 Then, in toUpper()
 {
    ....
    if (c >= 0x387 && c <= 0x3CE)
       c = toUpperGreek()...
    ///
 }

 Do you think it should stay like that or I should copy-paste it 
 in the body of toUpper()?

 I'm going to fix toLower() as well and make a pull request.
A switch with 11 cases is very likely going to be a lot faster than the hash table approach you're using, especially since the AA is not cached and will be dynamically allocated on every call.
Oct 03 2012
parent "Minas" <minas_mina1990 hotmail.co.uk> writes:
On Wednesday, 3 October 2012 at 11:03:27 UTC, Jakob Ovrum wrote:
 On Wednesday, 3 October 2012 at 10:56:11 UTC, Minas wrote:
 Currently, toUpper() (and probably toLower()) does not handle 
 greek characters correctly. I fixed toUpper() by making a 
 another function for greek characters

 // called if (c >= 0x387 && c <= 0x3CE)
 dchar toUpperGreek(dchar c)
 {
 	if( c >= 'α' && c <= 'ω' )
 	{
 		if( c == 'ς' )
 			c = 'Σ';
 		else
 			c -= 32;
 	}
 	else
 	{
 		dchar[dchar] map;
 		map['ά'] = 'Ά';
 		map['έ'] = 'Έ';
 		map['ή'] = 'Ή';
 		map['ί'] = 'Ί';
 		map['ϊ'] = 'Ϊ';
 		map['ΐ'] = 'Ϊ';
 		map['ό'] = 'Ό';
 		map['ύ'] = 'Ύ';
 		map['ϋ'] = 'Ϋ';
 		map['ΰ'] = 'Ϋ';
 		map['ώ'] = 'Ώ';
 		
 		c = map[c];
 	}
 	
 	return c;
 }

 Then, in toUpper()
 {
   ....
   if (c >= 0x387 && c <= 0x3CE)
      c = toUpperGreek()...
   ///
 }

 Do you think it should stay like that or I should copy-paste 
 it in the body of toUpper()?

 I'm going to fix toLower() as well and make a pull request.
A switch with 11 cases is very likely going to be a lot faster than the hash table approach you're using, especially since the AA is not cached and will be dynamically allocated on every call.
I had this in mind as well. I will change it, thanks.
Oct 03 2012
prev sibling next sibling parent reply "Paulo Pinto" <pjmlp progtools.org> writes:
On Wednesday, 3 October 2012 at 10:56:11 UTC, Minas wrote:
 Currently, toUpper() (and probably toLower()) does not handle 
 greek characters correctly. I fixed toUpper() by making a 
 another function for greek characters

 // called if (c >= 0x387 && c <= 0x3CE)
 dchar toUpperGreek(dchar c)
 {
 	if( c >= 'α' && c <= 'ω' )
 	{
 		if( c == 'ς' )
 			c = 'Σ';
 		else
 			c -= 32;
 	}
 	else
 	{
 		dchar[dchar] map;
 		map['ά'] = 'Ά';
 		map['έ'] = 'Έ';
 		map['ή'] = 'Ή';
 		map['ί'] = 'Ί';
 		map['ϊ'] = 'Ϊ';
 		map['ΐ'] = 'Ϊ';
 		map['ό'] = 'Ό';
 		map['ύ'] = 'Ύ';
 		map['ϋ'] = 'Ϋ';
 		map['ΰ'] = 'Ϋ';
 		map['ώ'] = 'Ώ';
 		
 		c = map[c];
 	}
 	
 	return c;
 }

 Then, in toUpper()
 {
    ....
    if (c >= 0x387 && c <= 0x3CE)
       c = toUpperGreek()...
    ///
 }

 Do you think it should stay like that or I should copy-paste it 
 in the body of toUpper()?

 I'm going to fix toLower() as well and make a pull request.
Regarding toLower() a problem I see is how to handle sigma (Σ), because it has two possible lower case representations depending where it occurs in a word. But of course toLower() is working on character basis, so it cannot know what the receiver plans to do with the character. -- Paulo
Oct 03 2012
parent reply "Minas" <minas_mina1990 hotmail.co.uk> writes:
On Wednesday, 3 October 2012 at 13:27:25 UTC, Paulo Pinto wrote:
 On Wednesday, 3 October 2012 at 10:56:11 UTC, Minas wrote:
 Currently, toUpper() (and probably toLower()) does not handle 
 greek characters correctly. I fixed toUpper() by making a 
 another function for greek characters

 // called if (c >= 0x387 && c <= 0x3CE)
 dchar toUpperGreek(dchar c)
 {
 	if( c >= 'α' && c <= 'ω' )
 	{
 		if( c == 'ς' )
 			c = 'Σ';
 		else
 			c -= 32;
 	}
 	else
 	{
 		dchar[dchar] map;
 		map['ά'] = 'Ά';
 		map['έ'] = 'Έ';
 		map['ή'] = 'Ή';
 		map['ί'] = 'Ί';
 		map['ϊ'] = 'Ϊ';
 		map['ΐ'] = 'Ϊ';
 		map['ό'] = 'Ό';
 		map['ύ'] = 'Ύ';
 		map['ϋ'] = 'Ϋ';
 		map['ΰ'] = 'Ϋ';
 		map['ώ'] = 'Ώ';
 		
 		c = map[c];
 	}
 	
 	return c;
 }

 Then, in toUpper()
 {
   ....
   if (c >= 0x387 && c <= 0x3CE)
      c = toUpperGreek()...
   ///
 }

 Do you think it should stay like that or I should copy-paste 
 it in the body of toUpper()?

 I'm going to fix toLower() as well and make a pull request.
Regarding toLower() a problem I see is how to handle sigma (Σ), because it has two possible lower case representations depending where it occurs in a word. But of course toLower() is working on character basis, so it cannot know what the receiver plans to do with the character. -- Paulo
Yeah, that's a problem indeed. I will make it become 'σ', and the programmer can change the final'σ' to 'ς' himself.
Oct 03 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 03-Oct-12 18:11, Minas wrote:
 On Wednesday, 3 October 2012 at 13:27:25 UTC, Paulo Pinto wrote:
 On Wednesday, 3 October 2012 at 10:56:11 UTC, Minas wrote:
 Currently, toUpper() (and probably toLower()) does not handle greek
 characters correctly. I fixed toUpper() by making a another function
 for greek characters
And a lot of others. And it is handwritten and thus unmaintainable.
 // called if (c >= 0x387 && c <= 0x3CE)
 dchar toUpperGreek(dchar c)
 {
     if( c >= 'α' && c <= 'ω' )
     {
         if( c == 'ς' )
             c = 'Σ';
         else
             c -= 32;
     }
     else
     {
         dchar[dchar] map;
         map['ά'] = 'Ά';
         map['έ'] = 'Έ';
         map['ή'] = 'Ή';
         map['ί'] = 'Ί';
         map['ϊ'] = 'Ϊ';
         map['ΐ'] = 'Ϊ';
         map['ό'] = 'Ό';
         map['ύ'] = 'Ύ';
         map['ϋ'] = 'Ϋ';
         map['ΰ'] = 'Ϋ';
         map['ώ'] = 'Ώ';

         c = map[c];
     }

     return c;
 }

 Then, in toUpper()
 {
   ....
   if (c >= 0x387 && c <= 0x3CE)
      c = toUpperGreek()...
   ///
 }

 Do you think it should stay like that or I should copy-paste it in
 the body of toUpper()?

 I'm going to fix toLower() as well and make a pull request.
I'm *strongly* against bringing these temporary hacks into standard library. The fact that toUpper/toLower are outdated is bad but fixing it by piling hack after hack on this mess of if/else branches is not the way out. Also I hope you haven't lost a few hundreds over here: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Agreek%3A%5D+%26+%5B%3ACasedLetter%3A%5D&g= The way out is a proper implementation that is is a direct derivative of the Unicode character database. And I've spent this summer on doing this proper 'cure' for these kind of problems with Unicode support in D. Admittedly, my reworked Unicode support probably won't hit the next release(2.061). Needs to go through review etc. But I'm determined to get it to 2.062. I'd suggest to keep around you personal version for the moment and then just switch to the new std one. However given our release schedule this could be anywhere from 4 months to 1 year away :)
 Regarding toLower() a problem I see is how to handle sigma (Σ),
 because it has two possible lower case representations depending where
 it occurs in a word. But of course toLower() is working on character
 basis, so it cannot know what the receiver plans to do with the
 character.

 --
 Paulo
Yeah, that's a problem indeed. I will make it become 'σ', and the programmer can change the final'σ' to 'ς' himself.
I think this is one of a small number of special cases, see the full list here: ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt (handling these subtleties is commonly called 'tailoring' and currently I believe is out reach for std library) Currently mytoLower will do 'σ' as prescribed by simple case folding rules. (i.e. the ones that can only map 1:1). I have case-insensitive string comparison that does 1:n mappings as well (and is going to replace current icmp) but it doesn't do tailoring. One day we may add some language specific tailoring (via locales etc.) but we'd better do it carefully. -- Dmitry Olshansky
Oct 03 2012
prev sibling next sibling parent reply "David Nadlinger" <see klickverbot.at> writes:
On Wednesday, 3 October 2012 at 10:56:11 UTC, Minas wrote:
 Do you think it should stay like that or I should copy-paste it 
 in the body of toUpper()?

 I'm going to fix toLower() as well and make a pull request.
In any case, you should coordinate with Dmitry Olshansky, since he is (was?) working on Unicode support in Phobos. David
Oct 03 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 03-Oct-12 20:13, David Nadlinger wrote:
 On Wednesday, 3 October 2012 at 10:56:11 UTC, Minas wrote:
 Do you think it should stay like that or I should copy-paste it in the
 body of toUpper()?

 I'm going to fix toLower() as well and make a pull request.
In any case, you should coordinate with Dmitry Olshansky, since he is
working on Unicode support in Phobos. Fixed ;) -- Dmitry Olshansky
Oct 03 2012
prev sibling parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 10/03/2012 03:56 AM, Minas wrote:
 Currently, toUpper() (and probably toLower()) does not handle greek
 characters correctly. I fixed toUpper() by making a another function for
 greek characters

 // called if (c >= 0x387 && c <= 0x3CE)
 dchar toUpperGreek(dchar c)
 {
 if( c >= 'α' && c <= 'ω' )
 {
 if( c == 'ς' )
 c = 'Σ';
 else
 c -= 32;
 }
 else
 {
 dchar[dchar] map;
 map['ά'] = 'Ά';
 map['έ'] = 'Έ';
 map['ή'] = 'Ή';
 map['ί'] = 'Ί';
 map['ϊ'] = 'Ϊ';
 map['ΐ'] = 'Ϊ';
 map['ό'] = 'Ό';
 map['ύ'] = 'Ύ';
 map['ϋ'] = 'Ϋ';
 map['ΰ'] = 'Ϋ';
 map['ώ'] = 'Ώ';

 c = map[c];
 }

 return c;
 }

 Then, in toUpper()
 {
 ....
 if (c >= 0x387 && c <= 0x3CE)
 c = toUpperGreek()...
 ///
 }

 Do you think it should stay like that or I should copy-paste it in the
 body of toUpper()?

 I'm going to fix toLower() as well and make a pull request.
I don't want to detract from the usefulness of these functions but toupper and tolower has been two of the strangests functions of the computer history. It is amazing that they are still accepted, because they are useful in very limited situations and those situations are becoming rarer as more and more systems support Unicode. Two quick examples: 1) How should this string be capitalized in a scientific article? "Anti-obesity effects of α-lipoic acid" I don't think the α in there should be upper-cased. 2) How should this name be capitalized in a list of names? "Ali" It completely depends on the writing system of that string itself, not even the current locale. (There are two uppercases that I know of, which can be considered as correct: "ALI" and "ALİ".) I agree that your toUpper() and toLower() will be useful in many contexts but will necessarily do the wrong thing in others. Ali
Oct 03 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 03-Oct-12 21:10, Ali Çehreli wrote:
 On 10/03/2012 03:56 AM, Minas wrote:
 Currently, toUpper() (and probably toLower()) does not handle greek
 characters correctly. I fixed toUpper() by making a another function for
 greek characters

 // called if (c >= 0x387 && c <= 0x3CE)
 dchar toUpperGreek(dchar c)
 {
 if( c >= 'α' && c <= 'ω' )
 {
 if( c == 'ς' )
 c = 'Σ';
 else
 c -= 32;
 }
 else
 {
 dchar[dchar] map;
 map['ά'] = 'Ά';
 map['έ'] = 'Έ';
 map['ή'] = 'Ή';
 map['ί'] = 'Ί';
 map['ϊ'] = 'Ϊ';
 map['ΐ'] = 'Ϊ';
 map['ό'] = 'Ό';
 map['ύ'] = 'Ύ';
 map['ϋ'] = 'Ϋ';
 map['ΰ'] = 'Ϋ';
 map['ώ'] = 'Ώ';

 c = map[c];
 }

 return c;
 }

 Then, in toUpper()
 {
 ....
 if (c >= 0x387 && c <= 0x3CE)
 c = toUpperGreek()...
 ///
 }

 Do you think it should stay like that or I should copy-paste it in the
 body of toUpper()?

 I'm going to fix toLower() as well and make a pull request.
I don't want to detract from the usefulness of these functions but toupper and tolower has been two of the strangests functions of the computer history. It is amazing that they are still accepted, because they are useful in very limited situations and those situations are becoming rarer as more and more systems support Unicode.
Glad you showed up! One and by far the most useful case is case-insensitive matching. That being said this doesn't and shouldn't involve toLower/toUpper (and on the whole string) anywhere. Not only it's multipass vs single pass but it's also wrong. As a lot of other ASCII-minded carry-overs. Other then this and being used as some intermediate sanitized form I don't think it has much use.
 Two quick examples:

 1) How should this string be capitalized in a scientific article?

    "Anti-obesity effects of α-lipoic acid"
There is a lot of lousy conversions. The basic toLower is defined in the standard, try it here: http://unicode.org/cldr/utility/transform.jsp?a=Upper&b=Anti-obesity+effects+of+%CE%B1-lipoic+acid
 I don't think the α in there should be upper-cased.
Depends on why you are doing it in the first place :) Capitalizing scientific article strikes me as kind of strange as well.
 2) How should this name be capitalized in a list of names?

    "Ali"
Again what's the goal of capitalization here? Simplifying matching afterwards? - Then it doesn't matter as long as it's lousiness is acceptable (rarely so) and it stays within the system, i.e. doesn't leak away.
 It completely depends on the writing system of that string itself, not
 even the current locale. (There are two uppercases that I know of, which
 can be considered as correct: "ALI" and "ALİ".)
One word: tailoring. Basically any software made in Turkey has to do ALİ :) Only half-joking.
 I agree that your toUpper() and toLower() will be useful in many
 contexts but will necessarily do the wrong thing in others.

 Ali
-- Dmitry Olshansky
Oct 03 2012
parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 10/03/2012 11:21 AM, Dmitry Olshansky wrote:
 On 03-Oct-12 21:10, Ali Çehreli wrote:
 On 10/03/2012 03:56 AM, Minas wrote:
[...]
 map['ά'] = 'Ά';
[...]
 Glad you showed up!
Why? Do I whine better? :p
 One and by far the most useful case is case-insensitive matching.
 That being said this doesn't and shouldn't involve toLower/toUpper (and
 on the whole string) anywhere. Not only it's multipass vs single pass
 but it's also wrong. As a lot of other ASCII-minded carry-overs.
As I have written at other times, there is an experimental alphabet-aware string library (unfortunately even the code is in Turkish at this time). That library has the following struct for order-comparing alphabet-aware strings and characters: struct Order { /** * Represents comparing characters at their bases. * * This value indicates that 'a' and 'b' are different. 'C' and 'c' * are the same according to this value. This value disregards upper * and lower cases. */ int base; /** * Represents comparing characters by their accents. * * This value indicates that 'a' and 'â' are different. This value * disregards upper and lower cases. */ int accent; /** * Represents comparing characters also by their upper and lower cases. * * Lower case letter comes before upper case. */ int cased; } (Of course opCmp() cannot return that type. :( ) The idea is that only the application knows what type of comparison makes sense. Ali
Oct 03 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 03-Oct-12 23:56, Ali Çehreli wrote:
 On 10/03/2012 11:21 AM, Dmitry Olshansky wrote:
  > On 03-Oct-12 21:10, Ali Çehreli wrote:
  >> On 10/03/2012 03:56 AM, Minas wrote:
 [...]
  >>> map['ά'] = 'Ά';
 [...]
  > Glad you showed up!

 Why? Do I whine better? :p
Well that might be the case :) But honestly because you pushed for some Unicode support back in the days. I currently look around to see if there are obviously important things not covered in my project.
  > One and by far the most useful case is case-insensitive matching.
  > That being said this doesn't and shouldn't involve toLower/toUpper (and
  > on the whole string) anywhere. Not only it's multipass vs single pass
  > but it's also wrong. As a lot of other ASCII-minded carry-overs.

 As I have written at other times, there is an experimental
 alphabet-aware string library (unfortunately even the code is in Turkish
 at this time).
If we are talking about the order then this is the way to go: http://unicode.org/reports/tr10/ Looks like it's one of things I haven't to implemented :(
 That library has the following struct for order-comparing alphabet-aware
 strings and characters:

 struct Order
 {
      /**
       * Represents comparing characters at their bases.
       *
       * This value indicates that 'a' and 'b' are different. 'C' and 'c'
       * are the same according to this value. This value disregards upper
       * and lower cases.
       */
      int base;

      /**
       * Represents comparing characters by their accents.
       *
       * This value indicates that 'a' and 'â' are different. This value
       * disregards upper and lower cases.
       */
      int accent;

      /**
       * Represents comparing characters also by their upper and lower
 cases.
       *
       * Lower case letter comes before upper case.
       */
      int cased;
 }

 (Of course opCmp() cannot return that type. :( )

 The idea is that only the application knows what type of comparison
 makes sense.
So instead library does all of them ? Ouch.. I'm not sure I got the idea. -- Dmitry Olshansky
Oct 03 2012
parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 10/03/2012 01:37 PM, Dmitry Olshansky wrote:
 On 03-Oct-12 23:56, Ali Çehreli wrote:
 If we are talking about the order then this is the way to go:
 http://unicode.org/reports/tr10/
Thank you. I wasn't aware of that long read. :)
 struct Order
 {
 int base;
 int accent;
 int cased;
 }

 (Of course opCmp() cannot return that type. :( )

 The idea is that only the application knows what type of comparison
 makes sense.
So instead library does all of them ? Ouch.. I'm not sure I got the idea.
The idea was that there would be AlphabetChar and AlphabetString that knew about what writing system that they belonged to: AlphabetChar!en, AlphabetChar!tr, etc. For example, while letter ç is a distinct letter in the Turkish alphabet, it is an accented form of c in most Latin-based alphabets. That affects the 'base' member above. On the other hand, â is an accented 'a' both in the Turkish and the Latin-based alphabets. So the 'base' comparison for â and a would be the same. Collation takes the alphabet into account. Although AlphabetChar!en is not compatible with AlphabetChar!tr, they can be forced to be compared according to the collation information of any alphabet. So, that experimental library provides a number of alphabets with their own collation orders. I see now that the library should have supported the Unicode document that you have linked above. I will do some reading. :) Ali
Oct 03 2012