digitalmars.D - Turkish 'I's can't D either
- Ali Cehreli (31/31) Aug 24 2009 You may be aware of the problems related to the consistency of the two s...
- Rainer Deyke (12/14) Aug 24 2009 This appears to be a library issue to me. If Phobos can't do this
- Ali Cehreli (6/9) Aug 25 2009 I started to see this at a more fundamental level. The Unicode letter I ...
- Rainer Deyke (6/16) Aug 25 2009 That's hardly the only case where unicode behavior is locale-dependent.
- Frank Benoit (2/6) Aug 25 2009 There are existing ICU bindings in the mango project, see dsource.org
- Walter Bright (7/7) Aug 24 2009 I think it's great that you're doing a Turkish programming tutorial! I
- Ali Cehreli (15/19) Aug 25 2009 It is a very interesting story. The Turkish 'i's have caused lots of tro...
- Daniel Keep (14/44) Aug 25 2009 To me, it seems that the issue is that the library routines don't have
- Ameer Armaly (8/14) Sep 03 2009 If I'm understanding you correctly, then the hash function would treat
-
Stewart Gordon
(16/19)
Sep 04 2009
- Ali Cehreli (4/18) Sep 05 2009 I think there should be three i's to solve problems like being able to c...
- Michel Fortin (45/112) Aug 25 2009 Perhaps this could be of some inspiration. In Cocoa you can pass a
- Daniel Keep (4/27) Aug 25 2009 You're assuming it's possible and practical to write every method that
- Michel Fortin (14/17) Aug 25 2009 No, only the base methods. You can build on them to create other
You may be aware of the problems related to the consistency of the two separate letter 'I's in the Turkish alphabet (and the alphabets that are based on the Turkish alphabet). Lowercase and uppercase versions of the two are consistent in whether they have a dot or not: http://en.wikipedia.org/wiki/Turkish_I Turkish alphabet being in a position so close to the western alphabets, but not close enough, puts it in a strange position. (Strangely; the same applies geographically, politically, socially, etc. as well... ;)) Computer systems *almost* work for Turkish, but not for those two letters. I love the fact that D allows Unicode letters in the source code and that it natively supports Unicode. I cannot stress enough how important this is. That is the single biggest reason why I decided to finally write a programming tutorial. Thank you to all who proposed and implemented those features! Back to the Turquois 'I's... What a programmer is to do who is writing programs that deals with Turkish letters? a) Accept that Phobos too has this age old behavior that is a result of premature optimization (i.e. this code in tolower: c + (cast(char)'a' - 'A')) b) Accept that the problem is unsolvable because the letter I has two minuscules, and the letter i has two majuscules anyway, and that the intent is not always clear c) Accept Turkish alphabet as being pathological (merely for being in the minority!), and use a Turkish version of Phobos or some other library d) Solve the problem with locale support Is option d possible with today's systems? Whose resposibility is this anyway? OS? Language? Program? Something else? The fact that alphanumerical ordering is also of interest, I think this has something to do with locales. Is there a way for a program to work with Turkish letters and ensure that the following program produces the expected output of 'dotless i', 'I with dot', and 0? import std.stdio; import std.string; import std.c.locale; import std.uni; void main() { const char * result = setlocale(LC_ALL, "tr_TR.UTF-8"); assert(result); writeln(toUniLower('I')); writeln(toUniUpper('i')); writeln(indexOf("I", '\u0131', // dotless i (CaseSensitive).no)); } This is a practical question. I really want to be able to work with Turkish... :) Thank you, Ali
Aug 24 2009
Ali Cehreli wrote:Is option d possible with today's systems? Whose resposibility is this anyway? OS? Language? Program? Something else?This appears to be a library issue to me. If Phobos can't do this properly, your options are basically: 1. Find a third party library solution. 2. Write your own solution. 2a. Write your own solution, and try to get it into Phobos. 2b. Write your own solution, and release it as a third-party library so that other people can use it. I know ICU can use different case mappings for different locales, but I don't think it has D bindings. -- Rainer Deyke - rainerd eldwood.com
Aug 24 2009
Rainer Deyke Wrote:This appears to be a library issue to me.I started to see this at a more fundamental level. The Unicode letter I (dotless capital i) has two possible lowercases and letter i has two possible uppercases. The chain of some historical events appears to have produced a crippled system: the application can't know how to lowercase or uppercase those. Having three separate 'i's would keep things elegant and correct, but the ASCII i and I have been in use for Turkish documents for decades now.I know ICU can use different case mappings for different locales, but I don't think it has D bindings.Under my limited understanding, that seems to be in contradiction with what Walter mentions in the other comment: locales or no locales? I will investigate more. :) Thanks! Ali
Aug 25 2009
Ali Cehreli wrote:Rainer Deyke Wrote:According to ICU, Lithuanian sometimes uses a third.This appears to be a library issue to me.I started to see this at a more fundamental level. The Unicode letter I (dotless capital i) has two possible lowercasesand letter i has two possible uppercases. The chain of some historical events appears to have produced a crippled system: the application can't know how to lowercase or uppercase those.That's hardly the only case where unicode behavior is locale-dependent. For example, collating order varies widely between languages. -- Rainer Deyke - rainerd eldwood.com
Aug 25 2009
Rainer Deyke schrieb:I know ICU can use different case mappings for different locales, but I don't think it has D bindings.There are existing ICU bindings in the mango project, see dsource.org
Aug 25 2009
I think it's great that you're doing a Turkish programming tutorial! I can't help you, though, with details of the Turkish language because I have no idea how it works. The only thing I can suggest is using a setlocale for Turkish is the wrong way, as Unicode was supposed to get away from that. tolower really is only for ASCII. But the toUniLower should work right with Turkish, though I don't know what right is for that case.
Aug 24 2009
Walter Bright Wrote:with details of the Turkish language because I have no idea how it works.It is a very interesting story. The Turkish 'i's have caused lots of trouble, even hardcoded conditionals in at least the early Java libraries that checked whether the locale was Turkish. Even the Unicode is in a strange position because two Unicode code points have two separate upper and lower cases. (I don't know whether there are other alphabets in such a situation.)tolower really is only for ASCII. But the toUniLower should work right with Turkish, though I don't know what right is for that case.The current implementation of toUniLower() favors the ASCII lowercasing of 'I' over the Turkish one (similar with toUniUpper() for i): dchar toUniLower(dchar c) { if (c >= 'A' && c <= 'Z') { c += 32; } An application would need a separate set of toUniLower() and friends to be able to work in Turkish. I don't think the issue is big enough for Phobos to tackle with a solution similar to CaseSensitive: toUniLower('I', (Alphabet).tr); Instead, a wrapper around toUniLower() should be used... Ali
Aug 25 2009
Ali Cehreli wrote:Walter Bright Wrote:To me, it seems that the issue is that the library routines don't have enough context to be able to correctly work out how to lowercase a string. Having it locale-dependant seems like a bad idea; let's say I'm processing some internal data that uses string names; case is irrelevant so I lowercase them and look them up in a hash table. If there's an I in there, but the hashtable is stored as i, the program will break if run in a Turkish locale. One thing I think the typesystem should be used more for is attaching more semantic information to data. So maybe the solution is to introduce something like a Text type that also stores the language of the text. Then the library methods WILL have the right context to know how to act. Just a thought.with details of the Turkish language because I have no idea how it works.It is a very interesting story. The Turkish 'i's have caused lots of trouble, even hardcoded conditionals in at least the early Java libraries that checked whether the locale was Turkish. Even the Unicode is in a strange position because two Unicode code points have two separate upper and lower cases. (I don't know whether there are other alphabets in such a situation.)tolower really is only for ASCII. But the toUniLower should work right with Turkish, though I don't know what right is for that case.The current implementation of toUniLower() favors the ASCII lowercasing of 'I' over the Turkish one (similar with toUniUpper() for i): dchar toUniLower(dchar c) { if (c >= 'A' && c <= 'Z') { c += 32; } An application would need a separate set of toUniLower() and friends to be able to work in Turkish. I don't think the issue is big enough for Phobos to tackle with a solution similar to CaseSensitive: toUniLower('I', (Alphabet).tr); Instead, a wrapper around toUniLower() should be used... Ali
Aug 25 2009
"Daniel Keep" <daniel.keep.lists gmail.com> wrote in message news:h70aup$cjn$1 digitalmars.com...One thing I think the typesystem should be used more for is attaching more semantic information to data. So maybe the solution is to introduce something like a Text type that also stores the language of the text. Then the library methods WILL have the right context to know how to act. Just a thought.If I'm understanding you correctly, then the hash function would treat Turkish i's the same as any other letter i, because the focus is on internal processing, but writef and friends would make the distinction because the text is meant to be read. Am I right? Ameer
Sep 03 2009
Walter Bright wrote:I think it's great that you're doing a Turkish programming tutorial! I can't help you, though, with details of the Turkish language because I have no idea how it works.<snip> It's quite simple actually. I is the uppercase form of ı. İ is the uppercase form of i. http://www.unicode.org/Public/UNIDATA/UnicodeData.txt lists them as 0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069; 0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN CAPITAL LETTER I DOT;;;0069; 0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049 but this is inadequate: while it tells you how to case-convert ı and İ (that's what the 0049 and 0069 at the end are), you need to add a locale-specific rule to all this to convert I and i in Turkish. Stewart.
Sep 04 2009
Stewart Gordon Wrote:I is the uppercase form of ı. İ is the uppercase form of i. http://www.unicode.org/Public/UNIDATA/UnicodeData.txt lists them as 0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069; 0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN CAPITAL LETTER I DOT;;;0069; 0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049 but this is inadequate: while it tells you how to case-convert ı and İ (that's what the 0049 and 0069 at the end are), you need to add a locale-specific rule to all this to convert I and i in Turkish.I think there should be three i's to solve problems like being able to capitalize strings that contain words from two languages as in e.g. an imaginary company name "Ali & Jim". The two lowercase i's should have been separate to be able to work with them correctly. The problem stems from Unicode... A group of us are about to start a small project that involves thin wrappers around Phobos to favor the Turkish behavior for character and string processing. That should help with applications that are happy to use Turkish only. More complex applications could use libraries like IBM's ICU. Ali
Sep 05 2009
On 2009-08-25 00:23:25 -0400, Ali Cehreli <acehreli yahoo.com> said:You may be aware of the problems related to the consistency of the two separate letter 'I's in the Turkish alphabet (and the alphabets that are based on the Turkish alphabet). Lowercase and uppercase versions of the two are consistent in whether they have a dot or not: http://en.wikipedia.org/wiki/Turkish_I Turkish alphabet being in a position so close to the western alphabets, but not close enough, puts it in a strange position. (Strangely; the same applies geographically, politically, socially, etc. as well... ;)) Computer systems *almost* work for Turkish, but not for those two letters. I love the fact that D allows Unicode letters in the source code and that it natively supports Unicode. I cannot stress enough how important this is. That is the single biggest reason why I decided to finally write a programming tutorial. Thank you to all who proposed and implemented those features! Back to the Turquois 'I's... What a programmer is to do who is writing programs that deals with Turkish letters? a) Accept that Phobos too has this age old behavior that is a result of premature optimization (i.e. this code in tolower: c + (cast(char)'a' - 'A')) b) Accept that the problem is unsolvable because the letter I has two minuscules, and the letter i has two majuscules anyway, and that the intent is not always clear c) Accept Turkish alphabet as being pathological (merely for being in the minority!), and use a Turkish version of Phobos or some other library d) Solve the problem with locale support Is option d possible with today's systems? Whose resposibility is this anyway? OS? Language? Program? Something else? The fact that alphanumerical ordering is also of interest, I think this has something to do with locales. Is there a way for a program to work with Turkish letters and ensure that the following program produces the expected output of 'dotless i', 'I with dot', and 0? import std.stdio; import std.string; import std.c.locale; import std.uni; void main() { const char * result = setlocale(LC_ALL, "tr_TR.UTF-8"); assert(result); writeln(toUniLower('I')); writeln(toUniUpper('i')); writeln(indexOf("I", '\u0131', // dotless i (CaseSensitive).no)); } This is a practical question. I really want to be able to work with Turkish... :)Perhaps this could be of some inspiration. In Cocoa you can pass a locale argument to many string methods (unfortunatly, not lowercaseString or uppercaseStrings) to get the desired result. For instance, the "rangeOfString:options:range:locale:" method can search for substrings case-insentively, and it specifically discuss the Turkish “ı” character under the locale parameter. It's also interesting to see that when you search for ß in a webpage using Safari, it also matches every instance of SS (whatever your locale). ß is a german character that becomes SS in uppercase. - - - What I'd like to see is an a base class representing a locale. Then you can instanciate the locale you want (from a config file, by coding it directly, having bindings to system APIs, or a mix of all this) and use the locale. Something like: class Locale { immutable: string lowercase(string s); string uppercase(string s); int compare(string a, string b); int compare(string a, string b); // number & date formatting, etc. } immutable(Locale) systemLocale(); // get default system locale immutable(Locale) locale(string localeName); // get best matching locale void main() { Locale turkish = locale("tr-TR"); writeln(turkish.lowercase("I")); // writes "ı" writeln(turkish.uppercase("i")); // writes "İ" Locale english = locale("en-US"); writeln(english.lowercase("I")); // writes "i" writeln(english.uppercase("i")); // writes "I" writeln(systemLocale.lowercase("I")); // depends on user settings writeln(systemLocale.uppercase("i")); // depends on user settings } This way you can work with many locales at once. And there's no reliance on a global state. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Aug 25 2009
Michel Fortin wrote:... What I'd like to see is an a base class representing a locale. Then you can instanciate the locale you want (from a config file, by coding it directly, having bindings to system APIs, or a mix of all this) and use the locale. Something like: class Locale { immutable: string lowercase(string s); string uppercase(string s); int compare(string a, string b); int compare(string a, string b); // number & date formatting, etc. } ... This way you can work with many locales at once. And there's no reliance on a global state.You're assuming it's possible and practical to write every method that is locale-dependant at once in a single class. I personally think that's somewhat unlikely...
Aug 25 2009
On 2009-08-25 08:04:44 -0400, Daniel Keep <daniel.keep.lists gmail.com> said:You're assuming it's possible and practical to write every method that is locale-dependant at once in a single class.No, only the base methods. You can build on them to create other things. With a compare method you can do sorting according to various collations for instance, but the sorting algorithm doesn't need to be part of the locale, it just needs the locale as an argument.I personally think that's somewhat unlikely...Every attempt at defining locales in an operating system attemps to centralize the data at some place. That said, a Locale class should know it's locale name, which means in turn that by passing an instance of Locale to some function, that function can take the locale name and do its own thing. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Aug 25 2009