digitalmars.D - Table of strings sorting problem
- Aarti (44/44) Mar 10 2006 Hello all D-Fans!
- S. Chancellor (10/64) Mar 10 2006 Sort works off of the binary value of a character. To implement a sort
- Hasan Aljudy (3/74) Mar 10 2006 prints "english" characters first!! acelosz
- James Dunne (13/89) Mar 11 2006 Correction: ASCII characters first, because they are in the range
- John C (8/62) Mar 11 2006 As others have implied, D's standard library isn't culturally aware.
Hello all D-Fans! I encountered a problem with string sorting according to Polish language rules. Here is a simple test program: // ---------------------------------- import std.stdio; void main() { char[][] table; table.length=15; table[0]="±"; table[1]="a"; table[2]="æ"; table[3]="c"; table[4]="ê"; table[5]="e"; table[6]="ñ"; table[7]="n"; table[6]="³"; table[7]="l"; table[8]="ó"; table[9]="o"; table[10]="¶"; table[11]="s"; table[12]="¼"; table[13]="¿"; table[14]="z"; table.sort; foreach(char[] s; table) { writef(s); } writefln(); } // ---------------------------------- Output of this test is: aceloszó±æ곶¼¿ when it should be: a±cæeêl³oós¶z¼¿ It looks like sort doesn't sort properly according to language rules. Is it a known issue? How to sort strings in D according to language rules? PS. Possibility of using Polish characters in class identifiers is for me really cool. In C++ books in examples you can see all the time Trojkat instead of Trójk±t (triangle) and it looks awful. Regards Marcin Kuszczak
Mar 10 2006
On 2006-03-10 17:20:35 -0800, Aarti <aarti interia.pl> said:Hello all D-Fans! I encountered a problem with string sorting according to Polish language rules. Here is a simple test program: // ---------------------------------- import std.stdio; void main() { char[][] table; table.length=15; table[0]="ą"; table[1]="a"; table[2]="ć"; table[3]="c"; table[4]="ę"; table[5]="e"; table[6]="ń"; table[7]="n"; table[6]="ł"; table[7]="l"; table[8]="ó"; table[9]="o"; table[10]="ś"; table[11]="s"; table[12]="ź"; table[13]="ż"; table[14]="z"; table.sort; foreach(char[] s; table) { writef(s); } writefln(); } // ---------------------------------- Output of this test is: aceloszóąćęłśźż when it should be: aącćeęlłoósśzźż It looks like sort doesn't sort properly according to language rules. Is it a known issue? How to sort strings in D according to language rules? PS. Possibility of using Polish characters in class identifiers is for me really cool. In C++ books in examples you can see all the time Trojkat instead of Trójkąt (triangle) and it looks awful. Regards Marcin KuszczakSort works off of the binary value of a character. To implement a sort algorithm for polish language on characters would need to be manually done by you. You would need to specify a map from the character to it's sort order and sort based on that. I'm not sure if the sort property takes a delegate, that was something that was proposed before. You could mainly say it's coincidence that the latin characters fall in order numerically. (It was probably done on purpose with the person who decided the ASCII character values though.) -S.
Mar 10 2006
S. Chancellor wrote:On 2006-03-10 17:20:35 -0800, Aarti <aarti interia.pl> said:And note that the outputHello all D-Fans! I encountered a problem with string sorting according to Polish language rules. Here is a simple test program: // ---------------------------------- import std.stdio; void main() { char[][] table; table.length=15; table[0]="ą"; table[1]="a"; table[2]="ć"; table[3]="c"; table[4]="ę"; table[5]="e"; table[6]="ń"; table[7]="n"; table[6]="ł"; table[7]="l"; table[8]="ó"; table[9]="o"; table[10]="ś"; table[11]="s"; table[12]="ź"; table[13]="ż"; table[14]="z"; table.sort; foreach(char[] s; table) { writef(s); } writefln(); } // ---------------------------------- Output of this test is: aceloszóąćęłśźż when it should be: aącćeęlłoósśzźż It looks like sort doesn't sort properly according to language rules. Is it a known issue? How to sort strings in D according to language rules? PS. Possibility of using Polish characters in class identifiers is for me really cool. In C++ books in examples you can see all the time Trojkat instead of Trójkąt (triangle) and it looks awful. Regards Marcin KuszczakSort works off of the binary value of a character. To implement a sort algorithm for polish language on characters would need to be manually done by you. You would need to specify a map from the character to it's sort order and sort based on that. I'm not sure if the sort property takes a delegate, that was something that was proposed before. You could mainly say it's coincidence that the latin characters fall in order numerically. (It was probably done on purpose with the person who decided the ASCII character values though.) -S.prints "english" characters first!! aceloszaceloszóąćęłśźż
Mar 10 2006
Hasan Aljudy wrote:S. Chancellor wrote:Correction: ASCII characters first, because they are in the range 0-127. Look at the unicode tables; they're publicly available. Other latin languages use the ASCII characters. The problem is language and culture-specific collation. It is a very difficult problem to solve generically, since each language has many subcultures and each subculture agrees on different rules for collating text. See discussions on ICU in the archives. If one is looking for an explanation of the problem along with a collation solution, I would recommend: http://www.unicode.org/reports/tr10/ -- Regards, James DunneOn 2006-03-10 17:20:35 -0800, Aarti <aarti interia.pl> said:And note that the output >> aceloszóąćęłśźż prints "english" characters first!! aceloszHello all D-Fans! I encountered a problem with string sorting according to Polish language rules. Here is a simple test program: // ---------------------------------- import std.stdio; void main() { char[][] table; table.length=15; table[0]="ą"; table[1]="a"; table[2]="ć"; table[3]="c"; table[4]="ę"; table[5]="e"; table[6]="ń"; table[7]="n"; table[6]="ł"; table[7]="l"; table[8]="ó"; table[9]="o"; table[10]="ś"; table[11]="s"; table[12]="ź"; table[13]="ż"; table[14]="z"; table.sort; foreach(char[] s; table) { writef(s); } writefln(); } // ---------------------------------- Output of this test is: aceloszóąćęłśźż when it should be: aącćeęlłoósśzźż It looks like sort doesn't sort properly according to language rules. Is it a known issue? How to sort strings in D according to language rules? PS. Possibility of using Polish characters in class identifiers is for me really cool. In C++ books in examples you can see all the time Trojkat instead of Trójkąt (triangle) and it looks awful. Regards Marcin KuszczakSort works off of the binary value of a character. To implement a sort algorithm for polish language on characters would need to be manually done by you. You would need to specify a map from the character to it's sort order and sort based on that. I'm not sure if the sort property takes a delegate, that was something that was proposed before. You could mainly say it's coincidence that the latin characters fall in order numerically. (It was probably done on purpose with the person who decided the ASCII character values though.) -S.
Mar 11 2006
Aarti wrote:Hello all D-Fans! I encountered a problem with string sorting according to Polish language rules. Here is a simple test program: // ---------------------------------- import std.stdio; void main() { char[][] table; table.length=15; table[0]="±"; table[1]="a"; table[2]="æ"; table[3]="c"; table[4]="ê"; table[5]="e"; table[6]="ñ"; table[7]="n"; table[6]="³"; table[7]="l"; table[8]="ó"; table[9]="o"; table[10]="¶"; table[11]="s"; table[12]="¼"; table[13]="¿"; table[14]="z"; table.sort; foreach(char[] s; table) { writef(s); } writefln(); } // ---------------------------------- Output of this test is: aceloszó±æ곶¼¿ when it should be: a±cæeêl³oós¶z¼¿ It looks like sort doesn't sort properly according to language rules. Is it a known issue? How to sort strings in D according to language rules? PS. Possibility of using Polish characters in class identifiers is for me really cool. In C++ books in examples you can see all the time Trojkat instead of Trójk±t (triangle) and it looks awful. Regards Marcin KuszczakAs others have implied, D's standard library isn't culturally aware. I've been working on a locale package for Mango that will eventually allow correct string sorting for specific languages. This is how you'd sort a list of Polish characters: const char[][] table = [ "a","±","c","æ","e","ê" ]; Culture.current = Culture.getCulture("pl-PL"); table.sort();
Mar 11 2006
John C wrote:As others have implied, D's standard library isn't culturally aware. I've been working on a locale package for Mango that will eventually allow correct string sorting for specific languages. This is how you'd sort a list of Polish characters: const char[][] table = [ "a","±","c","æ","e","ê" ]; Culture.current = Culture.getCulture("pl-PL"); table.sort();It would be really helpful! Does it already work? I especially don't understand how can I change standard behaviour of table sort property? I think that internationalization support is one of most important areas which could increase D acceptance all over the world. Althrough in C++ it's not as easy as it should be, but it's still easier than making own sort function. Especially when I want in my program that sorting according to rules of _many_ different languages should be supported. Another problem is that D documentation does not say anything that D sorts tables only in binary order. There should be also hint how to implement own sorters for table, because now language does not behave as expected in case of strings. Regards Marcin Kuszczak
Mar 11 2006
Aarti wrote:John C wrote:Well, I've written an implementation that works, but it's not yet ready to be unleashed on the public. It might be possible to override _adSort ... not tried it yet. Currently it's just a free function, which can be called as if an array property.As others have implied, D's standard library isn't culturally aware. I've been working on a locale package for Mango that will eventually allow correct string sorting for specific languages. This is how you'd sort a list of Polish characters: const char[][] table = [ "a","±","c","æ","e","ê" ]; Culture.current = Culture.getCulture("pl-PL"); table.sort();It would be really helpful! Does it already work? I especially don't understand how can I change standard behaviour of table sort property?I think that internationalization support is one of most important areas which could increase D acceptance all over the world. Althrough in C++ it's not as easy as it should be, but it's still easier than making own sort function. Especially when I want in my program that sorting according to rules of _many_ different languages should be supported. Another problem is that D documentation does not say anything that D sorts tables only in binary order. There should be also hint how to implement own sorters for table, because now language does not behave as expected in case of strings. Regards Marcin Kuszczak
Mar 11 2006