digitalmars.D.learn - Sorting with non-ASCII characters
- Chris (10/10) Sep 19 2013 Short question in case anyone knows the answer straight away:
- monarch_dodra (5/15) Sep 19 2013 Short answer, we currently can't, because we haven't implemented
- Chris (3/26) Sep 19 2013 Good that I asked! Imagine the time I would have wasted.
- bearophile (11/13) Sep 19 2013 The correct solution is a well implemented, well updated and well
- Chris (2/15) Sep 19 2013 Ok, thanks. We'll see.
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (15/25) Sep 19 2013 I have a project that tries to do exactly that:
- Jos van Uden (15/23) Sep 19 2013 If you only need to process extended ascii, then you could perhaps
- Chris (3/35) Sep 19 2013 Ok, thanks, will try that. I'll let you know if it worked.
- Chris (3/35) Sep 24 2013 Thanks a million, Jos! This does the trick for me.
- Jos van Uden (29/65) Sep 24 2013 Great.
- Chris (3/76) Sep 24 2013 Ah, yes of course. I will keep that in mind. At the moment I only
Short question in case anyone knows the answer straight away: How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"? Now I'm getting this: [wow, ara, ába, marca] ===> sort(listAbove); [ara, marca, wow, ába] I'd like to get: [ ába, ara, marca, wow] Thanks.
Sep 19 2013
On Thursday, 19 September 2013 at 15:18:11 UTC, Chris wrote:Short question in case anyone knows the answer straight away: How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"? Now I'm getting this: [wow, ara, ába, marca] ===> sort(listAbove); [ara, marca, wow, ába] I'd like to get: [ ába, ara, marca, wow] Thanks.Short answer, we currently can't, because we haven't implemented the "Unicode Collation Algorithm" http://d.puremagic.com/issues/show_bug.cgi?id=10566 Unfortunately, I don't know of any workarounds for you :/
Sep 19 2013
On Thursday, 19 September 2013 at 15:34:28 UTC, monarch_dodra wrote:On Thursday, 19 September 2013 at 15:18:11 UTC, Chris wrote:Good that I asked! Imagine the time I would have wasted.Short question in case anyone knows the answer straight away: How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"? Now I'm getting this: [wow, ara, ába, marca] ===> sort(listAbove); [ara, marca, wow, ába] I'd like to get: [ ába, ara, marca, wow] Thanks.Short answer, we currently can't, because we haven't implemented the "Unicode Collation Algorithm" http://d.puremagic.com/issues/show_bug.cgi?id=10566 Unfortunately, I don't know of any workarounds for you :/
Sep 19 2013
Chris:How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"?The correct solution is a well implemented, well updated and well debugged Unicode Collation Algorithm, as already answered. But if an approximation is enough then you could translate to D this Python code that converts a Unicode string to ASCII, and use it through a schwartzSort: http://newsbruiser.tigris.org/source/browse/~checkout~/newsbruiser/nb/lib/AsciiDammit.py (If you translate that function to D, you could also add it to Dub later.) Bye, bearophile
Sep 19 2013
On Thursday, 19 September 2013 at 15:42:52 UTC, bearophile wrote:Chris:Ok, thanks. We'll see.How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"?The correct solution is a well implemented, well updated and well debugged Unicode Collation Algorithm, as already answered. But if an approximation is enough then you could translate to D this Python code that converts a Unicode string to ASCII, and use it through a schwartzSort: http://newsbruiser.tigris.org/source/browse/~checkout~/newsbruiser/nb/lib/AsciiDammit.py (If you translate that function to D, you could also add it to Dub later.) Bye, bearophile
Sep 19 2013
On 09/19/2013 08:18 AM, Chris wrote:Short question in case anyone knows the answer straight away: How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"? Now I'm getting this: [wow, ara, ába, marca] ===> sort(listAbove); [ara, marca, wow, ába] I'd like to get: [ ába, ara, marca, wow] Thanks.I have a project that tries to do exactly that: https://code.google.com/p/trileri/source/browse/trunk/tr/dizgi.d#823 However, it is in Turkish and in need of a rewrite. :/ For the whole thing to work, every character must be of a certain alphabet. Here is the English alphabet: https://code.google.com/p/trileri/source/browse/trunk/tr/alfabe.d#747 Here is how I define e.g. á to be an accented version of a: https://code.google.com/p/trileri/source/browse/trunk/tr/harfler.d#23 However, some characters stand individually as they are not accents but proper letters themselves (e.g. ç of the Turkish alphabet): https://code.google.com/p/trileri/source/browse/trunk/tr/harfler.d#44 Well... I hope to get back to it at some point, taking advantage of the new std.uni as well. Ali
Sep 19 2013
On 19-9-2013 17:18, Chris wrote:Short question in case anyone knows the answer straight away: How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"? Now I'm getting this: [wow, ara, ába, marca] ===> sort(listAbove); [ara, marca, wow, ába] I'd like to get: [ ába, ara, marca, wow]If you only need to process extended ascii, then you could perhaps make do with a transliterated sort, something like: import std.stdio, std.string, std.algorithm, std.uni; void main() { auto sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa); writeln(sa); } void trSort(C, alias less = "a < b")(C[] arr) { static dstring c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ"; static dstring c2 = "aaaaaaceeeeiiiinoooooouuuuyy"; schwartzSort!(a => tr(toLower(a), c1, c2), less)(arr); }
Sep 19 2013
On Thursday, 19 September 2013 at 18:44:54 UTC, Jos van Uden wrote:On 19-9-2013 17:18, Chris wrote:Ok, thanks, will try that. I'll let you know if it worked.Short question in case anyone knows the answer straight away: How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"? Now I'm getting this: [wow, ara, ába, marca] ===> sort(listAbove); [ara, marca, wow, ába] I'd like to get: [ ába, ara, marca, wow]If you only need to process extended ascii, then you could perhaps make do with a transliterated sort, something like: import std.stdio, std.string, std.algorithm, std.uni; void main() { auto sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa); writeln(sa); } void trSort(C, alias less = "a < b")(C[] arr) { static dstring c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ"; static dstring c2 = "aaaaaaceeeeiiiinoooooouuuuyy"; schwartzSort!(a => tr(toLower(a), c1, c2), less)(arr); }
Sep 19 2013
On Thursday, 19 September 2013 at 18:44:54 UTC, Jos van Uden wrote:On 19-9-2013 17:18, Chris wrote:Thanks a million, Jos! This does the trick for me.Short question in case anyone knows the answer straight away: How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"? Now I'm getting this: [wow, ara, ába, marca] ===> sort(listAbove); [ara, marca, wow, ába] I'd like to get: [ ába, ara, marca, wow]If you only need to process extended ascii, then you could perhaps make do with a transliterated sort, something like: import std.stdio, std.string, std.algorithm, std.uni; void main() { auto sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa); writeln(sa); } void trSort(C, alias less = "a < b")(C[] arr) { static dstring c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ"; static dstring c2 = "aaaaaaceeeeiiiinoooooouuuuyy"; schwartzSort!(a => tr(toLower(a), c1, c2), less)(arr); }
Sep 24 2013
On 24-9-2013 11:26, Chris wrote:On Thursday, 19 September 2013 at 18:44:54 UTC, Jos van Uden wrote:Great. Be aware that the above code does a case insensitive sort, if you need case sensitive, you can use something like: import std.stdio, std.string, std.algorithm, std.uni; void main() { auto sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa, CaseSensitive.no); writeln(sa); writeln; sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa, CaseSensitive.yes); writeln(sa); } void trSort(C, alias less = "a < b")(C[] arr, CaseSensitive cs = CaseSensitive.yes) { static c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝŸ"d; static c2 = "aaaaaaceeeeiiiinoooooouuuuyyAAAAAACEEEEIIIINOOOOOOUUUUYY"d; if (cs == CaseSensitive.no) arr.schwartzSort!(a => a.toLower.tr(c1, c2), less); else arr.schwartzSort!(a => a.tr(c1, c2), less); }On 19-9-2013 17:18, Chris wrote:Thanks a million, Jos! This does the trick for me.Short question in case anyone knows the answer straight away: How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"? Now I'm getting this: [wow, ara, ába, marca] ===> sort(listAbove); [ara, marca, wow, ába] I'd like to get: [ ába, ara, marca, wow]If you only need to process extended ascii, then you could perhaps make do with a transliterated sort, something like: import std.stdio, std.string, std.algorithm, std.uni; void main() { auto sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa); writeln(sa); } void trSort(C, alias less = "a < b")(C[] arr) { static dstring c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ"; static dstring c2 = "aaaaaaceeeeiiiinoooooouuuuyy"; schwartzSort!(a => tr(toLower(a), c1, c2), less)(arr); }
Sep 24 2013
On Tuesday, 24 September 2013 at 10:35:53 UTC, Jos van Uden wrote:On 24-9-2013 11:26, Chris wrote:Ah, yes of course. I will keep that in mind. At the moment I only need case insensitive, but you never know. Thanks again.On Thursday, 19 September 2013 at 18:44:54 UTC, Jos van Uden wrote:Great. Be aware that the above code does a case insensitive sort, if you need case sensitive, you can use something like: import std.stdio, std.string, std.algorithm, std.uni; void main() { auto sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa, CaseSensitive.no); writeln(sa); writeln; sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa, CaseSensitive.yes); writeln(sa); } void trSort(C, alias less = "a < b")(C[] arr, CaseSensitive cs = CaseSensitive.yes) { static c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝŸ"d; static c2 = "aaaaaaceeeeiiiinoooooouuuuyyAAAAAACEEEEIIIINOOOOOOUUUUYY"d; if (cs == CaseSensitive.no) arr.schwartzSort!(a => a.toLower.tr(c1, c2), less); else arr.schwartzSort!(a => a.tr(c1, c2), less); }On 19-9-2013 17:18, Chris wrote:Thanks a million, Jos! This does the trick for me.Short question in case anyone knows the answer straight away: How do I sort text so that non-ascii characters like "á" are treated in the same way as "a"? Now I'm getting this: [wow, ara, ába, marca] ===> sort(listAbove); [ara, marca, wow, ába] I'd like to get: [ ába, ara, marca, wow]If you only need to process extended ascii, then you could perhaps make do with a transliterated sort, something like: import std.stdio, std.string, std.algorithm, std.uni; void main() { auto sa = ["wow", "ara", "ába", "Marca"]; writeln(sa); trSort(sa); writeln(sa); } void trSort(C, alias less = "a < b")(C[] arr) { static dstring c1 = "àáâãäåçèéêëìíîïñòóôõöøùúûüýÿ"; static dstring c2 = "aaaaaaceeeeiiiinoooooouuuuyy"; schwartzSort!(a => tr(toLower(a), c1, c2), less)(arr); }
Sep 24 2013