www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Another algo for faster sorting

reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
Coincidentally with another NG thread I'm curious if we can 
special-case our sort for
strings to Three-Way Radix QuickSort which is more efficient:

http://www.drdobbs.com/database/sorting-strings-with-three-way-radix-qui/184410724
Apr 07 2016
next sibling parent reply Andrea Fontana <nospam example.com> writes:
On Thursday, 7 April 2016 at 10:58:21 UTC, Dmitry Olshansky wrote:
 Coincidentally with another NG thread I'm curious if we can 
 special-case our sort for
 strings to Three-Way Radix QuickSort which is more efficient:

 http://www.drdobbs.com/database/sorting-strings-with-three-way-radix-qui/184410724
Radix sort is a very fast way to sort strings and integers . But it does not work with a custom " less" function . It just sorts date to lexical way . If phobos had a lexicalSort ( T [ ] , Ascending.YES ) ; probably a lot of optimizations could be done for many data types. Anyway my topic about sort optimization seems not to have a good luck if not for benchmark inaccuracy :) Andrea
Apr 07 2016
next sibling parent reply tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 7 April 2016 at 13:32:43 UTC, Andrea Fontana wrote:
 Anyway my topic about sort optimization seems not to have a 
 good luck if not for benchmark inaccuracy :)
I was planning to respond to your original question, I just didn't have time last night... I don't think that the simple "counting sort" you demonstrated in the other thread is particularly useful to have in Phobos, by itself: - It doesn't scale well at all to higher-entropy element types, like `int` or `string`. - It only works on value types, not reference types. - It's so trivial to implement, that anyone who really needs it can write it themselves. However, what might be really useful is to have a good generic implementation of "bucket sort". Using a user-supplied fast hash function whose output follows the same ordering as the original elements: 1. Move/copy each element into a bucket list chosen by the hash function. 2. Sort each individual bucket list using another algorithm. 3. Iterate through all the buckets lists, and move/copy each element to the output. With good design, it should be possible to express counting sort for small value types like `bool` and `byte` simply as a parameterization of bucket sort. (This would require that the bucket list type be parameterized, in addition to the hash function.)
Apr 07 2016
parent reply Andrea Fontana <nospam example.com> writes:
On Thursday, 7 April 2016 at 18:15:32 UTC, tsbockman wrote:
 I don't think that the simple "counting sort" you demonstrated 
 in the other thread is particularly useful to have in Phobos, 
 by itself:
     - It doesn't scale well at all to higher-entropy element 
 types, like `int` or `string`.
I mean to replace sort() on phobos if T == byte || T == ...
     - It only works on value types, not reference types.
Ditto
     - It's so trivial to implement, that anyone who really 
 needs it can write it themselves.
Ok, but it's not a good reason to keep an inefficient sort() on phobos. I would use algorithms in phobos without thinking if I have to reimplement them or not. If not we should implement ... sort(T)(T[]) if (T == byte) assert(0, "hey this is a bad implementation but is trivial to implement the right one"); :) Moreover: many algorithms are generic in phobos and in D ecosystem, and it could happen that byte's sort (or even bool) isn't used directly, but instead inside other algorithms. Let's say we need to write a burrows-wheeler transform for bzip2 and we need to sort byte... Won't you use the default sort function for it? Andrea
Apr 08 2016
next sibling parent tsbockman <thomas.bockman gmail.com> writes:
On Friday, 8 April 2016 at 07:28:38 UTC, Andrea Fontana wrote:
 I mean to replace sort() on phobos if T == byte || T == ...
I know that's what you meant.
 Ok, but it's not a good reason to keep an inefficient sort() on 
 phobos. I would use algorithms in phobos without thinking if I 
 have to reimplement them or not.
There are tons of algorithms in Phobos that *could* be made faster for some combinations of inputs, but it doesn't make sense to add every possible template specialization because of the long-term maintenance costs. My point is that I doubt specializing `sort()` for `bool` and `byte` is actually worth the costs. That's just my opinion though, and I could be swayed easily enough given some evidence of a specific real-world need for the optimization. In any case, you are free to submit a Phobos pull request if you think it's worth it.
Apr 08 2016
prev sibling parent tsbockman <thomas.bockman gmail.com> writes:
On Friday, 8 April 2016 at 07:28:38 UTC, Andrea Fontana wrote:
 Ok, but it's not a good reason to keep an inefficient sort() on 
 phobos...
Also, your response makes it sound like I was advocating for just keeping everything like it is now. But, my main point was actually that counting sort is just a variant of bucket sort, and it would (in my opinion) make more sense to write a good generic bucket sort implementation. This would be applicable to a much wider range of use cases, and could still be used to speed up `sort()` for `bool`, `byte`, and `ubyte` if needed.
Apr 08 2016
prev sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Thursday, 7 April 2016 at 13:32:43 UTC, Andrea Fontana wrote:
 On Thursday, 7 April 2016 at 10:58:21 UTC, Dmitry Olshansky 
 wrote:
 Coincidentally with another NG thread I'm curious if we can 
 special-case our sort for
 strings to Three-Way Radix QuickSort which is more efficient:

 http://www.drdobbs.com/database/sorting-strings-with-three-way-radix-qui/184410724
Radix sort is a very fast way to sort strings and integers . But it does not work with a custom " less" function . It just sorts date to lexical way .
You can make it work with float as well with some horrible hacks. I tried to implement it, but performances were not that great at the end. Maybe I missed something, or maybe this is just not CPU friendly enough for the sample I had to throw at it. See: https://github.com/deadalnix/Dsort/blob/master/sort/radix.d
Apr 08 2016
parent reply Andrea Fontana <nospam example.com> writes:
On Friday, 8 April 2016 at 07:33:32 UTC, deadalnix wrote:
 On Thursday, 7 April 2016 at 13:32:43 UTC, Andrea Fontana wrote:
 On Thursday, 7 April 2016 at 10:58:21 UTC, Dmitry Olshansky 
 wrote:
 Coincidentally with another NG thread I'm curious if we can 
 special-case our sort for
 strings to Three-Way Radix QuickSort which is more efficient:

 http://www.drdobbs.com/database/sorting-strings-with-three-way-radix-qui/184410724
Radix sort is a very fast way to sort strings and integers . But it does not work with a custom " less" function . It just sorts date to lexical way .
You can make it work with float as well with some horrible hacks. I tried to implement it, but performances were not that great at the end. Maybe I missed something, or maybe this is just not CPU friendly enough for the sample I had to throw at it. See: https://github.com/deadalnix/Dsort/blob/master/sort/radix.d
Using your functions to radixify float this seems to run faster for me than phobos sort (for large arrays 2x, 3x it seems... not so many benchmarks) http://dpaste.dzfl.pl/1b4752ff37b1 I don't optimize it that much. Just did a try. Andrea
Apr 08 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 4/8/16 6:06 AM, Andrea Fontana wrote:
 On Friday, 8 April 2016 at 07:33:32 UTC, deadalnix wrote:
 On Thursday, 7 April 2016 at 13:32:43 UTC, Andrea Fontana wrote:
 On Thursday, 7 April 2016 at 10:58:21 UTC, Dmitry Olshansky wrote:
 Coincidentally with another NG thread I'm curious if we can
 special-case our sort for
 strings to Three-Way Radix QuickSort which is more efficient:

 http://www.drdobbs.com/database/sorting-strings-with-three-way-radix-qui/184410724
Radix sort is a very fast way to sort strings and integers . But it does not work with a custom " less" function . It just sorts date to lexical way .
You can make it work with float as well with some horrible hacks. I tried to implement it, but performances were not that great at the end. Maybe I missed something, or maybe this is just not CPU friendly enough for the sample I had to throw at it. See: https://github.com/deadalnix/Dsort/blob/master/sort/radix.d
Using your functions to radixify float this seems to run faster for me than phobos sort (for large arrays 2x, 3x it seems... not so many benchmarks) http://dpaste.dzfl.pl/1b4752ff37b1 I don't optimize it that much. Just did a try. Andrea
Guess we need a pull request. In order to distinguish the "less than" (default) predicate from arbitrary predicates, just define two overloads of sort, one without the predicate parameter. -- Andrei
Apr 08 2016
parent Andrea Fontana <nospam example.com> writes:
On Friday, 8 April 2016 at 13:35:53 UTC, Andrei Alexandrescu 
wrote:
 Guess we need a pull request. In order to distinguish the "less 
 than" (default) predicate from arbitrary predicates, just 
 define two overloads of sort, one without the predicate 
 parameter. -- Andrei
Why not a sort with SortAscending.yes/no or an enum as (template?) param?
Apr 08 2016
prev sibling parent reply Jonathan Villa <jv_vortex msn.com> writes:
On Thursday, 7 April 2016 at 10:58:21 UTC, Dmitry Olshansky wrote:
 Coincidentally with another NG thread I'm curious if we can 
 special-case our sort for
 strings to Three-Way Radix QuickSort which is more efficient:

 http://www.drdobbs.com/database/sorting-strings-with-three-way-radix-qui/184410724
The other day I had problems using Quicksort (trying to sorting 100_000 doubles) causing stackoverflow for so many recursive calls (using .NET), then searching for other sorting algorithms I found SHELL-SORT, it's not recursive and it ended being even faster than Quicksort (what the heck? xd, well probably the JIT compiler). Now you want to sort strings, I don't know in that case, but maybe that can be useful, who knows? c: Here's the link where I got the algorithm (is in visual basic D:) https://support.microsoft.com/en-us/kb/169617 JV
Apr 07 2016
parent tsbockman <thomas.bockman gmail.com> writes:
On Thursday, 7 April 2016 at 16:23:31 UTC, Jonathan Villa wrote:
 ...then searching for other sorting algorithms I found 
 SHELL-SORT, it's not recursive and it ended being even faster 
 than Quicksort (what the heck? xd, well probably the JIT 
 compiler).
It would be cool to have a shell sort implementation available for small data sets. It's a bad choice in the general case though, because its asymptotic scaling is worse than the standard O(N log(N)) achieved by stuff like Timsort. Quicksort is actually bad too, because its worst case asymptotic complexity is an embarrassing O(N^2), which makes it highly vulnerable to denial-of-service attacks.
Apr 07 2016