digitalmars.D - Faster sort?
- Andrea Fontana (90/90) Apr 06 2016 I did a couple of tries using a "specialized" implementation of
- tsbockman (5/14) Apr 06 2016 Can you share the benchmark code?
- Andrea Fontana (27/45) Apr 07 2016 A simple test just written:
- John Colvin (27/64) Apr 07 2016 But it definitely can eliminate an unused result. My prediction:
- Andrea Fontana (3/8) Apr 07 2016 But it should remove result if I replace boolSort() with sort()
- John Colvin (14/22) Apr 07 2016 Not necessarily. It totally depends on the implementation details
- Andrea Fontana (5/8) Apr 07 2016 If i move boolSort() on another module but I can't use auto as
- Andrea Fontana (27/35) Apr 07 2016 What about this?
- Andrea Fontana (4/18) Apr 07 2016 whoops i missed a line there. It's 300ms but i wonder if test is
- tsbockman (7/33) Apr 07 2016 A time of "zero" means the benchmark is broken. In this case, you
- John Colvin (4/12) Apr 07 2016 You could create a dummy function (or even the real function,
- John Colvin (4/18) Apr 07 2016 Correction, use
- tsbockman (3/10) Apr 07 2016 It's easier to just output the result in some form. I typically
- John Colvin (6/17) Apr 07 2016 Take care with this approach. For example, calling a pure
- tsbockman (85/93) Apr 07 2016 `boolSort(arr)` has no side effects, and in your benchmark the
- Andrea Fontana (5/10) Apr 07 2016 Yes point of my original discussion wasn't the obviusly wrong
I did a couple of tries using a "specialized" implementation of sort!less. For example using a counting sort for bool seems a good idea. This code: auto boolSort(alias less = "a<b")(in bool[] data) { import std.array : uninitializedArray; import std.functional : binaryFun; // Check if order is true/false or false/true immutable test = binaryFun!less(true, false); size_t count; foreach(ref x; data) if (x == test) count++; return chain(repeat(test,count), repeat(!test, data.length - count)); } Count number of false (or true, it depends on result of "less"), and then chain them. I did a test over 10000 randomized array of 2,4,8,16...65536 elements. rdmd -O -release -noboundscheck -inline sort.d : 2 inputs: Faster: 229 μs and 9 hnsecs Phobos: 215 μs and 5 hnsecs ... 64 inputs: Faster: 513 μs and 4 hnsecs Phobos: 2 ms, 143 μs, and 5 hnsecs ... 65536 inputs: Faster: 2 secs, 700 ms, and 696 μs Phobos: 6 secs, 835 ms, and 319 μs Using ldmd2 -O -release -noboundscheck -inline sort.d && ./sort instead: 2 inputs: Faster: 0 hnsecs Phobos: 33 μs and 5 hnsecs ... 65536 inputs: Faster: 0 hnsecs (???) Phobos: 7 secs, 865 ms, 450 μs, and 6 hnsecs The same thing works for byte[] and ubyte[] array using a few tricks (eg: using counting sort and then sorting only keys). So for ubyte[]/byte[] (ldc): 2 inputs: 1 ms, 305 μs, and 4 hnsecs 31 μs and 4 hnsecs ... 256 inputs: 8 ms, 76 μs, and 3 hnsecs 8 ms, 807 μs, and 3 hnsecs ... 65536 inputs: 347 ms, 330 μs, and 9 hnsecs 14 secs, 214 ms, 66 μs, and 5 hnsecs Anyway implementation for byte/ubyte should be improved as long as it probably won't work fine if less is a functor or delegate: auto byteSort(alias less = "a<b", T)(in T[] data) if (is(T == byte) || is(T == ubyte)) { import std.range : iota; import std.algorithm : map; import std.array : uninitializedArray, array; // It needs 256*size_t.sizeof bytes. Maybe not suitables for embedded platforms? size_t[256] buckets; foreach(ref x; data) buckets[cast(ubyte)x]++; // Sort 256 keys using less function at compile time (less should work at compile time) static immutable sortedKeys = iota(0, 256).map!(x => cast(T)x).array.sort!less.array; auto result = uninitializedArray!(T[])(data.length); size_t currentIdx = 0; // bucket[0] = 3, bucket[1] = 0, bucket[2] = 1 => 0,0,0,2 foreach(ref k; sortedKeys) { auto b = buckets[cast(ubyte)k]; if (b) { result[currentIdx .. currentIdx + b] = cast(T)k; currentIdx += b; if (currentIdx >= data.length) return result; } } assert(0); } Maybe those implementations could be useful to improve phobos sorting? Andrea
Apr 06 2016
On Wednesday, 6 April 2016 at 08:15:39 UTC, Andrea Fontana wrote:Using ldmd2 -O -release -noboundscheck -inline sort.d && ./sort instead: 2 inputs: Faster: 0 hnsecs Phobos: 33 μs and 5 hnsecs ... 65536 inputs: Faster: 0 hnsecs (???) Phobos: 7 secs, 865 ms, 450 μs, and 6 hnsecsCan you share the benchmark code? "0 hnsecs" results generally mean that your test was too simple and/or didn't have any obvious side effects, and so the optimizer just removed it completely.
Apr 06 2016
On Wednesday, 6 April 2016 at 18:54:08 UTC, tsbockman wrote:On Wednesday, 6 April 2016 at 08:15:39 UTC, Andrea Fontana wrote:A simple test just written: Duration total; foreach(_; 0..10000) { auto arr = generate!( () => uniform(0,2)).map!(x => cast(bool)x).take(65536).array; StopWatch sw; sw.start; boolSort(arr); total += sw.peek.to!Duration; sw.stop; } andrea ububocs:/tmp$ ./sort 0 hnsecs I don't think compiler can remove a random generated array... I think times are too small to be measured with a good accuracy, using start() stop() to resume stopwatch seems to add a (fixed) overhead (some microsecs over the 10000 cycles) that doesn't depend on size of array tested. Anyway, it doesn't matter if it is 0 hnsec or some microsecs, in my opinion. It's still faster and faster than original sort (of course it's n vs n*log(n)). Here the same code with "sort(arr)" instead of "boolSort(arr)": andrea ububocs:/tmp$ ./sort 10 secs, 620 ms, 414 μs, and 2 hnsecs AndreaUsing ldmd2 -O -release -noboundscheck -inline sort.d && ./sort instead: 2 inputs: Faster: 0 hnsecs Phobos: 33 μs and 5 hnsecs ... 65536 inputs: Faster: 0 hnsecs (???) Phobos: 7 secs, 865 ms, 450 μs, and 6 hnsecsCan you share the benchmark code? "0 hnsecs" results generally mean that your test was too simple and/or didn't have any obvious side effects, and so the optimizer just removed it completely.
Apr 07 2016
On Thursday, 7 April 2016 at 07:57:11 UTC, Andrea Fontana wrote:On Wednesday, 6 April 2016 at 18:54:08 UTC, tsbockman wrote:But it definitely can eliminate an unused result. My prediction: you took an array and sorted it, then did nothing with the result, so it rightly concluded that there was no point doing the sort. In any given case the compiler could be removing some or all of the work. Laborious approach that defeats the optimisations you don't want while keeping the ones you do: % cat modA.d float[] codeToBenchmark(int someParam, float[] someOtherParam) { /* blah blah code */ } % cat modB.d // Do not import modA here auto codeToBenchmark(int, float[]); void main() { // start loop and timers: codeToBenchmark(/*blah params */); // end timers and loop } % ldc2 -c <optimisation options here> modA.d % ldc2 <opt options, less important> modA.o modB.d % ./modB <reliable results come out here> I really need to write an article about bencharking in D...On Wednesday, 6 April 2016 at 08:15:39 UTC, Andrea Fontana wrote:A simple test just written: Duration total; foreach(_; 0..10000) { auto arr = generate!( () => uniform(0,2)).map!(x => cast(bool)x).take(65536).array; StopWatch sw; sw.start; boolSort(arr); total += sw.peek.to!Duration; sw.stop; } andrea ububocs:/tmp$ ./sort 0 hnsecs I don't think compiler can remove a random generated array...Using ldmd2 -O -release -noboundscheck -inline sort.d && ./sort instead: 2 inputs: Faster: 0 hnsecs Phobos: 33 μs and 5 hnsecs ... 65536 inputs: Faster: 0 hnsecs (???) Phobos: 7 secs, 865 ms, 450 μs, and 6 hnsecsCan you share the benchmark code? "0 hnsecs" results generally mean that your test was too simple and/or didn't have any obvious side effects, and so the optimizer just removed it completely.
Apr 07 2016
On Thursday, 7 April 2016 at 08:23:09 UTC, John Colvin wrote:But it definitely can eliminate an unused result. My prediction: you took an array and sorted it, then did nothing with the result, so it rightly concluded that there was no point doing the sort. In any given case the compiler could be removing some or all of the work.But it should remove result if I replace boolSort() with sort() too, instead it take 10 seconds to run.
Apr 07 2016
On Thursday, 7 April 2016 at 08:33:40 UTC, Andrea Fontana wrote:On Thursday, 7 April 2016 at 08:23:09 UTC, John Colvin wrote:Not necessarily. It totally depends on the implementation details and the exact way the optimiser works. It might be interesting and informative for you to explore exactly why a particular version of a particular compiler with particular compilation flags will inline and elide one sort function but not another, but I would recommend just not letting the compiler see the source code of the benchmark and the sorting at the same time*, then you know neither will be inlined and also no extra attributes will be inferred and unrealistically taken advantage of. *hench my example of compiling one module to an object file and then compiling the other and linking them, without ever importing one from the other.But it definitely can eliminate an unused result. My prediction: you took an array and sorted it, then did nothing with the result, so it rightly concluded that there was no point doing the sort. In any given case the compiler could be removing some or all of the work.But it should remove result if I replace boolSort() with sort() too, instead it take 10 seconds to run.
Apr 07 2016
On Thursday, 7 April 2016 at 08:41:51 UTC, John Colvin wrote:*hench my example of compiling one module to an object file and then compiling the other and linking them, without ever importing one from the other.If i move boolSort() on another module but I can't use auto as return type. Return type is a "voldemort" type returned by std.range.chain. How can I define it?
Apr 07 2016
On Thursday, 7 April 2016 at 08:53:32 UTC, Andrea Fontana wrote:On Thursday, 7 April 2016 at 08:41:51 UTC, John Colvin wrote:What about this? Duration total; foreach(_; 0..10000) { auto arr = generate!( () => uniform(0,2)).map!(x => cast(bool)x).take(65536).array; StopWatch sw; sw.start; auto t = boolSort(arr); sw.stop; size_t sum = 0; foreach(x; t.map!(x => x?1:0)) sum+=x; writeln(sum); } writeln(total); This outputs: 32784 ... ... ... 32819 32648 32649 32716 32972 0 hnsecs*hench my example of compiling one module to an object file and then compiling the other and linking them, without ever importing one from the other.If i move boolSort() on another module but I can't use auto as return type. Return type is a "voldemort" type returned by std.range.chain. How can I define it?
Apr 07 2016
On Thursday, 7 April 2016 at 09:00:05 UTC, Andrea Fontana wrote:What about this? Duration total; foreach(_; 0..10000) { auto arr = generate!( () => uniform(0,2)).map!(x => cast(bool)x).take(65536).array; StopWatch sw; sw.start; auto t = boolSort(arr);Missed total+=...sw.stop; size_t sum = 0; foreach(x; t.map!(x => x?1:0)) sum+=x; writeln(sum); }whoops i missed a line there. It's 300ms but i wonder if test is valid. Anyway, that's not the point of my original post :)
Apr 07 2016
On Thursday, 7 April 2016 at 09:00:05 UTC, Andrea Fontana wrote:What about this? Duration total; foreach(_; 0..10000) { auto arr = generate!( () => uniform(0,2)).map!(x => cast(bool)x).take(65536).array; StopWatch sw; sw.start; auto t = boolSort(arr); sw.stop; size_t sum = 0; foreach(x; t.map!(x => x?1:0)) sum+=x; writeln(sum); } writeln(total); This outputs: 32784 ... ... ... 32819 32648 32649 32716 32972 0 hnsecsA time of "zero" means the benchmark is broken. In this case, you forgot to actually add the stopwatch time to the `total` variable. You don't need to print the sum every time; just accumulate it in a variable declared outside the loop and print it once at the end (like with `total`). See my benchmark code that I posted a few minutes ago.
Apr 07 2016
On Thursday, 7 April 2016 at 08:53:32 UTC, Andrea Fontana wrote:On Thursday, 7 April 2016 at 08:41:51 UTC, John Colvin wrote:You could create a dummy function (or even the real function, just with a different name) that creates the same type and use typeof(myDummyFunction(myDummyArgs)) when making the definition.*hench my example of compiling one module to an object file and then compiling the other and linking them, without ever importing one from the other.If i move boolSort() on another module but I can't use auto as return type. Return type is a "voldemort" type returned by std.range.chain. How can I define it?
Apr 07 2016
On Thursday, 7 April 2016 at 09:25:46 UTC, John Colvin wrote:On Thursday, 7 April 2016 at 08:53:32 UTC, Andrea Fontana wrote:Correction, use http://dlang.org/phobos/std_traits.html#ReturnType instead of all that typeof mess.On Thursday, 7 April 2016 at 08:41:51 UTC, John Colvin wrote:You could create a dummy function (or even the real function, just with a different name) that creates the same type and use typeof(myDummyFunction(myDummyArgs)) when making the definition.*hench my example of compiling one module to an object file and then compiling the other and linking them, without ever importing one from the other.If i move boolSort() on another module but I can't use auto as return type. Return type is a "voldemort" type returned by std.range.chain. How can I define it?
Apr 07 2016
On Thursday, 7 April 2016 at 08:23:09 UTC, John Colvin wrote:But it definitely can eliminate an unused result. My prediction: you took an array and sorted it, then did nothing with the result, so it rightly concluded that there was no point doing the sort. In any given case the compiler could be removing some or all of the work. Laborious approach that defeats the optimisations you don't want while keeping the ones you do:It's easier to just output the result in some form. I typically calculate and display a checksum of some sort.
Apr 07 2016
On Thursday, 7 April 2016 at 09:01:14 UTC, tsbockman wrote:On Thursday, 7 April 2016 at 08:23:09 UTC, John Colvin wrote:Take care with this approach. For example, calling a pure function (whether D pure or some optimiser inferred sort of purity) repeatedly in a loop can sometimes cause the loop to be reduced to a single iteration (doesn't apply in this case, because of randomly generating data each iteration).But it definitely can eliminate an unused result. My prediction: you took an array and sorted it, then did nothing with the result, so it rightly concluded that there was no point doing the sort. In any given case the compiler could be removing some or all of the work. Laborious approach that defeats the optimisations you don't want while keeping the ones you do:It's easier to just output the result in some form. I typically calculate and display a checksum of some sort.
Apr 07 2016
On Thursday, 7 April 2016 at 07:57:11 UTC, Andrea Fontana wrote:I don't think compiler can remove a random generated array... I think times are too small to be measured with a good accuracy, using start() stop() to resume stopwatch seems to add a (fixed) overhead (some microsecs over the 10000 cycles) that doesn't depend on size of array tested.`boolSort(arr)` has no side effects, and in your benchmark the return value is discarded. This means that the compiler is free to simply skip that function call entirely. Because of this, with warnings-as-errors your code won't even compile: "Warning: calling app.boolSort!"a<b".boolSort without side effects discards return value of type Result, prepend a cast(void) if intentional"Anyway, it doesn't matter if it is 0 hnsec or some microsecs, in my opinion. It's still faster and faster than original sort (of course it's n vs n*log(n)).A time of "zero" means that your benchmark is broken, and tells you nothing about how fast your code actually is. Here's a less broken benchmark (it's still got lots of room for improvement): module app; import std.algorithm, std.conv, std.range, std.random, std.stdio; auto boolSort(alias less = "a<b")(in bool[] data) { import std.array : uninitializedArray; import std.functional : binaryFun; // Check if order is true/false or false/true immutable test = binaryFun!less(true, false); size_t count; foreach(ref x; data) if (x == test) count++; return chain(repeat(test,count), repeat(!test, data.length - count)); } void main() { import std.datetime : Duration, StopWatch; Duration boolSortDur, stdSortDur; ulong boolSortKS, stdSortKS; foreach(_; 0..10_000) { auto arr = generate!( () => uniform(0,2)) .map!(x => cast(bool)x) .take(65_536) .array; { StopWatch sw; sw.start; auto sorted = boolSort(arr); sw.stop; boolSortDur += sw.peek.to!Duration; /* Scan the sorted result so that the compiler doesn't elide the boolSort(arr) call: */ foreach(bool b; sorted) boolSortKS += b; } { StopWatch sw; sw.start; auto sorted = sort(arr); sw.stop; stdSortDur += sw.peek.to!Duration; /* Scan the sorted result so that the compiler doesn't elide the sort(arr) call: */ foreach(bool b; sorted) stdSortKS += b; } } /* Make some I/O conditional upon the sorted results to establish a direct causal chain between the code being benchmarked, and something that we know cannot ever be removed by the optimizer: */ if(boolSortKS != stdSortKS) writeln("Keep sum mismatch!"); writefln("boolSort(): %s", boolSortDur); writefln("std sort(): %s", stdSortDur); } Results on my 3.2GHz Haswell 64-bit Linux system: DMD 64-bit: boolSort(): 2 secs, 886 ms, 915 μs, and 5 hnsecs std sort(): 15 secs, 135 ms, 900 μs, and 8 hnsecs DMD 32-bit: boolSort(): 2 secs, 827 ms, 13 μs, and 6 hnsecs std sort(): 16 secs, 668 ms, 900 μs, and 2 hnsecs LDC 64-bit: boolSort(): 369 ms, 590 μs, and 6 hnsecs std sort(): 13 secs, 798 ms, 107 μs, and 8 hnsecs So, your boolSort() is faster as expected - but not infinitely so. Andrei gave a lecture on optimization a while back: http://forum.dlang.org/thread/n6odlg$j2n$1 digitalmars.com Among the key points he made was basically, "Never assume: measure. And, measuring correctly is harder than it looks."
Apr 07 2016
On Thursday, 7 April 2016 at 08:48:41 UTC, tsbockman wrote:LDC 64-bit: boolSort(): 369 ms, 590 μs, and 6 hnsecs std sort(): 13 secs, 798 ms, 107 μs, and 8 hnsecs So, your boolSort() is faster as expected - but not infinitely so.Yes point of my original discussion wasn't the obviusly wrong mesure but the greate gain over standard sort() 369ms vs 13 secs is still a lot (35x: and it increase more and more for bigger length... it's linear vs nlogn)
Apr 07 2016