digitalmars.D - topN using a heap

Andrei Alexandrescu (17/17) Jan 16 2016 https://github.com/D-Programming-Language/phobos/pull/3934

Ivan Kazmenko (21/32) Jan 16 2016 Yeah, this sounds nice. Suppose we have an array of length n and

Timon Gehr (3/8) Jan 16 2016 The implementation falls back to topNHeap whenever k is within the first...

Andrei Alexandrescu (3/13) Jan 16 2016 Correct. The pedantic approach would be to only do that when k is up to

Timon Gehr (7/21) Jan 16 2016 Ivan's analysis suggests that even something significantly larger, like

Andrei Alexandrescu (5/7) Jan 16 2016 I'm not clear on how you got to that boundary. There are a few

Ilya Yaroshenko (4/12) Jan 16 2016 size_t.sizeof * 8 - bsr(n) should be good approximation for
Ivan Kazmenko (10/18) Jan 17 2016 The average case is O(n + (k log n log k)) for small enough k.

Andrei Alexandrescu (12/14) Jan 17 2016 I don't understand how you derived the average case complexity, and I

Ivan Kazmenko (42/58) Jan 17 2016 Yeah, I have the same notion.

Andrei Alexandrescu (10/11) Jan 17 2016 OK, very nice. Thanks! I've modified topN to work as follows. In a loop:

Ivan Kazmenko (27/39) Jan 18 2016 Here goes the test which shows quadratic behavior for the new

Ivan Kazmenko (39/51) Jan 18 2016 Perhaps I should include a textual summary as well.
Ivan Kazmenko (5/9) Jan 18 2016 Correction: this is the result of removing a uniform call in pull

Ilya (3/12) Jan 18 2016 A RNGs don't improve worst case. It only changes an permutation

Ivan Kazmenko (4/6) Jan 18 2016 Still, use of RNG makes it impossible to construct the worst case

Andrei Alexandrescu (2/7) Jan 18 2016 BTW I forgot to thank you for this analysis. This is good stuff. -- Andr...
Ilya (3/9) Jan 18 2016 No, it is definitely possible, because RNGs are Pseudo-RNGs.

Andrei Alexandrescu (3/11) Jan 18 2016 unpredictableSeed uses the system clock as a source of randomness, so

Ilya (3/20) Jan 18 2016 Would this work for pure functions? --Ilya

Timon Gehr (2/17) Jan 18 2016 Only if they accept the RNG as an additional argument.

Ilya (3/20) Jan 18 2016 Exactly :) So, I hope we would not add Pseudo-RNGs in

Andrei Alexandrescu (27/42) Jan 18 2016 Of course not. I think this back-and-forth takes away from the gist of

Ilya (7/61) Jan 18 2016 (a) Hope no
Ivan Kazmenko (16/17) Jan 18 2016 No, sort utilizes Introsort (Quicksort but switch to Heapsort if

Andrei Alexandrescu (3/18) Jan 19 2016 Do you think sort and topN would be attackable if they used a

Ivan Kazmenko (17/20) Jan 19 2016 Yes, but with a little interactivity (not generating the input in

Andrei Alexandrescu (3/9) Jan 20 2016 [snip]

Xinok (29/59) Jan 18 2016 To be clear, sort is NOT attackable. Introsort is used for

Andrei Alexandrescu (11/28) Jan 19 2016 Sorry, my bad. Could you please point to the code doing the

Ivan Kazmenko (19/22) Jan 18 2016 Since point 4 is in fact already fixed a good while ago, my

Andrei Alexandrescu (10/29) Jan 19 2016 Yah, I also think heap/double-heap topN is better; median-of-medians

Andrei Alexandrescu (7/8) Jan 19 2016 Destroy! I made it part of

Timon Gehr (4/10) Jan 19 2016 The switching heuristic is bad. It always switches after at most 8 steps...

Andrei Alexandrescu (4/15) Jan 19 2016 My algebra is completely rotten. Thanks! Updated

Jon Degenhardt (25/31) Sep 21 2016 Not completely clear from this thread what the conclusion was wrt

Andrei Alexandrescu (3/10) Sep 21 2016 I have it on my list to move https://arxiv.org/abs/1606.00484 into

Jon Degenhardt (5/18) Sep 21 2016 Very good, thanks. It'll be interesting to see how this algorithm

Andrei Alexandrescu (4/20) Sep 21 2016 Preliminary results indicate that QuickselectAdaptive is about 10x

Andrei Alexandrescu (2/2) Sep 23 2016 Work is blocked by https://issues.dlang.org/show_bug.cgi?id=16528, which...

Stefan Koch (3/6) Sep 23 2016 annotate the templates.

Andrei Alexandrescu (9/14) Sep 23 2016 I don't want to lose the deduction. I used another workaround (enumerate...

Jon Degenhardt (6/10) Sep 23 2016 That's a very nice result. Both the absolute numbers and that all

Andrei Alexandrescu (129/132) Sep 24 2016 Got curious so I tried to patch my ldc installation (0.17.1 (DMD

Jon Degenhardt (18/33) Sep 25 2016 I added the topN pull request to a local LDC build (current LDC

Andrei Alexandrescu (2/35) Sep 25 2016 Thanks for this work! -- Andrei

Andrei Alexandrescu (2/7) Sep 23 2016 Nagonna work for generic functions -- Andrei

deadalnix (9/11) Jan 18 2016 I got problem with that even when crytographically secure

Andrei Alexandrescu (3/12) Jan 18 2016 How would this translate to a matter of selecting the pivot during sort?...

deadalnix (4/6) Jan 18 2016 A large chunk of a given datacenter going quadratic at the same

Andrei Alexandrescu (14/26) Jan 18 2016 Well it does improve things. The probability of hitting the worst case

Ilya (7/41) Jan 18 2016 1. Yes, probability of hitting the worst case repeatedly is is

Timon Gehr (2/8) Jan 18 2016 You also need to predict the seed. How do you do that?

Ilya (3/13) Jan 18 2016 We can not use unpredictable seed (like system clock) in pure

Timon Gehr (4/17) Jan 18 2016 Clearly. The point is, there already was an impure implementation of

Ilya (3/25) Jan 18 2016 There is already implementation with predictable seed. Proof:

Timon Gehr (3/7) Jan 18 2016 The default RNG is seeded with unpredictableSeed. What is the point you

Ilya (3/11) Jan 18 2016 unpredictableSeed is initialized only once and can be easily

Andrei Alexandrescu (9/22) Jan 18 2016 https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/rando...

Ilya (13/38) Jan 18 2016 A good variant with minimal overhead is to call unpredictableSeed
Ilya (16/41) Jan 18 2016 A good variant with minimal overhead is to call unpredictableSeed
Ilya (20/45) Jan 18 2016 A good variant with minimal overhead is to call unpredictableSeed

Timon Gehr (2/29) Jan 18 2016 https://en.wikipedia.org/wiki/Introselect
Ivan Kazmenko (19/34) Jan 18 2016 Possible, but only if the seed and previous usage of victim

Timon Gehr (3/5) Jan 18 2016 Yes, it selected the pivot uniformly at random using the global RNG.

Ivan Kazmenko (28/44) Jan 17 2016 I'd suggest being able to switch between implementations.

Andrei Alexandrescu (21/26) Jan 17 2016 A nice idea, but it seems a bit overengineered. The differences among

Ziad Hatahet via Digitalmars-d (3/4) Jan 16 2016 Why not the first 10?

Andrei Alexandrescu (2/6) Jan 16 2016 So you get to put the appropriate element in the 11th position. -- Andre...

Andrei Alexandrescu (4/14) Jan 16 2016 To clarify this using a degenerate case: say someone calls topN(r, 0)

Ivan Kazmenko (4/6) Jan 16 2016 And why this? Do we additionally require the k-th element to

Andrei Alexandrescu (2/7) Jan 16 2016 Yes. -- Andrei

deadalnix (4/23) Jan 17 2016 A common way to do it is to go quicksort, but only recurse on one

Andrei Alexandrescu (4/6) Jan 17 2016 Yah, that's quickselect (which this work started from). It's linear, and...

deadalnix (3/10) Jan 17 2016 Yeah; forget about me, I was dumb on that one.

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

https://github.com/D-Programming-Language/phobos/pull/3934

So, say you're looking for the smallest 10 elements out of 100_000. The 
quickselect algorithm (which topN currently uses) will successively 
partition the set in (assuming the pivot choice works well) 50_000, 
25_000, etc chunks all the way down to finding the smallest 10.

That's quite a bit of work, so 3934 uses an alternate strategy for 
finding the smallest 10:

1. Organize the first 11 elements into a max heap

2. Scan all other elements progressively. Whenever an element is found 
that is smaller than the largest in the heap, swap it with the largest 
in the heap then restore the heap property.

3. At the end, swap the largest in the heap with the 10th and you're done!

This is very effective, and is seldom referred in the selection 
literature. In fact, a more inefficient approach (heapify the entire 
range) is discussed more often.


Destroy!

Andrei

Jan 16 2016

Ivan Kazmenko <gassa mail.ru> writes:

On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu 
wrote:
 That's quite a bit of work, so 3934 uses an alternate strategy 
 for finding the smallest 10:

 1. Organize the first 11 elements into a max heap

 2. Scan all other elements progressively. Whenever an element 
 is found that is smaller than the largest in the heap, swap it 
 with the largest in the heap then restore the heap property.

 3. At the end, swap the largest in the heap with the 10th and 
 you're done!

 This is very effective, and is seldom referred in the selection 
 literature. In fact, a more inefficient approach (heapify the 
 entire range) is discussed more often.

Yeah, this sounds nice.  Suppose we have an array of length n and 
want to find the smallest k elements.  The first step (heapify) 
is O(k) operations.

1. Let us analyze the rest for "small enough" k.

1a. If the array is "random", the expected number of insertions 
is the sum of probabilities that each new element is inserted.  
This number is (k/(k+1)) + (k/(k+2)) + ... + k/n, which is k 
times a part of harmonic series, that is, on the order of k log 
n.  In each case, O(log k) operations are required to process the 
new element.  In total, the "average" case takes (n-k) + (k log n 
log k), a total of O(n) for small enough values of k.

1b. In the worst case (each element is less than the current 
top), this still requires (n-k) log k operations, so we can say 
the total is O(n log k).

2. For k of the same order as n, the algorithm is O(n log n), 
just as a regular HeapSort.

To summarize:
For k<<n we have O(n) average case and O(n log k) worst case.
For k~n we have O(n log n) as HeapSort.

Jan 16 2016

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/16/2016 10:28 PM, Ivan Kazmenko wrote:
 On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:
 ...

 To summarize:
 For k<<n we have O(n) average case and O(n log k) worst case.
 For k~n we have O(n log n) as HeapSort.

The implementation falls back to topNHeap whenever k is within the first 
or last ~n/8 elements and therefore is Ω(n log n) on average.

Jan 16 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/16/16 8:00 PM, Timon Gehr wrote:
 On 01/16/2016 10:28 PM, Ivan Kazmenko wrote:
 On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:
 ...

 To summarize:
 For k<<n we have O(n) average case and O(n log k) worst case.
 For k~n we have O(n log n) as HeapSort.

 The implementation falls back to topNHeap whenever k is within the first
 or last ~n/8 elements and therefore is Ω(n log n) on average.

Correct. The pedantic approach would be to only do that when k is up to 
log(n). -- Andrei

Jan 16 2016

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/17/2016 03:09 AM, Andrei Alexandrescu wrote:
 On 1/16/16 8:00 PM, Timon Gehr wrote:
 On 01/16/2016 10:28 PM, Ivan Kazmenko wrote:
 On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:
 ...

 To summarize:
 For k<<n we have O(n) average case and O(n log k) worst case.
 For k~n we have O(n log n) as HeapSort.

 The implementation falls back to topNHeap whenever k is within the first
 or last ~n/8 elements and therefore is Ω(n log n) on average.

 Correct. The pedantic approach would be to only do that when k is up to
 log(n). -- Andrei

Ivan's analysis suggests that even something significantly larger, like 
n/log(n)² might work as an upper bound for k.
I don't think that meeting the documented runtime bounds amounts to 
pedantry. (Having linear average running time of course does not even 
imply linear expected running time, as is still written in the 
documentation.)

Jan 16 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/16/16 9:37 PM, Timon Gehr wrote:
 Ivan's analysis suggests that even something significantly larger, like
 n/log(n)² might work as an upper bound for k.

I'm not clear on how you got to that boundary. There are a few 
implementation details to be minded as well (quickly and carefully 
approximating the log etc). Could you please make a commented pull 
request against my own? Thanks! -- Andrei

Jan 16 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Sunday, 17 January 2016 at 03:26:54 UTC, Andrei Alexandrescu 
wrote:
 On 1/16/16 9:37 PM, Timon Gehr wrote:
 Ivan's analysis suggests that even something significantly 
 larger, like
 n/log(n)² might work as an upper bound for k.

 I'm not clear on how you got to that boundary. There are a few 
 implementation details to be minded as well (quickly and 
 carefully approximating the log etc). Could you please make a 
 commented pull request against my own? Thanks! -- Andrei

size_t.sizeof * 8 - bsr(n) should be good approximation for 
sorting. --Ilya

Jan 16 2016

Ivan Kazmenko <gassa mail.ru> writes:

On Sunday, 17 January 2016 at 03:26:54 UTC, Andrei Alexandrescu 
wrote:
 On 1/16/16 9:37 PM, Timon Gehr wrote:
 Ivan's analysis suggests that even something significantly 
 larger, like
 n/log(n)² might work as an upper bound for k.

 I'm not clear on how you got to that boundary. There are a few 
 implementation details to be minded as well (quickly and 
 carefully approximating the log etc). Could you please make a 
 commented pull request against my own? Thanks! -- Andrei

The average case is O(n + (k log n log k)) for small enough k.  
So, any k below roughly n / log^2 (n) makes the second summand 
less than the first.

Of course, the constant not present in asymptotic analysis, as 
well as handling the worst case, may make this notion 
impractical.  Measure is the way :) .  The worst case is just a 
decreasing sequence, or almost decreasing if we want to play 
against branch prediction as well.

Jan 17 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 01/17/2016 06:41 AM, Ivan Kazmenko wrote:
 The average case is O(n + (k log n log k)) for small enough k. So, any k
 below roughly n / log^2 (n) makes the second summand less than the first.

I don't understand how you derived the average case complexity, and I 
don't understand how you derived the bound for k.

Regarding the complexity: for all k (small or large), it takes O(k) to 
build a heap. Then in the worst case we do (n - k) heap insertions, i.e. 
total complexity is O(k + (n - k) * log(k)). Is that correct?

If k is any fraction of n, the worst case complexity comes to O(n + n 
log n) i.e. O(n log n). By your calculation, the average complexity 
comes to O(n log^2 n), i.e. greater than the worst case. So there's a 
mistake somewhere.

Could you please clarify matters? Thanks!


Andrei

Jan 17 2016

Ivan Kazmenko <gassa mail.ru> writes:

On Sunday, 17 January 2016 at 16:06:31 UTC, Andrei Alexandrescu 
wrote:
 On 01/17/2016 06:41 AM, Ivan Kazmenko wrote:
 The average case is O(n + (k log n log k)) for small enough k. 
 So, any k
 below roughly n / log^2 (n) makes the second summand less than 
 the first.

 I don't understand how you derived the average case complexity, 
 and I don't understand how you derived the bound for k.

 Regarding the complexity: for all k (small or large), it takes 
 O(k) to build a heap. Then in the worst case we do (n - k) heap 
 insertions, i.e. total complexity is O(k + (n - k) * log(k)). 
 Is that correct?

Yeah, I have the same notion.

 If k is any fraction of n, the worst case complexity comes to 
 O(n + n log n) i.e. O(n log n). By your calculation, the 
 average complexity comes to O(n log^2 n), i.e. greater than the 
 worst case. So there's a mistake somewhere.

The upper bound O(n + k log k log n) is correct, but it's tight 
only for small k.  I'll explain below.

 Could you please clarify matters? Thanks!

Here is a more verbose version.

Suppose for simplicity that the array consists real numbers of 
the same continuous random distribution.
(A discrete random distribution would complicate the analysis a 
bit, since the probability of two elements being equal would 
become greater than zero.)
Then, element a_i is the minimum of a_1, a_2, ..., a_i with 
probability 1/i.
Moreover, it is the second minimum with probability 1/i, the 
third minimum with probability 1/i, ..., the maximum of a_1, a_2, 
..., a_i with the same probability 1/i.
To prove this, we can show that, if we look at the ordering 
permutation p_1, p_2, ..., p_i such that of a_{p_1} < a_{p_2} < 
... a_{p_i}, each of the i! possible permutations is equally 
probable.

What immediately follows is that when we track k smallest 
elements and consider element i>k, the probability that it will 
change the k smallest elements encountered so far is k/i.
Summing that for i from k+1 to n, we get the expected number of 
heap insertions:
(k/(k+1)) + (k/(k+2)) + ... + k/n.

Now, this is just k multiplied by:
(1/1 + 1/2 + 1/3 + ... + 1/n) *minus*
(1/1 + 1/2 + 1/3 + ... + 1/k).
As these are harmonic series, the first line is asymptotically 
equal to log n, and the second line asymptotically equal to log k.
To summarize, the expected number of heap insertions for an array 
of random reals is k (log n - log k), and the complexity is O(n + 
k log k (log n - log k)).

What I did here for simplicity is just omit the "minus logarithm 
of k" part to get an upper bound.
For small enough k, the bound is tight.
For example, when k=sqrt(n), log n - log k = log(n/k) = (log n) / 
2.
On the other hand, when k=Theta(n), log n - log k = log(n/k) = 
Theta(1).

Ivan Kazmenko.

Jan 17 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 01/17/2016 03:32 PM, Ivan Kazmenko wrote:
 Here is a more verbose version.

OK, very nice. Thanks! I've modified topN to work as follows. In a loop:

* If nth <= r.length / log2(r.length)^^2 (or is similarly close to 
r.length), use topNHeap with one heap and stop
* If nth <= r.length / log2(r.length) (or is similarly close to 
r.length), use topNHeap with two heaps and stop
* Take a quickselect step, adjust r and nth, and continue

That seems to work nicely for all values involved.

All - let me know how things can be further improved. Thx!


Andrei

Jan 17 2016

Ivan Kazmenko <gassa mail.ru> writes:

On Sunday, 17 January 2016 at 22:20:30 UTC, Andrei Alexandrescu 
wrote:
 On 01/17/2016 03:32 PM, Ivan Kazmenko wrote:
 Here is a more verbose version.

 OK, very nice. Thanks! I've modified topN to work as follows. 
 In a loop:

 * If nth <= r.length / log2(r.length)^^2 (or is similarly close 
 to r.length), use topNHeap with one heap and stop
 * If nth <= r.length / log2(r.length) (or is similarly close to 
 r.length), use topNHeap with two heaps and stop
 * Take a quickselect step, adjust r and nth, and continue

 That seems to work nicely for all values involved.

 All - let me know how things can be further improved. Thx!


 Andrei

Here goes the test which shows quadratic behavior for the new 
version:
http://dpaste.dzfl.pl/e4b3bc26c3cf
(dpaste kills the slow code before it completes the task)

My local timings with "-O -release -noboundscheck -inline":
building Element array: 4989 msecs
checking Element array: 5018 msecs
checking int array: 626 msecs

With "-debug -unittest":
building Element array: 8384 msecs
checking Element array: 8380 msecs
checking int array: 2987 msecs

If we change the length MAX_N to something observable, say, 50, 
the array is:
[0, 536870911, 2, 536870911, 4, 536870911, 6, 36, 8, 33, 10, 35, 
12, 536870911, 14, 32, 16, 34, 18, 536870911, 536870911, 22, 23, 
21, 20, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 31, 30, 536870911, 29, 
536870911, 28, 536870911, 27, 536870911, 26, 536870911, 25, 
536870911, 24, 536870911]

The old version (2.070.0-b2) could not be tricked with it, does 
it use random?

The inspiration is the paper "A Killer Adversary for Quicksort":
http://www.cs.dartmouth.edu/~doug/mdmspe.pdf
(I already mentioned it on the forums a while ago)

Ivan Kazmenko.

Jan 18 2016

Ivan Kazmenko <gassa mail.ru> writes:

On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:
 On Sunday, 17 January 2016 at 22:20:30 UTC, Andrei Alexandrescu 
 wrote:
 All - let me know how things can be further improved. Thx!

 Here goes the test which shows quadratic behavior for the new 
 version:
 http://dpaste.dzfl.pl/e4b3bc26c3cf
 (dpaste kills the slow code before it completes the task)

 The inspiration is the paper "A Killer Adversary for Quicksort":
 http://www.cs.dartmouth.edu/~doug/mdmspe.pdf
 (I already mentioned it on the forums a while ago)

 Ivan Kazmenko.

Perhaps I should include a textual summary as well.

The code on DPaste starts by constructing an array of Elements of 
size MAX_N; in the code, MAX_N is 50_000.  After that, we run the 
victim function on our array.  Here, the victim is topN (array, 
MAX_N / 2); it could be sort (array) or something else.

An Element contains, or rather, pretends to contain, an int 
value.  Here is how Element is special: the result of comparison 
for two Elements is decided on-the-fly.  An Element can be either 
UNDECIDED or have a fixed value.  Initially, all elements are 
UNDECIDED.  When we compare two Elements and at least one of them 
has a fixed value, the comparison is resolved naturally, and 
UNDECIDED element is greater than any fixed element.  When we 
compare two UNDECIDED elements, the one which participated more 
in the comparisons so far gets a fixed value: greater than any 
other value fixed so far, but still less than UNDECIDED.  This 
way, the results of old comparisons are consistent with the new 
fixed value.

Now, what do we achieve by running the victim function?  Turns 
out that the algorithms using the idea of QuickSort or 
QuickSelect tend to make most comparisons against their current 
pivot value.  Our Element responds to that by fixing the pivot to 
one of the lowest available values.  After that, a partition 
using such pivot will have only few, O(1), elements before the 
pivot, and the rest after the pivot.  In total, this will lead to 
quadratic performance.

After running the victim function on our array of Elements (which 
- careful here - already takes quadratic time), we reorder them 
in their original order (to do that, each Element also stores its 
original index).

Now, we can re-run the algorithm on the array obtained so far.  
If the victim function is (strongly) pure, it will inevitably 
make the same comparisons in the same order.  The only difference 
is that their result will already be decided.

Alternatively, we can make an int array of the current values in 
our array of Elements (also in their original order).  Running 
the victim function on the int array must also make the same 
(quadratic number of) comparisons in the same order.

Ivan Kazmenko.

Jan 18 2016

Ivan Kazmenko <gassa mail.ru> writes:

On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:
 Here goes the test which shows quadratic behavior for the new 
 version:
 http://dpaste.dzfl.pl/e4b3bc26c3cf
 (dpaste kills the slow code before it completes the task)

Correction: this is the result of removing a uniform call in pull 
3921.  Since we are supposed to discuss the improvements related 
to pull 3934 here, I created a separate entry for the issue: 
https://issues.dlang.org/show_bug.cgi?id=15583.

Jan 18 2016

Ilya <ilyayaroshenko gmail.com> writes:

On Monday, 18 January 2016 at 20:45:56 UTC, Ivan Kazmenko wrote:
 On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:
 Here goes the test which shows quadratic behavior for the new 
 version:
 http://dpaste.dzfl.pl/e4b3bc26c3cf
 (dpaste kills the slow code before it completes the task)

 Correction: this is the result of removing a uniform call in 
 pull 3921.  Since we are supposed to discuss the improvements 
 related to pull 3934 here, I created a separate entry for the 
 issue: https://issues.dlang.org/show_bug.cgi?id=15583.

A RNGs don't improve worst case. It only changes an permutation 
for worst case. --Ilya

Jan 18 2016

Ivan Kazmenko <gassa mail.ru> writes:

On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
 A RNGs don't improve worst case. It only changes an permutation 
 for worst case. --Ilya

Still, use of RNG makes it impossible to construct the worst case 
beforehand, once and for all.  In that sense, this is a 
regression.

Jan 18 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/18/16 6:27 PM, Ivan Kazmenko wrote:
 On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
 A RNGs don't improve worst case. It only changes an permutation for
 worst case. --Ilya

 Still, use of RNG makes it impossible to construct the worst case
 beforehand, once and for all.  In that sense, this is a regression.

BTW I forgot to thank you for this analysis. This is good stuff. -- Andrei

Jan 18 2016

Ilya <ilyayaroshenko gmail.com> writes:

On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote:
 On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
 A RNGs don't improve worst case. It only changes an 
 permutation for worst case. --Ilya

 Still, use of RNG makes it impossible to construct the worst 
 case beforehand, once and for all.  In that sense, this is a 
 regression.

No, it is definitely possible, because RNGs are Pseudo-RNGs. 
--Ilya

Jan 18 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/18/16 6:44 PM, Ilya wrote:
 On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote:
 On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
 A RNGs don't improve worst case. It only changes an permutation for
 worst case. --Ilya

 Still, use of RNG makes it impossible to construct the worst case
 beforehand, once and for all.  In that sense, this is a regression.

 No, it is definitely possible, because RNGs are Pseudo-RNGs. --Ilya

unpredictableSeed uses the system clock as a source of randomness, so 
we're good there. -- Andrei

Jan 18 2016

Ilya <ilyayaroshenko gmail.com> writes:

On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu 
wrote:
 On 1/18/16 6:44 PM, Ilya wrote:
 On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko 
 wrote:
 On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
 A RNGs don't improve worst case. It only changes an 
 permutation for
 worst case. --Ilya

 Still, use of RNG makes it impossible to construct the worst 
 case
 beforehand, once and for all.  In that sense, this is a 
 regression.

 No, it is definitely possible, because RNGs are Pseudo-RNGs. 
 --Ilya

 unpredictableSeed uses the system clock as a source of 
 randomness, so we're good there. -- Andrei

Would this work for pure functions? --Ilya

Jan 18 2016

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/19/2016 12:51 AM, Ilya wrote:
 On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu wrote:
 On 1/18/16 6:44 PM, Ilya wrote:
 On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote:
 On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
 A RNGs don't improve worst case. It only changes an permutation for
 worst case. --Ilya

 Still, use of RNG makes it impossible to construct the worst case
 beforehand, once and for all.  In that sense, this is a regression.

 No, it is definitely possible, because RNGs are Pseudo-RNGs. --Ilya

 unpredictableSeed uses the system clock as a source of randomness, so
 we're good there. -- Andrei

 Would this work for pure functions? --Ilya

Only if they accept the RNG as an additional argument.

Jan 18 2016

Ilya <ilyayaroshenko gmail.com> writes:

On Monday, 18 January 2016 at 23:55:38 UTC, Timon Gehr wrote:
 On 01/19/2016 12:51 AM, Ilya wrote:
 On Monday, 18 January 2016 at 23:49:36 UTC, Andrei 
 Alexandrescu wrote:
 On 1/18/16 6:44 PM, Ilya wrote:
 On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko 
 wrote:
 [...]

 No, it is definitely possible, because RNGs are Pseudo-RNGs. 
 --Ilya

 unpredictableSeed uses the system clock as a source of 
 randomness, so
 we're good there. -- Andrei

 Would this work for pure functions? --Ilya

 Only if they accept the RNG as an additional argument.

Exactly :) So, I hope we would not add Pseudo-RNGs in 
std.algorithm. --Ilya

Jan 18 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/18/16 6:51 PM, Ilya wrote:
 On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu wrote:
 On 1/18/16 6:44 PM, Ilya wrote:
 On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko wrote:
 On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
 A RNGs don't improve worst case. It only changes an permutation for
 worst case. --Ilya

 Still, use of RNG makes it impossible to construct the worst case
 beforehand, once and for all.  In that sense, this is a regression.

 No, it is definitely possible, because RNGs are Pseudo-RNGs. --Ilya

 unpredictableSeed uses the system clock as a source of randomness, so
 we're good there. -- Andrei

 Would this work for pure functions? --Ilya

Of course not. I think this back-and-forth takes away from the gist of 
things. So let me summarize what has happened:

1. topN was reportedly slow. It was using a random pivot. I made it use 
getPivot (deterministic) instead of a random pivot in 
https://github.com/D-Programming-Language/phobos/pull/3921. getPivot is 
also what sort uses.

2. Now that both functions use getPivot, I set out to improve it in 
https://github.com/D-Programming-Language/phobos/pull/3922. The problem 
is I couldn't make getPivot impure; pure functions already call sort, so 
making it impure would have been a regression.

3. So I made getPivot use just a bit of randomness taken from the length 
of the range.

4. sort was and is attackable before all of these changes

5. So now we have pure topN and sort (subject to the range offering pure 
primitives) but they are both attackable.

6. PRNGs don't have any other downside than making functions impure.

The way I see it we have these solutions:

(a) make topN still use a random pivot. That way there's no regression. 
Then proceed and make sort() avoid quadratic cases in other ways, e.g. 
switch to heapsort if performance degrades.

(b) Improve sort() first, then apply a similar strategy to improving 
topN. Do not use RNGs at all.

(c) Seek randomness in other places, e.g. address of elements, data 
hashes etc. Come with a solution that may still be attacked in narrow 
cases but make that unlikely enough to be a real concern.


Andrei

Jan 18 2016

Ilya <ilyayaroshenko gmail.com> writes:

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu 
wrote:
 On 1/18/16 6:51 PM, Ilya wrote:
 On Monday, 18 January 2016 at 23:49:36 UTC, Andrei 
 Alexandrescu wrote:
 On 1/18/16 6:44 PM, Ilya wrote:
 On Monday, 18 January 2016 at 23:27:19 UTC, Ivan Kazmenko 
 wrote:
 On Monday, 18 January 2016 at 23:18:03 UTC, Ilya wrote:
 A RNGs don't improve worst case. It only changes an 
 permutation for
 worst case. --Ilya

 Still, use of RNG makes it impossible to construct the 
 worst case
 beforehand, once and for all.  In that sense, this is a 
 regression.

 No, it is definitely possible, because RNGs are Pseudo-RNGs. 
 --Ilya

 unpredictableSeed uses the system clock as a source of 
 randomness, so
 we're good there. -- Andrei

 Would this work for pure functions? --Ilya

 Of course not. I think this back-and-forth takes away from the 
 gist of things. So let me summarize what has happened:

 1. topN was reportedly slow. It was using a random pivot. I 
 made it use getPivot (deterministic) instead of a random pivot 
 in https://github.com/D-Programming-Language/phobos/pull/3921. 
 getPivot is also what sort uses.

 2. Now that both functions use getPivot, I set out to improve 
 it in 
 https://github.com/D-Programming-Language/phobos/pull/3922. The 
 problem is I couldn't make getPivot impure; pure functions 
 already call sort, so making it impure would have been a 
 regression.

 3. So I made getPivot use just a bit of randomness taken from 
 the length of the range.

 4. sort was and is attackable before all of these changes

 5. So now we have pure topN and sort (subject to the range 
 offering pure primitives) but they are both attackable.

 6. PRNGs don't have any other downside than making functions 
 impure.

 The way I see it we have these solutions:

 (a) make topN still use a random pivot. That way there's no 
 regression. Then proceed and make sort() avoid quadratic cases 
 in other ways, e.g. switch to heapsort if performance degrades.

 (b) Improve sort() first, then apply a similar strategy to 
 improving topN. Do not use RNGs at all.

 (c) Seek randomness in other places, e.g. address of elements, 
 data hashes etc. Come with a solution that may still be 
 attacked in narrow cases but make that unlikely enough to be a 
 real concern.


 Andrei

(a) Hope no
(b) Yes
(c) Memory addresses may not work with user defined ranges and 
hashes could be slow.

(d) Make PRNGs optional. --Ilya

Jan 18 2016

Ivan Kazmenko <gassa mail.ru> writes:

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu 
wrote:
 4. sort was and is attackable before all of these changes

No, sort utilizes Introsort (Quicksort but switch to Heapsort if 
recurse too deep): see 
https://github.com/D-Programming-Language/phobos/blob/2.067/std/algorithm/so
ting.d#L1048-L1053.  This improvement happened a year or two ago, when
algorithm.d was not even separated into sub-modules.

My adversary program (change "topN (a, MAX_N / 2)" to "sort (a)") 
admits that by not being able to slow the sort down.  But, if I 
comment out line 1053 to disable the switching, things go 
quadratic as expected.

L1053:    depth = depth >= depth.max / 2 ? (depth / 3) * 2 : 
(depth * 2) / 3;

The same remedy can help for topN: if we called partition more 
than log (length) in base (3/2) times, switch to the heap 
implementation of topN regardless of whether n is close to the 
edge.

Ivan Kazmenko.

Jan 18 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 01/18/2016 09:21 PM, Ivan Kazmenko wrote:
 On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:
 4. sort was and is attackable before all of these changes

 No, sort utilizes Introsort (Quicksort but switch to Heapsort if recurse
 too deep): see
 https://github.com/D-Programming-Language/phobos/blob/2.067/std/algorithm/sorting.d#L1048-L1053.
 This improvement happened a year or two ago, when algorithm.d was not
 even separated into sub-modules.

 My adversary program (change "topN (a, MAX_N / 2)" to "sort (a)") admits
 that by not being able to slow the sort down.  But, if I comment out
 line 1053 to disable the switching, things go quadratic as expected.

 L1053:    depth = depth >= depth.max / 2 ? (depth / 3) * 2 : (depth * 2)
 / 3;

 The same remedy can help for topN: if we called partition more than log
 (length) in base (3/2) times, switch to the heap implementation of topN
 regardless of whether n is close to the edge.

Do you think sort and topN would be attackable if they used a 
per-process-seeded RNG as per Xinok's idea? -- Andrei

Jan 19 2016

Ivan Kazmenko <gassa mail.ru> writes:

On Tuesday, 19 January 2016 at 13:52:08 UTC, Andrei Alexandrescu 
wrote:
 On 01/18/2016 09:21 PM, Ivan Kazmenko wrote:
 Do you think sort and topN would be attackable if they used a 
 per-process-seeded RNG as per Xinok's idea? -- Andrei

Yes, but with a little interactivity (not generating the input in 
advance) and more labor (finding the state of RNG).

Suppose the adversary has the means to generate an input for 
sort/topN, inspect the output, and then generate a bad input.  
With the current (not-secure) RNGs, from the first two actions, 
they can learn the current state of RNG: with sort, the order of 
elements which compare as equal depends on the choice of pivot, 
and therefore exposes the random bits involved; with topN, the 
order of everything except what topN guarantees does the same 
thing.

After that, generating a bad input is definitely possible, since 
everything is known and deterministic at this point.  The 
previously discussed method can be used, or something more clever 
if they want to make it fast.

Ivan Kazmenko.

Jan 19 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 01/19/2016 08:08 PM, Ivan Kazmenko wrote:
 On Tuesday, 19 January 2016 at 13:52:08 UTC, Andrei Alexandrescu wrote:
 On 01/18/2016 09:21 PM, Ivan Kazmenko wrote:
 Do you think sort and topN would be attackable if they used a
 per-process-seeded RNG as per Xinok's idea? -- Andrei

 Yes, but with a little interactivity (not generating the input in
 advance) and more labor (finding the state of RNG).

[snip]

Thanks! -- Andrei

Jan 20 2016

Xinok <xinok live.com> writes:

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu 
wrote:
 Of course not. I think this back-and-forth takes away from the 
 gist of things. So let me summarize what has happened:

 1. topN was reportedly slow. It was using a random pivot. I 
 made it use getPivot (deterministic) instead of a random pivot 
 in https://github.com/D-Programming-Language/phobos/pull/3921. 
 getPivot is also what sort uses.

 2. Now that both functions use getPivot, I set out to improve 
 it in 
 https://github.com/D-Programming-Language/phobos/pull/3922. The 
 problem is I couldn't make getPivot impure; pure functions 
 already call sort, so making it impure would have been a 
 regression.

 3. So I made getPivot use just a bit of randomness taken from 
 the length of the range.

 4. sort was and is attackable before all of these changes

 5. So now we have pure topN and sort (subject to the range 
 offering pure primitives) but they are both attackable.

 6. PRNGs don't have any other downside than making functions 
 impure.

 The way I see it we have these solutions:

 (a) make topN still use a random pivot. That way there's no 
 regression. Then proceed and make sort() avoid quadratic cases 
 in other ways, e.g. switch to heapsort if performance degrades.

 (b) Improve sort() first, then apply a similar strategy to 
 improving topN. Do not use RNGs at all.

 (c) Seek randomness in other places, e.g. address of elements, 
 data hashes etc. Come with a solution that may still be 
 attacked in narrow cases but make that unlikely enough to be a 
 real concern.


 Andrei

To be clear, sort is NOT attackable. Introsort is used for 
unstable sorting which begins with quick sort but falls back to 
heap sort after too many recursions to maintain O(n log n) 
running time. Timsort is used for stable sorting which is a 
variant of merge sort but still runs in O(n log n) time. In 
either case, you're guaranteed to have O(n log n) running time in 
the worst case.

On the other hand, someone can improve upon quick sort by 
choosing better pivots but be careful not to add too much 
overhead. A couple years ago, I tried choosing the pivot from a 
median of five but the result was as much as 10% slower. One idea 
is to begin initially choosing better pivots, but once the 
sublists fall below a certain length, simply choose the pivot 
from a median of three to avoid the extra overhead.

As for topN, there are two approaches I'm aware of that are 
deterministic without resorting to impurities or RNGs. The first 
is introselect which is similar to introsort in that it has a 
fall back algorithm. The other approach is to choose the pivot 
from a median of medians. My idea is a variant of the latter 
approach: Begin by choosing the pivot from a median of five. If 
you continuously choose bad pivots, take a larger sample of 
elements to choose the pivot from. Keep growing the sample as 
necessary.

Speaking of RNGs, they're technically pure as long as you always 
use the same seed. So what if we generated a random seed at the 
start of the process, but then reused that same seed over and 
over in pure functions for the duration of the process?

Jan 18 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 01/18/2016 09:42 PM, Xinok wrote:
 To be clear, sort is NOT attackable. Introsort is used for unstable
 sorting which begins with quick sort but falls back to heap sort after
 too many recursions to maintain O(n log n) running time. Timsort is used
 for stable sorting which is a variant of merge sort but still runs in
 O(n log n) time. In either case, you're guaranteed to have O(n log n)
 running time in the worst case.

Sorry, my bad. Could you please point to the code doing the 
introspection? I might do the same in topN.

 On the other hand, someone can improve upon quick sort by choosing
 better pivots but be careful not to add too much overhead. A couple
 years ago, I tried choosing the pivot from a median of five but the
 result was as much as 10% slower. One idea is to begin initially
 choosing better pivots, but once the sublists fall below a certain
 length, simply choose the pivot from a median of three to avoid the
 extra overhead.

That's what https://github.com/D-Programming-Language/phobos/pull/3922 
and in my measurements it's about as fast or faster than the current. 
Could you please re-run your measurements against that PR?

 Speaking of RNGs, they're technically pure as long as you always use the
 same seed. So what if we generated a random seed at the start of the
 process, but then reused that same seed over and over in pure functions
 for the duration of the process?

That's a great idea. The way D is defined, it already implies that 
"immutable" is process-immutable, not immutable across runs. Consider 
http://dpaste.dzfl.pl/2fa5baf0c149.

Very nice insights, Xinok. Thanks!


Andrei

Jan 19 2016

Ivan Kazmenko <gassa mail.ru> writes:

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu 
wrote:
 4. sort was and is attackable before all of these changes

 (b) Improve sort() first, then apply a similar strategy to 
 improving topN. Do not use RNGs at all.

Since point 4 is in fact already fixed a good while ago, my 
suggestion would be (b): to do the same for topN.

Introselect (https://en.wikipedia.org/wiki/Introselect), already 
mentioned in this thread, uses median-of-medians to achieve worst 
case O(n) performance if we recurse too deep.  There is already 
an issue suggesting to implement it: 
https://issues.dlang.org/show_bug.cgi?id=12141 (std.algorithm: 
implement deterministic topN).

In fact, the O(n log n) heap approach as it is implemented now 
could be faster than O(n) median-of-medians approach on 
reasonable inputs.  So, someone will have to implement 
median-of-medians and, well, measure.

At the very least, googling for "median of medians in practice" 
and such yields the tag wiki from StackOverflow.com: "The 
constant factor in the O(n) is large, and the algorithm is not 
commonly used in practice.".

Ivan Kazmenko.

Jan 18 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 01/18/2016 09:44 PM, Ivan Kazmenko wrote:
 On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:
 4. sort was and is attackable before all of these changes

 (b) Improve sort() first, then apply a similar strategy to improving
 topN. Do not use RNGs at all.

 Since point 4 is in fact already fixed a good while ago, my suggestion
 would be (b): to do the same for topN.

 Introselect (https://en.wikipedia.org/wiki/Introselect), already
 mentioned in this thread, uses median-of-medians to achieve worst case
 O(n) performance if we recurse too deep.  There is already an issue
 suggesting to implement it:
 https://issues.dlang.org/show_bug.cgi?id=12141 (std.algorithm: implement
 deterministic topN).

 In fact, the O(n log n) heap approach as it is implemented now could be
 faster than O(n) median-of-medians approach on reasonable inputs.  So,
 someone will have to implement median-of-medians and, well, measure.

 At the very least, googling for "median of medians in practice" and such
 yields the tag wiki from StackOverflow.com: "The constant factor in the
 O(n) is large, and the algorithm is not commonly used in practice.".

Yah, I also think heap/double-heap topN is better; median-of-medians 
(MoM) is theoretically nice but practically less so. Though there are 
two things that could help MoM in D: (a) median of five (core part of 
MoM) is relatively cheap to compute with six comparisons; (b) we can use 
stride() to compute median without needing to copy a fifth of the input 
or to complicate the algorithm - just recurse on the stride! All in all 
this might be a good algo to implement and test.

So anyhow we need the "intro" part. I'll get to that soon.


Andrei

Jan 19 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 01/19/2016 08:10 AM, Andrei Alexandrescu wrote:
 So anyhow we need the "intro" part. I'll get to that soon.

Destroy! I made it part of 
https://github.com/D-Programming-Language/phobos/pull/3934.

https://github.com/andralex/phobos/commit/4ba95cd1bd5124b53324c441e62c51d759481b04

One interesting side topic would be how to avoid the use of goto without 
losing clarity.


Andrei

Jan 19 2016

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/19/2016 04:33 PM, Andrei Alexandrescu wrote:
 On 01/19/2016 08:10 AM, Andrei Alexandrescu wrote:
 So anyhow we need the "intro" part. I'll get to that soon.

 Destroy! I made it part of
 https://github.com/D-Programming-Language/phobos/pull/3934.

 https://github.com/andralex/phobos/commit/4ba95cd1bd5124b53324c441e62c51d759481b04
 ...

The switching heuristic is bad. It always switches after at most 8 steps.

I'd just use the second heuristic discussed in 
https://en.wikipedia.org/wiki/Introselect .

Jan 19 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 01/19/2016 01:27 PM, Timon Gehr wrote:
 On 01/19/2016 04:33 PM, Andrei Alexandrescu wrote:
 On 01/19/2016 08:10 AM, Andrei Alexandrescu wrote:
 So anyhow we need the "intro" part. I'll get to that soon.

 Destroy! I made it part of
 https://github.com/D-Programming-Language/phobos/pull/3934.

 https://github.com/andralex/phobos/commit/4ba95cd1bd5124b53324c441e62c51d759481b04

 ...

 The switching heuristic is bad. It always switches after at most 8 steps.

My algebra is completely rotten. Thanks! Updated 
https://github.com/andralex/phobos/commit/cdd59b51a397ee1a4584e55c07d921c59acb5978.


Andrei

Jan 19 2016

Jon Degenhardt <jond noreply.com> writes:

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu 
wrote:
 So let me summarize what has happened:

 1. topN was reportedly slow. It was using a random pivot. I 
 made it use getPivot (deterministic) instead of a random pivot 
 in https://github.com/D-Programming-Language/phobos/pull/3921. 
 getPivot is also what sort uses.

 [snip]

Not completely clear from this thread what the conclusion was wrt 
getting known topN performance issues addressed. From pull 
requests it appears identified fixes are in current release 
versions of the DMD/LDC. However, I hit significant issues on one 
of the first large data sets I tried. Not an artificial data, but 
one with very skewed distributions of values (a google ngram 
file).

Details here:  https://issues.dlang.org/show_bug.cgi?id=16517. 
Includes test program, url for the ngram file.

A brief summary - Data file is a TSV file with 3 numeric fields, 
a bit over 10 million values each with different distribution 
properties. Used both topN and sort get the median value. Since 
this was median, it topN for the mid-point value, not at one end 
or the other. (This is a specific callout for some of the issues 
identified.)

Timing comparison of sort and topN, times in milliseconds:

           sort      topN
Field 2:   289      1756
Field 3:   285    148793
Field 4:   273    668906

The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar 
behavior is seen for DMD 2.071.2. This makes topN pretty much 
unusable.

Sep 21 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/21/16 4:16 AM, Jon Degenhardt wrote:
 Timing comparison of sort and topN, times in milliseconds:

           sort      topN
 Field 2:   289      1756
 Field 3:   285    148793
 Field 4:   273    668906

 The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar behavior
 is seen for DMD 2.071.2. This makes topN pretty much unusable.

I have it on my list to move https://arxiv.org/abs/1606.00484 into 
Phobos. Thanks for the data! -- Andrei

Sep 21 2016

Jon Degenhardt <jond noreply.com> writes:

On Wednesday, 21 September 2016 at 14:58:27 UTC, Andrei 
Alexandrescu wrote:
 On 9/21/16 4:16 AM, Jon Degenhardt wrote:
 Timing comparison of sort and topN, times in milliseconds:

           sort      topN
 Field 2:   289      1756
 Field 3:   285    148793
 Field 4:   273    668906

 The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar 
 behavior
 is seen for DMD 2.071.2. This makes topN pretty much unusable.

 I have it on my list to move https://arxiv.org/abs/1606.00484 
 into Phobos. Thanks for the data! -- Andrei

Very good, thanks. It'll be interesting to see how this algorithm 
does on this data set.

--Jon

Sep 21 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 09/21/2016 01:37 PM, Jon Degenhardt wrote:
 On Wednesday, 21 September 2016 at 14:58:27 UTC, Andrei Alexandrescu wrote:
 On 9/21/16 4:16 AM, Jon Degenhardt wrote:
 Timing comparison of sort and topN, times in milliseconds:

           sort      topN
 Field 2:   289      1756
 Field 3:   285    148793
 Field 4:   273    668906

 The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar behavior
 is seen for DMD 2.071.2. This makes topN pretty much unusable.

 I have it on my list to move https://arxiv.org/abs/1606.00484 into
 Phobos. Thanks for the data! -- Andrei

 Very good, thanks. It'll be interesting to see how this algorithm does
 on this data set.

Preliminary results indicate that QuickselectAdaptive is about 10x 
faster than sort, net of data reading overheads, on the second-field 
data set. I'll proceed with submitting the algorithm in Phobos. -- Andrei

Sep 21 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Work is blocked by https://issues.dlang.org/show_bug.cgi?id=16528, which 
is quite a head-scratcher. Any ideas for a workaround? Thanks! -- Andrei

Sep 23 2016

Stefan Koch <uplink.coder googlemail.com> writes:

On Friday, 23 September 2016 at 15:30:20 UTC, Andrei Alexandrescu 
wrote:
 Work is blocked by 
 https://issues.dlang.org/show_bug.cgi?id=16528, which is quite 
 a head-scratcher. Any ideas for a workaround? Thanks! -- Andrei

annotate the templates.

Sep 23 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 09/23/2016 11:40 AM, Stefan Koch wrote:
 On Friday, 23 September 2016 at 15:30:20 UTC, Andrei Alexandrescu wrote:
 Work is blocked by https://issues.dlang.org/show_bug.cgi?id=16528,
 which is quite a head-scratcher. Any ideas for a workaround? Thanks!
 -- Andrei

 annotate the templates.

I don't want to lose the deduction. I used another workaround (enumerate 
the potential unsafe operations under an if (false) and then forward to 
a  trusted function) in https://github.com/dlang/phobos/pull/4815.

Reviews are welcome!

BTW, as I commented in https://issues.dlang.org/show_bug.cgi?id=16517, 
using the new topN implementation instead of sort to compute the median 
over google's 1-grams is over 11x faster using dmd.


Andrei

Sep 23 2016

Jon Degenhardt <jond noreply.com> writes:

On Friday, 23 September 2016 at 16:09:18 UTC, Andrei Alexandrescu 
wrote:
 BTW, as I commented in 
 https://issues.dlang.org/show_bug.cgi?id=16517, using the new 
 topN implementation instead of sort to compute the median over 
 google's 1-grams is over 11x faster using dmd.

That's a very nice result. Both the absolute numbers and that all 
three sets achieve similar performance, as they different 
distribution characteristics.

--Jon

Sep 23 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 09/23/2016 04:45 PM, Jon Degenhardt wrote:
 That's a very nice result. Both the absolute numbers and that all three
 sets achieve similar performance, as they different distribution
 characteristics.

Got curious so I tried to patch my ldc installation (0.17.1 (DMD 
v2.068.2, LLVM 3.8.0)), but sadly that didn't work out because sorting.d 
uses the new syntax in various places.

Probably the best baseline is the equivalent C++ code using the mature 
implementation of nth_element, see 
http://paste.ofcode.org/ieZPdchjzTXbDcG3LvsYBP (also pasted at the end 
of this message). Here's what I got: 
http://paste.ofcode.org/6feyBLRiJtieHdZ3bmGaUW, see also text below.

The topN code produced with dmd is faster in virtually all cases (has 
parity for -f 4, which I suspect is a degenerate case with most elements 
equal, which exercises only a small part of the algorithm). For C++ I 
used -O3 -DNDEBUG, any other flags I should use?

As an aside, I enjoy poking fun at the stunning inefficiency of 
iostreams in my public talks; they delivered again :o).


Andrei

=========================== shell
$ g++ -O3 -DNDEBUG -omedian_by_sort_cpp issue16517.cpp
$ g++ -O3 -DNDEBUG -DTOPN -omedian_by_topn_cpp issue16517.cpp
$ dmd -O -inline -release -ofmedian_by_sort -boundscheck=off 
issue16517.d
$ dmd -O -inline -release -version=topn -ofmedian_by_topn 
-boundscheck=off issue16517.d
$ cut -f 2  /tmp/googlebooks-eng-all-1gram-20120701-0 | 
./median_by_sort_cpp
median (via sort): 1972; lines: 10512769; total time (ms): 3419.11; sort 
time (ms): 314.086
$ cut -f 2  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort
median (via sort): 1972; lines: 10512769; total time (ms): 1462; sort 
time (ms): 399
$ cut -f 2  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn_cpp
median (via nth_element): 1972; lines: 10512769; total time (ms): 
3255.41; nth_element time (ms): 40.676
$ cut -f 2  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn
median (via topN): 1972; lines: 10512769; total time (ms): 1211; topN 
time (ms): 32
$ cut -f 3  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort_cpp
median (via sort): 3; lines: 10512769; total time (ms): 2949.11; sort 
time (ms): 297.814
$ cut -f 3  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort
median (via sort): 3; lines: 10512769; total time (ms): 1351; sort time 
(ms): 382
$ cut -f 3  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn_cpp
median (via nth_element): 3; lines: 10512769; total time (ms): 2316.75; 
nth_element time (ms): 65.041
$ cut -f 3  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn
median (via topN): 3; lines: 10512769; total time (ms): 973; topN time 
(ms): 59
$ cut -f 4  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort_cpp
median (via sort): 2; lines: 10512769; total time (ms): 2614.47; sort 
time (ms): 351.285
$ cut -f 4  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort
median (via sort): 2; lines: 10512769; total time (ms): 1269; sort time 
(ms): 361
$ cut -f 4  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn_cpp
median (via nth_element): 2; lines: 10512769; total time (ms): 2443.17; 
nth_element time (ms): 76.211
$ cut -f 4  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn
median (via topN): 2; lines: 10512769; total time (ms): 998; topN time 
(ms): 77
===========================

=========================== issue16517.cpp
#include <iostream>
#include <cstdio>
#include <ctime>
#include <vector>
#include <algorithm>

using namespace std;

int main()
{
     clock_t swTotal, swSort;
     swTotal = clock();
     vector<double> values;
     double num;
     while (cin >> num) {
         values.push_back(num);
     }

     size_t medianIndex = values.size() / 2;
     swSort = clock();
     #ifdef TOPN
         const char* method = "nth_element";
         nth_element(values.begin(), values.begin() + medianIndex, 
values.end());
     #else
         const char* method = "sort";
         sort(values.begin(), values.end());
     #endif
     clock_t done = clock();

     printf("median (via %s): %g; lines: %zu; total time (ms): %g; %s 
time (ms): %g\n",
              method, values[medianIndex], values.size(),
              (done - swTotal) * 1000. / CLOCKS_PER_SEC,
              method, (done - swSort) * 1000. / CLOCKS_PER_SEC);
}
===========================

=========================== issue16517.d
import std.stdio;
import std.array : appender;
import std.algorithm : sort, topN;
import std.conv;
import std.range;
import std.datetime: StopWatch;

void main()
{
     StopWatch swTotal, swSort;
     swTotal.start;
     Appender!(double[]) values;
     foreach (line; stdin.byLine)
         values.put(line.to!double);

     size_t medianIndex = values.data.length/2;
     swSort.start;
     version(topn)
     {
         topN(values.data, medianIndex);
         auto method = "topN";
     }
     else {
         sort(values.data);
         auto method = "sort";
     }
     swTotal.stop;
     swSort.stop;

     writefln("median (via %s): %g; lines: %d; total time (ms): %d; %s 
time (ms): %d",
              method, values.data[medianIndex], values.data.length, 
swTotal.peek.msecs,
              method, swSort.peek.msecs);
}
===========================

Sep 24 2016

Jon Degenhardt <jond noreply.com> writes:

On Saturday, 24 September 2016 at 18:22:51 UTC, Andrei 
Alexandrescu wrote:
 Got curious so I tried to patch my ldc installation (0.17.1 
 (DMD v2.068.2, LLVM 3.8.0)), but sadly that didn't work out 
 because sorting.d uses the new syntax in various places.

 Probably the best baseline is the equivalent C++ code using the 
 mature implementation of nth_element, see 
 http://paste.ofcode.org/ieZPdchjzTXbDcG3LvsYBP (also pasted at 
 the end of this message). Here's what I got: 
 http://paste.ofcode.org/6feyBLRiJtieHdZ3bmGaUW, see also text 
 below.

 The topN code produced with dmd is faster in virtually all 
 cases (has parity for -f 4, which I suspect is a degenerate 
 case with most elements equal, which exercises only a small 
 part of the algorithm). For C++ I used -O3 -DNDEBUG, any other 
 flags I should use?

 [snip]

I added the topN pull request to a local LDC build (current LDC 
master). Also built the C++ nth_element and sort program you 
listed. (GCC 4.9.3, compiler line: g++-mp-4.9 -O3 -static-libgcc 
-static-libstdc++ -std=c++11).

Did 7 runs for each variant on the three fields in ngram file. 
Median times are below.

Median sort times (ms):

      Field 2   Field 3   Field 4
DMD      351       348       326
LDC      273       261       245
C++      281       260       246

Median topN / nth_element times (ms):

      Field 2   Field 3   Field 4
LDC       21        44        57
C++       41        60        71

Looking very good indeed!

Sep 25 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/25/16 4:19 AM, Jon Degenhardt wrote:
 On Saturday, 24 September 2016 at 18:22:51 UTC, Andrei Alexandrescu wrote:
 Got curious so I tried to patch my ldc installation (0.17.1 (DMD
 v2.068.2, LLVM 3.8.0)), but sadly that didn't work out because
 sorting.d uses the new syntax in various places.

 Probably the best baseline is the equivalent C++ code using the mature
 implementation of nth_element, see
 http://paste.ofcode.org/ieZPdchjzTXbDcG3LvsYBP (also pasted at the end
 of this message). Here's what I got:
 http://paste.ofcode.org/6feyBLRiJtieHdZ3bmGaUW, see also text below.

 The topN code produced with dmd is faster in virtually all cases (has
 parity for -f 4, which I suspect is a degenerate case with most
 elements equal, which exercises only a small part of the algorithm).
 For C++ I used -O3 -DNDEBUG, any other flags I should use?

 [snip]

 I added the topN pull request to a local LDC build (current LDC master).
 Also built the C++ nth_element and sort program you listed. (GCC 4.9.3,
 compiler line: g++-mp-4.9 -O3 -static-libgcc -static-libstdc++ -std=c++11).

 Did 7 runs for each variant on the three fields in ngram file. Median
 times are below.

 Median sort times (ms):

      Field 2   Field 3   Field 4
 DMD      351       348       326
 LDC      273       261       245
 C++      281       260       246

 Median topN / nth_element times (ms):

      Field 2   Field 3   Field 4
 LDC       21        44        57
 C++       41        60        71

 Looking very good indeed!

Thanks for this work! -- Andrei

Sep 25 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 09/23/2016 11:40 AM, Stefan Koch wrote:
 On Friday, 23 September 2016 at 15:30:20 UTC, Andrei Alexandrescu wrote:
 Work is blocked by https://issues.dlang.org/show_bug.cgi?id=16528,
 which is quite a head-scratcher. Any ideas for a workaround? Thanks!
 -- Andrei

 annotate the templates.

Nagonna work for generic functions -- Andrei

Sep 23 2016

deadalnix <deadalnix gmail.com> writes:

On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu 
wrote:
 unpredictableSeed uses the system clock as a source of 
 randomness, so we're good there. -- Andrei

I got problem with that even when crytographically secure 
randomness wasn't needed more than once. A specific case included 
adding jitter on some timeout to avoid all hosts to expire at the 
same time, and used mersenne twister with a pseudo random seed. 
There still were way too many hosts ending up with the same 
timeout.

System time is not enough.

Jan 18 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/18/16 6:55 PM, deadalnix wrote:
 On Monday, 18 January 2016 at 23:49:36 UTC, Andrei Alexandrescu wrote:
 unpredictableSeed uses the system clock as a source of randomness, so
 we're good there. -- Andrei

 I got problem with that even when crytographically secure randomness
 wasn't needed more than once. A specific case included adding jitter on
 some timeout to avoid all hosts to expire at the same time, and used
 mersenne twister with a pseudo random seed. There still were way too
 many hosts ending up with the same timeout.

 System time is not enough.

How would this translate to a matter of selecting the pivot during sort? 
-- Andrei

Jan 18 2016

deadalnix <deadalnix gmail.com> writes:

On Tuesday, 19 January 2016 at 00:17:30 UTC, Andrei Alexandrescu 
wrote:
 How would this translate to a matter of selecting the pivot 
 during sort? -- Andrei

A large chunk of a given datacenter going quadratic at the same 
time.

Jan 18 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/18/16 6:18 PM, Ilya wrote:
 On Monday, 18 January 2016 at 20:45:56 UTC, Ivan Kazmenko wrote:
 On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:
 Here goes the test which shows quadratic behavior for the new version:
 http://dpaste.dzfl.pl/e4b3bc26c3cf
 (dpaste kills the slow code before it completes the task)

 Correction: this is the result of removing a uniform call in pull
 3921.  Since we are supposed to discuss the improvements related to
 pull 3934 here, I created a separate entry for the issue:
 https://issues.dlang.org/show_bug.cgi?id=15583.

 A RNGs don't improve worst case. It only changes an permutation for
 worst case. --Ilya

Well it does improve things. The probability of hitting the worst case 
repeatedly is practically zero, and it's impossible to create an attack 
input.

I'm not sure whether we should worry about this. Probably we do because 
sort attacks are a press favorite.

The nice thing about not relying on randomness is that pure functions 
can call sort, topN etc.

As a sort of a compromise I was thinking of seeding the RNG with not 
only the length of the range, but also the integral representation of 
the address of the first element. This would still allow an attack if 
the input is always at the same address.


Thoughts?

Andrei

Jan 18 2016

Ilya <ilyayaroshenko gmail.com> writes:

On Monday, 18 January 2016 at 23:39:02 UTC, Andrei Alexandrescu 
wrote:
 On 1/18/16 6:18 PM, Ilya wrote:
 On Monday, 18 January 2016 at 20:45:56 UTC, Ivan Kazmenko 
 wrote:
 On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko 
 wrote:
 Here goes the test which shows quadratic behavior for the 
 new version:
 http://dpaste.dzfl.pl/e4b3bc26c3cf
 (dpaste kills the slow code before it completes the task)

 Correction: this is the result of removing a uniform call in 
 pull
 3921.  Since we are supposed to discuss the improvements 
 related to
 pull 3934 here, I created a separate entry for the issue:
 https://issues.dlang.org/show_bug.cgi?id=15583.

 A RNGs don't improve worst case. It only changes an 
 permutation for
 worst case. --Ilya

 Well it does improve things. The probability of hitting the 
 worst case repeatedly is practically zero, and it's impossible 
 to create an attack input.

 I'm not sure whether we should worry about this. Probably we do 
 because sort attacks are a press favorite.

 The nice thing about not relying on randomness is that pure 
 functions can call sort, topN etc.

 As a sort of a compromise I was thinking of seeding the RNG 
 with not only the length of the range, but also the integral 
 representation of the address of the first element. This would 
 still allow an attack if the input is always at the same 
 address.


 Thoughts?

 Andrei

1. Yes, probability of hitting the worst case repeatedly is is 
practically zero. But RNGs do not change this probability.
2. It is possible to build attack for our RNGs, because they are 
Pseudo-RNGs.
--Ilya

Jan 18 2016

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/19/2016 12:50 AM, Ilya wrote:
 ...

 1. Yes, probability of hitting the worst case repeatedly is is
 practically zero. But RNGs do not change this probability.
 2. It is possible to build attack for our RNGs, because they are
 Pseudo-RNGs.
 --Ilya

You also need to predict the seed. How do you do that?

Jan 18 2016

Ilya <ilyayaroshenko gmail.com> writes:

On Monday, 18 January 2016 at 23:53:53 UTC, Timon Gehr wrote:
 On 01/19/2016 12:50 AM, Ilya wrote:
 ...

 1. Yes, probability of hitting the worst case repeatedly is is
 practically zero. But RNGs do not change this probability.
 2. It is possible to build attack for our RNGs, because they 
 are
 Pseudo-RNGs.
 --Ilya

 You also need to predict the seed. How do you do that?

We can not use unpredictable seed (like system clock) in pure 
functions. --Ilya

Jan 18 2016

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/19/2016 12:55 AM, Ilya wrote:
 On Monday, 18 January 2016 at 23:53:53 UTC, Timon Gehr wrote:
 On 01/19/2016 12:50 AM, Ilya wrote:
 ...

 1. Yes, probability of hitting the worst case repeatedly is is
 practically zero. But RNGs do not change this probability.
 2. It is possible to build attack for our RNGs, because they are
 Pseudo-RNGs.
 --Ilya

 You also need to predict the seed. How do you do that?

 We can not use unpredictable seed (like system clock) in pure functions.
 --Ilya

Clearly. The point is, there already was an impure implementation of 
topN whose expected linear running time is still specified in the 
documentation. The current implementation does not fulfill this bound.

Jan 18 2016

Ilya <ilyayaroshenko gmail.com> writes:

On Tuesday, 19 January 2016 at 00:02:08 UTC, Timon Gehr wrote:
 On 01/19/2016 12:55 AM, Ilya wrote:
 On Monday, 18 January 2016 at 23:53:53 UTC, Timon Gehr wrote:
 On 01/19/2016 12:50 AM, Ilya wrote:
 ...

 1. Yes, probability of hitting the worst case repeatedly is 
 is
 practically zero. But RNGs do not change this probability.
 2. It is possible to build attack for our RNGs, because they 
 are
 Pseudo-RNGs.
 --Ilya

 You also need to predict the seed. How do you do that?

 We can not use unpredictable seed (like system clock) in pure 
 functions.
 --Ilya

 Clearly. The point is, there already was an impure 
 implementation of topN whose expected linear running time is 
 still specified in the documentation. The current 
 implementation does not fulfill this bound.

There is already implementation with predictable seed. Proof: 
https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151
--Ilya

Jan 18 2016

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/19/2016 01:12 AM, Ilya wrote:

 There is already implementation with predictable seed. Proof:
 https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151
 --Ilya

The default RNG is seeded with unpredictableSeed. What is the point you 
are trying to make?

Jan 18 2016

Ilya <ilyayaroshenko gmail.com> writes:

On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:
 On 01/19/2016 01:12 AM, Ilya wrote:

 There is already implementation with predictable seed. Proof:
 https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151
 --Ilya

 The default RNG is seeded with unpredictableSeed. What is the 
 point you are trying to make?

unpredictableSeed is initialized only once and can be easily 
estimated. --Ilya

Jan 18 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/18/16 7:46 PM, Ilya wrote:
 On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:
 On 01/19/2016 01:12 AM, Ilya wrote:

 There is already implementation with predictable seed. Proof:
 https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151

 --Ilya

 The default RNG is seeded with unpredictableSeed. What is the point
 you are trying to make?

 unpredictableSeed is initialized only once and can be easily estimated.
 --Ilya

https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/random.d#L1120

unpredictableSeed uses MinstdRand0. The sources of randomness used for 
initializing MinstdRand0 are the PID, the thread ID, and the system time 
at the moment of the seeding. Then at each subsequent call to 
unpredictableSeed, the time of the call is XORed with the current value 
of the MinstdRand0.

How do you think things could be improved?


Andrei

Jan 18 2016

Ilya <ilyayaroshenko gmail.com> writes:

On Tuesday, 19 January 2016 at 01:04:03 UTC, Andrei Alexandrescu 
wrote:
 On 1/18/16 7:46 PM, Ilya wrote:
 On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:
 On 01/19/2016 01:12 AM, Ilya wrote:

 There is already implementation with predictable seed. Proof:
 https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151

 --Ilya

 The default RNG is seeded with unpredictableSeed. What is the 
 point
 you are trying to make?

 unpredictableSeed is initialized only once and can be easily 
 estimated.
 --Ilya

 https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/random.d#L1120

 unpredictableSeed uses MinstdRand0. The sources of randomness 
 used for initializing MinstdRand0 are the PID, the thread ID, 
 and the system time at the moment of the seeding. Then at each 
 subsequent call to unpredictableSeed, the time of the call is 
 XORed with the current value of the MinstdRand0.

 How do you think things could be improved?


 Andrei

A good variant with minimal overhead is to call unpredictableSeed 
for sorting big arrays each time (one seed per array):
    a. A hacker would not able to estimate a seed using other API 
calls. For example, "give me a set of random numbers".
    b. A hacker would not be able to estimate a seed using a small 
arrays because they don't use RNGs. (and they have not any 
overhead!)
    c. A hacker would not be able to estimate a seed for big 
arrays, because attack based on time measurement would not work 
for big arrays.

Ilya

Jan 18 2016

Ilya <ilyayaroshenko gmail.com> writes:

On Tuesday, 19 January 2016 at 01:04:03 UTC, Andrei Alexandrescu 
wrote:
 On 1/18/16 7:46 PM, Ilya wrote:
 On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:
 On 01/19/2016 01:12 AM, Ilya wrote:

 There is already implementation with predictable seed. Proof:
 https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151

 --Ilya

 The default RNG is seeded with unpredictableSeed. What is the 
 point
 you are trying to make?

 unpredictableSeed is initialized only once and can be easily 
 estimated.
 --Ilya

 https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/random.d#L1120

 unpredictableSeed uses MinstdRand0. The sources of randomness 
 used for initializing MinstdRand0 are the PID, the thread ID, 
 and the system time at the moment of the seeding. Then at each 
 subsequent call to unpredictableSeed, the time of the call is 
 XORed with the current value of the MinstdRand0.

 How do you think things could be improved?


 Andrei

A good variant with minimal overhead is to call unpredictableSeed 
for sorting big arrays each time (one seed per array):
    a. A hacker would not able to estimate a seed using other API 
calls. For example, "give me a set of random numbers".
    b. A hacker would not be able to estimate a seed using a small 
arrays because they don't use RNGs. (and they have not any 
overhead!)
    c. A hacker would not be able to estimate a seed for big 
arrays, because attack based on time measurement would not work 
for big arrays.

EDIT:
   c. ... because each big array has own seed and time measurement 
would not work for big arrays with different seeds.

Ilya

Jan 18 2016

Ilya <ilyayaroshenko gmail.com> writes:

On Tuesday, 19 January 2016 at 01:04:03 UTC, Andrei Alexandrescu 
wrote:
 On 1/18/16 7:46 PM, Ilya wrote:
 On Tuesday, 19 January 2016 at 00:38:14 UTC, Timon Gehr wrote:
 On 01/19/2016 01:12 AM, Ilya wrote:

 There is already implementation with predictable seed. Proof:
 https://github.com/D-Programming-Language/phobos/blob/master/std/random.d#L1151

 --Ilya

 The default RNG is seeded with unpredictableSeed. What is the 
 point
 you are trying to make?

 unpredictableSeed is initialized only once and can be easily 
 estimated.
 --Ilya

 https://github.com/D-Programming-Language/phobos/blob/v2.069.2/std/random.d#L1120

 unpredictableSeed uses MinstdRand0. The sources of randomness 
 used for initializing MinstdRand0 are the PID, the thread ID, 
 and the system time at the moment of the seeding. Then at each 
 subsequent call to unpredictableSeed, the time of the call is 
 XORed with the current value of the MinstdRand0.

 How do you think things could be improved?


 Andrei

A good variant with minimal overhead is to call unpredictableSeed 
for sorting big arrays each time (one seed per array):
    a. A hacker would not be able to estimate a seed using other 
API calls. For example, "give me a set of random numbers".
    b. A hacker would not be able to estimate a seed using a small 
arrays because they don't use RNGs. (and they have not any 
overhead!)
    c. A hacker would not be able to estimate a seed for big 
arrays, because attack based on time measurement would not work 
for big arrays.

EDIT:
   c. ... because each big array has own seed and time measurement 
would not work for big arrays with different seeds.

__EDIT2__:
To be fair (c) is not really correct. To do (c) correct a thread 
local global seed should be used to generate unpredictableSeed 
for each big arrays.

Ilya

Jan 18 2016

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/19/2016 12:39 AM, Andrei Alexandrescu wrote:
 On 1/18/16 6:18 PM, Ilya wrote:
 On Monday, 18 January 2016 at 20:45:56 UTC, Ivan Kazmenko wrote:
 On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:
 Here goes the test which shows quadratic behavior for the new version:
 http://dpaste.dzfl.pl/e4b3bc26c3cf
 (dpaste kills the slow code before it completes the task)

 Correction: this is the result of removing a uniform call in pull
 3921.  Since we are supposed to discuss the improvements related to
 pull 3934 here, I created a separate entry for the issue:
 https://issues.dlang.org/show_bug.cgi?id=15583.

 A RNGs don't improve worst case. It only changes an permutation for
 worst case. --Ilya

 Well it does improve things. The probability of hitting the worst case
 repeatedly is practically zero, and it's impossible to create an attack
 input.

 I'm not sure whether we should worry about this. Probably we do because
 sort attacks are a press favorite.

 The nice thing about not relying on randomness is that pure functions
 can call sort, topN etc.

 As a sort of a compromise I was thinking of seeding the RNG with not
 only the length of the range, but also the integral representation of
 the address of the first element. This would still allow an attack if
 the input is always at the same address.


 Thoughts?

 Andrei

https://en.wikipedia.org/wiki/Introselect

Jan 18 2016

Ivan Kazmenko <gassa mail.ru> writes:

On Monday, 18 January 2016 at 23:39:02 UTC, Andrei Alexandrescu 
wrote:
 On 1/18/16 6:18 PM, Ilya wrote:
 A RNGs don't improve worst case. It only changes an 
 permutation for worst case. --Ilya

 Well it does improve things. The probability of hitting the 
 worst case repeatedly is practically zero, and it's impossible 
 to create an attack input.

Possible, but only if the seed and previous usage of victim 
program's global RNG are also known to the attacker... at which 
point the victim may well have a bigger problem than just 
quadratic topN.

 I'm not sure whether we should worry about this. Probably we do 
 because sort attacks are a press favorite.

 The nice thing about not relying on randomness is that pure 
 functions can call sort, topN etc.

Is it feasible to have two overloads of sort and/or topN, one 
pure and the other with better complexity guarantees?

 As a sort of a compromise I was thinking of seeding the RNG 
 with not only the length of the range, but also the integral 
 representation of the address of the first element. This would 
 still allow an attack if the input is always at the same 
 address.

Hmm.  Isn't using any pointers still breaking strong purity?  Say 
the GC moved our array, and it now has a different address.  A 
strongly pure function would order the elements exactly the same 
then.

The same goes about sorting, where the thing depending on pivot 
choice is the relative order of "equal" elements.

 From a theoretical standpoint (not taking current D purity rules 
into account), I'd say using a pointer (which may be modified by 
GC) is as pure as just allowing a static RNG (a global one, or 
even another instance dedicated specifically to sort/topN).

Ivan Kazmenko.

Jan 18 2016

Timon Gehr <timon.gehr gmx.ch> writes:

On 01/18/2016 01:00 PM, Ivan Kazmenko wrote:
 The old version (2.070.0-b2) could not be tricked with it, does it use
 random?

Yes, it selected the pivot uniformly at random using the global RNG. 
(This is also why the documentation says topN is O(n) in expectation.)

Jan 18 2016

Ivan Kazmenko <gassa mail.ru> writes:

On Sunday, 17 January 2016 at 02:37:48 UTC, Timon Gehr wrote:
 On 01/17/2016 03:09 AM, Andrei Alexandrescu wrote:
 On 1/16/16 8:00 PM, Timon Gehr wrote:
 The implementation falls back to topNHeap whenever k is 
 within the first
 or last ~n/8 elements and therefore is Ω(n log n) on average.

 Correct. The pedantic approach would be to only do that when k 
 is up to
 log(n). -- Andrei

 Ivan's analysis suggests that even something significantly 
 larger, like n/log(n)² might work as an upper bound for k.
 I don't think that meeting the documented runtime bounds 
 amounts to pedantry. (Having linear average running time of 
 course does not even imply linear expected running time, as is 
 still written in the documentation.)

I'd suggest being able to switch between implementations.

Recall that, when sorting, we have SwapStrategy.stable which is, 
well, stable, but additionally guarantees n log n with a larger 
constant (MergeSort).

And we have SwapStrategy.unstable which uses Introsort, which, in 
turn, has smaller constant on sane inputs.

Here, Andrei's suggestion is to also have two implementations, 
let us call them TopNStrategy.quickSelect and TopNStrategy.heap.

The quickSelect one is O(n) on sane inputs but O(n^2) on an 
artificial worst case.

The heap implementation is also O(n) when k is close to 0 or n, 
but O(n log n) otherwise.  What's important is that it is also 
O(n log n) worst case.

The current proposal switches between strategies based on a 
heuristic (k < n/8 or k > 7n/8 if I understand correctly), which 
may not be the best strategy in all cases.

So, what can be done is to introduce TopNStrategy.auto which is 
the default and uses the (current or better) heuristic to switch 
between strategies, but also leave a way to explicitly select one 
of the two strategies, just like one can do now with sort!(..., 
SwapStrategy.whatWeExplicitlyWant).

All the code for both strategies is already there.  Each strategy 
has its benefits not covered by the other (quickSelect is faster 
for k~n and sane inputs, heap is faster for k close to boundary 
and special inputs).  So, provide the default but let the user 
choose.

Ivan Kazmenko.

Jan 17 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 01/17/2016 06:55 AM, Ivan Kazmenko wrote:
 So, what can be done is to introduce TopNStrategy.auto which is the
 default and uses the (current or better) heuristic to switch between
 strategies, but also leave a way to explicitly select one of the two
 strategies, just like one can do now with sort!(...,
 SwapStrategy.whatWeExplicitlyWant).

A nice idea, but it seems a bit overengineered. The differences among 
approaches are rather subtle and explaining the circumstances under 
which one does better than the other is about as difficult as making the 
choice in the implementation.

BTW there's yet another approach:

1. Create a max heap for r[0 .. nth]

2. Create a min heap for r[nth .. $]

3. As long as r[nth] < r[0], swap them and restore the heap property in 
the two heaps

At the end of the process we have the smallest element in r[nth .. $], 
which is exactly in the r[nth] position, greater than or equal to the 
largest element in r[0 .. nth], which is exactly what the doctor prescribed.

Complexity of this turns rather nice. Step 1 is O(k), step 2 is O(n - 
k). In the worst case (r is sorted descending), step 3 will stop after k 
iterations and does k heap insertions in the two heaps. So overall 
complexity is O(n + k log(k) + k log(n)). Since k <= n, keep the largest 
terms: O(n + k log(n)). So if k is proportional to n / log(n), we get 
O(n). And that's worst case!

BTW I figured how to do stable partition. That'll come in a distinct PR.


Andrei

Jan 17 2016

Ziad Hatahet via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Sat, Jan 16, 2016 at 7:25 AM, Andrei Alexandrescu via Digitalmars-d <
digitalmars-d puremagic.com> wrote:

 1. Organize the first 11 elements into a max heap

Why not the first 10?

Jan 16 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/16/16 4:58 PM, Ziad Hatahet via Digitalmars-d wrote:
 On Sat, Jan 16, 2016 at 7:25 AM, Andrei Alexandrescu via Digitalmars-d
 <digitalmars-d puremagic.com <mailto:digitalmars-d puremagic.com>> wrote:


     1. Organize the first 11 elements into a max heap


 Why not the first 10?

So you get to put the appropriate element in the 11th position. -- Andrei

Jan 16 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/16/16 9:12 PM, Andrei Alexandrescu wrote:
 On 1/16/16 4:58 PM, Ziad Hatahet via Digitalmars-d wrote:
 On Sat, Jan 16, 2016 at 7:25 AM, Andrei Alexandrescu via Digitalmars-d
 <digitalmars-d puremagic.com <mailto:digitalmars-d puremagic.com>> wrote:


     1. Organize the first 11 elements into a max heap


 Why not the first 10?

 So you get to put the appropriate element in the 11th position. -- Andrei

To clarify this using a degenerate case: say someone calls topN(r, 0) 
i.e. find the minimum. Then you'd need a mini-heap of exactly one 
element to track the smallest element found so far. -- Andrei

Jan 16 2016

Ivan Kazmenko <gassa mail.ru> writes:

On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu 
wrote:
 3. At the end, swap the largest in the heap with the 10th and 
 you're done!

And why this?  Do we additionally require the k-th element to 
arrive exactly on k-th place?

Jan 16 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/16/16 5:27 PM, Ivan Kazmenko wrote:
 On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu wrote:
 3. At the end, swap the largest in the heap with the 10th and you're
 done!

 And why this?  Do we additionally require the k-th element to arrive
 exactly on k-th place?

Yes. -- Andrei

Jan 16 2016

deadalnix <deadalnix gmail.com> writes:

On Saturday, 16 January 2016 at 15:25:50 UTC, Andrei Alexandrescu 
wrote:
 https://github.com/D-Programming-Language/phobos/pull/3934

 So, say you're looking for the smallest 10 elements out of 
 100_000. The quickselect algorithm (which topN currently uses) 
 will successively partition the set in (assuming the pivot 
 choice works well) 50_000, 25_000, etc chunks all the way down 
 to finding the smallest 10.

 That's quite a bit of work, so 3934 uses an alternate strategy 
 for finding the smallest 10:

 1. Organize the first 11 elements into a max heap

 2. Scan all other elements progressively. Whenever an element 
 is found that is smaller than the largest in the heap, swap it 
 with the largest in the heap then restore the heap property.

 3. At the end, swap the largest in the heap with the 10th and 
 you're done!

 This is very effective, and is seldom referred in the selection 
 literature. In fact, a more inefficient approach (heapify the 
 entire range) is discussed more often.


 Destroy!

 Andrei

A common way to do it is to go quicksort, but only recurse on one 
side of the set. That should give log(n)^2 complexity on average.

Jan 17 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/17/16 8:07 PM, deadalnix wrote:
 A common way to do it is to go quicksort, but only recurse on one side
 of the set. That should give log(n)^2 complexity on average.

Yah, that's quickselect (which this work started from). It's linear, and 
you can't get top n in sublinear time because you need to look at all 
elements. -- Andrei

Jan 17 2016

deadalnix <deadalnix gmail.com> writes:

On Monday, 18 January 2016 at 01:38:16 UTC, Andrei Alexandrescu 
wrote:
 On 1/17/16 8:07 PM, deadalnix wrote:
 A common way to do it is to go quicksort, but only recurse on 
 one side
 of the set. That should give log(n)^2 complexity on average.

 Yah, that's quickselect (which this work started from). It's 
 linear, and you can't get top n in sublinear time because you 
 need to look at all elements. -- Andrei

Yeah; forget about me, I was dumb on that one.

Jan 17 2016

D Programming

C/C++ Programming

Other

digitalmars.D - topN using a heap