digitalmars.D.learn - What is the use case of RandomCover?

Ivan Kazmenko (87/87) Feb 18 2013 I'm unsure whether I should post into the ".learn" sub-forum or

Ivan Kazmenko (5/10) Feb 18 2013 Sorry, this part doesn't look clear. O(n^2) total with O(n^2)
Joseph Rushton Wakeling (3/10) Feb 18 2013 Surely, because randomShuffle is re-ordering the elements in-place, wher...

Ivan Kazmenko (15/17) Feb 19 2013 Sorry, but that does not answer my question. If "in-place

monarch_dodra (70/87) Feb 19 2013 Hum... randomShuffle and randomSample actually have nothing to do

Ivan Kazmenko (22/46) Feb 19 2013 I'd like to note that my post is about randomCover, not

Joseph Rushton Wakeling (13/17) Feb 19 2013 To be honest, I think RandomCover is (statistically) unsafe to use right...

monarch_dodra (6/17) Feb 19 2013 If it keeps a copy, how it is passed (value or ref) actually

Joseph Rushton Wakeling (8/10) Feb 21 2013 Yes, I didn't explain myself well there. The real problem is that copyi...

Ivan Kazmenko (10/13) Feb 19 2013 Given that the lazy version will hardly be possible without

monarch_dodra (31/83) Feb 19 2013 Extra documentation never hurts, but keep in mind that a ton of

Ivan Kazmenko (13/29) Feb 19 2013 Oops indeed!

monarch_dodra (8/33) Feb 19 2013 In theory, there should be no reproducibility breakage, as the

Ivan Kazmenko (14/18) Feb 19 2013 Indeed, I just looked at Boost and C++11 implementations, and

Ivan Kazmenko (12/16) Feb 19 2013 Sorry for replying to myself again, but the abovementioned

bearophile (4/9) Feb 19 2013 Is this to report in Bugzilla?

Joseph Rushton Wakeling (3/7) Feb 19 2013 Sorry, I missed your remark on this when writing my own email. But I'm ...

"Ivan Kazmenko" <gassa mail.ru> writes:

I'm unsure whether I should post into the ".learn" sub-forum or 
some other one in such a case, but still.

I wonder what is the use case of std.random.randomCover when one 
has std.random.randomShuffle.  I was trying to use it just to get 
a random permutation of integers, and found randomCover prior to 
randomShuffle.  However, for the number of elements as low as 
10,000, the delay was already rather surprising, so I searched 
for a faster solution and found randomShuffle does a superior 
job.  And now I wonder: how does one correctly use randomCover?  
Below is a sample test program showing the difference.

-----
import std.array;
import std.random;
import std.range;
import std.stdio;

immutable int MAX_N = 10_000;

int [] fun (int n, ref Random rnd)
{
	auto t = array (iota (MAX_N));
	version (randomCover)
	{
		auto c = randomCover (t, rnd);
	}
	version (randomShuffle)
	{
		auto c = t;
		randomShuffle (c, rnd);
	}

	return array (c);
}

void main ()
{
	auto rnd = Random (123456789);
	writeln (fun (MAX_N, rnd));
	writeln (fun (MAX_N, rnd) == fun (MAX_N, rnd));
}
-----

Here is a comparison:

1. Speed.
+randomShuffle performs O(n) steps and O(n) uniform() calls.
-randomCover performs O(n^2) steps and O(n^2) uniform() calls.

The latter however can (and perhaps should?) be optimized to 
O(n): in the implementation, the line
             auto chooseMe = uniform(0, k, _rnd) == 0;
can be moved outside the foreach loop and store the integer
         auto toPick = uniform(0, k, _rnd);
instead of a bool.  I can try and write the respective patch if 
needed.

2. Size.
-randomShuffle does not allocate anything extra, but modifies the 
range in place, and so requires allocating another range of n 
values if the original range has to be stored too.
+randomCover only allocates an array of n bools.  If that is the 
intended advantage, the implementation would be better off using 
a bit array instead of a bool array, as in this enhancement 
proposition: http://d.puremagic.com/issues/show_bug.cgi?id=2898

3. Laziness.
-randomShuffle just does its job once.
+randomCover produces some sort of a lazy generator instead.  
Still, the generator performs an O(n) computation on each step, 
so the profit is debatable.

4. Convenience.
+randomShuffle called multiple times with the same RNG advances 
the internal state of the RNG and thus produces different 
results.  If one needs the same results, it is still achievable 
by knowingly saving and loading the internal state of the RNG.
-randomCover called multiple times with the same RNG copies the 
RNG each time by value and thus produces the same result.  That 
is not the intended behavior in the majority of use cases I can 
imagine (e.g., generating different random permutations in a 
loop).  This is already the topic of an issue I found: 
http://d.puremagic.com/issues/show_bug.cgi?id=7067

Now, the only case I can think of where randomCover should be 
preferred to randomShuffle is when you have a huge range 
(hundreds of Mb), but you need to iterate only through the first 
few values in randomCover.  Is there any other?

Whether the above is indeed the intended use of randomCover or 
not, I think that the intended use (and a reference to 
randomShuffle for other cases) should be mentioned in the 
documentation along with the time complexity.

More on the topic of optimization, the performance of the whole 
randomCover thing can be optimized to O(n log n) using a Fenwick 
tree or such to popFront in O(log n).  But it will then require 
storing n integers, not n bools, thus losing the advantage of 
having smaller memory requirements than randomShuffle with copy.

-----
Ivan Kazmenko.

Feb 18 2013

"Ivan Kazmenko" <gassa mail.ru> writes:

 1. Speed.
 +randomShuffle performs O(n) steps and O(n) uniform() calls.
 -randomCover performs O(n^2) steps and O(n^2) uniform() calls.

 The latter however can (and perhaps should?) be optimized to 
 O(n): in the implementation, the line

Sorry, this part doesn't look clear.  O(n^2) total with O(n^2) 
uniform() calls can be optimized to the same O(n^2) total but 
with only O(n) uniform() calls.  It can be further optimized to 
O(n log n) total with O(n) uniform() calls using a Fenwick tree, 
but will then require storing n ints instead of n bools.

Feb 18 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 02/18/2013 03:35 PM, Ivan Kazmenko wrote:
 I wonder what is the use case of std.random.randomCover when one has
 std.random.randomShuffle.  I was trying to use it just to get a random
 permutation of integers, and found randomCover prior to randomShuffle. 
However,
 for the number of elements as low as 10,000, the delay was already rather
 surprising, so I searched for a faster solution and found randomShuffle does a
 superior job.  And now I wonder: how does one correctly use randomCover? Below
 is a sample test program showing the difference.

Surely, because randomShuffle is re-ordering the elements in-place, whereas 
randomCover is not.

Feb 18 2013

"Ivan Kazmenko" <gassa mail.ru> writes:

 Surely, because randomShuffle is re-ordering the elements 
 in-place, whereas randomCover is not.

Sorry, but that does not answer my question.  If "in-place 
shuffle" versus "function return value" was the only intended 
difference, randomCover could as well look like this (simplified 
to int[] case for expressiveness):

-----
int[] randomCover(int[] r, ref/*or not*/ Random rnd)
{
	auto s = r.dup;
	randomShuffle(s, rnd);
	return s;
}
-----

However, there are ~70 lines in the implementation quite 
different from just calling randomShuffle, presumably for some 
purpose.  And my question is, what is that purpose?

Feb 19 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Tuesday, 19 February 2013 at 09:04:12 UTC, Ivan Kazmenko wrote:
 Surely, because randomShuffle is re-ordering the elements 
 in-place, whereas randomCover is not.

 Sorry, but that does not answer my question.  If "in-place 
 shuffle" versus "function return value" was the only intended 
 difference, randomCover could as well look like this 
 (simplified to int[] case for expressiveness):

 -----
 int[] randomCover(int[] r, ref/*or not*/ Random rnd)
 {
 	auto s = r.dup;
 	randomShuffle(s, rnd);
 	return s;
 }
 -----

 However, there are ~70 lines in the implementation quite 
 different from just calling randomShuffle, presumably for some 
 purpose.  And my question is, what is that purpose?

Hum... randomShuffle and randomSample actually have nothing to do 
with each other.

RandomSample will not shuffle anything, but will just (lazily) 
iterate on the original range, skipping elements, making it a 
"random sample". It will not, in any circumstance, re-arrange any 
element.

Try this:

//--------
import std.random;
import std.algorithm;
import std.stdio;
import std.array;

void main()
{
    int[25] arr;
    foreach(i, ref a; arr) a = i;
    assert(isSorted(arr[]));
    foreach(_; 0 .. 10)
    {
       auto sample = randomSample(arr[], 4).array();
       assert(isSorted(sample));
       writefln("[%(%3s,%)]", sample);
    }
}
//--------
[  8, 12, 16, 21]
[  5, 12, 17, 23]
[  1, 15, 21, 22]
[  9, 15, 21, 23]
[  6,  7, 13, 24]
[  4,  6,  9, 11]
[  9, 12, 20, 23]
[  5, 14, 20, 23]
[  0,  2, 10, 24]
[  1,  9, 13, 23]
//--------








I also want to comment on your "randomSample" vs "randomSuffle" 
implementation suggestion. Keep in mind that:
a) randomSample doesn't allocate, whereas yours suggestion doesn't
b) randomSample gives direct access to the elements, whereas your 
suggestion doesn't.

If you don't care about a) and b), then by all means, dup away, 
and get better performance!

But think about the fact that you wouldn't be able to do 
something like this...

//--------
import std.random;
import std.algorithm;
import std.stdio;
import std.array;

void main()
{
     int[25] arr;
     foreach(_; 0 .. 10)
     {
         auto sample = randomSample(arr[], 5);
         foreach(ref a; sample)
             ++a;
     }
     writefln("[%(%3s,%)]", arr[]);
}
//--------




Last but not least, be warned there is an old-standing bug with 
anything in phobos that takes a PRNG by value. Basically, the 
PRNG will be duplicated, and generate the same sequence over and 
over. Ergo, do NOT pass a specific random generator to things 
like randomSample or randomSuffle.

This problem is one of the major reason we are currently (and 
slowly) re-designing random into random2.

Feb 19 2013

"Ivan Kazmenko" <gassa mail.ru> writes:

Hi!

Thank you for the reply.

 Hum... randomShuffle and randomSample actually have nothing to 
 do with each other.
 <snip>

I'd like to note that my post is about randomCover, not 
randomSample.  I do see the difference between the purpose of 
randomSample and randomShuffle.  But randomCover's effect is, at 
the first glance, just a slower version of randomSample wrapped 
as a lazy generator.

 I also want to comment on your "randomSample" vs "randomSuffle" 
 implementation suggestion. Keep in mind that:
 a) randomSample doesn't allocate, whereas yours suggestion 
 doesn't
 b) randomSample gives direct access to the elements, whereas 
 your suggestion doesn't.

 If you don't care about a) and b), then by all means, dup away, 
 and get better performance!

 But think about the fact that you wouldn't be able to do 
 something like this...
 <snip>
         auto sample = randomSample(arr[], 5);
         foreach(ref a; sample)
             ++a;

That stands for randomCover, too.  Well, thank you, perhaps 
that's the difference I was seeking.

If this is the intended difference, well, my proposition to 
enhance randomCover's performance and usefulness transforms into:

1. Document the usage of randomCover with an example such as 
above, and refer to randomShuffle as a faster version for simpler 
use cases.

2. Optimize the performance by putting Fenwick trees to good use. 
  Currently, randomCover'ing 10,000 elements takes time on the 
order of one second, and for 100,000 or more elements, it is 
hardly usable.

 Last but not least, be warned there is an old-standing bug with 
 anything in phobos that takes a PRNG by value. Basically, the 
 PRNG will be duplicated, and generate the same sequence over 
 and over. Ergo, do NOT pass a specific random generator to 
 things like randomSample or randomSuffle.

 This problem is one of the major reason we are currently (and 
 slowly) re-designing random into random2.

So, there is a general agreement that in random2, RNG should by 
default get passed by reference everywhere?  That's nice to hear.

-----
Ivan Kazmenko.

Feb 19 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 02/19/2013 12:46 PM, Ivan Kazmenko wrote:
 I'd like to note that my post is about randomCover, not randomSample.  I do see
 the difference between the purpose of randomSample and randomShuffle.  But
 randomCover's effect is, at the first glance, just a slower version of
 randomSample wrapped as a lazy generator.

To be honest, I think RandomCover is (statistically) unsafe to use right now,
as 
it takes a copy of the random number generator it is given _by value_.  So 
unless you pass it an unpredictably-seeded new RNG that will not be used in any 
other context, you're going to get correlation between the output of
RandomCover 
and other sequences of random numbers you generate elsewhere in your code. 
(This problem is also found in RandomSample, but at least in that case there is 
an alternative of calling rndGen.)

As for purpose -- I think the point of RandomCover is to provide a means of 
generating a random shuffle that doesn't re-order elements of the original
range 
in place, and that doesn't require creating a copy.

It's probably worth investigating the algorithms out there for this kind of 
functionality, because I doubt the algorithm that's given is optimal.

Feb 19 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Tuesday, 19 February 2013 at 12:18:43 UTC, Joseph Rushton 
Wakeling wrote:
 On 02/19/2013 12:46 PM, Ivan Kazmenko wrote:
 I'd like to note that my post is about randomCover, not 
 randomSample.  I do see
 the difference between the purpose of randomSample and 
 randomShuffle.  But
 randomCover's effect is, at the first glance, just a slower 
 version of
 randomSample wrapped as a lazy generator.

 To be honest, I think RandomCover is (statistically) unsafe to 
 use right now, as it takes a copy of the random number 
 generator it is given _by value_.

If it keeps a copy, how it is passed (value or ref) actually 
becomes irrelevant.

This is also a strong argument for the reference semantic PRNG 
approach.

Feb 19 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 02/19/2013 01:25 PM, monarch_dodra wrote:
 If it keeps a copy, how it is passed (value or ref) actually becomes
irrelevant.

Yes, I didn't explain myself well there.  The real problem is that copying 
really duplicates the RNG rather than just copying a reference to the RNG
engine.

 This is also a strong argument for the reference semantic PRNG approach.

Yes.  Are you having any luck in discussions on how to map out a random2 where 
this will be implemented?  I remember your original code was rejected as a
plain 
update to Phobos because of fears over code breakage (though TBH I think this
is 
a case where if you break someone's code you're doing them a favour, as they 
were probably doing something unwise).

Feb 21 2013

"Ivan Kazmenko" <gassa mail.ru> writes:

Joseph Rushton Wakeling wrote:
 It's probably worth investigating the algorithms out there for 
 this kind of functionality, because I doubt the algorithm 
 that's given is optimal.

Given that the lazy version will hardly be possible without 
allocating O(n) additional memory anyway, the simple solution is:

- build a random permutation of iota(n) at start using 
randomShuffle (instead of maintaining the private bool[] _chosen 
array),
- lazily iterate the range according to the permutation.

It is n ints of memory instead of n bools which is comparable, 
but O(n) initialization and O(1) per step which is definitely 
better than now.

Feb 19 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Tuesday, 19 February 2013 at 11:46:54 UTC, Ivan Kazmenko wrote:
 Hi!

 Thank you for the reply.

 Hum... randomShuffle and randomSample actually have nothing to 
 do with each other.
 <snip>

 I'd like to note that my post is about randomCover, not 
 randomSample.  I do see the difference between the purpose of 
 randomSample and randomShuffle.  But randomCover's effect is, 
 at the first glance, just a slower version of randomSample 
 wrapped as a lazy generator.

Hum... sorry about that.

 I also want to comment on your "randomSample" vs 
 "randomSuffle" implementation suggestion. Keep in mind that:
 a) randomSample doesn't allocate, whereas yours suggestion 
 doesn't
 b) randomSample gives direct access to the elements, whereas 
 your suggestion doesn't.

 If you don't care about a) and b), then by all means, dup 
 away, and get better performance!

 But think about the fact that you wouldn't be able to do 
 something like this...
 <snip>
        auto sample = randomSample(arr[], 5);
        foreach(ref a; sample)
            ++a;

 That stands for randomCover, too.  Well, thank you, perhaps 
 that's the difference I was seeking.

 If this is the intended difference, well, my proposition to 
 enhance randomCover's performance and usefulness transforms 
 into:

 1. Document the usage of randomCover with an example such as 
 above, and refer to randomShuffle as a faster version for 
 simpler use cases.

 2. Optimize the performance by putting Fenwick trees to good 
 use.
  Currently, randomCover'ing 10,000 elements takes time on the 
 order of one second, and for 100,000 or more elements, it is 
 hardly usable.

Extra documentation never hurts, but keep in mind that a ton of 
algorithms in phobos are lazy and operate this way. Usually, only 
the lazy version is implemented, as the aggressive version is 
trivial (as you suggested).

AFAIK, most of the ranges in std.range are lazy (though not 
obviously) in one way or another.

 Last but not least, be warned there is an old-standing bug 
 with anything in phobos that takes a PRNG by value. Basically, 
 the PRNG will be duplicated, and generate the same sequence 
 over and over. Ergo, do NOT pass a specific random generator 
 to things like randomSample or randomSuffle.

 This problem is one of the major reason we are currently (and 
 slowly) re-designing random into random2.

 So, there is a general agreement that in random2, RNG should by 
 default get passed by reference everywhere?  That's nice to 
 hear.

 -----
 Ivan Kazmenko.

The agreement is rather to make them reference types, so even 
when passed by value, you won't accidentally duplicate them. This 
is important as you sometimes pass random ranges to algorithms 
that have nothing to do with randomness.

I'd say the "textbook" example would be:

//----
import std.random;
import std.algorithm;
import std.stdio;

void main()
{
     uint[5] arr1;
     arr1[].fill(rndGen);
     uint[5] arr2;
     arr2[].fill(rndGen);
     writeln("arr1: ", arr1[]);
     writeln("arr1: ", arr2[]);
}
//----
arr1: [3622200385, 2579361262, 3208046123, 1753759120, 133131992]
arr2: [3622200385, 2579361262, 3208046123, 1753759120, 133131992]
//----
Oops!

Feb 19 2013

"Ivan Kazmenko" <gassa mail.ru> writes:

monarch_dodra wrote:
 void main()
 {
     uint[5] arr1;
     arr1[].fill(rndGen);
     uint[5] arr2;
     arr2[].fill(rndGen);
     writeln("arr1: ", arr1[]);
     writeln("arr1: ", arr2[]);
 }
 //----
 arr1: [3622200385, 2579361262, 3208046123, 1753759120, 
 133131992]
 arr2: [3622200385, 2579361262, 3208046123, 1753759120, 
 133131992]
 //----
 Oops!

Oops indeed!

It should be noted that a common need is to have reproducible RNG 
routines.  For example, I was angry at Python breaking the 
reproducibility of everything except the vanilla random.random() 
around version 3.2.  Still, as the language is going to have 
random2 anyway, and reproducibility will already suffer much from 
RNG becoming a reference type, it will be a good time for some 
cleanup.

On a side note, as D also uses MT19937 as the main RNG, random2 
designers may want to address the same accuracy issue which 
Python did.  More on that topic here: 
http://bugs.python.org/issue9025

Feb 19 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Tuesday, 19 February 2013 at 21:54:54 UTC, Ivan Kazmenko wrote:
 monarch_dodra wrote:
 void main()
 {
    uint[5] arr1;
    arr1[].fill(rndGen);
    uint[5] arr2;
    arr2[].fill(rndGen);
    writeln("arr1: ", arr1[]);
    writeln("arr1: ", arr2[]);
 }
 //----
 arr1: [3622200385, 2579361262, 3208046123, 1753759120, 
 133131992]
 arr2: [3622200385, 2579361262, 3208046123, 1753759120, 
 133131992]
 //----
 Oops!

 Oops indeed!

 It should be noted that a common need is to have reproducible 
 RNG routines.  For example, I was angry at Python breaking the 
 reproducibility of everything except the vanilla 
 random.random() around version 3.2.  Still, as the language is 
 going to have random2 anyway, and reproducibility will already 
 suffer much from RNG becoming a reference type, it will be a 
 good time for some cleanup.

In theory, there should be no reproducibility breakage, as the 
generators themselves will not be changed (they are _strictly_ 
modeled after boosts' (the ones we have anyways)). The breakage 
will only occur if the PRNG was being duplicated and generating 
the same sequence twice.

In this case though, one would argue it's not reproducibility 
breakage, but straight up broken code being fixed...

Feb 19 2013

"Ivan Kazmenko" <gassa mail.ru> writes:

monarch_dodra wrote:
 In theory, there should be no reproducibility breakage, as the 
 generators themselves will not be changed (they are _strictly_ 
 modeled after boosts' (the ones we have anyways)).

Indeed, I just looked at Boost and C++11 implementations, and 
they currently have these divisions too.  Having this kind of 
reproducibility across languages is a benefit in itself.

bearophile wrote:
 Is this to report in Bugzilla?

Sorry, I fear it's a false alarm, and the answer is no.  In any 
case, before advocating such a change further, I would like to 
know why Boost and C++11 people didn't do that in their realms.  
Maybe the division is actually more portable.  Maybe extracting 
bits is indeed faster with MT19937, but it can produce poorer 
quality random values with RNGs which are not so well-distributed 
bit-by-bit.

-----
Ivan Kazmenko.

Feb 19 2013

"Ivan Kazmenko" <gassa mail.ru> writes:

 On a side note, as D also uses MT19937 as the main RNG, random2 
 designers may want to address the same accuracy issue which 
 Python did.  More on that topic here: 
 http://bugs.python.org/issue9025

Sorry for replying to myself again, but the abovementioned 
accuracy issue is not about MT19937.  It's about how we transform 
the (uniform enough) bits from MT19937 to an integer in some 
range [0, LIMIT).  Looks like the inaccuracy is already addressed 
in D, so, sorry again for the bother.  Still, the current 
implementation uses divisions (which is slow on most CPUs), and 
for a good source of random bits such as MT19937, the 
implementation which repeatedly gets exactly 
core.bitop.bsr(LIMIT)+1 bits until the result is in [0, LIMIT) 
may be faster.

-----
Ivan Kazmenko

Feb 19 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Ivan Kazmenko:

 Still, the current implementation uses divisions (which is slow 
 on most CPUs), and for a good source of random bits such as 
 MT19937, the implementation which repeatedly gets exactly 
 core.bitop.bsr(LIMIT)+1 bits until the result is in [0, LIMIT) 
 may be faster.

Is this to report in Bugzilla?

Bye,
bearophile

Feb 19 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 02/19/2013 12:27 PM, monarch_dodra wrote:
 Last but not least, be warned there is an old-standing bug with anything in
 phobos that takes a PRNG by value. Basically, the PRNG will be duplicated, and
 generate the same sequence over and over. Ergo, do NOT pass a specific random
 generator to things like randomSample or randomSuffle.

Sorry, I missed your remark on this when writing my own email.  But I'm not
sure 
this issue can be over-emphasized ... :-)

Feb 19 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - What is the use case of RandomCover?