digitalmars.D.learn - What is the use case of RandomCover?
- Ivan Kazmenko (87/87) Feb 18 2013 I'm unsure whether I should post into the ".learn" sub-forum or
- Ivan Kazmenko (5/10) Feb 18 2013 Sorry, this part doesn't look clear. O(n^2) total with O(n^2)
- Joseph Rushton Wakeling (3/10) Feb 18 2013 Surely, because randomShuffle is re-ordering the elements in-place, wher...
- Ivan Kazmenko (15/17) Feb 19 2013 Sorry, but that does not answer my question. If "in-place
- monarch_dodra (70/87) Feb 19 2013 Hum... randomShuffle and randomSample actually have nothing to do
- Ivan Kazmenko (22/46) Feb 19 2013 I'd like to note that my post is about randomCover, not
- Joseph Rushton Wakeling (13/17) Feb 19 2013 To be honest, I think RandomCover is (statistically) unsafe to use right...
- monarch_dodra (6/17) Feb 19 2013 If it keeps a copy, how it is passed (value or ref) actually
- Joseph Rushton Wakeling (8/10) Feb 21 2013 Yes, I didn't explain myself well there. The real problem is that copyi...
- Ivan Kazmenko (10/13) Feb 19 2013 Given that the lazy version will hardly be possible without
- monarch_dodra (31/83) Feb 19 2013 Extra documentation never hurts, but keep in mind that a ton of
- Ivan Kazmenko (13/29) Feb 19 2013 Oops indeed!
- monarch_dodra (8/33) Feb 19 2013 In theory, there should be no reproducibility breakage, as the
- Ivan Kazmenko (14/18) Feb 19 2013 Indeed, I just looked at Boost and C++11 implementations, and
- Ivan Kazmenko (12/16) Feb 19 2013 Sorry for replying to myself again, but the abovementioned
- bearophile (4/9) Feb 19 2013 Is this to report in Bugzilla?
- Joseph Rushton Wakeling (3/7) Feb 19 2013 Sorry, I missed your remark on this when writing my own email. But I'm ...
I'm unsure whether I should post into the ".learn" sub-forum or some other one in such a case, but still. I wonder what is the use case of std.random.randomCover when one has std.random.randomShuffle. I was trying to use it just to get a random permutation of integers, and found randomCover prior to randomShuffle. However, for the number of elements as low as 10,000, the delay was already rather surprising, so I searched for a faster solution and found randomShuffle does a superior job. And now I wonder: how does one correctly use randomCover? Below is a sample test program showing the difference. ----- import std.array; import std.random; import std.range; import std.stdio; immutable int MAX_N = 10_000; int [] fun (int n, ref Random rnd) { auto t = array (iota (MAX_N)); version (randomCover) { auto c = randomCover (t, rnd); } version (randomShuffle) { auto c = t; randomShuffle (c, rnd); } return array (c); } void main () { auto rnd = Random (123456789); writeln (fun (MAX_N, rnd)); writeln (fun (MAX_N, rnd) == fun (MAX_N, rnd)); } ----- Here is a comparison: 1. Speed. +randomShuffle performs O(n) steps and O(n) uniform() calls. -randomCover performs O(n^2) steps and O(n^2) uniform() calls. The latter however can (and perhaps should?) be optimized to O(n): in the implementation, the line auto chooseMe = uniform(0, k, _rnd) == 0; can be moved outside the foreach loop and store the integer auto toPick = uniform(0, k, _rnd); instead of a bool. I can try and write the respective patch if needed. 2. Size. -randomShuffle does not allocate anything extra, but modifies the range in place, and so requires allocating another range of n values if the original range has to be stored too. +randomCover only allocates an array of n bools. If that is the intended advantage, the implementation would be better off using a bit array instead of a bool array, as in this enhancement proposition: http://d.puremagic.com/issues/show_bug.cgi?id=2898 3. Laziness. -randomShuffle just does its job once. +randomCover produces some sort of a lazy generator instead. Still, the generator performs an O(n) computation on each step, so the profit is debatable. 4. Convenience. +randomShuffle called multiple times with the same RNG advances the internal state of the RNG and thus produces different results. If one needs the same results, it is still achievable by knowingly saving and loading the internal state of the RNG. -randomCover called multiple times with the same RNG copies the RNG each time by value and thus produces the same result. That is not the intended behavior in the majority of use cases I can imagine (e.g., generating different random permutations in a loop). This is already the topic of an issue I found: http://d.puremagic.com/issues/show_bug.cgi?id=7067 Now, the only case I can think of where randomCover should be preferred to randomShuffle is when you have a huge range (hundreds of Mb), but you need to iterate only through the first few values in randomCover. Is there any other? Whether the above is indeed the intended use of randomCover or not, I think that the intended use (and a reference to randomShuffle for other cases) should be mentioned in the documentation along with the time complexity. More on the topic of optimization, the performance of the whole randomCover thing can be optimized to O(n log n) using a Fenwick tree or such to popFront in O(log n). But it will then require storing n integers, not n bools, thus losing the advantage of having smaller memory requirements than randomShuffle with copy. ----- Ivan Kazmenko.
Feb 18 2013
1. Speed. +randomShuffle performs O(n) steps and O(n) uniform() calls. -randomCover performs O(n^2) steps and O(n^2) uniform() calls. The latter however can (and perhaps should?) be optimized to O(n): in the implementation, the lineSorry, this part doesn't look clear. O(n^2) total with O(n^2) uniform() calls can be optimized to the same O(n^2) total but with only O(n) uniform() calls. It can be further optimized to O(n log n) total with O(n) uniform() calls using a Fenwick tree, but will then require storing n ints instead of n bools.
Feb 18 2013
On 02/18/2013 03:35 PM, Ivan Kazmenko wrote:I wonder what is the use case of std.random.randomCover when one has std.random.randomShuffle. I was trying to use it just to get a random permutation of integers, and found randomCover prior to randomShuffle. However, for the number of elements as low as 10,000, the delay was already rather surprising, so I searched for a faster solution and found randomShuffle does a superior job. And now I wonder: how does one correctly use randomCover? Below is a sample test program showing the difference.Surely, because randomShuffle is re-ordering the elements in-place, whereas randomCover is not.
Feb 18 2013
Surely, because randomShuffle is re-ordering the elements in-place, whereas randomCover is not.Sorry, but that does not answer my question. If "in-place shuffle" versus "function return value" was the only intended difference, randomCover could as well look like this (simplified to int[] case for expressiveness): ----- int[] randomCover(int[] r, ref/*or not*/ Random rnd) { auto s = r.dup; randomShuffle(s, rnd); return s; } ----- However, there are ~70 lines in the implementation quite different from just calling randomShuffle, presumably for some purpose. And my question is, what is that purpose?
Feb 19 2013
On Tuesday, 19 February 2013 at 09:04:12 UTC, Ivan Kazmenko wrote:Hum... randomShuffle and randomSample actually have nothing to do with each other. RandomSample will not shuffle anything, but will just (lazily) iterate on the original range, skipping elements, making it a "random sample". It will not, in any circumstance, re-arrange any element. Try this: //-------- import std.random; import std.algorithm; import std.stdio; import std.array; void main() { int[25] arr; foreach(i, ref a; arr) a = i; assert(isSorted(arr[])); foreach(_; 0 .. 10) { auto sample = randomSample(arr[], 4).array(); assert(isSorted(sample)); writefln("[%(%3s,%)]", sample); } } //-------- [ 8, 12, 16, 21] [ 5, 12, 17, 23] [ 1, 15, 21, 22] [ 9, 15, 21, 23] [ 6, 7, 13, 24] [ 4, 6, 9, 11] [ 9, 12, 20, 23] [ 5, 14, 20, 23] [ 0, 2, 10, 24] [ 1, 9, 13, 23] //-------- I also want to comment on your "randomSample" vs "randomSuffle" implementation suggestion. Keep in mind that: a) randomSample doesn't allocate, whereas yours suggestion doesn't b) randomSample gives direct access to the elements, whereas your suggestion doesn't. If you don't care about a) and b), then by all means, dup away, and get better performance! But think about the fact that you wouldn't be able to do something like this... //-------- import std.random; import std.algorithm; import std.stdio; import std.array; void main() { int[25] arr; foreach(_; 0 .. 10) { auto sample = randomSample(arr[], 5); foreach(ref a; sample) ++a; } writefln("[%(%3s,%)]", arr[]); } //-------- Last but not least, be warned there is an old-standing bug with anything in phobos that takes a PRNG by value. Basically, the PRNG will be duplicated, and generate the same sequence over and over. Ergo, do NOT pass a specific random generator to things like randomSample or randomSuffle. This problem is one of the major reason we are currently (and slowly) re-designing random into random2.Surely, because randomShuffle is re-ordering the elements in-place, whereas randomCover is not.Sorry, but that does not answer my question. If "in-place shuffle" versus "function return value" was the only intended difference, randomCover could as well look like this (simplified to int[] case for expressiveness): ----- int[] randomCover(int[] r, ref/*or not*/ Random rnd) { auto s = r.dup; randomShuffle(s, rnd); return s; } ----- However, there are ~70 lines in the implementation quite different from just calling randomShuffle, presumably for some purpose. And my question is, what is that purpose?
Feb 19 2013
Hi! Thank you for the reply.Hum... randomShuffle and randomSample actually have nothing to do with each other. <snip>I'd like to note that my post is about randomCover, not randomSample. I do see the difference between the purpose of randomSample and randomShuffle. But randomCover's effect is, at the first glance, just a slower version of randomSample wrapped as a lazy generator.I also want to comment on your "randomSample" vs "randomSuffle" implementation suggestion. Keep in mind that: a) randomSample doesn't allocate, whereas yours suggestion doesn't b) randomSample gives direct access to the elements, whereas your suggestion doesn't. If you don't care about a) and b), then by all means, dup away, and get better performance! But think about the fact that you wouldn't be able to do something like this... <snip> auto sample = randomSample(arr[], 5); foreach(ref a; sample) ++a;That stands for randomCover, too. Well, thank you, perhaps that's the difference I was seeking. If this is the intended difference, well, my proposition to enhance randomCover's performance and usefulness transforms into: 1. Document the usage of randomCover with an example such as above, and refer to randomShuffle as a faster version for simpler use cases. 2. Optimize the performance by putting Fenwick trees to good use. Currently, randomCover'ing 10,000 elements takes time on the order of one second, and for 100,000 or more elements, it is hardly usable.Last but not least, be warned there is an old-standing bug with anything in phobos that takes a PRNG by value. Basically, the PRNG will be duplicated, and generate the same sequence over and over. Ergo, do NOT pass a specific random generator to things like randomSample or randomSuffle. This problem is one of the major reason we are currently (and slowly) re-designing random into random2.So, there is a general agreement that in random2, RNG should by default get passed by reference everywhere? That's nice to hear. ----- Ivan Kazmenko.
Feb 19 2013
On 02/19/2013 12:46 PM, Ivan Kazmenko wrote:I'd like to note that my post is about randomCover, not randomSample. I do see the difference between the purpose of randomSample and randomShuffle. But randomCover's effect is, at the first glance, just a slower version of randomSample wrapped as a lazy generator.To be honest, I think RandomCover is (statistically) unsafe to use right now, as it takes a copy of the random number generator it is given _by value_. So unless you pass it an unpredictably-seeded new RNG that will not be used in any other context, you're going to get correlation between the output of RandomCover and other sequences of random numbers you generate elsewhere in your code. (This problem is also found in RandomSample, but at least in that case there is an alternative of calling rndGen.) As for purpose -- I think the point of RandomCover is to provide a means of generating a random shuffle that doesn't re-order elements of the original range in place, and that doesn't require creating a copy. It's probably worth investigating the algorithms out there for this kind of functionality, because I doubt the algorithm that's given is optimal.
Feb 19 2013
On Tuesday, 19 February 2013 at 12:18:43 UTC, Joseph Rushton Wakeling wrote:On 02/19/2013 12:46 PM, Ivan Kazmenko wrote:If it keeps a copy, how it is passed (value or ref) actually becomes irrelevant. This is also a strong argument for the reference semantic PRNG approach.I'd like to note that my post is about randomCover, not randomSample. I do see the difference between the purpose of randomSample and randomShuffle. But randomCover's effect is, at the first glance, just a slower version of randomSample wrapped as a lazy generator.To be honest, I think RandomCover is (statistically) unsafe to use right now, as it takes a copy of the random number generator it is given _by value_.
Feb 19 2013
On 02/19/2013 01:25 PM, monarch_dodra wrote:If it keeps a copy, how it is passed (value or ref) actually becomes irrelevant.Yes, I didn't explain myself well there. The real problem is that copying really duplicates the RNG rather than just copying a reference to the RNG engine.This is also a strong argument for the reference semantic PRNG approach.Yes. Are you having any luck in discussions on how to map out a random2 where this will be implemented? I remember your original code was rejected as a plain update to Phobos because of fears over code breakage (though TBH I think this is a case where if you break someone's code you're doing them a favour, as they were probably doing something unwise).
Feb 21 2013
Joseph Rushton Wakeling wrote:It's probably worth investigating the algorithms out there for this kind of functionality, because I doubt the algorithm that's given is optimal.Given that the lazy version will hardly be possible without allocating O(n) additional memory anyway, the simple solution is: - build a random permutation of iota(n) at start using randomShuffle (instead of maintaining the private bool[] _chosen array), - lazily iterate the range according to the permutation. It is n ints of memory instead of n bools which is comparable, but O(n) initialization and O(1) per step which is definitely better than now.
Feb 19 2013
On Tuesday, 19 February 2013 at 11:46:54 UTC, Ivan Kazmenko wrote:Hi! Thank you for the reply.Hum... sorry about that.Hum... randomShuffle and randomSample actually have nothing to do with each other. <snip>I'd like to note that my post is about randomCover, not randomSample. I do see the difference between the purpose of randomSample and randomShuffle. But randomCover's effect is, at the first glance, just a slower version of randomSample wrapped as a lazy generator.Extra documentation never hurts, but keep in mind that a ton of algorithms in phobos are lazy and operate this way. Usually, only the lazy version is implemented, as the aggressive version is trivial (as you suggested). AFAIK, most of the ranges in std.range are lazy (though not obviously) in one way or another.I also want to comment on your "randomSample" vs "randomSuffle" implementation suggestion. Keep in mind that: a) randomSample doesn't allocate, whereas yours suggestion doesn't b) randomSample gives direct access to the elements, whereas your suggestion doesn't. If you don't care about a) and b), then by all means, dup away, and get better performance! But think about the fact that you wouldn't be able to do something like this... <snip> auto sample = randomSample(arr[], 5); foreach(ref a; sample) ++a;That stands for randomCover, too. Well, thank you, perhaps that's the difference I was seeking. If this is the intended difference, well, my proposition to enhance randomCover's performance and usefulness transforms into: 1. Document the usage of randomCover with an example such as above, and refer to randomShuffle as a faster version for simpler use cases. 2. Optimize the performance by putting Fenwick trees to good use. Currently, randomCover'ing 10,000 elements takes time on the order of one second, and for 100,000 or more elements, it is hardly usable.The agreement is rather to make them reference types, so even when passed by value, you won't accidentally duplicate them. This is important as you sometimes pass random ranges to algorithms that have nothing to do with randomness. I'd say the "textbook" example would be: //---- import std.random; import std.algorithm; import std.stdio; void main() { uint[5] arr1; arr1[].fill(rndGen); uint[5] arr2; arr2[].fill(rndGen); writeln("arr1: ", arr1[]); writeln("arr1: ", arr2[]); } //---- arr1: [3622200385, 2579361262, 3208046123, 1753759120, 133131992] arr2: [3622200385, 2579361262, 3208046123, 1753759120, 133131992] //---- Oops!Last but not least, be warned there is an old-standing bug with anything in phobos that takes a PRNG by value. Basically, the PRNG will be duplicated, and generate the same sequence over and over. Ergo, do NOT pass a specific random generator to things like randomSample or randomSuffle. This problem is one of the major reason we are currently (and slowly) re-designing random into random2.So, there is a general agreement that in random2, RNG should by default get passed by reference everywhere? That's nice to hear. ----- Ivan Kazmenko.
Feb 19 2013
monarch_dodra wrote:void main() { uint[5] arr1; arr1[].fill(rndGen); uint[5] arr2; arr2[].fill(rndGen); writeln("arr1: ", arr1[]); writeln("arr1: ", arr2[]); } //---- arr1: [3622200385, 2579361262, 3208046123, 1753759120, 133131992] arr2: [3622200385, 2579361262, 3208046123, 1753759120, 133131992] //---- Oops!Oops indeed! It should be noted that a common need is to have reproducible RNG routines. For example, I was angry at Python breaking the reproducibility of everything except the vanilla random.random() around version 3.2. Still, as the language is going to have random2 anyway, and reproducibility will already suffer much from RNG becoming a reference type, it will be a good time for some cleanup. On a side note, as D also uses MT19937 as the main RNG, random2 designers may want to address the same accuracy issue which Python did. More on that topic here: http://bugs.python.org/issue9025
Feb 19 2013
On Tuesday, 19 February 2013 at 21:54:54 UTC, Ivan Kazmenko wrote:monarch_dodra wrote:In theory, there should be no reproducibility breakage, as the generators themselves will not be changed (they are _strictly_ modeled after boosts' (the ones we have anyways)). The breakage will only occur if the PRNG was being duplicated and generating the same sequence twice. In this case though, one would argue it's not reproducibility breakage, but straight up broken code being fixed...void main() { uint[5] arr1; arr1[].fill(rndGen); uint[5] arr2; arr2[].fill(rndGen); writeln("arr1: ", arr1[]); writeln("arr1: ", arr2[]); } //---- arr1: [3622200385, 2579361262, 3208046123, 1753759120, 133131992] arr2: [3622200385, 2579361262, 3208046123, 1753759120, 133131992] //---- Oops!Oops indeed! It should be noted that a common need is to have reproducible RNG routines. For example, I was angry at Python breaking the reproducibility of everything except the vanilla random.random() around version 3.2. Still, as the language is going to have random2 anyway, and reproducibility will already suffer much from RNG becoming a reference type, it will be a good time for some cleanup.
Feb 19 2013
monarch_dodra wrote:In theory, there should be no reproducibility breakage, as the generators themselves will not be changed (they are _strictly_ modeled after boosts' (the ones we have anyways)).Indeed, I just looked at Boost and C++11 implementations, and they currently have these divisions too. Having this kind of reproducibility across languages is a benefit in itself. bearophile wrote:Is this to report in Bugzilla?Sorry, I fear it's a false alarm, and the answer is no. In any case, before advocating such a change further, I would like to know why Boost and C++11 people didn't do that in their realms. Maybe the division is actually more portable. Maybe extracting bits is indeed faster with MT19937, but it can produce poorer quality random values with RNGs which are not so well-distributed bit-by-bit. ----- Ivan Kazmenko.
Feb 19 2013
On a side note, as D also uses MT19937 as the main RNG, random2 designers may want to address the same accuracy issue which Python did. More on that topic here: http://bugs.python.org/issue9025Sorry for replying to myself again, but the abovementioned accuracy issue is not about MT19937. It's about how we transform the (uniform enough) bits from MT19937 to an integer in some range [0, LIMIT). Looks like the inaccuracy is already addressed in D, so, sorry again for the bother. Still, the current implementation uses divisions (which is slow on most CPUs), and for a good source of random bits such as MT19937, the implementation which repeatedly gets exactly core.bitop.bsr(LIMIT)+1 bits until the result is in [0, LIMIT) may be faster. ----- Ivan Kazmenko
Feb 19 2013
Ivan Kazmenko:Still, the current implementation uses divisions (which is slow on most CPUs), and for a good source of random bits such as MT19937, the implementation which repeatedly gets exactly core.bitop.bsr(LIMIT)+1 bits until the result is in [0, LIMIT) may be faster.Is this to report in Bugzilla? Bye, bearophile
Feb 19 2013
On 02/19/2013 12:27 PM, monarch_dodra wrote:Last but not least, be warned there is an old-standing bug with anything in phobos that takes a PRNG by value. Basically, the PRNG will be duplicated, and generate the same sequence over and over. Ergo, do NOT pass a specific random generator to things like randomSample or randomSuffle.Sorry, I missed your remark on this when writing my own email. But I'm not sure this issue can be over-emphasized ... :-)
Feb 19 2013