digitalmars.D - RandomSample with specified random number generator
- Joseph Rushton Wakeling (89/89) Jun 12 2012 Hello all,
Hello all, An interesting challenge with the randomSample functionality in random.d. randomSample takes as input a range, a number of points to sample from that range, and (optionally) a random number generator. It returns a range corresponding to a lazily-calculated sample of the desired size. If you don't pass it a specific random number generator, then it just uses the thread-global random number generator. A consequence of this is that the sample changes each time you use it, i.e. if you write, auto sample1 = randomSample(iota(0, 100), 5); writeln(sample1); writeln(sample1); writeln(sample1); ... you get 3 different random samples, e.g. [0, 17, 18, 43, 77] [20, 25, 34, 53, 97] [6, 12, 28, 42, 87] Conversely, if you _do_ pass randomSample a specific random number generator, a copy of it is stored inside the RandomSample struct. This is necessary because, as the sample is lazily evaluated, you need to guarantee the RNG's existence at the point when you evaluate. A consequence of this is that if you evaluate the sample multiple times, you get the same result, i.e. if you do: auto sample2 = randomSample(iota(0, 100), 5, rndGen); writeln("Sample with specified RNG:"); writeln(sample2); writeln(sample2); writeln(sample2); then you get something like e.g. [33, 35, 54, 55, 94] [33, 35, 54, 55, 94] [33, 35, 54, 55, 94] ... which is problematic because it's different behaviour from the case without a specific RNG being used, but not problematic per se. One can see a logic for either case (a sample range always giving the same result, or always giving a different and independent result), it's just a matter of being clear which will happen. However, a real problem arises if you want to have multiple samples each being passed a specific source of randomness. If you do: auto sample3 = randomSample(iota(0, 100), 5, rndGen); writeln(sample3); auto sample4 = randomSample(iota(0, 100), 5, rndGen); writeln(sample4); auto sample5 = randomSample(iota(0, 100), 5, rndGen); writeln(sample5); ... then you'd expect in principle to get 3 different samples. However, because rndGen is _copied_ rather than used directly, its state does not get updated when the samples are evaluated. As a consequence, each of the samples above gets passed _the same random number generator in the same state_, and you get out the same samples, e.g. [33, 35, 54, 55, 94] [33, 35, 54, 55, 94] [33, 35, 54, 55, 94] (... yes, the same as sample2, assuming we're still in the same program and haven't made any other uses of rndGen in the meantime). What this means is that randomSample is impossible to use effectively with a specified random number generator. It's not just that successive "different" samples produce the same output, it's also that if you generate a random sample and then generate other random numbers afterwards, they will be correlated. I can see only two possible resolutions of this issue, but I'm curious if anyone else can think of something different. (i) store a reference to the random number generator instead of a copy (is this even possible?). The trouble is, this isn't safe, because what if e.g. you have a function which does something like auto generateSample() { auto rng = Random(unpredictableSeed); auto sample = randomSample(iota(0, 100), 5, rng); return sample; } void main() { auto sample = generateSample(); writeln(sample); // Won't work, because rng will no longer exist. } ... so this option seems impermissible. (ii) when copying the random number generator, seed it with an unpredictable seed before generating the first point of the random sample. This will effectively make the sample an independent thread with respect to random number generation. The disadvantage is that you lose the reproducibility of the sampling behaviour (e.g. computer game where you want to avoid the player being able to save/reload/try again and get a different outcome) and there might be unexpected correlations that occur if the seeding or the random number generation algorithm are done poorly. The second seems the only really valid option to me, and has the advantage of making identical the behaviour of the sample whether or not it's given a specific RNG (i.e. using the sample 3 different times gets you 3 different samples). ... any thoughts on this and on possible alternative options? Thanks & best wishes, -- Joe
Jun 12 2012