digitalmars.D - On random numbers and ranges
- Joseph Rushton Wakeling (116/116) Jun 10 2013 Hello all,
Hello all, Anyone looking at D bugzilla or Phobos pull requests recently will see that I've been filing a number of bug reports and fixes against std.random.RandomSample. (I have Diggory to thank for this -- during the course of a discussion on D.learn I noticed an error when attempting to sample file.byLine, and that in turn led to some careful re-examination of the whole RandomSample struct.) Now, the good news is that it's unlikely that the issues reported will have bitten anyone using randomSample as intended. However, the issues that have arisen seem interesting to discuss in a broader context of random number generation and its relation to ranges. These are thoughts I've been mulling over for some time, and it seems like an opportune moment to share them, with RandomSample providing a concrete illustration. There's also some concrete application: I am not sure how to proceed on certain issues I've identified, and would like feedback and advice. So. Let's begin by defining the concept of a Random Range: that is, a range where popFront() makes use of a source of randomness. This can be a pseudo-random number generator or a source of "true" randomness such as /dev/random. This immediately leads to a situation different from other range objects. For example, it means that the concept of a "save()" method is different and maybe meaningless: we can save the current state of the range, but the forward motion of different saved copies will diverge because of the randomness in popFront(). (Caveat: in the event that the source of randomness is a pseudo-random number generator, it's technically possible to also save the state of the RNG. However, this has other issues that make it undesirable, related to statistical "safety" within the program. This has been discussed before on the list.) That might ultimately be boring -- a conclusion that Random Ranges can't be Forward Ranges. What I find more compelling is something different: many Random Ranges also have a unique challenge that the initial value of front() cannot be determined until it is accessed for the first time. To see why, consider the case of RandomSample and its corresponding bug 8314. The original version of RandomSample determined the initial value of front() in the constructor. This meant that if you read from the same sample multiple times, the first value would always be identical, destroying the statistical independence of the samples. What I realized recently is that this has knock-on effects elsewhere. For example, for popFront() to work correctly, the value of front() must be initialized. In the case of RandomSample, calling popFront() before front() can result in a statistical anomaly -- sampling (n-1) items with even probability from a range of length (N-1) -- whereas what it _should_ be is sampling (n-1) items from the remains of the input range after the first item has been selected. There may also be other methods or properties that depend on the initialization of front() -- for example, RandomSample's index() method assumes that the first sample point has already been chosen. At the same time, there are methods one would definitely want _not_ to affect the initialization of front(). For example, RandomSample has the .length property. If calling .length would in turn trigger initialization of front(), you could have code like this: auto sample = randomSample(iota(100), 5); enforce(sample.length == 5); // <--- if it initializes the value of // .front, it will fix the first sample point foreach(s; sample) writeln(s); writeln(); foreach(s; sample) // <--- and if front is initialized by the call to writeln(s); // .length, it means this second extraction of the // sample will have an identical first point to the // previous one, which destroys statistical // independence. Cf. Issue 8314. The simple fix here is for the programmer to recognize what methods require initialization of front(), to check inside those methods for some boolean switch that indicates initialization, and depending on that switch call some initialization function. In RandomSample we have: property auto ref front() { assert(!empty); // The first sample point must be determined here to avoid // having it always correspond to the first element of the // input. The rest of the sample points are determined each // time we call popFront(). if(_first) { // We can save ourselves a random variate by checking right // at the beginning if we should use Algorithm A. if((_alphaInverse * _toSelect) > _available) { _algorithmA = true; } else { _Vprime = newVprime(_toSelect); _algorithmA = false; } prime(); _first = false; } return _input.front; } ... and what is currently in the if(_first) { } scope could be moved to a separate private frontInit() method to be called by any of the methods and properties that require front initialization. The problem with this approach is that it leaves it up to the programmer to be competent and may in the process be a rich source of bugs in random ranges. Other consequences include the fact that properties like front() can't actually be const or pure. Some while ago [ http://forum.dlang.org/post/mailman.1227.1344816494.31962.digitalmars-d puremagic.com ] I speculated about the idea of a start() method that could be guaranteed to be called immediately before the first access to front(). In practice it would need to be more complex than that: in fact, the check that would need to be done would have to be something like this: (1) What variables are set in the start() method? (2) What methods or properties make use of those variables? (3) Find the first call to any one (and only one) of those methods or properties, and call start() right before it. One could also make it a requirement that any range whose popFront() calls on random number generation either define a start() function or explicitly disable it. However, that seems to be a very complicated thing to define and check, and I wonder what others think about its feasibility. Anyway, I think this email is long enough by now -- I'm curious what others think about these issues and how they might be addressed. In the short term I will probably fix the issues with RandomSample as described here, but I am very interested in ideas for more general solutions. {Enj,Destr}oy? :-) Best wishes, -- Joe
Jun 10 2013