www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Random sampling next steps

reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
Hello all,

At the risk of sounding like a one-trick pony, some thoughts on RandomSample
and 
what remains to be done with it.

Besides algorithmic improvements, the recent updates from Martin Nowak and 
myself have fixed a couple of key bugs:
http://d.puremagic.com/issues/show_bug.cgi?id=7936
http://d.puremagic.com/issues/show_bug.cgi?id=8314

... but we still have a couple of important outstanding issues:
http://d.puremagic.com/issues/show_bug.cgi?id=7067
http://d.puremagic.com/issues/show_bug.cgi?id=8247

The solution here would seem to be identical and involve not changing 
RandomSample but changing the random number generators to being reference
types; 
cf. discussion in those bug reports, and also here:
http://forum.dlang.org/thread/4FD735EB.70404 webdrake.net

This is a controversial issue as it would involve a breaking change to D's RNG 
design, but broadly speaking seems like a necessary breaking change to avoid 
these kinds of issues cropping up consistently across various different 
applications.  Indeed, it seems likely that only flawed code, involving passing 
RNG by value, would have an issue with this change, and there may also be 
positive effects for future development of other parts of std.random.

However, the issue I'd like to dwell on here is the one raised by Craig 
Dillabaugh in comments on my blog.  The current RandomSample implementation 
covers only the case where the total number of items to pick from is known in 
advance.  In other words, either your input range needs hasLength!R == true or 
you need to manually specify the total number of items when calling
randomSample:

     auto randomSample(R)(R r, size_t n, size_t total)
     if(isInputRange!R)
     {
         return RandomSample!(R, void)(r, n, total);
     }

     auto randomSample(R)(R r, size_t n)
     if(isInputRange!R && hasLength!R)
     {
         return RandomSample!(R, void)(r, n, r.length);
     }

... and similarly in the constructor of RandomSample:

     static if (hasLength!R)
         this(R input, size_t howMany)
         {
             this(input, howMany, input.length);
         }

     this(R input, size_t howMany, size_t total)
     {
         _input = input;
         _available = total;
         _toSelect = howMany;
         enforce(_toSelect <= _available);
         _first = true;
     }

But what if the total number of items is not known in advance?  E.g. if you are 
reading a file, line by line, or reading records from a tape; you may know the 
total is finite, but you don't know what it actually _is_.

There are dedicated algorithms for this, via the technique known as "reservoir 
sampling".  Knuth lists Algorithm R; Vitter has defined an improved technique, 
Algorithm Z.  These should be fairly easy to implement in D, but the question
is 
how to integrate them with the existing random sampling functionality.

The simplest way would be to just implement an additional struct and function, 
say, RandomReservoirSample and randomReservoirSample.  However, I was wondering 
about a setup where there is a single interface that looks at the input range 
and other data you have provided and makes an intelligent choice of the optimal 
algorithm.  It's this I'd like advice on.

One solution would be to have two different underlying structs which are 
selected depending on provided data:

     auto randomSample(R)(R r, size_t n)
     if(isInputRange!R && hasLength!R)
     {
         return RandomSample!(R, void)(r, n, r.length);
     }

     auto randomSample(R)(R r, size_t n)
     if(isInputRange!R && !hasLength!R)
     {
         return RandomReservoirSample!(R, void)(r, n);
     }

... but doing something similar within RandomSample doesn't seem so easy.  Why? 
  Because the static if()s that you'd require within the struct would not
depend 
just on whether hasLength!R == true, but also on whether you'd passed a size_t 
total to the constructor.  I'm not sure how to factor in the latter case, even 
though it's clearly a compile-time issue.  Can anyone advise?

Basically, I want to be able to say: static if(!hasLength!R && 
no_total_was_passed) then certain functions or certain variables are or aren't 
defined.

I also think it would be a good idea for the reservoir sampling technique to 
emit a warning when in debug mode, to prompt the user to be _sure_ that they 
can't specify the total number of points to sample from.  Is there a
recommended 
method of doing something like this?

Alternatively, would people prefer to entirely separate the known-total and 
unknown-total sampling methods entirely, so the choice is always manual?

Finally, if hasLength!R == false, is there any way of guaranteeing that the 
input range is still going to be ultimately finite?  There could be some very 
nasty worst-case behaviour in the case of infinite ranges.

I'm asking this here rather than d-learn as it's related to design choices for 
Phobos, but I'll happily move the discussion over if necessary.

Thanks and best wishes,

     -- Joe
Jul 23 2012
parent travert phare.normalesup.org (Christophe Travert) writes:
Joseph Rushton Wakeling , dans le message (digitalmars.D:172997), a
 écrit :
 In other words, either your input range needs hasLength!R == true or 
 you need to manually specify the total number of items when calling
randomSample:
 
 But what if the total number of items is not known in advance?  E.g. if you
are 
 reading a file, line by line, or reading records from a tape; you may know the 
 total is finite, but you don't know what it actually _is_.
[snip]
 ... but doing something similar within RandomSample doesn't seem so easy. 
Why? 
   Because the static if()s that you'd require within the struct would not
depend 
 just on whether hasLength!R == true, but also on whether you'd passed a size_t 
 total to the constructor.
Why not using takeExactly ? this is the standard way select from a subset of the original range. I wouldn't even have provided the overload with 3 arguments, the user being able to use takeExactly when necessary (which could be advised in the doc in case the user doesn't know). struct RandomSample(R) if (isInputRange!R && hasLength!R) { ...// always use r.length, never total/available } auto randomSample(R)(R r, size_t n, size_t total) if(isInputRange!R) { return randomSample!(R, void)(takeExactly(r, total), n); } struct RandomSample(R) if(isInputRange!R && !hasLength!R) { ...// always reservoir random sample } There is no more issue here.
 I also think it would be a good idea for the reservoir sampling technique to 
 emit a warning when in debug mode, to prompt the user to be _sure_ that they 
 can't specify the total number of points to sample from.  Is there a
recommended 
 method of doing something like this?
I don't think library polluting compiler warnings is recommended.
 Alternatively, would people prefer to entirely separate the known-total and 
 unknown-total sampling methods entirely, so the choice is always manual?
RandomSample is a lazy range. RandomReservoirSample is not, and has a completely different implementation. IMHO, there is a fundamental difference that justifies to have a separate function with a different name.
 Finally, if hasLength!R == false, is there any way of guaranteeing that the 
 input range is still going to be ultimately finite?  There could be some very 
 nasty worst-case behaviour in the case of infinite ranges.
IsInfinite!Range. However, a finite range could return false on empty indefinitely, would the implementer of the range just forget to make empty an enum, or the user meet a corner case (e.g. repeat(1).until(2)). But that's a general problem, that would make most eager algorithm result in an infinite loop, starting with array and copy... -- Christophe
Jul 23 2012