digitalmars.D - Random sampling next steps
- Joseph Rushton Wakeling (94/94) Jul 23 2012 Hello all,
- travert phare.normalesup.org (Christophe Travert) (35/54) Jul 23 2012 [snip]
Hello all,
At the risk of sounding like a one-trick pony, some thoughts on RandomSample
and
what remains to be done with it.
Besides algorithmic improvements, the recent updates from Martin Nowak and
myself have fixed a couple of key bugs:
http://d.puremagic.com/issues/show_bug.cgi?id=7936
http://d.puremagic.com/issues/show_bug.cgi?id=8314
... but we still have a couple of important outstanding issues:
http://d.puremagic.com/issues/show_bug.cgi?id=7067
http://d.puremagic.com/issues/show_bug.cgi?id=8247
The solution here would seem to be identical and involve not changing
RandomSample but changing the random number generators to being reference
types;
cf. discussion in those bug reports, and also here:
http://forum.dlang.org/thread/4FD735EB.70404 webdrake.net
This is a controversial issue as it would involve a breaking change to D's RNG
design, but broadly speaking seems like a necessary breaking change to avoid
these kinds of issues cropping up consistently across various different
applications. Indeed, it seems likely that only flawed code, involving passing
RNG by value, would have an issue with this change, and there may also be
positive effects for future development of other parts of std.random.
However, the issue I'd like to dwell on here is the one raised by Craig
Dillabaugh in comments on my blog. The current RandomSample implementation
covers only the case where the total number of items to pick from is known in
advance. In other words, either your input range needs hasLength!R == true or
you need to manually specify the total number of items when calling
randomSample:
auto randomSample(R)(R r, size_t n, size_t total)
if(isInputRange!R)
{
return RandomSample!(R, void)(r, n, total);
}
auto randomSample(R)(R r, size_t n)
if(isInputRange!R && hasLength!R)
{
return RandomSample!(R, void)(r, n, r.length);
}
... and similarly in the constructor of RandomSample:
static if (hasLength!R)
this(R input, size_t howMany)
{
this(input, howMany, input.length);
}
this(R input, size_t howMany, size_t total)
{
_input = input;
_available = total;
_toSelect = howMany;
enforce(_toSelect <= _available);
_first = true;
}
But what if the total number of items is not known in advance? E.g. if you are
reading a file, line by line, or reading records from a tape; you may know the
total is finite, but you don't know what it actually _is_.
There are dedicated algorithms for this, via the technique known as "reservoir
sampling". Knuth lists Algorithm R; Vitter has defined an improved technique,
Algorithm Z. These should be fairly easy to implement in D, but the question
is
how to integrate them with the existing random sampling functionality.
The simplest way would be to just implement an additional struct and function,
say, RandomReservoirSample and randomReservoirSample. However, I was wondering
about a setup where there is a single interface that looks at the input range
and other data you have provided and makes an intelligent choice of the optimal
algorithm. It's this I'd like advice on.
One solution would be to have two different underlying structs which are
selected depending on provided data:
auto randomSample(R)(R r, size_t n)
if(isInputRange!R && hasLength!R)
{
return RandomSample!(R, void)(r, n, r.length);
}
auto randomSample(R)(R r, size_t n)
if(isInputRange!R && !hasLength!R)
{
return RandomReservoirSample!(R, void)(r, n);
}
... but doing something similar within RandomSample doesn't seem so easy. Why?
Because the static if()s that you'd require within the struct would not
depend
just on whether hasLength!R == true, but also on whether you'd passed a size_t
total to the constructor. I'm not sure how to factor in the latter case, even
though it's clearly a compile-time issue. Can anyone advise?
Basically, I want to be able to say: static if(!hasLength!R &&
no_total_was_passed) then certain functions or certain variables are or aren't
defined.
I also think it would be a good idea for the reservoir sampling technique to
emit a warning when in debug mode, to prompt the user to be _sure_ that they
can't specify the total number of points to sample from. Is there a
recommended
method of doing something like this?
Alternatively, would people prefer to entirely separate the known-total and
unknown-total sampling methods entirely, so the choice is always manual?
Finally, if hasLength!R == false, is there any way of guaranteeing that the
input range is still going to be ultimately finite? There could be some very
nasty worst-case behaviour in the case of infinite ranges.
I'm asking this here rather than d-learn as it's related to design choices for
Phobos, but I'll happily move the discussion over if necessary.
Thanks and best wishes,
-- Joe
Jul 23 2012
Joseph Rushton Wakeling , dans le message (digitalmars.D:172997), a écrit :In other words, either your input range needs hasLength!R == true or you need to manually specify the total number of items when calling randomSample: But what if the total number of items is not known in advance? E.g. if you are reading a file, line by line, or reading records from a tape; you may know the total is finite, but you don't know what it actually _is_.[snip]... but doing something similar within RandomSample doesn't seem so easy. Why? Because the static if()s that you'd require within the struct would not depend just on whether hasLength!R == true, but also on whether you'd passed a size_t total to the constructor.Why not using takeExactly ? this is the standard way select from a subset of the original range. I wouldn't even have provided the overload with 3 arguments, the user being able to use takeExactly when necessary (which could be advised in the doc in case the user doesn't know). struct RandomSample(R) if (isInputRange!R && hasLength!R) { ...// always use r.length, never total/available } auto randomSample(R)(R r, size_t n, size_t total) if(isInputRange!R) { return randomSample!(R, void)(takeExactly(r, total), n); } struct RandomSample(R) if(isInputRange!R && !hasLength!R) { ...// always reservoir random sample } There is no more issue here.I also think it would be a good idea for the reservoir sampling technique to emit a warning when in debug mode, to prompt the user to be _sure_ that they can't specify the total number of points to sample from. Is there a recommended method of doing something like this?I don't think library polluting compiler warnings is recommended.Alternatively, would people prefer to entirely separate the known-total and unknown-total sampling methods entirely, so the choice is always manual?RandomSample is a lazy range. RandomReservoirSample is not, and has a completely different implementation. IMHO, there is a fundamental difference that justifies to have a separate function with a different name.Finally, if hasLength!R == false, is there any way of guaranteeing that the input range is still going to be ultimately finite? There could be some very nasty worst-case behaviour in the case of infinite ranges.IsInfinite!Range. However, a finite range could return false on empty indefinitely, would the implementer of the range just forget to make empty an enum, or the user meet a corner case (e.g. repeat(1).until(2)). But that's a general problem, that would make most eager algorithm result in an infinite loop, starting with array and copy... -- Christophe
Jul 23 2012








travert phare.normalesup.org (Christophe Travert)