digitalmars.D - Random sampling next steps
- Joseph Rushton Wakeling (94/94) Jul 23 2012 Hello all,
- travert phare.normalesup.org (Christophe Travert) (35/54) Jul 23 2012 [snip]
Hello all, At the risk of sounding like a one-trick pony, some thoughts on RandomSample and what remains to be done with it. Besides algorithmic improvements, the recent updates from Martin Nowak and myself have fixed a couple of key bugs: http://d.puremagic.com/issues/show_bug.cgi?id=7936 http://d.puremagic.com/issues/show_bug.cgi?id=8314 ... but we still have a couple of important outstanding issues: http://d.puremagic.com/issues/show_bug.cgi?id=7067 http://d.puremagic.com/issues/show_bug.cgi?id=8247 The solution here would seem to be identical and involve not changing RandomSample but changing the random number generators to being reference types; cf. discussion in those bug reports, and also here: http://forum.dlang.org/thread/4FD735EB.70404 webdrake.net This is a controversial issue as it would involve a breaking change to D's RNG design, but broadly speaking seems like a necessary breaking change to avoid these kinds of issues cropping up consistently across various different applications. Indeed, it seems likely that only flawed code, involving passing RNG by value, would have an issue with this change, and there may also be positive effects for future development of other parts of std.random. However, the issue I'd like to dwell on here is the one raised by Craig Dillabaugh in comments on my blog. The current RandomSample implementation covers only the case where the total number of items to pick from is known in advance. In other words, either your input range needs hasLength!R == true or you need to manually specify the total number of items when calling randomSample: auto randomSample(R)(R r, size_t n, size_t total) if(isInputRange!R) { return RandomSample!(R, void)(r, n, total); } auto randomSample(R)(R r, size_t n) if(isInputRange!R && hasLength!R) { return RandomSample!(R, void)(r, n, r.length); } ... and similarly in the constructor of RandomSample: static if (hasLength!R) this(R input, size_t howMany) { this(input, howMany, input.length); } this(R input, size_t howMany, size_t total) { _input = input; _available = total; _toSelect = howMany; enforce(_toSelect <= _available); _first = true; } But what if the total number of items is not known in advance? E.g. if you are reading a file, line by line, or reading records from a tape; you may know the total is finite, but you don't know what it actually _is_. There are dedicated algorithms for this, via the technique known as "reservoir sampling". Knuth lists Algorithm R; Vitter has defined an improved technique, Algorithm Z. These should be fairly easy to implement in D, but the question is how to integrate them with the existing random sampling functionality. The simplest way would be to just implement an additional struct and function, say, RandomReservoirSample and randomReservoirSample. However, I was wondering about a setup where there is a single interface that looks at the input range and other data you have provided and makes an intelligent choice of the optimal algorithm. It's this I'd like advice on. One solution would be to have two different underlying structs which are selected depending on provided data: auto randomSample(R)(R r, size_t n) if(isInputRange!R && hasLength!R) { return RandomSample!(R, void)(r, n, r.length); } auto randomSample(R)(R r, size_t n) if(isInputRange!R && !hasLength!R) { return RandomReservoirSample!(R, void)(r, n); } ... but doing something similar within RandomSample doesn't seem so easy. Why? Because the static if()s that you'd require within the struct would not depend just on whether hasLength!R == true, but also on whether you'd passed a size_t total to the constructor. I'm not sure how to factor in the latter case, even though it's clearly a compile-time issue. Can anyone advise? Basically, I want to be able to say: static if(!hasLength!R && no_total_was_passed) then certain functions or certain variables are or aren't defined. I also think it would be a good idea for the reservoir sampling technique to emit a warning when in debug mode, to prompt the user to be _sure_ that they can't specify the total number of points to sample from. Is there a recommended method of doing something like this? Alternatively, would people prefer to entirely separate the known-total and unknown-total sampling methods entirely, so the choice is always manual? Finally, if hasLength!R == false, is there any way of guaranteeing that the input range is still going to be ultimately finite? There could be some very nasty worst-case behaviour in the case of infinite ranges. I'm asking this here rather than d-learn as it's related to design choices for Phobos, but I'll happily move the discussion over if necessary. Thanks and best wishes, -- Joe
Jul 23 2012
Joseph Rushton Wakeling , dans le message (digitalmars.D:172997), a écrit :In other words, either your input range needs hasLength!R == true or you need to manually specify the total number of items when calling randomSample: But what if the total number of items is not known in advance? E.g. if you are reading a file, line by line, or reading records from a tape; you may know the total is finite, but you don't know what it actually _is_.[snip]... but doing something similar within RandomSample doesn't seem so easy. Why? Because the static if()s that you'd require within the struct would not depend just on whether hasLength!R == true, but also on whether you'd passed a size_t total to the constructor.Why not using takeExactly ? this is the standard way select from a subset of the original range. I wouldn't even have provided the overload with 3 arguments, the user being able to use takeExactly when necessary (which could be advised in the doc in case the user doesn't know). struct RandomSample(R) if (isInputRange!R && hasLength!R) { ...// always use r.length, never total/available } auto randomSample(R)(R r, size_t n, size_t total) if(isInputRange!R) { return randomSample!(R, void)(takeExactly(r, total), n); } struct RandomSample(R) if(isInputRange!R && !hasLength!R) { ...// always reservoir random sample } There is no more issue here.I also think it would be a good idea for the reservoir sampling technique to emit a warning when in debug mode, to prompt the user to be _sure_ that they can't specify the total number of points to sample from. Is there a recommended method of doing something like this?I don't think library polluting compiler warnings is recommended.Alternatively, would people prefer to entirely separate the known-total and unknown-total sampling methods entirely, so the choice is always manual?RandomSample is a lazy range. RandomReservoirSample is not, and has a completely different implementation. IMHO, there is a fundamental difference that justifies to have a separate function with a different name.Finally, if hasLength!R == false, is there any way of guaranteeing that the input range is still going to be ultimately finite? There could be some very nasty worst-case behaviour in the case of infinite ranges.IsInfinite!Range. However, a finite range could return false on empty indefinitely, would the implementer of the range just forget to make empty an enum, or the user meet a corner case (e.g. repeat(1).until(2)). But that's a general problem, that would make most eager algorithm result in an infinite loop, starting with array and copy... -- Christophe
Jul 23 2012