digitalmars.D - randomSample with unknown length

Magnus Lie Hetland (21/21) Feb 02 2011 Reading the doc for std.random.randomSample, I saw that "The total

Andrei Alexandrescu (10/29) Feb 02 2011 I posted a problem solved by the algorithm above (and others, more

Simen kjaeraas (8/10) Feb 02 2011 Done. No comments or unittests yet, though.
Magnus Lie Hetland (8/16) Feb 02 2011 Sure. I was just thinking that you could have a version for the cases

Magnus Lie Hetland <magnus hetland.org> writes:

Reading the doc for std.random.randomSample, I saw that "The total 
length of r must be known". There are rather straightforward algorithms 
for drawing random samples *without* knowing this. This might be useful 
if one wants to support input ranges, I guess?

Take, for example, the method described by Knuth (TAoP 2), for 
selecting n elements uniformly at random from an input range:

- Select the first n elements as the current sample.
- Each subsequent element is rejected with a probability of 1 - n/t, 
where t is the number seen so far.
- If a new item is selected, it replaces a random item in the current sample.

A cool property of this is that at any time, the current sample is one 
drawn randomly (i.e., uniformly, without replacement) from the items 
you've seen so far, so you could really stop at any point. That is, 
stop iterating over the input; you can't really give the output as a 
small-footprint range here (as far as I can see). Seems you have to 
allocate room for n pointers. (Or you *could* just keep track of which 
objects were swapped in -- might be worth the overhead if n is large 
compared to the input size.)

-- 
Magnus Lie Hetland
http://hetland.org

Feb 02 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 2/2/11 6:03 AM, Magnus Lie Hetland wrote:
 Reading the doc for std.random.randomSample, I saw that "The total
 length of r must be known". There are rather straightforward algorithms
 for drawing random samples *without* knowing this. This might be useful
 if one wants to support input ranges, I guess?

 Take, for example, the method described by Knuth (TAoP 2), for selecting
 n elements uniformly at random from an input range:

 - Select the first n elements as the current sample.
 - Each subsequent element is rejected with a probability of 1 - n/t,
 where t is the number seen so far.
 - If a new item is selected, it replaces a random item in the current
 sample.

 A cool property of this is that at any time, the current sample is one
 drawn randomly (i.e., uniformly, without replacement) from the items
 you've seen so far, so you could really stop at any point. That is, stop
 iterating over the input; you can't really give the output as a
 small-footprint range here (as far as I can see). Seems you have to
 allocate room for n pointers. (Or you *could* just keep track of which
 objects were swapped in -- might be worth the overhead if n is large
 compared to the input size.)

I posted a problem solved by the algorithm above (and others, more 
sophisticated ones) as a challenge to this group a couple of years ago.

randomSample is designed to subsample a large stream in constant space 
and without needing to scan the entire stream in order to output the 
first element. I used in in my dissertation where e.g. I had to select 
100K samples from a stream of many millions.

Having a reservoir sample available would be nice. I'd be thrilled if 
you coded up a reservoirSample(r, n) function for addition to std.random.


Andrei

Feb 02 2011

"Simen kjaeraas" <simen.kjaras gmail.com> writes:

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:

 Having a reservoir sample available would be nice. I'd be thrilled if  
 you coded up a reservoirSample(r, n) function for addition to std.random.

Done. No comments or unittests yet, though.

randomSampleRange( R, num ) takes num samples from an input range, and
keeps a reservoir that is updated as you traverse the range (lazy, if you
wish).

randomSample( R, num ) takes num samples from all over a range (eager).

-- 
Simen

Feb 02 2011

Magnus Lie Hetland <magnus hetland.org> writes:

On 2011-02-02 16:32:25 +0100, Andrei Alexandrescu said:

 randomSample is designed to subsample a large stream in constant space 
 and without needing to scan the entire stream in order to output the 
 first element.

Sure. I was just thinking that you could have a version for the cases 
where there was no end in sight :)

 I used in in my dissertation where e.g. I had to select 100K samples 
 from a stream of many millions.

Cool.

 Having a reservoir sample available would be nice. I'd be thrilled if 
 you coded up a reservoirSample(r, n) function for addition to 
 std.random.

Seems Simen beat me to it :)

-- 
Magnus Lie Hetland
http://hetland.org

Feb 02 2011

D Programming

C/C++ Programming

Other

digitalmars.D - randomSample with unknown length