digitalmars.D.learn - randomShuffle

Yann (15/15) Jun 03 2013 Hey,

maarten van damme (2/16) Jun 03 2013
bearophile (12/15) Jun 03 2013 Two ways, the first gives items in random order, the second

Yann (4/19) Jun 03 2013 Thanks a lot for the suggestions!

Diggory (14/39) Jun 03 2013 For small samples from very large ranges an efficient algorithm

Joseph Rushton Wakeling (4/21) Jun 03 2013 Are you sure about the efficiency of that algorithm? Looks like it's be...
Joseph Rushton Wakeling (6/29) Jun 03 2013 There are at least two other key faults with that algorithm compared to
Joseph Rushton Wakeling (10/30) Jun 03 2013 Apologies, I misread the algorithm slightly. Your qualifier is quite co...

Diggory (16/54) Jun 03 2013 Thanks for testing before dismissing completely :P The way it

Joseph Rushton Wakeling (17/29) Jun 03 2013 It's an interesting little piece of invention. I'll have to look around...

Diggory (5/54) Jun 03 2013 I'd guess that the heavy use of floating point arithmetic to

Joseph Rushton Wakeling (2/5) Jun 03 2013 Yes, I agree. There might be some optimizations that could be done ther...

Diggory (40/48) Jun 04 2013 Well I've come up with this new algorithm:

Diggory (4/53) Jun 04 2013 OK, posted too soon! There's a small bug in this one where it

Diggory (41/41) Jun 04 2013 Here's the fixed one:

Joseph Rushton Wakeling (2/5) Jun 04 2013 I'll try it out later and get back to you. :-)
Joseph Rushton Wakeling (9/12) Jun 20 2013 VERY belated apologies for not getting back to you here. I was preparin...

Diggory (3/22) Jun 21 2013 Haha, no problem!

Joseph Rushton Wakeling (10/12) Jun 03 2013 You mean you want a random sample of the numbers 1, 2, ..., 1000?

"Yann" <skratchie gmx.de> writes:

Hey,

I am trying to generate an array of 10 unique (!) random numbers 
from 1 to 1000 each (for example). The best I could come up with 
is:

auto arr = iota(1, 1000).array;
randomShuffle(arr);
return arr.take(10).array.sort;

This leaves me quite unhappy, because I would like the code
a) to be one line
b) not use "array" more than once

Is there a better way to accomplish this? Naively, I would expect 
something like
"return iota(1, 1000).randomShuffle.take(10).sort;"

Thanks,
Yann

Jun 03 2013

maarten van damme <maartenvd1994 gmail.com> writes:

How about using std.random.randomSample?


2013/6/3 Yann <skratchie gmx.de>

 Hey,

 I am trying to generate an array of 10 unique (!) random numbers from 1 to
 1000 each (for example). The best I could come up with is:

 auto arr = iota(1, 1000).array;
 randomShuffle(arr);
 return arr.take(10).array.sort;

 This leaves me quite unhappy, because I would like the code
 a) to be one line
 b) not use "array" more than once

 Is there a better way to accomplish this? Naively, I would expect
 something like
 "return iota(1, 1000).randomShuffle.take(10).**sort;"

 Thanks,
 Yann

Jun 03 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Yann:

 Is there a better way to accomplish this? Naively, I would 
 expect something like
 "return iota(1, 1000).randomShuffle.take(10).sort;"

Two ways, the first gives items in random order, the second 
ordered as you seem to desire:

import std.stdio, std.random, std.range, std.array;
void main() {
     iota(1, 1001).randomCover(rndGen).take(10).writeln;
     iota(1, 1001).randomSample(10).writeln;
}


But be careful with randomCover, because maybe it takes the rnd 
generator by value.

Bye,
bearophile

Jun 03 2013

"Yann" <skratchie gmx.de> writes:

Thanks a lot for the suggestions!

Cheers,
Yann

On Monday, 3 June 2013 at 10:06:30 UTC, bearophile wrote:
 Yann:

 Is there a better way to accomplish this? Naively, I would 
 expect something like
 "return iota(1, 1000).randomShuffle.take(10).sort;"

 Two ways, the first gives items in random order, the second 
 ordered as you seem to desire:

 import std.stdio, std.random, std.range, std.array;
 void main() {
     iota(1, 1001).randomCover(rndGen).take(10).writeln;
     iota(1, 1001).randomSample(10).writeln;
 }


 But be careful with randomCover, because maybe it takes the rnd 
 generator by value.

 Bye,
 bearophile

Jun 03 2013

"Diggory" <diggsey googlemail.com> writes:

On Monday, 3 June 2013 at 10:10:15 UTC, Yann wrote:
 Thanks a lot for the suggestions!

 Cheers,
 Yann

 On Monday, 3 June 2013 at 10:06:30 UTC, bearophile wrote:
 Yann:

 Is there a better way to accomplish this? Naively, I would 
 expect something like
 "return iota(1, 1000).randomShuffle.take(10).sort;"

 Two ways, the first gives items in random order, the second 
 ordered as you seem to desire:

 import std.stdio, std.random, std.range, std.array;
 void main() {
    iota(1, 1001).randomCover(rndGen).take(10).writeln;
    iota(1, 1001).randomSample(10).writeln;
 }


 But be careful with randomCover, because maybe it takes the 
 rnd generator by value.

 Bye,
 bearophile


For small samples from very large ranges an efficient algorithm 
would be:

int[] randomGen(int N, int M) {
     if (N == 0)
         return [];

     int[] result = randomGen(N-1, M-1);

     int num = rand(M);
     foreach (ref int i; result)
         if (i >= num)
             ++i;

     result ~= num;

     return result;
}

Jun 03 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 06/03/2013 01:29 PM, Diggory wrote:
 For small samples from very large ranges an efficient algorithm would be:
 
 int[] randomGen(int N, int M) {
     if (N == 0)
         return [];
 
     int[] result = randomGen(N-1, M-1);
 
     int num = rand(M);
     foreach (ref int i; result)
         if (i >= num)
             ++i;
 
     result ~= num;
 
     return result;
 }

Are you sure about the efficiency of that algorithm?  Looks like it's be O(M)
to me.

randomSample currently has a computational complexity (and number of random
numbers required) of O(n) where n is sample size.

Jun 03 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 06/03/2013 02:30 PM, Joseph Rushton Wakeling wrote:
 On 06/03/2013 01:29 PM, Diggory wrote:
 For small samples from very large ranges an efficient algorithm would be:

 int[] randomGen(int N, int M) {
     if (N == 0)
         return [];

     int[] result = randomGen(N-1, M-1);

     int num = rand(M);
     foreach (ref int i; result)
         if (i >= num)
             ++i;

     result ~= num;

     return result;
 }

 
 Are you sure about the efficiency of that algorithm?  Looks like it's be O(M)
to me.
 
 randomSample currently has a computational complexity (and number of random
 numbers required) of O(n) where n is sample size.

There are at least two other key faults with that algorithm compared to
randomSample -- you're alloc'ing repeatedly inside it, and you return the whole
sample.  randomSample doesn't require any such storage, it can take any input
range as the sample source and just jumps across the range; and what it returns
is another range interface, from which you can take the sample points
sequentially.

Jun 03 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 06/03/2013 02:30 PM, Joseph Rushton Wakeling wrote:
 On 06/03/2013 01:29 PM, Diggory wrote:
 For small samples from very large ranges an efficient algorithm would be:

 int[] randomGen(int N, int M) {
     if (N == 0)
         return [];

     int[] result = randomGen(N-1, M-1);

     int num = rand(M);
     foreach (ref int i; result)
         if (i >= num)
             ++i;

     result ~= num;

     return result;
 }

 
 Are you sure about the efficiency of that algorithm?  Looks like it's be O(M)
to me.

Apologies, I misread the algorithm slightly.  Your qualifier is quite correct --
when the number of sample points is very small (e.g. 5 out of some arbitrary
very large number), that algorithm is very quick indeed (I ran some tests as I
was curious:-), and can outperform D's randomSample.

It doesn't scale with the size of the input (your value M), but with the number
of sample points n, but that scaling is very bad -- so as the sample size grows
it becomes _much_ worse than randomSample.

Do you have a reference for the algorithm?  Off the top of my head I don't
recognize it from any of the texts I've read.

Jun 03 2013

"Diggory" <diggsey googlemail.com> writes:

On Monday, 3 June 2013 at 13:18:30 UTC, Joseph Rushton Wakeling 
wrote:
 On 06/03/2013 02:30 PM, Joseph Rushton Wakeling wrote:
 On 06/03/2013 01:29 PM, Diggory wrote:
 For small samples from very large ranges an efficient 
 algorithm would be:

 int[] randomGen(int N, int M) {
     if (N == 0)
         return [];

     int[] result = randomGen(N-1, M-1);

     int num = rand(M);
     foreach (ref int i; result)
         if (i >= num)
             ++i;

     result ~= num;

     return result;
 }

 
 Are you sure about the efficiency of that algorithm?  Looks 
 like it's be O(M) to me.

 Apologies, I misread the algorithm slightly.  Your qualifier is 
 quite correct --
 when the number of sample points is very small (e.g. 5 out of 
 some arbitrary
 very large number), that algorithm is very quick indeed (I ran 
 some tests as I
 was curious:-), and can outperform D's randomSample.

 It doesn't scale with the size of the input (your value M), but 
 with the number
 of sample points n, but that scaling is very bad -- so as the 
 sample size grows
 it becomes _much_ worse than randomSample.

 Do you have a reference for the algorithm?  Off the top of my 
 head I don't
 recognize it from any of the texts I've read.

Thanks for testing before dismissing completely :P The way it 
returns results can be improved a lot by pre-allocating a range 
of the necessary size/using a range passed in.

The complexity is O(N²) where N is the number of samples out of 
M. The algorithm is something I just thought of although I've 
possibly used it before and I'm sure someone else has thought of 
it too. It's quite simple, it effectively says "pick a value from 
the range except for this value" recursively but instead of 
dividing the range in two, it shifts the top part down first to 
make a contiguous range and then shifts the results which are 
past the split up one afterward.

I expect the phobos implementation to be something like O(M) in 
which case my method would be faster whenever N < sqrt(M) (with 
the optimisations)

Jun 03 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 06/03/2013 07:00 PM, Diggory wrote:
 Thanks for testing before dismissing completely :P The way it returns results
 can be improved a lot by pre-allocating a range of the necessary size/using a
 range passed in.

Surely. :-)

 The complexity is O(N²) where N is the number of samples out of M. The
algorithm
 is something I just thought of although I've possibly used it before and I'm
 sure someone else has thought of it too. It's quite simple, it effectively says
 "pick a value from the range except for this value" recursively but instead of
 dividing the range in two, it shifts the top part down first to make a
 contiguous range and then shifts the results which are past the split up one
 afterward.

It's an interesting little piece of invention.  I'll have to look around and see
if there's anything similar documented.  I've not seen anything in the
literature which has this particular property of O(N) random variates plus
O(N^2) calculation.

The fact that it's not sequential might have been a bit of an issue
historically, which maybe explains why I haven't come across something like it.

 I expect the phobos implementation to be something like O(M) in which case my
 method would be faster whenever N < sqrt(M) (with the optimisations)

No, the Phobos implementation is O(N) -- I wrote it :-)  See:
http://braingam.es/2012/07/sampling-d/

The API is Andrei's -- the original implementation used what Knuth refers to as
Algorithm S [which is really simple and easy to implement and check, but is
indeed O(M)], I updated it to use Algorithm D (which is complicated:-).

So, since randomSample takes kN time where k is constant, your algorithm will
probably win where N < k.

I found that the sampling time was about equal with N = 50 and your algorithm as
it's currently written.

Jun 03 2013

"Diggory" <diggsey googlemail.com> writes:

On Monday, 3 June 2013 at 17:35:22 UTC, Joseph Rushton Wakeling 
wrote:
 On 06/03/2013 07:00 PM, Diggory wrote:
 Thanks for testing before dismissing completely :P The way it 
 returns results
 can be improved a lot by pre-allocating a range of the 
 necessary size/using a
 range passed in.

 Surely. :-)

 The complexity is O(N²) where N is the number of samples out 
 of M. The algorithm
 is something I just thought of although I've possibly used it 
 before and I'm
 sure someone else has thought of it too. It's quite simple, it 
 effectively says
 "pick a value from the range except for this value" 
 recursively but instead of
 dividing the range in two, it shifts the top part down first 
 to make a
 contiguous range and then shifts the results which are past 
 the split up one
 afterward.

 It's an interesting little piece of invention.  I'll have to 
 look around and see
 if there's anything similar documented.  I've not seen anything 
 in the
 literature which has this particular property of O(N) random 
 variates plus
 O(N^2) calculation.

 The fact that it's not sequential might have been a bit of an 
 issue
 historically, which maybe explains why I haven't come across 
 something like it.

 I expect the phobos implementation to be something like O(M) 
 in which case my
 method would be faster whenever N < sqrt(M) (with the 
 optimisations)

 No, the Phobos implementation is O(N) -- I wrote it :-)  See:
 http://braingam.es/2012/07/sampling-d/

 The API is Andrei's -- the original implementation used what 
 Knuth refers to as
 Algorithm S [which is really simple and easy to implement and 
 check, but is
 indeed O(M)], I updated it to use Algorithm D (which is 
 complicated:-).

 So, since randomSample takes kN time where k is constant, your 
 algorithm will
 probably win where N < k.

 I found that the sampling time was about equal with N = 50 and 
 your algorithm as
 it's currently written.

I'd guess that the heavy use of floating point arithmetic to 
calculate the step sizes means that algorithm has a fairly large 
constant overhead even though the complexity is smaller.

Jun 03 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 06/03/2013 08:28 PM, Diggory wrote:
 I'd guess that the heavy use of floating point arithmetic to calculate the step
 sizes means that algorithm has a fairly large constant overhead even though the
 complexity is smaller.

Yes, I agree.  There might be some optimizations that could be done there.

Jun 03 2013

"Diggory" <diggsey googlemail.com> writes:

On Monday, 3 June 2013 at 21:24:50 UTC, Joseph Rushton Wakeling 
wrote:
 On 06/03/2013 08:28 PM, Diggory wrote:
 I'd guess that the heavy use of floating point arithmetic to 
 calculate the step
 sizes means that algorithm has a fairly large constant 
 overhead even though the
 complexity is smaller.

 Yes, I agree.  There might be some optimizations that could be 
 done there.

Well I've come up with this new algorithm:


uint[] randomSample(uint N, uint M) {
	uint[] result = new uint[N];

	struct hashPair {
		uint key;
		uint index;
	}

	size_t tableSize = N*4;
	if (tableSize > M)
		tableSize = M;

	hashPair[] table = new hashPair[tableSize];

	for (uint i = 0; i < N; ++i) {
		uint v = (rndGen.front % (M-i))+i;
		uint newv = v;
		rndGen.popFront();

		uint vhash = v%tableSize;

		while (table[vhash].index) {
			if (table[vhash].key == v) {
				newv = table[vhash].index-1;
				table[vhash].index = i+1;
				goto done;
			}

			vhash = (vhash+1)%tableSize;
		}

		table[vhash].key = v;
		table[vhash].index = i+1;

done:
		result[i] = newv;
	}

	return result;
}


It's O(N) rather than O(N²). Conceptually it works by randomly 
shuffling the first N items in an array of size M which takes 
O(N) time but requires an array of size O(M), so to avoid this it 
uses a simple hash table to store just the items which have been 
swapped. Since only O(N) items are being shuffled there can only 
be O(N) swaps and so the hash table can be O(N) in size while 
still offering constant time look-up.

Jun 04 2013

"Diggory" <diggsey googlemail.com> writes:

On Tuesday, 4 June 2013 at 08:30:58 UTC, Diggory wrote:
 On Monday, 3 June 2013 at 21:24:50 UTC, Joseph Rushton Wakeling 
 wrote:
 On 06/03/2013 08:28 PM, Diggory wrote:
 I'd guess that the heavy use of floating point arithmetic to 
 calculate the step
 sizes means that algorithm has a fairly large constant 
 overhead even though the
 complexity is smaller.

 Yes, I agree.  There might be some optimizations that could be 
 done there.

 Well I've come up with this new algorithm:


 uint[] randomSample(uint N, uint M) {
 	uint[] result = new uint[N];

 	struct hashPair {
 		uint key;
 		uint index;
 	}

 	size_t tableSize = N*4;
 	if (tableSize > M)
 		tableSize = M;

 	hashPair[] table = new hashPair[tableSize];

 	for (uint i = 0; i < N; ++i) {
 		uint v = (rndGen.front % (M-i))+i;
 		uint newv = v;
 		rndGen.popFront();

 		uint vhash = v%tableSize;

 		while (table[vhash].index) {
 			if (table[vhash].key == v) {
 				newv = table[vhash].index-1;
 				table[vhash].index = i+1;
 				goto done;
 			}

 			vhash = (vhash+1)%tableSize;
 		}

 		table[vhash].key = v;
 		table[vhash].index = i+1;

 done:
 		result[i] = newv;
 	}

 	return result;
 }


 It's O(N) rather than O(N²). Conceptually it works by randomly 
 shuffling the first N items in an array of size M which takes 
 O(N) time but requires an array of size O(M), so to avoid this 
 it uses a simple hash table to store just the items which have 
 been swapped. Since only O(N) items are being shuffled there 
 can only be O(N) swaps and so the hash table can be O(N) in 
 size while still offering constant time look-up.

OK, posted too soon! There's a small bug in this one where it 
could potentially return the same value twice, it can be fixed by 
looking up "i" in the hash table as well as "v".

Jun 04 2013

"Diggory" <diggsey googlemail.com> writes:

Here's the fixed one:

uint[] randomSample(uint N, uint M) {
	uint[] result = new uint[N];

	struct hashPair {
		uint key;
		uint index;
	}

	size_t tableSize = N*4;
	if (tableSize > M)
		tableSize = M;

	hashPair[] table = new hashPair[tableSize];

	for (uint i = 0; i < N; ++i) {
		uint v = (rndGen.front % (M-i))+i;
		uint newv = v, newi = i;
		rndGen.popFront();

		uint ihash = i%tableSize;
		while (table[ihash].index) {
			if (table[ihash].key == i) {
				newi = table[ihash].index-1;
				break;
			}
		}

		uint vhash = v%tableSize;

		while (table[vhash].index) {
			if (table[vhash].key == v) {
				newv = table[vhash].index-1;
				table[vhash].index = newi+1;
				goto done;
			}

			vhash = (vhash+1)%tableSize;
		}

		table[vhash].key = v;
		table[vhash].index = newi+1;

done:
		result[i] = newv;
	}

	return result;
}


Still a few places to optimise, and I think the compiler 
optimisation should be able to give a decent speed up as well. 
Would be interested to see how it compares in your benchmark!

Jun 04 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 06/04/2013 02:03 PM, Diggory wrote:
 Still a few places to optimise, and I think the compiler optimisation should be
 able to give a decent speed up as well. Would be interested to see how it
 compares in your benchmark!

I'll try it out later and get back to you. :-)

Jun 04 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 06/04/2013 01:03 PM, Diggory wrote:
 Still a few places to optimise, and I think the compiler optimisation should be
 able to give a decent speed up as well. Would be interested to see how it
 compares in your benchmark!

VERY belated apologies for not getting back to you here.  I was preparing an
email on the nice features of RandomSample -- e.g. that you can use it to sample
lines from a file -- when I realized that actually there was a bug that stopped
you from doing that:
http://d.puremagic.com/issues/show_bug.cgi?id=10265

So I stopped to prepare a fix, and then I got distracted with a bunch of things
I realized could be done to improve std.random ... :-P

Anyway, I _will_ do the benchmark.  Just might be a little while.

Jun 20 2013

"Diggory" <diggsey googlemail.com> writes:

On Thursday, 20 June 2013 at 21:48:49 UTC, Joseph Rushton 
Wakeling wrote:
 On 06/04/2013 01:03 PM, Diggory wrote:
 Still a few places to optimise, and I think the compiler 
 optimisation should be
 able to give a decent speed up as well. Would be interested to 
 see how it
 compares in your benchmark!

 VERY belated apologies for not getting back to you here.  I was 
 preparing an
 email on the nice features of RandomSample -- e.g. that you can 
 use it to sample
 lines from a file -- when I realized that actually there was a 
 bug that stopped
 you from doing that:
 http://d.puremagic.com/issues/show_bug.cgi?id=10265

 So I stopped to prepare a fix, and then I got distracted with a 
 bunch of things
 I realized could be done to improve std.random ... :-P

 Anyway, I _will_ do the benchmark.  Just might be a little 
 while.

Haha, no problem!

Jun 21 2013

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 06/03/2013 10:48 AM, Yann wrote:
 I am trying to generate an array of 10 unique (!) random numbers from 1 to 1000
 each (for example). The best I could come up with is:

You mean you want a random sample of the numbers 1, 2, ..., 1000?

That is, you want to pick 10 unique numbers from 1, ..., 1000 with equal
probability of picking any one?

randomSample is your friend:

	auto arr = randomSample(iota(1, 1001), 10).array;

This picks a random sample of 10 values from the range iota(1, 1001).  (Note the
1001 is because iota iterates over [start, finish) not [start, finish], so you
need (1, 1001) if you want it to cover the values 1 to 1000.  I wish there was a
better/safer way of having a "closed-interval" iota.)

Jun 03 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - randomShuffle