digitalmars.D - random k-sample of a file
- Andrei Alexandrescu (8/8) Oct 09 2008 I just ran across a nice little problem that I wanted to share as an
- dsimcha (11/19) Oct 09 2008 Pretty simple, actually. Read the file one character at a time. Get th...
- Andrei Alexandrescu (5/27) Oct 09 2008 So this would shuffle that index.
- Walter Bright (5/10) Oct 09 2008 1. read the file, counting lines => nlines
- Andrei Alexandrescu (3/14) Oct 09 2008 No go if the file is a stream. Can it be done in one pass?
- BCS (8/27) Oct 09 2008 read and store the first k lines with there line number,
- Steven Schveighoffer (5/18) Oct 09 2008 Any bounds on the input k? If not, then no. In that case, you must do ...
- Andrei Alexandrescu (5/22) Oct 09 2008 k is a "small" number in the sense that you can keep something
- Walter Bright (4/5) Oct 09 2008 No, because without knowing the number of lines, you cannot produce a
- Andrei Alexandrescu (3/10) Oct 09 2008 I am getting revenge for "I never use anything in std.algorithm" :o).
- Walter Bright (2/12) Oct 09 2008 Get in line for everyone getting their revenge on me!
- BCS (2/17) Oct 09 2008 "Et tu, Brute!" - (Julius Caesar, Act III, Scene I).
- Ary Borenszweig (12/23) Oct 09 2008 Can't this be done in one single pass?
- Denis Koroskin (3/19) Oct 09 2008 What if there is only one line in a file? Andrei told that
- Andrei Alexandrescu (3/29) Oct 09 2008 Keeping k of the lines in memory is fine.
- Ary Borenszweig (2/28) Oct 09 2008
- KennyTM~ (4/33) Oct 09 2008 I guess p depends on the number of lines unfortunately.
- bearophile (19/26) Oct 09 2008 It's an interesting problem, I can offer two solutions, the first one is...
- bearophile (20/20) Oct 09 2008 I am not reading the anwers written by others, of course :-) With the he...
- Andrei Alexandrescu (7/33) Oct 09 2008 This is unfair. First line in the original file is included with
- bearophile (30/35) Oct 09 2008 If each of them have the same chance, then you can't guarantee the resul...
- bearophile (18/18) Oct 09 2008 Third solution, this requires a storage of k lines (but you can keep thi...
- Andrei Alexandrescu (4/23) Oct 09 2008 We have a winner!!! There is actually a very simple proof on how and why...
- Ary Borenszweig (10/37) Oct 09 2008 Say you want 2 lines from a file that has 3 lines. Say the lines are a,
- Andrei Alexandrescu (3/42) Oct 09 2008 Nothing except you stop the induction at step 3...
- Ary Borenszweig (9/53) Oct 09 2008 ... which is the last step in this case. There are only three lines.
- Andrei Alexandrescu (7/63) Oct 09 2008 Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 /
- Ary Borenszweig (7/75) Oct 10 2008 Ah, ok. It works if you do that. So bearophile's code must say
- Andrei Alexandrescu (3/80) Oct 10 2008 His prize gets retired. Now how does Kirk's code fare? :o)
- Ary Borenszweig (12/92) Oct 10 2008 I don't have Kirk's code here at work (I told Thunderbird not to
- bearophile (29/41) Oct 10 2008 I don't like contests...
- Andrei Alexandrescu (7/26) Oct 10 2008 No, that's a different problem. They want to extract n random samples
- Sergey Gromov (6/56) Oct 10 2008 I think your algorithm has one mistake. The last line of your text gets...
- Kirk McDonald (25/82) Oct 09 2008 To reiterate the code I was talking about (the solution for k == 1):
- Andrei Alexandrescu (13/104) Oct 09 2008 That's why I thought it's such a beautiful problem. My proof is
- Jason House (4/15) Oct 09 2008 Based on your later stated requirements, use a heap. Insert each line i...
- Kirk McDonald (32/43) Oct 09 2008 This is a classic interview question. The solution for k == 1 is easy:
- Andrei Alexandrescu (5/60) Oct 09 2008 This looks good too, but I'm not sure where i will start from in the
- Kirk McDonald (14/80) Oct 09 2008 No, i should start at k (which it does, that is what the start=k
- Andrei Alexandrescu (3/84) Oct 09 2008 Sounds good.
- bearophile (20/22) Oct 09 2008 My third converted to D, using my libs:
- Andrei Alexandrescu (23/44) Oct 09 2008 True men only write D. Your solution looks eerily similar to mine.
- Christopher Wright (9/10) Oct 10 2008 Out of curiosity, why are you using enforce rather than assert?
- Andrei Alexandrescu (5/18) Oct 10 2008 Enforce is never ignored, even in production builds. You wouldn't want a...
- Christopher Wright (8/30) Oct 11 2008 If you want it, the code's in a public subversion repository:
- Andrei Alexandrescu (5/61) Oct 09 2008 Oh forgot to mention this is superior numerically to bearophile's
- Kirk McDonald (5/70) Oct 09 2008 To be fair, I started writing it before bearophile posted his, and
- Andrei Alexandrescu (3/5) Oct 09 2008 I think you forgot to attach the D solution. ;o)
- BCS (23/34) Oct 09 2008 struct S
- Jason House (2/44) Oct 09 2008 You should check if your new random number is good enough to keep before...
- BCS (3/47) Oct 09 2008 I was going for simplicity and ease of reading within a given big-O clas...
- Russell Lewis (15/26) Oct 09 2008 BEGIN CODE
- Andrei Alexandrescu (3/21) Oct 09 2008 Then the last line will be part of the selection with probability 1.
- Russell Lewis (27/58) Oct 09 2008 Sorry, I overlooked the (obvious) fact that this biases things towards
- Andrei Alexandrescu (3/20) Oct 09 2008 Yah that works. Nice!
- Christopher Wright (15/26) Oct 09 2008 You can't do a uniform random distribution without knowing the length.
- Carlos (7/7) Oct 09 2008 : You can't do a uniform random distribution without knowing the length.
- Andrei Alexandrescu (4/13) Oct 09 2008 I think you can do even nonuniform distributions. The number of samples
- Andrei Alexandrescu (3/16) Oct 09 2008 I think "the number of total samples..." is the correct statement.
- Sergey Gromov (38/46) Oct 09 2008 Here's mine
- Andrei Alexandrescu (6/62) Oct 09 2008 Sweet! Suggestion: replace Sample's definition with:
- Jarrett Billingsley (5/7) Oct 09 2008 ..what? You want to remove most of Phobos' IO capabilities?
- Andrei Alexandrescu (5/15) Oct 09 2008 Well I'd like to extend std.stdio with capabilities that meet and exceed...
- Denis Koroskin (4/12) Oct 10 2008 Yeah, why need them when Tango is soon to be used with Phobos side-by-si...
- Jarrett Billingsley (4/19) Oct 10 2008 That's kind of what I'm thinking. It seems like a waste of time for
- Andrei Alexandrescu (7/28) Oct 10 2008 Well Walter wouldn't be working on it anyway. The thing is, if we want
- Denis Koroskin (4/34) Oct 10 2008 Great! Then, how about getting familiar with Tango IO and building your ...
- Andrei Alexandrescu (4/43) Oct 10 2008 That would make Phobos dependent on Tango, thus running into those
- Sergey Gromov (7/71) Oct 10 2008 Good advice, thanks. The Sample definition is unimportant for the
- bearophile (6/10) Oct 10 2008 BufferedFile seems the faster in Phobos, but it's 2-3 times slower than ...
- Michel Fortin (45/53) Oct 10 2008 My attempt. Works in one pass, and distributes uniformly but only when
I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? Andrei
Oct 09 2008
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articleI just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? AndreiPretty simple, actually. Read the file one character at a time. Get the offset of the beginning of each line as a ulong or something. Store this information in another file essentially as an on-disk array, where, for example, the first ulong.sizeof bytes represent the bit pattern for the offset of the first line, the next ulong.sizeof bytes represent the bit pattern for the offset of the second line, etc. Use plenty of casting tricks. Shuffle the ulong.sizeof chunks of this file the same way you would shuffle an ordinary array, using seeking as the equivalent to array indexing. (I never said this was going to be efficient.) For each of the first k offsets after shuffling, print everything in the original file from that offset to the first EOL character encountered.
Oct 09 2008
dsimcha wrote:== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articleSo this is essentially an index of the file's lines?I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? AndreiPretty simple, actually. Read the file one character at a time. Get the offset of the beginning of each line as a ulong or something. Store this information in another file essentially as an on-disk array, where, for example, the first ulong.sizeof bytes represent the bit pattern for the offset of the first line, the next ulong.sizeof bytes represent the bit pattern for the offset of the second line, etc. Use plenty of casting tricks.Shuffle the ulong.sizeof chunks of this file the same way you would shuffle an ordinary array, using seeking as the equivalent to array indexing.So this would shuffle that index.(I never said this was going to be efficient.) For each of the first k offsets after shuffling, print everything in the original file from that offset to the first EOL character encountered.Can it be done faster and maybe in one pass? Andrei
Oct 09 2008
Andrei Alexandrescu wrote:The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines
Oct 09 2008
Walter Bright wrote:Andrei Alexandrescu wrote:No go if the file is a stream. Can it be done in one pass? AndreiThe challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines
Oct 09 2008
Reply to Andrei,Walter Bright wrote:read and store the first k lines with there line number, if no lines remain, output stored lines read another line replace an existing line with a probability based on the current line and the replaced line's location if no lines remain ... Now all that remains is to figure out that functionAndrei Alexandrescu wrote:No go if the file is a stream. Can it be done in one pass? AndreiThe challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines
Oct 09 2008
"Andrei Alexandrescu" wroteWalter Bright wrote:Any bounds on the input k? If not, then no. In that case, you must do one pass to count the lines, otherwise, you would end up possibly throwing away lines that needed to be output to satisfy k. -SteveAndrei Alexandrescu wrote:No go if the file is a stream. Can it be done in one pass?The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines
Oct 09 2008
Steven Schveighoffer wrote:"Andrei Alexandrescu" wrotek is a "small" number in the sense that you can keep something proportional to k in working memory. You shouldn't keep something proportional to n in working memory. AndreiWalter Bright wrote:Any bounds on the input k? If not, then no. In that case, you must do one pass to count the lines, otherwise, you would end up possibly throwing away lines that needed to be output to satisfy k.Andrei Alexandrescu wrote:No go if the file is a stream. Can it be done in one pass?The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines
Oct 09 2008
Andrei Alexandrescu wrote:No go if the file is a stream. Can it be done in one pass?No, because without knowing the number of lines, you cannot produce a uniform distribution. Unless you're willing to make a guess based on the file size.
Oct 09 2008
Walter Bright wrote:Andrei Alexandrescu wrote:I am getting revenge for "I never use anything in std.algorithm" :o). AndreiNo go if the file is a stream. Can it be done in one pass?No, because without knowing the number of lines, you cannot produce a uniform distribution. Unless you're willing to make a guess based on the file size.
Oct 09 2008
Andrei Alexandrescu wrote:Walter Bright wrote:Get in line for everyone getting their revenge on me!Andrei Alexandrescu wrote:I am getting revenge for "I never use anything in std.algorithm" :o).No go if the file is a stream. Can it be done in one pass?No, because without knowing the number of lines, you cannot produce a uniform distribution. Unless you're willing to make a guess based on the file size.
Oct 09 2008
Reply to Andrei,Walter Bright wrote:"Et tu, Brute!" - (Julius Caesar, Act III, Scene I).Andrei Alexandrescu wrote:I am getting revenge for "I never use anything in std.algorithm" :o). AndreiNo go if the file is a stream. Can it be done in one pass?No, because without knowing the number of lines, you cannot produce a uniform distribution. Unless you're willing to make a guess based on the file size.
Oct 09 2008
Walter Bright wrote:Andrei Alexandrescu wrote:Can't this be done in one single pass? My thoughts were: 1. read the first k lines and assign them to the ones to return (the "stored lines") 2. probability of replacing a line: p = ?? 3. for each new line, with probability p, replace a stored line (I don't know which), and decrease probability (p -= ??) If you don't decrease the value of p in each iterations, the probability of keeping a line in the first group tends to zero. I don't know the values of ??... maybe someone else can find them? (I don't know how to randomly choose k elements of n, n >= k).The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines
Oct 09 2008
On Thu, 09 Oct 2008 23:16:38 +0400, Ary Borenszweig <ary esperanto.org.ar> wrote:Walter Bright wrote:What if there is only one line in a file? Andrei told thatAndrei Alexandrescu wrote:Can't this be done in one single pass? My thoughts were: 1. read the first k lines and assign them to the ones to return (the "stored lines")The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected linesThe file's size is unbounded so loading it in memory is not an option.
Oct 09 2008
Denis Koroskin wrote:On Thu, 09 Oct 2008 23:16:38 +0400, Ary Borenszweig <ary esperanto.org.ar> wrote:Keeping k of the lines in memory is fine. AndreiWalter Bright wrote:What if there is only one line in a file? Andrei told thatAndrei Alexandrescu wrote:Can't this be done in one single pass? My thoughts were: 1. read the first k lines and assign them to the ones to return (the "stored lines")The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected linesThe file's size is unbounded so loading it in memory is not an option.
Oct 09 2008
Denis Koroskin wrote:On Thu, 09 Oct 2008 23:16:38 +0400, Ary Borenszweig <ary esperanto.org.ar> wrote:If you can't read more lines in this step, the answer are all the lines.Walter Bright wrote:What if there is only one line in a file? Andrei told thatAndrei Alexandrescu wrote:Can't this be done in one single pass? My thoughts were: 1. read the first k lines and assign them to the ones to return (the "stored lines")The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected linesThe file's size is unbounded so loading it in memory is not an option.
Oct 09 2008
Ary Borenszweig wrote:Walter Bright wrote:I guess p depends on the number of lines unfortunately. Because p needs to be decreasing with line count, otherwise the first few line will almost always be replaced (not uniform then).Andrei Alexandrescu wrote:Can't this be done in one single pass? My thoughts were: 1. read the first k lines and assign them to the ones to return (the "stored lines") 2. probability of replacing a line: p = ?? 3. for each new line, with probability p, replace a stored line (I don't know which), and decrease probability (p -= ??) If you don't decrease the value of p in each iterations, the probability of keeping a line in the first group tends to zero. I don't know the values of ??... maybe someone else can find them? (I don't know how to randomly choose k elements of n, n >= k).The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?1. read the file, counting lines => nlines 2. select k random numbers 0..nlines 3. read the file line by line again, outputting the line if it's one of the selected lines
Oct 09 2008
Andrei Alexandrescu:I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?It's an interesting problem, I can offer two solutions, the first one is the one I use in practice, that is use a scripting language (but the same exact program can be written in D, using my libs it becomes very similar) that loads the file two times, the first to count the lines, then creates an array of randomly chosen indexes (without replacement!) and uses them to print the chosen ones: from sys import argv from random import sample filename = argv[1] k = int(argv[2]) nlines = sum(1 for _ in file(filename)) if k >= nlines: for line in file(filename): print line else: samples = set(sample(xrange(nlines), k)) for i, line in enumerate(file(filename)): if i in samples: print line If your practical tests show you that reading your file twice is too much slow, then you may want to look for a more complex solution. I'll think about the more complex problem in the next post. Later, bearophile
Oct 09 2008
I am not reading the anwers written by others, of course :-) With the help of "Programming Pearls" here is my second version, that spares the memory required for the chosen ones, so this code runs with very little memory: from sys import argv from random import random filename = argv[1] k = int(argv[2]) nlines = sum(1 for _ in file(filename)) if k >= nlines: for line in file(filename): print line else: select = k remaining = nlines for line in file(filename): if random() < float(select) / remaining: print line select -= 1 remaining -= 1 I'll think for a solution that avoids reading the file twice then... Bye, bearophile
Oct 09 2008
bearophile wrote:I am not reading the anwers written by others, of course :-) With the help of "Programming Pearls" here is my second version, that spares the memory required for the chosen ones, so this code runs with very little memory: from sys import argv from random import random filename = argv[1] k = int(argv[2]) nlines = sum(1 for _ in file(filename)) if k >= nlines: for line in file(filename): print line else: select = k remaining = nlines for line in file(filename): if random() < float(select) / remaining: print line select -= 1 remaining -= 1 I'll think for a solution that avoids reading the file twice then... Bye, bearophileThis is unfair. First line in the original file is included with probability k / n. The probability of including the second line is dependent on the event of including the first line. If it wasn't included, it's k / (n - 1), otherwise it's (k - 1) / (n - 1). All lines should be given equal chance. Andrei
Oct 09 2008
Andrei Alexandrescu:This is unfair. First line in the original file is included with probability k / n. The probability of including the second line is dependent on the event of including the first line. If it wasn't included, it's k / (n - 1), otherwise it's (k - 1) / (n - 1). All lines should be given equal chance.If each of them have the same chance, then you can't guarantee the result has k items. So you must have a biased choice. You can run this code: from random import random from collections import defaultdict def main(freqs, nrepeats): for i in xrange(nrepeats): chosen = [] select = k remaining = n for value in values: if random() < (float(select) / remaining): chosen.append(value) select -= 1 remaining -= 1 for c in chosen: freqs[c] += 1 return freqs import psyco; psyco.full() k = 15 n = 100 values = range(n) freqs = main(freqs=defaultdict(int), nrepeats=10000) print [freqs.get(i, 0) for i in xrange(n)] Its output: [1549, 1519, 1494, 1467, 1485, 1482, 1481, 1526, 1473, 1523, 1466, 1451, 1475, 1482, 1504, 1551, 1504, 1514, 1510, 1541, 1525, 1550, 1531, 1488, 1529, 1467, 1460, 1506, 1544, 1535, 1455, 1468, 1507, 1505, 1490, 1519, 1489, 1507, 1439, 1561, 1499, 1497, 1534, 1512, 1529, 1442, 1437, 1545, 1547, 1621, 1477, 1525, 1492, 1497, 1492, 1533, 1504, 1482, 1489, 1463, 1479, 1488, 1498, 1502, 1503, 1496, 1468, 1458, 1538, 1536, 1479, 1496, 1477, 1514, 1464, 1526, 1451, 1507, 1527, 1415, 1520, 1476, 1526, 1508, 1479, 1504, 1493, 1466, 1490, 1497, 1472, 1548, 1512, 1535, 1503, 1533, 1467, 1471, 1463, 1526] It shows the values are balanced. If that little benchmark isn't enough, my code is based on algorithm S by Knuth, you can take a look on its Art of computer programming for a demonstration that it work correctly. Bye, bearophile
Oct 09 2008
Third solution, this requires a storage of k lines (but you can keep this storage on disk): from sys import argv from random import random, randrange filename = argv[1] k = int(argv[2]) assert k > 0 chosen_lines = [] for i, line in enumerate(file(filename)): if i < k: chosen_lines.append(line) else: if random() < (1.0 / (i+1)): chosen_lines[randrange(k)] = line print chosen_lines Now I'll look for possible bugs in this third version and to the problem Andrei has just told me :-) Bye, bearophile
Oct 09 2008
bearophile wrote:Third solution, this requires a storage of k lines (but you can keep this storage on disk): from sys import argv from random import random, randrange filename = argv[1] k = int(argv[2]) assert k > 0 chosen_lines = [] for i, line in enumerate(file(filename)): if i < k: chosen_lines.append(line) else: if random() < (1.0 / (i+1)): chosen_lines[randrange(k)] = line print chosen_linesWe have a winner!!! There is actually a very simple proof on how and why this works. Andrei
Oct 09 2008
Andrei Alexandrescu escribió:bearophile wrote:Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(Third solution, this requires a storage of k lines (but you can keep this storage on disk): from sys import argv from random import random, randrange filename = argv[1] k = int(argv[2]) assert k > 0 chosen_lines = [] for i, line in enumerate(file(filename)): if i < k: chosen_lines.append(line) else: if random() < (1.0 / (i+1)): chosen_lines[randrange(k)] = line print chosen_linesWe have a winner!!! There is actually a very simple proof on how and why this works.Andrei
Oct 09 2008
Ary Borenszweig wrote:Andrei Alexandrescu escribió:Nothing except you stop the induction at step 3... Andreibearophile wrote:Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(Third solution, this requires a storage of k lines (but you can keep this storage on disk): from sys import argv from random import random, randrange filename = argv[1] k = int(argv[2]) assert k > 0 chosen_lines = [] for i, line in enumerate(file(filename)): if i < k: chosen_lines.append(line) else: if random() < (1.0 / (i+1)): chosen_lines[randrange(k)] = line print chosen_linesWe have a winner!!! There is actually a very simple proof on how and why this works.
Oct 09 2008
Andrei Alexandrescu escribió:Ary Borenszweig wrote:... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?Andrei Alexandrescu escribió:Nothing except you stop the induction at step 3...bearophile wrote:Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(Third solution, this requires a storage of k lines (but you can keep this storage on disk): from sys import argv from random import random, randrange filename = argv[1] k = int(argv[2]) assert k > 0 chosen_lines = [] for i, line in enumerate(file(filename)): if i < k: chosen_lines.append(line) else: if random() < (1.0 / (i+1)): chosen_lines[randrange(k)] = line print chosen_linesWe have a winner!!! There is actually a very simple proof on how and why this works.Andrei
Oct 09 2008
Ary Borenszweig wrote:Andrei Alexandrescu escribió:Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it. Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should. AndreiAry Borenszweig wrote:... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?Andrei Alexandrescu escribió:Nothing except you stop the induction at step 3...bearophile wrote:Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(Third solution, this requires a storage of k lines (but you can keep this storage on disk): from sys import argv from random import random, randrange filename = argv[1] k = int(argv[2]) assert k > 0 chosen_lines = [] for i, line in enumerate(file(filename)): if i < k: chosen_lines.append(line) else: if random() < (1.0 / (i+1)): chosen_lines[randrange(k)] = line print chosen_linesWe have a winner!!! There is actually a very simple proof on how and why this works.
Oct 09 2008
Andrei Alexandrescu wrote:Ary Borenszweig wrote:Ah, ok. It works if you do that. So bearophile's code must say if random() > (1.0 / (i+1)): instead of if random() < (1.0 / (i+1)): (that is, if random() returns a real number between 0 and 1 with uniform distribution)Andrei Alexandrescu escribió:Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it. Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should.Ary Borenszweig wrote:... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?Andrei Alexandrescu escribió:Nothing except you stop the induction at step 3...bearophile wrote:Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(Third solution, this requires a storage of k lines (but you can keep this storage on disk): from sys import argv from random import random, randrange filename = argv[1] k = int(argv[2]) assert k > 0 chosen_lines = [] for i, line in enumerate(file(filename)): if i < k: chosen_lines.append(line) else: if random() < (1.0 / (i+1)): chosen_lines[randrange(k)] = line print chosen_linesWe have a winner!!! There is actually a very simple proof on how and why this works.Andrei
Oct 10 2008
Ary Borenszweig wrote:Andrei Alexandrescu wrote:His prize gets retired. Now how does Kirk's code fare? :o) AndreiAry Borenszweig wrote:Ah, ok. It works if you do that. So bearophile's code must say if random() > (1.0 / (i+1)): instead of if random() < (1.0 / (i+1)): (that is, if random() returns a real number between 0 and 1 with uniform distribution)Andrei Alexandrescu escribió:Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it. Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should.Ary Borenszweig wrote:... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?Andrei Alexandrescu escribió:Nothing except you stop the induction at step 3...bearophile wrote:Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(Third solution, this requires a storage of k lines (but you can keep this storage on disk): from sys import argv from random import random, randrange filename = argv[1] k = int(argv[2]) assert k > 0 chosen_lines = [] for i, line in enumerate(file(filename)): if i < k: chosen_lines.append(line) else: if random() < (1.0 / (i+1)): chosen_lines[randrange(k)] = line print chosen_linesWe have a winner!!! There is actually a very simple proof on how and why this works.
Oct 10 2008
Andrei Alexandrescu wrote:Ary Borenszweig wrote:I don't have Kirk's code here at work (I told Thunderbird not to download everything I had in my home)... but if I remember correctly, it had the same problem, it was something like: if (0 == uniform(0, k)): or something like that, and should be 0 != ... I wonder why both used the wrong probability... (that's why I made the math, because I wasn't sure with that probability it worked) I really liked the problem! It has a very simple and elegant solution... I thought the probability would be something more complex, but it turned out to be simple. :-)Andrei Alexandrescu wrote:His prize gets retired. Now how does Kirk's code fare? :o)Ary Borenszweig wrote:Ah, ok. It works if you do that. So bearophile's code must say if random() > (1.0 / (i+1)): instead of if random() < (1.0 / (i+1)): (that is, if random() returns a real number between 0 and 1 with uniform distribution)Andrei Alexandrescu escribió:Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 3.0. This is because c has the "right" to sit equally in either of the k positions. If code doesn't do that, there's a bug in it. Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, as it should.Ary Borenszweig wrote:... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?Andrei Alexandrescu escribió:Nothing except you stop the induction at step 3...bearophile wrote:Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(Third solution, this requires a storage of k lines (but you can keep this storage on disk): from sys import argv from random import random, randrange filename = argv[1] k = int(argv[2]) assert k > 0 chosen_lines = [] for i, line in enumerate(file(filename)): if i < k: chosen_lines.append(line) else: if random() < (1.0 / (i+1)): chosen_lines[randrange(k)] = line print chosen_linesWe have a winner!!! There is actually a very simple proof on how and why this works.
Oct 10 2008
Andrei Alexandrescu:I don't like contests... This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc. The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335 from random import random, randrange from collections import defaultdict def main(k, values, freqs, nrepeats): for i in xrange(nrepeats): chosen = [None] * k for i, line in enumerate(values): if i < k: chosen.append(line) else: if random() < (1.0 / (i+1)): chosen[randrange(k)] = line for c in chosen: freqs[c] += 1 return freqs import psyco; psyco.full() k = 15 n = 100 values = range(n) freqs = main(k, values, freqs=defaultdict(int), nrepeats=10000) print [freqs.get(i, 0) for i in xrange(n)] My first version works quite probably, and probably the second one too. I think there are ways to solve this problem keeping only k lines, but the code isn't obvious. Bye, bearophileAh, ok. It works if you do that. So bearophile's code must say if random() > (1.0 / (i+1)): instead of if random() < (1.0 / (i+1)): (that is, if random() returns a real number between 0 and 1 with uniform distribution)His prize gets retired. Now how does Kirk's code fare? :o)
Oct 10 2008
bearophile wrote:Andrei Alexandrescu:No, that's a different problem. They want to extract n random samples from N records *without replacement*. It's a very different ballgame. See http://www.cs.duke.edu/~jsv/Papers/catalog/node139.html The problem as I stated is reasonably easy and my implementation should be correct. AndreiI don't like contests... This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc. The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335Ah, ok. It works if you do that. So bearophile's code must say if random() > (1.0 / (i+1)): instead of if random() < (1.0 / (i+1)): (that is, if random() returns a real number between 0 and 1 with uniform distribution)His prize gets retired. Now how does Kirk's code fare? :o)
Oct 10 2008
Fri, 10 Oct 2008 10:55:51 -0400, bearophile wrote:Andrei Alexandrescu:I think your algorithm has one mistake. The last line of your text gets into the k with probability 1/n while it should be k/n. Replace 1.0 with 1.0*k and the distribution will be absolutely (theoretically) uniform.I don't like contests... This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc. The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335 from random import random, randrange from collections import defaultdict def main(k, values, freqs, nrepeats): for i in xrange(nrepeats): chosen = [None] * k for i, line in enumerate(values): if i < k: chosen.append(line) else: if random() < (1.0 / (i+1)): chosen[randrange(k)] = line for c in chosen: freqs[c] += 1 return freqs import psyco; psyco.full() k = 15 n = 100 values = range(n) freqs = main(k, values, freqs=defaultdict(int), nrepeats=10000) print [freqs.get(i, 0) for i in xrange(n)] My first version works quite probably, and probably the second one too. I think there are ways to solve this problem keeping only k lines, but the code isn't obvious.Ah, ok. It works if you do that. So bearophile's code must say if random() > (1.0 / (i+1)): instead of if random() < (1.0 / (i+1)): (that is, if random() returns a real number between 0 and 1 with uniform distribution)His prize gets retired. Now how does Kirk's code fare? :o)
Oct 10 2008
Ary Borenszweig wrote:Andrei Alexandrescu escribió:To reiterate the code I was talking about (the solution for k == 1): from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line Assume there are three lines in the file. The algorithm churns through to the last line. We have the following state: i == 2 chosen_line = <one of the first two lines> Then we call randint(0, 2). This gives us a 1 out of 3 chance of chosen_line being the last line (if the random number is 0). Therefore, there is also a 2 out of 3 chance of it being some *other* line (if the random number is 1 or 2). It turns out there is an equal chance for it being either of the two lines. The algorithm starts on the first line, and does randint(0, 0), meaning chosen_line is always set to the first line initially. On the second line, it does randint(0, 1), which means it has a 50% chance of changing chosen_line to the second line. The math continues to work out in this way as the number of lines increases without bound. -- Kirk McDonaldAry Borenszweig wrote:... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?Andrei Alexandrescu escribió:Nothing except you stop the induction at step 3...bearophile wrote:Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(Third solution, this requires a storage of k lines (but you can keep this storage on disk): from sys import argv from random import random, randrange filename = argv[1] k = int(argv[2]) assert k > 0 chosen_lines = [] for i, line in enumerate(file(filename)): if i < k: chosen_lines.append(line) else: if random() < (1.0 / (i+1)): chosen_lines[randrange(k)] = line print chosen_linesWe have a winner!!! There is actually a very simple proof on how and why this works.
Oct 09 2008
Kirk McDonald wrote:Ary Borenszweig wrote:That's why I thought it's such a beautiful problem. My proof is inductive over n. The invariant is that the k lines held represent a uniform selection of the n lines seen so far at any moment. Base: For n <= k, vacuously satisfied. Inductive step: We have a random selection of n lines so far in array selection, and we add one more line. We need to prove that the new selection is also uniform. The new line has probability k / (n + 1) of being part of the selection. We toss a skewed coin with that probability. If it turns heads, we pick the new element for keeping. Any element in the array should go with equal probability (by the inductive assumption), so we roll a k-sided dice to choose one element to throw away. AndreiAndrei Alexandrescu escribió:To reiterate the code I was talking about (the solution for k == 1): from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line Assume there are three lines in the file. The algorithm churns through to the last line. We have the following state: i == 2 chosen_line = <one of the first two lines> Then we call randint(0, 2). This gives us a 1 out of 3 chance of chosen_line being the last line (if the random number is 0). Therefore, there is also a 2 out of 3 chance of it being some *other* line (if the random number is 1 or 2). It turns out there is an equal chance for it being either of the two lines. The algorithm starts on the first line, and does randint(0, 0), meaning chosen_line is always set to the first line initially. On the second line, it does randint(0, 1), which means it has a 50% chance of changing chosen_line to the second line. The math continues to work out in this way as the number of lines increases without bound.Ary Borenszweig wrote:... which is the last step in this case. There are only three lines. p(a) = 5/6 p(b) = 5/6 p(c) = 1/3 That doesn't seem uniform. In another post, Kirk says: "Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?Andrei Alexandrescu escribió:Nothing except you stop the induction at step 3...bearophile wrote:Say you want 2 lines from a file that has 3 lines. Say the lines are a, b and c. What's the probability that c belongs to the result? It's "1.0 / (i+1)", where i = 2, so 1/3. What's the probability that a does not belong to the result? Well, c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. So it's 1/3 * 1/2 = 1/6. What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6. What am I doing wrong? :-(Third solution, this requires a storage of k lines (but you can keep this storage on disk): from sys import argv from random import random, randrange filename = argv[1] k = int(argv[2]) assert k > 0 chosen_lines = [] for i, line in enumerate(file(filename)): if i < k: chosen_lines.append(line) else: if random() < (1.0 / (i+1)): chosen_lines[randrange(k)] = line print chosen_linesWe have a winner!!! There is actually a very simple proof on how and why this works.
Oct 09 2008
Andrei Alexandrescu Wrote:I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? AndreiBased on your later stated requirements, use a heap. Insert each line into the heap using a random number as the key. After pre-loading the first k elements, each insert into the heap is followed by the removal of the head element. Performance is O(characters in file + log2(k)*lines in file) Obviously, the random number generator needs a range of values that exceeds the lines in the file by a good margin, or else you'll get biases in what gets kept.
Oct 09 2008
Andrei Alexandrescu wrote:I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? AndreiThis is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to. -- Kirk McDonald
Oct 09 2008
Kirk McDonald wrote:Andrei Alexandrescu wrote:This looks good too, but I'm not sure where i will start from in the loop. It should start at k + 1. But where's the D code you guys? D is better than Python at scripting. Ahem. AndreiI just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? AndreiThis is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.
Oct 09 2008
Andrei Alexandrescu wrote:Kirk McDonald wrote:No, i should start at k (which it does, that is what the start=k parameter to enumerate() is for). This is because the range passed to randint() starts at 0. The islice function (like slices in D and indeed regular slices in Python) uses an [inclusive, exclusive) range. Thus, itertools.islice(f, k) will grab the first k lines from the file. (The opening index of 0 is implied.) islice is a function for "slicing" arbitrary iterable objects, particularly ones that can't normally be sliced. By slicing the first k lines from the file in this way, we also consume the first k lines from the file object, and so the for loop starts on line k, which is why I pass start=k to enumerate(). (Remember, I am counting lines from 0.) -- Kirk McDonaldAndrei Alexandrescu wrote:This looks good too, but I'm not sure where i will start from in the loop. It should start at k + 1. But where's the D code you guys? D is better than Python at scripting. Ahem. AndreiI just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? AndreiThis is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.
Oct 09 2008
Kirk McDonald wrote:Andrei Alexandrescu wrote:Sounds good. AndreiKirk McDonald wrote:No, i should start at k (which it does, that is what the start=k parameter to enumerate() is for). This is because the range passed to randint() starts at 0. The islice function (like slices in D and indeed regular slices in Python) uses an [inclusive, exclusive) range. Thus, itertools.islice(f, k) will grab the first k lines from the file. (The opening index of 0 is implied.) islice is a function for "slicing" arbitrary iterable objects, particularly ones that can't normally be sliced. By slicing the first k lines from the file in this way, we also consume the first k lines from the file object, and so the for loop starts on line k, which is why I pass start=k to enumerate(). (Remember, I am counting lines from 0.)Andrei Alexandrescu wrote:This looks good too, but I'm not sure where i will start from in the loop. It should start at k + 1. But where's the D code you guys? D is better than Python at scripting. Ahem. AndreiI just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? AndreiThis is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.
Oct 09 2008
Andrei Alexandrescu:But where's the D code you guys?My third converted to D, using my libs: import std.conv, d.all; void main(string[] args) { assert(args.length == 3); string filename = args[1]; int k = toInt(args[2]); assert (k > 0); string[] chosen_lines; foreach (i, line; xfile(filename)) if (i < k) chosen_lines ~= line; else if (fastRandom() < (1.0 / (i+1))) chosen_lines[randInt(k-1)] = line; putr(chosen_lines); }D is better than Python at scripting. Ahem.I know many languages, and so far I have never found something better than Python to create working prototypes. Bye, bearophile
Oct 09 2008
bearophile wrote:Andrei Alexandrescu:True men only write D. Your solution looks eerily similar to mine. import std.stdio, std.contracts, std.conv, std.random; void main(string[] args) { invariant k = parse!(uint)(args[1]); enforce(k > 0, "Must pass a strictly positive selection size"); string[] selection; auto gen = Random(unpredictableSeed); foreach (ulong tally, char[] line; lines(stdin)) { if (selection.length < k) { // Selection not full; add to it selection ~= line.idup; } else { auto t = uniform(gen, 0, tally + 1); if (t > k) continue; // no luck // Replace a random element in the selection selection[uniform(gen, 0, k - 1)] = line.idup; } } // Print selection foreach (s; selection) write(s); }But where's the D code you guys?My third converted to D, using my libs: import std.conv, d.all; void main(string[] args) { assert(args.length == 3); string filename = args[1]; int k = toInt(args[2]); assert (k > 0); string[] chosen_lines; foreach (i, line; xfile(filename)) if (i < k) chosen_lines ~= line; else if (fastRandom() < (1.0 / (i+1))) chosen_lines[randInt(k-1)] = line; putr(chosen_lines); }D is better than Python at scripting. Ahem.I know many languages, and so far I have never found something better than Python to create working prototypes.
Oct 09 2008
Andrei Alexandrescu wrote:enforce(k > 0, "Must pass a strictly positive selection size");Out of curiosity, why are you using enforce rather than assert? I've written an assertions framework, but it provides some more options than assert: expect.because ("Must pass a strictly positive selection size").that (k).greaterThan (0); If that failed, it would produce: AssertionException: Expected: greater than 0 but was: -4 -- Must pass a strictly positive selection size
Oct 10 2008
Christopher Wright wrote:Andrei Alexandrescu wrote:Enforce is never ignored, even in production builds. You wouldn't want a release build of your app to skip checking command line arguments.enforce(k > 0, "Must pass a strictly positive selection size");Out of curiosity, why are you using enforce rather than assert?I've written an assertions framework, but it provides some more options than assert: expect.because ("Must pass a strictly positive selection size").that (k).greaterThan (0); If that failed, it would produce: AssertionException: Expected: greater than 0 but was: -4 -- Must pass a strictly positive selection sizeThat looks pretty darn nice. Andrei
Oct 10 2008
Andrei Alexandrescu wrote:Christopher Wright wrote:True -- I forgot about that.Andrei Alexandrescu wrote:Enforce is never ignored, even in production builds. You wouldn't want a release build of your app to skip checking command line arguments.enforce(k > 0, "Must pass a strictly positive selection size");Out of curiosity, why are you using enforce rather than assert?If you want it, the code's in a public subversion repository: http://svn.dsource.org/projects/dmocks/dunit/trunk/ modules dunit.expect and dunit.constraint Currently BSD licensed, but if anyone wants it in a different license, that's up for negotiation. Though I think it would take some effort to make it work with Phobos.I've written an assertions framework, but it provides some more options than assert: expect.because ("Must pass a strictly positive selection size").that (k).greaterThan (0); If that failed, it would produce: AssertionException: Expected: greater than 0 but was: -4 -- Must pass a strictly positive selection sizeThat looks pretty darn nice. Andrei
Oct 11 2008
Kirk McDonald wrote:Andrei Alexandrescu wrote:Oh forgot to mention this is superior numerically to bearophile's because it's not affected by floating point quantization vagaries. But it came later too... AndreiI just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? AndreiThis is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.
Oct 09 2008
Andrei Alexandrescu wrote:Kirk McDonald wrote:To be fair, I started writing it before bearophile posted his, and indeed did not see his solution until after I had posted it. :-) -- Kirk McDonaldAndrei Alexandrescu wrote:Oh forgot to mention this is superior numerically to bearophile's because it's not affected by floating point quantization vagaries. But it came later too... AndreiI just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? AndreiThis is a classic interview question. The solution for k == 1 is easy: from random import randint chosen_line = None for i, line in enumerate(open('filename')): if randint(0, i) == 0: chosen_line = line print chosen_line (It is worth noting that randint() operates over an inclusive range.) If you do the math, this works out to a uniform distribution. For instance, say the file has three lines. Once we read in the third line, there is a 1 out of 3 chance that it will be picked as the chosen_line. Of the remaining 2 out of 3 chances, there is a 50% chance the second line will be chosen, and a 50% chance of the first line. Doing this for k > 1 becomes more complicated. We start by reading in the first k lines of the file. (And if we run out of lines before that, we're done.) import itertools from random import randint f = open('filename') k_lines = list(itertools.islice(f, k)) for i, line in enumerate(f, start=k): if randint(0, i) == 0: k_lines[randint(0, k-1)] = line for line in k_lines: print line This is my first crack at a solution. I am not sure how close to a uniform distribution this works out to.
Oct 09 2008
Kirk McDonald wrote:To be fair, I started writing it before bearophile posted his, and indeed did not see his solution until after I had posted it. :-)I think you forgot to attach the D solution. ;o) Andrei
Oct 09 2008
Reply to Andrei,I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? Andreistruct S { char[] line; real v; int opCmp(S s) { if(s.v < this.v) return -1; else if(s.v > this.v) return +1; return 0; } } int k; S[] buf = new S[k+1]; foreach(s;buf) s.v = real.max; foreach(char[] line; stream) { delete buf[$-1].line; buf[$-1].line = line.dup;; buf[$-1].v = reaqlRand(); // returns v > real.min and v < real.max buf.sort; } foreach(s;buf) if(s.v != real.max; writef("%s\n", s.line);
Oct 09 2008
BCS Wrote:Reply to Andrei,You should check if your new random number is good enough to keep before doing the array manipulation and string dup... That small change should yield an near-optimal big O. A heap should give slightly faster performance for moderate values of k.I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? Andreistruct S { char[] line; real v; int opCmp(S s) { if(s.v < this.v) return -1; else if(s.v > this.v) return +1; return 0; } } int k; S[] buf = new S[k+1]; foreach(s;buf) s.v = real.max; foreach(char[] line; stream) { delete buf[$-1].line; buf[$-1].line = line.dup;; buf[$-1].v = reaqlRand(); // returns v > real.min and v < real.max buf.sort; } foreach(s;buf) if(s.v != real.max; writef("%s\n", s.line);
Oct 09 2008
Reply to Jason,BCS Wrote:I was going for simplicity and ease of reading within a given big-O class. Switching to insertion sort or reusing data buffers would also do better.Reply to Andrei,You should check if your new random number is good enough to keep before doing the array manipulation and string dup... That small change should yield an near-optimal big O. A heap should give slightly faster performance for moderate values of k.I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? Andreistruct S { char[] line; real v; int opCmp(S s) { if(s.v < this.v) return -1; else if(s.v > this.v) return +1; return 0; } } int k; S[] buf = new S[k+1]; foreach(s;buf) s.v = real.max; foreach(char[] line; stream) { delete buf[$-1].line; buf[$-1].line = line.dup;; buf[$-1].v = reaqlRand(); // returns v > real.min and v < real.max buf.sort; } foreach(s;buf) if(s.v != real.max; writef("%s\n", s.line);
Oct 09 2008
BEGIN CODE char[][] choose(int k) { char[][] bufs; bufs.length = k; for(int i=0; i<k; i++) bufs[i] = ReadALine(); while(!feof(stdin)) { int i = random(0,k); bufs[i] = ReadALine(); } return bufs; } END CODE Andrei Alexandrescu wrote:I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? Andrei
Oct 09 2008
Russell Lewis wrote:BEGIN CODE char[][] choose(int k) { char[][] bufs; bufs.length = k; for(int i=0; i<k; i++) bufs[i] = ReadALine(); while(!feof(stdin)) { int i = random(0,k); bufs[i] = ReadALine(); } return bufs; } END CODEThen the last line will be part of the selection with probability 1. Andrei
Oct 09 2008
Sorry, I overlooked the (obvious) fact that this biases things towards the end of the file. Change the while loop as follows: BEGIN CODE int count = k; while(!feof(stdin)) { if(random(0,count) >= k) ReadALine(); // this discards the line else { int i = random(0,k); bufs[i] = ReadALine(); } count++; } END CODE Note that I'm assuming that random(a,b) is a function that randomly returns an integer in the range [a,b). This new version works because the most-recently-read line has a k/count chance of getting picked up into the current working set, and when we do choose to discard some line, we give all of the old lines an equal chance of survival. I don't have to to formally work out the math right now, but I'm pretty sure that that would give each line in the entire file a k/count chance of being in the set at any given point in the process. Russ Russell Lewis wrote:BEGIN CODE char[][] choose(int k) { char[][] bufs; bufs.length = k; for(int i=0; i<k; i++) bufs[i] = ReadALine(); while(!feof(stdin)) { int i = random(0,k); bufs[i] = ReadALine(); } return bufs; } END CODE Andrei Alexandrescu wrote:I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? Andrei
Oct 09 2008
Russell Lewis wrote:Sorry, I overlooked the (obvious) fact that this biases things towards the end of the file. Change the while loop as follows: BEGIN CODE int count = k; while(!feof(stdin)) { if(random(0,count) >= k) ReadALine(); // this discards the line else { int i = random(0,k); bufs[i] = ReadALine(); } count++; } END CODEYah that works. Nice! Andrei
Oct 09 2008
Andrei Alexandrescu wrote:I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it? AndreiYou can't do a uniform random distribution without knowing the length. Your options are: - Hold the entire file in memory. - Hold the entire file so far in memory. Once you get over a certain threshold, remove lines randomly. You have to tweak the random distribution for the axing method to make sure it's uniform. If this is going to be used often on huge data sets, that is a reasonable effort. How would you need to tweak the random distribution? The easiest way is to get two random numbers: what section to remove lines from and which line in that section to remove stuff from. You can work out some annoying recurrence to find out how to get the probability of axing from one section given that it has been available for k rounds. I might work out the recurrence tonight, but I don't have a chalkboard.
Oct 09 2008
: You can't do a uniform random distribution without knowing the length. Probably true for other distribution. Most certainly not true for uniform distribution (with a raisonable k) You can work on a subset of the file. Let say 1000 records. The distribution being uniform, you can select (or eliminate), a percentage of each subset and the results for the whole file will be ok.
Oct 09 2008
Carlos wrote:: You can't do a uniform random distribution without knowing the length. Probably true for other distribution. Most certainly not true for uniform distribution (with a raisonable k) You can work on a subset of the file. Let say 1000 records. The distribution being uniform, you can select (or eliminate), a percentage of each subset and the results for the whole file will be ok.I think you can do even nonuniform distributions. The number of samples seen should not influence your subsampling decision. Andrei
Oct 09 2008
Andrei Alexandrescu wrote:Carlos wrote:I think "the number of total samples..." is the correct statement. Andrei: You can't do a uniform random distribution without knowing the length. Probably true for other distribution. Most certainly not true for uniform distribution (with a raisonable k) You can work on a subset of the file. Let say 1000 records. The distribution being uniform, you can select (or eliminate), a percentage of each subset and the results for the whole file will be ok.I think you can do even nonuniform distributions. The number of samples seen should not influence your subsampling decision.
Oct 09 2008
Thu, 09 Oct 2008 13:56:48 -0500, Andrei Alexandrescu wrote:I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?Here's mine module k_sample; import std.conv: to; import std.stream: BufferedFile; import std.random: Random, uniform, unpredictableSeed; import std.stdio: writefln; void main(string[] args) { auto f = new BufferedFile(args[1]); auto k = to!(int)(args[2]); assert(k > 0); struct Sample { string text; ulong line; this(string s, ulong n) { text = s; line = n; } } Sample[] samples; auto gen = Random(unpredictableSeed); foreach (ulong n, char[] s; f) { if (samples.length < k) samples ~= Sample(s.idup, n+1); else if (uniform(gen, 0., 1.) < cast(double) k / (n+1)) samples[uniform(gen, 0, k)] = Sample(s.idup, n+1); } foreach (sample; samples) writefln("%8s: %s", sample.line, sample.text); } Usage: k-sample <file> <k> It's easy to prove that a probability of any particular line appearing in the samples is k/n where n is the total number of lines in the text.
Oct 09 2008
Sergey Gromov wrote:Thu, 09 Oct 2008 13:56:48 -0500, Andrei Alexandrescu wrote:Sweet! Suggestion: replace Sample's definition with: alias Tuple!(string, "text", ulong, "line") Sample; On a different topic, your use of std.stream reminded me. I was thinking of yanking it from Phobos. Thoughts? AndreiI just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?Here's mine module k_sample; import std.conv: to; import std.stream: BufferedFile; import std.random: Random, uniform, unpredictableSeed; import std.stdio: writefln; void main(string[] args) { auto f = new BufferedFile(args[1]); auto k = to!(int)(args[2]); assert(k > 0); struct Sample { string text; ulong line; this(string s, ulong n) { text = s; line = n; } } Sample[] samples; auto gen = Random(unpredictableSeed); foreach (ulong n, char[] s; f) { if (samples.length < k) samples ~= Sample(s.idup, n+1); else if (uniform(gen, 0., 1.) < cast(double) k / (n+1)) samples[uniform(gen, 0, k)] = Sample(s.idup, n+1); } foreach (sample; samples) writefln("%8s: %s", sample.line, sample.text); } Usage: k-sample <file> <k> It's easy to prove that a probability of any particular line appearing in the samples is k/n where n is the total number of lines in the text.
Oct 09 2008
On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On a different topic, your use of std.stream reminded me. I was thinking of yanking it from Phobos. Thoughts?..what? You want to remove most of Phobos' IO capabilities? I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.
Oct 09 2008
Jarrett Billingsley wrote:On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Well I'd like to extend std.stdio with capabilities that meet and exceed those in std.stream. The design of std.stream is ok, but makes it hard to compose streams, and requires a lot of hand coding to define new streams. AndreiOn a different topic, your use of std.stream reminded me. I was thinking of yanking it from Phobos. Thoughts?..what? You want to remove most of Phobos' IO capabilities? I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.
Oct 09 2008
On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley <jarrett.billingsley gmail.com> wrote:On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Yeah, why need them when Tango is soon to be used with Phobos side-by-side ;)On a different topic, your use of std.stream reminded me. I was thinking of yanking it from Phobos. Thoughts?..what? You want to remove most of Phobos' IO capabilities? I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.
Oct 10 2008
On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden gmail.com> wrote:On Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley <jarrett.billingsley gmail.com> wrote:That's kind of what I'm thinking. It seems like a waste of time for _another_ IO framework to be developed when we have two already. Priorities, priorities. How about fixing frontend bugs instead?On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Yeah, why need them when Tango is soon to be used with Phobos side-by-side ;)On a different topic, your use of std.stream reminded me. I was thinking of yanking it from Phobos. Thoughts?..what? You want to remove most of Phobos' IO capabilities? I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.
Oct 10 2008
Jarrett Billingsley wrote:On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden gmail.com> wrote:Well Walter wouldn't be working on it anyway. The thing is, if we want e.g. std.algorithm to work on I/O, there must be integration of I/O with the range concept. As far as I can see things, ranges will be quite a powerful currency allowing many std components to trade data with one another. AndreiOn Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley <jarrett.billingsley gmail.com> wrote:That's kind of what I'm thinking. It seems like a waste of time for _another_ IO framework to be developed when we have two already. Priorities, priorities. How about fixing frontend bugs instead?On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Yeah, why need them when Tango is soon to be used with Phobos side-by-side ;)On a different topic, your use of std.stream reminded me. I was thinking of yanking it from Phobos. Thoughts?..what? You want to remove most of Phobos' IO capabilities? I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.
Oct 10 2008
On Fri, 10 Oct 2008 16:53:29 +0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Jarrett Billingsley wrote:Great! Then, how about getting familiar with Tango IO and building your concept on top of it? ;)On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden gmail.com> wrote:Well Walter wouldn't be working on it anyway. The thing is, if we want e.g. std.algorithm to work on I/O, there must be integration of I/O with the range concept. As far as I can see things, ranges will be quite a powerful currency allowing many std components to trade data with one another. AndreiOn Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley <jarrett.billingsley gmail.com> wrote:That's kind of what I'm thinking. It seems like a waste of time for _another_ IO framework to be developed when we have two already. Priorities, priorities. How about fixing frontend bugs instead?On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Yeah, why need them when Tango is soon to be used with Phobos side-by-side ;)On a different topic, your use of std.stream reminded me. I was thinking of yanking it from Phobos. Thoughts?..what? You want to remove most of Phobos' IO capabilities? I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.
Oct 10 2008
Denis Koroskin wrote:On Fri, 10 Oct 2008 16:53:29 +0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:That would make Phobos dependent on Tango, thus running into those licensing issues I have little understanding of. AndreiJarrett Billingsley wrote:Great! Then, how about getting familiar with Tango IO and building your concept on top of it? ;)On Fri, Oct 10, 2008 at 5:55 AM, Denis Koroskin <2korden gmail.com> wrote:Well Walter wouldn't be working on it anyway. The thing is, if we want e.g. std.algorithm to work on I/O, there must be integration of I/O with the range concept. As far as I can see things, ranges will be quite a powerful currency allowing many std components to trade data with one another. AndreiOn Fri, 10 Oct 2008 08:27:37 +0400, Jarrett Billingsley <jarrett.billingsley gmail.com> wrote:That's kind of what I'm thinking. It seems like a waste of time for _another_ IO framework to be developed when we have two already. Priorities, priorities. How about fixing frontend bugs instead?On Thu, Oct 9, 2008 at 11:28 PM, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Yeah, why need them when Tango is soon to be used with Phobos side-by-side ;)On a different topic, your use of std.stream reminded me. I was thinking of yanking it from Phobos. Thoughts?..what? You want to remove most of Phobos' IO capabilities? I'm guessing there's something else you'd like to replace it with, but I wonder what it could be.
Oct 10 2008
Thu, 09 Oct 2008 22:28:51 -0500, Andrei Alexandrescu wrote:Sergey Gromov wrote:Good advice, thanks. The Sample definition is unimportant for the solution but eats up a lot of visual space.Thu, 09 Oct 2008 13:56:48 -0500, Andrei Alexandrescu wrote:Sweet! Suggestion: replace Sample's definition with: alias Tuple!(string, "text", ulong, "line") Sample;I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?Here's mine module k_sample; import std.conv: to; import std.stream: BufferedFile; import std.random: Random, uniform, unpredictableSeed; import std.stdio: writefln; void main(string[] args) { auto f = new BufferedFile(args[1]); auto k = to!(int)(args[2]); assert(k > 0); struct Sample { string text; ulong line; this(string s, ulong n) { text = s; line = n; } } Sample[] samples; auto gen = Random(unpredictableSeed); foreach (ulong n, char[] s; f) { if (samples.length < k) samples ~= Sample(s.idup, n+1); else if (uniform(gen, 0., 1.) < cast(double) k / (n+1)) samples[uniform(gen, 0, k)] = Sample(s.idup, n+1); } foreach (sample; samples) writefln("%8s: %s", sample.line, sample.text); } Usage: k-sample <file> <k> It's easy to prove that a probability of any particular line appearing in the samples is k/n where n is the total number of lines in the text.On a different topic, your use of std.stream reminded me. I was thinking of yanking it from Phobos. Thoughts?While experimenting with Bearophile's loaderhttp://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.announce&article_id=13402I've found out that BufferedFile is the fastest available way to process a file line-by-line. So the alternative should be as fast.
Oct 10 2008
Sergey Gromov:While experimenting with Bearophile's loaderBufferedFile seems the faster in Phobos, but it's 2-3 times slower than the line-by-line reading you have in Python, and I think it has 4 bugs, so I don't use it much, that's why I use the xfile() of my libs, that is a little slower, but without those bugs. Few bugs of BufferedFile can be seen if you try to load a very large file made of a single line (crashes D every time), or you try reading lines with \0 at the end of the line, etc. They are corner cases, usually. I suggest reading the C sources of the I/O of Python (or Perl, if you are able to read the sources. I think the differences in clarity between Python and Perl can be seen even in their C sources). Bye, bearophilehttp://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.announce&article_id=13402I've found out that BufferedFile is the fastest available way to process a file line-by-line. So the alternative should be as fast.
Oct 10 2008
On 2008-10-09 14:56:48 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:I just ran across a nice little problem that I wanted to share as an exercise for the interested in futile pastimes. The challenge, should you accept it, is to write a program that given a number k and a file, outputs k lines picked uniformly at random from the file. (If the file has less than k lines, it should all be output.) The file's size is unbounded so loading it in memory is not an option. How'd you go about it?My attempt. Works in one pass, and distributes uniformly but only when k is a factor of the number of lines in the file. At least, I believe the distribution is uniform, but feel free to prove me wrong. /* Uniform random value r, with r >= min && r < max. */ real random(uint min, uint max) { ... } /* Shuffle elements uniformly in the given string array. */ void shuffle(ref string[] array) { ... } string[] challenge(alias R)(R input, uint k) { string[] result; uint line; real pickFactor = 1; // Initial stage: fill initial k strings while (!input.eof && line < k) { result ~= input.readLine; ++line; } shuffle(result); while (!input.eof) { // Next stage: half the chances of picking a candidate. pickFactor /= 2; string[] candidates; line = 0; while (!input.eof && line < k) { candidates ~= input.readLine; ++line; } shuffle(candidates); for (uint i; i < candidates.length; ++i) if (random(0, 1) < pickFactor) result[i] = candidates[i]; // Note: if candidate.length != result.length // we don't have uniformity. } return result; } -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Oct 10 2008