digitalmars.D - shuffling lines in a stream
- Andrei Alexandrescu (15/15) Oct 09 2008 The high traffic enjoyed by "random k-sample of a file" confirms that
- BCS (2/22) Oct 09 2008 Um... I'm not following that.
- Andrei Alexandrescu (21/45) Oct 09 2008 Very simple: given a file, just randomly change the order of its lines
- BCS (2/7) Oct 09 2008 I got the bit about the end effect. The above is what I didn't get.
- Andrei Alexandrescu (3/12) Oct 09 2008 Don't worry about that for now.
- BCS (10/25) Oct 09 2008 cache the whole file into a tmp file
- Andrei Alexandrescu (4/35) Oct 09 2008 I agree with the last paragraph, but lseeking seems overly inefficient.
- BCS (3/14) Oct 10 2008 algorithmically, I don't think the lseek will matter, as to I/O and cach...
- Andrei Alexandrescu (8/24) Oct 10 2008 I think it does. Essentially you impose random access on the input, or
- BCS (3/33) Oct 10 2008 Is that not an I/O effect?
- Andrei Alexandrescu (8/35) Oct 10 2008 Well I agree things can be coded such that you make the copy only if the...
- BCS (4/13) Oct 10 2008 I think Big-O on mine is n.
- Andrei Alexandrescu (6/20) Oct 10 2008 I disagree. Big-O with lseek under the rug? That's a non-solution
- BCS (2/25) Oct 10 2008 In that case this isn't a problem /I/ am intersted in trying to solve.
- Andrei Alexandrescu (6/26) Oct 09 2008 Forgot to mention a desirable property. It would be nice if the
- Robert Fraser (3/23) Oct 09 2008 I know it's been suggested before, but we really need
- Bruno Medeiros (5/8) Oct 16 2008 Word.
- BLS (7/27) Oct 09 2008 Just in general
- Andrei Alexandrescu (9/37) Oct 10 2008 I think I overspecified it. Pure and simple you must randomize lines in
- Sergey Gromov (32/46) Oct 10 2008 Well, my temporary file contains the whole stream. Because there is a
- Andrei Alexandrescu (4/23) Oct 10 2008 Yes, that probability exists, but that doesn't imply you must load the
- Sergey Gromov (4/27) Oct 10 2008 This implies though that I cannot start output before I read all the
- Andrei Alexandrescu (4/30) Oct 10 2008 Creating temporary files is fine as long as you never need to store the
- KennyTM~ (3/35) Oct 10 2008 So HDD (or any file system) is not counted as memory? Then just store
- Andrei Alexandrescu (5/41) Oct 10 2008 This takes us back to one lseek per line. You can create temporary
- Sergey Gromov (18/33) Oct 10 2008 Ok, let me re-formulate the task then, since the creation of a temporary...
- Andrei Alexandrescu (3/36) Oct 10 2008 Feeling doesn't qualify.
- Christopher Wright (4/37) Oct 11 2008 You have to continually alter k to maintain a random distribution, but
- Sean Kelly (10/10) Oct 10 2008 It would be slower than the seeking option, but something like a
- Andrei Alexandrescu (4/15) Oct 10 2008 Sweet! This is even better than my solution. (And it's much faster than
- Christopher Wright (5/22) Oct 11 2008 You can reduce the number of passes by using more temporary files per
- Andrei Alexandrescu (7/18) Oct 10 2008 I think I found a problem with your solution. Consider you break the
- Sergey Gromov (7/25) Oct 10 2008 After merging b and c you end up with a and bc. Then you randomly merge...
- Andrei Alexandrescu (3/29) Oct 10 2008 How do you "randomly merge"? Describe the process.
- Sergey Gromov (17/46) Oct 10 2008 a[] and b[] are files to merge. ab[] is the result. a.length and
- Andrei Alexandrescu (3/49) Oct 10 2008 This should work. Use of ranges noted :o).
- Sergey Gromov (7/57) Oct 10 2008 I just rigorously proved that probability of line from a getting into
- Sergey Gromov (8/23) Oct 10 2008 Actually I felt an urge to write:
- Sean Kelly (24/54) Oct 10 2008 while( a.hasMore || b.hasMore )
- Andrei Alexandrescu (15/71) Oct 10 2008 Yah, Sergey got me thinking too. What if you have a large file that
- Sergey Gromov (36/112) Oct 10 2008 Here's my little proof. Paste this into wikipedia to see formulas.
- Sean Kelly (4/22) Oct 10 2008 When a is merged with (b&c) then its lines will be interleaved randomly
- Andrei Alexandrescu (9/31) Oct 10 2008 I think that works with Sergey's appropriate crooking of the dice. For
- Russell Lewis (47/67) Oct 13 2008 I've read most of the responses (not all) and haven't yet seen a
The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous. In case you hate random stuff, just wait; the next problem won't have any random element in it :o). Andrei
Oct 09 2008
Reply to Andrei,The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous. In case you hate random stuff, just wait; the next problem won't have any random element in it :o). AndreiUm... I'm not following that.
Oct 09 2008
BCS wrote:Reply to Andrei,Very simple: given a file, just randomly change the order of its lines and output them. For example (assume read from standard input): $ ./shuffle a b c ^D b a c $ ./shuffle a b c ^D b c a $ _ AndreiThe high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous. In case you hate random stuff, just wait; the next problem won't have any random element in it :o). AndreiUm... I'm not following that.
Oct 09 2008
Reply to Andrei,You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.I got the bit about the end effect. The above is what I didn't get.
Oct 09 2008
BCS wrote:Reply to Andrei,Don't worry about that for now. AndreiYou don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.I got the bit about the end effect. The above is what I didn't get.
Oct 09 2008
Reply to Andrei,BCS wrote:cache the whole file into a tmp file store slice data on the way through (start/length of each line) shuffle that set lseek/read/write each slice in order Quick'n'dirty The only fun part is how to shuffle the deck. What I think might be easiest to do would be to qsort it with rand as the compare function. I don't think there is any way to avoid storing the whole file because for a uniform sort there is a possibility that the last line will come out first.Reply to Andrei,Don't worry about that for now. AndreiYou don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.I got the bit about the end effect. The above is what I didn't get.
Oct 09 2008
BCS wrote:Reply to Andrei,I agree with the last paragraph, but lseeking seems overly inefficient. Could you avoid that? AndreiBCS wrote:cache the whole file into a tmp file store slice data on the way through (start/length of each line) shuffle that set lseek/read/write each slice in order Quick'n'dirty The only fun part is how to shuffle the deck. What I think might be easiest to do would be to qsort it with rand as the compare function. I don't think there is any way to avoid storing the whole file because for a uniform sort there is a possibility that the last line will come out first.Reply to Andrei,Don't worry about that for now. AndreiYou don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory.I got the bit about the end effect. The above is what I didn't get.
Oct 09 2008
Reply to Andrei,BCS wrote:algorithmically, I don't think the lseek will matter, as to I/O and cache effects, I'll leave that to someone else.I don't think there is any way to avoid storing the whole file because for a uniform sort there is a possibility that the last line will come out first.I agree with the last paragraph, but lseeking seems overly inefficient. Could you avoid that? Andrei
Oct 10 2008
BCS wrote:Reply to Andrei,I think it does. Essentially you impose random access on the input, or copy to a medium that offers it. gunzip --stdout bigfile.gz | shuffle You'll have to compulsively store a copy of the input. Besides, random access is kind of a dicey proposition on large files. Of course, only measurement will show... AndreiBCS wrote:algorithmically, I don't think the lseek will matter,I don't think there is any way to avoid storing the whole file because for a uniform sort there is a possibility that the last line will come out first.I agree with the last paragraph, but lseeking seems overly inefficient. Could you avoid that? Andrei
Oct 10 2008
Reply to Andrei,BCS wrote:You have to anyway, or is that not what you agread with above?Reply to Andrei,I think it does. Essentially you impose random access on the input, or copy to a medium that offers it. gunzip --stdout bigfile.gz | shuffle You'll have to compulsively store a copy of the input.BCS wrote:algorithmically, I don't think the lseek will matter,I don't think there is any way to avoid storing the whole file because for a uniform sort there is a possibility that the last line will come out first.I agree with the last paragraph, but lseeking seems overly inefficient. Could you avoid that? AndreiBesides, random access is kind of a dicey proposition on large files. Of course, only measurement will show...Is that not an I/O effect?as to I/O and cache effects, I'll leave that to someone else.Andrei
Oct 10 2008
BCS wrote:Reply to Andrei,Well I agree things can be coded such that you make the copy only if the file is too large anyway. But one lseek call per line is a complete disaster, whether you call it I/O effect or whatnot. I'd go as far as saying I'd never even try such a solution, especially since there exists a forward-pass algorithm. Of course measuring would be a great way to prove me wrong. AndreiBCS wrote:You have to anyway, or is that not what you agread with above?Reply to Andrei,I think it does. Essentially you impose random access on the input, or copy to a medium that offers it. gunzip --stdout bigfile.gz | shuffle You'll have to compulsively store a copy of the input.BCS wrote:algorithmically, I don't think the lseek will matter,I don't think there is any way to avoid storing the whole file because for a uniform sort there is a possibility that the last line will come out first.I agree with the last paragraph, but lseeking seems overly inefficient. Could you avoid that? Andrei
Oct 10 2008
Reply to Andrei,Well I agree things can be coded such that you make the copy only if the file is too large anyway. But one lseek call per line is a complete disaster, whether you call it I/O effect or whatnot. I'd go as far as saying I'd never even try such a solution, especially since there exists a forward-pass algorithm. Of course measuring would be a great way to prove me wrong. AndreiI think Big-O on mine is n. I don't think that can be beat. (Maybe I was being not at all clear that I was only looking at that aspect, the rest doesn't interest me as a puzzle)
Oct 10 2008
BCS wrote:Reply to Andrei,I disagree. Big-O with lseek under the rug? That's a non-solution because it essentially redefines the problem into "if I had a random-access operator then this is pretty much like array shuffling". If you have a random-access operator there is no problem to think about. AndreiWell I agree things can be coded such that you make the copy only if the file is too large anyway. But one lseek call per line is a complete disaster, whether you call it I/O effect or whatnot. I'd go as far as saying I'd never even try such a solution, especially since there exists a forward-pass algorithm. Of course measuring would be a great way to prove me wrong. AndreiI think Big-O on mine is n.
Oct 10 2008
Reply to Andrei,BCS wrote:In that case this isn't a problem /I/ am intersted in trying to solve.Reply to Andrei,I disagree. Big-O with lseek under the rug? That's a non-solution because it essentially redefines the problem into "if I had a random-access operator then this is pretty much like array shuffling". If you have a random-access operator there is no problem to think about. AndreiWell I agree things can be coded such that you make the copy only if the file is too large anyway. But one lseek call per line is a complete disaster, whether you call it I/O effect or whatnot. I'd go as far as saying I'd never even try such a solution, especially since there exists a forward-pass algorithm. Of course measuring would be a great way to prove me wrong. AndreiI think Big-O on mine is n.
Oct 10 2008
Andrei Alexandrescu wrote:The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous. In case you hate random stuff, just wait; the next problem won't have any random element in it :o). AndreiForgot to mention a desirable property. It would be nice if the algorithm would naturally subscale, e.g. if ran on a short file, it would produce a shuffled file with the speed competitive with an algorithm conceived for shorter files. Andrei
Oct 09 2008
Andrei Alexandrescu wrote:The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous. In case you hate random stuff, just wait; the next problem won't have any random element in it :o). AndreiI know it's been suggested before, but we really need digitalmars.d.offtopic
Oct 09 2008
Robert Fraser wrote:I know it's been suggested before, but we really need digitalmars.d.offtopicWord. -- Bruno Medeiros - Software Developer, MSc. in CS/E graduate http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Oct 16 2008
Andrei Alexandrescu schrieb:The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous. In case you hate random stuff, just wait; the next problem won't have any random element in it :o). AndreiJust in general I would use a memory mapped file. Map data in chunks, say 4MB. Chunks are unmapped as necessary ~LRU, means implement a LRUCache. This is how some "in memory database-engines" solve that sort of problems. Also possible : I miss the real problem :( Bjoern
Oct 09 2008
BLS wrote:Andrei Alexandrescu schrieb:I think I overspecified it. Pure and simple you must randomize lines in a file that may not fit in memory. Note that memory mapping may help the speed of whichever implementation, but gives zero indication on how you would actually tackle the problem. Please add [OT] to the title when answering, as I did, so people can filter it out. I think these are cool things to discuss nonetheless in D's context because it gives us ideas on language features and libraries. AndreiThe high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous. In case you hate random stuff, just wait; the next problem won't have any random element in it :o). AndreiJust in general I would use a memory mapped file. Map data in chunks, say 4MB. Chunks are unmapped as necessary ~LRU, means implement a LRUCache. This is how some "in memory database-engines" solve that sort of problems. Also possible : I miss the real problem :(
Oct 10 2008
Thu, 09 Oct 2008 23:42:47 -0500, Andrei Alexandrescu wrote:The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first. module shuffle; import std.cstream: din, dout; import std.stream: BufferedFile, FileMode; import std.file: remove, exists; import std.random: Random, uniform, unpredictableSeed, randomShuffle; import std.stdio: writeln; void main() { auto gen = Random(unpredictableSeed); enum tempName = "temp.$$$"; if (exists(tempName)) remove(tempName); auto temp = new BufferedFile(tempName, FileMode.In | FileMode.OutNew); ulong[] cache; foreach (char[] s; din) { cache ~= temp.position; temp.writeLine(s); } randomShuffle(cache, gen); foreach (pos; cache) { temp.position = pos; writeln(temp.readLine()); } temp.close(); }
Oct 10 2008
Sergey Gromov wrote:Thu, 09 Oct 2008 23:42:47 -0500, Andrei Alexandrescu wrote:Yes, that probability exists, but that doesn't imply you must load the entire file in memory. AndreiThe high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.
Oct 10 2008
Fri, 10 Oct 2008 07:55:38 -0500, Andrei Alexandrescu wrote:Sergey Gromov wrote:This implies though that I cannot start output before I read all the lines from the stream. Therefore I must store those lines somewhere.Thu, 09 Oct 2008 23:42:47 -0500, Andrei Alexandrescu wrote:Yes, that probability exists, but that doesn't imply you must load the entire file in memory.The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.
Oct 10 2008
Sergey Gromov wrote:Fri, 10 Oct 2008 07:55:38 -0500, Andrei Alexandrescu wrote:Creating temporary files is fine as long as you never need to store the entire input in memory. AndreiSergey Gromov wrote:This implies though that I cannot start output before I read all the lines from the stream. Therefore I must store those lines somewhere.Thu, 09 Oct 2008 23:42:47 -0500, Andrei Alexandrescu wrote:Yes, that probability exists, but that doesn't imply you must load the entire file in memory.The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.
Oct 10 2008
Andrei Alexandrescu wrote:Sergey Gromov wrote:So HDD (or any file system) is not counted as memory? Then just store all temporary variables to HDD.Fri, 10 Oct 2008 07:55:38 -0500, Andrei Alexandrescu wrote:Creating temporary files is fine as long as you never need to store the entire input in memory. AndreiSergey Gromov wrote:This implies though that I cannot start output before I read all the lines from the stream. Therefore I must store those lines somewhere.Thu, 09 Oct 2008 23:42:47 -0500, Andrei Alexandrescu wrote:Yes, that probability exists, but that doesn't imply you must load the entire file in memory.The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.
Oct 10 2008
KennyTM~ wrote:Andrei Alexandrescu wrote:This takes us back to one lseek per line. You can create temporary files. They behave as files usually do - they are better accessed in sequential patterns and not too many times. AndreiSergey Gromov wrote:So HDD (or any file system) is not counted as memory? Then just store all temporary variables to HDD.Fri, 10 Oct 2008 07:55:38 -0500, Andrei Alexandrescu wrote:Creating temporary files is fine as long as you never need to store the entire input in memory. AndreiSergey Gromov wrote:This implies though that I cannot start output before I read all the lines from the stream. Therefore I must store those lines somewhere.Thu, 09 Oct 2008 23:42:47 -0500, Andrei Alexandrescu wrote:Yes, that probability exists, but that doesn't imply you must load the entire file in memory.The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.
Oct 10 2008
Fri, 10 Oct 2008 09:20:44 -0500, Andrei Alexandrescu wrote:Sergey Gromov wrote:Ok, let me re-formulate the task then, since the creation of a temporary file containing all the lines seems inevitable. You basically want to uniformly shuffle lines in a very large file. The idea is this. Let N be number of lines in the file, which you know after storing it. There are two passes, forward and backward. They are symmetric. On a forward pass you read as many lines as you can fit in memory, let it be m. Shuffle them. Then choose a number, k which is 0 < k < m and is somehow based upon the m and N, like k = m * m/N. Write the k first lines back into the file and discard them from memory. Append as many new lines to your memory array as possible. Shuffle. Choose k. Write. Repeat. The reverse pass is exactly the same but you start from the end of file. The tricky part is how to choose k. It obviously needs to take number of discarded lines, q, into account like in k = m * m/(N-q). But my math-fu is almost non-existent so I'm unable to prove if this approach can give an uniform distribution. I feel like it can though.Fri, 10 Oct 2008 07:55:38 -0500, Andrei Alexandrescu wrote:Creating temporary files is fine as long as you never need to store the entire input in memory.Sergey Gromov wrote:This implies though that I cannot start output before I read all the lines from the stream. Therefore I must store those lines somewhere.Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.Yes, that probability exists, but that doesn't imply you must load the entire file in memory.
Oct 10 2008
Sergey Gromov wrote:Fri, 10 Oct 2008 09:20:44 -0500, Andrei Alexandrescu wrote:Feeling doesn't qualify. AndreiSergey Gromov wrote:Ok, let me re-formulate the task then, since the creation of a temporary file containing all the lines seems inevitable. You basically want to uniformly shuffle lines in a very large file. The idea is this. Let N be number of lines in the file, which you know after storing it. There are two passes, forward and backward. They are symmetric. On a forward pass you read as many lines as you can fit in memory, let it be m. Shuffle them. Then choose a number, k which is 0 < k < m and is somehow based upon the m and N, like k = m * m/N. Write the k first lines back into the file and discard them from memory. Append as many new lines to your memory array as possible. Shuffle. Choose k. Write. Repeat. The reverse pass is exactly the same but you start from the end of file. The tricky part is how to choose k. It obviously needs to take number of discarded lines, q, into account like in k = m * m/(N-q). But my math-fu is almost non-existent so I'm unable to prove if this approach can give an uniform distribution. I feel like it can though.Fri, 10 Oct 2008 07:55:38 -0500, Andrei Alexandrescu wrote:Creating temporary files is fine as long as you never need to store the entire input in memory.Sergey Gromov wrote:This implies though that I cannot start output before I read all the lines from the stream. Therefore I must store those lines somewhere.Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.Yes, that probability exists, but that doesn't imply you must load the entire file in memory.
Oct 10 2008
Sergey Gromov wrote:Fri, 10 Oct 2008 09:20:44 -0500, Andrei Alexandrescu wrote:You have to continually alter k to maintain a random distribution, but you have an invariant 0 <= k <= m. Those bounds mean you can't use this approach.Sergey Gromov wrote:Ok, let me re-formulate the task then, since the creation of a temporary file containing all the lines seems inevitable. You basically want to uniformly shuffle lines in a very large file. The idea is this. Let N be number of lines in the file, which you know after storing it. There are two passes, forward and backward. They are symmetric. On a forward pass you read as many lines as you can fit in memory, let it be m. Shuffle them. Then choose a number, k which is 0 < k < m and is somehow based upon the m and N, like k = m * m/N. Write the k first lines back into the file and discard them from memory. Append as many new lines to your memory array as possible. Shuffle. Choose k. Write. Repeat. The reverse pass is exactly the same but you start from the end of file. The tricky part is how to choose k. It obviously needs to take number of discarded lines, q, into account like in k = m * m/(N-q). But my math-fu is almost non-existent so I'm unable to prove if this approach can give an uniform distribution. I feel like it can though.Fri, 10 Oct 2008 07:55:38 -0500, Andrei Alexandrescu wrote:Creating temporary files is fine as long as you never need to store the entire input in memory.Sergey Gromov wrote:This implies though that I cannot start output before I read all the lines from the stream. Therefore I must store those lines somewhere.Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.Yes, that probability exists, but that doesn't imply you must load the entire file in memory.
Oct 11 2008
It would be slower than the seeking option, but something like a randomized mergesort would work as well. If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file. Repeat until the input is exhausted. Now randomly pick two of the temporary files and randomly merge them. Repeat until two temporary files remain, then output the result of the final random merge to the screen. For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place. Sean
Oct 10 2008
Sean Kelly wrote:It would be slower than the seeking option, but something like a randomized mergesort would work as well. If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file. Repeat until the input is exhausted. Now randomly pick two of the temporary files and randomly merge them. Repeat until two temporary files remain, then output the result of the final random merge to the screen. For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.Sweet! This is even better than my solution. (And it's much faster than seeking.) Andrei
Oct 10 2008
Andrei Alexandrescu wrote:Sean Kelly wrote:You can reduce the number of passes by using more temporary files per iteration. For instance, with O(log n) files per iteration, you'll get a time complexity of O(n log* n) rather than O(n log n). However, using more files makes the disk cache unhappy.It would be slower than the seeking option, but something like a randomized mergesort would work as well. If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file. Repeat until the input is exhausted. Now randomly pick two of the temporary files and randomly merge them. Repeat until two temporary files remain, then output the result of the final random merge to the screen. For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.Sweet! This is even better than my solution. (And it's much faster than seeking.) Andrei
Oct 11 2008
Sean Kelly wrote:It would be slower than the seeking option, but something like a randomized mergesort would work as well. If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file. Repeat until the input is exhausted. Now randomly pick two of the temporary files and randomly merge them. Repeat until two temporary files remain, then output the result of the final random merge to the screen. For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that? Andrei
Oct 10 2008
Fri, 10 Oct 2008 15:54:18 -0500, Andrei Alexandrescu wrote:Sean Kelly wrote:After merging b and c you end up with a and bc. Then you randomly merge these two files and lines from a have all the chances to appear anywhere within result. When randomly merging, the probability of picking a line from a file should be proportional to a number of lines left in that file.It would be slower than the seeking option, but something like a randomized mergesort would work as well. If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file. Repeat until the input is exhausted. Now randomly pick two of the temporary files and randomly merge them. Repeat until two temporary files remain, then output the result of the final random merge to the screen. For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?
Oct 10 2008
Sergey Gromov wrote:Fri, 10 Oct 2008 15:54:18 -0500, Andrei Alexandrescu wrote:How do you "randomly merge"? Describe the process. AndreiSean Kelly wrote:After merging b and c you end up with a and bc. Then you randomly merge these two files and lines from a have all the chances to appear anywhere within result. When randomly merging, the probability of picking a line from a file should be proportional to a number of lines left in that file.It would be slower than the seeking option, but something like a randomized mergesort would work as well. If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file. Repeat until the input is exhausted. Now randomly pick two of the temporary files and randomly merge them. Repeat until two temporary files remain, then output the result of the final random merge to the screen. For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?
Oct 10 2008
Fri, 10 Oct 2008 16:10:29 -0500, Andrei Alexandrescu wrote:Sergey Gromov wrote:a[] and b[] are files to merge. ab[] is the result. a.length and b.length are number of lines left in either file. while (a.length || b.length) { if (uniform(gen, 0, a.length + b.length) < a.length) { ab.put(a.head); a.next(); } else { ab.put(b.head); b.next(); } }Fri, 10 Oct 2008 15:54:18 -0500, Andrei Alexandrescu wrote:How do you "randomly merge"? Describe the process.Sean Kelly wrote:After merging b and c you end up with a and bc. Then you randomly merge these two files and lines from a have all the chances to appear anywhere within result. When randomly merging, the probability of picking a line from a file should be proportional to a number of lines left in that file.It would be slower than the seeking option, but something like a randomized mergesort would work as well. If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file. Repeat until the input is exhausted. Now randomly pick two of the temporary files and randomly merge them. Repeat until two temporary files remain, then output the result of the final random merge to the screen. For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?
Oct 10 2008
Sergey Gromov wrote:Fri, 10 Oct 2008 16:10:29 -0500, Andrei Alexandrescu wrote:This should work. Use of ranges noted :o). AndreiSergey Gromov wrote:a[] and b[] are files to merge. ab[] is the result. a.length and b.length are number of lines left in either file. while (a.length || b.length) { if (uniform(gen, 0, a.length + b.length) < a.length) { ab.put(a.head); a.next(); } else { ab.put(b.head); b.next(); } }Fri, 10 Oct 2008 15:54:18 -0500, Andrei Alexandrescu wrote:How do you "randomly merge"? Describe the process.Sean Kelly wrote:After merging b and c you end up with a and bc. Then you randomly merge these two files and lines from a have all the chances to appear anywhere within result. When randomly merging, the probability of picking a line from a file should be proportional to a number of lines left in that file.It would be slower than the seeking option, but something like a randomized mergesort would work as well. If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file. Repeat until the input is exhausted. Now randomly pick two of the temporary files and randomly merge them. Repeat until two temporary files remain, then output the result of the final random merge to the screen. For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?
Oct 10 2008
Fri, 10 Oct 2008 16:27:03 -0500, Andrei Alexandrescu wrote:Sergey Gromov wrote:I just rigorously proved that probability of line from a getting into output position 0 is equal to probability of line from a getting into position 1 and is equal to a.length/(a.length+b.length). I need to recollect TeX to write this proof down for electronic use.Fri, 10 Oct 2008 16:10:29 -0500, Andrei Alexandrescu wrote:This should work.Sergey Gromov wrote:a[] and b[] are files to merge. ab[] is the result. a.length and b.length are number of lines left in either file. while (a.length || b.length) { if (uniform(gen, 0, a.length + b.length) < a.length) { ab.put(a.head); a.next(); } else { ab.put(b.head); b.next(); } }Fri, 10 Oct 2008 15:54:18 -0500, Andrei Alexandrescu wrote:How do you "randomly merge"? Describe the process.Sean Kelly wrote:After merging b and c you end up with a and bc. Then you randomly merge these two files and lines from a have all the chances to appear anywhere within result. When randomly merging, the probability of picking a line from a file should be proportional to a number of lines left in that file.It would be slower than the seeking option, but something like a randomized mergesort would work as well. If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file. Repeat until the input is exhausted. Now randomly pick two of the temporary files and randomly merge them. Repeat until two temporary files remain, then output the result of the final random merge to the screen. For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?Use of ranges noted :o).Except that a.length and b.length should be stored in files somehow.
Oct 10 2008
Fri, 10 Oct 2008 16:27:03 -0500, Andrei Alexandrescu wrote:Actually I felt an urge to write: if (uniform(gen, 0, a.length + b.length) < a.length) ab.put(a.next); else ab.put(b.next); I had to really get hold of myself. ;)while (a.length || b.length) { if (uniform(gen, 0, a.length + b.length) < a.length) { ab.put(a.head); a.next(); } else { ab.put(b.head); b.next(); } }This should work. Use of ranges noted :o).
Oct 10 2008
Andrei Alexandrescu wrote:Sergey Gromov wrote:while( a.hasMore || b.hasMore ) { if( rand % 2 ) r ~= a.readLine; else r ~= b.readLine; } while( a.hasMore ) r ~= a.readLine; while( b.hasMore ) r ~= b.readLine; The downside to randomly choosing two files is that it's possible you could end up always choosing a small and a large file, thus increasing the chance that the large file will have a bunch of lines left over which are essentially appended. The easy way around this would be to maintain a list of files so only equal-sized files are merged (similar to how mergesort works). However, I think a better way would be to use some sort of bias for the random check above so that different-sized files are on equal footing. Perhaps write out the file with the first 4 bytes containing the number of lines in that file and use that to construct the bias. I'll leave those with better math-fu than I to come up with a formula which would be fair. SeanFri, 10 Oct 2008 15:54:18 -0500, Andrei Alexandrescu wrote:How do you "randomly merge"? Describe the process.Sean Kelly wrote:After merging b and c you end up with a and bc. Then you randomly merge these two files and lines from a have all the chances to appear anywhere within result. When randomly merging, the probability of picking a line from a file should be proportional to a number of lines left in that file.It would be slower than the seeking option, but something like a randomized mergesort would work as well. If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file. Repeat until the input is exhausted. Now randomly pick two of the temporary files and randomly merge them. Repeat until two temporary files remain, then output the result of the final random merge to the screen. For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?
Oct 10 2008
Sean Kelly wrote:Andrei Alexandrescu wrote:Yah, Sergey got me thinking too. What if you have a large file that doesn't fit in memory by exactly one line? Then I shuffle (n - 1) lines at random, and there remains the last line to write. If I choose from either fragment with probability 0.5, the last line will be with high probability among the first in the result!Sergey Gromov wrote:while( a.hasMore || b.hasMore ) { if( rand % 2 ) r ~= a.readLine; else r ~= b.readLine; } while( a.hasMore ) r ~= a.readLine; while( b.hasMore ) r ~= b.readLine; The downside to randomly choosing two files is that it's possible you could end up always choosing a small and a large file, thus increasing the chance that the large file will have a bunch of lines left over which are essentially appended.Fri, 10 Oct 2008 15:54:18 -0500, Andrei Alexandrescu wrote:How do you "randomly merge"? Describe the process.Sean Kelly wrote:After merging b and c you end up with a and bc. Then you randomly merge these two files and lines from a have all the chances to appear anywhere within result. When randomly merging, the probability of picking a line from a file should be proportional to a number of lines left in that file.It would be slower than the seeking option, but something like a randomized mergesort would work as well. If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file. Repeat until the input is exhausted. Now randomly pick two of the temporary files and randomly merge them. Repeat until two temporary files remain, then output the result of the final random merge to the screen. For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?The easy way around this would be to maintain a list of files so only equal-sized files are merged (similar to how mergesort works). However, I think a better way would be to use some sort of bias for the random check above so that different-sized files are on equal footing. Perhaps write out the file with the first 4 bytes containing the number of lines in that file and use that to construct the bias. I'll leave those with better math-fu than I to come up with a formula which would be fair.(In similar situations I've put the number of lines right in the name of the temporary file. I kid you not! And it's O(1) too!) It looks like any fragment should be chosen from with a probability proportional to its length. The lingering question for me is whether that probability needs to be updated, or can be fixed at the beginning of merging. I think fixed doesn't quite work unless special measures are taken at whenever a range is done. Again many thanks to all participants for your illuminating insights. Andrei
Oct 10 2008
Fri, 10 Oct 2008 17:45:28 -0500, Andrei Alexandrescu wrote:Sean Kelly wrote:Here's my little proof. Paste this into wikipedia to see formulas. Probability of a line from <math>a</math> getting into the first line of result: <math> p_a^1=\frac{n_a}{n_a+n_b} </math> Likewise for <math>b</math>: <math> p_b^1=\frac{n_b}{n_a+n_b} </math> Probability of a line from <math>a</math> becoming a second line of the output, <math>p_a^2</math>, is: <math> p_a^2=p_a^2\Big|_{a_1} + p_a^2\Big|_{b_1} </math> where <math>p_a^2\Big|_{a_1}</math> and <math>p_a^2\Big|_{b_1}</math> are probabilities of a line from <math>a</math> getting into position <math>2</math> when the first line is from <math>a</math> and from <math>b</math> respectively. <math> p_a^2\Big|_{a_1} = \frac{n_a}{n_a+n_b}\cdot\frac{n_a-1}{n_a-1+n_b} = \frac{n_a(n_a-1)}{(n_a+n_b)(n_a+n_b-1)} </math> <math> p_a^2\Big|_{b_1} = \frac{n_b}{n_a+n_b}\cdot\frac{n_a}{n_a+n_b-1} = \frac {n_a n_b}{(n_a+n_b)(n_a+n_b-1)} </math> Therefore <math> p_a^2 = \frac{n_a(n_a-1)}{(n_a+n_b)(n_a+n_b-1)}+\frac{n_a n_b}{(n_a+n_b) (n_a+n_b-1)} = \frac{n_a(n_a+n_b-1)}{(n_a+n_b)(n_a+n_b-1)} = \frac{n_a} {n_a+n_b} </math> So at least <math>p_a^1=p_a^2</math>. Induce anyone?Andrei Alexandrescu wrote:Yah, Sergey got me thinking too. What if you have a large file that doesn't fit in memory by exactly one line? Then I shuffle (n - 1) lines at random, and there remains the last line to write. If I choose from either fragment with probability 0.5, the last line will be with high probability among the first in the result!Sergey Gromov wrote:while( a.hasMore || b.hasMore ) { if( rand % 2 ) r ~= a.readLine; else r ~= b.readLine; } while( a.hasMore ) r ~= a.readLine; while( b.hasMore ) r ~= b.readLine; The downside to randomly choosing two files is that it's possible you could end up always choosing a small and a large file, thus increasing the chance that the large file will have a bunch of lines left over which are essentially appended.Fri, 10 Oct 2008 15:54:18 -0500, Andrei Alexandrescu wrote:How do you "randomly merge"? Describe the process.Sean Kelly wrote:After merging b and c you end up with a and bc. Then you randomly merge these two files and lines from a have all the chances to appear anywhere within result. When randomly merging, the probability of picking a line from a file should be proportional to a number of lines left in that file.It would be slower than the seeking option, but something like a randomized mergesort would work as well. If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file. Repeat until the input is exhausted. Now randomly pick two of the temporary files and randomly merge them. Repeat until two temporary files remain, then output the result of the final random merge to the screen. For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?The easy way around this would be to maintain a list of files so only equal-sized files are merged (similar to how mergesort works). However, I think a better way would be to use some sort of bias for the random check above so that different-sized files are on equal footing. Perhaps write out the file with the first 4 bytes containing the number of lines in that file and use that to construct the bias. I'll leave those with better math-fu than I to come up with a formula which would be fair.(In similar situations I've put the number of lines right in the name of the temporary file. I kid you not! And it's O(1) too!) It looks like any fragment should be chosen from with a probability proportional to its length. The lingering question for me is whether that probability needs to be updated, or can be fixed at the beginning of merging. I think fixed doesn't quite work unless special measures are taken at whenever a range is done. Again many thanks to all participants for your illuminating insights.
Oct 10 2008
Andrei Alexandrescu wrote:Sean Kelly wrote:When a is merged with (b&c) then its lines will be interleaved randomly in the result. SeanIt would be slower than the seeking option, but something like a randomized mergesort would work as well. If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file. Repeat until the input is exhausted. Now randomly pick two of the temporary files and randomly merge them. Repeat until two temporary files remain, then output the result of the final random merge to the screen. For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?
Oct 10 2008
Sean Kelly wrote:Andrei Alexandrescu wrote:I think that works with Sergey's appropriate crooking of the dice. For the record, my solution involved generating k shuffled sub-files, and then roll a fair k-sided dice and write one line in the chosen file to the output, until all files are exhausted. Sergey showed that my solution is wrong unless it so happens the last fragment is equal to all others. Thanks! Well are we ready for the next puzzle, or bored? AndreiSean Kelly wrote:When a is merged with (b&c) then its lines will be interleaved randomly in the result.It would be slower than the seeking option, but something like a randomized mergesort would work as well. If the program can buffer k lines in a file containing n lines, then read the first k lines into memory, shuffle them, and write them out to a temporary file. Repeat until the input is exhausted. Now randomly pick two of the temporary files and randomly merge them. Repeat until two temporary files remain, then output the result of the final random merge to the screen. For small files (ie. where n<k) the file would be read, shuffled in memory, and printed to the screen, assuming the proper checks were in place.I think I found a problem with your solution. Consider you break the file in three chunks: a, b, c. Then you pick at random b and c and randomly merge them. The problem is, you make early the decision that nothing in a will appear in the first two thirds of the result. So the quality of randomness suffers. How would you address that?
Oct 10 2008
I've read most of the responses (not all) and haven't yet seen a duplicate of my algorithm, so here it is. You can debate whether the cost of opening many little files is as bad as lseek or not...I don't know. BEGIN CODE void main() { CreateAnEmptyTmpDirectoryAndChdirToIt(); while(!feof(stdin)) { char[] name = Random8CharacterFilename(); if(!file_exists(name)) WriteToFile(name, ReadALine(stdin)); else { if(CoinFlip()) WriteToFile("junk", ReadALine(stdin)); else { rename(name, "junk"); WriteToFile(name, ReadALine(stdin)); } // shuffle "junk" to some non-duplicate name. chains of // renames are allowed so as to prevent any biasing of // the odds towards lines earlier in the file. name = Random8CharacterFilename(); while(file_exists(name)) { if(CoinFlip()) { rename(name, "tmp"); rename("junk", name); rename("tmp", "junk") } name = Random8CharacterFilename(); } } } char[][] names = GetSortedFilenames(); foreach(name; names) { PrintFileContents(name); unlink(name); } DestroyTmpDir(); } END CODE Andrei Alexandrescu wrote:The high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous. In case you hate random stuff, just wait; the next problem won't have any random element in it :o). Andrei
Oct 13 2008