digitalmars.D - Re: [OT] shuffling lines in a stream
- Jason House (3/28) Oct 10 2008 What you want to optimize is still unclear to me. At its heart, you must...
- Andrei Alexandrescu (9/43) Oct 10 2008 At the end of the day, devise a means to take care of the task. Faster
- BLS (19/48) Oct 10 2008 Hi Jason,
- Benji Smith (25/80) Oct 10 2008 Of course, this relies on the assumption that sentences have
- BLS (4/94) Oct 10 2008 Thanks for the pointer benji. I should have a book about it, somewhere.
Andrei Alexandrescu Wrote:Sergey Gromov wrote:What you want to optimize is still unclear to me. At its heart, you must pass over the data twice. Once to index it, and once to display it. Some library-based memmory-mapped file or a seekable stream seems like enough to get the job done. Neither that bit nor the randomization of indicies seems particularly interesting. What am I missing? I look forward to the next challenge ;)Thu, 09 Oct 2008 23:42:47 -0500, Andrei Alexandrescu wrote:Yes, that probability exists, but that doesn't imply you must load the entire file in memory. AndreiThe high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.
Oct 10 2008
Jason House wrote:Andrei Alexandrescu Wrote:At the end of the day, devise a means to take care of the task. Faster is better, occupying less space is better.Sergey Gromov wrote:What you want to optimize is still unclear to me.Thu, 09 Oct 2008 23:42:47 -0500, Andrei Alexandrescu wrote:Yes, that probability exists, but that doesn't imply you must load the entire file in memory. AndreiThe high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.At its heart, you must pass over the data twice. Once to index it, and once to display it. Some library-based memmory-mapped file or a seekable stream seems like enough to get the job done. Neither that bit nor the randomization of indicies seems particularly interesting. What am I missing?Your solution will be slow. Doing one random seek per line is too much, even in a memory-mapped file. Mapping into memory is not magic! We are talking about many millions of lines. Also you can't memory-map streams unless you create a full-blown copy first.I look forward to the next challenge ;)I think you will be surprised by the juicy parts of this one. Andrei
Oct 10 2008
Jason House schrieb:Andrei Alexandrescu Wrote:Hi Jason, I think (I may be wrong) that -Normal distribution- is the key to optimize speed. Let's say the stream Andrei is talking about is a book. Each sentence in this book consist of N characters. (End of Line is a DOT) Now we read the first ~10 pages. Character by Character. Based on this information we are able to say that each sentence has an average size of 100 characters and the normal distribution looks similat to : no Graphic possible :( so WIKIPEDIA http://en.wikipedia.org/wiki/Normal_distribution Indeed, NOT the solution Andrei is looking for, because indexing is still necesary. But due to the fact that finding the EOL/DOT is probabely faster by creating a slice from position 90 to 110 (in be already a win. Further : During the indexing process we can also adjust our first(base) normal distribution, as well as our +- tolerance. Heck it is late... hope this is not completely bullshi* BjoernSergey Gromov wrote:What you want to optimize is still unclear to me. At its heart, you must pass over the data twice. Once to index it, and once to display it. Some library-based memmory-mapped file or a seekable stream seems like enough to get the job done. Neither that bit nor the randomization of indicies seems particularly interesting. What am I missing? I look forward to the next challenge ;)Thu, 09 Oct 2008 23:42:47 -0500, Andrei Alexandrescu wrote:Yes, that probability exists, but that doesn't imply you must load the entire file in memory. AndreiThe high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.
Oct 10 2008
BLS wrote:Jason House schrieb:Of course, this relies on the assumption that sentences have normally-distributed length (which may or may not be true on any specific document). You might be interested in checking out something called a "Kernel Density Estimator", which is the typical solution for estimating the shape of an unknown distribution based on a series of samples. Here's the relevant wikipedia article: http://en.wikipedia.org/wiki/Kernel_density_estimation I learned about this technique because it was the subject of a job interview question once. The question was something like "Given a file containing 100 billion floating point values, what's the most effective algorithm for finding the median value. Your machine only has room for a billion values in memory at any given time, and making multiple passes of the file is very expensive." In my solution, I built a histogram during the first pass, counting the number of entries in each bucket. At the end of the first pass, I could isolate the bucket containing the median, and then during the second pass, I'd be able to definitely identify the median value. They pointed out several degenerate cases where my algorithm would collapse (because I always had to make assumptions about the shape of the distribution), and I left the interview stumped about the real answer. When I got home, I did my research and found the KDE technique, which I've used on a few projects since then. --benjiAndrei Alexandrescu Wrote:Hi Jason, I think (I may be wrong) that -Normal distribution- is the key to optimize speed. Let's say the stream Andrei is talking about is a book. Each sentence in this book consist of N characters. (End of Line is a DOT) Now we read the first ~10 pages. Character by Character. Based on this information we are able to say that each sentence has an average size of 100 characters and the normal distribution looks similat to : no Graphic possible :( so WIKIPEDIA http://en.wikipedia.org/wiki/Normal_distribution Indeed, NOT the solution Andrei is looking for, because indexing is still necesary. But due to the fact that finding the EOL/DOT is probabely faster by creating a slice from position 90 to 110 (in be already a win. Further : During the indexing process we can also adjust our first(base) normal distribution, as well as our +- tolerance. Heck it is late... hope this is not completely bullshi* BjoernSergey Gromov wrote:What you want to optimize is still unclear to me. At its heart, you must pass over the data twice. Once to index it, and once to display it. Some library-based memmory-mapped file or a seekable stream seems like enough to get the job done. Neither that bit nor the randomization of indicies seems particularly interesting. What am I missing? I look forward to the next challenge ;)Thu, 09 Oct 2008 23:42:47 -0500, Andrei Alexandrescu wrote:Yes, that probability exists, but that doesn't imply you must load the entire file in memory. AndreiThe high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.
Oct 10 2008
Benji Smith schrieb:BLS wrote:Thanks for the pointer benji. I should have a book about it, somewhere. ... well, have to look another day. And yes, my idea is pretty weak. bjoernJason House schrieb:Of course, this relies on the assumption that sentences have normally-distributed length (which may or may not be true on any specific document). You might be interested in checking out something called a "Kernel Density Estimator", which is the typical solution for estimating the shape of an unknown distribution based on a series of samples. Here's the relevant wikipedia article: http://en.wikipedia.org/wiki/Kernel_density_estimation I learned about this technique because it was the subject of a job interview question once. The question was something like "Given a file containing 100 billion floating point values, what's the most effective algorithm for finding the median value. Your machine only has room for a billion values in memory at any given time, and making multiple passes of the file is very expensive." In my solution, I built a histogram during the first pass, counting the number of entries in each bucket. At the end of the first pass, I could isolate the bucket containing the median, and then during the second pass, I'd be able to definitely identify the median value. They pointed out several degenerate cases where my algorithm would collapse (because I always had to make assumptions about the shape of the distribution), and I left the interview stumped about the real answer. When I got home, I did my research and found the KDE technique, which I've used on a few projects since then. --benjiAndrei Alexandrescu Wrote:Hi Jason, I think (I may be wrong) that -Normal distribution- is the key to optimize speed. Let's say the stream Andrei is talking about is a book. Each sentence in this book consist of N characters. (End of Line is a DOT) Now we read the first ~10 pages. Character by Character. Based on this information we are able to say that each sentence has an average size of 100 characters and the normal distribution looks similat to : no Graphic possible :( so WIKIPEDIA http://en.wikipedia.org/wiki/Normal_distribution Indeed, NOT the solution Andrei is looking for, because indexing is still necesary. But due to the fact that finding the EOL/DOT is probabely faster by creating a slice from position 90 to 110 (in be already a win. Further : During the indexing process we can also adjust our first(base) normal distribution, as well as our +- tolerance. Heck it is late... hope this is not completely bullshi* BjoernSergey Gromov wrote:What you want to optimize is still unclear to me. At its heart, you must pass over the data twice. Once to index it, and once to display it. Some library-based memmory-mapped file or a seekable stream seems like enough to get the job done. Neither that bit nor the randomization of indicies seems particularly interesting. What am I missing? I look forward to the next challenge ;)Thu, 09 Oct 2008 23:42:47 -0500, Andrei Alexandrescu wrote:Yes, that probability exists, but that doesn't imply you must load the entire file in memory. AndreiThe high traffic enjoyed by "random k-sample of a file" confirms that plenty of people around here have idle cycles ready to be spent on fun tasks. In that vein, here's another problem. Say you want to randomly shuffle lines in a stream. You don't know the size beforehand, and one pass is preferable. The stream can be larger than the working memory, but not an unbounded amount of times larger. Also lines in the stream can be unbounded, but it is guaranteed that at least a nonzero fraction of the file fits in working memory. You can use temporary files, but of course minimizing their total size is recommended. Speed is of course important. Draft your solution and explain why you think it's correct and advantageous.Well, my temporary file contains the whole stream. Because there is a probability that the last line of the stream goes first.
Oct 10 2008