digitalmars.D - [OT] The horizon of a stream
- Andrei Alexandrescu (30/30) Oct 23 2008 Consider some text stream. We define the horizon of the stream as the
- bearophile (36/36) Oct 23 2008 First try, D1, Phobos:
- Andrei Alexandrescu (6/7) Oct 23 2008 [snip]
- bearophile (7/10) Oct 23 2008 I understand, but I have 2 GB of RAM, so that solution (in Python) works...
- Andrei Alexandrescu (6/27) Oct 23 2008 Now you're talking! This is a solution worth a lot of attention. Under
- bearophile (6/7) Oct 23 2008 Assuming that the hash function is good, very similar strings too have d...
- bearophile (4/5) Oct 23 2008 There's a third possible solution, that is very fast: assuming the hash ...
- bearophile (20/20) Oct 23 2008 This versions compares hash values first, avoiding some slower string co...
- bearophile (4/8) Oct 23 2008 Ignore this solution, it's not faster, because the whole string is hashe...
- Sergey Gromov (13/47) Oct 23 2008 Traverse the stream, line by line. If the word is new, add it to a
- mgen (69/69) Oct 23 2008 If size is a problem use a trie!
- BCS (21/21) Oct 23 2008 (going at this blind so I might dup others ideas)
- Lionello Lunesu (7/7) Oct 23 2008 With if "corpus" would appear one more time but at a smaller distance?
- BCS (2/11) Oct 23 2008 IMNSHO no the max spread would be from 5 to 16
- Jason House (9/46) Oct 23 2008 By hash value, store a linked list of (line #, char offset within file)
- bearophile (104/105) Oct 24 2008 If the file is "small" my first implementation is good enough (small mea...
- Nigel Sandever (29/66) Oct 25 2008 The hash solution is simple and fast, but uses too much memory. So, sub-...
- Nigel Sandever (3/5) Oct 25 2008 Should have been 12MB/sec. I was instrumenting in 5 second intervals.
- mgen (1/1) Oct 25 2008 Could you post that file somewhere so we could get some even level testi...
- Nigel Sandever (5/6) Oct 25 2008 going for these algos?
- bearophile (4/5) Oct 25 2008 It looks a bit too much huge. Better to write a little program that gene...
- Nigel Sandever (8/16) Oct 25 2008 random data with a suitable distribution. To make it portable across
- bearophile (5/8) Oct 25 2008 A suggestion: you consider only lines that start with 'A'/'a', 'B'/'b', ...
- Nigel Sandever (34/45) Oct 25 2008 rest.
- bearophile (9/13) Oct 26 2008 This is a D newsgroup, so use D, it allows you to manage bits more effic...
- Nigel Sandever (18/32) Oct 26 2008 Sorry. No disrespect meant to D. I always prototype in Perl and then con...
- Christopher Wright (9/46) Nov 01 2008 Simplicity considering external memory constraints.
- Andrei Alexandrescu (3/51) Nov 01 2008 I like this solution a lot. Kudos!
Consider some text stream. We define the horizon of the stream as the longest number of lines between two identical lines minus one, or zero if the stream consists of unique lines. For example, this stream has horizon 9: lorem ipsum dolor habeas corpus ipsum sit amet rigor mortis consectetur coitus interruptus corpus adipisicing elit This is because the word "corpus" occurs on line 5 and line 14. This is the farthest two lines in the file are; notice that "ipsum" also occurs twice, but at a smaller distance. The challenge, should you blah blah, is to compute the horizon of a potentially large file. Of course theoretical and practical speed are as important as always. This is a task that's interesting and subtle in quite a number of ways, and it allows for many creative solutions. Therefore I intentionally did not restrict it much. Let ideas flow! Andrei
Oct 23 2008
First try, D1, Phobos: import std.stdio, std.stream; void main() { int[string] seen; int horizon; int nline; foreach (string line; new BufferedFile("data.txt")) { auto pos_ptr = line in seen; if (pos_ptr) { if ((nline - *pos_ptr) > horizon) horizon = nline - *pos_ptr; } else seen[line.dup] = nline; nline++; } writefln(horizon); } Idem, with my libs: import d.all, d.string; void main() { int[string] seen; int horizon; foreach (nline, line; xfile("data.txt")) { auto pos_ptr = line in seen; if (pos_ptr) { if ((nline - *pos_ptr) > horizon) horizon = nline - *pos_ptr; } else seen[line.dup] = nline; } putr(horizon); putr(seen); } I'll show an "improved" version soon... Bye, bearophile
Oct 23 2008
bearophile wrote:First try, D1, Phobos:[snip] At this point a discussion "ok, I'll store every line in a hashtable" would have sufficed. I can tell straight away that this is not a workable solution for a large file. It's actually happened to me :o|. Andrei
Oct 23 2008
Andrei Alexandrescu:At this point a discussion "ok, I'll store every line in a hashtable" would have sufficed. I can tell straight away that this is not a workable solution for a large file. It's actually happened to me :o|.I understand, but I have 2 GB of RAM, so that solution (in Python) works most of the times. When you design an algorithm, you start with the simpler version (that's also easy to debug and test), and only when you experimentally see it doesn't suffice (too much slow, not enough RAM, etc) you try to improve it. When you debug the refined version you can also use the unittests developed with the basic version, this is very useful. Another solution is to keep an associative array of just the string hash values, and lookup into the input file when you have a (possible) false positive. Another solution that requires even less RAM is to use a bloom filter (I have created a bloom filter class in Python but not in D yet) to keep what strings are seen, and lookup into the input file (as before) to remove (possible) false positives. Finally, looking up into the input file every time there's a match is too much slow, so you just store the candidate matches into a file or in RAM, to test them all in a second read of the file. If the bloom filter is defined well enough, then such possible matches aren't too many, and you can keep them in RAM. Bye, bearophile
Oct 23 2008
bearophile wrote:Andrei Alexandrescu:Now you're talking! This is a solution worth a lot of attention. Under what circumstances it does well vs. not so well? How can you ensure it works for a stream of any size if multiple passes are possible? etc.At this point a discussion "ok, I'll store every line in a hashtable" would have sufficed. I can tell straight away that this is not a workable solution for a large file. It's actually happened to me :o|.I understand, but I have 2 GB of RAM, so that solution (in Python) works most of the times. When you design an algorithm, you start with the simpler version (that's also easy to debug and test), and only when you experimentally see it doesn't suffice (too much slow, not enough RAM, etc) you try to improve it. When you debug the refined version you can also use the unittests developed with the basic version, this is very useful. Another solution is to keep an associative array of just the string hash values, and lookup into the input file when you have a (possible) false positive.Another solution that requires even less RAM is to use a bloom filter (I have created a bloom filter class in Python but not in D yet) to keep what strings are seen, and lookup into the input file (as before) to remove (possible) false positives.Bloom filter is also an excellent idea! Andrei
Oct 23 2008
Andrei Alexandrescu:Under what circumstances it does well vs. not so well?Assuming that the hash function is good, very similar strings too have different hash values. When there are lot of equal lines it doesn't work well, I presume, because there are lot of equal hash values. Note that in our example the lines are very short, so replacing them with a hash_t (as the key of the associative array, so it stores this value two times, because it keeps the hash of the hash value too, that being probably a size_t or uint, is hashed to itself) doesn't save you that much RAM. Bye, bearophile
Oct 23 2008
bearophile Wrote:Another solution that requires even less RAM is to use a bloom filter...There's a third possible solution, that is very fast: assuming the hash values are uint (32 bit), then you can create a bitfield of 2^32 bits, it requires just 1/2 GB, that is 1/4 of my RAM. So each bit represents a possible different value of the hash. Bye, bearophile
Oct 23 2008
This versions compares hash values first, avoiding some slower string compares, because in D1 strings don't store their hash value: import d.all, d.string; void main() { int[Record!(hash_t, string)] seen; int horizon; foreach (nline, line; xfile("data.txt")) { auto clean_line = line.chomp(); auto hash_clean_line = hash(clean_line); auto pos_ptr = record(hash_clean_line, clean_line) in seen; if (pos_ptr) { if ((nline - *pos_ptr) > horizon) horizon = nline - *pos_ptr; } else seen[record(hash_clean_line, clean_line.dup)] = nline; } putr(horizon); } Maybe it's not fully correct yet, so I'll test it some more... Bye, bearophile
Oct 23 2008
bearophile:This versions compares hash values first, avoiding some slower string compares, because in D1 strings don't store their hash value: ... auto pos_ptr = record(hash_clean_line, clean_line) in seen;Ignore this solution, it's not faster, because the whole string is hashed two times, etc. You have to use a String class like mine that lazily computes and stores the hash value too... Bye, bearophile
Oct 23 2008
Thu, 23 Oct 2008 14:29:36 -0500, Andrei Alexandrescu wrote:Consider some text stream. We define the horizon of the stream as the longest number of lines between two identical lines minus one, or zero if the stream consists of unique lines. For example, this stream has horizon 9: lorem ipsum dolor habeas corpus ipsum sit amet rigor mortis consectetur coitus interruptus corpus adipisicing elit This is because the word "corpus" occurs on line 5 and line 14. This is the farthest two lines in the file are; notice that "ipsum" also occurs twice, but at a smaller distance. The challenge, should you blah blah, is to compute the horizon of a potentially large file. Of course theoretical and practical speed are as important as always. This is a task that's interesting and subtle in quite a number of ways, and it allows for many creative solutions. Therefore I intentionally did not restrict it much. Let ideas flow!Traverse the stream, line by line. If the word is new, add it to a dictionary, size_t[string], with the current line number. If it's already in the dictionary, look at the distance to its first occurence and update horizon if that distance is greater than what we already have. If we don't run out of memory, we're done in one pass. If we do run out of memory, recover ;) then continue without adding words into the dictionary but dumping them into a temporary file instead, just checking and updating the horizon. The stream ends. If our current horizon is greater than the number of lines we dumped into a temp file, we're done. Otherwise, empty the dictionary, open the temp file, and repeat.
Oct 23 2008
If size is a problem use a trie! //---CODE--- import std.stream; import std.stdio; class position_trie { bool seen = false; int last_seen = 0; position_trie[char] branch; } void main() { int[char[]] last; int max; Stream file = new MemoryStream( "lorem ipsum dolor habeas corpus ipsum sit amet rigor mortis consectetur coitus interruptus corpus adipisicing elit" ); auto root = new position_trie(); position_trie node = root; position_trie next; foreach(ulong lineno,char[] line;file) { foreach(int chrno,char chr;line) { if(chr in node.branch) { node = node.branch[chr]; } else { next = new position_trie(); node.branch[chr] = next; node = next; foreach(char chr2; line[chrno+1..$]) { next = new position_trie(); node.branch[chr2] = next; node = next; } break; } } if(node.seen) { auto dist = lineno - node.last_seen; if(dist > max) max = dist; } node.seen = true; node.last_seen = lineno; node = root; } writefln(max); }
Oct 23 2008
(going at this blind so I might dup others ideas) - walk the file line by line - hash each line with a "good hash function"* - line index and the start location of each line as a function of it's hash - look for the hash bucket with the largest spread of line indexes - grab that spread - page in the needed file segments - test if the lines are the same, if so return - else try again -- using some kind of LRU cache keep as many pages in ram as you can -- prefer testing stuff that is in memory over stuff that is not. The main issue will be the what-to-pick-next-function. It must be fast, not requiter to much meta data, and work well with the cache. Another option: load every line into a database table with it's index: INSERT INTO tmp SELECT MAX(index) - MIN(index) FROM lines GROUP BY line; SELECT MAX(dif) INTO i, COUNT(dif) INTO c FROM tmp; if(c == 0) return 0; else return i;
Oct 23 2008
With if "corpus" would appear one more time but at a smaller distance? line 5: corpus ... line 14: corpus .. line 16: corpus Would this stream have a different horizon?
Oct 23 2008
Reply to Lionello,With if "corpus" would appear one more time but at a smaller distance? line 5: corpus ... line 14: corpus .. line 16: corpus Would this stream have a different horizon?IMNSHO no the max spread would be from 5 to 16
Oct 23 2008
When a hash collision occurs, append to the list, but also calculate the horizon from the head of the list. While the horizon is higher than previously observed, scan the list until a string match is found and update the maximum observed horizon. This performance is linear, but has extra string lookups over some alternatives. It's possible to optimize the string comparisons with some kind of temporary caching of strings such as weak pointers. Andrei Alexandrescu wrote:Consider some text stream. We define the horizon of the stream as the longest number of lines between two identical lines minus one, or zero if the stream consists of unique lines. For example, this stream has horizon 9: lorem ipsum dolor habeas corpus ipsum sit amet rigor mortis consectetur coitus interruptus corpus adipisicing elit This is because the word "corpus" occurs on line 5 and line 14. This is the farthest two lines in the file are; notice that "ipsum" also occurs twice, but at a smaller distance. The challenge, should you blah blah, is to compute the horizon of a potentially large file. Of course theoretical and practical speed are as important as always. This is a task that's interesting and subtle in quite a number of ways, and it allows for many creative solutions. Therefore I intentionally did not restrict it much. Let ideas flow! Andrei
Oct 23 2008
bearophile Wrote:There's a third possible solution, that is very fast: assuming the hash values are uint (32 bit), then you can create a bitfield of 2^32 bits, it requires just 1/2 GB, that is 1/4 of my RAM. So each bit represents a possible different value of the hash.If the file is "small" my first implementation is good enough (small means the associative array has to fit in for example 2 GB). For larger files here's the implementation of the third idea, with D1, Phobos + my libs: import d.all, std.string, std.c.stdlib, std.outofmemory; const uint bpc = uint.sizeof * 8; bool isFlagSet(uint* flags, uint i) { int offset = i / bpc; uint mask = 1 << (i % bpc); return (flags[offset] & mask) != 0; } void setFlag(uint* flags, uint i) { int offset = i / bpc; uint mask = 1 << (i % bpc); if ((flags[offset] & mask) == 0) flags[offset] |= mask; } void main() { uint* seen = cast(uint*)calloc(1 << 29, 1); uint* seen_again = cast(uint*)calloc(1 << 29, 1); if (seen is null || seen_again is null) throw new OutOfMemoryException(); foreach (line; xfile("data.txt")) { uint h = hash(line.chomp()); if (isFlagSet(seen, h)) // not used std.intrinsics, because they segfault here setFlag(seen_again, h); else setFlag(seen, h); } free(seen); int[string] aa; int horizon; foreach (nline, line; xfile("data.txt")) { auto clean_line = line.chomp(); if (isFlagSet(seen_again, hash(clean_line))) { auto pos_ptr = clean_line in aa; if (pos_ptr) { if ((nline - *pos_ptr) > horizon) horizon = nline - *pos_ptr; } else aa[clean_line.dup] = nline; } } putr(horizon); } With D1, just Phobos: import std.stdio, std.string, std.c.stdlib, std.stream, std.outofmemory; const uint bpc = uint.sizeof * 8; bool isFlagSet(uint* flags, uint i) { int offset = i / bpc; uint mask = 1 << (i % bpc); return (flags[offset] & mask) != 0; } void setFlag(uint* flags, uint i) { int offset = i / bpc; uint mask = 1 << (i % bpc); if ((flags[offset] & mask) == 0) flags[offset] |= mask; } void main() { uint* seen = cast(uint*)calloc(1 << 29, 1); uint* seen_again = cast(uint*)calloc(1 << 29, 1); if (seen is null || seen_again is null) throw new OutOfMemoryException(); foreach (string line; new BufferedFile("data.txt")) { auto clean_line = line.chomp(); uint h = typeid(string).getHash(&clean_line); if (isFlagSet(seen, h)) // not used std.intrinsics, because they segfault here setFlag(seen_again, h); else setFlag(seen, h); } free(seen); int[string] aa; int horizon, nline; foreach (string line; new BufferedFile("data.txt")) { auto clean_line = line.chomp(); if (isFlagSet(seen_again, typeid(string).getHash(&clean_line))) { auto pos_ptr = clean_line in aa; if (pos_ptr) { if ((nline - *pos_ptr) > horizon) horizon = nline - *pos_ptr; } else aa[clean_line.dup] = nline; } nline++; } writefln(horizon); } The worst case is like this: A B C D E E D C B A In this case the aa associative array has to store half of the lines of the file, and it be too much. As you can see with this little program I have found another bug in Phobos, the intrinsics in this case don't work (and I have improved the Record of my libs too, used in another version of this code). Bye, bearophile
Oct 24 2008
On Thu, 23 Oct 2008 14:29:36 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Consider some text stream. We define the horizon of the stream as the longest number of lines between two identical lines minus one, or zero if the stream consists of unique lines. For example, this stream has horizon 9: lorem ipsum dolor habeas corpus ipsum sit amet rigor mortis consectetur coitus interruptus corpus adipisicing elit This is because the word "corpus" occurs on line 5 and line 14. This is the farthest two lines in the file are; notice that "ipsum" also occurs twice, but at a smaller distance. The challenge, should you blah blah, is to compute the horizon of a potentially large file. Of course theoretical and practical speed are as important as always. This is a task that's interesting and subtle in quite a number of ways, and it allows for many creative solutions. Therefore I intentionally did not restrict it much. Let ideas flow! AndreiThe hash solution is simple and fast, but uses too much memory. So, sub-divide your data. Obviously, you cannot process the first half and then the second, but you can make multiple complete passes, and only consider a subset during each pass. For example, if you hash all lines beginning with 'a' (or 'A' or whatever is applicable with your data), and 'b' on the second pass etc. At the end of each pass you retain only the best horizion so far discovered and discard the rest. Assuming all lower-case and roughly even distribution, you reduce your memory requirements to ~4%. By way of example, I knocked up a data file constisting of 100+ million lines (1GB) of words chosen randomly from a dictionary. Using Perl & 26 passes, I got my result in just under 25 minutes using < 2MB of memory. Admittedly, with this size of file, the entire thing was in the file system caches for the second and subsequent passes which saved considerable time. If your file is much bigger, that probably wouldn't be the case. But, if you're going to write this in D, you can probably speed thing up a bit anyway. wc -l with a cold cache manages to achieve a throughput of 28mb/sec on my system, where perl only achieved a best of 60mb/sec with a warm cache. I seem to recall seeing a memory-mapped file module in the D library somewhere. Maybe that would reduce the IO for multiple passes? Also, given the minimal 2MB max. memory usage above, if you know (or can probe) your data, then you can reduce the number of passes by doing (say) a-f on the first pass, g-l on the second etc. Or if all your data begins with (say) "Line ...", then key off the appropriate position in the line to form your pass split. Or, if your file is really huge use the first two characters for a 676-way split. (And wait :) Anyway, just a thought from a drive-by reader.
Oct 25 2008
On Sun, 26 Oct 2008 01:00:59 GMT, Nigel Sandever <nigelsandever btconnect.com> wrote:system, where perl only achieved a best of 60mb/sec with a warm cache.Should have been 12MB/sec. I was instrumenting in 5 second intervals.
Oct 25 2008
Could you post that file somewhere so we could get some even level testing going for these algos?
Oct 25 2008
On Sat, 25 Oct 2008 21:45:29 -0400, mgen <bmeck stedwards.edu> wrote:Could you post that file somewhere so we could get some even level testinggoing for these algos? Sure. If you know somewhere that woudl allow me to upload a (bzip -9) 300Mb file? And are prepared to wait a while, I'm on dial-up :(
Oct 25 2008
mgen:Could you post that file somewhere so we could get some even level testing going for these algos?It looks a bit too much huge. Better to write a little program that generates random data with a suitable distribution. To make it portable across Phobos/Tango/etc the RND generator can be included into the code, there are very fast around. Bye, bearophile
Oct 25 2008
On Sat, 25 Oct 2008 22:54:47 -0400, bearophile <bearophileHUGS lycos.com> wrote:mgen:going for these algos?Could you post that file somewhere so we could get some even level testingIt looks a bit too much huge. Better to write a little program that generatesrandom data with a suitable distribution. To make it portable across Phobos/Tango/etc the RND generator can be included into the code, there are very fast around.Bye, bearophileYou'd need the PRNG to produce the same sequence for teh same seed cross platform--the MT for example. And you'd need to distribute (or point to) a common dictionary.
Oct 25 2008
Nigel Sandever:For example, if you hash all lines beginning with 'a' (or 'A' or whatever is applicable with your data), and 'b' on the second pass etc. At the end of each pass you retain only the best horizion so far discovered and discard the rest.A suggestion: you consider only lines that start with 'A'/'a', 'B'/'b', etc, to split the input data into bins. But you want to fill those bins uniformly. So how can do it? The hash number if good is nearly random, so you can partition its bits to partition your data into more uniform groups :-) The downside is that you have to compute the hash for each line many times. To speed this up you can hash just a sample of the chars of each line... Note that 1 GB file is too much small for a true benchmark, I have 2 GB RAM... You don't need to load the lines as strings, you can load the whole file as a single memory block in RAM. Then you don't even need slices (8 bytes) to represent a string, because their ending point is the end of the file or the successive newline, so you just need the starting points as keys and the line indexes as values. Unfortunately the built-in AA with the built-in memory allocator uses the same memory if your keys are 8 or 4 bytes long. So if you see memory isn't enough to run your program, then you need to use a less memory-hungry hash map. You can find plenty around, written in C, I have translated one to D, it's easy. With that if the lines aren't too much short or there are some duplicate lines, then the RAM (assuming to use 2 GB) suffices to keep the whole associative array, so you need to load the file just one time. Bye, bearophile
Oct 25 2008
On Sat, 25 Oct 2008 22:52:51 -0400, bearophile <bearophileHUGS lycos.com> wrote:Nigel Sandever:eachFor example, if you hash all lines beginning with 'a' (or 'A' or whatever is applicable with your data), and 'b' on the second pass etc. At the end ofrest.pass you retain only the best horizion so far discovered and discard theA suggestion: you consider only lines that start with 'A'/'a', 'B'/'b', etc,to split the input data into bins. But you want to fill those bins uniformly. So how can do it? The hash number if good is nearly random, so you can partition its bits to partition your data into more uniform groups :-) The downside is that you have to compute the hash for each line many times. To speed this up you can hash just a sample of the chars of each line... I did try that (using md5), but the penalty in Perl was horrible, The hash(AA) just had to rehash the binary md5 anyway, so it was pure cost for my known dataset. For a real application with unknown/non-uniform data it would work like a charm though ... assuming you don't hit that rarest of non-impossibilities: duplicate md5s of non-identical inputs :) That said. For most applications of this, one would probably have some feel for the dataset(s) one is likely to deal with. And trading domain specific knowledge for total generality is often the best optimisation one can perform.Note that 1 GB file is too much small for a true benchmark, I have 2 GB RAM...Agreed and ditto. But provided algorithms are written to assume files larger than memory, the numbers do scale linearly. You don't need to load the lines as strings, you can load the whole file as a single memory block in RAM. Then you don't even need slices (8 bytes) to represent a string, because their ending point is the end of the file or the successive newline, so you just need the starting points as keys and the line indexes as values. Unfortunately the built-in AA with the built-in memory allocator uses the same memory if your keys are 8 or 4 bytes long. So if you see memory isn't enough to run your program, then you need to use a less memory- hungry hash map. You can find plenty around, written in C, I have translated one to D, it's easy. With that if the lines aren't too much short or there are some duplicate lines, then the RAM (assuming to use 2 GB) suffices to keep the whole associative array, so you need to load the file just one time. I agree that a 'testset producer' and a freely accessible known dicionary is a better option. I used (a slightly modified version of) 2of12inf available from http://downloads.sourceforge.net/wordlist/alt12dicts-4.tar.gzBye, bearophile
Oct 25 2008
Nigel Sandever:I did try that (using md5), but the penalty in Perl was horrible,<This is a D newsgroup, so use D, it allows you to manage bits more efficiently. md5 is too much slow for this purpose, use the built-in hashing of strings.That said. For most applications of this, one would probably have some feel for the dataset(s) one is likely to deal with. And trading domain specific knowledge for total generality is often the best optimisation one can perform.<I agree, when you have gigabytes of data it's better to gain some knowledge about the data before process it.But provided algorithms are written to assume files larger than memory, the numbers do scale linearly.<Several of the provided algorithms don't assume files larger than memory, or they work quite better when the file is smaller than memory.I used (a slightly modified version of) 2of12inf available from<That's a quite complex file, so I suggest something simpler, as this after a cleaning of the non ASCII words: http://www.norvig.com/big.txt Bye, bearophile
Oct 26 2008
On Sun, 26 Oct 2008 03:39:50 -0400, bearophile <bearophileHUGS lycos.com> wrote:Nigel Sandever:efficiently.I did try that (using md5), but the penalty in Perl was horrible,<This is a D newsgroup, so use D, it allows you to manage bits moreSorry. No disrespect meant to D. I always prototype in Perl and then convert to C or D if I need performance. I'm just more familiar with Perl.cleaning of the non ASCII words:I used (a slightly modified version of) 2of12inf available from<That's a quite complex file, so I suggest something simpler, as this after ahttp://www.norvig.com/big.txtI don't know what is "complex" about a 1 word per line, 81536 line dictionary file? Or how having everyone clean up Conan Doyle would be simpler? If you have Perl, you can produce a suitable testfile from any 1 word per line dictionary with the command line: perl -l12n0777aF/\n/ -ne'print $F[rand F] for 1..4e8' yourdict >thedata With the 2of12inf dictionary file, 4e8 produces a file a little under 4GB in a round 10 minutes. YMWV depending upon the average length of the lines in your local dict. Of course the won't all be the same as mine, or anyone elses, but given the random nature, the results will be broadly comparible. My D foo is too rusty to try and write that in D. Especially for a D audience :) I'm sure one of you guys can knock that up in the blink of an eye.Bye, bearophile
Oct 26 2008
Andrei Alexandrescu wrote:Consider some text stream. We define the horizon of the stream as the longest number of lines between two identical lines minus one, or zero if the stream consists of unique lines. For example, this stream has horizon 9: lorem ipsum dolor habeas corpus ipsum sit amet rigor mortis consectetur coitus interruptus corpus adipisicing elit This is because the word "corpus" occurs on line 5 and line 14. This is the farthest two lines in the file are; notice that "ipsum" also occurs twice, but at a smaller distance. The challenge, should you blah blah, is to compute the horizon of a potentially large file. Of course theoretical and practical speed are as important as always. This is a task that's interesting and subtle in quite a number of ways, and it allows for many creative solutions. Therefore I intentionally did not restrict it much. Let ideas flow! AndreiSimplicity considering external memory constraints. Read in the file, line by line. Output the line followed by the line number. Sort the output (merge sort works wonderfully). Now you go through it a second time and find the maximum distance. This takes O(n log n) time, which is far from optimal. However, I think we already did sorting in external memory, so I could implement this much quicker than most alternatives. Depending on the number of times this has to be done, that might be a net advantage.
Nov 01 2008
Christopher Wright wrote:Andrei Alexandrescu wrote:I like this solution a lot. Kudos! AndreiConsider some text stream. We define the horizon of the stream as the longest number of lines between two identical lines minus one, or zero if the stream consists of unique lines. For example, this stream has horizon 9: lorem ipsum dolor habeas corpus ipsum sit amet rigor mortis consectetur coitus interruptus corpus adipisicing elit This is because the word "corpus" occurs on line 5 and line 14. This is the farthest two lines in the file are; notice that "ipsum" also occurs twice, but at a smaller distance. The challenge, should you blah blah, is to compute the horizon of a potentially large file. Of course theoretical and practical speed are as important as always. This is a task that's interesting and subtle in quite a number of ways, and it allows for many creative solutions. Therefore I intentionally did not restrict it much. Let ideas flow! AndreiSimplicity considering external memory constraints. Read in the file, line by line. Output the line followed by the line number. Sort the output (merge sort works wonderfully). Now you go through it a second time and find the maximum distance. This takes O(n log n) time, which is far from optimal. However, I think we already did sorting in external memory, so I could implement this much quicker than most alternatives. Depending on the number of times this has to be done, that might be a net advantage.
Nov 01 2008