digitalmars.D - Chunks, stream fusion, sortedness invariance
- bearophile (204/204) Mar 10 2013 One of the best ways to see how the development of Phobos is
One of the best ways to see how the development of Phobos is going is to actually use it in some program. There are many ways to write a program that generates text with a Markov chain. After trying various alternative implementations in D, I have produced this short version: import std.stdio, std.file, std.array, std.string, std.range, std.algorithm, std.typecons, std.random; void main() { enum K = 7; // K is order + 1. enum nGenerate = 1_000; const text = "moby_dick.txt" .readText .toLower .replace("\n", " ") .removechars("^a-z ") .squeeze(" "); auto pieces = iota(text.length - K + 1) .map!(i => text[i .. i + K]) .array .sort .group .array .assumeSorted!((a, b) => a[0][0..K-1] < b[0][0..K-1]); string precedent = pieces[pieces.map!q{a[1]}.dice][0]; std.stdio.write(precedent); foreach (_; 0 .. nGenerate) { alias P = typeof(pieces[0]); auto selected = pieces.equalRange(P(precedent[1 .. $] ~ ' ', 0)); precedent = selected[selected.map!q{a[1]}.dice][0]; std.stdio.write(precedent[$ - 1]); } } It processes about 1.2 MB of text and generates 1000 random chars in about 2 seconds. It's not the fastest program, but for my purposes it's fast enough. And it's short and not hard to read. The first part loads the text, and filters it, producing a clean string. The second part generates the overlapping string slices, sorts and groups them, finding their frequency. And then assumes the groups as sorted. The assumeSorted is fed with a lambda that is not strictly necessary, but it should speed up the program a little, because the second field of the tuples is ignored. Then we create the seed of the Markov chain, considering just the frequencies of all slices. We use dice for a biased extraction. Then the loop uses the precedent slice to extract the correctly biased successive slice. equalRange() performs a binary search on the pairs, and then a gallop to find the second end of the interval of the slices that will be used by the biased dice. This is very convenient and probably fast. I don't know of other languages (maybe beside C++) with a so nice equalRange(). This program is nice (and it shows that Phobos has come a long way compared to one or two years ago), but there are several things that could and should be improved, listed below. - - - - - - - - - - - - - - - The program uses this to generate the chunks: auto pieces = iota(text.length - K + 1) .map!(i => text[i .. i + K]) In Phobos there is a function that gives the chunks: Chunks!(Source) chunks(Source)(Source source, size_t chunkSize); But unfortunately it can't be used here because it lacks a very useful nOverlap optional field that tells it how much overlap the slices should have: Chunks!(Source) chunks(Source)(Source source, in size_t chunkSize, in size_t nOverlap = 0); With that extra argument the code becomes as it should be: auto pieces = text .chunks(K, K - 1) The relative request from 2011: http://d.puremagic.com/issues/show_bug.cgi?id=6621 - - - - - - - - - - - - - - - This shows just the first part of the program, with the text cleaning: import std.file: readText; import std.string: toLower, replace, removechars, squeeze; void main() { const clean = "text.txt" .readText .toLower .replace("\n", " ") .removechars("^a-z ") .squeeze(" "); // import std.stdio; writeln(clean); } It's not a fast part of the program. In D there are many ways to write it faster. One simple way is to use cast(read()) instead readText(), because this program assumes the input to be ASCII any way. Another way to speed up the code a little is to use inPlaceToLower. Such two ideas halve the creation time of the string 'clean'. But D is a multi-level language, so you can go down to implement a finite state machine that uses the program counter and goto to jump between about 13 states (and I am not even talking about inline asm). If you don't want to go that low, using a simple switch is enough to create the string 'clean' about 10 or 12 times faster than the original program: import std.file: read; import std.exception: assumeUnique; void main() { auto txt = cast(char[])read("moby_dick.txt"); int precedent = -1; size_t pos = 0; for (size_t i = 0; i < txt.length; i++) { immutable c = txt[i]; switch (c) { case 'a': .. case 'z': txt[pos] = c; pos++; precedent = c; break; case 'A': .. case 'Z': txt[pos] = cast(char)(c + (cast(char)'a' - 'A')); precedent = txt[pos]; pos++; break; case ' ': if (precedent != ' ') { txt[pos] = c; pos++; } precedent = c; break; case '\n', '\r': if (precedent != ' ') { txt[pos] = ' '; pos++; } precedent = ' '; break; default: } } const clean = assumeUnique(txt[0 .. pos]); // import std.stdio; writeln(clean); } This kind of code is fast, but it's long, and to be written it requires a bit of thinking. The original code with the chain of string functions is easy to write (if you don't miss a processing step!), but it's kind of slow. A modern language should offer a third way: simple enough code that is fast enough (even if it's not as fast as the low-level handwritten code). The main problem of code like this is that it creates a new large string at each step: "text.txt".readText.toLower.replace("\n", " ") .removechars("^a-z ").squeeze(" ") Lazy functional languages like Haskell show that there is a better way. In Phobos there are many lazy functions, but if you want to process a string, you have to go back to mostly eager functions. So I think there's a need for more lazy string processing functions. But I think adding more lazy string processing functions to Phobos is not enough. If you write that code in D with lazy functions you risk having a program that's slower than the current string processing code. So I think to face this program D front end needs some deforestation (also named stream fusion, www.google.com/search?q=haskell+deforestation ) logic that works on pure lazy ranges. (Deforestation is possible even when there is no laziness, like in the original code. But purity is needed to simplify the analysis a lot). - - - - - - - - - - - - - - - Now let's take another look at the second part of the program: auto pieces = iota(text.length - K + 1) .map!(i => text[i .. i + K]) .array .sort .group .array .assumeSorted!((a, b) => a[0][0..K-1] < b[0][0..K-1]); The missing argument for chunks is not the only visible problem. We sort, group, convert to an array, and then we have to state again that the array of (slice,frequency) pairs is sorted again. But both group and array don't change the order of items, so the result of the array() is already sorted. So in theory this should be enough for the successive binary searches of equalRange(): auto pieces = iota(text.length - K + 1) .map!(i => text[i .. i + K]) .array .sort .group .array; I have raised this topic already few days ago, "group sortedness invariance": http://forum.dlang.org/thread/zmiqbifxevljazceowif forum.dlang.org I think group(SortedRange) should give a lazy (input range) SortedRange of pairs. This will be possible after this enhancement, that will allow SortedRange to be an input range too: http://d.puremagic.com/issues/show_bug.cgi?id=9616 But that's not enough, because array() is going to convert the lazy SortedRange into a regular array, forgetting that it's sorted. One simple solution is to introduce a function like sortedArray() that converts a lazy range into a random-access eager range that is also a SortedRange. But maybe this is a bit too much brute solution. Maybe there are more general solutions. Ideas welcome. Bye, bearophile
Mar 10 2013