digitalmars.D - Anagrams comparison

bearophile (110/111) Jun 03 2010 A good way to find parts of D2/Phobos2 that can be improved is to use th...
bearophile <bearophileHUGS lycos.com> writes:
A good way to find parts of D2/Phobos2 that can be improved is to use them, so
now and then I'll show a small program and the problems I've found. Small
programs are enough for now, they are able to show enough troubles already.
Larger programs are for later when the most common bugs are fixed.

I show a comparison with Python again, because I know it well enough, but a
similar comparison can be done with Ruby, Clojure, etc.

This time I have used this simple RosettaCode problem:
http://rosettacode.org/wiki/Anagrams

Using the word list at http://www.puzzlers.org/pub/wordlists/unixdict.txt, find
the sets of words that share the same characters that contain the most words in
them. <

----------------------

This is a procedural version for D1 that works:


import std.stdio: writefln;
import std.stream: BufferedFile;
import std.string: chomp;

void main() {
    string[][string] sign2anags;

    foreach (string piece; new BufferedFile("unixdict.txt")) {
        string word = piece.chomp().dup;
        string key = word.dup.sort;
        sign2anags[key] ~= word;
    }

    int lmax;
    foreach (anags; sign2anags) {
        int len = anags.length;
        lmax = lmax < len ? len : lmax;
    }

    foreach (anags; sign2anags)
        if (anags.length == lmax)
            writefln(anags);
}


Its correct output:
[abel,able,bale,bela,elba]
[evil,levi,live,veil,vile]
[elan,lane,lean,lena,neal]
[caret,carte,cater,crate,trace]
[angel,angle,galen,glean,lange]
[alger,glare,lager,large,regal]


Comments on this D1 version:
- The BufferedFile is read lazily, by line, this is good.
- I have had to add "string" as type of "piece" inside the foreach because
BufferedFile doesn't yield strings, I have not understood why BufferedFile is
designed like this.
- The piece string inside the foreach body is not a new string each iteration.
This in theory can improve performance, but in practice similar code written in
Python is faster, despite all strings read in Python are newly allocated.
BufferedFile is not optimized for performance. And the optimization used by
BufferedFile forces the usage of dup. You must not forget to dup those strings
or the program will not work. This makes code even slower. And finding the
right spots where to add those dups is not immediate, in practice it's a
try-and-error process. This can be awful.
- The append to the string array of the values of sign2anags was slow in D1.
- The std.string.chomp() function of std.string is easy to confuse with
std.string.chop().
- The creation of the sorted key is nice, you just sort the string chars in
place.
- The final printing is not very nice, but it's readable enough.
- Overall this program is not too hard to understand and read, the most
bug-prone part is adding those dups. So this program is acceptable, but I'd
like a file line iterator similar to BufferedFile that yields strings that are
already duplicated and despite that is overall faster than BufferedFile.
- I don't know how this program will work if the input dictionary of words is
unicode and contains not ASCII chars like those requires to write French words.

----------------------

And now a Python version. There are many ways to write that program in Python,
here I've used the RosettaCode version, that is more functional-style. This
isn't the faster possible version, but this is a small problem, so max
performance is not so important. It's more important to have an easy to write
and read program that is correct.


import urllib
from collections import defaultdict

words = urllib.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt').read().split()


for word in words:
    anagram[tuple(sorted(word))].append(word)

count = max(len(ana) for ana in anagram.itervalues())

for ana in anagram.itervalues():
    if len(ana) >= count:
        print ana


The output:
['angel', 'angle', 'galen', 'glean', 'lange']
['alger', 'glare', 'lager', 'large', 'regal']
['caret', 'carte', 'cater', 'crate', 'trace']
['evil', 'levi', 'live', 'veil', 'vile']
['elan', 'lane', 'lean', 'lena', 'neal']
['abel', 'able', 'bale', 'bela', 'elba']


Some comments:
- This program doesn't read the dictionary of words from disk, it fetches it
from internet. A similar easy to use lib can be added to Phobos2 too.
- This Python2.x program will not work well with a nonASCII input dictionary.
Python3 can work better but you can't print accented chars that naively in the
Windows shell.

----------------------

I have then tried to write a more functional-style D2 version that uses the
algorithms and ranges of Phobos2, but this program fails in many different ways:


import std.stdio, std.stream, std.string, std.algorithm, std.range;

void main() {
    string[][string] sign2anags;
    foreach (string word; new BufferedFile("unixdict.txt"))
        sign2anags[word.chomp().dup.sort] ~= word.chomp();
    int maxCount = max(map!(walkLength)(sign2anags.byValue()));
    auto result = filter!((anags){ anags.length == maxCount;
})(sign2anags.byValue());
    writeln(array(xresult));
}


Some note:
- map is probably currently buggy.
- BufferedFile doesn't seem to yield D2 strings.
- The need to find the max of a lazy sequence is natural and very common, but I
am not sure it's currently supported.
- I have used walkLength to find the length of the string arrays.
- I am not sure that filter works, even if generally filter of Phobos looks
less buggy than map.
- The printing is not improved compared to the D1 version, it's worse.
- Overall not good.

----------------------

To show that it's not impossible to write a nice functional-style version in D
I have used my dlibs1, this is D1 code:


import std.string: chomp;
import d.string: putr;
import d.xio: xfile;
import d.func: maxs, len, array;

void main() {
    string[][string] sign2anags;
    foreach (word; xfile("unixdict.txt"))
        sign2anags[word.chomp().dup.sort] ~= word.chomp();
    auto result = maxs(sign2anags.values, &len!(string[]));
    putr(result);
}


Its correct printed output:

[["abel", "able", "bale", "bela", "elba"], ["evil", "levi", "live", "veil",
"vile"], ["elan", "lane", "lean", "lena", "neal"], ["caret", "carte", "cater",
"crate", "trace"], ["angel", "angle", "galen", "glean", "lange"], ["alger",
"glare", "lager", "large", "regal"]]


Notes:
- The program is not as good as the Python one, but to me it looks acceptable.
- putr prints with a newline.
- xfiles is faster than BufferedFile and yields strings. But as in the D1
version such strings are shared, so you have to dup them. A version that
doesn't need a dup can be added...
- max accepts 2, 3, 4 items, or a generic iterable of strings.
- By default in dlibs1 associative arrays are iterated on their keys first
because this is more useful, so I have used the values there. This can be slow.
If you prefer lazyness you can import d.func.xvalues, and then you can give
sign2anags.xvalues() to maxs.
- There are functions max, min, mins, maxs, maxPos and minPos. All such
functions accept a second optional argument that is a mapping "key", similar to
the key of schwartzSort of std.algorithm. maxs finds all the max items.
- You can say that this program is short because I have a maxs() in my dlibs1
that is casually fit for this program/problem. But the truth is different,
there is nothing random in that, maxs is present because it does a very common
operation, so it is able to shorten many other programs different from this
one. It's a very common "micropattern".
- The len is a templated function similar to walkLength.


Overall I think this time Phobos2 was not good enough (but of course it can be
improved and fixed, no changes to the D2 language are necessary here).

Bye,
bearophile
Jun 03 2010
D Programming

C/C++ Programming

Other

digitalmars.D - Anagrams comparison