www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Absolutely horrible default string hashing

reply dsimcha <dsimcha yahoo.com> writes:
I've been playing around with associative array implementations in D, mostly
trying to create ones that are more GC-friendly.  I had written one that
seemed much faster than the builtin for int and float keys but much slower for
strings.  I've found the reason:  The builtin AAs use trees for collision
resolution, while mine were using probing or chaining, and with strings there
were a heck of a lot of collisions because the default string hashing
algorithm is absolutely horrible.  Here's a small test program to demonstrate
this:

import std.stdio, std.random, std.algorithm, std.range;

void main() {
    uint[hash_t] hashFrq;
    bool[string] alreadyUsed;

    // Generate 100k random strings, compute their hashes and count the
    // frequency of each hash.
    foreach(i; 0..100_000) {
        string myRandString;

        // Make sure that no two strings are equal.
        do {
            myRandString = randString();
        } while(myRandString in alreadyUsed);
        alreadyUsed[myRandString] = true;

        hash_t hash = typeid(string).getHash(cast(void*) &myRandString);
        hashFrq[hash]++;
    }

    auto hashes = hashFrq.keys;
    auto counts = hashFrq.values;
    sort!((a, b) { return a.at!0 < b.at!0; })(zip(counts, hashes));
    foreach(i; 0..hashes.length) {
        writeln(hashes[i], "\t", counts[i]);
    }

    writeln(hashes.length, " unique hashes used.");
}


// Generates a random string of geometrically distributed length composed
// of uniformly distributed characters in [A-Z, a-z].  Expected length is 20.
string randString() {
    bool upperCase;
    bool keepGoing = true;
    string ret;

    uint upperA = cast(uint) 'A';
    uint upperZ = cast(uint) 'Z' + 1;
    uint lowerA = cast(uint) 'a';
    uint lowerZ = cast(uint) 'z' + 1;

    while(keepGoing) {
        upperCase = cast(bool) uniform!"[]"(0, 1);
        ret ~= cast(char)
            (upperCase ? uniform(upperA, upperZ) : uniform(lowerA, lowerZ));
        keepGoing = uniform(0.0, 1.0) > 0.05;
    }
    return ret;
}

The result is that only about 8400 unique hashes are used, meaning 90+ % of
keys cannot be stored directly in the position corresponding to their hash.
Note that this is in full hash_t space, not in the modulus space typically
used for AAs, so the real results are probably even worse.  If the hashes were
uniformly distributed across all possible values of hash_t, one only would
expect about 30 collisions, meaning about 99970 unique hashes.  (This is based
on monte carlo, not exact combinatorics, so it's only a rough figure, but it
doesn't really matter for the take-home message anyhow.)

Using trees instead of some simpler O(N) mechanism to resolve collisions is
very costly, in that the resulting array cannot be iterated over in O(1)
auxiliary space, and that the trees require extra space overhead to maintain,
meaning more GC pressure, etc.  If we want to fix D's AAs, we first need to
fix D's string hashing.
May 02 2009
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
dsimcha:
 we first need to fix D's string hashing.<
Ages ago I did suggest this one in the main D newsgroup: http://www.azillionmonkeys.com/qed/hash.html Bye, bearophile
May 02 2009
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from bearophile (bearophileHUGS lycos.com)'s article
 dsimcha:
 we first need to fix D's string hashing.<
Ages ago I did suggest this one in the main D newsgroup: http://www.azillionmonkeys.com/qed/hash.html Bye, bearophile
Yeah, upon further reverse engineering, the way string hashing "works" is that it simply adds up all the character code values of the characters and returns this sum. The implementation is in dmd\src\druntime\src\compiler\dmd\typeinfo\ti_AC.d , and I've verified this both by reading the code and by implementing the same thing myself and seeing if the results are the same. If anyone has a better way to do it (which would be almost anything; I've written a few myself, although they are extremely simplistic and I'm sure anyone who actually knows what they're doing could do even better), please submit a patch. I've filed this as bug 2922, because it's a severe performance issue, so please submit there.
May 02 2009
parent reply BCS <none anon.com> writes:
Hello dsimcha,

 == Quote from bearophile (bearophileHUGS lycos.com)'s article
 
 dsimcha:
 
 we first need to fix D's string hashing.<
 
Ages ago I did suggest this one in the main D newsgroup: http://www.azillionmonkeys.com/qed/hash.html Bye, bearophile
Yeah, upon further reverse engineering, the way string hashing "works" is that it simply adds up all the character code values of the characters and returns this sum. The implementation is in dmd\src\druntime\src\compiler\dmd\typeinfo\ti_AC.d , and I've verified this both by reading the code and by implementing the same thing myself and seeing if the results are the same. If anyone has a better way to do it (which would be almost anything; I've written a few myself, although they are extremely simplistic and I'm sure anyone who actually knows what they're doing could do even better), please submit a patch. I've filed this as bug 2922, because it's a severe performance issue, so please submit there.
IIRC something like this is common: hash_t ret = 0; foreach(c;str) { ret *= SomePrime; ret += c; }
May 02 2009
parent reply "Kristian Kilpi" <kjkilpi gmail.com> writes:
On Sun, 03 May 2009 04:59:26 +0300, BCS <none anon.com> wrote:
 IIRC something like this is common:

 hash_t ret = 0;
 foreach(c;str) { ret *= SomePrime; ret += c; }
I think that's the basis for the best general string hashing funcs around. IIRC from the university, it doesn't matter, in practice, if the multiplier is a prime number or not. So, the multiplication can be replaced with bit shifting (that's as fast as the addition operation). E.g. foreach(c; str) { ret = (ret << 4) + c; } I quickly googled around and found the algorithm djb2 which uses the multiplier 33: http://www.cse.yorku.ca/~oz/hash.html djb2 is obviously a little bit slower but it *may* give slightly better distribution (I'm not gonna run tests now to see if there is a real world difference... ;) ).
May 03 2009
next sibling parent reply =?ISO-8859-15?Q?=22J=E9r=F4me_M=2E_Berger=22?= <jeberger free.fr> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Kristian Kilpi wrote:
| On Sun, 03 May 2009 04:59:26 +0300, BCS <none anon.com> wrote:
|>
|> IIRC something like this is common:
|>
|> hash_t ret = 0;
|> foreach(c;str) { ret *= SomePrime; ret += c; }
|>
|
| I think that's the basis for the best general string hashing funcs
| around. IIRC from the university, it doesn't matter, in practice,
if the
| multiplier is a prime number or not. So, the multiplication can be
| replaced with bit shifting (that's as fast as the addition operation).
| E.g.
|
| foreach(c; str)
| {
|     ret = (ret << 4) + c;
| }
|
	That one is very bad because it only takes into account the last
few characters of the string (how many depends on the size of the
hash). However, several hashing algorithms use 2^n+1 as their
multiplier, which is very fast even on old/small/embedded hardware
because it can be implemented as a shift and an addition.

| I quickly googled around and found the algorithm djb2 which uses the
| multiplier 33:
| http://www.cse.yorku.ca/~oz/hash.html
|
| djb2 is obviously a little bit slower but it *may* give slightly
better
| distribution (I'm not gonna run tests now to see if there is a real
| world difference... ;) ).

		Jerome
- --
mailto:jeberger free.fr
http://jeberger.free.fr
Jabber: jeberger jabber.fr
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEUEARECAAYFAkn9jwgACgkQd0kWM4JG3k/dLgCeI9UpUhxiAuw4QB0LHFCAXk00
yLgAljamIvoLLsyDhZ0OSggl9lv/M9A=
=CdQB
-----END PGP SIGNATURE-----
May 03 2009
parent reply "Kristian Kilpi" <kjkilpi gmail.com> writes:
On Sun, 03 May 2009 15:33:12 +0300, Jérôme M. Berger <jeberger free.fr>  
wrote:
 |
 | foreach(c; str)
 | {
 |     ret = (ret << 4) + c;
 | }
 |
 	That one is very bad because it only takes into account the last
 few characters of the string (how many depends on the size of the
 hash). However, several hashing algorithms use 2^n+1 as their
 multiplier, which is very fast even on old/small/embedded hardware
 because it can be implemented as a shift and an addition.
You are of course right; thanks for the correction. Lets scrap that algorithm! :)
May 03 2009
parent reply Steve Teale <steve.teale britseyeview.com> writes:
Kristian Kilpi Wrote:

 On Sun, 03 May 2009 15:33:12 +0300, Jérôme M. Berger <jeberger free.fr>  
 wrote:
 |
 | foreach(c; str)
 | {
 |     ret = (ret << 4) + c;
 | }
 |
 	That one is very bad because it only takes into account the last
 few characters of the string (how many depends on the size of the
 hash). However, several hashing algorithms use 2^n+1 as their
 multiplier, which is very fast even on old/small/embedded hardware
 because it can be implemented as a shift and an addition.
You are of course right; thanks for the correction. Lets scrap that algorithm! :)
Would something like foreach(i, c; str) { ret ^= cast(uint) c; if (i & 1) ret <<= 1; } do a better job. That could cover the first 64 chars reasonably well, and I suspect that could cover a lot of use cases.
May 03 2009
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Steve Teale (steve.teale britseyeview.com)'s article
 Kristian Kilpi Wrote:
 On Sun, 03 May 2009 15:33:12 +0300, Jérôme M. Berger <jeberger free.fr>
 wrote:
 |
 | foreach(c; str)
 | {
 |     ret = (ret << 4) + c;
 | }
 |
 	That one is very bad because it only takes into account the last
 few characters of the string (how many depends on the size of the
 hash). However, several hashing algorithms use 2^n+1 as their
 multiplier, which is very fast even on old/small/embedded hardware
 because it can be implemented as a shift and an addition.
You are of course right; thanks for the correction. Lets scrap that algorithm! :)
Would something like foreach(i, c; str) { ret ^= cast(uint) c; if (i & 1) ret <<= 1; } do a better job. That could cover the first 64 chars reasonably well, and I
suspect that could cover a lot of use cases. I guess a lot of people missed my follow-up post on this. This is largely a typeinfo bug, not a string hashing bug. typeid(string) returns a TypeInfo reference to generic array TypeInfo, whereas typeid(char[]) returns a TypeInfo reference for which the hashing works well. What is needed is for typeid(typemodifier(char)[]) to return the same thing as typeid(char[]). Then again, I guess that, given how bad the generic array hashing is, it wouldn't hurt to improve that, too. If someone implements a decent meta-hashing function (i.e. take a bunch of hashes of elements of some object, like an array, a tuple, or a struct, and produce a single good hash out of these), there would be no shortage of use cases for it.
May 03 2009
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
dsimcha wrote:
 == Quote from Steve Teale (steve.teale britseyeview.com)'s article
 Kristian Kilpi Wrote:
 On Sun, 03 May 2009 15:33:12 +0300, J�r�me M. Berger <jeberger free.fr>
 wrote:
 |
 | foreach(c; str)
 | {
 |     ret = (ret << 4) + c;
 | }
 |
 	That one is very bad because it only takes into account the last
 few characters of the string (how many depends on the size of the
 hash). However, several hashing algorithms use 2^n+1 as their
 multiplier, which is very fast even on old/small/embedded hardware
 because it can be implemented as a shift and an addition.
You are of course right; thanks for the correction. Lets scrap that algorithm! :)
Would something like foreach(i, c; str) { ret ^= cast(uint) c; if (i & 1) ret <<= 1; } do a better job. That could cover the first 64 chars reasonably well, and I
suspect that could cover a lot of use cases. I guess a lot of people missed my follow-up post on this.
Yes. Such a minor oversight in a face-to-face conversation takes five seconds to fix. In a newsgroup, it becomes a long thread :o).
 This is largely a
 typeinfo bug, not a string hashing bug.  typeid(string) returns a TypeInfo
 reference to generic array TypeInfo, whereas typeid(char[]) returns a TypeInfo
 reference for which the hashing works well.  What is needed is for
 typeid(typemodifier(char)[]) to return the same thing as typeid(char[]).
 
 Then again, I guess that, given how bad the generic array hashing is, it
wouldn't
 hurt to improve that, too.  If someone implements a decent meta-hashing
function
 (i.e. take a bunch of hashes of elements of some object, like an array, a
tuple,
 or a struct, and produce a single good hash out of these), there would be no
 shortage of use cases for it.
I have exchanged email with Paul Hsieh about using his hash function in Phobos (azillionmonkeys.com) and he said his license is very permissive. I seem to recall I have introduced his hash function, with credit, in Phobos' old runtime but I am not sure whether Sean took that over. I suggest we do that in druntime for string, be they mutable or not. Andrei
May 03 2009
parent Sean Kelly <sean invisibleduck.org> writes:
Andrei Alexandrescu wrote:
 
 I have exchanged email with Paul Hsieh about using his hash function in 
 Phobos (azillionmonkeys.com) and he said his license is very permissive. 
 I seem to recall I have introduced his hash function, with credit, in 
 Phobos' old runtime but I am not sure whether Sean took that over. I 
 suggest we do that in druntime for string, be they mutable or not.
Unfortunately, that request is languishing in the "oops, I forgot" portion of my to-do list.
May 04 2009
prev sibling parent Steve Teale <steve.teale britseyeview.com> writes:
dsimcha Wrote:

 == Quote from Steve Teale (steve.teale britseyeview.com)'s article
 Kristian Kilpi Wrote:
 On Sun, 03 May 2009 15:33:12 +0300, Jérôme M. Berger <jeberger free.fr>
 wrote:
 |
 | foreach(c; str)
 | {
 |     ret = (ret << 4) + c;
 | }
 |
 	That one is very bad because it only takes into account the last
 few characters of the string (how many depends on the size of the
 hash). However, several hashing algorithms use 2^n+1 as their
 multiplier, which is very fast even on old/small/embedded hardware
 because it can be implemented as a shift and an addition.
You are of course right; thanks for the correction. Lets scrap that algorithm! :)
Would something like foreach(i, c; str) { ret ^= cast(uint) c; if (i & 1) ret <<= 1; } do a better job. That could cover the first 64 chars reasonably well, and I
suspect that could cover a lot of use cases. I guess a lot of people missed my follow-up post on this. This is largely a typeinfo bug, not a string hashing bug. typeid(string) returns a TypeInfo reference to generic array TypeInfo, whereas typeid(char[]) returns a TypeInfo reference for which the hashing works well. What is needed is for typeid(typemodifier(char)[]) to return the same thing as typeid(char[]). Then again, I guess that, given how bad the generic array hashing is, it wouldn't hurt to improve that, too. If someone implements a decent meta-hashing function (i.e. take a bunch of hashes of elements of some object, like an array, a tuple, or a struct, and produce a single good hash out of these), there would be no shortage of use cases for it.
Sorry, I missed that. The topic listing does get a bit fragmented at times. Thanks Steve
May 03 2009
prev sibling parent Bill Baxter <wbaxter gmail.com> writes:
This guy looks to have a pretty good hash function:
http://www.azillionmonkeys.com/qed/hash.html
He matched Bob Jenkins' One-at-a-time hash on a variety of tests and
beat it on speed.

--bb

On Sun, May 3, 2009 at 5:19 AM, Kristian Kilpi <kjkilpi gmail.com> wrote:
 On Sun, 03 May 2009 04:59:26 +0300, BCS <none anon.com> wrote:
 IIRC something like this is common:

 hash_t ret =3D 0;
 foreach(c;str) { ret *=3D SomePrime; ret +=3D c; }
I think that's the basis for the best general string hashing funcs around=
.
 IIRC from the university, it doesn't matter, in practice, if the multipli=
er
 is a prime number or not. So, the multiplication can be replaced with bit
 shifting (that's as fast as the addition operation).
 E.g.

 foreach(c; str)
 {
 =A0 =A0ret =3D (ret << 4) + c;
 }

 I quickly googled around and found the algorithm djb2 which uses the
 multiplier 33:
 http://www.cse.yorku.ca/~oz/hash.html

 djb2 is obviously a little bit slower but it *may* give slightly better
 distribution (I'm not gonna run tests now to see if there is a real world
 difference... ;) ).
May 03 2009
prev sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from bearophile (bearophileHUGS lycos.com)'s article
 dsimcha:
 we first need to fix D's string hashing.<
Ages ago I did suggest this one in the main D newsgroup: http://www.azillionmonkeys.com/qed/hash.html Bye, bearophile
On second thought...The real problem is just a strange bug in D's RTTI. (cast(void*) typeid(immutable(char)[])) != (cast(void*) typeid(char[])). If one uses the typeinfo for char[] instead of that for immutable(char)[], then the hash performance is actually good, around 96k unique hashes.
May 02 2009
prev sibling parent reply Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Sat, May 2, 2009 at 1:00 PM, dsimcha <dsimcha yahoo.com> wrote:

 The result is that only about 8400 unique hashes are used, meaning 90+ % =
of
 keys cannot be stored directly in the position corresponding to their has=
h.
 Note that this is in full hash_t space, not in the modulus space typicall=
y
 used for AAs, so the real results are probably even worse. =A0If the hash=
es were
 uniformly distributed across all possible values of hash_t, one only woul=
d
 expect about 30 collisions, meaning about 99970 unique hashes. =A0(This i=
s based
 on monte carlo, not exact combinatorics, so it's only a rough figure, but=
it
 doesn't really matter for the take-home message anyhow.)
Here's something bizarre: porting this example to Tango yields 99997 unique hashes, which is right in line with your estimate. But I think it's bizarre since, well, shouldn't druntime use _the same typeinfo code?_
May 02 2009
parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Jarrett Billingsley wrote:
 Here's something bizarre: porting this example to Tango yields 99997
 unique hashes, which is right in line with your estimate.  But I think
 it's bizarre since, well, shouldn't druntime use _the same typeinfo
 code?_
No, Tango recently upgraded its hashing algorithms. This was done after druntime was created, and presumably before the Tango version on your machine :).
May 03 2009
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sun, 03 May 2009 08:10:43 -0400, Frits van Bommel  
<fvbommel remwovexcapss.nl> wrote:

 Jarrett Billingsley wrote:
 Here's something bizarre: porting this example to Tango yields 99997
 unique hashes, which is right in line with your estimate.  But I think
 it's bizarre since, well, shouldn't druntime use _the same typeinfo
 code?_
No, Tango recently upgraded its hashing algorithms. This was done after druntime was created, and presumably before the Tango version on your machine :).
Reading dsimcha's followup, it's probably because tango is D1 only, which doesn't use immutable(char)[]. (man, I gotta get back to working on Tango4D2) -Steve
May 04 2009