digitalmars.D - Absolutely horrible default string hashing

dsimcha (63/63) May 02 2009 I've been playing around with associative array implementations in D, mo...

bearophile (5/6) May 02 2009 Ages ago I did suggest this one in the main D newsgroup:

dsimcha (11/17) May 02 2009 Yeah, upon further reverse engineering, the way string hashing "works" i...

BCS (4/28) May 02 2009 IIRC something like this is common:

Kristian Kilpi (16/19) May 03 2009 I think that's the basis for the best general string hashing funcs aroun...

=?ISO-8859-15?Q?=22J=E9r=F4me_M=2E_Berger=22?= (47/47) May 03 2009 -----BEGIN PGP SIGNED MESSAGE-----

Kristian Kilpi (4/15) May 03 2009 You are of course right; thanks for the correction. Lets scrap that

Steve Teale (9/25) May 03 2009 Would something like

dsimcha (12/37) May 03 2009 suspect that could cover a lot of use cases.

Andrei Alexandrescu (9/48) May 03 2009 Yes. Such a minor oversight in a face-to-face conversation takes five

Sean Kelly (3/9) May 04 2009 Unfortunately, that request is languishing in the "oops, I forgot"

Steve Teale (3/42) May 03 2009 Sorry, I missed that. The topic listing does get a bit fragmented at tim...

Bill Baxter (8/30) May 03 2009 This guy looks to have a pretty good hash function:

dsimcha (5/11) May 02 2009 On second thought...The real problem is just a strange bug in D's RTTI.

Jarrett Billingsley (12/20) May 02 2009 h.

Frits van Bommel (3/7) May 03 2009 No, Tango recently upgraded its hashing algorithms. This was done after ...

Steven Schveighoffer (6/14) May 04 2009 Reading dsimcha's followup, it's probably because tango is D1 only, whic...

dsimcha <dsimcha yahoo.com> writes:

I've been playing around with associative array implementations in D, mostly
trying to create ones that are more GC-friendly.  I had written one that
seemed much faster than the builtin for int and float keys but much slower for
strings.  I've found the reason:  The builtin AAs use trees for collision
resolution, while mine were using probing or chaining, and with strings there
were a heck of a lot of collisions because the default string hashing
algorithm is absolutely horrible.  Here's a small test program to demonstrate
this:

import std.stdio, std.random, std.algorithm, std.range;

void main() {
    uint[hash_t] hashFrq;
    bool[string] alreadyUsed;

    // Generate 100k random strings, compute their hashes and count the
    // frequency of each hash.
    foreach(i; 0..100_000) {
        string myRandString;

        // Make sure that no two strings are equal.
        do {
            myRandString = randString();
        } while(myRandString in alreadyUsed);
        alreadyUsed[myRandString] = true;

        hash_t hash = typeid(string).getHash(cast(void*) &myRandString);
        hashFrq[hash]++;
    }

    auto hashes = hashFrq.keys;
    auto counts = hashFrq.values;
    sort!((a, b) { return a.at!0 < b.at!0; })(zip(counts, hashes));
    foreach(i; 0..hashes.length) {
        writeln(hashes[i], "\t", counts[i]);
    }

    writeln(hashes.length, " unique hashes used.");
}


// Generates a random string of geometrically distributed length composed
// of uniformly distributed characters in [A-Z, a-z].  Expected length is 20.
string randString() {
    bool upperCase;
    bool keepGoing = true;
    string ret;

    uint upperA = cast(uint) 'A';
    uint upperZ = cast(uint) 'Z' + 1;
    uint lowerA = cast(uint) 'a';
    uint lowerZ = cast(uint) 'z' + 1;

    while(keepGoing) {
        upperCase = cast(bool) uniform!"[]"(0, 1);
        ret ~= cast(char)
            (upperCase ? uniform(upperA, upperZ) : uniform(lowerA, lowerZ));
        keepGoing = uniform(0.0, 1.0) > 0.05;
    }
    return ret;
}

The result is that only about 8400 unique hashes are used, meaning 90+ % of
keys cannot be stored directly in the position corresponding to their hash.
Note that this is in full hash_t space, not in the modulus space typically
used for AAs, so the real results are probably even worse.  If the hashes were
uniformly distributed across all possible values of hash_t, one only would
expect about 30 collisions, meaning about 99970 unique hashes.  (This is based
on monte carlo, not exact combinatorics, so it's only a rough figure, but it
doesn't really matter for the take-home message anyhow.)

Using trees instead of some simpler O(N) mechanism to resolve collisions is
very costly, in that the resulting array cannot be iterated over in O(1)
auxiliary space, and that the trees require extra space overhead to maintain,
meaning more GC pressure, etc.  If we want to fix D's AAs, we first need to
fix D's string hashing.

May 02 2009

bearophile <bearophileHUGS lycos.com> writes:

dsimcha:
 we first need to fix D's string hashing.<

Ages ago I did suggest this one in the main D newsgroup:
http://www.azillionmonkeys.com/qed/hash.html

Bye,
bearophile

May 02 2009

dsimcha <dsimcha yahoo.com> writes:

== Quote from bearophile (bearophileHUGS lycos.com)'s article
 dsimcha:
 we first need to fix D's string hashing.<

 Ages ago I did suggest this one in the main D newsgroup:
 http://www.azillionmonkeys.com/qed/hash.html
 Bye,
 bearophile

Yeah, upon further reverse engineering, the way string hashing "works" is that
it
simply adds up all the character code values of the characters and returns this
sum.  The implementation is in

dmd\src\druntime\src\compiler\dmd\typeinfo\ti_AC.d ,

and I've verified this both by reading the code and by implementing the same
thing
myself and seeing if the results are the same.  If anyone has a better way to do
it (which would be almost anything; I've written a few myself, although they are
extremely simplistic and I'm sure anyone who actually knows what they're doing
could do even better), please submit a patch.  I've filed this as bug 2922,
because it's a severe performance issue, so please submit there.

May 02 2009

BCS <none anon.com> writes:

Hello dsimcha,

 == Quote from bearophile (bearophileHUGS lycos.com)'s article
 
 dsimcha:
 
 we first need to fix D's string hashing.<
 

 Ages ago I did suggest this one in the main D newsgroup:
 http://www.azillionmonkeys.com/qed/hash.html
 Bye,
 bearophile

 Yeah, upon further reverse engineering, the way string hashing "works"
 is that it simply adds up all the character code values of the
 characters and returns this sum.  The implementation is in
 
 dmd\src\druntime\src\compiler\dmd\typeinfo\ti_AC.d ,
 
 and I've verified this both by reading the code and by implementing
 the same thing myself and seeing if the results are the same.  If
 anyone has a better way to do it (which would be almost anything; I've
 written a few myself, although they are extremely simplistic and I'm
 sure anyone who actually knows what they're doing could do even
 better), please submit a patch.  I've filed this as bug 2922, because
 it's a severe performance issue, so please submit there.
 

IIRC something like this is common:

hash_t ret = 0;
foreach(c;str) { ret *= SomePrime; ret += c; }

May 02 2009

"Kristian Kilpi" <kjkilpi gmail.com> writes:

On Sun, 03 May 2009 04:59:26 +0300, BCS <none anon.com> wrote:
 IIRC something like this is common:

 hash_t ret = 0;
 foreach(c;str) { ret *= SomePrime; ret += c; }

I think that's the basis for the best general string hashing funcs around.  
IIRC from the university, it doesn't matter, in practice, if the  
multiplier is a prime number or not. So, the multiplication can be  
replaced with bit shifting (that's as fast as the addition operation).
E.g.

foreach(c; str)
{
     ret = (ret << 4) + c;
}

I quickly googled around and found the algorithm djb2 which uses the  
multiplier 33:
http://www.cse.yorku.ca/~oz/hash.html

djb2 is obviously a little bit slower but it *may* give slightly better  
distribution (I'm not gonna run tests now to see if there is a real world  
difference... ;) ).

May 03 2009

=?ISO-8859-15?Q?=22J=E9r=F4me_M=2E_Berger=22?= <jeberger free.fr> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Kristian Kilpi wrote:
| On Sun, 03 May 2009 04:59:26 +0300, BCS <none anon.com> wrote:
|>
|> IIRC something like this is common:
|>
|> hash_t ret = 0;
|> foreach(c;str) { ret *= SomePrime; ret += c; }
|>
|
| I think that's the basis for the best general string hashing funcs
| around. IIRC from the university, it doesn't matter, in practice,
if the
| multiplier is a prime number or not. So, the multiplication can be
| replaced with bit shifting (that's as fast as the addition operation).
| E.g.
|
| foreach(c; str)
| {
|     ret = (ret << 4) + c;
| }
|
	That one is very bad because it only takes into account the last
few characters of the string (how many depends on the size of the
hash). However, several hashing algorithms use 2^n+1 as their
multiplier, which is very fast even on old/small/embedded hardware
because it can be implemented as a shift and an addition.

| I quickly googled around and found the algorithm djb2 which uses the
| multiplier 33:
| http://www.cse.yorku.ca/~oz/hash.html
|
| djb2 is obviously a little bit slower but it *may* give slightly
better
| distribution (I'm not gonna run tests now to see if there is a real
| world difference... ;) ).

		Jerome
- --
mailto:jeberger free.fr
http://jeberger.free.fr
Jabber: jeberger jabber.fr
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEUEARECAAYFAkn9jwgACgkQd0kWM4JG3k/dLgCeI9UpUhxiAuw4QB0LHFCAXk00
yLgAljamIvoLLsyDhZ0OSggl9lv/M9A=
=CdQB
-----END PGP SIGNATURE-----

May 03 2009

"Kristian Kilpi" <kjkilpi gmail.com> writes:

On Sun, 03 May 2009 15:33:12 +0300, J�r�me M. Berger <jeberger free.fr>  
wrote:
 |
 | foreach(c; str)
 | {
 |     ret = (ret << 4) + c;
 | }
 |
 	That one is very bad because it only takes into account the last
 few characters of the string (how many depends on the size of the
 hash). However, several hashing algorithms use 2^n+1 as their
 multiplier, which is very fast even on old/small/embedded hardware
 because it can be implemented as a shift and an addition.

You are of course right; thanks for the correction. Lets scrap that  
algorithm! :)

May 03 2009

Steve Teale <steve.teale britseyeview.com> writes:

Kristian Kilpi Wrote:

 On Sun, 03 May 2009 15:33:12 +0300, J�r�me M. Berger <jeberger free.fr>  
 wrote:
 |
 | foreach(c; str)
 | {
 |     ret = (ret << 4) + c;
 | }
 |
 	That one is very bad because it only takes into account the last
 few characters of the string (how many depends on the size of the
 hash). However, several hashing algorithms use 2^n+1 as their
 multiplier, which is very fast even on old/small/embedded hardware
 because it can be implemented as a shift and an addition.

 
 You are of course right; thanks for the correction. Lets scrap that  
 algorithm! :)

Would something like

foreach(i, c; str)
{
   ret ^= cast(uint) c;
   if (i & 1)
      ret <<= 1;
}

do a better job. That could cover the first 64 chars reasonably well, and I
suspect that could cover a lot of use cases.

May 03 2009

dsimcha <dsimcha yahoo.com> writes:

== Quote from Steve Teale (steve.teale britseyeview.com)'s article
 Kristian Kilpi Wrote:
 On Sun, 03 May 2009 15:33:12 +0300, J�r�me M. Berger <jeberger free.fr>
 wrote:
 |
 | foreach(c; str)
 | {
 |     ret = (ret << 4) + c;
 | }
 |
 	That one is very bad because it only takes into account the last
 few characters of the string (how many depends on the size of the
 hash). However, several hashing algorithms use 2^n+1 as their
 multiplier, which is very fast even on old/small/embedded hardware
 because it can be implemented as a shift and an addition.

 You are of course right; thanks for the correction. Lets scrap that
 algorithm! :)

 Would something like
 foreach(i, c; str)
 {
    ret ^= cast(uint) c;
    if (i & 1)
       ret <<= 1;
 }
 do a better job. That could cover the first 64 chars reasonably well, and I

suspect that could cover a lot of use cases.

I guess a lot of people missed my follow-up post on this.  This is largely a
typeinfo bug, not a string hashing bug.  typeid(string) returns a TypeInfo
reference to generic array TypeInfo, whereas typeid(char[]) returns a TypeInfo
reference for which the hashing works well.  What is needed is for
typeid(typemodifier(char)[]) to return the same thing as typeid(char[]).

Then again, I guess that, given how bad the generic array hashing is, it
wouldn't
hurt to improve that, too.  If someone implements a decent meta-hashing function
(i.e. take a bunch of hashes of elements of some object, like an array, a tuple,
or a struct, and produce a single good hash out of these), there would be no
shortage of use cases for it.

May 03 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

dsimcha wrote:
 == Quote from Steve Teale (steve.teale britseyeview.com)'s article
 Kristian Kilpi Wrote:
 On Sun, 03 May 2009 15:33:12 +0300, J�r�me M. Berger <jeberger free.fr>
 wrote:
 |
 | foreach(c; str)
 | {
 |     ret = (ret << 4) + c;
 | }
 |
 	That one is very bad because it only takes into account the last
 few characters of the string (how many depends on the size of the
 hash). However, several hashing algorithms use 2^n+1 as their
 multiplier, which is very fast even on old/small/embedded hardware
 because it can be implemented as a shift and an addition.

 You are of course right; thanks for the correction. Lets scrap that
 algorithm! :)

 Would something like
 foreach(i, c; str)
 {
    ret ^= cast(uint) c;
    if (i & 1)
       ret <<= 1;
 }
 do a better job. That could cover the first 64 chars reasonably well, and I

 suspect that could cover a lot of use cases.
 
 I guess a lot of people missed my follow-up post on this.

Yes. Such a minor oversight in a face-to-face conversation takes five 
seconds to fix. In a newsgroup, it becomes a long thread :o).

 This is largely a
 typeinfo bug, not a string hashing bug.  typeid(string) returns a TypeInfo
 reference to generic array TypeInfo, whereas typeid(char[]) returns a TypeInfo
 reference for which the hashing works well.  What is needed is for
 typeid(typemodifier(char)[]) to return the same thing as typeid(char[]).
 
 Then again, I guess that, given how bad the generic array hashing is, it
wouldn't
 hurt to improve that, too.  If someone implements a decent meta-hashing
function
 (i.e. take a bunch of hashes of elements of some object, like an array, a
tuple,
 or a struct, and produce a single good hash out of these), there would be no
 shortage of use cases for it.

I have exchanged email with Paul Hsieh about using his hash function in 
Phobos (azillionmonkeys.com) and he said his license is very permissive. 
I seem to recall I have introduced his hash function, with credit, in 
Phobos' old runtime but I am not sure whether Sean took that over. I 
suggest we do that in druntime for string, be they mutable or not.


Andrei

May 03 2009

Sean Kelly <sean invisibleduck.org> writes:

Andrei Alexandrescu wrote:
 
 I have exchanged email with Paul Hsieh about using his hash function in 
 Phobos (azillionmonkeys.com) and he said his license is very permissive. 
 I seem to recall I have introduced his hash function, with credit, in 
 Phobos' old runtime but I am not sure whether Sean took that over. I 
 suggest we do that in druntime for string, be they mutable or not.

Unfortunately, that request is languishing in the "oops, I forgot" 
portion of my to-do list.

May 04 2009

Steve Teale <steve.teale britseyeview.com> writes:

dsimcha Wrote:

 == Quote from Steve Teale (steve.teale britseyeview.com)'s article
 Kristian Kilpi Wrote:
 On Sun, 03 May 2009 15:33:12 +0300, J�r�me M. Berger <jeberger free.fr>
 wrote:
 |
 | foreach(c; str)
 | {
 |     ret = (ret << 4) + c;
 | }
 |
 	That one is very bad because it only takes into account the last
 few characters of the string (how many depends on the size of the
 hash). However, several hashing algorithms use 2^n+1 as their
 multiplier, which is very fast even on old/small/embedded hardware
 because it can be implemented as a shift and an addition.

 You are of course right; thanks for the correction. Lets scrap that
 algorithm! :)

 Would something like
 foreach(i, c; str)
 {
    ret ^= cast(uint) c;
    if (i & 1)
       ret <<= 1;
 }
 do a better job. That could cover the first 64 chars reasonably well, and I

 suspect that could cover a lot of use cases.
 
 I guess a lot of people missed my follow-up post on this.  This is largely a
 typeinfo bug, not a string hashing bug.  typeid(string) returns a TypeInfo
 reference to generic array TypeInfo, whereas typeid(char[]) returns a TypeInfo
 reference for which the hashing works well.  What is needed is for
 typeid(typemodifier(char)[]) to return the same thing as typeid(char[]).
 
 Then again, I guess that, given how bad the generic array hashing is, it
wouldn't
 hurt to improve that, too.  If someone implements a decent meta-hashing
function
 (i.e. take a bunch of hashes of elements of some object, like an array, a
tuple,
 or a struct, and produce a single good hash out of these), there would be no
 shortage of use cases for it.

Sorry, I missed that. The topic listing does get a bit fragmented at times.

Thanks Steve

May 03 2009

Bill Baxter <wbaxter gmail.com> writes:

This guy looks to have a pretty good hash function:
http://www.azillionmonkeys.com/qed/hash.html
He matched Bob Jenkins' One-at-a-time hash on a variety of tests and
beat it on speed.

--bb

On Sun, May 3, 2009 at 5:19 AM, Kristian Kilpi <kjkilpi gmail.com> wrote:
 On Sun, 03 May 2009 04:59:26 +0300, BCS <none anon.com> wrote:
 IIRC something like this is common:

 hash_t ret =3D 0;
 foreach(c;str) { ret *=3D SomePrime; ret +=3D c; }

 I think that's the basis for the best general string hashing funcs around=

.
 IIRC from the university, it doesn't matter, in practice, if the multipli=

er
 is a prime number or not. So, the multiplication can be replaced with bit
 shifting (that's as fast as the addition operation).
 E.g.

 foreach(c; str)
 {
 =A0 =A0ret =3D (ret << 4) + c;
 }

 I quickly googled around and found the algorithm djb2 which uses the
 multiplier 33:
 http://www.cse.yorku.ca/~oz/hash.html

 djb2 is obviously a little bit slower but it *may* give slightly better
 distribution (I'm not gonna run tests now to see if there is a real world
 difference... ;) ).

May 03 2009

dsimcha <dsimcha yahoo.com> writes:

== Quote from bearophile (bearophileHUGS lycos.com)'s article
 dsimcha:
 we first need to fix D's string hashing.<

 Ages ago I did suggest this one in the main D newsgroup:
 http://www.azillionmonkeys.com/qed/hash.html
 Bye,
 bearophile

On second thought...The real problem is just a strange bug in D's RTTI.
(cast(void*) typeid(immutable(char)[])) != (cast(void*) typeid(char[])).  If one
uses the typeinfo for char[] instead of that for immutable(char)[], then the
hash
performance is actually good, around 96k unique hashes.

May 02 2009

Jarrett Billingsley <jarrett.billingsley gmail.com> writes:

On Sat, May 2, 2009 at 1:00 PM, dsimcha <dsimcha yahoo.com> wrote:

 The result is that only about 8400 unique hashes are used, meaning 90+ % =

of
 keys cannot be stored directly in the position corresponding to their has=

h.
 Note that this is in full hash_t space, not in the modulus space typicall=

y
 used for AAs, so the real results are probably even worse. =A0If the hash=

es were
 uniformly distributed across all possible values of hash_t, one only woul=

d
 expect about 30 collisions, meaning about 99970 unique hashes. =A0(This i=

s based
 on monte carlo, not exact combinatorics, so it's only a rough figure, but=

 it
 doesn't really matter for the take-home message anyhow.)

Here's something bizarre: porting this example to Tango yields 99997
unique hashes, which is right in line with your estimate.  But I think
it's bizarre since, well, shouldn't druntime use _the same typeinfo
code?_

May 02 2009

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Jarrett Billingsley wrote:
 Here's something bizarre: porting this example to Tango yields 99997
 unique hashes, which is right in line with your estimate.  But I think
 it's bizarre since, well, shouldn't druntime use _the same typeinfo
 code?_

No, Tango recently upgraded its hashing algorithms. This was done after
druntime 
was created, and presumably before the Tango version on your machine :).

May 03 2009

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sun, 03 May 2009 08:10:43 -0400, Frits van Bommel  
<fvbommel remwovexcapss.nl> wrote:

 Jarrett Billingsley wrote:
 Here's something bizarre: porting this example to Tango yields 99997
 unique hashes, which is right in line with your estimate.  But I think
 it's bizarre since, well, shouldn't druntime use _the same typeinfo
 code?_

 No, Tango recently upgraded its hashing algorithms. This was done after  
 druntime was created, and presumably before the Tango version on your  
 machine :).

Reading dsimcha's followup, it's probably because tango is D1 only, which  
doesn't use immutable(char)[].  (man, I gotta get back to working on  
Tango4D2)

-Steve

May 04 2009

D Programming

C/C++ Programming

Other

digitalmars.D - Absolutely horrible default string hashing