digitalmars.D - Absolutely horrible default string hashing
- dsimcha (63/63) May 02 2009 I've been playing around with associative array implementations in D, mo...
- bearophile (5/6) May 02 2009 Ages ago I did suggest this one in the main D newsgroup:
- dsimcha (11/17) May 02 2009 Yeah, upon further reverse engineering, the way string hashing "works" i...
- BCS (4/28) May 02 2009 IIRC something like this is common:
- Kristian Kilpi (16/19) May 03 2009 I think that's the basis for the best general string hashing funcs aroun...
- =?ISO-8859-15?Q?=22J=E9r=F4me_M=2E_Berger=22?= (47/47) May 03 2009 -----BEGIN PGP SIGNED MESSAGE-----
- Kristian Kilpi (4/15) May 03 2009 You are of course right; thanks for the correction. Lets scrap that
- Steve Teale (9/25) May 03 2009 Would something like
- dsimcha (12/37) May 03 2009 suspect that could cover a lot of use cases.
- Andrei Alexandrescu (9/48) May 03 2009 Yes. Such a minor oversight in a face-to-face conversation takes five
- Sean Kelly (3/9) May 04 2009 Unfortunately, that request is languishing in the "oops, I forgot"
- Steve Teale (3/42) May 03 2009 Sorry, I missed that. The topic listing does get a bit fragmented at tim...
- Bill Baxter (8/30) May 03 2009 This guy looks to have a pretty good hash function:
- dsimcha (5/11) May 02 2009 On second thought...The real problem is just a strange bug in D's RTTI.
- Jarrett Billingsley (12/20) May 02 2009 h.
- Frits van Bommel (3/7) May 03 2009 No, Tango recently upgraded its hashing algorithms. This was done after ...
- Steven Schveighoffer (6/14) May 04 2009 Reading dsimcha's followup, it's probably because tango is D1 only, whic...
I've been playing around with associative array implementations in D, mostly trying to create ones that are more GC-friendly. I had written one that seemed much faster than the builtin for int and float keys but much slower for strings. I've found the reason: The builtin AAs use trees for collision resolution, while mine were using probing or chaining, and with strings there were a heck of a lot of collisions because the default string hashing algorithm is absolutely horrible. Here's a small test program to demonstrate this: import std.stdio, std.random, std.algorithm, std.range; void main() { uint[hash_t] hashFrq; bool[string] alreadyUsed; // Generate 100k random strings, compute their hashes and count the // frequency of each hash. foreach(i; 0..100_000) { string myRandString; // Make sure that no two strings are equal. do { myRandString = randString(); } while(myRandString in alreadyUsed); alreadyUsed[myRandString] = true; hash_t hash = typeid(string).getHash(cast(void*) &myRandString); hashFrq[hash]++; } auto hashes = hashFrq.keys; auto counts = hashFrq.values; sort!((a, b) { return a.at!0 < b.at!0; })(zip(counts, hashes)); foreach(i; 0..hashes.length) { writeln(hashes[i], "\t", counts[i]); } writeln(hashes.length, " unique hashes used."); } // Generates a random string of geometrically distributed length composed // of uniformly distributed characters in [A-Z, a-z]. Expected length is 20. string randString() { bool upperCase; bool keepGoing = true; string ret; uint upperA = cast(uint) 'A'; uint upperZ = cast(uint) 'Z' + 1; uint lowerA = cast(uint) 'a'; uint lowerZ = cast(uint) 'z' + 1; while(keepGoing) { upperCase = cast(bool) uniform!"[]"(0, 1); ret ~= cast(char) (upperCase ? uniform(upperA, upperZ) : uniform(lowerA, lowerZ)); keepGoing = uniform(0.0, 1.0) > 0.05; } return ret; } The result is that only about 8400 unique hashes are used, meaning 90+ % of keys cannot be stored directly in the position corresponding to their hash. Note that this is in full hash_t space, not in the modulus space typically used for AAs, so the real results are probably even worse. If the hashes were uniformly distributed across all possible values of hash_t, one only would expect about 30 collisions, meaning about 99970 unique hashes. (This is based on monte carlo, not exact combinatorics, so it's only a rough figure, but it doesn't really matter for the take-home message anyhow.) Using trees instead of some simpler O(N) mechanism to resolve collisions is very costly, in that the resulting array cannot be iterated over in O(1) auxiliary space, and that the trees require extra space overhead to maintain, meaning more GC pressure, etc. If we want to fix D's AAs, we first need to fix D's string hashing.
May 02 2009
dsimcha:we first need to fix D's string hashing.<Ages ago I did suggest this one in the main D newsgroup: http://www.azillionmonkeys.com/qed/hash.html Bye, bearophile
May 02 2009
== Quote from bearophile (bearophileHUGS lycos.com)'s articledsimcha:Yeah, upon further reverse engineering, the way string hashing "works" is that it simply adds up all the character code values of the characters and returns this sum. The implementation is in dmd\src\druntime\src\compiler\dmd\typeinfo\ti_AC.d , and I've verified this both by reading the code and by implementing the same thing myself and seeing if the results are the same. If anyone has a better way to do it (which would be almost anything; I've written a few myself, although they are extremely simplistic and I'm sure anyone who actually knows what they're doing could do even better), please submit a patch. I've filed this as bug 2922, because it's a severe performance issue, so please submit there.we first need to fix D's string hashing.<Ages ago I did suggest this one in the main D newsgroup: http://www.azillionmonkeys.com/qed/hash.html Bye, bearophile
May 02 2009
Hello dsimcha,== Quote from bearophile (bearophileHUGS lycos.com)'s articleIIRC something like this is common: hash_t ret = 0; foreach(c;str) { ret *= SomePrime; ret += c; }dsimcha:Yeah, upon further reverse engineering, the way string hashing "works" is that it simply adds up all the character code values of the characters and returns this sum. The implementation is in dmd\src\druntime\src\compiler\dmd\typeinfo\ti_AC.d , and I've verified this both by reading the code and by implementing the same thing myself and seeing if the results are the same. If anyone has a better way to do it (which would be almost anything; I've written a few myself, although they are extremely simplistic and I'm sure anyone who actually knows what they're doing could do even better), please submit a patch. I've filed this as bug 2922, because it's a severe performance issue, so please submit there.we first need to fix D's string hashing.<Ages ago I did suggest this one in the main D newsgroup: http://www.azillionmonkeys.com/qed/hash.html Bye, bearophile
May 02 2009
On Sun, 03 May 2009 04:59:26 +0300, BCS <none anon.com> wrote:IIRC something like this is common: hash_t ret = 0; foreach(c;str) { ret *= SomePrime; ret += c; }I think that's the basis for the best general string hashing funcs around. IIRC from the university, it doesn't matter, in practice, if the multiplier is a prime number or not. So, the multiplication can be replaced with bit shifting (that's as fast as the addition operation). E.g. foreach(c; str) { ret = (ret << 4) + c; } I quickly googled around and found the algorithm djb2 which uses the multiplier 33: http://www.cse.yorku.ca/~oz/hash.html djb2 is obviously a little bit slower but it *may* give slightly better distribution (I'm not gonna run tests now to see if there is a real world difference... ;) ).
May 03 2009
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Kristian Kilpi wrote: | On Sun, 03 May 2009 04:59:26 +0300, BCS <none anon.com> wrote: |> |> IIRC something like this is common: |> |> hash_t ret = 0; |> foreach(c;str) { ret *= SomePrime; ret += c; } |> | | I think that's the basis for the best general string hashing funcs | around. IIRC from the university, it doesn't matter, in practice, if the | multiplier is a prime number or not. So, the multiplication can be | replaced with bit shifting (that's as fast as the addition operation). | E.g. | | foreach(c; str) | { | ret = (ret << 4) + c; | } | That one is very bad because it only takes into account the last few characters of the string (how many depends on the size of the hash). However, several hashing algorithms use 2^n+1 as their multiplier, which is very fast even on old/small/embedded hardware because it can be implemented as a shift and an addition. | I quickly googled around and found the algorithm djb2 which uses the | multiplier 33: | http://www.cse.yorku.ca/~oz/hash.html | | djb2 is obviously a little bit slower but it *may* give slightly better | distribution (I'm not gonna run tests now to see if there is a real | world difference... ;) ). Jerome - -- mailto:jeberger free.fr http://jeberger.free.fr Jabber: jeberger jabber.fr -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEUEARECAAYFAkn9jwgACgkQd0kWM4JG3k/dLgCeI9UpUhxiAuw4QB0LHFCAXk00 yLgAljamIvoLLsyDhZ0OSggl9lv/M9A= =CdQB -----END PGP SIGNATURE-----
May 03 2009
On Sun, 03 May 2009 15:33:12 +0300, Jérôme M. Berger <jeberger free.fr> wrote:| | foreach(c; str) | { | ret = (ret << 4) + c; | } | That one is very bad because it only takes into account the last few characters of the string (how many depends on the size of the hash). However, several hashing algorithms use 2^n+1 as their multiplier, which is very fast even on old/small/embedded hardware because it can be implemented as a shift and an addition.You are of course right; thanks for the correction. Lets scrap that algorithm! :)
May 03 2009
Kristian Kilpi Wrote:On Sun, 03 May 2009 15:33:12 +0300, Jérôme M. Berger <jeberger free.fr> wrote:Would something like foreach(i, c; str) { ret ^= cast(uint) c; if (i & 1) ret <<= 1; } do a better job. That could cover the first 64 chars reasonably well, and I suspect that could cover a lot of use cases.| | foreach(c; str) | { | ret = (ret << 4) + c; | } | That one is very bad because it only takes into account the last few characters of the string (how many depends on the size of the hash). However, several hashing algorithms use 2^n+1 as their multiplier, which is very fast even on old/small/embedded hardware because it can be implemented as a shift and an addition.You are of course right; thanks for the correction. Lets scrap that algorithm! :)
May 03 2009
== Quote from Steve Teale (steve.teale britseyeview.com)'s articleKristian Kilpi Wrote:suspect that could cover a lot of use cases. I guess a lot of people missed my follow-up post on this. This is largely a typeinfo bug, not a string hashing bug. typeid(string) returns a TypeInfo reference to generic array TypeInfo, whereas typeid(char[]) returns a TypeInfo reference for which the hashing works well. What is needed is for typeid(typemodifier(char)[]) to return the same thing as typeid(char[]). Then again, I guess that, given how bad the generic array hashing is, it wouldn't hurt to improve that, too. If someone implements a decent meta-hashing function (i.e. take a bunch of hashes of elements of some object, like an array, a tuple, or a struct, and produce a single good hash out of these), there would be no shortage of use cases for it.On Sun, 03 May 2009 15:33:12 +0300, Jérôme M. Berger <jeberger free.fr> wrote:Would something like foreach(i, c; str) { ret ^= cast(uint) c; if (i & 1) ret <<= 1; } do a better job. That could cover the first 64 chars reasonably well, and I| | foreach(c; str) | { | ret = (ret << 4) + c; | } | That one is very bad because it only takes into account the last few characters of the string (how many depends on the size of the hash). However, several hashing algorithms use 2^n+1 as their multiplier, which is very fast even on old/small/embedded hardware because it can be implemented as a shift and an addition.You are of course right; thanks for the correction. Lets scrap that algorithm! :)
May 03 2009
dsimcha wrote:== Quote from Steve Teale (steve.teale britseyeview.com)'s articleYes. Such a minor oversight in a face-to-face conversation takes five seconds to fix. In a newsgroup, it becomes a long thread :o).Kristian Kilpi Wrote:suspect that could cover a lot of use cases. I guess a lot of people missed my follow-up post on this.On Sun, 03 May 2009 15:33:12 +0300, J�r�me M. Berger <jeberger free.fr> wrote:Would something like foreach(i, c; str) { ret ^= cast(uint) c; if (i & 1) ret <<= 1; } do a better job. That could cover the first 64 chars reasonably well, and I| | foreach(c; str) | { | ret = (ret << 4) + c; | } | That one is very bad because it only takes into account the last few characters of the string (how many depends on the size of the hash). However, several hashing algorithms use 2^n+1 as their multiplier, which is very fast even on old/small/embedded hardware because it can be implemented as a shift and an addition.You are of course right; thanks for the correction. Lets scrap that algorithm! :)This is largely a typeinfo bug, not a string hashing bug. typeid(string) returns a TypeInfo reference to generic array TypeInfo, whereas typeid(char[]) returns a TypeInfo reference for which the hashing works well. What is needed is for typeid(typemodifier(char)[]) to return the same thing as typeid(char[]). Then again, I guess that, given how bad the generic array hashing is, it wouldn't hurt to improve that, too. If someone implements a decent meta-hashing function (i.e. take a bunch of hashes of elements of some object, like an array, a tuple, or a struct, and produce a single good hash out of these), there would be no shortage of use cases for it.I have exchanged email with Paul Hsieh about using his hash function in Phobos (azillionmonkeys.com) and he said his license is very permissive. I seem to recall I have introduced his hash function, with credit, in Phobos' old runtime but I am not sure whether Sean took that over. I suggest we do that in druntime for string, be they mutable or not. Andrei
May 03 2009
Andrei Alexandrescu wrote:I have exchanged email with Paul Hsieh about using his hash function in Phobos (azillionmonkeys.com) and he said his license is very permissive. I seem to recall I have introduced his hash function, with credit, in Phobos' old runtime but I am not sure whether Sean took that over. I suggest we do that in druntime for string, be they mutable or not.Unfortunately, that request is languishing in the "oops, I forgot" portion of my to-do list.
May 04 2009
dsimcha Wrote:== Quote from Steve Teale (steve.teale britseyeview.com)'s articleSorry, I missed that. The topic listing does get a bit fragmented at times. Thanks SteveKristian Kilpi Wrote:suspect that could cover a lot of use cases. I guess a lot of people missed my follow-up post on this. This is largely a typeinfo bug, not a string hashing bug. typeid(string) returns a TypeInfo reference to generic array TypeInfo, whereas typeid(char[]) returns a TypeInfo reference for which the hashing works well. What is needed is for typeid(typemodifier(char)[]) to return the same thing as typeid(char[]). Then again, I guess that, given how bad the generic array hashing is, it wouldn't hurt to improve that, too. If someone implements a decent meta-hashing function (i.e. take a bunch of hashes of elements of some object, like an array, a tuple, or a struct, and produce a single good hash out of these), there would be no shortage of use cases for it.On Sun, 03 May 2009 15:33:12 +0300, Jérôme M. Berger <jeberger free.fr> wrote:Would something like foreach(i, c; str) { ret ^= cast(uint) c; if (i & 1) ret <<= 1; } do a better job. That could cover the first 64 chars reasonably well, and I| | foreach(c; str) | { | ret = (ret << 4) + c; | } | That one is very bad because it only takes into account the last few characters of the string (how many depends on the size of the hash). However, several hashing algorithms use 2^n+1 as their multiplier, which is very fast even on old/small/embedded hardware because it can be implemented as a shift and an addition.You are of course right; thanks for the correction. Lets scrap that algorithm! :)
May 03 2009
This guy looks to have a pretty good hash function: http://www.azillionmonkeys.com/qed/hash.html He matched Bob Jenkins' One-at-a-time hash on a variety of tests and beat it on speed. --bb On Sun, May 3, 2009 at 5:19 AM, Kristian Kilpi <kjkilpi gmail.com> wrote:On Sun, 03 May 2009 04:59:26 +0300, BCS <none anon.com> wrote:.IIRC something like this is common: hash_t ret =3D 0; foreach(c;str) { ret *=3D SomePrime; ret +=3D c; }I think that's the basis for the best general string hashing funcs around=IIRC from the university, it doesn't matter, in practice, if the multipli=eris a prime number or not. So, the multiplication can be replaced with bit shifting (that's as fast as the addition operation). E.g. foreach(c; str) { =A0 =A0ret =3D (ret << 4) + c; } I quickly googled around and found the algorithm djb2 which uses the multiplier 33: http://www.cse.yorku.ca/~oz/hash.html djb2 is obviously a little bit slower but it *may* give slightly better distribution (I'm not gonna run tests now to see if there is a real world difference... ;) ).
May 03 2009
== Quote from bearophile (bearophileHUGS lycos.com)'s articledsimcha:On second thought...The real problem is just a strange bug in D's RTTI. (cast(void*) typeid(immutable(char)[])) != (cast(void*) typeid(char[])). If one uses the typeinfo for char[] instead of that for immutable(char)[], then the hash performance is actually good, around 96k unique hashes.we first need to fix D's string hashing.<Ages ago I did suggest this one in the main D newsgroup: http://www.azillionmonkeys.com/qed/hash.html Bye, bearophile
May 02 2009
On Sat, May 2, 2009 at 1:00 PM, dsimcha <dsimcha yahoo.com> wrote:The result is that only about 8400 unique hashes are used, meaning 90+ % =ofkeys cannot be stored directly in the position corresponding to their has=h.Note that this is in full hash_t space, not in the modulus space typicall=yused for AAs, so the real results are probably even worse. =A0If the hash=es wereuniformly distributed across all possible values of hash_t, one only woul=dexpect about 30 collisions, meaning about 99970 unique hashes. =A0(This i=s basedon monte carlo, not exact combinatorics, so it's only a rough figure, but=itdoesn't really matter for the take-home message anyhow.)Here's something bizarre: porting this example to Tango yields 99997 unique hashes, which is right in line with your estimate. But I think it's bizarre since, well, shouldn't druntime use _the same typeinfo code?_
May 02 2009
Jarrett Billingsley wrote:Here's something bizarre: porting this example to Tango yields 99997 unique hashes, which is right in line with your estimate. But I think it's bizarre since, well, shouldn't druntime use _the same typeinfo code?_No, Tango recently upgraded its hashing algorithms. This was done after druntime was created, and presumably before the Tango version on your machine :).
May 03 2009
On Sun, 03 May 2009 08:10:43 -0400, Frits van Bommel <fvbommel remwovexcapss.nl> wrote:Jarrett Billingsley wrote:Reading dsimcha's followup, it's probably because tango is D1 only, which doesn't use immutable(char)[]. (man, I gotta get back to working on Tango4D2) -SteveHere's something bizarre: porting this example to Tango yields 99997 unique hashes, which is right in line with your estimate. But I think it's bizarre since, well, shouldn't druntime use _the same typeinfo code?_No, Tango recently upgraded its hashing algorithms. This was done after druntime was created, and presumably before the Tango version on your machine :).
May 04 2009