digitalmars.D - Superhash buried in Druntime does super work.
- Michael Rynn (79/79) Jul 28 2010 I noted a little while ago here a mention of the superhash, someone was
- bearophile (8/10) Jul 28 2010 Maybe that someone was me :-)
- Sean Kelly (1/1) Jul 28 2010 You might also want to look into MurmurHash. It's roughly twice as fast...
I noted a little while ago here a mention of the superhash, someone was curious about it, and I found it hidden in the D runtime library Superhash , published online, From "Hash Functions" , Paul Hsieh http://www.azillionmonkeys.com/qed/hash.htm. Since I had an almost ready made hash distribution and benchmark test in the dsource/projects/aa , and I somehow came to look at it again, I tried it out. I compared three hashing functions for a test class "WrapString", wrapping a D2 string. "TypeInfo.getHash" - the simple string hash used by D version(all). (uses number 11) "superHash" - my transcription of the above Hsieh string function into D. then I noticed that D uses this very same function in hashOf () in the module rt.util.hash. "Java hash" - the simple string hash as used by Java. (uses number 31) I also varied wether the hash was pre-caculated and cached, or was recalculated each toHash() call, to get an indication for the calculation overhead difference. The benchmark was compiled as release and with optimize flag. With hash functions, the distribution of the data makes a difference, so randomly constructed data behaves differently from more 'patterned' data. Patterned data might be expected to perform worse on not so good hash functions. Each run of the benchmark program used the same stored randomized data set of 1,000,000 valid strings of random length between 1 - 20 printable random characters distributed uniformly between 0x20 and 0x7E. The hashOf was passed the string length as a hash seed value. So this data maybe is not patterned enough to show big differences. The distribution bucket chain lengths... TypeInfo.getHash Java hash 31 SuperHash (hashOf) 1 50.90 52.07 52.97 2 31.03 33.02 33.70 3 10.11 11.23 10.64 4 2.89 2.91 2.25 5 1.41 0.62 0.39 6 1.02 0.14 0.04 7 0.72 0.02 0.00 8 0.66 0.00 0.00 9 0.67 10 0.38 11 0.13 12 0.04 13 0.01 14 0.00 Now for some hashtable timings. Using cached hash calculated by... (average of 10 / standard deviation ) TypeInfo.getHash 11 Java hash 31 SuperHash (hashOf) insert 2.36 / 0.113 2.50 / 0.173 2.37 / 0.089 lookup 3.19 / 0.232 2.91 / 0.087 2.73 / 0.107 The super hash gives the most random distribution of buckets and the best lookup times. The simple string hash using number 31 does better than using number 11. The differences are not very big. Insertion times are no different. But the distribution seems to affect lookups in the expected direction, and this replicates well enough. See how hash recalculation affects results. A hash calculation is done once for each insertion and lookup. TypeInfo.getHash 11 Java hash 31 Superhash (hashOf) insert 2.98 / 0.080 3.11 / 0.212 3.49 / 0.184 lookup 4.12 / 0.094 4.24 / 0.336 4.26 / 0.114 The difference in times between getHash 11 and Java hash 31, do not seem too significant. Maybe multiplying by bigger numbers takes a bit longer, and super hashing around takes a bit more longer. If I substituted 11 for 31 in the WrapString class toHash function, the times for 11 and 31 were equivalent. My SuperHash transcription also performed much the same as the rt.util version. For this data test set, there is not much of a time difference to choose between the hash functions. The Superhash hashOf does more work and is expected to take relatively longer, for larger slabs of data. On todays CPUs , most of the time is probably just spent fetching each character of the string from memory, before starting to hash the bits around. All the same, for the string hash, on this dataset, for both bucket distribution and reasonable hash time, and almost the same code, using 31 instead of 11 as the hash multiplier seems to be a benefit for bucket distribution without a speed penalty. If anyone can find or describe a dataset on which hash 31 performs worse than hash 11, please let me know. Back to whatever I was doing before now..
Jul 28 2010
Michael Rynn:I noted a little while ago here a mention of the superhash, someone was curious about it,Maybe that someone was me :-) I suggest you to: - time the code again using true words, that are less random, so they enjoy a stronger hash function (as the superhash); - show us all the code you have used in your timings; - If you can to use ldc D1 compiler, because it optimizes much better than DMD, so it can give more realistic comparisons of the hash functions. Bye, bearophile
Jul 28 2010
You might also want to look into MurmurHash. It's roughly twice as fast a SuperFastHash on x86 and an order of magnitude slower on SPARC. It's inspired by SuperFastHash, so the basic approach is similar.
Jul 28 2010