www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Performance of hashes and associative arrays

reply =?UTF-8?B?IlJhcGhhw6ts?= Jakse" <raphael.jakse gmail.com> writes:
Hello everybody,

we had a conversation with Ali Cehreli about the right ways of 
hashing values / objects. This is related to the D language when 
using associative arrays but applies for any language as well.

One of the points was, we have the following class:

==
class Student // class representing a student
{
     string firstName; // the first name of the student
     string lastName; // the last name of the student
}
==

Let s be a Student:

==
auto s = new Student("Jean", "Martin");
==

We want to be able to get the hash of s. Therefore, we 
re-implement the toHash method of the Student class :

==
class Student
{
     string firstName;
     string lastName;

     override size_t toHash() const
     {
         //...
     }
}
==

The original solution proposed was:

==
     override size_t toHash() const
     {
         return (typeid(firstName).getHash(&firstName) +
                 typeid(lastName).getHash(&lastName));
     }
==

However, with this solution, we get the same hash for new 
Student("Jean", "Martin") and new Student("Martin", "Jean"). We 
did an experiment of performance of associative arrays in D when 
the hash function is not well designed (returning the same hash 
for too many values). When the hash function returns the same 
hash for too many values, performance are dramatic (See the end 
of the post for more information).

So here is the second solution that was proposed and that avoids 
having the same hash for two different names:

==
     override size_t toHash() const
     {
         auto h = firstName ~ ":" ~ lastName; // we suppose that 
":" is not a valid character in a name
         return typeid(h).getHash(&h);
     }
==

A ":" was added so new Student("Martin", "Jean") gets a different 
hash than new Student("Mar", "tinJean")).
Let's call the following solution "second bis solution" :

==
     override size_t toHash() const
     {
         auto h = firstName ~ lastName;
         return typeid(h).getHash(&h);
     }
==

This solution suppose that being named Mar tinJean instead of 
Martin Jean is really strange and therefore quite unusual.
However, these two solutions imply string concatenations each 
time we need the hash.
Ali agreed that concatenating strings each time would indeed be 
inefficient. He thought we might cache the value (third solution) 
:

==
class Student
{
     string firstName;
     string lastName;
     size_t cachedHash;

     override size_t toHash() const
     {
         if(cachedHash is null)
         {
             auto h = firstName ~ ":" ~ lastName;
             cachedHash = typeid(h).getHash(&h);
         }
         return cachedHash;
     }
}
==

This solution doesn't make compromise on the "quality" of the 
hash and is fast after the first usage but needs to keep the hash 
as part of the class. The hash is calculated for each new object 
that is used as index in an associative array, even if the hash 
for this specific name has already been calculated in another 
object.

Questions are :
  - what is the most efficient solution, and in which case ?
  - Is it preferable to try to make the hash unique (second bis 
solution), or to try to make the hash easier to evaluate and risk 
in rare case to get the same hashes for two different objects 
(second solution)?
  - Is solution 3 a good practice? Preferable to the others 
solutions? When?
  - Is solution 1 preferable to the others solutions, given that 
first name / last name inversions are quite "rare" ?
  - how compares solution 1 to solution 2 bis ?

It would also be interesting to have ideas for the general case, 
i.e. random objects and not just names that have particularities 
like "First names are not often last names and vice versa", or 
for other specific situations.
Questions are oriented for usage in associative arrays in D, but 
still apply to other use cases of hashes.


For those who are interested, here is Ali's experiment on 
performance of associative arrays in D depending on the design of 
the hash function (I took the liberty to modify it slightly).

==
import std.stdio;
import std.random;
import std.string;

class Clock
{
     int hour;
     int minute;
     int second;

     static size_t opCmpCount = 0;
     static size_t opEqualsCount = 0;
     static size_t toHashCount = 0;

     override equals_t opEquals(Object o) const
     {
         ++opEqualsCount;

         auto rhs = cast(const Clock)o;

         return (rhs &&
                 (hour == rhs.hour) &&
                 (minute == rhs.minute) &&
                 (second == rhs.second));
     }

     override int opCmp(Object o) const
     {
         ++opCmpCount;

         /* Taking advantage of the automatically-maintained
          * order of the types. */
         if (typeid(this) != typeid(o)) {
             return typeid(this).opCmp(typeid(o));
         }

         auto rhs = cast(const Clock)o;
         /* No need to check whether rhs is null, because it is
          * known on this line that it has the same type as o. */

         return (hour != rhs.hour
                 ? hour - rhs.hour
                 : (minute != rhs.minute
                    ? minute - rhs.minute
                    : second - rhs.second));
     }

     override string toString() const
     {
         return format("%02s:%02s:%02s", hour, minute, second);
     }

     this(int hour, int minute, int second)
     {
         this.hour = hour;
         this.minute = minute;
         this.second = second;
     }

     override hash_t toHash() const
     {
         ++toHashCount;
         return hour + minute + second; // method 1
         // return hour + 60 * minute + 3600 * second; // method 2
     }

     static void printCounts()
     {
         writefln("opEquals: %8s", opEqualsCount);
         writefln("opCmp   : %8s", opCmpCount);
         writefln("toHash  : %8s", toHashCount);
     }
}

Clock randomClock()
{
     return new Clock(uniform(0, 24), uniform(0, 60), uniform(0, 
60));
}

void main()
{
     enum count = 10_000;
     Clock[Clock] aa;

     foreach (i; 0 .. count) {
         auto entry = randomClock();
         aa[entry] = entry;
     }

     foreach (i; 0 .. count) {
         auto result = (randomClock() in aa);
     }

     Clock.printCounts();
}
==

We both get the following results for the method 1:

opEquals:        0
opCmp   :  1438438
toHash  :    20000


and the following results for the method 2:

opEquals:        0
opCmp   :     1681
toHash  :    20000

Apart from the fact associative arrays in gdc and dmd don't use 
the opEquals method, we see that using method 1 implies approx. 
856 times more comparisons than method 2. I calculated that 
method 1 gives 139 different hashes for 80063 different elements, 
which means each hash represents an average of approx. 576 
elements. Method 2 gives exactly one hash for one element.
This seems to be an extreme case, we might think results would be 
completely different from a function giving hashes that 
"sometimes" represents 2 elements.
Nov 06 2012
next sibling parent reply "Dan" <dbdavidson yahoo.com> writes:
On Wednesday, 7 November 2012 at 06:38:32 UTC, Raphaël Jakse 
wrote:

 ==
     override size_t toHash() const
     {
         return (typeid(firstName).getHash(&firstName) +
                 typeid(lastName).getHash(&lastName));
     }
 ==
Isn't the real problem the addition. You want to mix the bits together in a consistent way that does not have associativity. I've seen things like: result = 37 * first + 17 * second I've incorporated this functionality in a toHash that can be mixed in. Any feedback on it would be great. See: http://forum.dlang.org/post/tbjrqaogxegtyexnfags forum.dlang.org https://github.com/patefacio/d-help/blob/master/d-help/opmix/mix.d It only supports structs, but the hashing could be made to support classes. It allows for the following code for your example. It provides toHash, opEquals and opCmp. Below the hashcodes for s1 and s2 are different, the instances are not equal and s1 is less than s2 which is what you want. import std.stdio; import opmix.mix; struct Student // class representing a student { mixin(HashSupport); string firstName; // the first name of the student string lastName; // the last name of the student } void main() { auto s1 = Student("Jean", "Martin"); auto s2 = Student("Martin", "Jean"); writeln(s1.toHash(), " vs ", s2.toHash()); writeln(s1 == s2); writeln(s1 < s2); }
 However, with this solution, we get the same hash for new 
 Student("Jean", "Martin") and new Student("Martin", "Jean"). We 
 did an experiment of performance of associative arrays in D 
 when the hash function is not well designed (returning the same 
 hash for too many values). When the hash function returns the 
 same hash for too many values, performance are dramatic (See 
 the end of the post for more information).
This makes sense and is not particular to D.
 Ali agreed that concatenating strings each time would indeed be 
 inefficient. He thought we might cache the value (third 
 solution) :
Interesting about caching the hashcode and on large classes could save you. But isn't the signature shown const? Would that compile? Also, what if you change the strings - you get the wrong hash? I suppose you could limit writes to properties and null the hashcode on any write.
 Questions are :
  - what is the most efficient solution, and in which case ?
No string concatenation is good. I think a single pass on all important data (in most cases is all the data) is the goal.
 This seems to be an extreme case, we might think results would 
 be completely different from a function giving hashes that 
 "sometimes" represents 2 elements.
Very true. If I increase to 10_000_000 I see opCmp:1913600 for the smart method and (several minutes later) opCmp:907216764 for the simple addition (method 1). In this case you know something special about the data and can take advantage of it. If I run the example using the mixin(HashSupport) it does opCmp:7793499 which is about 4 times as many compares as the smart one. The simple addition does 474 times as many compares as the smart one - so it is clearly very bad. So, if you know something special about the data, like it can easily be converted into a single number such as seconds, by all means include that. But remember, next time you add something to the class, if you don't have some "automated" way of pulling in info from all fields you need to revisit the hash (and opCmp and opEquals). Thanks Dan
Nov 07 2012
parent reply =?UTF-8?B?UmFwaGHDq2wgSmFrc2U=?= <raphael.jakse gmail.com> writes:
Hello,

Thanks for this complete answer. I will take a look to your code.
Additionally, Ali gave me a really interesting link about hashing, good 
practices, what is efficient, etc. If you didn't read it, it might 
interest you. Here it is:

http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

Le 07/11/2012 13:10, Dan a écrit :
 On Wednesday, 7 November 2012 at 06:38:32 UTC, Raphaël Jakse wrote:
 [...]
 Ali agreed that concatenating strings each time would indeed be
 inefficient. He thought we might cache the value (third solution) :
Interesting about caching the hashcode and on large classes could save you. But isn't the signature shown const? Would that compile?
Indeed, I didn't tested this code, I just wanted to explicit what I wanted to say. Thank you for the notice.
 Also, what
 if you change the strings - you get the wrong hash? I suppose you could
 limit writes to properties and null the hashcode on any write.
Yes, good question and good idea, I think I would do it this way. I didn't think about it.
 Questions are :
  - what is the most efficient solution, and in which case ?
No string concatenation is good. I think a single pass on all important data (in most cases is all the data) is the goal.
I'm not sure I understood well. You wanted to say that string contatenations are good, right ? I was thinking about a hash function that would take several arguments and hash them together. That would let take in account more than one string in the hash while avoiding concatenation.
[...]

 Thanks
 Dan
Nov 09 2012
parent reply "Dan" <dbdavidson yahoo.com> writes:
On Saturday, 10 November 2012 at 07:55:18 UTC, Raphaël Jakse 
wrote:
 Hello,

 Thanks for this complete answer. I will take a look to your 
 code.
Ok - good. I've been using 2.061 which I just realized allows "dup" on an associative array, a feature which was not available in 2.060. So mixin(Dup) and the unit tests require 2.061.
 If you didn't read it, it might interest you. Here it is:
I had not seen it and will read - thanks.
 Questions are :
 - what is the most efficient solution, and in which case ?
No string concatenation is good. I think a single pass on all important data (in most cases is all the data) is the goal.
I'm not sure I understood well. You wanted to say that string contatenations are good, right ?
I just meant string concatenation is likely unnecessary. Imagine two one megabyte strings. It is easy to concat them and call the string hash function on the result but you have to create a 2Mb string first. Alternatively, you could hash each and combine the hash codes in some consistent way.
 I was thinking about a hash function that would take several 
 arguments and hash them together. That would let take in 
 account more than one string in the hash while avoiding 
 concatenation.
Yes. Thanks Dan
Nov 10 2012
parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 11/10/2012 05:17 AM, Dan wrote:

 I just meant string concatenation is likely unnecessary. Imagine two one
 megabyte strings. It is easy to concat them and call the string hash
 function on the result but you have to create a 2Mb string first.
Just a random thought: std.range.chain can obviate that copy: http://dlang.org/phobos/std_range.html#chain Ali
Nov 10 2012
prev sibling next sibling parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 11/07/2012 07:38 AM, "Raphaël.Jakse"
<raphael.jakse gmail.com>" puremagic.com 
wrote:
 We want to be able to get the hash of s. Therefore, we re-implement the toHash
 method of the Student class :
OK, now I'm curious. Assuming I don't write a custom re-implementation, how would a custom struct or class be hashed? (What about a tuple?) I ask because I'm considering associative arrays where the key is a custom class or tuple as part of a current coding project.
Nov 10 2012
next sibling parent "Dan" <dbdavidson yahoo.com> writes:
On Saturday, 10 November 2012 at 18:18:07 UTC, Joseph Rushton 
Wakeling wrote:
 On 11/07/2012 07:38 AM, "Raphaël.Jakse" 
 <raphael.jakse gmail.com>" puremagic.com wrote:
 We want to be able to get the hash of s. Therefore, we 
 re-implement the toHash
 method of the Student class :
OK, now I'm curious. Assuming I don't write a custom re-implementation, how would a custom struct or class be hashed? (What about a tuple?) I ask because I'm considering associative arrays where the key is a custom class or tuple as part of a current coding project.
Not sure I understand the question. But here is how I'm doing it. No guarantees, but output looks promising. Code following output. Thanks Dan ----------------------------- true false false true 7F053B2DCFC0 20883845 vs 20883845 ----------------------------- import std.stdio; import std.traits; import std.typecons; import opmix.mix; struct S { mixin(HashSupport); alias Tuple!(int, char, string) X; X x; char[] mutable; } void main() { S s1 = { tuple(3, 'a', "foo".idup), ['a','b'].dup }; S s2 = { tuple(3, 'a', "foo".idup), ['a','b'].dup }; writeln(s1==s2); s1.x[0]++; writeln(s1==s2); writeln(s1<s2); writeln(s2<s1); writeln(s1 in [ s1: 3 ]); s2.x[0]++; writeln(s1.toHash(), " vs ", s2.toHash()); }
Nov 10 2012
prev sibling parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 11/10/2012 05:37 AM, Joseph Rushton Wakeling wrote:
 On 11/07/2012 07:38 AM, "Raphaël.Jakse"
 <raphael.jakse gmail.com>" puremagic.com wrote:
 We want to be able to get the hash of s. Therefore, we re-implement
 the toHash
 method of the Student class :
OK, now I'm curious. Assuming I don't write a custom re-implementation, how would a custom struct or class be hashed? (What about a tuple?) I ask because I'm considering associative arrays where the key is a custom class or tuple as part of a current coding project.
For classes, because they are reference types, it is the object identity that determines the hash value (probably the pointer of the actual object). As a result, even when the values of the members are the same, two objects have different hash values. For structs, because they are value types, it is the bytes that make up the object determine the hash value. But beware: The associative array members of structs are not hashed by values of their elements. (There was a long discussion recently on the the main newsgroup on this topic.) Ali
Nov 10 2012
parent reply Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 11/11/2012 02:35 AM, Ali Çehreli wrote:
 For classes, because they are reference types, it is the object identity that
 determines the hash value (probably the pointer of the actual object). As a
 result, even when the values of the members are the same, two objects have
 different hash values.

 For structs, because they are value types, it is the bytes that make up the
 object determine the hash value.
Ahh, clear. So where classes are concerned there's a definite need to override opHash if you want the _values_ to be the basis of the association, rather than the object itself ... this would also have potential implications for memory blow-up where a class is used as the key type, no?
 But beware: The associative array members of structs are not hashed by values
of
 their elements. (There was a long discussion recently on the the main newsgroup
 on this topic.)
I do recall that discussion, and I'll bear that in mind. So far the only practical consideration I've had was for a Tuple!() to be the key, rather than an actual struct or class; as it happens I've avoided that so far, and I think I may continue to do so while there are still issues around hashing. The more pressing reason for avoiding associative arrays as it happens was simply that I was getting unacceptably large memory usage as a result: I'd got some data stored in the form of byte[size_t] (really this should have been a set of size_t's, but this seemed theoretically an acceptable stop gap while there's no set class in Phobos), and for some reason that was generating memory consumption twice that of having instead a size_t[] regular array. I guess perhaps because although a byte should mean just that, in practice each value in the byte[size_t] associative array was taking up a whole word? It's a shame, because functionally it was useful to be able to do things of the form if(i in list), but memory-wise it just didn't work for the size of data I'm dealing with.
Nov 11 2012
parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 11/11/2012 01:11 AM, Joseph Rushton Wakeling wrote:

 So where classes are concerned there's a definite need to
 override opHash if you want the _values_ to be the basis of the
 association, rather than the object itself ... this would also have
 potential implications for memory blow-up where a class is used as the
 key type, no?
I am not sure that I understand. The associative array would still contain just class variables, not class objects. Everytime the hash needed to be calculated, the AA would call toHash() on the variable, which in turn would use the values of the members of the object. As another case, a single class object is owned by the runtime but many class variables of it can exist in multiple AAs.
 The more pressing reason for avoiding associative arrays as it happens
 was simply that I was getting unacceptably large memory usage as a
 result: I'd got some data stored in the form of byte[size_t] (really
 this should have been a set of size_t's, but this seemed theoretically
 an acceptable stop gap while there's no set class in Phobos),
I think std.container.RedBlackTree can take that responsibility but if you don't need the elements to be in any particular order, then it may be seen as an overkill as well.
 and for
 some reason that was generating memory consumption twice that of having
 instead a size_t[] regular array. I guess perhaps because although a
 byte should mean just that, in practice each value in the byte[size_t]
 associative array was taking up a whole word?
I don't know the actual implementation of AAs by dmd, but the hash buckets are probably singly-linked lists. Even if your data had no collisions, each byte would have to live in a singly-linked list.
 It's a shame, because functionally it was useful to be able to do things
 of the form if(i in list), but memory-wise it just didn't work for the
 size of data I'm dealing with.
If all you need is 'if(i in list)', then a HashSet may help. I haven't used one myself but apparently there are such data structures in other Ali
Nov 11 2012
next sibling parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On 11/11/2012 07:40 PM, Ali Çehreli wrote:
 I think std.container.RedBlackTree can take that responsibility but if you
don't
 need the elements to be in any particular order, then it may be seen as an
 overkill as well.
I know, and I probably should give that a go, but as I've got working code right now I'm moving on to something else and will come back to it later.
 If all you need is 'if(i in list)', then a HashSet may help. I haven't used one
 myself but apparently there are such data structures in other languages like

Indeed, and I was wondering whether it might be worth writing an implementation for std.containers. As it stands I don't really feel familiar enough with the underlying concepts, but I may revisit that idea once I'm done with a few other things.
Nov 11 2012
prev sibling parent Manfred Nowak <svv1999 hotmail.com> writes:
=?UTF-8?B?QWxpIMOHZWhyZWxp?= wrote:

 but the hash buckets are probably singly-linked lists
They are unbalanced binary search trees, which indeed turn to singly-linked lists on sorted input. -manfred
Nov 11 2012
prev sibling parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Jakse wrote:

 It would also be interesting to have ideas for the general
 case 
Yes, ideas might be interesting. :-) A root of "good" hashing is incorporated in algorithm2 of your posting: spread the values produced by the hash function uniformly over the interval size_t.min .. size_t.max. Therefore: Compute the hash value for a given key by multiplying each byte of the key with a factor, which increases from the byte with lowest significance to the byte with highest significance; of course add the results. Remarks: a) Because of the general case: assume that the original key contains much redundance. Therefore do not use the bytes of the original key but compute a ( at least close to) lossless compression of the original key. b) The factors in the factor list should be primes for spreading the hash value uniformly over the intended range c) The quotint of two consecutives primes in the factor list should be greater than 256, because the used key might not contain any reduundancy. -manfred
Nov 11 2012
parent reply =?ISO-8859-1?Q?Rapha=EBl_Jakse?= <raphael.jakse gmail.com> writes:
Le 11/11/2012 12:38, Manfred Nowak a écrit :
 Jakse wrote:

 It would also be interesting to have ideas for the general
 case
Yes, ideas might be interesting. :-) A root of "good" hashing is incorporated in algorithm2 of your posting: spread the values produced by the hash function uniformly over the interval size_t.min .. size_t.max. Therefore: Compute the hash value for a given key by multiplying each byte of the key with a factor, which increases from the byte with lowest significance to the byte with highest significance; of course add the results. Remarks: a) Because of the general case: assume that the original key contains much redundance. Therefore do not use the bytes of the original key but compute a ( at least close to) lossless compression of the original key. b) The factors in the factor list should be primes for spreading the hash value uniformly over the intended range c) The quotint of two consecutives primes in the factor list should be greater than 256, because the used key might not contain any reduundancy. -manfred
Thanks for this interesting answer. Is compressing for performance reasons ? is it more interesting to compress and then hash than just hash ?
Nov 17 2012
parent Manfred Nowak <svv1999 hotmail.com> writes:
Raphaël Jakse wrote:

 Is compressing for performance reasons?
Hopefully. Because in the general case the distribution of the keys is unknown, no function used for computing the hash value can be guarenteed to indeed spread the hash values uniformly over the hash interval. Compressing would have a negative effect on performance if more time for compressing and decrompessing would be needed than time was lost for resolving conflicting hash values.
 is it more interesting to compress and then hash than just
 hash ?
The only purpose of compressing is to create the special case of close to no redundancy in the keys used for hashing. If this special case is already confirmed or it is known that no compression will deminish the computing time: just hash. -manfred
Nov 17 2012