digitalmars.D.learn - Performance of hashes and associative arrays

=?UTF-8?B?IlJhcGhhw6ts?= Jakse" (211/211) Nov 06 2012 Hello everybody,

Dan (54/76) Nov 07 2012 Isn't the real problem the addition. You want to mix the bits

=?UTF-8?B?UmFwaGHDq2wgSmFrc2U=?= (16/32) Nov 09 2012 Hello,

Dan (14/31) Nov 10 2012 Ok - good. I've been using 2.061 which I just realized allows

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (4/7) Nov 10 2012 Just a random thought: std.range.chain can obviate that copy:

Joseph Rushton Wakeling (6/8) Nov 10 2012 OK, now I'm curious. Assuming I don't write a custom re-implementation,...

Dan (36/46) Nov 10 2012 Not sure I understand the question. But here is how I'm doing it.
=?UTF-8?B?QWxpIMOHZWhyZWxp?= (11/20) Nov 10 2012 For classes, because they are reference types, it is the object identity...

Joseph Rushton Wakeling (21/30) Nov 11 2012 Ahh, clear. So where classes are concerned there's a definite need to o...

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (18/36) Nov 11 2012 I am not sure that I understand.

Joseph Rushton Wakeling (7/13) Nov 11 2012 I know, and I probably should give that a go, but as I've got working co...
Manfred Nowak (4/5) Nov 11 2012 They are unbalanced binary search trees, which indeed turn to

Manfred Nowak (25/27) Nov 11 2012 Yes, ideas might be interesting. :-)

=?ISO-8859-1?Q?Rapha=EBl_Jakse?= (4/31) Nov 17 2012 Thanks for this interesting answer.

Manfred Nowak (13/16) Nov 17 2012 Hopefully. Because in the general case the distribution of the

=?UTF-8?B?IlJhcGhhw6ts?= Jakse" <raphael.jakse gmail.com> writes:

Hello everybody,

we had a conversation with Ali Cehreli about the right ways of 
hashing values / objects. This is related to the D language when 
using associative arrays but applies for any language as well.

One of the points was, we have the following class:

==
class Student // class representing a student
{
     string firstName; // the first name of the student
     string lastName; // the last name of the student
}
==

Let s be a Student:

==
auto s = new Student("Jean", "Martin");
==

We want to be able to get the hash of s. Therefore, we 
re-implement the toHash method of the Student class :

==
class Student
{
     string firstName;
     string lastName;

     override size_t toHash() const
     {
         //...
     }
}
==

The original solution proposed was:

==
     override size_t toHash() const
     {
         return (typeid(firstName).getHash(&firstName) +
                 typeid(lastName).getHash(&lastName));
     }
==

However, with this solution, we get the same hash for new 
Student("Jean", "Martin") and new Student("Martin", "Jean"). We 
did an experiment of performance of associative arrays in D when 
the hash function is not well designed (returning the same hash 
for too many values). When the hash function returns the same 
hash for too many values, performance are dramatic (See the end 
of the post for more information).

So here is the second solution that was proposed and that avoids 
having the same hash for two different names:

==
     override size_t toHash() const
     {
         auto h = firstName ~ ":" ~ lastName; // we suppose that 
":" is not a valid character in a name
         return typeid(h).getHash(&h);
     }
==

A ":" was added so new Student("Martin", "Jean") gets a different 
hash than new Student("Mar", "tinJean")).
Let's call the following solution "second bis solution" :

==
     override size_t toHash() const
     {
         auto h = firstName ~ lastName;
         return typeid(h).getHash(&h);
     }
==

This solution suppose that being named Mar tinJean instead of 
Martin Jean is really strange and therefore quite unusual.
However, these two solutions imply string concatenations each 
time we need the hash.
Ali agreed that concatenating strings each time would indeed be 
inefficient. He thought we might cache the value (third solution) 
:

==
class Student
{
     string firstName;
     string lastName;
     size_t cachedHash;

     override size_t toHash() const
     {
         if(cachedHash is null)
         {
             auto h = firstName ~ ":" ~ lastName;
             cachedHash = typeid(h).getHash(&h);
         }
         return cachedHash;
     }
}
==

This solution doesn't make compromise on the "quality" of the 
hash and is fast after the first usage but needs to keep the hash 
as part of the class. The hash is calculated for each new object 
that is used as index in an associative array, even if the hash 
for this specific name has already been calculated in another 
object.

Questions are :
  - what is the most efficient solution, and in which case ?
  - Is it preferable to try to make the hash unique (second bis 
solution), or to try to make the hash easier to evaluate and risk 
in rare case to get the same hashes for two different objects 
(second solution)?
  - Is solution 3 a good practice? Preferable to the others 
solutions? When?
  - Is solution 1 preferable to the others solutions, given that 
first name / last name inversions are quite "rare" ?
  - how compares solution 1 to solution 2 bis ?

It would also be interesting to have ideas for the general case, 
i.e. random objects and not just names that have particularities 
like "First names are not often last names and vice versa", or 
for other specific situations.
Questions are oriented for usage in associative arrays in D, but 
still apply to other use cases of hashes.


For those who are interested, here is Ali's experiment on 
performance of associative arrays in D depending on the design of 
the hash function (I took the liberty to modify it slightly).

==
import std.stdio;
import std.random;
import std.string;

class Clock
{
     int hour;
     int minute;
     int second;

     static size_t opCmpCount = 0;
     static size_t opEqualsCount = 0;
     static size_t toHashCount = 0;

     override equals_t opEquals(Object o) const
     {
         ++opEqualsCount;

         auto rhs = cast(const Clock)o;

         return (rhs &&
                 (hour == rhs.hour) &&
                 (minute == rhs.minute) &&
                 (second == rhs.second));
     }

     override int opCmp(Object o) const
     {
         ++opCmpCount;

         /* Taking advantage of the automatically-maintained
          * order of the types. */
         if (typeid(this) != typeid(o)) {
             return typeid(this).opCmp(typeid(o));
         }

         auto rhs = cast(const Clock)o;
         /* No need to check whether rhs is null, because it is
          * known on this line that it has the same type as o. */

         return (hour != rhs.hour
                 ? hour - rhs.hour
                 : (minute != rhs.minute
                    ? minute - rhs.minute
                    : second - rhs.second));
     }

     override string toString() const
     {
         return format("%02s:%02s:%02s", hour, minute, second);
     }

     this(int hour, int minute, int second)
     {
         this.hour = hour;
         this.minute = minute;
         this.second = second;
     }

     override hash_t toHash() const
     {
         ++toHashCount;
         return hour + minute + second; // method 1
         // return hour + 60 * minute + 3600 * second; // method 2
     }

     static void printCounts()
     {
         writefln("opEquals: %8s", opEqualsCount);
         writefln("opCmp   : %8s", opCmpCount);
         writefln("toHash  : %8s", toHashCount);
     }
}

Clock randomClock()
{
     return new Clock(uniform(0, 24), uniform(0, 60), uniform(0, 
60));
}

void main()
{
     enum count = 10_000;
     Clock[Clock] aa;

     foreach (i; 0 .. count) {
         auto entry = randomClock();
         aa[entry] = entry;
     }

     foreach (i; 0 .. count) {
         auto result = (randomClock() in aa);
     }

     Clock.printCounts();
}
==

We both get the following results for the method 1:

opEquals:        0
opCmp   :  1438438
toHash  :    20000


and the following results for the method 2:

opEquals:        0
opCmp   :     1681
toHash  :    20000

Apart from the fact associative arrays in gdc and dmd don't use 
the opEquals method, we see that using method 1 implies approx. 
856 times more comparisons than method 2. I calculated that 
method 1 gives 139 different hashes for 80063 different elements, 
which means each hash represents an average of approx. 576 
elements. Method 2 gives exactly one hash for one element.
This seems to be an extreme case, we might think results would be 
completely different from a function giving hashes that 
"sometimes" represents 2 elements.

Nov 06 2012

"Dan" <dbdavidson yahoo.com> writes:

On Wednesday, 7 November 2012 at 06:38:32 UTC, Raphaël Jakse 
wrote:

 ==
     override size_t toHash() const
     {
         return (typeid(firstName).getHash(&firstName) +
                 typeid(lastName).getHash(&lastName));
     }
 ==

Isn't the real problem the addition. You want to mix the bits 
together in a consistent way that does not have associativity. 
I've seen things like:

result = 37 * first + 17 * second

I've incorporated this functionality in a toHash that can be 
mixed in. Any feedback on it would be great. See:

http://forum.dlang.org/post/tbjrqaogxegtyexnfags forum.dlang.org
https://github.com/patefacio/d-help/blob/master/d-help/opmix/mix.d

It only supports structs, but the hashing could be made to 
support classes. It allows for the following code for your 
example. It provides toHash, opEquals and opCmp. Below the 
hashcodes for s1 and s2 are different, the instances are not 
equal and s1 is less than s2 which is what you want.

import std.stdio;
import opmix.mix;

struct Student // class representing a student
{
   mixin(HashSupport);
   string firstName; // the first name of the student
   string lastName; // the last name of the student
}

void main() {
   auto s1 = Student("Jean", "Martin");
   auto s2 = Student("Martin", "Jean");
   writeln(s1.toHash(), " vs ", s2.toHash());
   writeln(s1 == s2);
   writeln(s1 < s2);
}


 However, with this solution, we get the same hash for new 
 Student("Jean", "Martin") and new Student("Martin", "Jean"). We 
 did an experiment of performance of associative arrays in D 
 when the hash function is not well designed (returning the same 
 hash for too many values). When the hash function returns the 
 same hash for too many values, performance are dramatic (See 
 the end of the post for more information).

This makes sense and is not particular to D.

 Ali agreed that concatenating strings each time would indeed be 
 inefficient. He thought we might cache the value (third 
 solution) :

Interesting about caching the hashcode and on large classes could 
save you. But isn't the signature shown const? Would that 
compile? Also, what if you change the strings - you get the wrong 
hash? I suppose you could limit writes to properties and null the 
hashcode on any write.

 Questions are :
  - what is the most efficient solution, and in which case ?

No string concatenation is good. I think a single pass on all 
important data (in most cases is all the data) is the goal.

 This seems to be an extreme case, we might think results would 
 be completely different from a function giving hashes that 
 "sometimes" represents 2 elements.

Very true. If I increase to 10_000_000 I see opCmp:1913600 for 
the smart method and (several minutes later) opCmp:907216764 for 
the simple addition (method 1).
In this case you know something special about the data and can 
take advantage of it. If I run the example using the 
mixin(HashSupport) it does opCmp:7793499 which is about 4 times 
as many compares as the smart one. The simple addition does 474 
times as many compares as the smart one - so it is clearly very 
bad. So, if you know something special about the data, like it 
can easily be converted into a single number such as seconds, by 
all means include that. But remember, next time you add something 
to the class, if you don't have some "automated" way of pulling 
in info from all fields you need to revisit the hash (and opCmp 
and opEquals).

Thanks
Dan

Nov 07 2012

=?UTF-8?B?UmFwaGHDq2wgSmFrc2U=?= <raphael.jakse gmail.com> writes:

Hello,

Thanks for this complete answer. I will take a look to your code.
Additionally, Ali gave me a really interesting link about hashing, good 
practices, what is efficient, etc. If you didn't read it, it might 
interest you. Here it is:

http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx

Le 07/11/2012 13:10, Dan a écrit :
 On Wednesday, 7 November 2012 at 06:38:32 UTC, Raphaël Jakse wrote:
 [...]
 Ali agreed that concatenating strings each time would indeed be
 inefficient. He thought we might cache the value (third solution) :

 Interesting about caching the hashcode and on large classes could save
 you. But isn't the signature shown const? Would that compile?

Indeed, I didn't tested this code, I just wanted to explicit what I 
wanted to say. Thank you for the notice.

 Also, what
 if you change the strings - you get the wrong hash? I suppose you could
 limit writes to properties and null the hashcode on any write.

Yes, good question and good idea, I think I would do it this way. I 
didn't think about it.

 Questions are :
  - what is the most efficient solution, and in which case ?

 No string concatenation is good. I think a single pass on all important
 data (in most cases is all the data) is the goal.

I'm not sure I understood well. You wanted to say that string 
contatenations are good, right ?

I was thinking about a hash function that would take several arguments 
and hash them together. That would let take in account more than one 
string in the hash while avoiding concatenation.

[...]

 Thanks
 Dan

Nov 09 2012

"Dan" <dbdavidson yahoo.com> writes:

On Saturday, 10 November 2012 at 07:55:18 UTC, Raphaël Jakse 
wrote:
 Hello,

 Thanks for this complete answer. I will take a look to your 
 code.

Ok - good. I've been using 2.061 which I just realized allows 
"dup" on an associative array, a feature which was not available 
in 2.060. So mixin(Dup) and the unit tests require 2.061.

 If you didn't read it, it might interest you. Here it is:

I had not seen it and will read - thanks.

 Questions are :
 - what is the most efficient solution, and in which case ?

 No string concatenation is good. I think a single pass on all 
 important
 data (in most cases is all the data) is the goal.

 I'm not sure I understood well. You wanted to say that string 
 contatenations are good, right ?

I just meant string concatenation is likely unnecessary. Imagine 
two one megabyte strings. It is easy to concat them and call the 
string hash function on the result but you have to create a 2Mb 
string first. Alternatively, you could hash each and combine the 
hash codes in some consistent way.

 I was thinking about a hash function that would take several 
 arguments and hash them together. That would let take in 
 account more than one string in the hash while avoiding 
 concatenation.

Yes.

Thanks
Dan

Nov 10 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 11/10/2012 05:17 AM, Dan wrote:

 I just meant string concatenation is likely unnecessary. Imagine two one
 megabyte strings. It is easy to concat them and call the string hash
 function on the result but you have to create a 2Mb string first.

Just a random thought: std.range.chain can obviate that copy:

   http://dlang.org/phobos/std_range.html#chain

Ali

Nov 10 2012

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 11/07/2012 07:38 AM, "Raphaël.Jakse"
<raphael.jakse gmail.com>" puremagic.com 
wrote:
 We want to be able to get the hash of s. Therefore, we re-implement the toHash
 method of the Student class :

OK, now I'm curious.  Assuming I don't write a custom re-implementation, how 
would a custom struct or class be hashed?  (What about a tuple?)

I ask because I'm considering associative arrays where the key is a custom
class 
or tuple as part of a current coding project.

Nov 10 2012

"Dan" <dbdavidson yahoo.com> writes:

On Saturday, 10 November 2012 at 18:18:07 UTC, Joseph Rushton 
Wakeling wrote:
 On 11/07/2012 07:38 AM, "Raphaël.Jakse" 
 <raphael.jakse gmail.com>" puremagic.com wrote:
 We want to be able to get the hash of s. Therefore, we 
 re-implement the toHash
 method of the Student class :

 OK, now I'm curious.  Assuming I don't write a custom 
 re-implementation, how would a custom struct or class be 
 hashed?  (What about a tuple?)

 I ask because I'm considering associative arrays where the key 
 is a custom class or tuple as part of a current coding project.

Not sure I understand the question. But here is how I'm doing it. 
No guarantees, but output looks promising. Code following output.

Thanks
Dan

-----------------------------
true
false
false
true
7F053B2DCFC0
20883845 vs 20883845
-----------------------------
import std.stdio;
import std.traits;
import std.typecons;
import opmix.mix;

struct S {
   mixin(HashSupport);
   alias Tuple!(int, char, string) X;
   X x;
   char[] mutable;
}

void main() {
   S s1 = { tuple(3, 'a', "foo".idup), ['a','b'].dup };
   S s2 = { tuple(3, 'a', "foo".idup), ['a','b'].dup };
   writeln(s1==s2);
   s1.x[0]++;
   writeln(s1==s2);
   writeln(s1<s2);
   writeln(s2<s1);
   writeln(s1 in [ s1: 3 ]);
   s2.x[0]++;
   writeln(s1.toHash(), " vs ", s2.toHash());
}

Nov 10 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 11/10/2012 05:37 AM, Joseph Rushton Wakeling wrote:
 On 11/07/2012 07:38 AM, "Raphaël.Jakse"
 <raphael.jakse gmail.com>" puremagic.com wrote:
 We want to be able to get the hash of s. Therefore, we re-implement
 the toHash
 method of the Student class :

 OK, now I'm curious. Assuming I don't write a custom re-implementation,
 how would a custom struct or class be hashed? (What about a tuple?)

 I ask because I'm considering associative arrays where the key is a
 custom class or tuple as part of a current coding project.

For classes, because they are reference types, it is the object identity 
that determines the hash value (probably the pointer of the actual 
object). As a result, even when the values of the members are the same, 
two objects have different hash values.

For structs, because they are value types, it is the bytes that make up 
the object determine the hash value.

But beware: The associative array members of structs are not hashed by 
values of their elements. (There was a long discussion recently on the 
the main newsgroup on this topic.)

Ali

Nov 10 2012

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 11/11/2012 02:35 AM, Ali Çehreli wrote:
 For classes, because they are reference types, it is the object identity that
 determines the hash value (probably the pointer of the actual object). As a
 result, even when the values of the members are the same, two objects have
 different hash values.

 For structs, because they are value types, it is the bytes that make up the
 object determine the hash value.

Ahh, clear.  So where classes are concerned there's a definite need to override 
opHash if you want the _values_ to be the basis of the association, rather than 
the object itself ... this would also have potential implications for memory 
blow-up where a class is used as the key type, no?

 But beware: The associative array members of structs are not hashed by values
of
 their elements. (There was a long discussion recently on the the main newsgroup
 on this topic.)

I do recall that discussion, and I'll bear that in mind.

So far the only practical consideration I've had was for a Tuple!() to be the 
key, rather than an actual struct or class; as it happens I've avoided that so 
far, and I think I may continue to do so while there are still issues around 
hashing.

The more pressing reason for avoiding associative arrays as it happens was 
simply that I was getting unacceptably large memory usage as a result: I'd got 
some data stored in the form of byte[size_t] (really this should have been a
set 
of size_t's, but this seemed theoretically an acceptable stop gap while there's 
no set class in Phobos), and for some reason that was generating memory 
consumption twice that of having instead a size_t[] regular array.  I guess 
perhaps because although a byte should mean just that, in practice each value
in 
the byte[size_t] associative array was taking up a whole word?

It's a shame, because functionally it was useful to be able to do things of the 
form if(i in list), but memory-wise it just didn't work for the size of data
I'm 
dealing with.

Nov 11 2012

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 11/11/2012 01:11 AM, Joseph Rushton Wakeling wrote:

 So where classes are concerned there's a definite need to
 override opHash if you want the _values_ to be the basis of the
 association, rather than the object itself ... this would also have
 potential implications for memory blow-up where a class is used as the
 key type, no?

I am not sure that I understand.

The associative array would still contain just class variables, not 
class objects. Everytime the hash needed to be calculated, the AA would 
call toHash() on the variable, which in turn would use the values of the 
members of the object.

As another case, a single class object is owned by the runtime but many 
class variables of it can exist in multiple AAs.

 The more pressing reason for avoiding associative arrays as it happens
 was simply that I was getting unacceptably large memory usage as a
 result: I'd got some data stored in the form of byte[size_t] (really
 this should have been a set of size_t's, but this seemed theoretically
 an acceptable stop gap while there's no set class in Phobos),

I think std.container.RedBlackTree can take that responsibility but if 
you don't need the elements to be in any particular order, then it may 
be seen as an overkill as well.

 and for
 some reason that was generating memory consumption twice that of having
 instead a size_t[] regular array. I guess perhaps because although a
 byte should mean just that, in practice each value in the byte[size_t]
 associative array was taking up a whole word?

I don't know the actual implementation of AAs by dmd, but the hash 
buckets are probably singly-linked lists. Even if your data had no 
collisions, each byte would have to live in a singly-linked list.

 It's a shame, because functionally it was useful to be able to do things
 of the form if(i in list), but memory-wise it just didn't work for the
 size of data I'm dealing with.

If all you need is 'if(i in list)', then a HashSet may help. I haven't 
used one myself but apparently there are such data structures in other 


Ali

Nov 11 2012

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On 11/11/2012 07:40 PM, Ali Çehreli wrote:
 I think std.container.RedBlackTree can take that responsibility but if you
don't
 need the elements to be in any particular order, then it may be seen as an
 overkill as well.

I know, and I probably should give that a go, but as I've got working code
right 
now I'm moving on to something else and will come back to it later.

 If all you need is 'if(i in list)', then a HashSet may help. I haven't used one
 myself but apparently there are such data structures in other languages like


Indeed, and I was wondering whether it might be worth writing an implementation 
for std.containers.  As it stands I don't really feel familiar enough with the 
underlying concepts, but I may revisit that idea once I'm done with a few other 
things.

Nov 11 2012

Manfred Nowak <svv1999 hotmail.com> writes:

=?UTF-8?B?QWxpIMOHZWhyZWxp?= wrote:

 but the hash buckets are probably singly-linked lists

They are unbalanced binary search trees, which indeed turn to 
singly-linked lists on sorted input.

-manfred

Nov 11 2012

Manfred Nowak <svv1999 hotmail.com> writes:

Jakse wrote:

 It would also be interesting to have ideas for the general
 case 

Yes, ideas might be interesting. :-)

A root of "good" hashing is incorporated in algorithm2 of your 
posting:

spread the values produced by the hash function uniformly over 
the interval  size_t.min .. size_t.max.

Therefore:

Compute the hash value for a given key by multiplying each byte 
of the key with a factor, which increases from the byte with 
lowest significance to the byte with highest significance; of 
course add the results.

Remarks:
a)
Because of the general case: assume that the original key 
contains much redundance. Therefore do not use the bytes of the 
original key but compute a ( at least close to) lossless 
compression of the original key.

b)
The factors in the factor list should be primes for spreading 
the hash value uniformly over the intended range

c)
The quotint of two consecutives primes in the factor list should 
be greater than 256, because the used key might not contain any 
reduundancy.   

-manfred

Nov 11 2012

=?ISO-8859-1?Q?Rapha=EBl_Jakse?= <raphael.jakse gmail.com> writes:

Le 11/11/2012 12:38, Manfred Nowak a �crit :
 Jakse wrote:

 It would also be interesting to have ideas for the general
 case

 Yes, ideas might be interesting. :-)

 A root of "good" hashing is incorporated in algorithm2 of your
 posting:

 spread the values produced by the hash function uniformly over
 the interval  size_t.min .. size_t.max.

 Therefore:

 Compute the hash value for a given key by multiplying each byte
 of the key with a factor, which increases from the byte with
 lowest significance to the byte with highest significance; of
 course add the results.

 Remarks:
 a)
 Because of the general case: assume that the original key
 contains much redundance. Therefore do not use the bytes of the
 original key but compute a ( at least close to) lossless
 compression of the original key.

 b)
 The factors in the factor list should be primes for spreading
 the hash value uniformly over the intended range

 c)
 The quotint of two consecutives primes in the factor list should
 be greater than 256, because the used key might not contain any
 reduundancy.

 -manfred

Thanks for this interesting answer.
Is compressing for performance reasons ? is it more interesting to 
compress and then hash than just hash ?

Nov 17 2012

Manfred Nowak <svv1999 hotmail.com> writes:

Rapha�l Jakse wrote:

 Is compressing for performance reasons?

Hopefully. Because in the general case the distribution of the 
keys is unknown, no function used for computing the hash value 
can be guarenteed to indeed spread the hash values uniformly 
over the hash interval.

Compressing would have a negative effect on performance if more 
time for compressing and decrompessing would be needed than time 
was lost for resolving conflicting hash values. 

 is it more interesting to compress and then hash than just
 hash ?

The only purpose of compressing is to create the special case of 
close to no redundancy in the keys used for hashing. If this 
special case is already confirmed or it is known that no 
compression will deminish the computing time: just hash.

-manfred

Nov 17 2012

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Performance of hashes and associative arrays