digitalmars.D.learn - Performance of hashes and associative arrays
- =?UTF-8?B?IlJhcGhhw6ts?= Jakse" (211/211) Nov 06 2012 Hello everybody,
- Dan (54/76) Nov 07 2012 Isn't the real problem the addition. You want to mix the bits
- =?UTF-8?B?UmFwaGHDq2wgSmFrc2U=?= (16/32) Nov 09 2012 Hello,
- Dan (14/31) Nov 10 2012 Ok - good. I've been using 2.061 which I just realized allows
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (4/7) Nov 10 2012 Just a random thought: std.range.chain can obviate that copy:
- Joseph Rushton Wakeling (6/8) Nov 10 2012 OK, now I'm curious. Assuming I don't write a custom re-implementation,...
- Dan (36/46) Nov 10 2012 Not sure I understand the question. But here is how I'm doing it.
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (11/20) Nov 10 2012 For classes, because they are reference types, it is the object identity...
- Joseph Rushton Wakeling (21/30) Nov 11 2012 Ahh, clear. So where classes are concerned there's a definite need to o...
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (18/36) Nov 11 2012 I am not sure that I understand.
- Joseph Rushton Wakeling (7/13) Nov 11 2012 I know, and I probably should give that a go, but as I've got working co...
- Manfred Nowak (4/5) Nov 11 2012 They are unbalanced binary search trees, which indeed turn to
- Manfred Nowak (25/27) Nov 11 2012 Yes, ideas might be interesting. :-)
- =?ISO-8859-1?Q?Rapha=EBl_Jakse?= (4/31) Nov 17 2012 Thanks for this interesting answer.
- Manfred Nowak (13/16) Nov 17 2012 Hopefully. Because in the general case the distribution of the
Hello everybody, we had a conversation with Ali Cehreli about the right ways of hashing values / objects. This is related to the D language when using associative arrays but applies for any language as well. One of the points was, we have the following class: == class Student // class representing a student { string firstName; // the first name of the student string lastName; // the last name of the student } == Let s be a Student: == auto s = new Student("Jean", "Martin"); == We want to be able to get the hash of s. Therefore, we re-implement the toHash method of the Student class : == class Student { string firstName; string lastName; override size_t toHash() const { //... } } == The original solution proposed was: == override size_t toHash() const { return (typeid(firstName).getHash(&firstName) + typeid(lastName).getHash(&lastName)); } == However, with this solution, we get the same hash for new Student("Jean", "Martin") and new Student("Martin", "Jean"). We did an experiment of performance of associative arrays in D when the hash function is not well designed (returning the same hash for too many values). When the hash function returns the same hash for too many values, performance are dramatic (See the end of the post for more information). So here is the second solution that was proposed and that avoids having the same hash for two different names: == override size_t toHash() const { auto h = firstName ~ ":" ~ lastName; // we suppose that ":" is not a valid character in a name return typeid(h).getHash(&h); } == A ":" was added so new Student("Martin", "Jean") gets a different hash than new Student("Mar", "tinJean")). Let's call the following solution "second bis solution" : == override size_t toHash() const { auto h = firstName ~ lastName; return typeid(h).getHash(&h); } == This solution suppose that being named Mar tinJean instead of Martin Jean is really strange and therefore quite unusual. However, these two solutions imply string concatenations each time we need the hash. Ali agreed that concatenating strings each time would indeed be inefficient. He thought we might cache the value (third solution) : == class Student { string firstName; string lastName; size_t cachedHash; override size_t toHash() const { if(cachedHash is null) { auto h = firstName ~ ":" ~ lastName; cachedHash = typeid(h).getHash(&h); } return cachedHash; } } == This solution doesn't make compromise on the "quality" of the hash and is fast after the first usage but needs to keep the hash as part of the class. The hash is calculated for each new object that is used as index in an associative array, even if the hash for this specific name has already been calculated in another object. Questions are : - what is the most efficient solution, and in which case ? - Is it preferable to try to make the hash unique (second bis solution), or to try to make the hash easier to evaluate and risk in rare case to get the same hashes for two different objects (second solution)? - Is solution 3 a good practice? Preferable to the others solutions? When? - Is solution 1 preferable to the others solutions, given that first name / last name inversions are quite "rare" ? - how compares solution 1 to solution 2 bis ? It would also be interesting to have ideas for the general case, i.e. random objects and not just names that have particularities like "First names are not often last names and vice versa", or for other specific situations. Questions are oriented for usage in associative arrays in D, but still apply to other use cases of hashes. For those who are interested, here is Ali's experiment on performance of associative arrays in D depending on the design of the hash function (I took the liberty to modify it slightly). == import std.stdio; import std.random; import std.string; class Clock { int hour; int minute; int second; static size_t opCmpCount = 0; static size_t opEqualsCount = 0; static size_t toHashCount = 0; override equals_t opEquals(Object o) const { ++opEqualsCount; auto rhs = cast(const Clock)o; return (rhs && (hour == rhs.hour) && (minute == rhs.minute) && (second == rhs.second)); } override int opCmp(Object o) const { ++opCmpCount; /* Taking advantage of the automatically-maintained * order of the types. */ if (typeid(this) != typeid(o)) { return typeid(this).opCmp(typeid(o)); } auto rhs = cast(const Clock)o; /* No need to check whether rhs is null, because it is * known on this line that it has the same type as o. */ return (hour != rhs.hour ? hour - rhs.hour : (minute != rhs.minute ? minute - rhs.minute : second - rhs.second)); } override string toString() const { return format("%02s:%02s:%02s", hour, minute, second); } this(int hour, int minute, int second) { this.hour = hour; this.minute = minute; this.second = second; } override hash_t toHash() const { ++toHashCount; return hour + minute + second; // method 1 // return hour + 60 * minute + 3600 * second; // method 2 } static void printCounts() { writefln("opEquals: %8s", opEqualsCount); writefln("opCmp : %8s", opCmpCount); writefln("toHash : %8s", toHashCount); } } Clock randomClock() { return new Clock(uniform(0, 24), uniform(0, 60), uniform(0, 60)); } void main() { enum count = 10_000; Clock[Clock] aa; foreach (i; 0 .. count) { auto entry = randomClock(); aa[entry] = entry; } foreach (i; 0 .. count) { auto result = (randomClock() in aa); } Clock.printCounts(); } == We both get the following results for the method 1: opEquals: 0 opCmp : 1438438 toHash : 20000 and the following results for the method 2: opEquals: 0 opCmp : 1681 toHash : 20000 Apart from the fact associative arrays in gdc and dmd don't use the opEquals method, we see that using method 1 implies approx. 856 times more comparisons than method 2. I calculated that method 1 gives 139 different hashes for 80063 different elements, which means each hash represents an average of approx. 576 elements. Method 2 gives exactly one hash for one element. This seems to be an extreme case, we might think results would be completely different from a function giving hashes that "sometimes" represents 2 elements.
Nov 06 2012
On Wednesday, 7 November 2012 at 06:38:32 UTC, Raphaël Jakse wrote:== override size_t toHash() const { return (typeid(firstName).getHash(&firstName) + typeid(lastName).getHash(&lastName)); } ==Isn't the real problem the addition. You want to mix the bits together in a consistent way that does not have associativity. I've seen things like: result = 37 * first + 17 * second I've incorporated this functionality in a toHash that can be mixed in. Any feedback on it would be great. See: http://forum.dlang.org/post/tbjrqaogxegtyexnfags forum.dlang.org https://github.com/patefacio/d-help/blob/master/d-help/opmix/mix.d It only supports structs, but the hashing could be made to support classes. It allows for the following code for your example. It provides toHash, opEquals and opCmp. Below the hashcodes for s1 and s2 are different, the instances are not equal and s1 is less than s2 which is what you want. import std.stdio; import opmix.mix; struct Student // class representing a student { mixin(HashSupport); string firstName; // the first name of the student string lastName; // the last name of the student } void main() { auto s1 = Student("Jean", "Martin"); auto s2 = Student("Martin", "Jean"); writeln(s1.toHash(), " vs ", s2.toHash()); writeln(s1 == s2); writeln(s1 < s2); }However, with this solution, we get the same hash for new Student("Jean", "Martin") and new Student("Martin", "Jean"). We did an experiment of performance of associative arrays in D when the hash function is not well designed (returning the same hash for too many values). When the hash function returns the same hash for too many values, performance are dramatic (See the end of the post for more information).This makes sense and is not particular to D.Ali agreed that concatenating strings each time would indeed be inefficient. He thought we might cache the value (third solution) :Interesting about caching the hashcode and on large classes could save you. But isn't the signature shown const? Would that compile? Also, what if you change the strings - you get the wrong hash? I suppose you could limit writes to properties and null the hashcode on any write.Questions are : - what is the most efficient solution, and in which case ?No string concatenation is good. I think a single pass on all important data (in most cases is all the data) is the goal.This seems to be an extreme case, we might think results would be completely different from a function giving hashes that "sometimes" represents 2 elements.Very true. If I increase to 10_000_000 I see opCmp:1913600 for the smart method and (several minutes later) opCmp:907216764 for the simple addition (method 1). In this case you know something special about the data and can take advantage of it. If I run the example using the mixin(HashSupport) it does opCmp:7793499 which is about 4 times as many compares as the smart one. The simple addition does 474 times as many compares as the smart one - so it is clearly very bad. So, if you know something special about the data, like it can easily be converted into a single number such as seconds, by all means include that. But remember, next time you add something to the class, if you don't have some "automated" way of pulling in info from all fields you need to revisit the hash (and opCmp and opEquals). Thanks Dan
Nov 07 2012
Hello, Thanks for this complete answer. I will take a look to your code. Additionally, Ali gave me a really interesting link about hashing, good practices, what is efficient, etc. If you didn't read it, it might interest you. Here it is: http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx Le 07/11/2012 13:10, Dan a écrit :On Wednesday, 7 November 2012 at 06:38:32 UTC, Raphaël Jakse wrote: [...]Indeed, I didn't tested this code, I just wanted to explicit what I wanted to say. Thank you for the notice.Ali agreed that concatenating strings each time would indeed be inefficient. He thought we might cache the value (third solution) :Interesting about caching the hashcode and on large classes could save you. But isn't the signature shown const? Would that compile?Also, what if you change the strings - you get the wrong hash? I suppose you could limit writes to properties and null the hashcode on any write.Yes, good question and good idea, I think I would do it this way. I didn't think about it.I'm not sure I understood well. You wanted to say that string contatenations are good, right ? I was thinking about a hash function that would take several arguments and hash them together. That would let take in account more than one string in the hash while avoiding concatenation.Questions are : - what is the most efficient solution, and in which case ?No string concatenation is good. I think a single pass on all important data (in most cases is all the data) is the goal.[...] Thanks Dan
Nov 09 2012
On Saturday, 10 November 2012 at 07:55:18 UTC, Raphaël Jakse wrote:Hello, Thanks for this complete answer. I will take a look to your code.Ok - good. I've been using 2.061 which I just realized allows "dup" on an associative array, a feature which was not available in 2.060. So mixin(Dup) and the unit tests require 2.061.If you didn't read it, it might interest you. Here it is:I had not seen it and will read - thanks.I just meant string concatenation is likely unnecessary. Imagine two one megabyte strings. It is easy to concat them and call the string hash function on the result but you have to create a 2Mb string first. Alternatively, you could hash each and combine the hash codes in some consistent way.I'm not sure I understood well. You wanted to say that string contatenations are good, right ?Questions are : - what is the most efficient solution, and in which case ?No string concatenation is good. I think a single pass on all important data (in most cases is all the data) is the goal.I was thinking about a hash function that would take several arguments and hash them together. That would let take in account more than one string in the hash while avoiding concatenation.Yes. Thanks Dan
Nov 10 2012
On 11/10/2012 05:17 AM, Dan wrote:I just meant string concatenation is likely unnecessary. Imagine two one megabyte strings. It is easy to concat them and call the string hash function on the result but you have to create a 2Mb string first.Just a random thought: std.range.chain can obviate that copy: http://dlang.org/phobos/std_range.html#chain Ali
Nov 10 2012
On 11/07/2012 07:38 AM, "Raphaël.Jakse" <raphael.jakse gmail.com>" puremagic.com wrote:We want to be able to get the hash of s. Therefore, we re-implement the toHash method of the Student class :OK, now I'm curious. Assuming I don't write a custom re-implementation, how would a custom struct or class be hashed? (What about a tuple?) I ask because I'm considering associative arrays where the key is a custom class or tuple as part of a current coding project.
Nov 10 2012
On Saturday, 10 November 2012 at 18:18:07 UTC, Joseph Rushton Wakeling wrote:On 11/07/2012 07:38 AM, "Raphaël.Jakse" <raphael.jakse gmail.com>" puremagic.com wrote:Not sure I understand the question. But here is how I'm doing it. No guarantees, but output looks promising. Code following output. Thanks Dan ----------------------------- true false false true 7F053B2DCFC0 20883845 vs 20883845 ----------------------------- import std.stdio; import std.traits; import std.typecons; import opmix.mix; struct S { mixin(HashSupport); alias Tuple!(int, char, string) X; X x; char[] mutable; } void main() { S s1 = { tuple(3, 'a', "foo".idup), ['a','b'].dup }; S s2 = { tuple(3, 'a', "foo".idup), ['a','b'].dup }; writeln(s1==s2); s1.x[0]++; writeln(s1==s2); writeln(s1<s2); writeln(s2<s1); writeln(s1 in [ s1: 3 ]); s2.x[0]++; writeln(s1.toHash(), " vs ", s2.toHash()); }We want to be able to get the hash of s. Therefore, we re-implement the toHash method of the Student class :OK, now I'm curious. Assuming I don't write a custom re-implementation, how would a custom struct or class be hashed? (What about a tuple?) I ask because I'm considering associative arrays where the key is a custom class or tuple as part of a current coding project.
Nov 10 2012
On 11/10/2012 05:37 AM, Joseph Rushton Wakeling wrote:On 11/07/2012 07:38 AM, "Raphaël.Jakse" <raphael.jakse gmail.com>" puremagic.com wrote:For classes, because they are reference types, it is the object identity that determines the hash value (probably the pointer of the actual object). As a result, even when the values of the members are the same, two objects have different hash values. For structs, because they are value types, it is the bytes that make up the object determine the hash value. But beware: The associative array members of structs are not hashed by values of their elements. (There was a long discussion recently on the the main newsgroup on this topic.) AliWe want to be able to get the hash of s. Therefore, we re-implement the toHash method of the Student class :OK, now I'm curious. Assuming I don't write a custom re-implementation, how would a custom struct or class be hashed? (What about a tuple?) I ask because I'm considering associative arrays where the key is a custom class or tuple as part of a current coding project.
Nov 10 2012
On 11/11/2012 02:35 AM, Ali Çehreli wrote:For classes, because they are reference types, it is the object identity that determines the hash value (probably the pointer of the actual object). As a result, even when the values of the members are the same, two objects have different hash values. For structs, because they are value types, it is the bytes that make up the object determine the hash value.Ahh, clear. So where classes are concerned there's a definite need to override opHash if you want the _values_ to be the basis of the association, rather than the object itself ... this would also have potential implications for memory blow-up where a class is used as the key type, no?But beware: The associative array members of structs are not hashed by values of their elements. (There was a long discussion recently on the the main newsgroup on this topic.)I do recall that discussion, and I'll bear that in mind. So far the only practical consideration I've had was for a Tuple!() to be the key, rather than an actual struct or class; as it happens I've avoided that so far, and I think I may continue to do so while there are still issues around hashing. The more pressing reason for avoiding associative arrays as it happens was simply that I was getting unacceptably large memory usage as a result: I'd got some data stored in the form of byte[size_t] (really this should have been a set of size_t's, but this seemed theoretically an acceptable stop gap while there's no set class in Phobos), and for some reason that was generating memory consumption twice that of having instead a size_t[] regular array. I guess perhaps because although a byte should mean just that, in practice each value in the byte[size_t] associative array was taking up a whole word? It's a shame, because functionally it was useful to be able to do things of the form if(i in list), but memory-wise it just didn't work for the size of data I'm dealing with.
Nov 11 2012
On 11/11/2012 01:11 AM, Joseph Rushton Wakeling wrote:So where classes are concerned there's a definite need to override opHash if you want the _values_ to be the basis of the association, rather than the object itself ... this would also have potential implications for memory blow-up where a class is used as the key type, no?I am not sure that I understand. The associative array would still contain just class variables, not class objects. Everytime the hash needed to be calculated, the AA would call toHash() on the variable, which in turn would use the values of the members of the object. As another case, a single class object is owned by the runtime but many class variables of it can exist in multiple AAs.The more pressing reason for avoiding associative arrays as it happens was simply that I was getting unacceptably large memory usage as a result: I'd got some data stored in the form of byte[size_t] (really this should have been a set of size_t's, but this seemed theoretically an acceptable stop gap while there's no set class in Phobos),I think std.container.RedBlackTree can take that responsibility but if you don't need the elements to be in any particular order, then it may be seen as an overkill as well.and for some reason that was generating memory consumption twice that of having instead a size_t[] regular array. I guess perhaps because although a byte should mean just that, in practice each value in the byte[size_t] associative array was taking up a whole word?I don't know the actual implementation of AAs by dmd, but the hash buckets are probably singly-linked lists. Even if your data had no collisions, each byte would have to live in a singly-linked list.It's a shame, because functionally it was useful to be able to do things of the form if(i in list), but memory-wise it just didn't work for the size of data I'm dealing with.If all you need is 'if(i in list)', then a HashSet may help. I haven't used one myself but apparently there are such data structures in other Ali
Nov 11 2012
On 11/11/2012 07:40 PM, Ali Çehreli wrote:I think std.container.RedBlackTree can take that responsibility but if you don't need the elements to be in any particular order, then it may be seen as an overkill as well.I know, and I probably should give that a go, but as I've got working code right now I'm moving on to something else and will come back to it later.If all you need is 'if(i in list)', then a HashSet may help. I haven't used one myself but apparently there are such data structures in other languages likeIndeed, and I was wondering whether it might be worth writing an implementation for std.containers. As it stands I don't really feel familiar enough with the underlying concepts, but I may revisit that idea once I'm done with a few other things.
Nov 11 2012
=?UTF-8?B?QWxpIMOHZWhyZWxp?= wrote:but the hash buckets are probably singly-linked listsThey are unbalanced binary search trees, which indeed turn to singly-linked lists on sorted input. -manfred
Nov 11 2012
Jakse wrote:It would also be interesting to have ideas for the general caseYes, ideas might be interesting. :-) A root of "good" hashing is incorporated in algorithm2 of your posting: spread the values produced by the hash function uniformly over the interval size_t.min .. size_t.max. Therefore: Compute the hash value for a given key by multiplying each byte of the key with a factor, which increases from the byte with lowest significance to the byte with highest significance; of course add the results. Remarks: a) Because of the general case: assume that the original key contains much redundance. Therefore do not use the bytes of the original key but compute a ( at least close to) lossless compression of the original key. b) The factors in the factor list should be primes for spreading the hash value uniformly over the intended range c) The quotint of two consecutives primes in the factor list should be greater than 256, because the used key might not contain any reduundancy. -manfred
Nov 11 2012
Le 11/11/2012 12:38, Manfred Nowak a écrit :Jakse wrote:Thanks for this interesting answer. Is compressing for performance reasons ? is it more interesting to compress and then hash than just hash ?It would also be interesting to have ideas for the general caseYes, ideas might be interesting. :-) A root of "good" hashing is incorporated in algorithm2 of your posting: spread the values produced by the hash function uniformly over the interval size_t.min .. size_t.max. Therefore: Compute the hash value for a given key by multiplying each byte of the key with a factor, which increases from the byte with lowest significance to the byte with highest significance; of course add the results. Remarks: a) Because of the general case: assume that the original key contains much redundance. Therefore do not use the bytes of the original key but compute a ( at least close to) lossless compression of the original key. b) The factors in the factor list should be primes for spreading the hash value uniformly over the intended range c) The quotint of two consecutives primes in the factor list should be greater than 256, because the used key might not contain any reduundancy. -manfred
Nov 17 2012
Raphaël Jakse wrote:Is compressing for performance reasons?Hopefully. Because in the general case the distribution of the keys is unknown, no function used for computing the hash value can be guarenteed to indeed spread the hash values uniformly over the hash interval. Compressing would have a negative effect on performance if more time for compressing and decrompessing would be needed than time was lost for resolving conflicting hash values.is it more interesting to compress and then hash than just hash ?The only purpose of compressing is to create the special case of close to no redundancy in the keys used for hashing. If this special case is already confirmed or it is known that no compression will deminish the computing time: just hash. -manfred
Nov 17 2012