digitalmars.D - [druntime] TypeInfo.getHash problems
- H. S. Teoh (49/49) Feb 29 2012 I found the following bugs with the current implementations of getHash
- bearophile (5/7) Feb 29 2012 Maybe there are other bugs you have not found yet.
- H. S. Teoh (11/18) Feb 29 2012 [...]
- Dmitry Olshansky (12/58) Mar 01 2012 I suspect arrays should be usually hashed as strings, e.g. a simplistic
- H. S. Teoh (16/82) Mar 01 2012 This has to be fixed in druntime, though. Is there a druntime equivalent
- Dmitry Olshansky (7/73) Mar 01 2012 The fact that std.traits belongs to phobos doesn't mean it's impossible
I found the following bugs with the current implementations of getHash in the various TypeInfo classes: 1) TypeInfo_Array.getHash doesn't work correctly with class objects. with custom toHash() methods. Or, for that matter, arrays of AA's (which have their own problems--see (3a)). It just does a hash of the binary representation of the array. 2) TypeInfo_StaticArray.getHash doesn't suffer from (1), but it does suffer from another problem: it simply sums the hashes from each element, but this is prone to reordering hash collisions: an int[3] with elements [1,2,3] will have the same hash value as [3,2,1] or any reordering thereof. 3) TypeInfo_AssociativeArray.getHash is not implemented, so it just defaults to the base class getHash, which just does a hash of the binary representation of the AA. Unfortunately, this is fraught with problems: a) Two AA's with exactly the same key/value pairs may be binary unequal, because the underlying size of the hash table may be different (this happens when you initialize an AA with a literal vs. manually setting each element). b) If values are objects that implement toHash(), then the current implementation will have the wrong behaviour. 4) TypeInfo_Class.getHash will call the object's toHash method, if implemented, otherwise it hashes the binary representation of the object. But if the object has members that are references/pointers or other objects/structs that implement getHash, this will be wrong. These problems cause AA's to have many subtle breakages, for example, using another AA as key doesn't work properly because of (3); using arrays of objects as key doesn't work because of (1), using static arrays as key may trigger an unusually high number of hash collisions due to (2). To fix (1) may cause noticeable slowdown when the key is an array of basic types, such as dstring (we'd have to compute the hash of each dchar, then compute the hash of the hashes). Is there an easy way to tell whether or not an array's elements are binary-hashable? If so, we can still use hashing of the binary representation of the array for the simple cases (dstring, wstring, array of structs without nested refs/objects), and save the slower code for more complicated arrays. For (2), we may be able to incorporate the array index into the hash summation so as to make it resistant to reordering collisions. But it may be that we might as well just compute the hash of the hashes at this point? For (3), we can collect the hashes of the keys (already stored in the bucket) and the hashes of the values, and doing a hash on them. (This seems to be the easiest one to fix?) To fix (4) will require compile-time reflection... but may introduce slowdowns? Is there a way to detect the easy cases (binary-hashable) so that we only incur slowdown on the complicated cases? T -- Music critic: "That's an imitation fugue!"
Feb 29 2012
H. S. Teoh:I found the following bugs with the current implementations of getHash in the various TypeInfo classes:Maybe there are other bugs you have not found yet. In my opinion this is an important topic (some of the bugs listed here have wasted lot of my time, and generally those limits/bugs reduce the usefulness and safety of the built-in AAs significantly). Bye, bearophile
Feb 29 2012
On Wed, Feb 29, 2012 at 05:14:02PM -0500, bearophile wrote:H. S. Teoh:[...] Walter fixed a bug in TypeInfo_Array.getHash yesterday, so now wstring and dstring keys actually work properly with AA's (latest druntime on github -- note, you need to recompile phobos for the changes to take effect). But these additional issues are still there, so using an array of objects or another AA as key won't work properly yet. T -- Freedom of speech: the whole world has no right *not* to hear my spouting off!I found the following bugs with the current implementations of getHash in the various TypeInfo classes:Maybe there are other bugs you have not found yet. In my opinion this is an important topic (some of the bugs listed here have wasted lot of my time, and generally those limits/bugs reduce the usefulness and safety of the built-in AAs significantly).
Feb 29 2012
On 01.03.2012 1:45, H. S. Teoh wrote:I found the following bugs with the current implementations of getHash in the various TypeInfo classes: 1) TypeInfo_Array.getHash doesn't work correctly with class objects. with custom toHash() methods. Or, for that matter, arrays of AA's (which have their own problems--see (3a)). It just does a hash of the binary representation of the array. 2) TypeInfo_StaticArray.getHash doesn't suffer from (1), but it does suffer from another problem: it simply sums the hashes from each element, but this is prone to reordering hash collisions: an int[3] with elements [1,2,3] will have the same hash value as [3,2,1] or any reordering thereof. 3) TypeInfo_AssociativeArray.getHash is not implemented, so it just defaults to the base class getHash, which just does a hash of the binary representation of the AA. Unfortunately, this is fraught with problems: a) Two AA's with exactly the same key/value pairs may be binary unequal, because the underlying size of the hash table may be different (this happens when you initialize an AA with a literal vs. manually setting each element). b) If values are objects that implement toHash(), then the current implementation will have the wrong behaviour. 4) TypeInfo_Class.getHash will call the object's toHash method, if implemented, otherwise it hashes the binary representation of the object. But if the object has members that are references/pointers or other objects/structs that implement getHash, this will be wrong. These problems cause AA's to have many subtle breakages, for example, using another AA as key doesn't work properly because of (3); using arrays of objects as key doesn't work because of (1), using static arrays as key may trigger an unusually high number of hash collisions due to (2). To fix (1) may cause noticeable slowdown when the key is an array of basic types, such as dstring (we'd have to compute the hash of each dchar, then compute the hash of the hashes). Is there an easy way to tell whether or not an array's elements are binary-hashable? If so, we can still use hashing of the binary representation of the array for the simple cases (dstring, wstring, array of structs without nested refs/objects), and save the slower code for more complicated arrays.std.traits hasIndirections?For (2), we may be able to incorporate the array index into the hash summation so as to make it resistant to reordering collisions. But it may be that we might as well just compute the hash of the hashes at this point?I suspect arrays should be usually hashed as strings, e.g. a simplistic hash = seed; foreach(x; arr) hash = mix(hash, x);//if x is not integral use x.toHash() hash = mix(hash, arr.length); where mix(x) is k*x+b, or (x<<k) ^ b or whatever.For (3), we can collect the hashes of the keys (already stored in the bucket) and the hashes of the values, and doing a hash on them. (This seems to be the easiest one to fix?)Or hash them as array of pairs, but whatever is faster.To fix (4) will require compile-time reflection... but may introduce slowdowns? Is there a way to detect the easy cases (binary-hashable) so that we only incur slowdown on the complicated cases?see point 1. -- Dmitry Olshansky
Mar 01 2012
On Thu, Mar 01, 2012 at 12:36:31PM +0400, Dmitry Olshansky wrote:On 01.03.2012 1:45, H. S. Teoh wrote:[...]I found the following bugs with the current implementations of getHash in the various TypeInfo classes: 1) TypeInfo_Array.getHash doesn't work correctly with class objects. with custom toHash() methods. Or, for that matter, arrays of AA's (which have their own problems--see (3a)). It just does a hash of the binary representation of the array. 2) TypeInfo_StaticArray.getHash doesn't suffer from (1), but it does suffer from another problem: it simply sums the hashes from each element, but this is prone to reordering hash collisions: an int[3] with elements [1,2,3] will have the same hash value as [3,2,1] or any reordering thereof. 3) TypeInfo_AssociativeArray.getHash is not implemented, so it just defaults to the base class getHash, which just does a hash of the binary representation of the AA. Unfortunately, this is fraught with problems: a) Two AA's with exactly the same key/value pairs may be binary unequal, because the underlying size of the hash table may be different (this happens when you initialize an AA with a literal vs. manually setting each element). b) If values are objects that implement toHash(), then the current implementation will have the wrong behaviour. 4) TypeInfo_Class.getHash will call the object's toHash method, if implemented, otherwise it hashes the binary representation of the object. But if the object has members that are references/pointers or other objects/structs that implement getHash, this will be wrong.This has to be fixed in druntime, though. Is there a druntime equivalent to std.traits?To fix (1) may cause noticeable slowdown when the key is an array of basic types, such as dstring (we'd have to compute the hash of each dchar, then compute the hash of the hashes). Is there an easy way to tell whether or not an array's elements are binary-hashable? If so, we can still use hashing of the binary representation of the array for the simple cases (dstring, wstring, array of structs without nested refs/objects), and save the slower code for more complicated arrays.std.traits hasIndirections?Good idea!For (2), we may be able to incorporate the array index into the hash summation so as to make it resistant to reordering collisions. But it may be that we might as well just compute the hash of the hashes at this point?I suspect arrays should be usually hashed as strings, e.g. a simplistic hash = seed; foreach(x; arr) hash = mix(hash, x);//if x is not integral use x.toHash() hash = mix(hash, arr.length); where mix(x) is k*x+b, or (x<<k) ^ b or whatever.I did a quick implementation of this: https://github.com/D-Programming-Language/druntime/pull/171 Basically do a hash of the respective hash values of the key and value, and sum them over all pairs. The sum is commutative, so it avoids the problem of two AA's with identical contents but different internal hashtable sizes causing pair reordering.For (3), we can collect the hashes of the keys (already stored in the bucket) and the hashes of the values, and doing a hash on them. (This seems to be the easiest one to fix?)Or hash them as array of pairs, but whatever is faster.[...] It would be good if there was a clean way to do this in druntime. T -- This sentence is false.To fix (4) will require compile-time reflection... but may introduce slowdowns? Is there a way to detect the easy cases (binary-hashable) so that we only incur slowdown on the complicated cases?see point 1.
Mar 01 2012
On 01.03.2012 18:56, H. S. Teoh wrote:On Thu, Mar 01, 2012 at 12:36:31PM +0400, Dmitry Olshansky wrote:The fact that std.traits belongs to phobos doesn't mean it's impossible to smuggle few artifacts to druntime :) [..]On 01.03.2012 1:45, H. S. Teoh wrote:[...]I found the following bugs with the current implementations of getHash in the various TypeInfo classes: 1) TypeInfo_Array.getHash doesn't work correctly with class objects. with custom toHash() methods. Or, for that matter, arrays of AA's (which have their own problems--see (3a)). It just does a hash of the binary representation of the array. 2) TypeInfo_StaticArray.getHash doesn't suffer from (1), but it does suffer from another problem: it simply sums the hashes from each element, but this is prone to reordering hash collisions: an int[3] with elements [1,2,3] will have the same hash value as [3,2,1] or any reordering thereof. 3) TypeInfo_AssociativeArray.getHash is not implemented, so it just defaults to the base class getHash, which just does a hash of the binary representation of the AA. Unfortunately, this is fraught with problems: a) Two AA's with exactly the same key/value pairs may be binary unequal, because the underlying size of the hash table may be different (this happens when you initialize an AA with a literal vs. manually setting each element). b) If values are objects that implement toHash(), then the current implementation will have the wrong behaviour. 4) TypeInfo_Class.getHash will call the object's toHash method, if implemented, otherwise it hashes the binary representation of the object. But if the object has members that are references/pointers or other objects/structs that implement getHash, this will be wrong.This has to be fixed in druntime, though. Is there a druntime equivalent to std.traits?To fix (1) may cause noticeable slowdown when the key is an array of basic types, such as dstring (we'd have to compute the hash of each dchar, then compute the hash of the hashes). Is there an easy way to tell whether or not an array's elements are binary-hashable? If so, we can still use hashing of the binary representation of the array for the simple cases (dstring, wstring, array of structs without nested refs/objects), and save the slower code for more complicated arrays.std.traits hasIndirections?Looks like a clean solution.I did a quick implementation of this: https://github.com/D-Programming-Language/druntime/pull/171 Basically do a hash of the respective hash values of the key and value, and sum them over all pairs. The sum is commutative, so it avoids the problem of two AA's with identical contents but different internal hashtable sizes causing pair reordering.For (3), we can collect the hashes of the keys (already stored in the bucket) and the hashes of the values, and doing a hash on them. (This seems to be the easiest one to fix?)Or hash them as array of pairs, but whatever is faster.-- Dmitry Olshansky[...] It would be good if there was a clean way to do this in druntime.To fix (4) will require compile-time reflection... but may introduce slowdowns? Is there a way to detect the easy cases (binary-hashable) so that we only incur slowdown on the complicated cases?see point 1.
Mar 01 2012