digitalmars.D - getHash inconsistency
- H. S. Teoh (10/10) Mar 19 2012 Is this a bug?
- Daniel Murphy (3/10) Mar 19 2012 Of course.
- H. S. Teoh (10/24) Mar 20 2012 [...]
- Daniel Murphy (4/28) Mar 20 2012 I assume the custom hashing routine is better/faster? If so, that one. ...
- H. S. Teoh (19/31) Mar 20 2012 [...]
- Daniel Murphy (3/19) Mar 20 2012 Benchmark time!
- H. S. Teoh (18/37) Mar 22 2012 Alright, so after some benchmarking, I found that the above custom hash
- Andrei Alexandrescu (3/16) Mar 22 2012 Note that you can switch the actual algorithm depending on string length...
- Sean Kelly (2/39) Mar 22 2012 Might be worth plugging in MurmurHash for comparison.
- Sean Kelly (9/41) Mar 22 2012 a
Is this a bug? char[] a = "abc".dup; const(char)[] b = "abc"; string c = "abc"; writeln(typeid(a).getHash(&a)); // 12914 writeln(typeid(b).getHash(&b)); // 8503969683799911018 writeln(typeid(c).getHash(&c)); // 12914 T -- Obviously, some things aren't very obvious.
Mar 19 2012
"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message news:mailman.910.1332214803.4860.digitalmars-d puremagic.com...Is this a bug? char[] a = "abc".dup; const(char)[] b = "abc"; string c = "abc"; writeln(typeid(a).getHash(&a)); // 12914 writeln(typeid(b).getHash(&b)); // 8503969683799911018 writeln(typeid(c).getHash(&c)); // 12914Of course.
Mar 19 2012
On Tue, Mar 20, 2012 at 05:07:12PM +1100, Daniel Murphy wrote:"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message news:mailman.910.1332214803.4860.digitalmars-d puremagic.com...[...] Then the question is, what should be the fix? Currently, char[] and string have getHash defined in rt.typeinfo.ti_Ad.TypeInfo_Aa, which appears to be a custom hashing algorithm, whereas const(char)[] uses TypeInfo_Array.getHash, which uses rt.util.hash.hashOf. Which one should be used? T -- Life would be easier if I had the source code. -- YHLIs this a bug? char[] a = "abc".dup; const(char)[] b = "abc"; string c = "abc"; writeln(typeid(a).getHash(&a)); // 12914 writeln(typeid(b).getHash(&b)); // 8503969683799911018 writeln(typeid(c).getHash(&c)); // 12914Of course.
Mar 20 2012
"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message news:mailman.917.1332250604.4860.digitalmars-d puremagic.com...On Tue, Mar 20, 2012 at 05:07:12PM +1100, Daniel Murphy wrote:I assume the custom hashing routine is better/faster? If so, that one. I'd assume that getHash not working the same for const strings is an oversight."H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message news:mailman.910.1332214803.4860.digitalmars-d puremagic.com...[...] Then the question is, what should be the fix? Currently, char[] and string have getHash defined in rt.typeinfo.ti_Ad.TypeInfo_Aa, which appears to be a custom hashing algorithm, whereas const(char)[] uses TypeInfo_Array.getHash, which uses rt.util.hash.hashOf. Which one should be used? T -- Life would be easier if I had the source code. -- YHLIs this a bug? char[] a = "abc".dup; const(char)[] b = "abc"; string c = "abc"; writeln(typeid(a).getHash(&a)); // 12914 writeln(typeid(b).getHash(&b)); // 8503969683799911018 writeln(typeid(c).getHash(&c)); // 12914Of course.
Mar 20 2012
On Wed, Mar 21, 2012 at 03:55:49PM +1100, Daniel Murphy wrote:"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message news:mailman.917.1332250604.4860.digitalmars-d puremagic.com...[...][...]Then the question is, what should be the fix? Currently, char[] and string have getHash defined in rt.typeinfo.ti_Ad.TypeInfo_Aa, which appears to be a custom hashing algorithm, whereas const(char)[] uses TypeInfo_Array.getHash, which uses rt.util.hash.hashOf. Which one should be used?I assume the custom hashing routine is better/faster? If so, that one. I'd assume that getHash not working the same for const strings is an oversight.[...] Here's the current hashing code for char[] and string: foreach (char c; s) hash = hash * 11 + c; For const(char)[], it's rt.util.hash.hashOf, which is Paul Hsieh's SuperFastHash algorithm with very good hash distribution properties. It does seem to involve a lot more operations that the simple loop above, though; so I assume the above simple loop was chosen because hashing strings are a very common operation and, for the most part, only need a simple hash function. So I'm kinda leaning towards SuperFastHash, but worried about whether it will cause performance degradation, in which case we should stick with the simple loop. T -- "Maybe" is a strange word. When mom or dad says it it means "yes", but when my big brothers say it it means "no"! -- PJ jr.
Mar 20 2012
"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message news:mailman.951.1332306541.4860.digitalmars-d puremagic.com...Here's the current hashing code for char[] and string: foreach (char c; s) hash = hash * 11 + c; For const(char)[], it's rt.util.hash.hashOf, which is Paul Hsieh's SuperFastHash algorithm with very good hash distribution properties. It does seem to involve a lot more operations that the simple loop above, though; so I assume the above simple loop was chosen because hashing strings are a very common operation and, for the most part, only need a simple hash function. So I'm kinda leaning towards SuperFastHash, but worried about whether it will cause performance degradation, in which case we should stick with the simple loop. T -- "Maybe" is a strange word. When mom or dad says it it means "yes", but when my big brothers say it it means "no"! -- PJ jr.Benchmark time!
Mar 20 2012
On Wed, Mar 21, 2012 at 04:35:11PM +1100, Daniel Murphy wrote:"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message news:mailman.951.1332306541.4860.digitalmars-d puremagic.com...[...]Here's the current hashing code for char[] and string: foreach (char c; s) hash = hash * 11 + c; For const(char)[], it's rt.util.hash.hashOf, which is Paul Hsieh's SuperFastHash algorithm with very good hash distribution properties. It does seem to involve a lot more operations that the simple loop above, though; so I assume the above simple loop was chosen because hashing strings are a very common operation and, for the most part, only need a simple hash function. So I'm kinda leaning towards SuperFastHash, but worried about whether it will cause performance degradation, in which case we should stick with the simple loop.Benchmark time!Alright, so after some benchmarking, I found that the above custom hash function works best for *short* (3 to 10 character) randomized alphabetic strings (I assumed alphabetic to be the typical use case of strings). It's faster than SuperFastHash, and even has better distribution properties, probably because SuperFastHash is tuned for arbitrary binary data of arbitrary length, whereas the custom function is tuned for short string-like data. With longer strings, SuperFastHash beats the custom algorithm, and distribution properties are approximately the same. So I'm still on the fence about which algorithm is better. I can see why the custom hash was adopted for strings, since your typical AA tends to have short alphabetic keys, and something like SuperFastHash is probably overkill. But for longer keys, SuperFastHash is better. T -- Curiosity kills the cat. Moral: don't be the cat.
Mar 22 2012
On 3/22/12 1:18 PM, H. S. Teoh wrote:Alright, so after some benchmarking, I found that the above custom hash function works best for *short* (3 to 10 character) randomized alphabetic strings (I assumed alphabetic to be the typical use case of strings). It's faster than SuperFastHash, and even has better distribution properties, probably because SuperFastHash is tuned for arbitrary binary data of arbitrary length, whereas the custom function is tuned for short string-like data. With longer strings, SuperFastHash beats the custom algorithm, and distribution properties are approximately the same. So I'm still on the fence about which algorithm is better. I can see why the custom hash was adopted for strings, since your typical AA tends to have short alphabetic keys, and something like SuperFastHash is probably overkill. But for longer keys, SuperFastHash is better.Note that you can switch the actual algorithm depending on string length. Andrei
Mar 22 2012
On Mar 22, 2012, at 11:18 AM, H. S. Teoh wrote:On Wed, Mar 21, 2012 at 04:35:11PM +1100, Daniel Murphy wrote:Might be worth plugging in MurmurHash for comparison."H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message news:mailman.951.1332306541.4860.digitalmars-d puremagic.com...[...]Here's the current hashing code for char[] and string: foreach (char c; s) hash = hash * 11 + c; For const(char)[], it's rt.util.hash.hashOf, which is Paul Hsieh's SuperFastHash algorithm with very good hash distribution properties. It does seem to involve a lot more operations that the simple loop above, though; so I assume the above simple loop was chosen because hashing strings are a very common operation and, for the most part, only need a simple hash function. So I'm kinda leaning towards SuperFastHash, but worried about whether it will cause performance degradation, in which case we should stick with the simple loop.Benchmark time!Alright, so after some benchmarking, I found that the above custom hash function works best for *short* (3 to 10 character) randomized alphabetic strings (I assumed alphabetic to be the typical use case of strings). It's faster than SuperFastHash, and even has better distribution properties, probably because SuperFastHash is tuned for arbitrary binary data of arbitrary length, whereas the custom function is tuned for short string-like data. With longer strings, SuperFastHash beats the custom algorithm, and distribution properties are approximately the same. So I'm still on the fence about which algorithm is better. I can see why the custom hash was adopted for strings, since your typical AA tends to have short alphabetic keys, and something like SuperFastHash is probably overkill. But for longer keys, SuperFastHash is better.
Mar 22 2012
On Mar 20, 2012, at 10:10 PM, H. S. Teoh wrote:On Wed, Mar 21, 2012 at 03:55:49PM +1100, Daniel Murphy wrote:It"H. S. Teoh" <hsteoh quickfur.ath.cx> wrote in message=20 news:mailman.917.1332250604.4860.digitalmars-d puremagic.com...[...][...]Then the question is, what should be the fix? =20 Currently, char[] and string have getHash defined in rt.typeinfo.ti_Ad.TypeInfo_Aa, which appears to be a custom hashing algorithm, whereas const(char)[] uses TypeInfo_Array.getHash, which uses rt.util.hash.hashOf. Which one should be used?=20 I assume the custom hashing routine is better/faster? If so, that one. I'd assume that getHash not working the same for const strings is an oversight.=20[...] =20 Here's the current hashing code for char[] and string: =20 foreach (char c; s) hash =3D hash * 11 + c; =20 For const(char)[], it's rt.util.hash.hashOf, which is Paul Hsieh's SuperFastHash algorithm with very good hash distribution properties. =does seem to involve a lot more operations that the simple loop above, though; so I assume the above simple loop was chosen because hashing strings are a very common operation and, for the most part, only need =asimple hash function. =20 So I'm kinda leaning towards SuperFastHash, but worried about whether =itwill cause performance degradation, in which case we should stick with the simple loop.The thing with hashes is that you usually gain more by getting a good = distribution than you lose by computing the hash. Our AA implementation = should probably store the computed hash value per node if it isn't = already though, so we can compare against that before doing opEquals = when looking up a node, and so it's available when rehashing.=
Mar 22 2012