digitalmars.D - storing the hash multiplier instead of the hash value
- Andrei Alexandrescu (19/19) Mar 22 2010 Currently, each entry in a D hashtable stores the hash of the key for
- Daniel Keep (15/15) Mar 22 2010 How about just *fixing* the hashtable so it doesn't generate false
- Andrei Alexandrescu (14/29) Mar 22 2010 This was of course the first thing Walter and I discussed. It turns out
- Walter Bright (4/7) Mar 22 2010 Unfortunately, it won't be much of a win. Memory allocation is done in b...
- Andrei Alexandrescu (5/12) Mar 22 2010 As we discussed, if nodes are allocated in bunches, you could store 5
- Walter Bright (3/6) Mar 22 2010 Right, but the downside is that the nodes will not be collected unless a...
- Andrei Alexandrescu (9/22) Mar 22 2010 One more idea before I forget: the bunches approach requires using a
- Andrei Alexandrescu (7/30) Mar 22 2010 And the last idea before I forget (just talked to Walter about it) is
- Robert Jacques (5/10) Mar 22 2010 On Mon, 22 Mar 2010 19:00:57 -0300, Andrei Alexandrescu
- Andrei Alexandrescu (4/16) Mar 22 2010 Those are functions that you can call several times. The gc will keep
- Michael Rynn (22/51) Mar 23 2010 I just committed a aaA.d version that uses some heap node memory
- Andrei Alexandrescu (7/56) Mar 23 2010 Absolutely!
- Walter Bright (8/11) Mar 22 2010 All I did was switch from a tree used to resolve a bucket collision with...
- Andrei Alexandrescu (4/16) Mar 22 2010 In at least one application, the new hash performed on a par with or
- Robert Jacques (4/11) Mar 22 2010 What about the precise GC patch?
- Leandro Lucarella (7/15) Mar 23 2010 Well... there is an implementation hanging in bugzilla...
- Fawzi Mohamed (11/27) Mar 23 2010 I like murmurhash, that I what I put into tango, see
- Andrei Alexandrescu (14/27) Mar 23 2010 Thanks, Fawzi, that's great.
- Fawzi Mohamed (13/15) Mar 23 2010 that would be nice, that is one the main reasons Steven implementation
- Andrei Alexandrescu (11/25) Mar 23 2010 I have much loftier goals, which scare Walter sometimes :o). In the long...
- Walter Bright (2/5) Mar 23 2010 It would mean the end of precompiled libraries.
- Walter Bright (2/19) Mar 23 2010 D2 already allows this.
- Fawzi Mohamed (2/18) Mar 23 2010 ok I did not know that was possible.
- Walter Bright (3/8) Mar 23 2010 Just overload the toHash() function for your user-defined type to be wha...
- Fawzi Mohamed (7/15) Mar 23 2010 I know, maybe I have not expressed myself clearly, but this overridden
- Sea Kelly (5/41) Mar 30 2010 I like murmurhash as well, and I'd even had it in druntime briefly, but
- Robert Jacques (6/10) Mar 30 2010 That seems weird. The website it lists it as:
- Sean Kelly (4/20) Apr 11 2010 That must be new, I couldn't find any license at all at the time. I
- Lionello Lunesu (8/32) Mar 22 2010 Probably not. hash is an 'int' and this 'm' will still use up the same
Currently, each entry in a D hashtable stores the hash of the key for efficiency purposes. There is a bit of redundancy: as soon as you entered a hash bucket you know that the hashkey % tablesize is a specific number, call it k. But k is not enough information to restore the hash, so the actual hash gets stored for each entry. I'm thinking of exploiting that redundancy by storing the hash multiplier, not the hash value. Instead of storing h in each slot, store m = h / tablesize. Then you can restore the actual h by writing: restored_h = m * tablesize + k; k is known for each given bucket (it's the index of the bucket) and m is what gets stored per entry. What's the advantage of this approach? Well the advantage is that m is a small number. Any hash function will try to disperse the hash value as much as possible between the 32 available bits. But m will be a smaller number, and therefore will be more grouped and will engender fewer false pointers. Would this help? Andrei
Mar 22 2010
How about just *fixing* the hashtable so it doesn't generate false pointers in the first place? Maybe I'm just strange, but that seems like a more direct, efficacious solution... As for the redundancy, I was under the impression that the hash was cached so make resizing more efficient: if the number of buckets changes, you have to re-insert all the nodes. If you don't store the hash, you have to recompute it (which is expensive for anything other than ints (wherein D laughably 'hashes' by returning the ints value)). Given that I already think this is a silly way to go about solving the false pointer issue, I don't see any benefit to doing this other than to give the CPU something to do while it waits for memory reads. Sadly, I lack the background to make any sort of objective judgement of the hash function *itself*, so I'll just reiterate what I've heard repeated to me over and over again by numerous people: D's builtin hash function sucks like a black holes.
Mar 22 2010
On 03/22/2010 02:50 PM, Daniel Keep wrote:How about just *fixing* the hashtable so it doesn't generate false pointers in the first place? Maybe I'm just strange, but that seems like a more direct, efficacious solution...This was of course the first thing Walter and I discussed. It turns out that that would necessitate precise GC, which we don't have yet.As for the redundancy, I was under the impression that the hash was cached so make resizing more efficient: if the number of buckets changes, you have to re-insert all the nodes. If you don't store the hash, you have to recompute it (which is expensive for anything other than ints (wherein D laughably 'hashes' by returning the ints value)).What would be a better hashing scheme for ints?Given that I already think this is a silly way to go about solving the false pointer issue, I don't see any benefit to doing this other than to give the CPU something to do while it waits for memory reads. Sadly, I lack the background to make any sort of objective judgement of the hash function *itself*, so I'll just reiterate what I've heard repeated to me over and over again by numerous people: D's builtin hash function sucks like a black holes.Then you might be glad to hear that Walter has recently improved the default hashtable implementation significantly (at least for heavy-duty applications). We're using Paul Hsieh's hash function for strings and general memory, which is no slouch. http://www.dsource.org/projects/druntime/browser/trunk/src/rt/util/hash.d Better suggestions are always welcome. For integrals I'm unclear on what we could use to make things better. (Clearly we could and should get rid of the extraneous field.) Andrei
Mar 22 2010
Andrei Alexandrescu wrote:Better suggestions are always welcome. For integrals I'm unclear on what we could use to make things better. (Clearly we could and should get rid of the extraneous field.)Unfortunately, it won't be much of a win. Memory allocation is done in buckets of size 16, 32, 64, etc. Reducing the node size for a uint[uint] from 16 to 12 saves no memory at all.
Mar 22 2010
On 03/22/2010 03:36 PM, Walter Bright wrote:Andrei Alexandrescu wrote:As we discussed, if nodes are allocated in bunches, you could store 5 nodes in a 64-byte block instead of 4. That's more than a 25% increase in effectiveness because the per-block bookkeeping is also slashed by 5. AndreiBetter suggestions are always welcome. For integrals I'm unclear on what we could use to make things better. (Clearly we could and should get rid of the extraneous field.)Unfortunately, it won't be much of a win. Memory allocation is done in buckets of size 16, 32, 64, etc. Reducing the node size for a uint[uint] from 16 to 12 saves no memory at all.
Mar 22 2010
Andrei Alexandrescu wrote:As we discussed, if nodes are allocated in bunches, you could store 5 nodes in a 64-byte block instead of 4. That's more than a 25% increase in effectiveness because the per-block bookkeeping is also slashed by 5.Right, but the downside is that the nodes will not be collected unless all its neighbors are also not used.
Mar 22 2010
On 03/22/2010 04:03 PM, Andrei Alexandrescu wrote:On 03/22/2010 03:36 PM, Walter Bright wrote:One more idea before I forget: the bunches approach requires using a freelist for nodes that are available but not used. (Freelists are a good idea whether or not bunches are used.) One good possibility is to store the head of the freelist in a static variable, such that e.g. all int[string] instances use the same freelist. And there's no thread contention issue because each thread has its own freelist. AndreiAndrei Alexandrescu wrote:As we discussed, if nodes are allocated in bunches, you could store 5 nodes in a 64-byte block instead of 4. That's more than a 25% increase in effectiveness because the per-block bookkeeping is also slashed by 5. AndreiBetter suggestions are always welcome. For integrals I'm unclear on what we could use to make things better. (Clearly we could and should get rid of the extraneous field.)Unfortunately, it won't be much of a win. Memory allocation is done in buckets of size 16, 32, 64, etc. Reducing the node size for a uint[uint] from 16 to 12 saves no memory at all.
Mar 22 2010
On 03/22/2010 04:59 PM, Andrei Alexandrescu wrote:On 03/22/2010 04:03 PM, Andrei Alexandrescu wrote:And the last idea before I forget (just talked to Walter about it) is that the GC could and should offer functions: GC.preCollect(void delegate()); GC.postCollect(void delegate()); That way we can effectively implement weak pointers. AndreiOn 03/22/2010 03:36 PM, Walter Bright wrote:One more idea before I forget: the bunches approach requires using a freelist for nodes that are available but not used. (Freelists are a good idea whether or not bunches are used.) One good possibility is to store the head of the freelist in a static variable, such that e.g. all int[string] instances use the same freelist. And there's no thread contention issue because each thread has its own freelist.Andrei Alexandrescu wrote:As we discussed, if nodes are allocated in bunches, you could store 5 nodes in a 64-byte block instead of 4. That's more than a 25% increase in effectiveness because the per-block bookkeeping is also slashed by 5. AndreiBetter suggestions are always welcome. For integrals I'm unclear on what we could use to make things better. (Clearly we could and should get rid of the extraneous field.)Unfortunately, it won't be much of a win. Memory allocation is done in buckets of size 16, 32, 64, etc. Reducing the node size for a uint[uint] from 16 to 12 saves no memory at all.
Mar 22 2010
On Mon, 22 Mar 2010 19:00:57 -0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote: [snip]And the last idea before I forget (just talked to Walter about it) is that the GC could and should offer functions: GC.preCollect(void delegate()); GC.postCollect(void delegate()); That way we can effectively implement weak pointers.Pre and post collect should be an array of delegates. Otherwise only one person's weak pointer library would work.
Mar 22 2010
On 03/22/2010 08:32 PM, Robert Jacques wrote:On Mon, 22 Mar 2010 19:00:57 -0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote: [snip]Those are functions that you can call several times. The gc will keep indeed a container of delegates. AndreiAnd the last idea before I forget (just talked to Walter about it) is that the GC could and should offer functions: GC.preCollect(void delegate()); GC.postCollect(void delegate()); That way we can effectively implement weak pointers.Pre and post collect should be an array of delegates. Otherwise only one person's weak pointer library would work.
Mar 22 2010
On Mon, 22 Mar 2010 16:59:36 -0500, Andrei Alexandrescu wrote:On 03/22/2010 04:03 PM, Andrei Alexandrescu wrote:I just committed a aaA.d version that uses some heap node memory management, although its on a per class instance basis. Also it dispences with a separate hash storage for keys <= size_t. Sharing would mean some working out of which classes share the same sized node blocks. Much easier to implement class sharing using templates. See the code here : http://www.dsource.org/projects/aa/browser/trunk/ druntime/aaA.d See the benchmarks and comments here : http://www.dsource.org/projects/aa/ wiki/DrunTimeAA. The result squeezes some extra performance for integer or less sized keys (about 20% faster for lookups). For the druntime, the generic C interface constrains the kinds of specialized AA versions that can be instantiated using run-time TypeInfo and var-arged calls. Maybe a class / interface direct calls would work better. Just imagine at runtime looking at the TypeInfo_AssociatedArray and trying to work out exactly which template is going to be instantiated. And having a low code size, one or few versions fit all approach, that performance sacrifice is unavoidable in a basic runtime libary. But in template land , options and combinations and gains abound. --- Michael RynnOn 03/22/2010 03:36 PM, Walter Bright wrote:One more idea before I forget: the bunches approach requires using a freelist for nodes that are available but not used. (Freelists are a good idea whether or not bunches are used.) One good possibility is to store the head of the freelist in a static variable, such that e.g. all int[string] instances use the same freelist. And there's no thread contention issue because each thread has its own freelist. AndreiAndrei Alexandrescu wrote:As we discussed, if nodes are allocated in bunches, you could store 5 nodes in a 64-byte block instead of 4. That's more than a 25% increase in effectiveness because the per-block bookkeeping is also slashed by 5. AndreiBetter suggestions are always welcome. For integrals I'm unclear on what we could use to make things better. (Clearly we could and should get rid of the extraneous field.)Unfortunately, it won't be much of a win. Memory allocation is done in buckets of size 16, 32, 64, etc. Reducing the node size for a uint[uint] from 16 to 12 saves no memory at all.
Mar 23 2010
On 03/23/2010 08:11 AM, Michael Rynn wrote:On Mon, 22 Mar 2010 16:59:36 -0500, Andrei Alexandrescu wrote:Absolutely! This is solid work. It would be interesting to assess how your implementation and the old built-in hashes compare with Walter's new implementation, which uses a singly-linked list instead of a tree. Do you have what it takes to check out and build druntime and phobos? AndreiOn 03/22/2010 04:03 PM, Andrei Alexandrescu wrote:I just committed a aaA.d version that uses some heap node memory management, although its on a per class instance basis. Also it dispences with a separate hash storage for keys<= size_t. Sharing would mean some working out of which classes share the same sized node blocks. Much easier to implement class sharing using templates. See the code here : http://www.dsource.org/projects/aa/browser/trunk/ druntime/aaA.d See the benchmarks and comments here : http://www.dsource.org/projects/aa/ wiki/DrunTimeAA. The result squeezes some extra performance for integer or less sized keys (about 20% faster for lookups). For the druntime, the generic C interface constrains the kinds of specialized AA versions that can be instantiated using run-time TypeInfo and var-arged calls. Maybe a class / interface direct calls would work better. Just imagine at runtime looking at the TypeInfo_AssociatedArray and trying to work out exactly which template is going to be instantiated. And having a low code size, one or few versions fit all approach, that performance sacrifice is unavoidable in a basic runtime libary. But in template land , options and combinations and gains abound.On 03/22/2010 03:36 PM, Walter Bright wrote:One more idea before I forget: the bunches approach requires using a freelist for nodes that are available but not used. (Freelists are a good idea whether or not bunches are used.) One good possibility is to store the head of the freelist in a static variable, such that e.g. all int[string] instances use the same freelist. And there's no thread contention issue because each thread has its own freelist. AndreiAndrei Alexandrescu wrote:As we discussed, if nodes are allocated in bunches, you could store 5 nodes in a 64-byte block instead of 4. That's more than a 25% increase in effectiveness because the per-block bookkeeping is also slashed by 5. AndreiBetter suggestions are always welcome. For integrals I'm unclear on what we could use to make things better. (Clearly we could and should get rid of the extraneous field.)Unfortunately, it won't be much of a win. Memory allocation is done in buckets of size 16, 32, 64, etc. Reducing the node size for a uint[uint] from 16 to 12 saves no memory at all.
Mar 23 2010
Andrei Alexandrescu wrote:Then you might be glad to hear that Walter has recently improved the default hashtable implementation significantly (at least for heavy-duty applications).All I did was switch from a tree used to resolve a bucket collision with a singly linked list. The improvement comes from reduced memory consumption, which for uint[uint] changes a node from 20 bytes to 16, thus halving the memory used. If the hash node still uses over 16 bytes, I doubt there will be any improvement at all. Note that the linked list implementation is equivalent to g++'s hash_map implementation.
Mar 22 2010
On 03/22/2010 03:54 PM, Walter Bright wrote:Andrei Alexandrescu wrote:In at least one application, the new hash performed on a par with or better than g++'s hash_map. AndreiThen you might be glad to hear that Walter has recently improved the default hashtable implementation significantly (at least for heavy-duty applications).All I did was switch from a tree used to resolve a bucket collision with a singly linked list. The improvement comes from reduced memory consumption, which for uint[uint] changes a node from 20 bytes to 16, thus halving the memory used. If the hash node still uses over 16 bytes, I doubt there will be any improvement at all. Note that the linked list implementation is equivalent to g++'s hash_map implementation.
Mar 22 2010
On Mon, 22 Mar 2010 17:04:56 -0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 03/22/2010 02:50 PM, Daniel Keep wrote:What about the precise GC patch? http://d.puremagic.com/issues/show_bug.cgi?id=3463How about just *fixing* the hashtable so it doesn't generate false pointers in the first place? Maybe I'm just strange, but that seems like a more direct, efficacious solution...This was of course the first thing Walter and I discussed. It turns out that that would necessitate precise GC, which we don't have yet.
Mar 22 2010
Andrei Alexandrescu, el 22 de marzo a las 15:04 me escribiste:On 03/22/2010 02:50 PM, Daniel Keep wrote:Well... there is an implementation hanging in bugzilla... -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ----------------------------------------------------------------------How about just *fixing* the hashtable so it doesn't generate false pointers in the first place? Maybe I'm just strange, but that seems like a more direct, efficacious solution...This was of course the first thing Walter and I discussed. It turns out that that would necessitate precise GC, which we don't have yet.
Mar 23 2010
On 22-mar-10, at 21:04, Andrei Alexandrescu wrote:On 03/22/2010 02:50 PM, Daniel Keep wrote:I like murmurhash, that I what I put into tango, see http://www.dsource.org/projects/tango/browser/trunk/tango/core/rt/compiler/util/hash.d all that file is based on freely available code and written by me, and I give my code with whatever license would be needed. I think that the public interface should be exposed (or re-exposed) somewhere to the outside so that one can easily create efficient hashes for user defined types. For example it is nice to be able to chain hash functions (something that the default one does not allow). FawziSadly, I lack the background to make any sort of objective judgement of the hash function *itself*, so I'll just reiterate what I've heard repeated to me over and over again by numerous people: D's builtin hash function sucks like a black holes.Then you might be glad to hear that Walter has recently improved the default hashtable implementation significantly (at least for heavy- duty applications). We're using Paul Hsieh's hash function for strings and general memory, which is no slouch. http://www.dsource.org/projects/druntime/browser/trunk/src/rt/util/hash.d Better suggestions are always welcome. For integrals I'm unclear on what we could use to make things better. (Clearly we could and should get rid of the extraneous field.)
Mar 23 2010
On 03/23/2010 12:08 PM, Fawzi Mohamed wrote:On 22-mar-10, at 21:04, Andrei Alexandrescu wrote:Thanks, Fawzi, that's great. I'm trying to collect evidence of MurmurHash's effectiveness versus Hsieh's SuperFastHash. I seemed to find some: http://tanjent.livejournal.com/756623.htmlhttp://tanjent.livejournal.com/756623.html I also found what seems to be a thorough benchmark: http://www.strchr.com/hash_functions Looks pretty solid to me. The benchmarks look only at strings, not at typical workloads on void* (e.g. arbitrary objects containing integers and pointers and whatnot). Hsieh's hash is smack in the middle, whereas Murmur2 is fourth, faster by 10%.Better suggestions are always welcome. For integrals I'm unclear on what we could use to make things better. (Clearly we could and should get rid of the extraneous field.)I like murmurhash, that I what I put into tango, see http://www.dsource.org/projects/tango/browser/trunk/tango/core/rt/compiler/util/hash.d all that file is based on freely available code and written by me, and I give my code with whatever license would be needed.I think that the public interface should be exposed (or re-exposed) somewhere to the outside so that one can easily create efficient hashes for user defined types. For example it is nice to be able to chain hash functions (something that the default one does not allow).What I'm pushing for as of now is to move the associative array definition from opacity into templated goodies in object_.d. Andrei
Mar 23 2010
On 23-mar-10, at 19:04, Andrei Alexandrescu wrote:What I'm pushing for as of now is to move the associative array definition from opacity into templated goodies in object_.d.that would be nice, that is one the main reasons Steven implementation is faster. It would be nice if this would be done by the compiler as rewriting the calls as call to "normal" templates, i.e. to a specially named templated struct (AArray for example) so that (syntactic sugar for the name aside) that would be the same as a library implementation. This would have two advantages: - easy to replace the library implementation, as it would be even less special - easy to replace the usage in one piece of code with another implementation (well truth to be told that is rather easy also now) Fawzi
Mar 23 2010
On 03/23/2010 02:34 PM, Fawzi Mohamed wrote:On 23-mar-10, at 19:04, Andrei Alexandrescu wrote:I have much loftier goals, which scare Walter sometimes :o). In the long term, my plan is to allow object.d to decide on a number of fundamental program-wide choices, such as the use of mark-sweep GC versus reference counting GC. In wake of my experience with D in heavyset programs, I reckon that there is a necessity to have deterministic memory management for a subset of applications. Walter is afraid that that's going to mean the balkanization of D - everyone will define their own object.d. That may be a risk, but I strongly believe it's a risk worth taking. AndreiWhat I'm pushing for as of now is to move the associative array definition from opacity into templated goodies in object_.d.that would be nice, that is one the main reasons Steven implementation is faster. It would be nice if this would be done by the compiler as rewriting the calls as call to "normal" templates, i.e. to a specially named templated struct (AArray for example) so that (syntactic sugar for the name aside) that would be the same as a library implementation. This would have two advantages: - easy to replace the library implementation, as it would be even less special - easy to replace the usage in one piece of code with another implementation (well truth to be told that is rather easy also now)
Mar 23 2010
Andrei Alexandrescu wrote:Walter is afraid that that's going to mean the balkanization of D - everyone will define their own object.d. That may be a risk, but I strongly believe it's a risk worth taking.It would mean the end of precompiled libraries.
Mar 23 2010
Fawzi Mohamed wrote:On 23-mar-10, at 19:04, Andrei Alexandrescu wrote:D2 already allows this.What I'm pushing for as of now is to move the associative array definition from opacity into templated goodies in object_.d.that would be nice, that is one the main reasons Steven implementation is faster. It would be nice if this would be done by the compiler as rewriting the calls as call to "normal" templates, i.e. to a specially named templated struct (AArray for example) so that (syntactic sugar for the name aside) that would be the same as a library implementation. This would have two advantages: - easy to replace the library implementation, as it would be even less special - easy to replace the usage in one piece of code with another implementation (well truth to be told that is rather easy also now)
Mar 23 2010
On 23-mar-10, at 23:07, Walter Bright wrote:Fawzi Mohamed wrote:ok I did not know that was possible.On 23-mar-10, at 19:04, Andrei Alexandrescu wrote:D2 already allows this.What I'm pushing for as of now is to move the associative array definition from opacity into templated goodies in object_.d.that would be nice, that is one the main reasons Steven implementation is faster. It would be nice if this would be done by the compiler as rewriting the calls as call to "normal" templates, i.e. to a specially named templated struct (AArray for example) so that (syntactic sugar for the name aside) that would be the same as a library implementation. This would have two advantages: - easy to replace the library implementation, as it would be even less special - easy to replace the usage in one piece of code with another implementation (well truth to be told that is rather easy also now)
Mar 23 2010
Fawzi Mohamed wrote:I think that the public interface should be exposed (or re-exposed) somewhere to the outside so that one can easily create efficient hashes for user defined types. For example it is nice to be able to chain hash functions (something that the default one does not allow).Just overload the toHash() function for your user-defined type to be whatever you want it to be.
Mar 23 2010
On 23-mar-10, at 23:02, Walter Bright wrote:Fawzi Mohamed wrote:I know, maybe I have not expressed myself clearly, but this overridden function has to be written. For objects combining various pieces, one has to create a unique hash from various pieces. The functions I have defined in hash.d help in doing that is such a way that changing a bit anywhere most likely changes the whole hash.I think that the public interface should be exposed (or re-exposed) somewhere to the outside so that one can easily create efficient hashes for user defined types. For example it is nice to be able to chain hash functions (something that the default one does not allow).Just overload the toHash() function for your user-defined type to be whatever you want it to be.
Mar 23 2010
I like murmurhash as well, and I'd even had it in druntime briefly, but it uses a proprietary license and I couldn't get permision from the author to use it. It's also an order of magnitude slower on sparc than x86. Fawzi Mohamed <fawzi gmx.ch> wrote:On 22-mar-10, at 21:04, Andrei Alexandrescu wrote:On 03/22/2010 02:50 PM, Daniel Keep wrote:I like murmurhash, that I what I put into tango, see http://www.dsource.org/projects/tango/browser/trunk/tango/core/rt/compiler/util/hash.d all that file is based on freely available code and written by me, and I give my code with whatever license would be needed. I think that the public interface should be exposed (or re-exposed) somewhere to the outside so that one can easily create efficient hashes for user defined types. For example it is nice to be able to chain hash functions (something that the default one does not allow). FawziSadly, I lack the background to make any sort of objective >> judgement of the hash function *itself*, so I'll just reiterate what I've heard repeated to me over and over again by numerous people: D's builtinThen you might be glad to hear that Walter has recently improved the > default hashtable implementation significantly (at least for heavy- > duty applications). We're using Paul Hsieh's hash function for strings and general > memory, which is no slouch. http://www.dsource.org/projects/druntime/browser/trunk/src/rt/util/hash.d Better suggestions are always welcome. For integrals I'm unclear onfunction sucks like a black holes.hashwhat we could use to make things better. (Clearly we could and > should get rid of the extraneous field.)
Mar 30 2010
On Tue, 30 Mar 2010 16:17:20 -0300, Sea Kelly <sean invisibleduck.org> wrote:I like murmurhash as well, and I'd even had it in druntime briefly, but it uses a proprietary license and I couldn't get permision from the author to use it. It's also an order of magnitude slower on sparc than x86.That seems weird. The website it lists it as: "All code is released to the public domain. For business purposes, Murmurhash is under the MIT license."
Mar 30 2010
"Robert Jacques" <sandford jhu.edu> wrote:On Tue, 30 Mar 2010 16:17:20 -0300, Sea Kelly <sean invisibleduck.org> wrote:That must be new, I couldn't find any license at all at the time. I emailed the author about it and he never replied, jough he'd been helpful with other issues. Or perhaps I jus missed it.I like murmurhash as well, and I'd even had it in druntime briefly, but it uses a proprietary license and I couldn't get permision from the author to use it. It's also an order of magnitude slower on sparc than x86.That seems weird. The website it lists it as: "All code is released to the public domain. For business purposes, Murmurhash is under the MIT license."
Apr 11 2010
On 23-3-2010 1:51, Andrei Alexandrescu wrote:Currently, each entry in a D hashtable stores the hash of the key for efficiency purposes. There is a bit of redundancy: as soon as you entered a hash bucket you know that the hashkey % tablesize is a specific number, call it k. But k is not enough information to restore the hash, so the actual hash gets stored for each entry. I'm thinking of exploiting that redundancy by storing the hash multiplier, not the hash value. Instead of storing h in each slot, store m = h / tablesize. Then you can restore the actual h by writing: restored_h = m * tablesize + k; k is known for each given bucket (it's the index of the bucket) and m is what gets stored per entry. What's the advantage of this approach? Well the advantage is that m is a small number. Any hash function will try to disperse the hash value as much as possible between the 32 available bits. But m will be a smaller number, and therefore will be more grouped and will engender fewer false pointers. Would this help?Probably not. hash is an 'int' and this 'm' will still use up the same space as an int, I reckon. Even if it ends up being 16-bit and can be put in a short, alignment will probably cause it to consume 32-bits anyway. What's more the 16-bit access (not to mention the multiplier) will slow the AA down. I say, leave it be. L.
Mar 22 2010