digitalmars.D - AA rehash threshold
- Steven Schveighoffer (24/24) Nov 18 2014 From all I remember about hash tables, you are supposed to minimize the...
- Steven Schveighoffer (4/6) Nov 18 2014 BTW, I traced this load factor all the way back to the initial commit of...
- deadalnix (16/19) Nov 18 2014 That is the theory, but the real world is sometime different. It
- Steven Schveighoffer (12/29) Nov 18 2014 Heh, a load factor of 4 wouldn't work too well there :)
- Jerry Quinn (16/28) Nov 20 2014 This works nicely for small types, but has gotchas. For example, if
- Steven Schveighoffer (11/39) Nov 20 2014 Hm.. the bucket entry will simply be what a node is now. You are
- Jerry Quinn (7/20) Nov 20 2014 Almost. You do need a way to indicate that the bucket is empty. So you
- Steven Schveighoffer (10/31) Nov 20 2014 It's easy enough to set the "next" pointer to some invalid-but-not-null
- Martin Nowak (6/7) Nov 18 2014 We'll throw the current implementation away and implement an open
- Sean Kelly (5/11) Nov 18 2014 I'd like this to be configurable once we have the new and
From all I remember about hash tables, you are supposed to minimize the number of collisions when adding/looking up data. This helps keep the O(1) lookup property intact. Reading wikipedia (http://en.wikipedia.org/wiki/Hash_table#Dynamic_resizing): "To keep the load factor under a certain limit, e.g. under 3/4, many table implementations expand the table when items are inserted. For example, in Java's HashMap class the default load factor threshold for table expansion is 0.75 and in Python's dict, table size is resized when load factor is greater than 2/3." Well, I just realized that D AA's "load factor" is 4 (see below). That means, for a hash table with 4 buckets, you need to add 17 items to generate a rehash. It also means that you are guaranteed to have one bucket with at least 4 elements on it before a rehash. And that's on an evenly spread hash table. In most cases, we would see buckets with around 8-10 elements. I don't think this is optimal, but I'm nervous about correcting this -- going from a load factor of 4 to a load factor of 0.75 would be a tremendous jump. I'm not certain how the interaction with the GC and the list of primes in the AA runtime will fare with this new load. Any thoughts from the experts out there? -Steve Code that uses factor of 4: https://github.com/D-Programming-Language/druntime/blob/9e410b0bd91ead4439bf304c28c2fce511df436a/src/rt/aaA.d#L221
Nov 18 2014
On 11/18/14 9:16 PM, Steven Schveighoffer wrote:Code that uses factor of 4: https://github.com/D-Programming-Language/druntime/blob/9e410b0bd91ead4439bf304c28c2fce511df436a/src/rt/aaA.d#L221BTW, I traced this load factor all the way back to the initial commit of druntime. It's been this way since 2009. -Steve
Nov 18 2014
On Wednesday, 19 November 2014 at 02:16:36 UTC, Steven Schveighoffer wrote:From all I remember about hash tables, you are supposed to minimize the number of collisions when adding/looking up data. This helps keep the O(1) lookup property intact.That is the theory, but the real world is sometime different. It depends on how you resolve collisions. If you do it via some kind of linked list, then yes, you must remove all collisions. But if you do it by putting the element in a slot close to its actual position, then thing gets different. In that case, you don't want to minimize collision, but to minimize the distance to the object, in order for it to be in the cache line. After all, unless your hash table is very small, the first hit is most likely a cache miss, meaning ~300 cycles. at this point the cache line is in L1, with a 3 cycle access time. So accessing several element in the same cache line is often preferable. You may want to read about robin hood hash table for a simple example of this. These hash tables are fast even with a very high collision rate.
Nov 18 2014
On 11/18/14 9:46 PM, deadalnix wrote:On Wednesday, 19 November 2014 at 02:16:36 UTC, Steven Schveighoffer wrote:This is what AAs currently do.From all I remember about hash tables, you are supposed to minimize the number of collisions when adding/looking up data. This helps keep the O(1) lookup property intact.That is the theory, but the real world is sometime different. It depends on how you resolve collisions. If you do it via some kind of linked list, then yes, you must remove all collisions.But if you do it by putting the element in a slot close to its actual position, then thing gets different. In that case, you don't want to minimize collision, but to minimize the distance to the object, in order for it to be in the cache line.Heh, a load factor of 4 wouldn't work too well there :) I do recall reading about all the different cache miss handling. The wikipedia article I referenced has a lot of good info.After all, unless your hash table is very small, the first hit is most likely a cache miss, meaning ~300 cycles. at this point the cache line is in L1, with a 3 cycle access time. So accessing several element in the same cache line is often preferable.In the case of AAs, all bucket elements are pointers, so this point is moot for our purposes -- one may have to jump outside the cache to find the first element. A good improvement (this is likely to come with a full library type) is to instead store an inline element for each bucket entry instead of just a pointer to an element. I recall when writing dcollections, this added a significant speedup. -Steve
Nov 18 2014
Steven Schveighoffer <schveiguy yahoo.com> writes:On 11/18/14 9:46 PM, deadalnix wrote:This works nicely for small types, but has gotchas. For example, if you've got an AA of ints, what value indicates that this is a value folded into the bucket entry vs actually being a pointer? You'll need extra space to make sure it's safe. Alignment is another concern. Let's say you have bool for the entry/element flag. With ints you've doubled the size of the bucket array. If you have large objects as the elements, you'll waste space. Probably the implementation needs to be switchable depending on the size of the object. As an aside, I find it invaluable to have statistics about hash table buckets, ala C++11. When building custom hash functions and hash table implementations, it's almost necessary. I'd like to see bucket stats available in some manner for built-in AAs. If they're going away for a library type, we should have such stat info as part of the API. JerryAfter all, unless your hash table is very small, the first hit is most likely a cache miss, meaning ~300 cycles. at this point the cache line is in L1, with a 3 cycle access time. So accessing several element in the same cache line is often preferable.In the case of AAs, all bucket elements are pointers, so this point is moot for our purposes -- one may have to jump outside the cache to find the first element. A good improvement (this is likely to come with a full library type) is to instead store an inline element for each bucket entry instead of just a pointer to an element. I recall when writing dcollections, this added a significant speedup.
Nov 20 2014
On 11/20/14 5:30 PM, Jerry Quinn wrote:Steven Schveighoffer <schveiguy yahoo.com> writes:Hm.. the bucket entry will simply be what a node is now. You are actually saving space, because you don't need to store that extra pointer to the first node. Where it gets dicey is when you rehash. Some of the nodes will be moved into the bucket space, some will move out of it. But I don't think it's a big deal, as a rehash doesn't happen very often.On 11/18/14 9:46 PM, deadalnix wrote:This works nicely for small types, but has gotchas. For example, if you've got an AA of ints, what value indicates that this is a value folded into the bucket entry vs actually being a pointer? You'll need extra space to make sure it's safe. Alignment is another concern. Let's say you have bool for the entry/element flag. With ints you've doubled the size of the bucket array.After all, unless your hash table is very small, the first hit is most likely a cache miss, meaning ~300 cycles. at this point the cache line is in L1, with a 3 cycle access time. So accessing several element in the same cache line is often preferable.In the case of AAs, all bucket elements are pointers, so this point is moot for our purposes -- one may have to jump outside the cache to find the first element. A good improvement (this is likely to come with a full library type) is to instead store an inline element for each bucket entry instead of just a pointer to an element. I recall when writing dcollections, this added a significant speedup.If you have large objects as the elements, you'll waste space. Probably the implementation needs to be switchable depending on the size of the object.Yes, this is also a concern. This almost necessitates a minimum load as well as a maximum.As an aside, I find it invaluable to have statistics about hash table buckets, ala C++11. When building custom hash functions and hash table implementations, it's almost necessary. I'd like to see bucket stats available in some manner for built-in AAs. If they're going away for a library type, we should have such stat info as part of the API.Absolutely all of this could be added to a library type. -Steve
Nov 20 2014
Steven Schveighoffer <schveiguy yahoo.com> writes:On 11/20/14 5:30 PM, Jerry Quinn wrote:Almost. You do need a way to indicate that the bucket is empty. So you need another flag though it can be smaller than a pointer (for 64 bit). But the overhead is low except for things like ints.This works nicely for small types, but has gotchas. For example, if you've got an AA of ints, what value indicates that this is a value folded into the bucket entry vs actually being a pointer? You'll need extra space to make sure it's safe. Alignment is another concern. Let's say you have bool for the entry/element flag. With ints you've doubled the size of the bucket array.Hm.. the bucket entry will simply be what a node is now. You are actually saving space, because you don't need to store that extra pointer to the first node.Where it gets dicey is when you rehash. Some of the nodes will be moved into the bucket space, some will move out of it. But I don't think it's a big deal, as a rehash doesn't happen very often.True. Umm, that does raise a question - are addresses of AA contents required to be stable? C++11 requires that of its hashtables. But it ties your hands when trying to write customized hash tables.
Nov 20 2014
On 11/20/14 8:42 PM, Jerry Quinn wrote:Steven Schveighoffer <schveiguy yahoo.com> writes:It's easy enough to set the "next" pointer to some invalid-but-not-null value. Hackish, but it would work ;)On 11/20/14 5:30 PM, Jerry Quinn wrote:Almost. You do need a way to indicate that the bucket is empty. So you need another flag though it can be smaller than a pointer (for 64 bit). But the overhead is low except for things like ints.This works nicely for small types, but has gotchas. For example, if you've got an AA of ints, what value indicates that this is a value folded into the bucket entry vs actually being a pointer? You'll need extra space to make sure it's safe. Alignment is another concern. Let's say you have bool for the entry/element flag. With ints you've doubled the size of the bucket array.Hm.. the bucket entry will simply be what a node is now. You are actually saving space, because you don't need to store that extra pointer to the first node.I don't think we make any guarantees about that. As it stands now, whatever is implemented seems to be what people think is the spec. But it's not always true. We have so many more options when AA's become a library type. We could say "the builtin AA is this flavor of hashtable", but provide options for many different flavors with the same template if you want specific ones. -SteveWhere it gets dicey is when you rehash. Some of the nodes will be moved into the bucket space, some will move out of it. But I don't think it's a big deal, as a rehash doesn't happen very often.True. Umm, that does raise a question - are addresses of AA contents required to be stable? C++11 requires that of its hashtables. But it ties your hands when trying to write customized hash tables.
Nov 20 2014
On Wednesday, 19 November 2014 at 02:16:36 UTC, Steven Schveighoffer wrote:Any thoughts from the experts out there?We'll throw the current implementation away and implement an open addressing AA once we get to it. So don't worry about the load factor right now. https://github.com/D-Programming-Language/dmd/pull/4088#issuecomment-63152848
Nov 18 2014
On Wednesday, 19 November 2014 at 02:16:36 UTC, Steven Schveighoffer wrote:Well, I just realized that D AA's "load factor" is 4 (see below). That means, for a hash table with 4 buckets, you need to add 17 items to generate a rehash. It also means that you are guaranteed to have one bucket with at least 4 elements on it before a rehash. And that's on an evenly spread hash table. In most cases, we would see buckets with around 8-10 elements.I'd like this to be configurable once we have the new and improved AA implementation. 4 does seem a bit high though. The default max load factor for C++ unordered containers is 1.0.
Nov 18 2014