digitalmars.D - AA rehash threshold

Steven Schveighoffer (24/24) Nov 18 2014 From all I remember about hash tables, you are supposed to minimize the...

Steven Schveighoffer (4/6) Nov 18 2014 BTW, I traced this load factor all the way back to the initial commit of...
deadalnix (16/19) Nov 18 2014 That is the theory, but the real world is sometime different. It

Steven Schveighoffer (12/29) Nov 18 2014 Heh, a load factor of 4 wouldn't work too well there :)

Jerry Quinn (16/28) Nov 20 2014 This works nicely for small types, but has gotchas. For example, if

Steven Schveighoffer (11/39) Nov 20 2014 Hm.. the bucket entry will simply be what a node is now. You are

Jerry Quinn (7/20) Nov 20 2014 Almost. You do need a way to indicate that the bucket is empty. So you

Steven Schveighoffer (10/31) Nov 20 2014 It's easy enough to set the "next" pointer to some invalid-but-not-null

Martin Nowak (6/7) Nov 18 2014 We'll throw the current implementation away and implement an open
Sean Kelly (5/11) Nov 18 2014 I'd like this to be configurable once we have the new and

Steven Schveighoffer <schveiguy yahoo.com> writes:

 From all I remember about hash tables, you are supposed to minimize the 
number of collisions when adding/looking up data. This helps keep the 
O(1) lookup property intact.

Reading wikipedia 
(http://en.wikipedia.org/wiki/Hash_table#Dynamic_resizing):

"To keep the load factor under a certain limit, e.g. under 3/4, many 
table implementations expand the table when items are inserted. For 
example, in Java's HashMap class the default load factor threshold for 
table expansion is 0.75 and in Python's dict, table size is resized when 
load factor is greater than 2/3."

Well, I just realized that D AA's "load factor" is 4 (see below). That 
means, for a hash table with 4 buckets, you need to add 17 items to 
generate a rehash. It also means that you are guaranteed to have one 
bucket with at least 4 elements on it before a rehash. And that's on an 
evenly spread hash table. In most cases, we would see buckets with 
around 8-10 elements.

I don't think this is optimal, but I'm nervous about correcting this -- 
going from a load factor of 4 to a load factor of 0.75 would be a 
tremendous jump. I'm not certain how the interaction with the GC and the 
list of primes in the AA runtime will fare with this new load.

Any thoughts from the experts out there?

-Steve

Code that uses factor of 4: 
https://github.com/D-Programming-Language/druntime/blob/9e410b0bd91ead4439bf304c28c2fce511df436a/src/rt/aaA.d#L221

Nov 18 2014

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 11/18/14 9:16 PM, Steven Schveighoffer wrote:

 Code that uses factor of 4:
 https://github.com/D-Programming-Language/druntime/blob/9e410b0bd91ead4439bf304c28c2fce511df436a/src/rt/aaA.d#L221

BTW, I traced this load factor all the way back to the initial commit of 
druntime. It's been this way since 2009.

-Steve

Nov 18 2014

"deadalnix" <deadalnix gmail.com> writes:

On Wednesday, 19 November 2014 at 02:16:36 UTC, Steven
Schveighoffer wrote:
 From all I remember about hash tables, you are supposed to 
 minimize the number of collisions when adding/looking up data. 
 This helps keep the O(1) lookup property intact.

That is the theory, but the real world is sometime different. It
depends on how you resolve collisions. If you do it via some kind
of linked list, then yes, you must remove all collisions.

But if you do it by putting the element in a slot close to its
actual position, then thing gets different. In that case, you
don't want to minimize collision, but to minimize the distance to
the object, in order for it to be in the cache line.

After all, unless your hash table is very small, the first hit is
most likely a cache miss, meaning ~300 cycles. at this point the
cache line is in L1, with a 3 cycle access time. So accessing
several element in the same cache line is often preferable.

You may want to read about robin hood hash table for a simple
example of this. These hash tables are fast even with a very high
collision rate.

Nov 18 2014

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 11/18/14 9:46 PM, deadalnix wrote:
 On Wednesday, 19 November 2014 at 02:16:36 UTC, Steven
 Schveighoffer wrote:
 From all I remember about hash tables, you are supposed to minimize
 the number of collisions when adding/looking up data. This helps keep
 the O(1) lookup property intact.

 That is the theory, but the real world is sometime different. It
 depends on how you resolve collisions. If you do it via some kind
 of linked list, then yes, you must remove all collisions.

This is what AAs currently do.

 But if you do it by putting the element in a slot close to its
 actual position, then thing gets different. In that case, you
 don't want to minimize collision, but to minimize the distance to
 the object, in order for it to be in the cache line.

Heh, a load factor of 4 wouldn't work too well there :)

I do recall reading about all the different cache miss handling. The 
wikipedia article I referenced has a lot of good info.

 After all, unless your hash table is very small, the first hit is
 most likely a cache miss, meaning ~300 cycles. at this point the
 cache line is in L1, with a 3 cycle access time. So accessing
 several element in the same cache line is often preferable.

In the case of AAs, all bucket elements are pointers, so this point is 
moot for our purposes -- one may have to jump outside the cache to find 
the first element. A good improvement (this is likely to come with a 
full library type) is to instead store an inline element for each bucket 
entry instead of just a pointer to an element. I recall when writing 
dcollections, this added a significant speedup.

-Steve

Nov 18 2014

Jerry Quinn <jlquinn optonline.net> writes:

Steven Schveighoffer <schveiguy yahoo.com> writes:

 On 11/18/14 9:46 PM, deadalnix wrote:
 After all, unless your hash table is very small, the first hit is
 most likely a cache miss, meaning ~300 cycles. at this point the
 cache line is in L1, with a 3 cycle access time. So accessing
 several element in the same cache line is often preferable.

 In the case of AAs, all bucket elements are pointers, so this point is moot
 for our purposes -- one may have to jump outside the cache to find the first
 element. A good improvement (this is likely to come with a full library type)
 is to instead store an inline element for each bucket entry instead of just a
 pointer to an element. I recall when writing dcollections, this added a
 significant speedup.

This works nicely for small types, but has gotchas.  For example, if
you've got an AA of ints, what value indicates that this is a value
folded into the bucket entry vs actually being a pointer?  You'll need
extra space to make sure it's safe.  Alignment is another concern.
Let's say you have bool for the entry/element flag.  With ints you've
doubled the size of the bucket array.

If you have large objects as the elements, you'll waste space.  Probably
the implementation needs to be switchable depending on the size of the object.

As an aside, I find it invaluable to have statistics about hash table
buckets, ala C++11.  When building custom hash functions and hash table
implementations, it's almost necessary.

I'd like to see bucket stats available in some manner for built-in AAs.
If they're going away for a library type, we should have such stat info
as part of the API.

Jerry

Nov 20 2014

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 11/20/14 5:30 PM, Jerry Quinn wrote:
 Steven Schveighoffer <schveiguy yahoo.com> writes:

 On 11/18/14 9:46 PM, deadalnix wrote:
 After all, unless your hash table is very small, the first hit is
 most likely a cache miss, meaning ~300 cycles. at this point the
 cache line is in L1, with a 3 cycle access time. So accessing
 several element in the same cache line is often preferable.

 In the case of AAs, all bucket elements are pointers, so this point is moot
 for our purposes -- one may have to jump outside the cache to find the first
 element. A good improvement (this is likely to come with a full library type)
 is to instead store an inline element for each bucket entry instead of just a
 pointer to an element. I recall when writing dcollections, this added a
 significant speedup.

 This works nicely for small types, but has gotchas.  For example, if
 you've got an AA of ints, what value indicates that this is a value
 folded into the bucket entry vs actually being a pointer?  You'll need
 extra space to make sure it's safe.  Alignment is another concern.
 Let's say you have bool for the entry/element flag.  With ints you've
 doubled the size of the bucket array.

Hm.. the bucket entry will simply be what a node is now. You are 
actually saving space, because you don't need to store that extra 
pointer to the first node.

Where it gets dicey is when you rehash. Some of the nodes will be moved 
into the bucket space, some will move out of it. But I don't think it's 
a big deal, as a rehash doesn't happen very often.

 If you have large objects as the elements, you'll waste space.  Probably
 the implementation needs to be switchable depending on the size of the object.

Yes, this is also a concern. This almost necessitates a minimum load as 
well as a maximum.

 As an aside, I find it invaluable to have statistics about hash table
 buckets, ala C++11.  When building custom hash functions and hash table
 implementations, it's almost necessary.

 I'd like to see bucket stats available in some manner for built-in AAs.
 If they're going away for a library type, we should have such stat info
 as part of the API.

Absolutely all of this could be added to a library type.

-Steve

Nov 20 2014

Jerry Quinn <jlquinn optonline.net> writes:

Steven Schveighoffer <schveiguy yahoo.com> writes:

 On 11/20/14 5:30 PM, Jerry Quinn wrote:
 This works nicely for small types, but has gotchas.  For example, if
 you've got an AA of ints, what value indicates that this is a value
 folded into the bucket entry vs actually being a pointer?  You'll need
 extra space to make sure it's safe.  Alignment is another concern.
 Let's say you have bool for the entry/element flag.  With ints you've
 doubled the size of the bucket array.

 Hm.. the bucket entry will simply be what a node is now. You are actually
 saving space, because you don't need to store that extra pointer to the first
 node.

Almost. You do need a way to indicate that the bucket is empty.  So you
need another flag though it can be smaller than a pointer (for 64 bit).  But
the overhead is low except for things like ints.

 Where it gets dicey is when you rehash. Some of the nodes will be moved into
 the bucket space, some will move out of it. But I don't think it's a big deal,
 as a rehash doesn't happen very often.

True.  Umm, that does raise a question - are addresses of AA contents
required to be stable?  C++11 requires that of its hashtables.  But it
ties your hands when trying to write customized hash tables.

Nov 20 2014

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 11/20/14 8:42 PM, Jerry Quinn wrote:
 Steven Schveighoffer <schveiguy yahoo.com> writes:

 On 11/20/14 5:30 PM, Jerry Quinn wrote:
 This works nicely for small types, but has gotchas.  For example, if
 you've got an AA of ints, what value indicates that this is a value
 folded into the bucket entry vs actually being a pointer?  You'll need
 extra space to make sure it's safe.  Alignment is another concern.
 Let's say you have bool for the entry/element flag.  With ints you've
 doubled the size of the bucket array.

 Hm.. the bucket entry will simply be what a node is now. You are actually
 saving space, because you don't need to store that extra pointer to the first
 node.

 Almost. You do need a way to indicate that the bucket is empty.  So you
 need another flag though it can be smaller than a pointer (for 64 bit).  But
 the overhead is low except for things like ints.

It's easy enough to set the "next" pointer to some invalid-but-not-null 
value. Hackish, but it would work ;)

 Where it gets dicey is when you rehash. Some of the nodes will be moved into
 the bucket space, some will move out of it. But I don't think it's a big deal,
 as a rehash doesn't happen very often.

 True.  Umm, that does raise a question - are addresses of AA contents
 required to be stable?  C++11 requires that of its hashtables.  But it
 ties your hands when trying to write customized hash tables.

I don't think we make any guarantees about that. As it stands now, 
whatever is implemented seems to be what people think is the spec. But 
it's not always true.

We have so many more options when AA's become a library type. We could 
say "the builtin AA is this flavor of hashtable", but provide options 
for many different flavors with the same template if you want specific ones.

-Steve

Nov 20 2014

"Martin Nowak" <code dawg.eu> writes:

On Wednesday, 19 November 2014 at 02:16:36 UTC, Steven 
Schveighoffer wrote:
 Any thoughts from the experts out there?

We'll throw the current implementation away and implement an open 
addressing AA once we get to it. So don't worry about the load 
factor right now.
https://github.com/D-Programming-Language/dmd/pull/4088#issuecomment-63152848

Nov 18 2014

"Sean Kelly" <sean invisibleduck.org> writes:

On Wednesday, 19 November 2014 at 02:16:36 UTC, Steven 
Schveighoffer wrote:
 Well, I just realized that D AA's "load factor" is 4 (see 
 below). That means, for a hash table with 4 buckets, you need 
 to add 17 items to generate a rehash. It also means that you 
 are guaranteed to have one bucket with at least 4 elements on 
 it before a rehash. And that's on an evenly spread hash table. 
 In most cases, we would see buckets with around 8-10 elements.

I'd like this to be configurable once we have the new and 
improved AA implementation.  4 does seem a bit high though.  The 
default max load factor for C++ unordered containers is 1.0.

Nov 18 2014

D Programming

C/C++ Programming

Other

digitalmars.D - AA rehash threshold