digitalmars.D.learn - Hash table element existence check

Illuminati (7/7) Sep 02 2016 I am trying to create a hash table and would like an efficient

Steven Schveighoffer (7/12) Sep 02 2016 You mean you are writing your own hash table, or you want to use a D

Illuminati (14/29) Sep 03 2016 First.

Steven Schveighoffer (15/26) Sep 06 2016 In general, there may not be a way to do this. Are you storing hash

Cauterite (3/6) Sep 03 2016 Just do a regular lookup on the hash? It's an O(1) operation,

Illuminati (3/9) Sep 03 2016 Huh? One can look up fine, but how does one know if the result is

Cauterite (14/25) Sep 03 2016 Okay I think I misinterpreted the question. I believe when I did

Illuminati (16/43) Sep 04 2016 Yeah, and there might not be any "null" keys. If there is, then

Basile B. (8/15) Sep 03 2016 fragmentation is a consequence of the hash function. You should

Illuminati (15/31) Sep 03 2016 I mean memory fragmentation. If the key and value are structs,

Illuminati <Joe Masons.com> writes:

I am trying to create a hash table and would like an efficient 
way to be able to know if an element exists to test for 
collisions.

I could keep a bitarray, but wasting around 12% space. I could 
use pointers(null check) to elements but this creates 
fragmentation. It is not terrible, just curious if anyone has a 
better way?

Sep 02 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 9/2/16 3:38 PM, Illuminati wrote:
 I am trying to create a hash table and would like an efficient way to be
 able to know if an element exists to test for collisions.

You mean you are writing your own hash table, or you want to use a D 
hash table (associative array)?

 I could keep a bitarray, but wasting around 12% space. I could use
 pointers(null check) to elements but this creates fragmentation. It is
 not terrible, just curious if anyone has a better way?

I'm not sure I understand the question. Hash tables have many many many 
different ways to implement. Obviously, marking empty buckets somehow is 
necessary.

-Steve

Sep 02 2016

Illuminati <Joe Masons.com> writes:

On Friday, 2 September 2016 at 19:48:30 UTC, Steven Schveighoffer 
wrote:
 On 9/2/16 3:38 PM, Illuminati wrote:
 I am trying to create a hash table and would like an efficient 
 way to be
 able to know if an element exists to test for collisions.

 You mean you are writing your own hash table, or you want to 
 use a D hash table (associative array)?

First.

 I could keep a bitarray, but wasting around 12% space. I could 
 use
 pointers(null check) to elements but this creates 
 fragmentation. It is
 not terrible, just curious if anyone has a better way?

 I'm not sure I understand the question. Hash tables have many 
 many many different ways to implement. Obviously, marking empty 
 buckets somehow is necessary.

 -Steve

Yes, one must have a way to mark empty "buckets"(I hate that 
term! Seems so banal for some reason ;/)

If using pointers, which is what I use, then null means empty, 
but one can't use inline structs and which then can create data 
fragmentation. If the structs are inline, there is no way to know 
if the element is a valid struct or empty. One could test against 
the .init, say, but I don't think any of that is safe.

Hence a bitmap could be used in that case, but wastes a lot of 
space. I don't see any other way though. Pointers, also, waste 
memory rather than structs inline and effectively is a bitmap but 
does save space for sparse maps.

Sep 03 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 9/3/16 8:30 AM, Illuminati wrote:

 Yes, one must have a way to mark empty "buckets"(I hate that term! Seems
 so banal for some reason ;/)

 If using pointers, which is what I use, then null means empty, but one
 can't use inline structs and which then can create data fragmentation.
 If the structs are inline, there is no way to know if the element is a
 valid struct or empty. One could test against the .init, say, but I
 don't think any of that is safe.

 Hence a bitmap could be used in that case, but wastes a lot of space. I
 don't see any other way though. Pointers, also, waste memory rather than
 structs inline and effectively is a bitmap but does save space for
 sparse maps.

In general, there may not be a way to do this. Are you storing hash 
values to optimize equality check and rehashing? If so, then you may be 
able to flag some hash value as "empty". Other than that, you'd need 
some mechanism to flag a value as a "sentinel" empty value, or store the 
flag inline with the elements.

Note that there are always tradeoffs, which is why hash table has so 
many solutions :)

Using pointers to elements can lead to fragmentation, but what if your 
pointers and elements are stored in the same memory segment, or at least 
2 memory segments (one array for pointers, one for elements)? And when 
you rehash, having inline storage is going to be more costly due to all 
the copying, and storing external references to elements is dangerous.

If there was one best solution, everyone would use it...

-Steve

Sep 06 2016

Cauterite <cauterite gmail.com> writes:

On Friday, 2 September 2016 at 19:38:34 UTC, Illuminati wrote:
 I am trying to create a hash table and would like an efficient 
 way to be able to know if an element exists to test for 
 collisions.

Just do a regular lookup on the hash? It's an O(1) operation, 
like 4 instructions.

Sep 03 2016

Illuminati <Joe Masons.com> writes:

On Saturday, 3 September 2016 at 07:44:28 UTC, Cauterite wrote:
 On Friday, 2 September 2016 at 19:38:34 UTC, Illuminati wrote:
 I am trying to create a hash table and would like an efficient 
 way to be able to know if an element exists to test for 
 collisions.

 Just do a regular lookup on the hash? It's an O(1) operation, 
 like 4 instructions.

Huh? One can look up fine, but how does one know if the result is 
valid or not?

Sep 03 2016

Cauterite <cauterite gmail.com> writes:

On Saturday, 3 September 2016 at 12:33:26 UTC, Illuminati wrote:
 On Saturday, 3 September 2016 at 07:44:28 UTC, Cauterite wrote:
 On Friday, 2 September 2016 at 19:38:34 UTC, Illuminati wrote:
 I am trying to create a hash table and would like an 
 efficient way to be able to know if an element exists to test 
 for collisions.

 Just do a regular lookup on the hash? It's an O(1) operation, 
 like 4 instructions.

 Huh? One can look up fine, but how does one know if the result 
 is valid or not?

Okay I think I misinterpreted the question. I believe when I did 
this my solution was to have an additional template parameter 
specifying null key values, so the template was like this:

struct HashTbl(
	K, /* key type */
	V, /* value type */
	K NullKey, /* key placeholder value */
	alias hash_key, /* hash function */
	alias keys_eq /* equality function */
) {...

I guess this doesn't really solve the problem, but makes it the 
user's problem.

 I could keep a bitarray, but wasting around 12% space.

That assumes your (K,V) tuples are 1 byte total, right?

Sep 03 2016

Illuminati <Joe Masons.com> writes:

On Saturday, 3 September 2016 at 14:07:27 UTC, Cauterite wrote:
 On Saturday, 3 September 2016 at 12:33:26 UTC, Illuminati wrote:
 On Saturday, 3 September 2016 at 07:44:28 UTC, Cauterite wrote:
 On Friday, 2 September 2016 at 19:38:34 UTC, Illuminati wrote:
 I am trying to create a hash table and would like an 
 efficient way to be able to know if an element exists to 
 test for collisions.

 Just do a regular lookup on the hash? It's an O(1) operation, 
 like 4 instructions.

 Huh? One can look up fine, but how does one know if the result 
 is valid or not?

 Okay I think I misinterpreted the question. I believe when I 
 did this my solution was to have an additional template 
 parameter specifying null key values, so the template was like 
 this:

 struct HashTbl(
 	K, /* key type */
 	V, /* value type */
 	K NullKey, /* key placeholder value */
 	alias hash_key, /* hash function */
 	alias keys_eq /* equality function */
 ) {...

 I guess this doesn't really solve the problem, but makes it the 
 user's problem.

Yeah, and there might not be any "null" keys. If there is, then 
it is easy, But, say the key is an int... what is "null"? While 
it does let the user define what null means, it assumes that it 
is possible to do so and that might not always be the case.

I suppose one could then switch to pointers and waste a bit of 
space if necessary. The main point of the question was if there 
were any tricks that avoided this problem. E.g., maybe the hash 
function itself could be used to determine if the key exists. Not 
sure how but who knows?

 I could keep a bitarray, but wasting around 12% space.

 That assumes your (K,V) tuples are 1 byte total, right?


Yeah. Larger values will reduce the size to some degree. Really, 
it is simply N/8 wasted space. For large N it becomes a problem. 
Maybe not a huge deal since virtually most of the time N < 1M, 
and it would be around 125k bytes for such a large array. Large 
arrays could probably be handled differently if memory 
constraints were such an issue.

Sep 04 2016

Basile B. <b2.temp gmx.com> writes:

On Friday, 2 September 2016 at 19:38:34 UTC, Illuminati wrote:
 I am trying to create a hash table and would like an efficient 
 way to be able to know if an element exists to test for 
 collisions.

 I could keep a bitarray, but wasting around 12% space. I could 
 use pointers(null check) to elements but this creates 
 fragmentation. It is not terrible, just curious if anyone has a 
 better way?

fragmentation is a consequence of the hash function. You should 
set the hasher as a template parameter so that, according to the 
value type, the best hash fun (the one that creates less 
clustering) can be supplied.

But otherwise the buckets is almost always an array of 
ReturnType!hashFun with the max value wrapped around the next 
power of two value following entry count.

Sep 03 2016

Illuminati <Joe Masons.com> writes:

On Saturday, 3 September 2016 at 09:43:04 UTC, Basile B. wrote:
 On Friday, 2 September 2016 at 19:38:34 UTC, Illuminati wrote:
 I am trying to create a hash table and would like an efficient 
 way to be able to know if an element exists to test for 
 collisions.

 I could keep a bitarray, but wasting around 12% space. I could 
 use pointers(null check) to elements but this creates 
 fragmentation. It is not terrible, just curious if anyone has 
 a better way?

 fragmentation is a consequence of the hash function. You should 
 set the hasher as a template parameter so that, according to 
 the value type, the best hash fun (the one that creates less 
 clustering) can be supplied.

I mean memory fragmentation. If the key and value are structs, 
The memory is an array of tuple(key, value). Any scanning and 
returning are quite efficiency since all the tuples are next to 
each other. If they are pointers to the key/value then they could 
point to any location in memory. Scanning and such are far more 
likely to create cache misses. Maybe not a big deal for simple 
one-time access but in other cases it could be extremely 
slow(such as iterating over the table).

 But otherwise the buckets is almost always an array of 
 ReturnType!hashFun with the max value wrapped around the next 
 power of two value following entry count.

My hash table is simply a fixed array of type X = tuple(key, 
value). X is at location key.hashOf % length(more or less). When 
the table becomes too small, it is enlarged and everything is 
rehashed. But keys and values can be values or references and 
this changes the behavior. references can be checked for null, 
but values can't.

Sep 03 2016

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Hash table element existence check