www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Build in associated array (DIctionary) needs some enhancements

reply An Pham <home home.com> writes:
1. No function to create with desired capability
The initial size is default to 8 and increased of power of 2. 
There are two problems with this scheme. To populate to size of 
few thousands, there are too many memory relocations and rebuild 
the list. The other problem is wasted of unused slots; example a 
list of 2_100, the list will have 4_096 of slots

2. Cascade of collision
If hash causes a collision, it will use the subsequent slot as an 
overflow. However, if a next item hash points to that subsequent 
slot and it is already occupied by that overflow item, hence the 
cascade collision

3. Duplicate hash calculation
For a key that fits into register, size_t, there is no hash 
calculation but when it is added into the list, it will apply 
final MurmurHash2; for a key size bigger than size_t, it will do 
hash calculation with applied final MurmurHah2 and later added to 
the list, it will apply MurmurHash2 again. There are two problems 
with this scheme, the first one is obvious un-needed calculation; 
the other is if an integer key value fitted into list 
length/capability, applied MurmurHash2 potentially causes hash 
collision

Some suggestions
1. Use prime number length/capability to avoid wasted of unused 
slots
2. Add a way to initialize list with length/capability
3. A way to avoid final MurmurHash2 or get rid of it completely 
and add function/delegate to customized hash calculation
4. Use different scheme to handle collision such as plain array 
to hold key-value pair and hash list with integer index pointed 
to key-value pair. This scheme has two advantages; key-value pair 
will have integer index for collision and associated list will 
have order of items added when do iteration (possible use in 
json...); The dis-advantage is indirect access and it will take 
more work when doing deletion. However, deletion is rarely used 
and not a big deal for this use case

Happy codings
Mar 30
parent Steven Schveighoffer <schveiguy gmail.com> writes:
On Sunday, 30 March 2025 at 16:30:34 UTC, An Pham wrote:
 1. No function to create with desired capability
 The initial size is default to 8 and increased of power of 2. 
 There are two problems with this scheme. To populate to size of 
 few thousands, there are too many memory relocations and 
 rebuild the list. The other problem is wasted of unused slots; 
 example a list of 2_100, the list will have 4_096 of slots
The memory allocation of the bucket list is not significant, neither is having the bucket list be larger than needed. And the growth factor is actually 4. For an AA, you *want* to have a load factor that is not 1, or the hash lookup creeps towards linear complexity.
 2. Cascade of collision
 If hash causes a collision, it will use the subsequent slot as 
 an overflow. However, if a next item hash points to that 
 subsequent slot and it is already occupied by that overflow 
 item, hence the cascade collision
This is not the case. The algorithm to find an appropriate slot is not linear in this fashion. Please try reading the code again.
 3. Duplicate hash calculation
 For a key that fits into register, size_t, there is no hash 
 calculation but when it is added into the list, it will apply 
 final MurmurHash2; for a key size bigger than size_t, it will 
 do hash calculation with applied final MurmurHah2 and later 
 added to the list, it will apply MurmurHash2 again. There are 
 two problems with this scheme, the first one is obvious 
 un-needed calculation; the other is if an integer key value 
 fitted into list length/capability, applied MurmurHash2 
 potentially causes hash collision
The hash function is used once, and is calculated by the TypeInfo. If you want a different hash algorithm, you can create a type and define toHash.
 Some suggestions
Please implement your suggestions, test the performance, and get back with numbers. It's often quite surprising what makes the difference in performance! So suggesting changes without testing isn't fruitful. -Steve
Mar 30