www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - I submitted my container library to code.dlang.org

reply "w0rp" <devw0rp gmail.com> writes:
I just submitted my container library I was working on ages ago 
to code.dlang.org. It's currently waiting in a queue, but it 
should be available from there soon enough.

I submitted my library with the version "0.1.0" to indicate, "Do 
not expect this to work properly." I have a number of unit tests 
available for it, and documentation for the library hosted on my 
website.

https://w0rp.com/project/dstruct/

It is available on GitHub here.

https://github.com/w0rp/dstruct

If anyone is interested in this stuff at all, let me know what 
you think. If you hate it and want to burn me alive, let me know. 
A few points follow.

1. This library uses the GC a lot. std.allocator still hasn't 
made it out into the real world, so I can't use that yet. I would 
like to offer different allocation schemes at some point with 
that, but for now the GC default will do. If you absolutely hate 
the GC and can't use it, this library isn't for you. (For now.)
2. Again, I don't guarantee that everything will work. I would 
really like people to submit issues so I can cover any errors 
with more tests.
3. The API for just about anything can be subject to change at 
this point. I need more feedback from actual usage of my library.
4. I ended up writing my own library hashmap, which when I tested 
ages ago competed with the standard associative array in terms of 
performance. This allows me to mark many things  safe pure 
nothrow. Also, I was able to implement setDefault in a hopefully 
efficient manner. I have tried to push setDefault as an awesome 
thing a few times, but I've never been able to explain why it's 
good eloquently.

Destroy!
Mar 29 2015
parent reply Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 03/29/2015 05:19 PM, w0rp wrote:
 4. I ended up writing my own library hashmap, which when I tested ages
 ago competed with the standard associative array in terms of
 performance. This allows me to mark many things  safe pure nothrow.
 
 Destroy!
Nice docs. Always use open addressing when implementing a hash table. https://github.com/D-Programming-Language/dmd/pull/4088 https://github.com/higgsjs/Higgs/pull/170
 Also, I was able to implement setDefault in a hopefully efficient
 manner. I have tried to push setDefault as an awesome thing a few times,
 but I've never been able to explain why it's good eloquently.
I guess that's the counterpart to `get(key, default)` that also inserts the default value. That's a very much needed primitive for our AA, because currently you end up doing at least 1 unnecessary lookup. Often I even write this. auto p = key in aa; if (p is null) { aa[key] = default; p = key in aa; } It's called getOrElseUpdate in Scala, but I'd prefer getOrSet. http://www.scala-lang.org/api/2.10.1-RC1/scala/collection/mutable/HashMap.html#getOrElseUpdate(A,⇒B):B
Mar 29 2015
next sibling parent "w0rp" <devw0rp gmail.com> writes:
On Sunday, 29 March 2015 at 22:32:34 UTC, Martin Nowak wrote:
 On 03/29/2015 05:19 PM, w0rp wrote:
 4. I ended up writing my own library hashmap, which when I 
 tested ages
 ago competed with the standard associative array in terms of
 performance. This allows me to mark many things  safe pure 
 nothrow.
 
 Destroy!
Nice docs. Always use open addressing when implementing a hash table. https://github.com/D-Programming-Language/dmd/pull/4088 https://github.com/higgsjs/Higgs/pull/170
You probably beat my implementation in terms of speed now. I think the only thing I did quite differently from the standard implementation was use & 2 for computing the hash index instead of a divide, as that was the method OpenJDK used for its hashmap.
 Also, I was able to implement setDefault in a hopefully 
 efficient
 manner. I have tried to push setDefault as an awesome thing a 
 few times,
 but I've never been able to explain why it's good eloquently.
I guess that's the counterpart to `get(key, default)` that also inserts the default value. That's a very much needed primitive for our AA, because currently you end up doing at least 1 unnecessary lookup. Often I even write this. auto p = key in aa; if (p is null) { aa[key] = default; p = key in aa; } It's called getOrElseUpdate in Scala, but I'd prefer getOrSet. http://www.scala-lang.org/api/2.10.1-RC1/scala/collection/mutable/HashMap.html#getOrElseUpdate(A,⇒B):B
Yeah, that's the one. I personally don't really care what the name is. 'setdefault' is the Python name for it. Only we can do better than Python, because we have the lazy keyword. For which I'll say, "One magical day, the compiler can generate special code for this." As you say, it's special in that it has the ability to only compute the hash once when implemented with direct access to the hashtable.
Mar 29 2015
prev sibling next sibling parent reply Johannes Pfau <nospam example.com> writes:
Am Mon, 30 Mar 2015 00:32:12 +0200
schrieb Martin Nowak <code+news.digitalmars dawg.eu>:

 On 03/29/2015 05:19 PM, w0rp wrote:
 4. I ended up writing my own library hashmap, which when I tested
 ages ago competed with the standard associative array in terms of
 performance. This allows me to mark many things  safe pure nothrow.
 
 Destroy!
Nice docs. Always use open addressing when implementing a hash table. https://github.com/D-Programming-Language/dmd/pull/4088 https://github.com/higgsjs/Higgs/pull/170
Another thing to worry about with hash tables is this: http://events.ccc.de/congress/2012/Fahrplan/events/5152.en.html https://www.youtube.com/watch?v=wGYj8fhhUVA
Mar 30 2015
parent reply "Martin Nowak" <code dawg.eu> writes:
On Monday, 30 March 2015 at 09:31:26 UTC, Johannes Pfau wrote:
 Another thing to worry about with hash tables is this:

 http://events.ccc.de/congress/2012/Fahrplan/events/5152.en.html
 https://www.youtube.com/watch?v=wGYj8fhhUVA
Mmh, that's a topic for specialized hash tables IMO. A generic hash table should do a best effort to avoid collisions, but you can't expect it to use a cryptographic quality hash by default if that is considerably slower. Anyone benchmarked that SipHash?
Mar 31 2015
parent reply Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 03/31/2015 11:12 PM, Martin Nowak wrote:
 Anyone benchmarked that SipHash?
No way we use this for druntime. https://github.com/rurban/smhasher#readme I think we should have a flexible Hash in std.container though, that should allow to use a customized hash function.
Mar 31 2015
next sibling parent ketmar <ketmar ketmar.no-ip.org> writes:
On Tue, 31 Mar 2015 23:45:28 +0200, Martin Nowak wrote:

 On 03/31/2015 11:12 PM, Martin Nowak wrote:
 Anyone benchmarked that SipHash?
=20 No way we use this for druntime. https://github.com/rurban/smhasher#readme
hay, can we switch to `donothing128`? it's speed is terrifying! ;-)=
Mar 31 2015
prev sibling parent reply Johannes Pfau <nospam example.com> writes:
Am Tue, 31 Mar 2015 23:45:28 +0200
schrieb Martin Nowak <code+news.digitalmars dawg.eu>:

 On 03/31/2015 11:12 PM, Martin Nowak wrote:
 Anyone benchmarked that SipHash?
No way we use this for druntime. https://github.com/rurban/smhasher#readme I think we should have a flexible Hash in std.container though, that should allow to use a customized hash function.
It's probably more a problem for vibe-d or other server-like applications. Those should make sure to use DOS-safe hash tables. For most applications there's no possibility for DOS attacks using hash tables and we indeed shouldn't make these applications slower.
Apr 01 2015
parent reply "Kagamin" <spam here.lot> writes:
On Wednesday, 1 April 2015 at 12:05:37 UTC, Johannes Pfau wrote:
 It's probably more a problem for vibe-d or other
 server-like applications. Those should make sure to use 
 DOS-safe hash
 tables. For most applications there's no possibility for DOS 
 attacks
 using hash tables and we indeed shouldn't make these 
 applications
 slower.
The vulnerability presentation suggests perl solution (random hash seed) is good enough, it doesn't slow down anything. The seed can be left zero and initialized by an application as needed. One can also use a longer key and add more its bits every, say, 10 bytes of hashed data, not sure if it will make any difference.
Apr 02 2015
parent reply Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 04/02/2015 02:10 PM, Kagamin wrote:
 
 The vulnerability presentation suggests perl solution (random hash seed)
 is good enough, it doesn't slow down anything. The seed can be left zero
 and initialized by an application as needed. One can also use a longer
 key and add more its bits every, say, 10 bytes of hashed data, not sure
 if it will make any difference.
A global random hash seed would work, but it needs to be accessible for reproducing test cases (druntime DRT option or in core.runtime).
Apr 03 2015
next sibling parent "w0rp" <devw0rp gmail.com> writes:
On Friday, 3 April 2015 at 20:41:26 UTC, Martin Nowak wrote:
 On 04/02/2015 02:10 PM, Kagamin wrote:
 
 The vulnerability presentation suggests perl solution (random 
 hash seed)
 is good enough, it doesn't slow down anything. The seed can be 
 left zero
 and initialized by an application as needed. One can also use 
 a longer
 key and add more its bits every, say, 10 bytes of hashed data, 
 not sure
 if it will make any difference.
A global random hash seed would work, but it needs to be accessible for reproducing test cases (druntime DRT option or in core.runtime).
At least for a library hashmap, you could provide a hash seed at compile time, or as some static value and maybe lie about purity.
Apr 03 2015
prev sibling parent reply "Kagamin" <spam here.lot> writes:
On Friday, 3 April 2015 at 20:41:26 UTC, Martin Nowak wrote:
 A global random hash seed would work, but it needs to be 
 accessible for
 reproducing test cases (druntime DRT option or in core.runtime).
I think, leave the seed zero and only provide a function to change it: extern(C) void _d_setHashSeed(int seed); Then applications will set it to whatever value they want. The function shouldn't be broadly exposed as only server frameworks will need it.
Apr 04 2015
parent Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 04/04/2015 05:12 PM, Kagamin wrote:
 
 I think, leave the seed zero and only provide a function to change it:
 extern(C) void _d_setHashSeed(int seed);
Sounds good, accepting patches :). https://issues.dlang.org/show_bug.cgi?id=14414
Apr 05 2015
prev sibling next sibling parent reply "thedeemon" <dlang thedeemon.com> writes:
On Sunday, 29 March 2015 at 22:32:34 UTC, Martin Nowak wrote:
 Always use open addressing when implementing a hash table.
 https://github.com/D-Programming-Language/dmd/pull/4088
 https://github.com/higgsjs/Higgs/pull/170
Here's a variant of a open addressing hash table (Robin Hood one) that uses std.container.Array instead of relying on GC: https://bitbucket.org/infognition/robinhood/src/ (see rbhash.d, the rest is just tests) Not documented yet, sadly. Here are some reflections in this, describing the approach and comparing with other implementations (built-in AAs and linear probing hashtable from vibe.d): http://www.infognition.com/blog/2014/on_robin_hood_hashing.html Apparently with default hash functions for strings such hash tables get very slow due to clustering. With other key types they are pretty fast.
Mar 30 2015
parent reply "Martin Nowak" <code dawg.eu> writes:
On Monday, 30 March 2015 at 11:23:13 UTC, thedeemon wrote:
 Here's a variant of a open addressing hash table (Robin Hood 
 one) that uses std.container.Array instead of relying on GC:
 https://bitbucket.org/infognition/robinhood/src/
 (see rbhash.d, the rest is just tests)
 Not documented yet, sadly.
You shouldn't write when you read. Robin Hood sounds like a good idea, but it really isn't. Keep your load factor reasonable and distribute values evenly, then you don't need a LRU lookup.
 Apparently with default hash functions for strings such hash 
 tables get very slow due to clustering. With other key types 
 they are pretty fast.
That's one reason why you should use quadratic probing for hash tables, but you should also use a good hash function. MurmurHash2 (not 3) is particularly fast, even for very small keys. The simplest and most efficient way to do quadratic probing is to use triangular numbers and a power of 2 sized hash table.
Mar 31 2015
next sibling parent "Martin Nowak" <code dawg.eu> writes:
On Tuesday, 31 March 2015 at 21:17:04 UTC, Martin Nowak wrote:
 The simplest and most efficient way to do quadratic probing is 
 to use triangular numbers and a power of 2 sized hash table.
I'd be glad if someone implemented that for our current AA. https://issues.dlang.org/show_bug.cgi?id=14385
Mar 31 2015
prev sibling parent reply "thedeemon" <dlang thedeemon.com> writes:
On Tuesday, 31 March 2015 at 21:17:04 UTC, Martin Nowak wrote:

 Robin Hood sounds like a good idea, but it really isn't. Keep 
 your load factor reasonable and distribute values evenly, then 
 you don't need a LRU lookup.
Is there a D version of a hash table with open addressing and quadratic probing? It would be interesting to compare speeds. I like Robin Hood for cache-friendliness, and it works quite well for many combinations of key-value types.
Mar 31 2015
parent reply "w0rp" <devw0rp gmail.com> writes:
On Wednesday, 1 April 2015 at 06:31:28 UTC, thedeemon wrote:
 On Tuesday, 31 March 2015 at 21:17:04 UTC, Martin Nowak wrote:

 Robin Hood sounds like a good idea, but it really isn't. Keep 
 your load factor reasonable and distribute values evenly, then 
 you don't need a LRU lookup.
Is there a D version of a hash table with open addressing and quadratic probing? It would be interesting to compare speeds. I like Robin Hood for cache-friendliness, and it works quite well for many combinations of key-value types.
Now there is. I just changed the hashmap in my container library to use open addressing and quadratic probing. There's a benchmark program in the library for testing it in comparison to the standard associative array. I tested using 2.066 with optimisations on, and it seems to be about the same. https://github.com/w0rp/dstruct/blob/master/source/dstruct/map.d Now, I am not the most fantastic Computer Science guy in the world, and I probably got a few things wrong. If anyone would like to look at my code and point out mistakes, please do. I will add any improvements suggested.
Apr 03 2015
next sibling parent "w0rp" <devw0rp gmail.com> writes:
On Friday, 3 April 2015 at 16:12:00 UTC, w0rp wrote:
 On Wednesday, 1 April 2015 at 06:31:28 UTC, thedeemon wrote:
 On Tuesday, 31 March 2015 at 21:17:04 UTC, Martin Nowak wrote:

 Robin Hood sounds like a good idea, but it really isn't. Keep 
 your load factor reasonable and distribute values evenly, 
 then you don't need a LRU lookup.
Is there a D version of a hash table with open addressing and quadratic probing? It would be interesting to compare speeds. I like Robin Hood for cache-friendliness, and it works quite well for many combinations of key-value types.
Now there is. I just changed the hashmap in my container library to use open addressing and quadratic probing. There's a benchmark program in the library for testing it in comparison to the standard associative array. I tested using 2.066 with optimisations on, and it seems to be about the same. https://github.com/w0rp/dstruct/blob/master/source/dstruct/map.d Now, I am not the most fantastic Computer Science guy in the world, and I probably got a few things wrong. If anyone would like to look at my code and point out mistakes, please do. I will add any improvements suggested.
Also, I changed the API slightly. 1. I removed setDefault for the standard associative array type. Now it's only there for my type. The module now only deals with the added library type. 2. I changed my range functions to match the names for range functions in 2.067, so they are now named byKey, byValue, and byKeyValue. I left out 'keys' and 'values', which I think are mistakes, when you can do byKey.array yourself, etc. 3. I renamed ItemRange to KeyValueRange. (My ranges for the hashmaps are all named, which is supposed to make it easier to build higher order ranges from them.) 4. I added a constructor for the hashmaps which lets you maybe avoid some allocations by providing a minimum size, where the hashmap will probably use a size larger than the one you provide. 5. I started using prime sizes for the hashmaps, because quadratic probing doesn't seem to work without primes. I think that's the bigger changes at least.
Apr 03 2015
prev sibling next sibling parent "thedeemon" <dlang thedeemon.com> writes:
On Friday, 3 April 2015 at 16:12:00 UTC, w0rp wrote:
 Is there a D version of a hash table with open addressing and 
 quadratic probing?
 Now there is. 
 https://github.com/w0rp/dstruct/blob/master/source/dstruct/map.d
Great! I'll experiment with it and do some comparisons.
Apr 03 2015
prev sibling parent reply Martin Nowak <code+news.digitalmars dawg.eu> writes:
On 04/03/2015 06:11 PM, w0rp wrote:
 
 Now, I am not the most fantastic Computer Science guy in the world, and
 I probably got a few things wrong. If anyone would like to look at my
 code and point out mistakes, please do. I will add any improvements
 suggested.
You should use triangular numbers and power of 2 bucket sizes instead of quadratic numbers and prime sized buckets, because that guarantees full utilisation of the buckets and a minimal period of the numbers. The necessary loop is really trivial. https://github.com/D-Programming-Language/dmd/blob/f234c39a0e633fc9a0b5474fe2def76be9a373ef/src/root/stringtable.c#L162 If you replace i = (i + j) & (tabledim - 1); with this i = (i + 1) & (tabledim - 1); you get linear probing btw.
Apr 03 2015
parent "w0rp" <devw0rp gmail.com> writes:
On Friday, 3 April 2015 at 20:35:40 UTC, Martin Nowak wrote:
 On 04/03/2015 06:11 PM, w0rp wrote:
 
 Now, I am not the most fantastic Computer Science guy in the 
 world, and
 I probably got a few things wrong. If anyone would like to 
 look at my
 code and point out mistakes, please do. I will add any 
 improvements
 suggested.
You should use triangular numbers and power of 2 bucket sizes instead of quadratic numbers and prime sized buckets, because that guarantees full utilisation of the buckets and a minimal period of the numbers. The necessary loop is really trivial. https://github.com/D-Programming-Language/dmd/blob/f234c39a0e633fc9a0b5474fe2def76be9a373ef/src/root/stringtable.c#L162 If you replace i = (i + j) & (tabledim - 1); with this i = (i + 1) & (tabledim - 1); you get linear probing btw.
I have pushed again, and now the hashmap uses powers of two and triangular numbers, per your suggestion. I noticed a small improvement in speed, probably coming from the & for modulo trick. If you would like to test the maps with the benchmark program you can run the following: dub run -q --build=release --config=benchmark_hashmap You can use the --compiler switch for dub to try different compilers.
Apr 03 2015
prev sibling parent "Martin Nowak" <code dawg.eu> writes:
On Sunday, 29 March 2015 at 22:32:34 UTC, Martin Nowak wrote:
 I guess that's the counterpart to `get(key, default)` that also 
 inserts
 the default value. That's a very much needed primitive for our 
 AA,
 because currently you end up doing at least 1 unnecessary 
 lookup.
https://issues.dlang.org/show_bug.cgi?id=14386
Mar 31 2015