digitalmars.D - I submitted my container library to code.dlang.org
- w0rp (31/31) Mar 29 2015 I just submitted my container library I was working on ages ago
- Martin Nowak (17/25) Mar 29 2015 Nice docs.
- w0rp (13/45) Mar 29 2015 You probably beat my implementation in terms of speed now. I
- Johannes Pfau (5/18) Mar 30 2015 Another thing to worry about with hash tables is this:
- Martin Nowak (6/9) Mar 31 2015 Mmh, that's a topic for specialized hash tables IMO.
- Martin Nowak (5/6) Mar 31 2015 No way we use this for druntime.
- ketmar (2/7) Mar 31 2015 hay, can we switch to `donothing128`? it's speed is terrifying! ;-)=
- Johannes Pfau (7/15) Apr 01 2015 It's probably more a problem for vibe-d or other
- Kagamin (7/15) Apr 02 2015 The vulnerability presentation suggests perl solution (random
- Martin Nowak (3/9) Apr 03 2015 A global random hash seed would work, but it needs to be accessible for
- w0rp (3/17) Apr 03 2015 At least for a library hashmap, you could provide a hash seed at
- Kagamin (7/10) Apr 04 2015 I think, leave the seed zero and only provide a function to
- Martin Nowak (3/6) Apr 05 2015 Sounds good, accepting patches :).
- thedeemon (13/16) Mar 30 2015 Here's a variant of a open addressing hash table (Robin Hood one)
- Martin Nowak (9/17) Mar 31 2015 You shouldn't write when you read. Robin Hood sounds like a good
- Martin Nowak (3/5) Mar 31 2015 I'd be glad if someone implemented that for our current AA.
- thedeemon (5/8) Mar 31 2015 Is there a D version of a hash table with open addressing and
- w0rp (11/19) Apr 03 2015 Now there is. I just changed the hashmap in my container library
- w0rp (18/39) Apr 03 2015 Also, I changed the API slightly.
- thedeemon (2/6) Apr 03 2015 Great! I'll experiment with it and do some comparisons.
- Martin Nowak (11/16) Apr 03 2015 You should use triangular numbers and power of 2 bucket sizes instead of
- w0rp (9/30) Apr 03 2015 I have pushed again, and now the hashmap uses powers of two and
- Martin Nowak (2/8) Mar 31 2015 https://issues.dlang.org/show_bug.cgi?id=14386
I just submitted my container library I was working on ages ago to code.dlang.org. It's currently waiting in a queue, but it should be available from there soon enough. I submitted my library with the version "0.1.0" to indicate, "Do not expect this to work properly." I have a number of unit tests available for it, and documentation for the library hosted on my website. https://w0rp.com/project/dstruct/ It is available on GitHub here. https://github.com/w0rp/dstruct If anyone is interested in this stuff at all, let me know what you think. If you hate it and want to burn me alive, let me know. A few points follow. 1. This library uses the GC a lot. std.allocator still hasn't made it out into the real world, so I can't use that yet. I would like to offer different allocation schemes at some point with that, but for now the GC default will do. If you absolutely hate the GC and can't use it, this library isn't for you. (For now.) 2. Again, I don't guarantee that everything will work. I would really like people to submit issues so I can cover any errors with more tests. 3. The API for just about anything can be subject to change at this point. I need more feedback from actual usage of my library. 4. I ended up writing my own library hashmap, which when I tested ages ago competed with the standard associative array in terms of performance. This allows me to mark many things safe pure nothrow. Also, I was able to implement setDefault in a hopefully efficient manner. I have tried to push setDefault as an awesome thing a few times, but I've never been able to explain why it's good eloquently. Destroy!
Mar 29 2015
On 03/29/2015 05:19 PM, w0rp wrote:4. I ended up writing my own library hashmap, which when I tested ages ago competed with the standard associative array in terms of performance. This allows me to mark many things safe pure nothrow. Destroy!Nice docs. Always use open addressing when implementing a hash table. https://github.com/D-Programming-Language/dmd/pull/4088 https://github.com/higgsjs/Higgs/pull/170Also, I was able to implement setDefault in a hopefully efficient manner. I have tried to push setDefault as an awesome thing a few times, but I've never been able to explain why it's good eloquently.I guess that's the counterpart to `get(key, default)` that also inserts the default value. That's a very much needed primitive for our AA, because currently you end up doing at least 1 unnecessary lookup. Often I even write this. auto p = key in aa; if (p is null) { aa[key] = default; p = key in aa; } It's called getOrElseUpdate in Scala, but I'd prefer getOrSet. http://www.scala-lang.org/api/2.10.1-RC1/scala/collection/mutable/HashMap.html#getOrElseUpdate(A,⇒B):B
Mar 29 2015
On Sunday, 29 March 2015 at 22:32:34 UTC, Martin Nowak wrote:On 03/29/2015 05:19 PM, w0rp wrote:You probably beat my implementation in terms of speed now. I think the only thing I did quite differently from the standard implementation was use & 2 for computing the hash index instead of a divide, as that was the method OpenJDK used for its hashmap.4. I ended up writing my own library hashmap, which when I tested ages ago competed with the standard associative array in terms of performance. This allows me to mark many things safe pure nothrow. Destroy!Nice docs. Always use open addressing when implementing a hash table. https://github.com/D-Programming-Language/dmd/pull/4088 https://github.com/higgsjs/Higgs/pull/170Yeah, that's the one. I personally don't really care what the name is. 'setdefault' is the Python name for it. Only we can do better than Python, because we have the lazy keyword. For which I'll say, "One magical day, the compiler can generate special code for this." As you say, it's special in that it has the ability to only compute the hash once when implemented with direct access to the hashtable.Also, I was able to implement setDefault in a hopefully efficient manner. I have tried to push setDefault as an awesome thing a few times, but I've never been able to explain why it's good eloquently.I guess that's the counterpart to `get(key, default)` that also inserts the default value. That's a very much needed primitive for our AA, because currently you end up doing at least 1 unnecessary lookup. Often I even write this. auto p = key in aa; if (p is null) { aa[key] = default; p = key in aa; } It's called getOrElseUpdate in Scala, but I'd prefer getOrSet. http://www.scala-lang.org/api/2.10.1-RC1/scala/collection/mutable/HashMap.html#getOrElseUpdate(A,⇒B):B
Mar 29 2015
Am Mon, 30 Mar 2015 00:32:12 +0200 schrieb Martin Nowak <code+news.digitalmars dawg.eu>:On 03/29/2015 05:19 PM, w0rp wrote:Another thing to worry about with hash tables is this: http://events.ccc.de/congress/2012/Fahrplan/events/5152.en.html https://www.youtube.com/watch?v=wGYj8fhhUVA4. I ended up writing my own library hashmap, which when I tested ages ago competed with the standard associative array in terms of performance. This allows me to mark many things safe pure nothrow. Destroy!Nice docs. Always use open addressing when implementing a hash table. https://github.com/D-Programming-Language/dmd/pull/4088 https://github.com/higgsjs/Higgs/pull/170
Mar 30 2015
On Monday, 30 March 2015 at 09:31:26 UTC, Johannes Pfau wrote:Another thing to worry about with hash tables is this: http://events.ccc.de/congress/2012/Fahrplan/events/5152.en.html https://www.youtube.com/watch?v=wGYj8fhhUVAMmh, that's a topic for specialized hash tables IMO. A generic hash table should do a best effort to avoid collisions, but you can't expect it to use a cryptographic quality hash by default if that is considerably slower. Anyone benchmarked that SipHash?
Mar 31 2015
On 03/31/2015 11:12 PM, Martin Nowak wrote:Anyone benchmarked that SipHash?No way we use this for druntime. https://github.com/rurban/smhasher#readme I think we should have a flexible Hash in std.container though, that should allow to use a customized hash function.
Mar 31 2015
On Tue, 31 Mar 2015 23:45:28 +0200, Martin Nowak wrote:On 03/31/2015 11:12 PM, Martin Nowak wrote:hay, can we switch to `donothing128`? it's speed is terrifying! ;-)=Anyone benchmarked that SipHash?=20 No way we use this for druntime. https://github.com/rurban/smhasher#readme
Mar 31 2015
Am Tue, 31 Mar 2015 23:45:28 +0200 schrieb Martin Nowak <code+news.digitalmars dawg.eu>:On 03/31/2015 11:12 PM, Martin Nowak wrote:It's probably more a problem for vibe-d or other server-like applications. Those should make sure to use DOS-safe hash tables. For most applications there's no possibility for DOS attacks using hash tables and we indeed shouldn't make these applications slower.Anyone benchmarked that SipHash?No way we use this for druntime. https://github.com/rurban/smhasher#readme I think we should have a flexible Hash in std.container though, that should allow to use a customized hash function.
Apr 01 2015
On Wednesday, 1 April 2015 at 12:05:37 UTC, Johannes Pfau wrote:It's probably more a problem for vibe-d or other server-like applications. Those should make sure to use DOS-safe hash tables. For most applications there's no possibility for DOS attacks using hash tables and we indeed shouldn't make these applications slower.The vulnerability presentation suggests perl solution (random hash seed) is good enough, it doesn't slow down anything. The seed can be left zero and initialized by an application as needed. One can also use a longer key and add more its bits every, say, 10 bytes of hashed data, not sure if it will make any difference.
Apr 02 2015
On 04/02/2015 02:10 PM, Kagamin wrote:The vulnerability presentation suggests perl solution (random hash seed) is good enough, it doesn't slow down anything. The seed can be left zero and initialized by an application as needed. One can also use a longer key and add more its bits every, say, 10 bytes of hashed data, not sure if it will make any difference.A global random hash seed would work, but it needs to be accessible for reproducing test cases (druntime DRT option or in core.runtime).
Apr 03 2015
On Friday, 3 April 2015 at 20:41:26 UTC, Martin Nowak wrote:On 04/02/2015 02:10 PM, Kagamin wrote:At least for a library hashmap, you could provide a hash seed at compile time, or as some static value and maybe lie about purity.The vulnerability presentation suggests perl solution (random hash seed) is good enough, it doesn't slow down anything. The seed can be left zero and initialized by an application as needed. One can also use a longer key and add more its bits every, say, 10 bytes of hashed data, not sure if it will make any difference.A global random hash seed would work, but it needs to be accessible for reproducing test cases (druntime DRT option or in core.runtime).
Apr 03 2015
On Friday, 3 April 2015 at 20:41:26 UTC, Martin Nowak wrote:A global random hash seed would work, but it needs to be accessible for reproducing test cases (druntime DRT option or in core.runtime).I think, leave the seed zero and only provide a function to change it: extern(C) void _d_setHashSeed(int seed); Then applications will set it to whatever value they want. The function shouldn't be broadly exposed as only server frameworks will need it.
Apr 04 2015
On 04/04/2015 05:12 PM, Kagamin wrote:I think, leave the seed zero and only provide a function to change it: extern(C) void _d_setHashSeed(int seed);Sounds good, accepting patches :). https://issues.dlang.org/show_bug.cgi?id=14414
Apr 05 2015
On Sunday, 29 March 2015 at 22:32:34 UTC, Martin Nowak wrote:Always use open addressing when implementing a hash table. https://github.com/D-Programming-Language/dmd/pull/4088 https://github.com/higgsjs/Higgs/pull/170Here's a variant of a open addressing hash table (Robin Hood one) that uses std.container.Array instead of relying on GC: https://bitbucket.org/infognition/robinhood/src/ (see rbhash.d, the rest is just tests) Not documented yet, sadly. Here are some reflections in this, describing the approach and comparing with other implementations (built-in AAs and linear probing hashtable from vibe.d): http://www.infognition.com/blog/2014/on_robin_hood_hashing.html Apparently with default hash functions for strings such hash tables get very slow due to clustering. With other key types they are pretty fast.
Mar 30 2015
On Monday, 30 March 2015 at 11:23:13 UTC, thedeemon wrote:Here's a variant of a open addressing hash table (Robin Hood one) that uses std.container.Array instead of relying on GC: https://bitbucket.org/infognition/robinhood/src/ (see rbhash.d, the rest is just tests) Not documented yet, sadly.You shouldn't write when you read. Robin Hood sounds like a good idea, but it really isn't. Keep your load factor reasonable and distribute values evenly, then you don't need a LRU lookup.Apparently with default hash functions for strings such hash tables get very slow due to clustering. With other key types they are pretty fast.That's one reason why you should use quadratic probing for hash tables, but you should also use a good hash function. MurmurHash2 (not 3) is particularly fast, even for very small keys. The simplest and most efficient way to do quadratic probing is to use triangular numbers and a power of 2 sized hash table.
Mar 31 2015
On Tuesday, 31 March 2015 at 21:17:04 UTC, Martin Nowak wrote:The simplest and most efficient way to do quadratic probing is to use triangular numbers and a power of 2 sized hash table.I'd be glad if someone implemented that for our current AA. https://issues.dlang.org/show_bug.cgi?id=14385
Mar 31 2015
On Tuesday, 31 March 2015 at 21:17:04 UTC, Martin Nowak wrote:Robin Hood sounds like a good idea, but it really isn't. Keep your load factor reasonable and distribute values evenly, then you don't need a LRU lookup.Is there a D version of a hash table with open addressing and quadratic probing? It would be interesting to compare speeds. I like Robin Hood for cache-friendliness, and it works quite well for many combinations of key-value types.
Mar 31 2015
On Wednesday, 1 April 2015 at 06:31:28 UTC, thedeemon wrote:On Tuesday, 31 March 2015 at 21:17:04 UTC, Martin Nowak wrote:Now there is. I just changed the hashmap in my container library to use open addressing and quadratic probing. There's a benchmark program in the library for testing it in comparison to the standard associative array. I tested using 2.066 with optimisations on, and it seems to be about the same. https://github.com/w0rp/dstruct/blob/master/source/dstruct/map.d Now, I am not the most fantastic Computer Science guy in the world, and I probably got a few things wrong. If anyone would like to look at my code and point out mistakes, please do. I will add any improvements suggested.Robin Hood sounds like a good idea, but it really isn't. Keep your load factor reasonable and distribute values evenly, then you don't need a LRU lookup.Is there a D version of a hash table with open addressing and quadratic probing? It would be interesting to compare speeds. I like Robin Hood for cache-friendliness, and it works quite well for many combinations of key-value types.
Apr 03 2015
On Friday, 3 April 2015 at 16:12:00 UTC, w0rp wrote:On Wednesday, 1 April 2015 at 06:31:28 UTC, thedeemon wrote:Also, I changed the API slightly. 1. I removed setDefault for the standard associative array type. Now it's only there for my type. The module now only deals with the added library type. 2. I changed my range functions to match the names for range functions in 2.067, so they are now named byKey, byValue, and byKeyValue. I left out 'keys' and 'values', which I think are mistakes, when you can do byKey.array yourself, etc. 3. I renamed ItemRange to KeyValueRange. (My ranges for the hashmaps are all named, which is supposed to make it easier to build higher order ranges from them.) 4. I added a constructor for the hashmaps which lets you maybe avoid some allocations by providing a minimum size, where the hashmap will probably use a size larger than the one you provide. 5. I started using prime sizes for the hashmaps, because quadratic probing doesn't seem to work without primes. I think that's the bigger changes at least.On Tuesday, 31 March 2015 at 21:17:04 UTC, Martin Nowak wrote:Now there is. I just changed the hashmap in my container library to use open addressing and quadratic probing. There's a benchmark program in the library for testing it in comparison to the standard associative array. I tested using 2.066 with optimisations on, and it seems to be about the same. https://github.com/w0rp/dstruct/blob/master/source/dstruct/map.d Now, I am not the most fantastic Computer Science guy in the world, and I probably got a few things wrong. If anyone would like to look at my code and point out mistakes, please do. I will add any improvements suggested.Robin Hood sounds like a good idea, but it really isn't. Keep your load factor reasonable and distribute values evenly, then you don't need a LRU lookup.Is there a D version of a hash table with open addressing and quadratic probing? It would be interesting to compare speeds. I like Robin Hood for cache-friendliness, and it works quite well for many combinations of key-value types.
Apr 03 2015
On Friday, 3 April 2015 at 16:12:00 UTC, w0rp wrote:Is there a D version of a hash table with open addressing and quadratic probing?Now there is. https://github.com/w0rp/dstruct/blob/master/source/dstruct/map.dGreat! I'll experiment with it and do some comparisons.
Apr 03 2015
On 04/03/2015 06:11 PM, w0rp wrote:Now, I am not the most fantastic Computer Science guy in the world, and I probably got a few things wrong. If anyone would like to look at my code and point out mistakes, please do. I will add any improvements suggested.You should use triangular numbers and power of 2 bucket sizes instead of quadratic numbers and prime sized buckets, because that guarantees full utilisation of the buckets and a minimal period of the numbers. The necessary loop is really trivial. https://github.com/D-Programming-Language/dmd/blob/f234c39a0e633fc9a0b5474fe2def76be9a373ef/src/root/stringtable.c#L162 If you replace i = (i + j) & (tabledim - 1); with this i = (i + 1) & (tabledim - 1); you get linear probing btw.
Apr 03 2015
On Friday, 3 April 2015 at 20:35:40 UTC, Martin Nowak wrote:On 04/03/2015 06:11 PM, w0rp wrote:I have pushed again, and now the hashmap uses powers of two and triangular numbers, per your suggestion. I noticed a small improvement in speed, probably coming from the & for modulo trick. If you would like to test the maps with the benchmark program you can run the following: dub run -q --build=release --config=benchmark_hashmap You can use the --compiler switch for dub to try different compilers.Now, I am not the most fantastic Computer Science guy in the world, and I probably got a few things wrong. If anyone would like to look at my code and point out mistakes, please do. I will add any improvements suggested.You should use triangular numbers and power of 2 bucket sizes instead of quadratic numbers and prime sized buckets, because that guarantees full utilisation of the buckets and a minimal period of the numbers. The necessary loop is really trivial. https://github.com/D-Programming-Language/dmd/blob/f234c39a0e633fc9a0b5474fe2def76be9a373ef/src/root/stringtable.c#L162 If you replace i = (i + j) & (tabledim - 1); with this i = (i + 1) & (tabledim - 1); you get linear probing btw.
Apr 03 2015
On Sunday, 29 March 2015 at 22:32:34 UTC, Martin Nowak wrote:I guess that's the counterpart to `get(key, default)` that also inserts the default value. That's a very much needed primitive for our AA, because currently you end up doing at least 1 unnecessary lookup.https://issues.dlang.org/show_bug.cgi?id=14386
Mar 31 2015