digitalmars.D - I submitted my container library to code.dlang.org

w0rp (31/31) Mar 29 2015 I just submitted my container library I was working on ages ago

Martin Nowak (17/25) Mar 29 2015 Nice docs.

w0rp (13/45) Mar 29 2015 You probably beat my implementation in terms of speed now. I
Johannes Pfau (5/18) Mar 30 2015 Another thing to worry about with hash tables is this:

Martin Nowak (6/9) Mar 31 2015 Mmh, that's a topic for specialized hash tables IMO.

Martin Nowak (5/6) Mar 31 2015 No way we use this for druntime.

ketmar (2/7) Mar 31 2015 hay, can we switch to `donothing128`? it's speed is terrifying! ;-)=
Johannes Pfau (7/15) Apr 01 2015 It's probably more a problem for vibe-d or other

Kagamin (7/15) Apr 02 2015 The vulnerability presentation suggests perl solution (random

Martin Nowak (3/9) Apr 03 2015 A global random hash seed would work, but it needs to be accessible for

w0rp (3/17) Apr 03 2015 At least for a library hashmap, you could provide a hash seed at
Kagamin (7/10) Apr 04 2015 I think, leave the seed zero and only provide a function to

Martin Nowak (3/6) Apr 05 2015 Sounds good, accepting patches :).

thedeemon (13/16) Mar 30 2015 Here's a variant of a open addressing hash table (Robin Hood one)

Martin Nowak (9/17) Mar 31 2015 You shouldn't write when you read. Robin Hood sounds like a good

Martin Nowak (3/5) Mar 31 2015 I'd be glad if someone implemented that for our current AA.
thedeemon (5/8) Mar 31 2015 Is there a D version of a hash table with open addressing and

w0rp (11/19) Apr 03 2015 Now there is. I just changed the hashmap in my container library

w0rp (18/39) Apr 03 2015 Also, I changed the API slightly.
thedeemon (2/6) Apr 03 2015 Great! I'll experiment with it and do some comparisons.
Martin Nowak (11/16) Apr 03 2015 You should use triangular numbers and power of 2 bucket sizes instead of

w0rp (9/30) Apr 03 2015 I have pushed again, and now the hashmap uses powers of two and

Martin Nowak (2/8) Mar 31 2015 https://issues.dlang.org/show_bug.cgi?id=14386

"w0rp" <devw0rp gmail.com> writes:

I just submitted my container library I was working on ages ago 
to code.dlang.org. It's currently waiting in a queue, but it 
should be available from there soon enough.

I submitted my library with the version "0.1.0" to indicate, "Do 
not expect this to work properly." I have a number of unit tests 
available for it, and documentation for the library hosted on my 
website.

https://w0rp.com/project/dstruct/

It is available on GitHub here.

https://github.com/w0rp/dstruct

If anyone is interested in this stuff at all, let me know what 
you think. If you hate it and want to burn me alive, let me know. 
A few points follow.

1. This library uses the GC a lot. std.allocator still hasn't 
made it out into the real world, so I can't use that yet. I would 
like to offer different allocation schemes at some point with 
that, but for now the GC default will do. If you absolutely hate 
the GC and can't use it, this library isn't for you. (For now.)
2. Again, I don't guarantee that everything will work. I would 
really like people to submit issues so I can cover any errors 
with more tests.
3. The API for just about anything can be subject to change at 
this point. I need more feedback from actual usage of my library.
4. I ended up writing my own library hashmap, which when I tested 
ages ago competed with the standard associative array in terms of 
performance. This allows me to mark many things  safe pure 
nothrow. Also, I was able to implement setDefault in a hopefully 
efficient manner. I have tried to push setDefault as an awesome 
thing a few times, but I've never been able to explain why it's 
good eloquently.

Destroy!

Mar 29 2015

Martin Nowak <code+news.digitalmars dawg.eu> writes:

On 03/29/2015 05:19 PM, w0rp wrote:
 4. I ended up writing my own library hashmap, which when I tested ages
 ago competed with the standard associative array in terms of
 performance. This allows me to mark many things  safe pure nothrow.
 
 Destroy!

Nice docs.

Always use open addressing when implementing a hash table.
https://github.com/D-Programming-Language/dmd/pull/4088
https://github.com/higgsjs/Higgs/pull/170

 Also, I was able to implement setDefault in a hopefully efficient
 manner. I have tried to push setDefault as an awesome thing a few times,
 but I've never been able to explain why it's good eloquently.

I guess that's the counterpart to `get(key, default)` that also inserts
the default value. That's a very much needed primitive for our AA,
because currently you end up doing at least 1 unnecessary lookup.

Often I even write this.

auto p = key in aa;
if (p is null)
{
    aa[key] = default;
    p = key in aa;
}

It's called getOrElseUpdate in Scala, but I'd prefer getOrSet.
http://www.scala-lang.org/api/2.10.1-RC1/scala/collection/mutable/HashMap.html#getOrElseUpdate(A,⇒B):B

Mar 29 2015

"w0rp" <devw0rp gmail.com> writes:

On Sunday, 29 March 2015 at 22:32:34 UTC, Martin Nowak wrote:
On 03/29/2015 05:19 PM, w0rp wrote:
4. I ended up writing my own library hashmap, which when I
tested ages
ago competed with the standard associative array in terms of
performance. This allows me to mark many things safe pure
nothrow.

Destroy!

Nice docs.

Always use open addressing when implementing a hash table.
https://github.com/D-Programming-Language/dmd/pull/4088
https://github.com/higgsjs/Higgs/pull/170

You probably beat my implementation in terms of speed now. I
think the only thing
I did quite differently from the standard implementation was use
& 2 for computing the hash index instead of a divide, as that was
the method OpenJDK used for its hashmap.

Also, I was able to implement setDefault in a hopefully
efficient
manner. I have tried to push setDefault as an awesome thing a
few times,
but I've never been able to explain why it's good eloquently.

I guess that's the counterpart to `get(key, default)` that also
inserts
the default value. That's a very much needed primitive for our
AA,
because currently you end up doing at least 1 unnecessary
lookup.

Often I even write this.

auto p = key in aa;
if (p is null)
{
aa[key] = default;
p = key in aa;
}

It's called getOrElseUpdate in Scala, but I'd prefer getOrSet.
http://www.scala-lang.org/api/2.10.1-RC1/scala/collection/mutable/HashMap.html#getOrElseUpdate(A,⇒B):B

Yeah, that's the one. I personally don't really care what the
name is. 'setdefault' is the Python name for it. Only we can do
better than Python, because we have the lazy keyword. For which
I'll say, "One magical day, the compiler can generate special
code for this." As you say, it's special in that it has the
ability to only compute the hash once when implemented with
direct access to the hashtable.

Mar 29 2015

Johannes Pfau <nospam example.com> writes:

Am Mon, 30 Mar 2015 00:32:12 +0200
schrieb Martin Nowak <code+news.digitalmars dawg.eu>:

 On 03/29/2015 05:19 PM, w0rp wrote:
 4. I ended up writing my own library hashmap, which when I tested
 ages ago competed with the standard associative array in terms of
 performance. This allows me to mark many things  safe pure nothrow.
 
 Destroy!

 
 Nice docs.
 
 Always use open addressing when implementing a hash table.
 https://github.com/D-Programming-Language/dmd/pull/4088
 https://github.com/higgsjs/Higgs/pull/170
 

Another thing to worry about with hash tables is this:

http://events.ccc.de/congress/2012/Fahrplan/events/5152.en.html
https://www.youtube.com/watch?v=wGYj8fhhUVA

Mar 30 2015

"Martin Nowak" <code dawg.eu> writes:

On Monday, 30 March 2015 at 09:31:26 UTC, Johannes Pfau wrote:
 Another thing to worry about with hash tables is this:

 http://events.ccc.de/congress/2012/Fahrplan/events/5152.en.html
 https://www.youtube.com/watch?v=wGYj8fhhUVA

Mmh, that's a topic for specialized hash tables IMO.
A generic hash table should do a best effort to avoid collisions, 
but you can't expect it to use a cryptographic quality hash by 
default if that is considerably slower.
Anyone benchmarked that SipHash?

Mar 31 2015

Martin Nowak <code+news.digitalmars dawg.eu> writes:

On 03/31/2015 11:12 PM, Martin Nowak wrote:
 Anyone benchmarked that SipHash?

No way we use this for druntime.
https://github.com/rurban/smhasher#readme

I think we should have a flexible Hash in std.container though, that
should allow to use a customized hash function.

Mar 31 2015

ketmar <ketmar ketmar.no-ip.org> writes:

On Tue, 31 Mar 2015 23:45:28 +0200, Martin Nowak wrote:

 On 03/31/2015 11:12 PM, Martin Nowak wrote:
 Anyone benchmarked that SipHash?

=20
 No way we use this for druntime.
 https://github.com/rurban/smhasher#readme

hay, can we switch to `donothing128`? it's speed is terrifying! ;-)=

Mar 31 2015

Johannes Pfau <nospam example.com> writes:

Am Tue, 31 Mar 2015 23:45:28 +0200
schrieb Martin Nowak <code+news.digitalmars dawg.eu>:

 On 03/31/2015 11:12 PM, Martin Nowak wrote:
 Anyone benchmarked that SipHash?

 
 No way we use this for druntime.
 https://github.com/rurban/smhasher#readme
 
 I think we should have a flexible Hash in std.container though, that
 should allow to use a customized hash function.

It's probably more a problem for vibe-d or other
server-like applications. Those should make sure to use DOS-safe hash
tables. For most applications there's no possibility for DOS attacks
using hash tables and we indeed shouldn't make these applications
slower.

Apr 01 2015

"Kagamin" <spam here.lot> writes:

On Wednesday, 1 April 2015 at 12:05:37 UTC, Johannes Pfau wrote:
 It's probably more a problem for vibe-d or other
 server-like applications. Those should make sure to use 
 DOS-safe hash
 tables. For most applications there's no possibility for DOS 
 attacks
 using hash tables and we indeed shouldn't make these 
 applications
 slower.

The vulnerability presentation suggests perl solution (random 
hash seed) is good enough, it doesn't slow down anything. The 
seed can be left zero and initialized by an application as 
needed. One can also use a longer key and add more its bits 
every, say, 10 bytes of hashed data, not sure if it will make any 
difference.

Apr 02 2015

Martin Nowak <code+news.digitalmars dawg.eu> writes:

On 04/02/2015 02:10 PM, Kagamin wrote:
 
 The vulnerability presentation suggests perl solution (random hash seed)
 is good enough, it doesn't slow down anything. The seed can be left zero
 and initialized by an application as needed. One can also use a longer
 key and add more its bits every, say, 10 bytes of hashed data, not sure
 if it will make any difference.

A global random hash seed would work, but it needs to be accessible for
reproducing test cases (druntime DRT option or in core.runtime).

Apr 03 2015

"w0rp" <devw0rp gmail.com> writes:

On Friday, 3 April 2015 at 20:41:26 UTC, Martin Nowak wrote:
 On 04/02/2015 02:10 PM, Kagamin wrote:
 
 The vulnerability presentation suggests perl solution (random 
 hash seed)
 is good enough, it doesn't slow down anything. The seed can be 
 left zero
 and initialized by an application as needed. One can also use 
 a longer
 key and add more its bits every, say, 10 bytes of hashed data, 
 not sure
 if it will make any difference.

 A global random hash seed would work, but it needs to be 
 accessible for
 reproducing test cases (druntime DRT option or in core.runtime).

At least for a library hashmap, you could provide a hash seed at 
compile time, or as some static value and maybe lie about purity.

Apr 03 2015

"Kagamin" <spam here.lot> writes:

On Friday, 3 April 2015 at 20:41:26 UTC, Martin Nowak wrote:
 A global random hash seed would work, but it needs to be 
 accessible for
 reproducing test cases (druntime DRT option or in core.runtime).

I think, leave the seed zero and only provide a function to 
change it:
extern(C) void _d_setHashSeed(int seed);
Then applications will set it to whatever value they want. The 
function shouldn't be broadly exposed as only server frameworks 
will need it.

Apr 04 2015

Martin Nowak <code+news.digitalmars dawg.eu> writes:

On 04/04/2015 05:12 PM, Kagamin wrote:
 
 I think, leave the seed zero and only provide a function to change it:
 extern(C) void _d_setHashSeed(int seed);

Sounds good, accepting patches :).
https://issues.dlang.org/show_bug.cgi?id=14414

Apr 05 2015

"thedeemon" <dlang thedeemon.com> writes:

On Sunday, 29 March 2015 at 22:32:34 UTC, Martin Nowak wrote:
 Always use open addressing when implementing a hash table.
 https://github.com/D-Programming-Language/dmd/pull/4088
 https://github.com/higgsjs/Higgs/pull/170


Here's a variant of a open addressing hash table (Robin Hood one) 
that uses std.container.Array instead of relying on GC:
https://bitbucket.org/infognition/robinhood/src/
(see rbhash.d, the rest is just tests)
Not documented yet, sadly.

Here are some reflections in this, describing the approach and 
comparing with other implementations (built-in AAs and linear 
probing hashtable from vibe.d):
http://www.infognition.com/blog/2014/on_robin_hood_hashing.html

Apparently with default hash functions for strings such hash 
tables get very slow due to clustering. With other key types they 
are pretty fast.

Mar 30 2015

"Martin Nowak" <code dawg.eu> writes:

On Monday, 30 March 2015 at 11:23:13 UTC, thedeemon wrote:
 Here's a variant of a open addressing hash table (Robin Hood 
 one) that uses std.container.Array instead of relying on GC:
 https://bitbucket.org/infognition/robinhood/src/
 (see rbhash.d, the rest is just tests)
 Not documented yet, sadly.

You shouldn't write when you read. Robin Hood sounds like a good 
idea, but it really isn't. Keep your load factor reasonable and 
distribute values evenly, then you don't need a LRU lookup.

 Apparently with default hash functions for strings such hash 
 tables get very slow due to clustering. With other key types 
 they are pretty fast.

That's one reason why you should use quadratic probing for hash 
tables, but you should also use a good hash function. MurmurHash2 
(not 3) is particularly fast, even for very small keys.

The simplest and most efficient way to do quadratic probing is to 
use triangular numbers and a power of 2 sized hash table.

Mar 31 2015

"Martin Nowak" <code dawg.eu> writes:

On Tuesday, 31 March 2015 at 21:17:04 UTC, Martin Nowak wrote:
 The simplest and most efficient way to do quadratic probing is 
 to use triangular numbers and a power of 2 sized hash table.

I'd be glad if someone implemented that for our current AA.
https://issues.dlang.org/show_bug.cgi?id=14385

Mar 31 2015

"thedeemon" <dlang thedeemon.com> writes:

On Tuesday, 31 March 2015 at 21:17:04 UTC, Martin Nowak wrote:

 Robin Hood sounds like a good idea, but it really isn't. Keep 
 your load factor reasonable and distribute values evenly, then 
 you don't need a LRU lookup.

Is there a D version of a hash table with open addressing and 
quadratic probing? It would be interesting to compare speeds.
I like Robin Hood for cache-friendliness, and it works quite well 
for many combinations of key-value types.

Mar 31 2015

"w0rp" <devw0rp gmail.com> writes:

On Wednesday, 1 April 2015 at 06:31:28 UTC, thedeemon wrote:
 On Tuesday, 31 March 2015 at 21:17:04 UTC, Martin Nowak wrote:

 Robin Hood sounds like a good idea, but it really isn't. Keep 
 your load factor reasonable and distribute values evenly, then 
 you don't need a LRU lookup.

 Is there a D version of a hash table with open addressing and 
 quadratic probing? It would be interesting to compare speeds.
 I like Robin Hood for cache-friendliness, and it works quite 
 well for many combinations of key-value types.

Now there is. I just changed the hashmap in my container library 
to use open addressing and quadratic probing. There's a benchmark 
program in the library for testing it in comparison to the 
standard associative array. I tested using 2.066 with 
optimisations on, and it seems to be about the same.

https://github.com/w0rp/dstruct/blob/master/source/dstruct/map.d

Now, I am not the most fantastic Computer Science guy in the 
world, and I probably got a few things wrong. If anyone would 
like to look at my code and point out mistakes, please do. I will 
add any improvements suggested.

Apr 03 2015

"w0rp" <devw0rp gmail.com> writes:

On Friday, 3 April 2015 at 16:12:00 UTC, w0rp wrote:
 On Wednesday, 1 April 2015 at 06:31:28 UTC, thedeemon wrote:
 On Tuesday, 31 March 2015 at 21:17:04 UTC, Martin Nowak wrote:

 Robin Hood sounds like a good idea, but it really isn't. Keep 
 your load factor reasonable and distribute values evenly, 
 then you don't need a LRU lookup.

 Is there a D version of a hash table with open addressing and 
 quadratic probing? It would be interesting to compare speeds.
 I like Robin Hood for cache-friendliness, and it works quite 
 well for many combinations of key-value types.

 Now there is. I just changed the hashmap in my container 
 library to use open addressing and quadratic probing. There's a 
 benchmark program in the library for testing it in comparison 
 to the standard associative array. I tested using 2.066 with 
 optimisations on, and it seems to be about the same.

 https://github.com/w0rp/dstruct/blob/master/source/dstruct/map.d

 Now, I am not the most fantastic Computer Science guy in the 
 world, and I probably got a few things wrong. If anyone would 
 like to look at my code and point out mistakes, please do. I 
 will add any improvements suggested.

Also, I changed the API slightly.

1. I removed setDefault for the standard associative array type. 
Now it's only there for my type. The module now only deals with 
the added library type.
2. I changed my range functions to match the names for range 
functions in 2.067, so they are now named byKey, byValue, and 
byKeyValue. I left out 'keys' and 'values', which I think are 
mistakes, when you can do byKey.array yourself, etc.
3. I renamed ItemRange to KeyValueRange. (My ranges for the 
hashmaps are all named, which is supposed to make it easier to 
build higher order ranges from them.)
4. I added a constructor for the hashmaps which lets you maybe 
avoid some allocations by providing a minimum size, where the 
hashmap will probably use a size larger than the one you provide.
5. I started using prime sizes for the hashmaps, because 
quadratic probing doesn't seem to work without primes.

I think that's the bigger changes at least.

Apr 03 2015

"thedeemon" <dlang thedeemon.com> writes:

On Friday, 3 April 2015 at 16:12:00 UTC, w0rp wrote:
 Is there a D version of a hash table with open addressing and 
 quadratic probing?


 Now there is. 
 https://github.com/w0rp/dstruct/blob/master/source/dstruct/map.d

Great! I'll experiment with it and do some comparisons.

Apr 03 2015

Martin Nowak <code+news.digitalmars dawg.eu> writes:

On 04/03/2015 06:11 PM, w0rp wrote:
 
 Now, I am not the most fantastic Computer Science guy in the world, and
 I probably got a few things wrong. If anyone would like to look at my
 code and point out mistakes, please do. I will add any improvements
 suggested.

You should use triangular numbers and power of 2 bucket sizes instead of
quadratic numbers and prime sized buckets, because that guarantees full
utilisation of the buckets and a minimal period of the numbers.

The necessary loop is really trivial.

https://github.com/D-Programming-Language/dmd/blob/f234c39a0e633fc9a0b5474fe2def76be9a373ef/src/root/stringtable.c#L162

If you replace

    i = (i + j) & (tabledim - 1);

with this

    i = (i + 1) & (tabledim - 1);

you get linear probing btw.

Apr 03 2015

"w0rp" <devw0rp gmail.com> writes:

On Friday, 3 April 2015 at 20:35:40 UTC, Martin Nowak wrote:
 On 04/03/2015 06:11 PM, w0rp wrote:
 
 Now, I am not the most fantastic Computer Science guy in the 
 world, and
 I probably got a few things wrong. If anyone would like to 
 look at my
 code and point out mistakes, please do. I will add any 
 improvements
 suggested.

 You should use triangular numbers and power of 2 bucket sizes 
 instead of
 quadratic numbers and prime sized buckets, because that 
 guarantees full
 utilisation of the buckets and a minimal period of the numbers.

 The necessary loop is really trivial.

 https://github.com/D-Programming-Language/dmd/blob/f234c39a0e633fc9a0b5474fe2def76be9a373ef/src/root/stringtable.c#L162

 If you replace

     i = (i + j) & (tabledim - 1);

 with this

     i = (i + 1) & (tabledim - 1);

 you get linear probing btw.

I have pushed again, and now the hashmap uses powers of two and 
triangular numbers, per your suggestion. I noticed a small 
improvement in speed, probably coming from the & for modulo 
trick. If you would like to test the maps with the benchmark 
program you can run the following:

dub run -q --build=release --config=benchmark_hashmap

You can use the --compiler switch for dub to try different 
compilers.

Apr 03 2015

"Martin Nowak" <code dawg.eu> writes:

On Sunday, 29 March 2015 at 22:32:34 UTC, Martin Nowak wrote:
 I guess that's the counterpart to `get(key, default)` that also 
 inserts
 the default value. That's a very much needed primitive for our 
 AA,
 because currently you end up doing at least 1 unnecessary 
 lookup.

https://issues.dlang.org/show_bug.cgi?id=14386

Mar 31 2015

D Programming

C/C++ Programming

Other

digitalmars.D - I submitted my container library to code.dlang.org