digitalmars.D - About built-in AAs
- bearophile (62/62) Aug 16 2011 I have some comments, that I see as problems, about the built-in AAs.
- Walter Bright (5/10) Aug 16 2011 It wasn't changed to simplify code, it was changed because it was faster...
- bearophile (7/12) Aug 16 2011 I think there are search trees like the Red-Black ones that guarantee a ...
- Walter Bright (2/3) Aug 16 2011 Just feed it more data.
- bearophile (4/7) Aug 16 2011 If you feed it more data, even if all items pruce collision because they...
- Andrew Wiley (8/19) Aug 16 2011 The problem here is that algorithmic complexity doesn't really tell the
- Josh Simmons (10/33) Aug 16 2011 There's a lot more to making a hash-map robust than changing the
- Andrei Alexandrescu (11/18) Aug 16 2011 Let's please stop this. Many of us, including yourself, noticed the
- Walter Bright (7/14) Aug 16 2011 Also, I should point out that the switch from binary trees to linear lis...
- Andrei Alexandrescu (9/28) Aug 17 2011 Though that's a good point, people would usually recompile projects if
- bearophile (5/6) Aug 17 2011 OK. This sub-thread about Hash resilience against attacks is a minor det...
- Steven Schveighoffer (7/32) Aug 17 2011 Yes, but let's not forget the one valid request out of all of this -- if...
- Andrei Alexandrescu (4/37) Aug 17 2011 Yah, we should put that on the list of things to do on the Wiki. Wait,
- Jesse Phillips (4/46) Aug 17 2011 Added, do we also want to add don't have compiler generate opCmp
- Steven Schveighoffer (15/60) Aug 17 2011 So actually, it's technically a joint effort of the compiler and druntim...
- Steven Schveighoffer (15/60) Aug 17 2011 So actually, it's technically a joint effort of the compiler and druntim...
- Jonathan M Davis (5/40) Aug 17 2011 But then we can't change the hash table type to one that needs opCmp if ...
- Steven Schveighoffer (16/62) Aug 17 2011 I think that's a choice we should embrace. AFAIK, no *builtin* hash
- Andrej Mitrovic (2/2) Aug 17 2011 I'm all for whatever fixes hashes and makes them properly CTFE'able,
- Jonathan M Davis (9/86) Aug 17 2011 I'm not necessarily arguing that we should keep requiring opCmp (and cer...
- Sean Kelly (6/8) Aug 17 2011 if trees are no longer being used, opEquals should be used insted of =
- bearophile (5/6) Aug 17 2011 OK, but I'd like to note that what I have written comes from some experi...
- Walter Bright (12/16) Aug 17 2011 I don't think it's a sensible point of view. If you supply your own hash...
- bearophile (5/6) Aug 17 2011 Thank you for your answers. And I agree that the current situation is ov...
- Sean Kelly (10/17) Aug 17 2011 more practical ones, like receiving help from the compiler if I am using...
- Jesse Phillips (4/10) Aug 17 2011 What I got from it was that it didn't use his hash until he provided opC...
- bearophile (14/15) Aug 18 2011 What you are talking about is interesting. But I was asking in the point...
- Sean Kelly (13/38) Aug 18 2011 This sounds like an application design issue rather than a language issu...
- Josh Simmons (14/42) Aug 18 2011 te:
- Steven Schveighoffer (13/24) Aug 17 2011 And uses much less space -- only one pointer per node vs. two.
- Walter Bright (2/7) Aug 17 2011 Actually, that was a large factor in the speed increase.
- kennytm (11/20) Aug 16 2011 They use runtime typeid functions. Some V[K] type and methods are conver...
I have some comments, that I see as problems, about the built-in AAs. 1) I have written code similar to this, do you think it's correct? import std.stdio; struct F { uint x; hash_t toHash() { puts("*"); return x; } const bool opEquals(ref const(F) f) { return x == f.x; } } void main() { auto data = [F(1), F(2), F(3), F(4), F(5), F(6)]; int[F] aa; foreach (i, f; data) aa[f] = i; } After a good amount of debugging I have added those puts, and I've seen that those toHash/opEquals are never called, it's the second time I waste time on this problem. In my opinion this is not acceptable in a language like D. If a user wants to add the hashing protocol to a class/struct then he/she will probably add a toHash. So if a toHash() is present, the compiler has to give a compile time error if all the other parts of the hash protocol are not present or if they are not correct (this currently means a correct opEquals and opCmp). This is quite important. See also: http://d.puremagic.com/issues/show_bug.cgi?id=3844 http://d.puremagic.com/issues/show_bug.cgi?id=4290 ------------------- 2) Originally built-in AAs used a search tree to resolve collisions, but I think since some time they use a simple linked list (this has the advantage of simplifying the code, but it has the disadvantage that now hashing is able to become O(n^2), allowing certain attacks to D code that where impossible before). So why are D AAs requiring still an opCmp to used-defined hashable structs/classes? Why don't we remove this need (and change the D docs)? ------------------- 3) This program finally uses the user-defined hash: import std.stdio; struct F { uint x; hash_t toHash() { puts("*"); return x; } const bool opEquals(ref const(F) f) { return x == f.x; } const int opCmp(ref const F f) { puts("$"); return x - f.x; } } void main() { auto data = [F(1), F(2), F(3), F(3), F(5), F(1)]; int[F] aa; foreach (i, f; data) aa[f] = i; writeln(aa); } It outputs: $ $ F:1 F:3 F:5 F:4 The output shows that the AA is using opCmp() instead opEquals(). If the AA is hash based and resolves collisions with an unordered linked list then why isn't it calling opEquals() instead? Generally opEquals() is faster than opCmp(). ------------------- Now and then I compare the running time of C++0x code that uses <unordered_map> with the running time of the same operations done with D AAs and I don't see good timings for D. Are D AAs true templates or do they use some runtime typeid function? If they aren't true templates isn't it better to change this? Bye, bearophile
Aug 16 2011
On 8/16/2011 3:09 PM, bearophile wrote:2) Originally built-in AAs used a search tree to resolve collisions, but I think since some time they use a simple linked list (this has the advantage of simplifying the code, but it has the disadvantage that now hashing is able to become O(n^2), allowing certain attacks to D code that where impossible before).It wasn't changed to simplify code, it was changed because it was faster. As for "attacks", if your code is subject to people sending it specially crafted input in the hopes of making it run slowly, you can make specially crafted input to make trees run slowly, too.
Aug 16 2011
Walter Bright:It wasn't changed to simplify code, it was changed because it was faster.Right.As for "attacks", if your code is subject to people sending it specially crafted input in the hopes of making it run slowly, you can make specially crafted input to make trees run slowly, too.I think there are search trees like the Red-Black ones that guarantee a O(n ln n) worst case. I am wrong?Since operations such as inserting, deleting, and finding values require worst-case time proportional to the height of the tree, this theoretical upper bound on the height allows red–black trees to be efficient in the worst-case, unlike ordinary binary search trees.<From: http://en.wikipedia.org/wiki/Red%E2%80%93black_tree#Properties Bye, bearophile
Aug 16 2011
On 8/16/2011 5:06 PM, bearophile wrote:I think there are search trees like the Red-Black ones that guarantee a O(n ln n) worst case. I am wrong?Just feed it more data.
Aug 16 2011
Walter Bright:If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted. Bye, bearophileI think there are search trees like the Red-Black ones that guarantee a O(n ln n) worst case. I am wrong?Just feed it more data.
Aug 16 2011
On Tue, Aug 16, 2011 at 7:29 PM, bearophile <bearophileHUGS lycos.com>wrote:Walter Bright:The problem here is that algorithmic complexity doesn't really tell the whole story. We can talk about what "should" be faster all day, but unless we can prove that there's really a problem here, this doesn't seem like it's worth worrying about. And if it was changed in the past because the linked lists turned out to be faster, I'd guess that actual benchmarking was probably used back then to determine which was better.O(n ln n) worst case. I am wrong?I think there are search trees like the Red-Black ones that guarantee aJust feed it more data.If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted.
Aug 16 2011
On Wed, Aug 17, 2011 at 12:40 PM, Andrew Wiley <wiley.andrew.j gmail.com> wrote:On Tue, Aug 16, 2011 at 7:29 PM, bearophile <bearophileHUGS lycos.com> wrote:There's a lot more to making a hash-map robust than changing the lookup scheme. There's also a large variety of collision resolution methods that all have their pros and cons depending on the data-set and lookup patterns you're dealing with. For a built in AA system I feel simplicity is a much more achievable goal than trying to solve everybody's specific problems. Hence I feel that the current solution is probably a good choice. Just my two cents anyway. Josh.Walter Bright:The problem here is that algorithmic complexity doesn't really tell the whole story. We can talk about what "should" be faster all day, but unless we can prove that there's really a problem here, this doesn't seem like it's worth worrying about. And if it was changed in the past because the linked lists turned out to be faster, I'd guess that actual benchmarking was probably used back then to determine which was better.If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted.I think there are search trees like the Red-Black ones that guarantee a O(n ln n) worst case. I am wrong?Just feed it more data.
Aug 16 2011
On 8/16/11 9:29 PM, bearophile wrote:Walter Bright:Let's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared to other languages. Switching to singly-list collision handling marked an improvement. Now a lot of data structure designs have a worst-case that makes them perform worse than others. If you worry about attacks, please implement your own hashtable. If we switch back to the old implementation, you'll complain again about D's hashtables being slower than Python's, thus closing a years-long cycle. Thanks, AndreiIf you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted. Bye, bearophileI think there are search trees like the Red-Black ones that guarantee a O(n ln n) worst case. I am wrong?Just feed it more data.
Aug 16 2011
On 8/16/2011 9:15 PM, Andrei Alexandrescu wrote:Let's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared to other languages. Switching to singly-list collision handling marked an improvement. Now a lot of data structure designs have a worst-case that makes them perform worse than others. If you worry about attacks, please implement your own hashtable. If we switch back to the old implementation, you'll complain again about D's hashtables being slower than Python's, thus closing a years-long cycle.Also, I should point out that the switch from binary trees to linear lists involved zero change in user code - not even a recompile. Hence, any person who is worried about Bearophile's scenario, or that believe they can come up with a faster hash, can swap it with their own and prove it. The hash implementation was made to be opaque to the user, and pluggable, and this proved it.
Aug 16 2011
On 8/17/11 12:43 AM, Walter Bright wrote:On 8/16/2011 9:15 PM, Andrei Alexandrescu wrote:Though that's a good point, people would usually recompile projects if they want to relink so the point is not as strong a point as it might have been. I still very strongly believe we need to continue aggressively migrating built-in hashtables from pluggable to templated. Pluggable hashtables incur indirect calls in core functions - the bane of all speed optimizers. We can't afford that cost for such a useful data structure. AndreiLet's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared to other languages. Switching to singly-list collision handling marked an improvement. Now a lot of data structure designs have a worst-case that makes them perform worse than others. If you worry about attacks, please implement your own hashtable. If we switch back to the old implementation, you'll complain again about D's hashtables being slower than Python's, thus closing a years-long cycle.Also, I should point out that the switch from binary trees to linear lists involved zero change in user code - not even a recompile. Hence, any person who is worried about Bearophile's scenario, or that believe they can come up with a faster hash, can swap it with their own and prove it. The hash implementation was made to be opaque to the user, and pluggable, and this proved it.
Aug 17 2011
Andrei Alexandrescu:Let's please stop this.OK. This sub-thread about Hash resilience against attacks is a minor detail of my original post. kennytm has answered my point 4, but I'd like some answers to the first three points; regarding the error messages for not complete hash protocol, the possible removal of opCmp from the hash protocol, and the usage of opCmp instead of opEquals from the current hash protocol. Bye, bearophile
Aug 17 2011
On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 8/16/11 9:29 PM, bearophile wrote:Yes, but let's not forget the one valid request out of all of this -- if trees are no longer being used, opEquals should be used insted of opCmp. This allows more possible key types (which don't define an ordering). I think this would be a simple druntime change. -SteveWalter Bright:Let's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared to other languages. Switching to singly-list collision handling marked an improvement. Now a lot of data structure designs have a worst-case that makes them perform worse than others. If you worry about attacks, please implement your own hashtable. If we switch back to the old implementation, you'll complain again about D's hashtables being slower than Python's, thus closing a years-long cycle.If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted. Bye, bearophileI think there are search trees like the Red-Black ones that guarantee a O(n ln n) worst case. I am wrong?Just feed it more data.
Aug 17 2011
On 8/17/11 9:04 AM, Steven Schveighoffer wrote:On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Yah, we should put that on the list of things to do on the Wiki. Wait, where the heck is that? :o) AndreiOn 8/16/11 9:29 PM, bearophile wrote:Yes, but let's not forget the one valid request out of all of this -- if trees are no longer being used, opEquals should be used insted of opCmp. This allows more possible key types (which don't define an ordering). I think this would be a simple druntime change. -SteveWalter Bright:Let's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared to other languages. Switching to singly-list collision handling marked an improvement. Now a lot of data structure designs have a worst-case that makes them perform worse than others. If you worry about attacks, please implement your own hashtable. If we switch back to the old implementation, you'll complain again about D's hashtables being slower than Python's, thus closing a years-long cycle.If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted. Bye, bearophileI think there are search trees like the Red-Black ones that guarantee a O(n ln n) worst case. I am wrong?Just feed it more data.
Aug 17 2011
On Wed, 17 Aug 2011 10:46:00 -0500, Andrei Alexandrescu wrote:On 8/17/11 9:04 AM, Steven Schveighoffer wrote:Added, do we also want to add don't have compiler generate opCmp automatically? http://www.prowiki.org/wiki4d/wiki.cgi?LanguageDevelOn Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Yah, we should put that on the list of things to do on the Wiki. Wait, where the heck is that? :o) AndreiOn 8/16/11 9:29 PM, bearophile wrote:Yes, but let's not forget the one valid request out of all of this -- if trees are no longer being used, opEquals should be used insted of opCmp. This allows more possible key types (which don't define an ordering). I think this would be a simple druntime change. -SteveWalter Bright:Let's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared to other languages. Switching to singly-list collision handling marked an improvement. Now a lot of data structure designs have a worst-case that makes them perform worse than others. If you worry about attacks, please implement your own hashtable. If we switch back to the old implementation, you'll complain again about D's hashtables being slower than Python's, thus closing a years-long cycle.If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted. Bye, bearophileI think there are search trees like the Red-Black ones that guarantee a O(n ln n) worst case. I am wrong?Just feed it more data.
Aug 17 2011
On Wed, 17 Aug 2011 21:07:29 -0400, Jesse Phillips <jessekphillips+d gmail.com> wrote:On Wed, 17 Aug 2011 10:46:00 -0500, Andrei Alexandrescu wrote:So actually, it's technically a joint effort of the compiler and druntime. The compiler doesn't *reject* creating AA's using keys that don't define opCmp. It doesn't actually generate a function. But druntime *does* do a "default" opCmp if no opCmp is defined. See here: https://github.com/D-Programming-Language/druntime/blob/master/src/object_.d#L955 In essence, this "bug" is fixed by one of two things: 1. Using a fully templated AA implementation instead of a RTTI-based one (which avoids using the TypeInfo.compare function). 2. Not using opCmp anymore. Since 2 is already happening, I think we are good. I don't know how we should "fix" the TypeInfo_Struct function. Should it just throw an exception if xOpCmp isn't set? -SteveOn 8/17/11 9:04 AM, Steven Schveighoffer wrote:Added, do we also want to add don't have compiler generate opCmp automatically?On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Yah, we should put that on the list of things to do on the Wiki. Wait, where the heck is that? :o) AndreiOn 8/16/11 9:29 PM, bearophile wrote:Yes, but let's not forget the one valid request out of all of this -- if trees are no longer being used, opEquals should be used insted of opCmp. This allows more possible key types (which don't define an ordering). I think this would be a simple druntime change. -SteveWalter Bright:Let's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared to other languages. Switching to singly-list collision handling marked an improvement. Now a lot of data structure designs have a worst-case that makes them perform worse than others. If you worry about attacks, please implement your own hashtable. If we switch back to the old implementation, you'll complain again about D's hashtables being slower than Python's, thus closing a years-long cycle.If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted. Bye, bearophileI think there are search trees like the Red-Black ones that guarantee a O(n ln n) worst case. I am wrong?Just feed it more data.
Aug 17 2011
On Wed, 17 Aug 2011 21:07:29 -0400, Jesse Phillips <jessekphillips+d gmail.com> wrote:On Wed, 17 Aug 2011 10:46:00 -0500, Andrei Alexandrescu wrote:So actually, it's technically a joint effort of the compiler and druntime. The compiler doesn't *reject* creating AA's using keys that don't define opCmp. It doesn't actually generate a function. But druntime *does* do a "default" opCmp if no opCmp is defined. See here: https://github.com/D-Programming-Language/druntime/blob/master/src/object_.d#L955 In essence, this "bug" is fixed by one of two things: 1. Using a fully templated AA implementation instead of a RTTI-based one (which avoids using the TypeInfo.compare function). 2. Not using opCmp anymore. Since 2 is already happening, I think we are good. I don't know how we should "fix" the TypeInfo_Struct function. Should it just throw an exception if xOpCmp isn't set? -SteveOn 8/17/11 9:04 AM, Steven Schveighoffer wrote:Added, do we also want to add don't have compiler generate opCmp automatically?On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Yah, we should put that on the list of things to do on the Wiki. Wait, where the heck is that? :o) AndreiOn 8/16/11 9:29 PM, bearophile wrote:Yes, but let's not forget the one valid request out of all of this -- if trees are no longer being used, opEquals should be used insted of opCmp. This allows more possible key types (which don't define an ordering). I think this would be a simple druntime change. -SteveWalter Bright:Let's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared to other languages. Switching to singly-list collision handling marked an improvement. Now a lot of data structure designs have a worst-case that makes them perform worse than others. If you worry about attacks, please implement your own hashtable. If we switch back to the old implementation, you'll complain again about D's hashtables being slower than Python's, thus closing a years-long cycle.If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted. Bye, bearophileI think there are search trees like the Red-Black ones that guarantee a O(n ln n) worst case. I am wrong?Just feed it more data.
Aug 17 2011
On Wednesday, August 17, 2011 10:04:21 Steven Schveighoffer wrote:On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:But then we can't change the hash table type to one that needs opCmp if we need to later. That might be acceptable, but it makes it so that we can't transparently change the implementation again if we decide that we need to. - Jonathan M DavisOn 8/16/11 9:29 PM, bearophile wrote:Yes, but let's not forget the one valid request out of all of this -- if trees are no longer being used, opEquals should be used insted of opCmp. This allows more possible key types (which don't define an ordering). I think this would be a simple druntime change.Walter Bright:Let's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared to other languages. Switching to singly-list collision handling marked an improvement. Now a lot of data structure designs have a worst-case that makes them perform worse than others. If you worry about attacks, please implement your own hashtable. If we switch back to the old implementation, you'll complain again about D's hashtables being slower than Python's, thus closing a years-long cycle.If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees to handle the items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted. Bye, bearophileI think there are search trees like the Red-Black ones that guarantee a O(n ln n) worst case. I am wrong?Just feed it more data.
Aug 17 2011
On Wed, 17 Aug 2011 11:47:59 -0400, Jonathan M Davis <jmdavisProg gmx.com> wrote:On Wednesday, August 17, 2011 10:04:21 Steven Schveighoffer wrote:I think that's a choice we should embrace. AFAIK, no *builtin* hash implementations use trees for buckets in any language I'm aware of (I'm sure someone will find one though :). The precedent is to require opHash and opEquals, not opCmp. It just makes more sense for builtin hash tables to allow the most possible key types it can. Also, currently, if opCmp doesn't exist the *COMPILER MAKES ONE UP*, which is totally unacceptable. So if you define opEquals and not opCmp, as bearophile points out, your specifically defined opEquals is not even used, and some made-up approximation is used instead! It's one thing to make up opEquals, that is pretty easy to get reasonably right. It's something entirely different to invent an opCmp, especially for types which have no ordering! -SteveOn Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:But then we can't change the hash table type to one that needs opCmp if we need to later. That might be acceptable, but it makes it so that we can't transparently change the implementation again if we decide that we need to.On 8/16/11 9:29 PM, bearophile wrote:handleWalter Bright:If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees toI think there are search trees like the Red-Black ones that guarantee a O(n ln n) worst case. I am wrong?Just feed it more data.otherthe items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted. Bye, bearophileLet's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared tolanguages. Switching to singly-list collision handling marked an improvement. Now a lot of data structure designs have a worst-casethatmakes them perform worse than others. If you worry about attacks,pleaseimplement your own hashtable. If we switch back to the old implementation, you'll complain again about D's hashtables beingslowerthan Python's, thus closing a years-long cycle.Yes, but let's not forget the one valid request out of all of this -- if trees are no longer being used, opEquals should be used insted of opCmp. This allows more possible key types (which don't define an ordering). I think this would be a simple druntime change.
Aug 17 2011
I'm all for whatever fixes hashes and makes them properly CTFE'able, IOW no more "expression is not constant" for hash declarations please!
Aug 17 2011
On Wednesday, August 17, 2011 08:59 Steven Schveighoffer wrote:On Wed, 17 Aug 2011 11:47:59 -0400, Jonathan M Davis <jmdavisProg gmx.com> wrote:I'm not necessarily arguing that we should keep requiring opCmp (and certainly having the compiler generate one is not good IMHO). I'm just pointing out that that would mean that we'd be putting further limitations on our ability to change the implementation later if we decide that we need to. As long as we think that the benefits outweight the costs, then removing the need for opCmp and AA's probably is what we should do. But regardless, having the compiler create an opCmp for you seems like highly broken behavior. - Jonathan M DavisOn Wednesday, August 17, 2011 10:04:21 Steven Schveighoffer wrote:I think that's a choice we should embrace. AFAIK, no *builtin* hash implementations use trees for buckets in any language I'm aware of (I'm sure someone will find one though :). The precedent is to require opHash and opEquals, not opCmp. It just makes more sense for builtin hash tables to allow the most possible key types it can. Also, currently, if opCmp doesn't exist the *COMPILER MAKES ONE UP*, which is totally unacceptable. So if you define opEquals and not opCmp, as bearophile points out, your specifically defined opEquals is not even used, and some made-up approximation is used instead! It's one thing to make up opEquals, that is pretty easy to get reasonably right. It's something entirely different to invent an opCmp, especially for types which have no ordering!On Wed, 17 Aug 2011 00:15:38 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:But then we can't change the hash table type to one that needs opCmp if we need to later. That might be acceptable, but it makes it so that we can't transparently change the implementation again if we decide that we need to.On 8/16/11 9:29 PM, bearophile wrote:handleWalter Bright:If you feed it more data, even if all items pruce collision because they all hash to the same bucket, if you use Red-Black trees toI think there are search trees like the Red-Black ones that guarantee a O(n ln n) worst case. I am wrong?Just feed it more data.otherthe items in the same bucket you keep having a O(n ln n) behaviour, that's usually fast enough. With Python and the new D AAs you instead get a quadratic one. This quadratic behaviour gives troubles way before the physical RAM is exhausted. Bye, bearophileLet's please stop this. Many of us, including yourself, noticed the relatively poor performance of D's previous hashtables compared tolanguages. Switching to singly-list collision handling marked an improvement. Now a lot of data structure designs have a worst-casethatmakes them perform worse than others. If you worry about attacks,pleaseimplement your own hashtable. If we switch back to the old implementation, you'll complain again about D's hashtables beingslowerthan Python's, thus closing a years-long cycle.Yes, but let's not forget the one valid request out of all of this -- if trees are no longer being used, opEquals should be used insted of opCmp. This allows more possible key types (which don't define an ordering). I think this would be a simple druntime change.
Aug 17 2011
On Aug 17, 2011, at 7:04 AM, Steven Schveighoffer wrote:=20 Yes, but let's not forget the one valid request out of all of this -- =if trees are no longer being used, opEquals should be used insted of = opCmp. This allows more possible key types (which don't define an = ordering). I think this would be a simple druntime change. Yup. It would make the AAs faster too. I'd love to deprecate the = requirement for opCmp with AAs.=
Aug 17 2011
Andrei Alexandrescu:Let's please stop this.OK, but I'd like to note that what I have written comes from some experiments too: http://is.gd/L4U3Kp Hugs, bearophile
Aug 17 2011
On 8/17/2011 11:33 AM, bearophile wrote:Andrei Alexandrescu:I don't think it's a sensible point of view. If you supply your own hash function, the onus is on you to produce a suitable one. It is not the languages' fault if the hash tables run poorly if your hash function always returns the same value. Secondly, I seriously doubt this is any sort of viable "attack" vector. If you are allowing users to write system code (which D is) and remotely run it on your system, a slow hash map is the least of your security concerns. Heck, you could just write for(;;){} and call it a day. Nothing clever required. And, the linear version is much FASTER and uses significantly LESS memory in real world situations, which is why we switched to it. Bottom line, I don't think there's an actual problem here.Let's please stop this.OK, but I'd like to note that what I have written comes from some experiments too: http://is.gd/L4U3Kp
Aug 17 2011
Walter:Bottom line, I don't think there's an actual problem here.Thank you for your answers. And I agree that the current situation is overall better than the precedent one. My original first post of this thread was about other problems, quite more practical ones, like receiving help from the compiler if I am using hash protocol badly, etc. :-) Bye, bearophile
Aug 17 2011
On Aug 17, 2011, at 2:36 PM, bearophile wrote:Walter: =20overall better than the precedent one.=20Bottom line, I don't think there's an actual problem here.=20 Thank you for your answers. And I agree that the current situation is ==20 My original first post of this thread was about other problems, quite =more practical ones, like receiving help from the compiler if I am using = hash protocol badly, etc. :-) This would be a run-time issue, unless you're asking the compiler to = verify your hash algorithm at compile-time :-p I'd actually like to = have some introspection functionality so I could find out the average = chain length, max chain length, etc (basically what's provided by the = unordered containers from C++11), but the user would still have to query = this stuff to know that something was wrong.=
Aug 17 2011
On Wed, 17 Aug 2011 15:40:25 -0700, Sean Kelly wrote:This would be a run-time issue, unless you're asking the compiler to verify your hash algorithm at compile-time :-p I'd actually like to have some introspection functionality so I could find out the average chain length, max chain length, etc (basically what's provided by the unordered containers from C++11), but the user would still have to query this stuff to know that something was wrong.What I got from it was that it didn't use his hash until he provided opCmp too, thus the compiler should complain when it isn't going to use the hash you provide.
Aug 17 2011
Sean Kelly:This would be a run-time issue, unless you're asking the compiler to verify your hash algorithm at compile-time :-p I'd actually like to have some introspection functionality so I could find out the average chain length, max chain length, etc (basically what's provided by the unordered containers from C++11), but the user would still have to query this stuff to know that something was wrong.<What you are talking about is interesting. But I was asking in the point 1 of my original post is something different and simple, it's a fully static thing. This was my original post of this thread: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=142709 What I was asking in that point 1 is: 1) If the user writes wrong signatures for the each of the methods needed for the hash protocol (currently they are toHash() opEquals() and opCmp()), then I'd like the compiler to tell me with a compile-time error. I think DMD is already mostly doing this. 2) If the user adds a toHash method to a class or struct, then the compiler must statically require the presence of all the other methods needed by the hash protocol. Currently DMD is not doing this, if you add correct toHash() and opEquals() methods, the compiler will silently not use them. I think this isn't acceptable in a good language. -------------- In the successive points of my original post I have asked two other things: pushing AAs to a fully templated implementation, and totally remove the need of a opCmp from the hash protocol. -------------- Extra: I presume there is no way to ask (on request) the language to use a stack-style memory allocator for the built-in AAs instead the normal GC. But once built-in AAs are fully templates I presume it's not too much hard to give an interface to them from Phobos functions too. So now it's possible to let this use a different memory allocator. Bye, bearophile
Aug 18 2011
This sounds like an application design issue rather than a language issue. D= o any languages use a pool of hash routines like this? Sent from my iPhone On Aug 17, 2011, at 5:06 PM, Josh Simmons <simmons.44 gmail.com> wrote:On Thu, Aug 18, 2011 at 8:40 AM, Sean Kelly <sean invisibleduck.org> wrote=:erall better than the precedent one.On Aug 17, 2011, at 2:36 PM, bearophile wrote: =20Walter: =20Bottom line, I don't think there's an actual problem here.=20 Thank you for your answers. And I agree that the current situation is ov=re practical ones, like receiving help from the compiler if I am using hash p= rotocol badly, etc. :-)=20 My original first post of this thread was about other problems, quite mo=fy your hash algorithm at compile-time :-p I'd actually like to have some i= ntrospection functionality so I could find out the average chain length, max= chain length, etc (basically what's provided by the unordered containers fr= om C++11), but the user would still have to query this stuff to know that so= mething was wrong.=20 This would be a run-time issue, unless you're asking the compiler to veri==20 The security issue is basically a DoS one, for example if you know a web server is using a specific hash and collision resolution method to store message headers you can pass headers that all hash to buckets that provide worst-case behavior. In this instance universal hashing where a hash function is chosen randomly from a pool of hashes combined with good algorithmic complexity means the attacker is unable to do this reliably. =20 Unrelated though, I'm quite a fan of hopscotch hashing at the moment, in theory at least. It'd be interesting to get a few different resolution schemes working and compare their performance on various workloads.
Aug 18 2011
On Fri, Aug 19, 2011 at 1:43 AM, Sean Kelly <sean invisibleduck.org> wrote:This sounds like an application design issue rather than a language issue=. Do any languages use a pool of hash routines like this?Sent from my iPhone On Aug 17, 2011, at 5:06 PM, Josh Simmons <simmons.44 gmail.com> wrote:te:On Thu, Aug 18, 2011 at 8:40 AM, Sean Kelly <sean invisibleduck.org> wro=overall better than the precedent one.On Aug 17, 2011, at 2:36 PM, bearophile wrote:Walter:Bottom line, I don't think there's an actual problem here.Thank you for your answers. And I agree that the current situation is =more practical ones, like receiving help from the compiler if I am using ha= sh protocol badly, etc. :-)My original first post of this thread was about other problems, quite =rify your hash algorithm at compile-time :-p =C2=A0I'd actually like to hav= e some introspection functionality so I could find out the average chain le= ngth, max chain length, etc (basically what's provided by the unordered con= tainers from C++11), but the user would still have to query this stuff to k= now that something was wrong.This would be a run-time issue, unless you're asking the compiler to ve=I doubt it, I was just noting the issue not suggesting it should be fixed. Cheers, Josh.The security issue is basically a DoS one, for example if you know a web server is using a specific hash and collision resolution method to store message headers you can pass headers that all hash to buckets that provide worst-case behavior. In this instance universal hashing where a hash function is chosen randomly from a pool of hashes combined with good algorithmic complexity means the attacker is unable to do this reliably. Unrelated though, I'm quite a fan of hopscotch hashing at the moment, in theory at least. It'd be interesting to get a few different resolution schemes working and compare their performance on various workloads.
Aug 18 2011
On Tue, 16 Aug 2011 20:06:19 -0400, bearophile <bearophileHUGS lycos.com> wrote:Walter Bright:And uses much less space -- only one pointer per node vs. two.It wasn't changed to simplify code, it was changed because it was faster.Right.I'd add that removing the requirement for opCmp is a reasonable request -- this opens up keys to significantly more data types (those which don't have a defined order).Red black trees would be overkill here. The ideal hash only has one element per bucket, and at most 2 or 3. This is why you expand the bucket array even though you only have an element to bucket ratio of .75 or so. Consider that for one or two (or even sometimes three) elements, a linked list has identical topology, and it's insertion/removal algorithm is much less complex. -SteveAs for "attacks", if your code is subject to people sending it specially crafted input in the hopes of making it run slowly, you can make specially crafted input to make trees run slowly, too.I think there are search trees like the Red-Black ones that guarantee a O(n ln n) worst case. I am wrong?
Aug 17 2011
On 8/17/2011 7:00 AM, Steven Schveighoffer wrote:On Tue, 16 Aug 2011 20:06:19 -0400, bearophile <bearophileHUGS lycos.com> wrote:Actually, that was a large factor in the speed increase.Walter Bright:And uses much less space -- only one pointer per node vs. two.It wasn't changed to simplify code, it was changed because it was faster.
Aug 17 2011
bearophile <bearophileHUGS lycos.com> wrote:Now and then I compare the running time of C++0x code that uses <unordered_map> with the running time of the same operations done with D AAs and I don't see good timings for D. Are D AAs true templates or do they use some runtime typeid function? If they aren't true templates isn't it better to change this? Bye, bearophileThey use runtime typeid functions. Some V[K] type and methods are converted to AssociativeArray!(K, V) which itself is a proxy to those extern(C) aaGetX() functions. Some methods will resolve to those aaGetX directly in DMD. These functions can only know what types they're working on via the typeid. IMO in DMD V[K]'s methods shouldn't be special. They should be treated just as a syntactic substitution for the AssociativeArray!(K, V) type. The latter shall implement opIndex etc for its normal operation. Druntime gets to decide whether to continue using the C interface or turn it into a concrete templated structure. This also solves bug 5350.
Aug 16 2011