digitalmars.D - Robustness of the built in associative array
- Oskar Linde (73/73) Mar 24 2006 Hello,
- Nick (12/19) Mar 24 2006 Nope, in fact, you couldn't. The kicker is that the builtin AA uses some...
- Sean Kelly (11/33) Mar 24 2006 Indeed. It seems roughly similar to how arrays are handled: in place
- Oskar Linde (9/40) Mar 24 2006 Is there any way to retain the syntax and ease of use of the built in AA...
- Nick (11/17) Mar 25 2006 I think the STL containers do a very good job at being both flexible and...
- Wolfgang Draxinger (3/5) Mar 28 2006 Raise an exception?
- Wolfgang Draxinger (55/58) Mar 28 2006 Well, there's just the little problem, that C++ '&' reference types bein...
- Nick (17/24) Mar 25 2006 True. Except that adding an element to an array (when no realloc occurs)...
- Dave (7/12) Mar 26 2006 Just curious - what portion of the perf. diff. was that? What data types...
- Bruno Medeiros (18/33) Mar 26 2006 WHOOOA there. "Reference type" is a pretty overloaded term that means
- Hong (6/39) Mar 28 2006 This is just a perfect demonstration of the problem with D value types. ...
- Oskar Linde (5/28) Mar 24 2006 You correct. I don't know how I could have forgotten that. D's lack of a...
- Bruno Medeiros (23/82) Mar 26 2006 Oskar, thanks for pointing that out. That was one issue that I hadn't
- Walter Bright (4/20) Mar 27 2006 I hadn't realized that could happen, and it clearly needs to be fixed.
Hello, The latest "Implicit fn template instantiation" thread contains code that contains a hidden and possibly hard to find bug. Rewriting the code with a non-templated version for clarity: void increment(int[char[]] map, char[] key) { if (auto t = (key in map)) (*t)++; else map[key] = 1; } To understand the code, one first has to know if the associative array is a value type or a reference type. It turns out it is neither. It is definitely not a value type. The above function will in most cases alter the original associative array and work as expected. But it is not a reference type either -- in the few cases where adding a new key to the AA invokes a rehash, all hell breaks loose. The original associative array is suddenly *corrupted*! (The way to make the function work as expected is to make the map an inout argument, but that is beside my point.) Internally, an AA is implemented as an aaA*[] -- a dynamic array of pointers to aaA-nodes. The first node of the array contains metadata (the total number of keys), and the remaining [1..$] nodes contains the hash buckets. When passing an aa to a function, the aaA*[] gets passed. I assume the reason for this implementations is efficiency. All you need to do an efficient lookup is the array of buckets. But is this implementation comes at a cost of robustness. To make use of AA's, you have to follow a special rule: - Never add an element to or rehash the AA unless you know you hold the only reference. This means that even though the AA behaves as a reference type, the data can not be shared unless all references are readonly. I feel I must see this bastardization of being neither value or reference type a design mistake. A simple fix is to change the AA implementation so that AAs become MyAA* instead of aaA*[], with MyAA defined as: struct MyAA { AANode[] buckets; uint numberOfNodes; } You get a very slight additional lookup cost (double dereferencing to get to the buckets), but make the AA a pure reference type without any caveats and problems. My questions are: 1) Why is the D built in array type designed in such a fragile way? Should D not be robust language? Do users have to resort to writing library wrappers around D's types to get robustness? If so: 2) Why even keep the builtin AA? With the addition of an opIn() overload, you could get *exactly* the same functionality from a library implementation: Map!(char[],int) map; rather than: int[char[]] map; Cons: - This slight syntax difference. Pros: - The library version would be transparently replaceable by other implementations when the data or other conditions have different algorithmic requirements. - The library implementation could easily be more efficient than the current built in version by taking advantage of templates. - The library implementation could without problems be made to work with types that lack opCompare. - The library version would not have any strange quirks and gotchas. (Dangerous usage cases like above) - Greater library/language decoupling. - The possibility of implementing an UnsafeMap type with high performance but less safety for the few cases where the performance really matters. In my opinion, D's lack of protection (const arrays, etc) and robustness against programming mistakes is D's greatest weakness. My contribution to the slogan thread: "D - Programming without a lifeline" :) /Oskar
Mar 24 2006
In article <e00v0m$te2$3 digitaldaemon.com>, Oskar Linde says...Hello, The latest "Implicit fn template instantiation" thread contains code that contains a hidden and possibly hard to find bug. [...]Good catch!2) Why even keep the builtin AA? With the addition of an opIn() overload, you could get *exactly* the same functionality from a library implementationNope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax that is currently _impossible_ to duplicate in custom types. Eg. the following: I find it rather confusing and frustrating. In my current project I ended up making my own template AA to get around some of the deficiencies of the builtin type. Mostly I wanted to avoid double lookups, and easily allow custom hashers without encapsulating the key in a struct/class (think case-insensitive char[] keys.) The result was also faster and more memory efficient than the builtin AA. Nick
Mar 24 2006
Nick wrote:In article <e00v0m$te2$3 digitaldaemon.com>, Oskar Linde says...Indeed. It seems roughly similar to how arrays are handled: in place modifications persist, but anything causing a realloc does not. But the original should never be rendered unusable.Hello, The latest "Implicit fn template instantiation" thread contains code that contains a hidden and possibly hard to find bug. [...]Good catch!Yup, not having a bona fide reference type in D is problematic. Though you could always use pointers :-p2) Why even keep the builtin AA? With the addition of an opIn() overload, you could get *exactly* the same functionality from a library implementationNope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax that is currently _impossible_ to duplicate in custom types. Eg. the following:I find it rather confusing and frustrating. In my current project I ended up making my own template AA to get around some of the deficiencies of the builtin type. Mostly I wanted to avoid double lookups, and easily allow custom hashers without encapsulating the key in a struct/class (think case-insensitive char[] keys.) The result was also faster and more memory efficient than the builtin AA.This would also *almost* allow us to get rid of opCmp(Object) and opEquals(Object) in the Object base class, which would be very nice. I definately do like having built in AAs, but I do find some of the consequences irritating. Sean
Mar 24 2006
Sean Kelly wrote:Nick wrote:In article <e00v0m$te2$3 digitaldaemon.com>, Oskar Linde says...Consider the consequences of using a pointer in the above example. ;)Yup, not having a bona fide reference type in D is problematic. Though you could always use pointers :-p2) Why even keep the builtin AA? With the addition of an opIn() overload, you could get *exactly* the same functionality from a library implementationNope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax that is currently _impossible_ to duplicate in custom types. Eg. the following:Is there any way to retain the syntax and ease of use of the built in AA with a library implementation allowing the flexibility of choosing: - implementation - hash function - comparison function ? /OskarI find it rather confusing and frustrating. In my current project I ended up making my own template AA to get around some of the deficiencies of the builtin type. Mostly I wanted to avoid double lookups, and easily allow custom hashers without encapsulating the key in a struct/class (think case-insensitive char[] keys.) The result was also faster and more memory efficient than the builtin AA.This would also *almost* allow us to get rid of opCmp(Object) and opEquals(Object) in the Object base class, which would be very nice. I definately do like having built in AAs, but I do find some of the consequences irritating.
Mar 24 2006
In article <e01a21$te2$5 digitaldaemon.com>, Oskar Linde says...Is there any way to retain the syntax and ease of use of the built in AA with a library implementation allowing the flexibility of choosing: - implementation - hash function - comparison function ?I think the STL containers do a very good job at being both flexible and usable, covering most normal cases of use. The thing missing from D to make a STL-like library would be the reference return type. Some of the syntax would still be impossible to duplicate, eg. int[int] ii; ii[1]++; // This inserts the key '1' int i = ii[2]; // This does NOT insert '2' but I always thought this was the wrong way to do it anyway. (Eg. what should happen when the value is a struct and you use ii[3].foo?) Nick
Mar 25 2006
Nick wrote:but I always thought this was the wrong way to do it anyway. (Eg. what should happen when the value is a struct and you use ii[3].foo?)Raise an exception? Wolfgang Draxinger
Mar 28 2006
Nick wrote:I think the STL containers do a very good job at being both flexible and usable, covering most normal cases of use. The thing missing from D to make a STL-like library would be the reference return type.Well, there's just the little problem, that C++ '&' reference types being results of functions are not safe either. A C++ reference is IMHO nothing better than a strongly typed pointer. example class foo { foo(int bar) { a = new int[bar]; } int *a; // ignore the possible bounds error ;-) int & operator [] (int i){ return a[i]; } int & something () { int hello = 123; return hello; } }; This isn't a bit better than using a pointer instead a reference. The real problem currently is, that the index expression can be used as a lvalue, but using opIndex has not yet support for it. But why not solve it this way: The code int[] array; aa[i]++; compiles into execution of aa.opIndexAssign(aa.opIndex(i)+1, i); Now for the support of user defined AA I'd like to suggest a new overload of opIndex and opIndexAssign or new operator functions opKey, opKeyAssign, which take a parameter of the key type as first parameter. Also I'd introduce a base AA type, so that one can easyly derive from it. To use such a user defined AA class I'd like to suggest the following syntax; int[char[]]!MyAA an_AA; The idea is, that any sort of AA is in fact an implcit instanciation of the builtin AA template, i.e. int[char[]] an_AA; is the same like int[char[]]!__builtinAA an_AA; Instead of a struct I'd use classes for this. AAs are dynamic and thus allocated on the heap anyway. But making them a class would easyly allow to catch such ambigous things, like Oskar had found. So logically an AA class would be declared as class __builtinAA(T, K) { }; And a derived class class MyAA(T, K) : __builtinAA!(T, K) { }; or even class IntToStringAA : __builtinAA(char[], int) { }; Eventually derive __builtinAA from a common base, non templated AA type, so that RTTI clearly recognizes it as AA, and not derive from a templated class. How about this? Wolfgang Draxinger
Mar 28 2006
In article <e0169e$25hq$1 digitaldaemon.com>, Sean Kelly says...Indeed. It seems roughly similar to how arrays are handled: in place modifications persist, but anything causing a realloc does not. But the original should never be rendered unusable.True. Except that adding an element to an array (when no realloc occurs) does not change the original either.This would also *almost* allow us to get rid of opCmp(Object) and opEquals(Object) in the Object base class, which would be very nice. I definately do like having built in AAs, but I do find some of the consequences irritating.Sort of OT, but I've studied the builtin AA code and have come to the conclusion that opCmp() isn't needed NOW either. In the binary tree implementation that Walter used, he first compares the hash values, and only if those matches he calls opCmp(). Since actual hash collisions should be pretty uncommon(*), opCmp is almost never called(**). In the rare case of a hash collision, one could make a special case and eg. always place it in the left node. (*) It is uncommon for decent hash functions. It seems the builtin AA is optimized to also work with very bad hashers (in which case it approaches a pure binary tree.) I guess this isn't a bad feature for a generic AA, but it is also some of the reason why my custom hash map outperformed it (I didn't use btrees at all.) (**) Just did a test with an english dictionary of 173528 words and the standard D string hasher. opCmp() was called 1840 times, slightly over 1%. Nick
Mar 25 2006
In article <e0360o$28ep$1 digitaldaemon.com>, Nick says...(*) It is uncommon for decent hash functions. It seems the builtin AA is optimized to also work with very bad hashers (in which case it approaches a pure binary tree.) I guess this isn't a bad feature for a generic AA, but it is also some of the reason why my custom hash map outperformed it (I didn't use btrees at all.)Just curious - what portion of the perf. diff. was that? What data types? And in your estimation what accounts for the rest of the diff. (e.g. are you pre-allocating a bunch of space for the data (as opposed to the key) or not using the GC)? Thanks, - Dave
Mar 26 2006
In article <e06hpc$vb6$1 digitaldaemon.com>, Dave says...Just curious - what portion of the perf. diff. was that?I don't remember exactly. I did a btree implementation, but when I saw that it was slightly slower than the (simpler) list implementation (and definitely not faster), I just scrapped it. Note that from a theoretical view point, the btrees only make sense when the average bucket size is larger than 2. This practically only happens when the lookup table is too small compared to the number of elements, or when you use a bad hash function. Both probably will occur "in the wild", though. The DMD AA should keep the btrees. I knew my data beforehand, so I didn't need them.What data types?The speed tests were done with char[] keys and int values.And in your estimation what accounts for the rest of the diff.For lookups, hard to say. I found that using a simple power-of-two table size and a bitmask for lookups in was just as good as messing about with prime numbers, but only gave a miniscule performance increase. Perhaps templated code just means less overhead. In any case, the difference was very small, and the builtin AA is very fast. Inserts, however, are a different story. The builting AA starts with a small table size (97), and resizes (rehashes) at regular intervals as you insert elements. Being able to specify the approximate number of elements _in advance_ (thus avoiding the rehashes) gave me a 3x-5x performance increase when inserting.(e.g. are you pre-allocating a bunch of space for the data (as opposed to the key) or not using the GC)?Nope, the default allocater is the GC, and one new node struct is new'ed for each insert. (But with a malloc or region allocator it becomes much faster.) Nick
Mar 27 2006
In article <e08fi6$ouf$1 digitaldaemon.com>, Nick says...In article <e06hpc$vb6$1 digitaldaemon.com>, Dave says...<snip, see: http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/35954>Just curious - what portion of the perf. diff. was that?NickThanks for that info.
Mar 27 2006
Sean Kelly wrote:WHOOOA there. "Reference type" is a pretty overloaded term that means two distinct things, which I hope you already know: * Reference types (as in vs. value types) * Reference types or references (as in C++ references) I don't know any unambiguous terminology for these two terms, there may not even be one (correct me if wrong). However, nowadays, most of the time when one says "reference types" it means the first thing (me inclusive). So please, when you speak of the C++ references please disambiguate the term. (usually called just "references", although this is still very close to the other term) As for you point. Yes, this demonstrates a good usage scenario for a "C++ reference". Should there be such a feature in D? Or can this problem be solved in an alternate way? Is the added complexity of this feature worth it? I don't know I cannot come to a conclusion. :/ -- Bruno Medeiros - CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#DYup, not having a bona fide reference type in D is problematic. Though you could always use pointers :-p2) Why even keep the builtin AA? With the addition of an opIn() overload, you could get *exactly* the same functionality from a library implementationNope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax that is currently _impossible_ to duplicate in custom types. Eg. the following:
Mar 26 2006
This is just a perfect demonstration of the problem with D value types. A way to solve this would be to extend the inout keyword to function return and variables, such that [] would look like: inout T opIndex(size_t i) Hong In article <e0169e$25hq$1 digitaldaemon.com>, Sean Kelly says...Nick wrote:In article <e00v0m$te2$3 digitaldaemon.com>, Oskar Linde says...Indeed. It seems roughly similar to how arrays are handled: in place modifications persist, but anything causing a realloc does not. But the original should never be rendered unusable.Hello, The latest "Implicit fn template instantiation" thread contains code that contains a hidden and possibly hard to find bug. [...]Good catch!Yup, not having a bona fide reference type in D is problematic. Though you could always use pointers :-p2) Why even keep the builtin AA? With the addition of an opIn() overload, you could get *exactly* the same functionality from a library implementationNope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax that is currently _impossible_ to duplicate in custom types. Eg. the following:I find it rather confusing and frustrating. In my current project I ended up making my own template AA to get around some of the deficiencies of the builtin type. Mostly I wanted to avoid double lookups, and easily allow custom hashers without encapsulating the key in a struct/class (think case-insensitive char[] keys.) The result was also faster and more memory efficient than the builtin AA.This would also *almost* allow us to get rid of opCmp(Object) and opEquals(Object) in the Object base class, which would be very nice. I definately do like having built in AAs, but I do find some of the consequences irritating. Sean
Mar 28 2006
Nick wrote:In article <e00v0m$te2$3 digitaldaemon.com>, Oskar Linde says...You correct. I don't know how I could have forgotten that. D's lack of a reference return type has implications on properties too.Hello, The latest "Implicit fn template instantiation" thread contains code that contains a hidden and possibly hard to find bug. [...]Good catch!2) Why even keep the builtin AA? With the addition of an opIn() overload, you could get *exactly* the same functionality from a library implementationNope, in fact, you couldn't. The kicker is that the builtin AA uses some syntax that is currently _impossible_ to duplicate in custom types. Eg. the following:I find it rather confusing and frustrating. In my current project I ended up making my own template AA to get around some of the deficiencies of the builtin type. Mostly I wanted to avoid double lookups, and easily allow custom hashers without encapsulating the key in a struct/class (think case-insensitive char[] keys.) The result was also faster and more memory efficient than the builtin AA.I have similar experiences implementing a custom templated hash-table. /Oskar
Mar 24 2006
Oskar Linde wrote:Hello, The latest "Implicit fn template instantiation" thread contains code that contains a hidden and possibly hard to find bug. Rewriting the code with a non-templated version for clarity: void increment(int[char[]] map, char[] key) { if (auto t = (key in map)) (*t)++; else map[key] = 1; } To understand the code, one first has to know if the associative array is a value type or a reference type. It turns out it is neither. It is definitely not a value type. The above function will in most cases alter the original associative array and work as expected. But it is not a reference type either -- in the few cases where adding a new key to the AA invokes a rehash, all hell breaks loose. The original associative array is suddenly *corrupted*! (The way to make the function work as expected is to make the map an inout argument, but that is beside my point.) Internally, an AA is implemented as an aaA*[] -- a dynamic array of pointers to aaA-nodes. The first node of the array contains metadata (the total number of keys), and the remaining [1..$] nodes contains the hash buckets. When passing an aa to a function, the aaA*[] gets passed. I assume the reason for this implementations is efficiency. All you need to do an efficient lookup is the array of buckets. But is this implementation comes at a cost of robustness. To make use of AA's, you have to follow a special rule: - Never add an element to or rehash the AA unless you know you hold the only reference. This means that even though the AA behaves as a reference type, the data can not be shared unless all references are readonly. I feel I must see this bastardization of being neither value or reference type a design mistake.Oskar, thanks for pointing that out. That was one issue that I hadn't noticed yet, and I agree wholeheartedly that is a quite a design problem. As a side note: I would prefer saying that these are more like broken reference types, rather than saying they are not reference types, since that may not be the best way to describe it (not being reference or value types is more the nature of the problem of static arrays). Anyway, in any case you description of the problem is quite clear. Also, this illustrates another issue, which is the tricky reference semantics of dynamic arrays. Dynamic arrays are reference types, but they have a "catch", in that unlike all other reference types, there is an operation other than assignment that changes the identity of the array. It is the length change operation. But what is *really bad*, is that it is not "deterministic"[1] whether the resulting array from a length change will a brand new one or not. The identity (or perhaps better said, the contents) can be all new, or partially the same. [1] "deterministic" is not the 100% correct word (as it is more the opposite of random), but I'm missing a better term.A simple fix is to change the AA implementation so that AAs become MyAA* instead of aaA*[], with MyAA defined as: struct MyAA { AANode[] buckets; uint numberOfNodes; } You get a very slight additional lookup cost (double dereferencing to get to the buckets), but make the AA a pure reference type without any caveats and problems. My questions are: 1) Why is the D built in array type designed in such a fragile way? Should D not be robust language? Do users have to resort to writing library wrappers around D's types to get robustness? If so:I'd like to know that too. One more problem to the list, and yet people are already thinking about 1.0 and D slogans and whatnot. :/ -- Bruno Medeiros - CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Mar 26 2006
"Oskar Linde" <oskar.lindeREM OVEgmail.com> wrote in message news:e00v0m$te2$3 digitaldaemon.com...It turns out it is neither. It is definitely not a value type. The above function will in most cases alter the original associative array and work as expected. But it is not a reference type either -- in the few cases where adding a new key to the AA invokes a rehash, all hell breaks loose. The original associative array is suddenly *corrupted*! (The way to make the function work as expected is to make the map an inout argument, but that is beside my point.)I hadn't realized that could happen, and it clearly needs to be fixed.A simple fix is to change the AA implementation so that AAs become MyAA* instead of aaA*[], with MyAA defined as: struct MyAA { AANode[] buckets; uint numberOfNodes; } You get a very slight additional lookup cost (double dereferencing to get to the buckets), but make the AA a pure reference type without any caveats and problems.I'll make that change.
Mar 27 2006