digitalmars.D - Complexity in docs
- Tomás Rossi (11/11) Nov 18 2005 Is there somewhere in the docs something about the complexity of operati...
- Oskar Linde (7/12) Nov 18 2005 Associative arrays are implemented as hash-tables, i.e. O(1).
- Tomás Rossi (6/18) Nov 18 2005 Ok a hash table, don't know why didn't think about it before. Certainly ...
- Walter Bright (6/10) Nov 19 2005 indices
- Chris (5/10) Nov 18 2005 the implimentation for them can be found in dmd/src/phobos/internal.
- Tomás Rossi (4/14) Nov 18 2005 Ok, but why should I be looking at the implementation when searching for
- Chris (6/8) Nov 18 2005 I just mentioned it as an fyi. I do agree with you that complexity
- J C Calvarese (8/17) Nov 18 2005 Unfortunately, that wasn't the first time someone got yelled at for tryi...
- J C Calvarese (11/27) Nov 18 2005 If you're reading the docs and it occurs to you that some important info...
- Sean Kelly (12/22) Nov 18 2005 AA's are implemented as a chaining hash table with binary trees
- Manfred Nowak (21/25) Nov 18 2005 A clear maybe.
- Sean Kelly (7/16) Nov 18 2005 The number of rehashes/reallocations/whatever in any amortized constant
- Sean Kelly (4/9) Nov 18 2005 I take it back :p There is a constant bounding those functions. Should...
- Munchgreeble (26/35) Nov 19 2005 Hi,
- Manfred Nowak (13/16) Nov 19 2005 [...]
- Sean Kelly (16/23) Nov 19 2005 Actually, the C++ map is implemented as a balanced binary tree, which
- Manfred Nowak (21/22) Nov 19 2005 [...]
- Sean Kelly (8/25) Nov 19 2005 Oops, I was unclear. I meant that there is a constant bound for
- Manfred Nowak (12/15) Nov 19 2005 Using Hashing, the time for the access of an element can be bounded
- Sean Kelly (9/21) Nov 19 2005 I was thinking that if the size of the hash table doubles every time a
- Manfred Nowak (10/12) Nov 19 2005 [...]
- Sean Kelly (6/20) Nov 19 2005 I don't think so, though my math is a bit rusty in this area. My
- Lionello Lunesu (18/22) Nov 21 2005 OR when a bucket is (too) full? Is this true? I have no idea about the i...
- Oskar Linde (37/58) Nov 21 2005 Looking at the source (internal/aaA.d), the only place where an automati...
- Lionello Lunesu (28/59) Nov 22 2005 Hi,
- Walter Bright (7/11) Nov 19 2005 I've been using variations of that for 25 years. Every once in a while, ...
- Sean Kelly (7/20) Nov 19 2005 It does. And the memory cost of an additonal pointer per node isn't a
- Walter Bright (10/23) Nov 19 2005 I
- Manfred Nowak (6/7) Nov 19 2005 [...]
- Sean Kelly (6/14) Nov 19 2005 I think that can be considered a rebalancing so long as the insertions
- Manfred Nowak (8/10) Nov 20 2005 Correct. But this leads only to a randomly balanced tree and I doubt,
- Walter Bright (15/25) Nov 23 2005 Yes, that's true.
- Sean Kelly (5/12) Nov 23 2005 Word occurrence tests seem to be a pretty reasonable way to measure AA
Is there somewhere in the docs something about the complexity of operations? For example: What is the complexity for associative array operations? (I could guess it's implemented as a map (upon a tree) so it has map complexities but thas's a guess). IMO (as Stroustrup did with stl) it's a MUST to specify complexity of non-trivial operations and since D has complex (non-constant-time) features inside the language (e.g. associative arrays), it should stipulate time complexity of operations (and space complexity as well if relevant). Regards Tom
Nov 18 2005
In article <dll6iq$1n1u$1 digitaldaemon.com>, Tomás Rossi says...Is there somewhere in the docs something about the complexity of operations?Not as far as I know.For example: What is the complexity for associative array operations? (I could guess it's implemented as a map (upon a tree) so it has map complexities but thas's a guess).Associative arrays are implemented as hash-tables, i.e. O(1). (IIRC, the hash table nodes are sorted for each bucket, so even if all indices hash to the same value, you get O(log n). But doing binary search is probably reducing the performance when you have a good hash function.) /Oskar
Nov 18 2005
In article <dll7fq$1nn2$1 digitaldaemon.com>, Oskar Linde says...In article <dll6iq$1n1u$1 digitaldaemon.com>, Tomás Rossi says...Ok a hash table, don't know why didn't think about it before. Certainly I write without thinking first :). But still my concern could be other peoples concern so it should be documented as well as other non-trivial features/operations (if there is more). TomIs there somewhere in the docs something about the complexity of operations?Not as far as I know.For example: What is the complexity for associative array operations? (I could guess it's implemented as a map (upon a tree) so it has map complexities but thas's a guess).Associative arrays are implemented as hash-tables, i.e. O(1). (IIRC, the hash table nodes are sorted for each bucket, so even if all indices hash to the same value, you get O(log n). But doing binary search is probably reducing the performance when you have a good hash function.)
Nov 18 2005
"Oskar Linde" <oskar.lindeREM OVEgmail.com> wrote in message news:dll7fq$1nn2$1 digitaldaemon.com...Associative arrays are implemented as hash-tables, i.e. O(1). (IIRC, the hash table nodes are sorted for each bucket, so even if allindiceshash to the same value, you get O(log n). But doing binary search isprobablyreducing the performance when you have a good hash function.)The binary tree part is only for dealing with collisions. The rest is O(1), and as you say, the better the hash function, the more O(1) it will be.
Nov 19 2005
On Fri, 18 Nov 2005 18:29:46 +0000 (UTC), Tomás Rossi <Tomás_member pathlink.com> wrote:Is there somewhere in the docs something about the complexity of operations? For example: What is the complexity for associative array operations? (I could guess it's implemented as a map (upon a tree) so it has map complexities but thas's a guess).the implimentation for them can be found in dmd/src/phobos/internal. Like Oskar said, it's a hashtable. Chris
Nov 18 2005
In article <8a9sn1553k8psuebi8ff1msi0inb4i4dcu 4ax.com>, Chris says...On Fri, 18 Nov 2005 18:29:46 +0000 (UTC), Tomás Rossi <Tomás_member pathlink.com> wrote:Ok, but why should I be looking at the implementation when searching for complexities? Should be documented (aren't them or am I getting all wrong?) TomIs there somewhere in the docs something about the complexity of operations? For example: What is the complexity for associative array operations? (I could guess it's implemented as a map (upon a tree) so it has map complexities but thas's a guess).the implimentation for them can be found in dmd/src/phobos/internal. Like Oskar said, it's a hashtable.
Nov 18 2005
On Fri, 18 Nov 2005 19:19:38 +0000 (UTC), Tomás Rossi <Tomás_member pathlink.com> wrote:Ok, but why should I be looking at the implementation when searching for complexities? Should be documented (aren't them or am I getting all wrong?)I just mentioned it as an fyi. I do agree with you that complexity should be documented. It would give you a general idea of what to stay away from when writing high performance code. Chris
Nov 18 2005
In article <uhfsn1p10c1jan97n6lv7f8kd8eqp1t9k0 4ax.com>, Chris says...On Fri, 18 Nov 2005 19:19:38 +0000 (UTC), Tomás Rossi <Tomás_member pathlink.com> wrote:Unfortunately, that wasn't the first time someone got yelled at for trying to help. ;)Ok, but why should I be looking at the implementation when searching for complexities? Should be documented (aren't them or am I getting all wrong?)I just mentioned it as an fyi.I do agree with you that complexity should be documented. It would give you a general idea of what to stay away from when writing high performance code. ChrisIt's not in the official docs yet (and it'll probably quite a while before it is), but it's already in the un-official docs (due to the generous contributions from this newsgroup): http://www.prowiki.org/wiki4d/wiki.cgi?DocComments/Arrays#ComplexityofAssociativeArrayOperations jcc7
Nov 18 2005
In article <dll9ga$1pmq$1 digitaldaemon.com>, Tomás Rossi says...In article <8a9sn1553k8psuebi8ff1msi0inb4i4dcu 4ax.com>, Chris says...If you're reading the docs and it occurs to you that some important informatin is lacking you can click on the "Comments" link at the top of the page. In the wiki, you can add a comment about what you think needs to be added (whether it's something you know or something you'd like to know). Walter knows these wiki comment pages are there and he integrates new material from them from time to time. Home | Search | D | Comments <--- Or you can ask here and maybe has already figured out where the answer is by studying the code of Phobos or the front end. jcc7On Fri, 18 Nov 2005 18:29:46 +0000 (UTC), Tomás Rossi <Tomás_member pathlink.com> wrote:Ok, but why should I be looking at the implementation when searching for complexities? Should be documented (aren't them or am I getting all wrong?) TomIs there somewhere in the docs something about the complexity of operations? For example: What is the complexity for associative array operations? (I could guess it's implemented as a map (upon a tree) so it has map complexities but thas's a guess).the implimentation for them can be found in dmd/src/phobos/internal. Like Oskar said, it's a hashtable.
Nov 18 2005
Tomás Rossi wrote:Is there somewhere in the docs something about the complexity of operations? For example: What is the complexity for associative array operations? (I could guess it's implemented as a map (upon a tree) so it has map complexities but thas's a guess).AA's are implemented as a chaining hash table with binary trees replacing the slists (thus the need for opCmp instead of opEquals for stored objects). So AA operations should be O(1). Personally, I don't understand the choice of btrees over slists as it seems to expect that the hashing algorithm won't provide a good distribution. I'd rather lose the need for an extra pointer or two per node if possible. Though I suppose the tree could be implemented in an array to save memory as well.IMO (as Stroustrup did with stl) it's a MUST to specify complexity of non-trivial operations and since D has complex (non-constant-time) features inside the language (e.g. associative arrays), it should stipulate time complexity of operations (and space complexity as well if relevant).Agreed. Complexity constraints should be present in documentation whenever possile. This is just another thing that has been ignored in favor of more substantial updates. Sean
Nov 18 2005
Sean Kelly wrote: [...]So AA operations should be O(1).A clear maybe.Personally, I don't understand the choice of btrees over slists as it seems to expect that the hashing algorithm won't provide a good distribution.The hash function walter uses is the adress of the element inserted, that is as fast as can be---and bad in terms of distribution over the table. But although the distribution is bad the probability is close to zero, that any bin contains more than 16 elements. Although the unbalanced binary tree Walter has choosen for the resolution of collisions is bad as hell in theory it performs very well under the constraints given. Let a be the number of accesses to the n element in the AA and let r be the number of rehashes of the AA, then the runtime in total is O (a+r*n), the runtime for a single access is O(n) because an automatic rehash might occur, and the amortized runtime for a single access is O(1+r/ae), where ae is the minimal number of accesses to an element in the AA, of course ae>=1. Therefore the time bound for an access to an element in the AA is amortized O(1) if and only if the number of rehashes to the AA can be bound by a constant. -manfred
Nov 18 2005
Manfred Nowak wrote:Sean Kelly wrote: [...]Oops, I forgot the "amortized."So AA operations should be O(1).A clear maybe.Therefore the time bound for an access to an element in the AA is amortized O(1) if and only if the number of rehashes to the AA can be bound by a constant.The number of rehashes/reallocations/whatever in any amortized constant time algorithm tend to be a (slowly growing) function of N (which isn't a constant), so by that rationale they can't be amortized O(1). What am I missing here? Sean
Nov 18 2005
Sean Kelly wrote:The number of rehashes/reallocations/whatever in any amortized constant time algorithm tend to be a (slowly growing) function of N (which isn't a constant), so by that rationale they can't be amortized O(1). What am I missing here?I take it back :p There is a constant bounding those functions. Should have given it some thought before posting. Sean
Nov 18 2005
In article <dlmhh2$2o35$1 digitaldaemon.com>, Sean Kelly says...Sean Kelly wrote:Hi, I could be missing something here, but complexity information gives you a an idea of how *scaleable* an implementation is, not how *fast* it is. It's not uncommon for lower order solutions to a problem to be really slow and higher order solutions to be very quick - it just depends how big N is. I might have a great O(1) time algorithm for some application, but maybe it takes four days to run. If you have an O(N^2) algorithm for the same thing that runs in lets say 10ms + 2ms * N^2, N is going to have to get pretty big before you decide to switch over to the O(1) implementation. That's why you mostly only see the first order term given. The only point in going into more detail is to let you know about corner cases - it might just be that your particular application is going to push the wrong buttons and not end up with the typical performance (e.g you might get O(N) instead of O(1)) - extra terms are not useful for comparing execution speeds between two algorithms. If raw runtime execution speed is what you're interested in, you're just going to have to profile the code and see what works fastest for your application/target platform. Take a look at this performance comparision for example, it demonstrates that even for hashmaps, the performance varies quite widely among implementations: http://www.tommti-systems.de/go.html?http://www.tommti-systems.de/main-Dateien/reviews/languages/benchmarks.html Clearly they're all O(1) algorithms, but surprisingly the C++ implementation is the slowest even though it has the potential to be fastest. Hope that's useful Cheers MunchThe number of rehashes/reallocations/whatever in any amortized constant time algorithm tend to be a (slowly growing) function of N (which isn't a constant), so by that rationale they can't be amortized O(1). What am I missing here?I take it back :p There is a constant bounding those functions. Should have given it some thought before posting. Sean
Nov 19 2005
Munchgreeble wrote: [...]If raw runtime execution speed is what you're interested in, you're just going to have to profile the code and see what works fastest for your application/target platform.[...] Very true, but if and only if you know that platform. This is not the case with a language that is intended to provide bare metal access to the underlying hardware, where even the types of the hardware in general are unknown. Therefore there might be only two choices: 1) restrict the target harware or 2) provide a mechanism to adapt the implementation of the critical language features to the hardware. Number two is the way to go, but this feature is missing in D. -manfred
Nov 19 2005
Munchgreeble wrote:Take a look at this performance comparision for example, it demonstrates that even for hashmaps, the performance varies quite widely among implementations: http://www.tommti-systems.de/go.html?http://www.tommti-systems.de/main-Dateien/reviews/languages/benchmarks.html Clearly they're all O(1) algorithms, but surprisingly the C++ implementation is the slowest even though it has the potential to be fastest.Actually, the C++ map is implemented as a balanced binary tree, which has O(logN) find/insert performance. The hash-based associated an official part of the standard until around 2009 when the next C++ standard is published. For what it's worth, I've written an implementation of std::tr1::unordered_map which beats the Dinkumware implementation of std::map quite soundly in memory usage, cache miss, and test execution speed tests given some average to large-sized datasets. I can't remember the exact proportion, but I think it was about twice as good for all 3 metrics, which would put it on par with TR1 implementations floating around and I expect them all to perform approximately as well as mine, given that I didn't do anything too terribly fancy (std::vector on top of a single std::list). Sean
Nov 19 2005
Sean Kelly wrote: [...]There is a constant bounding those functions.[...] I do not agree. The amortized upper bound for a single access to an element in the AA is O( 1 + r/ae) as shown. The number of rehashes is summed up from the number ra of automatic rehashes and the number rm of manual rehashes: r = ra + rm. From the implementation no bound can be shown for rm. But even if it can be assumed that rm==0 from the implementation can be deduced that ra is lower bounded by Omega( log n). From the implementation also no lower bound can be shown for ae except that an element must at least be inserted, therefore ae is lower bounded by Omega(1). Because a quotient is upperbounded by the upper bound of its dividend divided by the lower bound of its divisor, the division ra/ae is upper bounded by O(log n). To reach an amortized upper bound of O(1) the minimal number ae of accesses to an element must be shown to be lower bounded by Omega(log n), which is impossible from the implementation. -manfred
Nov 19 2005
Manfred Nowak wrote:Sean Kelly wrote: [...]Oops, I was unclear. I meant that there is a constant bound for amortized constant time functions in general. And while there are hashing algorithms that I'm fairly certain could be bounded, I don't know enough about the DMD implementation to say. My question was about the concept in general, not DMD's AAs in particular. So what controls when a rehashing occurs in this implementation? SeanThere is a constant bounding those functions.[...] I do not agree. The amortized upper bound for a single access to an element in the AA is O( 1 + r/ae) as shown. The number of rehashes is summed up from the number ra of automatic rehashes and the number rm of manual rehashes: r = ra + rm. From the implementation no bound can be shown for rm. But even if it can be assumed that rm==0 from the implementation can be deduced that ra is lower bounded by Omega( log n).
Nov 19 2005
Sean Kelly wrote: [...]And while there are hashing algorithms that I'm fairly certain could be bounded,Using Hashing, the time for the access of an element can be bounded by a constant if and only if the number of elements to store in the table is known before hand. This seems to be true for every machine that will exist ever, because the number of atoms in the universe seems to be bounded by a constant also as time goes by. But I do not believe, that you meant that bound.So what controls when a rehashing occurs in this implementation?An automatic rehash occurs whenever the number of elements exceeds the table size by the factor four. Currently an adress space of 2**32 is assumed, which results in maximal 14 automatic rehashes. -manfred
Nov 19 2005
Manfred Nowak wrote:Sean Kelly wrote: [...]I was thinking that if the size of the hash table doubles every time a rehash occurs (ie. if there are logN rehashes for a table of size N) then the frequency of rehashes should approach zero as N approaches infinity. Amortizing the cost of these rehashes over all N insertions seems like it should yield an amortized constant time insert, just as push_back operations on std::vector in C++ are said to occur in amortized constant time. SeanAnd while there are hashing algorithms that I'm fairly certain could be bounded,Using Hashing, the time for the access of an element can be bounded by a constant if and only if the number of elements to store in the table is known before hand. This seems to be true for every machine that will exist ever, because the number of atoms in the universe seems to be bounded by a constant also as time goes by. But I do not believe, that you meant that bound.
Nov 19 2005
Sean Kelly wrote: [...]Amortizing the cost of these rehashes over all N insertions seems like it should yield an amortized constant time insert[...] Agreed, because I did not take into account that the size of the table changed at every rehash. Of course is the sum over the 2**-i (from i=1 to infinity) bounded by a constant. Sometimes facts generalized without thought turn out to be plain preconception :-( -manfred
Nov 19 2005
Manfred Nowak wrote:Sean Kelly wrote: [...]I don't think so, though my math is a bit rusty in this area. My conclusions regarding all this were more from a general sense of the concepts than because I worked it out on paper.Amortizing the cost of these rehashes over all N insertions seems like it should yield an amortized constant time insert[...] Agreed, because I did not take into account that the size of the table changed at every rehash. Of course is the sum over the 2**-i (from i=1 to infinity) bounded by a constant.Sometimes facts generalized without thought turn out to be plain preconception :-(Agreed. Sean
Nov 19 2005
An automatic rehash occurs whenever the number of elements exceeds the table size by the factor four. Currently an adress space of 2**32 is assumed, which results in maximal 14 automatic rehashes. -manfredOR when a bucket is (too) full? Is this true? I have no idea about the inner workings of D's hash-table. If so, what happens if after a rehash the bucket is still (too) full? Does it rehash again? Where can I find more info? In my hash-table I use a single array of slots (always 2^n) and keep track of a "distance" which is the highest distance from an element to its expected slot (if an element wants to be on [0] but [0..3] is full, it will end up on [3] and the distance is 3). Isn't this safer and more efficient* than a fixed bucket size of 16? * space efficient: I don't need to allocate buckets but simply use empty slots in the array * time efficient: I have direct control over "distance" and can force a rehash if it gets too high. (Interestingly, it happens that after a rehash, the "distance" turns out higher than before! Since the hash-function makes no guarantees I'd guess it's just a case of bad luck, as opposed to buggy code) I probably got a wrong idea about D's AA. I'll wait for more info : ) Lionello.
Nov 21 2005
Hi, In article <dlske4$2k2d$1 digitaldaemon.com>, Lionello Lunesu says...Looking at the source (internal/aaA.d), the only place where an automatic rehash gets done is when the average bucket contains more than 4 elements.An automatic rehash occurs whenever the number of elements exceeds the table size by the factor four. Currently an adress space of 2**32 is assumed, which results in maximal 14 automatic rehashes. -manfredOR when a bucket is (too) full? Is this true? I have no idea about the inner workings of D's hash-table.If so, what happens if after a rehash the bucket is still (too) full? Does it rehash again? Where can I find more info? In my hash-table I use a single array of slots (always 2^n) and keep track of a "distance" which is the highest distance from an element to its expected slot (if an element wants to be on [0] but [0..3] is full, it will end up on [3] and the distance is 3). Isn't this safer and more efficient* than a fixed bucket size of 16?What do you mean by a fixed bucket size of 16?* space efficient: I don't need to allocate buckets but simply use empty slots in the arrayYes, this is obviously a more space efficient storage if your highest allowed distance is high enough*. No additional pointers or other (non constant) overhead if I understand you correctly. *) this is a time/space trade-of. In order to be as fast as a traditional hash table, you may need to make the table so much larger that the memory requirements even are higher than for the traditional implementation. You are also much more reliant on a good hash function. It would be very interesting to see time/space comparisons of such a hash table to a more traditional one for a number of real life cases.* time efficient: I have direct control over "distance" and can force a rehash if it gets too high.I'm hardly an expert on hash tables, but in my experience, the only way to tell if it is faster for a particular application is to compare it to the alternative implementation. Some points though: - You are more sensitive to collisions, since you have not only the collisions within a bucket, but also collisions with nearby buckets. - You have no guarantee that a rehashing (and resizing) will improve the maximum distance. To get around this, you will need an additional condition that limits rehashing to, for instance, when the hash table size < a*n. But adding such a condition will remove your guarantee of a maximum highest distance. - If m is your maximum distance, lookup is O(m) vs O(log m) for a hash table that keeps balanced trees for each bucket. - Deletes are more expensive and needs to move more data. In conclusion, I think (guess) you are more sensitive to the performance of the hash function. I would be very interested in seeing a comparison with other methods on realistic data. I think D's hash table implementaion is safer for general purpose, because it gracefully degrades to O(log n) when the hash function is poor. It also minimizes the number of calls to comparison and hash functions, which may, depending on the implementations, be a good thing.(Interestingly, it happens that after a rehash, the "distance" turns out higher than before! Since the hash-function makes no guarantees I'd guess it's just a case of bad luck, as opposed to buggy code)Yes, I would assume such cases could happen. Regards, /Oskar
Nov 21 2005
Hi, I found this: http://en.wikipedia.org/wiki/Hashtable#Chaining describes a hash-table with balanced trees, like the one used by D. For everyone's information.What do you mean by a fixed bucket size of 16?Sorry, I've understood this from some other post that the bucket's size is fixed. But indeed, it makes no sense to fix it if it's a tree.Exactly. The implementation is extremely simple. That's what I love about it. But yes, it's O(n) if you have many collisions.* space efficient: I don't need to allocate buckets but simply use empty slots in the arrayYes, this is obviously a more space efficient storage if your highest allowed distance is high enough*. No additional pointers or other (non constant) overhead if I understand you correctly.- You are more sensitive to collisions, since you have not only the collisions within a bucket, but also collisions with nearby buckets.This is true. I've tested a couple of different hash functions, and the better hash-function nearly always wins (even if it's slower). For example, for a string-map I've compared CRC32 against java's string-hash (hash=7; foreach(c) hash=hash*31+c;) and the CRC32 won, eventhough it's much slower. The entropy of java's string-hash was improved by multiplying with some magic number (sqrt(5)-1, read about it somewhere) and that's the one I'm using ATM.- You have no guarantee that a rehashing (and resizing) will improve the maximum distance. To get around this, you will need an additional condition that limits rehashing to, for instance, when the hash table size < a*n. But adding such a condition will remove your guarantee of a maximum highest distance.This is true. However, in practice the distance never exceeds 8 (for <50% occupation of the hash-table) so the impact of the linear search is kept to a minimum.- If m is your maximum distance, lookup is O(m) vs O(log m) for a hash table that keeps balanced trees for each bucket.Yeah, you're right. Indeed, only a specific example can compare two hash implementations.- Deletes are more expensive and needs to move more data.Why exactly are deletes more expensive? The operation is the same as with insertion: jump to the (hash) position, walk some (up to 'distance'), delete the (key,value) pair (if any). The down side is that a look-up for a key not in the table always walks up to the max 'distance'.In conclusion, I think (guess) you are more sensitive to the performance of the hash function. I would be very interested in seeing a comparison with other methods on realistic data.That sounds like nice D exercise!I think D's hash table implementaion is safer for general purpose, because it gracefully degrades to O(log n) when the hash function is poor. It also minimizes the number of calls to comparison and hash functions, which may, depending on the implementations, be a good thing.Indeed, I can imagine it's faster for general cases, but it's certainly more complex (the code, I mean). Good thing it's built-in then ; ) L.
Nov 22 2005
"Manfred Nowak" <svv1999 hotmail.com> wrote in message news:Xns9713199CF924Esvv1999hotmailcom 63.105.9.61...Although the unbalanced binary tree Walter has choosen for the resolution of collisions is bad as hell in theory it performs very well under the constraints given.I've been using variations of that for 25 years. Every once in a while, I try out some new table method that is theoretically supposed to be faster. They always turn out to be slower. So back to the binary tree! I've also found it scales very well, from a few entries to tens of thousands of entries.
Nov 19 2005
Walter Bright wrote:"Manfred Nowak" <svv1999 hotmail.com> wrote in message news:Xns9713199CF924Esvv1999hotmailcom 63.105.9.61...It does. And the memory cost of an additonal pointer per node isn't a big deal, though it may be an issue for some extreme cases. Out of curiosity, is the binary tree implementation balanced in any way, or is it a plain old btree? Given the relatively small expected chain size I'd expect the latter, but I figured I'd ask anyway. SeanAlthough the unbalanced binary tree Walter has choosen for the resolution of collisions is bad as hell in theory it performs very well under the constraints given.I've been using variations of that for 25 years. Every once in a while, I try out some new table method that is theoretically supposed to be faster. They always turn out to be slower. So back to the binary tree! I've also found it scales very well, from a few entries to tens of thousands of entries.
Nov 19 2005
"Sean Kelly" <sean f4.ca> wrote in message news:dloh13$1glb$1 digitaldaemon.com...Walter Bright wrote:II've been using variations of that for 25 years. Every once in a while,faster.try out some new table method that is theoretically supposed to bethousandsThey always turn out to be slower. So back to the binary tree! I've also found it scales very well, from a few entries to tens ofD AA's are general purpose, which means they are meant to do a decent job for a wide range of applications. For very specific or extreme cases, it's likely you can do better with a custom implementation.of entries.It does. And the memory cost of an additonal pointer per node isn't a big deal, though it may be an issue for some extreme cases.Out of curiosity, is the binary tree implementation balanced in any way, or is it a plain old btree? Given the relatively small expected chain size I'd expect the latter, but I figured I'd ask anyway.It gets rebalanced when a rehash happens. Rebalancing it on the fly is one of those theoretical improvements that in practice makes things much worse.
Nov 19 2005
Walter Bright wrote: [...]It gets rebalanced when a rehash happens.[...] I do not see any balancing in the implementation. Its plain inserting. -manfred
Nov 19 2005
Manfred Nowak wrote:Walter Bright wrote: [...]I think that can be considered a rebalancing so long as the insertions don't occur in the same order as they had originally. Also, the contents of each tree may change on a rehash, since some elements may end up in different buckets, correct? SeanIt gets rebalanced when a rehash happens.[...] I do not see any balancing in the implementation. Its plain inserting.
Nov 19 2005
Sean Kelly wrote: [...]Also, the contents of each tree may change on a rehash, since some elements may end up in different buckets, correct?Correct. But this leads only to a randomly balanced tree and I doubt, that the factor of 3 for the space requirement of the tree against the space requirement of a 16 element array is justified by the time gain. However, I assume, that Walter has already tested that. -manfred
Nov 20 2005
"Manfred Nowak" <svv1999 hotmail.com> wrote in message news:Xns9714624647748svv1999hotmailcom 63.105.9.61...Sean Kelly wrote: [...]Yes, that's true.Also, the contents of each tree may change on a rehash, since some elements may end up in different buckets, correct?Correct. But this leads only to a randomly balanced treeand I doubt, that the factor of 3 for the space requirement of the tree against the space requirement of a 16 element array is justified by the time gain.AA's are usually used when performance matters more than memory consumption. Regular arrays are used where memory consumption is paramount. Thus, it is appropriate to trade memory for speed in AA algorithm design.However, I assume, that Walter has already tested that.It's easy to contrive test cases designed to make any particular scheme look good and any other particular scheme look bad. The current AA scheme has worked well for me for many years on all kinds of data. For example, it's used in DMDScript, and the performance of DMDScript is heavilly dependent on the AA algorithm, as all objects in Javascript are AA's. DMDScript is, by a factor of 2, the fastest Javascript interpreter available. The AA algorithm is all contained in internal\aaA.d. None of the rest of the language or library has any dependency on it. This means anyone can plug in another algorithm and give it a try.
Nov 23 2005
Walter Bright wrote:It's easy to contrive test cases designed to make any particular scheme look good and any other particular scheme look bad. The current AA scheme has worked well for me for many years on all kinds of data. For example, it's used in DMDScript, and the performance of DMDScript is heavilly dependent on the AA algorithm, as all objects in Javascript are AA's. DMDScript is, by a factor of 2, the fastest Javascript interpreter available.Word occurrence tests seem to be a pretty reasonable way to measure AA performance. Out of curiosity, I think I'll extend my C++ test to the D AA and see how it compares. Sean
Nov 23 2005