## digitalmars.D - An interesting data structure with search time O(sqrt n)

- Andrei Alexandrescu (37/37) Nov 30 2015 Okasaki's book is a continued inspiration of data structures and algorit...
- H. S. Teoh via Digitalmars-d (8/18) Nov 30 2015 Interesting indeed.
- Andrei Alexandrescu (31/46) Nov 30 2015 (Heaps offer only linear search time. You may take advantage of the heap...
- H. S. Teoh via Digitalmars-d (20/54) Nov 30 2015 I'm not so sure about that. At the micro level, yes, following pointers
- Timon Gehr (3/5) Nov 30 2015 How do you build in O(n)? (The initial array is assumed to be completely...
- Timon Gehr (3/8) Nov 30 2015 (I meant to say: There aren't any assumptions on the initial ordering of...
- Andrei Alexandrescu (5/14) Nov 30 2015 That's quite challenging. (My O(n) estimate was off the cuff and
- Andrei Alexandrescu (44/59) Dec 01 2015 OK, I think I have an answer to this in the form of an efficient algorit...
- Navin (20/91) Dec 01 2015 Nice to see this interesting post and learn.
- Chris Wright (77/98) Dec 02 2015 You can usually change an offline datastructure into an amortized online...
- Timon Gehr (40/101) Dec 02 2015 Assume the initial array is sorted from largest to smallest. This will
- Timon Gehr (3/7) Dec 02 2015 It should actually be (⌊i/2⌋-1)² here, but this does not change the...
- Timon Gehr (2/10) Dec 02 2015 Oops. No, it was actually right. Sorry for the noise. :-)
- Andrei Alexandrescu (2/4) Dec 02 2015 Galloping search ftw (good locality, log complexity). -- Andrei
- Andrei Alexandrescu (16/20) Dec 03 2015 Thanks. My math-fu is not good enough to properly follow the argument,
- Timon Gehr (70/92) Dec 03 2015 There was actually an (asymptotically inconsequential) error in the
- Jason Causey (46/49) Dec 10 2015 Hello all! I happened by this thread (from Hacker News) and the
- Dmitry Olshansky (8/36) Nov 30 2015 Reminds me of Van Emde Boas layout which is however fractal in nature:
- Andrei Alexandrescu (2/7) Nov 30 2015 Thanks, I'll look these up! -- Andrei
- Torin (6/13) Nov 30 2015 Isn't the size of each heap a little larger than O(sqrt n)? The
- Titus Nicolae (14/16) Nov 30 2015 On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu
- Torin (6/22) Nov 30 2015 I think Titus and I noticed the same thing at the same time.
- Titus Nicolae (6/22) Nov 30 2015 This might give a better balanced structure.
- Andrei Alexandrescu (2/23) Nov 30 2015 This is very interesting. Thanks, and thanks! -- Andrei
- Andrei Alexandrescu (3/17) Nov 30 2015 Yes, thanks! I just wrote about this on reddit (before having seen
- Timon Gehr (29/31) Nov 30 2015 (Assuming the bucket sizes actually grow linearly.) It seems to me that
- Sriram Srinivasan (2/2) Nov 30 2015 The key search phrase is "cache oblivious data structures". One
- Emil Kirschner (5/16) Dec 01 2015 Interesting, but isn't that basically a skip list with uneven
- Andrei Alexandrescu (2/18) Dec 01 2015 Doesn't look like a skip list to me. -- Andrei
- Marko Mikulicic (9/12) Dec 01 2015 Hi,
- Andrei Alexandrescu (2/11) Dec 01 2015 Noted, thanks! -- Andrei
- rsw0x (5/8) Dec 01 2015 Sort of reminds me of a modified Hashed Array Tree — not keen on

Okasaki's book is a continued inspiration of data structures and algorithms. I was thinking along the following lines: typical collections are searchable in linear time. Then we have highly structured collections that feature logarithmic search time. But there seems to be nothing in between. So I was thinking, what would be a data structure that allows O(sqrt n) search times? After a little tinkering, here's what I came up with. Start with a simple array of data. Then mentally decompose that array into a concatenation of smaller arrays: first has size 1, second has size 4, third has size 9, then 16, 25, 36, ... Generally the size of these imaginary subarrays grows quadratically. And they're adjacent to each other. The last array may be incomplete. Example: we decompose an array of size 35 into: an array of size 1 followed by arrays of sizes 4, 9, 16, 5 (last being an incomplete fragment of an array of size 25). Now each of these arrays we organize as a max heap. Moreover, we arrange data such that the maximums of these heaps are in INCREASING order. That means the smallest element of the entire (initial) array is at the first position, then followed by the next 4 smallest organized in a max heap, followed by the next 9 smallest organized in a max heap, ... etc. There are a total of O(sqrt n) heaps, and each has O(sqrt n) elements. That's the layout! Now, to search we look at one heap at a time. Whenever the maximum element (first element in the subarray) is smaller than the searched value, we skip over that entire heap and go to the next one. In the worst case we'll skip O(sqrt n) heaps. When the max value in a heap is less than the searched element, we found the heap and we run a linear search among O(sqrt n) elements. The structure is nice for sorting and in particular parallel sorting because each subarray can be sorted independently - there's no migration into or out of each subarray. Inside each subarray, of course heapsort would be a good choice because it takes advantage of the existing max heap. I haven't worked out the math for insertion and deletion yet, but they seem to also be around O(sqrt n) or O(log(n) * sqrt(n)). Just wanted to share with you and discuss what seems to be an interesting data structure. Please share any thoughts! Andrei

Nov 30 2015

On Mon, Nov 30, 2015 at 03:13:11PM -0500, Andrei Alexandrescu via Digitalmars-d wrote:Okasaki's book is a continued inspiration of data structures and algorithms. I was thinking along the following lines: typical collections are searchable in linear time. Then we have highly structured collections that feature logarithmic search time. But there seems to be nothing in between. So I was thinking, what would be a data structure that allows O(sqrt n) search times? After a little tinkering, here's what I came up with.Interesting indeed. It leaves me wondering, though, what's the point of having an O(sqrt n) collection? What are the advantages? Why would I use this structure instead of, say, a traditional array heap with O(log n) search time? T -- Why are you blatanly misspelling "blatant"? -- Branden Robinson

Nov 30 2015

On 11/30/15 3:20 PM, H. S. Teoh via Digitalmars-d wrote:On Mon, Nov 30, 2015 at 03:13:11PM -0500, Andrei Alexandrescu via Digitalmars-d wrote:(Heaps offer only linear search time. You may take advantage of the heap structure to skip portions of the array, but on average and in the worst case the search is still O(n). So I assume you meant "sorted array or one of the classic search trees".) The motivation starts with a desire to use arrays as the fundamental layout. There have been many trends in computing as of late, among which: cache hierarchies are here to stay and contiguous layout is preferable. The short of it is, arrays are king. No two ways about it - following pointers is a losing strategy when there's an alternative. A variety of new data structures (Clojure's arrays, heaps with tombstones) avoid classic pointer-based data structures in favor of adding structure on top of arrays. So now if we consider thinking, "how do we organize an array for good search performance" we have a spectrum. Generally we also care about insertion and removal. At one end of the spectrum there's doing nothing at all - that means O(1) build (nothing to do), O(n) search, O(1) insert (just append it), O(n) removal. Not very nice. At the other end, the absolute best structuring on top of an array for search is sorting. With sorting you get great O(log n) search performance. But the others are not nice - O(n log n) build, O(n) add, O(n) remove. So now consider my square heaps. We have O(n) build time (just a bunch of heapifications) and O(sqrt n) search. Then (again I haven't worked out the math yet) let's assume insertion and removal are both O(sqrt n). Then you get something less structured than full sorting, but also just good enough to offer the same complexity for each of search, insert, and delete. That would be pretty neat. AndreiOkasaki's book is a continued inspiration of data structures and algorithms. I was thinking along the following lines: typical collections are searchable in linear time. Then we have highly structured collections that feature logarithmic search time. But there seems to be nothing in between. So I was thinking, what would be a data structure that allows O(sqrt n) search times? After a little tinkering, here's what I came up with.Interesting indeed. It leaves me wondering, though, what's the point of having an O(sqrt n) collection? What are the advantages? Why would I use this structure instead of, say, a traditional array heap with O(log n) search time?

Nov 30 2015

On Mon, Nov 30, 2015 at 03:57:24PM -0500, Andrei Alexandrescu via Digitalmars-d wrote:On 11/30/15 3:20 PM, H. S. Teoh via Digitalmars-d wrote:Right, I wrote that without thinking. :-POn Mon, Nov 30, 2015 at 03:13:11PM -0500, Andrei Alexandrescu via Digitalmars-d wrote:(Heaps offer only linear search time. You may take advantage of the heap structure to skip portions of the array, but on average and in the worst case the search is still O(n). So I assume you meant "sorted array or one of the classic search trees".)Okasaki's book is a continued inspiration of data structures and algorithms. I was thinking along the following lines: typical collections are searchable in linear time. Then we have highly structured collections that feature logarithmic search time. But there seems to be nothing in between. So I was thinking, what would be a data structure that allows O(sqrt n) search times? After a little tinkering, here's what I came up with.Interesting indeed. It leaves me wondering, though, what's the point of having an O(sqrt n) collection? What are the advantages? Why would I use this structure instead of, say, a traditional array heap with O(log n) search time?The motivation starts with a desire to use arrays as the fundamental layout. There have been many trends in computing as of late, among which: cache hierarchies are here to stay and contiguous layout is preferable. The short of it is, arrays are king. No two ways about it - following pointers is a losing strategy when there's an alternative. A variety of new data structures (Clojure's arrays, heaps with tombstones) avoid classic pointer-based data structures in favor of adding structure on top of arrays.I'm not so sure about that. At the micro level, yes, following pointers for a tree / graph / etc., that could easily fit in a small/medium sized array is a losing strategy, due to cache misses, hardware prefetchers, etc.. When you're dealing with larger amounts of data, though, things become less clear-cut. If your array size is on the order of MB's or GB's, pointers start looking a lot more attractive. IMO in the long run what will win is data structures that use tree or pointer based implementations in the large scale, but built on cache-friendly array-based structures below a certain level of granularity. I agree with you, though, that array-based structures of intermediate big-O complexities are very interesting. If you can come up with an array structure that has the same complexity for all common operations, that could be useful as a quick-fix drop in solution in cases where top performance isn't required, but you'd like something better than O(n) array scanning. T -- Philosophy: how to make a career out of daydreaming.

Nov 30 2015

On 11/30/2015 09:57 PM, Andrei Alexandrescu wrote:So now consider my square heaps. We have O(n) build time (just a bunch of heapifications) and O(sqrt n) search.How do you build in O(n)? (The initial array is assumed to be completely unordered, afaict.)

Nov 30 2015

On 12/01/2015 03:33 AM, Timon Gehr wrote:On 11/30/2015 09:57 PM, Andrei Alexandrescu wrote:(I meant to say: There aren't any assumptions on the initial ordering of the array elements.)So now consider my square heaps. We have O(n) build time (just a bunch of heapifications) and O(sqrt n) search.How do you build in O(n)? (The initial array is assumed to be completely unordered, afaict.)

Nov 30 2015

On 11/30/15 9:47 PM, Timon Gehr wrote:On 12/01/2015 03:33 AM, Timon Gehr wrote:That's quite challenging. (My O(n) estimate was off the cuff and possibly wrong.) Creating the structure entails simultaneously solving the selection problem (find the k smallest elements) for several values of k. I'll post here if I come up with something. -- AndreiOn 11/30/2015 09:57 PM, Andrei Alexandrescu wrote:(I meant to say: There aren't any assumptions on the initial ordering of the array elements.)So now consider my square heaps. We have O(n) build time (just a bunch of heapifications) and O(sqrt n) search.How do you build in O(n)? (The initial array is assumed to be completely unordered, afaict.)

Nov 30 2015

On 12/01/2015 12:13 AM, Andrei Alexandrescu wrote:On 11/30/15 9:47 PM, Timon Gehr wrote:OK, I think I have an answer to this in the form of an efficient algorithm. First off: sizes 1+3+5+7+... seem a great choice, I'll use that for the initial implementation (thanks Titus!). Second: the whole max heap is a red herring - min heap is just as good, and in fact better. When doing the search just overshoot by one then go back one heap to the left and do the final linear search in there. So the structure we're looking at is an array of adjacent min-heaps of sizes 1, 3, 5, etc. The heaps are ordered (the maximum of heap k is less than or equal to the minimum of heap k+1). Question is how do we build such an array of heaps in place starting from an unstructured array of size n. One simple approach is to just sort the array in O(n log n). This satisfies all properties - all adjacent subsequences are obviously ordered, and any subsequence has the min heap property. As an engineering approach we may as well stop here - sorting is a widely studied and well implemented algorithm. However, we hope to get away with less work because we don't quite need full sorting. Here's the intuition: the collection of heaps can be seen as one large heap that has a DAG structure (as opposed to a tree). In the DAG, the root of heap k+1 is the child of all leaves of heap k (see http://imgur.com/I366GYS which shows the DAG for the 1, 3, 7, and 7 heaps). Clearly getting this structure to respect the heap property is all that's needed for everything to work - so we simply apply the classic heapify algorithm to it. It seems it can be applied almost unchanged - starting from the end, sift each element down the DAG. This looks efficient and minimal; I doubt there's any redundant work. However, getting bounds for complexity of this will be tricky. Classic heapify is tricky, too - it seems to have complexity O(n log n) but in fact has complexity O(n) - see nice discussion at http://stackoverflow.com/questions/9755721/how-can-building-a-heap-be on-time-complexity. When applying heapify to the DAG, there's more restrictions and the paths are longer, so a sliver more than O(n) is expected. Anyway, this looks ready for a preliminary implementation and some more serious calculations. One more interesting thing: the heap heads are sorted, so when searching, the heap corresponding to the searched item can be found using binary search. That makes that part of the search essentially negligible - the lion's share will be the linear search on the last mile. In turn, that suggests that more heaps that are smaller would be a better choice. (At an extreme, if we have an array of heaps each proportional to log(n), then we get search time O(log n) even though the array is not entirely sorted.) AndreiOn 12/01/2015 03:33 AM, Timon Gehr wrote:That's quite challenging. (My O(n) estimate was off the cuff and possibly wrong.) Creating the structure entails simultaneously solving the selection problem (find the k smallest elements) for several values of k. I'll post here if I come up with something. -- Andrei

Dec 01 2015

On Tuesday, 1 December 2015 at 22:48:56 UTC, Andrei Alexandrescu wrote:On 12/01/2015 12:13 AM, Andrei Alexandrescu wrote:Nice to see this interesting post and learn. I have a few questions. 1) This is offline datastructure since you don't know how the elements of the future are going to be ie dynamic. ie later elements from n to 2n can break or change your heaps as such in worst case or is it a dynamic data structure ? 2) Searching in min or max heaps is bad isn't it ? Lets say we choose max heaps. Now we have the root as 10^9 in the second last heap ie around n^2 elements. The children of it are 4*10^8 and 5*10^8 . If i'm searching for say 4.5 *10^8 my job is easy but if i'm searching for 1000, i have to search in both the subtrees and it goes linear and becomes around n^2 in the worst case. Did i overlook anything ? Instead of heaps, a single sorted array or breaking them into a series of sorted arrays ie skip lists kind of stuff would be fine if it just want a offline Data structure ? or is this some domain specific data Structure where you only/cre want max/min in some sequence ?On 11/30/15 9:47 PM, Timon Gehr wrote:OK, I think I have an answer to this in the form of an efficient algorithm. First off: sizes 1+3+5+7+... seem a great choice, I'll use that for the initial implementation (thanks Titus!). Second: the whole max heap is a red herring - min heap is just as good, and in fact better. When doing the search just overshoot by one then go back one heap to the left and do the final linear search in there. So the structure we're looking at is an array of adjacent min-heaps of sizes 1, 3, 5, etc. The heaps are ordered (the maximum of heap k is less than or equal to the minimum of heap k+1). Question is how do we build such an array of heaps in place starting from an unstructured array of size n. One simple approach is to just sort the array in O(n log n). This satisfies all properties - all adjacent subsequences are obviously ordered, and any subsequence has the min heap property. As an engineering approach we may as well stop here - sorting is a widely studied and well implemented algorithm. However, we hope to get away with less work because we don't quite need full sorting. Here's the intuition: the collection of heaps can be seen as one large heap that has a DAG structure (as opposed to a tree). In the DAG, the root of heap k+1 is the child of all leaves of heap k (see http://imgur.com/I366GYS which shows the DAG for the 1, 3, 7, and 7 heaps). Clearly getting this structure to respect the heap property is all that's needed for everything to work - so we simply apply the classic heapify algorithm to it. It seems it can be applied almost unchanged - starting from the end, sift each element down the DAG. This looks efficient and minimal; I doubt there's any redundant work. However, getting bounds for complexity of this will be tricky. Classic heapify is tricky, too - it seems to have complexity O(n log n) but in fact has complexity O(n) - see nice discussion at http://stackoverflow.com/questions/9755721/how-can-building-a-heap-be on-time-complexity. When applying heapify to the DAG, there's more restrictions and the paths are longer, so a sliver more than O(n) is expected. Anyway, this looks ready for a preliminary implementation and some more serious calculations. One more interesting thing: the heap heads are sorted, so when searching, the heap corresponding to the searched item can be found using binary search. That makes that part of the search essentially negligible - the lion's share will be the linear search on the last mile. In turn, that suggests that more heaps that are smaller would be a better choice. (At an extreme, if we have an array of heaps each proportional to log(n), then we get search time O(log n) even though the array is not entirely sorted.) Andrei

Dec 01 2015

On Wed, 02 Dec 2015 05:58:24 +0000, Navin wrote:Nice to see this interesting post and learn. I have a few questions. 1) This is offline datastructure since you don't know how the elements of the future are going to be ie dynamic. ie later elements from n to 2n can break or change your heaps as such in worst case or is it a dynamic data structure ?You can usually change an offline datastructure into an amortized online one. (I think.) You can usually change an amortized online algorithm into a deamortized one with a series of ugly hacks and large increases in constants. I think you can make insertions work and not be too terrible with an amortized algorithm. Something like: To insert a value X into the square heap, find the largest heap with a head equal to X. If there is none, find the smallest heap with a head greater than X. If there is none, choose the final heap. If that heap is not full, insert X into it. If it is full, trigger a rebalance. 'Rebalance' is an operation we use to ensure that each intermediate heap is exactly half full (rounding down) and the final heap is no more than half full. We start at a given heap. We calculate its surplus -- the number of elements it has in excess of half its capacity. We scan to the right (increasing heap sizes) until we find a heap with at least enough spare capacity to hold the surpluses of all the prior heaps we've scanned. If we reach the end of the square, we create a new heap. This is the target heap. Now we repeatedly bubble up values from lower heaps to higher ones until no heap between the one we tried to insert into and the target heap has any surplus. This costs us in the worst case something like O(n * log(n) * sqrt(n)). I haven't done the full analysis. But after doing a large rebalance, you shouldn't need to do a large rebalance for quite a while, so the average case, even on adversarial data, should be not so horrible. Deletions are simple: find the element and remove it from its heap. This can, however, reduce a heap to less than half filled. (This means I can insert and delete elements in an adversarial pattern to force all elements into one heap.) A similar rebalance algorithm is needed, but it's harder on this side because we're using max heaps everywhere and need to fill lower heaps.2) Searching in min or max heaps is bad isn't it ?Linear time in the worst case. If you're looking for something in a max heap that happens to be the smallest element, it can be in any leaf node, which gives you half the tree to look through. And the average isn't much better. However, if we can guarantee that an element is in one heap inside this square heap, it costs us O(sqrt n) to search that heap since that heap has no more than sqrt(n) elements.Lets say we choose max heaps. Now we have the root as 10^9 in the second last heap ie around n^2 elements. The children of it are 4*10^8 and 5*10^8 . If i'm searching for say 4.5 *10^8 my job is easy but if i'm searching for 1000, i have to search in both the subtrees and it goes linear and becomes around n^2 in the worst case. Did i overlook anything ? Instead of heaps, a single sorted array or breaking them into a series of sorted arrays ie skip lists kind of stuff would be fine if it just want a offline Data structure ?A skiplist is great. Problem is, it's random access to memory. That's bad enough at the best of times, and it's utter garbage if you're storing data on spinning disks. Even if you store it as a sorted list rather than a linked list, which means you never insert anything ever, it's O(log n/ B) seeks and reads to find an element. If you store it as a linked list, it's O(log n) reads. A square heap with an insertion algorithm as I describe gives you O(1) seeks and O(sqrt n/B) reads. (Here, B stands for block size, which in this case is the number of data elements you will naturally get for free or almost for free after reading one byte. When I was doing data structure research, the rule of thumb was that a block is roughly 1MB for spinning disks. Divide that by element size to get B. We track seeks separately because they're expensive. 9ms is a reasonable seek time. Comparatively, reading data sequentially is extremely cheap, especially if you can inform the disk scheduler that you are going to perform sequential reads. Seeks are expensive no matter what, even if you can inform the scheduler in advance -- and the skiplist doesn't let you do that. Even if you switch to SSDs, SSDs tend to be ~4x faster on sequential reads than random reads.) So if you have 64-byte data structures, for instance, and you've got a billion on disk, you have sixteen seeks for a skiplist on average, which costs you 144ms. (You can drop that a bit by keeping the top of the skiplist in memory.) The square heap, on the other hand? You've got a total of ~1500 heaps, so you can store their offsets and heads in memory. That means you identify the heap containing the element you're searching for without touching disk, and scanning the heap costs one seek and a scan. You're scanning 700K elements on average, which means you need to look through forty or so blocks. So the square heap could well be more efficient here. It costs ~3x the number of reads, but they're all sequential, so while it's close, the square heap probably wins out on average. But this isn't yet a viable real-world algorithm for most things because of amortized inserts that promise to be expensive even after they're deamortized. And I haven't even sketched out an algorithm for deletes. At least range queries will typically be reasonably fast.or is this some domain specific data Structure where you only/cre want max/min in some sequence ?

Dec 02 2015

On 12/01/2015 11:48 PM, Andrei Alexandrescu wrote:On 12/01/2015 12:13 AM, Andrei Alexandrescu wrote:Assume the initial array is sorted from largest to smallest. This will be the worst case for this construction algorithm, as each element will be sifted all the way down to leaf level of the last heap. For simplicity, let's assume that n = m² for m odd, such that the sizes of the heaps are 1,3,…,m. To sift some element down one entire heap of size i takes ≥ log₂(i) steps. Ignoring the work performed in the heaps each elements starts in (which is obviously at most O(n) in total), we can lower bound the performed work; for the heap of size j, we sift j elements down all the following heaps of sizes i=j+1,i=j+2,…,i=m: ∑ⱼ[j∈{1,3,…,m}]·j·∑ᵢ[j+1≤i≤m]·log(i) = ∑ᵢ∑ⱼ[j∈{1,3,…,m}]·[j+1≤i≤m]·j·log(i) = ∑ᵢ[2≤i≤m]·log(i)·∑ⱼ[j∈{1,3,…,m}]·[j+1≤i] j = ∑ᵢ[2≤i≤m]·log(i)·∑ⱼ∑ₖ[j=2·k+1]·[1≤j≤m]·[j≤i-1] j = ∑ᵢ[2≤i≤m]·log(i)·∑ⱼ∑ₖ[j=2·k+1]·[1≤2·k+1≤i-1]·(2·k+1) = ∑ᵢ[2≤i≤m]·log(i)·∑ₖ[1≤2·k+1≤i-1]·(2·k+1) = ∑ᵢ[2≤i≤m]·log(i)·∑ₖ[0≤k≤⌊i/2⌋-1]·(2·k+1) = ∑ᵢ[2≤i≤m]·log(i)·⌊i/2⌋² ≥ Ω(m³) = Ω(n^(3/2)). I.e. sorting is asymptotically faster, and probably also in practice. Moreover, the construction algorithm would be slower even if sifting down one entire heap would run in constant time. However, for your adapted data structure with min-heaps, implementing insert and remove in O(√n̅ log n) is now easy using your "sifting in a DAG" approach. One way to prove a useful lower bound on the time it must take to build might be to observe that building recursively can be used for sorting.One more interesting thing: the heap heads are sorted, so when searching, the heap corresponding to the searched item can be found using binary search. That makes that part of the search essentially negligible - the lion's share will be the linear search on the last mile. In turn, that suggests that more heaps that are smaller would be a better choice. (At an extreme, if we have an array of heaps each proportional to log(n), then we get search time O(log n) even though the array is not entirely sorted.) AndreiThe linear search can have way better locality, so it is likely that actually the binary search dominates for some data sets.

Dec 02 2015

On 12/02/2015 12:26 PM, Timon Gehr wrote:... ∑ᵢ[2≤i≤m]·log(i)·∑ₖ[0≤k≤⌊i/2⌋-1]·(2·k+1) = ∑ᵢ[2≤i≤m]·log(i)·⌊i/2⌋²It should actually be (⌊i/2⌋-1)² here, but this does not change the asymptotics.

Dec 02 2015

On 12/02/2015 03:29 PM, Timon Gehr wrote:On 12/02/2015 12:26 PM, Timon Gehr wrote:Oops. No, it was actually right. Sorry for the noise. :-)... ∑ᵢ[2≤i≤m]·log(i)·∑ₖ[0≤k≤⌊i/2⌋-1]·(2·k+1) = ∑ᵢ[2≤i≤m]·log(i)·⌊i/2⌋²It should actually be (⌊i/2⌋-1)² here, but this does not change the asymptotics.

Dec 02 2015

On 12/02/2015 06:26 AM, Timon Gehr wrote:The linear search can have way better locality, so it is likely that actually the binary search dominates for some data sets.Galloping search ftw (good locality, log complexity). -- Andrei

Dec 02 2015

On 12/02/2015 06:26 AM, Timon Gehr wrote:Assume the initial array is sorted from largest to smallest. This will be the worst case for this construction algorithm, as each element will be sifted all the way down to leaf level of the last heap.[...]Ω(n^(3/2)).Thanks. My math-fu is not good enough to properly follow the argument, but that bound sounds high; one intuition is that there shouldn't be more work done than in sorting - fewer swaps, fewer comparisons, less resulting structure. For example, heapifying the array 1, 2, 3, 4, 5, 6, 7, 8, 9 in 1+3+5 max heaps results in the array 9, 8, 6, 7, 5, 4, 3, 2, 1 which is "less sorted" than the fully sorted array. (I know, that doesn't prove anything.) At this point I need to either get to the bottom of the math or put together an experimental rig that counts comparisons and swaps. (BTW I figured one reason for which these DAG heaps work at all is that there's never a need to sift an element up between heaps; that would be inefficient because the root of a subheap has multiple parents. Sifting down is efficient because each node has at most two children.) Andrei

Dec 03 2015

On 12/03/2015 11:44 PM, Andrei Alexandrescu wrote:On 12/02/2015 06:26 AM, Timon Gehr wrote:There was actually an (asymptotically inconsequential) error in the initial expression. Fixed version: Ignoring the work performed in the heaps each elements starts in (which is obviously at most O(n)), we can lower bound the performed work; for the heap of size j, we sift j elements down all the following heaps of sizes i=j+2,i=j+4,…,i=m). ∑ⱼ[j∈{1,3,…,m}]·j·∑ᵢ[i∈{j+2,j+4,…,m}]·log(i) = ∑ᵢ∑ⱼ[j∈{1,3,…,m}]·[i∈{j+2,j+4,…,m}]·j·log(i) = ∑ᵢ∑ⱼ∑ₖ∑ₗ[i=2·k+1]·[j=2·l+1]·[1≤j≤m]·[j+2≤i≤m]·j·log(i) = ∑ᵢ∑ₖ[i=2·k+1]·[3≤i≤m]·log(i)·∑ⱼ∑ₗ[j=2·l+1]·[1≤j≤m]·[j+2≤i]·j = ∑ᵢ∑ₖ[i=2·k+1]·[3≤i≤m]·log(i)·∑ⱼ∑ₗ[j=2·l+1]·[1≤j≤i-2]·j = ∑ᵢ∑ₖ[i=2·k+1]·[3≤i≤m]·log(i)·∑ⱼ∑ₗ[j=2·l+1]·[1≤2·l+1≤i-2]·(2·l+1) = ∑ᵢ∑ₖ[i=2·k+1]·[3≤i≤m]·log(i)·∑ₗ[1≤2·l+1≤i-2]·(2·l+1) = ∑ᵢ∑ₖ[i=2·k+1]·[3≤i≤m]·log(i)·∑ₗ[0≤l≤(i-3)/2]·(2·l+1) = ∑ᵢ∑ₖ[i=2·k+1]·[3≤i≤m]·log(i)·∑ₗ[0≤l≤(i-3)/2]·(2·l+1) = ∑ᵢ∑ₖ[i=2·k+1]·[3≤i≤m]·log(i)·((i-1)/2)² = ∑ₖ[3≤2·k+1≤m]·log(2·k+1)·k² = ∑ₖ[1≤k≤(m-1)/2]·log(2·k+1)·k² ≥ Ω(m³) = Ω(n^(3/2)). Less formally: The heap of size 3 is sifted through by 1=1² element from the first heap. The heap of size 5 is sifted through by 1+3=2² elements from the first and second heaps The heap of size 7 is sifted through by 1+3+5=3² elements from the first three heaps. ... The heap of size 2·k+1 is sifted through by k² elements. ... The heap of size m is sifted through by ((m-1)/2)² elements. This gives the last sum in the above derivation: sifting through heap of size 2·k+1 costs at least log(2·k+1) v ∑ₖ[1≤k≤(m-1)/2]·log(2·k+1)·k² ^ k ranges from 1 to (m-1)/2. The factor k² is the number of elements sifting through the heap of size 2·k+1. The sum can be lower bounded by 1+2²+3²+…+((m-1)/2)² = (m+1)·(m-1)·(m-3)/24+(m+1)·(m-1)/8 = (m³-m)/24 ≥ Ω(m³).Assume the initial array is sorted from largest to smallest. This will be the worst case for this construction algorithm, as each element will be sifted all the way down to leaf level of the last heap.[...]Ω(n^(3/2)).Thanks. My math-fu is not good enough to properly follow the argument,but that bound sounds high; one intuition is that there shouldn't be more work done than in sorting - fewer swaps, fewer comparisons, less resulting structure. ...That intuition is spot on, but the sorting algorithm that the construction algorithm imitates most closely is a version of insertion sort.For example, heapifying the array 1, 2, 3, 4, 5, 6, 7, 8, 9 in 1+3+5 max heaps results in the array 9, 8, 6, 7, 5, 4, 3, 2, 1 which is "less sorted" than the fully sorted array. (I know, that doesn't prove anything.) ...It provides motivation to try and find another construction algorithm that runs asymptotically faster than n log n, or to prove that n log n is asymptotically optimal.At this point I need to either get to the bottom of the math orThe basic argument is not too involved: If each element needs to be sifted all the way down to the last heap, then for each of √n̅ heaps, you need to sift all its ~√n̅ elements down ~√n̅ other heaps, which requires more than ≈n^(3/2) operations. The above derivation makes this rigorous.put together an experimental rig that counts comparisons and swaps. (BTW I figured one reason for which these DAG heaps work at all is that there's never a need to sift an element up between heaps; that would be inefficient because the root of a subheap has multiple parents. Sifting down is efficient because each node has at most two children.) ...Good point. I had declared victory on the "insert" front too early.

Dec 03 2015

Hello all! I happened by this thread (from Hacker News) and the idea of this data structure got stuck in my head. I did some scribbling on paper and decided that it could be interesting to code... On Thursday, 3 December 2015 at 22:44:26 UTC, Andrei Alexandrescu wrote:At this point I need to either get to the bottom of the math or put together an experimental rig that counts comparisons and swaps.I've built a test implementation (in C++ ... I hope that isn't too distasteful on a D forum, but it's what I'm most comfortable with) here: https://github.com/jcausey-astate/HeapArray I chose to use a Min-Max heap[1] for the heaps themselves (this buys me O(1) min and max access). This solves the problem of making insert and delete follow the same pattern (to insert, ripple the max value in a full partition to the next one; in a delete, fill in the "hole" with min value from previous partition). I wasn't able to come up with anything better than pre-sorting the whole thing, then running the Floyd "make heap" on each partition. So, the whole thing costs O(n*lg(n) + n)) to make a new structure on top of an existing array. This is still faster than doing a top-down (add every element) make-heap though. I agree with Timon's analysis[2] on that. I also agree with Andrei that I have a "gut feeling" that there could be a faster setup algorithm, since we really just need to shuffle values into the right partitions, not actually fully sort them... But that would require knowing exactly what the partition "pivot" values were in advance, which would require searching, and 'round we go. My code is still rough, only works for ints, and needs some documentation and cleanup, but it does show that our hunches were more or less correct: This thing is much faster than linear for searching, but not as fast as logarithmic. It also performs OK when adding new values and performing lots of searches (when compared with a plain old array or vector where you have to linear search every time). My Github README has some charts, but I'll link a couple here: Search times (vector VS this structure VS multiset) https://plot.ly/~jcausey-astate/7/search-timing-vector-vs-heaparray-vs-multiset/ Time to add N values starting with an empty structure (vector VS this structure VS multiset) https://plot.ly/~jcausey-astate/18/fill-container-dynamically-heaparray-vs-vector-and-multiset/ Time to initially build the structure given an already-populated array (vector VS this structure VS multiset): https://plot.ly/~jcausey-astate/35/build-data-structure-on-existing-array-all-at-once/ [1]: http://www.akira.ruc.dk/~keld/teaching/algoritmedesign_f03/Artikler/02../Atkinson86.pdf [2]: http://forum.dlang.org/post/n3qqkm$2c6t$1 digitalmars.com

Dec 10 2015

On 30-Nov-2015 23:13, Andrei Alexandrescu wrote:Okasaki's book is a continued inspiration of data structures and algorithms. I was thinking along the following lines: typical collections are searchable in linear time. Then we have highly structured collections that feature logarithmic search time. But there seems to be nothing in between. So I was thinking, what would be a data structure that allows O(sqrt n) search times? After a little tinkering, here's what I came up with. Start with a simple array of data. Then mentally decompose that array into a concatenation of smaller arrays: first has size 1, second has size 4, third has size 9, then 16, 25, 36, ... Generally the size of these imaginary subarrays grows quadratically. And they're adjacent to each other. The last array may be incomplete. Example: we decompose an array of size 35 into: an array of size 1 followed by arrays of sizes 4, 9, 16, 5 (last being an incomplete fragment of an array of size 25). Now each of these arrays we organize as a max heap. Moreover, we arrange data such that the maximums of these heaps are in INCREASING order. That means the smallest element of the entire (initial) array is at the first position, then followed by the next 4 smallest organized in a max heap, followed by the next 9 smallest organized in a max heap, ... etc. There are a total of O(sqrt n) heaps, and each has O(sqrt n) elements. That's the layout! Now, to search we look at one heap at a time. Whenever the maximum element (first element in the subarray) is smaller than the searched value, we skip over that entire heap and go to the next one. In the worst case we'll skip O(sqrt n) heaps. When the max value in a heap is less than the searched element, we found the heap and we run a linear search among O(sqrt n) elements.Reminds me of Van Emde Boas layout which is however fractal in nature: sqrt(N) pieces each having sqrt(N) element are subdivided again into sqrt(sqrt(N)) pieces and so on. Not sure if you have seen, but see also cache-oblivious data-structures: http://erikdemaine.org/papers/BRICS2002/paper.pdf -- Dmitry Olshansky

Nov 30 2015

On 11/30/15 3:29 PM, Dmitry Olshansky wrote:Reminds me of Van Emde Boas layout which is however fractal in nature: sqrt(N) pieces each having sqrt(N) element are subdivided again into sqrt(sqrt(N)) pieces and so on. Not sure if you have seen, but see also cache-oblivious data-structures: http://erikdemaine.org/papers/BRICS2002/paper.pdfThanks, I'll look these up! -- Andrei

Nov 30 2015

On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu wrote:Now each of these arrays we organize as a max heap. Moreover, we arrange data such that the maximums of these heaps are in INCREASING order. That means the smallest element of the entire (initial) array is at the first position, then followed by the next 4 smallest organized in a max heap, followed by the next 9 smallest organized in a max heap, ... etc. There are a total of O(sqrt n) heaps, and each has O(sqrt n) elements.Isn't the size of each heap a little larger than O(sqrt n)? The total number of elements you can hold in k heaps is equal to the kth square pyramidal number, which is of size O(k^3), so the largest heap is of size k^2 = O(n^(2/3))

Nov 30 2015

On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu wrote: ...smallest organized in a max heap, ... etc. There are a total of O(sqrt n) heaps, and each has O(sqrt n) elements.... Hi Andrei, If I'm not mistaken, the number of heaps is proportional to n^(1/3) not n^(1/2) 1+2^2+..+n^2 is proportional to n^3 https://en.wikipedia.org/wiki/Square_pyramidal_number If n heaps have n^3 elements, rewrite n^3=m, then m elements will be stored in m^(1/3) heaps Interesting data structure! need to look over the details some more Titus

Nov 30 2015

On Monday, 30 November 2015 at 23:19:44 UTC, Titus Nicolae wrote:On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu wrote: ...I think Titus and I noticed the same thing at the same time. You could achieve O(sqrt n) lookup time by having each heap contain ceil(sqrt(n)) elements (except for the last one), giving you O(sqrt n) heaps of size O(sqrt n), but it may make insertion trickier as you try to keep the heaps the same size.smallest organized in a max heap, ... etc. There are a total of O(sqrt n) heaps, and each has O(sqrt n) elements.... Hi Andrei, If I'm not mistaken, the number of heaps is proportional to n^(1/3) not n^(1/2) 1+2^2+..+n^2 is proportional to n^3 https://en.wikipedia.org/wiki/Square_pyramidal_number If n heaps have n^3 elements, rewrite n^3=m, then m elements will be stored in m^(1/3) heaps Interesting data structure! need to look over the details some more Titus

Nov 30 2015

On Monday, 30 November 2015 at 23:19:44 UTC, Titus Nicolae wrote:On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu wrote: ...This might give a better balanced structure. Sum of odd numbers 1+3+5+..+(2n-1) = n^2 For N total elements, there are sqrt(n) heaps, and 2*sqrt(n) elements in the largest heap Andrei: La multi ani! :)smallest organized in a max heap, ... etc. There are a total of O(sqrt n) heaps, and each has O(sqrt n) elements.... Hi Andrei, If I'm not mistaken, the number of heaps is proportional to n^(1/3) not n^(1/2) 1+2^2+..+n^2 is proportional to n^3 https://en.wikipedia.org/wiki/Square_pyramidal_number If n heaps have n^3 elements, rewrite n^3=m, then m elements will be stored in m^(1/3) heaps Interesting data structure! need to look over the details some more Titus

Nov 30 2015

On 11/30/2015 06:32 PM, Titus Nicolae wrote:On Monday, 30 November 2015 at 23:19:44 UTC, Titus Nicolae wrote:This is very interesting. Thanks, and thanks! -- Andrei

Nov 30 2015

On 11/30/2015 06:19 PM, Titus Nicolae wrote:

Nov 30 2015

On 11/30/2015 09:13 PM, Andrei Alexandrescu wrote:I haven't worked out the math for insertion and deletion yet, but they seem to also be around O(sqrt n) or O(log(n) * sqrt(n)).(Assuming the bucket sizes actually grow linearly.) It seems to me that for insertion O(√n̅ log n) is very easy to achieve, but not for deletion. (For insertion, one simple strategy that satisfies the bound is to repeatedly move the maximum to the next heap [1].) Deletion leaves a hole in one heap, which should be filled by the minimum element of the next heap, etc. The minimum cannot be extracted efficiently from a max-heap. [1] struct Fancy(T){ T[] a; void insert(T t){ size_t i=1,j=0; for(;j+i<=a.length;j+=i,i++){ if(!(t<a[j])) continue; swap(t,a[j]); for(size_t c=0,w=1;w<i;c=w,w=2*c+1){ if(w+1<i&&a[j+w]<a[j+w+1]) w++; if(!(a[j+c]<a[j+w])) break; swap(a[j+c],a[j+w]); } } a~=t; for(auto c=a.length-1-j;c;c/=2){ if(!(a[j+c/2]<a[j+c])) break; swap(a[j+c/2],a[j+c]); } } }

Nov 30 2015

The key search phrase is "cache oblivious data structures". One of the cache-friendly layouts is the van Emde Boas tree.

Nov 30 2015

On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu wrote:Okasaki's book is a continued inspiration of data structures and algorithms. Start with a simple array of data. Then mentally decompose that array into a concatenation of smaller arrays: first has size 1, second has size 4, third has size 9, then 16, 25, 36, ... Generally the size of these imaginary subarrays grows quadratically. And they're adjacent to each other. The last array may be incomplete. ........ Please share any thoughts! AndreiInteresting, but isn't that basically a skip list with uneven sub-lists? insert could be quite complex unless the entire data structure is backed by a continuous or a continuous linked list.

Dec 01 2015

On 12/01/2015 04:50 AM, Emil Kirschner wrote:On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu wrote:Doesn't look like a skip list to me. -- AndreiOkasaki's book is a continued inspiration of data structures and algorithms. Start with a simple array of data. Then mentally decompose that array into a concatenation of smaller arrays: first has size 1, second has size 4, third has size 9, then 16, 25, 36, ... Generally the size of these imaginary subarrays grows quadratically. And they're adjacent to each other. The last array may be incomplete. ........ Please share any thoughts! AndreiInteresting, but isn't that basically a skip list with uneven sub-lists? insert could be quite complex unless the entire data structure is backed by a continuous or a continuous linked list.

Dec 01 2015

On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu wrote:Okasaki's book is a continued inspiration of data structures and algorithms. [...]Hi, You might find relevant ideas in https://cs.uwaterloo.ca/research/tr/1979/CS-79-31.pdf and http://www.sciencedirect.com/science/article/pii/0022000080900379 . Cheers, Marko

Dec 01 2015

On 12/1/15 5:19 AM, Marko Mikulicic wrote:On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu wrote:Noted, thanks! -- AndreiOkasaki's book is a continued inspiration of data structures and algorithms. [...]Hi, You might find relevant ideas in https://cs.uwaterloo.ca/research/tr/1979/CS-79-31.pdf and http://www.sciencedirect.com/science/article/pii/0022000080900379 .

Dec 01 2015

On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu wrote:Okasaki's book is a continued inspiration of data structures and algorithms. [...]Sort of reminds me of a modified Hashed Array Tree — not keen on the name, I think "Bucket Array" would have been better. https://en.wikipedia.org/wiki/Hashed_array_tree

Dec 01 2015