digitalmars.D - [OT] n-way union
- Andrei Alexandrescu (24/24) May 23 2009 This is somewhat OT but I think it's an interesting problem. Consider
- Sean Kelly (14/42) May 23 2009 It seems like there are two basic options: either merge the arrays (as
- Sean Kelly (9/9) May 23 2009 I have a better idea which is pretty much in line with the heap-based
- Sean Kelly (2/11) May 23 2009 hehe... "pretty much in line" turns out to be "exactly the same." Oh we...
- Jason House (4/35) May 23 2009 My first thought was to throw everything in a heap map with key=r.front ...
- BCS (8/39) May 23 2009 in sudocode
- Don (16/60) May 26 2009 In practice, I bet it'll go faster with a couple more steps:
- Andrei Alexandrescu (9/37) May 26 2009 Correct. Unfortunately my heap primitives didn't quite allow that so
- Don (11/50) May 26 2009 Depending on what "going through each element" means. Certainly, if you
- Andrei Alexandrescu (5/15) May 26 2009 Oh, I see. So yes, comparisons can be reduced. So how exactly do you
- Don (10/25) May 26 2009 That was my original thought, but it's easy to come up with cases that
- Jason House (3/21) May 26 2009 This conversation made me realize a simple optimization (forgive me if i...
- Andrei Alexandrescu (7/39) May 26 2009 Cool! That can actually be implemented as a micro-optimization with the
- Frits van Bommel (4/24) May 26 2009 If you keep it separate, you only need one comparison to determine the c...
- Andrei Alexandrescu (3/29) May 26 2009 Good point, and also I don't need to worry about adding new heap primiti...
- Robert Fraser (2/34) May 23 2009 I'd be interested if there's some way to remove duplicates.
- Georg Wrede (38/65) May 25 2009 If we'd know anything about the data, such as, the max value is always
- Andrei Alexandrescu (3/40) May 25 2009 You can assume that each array is sorted.
- Georg Wrede (4/45) May 25 2009 Err, you didn't comment on my algorithm, at the end. So, either it is
- Jason House (2/48) May 25 2009 Don't get offended, he didn't respond to anyone else's algorithm either....
- Andrei Alexandrescu (11/12) May 25 2009 I just wanted to bring what I think is an interesting problem to the
- Jason House (3/17) May 25 2009 I bet BCS had the advantage of a real keyboard ;)
- BCS (2/4) May 25 2009 you are 89% correct: http://en.wikipedia.org/wiki/Acer_Aspire_One
- Andrei Alexandrescu (20/26) May 25 2009 I am sorry, at the first two reads I did not understand your algorithm.
- Georg Wrede (71/101) May 25 2009 Canonical one. Righhhhhtttttt......
- Andrei Alexandrescu (21/32) May 25 2009 [snip]
- Georg Wrede (4/7) May 26 2009 Maybe I didn't understand BCS's algorithm. To me it looked slower. I'll
- Georg Wrede (4/7) May 26 2009 Maybe I misunderstood the point of this OT thread. One could believe the...
- bearophile (11/22) May 25 2009 Better to yield lazily the output items.
- Andrei Alexandrescu (4/10) May 25 2009 I don't think that applies here. Unstable removal is O(1) and restoring
- bearophile (9/10) May 25 2009 It's better for nWayUnion to be a lazy iterable.
- Andrei Alexandrescu (5/8) May 25 2009 The policy of std.algorithm is to not surreptitiously allocate memory if...
This is somewhat OT but I think it's an interesting problem. Consider the following data: double[][] a = [ [ 1, 4, 7, 8 ], [ 1, 7 ], [ 1, 7, 8], [ 4 ], [ 7 ], ]; We want to compute an n-way union, i.e., efficiently span all elements in all arrays in a, in sorted order. You can assume that each individual array in a is sorted. The output of n-way union should be: auto witness = [ 1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8 ]; assert(equal(nWayUnion(a), witness[])); The STL and std.algorithm have set_union that does that for two sets: for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But n-way unions poses additional challenges. What would be a fast algorithm? (Imagine a could be a very large range of ranges). Needless to say, nWayUnion is a range :o). Finally, why would anyone care for something like this? Andrei
May 23 2009
Andrei Alexandrescu wrote:This is somewhat OT but I think it's an interesting problem. Consider the following data: double[][] a = [ [ 1, 4, 7, 8 ], [ 1, 7 ], [ 1, 7, 8], [ 4 ], [ 7 ], ]; We want to compute an n-way union, i.e., efficiently span all elements in all arrays in a, in sorted order. You can assume that each individual array in a is sorted. The output of n-way union should be: auto witness = [ 1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8 ]; assert(equal(nWayUnion(a), witness[])); The STL and std.algorithm have set_union that does that for two sets: for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But n-way unions poses additional challenges. What would be a fast algorithm? (Imagine a could be a very large range of ranges).It seems like there are two basic options: either merge the arrays (as per merge sort) and deal with multiple passes across the same data elements or insert the elements from each array into a single destination array and deal with a bunch of memmove operations. The merge option is kind of interesting because it could benefit from a parallel range of sorts. Each front() op could actually return a range which contained the front element of each non-empty range. Pass this to a min() op accepting a range and drop the min element into the destination. The tricky part would be working things in such a way that the originating range could have popFront() called once the insertion had occurred.Needless to say, nWayUnion is a range :o). Finally, why would anyone care for something like this?Other than mergeSort? I'd think that being able to perform union of N sets in unison would be a nice way to eliminate arbitrary restrictions.
May 23 2009
I have a better idea which is pretty much in line with the heap-based solutions already posted. Place all the ranges in a min priority queue where the priority of a range is equal to the value of its top element. Pop the value off the heap at the front of the queue. If the heap is empty then remove it. If it contains more elements then fix up the priority queue to restore the invariant (since the priority of the head range may have changed). The temporary space would just be of size N where N is the number of ranges, and the number of duplicate compares is minimized.
May 23 2009
Sean Kelly wrote:I have a better idea which is pretty much in line with the heap-based solutions already posted. Place all the ranges in a min priority queue where the priority of a range is equal to the value of its top element. Pop the value off the heap at the front of the queue. If the heap is empty then remove it. If it contains more elements then fix up the priority queue to restore the invariant (since the priority of the head range may have changed). The temporary space would just be of size N where N is the number of ranges, and the number of duplicate compares is minimized.hehe... "pretty much in line" turns out to be "exactly the same." Oh well.
May 23 2009
Andrei Alexandrescu Wrote:This is somewhat OT but I think it's an interesting problem. Consider the following data: double[][] a = [ [ 1, 4, 7, 8 ], [ 1, 7 ], [ 1, 7, 8], [ 4 ], [ 7 ], ]; We want to compute an n-way union, i.e., efficiently span all elements in all arrays in a, in sorted order. You can assume that each individual array in a is sorted. The output of n-way union should be: auto witness = [ 1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8 ]; assert(equal(nWayUnion(a), witness[])); The STL and std.algorithm have set_union that does that for two sets: for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But n-way unions poses additional challenges. What would be a fast algorithm? (Imagine a could be a very large range of ranges).My first thought was to throw everything in a heap map with key=r.front and value=r optimal.Needless to say, nWayUnion is a range :o). Finally, why would anyone care for something like this? Andrei
May 23 2009
Hello Andrei,This is somewhat OT but I think it's an interesting problem. Consider the following data: double[][] a = [ [ 1, 4, 7, 8 ], [ 1, 7 ], [ 1, 7, 8], [ 4 ], [ 7 ], ]; We want to compute an n-way union, i.e., efficiently span all elements in all arrays in a, in sorted order. You can assume that each individual array in a is sorted. The output of n-way union should be: auto witness = [ 1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8 ]; assert(equal(nWayUnion(a), witness[])); The STL and std.algorithm have set_union that does that for two sets: for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But n-way unions poses additional challenges. What would be a fast algorithm? (Imagine a could be a very large range of ranges). Needless to say, nWayUnion is a range :o). Finally, why would anyone care for something like this? Andreiin sudocode 1. build a min heap out of the elements based on the first element of each array 2. remove the first value from the root of the heap, 3. if empty array, drop it 4. if not empty re-heap 5. goto 2 if anything left
May 23 2009
BCS wrote:Hello Andrei,In practice, I bet it'll go faster with a couple more steps: 2. remove the first value (from array X) from the root of the heap, 2a. Determine new root of the heap (don't need to do a full re-heap yet). 2b. move all values from X until empty or >= new root. 4. reheap There are interesting cases like: [5,6,7,8,9,9,......9] [30, 31,32,33,34...38] [1,1,1,1,1,1,1,1,1,1,...1] [10,11,12,13,14,15,16,16,......., 16] where in theory, the total number of comparisons required is a function only of the number of arrays, and is independent of the actual number of elements. If it's common to have runs where consecutive values come from a single array, and the arrays are random access, you could do MUCH better than O(n).This is somewhat OT but I think it's an interesting problem. Consider the following data: double[][] a = [ [ 1, 4, 7, 8 ], [ 1, 7 ], [ 1, 7, 8], [ 4 ], [ 7 ], ]; We want to compute an n-way union, i.e., efficiently span all elements in all arrays in a, in sorted order. You can assume that each individual array in a is sorted. The output of n-way union should be: auto witness = [ 1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8 ]; assert(equal(nWayUnion(a), witness[])); The STL and std.algorithm have set_union that does that for two sets: for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But n-way unions poses additional challenges. What would be a fast algorithm? (Imagine a could be a very large range of ranges). Needless to say, nWayUnion is a range :o). Finally, why would anyone care for something like this? Andreiin sudocode 1. build a min heap out of the elements based on the first element of each array 2. remove the first value from the root of the heap, 3. if empty array, drop it 4. if not empty re-heap 5. goto 2 if anything left
May 26 2009
Don wrote:BCS wrote:Correct. Unfortunately my heap primitives didn't quite allow that so currently I do it the "hard" way: pop from top, remove top from heap, reinsert (if not empty). But this is definitely an optimization worth doing. I think I'll implement it.in sudocode 1. build a min heap out of the elements based on the first element of each array 2. remove the first value from the root of the heap, 3. if empty array, drop it 4. if not empty re-heap 5. goto 2 if anything leftIn practice, I bet it'll go faster with a couple more steps: 2. remove the first value (from array X) from the root of the heap, 2a. Determine new root of the heap (don't need to do a full re-heap yet). 2b. move all values from X until empty or >= new root. 4. reheapThere are interesting cases like: [5,6,7,8,9,9,......9] [30, 31,32,33,34...38] [1,1,1,1,1,1,1,1,1,1,...1] [10,11,12,13,14,15,16,16,......., 16] where in theory, the total number of comparisons required is a function only of the number of arrays, and is independent of the actual number of elements. If it's common to have runs where consecutive values come from a single array, and the arrays are random access, you could do MUCH better than O(n).I think you can't do better. Complexity is O(n) at a minimum because you need to go through each element. The heap-based algorithm has O(n log m). What you could improve on is the log m factor (m = the number of sets). Andrei
May 26 2009
Andrei Alexandrescu wrote:Don wrote:Depending on what "going through each element" means. Certainly, if you need to move them, you can't beat O(n) moves, but you can definitely get O(m*m) < < O(n) comparisons in the best case I described. And that would also be the number of splicing operations. You could approximate this without hurting the worst-case too much, by (for example) once you get a run of more than K entries from the same array, step forward 2 entries; if that's still less than the root, copy both to the output, then step forward 4 entries, etc. That should give a best-case O(m log n) comparisons/splices. Which might not be of any practical use, of course.BCS wrote:Correct. Unfortunately my heap primitives didn't quite allow that so currently I do it the "hard" way: pop from top, remove top from heap, reinsert (if not empty). But this is definitely an optimization worth doing. I think I'll implement it.in sudocode 1. build a min heap out of the elements based on the first element of each array 2. remove the first value from the root of the heap, 3. if empty array, drop it 4. if not empty re-heap 5. goto 2 if anything leftIn practice, I bet it'll go faster with a couple more steps: 2. remove the first value (from array X) from the root of the heap, 2a. Determine new root of the heap (don't need to do a full re-heap yet). 2b. move all values from X until empty or >= new root. 4. reheapThere are interesting cases like: [5,6,7,8,9,9,......9] [30, 31,32,33,34...38] [1,1,1,1,1,1,1,1,1,1,...1] [10,11,12,13,14,15,16,16,......., 16] where in theory, the total number of comparisons required is a function only of the number of arrays, and is independent of the actual number of elements. If it's common to have runs where consecutive values come from a single array, and the arrays are random access, you could do MUCH better than O(n).I think you can't do better. Complexity is O(n) at a minimum because you need to go through each element. The heap-based algorithm has O(n log m). What you could improve on is the log m factor (m = the number of sets).
May 26 2009
Don wrote:Andrei Alexandrescu wrote:Oh, I see. So yes, comparisons can be reduced. So how exactly do you think that can be done? By binary searching the upper bound of the head to find a long run of identical elements? AndreiI think you can't do better. Complexity is O(n) at a minimum because you need to go through each element. The heap-based algorithm has O(n log m). What you could improve on is the log m factor (m = the number of sets).Depending on what "going through each element" means. Certainly, if you need to move them, you can't beat O(n) moves, but you can definitely get O(m*m) < < O(n) comparisons in the best case I described. And that would also be the number of splicing operations.
May 26 2009
Andrei Alexandrescu wrote:Don wrote:That was my original thought, but it's easy to come up with cases that would perform really badly -- so that you do a binary search for every two elements. Which was why I suggested repeated doubling of the stride distance, I think you can prove that's never worse than a linear search of every element. Although, if it was worth doing, I'd imagine that implementations of STL set_union would already be doing it. Anyway, it's a really interesting problem.Andrei Alexandrescu wrote:Oh, I see. So yes, comparisons can be reduced. So how exactly do you think that can be done? By binary searching the upper bound of the head to find a long run of identical elements?I think you can't do better. Complexity is O(n) at a minimum because you need to go through each element. The heap-based algorithm has O(n log m). What you could improve on is the log m factor (m = the number of sets).Depending on what "going through each element" means. Certainly, if you need to move them, you can't beat O(n) moves, but you can definitely get O(m*m) < < O(n) comparisons in the best case I described. And that would also be the number of splicing operations.
May 26 2009
Andrei Alexandrescu Wrote:Don wrote:This conversation made me realize a simple optimization (forgive me if it's what Don was implying) Instead of storing m ranges in a heap, store only m-1 ranges. The range that would be the min if the heap is kept separate. Each pop operation will compare the new front with the min heap element. This allows avoiding the re-heaping when the next element comes from the same range. If s is the number of times the source range must switch, then the performance is O( s*log(m) + n - s). The initial heaping is O(m*log(m)), but that's never worse than O(s*log(m)). Technically, the re-heaping steps are log(m-1), but that isn't important with big O.Andrei Alexandrescu wrote:Oh, I see. So yes, comparisons can be reduced. So how exactly do you think that can be done? By binary searching the upper bound of the head to find a long run of identical elements? AndreiI think you can't do better. Complexity is O(n) at a minimum because you need to go through each element. The heap-based algorithm has O(n log m). What you could improve on is the log m factor (m = the number of sets).Depending on what "going through each element" means. Certainly, if you need to move them, you can't beat O(n) moves, but you can definitely get O(m*m) < < O(n) comparisons in the best case I described. And that would also be the number of splicing operations.
May 26 2009
Jason House wrote:Andrei Alexandrescu Wrote:Cool! That can actually be implemented as a micro-optimization with the classic heap: right after you extract from the top (arr[0]), you look at its children (which are arr[1] and arr[2]). If both front()'s are still no less than yours, everything stays as it was and you're home free. I'll implement this, thanks. AndreiDon wrote:This conversation made me realize a simple optimization (forgive me if it's what Don was implying) Instead of storing m ranges in a heap, store only m-1 ranges. The range that would be the min if the heap is kept separate. Each pop operation will compare the new front with the min heap element. This allows avoiding the re-heaping when the next element comes from the same range. If s is the number of times the source range must switch, then the performance is O( s*log(m) + n - s). The initial heaping is O(m*log(m)), but that's never worse than O(s*log(m)). Technically, the re-heaping steps are log(m-1), but that isn't important with big O.Andrei Alexandrescu wrote:Oh, I see. So yes, comparisons can be reduced. So how exactly do you think that can be done? By binary searching the upper bound of the head to find a long run of identical elements? AndreiI think you can't do better. Complexity is O(n) at a minimum because you need to go through each element. The heap-based algorithm has O(n log m). What you could improve on is the log m factor (m = the number of sets).Depending on what "going through each element" means. Certainly, if you need to move them, you can't beat O(n) moves, but you can definitely get O(m*m) < < O(n) comparisons in the best case I described. And that would also be the number of splicing operations.
May 26 2009
Andrei Alexandrescu wrote:Jason House wrote:If you keep it separate, you only need one comparison to determine the current range still contains the minimum element. Just saying :).This conversation made me realize a simple optimization (forgive me if it's what Don was implying) Instead of storing m ranges in a heap, store only m-1 ranges. The range that would be the min if the heap is kept separate. Each pop operation will compare the new front with the min heap element. This allows avoiding the re-heaping when the next element comes from the same range. If s is the number of times the source range must switch, then the performance is O( s*log(m) + n - s). The initial heaping is O(m*log(m)), but that's never worse than O(s*log(m)). Technically, the re-heaping steps are log(m-1), but that isn't important with big O.Cool! That can actually be implemented as a micro-optimization with the classic heap: right after you extract from the top (arr[0]), you look at its children (which are arr[1] and arr[2]). If both front()'s are still no less than yours, everything stays as it was and you're home free. I'll implement this, thanks.
May 26 2009
Frits van Bommel wrote:Andrei Alexandrescu wrote:Good point, and also I don't need to worry about adding new heap primitives. AndreiJason House wrote:If you keep it separate, you only need one comparison to determine the current range still contains the minimum element. Just saying :).This conversation made me realize a simple optimization (forgive me if it's what Don was implying) Instead of storing m ranges in a heap, store only m-1 ranges. The range that would be the min if the heap is kept separate. Each pop operation will compare the new front with the min heap element. This allows avoiding the re-heaping when the next element comes from the same range. If s is the number of times the source range must switch, then the performance is O( s*log(m) + n - s). The initial heaping is O(m*log(m)), but that's never worse than O(s*log(m)). Technically, the re-heaping steps are log(m-1), but that isn't important with big O.Cool! That can actually be implemented as a micro-optimization with the classic heap: right after you extract from the top (arr[0]), you look at its children (which are arr[1] and arr[2]). If both front()'s are still no less than yours, everything stays as it was and you're home free. I'll implement this, thanks.
May 26 2009
Andrei Alexandrescu wrote:This is somewhat OT but I think it's an interesting problem. Consider the following data: double[][] a = [ [ 1, 4, 7, 8 ], [ 1, 7 ], [ 1, 7, 8], [ 4 ], [ 7 ], ]; We want to compute an n-way union, i.e., efficiently span all elements in all arrays in a, in sorted order. You can assume that each individual array in a is sorted. The output of n-way union should be: auto witness = [ 1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8 ]; assert(equal(nWayUnion(a), witness[])); The STL and std.algorithm have set_union that does that for two sets: for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But n-way unions poses additional challenges. What would be a fast algorithm? (Imagine a could be a very large range of ranges). Needless to say, nWayUnion is a range :o). Finally, why would anyone care for something like this? AndreiI'd be interested if there's some way to remove duplicates.
May 23 2009
Andrei Alexandrescu wrote:This is somewhat OT but I think it's an interesting problem. Consider the following data: double[][] a = [ [ 1, 4, 7, 8 ], [ 1, 7 ], [ 1, 7, 8], [ 4 ], [ 7 ], ]; We want to compute an n-way union, i.e., efficiently span all elements in all arrays in a, in sorted order. You can assume that each individual array in a is sorted. The output of n-way union should be: auto witness = [ 1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8 ]; assert(equal(nWayUnion(a), witness[])); The STL and std.algorithm have set_union that does that for two sets: for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But n-way unions poses additional challenges. What would be a fast algorithm? (Imagine a could be a very large range of ranges). Needless to say, nWayUnion is a range :o).If we'd know anything about the data, such as, the max value is always smaller than the total number of elements in the subarrays, then we'd probably more easily invent a decent algorithm. But the totally general algorithm has to be more inefficient. And constructing (not worst-case, but) tough-case data is trivial. For example, take a thousand subarrays, each a thousand elements long, containing random uints from the inclusive range 0..uint.max. It's likely that you then can pop the first element in a[0], but the second element is larger, and the first element in a[1] is also larger. This results in that a is still sorted, except that a[0] may belong to a later position. On the other hand, in your example, after popping all the ones, a is severely out of order. --------------------- You probably have to have an array of structs, where each struct contains a reference to a subarray, and a copy of the first value of this subarray. Sort the array of structs, and beginning with the first struct, pop from the pointed-to subarray as long as it contains elements equal to the first one. Then pop from the subarray that belongs to following struct, and so on, until a subbarray's first element already is larger. While popping, you obviously update the value copies, and append to an output array. Then sort the array of structs, trim it in case some of the structs now point to empty subarrays, and start over. This is only moderately efficient, but had Stepanof or anybody else come up with something essentially better, you'd have heard of it already. The only upshot of this algorithm is that the array a doesn't have to be sorted initially. (The subarrays, of course still do.) The algorithm could be enhanced by, instead of an array of structs, having a singly liked list of these structs. Each time you pop, you move the relevant struct to another list, poppedStructs. When done, you only have to sort the poppedStructs list, and then /merge/ it with the list of unpopped structs, which still is sorted. This would obviously drastically improve efficiency, since it is likely that what needs to be sorted now is only a few items per round. Hmmm. Maybe this isn't all that inefficient, after all...
May 25 2009
Georg Wrede wrote:Andrei Alexandrescu wrote:You can assume that each array is sorted. AndreiThis is somewhat OT but I think it's an interesting problem. Consider the following data: double[][] a = [ [ 1, 4, 7, 8 ], [ 1, 7 ], [ 1, 7, 8], [ 4 ], [ 7 ], ]; We want to compute an n-way union, i.e., efficiently span all elements in all arrays in a, in sorted order. You can assume that each individual array in a is sorted. The output of n-way union should be: auto witness = [ 1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8 ]; assert(equal(nWayUnion(a), witness[])); The STL and std.algorithm have set_union that does that for two sets: for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But n-way unions poses additional challenges. What would be a fast algorithm? (Imagine a could be a very large range of ranges). Needless to say, nWayUnion is a range :o).If we'd know anything about the data, such as, the max value is always smaller than the total number of elements in the subarrays, then we'd probably more easily invent a decent algorithm. But the totally general algorithm has to be more inefficient. And constructing (not worst-case, but) tough-case data is trivial. For example, take a thousand subarrays, each a thousand elements long, containing random uints from the inclusive range 0..uint.max.
May 25 2009
Andrei Alexandrescu wrote:Georg Wrede wrote:Err, you didn't comment on my algorithm, at the end. So, either it is worthless to an extent not deserving even a dismissal, or you didn't read the rest of the post.Andrei Alexandrescu wrote:You can assume that each array is sorted.This is somewhat OT but I think it's an interesting problem. Consider the following data: double[][] a = [ [ 1, 4, 7, 8 ], [ 1, 7 ], [ 1, 7, 8], [ 4 ], [ 7 ], ]; We want to compute an n-way union, i.e., efficiently span all elements in all arrays in a, in sorted order. You can assume that each individual array in a is sorted. The output of n-way union should be: auto witness = [ 1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8 ]; assert(equal(nWayUnion(a), witness[])); The STL and std.algorithm have set_union that does that for two sets: for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But n-way unions poses additional challenges. What would be a fast algorithm? (Imagine a could be a very large range of ranges). Needless to say, nWayUnion is a range :o).If we'd know anything about the data, such as, the max value is always smaller than the total number of elements in the subarrays, then we'd probably more easily invent a decent algorithm. But the totally general algorithm has to be more inefficient. And constructing (not worst-case, but) tough-case data is trivial. For example, take a thousand subarrays, each a thousand elements long, containing random uints from the inclusive range 0..uint.max.
May 25 2009
Georg Wrede Wrote:Andrei Alexandrescu wrote:Don't get offended, he didn't respond to anyone else's algorithm either.Georg Wrede wrote:Err, you didn't comment on my algorithm, at the end. So, either it is worthless to an extent not deserving even a dismissal, or you didn't read the rest of the post.Andrei Alexandrescu wrote:You can assume that each array is sorted.This is somewhat OT but I think it's an interesting problem. Consider the following data: double[][] a = [ [ 1, 4, 7, 8 ], [ 1, 7 ], [ 1, 7, 8], [ 4 ], [ 7 ], ]; We want to compute an n-way union, i.e., efficiently span all elements in all arrays in a, in sorted order. You can assume that each individual array in a is sorted. The output of n-way union should be: auto witness = [ 1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8 ]; assert(equal(nWayUnion(a), witness[])); The STL and std.algorithm have set_union that does that for two sets: for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But n-way unions poses additional challenges. What would be a fast algorithm? (Imagine a could be a very large range of ranges). Needless to say, nWayUnion is a range :o).If we'd know anything about the data, such as, the max value is always smaller than the total number of elements in the subarrays, then we'd probably more easily invent a decent algorithm. But the totally general algorithm has to be more inefficient. And constructing (not worst-case, but) tough-case data is trivial. For example, take a thousand subarrays, each a thousand elements long, containing random uints from the inclusive range 0..uint.max.
May 25 2009
Jason House wrote:Don't get offended, he didn't respond to anyone else's algorithm either.I just wanted to bring what I think is an interesting problem to the table; my response shouldn't be used as a yardstick for one's ability to define a good algorithm. Now that I feel pressured to give an answer, I'd say that BCS's answer is right on the money. It's the exact algorithm I put in std.algorithm (not checked in yet), and it works leanly and meanly on a large data set as we speak. Sean's, bearophile's, and your answers were also very close to BCS's, just that yours was a tad less detailed and the others came a bit later. Andrei
May 25 2009
Andrei Alexandrescu Wrote:Jason House wrote:Sorry if made you feel pressured. I didn't expect a response.Don't get offended, he didn't respond to anyone else's algorithm either.I just wanted to bring what I think is an interesting problem to the table; my response shouldn't be used as a yardstick for one's ability to define a good algorithm. Now that I feel pressured to give an answer,I'd say that BCS's answer is right on the money. It's the exact algorithm I put in std.algorithm (not checked in yet), and it works leanly and meanly on a large data set as we speak. Sean's, bearophile's, and your answers were also very close to BCS's, just that yours was a tad less detailed and the others came a bit later.I bet BCS had the advantage of a real keyboard ;)
May 25 2009
Hello Jason,I bet BCS had the advantage of a real keyboard ;)you are 89% correct: http://en.wikipedia.org/wiki/Acer_Aspire_One
May 25 2009
Georg Wrede wrote:Andrei Alexandrescu wrote:I am sorry, at the first two reads I did not understand your algorithm. I noticed that it is more complicated than the canonical one, and also it necessitates storage to hold the entire output. At the third read I did understand it, or at least I understood how you mean it to work. It has at least one bug here: ======== Sort the array of structs, and beginning with the first struct, pop from the pointed-to subarray as long as it contains elements equal to the first one. Then pop from the subarray that belongs to following struct, and so on, until a subbarray's first element already is larger. ======== It would fail on the data: double[][] a = [ [ 1, 1, 4, 7, 8 ], [ 7 ], ]; You will pop all 1s and then you'll pop 7. You should be popping 4. AndreiYou can assume that each array is sorted.Err, you didn't comment on my algorithm, at the end. So, either it is worthless to an extent not deserving even a dismissal, or you didn't read the rest of the post.
May 25 2009
Andrei Alexandrescu wrote:Georg Wrede wrote:Canonical one. Righhhhhtttttt...... From the outset I thought that you did have a solution in mind. And I thought that the pause of a couple of days before any seroious attemnts only meant that everybody else understood it, too. So, nobody wanted to spend 5 days inventing something which would only be dismissed "because Knuth did it better 50 years ago".Andrei Alexandrescu wrote:I am sorry, at the first two reads I did not understand your algorithm. I noticed that it is more complicated than the canonical oneYou can assume that each array is sorted.Err, you didn't comment on my algorithm, at the end. So, either it is worthless to an extent not deserving even a dismissal, or you didn't read the rest of the post.and also it necessitates storage to hold the entire output.I never said whether it'd be on stack or not. An array (or a linked list) of structs can be placed anywhere. Especially if you know (i.e. can easily count) the total number of items, you can allocate up front, be it on the heap or the stack.At the third read I did understand it, or at least I understood how you mean it to work. It has at least one bug here: ======== Sort the array of structs, and beginning with the first struct, pop from the pointed-to subarray as long as it contains elements equal to the first one. Then pop from the subarray that belongs to following struct, and so on, until a subbarray's first element already is larger. ======== It would fail on the data: double[][] a = [ [ 1, 1, 4, 7, 8 ], [ 7 ], ]; You will pop all 1s and then you'll pop 7. You should be popping 4.Sigh. So you didn't read that part carefully. You also didn't read the following part, at all. I admit, I relied too much on your (perceived, superior) ability, and thus described the algorithm top-level only. My first thought was to actually write the whole thing in D2, but then I remembered that (as an old man), I code things at most at 10% the speed of some of the young and strong here. I didn't want to spend 2 days on something I would demand of myself of doing in 2 hours. Hence the pseudocode. My algorithm would pop both the 1's. Then it would see that the 4 is bigger and give up. It would then go to a[0+1], and find nothing there (i.e. 7 is larger than 1), so it'd consider this round done. Now, where I didn't "follow description" was, my algorithm was insensitive to there being several instances of the same value in a subarray. (My mistake, the "1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8" thing led me astray.) BUT IF YOU LOOK CAREFULLY at my code, you'll see that it actually doens't make a difference. Why???? Because, having popped a number of items from a number of sub-arrays, the sub-arrays aren't in any order at all. So, there arises a need to do some sorting. INCIDENTALLY, which means, "You will pop all 1s and then you'll pop 7" *definitely* means you didn't read my post carefully. No biggie, my university Pascal teacher did the same thing. Since I'm a slow coder, I used to both present my algorithms in "regular peoples' language", and my math proofs in writing, instead of a lot of hairy equations. Both guys gave me 0's for a number of things, and then I had to set up an appointment to see both to explain and clarify. (The one thing I remember as a killer was about my thesis (STL), which I submitted, then went to Spain for 3 weeks, returned on an overnight flight with no sleep at all, and at 8am went to see my score, and found myself disqualified. I went straight to the professor and confronted him. He explained that this assignment meant one had to prove the case for a small value, prove it for the next value, then proove that if it was valid for both, then it would have to be valid for the one following them. When I said that "this is exactly what I've done here", he went silet for 10 minutes, then blushed, and gave me a "barely accepted score"... Sure, my first idea was to invent and quote real D2 code. But I thought that being a "regular" programmer, /everything/ in the code would be *rewritten* anyway, so why spend 4 days writing something I can describe while I'm f**ing reading the assignment. Then it turns out I get omitted the first time, the second time I'm not understood, the 3rd time I'm not understood either, and the 4th time I'm misunderstood. Were it not for D2, I'd suggest the reader ease up a bit. Save his and the audience's time. Getting it (err, reading it) right the first time (even if reading at half the speed) is much more efficient (for "himself", not to speak of the entire community and their combined wasted effort) than spawning a whole tree of response-subtrees, which (in the worst case) he has to answer individually. (So everybody loses.) =================================== Which all means, if somebody else actually did read my description, and did implement it, Andrei'd have a hard time beating their code. (As an additional hint, to save extra speed, one wouldn't wait till the turn is over, and then sort hen poppedItems, but instead, insert-sort them while iterating over the structs list.) At the same time, one could keep a tab on the next "least-value" while "popping", so as not to have to compare to integers between the current and the next actual poppable value.
May 25 2009
Georg Wrede wrote:Andrei Alexandrescu wrote:[snip] I don't even know how to start answering this and particularly what follows (30% of I can't even figure where it comes from and where it goes), but allow me to retort with just a few short and clear points. 1. You were not set up. I did have a solution in mind, but it's not from Knuth or some other classic. I didn't actually find it anywhere (which is why I consider the problem interesting to discuss in this group), but it's rather simple and clear because five people in this group came to it independently. 2. I'm not sure - are you reproaching me I didn't sit down to understand your algorithm (and pardon me, it wasn't the clearest exposition there is and it still has more than a few kinks to be ironed out)? And are you comparing the situation with your professor etc.? And am I supposed to thank you for not making a biggie out of it? Well your instructor was supposed to read your work, and I am not by any stretch of imagination. That was rather inappropriate. 3. If you are alleging that if somebody implemented your algorithm, it would run faster than BCS's, you are simply mistaken. Yours simply does more work than necessary, and no amount of buts and ifs can cover for that. AndreiI am sorry, at the first two reads I did not understand your algorithm. I noticed that it is more complicated than the canonical oneCanonical one. Righhhhhtttttt...... From the outset I thought that you did have a solution in mind. And I thought that the pause of a couple of days before any seroious attemnts only meant that everybody else understood it, too. So, nobody wanted to spend 5 days inventing something which would only be dismissed "because Knuth did it better 50 years ago".
May 25 2009
Andrei Alexandrescu wrote:3. If you are alleging that if somebody implemented your algorithm, it would run faster than BCS's, you are simply mistaken. Yours simply does more work than necessary, and no amount of buts and ifs can cover for that.Maybe I didn't understand BCS's algorithm. To me it looked slower. I'll have to wait till I see the one actually implemented, and do a comparison. If nothing else, I'll then see how much mine is slower.
May 26 2009
Andrei Alexandrescu wrote:Well your instructor was supposed to read your work, and I am not by any stretch of imagination. That was rather inappropriate.Maybe I misunderstood the point of this OT thread. One could believe the thread was created to look at alternative algorithms by people.Finally, why would anyone care for something like this?Has this been answered already? What was the answer?
May 26 2009
Georg Wrede:You probably have to have an array of structs, where each struct contains a reference to a subarray, and a copy of the first value of this subarray.Keeping a cached copy of the first item of the subarray ("subrange") may be better in some situations and worse (compared to keeping just a reference/pointer) in other situations. And it's not very simple to design a data structure able to adapt itself dynamically to such two situations.Sort the array of structs, and beginning with the first struct, pop from the pointed-to subarray as long as it contains elements equal to the first one. Then pop from the subarray that belongs to following struct, and so on, until a subbarray's first element already is larger. While popping, you obviously update the value copies, and append to an output array.Better to yield lazily the output items. And probably keeping them in a heap is more efficient than scanning them all.The algorithm could be enhanced by, instead of an array of structs, having a singly liked list of these structs.Today linked lists are mostly dead. In most situations arrays (or arrays with holes, plus few other smart things) are more efficient, both in memory and speed. -------------------- For Andrei: in my last post I have suggested to remove from the heap the references to the sub-ranges as soon they are exhausted. But lot of tests of mine have shown me that's often not the most efficient strategy: http://www.fantascienza.net/leonardo/ar/list_deletions.html So better things are possible. The most refined strategy may be overkill, but some medium strategy may be OK. Bye, bearophile
May 25 2009
bearophile wrote:For Andrei: in my last post I have suggested to remove from the heap the references to the sub-ranges as soon they are exhausted. But lot of tests of mine have shown me that's often not the most efficient strategy: http://www.fantascienza.net/leonardo/ar/list_deletions.html So better things are possible. The most refined strategy may be overkill, but some medium strategy may be OK.I don't think that applies here. Unstable removal is O(1) and restoring the heap property is O(log n) in theory and very close to O(1) in practice. Andrei
May 25 2009
Andrei Alexandrescu:Needless to say, nWayUnion is a range :o).<It's better for nWayUnion to be a lazy iterable. As probably others have already said, you can keep an heap of pointers, and compute the heap invariant using as opCmp a function that uses the values they point to. When nWayUnion yields an item, the relative pointer goes one step forward and the heap invariant has to be restored. When one array finishes, the relative pointer is removed before restoring the heap invariant. The algorithm can be generalized: the input can be an array (random access range? memory mapped file and arrays are among the most important cases) of sorted iterables (even lazy ones), that is an array of lazy sorted ranges. The most general input is a lazy range of lazy sorted ranges, in this situation the nWayUnion has to first "duplicate" it into an eager array of such ranges, and then turn it into an heap as before. I guess in most situations you don't have more than thausands of such sorted iterables, so cerating such temporary array isn't too much bad. (nWayUnion may also have a small fixed size array of pointers/indexes on the stack, for the simple case of an array of 3-5 arrays). Bye, bearophile
May 25 2009
bearophile wrote:The most general input is a lazy range of lazy sorted ranges, in this situation the nWayUnion has to first "duplicate" it into an eager array of such ranges, and then turn it into an heap as before. I guess in most situations you don't have more than thausands of such sorted iterables, so cerating such temporary array isn't too much bad. (nWayUnion may also have a small fixed size array of pointers/indexes on the stack, for the simple case of an array of 3-5 arrays).The policy of std.algorithm is to not surreptitiously allocate memory if it can help it. So the user is left responsible with creating a range of ranges with random access. Andrei
May 25 2009