www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [OT] n-way union

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
This is somewhat OT but I think it's an interesting problem. Consider 
the following data:

double[][] a =
[
     [ 1, 4, 7, 8 ],
     [ 1, 7 ],
     [ 1, 7, 8],
     [ 4 ],
     [ 7 ],
];

We want to compute an n-way union, i.e., efficiently span all elements 
in all arrays in a, in sorted order. You can assume that each individual 
array in a is sorted. The output of n-way union should be:

auto witness = [
     1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
];
assert(equal(nWayUnion(a), witness[]));

The STL and std.algorithm have set_union that does that for two sets: 
for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But 
n-way unions poses additional challenges. What would be a fast 
algorithm? (Imagine a could be a very large range of ranges).

Needless to say, nWayUnion is a range :o).

Finally, why would anyone care for something like this?


Andrei
May 23 2009
next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
Andrei Alexandrescu wrote:
 This is somewhat OT but I think it's an interesting problem. Consider 
 the following data:
 
 double[][] a =
 [
     [ 1, 4, 7, 8 ],
     [ 1, 7 ],
     [ 1, 7, 8],
     [ 4 ],
     [ 7 ],
 ];
 
 We want to compute an n-way union, i.e., efficiently span all elements 
 in all arrays in a, in sorted order. You can assume that each individual 
 array in a is sorted. The output of n-way union should be:
 
 auto witness = [
     1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
 ];
 assert(equal(nWayUnion(a), witness[]));
 
 The STL and std.algorithm have set_union that does that for two sets: 
 for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But 
 n-way unions poses additional challenges. What would be a fast 
 algorithm? (Imagine a could be a very large range of ranges).
It seems like there are two basic options: either merge the arrays (as per merge sort) and deal with multiple passes across the same data elements or insert the elements from each array into a single destination array and deal with a bunch of memmove operations. The merge option is kind of interesting because it could benefit from a parallel range of sorts. Each front() op could actually return a range which contained the front element of each non-empty range. Pass this to a min() op accepting a range and drop the min element into the destination. The tricky part would be working things in such a way that the originating range could have popFront() called once the insertion had occurred.
 Needless to say, nWayUnion is a range :o).
 
 Finally, why would anyone care for something like this?
Other than mergeSort? I'd think that being able to perform union of N sets in unison would be a nice way to eliminate arbitrary restrictions.
May 23 2009
parent reply Sean Kelly <sean invisibleduck.org> writes:
I have a better idea which is pretty much in line with the heap-based 
solutions already posted.  Place all the ranges in a min priority queue 
where the priority of a range is equal to the value of its top element. 
  Pop the value off the heap at the front of the queue.  If the heap is 
empty then remove it.  If it contains more elements then fix up the 
priority queue to restore the invariant (since the priority of the head 
range may have changed).  The temporary space would just be of size N 
where N is the number of ranges, and the number of duplicate compares is 
minimized.
May 23 2009
parent Sean Kelly <sean invisibleduck.org> writes:
Sean Kelly wrote:
 I have a better idea which is pretty much in line with the heap-based 
 solutions already posted.  Place all the ranges in a min priority queue 
 where the priority of a range is equal to the value of its top element. 
  Pop the value off the heap at the front of the queue.  If the heap is 
 empty then remove it.  If it contains more elements then fix up the 
 priority queue to restore the invariant (since the priority of the head 
 range may have changed).  The temporary space would just be of size N 
 where N is the number of ranges, and the number of duplicate compares is 
 minimized.
hehe... "pretty much in line" turns out to be "exactly the same." Oh well.
May 23 2009
prev sibling next sibling parent Jason House <jason.james.house gmail.com> writes:
Andrei Alexandrescu Wrote:

 This is somewhat OT but I think it's an interesting problem. Consider 
 the following data:
 
 double[][] a =
 [
      [ 1, 4, 7, 8 ],
      [ 1, 7 ],
      [ 1, 7, 8],
      [ 4 ],
      [ 7 ],
 ];
 
 We want to compute an n-way union, i.e., efficiently span all elements 
 in all arrays in a, in sorted order. You can assume that each individual 
 array in a is sorted. The output of n-way union should be:
 
 auto witness = [
      1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
 ];
 assert(equal(nWayUnion(a), witness[]));
 
 The STL and std.algorithm have set_union that does that for two sets: 
 for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But 
 n-way unions poses additional challenges. What would be a fast 
 algorithm? (Imagine a could be a very large range of ranges).
My first thought was to throw everything in a heap map with key=r.front and value=r optimal.
 Needless to say, nWayUnion is a range :o).
 
 Finally, why would anyone care for something like this?
 
 
 Andrei
May 23 2009
prev sibling next sibling parent reply BCS <none anon.com> writes:
Hello Andrei,

 This is somewhat OT but I think it's an interesting problem. Consider
 the following data:
 
 double[][] a =
 [
 [ 1, 4, 7, 8 ],
 [ 1, 7 ],
 [ 1, 7, 8],
 [ 4 ],
 [ 7 ],
 ];
 We want to compute an n-way union, i.e., efficiently span all elements
 in all arrays in a, in sorted order. You can assume that each
 individual array in a is sorted. The output of n-way union should be:
 
 auto witness = [
 1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
 ];
 assert(equal(nWayUnion(a), witness[]));
 
 The STL and std.algorithm have set_union that does that for two sets:
 for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But
 n-way unions poses additional challenges. What would be a fast
 algorithm? (Imagine a could be a very large range of ranges).
 
 Needless to say, nWayUnion is a range :o).
 
 Finally, why would anyone care for something like this?
 
 Andrei
 
in sudocode 1. build a min heap out of the elements based on the first element of each array 2. remove the first value from the root of the heap, 3. if empty array, drop it 4. if not empty re-heap 5. goto 2 if anything left
May 23 2009
parent reply Don <nospam nospam.com> writes:
BCS wrote:
 Hello Andrei,
 
 This is somewhat OT but I think it's an interesting problem. Consider
 the following data:

 double[][] a =
 [
 [ 1, 4, 7, 8 ],
 [ 1, 7 ],
 [ 1, 7, 8],
 [ 4 ],
 [ 7 ],
 ];
 We want to compute an n-way union, i.e., efficiently span all elements
 in all arrays in a, in sorted order. You can assume that each
 individual array in a is sorted. The output of n-way union should be:

 auto witness = [
 1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
 ];
 assert(equal(nWayUnion(a), witness[]));

 The STL and std.algorithm have set_union that does that for two sets:
 for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But
 n-way unions poses additional challenges. What would be a fast
 algorithm? (Imagine a could be a very large range of ranges).

 Needless to say, nWayUnion is a range :o).

 Finally, why would anyone care for something like this?

 Andrei
in sudocode 1. build a min heap out of the elements based on the first element of each array 2. remove the first value from the root of the heap, 3. if empty array, drop it 4. if not empty re-heap 5. goto 2 if anything left
In practice, I bet it'll go faster with a couple more steps: 2. remove the first value (from array X) from the root of the heap, 2a. Determine new root of the heap (don't need to do a full re-heap yet). 2b. move all values from X until empty or >= new root. 4. reheap There are interesting cases like: [5,6,7,8,9,9,......9] [30, 31,32,33,34...38] [1,1,1,1,1,1,1,1,1,1,...1] [10,11,12,13,14,15,16,16,......., 16] where in theory, the total number of comparisons required is a function only of the number of arrays, and is independent of the actual number of elements. If it's common to have runs where consecutive values come from a single array, and the arrays are random access, you could do MUCH better than O(n).
May 26 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Don wrote:
 BCS wrote:
 in sudocode

 1. build a min heap out of the elements based on the first element of 
 each array
 2. remove the first value from the root of the heap,
 3. if empty array, drop it
 4. if not empty re-heap
 5. goto 2 if anything left
In practice, I bet it'll go faster with a couple more steps: 2. remove the first value (from array X) from the root of the heap, 2a. Determine new root of the heap (don't need to do a full re-heap yet). 2b. move all values from X until empty or >= new root. 4. reheap
Correct. Unfortunately my heap primitives didn't quite allow that so currently I do it the "hard" way: pop from top, remove top from heap, reinsert (if not empty). But this is definitely an optimization worth doing. I think I'll implement it.
 There are interesting cases like:
 [5,6,7,8,9,9,......9]
 [30, 31,32,33,34...38]
 [1,1,1,1,1,1,1,1,1,1,...1]
 [10,11,12,13,14,15,16,16,......., 16]
 
 where in theory, the total number of comparisons required is a function 
 only of the number of arrays, and is independent of the actual number of 
 elements.
 If it's common to have runs where consecutive values come from a single 
 array, and the arrays are random access, you could do MUCH better than 
 O(n).
I think you can't do better. Complexity is O(n) at a minimum because you need to go through each element. The heap-based algorithm has O(n log m). What you could improve on is the log m factor (m = the number of sets). Andrei
May 26 2009
parent reply Don <nospam nospam.com> writes:
Andrei Alexandrescu wrote:
 Don wrote:
 BCS wrote:
 in sudocode

 1. build a min heap out of the elements based on the first element of 
 each array
 2. remove the first value from the root of the heap,
 3. if empty array, drop it
 4. if not empty re-heap
 5. goto 2 if anything left
In practice, I bet it'll go faster with a couple more steps: 2. remove the first value (from array X) from the root of the heap, 2a. Determine new root of the heap (don't need to do a full re-heap yet). 2b. move all values from X until empty or >= new root. 4. reheap
Correct. Unfortunately my heap primitives didn't quite allow that so currently I do it the "hard" way: pop from top, remove top from heap, reinsert (if not empty). But this is definitely an optimization worth doing. I think I'll implement it.
 There are interesting cases like:
 [5,6,7,8,9,9,......9]
 [30, 31,32,33,34...38]
 [1,1,1,1,1,1,1,1,1,1,...1]
 [10,11,12,13,14,15,16,16,......., 16]

 where in theory, the total number of comparisons required is a 
 function only of the number of arrays, and is independent of the 
 actual number of elements.
 If it's common to have runs where consecutive values come from a 
 single array, and the arrays are random access, you could do MUCH 
 better than O(n).
I think you can't do better. Complexity is O(n) at a minimum because you need to go through each element. The heap-based algorithm has O(n log m). What you could improve on is the log m factor (m = the number of sets).
Depending on what "going through each element" means. Certainly, if you need to move them, you can't beat O(n) moves, but you can definitely get O(m*m) < < O(n) comparisons in the best case I described. And that would also be the number of splicing operations. You could approximate this without hurting the worst-case too much, by (for example) once you get a run of more than K entries from the same array, step forward 2 entries; if that's still less than the root, copy both to the output, then step forward 4 entries, etc. That should give a best-case O(m log n) comparisons/splices. Which might not be of any practical use, of course.
May 26 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Don wrote:
 Andrei Alexandrescu wrote:
 I think you can't do better. Complexity is O(n) at a minimum because 
 you need to go through each element. The heap-based algorithm has O(n 
 log m). What you could improve on is the log m factor (m = the number 
 of sets).
Depending on what "going through each element" means. Certainly, if you need to move them, you can't beat O(n) moves, but you can definitely get O(m*m) < < O(n) comparisons in the best case I described. And that would also be the number of splicing operations.
Oh, I see. So yes, comparisons can be reduced. So how exactly do you think that can be done? By binary searching the upper bound of the head to find a long run of identical elements? Andrei
May 26 2009
next sibling parent Don <nospam nospam.com> writes:
Andrei Alexandrescu wrote:
 Don wrote:
 Andrei Alexandrescu wrote:
 I think you can't do better. Complexity is O(n) at a minimum because 
 you need to go through each element. The heap-based algorithm has O(n 
 log m). What you could improve on is the log m factor (m = the number 
 of sets).
Depending on what "going through each element" means. Certainly, if you need to move them, you can't beat O(n) moves, but you can definitely get O(m*m) < < O(n) comparisons in the best case I described. And that would also be the number of splicing operations.
Oh, I see. So yes, comparisons can be reduced. So how exactly do you think that can be done? By binary searching the upper bound of the head to find a long run of identical elements?
That was my original thought, but it's easy to come up with cases that would perform really badly -- so that you do a binary search for every two elements. Which was why I suggested repeated doubling of the stride distance, I think you can prove that's never worse than a linear search of every element. Although, if it was worth doing, I'd imagine that implementations of STL set_union would already be doing it. Anyway, it's a really interesting problem.
May 26 2009
prev sibling parent reply Jason House <jason.james.house gmail.com> writes:
Andrei Alexandrescu Wrote:

 Don wrote:
 Andrei Alexandrescu wrote:
 I think you can't do better. Complexity is O(n) at a minimum because 
 you need to go through each element. The heap-based algorithm has O(n 
 log m). What you could improve on is the log m factor (m = the number 
 of sets).
Depending on what "going through each element" means. Certainly, if you need to move them, you can't beat O(n) moves, but you can definitely get O(m*m) < < O(n) comparisons in the best case I described. And that would also be the number of splicing operations.
Oh, I see. So yes, comparisons can be reduced. So how exactly do you think that can be done? By binary searching the upper bound of the head to find a long run of identical elements? Andrei
This conversation made me realize a simple optimization (forgive me if it's what Don was implying) Instead of storing m ranges in a heap, store only m-1 ranges. The range that would be the min if the heap is kept separate. Each pop operation will compare the new front with the min heap element. This allows avoiding the re-heaping when the next element comes from the same range. If s is the number of times the source range must switch, then the performance is O( s*log(m) + n - s). The initial heaping is O(m*log(m)), but that's never worse than O(s*log(m)). Technically, the re-heaping steps are log(m-1), but that isn't important with big O.
May 26 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Jason House wrote:
 Andrei Alexandrescu Wrote:
 
 Don wrote:
 Andrei Alexandrescu wrote:
 I think you can't do better. Complexity is O(n) at a minimum
 because you need to go through each element. The heap-based
 algorithm has O(n log m). What you could improve on is the log
 m factor (m = the number of sets).
Depending on what "going through each element" means. Certainly, if you need to move them, you can't beat O(n) moves, but you can definitely get O(m*m) < < O(n) comparisons in the best case I described. And that would also be the number of splicing operations.
Oh, I see. So yes, comparisons can be reduced. So how exactly do you think that can be done? By binary searching the upper bound of the head to find a long run of identical elements? Andrei
This conversation made me realize a simple optimization (forgive me if it's what Don was implying) Instead of storing m ranges in a heap, store only m-1 ranges. The range that would be the min if the heap is kept separate. Each pop operation will compare the new front with the min heap element. This allows avoiding the re-heaping when the next element comes from the same range. If s is the number of times the source range must switch, then the performance is O( s*log(m) + n - s). The initial heaping is O(m*log(m)), but that's never worse than O(s*log(m)). Technically, the re-heaping steps are log(m-1), but that isn't important with big O.
Cool! That can actually be implemented as a micro-optimization with the classic heap: right after you extract from the top (arr[0]), you look at its children (which are arr[1] and arr[2]). If both front()'s are still no less than yours, everything stays as it was and you're home free. I'll implement this, thanks. Andrei
May 26 2009
parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Andrei Alexandrescu wrote:
 Jason House wrote:
 This conversation made me realize a simple optimization (forgive me
 if it's what Don was implying)

 Instead of storing m ranges in a heap, store only m-1 ranges. The
 range that would be the min if the heap is kept separate. Each pop
 operation will compare the new front with the min heap element. This
 allows avoiding the re-heaping when the next element comes from the
 same range. If s is the number of times the source range must switch,
 then the performance is O( s*log(m) + n - s). The initial heaping is
 O(m*log(m)), but that's never worse than O(s*log(m)).  Technically,
 the re-heaping steps are log(m-1), but that isn't important with big
 O.
Cool! That can actually be implemented as a micro-optimization with the classic heap: right after you extract from the top (arr[0]), you look at its children (which are arr[1] and arr[2]). If both front()'s are still no less than yours, everything stays as it was and you're home free. I'll implement this, thanks.
If you keep it separate, you only need one comparison to determine the current range still contains the minimum element. Just saying :).
May 26 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Frits van Bommel wrote:
 Andrei Alexandrescu wrote:
 Jason House wrote:
 This conversation made me realize a simple optimization (forgive me
 if it's what Don was implying)

 Instead of storing m ranges in a heap, store only m-1 ranges. The
 range that would be the min if the heap is kept separate. Each pop
 operation will compare the new front with the min heap element. This
 allows avoiding the re-heaping when the next element comes from the
 same range. If s is the number of times the source range must switch,
 then the performance is O( s*log(m) + n - s). The initial heaping is
 O(m*log(m)), but that's never worse than O(s*log(m)).  Technically,
 the re-heaping steps are log(m-1), but that isn't important with big
 O.
Cool! That can actually be implemented as a micro-optimization with the classic heap: right after you extract from the top (arr[0]), you look at its children (which are arr[1] and arr[2]). If both front()'s are still no less than yours, everything stays as it was and you're home free. I'll implement this, thanks.
If you keep it separate, you only need one comparison to determine the current range still contains the minimum element. Just saying :).
Good point, and also I don't need to worry about adding new heap primitives. Andrei
May 26 2009
prev sibling next sibling parent Robert Fraser <fraserofthenight gmail.com> writes:
Andrei Alexandrescu wrote:
 This is somewhat OT but I think it's an interesting problem. Consider 
 the following data:
 
 double[][] a =
 [
     [ 1, 4, 7, 8 ],
     [ 1, 7 ],
     [ 1, 7, 8],
     [ 4 ],
     [ 7 ],
 ];
 
 We want to compute an n-way union, i.e., efficiently span all elements 
 in all arrays in a, in sorted order. You can assume that each individual 
 array in a is sorted. The output of n-way union should be:
 
 auto witness = [
     1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
 ];
 assert(equal(nWayUnion(a), witness[]));
 
 The STL and std.algorithm have set_union that does that for two sets: 
 for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But 
 n-way unions poses additional challenges. What would be a fast 
 algorithm? (Imagine a could be a very large range of ranges).
 
 Needless to say, nWayUnion is a range :o).
 
 Finally, why would anyone care for something like this?
 
 
 Andrei
I'd be interested if there's some way to remove duplicates.
May 23 2009
prev sibling next sibling parent reply Georg Wrede <georg.wrede iki.fi> writes:
Andrei Alexandrescu wrote:
 This is somewhat OT but I think it's an interesting problem. Consider 
 the following data:
 
 double[][] a =
 [
     [ 1, 4, 7, 8 ],
     [ 1, 7 ],
     [ 1, 7, 8],
     [ 4 ],
     [ 7 ],
 ];
 
 We want to compute an n-way union, i.e., efficiently span all elements 
 in all arrays in a, in sorted order. You can assume that each individual 
 array in a is sorted. The output of n-way union should be:
 
 auto witness = [
     1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
 ];
 assert(equal(nWayUnion(a), witness[]));
 
 The STL and std.algorithm have set_union that does that for two sets: 
 for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But 
 n-way unions poses additional challenges. What would be a fast 
 algorithm? (Imagine a could be a very large range of ranges).
 
 Needless to say, nWayUnion is a range :o).
If we'd know anything about the data, such as, the max value is always smaller than the total number of elements in the subarrays, then we'd probably more easily invent a decent algorithm. But the totally general algorithm has to be more inefficient. And constructing (not worst-case, but) tough-case data is trivial. For example, take a thousand subarrays, each a thousand elements long, containing random uints from the inclusive range 0..uint.max. It's likely that you then can pop the first element in a[0], but the second element is larger, and the first element in a[1] is also larger. This results in that a is still sorted, except that a[0] may belong to a later position. On the other hand, in your example, after popping all the ones, a is severely out of order. --------------------- You probably have to have an array of structs, where each struct contains a reference to a subarray, and a copy of the first value of this subarray. Sort the array of structs, and beginning with the first struct, pop from the pointed-to subarray as long as it contains elements equal to the first one. Then pop from the subarray that belongs to following struct, and so on, until a subbarray's first element already is larger. While popping, you obviously update the value copies, and append to an output array. Then sort the array of structs, trim it in case some of the structs now point to empty subarrays, and start over. This is only moderately efficient, but had Stepanof or anybody else come up with something essentially better, you'd have heard of it already. The only upshot of this algorithm is that the array a doesn't have to be sorted initially. (The subarrays, of course still do.) The algorithm could be enhanced by, instead of an array of structs, having a singly liked list of these structs. Each time you pop, you move the relevant struct to another list, poppedStructs. When done, you only have to sort the poppedStructs list, and then /merge/ it with the list of unpopped structs, which still is sorted. This would obviously drastically improve efficiency, since it is likely that what needs to be sorted now is only a few items per round. Hmmm. Maybe this isn't all that inefficient, after all...
May 25 2009
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 This is somewhat OT but I think it's an interesting problem. Consider 
 the following data:

 double[][] a =
 [
     [ 1, 4, 7, 8 ],
     [ 1, 7 ],
     [ 1, 7, 8],
     [ 4 ],
     [ 7 ],
 ];

 We want to compute an n-way union, i.e., efficiently span all elements 
 in all arrays in a, in sorted order. You can assume that each 
 individual array in a is sorted. The output of n-way union should be:

 auto witness = [
     1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
 ];
 assert(equal(nWayUnion(a), witness[]));

 The STL and std.algorithm have set_union that does that for two sets: 
 for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But 
 n-way unions poses additional challenges. What would be a fast 
 algorithm? (Imagine a could be a very large range of ranges).

 Needless to say, nWayUnion is a range :o).
If we'd know anything about the data, such as, the max value is always smaller than the total number of elements in the subarrays, then we'd probably more easily invent a decent algorithm. But the totally general algorithm has to be more inefficient. And constructing (not worst-case, but) tough-case data is trivial. For example, take a thousand subarrays, each a thousand elements long, containing random uints from the inclusive range 0..uint.max.
You can assume that each array is sorted. Andrei
May 25 2009
parent reply Georg Wrede <georg.wrede iki.fi> writes:
Andrei Alexandrescu wrote:
 Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 This is somewhat OT but I think it's an interesting problem. Consider 
 the following data:

 double[][] a =
 [
     [ 1, 4, 7, 8 ],
     [ 1, 7 ],
     [ 1, 7, 8],
     [ 4 ],
     [ 7 ],
 ];

 We want to compute an n-way union, i.e., efficiently span all 
 elements in all arrays in a, in sorted order. You can assume that 
 each individual array in a is sorted. The output of n-way union 
 should be:

 auto witness = [
     1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
 ];
 assert(equal(nWayUnion(a), witness[]));

 The STL and std.algorithm have set_union that does that for two sets: 
 for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But 
 n-way unions poses additional challenges. What would be a fast 
 algorithm? (Imagine a could be a very large range of ranges).

 Needless to say, nWayUnion is a range :o).
If we'd know anything about the data, such as, the max value is always smaller than the total number of elements in the subarrays, then we'd probably more easily invent a decent algorithm. But the totally general algorithm has to be more inefficient. And constructing (not worst-case, but) tough-case data is trivial. For example, take a thousand subarrays, each a thousand elements long, containing random uints from the inclusive range 0..uint.max.
You can assume that each array is sorted.
Err, you didn't comment on my algorithm, at the end. So, either it is worthless to an extent not deserving even a dismissal, or you didn't read the rest of the post.
May 25 2009
next sibling parent reply Jason House <jason.james.house gmail.com> writes:
Georg Wrede Wrote:

 Andrei Alexandrescu wrote:
 Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 This is somewhat OT but I think it's an interesting problem. Consider 
 the following data:

 double[][] a =
 [
     [ 1, 4, 7, 8 ],
     [ 1, 7 ],
     [ 1, 7, 8],
     [ 4 ],
     [ 7 ],
 ];

 We want to compute an n-way union, i.e., efficiently span all 
 elements in all arrays in a, in sorted order. You can assume that 
 each individual array in a is sorted. The output of n-way union 
 should be:

 auto witness = [
     1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8
 ];
 assert(equal(nWayUnion(a), witness[]));

 The STL and std.algorithm have set_union that does that for two sets: 
 for example, set_union(a[0], a[1]) outputs [ 1, 1, 4, 7, 7, 8 ]. But 
 n-way unions poses additional challenges. What would be a fast 
 algorithm? (Imagine a could be a very large range of ranges).

 Needless to say, nWayUnion is a range :o).
If we'd know anything about the data, such as, the max value is always smaller than the total number of elements in the subarrays, then we'd probably more easily invent a decent algorithm. But the totally general algorithm has to be more inefficient. And constructing (not worst-case, but) tough-case data is trivial. For example, take a thousand subarrays, each a thousand elements long, containing random uints from the inclusive range 0..uint.max.
You can assume that each array is sorted.
Err, you didn't comment on my algorithm, at the end. So, either it is worthless to an extent not deserving even a dismissal, or you didn't read the rest of the post.
Don't get offended, he didn't respond to anyone else's algorithm either.
May 25 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Jason House wrote:
 Don't get offended, he didn't respond to anyone else's algorithm either. 
I just wanted to bring what I think is an interesting problem to the table; my response shouldn't be used as a yardstick for one's ability to define a good algorithm. Now that I feel pressured to give an answer, I'd say that BCS's answer is right on the money. It's the exact algorithm I put in std.algorithm (not checked in yet), and it works leanly and meanly on a large data set as we speak. Sean's, bearophile's, and your answers were also very close to BCS's, just that yours was a tad less detailed and the others came a bit later. Andrei
May 25 2009
parent reply Jason House <jason.james.house gmail.com> writes:
Andrei Alexandrescu Wrote:

 Jason House wrote:
 Don't get offended, he didn't respond to anyone else's algorithm either. 
I just wanted to bring what I think is an interesting problem to the table; my response shouldn't be used as a yardstick for one's ability to define a good algorithm. Now that I feel pressured to give an answer,
Sorry if made you feel pressured. I didn't expect a response.
 I'd say that BCS's answer 
 is right on the money. It's the exact algorithm I put in std.algorithm 
 (not checked in yet), and it works leanly and meanly on a large data set 
 as we speak. Sean's, bearophile's, and your answers were also very close 
 to BCS's, just that yours was a tad less detailed and the others came a 
 bit later.
I bet BCS had the advantage of a real keyboard ;)
May 25 2009
parent BCS <none anon.com> writes:
Hello Jason,

 I bet BCS had the advantage of a real keyboard ;)
 
you are 89% correct: http://en.wikipedia.org/wiki/Acer_Aspire_One
May 25 2009
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 You can assume that each array is sorted.
Err, you didn't comment on my algorithm, at the end. So, either it is worthless to an extent not deserving even a dismissal, or you didn't read the rest of the post.
I am sorry, at the first two reads I did not understand your algorithm. I noticed that it is more complicated than the canonical one, and also it necessitates storage to hold the entire output. At the third read I did understand it, or at least I understood how you mean it to work. It has at least one bug here: ======== Sort the array of structs, and beginning with the first struct, pop from the pointed-to subarray as long as it contains elements equal to the first one. Then pop from the subarray that belongs to following struct, and so on, until a subbarray's first element already is larger. ======== It would fail on the data: double[][] a = [ [ 1, 1, 4, 7, 8 ], [ 7 ], ]; You will pop all 1s and then you'll pop 7. You should be popping 4. Andrei
May 25 2009
parent reply Georg Wrede <georg.wrede iki.fi> writes:
Andrei Alexandrescu wrote:
 Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 You can assume that each array is sorted.
Err, you didn't comment on my algorithm, at the end. So, either it is worthless to an extent not deserving even a dismissal, or you didn't read the rest of the post.
I am sorry, at the first two reads I did not understand your algorithm. I noticed that it is more complicated than the canonical one
Canonical one. Righhhhhtttttt...... From the outset I thought that you did have a solution in mind. And I thought that the pause of a couple of days before any seroious attemnts only meant that everybody else understood it, too. So, nobody wanted to spend 5 days inventing something which would only be dismissed "because Knuth did it better 50 years ago".
 and also it necessitates storage to hold the entire output.
I never said whether it'd be on stack or not. An array (or a linked list) of structs can be placed anywhere. Especially if you know (i.e. can easily count) the total number of items, you can allocate up front, be it on the heap or the stack.
 At the third read I did understand it, or at least I understood how
 you mean it to work. It has at least one bug here:
 
 ========
 Sort the array of structs, and beginning with the first struct, pop from 
 the pointed-to subarray as long as it contains elements equal to the 
 first one. Then pop from the subarray that belongs to following struct, 
 and so on, until a subbarray's first element already is larger.
 ========
 
 It would fail on the data:
 
 double[][] a =
 [
     [ 1, 1, 4, 7, 8 ],
     [ 7 ],
 ];
 
 You will pop all 1s and then you'll pop 7. You should be popping 4.
Sigh. So you didn't read that part carefully. You also didn't read the following part, at all. I admit, I relied too much on your (perceived, superior) ability, and thus described the algorithm top-level only. My first thought was to actually write the whole thing in D2, but then I remembered that (as an old man), I code things at most at 10% the speed of some of the young and strong here. I didn't want to spend 2 days on something I would demand of myself of doing in 2 hours. Hence the pseudocode. My algorithm would pop both the 1's. Then it would see that the 4 is bigger and give up. It would then go to a[0+1], and find nothing there (i.e. 7 is larger than 1), so it'd consider this round done. Now, where I didn't "follow description" was, my algorithm was insensitive to there being several instances of the same value in a subarray. (My mistake, the "1, 1, 1, 4, 4, 7, 7, 7, 7, 8, 8" thing led me astray.) BUT IF YOU LOOK CAREFULLY at my code, you'll see that it actually doens't make a difference. Why???? Because, having popped a number of items from a number of sub-arrays, the sub-arrays aren't in any order at all. So, there arises a need to do some sorting. INCIDENTALLY, which means, "You will pop all 1s and then you'll pop 7" *definitely* means you didn't read my post carefully. No biggie, my university Pascal teacher did the same thing. Since I'm a slow coder, I used to both present my algorithms in "regular peoples' language", and my math proofs in writing, instead of a lot of hairy equations. Both guys gave me 0's for a number of things, and then I had to set up an appointment to see both to explain and clarify. (The one thing I remember as a killer was about my thesis (STL), which I submitted, then went to Spain for 3 weeks, returned on an overnight flight with no sleep at all, and at 8am went to see my score, and found myself disqualified. I went straight to the professor and confronted him. He explained that this assignment meant one had to prove the case for a small value, prove it for the next value, then proove that if it was valid for both, then it would have to be valid for the one following them. When I said that "this is exactly what I've done here", he went silet for 10 minutes, then blushed, and gave me a "barely accepted score"... Sure, my first idea was to invent and quote real D2 code. But I thought that being a "regular" programmer, /everything/ in the code would be *rewritten* anyway, so why spend 4 days writing something I can describe while I'm f**ing reading the assignment. Then it turns out I get omitted the first time, the second time I'm not understood, the 3rd time I'm not understood either, and the 4th time I'm misunderstood. Were it not for D2, I'd suggest the reader ease up a bit. Save his and the audience's time. Getting it (err, reading it) right the first time (even if reading at half the speed) is much more efficient (for "himself", not to speak of the entire community and their combined wasted effort) than spawning a whole tree of response-subtrees, which (in the worst case) he has to answer individually. (So everybody loses.) =================================== Which all means, if somebody else actually did read my description, and did implement it, Andrei'd have a hard time beating their code. (As an additional hint, to save extra speed, one wouldn't wait till the turn is over, and then sort hen poppedItems, but instead, insert-sort them while iterating over the structs list.) At the same time, one could keep a tab on the next "least-value" while "popping", so as not to have to compare to integers between the current and the next actual poppable value.
May 25 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Georg Wrede wrote:
 Andrei Alexandrescu wrote:
 I am sorry, at the first two reads I did not understand your 
 algorithm. I noticed that it is more complicated than the canonical one
Canonical one. Righhhhhtttttt...... From the outset I thought that you did have a solution in mind. And I thought that the pause of a couple of days before any seroious attemnts only meant that everybody else understood it, too. So, nobody wanted to spend 5 days inventing something which would only be dismissed "because Knuth did it better 50 years ago".
[snip] I don't even know how to start answering this and particularly what follows (30% of I can't even figure where it comes from and where it goes), but allow me to retort with just a few short and clear points. 1. You were not set up. I did have a solution in mind, but it's not from Knuth or some other classic. I didn't actually find it anywhere (which is why I consider the problem interesting to discuss in this group), but it's rather simple and clear because five people in this group came to it independently. 2. I'm not sure - are you reproaching me I didn't sit down to understand your algorithm (and pardon me, it wasn't the clearest exposition there is and it still has more than a few kinks to be ironed out)? And are you comparing the situation with your professor etc.? And am I supposed to thank you for not making a biggie out of it? Well your instructor was supposed to read your work, and I am not by any stretch of imagination. That was rather inappropriate. 3. If you are alleging that if somebody implemented your algorithm, it would run faster than BCS's, you are simply mistaken. Yours simply does more work than necessary, and no amount of buts and ifs can cover for that. Andrei
May 25 2009
next sibling parent Georg Wrede <georg.wrede iki.fi> writes:
Andrei Alexandrescu wrote:
 3. If you are alleging that if somebody implemented your algorithm, it 
 would run faster than BCS's, you are simply mistaken. Yours simply does 
 more work than necessary, and no amount of buts and ifs can cover for that.
Maybe I didn't understand BCS's algorithm. To me it looked slower. I'll have to wait till I see the one actually implemented, and do a comparison. If nothing else, I'll then see how much mine is slower.
May 26 2009
prev sibling parent Georg Wrede <georg.wrede iki.fi> writes:
Andrei Alexandrescu wrote:
 Well your instructor was supposed to read your work, and I am not by
 any stretch of imagination. That was rather inappropriate.
Maybe I misunderstood the point of this OT thread. One could believe the thread was created to look at alternative algorithms by people.
 Finally, why would anyone care for something like this?
Has this been answered already? What was the answer?
May 26 2009
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Georg Wrede:
 You probably have to have an array of structs, where each struct 
 contains a reference to a subarray, and a copy of the first value of 
 this subarray.
Keeping a cached copy of the first item of the subarray ("subrange") may be better in some situations and worse (compared to keeping just a reference/pointer) in other situations. And it's not very simple to design a data structure able to adapt itself dynamically to such two situations.
 Sort the array of structs, and beginning with the first struct, pop from 
 the pointed-to subarray as long as it contains elements equal to the 
 first one. Then pop from the subarray that belongs to following struct, 
 and so on, until a subbarray's first element already is larger.
 While popping, you obviously update the value copies, and append to an 
 output array.
Better to yield lazily the output items. And probably keeping them in a heap is more efficient than scanning them all.
 The algorithm could be enhanced by, instead of an array of structs, 
 having a singly liked list of these structs.
Today linked lists are mostly dead. In most situations arrays (or arrays with holes, plus few other smart things) are more efficient, both in memory and speed. -------------------- For Andrei: in my last post I have suggested to remove from the heap the references to the sub-ranges as soon they are exhausted. But lot of tests of mine have shown me that's often not the most efficient strategy: http://www.fantascienza.net/leonardo/ar/list_deletions.html So better things are possible. The most refined strategy may be overkill, but some medium strategy may be OK. Bye, bearophile
May 25 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 For 	Andrei: in my last post I have suggested to remove from the heap
 the references to the sub-ranges as soon they are exhausted. But lot
 of tests of mine have shown me that's often not the most efficient
 strategy: http://www.fantascienza.net/leonardo/ar/list_deletions.html
  So better things are possible. The most refined strategy may be
 overkill, but some medium strategy may be OK.
I don't think that applies here. Unstable removal is O(1) and restoring the heap property is O(log n) in theory and very close to O(1) in practice. Andrei
May 25 2009
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

Needless to say, nWayUnion is a range :o).<
It's better for nWayUnion to be a lazy iterable. As probably others have already said, you can keep an heap of pointers, and compute the heap invariant using as opCmp a function that uses the values they point to. When nWayUnion yields an item, the relative pointer goes one step forward and the heap invariant has to be restored. When one array finishes, the relative pointer is removed before restoring the heap invariant. The algorithm can be generalized: the input can be an array (random access range? memory mapped file and arrays are among the most important cases) of sorted iterables (even lazy ones), that is an array of lazy sorted ranges. The most general input is a lazy range of lazy sorted ranges, in this situation the nWayUnion has to first "duplicate" it into an eager array of such ranges, and then turn it into an heap as before. I guess in most situations you don't have more than thausands of such sorted iterables, so cerating such temporary array isn't too much bad. (nWayUnion may also have a small fixed size array of pointers/indexes on the stack, for the simple case of an array of 3-5 arrays). Bye, bearophile
May 25 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 The most general input is a lazy range of lazy sorted ranges, in this
situation the nWayUnion has to first "duplicate" it into an eager array of such
ranges, and then turn it into an heap as before.
 I guess in most situations you don't have more than thausands of such sorted
iterables, so cerating such temporary array isn't too much bad.
 (nWayUnion may also have a small fixed size array of pointers/indexes on the
stack, for the simple case of an array of 3-5 arrays).
The policy of std.algorithm is to not surreptitiously allocate memory if it can help it. So the user is left responsible with creating a range of ranges with random access. Andrei
May 25 2009