digitalmars.D.learn - Ranges/algorithms for aggregation

=?UTF-8?B?Ikx1w61z?= Marques" (40/40) Mar 21 2014 Is there a neat way to do this transformation with ranges and

bearophile (10/24) Mar 21 2014 What is the desired output data structure? An associative array

=?UTF-8?B?Ikx1w61z?= Marques" (16/24) Mar 21 2014 I'm doing this for a D newbie, to teach him the range/algorithmic

bearophile (11/18) Mar 21 2014 Yes, it happens, But it's a problem, because often if you know

=?UTF-8?B?Ikx1w61z?= Marques" (4/9) Mar 21 2014 The number of keys is large and unbounded (not just three as in

Justin Whear (16/59) Mar 21 2014 This pull request[1] for groupBy has been hanging around for a year now,...

H. S. Teoh (20/38) Mar 21 2014 Be aware, though, that groupBy only compares *adjacent* elements for

=?UTF-8?B?Ikx1w61z?= Marques" (4/11) Mar 21 2014 I think that's why Justin used sort. The hashGroupBy proposed by

=?UTF-8?B?Ikx1w61z?= Marques" (4/7) Mar 21 2014 I was thinking, we don't even need the full power of sort. Is

bearophile (5/8) Mar 21 2014 I don't know any, how is it supposed to know where to group

=?UTF-8?B?Ikx1w61z?= Marques" (11/13) Mar 21 2014 It would swap elements, like sort, so it doesn't need to put them

bearophile (5/10) Mar 22 2014 I think to swap items it must know where to swap them to. And you

=?UTF-8?B?Ikx1w61z?= Marques" <luis luismarques.eu> writes:

Is there a neat way to do this transformation with ranges and 
std.algorithms?

     Input:
     -------
     B foo
     B bar
     C ble
     B big
     A begga

     Output: (aggregated and sorted on length)
     -------
     B -> [foo, bar, big]
     C -> [ble]
     A -> [begga]

The most obvious way (to me) to do this without standard 
algorithms is with an AA to the aggregation. The most obvious way 
(to me) to do this with std.algorithms is:

     B foo
     B bar
     C ble
     B big
     A begga

     =>

     [B, foo]
     [B, bar]
     [C, ble]
     [B, big]
     [A, begga]

     =>

     [B, foo]
     [B, bar]
     [B, big]
     [C, ble]
     [A, begga]

     =>

     B -> [foo, bar, big]
     C -> [ble]
     A -> [begga]

But this seems wasteful on memory. Is there a better way to do 
this in a more algorithmic way?

Mar 21 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Luís Marques:

 Is there a neat way to do this transformation with ranges and 
 std.algorithms?

     Input:
     -------
     B foo
     B bar
     C ble
     B big
     A begga

     Output: (aggregated and sorted on length)
     -------
     B -> [foo, bar, big]
     C -> [ble]
     A -> [begga]

What is the desired output data structure? An associative array 
of dynamic arrays? Or is a dynamic arrays of dynamic arrays of 
2-tuples enough?

There are various ways to solve your problem.

Related:
https://d.puremagic.com/issues/show_bug.cgi?id=5968
https://d.puremagic.com/issues/show_bug.cgi?id=9842

Bye,
bearophile

Mar 21 2014

=?UTF-8?B?Ikx1w61z?= Marques" <luis luismarques.eu> writes:

On Friday, 21 March 2014 at 15:38:23 UTC, bearophile wrote:
    Output: (aggregated and sorted on length)
    -------
    B -> [foo, bar, big]
    C -> [ble]
    A -> [begga]

 What is the desired output data structure? An associative array 
 of dynamic arrays? Or is a dynamic arrays of dynamic arrays of 
 2-tuples enough?

I'm doing this for a D newbie, to teach him the range/algorithmic 
approach. The function he wrote to output the result of that 
transformation takes as an input an array of arrays (the latter), 
but he builds that input iteratively using an AA of arrays (the 
former). I asked him about that  mismatch and at the time he told 
me that "for now" he only needed the latter, suggesting he had 
other future plans where he might need the AA, but I'm not sure. 
So let's just say that the client is unclear on his requirements, 
which does happen in the real world anyway :-).

In any case, I think the hashGroupBy is what I was asking about 
:-). Neat. (did anyone actually implement it?)

I'm not sure how if "a dynamic arrays of dynamic arrays of 
2-tuples" sufficed that would help with the intermediate step, if 
we wanted to avoid the sorting step. Did you have anything in 
particular in mind there?

Mar 21 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Luís Marques:

 So let's just say that the client is unclear on his 
 requirements, which does happen in the real world anyway :-).

Yes, it happens, But it's a problem, because often if you know 
what you need you can produce the results more efficiently :-)


 (did anyone actually implement it?)

Not yet.


 I'm not sure how if "a dynamic arrays of dynamic arrays of 
 2-tuples" sufficed that would help with the intermediate step, 
 if we wanted to avoid the sorting step. Did you have anything 
 in particular in mind there?

I think this problem needs a sorting or a hash. One possible 
solution, if you don't need an associative array as output, is to 
use a multiSort followed by a building of groups using slicing. 
It could be efficient enough. Later you search the keys with a 
some kind of binary search.

Bye,
bearophile

Mar 21 2014

=?UTF-8?B?Ikx1w61z?= Marques" <luis luismarques.eu> writes:

On Friday, 21 March 2014 at 16:04:45 UTC, bearophile wrote:
 I think this problem needs a sorting or a hash. One possible 
 solution, if you don't need an associative array as output, is 
 to use a multiSort followed by a building of groups using 
 slicing. It could be efficient enough. Later you search the 
 keys with a some kind of binary search.

The number of keys is large and unbounded (not just three as in 
my example), so I guess this multiSort approach would not be 
practical, right? I think we really need the hashGroupBy.

Mar 21 2014

Justin Whear <justin economicmodeling.com> writes:

On Fri, 21 Mar 2014 15:22:13 +0000, Luís Marques wrote:

 Is there a neat way to do this transformation with ranges and
 std.algorithms?
 
      Input:
      -------
      B foo B bar C ble B big A begga
 
      Output: (aggregated and sorted on length)
      -------
      B -> [foo, bar, big]
      C -> [ble]
      A -> [begga]
 
 The most obvious way (to me) to do this without standard algorithms is
 with an AA to the aggregation. The most obvious way (to me) to do this
 with std.algorithms is:
 
      B foo B bar C ble B big A begga
 
      =>
 
      [B, foo]
      [B, bar]
      [C, ble]
      [B, big]
      [A, begga]
 
      =>
 
      [B, foo]
      [B, bar]
      [B, big]
      [C, ble]
      [A, begga]
 
      =>
 
      B -> [foo, bar, big]
      C -> [ble]
      A -> [begga]
 
 But this seems wasteful on memory. Is there a better way to do this in a
 more algorithmic way?

This pull request[1] for groupBy has been hanging around for a year now, 
driving me to copy-and-paste the implementation into a couple of my 
projects.  Using it, you could do this:

auto tuples = ... // get your list of (B, foo), (B, bar), etc.
auto output = tuples.sort!`a[0] < b[0]`
                    .groupBy!`a[0] == b[0]`;
// output is a range of:
//    [
//     [(A, begga)],
//     [(B, foo), (B, bar), (B, big)],
//     [(C, ble)]
//    ]

The advantage being that output isn't an array at all but a lazy range of 
lazy ranges.

1 https://github.com/D-Programming-Language/phobos/pull/1186

Mar 21 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Mar 21, 2014 at 04:10:12PM +0000, Justin Whear wrote:
[...]
 This pull request[1] for groupBy has been hanging around for a year
 now, driving me to copy-and-paste the implementation into a couple of
 my projects.  Using it, you could do this:
 
 auto tuples = ... // get your list of (B, foo), (B, bar), etc.
 auto output = tuples.sort!`a[0] < b[0]`
                     .groupBy!`a[0] == b[0]`;
 // output is a range of:
 //    [
 //     [(A, begga)],
 //     [(B, foo), (B, bar), (B, big)],
 //     [(C, ble)]
 //    ]
 
 The advantage being that output isn't an array at all but a lazy range
 of lazy ranges.
 
 1 https://github.com/D-Programming-Language/phobos/pull/1186

Be aware, though, that groupBy only compares *adjacent* elements for
equivalence; it does not sort the input. So if your input has equivalent
elements interspersed with non-equivalent elements, you will have the
equivalent elements split into multiple runs in the output.

Example:
	auto data = [
		tuple(1, "a")
		tuple(1, "b")
		tuple(2, "c")
		tuple(1, "d")
	];
	writeln(data.groupBy!((a,b) => a[0] == b[0]));

Will output:
	[[tuple(1, "a"), tuple(1, "b")], [tuple(2, "c")], [tuple(1, "d")]]

Which may not be what the OP wants.


T

-- 
Unix was not designed to stop people from doing stupid things, because that
would also stop them from doing clever things. -- Doug Gwyn

Mar 21 2014

=?UTF-8?B?Ikx1w61z?= Marques" <luis luismarques.eu> writes:

On Friday, 21 March 2014 at 16:53:46 UTC, H. S. Teoh wrote:
 Be aware, though, that groupBy only compares *adjacent* 
 elements for
 equivalence; it does not sort the input. So if your input has 
 equivalent
 elements interspersed with non-equivalent elements, you will 
 have the
 equivalent elements split into multiple runs in the output.

I think that's why Justin used sort. The hashGroupBy proposed by 
bearophile would avoid the sort and the additional memory usage 
though, so that would be even better.

Mar 21 2014

=?UTF-8?B?Ikx1w61z?= Marques" <luis luismarques.eu> writes:

On Friday, 21 March 2014 at 17:20:38 UTC, Luís Marques wrote:
 I think that's why Justin used sort. The hashGroupBy proposed 
 by bearophile would avoid the sort and the additional memory 
 usage though, so that would be even better.

I was thinking, we don't even need the full power of sort. Is 
there a standard algorithm that makes elements with equal keys be 
in sequence, but that otherwise is less expensive than sort?

Mar 21 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Luís Marques:

 I was thinking, we don't even need the full power of sort. Is 
 there a standard algorithm that makes elements with equal keys 
 be in sequence, but that otherwise is less expensive than sort?

I don't know any, how is it supposed to know where to group 
items? Usually you build an associative array for that.

Bye,
bearophile

Mar 21 2014

=?UTF-8?B?Ikx1w61z?= Marques" <luis luismarques.eu> writes:

On Saturday, 22 March 2014 at 01:08:11 UTC, bearophile wrote:
 how is it supposed to know where to group items? Usually you 
 build an associative array for that.

It would swap elements, like sort, so it doesn't need to put them 
anywhere, just permute them. The advantage is this:

Input: [7, 3, 7, 1, 1, 1, 1]
Output sort: [1, 1, 1, 1, 3, 7, 7]
Output groupSort: [3, 7, 7, 1, 1, 1, 1]

groupSort (or whatever it would be called) only makes one swap, 
while sort makes a lot of them. So groupSort is a lot cheaper. 
I'm not sure what the asymptotic time complexity of groupSort is, 
at this moment's notice (I guess it would depend on what strategy 
it would use).

Mar 21 2014

"bearophile" <bearophileHUGS lycos.com> writes:

Luís Marques:

 It would swap elements, like sort, so it doesn't need to put 
 them anywhere, just permute them. The advantage is this:

 Input: [7, 3, 7, 1, 1, 1, 1]
 Output sort: [1, 1, 1, 1, 3, 7, 7]
 Output groupSort: [3, 7, 7, 1, 1, 1, 1]

I think to swap items it must know where to swap them to. And you 
can't do that efficiently if the groups are unsorted.

Bye,
bearophile

Mar 22 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Ranges/algorithms for aggregation