## digitalmars.D - groupBy predicates: unary, binary, or both?

• Andrei Alexandrescu (38/38) Dec 29 2013 I'm working on
• monarch_dodra (24/24) Dec 29 2013 There is precedence for this in phobos with schwatzSort: It use a
• Andrei Alexandrescu (9/23) Dec 29 2013 Hm so that's different from both my suggestions. It's also more
• bearophile (13/22) Dec 29 2013 In similar cases the simplest solution is to have two functions
• Andrei Alexandrescu (3/10) Dec 29 2013 I think hash joins should come first.
• Timon Gehr (9/21) Dec 29 2013 What hash join algorithm do you have in mind that does not use
• w0rp (19/19) Jul 03 2014 I'm sorry to bump up an old thread, but whatever happened to
• w0rp (2/3) Jul 03 2014 That's suppose to be a : there sorry, not a comma.
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```I'm working on
https://github.com/D-Programming-Language/phobos/pull/1186, which is
somewhat important; "group by" is a powerful operation. Along the way, I
stumbled upon an interesting issue in which I wanted to consult the
community.

Usually groupBy is used to find runs of equivalent elements in a range.
For example:

[1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 6].groupBy()

yields the range [[1, 1, 1], [2, 2], [3, 3, 3], [4], [5, 5], [6]].

As is usual with Phobos algorithms, groupBy accepts a predicate. The
default (as illustrated above) is "a = b", i.e. all elements in a group
are equal to one another.

Equality is transitive and commutative. But there are useful cases in
which predicates are not commutative. Consider we want to find strictly
monotonic subranges in the range. We'd write:

auto r = [1, 3, 2, 4, 5, 1].groupBy!"a < b";

That should produce [[1, 3], [2, 4, 5], [1]]. For non-strict monotonic
runs, the predicate would be "a <= b" etc. All that is pretty awesome.

However, that makes life a bit tougher for the algorithm - it must only
compare adjacent elements only. In the case of "a = b", it suffices to
save the first element in a group and compare it against every other
element in the group.

Meanwhile, a very similar pull request
(https://github.com/D-Programming-Language/phobos/pull/1453) uses unary
predicates, i.e. an optional transformation function that is then used
in conjunction with "==" to decide which elements belong in the same group.

Unary predicates make life simpler for the algorithm (save the transform
of the first element, then compare it against the transform of the next
etc) and are often easier to write by the end user, too (e.g. just write
"a.length" instead of "a.length == b.length" to group by length).

So I was thinking to allow both cases, with the understanding that
grouping by unary predicates uses "==" for comparison whereas grouping
by binary predicates looks at adjacent elements to figure out group
membership. That approach would, however, preclude the use of string
lambdas (because deducing arity for string lambdas is possible, but
quite unwieldy).

What do you think?

Andrei
```
Dec 29 2013
"monarch_dodra" <monarchdodra gmail.com> writes:
```There is precedence for this in phobos with schwatzSort: It use a
unaryFun, followed by a binaryFun.

I've found this works quite well: More often than not, a unary
fun is more than enough (the default "a < b"/"a == b" is fine).
And in the cases where you *do* need binary comparison, then
having a unary fun anywas is fine. EG:

"MHklfsJKGsfd".byGroup!(std.uni.tolower, "a < b")();

I think it is both an elegant and efficient approach. That's what
I'd go for.

-------

I think we should handle non-commutative predicates. It's not
actually *that* much more complicated. Instead of storing in each
Group a value + range, simply save the range twice, and have one
range be 1 step ahead of the first. I think a few phobos algos do
this. minPos does something kind of similar.

The "overhead" is an initial extra save yes, but the cost of each
pop is to simply pop both ranges (no extra saves there). You
never actually have to make element copies (which may fail if the
range holds const elements to boot...). Also, if the ElementType
is big, then storing an extra range has good chances to actually
take up *less* room.

My stance is that supporting this would be a powerful and useful
tool, and I don't think there'd be that much overhead either, so
I say let's go for it.
```
Dec 29 2013
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```On 12/29/13 2:20 AM, monarch_dodra wrote:
There is precedence for this in phobos with schwatzSort: It use a
unaryFun, followed by a binaryFun.

I've found this works quite well: More often than not, a unary fun is
more than enough (the default "a < b"/"a == b" is fine). And in the
cases where you *do* need binary comparison, then having a unary fun
anywas is fine. EG:

"MHklfsJKGsfd".byGroup!(std.uni.tolower, "a < b")();

I think it is both an elegant and efficient approach. That's what I'd go
for.

Hm so that's different from both my suggestions. It's also more
difficult to explain, though indeed there's precedent with schwartz.

I think we should handle non-commutative predicates. It's not actually
*that* much more complicated. Instead of storing in each Group a value +
range, simply save the range twice, and have one range be 1 step ahead
of the first. I think a few phobos algos do this. minPos does something
kind of similar.

The existing pull request already supports non-commutative predicates by
means of copying the front. I'd thought about using the
step-behind-range trick but that's less efficient than a copy in many
cases. It's more general though as you point out, so I'll probably
switch to that.

Andrei
```
Dec 29 2013
"bearophile" <bearophileHUGS lycos.com> writes:
```Andrei Alexandrescu:

which is somewhat important; "group by" is a powerful operation.

I agree, it's a quite commonly useful operation.

So I was thinking to allow both cases, with the understanding
that grouping by unary predicates uses "==" for comparison
whereas grouping by binary predicates looks at adjacent
elements to figure out group membership. That approach would,
however, preclude the use of string lambdas (because deducing
arity for string lambdas is possible, but quite unwieldy).

In similar cases the simplest solution is to have two functions
with different (but similar) names.

However, that makes life a bit tougher for the algorithm - it
must only compare adjacent elements only.

But another common need is to group by equivalent classes using
an associative array. Both C# and Clojure have this functionality
(on LINQ and in the functions toolkit). This means that a
"hashGroup" builds an associative array inside and returns it.

http://d.puremagic.com/issues/show_bug.cgi?id=5968
http://d.puremagic.com/issues/show_bug.cgi?id=9842

Bye,
bearophile
```
Dec 29 2013
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
```On 12/29/13 2:27 AM, bearophile wrote:
But another common need is to group by equivalent classes using an
associative array. Both C# and Clojure have this functionality (on LINQ
and in the functions toolkit). This means that a "hashGroup" builds an
associative array inside and returns it.

http://d.puremagic.com/issues/show_bug.cgi?id=5968
http://d.puremagic.com/issues/show_bug.cgi?id=9842

I think hash joins should come first.

Andrei
```
Dec 29 2013
Timon Gehr <timon.gehr gmx.ch> writes:
```On 12/29/2013 05:17 PM, Andrei Alexandrescu wrote:
On 12/29/13 2:27 AM, bearophile wrote:
But another common need is to group by equivalent classes using an
associative array. Both C# and Clojure have this functionality (on LINQ
and in the functions toolkit). This means that a "hashGroup" builds an
associative array inside and returns it.

http://d.puremagic.com/issues/show_bug.cgi?id=5968
http://d.puremagic.com/issues/show_bug.cgi?id=9842

I think hash joins should come first.

Andrei

What hash join algorithm do you have in mind that does not use
hashGroupBy as a component?

I think accumAssocArray or similar would be quite handy as well, as it
is more general. A naive implementation r.hashGroupBy!f would then for
example be given by:

r.map!(a=>tuple(f(a),a)) // range of key-value pairs
.accumAssocArray!((a,b)=>a~=b) // method of combination
((ElementType!(typeof(r))[]).init); // initial value
```
Dec 29 2013
"w0rp" <devw0rp gmail.com> writes:
```I'm sorry to bump up an old thread, but whatever happened to
this? I was just using 'groupby' in Python for a common data
migration pattern I've developed...

# Build a dictionary of parent -> child relationships
# from a datbase query, for quick and dirty data migration.
# This runs in linear time.
{
y[0], tuple(y[1])
for y in
groupby(
some_query_func(
"SELECT parent_id, ... "
# ...
"ORDER BY parent_id "
),
lambda x: x["parent_id"]
)
}

... and I was thinking "it would be nice to have this in D, too!"
```
Jul 03 2014
"w0rp" <devw0rp gmail.com> writes:
```On Thursday, 3 July 2014 at 10:48:53 UTC, w0rp wrote:
y[0], tuple(y[1])

That's suppose to be a : there sorry, not a comma.
```
Jul 03 2014