digitalmars.D - [WORK] groupBy is in! Next: aggregate

Andrei Alexandrescu (28/28) Jan 23 2015 So H.S. Teoh awesomely took

Justin Whear (5/10) Jan 23 2015 This is great news. It seems like every time I make use of component
H. S. Teoh via Digitalmars-d (25/46) Jan 23 2015 Unfortunately it doesn't work in pure/@safe/nothrow code because of

Andrei Alexandrescu (4/47) Jan 23 2015 Clever! Or, conversely, I'm not that bright! Yes, this is awesome.

H. S. Teoh via Digitalmars-d (28/44) Jan 23 2015 [...]

Andrei Alexandrescu (4/7) Jan 23 2015 open https://github.com/D-Programming-Language/phobos/pulls

H. S. Teoh via Digitalmars-d (13/21) Jan 23 2015 [...]
=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (3/6) Jan 23 2015 Haha, that's funny!

Ary Borenszweig (11/39) Jan 23 2015 In most languages group by yields a tuple of {group key, group values}.

Andrei Alexandrescu (8/9) Jan 23 2015 Interesting, thanks. Looks like we're at a net loss of information with

Laeeth Isharc (23/33) Jan 23 2015 groupby hack below ? I haven't yet read the source code and

bearophile (5/7) Jan 23 2015 I'm saying this since some years... (and those languages probably

Andrei Alexandrescu (3/7) Jan 23 2015 At some point in the coming years the pain is bound to become unbearable...

H. S. Teoh via Digitalmars-d (7/18) Jan 23 2015 ^^^^

Andrei Alexandrescu (2/16) Jan 23 2015 I knew that pun won't go unremarked! :o) -- Andrei

"Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> (7/64) Jan 23 2015 You are talking about two different functions here. group by and

H. S. Teoh via Digitalmars-d (11/14) Jan 23 2015 [...]

Andrei Alexandrescu (4/15) Jan 23 2015 We already have partition() functions that actually partition a range

Ary Borenszweig (3/20) Jan 24 2015 Another name might be chunkBy: it returns chunks that are grouped by

H. S. Teoh via Digitalmars-d (5/27) Jan 24 2015 Incidentally, that was the original name I implemented it under.

Olivier Grant (11/14) Jan 25 2015 In ruby, the closest to D's currently-named groupBy method is a

MattCoder (9/25) Jan 23 2015 Sorry if this a dumb question, but since you're grouping an array

H. S. Teoh via Digitalmars-d (25/56) Jan 23 2015 [...]

bearophile (5/14) Jan 23 2015 Let's allocate, creating an associative array inside the grouping

Ary Borenszweig (3/13) Jan 24 2015 All languages I know do this for `group by` (because of the complexity

MattCoder (3/37) Jan 23 2015 Alright and thanks for the whole explanation.

Andrei Alexandrescu (2/29) Jan 23 2015 Yah, that would be partition(). -- Andrei

Russel Winder via Digitalmars-d (18/31) Jan 26 2015 On Fri, 2015-01-23 at 10:08 -0800, Andrei Alexandrescu via Digitalmars-d

bearophile (4/6) Jan 26 2015 Nope...

H. S. Teoh via Digitalmars-d (8/13) Jan 26 2015 [...]

"Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> (5/15) Jan 26 2015 Andrei had a point about `partition` being used already. I liked

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (3/7) Jan 26 2015 Does it return slices?

Andrei Alexandrescu (7/17) Jan 26 2015 My suggestion was to keep the name but change the code of your groupBy

Ary Borenszweig (5/25) Jan 26 2015 That's much more harder to implement than what it does right now. I

Andrei Alexandrescu (5/35) Jan 26 2015 The implementation right now is quite interesting but not complicated,

zeljkog (3/44) Jan 26 2015 I think std.experimental.algorithm.groupBy is one option.

Phil (2/2) Jan 26 2015 I also found the behaviour confusing given the name. I like

H. S. Teoh via Digitalmars-d (22/39) Jan 26 2015 Huh, what? I think there's some misunderstanding here. The unary version

Andrei Alexandrescu (23/59) Jan 26 2015 Here's how. Basically the binary-predicate version has only Boolean keys...

"Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> (28/31) Jan 26 2015 The current implementation has a certain beauty and does not lend

Laeeth Isharc (28/36) Jan 26 2015 T: you are good with algorithms. In many applications you have a

H. S. Teoh via Digitalmars-d (58/63) Jan 26 2015 [...]

Andrei Alexandrescu (18/20) Jan 26 2015 I agree. That said, perhaps a change of name would avoid confusion. I

Dicebot (6/6) Jan 26 2015 Don't forget that there is already

"Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> (5/11) Jan 26 2015 This seems valid to me. I'd suggest to keep the name and

Laeeth Isharc (41/52) Jan 26 2015 Thank you for the thoughtful reply.
Laeeth Isharc (15/19) Jan 26 2015 Read the docs now - they are perfect within the context of the

Russel Winder via Digitalmars-d (17/29) Jan 26 2015 What name do you think works for this operation?

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

So H.S. Teoh awesomely took 
https://github.com/D-Programming-Language/phobos/pull/2878 to 
completion. We now have a working and fast relational "group by" facility.

See it at work!

----


void main()
{
     import std.algorithm, std.stdio;
     [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
         .groupBy!(a => a & 1)
         .writeln;
}
----

[[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]]

The next step is to define an aggregate() function, which is a lot 
similar to reduce() but works on ranges of ranges and aggregates a 
function over each group. Continuing the previous example:

     [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
         .groupBy!(a => a & 1)
         .aggregate!max
         .writeln;

should print:

[453, 600, 929, 812, 529, 768]

The aggregate function should support aggregating several functions at 
once, e.g. aggregate!(min, max) etc.

Takers?


Andrei

Jan 23 2015

Justin Whear <justin economicmodeling.com> writes:

On Fri, 23 Jan 2015 10:08:30 -0800, Andrei Alexandrescu wrote:

 So H.S. Teoh awesomely took
 https://github.com/D-Programming-Language/phobos/pull/2878 to
 completion. We now have a working and fast relational "group by"
 facility.
 

This is great news.  It seems like every time I make use of component 
programming, I need groupBy at least once.  I have a D file with an old 
copy of a groupBy implementation (I think it's Andrei's original stab at 
it) and it gets copied around to the various projects.

Jan 23 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, Jan 23, 2015 at 10:08:30AM -0800, Andrei Alexandrescu via Digitalmars-d
wrote:
 So H.S. Teoh awesomely took
 https://github.com/D-Programming-Language/phobos/pull/2878 to
 completion. We now have a working and fast relational "group by"
 facility.

Unfortunately it doesn't work in pure/ safe/nothrow code because of
limitations in the current RefCounted implementation.


[...]
 The next step is to define an aggregate() function, which is a lot
 similar to reduce() but works on ranges of ranges and aggregates a
 function over each group. Continuing the previous example:
 
     [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
         .groupBy!(a => a & 1)
         .aggregate!max
         .writeln;
 
 should print:
 
 [453, 600, 929, 812, 529, 768]
 
 The aggregate function should support aggregating several functions at
 once, e.g. aggregate!(min, max) etc.
 
 Takers?

[...]

Isn't that just a simple matter of defining aggregate() in terms of
map() and reduce()?  Working example:

	import std.algorithm.comparison : max;
	import std.algorithm.iteration;
	import std.stdio;
	
	auto aggregate(alias func, RoR)(RoR ror) {
		return ror.map!(reduce!func);
	}
	
	void main() {
	    [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
	         .groupBy!(a => a & 1)
	         .aggregate!max
	         .writeln;
	}

Output is as expected.


T

-- 
Verbing weirds language. -- Calvin (& Hobbes)

Jan 23 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/23/15 10:29 AM, H. S. Teoh via Digitalmars-d wrote:
 On Fri, Jan 23, 2015 at 10:08:30AM -0800, Andrei Alexandrescu via
Digitalmars-d wrote:
 So H.S. Teoh awesomely took
 https://github.com/D-Programming-Language/phobos/pull/2878 to
 completion. We now have a working and fast relational "group by"
 facility.

 Unfortunately it doesn't work in pure/ safe/nothrow code because of
 limitations in the current RefCounted implementation.


 [...]
 The next step is to define an aggregate() function, which is a lot
 similar to reduce() but works on ranges of ranges and aggregates a
 function over each group. Continuing the previous example:

      [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
          .groupBy!(a => a & 1)
          .aggregate!max
          .writeln;

 should print:

 [453, 600, 929, 812, 529, 768]

 The aggregate function should support aggregating several functions at
 once, e.g. aggregate!(min, max) etc.

 Takers?

 [...]

 Isn't that just a simple matter of defining aggregate() in terms of
 map() and reduce()?  Working example:

 	import std.algorithm.comparison : max;
 	import std.algorithm.iteration;
 	import std.stdio;
 	
 	auto aggregate(alias func, RoR)(RoR ror) {
 		return ror.map!(reduce!func);
 	}
 	
 	void main() {
 	    [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
 	         .groupBy!(a => a & 1)
 	         .aggregate!max
 	         .writeln;
 	}

 Output is as expected.

Clever! Or, conversely, I'm not that bright! Yes, this is awesome. 
Probably the actual name "aggregate" should be defined even with that 
trivial implementation to help folks like me :o). -- Andrei

Jan 23 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, Jan 23, 2015 at 10:29:13AM -0800, H. S. Teoh via Digitalmars-d wrote:
 On Fri, Jan 23, 2015 at 10:08:30AM -0800, Andrei Alexandrescu via
Digitalmars-d wrote:

[...]
 The next step is to define an aggregate() function, which is a lot
 similar to reduce() but works on ranges of ranges and aggregates a
 function over each group. Continuing the previous example:
 
     [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
         .groupBy!(a => a & 1)
         .aggregate!max
         .writeln;
 
 should print:
 
 [453, 600, 929, 812, 529, 768]
 
 The aggregate function should support aggregating several functions
 at once, e.g. aggregate!(min, max) etc.


[...]

Here's a working variadic implementation:

	import std.algorithm.comparison : max, min;
	import std.algorithm.iteration;
	import std.stdio;
	
	template aggregate(funcs...) {
		auto aggregate(RoR)(RoR ror) {
			return ror.map!(reduce!funcs);
		}
	}
	
	void main() {
	    [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
	         .groupBy!(a => a & 1)
	         .aggregate!(max,min)
	         .writeln;
	}

Output (kinda ugly, but it works):

	[Tuple!(int, int)(453, 293), Tuple!(int, int)(600, 600), Tuple!(int, int)(929,
339), Tuple!(int, int)(812, 222), Tuple!(int, int)(529, 529), Tuple!(int,
int)(768, 768)]


Of course, it will require a little more polish before merging into
Phobos, but the core implementation is nowhere near the complexity of
groupBy.


T

-- 
The best compiler is between your ears. -- Michael Abrash

Jan 23 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/23/15 10:34 AM, H. S. Teoh via Digitalmars-d wrote:
 Of course, it will require a little more polish before merging into
 Phobos, but the core implementation is nowhere near the complexity of
 groupBy.

open https://github.com/D-Programming-Language/phobos/pulls

[F5]... [F5]... [F5]...


Andrei

Jan 23 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, Jan 23, 2015 at 10:47:28AM -0800, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 1/23/15 10:34 AM, H. S. Teoh via Digitalmars-d wrote:
Of course, it will require a little more polish before merging into
Phobos, but the core implementation is nowhere near the complexity of
groupBy.

 
 open https://github.com/D-Programming-Language/phobos/pulls
 
 [F5]... [F5]... [F5]...

[...]

void main() {
	foreach (iota(0 .. 60 * 60 * F5sPerSecond))
		writeln("[F5]...");

	writeln(q"ENDMSG

	https://github.com/D-Programming-Language/phobos/pull/2899

	ENDMG);
}

;-)


T

-- 
No! I'm not in denial!

Jan 23 2015

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

On Friday, 23 January 2015 at 18:47:29 UTC, Andrei Alexandrescu 
wrote:
 open https://github.com/D-Programming-Language/phobos/pulls

 [F5]... [F5]... [F5]...


 Andrei

Haha, that's funny!

Jan 23 2015

Ary Borenszweig <ary esperanto.org.ar> writes:

On 1/23/15 3:08 PM, Andrei Alexandrescu wrote:
 So H.S. Teoh awesomely took
 https://github.com/D-Programming-Language/phobos/pull/2878 to
 completion. We now have a working and fast relational "group by" facility.

 See it at work!

 ----


 void main()
 {
      import std.algorithm, std.stdio;
      [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
          .groupBy!(a => a & 1)
          .writeln;
 }
 ----

 [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]]

 The next step is to define an aggregate() function, which is a lot
 similar to reduce() but works on ranges of ranges and aggregates a
 function over each group. Continuing the previous example:

      [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
          .groupBy!(a => a & 1)
          .aggregate!max
          .writeln;

 should print:

 [453, 600, 929, 812, 529, 768]

 The aggregate function should support aggregating several functions at
 once, e.g. aggregate!(min, max) etc.

 Takers?


 Andrei

In most languages group by yields a tuple of {group key, group values}.

For example (Ruby or Crystal):

a = [1, 4, 2, 4, 5, 2, 3, 7, 9]
groups = a.group_by { |x| x % 3 }




Java: 
http://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#groupingBy-java.util.function.Function-

SQL: http://www.w3schools.com/sql/sql_groupby.asp

So I'm not sure "groupBy" is a good name for this.

Jan 23 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/23/15 12:19 PM, Ary Borenszweig wrote:
 In most languages group by yields a tuple of {group key, group values}.

Interesting, thanks. Looks like we're at a net loss of information with 
our current approach.

 quickfur, do you think you could expose a tuple with "key" and 
"values"? The former would be the function value, the latter would be 
what we offer right now.

That would apply only to the unary version of groupBy.


Andrei

Jan 23 2015

"Laeeth Isharc" <laeethnospam nospam.laeeth.com> writes:

On Friday, 23 January 2015 at 20:28:32 UTC, Andrei Alexandrescu 
wrote:
 On 1/23/15 12:19 PM, Ary Borenszweig wrote:
 In most languages group by yields a tuple of {group key, group 
 values}.

 Interesting, thanks. Looks like we're at a net loss of 
 information with our current approach.

  quickfur, do you think you could expose a tuple with "key" and 
 "values"? The former would be the function value, the latter 
 would be what we offer right now.

 That would apply only to the unary version of groupBy.


 Andrei

groupby hack below ?  I haven't yet read the source code and 
don't feel I understand ranges deeply enough to know if this will 
work in the general case.  But it at least works for the example 
(I think).


Laeeth.




void main()
{
     import std.algorithm, std.stdio, std.range;
     auto index=[293, 453, 600, 929, 339, 812, 222, 680, 529, 768];
     auto vals=[	1,		2,	3,	4,	5,		6,	7,	8,	9,		10];
     auto zippy=zip(index,vals);

     zippy.groupBy!(a=> a[0] & 1)
         .writeln;
}


[[Tuple!(int, int)(293, 1), Tuple!(int, int)(453, 2)], 
[Tuple!(int, int)(600, 3)], [Tuple!(int, int)(929, 4), 
Tuple!(int, int)(339, 5)], [Tuple!(int, int)(812, 6), Tuple!(int, 
int)(222, 7), Tuple!(int, int)(680, 8)], [Tuple!(int, int)(529, 
9)], [Tuple!(int, int)(768, 10)]]

Jan 23 2015

"bearophile" <bearophileHUGS lycos.com> writes:

Ary Borenszweig:

 In most languages group by yields a tuple of {group key, group 
 values}.

I'm saying this since some years... (and those languages probably 
don't use sorting to perform the aggregation).

Bye,
bearophile

Jan 23 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/23/15 12:38 PM, bearophile wrote:
 Ary Borenszweig:

 In most languages group by yields a tuple of {group key, group values}.

 I'm saying this since some years... (and those languages probably don't
 use sorting to perform the aggregation).

At some point in the coming years the pain is bound to become unbearable 
so you'll give the gift of a pull request. -- Andrei

Jan 23 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, Jan 23, 2015 at 12:40:01PM -0800, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 1/23/15 12:38 PM, bearophile wrote:
Ary Borenszweig:

In most languages group by yields a tuple of {group key, group
values}.

I'm saying this since some years... (and those languages probably
don't use sorting to perform the aggregation).

 
 At some point in the coming years the pain is bound to become
 unbearable so you'll give the gift of a pull request. -- Andrei

    ^^^^
    (!!!)


*cowers and hides from the rotten tomatoes...*


T

-- 
It won't be covered in the book. The source code has to be useful for
something, after all. -- Larry Wall

Jan 23 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/23/15 1:34 PM, H. S. Teoh via Digitalmars-d wrote:
 On Fri, Jan 23, 2015 at 12:40:01PM -0800, Andrei Alexandrescu via
Digitalmars-d wrote:
 On 1/23/15 12:38 PM, bearophile wrote:
 Ary Borenszweig:

 In most languages group by yields a tuple of {group key, group
 values}.

 I'm saying this since some years... (and those languages probably
 don't use sorting to perform the aggregation).

 At some point in the coming years the pain is bound to become
 unbearable so you'll give the gift of a pull request. -- Andrei

      ^^^^
      (!!!)

I knew that pun won't go unremarked! :o) -- Andrei

Jan 23 2015

"Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> writes:

On Friday, 23 January 2015 at 20:19:31 UTC, Ary Borenszweig wrote:
 On 1/23/15 3:08 PM, Andrei Alexandrescu wrote:
 So H.S. Teoh awesomely took
 https://github.com/D-Programming-Language/phobos/pull/2878 to
 completion. We now have a working and fast relational "group 
 by" facility.

 See it at work!

 ----


 void main()
 {
     import std.algorithm, std.stdio;
     [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
         .groupBy!(a => a & 1)
         .writeln;
 }
 ----

 [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]]

 The next step is to define an aggregate() function, which is a 
 lot
 similar to reduce() but works on ranges of ranges and 
 aggregates a
 function over each group. Continuing the previous example:

     [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
         .groupBy!(a => a & 1)
         .aggregate!max
         .writeln;

 should print:

 [453, 600, 929, 812, 529, 768]

 The aggregate function should support aggregating several 
 functions at
 once, e.g. aggregate!(min, max) etc.

 Takers?


 Andrei

 In most languages group by yields a tuple of {group key, group 
 values}.

 For example (Ruby or Crystal):

 a = [1, 4, 2, 4, 5, 2, 3, 7, 9]
 groups = a.group_by { |x| x % 3 }



 http://www.dotnetperls.com/groupby

 Java: 
 http://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#groupingBy-java.util.function.Function-

 SQL: http://www.w3schools.com/sql/sql_groupby.asp

 So I'm not sure "groupBy" is a good name for this.

You are talking about two different functions here. group by and 
partition by. The function that has been implemented is often 
called partition by.

The best example I know of:

https://clojuredocs.org/clojure.core/group-by
https://clojuredocs.org/clojure.core/partition-by

Jan 23 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, Jan 23, 2015 at 08:44:05PM +0000, via Digitalmars-d wrote:
[...]
 You are talking about two different functions here. group by and
 partition by. The function that has been implemented is often called
 partition by.

[...]

It's not too late to rename it, since we haven't released it yet. We
still have a little window of time to make this change if necessary.
Andrei?

Returning each group as a tuple sounds like a distinct, albeit related,
function. It can probably be added separately.


T

-- 
Why waste time reinventing the wheel, when you could be reinventing the engine?
-- Damian Conway

Jan 23 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/23/15 1:36 PM, H. S. Teoh via Digitalmars-d wrote:
 On Fri, Jan 23, 2015 at 08:44:05PM +0000, via Digitalmars-d wrote:
 [...]
 You are talking about two different functions here. group by and
 partition by. The function that has been implemented is often called
 partition by.

 [...]

 It's not too late to rename it, since we haven't released it yet. We
 still have a little window of time to make this change if necessary.
 Andrei?

 Returning each group as a tuple sounds like a distinct, albeit related,
 function. It can probably be added separately.

We already have partition() functions that actually partition a range 
into two subranges, so adding partitionBy with a different meaning may 
be confusing. -- Andrei

Jan 23 2015

Ary Borenszweig <ary esperanto.org.ar> writes:

On 1/23/15 8:54 PM, Andrei Alexandrescu wrote:
 On 1/23/15 1:36 PM, H. S. Teoh via Digitalmars-d wrote:
 On Fri, Jan 23, 2015 at 08:44:05PM +0000, via Digitalmars-d wrote:
 [...]
 You are talking about two different functions here. group by and
 partition by. The function that has been implemented is often called
 partition by.

 [...]

 It's not too late to rename it, since we haven't released it yet. We
 still have a little window of time to make this change if necessary.
 Andrei?

 Returning each group as a tuple sounds like a distinct, albeit related,
 function. It can probably be added separately.

 We already have partition() functions that actually partition a range
 into two subranges, so adding partitionBy with a different meaning may
 be confusing. -- Andrei

Another name might be chunkBy: it returns chunks that are grouped by 
some logic.

Jan 24 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Sun, Jan 25, 2015 at 01:39:59AM -0300, Ary Borenszweig via Digitalmars-d
wrote:
 On 1/23/15 8:54 PM, Andrei Alexandrescu wrote:
On 1/23/15 1:36 PM, H. S. Teoh via Digitalmars-d wrote:
On Fri, Jan 23, 2015 at 08:44:05PM +0000, via Digitalmars-d wrote:
[...]
You are talking about two different functions here. group by and
partition by. The function that has been implemented is often
called partition by.

[...]

It's not too late to rename it, since we haven't released it yet. We
still have a little window of time to make this change if necessary.
Andrei?

Returning each group as a tuple sounds like a distinct, albeit
related, function. It can probably be added separately.

We already have partition() functions that actually partition a range
into two subranges, so adding partitionBy with a different meaning
may be confusing. -- Andrei

 
 Another name might be chunkBy: it returns chunks that are grouped by
 some logic.

Incidentally, that was the original name I implemented it under.


T

-- 
What do you get if you drop a piano down a mineshaft? A flat minor.

Jan 24 2015

"Olivier Grant" <olivier.grant gmail.com> writes:

 We already have partition() functions that actually partition a 
 range into two subranges, so adding partitionBy with a 
 different meaning may be confusing. -- Andrei

In ruby, the closest to D's currently-named groupBy method is a 
set of three methods:

slice_before
slice_after
slice_when

http://ruby-doc.org/core-2.2.0/Enumerable.html#method-i-slice_when

Your example in ruby would be:

2.2.0 > [293, 453, 600, 929, 339, 812, 222, 680, 529, 
768].slice_when { |x,y| x & 1 != y & 1 }.to_a
=> [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]]

O.

Jan 25 2015

"MattCoder" <stop spam.com> writes:

On Friday, 23 January 2015 at 18:08:30 UTC, Andrei Alexandrescu 
wrote:
 So H.S. Teoh awesomely took 
 https://github.com/D-Programming-Language/phobos/pull/2878 to 
 completion. We now have a working and fast relational "group 
 by" facility.

 See it at work!

 ----


 void main()
 {
     import std.algorithm, std.stdio;
     [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
         .groupBy!(a => a & 1)
         .writeln;
 }
 ----

 [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]]

Sorry if this a dumb question, but since you're grouping an array 
according some rule, this shouldn't be:

[293, 453, 929, 339, 529][600, 812, 222, 680, 768]

?

Because then you have the array of "trues" and "falses" according 
the condition (a & 1).

Matheus.

Jan 23 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Fri, Jan 23, 2015 at 09:56:11PM +0000, MattCoder via Digitalmars-d wrote:
 On Friday, 23 January 2015 at 18:08:30 UTC, Andrei Alexandrescu wrote:
So H.S. Teoh awesomely took
https://github.com/D-Programming-Language/phobos/pull/2878 to
completion.  We now have a working and fast relational "group by"
facility.

See it at work!

----


void main()
{
    import std.algorithm, std.stdio;
    [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
        .groupBy!(a => a & 1)
        .writeln;
}
----

[[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]]

 
 Sorry if this a dumb question, but since you're grouping an array
 according some rule, this shouldn't be:
 
 [293, 453, 929, 339, 529][600, 812, 222, 680, 768]
 
 ?
 
 Because then you have the array of "trues" and "falses" according the
 condition (a & 1).

[...]

It's kind of a misnomer, because it actually only considers *consecutive
runs* of equivalent elements; it doesn't look at the whole range before
deciding what goes in which group. So technically it should be called
consecutiveGroupBy or consecutivePartitionBy, but that's too big a
mouthful to be a usable name.

What you describe could be an interesting candidate to add, though. It
could iterate over distinct values of the predicate, and traverse the
forward range (input ranges obviously can't work unless you allocate,
which makes it no longer lazy) each time. This, however, has O(n*k)
complexity where k is the number of distinct predicate values. If it's
anything bigger than bool or a relatively small enum, it would be
impractical. Moreover, the requirement to traverse the range multiple
times kinda sux; you might as well just call filter() with different
expected values yourself, in a loop. In fact, you could ostensibly
implement it something along these lines:

	auto globalPartition(alias pred, R)(R range) {
		alias Values = typeof(pred(range.front, range.front));

		return iota(Values.min, Values.max)
			.map!(v => range.save.filter!(pred(e) == v));
	}


T

-- 
Who told you to swim in Crocodile Lake without life insurance??

Jan 23 2015

"bearophile" <bearophileHUGS lycos.com> writes:

H. S. Teoh:

 What you describe could be an interesting candidate to add, 
 though. It
 could iterate over distinct values of the predicate, and 
 traverse the
 forward range (input ranges obviously can't work unless you 
 allocate,
 which makes it no longer lazy) each time. This, however, has 
 O(n*k)
 complexity where k is the number of distinct predicate values.

Let's allocate, creating an associative array inside the grouping 
function :-)

Bye,
bearophile

Jan 23 2015

Ary Borenszweig <ary esperanto.org.ar> writes:

On 1/23/15 7:30 PM, bearophile wrote:
 H. S. Teoh:

 What you describe could be an interesting candidate to add, though. It
 could iterate over distinct values of the predicate, and traverse the
 forward range (input ranges obviously can't work unless you allocate,
 which makes it no longer lazy) each time. This, however, has O(n*k)
 complexity where k is the number of distinct predicate values.

 Let's allocate, creating an associative array inside the grouping
 function :-)

 Bye,
 bearophile

All languages I know do this for `group by` (because of the complexity 
involved), and I think it's ok to do so.

Jan 24 2015

"MattCoder" <stop spam.com> writes:

On Friday, 23 January 2015 at 22:20:19 UTC, H. S. Teoh wrote:
 It's kind of a misnomer, because it actually only considers 
 *consecutive
 runs* of equivalent elements; it doesn't look at the whole 
 range before
 deciding what goes in which group. So technically it should be 
 called
 consecutiveGroupBy or consecutivePartitionBy, but that's too 
 big a
 mouthful to be a usable name.

 What you describe could be an interesting candidate to add, 
 though. It
 could iterate over distinct values of the predicate, and 
 traverse the
 forward range (input ranges obviously can't work unless you 
 allocate,
 which makes it no longer lazy) each time. This, however, has 
 O(n*k)
 complexity where k is the number of distinct predicate values. 
 If it's
 anything bigger than bool or a relatively small enum, it would 
 be
 impractical. Moreover, the requirement to traverse the range 
 multiple
 times kinda sux; you might as well just call filter() with 
 different
 expected values yourself, in a loop. In fact, you could 
 ostensibly
 implement it something along these lines:

 	auto globalPartition(alias pred, R)(R range) {
 		alias Values = typeof(pred(range.front, range.front));

 		return iota(Values.min, Values.max)
 			.map!(v => range.save.filter!(pred(e) == v));
 	}
...

Alright and thanks for the whole explanation.

Matheus.

Jan 23 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/23/15 1:56 PM, MattCoder wrote:
 On Friday, 23 January 2015 at 18:08:30 UTC, Andrei Alexandrescu wrote:
 So H.S. Teoh awesomely took
 https://github.com/D-Programming-Language/phobos/pull/2878 to
 completion. We now have a working and fast relational "group by"
 facility.

 See it at work!

 ----


 void main()
 {
     import std.algorithm, std.stdio;
     [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
         .groupBy!(a => a & 1)
         .writeln;
 }
 ----

 [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]]

 Sorry if this a dumb question, but since you're grouping an array
 according some rule, this shouldn't be:

 [293, 453, 929, 339, 529][600, 812, 222, 680, 768]

 ?

 Because then you have the array of "trues" and "falses" according the
 condition (a & 1).

Yah, that would be partition(). -- Andrei

Jan 23 2015

Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Fri, 2015-01-23 at 10:08 -0800, Andrei Alexandrescu via Digitalmars-d
wrote:
[…]

 
 void main()
 {
      import std.algorithm, std.stdio;
      [293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
          .groupBy!(a => a & 1)
          .writeln;
 }
 ----
 
 [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]]
 

[…]

I think I must be missing something, for me the result of a groupBy
operation on the above input data should be:

[1:[293, 453, 929, 339, 529], 0:[600, 812, 222, 680, 768]]

i.e. a map with keys being the cases and values being the values that
meet the case. In this example a & 1 asks for cases "lowest bit 0 or 1"
aka "odd or even".

There is nothing wrong with the semantics of the result above, but is
it's name "group by" as understood by the rest of the world?


-- 
Russel.
=============================================================================
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.net
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Jan 26 2015

"bearophile" <bearophileHUGS lycos.com> writes:

Russel Winder:

 but is it's name "group by" as understood by the rest of the 
 world?

Nope...

Bye,
bearophile

Jan 26 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d wrote:
 Russel Winder:
 
but is it's name "group by" as understood by the rest of the world?

 
 Nope...

[...]

I proposed to rename it but it got shot down. *shrug*

We still have a short window of time to sort this out, before 2.067 is
released...


T

-- 
Don't drink and derive. Alcohol and algebra don't mix.

Jan 26 2015

"Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> writes:

On Monday, 26 January 2015 at 16:13:40 UTC, H. S. Teoh wrote:
 On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via 
 Digitalmars-d wrote:
 Russel Winder:
 
but is it's name "group by" as understood by the rest of the 
world?

 
 Nope...

 [...]

 I proposed to rename it but it got shot down. *shrug*

Andrei had a point about `partition` being used already. I liked 
Oliver's suggestion to go with slice-something. `sliceBy` might 
be worth considering. It even hints at the (efficient) 
implementation.

Jan 26 2015

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Monday, 26 January 2015 at 16:44:20 UTC, Ulrich Küttler wrote:
 Andrei had a point about `partition` being used already. I 
 liked Oliver's suggestion to go with slice-something. `sliceBy` 
 might be worth considering. It even hints at the (efficient) 
 implementation.

Does it return slices?

If not, pick a different verb, e.g. "split".

Jan 26 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/26/15 8:11 AM, H. S. Teoh via Digitalmars-d wrote:
 On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d wrote:
 Russel Winder:

 but is it's name "group by" as understood by the rest of the world?

 Nope...

 [...]

 I proposed to rename it but it got shot down. *shrug*

 We still have a short window of time to sort this out, before 2.067 is
 released...

My suggestion was to keep the name but change the code of your groupBy 
implementation to return tuple(key, lazyValues) instead of just 
lazyValues. That needs to happen only for binary predicates; unary 
predicates will all have alternating true/false keys.

Seems that would please everyone.


Andrei

Jan 26 2015

Ary Borenszweig <ary esperanto.org.ar> writes:

On 1/26/15 2:34 PM, Andrei Alexandrescu wrote:
 On 1/26/15 8:11 AM, H. S. Teoh via Digitalmars-d wrote:
 On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d
 wrote:
 Russel Winder:

 but is it's name "group by" as understood by the rest of the world?

 Nope...

 [...]

 I proposed to rename it but it got shot down. *shrug*

 We still have a short window of time to sort this out, before 2.067 is
 released...

 My suggestion was to keep the name but change the code of your groupBy
 implementation to return tuple(key, lazyValues) instead of just
 lazyValues. That needs to happen only for binary predicates; unary
 predicates will all have alternating true/false keys.

 Seems that would please everyone.


 Andrei

That's much more harder to implement than what it does right now. I 
don't know how you'll manage to do the lazyValues thing: you'd need to 
make multiple passes in the range.

Again, other languages return an associative array in this case.

Jan 26 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/26/15 9:50 AM, Ary Borenszweig wrote:
 On 1/26/15 2:34 PM, Andrei Alexandrescu wrote:
 On 1/26/15 8:11 AM, H. S. Teoh via Digitalmars-d wrote:
 On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d
 wrote:
 Russel Winder:

 but is it's name "group by" as understood by the rest of the world?

 Nope...

 [...]

 I proposed to rename it but it got shot down. *shrug*

 We still have a short window of time to sort this out, before 2.067 is
 released...

 My suggestion was to keep the name but change the code of your groupBy
 implementation to return tuple(key, lazyValues) instead of just
 lazyValues. That needs to happen only for binary predicates; unary
 predicates will all have alternating true/false keys.

 Seems that would please everyone.


 Andrei

 That's much more harder to implement than what it does right now. I
 don't know how you'll manage to do the lazyValues thing: you'd need to
 make multiple passes in the range.

The implementation right now is quite interesting but not complicated, 
and achieves lazy grouping in a single pass.

 Again, other languages return an associative array in this case.

I think our approach is superior.


Andrei

Jan 26 2015

zeljkog <zeljkog home.com> writes:

On 26.01.15 19:05, Andrei Alexandrescu wrote:
 On 1/26/15 9:50 AM, Ary Borenszweig wrote:
 On 1/26/15 2:34 PM, Andrei Alexandrescu wrote:
 On 1/26/15 8:11 AM, H. S. Teoh via Digitalmars-d wrote:
 On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d
 wrote:
 Russel Winder:

 but is it's name "group by" as understood by the rest of the world?

 Nope...

 [...]

 I proposed to rename it but it got shot down. *shrug*

 We still have a short window of time to sort this out, before 2.067 is
 released...

 My suggestion was to keep the name but change the code of your groupBy
 implementation to return tuple(key, lazyValues) instead of just
 lazyValues. That needs to happen only for binary predicates; unary
 predicates will all have alternating true/false keys.

 Seems that would please everyone.


 Andrei

 That's much more harder to implement than what it does right now. I
 don't know how you'll manage to do the lazyValues thing: you'd need to
 make multiple passes in the range.

 
 The implementation right now is quite interesting but not complicated, 
 and achieves lazy grouping in a single pass.
 
 Again, other languages return an associative array in this case.

 
 I think our approach is superior.
 
 
 Andrei

I think std.experimental.algorithm.groupBy is one option.
To postpone thing.

Jan 26 2015

"Phil" <phil.j.ellison gmail.com> writes:

I also found the behaviour confusing given the name. I like 
ChunkBy.

Jan 26 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Mon, Jan 26, 2015 at 02:50:16PM -0300, Ary Borenszweig via Digitalmars-d
wrote:
 On 1/26/15 2:34 PM, Andrei Alexandrescu wrote:
On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d
wrote:
Russel Winder:

but is it's name "group by" as understood by the rest of the world?

Nope...




[...]
My suggestion was to keep the name but change the code of your
groupBy implementation to return tuple(key, lazyValues) instead of
just lazyValues. That needs to happen only for binary predicates;
unary predicates will all have alternating true/false keys.


Huh, what? I think there's some misunderstanding here. The unary version
of the current groupBy translates to a binary predicate:

	groupBy!(a => a.field)

is equivalent to:

	groupBy!((a, b) => a.field == b.field)

I don't see how this has anything to do with alternating keys.


[...]
 That's much more harder to implement than what it does right now. I
 don't know how you'll manage to do the lazyValues thing: you'd need to
 make multiple passes in the range.
 
 Again, other languages return an associative array in this case.

I think we're talking past each other here. What groupBy currently does
is to group elements by evaluating the predicate on *consecutive runs*
of elements. What some people seem to demand is a function that groups
elements by *global evaluation* of the predicate over all elements.
These two are similar but divergent functions, and conflating them is
not helping this discussion in any way.

If "group by" in other languages refers to the latter function, then
that means "groupBy" is poorly-named and we need to come up with a
better name for it. Changing it to return tuples and what-not seems to
be beating around the bush to me.


T

-- 
Computers are like a jungle: they have monitor lizards, rams, mice, c-moss,
binary trees... and bugs.

Jan 26 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/26/15 10:11 AM, H. S. Teoh via Digitalmars-d wrote:
 On Mon, Jan 26, 2015 at 02:50:16PM -0300, Ary Borenszweig via Digitalmars-d
wrote:
 On 1/26/15 2:34 PM, Andrei Alexandrescu wrote:
 On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d
 wrote:
 Russel Winder:

 but is it's name "group by" as understood by the rest of the world?

 Nope...




 [...]
 My suggestion was to keep the name but change the code of your
 groupBy implementation to return tuple(key, lazyValues) instead of
 just lazyValues. That needs to happen only for binary predicates;
 unary predicates will all have alternating true/false keys.


 Huh, what? I think there's some misunderstanding here. The unary version
 of the current groupBy translates to a binary predicate:

 	groupBy!(a => a.field)

 is equivalent to:

 	groupBy!((a, b) => a.field == b.field)

 I don't see how this has anything to do with alternating keys.

Here's how. Basically the binary-predicate version has only Boolean keys 
that may be false or true. They will alternate because it's the change 
that triggers creation of a new group. In this example:

[293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
     .groupBy!((a, b) => (a & 3) == (b & 3))

the groupBy function has no information about the result of a & 3. All 
it "sees" is the result of the predicate: true, false, true, false...

HOWEVER, if you write it like this:

[293, 453, 600, 929, 339, 812, 222, 680, 529, 768]
     .groupBy!(a => (a & 3))

then groupBy sees the actual value of the function and can emit the 
proper key.

So the key (ahem) here is to make groupBy with unary predicate different 
from groupBy with binary predicate. The former returns the tuple, the 
latter is unchanged. Makes sense?

 [...]
 That's much more harder to implement than what it does right now. I
 don't know how you'll manage to do the lazyValues thing: you'd need to
 make multiple passes in the range.

 Again, other languages return an associative array in this case.

 I think we're talking past each other here. What groupBy currently does
 is to group elements by evaluating the predicate on *consecutive runs*
 of elements. What some people seem to demand is a function that groups
 elements by *global evaluation* of the predicate over all elements.
 These two are similar but divergent functions, and conflating them is
 not helping this discussion in any way.

Agreed.

 If "group by" in other languages refers to the latter function, then
 that means "groupBy" is poorly-named and we need to come up with a
 better name for it. Changing it to return tuples and what-not seems to
 be beating around the bush to me.

I like our notion of groupBy the same way I like the notion that 
something must be a random-access range in order to be sorted. (Other 
languages give the illusion they sort streams by internally converting 
them to arrays.) D offers better control, better flexibility, and richer 
semantics.


Andrei

Jan 26 2015

"Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> writes:

On Monday, 26 January 2015 at 18:31:05 UTC, Andrei Alexandrescu 
wrote:
 So the key (ahem) here is to make groupBy with unary predicate 
 different from groupBy with binary predicate. The former 
 returns the tuple, the latter is unchanged. Makes sense?

The current implementation has a certain beauty and does not lend 
itself easily to a split in two different versions. Also, the 
additional key return value might be unnecessary and unexpected 
in many cases.

You can easily build the extended groupBy function out of the 
current one, if you only calculate the predicate beforehand. If 
the predicate is expensive, this might be sensible anyway.

auto groupByStar(alias pred, Range)(Range r)
{
   return r.map!(pred, "a")
	.groupBy!(a => a[0])
	.map!(inner => tuple(inner.front[0], inner.map!"a[1]"))
	;
}

void main()
{
   auto a = iota(10);
   foreach (r; a.groupByStar!"a/3") {
	writeln(r[0], " ", r[1]);
   }
}

I do not know if this solution is any better. The above might be 
a terrible idea.

Note: I failed to compile `r.groupBy!"a[0]"` with

iteration.d(1648): Error: function expected before (), not "a[0]" 
of type string

Jan 26 2015

"Laeeth Isharc" <Laeeth.nospam nospam-laeeth.com> writes:

 If "group by" in other languages refers to the latter function, 
 then
 that means "groupBy" is poorly-named and we need to come up 
 with a
 better name for it. Changing it to return tuples and what-not 
 seems to
 be beating around the bush to me.


 T

T: you are good with algorithms.  In many applications you have a 
bunch of results and want to summarise them.  This is often what 
the corporate manager is doing with Excel pivot tables, and it is 
what the groupby function is used for in pandas.  See here for a 
simple tutorial.

http://wesmckinney.com/blog/?p=125

And here for a summary of what pandas can do with data:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html

Is there any reason why we shouldn't add to Phobos: median, 
ranking, stddev, variance, correlation, covariance, skew, 
kurtosis, quantile, moving average, exp mov average, rolling 
window (see pandas)?

I personally am fine with the implementation we have (although as 
Ray Dalio would say. I haven't yet earned the right that you 
should care what I think).  All that it means is that you need to 
sort on multi key your results first before passing to groupby.

My question is how much is lost by doing it in two steps (sort, 
groupby) rather than one.  I don't think all that much, but it is 
not my field,  I am also not that bothered, because this comes at 
the end of processing, not within the inner loop, so for me I 
don't think it makes a difference for now.  If data sets reach 
commoncrawl type sizes then it might be different, although I 
will take D over java any day, warts and all.

In any case, the documentation should be very clear on what 
groupby does, and how the user can do what he might be expecting 
to achieve, coming from a different framework.

It would be interesting to benchmark D against pandas (which is 
implemented in cython for the key bits) and see how we look.

Jan 26 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Mon, Jan 26, 2015 at 07:59:10PM +0000, Laeeth Isharc via Digitalmars-d wrote:
[...]
 My question is how much is lost by doing it in two steps (sort,
 groupby) rather than one.

[...]

I'm not sure I understand what you mean by "lost"... I assume you're not
talking about data loss, since that's not applicable either way. In any
case, as with all things, there's a tradeoff.

Building a hash table of groups, as some people demand, has the
advantage that the grouping predicate is applied globally, and no
sorting (O(n log n)) is needed. The disadvantage is that the entire data
set has to fit in memory, and you cannot process it lazily. Before
you've seen all of the data, you cannot know whether there are more
distinct groups to come, or whether currently known groups have more
members. Once the data size gets too large, you run into trouble. Plus,
some people are adverse to gratuitous GC use in Phobos algorithms.
Perhaps some of this is misplaced, but that's the perception.

The current implementation has the advantage that it requires very
little memory to work, and can process data lazily. It only needs to see
a tiny portion of the entire data set to do its work. Each group
returned is also lazy, so it doesn't need to store the whole group in
its entirety. It can handle 10GB long groups in a 100GB data set
streamed over the network with very little memory, whereas such a
monstrous thing would be infeasibly resource-hungry if you need to
allocate an on-disk hashtable (very inefficient!), iterate over the
entire dataset, and then be I/O bound as you iterate over each group
loading group members from disk. The big disadvantage, however, is that
if your data is not sorted, then the groupings returned won't be quite
what you'd expect, since the predicate is evaluated only over
consecutive runs of elements, not globally over the entire data set.

Which one is better? That depends on your specific application and its
needs. For some applications, there is simply no way around the fact
that you have a large dataset with randomly-ordered elements, so the
sorting has to be done *somewhere*. For other applications, you *can*
make stream your data in sorted order -- or perhaps it doesn't care if
groupings aren't global -- so you can take advantage of the fact that
the current groupBy is extremely cheap. It requires minimal memory to do
its work, and it can do so inside a pipeline without requiring anything
more than an input range.

Basically, std.algorithm is (currently) primarily focused on linear
algorithms.  While there *are* some sorting algorithms and sublinear
algorithms that take advantage of sorted data, it is still primarily
focused on linear processing. There is no good abstraction as yet for
more complex processing like multidimensional or graph traversals. As a
result, it is best suited for stream-based processing.

Generally speaking, most application code is primarily concerned with
linear processing (copy this text from here to there, scan this text for
some linear string AKA search keyword, render this line of text to the
screen, draw this line around the window, move this item along in the
processing queue, etc.). Non-linear computations generally take place
only in dedicated computing modules of limited scope in the application,
where one tends to write dedicated algorithms rather than reuse stock
generic algorithms.


 In any case, the documentation should be very clear on what groupby
 does, and how the user can do what he might be expecting to achieve,
 coming from a different framework.

[...]

As far as I know, the current groupBy docs explain quite clearly what it
does.  If you find it still inadequate or unclear, please file bugs
against it so that we can look into improving the docs.


T

-- 
GEEK = Gatherer of Extremely Enlightening Knowledge

Jan 26 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 1/26/15 12:31 PM, H. S. Teoh via Digitalmars-d wrote:
 As far as I know, the current groupBy docs explain quite clearly what it
 does.

I agree. That said, perhaps a change of name would avoid confusion. I 
like groupBy as we currently define it, it's straight within the spirit 
of D. However, some people may legitimately expect it does global 
processing and groups all identical keys together.

So here's what I think we should do:

1. Rename our current groupBy to chunkBy

2. Change code to have chunkBy with binary predicates return tuples, as 
discussed

3. Add a method called groupEquivalent to SortedRange. It takes no 
predicate (because predicate already is embedded in SortedRange). It 
groups together elements that are equivalent according to the sorting 
order. Returns a range of ranges (similar to what groupBy returns now).

4. Perhaps add a global chunkEquivalent as an alias for groupBy!((a, b) 
=> a == b).

So then in order to get relational group-by semantics by sorting, one 
would write range.sort!(predicate).groupEquivalent...


Andrei

Jan 26 2015

"Dicebot" <public dicebot.lv> writes:

Don't forget that there is already 

groupBy in this context (only groups consequitive elements). It 
would be totally awkward to have those named differently (which 
was why I have suggested to change the name from chunkBy to 
groupBy in the very beginning)

Jan 26 2015

"Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> writes:

On Monday, 26 January 2015 at 20:58:36 UTC, Dicebot wrote:
 Don't forget that there is already 

 groupBy in this context (only groups consequitive elements). It 
 would be totally awkward to have those named differently (which 
 was why I have suggested to change the name from chunkBy to 
 groupBy in the very beginning)

This seems valid to me. I'd suggest to keep the name and 
implementation as it is and add the "common" groupBy function as 
an example.

https://github.com/D-Programming-Language/phobos/pull/2915

Jan 26 2015

"Laeeth Isharc" <Laeeth.nospam nospam-laeeth.com> writes:

Thank you for the thoughtful reply.

I meant lost in terms of extra processing time and memory 
consumption in order to achieve the fairly common use case of a 
groupby pivot table or pandas style (ie what you get if you sort 
the data by group and then run D groupby)

If you first sort the data, and then run groupby, then it seems 
likely in some cases this will be more expensive than a straight 
pandas style groupby., and my question was how much more 
expensive.

If you needed to sort the data anyway, because you were producing 
a report, this doesn't matter.  But if you didn't need it sorted 
because you just wanted to produce statistics on each group (eg 
median, inter quartile difference) then it might.

I hadn't thought clearly before about in memory vs not.

I guess for pandas style groupby, having to sort a dataset 
externally just to use groupby is a comparative disaster because 
it seems to involve much more I/O than is needed.  If I am not 
mistaken, this might be one reason to have also at some point a 
separate function that does pandas groupby efficiently for  data 
that may not be held in memory.

Can you explain why for hashgroupby the entire dataset needs to 
fit in memory?  If I were writing this naively, one could have an 
associative array of lists of relevant items hashed by group.  
You go through each line once, add the index to the list for that 
group, then at the end sort the group keys and return indices or 
whatever the slice equivalent is of the lines matching each 
group.  But I may be missing some obvious points.

Ie my imagined use case has large data for each line, but the 
hash table and lists would fit in memory.

If thinking of a case where the data per line is not that large, 
but there are so many lines that it becomes an external sort then 
I see your point.  However I suggest that the use case I describe 
- whilst not the only one - is important enough at least to 
briefly consider.

"The  big disadvantage, however,
 is that
 if your data is not sorted, then the groupings returned won't 
 be quite
 what you'd expect, since the predicate is evaluated only over
 consecutive runs of elements, not globally over the entire data 
 set."

Well - it is a completely different function...

I really do believe that pandas style dataframe type processing 
could be an interesting area for D, and most of what is needed to 
construct this is already here.


 As far as I know, the current groupBy docs explain quite 
 clearly what it
 does.  If you find it still inadequate or unclear, please file 
 bugs
 against it so that we can look into improving the docs.


Sorry - I had hadn't thought to check!  Will do so now.


Laeeth.

Jan 26 2015

"Laeeth Isharc" <laeethnospam nospamlaeeth.com> writes:

 As far as I know, the current groupBy docs explain quite 
 clearly what it
 does.  If you find it still inadequate or unclear, please file 
 bug against it so that we can look into improving the docs.

Read the docs now - they are perfect within the context of the 
style of documentation (and these documents themselves have 
certainly been improved over the past months).

It is very clear to me, but I am not sure it would be to everyone 
I work with.  The formal library description may not be the place 
for this, but I do think many less sophisticated readers would 
find the style below more accessible and easier to follow (this 
is not about groupBy, but a broader question of documentation to 
supplement the formal reference description for the standard 
library):

http://pandas.pydata.org/pandas-docs/stable/groupby.html

I appreciate that it is all very well to suggest improvements but 
someone actually has to do the work.  Recognizing the direction 
of what might be helpful is perhaps a start though.  I might be 
on my own with this view, but I don't think I am.

Jan 26 2015

Russel Winder via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Mon, 2015-01-26 at 08:11 -0800, H. S. Teoh via Digitalmars-d wrote:
 On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d wr=

ote:
 Russel Winder:
=20
but is it's name "group by" as understood by the rest of the world?

=20
 Nope...

 [...]
=20
 I proposed to rename it but it got shot down. *shrug*

What name do you think works for this operation?

 We still have a short window of time to sort this out, before 2.067 is
 released...

To be honest, given the confirmation that the semantics of this
operation and that of the one other languages call groupBy are
different, then 2.067 should not go out with this operation called
groupBy.
=20

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Jan 26 2015

D Programming

C/C++ Programming

Other

digitalmars.D - [WORK] groupBy is in! Next: aggregate