digitalmars.D - [WORK] groupBy is in! Next: aggregate
- Andrei Alexandrescu (28/28) Jan 23 2015 So H.S. Teoh awesomely took
- Justin Whear (5/10) Jan 23 2015 This is great news. It seems like every time I make use of component
- H. S. Teoh via Digitalmars-d (25/46) Jan 23 2015 Unfortunately it doesn't work in pure/@safe/nothrow code because of
- Andrei Alexandrescu (4/47) Jan 23 2015 Clever! Or, conversely, I'm not that bright! Yes, this is awesome.
- H. S. Teoh via Digitalmars-d (28/44) Jan 23 2015 [...]
- Andrei Alexandrescu (4/7) Jan 23 2015 open https://github.com/D-Programming-Language/phobos/pulls
- H. S. Teoh via Digitalmars-d (13/21) Jan 23 2015 [...]
- =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (3/6) Jan 23 2015 Haha, that's funny!
- Ary Borenszweig (11/39) Jan 23 2015 In most languages group by yields a tuple of {group key, group values}.
- Andrei Alexandrescu (8/9) Jan 23 2015 Interesting, thanks. Looks like we're at a net loss of information with
- Laeeth Isharc (23/33) Jan 23 2015 groupby hack below ? I haven't yet read the source code and
- bearophile (5/7) Jan 23 2015 I'm saying this since some years... (and those languages probably
- Andrei Alexandrescu (3/7) Jan 23 2015 At some point in the coming years the pain is bound to become unbearable...
- H. S. Teoh via Digitalmars-d (7/18) Jan 23 2015 ^^^^
- Andrei Alexandrescu (2/16) Jan 23 2015 I knew that pun won't go unremarked! :o) -- Andrei
- "Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> (7/64) Jan 23 2015 You are talking about two different functions here. group by and
- H. S. Teoh via Digitalmars-d (11/14) Jan 23 2015 [...]
- Andrei Alexandrescu (4/15) Jan 23 2015 We already have partition() functions that actually partition a range
- Ary Borenszweig (3/20) Jan 24 2015 Another name might be chunkBy: it returns chunks that are grouped by
- H. S. Teoh via Digitalmars-d (5/27) Jan 24 2015 Incidentally, that was the original name I implemented it under.
- Olivier Grant (11/14) Jan 25 2015 In ruby, the closest to D's currently-named groupBy method is a
- MattCoder (9/25) Jan 23 2015 Sorry if this a dumb question, but since you're grouping an array
- H. S. Teoh via Digitalmars-d (25/56) Jan 23 2015 [...]
- bearophile (5/14) Jan 23 2015 Let's allocate, creating an associative array inside the grouping
- Ary Borenszweig (3/13) Jan 24 2015 All languages I know do this for `group by` (because of the complexity
- MattCoder (3/37) Jan 23 2015 Alright and thanks for the whole explanation.
- Andrei Alexandrescu (2/29) Jan 23 2015 Yah, that would be partition(). -- Andrei
- Russel Winder via Digitalmars-d (18/31) Jan 26 2015 On Fri, 2015-01-23 at 10:08 -0800, Andrei Alexandrescu via Digitalmars-d
- bearophile (4/6) Jan 26 2015 Nope...
- H. S. Teoh via Digitalmars-d (8/13) Jan 26 2015 [...]
- "Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> (5/15) Jan 26 2015 Andrei had a point about `partition` being used already. I liked
- "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (3/7) Jan 26 2015 Does it return slices?
- Andrei Alexandrescu (7/17) Jan 26 2015 My suggestion was to keep the name but change the code of your groupBy
- Ary Borenszweig (5/25) Jan 26 2015 That's much more harder to implement than what it does right now. I
- Andrei Alexandrescu (5/35) Jan 26 2015 The implementation right now is quite interesting but not complicated,
- zeljkog (3/44) Jan 26 2015 I think std.experimental.algorithm.groupBy is one option.
- Phil (2/2) Jan 26 2015 I also found the behaviour confusing given the name. I like
- H. S. Teoh via Digitalmars-d (22/39) Jan 26 2015 Huh, what? I think there's some misunderstanding here. The unary version
- Andrei Alexandrescu (23/59) Jan 26 2015 Here's how. Basically the binary-predicate version has only Boolean keys...
- "Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> (28/31) Jan 26 2015 The current implementation has a certain beauty and does not lend
- Laeeth Isharc (28/36) Jan 26 2015 T: you are good with algorithms. In many applications you have a
- H. S. Teoh via Digitalmars-d (58/63) Jan 26 2015 [...]
- Andrei Alexandrescu (18/20) Jan 26 2015 I agree. That said, perhaps a change of name would avoid confusion. I
- Dicebot (6/6) Jan 26 2015 Don't forget that there is already
- "Ulrich =?UTF-8?B?S8O8dHRsZXIi?= <kuettler gmail.com> (5/11) Jan 26 2015 This seems valid to me. I'd suggest to keep the name and
- Laeeth Isharc (41/52) Jan 26 2015 Thank you for the thoughtful reply.
- Laeeth Isharc (15/19) Jan 26 2015 Read the docs now - they are perfect within the context of the
- Russel Winder via Digitalmars-d (17/29) Jan 26 2015 What name do you think works for this operation?
So H.S. Teoh awesomely took https://github.com/D-Programming-Language/phobos/pull/2878 to completion. We now have a working and fast relational "group by" facility. See it at work! ---- void main() { import std.algorithm, std.stdio; [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .writeln; } ---- [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]] The next step is to define an aggregate() function, which is a lot similar to reduce() but works on ranges of ranges and aggregates a function over each group. Continuing the previous example: [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .aggregate!max .writeln; should print: [453, 600, 929, 812, 529, 768] The aggregate function should support aggregating several functions at once, e.g. aggregate!(min, max) etc. Takers? Andrei
Jan 23 2015
On Fri, 23 Jan 2015 10:08:30 -0800, Andrei Alexandrescu wrote:So H.S. Teoh awesomely took https://github.com/D-Programming-Language/phobos/pull/2878 to completion. We now have a working and fast relational "group by" facility.This is great news. It seems like every time I make use of component programming, I need groupBy at least once. I have a D file with an old copy of a groupBy implementation (I think it's Andrei's original stab at it) and it gets copied around to the various projects.
Jan 23 2015
On Fri, Jan 23, 2015 at 10:08:30AM -0800, Andrei Alexandrescu via Digitalmars-d wrote:So H.S. Teoh awesomely took https://github.com/D-Programming-Language/phobos/pull/2878 to completion. We now have a working and fast relational "group by" facility.Unfortunately it doesn't work in pure/ safe/nothrow code because of limitations in the current RefCounted implementation. [...]The next step is to define an aggregate() function, which is a lot similar to reduce() but works on ranges of ranges and aggregates a function over each group. Continuing the previous example: [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .aggregate!max .writeln; should print: [453, 600, 929, 812, 529, 768] The aggregate function should support aggregating several functions at once, e.g. aggregate!(min, max) etc. Takers?[...] Isn't that just a simple matter of defining aggregate() in terms of map() and reduce()? Working example: import std.algorithm.comparison : max; import std.algorithm.iteration; import std.stdio; auto aggregate(alias func, RoR)(RoR ror) { return ror.map!(reduce!func); } void main() { [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .aggregate!max .writeln; } Output is as expected. T -- Verbing weirds language. -- Calvin (& Hobbes)
Jan 23 2015
On 1/23/15 10:29 AM, H. S. Teoh via Digitalmars-d wrote:On Fri, Jan 23, 2015 at 10:08:30AM -0800, Andrei Alexandrescu via Digitalmars-d wrote:Clever! Or, conversely, I'm not that bright! Yes, this is awesome. Probably the actual name "aggregate" should be defined even with that trivial implementation to help folks like me :o). -- AndreiSo H.S. Teoh awesomely took https://github.com/D-Programming-Language/phobos/pull/2878 to completion. We now have a working and fast relational "group by" facility.Unfortunately it doesn't work in pure/ safe/nothrow code because of limitations in the current RefCounted implementation. [...]The next step is to define an aggregate() function, which is a lot similar to reduce() but works on ranges of ranges and aggregates a function over each group. Continuing the previous example: [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .aggregate!max .writeln; should print: [453, 600, 929, 812, 529, 768] The aggregate function should support aggregating several functions at once, e.g. aggregate!(min, max) etc. Takers?[...] Isn't that just a simple matter of defining aggregate() in terms of map() and reduce()? Working example: import std.algorithm.comparison : max; import std.algorithm.iteration; import std.stdio; auto aggregate(alias func, RoR)(RoR ror) { return ror.map!(reduce!func); } void main() { [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .aggregate!max .writeln; } Output is as expected.
Jan 23 2015
On Fri, Jan 23, 2015 at 10:29:13AM -0800, H. S. Teoh via Digitalmars-d wrote:On Fri, Jan 23, 2015 at 10:08:30AM -0800, Andrei Alexandrescu via Digitalmars-d wrote:[...][...] Here's a working variadic implementation: import std.algorithm.comparison : max, min; import std.algorithm.iteration; import std.stdio; template aggregate(funcs...) { auto aggregate(RoR)(RoR ror) { return ror.map!(reduce!funcs); } } void main() { [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .aggregate!(max,min) .writeln; } Output (kinda ugly, but it works): [Tuple!(int, int)(453, 293), Tuple!(int, int)(600, 600), Tuple!(int, int)(929, 339), Tuple!(int, int)(812, 222), Tuple!(int, int)(529, 529), Tuple!(int, int)(768, 768)] Of course, it will require a little more polish before merging into Phobos, but the core implementation is nowhere near the complexity of groupBy. T -- The best compiler is between your ears. -- Michael AbrashThe next step is to define an aggregate() function, which is a lot similar to reduce() but works on ranges of ranges and aggregates a function over each group. Continuing the previous example: [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .aggregate!max .writeln; should print: [453, 600, 929, 812, 529, 768] The aggregate function should support aggregating several functions at once, e.g. aggregate!(min, max) etc.
Jan 23 2015
On 1/23/15 10:34 AM, H. S. Teoh via Digitalmars-d wrote:Of course, it will require a little more polish before merging into Phobos, but the core implementation is nowhere near the complexity of groupBy.open https://github.com/D-Programming-Language/phobos/pulls [F5]... [F5]... [F5]... Andrei
Jan 23 2015
On Fri, Jan 23, 2015 at 10:47:28AM -0800, Andrei Alexandrescu via Digitalmars-d wrote:On 1/23/15 10:34 AM, H. S. Teoh via Digitalmars-d wrote:[...] void main() { foreach (iota(0 .. 60 * 60 * F5sPerSecond)) writeln("[F5]..."); writeln(q"ENDMSG https://github.com/D-Programming-Language/phobos/pull/2899 ENDMG); } ;-) T -- No! I'm not in denial!Of course, it will require a little more polish before merging into Phobos, but the core implementation is nowhere near the complexity of groupBy.open https://github.com/D-Programming-Language/phobos/pulls [F5]... [F5]... [F5]...
Jan 23 2015
On Friday, 23 January 2015 at 18:47:29 UTC, Andrei Alexandrescu wrote:open https://github.com/D-Programming-Language/phobos/pulls [F5]... [F5]... [F5]... AndreiHaha, that's funny!
Jan 23 2015
On 1/23/15 3:08 PM, Andrei Alexandrescu wrote:So H.S. Teoh awesomely took https://github.com/D-Programming-Language/phobos/pull/2878 to completion. We now have a working and fast relational "group by" facility. See it at work! ---- void main() { import std.algorithm, std.stdio; [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .writeln; } ---- [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]] The next step is to define an aggregate() function, which is a lot similar to reduce() but works on ranges of ranges and aggregates a function over each group. Continuing the previous example: [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .aggregate!max .writeln; should print: [453, 600, 929, 812, 529, 768] The aggregate function should support aggregating several functions at once, e.g. aggregate!(min, max) etc. Takers? AndreiIn most languages group by yields a tuple of {group key, group values}. For example (Ruby or Crystal): a = [1, 4, 2, 4, 5, 2, 3, 7, 9] groups = a.group_by { |x| x % 3 } Java: http://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#groupingBy-java.util.function.Function- SQL: http://www.w3schools.com/sql/sql_groupby.asp So I'm not sure "groupBy" is a good name for this.
Jan 23 2015
On 1/23/15 12:19 PM, Ary Borenszweig wrote:In most languages group by yields a tuple of {group key, group values}.Interesting, thanks. Looks like we're at a net loss of information with our current approach. quickfur, do you think you could expose a tuple with "key" and "values"? The former would be the function value, the latter would be what we offer right now. That would apply only to the unary version of groupBy. Andrei
Jan 23 2015
On Friday, 23 January 2015 at 20:28:32 UTC, Andrei Alexandrescu wrote:On 1/23/15 12:19 PM, Ary Borenszweig wrote:groupby hack below ? I haven't yet read the source code and don't feel I understand ranges deeply enough to know if this will work in the general case. But it at least works for the example (I think). Laeeth. void main() { import std.algorithm, std.stdio, std.range; auto index=[293, 453, 600, 929, 339, 812, 222, 680, 529, 768]; auto vals=[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]; auto zippy=zip(index,vals); zippy.groupBy!(a=> a[0] & 1) .writeln; } [[Tuple!(int, int)(293, 1), Tuple!(int, int)(453, 2)], [Tuple!(int, int)(600, 3)], [Tuple!(int, int)(929, 4), Tuple!(int, int)(339, 5)], [Tuple!(int, int)(812, 6), Tuple!(int, int)(222, 7), Tuple!(int, int)(680, 8)], [Tuple!(int, int)(529, 9)], [Tuple!(int, int)(768, 10)]]In most languages group by yields a tuple of {group key, group values}.Interesting, thanks. Looks like we're at a net loss of information with our current approach. quickfur, do you think you could expose a tuple with "key" and "values"? The former would be the function value, the latter would be what we offer right now. That would apply only to the unary version of groupBy. Andrei
Jan 23 2015
Ary Borenszweig:In most languages group by yields a tuple of {group key, group values}.I'm saying this since some years... (and those languages probably don't use sorting to perform the aggregation). Bye, bearophile
Jan 23 2015
On 1/23/15 12:38 PM, bearophile wrote:Ary Borenszweig:At some point in the coming years the pain is bound to become unbearable so you'll give the gift of a pull request. -- AndreiIn most languages group by yields a tuple of {group key, group values}.I'm saying this since some years... (and those languages probably don't use sorting to perform the aggregation).
Jan 23 2015
On Fri, Jan 23, 2015 at 12:40:01PM -0800, Andrei Alexandrescu via Digitalmars-d wrote:On 1/23/15 12:38 PM, bearophile wrote:^^^^ (!!!) *cowers and hides from the rotten tomatoes...* T -- It won't be covered in the book. The source code has to be useful for something, after all. -- Larry WallAry Borenszweig:At some point in the coming years the pain is bound to become unbearable so you'll give the gift of a pull request. -- AndreiIn most languages group by yields a tuple of {group key, group values}.I'm saying this since some years... (and those languages probably don't use sorting to perform the aggregation).
Jan 23 2015
On 1/23/15 1:34 PM, H. S. Teoh via Digitalmars-d wrote:On Fri, Jan 23, 2015 at 12:40:01PM -0800, Andrei Alexandrescu via Digitalmars-d wrote:I knew that pun won't go unremarked! :o) -- AndreiOn 1/23/15 12:38 PM, bearophile wrote:^^^^ (!!!)Ary Borenszweig:At some point in the coming years the pain is bound to become unbearable so you'll give the gift of a pull request. -- AndreiIn most languages group by yields a tuple of {group key, group values}.I'm saying this since some years... (and those languages probably don't use sorting to perform the aggregation).
Jan 23 2015
On Friday, 23 January 2015 at 20:19:31 UTC, Ary Borenszweig wrote:On 1/23/15 3:08 PM, Andrei Alexandrescu wrote:You are talking about two different functions here. group by and partition by. The function that has been implemented is often called partition by. The best example I know of: https://clojuredocs.org/clojure.core/group-by https://clojuredocs.org/clojure.core/partition-bySo H.S. Teoh awesomely took https://github.com/D-Programming-Language/phobos/pull/2878 to completion. We now have a working and fast relational "group by" facility. See it at work! ---- void main() { import std.algorithm, std.stdio; [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .writeln; } ---- [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]] The next step is to define an aggregate() function, which is a lot similar to reduce() but works on ranges of ranges and aggregates a function over each group. Continuing the previous example: [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .aggregate!max .writeln; should print: [453, 600, 929, 812, 529, 768] The aggregate function should support aggregating several functions at once, e.g. aggregate!(min, max) etc. Takers? AndreiIn most languages group by yields a tuple of {group key, group values}. For example (Ruby or Crystal): a = [1, 4, 2, 4, 5, 2, 3, 7, 9] groups = a.group_by { |x| x % 3 } http://www.dotnetperls.com/groupby Java: http://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#groupingBy-java.util.function.Function- SQL: http://www.w3schools.com/sql/sql_groupby.asp So I'm not sure "groupBy" is a good name for this.
Jan 23 2015
On Fri, Jan 23, 2015 at 08:44:05PM +0000, via Digitalmars-d wrote: [...]You are talking about two different functions here. group by and partition by. The function that has been implemented is often called partition by.[...] It's not too late to rename it, since we haven't released it yet. We still have a little window of time to make this change if necessary. Andrei? Returning each group as a tuple sounds like a distinct, albeit related, function. It can probably be added separately. T -- Why waste time reinventing the wheel, when you could be reinventing the engine? -- Damian Conway
Jan 23 2015
On 1/23/15 1:36 PM, H. S. Teoh via Digitalmars-d wrote:On Fri, Jan 23, 2015 at 08:44:05PM +0000, via Digitalmars-d wrote: [...]We already have partition() functions that actually partition a range into two subranges, so adding partitionBy with a different meaning may be confusing. -- AndreiYou are talking about two different functions here. group by and partition by. The function that has been implemented is often called partition by.[...] It's not too late to rename it, since we haven't released it yet. We still have a little window of time to make this change if necessary. Andrei? Returning each group as a tuple sounds like a distinct, albeit related, function. It can probably be added separately.
Jan 23 2015
On 1/23/15 8:54 PM, Andrei Alexandrescu wrote:On 1/23/15 1:36 PM, H. S. Teoh via Digitalmars-d wrote:Another name might be chunkBy: it returns chunks that are grouped by some logic.On Fri, Jan 23, 2015 at 08:44:05PM +0000, via Digitalmars-d wrote: [...]We already have partition() functions that actually partition a range into two subranges, so adding partitionBy with a different meaning may be confusing. -- AndreiYou are talking about two different functions here. group by and partition by. The function that has been implemented is often called partition by.[...] It's not too late to rename it, since we haven't released it yet. We still have a little window of time to make this change if necessary. Andrei? Returning each group as a tuple sounds like a distinct, albeit related, function. It can probably be added separately.
Jan 24 2015
On Sun, Jan 25, 2015 at 01:39:59AM -0300, Ary Borenszweig via Digitalmars-d wrote:On 1/23/15 8:54 PM, Andrei Alexandrescu wrote:Incidentally, that was the original name I implemented it under. T -- What do you get if you drop a piano down a mineshaft? A flat minor.On 1/23/15 1:36 PM, H. S. Teoh via Digitalmars-d wrote:Another name might be chunkBy: it returns chunks that are grouped by some logic.On Fri, Jan 23, 2015 at 08:44:05PM +0000, via Digitalmars-d wrote: [...]We already have partition() functions that actually partition a range into two subranges, so adding partitionBy with a different meaning may be confusing. -- AndreiYou are talking about two different functions here. group by and partition by. The function that has been implemented is often called partition by.[...] It's not too late to rename it, since we haven't released it yet. We still have a little window of time to make this change if necessary. Andrei? Returning each group as a tuple sounds like a distinct, albeit related, function. It can probably be added separately.
Jan 24 2015
We already have partition() functions that actually partition a range into two subranges, so adding partitionBy with a different meaning may be confusing. -- AndreiIn ruby, the closest to D's currently-named groupBy method is a set of three methods: slice_before slice_after slice_when http://ruby-doc.org/core-2.2.0/Enumerable.html#method-i-slice_when Your example in ruby would be: 2.2.0 > [293, 453, 600, 929, 339, 812, 222, 680, 529, 768].slice_when { |x,y| x & 1 != y & 1 }.to_a => [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]] O.
Jan 25 2015
On Friday, 23 January 2015 at 18:08:30 UTC, Andrei Alexandrescu wrote:So H.S. Teoh awesomely took https://github.com/D-Programming-Language/phobos/pull/2878 to completion. We now have a working and fast relational "group by" facility. See it at work! ---- void main() { import std.algorithm, std.stdio; [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .writeln; } ---- [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]]Sorry if this a dumb question, but since you're grouping an array according some rule, this shouldn't be: [293, 453, 929, 339, 529][600, 812, 222, 680, 768] ? Because then you have the array of "trues" and "falses" according the condition (a & 1). Matheus.
Jan 23 2015
On Fri, Jan 23, 2015 at 09:56:11PM +0000, MattCoder via Digitalmars-d wrote:On Friday, 23 January 2015 at 18:08:30 UTC, Andrei Alexandrescu wrote:[...] It's kind of a misnomer, because it actually only considers *consecutive runs* of equivalent elements; it doesn't look at the whole range before deciding what goes in which group. So technically it should be called consecutiveGroupBy or consecutivePartitionBy, but that's too big a mouthful to be a usable name. What you describe could be an interesting candidate to add, though. It could iterate over distinct values of the predicate, and traverse the forward range (input ranges obviously can't work unless you allocate, which makes it no longer lazy) each time. This, however, has O(n*k) complexity where k is the number of distinct predicate values. If it's anything bigger than bool or a relatively small enum, it would be impractical. Moreover, the requirement to traverse the range multiple times kinda sux; you might as well just call filter() with different expected values yourself, in a loop. In fact, you could ostensibly implement it something along these lines: auto globalPartition(alias pred, R)(R range) { alias Values = typeof(pred(range.front, range.front)); return iota(Values.min, Values.max) .map!(v => range.save.filter!(pred(e) == v)); } T -- Who told you to swim in Crocodile Lake without life insurance??So H.S. Teoh awesomely took https://github.com/D-Programming-Language/phobos/pull/2878 to completion. We now have a working and fast relational "group by" facility. See it at work! ---- void main() { import std.algorithm, std.stdio; [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .writeln; } ---- [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]]Sorry if this a dumb question, but since you're grouping an array according some rule, this shouldn't be: [293, 453, 929, 339, 529][600, 812, 222, 680, 768] ? Because then you have the array of "trues" and "falses" according the condition (a & 1).
Jan 23 2015
H. S. Teoh:What you describe could be an interesting candidate to add, though. It could iterate over distinct values of the predicate, and traverse the forward range (input ranges obviously can't work unless you allocate, which makes it no longer lazy) each time. This, however, has O(n*k) complexity where k is the number of distinct predicate values.Let's allocate, creating an associative array inside the grouping function :-) Bye, bearophile
Jan 23 2015
On 1/23/15 7:30 PM, bearophile wrote:H. S. Teoh:All languages I know do this for `group by` (because of the complexity involved), and I think it's ok to do so.What you describe could be an interesting candidate to add, though. It could iterate over distinct values of the predicate, and traverse the forward range (input ranges obviously can't work unless you allocate, which makes it no longer lazy) each time. This, however, has O(n*k) complexity where k is the number of distinct predicate values.Let's allocate, creating an associative array inside the grouping function :-) Bye, bearophile
Jan 24 2015
On Friday, 23 January 2015 at 22:20:19 UTC, H. S. Teoh wrote:It's kind of a misnomer, because it actually only considers *consecutive runs* of equivalent elements; it doesn't look at the whole range before deciding what goes in which group. So technically it should be called consecutiveGroupBy or consecutivePartitionBy, but that's too big a mouthful to be a usable name. What you describe could be an interesting candidate to add, though. It could iterate over distinct values of the predicate, and traverse the forward range (input ranges obviously can't work unless you allocate, which makes it no longer lazy) each time. This, however, has O(n*k) complexity where k is the number of distinct predicate values. If it's anything bigger than bool or a relatively small enum, it would be impractical. Moreover, the requirement to traverse the range multiple times kinda sux; you might as well just call filter() with different expected values yourself, in a loop. In fact, you could ostensibly implement it something along these lines: auto globalPartition(alias pred, R)(R range) { alias Values = typeof(pred(range.front, range.front)); return iota(Values.min, Values.max) .map!(v => range.save.filter!(pred(e) == v)); } ...Alright and thanks for the whole explanation. Matheus.
Jan 23 2015
On 1/23/15 1:56 PM, MattCoder wrote:On Friday, 23 January 2015 at 18:08:30 UTC, Andrei Alexandrescu wrote:Yah, that would be partition(). -- AndreiSo H.S. Teoh awesomely took https://github.com/D-Programming-Language/phobos/pull/2878 to completion. We now have a working and fast relational "group by" facility. See it at work! ---- void main() { import std.algorithm, std.stdio; [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .writeln; } ---- [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]]Sorry if this a dumb question, but since you're grouping an array according some rule, this shouldn't be: [293, 453, 929, 339, 529][600, 812, 222, 680, 768] ? Because then you have the array of "trues" and "falses" according the condition (a & 1).
Jan 23 2015
On Fri, 2015-01-23 at 10:08 -0800, Andrei Alexandrescu via Digitalmars-d wrote: […]void main() { import std.algorithm, std.stdio; [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => a & 1) .writeln; } ---- [[293, 453], [600], [929, 339], [812, 222, 680], [529], [768]][…] I think I must be missing something, for me the result of a groupBy operation on the above input data should be: [1:[293, 453, 929, 339, 529], 0:[600, 812, 222, 680, 768]] i.e. a map with keys being the cases and values being the values that meet the case. In this example a & 1 asks for cases "lowest bit 0 or 1" aka "odd or even". There is nothing wrong with the semantics of the result above, but is it's name "group by" as understood by the rest of the world? -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Jan 26 2015
Russel Winder:but is it's name "group by" as understood by the rest of the world?Nope... Bye, bearophile
Jan 26 2015
On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d wrote:Russel Winder:[...] I proposed to rename it but it got shot down. *shrug* We still have a short window of time to sort this out, before 2.067 is released... T -- Don't drink and derive. Alcohol and algebra don't mix.but is it's name "group by" as understood by the rest of the world?Nope...
Jan 26 2015
On Monday, 26 January 2015 at 16:13:40 UTC, H. S. Teoh wrote:On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d wrote:Andrei had a point about `partition` being used already. I liked Oliver's suggestion to go with slice-something. `sliceBy` might be worth considering. It even hints at the (efficient) implementation.Russel Winder:[...] I proposed to rename it but it got shot down. *shrug*but is it's name "group by" as understood by the rest of the world?Nope...
Jan 26 2015
On Monday, 26 January 2015 at 16:44:20 UTC, Ulrich Küttler wrote:Andrei had a point about `partition` being used already. I liked Oliver's suggestion to go with slice-something. `sliceBy` might be worth considering. It even hints at the (efficient) implementation.Does it return slices? If not, pick a different verb, e.g. "split".
Jan 26 2015
On 1/26/15 8:11 AM, H. S. Teoh via Digitalmars-d wrote:On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d wrote:My suggestion was to keep the name but change the code of your groupBy implementation to return tuple(key, lazyValues) instead of just lazyValues. That needs to happen only for binary predicates; unary predicates will all have alternating true/false keys. Seems that would please everyone. AndreiRussel Winder:[...] I proposed to rename it but it got shot down. *shrug* We still have a short window of time to sort this out, before 2.067 is released...but is it's name "group by" as understood by the rest of the world?Nope...
Jan 26 2015
On 1/26/15 2:34 PM, Andrei Alexandrescu wrote:On 1/26/15 8:11 AM, H. S. Teoh via Digitalmars-d wrote:That's much more harder to implement than what it does right now. I don't know how you'll manage to do the lazyValues thing: you'd need to make multiple passes in the range. Again, other languages return an associative array in this case.On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d wrote:My suggestion was to keep the name but change the code of your groupBy implementation to return tuple(key, lazyValues) instead of just lazyValues. That needs to happen only for binary predicates; unary predicates will all have alternating true/false keys. Seems that would please everyone. AndreiRussel Winder:[...] I proposed to rename it but it got shot down. *shrug* We still have a short window of time to sort this out, before 2.067 is released...but is it's name "group by" as understood by the rest of the world?Nope...
Jan 26 2015
On 1/26/15 9:50 AM, Ary Borenszweig wrote:On 1/26/15 2:34 PM, Andrei Alexandrescu wrote:The implementation right now is quite interesting but not complicated, and achieves lazy grouping in a single pass.On 1/26/15 8:11 AM, H. S. Teoh via Digitalmars-d wrote:That's much more harder to implement than what it does right now. I don't know how you'll manage to do the lazyValues thing: you'd need to make multiple passes in the range.On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d wrote:My suggestion was to keep the name but change the code of your groupBy implementation to return tuple(key, lazyValues) instead of just lazyValues. That needs to happen only for binary predicates; unary predicates will all have alternating true/false keys. Seems that would please everyone. AndreiRussel Winder:[...] I proposed to rename it but it got shot down. *shrug* We still have a short window of time to sort this out, before 2.067 is released...but is it's name "group by" as understood by the rest of the world?Nope...Again, other languages return an associative array in this case.I think our approach is superior. Andrei
Jan 26 2015
On 26.01.15 19:05, Andrei Alexandrescu wrote:On 1/26/15 9:50 AM, Ary Borenszweig wrote:I think std.experimental.algorithm.groupBy is one option. To postpone thing.On 1/26/15 2:34 PM, Andrei Alexandrescu wrote:The implementation right now is quite interesting but not complicated, and achieves lazy grouping in a single pass.On 1/26/15 8:11 AM, H. S. Teoh via Digitalmars-d wrote:That's much more harder to implement than what it does right now. I don't know how you'll manage to do the lazyValues thing: you'd need to make multiple passes in the range.On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d wrote:My suggestion was to keep the name but change the code of your groupBy implementation to return tuple(key, lazyValues) instead of just lazyValues. That needs to happen only for binary predicates; unary predicates will all have alternating true/false keys. Seems that would please everyone. AndreiRussel Winder:[...] I proposed to rename it but it got shot down. *shrug* We still have a short window of time to sort this out, before 2.067 is released...but is it's name "group by" as understood by the rest of the world?Nope...Again, other languages return an associative array in this case.I think our approach is superior. Andrei
Jan 26 2015
I also found the behaviour confusing given the name. I like ChunkBy.
Jan 26 2015
On Mon, Jan 26, 2015 at 02:50:16PM -0300, Ary Borenszweig via Digitalmars-d wrote:On 1/26/15 2:34 PM, Andrei Alexandrescu wrote:[...]On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d wrote:Russel Winder:but is it's name "group by" as understood by the rest of the world?Nope...Huh, what? I think there's some misunderstanding here. The unary version of the current groupBy translates to a binary predicate: groupBy!(a => a.field) is equivalent to: groupBy!((a, b) => a.field == b.field) I don't see how this has anything to do with alternating keys. [...]My suggestion was to keep the name but change the code of your groupBy implementation to return tuple(key, lazyValues) instead of just lazyValues. That needs to happen only for binary predicates; unary predicates will all have alternating true/false keys.That's much more harder to implement than what it does right now. I don't know how you'll manage to do the lazyValues thing: you'd need to make multiple passes in the range. Again, other languages return an associative array in this case.I think we're talking past each other here. What groupBy currently does is to group elements by evaluating the predicate on *consecutive runs* of elements. What some people seem to demand is a function that groups elements by *global evaluation* of the predicate over all elements. These two are similar but divergent functions, and conflating them is not helping this discussion in any way. If "group by" in other languages refers to the latter function, then that means "groupBy" is poorly-named and we need to come up with a better name for it. Changing it to return tuples and what-not seems to be beating around the bush to me. T -- Computers are like a jungle: they have monitor lizards, rams, mice, c-moss, binary trees... and bugs.
Jan 26 2015
On 1/26/15 10:11 AM, H. S. Teoh via Digitalmars-d wrote:On Mon, Jan 26, 2015 at 02:50:16PM -0300, Ary Borenszweig via Digitalmars-d wrote:Here's how. Basically the binary-predicate version has only Boolean keys that may be false or true. They will alternate because it's the change that triggers creation of a new group. In this example: [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!((a, b) => (a & 3) == (b & 3)) the groupBy function has no information about the result of a & 3. All it "sees" is the result of the predicate: true, false, true, false... HOWEVER, if you write it like this: [293, 453, 600, 929, 339, 812, 222, 680, 529, 768] .groupBy!(a => (a & 3)) then groupBy sees the actual value of the function and can emit the proper key. So the key (ahem) here is to make groupBy with unary predicate different from groupBy with binary predicate. The former returns the tuple, the latter is unchanged. Makes sense?On 1/26/15 2:34 PM, Andrei Alexandrescu wrote:[...]On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d wrote:Russel Winder:but is it's name "group by" as understood by the rest of the world?Nope...Huh, what? I think there's some misunderstanding here. The unary version of the current groupBy translates to a binary predicate: groupBy!(a => a.field) is equivalent to: groupBy!((a, b) => a.field == b.field) I don't see how this has anything to do with alternating keys.My suggestion was to keep the name but change the code of your groupBy implementation to return tuple(key, lazyValues) instead of just lazyValues. That needs to happen only for binary predicates; unary predicates will all have alternating true/false keys.[...]Agreed.That's much more harder to implement than what it does right now. I don't know how you'll manage to do the lazyValues thing: you'd need to make multiple passes in the range. Again, other languages return an associative array in this case.I think we're talking past each other here. What groupBy currently does is to group elements by evaluating the predicate on *consecutive runs* of elements. What some people seem to demand is a function that groups elements by *global evaluation* of the predicate over all elements. These two are similar but divergent functions, and conflating them is not helping this discussion in any way.If "group by" in other languages refers to the latter function, then that means "groupBy" is poorly-named and we need to come up with a better name for it. Changing it to return tuples and what-not seems to be beating around the bush to me.I like our notion of groupBy the same way I like the notion that something must be a random-access range in order to be sorted. (Other languages give the illusion they sort streams by internally converting them to arrays.) D offers better control, better flexibility, and richer semantics. Andrei
Jan 26 2015
On Monday, 26 January 2015 at 18:31:05 UTC, Andrei Alexandrescu wrote:So the key (ahem) here is to make groupBy with unary predicate different from groupBy with binary predicate. The former returns the tuple, the latter is unchanged. Makes sense?The current implementation has a certain beauty and does not lend itself easily to a split in two different versions. Also, the additional key return value might be unnecessary and unexpected in many cases. You can easily build the extended groupBy function out of the current one, if you only calculate the predicate beforehand. If the predicate is expensive, this might be sensible anyway. auto groupByStar(alias pred, Range)(Range r) { return r.map!(pred, "a") .groupBy!(a => a[0]) .map!(inner => tuple(inner.front[0], inner.map!"a[1]")) ; } void main() { auto a = iota(10); foreach (r; a.groupByStar!"a/3") { writeln(r[0], " ", r[1]); } } I do not know if this solution is any better. The above might be a terrible idea. Note: I failed to compile `r.groupBy!"a[0]"` with iteration.d(1648): Error: function expected before (), not "a[0]" of type string
Jan 26 2015
If "group by" in other languages refers to the latter function, then that means "groupBy" is poorly-named and we need to come up with a better name for it. Changing it to return tuples and what-not seems to be beating around the bush to me. TT: you are good with algorithms. In many applications you have a bunch of results and want to summarise them. This is often what the corporate manager is doing with Excel pivot tables, and it is what the groupby function is used for in pandas. See here for a simple tutorial. http://wesmckinney.com/blog/?p=125 And here for a summary of what pandas can do with data: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html Is there any reason why we shouldn't add to Phobos: median, ranking, stddev, variance, correlation, covariance, skew, kurtosis, quantile, moving average, exp mov average, rolling window (see pandas)? I personally am fine with the implementation we have (although as Ray Dalio would say. I haven't yet earned the right that you should care what I think). All that it means is that you need to sort on multi key your results first before passing to groupby. My question is how much is lost by doing it in two steps (sort, groupby) rather than one. I don't think all that much, but it is not my field, I am also not that bothered, because this comes at the end of processing, not within the inner loop, so for me I don't think it makes a difference for now. If data sets reach commoncrawl type sizes then it might be different, although I will take D over java any day, warts and all. In any case, the documentation should be very clear on what groupby does, and how the user can do what he might be expecting to achieve, coming from a different framework. It would be interesting to benchmark D against pandas (which is implemented in cython for the key bits) and see how we look.
Jan 26 2015
On Mon, Jan 26, 2015 at 07:59:10PM +0000, Laeeth Isharc via Digitalmars-d wrote: [...]My question is how much is lost by doing it in two steps (sort, groupby) rather than one.[...] I'm not sure I understand what you mean by "lost"... I assume you're not talking about data loss, since that's not applicable either way. In any case, as with all things, there's a tradeoff. Building a hash table of groups, as some people demand, has the advantage that the grouping predicate is applied globally, and no sorting (O(n log n)) is needed. The disadvantage is that the entire data set has to fit in memory, and you cannot process it lazily. Before you've seen all of the data, you cannot know whether there are more distinct groups to come, or whether currently known groups have more members. Once the data size gets too large, you run into trouble. Plus, some people are adverse to gratuitous GC use in Phobos algorithms. Perhaps some of this is misplaced, but that's the perception. The current implementation has the advantage that it requires very little memory to work, and can process data lazily. It only needs to see a tiny portion of the entire data set to do its work. Each group returned is also lazy, so it doesn't need to store the whole group in its entirety. It can handle 10GB long groups in a 100GB data set streamed over the network with very little memory, whereas such a monstrous thing would be infeasibly resource-hungry if you need to allocate an on-disk hashtable (very inefficient!), iterate over the entire dataset, and then be I/O bound as you iterate over each group loading group members from disk. The big disadvantage, however, is that if your data is not sorted, then the groupings returned won't be quite what you'd expect, since the predicate is evaluated only over consecutive runs of elements, not globally over the entire data set. Which one is better? That depends on your specific application and its needs. For some applications, there is simply no way around the fact that you have a large dataset with randomly-ordered elements, so the sorting has to be done *somewhere*. For other applications, you *can* make stream your data in sorted order -- or perhaps it doesn't care if groupings aren't global -- so you can take advantage of the fact that the current groupBy is extremely cheap. It requires minimal memory to do its work, and it can do so inside a pipeline without requiring anything more than an input range. Basically, std.algorithm is (currently) primarily focused on linear algorithms. While there *are* some sorting algorithms and sublinear algorithms that take advantage of sorted data, it is still primarily focused on linear processing. There is no good abstraction as yet for more complex processing like multidimensional or graph traversals. As a result, it is best suited for stream-based processing. Generally speaking, most application code is primarily concerned with linear processing (copy this text from here to there, scan this text for some linear string AKA search keyword, render this line of text to the screen, draw this line around the window, move this item along in the processing queue, etc.). Non-linear computations generally take place only in dedicated computing modules of limited scope in the application, where one tends to write dedicated algorithms rather than reuse stock generic algorithms.In any case, the documentation should be very clear on what groupby does, and how the user can do what he might be expecting to achieve, coming from a different framework.[...] As far as I know, the current groupBy docs explain quite clearly what it does. If you find it still inadequate or unclear, please file bugs against it so that we can look into improving the docs. T -- GEEK = Gatherer of Extremely Enlightening Knowledge
Jan 26 2015
On 1/26/15 12:31 PM, H. S. Teoh via Digitalmars-d wrote:As far as I know, the current groupBy docs explain quite clearly what it does.I agree. That said, perhaps a change of name would avoid confusion. I like groupBy as we currently define it, it's straight within the spirit of D. However, some people may legitimately expect it does global processing and groups all identical keys together. So here's what I think we should do: 1. Rename our current groupBy to chunkBy 2. Change code to have chunkBy with binary predicates return tuples, as discussed 3. Add a method called groupEquivalent to SortedRange. It takes no predicate (because predicate already is embedded in SortedRange). It groups together elements that are equivalent according to the sorting order. Returns a range of ranges (similar to what groupBy returns now). 4. Perhaps add a global chunkEquivalent as an alias for groupBy!((a, b) => a == b). So then in order to get relational group-by semantics by sorting, one would write range.sort!(predicate).groupEquivalent... Andrei
Jan 26 2015
Don't forget that there is already groupBy in this context (only groups consequitive elements). It would be totally awkward to have those named differently (which was why I have suggested to change the name from chunkBy to groupBy in the very beginning)
Jan 26 2015
On Monday, 26 January 2015 at 20:58:36 UTC, Dicebot wrote:Don't forget that there is already groupBy in this context (only groups consequitive elements). It would be totally awkward to have those named differently (which was why I have suggested to change the name from chunkBy to groupBy in the very beginning)This seems valid to me. I'd suggest to keep the name and implementation as it is and add the "common" groupBy function as an example. https://github.com/D-Programming-Language/phobos/pull/2915
Jan 26 2015
Thank you for the thoughtful reply. I meant lost in terms of extra processing time and memory consumption in order to achieve the fairly common use case of a groupby pivot table or pandas style (ie what you get if you sort the data by group and then run D groupby) If you first sort the data, and then run groupby, then it seems likely in some cases this will be more expensive than a straight pandas style groupby., and my question was how much more expensive. If you needed to sort the data anyway, because you were producing a report, this doesn't matter. But if you didn't need it sorted because you just wanted to produce statistics on each group (eg median, inter quartile difference) then it might. I hadn't thought clearly before about in memory vs not. I guess for pandas style groupby, having to sort a dataset externally just to use groupby is a comparative disaster because it seems to involve much more I/O than is needed. If I am not mistaken, this might be one reason to have also at some point a separate function that does pandas groupby efficiently for data that may not be held in memory. Can you explain why for hashgroupby the entire dataset needs to fit in memory? If I were writing this naively, one could have an associative array of lists of relevant items hashed by group. You go through each line once, add the index to the list for that group, then at the end sort the group keys and return indices or whatever the slice equivalent is of the lines matching each group. But I may be missing some obvious points. Ie my imagined use case has large data for each line, but the hash table and lists would fit in memory. If thinking of a case where the data per line is not that large, but there are so many lines that it becomes an external sort then I see your point. However I suggest that the use case I describe - whilst not the only one - is important enough at least to briefly consider. "The big disadvantage, however,is that if your data is not sorted, then the groupings returned won't be quite what you'd expect, since the predicate is evaluated only over consecutive runs of elements, not globally over the entire data set."Well - it is a completely different function... I really do believe that pandas style dataframe type processing could be an interesting area for D, and most of what is needed to construct this is already here.As far as I know, the current groupBy docs explain quite clearly what it does. If you find it still inadequate or unclear, please file bugs against it so that we can look into improving the docs.Sorry - I had hadn't thought to check! Will do so now. Laeeth.
Jan 26 2015
As far as I know, the current groupBy docs explain quite clearly what it does. If you find it still inadequate or unclear, please file bug against it so that we can look into improving the docs.Read the docs now - they are perfect within the context of the style of documentation (and these documents themselves have certainly been improved over the past months). It is very clear to me, but I am not sure it would be to everyone I work with. The formal library description may not be the place for this, but I do think many less sophisticated readers would find the style below more accessible and easier to follow (this is not about groupBy, but a broader question of documentation to supplement the formal reference description for the standard library): http://pandas.pydata.org/pandas-docs/stable/groupby.html I appreciate that it is all very well to suggest improvements but someone actually has to do the work. Recognizing the direction of what might be helpful is perhaps a start though. I might be on my own with this view, but I don't think I am.
Jan 26 2015
On Mon, 2015-01-26 at 08:11 -0800, H. S. Teoh via Digitalmars-d wrote:On Mon, Jan 26, 2015 at 11:26:04AM +0000, bearophile via Digitalmars-d wr=ote:What name do you think works for this operation?Russel Winder: =20[...] =20 I proposed to rename it but it got shot down. *shrug*but is it's name "group by" as understood by the rest of the world?=20 Nope...We still have a short window of time to sort this out, before 2.067 is released...To be honest, given the confirmation that the semantics of this operation and that of the one other languages call groupBy are different, then 2.067 should not go out with this operation called groupBy.=20--=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Jan 26 2015