digitalmars.D - groupBy predicates: unary, binary, or both?
- Andrei Alexandrescu (38/38) Dec 29 2013 I'm working on
- monarch_dodra (24/24) Dec 29 2013 There is precedence for this in phobos with schwatzSort: It use a
- Andrei Alexandrescu (9/23) Dec 29 2013 Hm so that's different from both my suggestions. It's also more
- bearophile (13/22) Dec 29 2013 In similar cases the simplest solution is to have two functions
- Andrei Alexandrescu (3/10) Dec 29 2013 I think hash joins should come first.
- Timon Gehr (9/21) Dec 29 2013 What hash join algorithm do you have in mind that does not use
I'm working on https://github.com/D-Programming-Language/phobos/pull/1186, which is somewhat important; "group by" is a powerful operation. Along the way, I stumbled upon an interesting issue in which I wanted to consult the community. Usually groupBy is used to find runs of equivalent elements in a range. For example: [1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 6].groupBy() yields the range [[1, 1, 1], [2, 2], [3, 3, 3], [4], [5, 5], [6]]. As is usual with Phobos algorithms, groupBy accepts a predicate. The default (as illustrated above) is "a = b", i.e. all elements in a group are equal to one another. Equality is transitive and commutative. But there are useful cases in which predicates are not commutative. Consider we want to find strictly monotonic subranges in the range. We'd write: auto r = [1, 3, 2, 4, 5, 1].groupBy!"a < b"; That should produce [[1, 3], [2, 4, 5], [1]]. For non-strict monotonic runs, the predicate would be "a <= b" etc. All that is pretty awesome. However, that makes life a bit tougher for the algorithm - it must only compare adjacent elements only. In the case of "a = b", it suffices to save the first element in a group and compare it against every other element in the group. Meanwhile, a very similar pull request (https://github.com/D-Programming-Language/phobos/pull/1453) uses unary predicates, i.e. an optional transformation function that is then used in conjunction with "==" to decide which elements belong in the same group. Unary predicates make life simpler for the algorithm (save the transform of the first element, then compare it against the transform of the next etc) and are often easier to write by the end user, too (e.g. just write "a.length" instead of "a.length == b.length" to group by length). So I was thinking to allow both cases, with the understanding that grouping by unary predicates uses "==" for comparison whereas grouping by binary predicates looks at adjacent elements to figure out group membership. That approach would, however, preclude the use of string lambdas (because deducing arity for string lambdas is possible, but quite unwieldy). What do you think? Andrei
Dec 29 2013
There is precedence for this in phobos with schwatzSort: It use a unaryFun, followed by a binaryFun. I've found this works quite well: More often than not, a unary fun is more than enough (the default "a < b"/"a == b" is fine). And in the cases where you *do* need binary comparison, then having a unary fun anywas is fine. EG: "MHklfsJKGsfd".byGroup!(std.uni.tolower, "a < b")(); I think it is both an elegant and efficient approach. That's what I'd go for. ------- I think we should handle non-commutative predicates. It's not actually *that* much more complicated. Instead of storing in each Group a value + range, simply save the range twice, and have one range be 1 step ahead of the first. I think a few phobos algos do this. minPos does something kind of similar. The "overhead" is an initial extra save yes, but the cost of each pop is to simply pop both ranges (no extra saves there). You never actually have to make element copies (which may fail if the range holds const elements to boot...). Also, if the ElementType is big, then storing an extra range has good chances to actually take up *less* room. My stance is that supporting this would be a powerful and useful tool, and I don't think there'd be that much overhead either, so I say let's go for it.
Dec 29 2013
On 12/29/13 2:20 AM, monarch_dodra wrote:There is precedence for this in phobos with schwatzSort: It use a unaryFun, followed by a binaryFun. I've found this works quite well: More often than not, a unary fun is more than enough (the default "a < b"/"a == b" is fine). And in the cases where you *do* need binary comparison, then having a unary fun anywas is fine. EG: "MHklfsJKGsfd".byGroup!(std.uni.tolower, "a < b")(); I think it is both an elegant and efficient approach. That's what I'd go for.Hm so that's different from both my suggestions. It's also more difficult to explain, though indeed there's precedent with schwartz.I think we should handle non-commutative predicates. It's not actually *that* much more complicated. Instead of storing in each Group a value + range, simply save the range twice, and have one range be 1 step ahead of the first. I think a few phobos algos do this. minPos does something kind of similar.The existing pull request already supports non-commutative predicates by means of copying the front. I'd thought about using the step-behind-range trick but that's less efficient than a copy in many cases. It's more general though as you point out, so I'll probably switch to that. Andrei
Dec 29 2013
Andrei Alexandrescu:which is somewhat important; "group by" is a powerful operation.I agree, it's a quite commonly useful operation.So I was thinking to allow both cases, with the understanding that grouping by unary predicates uses "==" for comparison whereas grouping by binary predicates looks at adjacent elements to figure out group membership. That approach would, however, preclude the use of string lambdas (because deducing arity for string lambdas is possible, but quite unwieldy).In similar cases the simplest solution is to have two functions with different (but similar) names.However, that makes life a bit tougher for the algorithm - it must only compare adjacent elements only.But another common need is to group by equivalent classes using (on LINQ and in the functions toolkit). This means that a "hashGroup" builds an associative array inside and returns it. See also: http://d.puremagic.com/issues/show_bug.cgi?id=5968 http://d.puremagic.com/issues/show_bug.cgi?id=9842 Bye, bearophile
Dec 29 2013
On 12/29/13 2:27 AM, bearophile wrote:But another common need is to group by equivalent classes using an and in the functions toolkit). This means that a "hashGroup" builds an associative array inside and returns it. See also: http://d.puremagic.com/issues/show_bug.cgi?id=5968 http://d.puremagic.com/issues/show_bug.cgi?id=9842I think hash joins should come first. Andrei
Dec 29 2013
On 12/29/2013 05:17 PM, Andrei Alexandrescu wrote:On 12/29/13 2:27 AM, bearophile wrote:What hash join algorithm do you have in mind that does not use hashGroupBy as a component? I think accumAssocArray or similar would be quite handy as well, as it is more general. A naive implementation r.hashGroupBy!f would then for example be given by: r.map!(a=>tuple(f(a),a)) // range of key-value pairs .accumAssocArray!((a,b)=>a~=b) // method of combination ((ElementType!(typeof(r))[]).init); // initial valueBut another common need is to group by equivalent classes using an and in the functions toolkit). This means that a "hashGroup" builds an associative array inside and returns it. See also: http://d.puremagic.com/issues/show_bug.cgi?id=5968 http://d.puremagic.com/issues/show_bug.cgi?id=9842I think hash joins should come first. Andrei
Dec 29 2013
I'm sorry to bump up an old thread, but whatever happened to this? I was just using 'groupby' in Python for a common data migration pattern I've developed... { y[0], tuple(y[1]) for y in groupby( some_query_func( "SELECT parent_id, ... " "ORDER BY parent_id " ), lambda x: x["parent_id"] ) } ... and I was thinking "it would be nice to have this in D, too!"
Jul 03 2014
On Thursday, 3 July 2014 at 10:48:53 UTC, w0rp wrote:y[0], tuple(y[1])That's suppose to be a : there sorry, not a comma.
Jul 03 2014