www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - how to get top N distinct elements from range?

reply "Andrea Fontana" <nospam example.com> writes:
I wonder if exists a way to get top n distinct elements from a 
range (infinite too!)

A (not efficient) way to to this is range.array.sort.uniq.take(n) 
but it's a bit overkill, it sorts elements, and of course doesn't 
work with infinite ranges. Am i missing any function?
Mar 08 2013
next sibling parent Ary Borenszweig <ary esperanto.org.ar> writes:
On 3/8/13 10:33 AM, Andrea Fontana wrote:
 I wonder if exists a way to get top n distinct elements from a range
 (infinite too!)
Top from an infinite range? o_O
Mar 08 2013
prev sibling next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Andrea Fontana:

 I wonder if exists a way to get top n distinct elements from a 
 range (infinite too!)

 A (not efficient) way to to this is 
 range.array.sort.uniq.take(n) but it's a bit overkill, it sorts 
 elements, and of course doesn't work with infinite ranges. Am i 
 missing any function?
What does it mean "top"? I think you are not missing functions. One solution (not tested): bool[ForeachType!(typeof(myRange))] mySet; foreach (item; myRange) { mySet[item] = true; if (mySet.length >= n) break; } auto top = mySet.byKey; Bye, bearophile
Mar 08 2013
parent "Andrea Fontana" <nospam example.com> writes:
On Friday, 8 March 2013 at 14:10:55 UTC, bearophile wrote:
 Andrea Fontana:

 I wonder if exists a way to get top n distinct elements from a 
 range (infinite too!)

 A (not efficient) way to to this is 
 range.array.sort.uniq.take(n) but it's a bit overkill, it 
 sorts elements, and of course doesn't work with infinite 
 ranges. Am i missing any function?
What does it mean "top"? I think you are not missing functions. One solution (not tested): bool[ForeachType!(typeof(myRange))] mySet; foreach (item; myRange) { mySet[item] = true; if (mySet.length >= n) break; } auto top = mySet.byKey; Bye, bearophile
Yes I wonder if a lazy function like this function was in phobos. It's quite common, and I didn't want to reinvent wheel :)
Mar 08 2013
prev sibling next sibling parent reply "jerro" <a a.com> writes:
On Friday, 8 March 2013 at 13:33:24 UTC, Andrea Fontana wrote:
 I wonder if exists a way to get top n distinct elements from a 
 range (infinite too!)
It's impossible to do that for infinite ranges
 A (not efficient) way to to this is 
 range.array.sort.uniq.take(n) but it's a bit overkill, it sorts 
 elements, and of course doesn't work with infinite ranges. Am i 
 missing any function?
You could do something like (untested code): for(size_t m = 0, mnext = n; ; n = mnext, mnext = min(2 * mnext, range.length)) { partialSort!condition(range[m .. $], mnext - m); if(range[0 .. mnext].uniq.count >= n) break; assert(mnext < range.length); } auto topN = range.uniq.take(n); You will need a random access range with slicing for that, and it will modify it. If I am not mistaken, this has time complexity O(log(ndup / n) * range.length), where ndup is the number of all elements (including duplicates) with values among the top n.
Mar 08 2013
parent reply "Andrea Fontana" <nospam example.com> writes:
On Friday, 8 March 2013 at 14:43:29 UTC, jerro wrote:
 On Friday, 8 March 2013 at 13:33:24 UTC, Andrea Fontana wrote:
 I wonder if exists a way to get top n distinct elements from a 
 range (infinite too!)
It's impossible to do that for infinite ranges
Why? sequence!"n*2".myTopDistinct!"a==b"(3); will give [2,4,6] We just need to keep a list of "just-seen" elements But I don't know how to make it lazy
Mar 08 2013
next sibling parent reply "Ivan Kazmenko" <gassa mail.ru> writes:
 I wonder if exists a way to get top n distinct elements from 
 a range (infinite too!)
It's impossible to do that for infinite ranges
Why? sequence!"n*2".myTopDistinct!"a==b"(3); will give [2,4,6]
I don't quite get it, do you want the least n elements? What result would you want if the sequence was the following: sequence!"-n*2".myTopDistinct!"a==b"(3); The sequence goes like [-2, -4, -6, -8, -10, ...] ad infinitum. To clarify a doubt, do you mean "top" as in "front", or "top" as the "greatest" or "least" in terms of comparison? Or maybe "the first n distinct elements found"? If so, why "sort" in the original post? It could bring middle elements to front.
Mar 08 2013
parent reply "Andrea Fontana" <nospam example.com> writes:
On Friday, 8 March 2013 at 16:12:37 UTC, Ivan Kazmenko wrote:
 I wonder if exists a way to get top n distinct elements from 
 a range (infinite too!)
It's impossible to do that for infinite ranges
Why? sequence!"n*2".myTopDistinct!"a==b"(3); will give [2,4,6]
I don't quite get it, do you want the least n elements? What result would you want if the sequence was the following: sequence!"-n*2".myTopDistinct!"a==b"(3);
Infinite loop as: sequence!"n*2".filter!"a>1"; and a lot of other cases.
 The sequence goes like [-2, -4, -6, -8, -10, ...] ad infinitum.

 To clarify a doubt, do you mean "top" as in "front", or "top" 
 as the "greatest" or "least" in terms of comparison?  Or maybe 
 "the first n distinct elements found"?  If so, why "sort" in 
 the original post?  It could bring middle elements to front.
I mean "the first n distinct elements found". Example was a mistake. Uniq works with sorted range (if you want to get distinct element) so I sort it to make it works. But I should not :) Whoops. Bearophile got it btw. So, is there any lazy way to do it?
Mar 08 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Andrea Fontana:

 So, is there any lazy way to do it?
Take my code and put it inside a struct, add empty, popFront and front. Otherwise return an impure take of a filter closure from a function that keeps the set. Something like (untested): auto firstDistinct(Range)(Range r, size_t n) { bool[ForeachType!(Range)] mySet; return f.filter!((k) { if (k in mySet) return false; mySet[k] = true; return true; }).take(n); } Bye, bearophile
Mar 08 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
 Otherwise return an impure take of a filter closure from a 
 function that keeps the set. Something like (untested):
It seems to work: import std.stdio, std.range, std.algorithm, std.traits; auto firstDistinct(Range)(Range r, in size_t n) { bool[ForeachType!Range] mySet; return r.filter!((k) { if (k in mySet) return false; mySet[k] = true; return true; }).take(n); } void main() { immutable data = [1, 0, 0, 0, 0, 3, 3, 0, 3, 3, 0, 2, 3, 2, 3, 2, 3, 1, 1, 3, 2, 1, 0, 0, 0, 1, 0, 3, 0, 3]; foreach (i; 0 .. 6) data.firstDistinct(i).writeln; } Bye, bearophile
Mar 08 2013
next sibling parent reply "Chris Cain" <clcain uncg.edu> writes:
On Friday, 8 March 2013 at 18:17:22 UTC, bearophile wrote:
 Otherwise return an impure take of a filter closure from a 
 function that keeps the set. Something like (untested):
It seems to work: import std.stdio, std.range, std.algorithm, std.traits; auto firstDistinct(Range)(Range r, in size_t n) { bool[ForeachType!Range] mySet; return r.filter!((k) { if (k in mySet) return false; mySet[k] = true; return true; }).take(n); }
I have to say, that's a great example of beautiful simplicity in programming. Bravo.
Mar 08 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/8/13 1:34 PM, Chris Cain wrote:
 On Friday, 8 March 2013 at 18:17:22 UTC, bearophile wrote:
 auto firstDistinct(Range)(Range r, in size_t n) {
 bool[ForeachType!Range] mySet;

 return r.filter!((k) {
 if (k in mySet)
 return false;
 mySet[k] = true;
 return true;
 }).take(n);
 }
I have to say, that's a great example of beautiful simplicity in programming. Bravo.
Yah, impressive. Nice work bearophile. Andrei
Mar 09 2013
parent "bearophile" <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

 Yah,
There are 3 problems: - In a successive post I have also added a template constraint, so Range is an InputRange. - firstDistinct() packs two behaviours that are better kept separated, it's like a takeDistinct. It's better to remove the final take, and return all distinct items. If a user wants the first ones, he/she/shi will add a take after it. This is why later in this thread we have used distinct(). In very few cases I'd like some extra functions in Phobos like flatMap and amap, because they are very common operations. But in most cases it's better to keep only the simplest algorithms (like distinct and take), so you can combine them in many different ways. - Successive posts have shown that the code doesn't work. After several tests I have opened this: http://d.puremagic.com/issues/show_bug.cgi?id=9674 Please take a look at it. - I have recently written a little D program, about something else, that I think you will appreciate even more than that firstDistinct/distinct :-) Bye, bearophile
Mar 09 2013
prev sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
It needs a constraint:


auto firstDistinct(Range)(Range r, in size_t n)
if (isInputRange!Range) {
     bool[ForeachType!Range] mySet;

     return r.filter!((k) {
         if (k in mySet)
             return false;
         mySet[k] = true;
         return true;
     }).take(n);
}


Bye,
bearophile
Mar 08 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
 auto firstDistinct(Range)(Range r, in size_t n)
 if (isInputRange!Range) {
     bool[ForeachType!Range] mySet;

     return r.filter!((k) {
         if (k in mySet)
             return false;
         mySet[k] = true;
         return true;
     }).take(n);
 }
I think the standard library of Ada2012 has bounded containers, so you can use them to stack-allocate a hash set that can contain up to n items (plus some more free cells to keep its load factor low enough). In Rust there are fully safe stack-allocated closures. Combining the two things it's possible to create a function like that firstDistinct() with zero heap allocations, that is probably fast. Bye, bearophile
Mar 08 2013
parent reply "Andrea Fontana" <nospam example.com> writes:
On Friday, 8 March 2013 at 19:36:04 UTC, bearophile wrote:
 auto firstDistinct(Range)(Range r, in size_t n)
 if (isInputRange!Range) {
    bool[ForeachType!Range] mySet;

    return r.filter!((k) {
        if (k in mySet)
            return false;
        mySet[k] = true;
        return true;
    }).take(n);
 }
I think the standard library of Ada2012 has bounded containers, so you can use them to stack-allocate a hash set that can contain up to n items (plus some more free cells to keep its load factor low enough). In Rust there are fully safe stack-allocated closures. Combining the two things it's possible to create a function like that firstDistinct() with zero heap allocations, that is probably fast. Bye, bearophile
Something goes wrong by the way. I removed "n" template params and "take(n)" (i can do after distinct() call, isn't it the same?). Try this (sorted for better reading): iota(100).map!((x) => uniform(0,10))().distinct().array.sort.writeln; iota(100).map!((x) => uniform(0,10))().array.distinct().array.sort.writeln; output: [0, 0, 3, 3, 4, 4, 4, 6, 7, 8] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Mar 08 2013
next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Andrea Fontana:

 Something goes wrong by the way. I removed "n" template params 
 and "take(n)" (i can do after distinct() call, isn't it the 
 same?).
I think take() is not the cause. See this program: import std.stdio, std.range, std.algorithm, std.traits, std.random; auto distinct(Range)(Range r) if (isInputRange!Range) { bool[ForeachType!Range] mySet; return r.filter!((k) { if (k in mySet) return false; mySet[k] = true; return true; }); } void main() { 5.iota.map!((_) { auto x = uniform(0, 10); write("*"); return x; }).distinct.writeln; } It outputs something like: *[*2*, *0*, *2**, *1] You expect that output to contain only 5 stars. Maybe map() or filter() is the cause of this. Bye, bearophile
Mar 08 2013
parent "bearophile" <bearophileHUGS lycos.com> writes:
 You expect that output to contain only 5 stars.
Better seen with this code: import std.stdio, std.range, std.algorithm, std.traits, std.random; auto distinct(Range)(Range r) if (isInputRange!Range) { bool[ForeachType!Range] mySet; return r.filter!((k) { if (k in mySet) return false; mySet[k] = true; return true; }); } void main() { 5.iota.map!((_) { auto x = uniform(0, 10); write("*"); return x; }).distinct.array; } Bye, bearophile
Mar 08 2013
prev sibling parent reply "Andrea Fontana" <nospam example.com> writes:
On Friday, 8 March 2013 at 22:36:35 UTC, Andrea Fontana wrote:
 On Friday, 8 March 2013 at 19:36:04 UTC, bearophile wrote:
 auto firstDistinct(Range)(Range r, in size_t n)
 if (isInputRange!Range) {
   bool[ForeachType!Range] mySet;

   return r.filter!((k) {
       if (k in mySet)
           return false;
       mySet[k] = true;
       return true;
   }).take(n);
 }
I think the standard library of Ada2012 has bounded containers, so you can use them to stack-allocate a hash set that can contain up to n items (plus some more free cells to keep its load factor low enough). In Rust there are fully safe stack-allocated closures. Combining the two things it's possible to create a function like that firstDistinct() with zero heap allocations, that is probably fast. Bye, bearophile
Something goes wrong by the way. I removed "n" template params and "take(n)" (i can do after distinct() call, isn't it the same?). Try this (sorted for better reading): iota(100).map!((x) => uniform(0,10))().distinct().array.sort.writeln; iota(100).map!((x) => uniform(0,10))().array.distinct().array.sort.writeln; output: [0, 0, 3, 3, 4, 4, 4, 6, 7, 8] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Here the answer: auto r=iota(100).map!((x) => uniform(0,10))(); writeln(r.front," ", r.front, " ", r.front, " ", r.front); But i think "front" was "cached", but it seems not...
Mar 08 2013
next sibling parent "Andrea Fontana" <nospam example.com> writes:
On Friday, 8 March 2013 at 23:37:20 UTC, Andrea Fontana wrote:
 On Friday, 8 March 2013 at 22:36:35 UTC, Andrea Fontana wrote:
 On Friday, 8 March 2013 at 19:36:04 UTC, bearophile wrote:
 auto firstDistinct(Range)(Range r, in size_t n)
 if (isInputRange!Range) {
  bool[ForeachType!Range] mySet;

  return r.filter!((k) {
      if (k in mySet)
          return false;
      mySet[k] = true;
      return true;
  }).take(n);
 }
I think the standard library of Ada2012 has bounded containers, so you can use them to stack-allocate a hash set that can contain up to n items (plus some more free cells to keep its load factor low enough). In Rust there are fully safe stack-allocated closures. Combining the two things it's possible to create a function like that firstDistinct() with zero heap allocations, that is probably fast. Bye, bearophile
Something goes wrong by the way. I removed "n" template params and "take(n)" (i can do after distinct() call, isn't it the same?). Try this (sorted for better reading): iota(100).map!((x) => uniform(0,10))().distinct().array.sort.writeln; iota(100).map!((x) => uniform(0,10))().array.distinct().array.sort.writeln; output: [0, 0, 3, 3, 4, 4, 4, 6, 7, 8] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Here the answer: auto r=iota(100).map!((x) => uniform(0,10))(); writeln(r.front," ", r.front, " ", r.front, " ", r.front); But i think "front" was "cached", but it seems not...
This is my "struct" version that helped me to findout that problem. auto distinct(Range)(Range rs) if (isInputRange!(Unqual!Range)) { return DistinctResult!Range(rs); } struct DistinctResult(Range) if (isInputRange!(Unqual!Range)) { alias Unqual!Range T; T range; bool[ElementType!T] justSeen; property front() { return range.front; } property empty() { return range.empty; } void popFront() { justSeen[range.front] = true; while(true) { range.popFront(); if (range.empty || range.front !in justSeen) break; } } }
Mar 08 2013
prev sibling next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Andrea Fontana:

 But i think "front" was "cached", but it seems not...	
The caching of map front was recently removed. More test code: import std.stdio, std.range, std.algorithm, std.traits, std.random; void main() { 5.iota.map!((_) { auto x = uniform(0, 10); write("*"); return x; }).filter!(_ => true).array; } Prints: ********** Bye, bearophile
Mar 08 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Simpler:


import std.stdio, std.range, std.algorithm;

void main() {
     iota(5).map!((x) {
         write("*");
         return x;
     }).filter!(_ => true).array;
}
Mar 08 2013
parent reply "Andrea Fontana" <nospam example.com> writes:
On Friday, 8 March 2013 at 23:44:40 UTC, bearophile wrote:
 Simpler:


 import std.stdio, std.range, std.algorithm;

 void main() {
     iota(5).map!((x) {
         write("*");
         return x;
     }).filter!(_ => true).array;
 }
Chech this: 5.iota.map!((_) { auto x = uniform(0, 1000); writeln(" MAP-1 = ", x); return x; }).filter!(_ => true).map!((x){ writeln(" MAP-2 = ", x); return x; })().array.writeln; Here some function calls front() internally (filter? map?) to check some conditions and triggers random creation again.
Mar 08 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
This is a shortened version of the filter(), it calls pred only n 
times, but it calls _input.front n*2 times. If _input.front is 
not deterministic (that means it's not pure, as in the case of 
distinct filtering function) it gives wrong results:


struct FilterResult(alias pred, Range) {
     Range _input;

     this(Range r) {
         _input = r;
         while (!_input.empty && !pred(_input.front)) {
             _input.popFront();
         }
     }

      property bool empty() { return _input.empty; }

     void popFront() {
         do {
             _input.popFront();
         } while (!_input.empty && !pred(_input.front));
     }

      property auto ref front() {
         return _input.front;
     }
}


This isn't exactly a bug of filter(), it's a design decision of 
not caching _input.front.

Bye,
bearophile
Mar 08 2013
parent reply "Andrea Fontana" <nospam example.com> writes:
On Saturday, 9 March 2013 at 01:00:26 UTC, bearophile wrote:
 This is a shortened version of the filter(), it calls pred only 
 n times, but it calls _input.front n*2 times. If _input.front 
 is not deterministic (that means it's not pure, as in the case 
 of distinct filtering function) it gives wrong results:


 struct FilterResult(alias pred, Range) {
     Range _input;

     this(Range r) {
         _input = r;
         while (!_input.empty && !pred(_input.front)) {
             _input.popFront();
         }
     }

      property bool empty() { return _input.empty; }

     void popFront() {
         do {
             _input.popFront();
         } while (!_input.empty && !pred(_input.front));
     }

      property auto ref front() {
         return _input.front;
     }
 }


 This isn't exactly a bug of filter(), it's a design decision of 
 not caching _input.front.

 Bye,
 bearophile
Not funny btw :) It's not so easy to find out. And it's not easy to write a "cached map" function if you use classes or some pointers as ElementType...
Mar 08 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Andrea Fontana:

 Not funny btw :) It's not so easy to find out. And it's not 
 easy to write a "cached map" function if you use classes or 
 some pointers as ElementType...
Not caching _input.front in filter() means that filter() assumes the front() of its argument range to act as a pure function (even if this isn't enforced by the type system). At least this should be documented in std.algorithm ddocs... Python filter() does not work that way: from sys import stdout def fun(x): stdout.write("*") return x list(filter(lambda x: True, map(fun, xrange(5)))) Prints: ***** Bye, bearophile
Mar 08 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
 from sys import stdout
 def fun(x):
     stdout.write("*")
     return x
 list(filter(lambda x: True, map(fun, xrange(5))))
Sorry, I meant: from itertools import imap, ifilter from sys import stdout def fun(x): stdout.write("*") return x list(ifilter(lambda x: True, imap(fun, xrange(5))))
Mar 08 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
With this modified filter (that caches _input.front) it seems to 
work:


import std.stdio, std.range, std.algorithm, std.traits, 
std.random;

struct FilterResult2(alias pred, Range) {
     import std.traits: ForeachType;

     Range _input;
     ForeachType!Range rFront;

     this(Range r) {
         _input = r;

         while (!_input.empty) {
             this.rFront = _input.front;
             if (pred(this.rFront))
                 break;
             _input.popFront();
         }
     }

      property bool empty() { return _input.empty; }

     void popFront() {
         do {
             _input.popFront();
             if (_input.empty)
                 break;
             this.rFront = _input.front;
             if (pred(this.rFront))
                 break;
         } while (true);
     }

      property auto ref front() {
         return this.rFront;
     }
}

FilterResult2!(pred, Range) filter2(alias pred, Range)(Range rs) {
     return FilterResult2!(pred, Range)(rs);
}


auto distinct(Range)(Range r)
if (isInputRange!Range) {
     bool[ForeachType!Range] mySet;

     return r.filter2!((k) {
         if (k in mySet)
             return false;
         mySet[k] = true;
         return true;
     });
}

void main() {
     100.iota.map!(_ => uniform(0, 
10)).distinct.array.sort.writeln;
}


Is it a good idea to replace std.algorithm.filter with something 
like that filter2 (plus some missing methods)?

Bye,
bearophile
Mar 08 2013
parent reply "Andrea Fontana" <nospam example.com> writes:
On Saturday, 9 March 2013 at 01:36:53 UTC, bearophile wrote:
 Is it a good idea to replace std.algorithm.filter with 
 something like that filter2 (plus some missing methods)?

 Bye,
 bearophile
Maybe two versions (filter and cachedFilter) or a bool template param? I was thinking about pure front() too: but I think it's a wrong assumption. The right assumption would be that front should return the same value until popFront is called again. It can read front value lazily from front() call. It can do a lot of impure things (lol) but it shouldn't change front "randomly" at each call. I would improve distinct to support an alias pred = "a < b" to build a bst instead of an AA. Or just a field like distinct!"a.id" (that should work with aa backend)
Mar 09 2013
next sibling parent reply "Andrea Fontana" <nospam example.com> writes:
On Saturday, 9 March 2013 at 09:13:26 UTC, Andrea Fontana wrote:
 On Saturday, 9 March 2013 at 01:36:53 UTC, bearophile wrote:
 Is it a good idea to replace std.algorithm.filter with 
 something like that filter2 (plus some missing methods)?

 Bye,
 bearophile
Maybe two versions (filter and cachedFilter) or a bool template param? I was thinking about pure front() too: but I think it's a wrong assumption. The right assumption would be that front should return the same value until popFront is called again. It can read front value lazily from front() call. It can do a lot of impure things (lol) but it shouldn't change front "randomly" at each call. I would improve distinct to support an alias pred = "a < b" to build a bst instead of an AA. Or just a field like distinct!"a.id" (that should work with aa backend)
I almost got it: auto distinct(alias fun = "a", Range)(Range r) if (isInputRange!Range) { alias unaryFun!fun _fun; bool[ReturnType!_fun] mySet; return r.filter2!((k) { if (_fun(k) in mySet) return false; mySet[_fun(k)] = true; return true; }); } ReturnType!_fun doesn't work, if i set to the right type, function works as expected
Mar 09 2013
parent reply "Andrea Fontana" <nospam example.com> writes:
On Saturday, 9 March 2013 at 09:32:43 UTC, Andrea Fontana wrote:
 On Saturday, 9 March 2013 at 09:13:26 UTC, Andrea Fontana wrote:
 On Saturday, 9 March 2013 at 01:36:53 UTC, bearophile wrote:
 Is it a good idea to replace std.algorithm.filter with 
 something like that filter2 (plus some missing methods)?

 Bye,
 bearophile
Maybe two versions (filter and cachedFilter) or a bool template param? I was thinking about pure front() too: but I think it's a wrong assumption. The right assumption would be that front should return the same value until popFront is called again. It can read front value lazily from front() call. It can do a lot of impure things (lol) but it shouldn't change front "randomly" at each call. I would improve distinct to support an alias pred = "a < b" to build a bst instead of an AA. Or just a field like distinct!"a.id" (that should work with aa backend)
I almost got it: auto distinct(alias fun = "a", Range)(Range r) if (isInputRange!Range) { alias unaryFun!fun _fun; bool[ReturnType!_fun] mySet; return r.filter2!((k) { if (_fun(k) in mySet) return false; mySet[_fun(k)] = true; return true; }); } ReturnType!_fun doesn't work, if i set to the right type, function works as expected
auto distinct(alias fun = "a", Range)(Range r) if (isInputRange!Range) { alias unaryFun!fun _fun; bool[typeof(_fun((ElementType!Range).init))] mySet; return r.filter2!((k) { if (_fun(k) in mySet) return false; mySet[_fun(k)] = true; return true; }); } struct User { int id; string name; } void main() { User[] users = [ {1, "first"}, {2, "second"}, {3, "first"}]; users.distinct!("a.id")().writeln; users.distinct!("a.name")().writeln; } this works. Is this: bool[typeof(_fun((ElementType!Range).init))] mySet; the right way?
Mar 09 2013
next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Andrea Fontana:

 Is this:
 bool[typeof(_fun((ElementType!Range).init))] mySet;
 	
 the right way?
I don't know if that's the best way, but I think it's OK. (I think it should even go in the D wiki, so if it's not the best to do it, someone will improve it.) Bye, bearophile
Mar 09 2013
prev sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 03/09/2013 10:35 AM, Andrea Fontana wrote:
 ...

 this works. Is this:
 bool[typeof(_fun((ElementType!Range).init))] mySet;

 the right way?
It's ok, but the redundant parentheses around the element type are confusing.
Mar 09 2013
prev sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Andrea Fontana:

 Maybe two versions (filter and cachedFilter) or a bool template 
 param?
In the meantime I have filed the problem, I hope Andrei will take a look: http://d.puremagic.com/issues/show_bug.cgi?id=9674 Bye, bearophile
Mar 09 2013
prev sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Andrea Fontana:

 Here the answer:

 auto r=iota(100).map!((x) => uniform(0,10))();
 writeln(r.front," ", r.front, " ", r.front, " ", r.front);
I think that's not the answer. I think the answer is a bug in filter(). Bye, bearophile
Mar 08 2013
parent "Andrea Fontana" <nospam example.com> writes:
On Friday, 8 March 2013 at 23:51:48 UTC, bearophile wrote:
 Andrea Fontana:

 Here the answer:

 auto r=iota(100).map!((x) => uniform(0,10))();
 writeln(r.front," ", r.front, " ", r.front, " ", r.front);
I think that's not the answer. I think the answer is a bug in filter(). Bye, bearophile
My struct version never work if map doesn't cache. Every time I call (inside popFront() too!) front() gives different values. If i do array.distinct() it works fine: an array is filled with random values and they will never changes again.
Mar 08 2013
prev sibling parent "jerro" <a a.com> writes:
On Friday, 8 March 2013 at 15:53:56 UTC, Andrea Fontana wrote:
 On Friday, 8 March 2013 at 14:43:29 UTC, jerro wrote:
 On Friday, 8 March 2013 at 13:33:24 UTC, Andrea Fontana wrote:
 I wonder if exists a way to get top n distinct elements from 
 a range (infinite too!)
It's impossible to do that for infinite ranges
Why? sequence!"n*2".myTopDistinct!"a==b"(3); will give [2,4,6]
I didn't understand you correctly then. I thought that top n distinct elements meant n largest distinct elements. Now, I'm assuming that you just need first n distinct elements.
 We just need to keep a list of "just-seen" elements
 But I don't know how to make it lazy
You could write your own range that uses an associative array to check for duplicates: struct Unique(R) { R r; bool[elementType!R] seen; property front(){ return r.front; } property empty(){ return r.empty; } void popFront() { seen[r] = bool.init; while(r.front in seen) r.popFront(); } } I don't think there's anything like this in Phobos.
Mar 08 2013
prev sibling next sibling parent reply "Ivan Kazmenko" <gassa mail.ru> writes:
 I wonder if exists a way to get top n distinct elements from a 
 range (infinite too!)
 A (not efficient) way to to this is 
 range.array.sort.uniq.take(n) but it's a bit overkill, it sorts 
 elements, and of course doesn't work with infinite ranges. Am i 
 missing any function?
For an arbitrary infinite range, it is impossible from the mathematical point of view. An impossible example would be doing that with the infinite range of natural numbers, stored in bignums. For a finite range, you can iterate over the range while maintaining a collection of the top n elements and discarding duplicates as you find them. For example, using a RedBlackTree as the collection, you will get O(log(n) * range.length) total time and O(n) total memory used. If n is small (on the order of 10s), an array will suffice, resulting in O(n * range.length) total time but with lower constant factor.
Mar 08 2013
parent reply "Chris Cain" <clcain uncg.edu> writes:
On Friday, 8 March 2013 at 15:04:32 UTC, Ivan Kazmenko wrote:
 For a finite range, you can iterate over the range while 
 maintaining a collection of the top n elements and discarding 
 duplicates as you find them.  For example, using a RedBlackTree 
 as the collection, you will get O(log(n) * range.length) total 
 time and O(n) total memory used.  If n is small (on the order 
 of 10s), an array will suffice, resulting in O(n * 
 range.length) total time but with lower constant factor.
You can also use an array if you use a BinaryHeap. If you use a heap then it should have a much lower constant factor than a red-black tree so, even though they would both have O(m*log(n)) (where n is the number of elements you want the top of and m is the size of the collection) running time, a heap would be faster.
Mar 08 2013
parent "Ivan Kazmenko" <gassa mail.ru> writes:
 For a finite range, you can iterate over the range while 
 maintaining a collection of the top n elements and discarding 
 duplicates as you find them.  For example, using a 
 RedBlackTree as the collection, you will get O(log(n) * 
 range.length) total time and O(n) total memory used.  If n is 
 small (on the order of 10s), an array will suffice, resulting 
 in O(n * range.length) total time but with lower constant 
 factor.
You can also use an array if you use a BinaryHeap. If you use a heap then it should have a much lower constant factor than a red-black tree so, even though they would both have O(m*log(n)) (where n is the number of elements you want the top of and m is the size of the collection) running time, a heap would be faster.
My first thought was about a heap, too, but it does not allow to search for a duplicate in O(log(n)) as it guarantees only partial ordering.
Mar 08 2013
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-03-08 14:33, Andrea Fontana wrote:
 I wonder if exists a way to get top n distinct elements from a range
 (infinite too!)

 A (not efficient) way to to this is range.array.sort.uniq.take(n) but
 it's a bit overkill, it sorts elements, and of course doesn't work with
 infinite ranges. Am i missing any function?
There's a function called "topN" in std.algorithm, I don't know if that's what you're looking for. http://dlang.org/phobos/std_algorithm.html#topN -- /Jacob Carlborg
Mar 08 2013