www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - sortUniq

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
There's this classic patter on Unix: |sort|uniq, i.e. sort some data and 
only display the unique elements.

What would be a better integrated version - one that does sorting and 
uniq in one shot? I suspect the combination could be quite a bit better 
than doing the two in sequence.

A few google searches didn't yield much. Ideas?


Thanks,

Andrei
Jan 22 2015
next sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, Jan 22, 2015 at 01:40:56PM -0800, Andrei Alexandrescu via Digitalmars-d
wrote:
 There's this classic patter on Unix: |sort|uniq, i.e. sort some data
 and only display the unique elements.
 
 What would be a better integrated version - one that does sorting and
 uniq in one shot? I suspect the combination could be quite a bit
 better than doing the two in sequence.
 
 A few google searches didn't yield much. Ideas?
[...] Off the top of my head: a sorting algorithm that swallows identical elements? I've never actually seen such an algorithm before, though. Generally, sorting algorithms try not to output less elements than they were given. :-P T -- Marketing: the art of convincing people to pay for what they didn't need before which you fail to deliver after.
Jan 22 2015
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Thursday, 22 January 2015 at 21:40:57 UTC, Andrei Alexandrescu 
wrote:
 There's this classic patter on Unix: |sort|uniq, i.e. sort some 
 data and only display the unique elements.

 What would be a better integrated version - one that does 
 sorting and uniq in one shot? I suspect the combination could 
 be quite a bit better than doing the two in sequence.
I think the only way you can go lower than O(n log n) is pigeonholing (counting sort without the counting), but for generic keys you need AAs which aren't O(1) (and slow). Maybe heapsort (you could drop duplicates while building the heap), but that needs an extra allocation.
Jan 22 2015
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

 There's this classic patter on Unix: |sort|uniq, i.e. sort some 
 data and only display the unique elements.
In Bugzilla I've asked for a hashGroup, it returns a built-in associative array. Bye, bearophile
Jan 22 2015
prev sibling next sibling parent reply Justin Whear <justin economicmodeling.com> writes:
On Thu, 22 Jan 2015 13:40:56 -0800, Andrei Alexandrescu wrote:

 There's this classic patter on Unix: |sort|uniq, i.e. sort some data and
 only display the unique elements.
 
 What would be a better integrated version - one that does sorting and
 uniq in one shot? I suspect the combination could be quite a bit better
 than doing the two in sequence.
 
 A few google searches didn't yield much. Ideas?
 
 
 Thanks,
 
 Andrei
Efficiency improvement-wise, perhaps a generalization of a counting sort (http://en.wikipedia.org/wiki/Counting_sort), see "Variant algorithms."
Jan 22 2015
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/22/15 2:06 PM, Justin Whear wrote:
 On Thu, 22 Jan 2015 13:40:56 -0800, Andrei Alexandrescu wrote:

 There's this classic patter on Unix: |sort|uniq, i.e. sort some data and
 only display the unique elements.

 What would be a better integrated version - one that does sorting and
 uniq in one shot? I suspect the combination could be quite a bit better
 than doing the two in sequence.

 A few google searches didn't yield much. Ideas?


 Thanks,

 Andrei
Efficiency improvement-wise, perhaps a generalization of a counting sort (http://en.wikipedia.org/wiki/Counting_sort), see "Variant algorithms."
Thanks. I'm thinking of something based on general comparisons. Something like a modified quicksort with fat pivot: A. Partition into three subranges: R1<pivot, R2=pivot, R3>pivot B. Recurse on R1 obtaining R11, a subrange of R1 containing only unique elements. C. Sort-move-unique R2 into the portion after R11, obtaining the result. A and B seem fine, I'm unclear about C. It would be an algorithm that sorts and simultaneously moves data in a range that overlaps the input. Andrei
Jan 22 2015
next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, Jan 22, 2015 at 02:29:58PM -0800, Andrei Alexandrescu via Digitalmars-d
wrote:
[...]
 Thanks. I'm thinking of something based on general comparisons.
 Something like a modified quicksort with fat pivot:
 
 A. Partition into three subranges: R1<pivot, R2=pivot, R3>pivot
 
 B. Recurse on R1 obtaining R11, a subrange of R1 containing only
 unique elements.
 
 C. Sort-move-unique R2 into the portion after R11, obtaining the
 result.
 
 A and B seem fine, I'm unclear about C. It would be an algorithm that
 sorts and simultaneously moves data in a range that overlaps the
 input.
[...] Wait, if R1 and R3, after the recursion, are sorted and contain only unique elements, doesn't that mean that we can just collapse R2 into a single element along with the endpoints of R1 and R3? So if R1[$-1] < pivot, then we're done, otherwise coalesce it with the pivot, and likewise with R3[0]. T -- Famous last words: I wonder what will happen if I do *this*...
Jan 22 2015
next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Thursday, 22 January 2015 at 22:53:59 UTC, H. S. Teoh via 
Digitalmars-d wrote:
 So if R1[$-1] < pivot,
This should be always true, no? R1 might be empty, though.
Jan 22 2015
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/22/15 2:51 PM, H. S. Teoh via Digitalmars-d wrote:
 Wait, if R1 and R3, after the recursion, are sorted and contain only
 unique elements, doesn't that mean that we can just collapse R2 into a
 single element along with the endpoints of R1 and R3? So if R1[$-1] <
 pivot, then we're done, otherwise coalesce it with the pivot, and
 likewise with R3[0].
Right, forgot to mention the pivot. You don't recurse on the right, only the left; that way you massage the unique stuff in R1 toward the left in the possibly smaller range R11. Then the challenge is to sort, uinique-fy, and move in one fell swoop from R3 (padded with the pivot to the left) into the portion adjacent to R11. Example: say the range has 100 elements. Then we divide like this: r[0 .. 30] is <pivot r[30 .. 34] is =pivot r[34 .. 100] is >pivot First, recurse on r[0 .. 30] getting a smaller range that's sorted and unique. Say r[0 .. 24]. The step needed now is to take r[33 .. 100], which is ONE copy of the pivot plus everything greater than pivot, and in one shot deposit it starting at r[24], sorted and uniquefied. I'm thinking of a modified heapsort that progressively creates the heap at r[24], or something like that. Andrei
Jan 22 2015
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, Jan 22, 2015 at 04:10:45PM -0800, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 1/22/15 2:51 PM, H. S. Teoh via Digitalmars-d wrote:
Wait, if R1 and R3, after the recursion, are sorted and contain only
unique elements, doesn't that mean that we can just collapse R2 into
a single element along with the endpoints of R1 and R3? So if R1[$-1]
< pivot, then we're done, otherwise coalesce it with the pivot, and
likewise with R3[0].
Right, forgot to mention the pivot. You don't recurse on the right, only the left; that way you massage the unique stuff in R1 toward the left in the possibly smaller range R11. Then the challenge is to sort, uinique-fy, and move in one fell swoop from R3 (padded with the pivot to the left) into the portion adjacent to R11. Example: say the range has 100 elements. Then we divide like this: r[0 .. 30] is <pivot r[30 .. 34] is =pivot r[34 .. 100] is >pivot First, recurse on r[0 .. 30] getting a smaller range that's sorted and unique. Say r[0 .. 24]. The step needed now is to take r[33 .. 100], which is ONE copy of the pivot plus everything greater than pivot, and in one shot deposit it starting at r[24], sorted and uniquefied. I'm thinking of a modified heapsort that progressively creates the heap at r[24], or something like that.
[...] Well, that would work. Recurse on left to get a uniqSorted left segment, then modified heapsort on right (instead of heapsorting in place, deposit extracted heap elements into the target range). The question, though, is whether this hybrid recursion actually wins us anything, since heapsort is known to be somewhat slower than quicksort. In the worst case, you choose a poor pivot and get *both* the worst case of quicksort and the inferior performance of heapsort compounded together. Unless there's a way to modify heapsort to be more efficient as well? Actually, another idea occurred to me: suppose we modify partition() instead, so that instead of partitioning in-place, it deposits partitioned elements into a target (possibly overlapping) range. Then we modify the right-recursion, so that its partitioning step actually simultaneously moves the range over to the target area for us. Not sure if this will eliminate the copying cost, but it might, since partition has to traverse the entire subrange anyway. T -- Старый друг лучше новых двух.
Jan 22 2015
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/22/15 4:26 PM, H. S. Teoh via Digitalmars-d wrote:
 The question, though, is whether this hybrid recursion actually wins us
 anything, since heapsort is known to be somewhat slower than quicksort.
 In the worst case, you choose a poor pivot and get*both*  the worst case
 of quicksort and the inferior performance of heapsort compounded
 together.
The glass-half-full view of this is you combine the advantages of quicksort and heapsort :o).
 Unless there's a way to modify heapsort to be more efficient as well?

 Actually, another idea occurred to me: suppose we modify partition()
 instead, so that instead of partitioning in-place, it deposits
 partitioned elements into a target (possibly overlapping) range. Then we
 modify the right-recursion, so that its partitioning step actually
 simultaneously moves the range over to the target area for us. Not sure
 if this will eliminate the copying cost, but it might, since partition
 has to traverse the entire subrange anyway.
That's interesting! Andrei
Jan 22 2015
parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, Jan 22, 2015 at 04:33:28PM -0800, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 1/22/15 4:26 PM, H. S. Teoh via Digitalmars-d wrote:
The question, though, is whether this hybrid recursion actually wins
us anything, since heapsort is known to be somewhat slower than
quicksort.  In the worst case, you choose a poor pivot and get*both*
the worst case of quicksort and the inferior performance of heapsort
compounded together.
The glass-half-full view of this is you combine the advantages of quicksort and heapsort :o).
Well, I've been in the field long enough to realize that putting things together more often than not causes increase in complexity (usually accompanied by poorer performance) rather than the other way round. It's part of why algorithms research is so hard... but also so rewarding, when you do hit upon a winning combination that actually reduces complexity and/or improves performance.
Unless there's a way to modify heapsort to be more efficient as well?

Actually, another idea occurred to me: suppose we modify partition()
instead, so that instead of partitioning in-place, it deposits
partitioned elements into a target (possibly overlapping) range. Then
we modify the right-recursion, so that its partitioning step actually
simultaneously moves the range over to the target area for us. Not
sure if this will eliminate the copying cost, but it might, since
partition has to traverse the entire subrange anyway.
That's interesting!
[...] Here's a crude first-impressions sketch of how this modified partition might work: - As we move up the range, if the current element is less than the pivot, we deposit it into the growing left partition which is in the target location rather than the original location. - If the current element is equal to the pivot, we ignore it (this eliminates the pivot and its duplicates altogether from the left/right partitions). - The tricky case is if it's greater than the pivot. In the normal quicksort partition, we do nothing -- it will get moved later if it's occupying the space that belongs to the left partition. However, in our case, we need to move it *somewhere*, otherwise it will get lost if it doesn't lie in the overlapping area of the target range and the range being partitioned. One idea is to move it to the end of the target range, so that the two partitions grow from opposite ends of the target range, and they should meet in the middle (otherwise the target range is of the wrong size). However, the details of how this can be done is still murky. I'll have to think more about this. T -- Talk is cheap. Whining is actually free. -- Lars Wirzenius
Jan 22 2015
prev sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, Jan 22, 2015 at 02:51:38PM -0800, H. S. Teoh via Digitalmars-d wrote:
 On Thu, Jan 22, 2015 at 02:29:58PM -0800, Andrei Alexandrescu via
Digitalmars-d wrote:
 [...]
 Thanks. I'm thinking of something based on general comparisons.
 Something like a modified quicksort with fat pivot:
 
 A. Partition into three subranges: R1<pivot, R2=pivot, R3>pivot
 
 B. Recurse on R1 obtaining R11, a subrange of R1 containing only
 unique elements.
 
 C. Sort-move-unique R2 into the portion after R11, obtaining the
 result.
 
 A and B seem fine, I'm unclear about C. It would be an algorithm that
 sorts and simultaneously moves data in a range that overlaps the
 input.
[...] Wait, if R1 and R3, after the recursion, are sorted and contain only unique elements, doesn't that mean that we can just collapse R2 into a single element along with the endpoints of R1 and R3? So if R1[$-1] < pivot, then we're done, otherwise coalesce it with the pivot, and likewise with R3[0].
[...] Working proof of concept: import std.algorithm; import std.array; import std.stdio; // Returns: final index of pivot size_t partition(R)(R r) { auto pivot = r[0]; swap(r[0], r[$-1]); // move pivot to end auto leftIdx = 0; foreach (i; 0 .. r.length) { if (r[i] < pivot) swap(r[i], r[leftIdx++]); } swap(r[leftIdx], r[$-1]); return leftIdx; } R uniqueSort(R)(R r) { if (r.empty) return r; // Partition auto pivotIdx = partition(r); auto pivot = r[pivotIdx]; // Recurse auto left = uniqueSort(r[0 .. pivotIdx]); auto right = uniqueSort(r[pivotIdx + 1 .. $]); // Coalesce left ~ pivot ~ right. if (!left.empty && left[$-1] == pivot) left = left[0 .. $-1]; if (!right.empty && right[0] == pivot) right = right[1 .. $]; r[left.length] = pivot; auto vacant = copy(right, r[left.length + 1 .. $]); return r[0 .. $ - vacant.length]; } unittest { auto data = [4, 0, 9, 7, 2, 5, 1, 1, 3, 4, 2, 0, 9, 5, 3, 6, 1, 7, 4, 8]; auto sorted = data.uniqueSort(); assert(sorted == [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]); } I'm not sure about the performance of this algorithm, though, because the call to copy() might become prohibitive once you add up all the recursive calls to uniqueSort(). Also, I only got one test case working; no idea if there are subtle bugs that are lurking in there somewhere. Please test with more test cases. :-) T -- "No, John. I want formats that are actually useful, rather than over-featured megaliths that address all questions by piling on ridiculous internal links in forms which are hideously over-complex." -- Simon St. Laurent on xml-dev
Jan 22 2015
parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Thursday, 22 January 2015 at 23:35:32 UTC, H. S. Teoh via 
Digitalmars-d wrote:
 On Thu, Jan 22, 2015 at 02:51:38PM -0800, H. S. Teoh via 
 Digitalmars-d wrote:
 Wait, if R1 and R3, after the recursion, are sorted and 
 contain only
 unique elements, doesn't that mean that we can just collapse 
 R2 into a
 single element along with the endpoints of R1 and R3? So if 
 R1[$-1] <
 pivot, then we're done, otherwise coalesce it with the pivot, 
 and
 likewise with R3[0].
I'm pretty sure that when Andrei said:
 C. Sort-move-unique R2 into the portion after R11, obtaining 
 the
 result.
He meant R3, not R2. R2 is obviously collapsed to one element (the pivot).
 Working proof of concept:
I think this will actually have worse performance than sort+uniq, since you're now shuffling half of the elements at each recursion depth - an additional O(n log n).
Jan 22 2015
next sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Thu, Jan 22, 2015 at 11:52:18PM +0000, Vladimir Panteleev via Digitalmars-d
wrote:
 On Thursday, 22 January 2015 at 23:35:32 UTC, H. S. Teoh via Digitalmars-d
 wrote:
On Thu, Jan 22, 2015 at 02:51:38PM -0800, H. S. Teoh via Digitalmars-d
wrote:
Wait, if R1 and R3, after the recursion, are sorted and contain only
unique elements, doesn't that mean that we can just collapse R2 into
a single element along with the endpoints of R1 and R3? So if
R1[$-1] < pivot, then we're done, otherwise coalesce it with the
pivot, and likewise with R3[0].
I'm pretty sure that when Andrei said:
 C. Sort-move-unique R2 into the portion after R11, obtaining > the
 result.
He meant R3, not R2. R2 is obviously collapsed to one element (the pivot).
Ah, that's why I misunderstood him.
Working proof of concept:
I think this will actually have worse performance than sort+uniq, since you're now shuffling half of the elements at each recursion depth - an additional O(n log n).
That's what I thought. At each step, you have to coalesce the elements by copying `right` over. So that adds up to quite a cost. Actually, I just realized that the code has an unnecessary check for left[$-1] being equal to pivot: this is actually impossible, since partition() makes sure that the left half of the range is strictly less than the pivot, so it cannot possibly contain the pivot. So we can eliminate that check. But it doesn't solve the problem of excessive copying. Hmm... actually, I think it might be possible to modify partition so that it eliminates duplicated pivots for you. However, this still doesn't solve the recursive case where `left` might have shrunk, thus necessitating `right` be shifted over afterwards. A modified heapsort might be a better candidate for uniqueSort() than quicksort, it looks like. T -- Music critic: "That's an imitation fugue!"
Jan 22 2015
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/22/15 3:52 PM, Vladimir Panteleev wrote:
 On Thursday, 22 January 2015 at 23:35:32 UTC, H. S. Teoh via
 Digitalmars-d wrote:
 On Thu, Jan 22, 2015 at 02:51:38PM -0800, H. S. Teoh via Digitalmars-d
 wrote:
 Wait, if R1 and R3, after the recursion, are sorted and contain only
 unique elements, doesn't that mean that we can just collapse R2 into a
 single element along with the endpoints of R1 and R3? So if R1[$-1] <
 pivot, then we're done, otherwise coalesce it with the pivot, and
 likewise with R3[0].
I'm pretty sure that when Andrei said:
 C. Sort-move-unique R2 into the portion after R11, obtaining > the
 result.
He meant R3, not R2. R2 is obviously collapsed to one element (the pivot).
Yes, thanks and sorry for the distraction.
 Working proof of concept:
I think this will actually have worse performance than sort+uniq, since you're now shuffling half of the elements at each recursion depth - an additional O(n log n).
Yah, the key here is to exploit the combination to get better performance than composing the two. So you must do surgery on the core algorithms involved. Andrei
Jan 22 2015
prev sibling next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

 What would be a better integrated version - one that does 
 sorting and uniq in one shot? I suspect the combination could 
 be quite a bit better than doing the two in sequence.
What I need most is a "groping by hashing", since years. A unique sort can be handy, but it's less useful that the first. Bye, bearophile
Jan 22 2015
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/22/15 4:06 PM, bearophile wrote:
 Andrei Alexandrescu:

 What would be a better integrated version - one that does sorting and
 uniq in one shot? I suspect the combination could be quite a bit
 better than doing the two in sequence.
What I need most is a "groping by hashing", since years. A unique sort can be handy, but it's less useful that the first.
I agree a hash uniq would be nice. -- Andrei
Jan 22 2015
prev sibling parent reply Paul O'Neil <redballoon36 gmail.com> writes:
On 01/22/2015 07:06 PM, bearophile wrote:
 Andrei Alexandrescu:
 
 What would be a better integrated version - one that does sorting and
 uniq in one shot? I suspect the combination could be quite a bit
 better than doing the two in sequence.
What I need most is a "groping by hashing", since years. A unique sort can be handy, but it's less useful that the first. Bye, bearophile
Is that different than groupBy? -- Paul O'Neil Github / IRC: todayman
Jan 22 2015
parent "bearophile" <bearophileHUGS lycos.com> writes:
Paul O'Neil:

 Is that different than groupBy?
It works with hashing, and doesn't require a sorting, so it's different. On the other hand it's just few lines of code long, and it's quite easy to write... Bye, bearophile
Jan 22 2015
prev sibling next sibling parent reply "Xinok" <xinok live.com> writes:
On Thursday, 22 January 2015 at 21:40:57 UTC, Andrei Alexandrescu 
wrote:
 There's this classic patter on Unix: |sort|uniq, i.e. sort some 
 data and only display the unique elements.

 What would be a better integrated version - one that does 
 sorting and uniq in one shot? I suspect the combination could 
 be quite a bit better than doing the two in sequence.

 A few google searches didn't yield much. Ideas?


 Thanks,

 Andrei
My thought is that uniq on a sorted list is only an O(n) operation, so it's not an expensive operation by any means. If there's to be any benefit to a sortUniq, it has to be capable of reducing work incurred during the sorting process; else you're just going to end up with something less efficient. One solution may be RedBlackTree, which has an option to disallow duplicate elements. This container has three useful properties: (1) The tree grows one element at a time. This is in opposition to other algorithms like quicksort or heapsort in which you must operate on the entire set of elements at once. (2) Removing duplicates only requires a single comparison per element, thus retaining the worst-case of |sort|uniq. (3) Duplicate elements are removed immediately. Inserting an element into a balanced tree with N elements is an O(lg n) operation, so the smaller n is, the less work that is required. The shortcoming of RedBlackTree is that it's not very fast. However, I don't know any other O(n lg n) sorting algorithm which has these properties.
Jan 22 2015
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Jan 23, 2015 at 12:59:51AM +0000, Xinok via Digitalmars-d wrote:
 On Thursday, 22 January 2015 at 21:40:57 UTC, Andrei Alexandrescu wrote:
There's this classic patter on Unix: |sort|uniq, i.e. sort some data
and only display the unique elements.

What would be a better integrated version - one that does sorting and
uniq in one shot? I suspect the combination could be quite a bit
better than doing the two in sequence.
[...]
 My thought is that uniq on a sorted list is only an O(n) operation, so
 it's not an expensive operation by any means. If there's to be any
 benefit to a sortUniq, it has to be capable of reducing work incurred
 during the sorting process; else you're just going to end up with
 something less efficient.
 
 One solution may be RedBlackTree, which has an option to disallow
 duplicate elements. This container has three useful properties:
The problem with RedBlackTree is that it's cache-unfriendly due to the memory allocation and per-node pointers. One reason quicksort performs so well is because it works in-place, and the partition operation accesses elements bilinearly (i.e., two linear traversals over the same stretch of memory interleaved with each other), so it's very likely that most of the accesses will hit the cache (possibly even all, if hardware prefetching is present). Whereas with RedBlackTree, it's jumping all over memory, so there will be a higher rate of cache misses and the resulting slow RAM fetches. Plus, memory allocation is slooow, and is best avoided where possible. T -- What's a "hot crossed bun"? An angry rabbit.
Jan 22 2015
parent "Laeeth Isharc" <Laeeth.nospam nospam-laeeth.com> writes:
 The problem with RedBlackTree is that it's cache-unfriendly due 
 to the
 memory allocation and per-node pointers. One reason quicksort 
 performs
 so well is because it works in-place, and the partition 
 operation
 accesses elements bilinearly (i.e., two linear traversals over 
 the same
 stretch of memory interleaved with each other), so it's very 
 likely that
 most of the accesses will hit the cache (possibly even all, if 
 hardware
 prefetching is present).  Whereas with RedBlackTree, it's 
 jumping all
 over memory, so there will be a higher rate of cache misses and 
 the
 resulting slow RAM fetches. Plus, memory allocation is slooow, 
 and is
 best avoided where possible.
In an interview Stepanov said he would have used b* trees instead for this reason if he were writing STL today. Here is a (dated) benchmark comparison of hash libraries: https://attractivechaos.wordpress.com/2008/08/28/comparison-of-hash-table-libraries/
Jan 24 2015
prev sibling parent reply zeljkog <zeljkog home.com> writes:
On 22.01.15 22:40, Andrei Alexandrescu wrote:
 There's this classic patter on Unix: |sort|uniq, i.e. sort some data and
 only display the unique elements.
Loosely-related, compiling ... auto ret = arr.filter!(myFilter()); ... I got: Error: closures are not yet supported in CTFE Using struct with opCall directly also does not pass. Did I miss something? Is it near?
Jan 23 2015
next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
zeljkog:

 Loosely-related, compiling

 ...
 auto ret = arr.filter!(myFilter());
 ...

 I got:

 Error: closures are not yet supported in CTFE

 Using struct with opCall directly also does not pass.

 Did I miss something?
 Is it near?
Please show a minimal nonworking example in the D.learn newsgroup. Bye, bearophile
Jan 23 2015
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/23/15 7:04 AM, zeljkog wrote:
 On 22.01.15 22:40, Andrei Alexandrescu wrote:
 There's this classic patter on Unix: |sort|uniq, i.e. sort some data and
 only display the unique elements.
Loosely-related, compiling ... auto ret = arr.filter!(myFilter()); ... I got: Error: closures are not yet supported in CTFE Using struct with opCall directly also does not pass. Did I miss something? Is it near?
Must be you put that code at top level and the compiler tried to interpret it during compilation. -- Andrei
Jan 23 2015
parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Fri, Jan 23, 2015 at 09:40:42AM -0800, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 1/23/15 7:04 AM, zeljkog wrote:
On 22.01.15 22:40, Andrei Alexandrescu wrote:
There's this classic patter on Unix: |sort|uniq, i.e. sort some data
and only display the unique elements.
Loosely-related, compiling ... auto ret = arr.filter!(myFilter()); ... I got: Error: closures are not yet supported in CTFE Using struct with opCall directly also does not pass. Did I miss something? Is it near?
Must be you put that code at top level and the compiler tried to interpret it during compilation. -- Andrei
I think what he's trying to do is to call a function that returns a delegate, and use that delegate to instantiate the filter template. AFAIK I've never seen code like this before, and it looks like the compiler isn't prepared to handle this. T -- I am not young enough to know everything. -- Oscar Wilde
Jan 23 2015
parent reply zeljkog <zeljkog home.com> writes:
On 23.01.15 18:48, H. S. Teoh via Digitalmars-d wrote:
 
 I think what he's trying to do is to call a function that returns a
 delegate, and use that delegate to instantiate the filter template.
 AFAIK I've never seen code like this before, and it looks like the
 compiler isn't prepared to handle this.
 
Yes, I tried to use filter for unique, need closure. I think there are many applications for this pattern.
Jan 23 2015
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/23/15 10:05 AM, zeljkog wrote:
 On 23.01.15 18:48, H. S. Teoh via Digitalmars-d wrote:
 I think what he's trying to do is to call a function that returns a
 delegate, and use that delegate to instantiate the filter template.
 AFAIK I've never seen code like this before, and it looks like the
 compiler isn't prepared to handle this.
Yes, I tried to use filter for unique, need closure. I think there are many applications for this pattern.
Please post a complete snippet then. Thanks! -- Andrei
Jan 23 2015
parent reply zeljkog <zeljkog home.com> writes:
On 23.01.15 19:13, Andrei Alexandrescu wrote:
 On 1/23/15 10:05 AM, zeljkog wrote:
 On 23.01.15 18:48, H. S. Teoh via Digitalmars-d wrote:
 I think what he's trying to do is to call a function that returns a
 delegate, and use that delegate to instantiate the filter template.
 AFAIK I've never seen code like this before, and it looks like the
 compiler isn't prepared to handle this.
Yes, I tried to use filter for unique, need closure. I think there are many applications for this pattern.
Please post a complete snippet then. Thanks! -- Andrei
import std.stdio, std.algorithm; auto unique(){ bool[int] c; return (int a){ if (a in c) return false; else{ c[a] = true; return true; } }; } void main() { [1, 5, 5, 2, 1, 5, 6, 6].filter!(unique()).writeln; }
Jan 23 2015
next sibling parent reply "ixid" <adamsibson hotmail.com> writes:
 import std.stdio, std.algorithm;

 auto unique(){
     bool[int] c;
     return (int a){
         if (a in c)
             return false;
         else{
             c[a] = true;
             return true;
         }
     };
 }

 void main()
 {
     [1, 5, 5, 2, 1, 5, 6, 6].filter!(unique()).writeln;
 }
Just add static to the closure. auto unique(){ static bool[int] c; return (int a){ if (a in c) return false; else{ c[a] = true; return true; } }; }
Jan 24 2015
parent "ixid" <adamsibson hotmail.com> writes:
Ignore me, sorry! Obviously that will not work for reuse of 
unique.
Jan 24 2015
prev sibling parent reply "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Friday, 23 January 2015 at 18:22:08 UTC, zeljkog wrote:
 On 23.01.15 19:13, Andrei Alexandrescu wrote:
 On 1/23/15 10:05 AM, zeljkog wrote:
 On 23.01.15 18:48, H. S. Teoh via Digitalmars-d wrote:
 I think what he's trying to do is to call a function that 
 returns a
 delegate, and use that delegate to instantiate the filter 
 template.
 AFAIK I've never seen code like this before, and it looks 
 like the
 compiler isn't prepared to handle this.
Yes, I tried to use filter for unique, need closure. I think there are many applications for this pattern.
Please post a complete snippet then. Thanks! -- Andrei
import std.stdio, std.algorithm; auto unique(){ bool[int] c; return (int a){ if (a in c) return false; else{ c[a] = true; return true; } }; } void main() { [1, 5, 5, 2, 1, 5, 6, 6].filter!(unique()).writeln; }
auto f = unique(); [1, 5, 5, 2, 1, 5, 6, 6].filter!(f).writeln; // [1, 5, 2, 6] Filter needs an alias, and you cannot alias an R-value (it has no symbol).
Jan 24 2015
next sibling parent reply zeljkog <zeljkog home.com> writes:
On 24.01.15 12:50, Peter Alexander wrote:
 auto f = unique();
 [1, 5, 5, 2, 1, 5, 6, 6].filter!(f).writeln;  // [1, 5, 2, 6]
 
 Filter needs an alias, and you cannot alias an R-value (it has no symbol).
Yes, I see :) But I think this should be supported by std.algorithm: import std.stdio, std.algorithm; struct Uniq{ bool[int] c; bool opCall(int a){ if (a in c) return false; else{ c[a] = true; return true; } } } void main() { [1, 5, 5, 2, 1, 5, 6, 6].filter!Uniq.writeln; }
Jan 24 2015
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/24/15 4:13 AM, zeljkog wrote:
 On 24.01.15 12:50, Peter Alexander wrote:
 auto f = unique();
 [1, 5, 5, 2, 1, 5, 6, 6].filter!(f).writeln;  // [1, 5, 2, 6]

 Filter needs an alias, and you cannot alias an R-value (it has no symbol).
Yes, I see :) But I think this should be supported by std.algorithm: import std.stdio, std.algorithm; struct Uniq{ bool[int] c; bool opCall(int a){ if (a in c) return false; else{ c[a] = true; return true; } } } void main() { [1, 5, 5, 2, 1, 5, 6, 6].filter!Uniq.writeln; }
I can't make sense of this - where is Uniq supposed to be instantiated? -- Andrei
Jan 24 2015
parent reply zeljkog <zeljkog home.com> writes:
On 24.01.15 15:53, Andrei Alexandrescu wrote:
 On 1/24/15 4:13 AM, zeljkog wrote:
 On 24.01.15 12:50, Peter Alexander wrote:
 auto f = unique();
 [1, 5, 5, 2, 1, 5, 6, 6].filter!(f).writeln;  // [1, 5, 2, 6]

 Filter needs an alias, and you cannot alias an R-value (it has no
 symbol).
Yes, I see :) But I think this should be supported by std.algorithm: import std.stdio, std.algorithm; struct Uniq{ bool[int] c; bool opCall(int a){ if (a in c) return false; else{ c[a] = true; return true; } } } void main() { [1, 5, 5, 2, 1, 5, 6, 6].filter!Uniq.writeln; }
I can't make sense of this - where is Uniq supposed to be instantiated? -- Andrei
Require following changes in std.algorithm.filter. original: template filter(alias predicate) if (is(typeof(unaryFun!predicate))) { auto filter(Range)(Range range) if (isInputRange!(Unqual!Range)) { return FilterResult!(unaryFun!predicate, Range)(range); } } private struct FilterResult(alias pred, Range) ... changed (to run this case): template filter(alias predicate) if (is(typeof(unaryFun!predicate))) { auto filter(Range)(Range range) if (isInputRange!(Unqual!Range)) { static if (is(predicate == struct)) return FilterResult!(predicate, Range)(range); else return FilterResult!(unaryFun!predicate, Range)(range); } } private struct FilterResult(alias predicate, Range) { static if (is(predicate == struct)) predicate pred; else alias pred = predicate; ...
Jan 24 2015
next sibling parent zeljkog <zeljkog home.com> writes:
On 24.01.15 16:17, zeljkog wrote:
 template filter(alias predicate) if (is(typeof(unaryFun!predicate)))
Should be: template filter(alias predicate) if (is(typeof(unaryFun!predicate)) || is(predicate == struct))
Jan 24 2015
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/24/15 7:17 AM, zeljkog wrote:
 private struct FilterResult(alias predicate, Range)
 {
      static if (is(predicate == struct))
          predicate pred;
      else
          alias pred = predicate;
 ...
The problem is here - the creation of an empty object is unlikely to be of use to the caller. We should, however, support the case when the caller passes a stateful predicate into the call. -- Andrei
Jan 24 2015
parent reply zeljkog <zeljkog home.com> writes:
On 24.01.15 18:02, Andrei Alexandrescu wrote:
 On 1/24/15 7:17 AM, zeljkog wrote:
 private struct FilterResult(alias predicate, Range)
 {
      static if (is(predicate == struct))
          predicate pred;
      else
          alias pred = predicate;
 ...
The problem is here - the creation of an empty object is unlikely to be of use to the caller. We should, however, support the case when the caller passes a stateful predicate into the call. -- Andrei
I think empty object could be useful, but perhaps there is better :) Until now I concluded: 1. Resulting range must be output range. Right? I don't know is it considerable problem? 2. unaryFun can pass down struct (with opCall check?)
Jan 24 2015
parent reply zeljkog <zeljkog home.com> writes:
On 24.01.15 18:26, zeljkog wrote:
 Until now I concluded:
 1. Resulting range must be output range. Right?
Should be input range.
Jan 24 2015
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/24/15 9:34 AM, zeljkog wrote:
 On 24.01.15 18:26, zeljkog wrote:
 Until now I concluded:
 1. Resulting range must be output range. Right?
Should be input range.
For filter it's a forward range if input is forward or better. -- Andrei
Jan 24 2015
next sibling parent reply zeljkog <zeljkog home.com> writes:
On 24.01.15 19:35, Andrei Alexandrescu wrote:
 On 1/24/15 9:34 AM, zeljkog wrote:
 On 24.01.15 18:26, zeljkog wrote:
 Until now I concluded:
 1. Resulting range must be output range. Right?
Should be input range.
For filter it's a forward range if input is forward or better. -- Andrei
But here there is potential problem with shared state after save?
Jan 24 2015
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/24/15 10:41 AM, zeljkog wrote:
 On 24.01.15 19:35, Andrei Alexandrescu wrote:
 On 1/24/15 9:34 AM, zeljkog wrote:
 On 24.01.15 18:26, zeljkog wrote:
 Until now I concluded:
 1. Resulting range must be output range. Right?
Should be input range.
For filter it's a forward range if input is forward or better. -- Andrei
But here there is potential problem with shared state after save?
It's not a problem - save saves the state of the iteration, not the data being iterated. -- Andrei
Jan 24 2015
parent zeljkog <zeljkog home.com> writes:
On 24.01.15 19:47, Andrei Alexandrescu wrote:
 But here there is potential problem with shared state after save?
It's not a problem - save saves the state of the iteration, not the data being iterated. -- Andrei
If you pass stateful object with opCall or closure it becomes part of "state of the iteration"? Consider: import std.stdio, std.algorithm; struct Uniq{ bool[int] c; bool opCall(int a){ if (a in c) return false; else{ c[a] = true; return true; } } } void main() { Uniq a; auto arr = [1, 5, 5, 2, 1, 5, 6, 6]; auto r = arr.filter!a; auto r1 = r.save(); r.writeln; r1.writeln; } output: [1, 2, 6] [5]
Jan 24 2015
prev sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sat, Jan 24, 2015 at 10:35:13AM -0800, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 1/24/15 9:34 AM, zeljkog wrote:
On 24.01.15 18:26, zeljkog wrote:
Until now I concluded:
1. Resulting range must be output range. Right?
Should be input range.
For filter it's a forward range if input is forward or better. -- Andrei
Hmm. Couldn't filter be made bidirectional too, if the underlying range is also bidirectional? After all, filtering doesn't depend on the surrounding elements, so in theory you should be able to popBack as well. T -- "How are you doing?" "Doing what?"
Jan 24 2015
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/24/15 11:00 AM, H. S. Teoh via Digitalmars-d wrote:
 On Sat, Jan 24, 2015 at 10:35:13AM -0800, Andrei Alexandrescu via
Digitalmars-d wrote:
 On 1/24/15 9:34 AM, zeljkog wrote:
 On 24.01.15 18:26, zeljkog wrote:
 Until now I concluded:
 1. Resulting range must be output range. Right?
Should be input range.
For filter it's a forward range if input is forward or better. -- Andrei
Hmm. Couldn't filter be made bidirectional too, if the underlying range is also bidirectional? After all, filtering doesn't depend on the surrounding elements, so in theory you should be able to popBack as well.
Andrei
Jan 24 2015
parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sat, Jan 24, 2015 at 11:20:55AM -0800, Andrei Alexandrescu via Digitalmars-d
wrote:
 On 1/24/15 11:00 AM, H. S. Teoh via Digitalmars-d wrote:
On Sat, Jan 24, 2015 at 10:35:13AM -0800, Andrei Alexandrescu via Digitalmars-d
wrote:
[...]
For filter it's a forward range if input is forward or better. --
Andrei
Hmm. Couldn't filter be made bidirectional too, if the underlying range is also bidirectional? After all, filtering doesn't depend on the surrounding elements, so in theory you should be able to popBack as well.
[...] Ah, I should have known, since I do remember seeing this when splitting std.algorithm. I guess it didn't really register beyond "this function belongs in std.algorithm.iteration". :-P It's kinda sad that we can't merge the two functions and make it so that the initial scanning of .back only happens lazily. I mean, in theory, this should be easy: struct Filter { R range; ... property auto back() { while (!range.empty && !pred(range.back)) range.popBack(); return range.back; } void popBack() { range.popBack(); } } But I remember there was a huge discussion over whether .front/.back/.empty should do any non-trivial work. The current implementation appears to be the consequence of shoehorning all work into popFront/popBack. But in my mind, these methods ought to be abstract, and the user shouldn't be able to see any difference in behaviour regardless of where the real work is actually done. T -- People who are more than casually interested in computers should have at least some idea of what the underlying hardware is like. Otherwise the programs they write will be pretty weird. -- D. Knuth
Jan 24 2015
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/24/15 3:50 AM, Peter Alexander wrote:
 On Friday, 23 January 2015 at 18:22:08 UTC, zeljkog wrote:
 On 23.01.15 19:13, Andrei Alexandrescu wrote:
 On 1/23/15 10:05 AM, zeljkog wrote:
 On 23.01.15 18:48, H. S. Teoh via Digitalmars-d wrote:
 I think what he's trying to do is to call a function that returns a
 delegate, and use that delegate to instantiate the filter template.
 AFAIK I've never seen code like this before, and it looks like the
 compiler isn't prepared to handle this.
Yes, I tried to use filter for unique, need closure. I think there are many applications for this pattern.
Please post a complete snippet then. Thanks! -- Andrei
import std.stdio, std.algorithm; auto unique(){ bool[int] c; return (int a){ if (a in c) return false; else{ c[a] = true; return true; } }; } void main() { [1, 5, 5, 2, 1, 5, 6, 6].filter!(unique()).writeln; }
auto f = unique(); [1, 5, 5, 2, 1, 5, 6, 6].filter!(f).writeln; // [1, 5, 2, 6] Filter needs an alias, and you cannot alias an R-value (it has no symbol).
Hmmm... we do allow rvalues sometimes (e.g. for strings). I think we could and should relax the rule to allow rvalues in this case, too. -- Andrei
Jan 24 2015
parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Saturday, 24 January 2015 at 14:53:02 UTC, Andrei Alexandrescu
wrote:
 On 1/24/15 3:50 AM, Peter Alexander wrote:
 On Friday, 23 January 2015 at 18:22:08 UTC, zeljkog wrote:
 On 23.01.15 19:13, Andrei Alexandrescu wrote:
 On 1/23/15 10:05 AM, zeljkog wrote:
 On 23.01.15 18:48, H. S. Teoh via Digitalmars-d wrote:
 I think what he's trying to do is to call a function that 
 returns a
 delegate, and use that delegate to instantiate the filter 
 template.
 AFAIK I've never seen code like this before, and it looks 
 like the
 compiler isn't prepared to handle this.
Yes, I tried to use filter for unique, need closure. I think there are many applications for this pattern.
Please post a complete snippet then. Thanks! -- Andrei
import std.stdio, std.algorithm; auto unique(){ bool[int] c; return (int a){ if (a in c) return false; else{ c[a] = true; return true; } }; } void main() { [1, 5, 5, 2, 1, 5, 6, 6].filter!(unique()).writeln; }
auto f = unique(); [1, 5, 5, 2, 1, 5, 6, 6].filter!(f).writeln; // [1, 5, 2, 6] Filter needs an alias, and you cannot alias an R-value (it has no symbol).
Hmmm... we do allow rvalues sometimes (e.g. for strings). I think we could and should relax the rule to allow rvalues in this case, too. -- Andrei
I was wrong. R-values work, as long as they are compile time constants. The problem here is that closures don't yet work in CTFE (as the error says). Pulling it out as a local variable works because the alias then binds to the symbol rather than the value.
Jan 24 2015
prev sibling parent zeljkog <zeljkog home.com> writes:
On 23.01.15 19:05, zeljkog wrote:
 On 23.01.15 18:48, H. S. Teoh via Digitalmars-d wrote:
 I think what he's trying to do is to call a function that returns a
 delegate, and use that delegate to instantiate the filter template.
 AFAIK I've never seen code like this before, and it looks like the
 compiler isn't prepared to handle this.
Yes, I tried to use filter for unique, need closure. I think there are many applications for this pattern.
It comes down to a question: Why structs with opCall are not supported as template parameters in std.algorithm together with functions? Like this: filter!SomeStructWithOpCall It is simple addition. Or is it with reason?
Jan 24 2015