www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Is there something like a consuming take?

reply berni <someone somewhere.com> writes:
I want to copy the first n items of a range to an array, removing 
these items from the range.

This works:

 foreach (i;0..n)
 {
    data ~= r.front;
    r.popFront();
 }
but looks a little bit arkward. I came up with this now:
 data = r.take(n).array;
This works partly, because the values of r are not consumed. So I have to call afterwards:
 r = r.drop(n);
Now I wonder, if it is possible to do this with one single call, something like data = r.take_consuming(n).array; Does there something like this exist?
Jul 06 2019
parent reply a11e99z <black80 bk.ru> writes:
On Saturday, 6 July 2019 at 11:20:50 UTC, berni wrote:
 I want to copy the first n items of a range to an array, I came 
 up with this now:
 data = r.take(n).array;
This works partly, because the values of r are not consumed. So I have to call afterwards:
 r = r.drop(n);
Now I wonder, if it is possible to do this with one single call, something like data = r.take_consuming(n).array; Does there something like this exist?
sure auto take_consuming( R )( ref R r, int cnt ) { auto tmp = r.take( cnt ).array; r = r.drop( cnt ); return tmp; } don't thank
Jul 06 2019
parent reply berni <someone somewhere.com> writes:
On Saturday, 6 July 2019 at 11:48:51 UTC, a11e99z wrote:
 sure
 auto take_consuming( R )( ref R r, int cnt ) {
     auto tmp = r.take( cnt ).array;
     r = r.drop( cnt );
     return tmp;
 }
 don't thank
Doesn't look like what I'm looking for, as it is exactly the same I allready found. Maybe I need to explain, what I dislike with this approach: take() calls popFront n times and drop() calls popFront another n times giving a total of 2n times (depending on the underlying range, this might cause lot's of calulcations be done twice. The first version with the foreach loop calls popFront only n times.
Jul 06 2019
next sibling parent a11e99z <black80 bk.ru> writes:
On Saturday, 6 July 2019 at 12:10:13 UTC, berni wrote:
 On Saturday, 6 July 2019 at 11:48:51 UTC, a11e99z wrote:

 Maybe I need to explain, what I dislike with this approach: 
 take() calls popFront n times and drop() calls popFront another 
 n times giving a total of 2n times (depending on the underlying 
 range, this might cause lot's of calulcations be done twice. 
 The first version with the foreach loop calls popFront only n 
 times.
auto take_consuming( R )( ref R r, int cnt ) { import std.range.primitives : hasSlicing; static if (hasSlicing!R) { // without allocations auto tmp = r[0..cnt]; r = r[cnt..$]; // or r.popFronN( cnt ); // O(1) return tmp; } else { // loop range once auto tmp = uninitializedArray!( ElementType!R[])( cnt); int k = 0; for (; !r.empty && k<cnt; ++k, r.popFront) tmp[ k] = r.front; return tmp[ 0..k]; } }
Jul 06 2019
prev sibling parent reply berni <someone somewhere.com> writes:
Now it's getting weird. Meanwhile I encountered, that take() 
sometimes consumes and sometimes not. Where can I learn, what is 
the reason behind this behavior? And how can I handle this?
Jul 06 2019
next sibling parent reply berni <someone somewhere.com> writes:
A small example showing this strange behaviour:

import std.stdio;
import std.algorithm.iteration;
import std.range;

enum BUFFER_SIZE = 1024;

void main(string[] args)
{
    auto a = (new File(args[1]))
        .byChunk(BUFFER_SIZE)
        .joiner;

    writeln(a.take(5));
    writeln(a);
}
Using a file, containing the bytes 1 to 10 I get:
[ 1, 2, 3, 4, 5 ]
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ]
take does not consume. When I now change BUFFER_SIZE to 2 I get:
[ 1, 2, 3, 4, 5 ]
[ 5, 6, 7, 8, 9, 10 ]
Now the first two buffers have been consumend and the third ([5, 6]) not. Feels like a bug in Phobos. But maybe I do not understand, what's happening and this is correct behaviour. Can anyone explain or confirm, that this is a bug?
Jul 06 2019
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Saturday, 6 July 2019 at 14:40:23 UTC, berni wrote:
        .byChunk(BUFFER_SIZE)
byChunk is defined to reuse its buffer between calls. http://dpldocs.info/experimental-docs/std.stdio.byChunk.1.html#examples This means previous contents are overwritten when you advance.
 When I now change BUFFER_SIZE to 2 I get:

[ 1, 2, 3, 4, 5 ]
[ 5, 6, 7, 8, 9, 10 ]
Now the first two buffers have been consumend and the third ([5, 6]) not.
So here, the take call gave you a view into the first 5 elements. It read one and two and printed them, then byChunk.popFront was called, overwriting the buffer with 3,4 and take passed that to writeln, then popFront again, overwriting with 5,6. writeln printed out 5, and take, having finished its work, left the buffer alone in its state. Now, you print the other thing, which still has 5,6 in the buffer, then popFront, overwrites with 7,8, etc and so on. So this is a case of input range behavior - always consuming the underlying file - combined with buffering of two elements at once, leaving 5,6 behind, and the reuse of the buffer meaning you see that 5,6 again on the next call.
Jul 06 2019
parent berni <someone somewhere.com> writes:
On Saturday, 6 July 2019 at 14:48:04 UTC, Adam D. Ruppe wrote:
 [...]
 So this is a case of input range behavior - always consuming 
 the underlying file - combined with buffering of two elements 
 at once, leaving 5,6 behind, and the reuse of the buffer 
 meaning you see that 5,6 again on the next call.
Thanks for clearifing what happens. In my oppinion the behaviour of take() should be better defined. It's clear, that take() returns a range with the first n elements of the underlaying range (and that is done lazily). But it's not specified what happens with the underlaying range. As the behaviour is unpredictable (or at least hard to predict), one should assume, that the underlaying range is completely destroyed by take(). This makes take() much less usefull, than it could be, in my eyes. :-(
Jul 06 2019
prev sibling next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Saturday, 6 July 2019 at 14:12:36 UTC, berni wrote:
 Meanwhile I encountered, that take() sometimes consumes and 
 sometimes not.
It depends on what you're passing. So take is defined as just getting the first N elements from the given range. So what happens next depends on what it is "taking" from (I don't like the name "take" exactly because that implies, well, taking. What the function really does is more like "view into first N elements". With input ranges, iterating over them consumes it implicitly, so you could say that take *always* consumes input ranges (at least once it gets iterated over). But, it will frequently consume a copy of the range instead of the one you have at the top level, since ranges are passed by value. If you want it to be seen outside, `ref` is the general answer. You might find success with refRange in some cases. http://dpldocs.info/experimental-docs/std.range.refRange.html But if you are passing something with slicing, like a plain array, take will never actually consume it, and instead just slice the input, even if it is ref. This gets the view of those first elements in cheaper way. So going back to your original definition:
 I want to copy the first n items of a range to an array, 
 removing these items from the range.
As far as I know, none of the std.range functions are defined to do this. I'd probably write your own that: 1) takes the range by ref so changes are visible outside 2) iterates over it with popFront 3) returns the copy that should fulfill all the requirements. you could slightly optimize arrays (or other hasSlicing things) like int[] yourFunction(ref int[] arr, int n) { auto ret = arr[0 .. n]; arr = arr[n .. $]; return ret; } that is, just slicing and consuming in one go for each side and then you don't even have to actually copy it, just return the slice.
Jul 06 2019
prev sibling next sibling parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Saturday, July 6, 2019 8:12:36 AM MDT berni via Digitalmars-d-learn 
wrote:
 Now it's getting weird. Meanwhile I encountered, that take()
 sometimes consumes and sometimes not. Where can I learn, what is
 the reason behind this behavior? And how can I handle this?
take _always_ consumes the range that it's given. The problem is that some types of ranges are implicitly saved when they're copied, whereas others aren't, and when a range is implicitly saved when it's copied, you end up with the copy being consumed. Dynamic arrays are implicitly saved when they're copied, so the range you pass is saved, and the copy is consumed instead of the original. In generic code, you have to assume that once a range has been copied, you can't use it anymore (just the copy) precisely because the semantics of copying differ depending on the type of the range. You can only use a range after copying it if you know what type of range you're dealing with and how it behaves. So, you can rely on a dynamic array implicitly saving when it's passed to take, but in generic code, you really shouldn't be using a range again once you pass it to take, because what actually happens is dependent on the type of the range. In general what this means is that if you pass a range to a function, and you then want to use the range again afterwards, you need to call save when passing it to the function, and otherwise, you just assume that it's consumed and don't use it again. You certainly don't pass it to a function with the expectation of some elements being consumed and then continue to use the rest of the range unless the function takes its argument by ref or by pointer (which relatively few range-based functions do). If you want a function that's guaranteed to not implicitly copy a range, then it needs to accept the argument by ref or take a pointer to it. In the case where a function doesn't have to do the work lazily, ref would work, but in a case like take where you're returning a wrapper range, pointers would be required. So, a version of take that didn't ever copy the range it's given and thus never risked implicitly saving the range it was passed would have to either take a pointer to it or take it by ref and then take the address of the ref. In either case, the code using such a take would then have to ensure that the original range didn't leave scope and get destroyed before the take range was consumed, or the take range would then refer to invalid memory. - Jonathan M Davis
Jul 06 2019
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Saturday, 6 July 2019 at 18:17:26 UTC, Jonathan M Davis wrote:
 take _always_ consumes the range that it's given
not if it hasSlicing. see http://dpldocs.info/experimental-docs/source/std.range.d.html#L2015 but yeah otherwise i agree with you
Jul 06 2019
next sibling parent reply berni <someone somewhere.com> writes:
I start to understand: A (lazy) range is something like a 
voucher: byChunk() does not provide the data immediately, but 
only a voucher to hand you a chunk (an array of 2 bytes in my 
example above) a time. This voucher is passed to joiner(), which 
keeps that voucher and hands me out a new voucher for a byte a 
time (called "a" in my example above). This voucher is now passed 
to take(), whick again keeps that voucher and hands me out yet an 
other voucher for a byte a time, but limits itself to 5 items.

Up to now nothing has happend, but the exchange of that vouchers. 
Now when writeln() is called, it shows take() that voucher and 
says: "Give me the first of your numbers". take() uses his 
voucher to do the same with joiner(), which uses its voucher to 
get the first chunk of byChunk(), takes the first byte of that 
chunk and hands it to take() and take() hands it to writeln().

Now writeln() uses it's voucher once again, to get the next byte. 
take() asks joiner() again, but this time joiner() does not ask 
byChunk(), because it still has one number left in the chunk it 
got when asking the last time. Now joiner() uses this second byte 
to hand it back. And so on.

After writeln() finished I'm doing something evil: I use a 
private copy of that voucher, that I allready handed to take() to 
use it myself. That's sort of stealing back. And hence the 
results are more or less unpredictable. Especially, when I use 
this copy, before take() finished it's job, as Jonathan M Davis 
points out.

My misthinking was to assume, that my copy of the voucher is 
still the same as the voucher, that take() owns after having 
handed out the 5 items, so I can use it to get the other items. 
If it where possible to ask take() to hand me back it's voucher, 
I could do it with this voucher, but I don't see a possibility to 
get that voucher back.

What I actually want to do, when using take(), is not to get a 
new voucher, but I want to keep my voucher and just use this 
voucher to get the first 5 items - take() doesn't do that, so 
I've got to write my own function which cannot be lazy.

Or better: I'd like to hand in my voucher and get back two 
vouchers, one for the first 5 bytes and one for the rest. That's 
actually, what I thought, take() is doing...
Jul 06 2019
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Saturday, July 6, 2019 11:02:15 PM MDT berni via Digitalmars-d-learn 
wrote:
 Or better: I'd like to hand in my voucher and get back two
 vouchers, one for the first 5 bytes and one for the rest. That's
 actually, what I thought, take() is doing...
Without slicing, that's impossible without iterating over the elements multiple times. It would be possible to have a function like take that did something like return a tuple of two ranges where the first one is the taken elements, and the second one is the rest, but without slicing, that could only be done by calling save so that the two ranges are independent copies of the original range, and then the range that doesn't have the elements that were taken off has to internally pop off all of the elements that were taken so that its first element is the first one that wasn't taken. There's no way to magically skip the taken elements for the second range without slicing. The alternative is to manually iterate through all of the elements that you want to take, doing whatever you're going to do with them, and then doing whatever you want to do with rest of the range after that. Then iteration only occurs once. But that means that you don't have a range of the taken elements and can't do something like pass them to another algorithm as a range without constructing a new range from them, and if you're going to do that, you might as well just call save, pass the range to take, and the call drop on the original range to pop off the elements that were taken. - Jonathan M Davis
Jul 07 2019
parent reply berni <someone somewhere.com> writes:
On Sunday, 7 July 2019 at 09:01:53 UTC, Jonathan M Davis wrote:
 Without slicing, that's impossible without iterating over the 
 elements multiple times.
That's what I thought too, but meanwhile I think it is possible. To get it working, the second range needs to know about the first one and when the second one is queried the first time it has to notify the first one, that it has to remove all remaining items from the underlying range (and in case it still needs them, save them) and never use it again. The advantage of this setup over copying the items immediately is some sort of laziness. It might happen, that the second one will never query or it might happen, that the first one finished allready when the second one queries. In this cases no additional memory is needed. Even if the second one is queried before the first one finished, the needed buffer might be much smaller.
Jul 07 2019
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Sunday, July 7, 2019 12:36:42 PM MDT berni via Digitalmars-d-learn wrote:
 On Sunday, 7 July 2019 at 09:01:53 UTC, Jonathan M Davis wrote:
 Without slicing, that's impossible without iterating over the
 elements multiple times.
That's what I thought too, but meanwhile I think it is possible. To get it working, the second range needs to know about the first one and when the second one is queried the first time it has to notify the first one, that it has to remove all remaining items from the underlying range (and in case it still needs them, save them) and never use it again. The advantage of this setup over copying the items immediately is some sort of laziness. It might happen, that the second one will never query or it might happen, that the first one finished allready when the second one queries. In this cases no additional memory is needed. Even if the second one is queried before the first one finished, the needed buffer might be much smaller.
Having one range know about the other isn't enough. That just means that the take range would tell the other range that it had popped an element off, and then the other would know that it had to pop an element off. That still involves popping the same element on different ranges twice. In order to have popFront only called once, they have to be the same range, so when take pops an element off, it pops an element off of the other range. If both the take range and the drop range are wrappers, then it would be possible to do something with pointers to the original range where the number of elements that has been popped off is kept track of, and as long as the take range pops them all first, the drop range can just continue iterating through the original range, but if any elements from the drop range got iterated first, then it would have to save the range and start popping from there not to screw up the take range. And since pointers to the original range would have to be used to avoid having independent copies, you run the risk of all of this being screwed up by a piece of code popping an element from the original range at some point. Also, all of that machinery would likely quickly become far more expensive than simply saving the range to call take on it and then calling drop on the original. Yes, that pops the same elements multiple times, but wrapper ranges already end up calling chains of popFronts, fronts, and emptys as they iterate through a range. You rarely really get a single popFront call for a popFront unless the optimizer has actually managed to optimize out all of the layers. You can try mucking around with your own implemention of take if you want, but I'd suggest that you're just better off taking the approach that it's expected that if you use take, you're going to have to iterate through those elements twice and that if that's not acceptable, it's better to do whatever you need to do with the take range by manually operating on each of the elements rather than having them be a range. Unless random-access ranges with slicing are involved, trying to do anything that effectively slices a range without iterating over the elements multiple times is almost certainly going to be a mess. At some point, if you really want slicing semantics, then you need to make sure that your range actually has slicing rather than trying to act like it does when it doesn't. - Jonathan M Davis
Jul 07 2019
parent reply berni <someone somewhere.com> writes:
On Sunday, 7 July 2019 at 21:55:17 UTC, Jonathan M Davis wrote:
 Having one range know about the other isn't enough. That just 
 means that the take range would tell the other range that it 
 had popped an element off, and then the other would know that 
 it had to pop an element off. That still involves popping the 
 same element on different ranges twice.
Sounds like you didn't understand what I meant. So I tried to implement what I had in mind and I have to confess, that there is something about ranges that I do not understand yet:
import std.stdio;
import std.array;

void main()
{
    auto a = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15];

    call(a,5);
}

void call(T)(ref T range, int c)
{
    struct Take
    {
        T* take_range;
        int count;

         property bool empty=false;
         property auto front() { return (*take_range).front; }
        void popFront()
        {
            (*take_range).popFront();
            if (--count<=0) empty=true;
        }

        T* get_range()
        {
            assert(count==0);
            while (!empty) popFront();
            return take_range;
        }
    }

    auto take = Take(&range,c);

//    writeln(take);
    while (!take.empty) { write(take.front); take.popFront(); }

    auto original_range = take.get_range();

    writeln(*original_range);
}
Running this program leads to
12345[6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
as expected. But when I replace that loop at the bottom by writeln(take) the assertion in get_range fails (count==5). As all the functions are called in the same order by both the loop and writeln, I confess, that writeln somehow manages to use a copy of "take" although this isn't a ForwardRange and copying should therefore not be possible.
 You can try mucking around with your own implemention of take 
 if you want, but I'd suggest that you're just better off taking 
 the approach that it's expected that if you use take,
I't not about improving or doing things different. It's just about me understanding, what happens.
Jul 09 2019
parent berni <someone somewhere.com> writes:
Hurray, it works! :-)

https://run.dlang.io/is/2GMq34

I have to use classes to avoid copying when arguments are passed 
to a function. (And yes, there should of course be much more 
checks, especially when there are to few elements in the original 
range. And it could be speed improved and so on... I wanted to 
keep it simple.)
Jul 09 2019
prev sibling parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Saturday, July 6, 2019 12:58:52 PM MDT Adam D. Ruppe via Digitalmars-d-
learn wrote:
 On Saturday, 6 July 2019 at 18:17:26 UTC, Jonathan M Davis wrote:
 take _always_ consumes the range that it's given
not if it hasSlicing. see http://dpldocs.info/experimental-docs/source/std.range.d.html#L2015 but yeah otherwise i agree with you
True enough, though if the code is generic, it still pretty much has to assume that the range was consumed. - Jonathan M Davis
Jul 07 2019
prev sibling parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Saturday, July 6, 2019 12:17:26 PM MDT Jonathan M Davis via Digitalmars-
d-learn wrote:
 On Saturday, July 6, 2019 8:12:36 AM MDT berni via Digitalmars-d-learn

 wrote:
 Now it's getting weird. Meanwhile I encountered, that take()
 sometimes consumes and sometimes not. Where can I learn, what is
 the reason behind this behavior? And how can I handle this?
take _always_ consumes the range that it's given. The problem is that some types of ranges are implicitly saved when they're copied, whereas others aren't, and when a range is implicitly saved when it's copied, you end up with the copy being consumed. Dynamic arrays are implicitly saved when they're copied, so the range you pass is saved, and the copy is consumed instead of the original. In generic code, you have to assume that once a range has been copied, you can't use it anymore (just the copy) precisely because the semantics of copying differ depending on the type of the range. You can only use a range after copying it if you know what type of range you're dealing with and how it behaves. So, you can rely on a dynamic array implicitly saving when it's passed to take, but in generic code, you really shouldn't be using a range again once you pass it to take, because what actually happens is dependent on the type of the range. In general what this means is that if you pass a range to a function, and you then want to use the range again afterwards, you need to call save when passing it to the function, and otherwise, you just assume that it's consumed and don't use it again. You certainly don't pass it to a function with the expectation of some elements being consumed and then continue to use the rest of the range unless the function takes its argument by ref or by pointer (which relatively few range-based functions do). If you want a function that's guaranteed to not implicitly copy a range, then it needs to accept the argument by ref or take a pointer to it. In the case where a function doesn't have to do the work lazily, ref would work, but in a case like take where you're returning a wrapper range, pointers would be required. So, a version of take that didn't ever copy the range it's given and thus never risked implicitly saving the range it was passed would have to either take a pointer to it or take it by ref and then take the address of the ref. In either case, the code using such a take would then have to ensure that the original range didn't leave scope and get destroyed before the take range was consumed, or the take range would then refer to invalid memory.
Another thing to consider about lazy ranges that take pointers to avoid implicit saving and ensure that they consume the original is that if the original has anything popped from it before the wrapper range is fully consumed, then that will screw up the wrapper range, because the elements in the range it's wrapping will have changed. Wrapper ranges are really designed with the idea that they're operating on their own copy of the range. Even with take as it's currently implemented, you risk bugs if you pass it a range which is a reference type without calling save, because if you don't fully consume the take range before using the original range again, you end up screwing up the elements in the take range when you pop anything off of the original range. It really doesn't work well to have a wrapper range which doesn't have its own independent copy of the range to iterate over. The only case where it works cleanly to have a wrapper range like take gives you and not have to iterate through the elements again to have a range starting at the first element after the take range is to have a random-access range that has slicing, in which case, you can just slice it, and take isn't required. Without a random-access range with slicing, if you want to only iterate through the elements once, you really should just be iterating through the range manually rather than using a wrapper range. - Jonathan M Davis
Jul 06 2019