digitalmars.D.learn - byChunk odd behavior?
- Hanh (26/26) Mar 22 2016 Hi all,
- Hanh (4/30) Mar 22 2016 I have the feeling that it's related to the forward only nature
- Taylor Hillegeist (26/52) Mar 22 2016 I dont know if this helps, but it looks like since take three
- =?UTF-8?Q?Ali_=c3=87ehreli?= (11/32) Mar 22 2016 I don't understand the issue fully but byChunk() will treat every
- cy (30/32) Mar 22 2016 Never use an input range twice. So, here's how to use it twice:
- Hanh (18/18) Mar 22 2016 Thanks for your help everyone.
- Chris Wright (8/10) Mar 23 2016 import std.range, std.array;
- cym13 (16/34) Mar 23 2016 Doing *anything* to a range invalidates it (or at least you
Hi all, I'm trying to process a rather large file as an InputRange and run into something strange with byChunk / take. void test() { auto file = new File("test.txt"); auto input = file.byChunk(2).joiner; input.take(3).array; foreach (char c; input) { writeln(c); } } Let's say test.txt contains "123456". The output will be 3 4 5 6 The "take" consumed one chunk from the file, but if I increase the chunk size to 4, then it won't. It looks like if "take" spans two chunks, it affects the input range otherwise it doesn't. Actually, what is the easiest way to read a large file as a stream? My file contains a bunch of serialized messages of variable length. Thanks, --h
Mar 22 2016
On Tuesday, 22 March 2016 at 07:17:41 UTC, Hanh wrote:Hi all, I'm trying to process a rather large file as an InputRange and run into something strange with byChunk / take. void test() { auto file = new File("test.txt"); auto input = file.byChunk(2).joiner; input.take(3).array; foreach (char c; input) { writeln(c); } } Let's say test.txt contains "123456". The output will be 3 4 5 6 The "take" consumed one chunk from the file, but if I increase the chunk size to 4, then it won't. It looks like if "take" spans two chunks, it affects the input range otherwise it doesn't. Actually, what is the easiest way to read a large file as a stream? My file contains a bunch of serialized messages of variable length. Thanks, --hI have the feeling that it's related to the forward only nature of an InputRange. All would be ok with a take(N)+popFrontN method. I'm going to keep looking.
Mar 22 2016
On Tuesday, 22 March 2016 at 07:17:41 UTC, Hanh wrote:Hi all, I'm trying to process a rather large file as an InputRange and run into something strange with byChunk / take. void test() { auto file = new File("test.txt"); auto input = file.byChunk(2).joiner; input.take(3).array; foreach (char c; input) { writeln(c); } } Let's say test.txt contains "123456". The output will be 3 4 5 6 The "take" consumed one chunk from the file, but if I increase the chunk size to 4, then it won't. It looks like if "take" spans two chunks, it affects the input range otherwise it doesn't. Actually, what is the easiest way to read a large file as a stream? My file contains a bunch of serialized messages of variable length. Thanks, --hI dont know if this helps, but it looks like since take three doesn't consume the chunk it is not removed from the range. import std.stdio; import std.algorithm; import std.range; void main() { auto file = stdin; auto input = file.byChunk(2).joiner; foreach (char c; input.take(3).array) { writeln(c); } foreach (char c; input) { writeln(c); } } Produces: 1 2 3 < Got data but didn't eat the chunk. 3 4 5 6
Mar 22 2016
On 03/22/2016 12:17 AM, Hanh wrote:Hi all, I'm trying to process a rather large file as an InputRange and run into something strange with byChunk / take. void test() { auto file = new File("test.txt"); auto input = file.byChunk(2).joiner; input.take(3).array; foreach (char c; input) { writeln(c); } } Let's say test.txt contains "123456". The output will be 3 4 5 6 The "take" consumed one chunk from the file, but if I increase the chunk size to 4, then it won't.I don't understand the issue fully but byChunk() will treat every character in the file. So, even the newline character(s) are considered.Actually, what is the easiest way to read a large file as a stream? My file contains a bunch of serialized messages of variable length.If it's a text file I think I would start with File.byLine (or byLineCopy). Then it depends on how the messages are layed out. One per line? Do you know the size at the start? etc. Alternatively, use (or examine) one of the great D serialization modules out there. :) (We already need something like this in the standard library, which I think some people are already working on.) Ali
Mar 22 2016
On Tuesday, 22 March 2016 at 07:17:41 UTC, Hanh wrote:input.take(3).array; foreach (char c; input) {Never use an input range twice. So, here's how to use it twice: If it's a "forward range" you can use save() to get a copy to use later (but all the std.stdio.* ranges don't implement that). You can also use "std.range.tee" to send the results to an "output range" (something implementing put(K)(K)) while iterating over them. tee can't produce two input ranges, because without caching all iterated items in memory, only one range can request items on-demand; the other must take them passively. You could write a thing that takes an InputRange and produces a ForwardRange, by caching those items in memory, but at that point you might as well use .array and get the whole thing. ByChunk is an input range (not a forward range), so there's undefined behavior when you use it twice. No bugs there, since it wasn't meant to be reused anyway. What it does is cache the last seen chunk, first iterate over that, then read more chunks from the file. So every time you iterate, you'll get that same last chunk. It's also tricky to use input ranges after mutating their underlying data structure. If you seek in the file, for instance, then a previously created ByChunk will produce the chunk it has cached, and only then start reading chunks from that exact position in the file. A range over some sort of list, if you delete the current item in the list, should the range produce the previous item? The next item? null? So, as a general rule, never use input ranges twice, and never use them after mutating the underlying data structure. Just recreate them if you want to do something twice, or use tee as mentioned above.
Mar 22 2016
Thanks for your help everyone. I agree that the issue is due to the misusage of an InputRange but what is the semantics of 'take' when applied to an InputRange? It seems that calling it invalidates the range; in which case what is the recommended way to get a few bytes and keep on advancing. For instance, to read a ushort, I use range.read!(ushort)() Unfortunately, it reads a single value. For now, I use a loop foreach (i; 0..N) { buffer[i] = range.front; range.popFront(); } Is there a more idiomatic way to do the same thing? In Scala, 'take' consumes bytes from the iterator. So the same code would be buffer = range.take(N).toArray
Mar 22 2016
On Wed, 23 Mar 2016 03:17:05 +0000, Hanh wrote:In Scala, 'take' consumes bytes from the iterator. So the same code would be buffer = range.take(N).toArrayimport std.range, std.array; auto bytes = byteRange.takeExactly(N).array; There's also take(N), but if the range contains fewer than N elements, it will only give you as many as the range contains. If If you're trying to deserialize something, takeExactly is probably better. http://dpldocs.info/experimental-docs/std.range.takeExactly.html http://dpldocs.info/experimental-docs/std.array.array.1.html
Mar 23 2016
On Wednesday, 23 March 2016 at 03:17:05 UTC, Hanh wrote:Thanks for your help everyone. I agree that the issue is due to the misusage of an InputRange but what is the semantics of 'take' when applied to an InputRange? It seems that calling it invalidates the range; in which case what is the recommended way to get a few bytes and keep on advancing.Doing *anything* to a range invalidates it (or at least you should expect it to), a range is read-once. Never reuse a range. Some ranges can be saved in order to use a copy of it, but never expect a range to be implicitely reusable.For instance, to read a ushort, I use range.read!(ushort)() Unfortunately, it reads a single value. For now, I use a loop foreach (element ; range.enumerate) { buffer[i] = range.front; range.popFront(); } Is there a more idiomatic way to do the same thing?Two ways, the first one being for reference: import std.range: enumerate; foreach (element, index ; range.enumerate) { buffer[index] = element; } And the other oneIn Scala, 'take' consumes bytes from the iterator. So the same code would be buffer = range.take(N).toArrayThen just do that! import std.range, std.array; auto buffer = range.take(N).array; auto example = iota(0, 200, 5).take(5).array; assert(example == [0, 5, 10, 15, 20]);
Mar 23 2016
On Wednesday, 23 March 2016 at 19:07:34 UTC, cym13 wrote:Well, that's what I do in the first post but you can't call it twice with an InputRange. auto buffer1 = range.take(4).array; // ok range.popFrontN(4); // not ok auto buffer2 = range.take(4).array; // not okIn Scala, 'take' consumes bytes from the iterator. So the same code would be buffer = range.take(N).toArrayThen just do that! import std.range, std.array; auto buffer = range.take(N).array; auto example = iota(0, 200, 5).take(5).array; assert(example == [0, 5, 10, 15, 20]);
Mar 24 2016
On Thursday, 24 March 2016 at 07:52:27 UTC, Hanh wrote:On Wednesday, 23 March 2016 at 19:07:34 UTC, cym13 wrote:Please, take some time to reread cy's answer above. void main(string[] args) { import std.range; import std.array; import std.algorithm; auto range = iota(0, 25, 5); // Will not consume (forward ranges only) // // Note however that range elements are not stored in any way by default // so reusing the range will also need you to recompute them each time! auto buffer1 = range.save.take(4).array; assert(buffer1 == [0, 5, 10, 15]); // The solution to the recomputation problème, and often the best way to // handle range reuse is to store them in an array // // This is reusable at will with no redundant computation auto arr = range.save.array; assert(arr == [0, 5, 10, 15, 20]); // And it has a range interface too auto buffer2 = arr.take(4).array; assert(buffer2 == [0, 5, 10, 15]); // This consume auto buffer3 = range.take(4).array; assert(buffer3 == [0, 5, 10, 15]); }Well, that's what I do in the first post but you can't call it twice with an InputRange. auto buffer1 = range.take(4).array; // ok range.popFrontN(4); // not ok auto buffer2 = range.take(4).array; // not okIn Scala, 'take' consumes bytes from the iterator. So the same code would be buffer = range.take(N).toArrayThen just do that! import std.range, std.array; auto buffer = range.take(N).array; auto example = iota(0, 200, 5).take(5).array; assert(example == [0, 5, 10, 15, 20]);
Mar 25 2016
On Friday, 25 March 2016 at 08:01:04 UTC, cym13 wrote:// This consume auto buffer3 = range.take(4).array; assert(buffer3 == [0, 5, 10, 15]); }Thanks for your help. However the last statement is incorrect. I am in fact looking for a version of 'take' that consumes the InputRange. You can see it by doing a second take afterwards. auto buffer3 = range.take(4).array; assert(buffer3 == [0, 5, 10, 15]); auto buffer4 = range.take(4).array; assert(buffer4 == [0, 5, 10, 15]); I haven't clearly explained my main goal. I have a large binary file that I need to deserialize. It's not my file and it's in a custom but simple format, so I would prefer not to depend on a third party serializer library but I will look into that. I was thinking around the lines of: 1. Open file 2. Map a byChunk.joiner to read by chunks and present an iterator interface 3. Read data with std.bitmanip/read functions Step 3. works fine as long as items are single scalar values. bitmanip doesn't have array readers. Obviously, I could loop but then I thought that for the case of a ubyte[], there would be a shortcut that I don't know about. Thanks, --h
Mar 25 2016
On Saturday, 26 March 2016 at 02:28:53 UTC, Hanh wrote:On Friday, 25 March 2016 at 08:01:04 UTC, cym13 wrote:Sorry, it seems I completely misunderstood you goal. I thought that take() consumed its input (which mostly only shows that I really am careful about not reusing ranges). Writting a take that consume shouldn't be difficult though: import std.range, std.traits; Take!R takeConsume(R)(auto ref R input, size_t n) if (isInputRange!(Unqual!R) && !isInfinite!(Unqual!R) { auto buffer = input.take(n); input = input.drop(buffer.walkLength); return buffer; } but I think going with std.bitmanip/read may be the easiest in the end.// This consume auto buffer3 = range.take(4).array; assert(buffer3 == [0, 5, 10, 15]); }Thanks for your help. However the last statement is incorrect. I am in fact looking for a version of 'take' that consumes the InputRange. You can see it by doing a second take afterwards. auto buffer3 = range.take(4).array; assert(buffer3 == [0, 5, 10, 15]); auto buffer4 = range.take(4).array; assert(buffer4 == [0, 5, 10, 15]); I haven't clearly explained my main goal. I have a large binary file that I need to deserialize. It's not my file and it's in a custom but simple format, so I would prefer not to depend on a third party serializer library but I will look into that. I was thinking around the lines of: 1. Open file 2. Map a byChunk.joiner to read by chunks and present an iterator interface 3. Read data with std.bitmanip/read functions Step 3. works fine as long as items are single scalar values. bitmanip doesn't have array readers. Obviously, I could loop but then I thought that for the case of a ubyte[], there would be a shortcut that I don't know about. Thanks, --h
Mar 26 2016
On Saturday, 26 March 2016 at 08:34:04 UTC, cym13 wrote:Sorry, it seems I completely misunderstood you goal. I thought that take() consumed its input (which mostly only shows that I really am careful about not reusing ranges). Writting a take that consume shouldn't be difficult though: import std.range, std.traits; Take!R takeConsume(R)(auto ref R input, size_t n) if (isInputRange!(Unqual!R) && !isInfinite!(Unqual!R) { auto buffer = input.take(n); input = input.drop(buffer.walkLength); return buffer; } but I think going with std.bitmanip/read may be the easiest in the end.Turns out bitmanip is actually using a loop. foreach(ref e; bytes) { e = range.front; range.popFront(); } By the way, in your code above you are actually reusing the range: take is followed by drop and it won't work on an input range like 'byChunk'. That's the problem I ran into (see first post).
Mar 26 2016