www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Can't recreate a range?

reply Casey <noman email.com> writes:
So, I'm trying to run some tests and I had code that looks 
similar to this:

unittest
{
         auto range = 
readStream(File("test_data/multiple.xml").byLine);
         int count = 0;
         while (!range.empty)
         {
                 count++;
                 range.popFront();
         }
         assert(count == 3);
}

However, if I have a second block above/below it doing the same 
thing, the last assert will fail because in the second block it 
appears the ranges have been joined together.  E.g.:

unittest
{
         auto range = 
readStream(File("test_data/multiple.xml").byLine);
         int count = 0;
         while (!range.empty)
         {
                 count++;
                 range.popFront();
         }
         assert(count == 3); // Passes
}

unittest
{
         auto range = 
readStream(File("test_data/another.xml").byLine);
         int count = 0;
         while (!range.empty)
         {
                 count++;
                 range.popFront();
         }
         assert(count == 2); // Fails
}

To me, it looks like range is not  being overwritten.  The code 
for readStream looks similar to this:

auto readStream(Range)(auto ref Range r) if 
(isInputRange!(Unqual!Range))
{
         struct StreamRange(Range)
         {
                 alias R = Unqual!Range;
                 R _input;

                 this(R input)
                 {
                         this._input = input;
                 }

                 bool empty()
                 {
                         return this._input.empty;
                 }

                 string front()
                 {
                         // Do stuff...
                 }

                 void popFront()
                 {
                 }
         }

         return StreamRange!(Range)(r);
}

I feel like I'm missing something obscure and it's driving me a 
bit batty.  Any clue as to why this is happening?  I'd like to 
not have to worry about creating new variable names between 
tests.  To me, it seems like each unittest block is independent 
of each other and I haven't come across anything that contradicts 
that.  However, I didn't find anything that confirms it either.

Thanks.
Apr 29 2020
next sibling parent reply Paul Backus <snarwin gmail.com> writes:
On Wednesday, 29 April 2020 at 20:43:20 UTC, Casey wrote:
 So, I'm trying to run some tests and I had code that looks 
 similar to this:
[...]
 I feel like I'm missing something obscure and it's driving me a 
 bit batty.  Any clue as to why this is happening?  I'd like to 
 not have to worry about creating new variable names between 
 tests.  To me, it seems like each unittest block is independent 
 of each other and I haven't come across anything that 
 contradicts that.  However, I didn't find anything that 
 confirms it either.

 Thanks.
The code you posted looks correct to me. If you can post a complete example that reproduces the problem, it will be much easier for others to help you debug.
Apr 29 2020
parent reply Casey <noman email.com> writes:
Here's a minimal code example that duplicates the issue:

import std.array, std.range, std.stdio, std.traits, std.string;

auto readStream(Range)(auto ref Range r) if 
(isInputRange!(Unqual!Range))
{
	struct StreamRange(Range)
	{
		alias R = Unqual!Range;
		R _input;

		auto buff = appender!string;

		this(R input)
		{
			this._input = input;
		}

		bool empty()
		{
			return this._input.empty;
		}

		string front()
		{
			if (buff.capacity == 0)
			{
				bool iterate = true;
				bool doCapture = false;
				buff.reserve(1000);

				while (iterate)
				{
					if (this._input.empty)
						break;

					auto value = this._input.front;
					if (value.strip == "<main>")
					{
						doCapture = true;
						buff.put(value.strip);
						buff.put("\n");
					}
					else if (value.strip == "</main>")
					{
						buff.put(value.strip);
						doCapture = false;
						iterate = false;
					}
					else if (doCapture)
					{
						buff.put(value.strip);
						buff.put("\n");
					}
					this._input.popFront();
				}
			}
			return buff[];
		}

		void popFront()
		{
			buff = appender!string;
		}
	}

	return StreamRange!(Range)(r);
}

unittest
{
	auto range = readStream(File("test1.xml").byLine);
	int count = 0;
	while (!range.empty)
	{
		writeln(range.front);
		count++;
		writeln("Current count: ", count);
		range.popFront();
	}
	assert(count == 1);
}

unittest
{
	auto range = readStream(File("test2.xml").byLine);
	int count = 0;
	while (!range.empty)
	{
		writeln(range.front);
		count++;
		writeln("Current count: ", count);
		range.popFront();
	}
	assert(count == 1);
}


Here are the two XML files (very simple):

<?xml version="1.0">
<main>
	<text>Here is some text for the first file.</text>
</main>

<?xml version="1.0">
<main>
	<text>Here is some text for the second file.</text>
</main>

I've even tried putting wrapping the code that is being tested in 
a function and looping over it with different parameters 
(filename and expected count) without any luck either.  I've also 
tried declaring the file handle separate and ensuring it was 
closed.  No luck.
Apr 30 2020
parent reply Paul Backus <snarwin gmail.com> writes:
On Thursday, 30 April 2020 at 13:04:47 UTC, Casey wrote:
 Here's a minimal code example that duplicates the issue:

 import std.array, std.range, std.stdio, std.traits, std.string;

 auto readStream(Range)(auto ref Range r) if 
 (isInputRange!(Unqual!Range))
 {
 	struct StreamRange(Range)
 	{
 		alias R = Unqual!Range;
 		R _input;

 		auto buff = appender!string;
Using a default value like this means that it will be shared among all instances of the struct. Instead, you should set `buff = appender!string` in the constructor, so that each struct will have its own appender.
Apr 30 2020
next sibling parent Simen =?UTF-8?B?S2rDpnLDpXM=?= <simen.kjaras gmail.com> writes:
On Thursday, 30 April 2020 at 13:23:25 UTC, Paul Backus wrote:
 On Thursday, 30 April 2020 at 13:04:47 UTC, Casey wrote:
 Here's a minimal code example that duplicates the issue:

 import std.array, std.range, std.stdio, std.traits, std.string;

 auto readStream(Range)(auto ref Range r) if 
 (isInputRange!(Unqual!Range))
 {
 	struct StreamRange(Range)
 	{
 		alias R = Unqual!Range;
 		R _input;

 		auto buff = appender!string;
Using a default value like this means that it will be shared among all instances of the struct. Instead, you should set `buff = appender!string` in the constructor, so that each struct will have its own appender.
Yup, that's the one. No need to assign it at all, in fact - the line can be replaced with Appender!string buff; And things just work. -- Simen
Apr 30 2020
prev sibling parent reply Casey <noman email.com> writes:
On Thursday, 30 April 2020 at 13:23:25 UTC, Paul Backus wrote:
 Using a default value like this means that it will be shared 
 among all instances of the struct. Instead, you should set 
 `buff = appender!string` in the constructor, so that each 
 struct will have its own appender.
I'll give it a try when I get back to it (fixing lint issues), but are you sure that's the issue? In popFront, I recreate the appender. So, the appender should be clear before the empty check after it processes the last of the data from _input. Still a good thing to do as a best practice; I'm just wondering if something else is causing an issue and even if initializing the appender in the constructor solves the problem, I'd be suspect of future issues cropping up.
Apr 30 2020
next sibling parent Casey <noman email.com> writes:
On Thursday, 30 April 2020 at 15:42:03 UTC, Casey wrote:
 I'll give it a try when I get back to it (fixing lint issues), 
 but are you sure that's the issue?  In popFront, I recreate the 
 appender.  So, the appender should be clear before the empty 
 check after it processes the last of the data from _input.  
 Still a good thing to do as a best practice; I'm just wondering 
 if something else is causing an issue and even if initializing 
 the appender in the constructor solves the problem, I'd be 
 suspect of future issues cropping up.
O.K. It's working now, but I suspect there may still be something else going on. The more I think of it, the more I suspect there's a newline at the end of the file that isn't accounted for until after the last tag is read from the file, so the input is not empty, so my code's empty check fails. Regardless, thanks for the help!
Apr 30 2020
prev sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 4/30/20 11:42 AM, Casey wrote:
 On Thursday, 30 April 2020 at 13:23:25 UTC, Paul Backus wrote:
 Using a default value like this means that it will be shared among all 
 instances of the struct. Instead, you should set `buff = 
 appender!string` in the constructor, so that each struct will have its 
 own appender.
I'll give it a try when I get back to it (fixing lint issues), but are you sure that's the issue?  In popFront, I recreate the appender.  So, the appender should be clear before the empty check after it processes the last of the data from _input.  Still a good thing to do as a best practice; I'm just wondering if something else is causing an issue and even if initializing the appender in the constructor solves the problem, I'd be suspect of future issues cropping up.
I would say part of the issue is that you are doing all your work in front and not popFront. What happens is that Appender is a pointer-to-implementation struct, and the compiler will allocate the first one shared amongst all initial StreamRange instances. On your first call to front, it's going to utilize that shared one. Then on the call to popFront, it will reset it to another one. For the second unittest, in your first call to front, it notices that it's already been filled, so it doesn't do any work (and returns the existing buffer). another problem, your empty condition is based on the input, which violates range expectations. indeed, on the first call to front, the range all of a sudden becomes empty. I'd say: 1. move your work to the popFront function (you then need to call popFront once before returning the range in your factory method). 2. change empty to check if the buffer is empty instead of the input. Some other observations: 3. You don't need a further Range parameter for the nested struct, it can use the template parameter from the containing function. 4. Make it a static struct, or else it will retain a pointer to the function stack frame needlessly. -Steve
Apr 30 2020
next sibling parent reply Casey <noman email.com> writes:
On Thursday, 30 April 2020 at 16:21:05 UTC, Steven Schveighoffer 
wrote:
 I would say part of the issue is that you are doing all your 
 work in front and not popFront.

 What happens is that Appender is a pointer-to-implementation 
 struct, and the compiler will allocate the first one shared 
 amongst all initial StreamRange instances.

 On your first call to front, it's going to utilize that shared 
 one. Then on the call to popFront, it will reset it to another 
 one.

 For the second unittest, in your first call to front, it 
 notices that it's already been filled, so it doesn't do any 
 work (and returns the existing buffer).
Interesting. I'll take this into account. I was putting the work into front because I didn't want to do the work until it was requested. Putting the work in popFront makes more sense in some ways, but the fact you have to call it before getting any records seems like it would break normal range algorithms. (Please correct me if I'm wrong) I'm wondering if putting the work into it's own method and calling it one from the constructor and from popFront the rest of the way.
 another problem, your empty condition is based on the input, 
 which violates range expectations. indeed, on the first call to 
 front, the range all of a sudden becomes empty.
I was about to argue the point, but after thinking about the use of popFront to execute the work, that would work.
 3. You don't need a further Range parameter for the nested 
 struct, it can use the template parameter from the containing 
 function.
 4. Make it a static struct, or else it will retain a pointer to 
 the function stack frame needlessly.
I'll definitely try that out. Regarding 3, I think that was in the example code I used, so I just went with it.
Apr 30 2020
parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 4/30/20 2:03 PM, Casey wrote:
 Interesting.  I'll take this into account.  I was putting the work into 
 front because I didn't want to do the work until it was requested.  
 Putting the work in popFront makes more sense in some ways, but the fact 
 you have to call it before getting any records seems like it would break 
 normal range algorithms.  (Please correct me if I'm wrong)  I'm 
 wondering if putting the work into it's own method and calling it one 
 from the constructor and from popFront the rest of the way.
You don't necessarily have to do it in a separate function, because until you return it from your factory method (the only way you can create such a thing), it technically doesn't need to be "sane". So what I mean is at the bottom of your function you have: auto readStream(Range)(auto ref Range r) if (isInputRange!(Unqual!Range)) { ... // define your struct auto result = StreamRange(r); if(!r.empty) result.popFront(); // prime range return result; } There is a valid argument to be made about not doing any work until requested. You could do this inside front, but in that case, I would say you have an additional boolean that checks if it has ever been run, and run popFront once (setting that bool to false). But I don't necessarily like that, because that puts an extra burden on front. front is supposed to be non-mutating, called as much as you need (ideally it should be inout). Practically, you can make it do work, but really the bulk of the work should be done in popFront. Note that you will have to check for priming in popFront as well, because if you did for instance: auto r = readStream(...); r.popFront(); // skip first element? This would not do what is expected. In fact, thinking about it some more, you should be able to call popFront as many times as you want and have it skip that many elements without calling front once. Your range will not do that. -Steve
Apr 30 2020
prev sibling parent reply Paul Backus <snarwin gmail.com> writes:
On Thursday, 30 April 2020 at 16:21:05 UTC, Steven Schveighoffer 
wrote:
 I would say part of the issue is that you are doing all your 
 work in front and not popFront.
[...]
 I'd say:

 1. move your work to the popFront function (you then need to 
 call popFront once before returning the range in your factory 
 method).
 2. change empty to check if the buffer is empty instead of the 
 input.
Doing work in popFront instead of front is usually an anti-pattern, since it forces eager evaluation of the next element even when that element is never used. You should only do this if there's no reasonable way to avoid it.
Apr 30 2020
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Apr 30, 2020 at 06:05:55PM +0000, Paul Backus via Digitalmars-d-learn
wrote:
[...]
 Doing work in popFront instead of front is usually an anti-pattern,
 since it forces eager evaluation of the next element even when that
 element is never used. You should only do this if there's no
 reasonable way to avoid it.
Really?? IME, usually you *need* to do work in popFront instead of front, because otherwise .empty wouldn't be able to tell whether there is another element or not. E.g., in filtering a range based on some criterion on each element, you can't defer computing the next element until .front because you can't predict whether there will be another element that won't be dropped by popFront. Also, for ranges based on generator functions, if .front is lazy then you need to keep extra baggage around your range to indicate whether or not the generator has been invoked yet; it's easier to just always compute the next element eagerly and cache it, and .front just returns the cached data. Even when the range involves some expensive per-element computation, I find that it's simpler to just compute and cache in .popFront instead of adding extra baggage to .front to know when the computation has already been performed. I'm hard-pressed to come up with an example where deferring computation to .front is a good idea! T -- A mathematician is a device for turning coffee into theorems. -- P. Erdos
Apr 30 2020
next sibling parent Casey <noman email.com> writes:
On Thursday, 30 April 2020 at 18:30:14 UTC, H. S. Teoh wrote:
 On Thu, Apr 30, 2020 at 06:05:55PM +0000, Paul Backus via 
 Digitalmars-d-learn wrote: [...]
 Doing work in popFront instead of front is usually an 
 anti-pattern, since it forces eager evaluation of the next 
 element even when that element is never used. You should only 
 do this if there's no reasonable way to avoid it.
Really?? IME, usually you *need* to do work in popFront instead of front, because otherwise .empty wouldn't be able to tell whether there is another element or not. E.g., in filtering a range based on some criterion on each element, you can't defer computing the next element until .front because you can't predict whether there will be another element that won't be dropped by popFront. Also, for ranges based on generator functions, if .front is lazy then you need to keep extra baggage around your range to indicate whether or not the generator has been invoked yet; it's easier to just always compute the next element eagerly and cache it, and .front just returns the cached data. Even when the range involves some expensive per-element computation, I find that it's simpler to just compute and cache in .popFront instead of adding extra baggage to .front to know when the computation has already been performed. I'm hard-pressed to come up with an example where deferring computation to .front is a good idea! T
Well, I just remembered that I had a tab open to this: http://ddili.org/ders/d.en/ranges.html Reading, you see the following: * empty: specifies whether the range is empty; it must return true when the range is considered to be empty, and false otherwise * front: provides access to the element at the beginning of the range * popFront(): shortens the range from the beginning by removing the first element Looking at that, if popFront shortens the range, then to me it sounds like the work should be done in popFront.
Apr 30 2020
prev sibling next sibling parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 4/30/20 2:30 PM, H. S. Teoh wrote:
 On Thu, Apr 30, 2020 at 06:05:55PM +0000, Paul Backus via Digitalmars-d-learn
wrote:
 [...]
 Doing work in popFront instead of front is usually an anti-pattern,
 since it forces eager evaluation of the next element even when that
 element is never used. You should only do this if there's no
 reasonable way to avoid it.
Really?? IME, usually you *need* to do work in popFront instead of front, because otherwise .empty wouldn't be able to tell whether there is another element or not. E.g., in filtering a range based on some criterion on each element, you can't defer computing the next element until .front because you can't predict whether there will be another element that won't be dropped by popFront. Also, for ranges based on generator functions, if .front is lazy then you need to keep extra baggage around your range to indicate whether or not the generator has been invoked yet; it's easier to just always compute the next element eagerly and cache it, and .front just returns the cached data. Even when the range involves some expensive per-element computation, I find that it's simpler to just compute and cache in .popFront instead of adding extra baggage to .front to know when the computation has already been performed. I'm hard-pressed to come up with an example where deferring computation to .front is a good idea!
There could be expensive operations done to construct the element that aren't necessary to iterate the range, but generally you achieve this by using something like map and cache. But really the only valid use case for this is if you create a range but never use it. Some may need this, but I think a wrapper would be a better solution than enforcing a certain usage pattern. -Steve
Apr 30 2020
prev sibling parent reply Paul Backus <snarwin gmail.com> writes:
On Thursday, 30 April 2020 at 18:30:14 UTC, H. S. Teoh wrote:
 On Thu, Apr 30, 2020 at 06:05:55PM +0000, Paul Backus via 
 Digitalmars-d-learn wrote: [...]
 Doing work in popFront instead of front is usually an 
 anti-pattern, since it forces eager evaluation of the next 
 element even when that element is never used. You should only 
 do this if there's no reasonable way to avoid it.
Really?? IME, usually you *need* to do work in popFront instead of front, because otherwise .empty wouldn't be able to tell whether there is another element or not. E.g., in filtering a range based on some criterion on each element, you can't defer computing the next element until .front because you can't predict whether there will be another element that won't be dropped by popFront.
There are certainly cases where it can't be avoided. Filter is one; file I/O (as Steven Schveighoffer pointed out) is another one. Obviously, you gotta do what you gotta do. Still, I think that as long as work *can* be deferred to .front, it should be. That's the essence of lazy evaluation: only do your computation once you're absolutely sure it's necessary.
 Also, for ranges based on generator functions, if .front is 
 lazy then you need to keep extra baggage around your range to 
 indicate whether or not the generator has been invoked yet; 
 it's easier to just always compute the next element eagerly and 
 cache it, and .front just returns the cached data.
std.range.generate is actually a perfect example of the problem with doing work in popFront. Because it has to call popFront both on construction *and* after every element, consuming n elements will call the generator function n+1 times. The final call is completely wasted.
Apr 30 2020
parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 4/30/20 6:39 PM, Paul Backus wrote:
 On Thursday, 30 April 2020 at 18:30:14 UTC, H. S. Teoh wrote:
 Also, for ranges based on generator functions, if .front is lazy then 
 you need to keep extra baggage around your range to indicate whether 
 or not the generator has been invoked yet; it's easier to just always 
 compute the next element eagerly and cache it, and .front just returns 
 the cached data.
std.range.generate is actually a perfect example of the problem with doing work in popFront. Because it has to call popFront both on construction *and* after every element, consuming n elements will call the generator function n+1 times. The final call is completely wasted.
generate used to do everything on front. But that meant it wasn't a true range as multiple calls to front generated different data (popFront was a noop). I fixed it a while ago. It should be the status quo to do all work in popFront, and then you should wrap if you need different behavior. I'm ok with something like this (I'm kind of surprised something like this doesn't exist already): struct LazyRange(R) { R src; bool frontEvaluated; void sync() { if(!frontEvaluated) { src.popFront; frontEvaluated = true; } } auto front() { sync(); return src.front; } bool empty() { sync(); return src.empty; } void popFront() { sync(); frontEvaluated = false; } } -Steve
Apr 30 2020
prev sibling parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 4/30/20 2:05 PM, Paul Backus wrote:
 On Thursday, 30 April 2020 at 16:21:05 UTC, Steven Schveighoffer wrote:
 I would say part of the issue is that you are doing all your work in 
 front and not popFront.
[...]
 I'd say:

 1. move your work to the popFront function (you then need to call 
 popFront once before returning the range in your factory method).
 2. change empty to check if the buffer is empty instead of the input.
Doing work in popFront instead of front is usually an anti-pattern, since it forces eager evaluation of the next element even when that element is never used. You should only do this if there's no reasonable way to avoid it.
In the general scheme of things, this only makes sense for the first front access or empty access. From then on, popFront should do the work. In most cases, you can't tell a range is empty unless you've done the work to determine the next element. After you call popFront, you need to call empty before calling front, or else you may be using an invalid range. So all the work is going to be done anyway. A good example, byLine. How do you know that there is no more data until you attempt to read the next line from the stream? At which point, you have to do all the work. -Steve P.S. I see H.S. Teoh said much of the same thing, but I had already typed this out before reading it ;)
Apr 30 2020
prev sibling parent reply Simen =?UTF-8?B?S2rDpnLDpXM=?= <simen.kjaras gmail.com> writes:
On Wednesday, 29 April 2020 at 20:43:20 UTC, Casey wrote:
                 void popFront()
                 {
                 }
I mean, it might be you messed up in posting this, but having an empty popFront and expecting it to do something is a tad optimistic. Apart from that, it seems like the code should do what you want it to. What's the value of count when the code asserts? I'm afeared we'll need some code that actually compiles and shows off the issue to give any more answers. -- Simen
Apr 29 2020
parent Casey <noman email.com> writes:
On Wednesday, 29 April 2020 at 22:32:00 UTC, Simen Kjærås wrote:
 I mean, it might be you messed up in posting this, but having 
 an empty popFront and expecting it to do something is a tad 
 optimistic.
I was just trying to get the example to a very minimal state. I just added more descriptive code that replicates the bug.
 What's the value of count when the code asserts?
The count ends up being the count of the previous test + the count of the current test. In the code I uploaded, the first test will get 1, but the second test will get 2. I'm printing out the output of the range to force it to actually perform the loop, so you can see that the contents of the first file are printed twice.
Apr 30 2020