digitalmars.D - let's talk about output ranges

Adam D. Ruppe (60/66) Feb 06 2014 yea. I think it is time we hashed out output ranges the same way

H. S. Teoh (90/152) Feb 06 2014 +1. I think output ranges (as a concept, not necessarily what's

Adam D. Ruppe (66/66) Feb 06 2014 I just slapped this together in the other thread, let me
Adam D. Ruppe (66/94) Feb 06 2014 That could be ok but I agree that having to check is a pain.

Adam D. Ruppe (7/13) Feb 06 2014 similar to my finish

Dmitry Olshansky (12/20) Feb 06 2014 Just throwing a bit of related thoughts that came and go up w.r.t.

"Adam D. Ruppe" <destructionator gmail.com> writes:

splitting from the ARC thread

On Thursday, 6 February 2014 at 15:47:48 UTC, Johannes Pfau wrote:
 As they do not keep references there's no need for ARC or GC,
 we just need a way to tell every function how it should 
 allocate.

yea. I think it is time we hashed out output ranges the same way 
we've done with input ranges. Right now, output ranges just have 
put(T). I think it would be useful to add extensions similarly to 
random access ranges, infinite ranges, etc..


What I want to see work is:
char[3] buffer;
char[] got = toLowerWithRange(buffer, "FOO"); // if we allow it 
w/o slicing
assert(got.ptr is buffer.ptr);
assert(got == "foo");

char[] got = toLowerWithRange(buffer, "FOOL"); // throws, out no 
more space

char[] got = toLowerWithRange(buffer[], "FOO"); // same result as 
above
char[] got = toLowerWithRange(buffer[], "FOOL"); // MAyBE 
reallocates

(I call it toLowerWithRange just to avoid any overloading 
ambiguities when reusing the toLower name. Names aren't important 
to me though.)


What do we have here? When passed a static array, it must not 
grow beyond its length - the only allocation allowed is the 
exception it throws.

When passed a slice though, things are a bit more interesting. I 
think then it should try to put it directly in the slice, but if 
that isn't possible, go ahead and reallocate.

Though that's not necessarily sane either. Maybe slice should 
also throw if there isn't enough room.


toLower would prolly just call put(), with std.array or std.range 
defining it for static vs dynamic arrays.


With user defined types, this behavior would be tweakable, 
similarly to how input ranges might hasLength!T. Other 
similarities:

* A slice should also be a random access output range. Not all 
values are generated in order, one char at a time.

* Allocators should *offer* output ranges, but not necessarily 
*be* output ranges, similarly to the relationship between 
containers and input ranges (a view into the container). This 
could argue that passing a static buffer without slicing it is 
wrong

* We want this stuff to just work without a bajillion explicit 
wrappers, again, like input ranges.

* Output ranges would probably make the most sense to pass places 
by ref (as is the case with them most the time right now).



Eh, I gotta go, but let's start talking about this.

BTW for the growable thing, I wrote this little array thingy:
http://arsdnet.net/dcode/array2.d

it uses a static buffer up to an estimated size, then switches to 
GC. The idea being it can use infinite length, but for the 
majority of the input, uses stack space and thus avoids 
allocating at all. I'd like to see this thing be easy to use with 
phobos functions (perhaps with some additions but the same basic 
idea) as I find this pattern of guessing with a prealloced stack 
buffer very useful.


 Exceptions currently can't work on a system without GC cause we 
 always use 'throw new' and nobody ever explicitly frees 
 Exceptions.

Exceptions only come in exceptional circumstances anyway, so I 
think they are ok to be an... exception. LOL

but seriously, the exception path is rarely fully optimized in 
the first place. If you want to avoid them, use nothrow or 
preallocate them or something.

Feb 06 2014

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Feb 06, 2014 at 04:21:51PM +0000, Adam D. Ruppe wrote:
 splitting from the ARC thread
 
 On Thursday, 6 February 2014 at 15:47:48 UTC, Johannes Pfau wrote:
As they do not keep references there's no need for ARC or GC,
we just need a way to tell every function how it should allocate.

 
 yea. I think it is time we hashed out output ranges the same way
 we've done with input ranges. Right now, output ranges just have
 put(T). I think it would be useful to add extensions similarly to
 random access ranges, infinite ranges, etc..

+1. I think output ranges (as a concept, not necessarily what's
currently defined in Phobos) are far underused, and have a lot of
untapped potential.


 What I want to see work is:
 char[3] buffer;
 char[] got = toLowerWithRange(buffer, "FOO"); // if we allow it w/o
 slicing
 assert(got.ptr is buffer.ptr);
 assert(got == "foo");
 
 char[] got = toLowerWithRange(buffer, "FOOL"); // throws, out no
 more space
 
 char[] got = toLowerWithRange(buffer[], "FOO"); // same result as
 above
 char[] got = toLowerWithRange(buffer[], "FOOL"); // MAyBE
 reallocates
 
 (I call it toLowerWithRange just to avoid any overloading
 ambiguities when reusing the toLower name. Names aren't important to
 me though.)
 
 What do we have here? When passed a static array, it must not grow
 beyond its length - the only allocation allowed is the exception it
 throws.

Would it make sense to have a .full method/property analogous to input
ranges' .empty? Perhaps something like:

	struct MyOutputRange {
		void put(...);
		 property bool full() { ... }
	}

Though it may not be a good idea for the caller to constantly have to
check if something is full before writing to it. Maybe having put()
throw is the better way (then output ranges that allocate will just
allocate when full without the caller needing to worry about it).


 When passed a slice though, things are a bit more interesting. I
 think then it should try to put it directly in the slice, but if
 that isn't possible, go ahead and reallocate.
 
 Though that's not necessarily sane either. Maybe slice should also
 throw if there isn't enough room.

I think we're jumping the gun to implementation details here. What about
other logical concepts that output ranges should support? One thing that
the current output range API doesn't do very well is chaining. For
example, if you don't care about allocation strategy and just use the
GC, you can write:

	auto result = "mystring".toUpper.translate("abc", "def");

But once you have to pass an explicit output range to it, you can't
chain things anymore:

	auto result1 = ArcOutputRange!string();
	"mystring".toUpper(result1);	// need to pass in result explicitly

	auto result2 = ArcOutputRange!string();
	result.data.translate("abc", "def", result2);

This is a big usability hindrance. Ideally we'd want to write something
like:

	auto result = "mystring".toUpper(ArcOutputRange!string())
				.translate("abc", "def");

But I'm not sure how this can be made to work.


 toLower would prolly just call put(), with std.array or std.range
 defining it for static vs dynamic arrays.

Yes.


 With user defined types, this behavior would be tweakable, similarly
 to how input ranges might hasLength!T. Other similarities:
 
 * A slice should also be a random access output range. Not all
 values are generated in order, one char at a time.

So we should extend put() to take an index, then?

	struct MyOutputRange {
		void put(T item);	// put item at current index
		void put(T item, size_t idx); // put item at specified index
	}


 * Allocators should *offer* output ranges, but not necessarily *be*
 output ranges, similarly to the relationship between containers and
 input ranges (a view into the container). This could argue that
 passing a static buffer without slicing it is wrong

An allocator is definitely not an output range! You don't put data into
an allocator, you get a memory buffer *from* an allocator that you can
put data into. Big difference. We shouldn't conflate the two.

That's why I said in another post, that algorithms that produce data
into a data sink should not care what an allocator is; they should take
an output range. Logically, it makes more sense: what you want is
something to put your result data into, you don't care how it's
allocated, or even whether it's a chunk of memory (it may be a socket or
pipe or whatever).  Doing it this way also eliminates some needless
allocations -- why do you need to allocate a string if all you want is
to take input data, transform it to lowercase, and write to stdout? Let
stdout do the buffering, and let toLower send the data to stdout
directly. Calling an allocator from toLower essentially amounts to
buffering the data twice.


 * We want this stuff to just work without a bajillion explicit
 wrappers, again, like input ranges.

Yes, which is why I brought up the issue of how to chain function calls
that takes output ranges. That's a big usability issue that we need to
address if output ranges are going to be as cool as input ranges were.


 * Output ranges would probably make the most sense to pass places by
 ref (as is the case with them most the time right now).

They should probably be *always* passed by ref, otherwise you could end
up with some pathological behaviour of data from multiple sources
overwriting each other because they were operating on copies of output
ranges instead of references to a single one.

Also, delegates and function pointers should be treated as output ranges
as well (Phobos should define .put and whatever other needed methods for
them via UFCS).


[...]
Exceptions currently can't work on a system without GC cause we
always use 'throw new' and nobody ever explicitly frees Exceptions.

 
 Exceptions only come in exceptional circumstances anyway, so I think
 they are ok to be an... exception. LOL
 
 but seriously, the exception path is rarely fully optimized in the
 first place. If you want to avoid them, use nothrow or preallocate
 them or something.

Yeah, you could do something like this:

	void myFailureProneAlgorithm(T...)(T args) {
		auto prealloc_exception = ARCAllocator!Exception();
		auto stacktraceBuffer = /* preallocate stacktrace here */;

		auto result = doBigComplicatedCalculation();
		if (result.hasError) {
			// Fill in exception info without allocation
			prealloc_exception.msg = result.err_msg;
			createStackTrace(stacktraceBuffer);
			prealloc_exception.stacktrace = stacktraceBuffer;

			// N.B.: no 'new'.
			throw prealloc_exception;
		}
		return result;
	}

	void main() {
		try {
			auto input = ...;
			auto output = myFailureProneAlgorithm(input);
		} catch(Exception e) {
			... // you get a statically allocated exception here
		}
	}

Doesn't solve the case where you call some library function that throws,
though. :-(


T

-- 
One disk to rule them all, One disk to find them. One disk to bring them all
and in the darkness grind them. In the Land of Redmond where the shadows lie.
-- The Silicon Valley Tarot

Feb 06 2014

"Adam D. Ruppe" <destructionator gmail.com> writes:

I just slapped this together in the other thread, let me 
copy/paste it here, talk about it a bit, then I'll finish reading 
what you wrote:

struct GCSink(T) {
     // so this is a reference type
     private struct Impl {
         T[] data;
         void put(T t) { data ~= t; }
         T[] finish() { return data; }
     }
     Impl* impl;
     alias impl this;
     void start() {
         impl = new Impl;
     }
}

struct StaticSink(T) {
     T[] container;
     this(T[] c) { container = c; }
     size_t size;
     void start() { size = 0; }
     void put(T t) { container[size++] = t; }
     T[] finish() { return container[0 .. size]; }
}
StaticSink!T staticSink(T)(T[] t) {
     return StaticSink!T(t);
}

T[] toUpper(T, OR = GCSink!T)(in T[] data, OR output = OR()) {
     output.start();
     foreach(d; data)
        output.put(d & ~0x20);
     return output.finish();
}

void main() {
     import std.stdio;
     writeln(toUpper("cool")); // default: GC
     char[10] buffer;
     auto received = toUpper("cool", staticSink(buffer[])); // 
custom static sink
     assert(buffer.ptr is received.ptr);
     assert(received == "COOL");
}

====

In addition to put(), I also added start() and finish(). There's 
precedent for this in Phobos already: the std.digest output 
ranges have methods like this. Like std.digest, put builds it up, 
then finish returns the final product.

These wouldn't be required functions on all ranges. Finish might 
even return null or void, if it was a sink into a write function 
or something. It could also close a file or something like that.

But, if it does build into an array, finish ought to be defined 
to return an analogous input range (or slice) into the data it 
just built for easier chaining and basic simple usage.

(Appender in phobos has a kinda similar thing called data, but I 
think this breaks things since it might be called at any time. 
Finish, on the other hand, could be defined that any puts after 
it are invalid.

start is used to restart things. Calling start at any time should 
reset the range to reuse the buffer.


I used the Impl* on the GC one so the default worked. Ref with a 
default argument would fail because it isn't an lvalue, so that 
would kill the simplicity.


I called staticSink on the buffer to build the range... which my 
first post said I wanted to avoid but I kind like it this way 
better. It isn't a hassle to call and keeps a nice separation 
between the view and the container.

Feb 06 2014

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Thursday, 6 February 2014 at 18:10:53 UTC, H. S. Teoh wrote:
 Would it make sense to have a .full method/property analogous 
 to input ranges' .empty? Perhaps something like:

That could be ok but I agree that having to check is a pain. 
Besides, what are you going to do if it is full? Suppose I call

char[1] b;
toUpper("lol", staticSink(b[]));


The best toUpper could do is truncate or throw.... and I think 
that would be better encoded in the range itself so the caller 
can decide.

auto got = toUpper("lol", staticSinkTruncating(b[]));
if(got.length < "lol".length) {
    // we could perhaps call it again to process the rest
}

(put in the truncating one would just check its own length and 
noop if it is full)

Otherwise, throwing range violations/out of memory exceptions are 
what would most likely happen anyway.

 One thing that
 the current output range API doesn't do very well is chaining.

Indeed. In my other post, I just wrote about finish. Finish 
serves to flush the buffer (digests or compression algorithms for 
example might need to be padded to block size), could finalize 
things (suppose an appender which just puts the pieces into a 
static array, then calls join all at once at the end), and could 
also just generally return the result.

There are some cases where returning from finish doesn't make 
sense, such as if you sunk to a file, you wouldn't keep an array 
of the contents around... but finish is still potentially useful 
in that it could close the file or release a lock. (Of course, 
dtors could do that too. But destructors can never return data to 
the user - that's where finish is special.)

Anyway, not all output ranges would offer finish and not all 
would return T[]. But not all input ranges offer opSlice either 
so we're still in analogous territory.

 This is a big usability hindrance. Ideally we'd want to write 
 something
 like:

 	auto result = "mystring".toUpper(ArcOutputRange!string())
 				.translate("abc", "def");

 But I'm not sure how this can be made to work.

hmmm.... finish doesn't account for all that.... well, I guess it 
could by returning a range.

tbh toUpper might be better as a higher-order input range. Like 
alias toUpper = map!charToUpper(...). Those chain, they don't 
allocate, and they are well-defined right now.

Then at the end we build the result lazily and just put it all at 
once into the output range.

"mystring".toUpper.translate("abc","def").array(ArcOutputRange!string());



Yeah, I actually think that's the way to go. And calling .array 
at the end is nothing new to Phobos anyway. I'd be a bit weird 
doing it with toUpper but I think it really is the best fit.

(BTW I would be PISSED of toUpper actually changed like this. 
It'd break a bunch of code and I don't really care that toUpper 
allocates. I want it to just work. But we could offer equivalent 
functionality via per-character functions and map so we don't 
have to break code to offer the new options.)

 So we should extend put() to take an index, then?

that would work.

 An allocator is definitely not an output range!

yup, and I don't think a static array is either. A static array 
is neither an input range, since you can't do a = a[1..$]. But 
offering easy getters for such is easy and it rox.


 into a data sink should not care what an allocator is; they 
 should take an output range.

Actually, I think they should generate lazy input ranges whenever 
possible. Then only at the end do we send it to the output range. 
It's just input ranges aren't allowed to allocate, that would 
kill their complexity guarantee, so we need an example of a 
function which *must* allocate up front.

They want the random access output range. Otherwise we can just 
put at the end.

 Let
 stdout do the buffering, and let toLower send the data to stdout
 directly. Calling an allocator from toLower essentially amounts 
 to buffering the data twice.

yes

 They should probably be *always* passed by ref, otherwise you 
 could end up with some pathological behaviour of data from 
 multiple sources overwriting each other because they were 
 operating on copies of output ranges instead of references to a 
 single one.

That won't necessarily work though, you can't have a ref default 
parameter. But we can use pimpl or something to force a regular 
struct to be a ref item. Lazy initialization can be surprising, 
but we deal with that already with array slices so I think it is 
ok.

 Also, delegates and function pointers should be treated as 
 output ranges as well (Phobos should define .put and whatever
 other needed methods for them via UFCS).

Yes, indeed.


 Doesn't solve the case where you call some library function 
 that throws, though. :-(

at least there's nothrow if it is really that important to us.

Feb 06 2014

"Adam D. Ruppe" <destructionator gmail.com> writes:

In the other thread:

Andrei said:
.flush() or .done() to mark the end of several writes

similar to my finish

 .clear() to clear the range (useful if e.g. it's implemented as 
 a
 slice with appending)

similar to my start, but start is allowed to do required 
initializtions too whereas clear doesn't sound as required.


Dmitry said:
 .reserve(n) to notify underlying sink that it n items are
 coming (it should preallocate etc.)

yes, this may also be a noop on some objects

Feb 06 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

06-Feb-2014 20:21, Adam D. Ruppe пишет:
 splitting from the ARC thread

 On Thursday, 6 February 2014 at 15:47:48 UTC, Johannes Pfau wrote:
 As they do not keep references there's no need for ARC or GC,
 we just need a way to tell every function how it should allocate.

 yea. I think it is time we hashed out output ranges the same way we've
 done with input ranges. Right now, output ranges just have put(T). I
 think it would be useful to add extensions similarly to random access
 ranges, infinite ranges, etc..

Just throwing a bit of related thoughts that came and go up w.r.t. 
output ranges.

We really should add multiplexer (one sink writes to many) and 
demultiplexer adapters (by applying n-ary predicate and putting the 
result into a single sink).

And a 3rd filter primitive, as a wrapper of a sink that applies some 
predicate before forwarding it down the line.

Together with other suggestions would make output ranges *much* more 
powerful/useful.

-- 
Dmitry Olshansky

Feb 06 2014

D Programming

C/C++ Programming

Other

digitalmars.D - let's talk about output ranges