www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - let's talk about output ranges

reply "Adam D. Ruppe" <destructionator gmail.com> writes:
splitting from the ARC thread

On Thursday, 6 February 2014 at 15:47:48 UTC, Johannes Pfau wrote:
 As they do not keep references there's no need for ARC or GC,
 we just need a way to tell every function how it should 
 allocate.
yea. I think it is time we hashed out output ranges the same way we've done with input ranges. Right now, output ranges just have put(T). I think it would be useful to add extensions similarly to random access ranges, infinite ranges, etc.. What I want to see work is: char[3] buffer; char[] got = toLowerWithRange(buffer, "FOO"); // if we allow it w/o slicing assert(got.ptr is buffer.ptr); assert(got == "foo"); char[] got = toLowerWithRange(buffer, "FOOL"); // throws, out no more space char[] got = toLowerWithRange(buffer[], "FOO"); // same result as above char[] got = toLowerWithRange(buffer[], "FOOL"); // MAyBE reallocates (I call it toLowerWithRange just to avoid any overloading ambiguities when reusing the toLower name. Names aren't important to me though.) What do we have here? When passed a static array, it must not grow beyond its length - the only allocation allowed is the exception it throws. When passed a slice though, things are a bit more interesting. I think then it should try to put it directly in the slice, but if that isn't possible, go ahead and reallocate. Though that's not necessarily sane either. Maybe slice should also throw if there isn't enough room. toLower would prolly just call put(), with std.array or std.range defining it for static vs dynamic arrays. With user defined types, this behavior would be tweakable, similarly to how input ranges might hasLength!T. Other similarities: * A slice should also be a random access output range. Not all values are generated in order, one char at a time. * Allocators should *offer* output ranges, but not necessarily *be* output ranges, similarly to the relationship between containers and input ranges (a view into the container). This could argue that passing a static buffer without slicing it is wrong * We want this stuff to just work without a bajillion explicit wrappers, again, like input ranges. * Output ranges would probably make the most sense to pass places by ref (as is the case with them most the time right now). Eh, I gotta go, but let's start talking about this. BTW for the growable thing, I wrote this little array thingy: http://arsdnet.net/dcode/array2.d it uses a static buffer up to an estimated size, then switches to GC. The idea being it can use infinite length, but for the majority of the input, uses stack space and thus avoids allocating at all. I'd like to see this thing be easy to use with phobos functions (perhaps with some additions but the same basic idea) as I find this pattern of guessing with a prealloced stack buffer very useful.
 Exceptions currently can't work on a system without GC cause we 
 always use 'throw new' and nobody ever explicitly frees 
 Exceptions.
Exceptions only come in exceptional circumstances anyway, so I think they are ok to be an... exception. LOL but seriously, the exception path is rarely fully optimized in the first place. If you want to avoid them, use nothrow or preallocate them or something.
Feb 06 2014
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Feb 06, 2014 at 04:21:51PM +0000, Adam D. Ruppe wrote:
 splitting from the ARC thread
 
 On Thursday, 6 February 2014 at 15:47:48 UTC, Johannes Pfau wrote:
As they do not keep references there's no need for ARC or GC,
we just need a way to tell every function how it should allocate.
yea. I think it is time we hashed out output ranges the same way we've done with input ranges. Right now, output ranges just have put(T). I think it would be useful to add extensions similarly to random access ranges, infinite ranges, etc..
+1. I think output ranges (as a concept, not necessarily what's currently defined in Phobos) are far underused, and have a lot of untapped potential.
 What I want to see work is:
 char[3] buffer;
 char[] got = toLowerWithRange(buffer, "FOO"); // if we allow it w/o
 slicing
 assert(got.ptr is buffer.ptr);
 assert(got == "foo");
 
 char[] got = toLowerWithRange(buffer, "FOOL"); // throws, out no
 more space
 
 char[] got = toLowerWithRange(buffer[], "FOO"); // same result as
 above
 char[] got = toLowerWithRange(buffer[], "FOOL"); // MAyBE
 reallocates
 
 (I call it toLowerWithRange just to avoid any overloading
 ambiguities when reusing the toLower name. Names aren't important to
 me though.)
 
 What do we have here? When passed a static array, it must not grow
 beyond its length - the only allocation allowed is the exception it
 throws.
Would it make sense to have a .full method/property analogous to input ranges' .empty? Perhaps something like: struct MyOutputRange { void put(...); property bool full() { ... } } Though it may not be a good idea for the caller to constantly have to check if something is full before writing to it. Maybe having put() throw is the better way (then output ranges that allocate will just allocate when full without the caller needing to worry about it).
 When passed a slice though, things are a bit more interesting. I
 think then it should try to put it directly in the slice, but if
 that isn't possible, go ahead and reallocate.
 
 Though that's not necessarily sane either. Maybe slice should also
 throw if there isn't enough room.
I think we're jumping the gun to implementation details here. What about other logical concepts that output ranges should support? One thing that the current output range API doesn't do very well is chaining. For example, if you don't care about allocation strategy and just use the GC, you can write: auto result = "mystring".toUpper.translate("abc", "def"); But once you have to pass an explicit output range to it, you can't chain things anymore: auto result1 = ArcOutputRange!string(); "mystring".toUpper(result1); // need to pass in result explicitly auto result2 = ArcOutputRange!string(); result.data.translate("abc", "def", result2); This is a big usability hindrance. Ideally we'd want to write something like: auto result = "mystring".toUpper(ArcOutputRange!string()) .translate("abc", "def"); But I'm not sure how this can be made to work.
 toLower would prolly just call put(), with std.array or std.range
 defining it for static vs dynamic arrays.
Yes.
 With user defined types, this behavior would be tweakable, similarly
 to how input ranges might hasLength!T. Other similarities:
 
 * A slice should also be a random access output range. Not all
 values are generated in order, one char at a time.
So we should extend put() to take an index, then? struct MyOutputRange { void put(T item); // put item at current index void put(T item, size_t idx); // put item at specified index }
 * Allocators should *offer* output ranges, but not necessarily *be*
 output ranges, similarly to the relationship between containers and
 input ranges (a view into the container). This could argue that
 passing a static buffer without slicing it is wrong
An allocator is definitely not an output range! You don't put data into an allocator, you get a memory buffer *from* an allocator that you can put data into. Big difference. We shouldn't conflate the two. That's why I said in another post, that algorithms that produce data into a data sink should not care what an allocator is; they should take an output range. Logically, it makes more sense: what you want is something to put your result data into, you don't care how it's allocated, or even whether it's a chunk of memory (it may be a socket or pipe or whatever). Doing it this way also eliminates some needless allocations -- why do you need to allocate a string if all you want is to take input data, transform it to lowercase, and write to stdout? Let stdout do the buffering, and let toLower send the data to stdout directly. Calling an allocator from toLower essentially amounts to buffering the data twice.
 * We want this stuff to just work without a bajillion explicit
 wrappers, again, like input ranges.
Yes, which is why I brought up the issue of how to chain function calls that takes output ranges. That's a big usability issue that we need to address if output ranges are going to be as cool as input ranges were.
 * Output ranges would probably make the most sense to pass places by
 ref (as is the case with them most the time right now).
They should probably be *always* passed by ref, otherwise you could end up with some pathological behaviour of data from multiple sources overwriting each other because they were operating on copies of output ranges instead of references to a single one. Also, delegates and function pointers should be treated as output ranges as well (Phobos should define .put and whatever other needed methods for them via UFCS). [...]
Exceptions currently can't work on a system without GC cause we
always use 'throw new' and nobody ever explicitly frees Exceptions.
Exceptions only come in exceptional circumstances anyway, so I think they are ok to be an... exception. LOL but seriously, the exception path is rarely fully optimized in the first place. If you want to avoid them, use nothrow or preallocate them or something.
Yeah, you could do something like this: void myFailureProneAlgorithm(T...)(T args) { auto prealloc_exception = ARCAllocator!Exception(); auto stacktraceBuffer = /* preallocate stacktrace here */; auto result = doBigComplicatedCalculation(); if (result.hasError) { // Fill in exception info without allocation prealloc_exception.msg = result.err_msg; createStackTrace(stacktraceBuffer); prealloc_exception.stacktrace = stacktraceBuffer; // N.B.: no 'new'. throw prealloc_exception; } return result; } void main() { try { auto input = ...; auto output = myFailureProneAlgorithm(input); } catch(Exception e) { ... // you get a statically allocated exception here } } Doesn't solve the case where you call some library function that throws, though. :-( T -- One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie. -- The Silicon Valley Tarot
Feb 06 2014
next sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
I just slapped this together in the other thread, let me 
copy/paste it here, talk about it a bit, then I'll finish reading 
what you wrote:

struct GCSink(T) {
     // so this is a reference type
     private struct Impl {
         T[] data;
         void put(T t) { data ~= t; }
         T[] finish() { return data; }
     }
     Impl* impl;
     alias impl this;
     void start() {
         impl = new Impl;
     }
}

struct StaticSink(T) {
     T[] container;
     this(T[] c) { container = c; }
     size_t size;
     void start() { size = 0; }
     void put(T t) { container[size++] = t; }
     T[] finish() { return container[0 .. size]; }
}
StaticSink!T staticSink(T)(T[] t) {
     return StaticSink!T(t);
}

T[] toUpper(T, OR = GCSink!T)(in T[] data, OR output = OR()) {
     output.start();
     foreach(d; data)
        output.put(d & ~0x20);
     return output.finish();
}

void main() {
     import std.stdio;
     writeln(toUpper("cool")); // default: GC
     char[10] buffer;
     auto received = toUpper("cool", staticSink(buffer[])); // 
custom static sink
     assert(buffer.ptr is received.ptr);
     assert(received == "COOL");
}

====

In addition to put(), I also added start() and finish(). There's 
precedent for this in Phobos already: the std.digest output 
ranges have methods like this. Like std.digest, put builds it up, 
then finish returns the final product.

These wouldn't be required functions on all ranges. Finish might 
even return null or void, if it was a sink into a write function 
or something. It could also close a file or something like that.

But, if it does build into an array, finish ought to be defined 
to return an analogous input range (or slice) into the data it 
just built for easier chaining and basic simple usage.

(Appender in phobos has a kinda similar thing called data, but I 
think this breaks things since it might be called at any time. 
Finish, on the other hand, could be defined that any puts after 
it are invalid.

start is used to restart things. Calling start at any time should 
reset the range to reuse the buffer.


I used the Impl* on the GC one so the default worked. Ref with a 
default argument would fail because it isn't an lvalue, so that 
would kill the simplicity.


I called staticSink on the buffer to build the range... which my 
first post said I wanted to avoid but I kind like it this way 
better. It isn't a hassle to call and keeps a nice separation 
between the view and the container.
Feb 06 2014
prev sibling parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Thursday, 6 February 2014 at 18:10:53 UTC, H. S. Teoh wrote:
 Would it make sense to have a .full method/property analogous 
 to input ranges' .empty? Perhaps something like:
That could be ok but I agree that having to check is a pain. Besides, what are you going to do if it is full? Suppose I call char[1] b; toUpper("lol", staticSink(b[])); The best toUpper could do is truncate or throw.... and I think that would be better encoded in the range itself so the caller can decide. auto got = toUpper("lol", staticSinkTruncating(b[])); if(got.length < "lol".length) { // we could perhaps call it again to process the rest } (put in the truncating one would just check its own length and noop if it is full) Otherwise, throwing range violations/out of memory exceptions are what would most likely happen anyway.
 One thing that
 the current output range API doesn't do very well is chaining.
Indeed. In my other post, I just wrote about finish. Finish serves to flush the buffer (digests or compression algorithms for example might need to be padded to block size), could finalize things (suppose an appender which just puts the pieces into a static array, then calls join all at once at the end), and could also just generally return the result. There are some cases where returning from finish doesn't make sense, such as if you sunk to a file, you wouldn't keep an array of the contents around... but finish is still potentially useful in that it could close the file or release a lock. (Of course, dtors could do that too. But destructors can never return data to the user - that's where finish is special.) Anyway, not all output ranges would offer finish and not all would return T[]. But not all input ranges offer opSlice either so we're still in analogous territory.
 This is a big usability hindrance. Ideally we'd want to write 
 something
 like:

 	auto result = "mystring".toUpper(ArcOutputRange!string())
 				.translate("abc", "def");

 But I'm not sure how this can be made to work.
hmmm.... finish doesn't account for all that.... well, I guess it could by returning a range. tbh toUpper might be better as a higher-order input range. Like alias toUpper = map!charToUpper(...). Those chain, they don't allocate, and they are well-defined right now. Then at the end we build the result lazily and just put it all at once into the output range. "mystring".toUpper.translate("abc","def").array(ArcOutputRange!string()); Yeah, I actually think that's the way to go. And calling .array at the end is nothing new to Phobos anyway. I'd be a bit weird doing it with toUpper but I think it really is the best fit. (BTW I would be PISSED of toUpper actually changed like this. It'd break a bunch of code and I don't really care that toUpper allocates. I want it to just work. But we could offer equivalent functionality via per-character functions and map so we don't have to break code to offer the new options.)
 So we should extend put() to take an index, then?
that would work.
 An allocator is definitely not an output range!
yup, and I don't think a static array is either. A static array is neither an input range, since you can't do a = a[1..$]. But offering easy getters for such is easy and it rox.
 into a data sink should not care what an allocator is; they 
 should take an output range.
Actually, I think they should generate lazy input ranges whenever possible. Then only at the end do we send it to the output range. It's just input ranges aren't allowed to allocate, that would kill their complexity guarantee, so we need an example of a function which *must* allocate up front. They want the random access output range. Otherwise we can just put at the end.
 Let
 stdout do the buffering, and let toLower send the data to stdout
 directly. Calling an allocator from toLower essentially amounts 
 to buffering the data twice.
yes
 They should probably be *always* passed by ref, otherwise you 
 could end up with some pathological behaviour of data from 
 multiple sources overwriting each other because they were 
 operating on copies of output ranges instead of references to a 
 single one.
That won't necessarily work though, you can't have a ref default parameter. But we can use pimpl or something to force a regular struct to be a ref item. Lazy initialization can be surprising, but we deal with that already with array slices so I think it is ok.
 Also, delegates and function pointers should be treated as 
 output ranges as well (Phobos should define .put and whatever
 other needed methods for them via UFCS).
Yes, indeed.
 Doesn't solve the case where you call some library function 
 that throws, though. :-(
at least there's nothrow if it is really that important to us.
Feb 06 2014
parent "Adam D. Ruppe" <destructionator gmail.com> writes:
In the other thread:

Andrei said:
.flush() or .done() to mark the end of several writes
similar to my finish
 .clear() to clear the range (useful if e.g. it's implemented as 
 a
 slice with appending)
similar to my start, but start is allowed to do required initializtions too whereas clear doesn't sound as required. Dmitry said:
 .reserve(n) to notify underlying sink that it n items are
 coming (it should preallocate etc.)
yes, this may also be a noop on some objects
Feb 06 2014
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
06-Feb-2014 20:21, Adam D. Ruppe пишет:
 splitting from the ARC thread

 On Thursday, 6 February 2014 at 15:47:48 UTC, Johannes Pfau wrote:
 As they do not keep references there's no need for ARC or GC,
 we just need a way to tell every function how it should allocate.
yea. I think it is time we hashed out output ranges the same way we've done with input ranges. Right now, output ranges just have put(T). I think it would be useful to add extensions similarly to random access ranges, infinite ranges, etc..
Just throwing a bit of related thoughts that came and go up w.r.t. output ranges. We really should add multiplexer (one sink writes to many) and demultiplexer adapters (by applying n-ary predicate and putting the result into a single sink). And a 3rd filter primitive, as a wrapper of a sink that applies some predicate before forwarding it down the line. Together with other suggestions would make output ranges *much* more powerful/useful. -- Dmitry Olshansky
Feb 06 2014