digitalmars.D - Ranges of char and wchar
- Andrei Alexandrescu (23/23) May 08 2014 A discussion is building around
- H. S. Teoh via Digitalmars-d (51/73) May 08 2014 I've thought about input ranges vs. output ranges for a bit. I think it
- Walter Bright (4/16) May 08 2014 I think you nailed it.
- Nick Sabalausky (21/24) May 08 2014 I think that actually raises an issue worth considering. I posit that
- =?UTF-8?B?Ikx1w61z?= Marques" (8/23) May 08 2014 I agree with H. S. Teoh. Indeed, I was thinking of trying to
- Andrei Alexandrescu (12/28) May 08 2014 Interesting. So then the range returned by format() will save everything...
- H. S. Teoh via Digitalmars-d (9/39) May 08 2014 [...]
- H. S. Teoh via Digitalmars-d (52/79) May 08 2014 [...]
- =?UTF-8?B?Ikx1w61z?= Marques" (13/21) May 08 2014 You might think that's bad, but there's nothing in that example
- Kagamin (4/7) May 09 2014 Another issue is iteration might be not repeatable, especially if
- Peter Alexander (10/21) May 10 2014 void fun(int[] a)
- Walter Bright (7/10) May 08 2014 The setExtension() function is itself not very important, but what is im...
- H. S. Teoh via Digitalmars-d (25/38) May 08 2014 I think having setExtension() return an input range is the best
- Andrei Alexandrescu (6/18) May 08 2014 4. Rule of least surprise
- Walter Bright (8/11) May 08 2014 It's true that when I first encountered C#'s LINQ, I was surprised that ...
- Jacob Carlborg (6/14) May 09 2014 Don't know if it helps, but we could add a UDA indicating which
- Nick Sabalausky (21/23) May 08 2014 Isn't eagerness what we have array() for?
- Walter Bright (14/16) May 08 2014 The lazy vs eager issue seems at first blush to be an esoteric, potayto-...
- Nick Sabalausky (5/6) May 08 2014 Or yield()-ing inside the eager one's sink.
- Walter Bright (2/7) May 08 2014 The data is still supplied to it.
- Nick Sabalausky (3/11) May 08 2014 I think I've lost track of what you mean here. What data is still
- Walter Bright (4/16) May 08 2014 When creating an engine vs calculating the result, the former does not r...
- John Colvin (18/46) May 08 2014 A little thing I've been thinking about on a related note:
- Andrei Alexandrescu (25/25) May 08 2014 On 5/8/14, 10:46 AM, Andrei Alexandrescu wrote:
- Andrei Alexandrescu (5/9) May 08 2014 The larger point here (I forgot) is that user code that's written in the...
- Walter Bright (14/14) May 08 2014 Classification of functions in http://dlang.org/phobos/std_path.html:
- Jesse Phillips (11/23) May 13 2014 I'm with H. S. Teoh that we should prefer the input range
A discussion is building around https://github.com/D-Programming-Language/phobos/pull/2149, which is a nice initiative by Walter to allow Phobos users to avoid or control memory allocation. First instance of the pull request copied the inputs into an output range. The second instance (right now) creates an input range that lazily creates the result. The element type of that range is the encoding type of the first argument (i.e. char or wchar most of the time). This is different from string/wstring/etc element-wise iteration because it'll be done code unit-wise, not code point-wise. We need a robust idiom for doing such string manipulation without allocation, for which setExtension is just an example. Going the output range route has nice things going for it because the output range decides the encoding in advance and then accepts via put() calls any encoding, with only the minimum transcoding needed. However output range means the string operation will be done eagerly, whereas lazy has advantages (nice piping, saving on work etc). On the other hand, there's the risk of becoming "more catholic than the Pope" by insisting on lazy string processing. Most string operations are eager, and insisting on a general framework for lazy encoded operations on strings may be an exaggeration. D{iscuss,estroy}! Andrei
May 08 2014
On Thu, May 08, 2014 at 10:46:12AM -0700, Andrei Alexandrescu via Digitalmars-d wrote:A discussion is building around https://github.com/D-Programming-Language/phobos/pull/2149, which is a nice initiative by Walter to allow Phobos users to avoid or control memory allocation. First instance of the pull request copied the inputs into an output range. The second instance (right now) creates an input range that lazily creates the result.I've thought about input ranges vs. output ranges for a bit. I think it doesn't make sense for functions that process data to take an output range: output ranges are data sinks, and should only be used for the endpoint of a data processing pipeline. Since the string function doesn't know whether or not it's the last in a pipeline (only the calling code can know this), it should return an input range. If the user code wants to put the result into an output range, then it should simply use std.algorithm.copy. This way, you maximize the usability of the function -- it can participate in UFCS chains, compose with other std.algorithm functions, etc.. [...]We need a robust idiom for doing such string manipulation without allocation, for which setExtension is just an example. Going the output range route has nice things going for it because the output range decides the encoding in advance and then accepts via put() calls any encoding, with only the minimum transcoding needed.The problem with this approach is that it hampers usage in UFCS pipelines.However output range means the string operation will be done eagerly, whereas lazy has advantages (nice piping, saving on work etc). On the other hand, there's the risk of becoming "more catholic than the Pope" by insisting on lazy string processing. Most string operations are eager, and insisting on a general framework for lazy encoded operations on strings may be an exaggeration.[...] In terms of usability, my opinion is that it makes most sense to return an input range. Let the user decide when the result should be copied into an output range (via std.algorithm.copy). Compare the following for constructing a path from a directory name, a filename, and an extension: Case 1: setExtension takes an output range: // Look how ugly this is: string dirname = ...; string filename = ...; // Need temp buffer to store result char[128] result; char[] outputRange = result[]; dirname.copy(outputRange); setExtension(filename, ".ext", outputRange); writeln(result); Case 2: setExtension takes an input range: // Look how clean this is: string dirname = ...; string filename = ...; writeln(chain(dirname, setExtension(filename, ".ext"))); In case 1, the user has to manually create various intermediate buffers to store intermediate results. I used a trivial example here, but in application code, the processing you need is usually far more complex. This means creating lots of intermediate buffers, making sure you link the right ones together, etc.. The code becomes very verbose, and becomes a maintenance nightmare (which of the tmp1, tmp2, tmp3 buffers refer to which fragment of the result again? Oh oops, I think I passed the wrong output range to setExtension). In case 2, the user decides when a buffer is needed and when it's not. The function calls chain very nicely. The code is more readable, and easy to maintain (and needless allocations -- including temporary static buffers -- are eliminated). T -- Nearly all men can stand adversity, but if you want to test a man's character, give him power. -- Abraham Lincoln
May 08 2014
On 5/8/2014 11:19 AM, H. S. Teoh via Digitalmars-d wrote:In case 1, the user has to manually create various intermediate buffers to store intermediate results. I used a trivial example here, but in application code, the processing you need is usually far more complex. This means creating lots of intermediate buffers, making sure you link the right ones together, etc.. The code becomes very verbose, and becomes a maintenance nightmare (which of the tmp1, tmp2, tmp3 buffers refer to which fragment of the result again? Oh oops, I think I passed the wrong output range to setExtension). In case 2, the user decides when a buffer is needed and when it's not. The function calls chain very nicely. The code is more readable, and easy to maintain (and needless allocations -- including temporary static buffers -- are eliminated).I think you nailed it. Being able to eliminate temporary buffers is a big win. The fastest way to manage allocated memory is to not need allocated memory.
May 08 2014
On 5/8/2014 2:19 PM, H. S. Teoh via Digitalmars-d wrote:This way, you maximize the usability of the function -- it can participate in UFCS chains, compose with other std.algorithm functions, etc..I think that actually raises an issue worth considering. I posit that the results of an InputRange setExtention don't make any sense with most std.algorithm functions: What's an InputRange? It's a set of elements. So what are the "elements" of a setExtention? Uhh... The result of setExtention really just has exactly one "logical" element: the resulting filesystem path. But that's not necessarily the *actual* result. The actual result is the "elements" of "however the hell setExtention's internal algorithm happened to feel like plopping them out". What are you going to zip() them, map() them, sort() them? All completely meaningless. The "elements" of setExtention's InputRange are NOT defined. At least not in any well-encapsulated way. And really, they SHOULDN'T be defined. So we have setExtention, and anything that follows it's model, returning what's more or less an undefined result that's basically useless for pretty much anything besides plopping straight into array() or copy(). If the big benefit of going InputRange instead of OutputRange for such functions is just simply call chaining, then maybe we're solving the wrong problem and should instead be looking at ways to improve the usage of OutputRange-based functions.
May 08 2014
On Thursday, 8 May 2014 at 18:21:32 UTC, H. S. Teoh via Digitalmars-d wrote:I've thought about input ranges vs. output ranges for a bit. I think it doesn't make sense for functions that process data to take an output range: output ranges are data sinks, and should only be used for the endpoint of a data processing pipeline. Since the string function doesn't know whether or not it's the last in a pipeline (only the calling code can know this), it should return an input range. If the user code wants to put the result into an output range, then it should simply use std.algorithm.copy.I agree with H. S. Teoh. Indeed, I was thinking of trying to create an alternative version of std.format which returned an InputRange, instead of taking an OutputRange. The benefit of this becomes obvious when you want to use std.format() in your own ranges, which currently impairs laziness, pipelining, avoiding memory allocations, etc.
May 08 2014
On 5/8/14, 1:41 PM, "LuÃs Marques" <luis luismarques.eu>" wrote:On Thursday, 8 May 2014 at 18:21:32 UTC, H. S. Teoh via Digitalmars-d wrote:Interesting. So then the range returned by format() will save everything passed to it, which means... int fun(int[] a) { auto before = format("Before: %s", a); foreach (ref e; a) ++e; auto after = format("After: %s", a); writeln(before, "\n--->\n", after); } *ouch* AndreiI've thought about input ranges vs. output ranges for a bit. I think it doesn't make sense for functions that process data to take an output range: output ranges are data sinks, and should only be used for the endpoint of a data processing pipeline. Since the string function doesn't know whether or not it's the last in a pipeline (only the calling code can know this), it should return an input range. If the user code wants to put the result into an output range, then it should simply use std.algorithm.copy.I agree with H. S. Teoh. Indeed, I was thinking of trying to create an alternative version of std.format which returned an InputRange, instead of taking an OutputRange. The benefit of this becomes obvious when you want to use std.format() in your own ranges, which currently impairs laziness, pipelining, avoiding memory allocations, etc.
May 08 2014
On Thu, May 08, 2014 at 02:38:18PM -0700, Andrei Alexandrescu via Digitalmars-d wrote:On 5/8/14, 1:41 PM, "Luís Marques" <luis luismarques.eu>" wrote:[...] Yeah, I'm not so sure it's a good idea for std.format to be lazy. But there are valid use cases for both, so now I'm wondering if we should adopt Walter's idea of using a naming convention to distinguish between algorithms (lazy) and functions (eager). T -- WINDOWS = Will Install Needless Data On Whole System -- CompuManOn Thursday, 8 May 2014 at 18:21:32 UTC, H. S. Teoh via Digitalmars-d wrote:Interesting. So then the range returned by format() will save everything passed to it, which means... int fun(int[] a) { auto before = format("Before: %s", a); foreach (ref e; a) ++e; auto after = format("After: %s", a); writeln(before, "\n--->\n", after); } *ouch*I've thought about input ranges vs. output ranges for a bit. I think it doesn't make sense for functions that process data to take an output range: output ranges are data sinks, and should only be used for the endpoint of a data processing pipeline. Since the string function doesn't know whether or not it's the last in a pipeline (only the calling code can know this), it should return an input range. If the user code wants to put the result into an output range, then it should simply use std.algorithm.copy.I agree with H. S. Teoh. Indeed, I was thinking of trying to create an alternative version of std.format which returned an InputRange, instead of taking an OutputRange. The benefit of this becomes obvious when you want to use std.format() in your own ranges, which currently impairs laziness, pipelining, avoiding memory allocations, etc.
May 08 2014
On Thu, May 08, 2014 at 02:54:32PM -0700, H. S. Teoh via Digitalmars-d wrote:On Thu, May 08, 2014 at 02:38:18PM -0700, Andrei Alexandrescu via Digitalmars-d wrote:[...]On 5/8/14, 1:41 PM, "Luís Marques" <luis luismarques.eu>" wrote:[...] On second thoughts, perhaps std.format *should* be lazy, or at least, the core implementation should be lazy, and std.format should be an eager wrapper around it (using std.algorithm.copy to copy the result of the underlying formatting engine to the output range). The problem is, the current design of std.format is heavily dependent on output ranges (e.g., user-defined types that define a toString that takes a delegate as an output range), and changing this now is going to cause large-scale breakage of existing code. I don't see any clean way of continuing to support toString(delegate) without introducing yet more hidden allocations (to buffer the output of toString, for example), if we switch std.format to a lazy implementation. The only way I can think of to do this is to have language-level support for yield(), then you can hook things up this way: module std.format; auto format(A...)(string fmt, A args) { struct Result { ... property auto front() { // assume typeof(args[i]) = user-defined type args[i].toString((const(char)[] data) { // This switches to a different // fiber than the one running // toString. yield(data); }); ... } ... } return Result(...); } module usercode; struct T { void toString(scope void delegate(const(char)[]) sink) { ... // Each of these calls will be yielded by the // sink, so they will "block" until the outer // caller has processes the new data. sink("abc"); sink("def"); ... } } This requires heavy-duty yield() support in the language, though. I don't know how likely such a thing will make it into the language... T -- Winners never quit, quitters never win. But those who never quit AND never win are idiots.[...] Yeah, I'm not so sure it's a good idea for std.format to be lazy. But there are valid use cases for both, so now I'm wondering if we should adopt Walter's idea of using a naming convention to distinguish between algorithms (lazy) and functions (eager).I agree with H. S. Teoh. Indeed, I was thinking of trying to create an alternative version of std.format which returned an InputRange, instead of taking an OutputRange. The benefit of this becomes obvious when you want to use std.format() in your own ranges, which currently impairs laziness, pipelining, avoiding memory allocations, etc.Interesting. So then the range returned by format() will save everything passed to it, which means... int fun(int[] a) { auto before = format("Before: %s", a); foreach (ref e; a) ++e; auto after = format("After: %s", a); writeln(before, "\n--->\n", after); } *ouch*
May 08 2014
On Thursday, 8 May 2014 at 21:38:12 UTC, Andrei Alexandrescu wrote:int fun(int[] a) { auto before = format("Before: %s", a); foreach (ref e; a) ++e; auto after = format("After: %s", a); writeln(before, "\n--->\n", after); } *ouch*You might think that's bad, but there's nothing in that example that is specific to formatting; that kind of behavior can happen with any lazy range taking input by ref. (I guess that's one drawback of providing functional programming without immutability?) To be clear, I'm not saying that the idea was to make the existing format function lazy, break existing code and surprise people with that change in semantics. format() can be an eager wrapper for the new lazy functionality. And eagerness/laziness should be part of the specification IMO, even more than asymptotic complexity should be part of the STL spec, since (bug > performance bug).
May 08 2014
On Thursday, 8 May 2014 at 22:27:11 UTC, LuÃs Marques wrote:(I guess that's one drawback of providing functional programming without immutability?)Another issue is iteration might be not repeatable, especially if not destructing, in D one can save a range.
May 09 2014
On Thursday, 8 May 2014 at 21:38:12 UTC, Andrei Alexandrescu wrote:Interesting. So then the range returned by format() will save everything passed to it, which means... int fun(int[] a) { auto before = format("Before: %s", a); foreach (ref e; a) ++e; auto after = format("After: %s", a); writeln(before, "\n--->\n", after); } *ouch* Andreivoid fun(int[] a) { auto before = a.map!(a => a + 1); foreach (ref e; a) ++e; auto after = a.map!(a => a + 1); writeln(before, "\n--->\n", after); } Laziness and mutable references are not the best of friends.
May 10 2014
On 5/8/2014 10:46 AM, Andrei Alexandrescu wrote:A discussion is building around https://github.com/D-Programming-Language/phobos/pull/2149, which is a nice initiative by Walter to allow Phobos users to avoid or control memory allocation.The setExtension() function is itself not very important, but what is important is an example for how to put together ranges. Some design goals: 1. purity, safe, nothrow, nogc 2. composability 3. have them work in a consistent way, so there's less for a user to learn
May 08 2014
On Thu, May 08, 2014 at 11:27:54AM -0700, Walter Bright via Digitalmars-d wrote:On 5/8/2014 10:46 AM, Andrei Alexandrescu wrote:I think having setExtension() return an input range is the best solution. It will satisfy all 3 requirements: it's easy to make it pure, safe, nothrow, nogc (since it lazily generates the result); it's easy to compose with UFCS chains, and it's consistent with the rest of the range-based functions in Phobos. Having setExtension (or any string function) take an output range would break (2) -- you have to allocate intermediate buffers to serve as output ranges if you want to do further processing of the result. If setExtension returns an input range, then you can just use std.algorithm.copy to copy the result to an output range should you need to. Going the other way, you can't easily convert an output range into an input range. There is the issue of decoding, though. Perhaps setExtension should take a compile-time argument specifying which char type is desired in the result? So you could do: string filename, ext; wstring pathname_w = filename.setExtension!wchar(ext).array; dstring pathname_d = filename.setExtension!dchar(ext).array; This way, if the output char type is identical to the input char type, the function can bypass decoding. T -- "Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next. -- (Stolen from the net)A discussion is building around https://github.com/D-Programming-Language/phobos/pull/2149, which is a nice initiative by Walter to allow Phobos users to avoid or control memory allocation.The setExtension() function is itself not very important, but what is important is an example for how to put together ranges. Some design goals: 1. purity, safe, nothrow, nogc 2. composability 3. have them work in a consistent way, so there's less for a user to learn
May 08 2014
On 5/8/14, 11:27 AM, Walter Bright wrote:On 5/8/2014 10:46 AM, Andrei Alexandrescu wrote:4. Rule of least surprise Some may be surprised that auto c = setExtension(a, b) doesn't actually just do it. So changing a and/or b before using c would yield surprising result for someone coming from a background of eager-string languages. AndreiA discussion is building around https://github.com/D-Programming-Language/phobos/pull/2149, which is a nice initiative by Walter to allow Phobos users to avoid or control memory allocation.The setExtension() function is itself not very important, but what is important is an example for how to put together ranges. Some design goals: 1. purity, safe, nothrow, nogc 2. composability 3. have them work in a consistent way, so there's less for a user to learn
May 08 2014
On 5/8/2014 2:20 PM, Andrei Alexandrescu wrote:Some may be surprised that auto c = setExtension(a, b) doesn't actually just do it. So changing a and/or b before using c would yield surprising result for someone coming from a background of eager-string languages.lazy. It's also true that most of std.algorithm is lazy. Apart from coming up with a new naming convention (and renaming algorithms in Phobos), I don't see any obvious solution to what's lazy and what's not. One possibility is to informally (i.e. in the documentation rather than the core language spec) call something an 'algorithm' if it is lazy and 'function' if it is eager.
May 08 2014
On 08/05/14 23:33, Walter Bright wrote:it was lazy. It's also true that most of std.algorithm is lazy. Apart from coming up with a new naming convention (and renaming algorithms in Phobos), I don't see any obvious solution to what's lazy and what's not. One possibility is to informally (i.e. in the documentation rather than the core language spec) call something an 'algorithm' if it is lazy and 'function' if it is eager.Don't know if it helps, but we could add a UDA indicating which functions are lazy. This wouldn't require any renaming of existing functions. -- /Jacob Carlborg
May 09 2014
On 5/8/2014 1:46 PM, Andrei Alexandrescu wrote:However output range means the string operation will be done eagerly, whereas lazy has advantages (nice piping, saving on work etc).Isn't eagerness what we have array() for? Also, while naming is often just bikeshedding, in this case I think we really need to address it. It's unfortunate that the new InputRange version can't be overloaded with the non-range version, because distinguishing them with "setExt" vs "setExtention" is just a non-option as far as I'm concerned. But IMO that's the minor stuff, I think there's a bigger elephant in the room here: The various benefits of doing it as an InputRange instead of OutputRange, combined with the terrible mess that resulted from converting such a straightforward function from OutputRange to InputRange is yet another clear indication that we REALLY need a simple algorithms exploding into messes like that to become the norm in D? I certainly hope not. I'd rather just go back to opApply. I'm convinced this is a real problem for D moving forward, and I don't like continually seeing it get swept under the same "too unimportant for right now" rug as things that actually belong under there (like named
May 08 2014
On 5/8/2014 10:46 AM, Andrei Alexandrescu wrote:However output range means the string operation will be done eagerly, whereas lazy has advantages (nice piping, saving on work etc).The lazy vs eager issue seems at first blush to be an esoteric, potayto-potahto bikeshed. But I think it's more interesting than that. Lazy means that the computation is divided into two pieces: 1. creating an engine to do the calculations 2. feeding the data to the engine to generate a result This, of course, reminds me of regex. By creating the regex engine separately, and then using that engine for multiple sets of data, one can afford to spend extra computational effort at optimizing the engine. Creating the engine can even be done at compile time (as Dmitry's most excellent regex engine does). Engines can make it easier to parallelize computation. And lastly, it's easy to make an eager version by wrapping a lazy one. But to make a lazy version from an eager one means reimplementing it.
May 08 2014
On 5/8/2014 2:46 PM, Walter Bright wrote:But to make a lazy version from an eager one means reimplementing it.Or yield()-ing inside the eager one's sink. And note also there is such as thing as a stackless "fiber", so I'm not certain a full-fledged context-switching fiber would necessarily be required (although it might - stackless fibers do have limitations).
May 08 2014
On 5/8/2014 12:22 PM, Nick Sabalausky wrote:On 5/8/2014 2:46 PM, Walter Bright wrote:The data is still supplied to it.But to make a lazy version from an eager one means reimplementing it.Or yield()-ing inside the eager one's sink.
May 08 2014
On 5/8/2014 4:13 PM, Walter Bright wrote:8/2014 12:22 PM, Nick Sabalausky wrote:I think I've lost track of what you mean here. What data is still supplied to what, and what point does that illustrate?On 5/8/2014 2:46 PM, Walter Bright wrote:The data is still supplied to it.But to make a lazy version from an eager one means reimplementing it.Or yield()-ing inside the eager one's sink.
May 08 2014
On 5/8/2014 4:24 PM, Nick Sabalausky wrote:On 5/8/2014 4:13 PM, Walter Bright wrote:When creating an engine vs calculating the result, the former does not require the actual data to be supplied. regex is a classic example of that.8/2014 12:22 PM, Nick Sabalausky wrote:I think I've lost track of what you mean here. What data is still supplied to what, and what point does that illustrate?On 5/8/2014 2:46 PM, Walter Bright wrote:The data is still supplied to it.But to make a lazy version from an eager one means reimplementing it.Or yield()-ing inside the eager one's sink.
May 08 2014
On Thursday, 8 May 2014 at 17:46:06 UTC, Andrei Alexandrescu wrote:A discussion is building around https://github.com/D-Programming-Language/phobos/pull/2149, which is a nice initiative by Walter to allow Phobos users to avoid or control memory allocation. First instance of the pull request copied the inputs into an output range. The second instance (right now) creates an input range that lazily creates the result. The element type of that range is the encoding type of the first argument (i.e. char or wchar most of the time). This is different from string/wstring/etc element-wise iteration because it'll be done code unit-wise, not code point-wise. We need a robust idiom for doing such string manipulation without allocation, for which setExtension is just an example. Going the output range route has nice things going for it because the output range decides the encoding in advance and then accepts via put() calls any encoding, with only the minimum transcoding needed. However output range means the string operation will be done eagerly, whereas lazy has advantages (nice piping, saving on work etc). On the other hand, there's the risk of becoming "more catholic than the Pope" by insisting on lazy string processing. Most string operations are eager, and insisting on a general framework for lazy encoded operations on strings may be an exaggeration. D{iscuss,estroy}! AndreiA little thing I've been thinking about on a related note: someOutputRange <| someInputRange; equivalent to someInputRange |> someOutputRange; which is conceptually the same as someInputRange.copy(someOutputRange); perhaps implemented by introducing new operators opSink (in the output range) and opSource (in the input range). These would be invoked as a pair, opSource returning an input range which is passed to opSink. If not implemented, opSource defaults to `auto opSource() { return this; }` or even `alias opSource = this;`. opSink would default to a std.algorithm.copy style thing (can't really be std.algorithm.copy itself because you don't want the dependency). I'm not sure what ground this really breaks, if any, but it sure looks nice to me.
May 08 2014
On 5/8/14, 10:46 AM, Andrei Alexandrescu wrote: [snip] Here's a list of algorithms in std.algorithm that would need to be looked at if we want to handle ranges of char and wchar properly (at least versions with predicates). When I say "looked at" I mean "modified appropriately to handle ranges of char and wchar in addition to strings". ======================== Searching: all any canFind count countUntil commonPrefix endsWith find findAdjacent findAmong findSkip findSplit findSplitAfter findSplitBefore minCount minPos mismatch skipOver startsWith until Comparison: cmp equal levenshteinDistance levenshteinDistanceAndPath mismatch Iteration: filter filterBidirectional group map reduce splitter uniq Sorting: completeSort isPartitioned isSorted makeIndex nextPermutation nextEvenPermutation partialSort partition partition3 schwartzSort sort topN topNCopy Set operations: (probably none, because set ops are rare on strings) Mutation: bringToFront copy fill initializeAll moveAll moveSome remove reverse strip stripLeft stripRight swapRanges uninitializedFill ======================== Of these, only a fraction (maybe 60-70%) are actually used with strings. Still that leaves a significant amount of work to do if we want std.algorithm to work smoothly with ranges of char/wchar. Here "smoothly" means "if it compiles and runs it won't do something UTF-goofy". Andrei
May 08 2014
On 5/8/14, 3:33 PM, Andrei Alexandrescu wrote:Of these, only a fraction (maybe 60-70%) are actually used with strings. Still that leaves a significant amount of work to do if we want std.algorithm to work smoothly with ranges of char/wchar. Here "smoothly" means "if it compiles and runs it won't do something UTF-goofy".The larger point here (I forgot) is that user code that's written in the style of std.algorithm would need similar adjustments in order to work with ranges of char/wchar. Andrie
May 08 2014
Classification of functions in http://dlang.org/phobos/std_path.html: 1. Functions that allocate GC memory and return an array: dirName, setExtension, defaultExtension, buildPath, buildNormalizedPath, absolutePath, relativePath, expandTilde These are prime candidates to be converted into algorithms. 2. Functions that return a slice of their input arrays: baseName, rootName, driveName, stripDrive, extension, Should be enhanced to accept RandomAccessRange and produce RandomAccessRange 3. Functions that return a range: pathSplitter Enhance to accept RandomAccessRange 4. Functions that take an array and compute a value: isRooted, isAbsolute, filenameCmp, globMatch, isValidPath These are reducers, and should be enhanced to accept input ranges.
May 08 2014
On Thursday, 8 May 2014 at 17:46:06 UTC, Andrei Alexandrescu wrote:A discussion is building around https://github.com/D-Programming-Language/phobos/pull/2149, which is a nice initiative by Walter to allow Phobos users to avoid or control memory allocation. First instance of the pull request copied the inputs into an output range. The second instance (right now) creates an input range that lazily creates the result. The element type of that range is the encoding type of the first argument (i.e. char or wchar most of the time). This is different from string/wstring/etc element-wise iteration because it'll be done code unit-wise, not code point-wise.I'm with H. S. Teoh that we should prefer the input range approach. My limited experience with output ranges suggests to me that they really need to be the end point, where the data is actually leaving the program (writing to disk, over a network). And as Teoh mentions, if you need it in specific memory you can copy an input range into it. D is already fairly lazy, and .array() is great for doing eager evaluation into the GC. For strings having something similar which allows for the original encoding to remain would be good.
May 13 2014