digitalmars.D - std.experimental.collections.rcstring and its integration in Phobos
- Seb (24/24) Jul 17 2018 So we managed to revive the rcstring project and it's already a
- Jonathan M Davis (10/34) Jul 17 2018 If it's not a range by default, why would you expect _anything_ which
- Seb (9/22) Jul 17 2018 Well, there are few cases where the range type doesn't matter and
- Jonathan M Davis (25/49) Jul 17 2018 That effectively means treating rcstring as a range of char by default
- Seb (11/45) Jul 18 2018 Well, the problem of it being a range of char is that this might
- Jonathan M Davis (34/42) Jul 18 2018 I don't know. I'm fine with it not being a range and leaving it up to th...
- Andrei Alexandrescu (3/5) Jul 17 2018 Many functions do not care about the range aspect, but do care about the...
- Jonathan M Davis (20/25) Jul 17 2018 It doesn't care about strings either. It operates on a range of characte...
- Andrea Fontana (2/13) Jul 18 2018 This makes sense for me too.
- Jacob Carlborg (4/8) Jul 17 2018 I vote for .by!char to be the default.
- rikki cattermole (11/19) Jul 17 2018 I'm thinking .as!T
- Seb (6/13) Jul 18 2018 The problem here is this would also lead to very confusing
- Eugene Wissner (5/20) Jul 18 2018 Therefore it shouldn't compile at all, but
- sarn (8/12) Jul 18 2018 +1 to requiring an explicit byCodeUnit or whatever.
- Jacob Carlborg (5/11) Jul 18 2018 How about not giving access to operate on individual characters. If they...
- jmh530 (26/29) Jul 17 2018 I'm glad this is getting worked on. It feels like something that
- Seb (13/46) Jul 18 2018 Well AFAICT the idea is that with RCIAllocator (or its
- Jon Degenhardt (20/32) Jul 17 2018 I don't know the goals/role rcstring is expected to play,
So we managed to revive the rcstring project and it's already a PR for Phobos: https://github.com/dlang/phobos/pull/6631 (still WIP though) The current approach in short: - uses the new nogc, safe and nothrow Array from the collections library (check Eduardo's DConf18 talk) - uses reference counting - _no_ range by default (it needs an explicit `.by!{d,w,}char`) (as in no auto-decoding by default) Still to be done: - integration in Phobos (the current idea is to generate additional overloads for rcstring) - performance - use of static immutable rcstring in fully nogc - extensive testing Especially the "seamless" integration in Phobos will be challenging. I made a rough listing of all symbols that one would expect to be usable with an rcstring type (https://gist.github.com/wilzbach/d74712269f889827cff6b2c7a08d07f8). It's more than 200. As rcstring isn't a range by default, but one excepts `"foo".rcstring.equal("foo")` to work, overloads for all these symbols would need to be added. What do you think about this approach? Do you have a better idea?
Jul 17 2018
On Tuesday, July 17, 2018 15:21:30 Seb via Digitalmars-d wrote:So we managed to revive the rcstring project and it's already a PR for Phobos: https://github.com/dlang/phobos/pull/6631 (still WIP though) The current approach in short: - uses the new nogc, safe and nothrow Array from the collections library (check Eduardo's DConf18 talk) - uses reference counting - _no_ range by default (it needs an explicit `.by!{d,w,}char`) (as in no auto-decoding by default) Still to be done: - integration in Phobos (the current idea is to generate additional overloads for rcstring) - performance - use of static immutable rcstring in fully nogc - extensive testing Especially the "seamless" integration in Phobos will be challenging. I made a rough listing of all symbols that one would expect to be usable with an rcstring type (https://gist.github.com/wilzbach/d74712269f889827cff6b2c7a08d07f8). It's more than 200. As rcstring isn't a range by default, but one excepts `"foo".rcstring.equal("foo")` to work, overloads for all these symbols would need to be added. What do you think about this approach? Do you have a better idea?If it's not a range by default, why would you expect _anything_ which operates on ranges to work with rcstring directly? IMHO, if it's not a range, then range-based functions shouldn't work with it, and I don't see how they even _can_ work with it unless you assume code units, or code points, or graphemes as the default. If it's designed to not be a range, then it should be up to the programmer to call the appropriate function on it to get the appropriate range type for a particular use case, in which case, you really shouldn't need to add much of any overloads for it. - Jonathan M Davis
Jul 17 2018
On Tuesday, 17 July 2018 at 16:58:37 UTC, Jonathan M Davis wrote:On Tuesday, July 17, 2018 15:21:30 Seb via Digitalmars-d wrote:Well, there are few cases where the range type doesn't matter and one can simply compare bytes, e.g. equal (e.g. "ä" == "ä" <=> [195, 164] == [195, 164]) commonPrefix find ... Of course this assumes that there's no normalization necessary, but the current auto-decoding assumes this too.[...]If it's not a range by default, why would you expect _anything_ which operates on ranges to work with rcstring directly? IMHO, if it's not a range, then range-based functions shouldn't work with it, and I don't see how they even _can_ work with it unless you assume code units, or code points, or graphemes as the default. If it's designed to not be a range, then it should be up to the programmer to call the appropriate function on it to get the appropriate range type for a particular use case, in which case, you really shouldn't need to add much of any overloads for it. - Jonathan M Davis
Jul 17 2018
On Tuesday, July 17, 2018 17:28:19 Seb via Digitalmars-d wrote:On Tuesday, 17 July 2018 at 16:58:37 UTC, Jonathan M Davis wrote:That effectively means treating rcstring as a range of char by default rather than not treating it as a range by default. And if we then do that only with functions that overload on rcstring rather than making rcstring actually a range of char, then why aren't we just treating it as a range of char in general? IMHO, the fact that so many alogorithms currently special-case on arrays of characters is one reason that auto-decoding has been a disaster, and adding a bunch of overloads for rcstring is just compounding the problem. Algorithms should properly support arbitrary ranges of characters, and then rcstring can be passed to them by calling one of the functions on it to get a range of code units, code points, or graphemes to get an actual range - either that, or rcstring should default to being a range of char. going halfway and making it work with some functions via overloads really doesn't make sense. Now, if we're talking about functions that really operate on strings and not ranges of characters (and thus do stuff like append), then that becomes a different question, but we've mostly been trying to move away from functions like that in Phobos.On Tuesday, July 17, 2018 15:21:30 Seb via Digitalmars-d wrote:Well, there are few cases where the range type doesn't matter and one can simply compare bytes, e.g. equal (e.g. "ä" == "ä" <=> [195, 164] == [195, 164]) commonPrefix find ...[...]If it's not a range by default, why would you expect _anything_ which operates on ranges to work with rcstring directly? IMHO, if it's not a range, then range-based functions shouldn't work with it, and I don't see how they even _can_ work with it unless you assume code units, or code points, or graphemes as the default. If it's designed to not be a range, then it should be up to the programmer to call the appropriate function on it to get the appropriate range type for a particular use case, in which case, you really shouldn't need to add much of any overloads for it. - Jonathan M DavisOf course this assumes that there's no normalization necessary, but the current auto-decoding assumes this too.You can still normalize with auto-decoding (the code units - and thus code points - are in a specific order even when encoded, and that order can be normalized), and really, anyone who wants fully correct string comparisons needs to be normalizing their strings. With that in mind, rcstring probably should support normalization of its internal representation. - Jonathan M Davis
Jul 17 2018
On Tuesday, 17 July 2018 at 18:09:13 UTC, Jonathan M Davis wrote:On Tuesday, July 17, 2018 17:28:19 Seb via Digitalmars-d wrote:Well, the problem of it being a range of char is that this might lead to very confusing behavior, e.g. "ä".rcstring.split.join("|") == �|� So we probably shouldn't go this route either. The idea of adding overloads was to introduce a bit of user-convenience, s.t. they don't have to say readText("foo".rcstring.by!char) all the time.On Tuesday, 17 July 2018 at 16:58:37 UTC, Jonathan M Davis wrote:That effectively means treating rcstring as a range of char by default rather than not treating it as a range by default. And if we then do that only with functions that overload on rcstring rather than making rcstring actually a range of char, then why aren't we just treating it as a range of char in general? IMHO, the fact that so many alogorithms currently special-case on arrays of characters is one reason that auto-decoding has been a disaster, and adding a bunch of overloads for rcstring is just compounding the problem. Algorithms should properly support arbitrary ranges of characters, and then rcstring can be passed to them by calling one of the functions on it to get a range of code units, code points, or graphemes to get an actual range - either that, or rcstring should default to being a range of char. going halfway and making it work with some functions via overloads really doesn't make sense.[...]Well, there are few cases where the range type doesn't matter and one can simply compare bytes, e.g. equal (e.g. "ä" == "ä" <=> [195, 164] == [195, 164]) commonPrefix find ...You can still normalize with auto-decoding (the code units - and thus code points - are in a specific order even when encoded, and that order can be normalized), and really, anyone who wants fully correct string comparisons needs to be normalizing their strings. With that in mind, rcstring probably should support normalization of its internal representation.It currently doesn't support this out of the box, but it's a very valid point and I added it to the list.
Jul 18 2018
On Wednesday, July 18, 2018 12:15:52 Seb via Digitalmars-d wrote:Well, the problem of it being a range of char is that this might lead to very confusing behavior, e.g. "ä".rcstring.split.join("|") == �|� So we probably shouldn't go this route either.I don't know. I'm fine with it not being a range and leaving it up to the programmer, but part of the point here is that the programmer needs to understand Unicode well enough to be able to do the right thing in cases like this or they're screwed anyway. And if strings (of any variety) operate as ranges of code units by default, the fact that there's a problem when someone screws it up is going to be a lot more obvious. Forcing people to call a function like by!char or by!dchar still requires that they deal with all of this. It just makes it explicit. And that's not necessarily a bad idea, but if someone is going to be confused by something like split splitting in the middle of code points, they're going to be in trouble with the bu function anyway.The idea of adding overloads was to introduce a bit of user-convenience, s.t. they don't have to say readText("foo".rcstring.by!char) all the time.The wouldn't be doing anything that verbose anyway. In that case, you'd just pass the string literal. At most, they'd be doing something like readText(str.by!char); And of course, readText is definitely _not_ nogc. But regardless, these are functions that are designed to be generic and take ranges of characters rather than strings, and adding overloads for specific types just because we don't want to call the function to get a range over them seems like it's going in totally the wrong direction. It means adding a lot of overloads, and we already have quite a mess thanks to all of the special-casing that we have to avoid auto-decoding without getting into adding yet another set of overloads for rcstring. We've put in the effort to genericize these functions and make many of these functions work with ranges of characters rather than strings, and I really don't think that we should start adding overloads for specific string types just because we don't want to treat them as ranges directly. I'd honestly rather see an rcstring type that was just treated as a range of char than see us adding overloads for rcstring. That's what arrays of char should have been treated as in the first place, and we already have to do stuff like byCodeUnit for strings anyway, so having to do by!char or by!dchar really doesn't seem like a big deal to me - especially if the alternative is adding a bunch of overloads. - Jonathan M Davis
Jul 18 2018
On 7/17/18 12:58 PM, Jonathan M Davis wrote:If it's not a range by default, why would you expect_anything_ which operates on ranges to work with rcstring directly?Many functions do not care about the range aspect, but do care about the string aspect. Consider e.g. chdir.
Jul 17 2018
On Tuesday, July 17, 2018 22:45:33 Andrei Alexandrescu via Digitalmars-d wrote:On 7/17/18 12:58 PM, Jonathan M Davis wrote:It doesn't care about strings either. It operates on a range of characters. If a function is just taking a value as input and isn't storing it or mutating its elements, then a range of characters works perfectly fine and is more flexible than any particular type - and IMHO shouldn't then be having overloads for particular ranges of characters or string types if we can avoid it. If we're talking about a functions that's really operating on a string as a string and doing things like appending as opposed to doing range-based operations, then maybe overloading for other string types makes sense rather than requiring an array of characters. But if it's just taking a string and reading it? That has no need to operate on strings specifically and should be operating on a range of characters - something that we've been moving towards with Phobos. As such, I don't think that it generally makes sense for functions in Phobos to be explicitly accepting rcstring unless it's actually a range. If it's not actually a range, then such functions should already work with it by calling the appropriate function to get a range over it without needing to special-case anything. - Jonathan M DavisIf it's not a range by default, why would you expect_anything_ which operates on ranges to work with rcstring directly?Many functions do not care about the range aspect, but do care about the string aspect. Consider e.g. chdir.
Jul 17 2018
On Tuesday, 17 July 2018 at 16:58:37 UTC, Jonathan M Davis wrote:If it's not a range by default, why would you expect _anything_ which operates on ranges to work with rcstring directly? IMHO, if it's not a range, then range-based functions shouldn't work with it, and I don't see how they even _can_ work with it unless you assume code units, or code points, or graphemes as the default. If it's designed to not be a range, then it should be up to the programmer to call the appropriate function on it to get the appropriate range type for a particular use case, in which case, you really shouldn't need to add much of any overloads for it. - Jonathan M DavisThis makes sense for me too.
Jul 18 2018
On 2018-07-17 17:21, Seb wrote:- _no_ range by default (it needs an explicit `.by!{d,w,}char`) (as in no auto-decoding by default) What do you think about this approach? Do you have a better idea?I vote for .by!char to be the default. -- /Jacob Carlborg
Jul 17 2018
On 18/07/2018 5:41 AM, Jacob Carlborg wrote:On 2018-07-17 17:21, Seb wrote:I'm thinking .as!T So we can cover, ubyte/char/wchar/dchar, string/wstring/dstring all in one. I think whatever we expose as the default for string/wstring/dstring however should be settable. e.g. ``` struct RCString(DefaultStringType=string) { alias .as!DefaultStringType this; } ``` Which is a perfect example of what my named parameter DIP is for ;)- _no_ range by default (it needs an explicit `.by!{d,w,}char`) (as in no auto-decoding by default) What do you think about this approach? Do you have a better idea?I vote for .by!char to be the default.
Jul 17 2018
On Tuesday, 17 July 2018 at 17:41:05 UTC, Jacob Carlborg wrote:On 2018-07-17 17:21, Seb wrote:The problem here is this would also lead to very confusing behavior for newcomers, e.g. ``` "ä".split.join("|") == �|� ```- _no_ range by default (it needs an explicit `.by!{d,w,}char`) (as in no auto-decoding by default) What do you think about this approach? Do you have a better idea?I vote for .by!char to be the default.
Jul 18 2018
On Wednesday, 18 July 2018 at 11:37:33 UTC, Seb wrote:On Tuesday, 17 July 2018 at 17:41:05 UTC, Jacob Carlborg wrote:Therefore it shouldn't compile at all, but rcstring("ä")[].split("|") or rcstring("ä").byCodePoint.split("|")On 2018-07-17 17:21, Seb wrote:The problem here is this would also lead to very confusing behavior for newcomers, e.g. ``` "ä".split.join("|") == �|� ```- _no_ range by default (it needs an explicit `.by!{d,w,}char`) (as in no auto-decoding by default) What do you think about this approach? Do you have a better idea?I vote for .by!char to be the default.
Jul 18 2018
On Wednesday, 18 July 2018 at 12:03:02 UTC, Eugene Wissner wrote:Therefore it shouldn't compile at all, but rcstring("ä")[].split("|") or rcstring("ä").byCodePoint.split("|")+1 to requiring an explicit byCodeUnit or whatever. For every "obvious" way to interpret a string as a range, you can find an application where the obvious code is surprisingly buggy. (BTW, rcstring("ä").byCodePoint.split("|") is buggy for characters made of multiple codepoints. Canonicalisation doesn't fix it because many characters just don't have a single-codepoint form.)
Jul 18 2018
On 2018-07-18 13:37, Seb wrote:The problem here is this would also lead to very confusing behavior for newcomers, e.g. ``` "ä".split.join("|") == �|� ```How about not giving access to operate on individual characters. If they need to do that they should operate on an array of bytes. Too controversial? -- /Jacob Carlborg
Jul 18 2018
On Tuesday, 17 July 2018 at 15:21:30 UTC, Seb wrote:So we managed to revive the rcstring project and it's already a PR for Phobos: [snip]I'm glad this is getting worked on. It feels like something that D has been working towards for a while. Unfortunately, I haven't (yet) watched the collections video at Dconf and don't see a presentation on the website. Because of that, I don't really understand some of the design decisions. For instance, I also don't really understand how RCIAllocator is different from the old IAllocator (the documentation could use some work, IMO). It looks like RCIAllocator is part of what drives the reference counting semantics, but it also looks like Array has some support for reference counting, like addRef, that invoke RCIAllocator somehow. But Array also has some support for gc_allocator as the default, so my cursory examination suggests that Array is not really intended to be an RCArray... So at that point I started wondering why not just have String as an alias of Array, akin to how D does it for dynamic arrays to strings currently. If there is stuff in rcstring now that isn't in Array, then that could be included in Array as a compile-time specialization for the relevant types (at the cost of bloating Array). And then leave it up to the user how to allocate. I think part of the above design decision connects in with why rcstring stores the data as ubytes, even for wchar and dchar. Recent comments suggest that it is related to auto-decoding. My sense is that an rcstring that does not have auto-decoding, even if it requires more work to get working with phobos is a better solution over the long-run.
Jul 17 2018
On Tuesday, 17 July 2018 at 18:43:47 UTC, jmh530 wrote:On Tuesday, 17 July 2018 at 15:21:30 UTC, Seb wrote:Well AFAICT the idea is that with RCIAllocator (or its convenience function allocatorObject) is that you can convert any allocator to a reference-counted one, e.g. RCIAllocator a = allocatorObject(Mallocator.instance);So we managed to revive the rcstring project and it's already a PR for Phobos: [snip]I'm glad this is getting worked on. It feels like something that D has been working towards for a while. Unfortunately, I haven't (yet) watched the collections video at Dconf and don't see a presentation on the website. Because of that, I don't really understand some of the design decisions. For instance, I also don't really understand how RCIAllocator is different from the old IAllocator (the documentation could use some work, IMO). It looks like RCIAllocator is part of what drives the reference counting semantics,but it also looks like Array has some support for reference counting, like addRef, that invoke RCIAllocator somehow. But Array also has some support for gc_allocator as the default, so my cursory examination suggests that Array is not really intended to be an RCArray...Yes, Array is a reference-counted Array, but it also has a reference-counted allocator.So at that point I started wondering why not just have String as an alias of Array, akin to how D does it for dynamic arrays to strings currently. If there is stuff in rcstring now that isn't in Array, then that could be included in Array as a compile-time specialization for the relevant types (at the cost of bloating Array). And then leave it up to the user how to allocate.There's lots of stuff in rcstring related to better interoperability with existing strings. e.g. you just want `"foo".rcstring == "foo"` to work.I think part of the above design decision connects in with why rcstring stores the data as ubytes, even for wchar and dchar. Recent comments suggest that it is related to auto-decoding.Yes rcstring doesn't do any auto-decoding and hence stores its data as an ubyte array.My sense is that an rcstring that does not have auto-decoding, even if it requires more work to get working with phobos is a better solution over the long-run.What do you mean by this?
Jul 18 2018
On Wednesday, 18 July 2018 at 11:56:39 UTC, Seb wrote:[snip]Just that there are a lot of complaints about D's auto-decoding of strings. Not doing any auto-decoding seems like a good long-run design decision, even if it makes some things more difficult.I think part of the above design decision connects in with why rcstring stores the data as ubytes, even for wchar and dchar. Recent comments suggest that it is related to auto-decoding.Yes rcstring doesn't do any auto-decoding and hence stores its data as an ubyte array.My sense is that an rcstring that does not have auto-decoding, even if it requires more work to get working with phobos is a better solution over the long-run.What do you mean by this?
Jul 18 2018
On Wednesday, 18 July 2018 at 11:56:39 UTC, Seb wrote:[snip] Yes, Array is a reference-counted Array, but it also has a reference-counted allocator.I see. Is it really a good idea to make the ownership/lifetime strategy part of the container? What happens when you want to make nogc collections for lists, trees, etc? You have to make multiple versions for unique/ref counted/some new strategy? I would think it is more generic to have it a separate wrapper that handles the ownership/lifetime strategy, like what exists in automem and C++'s smart pointers...though automem looks like it has a separate type for Unique_Array rather than including it in Unique...so I suppose that potentially has the same issue...
Jul 18 2018
On Tuesday, 17 July 2018 at 15:21:30 UTC, Seb wrote:So we managed to revive the rcstring project and it's already a PR for Phobos: https://github.com/dlang/phobos/pull/6631 (still WIP though) The current approach in short: - uses the new nogc, safe and nothrow Array from the collections library (check Eduardo's DConf18 talk) - uses reference counting - _no_ range by default (it needs an explicit `.by!{d,w,}char`) (as in no auto-decoding by default) [snip] What do you think about this approach? Do you have a better idea?I don't know the goals/role rcstring is expected to play, especially wrt existing string/character facilities. Perhaps you could describe more? Strings are central to many applications, so I'm wondering about things like whether rcstring is intended as a replacement for string that would be used by most new programs, and whether applications would use arrays and ranges of char together with rcstring, or rcstring would be used for everything. Perhaps its too early for these questions, and the current goal is simpler. For example, adding a meaningful collection class that is nogc, safe and ref-counted that be used as a proving ground for the newer memory management facilities being developed. Such simpler goals would be quite reasonable. What's got me wondering about the larger questions are the comments about ranges and autodecoding. If rcstring is intended as a vehicle for general nogc handling of character data and/or for reducing the impact of autodecoding, then it makes sense to consider from those perspectives. --Jon
Jul 17 2018
On Wednesday, 18 July 2018 at 03:40:08 UTC, Jon Degenhardt wrote:On Tuesday, 17 July 2018 at 15:21:30 UTC, Seb wrote:Sorry for the brevity yesterday. One of long-term pain point of D is that usage of string can't be nogc. rcstring is intended to be a drop-in nogc replacement for everywhere where string is currently used (that's the idea, at least).So we managed to revive the rcstring project and it's already a PR for Phobos: https://github.com/dlang/phobos/pull/6631 (still WIP though) The current approach in short: - uses the new nogc, safe and nothrow Array from the collections library (check Eduardo's DConf18 talk) - uses reference counting - _no_ range by default (it needs an explicit `.by!{d,w,}char`) (as in no auto-decoding by default) [snip] What do you think about this approach? Do you have a better idea?I don't know the goals/role rcstring is expected to play, especially wrt existing string/character facilities. Perhaps you could describe more?Strings are central to many applications, so I'm wondering about things like whether rcstring is intended as a replacement for string that would be used by most new programs,Yes, that's the long-term goal. An opt-in nogc string class. There's no plan to do sth. disruptive like replacing the `alias string = immutable(char)[];` declaration in druntime. However, we might move rcstring to druntime at some point, s.t. e.g. Exceptions or asserts can use nogc strings.and whether applications would use arrays and ranges of char together with rcstring, or rcstring would be used for everything.That point is still open for discussion, but at the moment rcstring isn't a range and the user has to declare what kind of range he/she wants with e.g. `.by!char` However, one current idea is that for some use cases (e.g. comparison) it might not matter and an application could add overloads for rcstrings. The current idea is to do the same this for Phobos - though I have to say that I'm not really looking forward to adding 200 overloads to Phobos :/Perhaps its too early for these questions, and the current goal is simpler. For example, adding a meaningful collection class that is nogc, safe and ref-counted that be used as a proving ground for the newer memory management facilities being developed.That's the long-term goal of the collections project. However, with rcstring being the first big use case for it, the idea was to push rcstring forward and by that discover all remaining issues with the Array class. Also the interface of rcstring is rather contained (and doesn't expose the underlying storage to the user), which allows us to iterate over/improve upon the Array design.Such simpler goals would be quite reasonable. What's got me wondering about the larger questions are the comments about ranges and autodecoding. If rcstring is intended as a vehicle for general nogc handling of character data and/or for reducing the impact of autodecoding, then it makes sense to consider from those perspectives.Hehe, it's intended to solve both problems (auto-decoding by default and nogc) at the same time. However, it looks like to me like there isn't a good solution to the auto-decoding problem that is convenient to use for the user and doesn't sacrifice on performance.
Jul 18 2018
On Wednesday, 18 July 2018 at 12:10:04 UTC, Seb wrote:On Wednesday, 18 July 2018 at 03:40:08 UTC, Jon Degenhardt [...]Maybe I misunderstood but you mean that for comparisons the encoding doesn't matter only right? But that does not preclude normalization, e.g. unicode defines U+00F1 as equal to the sequence U+006E U+0303 and that would work as long as they're normalized (from what I understand at least) and regardless of whether you compare char/wchar/dchars.and whether applications would use arrays and ranges of char together with rcstring, or rcstring would be used for everything.That point is still open for discussion, but at the moment rcstring isn't a range and the user has to declare what kind of range he/she wants with e.g. `.by!char` However, one current idea is that for some use cases (e.g. comparison) it might not matter and an application could add overloads for rcstrings.The current idea is to do the same this for Phobos - though I have to say that I'm not really looking forward to adding 200 overloads to Phobos :/How about a compile time flag that can make things more convenient: auto str1 = latin1("literal"); rcstring!Latin1 latin1string(string str) { return rcstring!Latin1(str); } auto str2 = utf8("åsm"); // ... struct rcstring(Encoding = Unknown) { ubyte[] data; bool normalized = false; static if (is(Encoding == Latin1)) { // by char range interface implementation } else if (is(Encoding == Utf8)) { // byGrapheme range interface implementation? } else { // no range interface implementation } bool opEquals()(auto ref const S lhs) const { static if (!is(Encoding == Latin1)) { return data == lhs.data; } else { return normalized() == lhs.normalized() } } } And now most ranges will work correctly. And then some of the algorithms that don't need to use byGrapheme but just need normalized code points to work correctly can do that and that seems like all the special handling you'll need inside range algorithms? Then: readText("foo".latin1); "ä".utf8.split.join("|"); ?? Cheers, - AliPerhaps its too early for these questions, and the current goal is simpler. For example, adding a meaningful collection class that is nogc, safe and ref-counted that be used as a proving ground for the newer memory management facilities being developed.That's the long-term goal of the collections project. However, with rcstring being the first big use case for it, the idea was to push rcstring forward and by that discover all remaining issues with the Array class. Also the interface of rcstring is rather contained (and doesn't expose the underlying storage to the user), which allows us to iterate over/improve upon the Array design.Such simpler goals would be quite reasonable. What's got me wondering about the larger questions are the comments about ranges and autodecoding. If rcstring is intended as a vehicle for general nogc handling of character data and/or for reducing the impact of autodecoding, then it makes sense to consider from those perspectives.Hehe, it's intended to solve both problems (auto-decoding by default and nogc) at the same time. However, it looks like to me like there isn't a good solution to the auto-decoding problem that is convenient to use for the user and doesn't sacrifice on performance.
Jul 18 2018
On Wednesday, 18 July 2018 at 22:44:33 UTC, aliak wrote:On Wednesday, 18 July 2018 at 12:10:04 UTC, Seb wrote:I like this approach, `rcstring.by!` is to verbose for my taste and quite annoying for day to day usage. I think rcstring should be aliased by concrete implementation like ansi, uft8, utf16, utf32. Those aliases should be ranges and maybe subtype their respective string, wstring, dstring so they can be transparently used for non-range based APIs (this required dip1000 for safe). The take away is that rcstring by itself does not satisfy the usability criteria, and probably should focus on performance and flexibility to be used as a building block for higher level constructs that are easier to use and safer in regards to how they work with the string type they hold.On Wednesday, 18 July 2018 at 03:40:08 UTC, Jon Degenhardt [...]Maybe I misunderstood but you mean that for comparisons the encoding doesn't matter only right? But that does not preclude normalization, e.g. unicode defines U+00F1 as equal to the sequence U+006E U+0303 and that would work as long as they're normalized (from what I understand at least) and regardless of whether you compare char/wchar/dchars.[...]That point is still open for discussion, but at the moment rcstring isn't a range and the user has to declare what kind of range he/she wants with e.g. `.by!char` However, one current idea is that for some use cases (e.g. comparison) it might not matter and an application could add overloads for rcstrings.The current idea is to do the same this for Phobos - though I have to say that I'm not really looking forward to adding 200 overloads to Phobos :/How about a compile time flag that can make things more convenient: auto str1 = latin1("literal"); rcstring!Latin1 latin1string(string str) { return rcstring!Latin1(str); } auto str2 = utf8("åsm"); // ... struct rcstring(Encoding = Unknown) { ubyte[] data; bool normalized = false; static if (is(Encoding == Latin1)) { // by char range interface implementation } else if (is(Encoding == Utf8)) { // byGrapheme range interface implementation? } else { // no range interface implementation } bool opEquals()(auto ref const S lhs) const { static if (!is(Encoding == Latin1)) { return data == lhs.data; } else { return normalized() == lhs.normalized() } } } And now most ranges will work correctly. And then some of the algorithms that don't need to use byGrapheme but just need normalized code points to work correctly can do that and that seems like all the special handling you'll need inside range algorithms? Then: readText("foo".latin1); "ä".utf8.split.join("|"); ?? Cheers, - Ali[...]That's the long-term goal of the collections project. However, with rcstring being the first big use case for it, the idea was to push rcstring forward and by that discover all remaining issues with the Array class. Also the interface of rcstring is rather contained (and doesn't expose the underlying storage to the user), which allows us to iterate over/improve upon the Array design.[...]Hehe, it's intended to solve both problems (auto-decoding by default and nogc) at the same time. However, it looks like to me like there isn't a good solution to the auto-decoding problem that is convenient to use for the user and doesn't sacrifice on performance.
Jul 19 2018