www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - std.experimental.collections.rcstring and its integration in Phobos

reply Seb <seb wilzba.ch> writes:
So we managed to revive the rcstring project and it's already a 
PR for Phobos:

https://github.com/dlang/phobos/pull/6631 (still WIP though)

The current approach in short:

- uses the new  nogc,  safe and nothrow Array from the 
collections library (check Eduardo's DConf18 talk)
- uses reference counting
- _no_ range by default (it needs an explicit `.by!{d,w,}char`) 
(as in no auto-decoding by default)

Still to be done:

- integration in Phobos (the current idea is to generate 
additional overloads for rcstring)
- performance
- use of static immutable rcstring in fully  nogc
- extensive testing

Especially the "seamless" integration in Phobos will be 
challenging.
I made a rough listing of all symbols that one would expect to be 
usable with an rcstring type 
(https://gist.github.com/wilzbach/d74712269f889827cff6b2c7a08d07f8). It's more
than 200.
As rcstring isn't a range by default, but one excepts 
`"foo".rcstring.equal("foo")` to work, overloads for all these 
symbols would need to be added.

What do you think about this approach? Do you have a better idea?
Jul 17 2018
next sibling parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Tuesday, July 17, 2018 15:21:30 Seb via Digitalmars-d wrote:
 So we managed to revive the rcstring project and it's already a
 PR for Phobos:

 https://github.com/dlang/phobos/pull/6631 (still WIP though)

 The current approach in short:

 - uses the new  nogc,  safe and nothrow Array from the
 collections library (check Eduardo's DConf18 talk)
 - uses reference counting
 - _no_ range by default (it needs an explicit `.by!{d,w,}char`)
 (as in no auto-decoding by default)

 Still to be done:

 - integration in Phobos (the current idea is to generate
 additional overloads for rcstring)
 - performance
 - use of static immutable rcstring in fully  nogc
 - extensive testing

 Especially the "seamless" integration in Phobos will be
 challenging.
 I made a rough listing of all symbols that one would expect to be
 usable with an rcstring type
 (https://gist.github.com/wilzbach/d74712269f889827cff6b2c7a08d07f8). It's
 more than 200. As rcstring isn't a range by default, but one excepts
 `"foo".rcstring.equal("foo")` to work, overloads for all these
 symbols would need to be added.

 What do you think about this approach? Do you have a better idea?
If it's not a range by default, why would you expect _anything_ which operates on ranges to work with rcstring directly? IMHO, if it's not a range, then range-based functions shouldn't work with it, and I don't see how they even _can_ work with it unless you assume code units, or code points, or graphemes as the default. If it's designed to not be a range, then it should be up to the programmer to call the appropriate function on it to get the appropriate range type for a particular use case, in which case, you really shouldn't need to add much of any overloads for it. - Jonathan M Davis
Jul 17 2018
next sibling parent reply Seb <seb wilzba.ch> writes:
On Tuesday, 17 July 2018 at 16:58:37 UTC, Jonathan M Davis wrote:
 On Tuesday, July 17, 2018 15:21:30 Seb via Digitalmars-d wrote:
 [...]
If it's not a range by default, why would you expect _anything_ which operates on ranges to work with rcstring directly? IMHO, if it's not a range, then range-based functions shouldn't work with it, and I don't see how they even _can_ work with it unless you assume code units, or code points, or graphemes as the default. If it's designed to not be a range, then it should be up to the programmer to call the appropriate function on it to get the appropriate range type for a particular use case, in which case, you really shouldn't need to add much of any overloads for it. - Jonathan M Davis
Well, there are few cases where the range type doesn't matter and one can simply compare bytes, e.g. equal (e.g. "ä" == "ä" <=> [195, 164] == [195, 164]) commonPrefix find ... Of course this assumes that there's no normalization necessary, but the current auto-decoding assumes this too.
Jul 17 2018
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Tuesday, July 17, 2018 17:28:19 Seb via Digitalmars-d wrote:
 On Tuesday, 17 July 2018 at 16:58:37 UTC, Jonathan M Davis wrote:
 On Tuesday, July 17, 2018 15:21:30 Seb via Digitalmars-d wrote:
 [...]
If it's not a range by default, why would you expect _anything_ which operates on ranges to work with rcstring directly? IMHO, if it's not a range, then range-based functions shouldn't work with it, and I don't see how they even _can_ work with it unless you assume code units, or code points, or graphemes as the default. If it's designed to not be a range, then it should be up to the programmer to call the appropriate function on it to get the appropriate range type for a particular use case, in which case, you really shouldn't need to add much of any overloads for it. - Jonathan M Davis
Well, there are few cases where the range type doesn't matter and one can simply compare bytes, e.g. equal (e.g. "ä" == "ä" <=> [195, 164] == [195, 164]) commonPrefix find ...
That effectively means treating rcstring as a range of char by default rather than not treating it as a range by default. And if we then do that only with functions that overload on rcstring rather than making rcstring actually a range of char, then why aren't we just treating it as a range of char in general? IMHO, the fact that so many alogorithms currently special-case on arrays of characters is one reason that auto-decoding has been a disaster, and adding a bunch of overloads for rcstring is just compounding the problem. Algorithms should properly support arbitrary ranges of characters, and then rcstring can be passed to them by calling one of the functions on it to get a range of code units, code points, or graphemes to get an actual range - either that, or rcstring should default to being a range of char. going halfway and making it work with some functions via overloads really doesn't make sense. Now, if we're talking about functions that really operate on strings and not ranges of characters (and thus do stuff like append), then that becomes a different question, but we've mostly been trying to move away from functions like that in Phobos.
 Of course this assumes that there's no normalization necessary,
 but the current auto-decoding assumes this too.
You can still normalize with auto-decoding (the code units - and thus code points - are in a specific order even when encoded, and that order can be normalized), and really, anyone who wants fully correct string comparisons needs to be normalizing their strings. With that in mind, rcstring probably should support normalization of its internal representation. - Jonathan M Davis
Jul 17 2018
parent reply Seb <seb wilzba.ch> writes:
On Tuesday, 17 July 2018 at 18:09:13 UTC, Jonathan M Davis wrote:
 On Tuesday, July 17, 2018 17:28:19 Seb via Digitalmars-d wrote:
 On Tuesday, 17 July 2018 at 16:58:37 UTC, Jonathan M Davis 
 wrote:
 [...]
Well, there are few cases where the range type doesn't matter and one can simply compare bytes, e.g. equal (e.g. "ä" == "ä" <=> [195, 164] == [195, 164]) commonPrefix find ...
That effectively means treating rcstring as a range of char by default rather than not treating it as a range by default. And if we then do that only with functions that overload on rcstring rather than making rcstring actually a range of char, then why aren't we just treating it as a range of char in general? IMHO, the fact that so many alogorithms currently special-case on arrays of characters is one reason that auto-decoding has been a disaster, and adding a bunch of overloads for rcstring is just compounding the problem. Algorithms should properly support arbitrary ranges of characters, and then rcstring can be passed to them by calling one of the functions on it to get a range of code units, code points, or graphemes to get an actual range - either that, or rcstring should default to being a range of char. going halfway and making it work with some functions via overloads really doesn't make sense.
Well, the problem of it being a range of char is that this might lead to very confusing behavior, e.g. "ä".rcstring.split.join("|") == �|� So we probably shouldn't go this route either. The idea of adding overloads was to introduce a bit of user-convenience, s.t. they don't have to say readText("foo".rcstring.by!char) all the time.
 You can still normalize with auto-decoding (the code units - 
 and thus code points - are in a specific order even when 
 encoded, and that order can be normalized), and really, anyone 
 who wants fully correct string comparisons needs to be 
 normalizing their strings. With that in mind, rcstring probably 
 should support normalization of its internal representation.
It currently doesn't support this out of the box, but it's a very valid point and I added it to the list.
Jul 18 2018
parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Wednesday, July 18, 2018 12:15:52 Seb via Digitalmars-d wrote:
 Well, the problem of it being a range of char is that this might
 lead to very confusing behavior, e.g.

 "ä".rcstring.split.join("|") == �|�

 So we probably shouldn't go this route either.
I don't know. I'm fine with it not being a range and leaving it up to the programmer, but part of the point here is that the programmer needs to understand Unicode well enough to be able to do the right thing in cases like this or they're screwed anyway. And if strings (of any variety) operate as ranges of code units by default, the fact that there's a problem when someone screws it up is going to be a lot more obvious. Forcing people to call a function like by!char or by!dchar still requires that they deal with all of this. It just makes it explicit. And that's not necessarily a bad idea, but if someone is going to be confused by something like split splitting in the middle of code points, they're going to be in trouble with the bu function anyway.
 The idea of adding overloads was to introduce a bit of
 user-convenience, s.t. they don't have to say

 readText("foo".rcstring.by!char)

 all the time.
The wouldn't be doing anything that verbose anyway. In that case, you'd just pass the string literal. At most, they'd be doing something like readText(str.by!char); And of course, readText is definitely _not_ nogc. But regardless, these are functions that are designed to be generic and take ranges of characters rather than strings, and adding overloads for specific types just because we don't want to call the function to get a range over them seems like it's going in totally the wrong direction. It means adding a lot of overloads, and we already have quite a mess thanks to all of the special-casing that we have to avoid auto-decoding without getting into adding yet another set of overloads for rcstring. We've put in the effort to genericize these functions and make many of these functions work with ranges of characters rather than strings, and I really don't think that we should start adding overloads for specific string types just because we don't want to treat them as ranges directly. I'd honestly rather see an rcstring type that was just treated as a range of char than see us adding overloads for rcstring. That's what arrays of char should have been treated as in the first place, and we already have to do stuff like byCodeUnit for strings anyway, so having to do by!char or by!dchar really doesn't seem like a big deal to me - especially if the alternative is adding a bunch of overloads. - Jonathan M Davis
Jul 18 2018
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/17/18 12:58 PM, Jonathan M Davis wrote:
 If it's not a range by default, why would you expect_anything_  which
 operates on ranges to work with rcstring directly?
Many functions do not care about the range aspect, but do care about the string aspect. Consider e.g. chdir.
Jul 17 2018
parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Tuesday, July 17, 2018 22:45:33 Andrei Alexandrescu via Digitalmars-d 
wrote:
 On 7/17/18 12:58 PM, Jonathan M Davis wrote:
 If it's not a range by default, why would you expect_anything_  which
 operates on ranges to work with rcstring directly?
Many functions do not care about the range aspect, but do care about the string aspect. Consider e.g. chdir.
It doesn't care about strings either. It operates on a range of characters. If a function is just taking a value as input and isn't storing it or mutating its elements, then a range of characters works perfectly fine and is more flexible than any particular type - and IMHO shouldn't then be having overloads for particular ranges of characters or string types if we can avoid it. If we're talking about a functions that's really operating on a string as a string and doing things like appending as opposed to doing range-based operations, then maybe overloading for other string types makes sense rather than requiring an array of characters. But if it's just taking a string and reading it? That has no need to operate on strings specifically and should be operating on a range of characters - something that we've been moving towards with Phobos. As such, I don't think that it generally makes sense for functions in Phobos to be explicitly accepting rcstring unless it's actually a range. If it's not actually a range, then such functions should already work with it by calling the appropriate function to get a range over it without needing to special-case anything. - Jonathan M Davis
Jul 17 2018
prev sibling parent Andrea Fontana <nospam example.com> writes:
On Tuesday, 17 July 2018 at 16:58:37 UTC, Jonathan M Davis wrote:
 If it's not a range by default, why would you expect _anything_ 
 which operates on ranges to work with rcstring directly? IMHO, 
 if it's not a range, then range-based functions shouldn't work 
 with it, and I don't see how they even _can_ work with it 
 unless you assume code units, or code points, or graphemes as 
 the default. If it's designed to not be a range, then it should 
 be up to the programmer to call the appropriate function on it 
 to get the appropriate range type for a particular use case, in 
 which case, you really shouldn't need to add much of any 
 overloads for it.

 - Jonathan M Davis
This makes sense for me too.
Jul 18 2018
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2018-07-17 17:21, Seb wrote:

 - _no_ range by default (it needs an explicit `.by!{d,w,}char`) (as in 
 no auto-decoding by default)
 
 What do you think about this approach? Do you have a better idea?
I vote for .by!char to be the default. -- /Jacob Carlborg
Jul 17 2018
next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
On 18/07/2018 5:41 AM, Jacob Carlborg wrote:
 On 2018-07-17 17:21, Seb wrote:
 
 - _no_ range by default (it needs an explicit `.by!{d,w,}char`) (as in 
 no auto-decoding by default)

 What do you think about this approach? Do you have a better idea?
I vote for .by!char to be the default.
I'm thinking .as!T So we can cover, ubyte/char/wchar/dchar, string/wstring/dstring all in one. I think whatever we expose as the default for string/wstring/dstring however should be settable. e.g. ``` struct RCString(DefaultStringType=string) { alias .as!DefaultStringType this; } ``` Which is a perfect example of what my named parameter DIP is for ;)
Jul 17 2018
prev sibling parent reply Seb <seb wilzba.ch> writes:
On Tuesday, 17 July 2018 at 17:41:05 UTC, Jacob Carlborg wrote:
 On 2018-07-17 17:21, Seb wrote:

 - _no_ range by default (it needs an explicit 
 `.by!{d,w,}char`) (as in no auto-decoding by default)
 
 What do you think about this approach? Do you have a better 
 idea?
I vote for .by!char to be the default.
The problem here is this would also lead to very confusing behavior for newcomers, e.g. ``` "ä".split.join("|") == �|� ```
Jul 18 2018
next sibling parent reply Eugene Wissner <belka caraus.de> writes:
On Wednesday, 18 July 2018 at 11:37:33 UTC, Seb wrote:
 On Tuesday, 17 July 2018 at 17:41:05 UTC, Jacob Carlborg wrote:
 On 2018-07-17 17:21, Seb wrote:

 - _no_ range by default (it needs an explicit 
 `.by!{d,w,}char`) (as in no auto-decoding by default)
 
 What do you think about this approach? Do you have a better 
 idea?
I vote for .by!char to be the default.
The problem here is this would also lead to very confusing behavior for newcomers, e.g. ``` "ä".split.join("|") == �|� ```
Therefore it shouldn't compile at all, but rcstring("ä")[].split("|") or rcstring("ä").byCodePoint.split("|")
Jul 18 2018
parent sarn <sarn theartofmachinery.com> writes:
On Wednesday, 18 July 2018 at 12:03:02 UTC, Eugene Wissner wrote:
 Therefore it shouldn't compile at all, but

 rcstring("ä")[].split("|")

 or

 rcstring("ä").byCodePoint.split("|")
+1 to requiring an explicit byCodeUnit or whatever. For every "obvious" way to interpret a string as a range, you can find an application where the obvious code is surprisingly buggy. (BTW, rcstring("ä").byCodePoint.split("|") is buggy for characters made of multiple codepoints. Canonicalisation doesn't fix it because many characters just don't have a single-codepoint form.)
Jul 18 2018
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2018-07-18 13:37, Seb wrote:

 The problem here is this would also lead to very confusing behavior for 
 newcomers, e.g.
 
 ```
 "ä".split.join("|") == �|�
 ```
How about not giving access to operate on individual characters. If they need to do that they should operate on an array of bytes. Too controversial? -- /Jacob Carlborg
Jul 18 2018
prev sibling next sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Tuesday, 17 July 2018 at 15:21:30 UTC, Seb wrote:
 So we managed to revive the rcstring project and it's already a 
 PR for Phobos:

 [snip]
I'm glad this is getting worked on. It feels like something that D has been working towards for a while. Unfortunately, I haven't (yet) watched the collections video at Dconf and don't see a presentation on the website. Because of that, I don't really understand some of the design decisions. For instance, I also don't really understand how RCIAllocator is different from the old IAllocator (the documentation could use some work, IMO). It looks like RCIAllocator is part of what drives the reference counting semantics, but it also looks like Array has some support for reference counting, like addRef, that invoke RCIAllocator somehow. But Array also has some support for gc_allocator as the default, so my cursory examination suggests that Array is not really intended to be an RCArray... So at that point I started wondering why not just have String as an alias of Array, akin to how D does it for dynamic arrays to strings currently. If there is stuff in rcstring now that isn't in Array, then that could be included in Array as a compile-time specialization for the relevant types (at the cost of bloating Array). And then leave it up to the user how to allocate. I think part of the above design decision connects in with why rcstring stores the data as ubytes, even for wchar and dchar. Recent comments suggest that it is related to auto-decoding. My sense is that an rcstring that does not have auto-decoding, even if it requires more work to get working with phobos is a better solution over the long-run.
Jul 17 2018
parent reply Seb <seb wilzba.ch> writes:
On Tuesday, 17 July 2018 at 18:43:47 UTC, jmh530 wrote:
 On Tuesday, 17 July 2018 at 15:21:30 UTC, Seb wrote:
 So we managed to revive the rcstring project and it's already 
 a PR for Phobos:

 [snip]
I'm glad this is getting worked on. It feels like something that D has been working towards for a while. Unfortunately, I haven't (yet) watched the collections video at Dconf and don't see a presentation on the website. Because of that, I don't really understand some of the design decisions. For instance, I also don't really understand how RCIAllocator is different from the old IAllocator (the documentation could use some work, IMO). It looks like RCIAllocator is part of what drives the reference counting semantics,
Well AFAICT the idea is that with RCIAllocator (or its convenience function allocatorObject) is that you can convert any allocator to a reference-counted one, e.g. RCIAllocator a = allocatorObject(Mallocator.instance);
 but it also looks like Array has some support for reference 
 counting, like addRef, that invoke RCIAllocator somehow. But 
 Array also has some support for gc_allocator as the default, so 
 my cursory examination suggests that Array is not really 
 intended to be an RCArray...
Yes, Array is a reference-counted Array, but it also has a reference-counted allocator.
 So at that point I started wondering why not just have String 
 as an alias of Array, akin to how D does it for dynamic arrays 
 to strings currently. If there is stuff in rcstring now that 
 isn't in Array, then that could be included in Array as a 
 compile-time specialization for the relevant types (at the cost 
 of bloating Array). And then leave it up to the user how to 
 allocate.
There's lots of stuff in rcstring related to better interoperability with existing strings. e.g. you just want `"foo".rcstring == "foo"` to work.
 I think part of the above design decision connects in with why 
 rcstring stores the data as ubytes, even for wchar and dchar. 
 Recent comments suggest that it is related to auto-decoding.
Yes rcstring doesn't do any auto-decoding and hence stores its data as an ubyte array.
 My sense is that an rcstring that does not have auto-decoding, 
 even if it requires more work to get working with phobos is a 
 better solution over the long-run.
What do you mean by this?
Jul 18 2018
next sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Wednesday, 18 July 2018 at 11:56:39 UTC, Seb wrote:
 [snip]

 I think part of the above design decision connects in with why 
 rcstring stores the data as ubytes, even for wchar and dchar. 
 Recent comments suggest that it is related to auto-decoding.
Yes rcstring doesn't do any auto-decoding and hence stores its data as an ubyte array.
 My sense is that an rcstring that does not have auto-decoding, 
 even if it requires more work to get working with phobos is a 
 better solution over the long-run.
What do you mean by this?
Just that there are a lot of complaints about D's auto-decoding of strings. Not doing any auto-decoding seems like a good long-run design decision, even if it makes some things more difficult.
Jul 18 2018
prev sibling parent jmh530 <john.michael.hall gmail.com> writes:
On Wednesday, 18 July 2018 at 11:56:39 UTC, Seb wrote:
 [snip]

 Yes, Array is a reference-counted Array, but it also has a 
 reference-counted allocator.
I see. Is it really a good idea to make the ownership/lifetime strategy part of the container? What happens when you want to make nogc collections for lists, trees, etc? You have to make multiple versions for unique/ref counted/some new strategy? I would think it is more generic to have it a separate wrapper that handles the ownership/lifetime strategy, like what exists in automem and C++'s smart pointers...though automem looks like it has a separate type for Unique_Array rather than including it in Unique...so I suppose that potentially has the same issue...
Jul 18 2018
prev sibling parent reply Jon Degenhardt <noreply noreply.com> writes:
On Tuesday, 17 July 2018 at 15:21:30 UTC, Seb wrote:
 So we managed to revive the rcstring project and it's already a 
 PR for Phobos:

 https://github.com/dlang/phobos/pull/6631 (still WIP though)

 The current approach in short:

 - uses the new  nogc,  safe and nothrow Array from the 
 collections library (check Eduardo's DConf18 talk)
 - uses reference counting
 - _no_ range by default (it needs an explicit `.by!{d,w,}char`) 
 (as in no auto-decoding by default)

 [snip]

 What do you think about this approach? Do you have a better 
 idea?
I don't know the goals/role rcstring is expected to play, especially wrt existing string/character facilities. Perhaps you could describe more? Strings are central to many applications, so I'm wondering about things like whether rcstring is intended as a replacement for string that would be used by most new programs, and whether applications would use arrays and ranges of char together with rcstring, or rcstring would be used for everything. Perhaps its too early for these questions, and the current goal is simpler. For example, adding a meaningful collection class that is nogc, safe and ref-counted that be used as a proving ground for the newer memory management facilities being developed. Such simpler goals would be quite reasonable. What's got me wondering about the larger questions are the comments about ranges and autodecoding. If rcstring is intended as a vehicle for general nogc handling of character data and/or for reducing the impact of autodecoding, then it makes sense to consider from those perspectives. --Jon
Jul 17 2018
parent reply Seb <seb wilzba.ch> writes:
On Wednesday, 18 July 2018 at 03:40:08 UTC, Jon Degenhardt wrote:
 On Tuesday, 17 July 2018 at 15:21:30 UTC, Seb wrote:
 So we managed to revive the rcstring project and it's already 
 a PR for Phobos:

 https://github.com/dlang/phobos/pull/6631 (still WIP though)

 The current approach in short:

 - uses the new  nogc,  safe and nothrow Array from the 
 collections library (check Eduardo's DConf18 talk)
 - uses reference counting
 - _no_ range by default (it needs an explicit 
 `.by!{d,w,}char`) (as in no auto-decoding by default)

 [snip]

 What do you think about this approach? Do you have a better 
 idea?
I don't know the goals/role rcstring is expected to play, especially wrt existing string/character facilities. Perhaps you could describe more?
Sorry for the brevity yesterday. One of long-term pain point of D is that usage of string can't be nogc. rcstring is intended to be a drop-in nogc replacement for everywhere where string is currently used (that's the idea, at least).
 Strings are central to many applications, so I'm wondering 
 about things like whether rcstring is intended as a replacement 
 for string that would be used by most new programs,
Yes, that's the long-term goal. An opt-in nogc string class. There's no plan to do sth. disruptive like replacing the `alias string = immutable(char)[];` declaration in druntime. However, we might move rcstring to druntime at some point, s.t. e.g. Exceptions or asserts can use nogc strings.
 and whether applications would use arrays and ranges of char 
 together with rcstring, or rcstring would be used for 
 everything.
That point is still open for discussion, but at the moment rcstring isn't a range and the user has to declare what kind of range he/she wants with e.g. `.by!char` However, one current idea is that for some use cases (e.g. comparison) it might not matter and an application could add overloads for rcstrings. The current idea is to do the same this for Phobos - though I have to say that I'm not really looking forward to adding 200 overloads to Phobos :/
 Perhaps its too early for these questions, and the current goal 
 is simpler. For example, adding a meaningful collection class 
 that is  nogc,  safe and ref-counted that be used as a proving 
 ground for the newer memory management facilities being 
 developed.
That's the long-term goal of the collections project. However, with rcstring being the first big use case for it, the idea was to push rcstring forward and by that discover all remaining issues with the Array class. Also the interface of rcstring is rather contained (and doesn't expose the underlying storage to the user), which allows us to iterate over/improve upon the Array design.
 Such simpler goals would be quite reasonable. What's got me 
 wondering about the larger questions are the comments about 
 ranges and autodecoding. If rcstring is intended as a vehicle 
 for general  nogc handling of character data and/or for 
 reducing the impact of autodecoding, then it makes sense to 
 consider from those perspectives.
Hehe, it's intended to solve both problems (auto-decoding by default and nogc) at the same time. However, it looks like to me like there isn't a good solution to the auto-decoding problem that is convenient to use for the user and doesn't sacrifice on performance.
Jul 18 2018
parent reply aliak <something something.com> writes:
On Wednesday, 18 July 2018 at 12:10:04 UTC, Seb wrote:
 On Wednesday, 18 July 2018 at 03:40:08 UTC, Jon Degenhardt [...]
 and whether applications would use arrays and ranges of char 
 together with rcstring, or rcstring would be used for 
 everything.
That point is still open for discussion, but at the moment rcstring isn't a range and the user has to declare what kind of range he/she wants with e.g. `.by!char` However, one current idea is that for some use cases (e.g. comparison) it might not matter and an application could add overloads for rcstrings.
Maybe I misunderstood but you mean that for comparisons the encoding doesn't matter only right? But that does not preclude normalization, e.g. unicode defines U+00F1 as equal to the sequence U+006E U+0303 and that would work as long as they're normalized (from what I understand at least) and regardless of whether you compare char/wchar/dchars.
 The current idea is to do the same this for Phobos - though I 
 have to say that I'm not really looking forward to adding 200 
 overloads to Phobos :/

 Perhaps its too early for these questions, and the current 
 goal is simpler. For example, adding a meaningful collection 
 class that is  nogc,  safe and ref-counted that be used as a 
 proving ground for the newer memory management facilities 
 being developed.
That's the long-term goal of the collections project. However, with rcstring being the first big use case for it, the idea was to push rcstring forward and by that discover all remaining issues with the Array class. Also the interface of rcstring is rather contained (and doesn't expose the underlying storage to the user), which allows us to iterate over/improve upon the Array design.
 Such simpler goals would be quite reasonable. What's got me 
 wondering about the larger questions are the comments about 
 ranges and autodecoding. If rcstring is intended as a vehicle 
 for general  nogc handling of character data and/or for 
 reducing the impact of autodecoding, then it makes sense to 
 consider from those perspectives.
Hehe, it's intended to solve both problems (auto-decoding by default and nogc) at the same time. However, it looks like to me like there isn't a good solution to the auto-decoding problem that is convenient to use for the user and doesn't sacrifice on performance.
How about a compile time flag that can make things more convenient: auto str1 = latin1("literal"); rcstring!Latin1 latin1string(string str) { return rcstring!Latin1(str); } auto str2 = utf8("åsm"); // ... struct rcstring(Encoding = Unknown) { ubyte[] data; bool normalized = false; static if (is(Encoding == Latin1)) { // by char range interface implementation } else if (is(Encoding == Utf8)) { // byGrapheme range interface implementation? } else { // no range interface implementation } bool opEquals()(auto ref const S lhs) const { static if (!is(Encoding == Latin1)) { return data == lhs.data; } else { return normalized() == lhs.normalized() } } } And now most ranges will work correctly. And then some of the algorithms that don't need to use byGrapheme but just need normalized code points to work correctly can do that and that seems like all the special handling you'll need inside range algorithms? Then: readText("foo".latin1); "ä".utf8.split.join("|"); ?? Cheers, - Ali
Jul 18 2018
parent Radu <void null.pt> writes:
On Wednesday, 18 July 2018 at 22:44:33 UTC, aliak wrote:
 On Wednesday, 18 July 2018 at 12:10:04 UTC, Seb wrote:
 On Wednesday, 18 July 2018 at 03:40:08 UTC, Jon Degenhardt 
 [...]
 [...]
That point is still open for discussion, but at the moment rcstring isn't a range and the user has to declare what kind of range he/she wants with e.g. `.by!char` However, one current idea is that for some use cases (e.g. comparison) it might not matter and an application could add overloads for rcstrings.
Maybe I misunderstood but you mean that for comparisons the encoding doesn't matter only right? But that does not preclude normalization, e.g. unicode defines U+00F1 as equal to the sequence U+006E U+0303 and that would work as long as they're normalized (from what I understand at least) and regardless of whether you compare char/wchar/dchars.
 The current idea is to do the same this for Phobos - though I 
 have to say that I'm not really looking forward to adding 200 
 overloads to Phobos :/

 [...]
That's the long-term goal of the collections project. However, with rcstring being the first big use case for it, the idea was to push rcstring forward and by that discover all remaining issues with the Array class. Also the interface of rcstring is rather contained (and doesn't expose the underlying storage to the user), which allows us to iterate over/improve upon the Array design.
 [...]
Hehe, it's intended to solve both problems (auto-decoding by default and nogc) at the same time. However, it looks like to me like there isn't a good solution to the auto-decoding problem that is convenient to use for the user and doesn't sacrifice on performance.
How about a compile time flag that can make things more convenient: auto str1 = latin1("literal"); rcstring!Latin1 latin1string(string str) { return rcstring!Latin1(str); } auto str2 = utf8("åsm"); // ... struct rcstring(Encoding = Unknown) { ubyte[] data; bool normalized = false; static if (is(Encoding == Latin1)) { // by char range interface implementation } else if (is(Encoding == Utf8)) { // byGrapheme range interface implementation? } else { // no range interface implementation } bool opEquals()(auto ref const S lhs) const { static if (!is(Encoding == Latin1)) { return data == lhs.data; } else { return normalized() == lhs.normalized() } } } And now most ranges will work correctly. And then some of the algorithms that don't need to use byGrapheme but just need normalized code points to work correctly can do that and that seems like all the special handling you'll need inside range algorithms? Then: readText("foo".latin1); "ä".utf8.split.join("|"); ?? Cheers, - Ali
I like this approach, `rcstring.by!` is to verbose for my taste and quite annoying for day to day usage. I think rcstring should be aliased by concrete implementation like ansi, uft8, utf16, utf32. Those aliases should be ranges and maybe subtype their respective string, wstring, dstring so they can be transparently used for non-range based APIs (this required dip1000 for safe). The take away is that rcstring by itself does not satisfy the usability criteria, and probably should focus on performance and flexibility to be used as a building block for higher level constructs that are easier to use and safer in regards to how they work with the string type they hold.
Jul 19 2018