digitalmars.D - std.experimental.collections.rcstring and its integration in Phobos

Seb (24/24) Jul 17 2018 So we managed to revive the rcstring project and it's already a

Jonathan M Davis (10/34) Jul 17 2018 If it's not a range by default, why would you expect _anything_ which

Seb (9/22) Jul 17 2018 Well, there are few cases where the range type doesn't matter and

Jonathan M Davis (25/49) Jul 17 2018 That effectively means treating rcstring as a range of char by default

Seb (11/45) Jul 18 2018 Well, the problem of it being a range of char is that this might

Jonathan M Davis (34/42) Jul 18 2018 I don't know. I'm fine with it not being a range and leaving it up to th...

Andrei Alexandrescu (3/5) Jul 17 2018 Many functions do not care about the range aspect, but do care about the...

Jonathan M Davis (20/25) Jul 17 2018 It doesn't care about strings either. It operates on a range of characte...

Andrea Fontana (2/13) Jul 18 2018 This makes sense for me too.

Jacob Carlborg (4/8) Jul 17 2018 I vote for .by!char to be the default.

rikki cattermole (11/19) Jul 17 2018 I'm thinking .as!T
Seb (6/13) Jul 18 2018 The problem here is this would also lead to very confusing

Eugene Wissner (5/20) Jul 18 2018 Therefore it shouldn't compile at all, but

sarn (8/12) Jul 18 2018 +1 to requiring an explicit byCodeUnit or whatever.

Jacob Carlborg (5/11) Jul 18 2018 How about not giving access to operate on individual characters. If they...

jmh530 (26/29) Jul 17 2018 I'm glad this is getting worked on. It feels like something that

Seb (13/46) Jul 18 2018 Well AFAICT the idea is that with RCIAllocator (or its

jmh530 (5/15) Jul 18 2018 Just that there are a lot of complaints about D's auto-decoding
jmh530 (10/13) Jul 18 2018 I see. Is it really a good idea to make the ownership/lifetime

Jon Degenhardt (20/32) Jul 17 2018 I don't know the goals/role rcstring is expected to play,

Seb (33/71) Jul 18 2018 Sorry for the brevity yesterday.

aliak (44/80) Jul 18 2018 Maybe I misunderstood but you mean that for comparisons the

Radu (13/88) Jul 19 2018 I like this approach, `rcstring.by!` is to verbose for my taste

Seb <seb wilzba.ch> writes:

So we managed to revive the rcstring project and it's already a 
PR for Phobos:

https://github.com/dlang/phobos/pull/6631 (still WIP though)

The current approach in short:

- uses the new  nogc,  safe and nothrow Array from the 
collections library (check Eduardo's DConf18 talk)
- uses reference counting
- _no_ range by default (it needs an explicit `.by!{d,w,}char`) 
(as in no auto-decoding by default)

Still to be done:

- integration in Phobos (the current idea is to generate 
additional overloads for rcstring)
- performance
- use of static immutable rcstring in fully  nogc
- extensive testing

Especially the "seamless" integration in Phobos will be 
challenging.
I made a rough listing of all symbols that one would expect to be 
usable with an rcstring type 
(https://gist.github.com/wilzbach/d74712269f889827cff6b2c7a08d07f8). It's more
than 200.
As rcstring isn't a range by default, but one excepts 
`"foo".rcstring.equal("foo")` to work, overloads for all these 
symbols would need to be added.

What do you think about this approach? Do you have a better idea?

Jul 17 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Tuesday, July 17, 2018 15:21:30 Seb via Digitalmars-d wrote:
 So we managed to revive the rcstring project and it's already a
 PR for Phobos:

 https://github.com/dlang/phobos/pull/6631 (still WIP though)

 The current approach in short:

 - uses the new  nogc,  safe and nothrow Array from the
 collections library (check Eduardo's DConf18 talk)
 - uses reference counting
 - _no_ range by default (it needs an explicit `.by!{d,w,}char`)
 (as in no auto-decoding by default)

 Still to be done:

 - integration in Phobos (the current idea is to generate
 additional overloads for rcstring)
 - performance
 - use of static immutable rcstring in fully  nogc
 - extensive testing

 Especially the "seamless" integration in Phobos will be
 challenging.
 I made a rough listing of all symbols that one would expect to be
 usable with an rcstring type
 (https://gist.github.com/wilzbach/d74712269f889827cff6b2c7a08d07f8). It's
 more than 200. As rcstring isn't a range by default, but one excepts
 `"foo".rcstring.equal("foo")` to work, overloads for all these
 symbols would need to be added.

 What do you think about this approach? Do you have a better idea?

If it's not a range by default, why would you expect _anything_ which
operates on ranges to work with rcstring directly? IMHO, if it's not a
range, then range-based functions shouldn't work with it, and I don't see
how they even _can_ work with it unless you assume code units, or code
points, or graphemes as the default. If it's designed to not be a range,
then it should be up to the programmer to call the appropriate function on
it to get the appropriate range type for a particular use case, in which
case, you really shouldn't need to add much of any overloads for it.

- Jonathan M Davis

Jul 17 2018

Seb <seb wilzba.ch> writes:

On Tuesday, 17 July 2018 at 16:58:37 UTC, Jonathan M Davis wrote:
 On Tuesday, July 17, 2018 15:21:30 Seb via Digitalmars-d wrote:
 [...]

 If it's not a range by default, why would you expect _anything_ 
 which operates on ranges to work with rcstring directly? IMHO, 
 if it's not a range, then range-based functions shouldn't work 
 with it, and I don't see how they even _can_ work with it 
 unless you assume code units, or code points, or graphemes as 
 the default. If it's designed to not be a range, then it should 
 be up to the programmer to call the appropriate function on it 
 to get the appropriate range type for a particular use case, in 
 which case, you really shouldn't need to add much of any 
 overloads for it.

 - Jonathan M Davis

Well, there are few cases where the range type doesn't matter and 
one can simply compare bytes, e.g.

equal (e.g. "ä" == "ä" <=> [195, 164] == [195, 164])
commonPrefix
find
...

Of course this assumes that there's no normalization necessary, 
but the current auto-decoding assumes this too.

Jul 17 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Tuesday, July 17, 2018 17:28:19 Seb via Digitalmars-d wrote:
 On Tuesday, 17 July 2018 at 16:58:37 UTC, Jonathan M Davis wrote:
 On Tuesday, July 17, 2018 15:21:30 Seb via Digitalmars-d wrote:
 [...]

 If it's not a range by default, why would you expect _anything_
 which operates on ranges to work with rcstring directly? IMHO,
 if it's not a range, then range-based functions shouldn't work
 with it, and I don't see how they even _can_ work with it
 unless you assume code units, or code points, or graphemes as
 the default. If it's designed to not be a range, then it should
 be up to the programmer to call the appropriate function on it
 to get the appropriate range type for a particular use case, in
 which case, you really shouldn't need to add much of any
 overloads for it.

 - Jonathan M Davis

 Well, there are few cases where the range type doesn't matter and
 one can simply compare bytes, e.g.

 equal (e.g. "�" == "�" <=> [195, 164] == [195, 164])
 commonPrefix
 find
 ...

That effectively means treating rcstring as a range of char by default
rather than not treating it as a range by default. And if we then do that
only with functions that overload on rcstring rather than making rcstring
actually a range of char, then why aren't we just treating it as a range of
char in general?

IMHO, the fact that so many alogorithms currently special-case on arrays of
characters is one reason that auto-decoding has been a disaster, and adding
a bunch of overloads for rcstring is just compounding the problem.
Algorithms should properly support arbitrary ranges of characters, and then
rcstring can be passed to them by calling one of the functions on it to get
a range of code units, code points, or graphemes to get an actual range -
either that, or rcstring should default to being a range of char. going
halfway and making it work with some functions via overloads really doesn't
make sense.

Now, if we're talking about functions that really operate on strings and not
ranges of characters (and thus do stuff like append), then that becomes a
different question, but we've mostly been trying to move away from functions
like that in Phobos.

 Of course this assumes that there's no normalization necessary,
 but the current auto-decoding assumes this too.

You can still normalize with auto-decoding (the code units - and thus code
points - are in a specific order even when encoded, and that order can be
normalized), and really, anyone who wants fully correct string comparisons
needs to be normalizing their strings. With that in mind, rcstring probably
should support normalization of its internal representation.

- Jonathan M Davis

Jul 17 2018

Seb <seb wilzba.ch> writes:

On Tuesday, 17 July 2018 at 18:09:13 UTC, Jonathan M Davis wrote:
 On Tuesday, July 17, 2018 17:28:19 Seb via Digitalmars-d wrote:
 On Tuesday, 17 July 2018 at 16:58:37 UTC, Jonathan M Davis 
 wrote:
 [...]

 Well, there are few cases where the range type doesn't matter 
 and one can simply compare bytes, e.g.

 equal (e.g. "ä" == "ä" <=> [195, 164] == [195, 164])
 commonPrefix
 find
 ...

 That effectively means treating rcstring as a range of char by 
 default rather than not treating it as a range by default. And 
 if we then do that only with functions that overload on 
 rcstring rather than making rcstring actually a range of char, 
 then why aren't we just treating it as a range of char in 
 general?

 IMHO, the fact that so many alogorithms currently special-case 
 on arrays of characters is one reason that auto-decoding has 
 been a disaster, and adding a bunch of overloads for rcstring 
 is just compounding the problem. Algorithms should properly 
 support arbitrary ranges of characters, and then rcstring can 
 be passed to them by calling one of the functions on it to get 
 a range of code units, code points, or graphemes to get an 
 actual range - either that, or rcstring should default to being 
 a range of char. going halfway and making it work with some 
 functions via overloads really doesn't make sense.

Well, the problem of it being a range of char is that this might 
lead to very confusing behavior, e.g.

"ä".rcstring.split.join("|") == �|�

So we probably shouldn't go this route either.
The idea of adding overloads was to introduce a bit of 
user-convenience, s.t. they don't have to say

readText("foo".rcstring.by!char)

all the time.

 You can still normalize with auto-decoding (the code units - 
 and thus code points - are in a specific order even when 
 encoded, and that order can be normalized), and really, anyone 
 who wants fully correct string comparisons needs to be 
 normalizing their strings. With that in mind, rcstring probably 
 should support normalization of its internal representation.

It currently doesn't support this out of the box, but it's a very 
valid point and I added it to the list.

Jul 18 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Wednesday, July 18, 2018 12:15:52 Seb via Digitalmars-d wrote:
 Well, the problem of it being a range of char is that this might
 lead to very confusing behavior, e.g.

 "ä".rcstring.split.join("|") == �|�

 So we probably shouldn't go this route either.

I don't know. I'm fine with it not being a range and leaving it up to the
programmer, but part of the point here is that the programmer needs to
understand Unicode well enough to be able to do the right thing in cases
like this or they're screwed anyway. And if strings (of any variety) operate
as ranges of code units by default, the fact that there's a problem when
someone screws it up is going to be a lot more obvious.

Forcing people to call a function like by!char or by!dchar still requires
that they deal with all of this. It just makes it explicit. And that's not
necessarily a bad idea, but if someone is going to be confused by something
like split splitting in the middle of code points, they're going to be in
trouble with the bu function anyway.

 The idea of adding overloads was to introduce a bit of
 user-convenience, s.t. they don't have to say

 readText("foo".rcstring.by!char)

 all the time.

The wouldn't be doing anything that verbose anyway. In that case, you'd just
pass the string literal. At most, they'd be doing something like

readText(str.by!char);

And of course, readText is definitely _not_  nogc. But regardless, these are
functions that are designed to be generic and take ranges of characters
rather than strings, and adding overloads for specific types just because we
don't want to call the function to get a range over them seems like it's
going in totally the wrong direction. It means adding a lot of overloads,
and we already have quite a mess thanks to all of the special-casing that we
have to avoid auto-decoding without getting into adding yet another set of
overloads for rcstring. We've put in the effort to genericize these
functions and make many of these functions work with ranges of characters
rather than strings, and I really don't think that we should start adding
overloads for specific string types just because we don't want to treat them
as ranges directly.

I'd honestly rather see an rcstring type that was just treated as a range of
char than see us adding overloads for rcstring. That's what arrays of char
should have been treated as in the first place, and we already have to do
stuff like byCodeUnit for strings anyway, so having to do by!char or
by!dchar really doesn't seem like a big deal to me - especially if the
alternative is adding a bunch of overloads.

- Jonathan M Davis

Jul 18 2018

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/17/18 12:58 PM, Jonathan M Davis wrote:
 If it's not a range by default, why would you expect_anything_  which
 operates on ranges to work with rcstring directly?

Many functions do not care about the range aspect, but do care about the 
string aspect. Consider e.g. chdir.

Jul 17 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Tuesday, July 17, 2018 22:45:33 Andrei Alexandrescu via Digitalmars-d 
wrote:
 On 7/17/18 12:58 PM, Jonathan M Davis wrote:
 If it's not a range by default, why would you expect_anything_  which
 operates on ranges to work with rcstring directly?

 Many functions do not care about the range aspect, but do care about the
 string aspect. Consider e.g. chdir.

It doesn't care about strings either. It operates on a range of characters.
If a function is just taking a value as input and isn't storing it or
mutating its elements, then a range of characters works perfectly fine and
is more flexible than any particular type - and IMHO shouldn't then be
having overloads for particular ranges of characters or string types if we
can avoid it. If we're talking about a functions that's really operating on
a string as a string and doing things like appending as opposed to doing
range-based operations, then maybe overloading for other string types makes
sense rather than requiring an array of characters. But if it's just taking
a string and reading it? That has no need to operate on strings specifically
and should be operating on a range of characters - something that we've been
moving towards with Phobos.

As such, I don't think that it generally makes sense for functions in Phobos
to be explicitly accepting rcstring unless it's actually a range. If it's
not actually a range, then such functions should already work with it by
calling the appropriate function to get a range over it without needing to
special-case anything.

- Jonathan M Davis

Jul 17 2018

Andrea Fontana <nospam example.com> writes:

On Tuesday, 17 July 2018 at 16:58:37 UTC, Jonathan M Davis wrote:
 If it's not a range by default, why would you expect _anything_ 
 which operates on ranges to work with rcstring directly? IMHO, 
 if it's not a range, then range-based functions shouldn't work 
 with it, and I don't see how they even _can_ work with it 
 unless you assume code units, or code points, or graphemes as 
 the default. If it's designed to not be a range, then it should 
 be up to the programmer to call the appropriate function on it 
 to get the appropriate range type for a particular use case, in 
 which case, you really shouldn't need to add much of any 
 overloads for it.

 - Jonathan M Davis

This makes sense for me too.

Jul 18 2018

Jacob Carlborg <doob me.com> writes:

On 2018-07-17 17:21, Seb wrote:

 - _no_ range by default (it needs an explicit `.by!{d,w,}char`) (as in 
 no auto-decoding by default)
 
 What do you think about this approach? Do you have a better idea?

I vote for .by!char to be the default.

-- 
/Jacob Carlborg

Jul 17 2018

rikki cattermole <rikki cattermole.co.nz> writes:

On 18/07/2018 5:41 AM, Jacob Carlborg wrote:
 On 2018-07-17 17:21, Seb wrote:
 
 - _no_ range by default (it needs an explicit `.by!{d,w,}char`) (as in 
 no auto-decoding by default)

 What do you think about this approach? Do you have a better idea?

 
 I vote for .by!char to be the default.

I'm thinking .as!T

So we can cover, ubyte/char/wchar/dchar, string/wstring/dstring all in one.

I think whatever we expose as the default for string/wstring/dstring 
however should be settable. e.g.

```
struct RCString(DefaultStringType=string) {
	alias .as!DefaultStringType this;
}
```

Which is a perfect example of what my named parameter DIP is for ;)

Jul 17 2018

Seb <seb wilzba.ch> writes:

On Tuesday, 17 July 2018 at 17:41:05 UTC, Jacob Carlborg wrote:
 On 2018-07-17 17:21, Seb wrote:

 - _no_ range by default (it needs an explicit 
 `.by!{d,w,}char`) (as in no auto-decoding by default)
 
 What do you think about this approach? Do you have a better 
 idea?

 I vote for .by!char to be the default.

The problem here is this would also lead to very confusing 
behavior for newcomers, e.g.

```
"ä".split.join("|") == �|�
```

Jul 18 2018

Eugene Wissner <belka caraus.de> writes:

On Wednesday, 18 July 2018 at 11:37:33 UTC, Seb wrote:
 On Tuesday, 17 July 2018 at 17:41:05 UTC, Jacob Carlborg wrote:
 On 2018-07-17 17:21, Seb wrote:

 - _no_ range by default (it needs an explicit 
 `.by!{d,w,}char`) (as in no auto-decoding by default)
 
 What do you think about this approach? Do you have a better 
 idea?

 I vote for .by!char to be the default.

 The problem here is this would also lead to very confusing 
 behavior for newcomers, e.g.

 ```
 "ä".split.join("|") == �|�
 ```

Therefore it shouldn't compile at all, but

rcstring("ä")[].split("|")

or

rcstring("ä").byCodePoint.split("|")

Jul 18 2018

sarn <sarn theartofmachinery.com> writes:

On Wednesday, 18 July 2018 at 12:03:02 UTC, Eugene Wissner wrote:
 Therefore it shouldn't compile at all, but

 rcstring("ä")[].split("|")

 or

 rcstring("ä").byCodePoint.split("|")

+1 to requiring an explicit byCodeUnit or whatever.

For every "obvious" way to interpret a string as a range, you can 
find an application where the obvious code is surprisingly buggy.

(BTW, rcstring("ä").byCodePoint.split("|") is buggy for 
characters made of multiple codepoints.  Canonicalisation doesn't 
fix it because many characters just don't have a single-codepoint 
form.)

Jul 18 2018

Jacob Carlborg <doob me.com> writes:

On 2018-07-18 13:37, Seb wrote:

 The problem here is this would also lead to very confusing behavior for 
 newcomers, e.g.
 
 ```
 "ä".split.join("|") == �|�
 ```

How about not giving access to operate on individual characters. If they 
need to do that they should operate on an array of bytes. Too controversial?

-- 
/Jacob Carlborg

Jul 18 2018

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 17 July 2018 at 15:21:30 UTC, Seb wrote:
 So we managed to revive the rcstring project and it's already a 
 PR for Phobos:

 [snip]

I'm glad this is getting worked on. It feels like something that 
D has been working towards for a while.

Unfortunately, I haven't (yet) watched the collections video at 
Dconf and don't see a presentation on the website. Because of 
that, I don't really understand some of the design decisions.

For instance, I also don't really understand how RCIAllocator is 
different from the old IAllocator (the documentation could use 
some work, IMO). It looks like RCIAllocator is part of what 
drives the reference counting semantics, but it also looks like 
Array has some support for reference counting, like addRef, that 
invoke RCIAllocator somehow. But Array also has some support for 
gc_allocator as the default, so my cursory examination suggests 
that Array is not really intended to be an RCArray...

So at that point I started wondering why not just have String as 
an alias of Array, akin to how D does it for dynamic arrays to 
strings currently. If there is stuff in rcstring now that isn't 
in Array, then that could be included in Array as a compile-time 
specialization for the relevant types (at the cost of bloating 
Array). And then leave it up to the user how to allocate.

I think part of the above design decision connects in with why 
rcstring stores the data as ubytes, even for wchar and dchar. 
Recent comments suggest that it is related to auto-decoding. My 
sense is that an rcstring that does not have auto-decoding, even 
if it requires more work to get working with phobos is a better 
solution over the long-run.

Jul 17 2018

Seb <seb wilzba.ch> writes:

On Tuesday, 17 July 2018 at 18:43:47 UTC, jmh530 wrote:
 On Tuesday, 17 July 2018 at 15:21:30 UTC, Seb wrote:
 So we managed to revive the rcstring project and it's already 
 a PR for Phobos:

 [snip]

 I'm glad this is getting worked on. It feels like something 
 that D has been working towards for a while.

 Unfortunately, I haven't (yet) watched the collections video at 
 Dconf and don't see a presentation on the website. Because of 
 that, I don't really understand some of the design decisions.

 For instance, I also don't really understand how RCIAllocator 
 is different from the old IAllocator (the documentation could 
 use some work, IMO). It looks like RCIAllocator is part of what 
 drives the reference counting semantics,

Well AFAICT the idea is that with RCIAllocator (or its 
convenience function allocatorObject) is that you can convert any 
allocator to a reference-counted one, e.g.

RCIAllocator a = allocatorObject(Mallocator.instance);

 but it also looks like Array has some support for reference 
 counting, like addRef, that invoke RCIAllocator somehow. But 
 Array also has some support for gc_allocator as the default, so 
 my cursory examination suggests that Array is not really 
 intended to be an RCArray...

Yes, Array is a reference-counted Array, but it also has a 
reference-counted allocator.

 So at that point I started wondering why not just have String 
 as an alias of Array, akin to how D does it for dynamic arrays 
 to strings currently. If there is stuff in rcstring now that 
 isn't in Array, then that could be included in Array as a 
 compile-time specialization for the relevant types (at the cost 
 of bloating Array). And then leave it up to the user how to 
 allocate.

There's lots of stuff in rcstring related to better 
interoperability with existing strings.
e.g. you just want `"foo".rcstring == "foo"` to work.

 I think part of the above design decision connects in with why 
 rcstring stores the data as ubytes, even for wchar and dchar. 
 Recent comments suggest that it is related to auto-decoding.

Yes rcstring doesn't do any auto-decoding and hence stores its 
data as an ubyte array.

 My sense is that an rcstring that does not have auto-decoding, 
 even if it requires more work to get working with phobos is a 
 better solution over the long-run.

What do you mean by this?

Jul 18 2018

jmh530 <john.michael.hall gmail.com> writes:

On Wednesday, 18 July 2018 at 11:56:39 UTC, Seb wrote:
 [snip]

 I think part of the above design decision connects in with why 
 rcstring stores the data as ubytes, even for wchar and dchar. 
 Recent comments suggest that it is related to auto-decoding.

 Yes rcstring doesn't do any auto-decoding and hence stores its 
 data as an ubyte array.

 My sense is that an rcstring that does not have auto-decoding, 
 even if it requires more work to get working with phobos is a 
 better solution over the long-run.

 What do you mean by this?

Just that there are a lot of complaints about D's auto-decoding 
of strings. Not doing any auto-decoding seems like a good 
long-run design decision, even if it makes some things more 
difficult.

Jul 18 2018

jmh530 <john.michael.hall gmail.com> writes:

On Wednesday, 18 July 2018 at 11:56:39 UTC, Seb wrote:
 [snip]

 Yes, Array is a reference-counted Array, but it also has a 
 reference-counted allocator.

I see. Is it really a good idea to make the ownership/lifetime 
strategy part of the container? What happens when you want to 
make nogc collections for lists, trees, etc? You have to make 
multiple versions for unique/ref counted/some new strategy? I 
would think it is more generic to have it a separate wrapper that 
handles the ownership/lifetime strategy, like what exists in 
automem and C++'s smart pointers...though automem looks like it 
has a separate type for Unique_Array rather than including it in 
Unique...so I suppose that potentially has the same issue...

Jul 18 2018

Jon Degenhardt <noreply noreply.com> writes:

On Tuesday, 17 July 2018 at 15:21:30 UTC, Seb wrote:
 So we managed to revive the rcstring project and it's already a 
 PR for Phobos:

 https://github.com/dlang/phobos/pull/6631 (still WIP though)

 The current approach in short:

 - uses the new  nogc,  safe and nothrow Array from the 
 collections library (check Eduardo's DConf18 talk)
 - uses reference counting
 - _no_ range by default (it needs an explicit `.by!{d,w,}char`) 
 (as in no auto-decoding by default)

 [snip]

 What do you think about this approach? Do you have a better 
 idea?

I don't know the goals/role rcstring is expected to play, 
especially wrt existing string/character facilities. Perhaps you 
could describe more?

Strings are central to many applications, so I'm wondering about 
things like whether rcstring is intended as a replacement for 
string that would be used by most new programs, and whether 
applications would use arrays and ranges of char together with 
rcstring, or rcstring would be used for everything.

Perhaps its too early for these questions, and the current goal 
is simpler. For example, adding a meaningful collection class 
that is  nogc,  safe and ref-counted that be used as a proving 
ground for the newer memory management facilities being developed.

Such simpler goals would be quite reasonable. What's got me 
wondering about the larger questions are the comments about 
ranges and autodecoding. If rcstring is intended as a vehicle for 
general  nogc handling of character data and/or for reducing the 
impact of autodecoding, then it makes sense to consider from 
those perspectives.

--Jon

Jul 17 2018

Seb <seb wilzba.ch> writes:

On Wednesday, 18 July 2018 at 03:40:08 UTC, Jon Degenhardt wrote:
 On Tuesday, 17 July 2018 at 15:21:30 UTC, Seb wrote:
 So we managed to revive the rcstring project and it's already 
 a PR for Phobos:

 https://github.com/dlang/phobos/pull/6631 (still WIP though)

 The current approach in short:

 - uses the new  nogc,  safe and nothrow Array from the 
 collections library (check Eduardo's DConf18 talk)
 - uses reference counting
 - _no_ range by default (it needs an explicit 
 `.by!{d,w,}char`) (as in no auto-decoding by default)

 [snip]

 What do you think about this approach? Do you have a better 
 idea?

 I don't know the goals/role rcstring is expected to play, 
 especially wrt existing string/character facilities. Perhaps 
 you could describe more?

Sorry for the brevity yesterday.
One of long-term pain point of D is that usage of string can't be 
 nogc.
rcstring is intended to be a drop-in  nogc replacement for 
everywhere where string is currently used (that's the idea, at 
least).

 Strings are central to many applications, so I'm wondering 
 about things like whether rcstring is intended as a replacement 
 for string that would be used by most new programs,

Yes, that's the long-term goal. An opt-in  nogc string class.
There's no plan to do sth. disruptive like replacing the `alias 
string = immutable(char)[];` declaration in druntime. However, we 
might move rcstring to druntime at some point, s.t. e.g. 
Exceptions or asserts can use  nogc strings.

 and whether applications would use arrays and ranges of char 
 together with rcstring, or rcstring would be used for 
 everything.

That point is still open for discussion, but at the moment 
rcstring isn't a range and the user has to declare what kind of 
range he/she wants with e.g. `.by!char`
However, one current idea is that for some use cases (e.g. 
comparison) it might not matter and an application could add 
overloads for rcstrings.
The current idea is to do the same this for Phobos - though I 
have to say that I'm not really looking forward to adding 200 
overloads to Phobos :/

 Perhaps its too early for these questions, and the current goal 
 is simpler. For example, adding a meaningful collection class 
 that is  nogc,  safe and ref-counted that be used as a proving 
 ground for the newer memory management facilities being 
 developed.

That's the long-term goal of the collections project.
However, with rcstring being the first big use case for it, the 
idea was to push rcstring forward and by that discover all 
remaining issues with the Array class.
Also the interface of rcstring is rather contained (and doesn't 
expose the underlying storage to the user), which allows us to 
iterate over/improve upon the Array design.

 Such simpler goals would be quite reasonable. What's got me 
 wondering about the larger questions are the comments about 
 ranges and autodecoding. If rcstring is intended as a vehicle 
 for general  nogc handling of character data and/or for 
 reducing the impact of autodecoding, then it makes sense to 
 consider from those perspectives.

Hehe, it's intended to solve both problems (auto-decoding by 
default and  nogc) at the same time.
However, it looks like to me like there isn't a good solution to 
the auto-decoding problem that is convenient to use for the user 
and doesn't sacrifice on performance.

Jul 18 2018

aliak <something something.com> writes:

On Wednesday, 18 July 2018 at 12:10:04 UTC, Seb wrote:
 On Wednesday, 18 July 2018 at 03:40:08 UTC, Jon Degenhardt [...]
 and whether applications would use arrays and ranges of char 
 together with rcstring, or rcstring would be used for 
 everything.

 That point is still open for discussion, but at the moment 
 rcstring isn't a range and the user has to declare what kind of 
 range he/she wants with e.g. `.by!char`
 However, one current idea is that for some use cases (e.g. 
 comparison) it might not matter and an application could add 
 overloads for rcstrings.

Maybe I misunderstood but you mean that for comparisons the 
encoding doesn't matter only right? But that does not preclude 
normalization, e.g. unicode defines U+00F1 as equal to the 
sequence U+006E U+0303 and that would work as long as they're 
normalized (from what I understand at least) and regardless of 
whether you compare char/wchar/dchars.

 The current idea is to do the same this for Phobos - though I 
 have to say that I'm not really looking forward to adding 200 
 overloads to Phobos :/

 Perhaps its too early for these questions, and the current 
 goal is simpler. For example, adding a meaningful collection 
 class that is  nogc,  safe and ref-counted that be used as a 
 proving ground for the newer memory management facilities 
 being developed.

 That's the long-term goal of the collections project.
 However, with rcstring being the first big use case for it, the 
 idea was to push rcstring forward and by that discover all 
 remaining issues with the Array class.
 Also the interface of rcstring is rather contained (and doesn't 
 expose the underlying storage to the user), which allows us to 
 iterate over/improve upon the Array design.

 Such simpler goals would be quite reasonable. What's got me 
 wondering about the larger questions are the comments about 
 ranges and autodecoding. If rcstring is intended as a vehicle 
 for general  nogc handling of character data and/or for 
 reducing the impact of autodecoding, then it makes sense to 
 consider from those perspectives.

 Hehe, it's intended to solve both problems (auto-decoding by 
 default and  nogc) at the same time.
 However, it looks like to me like there isn't a good solution 
 to the auto-decoding problem that is convenient to use for the 
 user and doesn't sacrifice on performance.

How about a compile time flag that can make things more 
convenient:

auto str1 = latin1("literal");
rcstring!Latin1 latin1string(string str) {
   return rcstring!Latin1(str);
}

auto str2 = utf8("åsm");
// ...

struct rcstring(Encoding = Unknown) {
   ubyte[] data;
   bool normalized = false;
   static if (is(Encoding == Latin1)) {
     // by char range interface implementation
   } else if (is(Encoding == Utf8)) {
     // byGrapheme range interface implementation?
   } else {
     // no range interface implementation
   }

   bool opEquals()(auto ref const S lhs) const {
     static if (!is(Encoding == Latin1)) {
       return data == lhs.data;
     } else {
       return normalized() == lhs.normalized()
     }
   }

}

And now most ranges will work correctly. And then some of the 
algorithms that don't need to use byGrapheme but just need 
normalized code points to work correctly can do that and that 
seems like all the special handling you'll need inside range 
algorithms?

Then:

readText("foo".latin1);
"ä".utf8.split.join("|");

??

Cheers,
- Ali

Jul 18 2018

Radu <void null.pt> writes:

On Wednesday, 18 July 2018 at 22:44:33 UTC, aliak wrote:
 On Wednesday, 18 July 2018 at 12:10:04 UTC, Seb wrote:
 On Wednesday, 18 July 2018 at 03:40:08 UTC, Jon Degenhardt 
 [...]
 [...]

 That point is still open for discussion, but at the moment 
 rcstring isn't a range and the user has to declare what kind 
 of range he/she wants with e.g. `.by!char`
 However, one current idea is that for some use cases (e.g. 
 comparison) it might not matter and an application could add 
 overloads for rcstrings.

 Maybe I misunderstood but you mean that for comparisons the 
 encoding doesn't matter only right? But that does not preclude 
 normalization, e.g. unicode defines U+00F1 as equal to the 
 sequence U+006E U+0303 and that would work as long as they're 
 normalized (from what I understand at least) and regardless of 
 whether you compare char/wchar/dchars.

 The current idea is to do the same this for Phobos - though I 
 have to say that I'm not really looking forward to adding 200 
 overloads to Phobos :/

 [...]

 That's the long-term goal of the collections project.
 However, with rcstring being the first big use case for it, 
 the idea was to push rcstring forward and by that discover all 
 remaining issues with the Array class.
 Also the interface of rcstring is rather contained (and 
 doesn't expose the underlying storage to the user), which 
 allows us to iterate over/improve upon the Array design.

 [...]

 Hehe, it's intended to solve both problems (auto-decoding by 
 default and  nogc) at the same time.
 However, it looks like to me like there isn't a good solution 
 to the auto-decoding problem that is convenient to use for the 
 user and doesn't sacrifice on performance.

 How about a compile time flag that can make things more 
 convenient:

 auto str1 = latin1("literal");
 rcstring!Latin1 latin1string(string str) {
   return rcstring!Latin1(str);
 }

 auto str2 = utf8("åsm");
 // ...

 struct rcstring(Encoding = Unknown) {
   ubyte[] data;
   bool normalized = false;
   static if (is(Encoding == Latin1)) {
     // by char range interface implementation
   } else if (is(Encoding == Utf8)) {
     // byGrapheme range interface implementation?
   } else {
     // no range interface implementation
   }

   bool opEquals()(auto ref const S lhs) const {
     static if (!is(Encoding == Latin1)) {
       return data == lhs.data;
     } else {
       return normalized() == lhs.normalized()
     }
   }

 }

 And now most ranges will work correctly. And then some of the 
 algorithms that don't need to use byGrapheme but just need 
 normalized code points to work correctly can do that and that 
 seems like all the special handling you'll need inside range 
 algorithms?

 Then:

 readText("foo".latin1);
 "ä".utf8.split.join("|");

 ??

 Cheers,
 - Ali

I like this approach, `rcstring.by!` is to verbose for my taste 
and quite annoying for day to day usage.

I think rcstring should be aliased by concrete implementation 
like ansi, uft8, utf16, utf32. Those aliases should be ranges and 
maybe subtype their respective string, wstring, dstring so they 
can be transparently used for non-range based APIs (this required 
dip1000 for  safe).

The take away is that rcstring by itself does not satisfy the 
usability criteria, and probably should focus on performance and 
flexibility to be used as a building block for higher level 
constructs that are easier to use and safer in regards to how 
they work with the string type they hold.

Jul 19 2018

D Programming

C/C++ Programming

Other

digitalmars.D - std.experimental.collections.rcstring and its integration in Phobos