digitalmars.D - Relaxing the definition of isSomeString and isNarrowString

Andrei Alexandrescu (14/14) Aug 23 2014 Currently char[], wchar[], dchar[] and qualified variants fulfill the

H. S. Teoh via Digitalmars-d (11/28) Aug 23 2014 [...]

Andrei Alexandrescu (79/104) Aug 24 2014 I've read the thread now, thanks. I think that discussion has lost

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (6/18) Aug 24 2014 Any reason why this is restricted to strings? I.e., is there

Kiith-Sa (5/23) Aug 24 2014 We already have an 'RCArray' (std.container.Array) although it's

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (7/31) Aug 24 2014 Well, if `isSomeString` were extended as Andrei suggests (and the

Andrei Alexandrescu (4/7) Aug 24 2014 It uses RefCounted:

Andrew Godfrey (11/11) Aug 24 2014 The OP and the question of auto-decoding share the same root

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (10/20) Aug 24 2014 ... of course this cannot be guaranteed. In fact, even the

ketmar via Digitalmars-d (6/6) Aug 24 2014 On Sun, 24 Aug 2014 18:28:51 +0000

Dmitry Olshansky (9/15) Aug 24 2014 Just make a function that translates UTF-8 into Latin-1, use ubyte for

ketmar via Digitalmars-d (13/15) Aug 24 2014 i don't want ubytes. but there is nothing bad with assigning latin-1

Dmitry Olshansky (7/18) Aug 24 2014 char != byte, and compiler will reject pointer and array assignments of

Andrew Godfrey (11/36) Aug 24 2014 Both your answers are at the level of the compiler/language spec.
"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (3/16) Aug 25 2014 Right, not just pedantically, but crucially - didn't read it

Andrew Godfrey (5/23) Aug 25 2014 Oops, I did it myself. I meant code unit.

H. S. Teoh via Digitalmars-d (10/13) Aug 24 2014 [...]

Vladimir Panteleev (4/9) Aug 24 2014 This doesn't get you far in practice if you want to actually

Jonathan M Davis (16/25) Aug 24 2014 Well, all of the non-string specific stuff (like find) will work
H. S. Teoh via Digitalmars-d (11/19) Aug 24 2014 We need a major revamp of std.encoding.

Andrei Alexandrescu (8/23) Aug 24 2014 Thanks for asking. Strings are not special, but are a good place to

Dmitry Olshansky (43/85) Aug 24 2014 I'd be damned, if I didn't measure. Do not assume things if there is
Dmitry Olshansky (11/17) Aug 24 2014 The rest of post is mostly in line with my reasoning (except for nobody

Andrei Alexandrescu (20/36) Aug 24 2014 I don't know of any std::vector that uses the small string optimization....

Dmitry Olshansky (15/42) Aug 24 2014 This time it's me who must be wrong.

monarch_dodra (10/15) Aug 25 2014 Dmitry already replied, but I want to stress the reply. That's

Andrei Alexandrescu (8/22) Aug 25 2014 That escalated quickly. Chillax.

monarch_dodra (12/16) Aug 25 2014 Sorry. It's just that I've been seeing this claim way too much

monarch_dodra (2/3) Aug 25 2014 https://github.com/D-Programming-Language/phobos/pull/2464

monarch_dodra (23/42) Aug 25 2014 One issue this proposal seems to forget (and it's a problem that

Andrei Alexandrescu (3/5) Aug 25 2014 It will.

monarch_dodra (3/9) Aug 25 2014 Awesome.

Jonathan M Davis (4/15) Aug 25 2014 There's a PR currently in review which fixes this (though IIRC

Jonathan M Davis (39/54) Aug 23 2014 I don't know. My first reaction is to be very, very worried about

Andrei Alexandrescu (13/46) Aug 24 2014 I guess unittests are for that kind of stuff. One good start would be to...

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (10/25) Aug 24 2014 IMO in general it's a good proposal; I expected `isSomeString` to

Andrei Alexandrescu (5/31) Aug 24 2014 Yah, length should be there. I think ptr is good to require because I

Dmitry Olshansky (4/15) Aug 24 2014 .ptr would be nice '@system' addition to any contiguous ranges.
Peter Alexander (8/14) Aug 24 2014 Hmm, you also need .length, but it needs to return the length of
monarch_dodra (10/26) Aug 25 2014 One issue is that strings are "auto decoded", yet a range of

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Currently char[], wchar[], dchar[] and qualified variants fulfill the 
requirements of isSomeString. Also, char[], wchar[] and qualified 
variants fulfill the requirements of isNarrowString.

Various algorithms in Phobos test for these traits to optimize away UTF 
decoding where unnecessary.

I'm thinking of relaxing the definitions to all types that fulfill the 
following requirements:

* are random access ranges
* element type is some character
* offer .ptr as a  system property that offers a pointer to the first 
character

This would allow us to generalize the notion of string and offer 
optimizations for user-defined, not only built-in, strings. Thoughts?


Andrei

Aug 23 2014

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Sat, Aug 23, 2014 at 06:06:37PM -0700, Andrei Alexandrescu via Digitalmars-d
wrote:
 Currently char[], wchar[], dchar[] and qualified variants fulfill the
 requirements of isSomeString. Also, char[], wchar[] and qualified
 variants fulfill the requirements of isNarrowString.
 
 Various algorithms in Phobos test for these traits to optimize away
 UTF decoding where unnecessary.
 
 I'm thinking of relaxing the definitions to all types that fulfill the
 following requirements:
 
 * are random access ranges
 * element type is some character
 * offer .ptr as a  system property that offers a pointer to the first
 character
 
 This would allow us to generalize the notion of string and offer
 optimizations for user-defined, not only built-in, strings. Thoughts?

[...]

Recently there has been another heated debate on Github about whether
ranges should auto-decode at all:

	https://github.com/D-Programming-Language/phobos/pull/2423

Jonathan is about to write up a DIP for phasing out auto-decoding. Given
that, I think we need to decide which direction to take, lest we waste
energy on something that will be going away soon.


T

-- 
I don't trust computers, I've spent too long programming to think that they can
get anything right. -- James Miller

Aug 23 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/23/14, 6:39 PM, H. S. Teoh via Digitalmars-d wrote:
 On Sat, Aug 23, 2014 at 06:06:37PM -0700, Andrei Alexandrescu via
Digitalmars-d wrote:
 Currently char[], wchar[], dchar[] and qualified variants fulfill the
 requirements of isSomeString. Also, char[], wchar[] and qualified
 variants fulfill the requirements of isNarrowString.

 Various algorithms in Phobos test for these traits to optimize away
 UTF decoding where unnecessary.

 I'm thinking of relaxing the definitions to all types that fulfill the
 following requirements:

 * are random access ranges
 * element type is some character
 * offer .ptr as a  system property that offers a pointer to the first
 character

 This would allow us to generalize the notion of string and offer
 optimizations for user-defined, not only built-in, strings. Thoughts?

 [...]

 Recently there has been another heated debate on Github about whether
 ranges should auto-decode at all:

 	https://github.com/D-Programming-Language/phobos/pull/2423

 Jonathan is about to write up a DIP for phasing out auto-decoding. Given
 that, I think we need to decide which direction to take, lest we waste
 energy on something that will be going away soon.

I've read the thread now, thanks. I think that discussion has lost 
perspective. There is talk about how decoding is slow, yet nobody (like, 
literally not one soul) looked at measuring and optimizing decoding. 
Everybody just assumes it's unacceptably slow; the irony is, it is, but 
mostly because it's not engineered for speed.

Look e.g. at 
https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1074. 
That's a memory allocation each and every time decodeImpl is called. I'm 
not kidding. Take a look at http://goo.gl/p5pl3D vs. 
http://goo.gl/YL2iFN. It's egregious. The entire function, which should 
be small and nimble, is a behemoth. True, that function is only called 
for non-ASCII sequences, but the caller code can definitely be improved 
for speed (I see e.g. unnecessary pass of strings by reference etc). A 
bunch of good work can be done to improve efficiency of decoding 
tremendously.

If the DIP is based on Jonathan's comments in the pull request, I think 
it's a waste of everybody's time. It's a complete rewiring of the 
standard library fundamentals and a huge inconvenience for all users, 
for a slight improvement of a few scenarios.

UTF strings are bidirectional ranges of variable-length encoded code 
units. Algorithms work with those. That is the matter of fact, and what 
makes UTF strings tick in D. It's the reason Python has a large schism 
between 2 and 3 over UTF strings and D is doing well, thank you very 
much. Considering D is faring a lot better than all languages I know of 
in the UTF realm, if auto-decoding was a mistake, well it wasn't a bad one.

I suggest we focus efforts on solving real problems with creative 
solution. Endless hand-wringing over small fish and forever redoing 
what's already there is not the way to progress. Speaking of which, this 
is why I discuss isNarrowString: I have recently gotten convinced of a 
number of things. First, there will always be a category of users and 
applications who will NOT, for reasons good and bad, accept garbage 
collection. They just won't, and we can discuss how that sometimes is a 
bummer reflecting on human nature, but I don't see how framing the 
problem as a user education issue is pushing D forward.

It follows we need to improve proper use of D for people who will not 
have the GC, by means of superior libraries. And if you think of it this 
is very D-ish: we have a powerful, incredibly expressive language at our 
hands, designed explicitly to make powerful abstractions easy to define. 
So the right direction of exploration is better libraries.

One other thing I realized is that avoiding the GC by means of APIs 
based on output ranges Will Not Work(tm). Not all allocations can be 
successfully hoisted to the user in that manner. Output ranges are fine 
if the sole focus is producing sequential output, but the reality is 
that often more complex structures need to be allocated for things to 
work, and the proper way to hoist that into the client is by providing 
allocator services, not output ranges. Consider warp itself: it reads 
files and writes files, which is all linear and nice, but part of the 
processing involves storing macros, which are a complex internal data 
structure (e.g. a hash table). The linear output can be hoisted nicely 
to the client, but not the intricate allocation needed by macros.

Continuing to pull on that thread, it follows that we need to provide 
and document thoroughly data structures that make resource management 
derived from tried and true practices in C++: std::string, 
std::unique_ptr, std::shared_ptr. For example, we have RefCounted. It 
just works transparently in a bunch of situations, but very few users 
actually know about it. Furthermore, it has imperfections and rough 
edges. Improving RefCounted (better code, better documentation, maybe 
DIPs prompting language improvements that make it better) would be real 
progress.

To that end I'm working on RCString, an industrial-strength string type 
that's much like string, just reference counted and with configurable 
allocation. It's safe, too - user code cannot casually extract 
references to string internals. By default allocation would use GC's 
primitives; one thing I learned to appreciate about our GC is its 
ability to free memory explicitly, which means RCString will free memory 
deterministically most of the time, yet if you leak some (e.g. by having 
RCString class members) the GC will pick up the litter. I think 
reference counting backed up by a GC that lifts litter and cycles and is 
a modern, emergent pattern that D could use to great effect.

(Speaking of which: Some, but not all, types in std.container use 
reference counting. One other great area of improvement would be to 
guarantee that everything is std.container is reference counted. 
Containers are the perfect candidate for reference counting - they are 
typically large enough to make the reference counting overhead 
negligible by comparison with the typical work on them.)

I'd like RCString to benefit of the optimizations for built-in strings, 
and that's why I was looking at relaxing isSomeString/isNarrowString.


Andrei

Aug 24 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu 
wrote:
 To that end I'm working on RCString, an industrial-strength 
 string type that's much like string, just reference counted and 
 with configurable allocation. It's safe, too - user code cannot 
 casually extract references to string internals. By default 
 allocation would use GC's primitives; one thing I learned to 
 appreciate about our GC is its ability to free memory 
 explicitly, which means RCString will free memory 
 deterministically most of the time, yet if you leak some (e.g. 
 by having RCString class members) the GC will pick up the 
 litter. I think reference counting backed up by a GC that lifts 
 litter and cycles and is a modern, emergent pattern that D 
 could use to great effect.

Any reason why this is restricted to strings? I.e., is there 
something special about strings (apart from auto-decoding) that 
doesn't apply to arrays in general? If not, wouldn't an RCArray 
be a better idea?

Aug 24 2014

"Kiith-Sa" <kiithsacmp gmail.com> writes:

On Sunday, 24 August 2014 at 12:49:30 UTC, Marc Schütz wrote:
 On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu 
 wrote:
 To that end I'm working on RCString, an industrial-strength 
 string type that's much like string, just reference counted 
 and with configurable allocation. It's safe, too - user code 
 cannot casually extract references to string internals. By 
 default allocation would use GC's primitives; one thing I 
 learned to appreciate about our GC is its ability to free 
 memory explicitly, which means RCString will free memory 
 deterministically most of the time, yet if you leak some (e.g. 
 by having RCString class members) the GC will pick up the 
 litter. I think reference counting backed up by a GC that 
 lifts litter and cycles and is a modern, emergent pattern that 
 D could use to great effect.

 Any reason why this is restricted to strings? I.e., is there 
 something special about strings (apart from auto-decoding) that 
 doesn't apply to arrays in general? If not, wouldn't an RCArray 
 be a better idea?

We already have an 'RCArray' (std.container.Array) although it's 
not perfect.
We don't have a reference counted string type that would work 
with the string operation functions.

Aug 24 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Sunday, 24 August 2014 at 13:04:10 UTC, Kiith-Sa wrote:
 On Sunday, 24 August 2014 at 12:49:30 UTC, Marc Schütz wrote:
 On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu 
 wrote:
 To that end I'm working on RCString, an industrial-strength 
 string type that's much like string, just reference counted 
 and with configurable allocation. It's safe, too - user code 
 cannot casually extract references to string internals. By 
 default allocation would use GC's primitives; one thing I 
 learned to appreciate about our GC is its ability to free 
 memory explicitly, which means RCString will free memory 
 deterministically most of the time, yet if you leak some 
 (e.g. by having RCString class members) the GC will pick up 
 the litter. I think reference counting backed up by a GC that 
 lifts litter and cycles and is a modern, emergent pattern 
 that D could use to great effect.

 Any reason why this is restricted to strings? I.e., is there 
 something special about strings (apart from auto-decoding) 
 that doesn't apply to arrays in general? If not, wouldn't an 
 RCArray be a better idea?

 We already have an 'RCArray' (std.container.Array) although 
 it's not perfect.
 We don't have a reference counted string type that would work 
 with the string operation functions.

Well, if `isSomeString` were extended as Andrei suggests (and the 
string functions adapted a bit), I don't see why it shouldn't 
work.

But I'm not sure std.container.Array really is refcounted. The 
documentation doesn't say so, and it shows neither a postblit nor 
an opAssign.

Aug 24 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/24/14, 6:20 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 But I'm not sure std.container.Array really is refcounted. The
 documentation doesn't say so, and it shows neither a postblit nor an
 opAssign.

It uses RefCounted: 
https://github.com/D-Programming-Language/phobos/blob/master/std/container/array.d#L169

Andrei

Aug 24 2014

"Andrew Godfrey" <X y.com> writes:

The OP and the question of auto-decoding share the same root 
problem: Even though D does a lot better with UTF than other 
languages I've used, it still confuses characters with code 
points somewhat. "Element type is some character" is an example 
from OP. So clarify for me:
If a programmer makes an array of either 'char' or 'wchar', does 
that always, unambiguously, mean a UTF8 or UTF16 code point? E.g. 
If interoperating with C code, they will never make the mistake 
of using these types for a non-string byte/word array?

If and only if this is true, then D has done well and I'm 
unafraid of duck-typing here.

Aug 24 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Sunday, 24 August 2014 at 18:19:45 UTC, Andrew Godfrey wrote:
 The OP and the question of auto-decoding share the same root 
 problem: Even though D does a lot better with UTF than other 
 languages I've used, it still confuses characters with code 
 points somewhat. "Element type is some character" is an example 
 from OP. So clarify for me:
 If a programmer makes an array of either 'char' or 'wchar', 
 does that always, unambiguously, mean a UTF8 or UTF16 code 
 point?

It has to, because it is required by the specification. But ...

 E.g. If interoperating with C code, they will never make the 
 mistake of using these types for a non-string byte/word array?

... of course this cannot be guaranteed. In fact, even the 
druntime currently just assumes that program arguments and 
environment variables are UTF8 encoded on Unix, AFAIK. This is 
true in most cases, but of course not guaranteed. Potentially 
also problematic are the functions taking filenames. In Unix, 
filenames are just opaque arrays of bytes, but those functions 
take `string` parameters, i.e. assuming UTF8 encoding. This could 
force the user to place non-UTF8 sequences into strings.

Aug 24 2014

ketmar via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Sun, 24 Aug 2014 18:28:51 +0000
via Digitalmars-d <digitalmars-d puremagic.com> wrote:

and D parser unconditionally assumes that source code is in UTF-8 too.
which makes great PITA for me: i can't use strings (any string-like
literals, actually) in Latin-1 (for example) without ugly
"\x..\x..\x..". hate it.

Aug 24 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

24-Aug-2014 22:36, ketmar via Digitalmars-d пишет:
 On Sun, 24 Aug 2014 18:28:51 +0000
 via Digitalmars-d <digitalmars-d puremagic.com> wrote:

 and D parser unconditionally assumes that source code is in UTF-8 too.
 which makes great PITA for me: i can't use strings (any string-like
 literals, actually) in Latin-1 (for example) without ugly
 "\x..\x..\x..". hate it.

Just make a function that translates UTF-8 into Latin-1, use ubyte for 
Latin-1.

Then this neat trick will keep literal compile-time constant:

enum latin(alias str) = toLatin1(str);

(where toLatin some normal function that does the translation)

Then use it as latin!"some-string-in-latin1"

-- 
Dmitry Olshansky

Aug 24 2014

ketmar via Digitalmars-d <digitalmars-d puremagic.com> writes:

On Mon, 25 Aug 2014 01:23:57 +0400
Dmitry Olshansky via Digitalmars-d <digitalmars-d puremagic.com> wrote:

 Just make a function that translates UTF-8 into Latin-1, use ubyte
 for Latin-1.

i don't want ubytes. but there is nothing bad with assigning latin-1
text to ordinary string after conversion. ;-)

the downside is memory consumption. i have ALOT of ctfe-magic,
compiling some modules easily hits memory limits (yeah, x86; no, i
don't planning to switch to x86_64 soon), and calling tr-functions
trashes memory even more (i'm not even talking about code readability
here). what i really want is "verbatim" string literal, which scanner,
lexer and parser just passes "as-is", byte-to-byte, without trying to
decode it as UTF-8. or (this can be easier to implement, i think) at
least something like pragma(encoding, "..."). with at least two working
values: "utf-8" and "native".

Aug 24 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

24-Aug-2014 22:19, Andrew Godfrey пишет:
 The OP and the question of auto-decoding share the same root problem:
 Even though D does a lot better with UTF than other languages I've used,
 it still confuses characters with code points somewhat. "Element type is
 some character" is an example from OP. So clarify for me:
 If a programmer makes an array of either 'char' or 'wchar', does that
 always, unambiguously, mean a UTF8 or UTF16 code point?

Yes, pedantically - UTF-8 and UTF-16 code _units_. dchar is a codepoint.

 E.g. If
 interoperating with C code, they will never make the mistake of using
 these types for a non-string byte/word array?

char != byte, and compiler will reject pointer and array assignments of 
byte* to char*, ubyte[] to char[] etc. Values themselves are 
convertible, so would work with implicit conversion.

 If and only if this is true, then D has done well and I'm unafraid of
 duck-typing here.


-- 
Dmitry Olshansky

Aug 24 2014

"Andrew Godfrey" <x y.com> writes:

On Sunday, 24 August 2014 at 18:43:36 UTC, Dmitry Olshansky wrote:
 24-Aug-2014 22:19, Andrew Godfrey пишет:
 The OP and the question of auto-decoding share the same root 
 problem:
 Even though D does a lot better with UTF than other languages 
 I've used,
 it still confuses characters with code points somewhat. 
 "Element type is
 some character" is an example from OP. So clarify for me:
 If a programmer makes an array of either 'char' or 'wchar', 
 does that
 always, unambiguously, mean a UTF8 or UTF16 code point?

 Yes, pedantically - UTF-8 and UTF-16 code _units_. dchar is a 
 codepoint.

 E.g. If
 interoperating with C code, they will never make the mistake 
 of using
 these types for a non-string byte/word array?

 char != byte, and compiler will reject pointer and array 
 assignments of byte* to char*, ubyte[] to char[] etc. Values 
 themselves are convertible, so would work with implicit 
 conversion.

 If and only if this is true, then D has done well and I'm 
 unafraid of
 duck-typing here.


Both your answers are at the level of the compiler/language spec.
Relevant yes, but not complete. E.g. How often will people 
manually converting a .h file, convert C "const char *" correctly 
to either something char-based or something ubyte-based, 
depending on whether it represents utf-8 code points?
How often will they even know?
With wchar it's probably even worse, because of API's that
use one type but depending on other parameters,
the string elements can be utf-16 code points or
glyph indices.

Aug 24 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Sunday, 24 August 2014 at 18:43:36 UTC, Dmitry Olshansky wrote:
 24-Aug-2014 22:19, Andrew Godfrey пишет:
 The OP and the question of auto-decoding share the same root 
 problem:
 Even though D does a lot better with UTF than other languages 
 I've used,
 it still confuses characters with code points somewhat. 
 "Element type is
 some character" is an example from OP. So clarify for me:
 If a programmer makes an array of either 'char' or 'wchar', 
 does that
 always, unambiguously, mean a UTF8 or UTF16 code point?

 Yes, pedantically - UTF-8 and UTF-16 code _units_. dchar is a 
 codepoint.

Right, not just pedantically, but crucially - didn't read it 
correctly.

Aug 25 2014

"Andrew Godfrey" <X y.com> writes:

On Monday, 25 August 2014 at 08:07:58 UTC, Marc Schütz wrote:
 On Sunday, 24 August 2014 at 18:43:36 UTC, Dmitry Olshansky 
 wrote:
 24-Aug-2014 22:19, Andrew Godfrey пишет:
 The OP and the question of auto-decoding share the same root 
 problem:
 Even though D does a lot better with UTF than other languages 
 I've used,
 it still confuses characters with code points somewhat. 
 "Element type is
 some character" is an example from OP. So clarify for me:
 If a programmer makes an array of either 'char' or 'wchar', 
 does that
 always, unambiguously, mean a UTF8 or UTF16 code point?

 Yes, pedantically - UTF-8 and UTF-16 code _units_. dchar is a 
 codepoint.

 Right, not just pedantically, but crucially - didn't read it 
 correctly.

Oops, I did it myself. I meant code unit.

Anyway, see my earlier reply - I wasn't asking what the language 
spec says. I'm asking about the reality given what happens when D 
code interfaces with e.g. C/C++ code.

Aug 25 2014

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Sun, Aug 24, 2014 at 06:19:44PM +0000, Andrew Godfrey via Digitalmars-d
wrote:
[...]
 So clarify for me: If a programmer makes an array of either 'char' or
 'wchar', does that always, unambiguously, mean a UTF8 or UTF16 code
 point?

[...]

In D, an array of char, wchar, or dchar always means a Unicode encoding.
Non-Unicode encodings should be represented as ubyte[] (resp. ushort[]
or ulong[], if such exist) instead.


T

-- 
If you want to solve a problem, you need to address its root cause, not
just its symptoms. Otherwise it's like treating cancer with Tylenol...

Aug 24 2014

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Monday, 25 August 2014 at 01:31:35 UTC, H. S. Teoh via 
Digitalmars-d wrote:
 In D, an array of char, wchar, or dchar always means a Unicode 
 encoding.
 Non-Unicode encodings should be represented as ubyte[] (resp. 
 ushort[]
 or ulong[], if such exist) instead.

This doesn't get you far in practice if you want to actually 
operate on the text.

Aug 24 2014

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Monday, 25 August 2014 at 02:40:20 UTC, Vladimir Panteleev 
wrote:
 On Monday, 25 August 2014 at 01:31:35 UTC, H. S. Teoh via 
 Digitalmars-d wrote:
 In D, an array of char, wchar, or dchar always means a Unicode 
 encoding.
 Non-Unicode encodings should be represented as ubyte[] (resp. 
 ushort[]
 or ulong[], if such exist) instead.

 This doesn't get you far in practice if you want to actually 
 operate on the text.

Well, all of the non-string specific stuff (like find) will work 
just find, but since all of the string-specific functions assume 
UTF-8, UTF-16, or UTF-32, you'll have to convert it. We can't 
really do otherwise, because you have to know what encoding 
you're dealing with to operate on it as a string, and than means 
that you need to either call specific functions which expect the 
encoding that you're using, or you need types specific to those 
encodings (in which case, you wouldn't use ubyte[] and the like 
directly).

We do need better support for other encodings, but I don't think 
that it really costs us anything to treat char as UTF-8, wchar as 
UTF-16, and dchar as UTF-32 and require that other encodings use 
different representations.

- Jonathan M Davis

Aug 24 2014

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Mon, Aug 25, 2014 at 02:40:19AM +0000, Vladimir Panteleev via Digitalmars-d
wrote:
 On Monday, 25 August 2014 at 01:31:35 UTC, H. S. Teoh via Digitalmars-d
 wrote:
In D, an array of char, wchar, or dchar always means a Unicode
encoding.  Non-Unicode encodings should be represented as ubyte[]
(resp. ushort[] or ulong[], if such exist) instead.

 
 This doesn't get you far in practice if you want to actually operate
 on the text.

We need a major revamp of std.encoding.

I envision it to use CTFE and template magic to make it possible to
statically optimize manipulating strings of any supported encoding via
statically-wrapped non-UTF string types, as well as dynamic types for
dealing with runtime-decided encodings. If possible, I envision being
able to use std.conv.to to convert between encodings via compile-time
customization hooks for to!().


T

-- 
Век живи - век учись. А дураком помрёшь.

Aug 24 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/24/14, 5:49 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu wrote:
 To that end I'm working on RCString, an industrial-strength string
 type that's much like string, just reference counted and with
 configurable allocation. It's safe, too - user code cannot casually
 extract references to string internals. By default allocation would
 use GC's primitives; one thing I learned to appreciate about our GC is
 its ability to free memory explicitly, which means RCString will free
 memory deterministically most of the time, yet if you leak some (e.g.
 by having RCString class members) the GC will pick up the litter. I
 think reference counting backed up by a GC that lifts litter and
 cycles and is a modern, emergent pattern that D could use to great
 effect.

 Any reason why this is restricted to strings? I.e., is there something
 special about strings (apart from auto-decoding) that doesn't apply to
 arrays in general? If not, wouldn't an RCArray be a better idea?

Thanks for asking. Strings are not special, but are a good place to 
start because (a) they are ubiquitous so good bang for the buck, (b) 
they don't have indirections to worry about, (c) they have a readily 
available built-in to compare against. Also RCString is different from 
e.g. std.container.Array in that the former has slice semantics whereas 
the latter has pure reference semantics.

Andrei

Aug 24 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

24-Aug-2014 16:24, Andrei Alexandrescu пишет:
 On 8/23/14, 6:39 PM, H. S. Teoh via Digitalmars-d wrote:
 On Sat, Aug 23, 2014 at 06:06:37PM -0700, Andrei Alexandrescu via
 Digitalmars-d wrote:
 Currently char[], wchar[], dchar[] and qualified variants fulfill the
 requirements of isSomeString. Also, char[], wchar[] and qualified
 variants fulfill the requirements of isNarrowString.

 Various algorithms in Phobos test for these traits to optimize away
 UTF decoding where unnecessary.

 I'm thinking of relaxing the definitions to all types that fulfill the
 following requirements:

 * are random access ranges
 * element type is some character
 * offer .ptr as a  system property that offers a pointer to the first
 character

 This would allow us to generalize the notion of string and offer
 optimizations for user-defined, not only built-in, strings. Thoughts?

 [...]

 Recently there has been another heated debate on Github about whether
 ranges should auto-decode at all:

     https://github.com/D-Programming-Language/phobos/pull/2423

 Jonathan is about to write up a DIP for phasing out auto-decoding. Given
 that, I think we need to decide which direction to take, lest we waste
 energy on something that will be going away soon.

 I've read the thread now, thanks. I think that discussion has lost
 perspective. There is talk about how decoding is slow, yet nobody (like,
 literally not one soul) looked at measuring and optimizing decoding.
 Everybody just assumes it's unacceptably slow; the irony is, it is, but
 mostly because it's not engineered for speed.

I'd be damned, if I didn't measure. Do not assume things if there is 
little actual data.

Take a peek at my tests (yeah, rough stuff but it does the job).
https://github.com/DmitryOlshansky/gsoc-bench-2012

Let's count number of codepoints in a file (loading it into memory 
before measuring stuff):
===================
      decode-only [arwiki],    368312 hits,   3.16312, 205.58 Mb/s
             noop [arwiki],    599328 hits,   0.67872, 958.11 Mb/s
====================
      decode-only [ruwiki],    523414 hits,   4.54544, 206.23 Mb/s
             noop [ruwiki],    884374 hits,    0.7898, 1186.92 Mb/s
====================
      decode-only [dewiki],   1773464 hits,   5.63744, 318.28 Mb/s
             noop [dewiki],   1607454 hits,    2.1322, 841.52 Mb/s

With that I can say that decode *is* engineered for speed.
It processes gets about 200-300 millions of codepoints per second for 
me. Compared with baseline of 840 Mbytes/sec of testing if each byte is 
 0x20.

With LDC both numbers are about 2 times better but the picture stays the 
same.

As you can see there is a significant gap there and decoding is not yet 
memory-bound. It's still a cost that hinders top-performance code.

 Look e.g. at
 https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1074.
 That's a memory allocation each and every time decodeImpl is called. I'm
 not kidding. Take a look at http://goo.gl/p5pl3D vs.

You are completely wrong on this count. I'm horrified to learn that even 
D experts are unaware of how `enum` works or you'd got to be kidding me.

enum doesn't allocate but serves as `copy-paste` token effectively 
putting said literal everywhere it is used.

Now look at usage pattern:
https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1173

(where 'i' is constant from static foreach)

Compiler sees it as:
[(1 << 7) - 1, (1 << 11) - 1, (1 << 16) - 1, (1 << 21) - 1][0]

and const-folds it accordingly.

 I'd like RCString to benefit of the optimizations for built-in strings,
 and that's why I was looking at relaxing isSomeString/isNarrowString.

On this I agree, isSomeString is an ugly half-assed hack if we can't 
have other kinds of strings.

Also consider Array!char which in fact should be pretty close to RCString.

I'm of opinion that there are 2 ways out of 'decode by default sucks':

a) Make it finally generic and clearly state what a DecodableRange is. 
Stick with that everywhere.
b) To hell with that and just remove implicit decoding.

Else it really-really sucks.

-- 
Dmitry Olshansky

Aug 24 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

24-Aug-2014 16:24, Andrei Alexandrescu пишет:
 (Speaking of which: Some, but not all, types in std.container use
 reference counting. One other great area of improvement would be to
 guarantee that everything is std.container is reference counted.
 Containers are the perfect candidate for reference counting - they are
 typically large enough to make the reference counting overhead
 negligible by comparison with the typical work on them.)

The rest of post is mostly in line with my reasoning (except for nobody 
measures stuff, it's just bullshit).

Speaking of data-structures I find just about the opposite. Most data 
structure are small, which must be the fact so fondly used by C++ 
vector: small-string optimization. Only very few data-structures are 
large in a given program, and usually correspond to some global tables 
and repositories. Others are either short lived byproduct of input 
processing or are small data-sets attached to some global entity.

-- 
Dmitry Olshansky

Aug 24 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/24/14, 6:16 AM, Dmitry Olshansky wrote:
 24-Aug-2014 16:24, Andrei Alexandrescu пишет:
 (Speaking of which: Some, but not all, types in std.container use
 reference counting. One other great area of improvement would be to
 guarantee that everything is std.container is reference counted.
 Containers are the perfect candidate for reference counting - they are
 typically large enough to make the reference counting overhead
 negligible by comparison with the typical work on them.)

 The rest of post is mostly in line with my reasoning (except for nobody
 measures stuff, it's just bullshit).

Apologies for that, thanks for setting me straight.

 Speaking of data-structures I find just about the opposite. Most data
 structure are small, which must be the fact so fondly used by C++
 vector: small-string optimization. Only very few data-structures are
 large in a given program, and usually correspond to some global tables
 and repositories. Others are either short lived byproduct of input
 processing or are small data-sets attached to some global entity.

I don't know of any std::vector that uses the small string optimization. 
std::string does ubiquitously because (a) strings are often handled as 
values, and (b) C++11 put refcounted strings into illegality (forced 
mistake) therefore robbing implementers of an important optimization.

In a way both C++ and D got it "wrong". Arrays/containers are entity 
types - they have identity and should be manipulated most often by 
reference. Presence of pass-by-value of containers in C++ programs, save 
for rvalue optimization purposes, is suspicious. In contrast, strings 
are value types - they are handled most often as a unit and passed by 
value, just like e.g. numbers.

C++ made both containers and strings value types, so it needs forever to 
look over its shoulder about n00bs copying large containers unwittingly. 
It also does a fair amount of unneeded string copying, and optimizing 
string-based C++ code is nontrivial. D made both arrays and strings 
slices, a data structure made highly expressive by the garbage collector 
but that occasionally confuses people. With std.refcounted.RCString and 
std.container.Array we get both abstractions "right".


Andrei

Aug 24 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

24-Aug-2014 21:59, Andrei Alexandrescu пишет:
 On 8/24/14, 6:16 AM, Dmitry Olshansky wrote:
 24-Aug-2014 16:24, Andrei Alexandrescu пишет:

 Speaking of data-structures I find just about the opposite. Most data
 structure are small, which must be the fact so fondly used by C++
 vector: small-string optimization. Only very few data-structures are
 large in a given program, and usually correspond to some global tables
 and repositories. Others are either short lived byproduct of input
 processing or are small data-sets attached to some global entity.

 I don't know of any std::vector that uses the small string optimization.

This time it's me who must be wrong.

Yet I see that this is recognized need:
https://github.com/facebook/folly/blob/master/folly/small_vector.h
LLVM folks seem to do the same.

With that in mind small containers might be better as a special value type.

 std::string does ubiquitously because (a) strings are often handled as
 values, and (b) C++11 put refcounted strings into illegality (forced
 mistake) therefore robbing implementers of an important optimization.

 In a way both C++ and D got it "wrong". Arrays/containers are entity
 types - they have identity and should be manipulated most often by
 reference. Presence of pass-by-value of containers in C++ programs, save
 for rvalue optimization purposes, is suspicious.

Agreed.

 In contrast, strings
 are value types - they are handled most often as a unit and passed by
 value, just like e.g. numbers.

Indeed, just note that it would be real nice not to actually copy 
strings. In fact it seems that copy-on-write (with ref-counting) for big 
strings and small string optimization for small is almost ideal 
solution. In D we could have non-atomic (thread-local) copy-on-write, 
which should be quite fast.

 C++ made both containers and strings value types, so it needs forever to
 look over its shoulder about n00bs copying large containers unwittingly.
 It also does a fair amount of unneeded string copying, and optimizing
 string-based C++ code is nontrivial. D made both arrays and strings
 slices, a data structure made highly expressive by the garbage collector
 but that occasionally confuses people. With std.refcounted.RCString and
 std.container.Array we get both abstractions "right".

Good points.

-- 
Dmitry Olshansky

Aug 24 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu 
wrote:
 Look e.g. at 
 https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1074. 
 That's a memory allocation each and every time decodeImpl is 
 called. I'm not kidding. Take a look at http://goo.gl/p5pl3D 
 vs. http://goo.gl/YL2iFN. It's egregious.

Dmitry already replied, but I want to stress the reply. That's 
complete BS. decodeImpl does *not* allocate. That line of code 
has been commented on, tested and benched. The enum is only ever 
used for CT indexing, which will *not* allocate.

I'm repeatedly seeing people complain that decoding is slow. 
Maybe. And that it "allocates". But that's BS. A lot of people 
have been repeating this false information now, and I find it 
worrysome that both you and Walter have made this claim now.

Aug 25 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/25/14, 1:21 PM, monarch_dodra wrote:
 On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu wrote:
 Look e.g. at
 https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1074.
 That's a memory allocation each and every time decodeImpl is called.
 I'm not kidding. Take a look at http://goo.gl/p5pl3D vs.
 http://goo.gl/YL2iFN. It's egregious.

 Dmitry already replied, but I want to stress the reply. That's complete
 BS. decodeImpl does *not* allocate. That line of code has been commented
 on, tested and benched. The enum is only ever used for CT indexing,
 which will *not* allocate.

 I'm repeatedly seeing people complain that decoding is slow. Maybe. And
 that it "allocates". But that's BS. A lot of people have been repeating
 this false information now, and I find it worrysome that both you and
 Walter have made this claim now.

That escalated quickly. Chillax.

The use of that enum array is fine but somewhat fragile - a small 
refactoring may trigger an allocation. Trying a static immutable array 
may be in order.

The larger point is there's significant optimization work that can be 
done on decoding before looking at other changes.


Andrei

Aug 25 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Monday, 25 August 2014 at 21:11:52 UTC, Andrei Alexandrescu 
wrote:
 That escalated quickly. Chillax.

Sorry. It's just that I've been seeing this claim way too much 
frequently, especially in learn, where there have been too many 
threads about trying to avoid decoding. Most often, for the wrong 
reasons.

 The use of that enum array is fine but somewhat fragile - a 
 small refactoring may trigger an allocation.

Declaring it as a simple 4-member TypeTuple would iliminate the 
(potential) issue, as a TT *can't* be runtime indexed. Seems like 
a good compromise.

 Trying a static immutable array may be in order.

That would actually create an object, and potentially prevent 
optimizations (I think, maybe). The TT seems cleaner to me 
anyways. I'll create a pull for it.

Aug 25 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Monday, 25 August 2014 at 22:04:34 UTC, monarch_dodra wrote:
 I'll create a pull for it.

https://github.com/D-Programming-Language/phobos/pull/2464

Aug 25 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu 
wrote:
 To that end I'm working on RCString, an industrial-strength 
 string type that's much like string, just reference counted and 
 with configurable allocation. It's safe, too - user code cannot 
 casually extract references to string internals. By default 
 allocation would use GC's primitives; one thing I learned to 
 appreciate about our GC is its ability to free memory 
 explicitly, which means RCString will free memory 
 deterministically most of the time, yet if you leak some (e.g. 
 by having RCString class members) the GC will pick up the 
 litter. I think reference counting backed up by a GC that lifts 
 litter and cycles and is a modern, emergent pattern that D 
 could use to great effect.

 (Speaking of which: Some, but not all, types in std.container 
 use reference counting. One other great area of improvement 
 would be to guarantee that everything is std.container is 
 reference counted. Containers are the perfect candidate for 
 reference counting - they are typically large enough to make 
 the reference counting overhead negligible by comparison with 
 the typical work on them.)

One issue this proposal seems to forget (and it's a problem that 
transcends D), is that the GC does not finalize structs. Your RC 
proposal is fine and good for strings, because the individual 
chars don't have destructors. But unless we migrate *everything* 
to using RC, we'd still be leaking non-memory resources.

For example, "File" is reference counted, and I've seen people 
time and time again get had, because they use a "File[]". Oops. 
Imo, this is a big issue. Are there any plans to takle this 
problem?

Another issue we encounter a lot with reference type objects that 
do RC, is one of initial initialisation/allocation. Currenly, D 
does not have default constructor. I understand why. But it makes 
it painfainly difficult to implement run-time initialize with no 
arguments, while avoiding user errors. This has been a problem 
time and time again for objects such as Appender, or T[U], in 
that aliasing only happens *after* the first operation.

Has this been discussed again yet? We have "T.init". Why couldn't 
we have default construction, that can be explicitly skipped with 
"T a = T.init;"?

I realize this derails the conversation a bit, but I think it is 
related enought to warrant mentioning.

Aug 25 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/25/14, 1:35 PM, monarch_dodra wrote:
 One issue this proposal seems to forget (and it's a problem that
 transcends D), is that the GC does not finalize structs.

It will.

Andrei

Aug 25 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Monday, 25 August 2014 at 21:12:49 UTC, Andrei Alexandrescu 
wrote:
 On 8/25/14, 1:35 PM, monarch_dodra wrote:
 One issue this proposal seems to forget (and it's a problem 
 that
 transcends D), is that the GC does not finalize structs.

 It will.

 Andrei

Awesome.

Aug 25 2014

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Monday, 25 August 2014 at 22:00:30 UTC, monarch_dodra wrote:
 On Monday, 25 August 2014 at 21:12:49 UTC, Andrei Alexandrescu 
 wrote:
 On 8/25/14, 1:35 PM, monarch_dodra wrote:
 One issue this proposal seems to forget (and it's a problem 
 that
 transcends D), is that the GC does not finalize structs.

 It will.

 Andrei

 Awesome.

There's a PR currently in review which fixes this (though IIRC 
there were some bugs with it that were being worked out).

- Jonathan M Davis

Aug 25 2014

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Sunday, 24 August 2014 at 01:06:31 UTC, Andrei Alexandrescu 
wrote:
 Currently char[], wchar[], dchar[] and qualified variants 
 fulfill the requirements of isSomeString. Also, char[], wchar[] 
 and qualified variants fulfill the requirements of 
 isNarrowString.

 Various algorithms in Phobos test for these traits to optimize 
 away UTF decoding where unnecessary.

 I'm thinking of relaxing the definitions to all types that 
 fulfill the following requirements:

 * are random access ranges
 * element type is some character
 * offer .ptr as a  system property that offers a pointer to the 
 first character

 This would allow us to generalize the notion of string and 
 offer optimizations for user-defined, not only built-in, 
 strings. Thoughts?

I don't know. My first reaction is to be very, very worried about 
this. We made it so that isSomeString, isIntegral, etc. test for 
the specific type and not implicit conversion precisely to avoid 
problems where the type is assumed to be one of the built-in 
types, and the code then does weird things or just doesn't 
compile.

If we had done this with isSomeString originally, I'm sure that 
we could make it work, but you're talking about a type that isn't 
an array, and I expect that there's a lot of code out there that 
assumes that something that passes isSomeString is an array. And 
there's a good chance that a lot of the functions that do won't 
react well to a user-defined type.

I think that if we go that route, we should probably create a new 
trait for it rather than reuse isSomeString.

In addition, I'd be very concerned about using ptr like that, 
because the main use case for that aside from trying to avoid 
bounds checks is to pass to a C function, and that's going to 
need ptr to give you an actual array, in which case, I start 
wondirng why we're not just using arrays.

Also, what optimizations are you looking for here? To avoid 
auto-decoding? Something else? A lot of the 
compiler/druntime/Phobos devs are coming to the conclusion that 
the auto-decoding was a mistake and that we should look at 
removing it if we can and change it so that strings are treated 
as ranges of their element type rather than dchar. I'm even in 
the process of working on a DIP for just that, and making the 
transition would be nowhere near as bad as I was initially 
expecting it to be. I think that we can actually do it fairly 
cleanly. If we did that, then I'm not sure what gain there would 
be in doing something like this to isSomeString or even in 
creating a new trait. Simply having a random-access range whose 
element type is a character type would do it already. The only 
part that it might miss out on is using ptr to skip bounds 
checking, and I don't know how much sense that makes with a 
user-defined type anyway given that ptr would have to be giving 
you a pointer to an array to work wherever ptr works for arrays.

- Jonathan M Davis

Aug 23 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/23/14, 7:05 PM, Jonathan M Davis wrote:
 I don't know. My first reaction is to be very, very worried about this.
 We made it so that isSomeString, isIntegral, etc. test for the specific
 type and not implicit conversion precisely to avoid problems where the
 type is assumed to be one of the built-in types, and the code then does
 weird things or just doesn't compile.

I guess unittests are for that kind of stuff. One good start would be to 
see what kind of assumptions are made by algorithms specialized on 
isSomeString.

 If we had done this with isSomeString originally, I'm sure that we could
 make it work, but you're talking about a type that isn't an array, and I
 expect that there's a lot of code out there that assumes that something
 that passes isSomeString is an array. And there's a good chance that a
 lot of the functions that do won't react well to a user-defined type.

 I think that if we go that route, we should probably create a new trait
 for it rather than reuse isSomeString.

Yah, that would be a safe path but would complicate checks.

 In addition, I'd be very concerned about using ptr like that, because
 the main use case for that aside from trying to avoid bounds checks is
 to pass to a C function, and that's going to need ptr to give you an
 actual array, in which case, I start wondirng why we're not just using
 arrays.

Because the abstraction I'm thinking of is reference counted.

 Also, what optimizations are you looking for here? To avoid
 auto-decoding? Something else?

It's just taking advantage of the work that's already been done in 
std.algorithm and elsewhere.

 A lot of the compiler/druntime/Phobos
 devs are coming to the conclusion that the auto-decoding was a mistake
 and that we should look at removing it if we can and change it so that
 strings are treated as ranges of their element type rather than dchar.
 I'm even in the process of working on a DIP for just that, and making
 the transition would be nowhere near as bad as I was initially expecting
 it to be. I think that we can actually do it fairly cleanly. If we did
 that, then I'm not sure what gain there would be in doing something like
 this to isSomeString or even in creating a new trait. Simply having a
 random-access range whose element type is a character type would do it
 already. The only part that it might miss out on is using ptr to skip
 bounds checking, and I don't know how much sense that makes with a
 user-defined type anyway given that ptr would have to be giving you a
 pointer to an array to work wherever ptr works for arrays.

As I wrote at length a few minutes ago, I think such a DIP would be a 
waste of time and energy. I encourage everyone to work on pushing things 
forward with new work instead of forever fiddling with what we have "for 
the last time".


Andrei

Aug 24 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Sunday, 24 August 2014 at 01:06:31 UTC, Andrei Alexandrescu 
wrote:
 Currently char[], wchar[], dchar[] and qualified variants 
 fulfill the requirements of isSomeString. Also, char[], wchar[] 
 and qualified variants fulfill the requirements of 
 isNarrowString.

 Various algorithms in Phobos test for these traits to optimize 
 away UTF decoding where unnecessary.

 I'm thinking of relaxing the definitions to all types that 
 fulfill the following requirements:

 * are random access ranges
 * element type is some character
 * offer .ptr as a  system property that offers a pointer to the 
 first character

 This would allow us to generalize the notion of string and 
 offer optimizations for user-defined, not only built-in, 
 strings. Thoughts?

IMO in general it's a good proposal; I expected `isSomeString` to 
return `true` for higher-order ranges over strings, and was 
surprised that it didn't.

But `.ptr` must only be included if the elements are indeed 
contiguous in memory. For example, `std.algorithm.filter` applied 
to a string should not provide `.ptr`, but should IMO be treated 
as a string nevertheless.

Related: What about `.length`? RA ranges can also be infinite...

Aug 24 2014

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 8/24/14, 6:00 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 On Sunday, 24 August 2014 at 01:06:31 UTC, Andrei Alexandrescu wrote:
 Currently char[], wchar[], dchar[] and qualified variants fulfill the
 requirements of isSomeString. Also, char[], wchar[] and qualified
 variants fulfill the requirements of isNarrowString.

 Various algorithms in Phobos test for these traits to optimize away
 UTF decoding where unnecessary.

 I'm thinking of relaxing the definitions to all types that fulfill the
 following requirements:

 * are random access ranges
 * element type is some character
 * offer .ptr as a  system property that offers a pointer to the first
 character

 This would allow us to generalize the notion of string and offer
 optimizations for user-defined, not only built-in, strings. Thoughts?

 IMO in general it's a good proposal; I expected `isSomeString` to return
 `true` for higher-order ranges over strings, and was surprised that it
 didn't.

 But `.ptr` must only be included if the elements are indeed contiguous
 in memory. For example, `std.algorithm.filter` applied to a string
 should not provide `.ptr`, but should IMO be treated as a string
 nevertheless.

 Related: What about `.length`? RA ranges can also be infinite...

Yah, length should be there. I think ptr is good to require because I 
speculate many optimizations take advantage of the characters being 
contiguous.

Andrei

Aug 24 2014

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

24-Aug-2014 05:06, Andrei Alexandrescu пишет:
 Currently char[], wchar[], dchar[] and qualified variants fulfill the
 requirements of isSomeString. Also, char[], wchar[] and qualified
 variants fulfill the requirements of isNarrowString.

 Various algorithms in Phobos test for these traits to optimize away UTF
 decoding where unnecessary.

 I'm thinking of relaxing the definitions to all types that fulfill the
 following requirements:

 * are random access ranges
 * element type is some character
 * offer .ptr as a  system property that offers a pointer to the first
 character

.ptr would be nice ' system' addition to any contiguous ranges.


-- 
Dmitry Olshansky

Aug 24 2014

"Peter Alexander" <peter.alexander.au gmail.com> writes:

On Sunday, 24 August 2014 at 01:06:31 UTC, Andrei Alexandrescu 
wrote:
 I'm thinking of relaxing the definitions to all types that 
 fulfill the following requirements:

 * are random access ranges
 * element type is some character
 * offer .ptr as a  system property that offers a pointer to the 
 first character

Hmm, you also need .length, but it needs to return the length of 
the encoding, not the string. There might be custom string types 
where .length returns the number of code points, but where .ptr 
points to code units.

Narrow string optimized functions will also be looking for an 
opSlice that is indexed by code units.

Aug 24 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Sunday, 24 August 2014 at 01:06:31 UTC, Andrei Alexandrescu 
wrote:
 Currently char[], wchar[], dchar[] and qualified variants 
 fulfill the requirements of isSomeString. Also, char[], wchar[] 
 and qualified variants fulfill the requirements of 
 isNarrowString.

 Various algorithms in Phobos test for these traits to optimize 
 away UTF decoding where unnecessary.

 I'm thinking of relaxing the definitions to all types that 
 fulfill the following requirements:

 * are random access ranges
 * element type is some character
 * offer .ptr as a  system property that offers a pointer to the 
 first character

 This would allow us to generalize the notion of string and 
 offer optimizations for user-defined, not only built-in, 
 strings. Thoughts?


 Andrei

One issue is that strings are "auto decoded", yet a range of 
"char" is not. I don't see why ".ptr" is needed (or .length) for 
that matter. We could very well have a range of chars that 
doesn't of length, nor .ptr, but should still be handled like a 
sequence of decoded characters.

I don't see how your proposal caters to that more generic problem 
that a UD char range is not auto-decoded (or more generally, that 
it will be handled *differently* from a string).

Aug 25 2014

D Programming

C/C++ Programming

Other

digitalmars.D - Relaxing the definition of isSomeString and isNarrowString