www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Relaxing the definition of isSomeString and isNarrowString

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Currently char[], wchar[], dchar[] and qualified variants fulfill the 
requirements of isSomeString. Also, char[], wchar[] and qualified 
variants fulfill the requirements of isNarrowString.

Various algorithms in Phobos test for these traits to optimize away UTF 
decoding where unnecessary.

I'm thinking of relaxing the definitions to all types that fulfill the 
following requirements:

* are random access ranges
* element type is some character
* offer .ptr as a  system property that offers a pointer to the first 
character

This would allow us to generalize the notion of string and offer 
optimizations for user-defined, not only built-in, strings. Thoughts?


Andrei
Aug 23 2014
next sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sat, Aug 23, 2014 at 06:06:37PM -0700, Andrei Alexandrescu via Digitalmars-d
wrote:
 Currently char[], wchar[], dchar[] and qualified variants fulfill the
 requirements of isSomeString. Also, char[], wchar[] and qualified
 variants fulfill the requirements of isNarrowString.
 
 Various algorithms in Phobos test for these traits to optimize away
 UTF decoding where unnecessary.
 
 I'm thinking of relaxing the definitions to all types that fulfill the
 following requirements:
 
 * are random access ranges
 * element type is some character
 * offer .ptr as a  system property that offers a pointer to the first
 character
 
 This would allow us to generalize the notion of string and offer
 optimizations for user-defined, not only built-in, strings. Thoughts?
[...] Recently there has been another heated debate on Github about whether ranges should auto-decode at all: https://github.com/D-Programming-Language/phobos/pull/2423 Jonathan is about to write up a DIP for phasing out auto-decoding. Given that, I think we need to decide which direction to take, lest we waste energy on something that will be going away soon. T -- I don't trust computers, I've spent too long programming to think that they can get anything right. -- James Miller
Aug 23 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/23/14, 6:39 PM, H. S. Teoh via Digitalmars-d wrote:
 On Sat, Aug 23, 2014 at 06:06:37PM -0700, Andrei Alexandrescu via
Digitalmars-d wrote:
 Currently char[], wchar[], dchar[] and qualified variants fulfill the
 requirements of isSomeString. Also, char[], wchar[] and qualified
 variants fulfill the requirements of isNarrowString.

 Various algorithms in Phobos test for these traits to optimize away
 UTF decoding where unnecessary.

 I'm thinking of relaxing the definitions to all types that fulfill the
 following requirements:

 * are random access ranges
 * element type is some character
 * offer .ptr as a  system property that offers a pointer to the first
 character

 This would allow us to generalize the notion of string and offer
 optimizations for user-defined, not only built-in, strings. Thoughts?
[...] Recently there has been another heated debate on Github about whether ranges should auto-decode at all: https://github.com/D-Programming-Language/phobos/pull/2423 Jonathan is about to write up a DIP for phasing out auto-decoding. Given that, I think we need to decide which direction to take, lest we waste energy on something that will be going away soon.
I've read the thread now, thanks. I think that discussion has lost perspective. There is talk about how decoding is slow, yet nobody (like, literally not one soul) looked at measuring and optimizing decoding. Everybody just assumes it's unacceptably slow; the irony is, it is, but mostly because it's not engineered for speed. Look e.g. at https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1074. That's a memory allocation each and every time decodeImpl is called. I'm not kidding. Take a look at http://goo.gl/p5pl3D vs. http://goo.gl/YL2iFN. It's egregious. The entire function, which should be small and nimble, is a behemoth. True, that function is only called for non-ASCII sequences, but the caller code can definitely be improved for speed (I see e.g. unnecessary pass of strings by reference etc). A bunch of good work can be done to improve efficiency of decoding tremendously. If the DIP is based on Jonathan's comments in the pull request, I think it's a waste of everybody's time. It's a complete rewiring of the standard library fundamentals and a huge inconvenience for all users, for a slight improvement of a few scenarios. UTF strings are bidirectional ranges of variable-length encoded code units. Algorithms work with those. That is the matter of fact, and what makes UTF strings tick in D. It's the reason Python has a large schism between 2 and 3 over UTF strings and D is doing well, thank you very much. Considering D is faring a lot better than all languages I know of in the UTF realm, if auto-decoding was a mistake, well it wasn't a bad one. I suggest we focus efforts on solving real problems with creative solution. Endless hand-wringing over small fish and forever redoing what's already there is not the way to progress. Speaking of which, this is why I discuss isNarrowString: I have recently gotten convinced of a number of things. First, there will always be a category of users and applications who will NOT, for reasons good and bad, accept garbage collection. They just won't, and we can discuss how that sometimes is a bummer reflecting on human nature, but I don't see how framing the problem as a user education issue is pushing D forward. It follows we need to improve proper use of D for people who will not have the GC, by means of superior libraries. And if you think of it this is very D-ish: we have a powerful, incredibly expressive language at our hands, designed explicitly to make powerful abstractions easy to define. So the right direction of exploration is better libraries. One other thing I realized is that avoiding the GC by means of APIs based on output ranges Will Not Work(tm). Not all allocations can be successfully hoisted to the user in that manner. Output ranges are fine if the sole focus is producing sequential output, but the reality is that often more complex structures need to be allocated for things to work, and the proper way to hoist that into the client is by providing allocator services, not output ranges. Consider warp itself: it reads files and writes files, which is all linear and nice, but part of the processing involves storing macros, which are a complex internal data structure (e.g. a hash table). The linear output can be hoisted nicely to the client, but not the intricate allocation needed by macros. Continuing to pull on that thread, it follows that we need to provide and document thoroughly data structures that make resource management derived from tried and true practices in C++: std::string, std::unique_ptr, std::shared_ptr. For example, we have RefCounted. It just works transparently in a bunch of situations, but very few users actually know about it. Furthermore, it has imperfections and rough edges. Improving RefCounted (better code, better documentation, maybe DIPs prompting language improvements that make it better) would be real progress. To that end I'm working on RCString, an industrial-strength string type that's much like string, just reference counted and with configurable allocation. It's safe, too - user code cannot casually extract references to string internals. By default allocation would use GC's primitives; one thing I learned to appreciate about our GC is its ability to free memory explicitly, which means RCString will free memory deterministically most of the time, yet if you leak some (e.g. by having RCString class members) the GC will pick up the litter. I think reference counting backed up by a GC that lifts litter and cycles and is a modern, emergent pattern that D could use to great effect. (Speaking of which: Some, but not all, types in std.container use reference counting. One other great area of improvement would be to guarantee that everything is std.container is reference counted. Containers are the perfect candidate for reference counting - they are typically large enough to make the reference counting overhead negligible by comparison with the typical work on them.) I'd like RCString to benefit of the optimizations for built-in strings, and that's why I was looking at relaxing isSomeString/isNarrowString. Andrei
Aug 24 2014
next sibling parent reply "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu 
wrote:
 To that end I'm working on RCString, an industrial-strength 
 string type that's much like string, just reference counted and 
 with configurable allocation. It's safe, too - user code cannot 
 casually extract references to string internals. By default 
 allocation would use GC's primitives; one thing I learned to 
 appreciate about our GC is its ability to free memory 
 explicitly, which means RCString will free memory 
 deterministically most of the time, yet if you leak some (e.g. 
 by having RCString class members) the GC will pick up the 
 litter. I think reference counting backed up by a GC that lifts 
 litter and cycles and is a modern, emergent pattern that D 
 could use to great effect.
Any reason why this is restricted to strings? I.e., is there something special about strings (apart from auto-decoding) that doesn't apply to arrays in general? If not, wouldn't an RCArray be a better idea?
Aug 24 2014
next sibling parent reply "Kiith-Sa" <kiithsacmp gmail.com> writes:
On Sunday, 24 August 2014 at 12:49:30 UTC, Marc Schütz wrote:
 On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu 
 wrote:
 To that end I'm working on RCString, an industrial-strength 
 string type that's much like string, just reference counted 
 and with configurable allocation. It's safe, too - user code 
 cannot casually extract references to string internals. By 
 default allocation would use GC's primitives; one thing I 
 learned to appreciate about our GC is its ability to free 
 memory explicitly, which means RCString will free memory 
 deterministically most of the time, yet if you leak some (e.g. 
 by having RCString class members) the GC will pick up the 
 litter. I think reference counting backed up by a GC that 
 lifts litter and cycles and is a modern, emergent pattern that 
 D could use to great effect.
Any reason why this is restricted to strings? I.e., is there something special about strings (apart from auto-decoding) that doesn't apply to arrays in general? If not, wouldn't an RCArray be a better idea?
We already have an 'RCArray' (std.container.Array) although it's not perfect. We don't have a reference counted string type that would work with the string operation functions.
Aug 24 2014
parent reply "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Sunday, 24 August 2014 at 13:04:10 UTC, Kiith-Sa wrote:
 On Sunday, 24 August 2014 at 12:49:30 UTC, Marc Schütz wrote:
 On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu 
 wrote:
 To that end I'm working on RCString, an industrial-strength 
 string type that's much like string, just reference counted 
 and with configurable allocation. It's safe, too - user code 
 cannot casually extract references to string internals. By 
 default allocation would use GC's primitives; one thing I 
 learned to appreciate about our GC is its ability to free 
 memory explicitly, which means RCString will free memory 
 deterministically most of the time, yet if you leak some 
 (e.g. by having RCString class members) the GC will pick up 
 the litter. I think reference counting backed up by a GC that 
 lifts litter and cycles and is a modern, emergent pattern 
 that D could use to great effect.
Any reason why this is restricted to strings? I.e., is there something special about strings (apart from auto-decoding) that doesn't apply to arrays in general? If not, wouldn't an RCArray be a better idea?
We already have an 'RCArray' (std.container.Array) although it's not perfect. We don't have a reference counted string type that would work with the string operation functions.
Well, if `isSomeString` were extended as Andrei suggests (and the string functions adapted a bit), I don't see why it shouldn't work. But I'm not sure std.container.Array really is refcounted. The documentation doesn't say so, and it shows neither a postblit nor an opAssign.
Aug 24 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/24/14, 6:20 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 But I'm not sure std.container.Array really is refcounted. The
 documentation doesn't say so, and it shows neither a postblit nor an
 opAssign.
It uses RefCounted: https://github.com/D-Programming-Language/phobos/blob/master/std/container/array.d#L169 Andrei
Aug 24 2014
parent reply "Andrew Godfrey" <X y.com> writes:
The OP and the question of auto-decoding share the same root 
problem: Even though D does a lot better with UTF than other 
languages I've used, it still confuses characters with code 
points somewhat. "Element type is some character" is an example 
from OP. So clarify for me:
If a programmer makes an array of either 'char' or 'wchar', does 
that always, unambiguously, mean a UTF8 or UTF16 code point? E.g. 
If interoperating with C code, they will never make the mistake 
of using these types for a non-string byte/word array?

If and only if this is true, then D has done well and I'm 
unafraid of duck-typing here.
Aug 24 2014
next sibling parent reply "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Sunday, 24 August 2014 at 18:19:45 UTC, Andrew Godfrey wrote:
 The OP and the question of auto-decoding share the same root 
 problem: Even though D does a lot better with UTF than other 
 languages I've used, it still confuses characters with code 
 points somewhat. "Element type is some character" is an example 
 from OP. So clarify for me:
 If a programmer makes an array of either 'char' or 'wchar', 
 does that always, unambiguously, mean a UTF8 or UTF16 code 
 point?
It has to, because it is required by the specification. But ...
 E.g. If interoperating with C code, they will never make the 
 mistake of using these types for a non-string byte/word array?
... of course this cannot be guaranteed. In fact, even the druntime currently just assumes that program arguments and environment variables are UTF8 encoded on Unix, AFAIK. This is true in most cases, but of course not guaranteed. Potentially also problematic are the functions taking filenames. In Unix, filenames are just opaque arrays of bytes, but those functions take `string` parameters, i.e. assuming UTF8 encoding. This could force the user to place non-UTF8 sequences into strings.
Aug 24 2014
parent reply ketmar via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Sun, 24 Aug 2014 18:28:51 +0000
via Digitalmars-d <digitalmars-d puremagic.com> wrote:

and D parser unconditionally assumes that source code is in UTF-8 too.
which makes great PITA for me: i can't use strings (any string-like
literals, actually) in Latin-1 (for example) without ugly
"\x..\x..\x..". hate it.
Aug 24 2014
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
24-Aug-2014 22:36, ketmar via Digitalmars-d пишет:
 On Sun, 24 Aug 2014 18:28:51 +0000
 via Digitalmars-d <digitalmars-d puremagic.com> wrote:

 and D parser unconditionally assumes that source code is in UTF-8 too.
 which makes great PITA for me: i can't use strings (any string-like
 literals, actually) in Latin-1 (for example) without ugly
 "\x..\x..\x..". hate it.
Just make a function that translates UTF-8 into Latin-1, use ubyte for Latin-1. Then this neat trick will keep literal compile-time constant: enum latin(alias str) = toLatin1(str); (where toLatin some normal function that does the translation) Then use it as latin!"some-string-in-latin1" -- Dmitry Olshansky
Aug 24 2014
parent ketmar via Digitalmars-d <digitalmars-d puremagic.com> writes:
On Mon, 25 Aug 2014 01:23:57 +0400
Dmitry Olshansky via Digitalmars-d <digitalmars-d puremagic.com> wrote:

 Just make a function that translates UTF-8 into Latin-1, use ubyte
 for Latin-1.
i don't want ubytes. but there is nothing bad with assigning latin-1 text to ordinary string after conversion. ;-) the downside is memory consumption. i have ALOT of ctfe-magic, compiling some modules easily hits memory limits (yeah, x86; no, i don't planning to switch to x86_64 soon), and calling tr-functions trashes memory even more (i'm not even talking about code readability here). what i really want is "verbatim" string literal, which scanner, lexer and parser just passes "as-is", byte-to-byte, without trying to decode it as UTF-8. or (this can be easier to implement, i think) at least something like pragma(encoding, "..."). with at least two working values: "utf-8" and "native".
Aug 24 2014
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
24-Aug-2014 22:19, Andrew Godfrey пишет:
 The OP and the question of auto-decoding share the same root problem:
 Even though D does a lot better with UTF than other languages I've used,
 it still confuses characters with code points somewhat. "Element type is
 some character" is an example from OP. So clarify for me:
 If a programmer makes an array of either 'char' or 'wchar', does that
 always, unambiguously, mean a UTF8 or UTF16 code point?
Yes, pedantically - UTF-8 and UTF-16 code _units_. dchar is a codepoint.
 E.g. If
 interoperating with C code, they will never make the mistake of using
 these types for a non-string byte/word array?
char != byte, and compiler will reject pointer and array assignments of byte* to char*, ubyte[] to char[] etc. Values themselves are convertible, so would work with implicit conversion.
 If and only if this is true, then D has done well and I'm unafraid of
 duck-typing here.
-- Dmitry Olshansky
Aug 24 2014
next sibling parent "Andrew Godfrey" <x y.com> writes:
On Sunday, 24 August 2014 at 18:43:36 UTC, Dmitry Olshansky wrote:
 24-Aug-2014 22:19, Andrew Godfrey пишет:
 The OP and the question of auto-decoding share the same root 
 problem:
 Even though D does a lot better with UTF than other languages 
 I've used,
 it still confuses characters with code points somewhat. 
 "Element type is
 some character" is an example from OP. So clarify for me:
 If a programmer makes an array of either 'char' or 'wchar', 
 does that
 always, unambiguously, mean a UTF8 or UTF16 code point?
Yes, pedantically - UTF-8 and UTF-16 code _units_. dchar is a codepoint.
 E.g. If
 interoperating with C code, they will never make the mistake 
 of using
 these types for a non-string byte/word array?
char != byte, and compiler will reject pointer and array assignments of byte* to char*, ubyte[] to char[] etc. Values themselves are convertible, so would work with implicit conversion.
 If and only if this is true, then D has done well and I'm 
 unafraid of
 duck-typing here.
Both your answers are at the level of the compiler/language spec. Relevant yes, but not complete. E.g. How often will people manually converting a .h file, convert C "const char *" correctly to either something char-based or something ubyte-based, depending on whether it represents utf-8 code points? How often will they even know? With wchar it's probably even worse, because of API's that use one type but depending on other parameters, the string elements can be utf-16 code points or glyph indices.
Aug 24 2014
prev sibling parent reply "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Sunday, 24 August 2014 at 18:43:36 UTC, Dmitry Olshansky wrote:
 24-Aug-2014 22:19, Andrew Godfrey пишет:
 The OP and the question of auto-decoding share the same root 
 problem:
 Even though D does a lot better with UTF than other languages 
 I've used,
 it still confuses characters with code points somewhat. 
 "Element type is
 some character" is an example from OP. So clarify for me:
 If a programmer makes an array of either 'char' or 'wchar', 
 does that
 always, unambiguously, mean a UTF8 or UTF16 code point?
Yes, pedantically - UTF-8 and UTF-16 code _units_. dchar is a codepoint.
Right, not just pedantically, but crucially - didn't read it correctly.
Aug 25 2014
parent "Andrew Godfrey" <X y.com> writes:
On Monday, 25 August 2014 at 08:07:58 UTC, Marc Schütz wrote:
 On Sunday, 24 August 2014 at 18:43:36 UTC, Dmitry Olshansky 
 wrote:
 24-Aug-2014 22:19, Andrew Godfrey пишет:
 The OP and the question of auto-decoding share the same root 
 problem:
 Even though D does a lot better with UTF than other languages 
 I've used,
 it still confuses characters with code points somewhat. 
 "Element type is
 some character" is an example from OP. So clarify for me:
 If a programmer makes an array of either 'char' or 'wchar', 
 does that
 always, unambiguously, mean a UTF8 or UTF16 code point?
Yes, pedantically - UTF-8 and UTF-16 code _units_. dchar is a codepoint.
Right, not just pedantically, but crucially - didn't read it correctly.
Oops, I did it myself. I meant code unit. Anyway, see my earlier reply - I wasn't asking what the language spec says. I'm asking about the reality given what happens when D code interfaces with e.g. C/C++ code.
Aug 25 2014
prev sibling parent reply "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sun, Aug 24, 2014 at 06:19:44PM +0000, Andrew Godfrey via Digitalmars-d
wrote:
[...]
 So clarify for me: If a programmer makes an array of either 'char' or
 'wchar', does that always, unambiguously, mean a UTF8 or UTF16 code
 point?
[...] In D, an array of char, wchar, or dchar always means a Unicode encoding. Non-Unicode encodings should be represented as ubyte[] (resp. ushort[] or ulong[], if such exist) instead. T -- If you want to solve a problem, you need to address its root cause, not just its symptoms. Otherwise it's like treating cancer with Tylenol...
Aug 24 2014
parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Monday, 25 August 2014 at 01:31:35 UTC, H. S. Teoh via 
Digitalmars-d wrote:
 In D, an array of char, wchar, or dchar always means a Unicode 
 encoding.
 Non-Unicode encodings should be represented as ubyte[] (resp. 
 ushort[]
 or ulong[], if such exist) instead.
This doesn't get you far in practice if you want to actually operate on the text.
Aug 24 2014
next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Monday, 25 August 2014 at 02:40:20 UTC, Vladimir Panteleev 
wrote:
 On Monday, 25 August 2014 at 01:31:35 UTC, H. S. Teoh via 
 Digitalmars-d wrote:
 In D, an array of char, wchar, or dchar always means a Unicode 
 encoding.
 Non-Unicode encodings should be represented as ubyte[] (resp. 
 ushort[]
 or ulong[], if such exist) instead.
This doesn't get you far in practice if you want to actually operate on the text.
Well, all of the non-string specific stuff (like find) will work just find, but since all of the string-specific functions assume UTF-8, UTF-16, or UTF-32, you'll have to convert it. We can't really do otherwise, because you have to know what encoding you're dealing with to operate on it as a string, and than means that you need to either call specific functions which expect the encoding that you're using, or you need types specific to those encodings (in which case, you wouldn't use ubyte[] and the like directly). We do need better support for other encodings, but I don't think that it really costs us anything to treat char as UTF-8, wchar as UTF-16, and dchar as UTF-32 and require that other encodings use different representations. - Jonathan M Davis
Aug 24 2014
prev sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Mon, Aug 25, 2014 at 02:40:19AM +0000, Vladimir Panteleev via Digitalmars-d
wrote:
 On Monday, 25 August 2014 at 01:31:35 UTC, H. S. Teoh via Digitalmars-d
 wrote:
In D, an array of char, wchar, or dchar always means a Unicode
encoding.  Non-Unicode encodings should be represented as ubyte[]
(resp. ushort[] or ulong[], if such exist) instead.
This doesn't get you far in practice if you want to actually operate on the text.
We need a major revamp of std.encoding. I envision it to use CTFE and template magic to make it possible to statically optimize manipulating strings of any supported encoding via statically-wrapped non-UTF string types, as well as dynamic types for dealing with runtime-decided encodings. If possible, I envision being able to use std.conv.to to convert between encodings via compile-time customization hooks for to!(). T -- Век живи - век учись. А дураком помрёшь.
Aug 24 2014
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/24/14, 5:49 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu wrote:
 To that end I'm working on RCString, an industrial-strength string
 type that's much like string, just reference counted and with
 configurable allocation. It's safe, too - user code cannot casually
 extract references to string internals. By default allocation would
 use GC's primitives; one thing I learned to appreciate about our GC is
 its ability to free memory explicitly, which means RCString will free
 memory deterministically most of the time, yet if you leak some (e.g.
 by having RCString class members) the GC will pick up the litter. I
 think reference counting backed up by a GC that lifts litter and
 cycles and is a modern, emergent pattern that D could use to great
 effect.
Any reason why this is restricted to strings? I.e., is there something special about strings (apart from auto-decoding) that doesn't apply to arrays in general? If not, wouldn't an RCArray be a better idea?
Thanks for asking. Strings are not special, but are a good place to start because (a) they are ubiquitous so good bang for the buck, (b) they don't have indirections to worry about, (c) they have a readily available built-in to compare against. Also RCString is different from e.g. std.container.Array in that the former has slice semantics whereas the latter has pure reference semantics. Andrei
Aug 24 2014
prev sibling next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
24-Aug-2014 16:24, Andrei Alexandrescu пишет:
 On 8/23/14, 6:39 PM, H. S. Teoh via Digitalmars-d wrote:
 On Sat, Aug 23, 2014 at 06:06:37PM -0700, Andrei Alexandrescu via
 Digitalmars-d wrote:
 Currently char[], wchar[], dchar[] and qualified variants fulfill the
 requirements of isSomeString. Also, char[], wchar[] and qualified
 variants fulfill the requirements of isNarrowString.

 Various algorithms in Phobos test for these traits to optimize away
 UTF decoding where unnecessary.

 I'm thinking of relaxing the definitions to all types that fulfill the
 following requirements:

 * are random access ranges
 * element type is some character
 * offer .ptr as a  system property that offers a pointer to the first
 character

 This would allow us to generalize the notion of string and offer
 optimizations for user-defined, not only built-in, strings. Thoughts?
[...] Recently there has been another heated debate on Github about whether ranges should auto-decode at all: https://github.com/D-Programming-Language/phobos/pull/2423 Jonathan is about to write up a DIP for phasing out auto-decoding. Given that, I think we need to decide which direction to take, lest we waste energy on something that will be going away soon.
I've read the thread now, thanks. I think that discussion has lost perspective. There is talk about how decoding is slow, yet nobody (like, literally not one soul) looked at measuring and optimizing decoding. Everybody just assumes it's unacceptably slow; the irony is, it is, but mostly because it's not engineered for speed.
I'd be damned, if I didn't measure. Do not assume things if there is little actual data. Take a peek at my tests (yeah, rough stuff but it does the job). https://github.com/DmitryOlshansky/gsoc-bench-2012 Let's count number of codepoints in a file (loading it into memory before measuring stuff): =================== decode-only [arwiki], 368312 hits, 3.16312, 205.58 Mb/s noop [arwiki], 599328 hits, 0.67872, 958.11 Mb/s ==================== decode-only [ruwiki], 523414 hits, 4.54544, 206.23 Mb/s noop [ruwiki], 884374 hits, 0.7898, 1186.92 Mb/s ==================== decode-only [dewiki], 1773464 hits, 5.63744, 318.28 Mb/s noop [dewiki], 1607454 hits, 2.1322, 841.52 Mb/s With that I can say that decode *is* engineered for speed. It processes gets about 200-300 millions of codepoints per second for me. Compared with baseline of 840 Mbytes/sec of testing if each byte is
 0x20.
With LDC both numbers are about 2 times better but the picture stays the same. As you can see there is a significant gap there and decoding is not yet memory-bound. It's still a cost that hinders top-performance code.
 Look e.g. at
 https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1074.
 That's a memory allocation each and every time decodeImpl is called. I'm
 not kidding. Take a look at http://goo.gl/p5pl3D vs.
You are completely wrong on this count. I'm horrified to learn that even D experts are unaware of how `enum` works or you'd got to be kidding me. enum doesn't allocate but serves as `copy-paste` token effectively putting said literal everywhere it is used. Now look at usage pattern: https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1173 (where 'i' is constant from static foreach) Compiler sees it as: [(1 << 7) - 1, (1 << 11) - 1, (1 << 16) - 1, (1 << 21) - 1][0] and const-folds it accordingly.
 I'd like RCString to benefit of the optimizations for built-in strings,
 and that's why I was looking at relaxing isSomeString/isNarrowString.
On this I agree, isSomeString is an ugly half-assed hack if we can't have other kinds of strings. Also consider Array!char which in fact should be pretty close to RCString. I'm of opinion that there are 2 ways out of 'decode by default sucks': a) Make it finally generic and clearly state what a DecodableRange is. Stick with that everywhere. b) To hell with that and just remove implicit decoding. Else it really-really sucks. -- Dmitry Olshansky
Aug 24 2014
prev sibling next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
24-Aug-2014 16:24, Andrei Alexandrescu пишет:
 (Speaking of which: Some, but not all, types in std.container use
 reference counting. One other great area of improvement would be to
 guarantee that everything is std.container is reference counted.
 Containers are the perfect candidate for reference counting - they are
 typically large enough to make the reference counting overhead
 negligible by comparison with the typical work on them.)
The rest of post is mostly in line with my reasoning (except for nobody measures stuff, it's just bullshit). Speaking of data-structures I find just about the opposite. Most data structure are small, which must be the fact so fondly used by C++ vector: small-string optimization. Only very few data-structures are large in a given program, and usually correspond to some global tables and repositories. Others are either short lived byproduct of input processing or are small data-sets attached to some global entity. -- Dmitry Olshansky
Aug 24 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/24/14, 6:16 AM, Dmitry Olshansky wrote:
 24-Aug-2014 16:24, Andrei Alexandrescu пишет:
 (Speaking of which: Some, but not all, types in std.container use
 reference counting. One other great area of improvement would be to
 guarantee that everything is std.container is reference counted.
 Containers are the perfect candidate for reference counting - they are
 typically large enough to make the reference counting overhead
 negligible by comparison with the typical work on them.)
The rest of post is mostly in line with my reasoning (except for nobody measures stuff, it's just bullshit).
Apologies for that, thanks for setting me straight.
 Speaking of data-structures I find just about the opposite. Most data
 structure are small, which must be the fact so fondly used by C++
 vector: small-string optimization. Only very few data-structures are
 large in a given program, and usually correspond to some global tables
 and repositories. Others are either short lived byproduct of input
 processing or are small data-sets attached to some global entity.
I don't know of any std::vector that uses the small string optimization. std::string does ubiquitously because (a) strings are often handled as values, and (b) C++11 put refcounted strings into illegality (forced mistake) therefore robbing implementers of an important optimization. In a way both C++ and D got it "wrong". Arrays/containers are entity types - they have identity and should be manipulated most often by reference. Presence of pass-by-value of containers in C++ programs, save for rvalue optimization purposes, is suspicious. In contrast, strings are value types - they are handled most often as a unit and passed by value, just like e.g. numbers. C++ made both containers and strings value types, so it needs forever to look over its shoulder about n00bs copying large containers unwittingly. It also does a fair amount of unneeded string copying, and optimizing string-based C++ code is nontrivial. D made both arrays and strings slices, a data structure made highly expressive by the garbage collector but that occasionally confuses people. With std.refcounted.RCString and std.container.Array we get both abstractions "right". Andrei
Aug 24 2014
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
24-Aug-2014 21:59, Andrei Alexandrescu пишет:
 On 8/24/14, 6:16 AM, Dmitry Olshansky wrote:
 24-Aug-2014 16:24, Andrei Alexandrescu пишет:

 Speaking of data-structures I find just about the opposite. Most data
 structure are small, which must be the fact so fondly used by C++
 vector: small-string optimization. Only very few data-structures are
 large in a given program, and usually correspond to some global tables
 and repositories. Others are either short lived byproduct of input
 processing or are small data-sets attached to some global entity.
I don't know of any std::vector that uses the small string optimization.
This time it's me who must be wrong. Yet I see that this is recognized need: https://github.com/facebook/folly/blob/master/folly/small_vector.h LLVM folks seem to do the same. With that in mind small containers might be better as a special value type.
 std::string does ubiquitously because (a) strings are often handled as
 values, and (b) C++11 put refcounted strings into illegality (forced
 mistake) therefore robbing implementers of an important optimization.

 In a way both C++ and D got it "wrong". Arrays/containers are entity
 types - they have identity and should be manipulated most often by
 reference. Presence of pass-by-value of containers in C++ programs, save
 for rvalue optimization purposes, is suspicious.
Agreed.
 In contrast, strings
 are value types - they are handled most often as a unit and passed by
 value, just like e.g. numbers.
Indeed, just note that it would be real nice not to actually copy strings. In fact it seems that copy-on-write (with ref-counting) for big strings and small string optimization for small is almost ideal solution. In D we could have non-atomic (thread-local) copy-on-write, which should be quite fast.
 C++ made both containers and strings value types, so it needs forever to
 look over its shoulder about n00bs copying large containers unwittingly.
 It also does a fair amount of unneeded string copying, and optimizing
 string-based C++ code is nontrivial. D made both arrays and strings
 slices, a data structure made highly expressive by the garbage collector
 but that occasionally confuses people. With std.refcounted.RCString and
 std.container.Array we get both abstractions "right".
Good points. -- Dmitry Olshansky
Aug 24 2014
prev sibling next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu 
wrote:
 Look e.g. at 
 https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1074. 
 That's a memory allocation each and every time decodeImpl is 
 called. I'm not kidding. Take a look at http://goo.gl/p5pl3D 
 vs. http://goo.gl/YL2iFN. It's egregious.
Dmitry already replied, but I want to stress the reply. That's complete BS. decodeImpl does *not* allocate. That line of code has been commented on, tested and benched. The enum is only ever used for CT indexing, which will *not* allocate. I'm repeatedly seeing people complain that decoding is slow. Maybe. And that it "allocates". But that's BS. A lot of people have been repeating this false information now, and I find it worrysome that both you and Walter have made this claim now.
Aug 25 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/25/14, 1:21 PM, monarch_dodra wrote:
 On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu wrote:
 Look e.g. at
 https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1074.
 That's a memory allocation each and every time decodeImpl is called.
 I'm not kidding. Take a look at http://goo.gl/p5pl3D vs.
 http://goo.gl/YL2iFN. It's egregious.
Dmitry already replied, but I want to stress the reply. That's complete BS. decodeImpl does *not* allocate. That line of code has been commented on, tested and benched. The enum is only ever used for CT indexing, which will *not* allocate. I'm repeatedly seeing people complain that decoding is slow. Maybe. And that it "allocates". But that's BS. A lot of people have been repeating this false information now, and I find it worrysome that both you and Walter have made this claim now.
That escalated quickly. Chillax. The use of that enum array is fine but somewhat fragile - a small refactoring may trigger an allocation. Trying a static immutable array may be in order. The larger point is there's significant optimization work that can be done on decoding before looking at other changes. Andrei
Aug 25 2014
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 25 August 2014 at 21:11:52 UTC, Andrei Alexandrescu 
wrote:
 That escalated quickly. Chillax.
Sorry. It's just that I've been seeing this claim way too much frequently, especially in learn, where there have been too many threads about trying to avoid decoding. Most often, for the wrong reasons.
 The use of that enum array is fine but somewhat fragile - a 
 small refactoring may trigger an allocation.
Declaring it as a simple 4-member TypeTuple would iliminate the (potential) issue, as a TT *can't* be runtime indexed. Seems like a good compromise.
 Trying a static immutable array may be in order.
That would actually create an object, and potentially prevent optimizations (I think, maybe). The TT seems cleaner to me anyways. I'll create a pull for it.
Aug 25 2014
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 25 August 2014 at 22:04:34 UTC, monarch_dodra wrote:
 I'll create a pull for it.
https://github.com/D-Programming-Language/phobos/pull/2464
Aug 25 2014
prev sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu 
wrote:
 To that end I'm working on RCString, an industrial-strength 
 string type that's much like string, just reference counted and 
 with configurable allocation. It's safe, too - user code cannot 
 casually extract references to string internals. By default 
 allocation would use GC's primitives; one thing I learned to 
 appreciate about our GC is its ability to free memory 
 explicitly, which means RCString will free memory 
 deterministically most of the time, yet if you leak some (e.g. 
 by having RCString class members) the GC will pick up the 
 litter. I think reference counting backed up by a GC that lifts 
 litter and cycles and is a modern, emergent pattern that D 
 could use to great effect.

 (Speaking of which: Some, but not all, types in std.container 
 use reference counting. One other great area of improvement 
 would be to guarantee that everything is std.container is 
 reference counted. Containers are the perfect candidate for 
 reference counting - they are typically large enough to make 
 the reference counting overhead negligible by comparison with 
 the typical work on them.)
One issue this proposal seems to forget (and it's a problem that transcends D), is that the GC does not finalize structs. Your RC proposal is fine and good for strings, because the individual chars don't have destructors. But unless we migrate *everything* to using RC, we'd still be leaking non-memory resources. For example, "File" is reference counted, and I've seen people time and time again get had, because they use a "File[]". Oops. Imo, this is a big issue. Are there any plans to takle this problem? Another issue we encounter a lot with reference type objects that do RC, is one of initial initialisation/allocation. Currenly, D does not have default constructor. I understand why. But it makes it painfainly difficult to implement run-time initialize with no arguments, while avoiding user errors. This has been a problem time and time again for objects such as Appender, or T[U], in that aliasing only happens *after* the first operation. Has this been discussed again yet? We have "T.init". Why couldn't we have default construction, that can be explicitly skipped with "T a = T.init;"? I realize this derails the conversation a bit, but I think it is related enought to warrant mentioning.
Aug 25 2014
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/25/14, 1:35 PM, monarch_dodra wrote:
 One issue this proposal seems to forget (and it's a problem that
 transcends D), is that the GC does not finalize structs.
It will. Andrei
Aug 25 2014
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 25 August 2014 at 21:12:49 UTC, Andrei Alexandrescu 
wrote:
 On 8/25/14, 1:35 PM, monarch_dodra wrote:
 One issue this proposal seems to forget (and it's a problem 
 that
 transcends D), is that the GC does not finalize structs.
It will. Andrei
Awesome.
Aug 25 2014
parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Monday, 25 August 2014 at 22:00:30 UTC, monarch_dodra wrote:
 On Monday, 25 August 2014 at 21:12:49 UTC, Andrei Alexandrescu 
 wrote:
 On 8/25/14, 1:35 PM, monarch_dodra wrote:
 One issue this proposal seems to forget (and it's a problem 
 that
 transcends D), is that the GC does not finalize structs.
It will. Andrei
Awesome.
There's a PR currently in review which fixes this (though IIRC there were some bugs with it that were being worked out). - Jonathan M Davis
Aug 25 2014
prev sibling next sibling parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Sunday, 24 August 2014 at 01:06:31 UTC, Andrei Alexandrescu 
wrote:
 Currently char[], wchar[], dchar[] and qualified variants 
 fulfill the requirements of isSomeString. Also, char[], wchar[] 
 and qualified variants fulfill the requirements of 
 isNarrowString.

 Various algorithms in Phobos test for these traits to optimize 
 away UTF decoding where unnecessary.

 I'm thinking of relaxing the definitions to all types that 
 fulfill the following requirements:

 * are random access ranges
 * element type is some character
 * offer .ptr as a  system property that offers a pointer to the 
 first character

 This would allow us to generalize the notion of string and 
 offer optimizations for user-defined, not only built-in, 
 strings. Thoughts?
I don't know. My first reaction is to be very, very worried about this. We made it so that isSomeString, isIntegral, etc. test for the specific type and not implicit conversion precisely to avoid problems where the type is assumed to be one of the built-in types, and the code then does weird things or just doesn't compile. If we had done this with isSomeString originally, I'm sure that we could make it work, but you're talking about a type that isn't an array, and I expect that there's a lot of code out there that assumes that something that passes isSomeString is an array. And there's a good chance that a lot of the functions that do won't react well to a user-defined type. I think that if we go that route, we should probably create a new trait for it rather than reuse isSomeString. In addition, I'd be very concerned about using ptr like that, because the main use case for that aside from trying to avoid bounds checks is to pass to a C function, and that's going to need ptr to give you an actual array, in which case, I start wondirng why we're not just using arrays. Also, what optimizations are you looking for here? To avoid auto-decoding? Something else? A lot of the compiler/druntime/Phobos devs are coming to the conclusion that the auto-decoding was a mistake and that we should look at removing it if we can and change it so that strings are treated as ranges of their element type rather than dchar. I'm even in the process of working on a DIP for just that, and making the transition would be nowhere near as bad as I was initially expecting it to be. I think that we can actually do it fairly cleanly. If we did that, then I'm not sure what gain there would be in doing something like this to isSomeString or even in creating a new trait. Simply having a random-access range whose element type is a character type would do it already. The only part that it might miss out on is using ptr to skip bounds checking, and I don't know how much sense that makes with a user-defined type anyway given that ptr would have to be giving you a pointer to an array to work wherever ptr works for arrays. - Jonathan M Davis
Aug 23 2014
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/23/14, 7:05 PM, Jonathan M Davis wrote:
 I don't know. My first reaction is to be very, very worried about this.
 We made it so that isSomeString, isIntegral, etc. test for the specific
 type and not implicit conversion precisely to avoid problems where the
 type is assumed to be one of the built-in types, and the code then does
 weird things or just doesn't compile.
I guess unittests are for that kind of stuff. One good start would be to see what kind of assumptions are made by algorithms specialized on isSomeString.
 If we had done this with isSomeString originally, I'm sure that we could
 make it work, but you're talking about a type that isn't an array, and I
 expect that there's a lot of code out there that assumes that something
 that passes isSomeString is an array. And there's a good chance that a
 lot of the functions that do won't react well to a user-defined type.

 I think that if we go that route, we should probably create a new trait
 for it rather than reuse isSomeString.
Yah, that would be a safe path but would complicate checks.
 In addition, I'd be very concerned about using ptr like that, because
 the main use case for that aside from trying to avoid bounds checks is
 to pass to a C function, and that's going to need ptr to give you an
 actual array, in which case, I start wondirng why we're not just using
 arrays.
Because the abstraction I'm thinking of is reference counted.
 Also, what optimizations are you looking for here? To avoid
 auto-decoding? Something else?
It's just taking advantage of the work that's already been done in std.algorithm and elsewhere.
 A lot of the compiler/druntime/Phobos
 devs are coming to the conclusion that the auto-decoding was a mistake
 and that we should look at removing it if we can and change it so that
 strings are treated as ranges of their element type rather than dchar.
 I'm even in the process of working on a DIP for just that, and making
 the transition would be nowhere near as bad as I was initially expecting
 it to be. I think that we can actually do it fairly cleanly. If we did
 that, then I'm not sure what gain there would be in doing something like
 this to isSomeString or even in creating a new trait. Simply having a
 random-access range whose element type is a character type would do it
 already. The only part that it might miss out on is using ptr to skip
 bounds checking, and I don't know how much sense that makes with a
 user-defined type anyway given that ptr would have to be giving you a
 pointer to an array to work wherever ptr works for arrays.
As I wrote at length a few minutes ago, I think such a DIP would be a waste of time and energy. I encourage everyone to work on pushing things forward with new work instead of forever fiddling with what we have "for the last time". Andrei
Aug 24 2014
prev sibling next sibling parent reply "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Sunday, 24 August 2014 at 01:06:31 UTC, Andrei Alexandrescu 
wrote:
 Currently char[], wchar[], dchar[] and qualified variants 
 fulfill the requirements of isSomeString. Also, char[], wchar[] 
 and qualified variants fulfill the requirements of 
 isNarrowString.

 Various algorithms in Phobos test for these traits to optimize 
 away UTF decoding where unnecessary.

 I'm thinking of relaxing the definitions to all types that 
 fulfill the following requirements:

 * are random access ranges
 * element type is some character
 * offer .ptr as a  system property that offers a pointer to the 
 first character

 This would allow us to generalize the notion of string and 
 offer optimizations for user-defined, not only built-in, 
 strings. Thoughts?
IMO in general it's a good proposal; I expected `isSomeString` to return `true` for higher-order ranges over strings, and was surprised that it didn't. But `.ptr` must only be included if the elements are indeed contiguous in memory. For example, `std.algorithm.filter` applied to a string should not provide `.ptr`, but should IMO be treated as a string nevertheless. Related: What about `.length`? RA ranges can also be infinite...
Aug 24 2014
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 8/24/14, 6:00 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:
 On Sunday, 24 August 2014 at 01:06:31 UTC, Andrei Alexandrescu wrote:
 Currently char[], wchar[], dchar[] and qualified variants fulfill the
 requirements of isSomeString. Also, char[], wchar[] and qualified
 variants fulfill the requirements of isNarrowString.

 Various algorithms in Phobos test for these traits to optimize away
 UTF decoding where unnecessary.

 I'm thinking of relaxing the definitions to all types that fulfill the
 following requirements:

 * are random access ranges
 * element type is some character
 * offer .ptr as a  system property that offers a pointer to the first
 character

 This would allow us to generalize the notion of string and offer
 optimizations for user-defined, not only built-in, strings. Thoughts?
IMO in general it's a good proposal; I expected `isSomeString` to return `true` for higher-order ranges over strings, and was surprised that it didn't. But `.ptr` must only be included if the elements are indeed contiguous in memory. For example, `std.algorithm.filter` applied to a string should not provide `.ptr`, but should IMO be treated as a string nevertheless. Related: What about `.length`? RA ranges can also be infinite...
Yah, length should be there. I think ptr is good to require because I speculate many optimizations take advantage of the characters being contiguous. Andrei
Aug 24 2014
prev sibling next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
24-Aug-2014 05:06, Andrei Alexandrescu пишет:
 Currently char[], wchar[], dchar[] and qualified variants fulfill the
 requirements of isSomeString. Also, char[], wchar[] and qualified
 variants fulfill the requirements of isNarrowString.

 Various algorithms in Phobos test for these traits to optimize away UTF
 decoding where unnecessary.

 I'm thinking of relaxing the definitions to all types that fulfill the
 following requirements:

 * are random access ranges
 * element type is some character
 * offer .ptr as a  system property that offers a pointer to the first
 character
.ptr would be nice ' system' addition to any contiguous ranges. -- Dmitry Olshansky
Aug 24 2014
prev sibling next sibling parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Sunday, 24 August 2014 at 01:06:31 UTC, Andrei Alexandrescu 
wrote:
 I'm thinking of relaxing the definitions to all types that 
 fulfill the following requirements:

 * are random access ranges
 * element type is some character
 * offer .ptr as a  system property that offers a pointer to the 
 first character
Hmm, you also need .length, but it needs to return the length of the encoding, not the string. There might be custom string types where .length returns the number of code points, but where .ptr points to code units. Narrow string optimized functions will also be looking for an opSlice that is indexed by code units.
Aug 24 2014
prev sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Sunday, 24 August 2014 at 01:06:31 UTC, Andrei Alexandrescu 
wrote:
 Currently char[], wchar[], dchar[] and qualified variants 
 fulfill the requirements of isSomeString. Also, char[], wchar[] 
 and qualified variants fulfill the requirements of 
 isNarrowString.

 Various algorithms in Phobos test for these traits to optimize 
 away UTF decoding where unnecessary.

 I'm thinking of relaxing the definitions to all types that 
 fulfill the following requirements:

 * are random access ranges
 * element type is some character
 * offer .ptr as a  system property that offers a pointer to the 
 first character

 This would allow us to generalize the notion of string and 
 offer optimizations for user-defined, not only built-in, 
 strings. Thoughts?


 Andrei
One issue is that strings are "auto decoded", yet a range of "char" is not. I don't see why ".ptr" is needed (or .length) for that matter. We could very well have a range of chars that doesn't of length, nor .ptr, but should still be handled like a sequence of decoded characters. I don't see how your proposal caters to that more generic problem that a UD char range is not auto-decoded (or more generally, that it will be handled *differently* from a string).
Aug 25 2014