digitalmars.D - Relaxing the definition of isSomeString and isNarrowString
- Andrei Alexandrescu (14/14) Aug 23 2014 Currently char[], wchar[], dchar[] and qualified variants fulfill the
- H. S. Teoh via Digitalmars-d (11/28) Aug 23 2014 [...]
- Andrei Alexandrescu (79/104) Aug 24 2014 I've read the thread now, thanks. I think that discussion has lost
- "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (6/18) Aug 24 2014 Any reason why this is restricted to strings? I.e., is there
- Kiith-Sa (5/23) Aug 24 2014 We already have an 'RCArray' (std.container.Array) although it's
- "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (7/31) Aug 24 2014 Well, if `isSomeString` were extended as Andrei suggests (and the
- Andrei Alexandrescu (4/7) Aug 24 2014 It uses RefCounted:
- Andrew Godfrey (11/11) Aug 24 2014 The OP and the question of auto-decoding share the same root
- "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (10/20) Aug 24 2014 ... of course this cannot be guaranteed. In fact, even the
- ketmar via Digitalmars-d (6/6) Aug 24 2014 On Sun, 24 Aug 2014 18:28:51 +0000
- Dmitry Olshansky (9/15) Aug 24 2014 Just make a function that translates UTF-8 into Latin-1, use ubyte for
- ketmar via Digitalmars-d (13/15) Aug 24 2014 i don't want ubytes. but there is nothing bad with assigning latin-1
- Dmitry Olshansky (7/18) Aug 24 2014 char != byte, and compiler will reject pointer and array assignments of
- Andrew Godfrey (11/36) Aug 24 2014 Both your answers are at the level of the compiler/language spec.
- "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (3/16) Aug 25 2014 Right, not just pedantically, but crucially - didn't read it
- Andrew Godfrey (5/23) Aug 25 2014 Oops, I did it myself. I meant code unit.
- H. S. Teoh via Digitalmars-d (10/13) Aug 24 2014 [...]
- Vladimir Panteleev (4/9) Aug 24 2014 This doesn't get you far in practice if you want to actually
- Jonathan M Davis (16/25) Aug 24 2014 Well, all of the non-string specific stuff (like find) will work
- H. S. Teoh via Digitalmars-d (11/19) Aug 24 2014 We need a major revamp of std.encoding.
- Andrei Alexandrescu (8/23) Aug 24 2014 Thanks for asking. Strings are not special, but are a good place to
- Dmitry Olshansky (43/85) Aug 24 2014 I'd be damned, if I didn't measure. Do not assume things if there is
- Dmitry Olshansky (11/17) Aug 24 2014 The rest of post is mostly in line with my reasoning (except for nobody
- Andrei Alexandrescu (20/36) Aug 24 2014 I don't know of any std::vector that uses the small string optimization....
- Dmitry Olshansky (15/42) Aug 24 2014 This time it's me who must be wrong.
- monarch_dodra (10/15) Aug 25 2014 Dmitry already replied, but I want to stress the reply. That's
- Andrei Alexandrescu (8/22) Aug 25 2014 That escalated quickly. Chillax.
- monarch_dodra (12/16) Aug 25 2014 Sorry. It's just that I've been seeing this claim way too much
- monarch_dodra (2/3) Aug 25 2014 https://github.com/D-Programming-Language/phobos/pull/2464
- monarch_dodra (23/42) Aug 25 2014 One issue this proposal seems to forget (and it's a problem that
- Andrei Alexandrescu (3/5) Aug 25 2014 It will.
- monarch_dodra (3/9) Aug 25 2014 Awesome.
- Jonathan M Davis (4/15) Aug 25 2014 There's a PR currently in review which fixes this (though IIRC
- Jonathan M Davis (39/54) Aug 23 2014 I don't know. My first reaction is to be very, very worried about
- Andrei Alexandrescu (13/46) Aug 24 2014 I guess unittests are for that kind of stuff. One good start would be to...
- "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (10/25) Aug 24 2014 IMO in general it's a good proposal; I expected `isSomeString` to
- Andrei Alexandrescu (5/31) Aug 24 2014 Yah, length should be there. I think ptr is good to require because I
- Dmitry Olshansky (4/15) Aug 24 2014 .ptr would be nice '@system' addition to any contiguous ranges.
- Peter Alexander (8/14) Aug 24 2014 Hmm, you also need .length, but it needs to return the length of
- monarch_dodra (10/26) Aug 25 2014 One issue is that strings are "auto decoded", yet a range of
Currently char[], wchar[], dchar[] and qualified variants fulfill the requirements of isSomeString. Also, char[], wchar[] and qualified variants fulfill the requirements of isNarrowString. Various algorithms in Phobos test for these traits to optimize away UTF decoding where unnecessary. I'm thinking of relaxing the definitions to all types that fulfill the following requirements: * are random access ranges * element type is some character * offer .ptr as a system property that offers a pointer to the first character This would allow us to generalize the notion of string and offer optimizations for user-defined, not only built-in, strings. Thoughts? Andrei
Aug 23 2014
On Sat, Aug 23, 2014 at 06:06:37PM -0700, Andrei Alexandrescu via Digitalmars-d wrote:Currently char[], wchar[], dchar[] and qualified variants fulfill the requirements of isSomeString. Also, char[], wchar[] and qualified variants fulfill the requirements of isNarrowString. Various algorithms in Phobos test for these traits to optimize away UTF decoding where unnecessary. I'm thinking of relaxing the definitions to all types that fulfill the following requirements: * are random access ranges * element type is some character * offer .ptr as a system property that offers a pointer to the first character This would allow us to generalize the notion of string and offer optimizations for user-defined, not only built-in, strings. Thoughts?[...] Recently there has been another heated debate on Github about whether ranges should auto-decode at all: https://github.com/D-Programming-Language/phobos/pull/2423 Jonathan is about to write up a DIP for phasing out auto-decoding. Given that, I think we need to decide which direction to take, lest we waste energy on something that will be going away soon. T -- I don't trust computers, I've spent too long programming to think that they can get anything right. -- James Miller
Aug 23 2014
On 8/23/14, 6:39 PM, H. S. Teoh via Digitalmars-d wrote:On Sat, Aug 23, 2014 at 06:06:37PM -0700, Andrei Alexandrescu via Digitalmars-d wrote:I've read the thread now, thanks. I think that discussion has lost perspective. There is talk about how decoding is slow, yet nobody (like, literally not one soul) looked at measuring and optimizing decoding. Everybody just assumes it's unacceptably slow; the irony is, it is, but mostly because it's not engineered for speed. Look e.g. at https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1074. That's a memory allocation each and every time decodeImpl is called. I'm not kidding. Take a look at http://goo.gl/p5pl3D vs. http://goo.gl/YL2iFN. It's egregious. The entire function, which should be small and nimble, is a behemoth. True, that function is only called for non-ASCII sequences, but the caller code can definitely be improved for speed (I see e.g. unnecessary pass of strings by reference etc). A bunch of good work can be done to improve efficiency of decoding tremendously. If the DIP is based on Jonathan's comments in the pull request, I think it's a waste of everybody's time. It's a complete rewiring of the standard library fundamentals and a huge inconvenience for all users, for a slight improvement of a few scenarios. UTF strings are bidirectional ranges of variable-length encoded code units. Algorithms work with those. That is the matter of fact, and what makes UTF strings tick in D. It's the reason Python has a large schism between 2 and 3 over UTF strings and D is doing well, thank you very much. Considering D is faring a lot better than all languages I know of in the UTF realm, if auto-decoding was a mistake, well it wasn't a bad one. I suggest we focus efforts on solving real problems with creative solution. Endless hand-wringing over small fish and forever redoing what's already there is not the way to progress. Speaking of which, this is why I discuss isNarrowString: I have recently gotten convinced of a number of things. First, there will always be a category of users and applications who will NOT, for reasons good and bad, accept garbage collection. They just won't, and we can discuss how that sometimes is a bummer reflecting on human nature, but I don't see how framing the problem as a user education issue is pushing D forward. It follows we need to improve proper use of D for people who will not have the GC, by means of superior libraries. And if you think of it this is very D-ish: we have a powerful, incredibly expressive language at our hands, designed explicitly to make powerful abstractions easy to define. So the right direction of exploration is better libraries. One other thing I realized is that avoiding the GC by means of APIs based on output ranges Will Not Work(tm). Not all allocations can be successfully hoisted to the user in that manner. Output ranges are fine if the sole focus is producing sequential output, but the reality is that often more complex structures need to be allocated for things to work, and the proper way to hoist that into the client is by providing allocator services, not output ranges. Consider warp itself: it reads files and writes files, which is all linear and nice, but part of the processing involves storing macros, which are a complex internal data structure (e.g. a hash table). The linear output can be hoisted nicely to the client, but not the intricate allocation needed by macros. Continuing to pull on that thread, it follows that we need to provide and document thoroughly data structures that make resource management derived from tried and true practices in C++: std::string, std::unique_ptr, std::shared_ptr. For example, we have RefCounted. It just works transparently in a bunch of situations, but very few users actually know about it. Furthermore, it has imperfections and rough edges. Improving RefCounted (better code, better documentation, maybe DIPs prompting language improvements that make it better) would be real progress. To that end I'm working on RCString, an industrial-strength string type that's much like string, just reference counted and with configurable allocation. It's safe, too - user code cannot casually extract references to string internals. By default allocation would use GC's primitives; one thing I learned to appreciate about our GC is its ability to free memory explicitly, which means RCString will free memory deterministically most of the time, yet if you leak some (e.g. by having RCString class members) the GC will pick up the litter. I think reference counting backed up by a GC that lifts litter and cycles and is a modern, emergent pattern that D could use to great effect. (Speaking of which: Some, but not all, types in std.container use reference counting. One other great area of improvement would be to guarantee that everything is std.container is reference counted. Containers are the perfect candidate for reference counting - they are typically large enough to make the reference counting overhead negligible by comparison with the typical work on them.) I'd like RCString to benefit of the optimizations for built-in strings, and that's why I was looking at relaxing isSomeString/isNarrowString. AndreiCurrently char[], wchar[], dchar[] and qualified variants fulfill the requirements of isSomeString. Also, char[], wchar[] and qualified variants fulfill the requirements of isNarrowString. Various algorithms in Phobos test for these traits to optimize away UTF decoding where unnecessary. I'm thinking of relaxing the definitions to all types that fulfill the following requirements: * are random access ranges * element type is some character * offer .ptr as a system property that offers a pointer to the first character This would allow us to generalize the notion of string and offer optimizations for user-defined, not only built-in, strings. Thoughts?[...] Recently there has been another heated debate on Github about whether ranges should auto-decode at all: https://github.com/D-Programming-Language/phobos/pull/2423 Jonathan is about to write up a DIP for phasing out auto-decoding. Given that, I think we need to decide which direction to take, lest we waste energy on something that will be going away soon.
Aug 24 2014
On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu wrote:To that end I'm working on RCString, an industrial-strength string type that's much like string, just reference counted and with configurable allocation. It's safe, too - user code cannot casually extract references to string internals. By default allocation would use GC's primitives; one thing I learned to appreciate about our GC is its ability to free memory explicitly, which means RCString will free memory deterministically most of the time, yet if you leak some (e.g. by having RCString class members) the GC will pick up the litter. I think reference counting backed up by a GC that lifts litter and cycles and is a modern, emergent pattern that D could use to great effect.Any reason why this is restricted to strings? I.e., is there something special about strings (apart from auto-decoding) that doesn't apply to arrays in general? If not, wouldn't an RCArray be a better idea?
Aug 24 2014
On Sunday, 24 August 2014 at 12:49:30 UTC, Marc Schütz wrote:On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu wrote:We already have an 'RCArray' (std.container.Array) although it's not perfect. We don't have a reference counted string type that would work with the string operation functions.To that end I'm working on RCString, an industrial-strength string type that's much like string, just reference counted and with configurable allocation. It's safe, too - user code cannot casually extract references to string internals. By default allocation would use GC's primitives; one thing I learned to appreciate about our GC is its ability to free memory explicitly, which means RCString will free memory deterministically most of the time, yet if you leak some (e.g. by having RCString class members) the GC will pick up the litter. I think reference counting backed up by a GC that lifts litter and cycles and is a modern, emergent pattern that D could use to great effect.Any reason why this is restricted to strings? I.e., is there something special about strings (apart from auto-decoding) that doesn't apply to arrays in general? If not, wouldn't an RCArray be a better idea?
Aug 24 2014
On Sunday, 24 August 2014 at 13:04:10 UTC, Kiith-Sa wrote:On Sunday, 24 August 2014 at 12:49:30 UTC, Marc Schütz wrote:Well, if `isSomeString` were extended as Andrei suggests (and the string functions adapted a bit), I don't see why it shouldn't work. But I'm not sure std.container.Array really is refcounted. The documentation doesn't say so, and it shows neither a postblit nor an opAssign.On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu wrote:We already have an 'RCArray' (std.container.Array) although it's not perfect. We don't have a reference counted string type that would work with the string operation functions.To that end I'm working on RCString, an industrial-strength string type that's much like string, just reference counted and with configurable allocation. It's safe, too - user code cannot casually extract references to string internals. By default allocation would use GC's primitives; one thing I learned to appreciate about our GC is its ability to free memory explicitly, which means RCString will free memory deterministically most of the time, yet if you leak some (e.g. by having RCString class members) the GC will pick up the litter. I think reference counting backed up by a GC that lifts litter and cycles and is a modern, emergent pattern that D could use to great effect.Any reason why this is restricted to strings? I.e., is there something special about strings (apart from auto-decoding) that doesn't apply to arrays in general? If not, wouldn't an RCArray be a better idea?
Aug 24 2014
On 8/24/14, 6:20 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:But I'm not sure std.container.Array really is refcounted. The documentation doesn't say so, and it shows neither a postblit nor an opAssign.It uses RefCounted: https://github.com/D-Programming-Language/phobos/blob/master/std/container/array.d#L169 Andrei
Aug 24 2014
The OP and the question of auto-decoding share the same root problem: Even though D does a lot better with UTF than other languages I've used, it still confuses characters with code points somewhat. "Element type is some character" is an example from OP. So clarify for me: If a programmer makes an array of either 'char' or 'wchar', does that always, unambiguously, mean a UTF8 or UTF16 code point? E.g. If interoperating with C code, they will never make the mistake of using these types for a non-string byte/word array? If and only if this is true, then D has done well and I'm unafraid of duck-typing here.
Aug 24 2014
On Sunday, 24 August 2014 at 18:19:45 UTC, Andrew Godfrey wrote:The OP and the question of auto-decoding share the same root problem: Even though D does a lot better with UTF than other languages I've used, it still confuses characters with code points somewhat. "Element type is some character" is an example from OP. So clarify for me: If a programmer makes an array of either 'char' or 'wchar', does that always, unambiguously, mean a UTF8 or UTF16 code point?It has to, because it is required by the specification. But ...E.g. If interoperating with C code, they will never make the mistake of using these types for a non-string byte/word array?... of course this cannot be guaranteed. In fact, even the druntime currently just assumes that program arguments and environment variables are UTF8 encoded on Unix, AFAIK. This is true in most cases, but of course not guaranteed. Potentially also problematic are the functions taking filenames. In Unix, filenames are just opaque arrays of bytes, but those functions take `string` parameters, i.e. assuming UTF8 encoding. This could force the user to place non-UTF8 sequences into strings.
Aug 24 2014
On Sun, 24 Aug 2014 18:28:51 +0000 via Digitalmars-d <digitalmars-d puremagic.com> wrote: and D parser unconditionally assumes that source code is in UTF-8 too. which makes great PITA for me: i can't use strings (any string-like literals, actually) in Latin-1 (for example) without ugly "\x..\x..\x..". hate it.
Aug 24 2014
24-Aug-2014 22:36, ketmar via Digitalmars-d пишет:On Sun, 24 Aug 2014 18:28:51 +0000 via Digitalmars-d <digitalmars-d puremagic.com> wrote: and D parser unconditionally assumes that source code is in UTF-8 too. which makes great PITA for me: i can't use strings (any string-like literals, actually) in Latin-1 (for example) without ugly "\x..\x..\x..". hate it.Just make a function that translates UTF-8 into Latin-1, use ubyte for Latin-1. Then this neat trick will keep literal compile-time constant: enum latin(alias str) = toLatin1(str); (where toLatin some normal function that does the translation) Then use it as latin!"some-string-in-latin1" -- Dmitry Olshansky
Aug 24 2014
On Mon, 25 Aug 2014 01:23:57 +0400 Dmitry Olshansky via Digitalmars-d <digitalmars-d puremagic.com> wrote:Just make a function that translates UTF-8 into Latin-1, use ubyte for Latin-1.i don't want ubytes. but there is nothing bad with assigning latin-1 text to ordinary string after conversion. ;-) the downside is memory consumption. i have ALOT of ctfe-magic, compiling some modules easily hits memory limits (yeah, x86; no, i don't planning to switch to x86_64 soon), and calling tr-functions trashes memory even more (i'm not even talking about code readability here). what i really want is "verbatim" string literal, which scanner, lexer and parser just passes "as-is", byte-to-byte, without trying to decode it as UTF-8. or (this can be easier to implement, i think) at least something like pragma(encoding, "..."). with at least two working values: "utf-8" and "native".
Aug 24 2014
24-Aug-2014 22:19, Andrew Godfrey пишет:The OP and the question of auto-decoding share the same root problem: Even though D does a lot better with UTF than other languages I've used, it still confuses characters with code points somewhat. "Element type is some character" is an example from OP. So clarify for me: If a programmer makes an array of either 'char' or 'wchar', does that always, unambiguously, mean a UTF8 or UTF16 code point?Yes, pedantically - UTF-8 and UTF-16 code _units_. dchar is a codepoint.E.g. If interoperating with C code, they will never make the mistake of using these types for a non-string byte/word array?char != byte, and compiler will reject pointer and array assignments of byte* to char*, ubyte[] to char[] etc. Values themselves are convertible, so would work with implicit conversion.If and only if this is true, then D has done well and I'm unafraid of duck-typing here.-- Dmitry Olshansky
Aug 24 2014
On Sunday, 24 August 2014 at 18:43:36 UTC, Dmitry Olshansky wrote:24-Aug-2014 22:19, Andrew Godfrey пишет:Both your answers are at the level of the compiler/language spec. Relevant yes, but not complete. E.g. How often will people manually converting a .h file, convert C "const char *" correctly to either something char-based or something ubyte-based, depending on whether it represents utf-8 code points? How often will they even know? With wchar it's probably even worse, because of API's that use one type but depending on other parameters, the string elements can be utf-16 code points or glyph indices.The OP and the question of auto-decoding share the same root problem: Even though D does a lot better with UTF than other languages I've used, it still confuses characters with code points somewhat. "Element type is some character" is an example from OP. So clarify for me: If a programmer makes an array of either 'char' or 'wchar', does that always, unambiguously, mean a UTF8 or UTF16 code point?Yes, pedantically - UTF-8 and UTF-16 code _units_. dchar is a codepoint.E.g. If interoperating with C code, they will never make the mistake of using these types for a non-string byte/word array?char != byte, and compiler will reject pointer and array assignments of byte* to char*, ubyte[] to char[] etc. Values themselves are convertible, so would work with implicit conversion.If and only if this is true, then D has done well and I'm unafraid of duck-typing here.
Aug 24 2014
On Sunday, 24 August 2014 at 18:43:36 UTC, Dmitry Olshansky wrote:24-Aug-2014 22:19, Andrew Godfrey пишет:Right, not just pedantically, but crucially - didn't read it correctly.The OP and the question of auto-decoding share the same root problem: Even though D does a lot better with UTF than other languages I've used, it still confuses characters with code points somewhat. "Element type is some character" is an example from OP. So clarify for me: If a programmer makes an array of either 'char' or 'wchar', does that always, unambiguously, mean a UTF8 or UTF16 code point?Yes, pedantically - UTF-8 and UTF-16 code _units_. dchar is a codepoint.
Aug 25 2014
On Monday, 25 August 2014 at 08:07:58 UTC, Marc Schütz wrote:On Sunday, 24 August 2014 at 18:43:36 UTC, Dmitry Olshansky wrote:Oops, I did it myself. I meant code unit. Anyway, see my earlier reply - I wasn't asking what the language spec says. I'm asking about the reality given what happens when D code interfaces with e.g. C/C++ code.24-Aug-2014 22:19, Andrew Godfrey пишет:Right, not just pedantically, but crucially - didn't read it correctly.The OP and the question of auto-decoding share the same root problem: Even though D does a lot better with UTF than other languages I've used, it still confuses characters with code points somewhat. "Element type is some character" is an example from OP. So clarify for me: If a programmer makes an array of either 'char' or 'wchar', does that always, unambiguously, mean a UTF8 or UTF16 code point?Yes, pedantically - UTF-8 and UTF-16 code _units_. dchar is a codepoint.
Aug 25 2014
On Sun, Aug 24, 2014 at 06:19:44PM +0000, Andrew Godfrey via Digitalmars-d wrote: [...]So clarify for me: If a programmer makes an array of either 'char' or 'wchar', does that always, unambiguously, mean a UTF8 or UTF16 code point?[...] In D, an array of char, wchar, or dchar always means a Unicode encoding. Non-Unicode encodings should be represented as ubyte[] (resp. ushort[] or ulong[], if such exist) instead. T -- If you want to solve a problem, you need to address its root cause, not just its symptoms. Otherwise it's like treating cancer with Tylenol...
Aug 24 2014
On Monday, 25 August 2014 at 01:31:35 UTC, H. S. Teoh via Digitalmars-d wrote:In D, an array of char, wchar, or dchar always means a Unicode encoding. Non-Unicode encodings should be represented as ubyte[] (resp. ushort[] or ulong[], if such exist) instead.This doesn't get you far in practice if you want to actually operate on the text.
Aug 24 2014
On Monday, 25 August 2014 at 02:40:20 UTC, Vladimir Panteleev wrote:On Monday, 25 August 2014 at 01:31:35 UTC, H. S. Teoh via Digitalmars-d wrote:Well, all of the non-string specific stuff (like find) will work just find, but since all of the string-specific functions assume UTF-8, UTF-16, or UTF-32, you'll have to convert it. We can't really do otherwise, because you have to know what encoding you're dealing with to operate on it as a string, and than means that you need to either call specific functions which expect the encoding that you're using, or you need types specific to those encodings (in which case, you wouldn't use ubyte[] and the like directly). We do need better support for other encodings, but I don't think that it really costs us anything to treat char as UTF-8, wchar as UTF-16, and dchar as UTF-32 and require that other encodings use different representations. - Jonathan M DavisIn D, an array of char, wchar, or dchar always means a Unicode encoding. Non-Unicode encodings should be represented as ubyte[] (resp. ushort[] or ulong[], if such exist) instead.This doesn't get you far in practice if you want to actually operate on the text.
Aug 24 2014
On Mon, Aug 25, 2014 at 02:40:19AM +0000, Vladimir Panteleev via Digitalmars-d wrote:On Monday, 25 August 2014 at 01:31:35 UTC, H. S. Teoh via Digitalmars-d wrote:We need a major revamp of std.encoding. I envision it to use CTFE and template magic to make it possible to statically optimize manipulating strings of any supported encoding via statically-wrapped non-UTF string types, as well as dynamic types for dealing with runtime-decided encodings. If possible, I envision being able to use std.conv.to to convert between encodings via compile-time customization hooks for to!(). T -- Век живи - век учись. А дураком помрёшь.In D, an array of char, wchar, or dchar always means a Unicode encoding. Non-Unicode encodings should be represented as ubyte[] (resp. ushort[] or ulong[], if such exist) instead.This doesn't get you far in practice if you want to actually operate on the text.
Aug 24 2014
On 8/24/14, 5:49 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu wrote:Thanks for asking. Strings are not special, but are a good place to start because (a) they are ubiquitous so good bang for the buck, (b) they don't have indirections to worry about, (c) they have a readily available built-in to compare against. Also RCString is different from e.g. std.container.Array in that the former has slice semantics whereas the latter has pure reference semantics. AndreiTo that end I'm working on RCString, an industrial-strength string type that's much like string, just reference counted and with configurable allocation. It's safe, too - user code cannot casually extract references to string internals. By default allocation would use GC's primitives; one thing I learned to appreciate about our GC is its ability to free memory explicitly, which means RCString will free memory deterministically most of the time, yet if you leak some (e.g. by having RCString class members) the GC will pick up the litter. I think reference counting backed up by a GC that lifts litter and cycles and is a modern, emergent pattern that D could use to great effect.Any reason why this is restricted to strings? I.e., is there something special about strings (apart from auto-decoding) that doesn't apply to arrays in general? If not, wouldn't an RCArray be a better idea?
Aug 24 2014
24-Aug-2014 16:24, Andrei Alexandrescu пишет:On 8/23/14, 6:39 PM, H. S. Teoh via Digitalmars-d wrote:I'd be damned, if I didn't measure. Do not assume things if there is little actual data. Take a peek at my tests (yeah, rough stuff but it does the job). https://github.com/DmitryOlshansky/gsoc-bench-2012 Let's count number of codepoints in a file (loading it into memory before measuring stuff): =================== decode-only [arwiki], 368312 hits, 3.16312, 205.58 Mb/s noop [arwiki], 599328 hits, 0.67872, 958.11 Mb/s ==================== decode-only [ruwiki], 523414 hits, 4.54544, 206.23 Mb/s noop [ruwiki], 884374 hits, 0.7898, 1186.92 Mb/s ==================== decode-only [dewiki], 1773464 hits, 5.63744, 318.28 Mb/s noop [dewiki], 1607454 hits, 2.1322, 841.52 Mb/s With that I can say that decode *is* engineered for speed. It processes gets about 200-300 millions of codepoints per second for me. Compared with baseline of 840 Mbytes/sec of testing if each byte isOn Sat, Aug 23, 2014 at 06:06:37PM -0700, Andrei Alexandrescu via Digitalmars-d wrote:I've read the thread now, thanks. I think that discussion has lost perspective. There is talk about how decoding is slow, yet nobody (like, literally not one soul) looked at measuring and optimizing decoding. Everybody just assumes it's unacceptably slow; the irony is, it is, but mostly because it's not engineered for speed.Currently char[], wchar[], dchar[] and qualified variants fulfill the requirements of isSomeString. Also, char[], wchar[] and qualified variants fulfill the requirements of isNarrowString. Various algorithms in Phobos test for these traits to optimize away UTF decoding where unnecessary. I'm thinking of relaxing the definitions to all types that fulfill the following requirements: * are random access ranges * element type is some character * offer .ptr as a system property that offers a pointer to the first character This would allow us to generalize the notion of string and offer optimizations for user-defined, not only built-in, strings. Thoughts?[...] Recently there has been another heated debate on Github about whether ranges should auto-decode at all: https://github.com/D-Programming-Language/phobos/pull/2423 Jonathan is about to write up a DIP for phasing out auto-decoding. Given that, I think we need to decide which direction to take, lest we waste energy on something that will be going away soon.0x20.With LDC both numbers are about 2 times better but the picture stays the same. As you can see there is a significant gap there and decoding is not yet memory-bound. It's still a cost that hinders top-performance code.Look e.g. at https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1074. That's a memory allocation each and every time decodeImpl is called. I'm not kidding. Take a look at http://goo.gl/p5pl3D vs.You are completely wrong on this count. I'm horrified to learn that even D experts are unaware of how `enum` works or you'd got to be kidding me. enum doesn't allocate but serves as `copy-paste` token effectively putting said literal everywhere it is used. Now look at usage pattern: https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1173 (where 'i' is constant from static foreach) Compiler sees it as: [(1 << 7) - 1, (1 << 11) - 1, (1 << 16) - 1, (1 << 21) - 1][0] and const-folds it accordingly.I'd like RCString to benefit of the optimizations for built-in strings, and that's why I was looking at relaxing isSomeString/isNarrowString.On this I agree, isSomeString is an ugly half-assed hack if we can't have other kinds of strings. Also consider Array!char which in fact should be pretty close to RCString. I'm of opinion that there are 2 ways out of 'decode by default sucks': a) Make it finally generic and clearly state what a DecodableRange is. Stick with that everywhere. b) To hell with that and just remove implicit decoding. Else it really-really sucks. -- Dmitry Olshansky
Aug 24 2014
24-Aug-2014 16:24, Andrei Alexandrescu пишет:(Speaking of which: Some, but not all, types in std.container use reference counting. One other great area of improvement would be to guarantee that everything is std.container is reference counted. Containers are the perfect candidate for reference counting - they are typically large enough to make the reference counting overhead negligible by comparison with the typical work on them.)The rest of post is mostly in line with my reasoning (except for nobody measures stuff, it's just bullshit). Speaking of data-structures I find just about the opposite. Most data structure are small, which must be the fact so fondly used by C++ vector: small-string optimization. Only very few data-structures are large in a given program, and usually correspond to some global tables and repositories. Others are either short lived byproduct of input processing or are small data-sets attached to some global entity. -- Dmitry Olshansky
Aug 24 2014
On 8/24/14, 6:16 AM, Dmitry Olshansky wrote:24-Aug-2014 16:24, Andrei Alexandrescu пишет:Apologies for that, thanks for setting me straight.(Speaking of which: Some, but not all, types in std.container use reference counting. One other great area of improvement would be to guarantee that everything is std.container is reference counted. Containers are the perfect candidate for reference counting - they are typically large enough to make the reference counting overhead negligible by comparison with the typical work on them.)The rest of post is mostly in line with my reasoning (except for nobody measures stuff, it's just bullshit).Speaking of data-structures I find just about the opposite. Most data structure are small, which must be the fact so fondly used by C++ vector: small-string optimization. Only very few data-structures are large in a given program, and usually correspond to some global tables and repositories. Others are either short lived byproduct of input processing or are small data-sets attached to some global entity.I don't know of any std::vector that uses the small string optimization. std::string does ubiquitously because (a) strings are often handled as values, and (b) C++11 put refcounted strings into illegality (forced mistake) therefore robbing implementers of an important optimization. In a way both C++ and D got it "wrong". Arrays/containers are entity types - they have identity and should be manipulated most often by reference. Presence of pass-by-value of containers in C++ programs, save for rvalue optimization purposes, is suspicious. In contrast, strings are value types - they are handled most often as a unit and passed by value, just like e.g. numbers. C++ made both containers and strings value types, so it needs forever to look over its shoulder about n00bs copying large containers unwittingly. It also does a fair amount of unneeded string copying, and optimizing string-based C++ code is nontrivial. D made both arrays and strings slices, a data structure made highly expressive by the garbage collector but that occasionally confuses people. With std.refcounted.RCString and std.container.Array we get both abstractions "right". Andrei
Aug 24 2014
24-Aug-2014 21:59, Andrei Alexandrescu пишет:On 8/24/14, 6:16 AM, Dmitry Olshansky wrote:This time it's me who must be wrong. Yet I see that this is recognized need: https://github.com/facebook/folly/blob/master/folly/small_vector.h LLVM folks seem to do the same. With that in mind small containers might be better as a special value type.24-Aug-2014 16:24, Andrei Alexandrescu пишет:I don't know of any std::vector that uses the small string optimization.Speaking of data-structures I find just about the opposite. Most data structure are small, which must be the fact so fondly used by C++ vector: small-string optimization. Only very few data-structures are large in a given program, and usually correspond to some global tables and repositories. Others are either short lived byproduct of input processing or are small data-sets attached to some global entity.std::string does ubiquitously because (a) strings are often handled as values, and (b) C++11 put refcounted strings into illegality (forced mistake) therefore robbing implementers of an important optimization. In a way both C++ and D got it "wrong". Arrays/containers are entity types - they have identity and should be manipulated most often by reference. Presence of pass-by-value of containers in C++ programs, save for rvalue optimization purposes, is suspicious.Agreed.In contrast, strings are value types - they are handled most often as a unit and passed by value, just like e.g. numbers.Indeed, just note that it would be real nice not to actually copy strings. In fact it seems that copy-on-write (with ref-counting) for big strings and small string optimization for small is almost ideal solution. In D we could have non-atomic (thread-local) copy-on-write, which should be quite fast.C++ made both containers and strings value types, so it needs forever to look over its shoulder about n00bs copying large containers unwittingly. It also does a fair amount of unneeded string copying, and optimizing string-based C++ code is nontrivial. D made both arrays and strings slices, a data structure made highly expressive by the garbage collector but that occasionally confuses people. With std.refcounted.RCString and std.container.Array we get both abstractions "right".Good points. -- Dmitry Olshansky
Aug 24 2014
On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu wrote:Look e.g. at https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1074. That's a memory allocation each and every time decodeImpl is called. I'm not kidding. Take a look at http://goo.gl/p5pl3D vs. http://goo.gl/YL2iFN. It's egregious.Dmitry already replied, but I want to stress the reply. That's complete BS. decodeImpl does *not* allocate. That line of code has been commented on, tested and benched. The enum is only ever used for CT indexing, which will *not* allocate. I'm repeatedly seeing people complain that decoding is slow. Maybe. And that it "allocates". But that's BS. A lot of people have been repeating this false information now, and I find it worrysome that both you and Walter have made this claim now.
Aug 25 2014
On 8/25/14, 1:21 PM, monarch_dodra wrote:On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu wrote:That escalated quickly. Chillax. The use of that enum array is fine but somewhat fragile - a small refactoring may trigger an allocation. Trying a static immutable array may be in order. The larger point is there's significant optimization work that can be done on decoding before looking at other changes. AndreiLook e.g. at https://github.com/D-Programming-Language/phobos/blob/master/std/utf.d#L1074. That's a memory allocation each and every time decodeImpl is called. I'm not kidding. Take a look at http://goo.gl/p5pl3D vs. http://goo.gl/YL2iFN. It's egregious.Dmitry already replied, but I want to stress the reply. That's complete BS. decodeImpl does *not* allocate. That line of code has been commented on, tested and benched. The enum is only ever used for CT indexing, which will *not* allocate. I'm repeatedly seeing people complain that decoding is slow. Maybe. And that it "allocates". But that's BS. A lot of people have been repeating this false information now, and I find it worrysome that both you and Walter have made this claim now.
Aug 25 2014
On Monday, 25 August 2014 at 21:11:52 UTC, Andrei Alexandrescu wrote:That escalated quickly. Chillax.Sorry. It's just that I've been seeing this claim way too much frequently, especially in learn, where there have been too many threads about trying to avoid decoding. Most often, for the wrong reasons.The use of that enum array is fine but somewhat fragile - a small refactoring may trigger an allocation.Declaring it as a simple 4-member TypeTuple would iliminate the (potential) issue, as a TT *can't* be runtime indexed. Seems like a good compromise.Trying a static immutable array may be in order.That would actually create an object, and potentially prevent optimizations (I think, maybe). The TT seems cleaner to me anyways. I'll create a pull for it.
Aug 25 2014
On Monday, 25 August 2014 at 22:04:34 UTC, monarch_dodra wrote:I'll create a pull for it.https://github.com/D-Programming-Language/phobos/pull/2464
Aug 25 2014
On Sunday, 24 August 2014 at 12:24:03 UTC, Andrei Alexandrescu wrote:To that end I'm working on RCString, an industrial-strength string type that's much like string, just reference counted and with configurable allocation. It's safe, too - user code cannot casually extract references to string internals. By default allocation would use GC's primitives; one thing I learned to appreciate about our GC is its ability to free memory explicitly, which means RCString will free memory deterministically most of the time, yet if you leak some (e.g. by having RCString class members) the GC will pick up the litter. I think reference counting backed up by a GC that lifts litter and cycles and is a modern, emergent pattern that D could use to great effect. (Speaking of which: Some, but not all, types in std.container use reference counting. One other great area of improvement would be to guarantee that everything is std.container is reference counted. Containers are the perfect candidate for reference counting - they are typically large enough to make the reference counting overhead negligible by comparison with the typical work on them.)One issue this proposal seems to forget (and it's a problem that transcends D), is that the GC does not finalize structs. Your RC proposal is fine and good for strings, because the individual chars don't have destructors. But unless we migrate *everything* to using RC, we'd still be leaking non-memory resources. For example, "File" is reference counted, and I've seen people time and time again get had, because they use a "File[]". Oops. Imo, this is a big issue. Are there any plans to takle this problem? Another issue we encounter a lot with reference type objects that do RC, is one of initial initialisation/allocation. Currenly, D does not have default constructor. I understand why. But it makes it painfainly difficult to implement run-time initialize with no arguments, while avoiding user errors. This has been a problem time and time again for objects such as Appender, or T[U], in that aliasing only happens *after* the first operation. Has this been discussed again yet? We have "T.init". Why couldn't we have default construction, that can be explicitly skipped with "T a = T.init;"? I realize this derails the conversation a bit, but I think it is related enought to warrant mentioning.
Aug 25 2014
On 8/25/14, 1:35 PM, monarch_dodra wrote:One issue this proposal seems to forget (and it's a problem that transcends D), is that the GC does not finalize structs.It will. Andrei
Aug 25 2014
On Monday, 25 August 2014 at 21:12:49 UTC, Andrei Alexandrescu wrote:On 8/25/14, 1:35 PM, monarch_dodra wrote:Awesome.One issue this proposal seems to forget (and it's a problem that transcends D), is that the GC does not finalize structs.It will. Andrei
Aug 25 2014
On Monday, 25 August 2014 at 22:00:30 UTC, monarch_dodra wrote:On Monday, 25 August 2014 at 21:12:49 UTC, Andrei Alexandrescu wrote:There's a PR currently in review which fixes this (though IIRC there were some bugs with it that were being worked out). - Jonathan M DavisOn 8/25/14, 1:35 PM, monarch_dodra wrote:Awesome.One issue this proposal seems to forget (and it's a problem that transcends D), is that the GC does not finalize structs.It will. Andrei
Aug 25 2014
On Sunday, 24 August 2014 at 01:06:31 UTC, Andrei Alexandrescu wrote:Currently char[], wchar[], dchar[] and qualified variants fulfill the requirements of isSomeString. Also, char[], wchar[] and qualified variants fulfill the requirements of isNarrowString. Various algorithms in Phobos test for these traits to optimize away UTF decoding where unnecessary. I'm thinking of relaxing the definitions to all types that fulfill the following requirements: * are random access ranges * element type is some character * offer .ptr as a system property that offers a pointer to the first character This would allow us to generalize the notion of string and offer optimizations for user-defined, not only built-in, strings. Thoughts?I don't know. My first reaction is to be very, very worried about this. We made it so that isSomeString, isIntegral, etc. test for the specific type and not implicit conversion precisely to avoid problems where the type is assumed to be one of the built-in types, and the code then does weird things or just doesn't compile. If we had done this with isSomeString originally, I'm sure that we could make it work, but you're talking about a type that isn't an array, and I expect that there's a lot of code out there that assumes that something that passes isSomeString is an array. And there's a good chance that a lot of the functions that do won't react well to a user-defined type. I think that if we go that route, we should probably create a new trait for it rather than reuse isSomeString. In addition, I'd be very concerned about using ptr like that, because the main use case for that aside from trying to avoid bounds checks is to pass to a C function, and that's going to need ptr to give you an actual array, in which case, I start wondirng why we're not just using arrays. Also, what optimizations are you looking for here? To avoid auto-decoding? Something else? A lot of the compiler/druntime/Phobos devs are coming to the conclusion that the auto-decoding was a mistake and that we should look at removing it if we can and change it so that strings are treated as ranges of their element type rather than dchar. I'm even in the process of working on a DIP for just that, and making the transition would be nowhere near as bad as I was initially expecting it to be. I think that we can actually do it fairly cleanly. If we did that, then I'm not sure what gain there would be in doing something like this to isSomeString or even in creating a new trait. Simply having a random-access range whose element type is a character type would do it already. The only part that it might miss out on is using ptr to skip bounds checking, and I don't know how much sense that makes with a user-defined type anyway given that ptr would have to be giving you a pointer to an array to work wherever ptr works for arrays. - Jonathan M Davis
Aug 23 2014
On 8/23/14, 7:05 PM, Jonathan M Davis wrote:I don't know. My first reaction is to be very, very worried about this. We made it so that isSomeString, isIntegral, etc. test for the specific type and not implicit conversion precisely to avoid problems where the type is assumed to be one of the built-in types, and the code then does weird things or just doesn't compile.I guess unittests are for that kind of stuff. One good start would be to see what kind of assumptions are made by algorithms specialized on isSomeString.If we had done this with isSomeString originally, I'm sure that we could make it work, but you're talking about a type that isn't an array, and I expect that there's a lot of code out there that assumes that something that passes isSomeString is an array. And there's a good chance that a lot of the functions that do won't react well to a user-defined type. I think that if we go that route, we should probably create a new trait for it rather than reuse isSomeString.Yah, that would be a safe path but would complicate checks.In addition, I'd be very concerned about using ptr like that, because the main use case for that aside from trying to avoid bounds checks is to pass to a C function, and that's going to need ptr to give you an actual array, in which case, I start wondirng why we're not just using arrays.Because the abstraction I'm thinking of is reference counted.Also, what optimizations are you looking for here? To avoid auto-decoding? Something else?It's just taking advantage of the work that's already been done in std.algorithm and elsewhere.A lot of the compiler/druntime/Phobos devs are coming to the conclusion that the auto-decoding was a mistake and that we should look at removing it if we can and change it so that strings are treated as ranges of their element type rather than dchar. I'm even in the process of working on a DIP for just that, and making the transition would be nowhere near as bad as I was initially expecting it to be. I think that we can actually do it fairly cleanly. If we did that, then I'm not sure what gain there would be in doing something like this to isSomeString or even in creating a new trait. Simply having a random-access range whose element type is a character type would do it already. The only part that it might miss out on is using ptr to skip bounds checking, and I don't know how much sense that makes with a user-defined type anyway given that ptr would have to be giving you a pointer to an array to work wherever ptr works for arrays.As I wrote at length a few minutes ago, I think such a DIP would be a waste of time and energy. I encourage everyone to work on pushing things forward with new work instead of forever fiddling with what we have "for the last time". Andrei
Aug 24 2014
On Sunday, 24 August 2014 at 01:06:31 UTC, Andrei Alexandrescu wrote:Currently char[], wchar[], dchar[] and qualified variants fulfill the requirements of isSomeString. Also, char[], wchar[] and qualified variants fulfill the requirements of isNarrowString. Various algorithms in Phobos test for these traits to optimize away UTF decoding where unnecessary. I'm thinking of relaxing the definitions to all types that fulfill the following requirements: * are random access ranges * element type is some character * offer .ptr as a system property that offers a pointer to the first character This would allow us to generalize the notion of string and offer optimizations for user-defined, not only built-in, strings. Thoughts?IMO in general it's a good proposal; I expected `isSomeString` to return `true` for higher-order ranges over strings, and was surprised that it didn't. But `.ptr` must only be included if the elements are indeed contiguous in memory. For example, `std.algorithm.filter` applied to a string should not provide `.ptr`, but should IMO be treated as a string nevertheless. Related: What about `.length`? RA ranges can also be infinite...
Aug 24 2014
On 8/24/14, 6:00 AM, "Marc Schütz" <schuetzm gmx.net>" wrote:On Sunday, 24 August 2014 at 01:06:31 UTC, Andrei Alexandrescu wrote:Yah, length should be there. I think ptr is good to require because I speculate many optimizations take advantage of the characters being contiguous. AndreiCurrently char[], wchar[], dchar[] and qualified variants fulfill the requirements of isSomeString. Also, char[], wchar[] and qualified variants fulfill the requirements of isNarrowString. Various algorithms in Phobos test for these traits to optimize away UTF decoding where unnecessary. I'm thinking of relaxing the definitions to all types that fulfill the following requirements: * are random access ranges * element type is some character * offer .ptr as a system property that offers a pointer to the first character This would allow us to generalize the notion of string and offer optimizations for user-defined, not only built-in, strings. Thoughts?IMO in general it's a good proposal; I expected `isSomeString` to return `true` for higher-order ranges over strings, and was surprised that it didn't. But `.ptr` must only be included if the elements are indeed contiguous in memory. For example, `std.algorithm.filter` applied to a string should not provide `.ptr`, but should IMO be treated as a string nevertheless. Related: What about `.length`? RA ranges can also be infinite...
Aug 24 2014
24-Aug-2014 05:06, Andrei Alexandrescu пишет:Currently char[], wchar[], dchar[] and qualified variants fulfill the requirements of isSomeString. Also, char[], wchar[] and qualified variants fulfill the requirements of isNarrowString. Various algorithms in Phobos test for these traits to optimize away UTF decoding where unnecessary. I'm thinking of relaxing the definitions to all types that fulfill the following requirements: * are random access ranges * element type is some character * offer .ptr as a system property that offers a pointer to the first character.ptr would be nice ' system' addition to any contiguous ranges. -- Dmitry Olshansky
Aug 24 2014
On Sunday, 24 August 2014 at 01:06:31 UTC, Andrei Alexandrescu wrote:I'm thinking of relaxing the definitions to all types that fulfill the following requirements: * are random access ranges * element type is some character * offer .ptr as a system property that offers a pointer to the first characterHmm, you also need .length, but it needs to return the length of the encoding, not the string. There might be custom string types where .length returns the number of code points, but where .ptr points to code units. Narrow string optimized functions will also be looking for an opSlice that is indexed by code units.
Aug 24 2014
On Sunday, 24 August 2014 at 01:06:31 UTC, Andrei Alexandrescu wrote:Currently char[], wchar[], dchar[] and qualified variants fulfill the requirements of isSomeString. Also, char[], wchar[] and qualified variants fulfill the requirements of isNarrowString. Various algorithms in Phobos test for these traits to optimize away UTF decoding where unnecessary. I'm thinking of relaxing the definitions to all types that fulfill the following requirements: * are random access ranges * element type is some character * offer .ptr as a system property that offers a pointer to the first character This would allow us to generalize the notion of string and offer optimizations for user-defined, not only built-in, strings. Thoughts? AndreiOne issue is that strings are "auto decoded", yet a range of "char" is not. I don't see why ".ptr" is needed (or .length) for that matter. We could very well have a range of chars that doesn't of length, nor .ptr, but should still be handled like a sequence of decoded characters. I don't see how your proposal caters to that more generic problem that a UD char range is not auto-decoded (or more generally, that it will be handled *differently* from a string).
Aug 25 2014