www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Making all strings UTF ranges has some risk of WTF

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
It's no secret that string et al. are not a magic recipe for writing 
correct Unicode code. However, things are pretty good and could be 
further improved by operating the following changes in std.array and 
std.range:

- make front() and back() for UTF-8 and UTF-16 automatically decode the 
first and last Unicode character

- make popFront() and popBack() skip one entire Unicode character 
(instead of just one code unit)

- alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings

- change hasLength to return false for UTF-8 and UTF-16 strings

These changes effectively make UTF-8 and UTF-16 bidirectional ranges, 
with the quirk that you still have a sort of a random-access operator.

I'm very strongly in favor of this change. Bidirectional strings allow 
beautiful correct algorithms to be written that handle encoded strings 
without any additional effort; with these changes, everything applicable 
of std.algorithm works out of the box (with the appropriate fixes here 
and there), which is really remarkable.

The remaining WTF is the length property. Traditionally, a range 
offering length also implies the expectation that a range of length n 
allows you to call popFront n times and then assert that the range is 
empty. However, if you check e.g. hasLength!string it will yield false, 
although the string does have an accessible member by that name and of 
the appropriate type.

Although Phobos always checks its assumptions, people might occasionally 
write code that just uses .length without checking hasLength. Then, 
they'll be annoyed when the code fails with UTF-8 and UTF-16 strings.

(The "real" length of the range is not stored, but can be computed by 
using str.walkLength() in std.range.)

What can be done about that? I see a number of solutions:

(a) Do not operate the change at all.

(b) Operate the change and mention that in range algorithms you should 
check hasLength and only then use "length" under the assumption that it 
really means "elements count".

(c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define 
a different name for that. Any other name (codeUnits, codes etc.) would 
do. The entire point is to not make algorithms believe strings have a 
.length property.

(d) Have std.range define a distinct property called e.g. "count" and 
then specialize it appropriately. Then change all references to .length 
in std.algorithm and elsewhere to .count.

What would you do? Any ideas are welcome.


Andrei
Feb 03 2010
next sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Wed, 03 Feb 2010 21:00:21 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:
 (a) Do not operate the change at all.

 (b) Operate the change and mention that in range algorithms you should  
 check hasLength and only then use "length" under the assumption that it  
 really means "elements count".

 (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define  
 a different name for that. Any other name (codeUnits, codes etc.) would  
 do. The entire point is to not make algorithms believe strings have a  
 .length property.

 (d) Have std.range define a distinct property called e.g. "count" and  
 then specialize it appropriately. Then change all references to .length  
 in std.algorithm and elsewhere to .count.

 What would you do? Any ideas are welcome.
I like b) and d), with a slight preference for d. I think the benefits of strings being encoding correct and able to use std.algorithm outweighs the disadvantages. And making char[] different from T[] is going to play havoc with templated algorithms. Another alternative is to remove the char types from the language and implement them as library ranges.
Feb 03 2010
prev sibling next sibling parent reply ZY Zhou <ZY Zhou.comm> writes:
Andrei Alexandrescu Wrote:
 What can be done about that? I see a number of solutions:
 
 (a) Do not operate the change at all.
 
 (b) Operate the change and mention that in range algorithms you should 
 check hasLength and only then use "length" under the assumption that it 
 really means "elements count".
 
 (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define 
 a different name for that. Any other name (codeUnits, codes etc.) would 
 do. The entire point is to not make algorithms believe strings have a 
 .length property.
 
 (d) Have std.range define a distinct property called e.g. "count" and 
 then specialize it appropriately. Then change all references to .length 
 in std.algorithm and elsewhere to .count.
 
 What would you do? Any ideas are welcome.
 
 Andrei
I choose (b)
Feb 03 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
ZY Zhou wrote:
 Andrei Alexandrescu Wrote:
 What can be done about that? I see a number of solutions:

 (a) Do not operate the change at all.

 (b) Operate the change and mention that in range algorithms you should 
 check hasLength and only then use "length" under the assumption that it 
 really means "elements count".

 (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define 
 a different name for that. Any other name (codeUnits, codes etc.) would 
 do. The entire point is to not make algorithms believe strings have a 
 .length property.

 (d) Have std.range define a distinct property called e.g. "count" and 
 then specialize it appropriately. Then change all references to .length 
 in std.algorithm and elsewhere to .count.

 What would you do? Any ideas are welcome.

 Andrei
I choose (b)
Perfect. Then you'll be glad to know that I changed all of Phobos to support it. All unittests pass now, but I suspect there are a couple of bugs left. The change will be part of the next release. I'll commit during the weekend. This is a very exciting development, I was very unhappy about the interaction between std.algorithm and strings. Andrei
Feb 03 2010
prev sibling next sibling parent reply Chad J <chadjoan __spam.is.bad__gmail.com> writes:
Andrei Alexandrescu wrote:
 ...
 
 What can be done about that? I see a number of solutions:
 
 (a) Do not operate the change at all.
 
 (b) Operate the change and mention that in range algorithms you should
 check hasLength and only then use "length" under the assumption that it
 really means "elements count".
 
 (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
 a different name for that. Any other name (codeUnits, codes etc.) would
 do. The entire point is to not make algorithms believe strings have a
 .length property.
 
 (d) Have std.range define a distinct property called e.g. "count" and
 then specialize it appropriately. Then change all references to .length
 in std.algorithm and elsewhere to .count.
 
 What would you do? Any ideas are welcome.
 
 
 Andrei
I'm leaning towards (c) here. To me the .length on char[] and wchar[] are kinda like doing this: struct SomePOD { int a, b; double y; } SomePOD pod; auto len = pod.length; assert(len == 16); // true. I'll admit it's not a perfect analogy. What I'm playing on here is that the .length on char[] and wchar[] returns the /size of/ the string in bytes rather than the /length/ of the string in number of (well-formed) characters. Unfortunately .sizeof is supposed to return the size of the string's reference (8 bytes on x86 systems) and not the size of the string, IIRC. So that's taken. So perhaps a .bytes or .nbytes property. Maybe make it work for arrays of structs and things like that too. A tuple (or any container) of non-homogeneous elements could probably benefit from this property as well. Given such a property being available, I wouldn't miss .length at all. It's quite misleading.
Feb 03 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Chad J wrote:
 Andrei Alexandrescu wrote:
 ...

 What can be done about that? I see a number of solutions:

 (a) Do not operate the change at all.

 (b) Operate the change and mention that in range algorithms you should
 check hasLength and only then use "length" under the assumption that it
 really means "elements count".

 (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
 a different name for that. Any other name (codeUnits, codes etc.) would
 do. The entire point is to not make algorithms believe strings have a
 .length property.

 (d) Have std.range define a distinct property called e.g. "count" and
 then specialize it appropriately. Then change all references to .length
 in std.algorithm and elsewhere to .count.

 What would you do? Any ideas are welcome.


 Andrei
I'm leaning towards (c) here. To me the .length on char[] and wchar[] are kinda like doing this: struct SomePOD { int a, b; double y; } SomePOD pod; auto len = pod.length; assert(len == 16); // true. I'll admit it's not a perfect analogy. What I'm playing on here is that the .length on char[] and wchar[] returns the /size of/ the string in bytes rather than the /length/ of the string in number of (well-formed) characters. Unfortunately .sizeof is supposed to return the size of the string's reference (8 bytes on x86 systems) and not the size of the string, IIRC. So that's taken. So perhaps a .bytes or .nbytes property. Maybe make it work for arrays of structs and things like that too. A tuple (or any container) of non-homogeneous elements could probably benefit from this property as well. Given such a property being available, I wouldn't miss .length at all. It's quite misleading.
I hear you. Actually, to either quench or add to the confusion, .length for wstring returns the length in 16-bit units, not bytes. Andrei
Feb 03 2010
parent Chad J <chadjoan __spam.is.bad__gmail.com> writes:
Andrei Alexandrescu wrote:
 ...
 
 I hear you. Actually, to either quench or add to the confusion, .length
 for wstring returns the length in 16-bit units, not bytes.
 
 Andrei
0.o ... kay.
Feb 03 2010
prev sibling next sibling parent reply grauzone <none example.net> writes:
Andrei Alexandrescu wrote:
 What can be done about that? I see a number of solutions:
 
 (a) Do not operate the change at all.
 
 (b) Operate the change and mention that in range algorithms you should 
 check hasLength and only then use "length" under the assumption that it 
 really means "elements count".
 
 (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define 
 a different name for that. Any other name (codeUnits, codes etc.) would 
 do. The entire point is to not make algorithms believe strings have a 
 .length property.
 
 (d) Have std.range define a distinct property called e.g. "count" and 
 then specialize it appropriately. Then change all references to .length 
 in std.algorithm and elsewhere to .count.
 
 What would you do? Any ideas are welcome.
Change the type of string literals from char[] (or whatever the string type is in D2) to a wrapper struct defined in object.d: struct string { char[] raw; } Now string.length is invalid, and you don't have to do weird stuff as in (b) or (c). From here on, you could do 2 things: 1. add accessor methods to string like string classes in other languages do 2. leave the wrapper struct as it is (just add the required range foo), and require the user to use either a) the range API (with utf-8 decoding etc.) or b) access the raw "byte" string with string.raw. I really liked how strings were simply char[]'s, but now with immutable, there's a lot of noise around it anyway, and there's no real value to strings being array slices anymore. Making the user deal directly with utf-8 was probably a bad idea to begin with.
Feb 03 2010
next sibling parent Trass3r <un known.com> writes:
Am 04.02.2010, 04:05 Uhr, schrieb grauzone <none example.net>:

 Andrei Alexandrescu wrote:
 What can be done about that? I see a number of solutions:
  (a) Do not operate the change at all.
  (b) Operate the change and mention that in range algorithms you should  
 check hasLength and only then use "length" under the assumption that it  
 really means "elements count".
  (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and  
 define a different name for that. Any other name (codeUnits, codes  
 etc.) would do. The entire point is to not make algorithms believe  
 strings have a .length property.
  (d) Have std.range define a distinct property called e.g. "count" and  
 then specialize it appropriately. Then change all references to .length  
 in std.algorithm and elsewhere to .count.
  What would you do? Any ideas are welcome.
Definitely against (c)+(d).
 Change the type of string literals from char[] (or whatever the string  
 type is in D2) to a wrapper struct defined in object.d:

 struct string {
      char[] raw;
 }
That sounds like a really reasonable way to me.
Feb 04 2010
prev sibling parent "Simen kjaeraas" <simen.kjaras gmail.com> writes:
grauzone <none example.net> wrote:
 Change the type of string literals from char[] (or whatever the string  
 type is in D2) to a wrapper struct defined in object.d:

 struct string {
      char[] raw;
 }

 Now string.length is invalid, and you don't have to do weird stuff as in  
 (b) or (c).

  From here on, you could do 2 things:
 1. add accessor methods to string like string classes in other languages  
 do
 2. leave the wrapper struct as it is (just add the required range foo),  
 and require the user to use either a) the range API (with utf-8 decoding  
 etc.) or b) access the raw "byte" string with string.raw.
This seems to me a job for disable: struct string { immutable( char )[] payload; alias payload this; disable int length( ) { return 0; } } -- Simen
Feb 04 2010
prev sibling next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Andrei Alexandrescu wrote:
 It's no secret that string et al. are not a magic recipe for writing 
 correct Unicode code.
I'm concerned it would be slow. Most operations on strings do not need to decode the unicode characters, for example, find, startsWith, etc., do not. Decoding then doing find, startsWith, etc., will be considerably slower.
Feb 03 2010
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Walter Bright wrote:
 Andrei Alexandrescu wrote:
 It's no secret that string et al. are not a magic recipe for writing 
 correct Unicode code.
I'm concerned it would be slow. Most operations on strings do not need to decode the unicode characters, for example, find, startsWith, etc., do not. Decoding then doing find, startsWith, etc., will be considerably slower.
I thought you're going to say that, but fortunately it's easy to special-case certain algorithms for strings during compilation. In fact I already did - for example, Boyer-Moore searching would be very difficult to rewrite for variable-length characters, but there's no need for it. I special-cased that algorithm. I believe this is a good strategy. Andrei
Feb 03 2010
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Walter Bright wrote:
 Andrei Alexandrescu wrote:
 It's no secret that string et al. are not a magic recipe for writing 
 correct Unicode code.
I'm concerned it would be slow. Most operations on strings do not need to decode the unicode characters, for example, find, startsWith, etc., do not. Decoding then doing find, startsWith, etc., will be considerably slower.
Oh, one more thing: doing mixed-width searches would require decoding. Andrei
Feb 03 2010
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Andrei Alexandrescu wrote:
 Oh, one more thing: doing mixed-width searches would require decoding.
Or a conversion before the loop starts of the search term.
Feb 03 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Walter Bright wrote:
 Andrei Alexandrescu wrote:
 Oh, one more thing: doing mixed-width searches would require decoding.
Or a conversion before the loop starts of the search term.
That triggers memory allocation plus the same cost. It's not likely to work any better. Andrei
Feb 03 2010
parent Ben Hanson <Ben.Hanson tfbplc.co.uk> writes:
Andrei Alexandrescu Wrote:

 Walter Bright wrote:
 Andrei Alexandrescu wrote:
 Oh, one more thing: doing mixed-width searches would require decoding.
Or a conversion before the loop starts of the search term.
That triggers memory allocation plus the same cost. It's not likely to work any better. Andrei
Exactly. Please don't go down the route that Microsoft did for regex case insensitive searching (at least when I last looked at it) - they made the *input string* all lower case... D'oh! Regards, Ben
Feb 04 2010
prev sibling next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 It's no secret that string et al. are not a magic recipe for writing
 correct Unicode code. However, things are pretty good and could be
 further improved by operating the following changes in std.array and
 std.range:
 - make front() and back() for UTF-8 and UTF-16 automatically decode the
 first and last Unicode character
 - make popFront() and popBack() skip one entire Unicode character
 (instead of just one code unit)
 - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings
 - change hasLength to return false for UTF-8 and UTF-16 strings
 These changes effectively make UTF-8 and UTF-16 bidirectional ranges,
 with the quirk that you still have a sort of a random-access operator.
 I'm very strongly in favor of this change. Bidirectional strings allow
 beautiful correct algorithms to be written that handle encoded strings
 without any additional effort; with these changes, everything applicable
 of std.algorithm works out of the box (with the appropriate fixes here
 and there), which is really remarkable.
 The remaining WTF is the length property. Traditionally, a range
 offering length also implies the expectation that a range of length n
 allows you to call popFront n times and then assert that the range is
 empty. However, if you check e.g. hasLength!string it will yield false,
 although the string does have an accessible member by that name and of
 the appropriate type.
 Although Phobos always checks its assumptions, people might occasionally
 write code that just uses .length without checking hasLength. Then,
 they'll be annoyed when the code fails with UTF-8 and UTF-16 strings.
 (The "real" length of the range is not stored, but can be computed by
 using str.walkLength() in std.range.)
 What can be done about that? I see a number of solutions:
 (a) Do not operate the change at all.
 (b) Operate the change and mention that in range algorithms you should
 check hasLength and only then use "length" under the assumption that it
 really means "elements count".
 (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
 a different name for that. Any other name (codeUnits, codes etc.) would
 do. The entire point is to not make algorithms believe strings have a
 .length property.
 (d) Have std.range define a distinct property called e.g. "count" and
 then specialize it appropriately. Then change all references to .length
 in std.algorithm and elsewhere to .count.
 What would you do? Any ideas are welcome.
 Andrei
I personally would find this extremely annoying because most of the code I write that involves strings is scientific computing code that will never be internationalized, let alone released to the general public. I basically just use ASCII because it's all I need and if your UTF-8 string contains only ASCII characters, it can be treated as random-access. I don't know how many people out there are in similar situations, but I doubt they'll be too happy. On the other hand, I guess it wouldn't be hard to write a simple wrapper struct on top of immutable(ubyte)[] and call it AsciiString. Once alias this gets fully debugged, I could even make it implicitly convert to immutable(char)[].
Feb 03 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
dsimcha wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s article
 It's no secret that string et al. are not a magic recipe for writing
 correct Unicode code. However, things are pretty good and could be
 further improved by operating the following changes in std.array and
 std.range:
 - make front() and back() for UTF-8 and UTF-16 automatically decode the
 first and last Unicode character
 - make popFront() and popBack() skip one entire Unicode character
 (instead of just one code unit)
 - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings
 - change hasLength to return false for UTF-8 and UTF-16 strings
 These changes effectively make UTF-8 and UTF-16 bidirectional ranges,
 with the quirk that you still have a sort of a random-access operator.
 I'm very strongly in favor of this change. Bidirectional strings allow
 beautiful correct algorithms to be written that handle encoded strings
 without any additional effort; with these changes, everything applicable
 of std.algorithm works out of the box (with the appropriate fixes here
 and there), which is really remarkable.
 The remaining WTF is the length property. Traditionally, a range
 offering length also implies the expectation that a range of length n
 allows you to call popFront n times and then assert that the range is
 empty. However, if you check e.g. hasLength!string it will yield false,
 although the string does have an accessible member by that name and of
 the appropriate type.
 Although Phobos always checks its assumptions, people might occasionally
 write code that just uses .length without checking hasLength. Then,
 they'll be annoyed when the code fails with UTF-8 and UTF-16 strings.
 (The "real" length of the range is not stored, but can be computed by
 using str.walkLength() in std.range.)
 What can be done about that? I see a number of solutions:
 (a) Do not operate the change at all.
 (b) Operate the change and mention that in range algorithms you should
 check hasLength and only then use "length" under the assumption that it
 really means "elements count".
 (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
 a different name for that. Any other name (codeUnits, codes etc.) would
 do. The entire point is to not make algorithms believe strings have a
 .length property.
 (d) Have std.range define a distinct property called e.g. "count" and
 then specialize it appropriately. Then change all references to .length
 in std.algorithm and elsewhere to .count.
 What would you do? Any ideas are welcome.
 Andrei
I personally would find this extremely annoying because most of the code I write that involves strings is scientific computing code that will never be internationalized, let alone released to the general public. I basically just use ASCII because it's all I need and if your UTF-8 string contains only ASCII characters, it can be treated as random-access. I don't know how many people out there are in similar situations, but I doubt they'll be too happy. On the other hand, I guess it wouldn't be hard to write a simple wrapper struct on top of immutable(ubyte)[] and call it AsciiString. Once alias this gets fully debugged, I could even make it implicitly convert to immutable(char)[].
It's definitely going to be easy to use all sensible algorithms with immutable(ubyte)[]. But even if you go with string, there should be no problem at all. Remember, telling ASCII from UTF is one mask and one test away, and the way Walter and I wrote virtually all related routines was to special-case ASCII. In most cases I don't think you'll notice a decrease in performance. Andrei
Feb 03 2010
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 03 Feb 2010 23:41:02 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 dsimcha wrote:
  I personally would find this extremely annoying because most of the  
 code I write
 that involves strings is scientific computing code that will never be
 internationalized, let alone released to the general public.  I  
 basically just use
 ASCII because it's all I need and if your UTF-8 string contains only  
 ASCII
 characters, it can be treated as random-access.  I don't know how many  
 people out
 there are in similar situations, but I doubt they'll be too happy.
  On the other hand, I guess it wouldn't be hard to write a simple  
 wrapper struct on
 top of immutable(ubyte)[] and call it AsciiString.  Once alias this  
 gets fully
 debugged, I could even make it implicitly convert to immutable(char)[].
It's definitely going to be easy to use all sensible algorithms with immutable(ubyte)[]. But even if you go with string, there should be no problem at all. Remember, telling ASCII from UTF is one mask and one test away, and the way Walter and I wrote virtually all related routines was to special-case ASCII. In most cases I don't think you'll notice a decrease in performance.
I'm in the same camp as dsimcha, I generally write all my apps assuming ASCII strings (most are internal tools anyways). Can the compiler help making ASCII strings easier to use? i.e., this already works: wstring s = "hello"; // converts to immutable(wchar)[] what about this? asciistring a = "hello"; // converts to immutable(ubyte)[] (or immutable(ASCIIChar)[]) asciistring a = "\uFBCD"; // error, requires cast. The only issue that remains to be resolved then is the upgradability that ascii characters currently enjoy for utf8. I.e. I can call any utf-8 accepting function with an ASCII string, but not an ASCII string accepting function with utf-8 data. Ideally, there should be a 7-bit ASCII character type that implicitly upconverts to char, and can be initialized with a string literal. In addition, you are putting D's utf8 char even further away from C's ASCII char. It would be nice to separate compatible C strings from d strings. At some point, I should be able to designate a function (even a C function) takes only ASCII data, and the compiler should disallow passing general utf8 data into it. This involves either renaming D's char to keep source closer to C, or rewriting C function signatures to reflect the difference. -Steve
Feb 08 2010
prev sibling next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2010-02-03 21:00:21 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 It's no secret that string et al. are not a magic recipe for writing 
 correct Unicode code.
 
 [...]
 
 What would you do? Any ideas are welcome.
UTF-8 and UTF-16 encodings are interesting beasts. If you have a UTF-8 string and want to search for an occurrence of that string in another UTF-8 string, you don't have to decode each multi-byte code-points: a binary comparison is enough. If you're counting counting the number of code points, then all you need is to count the number of code unit with the most significant bit set to zero. If on the other hand you're applying a character-by-character transformation, then you need to fully decode each character, unless you're only interested in transforming characters from the lower non-multibyte subrange of the encoding (which happens quite often). Clearly, I don't think there's a one-size-fit-all way to iterate over string arrays. Fully decoding each code unit is clearly the most costly method; it shouldn't be required when its not necessary. I think we need to be able to represent char[] and wchar[] as a range of dchar to deal with cases where you want to iterate over Unicode code points, but I'd let the programmer ultimately decide what to do. As for .length, I'll say that removing this property would make it hard to write low-level code. For instance, if I copy a string into a buffer, I need to know the length in bytes (array.length * sizeof(array[0])), not the number of characters. So it doesn't make much sense to disable .length. So my answer would be mostly to leave things as they are. Perhaps the char[] and wchar[] as dchar ranges could be aliased to string and wstring, but that'd definitely be a blow to the philosophy of strings as simple arrays. You'd also still need to be able to access the actual array underneath. And will all the implicit conversions still work? I'm really not sure it's worth it, but perhaps. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Feb 03 2010
parent Chad J <chadjoan __spam.is.bad__gmail.com> writes:
Michel Fortin wrote:
 ...
 
 As for .length, I'll say that removing this property would make it hard
 to write low-level code. For instance, if I copy a string into a buffer,
 I need to know the length in bytes (array.length * sizeof(array[0])),
 not the number of characters. So it doesn't make much sense to disable
 .length.
 
 ...
 
This would be under the condition that there is another property that does the same (or similar) thing. It's the part where he said, "and define a different name for that". You could also just reinterpret them as byte[] or ubyte[]. Then they will behave in a low level way. No surprises.
Feb 04 2010
prev sibling next sibling parent reply Rainer Deyke <rainerd eldwood.com> writes:
Andrei Alexandrescu wrote:
 - make front() and back() for UTF-8 and UTF-16 automatically decode the
 first and last Unicode character
 
 - make popFront() and popBack() skip one entire Unicode character
 (instead of just one code unit)
 
 - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings
 
 - change hasLength to return false for UTF-8 and UTF-16 strings
These are all fine for a dedicated string type. They're horrible for generic arrays, for the following reasons: - They break generic code. - They make it impossible to manipulate an array of code units as an array of code units. -- Rainer Deyke - rainerd eldwood.com
Feb 03 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Rainer Deyke wrote:
 Andrei Alexandrescu wrote:
 - make front() and back() for UTF-8 and UTF-16 automatically decode the
 first and last Unicode character

 - make popFront() and popBack() skip one entire Unicode character
 (instead of just one code unit)

 - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings

 - change hasLength to return false for UTF-8 and UTF-16 strings
These are all fine for a dedicated string type. They're horrible for generic arrays, for the following reasons: - They break generic code. - They make it impossible to manipulate an array of code units as an array of code units.
Arrays of char and wchar are not quite generic - they are definitely UTF strings. Andrei
Feb 03 2010
parent reply Rainer Deyke <rainerd eldwood.com> writes:
Andrei Alexandrescu wrote:
 Arrays of char and wchar are not quite generic - they are definitely UTF
 strings.
A 'char' is a single utf-8 code unit. A 'char[]' is (or should be) a generic array of utf-8 code units. Sometimes these code units line up to form valid unicode code points, sometimes they don't. If you want a data type that always contains a valid utf-8 string, don't call it 'char[]'. It's misleading, it breaks generic code, and it renders built-in arrays useless for when you actually want an array of utf-8 code units. It's the same mistake as std::vector<bool> in C++, but much worse. -- Rainer Deyke - rainerd eldwood.com
Feb 03 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Rainer Deyke wrote:
 Andrei Alexandrescu wrote:
 Arrays of char and wchar are not quite generic - they are definitely UTF
 strings.
A 'char' is a single utf-8 code unit. A 'char[]' is (or should be) a generic array of utf-8 code units. Sometimes these code units line up to form valid unicode code points, sometimes they don't. If you want a data type that always contains a valid utf-8 string, don't call it 'char[]'. It's misleading, it breaks generic code, and it renders built-in arrays useless for when you actually want an array of utf-8 code units. It's the same mistake as std::vector<bool> in C++, but much worse.
I agree up to the assessment of the size of the problem and a couple of other points. I've had a great time writing utf code in D with string. Getting back to C++'s std::string really put things in perspective. If your purpose is to store some disparate utf-8 code units (a need that I've never had), I see no problem with storing then as ubyte[]. Andrei
Feb 03 2010
prev sibling next sibling parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
Andrei Alexandrescu wrote:
 It's no secret that string et al. are not a magic recipe for writing
 correct Unicode code. However, things are pretty good and could be
 further improved by operating the following changes in std.array and
 std.range:

 - make front() and back() for UTF-8 and UTF-16 automatically decode the
 first and last Unicode character
They would yield dchar, right? Wouldn't that cause trouble in templated code?
 - make popFront() and popBack() skip one entire Unicode character
 (instead of just one code unit)
That's perfectly fine, because the opposite operations do "encode": string s = "aÄŸ"; assert(s.length == 3); s ~= 'ÅŸ'; assert(s.length == 5);
 - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings
Ok.
 - change hasLength to return false for UTF-8 and UTF-16 strings
I don't understand that one. strings have lengths. Adding and removing does not alter length by 1 for those types. I don't think it's a big deal. It is already so in the language for those types. dstring does not have that problem and could be used when by-1 change is desired.
 (b) Operate the change and mention that in range algorithms you should
 check hasLength and only then use "length" under the assumption that it
 really means "elements count".
The change sounds ok and hasLength should yield true. Or... can it return an enum { no, kind_of, yes } ;) Current utf.decode takes the index by reference and modifies it by the amount. Could popFront() do something similar? I think that's it: front() and popFront() are separated for cohesion. What is causing trouble here is the separation of "by-N" from popFront(). You are concerned that the user makes the assumption and popFront() will reduce by 1. I think that is the problem here. How about something like: // returns the amount that the next popFront() will reduce length int nextStep(); Ali
Feb 03 2010
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Ali Çehreli wrote:
 Andrei Alexandrescu wrote:
  > It's no secret that string et al. are not a magic recipe for writing
  > correct Unicode code. However, things are pretty good and could be
  > further improved by operating the following changes in std.array and
  > std.range:
  >
  > - make front() and back() for UTF-8 and UTF-16 automatically decode the
  > first and last Unicode character
 
 They would yield dchar, right? Wouldn't that cause trouble in templated 
 code?
Yes, dchar. There was some figuring out in parts of Phobos, but the gains are well worth it. The simplifications are enormous. Until now, Phobos didn't hit the nail on the head with simple encoding/decoding/transcoding primitives. There were many attempts in std.utf, std.encoding, and std.string - all very clunky to use. Now I can just write s.front to get the first dchar of any string, and s.popFront to drop it. Very simple!
  > - make popFront() and popBack() skip one entire Unicode character
  > (instead of just one code unit)
 
 That's perfectly fine, because the opposite operations do "encode":
 
     string s = "aÄŸ";
     assert(s.length == 3);
     s ~= 'ÅŸ';
     assert(s.length == 5);
 
  > - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings
 
 Ok.
 
  > - change hasLength to return false for UTF-8 and UTF-16 strings
 
 I don't understand that one. strings have lengths. Adding and removing 
 does not alter length by 1 for those types. I don't think it's a big 
 deal. It is already so in the language for those types. dstring does not 
 have that problem and could be used when by-1 change is desired.
hasLength is a property used by range algorithms to tell them that a range stores the length with a particular meaning (the number of elements). It is perfectly fine that strings don't obey hasLength but do expose .length - it's just that it has different semantics.
  > (b) Operate the change and mention that in range algorithms you should
  > check hasLength and only then use "length" under the assumption that it
  > really means "elements count".
 
 The change sounds ok and hasLength should yield true. Or... can it 
 return an enum { no, kind_of, yes } ;)
 
 Current utf.decode takes the index by reference and modifies it by the 
 amount. Could popFront() do something similar?
I think we could dedicate a special function for that. In fact it does exist I think - it's called stride(). Andrei
Feb 03 2010
prev sibling parent Daniel Keep <daniel.keep.lists gmail.com> writes:
Ali Çehreli wrote:
 ...
 
 The change sounds ok and hasLength should yield true. Or... can it
 return an enum { no, kind_of, yes } ;)
 
 ...
 
 Ali
I believe what you're after is this: enum Bool { True, False, FileNotFound }
Feb 04 2010
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
I'd like D to give the right results and to be not bug-prone by default (even
if it's a bit slower) and do something more optimized but maybe faster when I
want it and I know what I am doing. So I like the idea of making strings more
correct.

I liked the idea of D strings acting like normal arrays, but they are different
data structures, so it's better to accept them as something different, even if
similar (especially UTF32 ones). So a specific struct, named like String, can
be used to represent a string. String literals too return such struct.

In C language the strlen() is O(n), so people can learn that finding the length
of most strings is O(n) in D too (UTF32 strings today are not common. If D gets
success in a future the UTF32 strings can become the only used strings. Such
things are happened many times in computer history). The String structs can
cache inside themselves the length and hash value the first time such values
are computed (so each String is a struct 4 words long).

Time ago you (Andrei) told me that you don't like a function like len() to have
a different computational complexity, like O(n) or O(1), according to the
input. But a simple possibility is for "foo".length property to return the
length of the walk on all chars, that is O(n) the first time you call it (it's
O(1) on UTF32 strings). Some algorithms or some parts of code (and string
literals, etc.) can know what's the length of the string they return or use, so
they can put this this length value inside the String cached value and there's
never a need to compute it.

The struct String that represents the string supports the [] operator too
(there can be a MutableString too, that is not used often), but it usually
walks the string to find the right character to return. Then a method like
"foo".ubyte(i) can be used to return the i-th ubyte of the string. So initially
this [] is done with just a O(n) walk, later some performance optimization can
be added, like a simple skip tree based on indexes.

Bye,
bearophile
Feb 04 2010
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
 The String structs can cache inside themselves the length and hash value the
first time such
 values are computed (so each String is a struct 4 words long).
So it's not really immutable. It contains an immutable dynamic array that can be called "str", plus two mutable words. To access the underlying 8/16/32 bit units it's composed you can use the [] of the str attribute: "foo".str[i] Bye, bearophile
Feb 04 2010
prev sibling parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
bearophile wrote:
 Time ago you (Andrei) told me that you don't like a function like 
len() to have a different computational complexity, like O(n) or O(1), according to the input. This reminds me of an excellent talk by Matt Austern on STL's singly-linked lists. One of the interesting points of the design was around the length of the singly-linked list. In the end, he decides not to provide one. I think his point was that getting the length of the data structure was not one of the main operations of a singly list. I agree that the users who really need length can wrap it in a struct that stores the length. Although there doesn't seem to be a slide dedicated to that point, the presentation is here: http://www.accu-usa.org/Slides/SinglyLinkedLists.ppt Ali
Feb 04 2010
prev sibling next sibling parent Jason House <jason.james.house gmail.com> writes:
Andrei Alexandrescu Wrote:

 It's no secret that string et al. are not a magic recipe for writing 
 correct Unicode code. However, things are pretty good and could be 
 further improved by operating the following changes in std.array and 
 std.range:
 
 These changes effectively make UTF-8 and UTF-16 bidirectional ranges, 
 with the quirk that you still have a sort of a random-access operator.
 
 I'm very strongly in favor of this change. Bidirectional strings allow 
 beautiful correct algorithms to be written that handle encoded strings 
 without any additional effort; with these changes, everything applicable 
 of std.algorithm works out of the box (with the appropriate fixes here 
 and there), which is really remarkable.
 
 The remaining WTF is the length property. Traditionally, a range 
 offering length also implies the expectation that a range of length n 
 allows you to call popFront n times and then assert that the range is 
 empty. However, if you check e.g. hasLength!string it will yield false, 
 although the string does have an accessible member by that name and of 
 the appropriate type.
 
 Although Phobos always checks its assumptions, people might occasionally 
 write code that just uses .length without checking hasLength. Then, 
 they'll be annoyed when the code fails with UTF-8 and UTF-16 strings.
 
 (The "real" length of the range is not stored, but can be computed by 
 using str.walkLength() in std.range.)
 
 What can be done about that? I see a number of solutions:
The underlying array of byte-sized data fragments is an implementation detail. hasLength is a kludge. Follow good OO design and hide the implementation details from the standard interface! I would use a struct for UTF8 and UTF16 strings, and add a method to get the raw array. That allows simple, compiler-enforced usage while still allowing special casing to use raw data. As an added bonus, this method can generalize for other variable widthrange elements.
Feb 04 2010
prev sibling next sibling parent reply "Simen kjaeraas" <simen.kjaras gmail.com> writes:
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:
 What can be done about that? I see a number of solutions:

 (a) Do not operate the change at all.

 (b) Operate the change and mention that in range algorithms you should  
 check hasLength and only then use "length" under the assumption that it  
 really means "elements count".

 (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define  
 a different name for that. Any other name (codeUnits, codes etc.) would  
 do. The entire point is to not make algorithms believe strings have a  
 .length property.

 (d) Have std.range define a distinct property called e.g. "count" and  
 then specialize it appropriately. Then change all references to .length  
 in std.algorithm and elsewhere to .count.

 What would you do? Any ideas are welcome.
Of the above, I feel (b) is the correct solution, and I understand it has already been implemented in svn. -- Simen
Feb 04 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Simen kjaeraas:
 Of the above, I feel (b) is the correct solution, and I understand it
 has already been implemented in svn.
Yes, I presume he was mostly looking for a justification of his ideas he has already accepted and even partially implemented :-) Bye, bearophile
Feb 04 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 Simen kjaeraas:
 Of the above, I feel (b) is the correct solution, and I understand
 it has already been implemented in svn.
Yes, I presume he was mostly looking for a justification of his ideas he has already accepted and even partially implemented :-)
I am ready to throw away the implementation as soon as a better idea comes around. As other times, I operated the change to see how things feel with the new approach. Generally it feels like the new state of affairs is a solid improvement. One recurring problem has been that some code has assumed that ElementType!SomeString has the width of one encoding unit. That assumption is no longer true so I had to change such code with typeof(SomeString.init[0]). Probably I'll abstract that as CodeUnit!SomeString in std.traits. I also found some bugs; for example Levenshtein distance was erroneous because it didn't operate at character level. The fix using front and popFront was very simple. Regarding defining an entire new struct for strings, I think that's a sensible approach. With the new operators in tow, UString (universal string) that traffics in dchar and makes representation a detail would be nicely implementable. It could even have mutable elements at dchar granularity. My feeling is, however, that at this point too much toothpaste is out of the tube for that to happen in D2. That would be offset if current strings were unbearable, but I think they're working very well. Andrei
Feb 04 2010
next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 bearophile wrote:
 Simen kjaeraas:
 Of the above, I feel (b) is the correct solution, and I understand
 it has already been implemented in svn.
Yes, I presume he was mostly looking for a justification of his ideas he has already accepted and even partially implemented :-)
I am ready to throw away the implementation as soon as a better idea comes around. As other times, I operated the change to see how things feel with the new approach.
Has any thought been given to foreach? Currently all these work for strings: foreach (c; "abc") { } // typeof(c) is 'char' foreach (char c; "abc") { } foreach (wchar c; "abc") { } foreach (dchar c; "abc") { } I'm concerned about the first case where the element type is implicit. The implicit element type is (currently) the code units. If the range use code points 'dchar' as the element type, then I think foreach needs to be changed so that the default element type is 'dchar' too (in the first line of my example). Having ranges and foreach disagree on this would be very inconsistent. Of course you should be allowed to iterate using 'char' and 'wchar' too. I think this would fit nicely. I was surprised at first when learning D and I noticed that foreach didn't do this, that I had to explicitly has for it. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Feb 04 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Michel Fortin wrote:
 On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> said:
 
 bearophile wrote:
 Simen kjaeraas:
 Of the above, I feel (b) is the correct solution, and I understand
 it has already been implemented in svn.
Yes, I presume he was mostly looking for a justification of his ideas he has already accepted and even partially implemented :-)
I am ready to throw away the implementation as soon as a better idea comes around. As other times, I operated the change to see how things feel with the new approach.
Has any thought been given to foreach? Currently all these work for strings: foreach (c; "abc") { } // typeof(c) is 'char' foreach (char c; "abc") { } foreach (wchar c; "abc") { } foreach (dchar c; "abc") { } I'm concerned about the first case where the element type is implicit. The implicit element type is (currently) the code units. If the range use code points 'dchar' as the element type, then I think foreach needs to be changed so that the default element type is 'dchar' too (in the first line of my example). Having ranges and foreach disagree on this would be very inconsistent. Of course you should be allowed to iterate using 'char' and 'wchar' too. I think this would fit nicely. I was surprised at first when learning D and I noticed that foreach didn't do this, that I had to explicitly has for it.
This is a good point. I'm in favor of changing the language to make the implicit type dchar. Andrei
Feb 04 2010
next sibling parent reply Don <nospam nospam.com> writes:
Andrei Alexandrescu wrote:
 Michel Fortin wrote:
 On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> said:

 bearophile wrote:
 Simen kjaeraas:
 Of the above, I feel (b) is the correct solution, and I understand
 it has already been implemented in svn.
Yes, I presume he was mostly looking for a justification of his ideas he has already accepted and even partially implemented :-)
I am ready to throw away the implementation as soon as a better idea comes around. As other times, I operated the change to see how things feel with the new approach.
Has any thought been given to foreach? Currently all these work for strings: foreach (c; "abc") { } // typeof(c) is 'char' foreach (char c; "abc") { } foreach (wchar c; "abc") { } foreach (dchar c; "abc") { } I'm concerned about the first case where the element type is implicit. The implicit element type is (currently) the code units. If the range use code points 'dchar' as the element type, then I think foreach needs to be changed so that the default element type is 'dchar' too (in the first line of my example). Having ranges and foreach disagree on this would be very inconsistent. Of course you should be allowed to iterate using 'char' and 'wchar' too. I think this would fit nicely. I was surprised at first when learning D and I noticed that foreach didn't do this, that I had to explicitly has for it.
This is a good point. I'm in favor of changing the language to make the implicit type dchar. Andrei
We seem to be approaching the point where char[], wchar[] and dchar[] are all arrays of dchar, but with different levels of compression. It makes me wonder if the char, wchar types actually make any sense. If char[] is actually a UTF string, then char[] ~ char should be permitted ONLY if char can be implicitly converted to dchar. Otherwise, you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will not necessarily result in a valid unicode string. I suspect that string, wstring should have been the primary types and had a .codepoints property, which returned a ubyte[] resp. ushort[] reference to the data. It's too late, of course. The extra value you get by having a specific type for 'this is a code point for a UTF8 string' seems to be very minor, compared to just using a ubyte.
Feb 04 2010
next sibling parent reply Jerry Quinn <jlquinn optonline.net> writes:
Don Wrote:
 We seem to be approaching the point where char[], wchar[] and dchar[] 
 are all arrays of dchar, but with different levels of compression.
 It makes me wonder if the char, wchar types actually make any sense.
 If char[] is actually a UTF string, then char[] ~ char should be 
 permitted ONLY if char can be implicitly converted to dchar. Otherwise, 
 you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will 
 not necessarily result in a valid unicode string.
Well, if you're working with a LOT of text, you may be mmapping GB's of UTF-8 text. Yes, this does happen. You better be able to handle it in a sane manner, i.e. not reallocating the memory to read the data in. So, there is a definite need for casting to array of char, and dealing with the inevitable stray non-unicode char in that mess. Real-world text processing can be a messy affair. It probably requires walking such an array and returning slices cast to char after they've been validated.
Feb 04 2010
parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Jerry Quinn (jlquinn optonline.net)'s article
 Don Wrote:
 We seem to be approaching the point where char[], wchar[] and dchar[]
 are all arrays of dchar, but with different levels of compression.
 It makes me wonder if the char, wchar types actually make any sense.
 If char[] is actually a UTF string, then char[] ~ char should be
 permitted ONLY if char can be implicitly converted to dchar. Otherwise,
 you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will
 not necessarily result in a valid unicode string.
Well, if you're working with a LOT of text, you may be mmapping GB's of UTF-8
text. Yes, this does happen. You better be able to handle it in a sane manner, i.e. not reallocating the memory to read the data in. So, there is a definite need for casting to array of char, and dealing with the inevitable stray non-unicode char in that mess. Welcome to the world of DNA sequence manipulation.
Feb 04 2010
prev sibling next sibling parent reply Rainer Deyke <rainerd eldwood.com> writes:
Don wrote:
 I suspect that string, wstring should have been the primary types and
 had a .codepoints property, which returned a ubyte[] resp. ushort[]
 reference to the data. It's too late, of course. The extra value you get
 by having a specific type for 'this is a code point for a UTF8 string'
 seems to be very minor, compared to just using a ubyte.
If it's not too late to completely change the semantics of char[], then it's also not too late to dump 'char' completely. If it /is/ too late to remove 'char', then 'char[]' should retain the current semantics and a new string type should be added for the new semantics. -- Rainer Deyke - rainerd eldwood.com
Feb 04 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Rainer Deyke wrote:
 Don wrote:
 I suspect that string, wstring should have been the primary types and
 had a .codepoints property, which returned a ubyte[] resp. ushort[]
 reference to the data. It's too late, of course. The extra value you get
 by having a specific type for 'this is a code point for a UTF8 string'
 seems to be very minor, compared to just using a ubyte.
If it's not too late to completely change the semantics of char[], then it's also not too late to dump 'char' completely. If it /is/ too late to remove 'char', then 'char[]' should retain the current semantics and a new string type should be added for the new semantics.
One idea I've had for a while was to have a universal string type: struct UString { union { char[] utf8; wchar[] utf16; dchar[] utf32; } enum Discriminator { utf8, utf16, utf32 }; Discriminator kind; IntervalTree!(size_t) skip; ... } The IntervalTree stores the skip amounts that must be added for a given index in the string. For ASCII strings that would be null. Then its size grows with the number of multibyte characters. Beyond a threshold, representation is transparently switched to utf16 or utf32 as needed and the tree becomes smaller or null again. In an advanced implementation the discriminator and the tree could be stored at negative offset, and the tree could be compressed taking advantage of its limited size. That would make UString quite low-overhead while offering a staunchly dchar-based interface. I don't mind at all using string, but I also think UString would be a good extra abstraction. Andrei
Feb 04 2010
next sibling parent grauzone <none example.net> writes:
Andrei Alexandrescu wrote:
 Rainer Deyke wrote:
 Don wrote:
 I suspect that string, wstring should have been the primary types and
 had a .codepoints property, which returned a ubyte[] resp. ushort[]
 reference to the data. It's too late, of course. The extra value you get
 by having a specific type for 'this is a code point for a UTF8 string'
 seems to be very minor, compared to just using a ubyte.
If it's not too late to completely change the semantics of char[], then it's also not too late to dump 'char' completely. If it /is/ too late to remove 'char', then 'char[]' should retain the current semantics and a new string type should be added for the new semantics.
One idea I've had for a while was to have a universal string type: struct UString { union { char[] utf8; wchar[] utf16; dchar[] utf32; } enum Discriminator { utf8, utf16, utf32 }; Discriminator kind; IntervalTree!(size_t) skip; ... }
You mean like this? http://www.dprogramming.com/mtext.php
Feb 04 2010
prev sibling next sibling parent reply Rainer Deyke <rainerd eldwood.com> writes:
Andrei Alexandrescu wrote:
 One idea I've had for a while was to have a universal string type:
 
 struct UString {
     union {
         char[] utf8;
         wchar[] utf16;
         dchar[] utf32;
     }
     enum Discriminator { utf8, utf16, utf32 };
     Discriminator kind;
     IntervalTree!(size_t) skip;
     ...
 }
 
 The IntervalTree stores the skip amounts that must be added for a given
 index in the string. For ASCII strings that would be null. Then its size
 grows with the number of multibyte characters. Beyond a threshold,
 representation is transparently switched to utf16 or utf32 as needed and
 the tree becomes smaller or null again.
Although I see some potential in a universal string type, I don't think this is the right implementation strategy. I'd rather have my short strings in utf-32 (optimized for speed) and my long strings in utf-8/utf-16 (optimized for memory usage). -- Rainer Deyke - rainerd eldwood.com
Feb 04 2010
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Rainer Deyke wrote:
 Andrei Alexandrescu wrote:
 One idea I've had for a while was to have a universal string type:

 struct UString {
     union {
         char[] utf8;
         wchar[] utf16;
         dchar[] utf32;
     }
     enum Discriminator { utf8, utf16, utf32 };
     Discriminator kind;
     IntervalTree!(size_t) skip;
     ...
 }

 The IntervalTree stores the skip amounts that must be added for a given
 index in the string. For ASCII strings that would be null. Then its size
 grows with the number of multibyte characters. Beyond a threshold,
 representation is transparently switched to utf16 or utf32 as needed and
 the tree becomes smaller or null again.
Although I see some potential in a universal string type, I don't think this is the right implementation strategy. I'd rather have my short strings in utf-32 (optimized for speed) and my long strings in utf-8/utf-16 (optimized for memory usage).
The definition I outlined does not specify or constrain the strategy of changing the discriminator. Andrei
Feb 04 2010
prev sibling parent Michael Rynn <michaelrynn optusnet.com.au> writes:
On Thu, 04 Feb 2010 18:41:48 -0700, Rainer Deyke wrote:

 Andrei Alexandrescu wrote:
 One idea I've had for a while was to have a universal string type:
 
 struct UString {
     union {
         char[] utf8;
         wchar[] utf16;
         dchar[] utf32;
     }
     enum Discriminator { utf8, utf16, utf32 }; Discriminator kind;
     IntervalTree!(size_t) skip;
     ...
 }
 
Firstly, for such "augmented types" in D, such as strings, bignums or any future ideas , it is great to have the facilities of creating them using the struct, so that they can be used elsewhere without regards to whether they are built in as compiler specials or in the library. What is there for struct now is good and getting better in D2, but I still feel a little insecure with understanding how to make a really optimal implementation that is as good as a built in type that the compiler understands. The DPL is being been a help for this. Programmers will want to use raw char[] wchar[] dchar[] for whatever reasons with their ?simple? behaviours, so they should not be made unavailable because more sophisticated types are creatable, purely for unicode strings. I have made a UString implementation, similar to above. But I played a different trick. I was interested for this to also maintain a terminating null char for conversion passing to Windows API functions, in particular 16 bit W. interfaces. struct UString_char { char[] str_; /// ... lots of good D type stuff, constructor and assign conversions, access size_t length() { return str_.length - 1; // hide terminating null } } struct UString_wchar { wchar[] str_; /// ditto D type stuff } struct UString_dchar { dchar[] str_; } // throw in void[] for charity. (although no one will need it) struct UString_void { void[] ptr_; } enum UStringType { UC_CHAR, UC_WCHAR, UC_DCHAR } struct UString { union { UString_void vstr; UString_char cstr; UString_wchar wstr; UString_dchar dstr; } UStringType ztype; // type things to track what we are. } I could then choose individual components by themselves, where appropriate, even get them working. In D2 immutable not for str_ array, while appending or fiddling null terminator. I did not get associative array working as a key using UString, have not tried since. Also made a class version called VString containing the union. There's a lot of issues. I also must acknowledge the prior art of the mtext code, and its MString structure type. I was partly inspired by seeing this, and how complex it was to do nearly everything. When last I checked mtext it was kind of broken for recent D1 and D2 compilers, and I did not want to fix. I admit I did not like the complexity of the direct union { char[], wchar[], dchar[] } Splitting up into seperatedly usable structs seems to me to give 3 times the potential for the same price. The advantage of using struct for such types is it may help bring about perfection of such a POD based "type creation" facility. I note from looking at some of the phobos D2 code, eg std.array, this seems to be attempted in places. Nearly all the more interesting D types, arrays, maps, are all equivalent to smallish POD types, with at least 2-3 times the machine word size (32/64 bit). Making it all work and understandable and avoiding WTFbug is a big challenge.
Feb 05 2010
prev sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2010-02-04 18:16:55 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 Rainer Deyke wrote:
 Don wrote:
 I suspect that string, wstring should have been the primary types and
 had a .codepoints property, which returned a ubyte[] resp. ushort[]
 reference to the data. It's too late, of course. The extra value you get
 by having a specific type for 'this is a code point for a UTF8 string'
 seems to be very minor, compared to just using a ubyte.
If it's not too late to completely change the semantics of char[], then it's also not too late to dump 'char' completely. If it /is/ too late to remove 'char', then 'char[]' should retain the current semantics and a new string type should be added for the new semantics.
One idea I've had for a while was to have a universal string type: struct UString { union { char[] utf8; wchar[] utf16; dchar[] utf32; } enum Discriminator { utf8, utf16, utf32 }; Discriminator kind; IntervalTree!(size_t) skip; ... }
That's a nice concept, but it seems to me that it adds much overhead to improve a rather niche area. It's not often that you need to access characters by index. Generally when you need to it's because you've already parsed the string and want to return to a previous location, in which case you'd better when you first parse to just save the range or the index in code units rather than the index in code point. But I have to say quite satisfied in the way D handle strings in general. Easy access to code points and direct access to the data is quite handy. I think it fits very well with a low-level language. I'd say in general when manipulating strings I rarely need to bother about code points. Most of the time I'm just searching for ASCII-range markers when parsing so I can search for them directly as code units, not bothering at all about multi-byte characters. That's why I'm a little wary about your changes. If I'm looking for a substring then I can search by code units too. It's just for the more fancy stuff (case-insensitive searching, character transformation) that it becomes necessary to work with code points. That's why I'm a little wary about your changes in that area: I fear it'll make the common case of working with code units more difficult to deal with. But I won't judge before I see. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Feb 04 2010
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Don wrote:
 Andrei Alexandrescu wrote:
 Michel Fortin wrote:
 On 2010-02-04 12:19:42 -0500, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> said:

 bearophile wrote:
 Simen kjaeraas:
 Of the above, I feel (b) is the correct solution, and I understand
 it has already been implemented in svn.
Yes, I presume he was mostly looking for a justification of his ideas he has already accepted and even partially implemented :-)
I am ready to throw away the implementation as soon as a better idea comes around. As other times, I operated the change to see how things feel with the new approach.
Has any thought been given to foreach? Currently all these work for strings: foreach (c; "abc") { } // typeof(c) is 'char' foreach (char c; "abc") { } foreach (wchar c; "abc") { } foreach (dchar c; "abc") { } I'm concerned about the first case where the element type is implicit. The implicit element type is (currently) the code units. If the range use code points 'dchar' as the element type, then I think foreach needs to be changed so that the default element type is 'dchar' too (in the first line of my example). Having ranges and foreach disagree on this would be very inconsistent. Of course you should be allowed to iterate using 'char' and 'wchar' too. I think this would fit nicely. I was surprised at first when learning D and I noticed that foreach didn't do this, that I had to explicitly has for it.
This is a good point. I'm in favor of changing the language to make the implicit type dchar. Andrei
We seem to be approaching the point where char[], wchar[] and dchar[] are all arrays of dchar, but with different levels of compression.
That is a good way to look at things.
 It makes me wonder if the char, wchar types actually make any sense.
 If char[] is actually a UTF string, then char[] ~ char should be 
 permitted ONLY if char can be implicitly converted to dchar. Otherwise, 
 you're performing cast(char[])(cast(ubyte[])s ~ cast(ubyte)c) which will 
 not necessarily result in a valid unicode string.
Well as it's been mentioned, sometimes you may assemble a string out of individual characters. Probably that case is rare enough to warrant a cast. Note that today char is already convertible to dchar (there's no checking).
 I suspect that string, wstring should have been the primary types and 
 had a .codepoints property, which returned a ubyte[] resp. ushort[] 
 reference to the data. It's too late, of course. The extra value you get 
 by having a specific type for 'this is a code point for a UTF8 string' 
 seems to be very minor, compared to just using a ubyte.
What we can do is to have to!(const ubyte[]) work for all UTF8 strings and to!(const ushort[]) work for all UTF16 strings. That view is correct and safe. Also, it's not difficult to add a .codepoints pseudo-property. Andrei
Feb 04 2010
prev sibling parent Justin Johansson <no spam.com> writes:
Andrei Alexandrescu wrote:
 Has any thought been given to foreach? Currently all these work for 
 strings:

     foreach (c; "abc") { } // typeof(c) is 'char'
     foreach (char c; "abc") { }
     foreach (wchar c; "abc") { }
     foreach (dchar c; "abc") { }

 I'm concerned about the first case where the element type is implicit. 
 The implicit element type is (currently) the code units. If the range 
 use code points 'dchar' as the element type, then I think foreach 
 needs to be changed so that the default element type is 'dchar' too 
 (in the first line of my example). Having ranges and foreach disagree 
 on this would be very inconsistent. Of course you should be allowed to 
 iterate using 'char' and 'wchar' too.

 I think this would fit nicely. I was surprised at first when learning 
 D and I noticed that foreach didn't do this, that I had to explicitly 
 has for it.
This is a good point. I'm in favor of changing the language to make the implicit type dchar. Andrei
I concur. It's great to see consensus moving in this direction. For too long Java has suffered the err that a short (i.e. UTF-16 codeunit) is just about as good as a full Unicode codepoint (i.e. UTF-32 "codeunit"). As a result, the near-enough is good-enough, 16-bit Java API's means that programmers either forget (as best) or become slack (at worse) in the dealing of valid Unicode characters. Part of this also stems from the culture that if it ain't ASCII or in a Western character set (BMP), who cares. As a matter of taste, I'd prefer to see a dchar Unicode codepoint officially acknowledged/ordained as "unichar", though I guess there is always the alias resort for pedants like myself. Cheers Justin Johansson
Feb 04 2010
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 I am ready to throw away the implementation as soon as a better idea 
 comes around. As other times, I operated the change to see how things 
 feel with the new approach.
And just to be sure: you are better than me because you actually implement things, while I am here just complaining all the time :o) So thank you for your work. Later, bearophile
Feb 04 2010
prev sibling parent reply Steve Teale <steve.teale britseyeview.com> writes:
Andrei Alexandrescu Wrote:

 
 What would you do? Any ideas are welcome.
 
Andrei, congratulations on starting the most interesting thread I have seen on dm.D for as long as I can remember. It has managed to stay really focused, and has produced a lot of interesting suggestions. I don't envy you the task of sorting through them - but you did ask. From what you've seen so far, are we talking Phobos or language issues, and if both, in what proportion? Steve.
Feb 05 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Steve Teale wrote:
 Andrei Alexandrescu Wrote:
 
 What would you do? Any ideas are welcome.
Andrei, congratulations on starting the most interesting thread I have seen on dm.D for as long as I can remember. It has managed to stay really focused, and has produced a lot of interesting suggestions. I don't envy you the task of sorting through them - but you did ask. From what you've seen so far, are we talking Phobos or language issues, and if both, in what proportion?
A bit of both, with an emphasis on library. Andrei
Feb 05 2010