digitalmars.D - Beginner not getting "string"
- Nick (24/24) Aug 29 2010 Reading Andrei's book and something seems amiss:
- dsimcha (14/34) Aug 29 2010 Basically, the reason is because you can't have a regular array of code ...
- Andrei Alexandrescu (39/61) Aug 29 2010 (Background for others: code point == actual conceptual character, code
- Nick Sabalausky (3/4) Aug 29 2010 Please tell me that's a typo. This isn't the era of UCS-2.
- BCS (6/13) Aug 29 2010 Even if choosing to use UTF-16 is a bad idea (and I'm in no position to ...
- Nick Sabalausky (4/15) Aug 29 2010 What I meant is that I'm fairly sure a wchar should be a code *unit* (ju...
- BCS (6/25) Aug 29 2010 Given that I had to read three differnt pages twice each before I locate...
- Andrei Alexandrescu (3/8) Aug 29 2010 Yah, typo.
- Steven Schveighoffer (6/13) Aug 30 2010 char[] x;
Reading Andrei's book and something seems amiss: 1. A char in D is a code *unit* not a code point. Considering that code units are generally used to encode in an encoding, I would have expected that the type for a code unit to be byte or something similar, as far from code points as possible. In my mind, Unicode characters, aka chars are code points. 2. Thus a string in D is an array of code *units*, although in Unicode a string is really an array of code points. 3. Iterating a string in D is wrong by default, iterating over code units instead of characters (code points). Even worse, the error does not appear until you put some non-ascii text in there. 4. All string-processing calls (like sort, toupper, split and such) are by default wrong on non-ascii strings. Wrong without any error, warning or anything. So I guess my question is why, in a language with the power and expressiveness of D, in our day and age, would one choose such an exposed, fragile implementation of string that ensures that the default code one writes for text manipulation is most likely wrong? I18N is one of the first things I judge a new language by and so far D is... puzzling. I don't know much about D so I am probably just not getting it but can you please point me to some rationale behind these string design decisions? Thanks! Nick
Aug 29 2010
== Quote from Nick (nick example.com)'s articleReading Andrei's book and something seems amiss: 1. A char in D is a code *unit* not a code point. Considering that code units are generally used to encode in an encoding, I would have expected that the type for a code unit to be byte or something similar, as far from code points as possible. In my mind, Unicode characters, aka chars are code points.Basically, the reason is because you can't have a regular array of code points, you'd need to maintain some additional data structures. These can easily be built on top of an array of code units. You can't build an array of code units on top of an array of code points, at least not efficiently.2. Thus a string in D is an array of code *units*, although in Unicode a string is really an array of code points.This is admittedly a wart. However, when you use std.range, ElementType!(string) == dchar, so iterating with range primitives does what you'd think it should.3. Iterating a string in D is wrong by default, iterating over code units instead of characters (code points). Even worse, the error does not appear until you put some non-ascii text in there.See answer to (2).4. All string-processing calls (like sort, toupper, split and such) are by default wrong on non-ascii strings. Wrong without any error, warning or anything.If these don't work right for non-ASCII strings then it's an bug, not a design issue. Please file bug reports (1 per function).So I guess my question is why, in a language with the power and expressiveness of D, in our day and age, would one choose such an exposed, fragile implementation of string that ensures that the default code one writes for text manipulation is most likely wrong? I18N is one of the first things I judge a new language by and so far D is... puzzling.Part of it was a silly design error that became too hard to change. Most of it, though, is to avoid abstraction inversion (http://en.wikipedia.org/wiki/Abstraction_inversion) by providing access to the lower level aspects of unicode strings.
Aug 29 2010
On 08/29/2010 12:44 PM, Nick wrote:Reading Andrei's book and something seems amiss: 1. A char in D is a code *unit* not a code point. Considering that code units are generally used to encode in an encoding, I would have expected that the type for a code unit to be byte or something similar, as far from code points as possible. In my mind, Unicode characters, aka chars are code points.(Background for others: code point == actual conceptual character, code unit == the smallest unit of encoding (one byte for UTF8, two bytes for UTF16, four bytes for UTF32). In UTF32 code units are chosen to be equal to code points.) Indeed, D's char is a UTF-8 code unit, and wchar is a UTF-16 code point. (dchar is at the same time a UTF-32 code unit and a Unicode code point.) Making the type of a code unit byte would considerably weaken the expressive power because an array of byte[] could be considered either untyped data or UTF-encoded data without a static means to differentiate between the two. This would be largely obviated by making string an elaborate type, but there are considerable advantages to having string a regular array type.2. Thus a string in D is an array of code *units*, although in Unicode a string is really an array of code points.In Unicode a string is generally a _sequence_ of code points. Due to the variable-length encoding enacted by UTF-8 and UTF-16, it would be difficult to emulate array semantics on such representations.3. Iterating a string in D is wrong by default, iterating over code units instead of characters (code points). Even worse, the error does not appear until you put some non-ascii text in there.It's been discussed before that foreach (c; str) should set by default the type of c to dchar. I agree. That being said, iterating a string with the formal iteration mechanism defined by std.range is always correct and moves one code point at a time. So what I can advise is to use foreach (dchar c; str). Other than that, everything should work properly.4. All string-processing calls (like sort, toupper, split and such) are by default wrong on non-ascii strings. Wrong without any error, warning or anything.You'll be glad to hear that this assumption is false. 1. sort does not compile for char[] or wchar[]. The reason is that char[] and wchar[] do not obey the random-access requirements. 2. All overloads of split work correctly with non-ASCII strings. If you find anything that doesn't, that's a bug in the implementation, not in the design. I also recommend you look up splitter in std.algorithm.So I guess my question is why, in a language with the power and expressiveness of D, in our day and age, would one choose such an exposed, fragile implementation of string that ensures that the default code one writes for text manipulation is most likely wrong? I18N is one of the first things I judge a new language by and so far D is... puzzling. I don't know much about D so I am probably just not getting it but can you please point me to some rationale behind these string design decisions?Support of UTF in D could be better but it definitely compares favorably to that in many other languages (including all languages that I know). The choice of array clarifies the representation and offer random access to individual code units, which is sometimes necessary for efficient manipulation. However, the formal range interface offers bidirectional access to code points. As I mentioned elsewhere, I could not find an edit distance implementation for any other language than D that works directly on UTF-encoded inputs. And it's not special-cased - the same implementation works e.g. for lists of integers. Andrei
Aug 29 2010
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:i5e88k$2p0t$1 digitalmars.com...wchar is a UTF-16 code point.Please tell me that's a typo. This isn't the era of UCS-2.
Aug 29 2010
Hello Nick,"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:i5e88k$2p0t$1 digitalmars.com...Even if choosing to use UTF-16 is a bad idea (and I'm in no position to say it is)being able to read/write it to interact with things that already use it is a good idea. -- ... <IXOYE><wchar is a UTF-16 code point.Please tell me that's a typo. This isn't the era of UCS-2.
Aug 29 2010
"BCS" <none anon.com> wrote in message news:a6268ff1afa68cd1591044967ce news.digitalmars.com...Hello Nick,What I meant is that I'm fairly sure a wchar should be a code *unit* (just like char is a code unit), not a code *point*."Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:i5e88k$2p0t$1 digitalmars.com...Even if choosing to use UTF-16 is a bad idea (and I'm in no position to say it is)being able to read/write it to interact with things that already use it is a good idea.wchar is a UTF-16 code point.Please tell me that's a typo. This isn't the era of UCS-2.
Aug 29 2010
Hello Nick,"BCS" <none anon.com> wrote in message news:a6268ff1afa68cd1591044967ce news.digitalmars.com...Given that I had to read three differnt pages twice each before I located a clear statement of which was which (talk about boring reading!) I'd be willing to guess that that was a slip up. -- ... <IXOYE><Hello Nick,What I meant is that I'm fairly sure a wchar should be a code *unit* (just like char is a code unit), not a code *point*."Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message news:i5e88k$2p0t$1 digitalmars.com...Even if choosing to use UTF-16 is a bad idea (and I'm in no position to say it is)being able to read/write it to interact with things that already use it is a good idea.wchar is a UTF-16 code point.Please tell me that's a typo. This isn't the era of UCS-2.
Aug 29 2010
On 08/29/2010 02:01 PM, Nick Sabalausky wrote:"Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org> wrote in message news:i5e88k$2p0t$1 digitalmars.com...Yah, typo. Andreiwchar is a UTF-16 code point.Please tell me that's a typo. This isn't the era of UCS-2.
Aug 29 2010
On Sun, 29 Aug 2010 14:17:44 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 08/29/2010 12:44 PM, Nick wrote:char[] x; x.sort; // compiles 2.048 Is a deprecation planned? -Steve4. All string-processing calls (like sort, toupper, split and such) are by default wrong on non-ascii strings. Wrong without any error, warning or anything.You'll be glad to hear that this assumption is false. 1. sort does not compile for char[] or wchar[]. The reason is that char[] and wchar[] do not obey the random-access requirements.
Aug 30 2010