digitalmars.D.learn - Unicode problems?
- Trass3r (6/6) Feb 16 2009 Wikipedia states that D still has some Unicode problems:
- Daniel Keep (11/18) Feb 16 2009 They're not bugs, if that's what you mean. It's just a side-effect of
- Chris Nicholson-Sauls (5/29) Feb 16 2009 I use UTF-32, at least occasionally. In cases where I specifically
- Lutger (6/13) Feb 16 2009 I think it's a point of view thing to call that unintuitive, but otherwi...
Wikipedia states that D still has some Unicode problems: "Operations on Unicode strings are unintuitive (compiler accepts Unicode source code, standard library and foreach constructs operate on UTF-8, but string slicing and length property operate on bytes rather than characters)." Is this information correct?
Feb 16 2009
Trass3r wrote:Wikipedia states that D still has some Unicode problems: "Operations on Unicode strings are unintuitive (compiler accepts Unicode source code, standard library and foreach constructs operate on UTF-8, but string slicing and length property operate on bytes rather than characters)." Is this information correct?They're not bugs, if that's what you mean. It's just a side-effect of how Unicode works. http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD Long story short: they operate on bytes because operating on actual code points can't be done efficiently [1]. -- Daniel [1] Given that strings are implemented as arrays with a given, non-changing width and that you're not using UTF-32 which no one does because it's too big and that we don't add some fancy caching stuff to char[] arrays specifically, blah blah blah.
Feb 16 2009
Daniel Keep wrote:Trass3r wrote:I use UTF-32, at least occasionally. In cases where I specifically expect/encourage multilingual support/use, it can simplify matters greatly, where those otherwise inefficient operations become common. -- Chris Nicholson-SaulsWikipedia states that D still has some Unicode problems: "Operations on Unicode strings are unintuitive (compiler accepts Unicode source code, standard library and foreach constructs operate on UTF-8, but string slicing and length property operate on bytes rather than characters)." Is this information correct?They're not bugs, if that's what you mean. It's just a side-effect of how Unicode works. http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD Long story short: they operate on bytes because operating on actual code points can't be done efficiently [1]. -- Daniel [1] Given that strings are implemented as arrays with a given, non-changing width and that you're not using UTF-32 which no one does because it's too big and that we don't add some fancy caching stuff to char[] arrays specifically, blah blah blah.
Feb 16 2009
Trass3r wrote:Wikipedia states that D still has some Unicode problems: "Operations on Unicode strings are unintuitive (compiler accepts Unicode source code, standard library and foreach constructs operate on UTF-8, but string slicing and length property operate on bytes rather than characters)." Is this information correct?I think it's a point of view thing to call that unintuitive, but otherwise I can't find anything incorrect in it. Except maybe that "..operate on bytes" should be "..operate on code units" ? It doesn't mean that D has unicode problems though.
Feb 16 2009