digitalmars.D.learn - Unicode problems?

Trass3r (6/6) Feb 16 2009 Wikipedia states that D still has some Unicode problems:

Daniel Keep (11/18) Feb 16 2009 They're not bugs, if that's what you mean. It's just a side-effect of

Chris Nicholson-Sauls (5/29) Feb 16 2009 I use UTF-32, at least occasionally. In cases where I specifically

Lutger (6/13) Feb 16 2009 I think it's a point of view thing to call that unintuitive, but otherwi...

Trass3r <mrmocool gmx.de> writes:

Wikipedia states that D still has some Unicode problems:
"Operations on Unicode strings are unintuitive (compiler accepts Unicode  
source code, standard library and foreach constructs operate on UTF-8, but  
string slicing and length property operate on bytes rather than  
characters)."

Is this information correct?

Feb 16 2009

Daniel Keep <daniel.keep.lists gmail.com> writes:

Trass3r wrote:
 Wikipedia states that D still has some Unicode problems:
 "Operations on Unicode strings are unintuitive (compiler accepts Unicode
 source code, standard library and foreach constructs operate on UTF-8,
 but string slicing and length property operate on bytes rather than
 characters)."
 
 Is this information correct?

They're not bugs, if that's what you mean.  It's just a side-effect of
how Unicode works.

http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

Long story short: they operate on bytes because operating on actual code
points can't be done efficiently [1].

  -- Daniel

[1] Given that strings are implemented as arrays with a given,
non-changing width and that you're not using UTF-32 which no one does
because it's too big and that we don't add some fancy caching stuff to
char[] arrays specifically, blah blah blah.

Feb 16 2009

Chris Nicholson-Sauls <ibisbasenji gmail.com> writes:

Daniel Keep wrote:
 
 Trass3r wrote:
 Wikipedia states that D still has some Unicode problems:
 "Operations on Unicode strings are unintuitive (compiler accepts Unicode
 source code, standard library and foreach constructs operate on UTF-8,
 but string slicing and length property operate on bytes rather than
 characters)."

 Is this information correct?

 
 They're not bugs, if that's what you mean.  It's just a side-effect of
 how Unicode works.
 
 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
 
 Long story short: they operate on bytes because operating on actual code
 points can't be done efficiently [1].
 
   -- Daniel
 
 [1] Given that strings are implemented as arrays with a given,
 non-changing width and that you're not using UTF-32 which no one does
 because it's too big and that we don't add some fancy caching stuff to
 char[] arrays specifically, blah blah blah.

I use UTF-32, at least occasionally.  In cases where I specifically 
expect/encourage multilingual support/use, it can simplify matters 
greatly, where those otherwise inefficient operations become common.

-- Chris Nicholson-Sauls

Feb 16 2009

Lutger <lutger.blijdestijn gmail.com> writes:

Trass3r wrote:

 Wikipedia states that D still has some Unicode problems:
 "Operations on Unicode strings are unintuitive (compiler accepts Unicode  
 source code, standard library and foreach constructs operate on UTF-8, but  
 string slicing and length property operate on bytes rather than  
 characters)."
 
 Is this information correct?

I think it's a point of view thing to call that unintuitive, but otherwise I 
can't find anything incorrect in it. Except maybe that "..operate on bytes" 
should be "..operate on code units" ? It doesn't mean that D has unicode 
problems though.

Feb 16 2009

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Unicode problems?