www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - get the facts: string

reply Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This post tries to sum up some of the facts about unicode, encodings
and strings.

0) The concept of a "character" is language dependent.
(-> glyph, glyph cluster, lignature ...)

1) Every unicode code point can be encoded in each of the 5 common UTFs.
(-> UTF-8, UTF-16-BE, UTF-16-LE, UTF-32-BE, UTF-32-LE)

2) Not every characters can be represented in the different ANSI and OEM
character sets.

3) The character set of a terminal/shell can be changed on the fly.

4) A code point isn't allways a complete character.
(Yes, a UTF-32 fragment isn't allways a character.)

5) Some characters can be represented by different code point sequences
and thus different sequences of code point fragments.

6) Slicing gets difficult if strings are NULL terminated like in C.

7) Slicing gets difficult if strings begin with a BOM.

8) Java's String concept hides a few transcodings and requires either
a VM or opAssign.

9) Not every system uses the same fragment size let alone encoding.

10) Data has to be exchanged between different systems.

11) String processing is usually not a performance problem unless
the application is dedicated to text processing or a lot of transcodings
occure.

further reading: http://www.unicode.org

Have a look at ICU to see some unicode string processing <g>

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFDiZeR3w+/yD4P9tIRAsdDAJ47LKfhl9DKM/yZtlf/V/sEYJplBQCgwu1+
3J8o9MivNXDROOkracEmE7Y=
=FbqB
-----END PGP SIGNATURE-----
Nov 26 2005
next sibling parent John Reimer <terminal.node gmail.com> writes:
Thomas Kuehne wrote:
 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1
 
 This post tries to sum up some of the facts about unicode, encodings
 and strings.
 
 0) The concept of a "character" is language dependent.
 (-> glyph, glyph cluster, lignature ...)
 
 1) Every unicode code point can be encoded in each of the 5 common UTFs.
 (-> UTF-8, UTF-16-BE, UTF-16-LE, UTF-32-BE, UTF-32-LE)
 
 2) Not every characters can be represented in the different ANSI and OEM
 character sets.
 
 3) The character set of a terminal/shell can be changed on the fly.
 
 4) A code point isn't allways a complete character.
 (Yes, a UTF-32 fragment isn't allways a character.)
 
 5) Some characters can be represented by different code point sequences
 and thus different sequences of code point fragments.
 
 6) Slicing gets difficult if strings are NULL terminated like in C.
 
 7) Slicing gets difficult if strings begin with a BOM.
 
 8) Java's String concept hides a few transcodings and requires either
 a VM or opAssign.
 
 9) Not every system uses the same fragment size let alone encoding.
 
 10) Data has to be exchanged between different systems.
 
 11) String processing is usually not a performance problem unless
 the application is dedicated to text processing or a lot of transcodings
 occure.
 
 further reading: http://www.unicode.org
 
 Have a look at ICU to see some unicode string processing <g>
 
 Thomas
 
 
 -----BEGIN PGP SIGNATURE-----
 
 iD8DBQFDiZeR3w+/yD4P9tIRAsdDAJ47LKfhl9DKM/yZtlf/V/sEYJplBQCgwu1+
 3J8o9MivNXDROOkracEmE7Y=
 =FbqB
 -----END PGP SIGNATURE-----
Thanks, Thomas. Nice summary. I think I may actually get to understand some of this finally. :)
Nov 26 2005
prev sibling parent Derek Parnell <derek psych.ward> writes:
On Sat, 26 Nov 2005 21:39:03 +0000 (UTC), Thomas Kuehne wrote:

 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1
 
 This post tries to sum up some of the facts about unicode, encodings
 and strings.
Thanks Thomas, this is very neat. Now what is Walter going to do about it with respect to D. I suspect nothing. It is up to each coder to decide how to handle Unicode when using D, so there will be a myriad of solutions to the issues, and some will be better than others. The C/C++ world prevails. Such a pity. -- Derek Parnell Melbourne, Australia 27/11/2005 9:26:14 AM
Nov 26 2005