digitalmars.D - get the facts: string
- Thomas Kuehne (32/32) Nov 26 2005 -----BEGIN PGP SIGNED MESSAGE-----
- John Reimer (3/54) Nov 26 2005 Thanks, Thomas. Nice summary. I think I may actually get to understand
- Derek Parnell (11/16) Nov 26 2005 Thanks Thomas,
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 This post tries to sum up some of the facts about unicode, encodings and strings. 0) The concept of a "character" is language dependent. (-> glyph, glyph cluster, lignature ...) 1) Every unicode code point can be encoded in each of the 5 common UTFs. (-> UTF-8, UTF-16-BE, UTF-16-LE, UTF-32-BE, UTF-32-LE) 2) Not every characters can be represented in the different ANSI and OEM character sets. 3) The character set of a terminal/shell can be changed on the fly. 4) A code point isn't allways a complete character. (Yes, a UTF-32 fragment isn't allways a character.) 5) Some characters can be represented by different code point sequences and thus different sequences of code point fragments. 6) Slicing gets difficult if strings are NULL terminated like in C. 7) Slicing gets difficult if strings begin with a BOM. 8) Java's String concept hides a few transcodings and requires either a VM or opAssign. 9) Not every system uses the same fragment size let alone encoding. 10) Data has to be exchanged between different systems. 11) String processing is usually not a performance problem unless the application is dedicated to text processing or a lot of transcodings occure. further reading: http://www.unicode.org Have a look at ICU to see some unicode string processing <g> Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFDiZeR3w+/yD4P9tIRAsdDAJ47LKfhl9DKM/yZtlf/V/sEYJplBQCgwu1+ 3J8o9MivNXDROOkracEmE7Y= =FbqB -----END PGP SIGNATURE-----
Nov 26 2005
Thomas Kuehne wrote:-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 This post tries to sum up some of the facts about unicode, encodings and strings. 0) The concept of a "character" is language dependent. (-> glyph, glyph cluster, lignature ...) 1) Every unicode code point can be encoded in each of the 5 common UTFs. (-> UTF-8, UTF-16-BE, UTF-16-LE, UTF-32-BE, UTF-32-LE) 2) Not every characters can be represented in the different ANSI and OEM character sets. 3) The character set of a terminal/shell can be changed on the fly. 4) A code point isn't allways a complete character. (Yes, a UTF-32 fragment isn't allways a character.) 5) Some characters can be represented by different code point sequences and thus different sequences of code point fragments. 6) Slicing gets difficult if strings are NULL terminated like in C. 7) Slicing gets difficult if strings begin with a BOM. 8) Java's String concept hides a few transcodings and requires either a VM or opAssign. 9) Not every system uses the same fragment size let alone encoding. 10) Data has to be exchanged between different systems. 11) String processing is usually not a performance problem unless the application is dedicated to text processing or a lot of transcodings occure. further reading: http://www.unicode.org Have a look at ICU to see some unicode string processing <g> Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFDiZeR3w+/yD4P9tIRAsdDAJ47LKfhl9DKM/yZtlf/V/sEYJplBQCgwu1+ 3J8o9MivNXDROOkracEmE7Y= =FbqB -----END PGP SIGNATURE-----Thanks, Thomas. Nice summary. I think I may actually get to understand some of this finally. :)
Nov 26 2005
On Sat, 26 Nov 2005 21:39:03 +0000 (UTC), Thomas Kuehne wrote:-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 This post tries to sum up some of the facts about unicode, encodings and strings.Thanks Thomas, this is very neat. Now what is Walter going to do about it with respect to D. I suspect nothing. It is up to each coder to decide how to handle Unicode when using D, so there will be a myriad of solutions to the issues, and some will be better than others. The C/C++ world prevails. Such a pity. -- Derek Parnell Melbourne, Australia 27/11/2005 9:26:14 AM
Nov 26 2005