digitalmars.D - "There is no character" (str type)
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (29/37) Oct 11 2004 Like most other newcomers to Unicode,
- Jaap Geurts (2/5) Oct 11 2004 Thank you for sharing that with us. Now is is suddenly clear to me too.
- Walter (6/11) Oct 11 2004 It reminds me of when a good friend of mine patiently explained to me al...
Like most other newcomers to Unicode, I had some trouble with the UTF types... Then I found this passage on the ICU home page:Often, a user thinks of a "character" as a complete unit in a language, like an 'Ä', while it may be represented with multiple Unicode code points including a base character and combining marks. (See the Unicode standard for details.) This often requires users to index and pass strings (UnicodeString or UChar *) with multiple code units or code points. It cannot be done with single-integer character types. Indexing of such "characters" is done with the BreakIterator class (in C: ubrk_ functions). [note: they talk about the ICU types]Which explained to me that sometimes a single "character" is not enough *anyway*, and that I should be thinking in strings and code units... And suddenly, me and all D's new char types are friends again! It makes perfect sense to have UTF-8/UTF-16/UTF-32 types in D. I just have to get out of the "uniform code unit size"-think. I guess that of my own Latin-1 text, about 99% is ASCII*... Which sounds like a good reason to have UTF-8 the default ? (I also read that of Unicode text, 99% is U+0000 to U+FFFF) Another major advantage of UTF-8 (besides half the size) over UTF-16 is that it is endian-agnostic. No more BE/LE! (and none of that pesky ASCII-breaking "BOM" crap either) So if my text is mostly ascii / iso-latin-1, I just use char[]. If most of my text is unicode, I use wchar[]. And should I ever need to access a single Unicode code point, then I have dchar. Now all I want is a string type ALIAS, and all things are spiffy. Can we have a either "str" or "string" alias added, for char[] ? Pretty-please ? (Hey, it worked for the "bool" alias for bit...) "void main(str[] args)" --anders PS. Now I just have to remember to dimension my D strings as char[max * 2] (for Latin-1) or even char[max * 4]... And how to loop over an array of potential "surrogates". PPS. * 10% of my own Swedish characters are non-ASCII. (ÅÄÖ) But as a programmer, I usually write things in English. Except for my last name, which accounts for the 1%. :-)
Oct 11 2004
And suddenly, me and all D's new char types are friends again! It makes perfect sense to have UTF-8/UTF-16/UTF-32 types in D. I just have to get out of the "uniform code unit size"-think.Thank you for sharing that with us. Now is is suddenly clear to me too. Jaap
Oct 11 2004
"Jaap Geurts" <jaapsen hotmail.com> wrote in message news:ckeb2r$1mcv$1 digitaldaemon.com...It reminds me of when a good friend of mine patiently explained to me all about instructions, opcodes, registers, stacks, and program counters. It was all gibberish to me, until there was one moment when suddenly it all came together and made perfect sense.And suddenly, me and all D's new char types are friends again! It makes perfect sense to have UTF-8/UTF-16/UTF-32 types in D. I just have to get out of the "uniform code unit size"-think.Thank you for sharing that with us. Now is is suddenly clear to me too.
Oct 11 2004