www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - "There is no character" (str type)

reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Like most other newcomers to Unicode,
I had some trouble with the UTF types...

Then I found this passage on the ICU home page:
 Often, a user thinks of a "character" as a complete unit in a
 language, like an 'Ä', while it may be represented with multiple
 Unicode code points including a base character and combining marks.
 (See the Unicode standard for details.) This often requires users to
 index and pass strings (UnicodeString or UChar *) with multiple code
 units or code points. It cannot be done with single-integer character
 types. Indexing of such "characters" is done with the BreakIterator
 class (in C: ubrk_ functions). [note: they talk about the ICU types]
Which explained to me that sometimes a single "character" is not enough *anyway*, and that I should be thinking in strings and code units... And suddenly, me and all D's new char types are friends again! It makes perfect sense to have UTF-8/UTF-16/UTF-32 types in D. I just have to get out of the "uniform code unit size"-think. I guess that of my own Latin-1 text, about 99% is ASCII*... Which sounds like a good reason to have UTF-8 the default ? (I also read that of Unicode text, 99% is U+0000 to U+FFFF) Another major advantage of UTF-8 (besides half the size) over UTF-16 is that it is endian-agnostic. No more BE/LE! (and none of that pesky ASCII-breaking "BOM" crap either) So if my text is mostly ascii / iso-latin-1, I just use char[]. If most of my text is unicode, I use wchar[]. And should I ever need to access a single Unicode code point, then I have dchar. Now all I want is a string type ALIAS, and all things are spiffy. Can we have a either "str" or "string" alias added, for char[] ? Pretty-please ? (Hey, it worked for the "bool" alias for bit...) "void main(str[] args)" --anders PS. Now I just have to remember to dimension my D strings as char[max * 2] (for Latin-1) or even char[max * 4]... And how to loop over an array of potential "surrogates". PPS. * 10% of my own Swedish characters are non-ASCII. (ÅÄÖ) But as a programmer, I usually write things in English. Except for my last name, which accounts for the 1%. :-)
Oct 11 2004
parent reply "Jaap Geurts" <jaapsen hotmail.com> writes:
 And suddenly, me and all D's new char types are friends again!
 It makes perfect sense to have UTF-8/UTF-16/UTF-32 types in D.
 I just have to get out of the "uniform code unit size"-think.
Thank you for sharing that with us. Now is is suddenly clear to me too. Jaap
Oct 11 2004
parent "Walter" <newshound digitalmars.com> writes:
"Jaap Geurts" <jaapsen hotmail.com> wrote in message
news:ckeb2r$1mcv$1 digitaldaemon.com...
 And suddenly, me and all D's new char types are friends again!
 It makes perfect sense to have UTF-8/UTF-16/UTF-32 types in D.
 I just have to get out of the "uniform code unit size"-think.
Thank you for sharing that with us. Now is is suddenly clear to me too.
It reminds me of when a good friend of mine patiently explained to me all about instructions, opcodes, registers, stacks, and program counters. It was all gibberish to me, until there was one moment when suddenly it all came together and made perfect sense.
Oct 11 2004