digitalmars.D - "There is no character" (str type)

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (29/37) Oct 11 2004 Like most other newcomers to Unicode,

Jaap Geurts (2/5) Oct 11 2004 Thank you for sharing that with us. Now is is suddenly clear to me too.

Walter (6/11) Oct 11 2004 It reminds me of when a good friend of mine patiently explained to me al...

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Like most other newcomers to Unicode,
I had some trouble with the UTF types...

Then I found this passage on the ICU home page:
 Often, a user thinks of a "character" as a complete unit in a
 language, like an '�', while it may be represented with multiple
 Unicode code points including a base character and combining marks.
 (See the Unicode standard for details.) This often requires users to
 index and pass strings (UnicodeString or UChar *) with multiple code
 units or code points. It cannot be done with single-integer character
 types. Indexing of such "characters" is done with the BreakIterator
 class (in C: ubrk_ functions). [note: they talk about the ICU types]

Which explained to me that sometimes a single
"character" is not enough *anyway*, and that I
should be thinking in strings and code units...

And suddenly, me and all D's new char types are friends again!
It makes perfect sense to have UTF-8/UTF-16/UTF-32 types in D.
I just have to get out of the "uniform code unit size"-think.


I guess that of my own Latin-1 text, about 99% is ASCII*...
Which sounds like a good reason to have UTF-8 the default ?
(I also read that of Unicode text, 99% is U+0000 to U+FFFF)

Another major advantage of UTF-8 (besides half the size)
over UTF-16 is that it is endian-agnostic. No more BE/LE!
(and none of that pesky ASCII-breaking "BOM" crap either)

So if my text is mostly ascii / iso-latin-1, I just use char[].
If most of my text is unicode, I use wchar[]. And should I ever
need to access a single Unicode code point, then I have dchar.


Now all I want is a string type ALIAS, and all things are spiffy.
Can we have a either "str" or "string" alias added, for char[] ?
Pretty-please ? (Hey, it worked for the "bool" alias for bit...)

"void main(str[] args)"
--anders


PS.  Now I just have to remember to dimension my D strings
      as char[max * 2] (for Latin-1) or even char[max * 4]...
      And how to loop over an array of potential "surrogates".

PPS. * 10% of my own Swedish characters are non-ASCII. (���)
      But as a programmer, I usually write things in English.
      Except for my last name, which accounts for the 1%. :-)

Oct 11 2004

"Jaap Geurts" <jaapsen hotmail.com> writes:

 And suddenly, me and all D's new char types are friends again!
 It makes perfect sense to have UTF-8/UTF-16/UTF-32 types in D.
 I just have to get out of the "uniform code unit size"-think.

Thank you for sharing that with us. Now is is suddenly clear to me too.

Jaap

Oct 11 2004

"Walter" <newshound digitalmars.com> writes:

"Jaap Geurts" <jaapsen hotmail.com> wrote in message
news:ckeb2r$1mcv$1 digitaldaemon.com...
 And suddenly, me and all D's new char types are friends again!
 It makes perfect sense to have UTF-8/UTF-16/UTF-32 types in D.
 I just have to get out of the "uniform code unit size"-think.

 Thank you for sharing that with us. Now is is suddenly clear to me too.

It reminds me of when a good friend of mine patiently explained to me all
about instructions, opcodes, registers, stacks, and program counters. It was
all gibberish to me, until there was one moment when suddenly it all came
together and made perfect sense.

Oct 11 2004

D Programming

C/C++ Programming

Other

digitalmars.D - "There is no character" (str type)