digitalmars.D - UTF-8 char[] consistency

Jaap Geurts (24/24) Sep 25 2004 Hi all,

Ben Hinkle (13/62) Sep 25 2004 which string functions specifically? What do you mean by "fail"?

Jaap Geurts (7/23) Sep 26 2004 I tried the wchar[] and dchar[] and that works just fine. But because I ...

David L. Davis (8/17) Sep 26 2004 Jaap Geurts: Yes, stringw.d (v0.3 beta) is one of my pet projects and yo...

Jaap Geurts (7/11) Sep 27 2004 can
Jaap Geurts (17/39) Sep 27 2004 David,

Arcane Jill (29/35) Sep 28 2004 Sorry to leap into the middle of your conversation with David, but that ...
David L. Davis (21/31) Sep 28 2004 Jaap: Currently for anything unicode based, I've been waiting on work th...
David L. Davis (21/31) Sep 29 2004 Jaap: Currently for anything unicode based, I've been waiting on work th...

David L. Davis (6/6) Sep 29 2004 Everyone: Oops!!! Sorry about the repost everyone. I had a bad storm in ...
Arcane Jill (48/52) Sep 29 2004 Unlike UTF-8, UTF-16 is very cunning - and this is basically because Uni...

Arcane Jill (6/9) Sep 29 2004 Erratum.
David L. Davis (6/6) Sep 29 2004 Arcane Jill: Thxs as always for the clear insight! I now have a better

Ben Hinkle (16/29) Sep 26 2004 to

Jaap Geurts (12/20) Sep 27 2004 have.

Arcane Jill (52/71) Sep 26 2004 Cool.

Benjamin Herr (19/21) Sep 26 2004 So can we not just drop char and char[]s and define some standard string...

Thomas Kuehne (13/19) Sep 26 2004 I guess you didn't (yet) dive into Unicode?

Arcane Jill (34/44) Sep 27 2004 True enough. The best definition of "character" I have ever encountered ...
Benjamin Herr (11/31) Sep 27 2004 I only theoretically dealed with Unicode (so, no). I had not idea I am

Thomas Kuehne (13/23) Sep 27 2004 UTF-8/16/32 only deal with one codepoint at a time(except for some

Jaap Geurts (28/47) Sep 27 2004 I see. If that is the way it is. Than I'll use functions operating on

Arcane Jill (28/41) Sep 27 2004 Most Unicode platforms use UTF-16, including the ICU library. It follows

Thomas Kuehne (6/15) Sep 27 2004 Guess you missed the extended CJK part. There are names of living person...

Arcane Jill (30/45) Sep 28 2004 I will freely admit that I don't speak Chinese and don't know the intric...

Benjamin Herr (6/9) Sep 28 2004 I guess I really do not get it. I thought I was just told that

Sean Kelly (10/18) Sep 28 2004 I think what Jill was saying is that in most cases, UTF-16 will represen...

Arcane Jill (21/31) Sep 29 2004 Yes, exactly. And to some extent, the same is also true of UTF-8 if your

Thomas Kuehne (10/17) Sep 27 2004 Potentially codepoints are 64 bit. The highes currently assigned codepoi...

Arcane Jill (9/10) Sep 29 2004 First I've heard of it. Do you have a source for this information?

Arcane Jill (32/40) Sep 28 2004 Head out to www.unicode.org and check out their various FAQs. They do a ...

J C Calvarese (6/63) Sep 29 2004 Cool. I added this to a wiki page:

Ben Hinkle (22/37) Sep 27 2004 not my

Arcane Jill (6/42) Sep 29 2004 I posted this yesterday:

"Jaap Geurts" <jaapsen hotmail.com> writes:

Hi all,

I'm testing and programming in D using UTF-8 under linux to encode the
Vietnamese character set.
I have some trouble with the way D handles the char[].length property.

If I make a string as follows
char[] s = "câu này có những chữ cái tiếng việt";

Then the length property (s.length) reports the number of bytes not the number
of characters as I would expect to happen. The length property would return the
number of bytes for the byte[].
Therefore I still need to use a strlen function to determine the correct string
length.
One of the implications is that most *string* handling functions in the phobos
library depend on the length property and thus fail.

There are some solutions to this: without modifying the language:

1. use a special functions to do the work.
2. make a string class.
3. convert everything internally to UTF-16, convert it back to UTF-8 before
output.

1. The special functions would work but is troublesome because the phobos
functions cannot be used.(i.e. they have to be rewritten).

2. the string class doesn't work well because the opAssign function cannot be
overridden and this the following cannot be done:



I know that it can be done slightly different but I'd like it to be as seemless
as possible. (String s = new String("hello");) However the the phobos functions
stil don't work and have to be included in the class. Wasn't Walter against a
String class?? ;)

3. Converting everything is not very efficient. And requires non-transparent
extra work.

I'd suggest the following:

1. The char[] needs to be treated by the D compiler as a string array not as a
byte array, or
2. Implement a special String datatype (has been discussed earlier and Walter
is against it.)

Also, a lot of phobos functions are missing for wide and double character
operations. E.g. wchar[] ljustify(wchar[], int width); is not available and
many more are not available for larger char sets.

Regards, Jaap


---
D programming from Vietnam

Sep 25 2004

Ben Hinkle <bhinkle4 juno.com> writes:

Jaap Geurts wrote:

 Hi all,
 
 I'm testing and programming in D using UTF-8 under linux to encode the
 Vietnamese character set. I have some trouble with the way D handles the
 char[].length property.

If this isn't in some FAQ it should be. 

 If I make a string as follows
 char[] s = "câu này có những chữ cái tiếng việt";
 
 Then the length property (s.length) reports the number of bytes not the
 number of characters as I would expect to happen. The length property
 would return the number of bytes for the byte[]. Therefore I still need to
 use a strlen function to determine the correct string length. One of the
 implications is that most *string* handling functions in the phobos
 library depend on the length property and thus fail.

which string functions specifically? What do you mean by "fail"?

 There are some solutions to this: without modifying the language:
 
 1. use a special functions to do the work.
 2. make a string class.
 3. convert everything internally to UTF-16, convert it back to UTF-8
 before output.

  4. use dchar[] (or possibly wchar[] if you know the unicode codepoints in 
your string will fit in a wchar).

 1. The special functions would work but is troublesome because the phobos
 functions cannot be used.(i.e. they have to be rewritten).
 
 2. the string class doesn't work well because the opAssign function cannot
 be overridden and this the following cannot be done:
 

 
 I know that it can be done slightly different but I'd like it to be as
 seemless as possible. (String s = new String("hello");) However the the
 phobos functions stil don't work and have to be included in the class.
 Wasn't Walter against a String class?? ;)
 
 3. Converting everything is not very efficient. And requires
 non-transparent extra work.
 
 I'd suggest the following:
 
 1. The char[] needs to be treated by the D compiler as a string array not
 as a byte array, or 2. Implement a special String datatype (has been
 discussed earlier and Walter is against it.)

Have you tried using dchar[] or wchar[] in your app? Someone has made
wstring.d which is the wchar equivalent to std.string (maybe it works for
dchar, too, I don't remember exactly). And AJ and some others are working
on expanding the unicode support - see the www.dsource.org.

 Also, a lot of phobos functions are missing for wide and double character
 operations. E.g. wchar[] ljustify(wchar[], int width); is not available
 and many more are not available for larger char sets.

I don't have that wstring.d handy but hopefully it covers these. If not
please let the author know so they can add them (and/or contribute them
yourself). Your help in improving the library support for wchar and dchar
would most likely be very much appreciated.

 Regards, Jaap
 
 
 ---
 D programming from Vietnam

Sep 25 2004

"Jaap Geurts" <jaapsen hotmail.com> writes:

On Sat, 25 Sep 2004 10:50:41 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:

 If I make a string as follows
 char[] s = "câu này có những chữ cái tiếng việt";

 Then the length property (s.length) reports the number of bytes not the
 number of characters as I would expect to happen. The length property
 would return the number of bytes for the byte[]. Therefore I still need to
 use a strlen function to determine the correct string length. One of the
 implications is that most *string* handling functions in the phobos
 library depend on the length property and thus fail.

 which string functions specifically? What do you mean by "fail"?

The report the incorrect length. It reports the byte count not the the actual
character count, as I would expect because it's an array of char. If I'm right
for a char[] s; array and then requesting its length s.length; should report a
wcslen(s) of some sort. But the curren't implementation doesn't.

   4. use dchar[] (or possibly wchar[] if you know the unicode codepoints in
 your string will fit in a wchar).

I tried the wchar[] and dchar[] and that works just fine. But because I program
under linux it would be nice if I can keep all my internal data in a consistent
format. Which is utf-8 for unix bases systems. It seems a little odd to have to
convert it to utf-16 each time I need to know the length of a string. Of course
the occasional conversion is unavoidable because sometimes if one wants to
insert a utf-8 encoded character into a string, one has to fit a wchar into a
char[], i realize that.


 I don't have that wstring.d handy but hopefully it covers these. If not
 please let the author know so they can add them (and/or contribute them
 yourself). Your help in improving the library support for wchar and dchar
 would most likely be very much appreciated.

If  someone is reading this and knows where the wstring.d is. Can you please
point me to it?

Thanks, Jaap

---
D programming from Vietnam

Sep 26 2004

David L. Davis <SpottedTiger yahoo.com> writes:

In article <opsexriepv2saxk9 krd8833t>, Jaap Geurts says...
On Sat, 25 Sep 2004 10:50:41 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:

 I don't have that wstring.d handy but hopefully it covers these. If not
 please let the author know so they can add them (and/or contribute them
 yourself). Your help in improving the library support for wchar and dchar
 would most likely be very much appreciated.

If  someone is reading this and knows where the wstring.d is. Can you please
point me to it?

Thanks, Jaap

---
D programming from Vietnam

Jaap Geurts: Yes, stringw.d (v0.3 beta) is one of my pet projects and you can
find here: http://spottedtiger.tripod.com/D_Language/D_Support_Projects_XP.html

Please, let me know if there's any missing std.string.d function(s) that you
need, and I'll work on getting them in as soon as possible.

David L.

-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"

Sep 26 2004

"Jaap Geurts" <jaapsen hotmail.com> writes:

"David L. Davis" <SpottedTiger yahoo.com> wrote in message
news:cj7aih$mq5$1 digitaldaemon.com...

 Jaap Geurts: Yes, stringw.d (v0.3 beta) is one of my pet projects and you

can
 find here:

http://spottedtiger.tripod.com/D_Language/D_Support_Projects_XP.html
 Please, let me know if there's any missing std.string.d function(s) that

you
 need, and I'll work on getting them in as soon as possible.

If I find bugs or I need other functions, I'll submit my ideas to you.
Thanks, David.

Sep 27 2004

"Jaap Geurts" <jaapsen hotmail.com> writes:

David,

I've examined your wstring library, and noticed that the
case(islower,isupper) family functions cannot do other languages than plain
latin ascii. Am I right in this?
What is needed I guess is for the user to supply a conversion table (are the
functions in phobos suitable?). I don't know enough about locale support in
OS's but if it is not available there we'd have to code it into the lib.

I'll do some probing about how to code it first and if you wish I can
provide you the one for Vietnamese.

Regards, Jaap

"David L. Davis" <SpottedTiger yahoo.com> wrote in message
news:cj7aih$mq5$1 digitaldaemon.com...
 In article <opsexriepv2saxk9 krd8833t>, Jaap Geurts says...
On Sat, 25 Sep 2004 10:50:41 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:

 I don't have that wstring.d handy but hopefully it covers these. If not
 please let the author know so they can add them (and/or contribute them
 yourself). Your help in improving the library support for wchar and



dchar
 would most likely be very much appreciated.

If  someone is reading this and knows where the wstring.d is. Can you


please point me to it?
Thanks, Jaap

---
D programming from Vietnam

 Jaap Geurts: Yes, stringw.d (v0.3 beta) is one of my pet projects and you

can
 find here:

http://spottedtiger.tripod.com/D_Language/D_Support_Projects_XP.html
 Please, let me know if there's any missing std.string.d function(s) that

you
 need, and I'll work on getting them in as soon as possible.

 David L.

 -------------------------------------------------------------------
 "Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"

Sep 27 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjal85$1oia$1 digitaldaemon.com>, Jaap Geurts says...
What is needed I guess is for the user to supply a conversion table (are the
functions in phobos suitable?).

Sorry to leap into the middle of your conversation with David, but that is not
so. What you need to do is go to www.dsource.org and look for a project called
Deimos. Therein, you will find a library called etc.unicode, in source code
form. Development of this library has been halted, in favor of ICU, but
etc.unicode /does/ do simple casing. (And don't be fooled by the word "simple" -
which only means that the function works on characters, not strings, (so it
can't uppercase "�" to "SS") and that it doesn't know that Turkish, Azeri and
Lithuanian have non-standard casing rules. It is "simple casing" as opposed to
"full casing", that's all).

The relevant prototypes are:





You do not need to specify a locale, because, if the locale is anything other
than Turkish, Azeri or Lithuanian, the casing will be done correctly.


I don't know enough about locale support in
OS's but if it is not available there we'd have to code it into the lib.

It is a common misconception that casing is locale sensitive. In Unicode, in
general, it is not. Okay, so (as mentioned above) Turkish, Azeri and Lithuanian
are different, but that is a small enough number that I prefer to think of it as
being "locale-independent with three exceptions".

I think the misconception arises because the C functions toupper(), tolower()
etc. are dependent on something /called/ locale, but which is in fact more
closely related to encoding scheme. These ctype functions need to do this
because C's chars are only eight bits wide. This logic does not apply to
Unicode, and certainly not to the functions in etc.unicode and the forthcoming
ICU port.



I'll do some probing about how to code it first and if you wish I can
provide you the one for Vietnamese.

The Unicode standard does not regard Vietnamese as an exception to the standard
lookups, so etc.unicode is all you need.

Arcane Jill

Sep 28 2004

David L. Davis <SpottedTiger yahoo.com> writes:

In article <cjal85$1oia$1 digitaldaemon.com>, Jaap Geurts says...
David,

I've examined your wstring library, and noticed that the
case(islower,isupper) family functions cannot do other languages than plain
latin ascii. Am I right in this?
What is needed I guess is for the user to supply a conversion table (are the
functions in phobos suitable?). I don't know enough about locale support in
OS's but if it is not available there we'd have to code it into the lib.

I'll do some probing about how to code it first and if you wish I can
provide you the one for Vietnamese.

Regards, Jaap

Jaap: Currently for anything unicode based, I've been waiting on work that
Arcane Jill is doing. StringW.d was mainly created to make it easier to work
with 16-bit characters (string.d made it a real pain...you nearly have to cast
everything), and hopefully in turn it will work with Windows' 16-bit wide
character API functions. But at this point I haven't tested it, plus I don't
understand enough to know the real different between the 16-bit characters and
unicode characters (some real example data and code would be helpful in this
area...Jill?, Ben?, and/or anyone?). Anywayz, needless to say I've mirrored
string.d functions like tolower(), toupper() and my very own asciiProperCase()
functions to still work on ascii characters only. In my last reply I was mainly
to point you to where stringw.d could be found, and if you found it useful and
to let you know that if you needed anything that string.d had that's missing in
it...that I would add it if you needed it. I hope I did give the impression that
it did unicode? Also, I'm afraid I don't know much about "locale support"
either. But if you do something in that area I wouldn't mind taking a look at
it. :))

Good Luck in your project,
David L.

-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"

Sep 28 2004

David L. Davis <SpottedTiger yahoo.com> writes:

In article <cjal85$1oia$1 digitaldaemon.com>, Jaap Geurts says...
David,

I've examined your wstring library, and noticed that the
case(islower,isupper) family functions cannot do other languages than plain
latin ascii. Am I right in this?
What is needed I guess is for the user to supply a conversion table (are the
functions in phobos suitable?). I don't know enough about locale support in
OS's but if it is not available there we'd have to code it into the lib.

I'll do some probing about how to code it first and if you wish I can
provide you the one for Vietnamese.

Regards, Jaap

Jaap: Currently for anything unicode based, I've been waiting on work that
Arcane Jill is doing. StringW.d was mainly created to make it easier to work
with 16-bit characters (string.d made it a real pain...you nearly have to cast
everything), and hopefully in turn it will work with Windows' 16-bit wide
character API functions. But at this point I haven't tested it, plus I don't
understand enough to know the real different between the 16-bit characters and
unicode characters (some real example data and code would be helpful in this
area...Jill?, Ben?, and/or anyone?). Anywayz, needless to say I've mirrored
string.d functions like tolower(), toupper() and my very own asciiProperCase()
functions to still work on ascii characters only. In my last reply I was mainly
to point you to where stringw.d could be found, and if you found it useful and
to let you know that if you needed anything that string.d had that's missing in
it...that I would add it if you needed it. I hope I did give the impression that
it did unicode? Also, I'm afraid I don't know much about "locale support"
either. But if you do something in that area I wouldn't mind taking a look at
it. :))

Good Luck in your project,
David L.

-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"

Sep 29 2004

David L. Davis <SpottedTiger yahoo.com> writes:

Everyone: Oops!!! Sorry about the repost everyone. I had a bad storm in my area
last night and my connection to the internet wasn't working right, so I didn't
think my message had gotten posted. Again sorry.

David L.

-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"

Sep 29 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cje51f$q8t$1 digitaldaemon.com>, David L. Davis says...

plus I don't
understand enough to know the real different between the 16-bit characters and
unicode characters (some real example data and code would be helpful in this
area...Jill?, Ben?, and/or anyone?).

Unlike UTF-8, UTF-16 is very cunning - and this is basically because Unicode and
UTF-16 were designed together, to work with each other. Here's how it works -
there are two different perspectives: the 16-bit perspective, and the 21-bit
perspective.

In the 21-bit perspective, characters run from U+0000 to U+10FFFF - /but/, the
range U+D800 to U+DFFF is illegal and invalid. There are /no/ Unicode characters
in this range. Any application built to view the Unicode world from this point
of view should be prepared to correctly handle and display all valid characters
(which excludes U+D800 to U+DFFF).

In the 16-bit perspective, characters run from U+0000 to U+FFFF - and, in this
world, the range U+D800 to U+DFFF are just hunky dory. In this perspective, they
are called "surrogate characters". They always occur in pairs, with a high
surrogate (a character in the range U+D800 to U+DBFF) always immediately
followed by a low surrogate (a character in the range U+DC00 to U_DFFF). There
are plenty of applications built to view the Unicode world from this point of
view (in particular, legacy applications written before Unicode 3.0, when all
Unicode characters actually /were/ 16 bits wide).

Let's take an example. The Unicode character U+1D11E (musical symbol G clef).
When viewed by an application which sees 21-bit wide characters, what you see is
U+1D11E, which you interpret as a single character, and display as ... well ...
as musical symbol G clef.

A legacy 16-bit-Unicode application looking at the same text file (assuming it
to have been saved in UTF-16) will see two "characters": U+D874 followed by
U+DD1E. (These are the UTF-16 fragments which together represent U+1D11E). Such
an application may safely interpret these wchars as "unknown character" followed
by "unknown character", and nothing will break. A slightly more sophisticated
application might even interpret them as "high surrogate" followed by "low
surrogate", and still, nothing would break. These pseudo-characters would likely
both display as "unknown character" glyphs, but some fonts may give high
surrogates a different glyph from low surrogates. (And, indeed, the Mac's "last
chance" fallback font will actually display each psuedo-character as a tiny
little hex representation of its codepoint!)

Of course, all of this will fail completely if UTF-8 is used instead of UTF-16.
In UTF-8, the representation of U+1D11E is: F0 9D 84 9E. Every UTF-8-aware
application will decode this as 0x1D11E, and an application which is unaware of
characters beyond U+FFFF would fall over badly here. (It might even truncate it
to U+D11E: Hangul syllable TYAELM). But of course, you can still transcode into
UTF-16 and deal with it that way - which is another reason why UTF-16 is very
good for the internal workings of an application.

Arcane Jill

PS. It is worth noting that the vast majority of fonts available today which are
either free or come bundled with an OS do not render characters beyond U+FFFF at
all. In fact, I have yet to find /even one/ free font which contains U+1D11E
(musical symbol G clef). [I would be very happy to be shown to be wrong on this
point - anyone know of one?]. This means that if you stick such characters in a
web page, nobody will be able to see them - so you'll have to use a gif after
all. :( Unicode may be the future, but sadly it is not the present.

Sep 29 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cje7o0$rj6$1 digitaldaemon.com>, Arcane Jill says...

A legacy 16-bit-Unicode application looking at the same text file (assuming it
to have been saved in UTF-16) will see two "characters": U+D874 followed by
U+DD1E. (These are the UTF-16 fragments which together represent U+1D11E).

Erratum.

Whoops! UTF-16 for 1D11E is actually D834 followed by DD1E. (That'll teach me
not to try UTF-16 transcoding by hand in future!)

The logic of the post still holds, however.

Jill

Sep 29 2004

David L. Davis <SpottedTiger yahoo.com> writes:

Arcane Jill: Thxs as always for the clear insight! I now have a better
understanding of how 16-bit characters (aka UTF-16 / wchar[]) and Unicode (v3.0
/ v4.0) match against one another. :)) 

I hope your ICU conversion work is coming along fine.

-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"

Sep 29 2004

"Ben Hinkle" <bhinkle mathworks.com> writes:

"Jaap Geurts" <jaapsen hotmail.com> wrote in message
news:opsexriepv2saxk9 krd8833t...
 On Sat, 25 Sep 2004 10:50:41 -0400, Ben Hinkle <bhinkle4 juno.com> wrote:

 If I make a string as follows
 char[] s = "c�u n�y c� nh?ng ch? c�i ti?ng vi?t";

 Then the length property (s.length) reports the number of bytes not the
 number of characters as I would expect to happen. The length property
 would return the number of bytes for the byte[]. Therefore I still need



to
 use a strlen function to determine the correct string length. One of



the
 implications is that most *string* handling functions in the phobos
 library depend on the length property and thus fail.

 which string functions specifically? What do you mean by "fail"?

 The report the incorrect length. It reports the byte count not the the

actual character count, as I would expect because it's an array of char. If
I'm right for a char[] s; array and then requesting its length s.length;
should report a wcslen(s) of some sort. But the curren't implementation
doesn't.

That is by design. Out of curiosity, what are you doing with your strings
that require the number of characters? Usually one just deals with string
fragments and it doesn't matter how long it is (either in characters or in
bytes). In a perfect world your expectation of having a one-to-one mapping
between array indexing and character indexing would clearly be nice to have.
But the current design is (in Walter's opinion - and I agree with him) the
best we can do given the imperfect world we find ourselves in and given D's
design goals.

Sep 26 2004

"Jaap Geurts" <jaapsen hotmail.com> writes:

"Ben Hinkle" <bhinkle mathworks.com> wrote in message
news:cj7eb6$ole$1 digitaldaemon.com...
 That is by design. Out of curiosity, what are you doing with your strings
 that require the number of characters? Usually one just deals with string
 fragments and it doesn't matter how long it is (either in characters or in
 bytes). In a perfect world your expectation of having a one-to-one mapping
 between array indexing and character indexing would clearly be nice to

have.
 But the current design is (in Walter's opinion - and I agree with him) the
 best we can do given the imperfect world we find ourselves in and given

D's
 design goals.

If this is by design than fine. Who am I to change it. It is just because I
need to insert characters into existing strings.
I see. Moreover if char[] does behave the way it currently does it will be
fast, but it probably won't if it had to interpret the array as UTF-8.
But then I see little difference between byte[] and char[]. They are
basically the same and can be interpreted ambiguously. Something that Walter
wanted to prevent if I remember correctly.

Jaap

Sep 27 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <opsevonsdl2saxk9 krd8833t>, Jaap Geurts says...
Hi all,

Hi.


I'm testing and programming in D using UTF-8 under linux to encode the
Vietnamese character set.

Cool.


I have some trouble with the way D handles the char[].length property.

length does what it does. What you need is a character count, which is
something different.


Therefore I still need to use a strlen function to determine the correct string
length.

Okay, here's one:











And some overloads to complete the set:


















One of the implications is that most *string* handling functions in the phobos
library depend on the length property and thus fail.

Phobos is not really geared up for Unicode yet. The string handling functions
are defined to work only for ASCII.

What you need is Unicode string handling. D doesn't have that yet. There is a
third party Unicode library called ICU (Internationalization Components for
Unicode) which I'm trying to port to D, but it's slow work, partly because I've
got too much else on at the moment.



There are some solutions to this: without modifying the language:

1. use a special functions to do the work.
2. make a string class.
3. convert everything internally to UTF-16, convert it back to UTF-8 before
output.

Option 3 won't work in general. In general, you'll need to convert everything
internally to UTF-32, not UTF-16. Of course, if it's just for Vietnamese, UTF-16
will be fine.



1. The special functions would work but is troublesome because the phobos
functions cannot be used.(i.e. they have to be rewritten).

True.


2. the string class doesn't work well because the opAssign function cannot be
overridden and this the following cannot be done:



I know that it can be done slightly different but I'd like it to be as seemless
as possible. (String s = new String("hello");) However the the phobos functions
stil don't work and have to be included in the class. Wasn't Walter against a
String class?? ;)

I've had exactly the same problem with a completely different class. I would
very much like to see implicit constructors in D, so we could do:




But this sort of thing is down to Walter, and he doesn't consider it a priority.



3. Converting everything is not very efficient. And requires non-transparent
extra work.

I'd suggest the following:

1. The char[] needs to be treated by the D compiler as a string array not as a
byte array,

That's just not possible. A char is a UTF-8 fragment, not a Unicode character.
They're just not the same.


or
2. Implement a special String datatype (has been discussed earlier and Walter
is against it.)

This will happen anyway in time - by accident! ICU has a class called
UnicodeString, so D will get that once ICU is ported.



Also, a lot of phobos functions are missing for wide and double character
operations. E.g. wchar[] ljustify(wchar[], int width); is not available and
many more are not available for larger char sets.

Again, ICU will fill in these gaps. I wish I could bring you better news, but at
least these things are on their way and will get here eventually.

Arcane Jill

Sep 26 2004

Benjamin Herr <ben 0x539.de> writes:

Arcane Jill wrote:
 This will happen anyway in time - by accident! ICU has a class called
 UnicodeString, so D will get that once ICU is ported.

So can we not just drop char and char[]s and define some standard string 
class to be used for unicode strings (preferably one returning dchars 
when prompted for individual characters)?

I mean, strings via easy-to-use arrays were one of those nifty ideas 
that attracts me to D. No freaky libraries to remember, just intuitive 
things that work the same for all kinds of arrays.
But having strings implemented as character arrays is cool only as long 
as I can actually use that char[]-string like an array and get 
characters out of it by using the [] operator.
Beyond that, it just is an annoying inconsistent analogy. Also it 
appears confusing to me that some string operations are supposed to be 
done with array operations, while others are defined in std.string.
Now it seems far easier-to-use to have a string class that wraps all this.

I apologise if my uneducated ranting is far below the average level of 
insight that is to be available here, and I apologise for the slight 
offtopicness, and I apologise for bringing this up long after the case 
to ditch char.

-ben

Sep 26 2004

"Thomas Kuehne" <eisvogel users.sourceforge.net> writes:

Benjamin Herr <ben 0x539.de> schrieb:
 Arcane Jill wrote:
 This will happen anyway in time - by accident! ICU has a class called
 UnicodeString, so D will get that once ICU is ported.

 So can we not just drop char and char[]s and define some standard string
 class to be used for unicode strings (preferably one returning dchars
 when prompted for individual characters)?

I guess you didn't (yet) dive into Unicode?
A "character" is something quite complicated.

1) it can consist of one codepoint like 0x41 "A"
2) two different codpoint sequences can be equal: 0xC1 "�" and 0x41 0x2CA
"�"
3) especially in Hanglu/Korean a "character" might be a sequenze of 1 up to
4 codepoints.
4) upper/lowercase conversion is dependend on the language used: Up1 ->
Down1, Down2

Above points out only some basics you'd have to implement in your string
class.

Thomas

Sep 26 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cj8kb0$22n5$1 digitaldaemon.com>, Thomas Kuehne says...

A "character" is something quite complicated.

True enough. The best definition of "character" I have ever encountered is this:
A "character" is anything the Unicode Consortium say is a character!

More official definitions such as "the smallest unit of information having
semantic meaning" just don't hold up under close examination, as it's too easy
to find counterexamples. The problem arises because Unicode started its life as
the union of many existing legacy "character sets", each of which had their own
different idea of what a "character" was.


1) it can consist of one codepoint like 0x41 "A"
2) two different codpoint sequences can be equal: 0xC1 "�" and 0x41 0x2CA
"�"
3) especially in Hanglu/Korean a "character" might be a sequenze of 1 up to
4 codepoints.

Actually, you're talking about graphemes and/or glyphs, not characters. There
is, in fact, a precise one-to-one correspondence between codepoints and
characters.

A grapheme, on the other hand, may consist of one or more characters combined
together (for example 'A' + combining-acute-accent = '�', as per your example);
a glyph may consist of one or more graphemes ligated together (for example 'a' +
zero-width-joiner + 'e' = '�').

And just to be even more pedantic, your statement "two different codepoint
sequences can be /equal/" should really read "two different codepoint sequences
can be /canonically equivalent/". Equal means equal.


4) upper/lowercase conversion is dependend on the language used: Up1 ->
Down1, Down2

..though currently only Turkish, Lithuanian and Azeri are non-standard. As far
as casing is concerned, locale is /almost/ ignorable. The functions
getSimpleUppercaseMapping() and getSimpleLowercaseMapping() in etc.unicode will
work fine for all languages apart from these few non-standard exceptions listed
above. A bigger problem with casing is that (for example) uppercase "�" is "SS"
- that is, strings can get longer when you case-convert them. Even etc.unicode
doesn't deal with that (because it got aborted in favor of ICU before full
casing was implemented).

You're probably thinking of collation (sort order), which varies /greatly/ from
language to language.


Above points out only some basics you'd have to implement in your string
class.

I think the original poster was only talking about character counting, and the
related problem of locating character boundaries in a UTF array. That's
relatively easy, and can be hand-coded without too much trouble. The more
complex stuff like casing, collation, equivalence, grapheme boundary
identification, etc., is probably best left to an external library.

Arcane Jill

Sep 27 2004

Benjamin Herr <ben 0x539.de> writes:

Thomas Kuehne wrote:
 Benjamin Herr <ben 0x539.de> schrieb:
 
Arcane Jill wrote:

This will happen anyway in time - by accident! ICU has a class called
UnicodeString, so D will get that once ICU is ported.

So can we not just drop char and char[]s and define some standard string
class to be used for unicode strings (preferably one returning dchars
when prompted for individual characters)?

 
 
 I guess you didn't (yet) dive into Unicode?
 A "character" is something quite complicated.

I only theoretically dealed with Unicode (so, no). I had not idea I am 
so far off, though.

 1) it can consist of one codepoint like 0x41 "A"

Sounds easy so far.

 2) two different codpoint sequences can be equal: 0xC1 "�" and 0x41 0x2CA
 "�"

Is it not invalid at least with utf8 to use anything but the least 
`large' representation?

 [...]

 Above points out only some basics you'd have to implement in your string
 class.

I either have to implement it in my string class, or I have to do it `by 
hand' every time I need any of this functionality.
Which is why I suggested a class (or even just a struct, to keep the 
semantics closer to the standard arrays), after all.

-ben

Sep 27 2004

"Thomas Kuehne" <eisvogel users.sourceforge.net> writes:

Benjamin Herr schrieb:
 2) two different codpoint sequences can be equal: 0xC1 "�" and 0x41
 0x2CA "�"

 Is it not invalid at least with utf8 to use anything but the least
 `large' representation?

UTF-8/16/32 only deal with one codepoint at a time(except for some
checking).
 The codepoint sequence above would be U+0000C1 "�" and U+000041 U+0002CA
"�"
The above are different Normalization Forms.
(http://www.unicode.org/reports/tr15/)

 Above points out only some basics you'd have to implement in your string
 class.

 I either have to implement it in my string class, or I have to do it `by
 hand' every time I need any of this functionality.
 Which is why I suggested a class (or even just a struct, to keep the
 semantics closer to the standard arrays), after all.

If you ensure that the input only contains Latin(Frensh/German..) / Greek /
Cyrillic in fully  NFC/NFKC  you can assume for most cases that 1 dchar == 1
character.

If you realy nead full string handling, I suppose you assist Arcane Jill
with porting the ICU.

Thomas

Sep 27 2004

"Jaap Geurts" <jaapsen hotmail.com> writes:

I'm testing and programming in D using UTF-8 under linux to encode the


Vietnamese character set.
 Cool.


I have some trouble with the way D handles the char[].length property.

 length does what it does. What you need is a character count, which is
 something different.

I see. If that is the way it is. Than I'll use functions operating on
strings.

Therefore I still need to use a strlen function to determine the correct


string length.
 Okay, here's one:

Thanks for the code examples.

 Phobos is not really geared up for Unicode yet. The string handling

functions
 are defined to work only for ASCII.

I noticed. I'll use David's (Spotted Tiger) stringw.d and complement if
necessary.

3. convert everything internally to UTF-16, convert it back to UTF-8


before output.
 Option 3 won't work in general. In general, you'll need to convert

everything
 internally to UTF-32, not UTF-16. Of course, if it's just for Vietnamese,

UTF-16
 will be fine.

Strangely enough the Windows32 API uses the UTF16 as their encoding.

 That's just not possible. A char is a UTF-8 fragment, not a Unicode

character.
 They're just not the same.

I understand the issues, and UTF-8 in particular was actually designed with
backwards compatibility in mind. (For C uses the zero char as the
terminator. Had the world programmed in Pascal then we probably wouldn't
have UTF-8/

or
2. Implement a special String datatype (has been discussed earlier and


Walter is against it.)
 This will happen anyway in time - by accident! ICU has a class called
 UnicodeString, so D will get that once ICU is ported.

But (if my memory serves me well) this is exactly what Walter wanted to
prevent: A multitude of String classes all doing the same but having
slightly different interfaces. Should this be part of the Language or the
Phobos library don't you think. UTF-8 will always require a class of some
sort...
I'm not trying to put oil in the fire but isn't this an important aspect for
version 1.0?

Jaap

--
D Programming from Vietnam.

Sep 27 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cj90u4$b46$1 digitaldaemon.com>, Jaap Geurts says...

Strangely enough the Windows32 API uses the UTF16 as their encoding.

Most Unicode platforms use UTF-16, including the ICU library. It follows
logically, therefore, that on these platforms - /including/ the Windows API -
you cannot use array indexing to find the nth character.

But there is method in this madness. The Unicode characters from U+10000 upwards
are not characters from living languages. By and large, it is generally
considered "harmless" to regard such characters as if they were two characters.
For examples, consider the character U+1D11E (musical symbol G clef) - does it
really /matter/ if your application perceives instead U+D874 followed by U+D11C?
It won't affect casing, sorting or anything like that because the character
isn't part of any living language script. From the point of view of most general
purpose algorithms, it's just another shape to draw, like a WingDings symbol. So
UTF-16 is simply the best space/speed compromise for the majority of real-life
languages.


I understand the issues, and UTF-8 in particular was actually designed with
backwards compatibility in mind. (For C uses the zero char as the
terminator. Had the world programmed in Pascal then we probably wouldn't
have UTF-8/

The compatibility is with ASCII, not with C. There is no Unicode meaning of
U+0000, apart from "some sort of application-dependent control character".



 This will happen anyway in time - by accident! ICU has a class called
 UnicodeString, so D will get that once ICU is ported.

But (if my memory serves me well) this is exactly what Walter wanted to
prevent: A multitude of String classes all doing the same but having
slightly different interfaces. Should this be part of the Language or the
Phobos library don't you think.

Well, ICU is not really anything to do with D. It was originally a Java API,
then got ported to C and C++. We'll have it in D, too, eventually. It's not my
fault if ICU defines a string class. But I don't think Walter will be
complaining - the ICU class isn't a simple "replacement" or "alternative" to
char[] - it provides full Unicode functionality, in a way that char[] doesn't.

I don't think we'll be seeing "a multitude of String classes" either. To be
honest, I don't think even ICU's UnicodeString class will ever become any kind
of D "standard", because you won't be able to do implicit casting to/from it.



UTF-8 will always require a class of some
sort...

Well, I'm more inclined to the view that truly internationalized software just
won't use UTF-8 at all. UTF-16 is much more managable for this sort of thing.
UTF-8 can do the job, but it's mainly intended for text which "mostly ASCII".

Arcane Jill

Sep 27 2004

"Thomas Kuehne" <eisvogel users.sourceforge.net> writes:

Arcane Jill schrieb:
 But there is method in this madness. The Unicode characters from U+10000
 upwardsare not characters from living languages. By and large, it is
 generally considered "harmless" to regard such characters as if they were
 two characters.
 For examples, consider the character U+1D11E (musical symbol G clef) -
 does it really /matter/ if your application perceives instead U+D874
 followed by U+D11C?
 It won't affect casing, sorting or anything like that because the
 character isn't part of any living language script.

Guess you missed the extended CJK part. There are names of living persons
that can only be encoded using post U+FFFF stuff. As a consequence it does
affect the sorting and
"character"/"glyph"/"graphem"/"codepoint"/"what-so-ever" count algorithms.

Thomas

Sep 27 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cj97fh$t7f$1 digitaldaemon.com>, Thomas Kuehne says...
Arcane Jill schrieb:
 But there is method in this madness. The Unicode characters from U+10000
 upwardsare not characters from living languages. By and large, it is
 generally considered "harmless" to regard such characters as if they were
 two characters.
 For examples, consider the character U+1D11E (musical symbol G clef) -
 does it really /matter/ if your application perceives instead U+D874
 followed by U+D11C?
 It won't affect casing, sorting or anything like that because the
 character isn't part of any living language script.

Guess you missed the extended CJK part. There are names of living persons
that can only be encoded using post U+FFFF stuff. As a consequence it does
affect the sorting and
"character"/"glyph"/"graphem"/"codepoint"/"what-so-ever" count algorithms.

Thomas

I will freely admit that I don't speak Chinese and don't know the intricacies of
CJK. But that isn't really what I was trying to get at. Yes, obviously, if an
app wants to be general, it must use proper character access via library
functions. All I really meant was that if you pretend UTF-16 fragments are
characters then your application will /usually/ behave sensibly. That's all.

Me, I'm all in favor of proper character iteration. It's just that a lot of apps
are going to want a quick-and-dirty shortcut that works more often than not, and
UTF-16 is exactly that.

So, there are characters in the >U+FFFF range which are used in proper names? I
didn't know that. But how badly does that change things? Does it affect casing?
I suppose the answer to that depends on whether or not CJK characters /have/
case. Do they? Does it affect sorting? Not in general, since collation is a
function of the /user's preferences/, not the script (that is, if an English
user sorts Czechoslovakian text, they will expect to see it in "English order",
not "Czechoslovakian order"), so only applications which are (a) fully
internationalized, or (b) written for CJK users specifically, will need to care.
For the rest of the world, two "unknown character" glyphs is not that much worse
than one.

So I'd summarize as:

*) If you want to write a fully internationalized app, you need to be using a
proper Unicode library, but

*) If you just want basic Unicode support which works in all but exceptional
circumstances, you can make do with UTF-16, and the pretence that characters are
16-bits wide.

In other words, yes, you're right. But we can ususally cheat.

Anyway, this sort of conversation goes on all the time on the Unicode public
forum. If you want to talk about this in depth, suggest we move the discussion
there.

Arcane Jill

Sep 28 2004

Benjamin Herr <ben 0x539.de> writes:

Arcane Jill wrote:
 *) If you just want basic Unicode support which works in all but exceptional
 circumstances, you can make do with UTF-16, and the pretence that characters
are
 16-bits wide.

I guess I really do not get it. I thought I was just told that 
codepoints might be only 16-bits wide but that I always have to account 
for multi-codepointy chars?
*more clueless*


-ben

Sep 28 2004

Sean Kelly <sean f4.ca> writes:

In article <cjc38q$2jna$2 digitaldaemon.com>, Benjamin Herr says...
Arcane Jill wrote:
 *) If you just want basic Unicode support which works in all but exceptional
 circumstances, you can make do with UTF-16, and the pretence that characters
are
 16-bits wide.

I guess I really do not get it. I thought I was just told that 
codepoints might be only 16-bits wide but that I always have to account 
for multi-codepointy chars?
*more clueless*

I think what Jill was saying is that in most cases, UTF-16 will represent any
character you care about with a single wchar (ie. in 16 bits).  So if you code
an application to use wchars you can generally pretend as if there is a 1 to 1
correspondence between wchars and characters.  It's *possible* that some users
(Chinese perhaps) could break your application, but if this isn't your target
market then it may not be a concern.  I think the point is that if you're
worried that dchars will use up too much memory, you can usually get away with
pretending UTF-16 is not a multi-char encoding scheme.


Sean

Sep 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjc7pb$2n3d$1 digitaldaemon.com>, Sean Kelly says...

I think what Jill was saying is that in most cases, UTF-16 will represent any
character you care about with a single wchar (ie. in 16 bits).  So if you code
an application to use wchars you can generally pretend as if there is a 1 to 1
correspondence between wchars and characters.  It's *possible* that some users
(Chinese perhaps) could break your application, but if this isn't your target
market then it may not be a concern.  I think the point is that if you're
worried that dchars will use up too much memory, you can usually get away with
pretending UTF-16 is not a multi-char encoding scheme.

Sean

Yes, exactly. And to some extent, the same is also true of UTF-8 if your
application only cares about ASCII. /Many/ algorithms will work just fine if you
pretend that /UTF-8/ is a character set, and that a char[] is an actual string
of 8-bit-wide "characters". For example, concatenation (strcat, ~); finding a
character or a substring (strchr, strstr, find); splitting on boundaries
determined by strchr/strstr/find; tokenizing using ASCII separators such as
space or tab; identification of C/C++/D comments; parsing XML; ... the list is
endless. So long as you don't try to interpret or manipulate the characters you
don't "understand", these encodings are robust enough to withstand most other
manipulations.

The major reason for preferring UTF-16 over UTF-8, however, is that UTF-16 is
likely to contain over 99% of all characters in which you are likely to be
interested. The same cannot be said of UTF-8, which contains only ASCII
characters.

The major reason for preferring UTF-16 over UTF-32 is that you get a lot of
wasted space with UTF-32. As noted above, >99% of your characters will only need
two bytes, so that's two bytes of zeroes for each such character. Even the
U+FFFF characters are still guaranteed to have /over one third/ of its bits
unused. UTF-32 text files (and strings), therefore, /will/ have between a third
and a half (and maybe even more if the text is mostly ASCII) of all of its bits
wasted.

So it's just a space/speed compromise, that's all. But a pretty good one in most
cases.

Jill

Sep 29 2004

Thomas Kuehne <eisvogel users.sourceforge.net> writes:

Benjamin Herr Schrieb:
 *) If you just want basic Unicode support which works in all but
 exceptional circumstances, you can make do with UTF-16, and the pretence
 that characters are 16-bits wide.

 I guess I really do not get it. I thought I was just told that
 codepoints might be only 16-bits wide but that I always have to account
 for multi-codepointy chars?
 *more clueless*

Potentially codepoints are 64 bit. The highes currently assigned codepoint
fits in 32 bit. For the majority of living languages only codepoints fit in
16 bit.
The bit-size of a codepoint has nothing todo with multi-codepoint "chars".
Again if you ensure that neither Korean/Hebrew/Arabic, (Zero-Width-)Joiners
nor combining accents are used you might trade a 16-bit char as a
"character" in most cases. Exceptions: sorting, display and advanced text
analysis.

Thomas

Sep 27 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjc8ag$2nb2$1 digitaldaemon.com>, Thomas Kuehne says...

Potentially codepoints are 64 bit.

First I've heard of it. Do you have a source for this information?

So far as I am aware, the UC are /adamant/ that they will never go beyond 21
bits. Programming languages tend to use 32 bits because (a) 32 bits is a more
natural length for computers, and (b) they're not taking chances - once upon a
time the UC thought that 16 bits would be sufficient. But I have never heard
/anyone/ claim that codepoints are potentially 64 bits before. Whence does this
originate?

Arcane Jill

Sep 29 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjc38q$2jna$2 digitaldaemon.com>, Benjamin Herr says...
Arcane Jill wrote:
 *) If you just want basic Unicode support which works in all but exceptional
 circumstances, you can make do with UTF-16, and the pretence that characters
are
 16-bits wide.

I guess I really do not get it. I thought I was just told that 
codepoints might be only 16-bits wide but that I always have to account 
for multi-codepointy chars?
*more clueless*

Head out to www.unicode.org and check out their various FAQs. They do a much
better job at explaining things than I.


For what it's worth, here's my potted summary:

"code unit" = the technical name for a single primitive fragment of either
UTF-8, UTF-16 or UTF-32 (that is, the value held in a single char, wchar or
dchar). I tend to use the phrases UTF-8 fragment, UTF-16 fragment and UTF-32
fragment to express this concept.

"code point" = the technical name for the numerical value associated with a
character. In Unicode, valid codepoints go from 0 to 0x10FFFF inclusive. In D, a
codepoint can only be stored in a dchar.

"character" = officially, the smallest unit of textual information with semantic
meaning. Practically speaking, this means either (a) a control code; (b)
something printable; or (c) a combiner, such as an accent you can place over
another character. Every character has a unique codepoint. Conversely, every
codepoint in the range 0 to 0x10FFFF corresponds to a unique Unicode character.

which is the character corresponding to codepoint 0x20AC).

As an observation, over 99% of all the characters you are likely to use, and
which are involved in text processing, will occur in the range U+0000 to U+FFFF.
Therefore an array of sixteen-bit values interpretted as characters will likely
be sufficient for most purposes. (A UTF-16 string may be interpretted in this
way). If you want that extra 1%, as some apps will, you'll need to go the whole
hog and recognise characters all the way up to U+10FFFF.

"grapheme" = a printable base character which may have been modified by zero or
more combining characters (for example 'a' followed by combining-acute-accent).

"glyph" = one or more graphemes glued together to form a single printable
symbol. The Unicode character zero-width-joiner usually acts as the glue.

For more detailed information, as I suggested above, please feel free to go to
the Unicode website, and get all the details from the people who organize the
whole thing.

Arcane Jill

Sep 28 2004

J C Calvarese <jcc7 cox.net> writes:

Arcane Jill wrote:
 In article <cjc38q$2jna$2 digitaldaemon.com>, Benjamin Herr says...
 
Arcane Jill wrote:

*) If you just want basic Unicode support which works in all but exceptional
circumstances, you can make do with UTF-16, and the pretence that characters are
16-bits wide.

I guess I really do not get it. I thought I was just told that 
codepoints might be only 16-bits wide but that I always have to account 
for multi-codepointy chars?
*more clueless*

 
 
 Head out to www.unicode.org and check out their various FAQs. They do a much
 better job at explaining things than I.
 
 
 For what it's worth, here's my potted summary:

Cool. I added this to a wiki page: 
http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues

 
 "code unit" = the technical name for a single primitive fragment of either
 UTF-8, UTF-16 or UTF-32 (that is, the value held in a single char, wchar or
 dchar). I tend to use the phrases UTF-8 fragment, UTF-16 fragment and UTF-32
 fragment to express this concept.
 
 "code point" = the technical name for the numerical value associated with a
 character. In Unicode, valid codepoints go from 0 to 0x10FFFF inclusive. In D,
a
 codepoint can only be stored in a dchar.
 
 "character" = officially, the smallest unit of textual information with
semantic
 meaning. Practically speaking, this means either (a) a control code; (b)
 something printable; or (c) a combiner, such as an accent you can place over
 another character. Every character has a unique codepoint. Conversely, every
 codepoint in the range 0 to 0x10FFFF corresponds to a unique Unicode character.

 which is the character corresponding to codepoint 0x20AC).
 
 As an observation, over 99% of all the characters you are likely to use, and
 which are involved in text processing, will occur in the range U+0000 to
U+FFFF.
 Therefore an array of sixteen-bit values interpretted as characters will likely
 be sufficient for most purposes. (A UTF-16 string may be interpretted in this
 way). If you want that extra 1%, as some apps will, you'll need to go the whole
 hog and recognise characters all the way up to U+10FFFF.
 
 "grapheme" = a printable base character which may have been modified by zero or
 more combining characters (for example 'a' followed by combining-acute-accent).
 
 "glyph" = one or more graphemes glued together to form a single printable
 symbol. The Unicode character zero-width-joiner usually acts as the glue.
 
 For more detailed information, as I suggested above, please feel free to go to
 the Unicode website, and get all the details from the people who organize the
 whole thing.
 
 Arcane Jill
 
 


-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/

Sep 29 2004

"Ben Hinkle" <bhinkle mathworks.com> writes:

[snip]
 This will happen anyway in time - by accident! ICU has a class called
 UnicodeString, so D will get that once ICU is ported.

But (if my memory serves me well) this is exactly what Walter wanted to
prevent: A multitude of String classes all doing the same but having
slightly different interfaces. Should this be part of the Language or the
Phobos library don't you think.

 Well, ICU is not really anything to do with D. It was originally a Java

API,
 then got ported to C and C++. We'll have it in D, too, eventually. It's

not my
 fault if ICU defines a string class. But I don't think Walter will be
 complaining - the ICU class isn't a simple "replacement" or "alternative"

to
 char[] - it provides full Unicode functionality, in a way that char[]

doesn't.
 I don't think we'll be seeing "a multitude of String classes" either. To

be
 honest, I don't think even ICU's UnicodeString class will ever become any

kind
 of D "standard", because you won't be able to do implicit casting to/from

it.

Is there a link to the String class API? I'm curious to see what the
differences are from a function-based API. Is the basic difference that the
String's encoding is determined at runtime? Maybe a struct would be better
than a class:
struct ICUString {
  enum Encoding {UTF8, UTF16, UTF32,...};
  uint len;
  void* data;
  Encoding encoding;
  ... member functions like opIndex, etc...
}
... functions like std.string with ICUString instead of char[] or wchar[] or
dchar[]...

[snip]

Sep 27 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cjd924$81h$1 digitaldaemon.com>, David L. Davis says...
In article <cjal85$1oia$1 digitaldaemon.com>, Jaap Geurts says...
David,

I've examined your wstring library, and noticed that the
case(islower,isupper) family functions cannot do other languages than plain
latin ascii. Am I right in this?
What is needed I guess is for the user to supply a conversion table (are the
functions in phobos suitable?). I don't know enough about locale support in
OS's but if it is not available there we'd have to code it into the lib.

I'll do some probing about how to code it first and if you wish I can
provide you the one for Vietnamese.

Regards, Jaap

Jaap: Currently for anything unicode based, I've been waiting on work that
Arcane Jill is doing. StringW.d was mainly created to make it easier to work
with 16-bit characters (string.d made it a real pain...you nearly have to cast
everything), and hopefully in turn it will work with Windows' 16-bit wide
character API functions. But at this point I haven't tested it, plus I don't
understand enough to know the real different between the 16-bit characters and
unicode characters (some real example data and code would be helpful in this
area...Jill?, Ben?, and/or anyone?). Anywayz, needless to say I've mirrored
string.d functions like tolower(), toupper() and my very own asciiProperCase()
functions to still work on ascii characters only. In my last reply I was mainly
to point you to where stringw.d could be found, and if you found it useful and
to let you know that if you needed anything that string.d had that's missing in
it...that I would add it if you needed it. I hope I did give the impression that
it did unicode? Also, I'm afraid I don't know much about "locale support"
either. But if you do something in that area I wouldn't mind taking a look at
it. :))

Good Luck in your project,
David L.

I posted this yesterday:
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/11206 - I hope it's
helpful.


-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"

Nice quote!
Jill

Sep 29 2004

D Programming

C/C++ Programming

Other

digitalmars.D - UTF-8 char[] consistency