www.digitalmars.com         C & C++   DMDScript  

D - Unicode in D

reply globalization guy <globalization_member pathlink.com> writes:
I think you'll be making a big mistake if you adopt C's obsolete char == byte
concept of strings. Savvy language designers these days realize that, like int's
and float's, char's should be a fundamental data type at a higher-level of
abstraction than raw bytes. The model that most modern language designers are
turning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit.

If you do so, you make it possible for strings in your language to have a
single, canonical form that all APIs use. Instead of the nightmare that C/C++
programmers face when passing string parameters ("now, let's see, is this a
char* or a const char* or an ISO C++ string or an ISO wstring or a wchar_t* or a
char[] or a wchar_t[] or an instance of one of countless string classes...?).
The fact that not just every library but practically every project feels the
need to reinvent its own string type is proof of the need for a good, solid,
canonical form built right into the language.

Most language designers these days either get this from the start of they later
figure it out and have to screw up their language with multiple string types.

Having canonical UTF-16 chars and strings internally does not mean that you
can't deal with other character encodings externally. You can can convert to
canonical form on import and convert back to some legacy encoding on export.

Javascript or default XML or most new text protocols, no conversion will be
necessary. It will only be needed for legacy data (or a very lightweight switch
between UTF-8 and UTF-16). And for those cases where you have to work with
legacy data and yet don't want to incur the overhead of encoding conversion in
and out, you can still treat the external strings as byte arrays instead of
strings, assuming you have a "byte" data type, and do direct byte manipulation
on them. That's essentially what you would have been doing anyway if you had
used the old char == byte model I see in your docs. You just call it "byte"
instead of "char" so it doesn't end up being your default string type.

Having a modern UTF-16 char type, separate from arrays of "byte", gives you a
consistency that allows for the creation of great libraries (since text is such

their libraries universally use a single string type. Perl figured it out pretty
late and as a result, with the addition of UTF-8 to Perl in v. 5.6, it's never
clear which CPAN modules will work and which ones will fail, so you have to use
pragmas ("use utf-8" vs. "use bytes") and do lots of testing.

I hope you'll consider making this change to your design. Have an 8-bit unsigned
"byte" type and a 16-bit unsigned UTF-16 "char" and forget about this "8-bit
char plus 16-bit wide char on Win32 and 32-bit wide char on Linux" stuff or I'm
quite sure you'll later regret it. C/C++ are in that sorry state for legacy
reasons only, not because their designers were foolish, but any new language
that intentionally copies that "design" is likely to regret that decision.
Jan 16 2003
next sibling parent reply "Paul Sheer" <psheer icon.co.za> writes:
On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:

 I think you'll be making a big mistake if you adopt C's obsolete char == byte
what about embedded work? this needs to be lightweight in any case, a 16 bit character set doesn't hold all the charsets needed by the worlds languages, but a 20 bit charset (UTF-8) is overkill. then again, most programmers get by with 8 bits 99% of the time. So you need to give people options. -paul
Jan 16 2003
next sibling parent reply Theodore Reed <rizen surreality.us> writes:
On Thu, 16 Jan 2003 14:40:15 +0200
"Paul Sheer" <psheer icon.co.za> wrote:

 On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:
 
 I think you'll be making a big mistake if you adopt C's obsolete
 char == byte
what about embedded work? this needs to be lightweight in any case, a 16 bit character set doesn't hold all the charsets needed by the worlds languages, but a 20 bit charset (UTF-8) is overkill. then again, most programmers get by with 8 bits 99% of the time. So you need to give people options.
But the default option should be UTF-8 with a module available for conversion. (I tend to stay away from UTF-16 because of endian issues.) Also, I'm not sure where you're getting the 20-bit part. UTF-8 can encode everything in the Unicode 32-bit range. (Although it takes like 8 bytes towards the end.) UTF-8 also addresses the lightweight bit, as long as you aren't using non-English characters, but even if you are, they aren't that much longer. And it's better than having to deal with 50 million 8-bit encodings. FWIW, I wholeheartedly support Unicode strings in D. -- Theodore Reed (rizen/bancus) -==- http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~ "We have committed a greater crime, and for this crime there is no name. What punishment awaits us if it be discovered we know not, for no such crime has come in the memory of men and there are no laws to provide for it." -- Equality 7-2521, Ayn Rand's Anthem
Jan 16 2003
next sibling parent reply "Sean L. Palmer" <seanpalmer directvinternet.com> writes:
I'm all for UTF-8.  Most fonts don't come anywhere close to having all the
glyphs anyway, but it's still nice to use an encoding that actually has a
real definition (whereas "byte" has no meaning whatsoever and could mean
ANSI, DOS OEM, ASCII-7, UTF-8, or MBCS.)  UTF-8 allows you the full unicode
range but the part that we use everyday just takes 1 byte per char, like
usual.  I believe it even maps almost 1:1 to ASCII in that range.

You cannot however make a UTF-8 data type.  By definition each character may
take more than one byte.  But you don't make arrays of characters, you make
arrays of character building blocks (bytes) that are interpreted as
characters.

Anyway we'd need some automated way to step through the array one character
at a time.  Maybe string could be an array of bytes that pretends that it's
an array of 32-bit unicode characters?

Sean

"Theodore Reed" <rizen surreality.us> wrote in message
news:20030116081437.1a593197.rizen surreality.us...
 On Thu, 16 Jan 2003 14:40:15 +0200
 "Paul Sheer" <psheer icon.co.za> wrote:

 On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:

 I think you'll be making a big mistake if you adopt C's obsolete
 char == byte
what about embedded work? this needs to be lightweight in any case, a 16 bit character set doesn't hold all the charsets needed by the worlds languages, but a 20 bit charset (UTF-8) is overkill. then again, most programmers get by with 8 bits 99% of the time. So you need to give people options.
But the default option should be UTF-8 with a module available for conversion. (I tend to stay away from UTF-16 because of endian issues.) Also, I'm not sure where you're getting the 20-bit part. UTF-8 can encode everything in the Unicode 32-bit range. (Although it takes like 8 bytes towards the end.) UTF-8 also addresses the lightweight bit, as long as you aren't using non-English characters, but even if you are, they aren't that much longer. And it's better than having to deal with 50 million 8-bit encodings. FWIW, I wholeheartedly support Unicode strings in D. -- Theodore Reed (rizen/bancus) -==- http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~ "We have committed a greater crime, and for this crime there is no name. What punishment awaits us if it be discovered we know not, for no such crime has come in the memory of men and there are no laws to provide for it." -- Equality 7-2521, Ayn Rand's Anthem
Jan 16 2003
next sibling parent reply Theodore Reed <rizen surreality.us> writes:
On Thu, 16 Jan 2003 09:49:58 -0800
"Sean L. Palmer" <seanpalmer directvinternet.com> wrote:

 I'm all for UTF-8.  Most fonts don't come anywhere close to having all
 the glyphs anyway, but it's still nice to use an encoding that
 actually has a real definition (whereas "byte" has no meaning
 whatsoever and could mean ANSI, DOS OEM, ASCII-7, UTF-8, or MBCS.) 
 UTF-8 allows you the full unicode range but the part that we use
 everyday just takes 1 byte per char, like usual.  I believe it even
 maps almost 1:1 to ASCII in that range.
AFAIK, Unicode between 0 and 127 is the exact same thing as ASCII. -- Theodore Reed (rizen/bancus) -==- http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~ "The word of Sin is Restriction. O man! refuse not thy wife, if she will! O lover, if thou wilt, depart! There is no bond that can unite the divided but love: all else is a curse. Accursed! Accursed be it to the aeons! Hell." -- Liber AL vel Legis, 1:41
Jan 16 2003
parent Alix Pexton <Alix thedjournal.com> writes:
Theodore Reed wrote:
 On Thu, 16 Jan 2003 09:49:58 -0800
 "Sean L. Palmer" <seanpalmer directvinternet.com> wrote:
 
 
I'm all for UTF-8.  Most fonts don't come anywhere close to having all
the glyphs anyway, but it's still nice to use an encoding that
actually has a real definition (whereas "byte" has no meaning
whatsoever and could mean ANSI, DOS OEM, ASCII-7, UTF-8, or MBCS.) 
UTF-8 allows you the full unicode range but the part that we use
everyday just takes 1 byte per char, like usual.  I believe it even
maps almost 1:1 to ASCII in that range.
AFAIK, Unicode between 0 and 127 is the exact same thing as ASCII.
As I see it there are two issues here. Firstly there is the ability to read and manipulate text streams that are incoded in one of the many multi-byte/variable-width formats. and secondly there is allow code to be written in mb/vw formats. The first can be achieved (though perhaps not transparently) using a library, while the second obviously requires work to be done on the front end of the compiler. The front end is freely available under the gpl/artistic licences, and I don't think it would be difficult to augment it with mb/vw support. However, This doesn't give us an intergrated solution, such as you might find in other languages, but it is a start. Alix Pexton Webmaster - "the D journal" www.thedjournal.com PS who need mb/vw when we have lojban ;) .
Jan 16 2003
prev sibling parent globalization guy <globalization_member pathlink.com> writes:
In article <b06r0m$1l3u$1 digitaldaemon.com>, Sean L. Palmer says...
I'm all for UTF-8.  Most fonts don't come anywhere close to having all the
glyphs anyway,...
Modern font systems cover different Unicode ranges with different fonts. A font that contains all the Unicode glyphs is of very limited use. (It tends to be useful for primitive tools that assume a single font for all glyphs. Such tools are being superceded by modern tools, though, and the complexities of rendering are being delegated to central rendering subsystems.)
... but it's still nice to use an encoding that actually has a
real definition (whereas "byte" has no meaning whatsoever and could mean
ANSI, DOS OEM, ASCII-7, UTF-8, or MBCS.)  UTF-8 allows you the full unicode
range but the part that we use everyday just takes 1 byte per char, like
usual.  
I'd be careful about the "part we use everyday" idea. I don't really know who's involved in this "D" project, but big company developers tend to work more and more in systems that handle a rich range of characters. The reason is because that's what their company needs to do every day, whether they do personally or not. That's what is swirling around the Internet every day. It is true, though, that for Westerners, ASCII characters occur more commonly, so UTF-8 has a sort of "poor man's compression" advantage that is often useful.
 I believe it even maps almost 1:1 to ASCII in that range.

You cannot however make a UTF-8 data type.  By definition each character may
take more than one byte.  But you don't make arrays of characters, you make
arrays of character building blocks (bytes) that are interpreted as
characters.
No, you make arrays of UTF-16 code units. When you need to do work with arrays of characters UTF-16 is a better choice than UTF-8, though UTF-8 is better for data interchange with unknown recipients.
Anyway we'd need some automated way to step through the array one character
at a time.  Maybe string could be an array of bytes that pretends that it's
an array of 32-bit unicode characters?
UTF-16. That's what it's for. UTF-32 is not practical for most purposes that involve large amounts of text.
Sean

"Theodore Reed" <rizen surreality.us> wrote in message
news:20030116081437.1a593197.rizen surreality.us...
 On Thu, 16 Jan 2003 14:40:15 +0200
 "Paul Sheer" <psheer icon.co.za> wrote:

 On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:

 I think you'll be making a big mistake if you adopt C's obsolete
 char == byte
what about embedded work? this needs to be lightweight in any case, a 16 bit character set doesn't hold all the charsets needed by the worlds languages, but a 20 bit charset (UTF-8) is overkill. then again, most programmers get by with 8 bits 99% of the time. So you need to give people options.
But the default option should be UTF-8 with a module available for conversion. (I tend to stay away from UTF-16 because of endian issues.) Also, I'm not sure where you're getting the 20-bit part. UTF-8 can encode everything in the Unicode 32-bit range. (Although it takes like 8 bytes towards the end.) UTF-8 also addresses the lightweight bit, as long as you aren't using non-English characters, but even if you are, they aren't that much longer. And it's better than having to deal with 50 million 8-bit encodings. FWIW, I wholeheartedly support Unicode strings in D. -- Theodore Reed (rizen/bancus) -==- http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~ "We have committed a greater crime, and for this crime there is no name. What punishment awaits us if it be discovered we know not, for no such crime has come in the memory of men and there are no laws to provide for it." -- Equality 7-2521, Ayn Rand's Anthem
Jan 16 2003
prev sibling next sibling parent reply globalization guy <globalization_member pathlink.com> writes:
In article <20030116081437.1a593197.rizen surreality.us>, Theodore Reed says...
On Thu, 16 Jan 2003 14:40:15 +0200
"Paul Sheer" <psheer icon.co.za> wrote:

But the default option should be UTF-8 with a module available for
conversion. (I tend to stay away from UTF-16 because of endian issues.)
The default (and only) form should be UTF-16 in the language itself. There is no endianness issue unless data is serialized. Serialization is a type of output like printing on paper, and I'm not suggesting serializing into UTF-16 by default. UTF-8 is the way to go for that. I'm only talking about the "model" used by the programming language. Another way to look at it is to consider int's. Do you try to avoid the int data type? It has exactly the same endianness issues as UTF-16.
Also, I'm not sure where you're getting the 20-bit part. UTF-8 can
encode everything in the Unicode 32-bit range. (Although it takes like 8
bytes towards the end.) 
He's right, actually. Unicode has a range of slightly over 20 bits. (1M + 62K, to be exact.) Originally, Unicode had a 16-bit range and ISO 10646 had a 31 bit range (not 32), but both now have converged on a little over 20.
UTF-8 also addresses the lightweight bit, as long as you aren't using
non-English characters, but even if you are, they aren't that much
longer. 
So does UTF-16 because although Western characters take a little more space than with UTF-8, processing is lighter weight, and that is usually more significant.
 And it's better than having to deal with 50 million 8-bit
encodings.
Amen to that! Talk about heavyweight...
FWIW, I wholeheartedly support Unicode strings in D.
Yes, indeed. It is a real benefit to give the users because with Unicode strings as standard, you get libraries that can take a lot of the really arcane issues off the programmers' shoulders (and put them on the library authors' shoulders, where tough stuff belongs). When D programmers then deal with Unicode XML, HTML just send the strings to the libraries, confident that the "Unicode stuff" will be taken care of. That's the kind of advantage modern developers get from Java that they don't get from good ol' C.
Jan 16 2003
parent Paul Stanton <Paul_member pathlink.com> writes:
In article <b07jht$22v4$1 digitaldaemon.com>, globalization guy says...

That's the kind of advantage modern developers get from Java that they don't get
from good ol' C.
provided solaris/jvm is configured correctly by (friggin) service provider
Jan 16 2003
prev sibling parent "Serge K" <skarebo programmer.net> writes:
 UTF-8 can encode everything in the Unicode 32-bit range.
 (Although it takes like 8 bytes towards the end.)
0x00..0x7F --> 1 byte - ASCII 0x80..0x7FF --> 2 byte - Latin extended, Greek, Cyrillic, Hebrew, Arabic, etc... 0x800..0xFFFF --> 3 byte - most of the scripts in use. 0x10000..0x10FFFF --> 4 byte - rare/dead/... scripts
Jan 16 2003
prev sibling parent globalization guy <globalization_member pathlink.com> writes:
In article <b065i9$19aa$1 digitaldaemon.com>, Paul Sheer says...
On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:

 I think you'll be making a big mistake if you adopt C's obsolete char == byte
what about embedded work? this needs to be lightweight
Good questions. I think you'll find if you sniff around that more and more embedded work is going to Unicode. The reason is because it is inevitable that any successful device that deals with natural language will be required to handle more and more characters as its market expands. When you add new characters by changing character sets, you get a high marginal cost per market and you still can't handle mixed language scenarios (which have become very common due to the Internet.) When you add new characters by *adding* character sets, you lose all of your "lightweight" benefits. I attended a Unicode conference once where there was a separate embedded systems conference going on in the same building. By the end of the conference, we had almost merged, at least in the hallways. ;-) Unicode, done right, gives you universality at a fraction of the cost of patchwork solutions to worldwide markets. Even in English, the range of characters being demanded by customers has continued to grow. It grew beyond ASCII years ago, has now gone beyond Latin-1. MS Windows had to add a proprietary extension to Latin-1 before they gave up entirely and went full Unicode, as did Apple with OS X, Sun with Java, Perl, HTML 4....
in any case, a 16 bit character set doesn't hold all
the charsets needed by the worlds languages, but a
20 bit charset (UTF-8) is overkill. then again, most
programmers get by with 8 bits 99% of the time. So you
need to give people options.

-paul
UTF-16 isn't a 16-bit character set. It's a 16-bit encoding of a character set that has an enormous repertoire. There is room for well over a million characters in the Universal Character Set (shared by Unicode and ISO 10646), and many of those "characters" are actually components meant to be combined with others to create a truly enormous variety of what most people think of as "characters". It is no longer correct to assume a 1:1 correspondence between a Unicode character and a glyph you see on a screen or on paper. (And that correspondence was lost way back when TrueType was created anyway). The length of a string in these modern times is an abstract concept, not a physical one, when dealing with natural language. The nice 1:1 correspondences between code point / character / glyph are still available for artificial symbols created as sequences of ASCII printing characters, though, and that is true even in UTF-16 Unicode. Unicode certainly does have room for all of the world's character sets. It is a subset of them all -- with "all" meaning those considered significant by the various national bodies represented in ISO and all of the industrial bodies providing input to the Unicode Technical Committee. It's not a universal superset in an absolute sense. When you say "most programmers get by with 8 bits 99% of the time", I think you may be thinking a bit too narrowly. The composition of programmers has become more international than perhaps you realize, and the change isn't slowing down. Even in the West, most major companies have moved to Unicode *to solve their own problems*. MS programmers can't get by with 8-bits. Neither can Apple's, or Sun's, or Oracle's, or IBM's.... Another thing to consider is that programmers use the tools that exist, naturally. For a long time, major programming languages had the fundamental equivalence of byte and char at their core. Many people who got by with 8-bits did so because there was no practical alternative. These days, there are, and modern languages need to be designed to take advantage of all the great advantages that come along with using Unicode.

Jan 16 2003
prev sibling next sibling parent "Martin M. Pedersen" <mmp www.moeller-pedersen.dk> writes:
Hi,

I have been thinking about this issue too, and also I think that Unicode
string should be a prime concern of D. And, yes, UTF-8 is the way to go.
I would very much like to see a string using canonical UTF-8 encoding being
built right into the language, as a class with value semanthics.

What we are faced with is:

1. We need char and wchar_t for compability with APIs.
2. We need good Unicode support.
3. We need a memory efficient representation of strings.
4. We need the ability easy manipulation of strings.

There are two fundamental types of text data: a character and a string.
Also, Java uses two kinds of strings: a String class for storing strings,
and a StringBuffer for manipulating strings. This separation solves many
problems.

I believe that:

- A single character should be represented using 32-bit UCS-4 using native
endianess - like the wchar_t commenly seen on UNIX. It probably should be
struct in order to avoid overhead of vtbl, and still support character
methods such as isUpper() and toUpper().

- A non-modifyable string should be stored using UTF-8. By non-modifyable I
mean that they do not allow individual characters to be manipulated, but
they do allow reassignment. Read-only forward characters iterators could
also be supported in an efficient manner. As it has already been stated,
they would in most cases be as memory efficient as C's char arrays. This
also addresses Walter's concern of perfermance issues with CPU caches. But
it also means that the concept of using arrays simply is not good enough.
This string class should also provide functionality such a collate() method.

- A modifyable string should support manipulation of individual characters,
and could likely be an array of UCS-4 characters.

Methods should be provided for converting to/from char* and wchar_t*
(whether it is 16- or 32-bit) as needed for supporting C APIs. Some will
argue that this would involve too many conversions. However, if you are
using char* today on Windows, Windows will do this conversion all the time,
and you probably do not notice. And if it really becomes a bottle-neck,
optimization would be simple in most cases - just cache the converted
string. And if you are only concerned using C APIs - use the C string
functions such as strcat()/wcscat() or specialiced classes.

In addition character encoders could be provided for whatever representation
is needed. I myself would like support for US-ASCII, EBCDIC, ISO-8859,
UTF-7, UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, and US-ASCII/ISO-8859
with encoding of characters as in HTML (I don't remember what standard this
is called,  but it specifies characters using "&somename;"). Other would
have different needs, so it should be simple to implement a new character
encoder/decoder.

Regards,
Martin M. Pedersen.

"globalization guy" <globalization_member pathlink.com> wrote in message
news:b05pdd$13bv$1 digitaldaemon.com...
 I think you'll be making a big mistake if you adopt C's obsolete char ==
byte
 concept of strings. Savvy language designers these days realize that, like
int's
 and float's, char's should be a fundamental data type at a higher-level of
 abstraction than raw bytes. The model that most modern language designers
are
 turning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit.

 If you do so, you make it possible for strings in your language to have a
 single, canonical form that all APIs use. Instead of the nightmare that
C/C++
 programmers face when passing string parameters ("now, let's see, is this
a
 char* or a const char* or an ISO C++ string or an ISO wstring or a
wchar_t* or a
 char[] or a wchar_t[] or an instance of one of countless string
classes...?).
 The fact that not just every library but practically every project feels
the
 need to reinvent its own string type is proof of the need for a good,
solid,
 canonical form built right into the language.

 Most language designers these days either get this from the start of they
later
 figure it out and have to screw up their language with multiple string
types.
 Having canonical UTF-16 chars and strings internally does not mean that
you
 can't deal with other character encodings externally. You can can convert
to
 canonical form on import and convert back to some legacy encoding on
export.
 When you create the strings yourself, or when they are created in Java or
 Javascript or default XML or most new text protocols, no conversion will
be
 necessary. It will only be needed for legacy data (or a very lightweight
switch
 between UTF-8 and UTF-16). And for those cases where you have to work with
 legacy data and yet don't want to incur the overhead of encoding
conversion in
 and out, you can still treat the external strings as byte arrays instead
of
 strings, assuming you have a "byte" data type, and do direct byte
manipulation
 on them. That's essentially what you would have been doing anyway if you
had
 used the old char == byte model I see in your docs. You just call it
"byte"
 instead of "char" so it doesn't end up being your default string type.

 Having a modern UTF-16 char type, separate from arrays of "byte", gives
you a
 consistency that allows for the creation of great libraries (since text is
such

start, and
 their libraries universally use a single string type. Perl figured it out
pretty
 late and as a result, with the addition of UTF-8 to Perl in v. 5.6, it's
never
 clear which CPAN modules will work and which ones will fail, so you have
to use
 pragmas ("use utf-8" vs. "use bytes") and do lots of testing.

 I hope you'll consider making this change to your design. Have an 8-bit
unsigned
 "byte" type and a 16-bit unsigned UTF-16 "char" and forget about this
"8-bit
 char plus 16-bit wide char on Win32 and 32-bit wide char on Linux" stuff
or I'm
 quite sure you'll later regret it. C/C++ are in that sorry state for
legacy
 reasons only, not because their designers were foolish, but any new
language
 that intentionally copies that "design" is likely to regret that decision.
Jan 16 2003
prev sibling next sibling parent reply "Walter" <walter digitalmars.com> writes:
You make some great points. I have to ask, though, why UTF-16 as opposed to
UTF-8?
Jan 16 2003
parent reply globalization guy <globalization_member pathlink.com> writes:
In article <b08cdr$2fld$1 digitaldaemon.com>, Walter says...
You make some great points. I have to ask, though, why UTF-16 as opposed to
UTF-8?
Good question, and actually it's not an open and shut case. UTF-8 would not be a big mistake, but it might not be quite as good as UTF-16. The biggest reason I think UTF-16 has the edge is that I think you'll probably want to treat your strings as arrays of characters on many occasions, and that's *almost* as easy to do with UTF-16 as with ASCII. It's really not very practical with UTF-8, though. UTF-16 characters are almost always a single 16-bit code unit. Once in a billion characters or so, you get a character that is composed of two "surrogates". Sort of like half characters. Your code does have to keep this exceptional case in mind and handle it when necessary, though that is usually the type of problem you delegate to the standard library. In most cases, a function can just think of each surrogate as a character and not worry that it might be just half of the representation of a character -- as long as the two don't get separated. In almost all cases, though, you can think of a character as a single 16-bit entity, which is almost as simple as thinking of it as a single 8-bit entity. You can do bit operations on them and other C-like things and it should be very efficient. Unlike UTF-16's two cases, one of which is very rare, UTF-8 has four cases, three of which are very common. All of your code needs to do a good job with those three cases. Only the fourth can be considered exceptional. (Of course it has to be handled, too, but it is like the exceptional UTF-16 case, where you don't have to optimize for it because it rarely occurs). Most strings will tend to have mixed-width characters, so a model of an array of elements isn't a very good one. You can still implement your language with accessors that reach into a UTF-8 string and parse out the right character when you say "str[5]", but it will be further removed from the physical implementation than if you use UTF-16. For a somewhat lower-level language like "D", this probably isn't a very good fit. The main benefit of UTF-8 is when exchanging text data with arbitrary external parties. UTF-8 has no endianness problem, so you don't have to worry about the *internal* memory model of the recipient. It has some other features that make it easier to digest by legacy systems that can only handle ASCII. They won't work right outside ASCII, but they'll often work for ASCII and they'll fail more gracefully than would be the case with UTF-16 (that is likely to contain embedded \0 bytes.) None of these issues are relevant to your own program's *internal* text model. Internally, you're not worried about endianness. (You don't worry about the endianness of your int variables, do you?) You don't have to worry about losing a byte in RAM, etc. When talking to external APIs, you'll still have to output in a form that the API can handle. Win32 APIs want UTF-16. Mac APIs want UTF-16. Java APIs want UTF-16, as do .Net APIs. Unix APIs are problematic, since there are so many and they aren't coordinated by a single body. Some will only be able to handle ASCII, others will be upgraded to UTF-8. I don't think the Unix system APIs will become UTF-16 because legacy is such a ball and chain in the Unix world, but the process is underway to upgrade the standard system encoding for all major Linux distributions to UTF-8. If Linux APIs (and probably most Unix APIs eventually) are of primary importance, UTF-8 is still a possibility. I'm not totally ruling it out. It wouldn't hurt you much to use UTF-8 internally, but accessing strings as arrays of characters would require sort of a virtual string model that doesn't match the physical model quite as closely as you could get with UTF-16. The additional abstraction might have more overhead than you would prefer internally. If it's a choice between internal inefficiency and inefficiency when calling external APIs, I would usually go for the latter. Most language designers who understand internationalization have decided to go with UTF-16 for languages that have their own rich set of internal libraries, and they have mechanisms for calling external APIs that convert the string encodings.
Jan 17 2003
parent reply "Walter" <walter digitalmars.com> writes:
I read your post with great interest. However, I'm leaning towards UTF-8 for
the following reasons (some of which you've covered):

1) In googling around and reading various articles, it seems that UTF-8 is
gaining momentum as the encoding of choice, including html.

2) Linux is moving towards UTF-8 permeating the OS. Doing UTF-8 in D means
that D will mesh naturally with Linux system api's.

3) Is Win32's "wide char" really UTF-16, including the multi word encodings?

4) I like the fact of no endianness issues, which is important when writing
files and transmitting text - it's much more important an issue than the
endianness of ints.

5) 16 bit accesses on Intel CPUs can be pretty slow compared to byte or
dword accesses (varies by CPU type).

6) Sure, UTF-16 reduces the frequency of multi character encodings, but the
code to deal with it must still be there and must still execute.

7) I've converted some large Java text processing apps to C++, and converted
the Java 16 bit char's to using UTF-8. That change resulted in *substantial*
performance improvements.

8) I suspect that 99% of the text processed in computers is ascii. UTF-8 is
a big win in memory and speed for processing english text.

9) A lot of diverse systems and lightweight embedded systems need to work
with 8 bit chars. Going to UTF-16 would, I think, reduce the scope of
applications and systems that D would be useful for. Going to UTF-8 would
make it as broad as possible.

10) Interestingly, making char[] in D to be UTF-8 does not seem to step on
or prevent dealing with wchar_t[] arrays being UTF-16.

11) I'm not convinced the char[i] indexing problem will be a big one. Most
operations done on ascii strings remain unchanged for UTF-8, including
things like sorting & searching.

See http://www.cl.cam.ac.uk/~mgk25/unicode.html

"globalization guy" <globalization_member pathlink.com> wrote in message
news:b09qpe$aff$1 digitaldaemon.com...
 In article <b08cdr$2fld$1 digitaldaemon.com>, Walter says...
You make some great points. I have to ask, though, why UTF-16 as opposed
to
UTF-8?
Good question, and actually it's not an open and shut case. UTF-8 would
not be a
 big mistake, but it might not be quite as good as UTF-16.

 The biggest reason I think UTF-16 has the edge is that I think you'll
probably
 want to treat your strings as arrays of characters on many occasions, and
that's
 *almost* as easy to do with UTF-16 as with ASCII. It's really not very
practical
 with UTF-8, though.

 UTF-16 characters are almost always a single 16-bit code unit. Once in a
billion
 characters or so, you get a character that is composed of two
"surrogates". Sort
 of like half characters. Your code does have to keep this exceptional case
in
 mind and handle it when necessary, though that is usually the type of
problem
 you delegate to the standard library. In most cases, a function can just
think
 of each surrogate as a character and not worry that it might be just half
of the
 representation of a character -- as long as the two don't get separated.
In
 almost all cases, though, you can think of a character as a single 16-bit
 entity, which is almost as simple as thinking of it as a single 8-bit
entity.
 You can do bit operations on them and other C-like things and it should be
very
 efficient.

 Unlike UTF-16's two cases, one of which is very rare, UTF-8 has four
cases,
 three of which are very common. All of your code needs to do a good job
with
 those three cases. Only the fourth can be considered exceptional. (Of
course it
 has to be handled, too, but it is like the exceptional UTF-16 case, where
you
 don't have to optimize for it because it rarely occurs). Most strings will
tend
 to have mixed-width characters, so a model of an array of elements isn't a
very
 good one.

 You can still implement your language with accessors that reach into a
UTF-8
 string and parse out the right character when you say "str[5]", but it
will be
 further removed from the physical implementation than if you use UTF-16.
For a
 somewhat lower-level language like "D", this probably isn't a very good
fit.
 The main benefit of UTF-8 is when exchanging text data with arbitrary
external
 parties. UTF-8 has no endianness problem, so you don't have to worry about
the
 *internal* memory model of the recipient. It has some other features that
make
 it easier to digest by legacy systems that can only handle ASCII. They
won't
 work right outside ASCII, but they'll often work for ASCII and they'll
fail more
 gracefully than would be the case with UTF-16 (that is likely to contain
 embedded \0 bytes.)

 None of these issues are relevant to your own program's *internal* text
model.
 Internally, you're not worried about endianness. (You don't worry about
the
 endianness of your int variables, do you?) You don't have to worry about
losing
 a byte in RAM, etc.

 When talking to external APIs, you'll still have to output in a form that
the
 API can handle. Win32 APIs want UTF-16. Mac APIs want UTF-16. Java APIs
want
 UTF-16, as do .Net APIs. Unix APIs are problematic, since there are so
many and
 they aren't coordinated by a single body. Some will only be able to handle
 ASCII, others will be upgraded to UTF-8. I don't think the Unix system
APIs will
 become UTF-16 because legacy is such a ball and chain in the Unix world,
but the
 process is underway to upgrade the standard system encoding for all major
Linux
 distributions to UTF-8.

 If Linux APIs (and probably most Unix APIs eventually) are of primary
 importance, UTF-8 is still a possibility. I'm not totally ruling it out.
It
 wouldn't hurt you much to use UTF-8 internally, but accessing strings as
arrays
 of characters would require sort of a virtual string model that doesn't
match
 the physical model quite as closely as you could get with UTF-16. The
additional
 abstraction might have more overhead than you would prefer internally. If
it's a
 choice between internal inefficiency and inefficiency when calling
external
 APIs, I would usually go for the latter.

 Most language designers who understand internationalization have decided
to go
 with UTF-16 for languages that have their own rich set of internal
libraries,
 and they have mechanisms for calling external APIs that convert the string
 encodings.
Jan 17 2003
next sibling parent reply Burton Radons <loth users.sourceforge.net> writes:
Walter wrote:
 10) Interestingly, making char[] in D to be UTF-8 does not seem to step on
 or prevent dealing with wchar_t[] arrays being UTF-16.
You're planning on making this a part of char[]? I was thinking of generating a StringUTF8 instance during compilation, but whatever. I think we should kill off wchar if we go in this direction. The char/wchar conflict is probably the worst part of D's design right now as it doesn't fit well with the rest of the language (limited and ambiguous overloading), and it would provide absolutely nothing that char doesn't already encapsulate. If you need different encodings, use a library.
 11) I'm not convinced the char[i] indexing problem will be a big one. Most
 operations done on ascii strings remain unchanged for UTF-8, including
 things like sorting & searching.
It's not such a speed hit any longer that all code absolutely must use slicing and iterators to be useful. 12) UTF-8 doesn't embed ANY control characters, so it can interface with unintelligent C libraries natively. That's not a minor advantage when you're trying to get people to switch to it!
Jan 17 2003
parent reply "Walter" <walter digitalmars.com> writes:
"Burton Radons" <loth users.sourceforge.net> wrote in message
news:b0a6rl$i4m$1 digitaldaemon.com...
 Walter wrote:
 10) Interestingly, making char[] in D to be UTF-8 does not seem to step
on
 or prevent dealing with wchar_t[] arrays being UTF-16.
You're planning on making this a part of char[]? I was thinking of generating a StringUTF8 instance during compilation, but whatever.
I think making char[] a UTF-8 is the right way.
 I think we should kill off wchar if we go in this direction.  The
 char/wchar conflict is probably the worst part of D's design right now
 as it doesn't fit well with the rest of the language (limited and
 ambiguous overloading), and it would provide absolutely nothing that
 char doesn't already encapsulate.  If you need different encodings, use
 a library.
I agree that the char/wchar conflict is a screwup in D's design, and one I've not been happy with. UTF-8 offers a way out. wchar_t should still be retained, though, for interfacing with the win32 api.
 11) I'm not convinced the char[i] indexing problem will be a big one.
Most
 operations done on ascii strings remain unchanged for UTF-8, including
 things like sorting & searching.
It's not such a speed hit any longer that all code absolutely must use slicing and iterators to be useful.
Interestingly, if foreach is done right, iterating through char[] will work right, UTF-8 or not.
 12) UTF-8 doesn't embed ANY control characters, so it can interface with
 unintelligent C libraries natively.  That's not a minor advantage when
 you're trying to get people to switch to it!
You're right.
Jan 17 2003
parent reply "Mike Wynn" <mike.wynn l8night.co.uk> writes:
"Walter" <walter digitalmars.com> wrote in message
news:b0a7ft$iei$1 digitaldaemon.com...
 "Burton Radons" <loth users.sourceforge.net> wrote in message
 news:b0a6rl$i4m$1 digitaldaemon.com...
 Walter wrote:
 10) Interestingly, making char[] in D to be UTF-8 does not seem to
step
 on
 or prevent dealing with wchar_t[] arrays being UTF-16.
You're planning on making this a part of char[]? I was thinking of generating a StringUTF8 instance during compilation, but whatever.
I think making char[] a UTF-8 is the right way.
I would be more in favor of a String class that was utf8 internally the problem with utf8 is that the the number of bytes and the number of chars are dependant on the data char[] to me implies an array of char's so char [] foo ="aa"\0x0555; is 4 bytes, but only 3 chars so what is foo[2] ? and what if I set foo[1] = \0x467; and what about wanting 8 bit ascii strings ? if you are going UTF8 then think about the minor extension Java added to the encoding by allowing a two byte 0, which allows embedded 0 in strings without messing up the C strlen (which returns the byte length).
 I think we should kill off wchar if we go in this direction.  The
 char/wchar conflict is probably the worst part of D's design right now
 as it doesn't fit well with the rest of the language (limited and
 ambiguous overloading), and it would provide absolutely nothing that
 char doesn't already encapsulate.  If you need different encodings, use
 a library.
 I agree that the char/wchar conflict is a screwup in D's design, and one
 I've not been happy with. UTF-8 offers a way out. wchar_t should still be
 retained, though, for interfacing with the win32 api.

 11) I'm not convinced the char[i] indexing problem will be a big one.
Most
 operations done on ascii strings remain unchanged for UTF-8, including
 things like sorting & searching.
It's not such a speed hit any longer that all code absolutely must use slicing and iterators to be useful.
Interestingly, if foreach is done right, iterating through char[] will
work
 right, UTF-8 or not.
 12) UTF-8 doesn't embed ANY control characters, so it can interface with
 unintelligent C libraries natively.  That's not a minor advantage when
 you're trying to get people to switch to it!
You're right.
Jan 17 2003
parent "Walter" <walter digitalmars.com> writes:
UTF-8 does lead to the problem of what is meant by:

    char[] c;
    c[5]

Is it the 5th byte of c[], or the 5th decoded 32 bit character? Saying it's
the 5th decoded character has all kinds of implications for slicing and
.length.

8 bit ascii isn't a problem, just cast it to a byte[], as in:
    byte[] b = cast(byte[])c;

I'm not sure about the Java 00 issue, I didn't think Java supported UTF-8. D
does not have the "what to do about embedded 0" problem, as the length is
carried along separately.

"Mike Wynn" <mike.wynn l8night.co.uk> wrote in message
news:b0a8eg$ivc$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:b0a7ft$iei$1 digitaldaemon.com...
 "Burton Radons" <loth users.sourceforge.net> wrote in message
 news:b0a6rl$i4m$1 digitaldaemon.com...
 Walter wrote:
 10) Interestingly, making char[] in D to be UTF-8 does not seem to
step
 on
 or prevent dealing with wchar_t[] arrays being UTF-16.
You're planning on making this a part of char[]? I was thinking of generating a StringUTF8 instance during compilation, but whatever.
I think making char[] a UTF-8 is the right way.
I would be more in favor of a String class that was utf8 internally the problem with utf8 is that the the number of bytes and the number of chars are dependant on the data char[] to me implies an array of char's so char [] foo ="aa"\0x0555; is 4 bytes, but only 3 chars so what is foo[2] ? and what if I set foo[1] = \0x467; and what about wanting 8 bit ascii strings ? if you are going UTF8 then think about the minor extension Java added to
the
 encoding by allowing a two byte 0, which allows embedded 0 in strings
 without messing up the C strlen (which returns the byte length).
Jan 17 2003
prev sibling next sibling parent reply "Mike Wynn" <mike.wynn l8night.co.uk> writes:
 6) Sure, UTF-16 reduces the frequency of multi character encodings, but
the
 code to deal with it must still be there and must still execute.
I was under the impression UTF-16 was glyph based, so each char (16bits) was a glyph of some form, not all glyph cause the graphics to move to the next char, so accents can be encoded as a postfix to the char they are over/under and charsets like chinesse have sequences that generate the correct visual reprosentation; UTF-8 is just a way to encode UTF-16 so the it is compatable with ascii, 0..127 map to 0.127 then using 128..256 as special values identifing multi byte values the string can be processed as 8bit ascii by software without problem, only the visual reprosentation changes 128..256 on dos are the box drawing and intl chars. however a 3 UTF-16 char sequence will encode to 3 utf 8 encoded sequences and if they are all >127 then that would be 6 or more bytes, so if you consider the 3 UTF-16 values to be one "char" then the UTF8 should also consider the 6 or more byte sequence as one "char" rather than 3 "chars"
Jan 17 2003
parent reply "Serge K" <skarebo programmer.net> writes:
 I was under the impression UTF-16 was glyph based, so each char (16bits)
was
 a glyph of some form, not all glyph cause the graphics to move to the next
 char, so accents can be encoded as a postfix to the char they are
over/under
 and charsets like chinesse have sequences that generate the correct visual
 reprosentation;
First, UTF-16 is just one of the many standard encodings for the Unicode. UTF-16 allows more then 16bit characters - with surrogates it can represent all >1M codes. (Unicode v2 used UCS-2 which is 16bit-only encoding)
 I was under the impression UTF-16 was glyph based
from The Unicode Standard, ch2 General Structure http://www.unicode.org/uni2book/ch02.pdf "Characters, not glyphs - The Unicode Standard encodes characters, not glyphs. The Unicode Standard draws a distinction between characters, which are the smallest components of written language that have semantic value, and glyphs, which represent the shapes that characters can have when they are rendered or displayed. Various relationships may exist between characters and glyphs: a single glyph may correspond to a single character, or to a number of characters, or multiple glyphs may result from a single character." btw, there are many precomposed characters in the Unicode which can be represented with combining characters as well. ( [â] and [a,(combining ^)] - equally valid representations for [a with circumflex] ).
Jan 16 2003
parent "Mike Wynn" <mike.wynn l8night.co.uk> writes:
 First, UTF-16 is just one of the many standard encodings for the Unicode.
 UTF-16 allows more then 16bit characters - with surrogates it can
represent
 all >1M codes.
 (Unicode v2 used UCS-2 which is 16bit-only encoding)
right, me getting confused. too many tla's too many standards (as ever).
 I was under the impression UTF-16 was glyph based
from The Unicode Standard, ch2 General Structure http://www.unicode.org/uni2book/ch02.pdf "Characters, not glyphs - The Unicode Standard encodes characters, not glyphs. The Unicode Standard draws a distinction between characters, which are the smallest components of written language that have semantic value, and glyphs, which represent the shapes that characters can have when they are rendered or displayed. Various relationships may exist between characters and glyphs: a single glyph may correspond to a single character, or to a number of characters, or multiple glyphs may result from a single character." btw, there are many precomposed characters in the Unicode which can be represented with combining characters as well. ( [â] and [a,(combining
^)] -
 equally valid representations for [a with circumflex] ).
so if I read this right ... (been using UTF8 for ages and ignored what it represents, keeps me sane (er) ) I can't understand arabic file names anyway :) so a string (no matter how its encoded) contains 3 lengths the byte length, then number of unicode entites (16 bit UCS-2) and the number of "characters" so cât as UTF8 is 4 bytes, as UTF-16 is 6 bytes, its 3 UCS-2 entities, and 3 "characters" but if the â was [a,(combining ^)] not the single â UCS-2 value then cât would be UTF8 is 8+ bytes, as UTF-16 is 8 bytes, its 4 UCS-2 entities, but still 3 "characters" which is why I think String should be a class not a thing[] you should be able to get a utf8 encoded byte[], utf-16 short[], UCS-2 short[] (for win32/api), (32 bit unicode) int[] (for linux) and ideally a Character[] from the string. how a String is stored utf8, utf16 or 32bit/64bit values is only relevant for performance and different people will want different internal representations. but semantically they should be all the same. this is all another reason why I also think that arrays should be templated classes that have an index method (operator []) so the Character[] from the string can modify the String it represents. Mike.
Jan 18 2003
prev sibling parent reply "Serge K" <skarebo programmer.net> writes:
 3) Is Win32's "wide char" really UTF-16, including the multi word
encodings? WinXP, WinCE : UTF-16 Win2K : was UCS-2, but some service pack made it UTF-16 WinNT4 : UCS-2 Win9x : must die.
 5) 16 bit accesses on Intel CPUs can be pretty slow compared to byte or
 dword accesses (varies by CPU type).
16bit prefix can slow down instruction decoding (mostly for Intel CPUs, but P4 uses pre-decoded instructions anyhow), while instruction processing is more cache-branch-sensitive.
 6) Sure, UTF-16 reduces the frequency of multi character encodings, but
the
 code to deal with it must still be there and must still execute.
Just an idea : string class may have 2 values for the string length: 1 - number of "units" ( 8bit for UTF-8, 16bit for UTF-16 ) 2 - number of characters. In case if these numbers are equal, string processing library may use simplified and faster functions.
 7) I've converted some large Java text processing apps to C++, and
converted
 the Java 16 bit char's to using UTF-8. That change resulted in
*substantial*
 performance improvements.

 8) I suspect that 99% of the text processed in computers is ascii. UTF-8
is
 a big win in memory and speed for processing english text.
You think, that 99% of the computer users - english speaking? Think again... btw, something about UTF-8 & UTF-16 efficiency: http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_u nicode.html#Test_Results For latin script based languages - UTF-8 takes ~51% less space than UTF-16. For greek (expect the same for cyrillic)- ~88% - not that better than UTF-16. For japanese, chinese, korean, hindi - 115%..140% - UTF-16 is more space efficient.
Jan 16 2003
parent "Walter" <walter digitalmars.com> writes:
"Serge K" <skarebo programmer.net> wrote in message
news:b0anmt$r7g$1 digitaldaemon.com...
 3) Is Win32's "wide char" really UTF-16, including the multi word
encodings? WinXP, WinCE : UTF-16 Win2K : was UCS-2, but some service pack made it UTF-16 WinNT4 : UCS-2 Win9x : must die.
LOL! Looking forward, then, one can treat it as UTF-16.
 8) I suspect that 99% of the text processed in computers is ascii. UTF-8
is
 a big win in memory and speed for processing english text.
You think, that 99% of the computer users - english speaking?
Not at all. But the text processed - yes. But I imagine it would be pretty tough to come by figures for that that are better than speculation.
 something about UTF-8 & UTF-16 efficiency:
http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_u
 nicode.html#Test_Results
 For latin script based languages - UTF-8 takes ~51% less space than
UTF-16.
 For greek (expect the same for cyrillic)- ~88% - not that better than
 UTF-16.
 For japanese, chinese, korean, hindi - 115%..140% - UTF-16 is more space
 efficient.
Thanks for the info. That's about what I would have guessed. Another valuable statistic would be how well UTF-8 compressed with LZW as opposed to the same thing in UTF-16.
Jan 17 2003
prev sibling parent reply "Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> writes:
"globalization guy" <globalization_member pathlink.com> escreveu na mensagem
news:b05pdd$13bv$1 digitaldaemon.com...
 I think you'll be making a big mistake if you adopt C's obsolete char ==
byte
 concept of strings. Savvy language designers these days realize that, like
int's
 and float's, char's should be a fundamental data type at a higher-level of
 abstraction than raw bytes. The model that most modern language designers
are
 turning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit.

 If you do so, you make it possible for strings in your language to have a
 single, canonical form that all APIs use. Instead of the nightmare that
C/C++
 programmers face when passing string parameters ("now, let's see, is this
a
 char* or a const char* or an ISO C++ string or an ISO wstring or a
wchar_t* or a
 char[] or a wchar_t[] or an instance of one of countless string
classes...?).
 The fact that not just every library but practically every project feels
the
 need to reinvent its own string type is proof of the need for a good,
solid,
 canonical form built right into the language.

 Most language designers these days either get this from the start of they
later
 figure it out and have to screw up their language with multiple string
types.
 Having canonical UTF-16 chars and strings internally does not mean that
you
 can't deal with other character encodings externally. You can can convert
to
 canonical form on import and convert back to some legacy encoding on
export.
 When you create the strings yourself, or when they are created in Java or
 Javascript or default XML or most new text protocols, no conversion will
be
 necessary. It will only be needed for legacy data (or a very lightweight
switch
 between UTF-8 and UTF-16). And for those cases where you have to work with
 legacy data and yet don't want to incur the overhead of encoding
conversion in
 and out, you can still treat the external strings as byte arrays instead
of
 strings, assuming you have a "byte" data type, and do direct byte
manipulation
 on them. That's essentially what you would have been doing anyway if you
had
 used the old char == byte model I see in your docs. You just call it
"byte"
 instead of "char" so it doesn't end up being your default string type.

 Having a modern UTF-16 char type, separate from arrays of "byte", gives
you a
 consistency that allows for the creation of great libraries (since text is
such

start, and
 their libraries universally use a single string type. Perl figured it out
pretty
 late and as a result, with the addition of UTF-8 to Perl in v. 5.6, it's
never
 clear which CPAN modules will work and which ones will fail, so you have
to use
 pragmas ("use utf-8" vs. "use bytes") and do lots of testing.

 I hope you'll consider making this change to your design. Have an 8-bit
unsigned
 "byte" type and a 16-bit unsigned UTF-16 "char" and forget about this
"8-bit
 char plus 16-bit wide char on Win32 and 32-bit wide char on Linux" stuff
or I'm
 quite sure you'll later regret it. C/C++ are in that sorry state for
legacy
 reasons only, not because their designers were foolish, but any new
language
 that intentionally copies that "design" is likely to regret that decision.
Hi, There was a thread a year ago in the smalleiffel mailing list (starting at http://groups.yahoo.com/group/smalleiffel/message/4075 ) about unicode strings in Eiffel. It's a quite interesting read about the problems of adding string-like Unicode classes. The main point is that true Unicode support is very difficult to achieve just some libraries provide good, correct and complete unicode encoders/decoders/renderers/etc. While I agree that some Unicode support is a necessity today (main mother tongue is brazilian portuguese so I use non-ascii characters everyday), we can't just add some base types and pretend everything is allright. We won't correct incorrect written code with a primitive unicode string. Most programmers don't think about unicode when they develop their software, so almost every line of code dealing with texts contain some assumptions about the character sets being used. Java has a primitive 16 bit char, but basic library functions (because they need good performance) use incorrect code for string handling stuff (the correct classes are in java.text, providing means to correctly collate strings). Some times we are just using plain old ASCII but we're bitten by the encoding issues. And when we need to deal with true unicode support the libraries tricky us into believing everything is ok. IMO D should support a simple char array to deal with ASCII (as it does today) and some kind of standard library module to deal with unicode glyphs and text. This could be included in phobos or even in deimos. Any volunteers? With this we could force the programmer to deal with another set of tools (albeit similar) when dealing with each kind of string: ASCII or unicode. This module should allow creation of variable sized string and glyphs through an opaque ADT. Each kind of usage has different semantics and optimization strategies (e.g. Boyer-Moore is good for ASCII but with unicode the space and time usage are worse). Best regards, Daniel Yokomiso. P.S.: I had to written some libraries and components (EJBs) in several Java projects to deal with data-transfer in plain ASCII (communication with IBM mainframes). Each day I dreamed of using a language with simple one byte character strings, without problems with encoding and endianess (Solaris vs. Linux vs. Windows NT have some nice "features" in their JVMs if you aren't careful when writing Java code that uses "ASCII" String). But Java has a 16 bit character type and a SIGNED byte type, both awkward for this usage. A language shouldn't get in the way of simple code. "Never argue with an idiot. They drag you down to their level then beat you with experience." --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.443 / Virus Database: 248 - Release Date: 11/1/2003
Jan 17 2003
parent reply "Walter" <walter digitalmars.com> writes:
I once wrote a large project that dealt with mixed ascii and unicode. There
was bug after bug when the two collided. Finally, I threw in the towel and
made the entire program unicode - every string in it.

The trouble in D is that in the current scheme, everything dealing with text
has to be written twice, once for char[] and again for wchar_t[]. In C,
there's that wretched tchar.h to swap back and forth. It may just be easier
in the long run to just make UTF-8 the native type, and then at least try
and make sure the standard D library is correct.

-Walter

"Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message
news:b0agdq$ni9$1 digitaldaemon.com...
 "globalization guy" <globalization_member pathlink.com> escreveu na
mensagem
 news:b05pdd$13bv$1 digitaldaemon.com...
 I think you'll be making a big mistake if you adopt C's obsolete char ==
byte
 concept of strings. Savvy language designers these days realize that,
like
 int's
 and float's, char's should be a fundamental data type at a higher-level
of
 abstraction than raw bytes. The model that most modern language
designers
 are
 turning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit.

 If you do so, you make it possible for strings in your language to have
a
 single, canonical form that all APIs use. Instead of the nightmare that
C/C++
 programmers face when passing string parameters ("now, let's see, is
this
 a
 char* or a const char* or an ISO C++ string or an ISO wstring or a
wchar_t* or a
 char[] or a wchar_t[] or an instance of one of countless string
classes...?).
 The fact that not just every library but practically every project feels
the
 need to reinvent its own string type is proof of the need for a good,
solid,
 canonical form built right into the language.

 Most language designers these days either get this from the start of
they
 later
 figure it out and have to screw up their language with multiple string
types.
 Having canonical UTF-16 chars and strings internally does not mean that
you
 can't deal with other character encodings externally. You can can
convert
 to
 canonical form on import and convert back to some legacy encoding on
export.
 When you create the strings yourself, or when they are created in Java
or

 Javascript or default XML or most new text protocols, no conversion will
be
 necessary. It will only be needed for legacy data (or a very lightweight
switch
 between UTF-8 and UTF-16). And for those cases where you have to work
with
 legacy data and yet don't want to incur the overhead of encoding
conversion in
 and out, you can still treat the external strings as byte arrays instead
of
 strings, assuming you have a "byte" data type, and do direct byte
manipulation
 on them. That's essentially what you would have been doing anyway if you
had
 used the old char == byte model I see in your docs. You just call it
"byte"
 instead of "char" so it doesn't end up being your default string type.

 Having a modern UTF-16 char type, separate from arrays of "byte", gives
you a
 consistency that allows for the creation of great libraries (since text
is
 such

start, and
 their libraries universally use a single string type. Perl figured it
out
 pretty
 late and as a result, with the addition of UTF-8 to Perl in v. 5.6, it's
never
 clear which CPAN modules will work and which ones will fail, so you have
to use
 pragmas ("use utf-8" vs. "use bytes") and do lots of testing.

 I hope you'll consider making this change to your design. Have an 8-bit
unsigned
 "byte" type and a 16-bit unsigned UTF-16 "char" and forget about this
"8-bit
 char plus 16-bit wide char on Win32 and 32-bit wide char on Linux" stuff
or I'm
 quite sure you'll later regret it. C/C++ are in that sorry state for
legacy
 reasons only, not because their designers were foolish, but any new
language
 that intentionally copies that "design" is likely to regret that
decision.

 Hi,

     There was a thread a year ago in the smalleiffel mailing list
(starting
 at http://groups.yahoo.com/group/smalleiffel/message/4075 ) about unicode
 strings in Eiffel. It's a quite interesting read about the problems of
 adding string-like Unicode classes. The main point is that true Unicode
 support is very difficult to achieve just some libraries provide good,
 correct and complete unicode encoders/decoders/renderers/etc.
     While I agree that some Unicode support is a necessity today (main
 mother tongue is brazilian portuguese so I use non-ascii characters
 everyday), we can't just add some base types and pretend everything is
 allright. We won't correct incorrect written code with a primitive unicode
 string. Most programmers don't think about unicode when they develop their
 software, so almost every line of code dealing with texts contain some
 assumptions about the character sets being used. Java has a primitive 16
bit
 char, but basic library functions (because they need good performance) use
 incorrect code for string handling stuff (the correct classes are in
 java.text, providing means to correctly collate strings). Some times we
are
 just using plain old ASCII but we're bitten by the encoding issues. And
when
 we need to deal with true unicode support the libraries tricky us into
 believing everything is ok.
     IMO D should support a simple char array to deal with ASCII (as it
does
 today) and some kind of standard library module to deal with unicode
glyphs
 and text. This could be included in phobos or even in deimos. Any
 volunteers? With this we could force the programmer to deal with another
set
 of tools (albeit similar) when dealing with each kind of string: ASCII or
 unicode. This module should allow creation of variable sized string and
 glyphs through an opaque ADT. Each kind of usage has different semantics
and
 optimization strategies (e.g. Boyer-Moore is good for ASCII but with
unicode
 the space and time usage are worse).

     Best regards,
     Daniel Yokomiso.

 P.S.: I had to written some libraries and components (EJBs) in several
Java
 projects to deal with data-transfer in plain ASCII (communication with IBM
 mainframes). Each day I dreamed of using a language with simple one byte
 character strings, without problems with encoding and endianess (Solaris
vs.
 Linux vs. Windows NT have some nice "features" in their JVMs if you aren't
 careful when writing Java code that uses "ASCII" String). But Java has a
16
 bit character type and a SIGNED byte type, both awkward for this usage. A
 language shouldn't get in the way of simple code.

 "Never argue with an idiot. They drag you down to their level then beat
you
 with experience."


 ---
 Outgoing mail is certified Virus Free.
 Checked by AVG anti-virus system (http://www.grisoft.com).
 Version: 6.0.443 / Virus Database: 248 - Release Date: 11/1/2003
Jan 17 2003
next sibling parent reply "Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> writes:
"Walter" <walter digitalmars.com> escreveu na mensagem
news:b0b0up$vk7$1 digitaldaemon.com...
 I once wrote a large project that dealt with mixed ascii and unicode.
There
 was bug after bug when the two collided. Finally, I threw in the towel and
 made the entire program unicode - every string in it.

 The trouble in D is that in the current scheme, everything dealing with
text
 has to be written twice, once for char[] and again for wchar_t[]. In C,
 there's that wretched tchar.h to swap back and forth. It may just be
easier
 in the long run to just make UTF-8 the native type, and then at least try
 and make sure the standard D library is correct.

 -Walter
[snip] Hi, Current D uses char[] as the string type. If we declare each char to be UTF-8 we'll have all the problems with what does "myString[13] = someChar;" means. I think a opaque string datatype may be better in this case. We could have a glyph datatype that represents one unicode glyph in UTF-8 encoding, and use it together with a string class. Also I don't think a mutable string type is a good idea. Python and Java use immutable strings, and this leads to better programs (you don't need to worry about copying your strings when you get or give them). Some nice tricks, like caching hashCode results for strings are possible, because the values won't change. We could also provide a mutable string class. If this is the way to go we need lots of test cases, specially from people with experience writing unicode libraries. The Unicode spec has lots of particularities, like correct regular expression support, that may lead to subtle bugs. Best regards, Daniel Yokomiso. "Before you criticize someone, walk a mile in their shoes. That way you're a mile away and you have their shoes, too." --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.443 / Virus Database: 248 - Release Date: 10/1/2003
Jan 18 2003
next sibling parent Theodore Reed <rizen surreality.us> writes:
On Sat, 18 Jan 2003 12:51:42 -0300
"Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote:

     Current D uses char[] as the string type. If we declare each char
     to be
 UTF-8 we'll have all the problems with what does "myString[13] =
 someChar;" means. I think a opaque string datatype may be better in
 this case. We could have a glyph datatype that represents one unicode
 glyph in UTF-8 encoding, and use it together with a string class. Also
So what does "myString[13] = someGlyph" mean? char doesn't have to be a byte, we can have another data byte for that. -- Theodore Reed (rizen/bancus) -==- http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~ "Yesterday no longer exists Tomarrow's forever a day away And we are cell-mates, held together in the shoreless stream that is today."
Jan 18 2003
prev sibling parent reply "Walter" <walter digitalmars.com> writes:
"Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message
news:b0bpq9$1d3d$1 digitaldaemon.com...
     Current D uses char[] as the string type. If we declare each char to
be
 UTF-8 we'll have all the problems with what does "myString[13] =
someChar;"
 means. I think a opaque string datatype may be better in this case. We
could
 have a glyph datatype that represents one unicode glyph in UTF-8 encoding,
 and use it together with a string class.
I'm thinking that myString[13] should simply set the byte at myString[13]. Trying to fiddle with the multibyte stuff with simple array access semantics just looks to be too confusing and error prone. To access the unicode characters from it would be via a function or property.
 Also I don't think a mutable string
 type is a good idea. Python and Java use immutable strings, and this leads
 to better programs (you don't need to worry about copying your strings
when
 you get or give them). Some nice tricks, like caching hashCode results for
 strings are possible, because the values won't change. We could also
provide
 a mutable string class.
I think the copy-on-write approach to strings is the right idea. Unfortunately, if done by the language semantics, it can have severe adverse performance results (think of a toupper() function, copying the string again each time a character is converted). Using it instead as a coding style, which is currently how it's done in Phobos, seems to work well. My javascript implementation (DMDScript) does cache the hash for each string, and that works well for the semantics of javascript. But I don't think it is appropriate for lower level language like D to do as much for strings.
     If this is the way to go we need lots of test cases, specially from
 people with experience writing unicode libraries. The Unicode spec has
lots
 of particularities, like correct regular expression support, that may lead
 to subtle bugs.
Regular expression implementations naturally lend themselves to subtle bugs :-(. Having a good test suite is a lifesaver.
Jan 18 2003
next sibling parent reply Burton Radons <loth users.sourceforge.net> writes:
Walter wrote:
 "Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message
 news:b0bpq9$1d3d$1 digitaldaemon.com...
 
Current D uses char[] as the string type. If we declare each char to be
UTF-8 we'll have all the problems with what does "myString[13] = someChar;"
means. I think a opaque string datatype may be better in this case. We could
have a glyph datatype that represents one unicode glyph in UTF-8 encoding,
and use it together with a string class.
I'm thinking that myString[13] should simply set the byte at myString[13]. Trying to fiddle with the multibyte stuff with simple array access semantics just looks to be too confusing and error prone. To access the unicode characters from it would be via a function or property.
I disagree. Returning the character makes indexing expensive, but it has the expectant result and for the most part hides the fact that compaction is going on automatically; the only rule change is that indexed assignment can invalidate any slices and copies, which isn't any worse than D's current rules. Then char.size will be 4 and char.max will be 0x10FFFF or 0x7FFFFFFF, depending upon whether we use UNICODE or ISO-10646 for our UTF-8. I also think that incrementing a char pointer should read the data to determine how many bytes it needs to skip. It should be as transparent as possible! If it can't be transparent, then it should use a class or be limited: no indexing, no char pointers. I don't like either option. [snip]
Jan 18 2003
parent "Walter" <walter digitalmars.com> writes:
"Burton Radons" <loth users.sourceforge.net> wrote in message
news:b0cgdd$1t4o$1 digitaldaemon.com...
 Walter wrote:
 "Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message
 news:b0bpq9$1d3d$1 digitaldaemon.com...

Current D uses char[] as the string type. If we declare each char to be
UTF-8 we'll have all the problems with what does "myString[13] =
someChar;"
means. I think a opaque string datatype may be better in this case. We
could
have a glyph datatype that represents one unicode glyph in UTF-8
encoding,
and use it together with a string class.
I'm thinking that myString[13] should simply set the byte at
myString[13].
 Trying to fiddle with the multibyte stuff with simple array access
semantics
 just looks to be too confusing and error prone. To access the unicode
 characters from it would be via a function or property.
I disagree. Returning the character makes indexing expensive, but it has the expectant result and for the most part hides the fact that compaction is going on automatically; the only rule change is that indexed assignment can invalidate any slices and copies, which isn't any worse than D's current rules. Then char.size will be 4 and char.max will be 0x10FFFF or 0x7FFFFFFF, depending upon whether we use UNICODE or ISO-10646 for our UTF-8. I also think that incrementing a char pointer should read the data to determine how many bytes it needs to skip. It should be as transparent as possible! If it can't be transparent, then it should use a class or be limited: no indexing, no char pointers. I don't like either option.
Obviously, this needs more thought by me.
Jan 18 2003
prev sibling next sibling parent reply "Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> writes:
"Walter" <walter digitalmars.com> escreveu na mensagem
news:b0c66n$1mq6$1 digitaldaemon.com...
 "Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message
 news:b0bpq9$1d3d$1 digitaldaemon.com...
     Current D uses char[] as the string type. If we declare each char to
be
 UTF-8 we'll have all the problems with what does "myString[13] =
someChar;"
 means. I think a opaque string datatype may be better in this case. We
could
 have a glyph datatype that represents one unicode glyph in UTF-8
encoding,
 and use it together with a string class.
I'm thinking that myString[13] should simply set the byte at myString[13]. Trying to fiddle with the multibyte stuff with simple array access
semantics
 just looks to be too confusing and error prone. To access the unicode
 characters from it would be via a function or property.
That's why I think it should be a opaque, immutable, data-type.
 Also I don't think a mutable string
 type is a good idea. Python and Java use immutable strings, and this
leads
 to better programs (you don't need to worry about copying your strings
when
 you get or give them). Some nice tricks, like caching hashCode results
for
 strings are possible, because the values won't change. We could also
provide
 a mutable string class.
I think the copy-on-write approach to strings is the right idea. Unfortunately, if done by the language semantics, it can have severe
adverse
 performance results (think of a toupper() function, copying the string
again
 each time a character is converted). Using it instead as a coding style,
 which is currently how it's done in Phobos, seems to work well. My
 javascript implementation (DMDScript) does cache the hash for each string,
 and that works well for the semantics of javascript. But I don't think it
is
 appropriate for lower level language like D to do as much for strings.


     If this is the way to go we need lots of test cases, specially from
 people with experience writing unicode libraries. The Unicode spec has
lots
 of particularities, like correct regular expression support, that may
lead
 to subtle bugs.
Regular expression implementations naturally lend themselves to subtle
bugs
 :-(. Having a good test suite is a lifesaver.
Not if you write a "correct" regular expression implementation. If you implement right from scratch using simple NFAs you probably won't have any headaches. I've implemented a toy regex machine in Java based on Mark Jason Dominus excelent article "How Regexes work" at http://perl.plover.com/Regex/ It's very simple and quite fast as it's a dumb implementation without any kind of optimizations (4 times slower than a fast bytecode regex interpreter in Java, http://jakarta.apache.org/regexp/index.html). Also the sourcecode is lot's of times cleaner. BTW I've written a unit test suite based on Jakarta Regexp set of tests. I can port it to D if you like and use it with your regex implementation. --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.443 / Virus Database: 248 - Release Date: 10/1/2003
Jan 18 2003
parent "Walter" <walter digitalmars.com> writes:
"Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message
news:b0cond$222q$1 digitaldaemon.com...
 BTW I've written a unit test suite based on
 Jakarta Regexp set of tests. I can port it to D if you like and use it
with
 your regex implementation.
At the moment I'm using Spencer's regex test suite augmented with a bunch of new test vectors. More testing is better, so yes I'm interested in better & more comprehensive tests.
Jan 18 2003
prev sibling parent reply "Sean L. Palmer" <seanpalmer directvinternet.com> writes:
"Walter" <walter digitalmars.com> wrote in message
news:b0c66n$1mq6$1 digitaldaemon.com...
 I think the copy-on-write approach to strings is the right idea.
 Unfortunately, if done by the language semantics, it can have severe
adverse
 performance results (think of a toupper() function, copying the string
again
 each time a character is converted). Using it instead as a coding style,
Copy-on-write usually doesn't copy unless there's more than one live reference to the string. If you're actively modifying it, it'll only make one copy until you distribute the new reference. Of course that means reference counting. Perhaps the GC could store info about string use.
Jan 19 2003
parent Ilya Minkov <midiclub 8ung.at> writes:
Sean L. Palmer wrote:
 "Walter" <walter digitalmars.com> wrote in message
 news:b0c66n$1mq6$1 digitaldaemon.com...
 
I think the copy-on-write approach to strings is the right idea.
Unfortunately, if done by the language semantics, it can have severe
adverse
performance results (think of a toupper() function, copying the string
again
each time a character is converted). Using it instead as a coding style,
Copy-on-write usually doesn't copy unless there's more than one live reference to the string. If you're actively modifying it, it'll only make one copy until you distribute the new reference. Of course that means reference counting. Perhaps the GC could store info about string use.
That's not gonna work, because there's no reliable way you can get this data from GC outside a mark phase. The Delphi string implementation is Ref-Counted, and is said to be extremely slow. So it's better copy and forget the rest, than to count at every assignment. You'll just have one more reason to optimise the GC then. :) IMO, the amount of copying should be limited by merging the operations together.
Jan 20 2003
prev sibling parent reply "Ben Hinkle" <bhinkle mathworks.com> writes:
"Walter" <walter digitalmars.com> wrote in message
news:b0b0up$vk7$1 digitaldaemon.com...
 I once wrote a large project that dealt with mixed ascii and unicode.
There
 was bug after bug when the two collided. Finally, I threw in the towel and
 made the entire program unicode - every string in it.

 The trouble in D is that in the current scheme, everything dealing with
text
 has to be written twice, once for char[] and again for wchar_t[]. In C,
 there's that wretched tchar.h to swap back and forth. It may just be
easier
 in the long run to just make UTF-8 the native type, and then at least try
 and make sure the standard D library is correct.
I've gotten a little confused reading this thread. Here are some questions swimming in my head: 1) What does it mean to make UTF-8 the native type? 2) What is char.size? 3) Does char[] differ from byte[] or is it a typedef? 4) How does one get a UTF-16 encoding of a char[], or get the length, or get the 5th character, or set the 5th character to a given unicode character (expressed in UTF-16, say)? Here are my guesses to the answers: 1) string literals are encoded in UTF-8 2) char.size = 8 3) it's a typedef 4) through the library or directly if you know enough about the char[] you are manipulating. Is this correct? thanks, -Ben
 -Walter

 "Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message
 news:b0agdq$ni9$1 digitaldaemon.com...
 "globalization guy" <globalization_member pathlink.com> escreveu na
mensagem
 news:b05pdd$13bv$1 digitaldaemon.com...
 I think you'll be making a big mistake if you adopt C's obsolete char
==
 byte
 concept of strings. Savvy language designers these days realize that,
like
 int's
 and float's, char's should be a fundamental data type at a
higher-level
 of
 abstraction than raw bytes. The model that most modern language
designers
 are
 turning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit.

 If you do so, you make it possible for strings in your language to
have
 a
 single, canonical form that all APIs use. Instead of the nightmare
that
 C/C++
 programmers face when passing string parameters ("now, let's see, is
this
 a
 char* or a const char* or an ISO C++ string or an ISO wstring or a
wchar_t* or a
 char[] or a wchar_t[] or an instance of one of countless string
classes...?).
 The fact that not just every library but practically every project
feels
 the
 need to reinvent its own string type is proof of the need for a good,
solid,
 canonical form built right into the language.

 Most language designers these days either get this from the start of
they
 later
 figure it out and have to screw up their language with multiple string
types.
 Having canonical UTF-16 chars and strings internally does not mean
that
 you
 can't deal with other character encodings externally. You can can
convert
 to
 canonical form on import and convert back to some legacy encoding on
export.
 When you create the strings yourself, or when they are created in Java
or

 Javascript or default XML or most new text protocols, no conversion
will
 be
 necessary. It will only be needed for legacy data (or a very
lightweight
 switch
 between UTF-8 and UTF-16). And for those cases where you have to work
with
 legacy data and yet don't want to incur the overhead of encoding
conversion in
 and out, you can still treat the external strings as byte arrays
instead
 of
 strings, assuming you have a "byte" data type, and do direct byte
manipulation
 on them. That's essentially what you would have been doing anyway if
you
 had
 used the old char == byte model I see in your docs. You just call it
"byte"
 instead of "char" so it doesn't end up being your default string type.

 Having a modern UTF-16 char type, separate from arrays of "byte",
gives
 you a
 consistency that allows for the creation of great libraries (since
text
 is
 such

start, and
 their libraries universally use a single string type. Perl figured it
out
 pretty
 late and as a result, with the addition of UTF-8 to Perl in v. 5.6,
it's
 never
 clear which CPAN modules will work and which ones will fail, so you
have
 to use
 pragmas ("use utf-8" vs. "use bytes") and do lots of testing.

 I hope you'll consider making this change to your design. Have an
8-bit
 unsigned
 "byte" type and a 16-bit unsigned UTF-16 "char" and forget about this
"8-bit
 char plus 16-bit wide char on Win32 and 32-bit wide char on Linux"
stuff
 or I'm
 quite sure you'll later regret it. C/C++ are in that sorry state for
legacy
 reasons only, not because their designers were foolish, but any new
language
 that intentionally copies that "design" is likely to regret that
decision.

 Hi,

     There was a thread a year ago in the smalleiffel mailing list
(starting
 at http://groups.yahoo.com/group/smalleiffel/message/4075 ) about
unicode
 strings in Eiffel. It's a quite interesting read about the problems of
 adding string-like Unicode classes. The main point is that true Unicode
 support is very difficult to achieve just some libraries provide good,
 correct and complete unicode encoders/decoders/renderers/etc.
     While I agree that some Unicode support is a necessity today (main
 mother tongue is brazilian portuguese so I use non-ascii characters
 everyday), we can't just add some base types and pretend everything is
 allright. We won't correct incorrect written code with a primitive
unicode
 string. Most programmers don't think about unicode when they develop
their
 software, so almost every line of code dealing with texts contain some
 assumptions about the character sets being used. Java has a primitive 16
bit
 char, but basic library functions (because they need good performance)
use
 incorrect code for string handling stuff (the correct classes are in
 java.text, providing means to correctly collate strings). Some times we
are
 just using plain old ASCII but we're bitten by the encoding issues. And
when
 we need to deal with true unicode support the libraries tricky us into
 believing everything is ok.
     IMO D should support a simple char array to deal with ASCII (as it
does
 today) and some kind of standard library module to deal with unicode
glyphs
 and text. This could be included in phobos or even in deimos. Any
 volunteers? With this we could force the programmer to deal with another
set
 of tools (albeit similar) when dealing with each kind of string: ASCII
or
 unicode. This module should allow creation of variable sized string and
 glyphs through an opaque ADT. Each kind of usage has different semantics
and
 optimization strategies (e.g. Boyer-Moore is good for ASCII but with
unicode
 the space and time usage are worse).

     Best regards,
     Daniel Yokomiso.

 P.S.: I had to written some libraries and components (EJBs) in several
Java
 projects to deal with data-transfer in plain ASCII (communication with
IBM
 mainframes). Each day I dreamed of using a language with simple one byte
 character strings, without problems with encoding and endianess (Solaris
vs.
 Linux vs. Windows NT have some nice "features" in their JVMs if you
aren't
 careful when writing Java code that uses "ASCII" String). But Java has a
16
 bit character type and a SIGNED byte type, both awkward for this usage.
A
 language shouldn't get in the way of simple code.

 "Never argue with an idiot. They drag you down to their level then beat
you
 with experience."


 ---
 Outgoing mail is certified Virus Free.
 Checked by AVG anti-virus system (http://www.grisoft.com).
 Version: 6.0.443 / Virus Database: 248 - Release Date: 11/1/2003
Jan 18 2003
parent reply "Walter" <walter digitalmars.com> writes:
"Ben Hinkle" <bhinkle mathworks.com> wrote in message
news:b0bvoh$1hm5$1 digitaldaemon.com...
 I've gotten a little confused reading this thread. Here are some questions
 swimming in my head:
 1) What does it mean to make UTF-8 the native type?
From a compiler standpoint, all it really means is that string literals are encoded as UTF-8. The real support for it will be in the runtime library, such as UTF-8 support in printf().
 2) What is char.size?
It'll be 1.
 3) Does char[] differ from byte[] or is it a typedef?
It differs in that it can be overloaded differently, and the compiler recognizes char[] as special when doing casts to other array types - it can do conversions between UTF-8 and UTF-16, for example.
 4) How does one get a UTF-16 encoding of a char[],
At the moment, I'm thinking: wchar[] w; char[] c; w = cast(wchar[])c; to do a UTF-8 to UTF-16 conversion.
 or get the length,
To get the length in bytes: c.length to get the length in USC-4 characters, perhaps: c.nchars ??
 or get
 the 5th character, or set the 5th character to a given unicode character
 (expressed in UTF-16, say)?
Probably a library function.
Jan 18 2003
next sibling parent reply Mark Evans <Mark_member pathlink.com> writes:
The best way to handle Unicode is, as a previous poster suggested, to make
UTF-16 the default and tack on ASCII conversions in the runtime library.  Not
the other way around.  Legacy stuff should be runtime lib, modern stuff
built-in.  Otherwise we are building a language on outdated standards.

I don't like typecasting hacks or half-measures.  Besides, typecasting by
definition should not change the size of its argument.

Mark
Jan 18 2003
parent reply "Walter" <walter digitalmars.com> writes:
You're probably right, the typecasting hack is inconsistent enough with the
way the rest of the language works that it's probably a bad idea.

As for why UTF-16 instead of UTF-8, why do you find it preferable?

"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b0ccek$1qnh$1 digitaldaemon.com...
 The best way to handle Unicode is, as a previous poster suggested, to make
 UTF-16 the default and tack on ASCII conversions in the runtime library.
Not
 the other way around.  Legacy stuff should be runtime lib, modern stuff
 built-in.  Otherwise we are building a language on outdated standards.

 I don't like typecasting hacks or half-measures.  Besides, typecasting by
 definition should not change the size of its argument.

 Mark
Jan 18 2003
parent reply Mark Evans <Mark_member pathlink.com> writes:
Walter asked,
As for why UTF-16 instead of UTF-8, why do you find it preferable?
If one wants to do serious internationalized applications it is mandatory. China, Japan, India for example. China and India by themselves encompass hundreds of languages and dialects that use non-Western glyphs. My contacts at the SIL linguistics center in Dallas (heavy-duty Unicode and SGML folks) complain that in their language work, not even UTF-16 is good enough. They push for 32 bits! I would not go that far, but UTF-16 is a very sensible, capable format for the majority of languages. Mark
Jan 20 2003
next sibling parent reply "Walter" <walter digitalmars.com> writes:
"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b0itlo$2a46$1 digitaldaemon.com...
 If one wants to do serious internationalized applications it is mandatory.
 China, Japan, India for example.  China and India by themselves encompass
 hundreds of languages and dialects that use non-Western glyphs.
UTF-8 can handle that.
 My contacts at the SIL linguistics center in Dallas (heavy-duty Unicode
and SGML
 folks) complain that in their language work, not even UTF-16 is good
enough.
 They push for 32 bits!
UTF-16 has 2^20 characters in it. UTF-8 has 2^31 characters.
 I would not go that far, but UTF-16 is a very sensible, capable format for
the
 majority of languages.
The only advantage it has over UTF-8 is it is more compact for some languages. UTF-8 is more compact for the rest.
Jan 21 2003
parent reply Mark Evans <Mark_member pathlink.com> writes:
Well OK I should have been clearer.  You are right about sheer numerical
quantity, but read the FAQ at Unicode.org (excerpted below).  Numerical quantity
at the price of variable-width codes is a headache.  UTF-16 has variable width,
but not as variable as UTF-8, and nowhere near as frequently.

UTF-16 is the Windows standard.  It's a sweet spot for Unicode, which was
originally a pure 16-bit design.  The Unicode leaders advocate UTF-16 and I
accept their wisdom.

The "real deal" with UTF-8 is that it's a retrofit to accommodate legacy ASCII
that we all know and love.  So again I would argue that UTF-8 qualifies in a
certain sense as "legacy support," and should therefore go in the runtime, not
the core code.

I'd go even further and not use 'char' with any meaning other than UTF-16.  I
never liked the Windows char/wchar goofiness.  A language should only have one
type of char and the runtimes can support conversions of language-standard chars
to other formats.  Trying to shimmy 'alternative characters' into C was a bad
idea.  The wonderful thing about designing a new language is that you can do it
right.  (Implementation details at http://www.unicode.org/reports/tr27/ )

Mark

http://www.unicode.org/faq/utf_bom.html
-----------------------------------------------
"Most Unicode APIs are using UTF-16."
-----------------------------------------------
"UTF-8 will be most common on the web. UTF16, UTF16LE, UTF16BE are used by Java
and Windows."
[BE and LE mean Big Endian and Little Endian.]
-----------------------------------------------
"Unicode was originally designed as a pure 16-bit encoding, aimed at
representing all modern scripts."
-----------------------------------------------
[UTF-8 can have anywhere from 1 to 4 code blocks so it's highly variable.
UTF-16 almost always has one code block, and in rare 1% cases, two; but no more.
This is important in the Asian context:]
"East Asians (Chinese, Japanese, and Koreans) are ... are well acquainted with
the problems that variable-width codes ... have caused....With UTF-16,
relatively few characters require 2 units. The vast majority of characters in
common use are single code units.  Even in East Asian text, the incidence of
surrogate pairs should be well less than 1% of all text storage on average."
-----------------------------------------------
"Furthermore, both Unicode and ISO 10646 have policies in place that formally
limit even the UTF-32 encoding form to the integer range that can be expressed
with UTF-16 (or 21 significant bits)."
-----------------------------------------------
"We don't anticipate a general switch to UTF-32 storage for a long time (if
ever)....The chief selling point for Unicode was providing a representation for
all the world's characters.... These features were enough to swing industry to
the side of using Unicode (UTF-16)."
-----------------------------------------------
Jan 21 2003
parent Mark Evans <Mark_member pathlink.com> writes:
Quick follow-up.  Even the extra space in UTF-8 will probably not be used in the
future, and UTF-8 vs. UTF-16 are going to be neck-and-neck in terms of
storage/performance over time.  So I see no compelling reason for UTF-8 except
its legacy ties to 7-bit ASCII.  I think of UTF-8 as "ASCII with Unicode paint."

Mark

http://www-106.ibm.com/developerworks/library/utfencodingforms/
"Storage vs. performance
Both UTF-8 and UTF-16 are substantially more compact than UTF-32, when averaging
over the world's text in computers. UTF-8 is currently more compact than UTF-16
on average, although it is not particularly suited for East-Asian text because
it occupies about 3 bytes of storage per code point. UTF-8 will probably end up
as about the same as UTF-16 over time, and may end up being less compact on
average as computers continue to make inroads into East and South Asia. Both
UTF-8 and UTF-16 offer substantial advantages over UTF-32 in terms of storage
requirements."

http://czyborra.com/utf/
"Actually, UTF-8 continues to represent up to 31 bits with up to 6 bytes, but it
is generally expected that the one million code points of the 20 bits offered by
UTF-16 and 4-byte UTF-8 will suffice to cover all characters and that we will
never get to see any Unicode character definitions beyond that."
Jan 21 2003
prev sibling parent reply Ilya Minkov <midiclub tiscali.de> writes:
Mark Evans wrote:
 Walter asked,
 
As for why UTF-16 instead of UTF-8, why do you find it preferable?
If one wants to do serious internationalized applications it is mandatory. China, Japan, India for example. China and India by themselves encompass hundreds of languages and dialects that use non-Western glyphs. My contacts at the SIL linguistics center in Dallas (heavy-duty Unicode and SGML folks) complain that in their language work, not even UTF-16 is good enough. They push for 32 bits!
Could someone explain me *what's the difference*? I thought there was one unicode set, which encodes *everything*. Then, there are different "wrappings" of it, like UTF8, 16 and so on. They do the same by assgning blocks, where multiple "characters" of 8, 16, or smth. bits compose a final character value. And a lot of optimisation can be done, because it is not likely that each next symbol will be from a different language, since natural language usually consists of words, sentences, and so on. In UFT8 there are sequences, consisting of header-data, where header encodes the language/code and the length of the text, so that some data is generalized and need not be tranferred with every symbol, and so that a character in a certain encoding can take as many target system characters is it needs. As far as I understood, UTF7 is the shortest encoding for latin text, but it would be less optimal for some multi-hunderd-character sets than a generally wider encoding. Please, someone correct me if i'm wrong. But if i'm right, Russian, arabic, and other "tiny" alphabets would only experience a minor "fat-ratio" with UTF8, since they requiere less not many more symbols than latin. That is, only headers and no further overhead. Can anyone tell me: taken the same newspaper article in chinese, japanese, or some other "wide" language, encoded in UTF7, 8, 16, 32 and so on: how much space would it take? Which languages suffer more and which less from "small" UTF encodigs? -i.
 
 I would not go that far, but UTF-16 is a very sensible, capable format for the
 majority of languages.
 
 Mark
 
 
Jan 22 2003
next sibling parent Ilya Minkov <midiclub tiscali.de> writes:
Ilya Minkov wrote:
 Could someone explain me *what's the difference*? ...
I see myself approved.
 Can anyone tell me: taken the same newspaper article in chinese, 
 japanese, or some other "wide" language, encoded in UTF7, 8, 16, 32 and 
 so on: how much space would it take? Which languages suffer more and 
 which less from "small" UTF encodigs?
This one remains. -i.
Jan 22 2003
prev sibling next sibling parent Mark Evans <Mark_member pathlink.com> writes:
Ilya Minkov says...
Could someone explain me *what's the difference*?
Take the trouble to read through the links supplied in the previous posts before asking redundant questions like this. Mark
Jan 22 2003
prev sibling parent reply Theodore Reed <rizen surreality.us> writes:
On Wed, 22 Jan 2003 15:26:56 +0100
Ilya Minkov <midiclub tiscali.de> wrote:

 sentences, and so on. In UFT8 there are sequences, consisting of
 header-data, where header encodes the language/code and the length of
 the text, so that some data is generalized and need not be tranferred
 with every symbol, and so that a character in a certain encoding can
 take as many target system characters is it needs.
That's not how UTF-8 works (although I've thought a RLE scheme like the one you describe would be pretty good). In UTF-8 a glyph can be 1-4 bytes. If the unicode value is below 0x80, it takes one byte. If it's between 0x80 and 0x7FF (inclusive), it takes two, etc
 As far as I understood, UTF7 is the shortest encoding for latin text, 
 but it would be less optimal for some multi-hunderd-character sets
 than a generally wider encoding.
Quite less than optimal.
 Please, someone correct me if i'm wrong. But if i'm right, Russian, 
 arabic, and other "tiny" alphabets would only experience a minor 
 "fat-ratio" with UTF8, since they requiere less not many more symbols 
 than latin. That is, only headers and no further overhead.
Most western alphabets would take 1-2 bytes per char. I think Arabic would take 3.
 Can anyone tell me: taken the same newspaper article in chinese, 
 japanese, or some other "wide" language, encoded in UTF7, 8, 16, 32
 and so on: how much space would it take? Which languages suffer more
 and which less from "small" UTF encodigs?
UTF-8 just flat takes less space over all. At most, it takes 4 bytes per glyph, plus for many, it takes less. The issue isn't really the space. It's the difficulty in dealing with an encoding where you don't know how long the next glyph will be without reading it. (Which also means that in order to access the glyph in the middle, you have to start scanning from the front.) -- Theodore Reed (rizen/bancus) -==- http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~ "I hold it to be the inalienable right of anybody to go to hell in his own way." -- Robert Frost
Jan 22 2003
next sibling parent Ilya Minkov <midiclub 8ung.at> writes:
Then considering UTF-16 might make sense...


I think there is a way to optimise UTF8 though: pre-scan the string and 
record character width changes in an array.
Jan 22 2003
prev sibling parent reply Mark Evans <Mark_member pathlink.com> writes:
In UTF-8 a glyph can be 1-4 bytes.
Only if you live within the same dynamic range as UTF-16. To get the full effective UTF-8 dynamic range of 32 bits, UTF-8 employs up to six bytes. With 4 bytes it has the same range as UTF-16. "The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set." http://www.unicode.org/reports/tr27
The issue isn't really the space.
It's the difficulty in dealing with an encoding where you don't know how
long the next glyph will be without reading it.
Exactly. UTF-16 can have at most one extra code (in roughly 1% of cases). So you have either one 16-bit word, or two. UTF-8 is the absolute worst encoding in this regard. UTF-32 is the best (constant size). The main selling point for D is that UTF-16 is the standard for Windows. Windows is built on it. Knowing Microsoft, they probably use a "slightly modified Microsoft version" of UTF-16...that would not surprise me at all. Mark
Jan 23 2003
parent reply "Serge K" <skarebo programmer.net> writes:
"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b0qp5g$n73$1 digitaldaemon.com...
In UTF-8 a glyph can be 1-4 bytes.
Only if you live within the same dynamic range as UTF-16. To get the full effective UTF-8 dynamic range of 32 bits, UTF-8 employs up to six bytes.
With 4
 bytes it has the same range as UTF-16.
Actually, UTF-8, UTF-16 and UTF-32 - all have the same range : [0..10FFFFh] UTF-8 encoding method can be extended up to six bytes max. to encode UCS-4 character set, but it is way beyond Unicode.
 "The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows
for the
 use of five- and six-byte sequences to encode characters that are outside
the
 range of the Unicode character set."  http://www.unicode.org/reports/tr27
Please, do not post truncated citations. "The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 as a transformation of Unicode characters."
The issue isn't really the space.
It's the difficulty in dealing with an encoding where you don't know how
long the next glyph will be without reading it.
Exactly. UTF-16 can have at most one extra code (in roughly 1% of cases).
So
 you have either one 16-bit word, or two.  UTF-8 is the absolute worst
encoding
 in this regard.  UTF-32 is the best (constant size).
For the real world applications UTF-16 strings have to use those surrogates only to access CJK Ideographs extensions (~43000 characters). In most of the cases UTF-16 string can be treated as an array of the UCS-2 characters. String object can include its length in 16bit units and in characters : if these numbers are equal - it's an UCS-2 string, no surrogates inside.
 The main selling point for D is that UTF-16 is the standard for Windows.
Windows
 is built on it.  Knowing Microsoft, they probably use a "slightly modified
 Microsoft version" of UTF-16...that would not surprise me at all.
Surprise... It's a regular UTF-16. >8-P (Starting with Win2K+sp.) WinNT 3.x & 4 support UCS-2 only - since it was Unicode 2.0 encoding. Any efficient prog. language must use UTF-16 for Windows implementation - otherwise it have to convert strings for any API function requiring string parameters...
Jan 27 2003
parent reply "Walter" <walter digitalmars.com> writes:
"Serge K" <skarebo programmer.net> wrote in message
news:b17cd6$2n1l$1 digitaldaemon.com...
 Any efficient prog. language must use UTF-16 for Windows implementation -
 otherwise it have to convert strings for any API function requiring string
 parameters...
Not necessarilly. While Win32 is now fully UTF-16 internally, and apparently converts the strings in "A" api functions to UTF-16, because UTF-16 uses double the memory it can still be far more efficient for an app to do all its computation with UTF-8, and then convert when calling the windows api.
Feb 03 2003
next sibling parent reply Theodore Reed <rizen surreality.us> writes:
On Mon, 3 Feb 2003 15:37:37 -0800
"Walter" <walter digitalmars.com> wrote:

 
 "Serge K" <skarebo programmer.net> wrote in message
 news:b17cd6$2n1l$1 digitaldaemon.com...
 Any efficient prog. language must use UTF-16 for Windows
 implementation - otherwise it have to convert strings for any API
 function requiring string parameters...
Not necessarilly. While Win32 is now fully UTF-16 internally, and apparently converts the strings in "A" api functions to UTF-16, because UTF-16 uses double the memory it can still be far more efficient for an app to do all its computation with UTF-8, and then convert when calling the windows api.
Plus, UTF-8 is pretty standard for Unicode on Linux. I believe BeOS used it, too, although I could be wrong. I don't know what OSX uses, nor other unices. My point is that choosing a standard by what the underlying platform uses is a bad idea. -- Theodore Reed (rizen/bancus) -==- http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~ "[...] for plainly, although every work of art is an expression, not every expression is a work of art." -- DeWitt H. Parker, "The Principles of Aesthetics"
Feb 04 2003
parent Mark Evans <Mark_member pathlink.com> writes:
My point is that choosing a standard by what the underlying platform
uses is a bad idea.
I agree with this remark, but think there are plenty of platform-independent reasons for UTF-16. The fact that Windows uses it just cements the case. Mark
Feb 13 2003
prev sibling parent reply Mark Evans <Mark_member pathlink.com> writes:
Walter says...
 Any efficient prog. language must use UTF-16 for Windows implementation -
 otherwise it have to convert strings for any API function requiring string
 parameters...
Not necessarilly. While Win32 is now fully UTF-16 internally, and apparently converts the strings in "A" api functions to UTF-16, because UTF-16 uses double the memory it can still be far more efficient for an app to do all its computation with UTF-8, and then convert when calling the windows api.
Memory is cheap and getting cheaper, but procesor time never loses value. The supposition that UTF-8 needs less space is flawed anyway. For some languages, yes -- but not all. My earlier citations indicate that long-term, averaging over all languages, UTF-8 and UTF-16 will require equivalent memory storage. UTF-8 code is also harder to write because UTF-8 is just more complicated than UTF-16. The only reason for its popularity is that it's a fig leaf for people who really want to use ASCII. They can use ASCII and call it UTF-8. Not very forward-thinking. Microsoft had good reasons for selecting UTF-16 and D should follow suit. Other languages are struggling with Unicode support, and it would be nice to have one language out up front in this area. Mark
Feb 13 2003
parent "Serge K" <skarebo programmer.net> writes:
 The supposition that UTF-8 needs less space is flawed anyway.  For some
 languages, yes -- but not all.  My earlier citations indicate that
long-term,
 averaging over all languages, UTF-8 and UTF-16 will require equivalent
memory
 storage.

 UTF-8 code is also harder to write because UTF-8 is just more complicated
than
 UTF-16.  The only reason for its popularity is that it's a fig leaf for
people
 who really want to use ASCII.  They can use ASCII and call it UTF-8.  Not
very
 forward-thinking.

 Microsoft had good reasons for selecting UTF-16 and D should follow suit.
Other
 languages are struggling with Unicode support, and it would be nice to
have one
 language out up front in this area.

 Mark
http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/index .html?dwzone=unicode ["Forms of Unicode", Mark Davis, IBM developer and President of the Unicode Consortium, IBM] "Storage vs. performance Both UTF-8 and UTF-16 are substantially more compact than UTF-32, when averaging over the world's text in computers. UTF-8 is currently more compact than UTF-16 on average, although it is not particularly suited for East-Asian text because it occupies about 3 bytes of storage per code point. UTF-8 will probably end up as about the same as UTF-16 over time, and may end up being less compact on average as computers continue to make inroads into East and South Asia. Both UTF-8 and UTF-16 offer substantial advantages over UTF-32 in terms of storage requirements." { btw, about storage : I've converted 300KB text file (russian book) into UTF-8 - it took about ~1.85 bytes per character. Little compression comparing to UTF-16 comes mostly from "spaces" and punctuation marks, but it hardly worth processing complexity. } "Code-point boundaries, iteration, and indexing are very fast with UTF-32. Code-point boundaries, accessing code points at a given offset, and iteration involve a few extra machine instructions for UTF-16; UTF-8 is a bit more cumbersome." { Occurrence of the UTF-16 surrogates in the real texts is estimated as <1% for CJK languages. Other scripts encoded in "higher planes" cover very rare or dead languages and some special symbols (like modern & old music symbols). So, if String object can identify absence of the surrogates - faster functions can to be used in most of the cases. The same optimization works for UTF-8, but only in the US-nivers (even British pound takes 2 bytes.. 8-) } "Ultimately, the choice of which encoding format to use will depend heavily on the programming environment. For systems that only offer 8-bit strings currently, but are multi-byte enabled, UTF-8 may be the best choice. For systems that do not care about storage requirements, UTF-32 may be best. For systems such as Windows, Java, or ICU that use UTF-16 strings already, UTF-16 is the obvious choice. Even if they have not yet upgraded to fully support surrogates, they will be before long. If the programming environment is not an issue, UTF-16 is recommended as a good compromise between elegance, performance, and storage."
Feb 16 2003
prev sibling next sibling parent Burton Radons <loth users.sourceforge.net> writes:
Walter wrote:
 "Ben Hinkle" <bhinkle mathworks.com> wrote in message
 news:b0bvoh$1hm5$1 digitaldaemon.com...
 
4) How does one get a UTF-16 encoding of a char[],
At the moment, I'm thinking: wchar[] w; char[] c; w = cast(wchar[])c; to do a UTF-8 to UTF-16 conversion.
This is less complex than "w = toWideStringz(c);" somehow? I can't speak for anyone else, but this won't help my work with dig at all - I already have to preprocess any strings sent to the API with toStringz, while the public interface will still use char[]. So constant casting is the name of the game by necessity, and if I want to be conservative I have to cache the conversion and delete it anyway. Calling these APIs directly, when this casting becomes a win, just doesn't happen to me.
Jan 18 2003
prev sibling parent reply "Ben Hinkle" <bhinkle mathworks.com> writes:
"Walter" <walter digitalmars.com> wrote in message
news:b0c66n$1mq6$2 digitaldaemon.com...
 "Ben Hinkle" <bhinkle mathworks.com> wrote in message
 news:b0bvoh$1hm5$1 digitaldaemon.com...
 I've gotten a little confused reading this thread. Here are some
questions
 swimming in my head:
 1) What does it mean to make UTF-8 the native type?
From a compiler standpoint, all it really means is that string literals
are
 encoded as UTF-8. The real support for it will be in the runtime library,
 such as UTF-8 support in printf().

 2) What is char.size?
It'll be 1.
D'oh! char.size=8 is a tad big ;)
 3) Does char[] differ from byte[] or is it a typedef?
It differs in that it can be overloaded differently, and the compiler recognizes char[] as special when doing casts to other array types - it
can
 do conversions between UTF-8 and UTF-16, for example.
The semantics of casting (across all of D) needs to be nice and predictable. I'd hate to track down a bug because a cast that I thought was trivial turned out to allocate new memory and copy data around...
 4) How does one get a UTF-16 encoding of a char[],
At the moment, I'm thinking: wchar[] w; char[] c; w = cast(wchar[])c; to do a UTF-8 to UTF-16 conversion.
 or get the length,
To get the length in bytes: c.length to get the length in USC-4 characters, perhaps: c.nchars ??
Could arrays (or some types that want to have array-like behavior) have some semantics that distinguish between the memory layout and the array indexing and length? Another example of this comes up in sparse matrices, where you want to have an array-like thing that has a non-trivial memory layout. Perhaps not full-blown operator overloading for [] and .length, etc - but some kind of special syntax to differentiate between running around in the memory layout or running around in the "high-level interface".
 or get
 the 5th character, or set the 5th character to a given unicode character
 (expressed in UTF-16, say)?
Probably a library function.
Jan 18 2003
parent reply Shannon Mann <Shannon_member pathlink.com> writes:
I've read through what I could find on the thread about char[] 
and I find myself disagreeing with the idea that char[n] should
return the n'th byte, regardless of the width of a character.

My reasons are simple.  When I have an array of, say, ints, I don't
expect that int[n] will give me the n'th byte of the array of numbers.
I fully expect that the n'th integer will be what I get.

I see no reason why this should not hold for arrays of characters.

I do expect that there are times when it would be useful to access
an array of TYPE (where TYPE is int, char, etc) at the byte level, 
but it strikes me that some interface between an array of TYPE 
elements and that array as an array of BYTE's (i.e. using the byte 
type) would be VERY USEFUL, and would address concerns in wanting 
to access characters in their raw byte form.  Indexing of the 
equivalent of a byte pointer to a TYPE array, perhaps formulated
in syntactic sugar, would achieve this.  I would personally prefer
a language-specific way to byte access an aggregate rather than
use pointers to achieve what the language should provide anyway.

Please note that the above statements stand REGARDLESS of the encoding
chosen, be it UTF-8 or 16 or whatever.
Feb 05 2003
parent "Sean L. Palmer" <seanpalmer directvinternet.com> writes:
The solution here is to use a char *iterator* instead of using char
*indexing*.  char indexing will be very slow.  char iteration will be very
fast.

D needs a good iterator concept.  It has a good array concept already, but
arrays are not the solution to everything.  For instance, serial input or
output can't easily be indexed.  You don't do:  serial_port[47] = character;
you do:  serial_port.write(character).  Those are like iterators (ok well at
least in STL, input iterators and output iterators were part of the iterator
family).

Sean

"Shannon Mann" <Shannon_member pathlink.com> wrote in message
news:b1rb8q$5i7$1 digitaldaemon.com...
 I've read through what I could find on the thread about char[]
 and I find myself disagreeing with the idea that char[n] should
 return the n'th byte, regardless of the width of a character.

 My reasons are simple.  When I have an array of, say, ints, I don't
 expect that int[n] will give me the n'th byte of the array of numbers.
 I fully expect that the n'th integer will be what I get.

 I see no reason why this should not hold for arrays of characters.

 I do expect that there are times when it would be useful to access
 an array of TYPE (where TYPE is int, char, etc) at the byte level,
 but it strikes me that some interface between an array of TYPE
 elements and that array as an array of BYTE's (i.e. using the byte
 type) would be VERY USEFUL, and would address concerns in wanting
 to access characters in their raw byte form.  Indexing of the
 equivalent of a byte pointer to a TYPE array, perhaps formulated
 in syntactic sugar, would achieve this.  I would personally prefer
 a language-specific way to byte access an aggregate rather than
 use pointers to achieve what the language should provide anyway.

 Please note that the above statements stand REGARDLESS of the encoding
 chosen, be it UTF-8 or 16 or whatever.
Feb 05 2003