D - Unicode in D

globalization guy (41/41) Jan 16 2003 I think you'll be making a big mistake if you adopt C's obsolete char ==...

Paul Sheer (8/9) Jan 16 2003 what about embedded work? this needs to be lightweight

Theodore Reed (19/31) Jan 16 2003 But the default option should be UTF-8 with a module available for

Sean L. Palmer (16/47) Jan 16 2003 I'm all for UTF-8. Most fonts don't come anywhere close to having all t...

Theodore Reed (10/17) Jan 16 2003 AFAIK, Unicode between 0 and 127 is the exact same thing as ASCII.

Alix Pexton (16/31) Jan 16 2003 As I see it there are two issues here. Firstly there is the ability to

globalization guy (18/73) Jan 16 2003 Modern font systems cover different Unicode ranges with different fonts....

globalization guy (23/36) Jan 16 2003 The default (and only) form should be UTF-16 in the language itself. The...

Paul Stanton (2/4) Jan 16 2003 provided solaris/jvm is configured correctly by (friggin) service provid...

Serge K (8/10) Jan 16 2003 0x00..0x7F --> 1 byte

globalization guy (48/57) Jan 16 2003 Good questions. I think you'll find if you sniff around that more and mo...

Martin M. Pedersen (81/122) Jan 16 2003 Hi,
Walter (2/2) Jan 16 2003 You make some great points. I have to ask, though, why UTF-16 as opposed...

globalization guy (60/62) Jan 17 2003 Good question, and actually it's not an open and shut case. UTF-8 would ...

Walter (77/140) Jan 17 2003 I read your post with great interest. However, I'm leaning towards UTF-8...

Burton Radons (14/19) Jan 17 2003 You're planning on making this a part of char[]? I was thinking of

Walter (11/30) Jan 17 2003 on

Mike Wynn (15/45) Jan 17 2003 step

Walter (14/39) Jan 17 2003 UTF-8 does lead to the problem of what is meant by:

Mike Wynn (16/18) Jan 17 2003 I was under the impression UTF-16 was glyph based, so each char (16bits)...

Serge K (20/26) Jan 16 2003 First, UTF-16 is just one of the many standard encodings for the Unicode...

Mike Wynn (25/44) Jan 18 2003 ^)] -

Serge K (28/38) Jan 16 2003 encodings?

Walter (10/27) Jan 17 2003 LOL! Looking forward, then, one can treat it as UTF-16.

Daniel Yokomiso (80/121) Jan 17 2003 byte

Walter (37/164) Jan 17 2003 I once wrote a large project that dealt with mixed ascii and unicode. Th...

Daniel Yokomiso (29/38) Jan 18 2003 There

Theodore Reed (11/17) Jan 18 2003 So what does "myString[13] = someGlyph" mean? char doesn't have to be a
Walter (22/37) Jan 18 2003 be

Burton Radons (13/26) Jan 18 2003 I disagree. Returning the character makes indexing expensive, but it

Walter (8/33) Jan 18 2003 someChar;"

Daniel Yokomiso (26/63) Jan 18 2003 encoding,

Walter (6/9) Jan 18 2003 with

Sean L. Palmer (8/12) Jan 19 2003 adverse

Ilya Minkov (9/25) Jan 20 2003 That's not gonna work, because there's no reliable way you can get this

Ben Hinkle (46/220) Jan 18 2003 There

Walter (19/29) Jan 18 2003 From a compiler standpoint, all it really means is that string literals ...

Mark Evans (7/7) Jan 18 2003 The best way to handle Unicode is, as a previous poster suggested, to ma...

Walter (6/13) Jan 18 2003 You're probably right, the typecasting hack is inconsistent enough with ...

Mark Evans (10/11) Jan 20 2003 If one wants to do serious internationalized applications it is mandator...

Walter (9/17) Jan 21 2003 UTF-8 can handle that.

Mark Evans (47/47) Jan 21 2003 Well OK I should have been clearer. You are right about sheer numerical

Mark Evans (20/20) Jan 21 2003 Quick follow-up. Even the extra space in UTF-8 will probably not be use...

Ilya Minkov (25/44) Jan 22 2003 Could someone explain me *what's the difference*? I thought there was

Ilya Minkov (4/9) Jan 22 2003 This one remains.
Mark Evans (4/5) Jan 22 2003 Take the trouble to read through the links supplied in the previous post...
Theodore Reed (21/37) Jan 22 2003 That's not how UTF-8 works (although I've thought a RLE scheme like the

Ilya Minkov (3/3) Jan 22 2003 Then considering UTF-16 might make sense...
Mark Evans (13/17) Jan 23 2003 Only if you live within the same dynamic range as UTF-16. To get the fu...

Serge K (30/46) Jan 27 2003 With 4

Walter (6/9) Feb 03 2003 Not necessarilly. While Win32 is now fully UTF-16 internally, and appare...

Theodore Reed (13/27) Feb 04 2003 Plus, UTF-8 is pretty standard for Unicode on Linux. I believe BeOS used

Mark Evans (3/5) Feb 13 2003 I agree with this remark, but think there are plenty of platform-indepen...

Mark Evans (14/21) Feb 13 2003 Memory is cheap and getting cheaper, but procesor time never loses value...

Serge K (46/58) Feb 16 2003 than

Burton Radons (8/19) Jan 18 2003 This is less complex than "w = toWideStringz(c);" somehow? I can't
Ben Hinkle (16/45) Jan 18 2003 questions

Shannon Mann (19/19) Feb 05 2003 I've read through what I could find on the thread about char[]

Sean L. Palmer (12/31) Feb 05 2003 The solution here is to use a char *iterator* instead of using char

globalization guy <globalization_member pathlink.com> writes:

I think you'll be making a big mistake if you adopt C's obsolete char == byte
concept of strings. Savvy language designers these days realize that, like int's
and float's, char's should be a fundamental data type at a higher-level of
abstraction than raw bytes. The model that most modern language designers are
turning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit.

If you do so, you make it possible for strings in your language to have a
single, canonical form that all APIs use. Instead of the nightmare that C/C++
programmers face when passing string parameters ("now, let's see, is this a
char* or a const char* or an ISO C++ string or an ISO wstring or a wchar_t* or a
char[] or a wchar_t[] or an instance of one of countless string classes...?).
The fact that not just every library but practically every project feels the
need to reinvent its own string type is proof of the need for a good, solid,
canonical form built right into the language.

Most language designers these days either get this from the start of they later
figure it out and have to screw up their language with multiple string types.

Having canonical UTF-16 chars and strings internally does not mean that you
can't deal with other character encodings externally. You can can convert to
canonical form on import and convert back to some legacy encoding on export.

Javascript or default XML or most new text protocols, no conversion will be
necessary. It will only be needed for legacy data (or a very lightweight switch
between UTF-8 and UTF-16). And for those cases where you have to work with
legacy data and yet don't want to incur the overhead of encoding conversion in
and out, you can still treat the external strings as byte arrays instead of
strings, assuming you have a "byte" data type, and do direct byte manipulation
on them. That's essentially what you would have been doing anyway if you had
used the old char == byte model I see in your docs. You just call it "byte"
instead of "char" so it doesn't end up being your default string type.

Having a modern UTF-16 char type, separate from arrays of "byte", gives you a
consistency that allows for the creation of great libraries (since text is such

their libraries universally use a single string type. Perl figured it out pretty
late and as a result, with the addition of UTF-8 to Perl in v. 5.6, it's never
clear which CPAN modules will work and which ones will fail, so you have to use
pragmas ("use utf-8" vs. "use bytes") and do lots of testing.

I hope you'll consider making this change to your design. Have an 8-bit unsigned
"byte" type and a 16-bit unsigned UTF-16 "char" and forget about this "8-bit
char plus 16-bit wide char on Win32 and 32-bit wide char on Linux" stuff or I'm
quite sure you'll later regret it. C/C++ are in that sorry state for legacy
reasons only, not because their designers were foolish, but any new language
that intentionally copies that "design" is likely to regret that decision.

Jan 16 2003

"Paul Sheer" <psheer icon.co.za> writes:

On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:

 I think you'll be making a big mistake if you adopt C's obsolete char == byte

what about embedded work? this needs to be lightweight

in any case, a 16 bit character set doesn't hold all
the charsets needed by the worlds languages, but a
20 bit charset (UTF-8) is overkill. then again, most
programmers get by with 8 bits 99% of the time. So you
need to give people options.

-paul

Jan 16 2003

Theodore Reed <rizen surreality.us> writes:

On Thu, 16 Jan 2003 14:40:15 +0200
"Paul Sheer" <psheer icon.co.za> wrote:

 On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:
 
 I think you'll be making a big mistake if you adopt C's obsolete
 char == byte

 
 what about embedded work? this needs to be lightweight
 
 in any case, a 16 bit character set doesn't hold all
 the charsets needed by the worlds languages, but a
 20 bit charset (UTF-8) is overkill. then again, most
 programmers get by with 8 bits 99% of the time. So you
 need to give people options.

But the default option should be UTF-8 with a module available for
conversion. (I tend to stay away from UTF-16 because of endian issues.)
Also, I'm not sure where you're getting the 20-bit part. UTF-8 can
encode everything in the Unicode 32-bit range. (Although it takes like 8
bytes towards the end.) 

UTF-8 also addresses the lightweight bit, as long as you aren't using
non-English characters, but even if you are, they aren't that much
longer. And it's better than having to deal with 50 million 8-bit
encodings.

FWIW, I wholeheartedly support Unicode strings in D.

-- 
Theodore Reed (rizen/bancus)       -==-       http://www.surreality.us/
~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~

"We have committed a greater crime, and for this crime there is no name.
What punishment awaits us if it be discovered we know not, for no such
crime has come in the memory of men and there are no laws to provide for
it." -- Equality 7-2521, Ayn Rand's Anthem

Jan 16 2003

"Sean L. Palmer" <seanpalmer directvinternet.com> writes:

I'm all for UTF-8.  Most fonts don't come anywhere close to having all the
glyphs anyway, but it's still nice to use an encoding that actually has a
real definition (whereas "byte" has no meaning whatsoever and could mean
ANSI, DOS OEM, ASCII-7, UTF-8, or MBCS.)  UTF-8 allows you the full unicode
range but the part that we use everyday just takes 1 byte per char, like
usual.  I believe it even maps almost 1:1 to ASCII in that range.

You cannot however make a UTF-8 data type.  By definition each character may
take more than one byte.  But you don't make arrays of characters, you make
arrays of character building blocks (bytes) that are interpreted as
characters.

Anyway we'd need some automated way to step through the array one character
at a time.  Maybe string could be an array of bytes that pretends that it's
an array of 32-bit unicode characters?

Sean

"Theodore Reed" <rizen surreality.us> wrote in message
news:20030116081437.1a593197.rizen surreality.us...
 On Thu, 16 Jan 2003 14:40:15 +0200
 "Paul Sheer" <psheer icon.co.za> wrote:

 On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:

 I think you'll be making a big mistake if you adopt C's obsolete
 char == byte

 what about embedded work? this needs to be lightweight

 in any case, a 16 bit character set doesn't hold all
 the charsets needed by the worlds languages, but a
 20 bit charset (UTF-8) is overkill. then again, most
 programmers get by with 8 bits 99% of the time. So you
 need to give people options.

 But the default option should be UTF-8 with a module available for
 conversion. (I tend to stay away from UTF-16 because of endian issues.)
 Also, I'm not sure where you're getting the 20-bit part. UTF-8 can
 encode everything in the Unicode 32-bit range. (Although it takes like 8
 bytes towards the end.)

 UTF-8 also addresses the lightweight bit, as long as you aren't using
 non-English characters, but even if you are, they aren't that much
 longer. And it's better than having to deal with 50 million 8-bit
 encodings.

 FWIW, I wholeheartedly support Unicode strings in D.

 --
 Theodore Reed (rizen/bancus)       -==-       http://www.surreality.us/
 ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~

 "We have committed a greater crime, and for this crime there is no name.
 What punishment awaits us if it be discovered we know not, for no such
 crime has come in the memory of men and there are no laws to provide for
 it." -- Equality 7-2521, Ayn Rand's Anthem

Jan 16 2003

Theodore Reed <rizen surreality.us> writes:

On Thu, 16 Jan 2003 09:49:58 -0800
"Sean L. Palmer" <seanpalmer directvinternet.com> wrote:

 I'm all for UTF-8.  Most fonts don't come anywhere close to having all
 the glyphs anyway, but it's still nice to use an encoding that
 actually has a real definition (whereas "byte" has no meaning
 whatsoever and could mean ANSI, DOS OEM, ASCII-7, UTF-8, or MBCS.) 
 UTF-8 allows you the full unicode range but the part that we use
 everyday just takes 1 byte per char, like usual.  I believe it even
 maps almost 1:1 to ASCII in that range.

AFAIK, Unicode between 0 and 127 is the exact same thing as ASCII.

-- 
Theodore Reed (rizen/bancus)       -==-       http://www.surreality.us/
~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~

"The word of Sin is Restriction. O man! refuse not thy wife, if she
will! O lover, if thou wilt, depart! There is no bond that can unite the
divided but love: all else is a curse. Accursed! Accursed be it to the
aeons! Hell." -- Liber AL vel Legis, 1:41

Jan 16 2003

Alix Pexton <Alix thedjournal.com> writes:

Theodore Reed wrote:
 On Thu, 16 Jan 2003 09:49:58 -0800
 "Sean L. Palmer" <seanpalmer directvinternet.com> wrote:
 
 
I'm all for UTF-8.  Most fonts don't come anywhere close to having all
the glyphs anyway, but it's still nice to use an encoding that
actually has a real definition (whereas "byte" has no meaning
whatsoever and could mean ANSI, DOS OEM, ASCII-7, UTF-8, or MBCS.) 
UTF-8 allows you the full unicode range but the part that we use
everyday just takes 1 byte per char, like usual.  I believe it even
maps almost 1:1 to ASCII in that range.

 
 
 AFAIK, Unicode between 0 and 127 is the exact same thing as ASCII.
 


As I see it there are two issues here. Firstly there is the ability to 
read and manipulate text streams that are incoded in one of the many 
multi-byte/variable-width formats. and secondly there is allow code to 
be written in mb/vw formats. The first can be achieved (though perhaps 
not transparently) using a library, while the second obviously requires 
work to be done on the front end of the compiler. The front end is 
freely available under the gpl/artistic licences, and I don't think it 
would be difficult to augment it with mb/vw support.
However, This doesn't give us an intergrated solution, such as you might 
find in other languages, but it is a start.

Alix Pexton
Webmaster - "the D journal"
www.thedjournal.com

PS
who need mb/vw when we have lojban ;) .

Jan 16 2003

globalization guy <globalization_member pathlink.com> writes:

In article <b06r0m$1l3u$1 digitaldaemon.com>, Sean L. Palmer says...
I'm all for UTF-8.  Most fonts don't come anywhere close to having all the
glyphs anyway,...

Modern font systems cover different Unicode ranges with different fonts. A font
that contains all the Unicode glyphs is of very limited use. (It tends to be
useful for primitive tools that assume a single font for all glyphs. Such tools
are being superceded by modern tools, though, and the complexities of rendering
are being delegated to central rendering subsystems.)

... but it's still nice to use an encoding that actually has a
real definition (whereas "byte" has no meaning whatsoever and could mean
ANSI, DOS OEM, ASCII-7, UTF-8, or MBCS.)  UTF-8 allows you the full unicode
range but the part that we use everyday just takes 1 byte per char, like
usual.  

I'd be careful about the "part we use everyday" idea. I don't really know who's
involved in this "D" project, but big company developers tend to work more and
more in systems that handle a rich range of characters. The reason is because
that's what their company needs to do every day, whether they do personally or
not. That's what is swirling around the Internet every day.

It is true, though, that for Westerners, ASCII characters occur more commonly,
so UTF-8 has a sort of "poor man's compression" advantage that is often useful.

 I believe it even maps almost 1:1 to ASCII in that range.

You cannot however make a UTF-8 data type.  By definition each character may
take more than one byte.  But you don't make arrays of characters, you make
arrays of character building blocks (bytes) that are interpreted as
characters.

No, you make arrays of UTF-16 code units. When you need to do work with arrays
of characters UTF-16 is a better choice than UTF-8, though UTF-8 is better for
data interchange with unknown recipients.

Anyway we'd need some automated way to step through the array one character
at a time.  Maybe string could be an array of bytes that pretends that it's
an array of 32-bit unicode characters?

UTF-16. That's what it's for. UTF-32 is not practical for most purposes that
involve large amounts of text.

Sean

"Theodore Reed" <rizen surreality.us> wrote in message
news:20030116081437.1a593197.rizen surreality.us...
 On Thu, 16 Jan 2003 14:40:15 +0200
 "Paul Sheer" <psheer icon.co.za> wrote:

 On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:

 I think you'll be making a big mistake if you adopt C's obsolete
 char == byte

 what about embedded work? this needs to be lightweight

 in any case, a 16 bit character set doesn't hold all
 the charsets needed by the worlds languages, but a
 20 bit charset (UTF-8) is overkill. then again, most
 programmers get by with 8 bits 99% of the time. So you
 need to give people options.

 But the default option should be UTF-8 with a module available for
 conversion. (I tend to stay away from UTF-16 because of endian issues.)
 Also, I'm not sure where you're getting the 20-bit part. UTF-8 can
 encode everything in the Unicode 32-bit range. (Although it takes like 8
 bytes towards the end.)

 UTF-8 also addresses the lightweight bit, as long as you aren't using
 non-English characters, but even if you are, they aren't that much
 longer. And it's better than having to deal with 50 million 8-bit
 encodings.

 FWIW, I wholeheartedly support Unicode strings in D.

 --
 Theodore Reed (rizen/bancus)       -==-       http://www.surreality.us/
 ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~

 "We have committed a greater crime, and for this crime there is no name.
 What punishment awaits us if it be discovered we know not, for no such
 crime has come in the memory of men and there are no laws to provide for
 it." -- Equality 7-2521, Ayn Rand's Anthem

Jan 16 2003

globalization guy <globalization_member pathlink.com> writes:

In article <20030116081437.1a593197.rizen surreality.us>, Theodore Reed says...
On Thu, 16 Jan 2003 14:40:15 +0200
"Paul Sheer" <psheer icon.co.za> wrote:

But the default option should be UTF-8 with a module available for
conversion. (I tend to stay away from UTF-16 because of endian issues.)

The default (and only) form should be UTF-16 in the language itself. There is no
endianness issue unless data is serialized. Serialization is a type of output
like printing on paper, and I'm not suggesting serializing into UTF-16 by
default. UTF-8 is the way to go for that. I'm only talking about the "model"
used by the programming language.

Another way to look at it is to consider int's. Do you try to avoid the int data
type? It has exactly the same endianness issues as UTF-16.

Also, I'm not sure where you're getting the 20-bit part. UTF-8 can
encode everything in the Unicode 32-bit range. (Although it takes like 8
bytes towards the end.) 

He's right, actually. Unicode has a range of slightly over 20 bits. (1M + 62K,
to be exact.) Originally, Unicode had a 16-bit range and ISO 10646 had a 31 bit
range (not 32), but both now have converged on a little over 20.

UTF-8 also addresses the lightweight bit, as long as you aren't using
non-English characters, but even if you are, they aren't that much
longer. 

So does UTF-16 because although Western characters take a little more space than
with UTF-8, processing is lighter weight, and that is usually more significant.

 And it's better than having to deal with 50 million 8-bit
encodings.

Amen to that! Talk about heavyweight...

FWIW, I wholeheartedly support Unicode strings in D.

Yes, indeed. It is a real benefit to give the users because with Unicode strings
as standard, you get libraries that can take a lot of the really arcane issues
off the programmers' shoulders (and put them on the library authors' shoulders,
where tough stuff belongs). When D programmers then deal with Unicode XML, HTML

just send the strings to the libraries, confident that the "Unicode stuff" will
be taken care of.

That's the kind of advantage modern developers get from Java that they don't get
from good ol' C.

Jan 16 2003

Paul Stanton <Paul_member pathlink.com> writes:

In article <b07jht$22v4$1 digitaldaemon.com>, globalization guy says...

That's the kind of advantage modern developers get from Java that they don't get
from good ol' C.

provided solaris/jvm is configured correctly by (friggin) service provider

Jan 16 2003

"Serge K" <skarebo programmer.net> writes:

 UTF-8 can encode everything in the Unicode 32-bit range.
 (Although it takes like 8 bytes towards the end.)

0x00..0x7F  --> 1 byte
 - ASCII

0x80..0x7FF --> 2 byte
 - Latin extended, Greek, Cyrillic, Hebrew, Arabic, etc...

0x800..0xFFFF --> 3 byte
 - most of the scripts in use.

0x10000..0x10FFFF --> 4 byte
 - rare/dead/... scripts

Jan 16 2003

globalization guy <globalization_member pathlink.com> writes:

In article <b065i9$19aa$1 digitaldaemon.com>, Paul Sheer says...
On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:

 I think you'll be making a big mistake if you adopt C's obsolete char == byte

what about embedded work? this needs to be lightweight

Good questions. I think you'll find if you sniff around that more and more
embedded work is going to Unicode. The reason is because it is inevitable that
any successful device that deals with natural language will be required to
handle more and more characters as its market expands. When you add new
characters by changing character sets, you get a high marginal cost per market
and you still can't handle mixed language scenarios (which have become very
common due to the Internet.) When you add new characters by *adding* character
sets, you lose all of your "lightweight" benefits.

I attended a Unicode conference once where there was a separate embedded systems
conference going on in the same building. By the end of the conference, we had
almost merged, at least in the hallways. ;-)

Unicode, done right, gives you universality at a fraction of the cost of
patchwork solutions to worldwide markets. Even in English, the range of
characters being demanded by customers has continued to grow. It grew beyond
ASCII years ago, has now gone beyond Latin-1. MS Windows had to add a
proprietary extension to Latin-1 before they gave up entirely and went full
Unicode, as did Apple with OS X, Sun with Java, Perl, HTML 4....

in any case, a 16 bit character set doesn't hold all
the charsets needed by the worlds languages, but a
20 bit charset (UTF-8) is overkill. then again, most
programmers get by with 8 bits 99% of the time. So you
need to give people options.

-paul

UTF-16 isn't a 16-bit character set. It's a 16-bit encoding of a character set
that has an enormous repertoire. There is room for well over a million
characters in the Universal Character Set (shared by Unicode and ISO 10646), and
many of those "characters" are actually components meant to be combined with
others to create a truly enormous variety of what most people think of as
"characters". It is no longer correct to assume a 1:1 correspondence between a
Unicode character and a glyph you see on a screen or on paper. (And that
correspondence was lost way back when TrueType was created anyway).

The length of a string in these modern times is an abstract concept, not a
physical one, when dealing with natural language. The nice 1:1 correspondences
between code point / character / glyph are still available for artificial
symbols created as sequences of ASCII printing characters, though, and that is
true even in UTF-16 Unicode.

Unicode certainly does have room for all of the world's character sets. It is a
subset of them all -- with "all" meaning those considered significant by the
various national bodies represented in ISO and all of the industrial bodies
providing input to the Unicode Technical Committee. It's not a universal
superset in an absolute sense.

When you say "most programmers get by with 8 bits 99% of the time", I think you
may be thinking a bit too narrowly. The composition of programmers has become
more international than perhaps you realize, and the change isn't slowing down.
Even in the West, most major companies have moved to Unicode *to solve their own
problems*. MS programmers can't get by with 8-bits. Neither can Apple's, or
Sun's, or Oracle's, or IBM's.... 

Another thing to consider is that programmers use the tools that exist,
naturally. For a long time, major programming languages had the fundamental
equivalence of byte and char at their core. Many people who got by with 8-bits
did so because there was no practical alternative.

These days, there are, and modern languages need to be designed to take
advantage of all the great advantages that come along with using Unicode.

Jan 16 2003

"Martin M. Pedersen" <mmp www.moeller-pedersen.dk> writes:

Hi,

I have been thinking about this issue too, and also I think that Unicode
string should be a prime concern of D. And, yes, UTF-8 is the way to go.
I would very much like to see a string using canonical UTF-8 encoding being
built right into the language, as a class with value semanthics.

What we are faced with is:

1. We need char and wchar_t for compability with APIs.
2. We need good Unicode support.
3. We need a memory efficient representation of strings.
4. We need the ability easy manipulation of strings.

There are two fundamental types of text data: a character and a string.
Also, Java uses two kinds of strings: a String class for storing strings,
and a StringBuffer for manipulating strings. This separation solves many
problems.

I believe that:

- A single character should be represented using 32-bit UCS-4 using native
endianess - like the wchar_t commenly seen on UNIX. It probably should be
struct in order to avoid overhead of vtbl, and still support character
methods such as isUpper() and toUpper().

- A non-modifyable string should be stored using UTF-8. By non-modifyable I
mean that they do not allow individual characters to be manipulated, but
they do allow reassignment. Read-only forward characters iterators could
also be supported in an efficient manner. As it has already been stated,
they would in most cases be as memory efficient as C's char arrays. This
also addresses Walter's concern of perfermance issues with CPU caches. But
it also means that the concept of using arrays simply is not good enough.
This string class should also provide functionality such a collate() method.

- A modifyable string should support manipulation of individual characters,
and could likely be an array of UCS-4 characters.

Methods should be provided for converting to/from char* and wchar_t*
(whether it is 16- or 32-bit) as needed for supporting C APIs. Some will
argue that this would involve too many conversions. However, if you are
using char* today on Windows, Windows will do this conversion all the time,
and you probably do not notice. And if it really becomes a bottle-neck,
optimization would be simple in most cases - just cache the converted
string. And if you are only concerned using C APIs - use the C string
functions such as strcat()/wcscat() or specialiced classes.

In addition character encoders could be provided for whatever representation
is needed. I myself would like support for US-ASCII, EBCDIC, ISO-8859,
UTF-7, UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, and US-ASCII/ISO-8859
with encoding of characters as in HTML (I don't remember what standard this
is called,  but it specifies characters using "&somename;"). Other would
have different needs, so it should be simple to implement a new character
encoder/decoder.

Regards,
Martin M. Pedersen.

"globalization guy" <globalization_member pathlink.com> wrote in message
news:b05pdd$13bv$1 digitaldaemon.com...
 I think you'll be making a big mistake if you adopt C's obsolete char ==

byte
 concept of strings. Savvy language designers these days realize that, like

int's
 and float's, char's should be a fundamental data type at a higher-level of
 abstraction than raw bytes. The model that most modern language designers

are
 turning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit.

 If you do so, you make it possible for strings in your language to have a
 single, canonical form that all APIs use. Instead of the nightmare that

C/C++
 programmers face when passing string parameters ("now, let's see, is this

a
 char* or a const char* or an ISO C++ string or an ISO wstring or a

wchar_t* or a
 char[] or a wchar_t[] or an instance of one of countless string

classes...?).
 The fact that not just every library but practically every project feels

the
 need to reinvent its own string type is proof of the need for a good,

solid,
 canonical form built right into the language.

 Most language designers these days either get this from the start of they

later
 figure it out and have to screw up their language with multiple string

types.
 Having canonical UTF-16 chars and strings internally does not mean that

you
 can't deal with other character encodings externally. You can can convert

to
 canonical form on import and convert back to some legacy encoding on

export.
 When you create the strings yourself, or when they are created in Java or


 Javascript or default XML or most new text protocols, no conversion will

be
 necessary. It will only be needed for legacy data (or a very lightweight

switch
 between UTF-8 and UTF-16). And for those cases where you have to work with
 legacy data and yet don't want to incur the overhead of encoding

conversion in
 and out, you can still treat the external strings as byte arrays instead

of
 strings, assuming you have a "byte" data type, and do direct byte

manipulation
 on them. That's essentially what you would have been doing anyway if you

had
 used the old char == byte model I see in your docs. You just call it

"byte"
 instead of "char" so it doesn't end up being your default string type.

 Having a modern UTF-16 char type, separate from arrays of "byte", gives

you a
 consistency that allows for the creation of great libraries (since text is

such


start, and
 their libraries universally use a single string type. Perl figured it out

pretty
 late and as a result, with the addition of UTF-8 to Perl in v. 5.6, it's

never
 clear which CPAN modules will work and which ones will fail, so you have

to use
 pragmas ("use utf-8" vs. "use bytes") and do lots of testing.

 I hope you'll consider making this change to your design. Have an 8-bit

unsigned
 "byte" type and a 16-bit unsigned UTF-16 "char" and forget about this

"8-bit
 char plus 16-bit wide char on Win32 and 32-bit wide char on Linux" stuff

or I'm
 quite sure you'll later regret it. C/C++ are in that sorry state for

legacy
 reasons only, not because their designers were foolish, but any new

language
 that intentionally copies that "design" is likely to regret that decision.

Jan 16 2003

"Walter" <walter digitalmars.com> writes:

You make some great points. I have to ask, though, why UTF-16 as opposed to
UTF-8?

Jan 16 2003

globalization guy <globalization_member pathlink.com> writes:

In article <b08cdr$2fld$1 digitaldaemon.com>, Walter says...
You make some great points. I have to ask, though, why UTF-16 as opposed to
UTF-8?

Good question, and actually it's not an open and shut case. UTF-8 would not be a
big mistake, but it might not be quite as good as UTF-16.

The biggest reason I think UTF-16 has the edge is that I think you'll probably
want to treat your strings as arrays of characters on many occasions, and that's
*almost* as easy to do with UTF-16 as with ASCII. It's really not very practical
with UTF-8, though.

UTF-16 characters are almost always a single 16-bit code unit. Once in a billion
characters or so, you get a character that is composed of two "surrogates". Sort
of like half characters. Your code does have to keep this exceptional case in
mind and handle it when necessary, though that is usually the type of problem
you delegate to the standard library. In most cases, a function can just think
of each surrogate as a character and not worry that it might be just half of the
representation of a character -- as long as the two don't get separated. In
almost all cases, though, you can think of a character as a single 16-bit
entity, which is almost as simple as thinking of it as a single 8-bit entity.
You can do bit operations on them and other C-like things and it should be very
efficient.

Unlike UTF-16's two cases, one of which is very rare, UTF-8 has four cases,
three of which are very common. All of your code needs to do a good job with
those three cases. Only the fourth can be considered exceptional. (Of course it
has to be handled, too, but it is like the exceptional UTF-16 case, where you
don't have to optimize for it because it rarely occurs). Most strings will tend
to have mixed-width characters, so a model of an array of elements isn't a very
good one.

You can still implement your language with accessors that reach into a UTF-8
string and parse out the right character when you say "str[5]", but it will be
further removed from the physical implementation than if you use UTF-16. For a
somewhat lower-level language like "D", this probably isn't a very good fit.

The main benefit of UTF-8 is when exchanging text data with arbitrary external
parties. UTF-8 has no endianness problem, so you don't have to worry about the
*internal* memory model of the recipient. It has some other features that make
it easier to digest by legacy systems that can only handle ASCII. They won't
work right outside ASCII, but they'll often work for ASCII and they'll fail more
gracefully than would be the case with UTF-16 (that is likely to contain
embedded \0 bytes.)

None of these issues are relevant to your own program's *internal* text model.
Internally, you're not worried about endianness. (You don't worry about the
endianness of your int variables, do you?) You don't have to worry about losing
a byte in RAM, etc.

When talking to external APIs, you'll still have to output in a form that the
API can handle. Win32 APIs want UTF-16. Mac APIs want UTF-16. Java APIs want
UTF-16, as do .Net APIs. Unix APIs are problematic, since there are so many and
they aren't coordinated by a single body. Some will only be able to handle
ASCII, others will be upgraded to UTF-8. I don't think the Unix system APIs will
become UTF-16 because legacy is such a ball and chain in the Unix world, but the
process is underway to upgrade the standard system encoding for all major Linux
distributions to UTF-8.

If Linux APIs (and probably most Unix APIs eventually) are of primary
importance, UTF-8 is still a possibility. I'm not totally ruling it out. It
wouldn't hurt you much to use UTF-8 internally, but accessing strings as arrays
of characters would require sort of a virtual string model that doesn't match
the physical model quite as closely as you could get with UTF-16. The additional
abstraction might have more overhead than you would prefer internally. If it's a
choice between internal inefficiency and inefficiency when calling external
APIs, I would usually go for the latter.

Most language designers who understand internationalization have decided to go
with UTF-16 for languages that have their own rich set of internal libraries,
and they have mechanisms for calling external APIs that convert the string
encodings.

Jan 17 2003

"Walter" <walter digitalmars.com> writes:

I read your post with great interest. However, I'm leaning towards UTF-8 for
the following reasons (some of which you've covered):

1) In googling around and reading various articles, it seems that UTF-8 is
gaining momentum as the encoding of choice, including html.

2) Linux is moving towards UTF-8 permeating the OS. Doing UTF-8 in D means
that D will mesh naturally with Linux system api's.

3) Is Win32's "wide char" really UTF-16, including the multi word encodings?

4) I like the fact of no endianness issues, which is important when writing
files and transmitting text - it's much more important an issue than the
endianness of ints.

5) 16 bit accesses on Intel CPUs can be pretty slow compared to byte or
dword accesses (varies by CPU type).

6) Sure, UTF-16 reduces the frequency of multi character encodings, but the
code to deal with it must still be there and must still execute.

7) I've converted some large Java text processing apps to C++, and converted
the Java 16 bit char's to using UTF-8. That change resulted in *substantial*
performance improvements.

8) I suspect that 99% of the text processed in computers is ascii. UTF-8 is
a big win in memory and speed for processing english text.

9) A lot of diverse systems and lightweight embedded systems need to work
with 8 bit chars. Going to UTF-16 would, I think, reduce the scope of
applications and systems that D would be useful for. Going to UTF-8 would
make it as broad as possible.

10) Interestingly, making char[] in D to be UTF-8 does not seem to step on
or prevent dealing with wchar_t[] arrays being UTF-16.

11) I'm not convinced the char[i] indexing problem will be a big one. Most
operations done on ascii strings remain unchanged for UTF-8, including
things like sorting & searching.

See http://www.cl.cam.ac.uk/~mgk25/unicode.html

"globalization guy" <globalization_member pathlink.com> wrote in message
news:b09qpe$aff$1 digitaldaemon.com...
 In article <b08cdr$2fld$1 digitaldaemon.com>, Walter says...
You make some great points. I have to ask, though, why UTF-16 as opposed


to
UTF-8?

 Good question, and actually it's not an open and shut case. UTF-8 would

not be a
 big mistake, but it might not be quite as good as UTF-16.

 The biggest reason I think UTF-16 has the edge is that I think you'll

probably
 want to treat your strings as arrays of characters on many occasions, and

that's
 *almost* as easy to do with UTF-16 as with ASCII. It's really not very

practical
 with UTF-8, though.

 UTF-16 characters are almost always a single 16-bit code unit. Once in a

billion
 characters or so, you get a character that is composed of two

"surrogates". Sort
 of like half characters. Your code does have to keep this exceptional case

in
 mind and handle it when necessary, though that is usually the type of

problem
 you delegate to the standard library. In most cases, a function can just

think
 of each surrogate as a character and not worry that it might be just half

of the
 representation of a character -- as long as the two don't get separated.

In
 almost all cases, though, you can think of a character as a single 16-bit
 entity, which is almost as simple as thinking of it as a single 8-bit

entity.
 You can do bit operations on them and other C-like things and it should be

very
 efficient.

 Unlike UTF-16's two cases, one of which is very rare, UTF-8 has four

cases,
 three of which are very common. All of your code needs to do a good job

with
 those three cases. Only the fourth can be considered exceptional. (Of

course it
 has to be handled, too, but it is like the exceptional UTF-16 case, where

you
 don't have to optimize for it because it rarely occurs). Most strings will

tend
 to have mixed-width characters, so a model of an array of elements isn't a

very
 good one.

 You can still implement your language with accessors that reach into a

UTF-8
 string and parse out the right character when you say "str[5]", but it

will be
 further removed from the physical implementation than if you use UTF-16.

For a
 somewhat lower-level language like "D", this probably isn't a very good

fit.
 The main benefit of UTF-8 is when exchanging text data with arbitrary

external
 parties. UTF-8 has no endianness problem, so you don't have to worry about

the
 *internal* memory model of the recipient. It has some other features that

make
 it easier to digest by legacy systems that can only handle ASCII. They

won't
 work right outside ASCII, but they'll often work for ASCII and they'll

fail more
 gracefully than would be the case with UTF-16 (that is likely to contain
 embedded \0 bytes.)

 None of these issues are relevant to your own program's *internal* text

model.
 Internally, you're not worried about endianness. (You don't worry about

the
 endianness of your int variables, do you?) You don't have to worry about

losing
 a byte in RAM, etc.

 When talking to external APIs, you'll still have to output in a form that

the
 API can handle. Win32 APIs want UTF-16. Mac APIs want UTF-16. Java APIs

want
 UTF-16, as do .Net APIs. Unix APIs are problematic, since there are so

many and
 they aren't coordinated by a single body. Some will only be able to handle
 ASCII, others will be upgraded to UTF-8. I don't think the Unix system

APIs will
 become UTF-16 because legacy is such a ball and chain in the Unix world,

but the
 process is underway to upgrade the standard system encoding for all major

Linux
 distributions to UTF-8.

 If Linux APIs (and probably most Unix APIs eventually) are of primary
 importance, UTF-8 is still a possibility. I'm not totally ruling it out.

It
 wouldn't hurt you much to use UTF-8 internally, but accessing strings as

arrays
 of characters would require sort of a virtual string model that doesn't

match
 the physical model quite as closely as you could get with UTF-16. The

additional
 abstraction might have more overhead than you would prefer internally. If

it's a
 choice between internal inefficiency and inefficiency when calling

external
 APIs, I would usually go for the latter.

 Most language designers who understand internationalization have decided

to go
 with UTF-16 for languages that have their own rich set of internal

libraries,
 and they have mechanisms for calling external APIs that convert the string
 encodings.

Jan 17 2003

Burton Radons <loth users.sourceforge.net> writes:

Walter wrote:
 10) Interestingly, making char[] in D to be UTF-8 does not seem to step on
 or prevent dealing with wchar_t[] arrays being UTF-16.

You're planning on making this a part of char[]?  I was thinking of 
generating a StringUTF8 instance during compilation, but whatever.

I think we should kill off wchar if we go in this direction.  The 
char/wchar conflict is probably the worst part of D's design right now 
as it doesn't fit well with the rest of the language (limited and 
ambiguous overloading), and it would provide absolutely nothing that 
char doesn't already encapsulate.  If you need different encodings, use 
a library.

 11) I'm not convinced the char[i] indexing problem will be a big one. Most
 operations done on ascii strings remain unchanged for UTF-8, including
 things like sorting & searching.

It's not such a speed hit any longer that all code absolutely must use 
slicing and iterators to be useful.

12) UTF-8 doesn't embed ANY control characters, so it can interface with 
unintelligent C libraries natively.  That's not a minor advantage when 
you're trying to get people to switch to it!

Jan 17 2003

"Walter" <walter digitalmars.com> writes:

"Burton Radons" <loth users.sourceforge.net> wrote in message
news:b0a6rl$i4m$1 digitaldaemon.com...
 Walter wrote:
 10) Interestingly, making char[] in D to be UTF-8 does not seem to step


on
 or prevent dealing with wchar_t[] arrays being UTF-16.

 You're planning on making this a part of char[]?  I was thinking of
 generating a StringUTF8 instance during compilation, but whatever.

I think making char[] a UTF-8 is the right way.

 I think we should kill off wchar if we go in this direction.  The
 char/wchar conflict is probably the worst part of D's design right now
 as it doesn't fit well with the rest of the language (limited and
 ambiguous overloading), and it would provide absolutely nothing that
 char doesn't already encapsulate.  If you need different encodings, use
 a library.

I agree that the char/wchar conflict is a screwup in D's design, and one
I've not been happy with. UTF-8 offers a way out. wchar_t should still be
retained, though, for interfacing with the win32 api.

 11) I'm not convinced the char[i] indexing problem will be a big one.


Most
 operations done on ascii strings remain unchanged for UTF-8, including
 things like sorting & searching.

 It's not such a speed hit any longer that all code absolutely must use
 slicing and iterators to be useful.

Interestingly, if foreach is done right, iterating through char[] will work
right, UTF-8 or not.

 12) UTF-8 doesn't embed ANY control characters, so it can interface with
 unintelligent C libraries natively.  That's not a minor advantage when
 you're trying to get people to switch to it!

You're right.

Jan 17 2003

"Mike Wynn" <mike.wynn l8night.co.uk> writes:

"Walter" <walter digitalmars.com> wrote in message
news:b0a7ft$iei$1 digitaldaemon.com...
 "Burton Radons" <loth users.sourceforge.net> wrote in message
 news:b0a6rl$i4m$1 digitaldaemon.com...
 Walter wrote:
 10) Interestingly, making char[] in D to be UTF-8 does not seem to



step
 on
 or prevent dealing with wchar_t[] arrays being UTF-16.

 You're planning on making this a part of char[]?  I was thinking of
 generating a StringUTF8 instance during compilation, but whatever.

 I think making char[] a UTF-8 is the right way.

I would be more in favor of a String class that was utf8 internally
the problem with utf8 is that the the number of bytes and the number of
chars are dependant on the data
char[] to me implies an array of char's so
char [] foo ="aa"\0x0555;
is 4 bytes, but only 3 chars
so what is foo[2] ? and what if I set foo[1] = \0x467;
and what about wanting 8 bit ascii strings ?

if you are going UTF8 then think about the minor extension Java added to the
encoding by allowing a two byte 0, which allows embedded 0 in strings
without messing up the C strlen (which returns the byte length).

 I think we should kill off wchar if we go in this direction.  The
 char/wchar conflict is probably the worst part of D's design right now
 as it doesn't fit well with the rest of the language (limited and
 ambiguous overloading), and it would provide absolutely nothing that
 char doesn't already encapsulate.  If you need different encodings, use
 a library.


 I agree that the char/wchar conflict is a screwup in D's design, and one
 I've not been happy with. UTF-8 offers a way out. wchar_t should still be
 retained, though, for interfacing with the win32 api.

 11) I'm not convinced the char[i] indexing problem will be a big one.


 Most
 operations done on ascii strings remain unchanged for UTF-8, including
 things like sorting & searching.

 It's not such a speed hit any longer that all code absolutely must use
 slicing and iterators to be useful.

 Interestingly, if foreach is done right, iterating through char[] will

work
 right, UTF-8 or not.


 12) UTF-8 doesn't embed ANY control characters, so it can interface with
 unintelligent C libraries natively.  That's not a minor advantage when
 you're trying to get people to switch to it!

 You're right.

Jan 17 2003

"Walter" <walter digitalmars.com> writes:

UTF-8 does lead to the problem of what is meant by:

    char[] c;
    c[5]

Is it the 5th byte of c[], or the 5th decoded 32 bit character? Saying it's
the 5th decoded character has all kinds of implications for slicing and
.length.

8 bit ascii isn't a problem, just cast it to a byte[], as in:
    byte[] b = cast(byte[])c;

I'm not sure about the Java 00 issue, I didn't think Java supported UTF-8. D
does not have the "what to do about embedded 0" problem, as the length is
carried along separately.

"Mike Wynn" <mike.wynn l8night.co.uk> wrote in message
news:b0a8eg$ivc$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:b0a7ft$iei$1 digitaldaemon.com...
 "Burton Radons" <loth users.sourceforge.net> wrote in message
 news:b0a6rl$i4m$1 digitaldaemon.com...
 Walter wrote:
 10) Interestingly, making char[] in D to be UTF-8 does not seem to



 step
 on
 or prevent dealing with wchar_t[] arrays being UTF-16.

 You're planning on making this a part of char[]?  I was thinking of
 generating a StringUTF8 instance during compilation, but whatever.

 I think making char[] a UTF-8 is the right way.

 I would be more in favor of a String class that was utf8 internally
 the problem with utf8 is that the the number of bytes and the number of
 chars are dependant on the data
 char[] to me implies an array of char's so
 char [] foo ="aa"\0x0555;
 is 4 bytes, but only 3 chars
 so what is foo[2] ? and what if I set foo[1] = \0x467;
 and what about wanting 8 bit ascii strings ?

 if you are going UTF8 then think about the minor extension Java added to

the
 encoding by allowing a two byte 0, which allows embedded 0 in strings
 without messing up the C strlen (which returns the byte length).

Jan 17 2003

"Mike Wynn" <mike.wynn l8night.co.uk> writes:

 6) Sure, UTF-16 reduces the frequency of multi character encodings, but

the
 code to deal with it must still be there and must still execute.

I was under the impression UTF-16 was glyph based, so each char (16bits) was
a glyph of some form, not all glyph cause the graphics to move to the next
char, so accents can be encoded as a postfix to the char they are over/under
and charsets like chinesse have sequences that generate the correct visual
reprosentation;

UTF-8 is just a way to encode UTF-16 so the it is compatable with ascii,
0..127 map to 0.127 then using 128..256 as special values identifing multi
byte values the string can be processed as 8bit ascii by software without
problem, only the visual reprosentation changes 128..256 on dos are the box
drawing and intl chars.
however a 3 UTF-16 char sequence will encode to 3 utf 8 encoded sequences
and if they are all >127 then that would be
6 or more bytes, so if you consider the 3 UTF-16 values to be one "char"
then the UTF8 should also consider the 6 or more byte sequence as one "char"
rather than 3 "chars"

Jan 17 2003

"Serge K" <skarebo programmer.net> writes:

 I was under the impression UTF-16 was glyph based, so each char (16bits)

was
 a glyph of some form, not all glyph cause the graphics to move to the next
 char, so accents can be encoded as a postfix to the char they are

over/under
 and charsets like chinesse have sequences that generate the correct visual
 reprosentation;

First, UTF-16 is just one of the many standard encodings for the Unicode.
UTF-16 allows more then 16bit characters - with surrogates it can represent
all >1M codes.
(Unicode v2 used UCS-2 which is 16bit-only encoding)


 I was under the impression UTF-16 was glyph based

from The Unicode Standard, ch2 General Structure
http://www.unicode.org/uni2book/ch02.pdf
"Characters, not glyphs  -  The Unicode Standard encodes characters, not
glyphs.
The Unicode Standard draws a distinction between characters, which are the
smallest components of written language that have semantic value, and
glyphs, which represent the shapes that characters can have when they are
rendered or displayed. Various relationships may exist between characters
and glyphs: a single glyph may correspond to a single character, or to a
number of characters, or multiple glyphs may result from a single
character."

btw, there are many precomposed characters in the Unicode which can be
represented with combining characters as well. ( [�] and [a,(combining ^)] -
equally valid representations for [a with circumflex] ).

Jan 16 2003

"Mike Wynn" <mike.wynn l8night.co.uk> writes:

 First, UTF-16 is just one of the many standard encodings for the Unicode.
 UTF-16 allows more then 16bit characters - with surrogates it can

represent
 all >1M codes.
 (Unicode v2 used UCS-2 which is 16bit-only encoding)

right, me getting confused. too many tla's too many standards (as ever).

 I was under the impression UTF-16 was glyph based

 from The Unicode Standard, ch2 General Structure
 http://www.unicode.org/uni2book/ch02.pdf
 "Characters, not glyphs  -  The Unicode Standard encodes characters, not
 glyphs.
 The Unicode Standard draws a distinction between characters, which are the
 smallest components of written language that have semantic value, and
 glyphs, which represent the shapes that characters can have when they are
 rendered or displayed. Various relationships may exist between characters
 and glyphs: a single glyph may correspond to a single character, or to a
 number of characters, or multiple glyphs may result from a single
 character."

 btw, there are many precomposed characters in the Unicode which can be
 represented with combining characters as well. ( [�] and [a,(combining

^)] -
 equally valid representations for [a with circumflex] ).

so if I read this right ... (been using UTF8 for ages and ignored what it
represents, keeps me sane (er) )
I can't understand arabic file names anyway :)

so a string (no matter how its encoded) contains 3 lengths
the byte length, then number of unicode entites (16 bit UCS-2) and the
number of "characters"
so  c�t as UTF8 is 4 bytes, as UTF-16 is 6 bytes, its 3 UCS-2 entities, and
3 "characters"
but if the � was [a,(combining ^)] not the single � UCS-2 value then
c�t would be  UTF8 is 8+ bytes, as UTF-16 is 8 bytes, its 4 UCS-2 entities,
but still 3 "characters"

which is why I think String should be a class not a thing[]
you should be able to get a utf8 encoded byte[], utf-16 short[], UCS-2
short[] (for win32/api), (32 bit unicode) int[] (for linux) and ideally a
Character[] from the string.

how a String is stored utf8, utf16 or 32bit/64bit values is only relevant
for performance and different people will want different internal
representations. but semantically they should be all the same.

this is all another reason why I also think that arrays should be templated
classes that have an index method (operator [])
so the Character[] from the string can modify the String it represents.

Mike.

Jan 18 2003

"Serge K" <skarebo programmer.net> writes:

 3) Is Win32's "wide char" really UTF-16, including the multi word

encodings?

WinXP, WinCE : UTF-16
Win2K : was UCS-2, but some service pack made it UTF-16
WinNT4 : UCS-2
Win9x : must die.

 5) 16 bit accesses on Intel CPUs can be pretty slow compared to byte or
 dword accesses (varies by CPU type).

16bit prefix can slow down instruction decoding (mostly for Intel CPUs, but
P4 uses pre-decoded instructions anyhow), while instruction processing is
more cache-branch-sensitive.

 6) Sure, UTF-16 reduces the frequency of multi character encodings, but

the
 code to deal with it must still be there and must still execute.

Just an idea : string class may have 2 values for the string length:
1 - number of "units" ( 8bit for UTF-8, 16bit for UTF-16 )
2 - number of characters.
In case if these numbers are equal, string processing library may use
simplified and faster functions.

 7) I've converted some large Java text processing apps to C++, and

converted
 the Java 16 bit char's to using UTF-8. That change resulted in

*substantial*
 performance improvements.

 8) I suspect that 99% of the text processed in computers is ascii. UTF-8

is
 a big win in memory and speed for processing english text.

You think, that 99% of the computer users - english speaking?
Think again...

btw,
something about UTF-8 & UTF-16 efficiency:
http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_u
nicode.html#Test_Results
For latin script based languages - UTF-8 takes ~51% less space than UTF-16.
For greek (expect the same for cyrillic)- ~88% - not that better than
UTF-16.
For japanese, chinese, korean, hindi - 115%..140% - UTF-16 is more space
efficient.

Jan 16 2003

"Walter" <walter digitalmars.com> writes:

"Serge K" <skarebo programmer.net> wrote in message
news:b0anmt$r7g$1 digitaldaemon.com...
 3) Is Win32's "wide char" really UTF-16, including the multi word

 encodings?

 WinXP, WinCE : UTF-16
 Win2K : was UCS-2, but some service pack made it UTF-16
 WinNT4 : UCS-2
 Win9x : must die.

LOL! Looking forward, then, one can treat it as UTF-16.

 8) I suspect that 99% of the text processed in computers is ascii. UTF-8

 is
 a big win in memory and speed for processing english text.

 You think, that 99% of the computer users - english speaking?

Not at all. But the text processed - yes. But I imagine it would be pretty
tough to come by figures for that that are better than speculation.

 something about UTF-8 & UTF-16 efficiency:

http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_u
 nicode.html#Test_Results
 For latin script based languages - UTF-8 takes ~51% less space than

UTF-16.
 For greek (expect the same for cyrillic)- ~88% - not that better than
 UTF-16.
 For japanese, chinese, korean, hindi - 115%..140% - UTF-16 is more space
 efficient.

Thanks for the info. That's about what I would have guessed. Another
valuable statistic would be how well UTF-8 compressed with LZW as opposed to
the same thing in UTF-16.

Jan 17 2003

"Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> writes:

"globalization guy" <globalization_member pathlink.com> escreveu na mensagem
news:b05pdd$13bv$1 digitaldaemon.com...
 I think you'll be making a big mistake if you adopt C's obsolete char ==

byte
 concept of strings. Savvy language designers these days realize that, like

int's
 and float's, char's should be a fundamental data type at a higher-level of
 abstraction than raw bytes. The model that most modern language designers

are
 turning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit.

 If you do so, you make it possible for strings in your language to have a
 single, canonical form that all APIs use. Instead of the nightmare that

C/C++
 programmers face when passing string parameters ("now, let's see, is this

a
 char* or a const char* or an ISO C++ string or an ISO wstring or a

wchar_t* or a
 char[] or a wchar_t[] or an instance of one of countless string

classes...?).
 The fact that not just every library but practically every project feels

the
 need to reinvent its own string type is proof of the need for a good,

solid,
 canonical form built right into the language.

 Most language designers these days either get this from the start of they

later
 figure it out and have to screw up their language with multiple string

types.
 Having canonical UTF-16 chars and strings internally does not mean that

you
 can't deal with other character encodings externally. You can can convert

to
 canonical form on import and convert back to some legacy encoding on

export.
 When you create the strings yourself, or when they are created in Java or


 Javascript or default XML or most new text protocols, no conversion will

be
 necessary. It will only be needed for legacy data (or a very lightweight

switch
 between UTF-8 and UTF-16). And for those cases where you have to work with
 legacy data and yet don't want to incur the overhead of encoding

conversion in
 and out, you can still treat the external strings as byte arrays instead

of
 strings, assuming you have a "byte" data type, and do direct byte

manipulation
 on them. That's essentially what you would have been doing anyway if you

had
 used the old char == byte model I see in your docs. You just call it

"byte"
 instead of "char" so it doesn't end up being your default string type.

 Having a modern UTF-16 char type, separate from arrays of "byte", gives

you a
 consistency that allows for the creation of great libraries (since text is

such


start, and
 their libraries universally use a single string type. Perl figured it out

pretty
 late and as a result, with the addition of UTF-8 to Perl in v. 5.6, it's

never
 clear which CPAN modules will work and which ones will fail, so you have

to use
 pragmas ("use utf-8" vs. "use bytes") and do lots of testing.

 I hope you'll consider making this change to your design. Have an 8-bit

unsigned
 "byte" type and a 16-bit unsigned UTF-16 "char" and forget about this

"8-bit
 char plus 16-bit wide char on Win32 and 32-bit wide char on Linux" stuff

or I'm
 quite sure you'll later regret it. C/C++ are in that sorry state for

legacy
 reasons only, not because their designers were foolish, but any new

language
 that intentionally copies that "design" is likely to regret that decision.

Hi,

    There was a thread a year ago in the smalleiffel mailing list (starting
at http://groups.yahoo.com/group/smalleiffel/message/4075 ) about unicode
strings in Eiffel. It's a quite interesting read about the problems of
adding string-like Unicode classes. The main point is that true Unicode
support is very difficult to achieve just some libraries provide good,
correct and complete unicode encoders/decoders/renderers/etc.
    While I agree that some Unicode support is a necessity today (main
mother tongue is brazilian portuguese so I use non-ascii characters
everyday), we can't just add some base types and pretend everything is
allright. We won't correct incorrect written code with a primitive unicode
string. Most programmers don't think about unicode when they develop their
software, so almost every line of code dealing with texts contain some
assumptions about the character sets being used. Java has a primitive 16 bit
char, but basic library functions (because they need good performance) use
incorrect code for string handling stuff (the correct classes are in
java.text, providing means to correctly collate strings). Some times we are
just using plain old ASCII but we're bitten by the encoding issues. And when
we need to deal with true unicode support the libraries tricky us into
believing everything is ok.
    IMO D should support a simple char array to deal with ASCII (as it does
today) and some kind of standard library module to deal with unicode glyphs
and text. This could be included in phobos or even in deimos. Any
volunteers? With this we could force the programmer to deal with another set
of tools (albeit similar) when dealing with each kind of string: ASCII or
unicode. This module should allow creation of variable sized string and
glyphs through an opaque ADT. Each kind of usage has different semantics and
optimization strategies (e.g. Boyer-Moore is good for ASCII but with unicode
the space and time usage are worse).

    Best regards,
    Daniel Yokomiso.

P.S.: I had to written some libraries and components (EJBs) in several Java
projects to deal with data-transfer in plain ASCII (communication with IBM
mainframes). Each day I dreamed of using a language with simple one byte
character strings, without problems with encoding and endianess (Solaris vs.
Linux vs. Windows NT have some nice "features" in their JVMs if you aren't
careful when writing Java code that uses "ASCII" String). But Java has a 16
bit character type and a SIGNED byte type, both awkward for this usage. A
language shouldn't get in the way of simple code.

"Never argue with an idiot. They drag you down to their level then beat you
with experience."


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.443 / Virus Database: 248 - Release Date: 11/1/2003

Jan 17 2003

"Walter" <walter digitalmars.com> writes:

I once wrote a large project that dealt with mixed ascii and unicode. There
was bug after bug when the two collided. Finally, I threw in the towel and
made the entire program unicode - every string in it.

The trouble in D is that in the current scheme, everything dealing with text
has to be written twice, once for char[] and again for wchar_t[]. In C,
there's that wretched tchar.h to swap back and forth. It may just be easier
in the long run to just make UTF-8 the native type, and then at least try
and make sure the standard D library is correct.

-Walter

"Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message
news:b0agdq$ni9$1 digitaldaemon.com...
 "globalization guy" <globalization_member pathlink.com> escreveu na

mensagem
 news:b05pdd$13bv$1 digitaldaemon.com...
 I think you'll be making a big mistake if you adopt C's obsolete char ==

 byte
 concept of strings. Savvy language designers these days realize that,


like
 int's
 and float's, char's should be a fundamental data type at a higher-level


of
 abstraction than raw bytes. The model that most modern language


designers
 are
 turning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit.

 If you do so, you make it possible for strings in your language to have


a
 single, canonical form that all APIs use. Instead of the nightmare that

 C/C++
 programmers face when passing string parameters ("now, let's see, is


this
 a
 char* or a const char* or an ISO C++ string or an ISO wstring or a

 wchar_t* or a
 char[] or a wchar_t[] or an instance of one of countless string

 classes...?).
 The fact that not just every library but practically every project feels

 the
 need to reinvent its own string type is proof of the need for a good,

 solid,
 canonical form built right into the language.

 Most language designers these days either get this from the start of


they
 later
 figure it out and have to screw up their language with multiple string

 types.
 Having canonical UTF-16 chars and strings internally does not mean that

 you
 can't deal with other character encodings externally. You can can


convert
 to
 canonical form on import and convert back to some legacy encoding on

 export.
 When you create the strings yourself, or when they are created in Java


or

 Javascript or default XML or most new text protocols, no conversion will

 be
 necessary. It will only be needed for legacy data (or a very lightweight

 switch
 between UTF-8 and UTF-16). And for those cases where you have to work


with
 legacy data and yet don't want to incur the overhead of encoding

 conversion in
 and out, you can still treat the external strings as byte arrays instead

 of
 strings, assuming you have a "byte" data type, and do direct byte

 manipulation
 on them. That's essentially what you would have been doing anyway if you

 had
 used the old char == byte model I see in your docs. You just call it

 "byte"
 instead of "char" so it doesn't end up being your default string type.

 Having a modern UTF-16 char type, separate from arrays of "byte", gives

 you a
 consistency that allows for the creation of great libraries (since text


is
 such


 start, and
 their libraries universally use a single string type. Perl figured it


out
 pretty
 late and as a result, with the addition of UTF-8 to Perl in v. 5.6, it's

 never
 clear which CPAN modules will work and which ones will fail, so you have

 to use
 pragmas ("use utf-8" vs. "use bytes") and do lots of testing.

 I hope you'll consider making this change to your design. Have an 8-bit

 unsigned
 "byte" type and a 16-bit unsigned UTF-16 "char" and forget about this

 "8-bit
 char plus 16-bit wide char on Win32 and 32-bit wide char on Linux" stuff

 or I'm
 quite sure you'll later regret it. C/C++ are in that sorry state for

 legacy
 reasons only, not because their designers were foolish, but any new

 language
 that intentionally copies that "design" is likely to regret that


decision.

 Hi,

     There was a thread a year ago in the smalleiffel mailing list

(starting
 at http://groups.yahoo.com/group/smalleiffel/message/4075 ) about unicode
 strings in Eiffel. It's a quite interesting read about the problems of
 adding string-like Unicode classes. The main point is that true Unicode
 support is very difficult to achieve just some libraries provide good,
 correct and complete unicode encoders/decoders/renderers/etc.
     While I agree that some Unicode support is a necessity today (main
 mother tongue is brazilian portuguese so I use non-ascii characters
 everyday), we can't just add some base types and pretend everything is
 allright. We won't correct incorrect written code with a primitive unicode
 string. Most programmers don't think about unicode when they develop their
 software, so almost every line of code dealing with texts contain some
 assumptions about the character sets being used. Java has a primitive 16

bit
 char, but basic library functions (because they need good performance) use
 incorrect code for string handling stuff (the correct classes are in
 java.text, providing means to correctly collate strings). Some times we

are
 just using plain old ASCII but we're bitten by the encoding issues. And

when
 we need to deal with true unicode support the libraries tricky us into
 believing everything is ok.
     IMO D should support a simple char array to deal with ASCII (as it

does
 today) and some kind of standard library module to deal with unicode

glyphs
 and text. This could be included in phobos or even in deimos. Any
 volunteers? With this we could force the programmer to deal with another

set
 of tools (albeit similar) when dealing with each kind of string: ASCII or
 unicode. This module should allow creation of variable sized string and
 glyphs through an opaque ADT. Each kind of usage has different semantics

and
 optimization strategies (e.g. Boyer-Moore is good for ASCII but with

unicode
 the space and time usage are worse).

     Best regards,
     Daniel Yokomiso.

 P.S.: I had to written some libraries and components (EJBs) in several

Java
 projects to deal with data-transfer in plain ASCII (communication with IBM
 mainframes). Each day I dreamed of using a language with simple one byte
 character strings, without problems with encoding and endianess (Solaris

vs.
 Linux vs. Windows NT have some nice "features" in their JVMs if you aren't
 careful when writing Java code that uses "ASCII" String). But Java has a

16
 bit character type and a SIGNED byte type, both awkward for this usage. A
 language shouldn't get in the way of simple code.

 "Never argue with an idiot. They drag you down to their level then beat

you
 with experience."


 ---
 Outgoing mail is certified Virus Free.
 Checked by AVG anti-virus system (http://www.grisoft.com).
 Version: 6.0.443 / Virus Database: 248 - Release Date: 11/1/2003

Jan 17 2003

"Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> writes:

"Walter" <walter digitalmars.com> escreveu na mensagem
news:b0b0up$vk7$1 digitaldaemon.com...
 I once wrote a large project that dealt with mixed ascii and unicode.

There
 was bug after bug when the two collided. Finally, I threw in the towel and
 made the entire program unicode - every string in it.

 The trouble in D is that in the current scheme, everything dealing with

text
 has to be written twice, once for char[] and again for wchar_t[]. In C,
 there's that wretched tchar.h to swap back and forth. It may just be

easier
 in the long run to just make UTF-8 the native type, and then at least try
 and make sure the standard D library is correct.

 -Walter

[snip]

Hi,

    Current D uses char[] as the string type. If we declare each char to be
UTF-8 we'll have all the problems with what does "myString[13] = someChar;"
means. I think a opaque string datatype may be better in this case. We could
have a glyph datatype that represents one unicode glyph in UTF-8 encoding,
and use it together with a string class. Also I don't think a mutable string
type is a good idea. Python and Java use immutable strings, and this leads
to better programs (you don't need to worry about copying your strings when
you get or give them). Some nice tricks, like caching hashCode results for
strings are possible, because the values won't change. We could also provide
a mutable string class.
    If this is the way to go we need lots of test cases, specially from
people with experience writing unicode libraries. The Unicode spec has lots
of particularities, like correct regular expression support, that may lead
to subtle bugs.

    Best regards,
    Daniel Yokomiso.

"Before you criticize someone, walk a mile in their shoes. That way you're a
mile away and you have their shoes, too."


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.443 / Virus Database: 248 - Release Date: 10/1/2003

Jan 18 2003

Theodore Reed <rizen surreality.us> writes:

On Sat, 18 Jan 2003 12:51:42 -0300
"Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote:

     Current D uses char[] as the string type. If we declare each char
     to be
 UTF-8 we'll have all the problems with what does "myString[13] =
 someChar;" means. I think a opaque string datatype may be better in
 this case. We could have a glyph datatype that represents one unicode
 glyph in UTF-8 encoding, and use it together with a string class. Also

So what does "myString[13] = someGlyph" mean? char doesn't have to be a
byte, we can have another data byte for that.

-- 
Theodore Reed (rizen/bancus)       -==-       http://www.surreality.us/
~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~

"Yesterday no longer exists
 Tomarrow's forever a day away
 And we are cell-mates, held together
 in the shoreless stream that is today."

Jan 18 2003

"Walter" <walter digitalmars.com> writes:

"Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message
news:b0bpq9$1d3d$1 digitaldaemon.com...
     Current D uses char[] as the string type. If we declare each char to

be
 UTF-8 we'll have all the problems with what does "myString[13] =

someChar;"
 means. I think a opaque string datatype may be better in this case. We

could
 have a glyph datatype that represents one unicode glyph in UTF-8 encoding,
 and use it together with a string class.

I'm thinking that myString[13] should simply set the byte at myString[13].
Trying to fiddle with the multibyte stuff with simple array access semantics
just looks to be too confusing and error prone. To access the unicode
characters from it would be via a function or property.

 Also I don't think a mutable string
 type is a good idea. Python and Java use immutable strings, and this leads
 to better programs (you don't need to worry about copying your strings

when
 you get or give them). Some nice tricks, like caching hashCode results for
 strings are possible, because the values won't change. We could also

provide
 a mutable string class.

I think the copy-on-write approach to strings is the right idea.
Unfortunately, if done by the language semantics, it can have severe adverse
performance results (think of a toupper() function, copying the string again
each time a character is converted). Using it instead as a coding style,
which is currently how it's done in Phobos, seems to work well. My
javascript implementation (DMDScript) does cache the hash for each string,
and that works well for the semantics of javascript. But I don't think it is
appropriate for lower level language like D to do as much for strings.


     If this is the way to go we need lots of test cases, specially from
 people with experience writing unicode libraries. The Unicode spec has

lots
 of particularities, like correct regular expression support, that may lead
 to subtle bugs.

Regular expression implementations naturally lend themselves to subtle bugs
:-(. Having a good test suite is a lifesaver.

Jan 18 2003

Burton Radons <loth users.sourceforge.net> writes:

Walter wrote:
 "Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message
 news:b0bpq9$1d3d$1 digitaldaemon.com...
 
Current D uses char[] as the string type. If we declare each char to be
UTF-8 we'll have all the problems with what does "myString[13] = someChar;"
means. I think a opaque string datatype may be better in this case. We could
have a glyph datatype that represents one unicode glyph in UTF-8 encoding,
and use it together with a string class.

 
 I'm thinking that myString[13] should simply set the byte at myString[13].
 Trying to fiddle with the multibyte stuff with simple array access semantics
 just looks to be too confusing and error prone. To access the unicode
 characters from it would be via a function or property.

I disagree.  Returning the character makes indexing expensive, but it 
has the expectant result and for the most part hides the fact that 
compaction is going on automatically; the only rule change is that 
indexed assignment can invalidate any slices and copies, which isn't any 
worse than D's current rules.  Then char.size will be 4 and char.max 
will be 0x10FFFF or 0x7FFFFFFF, depending upon whether we use UNICODE or 
ISO-10646 for our UTF-8.

I also think that incrementing a char pointer should read the data to 
determine how many bytes it needs to skip.  It should be as transparent 
as possible!  If it can't be transparent, then it should use a class or 
be limited: no indexing, no char pointers.  I don't like either option.

[snip]

Jan 18 2003

"Walter" <walter digitalmars.com> writes:

"Burton Radons" <loth users.sourceforge.net> wrote in message
news:b0cgdd$1t4o$1 digitaldaemon.com...
 Walter wrote:
 "Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message
 news:b0bpq9$1d3d$1 digitaldaemon.com...

Current D uses char[] as the string type. If we declare each char to be
UTF-8 we'll have all the problems with what does "myString[13] =



someChar;"
means. I think a opaque string datatype may be better in this case. We



could
have a glyph datatype that represents one unicode glyph in UTF-8



encoding,
and use it together with a string class.

 I'm thinking that myString[13] should simply set the byte at


myString[13].
 Trying to fiddle with the multibyte stuff with simple array access


semantics
 just looks to be too confusing and error prone. To access the unicode
 characters from it would be via a function or property.

 I disagree.  Returning the character makes indexing expensive, but it
 has the expectant result and for the most part hides the fact that
 compaction is going on automatically; the only rule change is that
 indexed assignment can invalidate any slices and copies, which isn't any
 worse than D's current rules.  Then char.size will be 4 and char.max
 will be 0x10FFFF or 0x7FFFFFFF, depending upon whether we use UNICODE or
 ISO-10646 for our UTF-8.

 I also think that incrementing a char pointer should read the data to
 determine how many bytes it needs to skip.  It should be as transparent
 as possible!  If it can't be transparent, then it should use a class or
 be limited: no indexing, no char pointers.  I don't like either option.

Obviously, this needs more thought by me.

Jan 18 2003

"Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> writes:

"Walter" <walter digitalmars.com> escreveu na mensagem
news:b0c66n$1mq6$1 digitaldaemon.com...
 "Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message
 news:b0bpq9$1d3d$1 digitaldaemon.com...
     Current D uses char[] as the string type. If we declare each char to

 be
 UTF-8 we'll have all the problems with what does "myString[13] =

 someChar;"
 means. I think a opaque string datatype may be better in this case. We

 could
 have a glyph datatype that represents one unicode glyph in UTF-8


encoding,
 and use it together with a string class.

 I'm thinking that myString[13] should simply set the byte at myString[13].
 Trying to fiddle with the multibyte stuff with simple array access

semantics
 just looks to be too confusing and error prone. To access the unicode
 characters from it would be via a function or property.

That's why I think it should be a opaque, immutable, data-type.

 Also I don't think a mutable string
 type is a good idea. Python and Java use immutable strings, and this


leads
 to better programs (you don't need to worry about copying your strings

 when
 you get or give them). Some nice tricks, like caching hashCode results


for
 strings are possible, because the values won't change. We could also

 provide
 a mutable string class.

 I think the copy-on-write approach to strings is the right idea.
 Unfortunately, if done by the language semantics, it can have severe

adverse
 performance results (think of a toupper() function, copying the string

again
 each time a character is converted). Using it instead as a coding style,
 which is currently how it's done in Phobos, seems to work well. My
 javascript implementation (DMDScript) does cache the hash for each string,
 and that works well for the semantics of javascript. But I don't think it

is
 appropriate for lower level language like D to do as much for strings.


     If this is the way to go we need lots of test cases, specially from
 people with experience writing unicode libraries. The Unicode spec has

 lots
 of particularities, like correct regular expression support, that may


lead
 to subtle bugs.

 Regular expression implementations naturally lend themselves to subtle

bugs
 :-(. Having a good test suite is a lifesaver.

Not if you write a "correct" regular expression implementation. If you
implement right from scratch using simple NFAs you probably won't have any
headaches. I've implemented a toy regex machine in Java based on Mark Jason
Dominus excelent article "How Regexes work" at http://perl.plover.com/Regex/
It's very simple and quite fast as it's a dumb implementation without any
kind of optimizations (4 times slower than a fast bytecode regex interpreter
in Java, http://jakarta.apache.org/regexp/index.html). Also the sourcecode
is lot's of times cleaner. BTW I've written a unit test suite based on
Jakarta Regexp set of tests. I can port it to D if you like and use it with
your regex implementation.


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.443 / Virus Database: 248 - Release Date: 10/1/2003

Jan 18 2003

"Walter" <walter digitalmars.com> writes:

"Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message
news:b0cond$222q$1 digitaldaemon.com...
 BTW I've written a unit test suite based on
 Jakarta Regexp set of tests. I can port it to D if you like and use it

with
 your regex implementation.

At the moment I'm using Spencer's regex test suite augmented with a bunch of
new test vectors. More testing is better, so yes I'm interested in better &
more comprehensive tests.

Jan 18 2003

"Sean L. Palmer" <seanpalmer directvinternet.com> writes:

"Walter" <walter digitalmars.com> wrote in message
news:b0c66n$1mq6$1 digitaldaemon.com...
 I think the copy-on-write approach to strings is the right idea.
 Unfortunately, if done by the language semantics, it can have severe

adverse
 performance results (think of a toupper() function, copying the string

again
 each time a character is converted). Using it instead as a coding style,

Copy-on-write usually doesn't copy unless there's more than one live
reference to the string.  If you're actively modifying it, it'll only make
one copy until you distribute the new reference.  Of course that means
reference counting.  Perhaps the GC could store info about string use.

Jan 19 2003

Ilya Minkov <midiclub 8ung.at> writes:

Sean L. Palmer wrote:
 "Walter" <walter digitalmars.com> wrote in message
 news:b0c66n$1mq6$1 digitaldaemon.com...
 
I think the copy-on-write approach to strings is the right idea.
Unfortunately, if done by the language semantics, it can have severe

 adverse
performance results (think of a toupper() function, copying the string

 again
each time a character is converted). Using it instead as a coding style,

 
 
 Copy-on-write usually doesn't copy unless there's more than one live
 reference to the string.  If you're actively modifying it, it'll only make
 one copy until you distribute the new reference.  Of course that means
 reference counting.  Perhaps the GC could store info about string use.
 

That's not gonna work, because there's no reliable way you can get this 
data from GC outside a mark phase.

The Delphi string implementation is Ref-Counted, and is said to be 
extremely slow. So it's better copy and forget the rest, than to count 
at every assignment. You'll just have one more reason to optimise the GC 
then. :)

IMO, the amount of copying should be limited by merging the operations 
together.

Jan 20 2003

"Ben Hinkle" <bhinkle mathworks.com> writes:

"Walter" <walter digitalmars.com> wrote in message
news:b0b0up$vk7$1 digitaldaemon.com...
 I once wrote a large project that dealt with mixed ascii and unicode.

There
 was bug after bug when the two collided. Finally, I threw in the towel and
 made the entire program unicode - every string in it.

 The trouble in D is that in the current scheme, everything dealing with

text
 has to be written twice, once for char[] and again for wchar_t[]. In C,
 there's that wretched tchar.h to swap back and forth. It may just be

easier
 in the long run to just make UTF-8 the native type, and then at least try
 and make sure the standard D library is correct.

I've gotten a little confused reading this thread. Here are some questions
swimming in my head:
1) What does it mean to make UTF-8 the native type?
2) What is char.size?
3) Does char[] differ from byte[] or is it a typedef?
4) How does one get a UTF-16 encoding of a char[], or get the length, or get
the 5th character, or set the 5th character to a given unicode character
(expressed in UTF-16, say)?

Here are my guesses to the answers:
1) string literals are encoded in UTF-8
2) char.size = 8
3) it's a typedef
4) through the library or directly if you know enough about the char[] you
are manipulating.

Is this correct?

thanks,
-Ben

 -Walter

 "Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message
 news:b0agdq$ni9$1 digitaldaemon.com...
 "globalization guy" <globalization_member pathlink.com> escreveu na

 mensagem
 news:b05pdd$13bv$1 digitaldaemon.com...
 I think you'll be making a big mistake if you adopt C's obsolete char



==
 byte
 concept of strings. Savvy language designers these days realize that,


 like
 int's
 and float's, char's should be a fundamental data type at a



higher-level
 of
 abstraction than raw bytes. The model that most modern language


 designers
 are
 turning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit.

 If you do so, you make it possible for strings in your language to



have
 a
 single, canonical form that all APIs use. Instead of the nightmare



that
 C/C++
 programmers face when passing string parameters ("now, let's see, is


 this
 a
 char* or a const char* or an ISO C++ string or an ISO wstring or a

 wchar_t* or a
 char[] or a wchar_t[] or an instance of one of countless string

 classes...?).
 The fact that not just every library but practically every project



feels
 the
 need to reinvent its own string type is proof of the need for a good,

 solid,
 canonical form built right into the language.

 Most language designers these days either get this from the start of


 they
 later
 figure it out and have to screw up their language with multiple string

 types.
 Having canonical UTF-16 chars and strings internally does not mean



that
 you
 can't deal with other character encodings externally. You can can


 convert
 to
 canonical form on import and convert back to some legacy encoding on

 export.
 When you create the strings yourself, or when they are created in Java


 or

 Javascript or default XML or most new text protocols, no conversion



will
 be
 necessary. It will only be needed for legacy data (or a very



lightweight
 switch
 between UTF-8 and UTF-16). And for those cases where you have to work


 with
 legacy data and yet don't want to incur the overhead of encoding

 conversion in
 and out, you can still treat the external strings as byte arrays



instead
 of
 strings, assuming you have a "byte" data type, and do direct byte

 manipulation
 on them. That's essentially what you would have been doing anyway if



you
 had
 used the old char == byte model I see in your docs. You just call it

 "byte"
 instead of "char" so it doesn't end up being your default string type.

 Having a modern UTF-16 char type, separate from arrays of "byte",



gives
 you a
 consistency that allows for the creation of great libraries (since



text
 is
 such


 start, and
 their libraries universally use a single string type. Perl figured it


 out
 pretty
 late and as a result, with the addition of UTF-8 to Perl in v. 5.6,



it's
 never
 clear which CPAN modules will work and which ones will fail, so you



have
 to use
 pragmas ("use utf-8" vs. "use bytes") and do lots of testing.

 I hope you'll consider making this change to your design. Have an



8-bit
 unsigned
 "byte" type and a 16-bit unsigned UTF-16 "char" and forget about this

 "8-bit
 char plus 16-bit wide char on Win32 and 32-bit wide char on Linux"



stuff
 or I'm
 quite sure you'll later regret it. C/C++ are in that sorry state for

 legacy
 reasons only, not because their designers were foolish, but any new

 language
 that intentionally copies that "design" is likely to regret that


 decision.

 Hi,

     There was a thread a year ago in the smalleiffel mailing list

 (starting
 at http://groups.yahoo.com/group/smalleiffel/message/4075 ) about


unicode
 strings in Eiffel. It's a quite interesting read about the problems of
 adding string-like Unicode classes. The main point is that true Unicode
 support is very difficult to achieve just some libraries provide good,
 correct and complete unicode encoders/decoders/renderers/etc.
     While I agree that some Unicode support is a necessity today (main
 mother tongue is brazilian portuguese so I use non-ascii characters
 everyday), we can't just add some base types and pretend everything is
 allright. We won't correct incorrect written code with a primitive


unicode
 string. Most programmers don't think about unicode when they develop


their
 software, so almost every line of code dealing with texts contain some
 assumptions about the character sets being used. Java has a primitive 16

 bit
 char, but basic library functions (because they need good performance)


use
 incorrect code for string handling stuff (the correct classes are in
 java.text, providing means to correctly collate strings). Some times we

 are
 just using plain old ASCII but we're bitten by the encoding issues. And

 when
 we need to deal with true unicode support the libraries tricky us into
 believing everything is ok.
     IMO D should support a simple char array to deal with ASCII (as it

 does
 today) and some kind of standard library module to deal with unicode

 glyphs
 and text. This could be included in phobos or even in deimos. Any
 volunteers? With this we could force the programmer to deal with another

 set
 of tools (albeit similar) when dealing with each kind of string: ASCII


or
 unicode. This module should allow creation of variable sized string and
 glyphs through an opaque ADT. Each kind of usage has different semantics

 and
 optimization strategies (e.g. Boyer-Moore is good for ASCII but with

 unicode
 the space and time usage are worse).

     Best regards,
     Daniel Yokomiso.

 P.S.: I had to written some libraries and components (EJBs) in several

 Java
 projects to deal with data-transfer in plain ASCII (communication with


IBM
 mainframes). Each day I dreamed of using a language with simple one byte
 character strings, without problems with encoding and endianess (Solaris

 vs.
 Linux vs. Windows NT have some nice "features" in their JVMs if you


aren't
 careful when writing Java code that uses "ASCII" String). But Java has a

 16
 bit character type and a SIGNED byte type, both awkward for this usage.


A
 language shouldn't get in the way of simple code.

 "Never argue with an idiot. They drag you down to their level then beat

 you
 with experience."


 ---
 Outgoing mail is certified Virus Free.
 Checked by AVG anti-virus system (http://www.grisoft.com).
 Version: 6.0.443 / Virus Database: 248 - Release Date: 11/1/2003

Jan 18 2003

"Walter" <walter digitalmars.com> writes:

"Ben Hinkle" <bhinkle mathworks.com> wrote in message
news:b0bvoh$1hm5$1 digitaldaemon.com...
 I've gotten a little confused reading this thread. Here are some questions
 swimming in my head:
 1) What does it mean to make UTF-8 the native type?

From a compiler standpoint, all it really means is that string literals are
encoded as UTF-8. The real support for it will be in the runtime library,
such as UTF-8 support in printf().

 2) What is char.size?

It'll be 1.

 3) Does char[] differ from byte[] or is it a typedef?

It differs in that it can be overloaded differently, and the compiler
recognizes char[] as special when doing casts to other array types - it can
do conversions between UTF-8 and UTF-16, for example.

 4) How does one get a UTF-16 encoding of a char[],

At the moment, I'm thinking:
    wchar[] w;
    char[] c;
    w = cast(wchar[])c;
to do a UTF-8 to UTF-16 conversion.

 or get the length,

To get the length in bytes:
    c.length
to get the length in USC-4 characters, perhaps:
    c.nchars ??

 or get
 the 5th character, or set the 5th character to a given unicode character
 (expressed in UTF-16, say)?

Probably a library function.

Jan 18 2003

Mark Evans <Mark_member pathlink.com> writes:

The best way to handle Unicode is, as a previous poster suggested, to make
UTF-16 the default and tack on ASCII conversions in the runtime library.  Not
the other way around.  Legacy stuff should be runtime lib, modern stuff
built-in.  Otherwise we are building a language on outdated standards.

I don't like typecasting hacks or half-measures.  Besides, typecasting by
definition should not change the size of its argument.

Mark

Jan 18 2003

"Walter" <walter digitalmars.com> writes:

You're probably right, the typecasting hack is inconsistent enough with the
way the rest of the language works that it's probably a bad idea.

As for why UTF-16 instead of UTF-8, why do you find it preferable?

"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b0ccek$1qnh$1 digitaldaemon.com...
 The best way to handle Unicode is, as a previous poster suggested, to make
 UTF-16 the default and tack on ASCII conversions in the runtime library.

Not
 the other way around.  Legacy stuff should be runtime lib, modern stuff
 built-in.  Otherwise we are building a language on outdated standards.

 I don't like typecasting hacks or half-measures.  Besides, typecasting by
 definition should not change the size of its argument.

 Mark

Jan 18 2003

Mark Evans <Mark_member pathlink.com> writes:

Walter asked,
As for why UTF-16 instead of UTF-8, why do you find it preferable?

If one wants to do serious internationalized applications it is mandatory.
China, Japan, India for example.  China and India by themselves encompass
hundreds of languages and dialects that use non-Western glyphs.

My contacts at the SIL linguistics center in Dallas (heavy-duty Unicode and SGML
folks) complain that in their language work, not even UTF-16 is good enough.
They push for 32 bits!

I would not go that far, but UTF-16 is a very sensible, capable format for the
majority of languages.

Mark

Jan 20 2003

"Walter" <walter digitalmars.com> writes:

"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b0itlo$2a46$1 digitaldaemon.com...
 If one wants to do serious internationalized applications it is mandatory.
 China, Japan, India for example.  China and India by themselves encompass
 hundreds of languages and dialects that use non-Western glyphs.

UTF-8 can handle that.

 My contacts at the SIL linguistics center in Dallas (heavy-duty Unicode

and SGML
 folks) complain that in their language work, not even UTF-16 is good

enough.
 They push for 32 bits!

UTF-16 has 2^20 characters in it. UTF-8 has 2^31 characters.

 I would not go that far, but UTF-16 is a very sensible, capable format for

the
 majority of languages.

The only advantage it has over UTF-8 is it is more compact for some
languages. UTF-8 is more compact for the rest.

Jan 21 2003

Mark Evans <Mark_member pathlink.com> writes:

Well OK I should have been clearer.  You are right about sheer numerical
quantity, but read the FAQ at Unicode.org (excerpted below).  Numerical quantity
at the price of variable-width codes is a headache.  UTF-16 has variable width,
but not as variable as UTF-8, and nowhere near as frequently.

UTF-16 is the Windows standard.  It's a sweet spot for Unicode, which was
originally a pure 16-bit design.  The Unicode leaders advocate UTF-16 and I
accept their wisdom.

The "real deal" with UTF-8 is that it's a retrofit to accommodate legacy ASCII
that we all know and love.  So again I would argue that UTF-8 qualifies in a
certain sense as "legacy support," and should therefore go in the runtime, not
the core code.

I'd go even further and not use 'char' with any meaning other than UTF-16.  I
never liked the Windows char/wchar goofiness.  A language should only have one
type of char and the runtimes can support conversions of language-standard chars
to other formats.  Trying to shimmy 'alternative characters' into C was a bad
idea.  The wonderful thing about designing a new language is that you can do it
right.  (Implementation details at http://www.unicode.org/reports/tr27/ )

Mark

http://www.unicode.org/faq/utf_bom.html
-----------------------------------------------
"Most Unicode APIs are using UTF-16."
-----------------------------------------------
"UTF-8 will be most common on the web. UTF16, UTF16LE, UTF16BE are used by Java
and Windows."
[BE and LE mean Big Endian and Little Endian.]
-----------------------------------------------
"Unicode was originally designed as a pure 16-bit encoding, aimed at
representing all modern scripts."
-----------------------------------------------
[UTF-8 can have anywhere from 1 to 4 code blocks so it's highly variable.
UTF-16 almost always has one code block, and in rare 1% cases, two; but no more.
This is important in the Asian context:]
"East Asians (Chinese, Japanese, and Koreans) are ... are well acquainted with
the problems that variable-width codes ... have caused....With UTF-16,
relatively few characters require 2 units. The vast majority of characters in
common use are single code units.  Even in East Asian text, the incidence of
surrogate pairs should be well less than 1% of all text storage on average."
-----------------------------------------------
"Furthermore, both Unicode and ISO 10646 have policies in place that formally
limit even the UTF-32 encoding form to the integer range that can be expressed
with UTF-16 (or 21 significant bits)."
-----------------------------------------------
"We don't anticipate a general switch to UTF-32 storage for a long time (if
ever)....The chief selling point for Unicode was providing a representation for
all the world's characters.... These features were enough to swing industry to
the side of using Unicode (UTF-16)."
-----------------------------------------------

Jan 21 2003

Mark Evans <Mark_member pathlink.com> writes:

Quick follow-up.  Even the extra space in UTF-8 will probably not be used in the
future, and UTF-8 vs. UTF-16 are going to be neck-and-neck in terms of
storage/performance over time.  So I see no compelling reason for UTF-8 except
its legacy ties to 7-bit ASCII.  I think of UTF-8 as "ASCII with Unicode paint."

Mark

http://www-106.ibm.com/developerworks/library/utfencodingforms/
"Storage vs. performance
Both UTF-8 and UTF-16 are substantially more compact than UTF-32, when averaging
over the world's text in computers. UTF-8 is currently more compact than UTF-16
on average, although it is not particularly suited for East-Asian text because
it occupies about 3 bytes of storage per code point. UTF-8 will probably end up
as about the same as UTF-16 over time, and may end up being less compact on
average as computers continue to make inroads into East and South Asia. Both
UTF-8 and UTF-16 offer substantial advantages over UTF-32 in terms of storage
requirements."

http://czyborra.com/utf/
"Actually, UTF-8 continues to represent up to 31 bits with up to 6 bytes, but it
is generally expected that the one million code points of the 20 bits offered by
UTF-16 and 4-byte UTF-8 will suffice to cover all characters and that we will
never get to see any Unicode character definitions beyond that."

Jan 21 2003

Ilya Minkov <midiclub tiscali.de> writes:

Mark Evans wrote:
 Walter asked,
 
As for why UTF-16 instead of UTF-8, why do you find it preferable?

 
 
 If one wants to do serious internationalized applications it is mandatory.
 China, Japan, India for example.  China and India by themselves encompass
 hundreds of languages and dialects that use non-Western glyphs.
 
 My contacts at the SIL linguistics center in Dallas (heavy-duty Unicode and
SGML
 folks) complain that in their language work, not even UTF-16 is good enough.
 They push for 32 bits!

Could someone explain me *what's the difference*? I thought there was 
one unicode set, which encodes *everything*. Then, there are different 
"wrappings" of it, like UTF8, 16 and so on. They do the same by assgning 
blocks, where multiple "characters" of 8, 16, or smth. bits compose a 
final character value. And a lot of optimisation can be done, because it 
is not likely that each next symbol will be from a different language, 
since natural language usually consists of words, sentences, and so on. 
In UFT8 there are sequences, consisting of header-data, where header 
encodes the language/code and the length of the text, so that some data 
is generalized and need not be tranferred with every symbol, and so that 
a character in a certain encoding can take as many target system 
characters is it needs.

As far as I understood, UTF7 is the shortest encoding for latin text, 
but it would be less optimal for some multi-hunderd-character sets than 
a generally wider encoding.

Please, someone correct me if i'm wrong. But if i'm right, Russian, 
arabic, and other "tiny" alphabets would only experience a minor 
"fat-ratio" with UTF8, since they requiere less not many more symbols 
than latin. That is, only headers and no further overhead.

Can anyone tell me: taken the same newspaper article in chinese, 
japanese, or some other "wide" language, encoded in UTF7, 8, 16, 32 and 
so on: how much space would it take? Which languages suffer more and 
which less from "small" UTF encodigs?

-i.

 
 I would not go that far, but UTF-16 is a very sensible, capable format for the
 majority of languages.
 
 Mark

Jan 22 2003

Ilya Minkov <midiclub tiscali.de> writes:

Ilya Minkov wrote:
 Could someone explain me *what's the difference*? ...

I see myself approved.



 Can anyone tell me: taken the same newspaper article in chinese, 
 japanese, or some other "wide" language, encoded in UTF7, 8, 16, 32 and 
 so on: how much space would it take? Which languages suffer more and 
 which less from "small" UTF encodigs?

This one remains.


-i.

Jan 22 2003

Mark Evans <Mark_member pathlink.com> writes:

Ilya Minkov says...
Could someone explain me *what's the difference*?

Take the trouble to read through the links supplied in the previous posts before
asking redundant questions like this.

Mark

Jan 22 2003

Theodore Reed <rizen surreality.us> writes:

On Wed, 22 Jan 2003 15:26:56 +0100
Ilya Minkov <midiclub tiscali.de> wrote:

 sentences, and so on. In UFT8 there are sequences, consisting of
 header-data, where header encodes the language/code and the length of
 the text, so that some data is generalized and need not be tranferred
 with every symbol, and so that a character in a certain encoding can
 take as many target system characters is it needs.

That's not how UTF-8 works (although  I've thought a RLE scheme like the
one you describe would be pretty good).  In UTF-8 a glyph can be 1-4
bytes. If the unicode value is below 0x80, it takes one byte. If it's
between 0x80 and 0x7FF (inclusive), it takes two, etc

 As far as I understood, UTF7 is the shortest encoding for latin text, 
 but it would be less optimal for some multi-hunderd-character sets
 than a generally wider encoding.

Quite less than optimal.
 
 Please, someone correct me if i'm wrong. But if i'm right, Russian, 
 arabic, and other "tiny" alphabets would only experience a minor 
 "fat-ratio" with UTF8, since they requiere less not many more symbols 
 than latin. That is, only headers and no further overhead.

Most western alphabets would take 1-2 bytes per char. I think Arabic
would take 3.

 Can anyone tell me: taken the same newspaper article in chinese, 
 japanese, or some other "wide" language, encoded in UTF7, 8, 16, 32
 and so on: how much space would it take? Which languages suffer more
 and which less from "small" UTF encodigs?

UTF-8 just flat takes less space over all. At most, it takes 4 bytes per
glyph, plus for many, it takes less. The issue isn't  really the space.
It's the difficulty in dealing with an encoding where you don't know how
long the next glyph will be without reading it. (Which also means that
in order to access the glyph in the middle, you have to start scanning
from the front.)

-- 
Theodore Reed (rizen/bancus)       -==-       http://www.surreality.us/
~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~

"I hold it to be the inalienable right of anybody to go to hell in his
own way." -- Robert Frost

Jan 22 2003

Ilya Minkov <midiclub 8ung.at> writes:

Then considering UTF-16 might make sense...


I think there is a way to optimise UTF8 though: pre-scan the string and 
record character width changes in an array.

Jan 22 2003

Mark Evans <Mark_member pathlink.com> writes:

In UTF-8 a glyph can be 1-4 bytes.

Only if you live within the same dynamic range as UTF-16.  To get the full
effective UTF-8 dynamic range of 32 bits, UTF-8 employs up to six bytes.  With 4
bytes it has the same range as UTF-16.

"The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the
use of five- and six-byte sequences to encode characters that are outside the
range of the Unicode character set."  http://www.unicode.org/reports/tr27

The issue isn't really the space.
It's the difficulty in dealing with an encoding where you don't know how
long the next glyph will be without reading it.

Exactly.  UTF-16 can have at most one extra code (in roughly 1% of cases).  So
you have either one 16-bit word, or two.  UTF-8 is the absolute worst encoding
in this regard.  UTF-32 is the best (constant size).

The main selling point for D is that UTF-16 is the standard for Windows. Windows
is built on it.  Knowing Microsoft, they probably use a "slightly modified
Microsoft version" of UTF-16...that would not surprise me at all.

Mark

Jan 23 2003

"Serge K" <skarebo programmer.net> writes:

"Mark Evans" <Mark_member pathlink.com> wrote in message
news:b0qp5g$n73$1 digitaldaemon.com...
In UTF-8 a glyph can be 1-4 bytes.

 Only if you live within the same dynamic range as UTF-16.  To get the full
 effective UTF-8 dynamic range of 32 bits, UTF-8 employs up to six bytes.

With 4
 bytes it has the same range as UTF-16.

Actually, UTF-8, UTF-16 and UTF-32 - all have the same range : [0..10FFFFh]

UTF-8 encoding method can be extended up to six bytes max. to encode UCS-4
character set, but it is way beyond Unicode.

 "The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows

for the
 use of five- and six-byte sequences to encode characters that are outside

the
 range of the Unicode character set."  http://www.unicode.org/reports/tr27

Please, do not post truncated citations.

"The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for
the use of five- and six-byte sequences to encode characters that are
outside the range of the Unicode character set; those five- and six-byte
sequences are illegal for the use of UTF-8 as a transformation of Unicode
characters."


The issue isn't really the space.
It's the difficulty in dealing with an encoding where you don't know how
long the next glyph will be without reading it.

 Exactly.  UTF-16 can have at most one extra code (in roughly 1% of cases).

So
 you have either one 16-bit word, or two.  UTF-8 is the absolute worst

encoding
 in this regard.  UTF-32 is the best (constant size).

For the real world applications UTF-16 strings have to use those surrogates
only to access CJK Ideographs extensions (~43000 characters).
In most of the cases UTF-16 string can be treated as an array of the UCS-2
characters.
String object can include its length in 16bit units and in characters :
if these numbers are equal - it's an UCS-2 string, no surrogates inside.

 The main selling point for D is that UTF-16 is the standard for Windows.

Windows
 is built on it.  Knowing Microsoft, they probably use a "slightly modified
 Microsoft version" of UTF-16...that would not surprise me at all.

Surprise...
It's a regular UTF-16. >8-P
(Starting with Win2K+sp.)
WinNT 3.x & 4 support UCS-2 only - since it was Unicode 2.0 encoding.

Any efficient prog. language must use UTF-16 for Windows implementation -
otherwise it have to convert strings for any API function requiring string
parameters...

Jan 27 2003

"Walter" <walter digitalmars.com> writes:

"Serge K" <skarebo programmer.net> wrote in message
news:b17cd6$2n1l$1 digitaldaemon.com...
 Any efficient prog. language must use UTF-16 for Windows implementation -
 otherwise it have to convert strings for any API function requiring string
 parameters...

Not necessarilly. While Win32 is now fully UTF-16 internally, and apparently
converts the strings in "A" api functions to UTF-16, because UTF-16 uses
double the memory it can still be far more efficient for an app to do all
its computation with UTF-8, and then convert when calling the windows api.

Feb 03 2003

Theodore Reed <rizen surreality.us> writes:

On Mon, 3 Feb 2003 15:37:37 -0800
"Walter" <walter digitalmars.com> wrote:

 
 "Serge K" <skarebo programmer.net> wrote in message
 news:b17cd6$2n1l$1 digitaldaemon.com...
 Any efficient prog. language must use UTF-16 for Windows
 implementation - otherwise it have to convert strings for any API
 function requiring string parameters...

 
 Not necessarilly. While Win32 is now fully UTF-16 internally, and
 apparently converts the strings in "A" api functions to UTF-16,
 because UTF-16 uses double the memory it can still be far more
 efficient for an app to do all its computation with UTF-8, and then
 convert when calling the windows api.
 
 

Plus, UTF-8 is pretty standard for Unicode on Linux. I believe BeOS used
it, too, although I could be wrong. I don't know what OSX uses, nor
other unices. 

My point is that choosing a standard by what the underlying platform
uses is a bad idea.

-- 
Theodore Reed (rizen/bancus)       -==-       http://www.surreality.us/
~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~

"[...] for plainly, although every work of art is an expression, not
every expression is a work of art." -- DeWitt H. Parker, "The Principles
of Aesthetics"

Feb 04 2003

Mark Evans <Mark_member pathlink.com> writes:

My point is that choosing a standard by what the underlying platform
uses is a bad idea.

I agree with this remark, but think there are plenty of platform-independent
reasons for UTF-16.  The fact that Windows uses it just cements the case.

Mark

Feb 13 2003

Mark Evans <Mark_member pathlink.com> writes:

Walter says...
 Any efficient prog. language must use UTF-16 for Windows implementation -
 otherwise it have to convert strings for any API function requiring string
 parameters...

Not necessarilly. While Win32 is now fully UTF-16 internally, and apparently
converts the strings in "A" api functions to UTF-16, because UTF-16 uses
double the memory it can still be far more efficient for an app to do all
its computation with UTF-8, and then convert when calling the windows api.

Memory is cheap and getting cheaper, but procesor time never loses value.

The supposition that UTF-8 needs less space is flawed anyway.  For some
languages, yes -- but not all.  My earlier citations indicate that long-term,
averaging over all languages, UTF-8 and UTF-16 will require equivalent memory
storage.

UTF-8 code is also harder to write because UTF-8 is just more complicated than
UTF-16.  The only reason for its popularity is that it's a fig leaf for people
who really want to use ASCII.  They can use ASCII and call it UTF-8.  Not very
forward-thinking.

Microsoft had good reasons for selecting UTF-16 and D should follow suit.  Other
languages are struggling with Unicode support, and it would be nice to have one
language out up front in this area.

Mark

Feb 13 2003

"Serge K" <skarebo programmer.net> writes:

 The supposition that UTF-8 needs less space is flawed anyway.  For some
 languages, yes -- but not all.  My earlier citations indicate that

long-term,
 averaging over all languages, UTF-8 and UTF-16 will require equivalent

memory
 storage.

 UTF-8 code is also harder to write because UTF-8 is just more complicated

than
 UTF-16.  The only reason for its popularity is that it's a fig leaf for

people
 who really want to use ASCII.  They can use ASCII and call it UTF-8.  Not

very
 forward-thinking.

 Microsoft had good reasons for selecting UTF-16 and D should follow suit.

Other
 languages are struggling with Unicode support, and it would be nice to

have one
 language out up front in this area.

 Mark


http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/index
.html?dwzone=unicode
["Forms of Unicode", Mark Davis, IBM developer and President of the Unicode
Consortium, IBM]

  "Storage vs. performance
Both UTF-8 and UTF-16 are substantially more compact than UTF-32, when
averaging over the world's text in computers. UTF-8 is currently more
compact than UTF-16 on average, although it is not particularly suited for
East-Asian text because it occupies about 3 bytes of storage per code point.
UTF-8 will probably end up as about the same as UTF-16 over time, and may
end up being less compact on average as computers continue to make inroads
into East and South Asia. Both UTF-8 and UTF-16 offer substantial advantages
over UTF-32 in terms of storage requirements."

{ btw, about storage :
I've converted 300KB text file (russian book) into UTF-8 - it took about
~1.85 bytes per character.
Little compression comparing to UTF-16 comes mostly from "spaces" and
punctuation marks,
but it hardly worth processing complexity. }

  "Code-point boundaries, iteration, and indexing are very fast with UTF-32.
Code-point boundaries, accessing code points at a given offset, and
iteration involve a few extra machine instructions for UTF-16; UTF-8 is a
bit more cumbersome."

{ Occurrence of the UTF-16 surrogates in the real texts is estimated as <1%
for CJK languages. Other scripts encoded in "higher planes" cover very rare
or dead languages and some special symbols (like modern & old music
symbols). So, if String object can identify absence of the surrogates -
faster functions can to be used in most of the cases. The same optimization
works for UTF-8, but only in the US-nivers (even British pound takes 2
bytes.. 8-) }

  "Ultimately, the choice of which encoding format to use will depend
heavily on the programming environment. For systems that only offer 8-bit
strings currently, but are multi-byte enabled, UTF-8 may be the best choice.
For systems that do not care about storage requirements, UTF-32 may be best.
For systems such as Windows, Java, or ICU that use UTF-16 strings already,
UTF-16 is the obvious choice. Even if they have not yet upgraded to fully
support surrogates, they will be before long.

   If the programming environment is not an issue, UTF-16 is recommended as
a good compromise between elegance, performance, and storage."

Feb 16 2003

Burton Radons <loth users.sourceforge.net> writes:

Walter wrote:
 "Ben Hinkle" <bhinkle mathworks.com> wrote in message
 news:b0bvoh$1hm5$1 digitaldaemon.com...
 
4) How does one get a UTF-16 encoding of a char[],

 
 
 At the moment, I'm thinking:
     wchar[] w;
     char[] c;
     w = cast(wchar[])c;
 to do a UTF-8 to UTF-16 conversion.

This is less complex than "w = toWideStringz(c);" somehow?  I can't 
speak for anyone else, but this won't help my work with dig at all - I 
already have to preprocess any strings sent to the API with toStringz, 
while the public interface will still use char[].  So constant casting 
is the name of the game by necessity, and if I want to be conservative I 
have to cache the conversion and delete it anyway.  Calling these APIs 
directly, when this casting becomes a win, just doesn't happen to me.

Jan 18 2003

"Ben Hinkle" <bhinkle mathworks.com> writes:

"Walter" <walter digitalmars.com> wrote in message
news:b0c66n$1mq6$2 digitaldaemon.com...
 "Ben Hinkle" <bhinkle mathworks.com> wrote in message
 news:b0bvoh$1hm5$1 digitaldaemon.com...
 I've gotten a little confused reading this thread. Here are some


questions
 swimming in my head:
 1) What does it mean to make UTF-8 the native type?

 From a compiler standpoint, all it really means is that string literals

are
 encoded as UTF-8. The real support for it will be in the runtime library,
 such as UTF-8 support in printf().

 2) What is char.size?

 It'll be 1.

D'oh! char.size=8 is a tad big ;)

 3) Does char[] differ from byte[] or is it a typedef?

 It differs in that it can be overloaded differently, and the compiler
 recognizes char[] as special when doing casts to other array types - it

can
 do conversions between UTF-8 and UTF-16, for example.

The semantics of casting (across all of D) needs to be nice and predictable.
I'd hate to track down a bug because a cast that I thought was trivial
turned out to allocate new memory and copy data around...

 4) How does one get a UTF-16 encoding of a char[],

 At the moment, I'm thinking:
     wchar[] w;
     char[] c;
     w = cast(wchar[])c;
 to do a UTF-8 to UTF-16 conversion.

 or get the length,

 To get the length in bytes:
     c.length
 to get the length in USC-4 characters, perhaps:
     c.nchars ??

Could arrays (or some types that want to have array-like behavior) have some
semantics that distinguish between the memory layout and the array indexing
and length? Another example of this comes up in sparse matrices, where you
want to have an array-like thing that has a non-trivial memory layout.
Perhaps not full-blown operator overloading for [] and .length, etc - but
some kind of special syntax to differentiate between running around in the
memory layout or running around in the "high-level interface".

 or get
 the 5th character, or set the 5th character to a given unicode character
 (expressed in UTF-16, say)?

 Probably a library function.

Jan 18 2003

Shannon Mann <Shannon_member pathlink.com> writes:

I've read through what I could find on the thread about char[] 
and I find myself disagreeing with the idea that char[n] should
return the n'th byte, regardless of the width of a character.

My reasons are simple.  When I have an array of, say, ints, I don't
expect that int[n] will give me the n'th byte of the array of numbers.
I fully expect that the n'th integer will be what I get.

I see no reason why this should not hold for arrays of characters.

I do expect that there are times when it would be useful to access
an array of TYPE (where TYPE is int, char, etc) at the byte level, 
but it strikes me that some interface between an array of TYPE 
elements and that array as an array of BYTE's (i.e. using the byte 
type) would be VERY USEFUL, and would address concerns in wanting 
to access characters in their raw byte form.  Indexing of the 
equivalent of a byte pointer to a TYPE array, perhaps formulated
in syntactic sugar, would achieve this.  I would personally prefer
a language-specific way to byte access an aggregate rather than
use pointers to achieve what the language should provide anyway.

Please note that the above statements stand REGARDLESS of the encoding
chosen, be it UTF-8 or 16 or whatever.

Feb 05 2003

"Sean L. Palmer" <seanpalmer directvinternet.com> writes:

The solution here is to use a char *iterator* instead of using char
*indexing*.  char indexing will be very slow.  char iteration will be very
fast.

D needs a good iterator concept.  It has a good array concept already, but
arrays are not the solution to everything.  For instance, serial input or
output can't easily be indexed.  You don't do:  serial_port[47] = character;
you do:  serial_port.write(character).  Those are like iterators (ok well at
least in STL, input iterators and output iterators were part of the iterator
family).

Sean

"Shannon Mann" <Shannon_member pathlink.com> wrote in message
news:b1rb8q$5i7$1 digitaldaemon.com...
 I've read through what I could find on the thread about char[]
 and I find myself disagreeing with the idea that char[n] should
 return the n'th byte, regardless of the width of a character.

 My reasons are simple.  When I have an array of, say, ints, I don't
 expect that int[n] will give me the n'th byte of the array of numbers.
 I fully expect that the n'th integer will be what I get.

 I see no reason why this should not hold for arrays of characters.

 I do expect that there are times when it would be useful to access
 an array of TYPE (where TYPE is int, char, etc) at the byte level,
 but it strikes me that some interface between an array of TYPE
 elements and that array as an array of BYTE's (i.e. using the byte
 type) would be VERY USEFUL, and would address concerns in wanting
 to access characters in their raw byte form.  Indexing of the
 equivalent of a byte pointer to a TYPE array, perhaps formulated
 in syntactic sugar, would achieve this.  I would personally prefer
 a language-specific way to byte access an aggregate rather than
 use pointers to achieve what the language should provide anyway.

 Please note that the above statements stand REGARDLESS of the encoding
 chosen, be it UTF-8 or 16 or whatever.

Feb 05 2003

D Programming

C/C++ Programming

Other

D - Unicode in D