D - Unicode in D
- globalization guy (41/41) Jan 16 2003 I think you'll be making a big mistake if you adopt C's obsolete char ==...
- Paul Sheer (8/9) Jan 16 2003 what about embedded work? this needs to be lightweight
- Theodore Reed (19/31) Jan 16 2003 But the default option should be UTF-8 with a module available for
- Sean L. Palmer (16/47) Jan 16 2003 I'm all for UTF-8. Most fonts don't come anywhere close to having all t...
- Theodore Reed (10/17) Jan 16 2003 AFAIK, Unicode between 0 and 127 is the exact same thing as ASCII.
- Alix Pexton (16/31) Jan 16 2003 As I see it there are two issues here. Firstly there is the ability to
- globalization guy (18/73) Jan 16 2003 Modern font systems cover different Unicode ranges with different fonts....
- globalization guy (23/36) Jan 16 2003 The default (and only) form should be UTF-16 in the language itself. The...
- Paul Stanton (2/4) Jan 16 2003 provided solaris/jvm is configured correctly by (friggin) service provid...
- Serge K (8/10) Jan 16 2003 0x00..0x7F --> 1 byte
- globalization guy (48/57) Jan 16 2003 Good questions. I think you'll find if you sniff around that more and mo...
- Martin M. Pedersen (81/122) Jan 16 2003 Hi,
- Walter (2/2) Jan 16 2003 You make some great points. I have to ask, though, why UTF-16 as opposed...
- globalization guy (60/62) Jan 17 2003 Good question, and actually it's not an open and shut case. UTF-8 would ...
- Walter (77/140) Jan 17 2003 I read your post with great interest. However, I'm leaning towards UTF-8...
- Burton Radons (14/19) Jan 17 2003 You're planning on making this a part of char[]? I was thinking of
- Walter (11/30) Jan 17 2003 on
- Mike Wynn (16/18) Jan 17 2003 I was under the impression UTF-16 was glyph based, so each char (16bits)...
- Serge K (20/26) Jan 16 2003 First, UTF-16 is just one of the many standard encodings for the Unicode...
- Mike Wynn (25/44) Jan 18 2003 ^)] -
- Serge K (28/38) Jan 16 2003 encodings?
- Walter (10/27) Jan 17 2003 LOL! Looking forward, then, one can treat it as UTF-16.
- Daniel Yokomiso (80/121) Jan 17 2003 byte
- Walter (37/164) Jan 17 2003 I once wrote a large project that dealt with mixed ascii and unicode. Th...
- Daniel Yokomiso (29/38) Jan 18 2003 There
- Theodore Reed (11/17) Jan 18 2003 So what does "myString[13] = someGlyph" mean? char doesn't have to be a
- Walter (22/37) Jan 18 2003 be
- Burton Radons (13/26) Jan 18 2003 I disagree. Returning the character makes indexing expensive, but it
- Walter (8/33) Jan 18 2003 someChar;"
- Daniel Yokomiso (26/63) Jan 18 2003 encoding,
- Walter (6/9) Jan 18 2003 with
- Sean L. Palmer (8/12) Jan 19 2003 adverse
- Ilya Minkov (9/25) Jan 20 2003 That's not gonna work, because there's no reliable way you can get this
- Ben Hinkle (46/220) Jan 18 2003 There
- Walter (19/29) Jan 18 2003 From a compiler standpoint, all it really means is that string literals ...
- Mark Evans (7/7) Jan 18 2003 The best way to handle Unicode is, as a previous poster suggested, to ma...
- Walter (6/13) Jan 18 2003 You're probably right, the typecasting hack is inconsistent enough with ...
- Mark Evans (10/11) Jan 20 2003 If one wants to do serious internationalized applications it is mandator...
- Walter (9/17) Jan 21 2003 UTF-8 can handle that.
- Mark Evans (47/47) Jan 21 2003 Well OK I should have been clearer. You are right about sheer numerical
- Mark Evans (20/20) Jan 21 2003 Quick follow-up. Even the extra space in UTF-8 will probably not be use...
- Ilya Minkov (25/44) Jan 22 2003 Could someone explain me *what's the difference*? I thought there was
- Ilya Minkov (4/9) Jan 22 2003 This one remains.
- Mark Evans (4/5) Jan 22 2003 Take the trouble to read through the links supplied in the previous post...
- Theodore Reed (21/37) Jan 22 2003 That's not how UTF-8 works (although I've thought a RLE scheme like the
- Ilya Minkov (3/3) Jan 22 2003 Then considering UTF-16 might make sense...
- Mark Evans (13/17) Jan 23 2003 Only if you live within the same dynamic range as UTF-16. To get the fu...
- Serge K (30/46) Jan 27 2003 With 4
- Walter (6/9) Feb 03 2003 Not necessarilly. While Win32 is now fully UTF-16 internally, and appare...
- Theodore Reed (13/27) Feb 04 2003 Plus, UTF-8 is pretty standard for Unicode on Linux. I believe BeOS used
- Mark Evans (3/5) Feb 13 2003 I agree with this remark, but think there are plenty of platform-indepen...
- Mark Evans (14/21) Feb 13 2003 Memory is cheap and getting cheaper, but procesor time never loses value...
- Serge K (46/58) Feb 16 2003 than
- Burton Radons (8/19) Jan 18 2003 This is less complex than "w = toWideStringz(c);" somehow? I can't
- Ben Hinkle (16/45) Jan 18 2003 questions
- Shannon Mann (19/19) Feb 05 2003 I've read through what I could find on the thread about char[]
- Sean L. Palmer (12/31) Feb 05 2003 The solution here is to use a char *iterator* instead of using char
I think you'll be making a big mistake if you adopt C's obsolete char == byte concept of strings. Savvy language designers these days realize that, like int's and float's, char's should be a fundamental data type at a higher-level of abstraction than raw bytes. The model that most modern language designers are turning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit. If you do so, you make it possible for strings in your language to have a single, canonical form that all APIs use. Instead of the nightmare that C/C++ programmers face when passing string parameters ("now, let's see, is this a char* or a const char* or an ISO C++ string or an ISO wstring or a wchar_t* or a char[] or a wchar_t[] or an instance of one of countless string classes...?). The fact that not just every library but practically every project feels the need to reinvent its own string type is proof of the need for a good, solid, canonical form built right into the language. Most language designers these days either get this from the start of they later figure it out and have to screw up their language with multiple string types. Having canonical UTF-16 chars and strings internally does not mean that you can't deal with other character encodings externally. You can can convert to canonical form on import and convert back to some legacy encoding on export. Javascript or default XML or most new text protocols, no conversion will be necessary. It will only be needed for legacy data (or a very lightweight switch between UTF-8 and UTF-16). And for those cases where you have to work with legacy data and yet don't want to incur the overhead of encoding conversion in and out, you can still treat the external strings as byte arrays instead of strings, assuming you have a "byte" data type, and do direct byte manipulation on them. That's essentially what you would have been doing anyway if you had used the old char == byte model I see in your docs. You just call it "byte" instead of "char" so it doesn't end up being your default string type. Having a modern UTF-16 char type, separate from arrays of "byte", gives you a consistency that allows for the creation of great libraries (since text is such their libraries universally use a single string type. Perl figured it out pretty late and as a result, with the addition of UTF-8 to Perl in v. 5.6, it's never clear which CPAN modules will work and which ones will fail, so you have to use pragmas ("use utf-8" vs. "use bytes") and do lots of testing. I hope you'll consider making this change to your design. Have an 8-bit unsigned "byte" type and a 16-bit unsigned UTF-16 "char" and forget about this "8-bit char plus 16-bit wide char on Win32 and 32-bit wide char on Linux" stuff or I'm quite sure you'll later regret it. C/C++ are in that sorry state for legacy reasons only, not because their designers were foolish, but any new language that intentionally copies that "design" is likely to regret that decision.
Jan 16 2003
On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:I think you'll be making a big mistake if you adopt C's obsolete char == bytewhat about embedded work? this needs to be lightweight in any case, a 16 bit character set doesn't hold all the charsets needed by the worlds languages, but a 20 bit charset (UTF-8) is overkill. then again, most programmers get by with 8 bits 99% of the time. So you need to give people options. -paul
Jan 16 2003
On Thu, 16 Jan 2003 14:40:15 +0200 "Paul Sheer" <psheer icon.co.za> wrote:On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:But the default option should be UTF-8 with a module available for conversion. (I tend to stay away from UTF-16 because of endian issues.) Also, I'm not sure where you're getting the 20-bit part. UTF-8 can encode everything in the Unicode 32-bit range. (Although it takes like 8 bytes towards the end.) UTF-8 also addresses the lightweight bit, as long as you aren't using non-English characters, but even if you are, they aren't that much longer. And it's better than having to deal with 50 million 8-bit encodings. FWIW, I wholeheartedly support Unicode strings in D. -- Theodore Reed (rizen/bancus) -==- http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~ "We have committed a greater crime, and for this crime there is no name. What punishment awaits us if it be discovered we know not, for no such crime has come in the memory of men and there are no laws to provide for it." -- Equality 7-2521, Ayn Rand's AnthemI think you'll be making a big mistake if you adopt C's obsolete char == bytewhat about embedded work? this needs to be lightweight in any case, a 16 bit character set doesn't hold all the charsets needed by the worlds languages, but a 20 bit charset (UTF-8) is overkill. then again, most programmers get by with 8 bits 99% of the time. So you need to give people options.
Jan 16 2003
I'm all for UTF-8. Most fonts don't come anywhere close to having all the glyphs anyway, but it's still nice to use an encoding that actually has a real definition (whereas "byte" has no meaning whatsoever and could mean ANSI, DOS OEM, ASCII-7, UTF-8, or MBCS.) UTF-8 allows you the full unicode range but the part that we use everyday just takes 1 byte per char, like usual. I believe it even maps almost 1:1 to ASCII in that range. You cannot however make a UTF-8 data type. By definition each character may take more than one byte. But you don't make arrays of characters, you make arrays of character building blocks (bytes) that are interpreted as characters. Anyway we'd need some automated way to step through the array one character at a time. Maybe string could be an array of bytes that pretends that it's an array of 32-bit unicode characters? Sean "Theodore Reed" <rizen surreality.us> wrote in message news:20030116081437.1a593197.rizen surreality.us...On Thu, 16 Jan 2003 14:40:15 +0200 "Paul Sheer" <psheer icon.co.za> wrote:On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:But the default option should be UTF-8 with a module available for conversion. (I tend to stay away from UTF-16 because of endian issues.) Also, I'm not sure where you're getting the 20-bit part. UTF-8 can encode everything in the Unicode 32-bit range. (Although it takes like 8 bytes towards the end.) UTF-8 also addresses the lightweight bit, as long as you aren't using non-English characters, but even if you are, they aren't that much longer. And it's better than having to deal with 50 million 8-bit encodings. FWIW, I wholeheartedly support Unicode strings in D. -- Theodore Reed (rizen/bancus) -==- http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~ "We have committed a greater crime, and for this crime there is no name. What punishment awaits us if it be discovered we know not, for no such crime has come in the memory of men and there are no laws to provide for it." -- Equality 7-2521, Ayn Rand's AnthemI think you'll be making a big mistake if you adopt C's obsolete char == bytewhat about embedded work? this needs to be lightweight in any case, a 16 bit character set doesn't hold all the charsets needed by the worlds languages, but a 20 bit charset (UTF-8) is overkill. then again, most programmers get by with 8 bits 99% of the time. So you need to give people options.
Jan 16 2003
On Thu, 16 Jan 2003 09:49:58 -0800 "Sean L. Palmer" <seanpalmer directvinternet.com> wrote:I'm all for UTF-8. Most fonts don't come anywhere close to having all the glyphs anyway, but it's still nice to use an encoding that actually has a real definition (whereas "byte" has no meaning whatsoever and could mean ANSI, DOS OEM, ASCII-7, UTF-8, or MBCS.) UTF-8 allows you the full unicode range but the part that we use everyday just takes 1 byte per char, like usual. I believe it even maps almost 1:1 to ASCII in that range.AFAIK, Unicode between 0 and 127 is the exact same thing as ASCII. -- Theodore Reed (rizen/bancus) -==- http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~ "The word of Sin is Restriction. O man! refuse not thy wife, if she will! O lover, if thou wilt, depart! There is no bond that can unite the divided but love: all else is a curse. Accursed! Accursed be it to the aeons! Hell." -- Liber AL vel Legis, 1:41
Jan 16 2003
Theodore Reed wrote:On Thu, 16 Jan 2003 09:49:58 -0800 "Sean L. Palmer" <seanpalmer directvinternet.com> wrote:As I see it there are two issues here. Firstly there is the ability to read and manipulate text streams that are incoded in one of the many multi-byte/variable-width formats. and secondly there is allow code to be written in mb/vw formats. The first can be achieved (though perhaps not transparently) using a library, while the second obviously requires work to be done on the front end of the compiler. The front end is freely available under the gpl/artistic licences, and I don't think it would be difficult to augment it with mb/vw support. However, This doesn't give us an intergrated solution, such as you might find in other languages, but it is a start. Alix Pexton Webmaster - "the D journal" www.thedjournal.com PS who need mb/vw when we have lojban ;) .I'm all for UTF-8. Most fonts don't come anywhere close to having all the glyphs anyway, but it's still nice to use an encoding that actually has a real definition (whereas "byte" has no meaning whatsoever and could mean ANSI, DOS OEM, ASCII-7, UTF-8, or MBCS.) UTF-8 allows you the full unicode range but the part that we use everyday just takes 1 byte per char, like usual. I believe it even maps almost 1:1 to ASCII in that range.AFAIK, Unicode between 0 and 127 is the exact same thing as ASCII.
Jan 16 2003
In article <b06r0m$1l3u$1 digitaldaemon.com>, Sean L. Palmer says...I'm all for UTF-8. Most fonts don't come anywhere close to having all the glyphs anyway,...Modern font systems cover different Unicode ranges with different fonts. A font that contains all the Unicode glyphs is of very limited use. (It tends to be useful for primitive tools that assume a single font for all glyphs. Such tools are being superceded by modern tools, though, and the complexities of rendering are being delegated to central rendering subsystems.)... but it's still nice to use an encoding that actually has a real definition (whereas "byte" has no meaning whatsoever and could mean ANSI, DOS OEM, ASCII-7, UTF-8, or MBCS.) UTF-8 allows you the full unicode range but the part that we use everyday just takes 1 byte per char, like usual.I'd be careful about the "part we use everyday" idea. I don't really know who's involved in this "D" project, but big company developers tend to work more and more in systems that handle a rich range of characters. The reason is because that's what their company needs to do every day, whether they do personally or not. That's what is swirling around the Internet every day. It is true, though, that for Westerners, ASCII characters occur more commonly, so UTF-8 has a sort of "poor man's compression" advantage that is often useful.I believe it even maps almost 1:1 to ASCII in that range. You cannot however make a UTF-8 data type. By definition each character may take more than one byte. But you don't make arrays of characters, you make arrays of character building blocks (bytes) that are interpreted as characters.No, you make arrays of UTF-16 code units. When you need to do work with arrays of characters UTF-16 is a better choice than UTF-8, though UTF-8 is better for data interchange with unknown recipients.Anyway we'd need some automated way to step through the array one character at a time. Maybe string could be an array of bytes that pretends that it's an array of 32-bit unicode characters?UTF-16. That's what it's for. UTF-32 is not practical for most purposes that involve large amounts of text.Sean "Theodore Reed" <rizen surreality.us> wrote in message news:20030116081437.1a593197.rizen surreality.us...On Thu, 16 Jan 2003 14:40:15 +0200 "Paul Sheer" <psheer icon.co.za> wrote:On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:But the default option should be UTF-8 with a module available for conversion. (I tend to stay away from UTF-16 because of endian issues.) Also, I'm not sure where you're getting the 20-bit part. UTF-8 can encode everything in the Unicode 32-bit range. (Although it takes like 8 bytes towards the end.) UTF-8 also addresses the lightweight bit, as long as you aren't using non-English characters, but even if you are, they aren't that much longer. And it's better than having to deal with 50 million 8-bit encodings. FWIW, I wholeheartedly support Unicode strings in D. -- Theodore Reed (rizen/bancus) -==- http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~ "We have committed a greater crime, and for this crime there is no name. What punishment awaits us if it be discovered we know not, for no such crime has come in the memory of men and there are no laws to provide for it." -- Equality 7-2521, Ayn Rand's AnthemI think you'll be making a big mistake if you adopt C's obsolete char == bytewhat about embedded work? this needs to be lightweight in any case, a 16 bit character set doesn't hold all the charsets needed by the worlds languages, but a 20 bit charset (UTF-8) is overkill. then again, most programmers get by with 8 bits 99% of the time. So you need to give people options.
Jan 16 2003
In article <20030116081437.1a593197.rizen surreality.us>, Theodore Reed says...On Thu, 16 Jan 2003 14:40:15 +0200 "Paul Sheer" <psheer icon.co.za> wrote: But the default option should be UTF-8 with a module available for conversion. (I tend to stay away from UTF-16 because of endian issues.)The default (and only) form should be UTF-16 in the language itself. There is no endianness issue unless data is serialized. Serialization is a type of output like printing on paper, and I'm not suggesting serializing into UTF-16 by default. UTF-8 is the way to go for that. I'm only talking about the "model" used by the programming language. Another way to look at it is to consider int's. Do you try to avoid the int data type? It has exactly the same endianness issues as UTF-16.Also, I'm not sure where you're getting the 20-bit part. UTF-8 can encode everything in the Unicode 32-bit range. (Although it takes like 8 bytes towards the end.)He's right, actually. Unicode has a range of slightly over 20 bits. (1M + 62K, to be exact.) Originally, Unicode had a 16-bit range and ISO 10646 had a 31 bit range (not 32), but both now have converged on a little over 20.UTF-8 also addresses the lightweight bit, as long as you aren't using non-English characters, but even if you are, they aren't that much longer.So does UTF-16 because although Western characters take a little more space than with UTF-8, processing is lighter weight, and that is usually more significant.And it's better than having to deal with 50 million 8-bit encodings.Amen to that! Talk about heavyweight...FWIW, I wholeheartedly support Unicode strings in D.Yes, indeed. It is a real benefit to give the users because with Unicode strings as standard, you get libraries that can take a lot of the really arcane issues off the programmers' shoulders (and put them on the library authors' shoulders, where tough stuff belongs). When D programmers then deal with Unicode XML, HTML just send the strings to the libraries, confident that the "Unicode stuff" will be taken care of. That's the kind of advantage modern developers get from Java that they don't get from good ol' C.
Jan 16 2003
In article <b07jht$22v4$1 digitaldaemon.com>, globalization guy says...That's the kind of advantage modern developers get from Java that they don't get from good ol' C.provided solaris/jvm is configured correctly by (friggin) service provider
Jan 16 2003
UTF-8 can encode everything in the Unicode 32-bit range. (Although it takes like 8 bytes towards the end.)0x00..0x7F --> 1 byte - ASCII 0x80..0x7FF --> 2 byte - Latin extended, Greek, Cyrillic, Hebrew, Arabic, etc... 0x800..0xFFFF --> 3 byte - most of the scripts in use. 0x10000..0x10FFFF --> 4 byte - rare/dead/... scripts
Jan 16 2003
In article <b065i9$19aa$1 digitaldaemon.com>, Paul Sheer says...On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:Good questions. I think you'll find if you sniff around that more and more embedded work is going to Unicode. The reason is because it is inevitable that any successful device that deals with natural language will be required to handle more and more characters as its market expands. When you add new characters by changing character sets, you get a high marginal cost per market and you still can't handle mixed language scenarios (which have become very common due to the Internet.) When you add new characters by *adding* character sets, you lose all of your "lightweight" benefits. I attended a Unicode conference once where there was a separate embedded systems conference going on in the same building. By the end of the conference, we had almost merged, at least in the hallways. ;-) Unicode, done right, gives you universality at a fraction of the cost of patchwork solutions to worldwide markets. Even in English, the range of characters being demanded by customers has continued to grow. It grew beyond ASCII years ago, has now gone beyond Latin-1. MS Windows had to add a proprietary extension to Latin-1 before they gave up entirely and went full Unicode, as did Apple with OS X, Sun with Java, Perl, HTML 4....I think you'll be making a big mistake if you adopt C's obsolete char == bytewhat about embedded work? this needs to be lightweightin any case, a 16 bit character set doesn't hold all the charsets needed by the worlds languages, but a 20 bit charset (UTF-8) is overkill. then again, most programmers get by with 8 bits 99% of the time. So you need to give people options. -paulUTF-16 isn't a 16-bit character set. It's a 16-bit encoding of a character set that has an enormous repertoire. There is room for well over a million characters in the Universal Character Set (shared by Unicode and ISO 10646), and many of those "characters" are actually components meant to be combined with others to create a truly enormous variety of what most people think of as "characters". It is no longer correct to assume a 1:1 correspondence between a Unicode character and a glyph you see on a screen or on paper. (And that correspondence was lost way back when TrueType was created anyway). The length of a string in these modern times is an abstract concept, not a physical one, when dealing with natural language. The nice 1:1 correspondences between code point / character / glyph are still available for artificial symbols created as sequences of ASCII printing characters, though, and that is true even in UTF-16 Unicode. Unicode certainly does have room for all of the world's character sets. It is a subset of them all -- with "all" meaning those considered significant by the various national bodies represented in ISO and all of the industrial bodies providing input to the Unicode Technical Committee. It's not a universal superset in an absolute sense. When you say "most programmers get by with 8 bits 99% of the time", I think you may be thinking a bit too narrowly. The composition of programmers has become more international than perhaps you realize, and the change isn't slowing down. Even in the West, most major companies have moved to Unicode *to solve their own problems*. MS programmers can't get by with 8-bits. Neither can Apple's, or Sun's, or Oracle's, or IBM's.... Another thing to consider is that programmers use the tools that exist, naturally. For a long time, major programming languages had the fundamental equivalence of byte and char at their core. Many people who got by with 8-bits did so because there was no practical alternative. These days, there are, and modern languages need to be designed to take advantage of all the great advantages that come along with using Unicode.
Jan 16 2003
Hi, I have been thinking about this issue too, and also I think that Unicode string should be a prime concern of D. And, yes, UTF-8 is the way to go. I would very much like to see a string using canonical UTF-8 encoding being built right into the language, as a class with value semanthics. What we are faced with is: 1. We need char and wchar_t for compability with APIs. 2. We need good Unicode support. 3. We need a memory efficient representation of strings. 4. We need the ability easy manipulation of strings. There are two fundamental types of text data: a character and a string. Also, Java uses two kinds of strings: a String class for storing strings, and a StringBuffer for manipulating strings. This separation solves many problems. I believe that: - A single character should be represented using 32-bit UCS-4 using native endianess - like the wchar_t commenly seen on UNIX. It probably should be struct in order to avoid overhead of vtbl, and still support character methods such as isUpper() and toUpper(). - A non-modifyable string should be stored using UTF-8. By non-modifyable I mean that they do not allow individual characters to be manipulated, but they do allow reassignment. Read-only forward characters iterators could also be supported in an efficient manner. As it has already been stated, they would in most cases be as memory efficient as C's char arrays. This also addresses Walter's concern of perfermance issues with CPU caches. But it also means that the concept of using arrays simply is not good enough. This string class should also provide functionality such a collate() method. - A modifyable string should support manipulation of individual characters, and could likely be an array of UCS-4 characters. Methods should be provided for converting to/from char* and wchar_t* (whether it is 16- or 32-bit) as needed for supporting C APIs. Some will argue that this would involve too many conversions. However, if you are using char* today on Windows, Windows will do this conversion all the time, and you probably do not notice. And if it really becomes a bottle-neck, optimization would be simple in most cases - just cache the converted string. And if you are only concerned using C APIs - use the C string functions such as strcat()/wcscat() or specialiced classes. In addition character encoders could be provided for whatever representation is needed. I myself would like support for US-ASCII, EBCDIC, ISO-8859, UTF-7, UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, and US-ASCII/ISO-8859 with encoding of characters as in HTML (I don't remember what standard this is called, but it specifies characters using "&somename;"). Other would have different needs, so it should be simple to implement a new character encoder/decoder. Regards, Martin M. Pedersen. "globalization guy" <globalization_member pathlink.com> wrote in message news:b05pdd$13bv$1 digitaldaemon.com...I think you'll be making a big mistake if you adopt C's obsolete char ==byteconcept of strings. Savvy language designers these days realize that, likeint'sand float's, char's should be a fundamental data type at a higher-level of abstraction than raw bytes. The model that most modern language designersareturning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit. If you do so, you make it possible for strings in your language to have a single, canonical form that all APIs use. Instead of the nightmare thatC/C++programmers face when passing string parameters ("now, let's see, is thisachar* or a const char* or an ISO C++ string or an ISO wstring or awchar_t* or achar[] or a wchar_t[] or an instance of one of countless stringclasses...?).The fact that not just every library but practically every project feelstheneed to reinvent its own string type is proof of the need for a good,solid,canonical form built right into the language. Most language designers these days either get this from the start of theylaterfigure it out and have to screw up their language with multiple stringtypes.Having canonical UTF-16 chars and strings internally does not mean thatyoucan't deal with other character encodings externally. You can can converttocanonical form on import and convert back to some legacy encoding onexport.When you create the strings yourself, or when they are created in Java orJavascript or default XML or most new text protocols, no conversion willbenecessary. It will only be needed for legacy data (or a very lightweightswitchbetween UTF-8 and UTF-16). And for those cases where you have to work with legacy data and yet don't want to incur the overhead of encodingconversion inand out, you can still treat the external strings as byte arrays insteadofstrings, assuming you have a "byte" data type, and do direct bytemanipulationon them. That's essentially what you would have been doing anyway if youhadused the old char == byte model I see in your docs. You just call it"byte"instead of "char" so it doesn't end up being your default string type. Having a modern UTF-16 char type, separate from arrays of "byte", givesyou aconsistency that allows for the creation of great libraries (since text issuchstart, andtheir libraries universally use a single string type. Perl figured it outprettylate and as a result, with the addition of UTF-8 to Perl in v. 5.6, it'sneverclear which CPAN modules will work and which ones will fail, so you haveto usepragmas ("use utf-8" vs. "use bytes") and do lots of testing. I hope you'll consider making this change to your design. Have an 8-bitunsigned"byte" type and a 16-bit unsigned UTF-16 "char" and forget about this"8-bitchar plus 16-bit wide char on Win32 and 32-bit wide char on Linux" stuffor I'mquite sure you'll later regret it. C/C++ are in that sorry state forlegacyreasons only, not because their designers were foolish, but any newlanguagethat intentionally copies that "design" is likely to regret that decision.
Jan 16 2003
You make some great points. I have to ask, though, why UTF-16 as opposed to UTF-8?
Jan 16 2003
In article <b08cdr$2fld$1 digitaldaemon.com>, Walter says...You make some great points. I have to ask, though, why UTF-16 as opposed to UTF-8?Good question, and actually it's not an open and shut case. UTF-8 would not be a big mistake, but it might not be quite as good as UTF-16. The biggest reason I think UTF-16 has the edge is that I think you'll probably want to treat your strings as arrays of characters on many occasions, and that's *almost* as easy to do with UTF-16 as with ASCII. It's really not very practical with UTF-8, though. UTF-16 characters are almost always a single 16-bit code unit. Once in a billion characters or so, you get a character that is composed of two "surrogates". Sort of like half characters. Your code does have to keep this exceptional case in mind and handle it when necessary, though that is usually the type of problem you delegate to the standard library. In most cases, a function can just think of each surrogate as a character and not worry that it might be just half of the representation of a character -- as long as the two don't get separated. In almost all cases, though, you can think of a character as a single 16-bit entity, which is almost as simple as thinking of it as a single 8-bit entity. You can do bit operations on them and other C-like things and it should be very efficient. Unlike UTF-16's two cases, one of which is very rare, UTF-8 has four cases, three of which are very common. All of your code needs to do a good job with those three cases. Only the fourth can be considered exceptional. (Of course it has to be handled, too, but it is like the exceptional UTF-16 case, where you don't have to optimize for it because it rarely occurs). Most strings will tend to have mixed-width characters, so a model of an array of elements isn't a very good one. You can still implement your language with accessors that reach into a UTF-8 string and parse out the right character when you say "str[5]", but it will be further removed from the physical implementation than if you use UTF-16. For a somewhat lower-level language like "D", this probably isn't a very good fit. The main benefit of UTF-8 is when exchanging text data with arbitrary external parties. UTF-8 has no endianness problem, so you don't have to worry about the *internal* memory model of the recipient. It has some other features that make it easier to digest by legacy systems that can only handle ASCII. They won't work right outside ASCII, but they'll often work for ASCII and they'll fail more gracefully than would be the case with UTF-16 (that is likely to contain embedded \0 bytes.) None of these issues are relevant to your own program's *internal* text model. Internally, you're not worried about endianness. (You don't worry about the endianness of your int variables, do you?) You don't have to worry about losing a byte in RAM, etc. When talking to external APIs, you'll still have to output in a form that the API can handle. Win32 APIs want UTF-16. Mac APIs want UTF-16. Java APIs want UTF-16, as do .Net APIs. Unix APIs are problematic, since there are so many and they aren't coordinated by a single body. Some will only be able to handle ASCII, others will be upgraded to UTF-8. I don't think the Unix system APIs will become UTF-16 because legacy is such a ball and chain in the Unix world, but the process is underway to upgrade the standard system encoding for all major Linux distributions to UTF-8. If Linux APIs (and probably most Unix APIs eventually) are of primary importance, UTF-8 is still a possibility. I'm not totally ruling it out. It wouldn't hurt you much to use UTF-8 internally, but accessing strings as arrays of characters would require sort of a virtual string model that doesn't match the physical model quite as closely as you could get with UTF-16. The additional abstraction might have more overhead than you would prefer internally. If it's a choice between internal inefficiency and inefficiency when calling external APIs, I would usually go for the latter. Most language designers who understand internationalization have decided to go with UTF-16 for languages that have their own rich set of internal libraries, and they have mechanisms for calling external APIs that convert the string encodings.
Jan 17 2003
I read your post with great interest. However, I'm leaning towards UTF-8 for the following reasons (some of which you've covered): 1) In googling around and reading various articles, it seems that UTF-8 is gaining momentum as the encoding of choice, including html. 2) Linux is moving towards UTF-8 permeating the OS. Doing UTF-8 in D means that D will mesh naturally with Linux system api's. 3) Is Win32's "wide char" really UTF-16, including the multi word encodings? 4) I like the fact of no endianness issues, which is important when writing files and transmitting text - it's much more important an issue than the endianness of ints. 5) 16 bit accesses on Intel CPUs can be pretty slow compared to byte or dword accesses (varies by CPU type). 6) Sure, UTF-16 reduces the frequency of multi character encodings, but the code to deal with it must still be there and must still execute. 7) I've converted some large Java text processing apps to C++, and converted the Java 16 bit char's to using UTF-8. That change resulted in *substantial* performance improvements. 8) I suspect that 99% of the text processed in computers is ascii. UTF-8 is a big win in memory and speed for processing english text. 9) A lot of diverse systems and lightweight embedded systems need to work with 8 bit chars. Going to UTF-16 would, I think, reduce the scope of applications and systems that D would be useful for. Going to UTF-8 would make it as broad as possible. 10) Interestingly, making char[] in D to be UTF-8 does not seem to step on or prevent dealing with wchar_t[] arrays being UTF-16. 11) I'm not convinced the char[i] indexing problem will be a big one. Most operations done on ascii strings remain unchanged for UTF-8, including things like sorting & searching. See http://www.cl.cam.ac.uk/~mgk25/unicode.html "globalization guy" <globalization_member pathlink.com> wrote in message news:b09qpe$aff$1 digitaldaemon.com...In article <b08cdr$2fld$1 digitaldaemon.com>, Walter says...toYou make some great points. I have to ask, though, why UTF-16 as opposednot be aUTF-8?Good question, and actually it's not an open and shut case. UTF-8 wouldbig mistake, but it might not be quite as good as UTF-16. The biggest reason I think UTF-16 has the edge is that I think you'llprobablywant to treat your strings as arrays of characters on many occasions, andthat's*almost* as easy to do with UTF-16 as with ASCII. It's really not verypracticalwith UTF-8, though. UTF-16 characters are almost always a single 16-bit code unit. Once in abillioncharacters or so, you get a character that is composed of two"surrogates". Sortof like half characters. Your code does have to keep this exceptional caseinmind and handle it when necessary, though that is usually the type ofproblemyou delegate to the standard library. In most cases, a function can justthinkof each surrogate as a character and not worry that it might be just halfof therepresentation of a character -- as long as the two don't get separated.Inalmost all cases, though, you can think of a character as a single 16-bit entity, which is almost as simple as thinking of it as a single 8-bitentity.You can do bit operations on them and other C-like things and it should beveryefficient. Unlike UTF-16's two cases, one of which is very rare, UTF-8 has fourcases,three of which are very common. All of your code needs to do a good jobwiththose three cases. Only the fourth can be considered exceptional. (Ofcourse ithas to be handled, too, but it is like the exceptional UTF-16 case, whereyoudon't have to optimize for it because it rarely occurs). Most strings willtendto have mixed-width characters, so a model of an array of elements isn't averygood one. You can still implement your language with accessors that reach into aUTF-8string and parse out the right character when you say "str[5]", but itwill befurther removed from the physical implementation than if you use UTF-16.For asomewhat lower-level language like "D", this probably isn't a very goodfit.The main benefit of UTF-8 is when exchanging text data with arbitraryexternalparties. UTF-8 has no endianness problem, so you don't have to worry aboutthe*internal* memory model of the recipient. It has some other features thatmakeit easier to digest by legacy systems that can only handle ASCII. Theywon'twork right outside ASCII, but they'll often work for ASCII and they'llfail moregracefully than would be the case with UTF-16 (that is likely to contain embedded \0 bytes.) None of these issues are relevant to your own program's *internal* textmodel.Internally, you're not worried about endianness. (You don't worry abouttheendianness of your int variables, do you?) You don't have to worry aboutlosinga byte in RAM, etc. When talking to external APIs, you'll still have to output in a form thattheAPI can handle. Win32 APIs want UTF-16. Mac APIs want UTF-16. Java APIswantUTF-16, as do .Net APIs. Unix APIs are problematic, since there are somany andthey aren't coordinated by a single body. Some will only be able to handle ASCII, others will be upgraded to UTF-8. I don't think the Unix systemAPIs willbecome UTF-16 because legacy is such a ball and chain in the Unix world,but theprocess is underway to upgrade the standard system encoding for all majorLinuxdistributions to UTF-8. If Linux APIs (and probably most Unix APIs eventually) are of primary importance, UTF-8 is still a possibility. I'm not totally ruling it out.Itwouldn't hurt you much to use UTF-8 internally, but accessing strings asarraysof characters would require sort of a virtual string model that doesn'tmatchthe physical model quite as closely as you could get with UTF-16. Theadditionalabstraction might have more overhead than you would prefer internally. Ifit's achoice between internal inefficiency and inefficiency when callingexternalAPIs, I would usually go for the latter. Most language designers who understand internationalization have decidedto gowith UTF-16 for languages that have their own rich set of internallibraries,and they have mechanisms for calling external APIs that convert the string encodings.
Jan 17 2003
Walter wrote:10) Interestingly, making char[] in D to be UTF-8 does not seem to step on or prevent dealing with wchar_t[] arrays being UTF-16.You're planning on making this a part of char[]? I was thinking of generating a StringUTF8 instance during compilation, but whatever. I think we should kill off wchar if we go in this direction. The char/wchar conflict is probably the worst part of D's design right now as it doesn't fit well with the rest of the language (limited and ambiguous overloading), and it would provide absolutely nothing that char doesn't already encapsulate. If you need different encodings, use a library.11) I'm not convinced the char[i] indexing problem will be a big one. Most operations done on ascii strings remain unchanged for UTF-8, including things like sorting & searching.It's not such a speed hit any longer that all code absolutely must use slicing and iterators to be useful. 12) UTF-8 doesn't embed ANY control characters, so it can interface with unintelligent C libraries natively. That's not a minor advantage when you're trying to get people to switch to it!
Jan 17 2003
"Burton Radons" <loth users.sourceforge.net> wrote in message news:b0a6rl$i4m$1 digitaldaemon.com...Walter wrote:on10) Interestingly, making char[] in D to be UTF-8 does not seem to stepI think making char[] a UTF-8 is the right way.or prevent dealing with wchar_t[] arrays being UTF-16.You're planning on making this a part of char[]? I was thinking of generating a StringUTF8 instance during compilation, but whatever.I think we should kill off wchar if we go in this direction. The char/wchar conflict is probably the worst part of D's design right now as it doesn't fit well with the rest of the language (limited and ambiguous overloading), and it would provide absolutely nothing that char doesn't already encapsulate. If you need different encodings, use a library.I agree that the char/wchar conflict is a screwup in D's design, and one I've not been happy with. UTF-8 offers a way out. wchar_t should still be retained, though, for interfacing with the win32 api.Most11) I'm not convinced the char[i] indexing problem will be a big one.Interestingly, if foreach is done right, iterating through char[] will work right, UTF-8 or not.operations done on ascii strings remain unchanged for UTF-8, including things like sorting & searching.It's not such a speed hit any longer that all code absolutely must use slicing and iterators to be useful.12) UTF-8 doesn't embed ANY control characters, so it can interface with unintelligent C libraries natively. That's not a minor advantage when you're trying to get people to switch to it!You're right.
Jan 17 2003
"Walter" <walter digitalmars.com> wrote in message news:b0a7ft$iei$1 digitaldaemon.com..."Burton Radons" <loth users.sourceforge.net> wrote in message news:b0a6rl$i4m$1 digitaldaemon.com...stepWalter wrote:10) Interestingly, making char[] in D to be UTF-8 does not seem toonI would be more in favor of a String class that was utf8 internally the problem with utf8 is that the the number of bytes and the number of chars are dependant on the data char[] to me implies an array of char's so char [] foo ="aa"\0x0555; is 4 bytes, but only 3 chars so what is foo[2] ? and what if I set foo[1] = \0x467; and what about wanting 8 bit ascii strings ? if you are going UTF8 then think about the minor extension Java added to the encoding by allowing a two byte 0, which allows embedded 0 in strings without messing up the C strlen (which returns the byte length).I think making char[] a UTF-8 is the right way.or prevent dealing with wchar_t[] arrays being UTF-16.You're planning on making this a part of char[]? I was thinking of generating a StringUTF8 instance during compilation, but whatever.I think we should kill off wchar if we go in this direction. The char/wchar conflict is probably the worst part of D's design right now as it doesn't fit well with the rest of the language (limited and ambiguous overloading), and it would provide absolutely nothing that char doesn't already encapsulate. If you need different encodings, use a library.I agree that the char/wchar conflict is a screwup in D's design, and one I've not been happy with. UTF-8 offers a way out. wchar_t should still be retained, though, for interfacing with the win32 api.workMost11) I'm not convinced the char[i] indexing problem will be a big one.Interestingly, if foreach is done right, iterating through char[] willoperations done on ascii strings remain unchanged for UTF-8, including things like sorting & searching.It's not such a speed hit any longer that all code absolutely must use slicing and iterators to be useful.right, UTF-8 or not.12) UTF-8 doesn't embed ANY control characters, so it can interface with unintelligent C libraries natively. That's not a minor advantage when you're trying to get people to switch to it!You're right.
Jan 17 2003
UTF-8 does lead to the problem of what is meant by: char[] c; c[5] Is it the 5th byte of c[], or the 5th decoded 32 bit character? Saying it's the 5th decoded character has all kinds of implications for slicing and .length. 8 bit ascii isn't a problem, just cast it to a byte[], as in: byte[] b = cast(byte[])c; I'm not sure about the Java 00 issue, I didn't think Java supported UTF-8. D does not have the "what to do about embedded 0" problem, as the length is carried along separately. "Mike Wynn" <mike.wynn l8night.co.uk> wrote in message news:b0a8eg$ivc$1 digitaldaemon.com..."Walter" <walter digitalmars.com> wrote in message news:b0a7ft$iei$1 digitaldaemon.com...the"Burton Radons" <loth users.sourceforge.net> wrote in message news:b0a6rl$i4m$1 digitaldaemon.com...stepWalter wrote:10) Interestingly, making char[] in D to be UTF-8 does not seem toonI would be more in favor of a String class that was utf8 internally the problem with utf8 is that the the number of bytes and the number of chars are dependant on the data char[] to me implies an array of char's so char [] foo ="aa"\0x0555; is 4 bytes, but only 3 chars so what is foo[2] ? and what if I set foo[1] = \0x467; and what about wanting 8 bit ascii strings ? if you are going UTF8 then think about the minor extension Java added toI think making char[] a UTF-8 is the right way.or prevent dealing with wchar_t[] arrays being UTF-16.You're planning on making this a part of char[]? I was thinking of generating a StringUTF8 instance during compilation, but whatever.encoding by allowing a two byte 0, which allows embedded 0 in strings without messing up the C strlen (which returns the byte length).
Jan 17 2003
6) Sure, UTF-16 reduces the frequency of multi character encodings, butthecode to deal with it must still be there and must still execute.I was under the impression UTF-16 was glyph based, so each char (16bits) was a glyph of some form, not all glyph cause the graphics to move to the next char, so accents can be encoded as a postfix to the char they are over/under and charsets like chinesse have sequences that generate the correct visual reprosentation; UTF-8 is just a way to encode UTF-16 so the it is compatable with ascii, 0..127 map to 0.127 then using 128..256 as special values identifing multi byte values the string can be processed as 8bit ascii by software without problem, only the visual reprosentation changes 128..256 on dos are the box drawing and intl chars. however a 3 UTF-16 char sequence will encode to 3 utf 8 encoded sequences and if they are all >127 then that would be 6 or more bytes, so if you consider the 3 UTF-16 values to be one "char" then the UTF8 should also consider the 6 or more byte sequence as one "char" rather than 3 "chars"
Jan 17 2003
I was under the impression UTF-16 was glyph based, so each char (16bits)wasa glyph of some form, not all glyph cause the graphics to move to the next char, so accents can be encoded as a postfix to the char they areover/underand charsets like chinesse have sequences that generate the correct visual reprosentation;First, UTF-16 is just one of the many standard encodings for the Unicode. UTF-16 allows more then 16bit characters - with surrogates it can represent all >1M codes. (Unicode v2 used UCS-2 which is 16bit-only encoding)I was under the impression UTF-16 was glyph basedfrom The Unicode Standard, ch2 General Structure http://www.unicode.org/uni2book/ch02.pdf "Characters, not glyphs - The Unicode Standard encodes characters, not glyphs. The Unicode Standard draws a distinction between characters, which are the smallest components of written language that have semantic value, and glyphs, which represent the shapes that characters can have when they are rendered or displayed. Various relationships may exist between characters and glyphs: a single glyph may correspond to a single character, or to a number of characters, or multiple glyphs may result from a single character." btw, there are many precomposed characters in the Unicode which can be represented with combining characters as well. ( [â] and [a,(combining ^)] - equally valid representations for [a with circumflex] ).
Jan 16 2003
First, UTF-16 is just one of the many standard encodings for the Unicode. UTF-16 allows more then 16bit characters - with surrogates it canrepresentall >1M codes. (Unicode v2 used UCS-2 which is 16bit-only encoding)right, me getting confused. too many tla's too many standards (as ever).^)] -I was under the impression UTF-16 was glyph basedfrom The Unicode Standard, ch2 General Structure http://www.unicode.org/uni2book/ch02.pdf "Characters, not glyphs - The Unicode Standard encodes characters, not glyphs. The Unicode Standard draws a distinction between characters, which are the smallest components of written language that have semantic value, and glyphs, which represent the shapes that characters can have when they are rendered or displayed. Various relationships may exist between characters and glyphs: a single glyph may correspond to a single character, or to a number of characters, or multiple glyphs may result from a single character." btw, there are many precomposed characters in the Unicode which can be represented with combining characters as well. ( [â] and [a,(combiningequally valid representations for [a with circumflex] ).so if I read this right ... (been using UTF8 for ages and ignored what it represents, keeps me sane (er) ) I can't understand arabic file names anyway :) so a string (no matter how its encoded) contains 3 lengths the byte length, then number of unicode entites (16 bit UCS-2) and the number of "characters" so cât as UTF8 is 4 bytes, as UTF-16 is 6 bytes, its 3 UCS-2 entities, and 3 "characters" but if the â was [a,(combining ^)] not the single â UCS-2 value then cât would be UTF8 is 8+ bytes, as UTF-16 is 8 bytes, its 4 UCS-2 entities, but still 3 "characters" which is why I think String should be a class not a thing[] you should be able to get a utf8 encoded byte[], utf-16 short[], UCS-2 short[] (for win32/api), (32 bit unicode) int[] (for linux) and ideally a Character[] from the string. how a String is stored utf8, utf16 or 32bit/64bit values is only relevant for performance and different people will want different internal representations. but semantically they should be all the same. this is all another reason why I also think that arrays should be templated classes that have an index method (operator []) so the Character[] from the string can modify the String it represents. Mike.
Jan 18 2003
3) Is Win32's "wide char" really UTF-16, including the multi wordencodings? WinXP, WinCE : UTF-16 Win2K : was UCS-2, but some service pack made it UTF-16 WinNT4 : UCS-2 Win9x : must die.5) 16 bit accesses on Intel CPUs can be pretty slow compared to byte or dword accesses (varies by CPU type).16bit prefix can slow down instruction decoding (mostly for Intel CPUs, but P4 uses pre-decoded instructions anyhow), while instruction processing is more cache-branch-sensitive.6) Sure, UTF-16 reduces the frequency of multi character encodings, butthecode to deal with it must still be there and must still execute.Just an idea : string class may have 2 values for the string length: 1 - number of "units" ( 8bit for UTF-8, 16bit for UTF-16 ) 2 - number of characters. In case if these numbers are equal, string processing library may use simplified and faster functions.7) I've converted some large Java text processing apps to C++, andconvertedthe Java 16 bit char's to using UTF-8. That change resulted in*substantial*performance improvements. 8) I suspect that 99% of the text processed in computers is ascii. UTF-8isa big win in memory and speed for processing english text.You think, that 99% of the computer users - english speaking? Think again... btw, something about UTF-8 & UTF-16 efficiency: http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_u nicode.html#Test_Results For latin script based languages - UTF-8 takes ~51% less space than UTF-16. For greek (expect the same for cyrillic)- ~88% - not that better than UTF-16. For japanese, chinese, korean, hindi - 115%..140% - UTF-16 is more space efficient.
Jan 16 2003
"Serge K" <skarebo programmer.net> wrote in message news:b0anmt$r7g$1 digitaldaemon.com...LOL! Looking forward, then, one can treat it as UTF-16.3) Is Win32's "wide char" really UTF-16, including the multi wordencodings? WinXP, WinCE : UTF-16 Win2K : was UCS-2, but some service pack made it UTF-16 WinNT4 : UCS-2 Win9x : must die.Not at all. But the text processed - yes. But I imagine it would be pretty tough to come by figures for that that are better than speculation.8) I suspect that 99% of the text processed in computers is ascii. UTF-8isa big win in memory and speed for processing english text.You think, that 99% of the computer users - english speaking?something about UTF-8 & UTF-16 efficiency:http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html#Test_Results For latin script based languages - UTF-8 takes ~51% less space thanUTF-16.For greek (expect the same for cyrillic)- ~88% - not that better than UTF-16. For japanese, chinese, korean, hindi - 115%..140% - UTF-16 is more space efficient.Thanks for the info. That's about what I would have guessed. Another valuable statistic would be how well UTF-8 compressed with LZW as opposed to the same thing in UTF-16.
Jan 17 2003
"globalization guy" <globalization_member pathlink.com> escreveu na mensagem news:b05pdd$13bv$1 digitaldaemon.com...I think you'll be making a big mistake if you adopt C's obsolete char ==byteconcept of strings. Savvy language designers these days realize that, likeint'sand float's, char's should be a fundamental data type at a higher-level of abstraction than raw bytes. The model that most modern language designersareturning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit. If you do so, you make it possible for strings in your language to have a single, canonical form that all APIs use. Instead of the nightmare thatC/C++programmers face when passing string parameters ("now, let's see, is thisachar* or a const char* or an ISO C++ string or an ISO wstring or awchar_t* or achar[] or a wchar_t[] or an instance of one of countless stringclasses...?).The fact that not just every library but practically every project feelstheneed to reinvent its own string type is proof of the need for a good,solid,canonical form built right into the language. Most language designers these days either get this from the start of theylaterfigure it out and have to screw up their language with multiple stringtypes.Having canonical UTF-16 chars and strings internally does not mean thatyoucan't deal with other character encodings externally. You can can converttocanonical form on import and convert back to some legacy encoding onexport.When you create the strings yourself, or when they are created in Java orJavascript or default XML or most new text protocols, no conversion willbenecessary. It will only be needed for legacy data (or a very lightweightswitchbetween UTF-8 and UTF-16). And for those cases where you have to work with legacy data and yet don't want to incur the overhead of encodingconversion inand out, you can still treat the external strings as byte arrays insteadofstrings, assuming you have a "byte" data type, and do direct bytemanipulationon them. That's essentially what you would have been doing anyway if youhadused the old char == byte model I see in your docs. You just call it"byte"instead of "char" so it doesn't end up being your default string type. Having a modern UTF-16 char type, separate from arrays of "byte", givesyou aconsistency that allows for the creation of great libraries (since text issuchstart, andtheir libraries universally use a single string type. Perl figured it outprettylate and as a result, with the addition of UTF-8 to Perl in v. 5.6, it'sneverclear which CPAN modules will work and which ones will fail, so you haveto usepragmas ("use utf-8" vs. "use bytes") and do lots of testing. I hope you'll consider making this change to your design. Have an 8-bitunsigned"byte" type and a 16-bit unsigned UTF-16 "char" and forget about this"8-bitchar plus 16-bit wide char on Win32 and 32-bit wide char on Linux" stuffor I'mquite sure you'll later regret it. C/C++ are in that sorry state forlegacyreasons only, not because their designers were foolish, but any newlanguagethat intentionally copies that "design" is likely to regret that decision.Hi, There was a thread a year ago in the smalleiffel mailing list (starting at http://groups.yahoo.com/group/smalleiffel/message/4075 ) about unicode strings in Eiffel. It's a quite interesting read about the problems of adding string-like Unicode classes. The main point is that true Unicode support is very difficult to achieve just some libraries provide good, correct and complete unicode encoders/decoders/renderers/etc. While I agree that some Unicode support is a necessity today (main mother tongue is brazilian portuguese so I use non-ascii characters everyday), we can't just add some base types and pretend everything is allright. We won't correct incorrect written code with a primitive unicode string. Most programmers don't think about unicode when they develop their software, so almost every line of code dealing with texts contain some assumptions about the character sets being used. Java has a primitive 16 bit char, but basic library functions (because they need good performance) use incorrect code for string handling stuff (the correct classes are in java.text, providing means to correctly collate strings). Some times we are just using plain old ASCII but we're bitten by the encoding issues. And when we need to deal with true unicode support the libraries tricky us into believing everything is ok. IMO D should support a simple char array to deal with ASCII (as it does today) and some kind of standard library module to deal with unicode glyphs and text. This could be included in phobos or even in deimos. Any volunteers? With this we could force the programmer to deal with another set of tools (albeit similar) when dealing with each kind of string: ASCII or unicode. This module should allow creation of variable sized string and glyphs through an opaque ADT. Each kind of usage has different semantics and optimization strategies (e.g. Boyer-Moore is good for ASCII but with unicode the space and time usage are worse). Best regards, Daniel Yokomiso. P.S.: I had to written some libraries and components (EJBs) in several Java projects to deal with data-transfer in plain ASCII (communication with IBM mainframes). Each day I dreamed of using a language with simple one byte character strings, without problems with encoding and endianess (Solaris vs. Linux vs. Windows NT have some nice "features" in their JVMs if you aren't careful when writing Java code that uses "ASCII" String). But Java has a 16 bit character type and a SIGNED byte type, both awkward for this usage. A language shouldn't get in the way of simple code. "Never argue with an idiot. They drag you down to their level then beat you with experience." --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.443 / Virus Database: 248 - Release Date: 11/1/2003
Jan 17 2003
I once wrote a large project that dealt with mixed ascii and unicode. There was bug after bug when the two collided. Finally, I threw in the towel and made the entire program unicode - every string in it. The trouble in D is that in the current scheme, everything dealing with text has to be written twice, once for char[] and again for wchar_t[]. In C, there's that wretched tchar.h to swap back and forth. It may just be easier in the long run to just make UTF-8 the native type, and then at least try and make sure the standard D library is correct. -Walter "Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message news:b0agdq$ni9$1 digitaldaemon.com..."globalization guy" <globalization_member pathlink.com> escreveu namensagemnews:b05pdd$13bv$1 digitaldaemon.com...likeI think you'll be making a big mistake if you adopt C's obsolete char ==byteconcept of strings. Savvy language designers these days realize that,int'sofand float's, char's should be a fundamental data type at a higher-leveldesignersabstraction than raw bytes. The model that most modern languageareaturning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit. If you do so, you make it possible for strings in your language to havethissingle, canonical form that all APIs use. Instead of the nightmare thatC/C++programmers face when passing string parameters ("now, let's see, isatheychar* or a const char* or an ISO C++ string or an ISO wstring or awchar_t* or achar[] or a wchar_t[] or an instance of one of countless stringclasses...?).The fact that not just every library but practically every project feelstheneed to reinvent its own string type is proof of the need for a good,solid,canonical form built right into the language. Most language designers these days either get this from the start oflaterconvertfigure it out and have to screw up their language with multiple stringtypes.Having canonical UTF-16 chars and strings internally does not mean thatyoucan't deal with other character encodings externally. You can cantoorcanonical form on import and convert back to some legacy encoding onexport.When you create the strings yourself, or when they are created in JavawithJavascript or default XML or most new text protocols, no conversion willbenecessary. It will only be needed for legacy data (or a very lightweightswitchbetween UTF-8 and UTF-16). And for those cases where you have to workislegacy data and yet don't want to incur the overhead of encodingconversion inand out, you can still treat the external strings as byte arrays insteadofstrings, assuming you have a "byte" data type, and do direct bytemanipulationon them. That's essentially what you would have been doing anyway if youhadused the old char == byte model I see in your docs. You just call it"byte"instead of "char" so it doesn't end up being your default string type. Having a modern UTF-16 char type, separate from arrays of "byte", givesyou aconsistency that allows for the creation of great libraries (since textsuchoutstart, andtheir libraries universally use a single string type. Perl figured itprettydecision.late and as a result, with the addition of UTF-8 to Perl in v. 5.6, it'sneverclear which CPAN modules will work and which ones will fail, so you haveto usepragmas ("use utf-8" vs. "use bytes") and do lots of testing. I hope you'll consider making this change to your design. Have an 8-bitunsigned"byte" type and a 16-bit unsigned UTF-16 "char" and forget about this"8-bitchar plus 16-bit wide char on Win32 and 32-bit wide char on Linux" stuffor I'mquite sure you'll later regret it. C/C++ are in that sorry state forlegacyreasons only, not because their designers were foolish, but any newlanguagethat intentionally copies that "design" is likely to regret that(startingHi, There was a thread a year ago in the smalleiffel mailing listat http://groups.yahoo.com/group/smalleiffel/message/4075 ) about unicode strings in Eiffel. It's a quite interesting read about the problems of adding string-like Unicode classes. The main point is that true Unicode support is very difficult to achieve just some libraries provide good, correct and complete unicode encoders/decoders/renderers/etc. While I agree that some Unicode support is a necessity today (main mother tongue is brazilian portuguese so I use non-ascii characters everyday), we can't just add some base types and pretend everything is allright. We won't correct incorrect written code with a primitive unicode string. Most programmers don't think about unicode when they develop their software, so almost every line of code dealing with texts contain some assumptions about the character sets being used. Java has a primitive 16bitchar, but basic library functions (because they need good performance) use incorrect code for string handling stuff (the correct classes are in java.text, providing means to correctly collate strings). Some times wearejust using plain old ASCII but we're bitten by the encoding issues. Andwhenwe need to deal with true unicode support the libraries tricky us into believing everything is ok. IMO D should support a simple char array to deal with ASCII (as itdoestoday) and some kind of standard library module to deal with unicodeglyphsand text. This could be included in phobos or even in deimos. Any volunteers? With this we could force the programmer to deal with anothersetof tools (albeit similar) when dealing with each kind of string: ASCII or unicode. This module should allow creation of variable sized string and glyphs through an opaque ADT. Each kind of usage has different semanticsandoptimization strategies (e.g. Boyer-Moore is good for ASCII but withunicodethe space and time usage are worse). Best regards, Daniel Yokomiso. P.S.: I had to written some libraries and components (EJBs) in severalJavaprojects to deal with data-transfer in plain ASCII (communication with IBM mainframes). Each day I dreamed of using a language with simple one byte character strings, without problems with encoding and endianess (Solarisvs.Linux vs. Windows NT have some nice "features" in their JVMs if you aren't careful when writing Java code that uses "ASCII" String). But Java has a16bit character type and a SIGNED byte type, both awkward for this usage. A language shouldn't get in the way of simple code. "Never argue with an idiot. They drag you down to their level then beatyouwith experience." --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.443 / Virus Database: 248 - Release Date: 11/1/2003
Jan 17 2003
"Walter" <walter digitalmars.com> escreveu na mensagem news:b0b0up$vk7$1 digitaldaemon.com...I once wrote a large project that dealt with mixed ascii and unicode.Therewas bug after bug when the two collided. Finally, I threw in the towel and made the entire program unicode - every string in it. The trouble in D is that in the current scheme, everything dealing withtexthas to be written twice, once for char[] and again for wchar_t[]. In C, there's that wretched tchar.h to swap back and forth. It may just beeasierin the long run to just make UTF-8 the native type, and then at least try and make sure the standard D library is correct. -Walter[snip] Hi, Current D uses char[] as the string type. If we declare each char to be UTF-8 we'll have all the problems with what does "myString[13] = someChar;" means. I think a opaque string datatype may be better in this case. We could have a glyph datatype that represents one unicode glyph in UTF-8 encoding, and use it together with a string class. Also I don't think a mutable string type is a good idea. Python and Java use immutable strings, and this leads to better programs (you don't need to worry about copying your strings when you get or give them). Some nice tricks, like caching hashCode results for strings are possible, because the values won't change. We could also provide a mutable string class. If this is the way to go we need lots of test cases, specially from people with experience writing unicode libraries. The Unicode spec has lots of particularities, like correct regular expression support, that may lead to subtle bugs. Best regards, Daniel Yokomiso. "Before you criticize someone, walk a mile in their shoes. That way you're a mile away and you have their shoes, too." --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.443 / Virus Database: 248 - Release Date: 10/1/2003
Jan 18 2003
On Sat, 18 Jan 2003 12:51:42 -0300 "Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote:Current D uses char[] as the string type. If we declare each char to be UTF-8 we'll have all the problems with what does "myString[13] = someChar;" means. I think a opaque string datatype may be better in this case. We could have a glyph datatype that represents one unicode glyph in UTF-8 encoding, and use it together with a string class. AlsoSo what does "myString[13] = someGlyph" mean? char doesn't have to be a byte, we can have another data byte for that. -- Theodore Reed (rizen/bancus) -==- http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~ "Yesterday no longer exists Tomarrow's forever a day away And we are cell-mates, held together in the shoreless stream that is today."
Jan 18 2003
"Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message news:b0bpq9$1d3d$1 digitaldaemon.com...Current D uses char[] as the string type. If we declare each char tobeUTF-8 we'll have all the problems with what does "myString[13] =someChar;"means. I think a opaque string datatype may be better in this case. Wecouldhave a glyph datatype that represents one unicode glyph in UTF-8 encoding, and use it together with a string class.I'm thinking that myString[13] should simply set the byte at myString[13]. Trying to fiddle with the multibyte stuff with simple array access semantics just looks to be too confusing and error prone. To access the unicode characters from it would be via a function or property.Also I don't think a mutable string type is a good idea. Python and Java use immutable strings, and this leads to better programs (you don't need to worry about copying your stringswhenyou get or give them). Some nice tricks, like caching hashCode results for strings are possible, because the values won't change. We could alsoprovidea mutable string class.I think the copy-on-write approach to strings is the right idea. Unfortunately, if done by the language semantics, it can have severe adverse performance results (think of a toupper() function, copying the string again each time a character is converted). Using it instead as a coding style, which is currently how it's done in Phobos, seems to work well. My javascript implementation (DMDScript) does cache the hash for each string, and that works well for the semantics of javascript. But I don't think it is appropriate for lower level language like D to do as much for strings.If this is the way to go we need lots of test cases, specially from people with experience writing unicode libraries. The Unicode spec haslotsof particularities, like correct regular expression support, that may lead to subtle bugs.Regular expression implementations naturally lend themselves to subtle bugs :-(. Having a good test suite is a lifesaver.
Jan 18 2003
Walter wrote:"Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message news:b0bpq9$1d3d$1 digitaldaemon.com...I disagree. Returning the character makes indexing expensive, but it has the expectant result and for the most part hides the fact that compaction is going on automatically; the only rule change is that indexed assignment can invalidate any slices and copies, which isn't any worse than D's current rules. Then char.size will be 4 and char.max will be 0x10FFFF or 0x7FFFFFFF, depending upon whether we use UNICODE or ISO-10646 for our UTF-8. I also think that incrementing a char pointer should read the data to determine how many bytes it needs to skip. It should be as transparent as possible! If it can't be transparent, then it should use a class or be limited: no indexing, no char pointers. I don't like either option. [snip]Current D uses char[] as the string type. If we declare each char to be UTF-8 we'll have all the problems with what does "myString[13] = someChar;" means. I think a opaque string datatype may be better in this case. We could have a glyph datatype that represents one unicode glyph in UTF-8 encoding, and use it together with a string class.I'm thinking that myString[13] should simply set the byte at myString[13]. Trying to fiddle with the multibyte stuff with simple array access semantics just looks to be too confusing and error prone. To access the unicode characters from it would be via a function or property.
Jan 18 2003
"Burton Radons" <loth users.sourceforge.net> wrote in message news:b0cgdd$1t4o$1 digitaldaemon.com...Walter wrote:someChar;""Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message news:b0bpq9$1d3d$1 digitaldaemon.com...Current D uses char[] as the string type. If we declare each char to be UTF-8 we'll have all the problems with what does "myString[13] =couldmeans. I think a opaque string datatype may be better in this case. Weencoding,have a glyph datatype that represents one unicode glyph in UTF-8myString[13].and use it together with a string class.I'm thinking that myString[13] should simply set the byte atsemanticsTrying to fiddle with the multibyte stuff with simple array accessObviously, this needs more thought by me.just looks to be too confusing and error prone. To access the unicode characters from it would be via a function or property.I disagree. Returning the character makes indexing expensive, but it has the expectant result and for the most part hides the fact that compaction is going on automatically; the only rule change is that indexed assignment can invalidate any slices and copies, which isn't any worse than D's current rules. Then char.size will be 4 and char.max will be 0x10FFFF or 0x7FFFFFFF, depending upon whether we use UNICODE or ISO-10646 for our UTF-8. I also think that incrementing a char pointer should read the data to determine how many bytes it needs to skip. It should be as transparent as possible! If it can't be transparent, then it should use a class or be limited: no indexing, no char pointers. I don't like either option.
Jan 18 2003
"Walter" <walter digitalmars.com> escreveu na mensagem news:b0c66n$1mq6$1 digitaldaemon.com..."Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message news:b0bpq9$1d3d$1 digitaldaemon.com...encoding,Current D uses char[] as the string type. If we declare each char tobeUTF-8 we'll have all the problems with what does "myString[13] =someChar;"means. I think a opaque string datatype may be better in this case. Wecouldhave a glyph datatype that represents one unicode glyph in UTF-8semanticsand use it together with a string class.I'm thinking that myString[13] should simply set the byte at myString[13]. Trying to fiddle with the multibyte stuff with simple array accessjust looks to be too confusing and error prone. To access the unicode characters from it would be via a function or property.That's why I think it should be a opaque, immutable, data-type.leadsAlso I don't think a mutable string type is a good idea. Python and Java use immutable strings, and thisforto better programs (you don't need to worry about copying your stringswhenyou get or give them). Some nice tricks, like caching hashCode resultsadversestrings are possible, because the values won't change. We could alsoprovidea mutable string class.I think the copy-on-write approach to strings is the right idea. Unfortunately, if done by the language semantics, it can have severeperformance results (think of a toupper() function, copying the stringagaineach time a character is converted). Using it instead as a coding style, which is currently how it's done in Phobos, seems to work well. My javascript implementation (DMDScript) does cache the hash for each string, and that works well for the semantics of javascript. But I don't think itisappropriate for lower level language like D to do as much for strings.leadIf this is the way to go we need lots of test cases, specially from people with experience writing unicode libraries. The Unicode spec haslotsof particularities, like correct regular expression support, that maybugsto subtle bugs.Regular expression implementations naturally lend themselves to subtle:-(. Having a good test suite is a lifesaver.Not if you write a "correct" regular expression implementation. If you implement right from scratch using simple NFAs you probably won't have any headaches. I've implemented a toy regex machine in Java based on Mark Jason Dominus excelent article "How Regexes work" at http://perl.plover.com/Regex/ It's very simple and quite fast as it's a dumb implementation without any kind of optimizations (4 times slower than a fast bytecode regex interpreter in Java, http://jakarta.apache.org/regexp/index.html). Also the sourcecode is lot's of times cleaner. BTW I've written a unit test suite based on Jakarta Regexp set of tests. I can port it to D if you like and use it with your regex implementation. --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.443 / Virus Database: 248 - Release Date: 10/1/2003
Jan 18 2003
"Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message news:b0cond$222q$1 digitaldaemon.com...BTW I've written a unit test suite based on Jakarta Regexp set of tests. I can port it to D if you like and use itwithyour regex implementation.At the moment I'm using Spencer's regex test suite augmented with a bunch of new test vectors. More testing is better, so yes I'm interested in better & more comprehensive tests.
Jan 18 2003
"Walter" <walter digitalmars.com> wrote in message news:b0c66n$1mq6$1 digitaldaemon.com...I think the copy-on-write approach to strings is the right idea. Unfortunately, if done by the language semantics, it can have severeadverseperformance results (think of a toupper() function, copying the stringagaineach time a character is converted). Using it instead as a coding style,Copy-on-write usually doesn't copy unless there's more than one live reference to the string. If you're actively modifying it, it'll only make one copy until you distribute the new reference. Of course that means reference counting. Perhaps the GC could store info about string use.
Jan 19 2003
Sean L. Palmer wrote:"Walter" <walter digitalmars.com> wrote in message news:b0c66n$1mq6$1 digitaldaemon.com...That's not gonna work, because there's no reliable way you can get this data from GC outside a mark phase. The Delphi string implementation is Ref-Counted, and is said to be extremely slow. So it's better copy and forget the rest, than to count at every assignment. You'll just have one more reason to optimise the GC then. :) IMO, the amount of copying should be limited by merging the operations together.I think the copy-on-write approach to strings is the right idea. Unfortunately, if done by the language semantics, it can have severeadverseperformance results (think of a toupper() function, copying the stringagaineach time a character is converted). Using it instead as a coding style,Copy-on-write usually doesn't copy unless there's more than one live reference to the string. If you're actively modifying it, it'll only make one copy until you distribute the new reference. Of course that means reference counting. Perhaps the GC could store info about string use.
Jan 20 2003
"Walter" <walter digitalmars.com> wrote in message news:b0b0up$vk7$1 digitaldaemon.com...I once wrote a large project that dealt with mixed ascii and unicode.Therewas bug after bug when the two collided. Finally, I threw in the towel and made the entire program unicode - every string in it. The trouble in D is that in the current scheme, everything dealing withtexthas to be written twice, once for char[] and again for wchar_t[]. In C, there's that wretched tchar.h to swap back and forth. It may just beeasierin the long run to just make UTF-8 the native type, and then at least try and make sure the standard D library is correct.I've gotten a little confused reading this thread. Here are some questions swimming in my head: 1) What does it mean to make UTF-8 the native type? 2) What is char.size? 3) Does char[] differ from byte[] or is it a typedef? 4) How does one get a UTF-16 encoding of a char[], or get the length, or get the 5th character, or set the 5th character to a given unicode character (expressed in UTF-16, say)? Here are my guesses to the answers: 1) string literals are encoded in UTF-8 2) char.size = 8 3) it's a typedef 4) through the library or directly if you know enough about the char[] you are manipulating. Is this correct? thanks, -Ben-Walter "Daniel Yokomiso" <daniel_yokomiso yahoo.com.br> wrote in message news:b0agdq$ni9$1 digitaldaemon.com...=="globalization guy" <globalization_member pathlink.com> escreveu namensagemnews:b05pdd$13bv$1 digitaldaemon.com...I think you'll be making a big mistake if you adopt C's obsolete charhigher-levelbytelikeconcept of strings. Savvy language designers these days realize that,int'sand float's, char's should be a fundamental data type at aofhavedesignersabstraction than raw bytes. The model that most modern languageareturning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit. If you do so, you make it possible for strings in your language toathatsingle, canonical form that all APIs use. Instead of the nightmarefeelsC/C++thisprogrammers face when passing string parameters ("now, let's see, isachar* or a const char* or an ISO C++ string or an ISO wstring or awchar_t* or achar[] or a wchar_t[] or an instance of one of countless stringclasses...?).The fact that not just every library but practically every projectthatthetheyneed to reinvent its own string type is proof of the need for a good,solid,canonical form built right into the language. Most language designers these days either get this from the start oflaterfigure it out and have to screw up their language with multiple stringtypes.Having canonical UTF-16 chars and strings internally does not meanwillyouconvertcan't deal with other character encodings externally. You can cantoorcanonical form on import and convert back to some legacy encoding onexport.When you create the strings yourself, or when they are created in JavaJavascript or default XML or most new text protocols, no conversionlightweightbenecessary. It will only be needed for legacy data (or a veryinsteadswitchwithbetween UTF-8 and UTF-16). And for those cases where you have to worklegacy data and yet don't want to incur the overhead of encodingconversion inand out, you can still treat the external strings as byte arraysyouofstrings, assuming you have a "byte" data type, and do direct bytemanipulationon them. That's essentially what you would have been doing anyway ifgiveshadused the old char == byte model I see in your docs. You just call it"byte"instead of "char" so it doesn't end up being your default string type. Having a modern UTF-16 char type, separate from arrays of "byte",textyou aconsistency that allows for the creation of great libraries (sinceisit'ssuchoutstart, andtheir libraries universally use a single string type. Perl figured itprettylate and as a result, with the addition of UTF-8 to Perl in v. 5.6,haveneverclear which CPAN modules will work and which ones will fail, so you8-bitto usepragmas ("use utf-8" vs. "use bytes") and do lots of testing. I hope you'll consider making this change to your design. Have anstuffunsigned"byte" type and a 16-bit unsigned UTF-16 "char" and forget about this"8-bitchar plus 16-bit wide char on Win32 and 32-bit wide char on Linux"unicodeor I'mdecision.quite sure you'll later regret it. C/C++ are in that sorry state forlegacyreasons only, not because their designers were foolish, but any newlanguagethat intentionally copies that "design" is likely to regret that(startingHi, There was a thread a year ago in the smalleiffel mailing listat http://groups.yahoo.com/group/smalleiffel/message/4075 ) aboutunicodestrings in Eiffel. It's a quite interesting read about the problems of adding string-like Unicode classes. The main point is that true Unicode support is very difficult to achieve just some libraries provide good, correct and complete unicode encoders/decoders/renderers/etc. While I agree that some Unicode support is a necessity today (main mother tongue is brazilian portuguese so I use non-ascii characters everyday), we can't just add some base types and pretend everything is allright. We won't correct incorrect written code with a primitivetheirstring. Most programmers don't think about unicode when they developusesoftware, so almost every line of code dealing with texts contain some assumptions about the character sets being used. Java has a primitive 16bitchar, but basic library functions (because they need good performance)orincorrect code for string handling stuff (the correct classes are in java.text, providing means to correctly collate strings). Some times wearejust using plain old ASCII but we're bitten by the encoding issues. Andwhenwe need to deal with true unicode support the libraries tricky us into believing everything is ok. IMO D should support a simple char array to deal with ASCII (as itdoestoday) and some kind of standard library module to deal with unicodeglyphsand text. This could be included in phobos or even in deimos. Any volunteers? With this we could force the programmer to deal with anothersetof tools (albeit similar) when dealing with each kind of string: ASCIIIBMunicode. This module should allow creation of variable sized string and glyphs through an opaque ADT. Each kind of usage has different semanticsandoptimization strategies (e.g. Boyer-Moore is good for ASCII but withunicodethe space and time usage are worse). Best regards, Daniel Yokomiso. P.S.: I had to written some libraries and components (EJBs) in severalJavaprojects to deal with data-transfer in plain ASCII (communication witharen'tmainframes). Each day I dreamed of using a language with simple one byte character strings, without problems with encoding and endianess (Solarisvs.Linux vs. Windows NT have some nice "features" in their JVMs if youAcareful when writing Java code that uses "ASCII" String). But Java has a16bit character type and a SIGNED byte type, both awkward for this usage.language shouldn't get in the way of simple code. "Never argue with an idiot. They drag you down to their level then beatyouwith experience." --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.443 / Virus Database: 248 - Release Date: 11/1/2003
Jan 18 2003
"Ben Hinkle" <bhinkle mathworks.com> wrote in message news:b0bvoh$1hm5$1 digitaldaemon.com...I've gotten a little confused reading this thread. Here are some questions swimming in my head: 1) What does it mean to make UTF-8 the native type?From a compiler standpoint, all it really means is that string literals are encoded as UTF-8. The real support for it will be in the runtime library, such as UTF-8 support in printf().2) What is char.size?It'll be 1.3) Does char[] differ from byte[] or is it a typedef?It differs in that it can be overloaded differently, and the compiler recognizes char[] as special when doing casts to other array types - it can do conversions between UTF-8 and UTF-16, for example.4) How does one get a UTF-16 encoding of a char[],At the moment, I'm thinking: wchar[] w; char[] c; w = cast(wchar[])c; to do a UTF-8 to UTF-16 conversion.or get the length,To get the length in bytes: c.length to get the length in USC-4 characters, perhaps: c.nchars ??or get the 5th character, or set the 5th character to a given unicode character (expressed in UTF-16, say)?Probably a library function.
Jan 18 2003
The best way to handle Unicode is, as a previous poster suggested, to make UTF-16 the default and tack on ASCII conversions in the runtime library. Not the other way around. Legacy stuff should be runtime lib, modern stuff built-in. Otherwise we are building a language on outdated standards. I don't like typecasting hacks or half-measures. Besides, typecasting by definition should not change the size of its argument. Mark
Jan 18 2003
You're probably right, the typecasting hack is inconsistent enough with the way the rest of the language works that it's probably a bad idea. As for why UTF-16 instead of UTF-8, why do you find it preferable? "Mark Evans" <Mark_member pathlink.com> wrote in message news:b0ccek$1qnh$1 digitaldaemon.com...The best way to handle Unicode is, as a previous poster suggested, to make UTF-16 the default and tack on ASCII conversions in the runtime library.Notthe other way around. Legacy stuff should be runtime lib, modern stuff built-in. Otherwise we are building a language on outdated standards. I don't like typecasting hacks or half-measures. Besides, typecasting by definition should not change the size of its argument. Mark
Jan 18 2003
Walter asked,As for why UTF-16 instead of UTF-8, why do you find it preferable?If one wants to do serious internationalized applications it is mandatory. China, Japan, India for example. China and India by themselves encompass hundreds of languages and dialects that use non-Western glyphs. My contacts at the SIL linguistics center in Dallas (heavy-duty Unicode and SGML folks) complain that in their language work, not even UTF-16 is good enough. They push for 32 bits! I would not go that far, but UTF-16 is a very sensible, capable format for the majority of languages. Mark
Jan 20 2003
"Mark Evans" <Mark_member pathlink.com> wrote in message news:b0itlo$2a46$1 digitaldaemon.com...If one wants to do serious internationalized applications it is mandatory. China, Japan, India for example. China and India by themselves encompass hundreds of languages and dialects that use non-Western glyphs.UTF-8 can handle that.My contacts at the SIL linguistics center in Dallas (heavy-duty Unicodeand SGMLfolks) complain that in their language work, not even UTF-16 is goodenough.They push for 32 bits!UTF-16 has 2^20 characters in it. UTF-8 has 2^31 characters.I would not go that far, but UTF-16 is a very sensible, capable format forthemajority of languages.The only advantage it has over UTF-8 is it is more compact for some languages. UTF-8 is more compact for the rest.
Jan 21 2003
Well OK I should have been clearer. You are right about sheer numerical quantity, but read the FAQ at Unicode.org (excerpted below). Numerical quantity at the price of variable-width codes is a headache. UTF-16 has variable width, but not as variable as UTF-8, and nowhere near as frequently. UTF-16 is the Windows standard. It's a sweet spot for Unicode, which was originally a pure 16-bit design. The Unicode leaders advocate UTF-16 and I accept their wisdom. The "real deal" with UTF-8 is that it's a retrofit to accommodate legacy ASCII that we all know and love. So again I would argue that UTF-8 qualifies in a certain sense as "legacy support," and should therefore go in the runtime, not the core code. I'd go even further and not use 'char' with any meaning other than UTF-16. I never liked the Windows char/wchar goofiness. A language should only have one type of char and the runtimes can support conversions of language-standard chars to other formats. Trying to shimmy 'alternative characters' into C was a bad idea. The wonderful thing about designing a new language is that you can do it right. (Implementation details at http://www.unicode.org/reports/tr27/ ) Mark http://www.unicode.org/faq/utf_bom.html ----------------------------------------------- "Most Unicode APIs are using UTF-16." ----------------------------------------------- "UTF-8 will be most common on the web. UTF16, UTF16LE, UTF16BE are used by Java and Windows." [BE and LE mean Big Endian and Little Endian.] ----------------------------------------------- "Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts." ----------------------------------------------- [UTF-8 can have anywhere from 1 to 4 code blocks so it's highly variable. UTF-16 almost always has one code block, and in rare 1% cases, two; but no more. This is important in the Asian context:] "East Asians (Chinese, Japanese, and Koreans) are ... are well acquainted with the problems that variable-width codes ... have caused....With UTF-16, relatively few characters require 2 units. The vast majority of characters in common use are single code units. Even in East Asian text, the incidence of surrogate pairs should be well less than 1% of all text storage on average." ----------------------------------------------- "Furthermore, both Unicode and ISO 10646 have policies in place that formally limit even the UTF-32 encoding form to the integer range that can be expressed with UTF-16 (or 21 significant bits)." ----------------------------------------------- "We don't anticipate a general switch to UTF-32 storage for a long time (if ever)....The chief selling point for Unicode was providing a representation for all the world's characters.... These features were enough to swing industry to the side of using Unicode (UTF-16)." -----------------------------------------------
Jan 21 2003
Quick follow-up. Even the extra space in UTF-8 will probably not be used in the future, and UTF-8 vs. UTF-16 are going to be neck-and-neck in terms of storage/performance over time. So I see no compelling reason for UTF-8 except its legacy ties to 7-bit ASCII. I think of UTF-8 as "ASCII with Unicode paint." Mark http://www-106.ibm.com/developerworks/library/utfencodingforms/ "Storage vs. performance Both UTF-8 and UTF-16 are substantially more compact than UTF-32, when averaging over the world's text in computers. UTF-8 is currently more compact than UTF-16 on average, although it is not particularly suited for East-Asian text because it occupies about 3 bytes of storage per code point. UTF-8 will probably end up as about the same as UTF-16 over time, and may end up being less compact on average as computers continue to make inroads into East and South Asia. Both UTF-8 and UTF-16 offer substantial advantages over UTF-32 in terms of storage requirements." http://czyborra.com/utf/ "Actually, UTF-8 continues to represent up to 31 bits with up to 6 bytes, but it is generally expected that the one million code points of the 20 bits offered by UTF-16 and 4-byte UTF-8 will suffice to cover all characters and that we will never get to see any Unicode character definitions beyond that."
Jan 21 2003
Mark Evans wrote:Walter asked,Could someone explain me *what's the difference*? I thought there was one unicode set, which encodes *everything*. Then, there are different "wrappings" of it, like UTF8, 16 and so on. They do the same by assgning blocks, where multiple "characters" of 8, 16, or smth. bits compose a final character value. And a lot of optimisation can be done, because it is not likely that each next symbol will be from a different language, since natural language usually consists of words, sentences, and so on. In UFT8 there are sequences, consisting of header-data, where header encodes the language/code and the length of the text, so that some data is generalized and need not be tranferred with every symbol, and so that a character in a certain encoding can take as many target system characters is it needs. As far as I understood, UTF7 is the shortest encoding for latin text, but it would be less optimal for some multi-hunderd-character sets than a generally wider encoding. Please, someone correct me if i'm wrong. But if i'm right, Russian, arabic, and other "tiny" alphabets would only experience a minor "fat-ratio" with UTF8, since they requiere less not many more symbols than latin. That is, only headers and no further overhead. Can anyone tell me: taken the same newspaper article in chinese, japanese, or some other "wide" language, encoded in UTF7, 8, 16, 32 and so on: how much space would it take? Which languages suffer more and which less from "small" UTF encodigs? -i.As for why UTF-16 instead of UTF-8, why do you find it preferable?If one wants to do serious internationalized applications it is mandatory. China, Japan, India for example. China and India by themselves encompass hundreds of languages and dialects that use non-Western glyphs. My contacts at the SIL linguistics center in Dallas (heavy-duty Unicode and SGML folks) complain that in their language work, not even UTF-16 is good enough. They push for 32 bits!I would not go that far, but UTF-16 is a very sensible, capable format for the majority of languages. Mark
Jan 22 2003
Ilya Minkov wrote:Could someone explain me *what's the difference*? ...I see myself approved.Can anyone tell me: taken the same newspaper article in chinese, japanese, or some other "wide" language, encoded in UTF7, 8, 16, 32 and so on: how much space would it take? Which languages suffer more and which less from "small" UTF encodigs?This one remains. -i.
Jan 22 2003
Ilya Minkov says...Could someone explain me *what's the difference*?Take the trouble to read through the links supplied in the previous posts before asking redundant questions like this. Mark
Jan 22 2003
On Wed, 22 Jan 2003 15:26:56 +0100 Ilya Minkov <midiclub tiscali.de> wrote:sentences, and so on. In UFT8 there are sequences, consisting of header-data, where header encodes the language/code and the length of the text, so that some data is generalized and need not be tranferred with every symbol, and so that a character in a certain encoding can take as many target system characters is it needs.That's not how UTF-8 works (although I've thought a RLE scheme like the one you describe would be pretty good). In UTF-8 a glyph can be 1-4 bytes. If the unicode value is below 0x80, it takes one byte. If it's between 0x80 and 0x7FF (inclusive), it takes two, etcAs far as I understood, UTF7 is the shortest encoding for latin text, but it would be less optimal for some multi-hunderd-character sets than a generally wider encoding.Quite less than optimal.Please, someone correct me if i'm wrong. But if i'm right, Russian, arabic, and other "tiny" alphabets would only experience a minor "fat-ratio" with UTF8, since they requiere less not many more symbols than latin. That is, only headers and no further overhead.Most western alphabets would take 1-2 bytes per char. I think Arabic would take 3.Can anyone tell me: taken the same newspaper article in chinese, japanese, or some other "wide" language, encoded in UTF7, 8, 16, 32 and so on: how much space would it take? Which languages suffer more and which less from "small" UTF encodigs?UTF-8 just flat takes less space over all. At most, it takes 4 bytes per glyph, plus for many, it takes less. The issue isn't really the space. It's the difficulty in dealing with an encoding where you don't know how long the next glyph will be without reading it. (Which also means that in order to access the glyph in the middle, you have to start scanning from the front.) -- Theodore Reed (rizen/bancus) -==- http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~ "I hold it to be the inalienable right of anybody to go to hell in his own way." -- Robert Frost
Jan 22 2003
Then considering UTF-16 might make sense... I think there is a way to optimise UTF8 though: pre-scan the string and record character width changes in an array.
Jan 22 2003
In UTF-8 a glyph can be 1-4 bytes.Only if you live within the same dynamic range as UTF-16. To get the full effective UTF-8 dynamic range of 32 bits, UTF-8 employs up to six bytes. With 4 bytes it has the same range as UTF-16. "The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set." http://www.unicode.org/reports/tr27The issue isn't really the space. It's the difficulty in dealing with an encoding where you don't know how long the next glyph will be without reading it.Exactly. UTF-16 can have at most one extra code (in roughly 1% of cases). So you have either one 16-bit word, or two. UTF-8 is the absolute worst encoding in this regard. UTF-32 is the best (constant size). The main selling point for D is that UTF-16 is the standard for Windows. Windows is built on it. Knowing Microsoft, they probably use a "slightly modified Microsoft version" of UTF-16...that would not surprise me at all. Mark
Jan 23 2003
"Mark Evans" <Mark_member pathlink.com> wrote in message news:b0qp5g$n73$1 digitaldaemon.com...With 4In UTF-8 a glyph can be 1-4 bytes.Only if you live within the same dynamic range as UTF-16. To get the full effective UTF-8 dynamic range of 32 bits, UTF-8 employs up to six bytes.bytes it has the same range as UTF-16.Actually, UTF-8, UTF-16 and UTF-32 - all have the same range : [0..10FFFFh] UTF-8 encoding method can be extended up to six bytes max. to encode UCS-4 character set, but it is way beyond Unicode."The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allowsfor theuse of five- and six-byte sequences to encode characters that are outsidetherange of the Unicode character set." http://www.unicode.org/reports/tr27Please, do not post truncated citations. "The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 as a transformation of Unicode characters."SoThe issue isn't really the space. It's the difficulty in dealing with an encoding where you don't know how long the next glyph will be without reading it.Exactly. UTF-16 can have at most one extra code (in roughly 1% of cases).you have either one 16-bit word, or two. UTF-8 is the absolute worstencodingin this regard. UTF-32 is the best (constant size).For the real world applications UTF-16 strings have to use those surrogates only to access CJK Ideographs extensions (~43000 characters). In most of the cases UTF-16 string can be treated as an array of the UCS-2 characters. String object can include its length in 16bit units and in characters : if these numbers are equal - it's an UCS-2 string, no surrogates inside.The main selling point for D is that UTF-16 is the standard for Windows.Windowsis built on it. Knowing Microsoft, they probably use a "slightly modified Microsoft version" of UTF-16...that would not surprise me at all.Surprise... It's a regular UTF-16. >8-P (Starting with Win2K+sp.) WinNT 3.x & 4 support UCS-2 only - since it was Unicode 2.0 encoding. Any efficient prog. language must use UTF-16 for Windows implementation - otherwise it have to convert strings for any API function requiring string parameters...
Jan 27 2003
"Serge K" <skarebo programmer.net> wrote in message news:b17cd6$2n1l$1 digitaldaemon.com...Any efficient prog. language must use UTF-16 for Windows implementation - otherwise it have to convert strings for any API function requiring string parameters...Not necessarilly. While Win32 is now fully UTF-16 internally, and apparently converts the strings in "A" api functions to UTF-16, because UTF-16 uses double the memory it can still be far more efficient for an app to do all its computation with UTF-8, and then convert when calling the windows api.
Feb 03 2003
On Mon, 3 Feb 2003 15:37:37 -0800 "Walter" <walter digitalmars.com> wrote:"Serge K" <skarebo programmer.net> wrote in message news:b17cd6$2n1l$1 digitaldaemon.com...Plus, UTF-8 is pretty standard for Unicode on Linux. I believe BeOS used it, too, although I could be wrong. I don't know what OSX uses, nor other unices. My point is that choosing a standard by what the underlying platform uses is a bad idea. -- Theodore Reed (rizen/bancus) -==- http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~ "[...] for plainly, although every work of art is an expression, not every expression is a work of art." -- DeWitt H. Parker, "The Principles of Aesthetics"Any efficient prog. language must use UTF-16 for Windows implementation - otherwise it have to convert strings for any API function requiring string parameters...Not necessarilly. While Win32 is now fully UTF-16 internally, and apparently converts the strings in "A" api functions to UTF-16, because UTF-16 uses double the memory it can still be far more efficient for an app to do all its computation with UTF-8, and then convert when calling the windows api.
Feb 04 2003
My point is that choosing a standard by what the underlying platform uses is a bad idea.I agree with this remark, but think there are plenty of platform-independent reasons for UTF-16. The fact that Windows uses it just cements the case. Mark
Feb 13 2003
Walter says...Memory is cheap and getting cheaper, but procesor time never loses value. The supposition that UTF-8 needs less space is flawed anyway. For some languages, yes -- but not all. My earlier citations indicate that long-term, averaging over all languages, UTF-8 and UTF-16 will require equivalent memory storage. UTF-8 code is also harder to write because UTF-8 is just more complicated than UTF-16. The only reason for its popularity is that it's a fig leaf for people who really want to use ASCII. They can use ASCII and call it UTF-8. Not very forward-thinking. Microsoft had good reasons for selecting UTF-16 and D should follow suit. Other languages are struggling with Unicode support, and it would be nice to have one language out up front in this area. MarkAny efficient prog. language must use UTF-16 for Windows implementation - otherwise it have to convert strings for any API function requiring string parameters...Not necessarilly. While Win32 is now fully UTF-16 internally, and apparently converts the strings in "A" api functions to UTF-16, because UTF-16 uses double the memory it can still be far more efficient for an app to do all its computation with UTF-8, and then convert when calling the windows api.
Feb 13 2003
The supposition that UTF-8 needs less space is flawed anyway. For some languages, yes -- but not all. My earlier citations indicate thatlong-term,averaging over all languages, UTF-8 and UTF-16 will require equivalentmemorystorage. UTF-8 code is also harder to write because UTF-8 is just more complicatedthanUTF-16. The only reason for its popularity is that it's a fig leaf forpeoplewho really want to use ASCII. They can use ASCII and call it UTF-8. Notveryforward-thinking. Microsoft had good reasons for selecting UTF-16 and D should follow suit.Otherlanguages are struggling with Unicode support, and it would be nice tohave onelanguage out up front in this area. Markhttp://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/index .html?dwzone=unicode ["Forms of Unicode", Mark Davis, IBM developer and President of the Unicode Consortium, IBM] "Storage vs. performance Both UTF-8 and UTF-16 are substantially more compact than UTF-32, when averaging over the world's text in computers. UTF-8 is currently more compact than UTF-16 on average, although it is not particularly suited for East-Asian text because it occupies about 3 bytes of storage per code point. UTF-8 will probably end up as about the same as UTF-16 over time, and may end up being less compact on average as computers continue to make inroads into East and South Asia. Both UTF-8 and UTF-16 offer substantial advantages over UTF-32 in terms of storage requirements." { btw, about storage : I've converted 300KB text file (russian book) into UTF-8 - it took about ~1.85 bytes per character. Little compression comparing to UTF-16 comes mostly from "spaces" and punctuation marks, but it hardly worth processing complexity. } "Code-point boundaries, iteration, and indexing are very fast with UTF-32. Code-point boundaries, accessing code points at a given offset, and iteration involve a few extra machine instructions for UTF-16; UTF-8 is a bit more cumbersome." { Occurrence of the UTF-16 surrogates in the real texts is estimated as <1% for CJK languages. Other scripts encoded in "higher planes" cover very rare or dead languages and some special symbols (like modern & old music symbols). So, if String object can identify absence of the surrogates - faster functions can to be used in most of the cases. The same optimization works for UTF-8, but only in the US-nivers (even British pound takes 2 bytes.. 8-) } "Ultimately, the choice of which encoding format to use will depend heavily on the programming environment. For systems that only offer 8-bit strings currently, but are multi-byte enabled, UTF-8 may be the best choice. For systems that do not care about storage requirements, UTF-32 may be best. For systems such as Windows, Java, or ICU that use UTF-16 strings already, UTF-16 is the obvious choice. Even if they have not yet upgraded to fully support surrogates, they will be before long. If the programming environment is not an issue, UTF-16 is recommended as a good compromise between elegance, performance, and storage."
Feb 16 2003
Walter wrote:"Ben Hinkle" <bhinkle mathworks.com> wrote in message news:b0bvoh$1hm5$1 digitaldaemon.com...This is less complex than "w = toWideStringz(c);" somehow? I can't speak for anyone else, but this won't help my work with dig at all - I already have to preprocess any strings sent to the API with toStringz, while the public interface will still use char[]. So constant casting is the name of the game by necessity, and if I want to be conservative I have to cache the conversion and delete it anyway. Calling these APIs directly, when this casting becomes a win, just doesn't happen to me.4) How does one get a UTF-16 encoding of a char[],At the moment, I'm thinking: wchar[] w; char[] c; w = cast(wchar[])c; to do a UTF-8 to UTF-16 conversion.
Jan 18 2003
"Walter" <walter digitalmars.com> wrote in message news:b0c66n$1mq6$2 digitaldaemon.com..."Ben Hinkle" <bhinkle mathworks.com> wrote in message news:b0bvoh$1hm5$1 digitaldaemon.com...questionsI've gotten a little confused reading this thread. Here are someareswimming in my head: 1) What does it mean to make UTF-8 the native type?From a compiler standpoint, all it really means is that string literalsencoded as UTF-8. The real support for it will be in the runtime library, such as UTF-8 support in printf().D'oh! char.size=8 is a tad big ;)2) What is char.size?It'll be 1.can3) Does char[] differ from byte[] or is it a typedef?It differs in that it can be overloaded differently, and the compiler recognizes char[] as special when doing casts to other array types - itdo conversions between UTF-8 and UTF-16, for example.The semantics of casting (across all of D) needs to be nice and predictable. I'd hate to track down a bug because a cast that I thought was trivial turned out to allocate new memory and copy data around...Could arrays (or some types that want to have array-like behavior) have some semantics that distinguish between the memory layout and the array indexing and length? Another example of this comes up in sparse matrices, where you want to have an array-like thing that has a non-trivial memory layout. Perhaps not full-blown operator overloading for [] and .length, etc - but some kind of special syntax to differentiate between running around in the memory layout or running around in the "high-level interface".4) How does one get a UTF-16 encoding of a char[],At the moment, I'm thinking: wchar[] w; char[] c; w = cast(wchar[])c; to do a UTF-8 to UTF-16 conversion.or get the length,To get the length in bytes: c.length to get the length in USC-4 characters, perhaps: c.nchars ??or get the 5th character, or set the 5th character to a given unicode character (expressed in UTF-16, say)?Probably a library function.
Jan 18 2003
I've read through what I could find on the thread about char[] and I find myself disagreeing with the idea that char[n] should return the n'th byte, regardless of the width of a character. My reasons are simple. When I have an array of, say, ints, I don't expect that int[n] will give me the n'th byte of the array of numbers. I fully expect that the n'th integer will be what I get. I see no reason why this should not hold for arrays of characters. I do expect that there are times when it would be useful to access an array of TYPE (where TYPE is int, char, etc) at the byte level, but it strikes me that some interface between an array of TYPE elements and that array as an array of BYTE's (i.e. using the byte type) would be VERY USEFUL, and would address concerns in wanting to access characters in their raw byte form. Indexing of the equivalent of a byte pointer to a TYPE array, perhaps formulated in syntactic sugar, would achieve this. I would personally prefer a language-specific way to byte access an aggregate rather than use pointers to achieve what the language should provide anyway. Please note that the above statements stand REGARDLESS of the encoding chosen, be it UTF-8 or 16 or whatever.
Feb 05 2003
The solution here is to use a char *iterator* instead of using char *indexing*. char indexing will be very slow. char iteration will be very fast. D needs a good iterator concept. It has a good array concept already, but arrays are not the solution to everything. For instance, serial input or output can't easily be indexed. You don't do: serial_port[47] = character; you do: serial_port.write(character). Those are like iterators (ok well at least in STL, input iterators and output iterators were part of the iterator family). Sean "Shannon Mann" <Shannon_member pathlink.com> wrote in message news:b1rb8q$5i7$1 digitaldaemon.com...I've read through what I could find on the thread about char[] and I find myself disagreeing with the idea that char[n] should return the n'th byte, regardless of the width of a character. My reasons are simple. When I have an array of, say, ints, I don't expect that int[n] will give me the n'th byte of the array of numbers. I fully expect that the n'th integer will be what I get. I see no reason why this should not hold for arrays of characters. I do expect that there are times when it would be useful to access an array of TYPE (where TYPE is int, char, etc) at the byte level, but it strikes me that some interface between an array of TYPE elements and that array as an array of BYTE's (i.e. using the byte type) would be VERY USEFUL, and would address concerns in wanting to access characters in their raw byte form. Indexing of the equivalent of a byte pointer to a TYPE array, perhaps formulated in syntactic sugar, would achieve this. I would personally prefer a language-specific way to byte access an aggregate rather than use pointers to achieve what the language should provide anyway. Please note that the above statements stand REGARDLESS of the encoding chosen, be it UTF-8 or 16 or whatever.
Feb 05 2003