D - Unicode
- Scott Egan (2/2) Apr 14 2004 Would it have been better just to stick to Unicode internally, and left ...
- Ilya Minkov (7/9) Apr 14 2004 By Walter's convention, all char[] are UTF-8, and where the standard
- Scott Egan (5/14) Apr 14 2004 Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick wit...
- =?ISO-8859-1?Q?Sigbj=F8rn_Lund_Olsen?= (20/22) Apr 14 2004 char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32.
- Hauke Duden (3/8) Apr 14 2004 This is not correct. dchar is UTF-32 and wchar is UTF-16.
- Ben Hinkle (3/5) Apr 14 2004 heh. I can never remember which one is which either.
- Walter (3/4) Apr 14 2004 LOL! Wish I'd thought of that!
- Scott Egan (10/32) Apr 15 2004 Given the intent of D to maintain some of the low level 'system' capabil...
- Ben Hinkle (6/10) Apr 15 2004 Efficiency would depend on the application - one that copies
- Scott Egan (19/29) Apr 15 2004 I've done some more homework and have a few other points:
- Serge K (2/4) Apr 16 2004 Actually, it uses UTF-16.
- =?ISO-8859-1?Q?Sigbj=F8rn_Lund_Olsen?= (18/40) Apr 18 2004 No, as far as I know UTF-8/UTF-16, it expects the "storage class" to be
- Ben Hinkle (6/8) Apr 14 2004 UTF-8 is a compromise between Unicode support and C's character model.
- J C Calvarese (6/17) Apr 14 2004 Since it has come up before, I've made a list of some of these threads:
Would it have been better just to stick to Unicode internally, and left any conversion to the IO classes?
Apr 14 2004
Scott Egan schrieb:Would it have been better just to stick to Unicode internally, and left any conversion to the IO classes?By Walter's convention, all char[] are UTF-8, and where the standard library doesn't obey to it and interpret it as ANSI/ASCII/whatever is to be considered as a bug. He has stated it at least for 10 times already. And char[] is to be the standardway of excanging unicode strings within D programmes. There are also dchar and wchar for other Unicode encodings. -eye
Apr 14 2004
Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with UCS-2 (UTF-16?), ie straight 16bit chars? "Ilya Minkov" <minkov cs.tum.edu> wrote in message news:c5jfuf$1ibt$1 digitaldaemon.com...Scott Egan schrieb:anyWould it have been better just to stick to Unicode internally, and leftconversion to the IO classes?By Walter's convention, all char[] are UTF-8, and where the standard library doesn't obey to it and interpret it as ANSI/ASCII/whatever is to be considered as a bug. He has stated it at least for 10 times already. And char[] is to be the standardway of excanging unicode strings within D programmes. There are also dchar and wchar for other Unicode encodings. -eye
Apr 14 2004
Scott Egan wrote:Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with UCS-2 (UTF-16?), ie straight 16bit chars?char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32. Of the three Unicode transition formats, personally, I find UTF-8 and UTF-32 to be the most appealing, due to UTF-8's backwards compatability with the widespread (if lacking) ASCII format, and UTF-32 because it's by far the most straightforward of the formats. UTF-16 is *not* straight 16bit chars, there are 1 million (roughly) defined Unicode character points, and you often need two utf16 shorts to map to a character. Only UTF-32 can offer a straightforward 'same sized block' character encoding within the Unicode standard, and personally, I can't see what UTF-16 has to offer compared to the two alternatives in the Unicode standard, aside from size savings in some languages. But, of course, UTF-8 (and probably UTF-32) 'suck' out of the box, don't they? At any rate, the three formats seem to all be in use in different spheres of computing, and thus all have their proper place in a generic programming language. Furthermore, the standard library has (afaik - I haven't used them) functions for converting between the three formats. Cheers, Sigbjørn Lund Olsen
Apr 14 2004
Sigbjørn Lund Olsen wrote:This is not correct. dchar is UTF-32 and wchar is UTF-16. HaukeFine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with UCS-2 (UTF-16?), ie straight 16bit chars?char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32.
Apr 14 2004
heh. I can never remember which one is which either. How about changing dchar to wwchar for "weally wide char", which is scalable to any number of bytes - weally weally wide char, etc ;-)char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32.This is not correct. dchar is UTF-32 and wchar is UTF-16.
Apr 14 2004
"Ben Hinkle" <bhinkle4 juno.com> wrote in message news:c5jun3$28gi$1 digitaldaemon.com...How about changing dchar to wwchar for "weally wide char",LOL! Wish I'd thought of that!
Apr 14 2004
Given the intent of D to maintain some of the low level 'system' capability I'd rather just use UTF-32 if it came down to it. The fixed size representation it offers has got to sure make dealing with strings more efficient and faster (and eaiser to much around with). The various stream libraies could be left to take care of any necessary conversions. However, that said, I'll drop it. "Sigbjørn Lund Olsen" <sigbjorn lundolsen.net> wrote in message news:c5jlp2$1r51$1 digitaldaemon.com...Scott Egan wrote:withFine, but UTF-8 sucks as about as much as ASN.1 - why not just stickUCS-2 (UTF-16?), ie straight 16bit chars?char[] for UTF-8, dchar[] for UTF-16, wchar[] for UTF-32. Of the three Unicode transition formats, personally, I find UTF-8 and UTF-32 to be the most appealing, due to UTF-8's backwards compatability with the widespread (if lacking) ASCII format, and UTF-32 because it's by far the most straightforward of the formats. UTF-16 is *not* straight 16bit chars, there are 1 million (roughly) defined Unicode character points, and you often need two utf16 shorts to map to a character. Only UTF-32 can offer a straightforward 'same sized block' character encoding within the Unicode standard, and personally, I can't see what UTF-16 has to offer compared to the two alternatives in the Unicode standard, aside from size savings in some languages. But, of course, UTF-8 (and probably UTF-32) 'suck' out of the box, don't they? At any rate, the three formats seem to all be in use in different spheres of computing, and thus all have their proper place in a generic programming language. Furthermore, the standard library has (afaik - I haven't used them) functions for converting between the three formats. Cheers, Sigbjørn Lund Olsen
Apr 15 2004
On Thu, 15 Apr 2004 19:26:47 +1000, "Scott Egan" <scotte tpg.com.aux> wrote:Given the intent of D to maintain some of the low level 'system' capability I'd rather just use UTF-32 if it came down to it. The fixed size representation it offers has got to sure make dealing with strings more efficient and faster (and eaiser to much around with).Efficiency would depend on the application - one that copies strings alot would slow down significantly assuming most strings would fit "nicely" in UTF-16 or UTF-8. Walter's experience has been that more programs copy strings than index as characters.
Apr 15 2004
I've done some more homework and have a few other points: Walter's experiance may be that programmers copy strings, but have you looked at the library lately? It's full of index work. BTW none of the string library is Unicode compatible; it just treats the char[] as arrays of single bytes (as is my 'split' offering does ;). If char is supposed to be UTF-8 then the system needs to be aware of supplemental chars etc (doesn't it???) for correct word boundary and capitalisation efforts. I would also expect that it would be very easy to produce invalid Unicode streams with some of the functions. And... Why not use just the Basic Multilingual Plane (Plane 0) and code it as UCS-2 Or given that the unicode standard is 21 bits, just use the fixed with UTF-32? Now I will shut up! "Ben Hinkle" <bhinkle4 juno.com> wrote in message news:2tus70dgshsjb5seh2hcrfvl3raj2mui20 4ax.com...On Thu, 15 Apr 2004 19:26:47 +1000, "Scott Egan" <scotte tpg.com.aux> wrote:capabilityGiven the intent of D to maintain some of the low level 'system'I'd rather just use UTF-32 if it came down to it. The fixed size representation it offers has got to sure make dealing with strings more efficient and faster (and eaiser to much around with).Efficiency would depend on the application - one that copies strings alot would slow down significantly assuming most strings would fit "nicely" in UTF-16 or UTF-8. Walter's experience has been that more programs copy strings than index as characters.
Apr 15 2004
Why not use just the Basic Multilingual Plane (Plane 0) and code it as UCS-2Actually, it uses UTF-16. Just like Windows does nowadays.
Apr 16 2004
Scott Egan wrote:I've done some more homework and have a few other points: Walter's experiance may be that programmers copy strings, but have you looked at the library lately? It's full of index work. BTW none of the string library is Unicode compatible; it just treats the char[] as arrays of single bytes (as is my 'split' offering does ;). If char is supposed to be UTF-8 then the system needs to be aware of supplemental chars etc (doesn't it???) for correct word boundary and capitalisation efforts. I would also expect that it would be very easy to produce invalid Unicode streams with some of the functions.No, as far as I know UTF-8/UTF-16, it expects the "storage class" to be containers of a certain bit width. That is, it does not expect 'char' to represent a character - it would be just the part of a character. A more semantically correct name for 'char' would be 'utf8byte' but some would think that too wordy. Personally it's one of the first things I alias.And... Why not use just the Basic Multilingual Plane (Plane 0) and code it as UCS-2 Or given that the unicode standard is 21 bits, just use the fixed with UTF-32? Now I will shut up!Sometimes space is a consideration. If I had a database of English text, lets say a couple of billion characters, well, I *know* I pretty much only need ASCII codes except in rare cases, and since I want to have as much of the database cached in memory at any given time to serve said English text faster, I would rather have UTF-8 encoded the text than UTF-32. In many cases you'll find that a particular encoding may be more appropriate than another, even if several encodings are appealing in their design. D gives you choice, and that's good. I like to think that the programmer knows better than a language designer what she wishes to do. Cheers, Sigbjørn Lund Olsen
Apr 18 2004
"Scott Egan" <scotte tpg.com.aux> wrote in message news:c5jhsa$1l7v$1 digitaldaemon.com...Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with UCS-2 (UTF-16?), ie straight 16bit chars?UTF-8 is a compromise between Unicode support and C's character model. Unicode hasn't flared up on the newsgroup in a while so you might have to look back a while to find Walter's arguments for and against the various ideas.
Apr 14 2004
Ben Hinkle wrote:"Scott Egan" <scotte tpg.com.aux> wrote in message news:c5jhsa$1l7v$1 digitaldaemon.com...Since it has come up before, I've made a list of some of these threads: http://www.wikiservice.at/d/wiki.cgi?UnicodeIssues -- Justin http://jcc_7.tripod.com/d/Fine, but UTF-8 sucks as about as much as ASN.1 - why not just stick with UCS-2 (UTF-16?), ie straight 16bit chars?UTF-8 is a compromise between Unicode support and C's character model. Unicode hasn't flared up on the newsgroup in a while so you might have to look back a while to find Walter's arguments for and against the various ideas.
Apr 14 2004