digitalmars.D - char[] initialization
- Andrew Fedoniouk (14/14) Jul 29 2006 Could somebody shed light on the subject:
- kris (3/15) Jul 29 2006 Try google?
- Hasan Aljudy (12/31) Jul 29 2006 I don't understand why the compiler should initialize variables to
- Derek (9/43) Jul 29 2006 I believe that D's philopsophy is that all datatypes are initialized to
- Hasan Aljudy (2/48) Jul 29 2006 I know .. I was asking "but why?" :(
- Robert Atkinson (9/61) Jul 29 2006 The intent I believe is to signal the programmer as soon as possible
- Hasan Aljudy (8/77) Jul 29 2006 Still missing my point.
- Carlos Santander (8/18) Jul 29 2006 The issue here is, a "reasonable valid default" will change from one app...
- Walter Bright (46/67) Jul 29 2006 That's right. Also, given:
- Andrew Fedoniouk (41/56) Jul 29 2006 Thanks, Kris.
- Carlos Santander (4/10) Jul 29 2006 But D's chars are UTF-8, not Latin-1 nor any other, so I don't think thi...
- Andrew Fedoniouk (15/24) Jul 29 2006 UTF-8 is a multibyte transport encoding of full 21-bit UNICODE codepoint...
- Carlos Santander (5/36) Jul 29 2006 My bad, then. I should've said char[] instead of char. Frits and Walter ...
- Frits van Bommel (16/37) Jul 29 2006 Yep, 0xFFFF is in the "Specials" range. In fact, together with 0xFFFE it...
- Andrew Fedoniouk (16/55) Jul 29 2006 Sorry but this is wrong. "UTF-8 codepoint" is a non-sense.
- Walter Bright (12/37) Jul 29 2006 "the value FFFF is guaranteed not to be a Unicode character at all"
- Andrew Fedoniouk (9/47) Jul 29 2006 1) What "UTF-8 character" means exactly?
- Walter Bright (8/22) Jul 29 2006 For an exact answer, the spec is: http://www.ietf.org/rfc/rfc3629.txt
- Andrew Fedoniouk (16/38) Jul 29 2006 Sorry but I understand what UCS character means
- Walter Bright (3/5) Jul 29 2006 This was all hashed out years ago. It's too late to start renaming basic...
- Andrew Fedoniouk (11/16) Jul 29 2006 I am not asking to rename anything.
- Unknown W. Brackets (7/29) Jul 29 2006 But even prior, this:
- Walter Bright (9/23) Jul 29 2006 char's have been initialized to 0xFF for years now, it was a bug that
- Unknown W. Brackets (41/101) Jul 29 2006 Andrew,
- Andrew Fedoniouk (68/106) Jul 29 2006 No objections with this.
- Walter Bright (8/20) Jul 29 2006 Pragmas are implementation defined behavior in C and C++, meaning they
- Andrew Fedoniouk (31/50) Jul 29 2006 What does it mean "UTF-8 ... supports ...every human language" ?
- Walter Bright (12/46) Jul 29 2006 I'm sure there are bugs in the library UTF-8 support. But they are bugs,...
- Andrew Fedoniouk (13/16) Jul 29 2006 Sorry but this is a bit optimistic.
- Walter Bright (17/34) Jul 29 2006 No matter, it is far easier to write a UTF-8 isword function than one
- Andrew Fedoniouk (40/74) Jul 29 2006 Sorry, did you try to write such a function (isword)?
- kris (2/108) Jul 29 2006
- Hasan Aljudy (3/12) Jul 30 2006 That's great, I'd be glad to help with anything if you need help with
- Walter Bright (38/98) Jul 30 2006 With code pages, it isn't so straightforward (especially if you've got
- Paolo Invernizzi (4/7) Jul 30 2006 LOL!!!
- John Reimer (5/12) Jul 30 2006 Okay, that clears things up. Now we know that UTF is a conspiracy for
- kris (2/22) Jul 30 2006 And created on the back of a napkin in a New Jersey diner ... way to go,...
- Unknown W. Brackets (14/26) Jul 30 2006 It's true that in HTML, attribute names were limited to a subset of
- Chris Miller (3/5) Jul 30 2006 Even body language? :)
- Unknown W. Brackets (85/249) Jul 29 2006 2. Sorry, an array of char (a single char is one single 8 bit octet)
- Andrew Fedoniouk (40/124) Jul 29 2006 "your definition is either lax or wrong"
- Unknown W. Brackets (28/32) Jul 29 2006 It really sounds to me like you're looking for UCS-2, then (e.g. as used...
- Andrew Fedoniouk (30/63) Jul 29 2006 Well, lets speak in terms of javascript if it is easier:
- Unknown W. Brackets (17/107) Jul 30 2006 Yes, you're right, most of the time I wouldn't (although a significant
- Bruno Medeiros (8/15) Jul 30 2006 Which, speaking of which, shouldn't that be a compile time error? The
- Unknown W. Brackets (6/22) Jul 30 2006 Eek! Yes, I would say (in my humble opinion) that this should be a
- Bruno Medeiros (16/24) Jul 30 2006 You mentioned "8-bit octet" repeatedly in various posts. That's
- Unknown W. Brackets (12/37) Jul 30 2006 I use that terminology because I've read too many RFCs (consider the FTP...
- Walter Bright (2/6) Jul 30 2006 I confess I often misuse the terminology.
- Derek (35/36) Jul 30 2006 Andrew and others,
- Walter Bright (9/36) Jul 30 2006 Thank you for the insightful summary of the situation.
- Unknown W. Brackets (6/14) Jul 30 2006 Indeed; this is the same situation as with XML transmission over the
- Oskar Linde (20/58) Jul 30 2006 Thank you for the clear summary.
- Bruno Medeiros (13/52) Jul 30 2006 Good summary. Additionally I'd like to say that, to hold 'KOI-8'
- Serg Kovrov (10/10) Jul 31 2006 Maybe I missed the point here, correct me if I misunderstood.
- Oskar Linde (20/30) Jul 31 2006 Having char[].length return something other than the actual number of
- Serg Kovrov (13/22) Jul 31 2006 Yes, I see. Thats why I do not like much char[] as substitute for string
- Frits van Bommel (5/11) Jul 31 2006 Store where? You can't put it in the array data itself without breaking
- Serg Kovrov (5/17) Jul 31 2006 Need to say that I no not have an idea where to store it, neither where
- Frits van Bommel (6/25) Jul 31 2006 The length is stored in the reference, but the character count would not...
-
Hasan Aljudy
(2/38)
Jul 31 2006
I say this calls for a proper *standard* String class ...
- Oskar Linde (9/37) Jul 31 2006 The question is, how often do you need it? Especially if you are not
- Serg Kovrov (2/46) Jul 31 2006 You've got some valid points, I just showed mine.
- Walter Bright (2/12) Jul 31 2006 std.utf.toUCSindex(s, s.length) will also give the character count.
- Thomas Kuehne (14/35) Jul 31 2006 -----BEGIN PGP SIGNED MESSAGE-----
- Andrew Fedoniouk (12/53) Jul 31 2006 Right, Thomas,
- Thomas Kuehne (17/64) Aug 02 2006 -----BEGIN PGP SIGNED MESSAGE-----
- Andrew Fedoniouk (33/74) Jul 31 2006 Derek thanks for summarizing all this but I will put it as following.
- Walter Bright (33/51) Jul 31 2006 I disagree the characterization that it is "extremely difficult" to use
- Andrew Fedoniouk (38/89) Jul 31 2006 Sorry but strings in DMDScript are quite different in terms of
- Derek Parnell (25/48) Jul 31 2006 For what its worth, to do *character* manipulation I convert strings to
- Andrew Fedoniouk (13/57) Jul 31 2006 Derek, using dchar (ultimate char) is perfectly fine in DBuild(*)
- John Reimer (12/18) Jul 31 2006 Really, Andrew, you are getting carried away in your demands. You almos...
- Andrew Fedoniouk (9/30) Aug 01 2006 :D
- =?ISO-8859-1?Q?=22R=E9my_J=2E_A=2E_Mou=EBza=22?= (6/46) Aug 01 2006 As "dbile" in french, pronounced something like "day bill". One has to
- Walter Bright (40/110) Jul 31 2006 ECMAScript 262-3 (Javascript) defines the source character set to be
- Andrew Fedoniouk (40/151) Aug 01 2006 Walter, please, forget about such thing as "character set is UTF-16"
- Walter Bright (14/41) Aug 01 2006 encoded using UTF-16.
- Andrew Fedoniouk (39/80) Aug 01 2006 (Hope this long dialog will help all of us to better understand what UNI...
- Derek Parnell (33/55) Aug 01 2006 Andrew is correct. In UTF-16, characters are variable length, from 2 to ...
- Andrew Fedoniouk (4/57) Aug 01 2006 Yes, Derek, this will be probably near the ideal.
- Regan Heath (40/116) Aug 01 2006 Yet, I don't find it at all difficult to think of them like so:
- Andrew Fedoniouk (6/124) Aug 01 2006 Another option will be to change char.init to 0 and forget about the pro...
- Unknown W. Brackets (4/12) Aug 01 2006 I'm trying to understand why this 0 thing is such an issue. If your
- Andrew Fedoniouk (17/29) Aug 01 2006 Declaration of char.init == 0 pretty much means that
- Oskar Linde (10/30) Aug 02 2006 You mean data with other encodings that still want to use the std.string...
- Unknown W. Brackets (7/51) Aug 02 2006 I fail to understand why I want another ambiguous type in my
- Derek Parnell (16/19) Aug 01 2006 I think the issue is more that Andrew wants to have hex-FF as a legitima...
- Andrew Fedoniouk (14/18) Aug 02 2006 What does it mean uninitialized? They *are* initialized.
- Walter Bright (3/5) Aug 02 2006 Yes, I found two bugs in my own code with it that would have been hidden...
- Derek Parnell (40/58) Aug 02 2006 Andrew, I will assume you are not trying to be difficult but that maybe
- Derek Parnell (19/86) Aug 01 2006 Me too, but that's probably because I've not been immersed in C/C++ for ...
- Regan Heath (4/8) Aug 01 2006 Good point. I neglected to mention that.
- kris (7/16) Aug 01 2006 Sure, although char, utf8, utf16, utf32 are much better choices, IMHO :)
- Walter Bright (3/25) Aug 02 2006 If UCS-2 is not a subset of UTF-16, what UCS-2 sequences are not valid
- Derek Parnell (11/37) Aug 02 2006 Huh??? I said "UCS-2 is a subset of Unicode characters" Did you miss tha...
- Walter Bright (7/33) Aug 02 2006 I saw it, but that statement is not the same as "UCS-2 is a subset of
- Walter Bright (42/107) Aug 01 2006 The only thing that UTF-16 adds are semantics for characters that are
- Andrew Fedoniouk (27/80) Aug 01 2006 There is no such thing as surrogate pair in UCS-2.
- Unknown W. Brackets (37/39) Aug 01 2006 Andrew, I think there's a misunderstanding here. Perhaps it's a
- Andrew Fedoniouk (17/20) Aug 01 2006 Consider this:
- Oskar Linde (8/21) Aug 02 2006 Not really surprising. Had you compiled this in a C program (you are
- Derek Parnell (18/46) Aug 02 2006 No, not surprised, just wondering why you didn't code it correctly thoug...
- Unknown W. Brackets (13/45) Aug 02 2006 Why would I ever use strncat() in a D program?
- Unknown W. Brackets (2/53) Aug 02 2006
- kris (14/16) Aug 01 2006 Actually, it doesn't help at all, Andrew ~ some of it is thoroughly
- Bruno Medeiros (8/46) Aug 03 2006 Uh, the statement "BMP is a subset of UTF-16" means that you can read a
Could somebody shed light on the subject: According to http://digitalmars.com/d/type.html characters in D are getting initialized by following values char -> 0xFF wchar -> 0xFFFF dchar -> 0x0000FFFF what is the idea to have string initialized by valid character code instead of 0? And that 0xFFFF.... Why is this special character (See Basic Multilingual Plane) was selected? To avoid use of strcat & co. on d strings? (Sorry if it was discussed before) Andrew Fedoniouk. http://terrainformatica.com
Jul 29 2006
Andrew Fedoniouk wrote:Could somebody shed light on the subject: According to http://digitalmars.com/d/type.html characters in D are getting initialized by following values char -> 0xFF wchar -> 0xFFFF dchar -> 0x0000FFFF what is the idea to have string initialized by valid character code instead of 0?Try google? http://www.digitalmars.com/d/archives/digitalmars/D/3239.html
Jul 29 2006
kris wrote:Andrew Fedoniouk wrote:I don't understand why the compiler should initialize variables to illegal values!! OK, is it because you have to initialize variables explicitly? Just WHY? As far as I know, the notion that non-initialized variables are bad is a side-effect of the C (and C++) language, because non-inited variables are garbage. However, in D (and Java .. and others), vars are always initialized. So, if the compiler can init variables to good defaults, why should it still be considered a bad habit not to init variables explicitly? That just makes no sense to me.Could somebody shed light on the subject: According to http://digitalmars.com/d/type.html characters in D are getting initialized by following values char -> 0xFF wchar -> 0xFFFF dchar -> 0x0000FFFF what is the idea to have string initialized by valid character code instead of 0?Try google? http://www.digitalmars.com/d/archives/digitalmars/D/3239.html
Jul 29 2006
On Sat, 29 Jul 2006 06:29:21 -0600, Hasan Aljudy wrote:kris wrote:I believe that D's philopsophy is that all datatypes are initialized to 'invalid' values if they possibly can be. The ones that can't are integers, bytes, and bools. References, floating point values, and characters are initialized to 'wrong' values. -- Derek Parnell Melbourne, Australia "Down with mediocrity!"Andrew Fedoniouk wrote:I don't understand why the compiler should initialize variables to illegal values!! OK, is it because you have to initialize variables explicitly? Just WHY? As far as I know, the notion that non-initialized variables are bad is a side-effect of the C (and C++) language, because non-inited variables are garbage. However, in D (and Java .. and others), vars are always initialized. So, if the compiler can init variables to good defaults, why should it still be considered a bad habit not to init variables explicitly? That just makes no sense to me.Could somebody shed light on the subject: According to http://digitalmars.com/d/type.html characters in D are getting initialized by following values char -> 0xFF wchar -> 0xFFFF dchar -> 0x0000FFFF what is the idea to have string initialized by valid character code instead of 0?Try google? http://www.digitalmars.com/d/archives/digitalmars/D/3239.html
Jul 29 2006
Derek wrote:On Sat, 29 Jul 2006 06:29:21 -0600, Hasan Aljudy wrote:I know .. I was asking "but why?" :(kris wrote:I believe that D's philopsophy is that all datatypes are initialized to 'invalid' values if they possibly can be. The ones that can't are integers, bytes, and bools. References, floating point values, and characters are initialized to 'wrong' values.Andrew Fedoniouk wrote:I don't understand why the compiler should initialize variables to illegal values!! OK, is it because you have to initialize variables explicitly? Just WHY? As far as I know, the notion that non-initialized variables are bad is a side-effect of the C (and C++) language, because non-inited variables are garbage. However, in D (and Java .. and others), vars are always initialized. So, if the compiler can init variables to good defaults, why should it still be considered a bad habit not to init variables explicitly? That just makes no sense to me.Could somebody shed light on the subject: According to http://digitalmars.com/d/type.html characters in D are getting initialized by following values char -> 0xFF wchar -> 0xFFFF dchar -> 0x0000FFFF what is the idea to have string initialized by valid character code instead of 0?Try google? http://www.digitalmars.com/d/archives/digitalmars/D/3239.html
Jul 29 2006
Hasan Aljudy wrote:Derek wrote:The intent I believe is to signal the programmer as soon as possible showing they have missed something. In C/C++ an un-initialised variable can easily survive thousands of debug runs until it 'initialises' to a completely wrong value. Most often on a release build and a end-users system. Take floats. By starting at NaN, from the very start you'll know you missed initialising it. You'll catch the error earlier in your debug process.On Sat, 29 Jul 2006 06:29:21 -0600, Hasan Aljudy wrote:I know .. I was asking "but why?" :(kris wrote:I believe that D's philopsophy is that all datatypes are initialized to 'invalid' values if they possibly can be. The ones that can't are integers, bytes, and bools. References, floating point values, and characters are initialized to 'wrong' values.Andrew Fedoniouk wrote:I don't understand why the compiler should initialize variables to illegal values!! OK, is it because you have to initialize variables explicitly? Just WHY? As far as I know, the notion that non-initialized variables are bad is a side-effect of the C (and C++) language, because non-inited variables are garbage. However, in D (and Java .. and others), vars are always initialized. So, if the compiler can init variables to good defaults, why should it still be considered a bad habit not to init variables explicitly? That just makes no sense to me.Could somebody shed light on the subject: According to http://digitalmars.com/d/type.html characters in D are getting initialized by following values char -> 0xFF wchar -> 0xFFFF dchar -> 0x0000FFFF what is the idea to have string initialized by valid character code instead of 0?Try google? http://www.digitalmars.com/d/archives/digitalmars/D/3239.html
Jul 29 2006
Robert Atkinson wrote:Hasan Aljudy wrote:Still missing my point. in C/C++ that's a problem because un-initialized variables carry garbage. in D, it's not; if you init them to a reasonable valid default, this problem won't exist anymore. If un-initializing is bad just for its own sake .. then the compiler should detect it and issue an error/warning, otherwise it should default to a reasonable valid value; in this case, zero for chars and floats.Derek wrote:The intent I believe is to signal the programmer as soon as possible showing they have missed something. In C/C++ an un-initialised variable can easily survive thousands of debug runs until it 'initialises' to a completely wrong value. Most often on a release build and a end-users system. Take floats. By starting at NaN, from the very start you'll know you missed initialising it. You'll catch the error earlier in your debug process.On Sat, 29 Jul 2006 06:29:21 -0600, Hasan Aljudy wrote:I know .. I was asking "but why?" :(kris wrote:I believe that D's philopsophy is that all datatypes are initialized to 'invalid' values if they possibly can be. The ones that can't are integers, bytes, and bools. References, floating point values, and characters are initialized to 'wrong' values.Andrew Fedoniouk wrote:I don't understand why the compiler should initialize variables to illegal values!! OK, is it because you have to initialize variables explicitly? Just WHY? As far as I know, the notion that non-initialized variables are bad is a side-effect of the C (and C++) language, because non-inited variables are garbage. However, in D (and Java .. and others), vars are always initialized. So, if the compiler can init variables to good defaults, why should it still be considered a bad habit not to init variables explicitly? That just makes no sense to me.Could somebody shed light on the subject: According to http://digitalmars.com/d/type.html characters in D are getting initialized by following values char -> 0xFF wchar -> 0xFFFF dchar -> 0x0000FFFF what is the idea to have string initialized by valid character code instead of 0?Try google? http://www.digitalmars.com/d/archives/digitalmars/D/3239.html
Jul 29 2006
Hasan Aljudy escribi:Still missing my point. in C/C++ that's a problem because un-initialized variables carry garbage. in D, it's not; if you init them to a reasonable valid default, this problem won't exist anymore. If un-initializing is bad just for its own sake .. then the compiler should detect it and issue an error/warning, otherwise it should default to a reasonable valid value; in this case, zero for chars and floats.The issue here is, a "reasonable valid default" will change from one app to the other, one function to the next, one variable to another, so the intention here is force the developer to be explicit about his/her intentions. Walter has said in the past that if there was a NAN for int/long/etc, he'd use that instead of 0. -- Carlos Santander Bernal
Jul 29 2006
Carlos Santander wrote:Hasan Aljudy escribi:That's right. Also, given: int x; foo(x); it is impossible for the maintenance programmer to distinguish between: 1) x is meant to be 0 2) the original programmer forgot to initialize x to 3, and there's a bug in the program Ok, fine, so why doesn't the compiler just squawk about referencing uninitialized variables? Consider: int x; ... if (...) { x = 3; ... } ... if (...) { ... foo(x); } There is no way for the compiler to determine that x in foo(x) is always initialized. So it must assume otherwise, and squawk about it. So how does our harried programmer fix it? int x = some-random-value; ... if (...) { x = 3; ... } ... if (...) { ... foo(x); } The compiler is now happy, but pity the poor maintenance programmer. He notices the some-random-value, and wonders what that value means. He analyzes the code, and discovers that that value is never used. Was it intended to be used? Did some previous maintenance programmer break the code? What's going on here? My take on programming languages is that the semantics should have the obvious meaning - i.e. if the programmer initializes a variable to a value, that value should have meaning. He should not have to initialize a variable because of some subtle *side effect* such initialization has. Programmers should not be required to add dead assignments, unreachable code, etc., just to keep the compiler happy.Still missing my point. in C/C++ that's a problem because un-initialized variables carry garbage. in D, it's not; if you init them to a reasonable valid default, this problem won't exist anymore. If un-initializing is bad just for its own sake .. then the compiler should detect it and issue an error/warning, otherwise it should default to a reasonable valid value; in this case, zero for chars and floats.The issue here is, a "reasonable valid default" will change from one app to the other, one function to the next, one variable to another, so the intention here is force the developer to be explicit about his/her intentions. Walter has said in the past that if there was a NAN for int/long/etc, he'd use that instead of 0.
Jul 29 2006
"kris" <foo bar.com> wrote in message news:eaf9ei$2m7$1 digitaldaemon.com...Andrew Fedoniouk wrote:Thanks, Kris. To Walter: Following assumption ( http://www.digitalmars.com/d/archives/digitalmars/D/3239.html): "codepoint U+FFFF is not a legitimate Unicode character, and, furthermore, it is guaranteed by the Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode character. This codepoint will remain forever unassigned, precisely so that it may be used for purposes such as this." is just wrong. 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from R-zone: {U+FFF0..U+FFFF} - region assigned already. 2) For char[] selection of 0xFF is wrong and even worse. For example character with code 0xFF in Latin-I encoding is "y diaeresis". In many European languages and Far East encodings 0xFF is a valid code point. For example in KOI-8 encoding 0xFF is officially assigned value. What is the point of current initializaton? If you are doing intialization already and this intialization is a part of specification so why not to use official "Nul" values in this case? You are doing the same for floats - you are using NaNs there (Null value for floats). Why not to use the same for chars? I think I understand your intention, 0xFF is sort of debug values in Visual C++: 0xCDCDCDCD - Allocated in heap, but not initialized 0xDDDDDDDD - Released heap memory. 0xFDFDFDFD - "NoMansLand" fences automatically placed at boundary of heap memory. Should never be overwritten. If you do overwrite one, you're probably walking off the end of an array. 0xCCCCCCCC - Allocated on stack, but not initialized but this is far from concept of null codepoint in character encodings. Andrew Fedoniouk. http://terrainformatica.comCould somebody shed light on the subject: According to http://digitalmars.com/d/type.html characters in D are getting initialized by following values char -> 0xFF wchar -> 0xFFFF dchar -> 0x0000FFFF what is the idea to have string initialized by valid character code instead of 0?Try google? http://www.digitalmars.com/d/archives/digitalmars/D/3239.html
Jul 29 2006
Andrew Fedoniouk escribi:2) For char[] selection of 0xFF is wrong and even worse. For example character with code 0xFF in Latin-I encoding is "y diaeresis". In many European languages and Far East encodings 0xFF is a valid code point. For example in KOI-8 encoding 0xFF is officially assigned value.But D's chars are UTF-8, not Latin-1 nor any other, so I don't think this applies. -- Carlos Santander Bernal
Jul 29 2006
"Carlos Santander" <csantander619 gmail.com> wrote in message news:eagiip$1lad$3 digitaldaemon.com...Andrew Fedoniouk escribi:UTF-8 is a multibyte transport encoding of full 21-bit UNICODE codepoint. Strictly speaking single byte in UTF-8 sequence cannot be named as char[acter] char as typename implies that value of its type contains some complete codepoint (assumed that information about codepage is stored somewhere or is known at the point of use) I mean that "UTF-8 characrter" (if it makes any sense at all) as type is always char[] and not a single char. 0xFF as a char initialization value implies that D char is not supposed to handle single byte character encodings at all. Is this an original intention? Andrew Fedoniouk. http://terrainformatica.com2) For char[] selection of 0xFF is wrong and even worse. For example character with code 0xFF in Latin-I encoding is "y diaeresis". In many European languages and Far East encodings 0xFF is a valid code point. For example in KOI-8 encoding 0xFF is officially assigned value.But D's chars are UTF-8, not Latin-1 nor any other, so I don't think this applies.
Jul 29 2006
Andrew Fedoniouk escribi:"Carlos Santander" <csantander619 gmail.com> wrote in message news:eagiip$1lad$3 digitaldaemon.com...My bad, then. I should've said char[] instead of char. Frits and Walter wrote better responses, anyway, so I'll leave this as is. -- Carlos Santander BernalAndrew Fedoniouk escribi:UTF-8 is a multibyte transport encoding of full 21-bit UNICODE codepoint. Strictly speaking single byte in UTF-8 sequence cannot be named as char[acter] char as typename implies that value of its type contains some complete codepoint (assumed that information about codepage is stored somewhere or is known at the point of use) I mean that "UTF-8 characrter" (if it makes any sense at all) as type is always char[] and not a single char. 0xFF as a char initialization value implies that D char is not supposed to handle single byte character encodings at all. Is this an original intention? Andrew Fedoniouk. http://terrainformatica.com2) For char[] selection of 0xFF is wrong and even worse. For example character with code 0xFF in Latin-I encoding is "y diaeresis". In many European languages and Far East encodings 0xFF is a valid code point. For example in KOI-8 encoding 0xFF is officially assigned value.But D's chars are UTF-8, not Latin-1 nor any other, so I don't think this applies.
Jul 29 2006
Andrew Fedoniouk wrote:To Walter: Following assumption ( http://www.digitalmars.com/d/archives/digitalmars/D/3239.html): "codepoint U+FFFF is not a legitimate Unicode character, and, furthermore, it is guaranteed by the Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode character. This codepoint will remain forever unassigned, precisely so that it may be used for purposes such as this." is just wrong. 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from R-zone: {U+FFF0..U+FFFF} - region assigned already.Yep, 0xFFFF is in the "Specials" range. In fact, together with 0xFFFE it forms the subrange of the "Noncharacters" (see http://www.unicode.org/charts/PDF/UFFF0.pdf, at the end). These are "intended for process internal uses, but are not permitted for interchange". 0xFFFF specifically is marked "<not a character> - the value FFFF if guaranteed not to be a Unicode character at all". So yes, it's assigned - for exactly such a purpose as D is using it for :).2) For char[] selection of 0xFF is wrong and even worse. For example character with code 0xFF in Latin-I encoding is "y diaeresis". In many European languages and Far East encodings 0xFF is a valid code point. For example in KOI-8 encoding 0xFF is officially assigned value.First of all, non-Unicode encodings are irrelevant. 'char' is a UTF-8 codepoint (I think that's the correct term). It's not a Unicode character (though some Unicode characters are encoded as a single UTF-8 codepoint, specifically anything up to 0x80 IIRC). 0xFF is indeed a valid Unicode character, but that doesn't mean that character is encoded as a byte with value 0xFF in UTF-8 (which char[]s represent). 0xFF is in fact one of the byte values that *cannot* occur in a valid UTF-8 text.
Jul 29 2006
"Frits van Bommel" <fvbommel REMwOVExCAPSs.nl> wrote in message news:eagjcd$1m1t$1 digitaldaemon.com...Andrew Fedoniouk wrote:Sorry but this is wrong. "UTF-8 codepoint" is a non-sense. In common practice Code Point is a: (1) A numerical index (or position) in an encoding table used for encoding characters. (2) Synonym for Unicode scalar value. As rule one code point represented by single glyph while represented to human.To Walter: Following assumption ( http://www.digitalmars.com/d/archives/digitalmars/D/3239.html): "codepoint U+FFFF is not a legitimate Unicode character, and, furthermore, it is guaranteed by the Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode character. This codepoint will remain forever unassigned, precisely so that it may be used for purposes such as this." is just wrong. 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from R-zone: {U+FFF0..U+FFFF} - region assigned already.Yep, 0xFFFF is in the "Specials" range. In fact, together with 0xFFFE it forms the subrange of the "Noncharacters" (see http://www.unicode.org/charts/PDF/UFFF0.pdf, at the end). These are "intended for process internal uses, but are not permitted for interchange". 0xFFFF specifically is marked "<not a character> - the value FFFF if guaranteed not to be a Unicode character at all". So yes, it's assigned - for exactly such a purpose as D is using it for :).2) For char[] selection of 0xFF is wrong and even worse. For example character with code 0xFF in Latin-I encoding is "y diaeresis". In many European languages and Far East encodings 0xFF is a valid code point. For example in KOI-8 encoding 0xFF is officially assigned value.First of all, non-Unicode encodings are irrelevant. 'char' is a UTF-8 codepoint (I think that's the correct term).It's not a Unicode character (though some Unicode characters are encoded as a single UTF-8 codepoint, specifically anything up to 0x80 IIRC). 0xFF is indeed a valid Unicode character, but that doesn't mean that character is encoded as a byte with value 0xFF in UTF-8 (which char[]s represent). 0xFF is in fact one of the byte values that *cannot* occur in a valid UTF-8 text.Sorry, but element of UTF-8 encoded sequence is a byte (octet) and not a char. char as a type historically means type for storing character code points. 0xFF is assigned and legal value in many encodings. Either use different name for this "D char" - let's say utf8byte or use char in the meaning "code point value" - thus initialize it by NUL value common for all known encodings. Andrew Fedoniouk. http://terrainformatica.com
Jul 29 2006
Andrew Fedoniouk wrote:Following assumption ( http://www.digitalmars.com/d/archives/digitalmars/D/3239.html): "codepoint U+FFFF is not a legitimate Unicode character, and, furthermore, it is guaranteed by the Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode character. This codepoint will remain forever unassigned, precisely so that it may be used for purposes such as this." is just wrong. 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from R-zone: {U+FFF0..U+FFFF} - region assigned already."the value FFFF is guaranteed not to be a Unicode character at all" http://www.unicode.org/charts/PDF/UFFF0.pdf2) For char[] selection of 0xFF is wrong and even worse. For example character with code 0xFF in Latin-I encoding is "y diaeresis". In many European languages and Far East encodings 0xFF is a valid code point. For example in KOI-8 encoding 0xFF is officially assigned value.char[] is not Unicode, it is UTF-8. For UTF-8, 0xFF is not a valid value. The Unicode U00FF is not encoded into UTF-8 as FF. "The octet values C0, C1, F5 to FF never appear." http://www.ietf.org/rfc/rfc3629.txtWhat is the point of current initializaton?The point is to initialize it with an invalid value, in order to flush out uninitialized data errors.If you are doing intialization already and this intialization is a part of specification so why not to use official "Nul" values in this case?Because 0 is a valid UTF-8 character.You are doing the same for floats - you are using NaNs there (Null value for floats). Why not to use the same for chars?The FF initialization does correspond (as close as we can get) with NaN for floats. 0 can masquerade as legitimate data, FF cannot.
Jul 29 2006
"Walter Bright" <newshound digitalmars.com> wrote in message news:eagk1o$1mph$1 digitaldaemon.com...Andrew Fedoniouk wrote:1) What "UTF-8 character" means exactly? 2) In ASCII char(0) is officially NUL. Why not to initialize strings by null?Following assumption ( http://www.digitalmars.com/d/archives/digitalmars/D/3239.html): "codepoint U+FFFF is not a legitimate Unicode character, and, furthermore, it is guaranteed by the Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode character. This codepoint will remain forever unassigned, precisely so that it may be used for purposes such as this." is just wrong. 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from R-zone: {U+FFF0..U+FFFF} - region assigned already."the value FFFF is guaranteed not to be a Unicode character at all" http://www.unicode.org/charts/PDF/UFFF0.pdf2) For char[] selection of 0xFF is wrong and even worse. For example character with code 0xFF in Latin-I encoding is "y diaeresis". In many European languages and Far East encodings 0xFF is a valid code point. For example in KOI-8 encoding 0xFF is officially assigned value.char[] is not Unicode, it is UTF-8. For UTF-8, 0xFF is not a valid value. The Unicode U00FF is not encoded into UTF-8 as FF. "The octet values C0, C1, F5 to FF never appear." http://www.ietf.org/rfc/rfc3629.txtWhat is the point of current initializaton?The point is to initialize it with an invalid value, in order to flush out uninitialized data errors.If you are doing intialization already and this intialization is a part of specification so why not to use official "Nul" values in this case?Because 0 is a valid UTF-8 character.I don't get it, sorry. In KOI-8R (Russian) enconding 0xFF is letter '?' Are you saying that I cannot use char[] to represen russian text in D? Andrew Fedoniouk. http://terrainformatica.comYou are doing the same for floats - you are using NaNs there (Null value for floats). Why not to use the same for chars?The FF initialization does correspond (as close as we can get) with NaN for floats. 0 can masquerade as legitimate data, FF cannot.
Jul 29 2006
Andrew Fedoniouk wrote:For an exact answer, the spec is: http://www.ietf.org/rfc/rfc3629.txt There isn't much to it.1) What "UTF-8 character" means exactly?What is the point of current initializaton?The point is to initialize it with an invalid value, in order to flush out uninitialized data errors.If you are doing intialization already and this intialization is a part of specification so why not to use official "Nul" values in this case?Because 0 is a valid UTF-8 character.2) In ASCII char(0) is officially NUL. Why not to initialize strings by null?Because 0 characters are valid UTF-8 values. By using an invalid UTF-8 value, we can flush out bugs from uninitialized data.I don't get it, sorry. In KOI-8R (Russian) enconding 0xFF is letter '?' Are you saying that I cannot use char[] to represen russian text in D?char[] is for UTF-8 encoded text only. For other encoding systems, use ubyte[]. But rest assured that Russian (and every other language) has a defined encoding in UTF-8, which is why it was selected for D.
Jul 29 2006
"Walter Bright" <newshound digitalmars.com> wrote in message news:eagmrk$1pn9$1 digitaldaemon.com...Andrew Fedoniouk wrote:Sorry but I understand what UCS character means but what exactly is "UTF-8 character" you are using? Is this 1) a single octet in UTF-8 sequence or 2) is a sequence of octets representing one unicode character (21 bit value)For an exact answer, the spec is: http://www.ietf.org/rfc/rfc3629.txt There isn't much to it.1) What "UTF-8 character" means exactly?What is the point of current initializaton?The point is to initialize it with an invalid value, in order to flush out uninitialized data errors.If you are doing intialization already and this intialization is a part of specification so why not to use official "Nul" values in this case?Because 0 is a valid UTF-8 character.Oh.... 0 as a value of UTF-8 octet can represent only single value character with codepoint 0x00000000. In plain English: UTF-8 encoded strings cannot contain zeros in the middle.2) In ASCII char(0) is officially NUL. Why not to initialize strings by null?Because 0 characters are valid UTF-8 values. By using an invalid UTF-8 value, we can flush out bugs from uninitialized data.Sorry but char[acter] in plain english means character - index of some human readable glyph in some table like ASCII, KOI-8, MAC-ASCII, whatever. Element of UTF-8 sequence is an octet. I think you should rename 'char' type to 'octet' if D/Phobos intended to support only UTF-8. Andrew.I don't get it, sorry. In KOI-8R (Russian) enconding 0xFF is letter '?' Are you saying that I cannot use char[] to represen russian text in D?char[] is for UTF-8 encoded text only. For other encoding systems, use ubyte[]. But rest assured that Russian (and every other language) has a defined encoding in UTF-8, which is why it was selected for D.
Jul 29 2006
Andrew Fedoniouk wrote:Element of UTF-8 sequence is an octet. I think you should rename 'char' type to 'octet' if D/Phobos intended to support only UTF-8.This was all hashed out years ago. It's too late to start renaming basic types.
Jul 29 2006
"Walter Bright" <newshound digitalmars.com> wrote in message news:eagufo$2knt$1 digitaldaemon.com...Andrew Fedoniouk wrote:I am not asking to rename anything. Could you please just remove this weird 0xFF initialization for char arrays? ( as it was prior to .162 buld ) This is the whole point. If you will do this then current char type can be used for representation of single byte encodings as it stands - character. Andrew Fedoniouk. http://terrainformatica.comElement of UTF-8 sequence is an octet. I think you should rename 'char' type to 'octet' if D/Phobos intended to support only UTF-8.This was all hashed out years ago. It's too late to start renaming basic types.
Jul 29 2006
But even prior, this: char c; writefln(cast(size_t) c); Would have given you 255, not 0. This has been true for quite some time. The fact that it did not happen for arrays in the same way was, as far as I know, a bug. Actually, I didn't even realize that got fixed. -[Unknown]"Walter Bright" <newshound digitalmars.com> wrote in message news:eagufo$2knt$1 digitaldaemon.com...Andrew Fedoniouk wrote:I am not asking to rename anything. Could you please just remove this weird 0xFF initialization for char arrays? ( as it was prior to .162 buld ) This is the whole point. If you will do this then current char type can be used for representation of single byte encodings as it stands - character. Andrew Fedoniouk. http://terrainformatica.comElement of UTF-8 sequence is an octet. I think you should rename 'char' type to 'octet' if D/Phobos intended to support only UTF-8.This was all hashed out years ago. It's too late to start renaming basic types.
Jul 29 2006
Andrew Fedoniouk wrote:"Walter Bright" <newshound digitalmars.com> wrote in message news:eagufo$2knt$1 digitaldaemon.com...Ok, but you did say "I think you should rename..." <g>Andrew Fedoniouk wrote:I am not asking to rename anything.Element of UTF-8 sequence is an octet. I think you should rename 'char' type to 'octet' if D/Phobos intended to support only UTF-8.This was all hashed out years ago. It's too late to start renaming basic types.Could you please just remove this weird 0xFF initialization for char arrays? ( as it was prior to .162 buld )char's have been initialized to 0xFF for years now, it was a bug that some array initializations didn't do it.This is the whole point. If you will do this then current char type can be used for representation of single byte encodings as it stands - character.? I don't understand what's standing in the way of that now. And values from 0..7F are single byte UTF-8 encodings and can be stored in a char. BTW, you can do this: typedef char mychar = 0; mychar[] a = new mychar[100]; // mychar[] will be initialized to 0
Jul 29 2006
Andrew, I think it will make a lot more sense if you keep these things in mind... (I'm sure you already know all of them, I'm just listing them out since they're crucial and must be thought of together): 1. char, wchar, and dchar are separate types. 2. char contains UTF-8 bytes. It may not contain UTF-16, UCS-2, KOI-8R, or any other encoding. It must contain UTF-8. 3. wchar contains UTF-16. It is similar to char in every other way (may not contain any other encoding than UTF-16, not even UCS-2.) 4. dchar contains UTF-32 code points. It may not contain any other sort of encoding, again. 5. For other encodings, such as ISO-8859-1 or KOI-8R, you should use ubyte/byte or some other method. It is not valid to use char. 6. The FF byte (8-bit octet sequence) may never appear in any valid UTF-8 string. Since char can only contain UTF-8 strings, it represents invalid data if it contains such an 8-bit octet. 7. Code points are the characters in Unicode; they are "compressed", so to speak, in encodings such as UTF-8 and UTF-16. USC-2 and USC-4 (UTF-32) contain full code points. 8. If you were to examine the bytes in a wchar string, it may be possible that the 8-bit octet sequence "FF" might show up. Nonetheless, since char cannot be used for UTF-16, this doesn't matter. 9. For the above reason, wchar (UTF-16) uses FFFF. This character is similar to FF for UTF-8. Given the above, I think I might answer your questions: 1. UTF-8 character here could mean an 8-bit octet of code point. In this case, they are both the same and represent a perfectly valid character in a string. 2. ASCII does not matter; char is not ASCII. It happens that ASCII bytes 0 to 127 correspond to the same code points in Unicode, and the same characters in UTF-8. 3. It does not matter; KOI-8R encoded strings should not be placed in char arrays. You should use UTF-8 or another encoding for your Russian text. 4. If you wish to use KOI-8R (or any other encoding not based on Unicode) you should not be using char arrays, which are meant for Unicode-related encodings only. Obviously this is by far different from C, but that's the good thing about D in many ways ;). Thanks, -[Unknown]"Walter Bright" <newshound digitalmars.com> wrote in message news:eagk1o$1mph$1 digitaldaemon.com...Andrew Fedoniouk wrote:1) What "UTF-8 character" means exactly? 2) In ASCII char(0) is officially NUL. Why not to initialize strings by null?Following assumption ( http://www.digitalmars.com/d/archives/digitalmars/D/3239.html): "codepoint U+FFFF is not a legitimate Unicode character, and, furthermore, it is guaranteed by the Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode character. This codepoint will remain forever unassigned, precisely so that it may be used for purposes such as this." is just wrong. 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from R-zone: {U+FFF0..U+FFFF} - region assigned already."the value FFFF is guaranteed not to be a Unicode character at all" http://www.unicode.org/charts/PDF/UFFF0.pdf2) For char[] selection of 0xFF is wrong and even worse. For example character with code 0xFF in Latin-I encoding is "y diaeresis". In many European languages and Far East encodings 0xFF is a valid code point. For example in KOI-8 encoding 0xFF is officially assigned value.char[] is not Unicode, it is UTF-8. For UTF-8, 0xFF is not a valid value. The Unicode U00FF is not encoded into UTF-8 as FF. "The octet values C0, C1, F5 to FF never appear." http://www.ietf.org/rfc/rfc3629.txtWhat is the point of current initializaton?The point is to initialize it with an invalid value, in order to flush out uninitialized data errors.If you are doing intialization already and this intialization is a part of specification so why not to use official "Nul" values in this case?Because 0 is a valid UTF-8 character.I don't get it, sorry. In KOI-8R (Russian) enconding 0xFF is letter '?' Are you saying that I cannot use char[] to represen russian text in D? Andrew Fedoniouk. http://terrainformatica.comYou are doing the same for floats - you are using NaNs there (Null value for floats). Why not to use the same for chars?The FF initialization does correspond (as close as we can get) with NaN for floats. 0 can masquerade as legitimate data, FF cannot.
Jul 29 2006
"Unknown W. Brackets" <unknown simplemachines.org> wrote in message news:eagn4d$1q1t$1 digitaldaemon.com...Andrew, I think it will make a lot more sense if you keep these things in mind... (I'm sure you already know all of them, I'm just listing them out since they're crucial and must be thought of together): 1. char, wchar, and dchar are separate types.No objections with this.2. char contains UTF-8 bytes. It may not contain UTF-16, UCS-2, KOI-8R, or any other encoding. It must contain UTF-8.Sorry but plural form "char contains UTF-8 bytes" is wrong. What you think char means: 1) char is an octet (byte) - member of utf-8 sequence -or- 2) char is code point of some character in some character table. ? Probably I am treating English too literally but char(acter) is not an UTF-8 byte. And never was. char is an index of some glyph in some encoding table. This is common definition used everywhere.3. wchar contains UTF-16. It is similar to char in every other way (may not contain any other encoding than UTF-16, not even UCS-2.)What is wchar (uint16) for you: 1) wchar as is an index of a Unicode scalar value in Basic Multilingual Plane (BMP) -or- 2) is a uint16 value - member of UTF-16 sequence. ?4. dchar contains UTF-32 code points. It may not contain any other sort of encoding, again.Oh..... UTF-32 (as any other utfs) is a transformation format - group name of two different encodings UTF-32BE and UTF-32LE. UTF-32 code point is a non-sense. UTF-32 defines of how to encode Unicode code point in again sequence of four bytes - octets. I would define this thing as dchar ( better name is uchar ) is type for representing full set of Unicode Code Points (21bit value). Pleas note: "transformation format" (UTF) is not by any means a "manipulation format". Representation of text in memory suitable for manipulation (e.g. text processing) is different as rule. You cannot use utf-8 encoded russian text for analysis. No way.5. For other encodings, such as ISO-8859-1 or KOI-8R, you should use ubyte/byte or some other method. It is not valid to use char.Vice versa. For utf-8 encoded strings you should use byte[] and for strings using single byte encodings you should use char.6. The FF byte (8-bit octet sequence) may never appear in any valid UTF-8 string. Since char can only contain UTF-8 strings, it represents invalid data if it contains such an 8-bit octet.No objections with that, for UTF-8 octet sequences 0xFF is invalid value of octet in the sequence. But please note: in the sequence of octets.7. Code points are the characters in Unicode; they are "compressed", so to speak, in encodings such as UTF-8 and UTF-16. USC-2 and USC-4 (UTF-32) contain full code points.Sorry, but USC-4 *is not* UTF-32 http://www.unicode.org/reports/tr19/tr19-9.html I will ask again: What: char c = 'a'; means for you? And following in C/C++: #pragma(encoding,"KOI-8R") char c = '?'; ?8. If you were to examine the bytes in a wchar string, it may be possible that the 8-bit octet sequence "FF" might show up. Nonetheless, since char cannot be used for UTF-16, this doesn't matter.Not clear what you mean here. Could you clarify? Especially last statement.9. For the above reason, wchar (UTF-16) uses FFFF. This character is similar to FF for UTF-8. Given the above, I think I might answer your questions: 1. UTF-8 character here could mean an 8-bit octet of code point. In this case, they are both the same and represent a perfectly valid character in a string.Sorry I am not buying following: "UTF-8 character" and "8-bit octet of code point"2. ASCII does not matter; char is not ASCII. It happens that ASCII bytes 0 to 127 correspond to the same code points in Unicode, and the same characters in UTF-8."ASCII does not matter"... for whom?3. It does not matter; KOI-8R encoded strings should not be placed in char arrays. You should use UTF-8 or another encoding for your Russian text."You should use UTF-8 or another encoding for your Russian text." Thanks. Advice from my side: Let me know when you will visit Russia. I will ask representatives of russian developer community and web authors to meet you. Advice per se: You should wear a helmet.4. If you wish to use KOI-8R (or any other encoding not based on Unicode) you should not be using char arrays, which are meant for Unicode-related encodings only.The same advice as above.Obviously this is by far different from C, but that's the good thing about D in many ways ;).In Israel they have an old saying: "Not a human for Saturday but Saturday for human". I do have practical experience in writnig text processing software in encodings other than "US-ASCII" and have heard your advices about UTF-8 usage with interest. Please don't take all of this personal - no intention to harm anybody. Honestly and with smile :) Andrew.
Jul 29 2006
Andrew Fedoniouk wrote:I will ask again: What: char c = 'a'; means for you? And following in C/C++: #pragma(encoding,"KOI-8R") char c = '?'; ?Pragmas are implementation defined behavior in C and C++, meaning they are unportable and rather useless. Not only that, char's themselves are implementation defined, and so it is very difficult to write portable code that deals with anything other than a-zA-Z0-9 and a few other characters. In D, char[] is a UTF-8 sequence. It's well defined, and therefore portable. It supports every human language.
Jul 29 2006
"Walter Bright" <newshound digitalmars.com> wrote in message news:eagut9$2l96$1 digitaldaemon.com...Andrew Fedoniouk wrote:What does it mean "UTF-8 ... supports ...every human language" ? It allows to encode - yes. But in runtime support means quite different thing and I am pretty sure you know what I mean here. In Java as we know UTF-8 is used for representing string literals inside .class files but being loaded they became vectors of Java chars - unicode BMP codepoints (ushort). And this serves almost all character cases. Exceptions like: it is not trivial to do effectively processing of single byte encoded things there - you need to rewrite the whole set of functions to handle this. Please don't think that UTF-8 is a panacea. For example in China they use GB2312 encoding to represent almost 7000 Chinese characters in active use now. This is strictly 2 bytes enconding and don't even try to ask them to switch to UTF-8 (3 bytes as a rule). This will increase their internet traffic by 1/3. Same apply to Europe. E.g. in Russia there are 32 characters in alphabet and it is just enough to have one byte encoding for English/Russian text. It makes no sense to send over the wire two bytes (russian in utf-8) instead of one for the sites like lib.ru. Sorry but guys are paying there for each byte downloaded from Internet. This apply to almost all countries except of US and Canada. Andrew Fedoniouk. http://terrainformatica.comI will ask again: What: char c = 'a'; means for you? And following in C/C++: #pragma(encoding,"KOI-8R") char c = '?'; ?Pragmas are implementation defined behavior in C and C++, meaning they are unportable and rather useless. Not only that, char's themselves are implementation defined, and so it is very difficult to write portable code that deals with anything other than a-zA-Z0-9 and a few other characters. In D, char[] is a UTF-8 sequence. It's well defined, and therefore portable. It supports every human language.
Jul 29 2006
Andrew Fedoniouk wrote:We both know what UTF-8 is and does.In D, char[] is a UTF-8 sequence. It's well defined, and therefore portable. It supports every human language.What does it mean "UTF-8 ... supports ...every human language" ? It allows to encode - yes.But in runtime support means quite different thing and I am pretty sure you know what I mean here.I'm sure there are bugs in the library UTF-8 support. But they are bugs, are fixable, and not fundamental problems. As you find any, please post them to bugzilla.In Java as we know UTF-8 is used for representing string literals inside .class files but being loaded they became vectors of Java chars - unicode BMP codepoints (ushort). And this serves almost all character cases. Exceptions like: it is not trivial to do effectively processing of single byte encoded things there - you need to rewrite the whole set of functions to handle this. Please don't think that UTF-8 is a panacea.I don't. But it's way better than C/C++, because you can rely on it and your code will work with different languages out of the box.For example in China they use GB2312 encoding to represent almost 7000 Chinese characters in active use now. This is strictly 2 bytes enconding and don't even try to ask them to switch to UTF-8 (3 bytes as a rule). This will increase their internet traffic by 1/3. Same apply to Europe. E.g. in Russia there are 32 characters in alphabet and it is just enough to have one byte encoding for English/Russian text. It makes no sense to send over the wire two bytes (russian in utf-8) instead of one for the sites like lib.ru. Sorry but guys are paying there for each byte downloaded from Internet. This apply to almost all countries except of US and Canada.If one needs to use a custom encoding, use ubyte[] or ushort[]. If one needs to be universal, use char[], wchar[], or dchar[]. And for what it's worth, D isn't a web transmission protocol. I don't see any problem with a D program converting its input from Format X to UTF for internal processing, and then converting its output back to X or Y or Z.
Jul 29 2006
Sorry but this is a bit optimistic. D/samples/wc.exe from the box will fail on russian texts. It will fail on almost all Eastern texts. Even they will be in UTF-8 encoding. Meaning of 'word' is different there. Having statement "string literals in D are only UTF-8 encoded" is not conceptually better than "string literals in C are encoded by using codepage defined by pragma(codepage,...)". Same by the way applied to most of Java compilers they accepts texts in various singlebyte encodings. (Why *I* am telling this to *you*? :-) Andrew.Please don't think that UTF-8 is a panacea.I don't. But it's way better than C/C++, because you can rely on it and your code will work with different languages out of the box.
Jul 29 2006
Andrew Fedoniouk wrote:No matter, it is far easier to write a UTF-8 isword function than one that will work on all possible character encoding methods.Sorry but this is a bit optimistic. D/samples/wc.exe from the box will fail on russian texts. It will fail on almost all Eastern texts. Even they will be in UTF-8 encoding. Meaning of 'word' is different there.Please don't think that UTF-8 is a panacea.I don't. But it's way better than C/C++, because you can rely on it and your code will work with different languages out of the box.Having statement "string literals in D are only UTF-8 encoded" is not conceptually better than "string literals in C are encoded by using codepage defined by pragma(codepage,...)".It is conceptually better because UTF-8 is completely defined and covers all human languages. Codepages are not completely defined, do not cover asian languages, rely on non-standard compiler extensions, and in fact you cannot even rely on *ASCII* being supported by any particular C or C++ compiler. (It could be EBCDIC or any encoding invented by the compiler vendor.) Code pages have another disastrous problem - it's impossible to mix languages. I have an academic text in front of me written in a mixture of german, french, and latin. How's that going to work with code pages? Code pages are obsolete yesterday's technology, and I'm not sorry to see them go.Same by the way applied to most of Java compilers they accepts texts in various singlebyte encodings. (Why *I* am telling this to *you*? :-)The compiler may accept it as an extension, but the Java *language* is defined to work with UTF-16 source text only. (Java calls them 'char's, even though there may be multi-char encodings.)
Jul 29 2006
"Walter Bright" <newshound digitalmars.com> wrote in message news:eah9st$2v1o$1 digitaldaemon.com...Andrew Fedoniouk wrote:Sorry, did you try to write such a function (isword)? (You need the whole set of character classification tables to accomplish this - utf-8 will not help you)No matter, it is far easier to write a UTF-8 isword function than one that will work on all possible character encoding methods.Sorry but this is a bit optimistic. D/samples/wc.exe from the box will fail on russian texts. It will fail on almost all Eastern texts. Even they will be in UTF-8 encoding. Meaning of 'word' is different there.Please don't think that UTF-8 is a panacea.I don't. But it's way better than C/C++, because you can rely on it and your code will work with different languages out of the box.I am not saying that you shall avoid use of UTF-8 encoding. If you have mix of say english, russian and chinese on some page the only way to deliver this to the user is to use some (universal) unicode transport encoding. But to render this thing on the screen is completely different story. Consider this: attribute names in html (sgml) represented by ascii codes only - you don't need utf-8 processing to deal with them at all. You also cannot use utf-8 for storing attribute values generally speaking. Attribute values participate in CSS selector analysis and some selectors require char by char (char as a code point and not a D char) access. There are only few academic cases where you can use utf-8 literally (as a sequence of utf-8 bytes) *in runtime*. D source code compilation is one of such things - you can store content of string literals in utf-8 form - you don't need to analyze their content.Having statement "string literals in D are only UTF-8 encoded" is not conceptually better than "string literals in C are encoded by using codepage defined by pragma(codepage,...)".It is conceptually better because UTF-8 is completely defined and covers all human languages. Codepages are not completely defined, do not cover asian languages, rely on non-standard compiler extensions, and in fact you cannot even rely on *ASCII* being supported by any particular C or C++ compiler. (It could be EBCDIC or any encoding invented by the compiler vendor.) Code pages have another disastrous problem - it's impossible to mix languages. I have an academic text in front of me written in a mixture of german, french, and latin. How's that going to work with code pages?Code pages are obsolete yesterday's technology, and I'm not sorry to see them go.Sorry but US is the first country which will ask "what a ...?" on demand to send always four bytes instead of one. UTF-8 encoding is "traffic friendly" only for 1/10 of population on the Earth (English speaking people). Others just don't want to pay that price. Sorry you or not sorry it is irrelevant for code pages existence. They will be forever untill all of us will not speak on Esperanto. ( Currently I am doing right-to-left support in the engine - Arabic and Hebrew - trust me - probably I have more things to say "sorry" about )Walter, where did you get that magic UTF-16 ? Doc: http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html mentions that input of Java compiler is sequence of Unicode (Code Points). And how this input sequence is encoded, utf-8, utf-16, koi8r - it does not matter at all and spec is silent about this - human is in its rights to choose encoding his/her terminal/keyboard supports. Andrew Fedoniouk. http://terrainformatica.comSame by the way applied to most of Java compilers they accepts texts in various singlebyte encodings. (Why *I* am telling this to *you*? :-)The compiler may accept it as an extension, but the Java *language* is defined to work with UTF-16 source text only. (Java calls them 'char's, even though there may be multi-char encodings.)
Jul 29 2006
Is there a doctor in the house? Andrew Fedoniouk wrote:"Walter Bright" <newshound digitalmars.com> wrote in message news:eah9st$2v1o$1 digitaldaemon.com...Andrew Fedoniouk wrote:Sorry, did you try to write such a function (isword)? (You need the whole set of character classification tables to accomplish this - utf-8 will not help you)No matter, it is far easier to write a UTF-8 isword function than one that will work on all possible character encoding methods.Sorry but this is a bit optimistic. D/samples/wc.exe from the box will fail on russian texts. It will fail on almost all Eastern texts. Even they will be in UTF-8 encoding. Meaning of 'word' is different there.Please don't think that UTF-8 is a panacea.I don't. But it's way better than C/C++, because you can rely on it and your code will work with different languages out of the box.I am not saying that you shall avoid use of UTF-8 encoding. If you have mix of say english, russian and chinese on some page the only way to deliver this to the user is to use some (universal) unicode transport encoding. But to render this thing on the screen is completely different story. Consider this: attribute names in html (sgml) represented by ascii codes only - you don't need utf-8 processing to deal with them at all. You also cannot use utf-8 for storing attribute values generally speaking. Attribute values participate in CSS selector analysis and some selectors require char by char (char as a code point and not a D char) access. There are only few academic cases where you can use utf-8 literally (as a sequence of utf-8 bytes) *in runtime*. D source code compilation is one of such things - you can store content of string literals in utf-8 form - you don't need to analyze their content.Having statement "string literals in D are only UTF-8 encoded" is not conceptually better than "string literals in C are encoded by using codepage defined by pragma(codepage,...)".It is conceptually better because UTF-8 is completely defined and covers all human languages. Codepages are not completely defined, do not cover asian languages, rely on non-standard compiler extensions, and in fact you cannot even rely on *ASCII* being supported by any particular C or C++ compiler. (It could be EBCDIC or any encoding invented by the compiler vendor.) Code pages have another disastrous problem - it's impossible to mix languages. I have an academic text in front of me written in a mixture of german, french, and latin. How's that going to work with code pages?Code pages are obsolete yesterday's technology, and I'm not sorry to see them go.Sorry but US is the first country which will ask "what a ...?" on demand to send always four bytes instead of one. UTF-8 encoding is "traffic friendly" only for 1/10 of population on the Earth (English speaking people). Others just don't want to pay that price. Sorry you or not sorry it is irrelevant for code pages existence. They will be forever untill all of us will not speak on Esperanto. ( Currently I am doing right-to-left support in the engine - Arabic and Hebrew - trust me - probably I have more things to say "sorry" about )Walter, where did you get that magic UTF-16 ? Doc: http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html mentions that input of Java compiler is sequence of Unicode (Code Points). And how this input sequence is encoded, utf-8, utf-16, koi8r - it does not matter at all and spec is silent about this - human is in its rights to choose encoding his/her terminal/keyboard supports. Andrew Fedoniouk. http://terrainformatica.comSame by the way applied to most of Java compilers they accepts texts in various singlebyte encodings. (Why *I* am telling this to *you*? :-)The compiler may accept it as an extension, but the Java *language* is defined to work with UTF-16 source text only. (Java calls them 'char's, even though there may be multi-char encodings.)
Jul 29 2006
Andrew Fedoniouk wrote:( Currently I am doing right-to-left support in the engine - Arabic and Hebrew - trust me - probably I have more things to say "sorry" about )That's great, I'd be glad to help with anything if you need help with regard to Arabic (I'm a native Arabic speaker).Andrew Fedoniouk. http://terrainformatica.com
Jul 30 2006
Andrew Fedoniouk wrote:"Walter Bright" <newshound digitalmars.com> wrote in message news:eah9st$2v1o$1 digitaldaemon.com...I have written isUniAlpha, which is the same thing.Andrew Fedoniouk wrote:Sorry, did you try to write such a function (isword)?No matter, it is far easier to write a UTF-8 isword function than one that will work on all possible character encoding methods.Sorry but this is a bit optimistic. D/samples/wc.exe from the box will fail on russian texts. It will fail on almost all Eastern texts. Even they will be in UTF-8 encoding. Meaning of 'word' is different there.Please don't think that UTF-8 is a panacea.I don't. But it's way better than C/C++, because you can rely on it and your code will work with different languages out of the box.(You need the whole set of character classification tables to accomplish this - utf-8 will not help you)With code pages, it isn't so straightforward (especially if you've got things like shift-JIS too). With code pages, a program can't even accept a text file unless you tell it what page the text is in.I am not saying that you shall avoid use of UTF-8 encoding. If you have mix of say english, russian and chinese on some page the only way to deliver this to the user is to use some (universal) unicode transport encoding. But to render this thing on the screen is completely different story.Fortunately, rendering is the job of the operating system - and I don't see how rendering with code pages would be any easier.Consider this: attribute names in html (sgml) represented by ascii codes only - you don't need utf-8 processing to deal with them at all. You also cannot use utf-8 for storing attribute values generally speaking. Attribute values participate in CSS selector analysis and some selectors require char by char (char as a code point and not a D char) access.I'd be surprised at that, since UTF-8 is a documented, supported HTML page encoding method. But if UTF-8 doesn't work for you, you can use wchar (UTF-16) or dchar (UTF-32), or ubyte (for anything else).There are only few academic cases where you can use utf-8 literally (as a sequence of utf-8 bytes) *in runtime*. D source code compilation is one of such things - you can store content of string literals in utf-8 form - you don't need to analyze their content.D identifiers can be unicode alphas, which means the UTF-8 must be decoded. The DMC++ compiler supports various code page source file possibilities, including some of the asian language multibyte encodings. I find that UTF-8 is a lot easier to work with, as the UTF-8 designers learned from the mistakes of the earlier multibyte encodings.I'll make a prediction that the huge benefits of UTF will outweigh the downside, and that code pages will increasingly fall into disuse. Note also supports EUC or SJIS, but not other code pages). Windows is (internally) completely unicode (the code page face it shows is done by a translation layer on I/O). In an increasingly multicultural and global economy, applications that cannot simultaneously handle multiple languages are going to be at a severe disadvantage. Another problem with code pages is when you're presented with a text file, what code page is it in? There's no way for a program to tell, unless there's some other transmission of associated metadata. With UTF, that's no problem.Code pages are obsolete yesterday's technology, and I'm not sorry to see them go.Sorry but US is the first country which will ask "what a ...?" on demand to send always four bytes instead of one. UTF-8 encoding is "traffic friendly" only for 1/10 of population on the Earth (English speaking people). Others just don't want to pay that price.Sorry you or not sorry it is irrelevant for code pages existence. They will be forever untill all of us will not speak on Esperanto. ( Currently I am doing right-to-left support in the engine - Arabic and Hebrew - trust me - probably I have more things to say "sorry" about )No problem, I believe you <g>.Java Language Specification Third Edition Chapter 3.2: "The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding." It is, of course, entirely reasonable for a Java compiler to have extensions to recognize other encodings and automatically convert them internally to UTF-16 before lexical analysis. "One Encoding to rule them all, One Encoding to replace them, One Encoding to handle them all and in the darkness bind them" -- UTF TolkienWalter, where did you get that magic UTF-16 ? Doc: http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html mentions that input of Java compiler is sequence of Unicode (Code Points). And how this input sequence is encoded, utf-8, utf-16, koi8r - it does not matter at all and spec is silent about this - human is in its rights to choose encoding his/her terminal/keyboard supports.Same by the way applied to most of Java compilers they accepts texts in various singlebyte encodings. (Why *I* am telling this to *you*? :-)The compiler may accept it as an extension, but the Java *language* is defined to work with UTF-16 source text only. (Java calls them 'char's, even though there may be multi-char encodings.)
Jul 30 2006
LOL!!! --- Paolo Walter Bright wrote:"One Encoding to rule them all, One Encoding to replace them, One Encoding to handle them all and in the darkness bind them" -- UTF Tolkien
Jul 30 2006
On Sun, 30 Jul 2006 03:25:22 -0700, Paolo Invernizzi <arathorn NOSPAM_fastwebnet.it> wrote:LOL!!! --- Paolo Walter Bright wrote:Okay, that clears things up. Now we know that UTF is a conspiracy for world domination. ;) -JJR"One Encoding to rule them all, One Encoding to replace them, One Encoding to handle them all and in the darkness bind them" -- UTF Tolkien
Jul 30 2006
John Reimer wrote:On Sun, 30 Jul 2006 03:25:22 -0700, Paolo Invernizzi <arathorn NOSPAM_fastwebnet.it> wrote:And created on the back of a napkin in a New Jersey diner ... way to go, KenLOL!!! --- Paolo Walter Bright wrote:Okay, that clears things up. Now we know that UTF is a conspiracy for world domination. ;) -JJR"One Encoding to rule them all, One Encoding to replace them, One Encoding to handle them all and in the darkness bind them" -- UTF Tolkien
Jul 30 2006
It's true that in HTML, attribute names were limited to a subset of characters available for use in the document. Namely, as mentioned, alpha-type characters (/[A-Za-z][A-Za-z0-9\.\-]*/.) You couldn't even use accented chars. However (in the case of HTML), you were required to use specific (English) attribute names anyway for HTML to validate; it's really not a significant limitation. Few people used SGML for anything else. XML allows for Unicode attribute and element names... PIs, CDATA, PCDATA, etc. And, of course, allows you to reference any Unicode code We could also talk about the limitations of horse driven carriages, and how they can only go a certain speed... nonetheless, we have cars now, so I'm not terribly worried about HTML's technical limitations anymore. -[Unknown]Consider this: attribute names in html (sgml) represented by ascii codes only - you don't need utf-8 processing to deal with them at all. You also cannot use utf-8 for storing attribute values generally speaking. Attribute values participate in CSS selector analysis and some selectors require char by char (char as a code point and not a D char) access.I'd be surprised at that, since UTF-8 is a documented, supported HTML page encoding method. But if UTF-8 doesn't work for you, you can use wchar (UTF-16) or dchar (UTF-32), or ubyte (for anything else).
Jul 30 2006
On Sat, 29 Jul 2006 20:37:56 -0400, Walter Bright <newshound digitalmars.com> wrote:In D, char[] is a UTF-8 sequence. It's well defined, and therefore portable. It supports every human language.Even body language? :)
Jul 30 2006
2. Sorry, an array of char (a single char is one single 8 bit octet) contains UTF-8 bytes which are 8-bit octets. A single character, in UTF-8 encoding, may be 1 byte, 2 bytes, etc. Thus, one char MAY NOT hold every single Unicode code point. You may need an array of multiple chars (bytes) to hold a single code point. This is not what it means to me; this is what it means. A char is a single 8-bit octet in a UTF-8 sequence. They ARE NOT by any means code points. I'm sorry that I did not specify "array", but I fear you are being pedantic here; I'm sure you knew what I meant. A char is a single byte in a UTF-8 sequence. I'm afraid I think calling it an index to a glyph is dangerous, because it could be mistaken. Again, a single char CANNOT represent code points above and including 128 because it is only ONE byte. A single char therefore may not represent a glyph all of the time, but rather will represent a byte in the sequence of UTF-8 which may be used to decode (along with other necessary bytes) the entirity of the code point. I hope I'm not being overly pedantic here, but I think your definition is either lax or wrong. But, that is only by its reading in English. represent full code points alone. Arrays of wchars must be used for encoding.) 4. I was ignoring endianess issues for simplicity. My point here is that a UTF-32 character directly represents a code point. Sorry again for the non-pedantic laxness in my wording. 5. Wrong. There is no vice versa. You may use byte or ubyte arrays for your UTF-8 encoded strings and so forth. In case you didn't realize I was trying to say this: *char is not for single byte encodings. char is ONLY for UTF-8. char may not be used for any other encoding unless you wish to have problems. char is not the same as in other languages, e.g. C.* If you wish for a 8-bit octet value (such as a character in any encoding; single byte or otherwise) you should not be using a char. That is not a correct usage for them, that is what byte and ubyte are for. It is expected that chars in an array will follow a specific sequence; that is, that they will be encoded in UTF-8. It is not possible to guarantee this if you use other encodings, which is why writefln() will fail in such cases. 6. Correct. And a single char (8-bit octet in a sequence of UTF-8 octets encoded such) may never be FF because no single 8-bit octet anywhere in a valid UTF-8 sequence may be FF. Remember, char is not a code point. It is a single 8-bit octet in a sequence. 7. My mistake. I always consider them roughly the same (and for some reason I thought that they had been made the same; but I assume your link is current.) Your first code sample defines a single UTF-8 character, 'a'. It is lucky you did not try: char c = '蝿'; (hopefully this character gets sent through to you properly; I will be sending this message UTF-8 if my client allows it.) Because that would have failed. A char cannot hold such a character, which has a code point outside the range 0 - 127. You would either need to use an array of chars, or etc. Your second example means nothing to me. I don't really care for such pragmas or putting untranslated text directly in source code, and have never dealt with it. 8. You may not use a single char or an array of chars to represent UTF-16. It may only represent UTF-8. If you wish to use UTF-16, you must use wchars. are the same - do you not agree? A 0 is a zero is a zero. It doesn't matter what he means. 2 (the second): rules about ASCII do not apply to char. Just as rules in Portugal do not dissuade me here in Los Angeles. 3 (the second): I have lead the development of a multi-lingual software which was used by quite a large sum of people. I also helped coordinate, and later interface with the assigned coordinator of translation. This software was translated into Thai, Chinese (simple and traditional), Russian, Italian, Spanish, Japanese, Catalan, and several other languages. More than twenty anyway. At first I was suggesting that everyone use their own encoding and handling that (sometimes painfully) in the code. I would sometimes get comments about using Unicode instead (from the translators who would have preferred this.) This software now uses UTF-8 and remains translated in these languages. So, while I have not been to Russia (although I have worked with numerous Russian developers, consumers, and translators) I would tend to disagree with your assertion. Also I do not like helmets. Obviously, I mean nothing to be taken personally as well; we are only talking about UTF-8, Unicode, its usage in D, and being pedantic ;). And helmets, we touched that subject too. But not about each other, really. Thanks, -[Unknown]"Unknown W. Brackets" <unknown simplemachines.org> wrote in message news:eagn4d$1q1t$1 digitaldaemon.com...Andrew, I think it will make a lot more sense if you keep these things in mind... (I'm sure you already know all of them, I'm just listing them out since they're crucial and must be thought of together): 1. char, wchar, and dchar are separate types.No objections with this.2. char contains UTF-8 bytes. It may not contain UTF-16, UCS-2, KOI-8R, or any other encoding. It must contain UTF-8.Sorry but plural form "char contains UTF-8 bytes" is wrong. What you think char means: 1) char is an octet (byte) - member of utf-8 sequence -or- 2) char is code point of some character in some character table. ? Probably I am treating English too literally but char(acter) is not an UTF-8 byte. And never was. char is an index of some glyph in some encoding table. This is common definition used everywhere.3. wchar contains UTF-16. It is similar to char in every other way (may not contain any other encoding than UTF-16, not even UCS-2.)What is wchar (uint16) for you: 1) wchar as is an index of a Unicode scalar value in Basic Multilingual Plane (BMP) -or- 2) is a uint16 value - member of UTF-16 sequence. ?4. dchar contains UTF-32 code points. It may not contain any other sort of encoding, again.Oh..... UTF-32 (as any other utfs) is a transformation format - group name of two different encodings UTF-32BE and UTF-32LE. UTF-32 code point is a non-sense. UTF-32 defines of how to encode Unicode code point in again sequence of four bytes - octets. I would define this thing as dchar ( better name is uchar ) is type for representing full set of Unicode Code Points (21bit value). Pleas note: "transformation format" (UTF) is not by any means a "manipulation format". Representation of text in memory suitable for manipulation (e.g. text processing) is different as rule. You cannot use utf-8 encoded russian text for analysis. No way.5. For other encodings, such as ISO-8859-1 or KOI-8R, you should use ubyte/byte or some other method. It is not valid to use char.Vice versa. For utf-8 encoded strings you should use byte[] and for strings using single byte encodings you should use char.6. The FF byte (8-bit octet sequence) may never appear in any valid UTF-8 string. Since char can only contain UTF-8 strings, it represents invalid data if it contains such an 8-bit octet.No objections with that, for UTF-8 octet sequences 0xFF is invalid value of octet in the sequence. But please note: in the sequence of octets.7. Code points are the characters in Unicode; they are "compressed", so to speak, in encodings such as UTF-8 and UTF-16. USC-2 and USC-4 (UTF-32) contain full code points.Sorry, but USC-4 *is not* UTF-32 http://www.unicode.org/reports/tr19/tr19-9.html I will ask again: What: char c = 'a'; means for you? And following in C/C++: #pragma(encoding,"KOI-8R") char c = '?'; ?8. If you were to examine the bytes in a wchar string, it may be possible that the 8-bit octet sequence "FF" might show up. Nonetheless, since char cannot be used for UTF-16, this doesn't matter.Not clear what you mean here. Could you clarify? Especially last statement.9. For the above reason, wchar (UTF-16) uses FFFF. This character is similar to FF for UTF-8. Given the above, I think I might answer your questions: 1. UTF-8 character here could mean an 8-bit octet of code point. In this case, they are both the same and represent a perfectly valid character in a string.Sorry I am not buying following: "UTF-8 character" and "8-bit octet of code point"2. ASCII does not matter; char is not ASCII. It happens that ASCII bytes 0 to 127 correspond to the same code points in Unicode, and the same characters in UTF-8."ASCII does not matter"... for whom?3. It does not matter; KOI-8R encoded strings should not be placed in char arrays. You should use UTF-8 or another encoding for your Russian text."You should use UTF-8 or another encoding for your Russian text." Thanks. Advice from my side: Let me know when you will visit Russia. I will ask representatives of russian developer community and web authors to meet you. Advice per se: You should wear a helmet.4. If you wish to use KOI-8R (or any other encoding not based on Unicode) you should not be using char arrays, which are meant for Unicode-related encodings only.The same advice as above.Obviously this is by far different from C, but that's the good thing about D in many ways ;).In Israel they have an old saying: "Not a human for Saturday but Saturday for human". I do have practical experience in writnig text processing software in encodings other than "US-ASCII" and have heard your advices about UTF-8 usage with interest. Please don't take all of this personal - no intention to harm anybody. Honestly and with smile :) Andrew.
Jul 29 2006
"Unknown W. Brackets" <unknown simplemachines.org> wrote in message news:eah49h$2pi8$1 digitaldaemon.com...2. Sorry, an array of char (a single char is one single 8 bit octet) contains UTF-8 bytes which are 8-bit octets. A single character, in UTF-8 encoding, may be 1 byte, 2 bytes, etc. Thus, one char MAY NOT hold every single Unicode code point. You may need an array of multiple chars (bytes) to hold a single code point. This is not what it means to me; this is what it means. A char is a single 8-bit octet in a UTF-8 sequence. They ARE NOT by any means code points. I'm sorry that I did not specify "array", but I fear you are being pedantic here; I'm sure you knew what I meant. A char is a single byte in a UTF-8 sequence. I'm afraid I think calling it an index to a glyph is dangerous, because it could be mistaken. Again, a single char CANNOT represent code points above and including 128 because it is only ONE byte. A single char therefore may not represent a glyph all of the time, but rather will represent a byte in the sequence of UTF-8 which may be used to decode (along with other necessary bytes) the entirity of the code point. I hope I'm not being overly pedantic here, but I think your definition is either lax or wrong. But, that is only by its reading in English."your definition is either lax or wrong" Which one?represent full code points alone. Arrays of wchars must be used for some 4. I was ignoring endianess issues for simplicity. My point here is that a UTF-32 character directly represents a code point. Sorry again for the non-pedantic laxness in my wording.5. Wrong. There is no vice versa. You may use byte or ubyte arrays for your UTF-8 encoded strings and so forth. In case you didn't realize I was trying to say this: *char is not for single byte encodings. char is ONLY for UTF-8. char may not be used for any other encoding unless you wish to have problems. char is not the same as in other languages, e.g. C.* If you wish for a 8-bit octet value (such as a character in any encoding; single byte or otherwise) you should not be using a char. That is not a correct usage for them, that is what byte and ubyte are for. It is expected that chars in an array will follow a specific sequence; that is, that they will be encoded in UTF-8. It is not possible to guarantee this if you use other encodings, which is why writefln() will fail in such cases. 6. Correct. And a single char (8-bit octet in a sequence of UTF-8 octets encoded such) may never be FF because no single 8-bit octet anywhere in a valid UTF-8 sequence may be FF. Remember, char is not a code point. It is a single 8-bit octet in a sequence. 7. My mistake. I always consider them roughly the same (and for some reason I thought that they had been made the same; but I assume your link is current.) Your first code sample defines a single UTF-8 character, 'a'. It is lucky you did not try: char c = '?'; (hopefully this character gets sent through to you properly; I will be sending this message UTF-8 if my client allows it.) Because that would have failed. A char cannot hold such a character, which has a code point outside the range 0 - 127. You would either need to use an array of chars, or etc. Your second example means nothing to me. I don't really care for such pragmas or putting untranslated text directly in source code, and have never dealt with it. 8. You may not use a single char or an array of chars to represent UTF-16. It may only represent UTF-8. If you wish to use UTF-16, you must use wchars. the same - do you not agree? A 0 is a zero is a zero. It doesn't matter what he means. 2 (the second): rules about ASCII do not apply to char. Just as rules in Portugal do not dissuade me here in Los Angeles. 3 (the second): I have lead the development of a multi-lingual software which was used by quite a large sum of people. I also helped coordinate, and later interface with the assigned coordinator of translation. This software was translated into Thai, Chinese (simple and traditional), Russian, Italian, Spanish, Japanese, Catalan, and several other languages. More than twenty anyway. At first I was suggesting that everyone use their own encoding and handling that (sometimes painfully) in the code. I would sometimes get comments about using Unicode instead (from the translators who would have preferred this.) This software now uses UTF-8 and remains translated in these languages. So, while I have not been to Russia (although I have worked with numerous Russian developers, consumers, and translators) I would tend to disagree with your assertion. Also I do not like helmets. Obviously, I mean nothing to be taken personally as well; we are only talking about UTF-8, Unicode, its usage in D, and being pedantic ;). And helmets, we touched that subject too. But not about each other, really. Thanks, -[Unknown]Ok. Let's make second round Some defintions: Unicode Code Point is an integer value (21bit used) - index in global Unicode table. Such global encoding table maintained by international Unicode Consortium. With some exceptions each code point there has correspondent glyph in "global super font". There are two types of encodings used for Unicode Code Points: 1) transport encodings - example UTF. Main purpose - transport/transfer. 2) manipulation encodings - mapping of ranges of Unicode Code Points to diapasons 0..0xFF, 0..0xFFFF and 0..0xFFFFFFFF. Transport encodings are used for transfer and long term storage of character data - texts. Manipulation encoding are used in programming for effective implementation of text processing functions. As a rule manipulation encoding maps some fragment (or two) of Unicode Code Point set to the range 0..0xFF and 0..0xFFFF. Main charcteristic of such mapping: each value of character vector (string) there is in 1:1 relationship with the correspondent codepoint in Unicode set. Main idea of such encoding - character at some index in string (vector) represents one code point in full. I think that motivation of having manipulation encodings is simple and everyone understands it. Think about how you will implement caret positioning in editbox for example. So statement: "char[] in D supposed to hold only UTF-8 encoded text" immediately leads us to "D is not designed for effective text processing". Is this logic clear? Again - let char be a char in D as it is now. Just don't initialize it by 0xFF please. And let us be a bit carefull with our utf-8 expectations - yes, it is almost ideal transport encoding, but it is completely useless for text manipulation purposes - too expensive. (last message on the subject) Andrew Fedoniouk. http://terrainformatica.com
Jul 29 2006
It really sounds to me like you're looking for UCS-2, then (e.g. as used in JavaScript, etc.) For that, length calculation (which is what I presume you mean) is inexpensive. As to your below assertion, I disagree. What I think you meant was: "char[] is not designed for effective multi-byte text processing." I will agree that wchar[] would be much better in that case, and even that limiting it to UCS-2 (which is, afaik, a subset of UTF-16) would probably make things significantly easier to work with. Nonetheless, I was only commenting on how D is currently designed and implemented. Perhaps there was some misunderstanding here. Even so, I don't see how initializing it to FF makes any problem. I think everyone understands that char[] is meant to hold UTF-8, and if you don't like that or don't want to use it, there are other methods available to you (heh, you can even use UTF-32!) I don't see that the initialization of these variables will cause anyone any problems. The only time I want such a variable initialized to 0 is when I use a numeric type, not a character type (and then, I try to use = 0 anyway.) It seems like what you may want to do is simply this: typedef ushort ucs2_t = 0; And use that type. Mission accomplished. Or, use various different encodings - in which case I humbly suggest: typedef ubyte latin1_t = 0; typedef ushort ucs2_t = 0; typedef ubyte koi8r_t = 0; typedef ubyte big5_t = 0; And so on, so on, so on... -[Unknown]So statement: "char[] in D supposed to hold only UTF-8 encoded text" immediately leads us to "D is not designed for effective text processing". Is this logic clear?
Jul 29 2006
"Unknown W. Brackets" <unknown simplemachines.org> wrote in message news:eahcqu$4d$1 digitaldaemon.com...It really sounds to me like you're looking for UCS-2, then (e.g. as used in JavaScript, etc.) For that, length calculation (which is what I presume you mean) is inexpensive.Well, lets speak in terms of javascript if it is easier: String.substr(start, end)... What these start, end means for you? I don't think that you will be interested in indexes of bytes in utf-8 sequence.As to your below assertion, I disagree. What I think you meant was: "char[] is not designed for effective multi-byte text processing."What is "multi-byte text processing"? processing of text - sequence of codepoints of the alphabet? What is 'multi-byte' there doing? Multi-byte I beleive you mean is a method of encoding of codepoints for transmission. Is this correct? You need real codepoints to do something meaningfull with them... How these codepoints are stored in memory: as byte, word or dword depends on your task, amount of memory you have and alphabet you are using. E.g. if you are counting frequency of russian words used in internet you'd better do not do this in Java - twice as expensive as in C without any need. So phrase "multi-byte text processing" is fuzzy on this end. (Seems like I am not clear enough with my subset of English.)I will agree that wchar[] would be much better in that case, and even that limiting it to UCS-2 (which is, afaik, a subset of UTF-16) would probably make things significantly easier to work with. Nonetheless, I was only commenting on how D is currently designed and implemented. Perhaps there was some misunderstanding here. Even so, I don't see how initializing it to FF makes any problem. I think everyone understands that char[] is meant to hold UTF-8, and if you don't like that or don't want to use it, there are other methods available to you (heh, you can even use UTF-32!) I don't see that the initialization of these variables will cause anyone any problems. The only time I want such a variable initialized to 0 is when I use a numeric type, not a character type (and then, I try to use = 0 anyway.) It seems like what you may want to do is simply this: typedef ushort ucs2_t = 0; And use that type. Mission accomplished. Or, use various different encodings - in which case I humbly suggest: typedef ubyte latin1_t = 0; typedef ushort ucs2_t = 0; typedef ubyte koi8r_t = 0; typedef ubyte big5_t = 0; And so on, so on, so on... -[Unknown]I like the last statement "..., so on, so on..." Sounds promissing enough. Just for information: strlen(const char* str) works with *all* single byte encodings in C. For multi-bytes (e.g. utf-8 ) it returns length of the sequence in octets. But these are not chars in terms of C strictly speaking but bytes - unsigned chars.So statement: "char[] in D supposed to hold only UTF-8 encoded text" immediately leads us to "D is not designed for effective text processing". Is this logic clear?
Jul 29 2006
Yes, you're right, most of the time I wouldn't (although a significant portion of the time, I would.) Even so, this is why I would use UCS-2, and not UTF-8. Why are you held up on char[]? My point is that char[] is only trouble when you're dealing with text that is not ISO-8859-1. I'm a great fan of localization and internationalization, but in all honesty the largest part of my text processing/analysis is with such strings. Generally, user input I don't analyze. Caret placement I leave to be handled by the libraries I use. That is, when I use char[]. So again, I will agree that, in D, char[] is not a good choice for strings you are expecting to contain possibly-internationalized data. I'm perfectly aware of what strlen (and str.length in D) do... it's similar to what they do in practically all other languages (unless you know the encoding is UCS-2, etc.) For example, I work with PHP a lot and it doesn't even have (with the versions I support) built-in support for Unicode. This makes text processing fun! -[Unknown]"Unknown W. Brackets" <unknown simplemachines.org> wrote in message news:eahcqu$4d$1 digitaldaemon.com...It really sounds to me like you're looking for UCS-2, then (e.g. as used in JavaScript, etc.) For that, length calculation (which is what I presume you mean) is inexpensive.Well, lets speak in terms of javascript if it is easier: String.substr(start, end)... What these start, end means for you? I don't think that you will be interested in indexes of bytes in utf-8 sequence.As to your below assertion, I disagree. What I think you meant was: "char[] is not designed for effective multi-byte text processing."What is "multi-byte text processing"? processing of text - sequence of codepoints of the alphabet? What is 'multi-byte' there doing? Multi-byte I beleive you mean is a method of encoding of codepoints for transmission. Is this correct? You need real codepoints to do something meaningfull with them... How these codepoints are stored in memory: as byte, word or dword depends on your task, amount of memory you have and alphabet you are using. E.g. if you are counting frequency of russian words used in internet you'd better do not do this in Java - twice as expensive as in C without any need. So phrase "multi-byte text processing" is fuzzy on this end. (Seems like I am not clear enough with my subset of English.)I will agree that wchar[] would be much better in that case, and even that limiting it to UCS-2 (which is, afaik, a subset of UTF-16) would probably make things significantly easier to work with. Nonetheless, I was only commenting on how D is currently designed and implemented. Perhaps there was some misunderstanding here. Even so, I don't see how initializing it to FF makes any problem. I think everyone understands that char[] is meant to hold UTF-8, and if you don't like that or don't want to use it, there are other methods available to you (heh, you can even use UTF-32!) I don't see that the initialization of these variables will cause anyone any problems. The only time I want such a variable initialized to 0 is when I use a numeric type, not a character type (and then, I try to use = 0 anyway.) It seems like what you may want to do is simply this: typedef ushort ucs2_t = 0; And use that type. Mission accomplished. Or, use various different encodings - in which case I humbly suggest: typedef ubyte latin1_t = 0; typedef ushort ucs2_t = 0; typedef ubyte koi8r_t = 0; typedef ubyte big5_t = 0; And so on, so on, so on... -[Unknown]I like the last statement "..., so on, so on..." Sounds promissing enough. Just for information: strlen(const char* str) works with *all* single byte encodings in C. For multi-bytes (e.g. utf-8 ) it returns length of the sequence in octets. But these are not chars in terms of C strictly speaking but bytes - unsigned chars.So statement: "char[] in D supposed to hold only UTF-8 encoded text" immediately leads us to "D is not designed for effective text processing". Is this logic clear?
Jul 30 2006
Unknown W. Brackets wrote:char c = '蝿'; Because that would have failed. A char cannot hold such a character, which has a code point outside the range 0 - 127. You would either need to use an array of chars, or etc.Which, speaking of which, shouldn't that be a compile time error? The compiler allows all kinds of *char mingling: dchar dc = '蝿'; char sc = dc; // :-( -- Bruno Medeiros - MSc in CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Jul 30 2006
Eek! Yes, I would say (in my humble opinion) that this should be a compile-time error. Obviously down-casting is more complicated. I think the case of chars is much more obvious/clear than the case of ints, but then it's also a special-case. -[Unknown]Unknown W. Brackets wrote:char c = '蝿'; Because that would have failed. A char cannot hold such a character, which has a code point outside the range 0 - 127. You would either need to use an array of chars, or etc.Which, speaking of which, shouldn't that be a compile time error? The compiler allows all kinds of *char mingling: dchar dc = '蝿'; char sc = dc; // :-(
Jul 30 2006
Unknown W. Brackets wrote:6. The FF byte (8-bit octet sequence) may never appear in any valid UTF-8 string. Since char can only contain UTF-8 strings, it represents invalid data if it contains such an 8-bit octet.You mentioned "8-bit octet" repeatedly in various posts. That's redundant: An "octet" is an 8-bit value. There are no "16-bit octets" and no "8-bit hextets" or stuff like that :P . I hope you knew that and were just distracted, but you kept saying that :) .1. UTF-8 character here could mean an 8-bit octet of code point. In this case, they are both the same and represent a perfectly valid character in a string.An "UTF-8 octet" is also called a UTF-8 'code unit'. Similarly a "UTF-16 hextet" is called a UTF-16 'code unit'. An UTF-8 code unit holds a Unicode code point if the code point is <128. Otherwise multiple UTF-8 code units are needed to encode that code point. The confusion between 'code unit' and 'code point' is a long standing one. An "UTF-8 character" is a slighty ambiguous term. Does it a mean a UTF-8 code unit, or does it mean an Unicode character/codepoint encoded in a UTF-8 sequence? -- Bruno Medeiros - MSc in CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D
Jul 30 2006
I use that terminology because I've read too many RFCs (consider the FTP RFC) - they all say "8-bit octet". Anyway, I'm trying to be completely clear. Code unit. Yeah, I knew it was code something but it slipped my mind. I was sure that he'd either correct me or 8-bit octet/etc. would remain clear. I hate it when I forget such obvious terms. Anyway, my point in what you're quoting is very context-dependent. Walter mentioned that "0 is a valid UTF-8 character." Andrew asked what this meant, so I explained that in this case (as you also clarified) it doesn't make any difference. Regardless, it's a valid [whatever it is] and that meaning is not unclear. -[Unknown]Unknown W. Brackets wrote:6. The FF byte (8-bit octet sequence) may never appear in any valid UTF-8 string. Since char can only contain UTF-8 strings, it represents invalid data if it contains such an 8-bit octet.You mentioned "8-bit octet" repeatedly in various posts. That's redundant: An "octet" is an 8-bit value. There are no "16-bit octets" and no "8-bit hextets" or stuff like that :P . I hope you knew that and were just distracted, but you kept saying that :) .1. UTF-8 character here could mean an 8-bit octet of code point. In this case, they are both the same and represent a perfectly valid character in a string.An "UTF-8 octet" is also called a UTF-8 'code unit'. Similarly a "UTF-16 hextet" is called a UTF-16 'code unit'. An UTF-8 code unit holds a Unicode code point if the code point is <128. Otherwise multiple UTF-8 code units are needed to encode that code point. The confusion between 'code unit' and 'code point' is a long standing one. An "UTF-8 character" is a slighty ambiguous term. Does it a mean a UTF-8 code unit, or does it mean an Unicode character/codepoint encoded in a UTF-8 sequence?
Jul 30 2006
Unknown W. Brackets wrote:Walter mentioned that "0 is a valid UTF-8 character." Andrew asked what this meant, so I explained that in this case (as you also clarified) it doesn't make any difference. Regardless, it's a valid [whatever it is] and that meaning is not unclear.I confess I often misuse the terminology.
Jul 30 2006
On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:... but this is far from concept of null codepoint in character encodings.Andrew and others, I've read through these posts a few times now, trying to understand the various points of view being presented. I keep getting the feeling that some people are deliberately trying *not* to understand what other people are saying. This is a sad situation. Andrew seems to be stating ... (a) char[] arrays should be allowed to hold encodings other than UTF-8, and thus initializing them with hex-FF byte values is not useful. (b) UTF-8 encoding is not an efficient encoding for text analysis. (c) UTF encodings are not optimized for data transmission (they contain redundant data in many contexts). (d) The D type called 'char' may not have been the best name to use if it is meant to be used to contain only UTF-8 octets. I, and many others including Walter, would probably agree to (b), (c) and (d). However, considering (b) and (c), UTF has benefits that outweigh these issues and there are ways to compensate for these too. Point (d) is a casualty of history and to change the language now to rename 'char' to anything else would be counter productive now. But feel free to implement your own flavour of D.<g> Back to point (a)... The fact is, char[] is designed to hold UTF-8 encodings so don't try to force anything else into such arrays. If you wish to use some other encodings, then use a more appropriate data structure for it. For example, to hold 'KOI-8' encodings of Russian text, I would recommend using ubyte[] instead. To transform char[] to any other encoding you will have to provide the functions to do that, as I don't think it is Walter's or D's responsibilty to do it. The point of initializing UTF-8 strings with illegal values is to help detect coding or logical mistakes. And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode codepoint *is* illegal. If you must store an octet of hex-FF then use ubyte[] arrays to do it. -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Jul 30 2006
Derek wrote:Andrew seems to be stating ... (a) char[] arrays should be allowed to hold encodings other than UTF-8, and thus initializing them with hex-FF byte values is not useful. (b) UTF-8 encoding is not an efficient encoding for text analysis. (c) UTF encodings are not optimized for data transmission (they contain redundant data in many contexts). (d) The D type called 'char' may not have been the best name to use if it is meant to be used to contain only UTF-8 octets. I, and many others including Walter, would probably agree to (b), (c) and (d). However, considering (b) and (c), UTF has benefits that outweigh these issues and there are ways to compensate for these too. Point (d) is a casualty of history and to change the language now to rename 'char' to anything else would be counter productive now. But feel free to implement your own flavour of D.<g> Back to point (a)... The fact is, char[] is designed to hold UTF-8 encodings so don't try to force anything else into such arrays. If you wish to use some other encodings, then use a more appropriate data structure for it. For example, to hold 'KOI-8' encodings of Russian text, I would recommend using ubyte[] instead. To transform char[] to any other encoding you will have to provide the functions to do that, as I don't think it is Walter's or D's responsibilty to do it. The point of initializing UTF-8 strings with illegal values is to help detect coding or logical mistakes. And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode codepoint *is* illegal. If you must store an octet of hex-FF then use ubyte[] arrays to do it.Thank you for the insightful summary of the situation. I suspect, though, that (c) might be moot since it is my understanding that most actual data transmission equipment automatically compresses the data stream, and so the redundancy of the UTF-8 is minimized. Text itself tends to be highly compressible on top of that. Furthermore, because of the rate of expansion and declining costs of bandwidth, the cost of extra bytes is declining at the same time that the cost of the inflexibility of code pages is increasing.
Jul 30 2006
Indeed; this is the same situation as with XML transmission over the web. It contains a huge amount of redundancy, and compresses so well that I've seen it do better than binary-based formats. Although, I'm afraid that most of the time this compression isn't necessarily automatic, and too often is not done. -[Unknown]I suspect, though, that (c) might be moot since it is my understanding that most actual data transmission equipment automatically compresses the data stream, and so the redundancy of the UTF-8 is minimized. Text itself tends to be highly compressible on top of that. Furthermore, because of the rate of expansion and declining costs of bandwidth, the cost of extra bytes is declining at the same time that the cost of the inflexibility of code pages is increasing.
Jul 30 2006
Derek wrote:On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:Thank you for the clear summary. Apart from the obvious (d), I think there are two reasons this char confusion comes up now and then. 1. The documentation may not be clear enough on the point that char is really only meant to represent an UTF-8 code unit (or ASCII character) and that char[] is an UTF-8 encoded string. It seems it needs to be more stressed. People coming from C will automatically assume the D char is a C char equivalent. It should be mentioned that dchar is the only type that can represent any Unicode character, while char is a character only in ASCII. The C to D type conversion table doesn't help either: http://www.digitalmars.com/d/ctod.html It should say something like: char => char (UTF-8 and ASCII strings) ubyte (other byte based encodings) 2. All string functions in Phobos work only on char[] (and in some cases wchar[] and dchar[]), making the tools for working with other string encodings extremely limited. This is easily remedied by a templated string library, such as what I have proposed earlier. /Oskar... but this is far from concept of null codepoint in character encodings.Andrew and others, I've read through these posts a few times now, trying to understand the various points of view being presented. I keep getting the feeling that some people are deliberately trying *not* to understand what other people are saying. This is a sad situation. Andrew seems to be stating ... (a) char[] arrays should be allowed to hold encodings other than UTF-8, and thus initializing them with hex-FF byte values is not useful. (b) UTF-8 encoding is not an efficient encoding for text analysis. (c) UTF encodings are not optimized for data transmission (they contain redundant data in many contexts). (d) The D type called 'char' may not have been the best name to use if it is meant to be used to contain only UTF-8 octets. I, and many others including Walter, would probably agree to (b), (c) and (d). However, considering (b) and (c), UTF has benefits that outweigh these issues and there are ways to compensate for these too. Point (d) is a casualty of history and to change the language now to rename 'char' to anything else would be counter productive now. But feel free to implement your own flavour of D.<g> Back to point (a)... The fact is, char[] is designed to hold UTF-8 encodings so don't try to force anything else into such arrays. If you wish to use some other encodings, then use a more appropriate data structure for it. For example, to hold 'KOI-8' encodings of Russian text, I would recommend using ubyte[] instead. To transform char[] to any other encoding you will have to provide the functions to do that, as I don't think it is Walter's or D's responsibilty to do it. The point of initializing UTF-8 strings with illegal values is to help detect coding or logical mistakes. And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode codepoint *is* illegal. If you must store an octet of hex-FF then use ubyte[] arrays to do it.
Jul 30 2006
Derek wrote:On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:Good summary. Additionally I'd like to say that, to hold 'KOI-8' encodings, you could create a typedef instead of just using a ubyte; typedef ubyte koi8char; Thus you are able to express in the code, what the encoding of such ubyte is, as it is part of the type information. And then the program is able to work with it: koi8char toUpper(koi8char ch) { ... int wordCount(koi8char[] str) { ... dchar[] toUTF32(koi8char[] str) { ... -- Bruno Medeiros - MSc in CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D... but this is far from concept of null codepoint in character encodings.Andrew and others, I've read through these posts a few times now, trying to understand the various points of view being presented. I keep getting the feeling that some people are deliberately trying *not* to understand what other people are saying. This is a sad situation. Andrew seems to be stating ... (a) char[] arrays should be allowed to hold encodings other than UTF-8, and thus initializing them with hex-FF byte values is not useful. (b) UTF-8 encoding is not an efficient encoding for text analysis. (c) UTF encodings are not optimized for data transmission (they contain redundant data in many contexts). (d) The D type called 'char' may not have been the best name to use if it is meant to be used to contain only UTF-8 octets. I, and many others including Walter, would probably agree to (b), (c) and (d). However, considering (b) and (c), UTF has benefits that outweigh these issues and there are ways to compensate for these too. Point (d) is a casualty of history and to change the language now to rename 'char' to anything else would be counter productive now. But feel free to implement your own flavour of D.<g> Back to point (a)... The fact is, char[] is designed to hold UTF-8 encodings so don't try to force anything else into such arrays. If you wish to use some other encodings, then use a more appropriate data structure for it. For example, to hold 'KOI-8' encodings of Russian text, I would recommend using ubyte[] instead. To transform char[] to any other encoding you will have to provide the functions to do that, as I don't think it is Walter's or D's responsibilty to do it. The point of initializing UTF-8 strings with illegal values is to help detect coding or logical mistakes. And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode codepoint *is* illegal. If you must store an octet of hex-FF then use ubyte[] arrays to do it.
Jul 30 2006
Maybe I missed the point here, correct me if I misunderstood. This is how I see the problem with char[] as utf-8 *string*. The length of array of chars is not always count of characters, but rather size of array in bytes. Which makes no sense for me. For that purpose I would like to see separate properties. For example, char[] str = "тест"; word "test" in russian - 4 cyrillic characters, would give you str.length 8, which make no use of this length property if you not sure that string is latin characters only.
Jul 31 2006
Serg Kovrov wrote:Maybe I missed the point here, correct me if I misunderstood.You have understood correctly.This is how I see the problem with char[] as utf-8 *string*. The length of array of chars is not always count of characters, but rather size of array in bytes. Which makes no sense for me. For that purpose I would like to see separate properties.Having char[].length return something other than the actual number of char-units would break it's array semantics.For example, char[] str = "тест"; word "test" in russian - 4 cyrillic characters, would give you str.length 8, which make no use of this length property if you not sure that string is latin characters only.It is actually not very often that you need to count the number of characters as opposed to the number of (UTF-8) code units. Counting the number of characters is also a rather expensive operation. All the ordinary operations (searching, slicing, concatenation, sub-string search, etc) operate on code units rather than characters. It is easy to implement your own character count though: size_t count(char[] arr) { size_t c = 0; foreach(dchar c;arr) c++; return c; } assert("тест".count() == 4); Also note that: assert("тест"d.length == 4); /Oskar
Jul 31 2006
* Oskar Linde:Having char[].length return something other than the actual number of char-units would break it's array semantics.Yes, I see. Thats why I do not like much char[] as substitute for string type.It is actually not very often that you need to count the number of characters as opposed to the number of (UTF-8) code units.Why not use separate properties for that?Counting the number of characters is also a rather expensive operation.Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.All the ordinary operations (searching, slicing, concatenation, sub-string search, etc) operate on code units rather than characters.Yes that's tough one. If you want to slice an array - use array unit's count for that. But if you want to slice a *string* (substring, search, etc) - use character's count for that. Maybe there should be interchangeable types - string and char[]. For different length, slice, find, etc. behaviors? I mean it could be same actual type, but different contexts for properties. And besides, string as opposite to char[] is more pleasant for my eyes =)
Jul 31 2006
Serg Kovrov wrote:* Oskar Linde:Store where? You can't put it in the array data itself without breaking slicing, and you putting it in the reference introduces problems with it getting out of date if the array is modified through another reference (without enforcing COW, that is).Counting the number of characters is also a rather expensive operation.Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.
Jul 31 2006
* Frits van Bommel:Serg Kovrov wrote:Need to say that I no not have an idea where to store it, neither where current length property stored. I'm really glad that compiler do it for me. As language user I just want to be confident that compiler do it wisely, and focus on my domain problems.* Oskar Linde:Store where? You can't put it in the array data itself without breaking slicing, and you putting it in the reference introduces problems with it getting out of date if the array is modified through another reference (without enforcing COW, that is).Counting the number of characters is also a rather expensive operation.Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.
Jul 31 2006
Serg Kovrov wrote:* Frits van Bommel:The length is stored in the reference, but the character count would not only depend on the memory location and size (which the reference holds) but also the data it holds (at least for char and wchar) which may be accessed through different references as well. That's the problem I was pointing out.Serg Kovrov wrote:Need to say that I no not have an idea where to store it, neither where current length property stored. I'm really glad that compiler do it for me. As language user I just want to be confident that compiler do it wisely, and focus on my domain problems.* Oskar Linde:Store where? You can't put it in the array data itself without breaking slicing, and you putting it in the reference introduces problems with it getting out of date if the array is modified through another reference (without enforcing COW, that is).Counting the number of characters is also a rather expensive operation.Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.
Jul 31 2006
Serg Kovrov wrote:* Oskar Linde:I say this calls for a proper *standard* String class ... <g>Having char[].length return something other than the actual number of char-units would break it's array semantics.Yes, I see. Thats why I do not like much char[] as substitute for string type.It is actually not very often that you need to count the number of characters as opposed to the number of (UTF-8) code units.Why not use separate properties for that?Counting the number of characters is also a rather expensive operation.Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.All the ordinary operations (searching, slicing, concatenation, sub-string search, etc) operate on code units rather than characters.Yes that's tough one. If you want to slice an array - use array unit's count for that. But if you want to slice a *string* (substring, search, etc) - use character's count for that. Maybe there should be interchangeable types - string and char[]. For different length, slice, find, etc. behaviors? I mean it could be same actual type, but different contexts for properties. And besides, string as opposite to char[] is more pleasant for my eyes =)
Jul 31 2006
Serg Kovrov wrote:* Oskar Linde:The question is, how often do you need it? Especially if you are not indexing by character.Having char[].length return something other than the actual number of char-units would break it's array semantics.Yes, I see. Thats why I do not like much char[] as substitute for string type.It is actually not very often that you need to count the number of characters as opposed to the number of (UTF-8) code units.Why not use separate properties for that?Counting the number of characters is also a rather expensive operation.Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.Why? Code unit indices will work equally well for substrings, searching etc.All the ordinary operations (searching, slicing, concatenation, sub-string search, etc) operate on code units rather than characters.Yes that's tough one. If you want to slice an array - use array unit's count for that. But if you want to slice a *string* (substring, search, etc) - use character's count for that.Maybe there should be interchangeable types - string and char[]. For different length, slice, find, etc. behaviors? I mean it could be same actual type, but different contexts for properties.Indexing an UTF-8 encoded string by character rather than code unit is expensive in either time or memory. If you for some reason need character indexing, use a dchar[].And besides, string as opposite to char[] is more pleasant for my eyes =)There is always alias.
Jul 31 2006
* Oskar Linde:Serg Kovrov wrote:You've got some valid points, I just showed mine.* Oskar Linde:The question is, how often do you need it? Especially if you are not indexing by character.Having char[].length return something other than the actual number of char-units would break it's array semantics.Yes, I see. Thats why I do not like much char[] as substitute for string type.It is actually not very often that you need to count the number of characters as opposed to the number of (UTF-8) code units.Why not use separate properties for that?Counting the number of characters is also a rather expensive operation.Indeed. Store once as property (and update as needed) is better than calculate it each time you need it.Why? Code unit indices will work equally well for substrings, searching etc.All the ordinary operations (searching, slicing, concatenation, sub-string search, etc) operate on code units rather than characters.Yes that's tough one. If you want to slice an array - use array unit's count for that. But if you want to slice a *string* (substring, search, etc) - use character's count for that.Maybe there should be interchangeable types - string and char[]. For different length, slice, find, etc. behaviors? I mean it could be same actual type, but different contexts for properties.Indexing an UTF-8 encoded string by character rather than code unit is expensive in either time or memory. If you for some reason need character indexing, use a dchar[].And besides, string as opposite to char[] is more pleasant for my eyes =)There is always alias.
Jul 31 2006
Oskar Linde wrote:It is easy to implement your own character count though: size_t count(char[] arr) { size_t c = 0; foreach(dchar c;arr) c++; return c; } assert("тест".count() == 4);std.utf.toUCSindex(s, s.length) will also give the character count.
Jul 31 2006
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Oskar Linde schrieb am 2006-07-31:Serg Kovrov wrote:I hate to be pedantic but dchar[] can only be used to count the code points - not the characters. A "character" can be composed by more than one code point/dchar. This feature is frequent used for accents, marks and some Asian scripts. - -> http://www.unicode.org Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFEzmhrLK5blCcjpWoRAnJhAJ0VKD2sD++PkR0hnFfGIAgFxn8OGgCeLg0Y mp2vyHbFrwExwr3h6/etjWc= =9RLJ -----END PGP SIGNATURE-----For example, char[] str = "????"; word "test" in russian - 4 cyrillic characters, would give you str.length 8, which make no use of this length property if you not sure that string is latin characters only.It is actually not very often that you need to count the number of characters as opposed to the number of (UTF-8) code units. Counting the number of characters is also a rather expensive operation. All the ordinary operations (searching, slicing, concatenation, sub-string search, etc) operate on code units rather than characters. It is easy to implement your own character count though: size_t count(char[] arr) { size_t c = 0; foreach(dchar c;arr) c++; return c; } assert("????".count() == 4); Also note that: assert("????"d.length == 4);
Jul 31 2006
"Thomas Kuehne" <thomas-dloop kuehne.cn> wrote in message news:ls52q3-3o8.ln1 birke.kuehne.cn...-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Oskar Linde schrieb am 2006-07-31:Right, Thomas, umlaut as a separate code point can exist so A with umlaut can be represented by two code points. But as far as I remember the intention was and is to have in Unicode also all full forms like "A-with-umlaut" So you can always "compress" multi code point forms into single point counterparts. This way "????"d.length == 4 will be true - it is just depeneds on your text parser. Andrew.Serg Kovrov wrote:I hate to be pedantic but dchar[] can only be used to count the code points - not the characters. A "character" can be composed by more than one code point/dchar. This feature is frequent used for accents, marks and some Asian scripts. - -> http://www.unicode.orgFor example, char[] str = "????"; word "test" in russian - 4 cyrillic characters, would give you str.length 8, which make no use of this length property if you not sure that string is latin characters only.It is actually not very often that you need to count the number of characters as opposed to the number of (UTF-8) code units. Counting the number of characters is also a rather expensive operation. All the ordinary operations (searching, slicing, concatenation, sub-string search, etc) operate on code units rather than characters. It is easy to implement your own character count though: size_t count(char[] arr) { size_t c = 0; foreach(dchar c;arr) c++; return c; } assert("????".count() == 4); Also note that: assert("????"d.length == 4);Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFEzmhrLK5blCcjpWoRAnJhAJ0VKD2sD++PkR0hnFfGIAgFxn8OGgCeLg0Y mp2vyHbFrwExwr3h6/etjWc= =9RLJ -----END PGP SIGNATURE-----
Jul 31 2006
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Andrew Fedoniouk schrieb am 2006-07-31:"Thomas Kuehne" <thomas-dloop kuehne.cn> wrote in message news:ls52q3-3o8.ln1 birke.kuehne.cn...I won't argue about the intention here. Post this statement on <unicode unicode.org> (http://www.unicode.org/consortium/distlist.html) an let's see the various responces ;)Oskar Linde schrieb am 2006-07-31:Right, Thomas, umlaut as a separate code point can exist so A with umlaut can be represented by two code points. But as far as I remember the intention was and is to have in Unicode also all full forms like "A-with-umlaut"Serg Kovrov wrote:I hate to be pedantic but dchar[] can only be used to count the code points - not the characters. A "character" can be composed by more than one code point/dchar. This feature is frequent used for accents, marks and some Asian scripts. - -> http://www.unicode.orgFor example, char[] str = "????"; word "test" in russian - 4 cyrillic characters, would give you str.length 8, which make no use of this length property if you not sure that string is latin characters only.It is actually not very often that you need to count the number of characters as opposed to the number of (UTF-8) code units. Counting the number of characters is also a rather expensive operation. All the ordinary operations (searching, slicing, concatenation, sub-string search, etc) operate on code units rather than characters. It is easy to implement your own character count though: size_t count(char[] arr) { size_t c = 0; foreach(dchar c;arr) c++; return c; } assert("????".count() == 4); Also note that: assert("????"d.length == 4);So you can always "compress" multi code point forms into single point counterparts.Not allways. For a common use case see Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFE0QYbLK5blCcjpWoRArZiAJ4mVulttOK6bafuCZLt2Ini2lx4JACgjdC7 1DH/6rvW8qaSzRX5W0i+7jk= =2pt0 -----END PGP SIGNATURE-----
Aug 02 2006
Derek thanks for summarizing all this but I will put it as following. There are two type of text encodings for two distinct use cases: 1) transport/storage encodings - one unicode code point represented as multiple code units of encoded sequence ( e.g. UTF ) string.length returns length in code units of encoding - not characters. 2) manipulation encodings - one unicode code point represented as one and only one element of the sequence (e.g. one byte, word or dword) string.length here returns length in code points (mapped character glyphs). The problem as I can see is this: D propose to use transport encoding for manipulation purposes which is main problem imo here - transport encodings are not designed for the manipulation - it is extremely difficult to use them for manipualtion in practice as we may see. One more problem: Encoding like UTF-8 and UTF-16 are almost useless with let's say Windows API, say TextOutA and TextOutW functions. Neither one of them will accept D's char[] and wchar[] directly. - ***A functions in Windows take byte string (LPSTR) and current codepage id to render text. ( byte + codepage = Unicode Code Point ) - ***W functions in Windows use LPWSTR things which are sequence of codepoints from Unicode Basic Multilingual Plane (BMP). ( cast(dword) word = Unicode Code Point ) Only few functions in Windows API treat LPWSTR as UTF-16. ----------------- "D strings are utf encoded sequences only" is a design mistake, IMO. On disk (serialized form) - yes. But not in memory for manipulation please. Andrew Fedoniouk. http://terrainformatica.com "Derek" <derek psyc.ward> wrote in message news:177u058vq8cdj.koexsq99n112.dlg 40tude.net...On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:... but this is far from concept of null codepoint in character encodings.Andrew and others, I've read through these posts a few times now, trying to understand the various points of view being presented. I keep getting the feeling that some people are deliberately trying *not* to understand what other people are saying. This is a sad situation. Andrew seems to be stating ... (a) char[] arrays should be allowed to hold encodings other than UTF-8, and thus initializing them with hex-FF byte values is not useful. (b) UTF-8 encoding is not an efficient encoding for text analysis. (c) UTF encodings are not optimized for data transmission (they contain redundant data in many contexts). (d) The D type called 'char' may not have been the best name to use if it is meant to be used to contain only UTF-8 octets. I, and many others including Walter, would probably agree to (b), (c) and (d). However, considering (b) and (c), UTF has benefits that outweigh these issues and there are ways to compensate for these too. Point (d) is a casualty of history and to change the language now to rename 'char' to anything else would be counter productive now. But feel free to implement your own flavour of D.<g> Back to point (a)... The fact is, char[] is designed to hold UTF-8 encodings so don't try to force anything else into such arrays. If you wish to use some other encodings, then use a more appropriate data structure for it. For example, to hold 'KOI-8' encodings of Russian text, I would recommend using ubyte[] instead. To transform char[] to any other encoding you will have to provide the functions to do that, as I don't think it is Walter's or D's responsibilty to do it. The point of initializing UTF-8 strings with illegal values is to help detect coding or logical mistakes. And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode codepoint *is* illegal. If you must store an octet of hex-FF then use ubyte[] arrays to do it. -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Jul 31 2006
Andrew Fedoniouk wrote:The problem as I can see is this: D propose to use transport encoding for manipulation purposes which is main problem imo here - transport encodings are not designed for the manipulation - it is extremely difficult to use them for manipualtion in practice as we may see.I disagree the characterization that it is "extremely difficult" to use for manipulation. foreach's direct support for it, as well as the functions in std.utf, make it straightforward. DMDScript is built around UTF-8, and manipulating multibyte characters in it has not turned out to be a significant problem. It's also certainly easier than codepage based multibyte designs like shift-JIS (I used to write code for shift-JIS).Encoding like UTF-8 and UTF-16 are almost useless with let's say Windows API, say TextOutA and TextOutW functions. Neither one of them will accept D's char[] and wchar[] directly. - ***A functions in Windows take byte string (LPSTR) and current codepage id to render text. ( byte + codepage = Unicode Code Point )Win9x only supports the A functions, and Phobos does a translation of the output into the Win9x code page when running on Win9x. Of course, this fails when one has characters not supported by Win9x, but code pages aren't going to help that either. Win9x is obsolete anyway, and there's no reason to cripple a new language by accommodating the failures of an obsolete system. When running on NT or later Windows, the W functions are used instead which work directly with UTF-16. Later Windows also support UTF-8 with the A functions.- ***W functions in Windows use LPWSTR things which are sequence of codepoints from Unicode Basic Multilingual Plane (BMP). ( cast(dword) word = Unicode Code Point ) Only few functions in Windows API treat LPWSTR as UTF-16.BMP is a proper subset of UTF-16. The only difference is that BMP doesn't do the 2-word surrogate pair encodings. But those are reserved in BMP anyway, so there is no conflict. Windows has been upgraded to handle them. Early versions of NT that couldn't handle surrogate pairs didn't work with those code points anyway, so nothing is gained by going to code pages. So, the W functions can and do take UTF-16 directly, and in fact the Phobos implementation does use the W functions, transmitting wchar[] to them, and it works fine. The neat thing about Phobos is it adapts to whether you are using Win9x, full 32 bit Windows, or Linux, and adjusts the char output accordingly so it "just works."----------------- "D strings are utf encoded sequences only" is a design mistake, IMO. On disk (serialized form) - yes. But not in memory for manipulation please.There isn't any better method of handling international character sets in a portable way. Code pages have serious, crippling, unfixable problems - including all the downsides of multibyte systems (because the asian code pages are multibyte).
Jul 31 2006
"Walter Bright" <newshound digitalmars.com> wrote in message news:eam1ec$10e1$1 digitaldaemon.com...Andrew Fedoniouk wrote:Sorry but strings in DMDScript are quite different in terms of 0) there are no such thing as char in JavaScript. 1) strings are Strings - not vectors of octets - js::string[] and d::char[] are different things. 2) are not supposed to be used by any OS API. 3) there are 12 or so methods of String class in JS - limited perimeter - what model you've choosen to store them is irrelevant - in some implementations they represented even by list of fixed runs.The problem as I can see is this: D propose to use transport encoding for manipulation purposes which is main problem imo here - transport encodings are not designed for the manipulation - it is extremely difficult to use them for manipualtion in practice as we may see.I disagree the characterization that it is "extremely difficult" to use for manipulation. foreach's direct support for it, as well as the functions in std.utf, make it straightforward. DMDScript is built around UTF-8, and manipulating multibyte characters in it has not turned out to be a significant problem.It's also certainly easier than codepage based multibyte designs like shift-JIS (I used to write code for shift-JIS).You are not right here. TextOutA and TextOutW are both supported by Win98. And intention in Harmonia was to use only those ***W functions which come out of the box on Win98 (without need of MSLU)Encoding like UTF-8 and UTF-16 are almost useless with let's say Windows API, say TextOutA and TextOutW functions. Neither one of them will accept D's char[] and wchar[] directly. - ***A functions in Windows take byte string (LPSTR) and current codepage id to render text. ( byte + codepage = Unicode Code Point )Win9x only supports the A functions,and Phobos does a translation of the output into the Win9x code page when running on Win9x. Of course, this fails when one has characters not supported by Win9x, but code pages aren't going to help that either. Win9x is obsolete anyway, and there's no reason to cripple a new language by accommodating the failures of an obsolete system.There is a huge market of embedded devices. If you think that computer evolution expands only in more-ram-speed direction than you are in trouble. http://www.litepc.com/graphics/eossystem.jpgWhen running on NT or later Windows, the W functions are used instead which work directly with UTF-16. Later Windows also support UTF-8 with the A functions.http://blogs.msdn.com/michkap/archive/2005/05/11/416552.aspxSorry this scares me "BMP is a proper subset of UTF-16" UTF-16 is a group name of *byte stream encodings* (UTF-16LE and UTF-16BE) of Unicode Code Set. BTW: which one of this UTFs D uses? Platform dependent I beleive.- ***W functions in Windows use LPWSTR things which are sequence of codepoints from Unicode Basic Multilingual Plane (BMP). ( cast(dword) word = Unicode Code Point ) Only few functions in Windows API treat LPWSTR as UTF-16.BMP is a proper subset of UTF-16. The only difference is that BMP doesn't do the 2-word surrogate pair encodings. But those are reserved in BMP anyway, so there is no conflict. Windows has been upgraded to handle them. Early versions of NT that couldn't handle surrogate pairs didn't work with those code points anyway, so nothing is gained by going to code pages.So, the W functions can and do take UTF-16 directly, and in fact the Phobos implementation does use the W functions, transmitting wchar[] to them, and it works fine. The neat thing about Phobos is it adapts to whether you are using Win9x, full 32 bit Windows, or Linux, and adjusts the char output accordingly so it "just works."It should work well. Efficent I mean. The language shall be agnostic to the meaning of char as much as possible. It shall not prevent you to write effective algorithms.We are speaking in different languages: A: "strings are utf encoded sequences only" is a design mistake. W: "use any encoding other that utf" is a design mistake. Different meaning, eh? Forget about codepages. Let those who aware about them to deal with them efficiently. "Codepage" (c) Walter (e.g. ASCII) is an efficient way of representing text. That is it. Others who can afford full set will work with full 21bit values. Practically it is enough to have 16 (BMP) but... Andrew Fedoniouk. http://terrainformatica.com----------------- "D strings are utf encoded sequences only" is a design mistake, IMO. On disk (serialized form) - yes. But not in memory for manipulation please.There isn't any better method of handling international character sets in a portable way. Code pages have serious, crippling, unfixable problems - including all the downsides of multibyte systems (because the asian code pages are multibyte).
Jul 31 2006
On Mon, 31 Jul 2006 18:23:19 -0700, Andrew Fedoniouk wrote:"Walter Bright" <newshound digitalmars.com> wrote in message news:eam1ec$10e1$1 digitaldaemon.com...For what its worth, to do *character* manipulation I convert strings to UTF-32, do my stuff and convert back to the initial format. char[] somefunc(char[] x) { return std.utf.toUTF8( somefunc( std.utf.toUTF32(x) ) ); } wchar[] somefunc(wchar[] x) { return std.utf.toUTF16( somefunc( std.utf.toUTF32(x) ) ); } dchar[] somefunc(dchar[] x) { dchar[] result; ... return result; } This seems to work fast enough for my purposes. DBuild (nee Build) uses this a lot. -- Derek (skype: derek.j.parnell) Melbourne, Australia "Down with mediocrity!" 1/08/2006 11:45:36 AMAndrew Fedoniouk wrote:Sorry but strings in DMDScript are quite different in terms of 0) there are no such thing as char in JavaScript. 1) strings are Strings - not vectors of octets - js::string[] and d::char[] are different things. 2) are not supposed to be used by any OS API. 3) there are 12 or so methods of String class in JS - limited perimeter - what model you've choosen to store them is irrelevant - in some implementations they represented even by list of fixed runs.The problem as I can see is this: D propose to use transport encoding for manipulation purposes which is main problem imo here - transport encodings are not designed for the manipulation - it is extremely difficult to use them for manipualtion in practice as we may see.I disagree the characterization that it is "extremely difficult" to use for manipulation. foreach's direct support for it, as well as the functions in std.utf, make it straightforward. DMDScript is built around UTF-8, and manipulating multibyte characters in it has not turned out to be a significant problem.
Jul 31 2006
"Derek Parnell" <derek nomail.afraid.org> wrote in message news:8n0koj5wjiio.qwc8ok4mrvr3$.dlg 40tude.net...On Mon, 31 Jul 2006 18:23:19 -0700, Andrew Fedoniouk wrote:Derek, using dchar (ultimate char) is perfectly fine in DBuild(*) circumstances - you are parsing - not dealing with OS in each line. Using dchar has drawback - you need to recreate all string primitive ops from scratch including RegExp, etc. Again dchar is ok - the only not ok is a strange selection for dchar null/nothing/nihil/nil/whatever value. (* dbuild does not sound good in russian - very close to idiot in medical meaning consider builDer/buildDer/creaDor for example - with red D in the middle - stylish at least) Andrew."Walter Bright" <newshound digitalmars.com> wrote in message news:eam1ec$10e1$1 digitaldaemon.com...For what its worth, to do *character* manipulation I convert strings to UTF-32, do my stuff and convert back to the initial format. char[] somefunc(char[] x) { return std.utf.toUTF8( somefunc( std.utf.toUTF32(x) ) ); } wchar[] somefunc(wchar[] x) { return std.utf.toUTF16( somefunc( std.utf.toUTF32(x) ) ); } dchar[] somefunc(dchar[] x) { dchar[] result; ... return result; } This seems to work fast enough for my purposes. DBuild (nee Build) uses this a lot. --Andrew Fedoniouk wrote:Sorry but strings in DMDScript are quite different in terms of 0) there are no such thing as char in JavaScript. 1) strings are Strings - not vectors of octets - js::string[] and d::char[] are different things. 2) are not supposed to be used by any OS API. 3) there are 12 or so methods of String class in JS - limited perimeter - what model you've choosen to store them is irrelevant - in some implementations they represented even by list of fixed runs.The problem as I can see is this: D propose to use transport encoding for manipulation purposes which is main problem imo here - transport encodings are not designed for the manipulation - it is extremely difficult to use them for manipualtion in practice as we may see.I disagree the characterization that it is "extremely difficult" to use for manipulation. foreach's direct support for it, as well as the functions in std.utf, make it straightforward. DMDScript is built around UTF-8, and manipulating multibyte characters in it has not turned out to be a significant problem.
Jul 31 2006
On Mon, 31 Jul 2006 20:46:53 -0700, Andrew Fedoniouk <news terrainformatica.com> wrote:(* dbuild does not sound good in russian - very close to idiot in medical meaning consider builDer/buildDer/creaDor for example - with red D in the middle - stylish at least) Andrew.Really, Andrew, you are getting carried away in your demands. You almost sound self-centered :). dbuild is not made for Russians only. Almost any English word conceived for a name might just have some sort of bad connotation in any one of the thousands of languages in this world. Why should anyone feel obligated to accomodate your culture here? I know Russians tend to be quite proud of their heritage, which is fine... but really, you are being quite silly to make these demands here. That aside... my personal, self-centered feeling is that the name "bud" is quite adequate. :D -JJR
Jul 31 2006
"John Reimer" <terminal.node gmail.com> wrote in message news:op.tdlccd0b6gr7xp epsilon-alpha...On Mon, 31 Jul 2006 20:46:53 -0700, Andrew Fedoniouk <news terrainformatica.com> wrote::D BTW: debilita [lat.] as a word with many variations is used in almost all laguages directly derived from latin. You can say d'buil' on streets of say Munich and they will undersatnd you. Trust me , free beer will be yours. So it is far from russian-centric :-P Andrew.(* dbuild does not sound good in russian - very close to idiot in medical meaning consider builDer/buildDer/creaDor for example - with red D in the iddle - stylish at least) Andrew.Really, Andrew, you are getting carried away in your demands. You almost sound self-centered :). dbuild is not made for Russians only. Almost any English word conceived for a name might just have some sort of bad connotation in any one of the thousands of languages in this world. Why should anyone feel obligated to accomodate your culture here? I know Russians tend to be quite proud of their heritage, which is fine... but really, you are being quite silly to make these demands here. That aside... my personal, self-centered feeling is that the name "bud" is quite adequate. :D
Aug 01 2006
Andrew Fedoniouk a crit :"John Reimer" <terminal.node gmail.com> wrote in message news:op.tdlccd0b6gr7xp epsilon-alpha...As "dbile" in french, pronounced something like "day bill". One has to correctly pronounce the ending D of dbuild to disambiguate it, but since we generally know about what we're speaking about in an IT related discussion, it should be OK, or even funny if we ambigously pronounce it in presence of humourous enough people.On Mon, 31 Jul 2006 20:46:53 -0700, Andrew Fedoniouk <news terrainformatica.com> wrote::D BTW: debilita [lat.] as a word with many variations is used in almost all laguages directly derived from latin. You can say d'buil' on streets of say Munich and they will undersatnd you. Trust me , free beer will be yours. So it is far from russian-centric :-P Andrew.(* dbuild does not sound good in russian - very close to idiot in medical meaning consider builDer/buildDer/creaDor for example - with red D in the iddle - stylish at least) Andrew.Really, Andrew, you are getting carried away in your demands. You almost sound self-centered :). dbuild is not made for Russians only. Almost any English word conceived for a name might just have some sort of bad connotation in any one of the thousands of languages in this world. Why should anyone feel obligated to accomodate your culture here? I know Russians tend to be quite proud of their heritage, which is fine... but really, you are being quite silly to make these demands here. That aside... my personal, self-centered feeling is that the name "bud" is quite adequate. :D
Aug 01 2006
Andrew Fedoniouk wrote:"Walter Bright" <newshound digitalmars.com> wrote in message news:eam1ec$10e1$1 digitaldaemon.com...ECMAScript 262-3 (Javascript) defines the source character set to be UTF-16, and the source character set is what JS programs manipulate for strings and characters.Andrew Fedoniouk wrote:Sorry but strings in DMDScript are quite different in terms of 0) there are no such thing as char in JavaScript.The problem as I can see is this: D propose to use transport encoding for manipulation purposes which is main problem imo here - transport encodings are not designed for the manipulation - it is extremely difficult to use them for manipualtion in practice as we may see.I disagree the characterization that it is "extremely difficult" to use for manipulation. foreach's direct support for it, as well as the functions in std.utf, make it straightforward. DMDScript is built around UTF-8, and manipulating multibyte characters in it has not turned out to be a significant problem.1) strings are Strings - not vectors of octets - js::string[] and d::char[] are different things. 2) are not supposed to be used by any OS API. 3) there are 12 or so methods of String class in JS - limited perimeter - what model you've choosen to store them is irrelevant - in some implementations they represented even by list of fixed runs.I agree how it's stored in the JS implementation is irrelevant. My point was that in DMDScript they are stored as utf-8 strings, and they work with only minor extra effort - DMDScript implements all the string handling functions JS defines.You're right in that Win98 exports a small handful of W functions without MSLU - but what those W functions actually do under the hood is translate the data based on the current code page and then call the corresponding A function. In other words, the Win9x W functions are rather pointless and don't support characters that are not in the current code page anyway. MSLU extends the same poor behavior to a bunch more pseudo W functions. This is why Phobos does not call W functions under Win9x. Conversely, the A functions under NT and later translate the characters to - you guessed it - UTF-16 and then call the corresponding W function. This is why Phobos under NT does not call the A functions.You are not right here. TextOutA and TextOutW are both supported by Win98. And intention in Harmonia was to use only those ***W functions which come out of the box on Win98 (without need of MSLU)- ***A functions in Windows take byte string (LPSTR) and current codepage id to render text. ( byte + codepage = Unicode Code Point )Win9x only supports the A functions,I agree there's a huge ecosystem of 32 bit embedded processors. And D works fine with Win9x - it just isn't crippled by Win9x's shortcomings.Win9x is obsolete anyway, and there's no reason to cripple a new language by accommodating the failures of an obsolete system.There is a huge market of embedded devices. If you think that computer evolution expands only in more-ram-speed direction than you are in trouble. http://www.litepc.com/graphics/eossystem.jpgThat is consistent with what I wrote about it.When running on NT or later Windows, the W functions are used instead which work directly with UTF-16. Later Windows also support UTF-8 with the A functions.http://blogs.msdn.com/michkap/archive/2005/05/11/416552.aspxD has been used for many years with foreign languages under Windows. If UTF-16 didn't work with Windows, I think it would have come up by now <g>. As for whether it is LE or BE, it is whatever the local platform is, just like ints, shorts, longs, etc. are.Sorry this scares me "BMP is a proper subset of UTF-16" UTF-16 is a group name of *byte stream encodings* (UTF-16LE and UTF-16BE) of Unicode Code Set. BTW: which one of this UTFs D uses? Platform dependent I beleive.- ***W functions in Windows use LPWSTR things which are sequence of codepoints from Unicode Basic Multilingual Plane (BMP). ( cast(dword) word = Unicode Code Point ) Only few functions in Windows API treat LPWSTR as UTF-16.BMP is a proper subset of UTF-16. The only difference is that BMP doesn't do the 2-word surrogate pair encodings. But those are reserved in BMP anyway, so there is no conflict. Windows has been upgraded to handle them. Early versions of NT that couldn't handle surrogate pairs didn't work with those code points anyway, so nothing is gained by going to code pages.Yes.So, the W functions can and do take UTF-16 directly, and in fact the Phobos implementation does use the W functions, transmitting wchar[] to them, and it works fine. The neat thing about Phobos is it adapts to whether you are using Win9x, full 32 bit Windows, or Linux, and adjusts the char output accordingly so it "just works."It should work well. Efficent I mean.The language shall be agnostic to the meaning of char as much as possible.That's C/C++'s approach, and it does not work very well. Check out tchar.h, there's a lovely disaster <g>. For another, just try using std::string with shift-JIS.It shall not prevent you to write effective algorithms.Does UTF-8 prevent writing effective algorithms? I don't see how. DMDScript works, and is faster than any other JS implementation out there, including my own C++ version <g>. And frankly, my struggles with trying to internationalize C++ code for DMDScript is what led to D's support for UTF. The D implementation is shorter, simpler, and faster than the C++ one (which uses wchar's).Practically it is enough to have 16 (BMP) but...I agree you can write code using BMP and ignore surrogate pairs today, and you'll probably never notice the bugs. But sooner or later, the surrogate pair problem is going to show up. Windows, Java, and Javascript have all had to go back and redo to deal with surrogate pairs.
Jul 31 2006
"Walter Bright" <newshound digitalmars.com> wrote in message news:eamql8$1jgc$1 digitaldaemon.com...Andrew Fedoniouk wrote:Walter, please, forget about such thing as "character set is UTF-16" it is a non-sense. Regarding ECMA-262: "A conforming implementation of this International standard shall interpret characters in conformance with the Unicode Standard, Version 2.1 or later, and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form..." It is quite different from your interpretation. Compiler accepts input stream as either BMP codes or full unicode set encoded using UTF-16. There is no mentioning that String[n] will return you utf-16 code unit. That will be weird."Walter Bright" <newshound digitalmars.com> wrote in message news:eam1ec$10e1$1 digitaldaemon.com...ECMAScript 262-3 (Javascript) defines the source character set to be UTF-16, and the source character set is what JS programs manipulate for strings and characters.Andrew Fedoniouk wrote:Sorry but strings in DMDScript are quite different in terms of 0) there are no such thing as char in JavaScript.The problem as I can see is this: D propose to use transport encoding for manipulation purposes which is main problem imo here - transport encodings are not designed for the manipulation - it is extremely difficult to use them for manipualtion in practice as we may see.I disagree the characterization that it is "extremely difficult" to use for manipulation. foreach's direct support for it, as well as the functions in std.utf, make it straightforward. DMDScript is built around UTF-8, and manipulating multibyte characters in it has not turned out to be a significant problem.Again it is up to you how they are stored internally and what you did there. In D situation is completely different - there is a char and char[] opened to all winds.1) strings are Strings - not vectors of octets - js::string[] and d::char[] are different things. 2) are not supposed to be used by any OS API. 3) there are 12 or so methods of String class in JS - limited perimeter - what model you've choosen to store them is irrelevant - in some implementations they represented even by list of fixed runs.I agree how it's stored in the JS implementation is irrelevant. My point was that in DMDScript they are stored as utf-8 strings, and they work with only minor extra effort - DMDScript implements all the string handling functions JS defines.I wouldn't be so pessimistic about Win98 :)You're right in that Win98 exports a small handful of W functions without MSLU - but what those W functions actually do under the hood is translate the data based on the current code page and then call the corresponding A function. In other words, the Win9x W functions are rather pointless and don't support characters that are not in the current code page anyway. MSLU extends the same poor behavior to a bunch more pseudo W functions. This is why Phobos does not call W functions under Win9x.You are not right here. TextOutA and TextOutW are both supported by Win98. And intention in Harmonia was to use only those ***W functions which come out of the box on Win98 (without need of MSLU)- ***A functions in Windows take byte string (LPSTR) and current codepage id to render text. ( byte + codepage = Unicode Code Point )Win9x only supports the A functions,Conversely, the A functions under NT and later translate the characters to - you guessed it - UTF-16 and then call the corresponding W function. This is why Phobos under NT does not call the A functions.Ok. And how do you call A functions? Do you use proposed koi8chars, latin1chars, etc.? You are using char for that. But wait, char cannot contain anything other than utf-8 :-PNo doubts about it.I agree there's a huge ecosystem of 32 bit embedded processors. And D works fine with Win9x - it just isn't crippled by Win9x's shortcomings.Win9x is obsolete anyway, and there's no reason to cripple a new language by accommodating the failures of an obsolete system.There is a huge market of embedded devices. If you think that computer evolution expands only in more-ram-speed direction than you are in trouble. http://www.litepc.com/graphics/eossystem.jpgThat is consistent with what I wrote about it.When running on NT or later Windows, the W functions are used instead which work directly with UTF-16. Later Windows also support UTF-8 with the A functions.http://blogs.msdn.com/michkap/archive/2005/05/11/416552.aspxD has been used for many years with foreign languages under Windows. If UTF-16 didn't work with Windows, I think it would have come up by now <g>. As for whether it is LE or BE, it is whatever the local platform is, just like ints, shorts, longs, etc. are.Sorry this scares me "BMP is a proper subset of UTF-16" UTF-16 is a group name of *byte stream encodings* (UTF-16LE and UTF-16BE) of Unicode Code Set. BTW: which one of this UTFs D uses? Platform dependent I beleive.- ***W functions in Windows use LPWSTR things which are sequence of codepoints from Unicode Basic Multilingual Plane (BMP). ( cast(dword) word = Unicode Code Point ) Only few functions in Windows API treat LPWSTR as UTF-16.BMP is a proper subset of UTF-16. The only difference is that BMP doesn't do the 2-word surrogate pair encodings. But those are reserved in BMP anyway, so there is no conflict. Windows has been upgraded to handle them. Early versions of NT that couldn't handle surrogate pairs didn't work with those code points anyway, so nothing is gained by going to code pages.Why? JavaScript for example has no such things as char. String.charAt() returns guess what? Correct - String object. No char - no problem :D Why do they need to redefine anything then? Again - let people decide of what char is and how to interpret it And that will be it. Phobos can work with utf-8/16 and satisfy you and other UTF-masochists (no offence implied). Ordinary people will do their own strings anyway. Just give them opAssign and dtor in structs and you will see explosion of perfect Changing char init value to 0 will not harm anybody but will allow to use char for other than utf-8 purposes - it is only one from 40 in active use encodings anyway. For persistence purposes (in compiled EXE) utf is the best choice probably. But in runtime - please not on language level. Educated IMO, of course. Andrew.Yes.So, the W functions can and do take UTF-16 directly, and in fact the Phobos implementation does use the W functions, transmitting wchar[] to them, and it works fine. The neat thing about Phobos is it adapts to whether you are using Win9x, full 32 bit Windows, or Linux, and adjusts the char output accordingly so it "just works."It should work well. Efficent I mean.The language shall be agnostic to the meaning of char as much as possible.That's C/C++'s approach, and it does not work very well. Check out tchar.h, there's a lovely disaster <g>. For another, just try using std::string with shift-JIS.It shall not prevent you to write effective algorithms.Does UTF-8 prevent writing effective algorithms? I don't see how. DMDScript works, and is faster than any other JS implementation out there, including my own C++ version <g>. And frankly, my struggles with trying to internationalize C++ code for DMDScript is what led to D's support for UTF. The D implementation is shorter, simpler, and faster than the C++ one (which uses wchar's).Practically it is enough to have 16 (BMP) but...I agree you can write code using BMP and ignore surrogate pairs today, and you'll probably never notice the bugs. But sooner or later, the surrogate pair problem is going to show up. Windows, Java, and Javascript have all had to go back and redo to deal with surrogate pairs.
Aug 01 2006
Andrew Fedoniouk wrote:Compiler accepts input stream as either BMP codes or full unicode setencoded using UTF-16. BMP is a subset of UTF-16.There is no mentioning that String[n] will return you utf-16 code unit. That will be weird.String.charCodeAt() will give you the utf-16 code unit.Take a look at std.file for an example.Conversely, the A functions under NT and later translate the characters to - you guessed it - UTF-16 and then call the corresponding W function. This is why Phobos under NT does not call the A functions.Ok. And how do you call A functions?See String.fromCharCode() and String.charCodeAt()Windows, Java, and Javascript have all had to go back and redo to deal with surrogate pairs.Why? JavaScript for example has no such things as char. String.charAt() returns guess what? Correct - String object. No char - no problem :DAgain - let people decide of what char is and how to interpret it And that will be it.I've already explained the problems C/C++ have with that. They're real problems, bad and unfixable enough that there are official proposals to add new UTF basic types to to C++.Phobos can work with utf-8/16 and satisfy you and other UTF-masochists (no offence implied).C++'s experience with this demonstrates that char* does not work very well with UTF-8. It's not just my experience, it's why new types were proposed for C++ (and not by me).Ordinary people will do their own strings anyway. Just give them opAssign and dtor in structs and you will see explosion of perfect Changing char init value to 0 will not harm anybody but will allow to use char for other than utf-8 purposes - it is only one from 40 in active use encodings anyway. For persistence purposes (in compiled EXE) utf is the best choice probably. But in runtime - please not on language level.ubyte[] will enable you to use any encoding you wish - and that's what it's there for.
Aug 01 2006
(Hope this long dialog will help all of us to better understand what UNICODE is) "Walter Bright" <newshound digitalmars.com> wrote in message news:eao5st$2r1f$1 digitaldaemon.com...Andrew Fedoniouk wrote:Walter with deepest respect but it is not. Two different things. UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking. If you will treat utf-16 sequence as a sequence of UCS-2 (BMP) codes you are in trouble. See: Sequence of two words D834 DD1E as UTF-16 will give you one unicode character with code 0x1D11E ( musical G clef ). And the same sequence interpretted as UCS-2 sequence will give you two (invlaid, non-printable but still) character codes. You will get different length of the string at least.Compiler accepts input stream as either BMP codes or full unicode setencoded using UTF-16. BMP is a subset of UTF-16.You mean here?: char* namez = toMBSz(name); h = CreateFileA(namez,GENERIC_WRITE,0,null,CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,cast(HANDLE)null); char* here is far from UTF-8 sequence.There is no mentioning that String[n] will return you utf-16 code unit. That will be weird.String.charCodeAt() will give you the utf-16 code unit.Take a look at std.file for an example.Conversely, the A functions under NT and later translate the characters to - you guessed it - UTF-16 and then call the corresponding W function. This is why Phobos under NT does not call the A functions.Ok. And how do you call A functions?ECMA-262 String.prototype.charCodeAt (pos) Returns a number (a nonnegative integer less than 2^16) representing the code point value of the character at position pos in the string.... As you may see it is returning (unicode) *code point* from BMP set but it is far from UTF-16 code unit you've declared above. Relaxing "a nonnegative integer less than 2^16" to "a nonnegative integer less than 2^21" will not harm anybody. Or at least such probability is vanishingly small.See String.fromCharCode() and String.charCodeAt()Windows, Java, and Javascript have all had to go back and redo to deal with surrogate pairs.Why? JavaScript for example has no such things as char. String.charAt() returns guess what? Correct - String object. No char - no problem :DBasic types of what?Again - let people decide of what char is and how to interpret it And that will be it.I've already explained the problems C/C++ have with that. They're real problems, bad and unfixable enough that there are official proposals to add new UTF basic types to to C++.Because char in C is not supposed to hold multy-byte encodings. At least std::string is strictly single byte thing by definition. And this is perfectly fine. There is wchar_t for holding OS supported range in full. On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.Phobos can work with utf-8/16 and satisfy you and other UTF-masochists (no offence implied).C++'s experience with this demonstrates that char* does not work very well with UTF-8. It's not just my experience, it's why new types were proposed for C++ (and not by me).Thus the whole set of Windows API headers (and std.c.string for example) seen in D has to be rewrited to accept ubyte[]. As char in D is not char in C Is this the idea? Andrew.Ordinary people will do their own strings anyway. Just give them opAssign and dtor in structs and you will see explosion of perfect strings. That Changing char init value to 0 will not harm anybody but will allow to use char for other than utf-8 purposes - it is only one from 40 in active use encodings anyway. For persistence purposes (in compiled EXE) utf is the best choice probably. But in runtime - please not on language level.ubyte[] will enable you to use any encoding you wish - and that's what it's there for.
Aug 01 2006
On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:(Hope this long dialog will help all of us to better understand what UNICODE is) "Walter Bright" <newshound digitalmars.com> wrote in message news:eao5st$2r1f$1 digitaldaemon.com...Andrew is correct. In UTF-16, characters are variable length, from 2 to 4 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to be up to 6 but that has changed). UCS-2 is a subset of Unicode characters that are all represented by 2-byte integers. Windows NT had implemented UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now. ...Andrew Fedoniouk wrote:Walter with deepest respect but it is not. Two different things. UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.Compiler accepts input stream as either BMP codes or full unicode setencoded using UTF-16. BMP is a subset of UTF-16.Yes. I believe this is how it now should be done. The Phobos library is not correctly using char, char[], and ubyte[] when interfacing with Windows and C functions. My guess is that Walter originally used 'char' to make things easier for C coders to move over to D, but in doing so, now with UTF support built-in, has caused more problems that the idea was supposed to solve. The move to UTF support is good, but the choice of 'char' for the name of a UTF-8 code-unit was, and still is, a big mistake. I would have liked something more like ... char ==> An unsigned 8-bit byte. An alias for ubyte. schar ==> A UTF-8 code unit. wchar ==> A UTF-16 code unit. dchar ==> A UTF-32 code unit. char[] ==> A 'C' string schar[] ==> A UTF-8 string wchar[] ==> A UTF-16 string dchar[] ==> A UTF-32 string And then have built-in conversions between the UTF encodings. So if people want to continue to use code from C/C++ that uses code-pages or similar they can stick with char[]. -- Derek (skype: derek.j.parnell) Melbourne, Australia "Down with mediocrity!" 2/08/2006 1:08:51 PMubyte[] will enable you to use any encoding you wish - and that's what it's there for.Thus the whole set of Windows API headers (and std.c.string for example) seen in D has to be rewrited to accept ubyte[]. As char in D is not char in C Is this the idea?
Aug 01 2006
"Derek Parnell" <derek nomail.afraid.org> wrote in message news:13qrud1m5v15d$.ydqvoi8nx4f8.dlg 40tude.net...On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:Yes, Derek, this will be probably near the ideal. Andrew.(Hope this long dialog will help all of us to better understand what UNICODE is) "Walter Bright" <newshound digitalmars.com> wrote in message news:eao5st$2r1f$1 digitaldaemon.com...Andrew is correct. In UTF-16, characters are variable length, from 2 to 4 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to be up to 6 but that has changed). UCS-2 is a subset of Unicode characters that are all represented by 2-byte integers. Windows NT had implemented UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now. ...Andrew Fedoniouk wrote:Walter with deepest respect but it is not. Two different things. UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.Compiler accepts input stream as either BMP codes or full unicode setencoded using UTF-16. BMP is a subset of UTF-16.Yes. I believe this is how it now should be done. The Phobos library is not correctly using char, char[], and ubyte[] when interfacing with Windows and C functions. My guess is that Walter originally used 'char' to make things easier for C coders to move over to D, but in doing so, now with UTF support built-in, has caused more problems that the idea was supposed to solve. The move to UTF support is good, but the choice of 'char' for the name of a UTF-8 code-unit was, and still is, a big mistake. I would have liked something more like ... char ==> An unsigned 8-bit byte. An alias for ubyte. schar ==> A UTF-8 code unit. wchar ==> A UTF-16 code unit. dchar ==> A UTF-32 code unit. char[] ==> A 'C' string schar[] ==> A UTF-8 string wchar[] ==> A UTF-16 string dchar[] ==> A UTF-32 string And then have built-in conversions between the UTF encodings. So if people want to continue to use code from C/C++ that uses code-pages or similar they can stick with char[].ubyte[] will enable you to use any encoding you wish - and that's what it's there for.Thus the whole set of Windows API headers (and std.c.string for example) seen in D has to be rewrited to accept ubyte[]. As char in D is not char in C Is this the idea?
Aug 01 2006
On Tue, 1 Aug 2006 21:04:10 -0700, Andrew Fedoniouk <news terrainformatica.com> wrote:"Derek Parnell" <derek nomail.afraid.org> wrote in message news:13qrud1m5v15d$.ydqvoi8nx4f8.dlg 40tude.net...Yet, I don't find it at all difficult to think of them like so: ubyte ==> An unsigned 8-bit byte. char ==> A UTF-8 code unit. wchar ==> A UTF-16 code unit. dchar ==> A UTF-32 code unit. ubyte[] ==> A 'C' string char[] ==> A UTF-8 string wchar[] ==> A UTF-16 string dchar[] ==> A UTF-32 string If you want to program in D you _will_ have to readjust your thinking in some areas, this is one of them. All you have to realise is that 'char' in D is not the same as 'char' in C. In quick and dirty ASCII only applications I can adjust my thinking further: char ==> An ASCII character char[] ==> An ASCII string I do however agree that C functions used in D should be declared like: int strlen(ubyte* s); and not like (as they currently are): int strlen(char* s); The problem with this is that the code: char[] s = "test"; strlen(s) would produce a compile error, and require a cast or a conversion function (toMBSz perhaps, which in many cases will not need to do anything). Of course the purists would say "That's perfectly correct, strlen cannot tell you the length of a UTF-8 string, only it's byte count", but at the same time it would be nice (for quick and dirty ASCII only programs) if it worked. Is it possible to declare them like this? int strlen(void* s); and for char[] to be implicitly 'paintable' as void* as char[] is already implicitly 'paintable' as void[]? It seems like it would nicely solve the problem of people seeing: int strlen(char* s); and thinking D's char is the same as C's char without introducing a painful need for cast or conversion in simple ASCII only situations. ReganOn Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:Yes, Derek, this will be probably near the ideal.(Hope this long dialog will help all of us to better understand what UNICODE is) "Walter Bright" <newshound digitalmars.com> wrote in message news:eao5st$2r1f$1 digitaldaemon.com...Andrew is correct. In UTF-16, characters are variable length, from 2 to 4 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to be up to 6 but that has changed). UCS-2 is a subset of Unicode characters that are all represented by 2-byte integers. Windows NT had implemented UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now. ...Andrew Fedoniouk wrote:Walter with deepest respect but it is not. Two different things. UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.Compiler accepts input stream as either BMP codes or full unicode setencoded using UTF-16. BMP is a subset of UTF-16.Yes. I believe this is how it now should be done. The Phobos library is not correctly using char, char[], and ubyte[] when interfacing with Windows and C functions. My guess is that Walter originally used 'char' to make things easier for C coders to move over to D, but in doing so, now with UTF support built-in, has caused more problems that the idea was supposed to solve. The move to UTF support is good, but the choice of 'char' for the name of a UTF-8 code-unit was, and still is, a big mistake. I would have liked something more like ... char ==> An unsigned 8-bit byte. An alias for ubyte. schar ==> A UTF-8 code unit. wchar ==> A UTF-16 code unit. dchar ==> A UTF-32 code unit. char[] ==> A 'C' string schar[] ==> A UTF-8 string wchar[] ==> A UTF-16 string dchar[] ==> A UTF-32 string And then have built-in conversions between the UTF encodings. So if people want to continue to use code from C/C++ that uses code-pages or similar they can stick with char[].ubyte[] will enable you to use any encoding you wish - and that's what it's there for.Thus the whole set of Windows API headers (and std.c.string for example) seen in D has to be rewrited to accept ubyte[]. As char in D is not char in C Is this the idea?
Aug 01 2006
"Regan Heath" <regan netwin.co.nz> wrote in message news:optdm2gghi23k2f5 nrage...On Tue, 1 Aug 2006 21:04:10 -0700, Andrew Fedoniouk <news terrainformatica.com> wrote:Another option will be to change char.init to 0 and forget about the problem left it as it is now. Some good string implementation will contain encoding field in string instance if needed. Andrew."Derek Parnell" <derek nomail.afraid.org> wrote in message news:13qrud1m5v15d$.ydqvoi8nx4f8.dlg 40tude.net...Yet, I don't find it at all difficult to think of them like so: ubyte ==> An unsigned 8-bit byte. char ==> A UTF-8 code unit. wchar ==> A UTF-16 code unit. dchar ==> A UTF-32 code unit. ubyte[] ==> A 'C' string char[] ==> A UTF-8 string wchar[] ==> A UTF-16 string dchar[] ==> A UTF-32 string If you want to program in D you _will_ have to readjust your thinking in some areas, this is one of them. All you have to realise is that 'char' in D is not the same as 'char' in C. In quick and dirty ASCII only applications I can adjust my thinking further: char ==> An ASCII character char[] ==> An ASCII string I do however agree that C functions used in D should be declared like: int strlen(ubyte* s); and not like (as they currently are): int strlen(char* s); The problem with this is that the code: char[] s = "test"; strlen(s) would produce a compile error, and require a cast or a conversion function (toMBSz perhaps, which in many cases will not need to do anything). Of course the purists would say "That's perfectly correct, strlen cannot tell you the length of a UTF-8 string, only it's byte count", but at the same time it would be nice (for quick and dirty ASCII only programs) if it worked. Is it possible to declare them like this? int strlen(void* s); and for char[] to be implicitly 'paintable' as void* as char[] is already implicitly 'paintable' as void[]? It seems like it would nicely solve the problem of people seeing: int strlen(char* s); and thinking D's char is the same as C's char without introducing a painful need for cast or conversion in simple ASCII only situations. ReganOn Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:Yes, Derek, this will be probably near the ideal.(Hope this long dialog will help all of us to better understand what UNICODE is) "Walter Bright" <newshound digitalmars.com> wrote in message news:eao5st$2r1f$1 digitaldaemon.com...Andrew is correct. In UTF-16, characters are variable length, from 2 to 4 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to be up to 6 but that has changed). UCS-2 is a subset of Unicode characters that are all represented by 2-byte integers. Windows NT had implemented UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now. ...Andrew Fedoniouk wrote:Walter with deepest respect but it is not. Two different things. UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.Compiler accepts input stream as either BMP codes or full unicode setencoded using UTF-16. BMP is a subset of UTF-16.Yes. I believe this is how it now should be done. The Phobos library is not correctly using char, char[], and ubyte[] when interfacing with Windows and C functions. My guess is that Walter originally used 'char' to make things easier for C coders to move over to D, but in doing so, now with UTF support built-in, has caused more problems that the idea was supposed to solve. The move to UTF support is good, but the choice of 'char' for the name of a UTF-8 code-unit was, and still is, a big mistake. I would have liked something more like ... char ==> An unsigned 8-bit byte. An alias for ubyte. schar ==> A UTF-8 code unit. wchar ==> A UTF-16 code unit. dchar ==> A UTF-32 code unit. char[] ==> A 'C' string schar[] ==> A UTF-8 string wchar[] ==> A UTF-16 string dchar[] ==> A UTF-32 string And then have built-in conversions between the UTF encodings. So if people want to continue to use code from C/C++ that uses code-pages or similar they can stick with char[].ubyte[] will enable you to use any encoding you wish - and that's what it's there for.Thus the whole set of Windows API headers (and std.c.string for example) seen in D has to be rewrited to accept ubyte[]. As char in D is not char in C Is this the idea?
Aug 01 2006
I'm trying to understand why this 0 thing is such an issue. If your second statement is valid, it makes the first moot - 0 or no 0. Why does it matter, then? -[Unknown]Another option will be to change char.init to 0 and forget about the problem left it as it is now. Some good string implementation will contain encoding field in string instance if needed. Andrew.
Aug 01 2006
"Unknown W. Brackets" <unknown simplemachines.org> wrote in message news:eapdsg$qeo$1 digitaldaemon.com...I'm trying to understand why this 0 thing is such an issue. If your second statement is valid, it makes the first moot - 0 or no 0. Why does it matter, then?Declaration of char.init == 0 pretty much means that D has no strict requirement that char[] shall contain only UTF-8 encoded sequences but any other encodings suitable for the application. char.init == 0 will resolve situation we see in Phobos now. char[] de facto is used for other than utf-8 encodings. char.init == 0 tells everybody that char can also be used for representing unicode *code points* with asuumption that offset value (mapping on full Unicode set, aka codepage) is stored somewhere in application or well known to it. char.init == 0 also highlights the fact that it is safe to use char[] as C string processing functions and passing them to non D modules and libraries. Is it UTF-8 encoded or not - does not matter - type is universal enough. Andrew.-[Unknown]Another option will be to change char.init to 0 and forget about the problem left it as it is now. Some good string implementation will contain encoding field in string instance if needed. Andrew.
Aug 01 2006
Andrew Fedoniouk wrote:"Unknown W. Brackets" <unknown simplemachines.org> wrote in message news:eapdsg$qeo$1 digitaldaemon.com...Why is this good?I'm trying to understand why this 0 thing is such an issue. If your second statement is valid, it makes the first moot - 0 or no 0. Why does it matter, then?Declaration of char.init == 0 pretty much means that D has no strict requirement that char[] shall contain only UTF-8 encoded sequences but any other encodings suitable for the application.char.init == 0 will resolve situation we see in Phobos now. char[] de facto is used for other than utf-8 encodings.You mean data with other encodings that still want to use the std.string functions? I have written template versions that replaces (almost) all std.string functions that do not rely on encoding.char.init == 0 tells everybody that char can also be used for representing unicode *code points* with asuumption that offset value (mapping on full Unicode set, aka codepage) is stored somewhere in application or well known to it.Maybe it would tell people that. A good thing it isn't so then. Again, why do you want to store non utf-8 data in a char[]?. What is wrong with ubyte[] or a suitable typedef?char.init == 0 also highlights the fact that it is safe to use char[] as C string processing functions and passing them to non D modules and libraries. Is it UTF-8 encoded or not - does not matter - type is universal enough.I can't see how that would make it considerably safer. /Oskar
Aug 02 2006
I fail to understand why I want another ambiguous type in my programming. I am glad that when I type "int", I know I have a number and not a pointer. I am glad that when I type char, I again know what I have. No guesswork. Your proposals sound like shooting myself in the foot. No fun. I'll take that helmet you offered first. -[Unknown]"Unknown W. Brackets" <unknown simplemachines.org> wrote in message news:eapdsg$qeo$1 digitaldaemon.com...I'm trying to understand why this 0 thing is such an issue. If your second statement is valid, it makes the first moot - 0 or no 0. Why does it matter, then?Declaration of char.init == 0 pretty much means that D has no strict requirement that char[] shall contain only UTF-8 encoded sequences but any other encodings suitable for the application. char.init == 0 will resolve situation we see in Phobos now. char[] de facto is used for other than utf-8 encodings. char.init == 0 tells everybody that char can also be used for representing unicode *code points* with asuumption that offset value (mapping on full Unicode set, aka codepage) is stored somewhere in application or well known to it. char.init == 0 also highlights the fact that it is safe to use char[] as C string processing functions and passing them to non D modules and libraries. Is it UTF-8 encoded or not - does not matter - type is universal enough. Andrew.-[Unknown]Another option will be to change char.init to 0 and forget about the problem left it as it is now. Some good string implementation will contain encoding field in string instance if needed. Andrew.
Aug 02 2006
On Tue, 01 Aug 2006 22:40:56 -0700, Unknown W. Brackets wrote:I'm trying to understand why this 0 thing is such an issue. If your second statement is valid, it makes the first moot - 0 or no 0. Why does it matter, then?I think the issue is more that Andrew wants to have hex-FF as a legitimate byte value anywhere in a char[] variable. He misses the point that the purpose of not allowing it in so we can detected uninitialized UTF-8 strings at run-time. Andrew, just use ubyte[] variables and you won't have a problem, apart from conversions between code-pages and Unicode <G>. In D, ubyte[] is the data structure designed to hold variable length arrays of unsigned bytes, which is exactly what you need to implement the type strings you have in KOI-8 encoding. -- Derek (skype: derek.j.parnell) Melbourne, Australia "Down with mediocrity!" 2/08/2006 4:24:27 PM
Aug 01 2006
I think the issue is more that Andrew wants to have hex-FF as a legitimate byte value anywhere in a char[] variable. He misses the point that the purpose of not allowing it in so we can detected uninitialized UTF-8 strings at run-time.What does it mean uninitialized? They *are* initialized. This is the main point. For any types you can declare initial value. I bet you are choosing not non existent values for say enums but some really meaningfull default values. having strings filled by ff's means that you will get problems of different kinds - partially initialized strings. Could you tell me do you ever had situation when ffffff strings helped you to find problem? And if yes how it is in principle different from catching strings with 00000? Can anyone here say that this fffffffs helped to find problem? Andrew.
Aug 02 2006
Andrew Fedoniouk wrote:Can anyone here say that this fffffffs helped to find problem?Yes, I found two bugs in my own code with it that would have been hidden with the 0 initialization.
Aug 02 2006
On Wed, 2 Aug 2006 00:08:42 -0700, Andrew Fedoniouk wrote:Andrew, I will assume you are not trying to be difficult but that maybe your English is a bit too literal. Of course in the clinical sense they are initialized because data is moved into them before your code has a chance to do anything. However, when I say "detected uninitialized UTF-8 strings" I mean "detect UTF-8 strings that have not been initialized by your own code". Is that better?I think the issue is more that Andrew wants to have hex-FF as a legitimate byte value anywhere in a char[] variable. He misses the point that the purpose of not allowing it in so we can detected uninitialized UTF-8 strings at run-time.What does it mean uninitialized? They *are* initialized.This is the main point. For any types you can declare initial value. I bet you are choosing not non existent values for say enums but some really meaningfull default values.Huh??? Now you are being difficult. The purpose of enums is to have them initialized to values that make sense in their context. But the default values for enum generally work for me as the exact value doesn't really matter in most cases. enum AccountType { Savings, Investment, FixedLoan, Club, LineOfCredit } I really don't care what values the compiler assigns to these enums. Sure I could choose specific values but it doesn't really matter.having strings filled by ff's means that you will get problems of different kinds - partially initialized strings.Huh???? Why would I always get partially initialized strings, as you imply? And even if I did, then having 0xFF in them is going to help me track down some stupid code that I wrote.Could you tell me do you ever had situation when ffffff strings helped you to find problem?No. I haven't made that kind of mistake yet with my code.And if yes how it is in principle different from catching strings with 00000?Because if I found a 0x00 in a string, I wouldn't know if its legitimate or not.Can anyone here say that this fffffffs helped to find problem?But if I found 0xFF I would know straight away that I've made a mistake somewhere. Actually, come to think about it, I did make a mistake once when my code was incorrectly interpreting a BOM in a text file. I loaded the file as if was UTF-8 but it should have been UTF-16. DMD correctly told me I had a bad UTF strings when I tried to write it out. -- Derek (skype: derek.j.parnell) Melbourne, Australia "Down with mediocrity!" 2/08/2006 5:49:46 PM
Aug 02 2006
On Wed, 02 Aug 2006 16:22:54 +1200, Regan Heath wrote:Me too, but that's probably because I've not been immersed in C/C++ for the last 20 odd years ;-) I "think in D" now and char[] is a UTF-8 string in my mind.Yet, I don't find it at all difficult to think of them like so: ubyte ==> An unsigned 8-bit byte. char ==> A UTF-8 code unit. wchar ==> A UTF-16 code unit. dchar ==> A UTF-32 code unit. ubyte[] ==> A 'C' string char[] ==> A UTF-8 string wchar[] ==> A UTF-16 string dchar[] ==> A UTF-32 stringchar ==> An unsigned 8-bit byte. An alias for ubyte. schar ==> A UTF-8 code unit. wchar ==> A UTF-16 code unit. dchar ==> A UTF-32 code unit. char[] ==> A 'C' string schar[] ==> A UTF-8 string wchar[] ==> A UTF-16 string dchar[] ==> A UTF-32 string And then have built-in conversions between the UTF encodings. So if people want to continue to use code from C/C++ that uses code-pages or similar they can stick with char[].Yes, Derek, this will be probably near the ideal.If you want to program in D you _will_ have to readjust your thinking in some areas, this is one of them. All you have to realise is that 'char' in D is not the same as 'char' in C.True, but Walter seems hell bent of easing the transition to D for C/C++ refugees.In quick and dirty ASCII only applications I can adjust my thinking further: char ==> An ASCII character char[] ==> An ASCII string I do however agree that C functions used in D should be declared like: int strlen(ubyte* s); and not like (as they currently are): int strlen(char* s); The problem with this is that the code: char[] s = "test"; strlen(s) would produce a compile error, and require a cast or a conversion function (toMBSz perhaps, which in many cases will not need to do anything). Of course the purists would say "That's perfectly correct, strlen cannot tell you the length of a UTF-8 string, only it's byte count", but at the same time it would be nice (for quick and dirty ASCII only programs) if it worked.And I'm a wannabe purist <G>Is it possible to declare them like this? int strlen(void* s); and for char[] to be implicitly 'paintable' as void* as char[] is already implicitly 'paintable' as void[]? It seems like it would nicely solve the problem of people seeing: int strlen(char* s); and thinking D's char is the same as C's char without introducing a painful need for cast or conversion in simple ASCII only situations.Is the zero-terminator for C strings that will get in the way. We need a nice way of getting the compiler to ensure C-strings are always terminated correctly. -- Derek (skype: derek.j.parnell) Melbourne, Australia "Down with mediocrity!" 2/08/2006 2:48:43 PM
Aug 01 2006
On Wed, 2 Aug 2006 14:55:11 +1000, Derek Parnell <derek nomail.afraid.org> wrote:Is the zero-terminator for C strings that will get in the way. We need a nice way of getting the compiler to ensure C-strings are always terminated correctly.Good point. I neglected to mention that. Regan
Aug 01 2006
Derek Parnell wrote: [snip]char ==> An unsigned 8-bit byte. An alias for ubyte. schar ==> A UTF-8 code unit. wchar ==> A UTF-16 code unit. dchar ==> A UTF-32 code unit. char[] ==> A 'C' string schar[] ==> A UTF-8 string wchar[] ==> A UTF-16 string dchar[] ==> A UTF-32 stringSure, although char, utf8, utf16, utf32 are much better choices, IMHO :) I'd be game to have them changed at this stage. It's not much more than some (extensive) global replacements. Don't think there's much need to check each instance. There's a nice shareware tool called "Active Search & Replace" which I've recently found to be very helpful in this regard.
Aug 01 2006
Derek Parnell wrote:On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:If UCS-2 is not a subset of UTF-16, what UCS-2 sequences are not valid UTF-16?(Hope this long dialog will help all of us to better understand what UNICODE is) "Walter Bright" <newshound digitalmars.com> wrote in message news:eao5st$2r1f$1 digitaldaemon.com...Andrew is correct. In UTF-16, characters are variable length, from 2 to 4 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to be up to 6 but that has changed). UCS-2 is a subset of Unicode characters that are all represented by 2-byte integers. Windows NT had implemented UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.Andrew Fedoniouk wrote:Walter with deepest respect but it is not. Two different things. UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.Compiler accepts input stream as either BMP codes or full unicode setencoded using UTF-16. BMP is a subset of UTF-16.
Aug 02 2006
On Wed, 02 Aug 2006 00:11:26 -0700, Walter Bright wrote:Derek Parnell wrote:Huh??? I said "UCS-2 is a subset of Unicode characters" Did you miss that? UTF-16 is not a subset as it can be used to encode every Unicode code point. UCS-2 is a subset as it can *not* encode code points that are outside of the "basic multilingual plane" (aka BMP). -- Derek (skype: derek.j.parnell) Melbourne, Australia "Down with mediocrity!" 2/08/2006 5:43:18 PMOn Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:If UCS-2 is not a subset of UTF-16, what UCS-2 sequences are not valid UTF-16?(Hope this long dialog will help all of us to better understand what UNICODE is) "Walter Bright" <newshound digitalmars.com> wrote in message news:eao5st$2r1f$1 digitaldaemon.com...Andrew is correct. In UTF-16, characters are variable length, from 2 to 4 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to be up to 6 but that has changed). UCS-2 is a subset of Unicode characters that are all represented by 2-byte integers. Windows NT had implemented UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.Andrew Fedoniouk wrote:Walter with deepest respect but it is not. Two different things. UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.Compiler accepts input stream as either BMP codes or full unicode setencoded using UTF-16. BMP is a subset of UTF-16.
Aug 02 2006
Derek Parnell wrote:On Wed, 02 Aug 2006 00:11:26 -0700, Walter Bright wrote:I saw it, but that statement is not the same as "UCS-2 is a subset of UTF-16". The issue I was talking about is "BMP [UCS-2] is a subset of UTF-16", which Andrew keeps replying "it is not". You said "Andrew is correct", so I inferred you were agreeing that UCS-2 is not a subset of UTF-16.Derek Parnell wrote:Huh??? I said "UCS-2 is a subset of Unicode characters" Did you miss that?On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:If UCS-2 is not a subset of UTF-16, what UCS-2 sequences are not valid UTF-16?"Walter Bright" <newshound digitalmars.com> wrote in message news:eao5st$2r1f$1 digitaldaemon.com...Andrew is correct. In UTF-16, characters are variable length, from 2 to 4 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to be up to 6 but that has changed). UCS-2 is a subset of Unicode characters that are all represented by 2-byte integers. Windows NT had implemented UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.Andrew Fedoniouk wrote:Walter with deepest respect but it is not. Two different things. UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.Compiler accepts input stream as either BMP codes or full unicode setencoded using UTF-16. BMP is a subset of UTF-16.UTF-16 is not a subset as it can be used to encode every Unicode code point. UCS-2 is a subset as it can *not* encode code points that are outside of the "basic multilingual plane" (aka BMP).I think you and I are in agreement.
Aug 02 2006
Andrew Fedoniouk wrote:"Walter Bright" <newshound digitalmars.com> wrote in messageThe only thing that UTF-16 adds are semantics for characters that are invalid for BMP. That makes UTF-16 a superset. It doesn't matter if you're strictly speaking, or if the jargon is different. UTF-16 is a superset of BMP, once you cut past the jargon and look at the underlying reality.BMP is a subset of UTF-16.Walter with deepest respect but it is not. Two different things. UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking. If you will treat utf-16 sequence as a sequence of UCS-2 (BMP) codes you are in trouble. See: Sequence of two words D834 DD1E as UTF-16 will give you one unicode character with code 0x1D11E ( musical G clef ). And the same sequence interpretted as UCS-2 sequence will give you two (invlaid, non-printable but still) character codes. You will get different length of the string at least.You could argue that for clarity namez should have been written as a ubyte*, but in the above code it would make no difference.You mean here?: char* namez = toMBSz(name); h = CreateFileA(namez,GENERIC_WRITE,0,null,CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,cast(HANDLE)null); char* here is far from UTF-8 sequence.Ok. And how do you call A functions?Take a look at std.file for an example.There is no difference.ECMA-262 String.prototype.charCodeAt (pos) Returns a number (a nonnegative integer less than 2^16) representing the code point value of the character at position pos in the string.... As you may see it is returning (unicode) *code point* from BMP set but it is far from UTF-16 code unit you've declared above.See String.fromCharCode() and String.charCodeAt()Windows, Java, and Javascript have all had to go back and redo to deal with surrogate pairs.Why? JavaScript for example has no such things as char. String.charAt() returns guess what? Correct - String object. No char - no problem :DRelaxing "a nonnegative integer less than 2^16" to "a nonnegative integer less than 2^21" will not harm anybody. Or at least such probability is vanishingly small.It'll break any code trying to deal with surrogate pairs.Basic types for utf-8 and utf-16. Ironically, they wind up being very much like D's char and wchar types.Basic types of what?Again - let people decide of what char is and how to interpret it And that will be it.I've already explained the problems C/C++ have with that. They're real problems, bad and unfixable enough that there are official proposals to add new UTF basic types to to C++.Standard functions in the C standard library to deal with multibyte encodings have been there since 1989. Compiler extensions to deal with shift-JIS and other multibyte encodings have been there since the mid 80's. They don't work very well, but nevertheless, are there and supported.Because char in C is not supposed to hold multy-byte encodings.Phobos can work with utf-8/16 and satisfy you and other UTF-masochists (no offence implied).C++'s experience with this demonstrates that char* does not work very well with UTF-8. It's not just my experience, it's why new types were proposed for C++ (and not by me).At least std::string is strictly single byte thing by definition. And this is perfectly fine.As long as you're dealing with ASCII only <g>. That world has been left behind, though.There is wchar_t for holding OS supported range in full. On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.That's just the trouble with wchar_t. It's implementation defined, which means its use is non-portable. The Win32 version cannot handle surrogate pairs as a single character. Linux has the opposite problem - you can't have UTF-16 strings in any non-kludgy way. Trying to write internationalized code with wchar_t that works correctly on both Win32 and Linux is an exercise in frustration. What you wind up doing is abstracting away the char type - giving up on help from the standard libraries and writing your own text processing code from scratch. I've been through this with real projects. It doesn't work just fine, and is a lot of extra work. Translating the code to D is nice, you essentially give that whole mess a heave-ho. BTW, you talked earlier a lot about memory efficiency. Linux's 32 bit wchar_t eats memory like nothing else.Thus the whole set of Windows API headers (and std.c.string for example) seen in D has to be rewrited to accept ubyte[]. As char in D is not char in CYou're right that a C char isn't a D char. All that means is one must be careful when calling C functions that take char*'s to pass it data in the form that particular C function expects. This is true for all C's data types - even int.Is this the idea?The vast majority (perhaps even all) of C standard string handling functions that accept char* will work with UTF-8 without modification. No rewrite required. You've implied all this doesn't work, by saying things must be rewritten, that it's extremely difficult to deal with UTF-8, that BMP is not a subset of UTF-16, etc. This is not my experience at all. If you've got some persuasive code examples, I'd like to see them.
Aug 01 2006
There is no such thing as surrogate pair in UCS-2. JS string is not holding UTF-16 code units - only full code points. See spec.As you may see it is returning (unicode) *code point* from BMP set but it is far from UTF-16 code unit you've declared above.There is no difference.Relaxing "a nonnegative integer less than 2^16" to "a nonnegative integer less than 2^21" will not harm anybody. Or at least such probability is vanishingly small.It'll break any code trying to deal with surrogate pairs.C string functions can be used with mutibyte encodings for the sole reason: all byte encodings has char with code 0 defined as NUL character. All encodings in practical use has no code byte with code 0 appear in the middle of sequence. They all built with C string processing in mind.Standard functions in the C standard library to deal with multibyte encodings have been there since 1989. Compiler extensions to deal with shift-JIS and other multibyte encodings have been there since the mid 80's. They don't work very well, but nevertheless, are there and supported.Because char in C is not supposed to hold multy-byte encodings.C++'s experience with this demonstrates that char* does not work very well with UTF-8. It's not just my experience, it's why new types were proposed for C++ (and not by me).Phobos can work with utf-8/16 and satisfy you and other UTF-masochists(no offence implied).At least std::string is strictly single byte thing by definition. And this is perfectly fine.As long as you're dealing with ASCII only <g>. That world has been left behind, though.Agree. As I said - if you need efficiency use byte/word encodings + mapping. dchar is no better than wchar_t/linux. Please don't say that I shall use urf-8 for that - simply does not work in my cases - too expencive.There is wchar_t for holding OS supported range in full. On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.That's just the trouble with wchar_t. It's implementation defined, which means its use is non-portable. The Win32 version cannot handle surrogate pairs as a single character. Linux has the opposite problem - you can't have UTF-16 strings in any non-kludgy way. Trying to write internationalized code with wchar_t that works correctly on both Win32 and Linux is an exercise in frustration. What you wind up doing is abstracting away the char type - giving up on help from the standard libraries and writing your own text processing code from scratch. I've been through this with real projects. It doesn't work just fine, and is a lot of extra work. Translating the code to D is nice, you essentially give that whole mess a heave-ho. BTW, you talked earlier a lot about memory efficiency. Linux's 32 bit wchar_t eats memory like nothing else.Correct. As I said because of 0 is NUL in UTF-8 too. Not 0xFF or anything else exotic.Thus the whole set of Windows API headers (and std.c.string for example) seen in D has to be rewrited to accept ubyte[]. As char in D is not char in CYou're right that a C char isn't a D char. All that means is one must be careful when calling C functions that take char*'s to pass it data in the form that particular C function expects. This is true for all C's data types - even int.Is this the idea?The vast majority (perhaps even all) of C standard string handling functions that accept char* will work with UTF-8 without modification. No rewrite required.You've implied all this doesn't work, by saying things must be rewritten, that it's extremely difficult to deal with UTF-8, that BMP is not a subset of UTF-16, etc. This is not my experience at all. If you've got some persuasive code examples, I'd like to see them.I am not saying that "must be rewritten". Sorry but this is you who propose to rewrite all string processing functions of standard library mankind has for today. Or I don't quite understand your idea with UTFs. Java did change string world by introducing just char (single UCS-2 code point) And no variations. Good it or bad? From uniformity point of view - good. For efficiency - bad. I've seen a lot of reinvented char as byte wheels in professional packages. Andrew.
Aug 01 2006
Andrew, I think there's a misunderstanding here. Perhaps it's a language thing. Let me define two things for you, in English, by my understanding of them. I was born in Utah and raised in Los Angeles as a native speaker, so hopefully these definitions aren't far from the standard understanding. Default: a setting, value, or situation which persists unless action is taken otherwise; such a thing that happens unless overridden or canceled. Null: something which has no current setting, value, or situation (but could have one); the absence of a setting, value, or situation. Therefore, I should conclude that "default" and "null" are very different concepts. The fact that C strings are null terminated, and that encodings provide for a "null" character (or code point or muffin or whatever they care to call them) does not logically necessitate that this provides for a default, or logically default, value. It is true that, as the above definitions, it would not be wrong for the default to be null. That would fit the definitions above perfectly. However, so would a value of ' ' (which might be the default in some language out there.) It would seem logical that 0 could be used as the default, but then as Walter pointed out... this can (and tends to) hide bugs which will bite you eventually. Let us suppose you were to have a string displayed in a place. It is possible, were it blank, that you might not notice it. Next let us suppose this space were filled with "?", "`", "ﮘ", or "ß" characters. Do you think you would be more, or less likely to notice it? Next, let us suppose that this character could be (in cases) detectable as invalid. Again note that 0 is not invalid, and may appear in strings. This sounds even better. So a default value of 0 does not, from an implementation or practical point of view, seem to make much sense to me. In fact, I think a default value of "42" for int makes sense (surely it reminds you of what six by nine is.) But maybe that's because I never leave things at their defaults. It's like writing a story where you expect the reader to think everyone has brown eyes unless you say otherwise. -[Unknown]Correct. As I said because of 0 is NUL in UTF-8 too. Not 0xFF or anything else exotic.
Aug 01 2006
But maybe that's because I never leave things at their defaults. It's like writing a story where you expect the reader to think everyone has brown eyes unless you say otherwise.Consider this: char[6] buf; strncpy(buf, "1234567", 5); What will be a content of you buffer? Answer is: 12345\xff . Surprise? It is. In modern D reliable implementation of this shall be as: char[6] buf; // memset(buf,0xFF,6); under the hood. uint n = strncpy(buf, "1234567", 5); buf[n] = 0; if you are going to use this with non D modules. Needless to say that this is a bit redundant. If D in any case initializes that memory why you need this uint n and buf[n] = 0; ? Don't tell me please that this is because your spent your childhood in boyscout camps and got some high principles. Lets' put aside that matters - it is purely technical discussion. Andrew.
Aug 01 2006
Andrew Fedoniouk wrote:Not really surprising. Had you compiled this in a C program (you are using C functions after all), you would have gotten: 12345\x?? <- some garbage. Not a zero terminated string. My manual for strncpy explicitly states: " if there is no null byte among the first n bytes of src, the result will not be null-terminated." /OskarBut maybe that's because I never leave things at their defaults. It's like writing a story where you expect the reader to think everyone has brown eyes unless you say otherwise.Consider this: char[6] buf; strncpy(buf, "1234567", 5); What will be a content of you buffer? Answer is: 12345\xff . Surprise? It is.
Aug 02 2006
On Tue, 1 Aug 2006 23:45:26 -0700, Andrew Fedoniouk wrote:No, not surprised, just wondering why you didn't code it correctly though. If you insist on using C functions then it should be coded ... extern(C) uint strncpy(ubyte *, ubyte *, uint ); ubyte[6] buf; strncpy(buf.ptr, cast(ubyte*)"1234567", 5);But maybe that's because I never leave things at their defaults. It's like writing a story where you expect the reader to think everyone has brown eyes unless you say otherwise.Consider this: char[6] buf; strncpy(buf, "1234567", 5); What will be a content of you buffer? Answer is: 12345\xff . Surprise? It is.In modern D reliable implementation of this shall be as: char[6] buf; // memset(buf,0xFF,6); under the hood. uint n = strncpy(buf, "1234567", 5); buf[n] = 0;Well that is debatable. I'd do it more like ... char[6] buf; // An array of UTF-8 code units. uint n = strncpy(buf, "1234567", 5); // Replace the first 5 code-units. buf[n..$] = 0; // Set remaining code-units to zero.if you are going to use this with non D modules. Needless to say that this is a bit redundant. If D in any case initializes that memory why you need this uint n and buf[n] = 0; ? Don't tell me please that this is because your spent your childhood in boyscout camps and got some high principles. Lets' put aside that matters - it is purely technical discussion.Exactly. And technically you should be using ubyte[] and not char[]. -- Derek (skype: derek.j.parnell) Melbourne, Australia "Down with mediocrity!" 2/08/2006 4:57:15 PM
Aug 02 2006
Why would I ever use strncat() in a D program? Consider this: if you do not wear a helmet while riding a motorcycle (read: I don't like helmets) you could break your head and die. Guess what? I don't ride motorcycles. Problem solved. I don't like null terminated strings. I think they are the root of much evil. Describing why having 0 as a default benefits null terminated strings is like describing how having less police help burglars to me. Obviously I'm being over-dramatic, but I remain unconvinced... Also I did spend (some of) my childhood in Boy Scout camps and I did learn many principles (none of which related to programming in the slightest.) I mean that literally. But you're right, that's beside the point. -[Unknown]But maybe that's because I never leave things at their defaults. It's like writing a story where you expect the reader to think everyone has brown eyes unless you say otherwise.Consider this: char[6] buf; strncpy(buf, "1234567", 5); What will be a content of you buffer? Answer is: 12345\xff . Surprise? It is. In modern D reliable implementation of this shall be as: char[6] buf; // memset(buf,0xFF,6); under the hood. uint n = strncpy(buf, "1234567", 5); buf[n] = 0; if you are going to use this with non D modules. Needless to say that this is a bit redundant. If D in any case initializes that memory why you need this uint n and buf[n] = 0; ? Don't tell me please that this is because your spent your childhood in boyscout camps and got some high principles. Lets' put aside that matters - it is purely technical discussion. Andrew.
Aug 02 2006
Correction: strncpy(). They're all evil. -[Unknown]Why would I ever use strncat() in a D program? Consider this: if you do not wear a helmet while riding a motorcycle (read: I don't like helmets) you could break your head and die. Guess what? I don't ride motorcycles. Problem solved. I don't like null terminated strings. I think they are the root of much evil. Describing why having 0 as a default benefits null terminated strings is like describing how having less police help burglars to me. Obviously I'm being over-dramatic, but I remain unconvinced... Also I did spend (some of) my childhood in Boy Scout camps and I did learn many principles (none of which related to programming in the slightest.) I mean that literally. But you're right, that's beside the point. -[Unknown]But maybe that's because I never leave things at their defaults. It's like writing a story where you expect the reader to think everyone has brown eyes unless you say otherwise.Consider this: char[6] buf; strncpy(buf, "1234567", 5); What will be a content of you buffer? Answer is: 12345\xff . Surprise? It is. In modern D reliable implementation of this shall be as: char[6] buf; // memset(buf,0xFF,6); under the hood. uint n = strncpy(buf, "1234567", 5); buf[n] = 0; if you are going to use this with non D modules. Needless to say that this is a bit redundant. If D in any case initializes that memory why you need this uint n and buf[n] = 0; ? Don't tell me please that this is because your spent your childhood in boyscout camps and got some high principles. Lets' put aside that matters - it is purely technical discussion. Andrew.
Aug 02 2006
Andrew Fedoniouk wrote:(Hope this long dialog will help all of us to better understand what UNICODE is)Actually, it doesn't help at all, Andrew ~ some of it is thoroughly misguided, and some is "cleverly" slanted purely for the benefit of the author. In truth, this thread would be the last place one would look to learn from an entirely unbiased opinion; one with only the readers education in mind. There are infinitely more useful places to go for that sort of thing. For those who have an interest, this tiny selection may help: http://icu.sourceforge.net/docs/papers/forms_of_unicode/ http://www.hackcraft.net/xmlUnicode/ http://www.cl.cam.ac.uk/~mgk25/unicode.html http://www.unicode.org/unicode/faq/utf_bom.html http://en.wikipedia.org/wiki/UTF-8 http://www.joelonsoftware.com/articles/Unicode.html
Aug 01 2006
Andrew Fedoniouk wrote:"Walter Bright" <newshound digitalmars.com> wrote in message news:eao5st$2r1f$1 digitaldaemon.com...Uh, the statement "BMP is a subset of UTF-16" means that you can read a BMP sequence as an UTF-16 sequence, not the opposite as you said: "If you will treat utf-16 sequence as a sequence of UCS-2 (BMP)".Andrew Fedoniouk wrote:Walter with deepest respect but it is not. Two different things. UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking. If you will treat utf-16 sequence as a sequence of UCS-2 (BMP) codes you are in trouble. See:Compiler accepts input stream as either BMP codes or full unicode setencoded using UTF-16. BMP is a subset of UTF-16.Just a note, not to ubyte[] but to ubyte* . -- Bruno Medeiros - MSc in CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#DThus the whole set of Windows API headers (and std.c.string for example) seen in D has to be rewrited to accept ubyte[]. As char in D is not char in C Is this the idea? Andrew.Ordinary people will do their own strings anyway. Just give them opAssign and dtor in structs and you will see explosion of perfect strings. That Changing char init value to 0 will not harm anybody but will allow to use char for other than utf-8 purposes - it is only one from 40 in active use encodings anyway. For persistence purposes (in compiled EXE) utf is the best choice probably. But in runtime - please not on language level.ubyte[] will enable you to use any encoding you wish - and that's what it's there for.
Aug 03 2006