digitalmars.D - char[] initialization

Andrew Fedoniouk (14/14) Jul 29 2006 Could somebody shed light on the subject:

kris (3/15) Jul 29 2006 Try google?

Hasan Aljudy (12/31) Jul 29 2006 I don't understand why the compiler should initialize variables to

Derek (9/43) Jul 29 2006 I believe that D's philopsophy is that all datatypes are initialized to

Hasan Aljudy (2/48) Jul 29 2006 I know .. I was asking "but why?" :(

Robert Atkinson (9/61) Jul 29 2006 The intent I believe is to signal the programmer as soon as possible

Hasan Aljudy (8/77) Jul 29 2006 Still missing my point.

Carlos Santander (8/18) Jul 29 2006 The issue here is, a "reasonable valid default" will change from one app...

Walter Bright (46/67) Jul 29 2006 That's right. Also, given:

Andrew Fedoniouk (41/56) Jul 29 2006 Thanks, Kris.

Carlos Santander (4/10) Jul 29 2006 But D's chars are UTF-8, not Latin-1 nor any other, so I don't think thi...

Andrew Fedoniouk (15/24) Jul 29 2006 UTF-8 is a multibyte transport encoding of full 21-bit UNICODE codepoint...

Carlos Santander (5/36) Jul 29 2006 My bad, then. I should've said char[] instead of char. Frits and Walter ...

Frits van Bommel (16/37) Jul 29 2006 Yep, 0xFFFF is in the "Specials" range. In fact, together with 0xFFFE it...

Andrew Fedoniouk (16/55) Jul 29 2006 Sorry but this is wrong. "UTF-8 codepoint" is a non-sense.

Walter Bright (12/37) Jul 29 2006 "the value FFFF is guaranteed not to be a Unicode character at all"

Andrew Fedoniouk (9/47) Jul 29 2006 1) What "UTF-8 character" means exactly?

Walter Bright (8/22) Jul 29 2006 For an exact answer, the spec is: http://www.ietf.org/rfc/rfc3629.txt

Andrew Fedoniouk (16/38) Jul 29 2006 Sorry but I understand what UCS character means

Walter Bright (3/5) Jul 29 2006 This was all hashed out years ago. It's too late to start renaming basic...

Andrew Fedoniouk (11/16) Jul 29 2006 I am not asking to rename anything.

Unknown W. Brackets (7/29) Jul 29 2006 But even prior, this:
Walter Bright (9/23) Jul 29 2006 char's have been initialized to 0xFF for years now, it was a bug that

Unknown W. Brackets (41/101) Jul 29 2006 Andrew,

Andrew Fedoniouk (68/106) Jul 29 2006 No objections with this.

Walter Bright (8/20) Jul 29 2006 Pragmas are implementation defined behavior in C and C++, meaning they

Andrew Fedoniouk (31/50) Jul 29 2006 What does it mean "UTF-8 ... supports ...every human language" ?

Walter Bright (12/46) Jul 29 2006 I'm sure there are bugs in the library UTF-8 support. But they are bugs,...

Andrew Fedoniouk (13/16) Jul 29 2006 Sorry but this is a bit optimistic.

Walter Bright (17/34) Jul 29 2006 No matter, it is far easier to write a UTF-8 isword function than one

Andrew Fedoniouk (40/74) Jul 29 2006 Sorry, did you try to write such a function (isword)?

kris (2/108) Jul 29 2006
Hasan Aljudy (3/12) Jul 30 2006 That's great, I'd be glad to help with anything if you need help with
Walter Bright (38/98) Jul 30 2006 With code pages, it isn't so straightforward (especially if you've got

Paolo Invernizzi (4/7) Jul 30 2006 LOL!!!

John Reimer (5/12) Jul 30 2006 Okay, that clears things up. Now we know that UTF is a conspiracy for

kris (2/22) Jul 30 2006 And created on the back of a napkin in a New Jersey diner ... way to go,...

Unknown W. Brackets (14/26) Jul 30 2006 It's true that in HTML, attribute names were limited to a subset of

Chris Miller (3/5) Jul 30 2006 Even body language? :)

Unknown W. Brackets (85/249) Jul 29 2006 2. Sorry, an array of char (a single char is one single 8 bit octet)

Andrew Fedoniouk (40/124) Jul 29 2006 "your definition is either lax or wrong"

Unknown W. Brackets (28/32) Jul 29 2006 It really sounds to me like you're looking for UCS-2, then (e.g. as used...

Andrew Fedoniouk (30/63) Jul 29 2006 Well, lets speak in terms of javascript if it is easier:

Unknown W. Brackets (17/107) Jul 30 2006 Yes, you're right, most of the time I wouldn't (although a significant

Bruno Medeiros (8/15) Jul 30 2006 Which, speaking of which, shouldn't that be a compile time error? The

Unknown W. Brackets (6/22) Jul 30 2006 Eek! Yes, I would say (in my humble opinion) that this should be a

Bruno Medeiros (16/24) Jul 30 2006 You mentioned "8-bit octet" repeatedly in various posts. That's

Unknown W. Brackets (12/37) Jul 30 2006 I use that terminology because I've read too many RFCs (consider the FTP...

Walter Bright (2/6) Jul 30 2006 I confess I often misuse the terminology.

Derek (35/36) Jul 30 2006 Andrew and others,

Walter Bright (9/36) Jul 30 2006 Thank you for the insightful summary of the situation.

Unknown W. Brackets (6/14) Jul 30 2006 Indeed; this is the same situation as with XML transmission over the

Oskar Linde (20/58) Jul 30 2006 Thank you for the clear summary.
Bruno Medeiros (13/52) Jul 30 2006 Good summary. Additionally I'd like to say that, to hold 'KOI-8'
Serg Kovrov (10/10) Jul 31 2006 Maybe I missed the point here, correct me if I misunderstood.

Oskar Linde (20/30) Jul 31 2006 Having char[].length return something other than the actual number of

Serg Kovrov (13/22) Jul 31 2006 Yes, I see. Thats why I do not like much char[] as substitute for string

Frits van Bommel (5/11) Jul 31 2006 Store where? You can't put it in the array data itself without breaking

Serg Kovrov (5/17) Jul 31 2006 Need to say that I no not have an idea where to store it, neither where

Frits van Bommel (6/25) Jul 31 2006 The length is stored in the reference, but the character count would not...

Hasan Aljudy (2/38) Jul 31 2006 I say this calls for a proper *standard* String class ...
Oskar Linde (9/37) Jul 31 2006 The question is, how often do you need it? Especially if you are not

Serg Kovrov (2/46) Jul 31 2006 You've got some valid points, I just showed mine.

Walter Bright (2/12) Jul 31 2006 std.utf.toUCSindex(s, s.length) will also give the character count.
Thomas Kuehne (14/35) Jul 31 2006 -----BEGIN PGP SIGNED MESSAGE-----

Andrew Fedoniouk (12/53) Jul 31 2006 Right, Thomas,

Thomas Kuehne (17/64) Aug 02 2006 -----BEGIN PGP SIGNED MESSAGE-----

Andrew Fedoniouk (33/74) Jul 31 2006 Derek thanks for summarizing all this but I will put it as following.

Walter Bright (33/51) Jul 31 2006 I disagree the characterization that it is "extremely difficult" to use

Andrew Fedoniouk (38/89) Jul 31 2006 Sorry but strings in DMDScript are quite different in terms of

Derek Parnell (25/48) Jul 31 2006 For what its worth, to do *character* manipulation I convert strings to

Andrew Fedoniouk (13/57) Jul 31 2006 Derek, using dchar (ultimate char) is perfectly fine in DBuild(*)

John Reimer (12/18) Jul 31 2006 Really, Andrew, you are getting carried away in your demands. You almos...

Andrew Fedoniouk (9/30) Aug 01 2006 :D

=?ISO-8859-1?Q?=22R=E9my_J=2E_A=2E_Mou=EBza=22?= (6/46) Aug 01 2006 As "d�bile" in french, pronounced something like "day bill". One has to

Walter Bright (40/110) Jul 31 2006 ECMAScript 262-3 (Javascript) defines the source character set to be

Andrew Fedoniouk (40/151) Aug 01 2006 Walter, please, forget about such thing as "character set is UTF-16"

Walter Bright (14/41) Aug 01 2006 encoded using UTF-16.

Andrew Fedoniouk (39/80) Aug 01 2006 (Hope this long dialog will help all of us to better understand what UNI...

Derek Parnell (33/55) Aug 01 2006 Andrew is correct. In UTF-16, characters are variable length, from 2 to ...

Andrew Fedoniouk (4/57) Aug 01 2006 Yes, Derek, this will be probably near the ideal.

Regan Heath (40/116) Aug 01 2006 Yet, I don't find it at all difficult to think of them like so:

Andrew Fedoniouk (6/124) Aug 01 2006 Another option will be to change char.init to 0 and forget about the pro...

Unknown W. Brackets (4/12) Aug 01 2006 I'm trying to understand why this 0 thing is such an issue. If your

Andrew Fedoniouk (17/29) Aug 01 2006 Declaration of char.init == 0 pretty much means that

Oskar Linde (10/30) Aug 02 2006 You mean data with other encodings that still want to use the std.string...
Unknown W. Brackets (7/51) Aug 02 2006 I fail to understand why I want another ambiguous type in my

Derek Parnell (16/19) Aug 01 2006 I think the issue is more that Andrew wants to have hex-FF as a legitima...

Andrew Fedoniouk (14/18) Aug 02 2006 What does it mean uninitialized? They *are* initialized.

Walter Bright (3/5) Aug 02 2006 Yes, I found two bugs in my own code with it that would have been hidden...
Derek Parnell (40/58) Aug 02 2006 Andrew, I will assume you are not trying to be difficult but that maybe

Derek Parnell (19/86) Aug 01 2006 Me too, but that's probably because I've not been immersed in C/C++ for ...

Regan Heath (4/8) Aug 01 2006 Good point. I neglected to mention that.

kris (7/16) Aug 01 2006 Sure, although char, utf8, utf16, utf32 are much better choices, IMHO :)
Walter Bright (3/25) Aug 02 2006 If UCS-2 is not a subset of UTF-16, what UCS-2 sequences are not valid

Derek Parnell (11/37) Aug 02 2006 Huh??? I said "UCS-2 is a subset of Unicode characters" Did you miss tha...

Walter Bright (7/33) Aug 02 2006 I saw it, but that statement is not the same as "UCS-2 is a subset of

Walter Bright (42/107) Aug 01 2006 The only thing that UTF-16 adds are semantics for characters that are

Andrew Fedoniouk (27/80) Aug 01 2006 There is no such thing as surrogate pair in UCS-2.

Unknown W. Brackets (37/39) Aug 01 2006 Andrew, I think there's a misunderstanding here. Perhaps it's a

Andrew Fedoniouk (17/20) Aug 01 2006 Consider this:

Oskar Linde (8/21) Aug 02 2006 Not really surprising. Had you compiled this in a C program (you are
Derek Parnell (18/46) Aug 02 2006 No, not surprised, just wondering why you didn't code it correctly thoug...
Unknown W. Brackets (13/45) Aug 02 2006 Why would I ever use strncat() in a D program?

Unknown W. Brackets (2/53) Aug 02 2006

kris (14/16) Aug 01 2006 Actually, it doesn't help at all, Andrew ~ some of it is thoroughly
Bruno Medeiros (8/46) Aug 03 2006 Uh, the statement "BMP is a subset of UTF-16" means that you can read a

"Andrew Fedoniouk" <news terrainformatica.com> writes:

Could somebody shed light on the subject:

According to http://digitalmars.com/d/type.html

characters in D are getting initialized by following values

char -> 0xFF
wchar -> 0xFFFF
dchar -> 0x0000FFFF

what is the idea to have string initialized by valid character code instead 
of 0?

And that 0xFFFF.... Why is this special character (See Basic
Multilingual Plane) was selected?

To avoid use of strcat & co. on d strings?

(Sorry if it was discussed before)

Andrew Fedoniouk.
http://terrainformatica.com

Jul 29 2006

kris <foo bar.com> writes:

Andrew Fedoniouk wrote:
 Could somebody shed light on the subject:
 
 According to http://digitalmars.com/d/type.html
 
 characters in D are getting initialized by following values
 
 char -> 0xFF
 wchar -> 0xFFFF
 dchar -> 0x0000FFFF
 
 what is the idea to have string initialized by valid character code instead 
 of 0?

Try google?

http://www.digitalmars.com/d/archives/digitalmars/D/3239.html

Jul 29 2006

Hasan Aljudy <hasan.aljudy gmail.com> writes:

kris wrote:
 Andrew Fedoniouk wrote:
 
 Could somebody shed light on the subject:

 According to http://digitalmars.com/d/type.html

 characters in D are getting initialized by following values

 char -> 0xFF
 wchar -> 0xFFFF
 dchar -> 0x0000FFFF

 what is the idea to have string initialized by valid character code 
 instead of 0?

 
 
 Try google?
 
 http://www.digitalmars.com/d/archives/digitalmars/D/3239.html

I don't understand why the compiler should initialize variables to 
illegal values!!

OK, is it because you have to initialize variables explicitly?
Just WHY?

As far as I know, the notion that non-initialized variables are bad is a 
side-effect of the C (and C++) language, because non-inited variables 
are garbage.

However, in D (and Java .. and others), vars are always initialized.
So, if the compiler can init variables to good defaults, why should it 
still be considered a bad habit not to init variables explicitly? That 
just makes no sense to me.

Jul 29 2006

Derek <derek psyc.ward> writes:

On Sat, 29 Jul 2006 06:29:21 -0600, Hasan Aljudy wrote:

 kris wrote:
 Andrew Fedoniouk wrote:
 
 Could somebody shed light on the subject:

 According to http://digitalmars.com/d/type.html

 characters in D are getting initialized by following values

 char -> 0xFF
 wchar -> 0xFFFF
 dchar -> 0x0000FFFF

 what is the idea to have string initialized by valid character code 
 instead of 0?

 
 Try google?
 
 http://www.digitalmars.com/d/archives/digitalmars/D/3239.html

 
 I don't understand why the compiler should initialize variables to 
 illegal values!!
 
 OK, is it because you have to initialize variables explicitly?
 Just WHY?
 
 As far as I know, the notion that non-initialized variables are bad is a 
 side-effect of the C (and C++) language, because non-inited variables 
 are garbage.
 
 However, in D (and Java .. and others), vars are always initialized.
 So, if the compiler can init variables to good defaults, why should it 
 still be considered a bad habit not to init variables explicitly? That 
 just makes no sense to me.

I believe that D's philopsophy is that all datatypes are initialized to
'invalid' values if they possibly can be. The ones that can't are integers,
bytes, and bools. References, floating point values, and characters are
initialized to 'wrong' values.

-- 
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"

Jul 29 2006

Hasan Aljudy <hasan.aljudy gmail.com> writes:

Derek wrote:
 On Sat, 29 Jul 2006 06:29:21 -0600, Hasan Aljudy wrote:
 
 
kris wrote:

Andrew Fedoniouk wrote:


Could somebody shed light on the subject:

According to http://digitalmars.com/d/type.html

characters in D are getting initialized by following values

char -> 0xFF
wchar -> 0xFFFF
dchar -> 0x0000FFFF

what is the idea to have string initialized by valid character code 
instead of 0?

Try google?

http://www.digitalmars.com/d/archives/digitalmars/D/3239.html

I don't understand why the compiler should initialize variables to 
illegal values!!

OK, is it because you have to initialize variables explicitly?
Just WHY?

As far as I know, the notion that non-initialized variables are bad is a 
side-effect of the C (and C++) language, because non-inited variables 
are garbage.

However, in D (and Java .. and others), vars are always initialized.
So, if the compiler can init variables to good defaults, why should it 
still be considered a bad habit not to init variables explicitly? That 
just makes no sense to me.

 
 
 I believe that D's philopsophy is that all datatypes are initialized to
 'invalid' values if they possibly can be. The ones that can't are integers,
 bytes, and bools. References, floating point values, and characters are
 initialized to 'wrong' values.
 

I know .. I was asking "but why?" :(

Jul 29 2006

Robert Atkinson <Robert.Atkinson NO.gmail.com.SPAM> writes:

Hasan Aljudy wrote:
 
 
 Derek wrote:
 On Sat, 29 Jul 2006 06:29:21 -0600, Hasan Aljudy wrote:


 kris wrote:

 Andrew Fedoniouk wrote:


 Could somebody shed light on the subject:

 According to http://digitalmars.com/d/type.html

 characters in D are getting initialized by following values

 char -> 0xFF
 wchar -> 0xFFFF
 dchar -> 0x0000FFFF

 what is the idea to have string initialized by valid character code 
 instead of 0?

 Try google?

 http://www.digitalmars.com/d/archives/digitalmars/D/3239.html

 I don't understand why the compiler should initialize variables to 
 illegal values!!

 OK, is it because you have to initialize variables explicitly?
 Just WHY?

 As far as I know, the notion that non-initialized variables are bad 
 is a side-effect of the C (and C++) language, because non-inited 
 variables are garbage.

 However, in D (and Java .. and others), vars are always initialized.
 So, if the compiler can init variables to good defaults, why should 
 it still be considered a bad habit not to init variables explicitly? 
 That just makes no sense to me.


 I believe that D's philopsophy is that all datatypes are initialized to
 'invalid' values if they possibly can be. The ones that can't are 
 integers,
 bytes, and bools. References, floating point values, and characters are
 initialized to 'wrong' values.

 
 I know .. I was asking "but why?" :(

The intent I believe is to signal the programmer as soon as possible 
showing they have missed something.  In C/C++ an un-initialised variable 
can easily survive thousands of debug runs until it 'initialises' to a 
completely wrong value.  Most often on a release build and a end-users 
system.

Take floats.  By starting at NaN, from the very start you'll know you 
missed initialising it.  You'll catch the error earlier in your debug 
process.

Jul 29 2006

Hasan Aljudy <hasan.aljudy gmail.com> writes:

Robert Atkinson wrote:
 Hasan Aljudy wrote:
 
 Derek wrote:

 On Sat, 29 Jul 2006 06:29:21 -0600, Hasan Aljudy wrote:


 kris wrote:

 Andrew Fedoniouk wrote:


 Could somebody shed light on the subject:

 According to http://digitalmars.com/d/type.html

 characters in D are getting initialized by following values

 char -> 0xFF
 wchar -> 0xFFFF
 dchar -> 0x0000FFFF

 what is the idea to have string initialized by valid character 
 code instead of 0?


 Try google?

 http://www.digitalmars.com/d/archives/digitalmars/D/3239.html


 I don't understand why the compiler should initialize variables to 
 illegal values!!

 OK, is it because you have to initialize variables explicitly?
 Just WHY?

 As far as I know, the notion that non-initialized variables are bad 
 is a side-effect of the C (and C++) language, because non-inited 
 variables are garbage.

 However, in D (and Java .. and others), vars are always initialized.
 So, if the compiler can init variables to good defaults, why should 
 it still be considered a bad habit not to init variables explicitly? 
 That just makes no sense to me.



 I believe that D's philopsophy is that all datatypes are initialized to
 'invalid' values if they possibly can be. The ones that can't are 
 integers,
 bytes, and bools. References, floating point values, and characters are
 initialized to 'wrong' values.

 I know .. I was asking "but why?" :(

 
 
 The intent I believe is to signal the programmer as soon as possible 
 showing they have missed something.  In C/C++ an un-initialised variable 
 can easily survive thousands of debug runs until it 'initialises' to a 
 completely wrong value.  Most often on a release build and a end-users 
 system.
 
 Take floats.  By starting at NaN, from the very start you'll know you 
 missed initialising it.  You'll catch the error earlier in your debug 
 process.

Still missing my point.
in C/C++ that's a problem because un-initialized variables carry garbage.
in D, it's not; if you init them to a reasonable valid default, this 
problem won't exist anymore.

If un-initializing is bad just for its own sake .. then the compiler 
should detect it and issue an error/warning, otherwise it should default 
to a reasonable valid value; in this case, zero for chars and floats.

Jul 29 2006

Carlos Santander <csantander619 gmail.com> writes:

Hasan Aljudy escribi�:
 
 
 Still missing my point.
 in C/C++ that's a problem because un-initialized variables carry garbage.
 in D, it's not; if you init them to a reasonable valid default, this 
 problem won't exist anymore.
 
 If un-initializing is bad just for its own sake .. then the compiler 
 should detect it and issue an error/warning, otherwise it should default 
 to a reasonable valid value; in this case, zero for chars and floats.

The issue here is, a "reasonable valid default" will change from one app to the 
other, one function to the next, one variable to another, so the intention here 
is force the developer to be explicit about his/her intentions.

Walter has said in the past that if there was a NAN for int/long/etc, he'd use 
that instead of 0.

-- 
Carlos Santander Bernal

Jul 29 2006

Walter Bright <newshound digitalmars.com> writes:

Carlos Santander wrote:
 Hasan Aljudy escribi�:
 Still missing my point.
 in C/C++ that's a problem because un-initialized variables carry garbage.
 in D, it's not; if you init them to a reasonable valid default, this 
 problem won't exist anymore.

 If un-initializing is bad just for its own sake .. then the compiler 
 should detect it and issue an error/warning, otherwise it should 
 default to a reasonable valid value; in this case, zero for chars and 
 floats.

 
 The issue here is, a "reasonable valid default" will change from one app 
 to the other, one function to the next, one variable to another, so the 
 intention here is force the developer to be explicit about his/her 
 intentions.
 
 Walter has said in the past that if there was a NAN for int/long/etc, 
 he'd use that instead of 0.
 

That's right. Also, given:

	int x;

	foo(x);

it is impossible for the maintenance programmer to distinguish between:

1) x is meant to be 0
2) the original programmer forgot to initialize x to 3, and there's a 
bug in the program

Ok, fine, so why doesn't the compiler just squawk about referencing 
uninitialized variables? Consider:

	int x;
	...
	if (...)
	{	x = 3;
		...
	}
	...
	if (...)
	{	...
		foo(x);
	}

There is no way for the compiler to determine that x in foo(x) is always 
initialized. So it must assume otherwise, and squawk about it. So how 
does our harried programmer fix it?

	int x = some-random-value;
	...
	if (...)
	{	x = 3;
		...
	}
	...
	if (...)
	{	...
		foo(x);
	}

The compiler is now happy, but pity the poor maintenance programmer. He 
notices the some-random-value, and wonders what that value means. He 
analyzes the code, and discovers that that value is never used. Was it 
intended to be used? Did some previous maintenance programmer break the 
code? What's going on here?

My take on programming languages is that the semantics should have the 
obvious meaning - i.e. if the programmer initializes a variable to a 
value, that value should have meaning. He should not have to initialize 
a variable because of some subtle *side effect* such initialization has.

Programmers should not be required to add dead assignments, unreachable 
code, etc., just to keep the compiler happy.

Jul 29 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"kris" <foo bar.com> wrote in message news:eaf9ei$2m7$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Could somebody shed light on the subject:

 According to http://digitalmars.com/d/type.html

 characters in D are getting initialized by following values

 char -> 0xFF
 wchar -> 0xFFFF
 dchar -> 0x0000FFFF

 what is the idea to have string initialized by valid character code 
 instead of 0?

 Try google?

 http://www.digitalmars.com/d/archives/digitalmars/D/3239.html

Thanks, Kris.

To Walter:

Following assumption ( 
http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):

"codepoint U+FFFF is not a legitimate Unicode character, and, furthermore, 
it is guaranteed by the
Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode character.
This codepoint will remain forever unassigned, precisely so that it may be 
used
for purposes such as this."

is just wrong.

1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from
R-zone: {U+FFF0..U+FFFF} - region assigned already.

2) For char[] selection of 0xFF is wrong and even worse.
For example character with code 0xFF in Latin-I encoding is
"y diaeresis". In many European languages and Far East encodings 0xFF is a 
valid code point.
For example in KOI-8 encoding 0xFF is officially assigned value.

What is the point of current initializaton?

If you are doing intialization already
and this intialization is a part of specification so why not to use
official "Nul" values in this case?

You are doing the same for floats - you are using NaNs there
 (Null value for floats). Why not to use the same for chars?

I think I understand your intention, 0xFF is sort of
debug values in Visual C++:

0xCDCDCDCD
  - Allocated in heap, but not initialized
0xDDDDDDDD
  - Released heap memory.
0xFDFDFDFD
  - "NoMansLand" fences automatically placed at boundary of heap memory. 
Should never be overwritten. If you do overwrite one, you're probably 
walking off the end of an array.
0xCCCCCCCC
  - Allocated on stack, but not initialized

but this is far from concept of null codepoint in character encodings.

Andrew Fedoniouk.
http://terrainformatica.com

Jul 29 2006

Carlos Santander <csantander619 gmail.com> writes:

Andrew Fedoniouk escribi�:
 2) For char[] selection of 0xFF is wrong and even worse.
 For example character with code 0xFF in Latin-I encoding is
 "y diaeresis". In many European languages and Far East encodings 0xFF is a 
 valid code point.
 For example in KOI-8 encoding 0xFF is officially assigned value.
 

But D's chars are UTF-8, not Latin-1 nor any other, so I don't think this
applies.

-- 
Carlos Santander Bernal

Jul 29 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"Carlos Santander" <csantander619 gmail.com> wrote in message 
news:eagiip$1lad$3 digitaldaemon.com...
 Andrew Fedoniouk escribi�:
 2) For char[] selection of 0xFF is wrong and even worse.
 For example character with code 0xFF in Latin-I encoding is
 "y diaeresis". In many European languages and Far East encodings 0xFF is 
 a valid code point.
 For example in KOI-8 encoding 0xFF is officially assigned value.

 But D's chars are UTF-8, not Latin-1 nor any other, so I don't think this 
 applies.

UTF-8 is a multibyte transport encoding of full 21-bit UNICODE codepoint.
Strictly speaking single byte in UTF-8 sequence cannot be named as 
char[acter]

char as typename implies that value of its type contains some complete
codepoint (assumed that information about codepage is stored somewhere
or is known at the point of use)

I mean that "UTF-8 characrter" (if it makes any sense at all) as type
is always char[] and not a single char.

0xFF as a char initialization value implies that D char is not supposed
to handle single byte character encodings at all. Is this an original 
intention?

Andrew Fedoniouk.
http://terrainformatica.com

Jul 29 2006

Carlos Santander <csantander619 gmail.com> writes:

Andrew Fedoniouk escribi�:
 "Carlos Santander" <csantander619 gmail.com> wrote in message 
 news:eagiip$1lad$3 digitaldaemon.com...
 Andrew Fedoniouk escribi�:
 2) For char[] selection of 0xFF is wrong and even worse.
 For example character with code 0xFF in Latin-I encoding is
 "y diaeresis". In many European languages and Far East encodings 0xFF is 
 a valid code point.
 For example in KOI-8 encoding 0xFF is officially assigned value.

 But D's chars are UTF-8, not Latin-1 nor any other, so I don't think this 
 applies.

 
 UTF-8 is a multibyte transport encoding of full 21-bit UNICODE codepoint.
 Strictly speaking single byte in UTF-8 sequence cannot be named as 
 char[acter]
 
 char as typename implies that value of its type contains some complete
 codepoint (assumed that information about codepage is stored somewhere
 or is known at the point of use)
 
 I mean that "UTF-8 characrter" (if it makes any sense at all) as type
 is always char[] and not a single char.
 
 0xFF as a char initialization value implies that D char is not supposed
 to handle single byte character encodings at all. Is this an original 
 intention?
 
 Andrew Fedoniouk.
 http://terrainformatica.com
 

My bad, then. I should've said char[] instead of char. Frits and Walter wrote 
better responses, anyway, so I'll leave this as is.

-- 
Carlos Santander Bernal

Jul 29 2006

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Andrew Fedoniouk wrote:
 To Walter:
 
 Following assumption ( 
 http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):
 
 "codepoint U+FFFF is not a legitimate Unicode character, and, furthermore, 
 it is guaranteed by the
 Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode character.
 This codepoint will remain forever unassigned, precisely so that it may be 
 used
 for purposes such as this."
 
 is just wrong.
 
 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from
 R-zone: {U+FFF0..U+FFFF} - region assigned already.

Yep, 0xFFFF is in the "Specials" range. In fact, together with 0xFFFE it 
forms the subrange of the "Noncharacters" (see 
http://www.unicode.org/charts/PDF/UFFF0.pdf, at the end). These are 
"intended for process internal uses, but are not permitted for 
interchange". 0xFFFF specifically is marked "<not a character> - the 
value FFFF if guaranteed not to be a Unicode character at all".
So yes, it's assigned - for exactly such a purpose as D is using it for :).

 2) For char[] selection of 0xFF is wrong and even worse.
 For example character with code 0xFF in Latin-I encoding is
 "y diaeresis". In many European languages and Far East encodings 0xFF is a 
 valid code point.
 For example in KOI-8 encoding 0xFF is officially assigned value.

First of all, non-Unicode encodings are irrelevant. 'char' is a UTF-8 
codepoint (I think that's the correct term).
It's not a Unicode character (though some Unicode characters are encoded 
as a single UTF-8 codepoint, specifically anything up to 0x80 IIRC).
0xFF is indeed a valid Unicode character, but that doesn't mean that 
character is encoded as a byte with value 0xFF in UTF-8 (which char[]s 
represent). 0xFF is in fact one of the byte values that *cannot* occur 
in a valid UTF-8 text.

Jul 29 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"Frits van Bommel" <fvbommel REMwOVExCAPSs.nl> wrote in message 
news:eagjcd$1m1t$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 To Walter:

 Following assumption ( 
 http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):

 "codepoint U+FFFF is not a legitimate Unicode character, and, 
 furthermore, it is guaranteed by the
 Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode 
 character.
 This codepoint will remain forever unassigned, precisely so that it may 
 be used
 for purposes such as this."

 is just wrong.

 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from
 R-zone: {U+FFF0..U+FFFF} - region assigned already.

 Yep, 0xFFFF is in the "Specials" range. In fact, together with 0xFFFE it 
 forms the subrange of the "Noncharacters" (see 
 http://www.unicode.org/charts/PDF/UFFF0.pdf, at the end). These are 
 "intended for process internal uses, but are not permitted for 
 interchange". 0xFFFF specifically is marked "<not a character> - the value 
 FFFF if guaranteed not to be a Unicode character at all".
 So yes, it's assigned - for exactly such a purpose as D is using it for 
 :).

 2) For char[] selection of 0xFF is wrong and even worse.
 For example character with code 0xFF in Latin-I encoding is
 "y diaeresis". In many European languages and Far East encodings 0xFF is 
 a valid code point.
 For example in KOI-8 encoding 0xFF is officially assigned value.

 First of all, non-Unicode encodings are irrelevant. 'char' is a UTF-8 
 codepoint (I think that's the correct term).

Sorry but this is wrong. "UTF-8 codepoint" is a non-sense.

In common practice Code Point is a: (1) A numerical index (or position)
in an encoding table used for encoding characters.
(2) Synonym for Unicode scalar value.

As rule one code point represented by single glyph while represented
to human.


 It's not a Unicode character (though some Unicode characters are encoded 
 as a single UTF-8 codepoint, specifically anything up to 0x80 IIRC).
 0xFF is indeed a valid Unicode character, but that doesn't mean that 
 character is encoded as a byte with value 0xFF in UTF-8 (which char[]s 
 represent). 0xFF is in fact one of the byte values that *cannot* occur in 
 a valid UTF-8 text.

Sorry, but element of UTF-8 encoded sequence is a byte (octet) and
not a char. char as a type historically means type for storing
character code points. 0xFF is assigned and legal value in many encodings.

Either use different name for this "D char" - let's say utf8byte or
use char in the meaning "code point value" - thus initialize it by
NUL value common for all known encodings.

Andrew Fedoniouk.
http://terrainformatica.com

Jul 29 2006

Walter Bright <newshound digitalmars.com> writes:

Andrew Fedoniouk wrote:
 Following assumption ( 
 http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):
 
 "codepoint U+FFFF is not a legitimate Unicode character, and, furthermore, 
 it is guaranteed by the
 Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode character.
 This codepoint will remain forever unassigned, precisely so that it may be 
 used
 for purposes such as this."
 
 is just wrong.
 
 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from
 R-zone: {U+FFF0..U+FFFF} - region assigned already.

"the value FFFF is guaranteed not to be a Unicode character at all"
http://www.unicode.org/charts/PDF/UFFF0.pdf


 2) For char[] selection of 0xFF is wrong and even worse.
 For example character with code 0xFF in Latin-I encoding is
 "y diaeresis". In many European languages and Far East encodings 0xFF is a 
 valid code point.
 For example in KOI-8 encoding 0xFF is officially assigned value.

char[] is not Unicode, it is UTF-8. For UTF-8, 0xFF is not a valid 
value. The Unicode U00FF is not encoded into UTF-8 as FF.

"The octet values C0, C1, F5 to FF never appear." 
http://www.ietf.org/rfc/rfc3629.txt


 What is the point of current initializaton?

The point is to initialize it with an invalid value, in order to flush 
out uninitialized data errors.

 If you are doing intialization already
 and this intialization is a part of specification so why not to use
 official "Nul" values in this case?

Because 0 is a valid UTF-8 character.


 You are doing the same for floats - you are using NaNs there
  (Null value for floats). Why not to use the same for chars?

The FF initialization does correspond (as close as we can get) with NaN 
for floats. 0 can masquerade as legitimate data, FF cannot.

Jul 29 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"Walter Bright" <newshound digitalmars.com> wrote in message 
news:eagk1o$1mph$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Following assumption ( 
 http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):

 "codepoint U+FFFF is not a legitimate Unicode character, and, 
 furthermore, it is guaranteed by the
 Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode 
 character.
 This codepoint will remain forever unassigned, precisely so that it may 
 be used
 for purposes such as this."

 is just wrong.

 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from
 R-zone: {U+FFF0..U+FFFF} - region assigned already.

 "the value FFFF is guaranteed not to be a Unicode character at all"
 http://www.unicode.org/charts/PDF/UFFF0.pdf


 2) For char[] selection of 0xFF is wrong and even worse.
 For example character with code 0xFF in Latin-I encoding is
 "y diaeresis". In many European languages and Far East encodings 0xFF is 
 a valid code point.
 For example in KOI-8 encoding 0xFF is officially assigned value.

 char[] is not Unicode, it is UTF-8. For UTF-8, 0xFF is not a valid value. 
 The Unicode U00FF is not encoded into UTF-8 as FF.

 "The octet values C0, C1, F5 to FF never appear." 
 http://www.ietf.org/rfc/rfc3629.txt


 What is the point of current initializaton?

 The point is to initialize it with an invalid value, in order to flush out 
 uninitialized data errors.

 If you are doing intialization already
 and this intialization is a part of specification so why not to use
 official "Nul" values in this case?

 Because 0 is a valid UTF-8 character.

1) What "UTF-8 character" means exactly?
2) In ASCII char(0) is officially NUL. Why not to initialize strings
by null?

 You are doing the same for floats - you are using NaNs there
  (Null value for floats). Why not to use the same for chars?

 The FF initialization does correspond (as close as we can get) with NaN 
 for floats. 0 can masquerade as legitimate data, FF cannot.

I don't get it, sorry. In KOI-8R (Russian) enconding 0xFF is letter '?'
Are you saying that I cannot use char[] to represen russian text in D?

Andrew Fedoniouk.
http://terrainformatica.com

Jul 29 2006

Walter Bright <newshound digitalmars.com> writes:

Andrew Fedoniouk wrote:
 What is the point of current initializaton?

 The point is to initialize it with an invalid value, in order to flush out 
 uninitialized data errors.

 If you are doing intialization already
 and this intialization is a part of specification so why not to use
 official "Nul" values in this case?

 Because 0 is a valid UTF-8 character.

 
 1) What "UTF-8 character" means exactly?

For an exact answer, the spec is: http://www.ietf.org/rfc/rfc3629.txt
There isn't much to it.

 2) In ASCII char(0) is officially NUL. Why not to initialize strings
 by null?

Because 0 characters are valid UTF-8 values. By using an invalid UTF-8 
value, we can flush out bugs from uninitialized data.

 I don't get it, sorry. In KOI-8R (Russian) enconding 0xFF is letter '?'
 Are you saying that I cannot use char[] to represen russian text in D?

char[] is for UTF-8 encoded text only. For other encoding systems, use 
ubyte[]. But rest assured that Russian (and every other language) has a 
defined encoding in UTF-8, which is why it was selected for D.

Jul 29 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"Walter Bright" <newshound digitalmars.com> wrote in message 
news:eagmrk$1pn9$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 What is the point of current initializaton?

 The point is to initialize it with an invalid value, in order to flush 
 out uninitialized data errors.

 If you are doing intialization already
 and this intialization is a part of specification so why not to use
 official "Nul" values in this case?

 Because 0 is a valid UTF-8 character.

 1) What "UTF-8 character" means exactly?

 For an exact answer, the spec is: http://www.ietf.org/rfc/rfc3629.txt
 There isn't much to it.

Sorry but I understand what UCS character means
but what exactly is "UTF-8 character" you are using?

Is this 1) a single octet in UTF-8 sequence or
2) is a sequence of octets representing one unicode character (21 bit value)


 2) In ASCII char(0) is officially NUL. Why not to initialize strings
 by null?

 Because 0 characters are valid UTF-8 values. By using an invalid UTF-8 
 value, we can flush out bugs from uninitialized data.

Oh....

0 as a value of UTF-8 octet can represent only single value character
with codepoint 0x00000000.

In plain English: UTF-8 encoded strings cannot contain zeros in the middle.


 I don't get it, sorry. In KOI-8R (Russian) enconding 0xFF is letter '?'
 Are you saying that I cannot use char[] to represen russian text in D?

 char[] is for UTF-8 encoded text only. For other encoding systems, use 
 ubyte[]. But rest assured that Russian (and every other language) has a 
 defined encoding in UTF-8, which is why it was selected for D.

Sorry but char[acter] in plain english means character - index of some
human readable glyph in some table like ASCII, KOI-8,
MAC-ASCII, whatever.

Element of UTF-8 sequence is an octet.  I think you should rename
'char' type to 'octet' if D/Phobos intended to support only UTF-8.

Andrew.

Jul 29 2006

Walter Bright <newshound digitalmars.com> writes:

Andrew Fedoniouk wrote:
 Element of UTF-8 sequence is an octet.  I think you should rename
 'char' type to 'octet' if D/Phobos intended to support only UTF-8.

This was all hashed out years ago. It's too late to start renaming basic 
types.

Jul 29 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"Walter Bright" <newshound digitalmars.com> wrote in message 
news:eagufo$2knt$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Element of UTF-8 sequence is an octet.  I think you should rename
 'char' type to 'octet' if D/Phobos intended to support only UTF-8.

 This was all hashed out years ago. It's too late to start renaming basic 
 types.

I am not asking to rename anything.

Could you please just remove this weird 0xFF initialization
for char arrays? ( as it was prior to .162 buld )

This is the whole point. If you will do this
then current char type can be used for
representation of single byte encodings as it stands -
character.

Andrew Fedoniouk.
http://terrainformatica.com

Jul 29 2006

"Unknown W. Brackets" <unknown simplemachines.org> writes:

But even prior, this:

char c;
writefln(cast(size_t) c);

Would have given you 255, not 0.  This has been true for quite some 
time.  The fact that it did not happen for arrays in the same way was, 
as far as I know, a bug.  Actually, I didn't even realize that got fixed.

-[Unknown]


 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eagufo$2knt$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Element of UTF-8 sequence is an octet.  I think you should rename
 'char' type to 'octet' if D/Phobos intended to support only UTF-8.

 This was all hashed out years ago. It's too late to start renaming basic 
 types.

 
 I am not asking to rename anything.
 
 Could you please just remove this weird 0xFF initialization
 for char arrays? ( as it was prior to .162 buld )
 
 This is the whole point. If you will do this
 then current char type can be used for
 representation of single byte encodings as it stands -
 character.
 
 Andrew Fedoniouk.
 http://terrainformatica.com

Jul 29 2006

Walter Bright <newshound digitalmars.com> writes:

Andrew Fedoniouk wrote:
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eagufo$2knt$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Element of UTF-8 sequence is an octet.  I think you should rename
 'char' type to 'octet' if D/Phobos intended to support only UTF-8.

 This was all hashed out years ago. It's too late to start renaming basic 
 types.

 I am not asking to rename anything.

Ok, but you did say "I think you should rename..." <g>

 Could you please just remove this weird 0xFF initialization
 for char arrays? ( as it was prior to .162 buld )

char's have been initialized to 0xFF for years now, it was a bug that 
some array initializations didn't do it.

 This is the whole point. If you will do this
 then current char type can be used for
 representation of single byte encodings as it stands -
 character.

? I don't understand what's standing in the way of that now. And values 
from 0..7F are single byte UTF-8 encodings and can be stored in a char.

BTW, you can do this:

typedef char mychar = 0;

mychar[] a = new mychar[100];	// mychar[] will be initialized to 0

Jul 29 2006

"Unknown W. Brackets" <unknown simplemachines.org> writes:

Andrew,

I think it will make a lot more sense if you keep these things in 
mind... (I'm sure you already know all of them, I'm just listing them 
out since they're crucial and must be thought of together):

1. char, wchar, and dchar are separate types.

2. char contains UTF-8 bytes.  It may not contain UTF-16, UCS-2, KOI-8R, 
or any other encoding.  It must contain UTF-8.

3. wchar contains UTF-16.  It is similar to char in every other way (may 
not contain any other encoding than UTF-16, not even UCS-2.)

4. dchar contains UTF-32 code points.  It may not contain any other sort 
of encoding, again.

5. For other encodings, such as ISO-8859-1 or KOI-8R, you should use 
ubyte/byte or some other method.  It is not valid to use char.

6. The FF byte (8-bit octet sequence) may never appear in any valid 
UTF-8 string.  Since char can only contain UTF-8 strings, it represents 
invalid data if it contains such an 8-bit octet.

7. Code points are the characters in Unicode; they are "compressed", so 
to speak, in encodings such as UTF-8 and UTF-16.  USC-2 and USC-4 
(UTF-32) contain full code points.

8. If you were to examine the bytes in a wchar string, it may be 
possible that the 8-bit octet sequence "FF" might show up.  Nonetheless, 
since char cannot be used for UTF-16, this doesn't matter.

9. For the above reason, wchar (UTF-16) uses FFFF.  This character is 
similar to FF for UTF-8.

Given the above, I think I might answer your questions:

1. UTF-8 character here could mean an 8-bit octet of code point.  In 
this case, they are both the same and represent a perfectly valid 
character in a string.

2. ASCII does not matter; char is not ASCII.  It happens that ASCII 
bytes 0 to 127 correspond to the same code points in Unicode, and the 
same characters in UTF-8.

3. It does not matter; KOI-8R encoded strings should not be placed in 
char arrays.  You should use UTF-8 or another encoding for your Russian 
text.

4. If you wish to use KOI-8R (or any other encoding not based on 
Unicode) you should not be using char arrays, which are meant for 
Unicode-related encodings only.

Obviously this is by far different from C, but that's the good thing 
about D in many ways ;).

Thanks,
-[Unknown]



 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eagk1o$1mph$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Following assumption ( 
 http://www.digitalmars.com/d/archives/digitalmars/D/3239.html):

 "codepoint U+FFFF is not a legitimate Unicode character, and, 
 furthermore, it is guaranteed by the
 Unicode Consortium that 0xFFFF will NEVER be a legitimate Unicode 
 character.
 This codepoint will remain forever unassigned, precisely so that it may 
 be used
 for purposes such as this."

 is just wrong.

 1) 0xFFFF is a valid UNICODE character - it is one of the "Specials" from
 R-zone: {U+FFF0..U+FFFF} - region assigned already.

 "the value FFFF is guaranteed not to be a Unicode character at all"
 http://www.unicode.org/charts/PDF/UFFF0.pdf


 2) For char[] selection of 0xFF is wrong and even worse.
 For example character with code 0xFF in Latin-I encoding is
 "y diaeresis". In many European languages and Far East encodings 0xFF is 
 a valid code point.
 For example in KOI-8 encoding 0xFF is officially assigned value.

 char[] is not Unicode, it is UTF-8. For UTF-8, 0xFF is not a valid value. 
 The Unicode U00FF is not encoded into UTF-8 as FF.

 "The octet values C0, C1, F5 to FF never appear." 
 http://www.ietf.org/rfc/rfc3629.txt


 What is the point of current initializaton?

 The point is to initialize it with an invalid value, in order to flush out 
 uninitialized data errors.

 If you are doing intialization already
 and this intialization is a part of specification so why not to use
 official "Nul" values in this case?

 Because 0 is a valid UTF-8 character.

 
 1) What "UTF-8 character" means exactly?
 2) In ASCII char(0) is officially NUL. Why not to initialize strings
 by null?
 
 You are doing the same for floats - you are using NaNs there
  (Null value for floats). Why not to use the same for chars?

 The FF initialization does correspond (as close as we can get) with NaN 
 for floats. 0 can masquerade as legitimate data, FF cannot.

 
 I don't get it, sorry. In KOI-8R (Russian) enconding 0xFF is letter '?'
 Are you saying that I cannot use char[] to represen russian text in D?
 
 Andrew Fedoniouk.
 http://terrainformatica.com

Jul 29 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"Unknown W. Brackets" <unknown simplemachines.org> wrote in message 
news:eagn4d$1q1t$1 digitaldaemon.com...
 Andrew,

 I think it will make a lot more sense if you keep these things in mind... 
 (I'm sure you already know all of them, I'm just listing them out since 
 they're crucial and must be thought of together):

 1. char, wchar, and dchar are separate types.

No objections with this.

 2. char contains UTF-8 bytes.  It may not contain UTF-16, UCS-2, KOI-8R, 
 or any other encoding.  It must contain UTF-8.

Sorry but plural form "char contains UTF-8 bytes" is wrong.

What you think char means:
1) char is an octet (byte) - member of utf-8 sequence -or-
2) char is code point of some character in some character table.

?

Probably I am treating English too literally but
char(acter) is not an UTF-8 byte.  And never was.

char is an index of some glyph in some encoding table.
This is common definition used everywhere.

 3. wchar contains UTF-16.  It is similar to char in every other way (may 
 not contain any other encoding than UTF-16, not even UCS-2.)



What is wchar (uint16) for you:
1) wchar as is an index of a Unicode scalar value in Basic Multilingual 
Plane (BMP)
-or-
2) is a uint16 value - member of UTF-16 sequence.

?

 4. dchar contains UTF-32 code points.  It may not contain any other sort 
 of encoding, again.

Oh.....

UTF-32 (as any other utfs) is a transformation format -
group name of two different encodings UTF-32BE and UTF-32LE.

UTF-32 code point is a non-sense.

UTF-32 defines of how to encode Unicode code point  in
again sequence of four bytes - octets.

I would define this thing as

dchar ( better name is uchar ) is type for representing
full set of Unicode Code Points (21bit value).

Pleas note: "transformation format" (UTF) is not by
any means a "manipulation format".

Representation of text in memory suitable for
manipulation (e.g. text processing) is different as rule.

You cannot use utf-8 encoded russian text for
analysis. No way.

 5. For other encodings, such as ISO-8859-1 or KOI-8R, you should use 
 ubyte/byte or some other method.  It is not valid to use char.

Vice versa. For utf-8 encoded strings you should use byte[]
and for strings using single byte encodings you should use char.

 6. The FF byte (8-bit octet sequence) may never appear in any valid UTF-8 
 string.  Since char can only contain UTF-8 strings, it represents invalid 
 data if it contains such an 8-bit octet.

No objections with that, for UTF-8 octet sequences 0xFF is invalid
value of octet in the sequence. But please note: in the sequence of octets.

 7. Code points are the characters in Unicode; they are "compressed", so to 
 speak, in encodings such as UTF-8 and UTF-16.  USC-2 and USC-4 (UTF-32) 
 contain full code points.

Sorry, but USC-4 *is not* UTF-32
http://www.unicode.org/reports/tr19/tr19-9.html

I will ask again:

What:
char c = 'a';
means for you?

And following in C/C++:

#pragma(encoding,"KOI-8R")

char c = '?';

?


 8. If you were to examine the bytes in a wchar string, it may be possible 
 that the 8-bit octet sequence "FF" might show up.  Nonetheless, since char 
 cannot be used for UTF-16, this doesn't matter.

Not clear what you mean here. Could you clarify? Especially last statement.

 9. For the above reason, wchar (UTF-16) uses FFFF.  This character is 
 similar to FF for UTF-8.

 Given the above, I think I might answer your questions:

 1. UTF-8 character here could mean an 8-bit octet of code point.  In this 
 case, they are both the same and represent a perfectly valid character in 
 a string.

Sorry I am not buying following:
"UTF-8 character" and "8-bit octet of code point"

 2. ASCII does not matter; char is not ASCII.  It happens that ASCII bytes 
 0 to 127 correspond to the same code points in Unicode, and the same 
 characters in UTF-8.

"ASCII does not matter"... for whom?

 3. It does not matter; KOI-8R encoded strings should not be placed in char 
 arrays.  You should use UTF-8 or another encoding for your Russian text.

"You should use UTF-8 or another encoding for your Russian text."

Thanks.

Advice from my side:
Let me know when you will visit Russia.
I will ask representatives of russian developer community and web authors
to meet you.

Advice per se: You should wear a helmet.

 4. If you wish to use KOI-8R (or any other encoding not based on Unicode) 
 you should not be using char arrays, which are meant for Unicode-related 
 encodings only.

The same advice as above.

 Obviously this is by far different from C, but that's the good thing about 
 D in many ways ;).

In Israel they have an old saying:
"Not a human for Saturday but Saturday for human".

I do have practical experience in writnig text processing software in
encodings other than "US-ASCII" and have heard your advices about
UTF-8 usage with interest.

Please don't take all of this personal - no intention to harm anybody.
Honestly and with smile :)

Andrew.

Jul 29 2006

Walter Bright <newshound digitalmars.com> writes:

Andrew Fedoniouk wrote:
 I will ask again:
 
 What:
 char c = 'a';
 means for you?
 And following in C/C++:
 
 #pragma(encoding,"KOI-8R")
 
 char c = '?';
 
 ?

Pragmas are implementation defined behavior in C and C++, meaning they 
are unportable and rather useless. Not only that, char's themselves are 
implementation defined, and so it is very difficult to write portable 
code that deals with anything other than a-zA-Z0-9 and a few other 
characters.

In D, char[] is a UTF-8 sequence. It's well defined, and therefore 
portable. It supports every human language.

Jul 29 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"Walter Bright" <newshound digitalmars.com> wrote in message 
news:eagut9$2l96$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 I will ask again:

 What:
 char c = 'a';
 means for you?
 And following in C/C++:

 #pragma(encoding,"KOI-8R")

 char c = '?';

 ?

 Pragmas are implementation defined behavior in C and C++, meaning they are 
 unportable and rather useless. Not only that, char's themselves are 
 implementation defined, and so it is very difficult to write portable code 
 that deals with anything other than a-zA-Z0-9 and a few other characters.

 In D, char[] is a UTF-8 sequence. It's well defined, and therefore 
 portable. It supports every human language.

What does it mean "UTF-8 ... supports ...every human language" ?

It allows to encode - yes.

But in runtime support means quite different thing
and I am pretty sure you know what I mean here.

In Java as we know UTF-8 is used for representing
string literals inside .class files but being loaded they
became vectors of Java chars - unicode BMP codepoints
(ushort). And this serves almost all character cases.
Exceptions like: it is not trivial to do effectively
processing of single byte encoded things there - you need
to rewrite the whole set of functions to handle this.

Please don't think that UTF-8 is a panacea.

For example in China they use GB2312 encoding
to represent almost 7000 Chinese characters in active use now.
This is strictly 2 bytes enconding and
don't even try to ask them to switch to UTF-8
(3 bytes as a rule). This will increase their internet
traffic by 1/3.

Same apply to Europe. E.g. in Russia
there are 32 characters in alphabet and it is
just enough to have one byte encoding for
English/Russian text. It makes no sense
to send over the wire two bytes (russian in utf-8)
instead of one for the sites like lib.ru.

Sorry but guys are paying there for each byte
downloaded from Internet. This apply
to almost all countries except of US and Canada.

Andrew Fedoniouk.
http://terrainformatica.com

Jul 29 2006

Walter Bright <newshound digitalmars.com> writes:

Andrew Fedoniouk wrote:
 In D, char[] is a UTF-8 sequence. It's well defined, and therefore 
 portable. It supports every human language.

 
 What does it mean "UTF-8 ... supports ...every human language" ?
 
 It allows to encode - yes.

We both know what UTF-8 is and does.

 But in runtime support means quite different thing
 and I am pretty sure you know what I mean here.

I'm sure there are bugs in the library UTF-8 support. But they are bugs, 
are fixable, and not fundamental problems. As you find any, please post 
them to bugzilla.


 In Java as we know UTF-8 is used for representing
 string literals inside .class files but being loaded they
 became vectors of Java chars - unicode BMP codepoints
 (ushort). And this serves almost all character cases.
 Exceptions like: it is not trivial to do effectively
 processing of single byte encoded things there - you need
 to rewrite the whole set of functions to handle this.
 
 Please don't think that UTF-8 is a panacea.

I don't. But it's way better than C/C++, because you can rely on it and 
your code will work with different languages out of the box.


 For example in China they use GB2312 encoding
 to represent almost 7000 Chinese characters in active use now.
 This is strictly 2 bytes enconding and
 don't even try to ask them to switch to UTF-8
 (3 bytes as a rule). This will increase their internet
 traffic by 1/3.
 
 Same apply to Europe. E.g. in Russia
 there are 32 characters in alphabet and it is
 just enough to have one byte encoding for
 English/Russian text. It makes no sense
 to send over the wire two bytes (russian in utf-8)
 instead of one for the sites like lib.ru.
 
 Sorry but guys are paying there for each byte
 downloaded from Internet. This apply
 to almost all countries except of US and Canada.

If one needs to use a custom encoding, use ubyte[] or ushort[]. If one 
needs to be universal, use char[], wchar[], or dchar[]. And for what 
it's worth, D isn't a web transmission protocol. I don't see any problem 
with a D program converting its input from Format X to UTF for internal 
processing, and then converting its output back to X or Y or Z.

Jul 29 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

 Please don't think that UTF-8 is a panacea.

 I don't. But it's way better than C/C++, because you can rely on it and 
 your code will work with different languages out of the box.

Sorry but this is a bit optimistic.

D/samples/wc.exe from the box will fail on russian texts.
It will fail on almost all Eastern texts. Even they
will be in UTF-8 encoding. Meaning of 'word'
is different there.

Having statement "string literals in D are only
UTF-8 encoded" is not conceptually better than
"string literals in C are encoded by using codepage defined
by pragma(codepage,...)".

Same by the way applied to most of Java compilers
they accepts texts in various singlebyte encodings.
(Why *I* am telling this to *you*? :-)

Andrew.

Jul 29 2006

Walter Bright <newshound digitalmars.com> writes:

Andrew Fedoniouk wrote:
 Please don't think that UTF-8 is a panacea.

 I don't. But it's way better than C/C++, because you can rely on it and 
 your code will work with different languages out of the box.

 
 Sorry but this is a bit optimistic.
 
 D/samples/wc.exe from the box will fail on russian texts.
 It will fail on almost all Eastern texts. Even they
 will be in UTF-8 encoding. Meaning of 'word'
 is different there.

No matter, it is far easier to write a UTF-8 isword function than one 
that will work on all possible character encoding methods.


 Having statement "string literals in D are only
 UTF-8 encoded" is not conceptually better than
 "string literals in C are encoded by using codepage defined
 by pragma(codepage,...)".

It is conceptually better because UTF-8 is completely defined and covers 
all human languages. Codepages are not completely defined, do not cover 
asian languages, rely on non-standard compiler extensions, and in fact 
you cannot even rely on *ASCII* being supported by any particular C or 
C++ compiler. (It could be EBCDIC or any encoding invented by the 
compiler vendor.)

Code pages have another disastrous problem - it's impossible to mix 
languages. I have an academic text in front of me written in a mixture 
of german, french, and latin. How's that going to work with code pages?

Code pages are obsolete yesterday's technology, and I'm not sorry to see 
them go.

 Same by the way applied to most of Java compilers
 they accepts texts in various singlebyte encodings.
 (Why *I* am telling this to *you*? :-)

The compiler may accept it as an extension, but the Java *language* is 
defined to work with UTF-16 source text only. (Java calls them 'char's, 
even though there may be multi-char encodings.)

Jul 29 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"Walter Bright" <newshound digitalmars.com> wrote in message 
news:eah9st$2v1o$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Please don't think that UTF-8 is a panacea.

 I don't. But it's way better than C/C++, because you can rely on it and 
 your code will work with different languages out of the box.

 Sorry but this is a bit optimistic.

 D/samples/wc.exe from the box will fail on russian texts.
 It will fail on almost all Eastern texts. Even they
 will be in UTF-8 encoding. Meaning of 'word'
 is different there.

 No matter, it is far easier to write a UTF-8 isword function than one that 
 will work on all possible character encoding methods.

Sorry, did you try to write such a function (isword)?

(You need the whole set of character classification tables
to accomplish this - utf-8 will not help you)

 Having statement "string literals in D are only
 UTF-8 encoded" is not conceptually better than
 "string literals in C are encoded by using codepage defined
 by pragma(codepage,...)".

 It is conceptually better because UTF-8 is completely defined and covers 
 all human languages. Codepages are not completely defined, do not cover 
 asian languages, rely on non-standard compiler extensions, and in fact you 
 cannot even rely on *ASCII* being supported by any particular C or C++ 
 compiler. (It could be EBCDIC or any encoding invented by the compiler 
 vendor.)

 Code pages have another disastrous problem - it's impossible to mix 
 languages. I have an academic text in front of me written in a mixture of 
 german, french, and latin. How's that going to work with code pages?

I am not saying that you shall avoid use of UTF-8 encoding.
If you have mix of say english, russian and chinese on some page
the only way to deliver this to the user is to use some (universal)
unicode transport encoding.
But to render this thing on the screen is completely different
story.

Consider this: attribute names in html (sgml) represented by
ascii codes only - you don't need utf-8 processing to deal with them at all.
You also cannot use utf-8 for storing attribute values generally speaking.
Attribute values participate in CSS selector analysis and some selectors
require char by char (char as a code point and not a D char) access.

There are only few academic cases where you can use utf-8 literally
(as a sequence of utf-8 bytes) *in runtime*. D source code compilation
is one of such things - you can store content of string literals in utf-8 
form -
you don't need to analyze their content.

 Code pages are obsolete yesterday's technology, and I'm not sorry to see 
 them go.

Sorry but US is the first country which will ask "what a ...?" on demand
to send always four bytes instead of one.

UTF-8 encoding is "traffic friendly" only for 1/10 of population
on the Earth (English speaking people).
Others just don't want to pay that price.

Sorry you or not sorry it is irrelevant for code pages existence.
They will be forever untill all of us will not speak on Esperanto.

( Currently I am doing right-to-left support in the engine - Arabic and 
Hebrew -
trust me - probably I have more things to say "sorry" about )

 Same by the way applied to most of Java compilers
 they accepts texts in various singlebyte encodings.
 (Why *I* am telling this to *you*? :-)

 The compiler may accept it as an extension, but the Java *language* is 
 defined to work with UTF-16 source text only. (Java calls them 'char's, 
 even though there may be multi-char encodings.)

Walter, where did you get that magic UTF-16 ?

Doc: http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html
mentions that input of Java compiler is sequence of Unicode (Code Points).
And how this input sequence is encoded, utf-8, utf-16, koi8r - it does not
matter at all and spec is silent about this - human is in its rights to 
choose
encoding his/her terminal/keyboard supports.

Andrew Fedoniouk.
http://terrainformatica.com

Jul 29 2006

kris <foo bar.com> writes:

Is there a doctor in the house?



Andrew Fedoniouk wrote:
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eah9st$2v1o$1 digitaldaemon.com...
 
Andrew Fedoniouk wrote:

Please don't think that UTF-8 is a panacea.

I don't. But it's way better than C/C++, because you can rely on it and 
your code will work with different languages out of the box.

Sorry but this is a bit optimistic.

D/samples/wc.exe from the box will fail on russian texts.
It will fail on almost all Eastern texts. Even they
will be in UTF-8 encoding. Meaning of 'word'
is different there.

No matter, it is far easier to write a UTF-8 isword function than one that 
will work on all possible character encoding methods.

 
 
 Sorry, did you try to write such a function (isword)?
 
 (You need the whole set of character classification tables
 to accomplish this - utf-8 will not help you)
 
 
Having statement "string literals in D are only
UTF-8 encoded" is not conceptually better than
"string literals in C are encoded by using codepage defined
by pragma(codepage,...)".

It is conceptually better because UTF-8 is completely defined and covers 
all human languages. Codepages are not completely defined, do not cover 
asian languages, rely on non-standard compiler extensions, and in fact you 
cannot even rely on *ASCII* being supported by any particular C or C++ 
compiler. (It could be EBCDIC or any encoding invented by the compiler 
vendor.)

Code pages have another disastrous problem - it's impossible to mix 
languages. I have an academic text in front of me written in a mixture of 
german, french, and latin. How's that going to work with code pages?

 
 
 I am not saying that you shall avoid use of UTF-8 encoding.
 If you have mix of say english, russian and chinese on some page
 the only way to deliver this to the user is to use some (universal)
 unicode transport encoding.
 But to render this thing on the screen is completely different
 story.
 
 Consider this: attribute names in html (sgml) represented by
 ascii codes only - you don't need utf-8 processing to deal with them at all.
 You also cannot use utf-8 for storing attribute values generally speaking.
 Attribute values participate in CSS selector analysis and some selectors
 require char by char (char as a code point and not a D char) access.
 
 There are only few academic cases where you can use utf-8 literally
 (as a sequence of utf-8 bytes) *in runtime*. D source code compilation
 is one of such things - you can store content of string literals in utf-8 
 form -
 you don't need to analyze their content.
 
 
Code pages are obsolete yesterday's technology, and I'm not sorry to see 
them go.

 
 
 Sorry but US is the first country which will ask "what a ...?" on demand
 to send always four bytes instead of one.
 
 UTF-8 encoding is "traffic friendly" only for 1/10 of population
 on the Earth (English speaking people).
 Others just don't want to pay that price.
 
 Sorry you or not sorry it is irrelevant for code pages existence.
 They will be forever untill all of us will not speak on Esperanto.
 
 ( Currently I am doing right-to-left support in the engine - Arabic and 
 Hebrew -
 trust me - probably I have more things to say "sorry" about )
 
 
Same by the way applied to most of Java compilers
they accepts texts in various singlebyte encodings.
(Why *I* am telling this to *you*? :-)

The compiler may accept it as an extension, but the Java *language* is 
defined to work with UTF-16 source text only. (Java calls them 'char's, 
even though there may be multi-char encodings.)

 
 
 Walter, where did you get that magic UTF-16 ?
 
 Doc: http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html
 mentions that input of Java compiler is sequence of Unicode (Code Points).
 And how this input sequence is encoded, utf-8, utf-16, koi8r - it does not
 matter at all and spec is silent about this - human is in its rights to 
 choose
 encoding his/her terminal/keyboard supports.
 
 Andrew Fedoniouk.
 http://terrainformatica.com

Jul 29 2006

Hasan Aljudy <hasan.aljudy gmail.com> writes:

Andrew Fedoniouk wrote:
 ( Currently I am doing right-to-left support in the engine - Arabic and 
 Hebrew -
 trust me - probably I have more things to say "sorry" about )
 

That's great, I'd be glad to help with anything if you need help with 
regard to Arabic (I'm a native Arabic speaker).

 
 Andrew Fedoniouk.
 http://terrainformatica.com

Jul 30 2006

Walter Bright <newshound digitalmars.com> writes:

Andrew Fedoniouk wrote:
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eah9st$2v1o$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Please don't think that UTF-8 is a panacea.

 I don't. But it's way better than C/C++, because you can rely on it and 
 your code will work with different languages out of the box.

 Sorry but this is a bit optimistic.

 D/samples/wc.exe from the box will fail on russian texts.
 It will fail on almost all Eastern texts. Even they
 will be in UTF-8 encoding. Meaning of 'word'
 is different there.

 No matter, it is far easier to write a UTF-8 isword function than one that 
 will work on all possible character encoding methods.

 Sorry, did you try to write such a function (isword)?

I have written isUniAlpha, which is the same thing.

 (You need the whole set of character classification tables
 to accomplish this - utf-8 will not help you)

With code pages, it isn't so straightforward (especially if you've got 
things like shift-JIS too). With code pages, a program can't even accept 
a text file unless you tell it what page the text is in.

 I am not saying that you shall avoid use of UTF-8 encoding.
 If you have mix of say english, russian and chinese on some page
 the only way to deliver this to the user is to use some (universal)
 unicode transport encoding.
 But to render this thing on the screen is completely different
 story.

Fortunately, rendering is the job of the operating system - and I don't 
see how rendering with code pages would be any easier.

 Consider this: attribute names in html (sgml) represented by
 ascii codes only - you don't need utf-8 processing to deal with them at all.
 You also cannot use utf-8 for storing attribute values generally speaking.
 Attribute values participate in CSS selector analysis and some selectors
 require char by char (char as a code point and not a D char) access.

I'd be surprised at that, since UTF-8 is a documented, supported HTML 
page encoding method. But if UTF-8 doesn't work for you, you can use 
wchar (UTF-16) or dchar (UTF-32), or ubyte (for anything else).

 There are only few academic cases where you can use utf-8 literally
 (as a sequence of utf-8 bytes) *in runtime*. D source code compilation
 is one of such things - you can store content of string literals in utf-8 
 form -
 you don't need to analyze their content.

D identifiers can be unicode alphas, which means the UTF-8 must be decoded.

The DMC++ compiler supports various code page source file possibilities, 
including some of the asian language multibyte encodings. I find that 
UTF-8 is a lot easier to work with, as the UTF-8 designers learned from 
the mistakes of the earlier multibyte encodings.

 Code pages are obsolete yesterday's technology, and I'm not sorry to see 
 them go.

 Sorry but US is the first country which will ask "what a ...?" on demand
 to send always four bytes instead of one.
 UTF-8 encoding is "traffic friendly" only for 1/10 of population
 on the Earth (English speaking people).
 Others just don't want to pay that price.

I'll make a prediction that the huge benefits of UTF will outweigh the 
downside, and that code pages will increasingly fall into disuse. Note 

also supports EUC or SJIS, but not other code pages). Windows is 
(internally) completely unicode (the code page face it shows is done by 
a translation layer on I/O).

In an increasingly multicultural and global economy, applications that 
cannot simultaneously handle multiple languages are going to be at a 
severe disadvantage.

Another problem with code pages is when you're presented with a text 
file, what code page is it in? There's no way for a program to tell, 
unless there's some other transmission of associated metadata. With UTF, 
that's no problem.

 Sorry you or not sorry it is irrelevant for code pages existence.
 They will be forever untill all of us will not speak on Esperanto.
 
 ( Currently I am doing right-to-left support in the engine - Arabic and 
 Hebrew -
 trust me - probably I have more things to say "sorry" about )

No problem, I believe you <g>.

 Same by the way applied to most of Java compilers
 they accepts texts in various singlebyte encodings.
 (Why *I* am telling this to *you*? :-)

 The compiler may accept it as an extension, but the Java *language* is 
 defined to work with UTF-16 source text only. (Java calls them 'char's, 
 even though there may be multi-char encodings.)

 
 Walter, where did you get that magic UTF-16 ?
 
 Doc: http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html
 mentions that input of Java compiler is sequence of Unicode (Code Points).
 And how this input sequence is encoded, utf-8, utf-16, koi8r - it does not
 matter at all and spec is silent about this - human is in its rights to 
 choose encoding his/her terminal/keyboard supports.

Java Language Specification Third Edition Chapter 3.2: "The Java 
programming language represents text in sequences of 16-bit code units, 
using the UTF-16 encoding."

It is, of course, entirely reasonable for a Java compiler to have 
extensions to recognize other encodings and automatically convert them 
internally to UTF-16 before lexical analysis.

"One Encoding to rule them all, One Encoding to replace them,
One Encoding to handle them all and in the darkness bind them"
-- UTF Tolkien

Jul 30 2006

Paolo Invernizzi <arathorn NOSPAM_fastwebnet.it> writes:

LOL!!!

---
Paolo

Walter Bright wrote:

 "One Encoding to rule them all, One Encoding to replace them,
 One Encoding to handle them all and in the darkness bind them"
 -- UTF Tolkien

Jul 30 2006

"John Reimer" <terminal.node gmail.com> writes:

On Sun, 30 Jul 2006 03:25:22 -0700, Paolo Invernizzi  
<arathorn NOSPAM_fastwebnet.it> wrote:

 LOL!!!

 ---
 Paolo

 Walter Bright wrote:

 "One Encoding to rule them all, One Encoding to replace them,
 One Encoding to handle them all and in the darkness bind them"
 -- UTF Tolkien



Okay, that clears things up. Now we know that UTF is a conspiracy for  
world domination. ;)

-JJR

Jul 30 2006

kris <foo bar.com> writes:

John Reimer wrote:
 On Sun, 30 Jul 2006 03:25:22 -0700, Paolo Invernizzi  
 <arathorn NOSPAM_fastwebnet.it> wrote:
 
 LOL!!!

 ---
 Paolo

 Walter Bright wrote:

 "One Encoding to rule them all, One Encoding to replace them,
 One Encoding to handle them all and in the darkness bind them"
 -- UTF Tolkien


 
 
 
 Okay, that clears things up. Now we know that UTF is a conspiracy for  
 world domination. ;)
 
 -JJR


And created on the back of a napkin in a New Jersey diner ... way to go, Ken

Jul 30 2006

"Unknown W. Brackets" <unknown simplemachines.org> writes:

It's true that in HTML, attribute names were limited to a subset of 
characters available for use in the document.  Namely, as mentioned, 
alpha-type characters (/[A-Za-z][A-Za-z0-9\.\-]*/.)  You couldn't even 
use accented chars.

However (in the case of HTML), you were required to use specific 
(English) attribute names anyway for HTML to validate; it's really not a 
significant limitation.  Few people used SGML for anything else.

XML allows for Unicode attribute and element names... PIs, CDATA, 
PCDATA, etc.  And, of course, allows you to reference any Unicode code 


We could also talk about the limitations of horse driven carriages, and 
how they can only go a certain speed... nonetheless, we have cars now, 
so I'm not terribly worried about HTML's technical limitations anymore.

-[Unknown]


 Consider this: attribute names in html (sgml) represented by
 ascii codes only - you don't need utf-8 processing to deal with them 
 at all.
 You also cannot use utf-8 for storing attribute values generally 
 speaking.
 Attribute values participate in CSS selector analysis and some selectors
 require char by char (char as a code point and not a D char) access.

 
 I'd be surprised at that, since UTF-8 is a documented, supported HTML 
 page encoding method. But if UTF-8 doesn't work for you, you can use 
 wchar (UTF-16) or dchar (UTF-32), or ubyte (for anything else).

Jul 30 2006

"Chris Miller" <chris dprogramming.com> writes:

On Sat, 29 Jul 2006 20:37:56 -0400, Walter Bright  
<newshound digitalmars.com> wrote:

 In D, char[] is a UTF-8 sequence. It's well defined, and therefore  
 portable. It supports every human language.

Even body language? :)

Jul 30 2006

"Unknown W. Brackets" <unknown simplemachines.org> writes:

2. Sorry, an array of char (a single char is one single 8 bit octet) 
contains UTF-8 bytes which are 8-bit octets.

A single character, in UTF-8 encoding, may be 1 byte, 2 bytes, etc. 
Thus, one char MAY NOT hold every single Unicode code point.  You may 
need an array of multiple chars (bytes) to hold a single code point.

This is not what it means to me; this is what it means.  A char is a 
single 8-bit octet in a UTF-8 sequence.  They ARE NOT by any means code 
points.

I'm sorry that I did not specify "array", but I fear you are being 
pedantic here; I'm sure you knew what I meant.

A char is a single byte in a UTF-8 sequence.  I'm afraid I think calling 
it an index to a glyph is dangerous, because it could be mistaken. 
Again, a single char CANNOT represent code points above and including 
128 because it is only ONE byte.

A single char therefore may not represent a glyph all of the time, but 
rather will represent a byte in the sequence of UTF-8 which may be used 
to decode (along with other necessary bytes) the entirity of the code point.

I hope I'm not being overly pedantic here, but I think your definition 
is either lax or wrong.  But, that is only by its reading in English.


represent full code points alone.  Arrays of wchars must be used for 


encoding.)

4. I was ignoring endianess issues for simplicity.  My point here is 
that a UTF-32 character directly represents  a code point.  Sorry again 
for the non-pedantic laxness in my wording.

5. Wrong.  There is no vice versa.  You may use byte or ubyte arrays for 
your UTF-8 encoded strings and so forth.

In case you didn't realize I was trying to say this:

*char is not for single byte encodings.  char is ONLY for UTF-8.  char 
may not be used for any other encoding unless you wish to have problems. 
  char is not the same as in other languages, e.g. C.*

If you wish for a 8-bit octet value (such as a character in any 
encoding; single byte or otherwise) you should not be using a char. 
That is not a correct usage for them, that is what byte and ubyte are for.

It is expected that chars in an array will follow a specific sequence; 
that is, that they will be encoded in UTF-8.  It is not possible to 
guarantee this if you use other encodings, which is why writefln() will 
fail in such cases.

6.  Correct.  And a single char (8-bit octet in a sequence of UTF-8 
octets encoded such) may never be FF because no single 8-bit octet 
anywhere in a valid UTF-8 sequence may be FF.  Remember, char is not a 
code point.  It is a single 8-bit octet in a sequence.

7. My mistake.  I always consider them roughly the same (and for some 
reason I thought that they had been made the same; but I assume your 
link is current.)

Your first code sample defines a single UTF-8 character, 'a'.  It is 
lucky you did not try:

char c = '蝿';

(hopefully this character gets sent through to you properly; I will be 
sending this message UTF-8 if my client allows it.)

Because that would have failed.  A char cannot hold such a character, 
which has a code point outside the range 0 - 127.  You would either need 
to use an array of chars, or etc.

Your second example means nothing to me.  I don't really care for such 
pragmas or putting untranslated text directly in source code, and have 
never dealt with it.

8. You may not use a single char or an array of chars to represent 
UTF-16.  It may only represent UTF-8.  If you wish to use UTF-16, you 
must use wchars.


are the same - do you not agree?  A 0 is a zero is a zero.  It doesn't 
matter what he means.

2 (the second): rules about ASCII do not apply to char.  Just as rules 
in Portugal do not dissuade me here in Los Angeles.

3 (the second): I have lead the development of a multi-lingual software 
which was used by quite a large sum of people.  I also helped 
coordinate, and later interface with the assigned coordinator of 
translation.  This software was translated into Thai, Chinese (simple 
and traditional), Russian, Italian, Spanish, Japanese, Catalan, and 
several other languages.  More than twenty anyway.

At first I was suggesting that everyone use their own encoding and 
handling that (sometimes painfully) in the code.  I would sometimes get 
comments about using Unicode instead (from the translators who would 
have preferred this.)  This software now uses UTF-8 and remains 
translated in these languages.

So, while I have not been to Russia (although I have worked with 
numerous Russian developers, consumers, and translators) I would tend to 
disagree with your assertion.  Also I do not like helmets.

Obviously, I mean nothing to be taken personally as well; we are only 
talking about UTF-8, Unicode, its usage in D, and being pedantic ;). 
And helmets, we touched that subject too.  But not about each other, really.

Thanks,
-[Unknown]


 "Unknown W. Brackets" <unknown simplemachines.org> wrote in message 
 news:eagn4d$1q1t$1 digitaldaemon.com...
 Andrew,

 I think it will make a lot more sense if you keep these things in mind... 
 (I'm sure you already know all of them, I'm just listing them out since 
 they're crucial and must be thought of together):

 1. char, wchar, and dchar are separate types.

 
 No objections with this.
 
 2. char contains UTF-8 bytes.  It may not contain UTF-16, UCS-2, KOI-8R, 
 or any other encoding.  It must contain UTF-8.

 
 Sorry but plural form "char contains UTF-8 bytes" is wrong.
 
 What you think char means:
 1) char is an octet (byte) - member of utf-8 sequence -or-
 2) char is code point of some character in some character table.
 
 ?
 
 Probably I am treating English too literally but
 char(acter) is not an UTF-8 byte.  And never was.
 
 char is an index of some glyph in some encoding table.
 This is common definition used everywhere.
 
 3. wchar contains UTF-16.  It is similar to char in every other way (may 
 not contain any other encoding than UTF-16, not even UCS-2.)

 

 
 What is wchar (uint16) for you:
 1) wchar as is an index of a Unicode scalar value in Basic Multilingual 
 Plane (BMP)
 -or-
 2) is a uint16 value - member of UTF-16 sequence.
 
 ?
 
 4. dchar contains UTF-32 code points.  It may not contain any other sort 
 of encoding, again.

 
 Oh.....
 
 UTF-32 (as any other utfs) is a transformation format -
 group name of two different encodings UTF-32BE and UTF-32LE.
 
 UTF-32 code point is a non-sense.
 
 UTF-32 defines of how to encode Unicode code point  in
 again sequence of four bytes - octets.
 
 I would define this thing as
 
 dchar ( better name is uchar ) is type for representing
 full set of Unicode Code Points (21bit value).
 
 Pleas note: "transformation format" (UTF) is not by
 any means a "manipulation format".
 
 Representation of text in memory suitable for
 manipulation (e.g. text processing) is different as rule.
 
 You cannot use utf-8 encoded russian text for
 analysis. No way.
 
 5. For other encodings, such as ISO-8859-1 or KOI-8R, you should use 
 ubyte/byte or some other method.  It is not valid to use char.

 
 Vice versa. For utf-8 encoded strings you should use byte[]
 and for strings using single byte encodings you should use char.
 
 6. The FF byte (8-bit octet sequence) may never appear in any valid UTF-8 
 string.  Since char can only contain UTF-8 strings, it represents invalid 
 data if it contains such an 8-bit octet.

 
 No objections with that, for UTF-8 octet sequences 0xFF is invalid
 value of octet in the sequence. But please note: in the sequence of octets.
 
 7. Code points are the characters in Unicode; they are "compressed", so to 
 speak, in encodings such as UTF-8 and UTF-16.  USC-2 and USC-4 (UTF-32) 
 contain full code points.

 
 Sorry, but USC-4 *is not* UTF-32
 http://www.unicode.org/reports/tr19/tr19-9.html
 
 I will ask again:
 
 What:
 char c = 'a';
 means for you?
 
 And following in C/C++:
 
 #pragma(encoding,"KOI-8R")
 
 char c = '?';
 
 ?
 
 
 8. If you were to examine the bytes in a wchar string, it may be possible 
 that the 8-bit octet sequence "FF" might show up.  Nonetheless, since char 
 cannot be used for UTF-16, this doesn't matter.

 
 Not clear what you mean here. Could you clarify? Especially last statement.
 
 9. For the above reason, wchar (UTF-16) uses FFFF.  This character is 
 similar to FF for UTF-8.

 Given the above, I think I might answer your questions:

 1. UTF-8 character here could mean an 8-bit octet of code point.  In this 
 case, they are both the same and represent a perfectly valid character in 
 a string.

 
 Sorry I am not buying following:
 "UTF-8 character" and "8-bit octet of code point"
 
 2. ASCII does not matter; char is not ASCII.  It happens that ASCII bytes 
 0 to 127 correspond to the same code points in Unicode, and the same 
 characters in UTF-8.

 
 "ASCII does not matter"... for whom?
 
 3. It does not matter; KOI-8R encoded strings should not be placed in char 
 arrays.  You should use UTF-8 or another encoding for your Russian text.

 
 "You should use UTF-8 or another encoding for your Russian text."
 
 Thanks.
 
 Advice from my side:
 Let me know when you will visit Russia.
 I will ask representatives of russian developer community and web authors
 to meet you.
 
 Advice per se: You should wear a helmet.
 
 4. If you wish to use KOI-8R (or any other encoding not based on Unicode) 
 you should not be using char arrays, which are meant for Unicode-related 
 encodings only.

 
 The same advice as above.
 
 Obviously this is by far different from C, but that's the good thing about 
 D in many ways ;).

 
 In Israel they have an old saying:
 "Not a human for Saturday but Saturday for human".
 
 I do have practical experience in writnig text processing software in
 encodings other than "US-ASCII" and have heard your advices about
 UTF-8 usage with interest.
 
 Please don't take all of this personal - no intention to harm anybody.
 Honestly and with smile :)
 
 Andrew.

Jul 29 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"Unknown W. Brackets" <unknown simplemachines.org> wrote in message 
news:eah49h$2pi8$1 digitaldaemon.com...
 2. Sorry, an array of char (a single char is one single 8 bit octet) 
 contains UTF-8 bytes which are 8-bit octets.

 A single character, in UTF-8 encoding, may be 1 byte, 2 bytes, etc. Thus, 
 one char MAY NOT hold every single Unicode code point.  You may need an 
 array of multiple chars (bytes) to hold a single code point.

 This is not what it means to me; this is what it means.  A char is a 
 single 8-bit octet in a UTF-8 sequence.  They ARE NOT by any means code 
 points.

 I'm sorry that I did not specify "array", but I fear you are being 
 pedantic here; I'm sure you knew what I meant.

 A char is a single byte in a UTF-8 sequence.  I'm afraid I think calling 
 it an index to a glyph is dangerous, because it could be mistaken. Again, 
 a single char CANNOT represent code points above and including 128 because 
 it is only ONE byte.

 A single char therefore may not represent a glyph all of the time, but 
 rather will represent a byte in the sequence of UTF-8 which may be used to 
 decode (along with other necessary bytes) the entirity of the code point.

 I hope I'm not being overly pedantic here, but I think your definition is 
 either lax or wrong.  But, that is only by its reading in English.

"your definition is either lax or wrong"

Which one?


 represent full code points alone.  Arrays of wchars must be used for some 



 4. I was ignoring endianess issues for simplicity.  My point here is that 
 a UTF-32 character directly represents  a code point.  Sorry again for the 
 non-pedantic laxness in my wording.

 5. Wrong.  There is no vice versa.  You may use byte or ubyte arrays for 
 your UTF-8 encoded strings and so forth.

 In case you didn't realize I was trying to say this:

 *char is not for single byte encodings.  char is ONLY for UTF-8.  char may 
 not be used for any other encoding unless you wish to have problems. char 
 is not the same as in other languages, e.g. C.*

 If you wish for a 8-bit octet value (such as a character in any encoding; 
 single byte or otherwise) you should not be using a char. That is not a 
 correct usage for them, that is what byte and ubyte are for.

 It is expected that chars in an array will follow a specific sequence; 
 that is, that they will be encoded in UTF-8.  It is not possible to 
 guarantee this if you use other encodings, which is why writefln() will 
 fail in such cases.

 6.  Correct.  And a single char (8-bit octet in a sequence of UTF-8 octets 
 encoded such) may never be FF because no single 8-bit octet anywhere in a 
 valid UTF-8 sequence may be FF.  Remember, char is not a code point.  It 
 is a single 8-bit octet in a sequence.

 7. My mistake.  I always consider them roughly the same (and for some 
 reason I thought that they had been made the same; but I assume your link 
 is current.)

 Your first code sample defines a single UTF-8 character, 'a'.  It is lucky 
 you did not try:

 char c = '?';

 (hopefully this character gets sent through to you properly; I will be 
 sending this message UTF-8 if my client allows it.)

 Because that would have failed.  A char cannot hold such a character, 
 which has a code point outside the range 0 - 127.  You would either need 
 to use an array of chars, or etc.

 Your second example means nothing to me.  I don't really care for such 
 pragmas or putting untranslated text directly in source code, and have 
 never dealt with it.

 8. You may not use a single char or an array of chars to represent UTF-16. 
 It may only represent UTF-8.  If you wish to use UTF-16, you must use 
 wchars.


 the same - do you not agree?  A 0 is a zero is a zero.  It doesn't matter 
 what he means.

 2 (the second): rules about ASCII do not apply to char.  Just as rules in 
 Portugal do not dissuade me here in Los Angeles.

 3 (the second): I have lead the development of a multi-lingual software 
 which was used by quite a large sum of people.  I also helped coordinate, 
 and later interface with the assigned coordinator of translation.  This 
 software was translated into Thai, Chinese (simple and traditional), 
 Russian, Italian, Spanish, Japanese, Catalan, and several other languages. 
 More than twenty anyway.

 At first I was suggesting that everyone use their own encoding and 
 handling that (sometimes painfully) in the code.  I would sometimes get 
 comments about using Unicode instead (from the translators who would have 
 preferred this.)  This software now uses UTF-8 and remains translated in 
 these languages.

 So, while I have not been to Russia (although I have worked with numerous 
 Russian developers, consumers, and translators) I would tend to disagree 
 with your assertion.  Also I do not like helmets.

 Obviously, I mean nothing to be taken personally as well; we are only 
 talking about UTF-8, Unicode, its usage in D, and being pedantic ;). And 
 helmets, we touched that subject too.  But not about each other, really.

 Thanks,
 -[Unknown]

Ok. Let's make second round

Some defintions:

Unicode Code Point is an integer value (21bit used) - index in
global Unicode table.
Such global encoding table maintained by international Unicode Consortium.
With some exceptions each code point there has correspondent
glyph in "global super font".

There are two types of encodings used for Unicode Code Points:
1) transport encodings - example UTF. Main purpose - transport/transfer.
2) manipulation encodings - mapping of ranges of  Unicode Code Points
to diapasons 0..0xFF, 0..0xFFFF and 0..0xFFFFFFFF.

Transport encodings are used for transfer and long term storage of
character data - texts.

Manipulation encoding are used in programming for effective implementation
of text processing functions.
As a rule manipulation encoding maps some fragment (or two) of
Unicode Code Point set to the range 0..0xFF and 0..0xFFFF.
Main charcteristic of such mapping: each value of character vector (string)
there is in 1:1 relationship with the correspondent codepoint in
Unicode set.
Main idea of such encoding - character at some index in string (vector)
represents one code point in full.

I think that motivation of having manipulation encodings is simple
and everyone understands it.
Think about how you will implement caret positioning in editbox
for example.

So statement: "char[] in D supposed to hold only UTF-8 encoded text"
immediately leads us to "D is not designed for effective text processing".

Is this logic clear?

Again - let char be a char in D as it is now. Just don't initialize it
by 0xFF please. And let us be a bit carefull with our utf-8 expectations -
yes, it is almost ideal transport encoding, but it is completely useless
for text manipulation purposes - too expensive.

(last message on the subject)

Andrew Fedoniouk.
http://terrainformatica.com

Jul 29 2006

"Unknown W. Brackets" <unknown simplemachines.org> writes:

It really sounds to me like you're looking for UCS-2, then (e.g. as used 
in JavaScript, etc.)  For that, length calculation (which is what I 
presume you mean) is inexpensive.

As to your below assertion, I disagree.  What I think you meant was:

"char[] is not designed for effective multi-byte text processing."

I will agree that wchar[] would be much better in that case, and even 
that limiting it to UCS-2 (which is, afaik, a subset of UTF-16) would 
probably make things significantly easier to work with.

Nonetheless, I was only commenting on how D is currently designed and 
implemented.  Perhaps there was some misunderstanding here.

Even so, I don't see how initializing it to FF makes any problem.  I 
think everyone understands that char[] is meant to hold UTF-8, and if 
you don't like that or don't want to use it, there are other methods 
available to you (heh, you can even use UTF-32!)

I don't see that the initialization of these variables will cause anyone 
any problems.  The only time I want such a variable initialized to 0 is 
when I use a numeric type, not a character type (and then, I try to use 
= 0 anyway.)

It seems like what you may want to do is simply this:

typedef ushort ucs2_t = 0;

And use that type.  Mission accomplished.  Or, use various different 
encodings - in which case I humbly suggest:

typedef ubyte latin1_t = 0;
typedef ushort ucs2_t = 0;
typedef ubyte koi8r_t = 0;
typedef ubyte big5_t = 0;

And so on, so on, so on...

-[Unknown]


 So statement: "char[] in D supposed to hold only UTF-8 encoded text"
 immediately leads us to "D is not designed for effective text processing".
 
 Is this logic clear?

Jul 29 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"Unknown W. Brackets" <unknown simplemachines.org> wrote in message 
news:eahcqu$4d$1 digitaldaemon.com...
 It really sounds to me like you're looking for UCS-2, then (e.g. as used 
 in JavaScript, etc.)  For that, length calculation (which is what I 
 presume you mean) is inexpensive.

Well, lets speak in terms of javascript if it is easier:

String.substr(start, end)...

What these start, end means for you?
I don't think that you will be interested in indexes
of bytes in utf-8 sequence.

 As to your below assertion, I disagree.  What I think you meant was:

 "char[] is not designed for effective multi-byte text processing."

What is "multi-byte text processing"?
processing of text - sequence of codepoints of the alphabet?
What is 'multi-byte' there doing? Multi-byte I beleive you mean is
a method of encoding of codepoints for transmission. Is this correct?

You need real codepoints to do something meaningfull with them...
How these codepoints are stored in memory: as byte, word or dword
depends on your task, amount of memory you have and alphabet
you are using.
E.g. if you are counting frequency of russian words used in internet
you'd better do not do this in Java - twice as expensive as in C
without any need.

So phrase "multi-byte text processing" is fuzzy on this end.

(Seems like I am not clear enough with my subset of English.)

 I will agree that wchar[] would be much better in that case, and even that 
 limiting it to UCS-2 (which is, afaik, a subset of UTF-16) would probably 
 make things significantly easier to work with.

 Nonetheless, I was only commenting on how D is currently designed and 
 implemented.  Perhaps there was some misunderstanding here.

 Even so, I don't see how initializing it to FF makes any problem.  I think 
 everyone understands that char[] is meant to hold UTF-8, and if you don't 
 like that or don't want to use it, there are other methods available to 
 you (heh, you can even use UTF-32!)

 I don't see that the initialization of these variables will cause anyone 
 any problems.  The only time I want such a variable initialized to 0 is 
 when I use a numeric type, not a character type (and then, I try to use = 
 0 anyway.)

 It seems like what you may want to do is simply this:

 typedef ushort ucs2_t = 0;

 And use that type.  Mission accomplished.  Or, use various different 
 encodings - in which case I humbly suggest:

 typedef ubyte latin1_t = 0;
 typedef ushort ucs2_t = 0;
 typedef ubyte koi8r_t = 0;
 typedef ubyte big5_t = 0;

 And so on, so on, so on...

 -[Unknown]

I like the last statement "..., so on, so on..."
Sounds promissing enough.

Just for information:
strlen(const char* str)  works with *all*
single byte encodings in C.
For multi-bytes (e.g. utf-8 )  it returns
length of the sequence in octets.
But these are not chars in terms of C
strictly speaking but bytes -
unsigned chars.


 So statement: "char[] in D supposed to hold only UTF-8 encoded text"
 immediately leads us to "D is not designed for effective text 
 processing".

 Is this logic clear?

Jul 29 2006

"Unknown W. Brackets" <unknown simplemachines.org> writes:

Yes, you're right, most of the time I wouldn't (although a significant 
portion of the time, I would.)  Even so, this is why I would use UCS-2, 
and not UTF-8.  Why are you held up on char[]?

My point is that char[] is only trouble when you're dealing with text 
that is not ISO-8859-1.  I'm a great fan of localization and 
internationalization, but in all honesty the largest part of my text 
processing/analysis is with such strings.

Generally, user input I don't analyze.  Caret placement I leave to be 
handled by the libraries I use.  That is, when I use char[].

So again, I will agree that, in D, char[] is not a good choice for 
strings you are expecting to contain possibly-internationalized data.

I'm perfectly aware of what strlen (and str.length in D) do... it's 
similar to what they do in practically all other languages (unless you 
know the encoding is UCS-2, etc.)  For example, I work with PHP a lot 
and it doesn't even have (with the versions I support) built-in support 
for Unicode.  This makes text processing fun!

-[Unknown]


 "Unknown W. Brackets" <unknown simplemachines.org> wrote in message 
 news:eahcqu$4d$1 digitaldaemon.com...
 It really sounds to me like you're looking for UCS-2, then (e.g. as used 
 in JavaScript, etc.)  For that, length calculation (which is what I 
 presume you mean) is inexpensive.

 
 Well, lets speak in terms of javascript if it is easier:
 
 String.substr(start, end)...
 
 What these start, end means for you?
 I don't think that you will be interested in indexes
 of bytes in utf-8 sequence.
 
 As to your below assertion, I disagree.  What I think you meant was:

 "char[] is not designed for effective multi-byte text processing."

 
 What is "multi-byte text processing"?
 processing of text - sequence of codepoints of the alphabet?
 What is 'multi-byte' there doing? Multi-byte I beleive you mean is
 a method of encoding of codepoints for transmission. Is this correct?
 
 You need real codepoints to do something meaningfull with them...
 How these codepoints are stored in memory: as byte, word or dword
 depends on your task, amount of memory you have and alphabet
 you are using.
 E.g. if you are counting frequency of russian words used in internet
 you'd better do not do this in Java - twice as expensive as in C
 without any need.
 
 So phrase "multi-byte text processing" is fuzzy on this end.
 
 (Seems like I am not clear enough with my subset of English.)
 
 I will agree that wchar[] would be much better in that case, and even that 
 limiting it to UCS-2 (which is, afaik, a subset of UTF-16) would probably 
 make things significantly easier to work with.

 Nonetheless, I was only commenting on how D is currently designed and 
 implemented.  Perhaps there was some misunderstanding here.

 Even so, I don't see how initializing it to FF makes any problem.  I think 
 everyone understands that char[] is meant to hold UTF-8, and if you don't 
 like that or don't want to use it, there are other methods available to 
 you (heh, you can even use UTF-32!)

 I don't see that the initialization of these variables will cause anyone 
 any problems.  The only time I want such a variable initialized to 0 is 
 when I use a numeric type, not a character type (and then, I try to use = 
 0 anyway.)

 It seems like what you may want to do is simply this:

 typedef ushort ucs2_t = 0;

 And use that type.  Mission accomplished.  Or, use various different 
 encodings - in which case I humbly suggest:

 typedef ubyte latin1_t = 0;
 typedef ushort ucs2_t = 0;
 typedef ubyte koi8r_t = 0;
 typedef ubyte big5_t = 0;

 And so on, so on, so on...

 -[Unknown]

 
 I like the last statement "..., so on, so on..."
 Sounds promissing enough.
 
 Just for information:
 strlen(const char* str)  works with *all*
 single byte encodings in C.
 For multi-bytes (e.g. utf-8 )  it returns
 length of the sequence in octets.
 But these are not chars in terms of C
 strictly speaking but bytes -
 unsigned chars.
 
 
 So statement: "char[] in D supposed to hold only UTF-8 encoded text"
 immediately leads us to "D is not designed for effective text 
 processing".

 Is this logic clear?

Jul 30 2006

Bruno Medeiros <brunodomedeirosATgmail SPAM.com> writes:

Unknown W. Brackets wrote:
 
 char c = '蝿';
 
 
 Because that would have failed.  A char cannot hold such a character, 
 which has a code point outside the range 0 - 127.  You would either need 
 to use an array of chars, or etc.

Which, speaking of which, shouldn't that be a compile time error? The
compiler allows all kinds of *char mingling:

   dchar dc = '蝿';
   char sc = dc;     // :-(


-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

Jul 30 2006

"Unknown W. Brackets" <unknown simplemachines.org> writes:

Eek!  Yes, I would say (in my humble opinion) that this should be a 
compile-time error.

Obviously down-casting is more complicated.  I think the case of chars 
is much more obvious/clear than the case of ints, but then it's also a 
special-case.

-[Unknown]


 Unknown W. Brackets wrote:
 char c = '蝿';


 Because that would have failed.  A char cannot hold such a character, 
 which has a code point outside the range 0 - 127.  You would either 
 need to use an array of chars, or etc.

 
 Which, speaking of which, shouldn't that be a compile time error? The
 compiler allows all kinds of *char mingling:
 
   dchar dc = '蝿';
   char sc = dc;     // :-(

Jul 30 2006

Bruno Medeiros <brunodomedeirosATgmail SPAM.com> writes:

Unknown W. Brackets wrote:
 6. The FF byte (8-bit octet sequence) may never appear in any valid 
 UTF-8 string.  Since char can only contain UTF-8 strings, it represents 
 invalid data if it contains such an 8-bit octet.
 

You mentioned "8-bit octet" repeatedly in various posts. That's 
redundant: An "octet" is an 8-bit value. There are no "16-bit octets" 
and no "8-bit hextets" or stuff like that :P . I hope you knew that and 
were just distracted, but you kept saying that :) .

 1. UTF-8 character here could mean an 8-bit octet of code point.  In 
 this case, they are both the same and represent a perfectly valid 
 character in a string.
 

An "UTF-8 octet" is also called a UTF-8 'code unit'. Similarly a "UTF-16 
hextet" is called a UTF-16 'code unit'. An UTF-8 code unit holds a 
Unicode code point if the code point is <128. Otherwise multiple UTF-8 
code units are needed to encode that code point.

The confusion between 'code unit' and 'code point' is a long standing 
one. An "UTF-8 character" is a slighty ambiguous term. Does it a mean a 
UTF-8 code unit, or does it mean an Unicode character/codepoint encoded 
in a UTF-8 sequence?

-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

Jul 30 2006

"Unknown W. Brackets" <unknown simplemachines.org> writes:

I use that terminology because I've read too many RFCs (consider the FTP 
RFC) - they all say "8-bit octet".  Anyway, I'm trying to be completely 
clear.

Code unit.  Yeah, I knew it was code something but it slipped my mind. 
I was sure that he'd either correct me or 8-bit octet/etc. would remain 
clear.  I hate it when I forget such obvious terms.

Anyway, my point in what you're quoting is very context-dependent. 
Walter mentioned that "0 is a valid UTF-8 character."  Andrew asked what 
this meant, so I explained that in this case (as you also clarified) it 
doesn't make any difference.  Regardless, it's a valid [whatever it is] 
and that meaning is not unclear.

-[Unknown]


 Unknown W. Brackets wrote:
 6. The FF byte (8-bit octet sequence) may never appear in any valid 
 UTF-8 string.  Since char can only contain UTF-8 strings, it 
 represents invalid data if it contains such an 8-bit octet.

 You mentioned "8-bit octet" repeatedly in various posts. That's 
 redundant: An "octet" is an 8-bit value. There are no "16-bit octets" 
 and no "8-bit hextets" or stuff like that :P . I hope you knew that and 
 were just distracted, but you kept saying that :) .
 
 1. UTF-8 character here could mean an 8-bit octet of code point.  In 
 this case, they are both the same and represent a perfectly valid 
 character in a string.

 
 An "UTF-8 octet" is also called a UTF-8 'code unit'. Similarly a "UTF-16 
 hextet" is called a UTF-16 'code unit'. An UTF-8 code unit holds a 
 Unicode code point if the code point is <128. Otherwise multiple UTF-8 
 code units are needed to encode that code point.
 
 The confusion between 'code unit' and 'code point' is a long standing 
 one. An "UTF-8 character" is a slighty ambiguous term. Does it a mean a 
 UTF-8 code unit, or does it mean an Unicode character/codepoint encoded 
 in a UTF-8 sequence?

Jul 30 2006

Walter Bright <newshound digitalmars.com> writes:

Unknown W. Brackets wrote:
 Walter mentioned that "0 is a valid UTF-8 character."  Andrew asked what 
 this meant, so I explained that in this case (as you also clarified) it 
 doesn't make any difference.  Regardless, it's a valid [whatever it is] 
 and that meaning is not unclear.

I confess I often misuse the terminology.

Jul 30 2006

Derek <derek psyc.ward> writes:

On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:


 ... but this is far from concept of null codepoint in character encodings.

Andrew and others,
I've read through these posts a few times now, trying to understand the
various points of view being presented. I keep getting the feeling that
some people are deliberately trying *not* to understand what other people
are saying. This is a sad situation.

Andrew seems to be stating ...
(a) char[] arrays should be allowed to hold encodings other than UTF-8, and
thus initializing them with hex-FF byte values is not useful.
(b) UTF-8 encoding is not an efficient encoding for text analysis.
(c) UTF encodings are not optimized for data transmission (they contain
redundant data in many contexts).
(d) The D type called 'char' may not have been the best name to use if it
is meant to be used to contain only UTF-8 octets.

I, and many others including Walter, would probably agree to (b), (c) and
(d). However, considering (b) and (c), UTF has benefits that outweigh these
issues and there are ways to compensate for these too. Point (d) is a
casualty of history and to change the language now to rename 'char' to
anything else would be counter productive now. But feel free to implement
your own flavour of D.<g>

Back to point (a)... The fact is, char[] is designed to hold UTF-8
encodings so don't try to force anything else into such arrays. If you wish
to use some other encodings, then use a more appropriate data structure for
it. For example, to hold 'KOI-8' encodings of Russian text, I would
recommend using ubyte[] instead. To transform char[] to any other encoding
you will have to provide the functions to do that, as I don't think it is
Walter's or D's responsibilty to do it. The point of initializing UTF-8
strings with illegal values is to help detect coding or logical mistakes.
And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode
codepoint *is* illegal. If you must store an octet of hex-FF then use
ubyte[] arrays to do it.

-- 
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"

Jul 30 2006

Walter Bright <newshound digitalmars.com> writes:

Derek wrote:
 Andrew seems to be stating ...
 (a) char[] arrays should be allowed to hold encodings other than UTF-8, and
 thus initializing them with hex-FF byte values is not useful.
 (b) UTF-8 encoding is not an efficient encoding for text analysis.
 (c) UTF encodings are not optimized for data transmission (they contain
 redundant data in many contexts).
 (d) The D type called 'char' may not have been the best name to use if it
 is meant to be used to contain only UTF-8 octets.
 
 I, and many others including Walter, would probably agree to (b), (c) and
 (d). However, considering (b) and (c), UTF has benefits that outweigh these
 issues and there are ways to compensate for these too. Point (d) is a
 casualty of history and to change the language now to rename 'char' to
 anything else would be counter productive now. But feel free to implement
 your own flavour of D.<g>
 
 Back to point (a)... The fact is, char[] is designed to hold UTF-8
 encodings so don't try to force anything else into such arrays. If you wish
 to use some other encodings, then use a more appropriate data structure for
 it. For example, to hold 'KOI-8' encodings of Russian text, I would
 recommend using ubyte[] instead. To transform char[] to any other encoding
 you will have to provide the functions to do that, as I don't think it is
 Walter's or D's responsibilty to do it. The point of initializing UTF-8
 strings with illegal values is to help detect coding or logical mistakes.
 And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode
 codepoint *is* illegal. If you must store an octet of hex-FF then use
 ubyte[] arrays to do it.

Thank you for the insightful summary of the situation.

I suspect, though, that (c) might be moot since it is my understanding 
that most actual data transmission equipment automatically compresses 
the data stream, and so the redundancy of the UTF-8 is minimized. Text 
itself tends to be highly compressible on top of that.

Furthermore, because of the rate of expansion and declining costs of 
bandwidth, the cost of extra bytes is declining at the same time that 
the cost of the inflexibility of code pages is increasing.

Jul 30 2006

"Unknown W. Brackets" <unknown simplemachines.org> writes:

Indeed; this is the same situation as with XML transmission over the 
web.  It contains a huge amount of redundancy, and compresses so well 
that I've seen it do better than binary-based formats.

Although, I'm afraid that most of the time this compression isn't 
necessarily automatic, and too often is not done.

-[Unknown]


 I suspect, though, that (c) might be moot since it is my understanding 
 that most actual data transmission equipment automatically compresses 
 the data stream, and so the redundancy of the UTF-8 is minimized. Text 
 itself tends to be highly compressible on top of that.
 
 Furthermore, because of the rate of expansion and declining costs of 
 bandwidth, the cost of extra bytes is declining at the same time that 
 the cost of the inflexibility of code pages is increasing.

Jul 30 2006

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Derek wrote:
 On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:
 
 
 ... but this is far from concept of null codepoint in character encodings.

 
 Andrew and others,
 I've read through these posts a few times now, trying to understand the
 various points of view being presented. I keep getting the feeling that
 some people are deliberately trying *not* to understand what other people
 are saying. This is a sad situation.
 
 Andrew seems to be stating ...
 (a) char[] arrays should be allowed to hold encodings other than UTF-8, and
 thus initializing them with hex-FF byte values is not useful.
 (b) UTF-8 encoding is not an efficient encoding for text analysis.
 (c) UTF encodings are not optimized for data transmission (they contain
 redundant data in many contexts).
 (d) The D type called 'char' may not have been the best name to use if it
 is meant to be used to contain only UTF-8 octets.
 
 I, and many others including Walter, would probably agree to (b), (c) and
 (d). However, considering (b) and (c), UTF has benefits that outweigh these
 issues and there are ways to compensate for these too. Point (d) is a
 casualty of history and to change the language now to rename 'char' to
 anything else would be counter productive now. But feel free to implement
 your own flavour of D.<g>
 
 Back to point (a)... The fact is, char[] is designed to hold UTF-8
 encodings so don't try to force anything else into such arrays. If you wish
 to use some other encodings, then use a more appropriate data structure for
 it. For example, to hold 'KOI-8' encodings of Russian text, I would
 recommend using ubyte[] instead. To transform char[] to any other encoding
 you will have to provide the functions to do that, as I don't think it is
 Walter's or D's responsibilty to do it. The point of initializing UTF-8
 strings with illegal values is to help detect coding or logical mistakes.
 And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode
 codepoint *is* illegal. If you must store an octet of hex-FF then use
 ubyte[] arrays to do it.

Thank you for the clear summary.

Apart from the obvious (d), I think there are two reasons this char 
confusion comes up now and then.

1. The documentation may not be clear enough on the point that char is 
really only meant to represent an UTF-8 code unit (or ASCII character) 
and that char[] is an UTF-8 encoded string. It seems it needs to be more 
stressed. People coming from C will automatically assume the D char is a 
C char equivalent. It should be mentioned that dchar is the only type 
that can represent any Unicode character, while char is a character only 
in ASCII.

The C to D type conversion table doesn't help either:
http://www.digitalmars.com/d/ctod.html
It should say something like:
char => char (UTF-8 and ASCII strings) ubyte (other byte based encodings)

2. All string functions in Phobos work only on char[] (and in some cases 
wchar[] and dchar[]), making the tools for working with other string 
encodings extremely limited. This is easily remedied by a templated 
string library, such as what I have proposed earlier.

/Oskar

Jul 30 2006

Bruno Medeiros <brunodomedeirosATgmail SPAM.com> writes:

Derek wrote:
 On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:
 
 
 ... but this is far from concept of null codepoint in character encodings.

 
 Andrew and others,
 I've read through these posts a few times now, trying to understand the
 various points of view being presented. I keep getting the feeling that
 some people are deliberately trying *not* to understand what other people
 are saying. This is a sad situation.
 
 Andrew seems to be stating ...
 (a) char[] arrays should be allowed to hold encodings other than UTF-8, and
 thus initializing them with hex-FF byte values is not useful.
 (b) UTF-8 encoding is not an efficient encoding for text analysis.
 (c) UTF encodings are not optimized for data transmission (they contain
 redundant data in many contexts).
 (d) The D type called 'char' may not have been the best name to use if it
 is meant to be used to contain only UTF-8 octets.
 
 I, and many others including Walter, would probably agree to (b), (c) and
 (d). However, considering (b) and (c), UTF has benefits that outweigh these
 issues and there are ways to compensate for these too. Point (d) is a
 casualty of history and to change the language now to rename 'char' to
 anything else would be counter productive now. But feel free to implement
 your own flavour of D.<g>
 
 Back to point (a)... The fact is, char[] is designed to hold UTF-8
 encodings so don't try to force anything else into such arrays. If you wish
 to use some other encodings, then use a more appropriate data structure for
 it. For example, to hold 'KOI-8' encodings of Russian text, I would
 recommend using ubyte[] instead. To transform char[] to any other encoding
 you will have to provide the functions to do that, as I don't think it is
 Walter's or D's responsibilty to do it. The point of initializing UTF-8
 strings with illegal values is to help detect coding or logical mistakes.
 And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode
 codepoint *is* illegal. If you must store an octet of hex-FF then use
 ubyte[] arrays to do it.
 

Good summary. Additionally I'd like to say that, to hold 'KOI-8' 
encodings, you could create a typedef instead of just using a ubyte;

   typedef ubyte koi8char;

Thus you are able to express in the code, what the encoding of such 
ubyte is, as it is part of the type information. And then the program is 
able to work with it:

   koi8char toUpper(koi8char ch) { ...
   int wordCount(koi8char[] str) { ...
   dchar[] toUTF32(koi8char[] str) { ...


-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

Jul 30 2006

Serg Kovrov <user domain.invalid> writes:

Maybe I missed the point here, correct me if I misunderstood.

This is how I see the problem with char[] as utf-8 *string*. The length 
of array of chars is not always count of characters, but rather size of 
array in bytes. Which makes no sense for me. For that purpose I would 
like to see separate properties.

For example,
char[] str = "тест";
word "test" in russian - 4 cyrillic characters, would give you 
str.length 8, which make no use of this length property if you not sure 
that string is latin characters only.

Jul 31 2006

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Serg Kovrov wrote:
 Maybe I missed the point here, correct me if I misunderstood.

You have understood correctly.

 This is how I see the problem with char[] as utf-8 *string*. The length 
 of array of chars is not always count of characters, but rather size of 
 array in bytes. Which makes no sense for me. For that purpose I would 
 like to see separate properties.

Having char[].length return something other than the actual number of 
char-units would break it's array semantics.

 For example,
 char[] str = "тест";
 word "test" in russian - 4 cyrillic characters, would give you 
 str.length 8, which make no use of this length property if you not sure 
 that string is latin characters only.

It is actually not very often that you need to count the number of 
characters as opposed to the number of (UTF-8) code units. Counting the 
number of characters is also a rather expensive operation. All the 
ordinary operations (searching, slicing, concatenation, sub-string 
search, etc) operate on code units rather than characters.

It is easy to implement your own character count though:

size_t count(char[] arr) {
	size_t c = 0;
	foreach(dchar c;arr)
		c++;
	return c;
}

assert("тест".count() == 4);

Also note that:

assert("тест"d.length == 4);

/Oskar

Jul 31 2006

Serg Kovrov <user domain.invalid> writes:

* Oskar Linde:
 Having char[].length return something other than the actual number
 of char-units would break it's array semantics.

Yes, I see. Thats why I do not like much char[] as substitute for string
type.

 It is actually not very often that you need to count the number
 of characters as opposed to the number of (UTF-8) code units.

Why not use separate properties for that?

 Counting the number of characters is also a rather expensive
 operation. 

Indeed. Store once as property (and update as needed) is better than 
calculate it each time you need it.

 All the ordinary operations (searching, slicing, concatenation, 
 sub-string  search, etc) operate on code units rather than
 characters.

Yes that's tough one. If you want to slice an array - use array unit's 
count for that. But if you want to slice a *string* (substring, search, 
etc) - use character's count for that.

Maybe there should be interchangeable types - string and char[]. For 
different length, slice, find, etc. behaviors? I mean it could be same 
actual type, but different contexts for properties.

And besides, string as opposite to char[] is more pleasant for my eyes =)

Jul 31 2006

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Serg Kovrov wrote:
 * Oskar Linde:
 Counting the number of characters is also a rather expensive
 operation. 

 
 Indeed. Store once as property (and update as needed) is better than 
 calculate it each time you need it.

Store where? You can't put it in the array data itself without breaking 
slicing, and you putting it in the reference introduces problems with it 
getting out of date if the array is modified through another reference 
(without enforcing COW, that is).

Jul 31 2006

Serg Kovrov <user domain.invalid> writes:

* Frits van Bommel:
 Serg Kovrov wrote:
 * Oskar Linde:
 Counting the number of characters is also a rather expensive
 operation. 

 Indeed. Store once as property (and update as needed) is better than 
 calculate it each time you need it.

 
 Store where? You can't put it in the array data itself without breaking 
 slicing, and you putting it in the reference introduces problems with it 
 getting out of date if the array is modified through another reference 
 (without enforcing COW, that is).

Need to say that I no not have an idea where to store it, neither where 
current length property stored. I'm really glad that compiler do it for me.

As language user I just want to be confident that compiler do it wisely, 
and focus on my domain problems.

Jul 31 2006

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Serg Kovrov wrote:
 * Frits van Bommel:
 Serg Kovrov wrote:
 * Oskar Linde:
 Counting the number of characters is also a rather expensive
 operation. 

 Indeed. Store once as property (and update as needed) is better than 
 calculate it each time you need it.

 Store where? You can't put it in the array data itself without 
 breaking slicing, and you putting it in the reference introduces 
 problems with it getting out of date if the array is modified through 
 another reference (without enforcing COW, that is).

 
 Need to say that I no not have an idea where to store it, neither where 
 current length property stored. I'm really glad that compiler do it for me.
 
 As language user I just want to be confident that compiler do it wisely, 
 and focus on my domain problems.

The length is stored in the reference, but the character count would not 
only depend on the memory location and size (which the reference holds) 
but also the data it holds (at least for char and wchar) which may be 
accessed through different references as well. That's the problem I was 
pointing out.

Jul 31 2006

Hasan Aljudy <hasan.aljudy gmail.com> writes:

Serg Kovrov wrote:
 * Oskar Linde:
 
 Having char[].length return something other than the actual number
 of char-units would break it's array semantics.

 
 
 Yes, I see. Thats why I do not like much char[] as substitute for string
 type.
 
 It is actually not very often that you need to count the number
 of characters as opposed to the number of (UTF-8) code units.

 
 
 Why not use separate properties for that?
 
 Counting the number of characters is also a rather expensive
 operation. 

 
 
 Indeed. Store once as property (and update as needed) is better than 
 calculate it each time you need it.
 
 All the ordinary operations (searching, slicing, concatenation, 
 sub-string  search, etc) operate on code units rather than
 characters.

 
 
 Yes that's tough one. If you want to slice an array - use array unit's 
 count for that. But if you want to slice a *string* (substring, search, 
 etc) - use character's count for that.
 
 Maybe there should be interchangeable types - string and char[]. For 
 different length, slice, find, etc. behaviors? I mean it could be same 
 actual type, but different contexts for properties.
 
 And besides, string as opposite to char[] is more pleasant for my eyes =)


I say this calls for a proper *standard* String class ... <g>

Jul 31 2006

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Serg Kovrov wrote:
 * Oskar Linde:
 Having char[].length return something other than the actual number
 of char-units would break it's array semantics.

 
 Yes, I see. Thats why I do not like much char[] as substitute for string
 type.
 
 It is actually not very often that you need to count the number
 of characters as opposed to the number of (UTF-8) code units.

 
 Why not use separate properties for that?
 
 Counting the number of characters is also a rather expensive
 operation. 

 
 Indeed. Store once as property (and update as needed) is better than 
 calculate it each time you need it.

The question is, how often do you need it? Especially if you are not 
indexing by character.

 All the ordinary operations (searching, slicing, concatenation, 
 sub-string  search, etc) operate on code units rather than
 characters.

 
 Yes that's tough one. If you want to slice an array - use array unit's 
 count for that. But if you want to slice a *string* (substring, search, 
 etc) - use character's count for that.

Why? Code unit indices will work equally well for substrings, searching 
etc.

 Maybe there should be interchangeable types - string and char[]. For 
 different length, slice, find, etc. behaviors? I mean it could be same 
 actual type, but different contexts for properties.

Indexing an UTF-8 encoded string by character rather than code unit is 
expensive in either time or memory. If you for some reason need 
character indexing, use a dchar[].

 And besides, string as opposite to char[] is more pleasant for my eyes =)

There is always alias.

Jul 31 2006

Serg Kovrov <user domain.invalid> writes:

* Oskar Linde:
 Serg Kovrov wrote:
 * Oskar Linde:
 Having char[].length return something other than the actual number
 of char-units would break it's array semantics.

 Yes, I see. Thats why I do not like much char[] as substitute for string
 type.

 It is actually not very often that you need to count the number
 of characters as opposed to the number of (UTF-8) code units.

 Why not use separate properties for that?

 Counting the number of characters is also a rather expensive
 operation. 

 Indeed. Store once as property (and update as needed) is better than 
 calculate it each time you need it.

 
 The question is, how often do you need it? Especially if you are not 
 indexing by character.
 
 All the ordinary operations (searching, slicing, concatenation, 
 sub-string  search, etc) operate on code units rather than
 characters.

 Yes that's tough one. If you want to slice an array - use array unit's 
 count for that. But if you want to slice a *string* (substring, 
 search, etc) - use character's count for that.

 
 Why? Code unit indices will work equally well for substrings, searching 
 etc.
 
 Maybe there should be interchangeable types - string and char[]. For 
 different length, slice, find, etc. behaviors? I mean it could be same 
 actual type, but different contexts for properties.

 
 Indexing an UTF-8 encoded string by character rather than code unit is 
 expensive in either time or memory. If you for some reason need 
 character indexing, use a dchar[].
 
 And besides, string as opposite to char[] is more pleasant for my eyes =)

 
 There is always alias.

You've got some valid points, I just showed mine.

Jul 31 2006

Walter Bright <newshound digitalmars.com> writes:

Oskar Linde wrote:
 It is easy to implement your own character count though:
 
 size_t count(char[] arr) {
     size_t c = 0;
     foreach(dchar c;arr)
         c++;
     return c;
 }
 
 assert("тест".count() == 4);

std.utf.toUCSindex(s, s.length) will also give the character count.

Jul 31 2006

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Oskar Linde schrieb am 2006-07-31:
 Serg Kovrov wrote:

 For example,
 char[] str = "????";
 word "test" in russian - 4 cyrillic characters, would give you 
 str.length 8, which make no use of this length property if you not sure 
 that string is latin characters only.

 It is actually not very often that you need to count the number of 
 characters as opposed to the number of (UTF-8) code units. Counting the 
 number of characters is also a rather expensive operation. All the 
 ordinary operations (searching, slicing, concatenation, sub-string 
 search, etc) operate on code units rather than characters.

 It is easy to implement your own character count though:

 size_t count(char[] arr) {
 	size_t c = 0;
 	foreach(dchar c;arr)
 		c++;
 	return c;
 }

 assert("????".count() == 4);

 Also note that:

 assert("????"d.length == 4);

I hate to be pedantic but dchar[] can only be used to count the code
points - not the characters. A "character" can be composed by more than
one code point/dchar. This feature is frequent used for accents, marks
and some Asian scripts.

- -> http://www.unicode.org

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFEzmhrLK5blCcjpWoRAnJhAJ0VKD2sD++PkR0hnFfGIAgFxn8OGgCeLg0Y
mp2vyHbFrwExwr3h6/etjWc=
=9RLJ
-----END PGP SIGNATURE-----

Jul 31 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"Thomas Kuehne" <thomas-dloop kuehne.cn> wrote in message 
news:ls52q3-3o8.ln1 birke.kuehne.cn...
 -----BEGIN PGP SIGNED MESSAGE-----
 Hash: SHA1

 Oskar Linde schrieb am 2006-07-31:
 Serg Kovrov wrote:

 For example,
 char[] str = "????";
 word "test" in russian - 4 cyrillic characters, would give you
 str.length 8, which make no use of this length property if you not sure
 that string is latin characters only.

 It is actually not very often that you need to count the number of
 characters as opposed to the number of (UTF-8) code units. Counting the
 number of characters is also a rather expensive operation. All the
 ordinary operations (searching, slicing, concatenation, sub-string
 search, etc) operate on code units rather than characters.

 It is easy to implement your own character count though:

 size_t count(char[] arr) {
 size_t c = 0;
 foreach(dchar c;arr)
 c++;
 return c;
 }

 assert("????".count() == 4);

 Also note that:

 assert("????"d.length == 4);

 I hate to be pedantic but dchar[] can only be used to count the code
 points - not the characters. A "character" can be composed by more than
 one code point/dchar. This feature is frequent used for accents, marks
 and some Asian scripts.

 - -> http://www.unicode.org


Right, Thomas,

umlaut as a separate code point can exist
so A with umlaut can be represented by two code points.
But as far as I remember the intention was and is
to have in Unicode also all full forms like "A-with-umlaut"
So you can always "compress" multi code point forms into
single point counterparts.

This way "????"d.length == 4 will be true -
it is just depeneds on your text parser.

Andrew.



 Thomas


 -----BEGIN PGP SIGNATURE-----

 iD8DBQFEzmhrLK5blCcjpWoRAnJhAJ0VKD2sD++PkR0hnFfGIAgFxn8OGgCeLg0Y
 mp2vyHbFrwExwr3h6/etjWc=
 =9RLJ
 -----END PGP SIGNATURE-----

Jul 31 2006

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Fedoniouk schrieb am 2006-07-31:
 "Thomas Kuehne" <thomas-dloop kuehne.cn> wrote in message 
 news:ls52q3-3o8.ln1 birke.kuehne.cn...
 Oskar Linde schrieb am 2006-07-31:
 Serg Kovrov wrote:

 For example,
 char[] str = "????";
 word "test" in russian - 4 cyrillic characters, would give you
 str.length 8, which make no use of this length property if you not sure
 that string is latin characters only.

 It is actually not very often that you need to count the number of
 characters as opposed to the number of (UTF-8) code units. Counting the
 number of characters is also a rather expensive operation. All the
 ordinary operations (searching, slicing, concatenation, sub-string
 search, etc) operate on code units rather than characters.

 It is easy to implement your own character count though:

 size_t count(char[] arr) {
 size_t c = 0;
 foreach(dchar c;arr)
 c++;
 return c;
 }

 assert("????".count() == 4);

 Also note that:

 assert("????"d.length == 4);

 I hate to be pedantic but dchar[] can only be used to count the code
 points - not the characters. A "character" can be composed by more than
 one code point/dchar. This feature is frequent used for accents, marks
 and some Asian scripts.

 - -> http://www.unicode.org


 Right, Thomas,

 umlaut as a separate code point can exist
 so A with umlaut can be represented by two code points.
 But as far as I remember the intention was and is
 to have in Unicode also all full forms like "A-with-umlaut"



I won't argue about the intention here.
Post this statement on 
<unicode unicode.org> (http://www.unicode.org/consortium/distlist.html)
an let's see the various responces ;)


 So you can always "compress" multi code point forms into
 single point counterparts.

Not allways. For a common use case see



Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFE0QYbLK5blCcjpWoRArZiAJ4mVulttOK6bafuCZLt2Ini2lx4JACgjdC7
1DH/6rvW8qaSzRX5W0i+7jk=
=2pt0
-----END PGP SIGNATURE-----

Aug 02 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

Derek thanks for summarizing all this but I will put it as following.

There are two type of text encodings for two distinct use cases:
  1) transport/storage encodings - one unicode code point
      represented as multiple code units of encoded sequence ( e.g. UTF )
      string.length returns length in code units of encoding - not 
characters.

  2) manipulation encodings - one unicode code point represented
      as one and only one element of the sequence (e.g. one byte, word or 
dword)
      string.length here returns length in code points (mapped character 
glyphs).

The problem as I can see is this:
D propose to use transport encoding for manipulation purposes
which is main problem imo here - transport encodings are not
designed for the manipulation - it is extremely difficult to use
them for manipualtion in practice as we may see.

One more problem:

Encoding like UTF-8 and UTF-16 are almost useless
with let's say Windows API, say TextOutA and TextOutW functions.
Neither one of them will accept D's char[] and wchar[] directly.

- ***A  functions in Windows take byte string (LPSTR) and current
  codepage id  to render text. ( byte + codepage = Unicode Code Point )

- ***W functions in Windows use LPWSTR things which are
  sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
  (  cast(dword) word  = Unicode Code Point )
  Only few functions in Windows API treat LPWSTR as UTF-16.

-----------------
"D strings are utf encoded sequences only" is a design mistake, IMO.
On disk (serialized form) - yes. But not in memory for manipulation please.

Andrew Fedoniouk.
http://terrainformatica.com



"Derek" <derek psyc.ward> wrote in message 
news:177u058vq8cdj.koexsq99n112.dlg 40tude.net...
 On Sat, 29 Jul 2006 13:27:14 -0700, Andrew Fedoniouk wrote:


 ... but this is far from concept of null codepoint in character 
 encodings.

 Andrew and others,
 I've read through these posts a few times now, trying to understand the
 various points of view being presented. I keep getting the feeling that
 some people are deliberately trying *not* to understand what other people
 are saying. This is a sad situation.

 Andrew seems to be stating ...
 (a) char[] arrays should be allowed to hold encodings other than UTF-8, 
 and
 thus initializing them with hex-FF byte values is not useful.
 (b) UTF-8 encoding is not an efficient encoding for text analysis.
 (c) UTF encodings are not optimized for data transmission (they contain
 redundant data in many contexts).
 (d) The D type called 'char' may not have been the best name to use if it
 is meant to be used to contain only UTF-8 octets.

 I, and many others including Walter, would probably agree to (b), (c) and
 (d). However, considering (b) and (c), UTF has benefits that outweigh 
 these
 issues and there are ways to compensate for these too. Point (d) is a
 casualty of history and to change the language now to rename 'char' to
 anything else would be counter productive now. But feel free to implement
 your own flavour of D.<g>

 Back to point (a)... The fact is, char[] is designed to hold UTF-8
 encodings so don't try to force anything else into such arrays. If you 
 wish
 to use some other encodings, then use a more appropriate data structure 
 for
 it. For example, to hold 'KOI-8' encodings of Russian text, I would
 recommend using ubyte[] instead. To transform char[] to any other encoding
 you will have to provide the functions to do that, as I don't think it is
 Walter's or D's responsibilty to do it. The point of initializing UTF-8
 strings with illegal values is to help detect coding or logical mistakes.
 And a leading octet with the value of hex-FF in a UTF-8 encoded Unicode
 codepoint *is* illegal. If you must store an octet of hex-FF then use
 ubyte[] arrays to do it.

 -- 
 Derek Parnell
 Melbourne, Australia
 "Down with mediocrity!"

Jul 31 2006

Walter Bright <newshound digitalmars.com> writes:

Andrew Fedoniouk wrote:
 The problem as I can see is this:
 D propose to use transport encoding for manipulation purposes
 which is main problem imo here - transport encodings are not
 designed for the manipulation - it is extremely difficult to use
 them for manipualtion in practice as we may see.

I disagree the characterization that it is "extremely difficult" to use 
for manipulation. foreach's direct support for it, as well as the 
functions in std.utf, make it straightforward. DMDScript is built around 
UTF-8, and manipulating multibyte characters in it has not turned out to 
be a significant problem.

It's also certainly easier than codepage based multibyte designs like 
shift-JIS (I used to write code for shift-JIS).

 Encoding like UTF-8 and UTF-16 are almost useless
 with let's say Windows API, say TextOutA and TextOutW functions.
 Neither one of them will accept D's char[] and wchar[] directly.
 
 - ***A  functions in Windows take byte string (LPSTR) and current
   codepage id  to render text. ( byte + codepage = Unicode Code Point )

Win9x only supports the A functions, and Phobos does a translation of 
the output into the Win9x code page when running on Win9x. Of course, 
this fails when one has characters not supported by Win9x, but code 
pages aren't going to help that either.

Win9x is obsolete anyway, and there's no reason to cripple a new 
language by accommodating the failures of an obsolete system.

When running on NT or later Windows, the W functions are used instead 
which work directly with UTF-16. Later Windows also support UTF-8 with 
the A functions.

 - ***W functions in Windows use LPWSTR things which are
   sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
   (  cast(dword) word  = Unicode Code Point )
   Only few functions in Windows API treat LPWSTR as UTF-16.

BMP is a proper subset of UTF-16. The only difference is that BMP 
doesn't do the 2-word surrogate pair encodings. But those are reserved 
in BMP anyway, so there is no conflict. Windows has been upgraded to 
handle them. Early versions of NT that couldn't handle surrogate pairs 
didn't work with those code points anyway, so nothing is gained by going 
to code pages.

So, the W functions can and do take UTF-16 directly, and in fact the 
Phobos implementation does use the W functions, transmitting wchar[] to 
them, and it works fine.

The neat thing about Phobos is it adapts to whether you are using Win9x, 
full 32 bit Windows, or Linux, and adjusts the char output accordingly 
so it "just works."

 -----------------
 "D strings are utf encoded sequences only" is a design mistake, IMO.
 On disk (serialized form) - yes. But not in memory for manipulation please.

There isn't any better method of handling international character sets 
in a portable way. Code pages have serious, crippling, unfixable 
problems - including all the downsides of multibyte systems (because the 
asian code pages are multibyte).

Jul 31 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"Walter Bright" <newshound digitalmars.com> wrote in message 
news:eam1ec$10e1$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 The problem as I can see is this:
 D propose to use transport encoding for manipulation purposes
 which is main problem imo here - transport encodings are not
 designed for the manipulation - it is extremely difficult to use
 them for manipualtion in practice as we may see.

 I disagree the characterization that it is "extremely difficult" to use 
 for manipulation. foreach's direct support for it, as well as the 
 functions in std.utf, make it straightforward. DMDScript is built around 
 UTF-8, and manipulating multibyte characters in it has not turned out to 
 be a significant problem.

Sorry but strings in DMDScript are quite different in terms of
0) there are no such thing as char in JavaScript.
1) strings are Strings - not vectors of octets - js::string[] and d::char[] 
are different things.
2) are not supposed to be used by any OS API.
3) there are 12 or so methods of String class in JS - limited perimeter -
what model you've choosen to store them is irrelevant -
in some implementations they represented even by list of fixed runs.

 It's also certainly easier than codepage based multibyte designs like 
 shift-JIS (I used to write code for shift-JIS).

 Encoding like UTF-8 and UTF-16 are almost useless
 with let's say Windows API, say TextOutA and TextOutW functions.
 Neither one of them will accept D's char[] and wchar[] directly.

 - ***A  functions in Windows take byte string (LPSTR) and current
   codepage id  to render text. ( byte + codepage = Unicode Code Point )

 Win9x only supports the A functions,

You are not right here.

TextOutA and TextOutW are both supported by Win98.
And intention in Harmonia was to use only those ***W
functions which come out of the box on Win98 (without need of MSLU)

 and Phobos does a translation of the output into the Win9x code page when 
 running on Win9x. Of course, this fails when one has characters not 
 supported by Win9x, but code pages aren't going to help that either.

 Win9x is obsolete anyway, and there's no reason to cripple a new language 
 by accommodating the failures of an obsolete system.

There is a huge market of embedded devices.
If you think that computer evolution expands only in more-ram-speed
direction than you are in trouble.

http://www.litepc.com/graphics/eossystem.jpg


 When running on NT or later Windows, the W functions are used instead 
 which work directly with UTF-16. Later Windows also support UTF-8 with the 
 A functions.

http://blogs.msdn.com/michkap/archive/2005/05/11/416552.aspx


 - ***W functions in Windows use LPWSTR things which are
   sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
   (  cast(dword) word  = Unicode Code Point )
   Only few functions in Windows API treat LPWSTR as UTF-16.

 BMP is a proper subset of UTF-16. The only difference is that BMP doesn't 
 do the 2-word surrogate pair encodings. But those are reserved in BMP 
 anyway, so there is no conflict. Windows has been upgraded to handle them. 
 Early versions of NT that couldn't handle surrogate pairs didn't work with 
 those code points anyway, so nothing is gained by going to code pages.

Sorry this scares me "BMP is a proper subset of UTF-16"
UTF-16 is a group name of *byte stream encodings*
(UTF-16LE and UTF-16BE) of Unicode Code Set.

BTW: which one of this UTFs D uses? Platform dependent I beleive.


 So, the W functions can and do take UTF-16 directly, and in fact the 
 Phobos implementation does use the W functions, transmitting wchar[] to 
 them, and it works fine.

 The neat thing about Phobos is it adapts to whether you are using Win9x, 
 full 32 bit Windows, or Linux, and adjusts the char output accordingly so 
 it "just works."

It should work well. Efficent I mean.
The language shall be agnostic to the meaning of char as much as possible.
It shall not prevent you to write effective algorithms.

 -----------------
 "D strings are utf encoded sequences only" is a design mistake, IMO.
 On disk (serialized form) - yes. But not in memory for manipulation 
 please.

 There isn't any better method of handling international character sets in 
 a portable way. Code pages have serious, crippling, unfixable problems - 
 including all the downsides of multibyte systems (because the asian code 
 pages are multibyte).

We are speaking in different languages:

A: "strings are utf encoded sequences only" is a design mistake.
W: "use any encoding other that utf" is a design mistake.

Different meaning, eh?

Forget about codepages.
Let those who aware about them to deal with them efficiently.
"Codepage" (c) Walter  (e.g. ASCII) is an efficient way of
representing text. That is it.

Others who can afford full set will work with full 21bit values.
Practically it is enough to have 16 (BMP) but...

Andrew Fedoniouk.
http://terrainformatica.com

Jul 31 2006

Derek Parnell <derek nomail.afraid.org> writes:

On Mon, 31 Jul 2006 18:23:19 -0700, Andrew Fedoniouk wrote:

 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eam1ec$10e1$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 The problem as I can see is this:
 D propose to use transport encoding for manipulation purposes
 which is main problem imo here - transport encodings are not
 designed for the manipulation - it is extremely difficult to use
 them for manipualtion in practice as we may see.

 I disagree the characterization that it is "extremely difficult" to use 
 for manipulation. foreach's direct support for it, as well as the 
 functions in std.utf, make it straightforward. DMDScript is built around 
 UTF-8, and manipulating multibyte characters in it has not turned out to 
 be a significant problem.

 
 Sorry but strings in DMDScript are quite different in terms of
 0) there are no such thing as char in JavaScript.
 1) strings are Strings - not vectors of octets - js::string[] and d::char[] 
 are different things.
 2) are not supposed to be used by any OS API.
 3) there are 12 or so methods of String class in JS - limited perimeter -
 what model you've choosen to store them is irrelevant -
 in some implementations they represented even by list of fixed runs.

For what its worth, to do *character* manipulation I convert strings to
UTF-32, do my stuff and convert back to the initial format.

char[] somefunc(char[] x)
{
   return std.utf.toUTF8( somefunc( std.utf.toUTF32(x) ) );
}

wchar[] somefunc(wchar[] x)
{
   return std.utf.toUTF16( somefunc( std.utf.toUTF32(x) ) );
}

dchar[] somefunc(dchar[] x)
{
   dchar[] result;
   ...
   return result;
}

This seems to work fast enough for my purposes. DBuild (nee Build) uses
this a lot.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
1/08/2006 11:45:36 AM

Jul 31 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"Derek Parnell" <derek nomail.afraid.org> wrote in message 
news:8n0koj5wjiio.qwc8ok4mrvr3$.dlg 40tude.net...
 On Mon, 31 Jul 2006 18:23:19 -0700, Andrew Fedoniouk wrote:

 "Walter Bright" <newshound digitalmars.com> wrote in message
 news:eam1ec$10e1$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 The problem as I can see is this:
 D propose to use transport encoding for manipulation purposes
 which is main problem imo here - transport encodings are not
 designed for the manipulation - it is extremely difficult to use
 them for manipualtion in practice as we may see.

 I disagree the characterization that it is "extremely difficult" to use
 for manipulation. foreach's direct support for it, as well as the
 functions in std.utf, make it straightforward. DMDScript is built around
 UTF-8, and manipulating multibyte characters in it has not turned out to
 be a significant problem.

 Sorry but strings in DMDScript are quite different in terms of
 0) there are no such thing as char in JavaScript.
 1) strings are Strings - not vectors of octets - js::string[] and 
 d::char[]
 are different things.
 2) are not supposed to be used by any OS API.
 3) there are 12 or so methods of String class in JS - limited perimeter -
 what model you've choosen to store them is irrelevant -
 in some implementations they represented even by list of fixed runs.

 For what its worth, to do *character* manipulation I convert strings to
 UTF-32, do my stuff and convert back to the initial format.

 char[] somefunc(char[] x)
 {
   return std.utf.toUTF8( somefunc( std.utf.toUTF32(x) ) );
 }

 wchar[] somefunc(wchar[] x)
 {
   return std.utf.toUTF16( somefunc( std.utf.toUTF32(x) ) );
 }

 dchar[] somefunc(dchar[] x)
 {
   dchar[] result;
   ...
   return result;
 }

 This seems to work fast enough for my purposes. DBuild (nee Build) uses
 this a lot.

 -- 

Derek, using dchar (ultimate char) is perfectly fine in DBuild(*)
circumstances - you are parsing - not dealing with OS in each line.

Using dchar has drawback - you need to recreate all string primitive
ops from scratch including RegExp, etc.

Again dchar is ok - the only not ok is a strange selection for dchar
null/nothing/nihil/nil/whatever value.

(* dbuild does not sound good in russian - very close to idiot in medical 
meaning
consider builDer/buildDer/creaDor for example - with red D in the middle - 
stylish at least)

Andrew.

Jul 31 2006

"John Reimer" <terminal.node gmail.com> writes:

On Mon, 31 Jul 2006 20:46:53 -0700, Andrew Fedoniouk  
<news terrainformatica.com> wrote:

 (* dbuild does not sound good in russian - very close to idiot in medical
 meaning
 consider builDer/buildDer/creaDor for example - with red D in the middle  
 -
 stylish at least)

 Andrew.

Really, Andrew, you are getting carried away in your demands.  You almost  
sound self-centered :).  dbuild is not made for Russians only.  Almost any  
English word conceived for a name might just have some sort of bad  
connotation in any one of the thousands of languages in this world.  Why  
should anyone feel obligated to accomodate your culture here?  I know  
Russians tend to be quite proud of their heritage, which is fine... but  
really, you are being quite silly to make these demands here.

That aside... my personal, self-centered feeling is that the name "bud" is  
quite adequate. :D

-JJR

Jul 31 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"John Reimer" <terminal.node gmail.com> wrote in message 
news:op.tdlccd0b6gr7xp epsilon-alpha...
 On Mon, 31 Jul 2006 20:46:53 -0700, Andrew Fedoniouk 
 <news terrainformatica.com> wrote:

 (* dbuild does not sound good in russian - very close to idiot in medical
 meaning
 consider builDer/buildDer/creaDor for example - with red D in the 
 iddle  -
 stylish at least)

 Andrew.

 Really, Andrew, you are getting carried away in your demands.  You almost 
 sound self-centered :).  dbuild is not made for Russians only.  Almost any 
 English word conceived for a name might just have some sort of bad 
 connotation in any one of the thousands of languages in this world.  Why 
 should anyone feel obligated to accomodate your culture here?  I know 
 Russians tend to be quite proud of their heritage, which is fine... but 
 really, you are being quite silly to make these demands here.

 That aside... my personal, self-centered feeling is that the name "bud" is 
 quite adequate. :D

:D

BTW:  debilita [lat.] as a word with many variations is used in
almost all laguages directly derived from latin.

You can say d'buil' on streets of say Munich and they will undersatnd you.
Trust me , free beer will be yours.

So it is far from russian-centric :-P

Andrew.

Aug 01 2006

=?ISO-8859-1?Q?=22R=E9my_J=2E_A=2E_Mou=EBza=22?= writes:

Andrew Fedoniouk a �crit :
 "John Reimer" <terminal.node gmail.com> wrote in message 
 news:op.tdlccd0b6gr7xp epsilon-alpha...
 
On Mon, 31 Jul 2006 20:46:53 -0700, Andrew Fedoniouk 
<news terrainformatica.com> wrote:


(* dbuild does not sound good in russian - very close to idiot in medical
meaning
consider builDer/buildDer/creaDor for example - with red D in the 
iddle  -
stylish at least)

Andrew.

Really, Andrew, you are getting carried away in your demands.  You almost 
sound self-centered :).  dbuild is not made for Russians only.  Almost any 
English word conceived for a name might just have some sort of bad 
connotation in any one of the thousands of languages in this world.  Why 
should anyone feel obligated to accomodate your culture here?  I know 
Russians tend to be quite proud of their heritage, which is fine... but 
really, you are being quite silly to make these demands here.

That aside... my personal, self-centered feeling is that the name "bud" is 
quite adequate. :D

 
 
 :D
 
 BTW:  debilita [lat.] as a word with many variations is used in
 almost all laguages directly derived from latin.
 
 You can say d'buil' on streets of say Munich and they will undersatnd you.
 Trust me , free beer will be yours.
 
 So it is far from russian-centric :-P
 
 Andrew.

As "d�bile" in french, pronounced something like "day bill". One has to 
correctly pronounce the ending D of dbuild to disambiguate it, but since 
we generally know about what we're speaking about in an IT related 
discussion, it should be OK, or even funny if we ambigously pronounce it 
in presence of humourous enough people.

Aug 01 2006

Walter Bright <newshound digitalmars.com> writes:

Andrew Fedoniouk wrote:
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eam1ec$10e1$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 The problem as I can see is this:
 D propose to use transport encoding for manipulation purposes
 which is main problem imo here - transport encodings are not
 designed for the manipulation - it is extremely difficult to use
 them for manipualtion in practice as we may see.

 I disagree the characterization that it is "extremely difficult" to use 
 for manipulation. foreach's direct support for it, as well as the 
 functions in std.utf, make it straightforward. DMDScript is built around 
 UTF-8, and manipulating multibyte characters in it has not turned out to 
 be a significant problem.

 
 Sorry but strings in DMDScript are quite different in terms of
 0) there are no such thing as char in JavaScript.

ECMAScript 262-3 (Javascript) defines the source character set to be 
UTF-16, and the source character set is what JS programs manipulate for 
strings and characters.

 1) strings are Strings - not vectors of octets - js::string[] and d::char[] 
 are different things.
 2) are not supposed to be used by any OS API.
 3) there are 12 or so methods of String class in JS - limited perimeter -
 what model you've choosen to store them is irrelevant -
 in some implementations they represented even by list of fixed runs.

I agree how it's stored in the JS implementation is irrelevant. My point 
was that in DMDScript they are stored as utf-8 strings, and they work 
with only minor extra effort - DMDScript implements all the string 
handling functions JS defines.


 - ***A  functions in Windows take byte string (LPSTR) and current
   codepage id  to render text. ( byte + codepage = Unicode Code Point )

 Win9x only supports the A functions,

 
 You are not right here.
 
 TextOutA and TextOutW are both supported by Win98.
 And intention in Harmonia was to use only those ***W
 functions which come out of the box on Win98 (without need of MSLU)

You're right in that Win98 exports a small handful of W functions 
without MSLU - but what those W functions actually do under the hood is 
translate the data based on the current code page and then call the 
corresponding A function. In other words, the Win9x W functions are 
rather pointless and don't support characters that are not in the 
current code page anyway. MSLU extends the same poor behavior to a bunch 
more pseudo W functions. This is why Phobos does not call W functions 
under Win9x.

Conversely, the A functions under NT and later translate the characters 
to - you guessed it - UTF-16 and then call the corresponding W function. 
This is why Phobos under NT does not call the A functions.


 Win9x is obsolete anyway, and there's no reason to cripple a new language 
 by accommodating the failures of an obsolete system.

 
 There is a huge market of embedded devices.
 If you think that computer evolution expands only in more-ram-speed
 direction than you are in trouble.
 
 http://www.litepc.com/graphics/eossystem.jpg

I agree there's a huge ecosystem of 32 bit embedded processors. And D 
works fine with Win9x - it just isn't crippled by Win9x's shortcomings.


 When running on NT or later Windows, the W functions are used instead 
 which work directly with UTF-16. Later Windows also support UTF-8 with the 
 A functions.

 http://blogs.msdn.com/michkap/archive/2005/05/11/416552.aspx

That is consistent with what I wrote about it.


 - ***W functions in Windows use LPWSTR things which are
   sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
   (  cast(dword) word  = Unicode Code Point )
   Only few functions in Windows API treat LPWSTR as UTF-16.

 BMP is a proper subset of UTF-16. The only difference is that BMP doesn't 
 do the 2-word surrogate pair encodings. But those are reserved in BMP 
 anyway, so there is no conflict. Windows has been upgraded to handle them. 
 Early versions of NT that couldn't handle surrogate pairs didn't work with 
 those code points anyway, so nothing is gained by going to code pages.

 
 Sorry this scares me "BMP is a proper subset of UTF-16"
 UTF-16 is a group name of *byte stream encodings*
 (UTF-16LE and UTF-16BE) of Unicode Code Set.
 
 BTW: which one of this UTFs D uses? Platform dependent I beleive.

D has been used for many years with foreign languages under Windows. If 
UTF-16 didn't work with Windows, I think it would have come up by now <g>.

As for whether it is LE or BE, it is whatever the local platform is, 
just like ints, shorts, longs, etc. are.

 So, the W functions can and do take UTF-16 directly, and in fact the 
 Phobos implementation does use the W functions, transmitting wchar[] to 
 them, and it works fine.

 The neat thing about Phobos is it adapts to whether you are using Win9x, 
 full 32 bit Windows, or Linux, and adjusts the char output accordingly so 
 it "just works."

 
 It should work well. Efficent I mean.

Yes.

 The language shall be agnostic to the meaning of char as much as possible.

That's C/C++'s approach, and it does not work very well. Check out 
tchar.h, there's a lovely disaster <g>. For another, just try using 
std::string with shift-JIS.

 It shall not prevent you to write effective algorithms.

Does UTF-8 prevent writing effective algorithms? I don't see how. 
DMDScript works, and is faster than any other JS implementation out 
there, including my own C++ version <g>. And frankly, my struggles with 
trying to internationalize C++ code for DMDScript is what led to D's 
support for UTF. The D implementation is shorter, simpler, and faster 
than the C++ one (which uses wchar's).


 Practically it is enough to have 16 (BMP) but...

I agree you can write code using BMP and ignore surrogate pairs today, 
and you'll probably never notice the bugs. But sooner or later, the 
surrogate pair problem is going to show up. Windows, Java, and 
Javascript have all had to go back and redo to deal with surrogate pairs.

Jul 31 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"Walter Bright" <newshound digitalmars.com> wrote in message 
news:eamql8$1jgc$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eam1ec$10e1$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 The problem as I can see is this:
 D propose to use transport encoding for manipulation purposes
 which is main problem imo here - transport encodings are not
 designed for the manipulation - it is extremely difficult to use
 them for manipualtion in practice as we may see.

 I disagree the characterization that it is "extremely difficult" to use 
 for manipulation. foreach's direct support for it, as well as the 
 functions in std.utf, make it straightforward. DMDScript is built around 
 UTF-8, and manipulating multibyte characters in it has not turned out to 
 be a significant problem.

 Sorry but strings in DMDScript are quite different in terms of
 0) there are no such thing as char in JavaScript.

 ECMAScript 262-3 (Javascript) defines the source character set to be 
 UTF-16, and the source character set is what JS programs manipulate for 
 strings and characters.

Walter, please, forget about such thing as "character set is UTF-16"
it is a non-sense.

Regarding ECMA-262:

"A conforming implementation of this International standard shall interpret 
characters in conformance with the
Unicode Standard, Version 2.1 or later, and ISO/IEC 10646-1 with either 
UCS-2 or UTF-16 as the adopted

encoding form..."

It is quite different from your interpretation. Compiler accepts input 
stream as either BMP codes or full unicode set encoded using UTF-16. There 
is no mentioning that String[n] will return you utf-16 code unit. That will 
be weird.


 1) strings are Strings - not vectors of octets - js::string[] and 
 d::char[] are different things.
 2) are not supposed to be used by any OS API.
 3) there are 12 or so methods of String class in JS - limited perimeter -
 what model you've choosen to store them is irrelevant -
 in some implementations they represented even by list of fixed runs.

 I agree how it's stored in the JS implementation is irrelevant. My point 
 was that in DMDScript they are stored as utf-8 strings, and they work with 
 only minor extra effort - DMDScript implements all the string handling 
 functions JS defines.

Again it is up to you how they are stored internally and what you did there.

In D situation is completely different - there is a char and char[] opened 
to all winds.


 - ***A  functions in Windows take byte string (LPSTR) and current
   codepage id  to render text. ( byte + codepage = Unicode Code Point )

 Win9x only supports the A functions,

 You are not right here.

 TextOutA and TextOutW are both supported by Win98.
 And intention in Harmonia was to use only those ***W
 functions which come out of the box on Win98 (without need of MSLU)

 You're right in that Win98 exports a small handful of W functions without 
 MSLU - but what those W functions actually do under the hood is translate 
 the data based on the current code page and then call the corresponding A 
 function. In other words, the Win9x W functions are rather pointless and 
 don't support characters that are not in the current code page anyway. 
 MSLU extends the same poor behavior to a bunch more pseudo W functions. 
 This is why Phobos does not call W functions under Win9x.

I wouldn't be so pessimistic about Win98 :)


 Conversely, the A functions under NT and later translate the characters 
 to - you guessed it - UTF-16 and then call the corresponding W function. 
 This is why Phobos under NT does not call the A functions.

Ok. And how do you call A functions?

Do you use proposed koi8chars, latin1chars, etc.?

You are using char for that. But wait, char cannot contain anything other 
than utf-8 :-P


 Win9x is obsolete anyway, and there's no reason to cripple a new 
 language by accommodating the failures of an obsolete system.

 There is a huge market of embedded devices.
 If you think that computer evolution expands only in more-ram-speed
 direction than you are in trouble.

 http://www.litepc.com/graphics/eossystem.jpg

 I agree there's a huge ecosystem of 32 bit embedded processors. And D 
 works fine with Win9x - it just isn't crippled by Win9x's shortcomings.


 When running on NT or later Windows, the W functions are used instead 
 which work directly with UTF-16. Later Windows also support UTF-8 with 
 the A functions.

 http://blogs.msdn.com/michkap/archive/2005/05/11/416552.aspx

 That is consistent with what I wrote about it.

No doubts about it.
 - ***W functions in Windows use LPWSTR things which are
   sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
   (  cast(dword) word  = Unicode Code Point )
   Only few functions in Windows API treat LPWSTR as UTF-16.

 BMP is a proper subset of UTF-16. The only difference is that BMP 
 doesn't do the 2-word surrogate pair encodings. But those are reserved 
 in BMP anyway, so there is no conflict. Windows has been upgraded to 
 handle them. Early versions of NT that couldn't handle surrogate pairs 
 didn't work with those code points anyway, so nothing is gained by going 
 to code pages.

 Sorry this scares me "BMP is a proper subset of UTF-16"
 UTF-16 is a group name of *byte stream encodings*
 (UTF-16LE and UTF-16BE) of Unicode Code Set.

 BTW: which one of this UTFs D uses? Platform dependent I beleive.

 D has been used for many years with foreign languages under Windows. If 
 UTF-16 didn't work with Windows, I think it would have come up by now <g>.

 As for whether it is LE or BE, it is whatever the local platform is, just 
 like ints, shorts, longs, etc. are.

 So, the W functions can and do take UTF-16 directly, and in fact the 
 Phobos implementation does use the W functions, transmitting wchar[] to 
 them, and it works fine.

 The neat thing about Phobos is it adapts to whether you are using Win9x, 
 full 32 bit Windows, or Linux, and adjusts the char output accordingly 
 so it "just works."

 It should work well. Efficent I mean.

 Yes.

 The language shall be agnostic to the meaning of char as much as 
 possible.

 That's C/C++'s approach, and it does not work very well. Check out 
 tchar.h, there's a lovely disaster <g>. For another, just try using 
 std::string with shift-JIS.

 It shall not prevent you to write effective algorithms.

 Does UTF-8 prevent writing effective algorithms? I don't see how. 
 DMDScript works, and is faster than any other JS implementation out there, 
 including my own C++ version <g>. And frankly, my struggles with trying to 
 internationalize C++ code for DMDScript is what led to D's support for 
 UTF. The D implementation is shorter, simpler, and faster than the C++ one 
 (which uses wchar's).


 Practically it is enough to have 16 (BMP) but...

 I agree you can write code using BMP and ignore surrogate pairs today, and 
 you'll probably never notice the bugs. But sooner or later, the surrogate 
 pair problem is going to show up. Windows, Java, and Javascript have all 
 had to go back and redo to deal with surrogate pairs.

Why? JavaScript for example has no such things as char.

String.charAt() returns guess what? Correct - String object.

No char - no problem :D

Why do they need to redefine anything then?

Again - let people decide of what char is and how to interpret it And that 
will be it.

Phobos can work with utf-8/16 and satisfy you and other UTF-masochists (no 
offence implied).  Ordinary people will do their own strings anyway. Just 
give them opAssign and dtor in structs and you will see explosion of perfect 


Changing char init value to 0 will not harm anybody but will allow to use 
char for other than

utf-8 purposes - it is only one from 40 in active use encodings anyway.

For persistence purposes (in compiled EXE) utf is the best choice probably. 
But in runtime - please not on language level.

Educated IMO, of course.

Andrew.

Aug 01 2006

Walter Bright <newshound digitalmars.com> writes:

Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set 

encoded using UTF-16.

BMP is a subset of UTF-16.

 There is no mentioning that String[n] will return you utf-16 code
 unit. That will be weird.

String.charCodeAt() will give you the utf-16 code unit.

 Conversely, the A functions under NT and later translate the characters 
 to - you guessed it - UTF-16 and then call the corresponding W function. 
 This is why Phobos under NT does not call the A functions.

 Ok. And how do you call A functions?

Take a look at std.file for an example.


 Windows, Java, and Javascript have all 
 had to go back and redo to deal with surrogate pairs.

 Why? JavaScript for example has no such things as char.
 String.charAt() returns guess what? Correct - String object.
 No char - no problem :D

See String.fromCharCode() and String.charCodeAt()

 Again - let people decide of what char is and how to interpret it And that 
 will be it.

I've already explained the problems C/C++ have with that. They're real 
problems, bad and unfixable enough that there are official proposals to 
add new UTF basic types to to C++.

 Phobos can work with utf-8/16 and satisfy you and other UTF-masochists (no 
 offence implied).

C++'s experience with this demonstrates that char* does not work very 
well with UTF-8. It's not just my experience, it's why new types were 
proposed for C++ (and not by me).

 Ordinary people will do their own strings anyway. Just 
 give them opAssign and dtor in structs and you will see explosion of perfect 

 
 Changing char init value to 0 will not harm anybody but will allow to use 
 char for other than
 
 utf-8 purposes - it is only one from 40 in active use encodings anyway.
 
 For persistence purposes (in compiled EXE) utf is the best choice probably. 
 But in runtime - please not on language level.

ubyte[] will enable you to use any encoding you wish - and that's what 
it's there for.

Aug 01 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

(Hope this long dialog will help all of us to better understand what UNICODE 
is)

"Walter Bright" <newshound digitalmars.com> wrote in message 
news:eao5st$2r1f$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set

 encoded using UTF-16.

 BMP is a subset of UTF-16.

Walter with deepest respect but it is not. Two different things.

UTF-16 is a variable-length enconding - byte stream.
Unicode BMP is a range of numbers strictly speaking.

If you will treat utf-16 sequence as a sequence of UCS-2 (BMP) codes you
are in trouble. See:

Sequence of two words D834 DD1E as UTF-16 will give you
one unicode character with code 0x1D11E  ( musical G clef ).
And the same sequence interpretted as UCS-2 sequence will
give you two (invlaid, non-printable but still) character codes.

You will get different length of the string at least.

 There is no mentioning that String[n] will return you utf-16 code
 unit. That will be weird.

 String.charCodeAt() will give you the utf-16 code unit.

 Conversely, the A functions under NT and later translate the characters 
 to - you guessed it - UTF-16 and then call the corresponding W function. 
 This is why Phobos under NT does not call the A functions.

 Ok. And how do you call A functions?

 Take a look at std.file for an example.

You mean here?:

char* namez = toMBSz(name);
 h = CreateFileA(namez,GENERIC_WRITE,0,null,CREATE_ALWAYS,
     FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,cast(HANDLE)null);

char* here is far from UTF-8 sequence.

 Windows, Java, and Javascript have all had to go back and redo to deal 
 with surrogate pairs.

 Why? JavaScript for example has no such things as char.
 String.charAt() returns guess what? Correct - String object.
 No char - no problem :D

 See String.fromCharCode() and String.charCodeAt()

ECMA-262

String.prototype.charCodeAt (pos)
Returns a number (a nonnegative integer less than 2^16) representing the 
code point value of the
character at position pos in the string....

As you may see it is returning (unicode) *code point* from BMP set
but it is far from UTF-16 code unit you've declared above.

Relaxing "a nonnegative integer less than 2^16" to
"a nonnegative integer less than 2^21" will not harm anybody. Or at least
such probability is vanishingly small.

 Again - let people decide of what char is and how to interpret it And 
 that will be it.

 I've already explained the problems C/C++ have with that. They're real 
 problems, bad and unfixable enough that there are official proposals to 
 add new UTF basic types to to C++.

Basic types of what?

 Phobos can work with utf-8/16 and satisfy you and other UTF-masochists 
 (no offence implied).

 C++'s experience with this demonstrates that char* does not work very well 
 with UTF-8. It's not just my experience, it's why new types were proposed 
 for C++ (and not by me).

Because char in C is not supposed  to hold multy-byte encodings.
At least std::string is strictly single byte thing by definition. And this
is perfectly fine. There is wchar_t for holding OS supported range in full.
On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.

 Ordinary people will do their own strings anyway. Just give them opAssign 
 and dtor in structs and you will see explosion of perfect strings. That 


 Changing char init value to 0 will not harm anybody but will allow to use 
 char for other than

 utf-8 purposes - it is only one from 40 in active use encodings anyway.

 For persistence purposes (in compiled EXE) utf is the best choice 
 probably. But in runtime - please not on language level.

 ubyte[] will enable you to use any encoding you wish - and that's what 
 it's there for.

Thus the whole set of Windows API headers (and std.c.string for example)
seen in D has to be rewrited to accept ubyte[]. As char in D is not char in 
C
Is this the idea?

Andrew.

Aug 01 2006

Derek Parnell <derek nomail.afraid.org> writes:

On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:

 (Hope this long dialog will help all of us to better understand what UNICODE 
 is)
 
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eao5st$2r1f$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set

 encoded using UTF-16.

 BMP is a subset of UTF-16.

 
 Walter with deepest respect but it is not. Two different things.
 
 UTF-16 is a variable-length enconding - byte stream.
 Unicode BMP is a range of numbers strictly speaking.

Andrew is correct. In UTF-16, characters are variable length, from 2 to 4
bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to
be up to 6 but that has changed). UCS-2 is a subset of Unicode characters
that are all represented by 2-byte integers. Windows NT had implemented
UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.

...

 ubyte[] will enable you to use any encoding you wish - and that's what 
 it's there for.

 
 Thus the whole set of Windows API headers (and std.c.string for example)
 seen in D has to be rewrited to accept ubyte[]. As char in D is not char in 
 C
 Is this the idea?

Yes. I believe this is how it now should be done. The Phobos library is not
correctly using char, char[], and ubyte[] when interfacing with Windows and
C functions. 

My guess is that Walter originally used 'char' to make things easier for C
coders to move over to D, but in doing so, now with UTF support built-in,
has caused more problems that the idea was supposed to solve. The move to
UTF support is good, but the choice of 'char' for the name of a UTF-8
code-unit was, and still is, a big mistake. I would have liked something
more like ...

  char  ==> An unsigned 8-bit byte. An alias for ubyte.
  schar ==> A UTF-8 code unit.
  wchar ==> A UTF-16 code unit.
  dchar ==> A UTF-32 code unit.

  char[] ==> A 'C' string 
  schar[] ==> A UTF-8 string
  wchar[] ==> A UTF-16 string
  dchar[] ==> A UTF-32 string

And then have built-in conversions between the UTF encodings. So if people
want to continue to use code from C/C++ that uses code-pages or similar
they can stick with char[]. 



-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
2/08/2006 1:08:51 PM

Aug 01 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"Derek Parnell" <derek nomail.afraid.org> wrote in message 
news:13qrud1m5v15d$.ydqvoi8nx4f8.dlg 40tude.net...
 On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:

 (Hope this long dialog will help all of us to better understand what 
 UNICODE
 is)

 "Walter Bright" <newshound digitalmars.com> wrote in message
 news:eao5st$2r1f$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set

 encoded using UTF-16.

 BMP is a subset of UTF-16.

 Walter with deepest respect but it is not. Two different things.

 UTF-16 is a variable-length enconding - byte stream.
 Unicode BMP is a range of numbers strictly speaking.

 Andrew is correct. In UTF-16, characters are variable length, from 2 to 4
 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to
 be up to 6 but that has changed). UCS-2 is a subset of Unicode characters
 that are all represented by 2-byte integers. Windows NT had implemented
 UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.

 ...

 ubyte[] will enable you to use any encoding you wish - and that's what
 it's there for.

 Thus the whole set of Windows API headers (and std.c.string for example)
 seen in D has to be rewrited to accept ubyte[]. As char in D is not char 
 in
 C
 Is this the idea?

 Yes. I believe this is how it now should be done. The Phobos library is 
 not
 correctly using char, char[], and ubyte[] when interfacing with Windows 
 and
 C functions.

 My guess is that Walter originally used 'char' to make things easier for C
 coders to move over to D, but in doing so, now with UTF support built-in,
 has caused more problems that the idea was supposed to solve. The move to
 UTF support is good, but the choice of 'char' for the name of a UTF-8
 code-unit was, and still is, a big mistake. I would have liked something
 more like ...

  char  ==> An unsigned 8-bit byte. An alias for ubyte.
  schar ==> A UTF-8 code unit.
  wchar ==> A UTF-16 code unit.
  dchar ==> A UTF-32 code unit.

  char[] ==> A 'C' string
  schar[] ==> A UTF-8 string
  wchar[] ==> A UTF-16 string
  dchar[] ==> A UTF-32 string

 And then have built-in conversions between the UTF encodings. So if people
 want to continue to use code from C/C++ that uses code-pages or similar
 they can stick with char[].

Yes, Derek, this will be probably near the ideal.

Andrew.

Aug 01 2006

"Regan Heath" <regan netwin.co.nz> writes:

On Tue, 1 Aug 2006 21:04:10 -0700, Andrew Fedoniouk  
<news terrainformatica.com> wrote:
 "Derek Parnell" <derek nomail.afraid.org> wrote in message
 news:13qrud1m5v15d$.ydqvoi8nx4f8.dlg 40tude.net...
 On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:

 (Hope this long dialog will help all of us to better understand what
 UNICODE
 is)

 "Walter Bright" <newshound digitalmars.com> wrote in message
 news:eao5st$2r1f$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set

 encoded using UTF-16.

 BMP is a subset of UTF-16.

 Walter with deepest respect but it is not. Two different things.

 UTF-16 is a variable-length enconding - byte stream.
 Unicode BMP is a range of numbers strictly speaking.

 Andrew is correct. In UTF-16, characters are variable length, from 2 to  
 4
 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used  
 to
 be up to 6 but that has changed). UCS-2 is a subset of Unicode  
 characters
 that are all represented by 2-byte integers. Windows NT had implemented
 UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.

 ...

 ubyte[] will enable you to use any encoding you wish - and that's what
 it's there for.

 Thus the whole set of Windows API headers (and std.c.string for  
 example)
 seen in D has to be rewrited to accept ubyte[]. As char in D is not  
 char
 in
 C
 Is this the idea?

 Yes. I believe this is how it now should be done. The Phobos library is
 not
 correctly using char, char[], and ubyte[] when interfacing with Windows
 and
 C functions.

 My guess is that Walter originally used 'char' to make things easier  
 for C
 coders to move over to D, but in doing so, now with UTF support  
 built-in,
 has caused more problems that the idea was supposed to solve. The move  
 to
 UTF support is good, but the choice of 'char' for the name of a UTF-8
 code-unit was, and still is, a big mistake. I would have liked something
 more like ...

  char  ==> An unsigned 8-bit byte. An alias for ubyte.
  schar ==> A UTF-8 code unit.
  wchar ==> A UTF-16 code unit.
  dchar ==> A UTF-32 code unit.

  char[] ==> A 'C' string
  schar[] ==> A UTF-8 string
  wchar[] ==> A UTF-16 string
  dchar[] ==> A UTF-32 string

 And then have built-in conversions between the UTF encodings. So if  
 people
 want to continue to use code from C/C++ that uses code-pages or similar
 they can stick with char[].

 Yes, Derek, this will be probably near the ideal.

Yet, I don't find it at all difficult to think of them like so:

   ubyte ==> An unsigned 8-bit byte.
   char  ==> A UTF-8 code unit.
   wchar ==> A UTF-16 code unit.
   dchar ==> A UTF-32 code unit.

   ubyte[] ==> A 'C' string
   char[]  ==> A UTF-8 string
   wchar[] ==> A UTF-16 string
   dchar[] ==> A UTF-32 string

If you want to program in D you _will_ have to readjust your thinking in  
some areas, this is one of them.
All you have to realise is that 'char' in D is not the same as 'char' in C.

In quick and dirty ASCII only applications I can adjust my thinking  
further:

   char   ==> An ASCII character
   char[] ==> An ASCII string

I do however agree that C functions used in D should be declared like:
   int strlen(ubyte* s);

and not like (as they currently are):
   int strlen(char* s);

The problem with this is that the code:
   char[] s = "test";
   strlen(s)

would produce a compile error, and require a cast or a conversion function  
(toMBSz perhaps, which in many cases will not need to do anything).

Of course the purists would say "That's perfectly correct, strlen cannot  
tell you the length of a UTF-8 string, only it's byte count", but at the  
same time it would be nice (for quick and dirty ASCII only programs) if it  
worked.

Is it possible to declare them like this?
   int strlen(void* s);

and for char[] to be implicitly 'paintable' as void* as char[] is already  
implicitly 'paintable' as void[]?

It seems like it would nicely solve the problem of people seeing:
   int strlen(char* s);

and thinking D's char is the same as C's char without introducing a  
painful need for cast or conversion in simple ASCII only situations.

Regan

Aug 01 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"Regan Heath" <regan netwin.co.nz> wrote in message 
news:optdm2gghi23k2f5 nrage...
 On Tue, 1 Aug 2006 21:04:10 -0700, Andrew Fedoniouk 
 <news terrainformatica.com> wrote:
 "Derek Parnell" <derek nomail.afraid.org> wrote in message
 news:13qrud1m5v15d$.ydqvoi8nx4f8.dlg 40tude.net...
 On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:

 (Hope this long dialog will help all of us to better understand what
 UNICODE
 is)

 "Walter Bright" <newshound digitalmars.com> wrote in message
 news:eao5st$2r1f$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set

 encoded using UTF-16.

 BMP is a subset of UTF-16.

 Walter with deepest respect but it is not. Two different things.

 UTF-16 is a variable-length enconding - byte stream.
 Unicode BMP is a range of numbers strictly speaking.

 Andrew is correct. In UTF-16, characters are variable length, from 2 to 
 4
 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used 
 to
 be up to 6 but that has changed). UCS-2 is a subset of Unicode 
 characters
 that are all represented by 2-byte integers. Windows NT had implemented
 UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.

 ...

 ubyte[] will enable you to use any encoding you wish - and that's what
 it's there for.

 Thus the whole set of Windows API headers (and std.c.string for 
 example)
 seen in D has to be rewrited to accept ubyte[]. As char in D is not 
 char
 in
 C
 Is this the idea?

 Yes. I believe this is how it now should be done. The Phobos library is
 not
 correctly using char, char[], and ubyte[] when interfacing with Windows
 and
 C functions.

 My guess is that Walter originally used 'char' to make things easier 
 for C
 coders to move over to D, but in doing so, now with UTF support 
 built-in,
 has caused more problems that the idea was supposed to solve. The move 
 to
 UTF support is good, but the choice of 'char' for the name of a UTF-8
 code-unit was, and still is, a big mistake. I would have liked something
 more like ...

  char  ==> An unsigned 8-bit byte. An alias for ubyte.
  schar ==> A UTF-8 code unit.
  wchar ==> A UTF-16 code unit.
  dchar ==> A UTF-32 code unit.

  char[] ==> A 'C' string
  schar[] ==> A UTF-8 string
  wchar[] ==> A UTF-16 string
  dchar[] ==> A UTF-32 string

 And then have built-in conversions between the UTF encodings. So if 
 people
 want to continue to use code from C/C++ that uses code-pages or similar
 they can stick with char[].

 Yes, Derek, this will be probably near the ideal.

 Yet, I don't find it at all difficult to think of them like so:

   ubyte ==> An unsigned 8-bit byte.
   char  ==> A UTF-8 code unit.
   wchar ==> A UTF-16 code unit.
   dchar ==> A UTF-32 code unit.

   ubyte[] ==> A 'C' string
   char[]  ==> A UTF-8 string
   wchar[] ==> A UTF-16 string
   dchar[] ==> A UTF-32 string

 If you want to program in D you _will_ have to readjust your thinking in 
 some areas, this is one of them.
 All you have to realise is that 'char' in D is not the same as 'char' in 
 C.

 In quick and dirty ASCII only applications I can adjust my thinking 
 further:

   char   ==> An ASCII character
   char[] ==> An ASCII string

 I do however agree that C functions used in D should be declared like:
   int strlen(ubyte* s);

 and not like (as they currently are):
   int strlen(char* s);

 The problem with this is that the code:
   char[] s = "test";
   strlen(s)

 would produce a compile error, and require a cast or a conversion function 
 (toMBSz perhaps, which in many cases will not need to do anything).

 Of course the purists would say "That's perfectly correct, strlen cannot 
 tell you the length of a UTF-8 string, only it's byte count", but at the 
 same time it would be nice (for quick and dirty ASCII only programs) if it 
 worked.

 Is it possible to declare them like this?
   int strlen(void* s);

 and for char[] to be implicitly 'paintable' as void* as char[] is already 
 implicitly 'paintable' as void[]?

 It seems like it would nicely solve the problem of people seeing:
   int strlen(char* s);

 and thinking D's char is the same as C's char without introducing a 
 painful need for cast or conversion in simple ASCII only situations.

 Regan

Another option will be to change char.init to 0 and forget about the problem
left it as it is now.  Some good string implementation will
contain encoding field in string instance if needed.

Andrew.

Aug 01 2006

"Unknown W. Brackets" <unknown simplemachines.org> writes:

I'm trying to understand why this 0 thing is such an issue.  If your 
second statement is valid, it makes the first moot - 0 or no 0.  Why 
does it matter, then?

-[Unknown]


 Another option will be to change char.init to 0 and forget about the problem
 left it as it is now.  Some good string implementation will
 contain encoding field in string instance if needed.
 
 Andrew.

Aug 01 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"Unknown W. Brackets" <unknown simplemachines.org> wrote in message 
news:eapdsg$qeo$1 digitaldaemon.com...
 I'm trying to understand why this 0 thing is such an issue.  If your 
 second statement is valid, it makes the first moot - 0 or no 0.  Why does 
 it matter, then?

Declaration of char.init == 0 pretty much means that
D has no strict requirement that char[] shall contain only UTF-8
encoded sequences but any other encodings suitable for
the application.

char.init == 0 will resolve situation we see in Phobos now.
char[] de facto is used for other than utf-8 encodings.

char.init == 0 tells everybody that char can also be used
for representing unicode *code points* with asuumption
that offset value (mapping on full Unicode set, aka codepage) is stored
somewhere in application or well known to it.

char.init == 0 also highlights the fact that it is safe to
use char[] as C string processing functions and passing them to non D 
modules and libraries.
Is it UTF-8 encoded or not - does not matter - type is universal enough.

Andrew.







 -[Unknown]


 Another option will be to change char.init to 0 and forget about the 
 problem
 left it as it is now.  Some good string implementation will
 contain encoding field in string instance if needed.

 Andrew.

Aug 01 2006

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Andrew Fedoniouk wrote:
 "Unknown W. Brackets" <unknown simplemachines.org> wrote in message 
 news:eapdsg$qeo$1 digitaldaemon.com...
 I'm trying to understand why this 0 thing is such an issue.  If your 
 second statement is valid, it makes the first moot - 0 or no 0.  Why does 
 it matter, then?

 
 Declaration of char.init == 0 pretty much means that
 D has no strict requirement that char[] shall contain only UTF-8
 encoded sequences but any other encodings suitable for
 the application.

Why is this good?

 char.init == 0 will resolve situation we see in Phobos now.
 char[] de facto is used for other than utf-8 encodings.

You mean data with other encodings that still want to use the std.string 
functions? I have written template versions that replaces (almost) all 
std.string functions that do not rely on encoding.

 char.init == 0 tells everybody that char can also be used
 for representing unicode *code points* with asuumption
 that offset value (mapping on full Unicode set, aka codepage) is stored
 somewhere in application or well known to it.

Maybe it would tell people that. A good thing it isn't so then. Again, 
why do you want to store non utf-8 data in a char[]?. What is wrong with 
ubyte[] or a suitable typedef?

 char.init == 0 also highlights the fact that it is safe to
 use char[] as C string processing functions and passing them to non D 
 modules and libraries.
 Is it UTF-8 encoded or not - does not matter - type is universal enough.

I can't see how that would make it considerably safer.

/Oskar

Aug 02 2006

"Unknown W. Brackets" <unknown simplemachines.org> writes:

I fail to understand why I want another ambiguous type in my 
programming.  I am glad that when I type "int", I know I have a number 
and not a pointer.

I am glad that when I type char, I again know what I have.  No 
guesswork.  Your proposals sound like shooting myself in the foot.

No fun.  I'll take that helmet you offered first.

-[Unknown]


 "Unknown W. Brackets" <unknown simplemachines.org> wrote in message 
 news:eapdsg$qeo$1 digitaldaemon.com...
 I'm trying to understand why this 0 thing is such an issue.  If your 
 second statement is valid, it makes the first moot - 0 or no 0.  Why does 
 it matter, then?

 
 Declaration of char.init == 0 pretty much means that
 D has no strict requirement that char[] shall contain only UTF-8
 encoded sequences but any other encodings suitable for
 the application.
 
 char.init == 0 will resolve situation we see in Phobos now.
 char[] de facto is used for other than utf-8 encodings.
 
 char.init == 0 tells everybody that char can also be used
 for representing unicode *code points* with asuumption
 that offset value (mapping on full Unicode set, aka codepage) is stored
 somewhere in application or well known to it.
 
 char.init == 0 also highlights the fact that it is safe to
 use char[] as C string processing functions and passing them to non D 
 modules and libraries.
 Is it UTF-8 encoded or not - does not matter - type is universal enough.
 
 Andrew.
 
 
 
 
 
 
 
 -[Unknown]


 Another option will be to change char.init to 0 and forget about the 
 problem
 left it as it is now.  Some good string implementation will
 contain encoding field in string instance if needed.

 Andrew.

Aug 02 2006

Derek Parnell <derek nomail.afraid.org> writes:

On Tue, 01 Aug 2006 22:40:56 -0700, Unknown W. Brackets wrote:

 I'm trying to understand why this 0 thing is such an issue.  If your 
 second statement is valid, it makes the first moot - 0 or no 0.  Why 
 does it matter, then?

I think the issue is more that Andrew wants to have hex-FF as a legitimate
byte value anywhere in a char[] variable. He misses the point that the
purpose of not allowing it in so we can detected uninitialized UTF-8
strings at run-time.

Andrew, just use ubyte[] variables and you won't have a problem, apart from
conversions between code-pages and Unicode <G>.

In D, ubyte[] is the data structure designed to hold variable length arrays
of unsigned bytes, which is exactly what you need to implement the type
strings you have in KOI-8 encoding.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
2/08/2006 4:24:27 PM

Aug 01 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

 I think the issue is more that Andrew wants to have hex-FF as a legitimate
 byte value anywhere in a char[] variable. He misses the point that the
 purpose of not allowing it in so we can detected uninitialized UTF-8
 strings at run-time.

What does it mean uninitialized? They *are* initialized.
This is the main point. For any types you can declare
initial value. I bet you are choosing not non existent values
for say enums but some really meaningfull default values.

having strings filled by ff's means that you will get problems
of different kinds - partially initialized strings.

Could you tell me do you ever had situation when
ffffff strings helped you to find problem?
And if yes how it is in principle different from
catching strings with 00000?

Can anyone here say that this fffffffs helped to find
problem?

Andrew.

Aug 02 2006

Walter Bright <newshound digitalmars.com> writes:

Andrew Fedoniouk wrote:
 Can anyone here say that this fffffffs helped to find
 problem?

Yes, I found two bugs in my own code with it that would have been hidden 
with the 0 initialization.

Aug 02 2006

Derek Parnell <derek nomail.afraid.org> writes:

On Wed, 2 Aug 2006 00:08:42 -0700, Andrew Fedoniouk wrote:

 I think the issue is more that Andrew wants to have hex-FF as a legitimate
 byte value anywhere in a char[] variable. He misses the point that the
 purpose of not allowing it in so we can detected uninitialized UTF-8
 strings at run-time.

 
 What does it mean uninitialized? They *are* initialized.

Andrew, I will assume you are not trying to be difficult but that maybe
your English is a bit too literal. 

Of course in the clinical sense they are initialized because data is moved
into them before your code has a chance to do anything. However, when I say
"detected uninitialized UTF-8 strings" I mean "detect UTF-8 strings that
have not been initialized by your own code". Is that better?

 This is the main point. For any types you can declare
 initial value. I bet you are choosing not non existent values
 for say enums but some really meaningfull default values.

Huh??? Now you are being difficult. The purpose of enums is to have them
initialized to values that make sense in their context. But the default
values for enum generally work for me as the exact value doesn't really
matter in most cases.

  enum AccountType
  {
     Savings,
     Investment,
     FixedLoan,
     Club,
     LineOfCredit
  }

I really don't care what values the compiler assigns to these enums. Sure I
could choose specific values but it doesn't really matter.
     
 having strings filled by ff's means that you will get problems
 of different kinds - partially initialized strings.

Huh???? Why would I always get partially initialized strings, as you imply?
And even if I did, then having 0xFF in them is going to help me track down
some stupid code that I wrote.

 Could you tell me do you ever had situation when
 ffffff strings helped you to find problem?

No. I haven't made that kind of mistake yet with my code.

 And if yes how it is in principle different from
 catching strings with 00000?

Because if I found a 0x00 in a string, I wouldn't know if its legitimate or
not.
 
 Can anyone here say that this fffffffs helped to find
 problem?

But if I found 0xFF I would know straight away that I've made a mistake
somewhere. Actually, come to think about it, I did make a mistake once when
my code was incorrectly interpreting a BOM in a text file. I loaded the
file as if was UTF-8 but it should have been UTF-16. DMD correctly told me
I had a bad UTF strings when I tried to write it out.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
2/08/2006 5:49:46 PM

Aug 02 2006

Derek Parnell <derek nomail.afraid.org> writes:

On Wed, 02 Aug 2006 16:22:54 +1200, Regan Heath wrote:

  char  ==> An unsigned 8-bit byte. An alias for ubyte.
  schar ==> A UTF-8 code unit.
  wchar ==> A UTF-16 code unit.
  dchar ==> A UTF-32 code unit.

  char[] ==> A 'C' string
  schar[] ==> A UTF-8 string
  wchar[] ==> A UTF-16 string
  dchar[] ==> A UTF-32 string

 And then have built-in conversions between the UTF encodings. So if  
 people
 want to continue to use code from C/C++ that uses code-pages or similar
 they can stick with char[].

 Yes, Derek, this will be probably near the ideal.

 
 Yet, I don't find it at all difficult to think of them like so:
 
    ubyte ==> An unsigned 8-bit byte.
    char  ==> A UTF-8 code unit.
    wchar ==> A UTF-16 code unit.
    dchar ==> A UTF-32 code unit.
 
    ubyte[] ==> A 'C' string
    char[]  ==> A UTF-8 string
    wchar[] ==> A UTF-16 string
    dchar[] ==> A UTF-32 string

Me too, but that's probably because I've not been immersed in C/C++ for the
last 20 odd years ;-) 

I "think in D" now and char[] is a UTF-8 string in my mind. 
 
 If you want to program in D you _will_ have to readjust your thinking in  
 some areas, this is one of them.
 All you have to realise is that 'char' in D is not the same as 'char' in C.

True, but Walter seems hell bent of easing the transition to D for C/C++
refugees.
 
 In quick and dirty ASCII only applications I can adjust my thinking  
 further:
 
    char   ==> An ASCII character
    char[] ==> An ASCII string
 
 I do however agree that C functions used in D should be declared like:
    int strlen(ubyte* s);
 
 and not like (as they currently are):
    int strlen(char* s);
 
 The problem with this is that the code:
    char[] s = "test";
    strlen(s)
 
 would produce a compile error, and require a cast or a conversion function  
 (toMBSz perhaps, which in many cases will not need to do anything).
 
 Of course the purists would say "That's perfectly correct, strlen cannot  
 tell you the length of a UTF-8 string, only it's byte count", but at the  
 same time it would be nice (for quick and dirty ASCII only programs) if it  
 worked.

And I'm a wannabe purist <G>
 
 Is it possible to declare them like this?
    int strlen(void* s);
 
 and for char[] to be implicitly 'paintable' as void* as char[] is already  
 implicitly 'paintable' as void[]?
 
 It seems like it would nicely solve the problem of people seeing:
    int strlen(char* s);
 
 and thinking D's char is the same as C's char without introducing a  
 painful need for cast or conversion in simple ASCII only situations.

Is the zero-terminator for C strings that will get in the way. We need a
nice way of getting the compiler to ensure C-strings are always terminated
correctly.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
2/08/2006 2:48:43 PM

Aug 01 2006

"Regan Heath" <regan netwin.co.nz> writes:

On Wed, 2 Aug 2006 14:55:11 +1000, Derek Parnell <derek nomail.afraid.org>  
wrote:
 Is the zero-terminator for C strings that will get in the way. We need a
 nice way of getting the compiler to ensure C-strings are always  
 terminated
 correctly.

Good point. I neglected to mention that.

Regan

Aug 01 2006

kris <foo bar.com> writes:

Derek Parnell wrote:
[snip]
   char  ==> An unsigned 8-bit byte. An alias for ubyte.
   schar ==> A UTF-8 code unit.
   wchar ==> A UTF-16 code unit.
   dchar ==> A UTF-32 code unit.
 
   char[] ==> A 'C' string 
   schar[] ==> A UTF-8 string
   wchar[] ==> A UTF-16 string
   dchar[] ==> A UTF-32 string

Sure, although char, utf8, utf16, utf32 are much better choices, IMHO :)

I'd be game to have them changed at this stage. It's not much more than 
some (extensive) global replacements. Don't think there's much need to 
check each instance. There's a nice shareware tool called "Active Search 
& Replace" which I've recently found to be very helpful in this regard.

Aug 01 2006

Walter Bright <newshound digitalmars.com> writes:

Derek Parnell wrote:
 On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
 
 (Hope this long dialog will help all of us to better understand what UNICODE 
 is)

 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eao5st$2r1f$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set

 encoded using UTF-16.

 BMP is a subset of UTF-16.

 Walter with deepest respect but it is not. Two different things.

 UTF-16 is a variable-length enconding - byte stream.
 Unicode BMP is a range of numbers strictly speaking.

 
 Andrew is correct. In UTF-16, characters are variable length, from 2 to 4
 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to
 be up to 6 but that has changed). UCS-2 is a subset of Unicode characters
 that are all represented by 2-byte integers. Windows NT had implemented
 UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.

If UCS-2 is not a subset of UTF-16, what UCS-2 sequences are not valid 
UTF-16?

Aug 02 2006

Derek Parnell <derek nomail.afraid.org> writes:

On Wed, 02 Aug 2006 00:11:26 -0700, Walter Bright wrote:

 Derek Parnell wrote:
 On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
 
 (Hope this long dialog will help all of us to better understand what UNICODE 
 is)

 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eao5st$2r1f$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set

 encoded using UTF-16.

 BMP is a subset of UTF-16.

 Walter with deepest respect but it is not. Two different things.

 UTF-16 is a variable-length enconding - byte stream.
 Unicode BMP is a range of numbers strictly speaking.

 
 Andrew is correct. In UTF-16, characters are variable length, from 2 to 4
 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to
 be up to 6 but that has changed). UCS-2 is a subset of Unicode characters
 that are all represented by 2-byte integers. Windows NT had implemented
 UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.

 
 If UCS-2 is not a subset of UTF-16, what UCS-2 sequences are not valid 
 UTF-16?

Huh??? I said "UCS-2 is a subset of Unicode characters" Did you miss that?
UTF-16 is not a subset as it can be used to encode every Unicode code
point. UCS-2 is a subset as it can *not* encode code points that are
outside of the "basic multilingual plane" (aka BMP). 

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
2/08/2006 5:43:18 PM

Aug 02 2006

Walter Bright <newshound digitalmars.com> writes:

Derek Parnell wrote:
 On Wed, 02 Aug 2006 00:11:26 -0700, Walter Bright wrote:
 Derek Parnell wrote:
 On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eao5st$2r1f$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set

 encoded using UTF-16.

 BMP is a subset of UTF-16.

 Walter with deepest respect but it is not. Two different things.

 UTF-16 is a variable-length enconding - byte stream.
 Unicode BMP is a range of numbers strictly speaking.

 Andrew is correct. In UTF-16, characters are variable length, from 2 to 4
 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to
 be up to 6 but that has changed). UCS-2 is a subset of Unicode characters
 that are all represented by 2-byte integers. Windows NT had implemented
 UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.

 If UCS-2 is not a subset of UTF-16, what UCS-2 sequences are not valid 
 UTF-16?

 
 Huh??? I said "UCS-2 is a subset of Unicode characters" Did you miss that?

I saw it, but that statement is not the same as "UCS-2 is a subset of 
UTF-16". The issue I was talking about is "BMP [UCS-2] is a subset of 
UTF-16", which Andrew keeps replying "it is not". You said "Andrew is 
correct", so I inferred you were agreeing that UCS-2 is not a subset of 
UTF-16.

 UTF-16 is not a subset as it can be used to encode every Unicode code
 point. UCS-2 is a subset as it can *not* encode code points that are
 outside of the "basic multilingual plane" (aka BMP). 

I think you and I are in agreement.

Aug 02 2006

Walter Bright <newshound digitalmars.com> writes:

Andrew Fedoniouk wrote:
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 BMP is a subset of UTF-16.

 
 Walter with deepest respect but it is not. Two different things.
 
 UTF-16 is a variable-length enconding - byte stream.
 Unicode BMP is a range of numbers strictly speaking.
 
 If you will treat utf-16 sequence as a sequence of UCS-2 (BMP) codes you
 are in trouble. See:
 
 Sequence of two words D834 DD1E as UTF-16 will give you
 one unicode character with code 0x1D11E  ( musical G clef ).
 And the same sequence interpretted as UCS-2 sequence will
 give you two (invlaid, non-printable but still) character codes.
 You will get different length of the string at least.

The only thing that UTF-16 adds are semantics for characters that are 
invalid for BMP. That makes UTF-16 a superset. It doesn't matter if 
you're strictly speaking, or if the jargon is different. UTF-16 is a 
superset of BMP, once you cut past the jargon and look at the underlying 
reality.


 Ok. And how do you call A functions?

 Take a look at std.file for an example.

 
 You mean here?:
 
 char* namez = toMBSz(name);
  h = CreateFileA(namez,GENERIC_WRITE,0,null,CREATE_ALWAYS,
      FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,cast(HANDLE)null);
 char* here is far from UTF-8 sequence.

You could argue that for clarity namez should have been written as a 
ubyte*, but in the above code it would make no difference.

 Windows, Java, and Javascript have all had to go back and redo to deal 
 with surrogate pairs.

 Why? JavaScript for example has no such things as char.
 String.charAt() returns guess what? Correct - String object.
 No char - no problem :D

 See String.fromCharCode() and String.charCodeAt()

 
 ECMA-262
 
 String.prototype.charCodeAt (pos)
 Returns a number (a nonnegative integer less than 2^16) representing the 
 code point value of the
 character at position pos in the string....
 
 As you may see it is returning (unicode) *code point* from BMP set
 but it is far from UTF-16 code unit you've declared above.

There is no difference.

 Relaxing "a nonnegative integer less than 2^16" to
 "a nonnegative integer less than 2^21" will not harm anybody.
 Or at least such probability is vanishingly small.

It'll break any code trying to deal with surrogate pairs.


 Again - let people decide of what char is and how to interpret it And 
 that will be it.

 I've already explained the problems C/C++ have with that. They're real 
 problems, bad and unfixable enough that there are official proposals to 
 add new UTF basic types to to C++.

 
 Basic types of what?

Basic types for utf-8 and utf-16. Ironically, they wind up being very 
much like D's char and wchar types.

 Phobos can work with utf-8/16 and satisfy you and other UTF-masochists 
 (no offence implied).

 C++'s experience with this demonstrates that char* does not work very well 
 with UTF-8. It's not just my experience, it's why new types were proposed 
 for C++ (and not by me).

 Because char in C is not supposed  to hold multy-byte encodings.

Standard functions in the C standard library to deal with multibyte 
encodings have been there since 1989. Compiler extensions to deal with 
shift-JIS and other multibyte encodings have been there since the mid 
80's. They don't work very well, but nevertheless, are there and supported.

 At least std::string is strictly single byte thing by definition. And this
 is perfectly fine.

As long as you're dealing with ASCII only <g>. That world has been left 
behind, though.

 There is wchar_t for holding OS supported range in full.
 On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.

That's just the trouble with wchar_t. It's implementation defined, which 
means its use is non-portable. The Win32 version cannot handle surrogate 
pairs as a single character. Linux has the opposite problem - you can't 
have UTF-16 strings in any non-kludgy way. Trying to write 
internationalized code with wchar_t that works correctly on both Win32 
and Linux is an exercise in frustration. What you wind up doing is 
abstracting away the char type - giving up on help from the standard 
libraries and writing your own text processing code from scratch.

I've been through this with real projects. It doesn't work just fine, 
and is a lot of extra work. Translating the code to D is nice, you 
essentially give that whole mess a heave-ho.

BTW, you talked earlier a lot about memory efficiency. Linux's 32 bit 
wchar_t eats memory like nothing else.

 Thus the whole set of Windows API headers (and std.c.string for example)
 seen in D has to be rewrited to accept ubyte[]. As char in D is not char in 
 C

You're right that a C char isn't a D char. All that means is one must be 
careful when calling C functions that take char*'s to pass it data in 
the form that particular C function expects. This is true for all C's 
data types - even int.

 Is this the idea?

The vast majority (perhaps even all) of C standard string handling 
functions that accept char* will work with UTF-8 without modification. 
No rewrite required.

You've implied all this doesn't work, by saying things must be 
rewritten, that it's extremely difficult to deal with UTF-8, that BMP is 
  not a subset of UTF-16, etc. This is not my experience at all. If 
you've got some persuasive code examples, I'd like to see them.

Aug 01 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

 As you may see it is returning (unicode) *code point* from BMP set
 but it is far from UTF-16 code unit you've declared above.

 There is no difference.

 Relaxing "a nonnegative integer less than 2^16" to
 "a nonnegative integer less than 2^21" will not harm anybody.
 Or at least such probability is vanishingly small.

 It'll break any code trying to deal with surrogate pairs.

There is no such thing as surrogate pair in UCS-2.
 JS string is not holding UTF-16 code units - only full code points.
See spec.


 Phobos can work with utf-8/16 and satisfy you and other UTF-masochists

 (no offence implied).

 C++'s experience with this demonstrates that char* does not work very 
 well with UTF-8. It's not just my experience, it's why new types were 
 proposed for C++ (and not by me).

 Because char in C is not supposed  to hold multy-byte encodings.

 Standard functions in the C standard library to deal with multibyte 
 encodings have been there since 1989. Compiler extensions to deal with 
 shift-JIS and other multibyte encodings have been there since the mid 
 80's. They don't work very well, but nevertheless, are there and 
 supported.

 At least std::string is strictly single byte thing by definition. And 
 this
 is perfectly fine.

 As long as you're dealing with ASCII only <g>. That world has been left 
 behind, though.


C string functions can be used with mutibyte encodings for the
sole reason: all byte encodings has char with code 0 defined
as NUL character. All encodings in practical use has no code
byte with code 0 appear in the middle of sequence.
They all built with C string processing in mind.


 There is wchar_t for holding OS supported range in full.
 On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.

 That's just the trouble with wchar_t. It's implementation defined, which 
 means its use is non-portable. The Win32 version cannot handle surrogate 
 pairs as a single character. Linux has the opposite problem - you can't 
 have UTF-16 strings in any non-kludgy way. Trying to write 
 internationalized code with wchar_t that works correctly on both Win32 and 
 Linux is an exercise in frustration. What you wind up doing is abstracting 
 away the char type - giving up on help from the standard libraries and 
 writing your own text processing code from scratch.

 I've been through this with real projects. It doesn't work just fine, and 
 is a lot of extra work. Translating the code to D is nice, you essentially 
 give that whole mess a heave-ho.

 BTW, you talked earlier a lot about memory efficiency. Linux's 32 bit 
 wchar_t eats memory like nothing else.

Agree. As I said - if you need efficiency use byte/word encodings + 
mapping.

dchar is no better than wchar_t/linux.
Please don't say that I shall use urf-8 for that - simply does not work in 
my cases -
too expencive.

 Thus the whole set of Windows API headers (and std.c.string for example)
 seen in D has to be rewrited to accept ubyte[]. As char in D is not char 
 in C

 You're right that a C char isn't a D char. All that means is one must be 
 careful when calling C functions that take char*'s to pass it data in the 
 form that particular C function expects. This is true for all C's data 
 types - even int.

 Is this the idea?

 The vast majority (perhaps even all) of C standard string handling 
 functions that accept char* will work with UTF-8 without modification. No 
 rewrite required.

Correct. As I said because of 0 is NUL in UTF-8 too. Not
0xFF or anything else exotic.

 You've implied all this doesn't work, by saying things must be rewritten, 
 that it's extremely difficult to deal with UTF-8, that BMP is not a subset 
 of UTF-16, etc. This is not my experience at all. If you've got some 
 persuasive code examples, I'd like to see them.

I am not saying that "must be rewritten". Sorry but this is you who propose
to rewrite all string processing functions of standard library mankind has 
for today.

Or I don't quite understand your idea with UTFs.

Java did change string world by introducing just char (single UCS-2 code 
point)
And no variations. Good it or bad? From uniformity point of view - good.
For efficiency - bad. I've seen a lot of reinvented char as byte wheels in 
professional
packages.

Andrew.

Aug 01 2006

"Unknown W. Brackets" <unknown simplemachines.org> writes:

Andrew, I think there's a misunderstanding here.  Perhaps it's a 
language thing.

Let me define two things for you, in English, by my understanding of 
them.  I was born in Utah and raised in Los Angeles as a native speaker, 
so hopefully these definitions aren't far from the standard understanding.

Default: a setting, value, or situation which persists unless action is 
taken otherwise; such a thing that happens unless overridden or canceled.

Null: something which has no current setting, value, or situation (but 
could have one); the absence of a setting, value, or situation.

Therefore, I should conclude that "default" and "null" are very 
different concepts.

The fact that C strings are null terminated, and that encodings provide 
for a "null" character (or code point or muffin or whatever they care to 
call them) does not logically necessitate that this provides for a 
default, or logically default, value.

It is true that, as the above definitions, it would not be wrong for the 
default to be null.  That would fit the definitions above perfectly. 
However, so would a value of ' ' (which might be the default in some 
language out there.)

It would seem logical that 0 could be used as the default, but then as 
Walter pointed out... this can (and tends to) hide bugs which will bite 
you eventually.

Let us suppose you were to have a string displayed in a place.  It is 
possible, were it blank, that you might not notice it.  Next let us 
suppose this space were filled with "?", "`", "ﮘ", or "ß" characters.

Do you think you would be more, or less likely to notice it?

Next, let us suppose that this character could be (in cases) detectable 
as invalid.  Again note that 0 is not invalid, and may appear in 
strings.  This sounds even better.

So a default value of 0 does not, from an implementation or practical 
point of view, seem to make much sense to me.  In fact, I think a 
default value of "42" for int makes sense (surely it reminds you of what 
six by nine is.)

But maybe that's because I never leave things at their defaults.  It's 
like writing a story where you expect the reader to think everyone has 
brown eyes unless you say otherwise.

-[Unknown]


 Correct. As I said because of 0 is NUL in UTF-8 too. Not
 0xFF or anything else exotic.

Aug 01 2006

"Andrew Fedoniouk" <news terrainformatica.com> writes:

 But maybe that's because I never leave things at their defaults.  It's
 like writing a story where you expect the reader to think everyone has 
 brown eyes unless you say otherwise.

Consider this:

char[6] buf;
strncpy(buf, "1234567", 5);

What will be a content of you buffer?

Answer is: 12345\xff . Surprise? It is.

In modern D reliable implementation of this shall be as:

char[6] buf; // memset(buf,0xFF,6); under the hood.
uint n = strncpy(buf, "1234567", 5);
buf[n] = 0;

if you are going to use this with non D modules.

Needless to say that this is a bit redundant.

If D in any case initializes that memory why you need
this uint n and buf[n] = 0; ?

Don't tell me please that this is because your spent
your childhood in boyscout camps and got some high principles.
Lets' put aside that matters - it is purely technical discussion.

Andrew.

Aug 01 2006

Oskar Linde <oskar.lindeREM OVEgmail.com> writes:

Andrew Fedoniouk wrote:
 But maybe that's because I never leave things at their defaults.  It's
 like writing a story where you expect the reader to think everyone has 
 brown eyes unless you say otherwise.

 
 Consider this:
 
 char[6] buf;
 strncpy(buf, "1234567", 5);
 
 What will be a content of you buffer?
 
 Answer is: 12345\xff . Surprise? It is.

Not really surprising. Had you compiled this in a C program (you are 
using C functions after all), you would have gotten:

12345\x?? <- some garbage. Not a zero terminated string.

My manual for strncpy explicitly states:

" if there is no null byte among the first n
        bytes of src, the result will not be null-terminated."

/Oskar

Aug 02 2006

Derek Parnell <derek nomail.afraid.org> writes:

On Tue, 1 Aug 2006 23:45:26 -0700, Andrew Fedoniouk wrote:

 But maybe that's because I never leave things at their defaults.  It's
 like writing a story where you expect the reader to think everyone has 
 brown eyes unless you say otherwise.

 
 Consider this:
 
 char[6] buf;
 strncpy(buf, "1234567", 5);
 
 What will be a content of you buffer?
 
 Answer is: 12345\xff . Surprise? It is.

No, not surprised, just wondering why you didn't code it correctly though.

If you insist on using C functions then it should be coded ...

  extern(C) uint strncpy(ubyte *, ubyte *, uint );
  ubyte[6] buf;
  strncpy(buf.ptr, cast(ubyte*)"1234567", 5);


 In modern D reliable implementation of this shall be as:
 
 char[6] buf; // memset(buf,0xFF,6); under the hood.
 uint n = strncpy(buf, "1234567", 5);
 buf[n] = 0;

Well that is debatable. I'd do it more like ...

 char[6] buf;  // An array of UTF-8 code units.
 uint n = strncpy(buf, "1234567", 5); // Replace the first 5 code-units.
 buf[n..$] = 0; // Set remaining code-units to zero.
 
 if you are going to use this with non D modules.
 
 Needless to say that this is a bit redundant.
 
 If D in any case initializes that memory why you need
 this uint n and buf[n] = 0; ?
 
 Don't tell me please that this is because your spent
 your childhood in boyscout camps and got some high principles.
 Lets' put aside that matters - it is purely technical discussion.

Exactly. And technically you should be using ubyte[] and not char[].

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
2/08/2006 4:57:15 PM

Aug 02 2006

"Unknown W. Brackets" <unknown simplemachines.org> writes:

Why would I ever use strncat() in a D program?

Consider this: if you do not wear a helmet while riding a motorcycle 
(read: I don't like helmets) you could break your head and die.  Guess 
what?  I don't ride motorcycles.  Problem solved.

I don't like null terminated strings.  I think they are the root of much 
evil.  Describing why having 0 as a default benefits null terminated 
strings is like describing how having less police help burglars to me. 
Obviously I'm being over-dramatic, but I remain unconvinced...

Also I did spend (some of) my childhood in Boy Scout camps and I did 
learn many principles (none of which related to programming in the 
slightest.)  I mean that literally.  But you're right, that's beside the 
point.

-[Unknown]


 But maybe that's because I never leave things at their defaults.  It's
 like writing a story where you expect the reader to think everyone has 
 brown eyes unless you say otherwise.

 
 Consider this:
 
 char[6] buf;
 strncpy(buf, "1234567", 5);
 
 What will be a content of you buffer?
 
 Answer is: 12345\xff . Surprise? It is.
 
 In modern D reliable implementation of this shall be as:
 
 char[6] buf; // memset(buf,0xFF,6); under the hood.
 uint n = strncpy(buf, "1234567", 5);
 buf[n] = 0;
 
 if you are going to use this with non D modules.
 
 Needless to say that this is a bit redundant.
 
 If D in any case initializes that memory why you need
 this uint n and buf[n] = 0; ?
 
 Don't tell me please that this is because your spent
 your childhood in boyscout camps and got some high principles.
 Lets' put aside that matters - it is purely technical discussion.
 
 Andrew.

Aug 02 2006

"Unknown W. Brackets" <unknown simplemachines.org> writes:

Correction: strncpy().  They're all evil.

-[Unknown]


 Why would I ever use strncat() in a D program?
 
 Consider this: if you do not wear a helmet while riding a motorcycle 
 (read: I don't like helmets) you could break your head and die.  Guess 
 what?  I don't ride motorcycles.  Problem solved.
 
 I don't like null terminated strings.  I think they are the root of much 
 evil.  Describing why having 0 as a default benefits null terminated 
 strings is like describing how having less police help burglars to me. 
 Obviously I'm being over-dramatic, but I remain unconvinced...
 
 Also I did spend (some of) my childhood in Boy Scout camps and I did 
 learn many principles (none of which related to programming in the 
 slightest.)  I mean that literally.  But you're right, that's beside the 
 point.
 
 -[Unknown]
 
 
 But maybe that's because I never leave things at their defaults.  It's
 like writing a story where you expect the reader to think everyone 
 has brown eyes unless you say otherwise.

 Consider this:

 char[6] buf;
 strncpy(buf, "1234567", 5);

 What will be a content of you buffer?

 Answer is: 12345\xff . Surprise? It is.

 In modern D reliable implementation of this shall be as:

 char[6] buf; // memset(buf,0xFF,6); under the hood.
 uint n = strncpy(buf, "1234567", 5);
 buf[n] = 0;

 if you are going to use this with non D modules.

 Needless to say that this is a bit redundant.

 If D in any case initializes that memory why you need
 this uint n and buf[n] = 0; ?

 Don't tell me please that this is because your spent
 your childhood in boyscout camps and got some high principles.
 Lets' put aside that matters - it is purely technical discussion.

 Andrew.

Aug 02 2006

kris <foo bar.com> writes:

Andrew Fedoniouk wrote:
 (Hope this long dialog will help all of us to better understand what UNICODE 
 is)

Actually, it doesn't help at all, Andrew ~ some of it is thoroughly 
misguided, and some is "cleverly" slanted purely for the benefit of the 
author. In truth, this thread would be the last place one would look to 
learn from an entirely unbiased opinion; one with only the readers 
education in mind.

There are infinitely more useful places to go for that sort of thing. 
For those who have an interest, this tiny selection may help:

http://icu.sourceforge.net/docs/papers/forms_of_unicode/
http://www.hackcraft.net/xmlUnicode/
http://www.cl.cam.ac.uk/~mgk25/unicode.html
http://www.unicode.org/unicode/faq/utf_bom.html
http://en.wikipedia.org/wiki/UTF-8
http://www.joelonsoftware.com/articles/Unicode.html

Aug 01 2006

Bruno Medeiros <brunodomedeirosATgmail SPAM.com> writes:

Andrew Fedoniouk wrote:
 "Walter Bright" <newshound digitalmars.com> wrote in message 
 news:eao5st$2r1f$1 digitaldaemon.com...
 Andrew Fedoniouk wrote:
 Compiler accepts input stream as either BMP codes or full unicode set

 encoded using UTF-16.

 BMP is a subset of UTF-16.

 
 Walter with deepest respect but it is not. Two different things.
 
 UTF-16 is a variable-length enconding - byte stream.
 Unicode BMP is a range of numbers strictly speaking.
 
 If you will treat utf-16 sequence as a sequence of UCS-2 (BMP) codes you
 are in trouble. See:
 

Uh, the statement "BMP is a subset of UTF-16" means that you can read a 
BMP sequence as an UTF-16 sequence, not the opposite as you said: "If 
you will treat utf-16 sequence as a sequence of UCS-2 (BMP)".


 Ordinary people will do their own strings anyway. Just give them opAssign 
 and dtor in structs and you will see explosion of perfect strings. That 


 Changing char init value to 0 will not harm anybody but will allow to use 
 char for other than

 utf-8 purposes - it is only one from 40 in active use encodings anyway.

 For persistence purposes (in compiled EXE) utf is the best choice 
 probably. But in runtime - please not on language level.

 ubyte[] will enable you to use any encoding you wish - and that's what 
 it's there for.

 
 Thus the whole set of Windows API headers (and std.c.string for example)
 seen in D has to be rewrited to accept ubyte[]. As char in D is not char in 
 C
 Is this the idea?
 
 Andrew.
 
 

Just a note, not to ubyte[] but to ubyte* .


-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

Aug 03 2006

D Programming

C/C++ Programming

Other

digitalmars.D - char[] initialization