digitalmars.D.learn - What is the legal range of chars?

monarch_dodra (27/27) Jun 19 2013 I know a "binary" char can hold the values 0 to 0xFF. However,

=?UTF-8?B?QWxpIMOHZWhyZWxp?= (16/19) Jun 19 2013 char.

monarch_dodra (6/27) Jun 19 2013 Hum... well, that's true for UTF-8 strings, if the _codeunit_

anonymous (4/8) Jun 19 2013 No, char is a UTF8 code unit.

Jonathan M Davis (10/20) Jun 19 2013 Exactly. char, wchar, and dchar are all code _units_, and dchar (UTF-32)...

monarch_dodra (13/41) Jun 19 2013 Well, there is still ambiguity when you have a standalone char if

Jonathan M Davis (6/19) Jun 19 2013 Well, it's fundamentally broken to compare char and wchar unless you kno...

monarch_dodra (17/19) Jul 28 2013 Resurrecting this thread for a related question: What is the

"monarch_dodra" <monarchdodra gmail.com> writes:

I know a "binary" char can hold the values 0 to 0xFF. However, 
I'm wondering about the cases where a codepoint can fit inside a 
char. For example, 'ç' is represented by 0xe7, which technically 
fits inside a char.

This is illegal:
char c = 'ç';
But this works:
char c = cast(char)'ç';
assert(c == 'ç');

... it "works"... but is it legal?

--------
The root of the question though is actually this: If I have a 
string, and somebody asks me to find the character "char c" in 
that string. Is it legal to iterate on the string char by char, 
until I find c exactly, or do I have to take onto account that 
some troll may have decided to put a wchar inside my char...?

Basically:
string myFind(string s, char c)
{
     foreach(i, char sc ; s)
         if(sc == c)
             return s[i .. $];
     return s[$ .. $];
}
assert(myFind("aça", cast(char)'ç') == "ça");

The assert above will fail. But whose fault is it? Is it a wrong 
call, or a wrong implementation?

Jun 19 2013

=?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:

On 06/19/2013 05:34 AM, monarch_dodra wrote:

 I know a "binary" char can hold the values 0 to 0xFF. However, I'm
 wondering about the cases where a codepoint can fit inside a char. For
 example, 'ç' is represented by 0xe7, which technically fits inside a 

char.

'ç' is represented by 0xe7 in an encoding that is not UTF-8. :)

That would be a special agreement between the producer and the consumer 
of that string. Otherwise, 0xe7 is not 'ç'. I recommend ubyte[] for 
those cases.

In UTF-8, 0xe7 is the first byte of a 3-byte code point:

import std.stdio;

void main()
{
     char[] a = [ 'a', 'b', 'c', 0xe7, 0x80, 0x80 ];
     writeln(a);
}

Prints a Chinese character:

abc瀀

Ali

Jun 19 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 19 June 2013 at 15:13:23 UTC, Ali Çehreli wrote:
 On 06/19/2013 05:34 AM, monarch_dodra wrote:

 I know a "binary" char can hold the values 0 to 0xFF.

 However, I'm
 wondering about the cases where a codepoint can fit inside a

 char. For
 example, 'ç' is represented by 0xe7, which technically fits

 inside a char.

 'ç' is represented by 0xe7 in an encoding that is not UTF-8. :)

 That would be a special agreement between the producer and the 
 consumer of that string. Otherwise, 0xe7 is not 'ç'. I 
 recommend ubyte[] for those cases.

 In UTF-8, 0xe7 is the first byte of a 3-byte code point:

 import std.stdio;

 void main()
 {
     char[] a = [ 'a', 'b', 'c', 0xe7, 0x80, 0x80 ];
     writeln(a);
 }

 Prints a Chinese character:

 abc瀀

 Ali

Hum... well, that's true for UTF-8 strings, if the _codeunit_ 
0xe7 appears, it is not 'ç'.

But when handling a 'char', there is no encoding, it "should" be 
raw _codepoint_.

I'm not really sure *if* these cases should be handle, nor how :/

Jun 19 2013

"anonymous" <anonymous example.com> writes:

On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra wrote:
 Hum... well, that's true for UTF-8 strings, if the _codeunit_ 
 0xe7 appears, it is not 'ç'.

 But when handling a 'char', there is no encoding, it "should" 
 be raw _codepoint_.

No, char is a UTF8 code unit.
Code unit and code point become synonymous in UTF32, so dchar is
a code point.

Jun 19 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, June 19, 2013 19:02:55 anonymous wrote:
 On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra wrote:
 Hum... well, that's true for UTF-8 strings, if the _codeunit_
 0xe7 appears, it is not 'ç'.
 
 But when handling a 'char', there is no encoding, it "should"
 be raw _codepoint_.

 
 No, char is a UTF8 code unit.
 Code unit and code point become synonymous in UTF32, so dchar is
 a code point.

Exactly. char, wchar, and dchar are all code _units_, and dchar (UTF-32) is 
the only case where a code unit is guaranteed to be a code point. For both 
char (UTF-8) and wchar (UTF-16), the number of code units in a code point is 
variable, and in the case of UTF-8, any code point which isn't an ASCII 
characters is multiple code units. Wikipedia and TDPL both have a nice chart 
showing the valid values for UTF-8 and how many code units are in a code point 
for each set of values:

http://en.wikipedia.org/wiki/UTF-8#Description

- Jonathan M Davis

Jun 19 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 19 June 2013 at 17:48:49 UTC, Jonathan M Davis 
wrote:
 On Wednesday, June 19, 2013 19:02:55 anonymous wrote:
 On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra 
 wrote:
 Hum... well, that's true for UTF-8 strings, if the _codeunit_
 0xe7 appears, it is not 'ç'.
 
 But when handling a 'char', there is no encoding, it "should"
 be raw _codepoint_.

 
 No, char is a UTF8 code unit.
 Code unit and code point become synonymous in UTF32, so dchar 
 is
 a code point.

 Exactly. char, wchar, and dchar are all code _units_, and dchar 
 (UTF-32) is
 the only case where a code unit is guaranteed to be a code 
 point. For both
 char (UTF-8) and wchar (UTF-16), the number of code units in a 
 code point is
 variable, and in the case of UTF-8, any code point which isn't 
 an ASCII
 characters is multiple code units. Wikipedia and TDPL both have 
 a nice chart
 showing the valid values for UTF-8 and how many code units are 
 in a code point
 for each set of values:

 http://en.wikipedia.org/wiki/UTF-8#Description

 - Jonathan M Davis

Well, there is still ambiguity when you have a standalone char if 
it is holding a (paritally truncated) code unit, or a partial 
code point.

If I write:
     char  c = '\xDF'; //0b11011111; //Lead UTF-8 2 byte encoding
     wchar w = 'ß';    //0b11011111; \u00DF
     assert(c == w);

The assert passes. Yet 'c' is just the partial of a 2 byte 
sequence, and not 'ß'.

In any case, this conversation gave me the answers I was looking 
for in the context of the original question.

Jun 19 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, June 19, 2013 21:22:00 monarch_dodra wrote:
 Well, there is still ambiguity when you have a standalone char if
 it is holding a (paritally truncated) code unit, or a partial
 code point.
 
 If I write:
 char c = '\xDF'; //0b11011111; //Lead UTF-8 2 byte encoding
 wchar w = 'ß'; //0b11011111; \u00DF
 assert(c == w);
 
 The assert passes. Yet 'c' is just the partial of a 2 byte
 sequence, and not 'ß'.

Well, it's fundamentally broken to compare char and wchar unless you know that 
both of the values being compared are ASCII values. They're different 
encodings.

 In any case, this conversation gave me the answers I was looking
 for in the context of the original question.

Good to hear.

- Jonathan M Davis

Jun 19 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 19 June 2013 at 20:09:54 UTC, Jonathan M Davis 
wrote:
 Good to hear.

 - Jonathan M Davis

Resurrecting this thread for a related question: What is the 
legal range a dchar can hold?

is it 0 .. 0x110000, or basically, just 0 .. 2^^32?

For example, writing this:
dchar d = 0x110000;
Will result in:
Error: cannot implicitly convert expression (1114112) of type int 
to dchar.

Yet:
uint v = uint.max;
dchar d = v; //Perfactly fine.

So the question is: While I know you *can* put anything you want 
in a dchar, is it actually legal? is the "dchar d = 0x110000;" 
thing just the whole "value range propagation" thing being weird? 
A bit more background on the thing would be nice.

Jul 28 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - What is the legal range of chars?