www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - What is the legal range of chars?

reply "monarch_dodra" <monarchdodra gmail.com> writes:
I know a "binary" char can hold the values 0 to 0xFF. However, 
I'm wondering about the cases where a codepoint can fit inside a 
char. For example, 'ç' is represented by 0xe7, which technically 
fits inside a char.

This is illegal:
char c = 'ç';
But this works:
char c = cast(char)'ç';
assert(c == 'ç');

... it "works"... but is it legal?

--------
The root of the question though is actually this: If I have a 
string, and somebody asks me to find the character "char c" in 
that string. Is it legal to iterate on the string char by char, 
until I find c exactly, or do I have to take onto account that 
some troll may have decided to put a wchar inside my char...?

Basically:
string myFind(string s, char c)
{
     foreach(i, char sc ; s)
         if(sc == c)
             return s[i .. $];
     return s[$ .. $];
}
assert(myFind("aça", cast(char)'ç') == "ça");

The assert above will fail. But whose fault is it? Is it a wrong 
call, or a wrong implementation?
Jun 19 2013
parent reply =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
On 06/19/2013 05:34 AM, monarch_dodra wrote:

 I know a "binary" char can hold the values 0 to 0xFF. However, I'm
 wondering about the cases where a codepoint can fit inside a char. For
 example, 'ç' is represented by 0xe7, which technically fits inside a 
char. 'ç' is represented by 0xe7 in an encoding that is not UTF-8. :) That would be a special agreement between the producer and the consumer of that string. Otherwise, 0xe7 is not 'ç'. I recommend ubyte[] for those cases. In UTF-8, 0xe7 is the first byte of a 3-byte code point: import std.stdio; void main() { char[] a = [ 'a', 'b', 'c', 0xe7, 0x80, 0x80 ]; writeln(a); } Prints a Chinese character: abc瀀 Ali
Jun 19 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 19 June 2013 at 15:13:23 UTC, Ali Çehreli wrote:
 On 06/19/2013 05:34 AM, monarch_dodra wrote:

 I know a "binary" char can hold the values 0 to 0xFF.
However, I'm
 wondering about the cases where a codepoint can fit inside a
char. For
 example, 'ç' is represented by 0xe7, which technically fits
inside a char. 'ç' is represented by 0xe7 in an encoding that is not UTF-8. :) That would be a special agreement between the producer and the consumer of that string. Otherwise, 0xe7 is not 'ç'. I recommend ubyte[] for those cases. In UTF-8, 0xe7 is the first byte of a 3-byte code point: import std.stdio; void main() { char[] a = [ 'a', 'b', 'c', 0xe7, 0x80, 0x80 ]; writeln(a); } Prints a Chinese character: abc瀀 Ali
Hum... well, that's true for UTF-8 strings, if the _codeunit_ 0xe7 appears, it is not 'ç'. But when handling a 'char', there is no encoding, it "should" be raw _codepoint_. I'm not really sure *if* these cases should be handle, nor how :/
Jun 19 2013
parent reply "anonymous" <anonymous example.com> writes:
On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra wrote:
 Hum... well, that's true for UTF-8 strings, if the _codeunit_ 
 0xe7 appears, it is not 'ç'.

 But when handling a 'char', there is no encoding, it "should" 
 be raw _codepoint_.
No, char is a UTF8 code unit. Code unit and code point become synonymous in UTF32, so dchar is a code point.
Jun 19 2013
parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, June 19, 2013 19:02:55 anonymous wrote:
 On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra wrote:
 Hum... well, that's true for UTF-8 strings, if the _codeunit_
 0xe7 appears, it is not 'ç'.
 
 But when handling a 'char', there is no encoding, it "should"
 be raw _codepoint_.
No, char is a UTF8 code unit. Code unit and code point become synonymous in UTF32, so dchar is a code point.
Exactly. char, wchar, and dchar are all code _units_, and dchar (UTF-32) is the only case where a code unit is guaranteed to be a code point. For both char (UTF-8) and wchar (UTF-16), the number of code units in a code point is variable, and in the case of UTF-8, any code point which isn't an ASCII characters is multiple code units. Wikipedia and TDPL both have a nice chart showing the valid values for UTF-8 and how many code units are in a code point for each set of values: http://en.wikipedia.org/wiki/UTF-8#Description - Jonathan M Davis
Jun 19 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 19 June 2013 at 17:48:49 UTC, Jonathan M Davis 
wrote:
 On Wednesday, June 19, 2013 19:02:55 anonymous wrote:
 On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra 
 wrote:
 Hum... well, that's true for UTF-8 strings, if the _codeunit_
 0xe7 appears, it is not 'ç'.
 
 But when handling a 'char', there is no encoding, it "should"
 be raw _codepoint_.
No, char is a UTF8 code unit. Code unit and code point become synonymous in UTF32, so dchar is a code point.
Exactly. char, wchar, and dchar are all code _units_, and dchar (UTF-32) is the only case where a code unit is guaranteed to be a code point. For both char (UTF-8) and wchar (UTF-16), the number of code units in a code point is variable, and in the case of UTF-8, any code point which isn't an ASCII characters is multiple code units. Wikipedia and TDPL both have a nice chart showing the valid values for UTF-8 and how many code units are in a code point for each set of values: http://en.wikipedia.org/wiki/UTF-8#Description - Jonathan M Davis
Well, there is still ambiguity when you have a standalone char if it is holding a (paritally truncated) code unit, or a partial code point. If I write: char c = '\xDF'; //0b11011111; //Lead UTF-8 2 byte encoding wchar w = 'ß'; //0b11011111; \u00DF assert(c == w); The assert passes. Yet 'c' is just the partial of a 2 byte sequence, and not 'ß'. In any case, this conversation gave me the answers I was looking for in the context of the original question.
Jun 19 2013
parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, June 19, 2013 21:22:00 monarch_dodra wrote:
 Well, there is still ambiguity when you have a standalone char if
 it is holding a (paritally truncated) code unit, or a partial
 code point.
 
 If I write:
 char c = '\xDF'; //0b11011111; //Lead UTF-8 2 byte encoding
 wchar w = 'ß'; //0b11011111; \u00DF
 assert(c == w);
 
 The assert passes. Yet 'c' is just the partial of a 2 byte
 sequence, and not 'ß'.
Well, it's fundamentally broken to compare char and wchar unless you know that both of the values being compared are ASCII values. They're different encodings.
 In any case, this conversation gave me the answers I was looking
 for in the context of the original question.
Good to hear. - Jonathan M Davis
Jun 19 2013
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Wednesday, 19 June 2013 at 20:09:54 UTC, Jonathan M Davis 
wrote:
 Good to hear.

 - Jonathan M Davis
Resurrecting this thread for a related question: What is the legal range a dchar can hold? is it 0 .. 0x110000, or basically, just 0 .. 2^^32? For example, writing this: dchar d = 0x110000; Will result in: Error: cannot implicitly convert expression (1114112) of type int to dchar. Yet: uint v = uint.max; dchar d = v; //Perfactly fine. So the question is: While I know you *can* put anything you want in a dchar, is it actually legal? is the "dchar d = 0x110000;" thing just the whole "value range propagation" thing being weird? A bit more background on the thing would be nice.
Jul 28 2013