digitalmars.D.learn - What is the legal range of chars?
- monarch_dodra (27/27) Jun 19 2013 I know a "binary" char can hold the values 0 to 0xFF. However,
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (16/19) Jun 19 2013 char.
- monarch_dodra (6/27) Jun 19 2013 Hum... well, that's true for UTF-8 strings, if the _codeunit_
- anonymous (4/8) Jun 19 2013 No, char is a UTF8 code unit.
- Jonathan M Davis (10/20) Jun 19 2013 Exactly. char, wchar, and dchar are all code _units_, and dchar (UTF-32)...
- monarch_dodra (13/41) Jun 19 2013 Well, there is still ambiguity when you have a standalone char if
- Jonathan M Davis (6/19) Jun 19 2013 Well, it's fundamentally broken to compare char and wchar unless you kno...
- monarch_dodra (17/19) Jul 28 2013 Resurrecting this thread for a related question: What is the
I know a "binary" char can hold the values 0 to 0xFF. However, I'm wondering about the cases where a codepoint can fit inside a char. For example, 'ç' is represented by 0xe7, which technically fits inside a char. This is illegal: char c = 'ç'; But this works: char c = cast(char)'ç'; assert(c == 'ç'); ... it "works"... but is it legal? -------- The root of the question though is actually this: If I have a string, and somebody asks me to find the character "char c" in that string. Is it legal to iterate on the string char by char, until I find c exactly, or do I have to take onto account that some troll may have decided to put a wchar inside my char...? Basically: string myFind(string s, char c) { foreach(i, char sc ; s) if(sc == c) return s[i .. $]; return s[$ .. $]; } assert(myFind("aça", cast(char)'ç') == "ça"); The assert above will fail. But whose fault is it? Is it a wrong call, or a wrong implementation?
Jun 19 2013
On 06/19/2013 05:34 AM, monarch_dodra wrote:I know a "binary" char can hold the values 0 to 0xFF. However, I'm wondering about the cases where a codepoint can fit inside a char. For example, 'ç' is represented by 0xe7, which technically fits inside achar. 'ç' is represented by 0xe7 in an encoding that is not UTF-8. :) That would be a special agreement between the producer and the consumer of that string. Otherwise, 0xe7 is not 'ç'. I recommend ubyte[] for those cases. In UTF-8, 0xe7 is the first byte of a 3-byte code point: import std.stdio; void main() { char[] a = [ 'a', 'b', 'c', 0xe7, 0x80, 0x80 ]; writeln(a); } Prints a Chinese character: abc瀀 Ali
Jun 19 2013
On Wednesday, 19 June 2013 at 15:13:23 UTC, Ali Çehreli wrote:On 06/19/2013 05:34 AM, monarch_dodra wrote:Hum... well, that's true for UTF-8 strings, if the _codeunit_ 0xe7 appears, it is not 'ç'. But when handling a 'char', there is no encoding, it "should" be raw _codepoint_. I'm not really sure *if* these cases should be handle, nor how :/I know a "binary" char can hold the values 0 to 0xFF.However, I'mwondering about the cases where a codepoint can fit inside achar. Forexample, 'ç' is represented by 0xe7, which technically fitsinside a char. 'ç' is represented by 0xe7 in an encoding that is not UTF-8. :) That would be a special agreement between the producer and the consumer of that string. Otherwise, 0xe7 is not 'ç'. I recommend ubyte[] for those cases. In UTF-8, 0xe7 is the first byte of a 3-byte code point: import std.stdio; void main() { char[] a = [ 'a', 'b', 'c', 0xe7, 0x80, 0x80 ]; writeln(a); } Prints a Chinese character: abc瀀 Ali
Jun 19 2013
On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra wrote:Hum... well, that's true for UTF-8 strings, if the _codeunit_ 0xe7 appears, it is not 'ç'. But when handling a 'char', there is no encoding, it "should" be raw _codepoint_.No, char is a UTF8 code unit. Code unit and code point become synonymous in UTF32, so dchar is a code point.
Jun 19 2013
On Wednesday, June 19, 2013 19:02:55 anonymous wrote:On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra wrote:Exactly. char, wchar, and dchar are all code _units_, and dchar (UTF-32) is the only case where a code unit is guaranteed to be a code point. For both char (UTF-8) and wchar (UTF-16), the number of code units in a code point is variable, and in the case of UTF-8, any code point which isn't an ASCII characters is multiple code units. Wikipedia and TDPL both have a nice chart showing the valid values for UTF-8 and how many code units are in a code point for each set of values: http://en.wikipedia.org/wiki/UTF-8#Description - Jonathan M DavisHum... well, that's true for UTF-8 strings, if the _codeunit_ 0xe7 appears, it is not 'ç'. But when handling a 'char', there is no encoding, it "should" be raw _codepoint_.No, char is a UTF8 code unit. Code unit and code point become synonymous in UTF32, so dchar is a code point.
Jun 19 2013
On Wednesday, 19 June 2013 at 17:48:49 UTC, Jonathan M Davis wrote:On Wednesday, June 19, 2013 19:02:55 anonymous wrote:Well, there is still ambiguity when you have a standalone char if it is holding a (paritally truncated) code unit, or a partial code point. If I write: char c = '\xDF'; //0b11011111; //Lead UTF-8 2 byte encoding wchar w = 'ß'; //0b11011111; \u00DF assert(c == w); The assert passes. Yet 'c' is just the partial of a 2 byte sequence, and not 'ß'. In any case, this conversation gave me the answers I was looking for in the context of the original question.On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra wrote:Exactly. char, wchar, and dchar are all code _units_, and dchar (UTF-32) is the only case where a code unit is guaranteed to be a code point. For both char (UTF-8) and wchar (UTF-16), the number of code units in a code point is variable, and in the case of UTF-8, any code point which isn't an ASCII characters is multiple code units. Wikipedia and TDPL both have a nice chart showing the valid values for UTF-8 and how many code units are in a code point for each set of values: http://en.wikipedia.org/wiki/UTF-8#Description - Jonathan M DavisHum... well, that's true for UTF-8 strings, if the _codeunit_ 0xe7 appears, it is not 'ç'. But when handling a 'char', there is no encoding, it "should" be raw _codepoint_.No, char is a UTF8 code unit. Code unit and code point become synonymous in UTF32, so dchar is a code point.
Jun 19 2013
On Wednesday, June 19, 2013 21:22:00 monarch_dodra wrote:Well, there is still ambiguity when you have a standalone char if it is holding a (paritally truncated) code unit, or a partial code point. If I write: char c = '\xDF'; //0b11011111; //Lead UTF-8 2 byte encoding wchar w = 'ß'; //0b11011111; \u00DF assert(c == w); The assert passes. Yet 'c' is just the partial of a 2 byte sequence, and not 'ß'.Well, it's fundamentally broken to compare char and wchar unless you know that both of the values being compared are ASCII values. They're different encodings.In any case, this conversation gave me the answers I was looking for in the context of the original question.Good to hear. - Jonathan M Davis
Jun 19 2013
On Wednesday, 19 June 2013 at 20:09:54 UTC, Jonathan M Davis wrote:Good to hear. - Jonathan M DavisResurrecting this thread for a related question: What is the legal range a dchar can hold? is it 0 .. 0x110000, or basically, just 0 .. 2^^32? For example, writing this: dchar d = 0x110000; Will result in: Error: cannot implicitly convert expression (1114112) of type int to dchar. Yet: uint v = uint.max; dchar d = v; //Perfactly fine. So the question is: While I know you *can* put anything you want in a dchar, is it actually legal? is the "dchar d = 0x110000;" thing just the whole "value range propagation" thing being weird? A bit more background on the thing would be nice.
Jul 28 2013