digitalmars.D - Characters in D
- Eugene (10/10) Nov 02 2019 Hello!
- Adam D. Ruppe (6/11) Nov 02 2019 The individual char can only hold those, but a group of chars can
- Eugene (3/14) Nov 02 2019 Um. It is not obvious at all. What's mean spread across multiple
- user4567 (6/26) Nov 02 2019 it's encoded in UTF-8, for example the **string** "п" takes 2
- Eugene (4/31) Nov 02 2019 "п" is represented by two code units, but "п"d is represented by
- Adam D. Ruppe (5/7) Nov 02 2019 The one character there is multiple utf-8 code units, thus
- Jacob Carlborg (10/12) Nov 02 2019 This might be a bit confusing. But the type of a character literal
Hello! In the book "Programming in D" is written: "Variable of type char can only hold letters that are in the ASCII table". (section 15.4 Character literals) So why there is executed next code? char[] cyrillics = "привет".dup; writeln(cyrillics.idup); Cyrillic characters are not within ASCII table. Why?; next code is ok according on what is written in book: char[] cyrillics = ['п', 'р', 'и', 'в', 'е', 'т']; //not compiled
Nov 02 2019
On Saturday, 2 November 2019 at 15:44:49 UTC, Eugene wrote:"Variable of type char can only hold letters that are in the ASCII table". (section 15.4 Character literals) So why there is executed next code?The individual char can only hold those, but a group of chars can hold anything.char[] cyrillics = "привет".dup;this works because the "string" has multi-char groupingschar[] cyrillics = ['п', 'р', 'и', 'в', 'е', 'т']; //notand this doesn't because you are specifying individual items there so it can't just spread them across multiple bytes
Nov 02 2019
On Saturday, 2 November 2019 at 15:54:02 UTC, Adam D. Ruppe wrote:On Saturday, 2 November 2019 at 15:44:49 UTC, Eugene wrote:Um. It is not obvious at all. What's mean spread across multiple bytes?"Variable of type char can only hold letters that are in the ASCII table". (section 15.4 Character literals) So why there is executed next code?The individual char can only hold those, but a group of chars can hold anything.char[] cyrillics = "привет".dup;this works because the "string" has multi-char groupingschar[] cyrillics = ['п', 'р', 'и', 'в', 'е', 'т']; //notand this doesn't because you are specifying individual items there so it can't just spread them across multiple bytes
Nov 02 2019
On Saturday, 2 November 2019 at 18:09:01 UTC, Eugene wrote:On Saturday, 2 November 2019 at 15:54:02 UTC, Adam D. Ruppe wrote:it's encoded in UTF-8, for example the **string** "п" takes 2 `char`s, although it's only one grapheme. assert("привет".length == 12); // encoded as UTF-8 assert("привет"d.length == 6); // decoded, each dchar is 4 bytes and can contain a cyrilic character.On Saturday, 2 November 2019 at 15:44:49 UTC, Eugene wrote:Um. It is not obvious at all. What's mean spread across multiple bytes?"Variable of type char can only hold letters that are in the ASCII table". (section 15.4 Character literals) So why there is executed next code?The individual char can only hold those, but a group of chars can hold anything.char[] cyrillics = "привет".dup;this works because the "string" has multi-char groupingschar[] cyrillics = ['п', 'р', 'и', 'в', 'е', 'т']; //notand this doesn't because you are specifying individual items there so it can't just spread them across multiple bytes
Nov 02 2019
On Saturday, 2 November 2019 at 18:26:57 UTC, user4567 wrote:On Saturday, 2 November 2019 at 18:09:01 UTC, Eugene wrote:"п" is represented by two code units, but "п"d is represented by one code point, therefore 12 and 6 respectively. Function dup manipulates by code units and represents their to char[]. So?On Saturday, 2 November 2019 at 15:54:02 UTC, Adam D. Ruppe wrote:it's encoded in UTF-8, for example the **string** "п" takes 2 `char`s, although it's only one grapheme. assert("привет".length == 12); // encoded as UTF-8 assert("привет"d.length == 6); // decoded, each dchar is 4 bytes and can contain a cyrilic character.On Saturday, 2 November 2019 at 15:44:49 UTC, Eugene wrote:Um. It is not obvious at all. What's mean spread across multiple bytes?"Variable of type char can only hold letters that are in the ASCII table". (section 15.4 Character literals) So why there is executed next code?The individual char can only hold those, but a group of chars can hold anything.char[] cyrillics = "привет".dup;this works because the "string" has multi-char groupingschar[] cyrillics = ['п', 'р', 'и', 'в', 'е', 'т']; //notand this doesn't because you are specifying individual items there so it can't just spread them across multiple bytes
Nov 02 2019
Your привет memory representation will look different depending on the encoding formats: //utf-8, you cannot put 'п' in a single char, so it will be encoded as 2 bytes: 0xd0 0xbf char[] cyrillics = [0xd0, 0xbf, 0xd1, 0x80, 0xd0, 0xb8, 0xd0, 0xb2, 0xd0, 0xb5, 0xd1, 0x82] //utf-16, a wchar has enough space to accommodate any letter from привет wchar[] cyrillics = [0x043f, 0x0440, 0x0438, 0x0432, 0x0435, 0x0442] //or - this is the same because each letter will fit in a wchar: wchar[] cyrillics = ['п', 'р', 'и', 'в', 'е', 'т'] //utf-32, a dchar has enough space to accommodate any letter from привет dchar[] cyrillics = [0x0000043f, 0x00000440, 0x00000438, 0x00000432, 0x00000435, 0x00000442] //or - this is the same because each letter will fit in a dchar: dchar[] cyrillics = ['п', 'р', 'и', 'в', 'е', 'т']
Nov 02 2019
On Saturday, 2 November 2019 at 18:45:50 UTC, Eugene wrote:On Saturday, 2 November 2019 at 18:26:57 UTC, user4567 wrote:Oh I see what you ask, in first place we thought that you didn't get the implication of encoding. So it's just a rule. If you use `char` literals they must be ascii. The rationale could be that this rule avoid bad surprises on the length of the array, otherwise I cant imagine anything else. ONly original designers (so Bright) must know the exact rationale... cant say more.On Saturday, 2 November 2019 at 18:09:01 UTC, Eugene wrote:"п" is represented by two code units, but "п"d is represented by one code point, therefore 12 and 6 respectively. Function dup manipulates by code units and represents their to char[]. So?On Saturday, 2 November 2019 at 15:54:02 UTC, Adam D. Ruppe wrote:it's encoded in UTF-8, for example the **string** "п" takes 2 `char`s, although it's only one grapheme. assert("привет".length == 12); // encoded as UTF-8 assert("привет"d.length == 6); // decoded, each dchar is 4 bytes and can contain a cyrilic character.On Saturday, 2 November 2019 at 15:44:49 UTC, Eugene wrote:Um. It is not obvious at all. What's mean spread across multiple bytes?"Variable of type char can only hold letters that are in the ASCII table". (section 15.4 Character literals) So why there is executed next code?The individual char can only hold those, but a group of chars can hold anything.char[] cyrillics = "привет".dup;this works because the "string" has multi-char groupingschar[] cyrillics = ['п', 'р', 'и', 'в', 'е', 'т']; //notand this doesn't because you are specifying individual items there so it can't just spread them across multiple bytes
Nov 02 2019
On Saturday, 2 November 2019 at 20:49:15 UTC, user4567 wrote:On Saturday, 2 November 2019 at 18:45:50 UTC, Eugene wrote:Actually you asked why isn't there an implicit encoding if I understand correctly.On Saturday, 2 November 2019 at 18:26:57 UTC, user4567 wrote:Oh I see what you ask, in first place we thought that you didn't get the implication of encoding. So it's just a rule. If you use `char` literals they must be ascii. The rationale could be that this rule avoid bad surprises on the length of the array, otherwise I cant imagine anything else. ONly original designers (so Bright) must know the exact rationale... cant say more.[...]"п" is represented by two code units, but "п"d is represented by one code point, therefore 12 and 6 respectively. Function dup manipulates by code units and represents their to char[]. So?
Nov 02 2019
On Saturday, 2 November 2019 at 20:53:30 UTC, user4567 wrote:On Saturday, 2 November 2019 at 20:49:15 UTC, user4567 wrote:That would require special cases in the compiler and language semantics. Implicit encoding would only be possible when a char literal is an array element. special cases in semantic are not nice IMO. "here we are in an array so the literal can be expanded to several bytes, here we're not in array so it's not allowed", you see ? Not nice because confusing.On Saturday, 2 November 2019 at 18:45:50 UTC, Eugene wrote:Actually you asked why isn't there an implicit encoding if I understand correctly.On Saturday, 2 November 2019 at 18:26:57 UTC, user4567 wrote:Oh I see what you ask, in first place we thought that you didn't get the implication of encoding. So it's just a rule. If you use `char` literals they must be ascii. The rationale could be that this rule avoid bad surprises on the length of the array, otherwise I cant imagine anything else. ONly original designers (so Bright) must know the exact rationale... cant say more.[...]"п" is represented by two code units, but "п"d is represented by one code point, therefore 12 and 6 respectively. Function dup manipulates by code units and represents their to char[]. So?
Nov 02 2019
On Saturday, 2 November 2019 at 20:58:06 UTC, user4567 wrote:On Saturday, 2 November 2019 at 20:53:30 UTC, user4567 wrote:Even worse. The special case would only work in dynamic arrays and not static arrays.On Saturday, 2 November 2019 at 20:49:15 UTC, user4567 wrote:That would require special cases in the compiler and language semantics. Implicit encoding would only be possible when a char literal is an array element. special cases in semantic are not nice IMO. "here we are in an array so the literal can be expanded to several bytes, here we're not in array so it's not allowed", you see ? Not nice because confusing.[...]Actually you asked why isn't there an implicit encoding if I understand correctly.
Nov 02 2019
On Saturday, 2 November 2019 at 20:59:17 UTC, user4567 wrote:On Saturday, 2 November 2019 at 20:58:06 UTC, user4567 wrote:👍 Yes, since dynamic array and code units (bytes) - spread. Thanks.On Saturday, 2 November 2019 at 20:53:30 UTC, user4567 wrote:Even worse. The special case would only work in dynamic arrays and not static arrays.On Saturday, 2 November 2019 at 20:49:15 UTC, user4567 wrote:That would require special cases in the compiler and language semantics. Implicit encoding would only be possible when a char literal is an array element. special cases in semantic are not nice IMO. "here we are in an array so the literal can be expanded to several bytes, here we're not in array so it's not allowed", you see ? Not nice because confusing.[...]Actually you asked why isn't there an implicit encoding if I understand correctly.
Nov 03 2019
On Saturday, 2 November 2019 at 18:09:01 UTC, Eugene wrote:Um. It is not obvious at all. What's mean spread across multiple bytes?The one character there is multiple utf-8 code units, thus multiple bytes. But with the ['x', 'y'] syntax you are being specific that each thing must be one unit.
Nov 02 2019
On 2019-11-02 16:44, Eugene wrote:next code is ok according on what is written in book: char[] cyrillics = ['п', 'р', 'и', 'в', 'е', 'т']; //not compiledThis might be a bit confusing. But the type of a character literal changes depending on it's content: static assert(is(typeof('a') == char)); static assert(is(typeof('п') == wchar)); static assert(is(typeof('😊') == dchar)); So the type you have on the right side does not match the type specified on the left side. -- /Jacob Carlborg
Nov 02 2019