digitalmars.D - Characters in D

Eugene (10/10) Nov 02 2019 Hello!

Adam D. Ruppe (6/11) Nov 02 2019 The individual char can only hold those, but a group of chars can

Eugene (3/14) Nov 02 2019 Um. It is not obvious at all. What's mean spread across multiple

user4567 (6/26) Nov 02 2019 it's encoded in UTF-8, for example the **string** "п" takes 2

Eugene (4/31) Nov 02 2019 "п" is represented by two code units, but "п"d is represented by

Rumbu (18/18) Nov 02 2019 Your привет memory representation will look different depending
user4567 (8/42) Nov 02 2019 Oh I see what you ask, in first place we thought that you didn't

user4567 (3/18) Nov 02 2019 Actually you asked why isn't there an implicit encoding if I

user4567 (7/27) Nov 02 2019 That would require special cases in the compiler and language

user4567 (3/15) Nov 02 2019 Even worse. The special case would only work in dynamic arrays

Eugene (3/19) Nov 03 2019 👍

Adam D. Ruppe (5/7) Nov 02 2019 The one character there is multiple utf-8 code units, thus

Jacob Carlborg (10/12) Nov 02 2019 This might be a bit confusing. But the type of a character literal

Eugene <lecom yandex.ru> writes:

Hello!

In the book "Programming in D" is written:
"Variable of type char can only hold letters that are in the 
ASCII table". (section 15.4 Character literals)
So why there is executed next code?
char[] cyrillics = "привет".dup;
writeln(cyrillics.idup);

Cyrillic characters are not within ASCII table. Why?;

next code is ok according on what is written in book:
char[] cyrillics = ['п', 'р', 'и', 'в', 'е', 'т']; //not compiled

Nov 02 2019

Adam D. Ruppe <destructionator gmail.com> writes:

On Saturday, 2 November 2019 at 15:44:49 UTC, Eugene wrote:
 "Variable of type char can only hold letters that are in the 
 ASCII table". (section 15.4 Character literals)
 So why there is executed next code?

The individual char can only hold those, but a group of chars can 
hold anything.

 char[] cyrillics = "привет".dup;

this works because the "string" has multi-char groupings

 char[] cyrillics = ['п', 'р', 'и', 'в', 'е', 'т']; //not

and this doesn't because you are specifying individual items 
there so it can't just spread them across multiple bytes

Nov 02 2019

Eugene <lecom yandex.ru> writes:

On Saturday, 2 November 2019 at 15:54:02 UTC, Adam D. Ruppe wrote:
 On Saturday, 2 November 2019 at 15:44:49 UTC, Eugene wrote:
 "Variable of type char can only hold letters that are in the 
 ASCII table". (section 15.4 Character literals)
 So why there is executed next code?

 The individual char can only hold those, but a group of chars 
 can hold anything.

 char[] cyrillics = "привет".dup;

 this works because the "string" has multi-char groupings

 char[] cyrillics = ['п', 'р', 'и', 'в', 'е', 'т']; //not

 and this doesn't because you are specifying individual items 
 there so it can't just spread them across multiple bytes

Um. It is not obvious at all. What's mean spread across multiple 
bytes?

Nov 02 2019

user4567 <user4567 1234.te> writes:

On Saturday, 2 November 2019 at 18:09:01 UTC, Eugene wrote:
 On Saturday, 2 November 2019 at 15:54:02 UTC, Adam D. Ruppe 
 wrote:
 On Saturday, 2 November 2019 at 15:44:49 UTC, Eugene wrote:
 "Variable of type char can only hold letters that are in the 
 ASCII table". (section 15.4 Character literals)
 So why there is executed next code?

 The individual char can only hold those, but a group of chars 
 can hold anything.

 char[] cyrillics = "привет".dup;

 this works because the "string" has multi-char groupings

 char[] cyrillics = ['п', 'р', 'и', 'в', 'е', 'т']; //not

 and this doesn't because you are specifying individual items 
 there so it can't just spread them across multiple bytes

 Um. It is not obvious at all. What's mean spread across 
 multiple bytes?

it's encoded in UTF-8, for example the **string** "п" takes 2 
`char`s, although it's only one grapheme.

     assert("привет".length == 12); // encoded as UTF-8
     assert("привет"d.length == 6); // decoded, each dchar is 4 
bytes and can contain a cyrilic character.

Nov 02 2019

Eugene <lecom yandex.ru> writes:

On Saturday, 2 November 2019 at 18:26:57 UTC, user4567 wrote:
 On Saturday, 2 November 2019 at 18:09:01 UTC, Eugene wrote:
 On Saturday, 2 November 2019 at 15:54:02 UTC, Adam D. Ruppe 
 wrote:
 On Saturday, 2 November 2019 at 15:44:49 UTC, Eugene wrote:
 "Variable of type char can only hold letters that are in the 
 ASCII table". (section 15.4 Character literals)
 So why there is executed next code?

 The individual char can only hold those, but a group of chars 
 can hold anything.

 char[] cyrillics = "привет".dup;

 this works because the "string" has multi-char groupings

 char[] cyrillics = ['п', 'р', 'и', 'в', 'е', 'т']; //not

 and this doesn't because you are specifying individual items 
 there so it can't just spread them across multiple bytes

 Um. It is not obvious at all. What's mean spread across 
 multiple bytes?

 it's encoded in UTF-8, for example the **string** "п" takes 2 
 `char`s, although it's only one grapheme.

     assert("привет".length == 12); // encoded as UTF-8
     assert("привет"d.length == 6); // decoded, each dchar is 4 
 bytes and can contain a cyrilic character.

"п" is represented by two code units, but "п"d is represented by 
one code point, therefore 12 and 6 respectively. Function dup 
manipulates by code units and represents their to char[]. So?

Nov 02 2019

Rumbu <rumbu rumbu.ro> writes:

Your привет memory representation will look different depending 
on the encoding formats:

//utf-8, you cannot put 'п' in a single char, so it will be 
encoded as 2 bytes: 0xd0 0xbf
char[] cyrillics = [0xd0, 0xbf, 0xd1, 0x80, 0xd0, 0xb8, 0xd0, 
0xb2, 0xd0, 0xb5, 0xd1, 0x82]


//utf-16, a wchar has enough space to accommodate any letter from 
привет
wchar[] cyrillics = [0x043f, 0x0440, 0x0438, 0x0432, 0x0435, 
0x0442]
//or - this is the same because each letter will fit in a wchar:
wchar[] cyrillics = ['п', 'р', 'и', 'в', 'е', 'т']


//utf-32, a dchar has enough space to accommodate any letter from 
привет
dchar[] cyrillics = [0x0000043f, 0x00000440, 0x00000438, 
0x00000432, 0x00000435, 0x00000442]
//or - this is the same because each letter will fit in a dchar:
dchar[] cyrillics = ['п', 'р', 'и', 'в', 'е', 'т']

Nov 02 2019

user4567 <user4567 1234.te> writes:

On Saturday, 2 November 2019 at 18:45:50 UTC, Eugene wrote:
 On Saturday, 2 November 2019 at 18:26:57 UTC, user4567 wrote:
 On Saturday, 2 November 2019 at 18:09:01 UTC, Eugene wrote:
 On Saturday, 2 November 2019 at 15:54:02 UTC, Adam D. Ruppe 
 wrote:
 On Saturday, 2 November 2019 at 15:44:49 UTC, Eugene wrote:
 "Variable of type char can only hold letters that are in 
 the ASCII table". (section 15.4 Character literals)
 So why there is executed next code?

 The individual char can only hold those, but a group of 
 chars can hold anything.

 char[] cyrillics = "привет".dup;

 this works because the "string" has multi-char groupings

 char[] cyrillics = ['п', 'р', 'и', 'в', 'е', 'т']; //not

 and this doesn't because you are specifying individual items 
 there so it can't just spread them across multiple bytes

 Um. It is not obvious at all. What's mean spread across 
 multiple bytes?

 it's encoded in UTF-8, for example the **string** "п" takes 2 
 `char`s, although it's only one grapheme.

     assert("привет".length == 12); // encoded as UTF-8
     assert("привет"d.length == 6); // decoded, each dchar is 4 
 bytes and can contain a cyrilic character.

 "п" is represented by two code units, but "п"d is represented 
 by one code point, therefore 12 and 6 respectively. Function 
 dup manipulates by code units and represents their to char[]. 
 So?

Oh I see what you ask, in first place we thought that you didn't 
get the implication of encoding. So it's just a rule. If you use 
`char` literals they must be ascii.

The rationale could be that this rule avoid bad surprises on the 
length of the array, otherwise I cant imagine anything else. ONly 
original designers (so Bright) must know the exact rationale... 
cant say more.

Nov 02 2019

user4567 <user4567 1234.te> writes:

On Saturday, 2 November 2019 at 20:49:15 UTC, user4567 wrote:
 On Saturday, 2 November 2019 at 18:45:50 UTC, Eugene wrote:
 On Saturday, 2 November 2019 at 18:26:57 UTC, user4567 wrote:
 [...]

 "п" is represented by two code units, but "п"d is represented 
 by one code point, therefore 12 and 6 respectively. Function 
 dup manipulates by code units and represents their to char[]. 
 So?

 Oh I see what you ask, in first place we thought that you 
 didn't get the implication of encoding. So it's just a rule. If 
 you use `char` literals they must be ascii.

 The rationale could be that this rule avoid bad surprises on 
 the length of the array, otherwise I cant imagine anything 
 else. ONly original designers (so Bright) must know the exact 
 rationale... cant say more.

Actually you asked why isn't there an implicit encoding if I 
understand correctly.

Nov 02 2019

user4567 <user4567 1234.te> writes:

On Saturday, 2 November 2019 at 20:53:30 UTC, user4567 wrote:
 On Saturday, 2 November 2019 at 20:49:15 UTC, user4567 wrote:
 On Saturday, 2 November 2019 at 18:45:50 UTC, Eugene wrote:
 On Saturday, 2 November 2019 at 18:26:57 UTC, user4567 wrote:
 [...]

 "п" is represented by two code units, but "п"d is represented 
 by one code point, therefore 12 and 6 respectively. Function 
 dup manipulates by code units and represents their to char[]. 
 So?

 Oh I see what you ask, in first place we thought that you 
 didn't get the implication of encoding. So it's just a rule. 
 If you use `char` literals they must be ascii.

 The rationale could be that this rule avoid bad surprises on 
 the length of the array, otherwise I cant imagine anything 
 else. ONly original designers (so Bright) must know the exact 
 rationale... cant say more.

 Actually you asked why isn't there an implicit encoding if I 
 understand correctly.

That would require special cases in the compiler and language 
semantics. Implicit encoding would only be possible when a char 
literal is an array element. special cases in semantic are not 
nice IMO. "here we are in an array so the literal can be expanded 
to several bytes, here we're not in array so it's not allowed", 
you see ? Not nice because confusing.

Nov 02 2019

user4567 <user4567 1234.te> writes:

On Saturday, 2 November 2019 at 20:58:06 UTC, user4567 wrote:
 On Saturday, 2 November 2019 at 20:53:30 UTC, user4567 wrote:
 On Saturday, 2 November 2019 at 20:49:15 UTC, user4567 wrote:
 [...]

 Actually you asked why isn't there an implicit encoding if I 
 understand correctly.

 That would require special cases in the compiler and language 
 semantics. Implicit encoding would only be possible when a char 
 literal is an array element. special cases in semantic are not 
 nice IMO. "here we are in an array so the literal can be 
 expanded to several bytes, here we're not in array so it's not 
 allowed", you see ? Not nice because confusing.

Even worse. The special case would only work in dynamic arrays 
and not static arrays.

Nov 02 2019

Eugene <lecom yandex.ru> writes:

On Saturday, 2 November 2019 at 20:59:17 UTC, user4567 wrote:
 On Saturday, 2 November 2019 at 20:58:06 UTC, user4567 wrote:
 On Saturday, 2 November 2019 at 20:53:30 UTC, user4567 wrote:
 On Saturday, 2 November 2019 at 20:49:15 UTC, user4567 wrote:
 [...]

 Actually you asked why isn't there an implicit encoding if I 
 understand correctly.

 That would require special cases in the compiler and language 
 semantics. Implicit encoding would only be possible when a 
 char literal is an array element. special cases in semantic 
 are not nice IMO. "here we are in an array so the literal can 
 be expanded to several bytes, here we're not in array so it's 
 not allowed", you see ? Not nice because confusing.

 Even worse. The special case would only work in dynamic arrays 
 and not static arrays.

👍
Yes, since dynamic array and code units (bytes) - spread. Thanks.

Nov 03 2019

Adam D. Ruppe <destructionator gmail.com> writes:

On Saturday, 2 November 2019 at 18:09:01 UTC, Eugene wrote:
 Um. It is not obvious at all. What's mean spread across 
 multiple bytes?

The one character there is multiple utf-8 code units, thus 
multiple bytes.

But with the ['x', 'y'] syntax you are being specific that each 
thing must be one unit.

Nov 02 2019

Jacob Carlborg <doob me.com> writes:

On 2019-11-02 16:44, Eugene wrote:

 next code is ok according on what is written in book:
 char[] cyrillics = ['п', 'р', 'и', 'в', 'е', 'т']; //not compiled

This might be a bit confusing. But the type of a character literal 
changes depending on it's content:

static assert(is(typeof('a') == char));
static assert(is(typeof('п') == wchar));
static assert(is(typeof('😊') == dchar));

So the type you have on the right side does not match the type specified 
on the left side.

-- 
/Jacob Carlborg

Nov 02 2019

D Programming

C/C++ Programming

Other

digitalmars.D - Characters in D