digitalmars.D.learn - Some questions about strings

Denis (16/16) Jun 21 2020 I have a few questions about how strings are stored.

Adam D. Ruppe (9/17) Jun 21 2020 Yes, they encode the same content differently in the bytes. If

Denis (7/24) Jun 21 2020 OK, then that actually simplifies what's needed, because I won't

Adam D. Ruppe (5/9) Jun 21 2020 Yeah D doesn't do extra work when you are just passing stuff

Denis (5/15) Jun 21 2020 Excellent. I'm trying to make this efficient, so I'm doing all of

=?UTF-8?Q?Ali_=c3=87ehreli?= (25/39) Jun 21 2020 string is char[]

Denis (11/18) Jun 21 2020 Got it now. This is the critical piece I missed: I understand the

Mike Parker (3/12) Jun 21 2020 They're aliases in object.d:

Denis (3/17) Jun 21 2020 Right at the top and plain as day too... ;)

Jacob Carlborg (11/13) Jun 22 2020 String **literals** have a terminating null character, to help

Denis (4/12) Jun 22 2020 OK, it makes sense that the null terminator would be added where

Denis <noreply noserver.lan> writes:

I have a few questions about how strings are stored.

- First, is there any difference between string, wstring and 
dstring? For example, a 3-byte Unicode character literal can be 
assigned to a variable of any of these types, then printed, etc, 
without errors.

- Are the characters of a string stored in memory by their 
Unicode codepoint(s), as opposed to some other encoding?

- Assuming that the answer to the first question is "no 
difference", do strings always allocate 4 bytes per codepoint?

- Can a series of codepoints, appropriately padded to the 
required width, and terminated by a null character, be directly 
assigned to a string WITHOUT GOING THROUGH A DECODING / ENCODING 
TRANSLATION?

The last question gets to the heart of what I'd ultimately like 
to accomplish and avoid.

Thanks for your help.

Jun 21 2020

Adam D. Ruppe <destructionator gmail.com> writes:

On Monday, 22 June 2020 at 03:17:54 UTC, Denis wrote:
 - First, is there any difference between string, wstring and 
 dstring?

Yes, they encode the same content differently in the bytes. If 
you cast it to ubyte[] and print that out you can see the 
difference.

 - Are the characters of a string stored in memory by their 
 Unicode codepoint(s), as opposed to some other encoding?

no, they are encoded in utf-8, 16, or 32 for string, wstring, and 
dstring respectively.

 - Can a series of codepoints, appropriately padded to the 
 required width, and terminated by a null character, be directly 
 assigned to a string WITHOUT GOING THROUGH A DECODING / 
 ENCODING TRANSLATION?

no, they must be encoded. Unicode code points are an abstract 
concept that must be encoded somehow to exist in memory (similar 
to the idea of a number).

Jun 21 2020

Denis <noreply noserver.lan> writes:

On Monday, 22 June 2020 at 03:24:37 UTC, Adam D. Ruppe wrote:
 On Monday, 22 June 2020 at 03:17:54 UTC, Denis wrote:
 - First, is there any difference between string, wstring and 
 dstring?

 Yes, they encode the same content differently in the bytes. If 
 you cast it to ubyte[] and print that out you can see the 
 difference.

 - Are the characters of a string stored in memory by their 
 Unicode codepoint(s), as opposed to some other encoding?

 no, they are encoded in utf-8, 16, or 32 for string, wstring, 
 and dstring respectively.

 - Can a series of codepoints, appropriately padded to the 
 required width, and terminated by a null character, be 
 directly assigned to a string WITHOUT GOING THROUGH A DECODING 
 / ENCODING TRANSLATION?

 no, they must be encoded. Unicode code points are an abstract 
 concept that must be encoded somehow to exist in memory 
 (similar to the idea of a number).

OK, then that actually simplifies what's needed, because I won't 
need to decode the UTF-8, only validate it.

My code reads a UTF-8 encoded file into a buffer and validates, 
byte by byte, the UTF-8 encoding along with some additional 
validation. If I simply return the UTF-8 encoded string, there 
won't be another decoding/encoding done -- correct?

Jun 21 2020

Adam D. Ruppe <destructionator gmail.com> writes:

On Monday, 22 June 2020 at 03:43:58 UTC, Denis wrote:
 My code reads a UTF-8 encoded file into a buffer and validates, 
 byte by byte, the UTF-8 encoding along with some additional 
 validation. If I simply return the UTF-8 encoded string, there 
 won't be another decoding/encoding done -- correct?

Yeah D doesn't do extra work when you are just passing stuff 
around, only when you specifically ask for it by calling a 
function or maybe doing foreach (depends on if you ask for char 
or dchar in the foreach type)

Jun 21 2020

Denis <noreply noserver.lan> writes:

On Monday, 22 June 2020 at 03:49:01 UTC, Adam D. Ruppe wrote:
 On Monday, 22 June 2020 at 03:43:58 UTC, Denis wrote:
 My code reads a UTF-8 encoded file into a buffer and 
 validates, byte by byte, the UTF-8 encoding along with some 
 additional validation. If I simply return the UTF-8 encoded 
 string, there won't be another decoding/encoding done -- 
 correct?

 Yeah D doesn't do extra work when you are just passing stuff 
 around, only when you specifically ask for it by calling a 
 function or maybe doing foreach (depends on if you ask for char 
 or dchar in the foreach type)

Excellent. I'm trying to make this efficient, so I'm doing all of 
the validation together, without using any external functions 
(apart from the buffer reads).

Thanks!

Jun 21 2020

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 6/21/20 8:17 PM, Denis wrote:> I have a few questions about how 
strings are stored.
 - First, is there any difference between string, wstring and dstring?

string is char[]
wstring is wchar[]
dstring is dchar[]

char is 1 byte: UTF-8 code unit
wchar is 2 bytes: UTF-16 code unit
dchar is 4 bytes: UTF-32 code unit

 For example, a 3-byte Unicode character literal can be assigned to a
 variable of any of these types, then printed, etc, without errors.

You can reveal some of the mystery by looking at their .length property. 
Additionally, foreach will visit these types element-by-element: char, 
wchar, and dchar, respectively.

 - Are the characters of a string stored in memory by their Unicode
 codepoint(s), as opposed to some other encoding?

As UTF encodings; nothing else.

 - Assuming that the answer to the first question is "no difference", do
 strings always allocate 4 bytes per codepoint?

No. They always allocate sufficient bytes to represent the code points 
in their respective UTF encodings. dstring is the only one where the 
number of code points equals the number of elements: UTF-32 code units, 
each being 4 bytes.

 - Can a series of codepoints, appropriately padded to the required
 width, and terminated by a null character,

null character is not required but may be a part of the strings.

 be directly assigned to a
 string WITHOUT GOING THROUGH A DECODING / ENCODING TRANSLATION?

It will go through decoding/encoding.

 The last question gets to the heart of what I'd ultimately like to
 accomplish and avoid.

 Thanks for your help.

There is also the infamous "auto decoding" of Phobos algorithms (which 
is as a mistake). I think one tool to get away from auto decoding of 
strings is std.string.representation:

   https://dlang.org/phobos/std_string.html#.representation

Because it returns a type that is not a string, there is not auto 
decoding to speak of. :)

Ali

Jun 21 2020

Denis <noreply noserver.lan> writes:

On Monday, 22 June 2020 at 03:31:17 UTC, Ali Çehreli wrote:
:
 string is char[]
 wstring is wchar[]
 dstring is dchar[]

Got it now. This is the critical piece I missed: I understand the 
relations between the char types and the UTF encodings (thanks to 
your book). But I mistakenly thought that the string types were 
different.

 You can reveal some of the mystery by looking at their .length 
 property. Additionally, foreach will visit these types 
 element-by-element: char, wchar, and dchar, respectively.

I did not try this test -- my bad.

 null character is not required but may be a part of the strings.

The terminating null character was one of the reasons I thought 
strings were different from char arrays. Now I know better.

Thank you for these clarifications.
Denis

Jun 21 2020

Mike Parker <aldacron gmail.com> writes:

On Monday, 22 June 2020 at 04:08:10 UTC, Denis wrote:
 On Monday, 22 June 2020 at 03:31:17 UTC, Ali Çehreli wrote:
 :
 string is char[]
 wstring is wchar[]
 dstring is dchar[]

 Got it now. This is the critical piece I missed: I understand 
 the relations between the char types and the UTF encodings 
 (thanks to your book). But I mistakenly thought that the string 
 types were different.

They're aliases in object.d:

https://github.com/dlang/druntime/blob/master/src/object.d#L35

Jun 21 2020

Denis <noreply noserver.lan> writes:

On Monday, 22 June 2020 at 04:32:32 UTC, Mike Parker wrote:
 On Monday, 22 June 2020 at 04:08:10 UTC, Denis wrote:
 On Monday, 22 June 2020 at 03:31:17 UTC, Ali Çehreli wrote:
 :
 string is char[]
 wstring is wchar[]
 dstring is dchar[]

 Got it now. This is the critical piece I missed: I understand 
 the relations between the char types and the UTF encodings 
 (thanks to your book). But I mistakenly thought that the 
 string types were different.

 They're aliases in object.d:

 https://github.com/dlang/druntime/blob/master/src/object.d#L35

Right at the top and plain as day too... ;)

I appreciate the link to the source -- thanks!

Jun 21 2020

Jacob Carlborg <doob me.com> writes:

On Monday, 22 June 2020 at 04:08:10 UTC, Denis wrote:

 The terminating null character was one of the reasons I thought 
 strings were different from char arrays. Now I know better.

String **literals** have a terminating null character, to help 
with integrating with C functions. But this null character will 
disappear when manipulating strings.

You cannot assume that a function parameter of type `string` will 
have a terminating null character, but calling `printf` with a 
string literal is fine:

printf("foobar\n"); // this will work since string literals have 
have a terminating null character

--
/Jacob Carlborg

Jun 22 2020

Denis <noreply noserver.lan> writes:

On Monday, 22 June 2020 at 09:06:35 UTC, Jacob Carlborg wrote:

 String **literals** have a terminating null character, to help 
 with integrating with C functions. But this null character will 
 disappear when manipulating strings.

 You cannot assume that a function parameter of type `string` 
 will have a terminating null character, but calling `printf` 
 with a string literal is fine:

 printf("foobar\n"); // this will work since string literals 
 have have a terminating null character

OK, it makes sense that the null terminator would be added where 
compatability with C is required.

Good to know.

Jun 22 2020

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Some questions about strings