www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Some questions about strings

reply Denis <noreply noserver.lan> writes:
I have a few questions about how strings are stored.

- First, is there any difference between string, wstring and 
dstring? For example, a 3-byte Unicode character literal can be 
assigned to a variable of any of these types, then printed, etc, 
without errors.

- Are the characters of a string stored in memory by their 
Unicode codepoint(s), as opposed to some other encoding?

- Assuming that the answer to the first question is "no 
difference", do strings always allocate 4 bytes per codepoint?

- Can a series of codepoints, appropriately padded to the 
required width, and terminated by a null character, be directly 
assigned to a string WITHOUT GOING THROUGH A DECODING / ENCODING 
TRANSLATION?

The last question gets to the heart of what I'd ultimately like 
to accomplish and avoid.

Thanks for your help.
Jun 21 2020
next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 22 June 2020 at 03:17:54 UTC, Denis wrote:
 - First, is there any difference between string, wstring and 
 dstring?
Yes, they encode the same content differently in the bytes. If you cast it to ubyte[] and print that out you can see the difference.
 - Are the characters of a string stored in memory by their 
 Unicode codepoint(s), as opposed to some other encoding?
no, they are encoded in utf-8, 16, or 32 for string, wstring, and dstring respectively.
 - Can a series of codepoints, appropriately padded to the 
 required width, and terminated by a null character, be directly 
 assigned to a string WITHOUT GOING THROUGH A DECODING / 
 ENCODING TRANSLATION?
no, they must be encoded. Unicode code points are an abstract concept that must be encoded somehow to exist in memory (similar to the idea of a number).
Jun 21 2020
parent reply Denis <noreply noserver.lan> writes:
On Monday, 22 June 2020 at 03:24:37 UTC, Adam D. Ruppe wrote:
 On Monday, 22 June 2020 at 03:17:54 UTC, Denis wrote:
 - First, is there any difference between string, wstring and 
 dstring?
Yes, they encode the same content differently in the bytes. If you cast it to ubyte[] and print that out you can see the difference.
 - Are the characters of a string stored in memory by their 
 Unicode codepoint(s), as opposed to some other encoding?
no, they are encoded in utf-8, 16, or 32 for string, wstring, and dstring respectively.
 - Can a series of codepoints, appropriately padded to the 
 required width, and terminated by a null character, be 
 directly assigned to a string WITHOUT GOING THROUGH A DECODING 
 / ENCODING TRANSLATION?
no, they must be encoded. Unicode code points are an abstract concept that must be encoded somehow to exist in memory (similar to the idea of a number).
OK, then that actually simplifies what's needed, because I won't need to decode the UTF-8, only validate it. My code reads a UTF-8 encoded file into a buffer and validates, byte by byte, the UTF-8 encoding along with some additional validation. If I simply return the UTF-8 encoded string, there won't be another decoding/encoding done -- correct?
Jun 21 2020
parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Monday, 22 June 2020 at 03:43:58 UTC, Denis wrote:
 My code reads a UTF-8 encoded file into a buffer and validates, 
 byte by byte, the UTF-8 encoding along with some additional 
 validation. If I simply return the UTF-8 encoded string, there 
 won't be another decoding/encoding done -- correct?
Yeah D doesn't do extra work when you are just passing stuff around, only when you specifically ask for it by calling a function or maybe doing foreach (depends on if you ask for char or dchar in the foreach type)
Jun 21 2020
parent Denis <noreply noserver.lan> writes:
On Monday, 22 June 2020 at 03:49:01 UTC, Adam D. Ruppe wrote:
 On Monday, 22 June 2020 at 03:43:58 UTC, Denis wrote:
 My code reads a UTF-8 encoded file into a buffer and 
 validates, byte by byte, the UTF-8 encoding along with some 
 additional validation. If I simply return the UTF-8 encoded 
 string, there won't be another decoding/encoding done -- 
 correct?
Yeah D doesn't do extra work when you are just passing stuff around, only when you specifically ask for it by calling a function or maybe doing foreach (depends on if you ask for char or dchar in the foreach type)
Excellent. I'm trying to make this efficient, so I'm doing all of the validation together, without using any external functions (apart from the buffer reads). Thanks!
Jun 21 2020
prev sibling parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 6/21/20 8:17 PM, Denis wrote:> I have a few questions about how 
strings are stored.
 - First, is there any difference between string, wstring and dstring?
string is char[] wstring is wchar[] dstring is dchar[] char is 1 byte: UTF-8 code unit wchar is 2 bytes: UTF-16 code unit dchar is 4 bytes: UTF-32 code unit
 For example, a 3-byte Unicode character literal can be assigned to a
 variable of any of these types, then printed, etc, without errors.
You can reveal some of the mystery by looking at their .length property. Additionally, foreach will visit these types element-by-element: char, wchar, and dchar, respectively.
 - Are the characters of a string stored in memory by their Unicode
 codepoint(s), as opposed to some other encoding?
As UTF encodings; nothing else.
 - Assuming that the answer to the first question is "no difference", do
 strings always allocate 4 bytes per codepoint?
No. They always allocate sufficient bytes to represent the code points in their respective UTF encodings. dstring is the only one where the number of code points equals the number of elements: UTF-32 code units, each being 4 bytes.
 - Can a series of codepoints, appropriately padded to the required
 width, and terminated by a null character,
null character is not required but may be a part of the strings.
 be directly assigned to a
 string WITHOUT GOING THROUGH A DECODING / ENCODING TRANSLATION?
It will go through decoding/encoding.
 The last question gets to the heart of what I'd ultimately like to
 accomplish and avoid.

 Thanks for your help.
There is also the infamous "auto decoding" of Phobos algorithms (which is as a mistake). I think one tool to get away from auto decoding of strings is std.string.representation: https://dlang.org/phobos/std_string.html#.representation Because it returns a type that is not a string, there is not auto decoding to speak of. :) Ali
Jun 21 2020
parent reply Denis <noreply noserver.lan> writes:
On Monday, 22 June 2020 at 03:31:17 UTC, Ali Çehreli wrote:
:
 string is char[]
 wstring is wchar[]
 dstring is dchar[]
Got it now. This is the critical piece I missed: I understand the relations between the char types and the UTF encodings (thanks to your book). But I mistakenly thought that the string types were different.
 You can reveal some of the mystery by looking at their .length 
 property. Additionally, foreach will visit these types 
 element-by-element: char, wchar, and dchar, respectively.
I did not try this test -- my bad.
 null character is not required but may be a part of the strings.
The terminating null character was one of the reasons I thought strings were different from char arrays. Now I know better. Thank you for these clarifications. Denis
Jun 21 2020
next sibling parent reply Mike Parker <aldacron gmail.com> writes:
On Monday, 22 June 2020 at 04:08:10 UTC, Denis wrote:
 On Monday, 22 June 2020 at 03:31:17 UTC, Ali Çehreli wrote:
 :
 string is char[]
 wstring is wchar[]
 dstring is dchar[]
Got it now. This is the critical piece I missed: I understand the relations between the char types and the UTF encodings (thanks to your book). But I mistakenly thought that the string types were different.
They're aliases in object.d: https://github.com/dlang/druntime/blob/master/src/object.d#L35
Jun 21 2020
parent Denis <noreply noserver.lan> writes:
On Monday, 22 June 2020 at 04:32:32 UTC, Mike Parker wrote:
 On Monday, 22 June 2020 at 04:08:10 UTC, Denis wrote:
 On Monday, 22 June 2020 at 03:31:17 UTC, Ali Çehreli wrote:
 :
 string is char[]
 wstring is wchar[]
 dstring is dchar[]
Got it now. This is the critical piece I missed: I understand the relations between the char types and the UTF encodings (thanks to your book). But I mistakenly thought that the string types were different.
They're aliases in object.d: https://github.com/dlang/druntime/blob/master/src/object.d#L35
Right at the top and plain as day too... ;) I appreciate the link to the source -- thanks!
Jun 21 2020
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On Monday, 22 June 2020 at 04:08:10 UTC, Denis wrote:

 The terminating null character was one of the reasons I thought 
 strings were different from char arrays. Now I know better.
String **literals** have a terminating null character, to help with integrating with C functions. But this null character will disappear when manipulating strings. You cannot assume that a function parameter of type `string` will have a terminating null character, but calling `printf` with a string literal is fine: printf("foobar\n"); // this will work since string literals have have a terminating null character -- /Jacob Carlborg
Jun 22 2020
parent Denis <noreply noserver.lan> writes:
On Monday, 22 June 2020 at 09:06:35 UTC, Jacob Carlborg wrote:

 String **literals** have a terminating null character, to help 
 with integrating with C functions. But this null character will 
 disappear when manipulating strings.

 You cannot assume that a function parameter of type `string` 
 will have a terminating null character, but calling `printf` 
 with a string literal is fine:

 printf("foobar\n"); // this will work since string literals 
 have have a terminating null character
OK, it makes sense that the null terminator would be added where compatability with C is required. Good to know.
Jun 22 2020