digitalmars.D.learn - Some questions about strings
- Denis (16/16) Jun 21 2020 I have a few questions about how strings are stored.
- Adam D. Ruppe (9/17) Jun 21 2020 Yes, they encode the same content differently in the bytes. If
- Denis (7/24) Jun 21 2020 OK, then that actually simplifies what's needed, because I won't
- Adam D. Ruppe (5/9) Jun 21 2020 Yeah D doesn't do extra work when you are just passing stuff
- Denis (5/15) Jun 21 2020 Excellent. I'm trying to make this efficient, so I'm doing all of
- =?UTF-8?Q?Ali_=c3=87ehreli?= (25/39) Jun 21 2020 string is char[]
- Denis (11/18) Jun 21 2020 Got it now. This is the critical piece I missed: I understand the
- Mike Parker (3/12) Jun 21 2020 They're aliases in object.d:
- Denis (3/17) Jun 21 2020 Right at the top and plain as day too... ;)
- Jacob Carlborg (11/13) Jun 22 2020 String **literals** have a terminating null character, to help
- Denis (4/12) Jun 22 2020 OK, it makes sense that the null terminator would be added where
I have a few questions about how strings are stored. - First, is there any difference between string, wstring and dstring? For example, a 3-byte Unicode character literal can be assigned to a variable of any of these types, then printed, etc, without errors. - Are the characters of a string stored in memory by their Unicode codepoint(s), as opposed to some other encoding? - Assuming that the answer to the first question is "no difference", do strings always allocate 4 bytes per codepoint? - Can a series of codepoints, appropriately padded to the required width, and terminated by a null character, be directly assigned to a string WITHOUT GOING THROUGH A DECODING / ENCODING TRANSLATION? The last question gets to the heart of what I'd ultimately like to accomplish and avoid. Thanks for your help.
Jun 21 2020
On Monday, 22 June 2020 at 03:17:54 UTC, Denis wrote:- First, is there any difference between string, wstring and dstring?Yes, they encode the same content differently in the bytes. If you cast it to ubyte[] and print that out you can see the difference.- Are the characters of a string stored in memory by their Unicode codepoint(s), as opposed to some other encoding?no, they are encoded in utf-8, 16, or 32 for string, wstring, and dstring respectively.- Can a series of codepoints, appropriately padded to the required width, and terminated by a null character, be directly assigned to a string WITHOUT GOING THROUGH A DECODING / ENCODING TRANSLATION?no, they must be encoded. Unicode code points are an abstract concept that must be encoded somehow to exist in memory (similar to the idea of a number).
Jun 21 2020
On Monday, 22 June 2020 at 03:24:37 UTC, Adam D. Ruppe wrote:On Monday, 22 June 2020 at 03:17:54 UTC, Denis wrote:OK, then that actually simplifies what's needed, because I won't need to decode the UTF-8, only validate it. My code reads a UTF-8 encoded file into a buffer and validates, byte by byte, the UTF-8 encoding along with some additional validation. If I simply return the UTF-8 encoded string, there won't be another decoding/encoding done -- correct?- First, is there any difference between string, wstring and dstring?Yes, they encode the same content differently in the bytes. If you cast it to ubyte[] and print that out you can see the difference.- Are the characters of a string stored in memory by their Unicode codepoint(s), as opposed to some other encoding?no, they are encoded in utf-8, 16, or 32 for string, wstring, and dstring respectively.- Can a series of codepoints, appropriately padded to the required width, and terminated by a null character, be directly assigned to a string WITHOUT GOING THROUGH A DECODING / ENCODING TRANSLATION?no, they must be encoded. Unicode code points are an abstract concept that must be encoded somehow to exist in memory (similar to the idea of a number).
Jun 21 2020
On Monday, 22 June 2020 at 03:43:58 UTC, Denis wrote:My code reads a UTF-8 encoded file into a buffer and validates, byte by byte, the UTF-8 encoding along with some additional validation. If I simply return the UTF-8 encoded string, there won't be another decoding/encoding done -- correct?Yeah D doesn't do extra work when you are just passing stuff around, only when you specifically ask for it by calling a function or maybe doing foreach (depends on if you ask for char or dchar in the foreach type)
Jun 21 2020
On Monday, 22 June 2020 at 03:49:01 UTC, Adam D. Ruppe wrote:On Monday, 22 June 2020 at 03:43:58 UTC, Denis wrote:Excellent. I'm trying to make this efficient, so I'm doing all of the validation together, without using any external functions (apart from the buffer reads). Thanks!My code reads a UTF-8 encoded file into a buffer and validates, byte by byte, the UTF-8 encoding along with some additional validation. If I simply return the UTF-8 encoded string, there won't be another decoding/encoding done -- correct?Yeah D doesn't do extra work when you are just passing stuff around, only when you specifically ask for it by calling a function or maybe doing foreach (depends on if you ask for char or dchar in the foreach type)
Jun 21 2020
On 6/21/20 8:17 PM, Denis wrote:> I have a few questions about how strings are stored.- First, is there any difference between string, wstring and dstring?string is char[] wstring is wchar[] dstring is dchar[] char is 1 byte: UTF-8 code unit wchar is 2 bytes: UTF-16 code unit dchar is 4 bytes: UTF-32 code unitFor example, a 3-byte Unicode character literal can be assigned to a variable of any of these types, then printed, etc, without errors.You can reveal some of the mystery by looking at their .length property. Additionally, foreach will visit these types element-by-element: char, wchar, and dchar, respectively.- Are the characters of a string stored in memory by their Unicode codepoint(s), as opposed to some other encoding?As UTF encodings; nothing else.- Assuming that the answer to the first question is "no difference", do strings always allocate 4 bytes per codepoint?No. They always allocate sufficient bytes to represent the code points in their respective UTF encodings. dstring is the only one where the number of code points equals the number of elements: UTF-32 code units, each being 4 bytes.- Can a series of codepoints, appropriately padded to the required width, and terminated by a null character,null character is not required but may be a part of the strings.be directly assigned to a string WITHOUT GOING THROUGH A DECODING / ENCODING TRANSLATION?It will go through decoding/encoding.The last question gets to the heart of what I'd ultimately like to accomplish and avoid. Thanks for your help.There is also the infamous "auto decoding" of Phobos algorithms (which is as a mistake). I think one tool to get away from auto decoding of strings is std.string.representation: https://dlang.org/phobos/std_string.html#.representation Because it returns a type that is not a string, there is not auto decoding to speak of. :) Ali
Jun 21 2020
On Monday, 22 June 2020 at 03:31:17 UTC, Ali Çehreli wrote: :string is char[] wstring is wchar[] dstring is dchar[]Got it now. This is the critical piece I missed: I understand the relations between the char types and the UTF encodings (thanks to your book). But I mistakenly thought that the string types were different.You can reveal some of the mystery by looking at their .length property. Additionally, foreach will visit these types element-by-element: char, wchar, and dchar, respectively.I did not try this test -- my bad.null character is not required but may be a part of the strings.The terminating null character was one of the reasons I thought strings were different from char arrays. Now I know better. Thank you for these clarifications. Denis
Jun 21 2020
On Monday, 22 June 2020 at 04:08:10 UTC, Denis wrote:On Monday, 22 June 2020 at 03:31:17 UTC, Ali Çehreli wrote: :They're aliases in object.d: https://github.com/dlang/druntime/blob/master/src/object.d#L35string is char[] wstring is wchar[] dstring is dchar[]Got it now. This is the critical piece I missed: I understand the relations between the char types and the UTF encodings (thanks to your book). But I mistakenly thought that the string types were different.
Jun 21 2020
On Monday, 22 June 2020 at 04:32:32 UTC, Mike Parker wrote:On Monday, 22 June 2020 at 04:08:10 UTC, Denis wrote:Right at the top and plain as day too... ;) I appreciate the link to the source -- thanks!On Monday, 22 June 2020 at 03:31:17 UTC, Ali Çehreli wrote: :They're aliases in object.d: https://github.com/dlang/druntime/blob/master/src/object.d#L35string is char[] wstring is wchar[] dstring is dchar[]Got it now. This is the critical piece I missed: I understand the relations between the char types and the UTF encodings (thanks to your book). But I mistakenly thought that the string types were different.
Jun 21 2020
On Monday, 22 June 2020 at 04:08:10 UTC, Denis wrote:The terminating null character was one of the reasons I thought strings were different from char arrays. Now I know better.String **literals** have a terminating null character, to help with integrating with C functions. But this null character will disappear when manipulating strings. You cannot assume that a function parameter of type `string` will have a terminating null character, but calling `printf` with a string literal is fine: printf("foobar\n"); // this will work since string literals have have a terminating null character -- /Jacob Carlborg
Jun 22 2020
On Monday, 22 June 2020 at 09:06:35 UTC, Jacob Carlborg wrote:String **literals** have a terminating null character, to help with integrating with C functions. But this null character will disappear when manipulating strings. You cannot assume that a function parameter of type `string` will have a terminating null character, but calling `printf` with a string literal is fine: printf("foobar\n"); // this will work since string literals have have a terminating null characterOK, it makes sense that the null terminator would be added where compatability with C is required. Good to know.
Jun 22 2020