digitalmars.D.learn - size of a string in bytes
- Nestor (17/17) Jan 28 2017 Hi,
- rikki cattermole (11/27) Jan 28 2017 A few misconceptions going on here.
- Nestor (9/49) Jan 28 2017 I do not want string lenth or code points. Perhaps I didn't
- rikki cattermole (12/57) Jan 28 2017 .length
- Ivan Kazmenko (14/18) Jan 28 2017 As said, the byte count is indeed string.length.
- Nestor (7/18) Jan 28 2017 Thank you Ivan,
- Adam D. Ruppe (11/13) Jan 28 2017 Not true in the language, but the Phobos library does treat char
- ag0aep6g (13/19) Jan 28 2017 In D, a `char` is a UTF-8 code unit. Its size is one byte,
- Nestor (3/18) Jan 28 2017 Very good explanation.
- H. S. Teoh via Digitalmars-d-learn (10/16) Jan 28 2017 The .length property of a string is the number of bytes used to store
Hi, One can get the length of a string easily, however since strings are UTF-8, sometimes characters take more than one byte. I would like to know then how many bytes does a string take, but this code didn't work as I expected: import std.stdio; void main() { string mystring1; string mystring2 = "A string of just 48 characters for testing size."; writeln(mystring1.sizeof); writeln( mystring2.sizeof); } In both cases the size is 8, so apparently sizeof is giving me just the default size of a string type and not the size of the variable in memory, which is what I want. Ideas?
Jan 28 2017
On 29/01/2017 3:51 AM, Nestor wrote:Hi, One can get the length of a string easily, however since strings are UTF-8, sometimes characters take more than one byte. I would like to know then how many bytes does a string take, but this code didn't work as I expected: import std.stdio; void main() { string mystring1; string mystring2 = "A string of just 48 characters for testing size."; writeln(mystring1.sizeof); writeln( mystring2.sizeof); } In both cases the size is 8, so apparently sizeof is giving me just the default size of a string type and not the size of the variable in memory, which is what I want. Ideas?A few misconceptions going on here. A string element is not a grapheme it is a character which is one byte. So what you want is mystring.length Now sizeof is not telling you about the elements, its telling you how big the reference to it is. Specifically length + pointer. It would have been 16 if you compiled in 64bit mode for example. If you want to know about graphemes and code points that is another story. For that you'll want std.uni[0] and std.utf[1]. [0] http://dlang.org/phobos/std_uni.html [1] http://dlang.org/phobos/std_utf.html
Jan 28 2017
On Saturday, 28 January 2017 at 14:56:03 UTC, rikki cattermole wrote:On 29/01/2017 3:51 AM, Nestor wrote:I do not want string lenth or code points. Perhaps I didn't explain myselft. I want to know variable size in memory. For example, say I have an UTF-8 string of only 2 characters, but each of them takes 2 bytes. string length would be 2, but the content of the string would take 4 bytes in memory (excluding overhead for type size). How can I get that?Hi, One can get the length of a string easily, however since strings are UTF-8, sometimes characters take more than one byte. I would like to know then how many bytes does a string take, but this code didn't work as I expected: import std.stdio; void main() { string mystring1; string mystring2 = "A string of just 48 characters for testing size."; writeln(mystring1.sizeof); writeln( mystring2.sizeof); } In both cases the size is 8, so apparently sizeof is giving me just the default size of a string type and not the size of the variable in memory, which is what I want. Ideas?A few misconceptions going on here. A string element is not a grapheme it is a character which is one byte. So what you want is mystring.length Now sizeof is not telling you about the elements, its telling you how big the reference to it is. Specifically length + pointer. It would have been 16 if you compiled in 64bit mode for example. If you want to know about graphemes and code points that is another story. For that you'll want std.uni[0] and std.utf[1]. [0] http://dlang.org/phobos/std_uni.html [1] http://dlang.org/phobos/std_utf.html
Jan 28 2017
On 29/01/2017 4:32 AM, Nestor wrote:On Saturday, 28 January 2017 at 14:56:03 UTC, rikki cattermole wrote:.length You are misunderstanding a char will always be exactly one byte in size. Check[0] for proof. Keep in mind here is the definition of string[1]: alias immutable(char)[] string; There is nothing fancy going on. What you were asking about "characters" wise is actually graphemes as per the unicode standard, they can be multiple bytes and codepoints in size but not a char. [0] http://dlang.org/spec/type.html [1] https://github.com/dlang/druntime/blob/master/src/object.dOn 29/01/2017 3:51 AM, Nestor wrote:I do not want string lenth or code points. Perhaps I didn't explain myselft. I want to know variable size in memory. For example, say I have an UTF-8 string of only 2 characters, but each of them takes 2 bytes. string length would be 2, but the content of the string would take 4 bytes in memory (excluding overhead for type size). How can I get that?Hi, One can get the length of a string easily, however since strings are UTF-8, sometimes characters take more than one byte. I would like to know then how many bytes does a string take, but this code didn't work as I expected: import std.stdio; void main() { string mystring1; string mystring2 = "A string of just 48 characters for testing size."; writeln(mystring1.sizeof); writeln( mystring2.sizeof); } In both cases the size is 8, so apparently sizeof is giving me just the default size of a string type and not the size of the variable in memory, which is what I want. Ideas?A few misconceptions going on here. A string element is not a grapheme it is a character which is one byte. So what you want is mystring.length Now sizeof is not telling you about the elements, its telling you how big the reference to it is. Specifically length + pointer. It would have been 16 if you compiled in 64bit mode for example. If you want to know about graphemes and code points that is another story. For that you'll want std.uni[0] and std.utf[1]. [0] http://dlang.org/phobos/std_uni.html [1] http://dlang.org/phobos/std_utf.html
Jan 28 2017
On Saturday, 28 January 2017 at 15:32:33 UTC, Nestor wrote:I want to know variable size in memory. For example, say I have an UTF-8 string of only 2 characters, but each of them takes 2 bytes. string length would be 2, but the content of the string would take 4 bytes in memory (excluding overhead for type size).As said, the byte count is indeed string.length. The number of code points can be found by std.range.walkLength, but be aware it takes O(answer) time to compute. Example: ----- import std.range, std.stdio; void main () { auto s = "Привет!"; writeln (s.length); // 13 bytes writeln (s.walkLength); // 7 code points } ----- Ivan Kazmenko.
Jan 28 2017
On Saturday, 28 January 2017 at 16:01:38 UTC, Ivan Kazmenko wrote:As said, the byte count is indeed string.length. The number of code points can be found by std.range.walkLength, but be aware it takes O(answer) time to compute. Example: ----- import std.range, std.stdio; void main () { auto s = "Привет!"; writeln (s.length); // 13 bytes writeln (s.walkLength); // 7 code points }Thank you Ivan, I believe I saw somewhere that in D a char was not neccesarrily the same as an ubyte because chars sometimes take more than one byte, so since a string is an array of chars, I thought length behaved like walkLength (which I had not seen), in other words, that it simply returned the amount of elements in the array.
Jan 28 2017
On Saturday, 28 January 2017 at 18:04:58 UTC, Nestor wrote:I believe I saw somewhere that in D a char was not neccesarrily the same as an ubyte because chars sometimes take more thanNot true in the language, but the Phobos library does treat char and ubyte differently because of the multi-char things. But the built-in .length on a string and indexing all work the same as bytes. Note that .length on a wstring or dstring (utf-16 or utf-32) are not bytes, but words. So wstring.length = number of wchars = number of 16 bit items. And dstring is 32 bit. Exactly the same as ushort[].length or int[].length - it is length of elements so if you actually want byte length, you'd cast it first or something.
Jan 28 2017
On Saturday, 28 January 2017 at 18:04:58 UTC, Nestor wrote:I believe I saw somewhere that in D a char was not neccesarrily the same as an ubyte because chars sometimes take more than one byte,In D, a `char` is a UTF-8 code unit. Its size is one byte, exactly and always. A `char` is not a "character" in the common meaning of the word. There's a more specialized word for "character" as a visual unit: grapheme. For example, 'Ä' is a grapheme (a visual unit, a "character"), but there is no single `char` for it. To encode 'Ä' in UTF-8, a sequence of multiple code units is used.so since a string is an array of chars, I thought length behaved like walkLength (which I had not seen), in other words, that it simply returned the amount of elements in the array.The elements of a `string` are (immutable) `char`s. That is, `string` is an array of UTF-8 code units. It's not an array of graphemes. A `string`'s .length gives you the number of `char`s in it, i.e. the number of UTF-8 code units, i.e. the number of bytes.
Jan 28 2017
On Saturday, 28 January 2017 at 19:09:01 UTC, ag0aep6g wrote:In D, a `char` is a UTF-8 code unit. Its size is one byte, exactly and always. A `char` is not a "character" in the common meaning of the word. There's a more specialized word for "character" as a visual unit: grapheme. For example, 'Ä' is a grapheme (a visual unit, a "character"), but there is no single `char` for it. To encode 'Ä' in UTF-8, a sequence of multiple code units is used. ... The elements of a `string` are (immutable) `char`s. That is, `string` is an array of UTF-8 code units. It's not an array of graphemes. A `string`'s .length gives you the number of `char`s in it, i.e. the number of UTF-8 code units, i.e. the number of bytes.Very good explanation. Thank you all for making this clear to me.
Jan 28 2017
On Sat, Jan 28, 2017 at 03:32:33PM +0000, Nestor via Digitalmars-d-learn wrote: [...]I do not want string lenth or code points. Perhaps I didn't explain myselft.The .length property of a string is the number of bytes used to store the string.I want to know variable size in memory. For example, say I have an UTF-8 string of only 2 characters, but each of them takes 2 bytes. string length would be 2, but the content of the string would take 4 bytes in memory (excluding overhead for type size).What you call "string length" is called grapheme count in D. What you want is the .length property. The number of bytes in a UTF-8 string is the same thing as the number of code units (note: do not confuse with code points, which is something else). --T
Jan 28 2017