digitalmars.D.learn - length of string result not as expected
- jicman (11/11) Aug 13 2013 Greetings.
- Jonathan M Davis (8/21) Aug 13 2013 length gives you the length of the array, which is 39, because it contai...
- jicman (4/29) Aug 13 2013 thanks, Jonathan. That looks like D2, since D1 does not have
- Adam D. Ruppe (20/22) Aug 13 2013 Your code looks like D1...
- Jacob Carlborg (21/29) Aug 14 2013 In D1 you can easily implement walkLength yourself:
- Jeremy DeHaan (19/30) Aug 13 2013 What version of DMD are you using? This code doesn't even compile
- jicman (5/40) Aug 13 2013 This is D1. Forgot to mention that. I am still in the old ages.
Greetings. import std.stdio; void main() { char[] str = "不良反應事件和產品客訴報告"; // 13 chinese characters... writefln(str.length); } this program returns 39. I expected to return 13. How do I know the exact length of the characters that I have in a char[] variable? Thanks. josé
Aug 13 2013
On Wednesday, August 14, 2013 04:53:34 jicman wrote:Greetings. import std.stdio; void main() { char[] str = "不良反應事件和產品客訴報告"; // 13 chinese characters... writefln(str.length); } this program returns 39. I expected to return 13. How do I know the exact length of the characters that I have in a char[] variable? Thanks.length gives you the length of the array, which is 39, because it contains 39 chars. If you want to know the number of code points in the string as opposed to the number of code units (char is a UTF-8 code unit), then use std.range.walkLength. e.g. writeln(walkLength(str)); It'll iterate through the string and count up the number of code points. - Jonathan M Davis
Aug 13 2013
On Wednesday, 14 August 2013 at 03:00:00 UTC, Jonathan M Davis wrote:On Wednesday, August 14, 2013 04:53:34 jicman wrote:thanks, Jonathan. That looks like D2, since D1 does not have std.range in its phobos library.Greetings. import std.stdio; void main() { char[] str = "不良反應事件和產品客訴報告"; // 13 chinese characters... writefln(str.length); } this program returns 39. I expected to return 13. How do I know the exact length of the characters that I have in a char[] variable? Thanks.length gives you the length of the array, which is 39, because it contains 39 chars. If you want to know the number of code points in the string as opposed to the number of code units (char is a UTF-8 code unit), then use std.range.walkLength. e.g. writeln(walkLength(str)); It'll iterate through the string and count up the number of code points. - Jonathan M Davis
Aug 13 2013
On Wednesday, 14 August 2013 at 02:53:43 UTC, jicman wrote:know the exact length of the characters that I have in a char[] variable? Thanks.Your code looks like D1... in D1 or D2: import std.uni; dstring s2 = toUTF32(str); writeln(s2.length); // 13 in D2 you can do it a little more efficiently like this: import std.range; writeln(walkLength(str)); // 13 The reason it shows 39 instead of 13 is that the char[] is UTF-8, and Chinese characters are multi-byte characters in utf-8. The .length property gives the number elements in the array, which are bytes in utf-8. dstring uses UTF-32, which has a consistent size for each code point. Which isn't technically quite the same as a character actually, but close enough that it works here. Bottom line though, char[] for non-English text tends to have a longer length than you expect because a lot of characters are multi-byte in utf8. If you use dstring, the length is more consistent.
Aug 13 2013
On 2013-08-14 05:05, Adam D. Ruppe wrote:Your code looks like D1... in D1 or D2: import std.uni; dstring s2 = toUTF32(str); writeln(s2.length); // 13 in D2 you can do it a little more efficiently like this: import std.range; writeln(walkLength(str)); // 13In D1 you can easily implement walkLength yourself: import std.utf; size_t walkLength (C) (C[] arr) { size_t i; size_t len; while (i < arr.length) { i += arr.stride(i); len++; } return len; } void main () { auto a = "不良反應事件和產品客訴報告"; assert(walkLength(a) == 13); } -- /Jacob Carlborg
Aug 14 2013
On Wednesday, 14 August 2013 at 02:53:43 UTC, jicman wrote:Greetings. import std.stdio; void main() { char[] str = "不良反應事件和產品客訴報告"; // 13 chinese characters... writefln(str.length); } this program returns 39. I expected to return 13. How do I know the exact length of the characters that I have in a char[] variable? Thanks. joséWhat version of DMD are you using? This code doesn't even compile for me. It gives me errors about not being able to convert type string to char[], like it should since a string literal is immutable data. To test the code I changed char[] to string. I also got an error for "writefln(str.length);" so I just changed that to "writeln(str.length);" Anyways, from what I understand, the reason you get this is because each of those characters is greater than a single 8 byte representation. D's chars are utf-8, so that means it takes more than a single char to store the data needed to represent one of the chinese characters. str.length will give you the length of the string with respect to each char it contains. You have 13 characters in your string, but you need 39 chars to store the data to represent them. Alternatively, you can use a different encoding to see the actual number of characters in your string, eg. wstring or dstring. I usually use dstrings when working with unicode personally.
Aug 13 2013
On Wednesday, 14 August 2013 at 03:16:08 UTC, Jeremy DeHaan wrote:On Wednesday, 14 August 2013 at 02:53:43 UTC, jicman wrote:This is D1. Forgot to mention that. I am still in the old ages. :-) thanks for the insight. I figured that much, but I need to know go and try to figure out what to do with both western character set as well as the asian, hebrew, etc. Thanks.Greetings. import std.stdio; void main() { char[] str = "不良反應事件和產品客訴報告"; // 13 chinese characters... writefln(str.length); } this program returns 39. I expected to return 13. How do I know the exact length of the characters that I have in a char[] variable? Thanks. joséWhat version of DMD are you using? This code doesn't even compile for me. It gives me errors about not being able to convert type string to char[], like it should since a string literal is immutable data. To test the code I changed char[] to string. I also got an error for "writefln(str.length);" so I just changed that to "writeln(str.length);" Anyways, from what I understand, the reason you get this is because each of those characters is greater than a single 8 byte representation. D's chars are utf-8, so that means it takes more than a single char to store the data needed to represent one of the chinese characters. str.length will give you the length of the string with respect to each char it contains. You have 13 characters in your string, but you need 39 chars to store the data to represent them. Alternatively, you can use a different encoding to see the actual number of characters in your string, eg. wstring or dstring. I usually use dstrings when working with unicode personally.
Aug 13 2013