www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - length of string result not as expected

reply "jicman" <cabrera wrc.xerox.com> writes:
Greetings.

import std.stdio;

void main()
{
   char[] str = "不良反應事件和產品客訴報告"; // 13 chinese
characters...
   writefln(str.length);
}

this program returns 39.  I expected to return 13.  How do I know 
the exact length of the characters that I have in a char[] 
variable?  Thanks.

josé
Aug 13 2013
next sibling parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Wednesday, August 14, 2013 04:53:34 jicman wrote:
 Greetings.
 
 import std.stdio;
 
 void main()
 {
 char[] str = "不良反應事件和產品客訴報告"; // 13 chinese
characters...
 writefln(str.length);
 }
 
 this program returns 39. I expected to return 13. How do I know
 the exact length of the characters that I have in a char[]
 variable? Thanks.
length gives you the length of the array, which is 39, because it contains 39 chars. If you want to know the number of code points in the string as opposed to the number of code units (char is a UTF-8 code unit), then use std.range.walkLength. e.g. writeln(walkLength(str)); It'll iterate through the string and count up the number of code points. - Jonathan M Davis
Aug 13 2013
parent "jicman" <cabrera wrc.xerox.com> writes:
On Wednesday, 14 August 2013 at 03:00:00 UTC, Jonathan M Davis 
wrote:
 On Wednesday, August 14, 2013 04:53:34 jicman wrote:
 Greetings.
 
 import std.stdio;
 
 void main()
 {
 char[] str = "不良反應事件和產品客訴報告"; // 13 chinese
characters...
 writefln(str.length);
 }
 
 this program returns 39. I expected to return 13. How do I know
 the exact length of the characters that I have in a char[]
 variable? Thanks.
length gives you the length of the array, which is 39, because it contains 39 chars. If you want to know the number of code points in the string as opposed to the number of code units (char is a UTF-8 code unit), then use std.range.walkLength. e.g. writeln(walkLength(str)); It'll iterate through the string and count up the number of code points. - Jonathan M Davis
thanks, Jonathan. That looks like D2, since D1 does not have std.range in its phobos library.
Aug 13 2013
prev sibling next sibling parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Wednesday, 14 August 2013 at 02:53:43 UTC, jicman wrote:
 know the exact length of the characters that I have in a char[] 
 variable?  Thanks.
Your code looks like D1... in D1 or D2: import std.uni; dstring s2 = toUTF32(str); writeln(s2.length); // 13 in D2 you can do it a little more efficiently like this: import std.range; writeln(walkLength(str)); // 13 The reason it shows 39 instead of 13 is that the char[] is UTF-8, and Chinese characters are multi-byte characters in utf-8. The .length property gives the number elements in the array, which are bytes in utf-8. dstring uses UTF-32, which has a consistent size for each code point. Which isn't technically quite the same as a character actually, but close enough that it works here. Bottom line though, char[] for non-English text tends to have a longer length than you expect because a lot of characters are multi-byte in utf8. If you use dstring, the length is more consistent.
Aug 13 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-08-14 05:05, Adam D. Ruppe wrote:

 Your code looks like D1...

 in D1 or D2:
 import std.uni;
 dstring s2 = toUTF32(str);
 writeln(s2.length); // 13


 in D2 you can do it a little more efficiently like this:

 import std.range;
 writeln(walkLength(str)); // 13
In D1 you can easily implement walkLength yourself: import std.utf; size_t walkLength (C) (C[] arr) { size_t i; size_t len; while (i < arr.length) { i += arr.stride(i); len++; } return len; } void main () { auto a = "不良反應事件和產品客訴報告"; assert(walkLength(a) == 13); } -- /Jacob Carlborg
Aug 14 2013
prev sibling parent reply "Jeremy DeHaan" <dehaan.jeremiah gmail.com> writes:
On Wednesday, 14 August 2013 at 02:53:43 UTC, jicman wrote:
 Greetings.

 import std.stdio;

 void main()
 {
   char[] str = "不良反應事件和產品客訴報告"; // 13 chinese
characters...
   writefln(str.length);
 }

 this program returns 39.  I expected to return 13.  How do I 
 know the exact length of the characters that I have in a char[] 
 variable?  Thanks.

 josé
What version of DMD are you using? This code doesn't even compile for me. It gives me errors about not being able to convert type string to char[], like it should since a string literal is immutable data. To test the code I changed char[] to string. I also got an error for "writefln(str.length);" so I just changed that to "writeln(str.length);" Anyways, from what I understand, the reason you get this is because each of those characters is greater than a single 8 byte representation. D's chars are utf-8, so that means it takes more than a single char to store the data needed to represent one of the chinese characters. str.length will give you the length of the string with respect to each char it contains. You have 13 characters in your string, but you need 39 chars to store the data to represent them. Alternatively, you can use a different encoding to see the actual number of characters in your string, eg. wstring or dstring. I usually use dstrings when working with unicode personally.
Aug 13 2013
parent "jicman" <cabrera wrc.xerox.com> writes:
On Wednesday, 14 August 2013 at 03:16:08 UTC, Jeremy DeHaan wrote:
 On Wednesday, 14 August 2013 at 02:53:43 UTC, jicman wrote:
 Greetings.

 import std.stdio;

 void main()
 {
  char[] str = "不良反應事件和產品客訴報告"; // 13 chinese
characters...
  writefln(str.length);
 }

 this program returns 39.  I expected to return 13.  How do I 
 know the exact length of the characters that I have in a 
 char[] variable?  Thanks.

 josé
What version of DMD are you using? This code doesn't even compile for me. It gives me errors about not being able to convert type string to char[], like it should since a string literal is immutable data. To test the code I changed char[] to string. I also got an error for "writefln(str.length);" so I just changed that to "writeln(str.length);" Anyways, from what I understand, the reason you get this is because each of those characters is greater than a single 8 byte representation. D's chars are utf-8, so that means it takes more than a single char to store the data needed to represent one of the chinese characters. str.length will give you the length of the string with respect to each char it contains. You have 13 characters in your string, but you need 39 chars to store the data to represent them. Alternatively, you can use a different encoding to see the actual number of characters in your string, eg. wstring or dstring. I usually use dstrings when working with unicode personally.
This is D1. Forgot to mention that. I am still in the old ages. :-) thanks for the insight. I figured that much, but I need to know go and try to figure out what to do with both western character set as well as the asian, hebrew, etc. Thanks.
Aug 13 2013