digitalmars.D.learn - length of string result not as expected

jicman (11/11) Aug 13 2013 Greetings.

Jonathan M Davis (8/21) Aug 13 2013 length gives you the length of the array, which is 39, because it contai...

jicman (4/29) Aug 13 2013 thanks, Jonathan. That looks like D2, since D1 does not have

Adam D. Ruppe (20/22) Aug 13 2013 Your code looks like D1...

Jacob Carlborg (21/29) Aug 14 2013 In D1 you can easily implement walkLength yourself:

Jeremy DeHaan (19/30) Aug 13 2013 What version of DMD are you using? This code doesn't even compile

jicman (5/40) Aug 13 2013 This is D1. Forgot to mention that. I am still in the old ages.

"jicman" <cabrera wrc.xerox.com> writes:

Greetings.

import std.stdio;

void main()
{
   char[] str = "不良反應事件和產品客訴報告"; // 13 chinese
characters...
   writefln(str.length);
}

this program returns 39.  I expected to return 13.  How do I know 
the exact length of the characters that I have in a char[] 
variable?  Thanks.

josé

Aug 13 2013

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Wednesday, August 14, 2013 04:53:34 jicman wrote:
 Greetings.
 
 import std.stdio;
 
 void main()
 {
 char[] str = "不良反應事件和產品客訴報告"; // 13 chinese
characters...
 writefln(str.length);
 }
 
 this program returns 39. I expected to return 13. How do I know
 the exact length of the characters that I have in a char[]
 variable? Thanks.

length gives you the length of the array, which is 39, because it contains 39 
chars. If you want to know the number of code points in the string as opposed 
to the number of code units (char is a UTF-8 code unit), then use 
std.range.walkLength. e.g.

writeln(walkLength(str));

It'll iterate through the string and count up the number of code points.

- Jonathan M Davis

Aug 13 2013

"jicman" <cabrera wrc.xerox.com> writes:

On Wednesday, 14 August 2013 at 03:00:00 UTC, Jonathan M Davis 
wrote:
 On Wednesday, August 14, 2013 04:53:34 jicman wrote:
 Greetings.
 
 import std.stdio;
 
 void main()
 {
 char[] str = "不良反應事件和產品客訴報告"; // 13 chinese
characters...
 writefln(str.length);
 }
 
 this program returns 39. I expected to return 13. How do I know
 the exact length of the characters that I have in a char[]
 variable? Thanks.

 length gives you the length of the array, which is 39, because 
 it contains 39
 chars. If you want to know the number of code points in the 
 string as opposed
 to the number of code units (char is a UTF-8 code unit), then 
 use
 std.range.walkLength. e.g.

 writeln(walkLength(str));

 It'll iterate through the string and count up the number of 
 code points.

 - Jonathan M Davis

thanks, Jonathan.  That looks like D2, since D1 does not have 
std.range in its phobos library.

Aug 13 2013

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Wednesday, 14 August 2013 at 02:53:43 UTC, jicman wrote:
 know the exact length of the characters that I have in a char[] 
 variable?  Thanks.

Your code looks like D1...

in D1 or D2:
import std.uni;
dstring s2 = toUTF32(str);
writeln(s2.length); // 13


in D2 you can do it a little more efficiently like this:

import std.range;
writeln(walkLength(str)); // 13



The reason it shows 39 instead of 13 is that the char[] is UTF-8, 
and Chinese characters are multi-byte characters in utf-8. The 
.length property gives the number elements in the array, which 
are bytes in utf-8.

dstring uses UTF-32, which has a consistent size for each code 
point. Which isn't technically quite the same as a character 
actually, but close enough that it works here.


Bottom line though, char[] for non-English text tends to have a 
longer length than you expect because a lot of characters are 
multi-byte in utf8. If you use dstring, the length is more 
consistent.

Aug 13 2013

Jacob Carlborg <doob me.com> writes:

On 2013-08-14 05:05, Adam D. Ruppe wrote:

 Your code looks like D1...

 in D1 or D2:
 import std.uni;
 dstring s2 = toUTF32(str);
 writeln(s2.length); // 13


 in D2 you can do it a little more efficiently like this:

 import std.range;
 writeln(walkLength(str)); // 13

In D1 you can easily implement walkLength yourself:

import std.utf;

size_t walkLength (C) (C[] arr)
{
     size_t i;
     size_t len;

     while (i < arr.length)
     {
         i += arr.stride(i);
         len++;
     }

     return len;
}

void main ()
{
     auto a = "不良反應事件和產品客訴報告";
     assert(walkLength(a) == 13);
}

-- 
/Jacob Carlborg

Aug 14 2013

"Jeremy DeHaan" <dehaan.jeremiah gmail.com> writes:

On Wednesday, 14 August 2013 at 02:53:43 UTC, jicman wrote:
 Greetings.

 import std.stdio;

 void main()
 {
   char[] str = "不良反應事件和產品客訴報告"; // 13 chinese
characters...
   writefln(str.length);
 }

 this program returns 39.  I expected to return 13.  How do I 
 know the exact length of the characters that I have in a char[] 
 variable?  Thanks.

 josé

What version of DMD are you using? This code doesn't even compile 
for me. It gives me errors about not being able to convert type 
string to char[], like it should since a string literal is 
immutable data. To test the code I changed char[] to string. I 
also got an error for "writefln(str.length);" so I just changed 
that to "writeln(str.length);"

Anyways, from what I understand, the reason you get this is 
because each of those characters is greater than a single 8 byte 
representation. D's chars are utf-8, so that means it takes more 
than a single char to store the data needed to represent one of 
the chinese characters. str.length will give you the length of 
the string with respect to each char it contains. You have 13 
characters in your string, but you need 39 chars to store the 
data to represent them.

Alternatively,  you can use a different encoding to see the 
actual number of characters in your string, eg. wstring or 
dstring. I usually use dstrings when working with unicode 
personally.

Aug 13 2013

"jicman" <cabrera wrc.xerox.com> writes:

On Wednesday, 14 August 2013 at 03:16:08 UTC, Jeremy DeHaan wrote:
 On Wednesday, 14 August 2013 at 02:53:43 UTC, jicman wrote:
 Greetings.

 import std.stdio;

 void main()
 {
  char[] str = "不良反應事件和產品客訴報告"; // 13 chinese
characters...
  writefln(str.length);
 }

 this program returns 39.  I expected to return 13.  How do I 
 know the exact length of the characters that I have in a 
 char[] variable?  Thanks.

 josé

 What version of DMD are you using? This code doesn't even 
 compile for me. It gives me errors about not being able to 
 convert type string to char[], like it should since a string 
 literal is immutable data. To test the code I changed char[] to 
 string. I also got an error for "writefln(str.length);" so I 
 just changed that to "writeln(str.length);"

 Anyways, from what I understand, the reason you get this is 
 because each of those characters is greater than a single 8 
 byte representation. D's chars are utf-8, so that means it 
 takes more than a single char to store the data needed to 
 represent one of the chinese characters. str.length will give 
 you the length of the string with respect to each char it 
 contains. You have 13 characters in your string, but you need 
 39 chars to store the data to represent them.

 Alternatively,  you can use a different encoding to see the 
 actual number of characters in your string, eg. wstring or 
 dstring. I usually use dstrings when working with unicode 
 personally.

This is D1. Forgot to mention that.  I am still in the old ages. 
:-)  thanks for the insight.  I figured that much, but I need to 
know go and try to figure out what to do with both western 
character set as well as the asian, hebrew, etc.  Thanks.

Aug 13 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - length of string result not as expected