digitalmars.D.learn - The length of strings vs. # of chars vs. sizeof

Charles Hixson (15/15) Nov 01 2009 I've read and re-read the documentation, but I can't decide whether a

Rainer Deyke (7/16) Nov 01 2009 Strings are just arrays of code units. Their length is the number of

=?UTF-8?B?IkrDqXLDtG1lIE0uIEJlcmdlciI=?= (12/28) Nov 01 2009 I

Rainer Deyke (4/13) Nov 01 2009 Yes.

Jesse Phillips (4/6) Nov 01 2009 I believe the documentation you are looking for is:

Rainer Deyke (13/18) Nov 01 2009 One thing that page fails to mention is that D has no awareness of

Daniel Keep (2/20) Nov 01 2009 Well, it *is* on a Wiki.

Charles Hixson (8/14) Nov 02 2009 Thanks, that does appear to be the answer.

rmcguire (10/32) Nov 03 2009 As far as I know if you want to shorten a utf8 string you just check the...

Bill Baxter (6/36) Nov 03 2009 It's explained well in Andrei's book.

rmcguire (5/47) Nov 03 2009 :) forgot about that, its been a while since I played with utf8.
Charles Hixson (2/41) Nov 03 2009 Thanks. That's a much better answer.

Charles Hixson <charleshixsn earthlink.net> writes:

I've read and re-read the documentation, but I can't decide whether a 
UTF-8 character that takes multiple bytes to express counts as one or 
multiple values in length and sizeof.  Sizeof seems to presume that all 
entries are the same length, but otherwise it seems to be the property I 
need.  (I suppose that I could just enter a string that I know is 
multi-byte chars, but it sure would be better if I could find out from 
the documentation.)  I'm pretty certain that it just counts as one 
character for indexing, so length would almost need to also count the 
number of characters rather than bytes.

Sizeof *should* be the correct property, and I've been assuming that it 
is, but I'm a bit afraid that I'll run across some unexpected character 
and it won't act the way I think it should.  And the documentation reads 
ambiguously.

Does anyone just *know* the answer?  (And if so, could they make the 
documentation explicit?)

Nov 01 2009

Rainer Deyke <rainerd eldwood.com> writes:

Charles Hixson wrote:
 I've read and re-read the documentation, but I can't decide whether a
 UTF-8 character that takes multiple bytes to express counts as one or
 multiple values in length and sizeof.  Sizeof seems to presume that all
 entries are the same length, but otherwise it seems to be the property I
 need.  (I suppose that I could just enter a string that I know is
 multi-byte chars, but it sure would be better if I could find out from
 the documentation.)  I'm pretty certain that it just counts as one
 character for indexing, so length would almost need to also count the
 number of characters rather than bytes.

Strings are just arrays of code units.  Their length is the number of
elements (i.e. code units) they contain, just like other arrays.  A code
point may comprise multiple code units, and a logical character may
comprise multiple code points.  The latter is true even with dchar/utf-32.


-- 
Rainer Deyke - rainerd eldwood.com

Nov 01 2009

=?UTF-8?B?IkrDqXLDtG1lIE0uIEJlcmdlciI=?= <jeberger free.fr> writes:

Rainer Deyke wrote:
 Charles Hixson wrote:
 I've read and re-read the documentation, but I can't decide whether a
 UTF-8 character that takes multiple bytes to express counts as one or
 multiple values in length and sizeof.  Sizeof seems to presume that al=


l
 entries are the same length, but otherwise it seems to be the property=


 I
 need.  (I suppose that I could just enter a string that I know is
 multi-byte chars, but it sure would be better if I could find out from=


 the documentation.)  I'm pretty certain that it just counts as one
 character for indexing, so length would almost need to also count the
 number of characters rather than bytes.

=20
 Strings are just arrays of code units.  Their length is the number of
 elements (i.e. code units) they contain, just like other arrays.  A cod=

e
 point may comprise multiple code units, and a logical character may
 comprise multiple code points.  The latter is true even with dchar/utf-=

32.
=20

	So, in UTF-8, length is the number of bytes in the string and=20
sizeof is 8 (on 32-bits systems).

		Jerome
--=20
mailto:jeberger free.fr
http://jeberger.free.fr
Jabber: jeberger jabber.fr

Nov 01 2009

Rainer Deyke <rainerd eldwood.com> writes:

Jérôme M. Berger wrote:
 Rainer Deyke wrote:
 Strings are just arrays of code units.  Their length is the number of
 elements (i.e. code units) they contain, just like other arrays.  A code
 point may comprise multiple code units, and a logical character may
 comprise multiple code points.  The latter is true even with
 dchar/utf-32.

     So, in UTF-8, length is the number of bytes in the string and sizeof
 is 8 (on 32-bits systems).

Yes.


-- 
Rainer Deyke - rainerd eldwood.com

Nov 01 2009

Jesse Phillips <jessekphillips gmail.com> writes:

On Sun, 01 Nov 2009 11:36:31 -0800, Charles Hixson wrote:

 Does anyone just *know* the answer?  (And if so, could they make the
 documentation explicit?)

I believe the documentation you are looking for is:

http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

It is more about understanding UTF than it is about learning strings.

Nov 01 2009

Rainer Deyke <rainerd eldwood.com> writes:

Jesse Phillips wrote:
 I believe the documentation you are looking for is:
 
 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
 
 It is more about understanding UTF than it is about learning strings.

One thing that page fails to mention is that D has no awareness of
anything higher-level than code points.  In particular:
  - dchar contains a code point, not a logical character.
  - D has no awareness of canonical forms and precomposed/decomposed
characters (at the language level).  (Some characters can be represented
as either one or two code points.  D does not know that these are
supposed to represent the same character.)
  - Although D stops you from outputting an incomplete code point, it
does not stop you from outputting an incomplete logical character.

Also, some D library functions only work on the ASCII subset of utf-8.


-- 
Rainer Deyke - rainerd eldwood.com

Nov 01 2009

Daniel Keep <daniel.keep.lists gmail.com> writes:

Rainer Deyke wrote:
 Jesse Phillips wrote:
 I believe the documentation you are looking for is:

 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

 It is more about understanding UTF than it is about learning strings.

 
 One thing that page fails to mention is that D has no awareness of
 anything higher-level than code points.  In particular:
   - dchar contains a code point, not a logical character.
   - D has no awareness of canonical forms and precomposed/decomposed
 characters (at the language level).  (Some characters can be represented
 as either one or two code points.  D does not know that these are
 supposed to represent the same character.)
   - Although D stops you from outputting an incomplete code point, it
 does not stop you from outputting an incomplete logical character.
 
 Also, some D library functions only work on the ASCII subset of utf-8.

Well, it *is* on a Wiki.

Nov 01 2009

Charles Hixson <charleshixsn earthlink.net> writes:

Jesse Phillips wrote:
 On Sun, 01 Nov 2009 11:36:31 -0800, Charles Hixson wrote:

 Does anyone just *know* the answer?  (And if so, could they make the
 documentation explicit?)

 I believe the documentation you are looking for is:

 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

 It is more about understanding UTF than it is about learning strings.

Thanks, that does appear to be the answer.

So if a string is too long, and I shorten it by one character, I'd 
better test it with std.utf.validate(str).  If it doesn't throw an 
error, it's ok.  Otherwise shorten it again and retry.

I hope I understood this correctly.  (I'm sure there's a more elegant 
way to do this, but here I'm going for a simple approach, as I should 
rarely be encountering this problem.)

Nov 02 2009

rmcguire <rjmcguire gmail.com> writes:

Charles Hixson <charleshixsn earthlink.net> wrote:
 
 Jesse Phillips wrote:
 On Sun, 01 Nov 2009 11:36:31 -0800, Charles Hixson wrote:

 Does anyone just *know* the answer?  (And if so, could they make the
 documentation explicit?)

 I believe the documentation you are looking for is:

 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

 It is more about understanding UTF than it is about learning strings.

 Thanks, that does appear to be the answer.
 
 So if a string is too long, and I shorten it by one character, I'd 
 better test it with std.utf.validate(str).  If it doesn't throw an 
 error, it's ok.  Otherwise shorten it again and retry.
 
 I hope I understood this correctly.  (I'm sure there's a more elegant 
 way to do this, but here I'm going for a simple approach, as I should 
 rarely be encountering this problem.)
 
 

As far as I know if you want to shorten a utf8 string you just check the 
first bit of the last byte to see if its 0. If its 0 go back further 
until you find a byte that starts with 1, and then remove that byte too.

All characters start with a byte that starts with 1, the number of 1s in 
the first byte of the character tell you how many bytes in the character.

Hope that helps, but you should find a library that already has a 
"shorten my string" function.

-Rory

Nov 03 2009

Bill Baxter <wbaxter gmail.com> writes:

On Tue, Nov 3, 2009 at 2:47 AM, rmcguire <rjmcguire gmail.com> wrote:
 Charles Hixson <charleshixsn earthlink.net> wrote:

 Jesse Phillips wrote:
 On Sun, 01 Nov 2009 11:36:31 -0800, Charles Hixson wrote:

 Does anyone just *know* the answer? =A0(And if so, could they make the
 documentation explicit?)

 I believe the documentation you are looking for is:

 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

 It is more about understanding UTF than it is about learning strings.

 Thanks, that does appear to be the answer.

 So if a string is too long, and I shorten it by one character, I'd
 better test it with std.utf.validate(str). =A0If it doesn't throw an
 error, it's ok. =A0Otherwise shorten it again and retry.

 I hope I understood this correctly. =A0(I'm sure there's a more elegant
 way to do this, but here I'm going for a simple approach, as I should
 rarely be encountering this problem.)

 As far as I know if you want to shorten a utf8 string you just check the
 first bit of the last byte to see if its 0. If its 0 go back further
 until you find a byte that starts with 1, and then remove that byte too.

 All characters start with a byte that starts with 1, the number of 1s in
 the first byte of the character tell you how many bytes in the character.

 Hope that helps, but you should find a library that already has a
 "shorten my string" function.

It's explained well in Andrei's book.
0* -- single byte character
11* -- first byte of multi-byte char
10* -- subsequent byte of multi-byte char

--bb

Nov 03 2009

rmcguire <rjmcguire gmail.com> writes:

Bill Baxter <wbaxter gmail.com> wrote:
 
 On Tue, Nov 3, 2009 at 2:47 AM, rmcguire <rjmcguire gmail.com> wrote:
 Charles Hixson <charleshixsn earthlink.net> wrote:

 Jesse Phillips wrote:
 On Sun, 01 Nov 2009 11:36:31 -0800, Charles Hixson wrote:

 Does anyone just *know* the answer? �(And if so, could they make the
 documentation explicit?)

 I believe the documentation you are looking for is:

 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

 It is more about understanding UTF than it is about learning strings.

 Thanks, that does appear to be the answer.

 So if a string is too long, and I shorten it by one character, I'd
 better test it with std.utf.validate(str). �If it doesn't throw an
 error, it's ok. �Otherwise shorten it again and retry.

 I hope I understood this correctly. �(I'm sure there's a more elegant
 way to do this, but here I'm going for a simple approach, as I should
 rarely be encountering this problem.)

 As far as I know if you want to shorten a utf8 string you just check the
 first bit of the last byte to see if its 0. If its 0 go back further
 until you find a byte that starts with 1, and then remove that byte too.

 All characters start with a byte that starts with 1, the number of 1s in
 the first byte of the character tell you how many bytes in the character.

 Hope that helps, but you should find a library that already has a
 "shorten my string" function.

 
 It's explained well in Andrei's book.
 0* -- single byte character
 11* -- first byte of multi-byte char
 10* -- subsequent byte of multi-byte char
 
 --bb
 

:) forgot about that, its been a while since I played with utf8.

made a Hessian serializer in C.

-Rory

Nov 03 2009

Charles Hixson <charleshixsn earthlink.net> writes:

Bill Baxter wrote:
 On Tue, Nov 3, 2009 at 2:47 AM, rmcguire<rjmcguire gmail.com>  wrote:
 Charles Hixson<charleshixsn earthlink.net>  wrote:

 Jesse Phillips wrote:
 On Sun, 01 Nov 2009 11:36:31 -0800, Charles Hixson wrote:

 Does anyone just *know* the answer?  (And if so, could they make the
 documentation explicit?)

 I believe the documentation you are looking for is:

 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

 It is more about understanding UTF than it is about learning strings.

 Thanks, that does appear to be the answer.

 So if a string is too long, and I shorten it by one character, I'd
 better test it with std.utf.validate(str).  If it doesn't throw an
 error, it's ok.  Otherwise shorten it again and retry.

 I hope I understood this correctly.  (I'm sure there's a more elegant
 way to do this, but here I'm going for a simple approach, as I should
 rarely be encountering this problem.)

 As far as I know if you want to shorten a utf8 string you just check the
 first bit of the last byte to see if its 0. If its 0 go back further
 until you find a byte that starts with 1, and then remove that byte too.

 All characters start with a byte that starts with 1, the number of 1s in
 the first byte of the character tell you how many bytes in the character.

 Hope that helps, but you should find a library that already has a
 "shorten my string" function.

 It's explained well in Andrei's book.
 0* -- single byte character
 11* -- first byte of multi-byte char
 10* -- subsequent byte of multi-byte char

 --bb

Thanks.  That's a much better answer.

Nov 03 2009

D Programming

C/C++ Programming

Other

digitalmars.D.learn - The length of strings vs. # of chars vs. sizeof