www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - On the meaning of string.length

reply "Adam D. Ruppe" <destructionator gmail.com> writes:

string.length returns the value it does with some rationale 
defending code units instead of "characters" - basically, I typed 
up a defense of D's string-as-array behavior.

To my surprise, my answer got an enormous number of votes* so I 
decided to post it to reddit too.

http://www.reddit.com/r/programming/comments/2mqghp/why_does_stringlength_count_code_units_instead_of/

This is really encouraging to me that there's been such a 
positive response. The question every so often comes up here too, 
people saying string.length should give number of characters, and 
of course, we have the automatic UTF decoding done in Phobos that 
comes up from time to time.

It looks like D, the language, made the right decisions here.

This reddit comment applies to the phobos thing though:

"Most people like to pick on surrogate pairs here, and decry 
languages which don't handle them "properly", but I think it's 
important to point out that handling surrogate pairs as a single 
character doesn't in any way fix the underlying issue -- many 
multiple-codepoint sequences are a single logical glyph even if 
you use 32 bit wide chars."


I know this has been said a lot of times... but I think the auto 
decoding in phobos was and is a mistake. The bigger question is 
what I posited on stackoverflow: "Moreover, what's the point? Why 
does these metrics matter?" Similarly with std.algorithm on 
strings, why would you ever want to call sort on a string? Well, 
I can think of a few reasons, like checking on the frequency of 
letter, but I think we should see what happens if Phobos changes 
from autodecoding to compile error when that would occur. Then we 
can fix it by casting to .representation or whatever to work with 
code units or manually adding a .utfDecode to work with dchars 
and make the decision explicitly.

That'd offer a way forward and I suspect would break less code 
than we might think.


* stack overflow votes are a silly thing, a somewhat easy answer 
like this gets a bazillion whereas difficult questions with 
difficult answers get me one, maybe two votes. oh well.
Nov 19 2014
next sibling parent reply "Upvoter" <Upvoter nowhere.fr> writes:
On Wednesday, 19 November 2014 at 14:33:05 UTC, Adam D. Ruppe 
wrote:

 string.length returns the value it does with some rationale 
 defending code units instead of "characters" - basically, I 
 typed up a defense of D's string-as-array behavior.

 To my surprise, my answer got an enormous number of votes* so I 
 decided to post it to reddit too.

 http://www.reddit.com/r/programming/comments/2mqghp/why_does_stringlength_count_code_units_instead_of/

 This is really encouraging to me that there's been such a 
 positive response. The question every so often comes up here 
 too, people saying string.length should give number of 
 characters, and of course, we have the automatic UTF decoding 
 done in Phobos that comes up from time to time.

 It looks like D, the language, made the right decisions here.

 This reddit comment applies to the phobos thing though:

 "Most people like to pick on surrogate pairs here, and decry 
 languages which don't handle them "properly", but I think it's 
 important to point out that handling surrogate pairs as a 
 single character doesn't in any way fix the underlying issue -- 
 many multiple-codepoint sequences are a single logical glyph 
 even if you use 32 bit wide chars."


 I know this has been said a lot of times... but I think the 
 auto decoding in phobos was and is a mistake. The bigger 
 question is what I posited on stackoverflow: "Moreover, what's 
 the point? Why does these metrics matter?" Similarly with 
 std.algorithm on strings, why would you ever want to call sort 
 on a string? Well, I can think of a few reasons, like checking 
 on the frequency of letter, but I think we should see what 
 happens if Phobos changes from autodecoding to compile error 
 when that would occur. Then we can fix it by casting to 
 .representation or whatever to work with code units or manually 
 adding a .utfDecode to work with dchars and make the decision 
 explicitly.

 That'd offer a way forward and I suspect would break less code 
 than we might think.


 * stack overflow votes are a silly thing, a somewhat easy 
 answer like this gets a bazillion whereas difficult questions 
 with difficult answers get me one, maybe two votes. oh well.
One more upvote. I agree when you say auto decoding is a good choice. Additonally it allows a good compatibility with the Linux API, in opposite to the Windows API since Windows unicode version use WideChars as string parameters (always two bytes.) And finally for someone like me who makes software for his own usage UTF-8 doesn't change anything since I'm french and every char fits in one byte...
Nov 19 2014
parent Walter Bright <newshound2 digitalmars.com> writes:
On 11/19/2014 7:06 AM, Upvoter wrote:
 On Wednesday, 19 November 2014 at 14:33:05 UTC, Adam D. Ruppe wrote:
 I think the auto decoding in phobos was and is a mistake.
I agree when you say auto decoding is a good choice.
Uh-oh!
Nov 19 2014
prev sibling parent reply Ary Borenszweig <ary esperanto.org.ar> writes:
On 11/19/14, 11:33 AM, Adam D. Ruppe wrote:

 returns the value it does with some rationale defending code units
 instead of "characters" - basically, I typed up a defense of D's
 string-as-array behavior.
In Ruby `length` returns the number of unicode characters and `bytesize` returns the number of bytes. I prefer this use of the names.
Nov 19 2014
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Wednesday, 19 November 2014 at 21:00:50 UTC, Ary Borenszweig 
wrote:
 In Ruby `length` returns the number of unicode characters
What is a unicode character? Even in utf-32, one printed character might be made up of two unicode code points. Or sometimes, two printed characters might come from a single code point.
Nov 20 2014
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
20-Nov-2014 16:50, Adam D. Ruppe пишет:
 On Wednesday, 19 November 2014 at 21:00:50 UTC, Ary Borenszweig wrote:
 In Ruby `length` returns the number of unicode characters
What is a unicode character? Even in utf-32, one printed character might be made up of two unicode code points. Or sometimes, two printed characters might come from a single code point.
Perl goes for grapheme cluster as character. I'd say that's probably the closest thing to it. Sadly being systems language we can't go so far as to create a per process table of cached graphemes, and then use index in that table as "character" ;) -- Dmitry Olshansky
Nov 21 2014