digitalmars.D.learn - Sort characters in string

Fredrik Boulund (15/15) Dec 06 2017 Hi,

Jonathan M Davis (54/70) Dec 06 2017 Okay. "Narrow strings" are not considered random access and aren't

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (5/7) Dec 06 2017 I don't think the standard says that? Isn't this only because the

Jonathan M Davis (18/25) Dec 06 2017 It's most definitely the case right now, and given how Unicode decoding
Steven Schveighoffer (7/15) Dec 06 2017 The current unicode encoding has 2 million different code points. I'd

Patrick Schluter (7/26) Dec 07 2017 2,097,152 possible codepoints. As of [Unicode 10] only 136,690

Patrick Schluter (7/14) Dec 07 2017 No. Unicode uses only 21 bits and it is very unlikely to change

Fredrik Boulund (5/16) Dec 06 2017 Thanks so much for such an elaborate reply! I was really close
Patrick Schluter (4/6) Dec 07 2017 YDNRC, 1 - 4 code units for UTF-8. Unicode is defined only up to

Biotronic (18/22) Dec 06 2017 Yeah, narrow (non-UTF-32) strings are not random-access, since

Fredrik Boulund (5/15) Dec 06 2017 Also very useful information! Thanks. I was just realizing that
Dgame (5/28) Dec 06 2017 Or you simply do

Fredrik Boulund (4/8) Dec 06 2017 This is so strange. I was dead sure I tried that but it failed

=?UTF-8?Q?Ali_=c3=87ehreli?= (9/18) Dec 06 2017 As a general comment, sorting a string does not make sense in general

H. S. Teoh (13/28) Dec 06 2017 [...]

Mengu (3/12) Dec 07 2017 if you're like me, you probably forgot an import :)

Fredrik Boulund <fredrik.boulund gmail.com> writes:

Hi,

I'm having some trouble sorting the individual characters in a 
string. Searching around, I found this thread 
(http://forum.dlang.org/post/mailman.612.1331659665.4860.digitalmars-d-
earn puremagic.com) about a similar issue, but it feels quite old so I wanted
to check if there is a clear cut way of sorting the characters of a string
nowadays?

I was expecting to be able to do something like this:

string word = "longword";
writeln(sort(word));

But that doesn't work because I guess a string is not the type of 
range required for sort?
I tried converting it to a char[] array (to!(char[])(word)) but 
that doesn't seem work either. I get:

Error: template std.algorithm.sorting.sort cannot deduce function 
from argument types !()(char[])

Best regards,
Fredrik

Dec 06 2017

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Wednesday, December 06, 2017 08:59:09 Fredrik Boulund via Digitalmars-d-
learn wrote:
 Hi,

 I'm having some trouble sorting the individual characters in a
 string. Searching around, I found this thread
 (http://forum.dlang.org/post/mailman.612.1331659665.4860.digitalmars-d-lea
 rn puremagic.com) about a similar issue, but it feels quite old so I
 wanted to check if there is a clear cut way of sorting the characters of
 a string nowadays?

 I was expecting to be able to do something like this:

 string word = "longword";
 writeln(sort(word));

 But that doesn't work because I guess a string is not the type of
 range required for sort?
 I tried converting it to a char[] array (to!(char[])(word)) but
 that doesn't seem work either. I get:

 Error: template std.algorithm.sorting.sort cannot deduce function
 from argument types !()(char[])

Okay. "Narrow strings" are not considered random access and aren't
considered to have length by the range API. This is because a code unit in
UTF-8 or UTF-16 is not guaranteed to be a full code point (IIRC, 1 - 6 code
units for UTF-8 and 1 - 2 for UTF-16), and slicing them risks cutting into
the middle of a code point. UTF-32 on the other hand is guaranteed to have a
code unit be a full code point. So, arrays of char and wchar are not
considered random access or to have length by the range API, but arrays of
dchar are. As part of that, front and back return a dchar for all string
types, so front, popFront, back, and popBack do decoding of code units to
code points and never cut up code points. The idea was to make it so that
Unicode correctness would be guaranteed by default.

While this design was well-intended, it doesn't really work, and it was a
huge mistake. While a code point is something that can actually be displayed
on the screen, it's not guaranteed to be a full character (e.g. the letter a
with a subscript of 2 would be multiple code points). You need a grapheme
cluster to get a full character, and that potentially means multiple code
points. So, having ranges of characters operate at the code point level does
not guarantee Unicode correctness. You could still cut up a character (it
would just be at the code point level rather than the code unit level). For
full-on Unicode correctness by default, everything would have to operate at
the grapheme level, which would be horribly inefficient (especially for
ASCII).

In reality, whether an algorithm needs to operate at the code unit, code
point, or grapheme level depends on what it's doing (e.g. so long as the
Unicode is normalized properly, find can operate at the code unit level, but
if you want to actually sort a range of full Unicode characters properly,
you'd need to operate at the grapheme level). So, what the range API
_should_ just treat strings as ranges of their element type and let the
programmer wrap them as necessary to deal with code points or graphemes when
appropriate. And ultimately, the programmer just plain needs to do what
they're doing, and a solution that reasonably guarantees Unicode correctness
really doesn't exist (not if you care about efficiency anyway).
Unfortunately, changing the range API at this point would break a lot of
code, and no one has figured out how to do it without breaking code. So, it
currently seems likely that we'll forever be stuck with this design mistake.

So, the best way to work around it depends on what you're doing.
std.utf.byCodeUnit creates a wrapper range that turns all strings into
ranges of their actual element types rather than dchar. So, that's one
solution. Another is std.string.representation, which casts a string to the
corresponding integral type (e.g. typeof("foo".representation) is
immutable(ubyte)[]). And std.uni has stuff for better handling code points
or graphemes if need be.

If you have a string, and you _know_ that it's only ASCII, then either use
representation or byCodeUnit to wrap it for the call to sort, but it _will_
have to be mutable, so string won't actually work. e.g.

char[] str = "hello world".dup;
sort(str.representation);
// str is now sorted

However, if the string could actually contain Unicode, then you'll have to
use std.uni's grapheme support to wrap the string in a range that operates
at the grapheme level; otherwise you could rip pieces of characters apart.

- Jonathan M Davis

Dec 06 2017

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:

On Wednesday, 6 December 2017 at 09:24:33 UTC, Jonathan M Davis 
wrote:
 UTF-32 on the other hand is guaranteed to have a code unit be a 
 full code point.

I don't think the standard says that? Isn't this only because the 
current set is small enough to fit? So this may change as Unicode 
grows?

Dec 06 2017

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Wednesday, December 06, 2017 09:34:48 Ola Fosheim Gr�stad via 
Digitalmars-d-learn wrote:
 On Wednesday, 6 December 2017 at 09:24:33 UTC, Jonathan M Davis

 wrote:
 UTF-32 on the other hand is guaranteed to have a code unit be a
 full code point.

 I don't think the standard says that? Isn't this only because the
 current set is small enough to fit? So this may change as Unicode
 grows?

It's most definitely the case right now, and given how Unicode decoding
works, I don't see how it could ever be the case that a UTF-32 code unit
would not be a code point - not without breaking all of the Unicode handling
in existence. And per wikipedia's short article on code points

----------------
The Unicode code space is divided into seventeen planes (the basic
multilingual plane, and 16 supplementary planes), each with 65,536 (= 216)
code points. Thus the total size of the Unicode code space is 17 � 65,536 =
1,114,112.
----------------

And uint.max is 4,294,967,295, leaving about 3855x space to grow into even
if they kept adding more code point values by adding more planes or however
that works.

I'd have to go digging through the actual standard to know for sure what it
actually guarantees though.

- Jonathan M Davis

Dec 06 2017

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 12/6/17 4:34 AM, Ola Fosheim Grøstad wrote:
 On Wednesday, 6 December 2017 at 09:24:33 UTC, Jonathan M Davis wrote:
 UTF-32 on the other hand is guaranteed to have a code unit be a full 
 code point.

 
 I don't think the standard says that? Isn't this only because the 
 current set is small enough to fit? So this may change as Unicode grows?
 
 

The current unicode encoding has 2 million different code points. I'd 
say we'll all be dead and so will our great great great grandchildren by 
the time unicode amasses more than 2 billion codepoints :)

Also, UTF8 has been standardized to only have up to 4 code units per 
code point. The encoding scheme allows more, but the standard restricts it.

-Steve

Dec 06 2017

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Wednesday, 6 December 2017 at 15:12:22 UTC, Steven 
Schveighoffer wrote:
 On 12/6/17 4:34 AM, Ola Fosheim Grøstad wrote:
 On Wednesday, 6 December 2017 at 09:24:33 UTC, Jonathan M 
 Davis wrote:
 UTF-32 on the other hand is guaranteed to have a code unit be 
 a full code point.

 
 I don't think the standard says that? Isn't this only because 
 the current set is small enough to fit? So this may change as 
 Unicode grows?
 
 

 The current unicode encoding has 2 million different code 
 points.

2,097,152 possible codepoints. As of [Unicode 10] only 136,690 
codepoints have been assigned.

I'd say we'll all be dead and so will our great great
 great grandchildren by the time unicode amasses more than 2 
 billion codepoints :)

So there's enough time even before the current range is even 
filled.

 Also, UTF8 has been standardized to only have up to 4 code 
 units per code point. The encoding scheme allows more, but the 
 standard restricts it.



[Unicode 10]: http://www.unicode.org/versions/Unicode10.0.0/

Dec 07 2017

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Wednesday, 6 December 2017 at 09:34:48 UTC, Ola Fosheim 
Grøstad wrote:
 On Wednesday, 6 December 2017 at 09:24:33 UTC, Jonathan M Davis 
 wrote:
 UTF-32 on the other hand is guaranteed to have a code unit be 
 a full code point.

 I don't think the standard says that? Isn't this only because 
 the current set is small enough to fit? So this may change as 
 Unicode grows?

No. Unicode uses only 21 bits and it is very unlikely to change 
anytime soon as barely 17 are really used. This means the current 
range can be grown by more than 16 times what it is now. So 
definitely, one UTF-32 codeunit is guaranted to hold any 
codepoint, forever.

Dec 07 2017

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Wednesday, 6 December 2017 at 09:24:33 UTC, Jonathan M Davis 
wrote:

 If you have a string, and you _know_ that it's only ASCII, then 
 either use representation or byCodeUnit to wrap it for the call 
 to sort, but it _will_ have to be mutable, so string won't 
 actually work. e.g.

 char[] str = "hello world".dup;
 sort(str.representation);
 // str is now sorted

 However, if the string could actually contain Unicode, then 
 you'll have to use std.uni's grapheme support to wrap the 
 string in a range that operates at the grapheme level; 
 otherwise you could rip pieces of characters apart.

Thanks so much for such an elaborate reply! I was really close 
with str.representation in my code, and the little code example 
you gave helped me fix my problem! Thanks a lot

Dec 06 2017

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Wednesday, 6 December 2017 at 09:24:33 UTC, Jonathan M Davis 
wrote:
 a full code point (IIRC, 1 - 6 code units for UTF-8 and 1 - 2 
 for UTF-16),

YDNRC, 1 - 4 code units for UTF-8. Unicode is defined only up to 
U+10FFFF. Everything above is illegal.

Dec 07 2017

Biotronic <simen.kjaras gmail.com> writes:

On Wednesday, 6 December 2017 at 08:59:09 UTC, Fredrik Boulund 
wrote:
 string word = "longword";
 writeln(sort(word));

 But that doesn't work because I guess a string is not the type 
 of range required for sort?

Yeah, narrow (non-UTF-32) strings are not random-access, since 
characters like 💩 take up more than one code unit, and so "💩"[0] 
returns an invalid piece of a character instead of a  full 
character.

In addition, sort does in-place sorting, so the input range is 
changed. Since D strings are immutable(char)[], changing the 
elements is disallowed. So in total, you'll need to convert from 
a string (immutable(char)[]) to a dchar[]. std.conv.to to the 
rescue:

     import std.stdio : writeln;
     import std.conv : to;
     import std.algorithm.sorting : sort;

     string word = "longword";
     writeln(sort(word.to!(dchar[]))); // dglnoorw

--
   Biotronic

Dec 06 2017

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Wednesday, 6 December 2017 at 09:25:20 UTC, Biotronic wrote:
 In addition, sort does in-place sorting, so the input range is 
 changed. Since D strings are immutable(char)[], changing the 
 elements is disallowed. So in total, you'll need to convert 
 from a string (immutable(char)[]) to a dchar[]. std.conv.to to 
 the rescue:

     import std.stdio : writeln;
     import std.conv : to;
     import std.algorithm.sorting : sort;

     string word = "longword";
     writeln(sort(word.to!(dchar[]))); // dglnoorw


Also very useful information! Thanks. I was just realizing that 
sort was in-place as I finished writing my first post. It got me 
really confused as I expected it to return a sorted array (but I 
do realize that is a strange assumption to make).

Dec 06 2017

Dgame <r.schuett.1987 gmail.com> writes:

On Wednesday, 6 December 2017 at 09:25:20 UTC, Biotronic wrote:
 On Wednesday, 6 December 2017 at 08:59:09 UTC, Fredrik Boulund 
 wrote:
 string word = "longword";
 writeln(sort(word));

 But that doesn't work because I guess a string is not the type 
 of range required for sort?

 Yeah, narrow (non-UTF-32) strings are not random-access, since 
 characters like 💩 take up more than one code unit, and so 
 "💩"[0] returns an invalid piece of a character instead of a  
 full character.

 In addition, sort does in-place sorting, so the input range is 
 changed. Since D strings are immutable(char)[], changing the 
 elements is disallowed. So in total, you'll need to convert 
 from a string (immutable(char)[]) to a dchar[]. std.conv.to to 
 the rescue:

     import std.stdio : writeln;
     import std.conv : to;
     import std.algorithm.sorting : sort;

     string word = "longword";
     writeln(sort(word.to!(dchar[]))); // dglnoorw

 --
   Biotronic

Or you simply do
----
writeln("longword".array.sort);
----

Dec 06 2017

Fredrik Boulund <fredrik.boulund gmail.com> writes:

On Wednesday, 6 December 2017 at 10:42:31 UTC, Dgame wrote:

 Or you simply do
 ----
 writeln("longword".array.sort);
 ----

This is so strange. I was dead sure I tried that but it failed 
for some reason. But after trying it just now it also seems to 
work just fine. Thanks! :)

Dec 06 2017

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 12/06/2017 04:43 AM, Fredrik Boulund wrote:
 On Wednesday, 6 December 2017 at 10:42:31 UTC, Dgame wrote:

 Or you simply do
 ----
 writeln("longword".array.sort);
 ----

 This is so strange. I was dead sure I tried that but it failed for some
 reason. But after trying it just now it also seems to work just fine.
 Thanks! :)

As a general comment, sorting a string does not make sense in general 
when Unicode is involved. For example, there may be combining diacriticals:

     // Three characters: e, a, and combining acute accent U+0301
     writeln("eá".array.sort);

prints

aé

So, the accent moves from a to e, which probably is not the intention.

Ali

Dec 06 2017

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Dec 06, 2017 at 10:32:03AM -0800, Ali �ehreli via Digitalmars-d-learn
wrote:
 On 12/06/2017 04:43 AM, Fredrik Boulund wrote:
 On Wednesday, 6 December 2017 at 10:42:31 UTC, Dgame wrote:

 Or you simply do
 ----
 writeln("longword".array.sort);
 ----

 This is so strange. I was dead sure I tried that but it failed for
 some reason. But after trying it just now it also seems to work just
 fine.  Thanks! :)

 
 As a general comment, sorting a string does not make sense in general
 when Unicode is involved.

[...]

Yeah... in general, you need to decide exactly what kind of sorting you
intend.  If you intend to sort individual graphemes (i.e., what we
normally think of as "characters"), you need to segment the string into
graphemes with .byGrapheme and then sort it as an array/range of
graphemes.  Sorting Unicode code points is probably not what you want,
and sorting code units is probably never what you want (unless you're
doing byte frequency analysis on UTF-8 or something :-P).

Unicode is a tricky beast.


T

-- 
What do you get if you drop a piano down a mineshaft? A flat minor.

Dec 06 2017

Mengu <mengukagan gmail.com> writes:

On Wednesday, 6 December 2017 at 12:43:09 UTC, Fredrik Boulund 
wrote:
 On Wednesday, 6 December 2017 at 10:42:31 UTC, Dgame wrote:

 Or you simply do
 ----
 writeln("longword".array.sort);
 ----

 This is so strange. I was dead sure I tried that but it failed 
 for some reason. But after trying it just now it also seems to 
 work just fine. Thanks! :)

if you're like me, you probably forgot an import :)

Dec 07 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Sort characters in string