digitalmars.D.learn - sorting a string

Namal (6/6) Jul 14 2017 Is there a 'easy' way to sort a string in D like it is possible

Steven Schveighoffer (10/18) Jul 14 2017 import std.algorithm: filter;

Namal (5/14) Jul 14 2017 Thx Steve! By sorting string I mean a function or series of

Anton Fediushin (6/9) Jul 14 2017 import std.algorithm : sort;

Steven Schveighoffer (12/23) Jul 14 2017 Don't do this, because it's not what you think. It's not actually

Anton Fediushin (13/18) Jul 14 2017 This sucks. I know, that `.sort` will be removed, but I thought

Steven Schveighoffer (9/28) Jul 14 2017 static assert(!isRandomAccessRange!(char[]));
ag0aep6g (20/27) Jul 14 2017 No, those are all false. char[] is treated as a range of code points
Jonathan M Davis via Digitalmars-d-learn (59/77) Jul 14 2017 It has to do with Unicode and Andrei's attempt to make ranges Unicode

Gerald Jansen (3/7) Jul 15 2017 Right on. Thanks for your very clear summary (the whole thing,

Namal (7/12) Jul 14 2017 Why does it have to be char[]?

ag0aep6g (5/12) Jul 14 2017 That's a compiler bug. The code should not compile, because now you can

Namal (6/15) Jul 14 2017 Thx alot. One final question. If I do it like that. I get a

Seb (4/22) Jul 14 2017 With 2.075 you want need this anymore, as the builtin properties

Steven Schveighoffer (5/31) Jul 14 2017 With 2.075, it won't compile even without the parentheses, because a

Namal <sotis22 mail.ru> writes:

Is there a 'easy' way to sort a string in D like it is possible 
in Python? Also how can I remove whitespace between characters if 
I am reading a line from a file and safe it as a string?

     string[] buffer;

     foreach (line ; File("test.txt").byLine)
         buffer ~= line.to!string;

Jul 14 2017

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 7/14/17 11:06 AM, Namal wrote:
 Is there a 'easy' way to sort a string in D like it is possible in 
 Python? Also how can I remove whitespace between characters if I am 
 reading a line from a file and safe it as a string?
 
      string[] buffer;
 
      foreach (line ; File("test.txt").byLine)
          buffer ~= line.to!string;

import std.algorithm: filter;
import std.uni: isWhite;

line.filter!(c => !c.isWhite).to!string;

be warned, this is going to be a bit slow, but that's the cost of 
autodecoding.

If you are looking for just removing ascii whitespace, you can do it a 
bit more efficiently, but it's not easy.

About the string sorting, you'd have to be more specific.

-Steve

Jul 14 2017

Namal <sotis22 mail.ru> writes:

On Friday, 14 July 2017 at 15:15:42 UTC, Steven Schveighoffer 
wrote:


 import std.algorithm: filter;
 import std.uni: isWhite;

 line.filter!(c => !c.isWhite).to!string;

 be warned, this is going to be a bit slow, but that's the cost 
 of autodecoding.

 If you are looking for just removing ascii whitespace, you can 
 do it a bit more efficiently, but it's not easy.

 About the string sorting, you'd have to be more specific.

 -Steve

Thx Steve! By sorting string I mean a function or series of 
functions that sorts a string by ASCII code, "cabA" to "Aabc" for 
instance.

Jul 14 2017

Anton Fediushin <fediushin.anton yandex.ru> writes:

On Friday, 14 July 2017 at 15:56:49 UTC, Namal wrote:
 Thx Steve! By sorting string I mean a function or series of 
 functions that sorts a string by ASCII code, "cabA" to "Aabc" 
 for instance.

import std.algorithm : sort;
import std.stdio : writeln;

"cabA".dup.sort.writeln;

`dup` is used, because string cannot be modified, so a copy of 
string used instead.

Jul 14 2017

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 7/14/17 12:43 PM, Anton Fediushin wrote:
 On Friday, 14 July 2017 at 15:56:49 UTC, Namal wrote:
 Thx Steve! By sorting string I mean a function or series of functions 
 that sorts a string by ASCII code, "cabA" to "Aabc" for instance.

 
 import std.algorithm : sort;
 import std.stdio : writeln;
 
 "cabA".dup.sort.writeln;
 
 `dup` is used, because string cannot be modified, so a copy of string 
 used instead.

Don't do this, because it's not what you think. It's not actually 
calling std.algorithm.sort, but the builtin array sort property. This 
will be going away soon.

Annoyingly, because of autodecoding, you have to cast to ubytes via 
representation to do it the "proper" way:

import std.string: representation, assumeUTF;
import std.algorithm: sort;

auto bytes = line.representation.dup;
bytes.sort;
auto result = bytes.assumeUTF; // result is now char[]

-Steve

Jul 14 2017

Anton Fediushin <fediushin.anton yandex.ru> writes:

On Friday, 14 July 2017 at 17:23:41 UTC, Steven Schveighoffer 
wrote:
 Don't do this, because it's not what you think. It's not 
 actually calling std.algorithm.sort, but the builtin array sort 
 property. This will be going away soon.

This sucks. I know, that `.sort` will be removed, but I thought 
it won't break any code.

 With 2.075, it won't compile even without the parentheses, 
 because a char[] is not an array according to std.algorithm...

But why? This should be true for `char[]`, isn't it?
-----
if ((ss == SwapStrategy.unstable && (hasSwappableElements!Range 
|| hasAssignableElements!Range) || ss != SwapStrategy.unstable && 
hasAssignableElements!Range) && isRandomAccessRange!Range && 
hasSlicing!Range && hasLength!Range)
-----
(It's from 
https://dlang.org/phobos/std_algorithm_sorting.html#sort)

Jul 14 2017

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 7/14/17 3:50 PM, Anton Fediushin wrote:
 On Friday, 14 July 2017 at 17:23:41 UTC, Steven Schveighoffer wrote:
 Don't do this, because it's not what you think. It's not actually 
 calling std.algorithm.sort, but the builtin array sort property. This 
 will be going away soon.

 
 This sucks. I know, that `.sort` will be removed, but I thought it won't 
 break any code.
 
 With 2.075, it won't compile even without the parentheses, because a 
 char[] is not an array according to std.algorithm...

 
 But why? This should be true for `char[]`, isn't it?
 -----
 if ((ss == SwapStrategy.unstable && (hasSwappableElements!Range || 
 hasAssignableElements!Range) || ss != SwapStrategy.unstable && 
 hasAssignableElements!Range) && isRandomAccessRange!Range && 
 hasSlicing!Range && hasLength!Range)
 -----
 (It's from https://dlang.org/phobos/std_algorithm_sorting.html#sort)

static assert(!isRandomAccessRange!(char[]));
static assert(!hasSlicing!(char[]));
static assert(!hasAssignableElements!(char[]));
static assert(!hasSwappableElements!(char[]));
static assert(!hasLength!(char[]));

It's because of autodecoding :) Phobos does not view char[] as an array, 
but rather a range of decoded dchar elements. It causes no end of problems.

-Steve

Jul 14 2017

ag0aep6g <anonymous example.com> writes:

On 07/14/2017 09:50 PM, Anton Fediushin wrote:
 But why? This should be true for `char[]`, isn't it?
 -----
 if ((ss == SwapStrategy.unstable && (hasSwappableElements!Range || 
 hasAssignableElements!Range) || ss != SwapStrategy.unstable && 
 hasAssignableElements!Range) && isRandomAccessRange!Range && 
 hasSlicing!Range && hasLength!Range)
 -----

No, those are all false. char[] is treated as a range of code points 
(dchar), not code units (char). It's decoded on-the-fly ("auto decoding").

Code unit sequences are variable in length. So to get the n-th code 
point you have to decode everything before it. That's too expensive to 
be considered random access.

Same for swappable/assignable elements: When the new sequence has a 
different length than the old one, you have to move everything that 
follows to make room or to close the gap. Too expensive to be accepted 
by the templates.

As for why auto-decoding is a thing:

The idea was to treat char[] and friends as ranges of visual characters 
to have a robust default where you don't accidentally mess your strings 
up. Unfortunately, a code point isn't always what's commonly understood 
as a  character (grapheme), and the whole things falls apart. It's a 
historical accident, really.

Here's a huge discussion thread about maybe getting rid of it:

http://forum.dlang.org/post/nh2o9i$hr0$1 digitalmars.com

And here's what I took away from that:

http://forum.dlang.org/post/nirpdo$167i$1 digitalmars.com

Jul 14 2017

Jonathan M Davis via Digitalmars-d-learn writes:

On Friday, July 14, 2017 7:50:17 PM MDT Anton Fediushin via Digitalmars-d-
learn wrote:
 On Friday, 14 July 2017 at 17:23:41 UTC, Steven Schveighoffer

 wrote:
 Don't do this, because it's not what you think. It's not
 actually calling std.algorithm.sort, but the builtin array sort
 property. This will be going away soon.

 This sucks. I know, that `.sort` will be removed, but I thought
 it won't break any code.

 With 2.075, it won't compile even without the parentheses,
 because a char[] is not an array according to std.algorithm...

 But why? This should be true for `char[]`, isn't it?
 -----
 if ((ss == SwapStrategy.unstable && (hasSwappableElements!Range

 || hasAssignableElements!Range) || ss != SwapStrategy.unstable &&

 hasAssignableElements!Range) && isRandomAccessRange!Range &&
 hasSlicing!Range && hasLength!Range)
 -----
 (It's from
 https://dlang.org/phobos/std_algorithm_sorting.html#sort)

It has to do with Unicode and Andrei's attempt to make ranges Unicode
correct without having to think about it. It was a nice thought, but it
really doesn't work, and it causes a number of annoying problems. As a quick
explanation, in Unicode, you have code units, which combine to make code
points. In UTF-8, a code unit is 8 bits. In UTF-16, it's 16 bits, and in
UTF-32, it's 32 bits. char is a UTF-8 code unit. wchar is a UTF-16 code
unit, and dchar is a UTF-32 cod unit. 32 bits is enough to represent any
code point, so a valid dchar is not only a code unit, it's guaranteed to be
a code point. So, indexing or slicing a dstring will not cut into the middle
of any code points. However, because UTF-8 and UTF-16 potentially require
multiple code units in order to represent a code point, you can't just
arbitrarily index a string or wstring (or arbitrarily slice either of them)
or you risk breaking up a code point.

To avoid this problem and guarantee Unicode correctness, Andrei made it so
that the range API does not treat strings or wstrings as either random
access or sliceable. So, isRandomAccessRange!string and hasSlicing!string
are false. And front returns dchar for all string types, because it decodes
the first code point from the code units, which means that popFront could
pop one char or wchar off, or it could pop several. Similarly, because the
number of code points can't be known without iterating them,
hasLength!string is false.

This does prevent you from blindly doing things to your strings using the
range API which will make them have invalid code points, but it also makes
it really annoying for stuff like sort, because sort requires a random
access range to sort, but char[] and wchar[] are not considered to be random
access ranges, because they're considered to be ranges of dchar rather than
char or wchar. It also hurts performance, because many algorithms don't
actually need to decode the code points - which is why many Phobos functions
have special overloads for strings; they do the decinding only when required
or skip it entirely.

To make matters worse, not only is all of this frustrating to deal with, but
it's not even fully correct. When Andrei added the range API to Phobos, he
thought that code points where the character you see on the screen (and
annoyingly, Unicode _does_ call them characters for some stupid reason), but
they really aren't. They do represent printable items, but they don't just
include characters such as A. They also include stuff like accents or
subscripts such that you sometimd actually need multiple code points to
represent an actual character on the screen - and a group of code points
which represent a character or glyph that you'd see on the screen are called
grapheme clusters. So, to be fully correct, ranges would have to decode
clear to grapheme clusters by default, which would horribly inefficient. The
correct way to handle Unicode requires that the programmer understand it
well enough to know when they should be operating at the code unit level,
when they should be operating at the code point level, and when they should
be operating at the grapheme level. You really can't automate it -
especially not if you want to be efficent. And you rarely want the code
point level, making Andrei's choice particularly bad.

So really, ranges should not be treating strings in a special manner; they
should be treated as ranges of code units and require the programmer to wrap
them in ranges of code points or graphemes as appropriate. But
unfortunately, making that change now would break a lot of code. So, we seem
to be stuck. The result is that you're forced to either specialize your
functions on strings or use functions like std.string.representation or
std.utf.byCodeUnit to work around the problem. And of course, this whole
issue is incredibly confusing to anyone coming to D - especially those who
aren't well-versed in Unicode. :(

- Jonathan M Davis

Jul 14 2017

Gerald Jansen <gjansenXXX XXXownmail.net> writes:

On Friday, 14 July 2017 at 20:52:54 UTC, Jonathan M Davis wrote:
 And of course, this whole issue is incredibly confusing to 
 anyone
 coming to D - especially those who aren't well-versed in 
 Unicode.

Right on. Thanks for your very clear summary (the whole thing, 
not just the last line!). Much appreciated.

Jul 15 2017

Namal <sotis22 mail.ru> writes:

On Friday, 14 July 2017 at 17:23:41 UTC, Steven Schveighoffer 
wrote:

 import std.string: representation, assumeUTF;
 import std.algorithm: sort;

 auto bytes = line.representation.dup;
 bytes.sort;
 auto result = bytes.assumeUTF; // result is now char[]


Why does it have to be char[]?

  auto bytes = line.representation.dup;
  bytes.sort;
  string result = bytes.assumeUTF;

works too.

Jul 14 2017

ag0aep6g <anonymous example.com> writes:

On 07/15/2017 04:33 AM, Namal wrote:
 Why does it have to be char[]?
 
   auto bytes = line.representation.dup;
   bytes.sort;
   string result = bytes.assumeUTF;
 
 works too.

That's a compiler bug. The code should not compile, because now you can 
mutate `result`'s elements through `bytes`. But `result`'s elements are 
supposed to be immutable.

I've filed an issue: https://issues.dlang.org/show_bug.cgi?id=17654

Jul 14 2017

Namal <sotis22 mail.ru> writes:

On Friday, 14 July 2017 at 16:43:42 UTC, Anton Fediushin wrote:
 On Friday, 14 July 2017 at 15:56:49 UTC, Namal wrote:
 Thx Steve! By sorting string I mean a function or series of 
 functions that sorts a string by ASCII code, "cabA" to "Aabc" 
 for instance.

 import std.algorithm : sort;
 import std.stdio : writeln;

 "cabA".dup.sort.writeln;

 `dup` is used, because string cannot be modified, so a copy of 
 string used instead.

Thx alot. One final question. If I do it like that. I get a 
deprrecation warning:

use std.algorithm.sort instead of .sort property

Wasn't .sort() the proper way to use it, no? Because that won't 
compile.

Jul 14 2017

Seb <seb wilzba.ch> writes:

On Friday, 14 July 2017 at 17:28:29 UTC, Namal wrote:
 On Friday, 14 July 2017 at 16:43:42 UTC, Anton Fediushin wrote:
 On Friday, 14 July 2017 at 15:56:49 UTC, Namal wrote:
 Thx Steve! By sorting string I mean a function or series of 
 functions that sorts a string by ASCII code, "cabA" to "Aabc" 
 for instance.

 import std.algorithm : sort;
 import std.stdio : writeln;

 "cabA".dup.sort.writeln;

 `dup` is used, because string cannot be modified, so a copy of 
 string used instead.

 Thx alot. One final question. If I do it like that. I get a 
 deprrecation warning:

 use std.algorithm.sort instead of .sort property

 Wasn't .sort() the proper way to use it, no? Because that won't 
 compile.

With 2.075 you want need this anymore, as the builtin properties 
have finally been removeD:

https://dlang.org/changelog/2.075.0_pre.html#removeArrayProps

Jul 14 2017

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 7/14/17 1:42 PM, Seb wrote:
 On Friday, 14 July 2017 at 17:28:29 UTC, Namal wrote:
 On Friday, 14 July 2017 at 16:43:42 UTC, Anton Fediushin wrote:
 On Friday, 14 July 2017 at 15:56:49 UTC, Namal wrote:
 Thx Steve! By sorting string I mean a function or series of 
 functions that sorts a string by ASCII code, "cabA" to "Aabc" for 
 instance.

 import std.algorithm : sort;
 import std.stdio : writeln;

 "cabA".dup.sort.writeln;

 `dup` is used, because string cannot be modified, so a copy of string 
 used instead.

 Thx alot. One final question. If I do it like that. I get a 
 deprrecation warning:

 use std.algorithm.sort instead of .sort property

 Wasn't .sort() the proper way to use it, no? Because that won't compile.

 
 With 2.075 you want need this anymore, as the builtin properties have 
 finally been removeD:
 
 https://dlang.org/changelog/2.075.0_pre.html#removeArrayProps

With 2.075, it won't compile even without the parentheses, because a 
char[] is not an array according to std.algorithm...

See my other post.

-Steve

Jul 14 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - sorting a string