digitalmars.D.learn - char array weirdness

Jack Stouffer (10/10) Mar 28 2016 void main () {

Anon (4/14) Mar 28 2016 Unicode! `char` is UTF-8, which means a character can be from 1

Jack Stouffer (4/21) Mar 28 2016 But the value fits into a char; a dchar is a waste of space. Why

Anon (10/34) Mar 28 2016 The compiler doesn't know that, and it isn't true in general. You

Anon (2/3) Mar 28 2016 *And because you're using ranges,
Steven Schveighoffer (5/9) Mar 28 2016 I just want to interject to say that the compiler understands that

H. S. Teoh via Digitalmars-d-learn (14/37) Mar 28 2016 Welcome to the world of auto-decoding. Phobos ranges always treat any
ag0aep6g (5/8) Mar 28 2016 UTF-8 strings are decoded by the range primitives. That is, `front`
Jonathan M Davis via Digitalmars-d-learn (14/15) Mar 28 2016 Yeah, though as I've started using it, I've quickly found that enough

Jonathan M Davis via Digitalmars-d-learn (55/65) Mar 28 2016 assert(typeof(ElementType!(typeof(val)) == dchar));

Jack Stouffer (4/5) Mar 28 2016 Thanks for the detailed responses. I think I'll compile this info

H. S. Teoh via Digitalmars-d-learn (37/47) Mar 28 2016 [...]

Marco Leise (13/22) Mar 29 2016 Am Mon, 28 Mar 2016 16:29:50 -0700

Jonathan M Davis via Digitalmars-d-learn (25/46) Mar 29 2016 Yeah. Operating at the code point level instead of the code unit level i...
Basile B. (7/17) Mar 29 2016 I've seen you so many time as a reviewer on dlang that I belive

Jack Stouffer (4/10) Mar 29 2016 It's not a joke. This is the first time I've run into this
H. S. Teoh via Digitalmars-d-learn (28/51) Mar 29 2016 Believe it or not, it was only last year (IIRC, maybe the year before)

Steven Schveighoffer (11/38) Mar 29 2016 Phobos treats narrow strings (wchar[], char[]) as ranges of dchar. It

Basile B. (4/50) Mar 29 2016 https://www.youtube.com/watch?v=JKQwgpaLR6o
H. S. Teoh via Digitalmars-d-learn (22/33) Mar 29 2016 [...]

Jack Stouffer (6/10) Mar 29 2016 The link (I think this is what you're referring to) to that

H. S. Teoh via Digitalmars-d-learn (24/35) Mar 29 2016 To be fair, one *could* make a case of autodecoding, if it was done

Jack Stouffer (41/46) Mar 30 2016 Just to drive this point home, I made a very simple benchmark.

ag0aep6g (13/37) Mar 30 2016 [...]

Jack Stouffer (50/53) Mar 30 2016 It's not that it's taking no time at all, it's just that it's

ag0aep6g (23/28) Mar 31 2016 So the auto-decoding version takes about twenty times as long as the

Jack Stouffer (4/10) Mar 31 2016 Ok, so not as bad as 100x, but still not great by any means. I

Jack Stouffer <jack jackstouffer.com> writes:

void main () {
     import std.range.primitives;
     char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
     pragma(msg, ElementEncodingType!(typeof(val)));
     pragma(msg, typeof(val.front));
}

prints

     char
     dchar

Why?

Mar 28 2016

Anon <anon anon.anon> writes:

On Monday, 28 March 2016 at 22:34:31 UTC, Jack Stouffer wrote:
 void main () {
     import std.range.primitives;
     char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
     pragma(msg, ElementEncodingType!(typeof(val)));
     pragma(msg, typeof(val.front));
 }

 prints

     char
     dchar

 Why?

Unicode! `char` is UTF-8, which means a character can be from 1 
to 4 bytes. val.front gives a `dchar` (UTF-32), consuming those 
bytes and giving you a sensible value.

Mar 28 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Monday, 28 March 2016 at 22:43:26 UTC, Anon wrote:
 On Monday, 28 March 2016 at 22:34:31 UTC, Jack Stouffer wrote:
 void main () {
     import std.range.primitives;
     char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
     pragma(msg, ElementEncodingType!(typeof(val)));
     pragma(msg, typeof(val.front));
 }

 prints

     char
     dchar

 Why?

 Unicode! `char` is UTF-8, which means a character can be from 1 
 to 4 bytes. val.front gives a `dchar` (UTF-32), consuming those 
 bytes and giving you a sensible value.

But the value fits into a char; a dchar is a waste of space. Why 
on Earth would a different type be given for the front value than 
the type of the elements themselves?

Mar 28 2016

Anon <anon anon.anon> writes:

On Monday, 28 March 2016 at 22:49:28 UTC, Jack Stouffer wrote:
 On Monday, 28 March 2016 at 22:43:26 UTC, Anon wrote:
 On Monday, 28 March 2016 at 22:34:31 UTC, Jack Stouffer wrote:
 void main () {
     import std.range.primitives;
     char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 
 's'];
     pragma(msg, ElementEncodingType!(typeof(val)));
     pragma(msg, typeof(val.front));
 }

 prints

     char
     dchar

 Why?

 Unicode! `char` is UTF-8, which means a character can be from 
 1 to 4 bytes. val.front gives a `dchar` (UTF-32), consuming 
 those bytes and giving you a sensible value.

 But the value fits into a char;

The compiler doesn't know that, and it isn't true in general. You 
could have, for example, U+3042 in your char[]. That would be 
encoded as three chars. It wouldn't make sense (or be correct) 
for val.front to yield '\xe3' (the first byte of U+3042 in UTF-8).

 a dchar is a waste of space.

If you're processing Unicode text, you *need* to use that space. 
Any because you're using ranges, it is only 3 extra bytes, 
anyway. It isn't going to hurt on modern systems.

 Why on Earth would a different type be given for the front 
 value than the type of the elements themselves?

Unicode. A single char cannot hold a Unicode code point. A single 
dchar can.

Mar 28 2016

Anon <anon anon.anon> writes:

On Monday, 28 March 2016 at 23:06:49 UTC, Anon wrote:
 Any because you're using ranges,

*And because you're using ranges,

Mar 28 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 3/28/16 7:06 PM, Anon wrote:
 The compiler doesn't know that, and it isn't true in general. You could
 have, for example, U+3042 in your char[]. That would be encoded as three
 chars. It wouldn't make sense (or be correct) for val.front to yield
 '\xe3' (the first byte of U+3042 in UTF-8).

I just want to interject to say that the compiler understands that 
char[] is an array of char code units just fine. It's Phobos that has a 
strange interpretation of it.

-Steve

Mar 28 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Mon, Mar 28, 2016 at 10:49:28PM +0000, Jack Stouffer via Digitalmars-d-learn
wrote:
 On Monday, 28 March 2016 at 22:43:26 UTC, Anon wrote:
On Monday, 28 March 2016 at 22:34:31 UTC, Jack Stouffer wrote:
void main () {
    import std.range.primitives;
    char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
    pragma(msg, ElementEncodingType!(typeof(val)));
    pragma(msg, typeof(val.front));
}

prints

    char
    dchar

Why?

Unicode! `char` is UTF-8, which means a character can be from 1 to 4
bytes. val.front gives a `dchar` (UTF-32), consuming those bytes and
giving you a sensible value.

 
 But the value fits into a char; a dchar is a waste of space. Why on
 Earth would a different type be given for the front value than the
 type of the elements themselves?

Welcome to the world of auto-decoding.  Phobos ranges always treat any
string / wstring / dstring as a range of dchar, even if it's encoded as
UTF-8.

The pros and cons of auto-decoding have been debated to death several
times already. Walter hates it and wishes to get rid of it, but so far
Andrei has refused to budge.  Personally I lean on the side of killing
auto-decoding, but it seems unlikely to change at this point.  (But you
never know... if enough people revolt against it, maybe there's a small
chance Andrei could be convinced...)

For the time being, I'd recommend std.utf.byCodeUnit as a workaround.


T

-- 
Those who don't understand D are condemned to reinvent it, poorly. -- Daniel N

Mar 28 2016

ag0aep6g <anonymous example.com> writes:

On 29.03.2016 00:49, Jack Stouffer wrote:
 But the value fits into a char; a dchar is a waste of space. Why on
 Earth would a different type be given for the front value than the type
 of the elements themselves?

UTF-8 strings are decoded by the range primitives. That is, `front` 
returns one Unicode code point (type dchar) that's pieced together from 
up to four UTF-8 code units (type char). A code point does not fit into 
the 8 bits of a char.

Mar 28 2016

Jonathan M Davis via Digitalmars-d-learn writes:

On Monday, March 28, 2016 16:02:26 H. S. Teoh via Digitalmars-d-learn wrote:
 For the time being, I'd recommend std.utf.byCodeUnit as a workaround.

Yeah, though as I've started using it, I've quickly found that enough
of Phobos doesn't support it yet, that it's problematic. e.g.

https://issues.dlang.org/show_bug.cgi?id=15800

The situation will improve, but for the moment, the most reliable thing is
still to use strings as ranges of dchar but special case functions for them
so that they avoid decoding where necessary. The main problem is places like
filter where if you _know_ that you're just dealing with ASCII but the code
has to treat the string as a range of dchar anyway, because it has to decode
to match what's expected of auto-decoding. To some extent, using
std.string.representation gets around that, but it runs into problems
similar to those of byCodeUnit.

So, we have a ways to go.

- Jonathan M Davis

Mar 28 2016

Jonathan M Davis via Digitalmars-d-learn writes:

On Monday, March 28, 2016 22:34:31 Jack Stouffer via Digitalmars-d-learn 
wrote:
 void main () {
      import std.range.primitives;
      char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
      pragma(msg, ElementEncodingType!(typeof(val)));
      pragma(msg, typeof(val.front));
 }

 prints

      char
      dchar

 Why?

assert(typeof(ElementType!(typeof(val)) == dchar));

The range API considers all strings to have an element type of dchar. char,
wchar, and dchar are UTF code units - UTF-8, UTF-16, and UTF-32
respectively. One or more code units make up a code point, which is actually
something displayable but not necessarily what you'd call a character (e.g.
it could be an accent). One or more code points then make up a grapheme,
which is really what a displayable character is. When Andrei designed the
range API, he didn't know about graphemes - just code units and code points,
so he thought that code points were guaranteed to be full characters and
decided that that's what we'd operate on for correctness' sake.

In the case of UTF-8, a code point is made up of 1 - 4 code units of 8 bits
each. In the case of UTF-16, a code point is mode up of 1 - 2 code units of
16 bits each. And in the case of UTF-32, a code unit is guaranteed to be a
single code point. So, by having the range API decode UTF-8 and UTF-16 to
UTF-32, strings then become ranges of dchar and avoid having code points
chopped up by stuff like slicing. So, while a code point is not actually
guaranteed to be a full character, certain classes of bugs are prevented by
operating on ranges of code points rather than code units. Of course, for
full correctness, graphemes need to be taken into account, and some
algorithms generally don't care whether they're operating on code units,
code points, or graphemes (e.g. find on code units generally works quite
well, whereas something like filter would be a complete disaster if you're
not actually dealing with ASCII).

Arrays of char and wchar are termed "narrow strings" - hence isNarrowString
is true for them (but not arrays of dchar) - and the range API does not
consider them to have slicing, be random access, or have length, because as
ranges of dchar, those operations would be O(n) rather than O(1). However,
because of this mess of whether an algorithm works best when operating on
code units or code points and the desire to avoid decoding to code points if
unnecessary, many algorithms special case narrow strings in order to
operate on them more efficiently. So, ElementEncodingType was introduced for
such cases. ElementType gives you the element type of the range, and for
everythnig but narrow strings ElementEncodingType is the same as
ElementType, but in the case of narrow strings, whereas ElementType is
dchar, ElementEncodingType is the actual element type of the array - hence
why ElementEncodingType(typeof(val)) is char in your code above.

The correct way to deal with this is really to understand Unicode well
enough to know when you should be dealing at the code unit, code point, or
grapheme level and write your code accordingly, but that's not exactly easy.
So, in some respects, just operating on strings as dchar simplifies things
and reduces bugs relating to breaking up code points, but it does come with
an efficiency cost, and it does make the range API more confusing when it
comes to operating on narrow strings. And it isn't even fully correct,
because it doesn't take graphemes into account. But it's what we're stuck
with at this point.

std.utf provides byCodeUnit and byChar to iterate by code unit or specific
character types, and std.uni provides byGrapheme for iterating by grapheme
(along with plenty of other helper functions). So, the tools to deal with
range s of characters more precisely are there, but they do require some
understanding of Unicode, and they don't always interact with the rest of
Phobos very well, since they're newer (e.g. std.conv.to doesn't fully work
with byCodeUnit yet, even though it works with ranges of dchar just fine).

- Jonathan M Davis

Mar 28 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Monday, 28 March 2016 at 23:07:22 UTC, Jonathan M Davis wrote:
 ...

Thanks for the detailed responses. I think I'll compile this info 
and put it in a blog post so people can just point to it when 
someone else is confused.

Mar 28 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Mon, Mar 28, 2016 at 04:07:22PM -0700, Jonathan M Davis via
Digitalmars-d-learn wrote:
[...]
 The range API considers all strings to have an element type of dchar.
 char, wchar, and dchar are UTF code units - UTF-8, UTF-16, and UTF-32
 respectively. One or more code units make up a code point, which is
 actually something displayable but not necessarily what you'd call a
 character (e.g.  it could be an accent). One or more code points then
 make up a grapheme, which is really what a displayable character is.
 When Andrei designed the range API, he didn't know about graphemes -
 just code units and code points, so he thought that code points were
 guaranteed to be full characters and decided that that's what we'd
 operate on for correctness' sake.

[...]

Unfortunately, the fact that the default is *not* to use graphemes makes
working with non-European language strings pretty much just as ugly and
error-prone as working with bare char's in European language strings.

You gave the example of filter() returning wrong results when used with
a range of chars (if we didn't have autodecoding), but the same can be
said of using filter() *with* autodecoding on a string that contains
combining diacritics: your diacritics may get randomly reattached to
stuff they weren't originally attached to, or you may end up with wrong
sequences of Unicode code points (e.g. diacritics not attached to any
grapheme). Using filter() on Korean text, even with autodecoding, will
pretty much produce garbage. And so on.

So in short, we're paying a performance cost for something that's only
arguably better but still not quite there, and this cost is attached to
almost *everything* you do with strings, regardless of whether you need
to (e.g., when you know you're dealing with pure ASCII data).  Even when
dealing with non-ASCII Unicode data, in many cases autodecoding
introduces a constant (and unnecessary!) overhead.  E.g., searching for
a non-ASCII character is equivalent to a substring search on the encoded
form of the character, and there is no good reason why Phobos couldn't
have done this instead of autodecoding every character while scanning
the string.  Regexes on Unicode strings could possibly be faster if the
regex engine internally converted literals in the regex into their
equivalent encoded forms and did the scanning without decoding. (IIRC
Dmitry did remark in some PR some time ago, to the effect that the regex
engine has been optimized to the point where the cost of autodecoding is
becoming visible, and the next step might be to bypass autodecoding.)

I argue that auto-decoding, as currently implemented, is a net minus,
even though I realize this is unlikely to change in this lifetime. It
charges a constant performance overhead yet still does not guarantee
things will behave as the user would expect (i.e., treat the string as
graphemes rather than code points).


T

-- 
We are in class, we are supposed to be learning, we have a teacher... Is it too
much that I expect him to teach me??? -- RL

Mar 28 2016

Marco Leise <Marco.Leise gmx.de> writes:

Am Mon, 28 Mar 2016 16:29:50 -0700
schrieb "H. S. Teoh via Digitalmars-d-learn"
<digitalmars-d-learn puremagic.com>:

 [=E2=80=A6] your diacritics may get randomly reattached to
 stuff they weren't originally attached to, or you may end up with wrong
 sequences of Unicode code points (e.g. diacritics not attached to any
 grapheme). Using filter() on Korean text, even with autodecoding, will
 pretty much produce garbage. And so on.

I'm on the same page here. If it ain't ASCII parsable, you
*have* to make a conscious decision about whether you need
code units or graphemes. I've yet to find out about the use
cases for auto-decoded code-points though.

 So in short, we're paying a performance cost for something that's only
 arguably better but still not quite there, and this cost is attached to
 almost *everything* you do with strings, regardless of whether you need
 to (e.g., when you know you're dealing with pure ASCII data).

An unconscious decision made by the library that yields the
least likely intended and expected result? Let me think ...
mhhh ... that's worse than iterating by char. No talking
back :p.

--=20
Marco

Mar 29 2016

Jonathan M Davis via Digitalmars-d-learn writes:

On Monday, March 28, 2016 16:29:50 H. S. Teoh via Digitalmars-d-learn wrote:
 On Mon, Mar 28, 2016 at 04:07:22PM -0700, Jonathan M Davis via
 Digitalmars-d-learn wrote: [...]

 The range API considers all strings to have an element type of dchar.
 char, wchar, and dchar are UTF code units - UTF-8, UTF-16, and UTF-32
 respectively. One or more code units make up a code point, which is
 actually something displayable but not necessarily what you'd call a
 character (e.g.  it could be an accent). One or more code points then
 make up a grapheme, which is really what a displayable character is.
 When Andrei designed the range API, he didn't know about graphemes -
 just code units and code points, so he thought that code points were
 guaranteed to be full characters and decided that that's what we'd
 operate on for correctness' sake.

 [...]

 Unfortunately, the fact that the default is *not* to use graphemes makes
 working with non-European language strings pretty much just as ugly and
 error-prone as working with bare char's in European language strings.

Yeah. Operating at the code point level instead of the code unit level is
correct for more text than just operating at the code unit level (especially
if you're dealing with char rather than wchar), but ultimately, it's
definitely not correct, and there's plenty of text that will be processed
incorrectly as code points.

 I argue that auto-decoding, as currently implemented, is a net minus,
 even though I realize this is unlikely to change in this lifetime. It
 charges a constant performance overhead yet still does not guarantee
 things will behave as the user would expect (i.e., treat the string as
 graphemes rather than code points).

I totally agree, and I think that _most_ of the Phobos devs agree at this
point. It's Andrei that doesn't. But we have the twin problems of figuring
out how to convince him and how to deal with the fact that changing it would
break a lot of code. Unicoding is disgusting to deal with if you want to
deal with it correctly _and_ be efficient about it, but hiding it doesn't
really work.

I think that the first steps are to make it so that the algorithms in Phobos
will operate just fine on ranges of char and wchar in addition to dchar and
move towards making it irrelevant wherever we can. Some functions (like
filter) are going to have to be told what level to operate at and would be a
serious problem if/when we switched away from auto-decoding, but many others
(such as find) can be made not to care while still operating on Unicode
correctly. And if we can get the amount code impacted low enough (at least
as far as Phobos goes), then maybe we can find a way to switch away from
auto-decoding. Ultimately though, I fear that we're stuck with it and that
we'll just have to figure out how to make it work well for those who know
what they're doing while minimizing the performance impact of auto-decoding
on those who don't know what they're doing as much as we reasonably can.

- Jonathan M Davis

Mar 29 2016

Basile B. <b2.temp gmx.com> writes:

On Monday, 28 March 2016 at 22:34:31 UTC, Jack Stouffer wrote:
 void main () {
     import std.range.primitives;
     char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
     pragma(msg, ElementEncodingType!(typeof(val)));
     pragma(msg, typeof(val.front));
 }

 prints

     char
     dchar

 Why?

I've seen you so many time as a reviewer on dlang that I belive 
this Q is a joke.
Even if obviously nobody can know everything...

https://www.youtube.com/watch?v=l97MxTx0nzs

seriously you didn't know that auto decoding is on and that it 
gives you a dchar...

Mar 29 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Tuesday, 29 March 2016 at 23:15:26 UTC, Basile B. wrote:
 I've seen you so many time as a reviewer on dlang that I belive 
 this Q is a joke.
 Even if obviously nobody can know everything...

 https://www.youtube.com/watch?v=l97MxTx0nzs

 seriously you didn't know that auto decoding is on and that it 
 gives you a dchar...

It's not a joke. This is the first time I've run into this 
problem in my code. I just started using D more and more in my 
work and I've never written anything that was really string heavy.

Mar 29 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Tue, Mar 29, 2016 at 11:15:26PM +0000, Basile B. via Digitalmars-d-learn
wrote:
 On Monday, 28 March 2016 at 22:34:31 UTC, Jack Stouffer wrote:
void main () {
    import std.range.primitives;
    char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
    pragma(msg, ElementEncodingType!(typeof(val)));
    pragma(msg, typeof(val.front));
}

prints

    char
    dchar

Why?

 
 I've seen you so many time as a reviewer on dlang that I belive this Q
 is a joke.
 Even if obviously nobody can know everything...
 
 https://www.youtube.com/watch?v=l97MxTx0nzs
 
 seriously you didn't know that auto decoding is on and that it gives
 you a dchar...

Believe it or not, it was only last year (IIRC, maybe the year before)
that Walter "discovered" that Phobos does autodecoding, and got pretty
upset over it.  If even Walter wasn't aware of this for that long...

I used to be in favor of autodecoding, but more and more, I'm seeing
that it was a bad choice.  It's a special case to how ranges normally
work, and this special case has caused a ripple of exceptional corner
cases to percolate throughout all Phobos code, leaving behind a
string(!) of bugs over the years that, certainly, eventually got
addressed, but nevertheless it shows that something didn't quite fit in.
It also left behind a trail of additional complexity to deal with these
special cases that made Phobos harder to understand and maintain.

It's a performance bottleneck for string-processing code, which is a
pity because D could have stood the chance to win against C/C++ string
processing (due to extensive need to call strlen and strdup). But in
spite of this heavy price we *still* don't guarantee correctness. On the
spectrum of speed (don't decode at all) vs. correctness (segment by
graphemes, not by code units or code points) autodecoding lands in the
anemic middle where you get neither speed nor full correctness.

The saddest part of it all is that this is unlikely to change because
people have gotten so uptight about the specter of breaking existing
code, in spite of the repeated experiences of newbies (and
not-so-newbies like Walter himself!) wondering why strings have
ElementType == dchar instead of char, usually followed by concerns over
the performance overhead.


T

-- 
Designer clothes: how to cover less by paying more.

Mar 29 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 3/29/16 7:42 PM, H. S. Teoh via Digitalmars-d-learn wrote:
 On Tue, Mar 29, 2016 at 11:15:26PM +0000, Basile B. via Digitalmars-d-learn
wrote:
 On Monday, 28 March 2016 at 22:34:31 UTC, Jack Stouffer wrote:
 void main () {
     import std.range.primitives;
     char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
     pragma(msg, ElementEncodingType!(typeof(val)));
     pragma(msg, typeof(val.front));
 }

 prints

     char
     dchar

 Why?

 I've seen you so many time as a reviewer on dlang that I belive this Q
 is a joke.
 Even if obviously nobody can know everything...

 https://www.youtube.com/watch?v=l97MxTx0nzs

 seriously you didn't know that auto decoding is on and that it gives
 you a dchar...

 Believe it or not, it was only last year (IIRC, maybe the year before)
 that Walter "discovered" that Phobos does autodecoding, and got pretty
 upset over it.  If even Walter wasn't aware of this for that long...

Phobos treats narrow strings (wchar[], char[]) as ranges of dchar. It 
was discovered that auto decoding strings isn't always the smartest 
thing to do, especially for performance.

So you get things like this: 
https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm/searching.d#L1622

That's right. Phobos insists that auto decoding must happen for narrow 
strings. Except that's not the best thing to do so it inserts lots of 
exceptions -- for narrow strings.

Mind blown?

-Steve

Mar 29 2016

Basile B. <b2.temp gmx.com> writes:

On Wednesday, 30 March 2016 at 00:05:29 UTC, Steven Schveighoffer 
wrote:
 On 3/29/16 7:42 PM, H. S. Teoh via Digitalmars-d-learn wrote:
 On Tue, Mar 29, 2016 at 11:15:26PM +0000, Basile B. via 
 Digitalmars-d-learn wrote:
 On Monday, 28 March 2016 at 22:34:31 UTC, Jack Stouffer wrote:
 void main () {
     import std.range.primitives;
     char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 
 's'];
     pragma(msg, ElementEncodingType!(typeof(val)));
     pragma(msg, typeof(val.front));
 }

 prints

     char
     dchar

 Why?

 I've seen you so many time as a reviewer on dlang that I 
 belive this Q
 is a joke.
 Even if obviously nobody can know everything...

 https://www.youtube.com/watch?v=l97MxTx0nzs

 seriously you didn't know that auto decoding is on and that 
 it gives
 you a dchar...

 Believe it or not, it was only last year (IIRC, maybe the year 
 before)
 that Walter "discovered" that Phobos does autodecoding, and 
 got pretty
 upset over it.  If even Walter wasn't aware of this for that 
 long...

 Phobos treats narrow strings (wchar[], char[]) as ranges of 
 dchar. It was discovered that auto decoding strings isn't 
 always the smartest thing to do, especially for performance.

 So you get things like this: 
 https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm/searching.d#L1622

 That's right. Phobos insists that auto decoding must happen for 
 narrow strings. Except that's not the best thing to do so it 
 inserts lots of exceptions -- for narrow strings.

 Mind blown?

 -Steve

https://www.youtube.com/watch?v=JKQwgpaLR6o

Listen to this then it'll be more clear.

Mar 29 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Tue, Mar 29, 2016 at 08:05:29PM -0400, Steven Schveighoffer via
Digitalmars-d-learn wrote:
[...]
Phobos treats narrow strings (wchar[], char[]) as ranges of dchar. It
was discovered that auto decoding strings isn't always the smartest
thing to do, especially for performance.

So you get things like this:
https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm/searching.d#L1622

That's right. Phobos insists that auto decoding must happen for narrow
strings. Except that's not the best thing to do so it inserts lots of
exceptions -- for narrow strings.

Mind blown?

[...]

Mind not blown. Mostly because I've seen many, many instances of similar
code in Phobos. It's what I was alluding to when I said that
special-casing strings has caused a ripple of exceptional cases to
percolate throughout Phobos, increasing code complexity and making
things very hard to maintain. I mean, honestly, just look at that code
as linked above. Can anyone honestly claim that this is maintainable
code? For something so trivial as linear search of strings, that's some
heavy hackery just to make strings work, as contrasted with, say, the
one-line call to simpleMindedFind(). Who would have thought linear
string searching would require a typecast, a trusted hack, and
templates named "force" just to make things work?

It's code like this -- and its pervasive ugliness throughout Phobos --
that slowly eroded my original pro-autodecoding stance. It's becoming
clearer and clearer to me that it's just not pulling its own weight
against the dramatic increase in Phobos code complexity, nevermind the
detrimental performance consequences.

--
Obviously, some things aren't very obvious.

Mar 29 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Tuesday, 29 March 2016 at 23:42:07 UTC, H. S. Teoh wrote:
 Believe it or not, it was only last year (IIRC, maybe the year 
 before) that Walter "discovered" that Phobos does autodecoding, 
 and got pretty upset over it.  If even Walter wasn't aware of 
 this for that long...

The link (I think this is what you're referring to) to that 
discussion: 
http://forum.dlang.org/post/lfbg06$30kh$1 digitalmars.com

It's a shame Walter never got his way. Special casing ranges like 
this is a huge mistake.

Mar 29 2016

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Wed, Mar 30, 2016 at 03:22:48AM +0000, Jack Stouffer via Digitalmars-d-learn
wrote:
 On Tuesday, 29 March 2016 at 23:42:07 UTC, H. S. Teoh wrote:
Believe it or not, it was only last year (IIRC, maybe the year
before) that Walter "discovered" that Phobos does autodecoding, and
got pretty upset over it.  If even Walter wasn't aware of this for
that long...

 
 The link (I think this is what you're referring to) to that
 discussion: http://forum.dlang.org/post/lfbg06$30kh$1 digitalmars.com
 
 It's a shame Walter never got his way. Special casing ranges like this
 is a huge mistake.

To be fair, one *could* make a case of autodecoding, if it was done
right, i.e., segmenting by graphemes, which is what is really expected
by users when they think of "characters". This would allow users to
truly think in terms of characters (in the intuitive sense) when they
work with strings. However, segmenting by graphemes is, in general,
quite expensive, and few algorithms actually need to do this. Most don't
need to -- a pretty large part of string processing consists of looking
for certain markers, mostly punctuation and control characters, and
treating the stuff in between as opaque data. If we didn't have
autodecoding, would be a simple matter of searching for sentinel
substrings.  This also indicates that most of the work done by
autodecoding is unnecessary -- it's wasted work since most of the string
data is treated opaquely anyway.

The unfortunate situation in Phobos currently is that we are neither
doing it right (segmenting by graphemes), *and* we're inefficient
because we're constantly decoding all that data that the application is
mostly going to treat as opaque data anyway.  It's the worst of both
worlds.

I wish we could get consensus for implementing Walter's plan to phase
out autodecoding (as proposed in the linked thread above).


T

-- 
Freedom of speech: the whole world has no right *not* to hear my spouting off!

Mar 29 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Wednesday, 30 March 2016 at 05:16:04 UTC, H. S. Teoh wrote:
 If we didn't have autodecoding, would be a simple matter of 
 searching for sentinel substrings.  This also indicates that 
 most of the work done by autodecoding is unnecessary -- it's 
 wasted work since most of the string data is treated opaquely 
 anyway.

Just to drive this point home, I made a very simple benchmark. 
Iterating over code points when you don't need to is 100x slower 
than iterating over code units.

import std.datetime;
import std.stdio;
import std.array;
import std.utf;
import std.uni;

enum testCount = 1_000_000;
enum var = "Lorem ipsum dolor sit amet, consectetur adipiscing 
elit. Praesent justo ante, vehicula in felis vitae, finibus 
tincidunt dolor. Fusce sagittis.";

void test()
{
     auto a = var.array;
}

void test2()
{
     auto a = var.byCodeUnit.array;
}

void test3()
{
     auto a = var.byGrapheme.array;
}

void main()
{
     import std.conv : to;
     auto r = benchmark!(test, test2, test3)(testCount);
     auto result = to!Duration(r[0] / testCount);
     auto result2 = to!Duration(r[1] / testCount);
     auto result3 = to!Duration(r[2] / testCount);

     writeln("auto-decoding", "\t\t", result);
     writeln("byCodeUnit", "\t\t", result2);
     writeln("byGrapheme", "\t\t", result3);
}


$ ldc2 -O3 -release -boundscheck=off test.d
$ ./test
auto-decoding	        1 μs
byCodeUnit		0 hnsecs
byGrapheme		11 μs

Mar 30 2016

ag0aep6g <anonymous example.com> writes:

On 30.03.2016 19:30, Jack Stouffer wrote:
 Just to drive this point home, I made a very simple benchmark. Iterating
 over code points when you don't need to is 100x slower than iterating
 over code units.

[...]
 enum testCount = 1_000_000;
 enum var = "Lorem ipsum dolor sit amet, consectetur adipiscing elit.
 Praesent justo ante, vehicula in felis vitae, finibus tincidunt dolor.
 Fusce sagittis.";

 void test()
 {
      auto a = var.array;
 }

 void test2()
 {
      auto a = var.byCodeUnit.array;
 }

 void test3()
 {
      auto a = var.byGrapheme.array;
 }

[...]
 $ ldc2 -O3 -release -boundscheck=off test.d
 $ ./test
 auto-decoding            1 μs
 byCodeUnit        0 hnsecs
 byGrapheme        11 μs

When byCodeUnit takes no time at all, isn't 1µs infinite times slower, 
instead of 100 times? And I think byCodeUnits's 1µs is so low that noise 
is going to mess with any ratios you make.

byCodeUnit taking no time at all suggests that it's been optimized away 
completely. To avoid that, don't hardcode the test data, and make some 
output that depends on the calculations being actually done. There was a 
little thread about this recently:
http://forum.dlang.org/post/sdmdwyhfgmbppfflkljz forum.dlang.org

I think creating arrays from the ranges is relatively costly and noisy, 
and it's not of interest when you want to compare iteration speed.

Mar 30 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Wednesday, 30 March 2016 at 22:49:24 UTC, ag0aep6g wrote:
 When byCodeUnit takes no time at all, isn't 1µs infinite times 
 slower, instead of 100 times? And I think byCodeUnits's 1µs is 
 so low that noise is going to mess with any ratios you make.

It's not that it's taking no time at all, it's just that it's 
less than 1 hecto-nanosecond, which is the smallest unit that 
benchmark works with.

Observe what happens when the times are no longer averaged, I 
also made some other changes to the script:

import std.datetime;
import std.stdio;
import std.array;
import std.utf;
import std.uni;

enum testCount = 1_000_000;

void test(char[] var)
{
     auto a = var.array;
}

void test2(char[] var)
{
     auto a = var.byCodeUnit.array;
}

void test3(char[] var)
{
     auto a = var.byGrapheme.array;
}

void main()
{
     import std.conv : to;
     import std.random : uniform;
     import std.string : assumeUTF;

     // random string
     ubyte[] data;
     foreach (_; 0 .. 200)
     {
         data ~= cast(ubyte) uniform(33, 126);
     }

     auto result = to!Duration(benchmark!(() => 
test(data.assumeUTF))(testCount)[0]);
     auto result2 = to!Duration(benchmark!(() => 
test2(data.assumeUTF))(testCount)[0]);
     auto result3 = to!Duration(benchmark!(() => 
test3(data.assumeUTF))(testCount)[0]);

     writeln("auto-decoding", "\t\t", result);
     writeln("byCodeUnit", "\t\t", result2);
     writeln("byGrapheme", "\t\t", result3);
}

$ ldc2 -O3 -release -boundscheck=off test.d
$ ./test
auto-decoding		1 sec, 757 ms, and 946 μs
byCodeUnit		87 ms, 731 μs, and 8 hnsecs
byGrapheme		14 secs, 769 ms, 796 μs, and 6 hnsecs

Mar 30 2016

ag0aep6g <anonymous example.com> writes:

On 31.03.2016 07:40, Jack Stouffer wrote:
 $ ldc2 -O3 -release -boundscheck=off test.d
 $ ./test
 auto-decoding        1 sec, 757 ms, and 946 μs
 byCodeUnit        87 ms, 731 μs, and 8 hnsecs
 byGrapheme        14 secs, 769 ms, 796 μs, and 6 hnsecs

So the auto-decoding version takes about twenty times as long as the 
non-decoding one (1758 / 88 ≅ 20).

I still think the allocations from the `.array` calls should be 
eliminated to see how just iterating compares.

Here's a quick edit to get rid of the `.array`s:
----
uint accumulator = 0;
void test(char[] var)
{
     foreach (dchar d; var) accumulator += d;
}
void test2(char[] var)
{
     foreach (c; var.byCodeUnit) accumulator += c;
}
----

I get theses timings then:
----
auto-decoding           642 ms, 969 μs, and 1 hnsec
byCodeUnit              84 ms, 980 μs, and 3 hnsecs
----
And 643 / 85 ≅ 8.

Mar 31 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Thursday, 31 March 2016 at 12:49:57 UTC, ag0aep6g wrote:
 I get theses timings then:
 ----
 auto-decoding           642 ms, 969 μs, and 1 hnsec
 byCodeUnit              84 ms, 980 μs, and 3 hnsecs
 ----
 And 643 / 85 ≅ 8.

Ok, so not as bad as 100x, but still not great by any means. I 
think I will do some investigation into why array of dchar is so 
much slower than calling array with char[].

Mar 31 2016

D Programming

C/C++ Programming

Other

digitalmars.D.learn - char array weirdness