digitalmars.D - "tuple unpacking" with zip?

Shriramana Sharma (28/28) Oct 21 2015 In Python I can do:

John Colvin (7/35) Oct 21 2015 static assert(is(ElementType!string == dchar));

Shriramana Sharma (5/9) Oct 21 2015 What's the diff betn char and dchar in this particular context?

John Colvin (7/15) Oct 21 2015 No. char[], wchar[] and dchar[] all have ElementType dchar.

Shriramana Sharma (10/24) Oct 21 2015 Why is it a mistake? That seems a very sane thing, although somewhat qui...

Jonathan M Davis (64/76) Oct 21 2015 LOL. This could open up a huge discussion if you're not careful.
Jacob Carlborg (5/10) Oct 21 2015 The short answer is: it's not 100% correct, it's slower and not always

Dmitry Olshansky (8/19) Oct 21 2015 Allow me to correct it to:

Shriramana Sharma (6/10) Oct 21 2015 Dear me... I meant UTF-8 encoded byte, rather than "codepoint", since al...

anonymous (2/3) Oct 21 2015 Also known as: code unit.
Dmitry Olshansky (8/16) Oct 21 2015 Aye, careful here. Unicode is a slippery road... Not even talking of

Shriramana Sharma <samjnaa_dont_spam_me gmail.com> writes:

In Python I can do:

ints = [1, 2, 3]
chars = ['a', 'b', 'c']
for i, c in zip(ints, chars):
    print(i, c)

Output:

1 a
2 b
3 c

But in D if I try:

import std.stdio, std.range;
void main ()
{
    int [] ints = [1, 2, 3];
    char [] chars = ['a', 'b', 'c'];
    foreach(int i, char c; zip(ints, chars))
        writeln(i, ' ', c);
}

I get at the foreach line:

Error: cannot infer argument types

But if I read the grammar at 
http://dlang.org/statement.html#ForeachStatement correctly, the foreach 
syntax does permit any number of identifiers, so I'm guessing that the 
limitation is with http://dlang.org/phobos/std_range.html#zip which says 
items can only be accessed by indexing.

What would be needed to std.range.Zip to get the expected functionality?

--

Oct 21 2015

John Colvin <john.loughran.colvin gmail.com> writes:

On Wednesday, 21 October 2015 at 10:08:24 UTC, Shriramana Sharma 
wrote:
 In Python I can do:

 ints = [1, 2, 3]
 chars = ['a', 'b', 'c']
 for i, c in zip(ints, chars):
     print(i, c)

 Output:

 1 a
 2 b
 3 c

 But in D if I try:

 import std.stdio, std.range;
 void main ()
 {
     int [] ints = [1, 2, 3];
     char [] chars = ['a', 'b', 'c'];
     foreach(int i, char c; zip(ints, chars))
         writeln(i, ' ', c);
 }

 I get at the foreach line:

 Error: cannot infer argument types

 But if I read the grammar at 
 http://dlang.org/statement.html#ForeachStatement correctly, the 
 foreach syntax does permit any number of identifiers, so I'm 
 guessing that the limitation is with 
 http://dlang.org/phobos/std_range.html#zip which says items can 
 only be accessed by indexing.

 What would be needed to std.range.Zip to get the expected 
 functionality?

static assert(is(ElementType!string == dchar));

     foreach(int i, dchar c; zip(ints, chars))
or
     foreach(i, c; zip(ints, chars))

will work fine.

Oct 21 2015

Shriramana Sharma <samjnaa_dont_spam_me gmail.com> writes:

John Colvin wrote:

 static assert(is(ElementType!string == dchar));

But this is false, no? Since ElementType!string is char and not dchar?

 foreach(int i, dchar c; zip(ints, chars))
 or
 foreach(i, c; zip(ints, chars))

What's the diff betn char and dchar in this particular context?

--

Oct 21 2015

John Colvin <john.loughran.colvin gmail.com> writes:

On Wednesday, 21 October 2015 at 12:07:12 UTC, Shriramana Sharma 
wrote:
 John Colvin wrote:

 static assert(is(ElementType!string == dchar));

 But this is false, no? Since ElementType!string is char and not 
 dchar?

No. char[], wchar[] and dchar[] all have ElementType dchar. 
Strings are special for ranges. It's a bad mistake, but it is 
what it is and apparently won't be changed.

 foreach(int i, dchar c; zip(ints, chars))
 or
 foreach(i, c; zip(ints, chars))

 What's the diff betn char and dchar in this particular context?

See http://dlang.org/type.html and  
http://dlang.org/arrays.html#strings

Oct 21 2015

Shriramana Sharma <samjnaa_dont_spam_me gmail.com> writes:

John Colvin wrote:

 But this is false, no? Since ElementType!string is char and not
 dchar?

 
 No. char[], wchar[] and dchar[] all have ElementType dchar.
 Strings are special for ranges. It's a bad mistake, but it is
 what it is and apparently won't be changed.

Why is it a mistake? That seems a very sane thing, although somewhat quirky. 
Since ElementType is a Range primitive, and apparently iterating through a 
string as a range will produce each semantically meaningful Unicode 
character rather than each UTF-8 or UTF-16 codepoint, it does make sense to 
do this.

 foreach(int i, dchar c; zip(ints, chars))
 or
 foreach(i, c; zip(ints, chars))

 What's the diff betn char and dchar in this particular context?

 
 See http://dlang.org/type.html and
 http://dlang.org/arrays.html#strings

Actually I found the explanation at 
http://dlang.org/phobos/std_range_primitives.html#ElementType.

--

Oct 21 2015

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, 21 October 2015 at 14:13:43 UTC, Shriramana Sharma 
wrote:
 John Colvin wrote:

 But this is false, no? Since ElementType!string is char and 
 not dchar?

 
 No. char[], wchar[] and dchar[] all have ElementType dchar. 
 Strings are special for ranges. It's a bad mistake, but it is 
 what it is and apparently won't be changed.

 Why is it a mistake? That seems a very sane thing, although 
 somewhat quirky. Since ElementType is a Range primitive, and 
 apparently iterating through a string as a range will produce 
 each semantically meaningful Unicode character rather than each 
 UTF-8 or UTF-16 codepoint, it does make sense to do this.

LOL. This could open up a huge discussion if you're not careful. 
A code point is not necessarily a full character. Operating on 
individual code units is generally wrong, because you frequently 
need multiple code units to get a full code point. Similarly, to 
get a full character - what's called a grapheme - you sometimes 
need multiple code points. To make matters even worse, the same 
grapheme can often be represented by different combinations of 
code points (e.g. an accented e can be represented as a single 
code point or it could be represented with the code point for e 
and the code point for the accent - and depending on the unicode 
normalization form being used, the order of those code points 
could differ).

So, operating at the code point level does _not_ actually make 
your program correct. It gets you closer, but you're still 
operating on pieces of characters - and it's arguably more 
pernicious, because more of the common characters "just work" 
while still not ensuring that all of them work, making it harder 
to catch when you screw it up.

However, operating at the grapheme level is incredibly expensive. 
In fact, operating at the code point level is often unnecessarily 
expensive. So, if you care about efficiency, you want to be 
operating at the code unit level as much as possible. And because 
most string code doesn't actually need to operate on individual 
characters, operating at the code unit level is actually 
frequently plenty (especially if your strings have had their code 
points normalized so that the same characters will always result 
in the same sequence of code units).

So, what we have with Phobos is neither fast nor correct. It's 
constantly decoding code points when it's completely unnecessary 
(Phobos has to special case its algorithms for strings all over 
the place to avoid unnecessary decoding). And because ranges deal 
at the code point level by default, they're not correct. Really, 
code should either be operating at the code unit level or the 
grapheme level. You're getting the worst of both worlds when 
operating at the code point level.

Rather, what's really needed is for the programmer to know enough 
about Unicode to know when they should be operating on code 
points or graphemes (or occasionally code points) and then 
explicitly do that - which is why we have 
std.utf.byCodeUnit/byChar/byWchar/byDchar and 
std.uni.byCodePoint/byGrapheme. But as soon as you use those, you 
lose out on the specializations that operate on arrays as well as 
any other code that specifically operates on arrays - even when 
you just want to operate on a char[] as a range of char.

The reality of the matter is that _most_ algorithms would work 
just fine with treating char[] as a range of char so long as they 
do explicit decoding when necessary (and it often wouldn't be 
necessary), but instead, we're constantly autodecoding, because 
that's what front and popFront do for arrays of char or wchar.

When Andrei came up with the current scheme, he didn't know about 
graphemes. He thought that code points were always full 
characters. And if that were the case, the way Phobos works would 
make sense. It might be slower by default, but it would be 
correct, and you could special-case on strings to operate on them 
more efficiently if you needed the extra efficiency. However, 
because code points are _not_ necessarily full characters, we're 
taking an efficiency hit without getting full correctness. 
Instead, we're getting the illusion of correctness. It's like how 
Andrei explained in TDPL that UTF-16 is worse than UTF-8, because 
it's harder to catch when you screw up and chop a character in 
half. Only, it turns out that that applies to UTF-32 as well.

- Jonathan M Davis

Oct 21 2015

Jacob Carlborg <doob me.com> writes:

On 2015-10-21 16:13, Shriramana Sharma wrote:

 Why is it a mistake? That seems a very sane thing, although somewhat quirky.
 Since ElementType is a Range primitive, and apparently iterating through a
 string as a range will produce each semantically meaningful Unicode
 character rather than each UTF-8 or UTF-16 codepoint, it does make sense to
 do this.

The short answer is: it's not 100% correct, it's slower and not always 
needed.

-- 
/Jacob Carlborg

Oct 21 2015

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 21-Oct-2015 20:35, Jacob Carlborg wrote:
 On 2015-10-21 16:13, Shriramana Sharma wrote:

 Why is it a mistake? That seems a very sane thing, although somewhat
 quirky.
 Since ElementType is a Range primitive, and apparently iterating
 through a
 string as a range will produce each semantically meaningful Unicode
 character rather than each UTF-8 or UTF-16 codepoint, it does make
 sense to
 do this.

 The short answer is: it's not 100% correct, it's slower and not always
 needed.

Allow me to correct it to:
- decoding alone often is not enough
- it's slow
- in many cases not decoding is possible and fast

Therefore many std.algo things would avoid decoding behind the scenes.

-- 
Dmitry Olshansky

Oct 21 2015

Shriramana Sharma <samjnaa_dont_spam_me gmail.com> writes:

Shriramana Sharma wrote:

 iterating through a
 string as a range will produce each semantically meaningful Unicode
 character rather than each UTF-8 or UTF-16 codepoint, it does make sense
 to do this.

Dear me... I meant UTF-8 encoded byte, rather than "codepoint", since all 
characters have codepoints, but not all codepoints (such as the surrogates) 
correspond to characters.

--

Oct 21 2015

anonymous <anonymous example.com> writes:

On Wednesday, October 21, 2015 06:21 PM, Shriramana Sharma wrote:

 Dear me... I meant UTF-8 encoded byte, rather than "codepoint",

Also known as: code unit.

Oct 21 2015

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 21-Oct-2015 19:21, Shriramana Sharma wrote:
 Shriramana Sharma wrote:

 iterating through a
 string as a range will produce each semantically meaningful Unicode
 character rather than each UTF-8 or UTF-16 codepoint, it does make sense
 to do this.

 Dear me... I meant UTF-8 encoded byte, rather than "codepoint", since all
 characters have codepoints, but not all codepoints (such as the surrogates)
 correspond to characters.

Aye, careful here. Unicode is a slippery road... Not even talking of 
code units and code points, there are things like "abstract character" 
and "user-perceived character". well, I tried my best to summarize most 
of it at:
http://dlang.org/phobos/std_uni.html

-- 
Dmitry Olshansky

Oct 21 2015

D Programming

C/C++ Programming

Other

digitalmars.D - "tuple unpacking" with zip?