digitalmars.D - The Case For Autodecode

ag0aep6g (31/31) Jun 03 2016 This is mostly me trying to make sense of the discussion.

Steven Schveighoffer (14/23) Jun 03 2016 I don't hate autodecoding. What I hate is that char[] autodecodes.
Kagamin (8/11) Jun 03 2016 A lot of discussion is disagreement on understanding of

ag0aep6g (10/17) Jun 03 2016 Speaking of that, the document that Walter dug up [1], which talks about...

Patrick Schluter (6/15) Jun 03 2016 You mean that '¶' is represented internally as 1 byte 0xB6 and

Steven Schveighoffer (10/26) Jun 03 2016 Not if '¶' is a dchar.

ag0aep6g (14/16) Jun 03 2016 That gives me an idea. A bitwise reinterpretation of int to float is

Steven Schveighoffer (14/30) Jun 03 2016 Except many chars *do* properly convert. This should work:

Steven Schveighoffer (6/11) Jun 03 2016 Allows this too:

Steven Schveighoffer (3/14) Jun 03 2016 Hm... actually doesn't work. dchar.init is 0x0000ffff

ag0aep6g (8/19) Jun 03 2016 Yeah, that's what I meant by "standalone code unit". Code units that on

Steven Schveighoffer (25/47) Jun 03 2016 But you can get a standalone code unit that is part of a coded sequence

ag0aep6g (12/28) Jun 03 2016 I don' think we're disagreeing on anything.

Steven Schveighoffer (13/31) Jun 03 2016 Ah, I thought you meant standalone as in it was assigned to a standalone...

ag0aep6g (13/21) Jun 03 2016 The Unicode FAQ has an answer to this exact question, but it also only

Patrick Schluter (4/48) Jun 04 2016 https://codepoints.net/specials

Steven Schveighoffer (6/15) Jun 04 2016 U+ffff is still a valid code point, even if it's not assigned any

Patrick Schluter (3/5) Jun 03 2016 Indeed.

ag0aep6g (29/33) Jun 03 2016 There is no single char for '¶', that's right, and D gets that right.

Observer (19/22) Jun 04 2016 I'm not familiar with the details of autodecoding, but one thing

ag0aep6g <anonymous example.com> writes:

This is mostly me trying to make sense of the discussion.

So everyone hates autodecoding. But Andrei seems to hate it a good bit 
less than everyone else. As far as I could follow, he has one reason for 
that, which might not be clear to everyone:

char converts implicitly to dchar, so the compiler lets you search for a 
dchar in a range of chars. But that gives nonsensical results. For 
example, you won't find 'ö' in  "ö".byChar, but you will find '¶' in 
there ('¶' is U+00B6, 'ö' is U+00F6, and 'ö' is encoded as 0xC3 0xB6 in 
UTF-8).

The same does not happen when searching for a grapheme in a range of 
code points, because you just can't do that accidentally. dchar does not 
implicitly convert to std.uni.Grapheme.

So autodecoding shields the user from one surprising aspect of narrow 
strings, and indeed this one kind of problem does not exist with code 
points.

So:
code units - a lot of surprises
code points - a lot of surprises minus one

I don't think this makes autodecoding actually desirable, but I do think 
it prevents a mistake that could otherwise be common.

The issue could also be avoided by making char not convert implicitly to 
dchar. I would like that, but it would of course be another substantial 
breaking change.

At Andrei: Apologies if I'm misrepresenting your position. If you have 
other arguments in favor of autodecoding, they haven't gotten through to me.

At everyone: Apologies if I'm just stating the obvious here. I needed 
this pointed out, and it happened in the depths of the other thread. So 
maybe this is an aspect others haven't considered either.

Finally, this is not the only argument in favor of *keeping* 
autodecoding, of course. Not wanting to break user code is the big one 
there, I guess.

Jun 03 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/3/16 7:24 AM, ag0aep6g wrote:
 This is mostly me trying to make sense of the discussion.

 So everyone hates autodecoding. But Andrei seems to hate it a good bit
 less than everyone else. As far as I could follow, he has one reason for
 that, which might not be clear to everyone:

I don't hate autodecoding. What I hate is that char[] autodecodes.

If strings were some auto-decoding type that wasn't immutable(char)[], 
that would be absolutely fine with me. In fact, I see this as the only 
way to fix this, since it shouldn't break any code.

 char converts implicitly to dchar, so the compiler lets you search for a
 dchar in a range of chars. But that gives nonsensical results. For
 example, you won't find 'ö' in  "ö".byChar, but you will find '¶' in
 there ('¶' is U+00B6, 'ö' is U+00F6, and 'ö' is encoded as 0xC3 0xB6 in
 UTF-8).

Question: why couldn't the compiler emit (in non-release builds) a 
runtime check to make sure you aren't converting non-ASCII characters to 
dchars? That is, like out of bounds checking, but for char -> dchar 
conversions, or any other invalid mechanism?

Yep, it's going to kill a lot of performance. But it's going to catch a 
lot of problems.

One thing to point out here, is that autodecoding only happens on 
arrays, and even then, only in certain cases.

-Steve

Jun 03 2016

Kagamin <spam here.lot> writes:

On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote:
 Finally, this is not the only argument in favor of *keeping* 
 autodecoding, of course. Not wanting to break user code is the 
 big one there, I guess.

A lot of discussion is disagreement on understanding of 
correctness of unicode support. I see 4 possible meanings here:
1. Implemented according to spec.
2. Provides level 1 unicode support.
3. Provides level 2 unicode support.
4. Achieves the goal of unicode, i.e. text processing according 
to natural language rules.

Jun 03 2016

ag0aep6g <anonymous example.com> writes:

On 06/03/2016 03:56 PM, Kagamin wrote:
 A lot of discussion is disagreement on understanding of correctness of
 unicode support. I see 4 possible meanings here:
 1. Implemented according to spec.
 2. Provides level 1 unicode support.
 3. Provides level 2 unicode support.
 4. Achieves the goal of unicode, i.e. text processing according to
 natural language rules.

Speaking of that, the document that Walter dug up [1], which talks about 
supports levels, is about regular expression engines in particular. It's 
not about general language support.

The version he linked to is also pretty old. A more recent revision [2] 
calls level 1 (code points) the "minimally useful level of support", 
speaks warmly about level 2 (graphemes), and says that level 3 (locale 
dependent behavior) is "only useful for specific applications".


[1] http://unicode.org/reports/tr18/tr18-5.1.html
[2] http://www.unicode.org/reports/tr18/tr18-17.html

Jun 03 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote:
 This is mostly me trying to make sense of the discussion.

 So everyone hates autodecoding. But Andrei seems to hate it a 
 good bit less than everyone else. As far as I could follow, he 
 has one reason for that, which might not be clear to everyone:

 char converts implicitly to dchar, so the compiler lets you 
 search for a dchar in a range of chars. But that gives 
 nonsensical results. For example, you won't find 'ö' in  
 "ö".byChar, but you will find '¶' in there ('¶' is U+00B6, 'ö' 
 is U+00F6, and 'ö' is encoded as 0xC3 0xB6 in UTF-8).

You mean that '¶' is represented internally as 1 byte 0xB6 and 
that it can be handled as such without error? This would mean 
that char literals are broken. The only valid way to represent 
'¶' in memory is 0xC3 0x86.
Sorry if I misunderstood, I'm only starting to learn D.

Jun 03 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/3/16 1:51 PM, Patrick Schluter wrote:
 On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote:
 This is mostly me trying to make sense of the discussion.

 So everyone hates autodecoding. But Andrei seems to hate it a good bit
 less than everyone else. As far as I could follow, he has one reason
 for that, which might not be clear to everyone:

 char converts implicitly to dchar, so the compiler lets you search for
 a dchar in a range of chars. But that gives nonsensical results. For
 example, you won't find 'ö' in  "ö".byChar, but you will find '¶' in
 there ('¶' is U+00B6, 'ö' is U+00F6, and 'ö' is encoded as 0xC3 0xB6
 in UTF-8).

 You mean that '¶' is represented internally as 1 byte 0xB6 and that it
 can be handled as such without error? This would mean that char literals
 are broken. The only valid way to represent '¶' in memory is 0xC3 0x86.
 Sorry if I misunderstood, I'm only starting to learn D.

Not if '¶' is a dchar.

What is happening in the example is that find is looking at the 
"ö".byChar range and saying "hm... can I compare dchar('¶') to char? 
Well, char implicitly casts to dchar, so I'm good!", but a direct cast 
of the bits from char does NOT mean the same thing as a dchar. It has to 
go through a decoding first.

The real problem here is that char implicitly casts to dchar. That 
should not be allowed.

-Steve

Jun 03 2016

ag0aep6g <anonymous example.com> writes:

On 06/03/2016 08:36 PM, Steven Schveighoffer wrote:
 but a direct cast
 of the bits from char does NOT mean the same thing as a dchar.

That gives me an idea. A bitwise reinterpretation of int to float is 
nonsensical, too. Yet int implicitly converts to float and (for small 
values) preserves the meaning. I mean, implicit conversion doesn't have 
to mean bitwise reinterpretation.

How about replacing non-standalone code units with replacement character 
(U+FFFD) in implicit widening conversions?

For example:

----
char c = "ö"[0];
wchar w = c;
assert(w == '\uFFFD');
----

Would probably just be band-aid, though.

Jun 03 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/3/16 2:55 PM, ag0aep6g wrote:
 On 06/03/2016 08:36 PM, Steven Schveighoffer wrote:
 but a direct cast
 of the bits from char does NOT mean the same thing as a dchar.

 That gives me an idea. A bitwise reinterpretation of int to float is
 nonsensical, too. Yet int implicitly converts to float and (for small
 values) preserves the meaning. I mean, implicit conversion doesn't have
 to mean bitwise reinterpretation.

I'm pretty sure the CPU handles this, though.

 How about replacing non-standalone code units with replacement character
 (U+FFFD) in implicit widening conversions?

 For example:

 ----
 char c = "ö"[0];
 wchar w = c;
 assert(w == '\uFFFD');
 ----

 Would probably just be band-aid, though.

Except many chars *do* properly convert. This should work:

char c = 'a';
dchar d = c;
assert(d == 'a');

As I mentioned in my earlier reply, some kind of "bounds checking" for 
the conversion could be a possibility.

Hm... an interesting possiblity:

dchar _dchar_convert(char c)
{
    return cast(int)cast(byte)c; // get sign extension for non-ASCII
}

-Steve

Jun 03 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/3/16 3:09 PM, Steven Schveighoffer wrote:
 Hm... an interesting possiblity:

 dchar _dchar_convert(char c)
 {
    return cast(int)cast(byte)c; // get sign extension for non-ASCII
 }

Allows this too:

dchar d = char.init; // calls conversion function
assert(d == dchar.init);

:)

-Steve

Jun 03 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/3/16 3:12 PM, Steven Schveighoffer wrote:
 On 6/3/16 3:09 PM, Steven Schveighoffer wrote:
 Hm... an interesting possiblity:

 dchar _dchar_convert(char c)
 {
    return cast(int)cast(byte)c; // get sign extension for non-ASCII
 }

 Allows this too:

 dchar d = char.init; // calls conversion function
 assert(d == dchar.init);

Hm... actually doesn't work. dchar.init is 0x0000ffff

-Steve

Jun 03 2016

ag0aep6g <anonymous example.com> writes:

On 06/03/2016 09:09 PM, Steven Schveighoffer wrote:
 Except many chars *do* properly convert. This should work:

 char c = 'a';
 dchar d = c;
 assert(d == 'a');

Yeah, that's what I meant by "standalone code unit". Code units that on 
their own represent a code point would not be touched.

 As I mentioned in my earlier reply, some kind of "bounds checking" for
 the conversion could be a possibility.

 Hm... an interesting possiblity:

 dchar _dchar_convert(char c)
 {
     return cast(int)cast(byte)c; // get sign extension for non-ASCII
 }

So when the char's most significant bit is set, this fills the upper 
bits of the dchar with 1s, right? And a set most significant bit in a 
char means it's part of a multibyte sequence, while in a dchar it means 
that the dchar is invalid, because they only go up to U+10FFFF. Huh. Neat.

Does it work for for char -> wchar, too?

Jun 03 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/3/16 3:52 PM, ag0aep6g wrote:
 On 06/03/2016 09:09 PM, Steven Schveighoffer wrote:
 Except many chars *do* properly convert. This should work:

 char c = 'a';
 dchar d = c;
 assert(d == 'a');

 Yeah, that's what I meant by "standalone code unit". Code units that on
 their own represent a code point would not be touched.

But you can get a standalone code unit that is part of a coded sequence 
quite easily

foo(string s)
{
    auto x = s[0];
    dchar d = x;
}

 As I mentioned in my earlier reply, some kind of "bounds checking" for
 the conversion could be a possibility.

 Hm... an interesting possiblity:

 dchar _dchar_convert(char c)
 {
     return cast(int)cast(byte)c; // get sign extension for non-ASCII
 }

 So when the char's most significant bit is set, this fills the upper
 bits of the dchar with 1s, right? And a set most significant bit in a
 char means it's part of a multibyte sequence, while in a dchar it means
 that the dchar is invalid, because they only go up to U+10FFFF. Huh. Neat.

An interesting thing is that I think the CPU can do this for us.

 Does it work for for char -> wchar, too?

It does not. 0xffff is a valid code point, and I think so are all the 
other values that would result. In fact, I think there are no invalid 
code units for wchar. Of course, a surrogate pair requires another code 
unit to be valid, so we can at least promote a char to a wchar in the 
surrogate pair range (and always in the low or high surrogate range so a 
naive transcoding of a char range to wchar will result in an invalid 
sequence if there are any non-ascii characters).

So we need most efficient logic that does this:

if(c & 0x80)
     return wchar(0xd800 + c);
else
     return wchar(c);

More expensive, but more correct!

wchar to dchar conversion is pretty sound, as the surrogate pairs are 
invalid code points for dchar.

-Steve

Jun 03 2016

ag0aep6g <anonymous example.com> writes:

On 06/03/2016 10:18 PM, Steven Schveighoffer wrote:
 But you can get a standalone code unit that is part of a coded sequence
 quite easily

 foo(string s)
 {
     auto x = s[0];
     dchar d = x;
 }

I don' think we're disagreeing on anything.

I'm calling UTF-8 code units below 0x80 "standalone" code units. They're 
never part of multibyte sequences. Your _dchar_convert returns them 
unscathed.

Higher code units are always part of multibyte sequences (or invalid 
already). Your function returns invalid code points for them.

_dchar_convert does exactly what I meant, except that I had in mind 
returning the replacement character for non-standalone code units. But I 
see that that may not be feasible, and it's probably not necessary.

[...]
 So we need most efficient logic that does this:

 if(c & 0x80)
      return wchar(0xd800 + c);

Is this going to be faster than returning a constant invalid wchar?

 else
      return wchar(c);

 More expensive, but more correct!

 wchar to dchar conversion is pretty sound, as the surrogate pairs are
 invalid code points for dchar.

 -Steve

Jun 03 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/3/16 4:39 PM, ag0aep6g wrote:
 On 06/03/2016 10:18 PM, Steven Schveighoffer wrote:
 But you can get a standalone code unit that is part of a coded sequence
 quite easily

 foo(string s)
 {
     auto x = s[0];
     dchar d = x;
 }

 I don' think we're disagreeing on anything.

 I'm calling UTF-8 code units below 0x80 "standalone" code units. They're
 never part of multibyte sequences. Your _dchar_convert returns them
 unscathed.

Ah, I thought you meant standalone as in it was assigned to a standalone 
char variable vs. part of an array or range. My mistake.

Re-reading your original message, I see that should have been clear to me...

 So we need most efficient logic that does this:

 if(c & 0x80)
      return wchar(0xd800 + c);

 Is this going to be faster than returning a constant invalid wchar?

No, but I like the idea of preserving the erroneous character you tried 
to convert.

But is there an invalid wchar? I looked through the wikipedia article on 
UTF 16, and it didn't seem to say there was one.

If we use U+FFFD, that signifies a coding problem but is still a valid 
code point. However, doing a wchar in the D800 - D8FF range without 
being followed by a code unit in the DC00 - DFFF range is an invalid 
sequence. D throws if it encounters such a thing.

-Steve

Jun 03 2016

ag0aep6g <anonymous example.com> writes:

On 06/03/2016 11:13 PM, Steven Schveighoffer wrote:
 No, but I like the idea of preserving the erroneous character you tried
 to convert.

Makes sense.

 But is there an invalid wchar? I looked through the wikipedia article on
 UTF 16, and it didn't seem to say there was one.

 If we use U+FFFD, that signifies a coding problem but is still a valid
 code point. However, doing a wchar in the D800 - D8FF range without
 being followed by a code unit in the DC00 - DFFF range is an invalid
 sequence. D throws if it encounters such a thing.

The Unicode FAQ has an answer to this exact question, but it also only 
says that "[u]npaired surrogates are invalid" [1].

It also mentions "noncharacters" which are "permanently reserved [...] 
for internal use". "For example, they might be used internally as a 
particular kind of object placeholder in a string." [2] - Not too bad.

And then there is the replacement character, of course. "[U]sed to 
replace an incoming character whose value is unknown or unrepresentable 
in Unicode" [3].


[1] http://www.unicode.org/faq/utf_bom.html#utf16-7
[2] http://www.unicode.org/faq/private_use.html#noncharacters
[3] http://www.fileformat.info/info/unicode/char/0fffd/index.htm

Jun 03 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Friday, 3 June 2016 at 20:18:31 UTC, Steven Schveighoffer 
wrote:
 On 6/3/16 3:52 PM, ag0aep6g wrote:
 On 06/03/2016 09:09 PM, Steven Schveighoffer wrote:
 Except many chars *do* properly convert. This should work:

 char c = 'a';
 dchar d = c;
 assert(d == 'a');

 Yeah, that's what I meant by "standalone code unit". Code 
 units that on
 their own represent a code point would not be touched.

 But you can get a standalone code unit that is part of a coded 
 sequence quite easily

 foo(string s)
 {
    auto x = s[0];
    dchar d = x;
 }

 As I mentioned in my earlier reply, some kind of "bounds 
 checking" for
 the conversion could be a possibility.

 Hm... an interesting possiblity:

 dchar _dchar_convert(char c)
 {
     return cast(int)cast(byte)c; // get sign extension for 
 non-ASCII
 }

 So when the char's most significant bit is set, this fills the 
 upper
 bits of the dchar with 1s, right? And a set most significant 
 bit in a
 char means it's part of a multibyte sequence, while in a dchar 
 it means
 that the dchar is invalid, because they only go up to 
 U+10FFFF. Huh. Neat.

 An interesting thing is that I think the CPU can do this for us.

 Does it work for for char -> wchar, too?

 It does not. 0xffff is a valid code point, and I think so are 
 all the other values that would result. In fact, I think there 
 are no invalid code units for wchar.

https://codepoints.net/specials

U+ffff would be fine, better at least than a surrogate.

Jun 04 2016

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/4/16 4:57 AM, Patrick Schluter wrote:
 On Friday, 3 June 2016 at 20:18:31 UTC, Steven Schveighoffer wrote:
 On 6/3/16 3:52 PM, ag0aep6g wrote:


 Does it work for for char -> wchar, too?

 It does not. 0xffff is a valid code point, and I think so are all the
 other values that would result. In fact, I think there are no invalid
 code units for wchar.

 https://codepoints.net/specials

 U+ffff would be fine, better at least than a surrogate.

U+ffff is still a valid code point, even if it's not assigned any 
unicode character.

But the result would be U+ff80 to U+ffff, and I'm sure some of those are 
valid code points.

-Steve

Jun 04 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Friday, 3 June 2016 at 18:36:45 UTC, Steven Schveighoffer 
wrote:
 The real problem here is that char implicitly casts to dchar. 
 That should not be allowed.

Indeed.

Jun 03 2016

ag0aep6g <anonymous example.com> writes:

On 06/03/2016 07:51 PM, Patrick Schluter wrote:
 You mean that '¶' is represented internally as 1 byte 0xB6 and that it
 can be handled as such without error? This would mean that char literals
 are broken. The only valid way to represent '¶' in memory is 0xC3 0x86.
 Sorry if I misunderstood, I'm only starting to learn D.

There is no single char for '¶', that's right, and D gets that right. 
That's not what happens.

But there is a single wchar for it. wchar is a UTF-16 code unit, 2 
bytes. UTF-16 encodes '¶' as a single code unit, so that's correct.

The problem is that you can accidentally search for a wchar in a range 
of chars. Every char is compared to the wchar by numeric value. But the 
numeric values of a char don't mean the same as those of a wchar, so you 
get nonsensical results.

A similar implicit conversion lets you search for a large number in a 
byte[]:

----
byte[] arr = [1, 2, 3];
foreach(x; arr) if (x == 1000) writeln("found it!");
----

You won't ever find 1000 in a byte[], of course. The byte type simply 
can't store the value. But you can compare a byte with an int. And that 
comparison is meaningful, unlike the comparison of a char with a wchar.

You can also produce false positives with numeric types, by mixing 
signed and unsigned types:

----
int[] arr = [1, -1, 3];
foreach(x; arr) if (x == uint.max) writeln("found it!");
----

uint.max is a large number, -1 is a small number. They're considered 
equal here because of an implicit conversion that messes with the 
meaning of the bits.

False negatives are not possible with numeric types. At least not in the 
same way as with differently sized Unicode code units.

Jun 03 2016

Observer <here somewhere.net> writes:

On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote:
 Finally, this is not the only argument in favor of *keeping* 
 autodecoding, of course. Not wanting to break user code is the 
 big one there, I guess.

I'm not familiar with the details of autodecoding, but one thing
strikes me about this whole discussion.  It seems to me that it
is just nibbling around the edges of how one should implement full
Unicode support.  And it seems to me that that topic, and how
autodecoding plays into it, won't be properly understood except by
comparison with mature software that has undergone many years of
testing and revision.  Two examples stand out to me:

* Perl 5 has undergone a gradual evolution, over many releases,
   to get this right.  It might also be the case that Perl 6 is
   even cleaner.

* The International Components for Unicode (ICU) package, with
   supported libraries for C, C++, and Java.  This is the industry-
   standard definition of what it means to handle Unicode in these
   languages.  See http://site.icu-project.org/ for details.

Both of these implementations have seen many years of real-world
use, so I would tend to look to them for guidance over trying to
develop my own opinion based on some small set of particular use
cases I might happen to have encountered.

Jun 04 2016

D Programming

C/C++ Programming

Other

digitalmars.D - The Case For Autodecode