www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - The Case For Autodecode

reply ag0aep6g <anonymous example.com> writes:
This is mostly me trying to make sense of the discussion.

So everyone hates autodecoding. But Andrei seems to hate it a good bit 
less than everyone else. As far as I could follow, he has one reason for 
that, which might not be clear to everyone:

char converts implicitly to dchar, so the compiler lets you search for a 
dchar in a range of chars. But that gives nonsensical results. For 
example, you won't find 'ö' in  "ö".byChar, but you will find '¶' in 
there ('¶' is U+00B6, 'ö' is U+00F6, and 'ö' is encoded as 0xC3 0xB6 in 
UTF-8).

The same does not happen when searching for a grapheme in a range of 
code points, because you just can't do that accidentally. dchar does not 
implicitly convert to std.uni.Grapheme.

So autodecoding shields the user from one surprising aspect of narrow 
strings, and indeed this one kind of problem does not exist with code 
points.

So:
code units - a lot of surprises
code points - a lot of surprises minus one

I don't think this makes autodecoding actually desirable, but I do think 
it prevents a mistake that could otherwise be common.

The issue could also be avoided by making char not convert implicitly to 
dchar. I would like that, but it would of course be another substantial 
breaking change.

At Andrei: Apologies if I'm misrepresenting your position. If you have 
other arguments in favor of autodecoding, they haven't gotten through to me.

At everyone: Apologies if I'm just stating the obvious here. I needed 
this pointed out, and it happened in the depths of the other thread. So 
maybe this is an aspect others haven't considered either.

Finally, this is not the only argument in favor of *keeping* 
autodecoding, of course. Not wanting to break user code is the big one 
there, I guess.
Jun 03 2016
next sibling parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/3/16 7:24 AM, ag0aep6g wrote:
 This is mostly me trying to make sense of the discussion.

 So everyone hates autodecoding. But Andrei seems to hate it a good bit
 less than everyone else. As far as I could follow, he has one reason for
 that, which might not be clear to everyone:
I don't hate autodecoding. What I hate is that char[] autodecodes. If strings were some auto-decoding type that wasn't immutable(char)[], that would be absolutely fine with me. In fact, I see this as the only way to fix this, since it shouldn't break any code.
 char converts implicitly to dchar, so the compiler lets you search for a
 dchar in a range of chars. But that gives nonsensical results. For
 example, you won't find 'ö' in  "ö".byChar, but you will find '¶' in
 there ('¶' is U+00B6, 'ö' is U+00F6, and 'ö' is encoded as 0xC3 0xB6 in
 UTF-8).
Question: why couldn't the compiler emit (in non-release builds) a runtime check to make sure you aren't converting non-ASCII characters to dchars? That is, like out of bounds checking, but for char -> dchar conversions, or any other invalid mechanism? Yep, it's going to kill a lot of performance. But it's going to catch a lot of problems. One thing to point out here, is that autodecoding only happens on arrays, and even then, only in certain cases. -Steve
Jun 03 2016
prev sibling next sibling parent reply Kagamin <spam here.lot> writes:
On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote:
 Finally, this is not the only argument in favor of *keeping* 
 autodecoding, of course. Not wanting to break user code is the 
 big one there, I guess.
A lot of discussion is disagreement on understanding of correctness of unicode support. I see 4 possible meanings here: 1. Implemented according to spec. 2. Provides level 1 unicode support. 3. Provides level 2 unicode support. 4. Achieves the goal of unicode, i.e. text processing according to natural language rules.
Jun 03 2016
parent ag0aep6g <anonymous example.com> writes:
On 06/03/2016 03:56 PM, Kagamin wrote:
 A lot of discussion is disagreement on understanding of correctness of
 unicode support. I see 4 possible meanings here:
 1. Implemented according to spec.
 2. Provides level 1 unicode support.
 3. Provides level 2 unicode support.
 4. Achieves the goal of unicode, i.e. text processing according to
 natural language rules.
Speaking of that, the document that Walter dug up [1], which talks about supports levels, is about regular expression engines in particular. It's not about general language support. The version he linked to is also pretty old. A more recent revision [2] calls level 1 (code points) the "minimally useful level of support", speaks warmly about level 2 (graphemes), and says that level 3 (locale dependent behavior) is "only useful for specific applications". [1] http://unicode.org/reports/tr18/tr18-5.1.html [2] http://www.unicode.org/reports/tr18/tr18-17.html
Jun 03 2016
prev sibling next sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote:
 This is mostly me trying to make sense of the discussion.

 So everyone hates autodecoding. But Andrei seems to hate it a 
 good bit less than everyone else. As far as I could follow, he 
 has one reason for that, which might not be clear to everyone:

 char converts implicitly to dchar, so the compiler lets you 
 search for a dchar in a range of chars. But that gives 
 nonsensical results. For example, you won't find 'ö' in  
 "ö".byChar, but you will find '¶' in there ('¶' is U+00B6, 'ö' 
 is U+00F6, and 'ö' is encoded as 0xC3 0xB6 in UTF-8).
You mean that '¶' is represented internally as 1 byte 0xB6 and that it can be handled as such without error? This would mean that char literals are broken. The only valid way to represent '¶' in memory is 0xC3 0x86. Sorry if I misunderstood, I'm only starting to learn D.
Jun 03 2016
next sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/3/16 1:51 PM, Patrick Schluter wrote:
 On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote:
 This is mostly me trying to make sense of the discussion.

 So everyone hates autodecoding. But Andrei seems to hate it a good bit
 less than everyone else. As far as I could follow, he has one reason
 for that, which might not be clear to everyone:

 char converts implicitly to dchar, so the compiler lets you search for
 a dchar in a range of chars. But that gives nonsensical results. For
 example, you won't find 'ö' in  "ö".byChar, but you will find '¶' in
 there ('¶' is U+00B6, 'ö' is U+00F6, and 'ö' is encoded as 0xC3 0xB6
 in UTF-8).
You mean that '¶' is represented internally as 1 byte 0xB6 and that it can be handled as such without error? This would mean that char literals are broken. The only valid way to represent '¶' in memory is 0xC3 0x86. Sorry if I misunderstood, I'm only starting to learn D.
Not if '¶' is a dchar. What is happening in the example is that find is looking at the "ö".byChar range and saying "hm... can I compare dchar('¶') to char? Well, char implicitly casts to dchar, so I'm good!", but a direct cast of the bits from char does NOT mean the same thing as a dchar. It has to go through a decoding first. The real problem here is that char implicitly casts to dchar. That should not be allowed. -Steve
Jun 03 2016
next sibling parent reply ag0aep6g <anonymous example.com> writes:
On 06/03/2016 08:36 PM, Steven Schveighoffer wrote:
 but a direct cast
 of the bits from char does NOT mean the same thing as a dchar.
That gives me an idea. A bitwise reinterpretation of int to float is nonsensical, too. Yet int implicitly converts to float and (for small values) preserves the meaning. I mean, implicit conversion doesn't have to mean bitwise reinterpretation. How about replacing non-standalone code units with replacement character (U+FFFD) in implicit widening conversions? For example: ---- char c = "ö"[0]; wchar w = c; assert(w == '\uFFFD'); ---- Would probably just be band-aid, though.
Jun 03 2016
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/3/16 2:55 PM, ag0aep6g wrote:
 On 06/03/2016 08:36 PM, Steven Schveighoffer wrote:
 but a direct cast
 of the bits from char does NOT mean the same thing as a dchar.
That gives me an idea. A bitwise reinterpretation of int to float is nonsensical, too. Yet int implicitly converts to float and (for small values) preserves the meaning. I mean, implicit conversion doesn't have to mean bitwise reinterpretation.
I'm pretty sure the CPU handles this, though.
 How about replacing non-standalone code units with replacement character
 (U+FFFD) in implicit widening conversions?

 For example:

 ----
 char c = "ö"[0];
 wchar w = c;
 assert(w == '\uFFFD');
 ----

 Would probably just be band-aid, though.
Except many chars *do* properly convert. This should work: char c = 'a'; dchar d = c; assert(d == 'a'); As I mentioned in my earlier reply, some kind of "bounds checking" for the conversion could be a possibility. Hm... an interesting possiblity: dchar _dchar_convert(char c) { return cast(int)cast(byte)c; // get sign extension for non-ASCII } -Steve
Jun 03 2016
next sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/3/16 3:09 PM, Steven Schveighoffer wrote:
 Hm... an interesting possiblity:

 dchar _dchar_convert(char c)
 {
    return cast(int)cast(byte)c; // get sign extension for non-ASCII
 }
Allows this too: dchar d = char.init; // calls conversion function assert(d == dchar.init); :) -Steve
Jun 03 2016
parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/3/16 3:12 PM, Steven Schveighoffer wrote:
 On 6/3/16 3:09 PM, Steven Schveighoffer wrote:
 Hm... an interesting possiblity:

 dchar _dchar_convert(char c)
 {
    return cast(int)cast(byte)c; // get sign extension for non-ASCII
 }
Allows this too: dchar d = char.init; // calls conversion function assert(d == dchar.init);
Hm... actually doesn't work. dchar.init is 0x0000ffff -Steve
Jun 03 2016
prev sibling parent reply ag0aep6g <anonymous example.com> writes:
On 06/03/2016 09:09 PM, Steven Schveighoffer wrote:
 Except many chars *do* properly convert. This should work:

 char c = 'a';
 dchar d = c;
 assert(d == 'a');
Yeah, that's what I meant by "standalone code unit". Code units that on their own represent a code point would not be touched.
 As I mentioned in my earlier reply, some kind of "bounds checking" for
 the conversion could be a possibility.

 Hm... an interesting possiblity:

 dchar _dchar_convert(char c)
 {
     return cast(int)cast(byte)c; // get sign extension for non-ASCII
 }
So when the char's most significant bit is set, this fills the upper bits of the dchar with 1s, right? And a set most significant bit in a char means it's part of a multibyte sequence, while in a dchar it means that the dchar is invalid, because they only go up to U+10FFFF. Huh. Neat. Does it work for for char -> wchar, too?
Jun 03 2016
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/3/16 3:52 PM, ag0aep6g wrote:
 On 06/03/2016 09:09 PM, Steven Schveighoffer wrote:
 Except many chars *do* properly convert. This should work:

 char c = 'a';
 dchar d = c;
 assert(d == 'a');
Yeah, that's what I meant by "standalone code unit". Code units that on their own represent a code point would not be touched.
But you can get a standalone code unit that is part of a coded sequence quite easily foo(string s) { auto x = s[0]; dchar d = x; }
 As I mentioned in my earlier reply, some kind of "bounds checking" for
 the conversion could be a possibility.

 Hm... an interesting possiblity:

 dchar _dchar_convert(char c)
 {
     return cast(int)cast(byte)c; // get sign extension for non-ASCII
 }
So when the char's most significant bit is set, this fills the upper bits of the dchar with 1s, right? And a set most significant bit in a char means it's part of a multibyte sequence, while in a dchar it means that the dchar is invalid, because they only go up to U+10FFFF. Huh. Neat.
An interesting thing is that I think the CPU can do this for us.
 Does it work for for char -> wchar, too?
It does not. 0xffff is a valid code point, and I think so are all the other values that would result. In fact, I think there are no invalid code units for wchar. Of course, a surrogate pair requires another code unit to be valid, so we can at least promote a char to a wchar in the surrogate pair range (and always in the low or high surrogate range so a naive transcoding of a char range to wchar will result in an invalid sequence if there are any non-ascii characters). So we need most efficient logic that does this: if(c & 0x80) return wchar(0xd800 + c); else return wchar(c); More expensive, but more correct! wchar to dchar conversion is pretty sound, as the surrogate pairs are invalid code points for dchar. -Steve
Jun 03 2016
next sibling parent reply ag0aep6g <anonymous example.com> writes:
On 06/03/2016 10:18 PM, Steven Schveighoffer wrote:
 But you can get a standalone code unit that is part of a coded sequence
 quite easily

 foo(string s)
 {
     auto x = s[0];
     dchar d = x;
 }
I don' think we're disagreeing on anything. I'm calling UTF-8 code units below 0x80 "standalone" code units. They're never part of multibyte sequences. Your _dchar_convert returns them unscathed. Higher code units are always part of multibyte sequences (or invalid already). Your function returns invalid code points for them. _dchar_convert does exactly what I meant, except that I had in mind returning the replacement character for non-standalone code units. But I see that that may not be feasible, and it's probably not necessary. [...]
 So we need most efficient logic that does this:

 if(c & 0x80)
      return wchar(0xd800 + c);
Is this going to be faster than returning a constant invalid wchar?
 else
      return wchar(c);

 More expensive, but more correct!

 wchar to dchar conversion is pretty sound, as the surrogate pairs are
 invalid code points for dchar.

 -Steve
Jun 03 2016
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/3/16 4:39 PM, ag0aep6g wrote:
 On 06/03/2016 10:18 PM, Steven Schveighoffer wrote:
 But you can get a standalone code unit that is part of a coded sequence
 quite easily

 foo(string s)
 {
     auto x = s[0];
     dchar d = x;
 }
I don' think we're disagreeing on anything. I'm calling UTF-8 code units below 0x80 "standalone" code units. They're never part of multibyte sequences. Your _dchar_convert returns them unscathed.
Ah, I thought you meant standalone as in it was assigned to a standalone char variable vs. part of an array or range. My mistake. Re-reading your original message, I see that should have been clear to me...
 So we need most efficient logic that does this:

 if(c & 0x80)
      return wchar(0xd800 + c);
Is this going to be faster than returning a constant invalid wchar?
No, but I like the idea of preserving the erroneous character you tried to convert. But is there an invalid wchar? I looked through the wikipedia article on UTF 16, and it didn't seem to say there was one. If we use U+FFFD, that signifies a coding problem but is still a valid code point. However, doing a wchar in the D800 - D8FF range without being followed by a code unit in the DC00 - DFFF range is an invalid sequence. D throws if it encounters such a thing. -Steve
Jun 03 2016
parent ag0aep6g <anonymous example.com> writes:
On 06/03/2016 11:13 PM, Steven Schveighoffer wrote:
 No, but I like the idea of preserving the erroneous character you tried
 to convert.
Makes sense.
 But is there an invalid wchar? I looked through the wikipedia article on
 UTF 16, and it didn't seem to say there was one.

 If we use U+FFFD, that signifies a coding problem but is still a valid
 code point. However, doing a wchar in the D800 - D8FF range without
 being followed by a code unit in the DC00 - DFFF range is an invalid
 sequence. D throws if it encounters such a thing.
The Unicode FAQ has an answer to this exact question, but it also only says that "[u]npaired surrogates are invalid" [1]. It also mentions "noncharacters" which are "permanently reserved [...] for internal use". "For example, they might be used internally as a particular kind of object placeholder in a string." [2] - Not too bad. And then there is the replacement character, of course. "[U]sed to replace an incoming character whose value is unknown or unrepresentable in Unicode" [3]. [1] http://www.unicode.org/faq/utf_bom.html#utf16-7 [2] http://www.unicode.org/faq/private_use.html#noncharacters [3] http://www.fileformat.info/info/unicode/char/0fffd/index.htm
Jun 03 2016
prev sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Friday, 3 June 2016 at 20:18:31 UTC, Steven Schveighoffer 
wrote:
 On 6/3/16 3:52 PM, ag0aep6g wrote:
 On 06/03/2016 09:09 PM, Steven Schveighoffer wrote:
 Except many chars *do* properly convert. This should work:

 char c = 'a';
 dchar d = c;
 assert(d == 'a');
Yeah, that's what I meant by "standalone code unit". Code units that on their own represent a code point would not be touched.
But you can get a standalone code unit that is part of a coded sequence quite easily foo(string s) { auto x = s[0]; dchar d = x; }
 As I mentioned in my earlier reply, some kind of "bounds 
 checking" for
 the conversion could be a possibility.

 Hm... an interesting possiblity:

 dchar _dchar_convert(char c)
 {
     return cast(int)cast(byte)c; // get sign extension for 
 non-ASCII
 }
So when the char's most significant bit is set, this fills the upper bits of the dchar with 1s, right? And a set most significant bit in a char means it's part of a multibyte sequence, while in a dchar it means that the dchar is invalid, because they only go up to U+10FFFF. Huh. Neat.
An interesting thing is that I think the CPU can do this for us.
 Does it work for for char -> wchar, too?
It does not. 0xffff is a valid code point, and I think so are all the other values that would result. In fact, I think there are no invalid code units for wchar.
https://codepoints.net/specials U+ffff would be fine, better at least than a surrogate.
Jun 04 2016
parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/4/16 4:57 AM, Patrick Schluter wrote:
 On Friday, 3 June 2016 at 20:18:31 UTC, Steven Schveighoffer wrote:
 On 6/3/16 3:52 PM, ag0aep6g wrote:
 Does it work for for char -> wchar, too?
It does not. 0xffff is a valid code point, and I think so are all the other values that would result. In fact, I think there are no invalid code units for wchar.
https://codepoints.net/specials U+ffff would be fine, better at least than a surrogate.
U+ffff is still a valid code point, even if it's not assigned any unicode character. But the result would be U+ff80 to U+ffff, and I'm sure some of those are valid code points. -Steve
Jun 04 2016
prev sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Friday, 3 June 2016 at 18:36:45 UTC, Steven Schveighoffer 
wrote:
 The real problem here is that char implicitly casts to dchar. 
 That should not be allowed.
Indeed.
Jun 03 2016
prev sibling parent ag0aep6g <anonymous example.com> writes:
On 06/03/2016 07:51 PM, Patrick Schluter wrote:
 You mean that '¶' is represented internally as 1 byte 0xB6 and that it
 can be handled as such without error? This would mean that char literals
 are broken. The only valid way to represent '¶' in memory is 0xC3 0x86.
 Sorry if I misunderstood, I'm only starting to learn D.
There is no single char for '¶', that's right, and D gets that right. That's not what happens. But there is a single wchar for it. wchar is a UTF-16 code unit, 2 bytes. UTF-16 encodes '¶' as a single code unit, so that's correct. The problem is that you can accidentally search for a wchar in a range of chars. Every char is compared to the wchar by numeric value. But the numeric values of a char don't mean the same as those of a wchar, so you get nonsensical results. A similar implicit conversion lets you search for a large number in a byte[]: ---- byte[] arr = [1, 2, 3]; foreach(x; arr) if (x == 1000) writeln("found it!"); ---- You won't ever find 1000 in a byte[], of course. The byte type simply can't store the value. But you can compare a byte with an int. And that comparison is meaningful, unlike the comparison of a char with a wchar. You can also produce false positives with numeric types, by mixing signed and unsigned types: ---- int[] arr = [1, -1, 3]; foreach(x; arr) if (x == uint.max) writeln("found it!"); ---- uint.max is a large number, -1 is a small number. They're considered equal here because of an implicit conversion that messes with the meaning of the bits. False negatives are not possible with numeric types. At least not in the same way as with differently sized Unicode code units.
Jun 03 2016
prev sibling parent Observer <here somewhere.net> writes:
On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote:
 Finally, this is not the only argument in favor of *keeping* 
 autodecoding, of course. Not wanting to break user code is the 
 big one there, I guess.
I'm not familiar with the details of autodecoding, but one thing strikes me about this whole discussion. It seems to me that it is just nibbling around the edges of how one should implement full Unicode support. And it seems to me that that topic, and how autodecoding plays into it, won't be properly understood except by comparison with mature software that has undergone many years of testing and revision. Two examples stand out to me: * Perl 5 has undergone a gradual evolution, over many releases, to get this right. It might also be the case that Perl 6 is even cleaner. * The International Components for Unicode (ICU) package, with supported libraries for C, C++, and Java. This is the industry- standard definition of what it means to handle Unicode in these languages. See http://site.icu-project.org/ for details. Both of these implementations have seen many years of real-world use, so I would tend to look to them for guidance over trying to develop my own opinion based on some small set of particular use cases I might happen to have encountered.
Jun 04 2016