digitalmars.D - dchar undefined behaviour
- tsbockman (30/31) Oct 22 2015 While working on updating and improving Lionello Lunesu's
- Walter Bright (6/15) Oct 23 2015 I think that ship has sailed. Illegal values in a dchar are not UB. Maki...
- tsbockman (6/12) Oct 23 2015 That makes sense to me. I think the language would have to work a
- Vladimir Panteleev (8/17) Oct 23 2015 That doesn't sound right. In fact, this puts into question why
- Anon (8/16) Oct 23 2015 Unless UTF-16 is deprecated and completely removed from all
- Dmitry Olshansky (8/22) Oct 24 2015 Exactly. Unicode officially limited UTf-8 to 10FFFF in Unicode 6.0 or
While working on updating and improving Lionello Lunesu's propagation related issue with the dchar type. The patch adds VRP-based compile-time evaluation of integer type comparisons, where possible. This caused the following issue: The compiler will now optimize out attempts to handle invalid, out-of-range dchar values. For example: dchar c = cast(dchar) uint.max; if(c > 0x10FFFF) writeln("invalid"); else writeln("OK"); With constant folding for integer comparisons, the above will print "OK" rather than "invalid", as it should. The predicate (c0x10FFFF) is simply *assumed* to be false, because the currentstarting range.imax for a dchar expression is dchar.max. So, this leads to the question: is making use of dchar values greater than dchar.max considered undefined behaviour, or not? 1. If it is UB, then there is quite a lot of D code (including std.uni) which must be corrected to use uint instead of dchar when dealing with values which could possibly fall outside the officially supported range. 2. If it is not UB, then the compiler needs to be updated to stop assuming that dchar values greater than dchar.max are impossible. This basically just means removing some of dchar's special treatment, and running it through more of the same code paths as uint. sense if people think code which might have to deal with invalid code points can be isolated sufficiently from other unicode processing.
Oct 22 2015
On 10/22/2015 6:31 PM, tsbockman wrote:So, this leads to the question: is making use of dchar values greater than dchar.max considered undefined behaviour, or not? 1. If it is UB, then there is quite a lot of D code (including std.uni) which must be corrected to use uint instead of dchar when dealing with values which could possibly fall outside the officially supported range. 2. If it is not UB, then the compiler needs to be updated to stop assuming that dchar values greater than dchar.max are impossible. This basically just means removing some of dchar's special treatment, and running it through more of the same code paths as uint.I think that ship has sailed. Illegal values in a dchar are not UB. Making it UB would result in surprising behavior which you've noted. Also, this segues into what to do about string, wstring, and dstring with invalid sequences in them. Currently, functions defined what they do with invalid sequences. Making it UB would be a burden to programmers.
Oct 23 2015
On Friday, 23 October 2015 at 12:17:22 UTC, Walter Bright wrote:I think that ship has sailed. Illegal values in a dchar are not UB. Making it UB would result in surprising behavior which you've noted. Also, this segues into what to do about string, wstring, and dstring with invalid sequences in them. Currently, functions defined what they do with invalid sequences. Making it UB would be a burden to programmers.That makes sense to me. I think the language would have to work a lot harder to block the creation of invalid dchar values to justify the current VRP assumption. should be easy, so far.
Oct 23 2015
On Friday, 23 October 2015 at 01:31:47 UTC, tsbockman wrote:dchar c = cast(dchar) uint.max; if(c > 0x10FFFF) writeln("invalid"); else writeln("OK"); With constant folding for integer comparisons, the above will print "OK" rather than "invalid", as it should. The predicate (c > 0x10FFFF) is simply *assumed* to be false, because the current starting range.imax for a dchar expression is dchar.max.That doesn't sound right. In fact, this puts into question why dchar.max is at the value it is now. It might be the current maximum at the current version of Unicode, but this seems like a completely pointless restriction that breaks forward-compatibility with future Unicode versions, meaning that D programs compiled today may be unable to work with Unicode text in the future because of a pointless artificial limitation.
Oct 23 2015
On Friday, 23 October 2015 at 21:22:38 UTC, Vladimir Panteleev wrote:That doesn't sound right. In fact, this puts into question why dchar.max is at the value it is now. It might be the current maximum at the current version of Unicode, but this seems like a completely pointless restriction that breaks forward-compatibility with future Unicode versions, meaning that D programs compiled today may be unable to work with Unicode text in the future because of a pointless artificial limitation.Unless UTF-16 is deprecated and completely removed from all systems everywhere, there is no way for Unicode Consortium to increase the limit beyond U+10FFFF. That limit is not arbitrary, but based on the technical limitations of what UTF-16 can actually represent. UTF-8 and UTF-32 both have room for expansion, but have been defined to match UTF-16's limitations.
Oct 23 2015
On 24-Oct-2015 02:45, Anon wrote:On Friday, 23 October 2015 at 21:22:38 UTC, Vladimir Panteleev wrote:Exactly. Unicode officially limited UTf-8 to 10FFFF in Unicode 6.0 or so. Previously it was expected to (maybe) expand beyond but it was decided to stay with 10FFFF pretty much indefinitely because of UTF-16. Also; only ~114k of codepoints have assigned meaning, we are looking at 900K+ unassigned values reserved today. -- Dmitry OlshanskyThat doesn't sound right. In fact, this puts into question why dchar.max is at the value it is now. It might be the current maximum at the current version of Unicode, but this seems like a completely pointless restriction that breaks forward-compatibility with future Unicode versions, meaning that D programs compiled today may be unable to work with Unicode text in the future because of a pointless artificial limitation.Unless UTF-16 is deprecated and completely removed from all systems everywhere, there is no way for Unicode Consortium to increase the limit beyond U+10FFFF. That limit is not arbitrary, but based on the technical limitations of what UTF-16 can actually represent. UTF-8 and UTF-32 both have room for expansion, but have been defined to match UTF-16's limitations.
Oct 24 2015