www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 14919] New: utf error

https://issues.dlang.org/show_bug.cgi?id=14919

          Issue ID: 14919
           Summary: utf error
           Product: D
           Version: D2
          Hardware: x86_64
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P1
         Component: dmd
          Assignee: nobody puremagic.com
          Reporter: code dawg.eu

Related/Alternative to issue 14519 (see
https://issues.dlang.org/show_bug.cgi?id=14519#c24).

When I `readText` a file a lot of time is already spent on utf validation.
But we don't take advantage of that and revalidate utf in almost every
algorithm.
The idea from issue 14519 to replace invalid chars with a replacement makes the
validation a little cheaper (b/c of the cost of dmd's EH, see issue 12442) but
still incurs a high overhead.

I suggest that we make a clean distinction between unvalidated ubyte[] data and
treat all char/wchar/dchar[] strings as valid.

The compiler already checks string literals and a few of string reading
functions do it as well. Unfortunately byLine and readln currently don't
validate utf.

This could be a much more performant approach to correct utf handling.

--
Aug 13 2015