www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 15949] New: Improve readtext handling of byte order mark (BOM)

https://issues.dlang.org/show_bug.cgi?id=15949

          Issue ID: 15949
           Summary: Improve readtext handling of byte order mark (BOM)
           Product: D
           Version: D2
          Hardware: All
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P1
         Component: phobos
          Assignee: nobody puremagic.com
          Reporter: Jesse.K.Phillips+D gmail.com

Problem:

I've hit this many times in Windows. I try to read in a file with
std.file.readText and get: "Syntax error at line 0"

This is because some Microsoft program has decided to insert a UTF-8 Byte Order
Mark (BOM) into the beginning of the file (0xEF 0xBB 0xBF). But readText really
shouldn't automatically convert a file's content based on the BOM specified.

Suggested fix:

I think readText should validate and skip the BOM. It should check that the BOM
is not UTF-16LE (0xFF 0xFE), UTF-16BE (0xFE 0xFF), UTF-32LE (FF FE 00 00),
UTF-32BE (0x00 0x00 0xFE 0xFF), if it is one of those then it should throw an
exception that the file being read is one of those encoding and will not be
converted to UTF-8 string.

The corresponding std.file.readText!wstring and std.file.readText!dstring
should perform equivalent validation. If it is no cost to change the byte order
then that should be done.


1. https://en.wikipedia.org/wiki/Byte_order_mark

--
Apr 21 2016