www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - UTF-8 requested. BOM is for UTF-16

reply solidstate1991 <laszloszeremi outlook.com> writes:
This is what I get when I try to run a unittest on one of my 
projects.

Here's the file that generates this error: 
https://github.com/ZILtoid1991/pixelperfectengine/blob/master/pixelperfectengine/src/PixelPerfectEngine/graphics/extensions.d
Sep 15
next sibling parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Sunday, September 15, 2019 5:50:27 PM MDT solidstate1991 via Digitalmars-
d wrote:
 This is what I get when I try to run a unittest on one of my
 projects.

 Here's the file that generates this error:
 https://github.com/ZILtoid1991/pixelperfectengine/blob/master/pixelperfect
 engine/src/PixelPerfectEngine/graphics/extensions.d
Well, without spending a fair bit of time digging into it, I can't say for sure what's going on, but the file reading stuff in Phobos doesn't tend to do much with BOMs. Rather, it tends to assume the encoding based on the type. So, for instance, readText expects the file to be in UTF-8 if it's told to provide an array of char, and it expects the file to be in UTF-16 with the native encoding of the machine (so, usually UTF-16LE) if it's told to provide an array of wchar (and of course, UTF-32 for if it's told an array of dchar). As such, if it's told to read UTF-8, and it finds that the file is UTF-16, then you're going to get a UTFException, because the data is invalid UTF-8. I don't know for sure that that's what's happening here, because there doesn't appear to be a call to read or readText in that module, and I don't have time right now to go digging to see what it's actually doing. But odds are that whatever is reading in the file is going to have to either cycle through the possible UTF encodings until it find one that doesn't throw (and then convert that to UTF-8 if that's what's desired), or it's going to have to look to see whether there's a BOM, and then read the file in based on the BOM (or lack thereof). As things stand, Phobos does a good job when the encoding is known ahead of time, but it's far more annoying to use when it isn't. - Jonathan M Davis
Sep 16
prev sibling next sibling parent reply Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:
On Sunday, 15 September 2019 at 23:50:27 UTC, solidstate1991 
wrote:
 This is what I get when I try to run a unittest on one of my 
 projects.

 Here's the file that generates this error: 
 https://github.com/ZILtoid1991/pixelperfectengine/blob/master/pixelperfectengine/src/PixelPerfectEngine/graphics/extensions.d
Since the Unicode BOM is a set of code points like any other, they can be expressed in UTF-8. There are the occasional text files that have a BOM even though they are encoded as UTF-8. I feel that readText is too strict in that case. It should just silently ignore the BOM and continue because it is just meaningless. Throwing an exception seems completely wrong to me.
Sep 16
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, September 16, 2019 1:27:52 PM MDT Gregor Mückl via Digitalmars-d 
wrote:
 On Sunday, 15 September 2019 at 23:50:27 UTC, solidstate1991

 wrote:
 This is what I get when I try to run a unittest on one of my
 projects.

 Here's the file that generates this error:
 https://github.com/ZILtoid1991/pixelperfectengine/blob/master/pixelperfe
 ctengine/src/PixelPerfectEngine/graphics/extensions.d
Since the Unicode BOM is a set of code points like any other, they can be expressed in UTF-8. There are the occasional text files that have a BOM even though they are encoded as UTF-8. I feel that readText is too strict in that case. It should just silently ignore the BOM and continue because it is just meaningless. Throwing an exception seems completely wrong to me.
The BOMs for UTF-16 and UTF-32 are not valid UTF-8. e.g. this program will print false for all 4 BOMs: import std.encoding; import std.stdio; import std.utf; void main() { foreach(i; [BOM.utf16be, BOM.utf16le, BOM.utf32be, BOM.utf32le]) writeln(isValidUTF(cast(const char[])bomTable[i].sequence)); } bool isValidUTF(const(char)[] str) { try { validate(str); return true; } catch(UTFException) return false; } The BOM for UTF-8 is valid UTF-8, but the others aren't. And it would be invalid to have a BOM for UTF-16 or UTF-32 and then treat the text as UTF-8 anyway. A program is free to ignore the BOM and then treat the rest of the text in whatever encoding it wants, but if the BOM doesn't match the actual encoding, then the text is invalid Unicode. If you want to ignore the BOM, it's trivial enough to write a wrapper around std.file.read which does that. What Phobos is missing is a function akin to readText which looks at the BOM and then converts the text to the requested encoding instead of throwing, but that gets a bit more involved, and it probably wouldn't work with more than UTF-8, UTF-16, and UTF-32. It also becomes problematic if you want it to handle all of the cases where the text is UTF-16 or UTF-32 but doesn't have a BOM, since that requires examining the text and guessing (as well as potentially having to deal with reversing the endianness of the encoding). Regardless, I definitely wouldn't expect a function like that to ignore the BOM if it's present. - Jonathan M Davis
Sep 16
parent Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:
On Monday, 16 September 2019 at 22:21:17 UTC, Jonathan M Davis 
wrote:
 The BOM for UTF-8 is valid UTF-8, but the others aren't. And it 
 would be invalid to have a BOM for UTF-16 or UTF-32 and then 
 treat the text as UTF-8 anyway. A program is free to ignore the 
 BOM and then treat the rest of the text in whatever encoding it 
 wants, but if the BOM doesn't match the actual encoding, then 
 the text is invalid Unicode.
Ah, I misunderstood. Apologies for that. My initial mistake was that I didn't quite understand that readText essentially ends up checking actual BOM encodings as a sanity check. So, yes, phobos isn't breaking any standard here. This is a point where adhering to Unicode terminology prevents confusion. There is only one BOM. But it is selected specifically so that it is encoded differently in each encoding scheme. To make that work, the endianess-revered decoded versions of the BOM are defined to be illegal code points. It is entirely legal to have UTF-8 *encoded* BOM at the start of a UTF-8 string, but that is a sequence of bytes that is distinct from the UTF-16 little endian *encoded* BOM, which is also distinct from a UTF-16 big endian *encoded* BOM etc.
Sep 17
prev sibling parent Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
On Sunday, 15 September 2019 at 23:50:27 UTC, solidstate1991 
wrote:
 This is what I get when I try to run a unittest on one of my 
 projects.

 Here's the file that generates this error: 
 https://github.com/ZILtoid1991/pixelperfectengine/blob/master/pixelperfectengine/src/PixelPerfectEngine/graphics/extensions.d
I think the short answer everyone is trying to get to is. You need to use readText!wstring(file) because the file claims to be UTF-16 and reading it an UTF-8 would be wrong. However the code shown has no readText so we don't exactly know where you issue lies.
Sep 18