digitalmars.D - UTF-8 requested. BOM is for UTF-16

solidstate1991 (4/4) Sep 15 2019 This is what I get when I try to run a unittest on one of my

Jonathan M Davis (23/28) Sep 16 2019 Well, without spending a fair bit of time digging into it, I can't say f...
Gregor =?UTF-8?B?TcO8Y2ts?= (8/12) Sep 16 2019 Since the Unicode BOM is a set of code points like any other,

Jonathan M Davis (39/53) Sep 16 2019 The BOMs for UTF-16 and UTF-32 are not valid UTF-8. e.g. this program wi...

Gregor =?UTF-8?B?TcO8Y2ts?= (15/21) Sep 17 2019 Ah, I misunderstood. Apologies for that. My initial mistake was

Jesse Phillips (7/11) Sep 18 2019 I think the short answer everyone is trying to get to is.

solidstate1991 <laszloszeremi outlook.com> writes:

This is what I get when I try to run a unittest on one of my 
projects.

Here's the file that generates this error: 
https://github.com/ZILtoid1991/pixelperfectengine/blob/master/pixelperfectengine/src/PixelPerfectEngine/graphics/extensions.d

Sep 15 2019

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Sunday, September 15, 2019 5:50:27 PM MDT solidstate1991 via Digitalmars-
d wrote:
 This is what I get when I try to run a unittest on one of my
 projects.

 Here's the file that generates this error:
 https://github.com/ZILtoid1991/pixelperfectengine/blob/master/pixelperfect
 engine/src/PixelPerfectEngine/graphics/extensions.d

Well, without spending a fair bit of time digging into it, I can't say for
sure what's going on, but the file reading stuff in Phobos doesn't tend to
do much with BOMs. Rather, it tends to assume the encoding based on the
type. So, for instance, readText expects the file to be in UTF-8 if it's
told to provide an array of char, and it expects the file to be in UTF-16
with the native encoding of the machine (so, usually UTF-16LE) if it's told
to provide an array of wchar (and of course, UTF-32 for if it's told an
array of dchar). As such, if it's told to read UTF-8, and it finds that the
file is UTF-16, then you're going to get a UTFException, because the data is
invalid UTF-8.

I don't know for sure that that's what's happening here, because there
doesn't appear to be a call to read or readText in that module, and I don't
have time right now to go digging to see what it's actually doing. But odds
are that whatever is reading in the file is going to have to either cycle
through the possible UTF encodings until it find one that doesn't throw (and
then convert that to UTF-8 if that's what's desired), or it's going to have
to look to see whether there's a BOM, and then read the file in based on the
BOM (or lack thereof). As things stand, Phobos does a good job when the
encoding is known ahead of time, but it's far more annoying to use when it
isn't.

- Jonathan M Davis

Sep 16 2019

Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:

On Sunday, 15 September 2019 at 23:50:27 UTC, solidstate1991 
wrote:
 This is what I get when I try to run a unittest on one of my 
 projects.

 Here's the file that generates this error: 
 https://github.com/ZILtoid1991/pixelperfectengine/blob/master/pixelperfectengine/src/PixelPerfectEngine/graphics/extensions.d

Since the Unicode BOM is a set of code points like any other, 
they can be expressed in UTF-8. There are the occasional text 
files that have a BOM even though they are encoded as UTF-8. I 
feel that readText is too strict in that case. It should just 
silently ignore the BOM and continue because it is just 
meaningless. Throwing an exception seems completely wrong to me.

Sep 16 2019

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Monday, September 16, 2019 1:27:52 PM MDT Gregor M�ckl via Digitalmars-d 
wrote:
 On Sunday, 15 September 2019 at 23:50:27 UTC, solidstate1991

 wrote:
 This is what I get when I try to run a unittest on one of my
 projects.

 Here's the file that generates this error:
 https://github.com/ZILtoid1991/pixelperfectengine/blob/master/pixelperfe
 ctengine/src/PixelPerfectEngine/graphics/extensions.d

 Since the Unicode BOM is a set of code points like any other,
 they can be expressed in UTF-8. There are the occasional text
 files that have a BOM even though they are encoded as UTF-8. I
 feel that readText is too strict in that case. It should just
 silently ignore the BOM and continue because it is just
 meaningless. Throwing an exception seems completely wrong to me.

The BOMs for UTF-16 and UTF-32 are not valid UTF-8. e.g. this program will
print false for all 4 BOMs:

    import std.encoding;
    import std.stdio;
    import std.utf;

    void main()
    {
        foreach(i; [BOM.utf16be, BOM.utf16le, BOM.utf32be, BOM.utf32le])
            writeln(isValidUTF(cast(const char[])bomTable[i].sequence));
    }

    bool isValidUTF(const(char)[] str)
    {
        try
        {
            validate(str);
            return true;
        }
        catch(UTFException)
            return false;
    }

The BOM for UTF-8 is valid UTF-8, but the others aren't. And it would be
invalid to have a BOM for UTF-16 or UTF-32 and then treat the text as UTF-8
anyway. A program is free to ignore the BOM and then treat the rest of the
text in whatever encoding it wants, but if the BOM doesn't match the actual
encoding, then the text is invalid Unicode.

If you want to ignore the BOM, it's trivial enough to write a wrapper around
std.file.read which does that.

What Phobos is missing is a function akin to readText which looks at the BOM
and then converts the text to the requested encoding instead of throwing,
but that gets a bit more involved, and it probably wouldn't work with more
than UTF-8, UTF-16, and UTF-32. It also becomes problematic if you want it
to handle all of the cases where the text is UTF-16 or UTF-32 but doesn't
have a BOM, since that requires examining the text and guessing (as well as
potentially having to deal with reversing the endianness of the encoding).
Regardless, I definitely wouldn't expect a function like that to ignore the
BOM if it's present.

- Jonathan M Davis

Sep 16 2019

Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:

On Monday, 16 September 2019 at 22:21:17 UTC, Jonathan M Davis 
wrote:
 The BOM for UTF-8 is valid UTF-8, but the others aren't. And it 
 would be invalid to have a BOM for UTF-16 or UTF-32 and then 
 treat the text as UTF-8 anyway. A program is free to ignore the 
 BOM and then treat the rest of the text in whatever encoding it 
 wants, but if the BOM doesn't match the actual encoding, then 
 the text is invalid Unicode.

Ah, I misunderstood. Apologies for that. My initial mistake was 
that I didn't quite understand that readText essentially ends up 
checking actual BOM encodings as a sanity check. So, yes, phobos 
isn't breaking any standard here.

This is a point where adhering to Unicode terminology prevents 
confusion. There is only one BOM. But it is selected specifically 
so that it is encoded differently in each encoding scheme. To 
make that work, the endianess-revered decoded versions of the BOM 
are defined to be illegal code points. It is entirely legal to 
have UTF-8 *encoded* BOM at the start of a UTF-8 string, but that 
is a sequence of bytes that is distinct from the UTF-16 little 
endian *encoded* BOM, which is also distinct from a UTF-16 big 
endian *encoded* BOM etc.

Sep 17 2019

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

On Sunday, 15 September 2019 at 23:50:27 UTC, solidstate1991 
wrote:
 This is what I get when I try to run a unittest on one of my 
 projects.

 Here's the file that generates this error: 
 https://github.com/ZILtoid1991/pixelperfectengine/blob/master/pixelperfectengine/src/PixelPerfectEngine/graphics/extensions.d


I think the short answer everyone is trying to get to is.

You need to use readText!wstring(file) because the file claims to 
be UTF-16 and reading it an UTF-8 would be wrong.

However the code shown has no readText so we don't exactly know 
where you issue lies.

Sep 18 2019

D Programming

C/C++ Programming

Other

digitalmars.D - UTF-8 requested. BOM is for UTF-16