digitalmars.D.learn - std.file: read, readText and UTF-8 decoding

Uranuz (29/29) Sep 21 2023 Hello!

Uranuz (21/21) Sep 21 2023 Addition:
Jonathan M Davis (44/73) Sep 21 2023 readText works great as long as you know that you're dealing with files ...

Uranuz (3/3) Sep 21 2023 OK. Thanks for response. I wish that there it was some API to

Jonathan M Davis (8/11) Sep 22 2023 You can open an issue if you want, though I don't know how much that wil...

Uranuz <neuranuz gmail.com> writes:

Hello!
I have some strange problem. I am trying to parse XML files and 
extract some information from it.
I use library dxml for it by Jonathan M Davis. But I have a 
probleme that I have multiple  XML files made by different people 
around the world. Some of these files was created with Byte Order 
Mark, but some of them without BOM. dxml expects no BOM at the 
start of the string.
At first I tried to read file with std.file.readText. Looks like 
it doesn't decode file at any way and doesn't remove BOM, so dxml 
failed to parse it then. This looks strange for me, because I 
expect that "text" function must decode data to UTF-8. Then I 
read that this behavior is documented at least:
"""
...However, no width or endian conversions are performed. So, if 
the width or endianness of the characters in the given file 
differ from the width or endianness of the element type of S, 
then validation will fail.
"""
So it's OK. But I understood that this function "readText" is not 
usefull for me.
So I tried to use plain "read" that returns "void[]". Problemmme 
is that I still don't understand which method I should use to 
convert this to string[] with proper UTF-8 decoding and remove 
BOM and etc.
Could you help me, please to make some clearance.
P.S. Function readText looks odd in std.file, because you cannot 
specify any encoding to decode this file. And logic how it 
decodes is unclear...

Sep 21 2023

Uranuz <neuranuz gmail.com> writes:

Addition:
Current solution to this problemme that I was found is:
So I just check for BOM manually. Get length of bom.sequence and 
remove that count of items from beginning. But I dont' think that 
it's convenient solution, because `who knows` how much else 
issues with UTF could happend. And I don't think that it's 
correct to handle them on the side of users of standart D 
library... I think that should be solution "out of the box". It 
could be not much effective, but it should at least "just work" 
without extra movements...
string[] getGroupsFromFile(string groupFilePath)
{
	writeln(`Parse file` ~ groupFilePath);
	string[] groupNames = [];
	char[] rawContent = cast(char[]) read(groupFilePath);
	auto bom = getBOM(cast(ubyte[]) rawContent);
	string content = cast(string) rawContent[bom.sequence.length..$];
	writeln(`Content:\n` ~ content);

	//... work with XML
	return groupNames;
}

Sep 21 2023

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Thursday, September 21, 2023 9:29:17 AM MDT Uranuz via Digitalmars-d-learn 
wrote:
 Hello!
 I have some strange problem. I am trying to parse XML files and
 extract some information from it.
 I use library dxml for it by Jonathan M Davis. But I have a
 probleme that I have multiple  XML files made by different people
 around the world. Some of these files was created with Byte Order
 Mark, but some of them without BOM. dxml expects no BOM at the
 start of the string.
 At first I tried to read file with std.file.readText. Looks like
 it doesn't decode file at any way and doesn't remove BOM, so dxml
 failed to parse it then. This looks strange for me, because I
 expect that "text" function must decode data to UTF-8. Then I
 read that this behavior is documented at least:
 """
 ...However, no width or endian conversions are performed. So, if
 the width or endianness of the characters in the given file
 differ from the width or endianness of the element type of S,
 then validation will fail.
 """
 So it's OK. But I understood that this function "readText" is not
 usefull for me.
 So I tried to use plain "read" that returns "void[]". Problemmme
 is that I still don't understand which method I should use to
 convert this to string[] with proper UTF-8 decoding and remove
 BOM and etc.
 Could you help me, please to make some clearance.
 P.S. Function readText looks odd in std.file, because you cannot
 specify any encoding to decode this file. And logic how it
 decodes is unclear...

readText works great as long as you know that you're dealing with files with
a specific encoding and without a BOM (which is very often true when dealing
with text files on *nix systems where they're using UTF-8), but it's not so
great when you're reading files where you have no clue what their encoding
is going to be (and it's worse on Windows where they unfortunately are much
more likely to be UTF-16). Phobos does give you the tools to solve the
problem, but it doesn't currently make it as easy as it arguably should be.
std.encoding has the pieces that you're missing here.

https://dlang.org/phobos/std_encoding.html#BOM
https://dlang.org/phobos/std_encoding.html#getBOM

You'll need to do something like

    import std.encoding : BOM, getBOM;
    import std.file : read;

    auto data = read(file);
    immutable bom = getBOM(cast(ubyte[])data).schema;

to get the BOM. Then you can compare the BOM against BOM.utf8, BOM.utf16le,
etc. so that you know what type to cast the data array to (string, wstring,
etc.). Then you can remove the BOM with something like

    R stripBOM(R)(R range)
        if(isForwardRange!R && isSomeChar!(ElementType!R))
    {
        import std.utf : decodeFront, UseReplacementDchar;
        if(range.empty)
            return range;
        auto orig = range.save;
        immutable c = range.decodeFront!(UseReplacementDchar.yes)();
        return c == '\uFEFF' ? range : orig;
    }

And then you either operate on the array with its current encoding type,
convert it to the desired string type (e.g. to!string) or wrap it in a type
that converts it as you parse it (e.g. std.utf.byChar).

Alternatively, you can just read the very beginning of the file and grab the
BOM that way and then call readText with the correct type after you've
figured out the file's encoding.

readText should currently handle the BOM correctly insofar as it checks
whether you made the correct choice when you told it whether you wanted a
string, wstring, etc., but since it reads in the entire file, it's not a
great plan to try it with each encoding (catching each exception in turn)
until you get the right one, and it doesn't strip the BOM off for you.

So, Phobos probably should get some new functionality to handle this better,
but it's at least possible to make it work with what's there.

- Jonathan M Davis

Sep 21 2023

Uranuz <neuranuz gmail.com> writes:

OK. Thanks for response. I wish that there it was some API to 
handle it "out of the box". Do I need to write some issue or 
something in order to not forget about this?

Sep 21 2023

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Friday, September 22, 2023 12:28:39 AM MDT Uranuz via Digitalmars-d-learn 
wrote:
 OK. Thanks for response. I wish that there it was some API to
 handle it "out of the box". Do I need to write some issue or
 something in order to not forget about this?

You can open an issue if you want, though I don't know how much that will
help it be remembered given how many issues ther are to sort through.

I'll probably get around to writing something eventually (particularly since
this issue is more likely to come up when using dxml than with many other
use cases), but I have a variety of items on my todo list.

- Jonathan M Davis

Sep 22 2023

D Programming

C/C++ Programming

Other

digitalmars.D.learn - std.file: read, readText and UTF-8 decoding