www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Files and UTF

reply Mike Surette <mjsurette gmail.com> writes:
In my efforts to learn D I am writing some code to read files in 
different UTF encodings with the aim of having them end up as 
UTF-8 internally. As a start I have the following code:

import std.stdio;
import std.file;

void main(string[] args)
{
     if (args.length == 2)
     {
         if (args[1].exists && args[1].isFile)
         {
             auto f = File(args[1]);
             writeln(args[1]);

             for (auto i = 1; i <= 3; ++i)
                 write(f.readln);
         }
     }
}

It works well outputting the file name and first three lines of 
the file properly, without any regard to the encoding of the 
file. The exception to this is if the file is UTF-16, with both 
LE and BE encodings, two characters representing the BOM are 
printed.

I assume that write detects the encoding of the string returned 
by readln and prints it correctly rather than readln reading in 
as a consistent encoding. Is this correct?

Is there a way to remove the BOM from the input buffer and still 
know the encoding of the file?

Is there a D idiomatic way to do what I want to do?

Mike
Aug 05 2020
parent reply WebFreak001 <d.forum webfreak.org> writes:
On Wednesday, 5 August 2020 at 17:39:36 UTC, Mike Surette wrote:
 In my efforts to learn D I am writing some code to read files 
 in different UTF encodings with the aim of having them end up 
 as UTF-8 internally. As a start I have the following code:

 import std.stdio;
 import std.file;

 void main(string[] args)
 {
     if (args.length == 2)
     {
         if (args[1].exists && args[1].isFile)
         {
             auto f = File(args[1]);
             writeln(args[1]);

             for (auto i = 1; i <= 3; ++i)
                 write(f.readln);
         }
     }
 }

 It works well outputting the file name and first three lines of 
 the file properly, without any regard to the encoding of the 
 file. The exception to this is if the file is UTF-16, with both 
 LE and BE encodings, two characters representing the BOM are 
 printed.

 I assume that write detects the encoding of the string returned 
 by readln and prints it correctly rather than readln reading in 
 as a consistent encoding. Is this correct?

 Is there a way to remove the BOM from the input buffer and 
 still know the encoding of the file?

 Is there a D idiomatic way to do what I want to do?

 Mike
all strings in D are _assumed_ to be UTF-8, so your I/O reading function needs to check that it is actually UTF-8. File/File.readln does not do that, so you are actually getting UTF-16 bytes in your string, not UTF-8 bytes. What you are seeing through writeln is not fully correct: If you only test English characters there are null bytes (0) before each english character byte, which aren't rendered in the console. You can verify this with this simple code: auto s = File("test.txt").readln; writefln("%(%02x %)", s.representation); result: ff fe 68 00 65 00 6c 00 6c 00 6f 00 20 00 77 00 6f 00 72 00 6c 00 64 00 21 00 0d 00 0a You can see there is a UTF-16 LE BOM in there and then all the English characters, which are null bytes and the characters. Basically what you want to do is writing a function converting the input encoding to UTF-8 so you can use it in D strings. If you want to get into this yourself, there is std.encoding which offers you most of the functionality: https://dlang.org/phobos/std_encoding.html You can use the getBOM method to try to determine encoding by BOM and can then convert it to UTF-8 from the source encoding using the `transcode` method or manually using the encoding classes. If you don't have a BOM you need some kind of algorithm to determine the encoding of your file. If you don't want to do that and just want to check if it's UTF-8, use the `std.utf : validate` function which throws if your string is not valid UTF-8 and otherwise does nothing. If you only support UTF-8, UTF-16 and possibly UTF-32, you can use just the std.utf methods after determining BOM to lazily convert without allocating memory (useful if you only go over your string linearly without going back): import std; void main() { // readln is actually rather unsafe for this! You should use std.file.read // or File.rawRead or read byte chunks instead. For chunking you need to adjust // the encode API however and probably make a helper struct with a small buffer. string s = File("test.txt").readln; // need to remove the line terminator before encoding // (it's encoded in UTF-8, potentially after UTF-16) if (s.length) s = s[0 .. $ - 1]; string e = encodeUTF8(s.representation); writefln("%s\n(%(%02x %))", s, s.representation); writefln("%s\n(%(%02x %))", e, e.representation); } string encodeUTF8(immutable(ubyte)[] bytes) { auto bom = getBOM(bytes); bytes = bytes[bom.sequence.length .. $]; switch (bom.schema) { // optionally we could validate, but we just trust because there is a UTF-8 BOM case BOM.utf8: return cast(string)bytes; case BOM.utf16le: return convertUTF!wchar(bytes, true); case BOM.utf16be: return convertUTF!wchar(bytes, false); case BOM.utf32le: return convertUTF!dchar(bytes, true); case BOM.utf32be: return convertUTF!dchar(bytes, false); default: string input = cast(string)bytes; validate(input); return input; } } private string convertUTF(T)(scope const(ubyte)[] bytes, bool littleEndian) { // T.sizeof expecting 2 or 4 (which divided maps to 0 and 1) enum name = ["UTF-16", "UTF-32"][T.sizeof / 4]; alias Int = AliasSeq!(ushort, uint)[T.sizeof / 4]; if (bytes.length % T.sizeof != 0) throw new Exception("File is " ~ name ~ ", but got " ~ bytes.length.to!string ~ " bytes, which is not a multiple of " ~ T.sizeof.stringof ~ "!"); scope Int[] units = (cast(Int*) bytes.ptr)[0 .. bytes.length / T.sizeof]; // swap mismatching endianness version (LittleEndian) // CPU is little endian, swap if file is big endian bool swap = !littleEndian; else // CPU is big endian, swap if file is little endian bool swap = littleEndian; if (swap) swapAllEndian(units); scope wstr = cast(const(T)[]) units; auto ret = wstr.toUTF8; // because we are operating in-place, we need to revert to keep memory consistent // if you don't use the byte data anywhere else, you could omit this // (note this could be unsafe though) if (swap) swapAllEndian(units); return ret; } private void swapAllEndian(T)(T[] data) { // TODO: could probably optimize this with SIMD instructions foreach (i; 0 .. data.length) data[i] = swapEndian(data[i]); } Example library if you want to guess encoding without BOM: https://code.dlang.org/packages/libguess-d used API docs: https://dlang.org/phobos/std_bitmanip.html#swapEndian <- swapping BE/LE for native encoding https://dlang.org/phobos/std_encoding.html <- BOM detection, transcoding capabilities for encodings other than UTF https://dlang.org/phobos/std_utf.html <- low level UTF-8 encoding/decoding, lazy decoding, validation
Aug 06 2020
parent reply WebFreak001 <d.forum webfreak.org> writes:
On Thursday, 6 August 2020 at 07:19:37 UTC, WebFreak001 wrote:
 [...]
In line 11 in my example code this makes a better, safer if than `if (s.length)`: if (s.length && s[$ - 1] == '\n') s = s[0 .. $ - 1]; Note that I only need to do this because of the readln API, it would be much safer and more correct to instead load the whole file into memory in a byte[].
Aug 06 2020
parent Mike Surette <mjsurette gmail.com> writes:
On Thursday, 6 August 2020 at 07:24:23 UTC, WebFreak001 wrote:
 On Thursday, 6 August 2020 at 07:19:37 UTC, WebFreak001 wrote:
 [...]
In line 11 in my example code this makes a better, safer if than `if (s.length)`: if (s.length && s[$ - 1] == '\n') s = s[0 .. $ - 1]; Note that I only need to do this because of the readln API, it would be much safer and more correct to instead load the whole file into memory in a byte[].
Thanks for the detailed answer. Now to study it. Mike
Aug 06 2020