digitalmars.D.learn - Files and UTF

Mike Surette (30/30) Aug 05 2020 In my efforts to learn D I am writing some code to read files in

WebFreak001 (106/136) Aug 06 2020 all strings in D are _assumed_ to be UTF-8, so your I/O reading

WebFreak001 (7/8) Aug 06 2020 In line 11 in my example code this makes a better, safer if than

Mike Surette (3/11) Aug 06 2020 Thanks for the detailed answer. Now to study it.

Mike Surette <mjsurette gmail.com> writes:

In my efforts to learn D I am writing some code to read files in 
different UTF encodings with the aim of having them end up as 
UTF-8 internally. As a start I have the following code:

import std.stdio;
import std.file;

void main(string[] args)
{
     if (args.length == 2)
     {
         if (args[1].exists && args[1].isFile)
         {
             auto f = File(args[1]);
             writeln(args[1]);

             for (auto i = 1; i <= 3; ++i)
                 write(f.readln);
         }
     }
}

It works well outputting the file name and first three lines of 
the file properly, without any regard to the encoding of the 
file. The exception to this is if the file is UTF-16, with both 
LE and BE encodings, two characters representing the BOM are 
printed.

I assume that write detects the encoding of the string returned 
by readln and prints it correctly rather than readln reading in 
as a consistent encoding. Is this correct?

Is there a way to remove the BOM from the input buffer and still 
know the encoding of the file?

Is there a D idiomatic way to do what I want to do?

Mike

Aug 05 2020

WebFreak001 <d.forum webfreak.org> writes:

On Wednesday, 5 August 2020 at 17:39:36 UTC, Mike Surette wrote:
 In my efforts to learn D I am writing some code to read files 
 in different UTF encodings with the aim of having them end up 
 as UTF-8 internally. As a start I have the following code:

 import std.stdio;
 import std.file;

 void main(string[] args)
 {
     if (args.length == 2)
     {
         if (args[1].exists && args[1].isFile)
         {
             auto f = File(args[1]);
             writeln(args[1]);

             for (auto i = 1; i <= 3; ++i)
                 write(f.readln);
         }
     }
 }

 It works well outputting the file name and first three lines of 
 the file properly, without any regard to the encoding of the 
 file. The exception to this is if the file is UTF-16, with both 
 LE and BE encodings, two characters representing the BOM are 
 printed.

 I assume that write detects the encoding of the string returned 
 by readln and prints it correctly rather than readln reading in 
 as a consistent encoding. Is this correct?

 Is there a way to remove the BOM from the input buffer and 
 still know the encoding of the file?

 Is there a D idiomatic way to do what I want to do?

 Mike

all strings in D are _assumed_ to be UTF-8, so your I/O reading 
function needs to check that it is actually UTF-8. 
File/File.readln does not do that, so you are actually getting 
UTF-16 bytes in your string, not UTF-8 bytes.

What you are seeing through writeln is not fully correct: If you 
only test English characters there are null bytes (0) before each 
english character byte, which aren't rendered in the console.

You can verify this with this simple code:
auto s = File("test.txt").readln;
writefln("%(%02x %)", s.representation);

result: ff fe 68 00 65 00 6c 00 6c 00 6f 00 20 00 77 00 6f 00 72 
00 6c 00 64 00 21 00 0d 00 0a


You can see there is a UTF-16 LE BOM in there and then all the 
English characters, which are null bytes and the characters.

Basically what you want to do is writing a function converting 
the input encoding to UTF-8 so you can use it in D strings. If 
you want to get into this yourself, there is std.encoding which 
offers you most of the functionality: 
https://dlang.org/phobos/std_encoding.html

You can use the getBOM method to try to determine encoding by BOM 
and can then convert it to UTF-8 from the source encoding using 
the `transcode` method or manually using the encoding classes.

If you don't have a BOM you need some kind of algorithm to 
determine the encoding of your file. If you don't want to do that 
and just want to check if it's UTF-8, use the `std.utf : 
validate` function which throws if your string is not valid UTF-8 
and otherwise does nothing.

If you only support UTF-8, UTF-16 and possibly UTF-32, you can 
use just the std.utf methods after determining BOM to lazily 
convert without allocating memory (useful if you only go over 
your string linearly without going back):



import std;

void main() {
	// readln is actually rather unsafe for this! You should use 
std.file.read
	// or File.rawRead or read byte chunks instead. For chunking you 
need to adjust
	// the encode API however and probably make a helper struct with 
a small buffer.
	string s = File("test.txt").readln;
	// need to remove the line terminator before encoding
	// (it's encoded in UTF-8, potentially after UTF-16)
	if (s.length) s = s[0 .. $ - 1];
	string e = encodeUTF8(s.representation);

	writefln("%s\n(%(%02x %))", s, s.representation);
	writefln("%s\n(%(%02x %))", e, e.representation);
}

string encodeUTF8(immutable(ubyte)[] bytes) {
	auto bom = getBOM(bytes);
	bytes = bytes[bom.sequence.length .. $];

	switch (bom.schema) {
	// optionally we could validate, but we just trust because there 
is a UTF-8 BOM
	case BOM.utf8: return cast(string)bytes;
	case BOM.utf16le: return convertUTF!wchar(bytes, true);
	case BOM.utf16be: return convertUTF!wchar(bytes, false);
	case BOM.utf32le: return convertUTF!dchar(bytes, true);
	case BOM.utf32be: return convertUTF!dchar(bytes, false);
	default: string input = cast(string)bytes; validate(input); 
return input;
	}
}

private string convertUTF(T)(scope const(ubyte)[] bytes, bool 
littleEndian) {
	// T.sizeof expecting 2 or 4 (which divided maps to 0 and 1)
	enum name = ["UTF-16", "UTF-32"][T.sizeof / 4];
	alias Int = AliasSeq!(ushort, uint)[T.sizeof / 4];

	if (bytes.length % T.sizeof != 0)
		throw new Exception("File is " ~ name ~ ", but got "
			~ bytes.length.to!string ~ " bytes, which is not a multiple of 
"
			~ T.sizeof.stringof ~ "!");

	scope Int[] units = (cast(Int*) bytes.ptr)[0 .. bytes.length / 
T.sizeof];

	// swap mismatching endianness
	version (LittleEndian) // CPU is little endian, swap if file is 
big endian
		bool swap = !littleEndian;
	else // CPU is big endian, swap if file is little endian
		bool swap = littleEndian;

	if (swap) swapAllEndian(units);

	scope wstr = cast(const(T)[]) units;
	auto ret = wstr.toUTF8;
	// because we are operating in-place, we need to revert to keep 
memory consistent
	// if you don't use the byte data anywhere else, you could omit 
this
	// (note this could be unsafe though)
	if (swap) swapAllEndian(units);

	return ret;
}

private void swapAllEndian(T)(T[] data) {
	// TODO: could probably optimize this with SIMD instructions
	foreach (i; 0 .. data.length)
		data[i] = swapEndian(data[i]);
}



Example library if you want to guess encoding without BOM: 
https://code.dlang.org/packages/libguess-d

used API docs:
https://dlang.org/phobos/std_bitmanip.html#swapEndian <- swapping 
BE/LE for native encoding
https://dlang.org/phobos/std_encoding.html <- BOM detection, 
transcoding capabilities for encodings other than UTF
https://dlang.org/phobos/std_utf.html <- low level UTF-8 
encoding/decoding, lazy decoding, validation

Aug 06 2020

WebFreak001 <d.forum webfreak.org> writes:

On Thursday, 6 August 2020 at 07:19:37 UTC, WebFreak001 wrote:
 [...]

In line 11 in my example code this makes a better, safer if than 
`if (s.length)`:
if (s.length && s[$ - 1] == '\n') s = s[0 .. $ - 1];

Note that I only need to do this because of the readln API, it 
would be much safer and more correct to instead load the whole 
file into memory in a byte[].

Aug 06 2020

Mike Surette <mjsurette gmail.com> writes:

On Thursday, 6 August 2020 at 07:24:23 UTC, WebFreak001 wrote:
 On Thursday, 6 August 2020 at 07:19:37 UTC, WebFreak001 wrote:
 [...]

 In line 11 in my example code this makes a better, safer if 
 than `if (s.length)`:
 if (s.length && s[$ - 1] == '\n') s = s[0 .. $ - 1];

 Note that I only need to do this because of the readln API, it 
 would be much safer and more correct to instead load the whole 
 file into memory in a byte[].

Thanks for the detailed answer. Now to study it.

Mike

Aug 06 2020

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Files and UTF