www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - UTFException when reading a file

reply Head Scratcher <filter stumped.com> writes:
I am using readText to read a file into a string. I am getting a 
UTFException on the file. It is probably because the file has an 
extended ANSI character that is not UTF-8.

How can I read the file and convert the string into proper UTF-8 
in memory without an exception?
Jan 11
next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Friday, 11 January 2019 at 19:45:05 UTC, Head Scratcher wrote:
 How can I read the file and convert the string into proper 
 UTF-8 in memory without an exception?
Use regular read() instead of readText, and then convert it use another function. Phobos has std.encoding which offers a transcode function: http://dpldocs.info/experimental-docs/std.encoding.transcode.html you would cast to the input type: --- import std.encoding; import std.file; void main() { string s; // the read here replaces your readText // and the cast tells what encoding it has now transcode(cast(Latin1String) read("ooooo.d"), s); import std.stdio; // and after that, the utf-8 string is in s writeln(s); } --- Or, since I didn't like the Phobos module for my web scrape needs, I made my own: https://github.com/adamdruppe/arsd/blob/master/characterencodings.d Just drop that file in your build and call this function: http://dpldocs.info/experimental-docs/arsd.characterencodings.convertToUtf8Lossy.html --- import arsd.characterencodings; import std.file; void main() { string s = convertToUtf8Lossy(read("ooooo.d"), "iso_8859-1"); // you can now use s } --- just changing the encoding string to whatever it happens to be right now. But it is possible neither my module nor the Phobos one has the encoding you need...
Jan 11
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Jan 11, 2019 at 07:45:05PM +0000, Head Scratcher via
Digitalmars-d-learn wrote:
 I am using readText to read a file into a string. I am getting a
 UTFException on the file. It is probably because the file has an
 extended ANSI character that is not UTF-8.
 How can I read the file and convert the string into proper UTF-8 in
 memory without an exception?
What's the encoding of the file? Without knowing the original encoding, there is no way to get UTF-8 out of it without the risk of some data being lost / garbled. Take a look at std.encoding to see if your file's encoding is already supported. If not, you may have to read the file in binary and do the conversion into UTF-8 yourself. Or use an external program to re-encode your file into UTF-8. On Posix systems, the 'recode' utility will help you do this. T -- To err is human; to forgive is not our policy. -- Samuel Adler
Jan 11
prev sibling parent Dennis <dkorpel gmail.com> writes:
On Friday, 11 January 2019 at 19:45:05 UTC, Head Scratcher wrote:
 How can I read the file and convert the string into proper 
 UTF-8 in memory without an exception?
You have multiple options: ``` import std.file: read; import std.encoding: transcode, Windows1252String; auto ansiStr = cast(Windows1252String) read(filename); string utf8string; transcode(ansiStr, utf8string); ``` If it's ANSI. ``` import std.encoding: sanitize; auto sanitized = (cast(string) read(filename)).sanitize; ``` If it's incorrect UTF8, eager ``` import std.exception: handle; import std.range; auto handled = str.handle!(UTFException, RangePrimitive.access, (e, r) => ' '); // Replace invalid code points with spaces ``` If it's incorrect UTF8, lazy See: https://dlang.org/phobos/std_encoding.html#transcode https://dlang.org/phobos/std_exception.html#handle
Jan 11