digitalmars.D.learn - Proper way to fix

Dr.No (21/24) Apr 06 2018 I'm reading line by line the lines from a CSV file provided by

Jonathan M Davis (32/56) Apr 06 2018 In general, Phobos pretty much requires that text files be in UTF-8 or t...

Dr.No <jckj33 gmail.com> writes:

I'm reading line by line the lines from a CSV file provided by 
the user which is assumed to be UTF8. But an user has provided an 
ANSI file which resulted in the error:

core.exception.UnicodeException src\rt\util\utf.d(292): invalid 
UTF-8 sequence

(it happend when the user took the originally UTF8 encoded file 
generated by another application, made some edit using an editor 
(which I don't know the name) then saved not aware it was 
changing the encoding to ANSI.

My question is: what's the proper way to solve that? using toUTF8 
didn't solve:

 while((line = csvFile.readln().toUTF8) !is null) {

I didn't find a way to set explicitly the encoding with 
std.stdio.File to set to UTF8 regardless it's an ANSI or already 
UTF8.
I don't want to conver the whole file to UTF8, the CSV file can 
be large and might take quite while. And if I do so to a 
temporary copy the file (which will make things even more slow) 
to avoid touch user's original file.

I thought in writing my own readLine() with 
std.stdio.File.byChunk to take as many bytes as possible until 
'\n' byte is seen, treat it as UTF8 and return.

But I'd like to not reinvent the wheel and use something native, 
if possible. Any ideas?

Apr 06 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Friday, April 06, 2018 16:10:56 Dr.No via Digitalmars-d-learn wrote:
 I'm reading line by line the lines from a CSV file provided by
 the user which is assumed to be UTF8. But an user has provided an

 ANSI file which resulted in the error:
core.exception.UnicodeException src\rt\util\utf.d(292): invalid
UTF-8 sequence

 (it happend when the user took the originally UTF8 encoded file
 generated by another application, made some edit using an editor
 (which I don't know the name) then saved not aware it was
 changing the encoding to ANSI.

 My question is: what's the proper way to solve that? using toUTF8

 didn't solve:
 while((line = csvFile.readln().toUTF8) !is null) {

 I didn't find a way to set explicitly the encoding with
 std.stdio.File to set to UTF8 regardless it's an ANSI or already
 UTF8.
 I don't want to conver the whole file to UTF8, the CSV file can
 be large and might take quite while. And if I do so to a
 temporary copy the file (which will make things even more slow)
 to avoid touch user's original file.

 I thought in writing my own readLine() with
 std.stdio.File.byChunk to take as many bytes as possible until
 '\n' byte is seen, treat it as UTF8 and return.

 But I'd like to not reinvent the wheel and use something native,
 if possible. Any ideas?

In general, Phobos pretty much requires that text files be in UTF-8 or that
they be in UTF-16 or UTF-32 with the native endianness. Certainly, char,
wchar, and dchar are assumed by both the language and the library to be
valid UTF-8, UTF-16, and UTF-32 respectively, and if you do anything with a
string as a range, by default, it will decode the code units to code points,
meaning that it will validate the UTF. If you want to read in anything that
is not going to be valid UTF, then you're going to have to read it in as
ubytes rather than as characters. If you're dealing with text that is
supposed to be valid UTF but might contain invalid characters, then you can
use std.utf.byCodeUnit to treat a string as a range of code units where it
replaces invalid UTF with the Unicode replacement character, but you'd still
have to read in the text as bytes (e.g. readText validates that the text is
proper UTF). std.utf.byUTF could be use instead of byCodeUnit to convert to
UTF-8, UTF-16, or UTF-32, and invalid UTF will be replaced with the
replacement character (so it doesn't have to be the target encoding, but it
does have to be one of those three encodings and not something else). byUTF8
is basically the version of byUTF!char / byChar that returns a string rather
than a lazy range, which is why it would have not worked for you if the file
is not UTF-8, UTF-16, or UTF-32.

If you're dealing with an encoding other than UTF-8, UTF-16, or UTF-32, and
you've read the data in as ubytes, then std.encoding _might_ help, but it's
not a great module and doesn't support a lot of encodings. On the other
hand, if by ANSI, you mean the encoding that Windows uses with the A
versions of its functions, then you can use std.windows.charset.fromMBSz to
convert it to string, though it will need to be a null terminated
immutable(char)* to work with fromMBSz, so somewhat stupidly, it basically
needs to be a null-terminated string that isn't valid UTF-8 that you pass
using str.ptr or &str[0].

If the encoding you're dealing with is not one that works with fromMBSz, and
it is not one of the few supported by std.encoding, then you're on your own.

- Jonathan M Davis

Apr 06 2018

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Proper way to fix