www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Proper way to fix

reply Dr.No <jckj33 gmail.com> writes:
I'm reading line by line the lines from a CSV file provided by 
the user which is assumed to be UTF8. But an user has provided an 
ANSI file which resulted in the error:

core.exception.UnicodeException src\rt\util\utf.d(292): invalid 
UTF-8 sequence
(it happend when the user took the originally UTF8 encoded file generated by another application, made some edit using an editor (which I don't know the name) then saved not aware it was changing the encoding to ANSI. My question is: what's the proper way to solve that? using toUTF8 didn't solve:
 while((line = csvFile.readln().toUTF8) !is null) {
I didn't find a way to set explicitly the encoding with std.stdio.File to set to UTF8 regardless it's an ANSI or already UTF8. I don't want to conver the whole file to UTF8, the CSV file can be large and might take quite while. And if I do so to a temporary copy the file (which will make things even more slow) to avoid touch user's original file. I thought in writing my own readLine() with std.stdio.File.byChunk to take as many bytes as possible until '\n' byte is seen, treat it as UTF8 and return. But I'd like to not reinvent the wheel and use something native, if possible. Any ideas?
Apr 06 2018
parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Friday, April 06, 2018 16:10:56 Dr.No via Digitalmars-d-learn wrote:
 I'm reading line by line the lines from a CSV file provided by
 the user which is assumed to be UTF8. But an user has provided an

 ANSI file which resulted in the error:
core.exception.UnicodeException src\rt\util\utf.d(292): invalid
UTF-8 sequence
(it happend when the user took the originally UTF8 encoded file generated by another application, made some edit using an editor (which I don't know the name) then saved not aware it was changing the encoding to ANSI. My question is: what's the proper way to solve that? using toUTF8 didn't solve:
 while((line = csvFile.readln().toUTF8) !is null) {
I didn't find a way to set explicitly the encoding with std.stdio.File to set to UTF8 regardless it's an ANSI or already UTF8. I don't want to conver the whole file to UTF8, the CSV file can be large and might take quite while. And if I do so to a temporary copy the file (which will make things even more slow) to avoid touch user's original file. I thought in writing my own readLine() with std.stdio.File.byChunk to take as many bytes as possible until '\n' byte is seen, treat it as UTF8 and return. But I'd like to not reinvent the wheel and use something native, if possible. Any ideas?
In general, Phobos pretty much requires that text files be in UTF-8 or that they be in UTF-16 or UTF-32 with the native endianness. Certainly, char, wchar, and dchar are assumed by both the language and the library to be valid UTF-8, UTF-16, and UTF-32 respectively, and if you do anything with a string as a range, by default, it will decode the code units to code points, meaning that it will validate the UTF. If you want to read in anything that is not going to be valid UTF, then you're going to have to read it in as ubytes rather than as characters. If you're dealing with text that is supposed to be valid UTF but might contain invalid characters, then you can use std.utf.byCodeUnit to treat a string as a range of code units where it replaces invalid UTF with the Unicode replacement character, but you'd still have to read in the text as bytes (e.g. readText validates that the text is proper UTF). std.utf.byUTF could be use instead of byCodeUnit to convert to UTF-8, UTF-16, or UTF-32, and invalid UTF will be replaced with the replacement character (so it doesn't have to be the target encoding, but it does have to be one of those three encodings and not something else). byUTF8 is basically the version of byUTF!char / byChar that returns a string rather than a lazy range, which is why it would have not worked for you if the file is not UTF-8, UTF-16, or UTF-32. If you're dealing with an encoding other than UTF-8, UTF-16, or UTF-32, and you've read the data in as ubytes, then std.encoding _might_ help, but it's not a great module and doesn't support a lot of encodings. On the other hand, if by ANSI, you mean the encoding that Windows uses with the A versions of its functions, then you can use std.windows.charset.fromMBSz to convert it to string, though it will need to be a null terminated immutable(char)* to work with fromMBSz, so somewhat stupidly, it basically needs to be a null-terminated string that isn't valid UTF-8 that you pass using str.ptr or &str[0]. If the encoding you're dealing with is not one that works with fromMBSz, and it is not one of the few supported by std.encoding, then you're on your own. - Jonathan M Davis
Apr 06 2018