digitalmars.D.learn - Parsing a UTF-16LE file line by line?
- Nestor (8/8) Jan 04 2017 Hi,
- Daniel =?iso-8859-1?b?S2964Ws=?= via Digitalmars-d-learn (5/17) Jan 04 2017 can you show your code, byLine should works ok, and post some example=20
- Daniel =?iso-8859-1?b?S2964Ws=?= via Digitalmars-d-learn (5/22) Jan 04 2017 Ok, I've done some testing and you are right byLine is broken, so=20
- Nestor (3/5) Jan 04 2017 A bug? I was under the impression that this function was
- pineapple (13/19) Jan 04 2017 I'm not sure if this works quite as intended, but I was at least
- rumbu (2/14) Jan 05 2017 fwide is not implemented in Windows:
- pineapple (4/21) Jan 06 2017 That's odd. It was on Windows 7 64-bit that I put together and
- Mike Wey (6/27) Jan 06 2017 Are you compiling a 32bit binary? Because in that case you would be
- Nestor (28/58) Jan 15 2017 After some testing I realized that byLine was not the one
- Nestor (5/32) Jan 15 2017 By the way, when caught, the exception says it's in file
- Daniel =?UTF-8?B?S296w6Fr?= via Digitalmars-d-learn (4/67) Jan 15 2017 This is because byLine does return range, so until you do something with...
- Nestor (5/7) Jan 15 2017 I see. So correcting my original doubt:
- Era Scarecrow (15/19) Jan 16 2017 Could... roll your own? Although if you wanted it to be UTF-8
- Nestor (8/28) Jan 17 2017 Thanks, but unfortunately this function does not produce proper
- Era Scarecrow (11/20) Jan 26 2017 I thought you wanted to get line by line of contents, which
- Nestor (6/8) Jan 28 2017 AFAIK in some cases the BOM takes up to 4 bytes (FOR UTF-32), so
- Patrick Schluter (2/10) Jan 29 2017 On UTF-8 files the BOM is 3 bytes long.
- Jack Applegame (3/5) Jan 26 2017 Maybe I'm wrong, but I think it's thread safe. Because static
- Era Scarecrow (3/8) Jan 27 2017 Perhaps, but fibers or other instances of sharing the buffer
- Daniel =?iso-8859-1?b?S2964Ws=?= via Digitalmars-d-learn (16/22) Jan 04 2017 Impression is nice but there is nothing about it, so anyone who will=20
- Steven Schveighoffer (9/16) Jan 05 2017 I have not tested much with UTF16 and std.stdio, but I don't believe the...
Hi, I was just trying to parse a UTF-16LE file using byLine, but apparently this function doesn't work with anything other than UTF-8, because I get this error: "Invalid UTF-8 sequence (at index 1)" How can I achieve what I want, without loading the entire file into memory? Thanks in advance.
Jan 04 2017
Nestor via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>=20 napsal St, led 4, 2017 v 12=E2=88=B603 :Hi, =20 I was just trying to parse a UTF-16LE file using byLine, but=20 apparently this function doesn't work with anything other than UTF-8,=20 because I get this error: =20 "Invalid UTF-8 sequence (at index 1)" =20 How can I achieve what I want, without loading the entire file into=20 memory? =20 Thanks in advance.can you show your code, byLine should works ok, and post some example=20 of utf16-le file which does not works =
Jan 04 2017
Daniel Koz=C3=A1k <kozzi11 gmail.com> napsal St, led 4, 2017 v 6=E2=88=B633= :=20 Nestor via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>=20 napsal St, led 4, 2017 v 12=E2=88=B603 :Ok, I've done some testing and you are right byLine is broken, so=20 please fill a bug =Hi, =20 I was just trying to parse a UTF-16LE file using byLine, but=20 apparently this function doesn't work with anything other than=20 UTF-8, because I get this error: =20 "Invalid UTF-8 sequence (at index 1)" =20 How can I achieve what I want, without loading the entire file into=20 memory? =20 Thanks in advance.can you show your code, byLine should works ok, and post some example=20 of utf16-le file which does not works
Jan 04 2017
On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Kozák wrote:Ok, I've done some testing and you are right byLine is broken, so please fill a bugA bug? I was under the impression that this function was *intended* to work only with UTF-8 encoded files.
Jan 04 2017
On Wednesday, 4 January 2017 at 19:20:31 UTC, Nestor wrote:On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Kozák wrote:I'm not sure if this works quite as intended, but I was at least able to produce a UTF-16 decode error rather than a UTF-8 decode error by setting the file orientation before reading it. import std.stdio; import core.stdc.wchar_ : fwide; void main(){ auto file = File("UTF-16LE encoded file.txt"); fwide(file.getFP(), 1); foreach(line; file.byLine){ writeln(file.readln); } }Ok, I've done some testing and you are right byLine is broken, so please fill a bugA bug? I was under the impression that this function was *intended* to work only with UTF-8 encoded files.
Jan 04 2017
I'm not sure if this works quite as intended, but I was at least able to produce a UTF-16 decode error rather than a UTF-8 decode error by setting the file orientation before reading it. import std.stdio; import core.stdc.wchar_ : fwide; void main(){ auto file = File("UTF-16LE encoded file.txt"); fwide(file.getFP(), 1); foreach(line; file.byLine){ writeln(file.readln); } }fwide is not implemented in Windows: https://msdn.microsoft.com/en-us/library/aa985619.aspx
Jan 05 2017
On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:That's odd. It was on Windows 7 64-bit that I put together and tested that example, and calling fwide definitely had an effect on program behavior.I'm not sure if this works quite as intended, but I was at least able to produce a UTF-16 decode error rather than a UTF-8 decode error by setting the file orientation before reading it. import std.stdio; import core.stdc.wchar_ : fwide; void main(){ auto file = File("UTF-16LE encoded file.txt"); fwide(file.getFP(), 1); foreach(line; file.byLine){ writeln(file.readln); } }fwide is not implemented in Windows: https://msdn.microsoft.com/en-us/library/aa985619.aspx
Jan 06 2017
On 01/06/2017 11:33 AM, pineapple wrote:On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:Are you compiling a 32bit binary? Because in that case you would be using the digital mars c runtime which might have an implementation for fwide. -- Mike WeyThat's odd. It was on Windows 7 64-bit that I put together and tested that example, and calling fwide definitely had an effect on program behavior.I'm not sure if this works quite as intended, but I was at least able to produce a UTF-16 decode error rather than a UTF-8 decode error by setting the file orientation before reading it. import std.stdio; import core.stdc.wchar_ : fwide; void main(){ auto file = File("UTF-16LE encoded file.txt"); fwide(file.getFP(), 1); foreach(line; file.byLine){ writeln(file.readln); } }fwide is not implemented in Windows: https://msdn.microsoft.com/en-us/library/aa985619.aspx
Jan 06 2017
On Friday, 6 January 2017 at 11:42:17 UTC, Mike Wey wrote:On 01/06/2017 11:33 AM, pineapple wrote:After some testing I realized that byLine was not the one failing, but any string manipulation done to the obtained line. Compile the following example with and without -debug and run to see what I mean: import std.stdio, std.string; enum EXIT_SUCCESS = 0, EXIT_FAILURE = 1; int main() { version(Windows) { import core.sys.windows.wincon; SetConsoleOutputCP(65001); } auto f = File("utf16le.txt", "r"); foreach (line; f.byLine()) try { string s; debug s = cast(string)strip(line); // this is the one causing problems if (1 > s.length) continue; writeln(s); } catch(Exception e) { writefln("Error. %s\nFile \"%s\", line %s.", e.msg, e.file, e.line); return EXIT_FAILURE; } return EXIT_SUCCESS; }On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:Are you compiling a 32bit binary? Because in that case you would be using the digital mars c runtime which might have an implementation for fwide.That's odd. It was on Windows 7 64-bit that I put together and tested that example, and calling fwide definitely had an effect on program behavior.I'm not sure if this works quite as intended, but I was at least able to produce a UTF-16 decode error rather than a UTF-8 decode error by setting the file orientation before reading it. import std.stdio; import core.stdc.wchar_ : fwide; void main(){ auto file = File("UTF-16LE encoded file.txt"); fwide(file.getFP(), 1); foreach(line; file.byLine){ writeln(file.readln); } }fwide is not implemented in Windows: https://msdn.microsoft.com/en-us/library/aa985619.aspx
Jan 15 2017
On Sunday, 15 January 2017 at 14:48:12 UTC, Nestor wrote:After some testing I realized that byLine was not the one failing, but any string manipulation done to the obtained line. Compile the following example with and without -debug and run to see what I mean: import std.stdio, std.string; enum EXIT_SUCCESS = 0, EXIT_FAILURE = 1; int main() { version(Windows) { import core.sys.windows.wincon; SetConsoleOutputCP(65001); } auto f = File("utf16le.txt", "r"); foreach (line; f.byLine()) try { string s; debug s = cast(string)strip(line); // this is the one causing problems if (1 > s.length) continue; writeln(s); } catch(Exception e) { writefln("Error. %s\nFile \"%s\", line %s.", e.msg, e.file, e.line); return EXIT_FAILURE; } return EXIT_SUCCESS; }By the way, when caught, the exception says it's in file src/phobos/std/utf.d line 1217, but that file only has 784 lines. That's quite odd. (I am compiling with dmd 2.072.2)
Jan 15 2017
V Sun, 15 Jan 2017 14:48:12 +0000 Nestor via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> napsáno:On Friday, 6 January 2017 at 11:42:17 UTC, Mike Wey wrote:This is because byLine does return range, so until you do something with that it does not cause any harm :)On 01/06/2017 11:33 AM, pineapple wrote:After some testing I realized that byLine was not the one failing, but any string manipulation done to the obtained line. Compile the following example with and without -debug and run to see what I mean: import std.stdio, std.string; enum EXIT_SUCCESS = 0, EXIT_FAILURE = 1; int main() { version(Windows) { import core.sys.windows.wincon; SetConsoleOutputCP(65001); } auto f = File("utf16le.txt", "r"); foreach (line; f.byLine()) try { string s; debug s = cast(string)strip(line); // this is the one causing problems if (1 > s.length) continue; writeln(s); } catch(Exception e) { writefln("Error. %s\nFile \"%s\", line %s.", e.msg, e.file, e.line); return EXIT_FAILURE; } return EXIT_SUCCESS; }On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:Are you compiling a 32bit binary? Because in that case you would be using the digital mars c runtime which might have an implementation for fwide.That's odd. It was on Windows 7 64-bit that I put together and tested that example, and calling fwide definitely had an effect on program behavior.I'm not sure if this works quite as intended, but I was at least able to produce a UTF-16 decode error rather than a UTF-8 decode error by setting the file orientation before reading it. import std.stdio; import core.stdc.wchar_ : fwide; void main(){ auto file = File("UTF-16LE encoded file.txt"); fwide(file.getFP(), 1); foreach(line; file.byLine){ writeln(file.readln); } }fwide is not implemented in Windows: https://msdn.microsoft.com/en-us/library/aa985619.aspx
Jan 15 2017
On Sunday, 15 January 2017 at 16:29:23 UTC, Daniel Kozák wrote:This is because byLine does return range, so until you do something with that it does not cause any harm :)I see. So correcting my original doubt: How could I parse an UTF16LE file line by line (producing a proper string in each iteration) without loading the entire file into memory?
Jan 15 2017
On Sunday, 15 January 2017 at 19:48:04 UTC, Nestor wrote:I see. So correcting my original doubt: How could I parse an UTF16LE file line by line (producing a proper string in each iteration) without loading the entire file into memory?Could... roll your own? Although if you wanted it to be UTF-8 output instead would require a second pass or better yet changing how the i iterated. char[] getLine16LE(File inp = stdin) { static char[1024*4] buffer; //4k reusable buffer, NOT thread safe int i; while(inp.rawRead(buffer[i .. i+2]) != null) { if (buffer[i] == '\n') break; i+=2; } return buffer[0 .. i]; }
Jan 16 2017
On Monday, 16 January 2017 at 14:47:23 UTC, Era Scarecrow wrote:On Sunday, 15 January 2017 at 19:48:04 UTC, Nestor wrote:Thanks, but unfortunately this function does not produce proper UTF8 strings, as a matter of fact the output even starts with the BOM. Also it doen't handle CRLF, and even for LF terminated lines it doesn't seem to work for lines other than the first. I guess I have to code encoding detection, buffered read, and transcoding by hand, the only problem is that the result could be sub-optimal, which is why I was looking for a built-in solution.I see. So correcting my original doubt: How could I parse an UTF16LE file line by line (producing a proper string in each iteration) without loading the entire file into memory?Could... roll your own? Although if you wanted it to be UTF-8 output instead would require a second pass or better yet changing how the i iterated. char[] getLine16LE(File inp = stdin) { static char[1024*4] buffer; //4k reusable buffer, NOT thread safe int i; while(inp.rawRead(buffer[i .. i+2]) != null) { if (buffer[i] == '\n') break; i+=2; } return buffer[0 .. i]; }
Jan 17 2017
On Tuesday, 17 January 2017 at 11:40:15 UTC, Nestor wrote:Thanks, but unfortunately this function does not produce proper UTF8 strings, as a matter of fact the output even starts with the BOM. Also it doesn't handle CRLF, and even for LF terminated lines it doesn't seem to work for lines other than the first.I thought you wanted to get line by line of contents, which would then remain as UTF-16. Translating between the two types shouldn't be hard, probably to!string or a foreach with appending to code-units on chars would convert to UTF-8. Skipping the BOM is just a matter of skipping the first two bytes identifying it...I guess I have to code encoding detection, buffered read, and transcoding by hand, the only problem is that the result could be sub-optimal, which is why I was looking for a built-in solution.Maybe. Honestly I'm not nearly as familiar with the library or functions as I would love to be, so often home-made solutions seem more prevalent until I learn the lingo. A disadvantage of being self taught.
Jan 26 2017
On Friday, 27 January 2017 at 04:26:31 UTC, Era Scarecrow wrote:Skipping the BOM is just a matter of skipping the first two bytes identifying it...AFAIK in some cases the BOM takes up to 4 bytes (FOR UTF-32), so when input encoding is unknown one must perform some kind of detection in order to apply the correct transcoding later. I thought by now dmd had this functionality built-in and exposed, since the compiler itself seems to do it for source code units.
Jan 28 2017
On Saturday, 28 January 2017 at 15:40:24 UTC, Nestor wrote:On Friday, 27 January 2017 at 04:26:31 UTC, Era Scarecrow wrote:On UTF-8 files the BOM is 3 bytes long.Skipping the BOM is just a matter of skipping the first two bytes identifying it...AFAIK in some cases the BOM takes up to 4 bytes (FOR UTF-32), so when input encoding is unknown one must perform some kind of detection in order to apply the correct transcoding later. I thought by now dmd had this functionality built-in and exposed, since the compiler itself seems to do it for source code units.
Jan 29 2017
On Monday, 16 January 2017 at 14:47:23 UTC, Era Scarecrow wrote:static char[1024*4] buffer; //4k reusable buffer, NOT thread safeMaybe I'm wrong, but I think it's thread safe. Because static mutable non-shared variables are stored in TLS.
Jan 26 2017
On Friday, 27 January 2017 at 07:02:52 UTC, Jack Applegame wrote:On Monday, 16 January 2017 at 14:47:23 UTC, Era Scarecrow wrote:Perhaps, but fibers or other instances of sharing the buffer wouldn't be safe/reliable, at least not for long.static char[1024*4] buffer; //4k reusable buffer, NOT thread safeMaybe I'm wrong, but I think it's thread safe. Because static mutable non-shared variables are stored in TLS.
Jan 27 2017
Nestor via Digitalmars-d-learn <digitalmars-d-learn puremagic.com>=20 napsal St, led 4, 2017 v 8=E2=88=B620 :On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Koz=C3=A1k wrote:Impression is nice but there is nothing about it, so anyone who will=20 read doc will expect it to work on any encoding. And from doc I see there is a way how one can select encoding and even=20 select Terminator and its type, and this does not works so I expect it=20 is a bug. Another wierd behaviour is when you read file as wstring it will try to=20 decode it as utf8, then encode it to utf16, but even if it works (for=20 utf8 files), and you end up with wstring lines (wstring[]) and you try=20 to save it, it will automaticly save it as utf8. WTF this is really=20 wrong and if it is intended it should be documentet better. Right now=20 it is really hard to work with dlang stdio. But I hoppe it will be deprecated someday and replace with something=20 what support ranges and async io =Ok, I've done some testing and you are right byLine is broken, so=20 please fill a bug=20 A bug? I was under the impression that this function was *intended*=20 to work only with UTF-8 encoded files.
Jan 04 2017
On 1/4/17 6:03 AM, Nestor wrote:Hi, I was just trying to parse a UTF-16LE file using byLine, but apparently this function doesn't work with anything other than UTF-8, because I get this error: "Invalid UTF-8 sequence (at index 1)" How can I achieve what I want, without loading the entire file into memory? Thanks in advance.I have not tested much with UTF16 and std.stdio, but I don't believe the underlying FILE * being used by phobos has good support for it. In my testing, for instance, byLine with a non-ascii delimeter didn't work at all. On Windows 64-bit, MSVC simply ignores any attempts to change the width of the stream. I wouldn't hold out much hope for this to be fixed. -Steve
Jan 05 2017