www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Want to read a whole file as utf-8

reply "Foo" <Foo test.de> writes:
How can I do that without any GC allocation? Nothing in std.file 
seems to be marked with  nogc

I'm asking since it seems very complicated to do that with C++, 
maybe D is a better choice, then we would probably move our whole 
project from C++ to D.
Feb 03 2015
next sibling parent reply FG <home fgda.pl> writes:
On 2015-02-03 at 19:53, Foo wrote:
 How can I do that without any GC allocation? Nothing in std.file seems to be
marked with  nogc

 I'm asking since it seems very complicated to do that with C++, maybe D is a
better choice, then we would probably move our whole project from C++ to D.
Looks like std.stdio isn't marked with nogc all the way either. So for now the temporary solution would be to use std.c.stdio. Get the file size, malloc a buffer large enough for it[1], use std.c.stdio.read to fill it, assign it to a char[] slice and std.utf.decode to consume the text... Oh wait, decode isn't nogc either. FFS, what now? [1] I assume the file is small, otherwise there would be an extra step involved where after nearing the end of the buffer you move the rest of the data to the front, read new data after it, and continue decoding.
Feb 03 2015
next sibling parent reply "Tobias Pankrath" <tobias pankrath.net> writes:
On Tuesday, 3 February 2015 at 19:44:49 UTC, FG wrote:
 On 2015-02-03 at 19:53, Foo wrote:
 How can I do that without any GC allocation? Nothing in 
 std.file seems to be marked with  nogc

 I'm asking since it seems very complicated to do that with 
 C++, maybe D is a better choice, then we would probably move 
 our whole project from C++ to D.
Looks like std.stdio isn't marked with nogc all the way either. So for now the temporary solution would be to use std.c.stdio. Get the file size, malloc a buffer large enough for it[1], use std.c.stdio.read to fill it, assign it to a char[] slice and std.utf.decode to consume the text... Oh wait, decode isn't nogc either. FFS, what now? [1] I assume the file is small, otherwise there would be an extra step involved where after nearing the end of the buffer you move the rest of the data to the front, read new data after it, and continue decoding.
Use std.utf.validate instead of decode. It will only allocate one exception if necessary.
Feb 03 2015
parent reply FG <home fgda.pl> writes:
On 2015-02-03 at 20:50, Tobias Pankrath wrote:
 Use std.utf.validate instead of decode. It will only allocate one exception if
necessary.
Looks to me like it uses decode internally... But Foo, do you have to use nogc? It still looks like it's work in progress, and lack of it doesn't mean that the GC is actually involved in the function. It will probably take several months for the obvious nogc parts of the std lib to get annotated, and much longer to get rid of unnecessary use of the GC. So maybe the solution for now is to verify the source code of the function in question with ones own set of eyeballs and decide if it's good enough for use, ie. doesn't leak too much?
Feb 03 2015
parent "Foo" <Foo test.de> writes:
On Tuesday, 3 February 2015 at 19:56:37 UTC, FG wrote:
 On 2015-02-03 at 20:50, Tobias Pankrath wrote:
 Use std.utf.validate instead of decode. It will only allocate 
 one exception if necessary.
Looks to me like it uses decode internally... But Foo, do you have to use nogc? It still looks like it's work in progress, and lack of it doesn't mean that the GC is actually involved in the function. It will probably take several months for the obvious nogc parts of the std lib to get annotated, and much longer to get rid of unnecessary use of the GC. So maybe the solution for now is to verify the source code of the function in question with ones own set of eyeballs and decide if it's good enough for use, ie. doesn't leak too much?
Yes, we don't want to use a GC. We want determinsitic life times. I'm not the boss, but I support the idea. Nordlöw Neither of them can be marked with nogc. :/
Feb 03 2015
prev sibling parent reply "Foo" <Foo test.de> writes:
On Tuesday, 3 February 2015 at 19:44:49 UTC, FG wrote:
 On 2015-02-03 at 19:53, Foo wrote:
 How can I do that without any GC allocation? Nothing in 
 std.file seems to be marked with  nogc

 I'm asking since it seems very complicated to do that with 
 C++, maybe D is a better choice, then we would probably move 
 our whole project from C++ to D.
Looks like std.stdio isn't marked with nogc all the way either. So for now the temporary solution would be to use std.c.stdio. Get the file size, malloc a buffer large enough for it[1], use std.c.stdio.read to fill it, assign it to a char[] slice and std.utf.decode to consume the text... Oh wait, decode isn't nogc either. FFS, what now? [1] I assume the file is small, otherwise there would be an extra step involved where after nearing the end of the buffer you move the rest of the data to the front, read new data after it, and continue decoding.
How would I use decoding for that? Isn't there a way to read the file as utf8 or event better, as unicode?
Feb 03 2015
next sibling parent reply FG <home fgda.pl> writes:
On 2015-02-04 at 00:07, Foo wrote:
 How would I use decoding for that? Isn't there a way to read the file as utf8
or event better, as unicode?
Well, apparently the utf-8-aware foreach loop still works just fine. This program shows the file size and the number of unicode glyps, or whatever they are called: import core.stdc.stdio; int main() nogc { const int bufSize = 64000; char[bufSize] buffer; size_t bytesRead, count; FILE* f = core.stdc.stdio.fopen("test.d", "r"); if (!f) return 1; bytesRead = fread(cast(void*)buffer, 1, bufSize, f); if (bytesRead > bufSize - 1) { printf("File is too big"); return 1; } if (!bytesRead) return 2; foreach (dchar d; buffer[0..bytesRead]) count++; printf("read %d bytes, %d unicode characters\n", bytesRead, count); fclose(f); return 0; } Outputs for example this: read 838 bytes, 829 unicode characters (It would be more complicated if it had to process bigger files.)
Feb 03 2015
parent reply "Namespace" <rswhite4 gmail.com> writes:
On Tuesday, 3 February 2015 at 23:55:19 UTC, FG wrote:
 On 2015-02-04 at 00:07, Foo wrote:
 How would I use decoding for that? Isn't there a way to read 
 the file as utf8 or event better, as unicode?
Well, apparently the utf-8-aware foreach loop still works just fine. This program shows the file size and the number of unicode glyps, or whatever they are called: import core.stdc.stdio; int main() nogc { const int bufSize = 64000; char[bufSize] buffer; size_t bytesRead, count; FILE* f = core.stdc.stdio.fopen("test.d", "r"); if (!f) return 1; bytesRead = fread(cast(void*)buffer, 1, bufSize, f); if (bytesRead > bufSize - 1) { printf("File is too big"); return 1; } if (!bytesRead) return 2; foreach (dchar d; buffer[0..bytesRead]) count++; printf("read %d bytes, %d unicode characters\n", bytesRead, count); fclose(f); return 0; } Outputs for example this: read 838 bytes, 829 unicode characters (It would be more complicated if it had to process bigger files.)
To use a foreach loop is such a nice idea! tank you very much. :) That's my code now: ---- private: static import m3.m3; static import core.stdc.stdio; alias printf = core.stdc.stdio.printf; public: trusted nogc auto readFile(in string filename) nothrow { import std.c.stdio : FILE, SEEK_END, SEEK_SET, fopen, fclose, fseek, ftell, fread; FILE* f = fopen(filename.ptr, "rb"); fseek(f, 0, SEEK_END); immutable size_t fsize = ftell(f); fseek(f, 0, SEEK_SET); char[] str = m3.m3.make!(char[])(fsize); fread(str.ptr, fsize, 1, f); fclose(f); return str; } trusted nogc property dstring toUTF32(in char[] s) { dchar[] r = m3.m3.make!(dchar[])(s.length); // r will never be longer than s foreach (immutable size_t i, dchar c; s) { r[i] = c; } return cast(dstring) r; } nogc void main() { auto str = readFile("test_file.txt"); scope(exit) m3.m3.destruct(str); auto str2 = str.toUTF32; printf("%d : %d\n", cast(int) str[0], cast(int) str2[0]); } ---- m3 is my own module and means "manual memory management", three m's so m3. If we will use D (what is now much more likely) that is our core module for memory management.
Feb 03 2015
parent reply FG <home fgda.pl> writes:
On 2015-02-04 at 01:56, Namespace wrote:
      FILE* f = fopen(filename.ptr, "rb");
      fseek(f, 0, SEEK_END);
      immutable size_t fsize = ftell(f);
      fseek(f, 0, SEEK_SET);
That's quite a smart way to get the size of the file. I started with std.file.getSize (which obviously isn't marked as nogc) and ended up with the monstrosity below (which I have only compiled on Windows), so I decided not to mention it in my previous post. Wouldn't be the point anyway, since I have only shown an example with a single-fill fixed buffer. But here it is, rendered useless by your code: long getFileSize(const char* cName) nogc { version(Windows) { import core.sys.windows.windows; WIN32_FILE_ATTRIBUTE_DATA fad; if (!GetFileAttributesExA(cName, GET_FILEEX_INFO_LEVELS.GetFileExInfoStandard, &fad)) return -1; ULARGE_INTEGER li; li.LowPart = fad.nFileSizeLow; li.HighPart = fad.nFileSizeHigh; return li.QuadPart; } else version(Posix) { import core.sys.posix.sys.stat; stat_t statbuf = void; if (stat(cName, &statbuf)) return -1; return statbuf.st_size; } }
Feb 03 2015
parent "Foo" <Foo test.de> writes:
Since I'm now almost finished, I'm glad to show you my work: 
https://github.com/Dgame/m3
You're free to use it or to contribute to it.
Feb 04 2015
prev sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
On Tuesday, 3 February 2015 at 23:07:03 UTC, Foo wrote:
 On Tuesday, 3 February 2015 at 19:44:49 UTC, FG wrote:
 On 2015-02-03 at 19:53, Foo wrote:
 How can I do that without any GC allocation? Nothing in 
 std.file seems to be marked with  nogc

 I'm asking since it seems very complicated to do that with 
 C++, maybe D is a better choice, then we would probably move 
 our whole project from C++ to D.
Looks like std.stdio isn't marked with nogc all the way either. So for now the temporary solution would be to use std.c.stdio. Get the file size, malloc a buffer large enough for it[1], use std.c.stdio.read to fill it, assign it to a char[] slice and std.utf.decode to consume the text... Oh wait, decode isn't nogc either. FFS, what now? [1] I assume the file is small, otherwise there would be an extra step involved where after nearing the end of the buffer you move the rest of the data to the front, read new data after it, and continue decoding.
How would I use decoding for that? Isn't there a way to read the file as utf8 or event better, as unicode?
Arrays of char, wchar and dchar are supposed to be UTF strings and of course you can just read them using a c function from a file. You'd just need to make sure they are valid UTF before passing them on to other parts of phobos. What do you mean with "as unicode"?
Feb 03 2015
prev sibling parent =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:
On Tuesday, 3 February 2015 at 18:53:28 UTC, Foo wrote:
 How can I do that without any GC allocation? Nothing in 
 std.file seems to be marked with  nogc

 I'm asking since it seems very complicated to do that with C++, 
 maybe D is a better choice, then we would probably move our 
 whole project from C++ to D.
My module https://github.com/nordlow/justd/blob/master/mmfile_ex.d together with https://github.com/nordlow/justd/blob/master/bylines.d is about as low-level as you can get in D.
Feb 03 2015