digitalmars.D.learn - Want to read a whole file as utf-8
- Foo (5/5) Feb 03 2015 How can I do that without any GC allocation? Nothing in std.file
- FG (10/12) Feb 03 2015 Looks like std.stdio isn't marked with @nogc all the way either.
- Tobias Pankrath (3/22) Feb 03 2015 Use std.utf.validate instead of decode. It will only allocate one
- FG (9/10) Feb 03 2015 Looks to me like it uses decode internally...
- Foo (4/21) Feb 03 2015 Yes, we don't want to use a GC. We want determinsitic life times.
- Foo (3/22) Feb 03 2015 How would I use decoding for that? Isn't there a way to read the
- FG (27/28) Feb 03 2015 Well, apparently the utf-8-aware foreach loop still works just fine.
- Namespace (45/78) Feb 03 2015 To use a foreach loop is such a nice idea! tank you very much. :)
- Tobias Pankrath (6/34) Feb 03 2015 Arrays of char, wchar and dchar are supposed to be UTF strings
- =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (6/11) Feb 03 2015 My module
How can I do that without any GC allocation? Nothing in std.file seems to be marked with nogc I'm asking since it seems very complicated to do that with C++, maybe D is a better choice, then we would probably move our whole project from C++ to D.
Feb 03 2015
On 2015-02-03 at 19:53, Foo wrote:How can I do that without any GC allocation? Nothing in std.file seems to be marked with nogc I'm asking since it seems very complicated to do that with C++, maybe D is a better choice, then we would probably move our whole project from C++ to D.Looks like std.stdio isn't marked with nogc all the way either. So for now the temporary solution would be to use std.c.stdio. Get the file size, malloc a buffer large enough for it[1], use std.c.stdio.read to fill it, assign it to a char[] slice and std.utf.decode to consume the text... Oh wait, decode isn't nogc either. FFS, what now? [1] I assume the file is small, otherwise there would be an extra step involved where after nearing the end of the buffer you move the rest of the data to the front, read new data after it, and continue decoding.
Feb 03 2015
On Tuesday, 3 February 2015 at 19:44:49 UTC, FG wrote:On 2015-02-03 at 19:53, Foo wrote:Use std.utf.validate instead of decode. It will only allocate one exception if necessary.How can I do that without any GC allocation? Nothing in std.file seems to be marked with nogc I'm asking since it seems very complicated to do that with C++, maybe D is a better choice, then we would probably move our whole project from C++ to D.Looks like std.stdio isn't marked with nogc all the way either. So for now the temporary solution would be to use std.c.stdio. Get the file size, malloc a buffer large enough for it[1], use std.c.stdio.read to fill it, assign it to a char[] slice and std.utf.decode to consume the text... Oh wait, decode isn't nogc either. FFS, what now? [1] I assume the file is small, otherwise there would be an extra step involved where after nearing the end of the buffer you move the rest of the data to the front, read new data after it, and continue decoding.
Feb 03 2015
On 2015-02-03 at 20:50, Tobias Pankrath wrote:Use std.utf.validate instead of decode. It will only allocate one exception if necessary.Looks to me like it uses decode internally... But Foo, do you have to use nogc? It still looks like it's work in progress, and lack of it doesn't mean that the GC is actually involved in the function. It will probably take several months for the obvious nogc parts of the std lib to get annotated, and much longer to get rid of unnecessary use of the GC. So maybe the solution for now is to verify the source code of the function in question with ones own set of eyeballs and decide if it's good enough for use, ie. doesn't leak too much?
Feb 03 2015
On Tuesday, 3 February 2015 at 19:56:37 UTC, FG wrote:On 2015-02-03 at 20:50, Tobias Pankrath wrote:Yes, we don't want to use a GC. We want determinsitic life times. I'm not the boss, but I support the idea. Nordlöw Neither of them can be marked with nogc. :/Use std.utf.validate instead of decode. It will only allocate one exception if necessary.Looks to me like it uses decode internally... But Foo, do you have to use nogc? It still looks like it's work in progress, and lack of it doesn't mean that the GC is actually involved in the function. It will probably take several months for the obvious nogc parts of the std lib to get annotated, and much longer to get rid of unnecessary use of the GC. So maybe the solution for now is to verify the source code of the function in question with ones own set of eyeballs and decide if it's good enough for use, ie. doesn't leak too much?
Feb 03 2015
On Tuesday, 3 February 2015 at 19:44:49 UTC, FG wrote:On 2015-02-03 at 19:53, Foo wrote:How would I use decoding for that? Isn't there a way to read the file as utf8 or event better, as unicode?How can I do that without any GC allocation? Nothing in std.file seems to be marked with nogc I'm asking since it seems very complicated to do that with C++, maybe D is a better choice, then we would probably move our whole project from C++ to D.Looks like std.stdio isn't marked with nogc all the way either. So for now the temporary solution would be to use std.c.stdio. Get the file size, malloc a buffer large enough for it[1], use std.c.stdio.read to fill it, assign it to a char[] slice and std.utf.decode to consume the text... Oh wait, decode isn't nogc either. FFS, what now? [1] I assume the file is small, otherwise there would be an extra step involved where after nearing the end of the buffer you move the rest of the data to the front, read new data after it, and continue decoding.
Feb 03 2015
On 2015-02-04 at 00:07, Foo wrote:How would I use decoding for that? Isn't there a way to read the file as utf8 or event better, as unicode?Well, apparently the utf-8-aware foreach loop still works just fine. This program shows the file size and the number of unicode glyps, or whatever they are called: import core.stdc.stdio; int main() nogc { const int bufSize = 64000; char[bufSize] buffer; size_t bytesRead, count; FILE* f = core.stdc.stdio.fopen("test.d", "r"); if (!f) return 1; bytesRead = fread(cast(void*)buffer, 1, bufSize, f); if (bytesRead > bufSize - 1) { printf("File is too big"); return 1; } if (!bytesRead) return 2; foreach (dchar d; buffer[0..bytesRead]) count++; printf("read %d bytes, %d unicode characters\n", bytesRead, count); fclose(f); return 0; } Outputs for example this: read 838 bytes, 829 unicode characters (It would be more complicated if it had to process bigger files.)
Feb 03 2015
On Tuesday, 3 February 2015 at 23:55:19 UTC, FG wrote:On 2015-02-04 at 00:07, Foo wrote:To use a foreach loop is such a nice idea! tank you very much. :) That's my code now: ---- private: static import m3.m3; static import core.stdc.stdio; alias printf = core.stdc.stdio.printf; public: trusted nogc auto readFile(in string filename) nothrow { import std.c.stdio : FILE, SEEK_END, SEEK_SET, fopen, fclose, fseek, ftell, fread; FILE* f = fopen(filename.ptr, "rb"); fseek(f, 0, SEEK_END); immutable size_t fsize = ftell(f); fseek(f, 0, SEEK_SET); char[] str = m3.m3.make!(char[])(fsize); fread(str.ptr, fsize, 1, f); fclose(f); return str; } trusted nogc property dstring toUTF32(in char[] s) { dchar[] r = m3.m3.make!(dchar[])(s.length); // r will never be longer than s foreach (immutable size_t i, dchar c; s) { r[i] = c; } return cast(dstring) r; } nogc void main() { auto str = readFile("test_file.txt"); scope(exit) m3.m3.destruct(str); auto str2 = str.toUTF32; printf("%d : %d\n", cast(int) str[0], cast(int) str2[0]); } ---- m3 is my own module and means "manual memory management", three m's so m3. If we will use D (what is now much more likely) that is our core module for memory management.How would I use decoding for that? Isn't there a way to read the file as utf8 or event better, as unicode?Well, apparently the utf-8-aware foreach loop still works just fine. This program shows the file size and the number of unicode glyps, or whatever they are called: import core.stdc.stdio; int main() nogc { const int bufSize = 64000; char[bufSize] buffer; size_t bytesRead, count; FILE* f = core.stdc.stdio.fopen("test.d", "r"); if (!f) return 1; bytesRead = fread(cast(void*)buffer, 1, bufSize, f); if (bytesRead > bufSize - 1) { printf("File is too big"); return 1; } if (!bytesRead) return 2; foreach (dchar d; buffer[0..bytesRead]) count++; printf("read %d bytes, %d unicode characters\n", bytesRead, count); fclose(f); return 0; } Outputs for example this: read 838 bytes, 829 unicode characters (It would be more complicated if it had to process bigger files.)
Feb 03 2015
On 2015-02-04 at 01:56, Namespace wrote:FILE* f = fopen(filename.ptr, "rb"); fseek(f, 0, SEEK_END); immutable size_t fsize = ftell(f); fseek(f, 0, SEEK_SET);That's quite a smart way to get the size of the file. I started with std.file.getSize (which obviously isn't marked as nogc) and ended up with the monstrosity below (which I have only compiled on Windows), so I decided not to mention it in my previous post. Wouldn't be the point anyway, since I have only shown an example with a single-fill fixed buffer. But here it is, rendered useless by your code: long getFileSize(const char* cName) nogc { version(Windows) { import core.sys.windows.windows; WIN32_FILE_ATTRIBUTE_DATA fad; if (!GetFileAttributesExA(cName, GET_FILEEX_INFO_LEVELS.GetFileExInfoStandard, &fad)) return -1; ULARGE_INTEGER li; li.LowPart = fad.nFileSizeLow; li.HighPart = fad.nFileSizeHigh; return li.QuadPart; } else version(Posix) { import core.sys.posix.sys.stat; stat_t statbuf = void; if (stat(cName, &statbuf)) return -1; return statbuf.st_size; } }
Feb 03 2015
Since I'm now almost finished, I'm glad to show you my work: https://github.com/Dgame/m3 You're free to use it or to contribute to it.
Feb 04 2015
On Tuesday, 3 February 2015 at 23:07:03 UTC, Foo wrote:On Tuesday, 3 February 2015 at 19:44:49 UTC, FG wrote:Arrays of char, wchar and dchar are supposed to be UTF strings and of course you can just read them using a c function from a file. You'd just need to make sure they are valid UTF before passing them on to other parts of phobos. What do you mean with "as unicode"?On 2015-02-03 at 19:53, Foo wrote:How would I use decoding for that? Isn't there a way to read the file as utf8 or event better, as unicode?How can I do that without any GC allocation? Nothing in std.file seems to be marked with nogc I'm asking since it seems very complicated to do that with C++, maybe D is a better choice, then we would probably move our whole project from C++ to D.Looks like std.stdio isn't marked with nogc all the way either. So for now the temporary solution would be to use std.c.stdio. Get the file size, malloc a buffer large enough for it[1], use std.c.stdio.read to fill it, assign it to a char[] slice and std.utf.decode to consume the text... Oh wait, decode isn't nogc either. FFS, what now? [1] I assume the file is small, otherwise there would be an extra step involved where after nearing the end of the buffer you move the rest of the data to the front, read new data after it, and continue decoding.
Feb 03 2015
On Tuesday, 3 February 2015 at 18:53:28 UTC, Foo wrote:How can I do that without any GC allocation? Nothing in std.file seems to be marked with nogc I'm asking since it seems very complicated to do that with C++, maybe D is a better choice, then we would probably move our whole project from C++ to D.My module https://github.com/nordlow/justd/blob/master/mmfile_ex.d together with https://github.com/nordlow/justd/blob/master/bylines.d is about as low-level as you can get in D.
Feb 03 2015