digitalmars.D.learn - Want to read a whole file as utf-8

Foo (5/5) Feb 03 2015 How can I do that without any GC allocation? Nothing in std.file

FG (10/12) Feb 03 2015 Looks like std.stdio isn't marked with @nogc all the way either.

Tobias Pankrath (3/22) Feb 03 2015 Use std.utf.validate instead of decode. It will only allocate one

FG (9/10) Feb 03 2015 Looks to me like it uses decode internally...

Foo (4/21) Feb 03 2015 Yes, we don't want to use a GC. We want determinsitic life times.

Foo (3/22) Feb 03 2015 How would I use decoding for that? Isn't there a way to read the

FG (27/28) Feb 03 2015 Well, apparently the utf-8-aware foreach loop still works just fine.

Namespace (45/78) Feb 03 2015 To use a foreach loop is such a nice idea! tank you very much. :)

FG (25/29) Feb 03 2015 That's quite a smart way to get the size of the file.

Foo (3/3) Feb 04 2015 Since I'm now almost finished, I'm glad to show you my work:

Tobias Pankrath (6/34) Feb 03 2015 Arrays of char, wchar and dchar are supposed to be UTF strings

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (6/11) Feb 03 2015 My module

"Foo" <Foo test.de> writes:

How can I do that without any GC allocation? Nothing in std.file 
seems to be marked with  nogc

I'm asking since it seems very complicated to do that with C++, 
maybe D is a better choice, then we would probably move our whole 
project from C++ to D.

Feb 03 2015

FG <home fgda.pl> writes:

On 2015-02-03 at 19:53, Foo wrote:
 How can I do that without any GC allocation? Nothing in std.file seems to be
marked with  nogc

 I'm asking since it seems very complicated to do that with C++, maybe D is a
better choice, then we would probably move our whole project from C++ to D.

Looks like std.stdio isn't marked with  nogc all the way either.

So for now the temporary solution would be to use std.c.stdio.
Get the file size, malloc a buffer large enough for it[1],
use std.c.stdio.read to fill it, assign it to a char[] slice
and std.utf.decode to consume the text...

Oh wait, decode isn't  nogc either. FFS, what now?


[1] I assume the file is small, otherwise there would be an extra step
involved where after nearing the end of the buffer you move the rest
of the data to the front, read new data after it, and continue decoding.

Feb 03 2015

"Tobias Pankrath" <tobias pankrath.net> writes:

On Tuesday, 3 February 2015 at 19:44:49 UTC, FG wrote:
 On 2015-02-03 at 19:53, Foo wrote:
 How can I do that without any GC allocation? Nothing in 
 std.file seems to be marked with  nogc

 I'm asking since it seems very complicated to do that with 
 C++, maybe D is a better choice, then we would probably move 
 our whole project from C++ to D.

 Looks like std.stdio isn't marked with  nogc all the way either.

 So for now the temporary solution would be to use std.c.stdio.
 Get the file size, malloc a buffer large enough for it[1],
 use std.c.stdio.read to fill it, assign it to a char[] slice
 and std.utf.decode to consume the text...

 Oh wait, decode isn't  nogc either. FFS, what now?


 [1] I assume the file is small, otherwise there would be an 
 extra step
 involved where after nearing the end of the buffer you move the 
 rest
 of the data to the front, read new data after it, and continue 
 decoding.

Use std.utf.validate instead of decode. It will only allocate one 
exception if necessary.

Feb 03 2015

FG <home fgda.pl> writes:

On 2015-02-03 at 20:50, Tobias Pankrath wrote:
 Use std.utf.validate instead of decode. It will only allocate one exception if
necessary.

Looks to me like it uses decode internally...

But Foo, do you have to use  nogc? It still looks like it's work in progress,
and lack of it doesn't mean that the GC is actually involved in the function.
It will probably take several months for the obvious nogc parts of the std lib
to get annotated, and much longer to get rid of unnecessary use of the GC.
So maybe the solution for now is to verify the source code of the function in
question with ones own set of eyeballs and decide if it's good enough for use,
ie. doesn't leak too much?

Feb 03 2015

"Foo" <Foo test.de> writes:

On Tuesday, 3 February 2015 at 19:56:37 UTC, FG wrote:
 On 2015-02-03 at 20:50, Tobias Pankrath wrote:
 Use std.utf.validate instead of decode. It will only allocate 
 one exception if necessary.

 Looks to me like it uses decode internally...

 But Foo, do you have to use  nogc? It still looks like it's 
 work in progress,
 and lack of it doesn't mean that the GC is actually involved in 
 the function.
 It will probably take several months for the obvious nogc parts 
 of the std lib
 to get annotated, and much longer to get rid of unnecessary use 
 of the GC.
 So maybe the solution for now is to verify the source code of 
 the function in
 question with ones own set of eyeballs and decide if it's good 
 enough for use,
 ie. doesn't leak too much?

Yes, we don't want to use a GC. We want determinsitic life times. 
I'm not the boss, but I support the idea.

 Nordlöw Neither of them can be marked with  nogc. :/

Feb 03 2015

"Foo" <Foo test.de> writes:

On Tuesday, 3 February 2015 at 19:44:49 UTC, FG wrote:
 On 2015-02-03 at 19:53, Foo wrote:
 How can I do that without any GC allocation? Nothing in 
 std.file seems to be marked with  nogc

 I'm asking since it seems very complicated to do that with 
 C++, maybe D is a better choice, then we would probably move 
 our whole project from C++ to D.

 Looks like std.stdio isn't marked with  nogc all the way either.

 So for now the temporary solution would be to use std.c.stdio.
 Get the file size, malloc a buffer large enough for it[1],
 use std.c.stdio.read to fill it, assign it to a char[] slice
 and std.utf.decode to consume the text...

 Oh wait, decode isn't  nogc either. FFS, what now?


 [1] I assume the file is small, otherwise there would be an 
 extra step
 involved where after nearing the end of the buffer you move the 
 rest
 of the data to the front, read new data after it, and continue 
 decoding.

How would I use decoding for that? Isn't there a way to read the 
file as utf8 or event better, as unicode?

Feb 03 2015

FG <home fgda.pl> writes:

On 2015-02-04 at 00:07, Foo wrote:
 How would I use decoding for that? Isn't there a way to read the file as utf8
or event better, as unicode?

Well, apparently the utf-8-aware foreach loop still works just fine.
This program shows the file size and the number of unicode glyps, or whatever
they are called:

     import core.stdc.stdio;
     int main()  nogc
     {
         const int bufSize = 64000;
         char[bufSize] buffer;
         size_t bytesRead, count;
         FILE* f = core.stdc.stdio.fopen("test.d", "r");
         if (!f)
             return 1;
         bytesRead = fread(cast(void*)buffer, 1, bufSize, f);
         if (bytesRead > bufSize - 1) {
             printf("File is too big");
             return 1;
         }
         if (!bytesRead)
             return 2;
         foreach (dchar d; buffer[0..bytesRead])
             count++;
         printf("read %d bytes, %d unicode characters\n", bytesRead, count);
         fclose(f);
         return 0;
     }

Outputs for example this: read 838 bytes, 829 unicode characters

(It would be more complicated if it had to process bigger files.)

Feb 03 2015

"Namespace" <rswhite4 gmail.com> writes:

On Tuesday, 3 February 2015 at 23:55:19 UTC, FG wrote:
 On 2015-02-04 at 00:07, Foo wrote:
 How would I use decoding for that? Isn't there a way to read 
 the file as utf8 or event better, as unicode?

 Well, apparently the utf-8-aware foreach loop still works just 
 fine.
 This program shows the file size and the number of unicode 
 glyps, or whatever they are called:

     import core.stdc.stdio;
     int main()  nogc
     {
         const int bufSize = 64000;
         char[bufSize] buffer;
         size_t bytesRead, count;
         FILE* f = core.stdc.stdio.fopen("test.d", "r");
         if (!f)
             return 1;
         bytesRead = fread(cast(void*)buffer, 1, bufSize, f);
         if (bytesRead > bufSize - 1) {
             printf("File is too big");
             return 1;
         }
         if (!bytesRead)
             return 2;
         foreach (dchar d; buffer[0..bytesRead])
             count++;
         printf("read %d bytes, %d unicode characters\n", 
 bytesRead, count);
         fclose(f);
         return 0;
     }

 Outputs for example this: read 838 bytes, 829 unicode characters

 (It would be more complicated if it had to process bigger 
 files.)

To use a foreach loop is such a nice idea! tank you very much. :)

That's my code now:
----
private:

static import m3.m3;
static import core.stdc.stdio;
alias printf = core.stdc.stdio.printf;

public:

 trusted
 nogc
auto readFile(in string filename) nothrow {
	import std.c.stdio : FILE, SEEK_END, SEEK_SET, fopen, fclose, 
fseek, ftell, fread;

	FILE* f = fopen(filename.ptr, "rb");
	fseek(f, 0, SEEK_END);
	immutable size_t fsize = ftell(f);
	fseek(f, 0, SEEK_SET);

	char[] str = m3.m3.make!(char[])(fsize);
	fread(str.ptr, fsize, 1, f);
	fclose(f);

	return str;
}

 trusted
 nogc
 property
dstring toUTF32(in char[] s) {
     dchar[] r = m3.m3.make!(dchar[])(s.length); // r will never 
be longer than s
     foreach (immutable size_t i, dchar c; s) {
     	r[i] = c;
     }

     return cast(dstring) r;
}

 nogc
void main() {
	auto str = readFile("test_file.txt");
	scope(exit) m3.m3.destruct(str);

	auto str2 = str.toUTF32;
	printf("%d : %d\n", cast(int) str[0], cast(int) str2[0]);
}
----

m3 is my own module and means "manual memory management", three 
m's so m3. If we will use D (what is now much more likely) that 
is our core module for memory management.

Feb 03 2015

FG <home fgda.pl> writes:

On 2015-02-04 at 01:56, Namespace wrote:
      FILE* f = fopen(filename.ptr, "rb");
      fseek(f, 0, SEEK_END);
      immutable size_t fsize = ftell(f);
      fseek(f, 0, SEEK_SET);

That's quite a smart way to get the size of the file.

I started with std.file.getSize (which obviously isn't marked as  nogc) and
ended up with the monstrosity below (which I have only compiled on Windows), so
I decided not to mention it in my previous post. Wouldn't be the point anyway,
since I have only shown an example with a single-fill fixed buffer. But here it
is, rendered useless by your code:

     long getFileSize(const char* cName)  nogc
     {
         version(Windows)
         {
             import core.sys.windows.windows;
             WIN32_FILE_ATTRIBUTE_DATA fad;
             if (!GetFileAttributesExA(cName,
GET_FILEEX_INFO_LEVELS.GetFileExInfoStandard, &fad))
                 return -1;
             ULARGE_INTEGER li;
             li.LowPart = fad.nFileSizeLow;
             li.HighPart = fad.nFileSizeHigh;
             return li.QuadPart;
         }
         else version(Posix)
         {
             import core.sys.posix.sys.stat;
             stat_t statbuf = void;
             if (stat(cName, &statbuf))
                 return -1;
             return statbuf.st_size;
         }
     }

Feb 03 2015

"Foo" <Foo test.de> writes:

Since I'm now almost finished, I'm glad to show you my work: 
https://github.com/Dgame/m3
You're free to use it or to contribute to it.

Feb 04 2015

"Tobias Pankrath" <tobias pankrath.net> writes:

On Tuesday, 3 February 2015 at 23:07:03 UTC, Foo wrote:
 On Tuesday, 3 February 2015 at 19:44:49 UTC, FG wrote:
 On 2015-02-03 at 19:53, Foo wrote:
 How can I do that without any GC allocation? Nothing in 
 std.file seems to be marked with  nogc

 I'm asking since it seems very complicated to do that with 
 C++, maybe D is a better choice, then we would probably move 
 our whole project from C++ to D.

 Looks like std.stdio isn't marked with  nogc all the way 
 either.

 So for now the temporary solution would be to use std.c.stdio.
 Get the file size, malloc a buffer large enough for it[1],
 use std.c.stdio.read to fill it, assign it to a char[] slice
 and std.utf.decode to consume the text...

 Oh wait, decode isn't  nogc either. FFS, what now?


 [1] I assume the file is small, otherwise there would be an 
 extra step
 involved where after nearing the end of the buffer you move 
 the rest
 of the data to the front, read new data after it, and continue 
 decoding.

 How would I use decoding for that? Isn't there a way to read 
 the file as utf8 or event better, as unicode?

Arrays of char, wchar and dchar are supposed to be UTF strings 
and of course you can just read them using a c function from a 
file. You'd just need to make sure they are valid UTF before 
passing them on to other parts of phobos.

What do you mean with "as unicode"?

Feb 03 2015

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

On Tuesday, 3 February 2015 at 18:53:28 UTC, Foo wrote:
 How can I do that without any GC allocation? Nothing in 
 std.file seems to be marked with  nogc

 I'm asking since it seems very complicated to do that with C++, 
 maybe D is a better choice, then we would probably move our 
 whole project from C++ to D.

My module

https://github.com/nordlow/justd/blob/master/mmfile_ex.d

together with

https://github.com/nordlow/justd/blob/master/bylines.d

is about as low-level as you can get in D.

Feb 03 2015

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Want to read a whole file as utf-8