digitalmars.D - Is there any reasons to not use "mmap" to read files?

rempas (11/11) Feb 06 2022 This should have probably been posted in the "Learn" section but

Elronnd (9/13) Feb 06 2022 Performance is weird, and depends a lot on your access patterns

rempas (6/14) Feb 06 2022 Thank you! I will actually make a compiler so it will just open

Temtaime (5/21) Feb 06 2022 Perso i'm almost always use mmap for opening large files for r/w.

rempas (5/9) Feb 06 2022 Thank you! For how big files are we talking about? Also like

=?UTF-8?Q?Ali_=c3=87ehreli?= (24/32) Feb 06 2022 So big that they can't fit in memory. For example, I benefit from mmap

rempas (11/36) Feb 06 2022 Thank you! I will have that in mind in case I want to do

Temtaime (6/20) Feb 06 2022 Windows has its own API to mmap files. There's no need to

rempas (5/10) Feb 06 2022 Yeah, I'm really glad that it works for you but I will have to

H. S. Teoh (6/16) Feb 06 2022 Just read the Phobos source code for std.mmfile. Phobos code is very

IGotD- (2/4) Feb 06 2022 Phobos/druntime is like an #ifdef hell but with version instead.

rempas (8/9) Feb 06 2022 Actually, I personally tried to read some files (like stdio.d and

H. S. Teoh (8/15) Feb 07 2022 I tried reading GLIB source code once. I will never ever do it again.

rempas (6/10) Feb 12 2022 Don't be so sure about that! Everything "GNU" seems to be bloated

rempas (5/8) Feb 06 2022 I suppose this goes to what I said that I need to properly learn

=?UTF-8?Q?Ali_=c3=87ehreli?= (4/5) Feb 06 2022 Yes. I misspoke: What I meant was std.mmfile handles the differences

rempas (3/6) Feb 06 2022 Cool! That what I would expect from a library tbh so I'm glad it

Patrick Schluter (8/19) Feb 06 2022 mmap has quite the overhead to set up the page table for a file.

rempas (5/12) Feb 06 2022 Thank you! After all I've heard, I will probably stick with

Steven Schveighoffer (14/25) Feb 07 2022 Will mmap be faster than fgetc? Almost certainly.

norm (3/19) Feb 07 2022 +1 iopipe !
rempas (5/19) Feb 12 2022 Thanks for your time Steve! I will do proper testing like you

sarn (8/19) Feb 08 2022 One reason to use read/write based I/O by default is that it's

rempas (4/11) Feb 12 2022 Yeah, thank you! I will open and read the whole file anyways so

user1234 (3/14) Feb 08 2022 `std.file.readText()` is just fine... your really want to do an

rempas (6/8) Feb 12 2022 Good point! I was really wondering if "fgetc" does a system call

user1234 (9/19) Feb 12 2022 I think that nowadays fgetc does not make sense anymore, maybe in

Basile B. (34/54) Feb 12 2022 The problem with phobos and if used to program a compiler is

=?UTF-8?Q?Ali_=c3=87ehreli?= (7/8) Feb 12 2022 I have a related experience: I realized that very many ftell() calls

H. S. Teoh (12/22) Feb 12 2022 [...]

Steven Schveighoffer (12/32) Feb 12 2022 `ftell` does not *need* to do a system call to get the current file
Patrick Schluter (6/28) Feb 13 2022 fread reads from its internal buffer when it can. By default it

Steven Schveighoffer (8/36) Feb 13 2022 If you seek within the buffer it could potentially leave the buffer

Patrick Schluter (8/16) Feb 13 2022 ftell() and fseek() use a syscall but also trigger that the next

Steven Schveighoffer (9/24) Feb 12 2022 Just a clarification here -- `fgetc` does NOT do an OS system call for

Florian Weimer (7/7) Feb 13 2022 One issue that hasn't been mentioned so far is that if the input

rempas (3/10) Feb 13 2022 Thank you for the info! That's very important and I'll keep in in

rempas <rempas tutanota.com> writes:

This should have probably been posted in the "Learn" section but 
I thought that it is an advanced topic so maybe people other than 
me may learn something too. So here we go!

I'm planning to make a change to my program to use "mmap" to the 
contents of a file rather than "fgetc". This is because I learned 
that "mmap" can do it faster. The thing is, are there any 
problems that can occur when using "mmap"? I need to know now 
because changing this means changing the design of the program 
and this is not something pleasant to do so I want to be sure 
that I won't have to change back in the future (where the project 
will be even bigger).

Feb 06 2022

Elronnd <elronnd elronnd.net> writes:

On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:
 I'm planning to make a change to my program to use "mmap" to 
 the contents of a file rather than "fgetc".  This is because I 
 learned that "mmap" can do it faster.  The thing is, are there 
 any problems that can occur when using "mmap"?

Performance is weird, and depends a lot on your access patterns 
and constraints.  Mmap is not universally fast and, I would 
argue, really only makes sense in a few constrained 
circumstances.  I would not switch to mmap just because you heard 
it was faster; only consider switching if you know i/o is a 
bottleneck for your application and know mmap is the solution.

https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf  
recent, good read.

Feb 06 2022

rempas <rempas tutanota.com> writes:

On Sunday, 6 February 2022 at 10:08:24 UTC, Elronnd wrote:
 Performance is weird, and depends a lot on your access patterns 
 and constraints.  Mmap is not universally fast and, I would 
 argue, really only makes sense in a few constrained 
 circumstances.  I would not switch to mmap just because you 
 heard it was faster; only consider switching if you know i/o is 
 a bottleneck for your application and know mmap is the solution.

 https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf  
 recent, good read.

Thank you! I will actually make a compiler so it will just open 
and read the requested files. I don't know if the database 
example you linked will be similar to my case (I will of course 
read it tho) so I have to make my research I guess just to be 
sure.

Feb 06 2022

Temtaime <temtaime gmail.com> writes:

On Sunday, 6 February 2022 at 10:48:01 UTC, rempas wrote:
 On Sunday, 6 February 2022 at 10:08:24 UTC, Elronnd wrote:
 Performance is weird, and depends a lot on your access 
 patterns and constraints.  Mmap is not universally fast and, I 
 would argue, really only makes sense in a few constrained 
 circumstances.  I would not switch to mmap just because you 
 heard it was faster; only consider switching if you know i/o 
 is a bottleneck for your application and know mmap is the 
 solution.

 https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf  
 recent, good read.

 Thank you! I will actually make a compiler so it will just open 
 and read the requested files. I don't know if the database 
 example you linked will be similar to my case (I will of course 
 read it tho) so I have to make my research I guess just to be 
 sure.

Perso i'm almost always use mmap for opening large files for r/w. 
It IS faster.
Exception are small ones that can be read into the memory using 
std.file.read for example.

Feb 06 2022

rempas <rempas tutanota.com> writes:

On Sunday, 6 February 2022 at 10:53:49 UTC, Temtaime wrote:
 Perso i'm almost always use mmap for opening large files for 
 r/w. It IS faster.
 Exception are small ones that can be read into the memory using 
 std.file.read for example.

Thank you! For how big files are we talking about? Also like 
another guy told me in another (C) forum, "mmap" is for Unix 
systems so do you know if Windows or MacOS can emulate that 
behavior with their memory allocation system calls?

Feb 06 2022

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 2/6/22 04:21, rempas wrote:
 On Sunday, 6 February 2022 at 10:53:49 UTC, Temtaime wrote:
 Perso i'm almost always use mmap for opening large files for r/w. It
 IS faster.


Ditto.

 how big files are we talking about?

So big that they can't fit in memory. For example, I benefit from mmap 
on a 16G system where a file would be 30G.

As others said, it depends on the use case. If the entire file will be 
read anyway especially in sequential order, then mmap may not have much 
benefit. In my use case though it is common to just read unknown small 
amounts of bytes from unknown places of the huge file. (Say, 5G total 
out of a 30G.)

Instead of my making multiple reads to those interesting parts of the 
file, mmap handles everything transparently: Just mmap the whole thing 
as a single array and access parts of that memory as needed.

One huge improvement is to add madvise(2) system call to the picture to 
tell the system the exact amount of memory that will be touched so the 
OS reads in a single shot. Otherwise, the system reads by a default 
amount, which I think is 4K, which can turn out to be pathetically slow 
e.g. when the file is accessed over a slow network. (Why read 4K when 
the need is just 200 bytes and why read in 4K steps when the need is 
already to be 1M?)

 Also like another guy
 told me in another (C) forum, "mmap" is for Unix systems so do you know
 if Windows or MacOS can emulate that behavior with their memory
 allocation system calls?

I haven't used mmap on Windows but it's in Phobos, so it should work. 
After all, mmap uses the virtual memory system of the OS and non-ancient 
Windows versions do use virtual memory and std.mmfile does include 
'version (windows)' sections; so, yes. :)

Ali

Feb 06 2022

rempas <rempas tutanota.com> writes:

On Sunday, 6 February 2022 at 16:45:59 UTC, Ali Çehreli wrote:
 So big that they can't fit in memory. For example, I benefit 
 from mmap on a 16G system where a file would be 30G.

Oh, this small...

 As others said, it depends on the use case. If the entire file 
 will be read anyway especially in sequential order, then mmap 
 may not have much benefit. In my use case though it is common 
 to just read unknown small amounts of bytes from unknown places 
 of the huge file. (Say, 5G total out of a 30G.)

 Instead of my making multiple reads to those interesting parts 
 of the file, mmap handles everything transparently: Just mmap 
 the whole thing as a single array and access parts of that 
 memory as needed.

Thank you! I will have that in mind in case I want to do 
something like that in the future. In my use-case tho, I will 
read the whole file.

 One huge improvement is to add madvise(2) system call to the 
 picture to tell the system the exact amount of memory that will 
 be touched so the OS reads in a single shot. Otherwise, the 
 system reads by a default amount, which I think is 4K, which 
 can turn out to be pathetically slow e.g. when the file is 
 accessed over a slow network. (Why read 4K when the need is 
 just 200 bytes and why read in 4K steps when the need is 
 already to be 1M?)

 I haven't used mmap on Windows but it's in Phobos, so it should 
 work. After all, mmap uses the virtual memory system of the OS 
 and non-ancient Windows versions do use virtual memory and 
 std.mmfile does include 'version (windows)' sections; so, yes. 
 :)

 Ali

"mmap" is a system call that doesn't exist (natively) on Windows. 
I don't know what D does with Phobos (which I'm not gonna use 
anyway) but even if it works (how?), I will end up creating my 
own library so I'm in the same spot. "madvise" seems cool, I'll 
check it out! Thanks! In the end, I like advising and telling 
others how to do their work, XD!

Feb 06 2022

Temtaime <temtaime gmail.com> writes:

On Sunday, 6 February 2022 at 18:14:51 UTC, rempas wrote:
 On Sunday, 6 February 2022 at 16:45:59 UTC, Ali Çehreli wrote:
 [...]

 Oh, this small...

 [...]

 Thank you! I will have that in mind in case I want to do 
 something like that in the future. In my use-case tho, I will 
 read the whole file.

 [...]

 "mmap" is a system call that doesn't exist (natively) on 
 Windows. I don't know what D does with Phobos (which I'm not 
 gonna use anyway) but even if it works (how?), I will end up 
 creating my own library so I'm in the same spot. "madvise" 
 seems cool, I'll check it out! Thanks! In the end, I like 
 advising and telling others how to do their work, XD!

Windows has its own API to mmap files. There's no need to 
reinvent the wheel, phobos MmFile works for me without any 
problems.
Maybe there's no flush function, but for my use cases it's not so 
critical.

Feb 06 2022

rempas <rempas tutanota.com> writes:

On Sunday, 6 February 2022 at 18:33:59 UTC, Temtaime wrote:
 Windows has its own API to mmap files. There's no need to 
 reinvent the wheel, phobos MmFile works for me without any 
 problems.
 Maybe there's no flush function, but for my use cases it's not 
 so critical.

Yeah, I'm really glad that it works for you but I will have to 
create a library for my compiler so there is a need to properly 
learn how things work so I'll know what I'm doing when the times 
comes. So yeah...

Feb 06 2022

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sun, Feb 06, 2022 at 06:47:45PM +0000, rempas via Digitalmars-d wrote:
 On Sunday, 6 February 2022 at 18:33:59 UTC, Temtaime wrote:
 Windows has its own API to mmap files. There's no need to reinvent
 the wheel, phobos MmFile works for me without any problems.  Maybe
 there's no flush function, but for my use cases it's not so
 critical.

 
 Yeah, I'm really glad that it works for you but I will have to create
 a library for my compiler so there is a need to properly learn how
 things work so I'll know what I'm doing when the times comes. So
 yeah...

Just read the Phobos source code for std.mmfile. Phobos code is very
readable compared to most typical standard libraries.


T

-- 
Turning your clock 15 minutes ahead won't cure lateness---you're just making
time go faster!

Feb 06 2022

IGotD- <nise nise.com> writes:

On Sunday, 6 February 2022 at 20:12:39 UTC, H. S. Teoh wrote:
 Just read the Phobos source code for std.mmfile. Phobos code is 
 very readable compared to most typical standard libraries.

Phobos/druntime is like an #ifdef hell but with version instead.

Feb 06 2022

rempas <rempas tutanota.com> writes:

On Sunday, 6 February 2022 at 20:48:09 UTC, IGotD- wrote:
 Phobos/druntime is like an #ifdef hell but with version instead.

Actually, I personally tried to read some files (like stdio.d and 
conv.d) and while I didn't found them super user friendly, they 
are WAY more clear and easy to read then GLIB! I don't know if 
every libc's header files are like that in every OS and also I'm 
super super n00b when it comes to reading other people's source 
code so maybe H. S. Teoh is just better at us at reading code, 
idk...

Feb 06 2022

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Mon, Feb 07, 2022 at 07:16:55AM +0000, rempas via Digitalmars-d wrote:
 On Sunday, 6 February 2022 at 20:48:09 UTC, IGotD- wrote:
 Phobos/druntime is like an #ifdef hell but with version instead.

 
 Actually, I personally tried to read some files (like stdio.d and
 conv.d) and while I didn't found them super user friendly, they are
 WAY more clear and easy to read then GLIB!

I tried reading GLIB source code once. I will never ever do it again.
:-P


 I don't know if every libc's header files are like that in every OS

[...]

If it's in C? Yeah, they all look like that.


T

-- 
Shin: (n.) A device for finding furniture in the dark.

Feb 07 2022

rempas <rempas tutanota.com> writes:

On Monday, 7 February 2022 at 18:31:42 UTC, H. S. Teoh wrote:
 I tried reading GLIB source code once. I will never ever do it 
 again. :-P

C!!!! You gotta love it, lol!


 If it's in C? Yeah, they all look like that.


 T

Don't be so sure about that! Everything "GNU" seems to be bloated 
but try to read some *BSD libc source code. It's both a little 
bit more readable and more organized, minimal and simple to 
understand.

Feb 12 2022

rempas <rempas tutanota.com> writes:

On Sunday, 6 February 2022 at 20:12:39 UTC, H. S. Teoh wrote:
 Just read the Phobos source code for std.mmfile. Phobos code is 
 very readable compared to most typical standard libraries.


 T

I suppose this goes to what I said that I need to properly learn 
how things work right? Well I mean, how Windows does it with the 
system call. I don't think that it is necessary to read the 
Phobos source code. But regardless, thanks for the suggestion!

Feb 06 2022

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 2/6/22 10:14, rempas wrote:

 "mmap" is a system call that doesn't exist (natively) on Windows.

Yes. I misspoke: What I meant was std.mmfile handles the differences 
automatically between systems.

Ali

Feb 06 2022

rempas <rempas tutanota.com> writes:

On Sunday, 6 February 2022 at 18:53:13 UTC, Ali Çehreli wrote:
 Yes. I misspoke: What I meant was std.mmfile handles the 
 differences automatically between systems.

 Ali

Cool! That what I would expect from a library tbh so I'm glad it 
works like that!

Feb 06 2022

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:
 This should have probably been posted in the "Learn" section 
 but I thought that it is an advanced topic so maybe people 
 other than me may learn something too. So here we go!

 I'm planning to make a change to my program to use "mmap" to 
 the contents of a file rather than "fgetc". This is because I 
 learned that "mmap" can do it faster. The thing is, are there 
 any problems that can occur when using "mmap"? I need to know 
 now because changing this means changing the design of the 
 program and this is not something pleasant to do so I want to 
 be sure that I won't have to change back in the future (where 
 the project will be even bigger).

mmap has quite the overhead to set up the page table for a file. 
This means for small files, open/read/write calls (and stdio 
which build on it) are faster.

The other issue with mmap is if you use string functions on the 
mapped part, you have to make sure that there are 0 bytes in the 
file or else you risk these functions to overshoot to unmapped 
pages and crashing the application.

Feb 06 2022

rempas <rempas tutanota.com> writes:

On Sunday, 6 February 2022 at 12:52:45 UTC, Patrick Schluter 
wrote:
 mmap has quite the overhead to set up the page table for a 
 file. This means for small files, open/read/write calls (and 
 stdio which build on it) are faster.

 The other issue with mmap is if you use string functions on the 
 mapped part, you have to make sure that there are 0 bytes in 
 the file or else you risk these functions to overshoot to 
 unmapped pages and crashing the application.

Thank you! After all I've heard, I will probably stick with 
"read". The files I'm going to read are going to be some 
kilobytes (megabytes at worse) so I should probably be fine.

Feb 06 2022

Steven Schveighoffer <schveiguy gmail.com> writes:

On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:
 This should have probably been posted in the "Learn" section 
 but I thought that it is an advanced topic so maybe people 
 other than me may learn something too. So here we go!

 I'm planning to make a change to my program to use "mmap" to 
 the contents of a file rather than "fgetc". This is because I 
 learned that "mmap" can do it faster. The thing is, are there 
 any problems that can occur when using "mmap"? I need to know 
 now because changing this means changing the design of the 
 program and this is not something pleasant to do so I want to 
 be sure that I won't have to change back in the future (where 
 the project will be even bigger).

Will mmap be faster than fgetc? Almost certainly.

Will it be faster than other i/o systems? Possibly not.

for my i/o system [iopipe](https://github.com/schveiguy/iopipe), 
every array is also an iopipe, so switching between mmap and file 
i/o is trivial. See [my talk in 
2017](https://dconf.org/2017/talks/schveighoffer.html) where I 
switched to mmap while on stage to show the difference.

IMO, the best way to determine which is better is to try it and 
measure. Having an i/o system that allows easy switching is 
helpful.

For sure, depending on your other tasks in your program, 
improving the file i/o might be insignificant.

-Steve

Feb 07 2022

norm <norm.rowtree gmail.com> writes:

On Tuesday, 8 February 2022 at 03:33:11 UTC, Steven Schveighoffer 
wrote:
 On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:
 [...]

 Will mmap be faster than fgetc? Almost certainly.

 Will it be faster than other i/o systems? Possibly not.

 for my i/o system 
 [iopipe](https://github.com/schveiguy/iopipe), every array is 
 also an iopipe, so switching between mmap and file i/o is 
 trivial. See [my talk in 
 2017](https://dconf.org/2017/talks/schveighoffer.html) where I 
 switched to mmap while on stage to show the difference.

 IMO, the best way to determine which is better is to try it and 
 measure. Having an i/o system that allows easy switching is 
 helpful.

 For sure, depending on your other tasks in your program, 
 improving the file i/o might be insignificant.

 -Steve

+1 iopipe !

Feb 07 2022

rempas <rempas tutanota.com> writes:

On Tuesday, 8 February 2022 at 03:33:11 UTC, Steven Schveighoffer 
wrote:
 Will mmap be faster than fgetc? Almost certainly.

 Will it be faster than other i/o systems? Possibly not.

 for my i/o system 
 [iopipe](https://github.com/schveiguy/iopipe), every array is 
 also an iopipe, so switching between mmap and file i/o is 
 trivial. See [my talk in 
 2017](https://dconf.org/2017/talks/schveighoffer.html) where I 
 switched to mmap while on stage to show the difference.

 IMO, the best way to determine which is better is to try it and 
 measure. Having an i/o system that allows easy switching is 
 helpful.

 For sure, depending on your other tasks in your program, 
 improving the file i/o might be insignificant.

 -Steve

Thanks for your time Steve! I will do proper testing like you 
suggested and see! It will take some time but I think it's worth 
it rather than randomly choose between one of them :)

Feb 12 2022

sarn <sarn theartofmachinery.com> writes:

On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:
 This should have probably been posted in the "Learn" section 
 but I thought that it is an advanced topic so maybe people 
 other than me may learn something too. So here we go!

 I'm planning to make a change to my program to use "mmap" to 
 the contents of a file rather than "fgetc". This is because I 
 learned that "mmap" can do it faster. The thing is, are there 
 any problems that can occur when using "mmap"? I need to know 
 now because changing this means changing the design of the 
 program and this is not something pleasant to do so I want to 
 be sure that I won't have to change back in the future (where 
 the project will be even bigger).

One reason to use read/write based I/O by default is that it's 
more versatile.  It's kind of like an input range versus a random 
access in Phobos.

```
// Could not map file /dev/stdin (Invalid argument)
auto f = new MmFile("/dev/stdin");
```

Feb 08 2022

rempas <rempas tutanota.com> writes:

On Tuesday, 8 February 2022 at 21:37:29 UTC, sarn wrote:
 One reason to use read/write based I/O by default is that it's 
 more versatile.  It's kind of like an input range versus a 
 random access in Phobos.

 ```
 // Could not map file /dev/stdin (Invalid argument)
 auto f = new MmFile("/dev/stdin");
 ```

Yeah, thank you! I will open and read the whole file anyways so 
it seems that it makes sense to try with this method and then 
measurement my program in the future to see! Have a nice day!

Feb 12 2022

user1234 <user1234 12.de> writes:

On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:
 This should have probably been posted in the "Learn" section 
 but I thought that it is an advanced topic so maybe people 
 other than me may learn something too. So here we go!

 I'm planning to make a change to my program to use "mmap" to 
 the contents of a file rather than "fgetc". This is because I 
 learned that "mmap" can do it faster. The thing is, are there 
 any problems that can occur when using "mmap"? I need to know 
 now because changing this means changing the design of the 
 program and this is not something pleasant to do so I want to 
 be sure that I won't have to change back in the future (where 
 the project will be even bigger).

`std.file.readText()` is just fine... your really want to do an 
os with call `fgetc` for every single byte that has to be read ?

Feb 08 2022

rempas <rempas tutanota.com> writes:

On Wednesday, 9 February 2022 at 02:07:05 UTC, user1234 wrote:
 `std.file.readText()` is just fine... your really want to do an 
 os with call `fgetc` for every single byte that has to be read ?

Good point! I was really wondering if "fgetc" does a system call 
every single time that it is called or if the text is buffered 
just like with "printf". I will use "read" in any case just to be 
sure tho. I don't want to use Phobos tho so I cannot use 
"file.readText". Thank you for your time!

Feb 12 2022

user1234 <user1234 12.de> writes:

On Saturday, 12 February 2022 at 13:17:19 UTC, rempas wrote:
 On Wednesday, 9 February 2022 at 02:07:05 UTC, user1234 wrote:
 `std.file.readText()` is just fine... your really want to do 
 an os with call `fgetc` for every single byte that has to be 
 read ?

 Good point! I was really wondering if "fgetc" does a system 
 call every single time that it is called or if the text is 
 buffered just like with "printf". I will use "read" in any case 
 just to be sure tho. I don't want to use Phobos tho so I cannot 
 use "file.readText". Thank you for your time!

I think that nowadays fgetc does not make sense anymore, maybe in 
the past when the amount of memory available was very reduced... 
source files are 100 kb top. You can load 100 of them, the 
fingerprint is still small. What will likely consume the more is 
the AST.

Otherwise readText is easy to translate, it's just fopen then 
fread then fclose, + a few checks for the errors, not a big deal 
to translate.

Feb 12 2022

Basile B. <b2.temp gmx.com> writes:

On Saturday, 12 February 2022 at 16:48:26 UTC, user1234 wrote:
 On Saturday, 12 February 2022 at 13:17:19 UTC, rempas wrote:
 On Wednesday, 9 February 2022 at 02:07:05 UTC, user1234 wrote:
 `std.file.readText()` is just fine... your really want to do 
 an os with call `fgetc` for every single byte that has to be 
 read ?

 Good point! I was really wondering if "fgetc" does a system 
 call every single time that it is called or if the text is 
 buffered just like with "printf". I will use "read" in any 
 case just to be sure tho. I don't want to use Phobos tho so I 
 cannot use "file.readText". Thank you for your time!

 I think that nowadays fgetc does not make sense anymore, maybe 
 in the past when the amount of memory available was very 
 reduced... source files are 100 kb top. You can load 100 of 
 them, the fingerprint is still small. What will likely consume 
 the more is the AST.

 Otherwise readText is easy to translate, it's just fopen then 
 fread then fclose, + a few checks for the errors, not a big 
 deal to translate.

The problem with phobos and if used to program a compiler is 
_dynamic arrays_, because of how they are managed.

With Styx I had used phobos because I knew the memory management 
was designed to work similarly with arrays, i.e functions can 
return arrays, but using the "sink" style would have not caused 
any problem (by "sink" style I mean when the buffer is owned by 
the calling frame, and passed as parameter, e.g like in many 
C-style APIs)

Then the amount of phobos code to translate in order to bootstrap 
[was 
minimal](https://gitlab.com/styx-lang/styx/-/raw/master/src/system.sx):

std.paths:

- isAbsolute
- isDir
- isFile
- dirName
- baseName
- exists
- cwd
- dirEntries
- setExtension

std.files:

- read (or readText)
- write (not even used I realize now)

std.process

- pipeProcess (actually just used to optionally --run after 
compile)

std.getopt

- getopt (tho libc functions for that could have been used... dmd 
itself doesnt have any special functions for the arg processing 
in the driver IIRC)

Add to this a few things from libc and unistd and you're good. 
You dont need more.

Feb 12 2022

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 2/12/22 05:17, rempas wrote:

 a system call every single time

I have a related experience: I realized that very many ftell() calls 
that I were making were very costly. I saved a lot of time after 
realizing that I did not need to make the calls because I could maintain 
a 'long' variable to keep track of where I was in the file.

I assumed ftell() would do the same but apparently not.

Ali

Feb 12 2022

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Feb 12, 2022 at 07:01:09PM -0800, Ali �ehreli via Digitalmars-d wrote:
 On 2/12/22 05:17, rempas wrote:
 
 a system call every single time

 
 I have a related experience: I realized that very many ftell() calls
 that I were making were very costly. I saved a lot of time after
 realizing that I did not need to make the calls because I could
 maintain a 'long' variable to keep track of where I was in the file.
 
 I assumed ftell() would do the same but apparently not.

[...]

I think the reason is the ftell involves an OS API call, because fread()
uses the underlying read() syscall which reads from where it left off
last, and there could be multiple threads reading from the same file
descriptor, so the only way for fseek/ftell to work correctly is via a
syscall into the kernel.  Obviously, this would be expensive, as it
would involve a kernel context-switch as well as acquiring and releasing
a lock on the file descriptor.


T

-- 
Too many people have open minds but closed eyes.

Feb 12 2022

Steven Schveighoffer <schveiguy gmail.com> writes:

On 2/12/22 10:13 PM, H. S. Teoh wrote:
 On Sat, Feb 12, 2022 at 07:01:09PM -0800, Ali Çehreli via Digitalmars-d wrote:
 On 2/12/22 05:17, rempas wrote:

 a system call every single time

 I have a related experience: I realized that very many ftell() calls
 that I were making were very costly. I saved a lot of time after
 realizing that I did not need to make the calls because I could
 maintain a 'long' variable to keep track of where I was in the file.

 I assumed ftell() would do the same but apparently not.

 [...]
 
 I think the reason is the ftell involves an OS API call, because fread()
 uses the underlying read() syscall which reads from where it left off
 last, and there could be multiple threads reading from the same file
 descriptor, so the only way for fseek/ftell to work correctly is via a
 syscall into the kernel.  Obviously, this would be expensive, as it
 would involve a kernel context-switch as well as acquiring and releasing
 a lock on the file descriptor.

`ftell` does not *need* to do a system call to get the current file 
position. But otherwise it has to store the offset of the file somewhere 
(which it does not). In fact, if you move the file pointer underneath 
(by using another thread to read from it, or e.g. with `lseek`), you 
will completely invalidate what `ftell` returns (try it!)

What `ftell` basically does is to a system call to `lseek` to get the 
current file position, then subtracts the difference between the current 
buffer offset and the buffer size.

This is not the same for `fgetc`. That only depends on the buffer, and 
not anything from the OS (after the buffer is filled).

-Steve

Feb 12 2022

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Sunday, 13 February 2022 at 03:13:43 UTC, H. S. Teoh wrote:
 On Sat, Feb 12, 2022 at 07:01:09PM -0800, Ali Çehreli via 
 Digitalmars-d wrote:
 On 2/12/22 05:17, rempas wrote:
 
 a system call every single time

 
 I have a related experience: I realized that very many ftell() 
 calls that I were making were very costly. I saved a lot of 
 time after realizing that I did not need to make the calls 
 because I could maintain a 'long' variable to keep track of 
 where I was in the file.
 
 I assumed ftell() would do the same but apparently not.

 [...]

 I think the reason is the ftell involves an OS API call, 
 because fread() uses the underlying read() syscall which reads 
 from where it left off last, and there could be multiple 
 threads reading from the same file descriptor, so the only way 
 for fseek/ftell to work correctly is via a syscall into the 
 kernel.  Obviously, this would be expensive, as it would 
 involve a kernel context-switch as well as acquiring and 
 releasing a lock on the file descriptor.

fread reads from its internal buffer when it can. By default it 
uses 1 page (4096 bytes on x86 and ARM). After a seek operation 
it will always try to fill the buffer with 4096 bytes (of course 
the read syscall might return less). As long as the reads are 
within the buffer fread() will not invoke a read syscall.

Feb 13 2022

Steven Schveighoffer <schveiguy gmail.com> writes:

On 2/13/22 6:02 AM, Patrick Schluter wrote:
 On Sunday, 13 February 2022 at 03:13:43 UTC, H. S. Teoh wrote:
 On Sat, Feb 12, 2022 at 07:01:09PM -0800, Ali Çehreli via 
 Digitalmars-d wrote:
 On 2/12/22 05:17, rempas wrote:

 a system call every single time

 I have a related experience: I realized that very many ftell() calls 
 that I were making were very costly. I saved a lot of time after 
 realizing that I did not need to make the calls because I could 
 maintain a 'long' variable to keep track of where I was in the file.

 I assumed ftell() would do the same but apparently not.

 [...]

 I think the reason is the ftell involves an OS API call, because 
 fread() uses the underlying read() syscall which reads from where it 
 left off last, and there could be multiple threads reading from the 
 same file descriptor, so the only way for fseek/ftell to work 
 correctly is via a syscall into the kernel.  Obviously, this would be 
 expensive, as it would involve a kernel context-switch as well as 
 acquiring and releasing a lock on the file descriptor.

 fread reads from its internal buffer when it can. By default it uses 1 
 page (4096 bytes on x86 and ARM). After a seek operation it will always 
 try to fill the buffer with 4096 bytes (of course the read syscall might 
 return less). As long as the reads are within the buffer fread() will 
 not invoke a read syscall.

If you seek within the buffer it could potentially leave the buffer 
alone. But it chooses to flush the buffer completely. Not sure why it 
does that. It's not so it can keep the data filled, it tries to read the 
full buffer at that point (meaning it removed all the buffered data).

This could be potentially really slow if you were skipping a few bytes 
at a time using fseek, as it would reload the entire buffer every seek.

-Steve

Feb 13 2022

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Sunday, 13 February 2022 at 03:01:09 UTC, Ali Çehreli wrote:
 On 2/12/22 05:17, rempas wrote:

 a system call every single time

 I have a related experience: I realized that very many ftell() 
 calls that I were making were very costly. I saved a lot of 
 time after realizing that I did not need to make the calls 
 because I could maintain a 'long' variable to keep track of 
 where I was in the file.

 I assumed ftell() would do the same but apparently not.

ftell() and fseek() use a syscall but also trigger that the next 
stdio read call (fgets, fgetc, fread, fscanf etc.) will 
systematically read its internal buffer again. If you make an 
itrace on an app with a fseek (ftell is often implement by using 
a relative seek of 0 call) yo will see something like

That's why one should avoid using seek when working with buffered 
stdio.

Feb 13 2022

Steven Schveighoffer <schveiguy gmail.com> writes:

On 2/8/22 9:07 PM, user1234 wrote:
 On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:
 This should have probably been posted in the "Learn" section but I 
 thought that it is an advanced topic so maybe people other than me may 
 learn something too. So here we go!

 I'm planning to make a change to my program to use "mmap" to the 
 contents of a file rather than "fgetc". This is because I learned that 
 "mmap" can do it faster. The thing is, are there any problems that can 
 occur when using "mmap"? I need to know now because changing this 
 means changing the design of the program and this is not something 
 pleasant to do so I want to be sure that I won't have to change back 
 in the future (where the project will be even bigger).

 
 `std.file.readText()` is just fine... your really want to do an os with 
 call `fgetc` for every single byte that has to be read ?

Just a clarification here -- `fgetc` does NOT do an OS system call for 
every character. It's a C library function, which uses a `FILE *`. And 
this is not a new development -- my ANSI C book from 1988 talks about 
how `FILE` has a buffer.

While it does not do a system call (unless the buffer is empty and it 
needs to fill the buffer), it's still an opaque call, which might cost a 
decent amount if you are reading by character.

-Steve

Feb 12 2022

Florian Weimer <fw deneb.enyo.de> writes:

One issue that hasn't been mentioned so far is that if the input 
file is truncated, accessing is memory-mapped view results in 
`SIGBUS` on Linux and other systems. (I think Windows prevents 
truncation instead.)

In theory, it is possible to intercept that signal and turn it 
into something else (Java does that), but I don't think the [D 
implementation](https://github.com/dlang/phobos/blob/master/std/mmfile.d) does
that.

Feb 13 2022

rempas <rempas tutanota.com> writes:

On Sunday, 13 February 2022 at 12:55:43 UTC, Florian Weimer wrote:
 One issue that hasn't been mentioned so far is that if the 
 input file is truncated, accessing is memory-mapped view 
 results in `SIGBUS` on Linux and other systems. (I think 
 Windows prevents truncation instead.)

 In theory, it is possible to intercept that signal and turn it 
 into something else (Java does that), but I don't think the [D 
 implementation](https://github.com/dlang/phobos/blob/master/std/mmfile.d) does
that.

Thank you for the info! That's very important and I'll keep in in 
mind!

Feb 13 2022

D Programming

C/C++ Programming

Other

digitalmars.D - Is there any reasons to not use "mmap" to read files?