www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Is there any reasons to not use "mmap" to read files?

reply rempas <rempas tutanota.com> writes:
This should have probably been posted in the "Learn" section but 
I thought that it is an advanced topic so maybe people other than 
me may learn something too. So here we go!

I'm planning to make a change to my program to use "mmap" to the 
contents of a file rather than "fgetc". This is because I learned 
that "mmap" can do it faster. The thing is, are there any 
problems that can occur when using "mmap"? I need to know now 
because changing this means changing the design of the program 
and this is not something pleasant to do so I want to be sure 
that I won't have to change back in the future (where the project 
will be even bigger).
Feb 06 2022
next sibling parent reply Elronnd <elronnd elronnd.net> writes:
On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:
 I'm planning to make a change to my program to use "mmap" to 
 the contents of a file rather than "fgetc".  This is because I 
 learned that "mmap" can do it faster.  The thing is, are there 
 any problems that can occur when using "mmap"?
Performance is weird, and depends a lot on your access patterns and constraints. Mmap is not universally fast and, I would argue, really only makes sense in a few constrained circumstances. I would not switch to mmap just because you heard it was faster; only consider switching if you know i/o is a bottleneck for your application and know mmap is the solution. https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf recent, good read.
Feb 06 2022
parent reply rempas <rempas tutanota.com> writes:
On Sunday, 6 February 2022 at 10:08:24 UTC, Elronnd wrote:
 Performance is weird, and depends a lot on your access patterns 
 and constraints.  Mmap is not universally fast and, I would 
 argue, really only makes sense in a few constrained 
 circumstances.  I would not switch to mmap just because you 
 heard it was faster; only consider switching if you know i/o is 
 a bottleneck for your application and know mmap is the solution.

 https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf  
 recent, good read.
Thank you! I will actually make a compiler so it will just open and read the requested files. I don't know if the database example you linked will be similar to my case (I will of course read it tho) so I have to make my research I guess just to be sure.
Feb 06 2022
parent reply Temtaime <temtaime gmail.com> writes:
On Sunday, 6 February 2022 at 10:48:01 UTC, rempas wrote:
 On Sunday, 6 February 2022 at 10:08:24 UTC, Elronnd wrote:
 Performance is weird, and depends a lot on your access 
 patterns and constraints.  Mmap is not universally fast and, I 
 would argue, really only makes sense in a few constrained 
 circumstances.  I would not switch to mmap just because you 
 heard it was faster; only consider switching if you know i/o 
 is a bottleneck for your application and know mmap is the 
 solution.

 https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf  
 recent, good read.
Thank you! I will actually make a compiler so it will just open and read the requested files. I don't know if the database example you linked will be similar to my case (I will of course read it tho) so I have to make my research I guess just to be sure.
Perso i'm almost always use mmap for opening large files for r/w. It IS faster. Exception are small ones that can be read into the memory using std.file.read for example.
Feb 06 2022
parent reply rempas <rempas tutanota.com> writes:
On Sunday, 6 February 2022 at 10:53:49 UTC, Temtaime wrote:
 Perso i'm almost always use mmap for opening large files for 
 r/w. It IS faster.
 Exception are small ones that can be read into the memory using 
 std.file.read for example.
Thank you! For how big files are we talking about? Also like another guy told me in another (C) forum, "mmap" is for Unix systems so do you know if Windows or MacOS can emulate that behavior with their memory allocation system calls?
Feb 06 2022
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 2/6/22 04:21, rempas wrote:
 On Sunday, 6 February 2022 at 10:53:49 UTC, Temtaime wrote:
 Perso i'm almost always use mmap for opening large files for r/w. It
 IS faster.
Ditto.
 how big files are we talking about?
So big that they can't fit in memory. For example, I benefit from mmap on a 16G system where a file would be 30G. As others said, it depends on the use case. If the entire file will be read anyway especially in sequential order, then mmap may not have much benefit. In my use case though it is common to just read unknown small amounts of bytes from unknown places of the huge file. (Say, 5G total out of a 30G.) Instead of my making multiple reads to those interesting parts of the file, mmap handles everything transparently: Just mmap the whole thing as a single array and access parts of that memory as needed. One huge improvement is to add madvise(2) system call to the picture to tell the system the exact amount of memory that will be touched so the OS reads in a single shot. Otherwise, the system reads by a default amount, which I think is 4K, which can turn out to be pathetically slow e.g. when the file is accessed over a slow network. (Why read 4K when the need is just 200 bytes and why read in 4K steps when the need is already to be 1M?)
 Also like another guy
 told me in another (C) forum, "mmap" is for Unix systems so do you know
 if Windows or MacOS can emulate that behavior with their memory
 allocation system calls?
I haven't used mmap on Windows but it's in Phobos, so it should work. After all, mmap uses the virtual memory system of the OS and non-ancient Windows versions do use virtual memory and std.mmfile does include 'version (windows)' sections; so, yes. :) Ali
Feb 06 2022
parent reply rempas <rempas tutanota.com> writes:
On Sunday, 6 February 2022 at 16:45:59 UTC, Ali Çehreli wrote:
 So big that they can't fit in memory. For example, I benefit 
 from mmap on a 16G system where a file would be 30G.
Oh, this small...
 As others said, it depends on the use case. If the entire file 
 will be read anyway especially in sequential order, then mmap 
 may not have much benefit. In my use case though it is common 
 to just read unknown small amounts of bytes from unknown places 
 of the huge file. (Say, 5G total out of a 30G.)

 Instead of my making multiple reads to those interesting parts 
 of the file, mmap handles everything transparently: Just mmap 
 the whole thing as a single array and access parts of that 
 memory as needed.
Thank you! I will have that in mind in case I want to do something like that in the future. In my use-case tho, I will read the whole file.
 One huge improvement is to add madvise(2) system call to the 
 picture to tell the system the exact amount of memory that will 
 be touched so the OS reads in a single shot. Otherwise, the 
 system reads by a default amount, which I think is 4K, which 
 can turn out to be pathetically slow e.g. when the file is 
 accessed over a slow network. (Why read 4K when the need is 
 just 200 bytes and why read in 4K steps when the need is 
 already to be 1M?)

 I haven't used mmap on Windows but it's in Phobos, so it should 
 work. After all, mmap uses the virtual memory system of the OS 
 and non-ancient Windows versions do use virtual memory and 
 std.mmfile does include 'version (windows)' sections; so, yes. 
 :)

 Ali
"mmap" is a system call that doesn't exist (natively) on Windows. I don't know what D does with Phobos (which I'm not gonna use anyway) but even if it works (how?), I will end up creating my own library so I'm in the same spot. "madvise" seems cool, I'll check it out! Thanks! In the end, I like advising and telling others how to do their work, XD!
Feb 06 2022
next sibling parent reply Temtaime <temtaime gmail.com> writes:
On Sunday, 6 February 2022 at 18:14:51 UTC, rempas wrote:
 On Sunday, 6 February 2022 at 16:45:59 UTC, Ali Çehreli wrote:
 [...]
Oh, this small...
 [...]
Thank you! I will have that in mind in case I want to do something like that in the future. In my use-case tho, I will read the whole file.
 [...]
"mmap" is a system call that doesn't exist (natively) on Windows. I don't know what D does with Phobos (which I'm not gonna use anyway) but even if it works (how?), I will end up creating my own library so I'm in the same spot. "madvise" seems cool, I'll check it out! Thanks! In the end, I like advising and telling others how to do their work, XD!
Windows has its own API to mmap files. There's no need to reinvent the wheel, phobos MmFile works for me without any problems. Maybe there's no flush function, but for my use cases it's not so critical.
Feb 06 2022
parent reply rempas <rempas tutanota.com> writes:
On Sunday, 6 February 2022 at 18:33:59 UTC, Temtaime wrote:
 Windows has its own API to mmap files. There's no need to 
 reinvent the wheel, phobos MmFile works for me without any 
 problems.
 Maybe there's no flush function, but for my use cases it's not 
 so critical.
Yeah, I'm really glad that it works for you but I will have to create a library for my compiler so there is a need to properly learn how things work so I'll know what I'm doing when the times comes. So yeah...
Feb 06 2022
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sun, Feb 06, 2022 at 06:47:45PM +0000, rempas via Digitalmars-d wrote:
 On Sunday, 6 February 2022 at 18:33:59 UTC, Temtaime wrote:
 Windows has its own API to mmap files. There's no need to reinvent
 the wheel, phobos MmFile works for me without any problems.  Maybe
 there's no flush function, but for my use cases it's not so
 critical.
Yeah, I'm really glad that it works for you but I will have to create a library for my compiler so there is a need to properly learn how things work so I'll know what I'm doing when the times comes. So yeah...
Just read the Phobos source code for std.mmfile. Phobos code is very readable compared to most typical standard libraries. T -- Turning your clock 15 minutes ahead won't cure lateness---you're just making time go faster!
Feb 06 2022
next sibling parent reply IGotD- <nise nise.com> writes:
On Sunday, 6 February 2022 at 20:12:39 UTC, H. S. Teoh wrote:
 Just read the Phobos source code for std.mmfile. Phobos code is 
 very readable compared to most typical standard libraries.
Phobos/druntime is like an #ifdef hell but with version instead.
Feb 06 2022
parent reply rempas <rempas tutanota.com> writes:
On Sunday, 6 February 2022 at 20:48:09 UTC, IGotD- wrote:
 Phobos/druntime is like an #ifdef hell but with version instead.
Actually, I personally tried to read some files (like stdio.d and conv.d) and while I didn't found them super user friendly, they are WAY more clear and easy to read then GLIB! I don't know if every libc's header files are like that in every OS and also I'm super super n00b when it comes to reading other people's source code so maybe H. S. Teoh is just better at us at reading code, idk...
Feb 06 2022
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Feb 07, 2022 at 07:16:55AM +0000, rempas via Digitalmars-d wrote:
 On Sunday, 6 February 2022 at 20:48:09 UTC, IGotD- wrote:
 Phobos/druntime is like an #ifdef hell but with version instead.
Actually, I personally tried to read some files (like stdio.d and conv.d) and while I didn't found them super user friendly, they are WAY more clear and easy to read then GLIB!
I tried reading GLIB source code once. I will never ever do it again. :-P
 I don't know if every libc's header files are like that in every OS
[...] If it's in C? Yeah, they all look like that. T -- Shin: (n.) A device for finding furniture in the dark.
Feb 07 2022
parent rempas <rempas tutanota.com> writes:
On Monday, 7 February 2022 at 18:31:42 UTC, H. S. Teoh wrote:
 I tried reading GLIB source code once. I will never ever do it 
 again. :-P
C!!!! You gotta love it, lol!
 If it's in C? Yeah, they all look like that.


 T
Don't be so sure about that! Everything "GNU" seems to be bloated but try to read some *BSD libc source code. It's both a little bit more readable and more organized, minimal and simple to understand.
Feb 12 2022
prev sibling parent rempas <rempas tutanota.com> writes:
On Sunday, 6 February 2022 at 20:12:39 UTC, H. S. Teoh wrote:
 Just read the Phobos source code for std.mmfile. Phobos code is 
 very readable compared to most typical standard libraries.


 T
I suppose this goes to what I said that I need to properly learn how things work right? Well I mean, how Windows does it with the system call. I don't think that it is necessary to read the Phobos source code. But regardless, thanks for the suggestion!
Feb 06 2022
prev sibling parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 2/6/22 10:14, rempas wrote:

 "mmap" is a system call that doesn't exist (natively) on Windows.
Yes. I misspoke: What I meant was std.mmfile handles the differences automatically between systems. Ali
Feb 06 2022
parent rempas <rempas tutanota.com> writes:
On Sunday, 6 February 2022 at 18:53:13 UTC, Ali Çehreli wrote:
 Yes. I misspoke: What I meant was std.mmfile handles the 
 differences automatically between systems.

 Ali
Cool! That what I would expect from a library tbh so I'm glad it works like that!
Feb 06 2022
prev sibling next sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:
 This should have probably been posted in the "Learn" section 
 but I thought that it is an advanced topic so maybe people 
 other than me may learn something too. So here we go!

 I'm planning to make a change to my program to use "mmap" to 
 the contents of a file rather than "fgetc". This is because I 
 learned that "mmap" can do it faster. The thing is, are there 
 any problems that can occur when using "mmap"? I need to know 
 now because changing this means changing the design of the 
 program and this is not something pleasant to do so I want to 
 be sure that I won't have to change back in the future (where 
 the project will be even bigger).
mmap has quite the overhead to set up the page table for a file. This means for small files, open/read/write calls (and stdio which build on it) are faster. The other issue with mmap is if you use string functions on the mapped part, you have to make sure that there are 0 bytes in the file or else you risk these functions to overshoot to unmapped pages and crashing the application.
Feb 06 2022
parent rempas <rempas tutanota.com> writes:
On Sunday, 6 February 2022 at 12:52:45 UTC, Patrick Schluter 
wrote:
 mmap has quite the overhead to set up the page table for a 
 file. This means for small files, open/read/write calls (and 
 stdio which build on it) are faster.

 The other issue with mmap is if you use string functions on the 
 mapped part, you have to make sure that there are 0 bytes in 
 the file or else you risk these functions to overshoot to 
 unmapped pages and crashing the application.
Thank you! After all I've heard, I will probably stick with "read". The files I'm going to read are going to be some kilobytes (megabytes at worse) so I should probably be fine.
Feb 06 2022
prev sibling next sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:
 This should have probably been posted in the "Learn" section 
 but I thought that it is an advanced topic so maybe people 
 other than me may learn something too. So here we go!

 I'm planning to make a change to my program to use "mmap" to 
 the contents of a file rather than "fgetc". This is because I 
 learned that "mmap" can do it faster. The thing is, are there 
 any problems that can occur when using "mmap"? I need to know 
 now because changing this means changing the design of the 
 program and this is not something pleasant to do so I want to 
 be sure that I won't have to change back in the future (where 
 the project will be even bigger).
Will mmap be faster than fgetc? Almost certainly. Will it be faster than other i/o systems? Possibly not. for my i/o system [iopipe](https://github.com/schveiguy/iopipe), every array is also an iopipe, so switching between mmap and file i/o is trivial. See [my talk in 2017](https://dconf.org/2017/talks/schveighoffer.html) where I switched to mmap while on stage to show the difference. IMO, the best way to determine which is better is to try it and measure. Having an i/o system that allows easy switching is helpful. For sure, depending on your other tasks in your program, improving the file i/o might be insignificant. -Steve
Feb 07 2022
next sibling parent norm <norm.rowtree gmail.com> writes:
On Tuesday, 8 February 2022 at 03:33:11 UTC, Steven Schveighoffer 
wrote:
 On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:
 [...]
Will mmap be faster than fgetc? Almost certainly. Will it be faster than other i/o systems? Possibly not. for my i/o system [iopipe](https://github.com/schveiguy/iopipe), every array is also an iopipe, so switching between mmap and file i/o is trivial. See [my talk in 2017](https://dconf.org/2017/talks/schveighoffer.html) where I switched to mmap while on stage to show the difference. IMO, the best way to determine which is better is to try it and measure. Having an i/o system that allows easy switching is helpful. For sure, depending on your other tasks in your program, improving the file i/o might be insignificant. -Steve
+1 iopipe !
Feb 07 2022
prev sibling parent rempas <rempas tutanota.com> writes:
On Tuesday, 8 February 2022 at 03:33:11 UTC, Steven Schveighoffer 
wrote:
 Will mmap be faster than fgetc? Almost certainly.

 Will it be faster than other i/o systems? Possibly not.

 for my i/o system 
 [iopipe](https://github.com/schveiguy/iopipe), every array is 
 also an iopipe, so switching between mmap and file i/o is 
 trivial. See [my talk in 
 2017](https://dconf.org/2017/talks/schveighoffer.html) where I 
 switched to mmap while on stage to show the difference.

 IMO, the best way to determine which is better is to try it and 
 measure. Having an i/o system that allows easy switching is 
 helpful.

 For sure, depending on your other tasks in your program, 
 improving the file i/o might be insignificant.

 -Steve
Thanks for your time Steve! I will do proper testing like you suggested and see! It will take some time but I think it's worth it rather than randomly choose between one of them :)
Feb 12 2022
prev sibling next sibling parent reply sarn <sarn theartofmachinery.com> writes:
On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:
 This should have probably been posted in the "Learn" section 
 but I thought that it is an advanced topic so maybe people 
 other than me may learn something too. So here we go!

 I'm planning to make a change to my program to use "mmap" to 
 the contents of a file rather than "fgetc". This is because I 
 learned that "mmap" can do it faster. The thing is, are there 
 any problems that can occur when using "mmap"? I need to know 
 now because changing this means changing the design of the 
 program and this is not something pleasant to do so I want to 
 be sure that I won't have to change back in the future (where 
 the project will be even bigger).
One reason to use read/write based I/O by default is that it's more versatile. It's kind of like an input range versus a random access in Phobos. ``` // Could not map file /dev/stdin (Invalid argument) auto f = new MmFile("/dev/stdin"); ```
Feb 08 2022
parent rempas <rempas tutanota.com> writes:
On Tuesday, 8 February 2022 at 21:37:29 UTC, sarn wrote:
 One reason to use read/write based I/O by default is that it's 
 more versatile.  It's kind of like an input range versus a 
 random access in Phobos.

 ```
 // Could not map file /dev/stdin (Invalid argument)
 auto f = new MmFile("/dev/stdin");
 ```
Yeah, thank you! I will open and read the whole file anyways so it seems that it makes sense to try with this method and then measurement my program in the future to see! Have a nice day!
Feb 12 2022
prev sibling next sibling parent reply user1234 <user1234 12.de> writes:
On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:
 This should have probably been posted in the "Learn" section 
 but I thought that it is an advanced topic so maybe people 
 other than me may learn something too. So here we go!

 I'm planning to make a change to my program to use "mmap" to 
 the contents of a file rather than "fgetc". This is because I 
 learned that "mmap" can do it faster. The thing is, are there 
 any problems that can occur when using "mmap"? I need to know 
 now because changing this means changing the design of the 
 program and this is not something pleasant to do so I want to 
 be sure that I won't have to change back in the future (where 
 the project will be even bigger).
`std.file.readText()` is just fine... your really want to do an os with call `fgetc` for every single byte that has to be read ?
Feb 08 2022
next sibling parent reply rempas <rempas tutanota.com> writes:
On Wednesday, 9 February 2022 at 02:07:05 UTC, user1234 wrote:
 `std.file.readText()` is just fine... your really want to do an 
 os with call `fgetc` for every single byte that has to be read ?
Good point! I was really wondering if "fgetc" does a system call every single time that it is called or if the text is buffered just like with "printf". I will use "read" in any case just to be sure tho. I don't want to use Phobos tho so I cannot use "file.readText". Thank you for your time!
Feb 12 2022
next sibling parent reply user1234 <user1234 12.de> writes:
On Saturday, 12 February 2022 at 13:17:19 UTC, rempas wrote:
 On Wednesday, 9 February 2022 at 02:07:05 UTC, user1234 wrote:
 `std.file.readText()` is just fine... your really want to do 
 an os with call `fgetc` for every single byte that has to be 
 read ?
Good point! I was really wondering if "fgetc" does a system call every single time that it is called or if the text is buffered just like with "printf". I will use "read" in any case just to be sure tho. I don't want to use Phobos tho so I cannot use "file.readText". Thank you for your time!
I think that nowadays fgetc does not make sense anymore, maybe in the past when the amount of memory available was very reduced... source files are 100 kb top. You can load 100 of them, the fingerprint is still small. What will likely consume the more is the AST. Otherwise readText is easy to translate, it's just fopen then fread then fclose, + a few checks for the errors, not a big deal to translate.
Feb 12 2022
parent Basile B. <b2.temp gmx.com> writes:
On Saturday, 12 February 2022 at 16:48:26 UTC, user1234 wrote:
 On Saturday, 12 February 2022 at 13:17:19 UTC, rempas wrote:
 On Wednesday, 9 February 2022 at 02:07:05 UTC, user1234 wrote:
 `std.file.readText()` is just fine... your really want to do 
 an os with call `fgetc` for every single byte that has to be 
 read ?
Good point! I was really wondering if "fgetc" does a system call every single time that it is called or if the text is buffered just like with "printf". I will use "read" in any case just to be sure tho. I don't want to use Phobos tho so I cannot use "file.readText". Thank you for your time!
I think that nowadays fgetc does not make sense anymore, maybe in the past when the amount of memory available was very reduced... source files are 100 kb top. You can load 100 of them, the fingerprint is still small. What will likely consume the more is the AST. Otherwise readText is easy to translate, it's just fopen then fread then fclose, + a few checks for the errors, not a big deal to translate.
The problem with phobos and if used to program a compiler is _dynamic arrays_, because of how they are managed. With Styx I had used phobos because I knew the memory management was designed to work similarly with arrays, i.e functions can return arrays, but using the "sink" style would have not caused any problem (by "sink" style I mean when the buffer is owned by the calling frame, and passed as parameter, e.g like in many C-style APIs) Then the amount of phobos code to translate in order to bootstrap [was minimal](https://gitlab.com/styx-lang/styx/-/raw/master/src/system.sx): std.paths: - isAbsolute - isDir - isFile - dirName - baseName - exists - cwd - dirEntries - setExtension std.files: - read (or readText) - write (not even used I realize now) std.process - pipeProcess (actually just used to optionally --run after compile) std.getopt - getopt (tho libc functions for that could have been used... dmd itself doesnt have any special functions for the arg processing in the driver IIRC) Add to this a few things from libc and unistd and you're good. You dont need more.
Feb 12 2022
prev sibling parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 2/12/22 05:17, rempas wrote:

 a system call every single time
I have a related experience: I realized that very many ftell() calls that I were making were very costly. I saved a lot of time after realizing that I did not need to make the calls because I could maintain a 'long' variable to keep track of where I was in the file. I assumed ftell() would do the same but apparently not. Ali
Feb 12 2022
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Feb 12, 2022 at 07:01:09PM -0800, Ali Çehreli via Digitalmars-d wrote:
 On 2/12/22 05:17, rempas wrote:
 
 a system call every single time
I have a related experience: I realized that very many ftell() calls that I were making were very costly. I saved a lot of time after realizing that I did not need to make the calls because I could maintain a 'long' variable to keep track of where I was in the file. I assumed ftell() would do the same but apparently not.
[...] I think the reason is the ftell involves an OS API call, because fread() uses the underlying read() syscall which reads from where it left off last, and there could be multiple threads reading from the same file descriptor, so the only way for fseek/ftell to work correctly is via a syscall into the kernel. Obviously, this would be expensive, as it would involve a kernel context-switch as well as acquiring and releasing a lock on the file descriptor. T -- Too many people have open minds but closed eyes.
Feb 12 2022
next sibling parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 2/12/22 10:13 PM, H. S. Teoh wrote:
 On Sat, Feb 12, 2022 at 07:01:09PM -0800, Ali Çehreli via Digitalmars-d wrote:
 On 2/12/22 05:17, rempas wrote:

 a system call every single time
I have a related experience: I realized that very many ftell() calls that I were making were very costly. I saved a lot of time after realizing that I did not need to make the calls because I could maintain a 'long' variable to keep track of where I was in the file. I assumed ftell() would do the same but apparently not.
[...] I think the reason is the ftell involves an OS API call, because fread() uses the underlying read() syscall which reads from where it left off last, and there could be multiple threads reading from the same file descriptor, so the only way for fseek/ftell to work correctly is via a syscall into the kernel. Obviously, this would be expensive, as it would involve a kernel context-switch as well as acquiring and releasing a lock on the file descriptor.
`ftell` does not *need* to do a system call to get the current file position. But otherwise it has to store the offset of the file somewhere (which it does not). In fact, if you move the file pointer underneath (by using another thread to read from it, or e.g. with `lseek`), you will completely invalidate what `ftell` returns (try it!) What `ftell` basically does is to a system call to `lseek` to get the current file position, then subtracts the difference between the current buffer offset and the buffer size. This is not the same for `fgetc`. That only depends on the buffer, and not anything from the OS (after the buffer is filled). -Steve
Feb 12 2022
prev sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Sunday, 13 February 2022 at 03:13:43 UTC, H. S. Teoh wrote:
 On Sat, Feb 12, 2022 at 07:01:09PM -0800, Ali Çehreli via 
 Digitalmars-d wrote:
 On 2/12/22 05:17, rempas wrote:
 
 a system call every single time
I have a related experience: I realized that very many ftell() calls that I were making were very costly. I saved a lot of time after realizing that I did not need to make the calls because I could maintain a 'long' variable to keep track of where I was in the file. I assumed ftell() would do the same but apparently not.
[...] I think the reason is the ftell involves an OS API call, because fread() uses the underlying read() syscall which reads from where it left off last, and there could be multiple threads reading from the same file descriptor, so the only way for fseek/ftell to work correctly is via a syscall into the kernel. Obviously, this would be expensive, as it would involve a kernel context-switch as well as acquiring and releasing a lock on the file descriptor.
fread reads from its internal buffer when it can. By default it uses 1 page (4096 bytes on x86 and ARM). After a seek operation it will always try to fill the buffer with 4096 bytes (of course the read syscall might return less). As long as the reads are within the buffer fread() will not invoke a read syscall.
Feb 13 2022
parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 2/13/22 6:02 AM, Patrick Schluter wrote:
 On Sunday, 13 February 2022 at 03:13:43 UTC, H. S. Teoh wrote:
 On Sat, Feb 12, 2022 at 07:01:09PM -0800, Ali Çehreli via 
 Digitalmars-d wrote:
 On 2/12/22 05:17, rempas wrote:

 a system call every single time
I have a related experience: I realized that very many ftell() calls that I were making were very costly. I saved a lot of time after realizing that I did not need to make the calls because I could maintain a 'long' variable to keep track of where I was in the file. I assumed ftell() would do the same but apparently not.
[...] I think the reason is the ftell involves an OS API call, because fread() uses the underlying read() syscall which reads from where it left off last, and there could be multiple threads reading from the same file descriptor, so the only way for fseek/ftell to work correctly is via a syscall into the kernel.  Obviously, this would be expensive, as it would involve a kernel context-switch as well as acquiring and releasing a lock on the file descriptor.
fread reads from its internal buffer when it can. By default it uses 1 page (4096 bytes on x86 and ARM). After a seek operation it will always try to fill the buffer with 4096 bytes (of course the read syscall might return less). As long as the reads are within the buffer fread() will not invoke a read syscall.
If you seek within the buffer it could potentially leave the buffer alone. But it chooses to flush the buffer completely. Not sure why it does that. It's not so it can keep the data filled, it tries to read the full buffer at that point (meaning it removed all the buffered data). This could be potentially really slow if you were skipping a few bytes at a time using fseek, as it would reload the entire buffer every seek. -Steve
Feb 13 2022
prev sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Sunday, 13 February 2022 at 03:01:09 UTC, Ali Çehreli wrote:
 On 2/12/22 05:17, rempas wrote:

 a system call every single time
I have a related experience: I realized that very many ftell() calls that I were making were very costly. I saved a lot of time after realizing that I did not need to make the calls because I could maintain a 'long' variable to keep track of where I was in the file. I assumed ftell() would do the same but apparently not.
ftell() and fseek() use a syscall but also trigger that the next stdio read call (fgets, fgetc, fread, fscanf etc.) will systematically read its internal buffer again. If you make an itrace on an app with a fseek (ftell is often implement by using a relative seek of 0 call) yo will see something like That's why one should avoid using seek when working with buffered stdio.
Feb 13 2022
prev sibling parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 2/8/22 9:07 PM, user1234 wrote:
 On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:
 This should have probably been posted in the "Learn" section but I 
 thought that it is an advanced topic so maybe people other than me may 
 learn something too. So here we go!

 I'm planning to make a change to my program to use "mmap" to the 
 contents of a file rather than "fgetc". This is because I learned that 
 "mmap" can do it faster. The thing is, are there any problems that can 
 occur when using "mmap"? I need to know now because changing this 
 means changing the design of the program and this is not something 
 pleasant to do so I want to be sure that I won't have to change back 
 in the future (where the project will be even bigger).
`std.file.readText()` is just fine... your really want to do an os with call `fgetc` for every single byte that has to be read ?
Just a clarification here -- `fgetc` does NOT do an OS system call for every character. It's a C library function, which uses a `FILE *`. And this is not a new development -- my ANSI C book from 1988 talks about how `FILE` has a buffer. While it does not do a system call (unless the buffer is empty and it needs to fill the buffer), it's still an opaque call, which might cost a decent amount if you are reading by character. -Steve
Feb 12 2022
prev sibling parent reply Florian Weimer <fw deneb.enyo.de> writes:
One issue that hasn't been mentioned so far is that if the input 
file is truncated, accessing is memory-mapped view results in 
`SIGBUS` on Linux and other systems. (I think Windows prevents 
truncation instead.)

In theory, it is possible to intercept that signal and turn it 
into something else (Java does that), but I don't think the [D 
implementation](https://github.com/dlang/phobos/blob/master/std/mmfile.d) does
that.
Feb 13 2022
parent rempas <rempas tutanota.com> writes:
On Sunday, 13 February 2022 at 12:55:43 UTC, Florian Weimer wrote:
 One issue that hasn't been mentioned so far is that if the 
 input file is truncated, accessing is memory-mapped view 
 results in `SIGBUS` on Linux and other systems. (I think 
 Windows prevents truncation instead.)

 In theory, it is possible to intercept that signal and turn it 
 into something else (Java does that), but I don't think the [D 
 implementation](https://github.com/dlang/phobos/blob/master/std/mmfile.d) does
that.
Thank you for the info! That's very important and I'll keep in in mind!
Feb 13 2022