digitalmars.D - Is there any reasons to not use "mmap" to read files?
- rempas (11/11) Feb 06 2022 This should have probably been posted in the "Learn" section but
- Elronnd (9/13) Feb 06 2022 Performance is weird, and depends a lot on your access patterns
- rempas (6/14) Feb 06 2022 Thank you! I will actually make a compiler so it will just open
- Temtaime (5/21) Feb 06 2022 Perso i'm almost always use mmap for opening large files for r/w.
- rempas (5/9) Feb 06 2022 Thank you! For how big files are we talking about? Also like
- =?UTF-8?Q?Ali_=c3=87ehreli?= (24/32) Feb 06 2022 So big that they can't fit in memory. For example, I benefit from mmap
- rempas (11/36) Feb 06 2022 Thank you! I will have that in mind in case I want to do
- Temtaime (6/20) Feb 06 2022 Windows has its own API to mmap files. There's no need to
- rempas (5/10) Feb 06 2022 Yeah, I'm really glad that it works for you but I will have to
- H. S. Teoh (6/16) Feb 06 2022 Just read the Phobos source code for std.mmfile. Phobos code is very
- IGotD- (2/4) Feb 06 2022 Phobos/druntime is like an #ifdef hell but with version instead.
- rempas (8/9) Feb 06 2022 Actually, I personally tried to read some files (like stdio.d and
- H. S. Teoh (8/15) Feb 07 2022 I tried reading GLIB source code once. I will never ever do it again.
- rempas (6/10) Feb 12 2022 Don't be so sure about that! Everything "GNU" seems to be bloated
- rempas (5/8) Feb 06 2022 I suppose this goes to what I said that I need to properly learn
- =?UTF-8?Q?Ali_=c3=87ehreli?= (4/5) Feb 06 2022 Yes. I misspoke: What I meant was std.mmfile handles the differences
- rempas (3/6) Feb 06 2022 Cool! That what I would expect from a library tbh so I'm glad it
- Patrick Schluter (8/19) Feb 06 2022 mmap has quite the overhead to set up the page table for a file.
- rempas (5/12) Feb 06 2022 Thank you! After all I've heard, I will probably stick with
- Steven Schveighoffer (14/25) Feb 07 2022 Will mmap be faster than fgetc? Almost certainly.
- norm (3/19) Feb 07 2022 +1 iopipe !
- rempas (5/19) Feb 12 2022 Thanks for your time Steve! I will do proper testing like you
- sarn (8/19) Feb 08 2022 One reason to use read/write based I/O by default is that it's
- rempas (4/11) Feb 12 2022 Yeah, thank you! I will open and read the whole file anyways so
- user1234 (3/14) Feb 08 2022 `std.file.readText()` is just fine... your really want to do an
- rempas (6/8) Feb 12 2022 Good point! I was really wondering if "fgetc" does a system call
- user1234 (9/19) Feb 12 2022 I think that nowadays fgetc does not make sense anymore, maybe in
- Basile B. (34/54) Feb 12 2022 The problem with phobos and if used to program a compiler is
- =?UTF-8?Q?Ali_=c3=87ehreli?= (7/8) Feb 12 2022 I have a related experience: I realized that very many ftell() calls
- H. S. Teoh (12/22) Feb 12 2022 [...]
- Steven Schveighoffer (12/32) Feb 12 2022 `ftell` does not *need* to do a system call to get the current file
- Patrick Schluter (6/28) Feb 13 2022 fread reads from its internal buffer when it can. By default it
- Steven Schveighoffer (8/36) Feb 13 2022 If you seek within the buffer it could potentially leave the buffer
- Patrick Schluter (8/16) Feb 13 2022 ftell() and fseek() use a syscall but also trigger that the next
- Steven Schveighoffer (9/24) Feb 12 2022 Just a clarification here -- `fgetc` does NOT do an OS system call for
- Florian Weimer (7/7) Feb 13 2022 One issue that hasn't been mentioned so far is that if the input
- rempas (3/10) Feb 13 2022 Thank you for the info! That's very important and I'll keep in in
This should have probably been posted in the "Learn" section but I thought that it is an advanced topic so maybe people other than me may learn something too. So here we go! I'm planning to make a change to my program to use "mmap" to the contents of a file rather than "fgetc". This is because I learned that "mmap" can do it faster. The thing is, are there any problems that can occur when using "mmap"? I need to know now because changing this means changing the design of the program and this is not something pleasant to do so I want to be sure that I won't have to change back in the future (where the project will be even bigger).
Feb 06 2022
On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:I'm planning to make a change to my program to use "mmap" to the contents of a file rather than "fgetc". This is because I learned that "mmap" can do it faster. The thing is, are there any problems that can occur when using "mmap"?Performance is weird, and depends a lot on your access patterns and constraints. Mmap is not universally fast and, I would argue, really only makes sense in a few constrained circumstances. I would not switch to mmap just because you heard it was faster; only consider switching if you know i/o is a bottleneck for your application and know mmap is the solution. https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf recent, good read.
Feb 06 2022
On Sunday, 6 February 2022 at 10:08:24 UTC, Elronnd wrote:Performance is weird, and depends a lot on your access patterns and constraints. Mmap is not universally fast and, I would argue, really only makes sense in a few constrained circumstances. I would not switch to mmap just because you heard it was faster; only consider switching if you know i/o is a bottleneck for your application and know mmap is the solution. https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf recent, good read.Thank you! I will actually make a compiler so it will just open and read the requested files. I don't know if the database example you linked will be similar to my case (I will of course read it tho) so I have to make my research I guess just to be sure.
Feb 06 2022
On Sunday, 6 February 2022 at 10:48:01 UTC, rempas wrote:On Sunday, 6 February 2022 at 10:08:24 UTC, Elronnd wrote:Perso i'm almost always use mmap for opening large files for r/w. It IS faster. Exception are small ones that can be read into the memory using std.file.read for example.Performance is weird, and depends a lot on your access patterns and constraints. Mmap is not universally fast and, I would argue, really only makes sense in a few constrained circumstances. I would not switch to mmap just because you heard it was faster; only consider switching if you know i/o is a bottleneck for your application and know mmap is the solution. https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf recent, good read.Thank you! I will actually make a compiler so it will just open and read the requested files. I don't know if the database example you linked will be similar to my case (I will of course read it tho) so I have to make my research I guess just to be sure.
Feb 06 2022
On Sunday, 6 February 2022 at 10:53:49 UTC, Temtaime wrote:Perso i'm almost always use mmap for opening large files for r/w. It IS faster. Exception are small ones that can be read into the memory using std.file.read for example.Thank you! For how big files are we talking about? Also like another guy told me in another (C) forum, "mmap" is for Unix systems so do you know if Windows or MacOS can emulate that behavior with their memory allocation system calls?
Feb 06 2022
On 2/6/22 04:21, rempas wrote:On Sunday, 6 February 2022 at 10:53:49 UTC, Temtaime wrote:Ditto.Perso i'm almost always use mmap for opening large files for r/w. It IS faster.how big files are we talking about?So big that they can't fit in memory. For example, I benefit from mmap on a 16G system where a file would be 30G. As others said, it depends on the use case. If the entire file will be read anyway especially in sequential order, then mmap may not have much benefit. In my use case though it is common to just read unknown small amounts of bytes from unknown places of the huge file. (Say, 5G total out of a 30G.) Instead of my making multiple reads to those interesting parts of the file, mmap handles everything transparently: Just mmap the whole thing as a single array and access parts of that memory as needed. One huge improvement is to add madvise(2) system call to the picture to tell the system the exact amount of memory that will be touched so the OS reads in a single shot. Otherwise, the system reads by a default amount, which I think is 4K, which can turn out to be pathetically slow e.g. when the file is accessed over a slow network. (Why read 4K when the need is just 200 bytes and why read in 4K steps when the need is already to be 1M?)Also like another guy told me in another (C) forum, "mmap" is for Unix systems so do you know if Windows or MacOS can emulate that behavior with their memory allocation system calls?I haven't used mmap on Windows but it's in Phobos, so it should work. After all, mmap uses the virtual memory system of the OS and non-ancient Windows versions do use virtual memory and std.mmfile does include 'version (windows)' sections; so, yes. :) Ali
Feb 06 2022
On Sunday, 6 February 2022 at 16:45:59 UTC, Ali Çehreli wrote:So big that they can't fit in memory. For example, I benefit from mmap on a 16G system where a file would be 30G.Oh, this small...As others said, it depends on the use case. If the entire file will be read anyway especially in sequential order, then mmap may not have much benefit. In my use case though it is common to just read unknown small amounts of bytes from unknown places of the huge file. (Say, 5G total out of a 30G.) Instead of my making multiple reads to those interesting parts of the file, mmap handles everything transparently: Just mmap the whole thing as a single array and access parts of that memory as needed.Thank you! I will have that in mind in case I want to do something like that in the future. In my use-case tho, I will read the whole file.One huge improvement is to add madvise(2) system call to the picture to tell the system the exact amount of memory that will be touched so the OS reads in a single shot. Otherwise, the system reads by a default amount, which I think is 4K, which can turn out to be pathetically slow e.g. when the file is accessed over a slow network. (Why read 4K when the need is just 200 bytes and why read in 4K steps when the need is already to be 1M?) I haven't used mmap on Windows but it's in Phobos, so it should work. After all, mmap uses the virtual memory system of the OS and non-ancient Windows versions do use virtual memory and std.mmfile does include 'version (windows)' sections; so, yes. :) Ali"mmap" is a system call that doesn't exist (natively) on Windows. I don't know what D does with Phobos (which I'm not gonna use anyway) but even if it works (how?), I will end up creating my own library so I'm in the same spot. "madvise" seems cool, I'll check it out! Thanks! In the end, I like advising and telling others how to do their work, XD!
Feb 06 2022
On Sunday, 6 February 2022 at 18:14:51 UTC, rempas wrote:On Sunday, 6 February 2022 at 16:45:59 UTC, Ali Çehreli wrote:Windows has its own API to mmap files. There's no need to reinvent the wheel, phobos MmFile works for me without any problems. Maybe there's no flush function, but for my use cases it's not so critical.[...]Oh, this small...[...]Thank you! I will have that in mind in case I want to do something like that in the future. In my use-case tho, I will read the whole file.[...]"mmap" is a system call that doesn't exist (natively) on Windows. I don't know what D does with Phobos (which I'm not gonna use anyway) but even if it works (how?), I will end up creating my own library so I'm in the same spot. "madvise" seems cool, I'll check it out! Thanks! In the end, I like advising and telling others how to do their work, XD!
Feb 06 2022
On Sunday, 6 February 2022 at 18:33:59 UTC, Temtaime wrote:Windows has its own API to mmap files. There's no need to reinvent the wheel, phobos MmFile works for me without any problems. Maybe there's no flush function, but for my use cases it's not so critical.Yeah, I'm really glad that it works for you but I will have to create a library for my compiler so there is a need to properly learn how things work so I'll know what I'm doing when the times comes. So yeah...
Feb 06 2022
On Sun, Feb 06, 2022 at 06:47:45PM +0000, rempas via Digitalmars-d wrote:On Sunday, 6 February 2022 at 18:33:59 UTC, Temtaime wrote:Just read the Phobos source code for std.mmfile. Phobos code is very readable compared to most typical standard libraries. T -- Turning your clock 15 minutes ahead won't cure lateness---you're just making time go faster!Windows has its own API to mmap files. There's no need to reinvent the wheel, phobos MmFile works for me without any problems. Maybe there's no flush function, but for my use cases it's not so critical.Yeah, I'm really glad that it works for you but I will have to create a library for my compiler so there is a need to properly learn how things work so I'll know what I'm doing when the times comes. So yeah...
Feb 06 2022
On Sunday, 6 February 2022 at 20:12:39 UTC, H. S. Teoh wrote:Just read the Phobos source code for std.mmfile. Phobos code is very readable compared to most typical standard libraries.Phobos/druntime is like an #ifdef hell but with version instead.
Feb 06 2022
On Sunday, 6 February 2022 at 20:48:09 UTC, IGotD- wrote:Phobos/druntime is like an #ifdef hell but with version instead.Actually, I personally tried to read some files (like stdio.d and conv.d) and while I didn't found them super user friendly, they are WAY more clear and easy to read then GLIB! I don't know if every libc's header files are like that in every OS and also I'm super super n00b when it comes to reading other people's source code so maybe H. S. Teoh is just better at us at reading code, idk...
Feb 06 2022
On Mon, Feb 07, 2022 at 07:16:55AM +0000, rempas via Digitalmars-d wrote:On Sunday, 6 February 2022 at 20:48:09 UTC, IGotD- wrote:I tried reading GLIB source code once. I will never ever do it again. :-PPhobos/druntime is like an #ifdef hell but with version instead.Actually, I personally tried to read some files (like stdio.d and conv.d) and while I didn't found them super user friendly, they are WAY more clear and easy to read then GLIB!I don't know if every libc's header files are like that in every OS[...] If it's in C? Yeah, they all look like that. T -- Shin: (n.) A device for finding furniture in the dark.
Feb 07 2022
On Monday, 7 February 2022 at 18:31:42 UTC, H. S. Teoh wrote:I tried reading GLIB source code once. I will never ever do it again. :-PC!!!! You gotta love it, lol!If it's in C? Yeah, they all look like that. TDon't be so sure about that! Everything "GNU" seems to be bloated but try to read some *BSD libc source code. It's both a little bit more readable and more organized, minimal and simple to understand.
Feb 12 2022
On Sunday, 6 February 2022 at 20:12:39 UTC, H. S. Teoh wrote:Just read the Phobos source code for std.mmfile. Phobos code is very readable compared to most typical standard libraries. TI suppose this goes to what I said that I need to properly learn how things work right? Well I mean, how Windows does it with the system call. I don't think that it is necessary to read the Phobos source code. But regardless, thanks for the suggestion!
Feb 06 2022
On 2/6/22 10:14, rempas wrote:"mmap" is a system call that doesn't exist (natively) on Windows.Yes. I misspoke: What I meant was std.mmfile handles the differences automatically between systems. Ali
Feb 06 2022
On Sunday, 6 February 2022 at 18:53:13 UTC, Ali Çehreli wrote:Yes. I misspoke: What I meant was std.mmfile handles the differences automatically between systems. AliCool! That what I would expect from a library tbh so I'm glad it works like that!
Feb 06 2022
On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:This should have probably been posted in the "Learn" section but I thought that it is an advanced topic so maybe people other than me may learn something too. So here we go! I'm planning to make a change to my program to use "mmap" to the contents of a file rather than "fgetc". This is because I learned that "mmap" can do it faster. The thing is, are there any problems that can occur when using "mmap"? I need to know now because changing this means changing the design of the program and this is not something pleasant to do so I want to be sure that I won't have to change back in the future (where the project will be even bigger).mmap has quite the overhead to set up the page table for a file. This means for small files, open/read/write calls (and stdio which build on it) are faster. The other issue with mmap is if you use string functions on the mapped part, you have to make sure that there are 0 bytes in the file or else you risk these functions to overshoot to unmapped pages and crashing the application.
Feb 06 2022
On Sunday, 6 February 2022 at 12:52:45 UTC, Patrick Schluter wrote:mmap has quite the overhead to set up the page table for a file. This means for small files, open/read/write calls (and stdio which build on it) are faster. The other issue with mmap is if you use string functions on the mapped part, you have to make sure that there are 0 bytes in the file or else you risk these functions to overshoot to unmapped pages and crashing the application.Thank you! After all I've heard, I will probably stick with "read". The files I'm going to read are going to be some kilobytes (megabytes at worse) so I should probably be fine.
Feb 06 2022
On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:This should have probably been posted in the "Learn" section but I thought that it is an advanced topic so maybe people other than me may learn something too. So here we go! I'm planning to make a change to my program to use "mmap" to the contents of a file rather than "fgetc". This is because I learned that "mmap" can do it faster. The thing is, are there any problems that can occur when using "mmap"? I need to know now because changing this means changing the design of the program and this is not something pleasant to do so I want to be sure that I won't have to change back in the future (where the project will be even bigger).Will mmap be faster than fgetc? Almost certainly. Will it be faster than other i/o systems? Possibly not. for my i/o system [iopipe](https://github.com/schveiguy/iopipe), every array is also an iopipe, so switching between mmap and file i/o is trivial. See [my talk in 2017](https://dconf.org/2017/talks/schveighoffer.html) where I switched to mmap while on stage to show the difference. IMO, the best way to determine which is better is to try it and measure. Having an i/o system that allows easy switching is helpful. For sure, depending on your other tasks in your program, improving the file i/o might be insignificant. -Steve
Feb 07 2022
On Tuesday, 8 February 2022 at 03:33:11 UTC, Steven Schveighoffer wrote:On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:+1 iopipe ![...]Will mmap be faster than fgetc? Almost certainly. Will it be faster than other i/o systems? Possibly not. for my i/o system [iopipe](https://github.com/schveiguy/iopipe), every array is also an iopipe, so switching between mmap and file i/o is trivial. See [my talk in 2017](https://dconf.org/2017/talks/schveighoffer.html) where I switched to mmap while on stage to show the difference. IMO, the best way to determine which is better is to try it and measure. Having an i/o system that allows easy switching is helpful. For sure, depending on your other tasks in your program, improving the file i/o might be insignificant. -Steve
Feb 07 2022
On Tuesday, 8 February 2022 at 03:33:11 UTC, Steven Schveighoffer wrote:Will mmap be faster than fgetc? Almost certainly. Will it be faster than other i/o systems? Possibly not. for my i/o system [iopipe](https://github.com/schveiguy/iopipe), every array is also an iopipe, so switching between mmap and file i/o is trivial. See [my talk in 2017](https://dconf.org/2017/talks/schveighoffer.html) where I switched to mmap while on stage to show the difference. IMO, the best way to determine which is better is to try it and measure. Having an i/o system that allows easy switching is helpful. For sure, depending on your other tasks in your program, improving the file i/o might be insignificant. -SteveThanks for your time Steve! I will do proper testing like you suggested and see! It will take some time but I think it's worth it rather than randomly choose between one of them :)
Feb 12 2022
On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:This should have probably been posted in the "Learn" section but I thought that it is an advanced topic so maybe people other than me may learn something too. So here we go! I'm planning to make a change to my program to use "mmap" to the contents of a file rather than "fgetc". This is because I learned that "mmap" can do it faster. The thing is, are there any problems that can occur when using "mmap"? I need to know now because changing this means changing the design of the program and this is not something pleasant to do so I want to be sure that I won't have to change back in the future (where the project will be even bigger).One reason to use read/write based I/O by default is that it's more versatile. It's kind of like an input range versus a random access in Phobos. ``` // Could not map file /dev/stdin (Invalid argument) auto f = new MmFile("/dev/stdin"); ```
Feb 08 2022
On Tuesday, 8 February 2022 at 21:37:29 UTC, sarn wrote:One reason to use read/write based I/O by default is that it's more versatile. It's kind of like an input range versus a random access in Phobos. ``` // Could not map file /dev/stdin (Invalid argument) auto f = new MmFile("/dev/stdin"); ```Yeah, thank you! I will open and read the whole file anyways so it seems that it makes sense to try with this method and then measurement my program in the future to see! Have a nice day!
Feb 12 2022
On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:This should have probably been posted in the "Learn" section but I thought that it is an advanced topic so maybe people other than me may learn something too. So here we go! I'm planning to make a change to my program to use "mmap" to the contents of a file rather than "fgetc". This is because I learned that "mmap" can do it faster. The thing is, are there any problems that can occur when using "mmap"? I need to know now because changing this means changing the design of the program and this is not something pleasant to do so I want to be sure that I won't have to change back in the future (where the project will be even bigger).`std.file.readText()` is just fine... your really want to do an os with call `fgetc` for every single byte that has to be read ?
Feb 08 2022
On Wednesday, 9 February 2022 at 02:07:05 UTC, user1234 wrote:`std.file.readText()` is just fine... your really want to do an os with call `fgetc` for every single byte that has to be read ?Good point! I was really wondering if "fgetc" does a system call every single time that it is called or if the text is buffered just like with "printf". I will use "read" in any case just to be sure tho. I don't want to use Phobos tho so I cannot use "file.readText". Thank you for your time!
Feb 12 2022
On Saturday, 12 February 2022 at 13:17:19 UTC, rempas wrote:On Wednesday, 9 February 2022 at 02:07:05 UTC, user1234 wrote:I think that nowadays fgetc does not make sense anymore, maybe in the past when the amount of memory available was very reduced... source files are 100 kb top. You can load 100 of them, the fingerprint is still small. What will likely consume the more is the AST. Otherwise readText is easy to translate, it's just fopen then fread then fclose, + a few checks for the errors, not a big deal to translate.`std.file.readText()` is just fine... your really want to do an os with call `fgetc` for every single byte that has to be read ?Good point! I was really wondering if "fgetc" does a system call every single time that it is called or if the text is buffered just like with "printf". I will use "read" in any case just to be sure tho. I don't want to use Phobos tho so I cannot use "file.readText". Thank you for your time!
Feb 12 2022
On Saturday, 12 February 2022 at 16:48:26 UTC, user1234 wrote:On Saturday, 12 February 2022 at 13:17:19 UTC, rempas wrote:The problem with phobos and if used to program a compiler is _dynamic arrays_, because of how they are managed. With Styx I had used phobos because I knew the memory management was designed to work similarly with arrays, i.e functions can return arrays, but using the "sink" style would have not caused any problem (by "sink" style I mean when the buffer is owned by the calling frame, and passed as parameter, e.g like in many C-style APIs) Then the amount of phobos code to translate in order to bootstrap [was minimal](https://gitlab.com/styx-lang/styx/-/raw/master/src/system.sx): std.paths: - isAbsolute - isDir - isFile - dirName - baseName - exists - cwd - dirEntries - setExtension std.files: - read (or readText) - write (not even used I realize now) std.process - pipeProcess (actually just used to optionally --run after compile) std.getopt - getopt (tho libc functions for that could have been used... dmd itself doesnt have any special functions for the arg processing in the driver IIRC) Add to this a few things from libc and unistd and you're good. You dont need more.On Wednesday, 9 February 2022 at 02:07:05 UTC, user1234 wrote:I think that nowadays fgetc does not make sense anymore, maybe in the past when the amount of memory available was very reduced... source files are 100 kb top. You can load 100 of them, the fingerprint is still small. What will likely consume the more is the AST. Otherwise readText is easy to translate, it's just fopen then fread then fclose, + a few checks for the errors, not a big deal to translate.`std.file.readText()` is just fine... your really want to do an os with call `fgetc` for every single byte that has to be read ?Good point! I was really wondering if "fgetc" does a system call every single time that it is called or if the text is buffered just like with "printf". I will use "read" in any case just to be sure tho. I don't want to use Phobos tho so I cannot use "file.readText". Thank you for your time!
Feb 12 2022
On 2/12/22 05:17, rempas wrote:a system call every single timeI have a related experience: I realized that very many ftell() calls that I were making were very costly. I saved a lot of time after realizing that I did not need to make the calls because I could maintain a 'long' variable to keep track of where I was in the file. I assumed ftell() would do the same but apparently not. Ali
Feb 12 2022
On Sat, Feb 12, 2022 at 07:01:09PM -0800, Ali Çehreli via Digitalmars-d wrote:On 2/12/22 05:17, rempas wrote:[...] I think the reason is the ftell involves an OS API call, because fread() uses the underlying read() syscall which reads from where it left off last, and there could be multiple threads reading from the same file descriptor, so the only way for fseek/ftell to work correctly is via a syscall into the kernel. Obviously, this would be expensive, as it would involve a kernel context-switch as well as acquiring and releasing a lock on the file descriptor. T -- Too many people have open minds but closed eyes.a system call every single timeI have a related experience: I realized that very many ftell() calls that I were making were very costly. I saved a lot of time after realizing that I did not need to make the calls because I could maintain a 'long' variable to keep track of where I was in the file. I assumed ftell() would do the same but apparently not.
Feb 12 2022
On 2/12/22 10:13 PM, H. S. Teoh wrote:On Sat, Feb 12, 2022 at 07:01:09PM -0800, Ali Çehreli via Digitalmars-d wrote:`ftell` does not *need* to do a system call to get the current file position. But otherwise it has to store the offset of the file somewhere (which it does not). In fact, if you move the file pointer underneath (by using another thread to read from it, or e.g. with `lseek`), you will completely invalidate what `ftell` returns (try it!) What `ftell` basically does is to a system call to `lseek` to get the current file position, then subtracts the difference between the current buffer offset and the buffer size. This is not the same for `fgetc`. That only depends on the buffer, and not anything from the OS (after the buffer is filled). -SteveOn 2/12/22 05:17, rempas wrote:[...] I think the reason is the ftell involves an OS API call, because fread() uses the underlying read() syscall which reads from where it left off last, and there could be multiple threads reading from the same file descriptor, so the only way for fseek/ftell to work correctly is via a syscall into the kernel. Obviously, this would be expensive, as it would involve a kernel context-switch as well as acquiring and releasing a lock on the file descriptor.a system call every single timeI have a related experience: I realized that very many ftell() calls that I were making were very costly. I saved a lot of time after realizing that I did not need to make the calls because I could maintain a 'long' variable to keep track of where I was in the file. I assumed ftell() would do the same but apparently not.
Feb 12 2022
On Sunday, 13 February 2022 at 03:13:43 UTC, H. S. Teoh wrote:On Sat, Feb 12, 2022 at 07:01:09PM -0800, Ali Çehreli via Digitalmars-d wrote:fread reads from its internal buffer when it can. By default it uses 1 page (4096 bytes on x86 and ARM). After a seek operation it will always try to fill the buffer with 4096 bytes (of course the read syscall might return less). As long as the reads are within the buffer fread() will not invoke a read syscall.On 2/12/22 05:17, rempas wrote:[...] I think the reason is the ftell involves an OS API call, because fread() uses the underlying read() syscall which reads from where it left off last, and there could be multiple threads reading from the same file descriptor, so the only way for fseek/ftell to work correctly is via a syscall into the kernel. Obviously, this would be expensive, as it would involve a kernel context-switch as well as acquiring and releasing a lock on the file descriptor.a system call every single timeI have a related experience: I realized that very many ftell() calls that I were making were very costly. I saved a lot of time after realizing that I did not need to make the calls because I could maintain a 'long' variable to keep track of where I was in the file. I assumed ftell() would do the same but apparently not.
Feb 13 2022
On 2/13/22 6:02 AM, Patrick Schluter wrote:On Sunday, 13 February 2022 at 03:13:43 UTC, H. S. Teoh wrote:If you seek within the buffer it could potentially leave the buffer alone. But it chooses to flush the buffer completely. Not sure why it does that. It's not so it can keep the data filled, it tries to read the full buffer at that point (meaning it removed all the buffered data). This could be potentially really slow if you were skipping a few bytes at a time using fseek, as it would reload the entire buffer every seek. -SteveOn Sat, Feb 12, 2022 at 07:01:09PM -0800, Ali Çehreli via Digitalmars-d wrote:fread reads from its internal buffer when it can. By default it uses 1 page (4096 bytes on x86 and ARM). After a seek operation it will always try to fill the buffer with 4096 bytes (of course the read syscall might return less). As long as the reads are within the buffer fread() will not invoke a read syscall.On 2/12/22 05:17, rempas wrote:[...] I think the reason is the ftell involves an OS API call, because fread() uses the underlying read() syscall which reads from where it left off last, and there could be multiple threads reading from the same file descriptor, so the only way for fseek/ftell to work correctly is via a syscall into the kernel. Obviously, this would be expensive, as it would involve a kernel context-switch as well as acquiring and releasing a lock on the file descriptor.a system call every single timeI have a related experience: I realized that very many ftell() calls that I were making were very costly. I saved a lot of time after realizing that I did not need to make the calls because I could maintain a 'long' variable to keep track of where I was in the file. I assumed ftell() would do the same but apparently not.
Feb 13 2022
On Sunday, 13 February 2022 at 03:01:09 UTC, Ali Çehreli wrote:On 2/12/22 05:17, rempas wrote:ftell() and fseek() use a syscall but also trigger that the next stdio read call (fgets, fgetc, fread, fscanf etc.) will systematically read its internal buffer again. If you make an itrace on an app with a fseek (ftell is often implement by using a relative seek of 0 call) yo will see something like That's why one should avoid using seek when working with buffered stdio.a system call every single timeI have a related experience: I realized that very many ftell() calls that I were making were very costly. I saved a lot of time after realizing that I did not need to make the calls because I could maintain a 'long' variable to keep track of where I was in the file. I assumed ftell() would do the same but apparently not.
Feb 13 2022
On 2/8/22 9:07 PM, user1234 wrote:On Sunday, 6 February 2022 at 09:40:48 UTC, rempas wrote:Just a clarification here -- `fgetc` does NOT do an OS system call for every character. It's a C library function, which uses a `FILE *`. And this is not a new development -- my ANSI C book from 1988 talks about how `FILE` has a buffer. While it does not do a system call (unless the buffer is empty and it needs to fill the buffer), it's still an opaque call, which might cost a decent amount if you are reading by character. -SteveThis should have probably been posted in the "Learn" section but I thought that it is an advanced topic so maybe people other than me may learn something too. So here we go! I'm planning to make a change to my program to use "mmap" to the contents of a file rather than "fgetc". This is because I learned that "mmap" can do it faster. The thing is, are there any problems that can occur when using "mmap"? I need to know now because changing this means changing the design of the program and this is not something pleasant to do so I want to be sure that I won't have to change back in the future (where the project will be even bigger).`std.file.readText()` is just fine... your really want to do an os with call `fgetc` for every single byte that has to be read ?
Feb 12 2022
One issue that hasn't been mentioned so far is that if the input file is truncated, accessing is memory-mapped view results in `SIGBUS` on Linux and other systems. (I think Windows prevents truncation instead.) In theory, it is possible to intercept that signal and turn it into something else (Java does that), but I don't think the [D implementation](https://github.com/dlang/phobos/blob/master/std/mmfile.d) does that.
Feb 13 2022
On Sunday, 13 February 2022 at 12:55:43 UTC, Florian Weimer wrote:One issue that hasn't been mentioned so far is that if the input file is truncated, accessing is memory-mapped view results in `SIGBUS` on Linux and other systems. (I think Windows prevents truncation instead.) In theory, it is possible to intercept that signal and turn it into something else (Java does that), but I don't think the [D implementation](https://github.com/dlang/phobos/blob/master/std/mmfile.d) does that.Thank you for the info! That's very important and I'll keep in in mind!
Feb 13 2022