digitalmars.D - [WORK] std.file.update function
- Andrei Alexandrescu (31/31) Sep 18 2016 There are quite a few situations in rdmd and dmd generally when we
- Andrei Alexandrescu (5/7) Sep 18 2016 Forgot to mention a situation here: if you change the source code of a
- rikki cattermole (2/9) Sep 18 2016 How does this compare against doing a checksum comparison on the file?
- Andrei Alexandrescu (2/12) Sep 18 2016 Favorably :o). -- Andrei
- rikki cattermole (3/16) Sep 18 2016 Confirmed in doing the checksum myself.
- Chris Wright (2/21) Sep 18 2016 You have an operating system that automatically checksums every file?
- R (3/5) Oct 18 2016 There are a few filesystems that keep checksums of blocks, but I
- Patrick Schluter (2/8) Oct 18 2016 zfs , btrfs. If the checksum's accessible is anoher story.
- Walter Bright (12/18) Sep 18 2016 The compiler currently creates the complete object file in a buffer, the...
- Jacob Carlborg (16/18) Sep 18 2016 You already mentioned in an other post [1] that the compiler could do
- Stefan Koch (8/10) Sep 18 2016 I'd like that as well.
- Walter Bright (12/14) Sep 18 2016 A major part of the problem (that working with Optlink has made painfull...
- ketmar (5/8) Sep 19 2016 yeah. there is a reason for absense of 100500 hobbyst FOSS
- Andrei Alexandrescu (18/41) Sep 19 2016 Great. In that case, if the target .o file already exists, it should be
- Stefan Koch (10/14) Sep 19 2016 Only if the TOC is unchanged.
- Walter Bright (19/40) Sep 19 2016 That's right. I was just referring to the idea of incrementally writing ...
- Stefan Koch (6/11) Sep 18 2016 If so we need it in druntime.
- Chris Wright (10/10) Sep 18 2016 This will produce different behavior with hard links. With hard links,
- Andrei Alexandrescu (5/15) Sep 18 2016 That's exactly right, and such considerations should also go in the
- Brad Roberts via Digitalmars-d (5/22) Sep 18 2016 This is nice in the case of no changes, but problematic in the case of
- Walter Bright (9/13) Sep 18 2016 As for compilation, I bet considerable speed increases could be had by n...
- Andrei Alexandrescu (4/8) Sep 19 2016 Good point, should be also part of the doco or a flag with update (e.g.
- Walter Bright (6/6) Sep 19 2016 One way to implement it is to open the existing file as a memory-mapped ...
There are quite a few situations in rdmd and dmd generally when we compute a dependency structure over sets of files. Based on that, we write new files that overwrite old, obsoleted files. Those changes in turn trigger other dependencies to go stale so more building is done etc. Simplest case is - source file is being changed, therefore a new object file is being produced, therefore a new executable is being produced. And it only gets more involved. We've discussed before using a simple method to avoid unnecessary stale dependencies when it's possible that a certain file won't, in fact, change contents: 1. Do all work on the side in a separate file e.g. file.ext.tmp 2. Compare the new file with the old file file.ext 3. If they're identical, delete file.ext.tmp; otherwise, rename file.ext.tmp into file.ext There is actually an even better way at the application level. Consider a function in std.file: updateS, Range)(S name, Range data); updateFile does something interesting: it opens the file "name" for reading AND writing, then reads data from the Range _and_ the file. For as long as the data and the contents in the file agree, it just moves reading along. At the first difference between the data and the file contents, starts writing the data into the file through the end of the range. So this makes zero writes (and leaves the "last modified time" intact) if the file has the same content as the data. Better yet, if it so happens that the file and the data have the same prefix, there's less writing going on, which IIRC is faster for most filesystems. Saving on writes happens to be particularly nice on new solid-state drives. Who wants to take this with testing, measurements etc? It's a cool mini project. Andrei
Sep 18 2016
On 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:Simplest case is - source file is being changed, therefore a new object file is being produced, therefore a new executable is being produced.Forgot to mention a situation here: if you change the source code of a module without influencing the object file (e.g. documentation, certain style changes, unittests in non-unittest builds etc) there'd be no linking upon rebuilding. -- Andrei
Sep 18 2016
On 19/09/2016 3:20 AM, Andrei Alexandrescu wrote:On 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:How does this compare against doing a checksum comparison on the file?Simplest case is - source file is being changed, therefore a new object file is being produced, therefore a new executable is being produced.Forgot to mention a situation here: if you change the source code of a module without influencing the object file (e.g. documentation, certain style changes, unittests in non-unittest builds etc) there'd be no linking upon rebuilding. -- Andrei
Sep 18 2016
On 9/18/16 11:24 AM, rikki cattermole wrote:On 19/09/2016 3:20 AM, Andrei Alexandrescu wrote:Favorably :o). -- AndreiOn 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:How does this compare against doing a checksum comparison on the file?Simplest case is - source file is being changed, therefore a new object file is being produced, therefore a new executable is being produced.Forgot to mention a situation here: if you change the source code of a module without influencing the object file (e.g. documentation, certain style changes, unittests in non-unittest builds etc) there'd be no linking upon rebuilding. -- Andrei
Sep 18 2016
On 19/09/2016 3:41 AM, Andrei Alexandrescu wrote:On 9/18/16 11:24 AM, rikki cattermole wrote:Confirmed in doing the checksum myself. However I have not compared against OS provided checksum.On 19/09/2016 3:20 AM, Andrei Alexandrescu wrote:Favorably :o). -- AndreiOn 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:How does this compare against doing a checksum comparison on the file?Simplest case is - source file is being changed, therefore a new object file is being produced, therefore a new executable is being produced.Forgot to mention a situation here: if you change the source code of a module without influencing the object file (e.g. documentation, certain style changes, unittests in non-unittest builds etc) there'd be no linking upon rebuilding. -- Andrei
Sep 18 2016
On Mon, 19 Sep 2016 04:24:41 +1200, rikki cattermole wrote:On 19/09/2016 3:41 AM, Andrei Alexandrescu wrote:You have an operating system that automatically checksums every file?On 9/18/16 11:24 AM, rikki cattermole wrote:Confirmed in doing the checksum myself. However I have not compared against OS provided checksum.On 19/09/2016 3:20 AM, Andrei Alexandrescu wrote:Favorably :o). -- AndreiOn 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:How does this compare against doing a checksum comparison on the file?Simplest case is - source file is being changed, therefore a new object file is being produced, therefore a new executable is being produced.Forgot to mention a situation here: if you change the source code of a module without influencing the object file (e.g. documentation, certain style changes, unittests in non-unittest builds etc) there'd be no linking upon rebuilding. -- Andrei
Sep 18 2016
On Monday, 19 September 2016 at 02:57:01 UTC, Chris Wright wrote:You have an operating system that automatically checksums every file?There are a few filesystems that keep checksums of blocks, but I don't see one that keeps file checksums.
Oct 18 2016
On Tuesday, 18 October 2016 at 13:51:48 UTC, R wrote:On Monday, 19 September 2016 at 02:57:01 UTC, Chris Wright wrote:zfs , btrfs. If the checksum's accessible is anoher story.You have an operating system that automatically checksums every file?There are a few filesystems that keep checksums of blocks, but I don't see one that keeps file checksums.
Oct 18 2016
On 9/18/2016 8:20 AM, Andrei Alexandrescu wrote:On 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:The compiler currently creates the complete object file in a buffer, then writes the buffer to a file in one command. The reason is mostly because the object file format isn't incremental, the beginning is written last and the body gets backpatched as the compilation progresses. I can't really see a compilation producing an object file where the first half of it matches the previous object file and the second half is different, because of the file format. Interestingly, the win32 .lib format is designed for incredibly slow floppy disks, in that updating the library need not read/write every disk sector. I'd love to design our own high speed formats, but then they'd be incompatible with everybody else's.Simplest case is - source file is being changed, therefore a new object file is being produced, therefore a new executable is being produced.Forgot to mention a situation here: if you change the source code of a module without influencing the object file (e.g. documentation, certain style changes, unittests in non-unittest builds etc) there'd be no linking upon rebuilding. --
Sep 18 2016
On 2016-09-19 07:16, Walter Bright wrote:I'd love to design our own high speed formats, but then they'd be incompatible with everybody else's.You already mentioned in an other post [1] that the compiler could do the linking as well. In that case you would need to write some form of linker. Then I suggest to develop the linker as a library, supporting all formats DMD currently supports. The library can be used both directly from DMD and to build an external linker. When we have our own linker we could create our own format too without having to worry about compatibility. I guess we need to create other tools for the new format as well, like object dumpers. But I assume that's a natural thing to do anyway. Bundle that with something like musl libc and we will have our own complete tool chain. It would also be easier to add support for cross-compiling. [1] http://forum.dlang.org/post/nrnsn7$1h3k$1 digitalmars.com -- /Jacob Carlborg
Sep 18 2016
On Monday, 19 September 2016 at 05:16:37 UTC, Walter Bright wrote:I'd love to design our own high speed formats, but then they'd be incompatible with everybody else's.I'd like that as well. I recently had a look at the ELF and the COFF file formats both are definitely in need of rework and dust-off :-) There are some nice things we could do if we had certain features on every platform, wrt. linking and symbol-tables. However the maintenance burden is a bit heavy we don't have enough menpower as it is.
Sep 18 2016
On 9/18/2016 11:33 PM, Stefan Koch wrote:However the maintenance burden is a bit heavy we don't have enough menpower as it is.A major part of the problem (that working with Optlink has made painfully clear) is that although linking is conceptually a rather trivial task, the people who've designed the file formats have an unending love of making trivial things exceedingly complicated. Furthermore, the weird things about the format are 98% undocumented lore. DMD still has problems generating "correct" Dwarf debug info because its correctness is not defined by the spec, but by lore and the idiosyncratic way that gcc emits it. Doing a linker inside DMD means that object files imported from other C/C++ compilers have to be correctly interpreted. I could do it, but I couldn't do that and continue to work on D.
Sep 18 2016
On Monday, 19 September 2016 at 06:53:47 UTC, Walter Bright wrote:Doing a linker inside DMD means that object files imported from other C/C++ compilers have to be correctly interpreted. I could do it, but I couldn't do that and continue to work on D.yeah. there is a reason for absense of 100500 hobbyst FOSS linkers. ;-) contrary to what it may look like, correct linking is really hard task. and mostly not fun to write too. people usually trying, and then just silently returning to binutils. ;-)
Sep 19 2016
On 09/19/2016 01:16 AM, Walter Bright wrote:On 9/18/2016 8:20 AM, Andrei Alexandrescu wrote:Great. In that case, if the target .o file already exists, it should be compared against the buffer. If identical, there should be no write and the timestamp of the .o file should stay the same. I need to re-emphasize this kind of stuff is important for tooling. Many files get recompiled to identical object files - e.g. the many innocent bystanders in a dense dependency structure when one module changes. We also embed documentation in source files. Being disciplined about reflecting actual changes in the actual file operations is very helpful for tools that track file writes and/or timestamps.On 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:The compiler currently creates the complete object file in a buffer, then writes the buffer to a file in one command. The reason is mostly because the object file format isn't incremental, the beginning is written last and the body gets backpatched as the compilation progresses.Simplest case is - source file is being changed, therefore a new object file is being produced, therefore a new executable is being produced.Forgot to mention a situation here: if you change the source code of a module without influencing the object file (e.g. documentation, certain style changes, unittests in non-unittest builds etc) there'd be no linking upon rebuilding. --I can't really see a compilation producing an object file where the first half of it matches the previous object file and the second half is different, because of the file format.Interesting. What happens e.g. if one makes a change to a function whose generated code is somewhere in the middle of the object file? If it doesn't alter the call graph, doesn't the new .o file share a common prefix with the old one?Interestingly, the win32 .lib format is designed for incredibly slow floppy disks, in that updating the library need not read/write every disk sector. I'd love to design our own high speed formats, but then they'd be incompatible with everybody else's.This (and the subsequent considerations) is drifting off-topic. This is about getting a useful function off the ground, and sadly is degenerating into yet another off-topic debate leading to no progress. Andrei
Sep 19 2016
On Monday, 19 September 2016 at 14:04:03 UTC, Andrei Alexandrescu wrote:Interesting. What happens e.g. if one makes a change to a function whose generated code is somewhere in the middle of the object file? If it doesn't alter the call graph, doesn't the new .o file share a common prefix with the old one?Only if the TOC is unchanged. There are a lot of common sections in the same order but with different offsets. we would need some binary patching method. But I am unaware of file-systems supporting this. Microsofts incremental linking mechnism makes use of thunks so it can avoid changing the header iirc. But all of this needs codegen to adept.
Sep 19 2016
On 9/19/2016 7:04 AM, Andrei Alexandrescu wrote:On 09/19/2016 01:16 AM, Walter Bright wrote:That's right. I was just referring to the idea of incrementally writing and comparing, which is a great idea for sequential file writing, likely won't work for the object file case. I think it is distinct enough to merit a separate library function. Note that we already have: Adding another "writeIfDifferent()" function would be a good thing. The range based incremental one should go into std.stdio. Any case where writing is much more costly than reading (such as SSD drives you mentioned, and the new Seagate "archival" drives), would make your technique a good one. It works even for memory; I've used it in code to reduce swapping, as in: if (*p != newvalue) *p = newvalue;The compiler currently creates the complete object file in a buffer, then writes the buffer to a file in one command. The reason is mostly because the object file format isn't incremental, the beginning is written last and the body gets backpatched as the compilation progresses.Great. In that case, if the target .o file already exists, it should be compared against the buffer. If identical, there should be no write and the timestamp of the .o file should stay the same.I need to re-emphasize this kind of stuff is important for tooling. Many files get recompiled to identical object files - e.g. the many innocent bystanders in a dense dependency structure when one module changes. We also embed documentation in source files. Being disciplined about reflecting actual changes in the actual file operations is very helpful for tools that track file writes and/or timestamps.That's right.Two things: 1. The object file starts out with a header that contains file offsets to the various tables and sections. Changing the size of any of the pieces in the file changes the header, and will likely require moving pieces around to make room. 2. Writing an object file can mean "backpatching" what was written earlier, as a declaration one assumed was external turns out to be internal.I can't really see a compilation producing an object file where the first half of it matches the previous object file and the second half is different, because of the file format.Interesting. What happens e.g. if one makes a change to a function whose generated code is somewhere in the middle of the object file? If it doesn't alter the call graph, doesn't the new .o file share a common prefix with the old one?
Sep 19 2016
On Sunday, 18 September 2016 at 15:17:31 UTC, Andrei Alexandrescu wrote:There are quite a few situations in rdmd and dmd generally when we compute a dependency structure over sets of files. Based on that, we write new files that overwrite old, obsoleted files. Those changes in turn trigger other dependencies to go stale so more building is done etc.If so we need it in druntime. Introducing phobos into ddmd is still considered a nono. Personally I am pretty torn, without range-specific optimizations in dmd they do incur more overhead then they should.
Sep 18 2016
This will produce different behavior with hard links. With hard links, the temporary file mechanism you mention will result in the old file being accessible via the other path. With your recommended strategy, the data accessible from both paths is updated. That's probably acceptable, and hard links aren't used that much anyway. Obviously, if you have to overwrite large portions of the file, it's going to be faster to just write it. This is just for cases when you can get speedups down the line by not updating write timestamps, or when you know a large portion of the file is unchanged and the file is cached, or you're using a disk that sucks at writing data.
Sep 18 2016
On 9/18/16 12:15 PM, Chris Wright wrote:This will produce different behavior with hard links. With hard links, the temporary file mechanism you mention will result in the old file being accessible via the other path. With your recommended strategy, the data accessible from both paths is updated. That's probably acceptable, and hard links aren't used that much anyway.Awesome, this should be part of the docs.Obviously, if you have to overwrite large portions of the file, it's going to be faster to just write it. This is just for cases when you can get speedups down the line by not updating write timestamps, or when you know a large portion of the file is unchanged and the file is cached, or you're using a disk that sucks at writing data.That's exactly right, and such considerations should also go in the function documentation. Wanna go for it? Andrei
Sep 18 2016
On 9/18/2016 8:17 AM, Andrei Alexandrescu via Digitalmars-d wrote:There is actually an even better way at the application level. Consider a function in std.file: updateS, Range)(S name, Range data); updateFile does something interesting: it opens the file "name" for reading AND writing, then reads data from the Range _and_ the file. For as long as the data and the contents in the file agree, it just moves reading along. At the first difference between the data and the file contents, starts writing the data into the file through the end of the range. So this makes zero writes (and leaves the "last modified time" intact) if the file has the same content as the data. Better yet, if it so happens that the file and the data have the same prefix, there's less writing going on, which IIRC is faster for most filesystems. Saving on writes happens to be particularly nice on new solid-state drives. Who wants to take this with testing, measurements etc? It's a cool mini project. AndreiThis is nice in the case of no changes, but problematic in the case of some changes. The standard write new, rename technique never has either file in a half-right state. The file is atomically either old or new and nothing in between. This can be critical.
Sep 18 2016
On 9/18/2016 7:05 PM, Brad Roberts via Digitalmars-d wrote:This is nice in the case of no changes, but problematic in the case of some changes. The standard write new, rename technique never has either file in a half-right state. The file is atomically either old or new and nothing in between. This can be critical.As for compilation, I bet considerable speed increases could be had by never writing object files at all. (Not only does it save the read/write file time, but it saves the encoding into the object file format and decoding of that format.) Have the compiler do the linking directly. dmd already does this for generating library files directly, and it's been very successful (although sometimes I suspect nobody has noticed(!) which is actually a good thing). It took surprisingly little code to make that work, though doing a link step would be far more work.
Sep 18 2016
On 09/18/2016 10:05 PM, Brad Roberts via Digitalmars-d wrote:This is nice in the case of no changes, but problematic in the case of some changes. The standard write new, rename technique never has either file in a half-right state. The file is atomically either old or new and nothing in between. This can be critical.Good point, should be also part of the doco or a flag with update (e.g. Yes.atomic). Alternative: the caller may wish to rename the file prior to the operation and then rename it back after the operation. -- Andrei
Sep 19 2016
One way to implement it is to open the existing file as a memory-mapped file. Memory-mapped files only get paged into memory as the memory is referenced. So if you did a memcmp(oldfile, newfile, size), it will stop once the first difference is found, and the rest of the file is never read. Also, only the changed pages of the memory-mapped file have to be written. On large files, this could be a big savings.
Sep 19 2016