www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [WORK] std.file.update function

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
There are quite a few situations in rdmd and dmd generally when we 
compute a dependency structure over sets of files. Based on that, we 
write new files that overwrite old, obsoleted files. Those changes in 
turn trigger other dependencies to go stale so more building is done etc.

Simplest case is - source file is being changed, therefore a new object 
file is being produced, therefore a new executable is being produced. 
And it only gets more involved.

We've discussed before using a simple method to avoid unnecessary stale 
dependencies when it's possible that a certain file won't, in fact, 
change contents:

1. Do all work on the side in a separate file e.g. file.ext.tmp

2. Compare the new file with the old file file.ext

3. If they're identical, delete file.ext.tmp; otherwise, rename 
file.ext.tmp into file.ext

There is actually an even better way at the application level. Consider 
a function in std.file:

updateS, Range)(S name, Range data);

updateFile does something interesting: it opens the file "name" for 
reading AND writing, then reads data from the Range _and_ the file. For 
as long as the data and the contents in the file agree, it just moves 
reading along. At the first difference between the data and the file 
contents, starts writing the data into the file through the end of the 
range.

So this makes zero writes (and leaves the "last modified time" intact) 
if the file has the same content as the data. Better yet, if it so 
happens that the file and the data have the same prefix, there's less 
writing going on, which IIRC is faster for most filesystems. Saving on 
writes happens to be particularly nice on new solid-state drives.

Who wants to take this with testing, measurements etc? It's a cool mini 
project.


Andrei
Sep 18 2016
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:
 Simplest case is - source file is being changed, therefore a new object
 file is being produced, therefore a new executable is being produced.
Forgot to mention a situation here: if you change the source code of a module without influencing the object file (e.g. documentation, certain style changes, unittests in non-unittest builds etc) there'd be no linking upon rebuilding. -- Andrei
Sep 18 2016
next sibling parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 19/09/2016 3:20 AM, Andrei Alexandrescu wrote:
 On 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:
 Simplest case is - source file is being changed, therefore a new object
 file is being produced, therefore a new executable is being produced.
Forgot to mention a situation here: if you change the source code of a module without influencing the object file (e.g. documentation, certain style changes, unittests in non-unittest builds etc) there'd be no linking upon rebuilding. -- Andrei
How does this compare against doing a checksum comparison on the file?
Sep 18 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/18/16 11:24 AM, rikki cattermole wrote:
 On 19/09/2016 3:20 AM, Andrei Alexandrescu wrote:
 On 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:
 Simplest case is - source file is being changed, therefore a new object
 file is being produced, therefore a new executable is being produced.
Forgot to mention a situation here: if you change the source code of a module without influencing the object file (e.g. documentation, certain style changes, unittests in non-unittest builds etc) there'd be no linking upon rebuilding. -- Andrei
How does this compare against doing a checksum comparison on the file?
Favorably :o). -- Andrei
Sep 18 2016
parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 19/09/2016 3:41 AM, Andrei Alexandrescu wrote:
 On 9/18/16 11:24 AM, rikki cattermole wrote:
 On 19/09/2016 3:20 AM, Andrei Alexandrescu wrote:
 On 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:
 Simplest case is - source file is being changed, therefore a new object
 file is being produced, therefore a new executable is being produced.
Forgot to mention a situation here: if you change the source code of a module without influencing the object file (e.g. documentation, certain style changes, unittests in non-unittest builds etc) there'd be no linking upon rebuilding. -- Andrei
How does this compare against doing a checksum comparison on the file?
Favorably :o). -- Andrei
Confirmed in doing the checksum myself. However I have not compared against OS provided checksum.
Sep 18 2016
parent reply Chris Wright <dhasenan gmail.com> writes:
On Mon, 19 Sep 2016 04:24:41 +1200, rikki cattermole wrote:

 On 19/09/2016 3:41 AM, Andrei Alexandrescu wrote:
 On 9/18/16 11:24 AM, rikki cattermole wrote:
 On 19/09/2016 3:20 AM, Andrei Alexandrescu wrote:
 On 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:
 Simplest case is - source file is being changed, therefore a new
 object file is being produced, therefore a new executable is being
 produced.
Forgot to mention a situation here: if you change the source code of a module without influencing the object file (e.g. documentation, certain style changes, unittests in non-unittest builds etc) there'd be no linking upon rebuilding. -- Andrei
How does this compare against doing a checksum comparison on the file?
Favorably :o). -- Andrei
Confirmed in doing the checksum myself. However I have not compared against OS provided checksum.
You have an operating system that automatically checksums every file?
Sep 18 2016
parent reply R <rjmcguire gmail.com> writes:
On Monday, 19 September 2016 at 02:57:01 UTC, Chris Wright wrote:

 You have an operating system that automatically checksums every 
 file?
There are a few filesystems that keep checksums of blocks, but I don't see one that keeps file checksums.
Oct 18 2016
parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Tuesday, 18 October 2016 at 13:51:48 UTC, R wrote:
 On Monday, 19 September 2016 at 02:57:01 UTC, Chris Wright 
 wrote:

 You have an operating system that automatically checksums 
 every file?
There are a few filesystems that keep checksums of blocks, but I don't see one that keeps file checksums.
zfs , btrfs. If the checksum's accessible is anoher story.
Oct 18 2016
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/18/2016 8:20 AM, Andrei Alexandrescu wrote:
 On 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:
 Simplest case is - source file is being changed, therefore a new object
 file is being produced, therefore a new executable is being produced.
Forgot to mention a situation here: if you change the source code of a module without influencing the object file (e.g. documentation, certain style changes, unittests in non-unittest builds etc) there'd be no linking upon rebuilding. --
The compiler currently creates the complete object file in a buffer, then writes the buffer to a file in one command. The reason is mostly because the object file format isn't incremental, the beginning is written last and the body gets backpatched as the compilation progresses. I can't really see a compilation producing an object file where the first half of it matches the previous object file and the second half is different, because of the file format. Interestingly, the win32 .lib format is designed for incredibly slow floppy disks, in that updating the library need not read/write every disk sector. I'd love to design our own high speed formats, but then they'd be incompatible with everybody else's.
Sep 18 2016
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2016-09-19 07:16, Walter Bright wrote:

 I'd love to design our own high speed formats, but then they'd be
 incompatible with everybody else's.
You already mentioned in an other post [1] that the compiler could do the linking as well. In that case you would need to write some form of linker. Then I suggest to develop the linker as a library, supporting all formats DMD currently supports. The library can be used both directly from DMD and to build an external linker. When we have our own linker we could create our own format too without having to worry about compatibility. I guess we need to create other tools for the new format as well, like object dumpers. But I assume that's a natural thing to do anyway. Bundle that with something like musl libc and we will have our own complete tool chain. It would also be easier to add support for cross-compiling. [1] http://forum.dlang.org/post/nrnsn7$1h3k$1 digitalmars.com -- /Jacob Carlborg
Sep 18 2016
prev sibling next sibling parent reply Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 19 September 2016 at 05:16:37 UTC, Walter Bright wrote:
 I'd love to design our own high speed formats, but then they'd 
 be incompatible with everybody else's.
I'd like that as well. I recently had a look at the ELF and the COFF file formats both are definitely in need of rework and dust-off :-) There are some nice things we could do if we had certain features on every platform, wrt. linking and symbol-tables. However the maintenance burden is a bit heavy we don't have enough menpower as it is.
Sep 18 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/18/2016 11:33 PM, Stefan Koch wrote:
 However the maintenance burden is a bit heavy we don't have enough menpower as
 it is.
A major part of the problem (that working with Optlink has made painfully clear) is that although linking is conceptually a rather trivial task, the people who've designed the file formats have an unending love of making trivial things exceedingly complicated. Furthermore, the weird things about the format are 98% undocumented lore. DMD still has problems generating "correct" Dwarf debug info because its correctness is not defined by the spec, but by lore and the idiosyncratic way that gcc emits it. Doing a linker inside DMD means that object files imported from other C/C++ compilers have to be correctly interpreted. I could do it, but I couldn't do that and continue to work on D.
Sep 18 2016
parent ketmar <ketmar ketmar.no-ip.org> writes:
On Monday, 19 September 2016 at 06:53:47 UTC, Walter Bright wrote:
 Doing a linker inside DMD means that object files imported from 
 other C/C++ compilers have to be correctly interpreted. I could 
 do it, but I couldn't do that and continue to work on D.
yeah. there is a reason for absense of 100500 hobbyst FOSS linkers. ;-) contrary to what it may look like, correct linking is really hard task. and mostly not fun to write too. people usually trying, and then just silently returning to binutils. ;-)
Sep 19 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 09/19/2016 01:16 AM, Walter Bright wrote:
 On 9/18/2016 8:20 AM, Andrei Alexandrescu wrote:
 On 09/18/2016 11:17 AM, Andrei Alexandrescu wrote:
 Simplest case is - source file is being changed, therefore a new object
 file is being produced, therefore a new executable is being produced.
Forgot to mention a situation here: if you change the source code of a module without influencing the object file (e.g. documentation, certain style changes, unittests in non-unittest builds etc) there'd be no linking upon rebuilding. --
The compiler currently creates the complete object file in a buffer, then writes the buffer to a file in one command. The reason is mostly because the object file format isn't incremental, the beginning is written last and the body gets backpatched as the compilation progresses.
Great. In that case, if the target .o file already exists, it should be compared against the buffer. If identical, there should be no write and the timestamp of the .o file should stay the same. I need to re-emphasize this kind of stuff is important for tooling. Many files get recompiled to identical object files - e.g. the many innocent bystanders in a dense dependency structure when one module changes. We also embed documentation in source files. Being disciplined about reflecting actual changes in the actual file operations is very helpful for tools that track file writes and/or timestamps.
 I can't really see a compilation producing an object file where the
 first half of it matches the previous object file and the second half is
 different, because of the file format.
Interesting. What happens e.g. if one makes a change to a function whose generated code is somewhere in the middle of the object file? If it doesn't alter the call graph, doesn't the new .o file share a common prefix with the old one?
 Interestingly, the win32 .lib format is designed for incredibly slow
 floppy disks, in that updating the library need not read/write every
 disk sector.

 I'd love to design our own high speed formats, but then they'd be
 incompatible with everybody else's.
This (and the subsequent considerations) is drifting off-topic. This is about getting a useful function off the ground, and sadly is degenerating into yet another off-topic debate leading to no progress. Andrei
Sep 19 2016
next sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Monday, 19 September 2016 at 14:04:03 UTC, Andrei Alexandrescu 
wrote:
 Interesting. What happens e.g. if one makes a change to a 
 function whose generated code is somewhere in the middle of the 
 object file? If it doesn't alter the call graph, doesn't the 
 new .o file share a common prefix with the old one?
Only if the TOC is unchanged. There are a lot of common sections in the same order but with different offsets. we would need some binary patching method. But I am unaware of file-systems supporting this. Microsofts incremental linking mechnism makes use of thunks so it can avoid changing the header iirc. But all of this needs codegen to adept.
Sep 19 2016
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 9/19/2016 7:04 AM, Andrei Alexandrescu wrote:
 On 09/19/2016 01:16 AM, Walter Bright wrote:
 The compiler currently creates the complete object file in a buffer,
 then writes the buffer to a file in one command. The reason is mostly
 because the object file format isn't incremental, the beginning is
 written last and the body gets backpatched as the compilation progresses.
Great. In that case, if the target .o file already exists, it should be compared against the buffer. If identical, there should be no write and the timestamp of the .o file should stay the same.
That's right. I was just referring to the idea of incrementally writing and comparing, which is a great idea for sequential file writing, likely won't work for the object file case. I think it is distinct enough to merit a separate library function. Note that we already have: Adding another "writeIfDifferent()" function would be a good thing. The range based incremental one should go into std.stdio. Any case where writing is much more costly than reading (such as SSD drives you mentioned, and the new Seagate "archival" drives), would make your technique a good one. It works even for memory; I've used it in code to reduce swapping, as in: if (*p != newvalue) *p = newvalue;
 I need to re-emphasize this kind of stuff is important for tooling. Many files
 get recompiled to identical object files - e.g. the many innocent bystanders in
 a dense dependency structure when one module changes. We also embed
 documentation in source files. Being disciplined about reflecting actual
changes
 in the actual file operations is very helpful for tools that track file writes
 and/or timestamps.
That's right.
 I can't really see a compilation producing an object file where the
 first half of it matches the previous object file and the second half is
 different, because of the file format.
Interesting. What happens e.g. if one makes a change to a function whose generated code is somewhere in the middle of the object file? If it doesn't alter the call graph, doesn't the new .o file share a common prefix with the old one?
Two things: 1. The object file starts out with a header that contains file offsets to the various tables and sections. Changing the size of any of the pieces in the file changes the header, and will likely require moving pieces around to make room. 2. Writing an object file can mean "backpatching" what was written earlier, as a declaration one assumed was external turns out to be internal.
Sep 19 2016
prev sibling next sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Sunday, 18 September 2016 at 15:17:31 UTC, Andrei Alexandrescu 
wrote:
 There are quite a few situations in rdmd and dmd generally when 
 we compute a dependency structure over sets of files. Based on 
 that, we write new files that overwrite old, obsoleted files. 
 Those changes in turn trigger other dependencies to go stale so 
 more building is done etc.
If so we need it in druntime. Introducing phobos into ddmd is still considered a nono. Personally I am pretty torn, without range-specific optimizations in dmd they do incur more overhead then they should.
Sep 18 2016
prev sibling next sibling parent reply Chris Wright <dhasenan gmail.com> writes:
This will produce different behavior with hard links. With hard links, 
the temporary file mechanism you mention will result in the old file 
being accessible via the other path. With your recommended strategy, the 
data accessible from both paths is updated.

That's probably acceptable, and hard links aren't used that much anyway.

Obviously, if you have to overwrite large portions of the file, it's 
going to be faster to just write it. This is just for cases when you can 
get speedups down the line by not updating write timestamps, or when you 
know a large portion of the file is unchanged and the file is cached, or 
you're using a disk that sucks at writing data.
Sep 18 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/18/16 12:15 PM, Chris Wright wrote:
 This will produce different behavior with hard links. With hard links,
 the temporary file mechanism you mention will result in the old file
 being accessible via the other path. With your recommended strategy, the
 data accessible from both paths is updated.

 That's probably acceptable, and hard links aren't used that much anyway.
Awesome, this should be part of the docs.
 Obviously, if you have to overwrite large portions of the file, it's
 going to be faster to just write it. This is just for cases when you can
 get speedups down the line by not updating write timestamps, or when you
 know a large portion of the file is unchanged and the file is cached, or
 you're using a disk that sucks at writing data.
That's exactly right, and such considerations should also go in the function documentation. Wanna go for it? Andrei
Sep 18 2016
prev sibling next sibling parent reply Brad Roberts via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 9/18/2016 8:17 AM, Andrei Alexandrescu via Digitalmars-d wrote:
 There is actually an even better way at the application level. Consider
 a function in std.file:

 updateS, Range)(S name, Range data);

 updateFile does something interesting: it opens the file "name" for
 reading AND writing, then reads data from the Range _and_ the file. For
 as long as the data and the contents in the file agree, it just moves
 reading along. At the first difference between the data and the file
 contents, starts writing the data into the file through the end of the
 range.

 So this makes zero writes (and leaves the "last modified time" intact)
 if the file has the same content as the data. Better yet, if it so
 happens that the file and the data have the same prefix, there's less
 writing going on, which IIRC is faster for most filesystems. Saving on
 writes happens to be particularly nice on new solid-state drives.

 Who wants to take this with testing, measurements etc? It's a cool mini
 project.


 Andrei
This is nice in the case of no changes, but problematic in the case of some changes. The standard write new, rename technique never has either file in a half-right state. The file is atomically either old or new and nothing in between. This can be critical.
Sep 18 2016
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 9/18/2016 7:05 PM, Brad Roberts via Digitalmars-d wrote:
 This is nice in the case of no changes, but problematic in the case of some
 changes.  The standard write new, rename technique never has either file in a
 half-right state.  The file is atomically either old or new and nothing in
 between.  This can be critical.
As for compilation, I bet considerable speed increases could be had by never writing object files at all. (Not only does it save the read/write file time, but it saves the encoding into the object file format and decoding of that format.) Have the compiler do the linking directly. dmd already does this for generating library files directly, and it's been very successful (although sometimes I suspect nobody has noticed(!) which is actually a good thing). It took surprisingly little code to make that work, though doing a link step would be far more work.
Sep 18 2016
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 09/18/2016 10:05 PM, Brad Roberts via Digitalmars-d wrote:
 This is nice in the case of no changes, but problematic in the case of
 some changes.  The standard write new, rename technique never has either
 file in a half-right state.  The file is atomically either old or new
 and nothing in between.  This can be critical.
Good point, should be also part of the doco or a flag with update (e.g. Yes.atomic). Alternative: the caller may wish to rename the file prior to the operation and then rename it back after the operation. -- Andrei
Sep 19 2016
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
One way to implement it is to open the existing file as a memory-mapped file. 
Memory-mapped files only get paged into memory as the memory is referenced. So 
if you did a memcmp(oldfile, newfile, size), it will stop once the first 
difference is found, and the rest of the file is never read.

Also, only the changed pages of the memory-mapped file have to be written. On 
large files, this could be a big savings.
Sep 19 2016