digitalmars.D - Fast file copy (threaded or not?)
- Marco Leise (60/60) Sep 01 2011 I split the discussion with Andrei about the benefit of a multi-threaded...
- Vladimir Panteleev (10/11) Sep 01 2011 I wouldn't advise using memory-mapped files under the hood for anything ...
- Marco Leise (4/13) Sep 01 2011 So cached data from memory-mapped files is not handled the same way that...
- Jonathan M Davis (12/86) Sep 01 2011 I would point out that regardless of what happens with performance with
- Marco Leise (9/25) Sep 01 2011 I guess you are right. Neither mine nor Andrei's expectations were met. ...
- Johannes Pfau (11/74) Sep 01 2011 Related link:
- Marco Leise (12/20) Sep 01 2011 As far as I can tell the write buffer is influenced by several settings,...
- Brad Roberts (3/10) Sep 01 2011 mmap has an issue with files larger than the mappable address space. No...
- Johannes Pfau (9/72) Sep 02 2011 I changed the threaded implementation a little so that it doesn't
- zeljkog (2/26) Sep 02 2011 Looking at this code, should be StopWatch.peek() defined as property?
- Jonathan M Davis (5/41) Sep 02 2011 Why? It's name isn't a noun, and conceptually, it's not really a propert...
- Walter Bright (5/9) Sep 02 2011 On Windows, we should just stick with the Windows CopyFile function:
- Andrej Mitrovic (6/11) Sep 02 2011 I've given OP's code a few test runs but I just get inconsistent
- Marco Leise (8/23) Sep 02 2011 Yeah, to get consistent results we'd need at minimum:
I split the discussion with Andrei about the benefit of a multi-threaded file copy routine to its own thread. This is about copying a file from and to the same HDD - a mechanical disk with seek times. My testing showed that Andrei is correct with the assumption that the kernel can optimize the small reads and writes in a multi-threaded application. I had to use large buffers up to 64 MB with my "single-threaded 100% synchronized writes" version to see the simple multi-threaded version from Johannes Pfau add 4,3% overhead during a 512 MB copy operation. Some more things I've experimented with: - using only system API calls instead of D wrappers: The difference is close to background noise - direct I/O for writing as used by databases: This worked pretty well, but you may not want to use it for reading as it bypasses the file cache. A file that is already cached would be copied slower as a result. - memory maps: Kernel memory is shared with userspace. This approach does not allocate memory in the application. It just makes pages of files directly accessible in user space. Once mapped, the whole copy operation comes down to a single 'memcpy' call. - splice (zero-copy): This is a Linux command that allows memory operations inside the kernel to be controlled from user space. The benefit is that the CPU never copies this memory from kernel to user space. Unfortunately the copy operation goes like this: "source file -> pipe , pipe -> destination file" A pipe is a hard-coded 64KB buffer. So it is not easy to move large chunks of data in a single call to splice(). 512 MB are still divided into 16.000+ calls. Although splice looks promising it suffers from too many context switches. I had the best results with direct I/O and using synchronized writes for buffer sizes from 8 MB onwards, but I found this to be too complex and probably system dependent. So I settled with the memory mapped version, that I rewrote using Phobos instead of POSIX calls, so it should run equally well on all platforms and is 5 lines of code at it's core: ---------------------------------------------------------------------- import std.datetime, std.exception, std.stdio, std.mmfile; void main(string[] args) { if (!enforce(args.length == 3, { stderr.writefln("%s SOURCE DEST", args[0]); })) return; auto sw = StopWatch(); sw.start(); auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0); auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew, src.length, null, src.length); auto data = dst[]; data[] = src[]; dst.flush(); sw.stop(); writefln("Copied %s bytes in %s msec (%s kB/s)", src.length, sw.peek().msecs, 1_000_000 * src.length / (1024 * sw.peek().usecs)); } ---------------------------------------------------------------------- This leaves it up to the kernel how to interleave disk reads and writes. - Marco
Sep 01 2011
On Thu, 01 Sep 2011 23:13:19 +0300, Marco Leise <Marco.Leise gmx.de> wrote:So I settled with the memory mapped version,I wouldn't advise using memory-mapped files under the hood for anything without prior extensive testing in low-memory conditions. The kernel will be reluctant to drop pages in the file that have already been read/written. Some hinting APIs could be used, but these are not portable or reliable. (I recently had to rewrite one of my programs which used memory-mapped files with this as one of the reasons). -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Sep 01 2011
Am 01.09.2011, 22:43 Uhr, schrieb Vladimir Panteleev <vladimir thecybershadow.net>:On Thu, 01 Sep 2011 23:13:19 +0300, Marco Leise <Marco.Leise gmx.de> wrote:So cached data from memory-mapped files is not handled the same way that cache from normal reads/writes is handled? Good catch, I'll remember that.So I settled with the memory mapped version,I wouldn't advise using memory-mapped files under the hood for anything without prior extensive testing in low-memory conditions. The kernel will be reluctant to drop pages in the file that have already been read/written. Some hinting APIs could be used, but these are not portable or reliable. (I recently had to rewrite one of my programs which used memory-mapped files with this as one of the reasons).
Sep 01 2011
On Thursday, September 01, 2011 13:13 Marco Leise wrote:I split the discussion with Andrei about the benefit of a multi-threaded file copy routine to its own thread. This is about copying a file from and to the same HDD - a mechanical disk with seek times. My testing showed that Andrei is correct with the assumption that the kernel can optimize the small reads and writes in a multi-threaded application. I had to use large buffers up to 64 MB with my "single-threaded 100% synchronized writes" version to see the simple multi-threaded version from Johannes Pfau add 4,3% overhead during a 512 MB copy operation. Some more things I've experimented with: - using only system API calls instead of D wrappers: The difference is close to background noise - direct I/O for writing as used by databases: This worked pretty well, but you may not want to use it for reading as it bypasses the file cache. A file that is already cached would be copied slower as a result. - memory maps: Kernel memory is shared with userspace. This approach does not allocate memory in the application. It just makes pages of files directly accessible in user space. Once mapped, the whole copy operation comes down to a single 'memcpy' call. - splice (zero-copy): This is a Linux command that allows memory operations inside the kernel to be controlled from user space. The benefit is that the CPU never copies this memory from kernel to user space. Unfortunately the copy operation goes like this: "source file -> pipe , pipe -> destination file" A pipe is a hard-coded 64KB buffer. So it is not easy to move large chunks of data in a single call to splice(). 512 MB are still divided into 16.000+ calls. Although splice looks promising it suffers from too many context switches. I had the best results with direct I/O and using synchronized writes for buffer sizes from 8 MB onwards, but I found this to be too complex and probably system dependent. So I settled with the memory mapped version, that I rewrote using Phobos instead of POSIX calls, so it should run equally well on all platforms and is 5 lines of code at it's core: ---------------------------------------------------------------------- import std.datetime, std.exception, std.stdio, std.mmfile; void main(string[] args) { if (!enforce(args.length == 3, { stderr.writefln("%s SOURCE DEST", args[0]); })) return; auto sw = StopWatch(); sw.start(); auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0); auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew, src.length, null, src.length); auto data = dst[]; data[] = src[]; dst.flush(); sw.stop(); writefln("Copied %s bytes in %s msec (%s kB/s)", src.length, sw.peek().msecs, 1_000_000 * src.length / (1024 * sw.peek().usecs)); } ---------------------------------------------------------------------- This leaves it up to the kernel how to interleave disk reads and writes.I would point out that regardless of what happens with performance with synchronous vs asynchronous I/O on a single HDD, it's pretty much a guarantee that in the general case asynchronous I/O is going to be faster when dealing with different HDDs. So, while we should definitely get hard data, unless copying asynchronously on a single hard drive is significantly worse than copying synchronously, then it's pretty much a given that we'd want to go with asynchronous I/O by default. If it were found that asynchronous I/O was significantly better on a single HDD, then that makes the question much more interesting, but as long as it's at least close - if not better - than synchronous I/O on the same HDD, then asynchronous I/O would be the way to go. - Jonathan M Davis
Sep 01 2011
Am 01.09.2011, 22:38 Uhr, schrieb Jonathan M Davis <jmdavisProg gmx.com>:I would point out that regardless of what happens with performance with synchronous vs asynchronous I/O on a single HDD, it's pretty much a guarantee that in the general case asynchronous I/O is going to be faster when dealing with different HDDs. So, while we should definitely get hard data, unless copying asynchronously on a single hard drive is significantly worse than copying synchronously, then it's pretty much a given that we'd want to go with asynchronous I/O by default. If it were found that asynchronous I/O was significantly better on a single HDD, then that makes the question much more interesting, but as long as it's at least close - if not better - than synchronous I/O on the same HDD, then asynchronous I/O would be the way to go. - Jonathan M DavisI guess you are right. Neither mine nor Andrei's expectations were met. I/O from multiple threads to a single device is handled remarkably well on today's systems. While it looked to me and others on the net like a no-go, we see no major difference in the performance of both approaches with typical buffer sizes and Phobos routines. If you want to go for the extra 5% in some cases you can go for that 100 MB buffer, OS specific functions and file usage hints, but that's never good for a standard library routine that is meant to be short, solid and portable.
Sep 01 2011
Marco Leise wrote:I split the discussion with Andrei about the benefit of a multi-threaded file copy routine to its own thread. This is about copying a file from and to the same HDD - a mechanical disk with seek times. My testing showed that Andrei is correct with the assumption that the kernel can optimize the small reads and writes in a multi-threaded application. I had to use large buffers up to 64 MB with my "single-threaded 100% synchronized writes" version to see the simple multi-threaded version from Johannes Pfau add 4,3% overhead during a 512 MB copy operation. Some more things I've experimented with: - using only system API calls instead of D wrappers: The difference is close to background noise - direct I/O for writing as used by databases: This worked pretty well, but you may not want to use it for reading as it bypasses the file cache. A file that is already cached would be copied slower as a result. - memory maps: Kernel memory is shared with userspace. This approach does not allocate memory in the application. It just makes pages of files directly accessible in user space. Once mapped, the whole copy operation comes down to a single 'memcpy' call. - splice (zero-copy): This is a Linux command that allows memory operations inside the kernel to be controlled from user space. The benefit is that the CPU never copies this memory from kernel to user space. Unfortunately the copy operation goes like this: "source file -> pipe , pipe -> destination file" A pipe is a hard-coded 64KB buffer. So it is not easy to move large chunks of data in a single call to splice(). 512 MB are still divided into 16.000+ calls. Although splice looks promising it suffers from too many context switches. I had the best results with direct I/O and using synchronized writes for buffer sizes from 8 MB onwards, but I found this to be too complex and probably system dependent. So I settled with the memory mapped version, that I rewrote using Phobos instead of POSIX calls, so it should run equally well on all platforms and is 5 lines of code at it's core: ---------------------------------------------------------------------- import std.datetime, std.exception, std.stdio, std.mmfile; void main(string[] args) { if (!enforce(args.length == 3, { stderr.writefln("%s SOURCE DEST", args[0]); })) return; auto sw = StopWatch(); sw.start(); auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0); auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew, src.length, null, src.length); auto data = dst[]; data[] = src[]; dst.flush(); sw.stop(); writefln("Copied %s bytes in %s msec (%s kB/s)", src.length, sw.peek().msecs, 1_000_000 * src.length / (1024 * sw.peek().usecs)); } ---------------------------------------------------------------------- This leaves it up to the kernel how to interleave disk reads and writes. - MarcoRelated link: http://www.devshed.com/c/a/BrainDump/Advising-the-Linux-Kernel-on-File-IO/ More related information: Linux maximum readahead buffer is 128KB (but I think that can be overwritten). It seems like there's no 'per file' limit for the write buffer. The only limit seems to be the memory available for caching (for example, in my case with 3GB of ram 1118MB are available for the write cache) -- Johannes Pfau
Sep 01 2011
Am 01.09.2011, 23:55 Uhr, schrieb Johannes Pfau <spam example.com>:Related link: http://www.devshed.com/c/a/BrainDump/Advising-the-Linux-Kernel-on-File-IO/ More related information: Linux maximum readahead buffer is 128KB (but I think that can be overwritten). It seems like there's no 'per file' limit for the write buffer. The only limit seems to be the memory available for caching (for example, in my case with 3GB of ram 1118MB are available for the write cache)As far as I can tell the write buffer is influenced by several settings, the free RAM and timers :). It will be different for virtually every environment. I didn't know about the readahead buffer though. I read that you can double it with the POSIX_FADV_SEQUENTIAL advise on the file, though. But to be honest, this probably has little effect unless you process the data while reading tiny blocks of it -> one large read is faster than lots of small reads. I tried this POSIX_FADV_SEQUENTIAL flag on my copy routine and it had 0 observable influence. POSIX_FADV_NOREUSE doesn't seem to be implemented :D . A full DMA copy from file descriptor to file descriptor would be nice, or adjustable pipe sizes, so the splice() can do more stuff in the background.
Sep 01 2011
On 9/1/2011 1:13 PM, Marco Leise wrote:Although splice looks promising it suffers from too many context switches. I had the best results with direct I/O and using synchronized writes for buffer sizes from 8 MB onwards, but I found this to be too complex and probably system dependent. So I settled with the memory mapped version, that I rewrote using Phobos instead of POSIX calls, so it should run equally well on all platforms and is 5 lines of code at it's core: - Marcommap has an issue with files larger than the mappable address space. Not that it's hard to handle that case, it does complicate the code in ways that the other options don't have problems with.
Sep 01 2011
Marco Leise wrote:I split the discussion with Andrei about the benefit of a multi-threaded file copy routine to its own thread. This is about copying a file from and to the same HDD - a mechanical disk with seek times. My testing showed that Andrei is correct with the assumption that the kernel can optimize the small reads and writes in a multi-threaded application. I had to use large buffers up to 64 MB with my "single-threaded 100% synchronized writes" version to see the simple multi-threaded version from Johannes Pfau add 4,3% overhead during a 512 MB copy operation. Some more things I've experimented with: - using only system API calls instead of D wrappers: The difference is close to background noise - direct I/O for writing as used by databases: This worked pretty well, but you may not want to use it for reading as it bypasses the file cache. A file that is already cached would be copied slower as a result. - memory maps: Kernel memory is shared with userspace. This approach does not allocate memory in the application. It just makes pages of files directly accessible in user space. Once mapped, the whole copy operation comes down to a single 'memcpy' call. - splice (zero-copy): This is a Linux command that allows memory operations inside the kernel to be controlled from user space. The benefit is that the CPU never copies this memory from kernel to user space. Unfortunately the copy operation goes like this: "source file -> pipe , pipe -> destination file" A pipe is a hard-coded 64KB buffer. So it is not easy to move large chunks of data in a single call to splice(). 512 MB are still divided into 16.000+ calls. Although splice looks promising it suffers from too many context switches. I had the best results with direct I/O and using synchronized writes for buffer sizes from 8 MB onwards, but I found this to be too complex and probably system dependent. So I settled with the memory mapped version, that I rewrote using Phobos instead of POSIX calls, so it should run equally well on all platforms and is 5 lines of code at it's core: ---------------------------------------------------------------------- import std.datetime, std.exception, std.stdio, std.mmfile; void main(string[] args) { if (!enforce(args.length == 3, { stderr.writefln("%s SOURCE DEST", args[0]); })) return; auto sw = StopWatch(); sw.start(); auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0); auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew, src.length, null, src.length); auto data = dst[]; data[] = src[]; dst.flush(); sw.stop(); writefln("Copied %s bytes in %s msec (%s kB/s)", src.length, sw.peek().msecs, 1_000_000 * src.length / (1024 * sw.peek().usecs)); } ---------------------------------------------------------------------- This leaves it up to the kernel how to interleave disk reads and writes. - MarcoI changed the threaded implementation a little so that it doesn't allocate buffers dynamically: https://gist.github.com/1188128 I hope I didn't screw up there. The idea is to have 2 buffers. Then, at the same time, one buffer is read from and one buffer is written to. When both read & write are finished, the buffers are swapped. -- Johannes Pfau
Sep 02 2011
Marco Leise Wrote:void main(string[] args) { if (!enforce(args.length == 3, { stderr.writefln("%s SOURCE DEST", args[0]); })) return; auto sw = StopWatch(); sw.start(); auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0); auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew, src.length, null, src.length); auto data = dst[]; data[] = src[]; dst.flush(); sw.stop(); writefln("Copied %s bytes in %s msec (%s kB/s)", src.length, sw.peek().msecs, 1_000_000 * src.length / (1024 * sw.peek().usecs)); } - MarcoLooking at this code, should be StopWatch.peek() defined as property?
Sep 02 2011
On Friday, September 02, 2011 04:58:46 zeljkog wrote:Marco Leise Wrote:Why? It's name isn't a noun, and conceptually, it's not really a property. You're "peeking" at the current time elapsed. That's very much an action, not a property. - Jonathan M Davisvoid main(string[] args) { if (!enforce(args.length == 3, { stderr.writefln("%s SOURCE DEST", args[0]); })) return; auto sw = StopWatch(); sw.start(); auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0); auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew, src.length, null, src.length); auto data = dst[]; data[] = src[]; dst.flush(); sw.stop(); writefln("Copied %s bytes in %s msec (%s kB/s)", src.length, sw.peek().msecs, 1_000_000 * src.length / (1024 * sw.peek().usecs)); } - MarcoLooking at this code, should be StopWatch.peek() defined as property?
Sep 02 2011
On 9/1/2011 1:13 PM, Marco Leise wrote:I split the discussion with Andrei about the benefit of a multi-threaded file copy routine to its own thread. This is about copying a file from and to the same HDD - a mechanical disk with seek times.On Windows, we should just stick with the Windows CopyFile function: http://msdn.microsoft.com/en-us/library/aa363851(v=vs.85).aspx And let the MS guys do their thing. Presumably they will do what works best on Windows.
Sep 02 2011
On 9/2/11, Walter Bright <newshound2 digitalmars.com> wrote:On Windows, we should just stick with the Windows CopyFile function: http://msdn.microsoft.com/en-us/library/aa363851(v=vs.85).aspx And let the MS guys do their thing. Presumably they will do what works best on Windows.I've given OP's code a few test runs but I just get inconsistent results. Sometimes the async version is twice as fast, other times a simple call via system("copy file1 file2") is faster. Anyway, I'm assuming the MS devs optimized copying beyond the little snippet we have here.. :p
Sep 02 2011
Am 02.09.2011, 16:08 Uhr, schrieb Andrej Mitrovic <andrej.mitrovich gmail.com>:On 9/2/11, Walter Bright <newshound2 digitalmars.com> wrote:Yeah, to get consistent results we'd need at minimum: - fixed target location on disk (sectors to the end are ~2x slower, can be ensured by not truncating/erasing the target on every run) - ability to disable / clear the read cache (possible on Linux) - give the process real-time I/O priorityOn Windows, we should just stick with the Windows CopyFile function: http://msdn.microsoft.com/en-us/library/aa363851(v=vs.85).aspx And let the MS guys do their thing. Presumably they will do what works best on Windows.I've given OP's code a few test runs but I just get inconsistent results. Sometimes the async version is twice as fast, other times a simple call via system("copy file1 file2") is faster. Anyway, I'm assuming the MS devs optimized copying beyond the little snippet we have here.. :p
Sep 02 2011