www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Fast file copy (threaded or not?)

reply "Marco Leise" <Marco.Leise gmx.de> writes:
I split the discussion with Andrei about the benefit of a multi-threaded  
file copy routine to its own thread.
This is about copying a file from and to the same HDD - a mechanical disk  
with seek times.

My testing showed that Andrei is correct with the assumption that the  
kernel can optimize the small reads and writes in a multi-threaded  
application. I had to use large buffers up to 64 MB with my  
"single-threaded 100% synchronized writes" version to see the simple  
multi-threaded version from Johannes Pfau add 4,3% overhead during a 512  
MB copy operation.

Some more things I've experimented with:

- using only system API calls instead of D wrappers:
   The difference is close to background noise

- direct I/O for writing as used by databases:
   This worked pretty well, but you may not want to use it for
   reading as it bypasses the file cache. A file that is already
   cached would be copied slower as a result.

- memory maps:
   Kernel memory is shared with userspace. This approach does
   not allocate memory in the application. It just makes pages
   of files directly accessible in user space. Once mapped, the
   whole copy operation comes down to a single 'memcpy' call.

- splice (zero-copy):
   This is a Linux command that allows memory operations inside
   the kernel to be controlled from user space. The benefit is
   that the CPU never copies this memory from kernel to
   user space. Unfortunately the copy operation goes like this:
   "source file -> pipe , pipe -> destination file"
   A pipe is a hard-coded 64KB buffer. So it is not easy to move
   large chunks of data in a single call to splice(). 512 MB are
   still divided into 16.000+ calls.

Although splice looks promising it suffers from too many context switches.  
I had the best results with direct I/O and using synchronized writes for  
buffer sizes from 8 MB onwards, but I found this to be too complex and  
probably system dependent. So I settled with the memory mapped version,  
that I rewrote using Phobos instead of POSIX calls, so it should run  
equally well on all platforms and is 5 lines of code at it's core:

----------------------------------------------------------------------

import std.datetime, std.exception, std.stdio, std.mmfile;

void main(string[] args)
{
     if (!enforce(args.length == 3, {
         stderr.writefln("%s SOURCE DEST", args[0]);
     })) return;

     auto sw = StopWatch();
     sw.start();

     auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0);
     auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew, src.length,  
null, src.length);
     auto data = dst[];
     data[] = src[];
     dst.flush();

     sw.stop();
     writefln("Copied %s bytes in %s msec (%s kB/s)", src.length,  
sw.peek().msecs,
             1_000_000 * src.length / (1024 * sw.peek().usecs));
}

----------------------------------------------------------------------

This leaves it up to the kernel how to interleave disk reads and writes.

- Marco
Sep 01 2011
next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Thu, 01 Sep 2011 23:13:19 +0300, Marco Leise <Marco.Leise gmx.de> wrote:

 So I settled with the memory mapped version,
I wouldn't advise using memory-mapped files under the hood for anything without prior extensive testing in low-memory conditions. The kernel will be reluctant to drop pages in the file that have already been read/written. Some hinting APIs could be used, but these are not portable or reliable. (I recently had to rewrite one of my programs which used memory-mapped files with this as one of the reasons). -- Best regards, Vladimir mailto:vladimir thecybershadow.net
Sep 01 2011
parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 01.09.2011, 22:43 Uhr, schrieb Vladimir Panteleev  
<vladimir thecybershadow.net>:

 On Thu, 01 Sep 2011 23:13:19 +0300, Marco Leise <Marco.Leise gmx.de>  
 wrote:

 So I settled with the memory mapped version,
I wouldn't advise using memory-mapped files under the hood for anything without prior extensive testing in low-memory conditions. The kernel will be reluctant to drop pages in the file that have already been read/written. Some hinting APIs could be used, but these are not portable or reliable. (I recently had to rewrite one of my programs which used memory-mapped files with this as one of the reasons).
So cached data from memory-mapped files is not handled the same way that cache from normal reads/writes is handled? Good catch, I'll remember that.
Sep 01 2011
prev sibling next sibling parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Thursday, September 01, 2011 13:13 Marco Leise wrote:
 I split the discussion with Andrei about the benefit of a multi-threaded
 file copy routine to its own thread.
 This is about copying a file from and to the same HDD - a mechanical disk
 with seek times.
 
 My testing showed that Andrei is correct with the assumption that the
 kernel can optimize the small reads and writes in a multi-threaded
 application. I had to use large buffers up to 64 MB with my
 "single-threaded 100% synchronized writes" version to see the simple
 multi-threaded version from Johannes Pfau add 4,3% overhead during a 512
 MB copy operation.
 
 Some more things I've experimented with:
 
 - using only system API calls instead of D wrappers:
 The difference is close to background noise
 
 - direct I/O for writing as used by databases:
 This worked pretty well, but you may not want to use it for
 reading as it bypasses the file cache. A file that is already
 cached would be copied slower as a result.
 
 - memory maps:
 Kernel memory is shared with userspace. This approach does
 not allocate memory in the application. It just makes pages
 of files directly accessible in user space. Once mapped, the
 whole copy operation comes down to a single 'memcpy' call.
 
 - splice (zero-copy):
 This is a Linux command that allows memory operations inside
 the kernel to be controlled from user space. The benefit is
 that the CPU never copies this memory from kernel to
 user space. Unfortunately the copy operation goes like this:
 "source file -> pipe , pipe -> destination file"
 A pipe is a hard-coded 64KB buffer. So it is not easy to move
 large chunks of data in a single call to splice(). 512 MB are
 still divided into 16.000+ calls.
 
 Although splice looks promising it suffers from too many context switches.
 I had the best results with direct I/O and using synchronized writes for
 buffer sizes from 8 MB onwards, but I found this to be too complex and
 probably system dependent. So I settled with the memory mapped version,
 that I rewrote using Phobos instead of POSIX calls, so it should run
 equally well on all platforms and is 5 lines of code at it's core:
 
 ----------------------------------------------------------------------
 
 import std.datetime, std.exception, std.stdio, std.mmfile;
 
 void main(string[] args)
 {
 if (!enforce(args.length == 3, {
 stderr.writefln("%s SOURCE DEST", args[0]);
 })) return;
 
 auto sw = StopWatch();
 sw.start();
 
 auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0);
 auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew, src.length,
 null, src.length);
 auto data = dst[];
 data[] = src[];
 dst.flush();
 
 sw.stop();
 writefln("Copied %s bytes in %s msec (%s kB/s)", src.length,
 sw.peek().msecs,
 1_000_000 * src.length / (1024 * sw.peek().usecs));
 }
 
 ----------------------------------------------------------------------
 
 This leaves it up to the kernel how to interleave disk reads and writes.
I would point out that regardless of what happens with performance with synchronous vs asynchronous I/O on a single HDD, it's pretty much a guarantee that in the general case asynchronous I/O is going to be faster when dealing with different HDDs. So, while we should definitely get hard data, unless copying asynchronously on a single hard drive is significantly worse than copying synchronously, then it's pretty much a given that we'd want to go with asynchronous I/O by default. If it were found that asynchronous I/O was significantly better on a single HDD, then that makes the question much more interesting, but as long as it's at least close - if not better - than synchronous I/O on the same HDD, then asynchronous I/O would be the way to go. - Jonathan M Davis
Sep 01 2011
parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 01.09.2011, 22:38 Uhr, schrieb Jonathan M Davis <jmdavisProg gmx.com>:

 I would point out that regardless of what happens with performance with
 synchronous vs asynchronous I/O on a single HDD, it's pretty much a  
 guarantee
 that in the general case asynchronous I/O is going to be faster when  
 dealing
 with different HDDs. So, while we should definitely get hard data, unless
 copying asynchronously on a single hard drive is significantly worse than
 copying synchronously, then it's pretty much a given that we'd want to  
 go with
 asynchronous I/O by default. If it were found that asynchronous I/O was
 significantly better on a single HDD, then that makes the question much  
 more
 interesting, but as long as it's at least close - if not better - than
 synchronous I/O on the same HDD, then asynchronous I/O would be the way  
 to go.

 - Jonathan M Davis
I guess you are right. Neither mine nor Andrei's expectations were met. I/O from multiple threads to a single device is handled remarkably well on today's systems. While it looked to me and others on the net like a no-go, we see no major difference in the performance of both approaches with typical buffer sizes and Phobos routines. If you want to go for the extra 5% in some cases you can go for that 100 MB buffer, OS specific functions and file usage hints, but that's never good for a standard library routine that is meant to be short, solid and portable.
Sep 01 2011
prev sibling next sibling parent reply Johannes Pfau <spam example.com> writes:
Marco Leise wrote:
I split the discussion with Andrei about the benefit of a
multi-threaded file copy routine to its own thread.
This is about copying a file from and to the same HDD - a mechanical
disk with seek times.

My testing showed that Andrei is correct with the assumption that the  
kernel can optimize the small reads and writes in a multi-threaded  
application. I had to use large buffers up to 64 MB with my  
"single-threaded 100% synchronized writes" version to see the simple  
multi-threaded version from Johannes Pfau add 4,3% overhead during a
512 MB copy operation.

Some more things I've experimented with:

- using only system API calls instead of D wrappers:
   The difference is close to background noise

- direct I/O for writing as used by databases:
   This worked pretty well, but you may not want to use it for
   reading as it bypasses the file cache. A file that is already
   cached would be copied slower as a result.

- memory maps:
   Kernel memory is shared with userspace. This approach does
   not allocate memory in the application. It just makes pages
   of files directly accessible in user space. Once mapped, the
   whole copy operation comes down to a single 'memcpy' call.

- splice (zero-copy):
   This is a Linux command that allows memory operations inside
   the kernel to be controlled from user space. The benefit is
   that the CPU never copies this memory from kernel to
   user space. Unfortunately the copy operation goes like this:
   "source file -> pipe , pipe -> destination file"
   A pipe is a hard-coded 64KB buffer. So it is not easy to move
   large chunks of data in a single call to splice(). 512 MB are
   still divided into 16.000+ calls.

Although splice looks promising it suffers from too many context
switches. I had the best results with direct I/O and using
synchronized writes for buffer sizes from 8 MB onwards, but I found
this to be too complex and probably system dependent. So I settled
with the memory mapped version, that I rewrote using Phobos instead of
POSIX calls, so it should run equally well on all platforms and is 5
lines of code at it's core:

----------------------------------------------------------------------

import std.datetime, std.exception, std.stdio, std.mmfile;

void main(string[] args)
{
     if (!enforce(args.length == 3, {
         stderr.writefln("%s SOURCE DEST", args[0]);
     })) return;

     auto sw = StopWatch();
     sw.start();

     auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0);
     auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew,
 src.length,  
null, src.length);
     auto data = dst[];
     data[] = src[];
     dst.flush();

     sw.stop();
     writefln("Copied %s bytes in %s msec (%s kB/s)", src.length,  
sw.peek().msecs,
             1_000_000 * src.length / (1024 * sw.peek().usecs));
}

----------------------------------------------------------------------

This leaves it up to the kernel how to interleave disk reads and
writes.

- Marco
Related link: http://www.devshed.com/c/a/BrainDump/Advising-the-Linux-Kernel-on-File-IO/ More related information: Linux maximum readahead buffer is 128KB (but I think that can be overwritten). It seems like there's no 'per file' limit for the write buffer. The only limit seems to be the memory available for caching (for example, in my case with 3GB of ram 1118MB are available for the write cache) -- Johannes Pfau
Sep 01 2011
parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 01.09.2011, 23:55 Uhr, schrieb Johannes Pfau <spam example.com>:

 Related link:
 http://www.devshed.com/c/a/BrainDump/Advising-the-Linux-Kernel-on-File-IO/

 More related information:

 Linux maximum readahead buffer is 128KB (but I think that can be
 overwritten).

 It seems like there's no 'per file' limit for the write buffer. The
 only limit seems to be the memory available for caching (for example,
 in my case with 3GB of ram 1118MB are available for the write cache)
As far as I can tell the write buffer is influenced by several settings, the free RAM and timers :). It will be different for virtually every environment. I didn't know about the readahead buffer though. I read that you can double it with the POSIX_FADV_SEQUENTIAL advise on the file, though. But to be honest, this probably has little effect unless you process the data while reading tiny blocks of it -> one large read is faster than lots of small reads. I tried this POSIX_FADV_SEQUENTIAL flag on my copy routine and it had 0 observable influence. POSIX_FADV_NOREUSE doesn't seem to be implemented :D . A full DMA copy from file descriptor to file descriptor would be nice, or adjustable pipe sizes, so the splice() can do more stuff in the background.
Sep 01 2011
prev sibling next sibling parent Brad Roberts <braddr puremagic.com> writes:
On 9/1/2011 1:13 PM, Marco Leise wrote:
 
 Although splice looks promising it suffers from too many context switches. I
had the best results with direct I/O and
 using synchronized writes for buffer sizes from 8 MB onwards, but I found this
to be too complex and probably system
 dependent. So I settled with the memory mapped version, that I rewrote using
Phobos instead of POSIX calls, so it should
 run equally well on all platforms and is 5 lines of code at it's core:

 
 - Marco
mmap has an issue with files larger than the mappable address space. Not that it's hard to handle that case, it does complicate the code in ways that the other options don't have problems with.
Sep 01 2011
prev sibling next sibling parent Johannes Pfau <spam example.com> writes:
Marco Leise wrote:
I split the discussion with Andrei about the benefit of a
multi-threaded file copy routine to its own thread.
This is about copying a file from and to the same HDD - a mechanical
disk with seek times.

My testing showed that Andrei is correct with the assumption that the  
kernel can optimize the small reads and writes in a multi-threaded  
application. I had to use large buffers up to 64 MB with my  
"single-threaded 100% synchronized writes" version to see the simple  
multi-threaded version from Johannes Pfau add 4,3% overhead during a
512 MB copy operation.

Some more things I've experimented with:

- using only system API calls instead of D wrappers:
   The difference is close to background noise

- direct I/O for writing as used by databases:
   This worked pretty well, but you may not want to use it for
   reading as it bypasses the file cache. A file that is already
   cached would be copied slower as a result.

- memory maps:
   Kernel memory is shared with userspace. This approach does
   not allocate memory in the application. It just makes pages
   of files directly accessible in user space. Once mapped, the
   whole copy operation comes down to a single 'memcpy' call.

- splice (zero-copy):
   This is a Linux command that allows memory operations inside
   the kernel to be controlled from user space. The benefit is
   that the CPU never copies this memory from kernel to
   user space. Unfortunately the copy operation goes like this:
   "source file -> pipe , pipe -> destination file"
   A pipe is a hard-coded 64KB buffer. So it is not easy to move
   large chunks of data in a single call to splice(). 512 MB are
   still divided into 16.000+ calls.

Although splice looks promising it suffers from too many context
switches. I had the best results with direct I/O and using
synchronized writes for buffer sizes from 8 MB onwards, but I found
this to be too complex and probably system dependent. So I settled
with the memory mapped version, that I rewrote using Phobos instead of
POSIX calls, so it should run equally well on all platforms and is 5
lines of code at it's core:

----------------------------------------------------------------------

import std.datetime, std.exception, std.stdio, std.mmfile;

void main(string[] args)
{
     if (!enforce(args.length == 3, {
         stderr.writefln("%s SOURCE DEST", args[0]);
     })) return;

     auto sw = StopWatch();
     sw.start();

     auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0);
     auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew,
 src.length,  
null, src.length);
     auto data = dst[];
     data[] = src[];
     dst.flush();

     sw.stop();
     writefln("Copied %s bytes in %s msec (%s kB/s)", src.length,  
sw.peek().msecs,
             1_000_000 * src.length / (1024 * sw.peek().usecs));
}

----------------------------------------------------------------------

This leaves it up to the kernel how to interleave disk reads and
writes.

- Marco
I changed the threaded implementation a little so that it doesn't allocate buffers dynamically: https://gist.github.com/1188128 I hope I didn't screw up there. The idea is to have 2 buffers. Then, at the same time, one buffer is read from and one buffer is written to. When both read & write are finished, the buffers are swapped. -- Johannes Pfau
Sep 02 2011
prev sibling next sibling parent reply zeljkog <zeljkog nospam.com> writes:
Marco Leise Wrote:
 
 void main(string[] args)
 {
      if (!enforce(args.length == 3, {
          stderr.writefln("%s SOURCE DEST", args[0]);
      })) return;
 
      auto sw = StopWatch();
      sw.start();
 
      auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0);
      auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew, src.length,  
 null, src.length);
      auto data = dst[];
      data[] = src[];
      dst.flush();
 
      sw.stop();
      writefln("Copied %s bytes in %s msec (%s kB/s)", src.length,  
 sw.peek().msecs,
              1_000_000 * src.length / (1024 * sw.peek().usecs));
 }
 
 - Marco
Looking at this code, should be StopWatch.peek() defined as property?
Sep 02 2011
parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Friday, September 02, 2011 04:58:46 zeljkog wrote:
 Marco Leise Wrote:
 void main(string[] args)
 {
 
      if (!enforce(args.length == 3, {
      
          stderr.writefln("%s SOURCE DEST", args[0]);
      
      })) return;
      
      auto sw = StopWatch();
      sw.start();
      
      auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0);
      auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew,
      src.length,
 
 null, src.length);
 
      auto data = dst[];
      data[] = src[];
      dst.flush();
      
      sw.stop();
      writefln("Copied %s bytes in %s msec (%s kB/s)", src.length,
 
 sw.peek().msecs,
 
              1_000_000 * src.length / (1024 *
              sw.peek().usecs));
 
 }
 
 - Marco
Looking at this code, should be StopWatch.peek() defined as property?
Why? It's name isn't a noun, and conceptually, it's not really a property. You're "peeking" at the current time elapsed. That's very much an action, not a property. - Jonathan M Davis
Sep 02 2011
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/1/2011 1:13 PM, Marco Leise wrote:
 I split the discussion with Andrei about the benefit of a multi-threaded file
 copy routine to its own thread.
 This is about copying a file from and to the same HDD - a mechanical disk with
 seek times.
On Windows, we should just stick with the Windows CopyFile function: http://msdn.microsoft.com/en-us/library/aa363851(v=vs.85).aspx And let the MS guys do their thing. Presumably they will do what works best on Windows.
Sep 02 2011
parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 9/2/11, Walter Bright <newshound2 digitalmars.com> wrote:
 On Windows, we should just stick with the Windows CopyFile function:

 http://msdn.microsoft.com/en-us/library/aa363851(v=vs.85).aspx

 And let the MS guys do their thing. Presumably they will do what works best
 on
 Windows.
I've given OP's code a few test runs but I just get inconsistent results. Sometimes the async version is twice as fast, other times a simple call via system("copy file1 file2") is faster. Anyway, I'm assuming the MS devs optimized copying beyond the little snippet we have here.. :p
Sep 02 2011
parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 02.09.2011, 16:08 Uhr, schrieb Andrej Mitrovic  
<andrej.mitrovich gmail.com>:

 On 9/2/11, Walter Bright <newshound2 digitalmars.com> wrote:
 On Windows, we should just stick with the Windows CopyFile function:

 http://msdn.microsoft.com/en-us/library/aa363851(v=vs.85).aspx

 And let the MS guys do their thing. Presumably they will do what works  
 best
 on
 Windows.
I've given OP's code a few test runs but I just get inconsistent results. Sometimes the async version is twice as fast, other times a simple call via system("copy file1 file2") is faster. Anyway, I'm assuming the MS devs optimized copying beyond the little snippet we have here.. :p
Yeah, to get consistent results we'd need at minimum: - fixed target location on disk (sectors to the end are ~2x slower, can be ensured by not truncating/erasing the target on every run) - ability to disable / clear the read cache (possible on Linux) - give the process real-time I/O priority
Sep 02 2011