digitalmars.D - Fast file copy (threaded or not?)

Marco Leise (60/60) Sep 01 2011 I split the discussion with Andrei about the benefit of a multi-threaded...

Vladimir Panteleev (10/11) Sep 01 2011 I wouldn't advise using memory-mapped files under the hood for anything ...

Marco Leise (4/13) Sep 01 2011 So cached data from memory-mapped files is not handled the same way that...

Jonathan M Davis (12/86) Sep 01 2011 I would point out that regardless of what happens with performance with

Marco Leise (9/25) Sep 01 2011 I guess you are right. Neither mine nor Andrei's expectations were met. ...

Johannes Pfau (11/74) Sep 01 2011 Related link:

Marco Leise (12/20) Sep 01 2011 As far as I can tell the write buffer is influenced by several settings,...

Brad Roberts (3/10) Sep 01 2011 mmap has an issue with files larger than the mappable address space. No...
Johannes Pfau (9/72) Sep 02 2011 I changed the threaded implementation a little so that it doesn't
zeljkog (2/26) Sep 02 2011 Looking at this code, should be StopWatch.peek() defined as property?

Jonathan M Davis (5/41) Sep 02 2011 Why? It's name isn't a noun, and conceptually, it's not really a propert...

Walter Bright (5/9) Sep 02 2011 On Windows, we should just stick with the Windows CopyFile function:

Andrej Mitrovic (6/11) Sep 02 2011 I've given OP's code a few test runs but I just get inconsistent

Marco Leise (8/23) Sep 02 2011 Yeah, to get consistent results we'd need at minimum:

"Marco Leise" <Marco.Leise gmx.de> writes:

I split the discussion with Andrei about the benefit of a multi-threaded  
file copy routine to its own thread.
This is about copying a file from and to the same HDD - a mechanical disk  
with seek times.

My testing showed that Andrei is correct with the assumption that the  
kernel can optimize the small reads and writes in a multi-threaded  
application. I had to use large buffers up to 64 MB with my  
"single-threaded 100% synchronized writes" version to see the simple  
multi-threaded version from Johannes Pfau add 4,3% overhead during a 512  
MB copy operation.

Some more things I've experimented with:

- using only system API calls instead of D wrappers:
   The difference is close to background noise

- direct I/O for writing as used by databases:
   This worked pretty well, but you may not want to use it for
   reading as it bypasses the file cache. A file that is already
   cached would be copied slower as a result.

- memory maps:
   Kernel memory is shared with userspace. This approach does
   not allocate memory in the application. It just makes pages
   of files directly accessible in user space. Once mapped, the
   whole copy operation comes down to a single 'memcpy' call.

- splice (zero-copy):
   This is a Linux command that allows memory operations inside
   the kernel to be controlled from user space. The benefit is
   that the CPU never copies this memory from kernel to
   user space. Unfortunately the copy operation goes like this:
   "source file -> pipe , pipe -> destination file"
   A pipe is a hard-coded 64KB buffer. So it is not easy to move
   large chunks of data in a single call to splice(). 512 MB are
   still divided into 16.000+ calls.

Although splice looks promising it suffers from too many context switches.  
I had the best results with direct I/O and using synchronized writes for  
buffer sizes from 8 MB onwards, but I found this to be too complex and  
probably system dependent. So I settled with the memory mapped version,  
that I rewrote using Phobos instead of POSIX calls, so it should run  
equally well on all platforms and is 5 lines of code at it's core:

----------------------------------------------------------------------

import std.datetime, std.exception, std.stdio, std.mmfile;

void main(string[] args)
{
     if (!enforce(args.length == 3, {
         stderr.writefln("%s SOURCE DEST", args[0]);
     })) return;

     auto sw = StopWatch();
     sw.start();

     auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0);
     auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew, src.length,  
null, src.length);
     auto data = dst[];
     data[] = src[];
     dst.flush();

     sw.stop();
     writefln("Copied %s bytes in %s msec (%s kB/s)", src.length,  
sw.peek().msecs,
             1_000_000 * src.length / (1024 * sw.peek().usecs));
}

----------------------------------------------------------------------

This leaves it up to the kernel how to interleave disk reads and writes.

- Marco

Sep 01 2011

"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

On Thu, 01 Sep 2011 23:13:19 +0300, Marco Leise <Marco.Leise gmx.de> wrote:

 So I settled with the memory mapped version,

I wouldn't advise using memory-mapped files under the hood for anything  
without prior extensive testing in low-memory conditions. The kernel will  
be reluctant to drop pages in the file that have already been  
read/written. Some hinting APIs could be used, but these are not portable  
or reliable. (I recently had to rewrite one of my programs which used  
memory-mapped files with this as one of the reasons).

-- 
Best regards,
  Vladimir                            mailto:vladimir thecybershadow.net

Sep 01 2011

"Marco Leise" <Marco.Leise gmx.de> writes:

Am 01.09.2011, 22:43 Uhr, schrieb Vladimir Panteleev  
<vladimir thecybershadow.net>:

 On Thu, 01 Sep 2011 23:13:19 +0300, Marco Leise <Marco.Leise gmx.de>  
 wrote:

 So I settled with the memory mapped version,

 I wouldn't advise using memory-mapped files under the hood for anything  
 without prior extensive testing in low-memory conditions. The kernel  
 will be reluctant to drop pages in the file that have already been  
 read/written. Some hinting APIs could be used, but these are not  
 portable or reliable. (I recently had to rewrite one of my programs  
 which used memory-mapped files with this as one of the reasons).

So cached data from memory-mapped files is not handled the same way that  
cache from normal reads/writes is handled? Good catch, I'll remember that.

Sep 01 2011

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Thursday, September 01, 2011 13:13 Marco Leise wrote:
 I split the discussion with Andrei about the benefit of a multi-threaded
 file copy routine to its own thread.
 This is about copying a file from and to the same HDD - a mechanical disk
 with seek times.
 
 My testing showed that Andrei is correct with the assumption that the
 kernel can optimize the small reads and writes in a multi-threaded
 application. I had to use large buffers up to 64 MB with my
 "single-threaded 100% synchronized writes" version to see the simple
 multi-threaded version from Johannes Pfau add 4,3% overhead during a 512
 MB copy operation.
 
 Some more things I've experimented with:
 
 - using only system API calls instead of D wrappers:
 The difference is close to background noise
 
 - direct I/O for writing as used by databases:
 This worked pretty well, but you may not want to use it for
 reading as it bypasses the file cache. A file that is already
 cached would be copied slower as a result.
 
 - memory maps:
 Kernel memory is shared with userspace. This approach does
 not allocate memory in the application. It just makes pages
 of files directly accessible in user space. Once mapped, the
 whole copy operation comes down to a single 'memcpy' call.
 
 - splice (zero-copy):
 This is a Linux command that allows memory operations inside
 the kernel to be controlled from user space. The benefit is
 that the CPU never copies this memory from kernel to
 user space. Unfortunately the copy operation goes like this:
 "source file -> pipe , pipe -> destination file"
 A pipe is a hard-coded 64KB buffer. So it is not easy to move
 large chunks of data in a single call to splice(). 512 MB are
 still divided into 16.000+ calls.
 
 Although splice looks promising it suffers from too many context switches.
 I had the best results with direct I/O and using synchronized writes for
 buffer sizes from 8 MB onwards, but I found this to be too complex and
 probably system dependent. So I settled with the memory mapped version,
 that I rewrote using Phobos instead of POSIX calls, so it should run
 equally well on all platforms and is 5 lines of code at it's core:
 
 ----------------------------------------------------------------------
 
 import std.datetime, std.exception, std.stdio, std.mmfile;
 
 void main(string[] args)
 {
 if (!enforce(args.length == 3, {
 stderr.writefln("%s SOURCE DEST", args[0]);
 })) return;
 
 auto sw = StopWatch();
 sw.start();
 
 auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0);
 auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew, src.length,
 null, src.length);
 auto data = dst[];
 data[] = src[];
 dst.flush();
 
 sw.stop();
 writefln("Copied %s bytes in %s msec (%s kB/s)", src.length,
 sw.peek().msecs,
 1_000_000 * src.length / (1024 * sw.peek().usecs));
 }
 
 ----------------------------------------------------------------------
 
 This leaves it up to the kernel how to interleave disk reads and writes.

I would point out that regardless of what happens with performance with 
synchronous vs asynchronous I/O on a single HDD, it's pretty much a guarantee 
that in the general case asynchronous I/O is going to be faster when dealing 
with different HDDs. So, while we should definitely get hard data, unless 
copying asynchronously on a single hard drive is significantly worse than 
copying synchronously, then it's pretty much a given that we'd want to go with 
asynchronous I/O by default. If it were found that asynchronous I/O was 
significantly better on a single HDD, then that makes the question much more 
interesting, but as long as it's at least close - if not better - than 
synchronous I/O on the same HDD, then asynchronous I/O would be the way to go.

- Jonathan M Davis

Sep 01 2011

"Marco Leise" <Marco.Leise gmx.de> writes:

Am 01.09.2011, 22:38 Uhr, schrieb Jonathan M Davis <jmdavisProg gmx.com>:

 I would point out that regardless of what happens with performance with
 synchronous vs asynchronous I/O on a single HDD, it's pretty much a  
 guarantee
 that in the general case asynchronous I/O is going to be faster when  
 dealing
 with different HDDs. So, while we should definitely get hard data, unless
 copying asynchronously on a single hard drive is significantly worse than
 copying synchronously, then it's pretty much a given that we'd want to  
 go with
 asynchronous I/O by default. If it were found that asynchronous I/O was
 significantly better on a single HDD, then that makes the question much  
 more
 interesting, but as long as it's at least close - if not better - than
 synchronous I/O on the same HDD, then asynchronous I/O would be the way  
 to go.

 - Jonathan M Davis

I guess you are right. Neither mine nor Andrei's expectations were met.  
I/O from multiple threads to a single device is handled remarkably well on  
today's systems. While it looked to me and others on the net like a no-go,  
we see no major difference in the performance of both approaches with  
typical buffer sizes and Phobos routines. If you want to go for the extra  
5% in some cases you can go for that 100 MB buffer, OS specific functions  
and file usage hints, but that's never good for a standard library routine  
that is meant to be short, solid and portable.

Sep 01 2011

Johannes Pfau <spam example.com> writes:

Marco Leise wrote:
I split the discussion with Andrei about the benefit of a
multi-threaded file copy routine to its own thread.
This is about copying a file from and to the same HDD - a mechanical
disk with seek times.

My testing showed that Andrei is correct with the assumption that the  
kernel can optimize the small reads and writes in a multi-threaded  
application. I had to use large buffers up to 64 MB with my  
"single-threaded 100% synchronized writes" version to see the simple  
multi-threaded version from Johannes Pfau add 4,3% overhead during a
512 MB copy operation.

Some more things I've experimented with:

- using only system API calls instead of D wrappers:
   The difference is close to background noise

- direct I/O for writing as used by databases:
   This worked pretty well, but you may not want to use it for
   reading as it bypasses the file cache. A file that is already
   cached would be copied slower as a result.

- memory maps:
   Kernel memory is shared with userspace. This approach does
   not allocate memory in the application. It just makes pages
   of files directly accessible in user space. Once mapped, the
   whole copy operation comes down to a single 'memcpy' call.

- splice (zero-copy):
   This is a Linux command that allows memory operations inside
   the kernel to be controlled from user space. The benefit is
   that the CPU never copies this memory from kernel to
   user space. Unfortunately the copy operation goes like this:
   "source file -> pipe , pipe -> destination file"
   A pipe is a hard-coded 64KB buffer. So it is not easy to move
   large chunks of data in a single call to splice(). 512 MB are
   still divided into 16.000+ calls.

Although splice looks promising it suffers from too many context
switches. I had the best results with direct I/O and using
synchronized writes for buffer sizes from 8 MB onwards, but I found
this to be too complex and probably system dependent. So I settled
with the memory mapped version, that I rewrote using Phobos instead of
POSIX calls, so it should run equally well on all platforms and is 5
lines of code at it's core:

----------------------------------------------------------------------

import std.datetime, std.exception, std.stdio, std.mmfile;

void main(string[] args)
{
     if (!enforce(args.length == 3, {
         stderr.writefln("%s SOURCE DEST", args[0]);
     })) return;

     auto sw = StopWatch();
     sw.start();

     auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0);
     auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew,
 src.length,  
null, src.length);
     auto data = dst[];
     data[] = src[];
     dst.flush();

     sw.stop();
     writefln("Copied %s bytes in %s msec (%s kB/s)", src.length,  
sw.peek().msecs,
             1_000_000 * src.length / (1024 * sw.peek().usecs));
}

----------------------------------------------------------------------

This leaves it up to the kernel how to interleave disk reads and
writes.

- Marco

Related link:
http://www.devshed.com/c/a/BrainDump/Advising-the-Linux-Kernel-on-File-IO/

More related information:

Linux maximum readahead buffer is 128KB (but I think that can be
overwritten).

It seems like there's no 'per file' limit for the write buffer. The
only limit seems to be the memory available for caching (for example,
in my case with 3GB of ram 1118MB are available for the write cache)

-- 
Johannes Pfau

Sep 01 2011

"Marco Leise" <Marco.Leise gmx.de> writes:

Am 01.09.2011, 23:55 Uhr, schrieb Johannes Pfau <spam example.com>:

 Related link:
 http://www.devshed.com/c/a/BrainDump/Advising-the-Linux-Kernel-on-File-IO/

 More related information:

 Linux maximum readahead buffer is 128KB (but I think that can be
 overwritten).

 It seems like there's no 'per file' limit for the write buffer. The
 only limit seems to be the memory available for caching (for example,
 in my case with 3GB of ram 1118MB are available for the write cache)

As far as I can tell the write buffer is influenced by several settings,  
the free RAM and timers :). It will be different for virtually every  
environment. I didn't know about the readahead buffer though. I read that  
you can double it with the POSIX_FADV_SEQUENTIAL advise on the file,  
though. But to be honest, this probably has little effect unless you  
process the data while reading tiny blocks of it -> one large read is  
faster than lots of small reads. I tried this POSIX_FADV_SEQUENTIAL flag  
on my copy routine and it had 0 observable influence. POSIX_FADV_NOREUSE  
doesn't seem to be implemented :D . A full DMA copy from file descriptor  
to file descriptor would be nice, or adjustable pipe sizes, so the  
splice() can do more stuff in the background.

Sep 01 2011

Brad Roberts <braddr puremagic.com> writes:

On 9/1/2011 1:13 PM, Marco Leise wrote:
 
 Although splice looks promising it suffers from too many context switches. I
had the best results with direct I/O and
 using synchronized writes for buffer sizes from 8 MB onwards, but I found this
to be too complex and probably system
 dependent. So I settled with the memory mapped version, that I rewrote using
Phobos instead of POSIX calls, so it should
 run equally well on all platforms and is 5 lines of code at it's core:

 
 - Marco

mmap has an issue with files larger than the mappable address space.  Not that
it's hard to handle that case, it does
complicate the code in ways that the other options don't have problems with.

Sep 01 2011

Johannes Pfau <spam example.com> writes:

Marco Leise wrote:
I split the discussion with Andrei about the benefit of a
multi-threaded file copy routine to its own thread.
This is about copying a file from and to the same HDD - a mechanical
disk with seek times.

My testing showed that Andrei is correct with the assumption that the  
kernel can optimize the small reads and writes in a multi-threaded  
application. I had to use large buffers up to 64 MB with my  
"single-threaded 100% synchronized writes" version to see the simple  
multi-threaded version from Johannes Pfau add 4,3% overhead during a
512 MB copy operation.

Some more things I've experimented with:

- using only system API calls instead of D wrappers:
   The difference is close to background noise

- direct I/O for writing as used by databases:
   This worked pretty well, but you may not want to use it for
   reading as it bypasses the file cache. A file that is already
   cached would be copied slower as a result.

- memory maps:
   Kernel memory is shared with userspace. This approach does
   not allocate memory in the application. It just makes pages
   of files directly accessible in user space. Once mapped, the
   whole copy operation comes down to a single 'memcpy' call.

- splice (zero-copy):
   This is a Linux command that allows memory operations inside
   the kernel to be controlled from user space. The benefit is
   that the CPU never copies this memory from kernel to
   user space. Unfortunately the copy operation goes like this:
   "source file -> pipe , pipe -> destination file"
   A pipe is a hard-coded 64KB buffer. So it is not easy to move
   large chunks of data in a single call to splice(). 512 MB are
   still divided into 16.000+ calls.

Although splice looks promising it suffers from too many context
switches. I had the best results with direct I/O and using
synchronized writes for buffer sizes from 8 MB onwards, but I found
this to be too complex and probably system dependent. So I settled
with the memory mapped version, that I rewrote using Phobos instead of
POSIX calls, so it should run equally well on all platforms and is 5
lines of code at it's core:

----------------------------------------------------------------------

import std.datetime, std.exception, std.stdio, std.mmfile;

void main(string[] args)
{
     if (!enforce(args.length == 3, {
         stderr.writefln("%s SOURCE DEST", args[0]);
     })) return;

     auto sw = StopWatch();
     sw.start();

     auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0);
     auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew,
 src.length,  
null, src.length);
     auto data = dst[];
     data[] = src[];
     dst.flush();

     sw.stop();
     writefln("Copied %s bytes in %s msec (%s kB/s)", src.length,  
sw.peek().msecs,
             1_000_000 * src.length / (1024 * sw.peek().usecs));
}

----------------------------------------------------------------------

This leaves it up to the kernel how to interleave disk reads and
writes.

- Marco

I changed the threaded implementation a little so that it doesn't
allocate buffers dynamically:
https://gist.github.com/1188128

I hope I didn't screw up there. The idea is to have 2 buffers. Then, at
the same time, one buffer is read from and one buffer is written to.
When both read & write are finished, the buffers are swapped.
-- 
Johannes Pfau

Sep 02 2011

zeljkog <zeljkog nospam.com> writes:

Marco Leise Wrote:
 
 void main(string[] args)
 {
      if (!enforce(args.length == 3, {
          stderr.writefln("%s SOURCE DEST", args[0]);
      })) return;
 
      auto sw = StopWatch();
      sw.start();
 
      auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0);
      auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew, src.length,  
 null, src.length);
      auto data = dst[];
      data[] = src[];
      dst.flush();
 
      sw.stop();
      writefln("Copied %s bytes in %s msec (%s kB/s)", src.length,  
 sw.peek().msecs,
              1_000_000 * src.length / (1024 * sw.peek().usecs));
 }
 
 - Marco

Looking at this code, should be StopWatch.peek() defined as property?

Sep 02 2011

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Friday, September 02, 2011 04:58:46 zeljkog wrote:
 Marco Leise Wrote:
 void main(string[] args)
 {
 
      if (!enforce(args.length == 3, {
      
          stderr.writefln("%s SOURCE DEST", args[0]);
      
      })) return;
      
      auto sw = StopWatch();
      sw.start();
      
      auto src = new MmFile(args[1], MmFile.Mode.Read, 0, null, 0);
      auto dst = new MmFile(args[2], MmFile.Mode.ReadWriteNew,
      src.length,
 
 null, src.length);
 
      auto data = dst[];
      data[] = src[];
      dst.flush();
      
      sw.stop();
      writefln("Copied %s bytes in %s msec (%s kB/s)", src.length,
 
 sw.peek().msecs,
 
              1_000_000 * src.length / (1024 *
              sw.peek().usecs));
 
 }
 
 - Marco

 
 Looking at this code, should be StopWatch.peek() defined as property?

Why? It's name isn't a noun, and conceptually, it's not really a property. 
You're "peeking" at the current time elapsed. That's very much an action, not 
a property.

- Jonathan M Davis

Sep 02 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 9/1/2011 1:13 PM, Marco Leise wrote:
 I split the discussion with Andrei about the benefit of a multi-threaded file
 copy routine to its own thread.
 This is about copying a file from and to the same HDD - a mechanical disk with
 seek times.

On Windows, we should just stick with the Windows CopyFile function:

http://msdn.microsoft.com/en-us/library/aa363851(v=vs.85).aspx

And let the MS guys do their thing. Presumably they will do what works best on 
Windows.

Sep 02 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 9/2/11, Walter Bright <newshound2 digitalmars.com> wrote:
 On Windows, we should just stick with the Windows CopyFile function:

 http://msdn.microsoft.com/en-us/library/aa363851(v=vs.85).aspx

 And let the MS guys do their thing. Presumably they will do what works best
 on
 Windows.

I've given OP's code a few test runs but I just get inconsistent
results. Sometimes the async version is twice as fast, other times a
simple call via system("copy file1 file2") is faster.

Anyway, I'm assuming the MS devs optimized copying beyond the little
snippet we have here.. :p

Sep 02 2011

"Marco Leise" <Marco.Leise gmx.de> writes:

Am 02.09.2011, 16:08 Uhr, schrieb Andrej Mitrovic  
<andrej.mitrovich gmail.com>:

 On 9/2/11, Walter Bright <newshound2 digitalmars.com> wrote:
 On Windows, we should just stick with the Windows CopyFile function:

 http://msdn.microsoft.com/en-us/library/aa363851(v=vs.85).aspx

 And let the MS guys do their thing. Presumably they will do what works  
 best
 on
 Windows.

 I've given OP's code a few test runs but I just get inconsistent
 results. Sometimes the async version is twice as fast, other times a
 simple call via system("copy file1 file2") is faster.

 Anyway, I'm assuming the MS devs optimized copying beyond the little
 snippet we have here.. :p

Yeah, to get consistent results we'd need at minimum:
- fixed target location on disk
   (sectors to the end are ~2x slower,
    can be ensured by not truncating/erasing the target on every run)
- ability to disable / clear the read cache (possible on Linux)
- give the process real-time I/O priority

Sep 02 2011

D Programming

C/C++ Programming

Other

digitalmars.D - Fast file copy (threaded or not?)