www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - memcpy vs slice copy

reply bearophile <bearophileHUGS lycos.com> writes:
While doing some string processing I've seen some unusual timings compared to
the C code, so I have written this to see the situation better.
When USE_MEMCPY is false this little benchmark runs about 3+ times slower:

import std.c.stdlib: malloc;
import std.c.string: memcpy;
import std.stdio: writefln;

const bool USE_MEMCPY = true;

void main() {
    const int N = 100_000_000;

    auto h = "hello\n";
    char* ptr = cast(char*)malloc(N * h.length);
    char* p = ptr;

    for (int i; i < (h.length * N); i += h.length) {
        static if (USE_MEMCPY)
            memcpy(p, h.ptr, h.length);
        else
            p[0 .. h.length] = h;

        p += h.length;
    }

    if (N <= 100) // to see if it works decrease N
        writefln(ptr[0 .. N * h.length]);
}

As usual with benchmarks I may be missing things (especially because now it's
quite late here), so keep eyes open.

Bye,
bearophile
Mar 14 2009
next sibling parent reply Moritz Warning <moritzwarning web.de> writes:
On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:

 While doing some string processing I've seen some unusual timings
 compared to the C code, so I have written this to see the situation
 better. When USE_MEMCPY is false this little benchmark runs about 3+
 times slower:
I did a little benchmark: ldc -release -O5 true: 0.51 false: 0.63 dmd -release -O true: 4.47 false: 3.58 I don't see a very big difference between slice copying and memcpy (but between compilers). Btw.: http://www.digitalmars.com/pnews/read.php? server=news.digitalmars.com&group=digitalmars.D.bugs&artnum=14933
Mar 15 2009
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Moritz Warning:
 I don't see a very big difference between slice copying and memcpy (but 
 between compilers).
I have taken the times again. My timings, best of 4: true: 1.33 s false: 4.28 s I have used dmd 1.041, with Phobos, on WinXP 32 bit, 2 GB RAM, CPU Core 2 at 2 GHz. This may be another little benchnmark for Robert Clipsham :-) Bye, bearophile
Mar 15 2009
parent reply bearophile <bearophileHUGS lycos.com> writes:
For reference here's a simple C version:

#include "stdlib.h"
#include "string.h"
#include "stdio.h"

#define N 100000000
#define L 6

char h[L] = "hello\n";

int main() {
    char *ptr;
    if (N <= 100)
        ptr = malloc(N * L + 1); // the +1 is for the final printing
    else
        ptr = malloc(N * L);

    char *p = ptr;
    int i;

    for (i = 0; i < N; i++) {
        memcpy(p, h, L);
        p += L;
    }

    // to see if it works decrease N
    if (N <= 100) {
        ptr[N * L] = '\0';
        printf("%s", ptr);
    }
    return 0;
}

It takes 0.59 s to run on the same PC and OS of mine, both using GCC and
LLVM-GCC.
(Note that the loop in the code is a bit different now, the end result is the
same).

The ASM of the inner loop:

L:  movl    _h, %eax
    movl    %eax, (%edx)
    movzwl  _h+4, %eax
    movw    %ax, 4(%edx)
    addl    $6, %edx
    cmpl    %ecx, %edx
    jne L

Bye,
bearophile
Mar 15 2009
parent reply Sergey Gromov <snake.scaly gmail.com> writes:
Sun, 15 Mar 2009 10:31:10 -0400, bearophile wrote:

 The ASM of the inner loop:
 
 L:  movl    _h, %eax
     movl    %eax, (%edx)
     movzwl  _h+4, %eax
     movw    %ax, 4(%edx)
     addl    $6, %edx
     cmpl    %ecx, %edx
     jne L
Obviously, a memcpy intrinsic is at work here. DMD calls an actual CRT function instead.
Mar 15 2009
parent bearophile <bearophileHUGS lycos.com> writes:
Sergey Gromov:
 Obviously, a memcpy intrinsic is at work here.<
Yes, gcc is able to recognize some calls to C library functions and replace them with intrinsics. I think LDC too uses an intrinsic to copy memory of a slice. This isn't a too much interesting benchmark, there's nothing much interesting/new to see here. Bye, bearophile
Mar 15 2009
prev sibling next sibling parent Moritz Warning <moritzwarning web.de> writes:
On Sun, 15 Mar 2009 13:17:50 +0000, Moritz Warning wrote:

 On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:
 
 While doing some string processing I've seen some unusual timings
 compared to the C code, so I have written this to see the situation
 better. When USE_MEMCPY is false this little benchmark runs about 3+
 times slower:
I did a little benchmark: ldc -release -O5 true: 0.51 false: 0.63 dmd -release -O true: 4.47 false: 3.58
My system is 64bit with dmd 1.041 and ldc from trunk. Since dmd produces 32bit binaries only this might account for a factor of two for the timings.
Mar 15 2009
prev sibling parent reply Sergey Gromov <snake.scaly gmail.com> writes:
Sun, 15 Mar 2009 13:17:50 +0000 (UTC), Moritz Warning wrote:

 On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:
 
 While doing some string processing I've seen some unusual timings
 compared to the C code, so I have written this to see the situation
 better. When USE_MEMCPY is false this little benchmark runs about 3+
 times slower:
I did a little benchmark: ldc -release -O5 true: 0.51 false: 0.63 dmd -release -O true: 4.47 false: 3.58 I don't see a very big difference between slice copying and memcpy (but between compilers). Btw.: http://www.digitalmars.com/pnews/read.php? server=news.digitalmars.com&group=digitalmars.D.bugs&artnum=14933
The original benchmark swapped insanely on my 1GB laptop so I've cut the number of iterations in half, to 50_000_000. Compiled with -O -release -inline. Results: slice: 2.31 memcpy: 0.73 That's 3 times difference. Disassembly: slice: L31: mov ECX,EDX mov EAX,6 lea ESI,010h[ESP] mov ECX,EAX mov EDI,EDX rep movsb add EDX,6 add EBX,6 cmp EBX,011E1A300h jb L31 memcpy: L35: push 6 lea ECX,014h[ESP] push ECX push EBX call near ptr _memcpy add EBX,6 add ESI,6 add ESP,0Ch cmp ESI,011E1A300h jb L35 Seems like rep movsb is /way/ sub-optimal for copying data.
Mar 15 2009
parent reply Don <nospam nospam.com> writes:
Sergey Gromov wrote:
 Sun, 15 Mar 2009 13:17:50 +0000 (UTC), Moritz Warning wrote:
 
 On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:

 While doing some string processing I've seen some unusual timings
 compared to the C code, so I have written this to see the situation
 better. When USE_MEMCPY is false this little benchmark runs about 3+
 times slower:
I did a little benchmark: ldc -release -O5 true: 0.51 false: 0.63 dmd -release -O true: 4.47 false: 3.58 I don't see a very big difference between slice copying and memcpy (but between compilers). Btw.: http://www.digitalmars.com/pnews/read.php? server=news.digitalmars.com&group=digitalmars.D.bugs&artnum=14933
The original benchmark swapped insanely on my 1GB laptop so I've cut the number of iterations in half, to 50_000_000. Compiled with -O -release -inline. Results: slice: 2.31 memcpy: 0.73 That's 3 times difference. Disassembly: slice: L31: mov ECX,EDX mov EAX,6 lea ESI,010h[ESP] mov ECX,EAX mov EDI,EDX rep movsb add EDX,6 add EBX,6 cmp EBX,011E1A300h jb L31 memcpy: L35: push 6 lea ECX,014h[ESP] push ECX push EBX call near ptr _memcpy add EBX,6 add ESI,6 add ESP,0Ch cmp ESI,011E1A300h jb L35 Seems like rep movsb is /way/ sub-optimal for copying data.
Definitely! The difference ought to be bigger than a factor of 3. Which means that memcpy probably isn't anywhere near optimal, either. rep movsd is always 4 times quicker than rep movsb. There's a range of lengths for which rep movsd is optimal; outside that range, there's are other options which are even faster. So there's a factor of 4-8 speedup available on most memory copies. Low-hanging fruit! <g>
Mar 16 2009
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Don:
Which means that memcpy probably isn't anywhere near optimal, either.<
Time ago I have read an article written by AMD that shows that indeed with modern CPUs there are ways to go much faster, using vector asm instructions, loop unrolling and explicit cache prefetching (but it's useful with longer arrays only. Profile-driven optimization can tell you if a particular copy usually copies lot of data, and use such kind of copy, that is overkill/slow/too much code for the cache for smaller copies. As an alternative the programmer may add some annotation to choose what copy strategy to use, but this is not nice). Bye, bearophile
Mar 16 2009
next sibling parent reply Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Mon, Mar 16, 2009 at 8:43 AM, bearophile <bearophileHUGS lycos.com> wrote:
 Don:
Which means that memcpy probably isn't anywhere near optimal, either.<
Time ago I have read an article written by AMD that shows that indeed with modern CPUs there are ways to go much faster, using vector asm instructions, loop unrolling and explicit cache prefetching
I'm actually kind of shocked that given the prevalence of memory block copy operations that more CPUs haven't implemented it as a basic instruction. Yes, RISC is nice, but geez, this seems like a no-brainer.
Mar 16 2009
parent reply BCS <none anon.com> writes:
Hello Jarrett,

 I'm actually kind of shocked that given the prevalence of memory block
 copy operations that more CPUs haven't implemented it as a basic
 instruction.  Yes, RISC is nice, but geez, this seems like a
 no-brainer.
 
How about memory to memory DMA, Why even make the CPU wait for it to finish?
Mar 16 2009
next sibling parent reply Don <nospam nospam.com> writes:
BCS wrote:
 Hello Jarrett,
 
 I'm actually kind of shocked that given the prevalence of memory block
 copy operations that more CPUs haven't implemented it as a basic
 instruction.  Yes, RISC is nice, but geez, this seems like a
 no-brainer.
How about memory to memory DMA, Why even make the CPU wait for it to finish?
That actually happens, to some extent, on some CPUs. If you get everything right, the data only goes to the outermost cache, and never reaches the CPU itself.
Mar 16 2009
parent BCS <none anon.com> writes:
Hello Don,

 BCS wrote:
 
 Hello Jarrett,
 
 I'm actually kind of shocked that given the prevalence of memory
 block copy operations that more CPUs haven't implemented it as a
 basic instruction.  Yes, RISC is nice, but geez, this seems like a
 no-brainer.
 
How about memory to memory DMA, Why even make the CPU wait for it to finish?
That actually happens, to some extent, on some CPUs. If you get everything right, the data only goes to the outermost cache, and never reaches the CPU itself.
I was thinking even more aggressive, for instance: I wonder how much more silicon it would take to allow copied that happen to be on the same DIMM to never even touch the motherboard? I can think of some cases where this would be very useful (large matrix transposition for one would work well if the memory is set up for arbitrary striding)
Mar 16 2009
prev sibling parent reply Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Mon, Mar 16, 2009 at 3:29 PM, BCS <none anon.com> wrote:
 I'm actually kind of shocked that given the prevalence of memory block
 copy operations that more CPUs haven't implemented it as a basic
 instruction. =A0Yes, RISC is nice, but geez, this seems like a
 no-brainer.
How about memory to memory DMA, Why even make the CPU wait for it to fini=
sh? Sure, then you have to worry about waiting for the *DMA* to finish. If it's a copy whose result is largely unimportant to the immediately-following code (copying a backbuffer into a frontbuffer, for example), it doesn't matter. But for copying large value types, I'd think it's pretty important that the copy is semantically atomic.
Mar 16 2009
parent BCS <none anon.com> writes:
Hello Jarrett,

 On Mon, Mar 16, 2009 at 3:29 PM, BCS <none anon.com> wrote:
 
 I'm actually kind of shocked that given the prevalence of memory
 block copy operations that more CPUs haven't implemented it as a
 basic instruction.  Yes, RISC is nice, but geez, this seems like a
 no-brainer.
 
How about memory to memory DMA, Why even make the CPU wait for it to finish?
Sure, then you have to worry about waiting for the *DMA* to finish. If it's a copy whose result is largely unimportant to the immediately-following code (copying a backbuffer into a frontbuffer, for example), it doesn't matter. But for copying large value types, I'd think it's pretty important that the copy is semantically atomic.
As long as either the CPU can voluntarily block or the MMU can block access until the DMA is done then it's a free gain in that the CPU can do /something/ that it wouldn't be able to otherwise.
Mar 16 2009
prev sibling parent reply Christopher Wright <dhasenan gmail.com> writes:
bearophile wrote:
 Don:
 Which means that memcpy probably isn't anywhere near optimal, either.<
Time ago I have read an article written by AMD that shows that indeed with modern CPUs there are ways to go much faster, using vector asm instructions, loop unrolling and explicit cache prefetching (but it's useful with longer arrays only. Profile-driven optimization can tell you if a particular copy usually copies lot of data, and use such kind of copy, that is overkill/slow/too much code for the cache for smaller copies. As an alternative the programmer may add some annotation to choose what copy strategy to use, but this is not nice). Bye, bearophile
You could probably get good results by seeing how long the array is, though.
Mar 16 2009
parent Don <nospam nospam.com> writes:
Christopher Wright wrote:
 bearophile wrote:
 Don:
 Which means that memcpy probably isn't anywhere near optimal, either.<
Time ago I have read an article written by AMD that shows that indeed with modern CPUs there are ways to go much faster, using vector asm instructions, loop unrolling and explicit cache prefetching (but it's useful with longer arrays only. Profile-driven optimization can tell you if a particular copy usually copies lot of data, and use such kind of copy, that is overkill/slow/too much code for the cache for smaller copies. As an alternative the programmer may add some annotation to choose what copy strategy to use, but this is not nice).
Not necessary. If the length is long enough to get benefit from that, the function call overhead for calling memcpy() is negligible. So you just need to include all the cases in memcpy.
 Bye,
 bearophile
You could probably get good results by seeing how long the array is, though.
Yes. If the length is known at compile time, and it's short, inline with special-case code for the small sizes. If it's long, just put in a call to memcpy. I looked at the code in the DMD backend, the generation of the memcpy intrinsic is in cdmemcpy() in cod2.c. But as far as I can tell, by the time that code is called, the length isn't a constant any more. So I don't think it'd be easy to fix. On Core2, rep movsb transfers 1 byte every 3 clocks -- ie 0.3 bytes per clock (!) rep movsq/rep movsq transfers 1 dword/qword every 0.63 clocks -- ie best case is 13 bytes/clock. On AMD you get one transfer per clock -- max 8 bytes per clock, 1 byte per clock with rep movsb. So rep movsb is _unbelievably_ slow. A simple D for loop is probably quicker.
Mar 17 2009
prev sibling parent reply Sergey Gromov <snake.scaly gmail.com> writes:
Mon, 16 Mar 2009 10:34:33 +0100, Don wrote:

 Sergey Gromov wrote:
 Sun, 15 Mar 2009 13:17:50 +0000 (UTC), Moritz Warning wrote:
 
 On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:

 While doing some string processing I've seen some unusual timings
 compared to the C code, so I have written this to see the situation
 better. When USE_MEMCPY is false this little benchmark runs about 3+
 times slower:
I did a little benchmark: ldc -release -O5 true: 0.51 false: 0.63 dmd -release -O true: 4.47 false: 3.58 I don't see a very big difference between slice copying and memcpy (but between compilers). Btw.: http://www.digitalmars.com/pnews/read.php? server=news.digitalmars.com&group=digitalmars.D.bugs&artnum=14933
The original benchmark swapped insanely on my 1GB laptop so I've cut the number of iterations in half, to 50_000_000. Compiled with -O -release -inline. Results: slice: 2.31 memcpy: 0.73 That's 3 times difference. Disassembly: slice: L31: mov ECX,EDX mov EAX,6 lea ESI,010h[ESP] mov ECX,EAX mov EDI,EDX rep movsb add EDX,6 add EBX,6 cmp EBX,011E1A300h jb L31 memcpy: L35: push 6 lea ECX,014h[ESP] push ECX push EBX call near ptr _memcpy add EBX,6 add ESI,6 add ESP,0Ch cmp ESI,011E1A300h jb L35 Seems like rep movsb is /way/ sub-optimal for copying data.
Definitely! The difference ought to be bigger than a factor of 3. Which means that memcpy probably isn't anywhere near optimal, either. rep movsd is always 4 times quicker than rep movsb. There's a range of lengths for which rep movsd is optimal; outside that range, there's are other options which are even faster. So there's a factor of 4-8 speedup available on most memory copies. Low-hanging fruit! <g>
Don't disregard the function call overhead. memcpy is called 50 M times, copying only 6 bytes per call.
Mar 16 2009
parent reply Don <nospam nospam.com> writes:
Sergey Gromov wrote:
 Mon, 16 Mar 2009 10:34:33 +0100, Don wrote:
 
 Sergey Gromov wrote:
 Sun, 15 Mar 2009 13:17:50 +0000 (UTC), Moritz Warning wrote:

 On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:

 While doing some string processing I've seen some unusual timings
 compared to the C code, so I have written this to see the situation
 better. When USE_MEMCPY is false this little benchmark runs about 3+
 times slower:
I did a little benchmark: ldc -release -O5 true: 0.51 false: 0.63 dmd -release -O true: 4.47 false: 3.58 I don't see a very big difference between slice copying and memcpy (but between compilers). Btw.: http://www.digitalmars.com/pnews/read.php? server=news.digitalmars.com&group=digitalmars.D.bugs&artnum=14933
The original benchmark swapped insanely on my 1GB laptop so I've cut the number of iterations in half, to 50_000_000. Compiled with -O -release -inline. Results: slice: 2.31 memcpy: 0.73 That's 3 times difference. Disassembly: slice: L31: mov ECX,EDX mov EAX,6 lea ESI,010h[ESP] mov ECX,EAX mov EDI,EDX rep movsb add EDX,6 add EBX,6 cmp EBX,011E1A300h jb L31 memcpy: L35: push 6 lea ECX,014h[ESP] push ECX push EBX call near ptr _memcpy add EBX,6 add ESI,6 add ESP,0Ch cmp ESI,011E1A300h jb L35 Seems like rep movsb is /way/ sub-optimal for copying data.
Definitely! The difference ought to be bigger than a factor of 3. Which means that memcpy probably isn't anywhere near optimal, either. rep movsd is always 4 times quicker than rep movsb. There's a range of lengths for which rep movsd is optimal; outside that range, there's are other options which are even faster. So there's a factor of 4-8 speedup available on most memory copies. Low-hanging fruit! <g>
Don't disregard the function call overhead. memcpy is called 50 M times, copying only 6 bytes per call.
Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's six bytes -- it's in the asm. Blimey. It should just be doing that as a direct sequence of loads and stores, for anything up to at least 8 bytes.
Mar 16 2009
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Don wrote:
 Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's 
 six bytes -- it's in the asm. Blimey. It should just be doing that as a 
 direct sequence of loads and stores, for anything up to at least 8 bytes.
The compiler will replace it with a simple mov if it is 1, 2, 4 or 8 bytes.
Mar 16 2009
next sibling parent Don <nospam nospam.com> writes:
Walter Bright wrote:
 Don wrote:
 Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's 
 six bytes -- it's in the asm. Blimey. It should just be doing that as 
 a direct sequence of loads and stores, for anything up to at least 8 
 bytes.
The compiler will replace it with a simple mov if it is 1, 2, 4 or 8 bytes.
That's good. In fact, in this case it's a pessimisation. It's quicker in debug mode, than release mode!
Mar 16 2009
prev sibling parent Sergey Gromov <snake.scaly gmail.com> writes:
Mon, 16 Mar 2009 11:36:50 -0700, Walter Bright wrote:

 Don wrote:
 Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's 
 six bytes -- it's in the asm. Blimey. It should just be doing that as a 
 direct sequence of loads and stores, for anything up to at least 8 bytes.
The compiler will replace it with a simple mov if it is 1, 2, 4 or 8 bytes.
It didn't, actually. I've just filed a patch which should fix this issue: http://d.puremagic.com/issues/show_bug.cgi?id=2750 For instance, the benchmark in the original post were compiling to L31: mov ECX,EDX mov EAX,6 lea ESI,010h[ESP] mov ECX,EAX mov EDI,EDX rep movsb add EDX,6 add EBX,6 cmp EBX,011E1A300h jb L31 With my patch it's just L3A: lea ESI,0Ch[ESP] mov EDI,EAX movsd movsb movsb add EAX,6 add ECX,6 cmp ECX,011E1A300h jb L3A as it probably was meant to be.
Mar 19 2009
prev sibling parent "Lionello Lunesu" <lionello lunesu.remove.com> writes:
This has been discussed before, to no avail.
http://d.puremagic.com/issues/show_bug.cgi?id=2313

L.
Mar 17 2009