digitalmars.D - memcpy vs slice copy
- bearophile (24/24) Mar 14 2009 While doing some string processing I've seen some unusual timings compar...
- Moritz Warning (12/16) Mar 15 2009 I did a little benchmark:
- bearophile (8/10) Mar 15 2009 I have taken the times again. My timings, best of 4:
- bearophile (38/38) Mar 15 2009 For reference here's a simple C version:
- Sergey Gromov (3/12) Mar 15 2009 Obviously, a memcpy intrinsic is at work here. DMD calls an actual CRT
- bearophile (6/7) Mar 15 2009 Yes, gcc is able to recognize some calls to C library functions and repl...
- Moritz Warning (4/20) Mar 15 2009 My system is 64bit with dmd 1.041 and ldc from trunk.
- Sergey Gromov (31/53) Mar 15 2009 The original benchmark swapped insanely on my 1GB laptop so I've cut the
- Don (8/67) Mar 16 2009 Definitely! The difference ought to be bigger than a factor of 3. Which
- bearophile (4/5) Mar 16 2009 Time ago I have read an article written by AMD that shows that indeed wi...
- Jarrett Billingsley (5/8) Mar 16 2009 I'm actually kind of shocked that given the prevalence of memory block
- BCS (2/7) Mar 16 2009 How about memory to memory DMA, Why even make the CPU wait for it to fin...
- Don (4/16) Mar 16 2009 That actually happens, to some extent, on some CPUs. If you get
- BCS (6/22) Mar 16 2009 I was thinking even more aggressive, for instance: I wonder how much mor...
- Jarrett Billingsley (7/13) Mar 16 2009 sh?
- BCS (4/20) Mar 16 2009 As long as either the CPU can voluntarily block or the MMU can block acc...
- Christopher Wright (2/9) Mar 16 2009 You could probably get good results by seeing how long the array is, tho...
- Don (19/36) Mar 17 2009 Not necessary. If the length is long enough to get benefit from that,
- Sergey Gromov (3/72) Mar 16 2009 Don't disregard the function call overhead. memcpy is called 50 M
- Don (4/76) Mar 16 2009 Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's
- Walter Bright (2/5) Mar 16 2009 The compiler will replace it with a simple mov if it is 1, 2, 4 or 8 byt...
- Don (4/11) Mar 16 2009 That's good.
- Sergey Gromov (27/33) Mar 19 2009 It didn't, actually.
- Lionello Lunesu (3/3) Mar 17 2009 This has been discussed before, to no avail.
While doing some string processing I've seen some unusual timings compared to the C code, so I have written this to see the situation better. When USE_MEMCPY is false this little benchmark runs about 3+ times slower: import std.c.stdlib: malloc; import std.c.string: memcpy; import std.stdio: writefln; const bool USE_MEMCPY = true; void main() { const int N = 100_000_000; auto h = "hello\n"; char* ptr = cast(char*)malloc(N * h.length); char* p = ptr; for (int i; i < (h.length * N); i += h.length) { static if (USE_MEMCPY) memcpy(p, h.ptr, h.length); else p[0 .. h.length] = h; p += h.length; } if (N <= 100) // to see if it works decrease N writefln(ptr[0 .. N * h.length]); } As usual with benchmarks I may be missing things (especially because now it's quite late here), so keep eyes open. Bye, bearophile
Mar 14 2009
On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:While doing some string processing I've seen some unusual timings compared to the C code, so I have written this to see the situation better. When USE_MEMCPY is false this little benchmark runs about 3+ times slower:I did a little benchmark: ldc -release -O5 true: 0.51 false: 0.63 dmd -release -O true: 4.47 false: 3.58 I don't see a very big difference between slice copying and memcpy (but between compilers). Btw.: http://www.digitalmars.com/pnews/read.php? server=news.digitalmars.com&group=digitalmars.D.bugs&artnum=14933
Mar 15 2009
Moritz Warning:I don't see a very big difference between slice copying and memcpy (but between compilers).I have taken the times again. My timings, best of 4: true: 1.33 s false: 4.28 s I have used dmd 1.041, with Phobos, on WinXP 32 bit, 2 GB RAM, CPU Core 2 at 2 GHz. This may be another little benchnmark for Robert Clipsham :-) Bye, bearophile
Mar 15 2009
For reference here's a simple C version: #include "stdlib.h" #include "string.h" #include "stdio.h" #define N 100000000 #define L 6 char h[L] = "hello\n"; int main() { char *ptr; if (N <= 100) ptr = malloc(N * L + 1); // the +1 is for the final printing else ptr = malloc(N * L); char *p = ptr; int i; for (i = 0; i < N; i++) { memcpy(p, h, L); p += L; } // to see if it works decrease N if (N <= 100) { ptr[N * L] = '\0'; printf("%s", ptr); } return 0; } It takes 0.59 s to run on the same PC and OS of mine, both using GCC and LLVM-GCC. (Note that the loop in the code is a bit different now, the end result is the same). The ASM of the inner loop: L: movl _h, %eax movl %eax, (%edx) movzwl _h+4, %eax movw %ax, 4(%edx) addl $6, %edx cmpl %ecx, %edx jne L Bye, bearophile
Mar 15 2009
Sun, 15 Mar 2009 10:31:10 -0400, bearophile wrote:The ASM of the inner loop: L: movl _h, %eax movl %eax, (%edx) movzwl _h+4, %eax movw %ax, 4(%edx) addl $6, %edx cmpl %ecx, %edx jne LObviously, a memcpy intrinsic is at work here. DMD calls an actual CRT function instead.
Mar 15 2009
Sergey Gromov:Obviously, a memcpy intrinsic is at work here.<Yes, gcc is able to recognize some calls to C library functions and replace them with intrinsics. I think LDC too uses an intrinsic to copy memory of a slice. This isn't a too much interesting benchmark, there's nothing much interesting/new to see here. Bye, bearophile
Mar 15 2009
On Sun, 15 Mar 2009 13:17:50 +0000, Moritz Warning wrote:On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:My system is 64bit with dmd 1.041 and ldc from trunk. Since dmd produces 32bit binaries only this might account for a factor of two for the timings.While doing some string processing I've seen some unusual timings compared to the C code, so I have written this to see the situation better. When USE_MEMCPY is false this little benchmark runs about 3+ times slower:I did a little benchmark: ldc -release -O5 true: 0.51 false: 0.63 dmd -release -O true: 4.47 false: 3.58
Mar 15 2009
Sun, 15 Mar 2009 13:17:50 +0000 (UTC), Moritz Warning wrote:On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:The original benchmark swapped insanely on my 1GB laptop so I've cut the number of iterations in half, to 50_000_000. Compiled with -O -release -inline. Results: slice: 2.31 memcpy: 0.73 That's 3 times difference. Disassembly: slice: L31: mov ECX,EDX mov EAX,6 lea ESI,010h[ESP] mov ECX,EAX mov EDI,EDX rep movsb add EDX,6 add EBX,6 cmp EBX,011E1A300h jb L31 memcpy: L35: push 6 lea ECX,014h[ESP] push ECX push EBX call near ptr _memcpy add EBX,6 add ESI,6 add ESP,0Ch cmp ESI,011E1A300h jb L35 Seems like rep movsb is /way/ sub-optimal for copying data.While doing some string processing I've seen some unusual timings compared to the C code, so I have written this to see the situation better. When USE_MEMCPY is false this little benchmark runs about 3+ times slower:I did a little benchmark: ldc -release -O5 true: 0.51 false: 0.63 dmd -release -O true: 4.47 false: 3.58 I don't see a very big difference between slice copying and memcpy (but between compilers). Btw.: http://www.digitalmars.com/pnews/read.php? server=news.digitalmars.com&group=digitalmars.D.bugs&artnum=14933
Mar 15 2009
Sergey Gromov wrote:Sun, 15 Mar 2009 13:17:50 +0000 (UTC), Moritz Warning wrote:Definitely! The difference ought to be bigger than a factor of 3. Which means that memcpy probably isn't anywhere near optimal, either. rep movsd is always 4 times quicker than rep movsb. There's a range of lengths for which rep movsd is optimal; outside that range, there's are other options which are even faster. So there's a factor of 4-8 speedup available on most memory copies. Low-hanging fruit! <g>On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:The original benchmark swapped insanely on my 1GB laptop so I've cut the number of iterations in half, to 50_000_000. Compiled with -O -release -inline. Results: slice: 2.31 memcpy: 0.73 That's 3 times difference. Disassembly: slice: L31: mov ECX,EDX mov EAX,6 lea ESI,010h[ESP] mov ECX,EAX mov EDI,EDX rep movsb add EDX,6 add EBX,6 cmp EBX,011E1A300h jb L31 memcpy: L35: push 6 lea ECX,014h[ESP] push ECX push EBX call near ptr _memcpy add EBX,6 add ESI,6 add ESP,0Ch cmp ESI,011E1A300h jb L35 Seems like rep movsb is /way/ sub-optimal for copying data.While doing some string processing I've seen some unusual timings compared to the C code, so I have written this to see the situation better. When USE_MEMCPY is false this little benchmark runs about 3+ times slower:I did a little benchmark: ldc -release -O5 true: 0.51 false: 0.63 dmd -release -O true: 4.47 false: 3.58 I don't see a very big difference between slice copying and memcpy (but between compilers). Btw.: http://www.digitalmars.com/pnews/read.php? server=news.digitalmars.com&group=digitalmars.D.bugs&artnum=14933
Mar 16 2009
Don:Which means that memcpy probably isn't anywhere near optimal, either.<Time ago I have read an article written by AMD that shows that indeed with modern CPUs there are ways to go much faster, using vector asm instructions, loop unrolling and explicit cache prefetching (but it's useful with longer arrays only. Profile-driven optimization can tell you if a particular copy usually copies lot of data, and use such kind of copy, that is overkill/slow/too much code for the cache for smaller copies. As an alternative the programmer may add some annotation to choose what copy strategy to use, but this is not nice). Bye, bearophile
Mar 16 2009
On Mon, Mar 16, 2009 at 8:43 AM, bearophile <bearophileHUGS lycos.com> wrote:Don:I'm actually kind of shocked that given the prevalence of memory block copy operations that more CPUs haven't implemented it as a basic instruction. Yes, RISC is nice, but geez, this seems like a no-brainer.Which means that memcpy probably isn't anywhere near optimal, either.<Time ago I have read an article written by AMD that shows that indeed with modern CPUs there are ways to go much faster, using vector asm instructions, loop unrolling and explicit cache prefetching
Mar 16 2009
Hello Jarrett,I'm actually kind of shocked that given the prevalence of memory block copy operations that more CPUs haven't implemented it as a basic instruction. Yes, RISC is nice, but geez, this seems like a no-brainer.How about memory to memory DMA, Why even make the CPU wait for it to finish?
Mar 16 2009
BCS wrote:Hello Jarrett,That actually happens, to some extent, on some CPUs. If you get everything right, the data only goes to the outermost cache, and never reaches the CPU itself.I'm actually kind of shocked that given the prevalence of memory block copy operations that more CPUs haven't implemented it as a basic instruction. Yes, RISC is nice, but geez, this seems like a no-brainer.How about memory to memory DMA, Why even make the CPU wait for it to finish?
Mar 16 2009
Hello Don,BCS wrote:I was thinking even more aggressive, for instance: I wonder how much more silicon it would take to allow copied that happen to be on the same DIMM to never even touch the motherboard? I can think of some cases where this would be very useful (large matrix transposition for one would work well if the memory is set up for arbitrary striding)Hello Jarrett,That actually happens, to some extent, on some CPUs. If you get everything right, the data only goes to the outermost cache, and never reaches the CPU itself.I'm actually kind of shocked that given the prevalence of memory block copy operations that more CPUs haven't implemented it as a basic instruction. Yes, RISC is nice, but geez, this seems like a no-brainer.How about memory to memory DMA, Why even make the CPU wait for it to finish?
Mar 16 2009
On Mon, Mar 16, 2009 at 3:29 PM, BCS <none anon.com> wrote:sh? Sure, then you have to worry about waiting for the *DMA* to finish. If it's a copy whose result is largely unimportant to the immediately-following code (copying a backbuffer into a frontbuffer, for example), it doesn't matter. But for copying large value types, I'd think it's pretty important that the copy is semantically atomic.I'm actually kind of shocked that given the prevalence of memory block copy operations that more CPUs haven't implemented it as a basic instruction. =A0Yes, RISC is nice, but geez, this seems like a no-brainer.How about memory to memory DMA, Why even make the CPU wait for it to fini=
Mar 16 2009
Hello Jarrett,On Mon, Mar 16, 2009 at 3:29 PM, BCS <none anon.com> wrote:As long as either the CPU can voluntarily block or the MMU can block access until the DMA is done then it's a free gain in that the CPU can do /something/ that it wouldn't be able to otherwise.Sure, then you have to worry about waiting for the *DMA* to finish. If it's a copy whose result is largely unimportant to the immediately-following code (copying a backbuffer into a frontbuffer, for example), it doesn't matter. But for copying large value types, I'd think it's pretty important that the copy is semantically atomic.I'm actually kind of shocked that given the prevalence of memory block copy operations that more CPUs haven't implemented it as a basic instruction. Yes, RISC is nice, but geez, this seems like a no-brainer.How about memory to memory DMA, Why even make the CPU wait for it to finish?
Mar 16 2009
bearophile wrote:Don:You could probably get good results by seeing how long the array is, though.Which means that memcpy probably isn't anywhere near optimal, either.<Time ago I have read an article written by AMD that shows that indeed with modern CPUs there are ways to go much faster, using vector asm instructions, loop unrolling and explicit cache prefetching (but it's useful with longer arrays only. Profile-driven optimization can tell you if a particular copy usually copies lot of data, and use such kind of copy, that is overkill/slow/too much code for the cache for smaller copies. As an alternative the programmer may add some annotation to choose what copy strategy to use, but this is not nice). Bye, bearophile
Mar 16 2009
Christopher Wright wrote:bearophile wrote:Not necessary. If the length is long enough to get benefit from that, the function call overhead for calling memcpy() is negligible. So you just need to include all the cases in memcpy.Don:Which means that memcpy probably isn't anywhere near optimal, either.<Time ago I have read an article written by AMD that shows that indeed with modern CPUs there are ways to go much faster, using vector asm instructions, loop unrolling and explicit cache prefetching (but it's useful with longer arrays only. Profile-driven optimization can tell you if a particular copy usually copies lot of data, and use such kind of copy, that is overkill/slow/too much code for the cache for smaller copies. As an alternative the programmer may add some annotation to choose what copy strategy to use, but this is not nice).Yes. If the length is known at compile time, and it's short, inline with special-case code for the small sizes. If it's long, just put in a call to memcpy. I looked at the code in the DMD backend, the generation of the memcpy intrinsic is in cdmemcpy() in cod2.c. But as far as I can tell, by the time that code is called, the length isn't a constant any more. So I don't think it'd be easy to fix. On Core2, rep movsb transfers 1 byte every 3 clocks -- ie 0.3 bytes per clock (!) rep movsq/rep movsq transfers 1 dword/qword every 0.63 clocks -- ie best case is 13 bytes/clock. On AMD you get one transfer per clock -- max 8 bytes per clock, 1 byte per clock with rep movsb. So rep movsb is _unbelievably_ slow. A simple D for loop is probably quicker.Bye, bearophileYou could probably get good results by seeing how long the array is, though.
Mar 17 2009
Mon, 16 Mar 2009 10:34:33 +0100, Don wrote:Sergey Gromov wrote:Don't disregard the function call overhead. memcpy is called 50 M times, copying only 6 bytes per call.Sun, 15 Mar 2009 13:17:50 +0000 (UTC), Moritz Warning wrote:Definitely! The difference ought to be bigger than a factor of 3. Which means that memcpy probably isn't anywhere near optimal, either. rep movsd is always 4 times quicker than rep movsb. There's a range of lengths for which rep movsd is optimal; outside that range, there's are other options which are even faster. So there's a factor of 4-8 speedup available on most memory copies. Low-hanging fruit! <g>On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:The original benchmark swapped insanely on my 1GB laptop so I've cut the number of iterations in half, to 50_000_000. Compiled with -O -release -inline. Results: slice: 2.31 memcpy: 0.73 That's 3 times difference. Disassembly: slice: L31: mov ECX,EDX mov EAX,6 lea ESI,010h[ESP] mov ECX,EAX mov EDI,EDX rep movsb add EDX,6 add EBX,6 cmp EBX,011E1A300h jb L31 memcpy: L35: push 6 lea ECX,014h[ESP] push ECX push EBX call near ptr _memcpy add EBX,6 add ESI,6 add ESP,0Ch cmp ESI,011E1A300h jb L35 Seems like rep movsb is /way/ sub-optimal for copying data.While doing some string processing I've seen some unusual timings compared to the C code, so I have written this to see the situation better. When USE_MEMCPY is false this little benchmark runs about 3+ times slower:I did a little benchmark: ldc -release -O5 true: 0.51 false: 0.63 dmd -release -O true: 4.47 false: 3.58 I don't see a very big difference between slice copying and memcpy (but between compilers). Btw.: http://www.digitalmars.com/pnews/read.php? server=news.digitalmars.com&group=digitalmars.D.bugs&artnum=14933
Mar 16 2009
Sergey Gromov wrote:Mon, 16 Mar 2009 10:34:33 +0100, Don wrote:Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's six bytes -- it's in the asm. Blimey. It should just be doing that as a direct sequence of loads and stores, for anything up to at least 8 bytes.Sergey Gromov wrote:Don't disregard the function call overhead. memcpy is called 50 M times, copying only 6 bytes per call.Sun, 15 Mar 2009 13:17:50 +0000 (UTC), Moritz Warning wrote:Definitely! The difference ought to be bigger than a factor of 3. Which means that memcpy probably isn't anywhere near optimal, either. rep movsd is always 4 times quicker than rep movsb. There's a range of lengths for which rep movsd is optimal; outside that range, there's are other options which are even faster. So there's a factor of 4-8 speedup available on most memory copies. Low-hanging fruit! <g>On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:The original benchmark swapped insanely on my 1GB laptop so I've cut the number of iterations in half, to 50_000_000. Compiled with -O -release -inline. Results: slice: 2.31 memcpy: 0.73 That's 3 times difference. Disassembly: slice: L31: mov ECX,EDX mov EAX,6 lea ESI,010h[ESP] mov ECX,EAX mov EDI,EDX rep movsb add EDX,6 add EBX,6 cmp EBX,011E1A300h jb L31 memcpy: L35: push 6 lea ECX,014h[ESP] push ECX push EBX call near ptr _memcpy add EBX,6 add ESI,6 add ESP,0Ch cmp ESI,011E1A300h jb L35 Seems like rep movsb is /way/ sub-optimal for copying data.While doing some string processing I've seen some unusual timings compared to the C code, so I have written this to see the situation better. When USE_MEMCPY is false this little benchmark runs about 3+ times slower:I did a little benchmark: ldc -release -O5 true: 0.51 false: 0.63 dmd -release -O true: 4.47 false: 3.58 I don't see a very big difference between slice copying and memcpy (but between compilers). Btw.: http://www.digitalmars.com/pnews/read.php? server=news.digitalmars.com&group=digitalmars.D.bugs&artnum=14933
Mar 16 2009
Don wrote:Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's six bytes -- it's in the asm. Blimey. It should just be doing that as a direct sequence of loads and stores, for anything up to at least 8 bytes.The compiler will replace it with a simple mov if it is 1, 2, 4 or 8 bytes.
Mar 16 2009
Walter Bright wrote:Don wrote:That's good. In fact, in this case it's a pessimisation. It's quicker in debug mode, than release mode!Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's six bytes -- it's in the asm. Blimey. It should just be doing that as a direct sequence of loads and stores, for anything up to at least 8 bytes.The compiler will replace it with a simple mov if it is 1, 2, 4 or 8 bytes.
Mar 16 2009
Mon, 16 Mar 2009 11:36:50 -0700, Walter Bright wrote:Don wrote:It didn't, actually. I've just filed a patch which should fix this issue: http://d.puremagic.com/issues/show_bug.cgi?id=2750 For instance, the benchmark in the original post were compiling to L31: mov ECX,EDX mov EAX,6 lea ESI,010h[ESP] mov ECX,EAX mov EDI,EDX rep movsb add EDX,6 add EBX,6 cmp EBX,011E1A300h jb L31 With my patch it's just L3A: lea ESI,0Ch[ESP] mov EDI,EAX movsd movsb movsb add EAX,6 add ECX,6 cmp ECX,011E1A300h jb L3A as it probably was meant to be.Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's six bytes -- it's in the asm. Blimey. It should just be doing that as a direct sequence of loads and stores, for anything up to at least 8 bytes.The compiler will replace it with a simple mov if it is 1, 2, 4 or 8 bytes.
Mar 19 2009
This has been discussed before, to no avail. http://d.puremagic.com/issues/show_bug.cgi?id=2313 L.
Mar 17 2009