www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Replacing C's memcpy with a D implementation

reply Mike Franklin <slavo5150 yahoo.com> writes:
I'm exploring the possibility of implementing some of the basic 
software building blocks (memcpy, memcmp, memmove, etc...) that D 
utilizes from the C library with D implementations.  There are 
many reasons to do this, one of which is to leverage information 
available at compile-time and in D's type system (type sizes, 
alignment, etc...) in order to optimize the implementation of 
these functions, and allow them to be used from  safe code.

The prevailing wisdom has been that there is no way to improve on 
C's memcpy implementation given that it has been mirco-optimized 
to death over several decades by many talented members of the 
human race.

So, I threw the following benchmark together to try to get a clue 
about what I was up against, and in a very short time, I beat the 
snot of C's memcpy.  The benefit seems to disappear as the array 
sizes increase, but I believe the vast majority of calls to 
memcpy are probably quite small.

import std.datetime.stopwatch;
import std.stdio;
import core.stdc.string;
import std.random;
import std.algorithm;

enum length = 4096 * 2;
ubyte[length] dst;
ubyte[length] src;
auto rnd = Random(42);
ubyte[] src2;
ubyte[] dst2;

void verifyResults()
{
     assert(memcmp(dst.ptr, src.ptr, length) == 0);
     assert(memcmp(dst2.ptr, src2.ptr, length) == 0);
}

void randomizeData()
{
     for(int i = 0; i < length; i++)
     {
         src[i] = uniform!ubyte;
         dst[i] = uniform!ubyte;
     }

     src2 = src;
     dst2 = dst;
}

void memcpyD()
{
     dst = src.dup;
}

void memcpyDstdAlg()
{
     copy(src2, dst2);
}

void memcpyC()
{
     memcpy(dst.ptr, src.ptr, length);
}

void memcpyNaive()
{
     for(int i = 0; i < length; i++)
     {
         dst[i] = src[i];
     }
}

void memcpyASM()
{
     auto s = src.ptr;
     auto d = dst.ptr;
     size_t len = length;
     asm pure nothrow  nogc
     {
         mov RSI, s;
         mov RDI, d;
         cld;
         mov RCX, len;
         rep;
         movsb;
     }
}

void main()
{
     // verify the integrity of the algorithm
     randomizeData();
     memcpyD();
     verifyResults();

     randomizeData();
     memcpyDstdAlg();
     verifyResults();

     randomizeData();
     memcpyC();
     verifyResults();

     randomizeData();
     memcpyNaive();
     verifyResults();

     randomizeData();
     memcpyASM();
     verifyResults();

     // test the performance of the algorithm
     auto r = benchmark!(memcpyD, memcpyDstdAlg, memcpyC, 
memcpyNaive, memcpyASM)(1000);
     Duration memcpyDResult = r[0];
     Duration memcpyDstdAlgResult = r[1];
     Duration memcpyCResult = r[2];
     Duration memcpyNaiveResult = r[3];
     Duration memcpyASMResult = r[4];

     writeln("memcpyD: ", memcpyDResult);
     writeln("memcpyDstdAlg: ", memcpyDstdAlgResult);
     writeln("memcpyC: ", memcpyCResult);
     writeln("memcpyNaive: ", memcpyNaiveResult);
     writeln("memcpyASM: ", memcpyASMResult);
}


------ Output --------
memcpyD:         1 ms, 772 μs, and 4 hnsecs
memcpyDstdAlg: 531 μs and 8 hnsecs
memcpyC:       371 μs and 3 hnsecs
memcpyNaive:    21 ms, 572 μs, and 2 hnsecs
memcpyASM:     119 μs and 6 hnsecs



I'm not experienced with this kind of programming, so I'm 
doubting these results.  Have I done something wrong?  Am I 
overlooking something?

Thanks,
Mike
Jun 10 2018
next sibling parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
 I'm exploring the possibility of implementing some of the basic 
 software building blocks (memcpy, memcmp, memmove, etc...) that 
 D utilizes from the C library with D implementations.  There 
 are many reasons to do this, one of which is to leverage 
 information available at compile-time and in D's type system 
 (type sizes, alignment, etc...) in order to optimize the 
 implementation of these functions, and allow them to be used 
 from  safe code.

 [...]
what compiler? what flags?
Jun 10 2018
parent Mike Franklin <slavo5150 yahoo.com> writes:
On Sunday, 10 June 2018 at 13:05:33 UTC, Nicholas Wilson wrote:
 On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
 I'm exploring the possibility of implementing some of the 
 basic software building blocks (memcpy, memcmp, memmove, 
 etc...) that D utilizes from the C library with D 
 implementations.  There are many reasons to do this, one of 
 which is to leverage information available at compile-time and 
 in D's type system (type sizes, alignment, etc...) in order to 
 optimize the implementation of these functions, and allow them 
 to be used from  safe code.

 [...]
what compiler? what flags?
dmd main.d Arch Linux x86_64 DMD, with no flags. I'm not compiling C's memcpy, and the significant D implementation is memcpyASM which is all inline assembly, so I don't see how the compiler flags will make any difference, anyway. Mike
Jun 10 2018
prev sibling next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
 D utilizes from the C library with D implementations.  There 
 are many reasons to do this, one of which is to leverage 
 information available at compile-time and in D's type system 
 (type sizes, alignment, etc...) in order to optimize the 
 implementation of these functions, and allow them to be used 
 from  safe code.
So keep in mind that memcpy is really a magical intrinsic anyway and optimzers frequently don't actually call a function, but rather see the size and replace the instructions inline (like it might replace it with just a couple movs instead of something fancy). And D already has it built in as well for safe etc: arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can further do its magic so be sure to check against that too.
Jun 10 2018
next sibling parent reply Mike Franklin <slavo5150 yahoo.com> writes:
On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:

 And D already has it built in as well for  safe etc:

 arr1[] = arr2[]; // the compiler makes this memcpy, the 
 optimzer can further do its magic

 so be sure to check against that too.
My intent is to use the D implementation in the druntime. I'm trying to replace the runtime hooks with templates, so we don't have to rely so much on `TypeInfo` and we can begin to use more of D without linking in the runtime. But one think I discovered is that while we can set an array's length in safe, nothrow, pure code, it gets lowered to a runtime hook that is neither safe, nothrow, nor pure; the compiler is lying to us. If I replace the runtime hook with a template, I need to be honest, and that means all that low-level code in druntime that the runtime hook depends on needs to be safe, nothrow, and pure. So I'm starting with the fundamental dependencies on the C library, trying to ensure they can be replaced with D implementations and use in D idiomatically. For example, memcpy might look like this. void memcpy(T)(T[] dest, T[] src); We can extract size, alignment etc.. at compile-time. So, my goal is not for user-facing code, but for improving the druntime implementations. Does that make sense? Mike
Jun 10 2018
parent reply Kagamin <spam here.lot> writes:
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
 There are many reasons to do this, one of which is to leverage 
 information available at compile-time and in D's type system 
 (type sizes, alignment, etc...) in order to optimize the 
 implementation of these functions, and allow them to be used 
 from  safe code.
In safe code you just use assignment and array ops, backend does the rest. On Sunday, 10 June 2018 at 13:27:04 UTC, Mike Franklin wrote:
 But one think I discovered is that while we can set an array's 
 length in  safe, nothrow, pure code, it gets lowered to a 
 runtime hook that is neither  safe, nothrow, nor pure; the 
 compiler is lying to us.
If the compiler can't get it right then who can?
Jun 10 2018
parent reply Mike Franklin <slavo5150 yahoo.com> writes:
On Sunday, 10 June 2018 at 15:12:27 UTC, Kagamin wrote:
 On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
 There are many reasons to do this, one of which is to leverage 
 information available at compile-time and in D's type system 
 (type sizes, alignment, etc...) in order to optimize the 
 implementation of these functions, and allow them to be used 
 from  safe code.
In safe code you just use assignment and array ops, backend does the rest. On Sunday, 10 June 2018 at 13:27:04 UTC, Mike Franklin wrote:
 But one think I discovered is that while we can set an array's 
 length in  safe, nothrow, pure code, it gets lowered to a 
 runtime hook that is neither  safe, nothrow, nor pure; the 
 compiler is lying to us.
If the compiler can't get it right then who can?
The compiler implementation is faulty. It rewrites the expressions to an `extern(C)` runtime implementation that is not safe, nothrow, or pure: https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src rt/lifetime.d#L1442 The backend is not involved. Mike
Jun 10 2018
next sibling parent Mike Franklin <slavo5150 yahoo.com> writes:
On Monday, 11 June 2018 at 02:49:00 UTC, Mike Franklin wrote:

 The compiler implementation is faulty.  It rewrites the 
 expressions to an `extern(C)` runtime implementation that is 
 not  safe, nothrow, or pure:  
 https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src
rt/lifetime.d#L1442  The backend is not involved.
Also, understand that this lowering happens in the IR generation stage: https://github.com/dlang/dmd/blob/3a79629988efd51d4dda9edb38a6701cd097da89/ rc/dmd/e2ir.d#L2616 so semantic analysis is not performed on the runtime call. If it were, it would fail to compile. Mike
Jun 10 2018
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/10/2018 7:49 PM, Mike Franklin wrote:
 On Sunday, 10 June 2018 at 15:12:27 UTC, Kagamin wrote:
 If the compiler can't get it right then who can?
The compiler implementation is faulty.  It rewrites the expressions to an `extern(C)` runtime implementation that is not safe, nothrow, or pure: https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src rt/lifetime.d#L1442 The backend is not involved.
It's only faulty if there's a bug in it. It's essentially trusted code.
Jun 10 2018
parent reply Mike Franklin <slavo5150 yahoo.com> writes:
On Monday, 11 June 2018 at 03:31:05 UTC, Walter Bright wrote:
 On 6/10/2018 7:49 PM, Mike Franklin wrote:
 On Sunday, 10 June 2018 at 15:12:27 UTC, Kagamin wrote:
 If the compiler can't get it right then who can?
The compiler implementation is faulty.  It rewrites the expressions to an `extern(C)` runtime implementation that is not safe, nothrow, or pure: https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src rt/lifetime.d#L1442 The backend is not involved.
It's only faulty if there's a bug in it. It's essentially trusted code.
That only addresses the safe attribute, and that code is much too complex for anyone to audit it and certify it as safe. Exceptions are also not all handled, so there is no way it can pass as nothrow. The runtime call needs to be replaced with a template that can take advantage of D's compile-time type information instead of `TypeInfo`, but doing so will force the implementation through semantic processing, which, in it's current state, will fail to compile. It needs to be refactored, and creating safe, nothrow, pure building blocks will help with that, hence the topic of this thread. Mike
Jun 10 2018
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/10/2018 8:43 PM, Mike Franklin wrote:
 That only addresses the  safe attribute, and that code is much too complex for 
 anyone to audit it and certify it as safe.
 
 Exceptions are also not all handled, so there is no way it can pass as nothrow.
 
 The runtime call needs to be replaced with a template that can take advantage
of 
 D's compile-time type information instead of `TypeInfo`, but doing so will
force 
 the implementation through semantic processing, which, in it's current state, 
 will fail to compile.  It needs to be refactored, and creating  safe,
nothrow, 
 pure building blocks will help with that, hence the topic of this thread.
Making it a template is not really necessary. The compiler knows if there is the possibility of it throwing based on the type, it doesn't need to infer it. (I notice it is doing __doPostblit(). This looks wrong, D allows data to be moved. As far as I can tell with a perfunctory examination, that's the only "can throw" bit.) Doing without TypeInfo is a worthy goal most of the time, but I don't think it is of much value here because it calls the garbage collector anyway which requires TypeInfo. Some obvious refactoring tasks for it would be to: 1. eliminate the gotos 2. use single assignment for variables 3. minimize the scopes of variables 4. use const 5. document the inputs and outputs 6. add test cases for the red lines
Jun 11 2018
next sibling parent reply Mike Franklin <slavo5150 yahoo.com> writes:
On Monday, 11 June 2018 at 08:00:10 UTC, Walter Bright wrote:

 Making it a template is not really necessary. The compiler 
 knows if there is the possibility of it throwing based on the 
 type, it doesn't need to infer it.
There are other reasons to make it a template, though. For example, if it were a template, it would not rely on `TypeInfo` and could then be used in -betterC-like situations. For setting the length of an array, there are issues with memory allocation that hinder that, but I'm thinking beyond just setting the length of an array here; I just used that as an illustrative example. I think there might also be optimization opportunities using templates, metaprogramming, and type introspection, that are not currently possible with the current design. So there are other reasons to pursue a template design for all of our runtime hooks, and if our memcpy, memcmp, etc.. are also templated it't turtles all the way down. We have to be careful about code bloat with tempates, but I think that can be mitigated. Mike
Jun 11 2018
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/11/2018 1:12 AM, Mike Franklin wrote:
 On Monday, 11 June 2018 at 08:00:10 UTC, Walter Bright wrote:
 
 Making it a template is not really necessary. The compiler knows if there is 
 the possibility of it throwing based on the type, it doesn't need to infer it.
There are other reasons to make it a template, though.  For example, if it were a template, it would not rely on `TypeInfo` and could then be used in -betterC-like situations.
We have no design for this function that doesn't rely on the GC, and the GC needs TypeInfo. This function is not usable with betterC with or without the TypeInfo argument.
 I think there might also be optimization opportunities using templates, 
 metaprogramming, and type introspection, that are not currently possible with 
 the current design.
Just making it a template doesn't automatically enable any of this.
 So there are other reasons to pursue a template design for all of our runtime 
 hooks, and if our memcpy, memcmp, etc.. are also templated it't turtles all
the 
 way down.
memcpy and memcmp are already handled specially by modern compilers, have been for decades; I seriously doubt there's more oil in that well.
Jun 11 2018
next sibling parent Mike Franklin <slavo5150 yahoo.com> writes:
On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:

 We have no design for this function that doesn't rely on the 
 GC, and the GC needs TypeInfo. This function is not usable with 
 betterC with or without the TypeInfo argument.
I understand that. I was using `_d_arraysetlengthT` as an example of problems I encountered trying to convert these runtime hooks into templates, specifically illustrating how the compiler does not enforce the guarantees of safety, nothrow, and pure, and, in fact, completely disregards them. Converting *any* runtime hook, not just `_d_arraysetlengthT`, to a template comes with a whole host of problems because of the aforementioned issue. As soon as it's implemented as a template it is semantically analyzed in the context of its caller, so safe, nothrow, pure, etc. all need to accommodated for. That's good to have the compiler check for that stuff. It's unfortunate the current implementation does not, and that's the fault in the current implementation. There are other runtime hooks that are good candidates for templates, even if `_d_arraysetlengthT` isn't one of them. Again, I just chose `_d_arraysetlengthT` as an example, because it is the one I had experience with. Mike
Jun 11 2018
prev sibling parent reply Mike Franklin <slavo5150 yahoo.com> writes:
On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:

 I think there might also be optimization opportunities using 
 templates, metaprogramming, and type introspection, that are 
 not currently possible with the current design.
Just making it a template doesn't automatically enable any of this.
I think it does, because I can then generate specific code based on the type information at compile-time.
 So there are other reasons to pursue a template design for all 
 of our runtime hooks, and if our memcpy, memcmp, etc.. are 
 also templated it't turtles all the way down.
memcpy and memcmp are already handled specially by modern compilers, have been for decades; I seriously doubt there's more oil in that well.
Yes, I may fail to achieve any performance benefits, but even if I achieve parity with the current implementation it will still be a win for me because I will no longer have to depend on the C standard library. And if I have to fork to use D that way, then that's what I'm going to do. Mike
Jun 11 2018
parent reply Mike Franklin <slavo5150 yahoo.com> writes:
On Monday, 11 June 2018 at 10:38:30 UTC, Mike Franklin wrote:
 On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:

 I think there might also be optimization opportunities using 
 templates, metaprogramming, and type introspection, that are 
 not currently possible with the current design.
Just making it a template doesn't automatically enable any of this.
I think it does, because I can then generate specific code based on the type information at compile-time.
Also, before you do any more nay-saying, you might want to revisit this talk https://www.youtube.com/watch?v=endKC3fDxqs which demonstrates precisely the kind of benefits that can be achieved with these kinds of changes to the compiler/runtime interface. Mike
Jun 11 2018
parent reply Johannes Pfau <nospam example.com> writes:
Am Mon, 11 Jun 2018 10:54:23 +0000 schrieb Mike Franklin:

 On Monday, 11 June 2018 at 10:38:30 UTC, Mike Franklin wrote:
 On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:

 I think there might also be optimization opportunities using
 templates, metaprogramming, and type introspection, that are not
 currently possible with the current design.
Just making it a template doesn't automatically enable any of this.
I think it does, because I can then generate specific code based on the type information at compile-time.
Also, before you do any more nay-saying, you might want to revisit this talk https://www.youtube.com/watch?v=endKC3fDxqs which demonstrates precisely the kind of benefits that can be achieved with these kinds of changes to the compiler/runtime interface. Mike
I guess for most D runtime hooks, using templates is a good idea to enable inlining and further optimizations. I understand that you actually need to reimplement memcpy, as in your microcontroller usecase you don't want to have any C runtime. So you'll basically have to rewrite the C runtime parts D depends on. However, I think for memcpy and similar functions you're probably better off keeping the C interface. This directly provides the benefit of compiler intrinsics/optimizations. And marking memcpy as nothrow/pure/ system/nogc is simple* either way. For the D implementation, the compiler will verify this for you, for the C implementation, you have to mark the function depending on the C implementation. But that's mostly trivial. On a related note, I agree that the compiler sometimes cheats by ignoring attributes, especially when calling TypeInfo related functions, and this is a huge problem. Runtime TypeInfo is not descriptive enough to fully represent the types and whenever the compiler the casts without properly checking first, there's the possibility of a problem. -- Johannes
Jun 11 2018
parent Mike Franklin <slavo5150 yahoo.com> writes:
On Monday, 11 June 2018 at 18:34:58 UTC, Johannes Pfau wrote:

 I understand that you actually need to reimplement memcpy, as 
 in your microcontroller usecase you don't want to have any C 
 runtime. So you'll basically have to rewrite the C runtime 
 parts D depends on.

 However, I think for memcpy and similar functions you're 
 probably better off keeping the C interface. This directly 
 provides the benefit of compiler intrinsics/optimizations. And 
 marking memcpy as nothrow/pure/ system/nogc is simple* either 
 way. For the D implementation, the compiler will verify this 
 for you, for the C implementation, you have to mark the 
 function depending on the C implementation. But that's mostly 
 trivial.
My plans go beyond microcontrollers. Mostly, I'd like to be able to use more features of D without having to link in a pre-built runtime. This is especially convenient for cross-compiling scenarios. By replacing the runtime hooks with templates, and the software building blocks with D implementations, we'd no longer need to obtain a C toolchain and a pre-compiled druntime library just to get a build for our target. We'd just need a cross-compiler like LDC or GDC. After adding druntime to the import path, we'd have everything we need for our target. Try, today, to create an LDC cross-compiler for a Windows Host and an ARM target like the Raspberry Pi. You'll quickly realize what a hassle it all is. Another issue with memcpy being generated by compilers is noone knows what it's actually doing without building and looking at the assembly. And the assembly will change based what's being built. Have you tried to look at the GCC implementation. I have, and I never want to again. If the implementation were templated in D, we'd be able to see exactly what code would be generated when by simply viewing the D runtime source code. It'd be easier to understand, predict, port, and enhance. There isn't one single game-changing benefit this endeavor would bring, and I guess that's why it's difficult to comprehend, but collectively, all the little things it would enable would make it quite compelling. Mike
Jun 11 2018
prev sibling next sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/11/18 4:00 AM, Walter Bright wrote:

 (I notice it is doing __doPostblit(). This looks wrong, D allows data to 
 be moved. As far as I can tell with a perfunctory examination, that's 
 the only "can throw" bit.)
No, __doPostblit is necessary -- you are making a copy. example: File[] fs = new File[5]; fs[0] = ...; // initialize fs auto fs2 = fs; fs.length = 100; At this point, fs points at a separate block from fs2. If you did not do postblit on this, then when one of those arrays is destroyed, the other will become invalid, as the File reference count would go to 0. -Steve
Jun 11 2018
parent Walter Bright <newshound2 digitalmars.com> writes:
On 6/11/2018 6:00 AM, Steven Schveighoffer wrote:
 No, __doPostblit is necessary -- you are making a copy.
 
 example:
 
 File[] fs = new File[5];
 
 fs[0] = ...; // initialize fs
 auto fs2 = fs;
 fs.length = 100;
 
 At this point, fs points at a separate block from fs2. If you did not do 
 postblit on this, then when one of those arrays is destroyed, the other will 
 become invalid, as the File reference count would go to 0.
Yes, you're right. This should probably go as a comment in the code in case it comes up again.
Jun 11 2018
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
https://github.com/dlang/druntime/pull/2213
Jun 11 2018
prev sibling parent reply Mike Franklin <slavo5150 yahoo.com> writes:
On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:

 arr1[] = arr2[]; // the compiler makes this memcpy, the 
 optimzer can further do its magic
void memcpyD() { dst = src.dup; } void memcpyD2() { dst[] = src[]; } ----- memcpyD: 1 ms, 725 μs, and 1 hnsec memcpyD2: 587 μs and 5 hnsecs memcpyASM: 119 μs and 5 hnsecs Still, the ASM version is much faster. btw, what's the difference between the two. If you can't tell, I'm actually not a very good D programmer. Mike
Jun 10 2018
next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
On 11/06/2018 1:45 AM, Mike Franklin wrote:
 On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:
 
 arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can 
 further do its magic
void memcpyD() {     dst = src.dup;
malloc (for slice not static array)
 }
 
 void memcpyD2()
 {
      dst[] = src[];
memcpy
 }
 
 -----
 memcpyD: 1 ms, 725 μs, and 1 hnsec
 memcpyD2: 587 μs and 5 hnsecs
 memcpyASM: 119 μs and 5 hnsecs
 
 Still, the ASM version is much faster.
 
 btw, what's the difference between the two.  If you can't tell, I'm 
 actually not a very good D programmer.
 
 Mike
 
 
 
Jun 10 2018
prev sibling next sibling parent reply Seb <seb wilzba.ch> writes:
On Sunday, 10 June 2018 at 13:45:54 UTC, Mike Franklin wrote:
 On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:

 arr1[] = arr2[]; // the compiler makes this memcpy, the 
 optimzer can further do its magic
void memcpyD() { dst = src.dup; } void memcpyD2() { dst[] = src[]; } ----- memcpyD: 1 ms, 725 μs, and 1 hnsec memcpyD2: 587 μs and 5 hnsecs memcpyASM: 119 μs and 5 hnsecs Still, the ASM version is much faster. btw, what's the difference between the two. If you can't tell, I'm actually not a very good D programmer. Mike
I would increase the test data size, s.t. you get a least a few seconds. Otherwise the benchmarking won't tell you much because it's subject to too much randomness.
Jun 10 2018
parent I love Ice Cream <IloveIcecream. icecreamsandwhich.com> writes:
Don't C implementations already do 90% of what you want? I 
thought most compilers know about and optimize these methods 
based on context. I thought they were *special* in the eyes of 
the compiler already. I think you are fighting a battle pitting 
40 years of tweaking against you...
Jun 10 2018
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 6/10/2018 6:45 AM, Mike Franklin wrote:
  void memcpyD()
 {
      dst = src.dup;
 }
Note that .dup is doing a GC memory allocation.
Jun 10 2018
prev sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Sunday, 10 June 2018 at 13:45:54 UTC, Mike Franklin wrote:
 On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:
 memcpyD: 1 ms, 725 μs, and 1 hnsec
 memcpyD2: 587 μs and 5 hnsecs
 memcpyASM: 119 μs and 5 hnsecs

 Still, the ASM version is much faster.
rep movsd is very CPU dependend and needs some precondtions to be fast. For relative short memory blocks it sucks on many other CPU than the last Intel. See what Agner Fog has to say about it: 16.10 String instructions (all processors) String instructions without a repeat prefix are too slow and should be replaced by simpler instructions. The same applies to LOOP on all processors and to JECXZ on some processors. REP MOVSD andREP STOSD are quite fast if the repeat count is not too small. Always use the largest word size possible (DWORDin 32-bit mode, QWORD in 64-bit mode), and make sure that both source and destination are aligned by the word size. In many cases, however, it is faster to use XMM registers. Moving data in XMM registers is faster than REP MOVSD and REP STOSD in most cases, especially on older processors. See page 164 for details. Note that while the REP MOVS instruction writes a word to the destination, it reads the next word from the source in the same clock cycle. You can have a cache bank conflict if bit 2-4 are the same in these two addresses on P2 and P3. In other words, you will get a penalty of one clock extra per iteration if ESI +WORDSIZE-EDI is divisible by 32. The easiest way to avoid cache bank conflicts is to align both source and destination by 8. Never use MOVSB or MOVSW in optimized code, not even in 16-bit mode. On many processors, REP MOVS and REP STOS can perform fast by moving 16 bytes or an entire cache line at a time . This happens only when certain conditions are met. Depending on the processor, the conditions for fast string instructions are, typically, that the count must be high, both source and destination must be aligned, the direction must be forward, the distance between source and destination must be at least the cache line size, and the memory type for both source and destination must be either write-back or write-combining (you can normally assume the latter condition is met). Under these conditions, the speed is as high as you can obtain with vector register moves or even faster on some processors. While the string instructions can be quite convenient, it must be emphasized that other solutions are faster in many cases. If the above conditions for fast move are not met then there is a lot to gain by using other methods. See page 164 for alternatives to REP MOVS
Jun 10 2018
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/10/2018 9:44 PM, Patrick Schluter wrote:
 See what Agner Fog has to say about it:
Thanks. Agner Fog gets the last word on this topic!
Jun 11 2018
parent David Nadlinger <code klickverbot.at> writes:
On Monday, 11 June 2018 at 08:02:42 UTC, Walter Bright wrote:
 On 6/10/2018 9:44 PM, Patrick Schluter wrote:
 See what Agner Fog has to say about it:
Thanks. Agner Fog gets the last word on this topic!
Well, Agner is rarely wrong indeed, but there is a limit to how much material a single person can keep up to date. On newer uarchs, `rep movsb` isn't slower than `rep movsd`, and often performs similar to the best SSE2 implementation (using NT stores). See "BeeOnRope"'s answer to this StackOverflow question for an in-depth discussion about this: https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy AVX2 seems to offer extra performance beyond that, though, if it is available (for example if runtime feature detection is used). I believe I read a comment by Agner somewhere to that effect as well – a search engine will certainly be able to turn up more. — David
Jun 17 2018
prev sibling next sibling parent reply Guillaume Piolat <first.last gmail.com> writes:
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
 I'm not experienced with this kind of programming, so I'm 
 doubting these results.  Have I done something wrong?  Am I 
 overlooking something?
Hi, I've spent a lot of time optimizing memcpy. One of the result was that on Intel ICC the compiler intrinsics were unbeatable. Please make one that guarantee the usage of the corresponding backend intrinsic, for example on LLVM. The reasoning is that small areas of memory (statically known) will be elided to a few byte moves (inlined), whereas the larger one will call the highly optimized C stdlib calls. If you use ASM instead of IR the optimization barriers and register spilling will make you probably less efficient.
Jun 10 2018
parent Mike Franklin <slavo5150 yahoo.com> writes:
On Sunday, 10 June 2018 at 13:17:53 UTC, Guillaume Piolat wrote:

 Please make one that guarantee the usage of the corresponding 
 backend intrinsic, for example on LLVM.
I tested with ldc and got similar results. I thought the implementation in C forwarded to the backend intrinsic. I think even LDC links with GCC, and the distribution of GCC is tuned for a given platform and architecture, so am I not already utilizing GCC's backend intrinsic. Also, I'm not trying to find the holy grail of memcpy with this endeavor. I just want to find something as good or better than what the druntime is currently using, but implemented in D. Even if it's not optimal, if it's as good as what druntime is currently using, I want to replace those calls in the runtime with a D implementation. Mike
Jun 10 2018
prev sibling next sibling parent reply David Nadlinger <code klickverbot.at> writes:
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
 I'm not experienced with this kind of programming, so I'm 
 doubting these results.  Have I done something wrong?  Am I 
 overlooking something?
You've just discovered the fact that one can rarely be careful enough with what is benchmarked, and having enough statistics. For example, check out the following output from running your program on macOS 10.12, compiled with LDC 1.8.0: --- $ ./test memcpyD: 2 ms, 570 μs, and 9 hnsecs memcpyDstdAlg: 77 μs and 2 hnsecs memcpyC: 74 μs and 1 hnsec memcpyNaive: 76 μs and 4 hnsecs memcpyASM: 145 μs and 5 hnsecs $ ./test memcpyD: 3 ms and 376 μs memcpyDstdAlg: 76 μs and 9 hnsecs memcpyC: 104 μs and 4 hnsecs memcpyNaive: 72 μs and 2 hnsecs memcpyASM: 181 μs and 8 hnsecs $ ./test memcpyD: 2 ms and 565 μs memcpyDstdAlg: 76 μs and 9 hnsecs memcpyC: 73 μs and 2 hnsecs memcpyNaive: 71 μs and 9 hnsecs memcpyASM: 145 μs and 3 hnsecs $ ./test memcpyD: 2 ms, 813 μs, and 8 hnsecs memcpyDstdAlg: 81 μs and 2 hnsecs memcpyC: 99 μs and 2 hnsecs memcpyNaive: 74 μs and 2 hnsecs memcpyASM: 149 μs and 1 hnsec $ ./test memcpyD: 2 ms, 593 μs, and 7 hnsecs memcpyDstdAlg: 77 μs and 3 hnsecs memcpyC: 75 μs memcpyNaive: 77 μs and 2 hnsecs memcpyASM: 145 μs and 5 hnsecs --- Because of the large amounts of noise, the only conclusion one can draw from this is that memcpyD is the slowest, followed by the ASM implementation. In fact, memcpyC and memcpyNaive produce exactly the same machine code (without bounds checking), as LLVM recognizes the loop and lowers it into a memcpy. memcpyDstdAlg instead gets turned into a vectorized loop, for reasons I didn't investigate any further. — David
Jun 10 2018
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/10/2018 11:16 AM, David Nadlinger wrote:
 Because of the large amounts of noise, the only conclusion one can draw from 
 this is that memcpyD is the slowest,
Probably because it does a memory allocation.
 followed by the ASM implementation.
The CPU makers abandoned optimizing the REP instructions decades ago, and just left the clunky implementations there for backwards compatibility.
 In fact, memcpyC and memcpyNaive produce exactly the same machine code
(without 
 bounds checking), as LLVM recognizes the loop and lowers it into a memcpy. 
 memcpyDstdAlg instead gets turned into a vectorized loop, for reasons I didn't 
 investigate any further.
This amply illustrates my other point that looking at the assembler generated is crucial to understanding what's happening.
Jun 10 2018
next sibling parent Temtaime <temtaime gmail.com> writes:
On Sunday, 10 June 2018 at 22:23:08 UTC, Walter Bright wrote:
 On 6/10/2018 11:16 AM, David Nadlinger wrote:
 Because of the large amounts of noise, the only conclusion one 
 can draw from this is that memcpyD is the slowest,
Probably because it does a memory allocation.
 followed by the ASM implementation.
The CPU makers abandoned optimizing the REP instructions decades ago, and just left the clunky implementations there for backwards compatibility.
 In fact, memcpyC and memcpyNaive produce exactly the same 
 machine code (without bounds checking), as LLVM recognizes the 
 loop and lowers it into a memcpy. memcpyDstdAlg instead gets 
 turned into a vectorized loop, for reasons I didn't 
 investigate any further.
This amply illustrates my other point that looking at the assembler generated is crucial to understanding what's happening.
On some cpu architectures(for example intel atoms) rep movsb is the fatest memcpy.
Jun 10 2018
prev sibling parent reply David Nadlinger <code klickverbot.at> writes:
On Sunday, 10 June 2018 at 22:23:08 UTC, Walter Bright wrote:
 On 6/10/2018 11:16 AM, David Nadlinger wrote:
 Because of the large amounts of noise, the only conclusion one 
 can draw from this is that memcpyD is the slowest,
Probably because it does a memory allocation.
Of course; that was already pointed out earlier in the thread.
 The CPU makers abandoned optimizing the REP instructions 
 decades ago, and just left the clunky implementations there for 
 backwards compatibility.
That's not entirely true. Intel started optimising some of the REP string instructions again on Ivy Bridge and above. There is a CPUID bit to indicate that (ERMS?); I'm sure the Optimization Manual has further details. From what I remember, `rep movsb` is supposed to beat an AVX loop on most recent Intel µarchs if the destination is aligned and the data is longer than a few cache lines. I've never measured that myself, though. — David
Jun 10 2018
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/10/2018 4:39 PM, David Nadlinger wrote:
 That's not entirely true. Intel started optimising some of the REP string 
 instructions again on Ivy Bridge and above. There is a CPUID bit to indicate 
 that (ERMS?); I'm sure the Optimization Manual has further details. From what
I 
 remember, `rep movsb` is supposed to beat an AVX loop on most recent Intel 
 µarchs if the destination is aligned and the data is longer than a few cache 
The drama of which instruction mix is faster on which CPU never abates!
Jun 10 2018
parent "Nick Sabalausky (Abscissa)" <SeeWebsiteToContactMe semitwist.com> writes:
On 06/10/2018 08:01 PM, Walter Bright wrote:
 On 6/10/2018 4:39 PM, David Nadlinger wrote:
 That's not entirely true. Intel started optimising some of the REP 
 string instructions again on Ivy Bridge and above. There is a CPUID 
 bit to indicate that (ERMS?); I'm sure the Optimization Manual has 
 further details. From what I remember, `rep movsb` is supposed to beat 
 an AVX loop on most recent Intel µarchs if the destination is aligned 
 and the data is longer than a few cache 
The drama of which instruction mix is faster on which CPU never abates!
In many ways, I really miss 80's machine architecture ;) So simple.
Jun 10 2018
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 6/10/2018 5:49 AM, Mike Franklin wrote:
 [...]
One source of entropy in the results is src and dst being global variables. Global variables in D are in TLS, and TLS access can be complex (many instructions) and is influenced by the -fPIC switch. Worse, global variable access is not optimized in dmd because of aliasing problems. The solution is to pass src, dst, and length to the copy function as function parameters (and make sure function inlining is off). In light of this, I want to BEAT THE DEAD HORSE once again and assert that if the assembler generated by a benchmark is not examined, the results can be severely misleading. I've seen this happen again and again. In this case, TLS access is likely being benchmarked, not memcpy. BTW, the relative timing of rep movsb can be highly dependent on which CPU chip you're using.
Jun 10 2018
prev sibling next sibling parent solidstate1991 <laszloszeremi outlook.com> writes:
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
 void memcpyASM()
 {
     auto s = src.ptr;
     auto d = dst.ptr;
     size_t len = length;
     asm pure nothrow  nogc
     {
         mov RSI, s;
         mov RDI, d;
         cld;
         mov RCX, len;
         rep;
         movsb;
     }
 }
Protip: Use SSE or AVX for an even faster copying.
Jun 10 2018
prev sibling next sibling parent reply Mike Franklin <slavo5150 yahoo.com> writes:
I've modified the test based on the feedback so far, so here's 
what it looks like now:

import std.datetime.stopwatch;
import std.stdio;
import core.stdc.string;
import std.random;
import std.algorithm;

enum length = 4096 * 2;

void init(ref ubyte[] a)
{
     a.length = length;

     for(int i = 0; i < length; i++)
     {
         a[i] = uniform!ubyte;
     }
}

void verifyResults(ubyte[] a, ubyte[] b)
{
     assert(memcmp(a.ptr, b.ptr, length) == 0);
}

void memcpyD(ubyte[] dst, ubyte[] src)
{
     dst[] = src[];
}

void memcpyDstdAlg(ubyte[] dst, ubyte[] src)
{
     copy(src, dst);
}

void memcpyC(ubyte[] dst, ubyte[] src)
{
     memcpy(dst.ptr, src.ptr, length);
}

void memcpyNaive(ubyte[] dst, ubyte[] src)
{
     for(int i = 0; i < length; i++)
     {
         dst[i] = src[i];
     }
}

void memcpyASM(ubyte[] dst, ubyte[] src)
{
     auto s = src.ptr;
     auto d = dst.ptr;
     size_t len = length;
     asm pure nothrow  nogc
     {
         mov RSI, s;
         mov RDI, d;
         cld;
         mov RCX, len;
         rep;
         movsb;
     }
}

Duration benchmark(alias f)(ubyte[] dst, ubyte[] src, uint n)
{
     Duration result;
     auto sw = StopWatch(AutoStart.yes);

     sw.reset();
     foreach (_; 0 .. n)
     {
         f(dst, src);
     }
     result = sw.peek();

     return result;
}

void main()
{
     ubyte[] src;
     ubyte[] dst;

     // verify the integrity of the algorithm
     init(src);
     init(dst);
     memcpyD(dst, src);
     verifyResults(dst, src);

     init(src);
     init(dst);
     memcpyDstdAlg(dst, src);
     verifyResults(dst, src);

     init(src);
     init(dst);
     memcpyC(dst, src);
     verifyResults(dst, src);

     init(src);
     init(dst);
     memcpyNaive(dst, src);
     verifyResults(dst, src);

     init(src);
     init(dst);
     memcpyASM(dst, src);
     verifyResults(dst, src);

     // test the performance of the algorithm
     enum iterations = 1000;
     writeln("memcpyD: ", benchmark!memcpyD(dst, src, iterations));
     writeln("memcpyDstdAlg: ", benchmark!memcpyDstdAlg(dst, src, 
iterations));
     writeln("memcpyC: ", benchmark!memcpyC(dst, src, iterations));
     writeln("memcpyNaive: ", benchmark!memcpyNaive(dst, src, 
iterations));
     writeln("memcpyASM: ", benchmark!memcpyASM(dst, src, 
iterations));
}

The results on my Windows 10 machine (Intel Core i7-6700, 3.4GHz):
memcpyD: 127 ╬╝s and 3 hnsecs
memcpyDstdAlg: 195 ╬╝s and 9 hnsecs
memcpyC: 126 ╬╝s and 7 hnsecs
memcpyNaive: 17 ms, 974 ╬╝s, and 9 hnsecs
memcpyASM: 122 ╬╝s and 8 hnsecs
(Gotta love how windows displays μ)

The results running on Arch Linux 64-bit in a VirtualBox on the 
same Windows 10 machine:
memcpyD: 409 μs
memcpyDstdAlg: 400 μs
memcpyC: 404 μs and 4 hnsecs
memcpyNaive: 17 ms, 251 μs, and 6 hnsecs
memcpyASM: 162 μs and 8 hnsecs

The results appear more sane now, but it seems the behavior is 
highly platform dependent.  Still the ASM is doing well for my 
hardware.  If I run the test multiple times, I do see a lot of 
noise in the results, but each test seems to be affected 
proportionally, so I'm gaining a little more confidence in the 
benchmark.

I still need to analyze the assembly of C's memcpy (anyone know 
where I can find the source code?), test on more platforms, and 
test varying sizes, but I'm just collecting some initial data 
right now, to learn how to proceed.

I'd be interested in those with other platforms reporting back 
their results for their hardware, and of course suggestions for 
how to meet or beat C's memcpy with a pure D implementation.

Thanks for all the feedback so far.

Mike
Jun 10 2018
parent reply Basile B. <b2.b2.temp.temp gmx.gmx.com.com.com> writes:
On Monday, 11 June 2018 at 01:03:16 UTC, Mike Franklin wrote:
 I've modified the test based on the feedback so far, so here's 
 what it looks like now:

 import std.datetime.stopwatch;
 import std.stdio;
 import core.stdc.string;
 import std.random;
 import std.algorithm;

 enum length = 4096 * 2;

 void init(ref ubyte[] a)
 {
     a.length = length;

     for(int i = 0; i < length; i++)
     {
         a[i] = uniform!ubyte;
     }
 }

 void verifyResults(ubyte[] a, ubyte[] b)
 {
     assert(memcmp(a.ptr, b.ptr, length) == 0);
 }

 void memcpyD(ubyte[] dst, ubyte[] src)
 {
     dst[] = src[];
 }

 void memcpyDstdAlg(ubyte[] dst, ubyte[] src)
 {
     copy(src, dst);
 }

 void memcpyC(ubyte[] dst, ubyte[] src)
 {
     memcpy(dst.ptr, src.ptr, length);
 }

 void memcpyNaive(ubyte[] dst, ubyte[] src)
 {
     for(int i = 0; i < length; i++)
     {
         dst[i] = src[i];
     }
 }

 void memcpyASM(ubyte[] dst, ubyte[] src)
 {
     auto s = src.ptr;
     auto d = dst.ptr;
     size_t len = length;
     asm pure nothrow  nogc
     {
         mov RSI, s;
         mov RDI, d;
         cld;
         mov RCX, len;
         rep;
         movsb;
     }
 }

 Duration benchmark(alias f)(ubyte[] dst, ubyte[] src, uint n)
 {
     Duration result;
     auto sw = StopWatch(AutoStart.yes);

     sw.reset();
     foreach (_; 0 .. n)
     {
         f(dst, src);
     }
     result = sw.peek();

     return result;
 }

 void main()
 {
     ubyte[] src;
     ubyte[] dst;

     // verify the integrity of the algorithm
     init(src);
     init(dst);
     memcpyD(dst, src);
     verifyResults(dst, src);

     init(src);
     init(dst);
     memcpyDstdAlg(dst, src);
     verifyResults(dst, src);

     init(src);
     init(dst);
     memcpyC(dst, src);
     verifyResults(dst, src);

     init(src);
     init(dst);
     memcpyNaive(dst, src);
     verifyResults(dst, src);

     init(src);
     init(dst);
     memcpyASM(dst, src);
     verifyResults(dst, src);

     // test the performance of the algorithm
     enum iterations = 1000;
     writeln("memcpyD: ", benchmark!memcpyD(dst, src, 
 iterations));
     writeln("memcpyDstdAlg: ", benchmark!memcpyDstdAlg(dst, 
 src, iterations));
     writeln("memcpyC: ", benchmark!memcpyC(dst, src, 
 iterations));
     writeln("memcpyNaive: ", benchmark!memcpyNaive(dst, src, 
 iterations));
     writeln("memcpyASM: ", benchmark!memcpyASM(dst, src, 
 iterations));
 }

 The results on my Windows 10 machine (Intel Core i7-6700, 
 3.4GHz):
 memcpyD: 127 ╬╝s and 3 hnsecs
 memcpyDstdAlg: 195 ╬╝s and 9 hnsecs
 memcpyC: 126 ╬╝s and 7 hnsecs
 memcpyNaive: 17 ms, 974 ╬╝s, and 9 hnsecs
 memcpyASM: 122 ╬╝s and 8 hnsecs
 (Gotta love how windows displays μ)

 The results running on Arch Linux 64-bit in a VirtualBox on the 
 same Windows 10 machine:
 memcpyD: 409 μs
 memcpyDstdAlg: 400 μs
 memcpyC: 404 μs and 4 hnsecs
 memcpyNaive: 17 ms, 251 μs, and 6 hnsecs
 memcpyASM: 162 μs and 8 hnsecs

 The results appear more sane now, but it seems the behavior is 
 highly platform dependent.  Still the ASM is doing well for my 
 hardware.  If I run the test multiple times, I do see a lot of 
 noise in the results, but each test seems to be affected 
 proportionally, so I'm gaining a little more confidence in the 
 benchmark.

 I still need to analyze the assembly of C's memcpy (anyone know 
 where I can find the source code?),
- default win32 OMF: https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C - default linux: https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c - not used but interesting: https://github.com/esmil/musl/blob/master/src/string/memcpy.c
Jun 10 2018
next sibling parent Basile B. <b2.b2.temp.temp gmx.gmx.com.com.com> writes:
On Monday, 11 June 2018 at 03:34:59 UTC, Basile B. wrote:
 On Monday, 11 June 2018 at 01:03:16 UTC, Mike Franklin wrote:
 [...]
- default win32 OMF: https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C - default linux: https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c - not used but interesting: https://github.com/esmil/musl/blob/master/src/string/memcpy.c
D slice style memcpy cg: https://github.com/dlang/dmd/blob/master/src/dmd/backend/cod2.c#L3526
Jun 11 2018
prev sibling next sibling parent Basile B. <b2.b2.temp.temp gmx.gmx.com.com.com> writes:
On Monday, 11 June 2018 at 03:34:59 UTC, Basile B. wrote:
 On Monday, 11 June 2018 at 01:03:16 UTC, Mike Franklin wrote:
 [...]
- default win32 OMF: https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C - default linux: https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c - not used but interesting: https://github.com/esmil/musl/blob/master/src/string/memcpy.c
D slice style memcpy cg: https://github.com/dlang/dmd/blob/master/src/dmd/backend/cod2.c#L3526
Jun 11 2018
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/10/2018 8:34 PM, Basile B. wrote:
 - default win32 OMF: 
 https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C
I think you mean: https://github.com/DigitalMars/dmc/blob/master/src/CORE32/MEMCPY.ASM
Jun 11 2018
parent Mike Franklin <slavo5150 yahoo.com> writes:
On Monday, 11 June 2018 at 08:05:14 UTC, Walter Bright wrote:
 On 6/10/2018 8:34 PM, Basile B. wrote:
 - default win32 OMF: 
 https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C
I think you mean: https://github.com/DigitalMars/dmc/blob/master/src/CORE32/MEMCPY.ASM
Cool! and it's even boost licensed. I think I'll try to port that next, before I try the SSE and AVX implementations. Mike
Jun 11 2018
prev sibling parent David Nadlinger <code klickverbot.at> writes:
On Monday, 11 June 2018 at 03:34:59 UTC, Basile B. wrote:
 - default linux: 
 https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c
To see what is executed when you call memcpy() on a regular GNU/Linux distro, you'd want to have a look at glibc instead. For example, the AVX2 and AVX512 implementations are in sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S and sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S, as forwarded to by memmove-avx2-unaligned-erms.S and memmove-avx512-unaligned-erms.S. (Pop quiz: Why might a separate "no-vzeroupper" variant be a good idea?) — David
Jun 17 2018
prev sibling parent reply Guillaume Piolat <first.last gmail.com> writes:
BTW the way memcpy is(was?) implemented in the C runtime coming 
from the Inter C++ compiler was really enlightening on the sheer 
difficulty of such a task.

First of all there isn't one loop but many depending on the 
source and destination alignment.

- If both are aligned on 16-byte boundaries, source and 
destination operand would be with MOVAPS/MOVDQA, nothing special
- If only the source or destination was misaligned, the function 
would dispatch to a variant with the core loop loading 16-byte 
aligned and writing 16-byte unaligned, with the PALIGNR 
instruction. However, since PALIGNR can't take a runtime value, 
this variant was _replicated 16 times_.
- I don't remember for both source and destination misaligned but 
you can degenerate this case to the above one.

Each of this loop had complicated loop preludes that do the first 
iteration, and they are so hard to do by hand.

It was also the only piece of assembly I've seen that 
(apparently) successfully used the "prefetch" instructions.

This was just the SSE version, AVX was different.

I don't know if someone really wrote this code, or if it was all 
from intrinsics.
Jun 11 2018
parent Walter Bright <newshound2 digitalmars.com> writes:
On 6/11/2018 11:17 AM, Guillaume Piolat wrote:
 I don't know if someone really wrote this code, or if it was all from
intrinsics.
memcpy is so critical to success it is likely written by Intel itself to ensure every drop of perf is wrung out of the CPU. I was Intel CEO I'd direct the CPU hardware guys to do this and give it away.
Jun 11 2018