digitalmars.D - Replacing C's memcpy with a D implementation

Mike Franklin (119/119) Jun 10 2018 I'm exploring the possibility of implementing some of the basic

Nicholas Wilson (2/11) Jun 10 2018 what compiler? what flags?

Mike Franklin (8/20) Jun 10 2018 dmd main.d

Adam D. Ruppe (10/16) Jun 10 2018 So keep in mind that memcpy is really a magical intrinsic anyway

Mike Franklin (21/25) Jun 10 2018 My intent is to use the D implementation in the druntime. I'm

Kagamin (5/14) Jun 10 2018 In safe code you just use assignment and array ops, backend does

Mike Franklin (6/20) Jun 10 2018 The compiler implementation is faulty. It rewrites the

Mike Franklin (5/9) Jun 10 2018 Also, understand that this lowering happens in the IR generation
Walter Bright (2/8) Jun 10 2018 It's only faulty if there's a bug in it. It's essentially @trusted code.

Mike Franklin (13/22) Jun 10 2018 That only addresses the @safe attribute, and that code is much

Walter Bright (16/26) Jun 11 2018 Making it a template is not really necessary. The compiler knows if ther...

Mike Franklin (15/18) Jun 11 2018 There are other reasons to make it a template, though. For

Walter Bright (7/21) Jun 11 2018 We have no design for this function that doesn't rely on the GC, and the...

Mike Franklin (19/22) Jun 11 2018 I understand that. I was using `_d_arraysetlengthT` as an
Mike Franklin (9/20) Jun 11 2018 I think it does, because I can then generate specific code based

Mike Franklin (7/16) Jun 11 2018 Also, before you do any more nay-saying, you might want to

Johannes Pfau (19/37) Jun 11 2018 I guess for most D runtime hooks, using templates is a good idea to

Mike Franklin (26/38) Jun 11 2018 My plans go beyond microcontrollers. Mostly, I'd like to be able

Steven Schveighoffer (11/14) Jun 11 2018 No, __doPostblit is necessary -- you are making a copy.

Walter Bright (3/16) Jun 11 2018 Yes, you're right. This should probably go as a comment in the code in c...

Walter Bright (1/1) Jun 11 2018 https://github.com/dlang/druntime/pull/2213

Mike Franklin (17/19) Jun 10 2018 void memcpyD()

rikki cattermole (3/32) Jun 10 2018 memcpy
Seb (5/24) Jun 10 2018 I would increase the test data size, s.t. you get a least a few

I love Ice Cream (5/5) Jun 10 2018 Don't C implementations already do 90% of what you want? I

Walter Bright (2/6) Jun 10 2018 Note that .dup is doing a GC memory allocation.
Patrick Schluter (46/51) Jun 10 2018 rep movsd is very CPU dependend and needs some precondtions to be

Walter Bright (2/3) Jun 11 2018 Thanks. Agner Fog gets the last word on this topic!

David Nadlinger (13/16) Jun 17 2018 Well, Agner is rarely wrong indeed, but there is a limit to how

Guillaume Piolat (11/14) Jun 10 2018 Hi,

Mike Franklin (13/15) Jun 10 2018 I tested with ldc and got similar results. I thought the

David Nadlinger (45/48) Jun 10 2018 You've just discovered the fact that one can rarely be careful

Walter Bright (6/13) Jun 10 2018 The CPU makers abandoned optimizing the REP instructions decades ago, an...

Temtaime (3/19) Jun 10 2018 On some cpu architectures(for example intel atoms) rep movsb is
David Nadlinger (10/17) Jun 10 2018 That's not entirely true. Intel started optimising some of the

Walter Bright (2/7) Jun 10 2018 The drama of which instruction mix is faster on which CPU never abates!

Nick Sabalausky (Abscissa) (2/11) Jun 10 2018 In many ways, I really miss 80's machine architecture ;) So simple.

Walter Bright (13/14) Jun 10 2018 One source of entropy in the results is src and dst being global variabl...
solidstate1991 (2/17) Jun 10 2018 Protip: Use SSE or AVX for an even faster copying.
Mike Franklin (131/131) Jun 10 2018 I've modified the test based on the feedback so far, so here's

Basile B. (7/134) Jun 10 2018 - default win32 OMF:

Basile B. (3/11) Jun 11 2018 D slice style memcpy cg:
Basile B. (3/11) Jun 11 2018 D slice style memcpy cg:
Walter Bright (3/5) Jun 11 2018 I think you mean:

Mike Franklin (4/9) Jun 11 2018 Cool! and it's even boost licensed. I think I'll try to port

David Nadlinger (12/14) Jun 17 2018 To see what is executed when you call memcpy() on a regular

Guillaume Piolat (21/21) Jun 11 2018 BTW the way memcpy is(was?) implemented in the C runtime coming

Walter Bright (4/5) Jun 11 2018 memcpy is so critical to success it is likely written by Intel itself to...

Mike Franklin <slavo5150 yahoo.com> writes:

I'm exploring the possibility of implementing some of the basic 
software building blocks (memcpy, memcmp, memmove, etc...) that D 
utilizes from the C library with D implementations.  There are 
many reasons to do this, one of which is to leverage information 
available at compile-time and in D's type system (type sizes, 
alignment, etc...) in order to optimize the implementation of 
these functions, and allow them to be used from  safe code.

The prevailing wisdom has been that there is no way to improve on 
C's memcpy implementation given that it has been mirco-optimized 
to death over several decades by many talented members of the 
human race.

So, I threw the following benchmark together to try to get a clue 
about what I was up against, and in a very short time, I beat the 
snot of C's memcpy.  The benefit seems to disappear as the array 
sizes increase, but I believe the vast majority of calls to 
memcpy are probably quite small.

import std.datetime.stopwatch;
import std.stdio;
import core.stdc.string;
import std.random;
import std.algorithm;

enum length = 4096 * 2;
ubyte[length] dst;
ubyte[length] src;
auto rnd = Random(42);
ubyte[] src2;
ubyte[] dst2;

void verifyResults()
{
     assert(memcmp(dst.ptr, src.ptr, length) == 0);
     assert(memcmp(dst2.ptr, src2.ptr, length) == 0);
}

void randomizeData()
{
     for(int i = 0; i < length; i++)
     {
         src[i] = uniform!ubyte;
         dst[i] = uniform!ubyte;
     }

     src2 = src;
     dst2 = dst;
}

void memcpyD()
{
     dst = src.dup;
}

void memcpyDstdAlg()
{
     copy(src2, dst2);
}

void memcpyC()
{
     memcpy(dst.ptr, src.ptr, length);
}

void memcpyNaive()
{
     for(int i = 0; i < length; i++)
     {
         dst[i] = src[i];
     }
}

void memcpyASM()
{
     auto s = src.ptr;
     auto d = dst.ptr;
     size_t len = length;
     asm pure nothrow  nogc
     {
         mov RSI, s;
         mov RDI, d;
         cld;
         mov RCX, len;
         rep;
         movsb;
     }
}

void main()
{
     // verify the integrity of the algorithm
     randomizeData();
     memcpyD();
     verifyResults();

     randomizeData();
     memcpyDstdAlg();
     verifyResults();

     randomizeData();
     memcpyC();
     verifyResults();

     randomizeData();
     memcpyNaive();
     verifyResults();

     randomizeData();
     memcpyASM();
     verifyResults();

     // test the performance of the algorithm
     auto r = benchmark!(memcpyD, memcpyDstdAlg, memcpyC, 
memcpyNaive, memcpyASM)(1000);
     Duration memcpyDResult = r[0];
     Duration memcpyDstdAlgResult = r[1];
     Duration memcpyCResult = r[2];
     Duration memcpyNaiveResult = r[3];
     Duration memcpyASMResult = r[4];

     writeln("memcpyD: ", memcpyDResult);
     writeln("memcpyDstdAlg: ", memcpyDstdAlgResult);
     writeln("memcpyC: ", memcpyCResult);
     writeln("memcpyNaive: ", memcpyNaiveResult);
     writeln("memcpyASM: ", memcpyASMResult);
}


------ Output --------
memcpyD:         1 ms, 772 μs, and 4 hnsecs
memcpyDstdAlg: 531 μs and 8 hnsecs
memcpyC:       371 μs and 3 hnsecs
memcpyNaive:    21 ms, 572 μs, and 2 hnsecs
memcpyASM:     119 μs and 6 hnsecs



I'm not experienced with this kind of programming, so I'm 
doubting these results.  Have I done something wrong?  Am I 
overlooking something?

Thanks,
Mike

Jun 10 2018

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
 I'm exploring the possibility of implementing some of the basic 
 software building blocks (memcpy, memcmp, memmove, etc...) that 
 D utilizes from the C library with D implementations.  There 
 are many reasons to do this, one of which is to leverage 
 information available at compile-time and in D's type system 
 (type sizes, alignment, etc...) in order to optimize the 
 implementation of these functions, and allow them to be used 
 from  safe code.

 [...]

what compiler? what flags?

Jun 10 2018

Mike Franklin <slavo5150 yahoo.com> writes:

On Sunday, 10 June 2018 at 13:05:33 UTC, Nicholas Wilson wrote:
 On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
 I'm exploring the possibility of implementing some of the 
 basic software building blocks (memcpy, memcmp, memmove, 
 etc...) that D utilizes from the C library with D 
 implementations.  There are many reasons to do this, one of 
 which is to leverage information available at compile-time and 
 in D's type system (type sizes, alignment, etc...) in order to 
 optimize the implementation of these functions, and allow them 
 to be used from  safe code.

 [...]

 what compiler? what flags?

dmd main.d
Arch Linux x86_64

DMD, with no flags.  I'm not compiling C's memcpy, and the 
significant D implementation is memcpyASM which is all inline 
assembly, so I don't see how the compiler flags will make any 
difference, anyway.

Mike

Jun 10 2018

Adam D. Ruppe <destructionator gmail.com> writes:

On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
 D utilizes from the C library with D implementations.  There 
 are many reasons to do this, one of which is to leverage 
 information available at compile-time and in D's type system 
 (type sizes, alignment, etc...) in order to optimize the 
 implementation of these functions, and allow them to be used 
 from  safe code.

So keep in mind that memcpy is really a magical intrinsic anyway 
and optimzers frequently don't actually call a function, but 
rather see the size and replace the instructions inline (like it 
might replace it with just a couple movs instead of something 
fancy).

And D already has it built in as well for  safe etc:

arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer 
can further do its magic

so be sure to check against that too.

Jun 10 2018

Mike Franklin <slavo5150 yahoo.com> writes:

On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:

 And D already has it built in as well for  safe etc:

 arr1[] = arr2[]; // the compiler makes this memcpy, the 
 optimzer can further do its magic

 so be sure to check against that too.

My intent is to use the D implementation in the druntime.  I'm 
trying to replace the runtime hooks with templates, so we don't 
have to rely so much on `TypeInfo` and we can begin to use more 
of D without linking in the runtime.

But one think I discovered is that while we can set an array's 
length in  safe, nothrow, pure code, it gets lowered to a runtime 
hook that is neither  safe, nothrow, nor pure; the compiler is 
lying to us.  If I replace the runtime hook with a template, I 
need to be honest, and that means all that low-level code in 
druntime that the runtime hook depends on needs to be  safe, 
nothrow, and pure.  So I'm starting with the fundamental 
dependencies on the C library, trying to ensure they can be 
replaced with D implementations and use in D idiomatically.  For 
example, memcpy might look like this.

void memcpy(T)(T[] dest, T[] src);

We can extract size, alignment etc.. at compile-time.

So, my goal is not for user-facing code, but for improving the 
druntime implementations.

Does that make sense?

Mike

Jun 10 2018

Kagamin <spam here.lot> writes:

On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
 There are many reasons to do this, one of which is to leverage 
 information available at compile-time and in D's type system 
 (type sizes, alignment, etc...) in order to optimize the 
 implementation of these functions, and allow them to be used 
 from  safe code.

In safe code you just use assignment and array ops, backend does 
the rest.

On Sunday, 10 June 2018 at 13:27:04 UTC, Mike Franklin wrote:
 But one think I discovered is that while we can set an array's 
 length in  safe, nothrow, pure code, it gets lowered to a 
 runtime hook that is neither  safe, nothrow, nor pure; the 
 compiler is lying to us.

If the compiler can't get it right then who can?

Jun 10 2018

Mike Franklin <slavo5150 yahoo.com> writes:

On Sunday, 10 June 2018 at 15:12:27 UTC, Kagamin wrote:
 On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
 There are many reasons to do this, one of which is to leverage 
 information available at compile-time and in D's type system 
 (type sizes, alignment, etc...) in order to optimize the 
 implementation of these functions, and allow them to be used 
 from  safe code.

 In safe code you just use assignment and array ops, backend 
 does the rest.

 On Sunday, 10 June 2018 at 13:27:04 UTC, Mike Franklin wrote:
 But one think I discovered is that while we can set an array's 
 length in  safe, nothrow, pure code, it gets lowered to a 
 runtime hook that is neither  safe, nothrow, nor pure; the 
 compiler is lying to us.

 If the compiler can't get it right then who can?


The compiler implementation is faulty.  It rewrites the 
expressions to an `extern(C)` runtime implementation that is not 
 safe, nothrow, or pure:  
https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src
rt/lifetime.d#L1442  The backend is not involved.

Mike

Jun 10 2018

Mike Franklin <slavo5150 yahoo.com> writes:

On Monday, 11 June 2018 at 02:49:00 UTC, Mike Franklin wrote:

 The compiler implementation is faulty.  It rewrites the 
 expressions to an `extern(C)` runtime implementation that is 
 not  safe, nothrow, or pure:  
 https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src
rt/lifetime.d#L1442  The backend is not involved.

Also, understand that this lowering happens in the IR generation 
stage:  
https://github.com/dlang/dmd/blob/3a79629988efd51d4dda9edb38a6701cd097da89/
rc/dmd/e2ir.d#L2616 so semantic analysis is not performed on the runtime call. 
If it were, it would fail to compile.

Mike

Jun 10 2018

Walter Bright <newshound2 digitalmars.com> writes:

On 6/10/2018 7:49 PM, Mike Franklin wrote:
 On Sunday, 10 June 2018 at 15:12:27 UTC, Kagamin wrote:
 If the compiler can't get it right then who can?

 The compiler implementation is faulty.  It rewrites the expressions to an 
 `extern(C)` runtime implementation that is not  safe, nothrow, or pure: 
 https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src
rt/lifetime.d#L1442  
 The backend is not involved.

It's only faulty if there's a bug in it. It's essentially  trusted code.

Jun 10 2018

Mike Franklin <slavo5150 yahoo.com> writes:

On Monday, 11 June 2018 at 03:31:05 UTC, Walter Bright wrote:
 On 6/10/2018 7:49 PM, Mike Franklin wrote:
 On Sunday, 10 June 2018 at 15:12:27 UTC, Kagamin wrote:
 If the compiler can't get it right then who can?

 The compiler implementation is faulty.  It rewrites the 
 expressions to an `extern(C)` runtime implementation that is 
 not  safe, nothrow, or pure: 
 https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src
rt/lifetime.d#L1442  The backend is not involved.

 It's only faulty if there's a bug in it. It's essentially 
  trusted code.

That only addresses the  safe attribute, and that code is much 
too complex for anyone to audit it and certify it as safe.

Exceptions are also not all handled, so there is no way it can 
pass as nothrow.

The runtime call needs to be replaced with a template that can 
take advantage of D's compile-time type information instead of 
`TypeInfo`, but doing so will force the implementation through 
semantic processing, which, in it's current state, will fail to 
compile.  It needs to be refactored, and creating  safe, nothrow, 
pure building blocks will help with that, hence the topic of this 
thread.

Mike

Jun 10 2018

Walter Bright <newshound2 digitalmars.com> writes:

On 6/10/2018 8:43 PM, Mike Franklin wrote:
 That only addresses the  safe attribute, and that code is much too complex for 
 anyone to audit it and certify it as safe.
 
 Exceptions are also not all handled, so there is no way it can pass as nothrow.
 
 The runtime call needs to be replaced with a template that can take advantage
of 
 D's compile-time type information instead of `TypeInfo`, but doing so will
force 
 the implementation through semantic processing, which, in it's current state, 
 will fail to compile.  It needs to be refactored, and creating  safe,
nothrow, 
 pure building blocks will help with that, hence the topic of this thread.

Making it a template is not really necessary. The compiler knows if there is
the 
possibility of it throwing based on the type, it doesn't need to infer it.

(I notice it is doing __doPostblit(). This looks wrong, D allows data to be 
moved. As far as I can tell with a perfunctory examination, that's the only
"can 
throw" bit.)

Doing without TypeInfo is a worthy goal most of the time, but I don't think it 
is of much value here because it calls the garbage collector anyway which 
requires TypeInfo.

Some obvious refactoring tasks for it would be to:

1. eliminate the gotos
2. use single assignment for variables
3. minimize the scopes of variables
4. use const
5. document the inputs and outputs
6. add test cases for the red lines

Jun 11 2018

Mike Franklin <slavo5150 yahoo.com> writes:

On Monday, 11 June 2018 at 08:00:10 UTC, Walter Bright wrote:

 Making it a template is not really necessary. The compiler 
 knows if there is the possibility of it throwing based on the 
 type, it doesn't need to infer it.

There are other reasons to make it a template, though.  For 
example, if it were a template, it would not rely on `TypeInfo` 
and could then be used in -betterC-like situations.  For setting 
the length of an array, there are issues with memory allocation 
that hinder that, but I'm thinking beyond just setting the length 
of an array here; I just used that as an illustrative example.

I think there might also be optimization opportunities using 
templates, metaprogramming, and type introspection, that are not 
currently possible with the current design.

So there are other reasons to pursue a template design for all of 
our runtime hooks, and if our memcpy, memcmp, etc.. are also 
templated it't turtles all the way down.  We have to be careful 
about code bloat with tempates, but I think that can be mitigated.

Mike

Jun 11 2018

Walter Bright <newshound2 digitalmars.com> writes:

On 6/11/2018 1:12 AM, Mike Franklin wrote:
 On Monday, 11 June 2018 at 08:00:10 UTC, Walter Bright wrote:
 
 Making it a template is not really necessary. The compiler knows if there is 
 the possibility of it throwing based on the type, it doesn't need to infer it.

 
 There are other reasons to make it a template, though.  For example, if it
were 
 a template, it would not rely on `TypeInfo` and could then be used in 
 -betterC-like situations.

We have no design for this function that doesn't rely on the GC, and the GC 
needs TypeInfo. This function is not usable with betterC with or without the 
TypeInfo argument.


 I think there might also be optimization opportunities using templates, 
 metaprogramming, and type introspection, that are not currently possible with 
 the current design.

Just making it a template doesn't automatically enable any of this.


 So there are other reasons to pursue a template design for all of our runtime 
 hooks, and if our memcpy, memcmp, etc.. are also templated it't turtles all
the 
 way down.

memcpy and memcmp are already handled specially by modern compilers, have been 
for decades; I seriously doubt there's more oil in that well.

Jun 11 2018

Mike Franklin <slavo5150 yahoo.com> writes:

On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:

 We have no design for this function that doesn't rely on the 
 GC, and the GC needs TypeInfo. This function is not usable with 
 betterC with or without the TypeInfo argument.

I understand that.  I was using `_d_arraysetlengthT` as an 
example of problems I encountered trying to convert these runtime 
hooks into templates, specifically illustrating how the compiler 
does not enforce the guarantees of safety, nothrow, and pure, 
and, in fact, completely disregards them.

Converting *any* runtime hook, not just `_d_arraysetlengthT`, to 
a template comes with a whole host of problems because of the 
aforementioned issue.  As soon as it's implemented as a template 
it is semantically analyzed in the context of its caller, so 
 safe, nothrow, pure, etc. all need to accommodated for.  That's 
good to have the compiler check for that stuff.  It's unfortunate 
the current implementation does not, and that's the fault in the 
current implementation.

There are other runtime hooks that are good candidates for 
templates, even if `_d_arraysetlengthT` isn't one of them.  
Again, I just chose `_d_arraysetlengthT` as an example, because 
it is the one I had experience with.

Mike

Jun 11 2018

Mike Franklin <slavo5150 yahoo.com> writes:

On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:

 I think there might also be optimization opportunities using 
 templates, metaprogramming, and type introspection, that are 
 not currently possible with the current design.

 Just making it a template doesn't automatically enable any of 
 this.

I think it does, because I can then generate specific code based 
on the type information at compile-time.

 So there are other reasons to pursue a template design for all 
 of our runtime hooks, and if our memcpy, memcmp, etc.. are 
 also templated it't turtles all the way down.

 memcpy and memcmp are already handled specially by modern 
 compilers, have been for decades; I seriously doubt there's 
 more oil in that well.

Yes, I may fail to achieve any performance benefits, but even if 
I achieve parity with the current implementation it will still be 
a win for me because I will no longer have to depend on the C 
standard library.  And if I have to fork to use D that way, then 
that's what I'm going to do.

Mike

Jun 11 2018

Mike Franklin <slavo5150 yahoo.com> writes:

On Monday, 11 June 2018 at 10:38:30 UTC, Mike Franklin wrote:
 On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:

 I think there might also be optimization opportunities using 
 templates, metaprogramming, and type introspection, that are 
 not currently possible with the current design.

 Just making it a template doesn't automatically enable any of 
 this.

 I think it does, because I can then generate specific code 
 based on the type information at compile-time.

Also, before you do any more nay-saying, you might want to 
revisit this talk https://www.youtube.com/watch?v=endKC3fDxqs 
which demonstrates precisely the kind of benefits that can be 
achieved with these kinds of changes to the compiler/runtime 
interface.

Mike

Jun 11 2018

Johannes Pfau <nospam example.com> writes:

Am Mon, 11 Jun 2018 10:54:23 +0000 schrieb Mike Franklin:

 On Monday, 11 June 2018 at 10:38:30 UTC, Mike Franklin wrote:
 On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:

 I think there might also be optimization opportunities using
 templates, metaprogramming, and type introspection, that are not
 currently possible with the current design.

 Just making it a template doesn't automatically enable any of this.

 I think it does, because I can then generate specific code based on the
 type information at compile-time.

 
 Also, before you do any more nay-saying, you might want to revisit this
 talk https://www.youtube.com/watch?v=endKC3fDxqs which demonstrates
 precisely the kind of benefits that can be achieved with these kinds of
 changes to the compiler/runtime interface.
 
 Mike

I guess for most D runtime hooks, using templates is a good idea to 
enable inlining and further optimizations.

I understand that you actually need to reimplement memcpy, as in your 
microcontroller usecase you don't want to have any C runtime. So you'll 
basically have to rewrite the C runtime parts D depends on.

However, I think for memcpy and similar functions you're probably better 
off keeping the C interface. This directly provides the benefit of 
compiler intrinsics/optimizations. And marking memcpy as nothrow/pure/
system/nogc is simple* either way. For the D implementation, the compiler 
will verify this for you, for the C implementation, you have to mark the 
function depending on the C implementation. But that's mostly trivial.

On a related note, I agree that the compiler sometimes cheats by ignoring 
attributes, especially when calling TypeInfo related functions, and this 
is a huge problem. Runtime TypeInfo is not descriptive enough to fully 
represent the types and whenever the compiler the casts without properly 
checking first, there's the possibility of a problem.

-- 
Johannes

Jun 11 2018

Mike Franklin <slavo5150 yahoo.com> writes:

On Monday, 11 June 2018 at 18:34:58 UTC, Johannes Pfau wrote:

 I understand that you actually need to reimplement memcpy, as 
 in your microcontroller usecase you don't want to have any C 
 runtime. So you'll basically have to rewrite the C runtime 
 parts D depends on.

 However, I think for memcpy and similar functions you're 
 probably better off keeping the C interface. This directly 
 provides the benefit of compiler intrinsics/optimizations. And 
 marking memcpy as nothrow/pure/ system/nogc is simple* either 
 way. For the D implementation, the compiler will verify this 
 for you, for the C implementation, you have to mark the 
 function depending on the C implementation. But that's mostly 
 trivial.

My plans go beyond microcontrollers.  Mostly, I'd like to be able 
to use more features of D without having to link in a pre-built 
runtime.  This is especially convenient for cross-compiling 
scenarios.  By replacing the runtime hooks with templates, and 
the software building blocks with D implementations, we'd no 
longer need to obtain a C toolchain and a pre-compiled druntime 
library just  to get a build for our target.  We'd just need a 
cross-compiler like LDC or GDC.  After adding druntime to the 
import path, we'd have everything we need for our target.  Try, 
today, to create an LDC cross-compiler for a Windows Host and an 
ARM target like the Raspberry Pi.  You'll quickly realize what a 
hassle it all is.

Another issue with memcpy being generated by compilers is noone 
knows what it's actually doing without building and looking at 
the assembly.  And the assembly will change based what's being 
built.  Have you tried to look at the GCC implementation.  I 
have, and I never want to again.  If the implementation were 
templated in D, we'd be able to see exactly what code would be 
generated when by simply viewing the D runtime source code.  It'd 
be easier to understand, predict, port, and enhance.

There isn't one single game-changing benefit this endeavor would 
bring, and I guess that's why it's difficult to comprehend, but 
collectively, all the little things it would enable would make it 
quite compelling.

Mike

Jun 11 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/11/18 4:00 AM, Walter Bright wrote:

 (I notice it is doing __doPostblit(). This looks wrong, D allows data to 
 be moved. As far as I can tell with a perfunctory examination, that's 
 the only "can throw" bit.)

No, __doPostblit is necessary -- you are making a copy.

example:

File[] fs = new File[5];

fs[0] = ...; // initialize fs
auto fs2 = fs;
fs.length = 100;

At this point, fs points at a separate block from fs2. If you did not do 
postblit on this, then when one of those arrays is destroyed, the other 
will become invalid, as the File reference count would go to 0.

-Steve

Jun 11 2018

Walter Bright <newshound2 digitalmars.com> writes:

On 6/11/2018 6:00 AM, Steven Schveighoffer wrote:
 No, __doPostblit is necessary -- you are making a copy.
 
 example:
 
 File[] fs = new File[5];
 
 fs[0] = ...; // initialize fs
 auto fs2 = fs;
 fs.length = 100;
 
 At this point, fs points at a separate block from fs2. If you did not do 
 postblit on this, then when one of those arrays is destroyed, the other will 
 become invalid, as the File reference count would go to 0.

Yes, you're right. This should probably go as a comment in the code in case it 
comes up again.

Jun 11 2018

Walter Bright <newshound2 digitalmars.com> writes:

https://github.com/dlang/druntime/pull/2213

Jun 11 2018

Mike Franklin <slavo5150 yahoo.com> writes:

On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:

 arr1[] = arr2[]; // the compiler makes this memcpy, the 
 optimzer can further do its magic

void memcpyD()
{
     dst = src.dup;
}

void memcpyD2()
{
     dst[] = src[];
}

-----
memcpyD: 1 ms, 725 μs, and 1 hnsec
memcpyD2: 587 μs and 5 hnsecs
memcpyASM: 119 μs and 5 hnsecs

Still, the ASM version is much faster.

btw, what's the difference between the two.  If you can't tell, 
I'm actually not a very good D programmer.

Mike

Jun 10 2018

rikki cattermole <rikki cattermole.co.nz> writes:

On 11/06/2018 1:45 AM, Mike Franklin wrote:
 On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:
 
 arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can 
 further do its magic

 
 void memcpyD()
 {
      dst = src.dup;

malloc (for slice not static array)

 }
 
 void memcpyD2()
 {
      dst[] = src[];

memcpy

 }
 
 -----
 memcpyD: 1 ms, 725 μs, and 1 hnsec
 memcpyD2: 587 μs and 5 hnsecs
 memcpyASM: 119 μs and 5 hnsecs
 
 Still, the ASM version is much faster.
 
 btw, what's the difference between the two.  If you can't tell, I'm 
 actually not a very good D programmer.
 
 Mike

Jun 10 2018

Seb <seb wilzba.ch> writes:

On Sunday, 10 June 2018 at 13:45:54 UTC, Mike Franklin wrote:
 On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:

 arr1[] = arr2[]; // the compiler makes this memcpy, the 
 optimzer can further do its magic

 void memcpyD()
 {
     dst = src.dup;
 }

 void memcpyD2()
 {
     dst[] = src[];
 }

 -----
 memcpyD: 1 ms, 725 μs, and 1 hnsec
 memcpyD2: 587 μs and 5 hnsecs
 memcpyASM: 119 μs and 5 hnsecs

 Still, the ASM version is much faster.

 btw, what's the difference between the two.  If you can't tell, 
 I'm actually not a very good D programmer.

 Mike

I would increase the test data size, s.t. you get a least a few 
seconds.
Otherwise the benchmarking won't tell you much because it's 
subject to too much randomness.

Jun 10 2018

I love Ice Cream <IloveIcecream. icecreamsandwhich.com> writes:

Don't C implementations already do 90% of what you want? I 
thought most compilers know about and optimize these methods 
based on context. I thought they were *special* in the eyes of 
the compiler already. I think you are fighting a battle pitting 
40 years of tweaking against you...

Jun 10 2018

Walter Bright <newshound2 digitalmars.com> writes:

On 6/10/2018 6:45 AM, Mike Franklin wrote:
  void memcpyD()
 {
      dst = src.dup;
 }

Note that .dup is doing a GC memory allocation.

Jun 10 2018

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Sunday, 10 June 2018 at 13:45:54 UTC, Mike Franklin wrote:
 On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:

 memcpyD: 1 ms, 725 μs, and 1 hnsec
 memcpyD2: 587 μs and 5 hnsecs
 memcpyASM: 119 μs and 5 hnsecs

 Still, the ASM version is much faster.

rep movsd is very CPU dependend and needs some precondtions to be 
fast. For relative short memory blocks it sucks on many other CPU 
than the last Intel.

See what Agner Fog has to say about it:

16.10
String instructions (all processors)
String instructions without a repeat prefix are too slow and 
should be replaced by simpler instructions. The same applies to 
LOOP on all processors and to JECXZ
on some processors. REP MOVSD andREP STOSD are quite fast if the 
repeat count is not too small. Always use the largest word size 
possible (DWORDin 32-bit mode, QWORD in 64-bit mode), and make 
sure that both source and destination are aligned by the word 
size. In many cases, however, it is faster to use XMM registers. 
Moving data in XMM registers is faster than REP MOVSD and REP 
STOSD
in most cases, especially on older processors. See page 164 for 
details.
Note that while the REP MOVS instruction writes a word to the 
destination, it reads the next word from the source in the same 
clock cycle. You can have a cache bank conflict if bit 2-4 are 
the same in these two addresses on P2 and P3. In other words, you 
will get a penalty of one clock extra per iteration if ESI
+WORDSIZE-EDI is divisible by 32. The easiest way to avoid cache 
bank conflicts is to align both source and destination by 8. 
Never use MOVSB or MOVSW
in optimized code, not even in 16-bit mode. On many processors, 
REP MOVS and REP STOS can perform fast by moving 16 bytes or an 
entire cache line at a time
. This happens only when certain conditions are met. Depending on 
the processor, the conditions for fast string instructions are, 
typically, that the count must
be high, both source and destination must be aligned, the 
direction must be forward, the distance between source and 
destination must be at least the cache line size, and the memory 
type for both source and destination must be either write-back or 
write-combining (you can normally assume the latter condition is 
met). Under these conditions, the speed is as high as you can 
obtain with vector register moves or even faster on some 
processors.
While the string instructions can be quite convenient, it must be 
emphasized that other solutions are faster in many cases. If the 
above conditions for fast move are not met then there is a lot to 
gain by using other methods. See page 164 for alternatives to REP 
MOVS

Jun 10 2018

Walter Bright <newshound2 digitalmars.com> writes:

On 6/10/2018 9:44 PM, Patrick Schluter wrote:
 See what Agner Fog has to say about it:

Thanks. Agner Fog gets the last word on this topic!

Jun 11 2018

David Nadlinger <code klickverbot.at> writes:

On Monday, 11 June 2018 at 08:02:42 UTC, Walter Bright wrote:
 On 6/10/2018 9:44 PM, Patrick Schluter wrote:
 See what Agner Fog has to say about it:

 Thanks. Agner Fog gets the last word on this topic!

Well, Agner is rarely wrong indeed, but there is a limit to how 
much material a single person can keep up to date.

On newer uarchs, `rep movsb` isn't slower than `rep movsd`, and 
often performs similar to the best SSE2 implementation (using NT 
stores). See "BeeOnRope"'s answer to this StackOverflow question 
for an in-depth discussion about this: 
https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy

AVX2 seems to offer extra performance beyond that, though, if it 
is available (for example if runtime feature detection is used). 
I believe I read a comment by Agner somewhere to that effect as 
well – a search engine will certainly be able to turn up more.

  — David

Jun 17 2018

Guillaume Piolat <first.last gmail.com> writes:

On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
 I'm not experienced with this kind of programming, so I'm 
 doubting these results.  Have I done something wrong?  Am I 
 overlooking something?


Hi,

I've spent a lot of time optimizing memcpy. One of the result was 
that on Intel ICC the compiler intrinsics were unbeatable.
Please make one that guarantee the usage of the corresponding 
backend intrinsic, for example on LLVM.

The reasoning is that small areas of memory (statically known) 
will be elided to a few byte moves (inlined), whereas the larger 
one will call the highly optimized C stdlib calls.
If you use ASM instead of IR the optimization barriers and 
register spilling will make you probably less efficient.

Jun 10 2018

Mike Franklin <slavo5150 yahoo.com> writes:

On Sunday, 10 June 2018 at 13:17:53 UTC, Guillaume Piolat wrote:

 Please make one that guarantee the usage of the corresponding 
 backend intrinsic, for example on LLVM.

I tested with ldc and got similar results.  I thought the 
implementation in C forwarded to the backend intrinsic.  I think 
even LDC links with GCC, and the distribution of GCC is tuned for 
a given platform and architecture, so am I not already utilizing 
GCC's backend intrinsic.

Also, I'm not trying to find the holy grail of memcpy with this 
endeavor.  I just want to find something as good or better than 
what the druntime is currently using, but implemented in D.  Even 
if it's not optimal, if it's as good as what druntime is 
currently using, I want to replace those calls in the runtime 
with a D implementation.

Mike

Jun 10 2018

David Nadlinger <code klickverbot.at> writes:

On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
 I'm not experienced with this kind of programming, so I'm 
 doubting these results.  Have I done something wrong?  Am I 
 overlooking something?

You've just discovered the fact that one can rarely be careful 
enough with what is benchmarked, and having enough statistics.

For example, check out the following output from running your 
program on macOS 10.12, compiled with LDC 1.8.0:

---
$ ./test
memcpyD: 2 ms, 570 μs, and 9 hnsecs
memcpyDstdAlg: 77 μs and 2 hnsecs
memcpyC: 74 μs and 1 hnsec
memcpyNaive: 76 μs and 4 hnsecs
memcpyASM: 145 μs and 5 hnsecs
$ ./test
memcpyD: 3 ms and 376 μs
memcpyDstdAlg: 76 μs and 9 hnsecs
memcpyC: 104 μs and 4 hnsecs
memcpyNaive: 72 μs and 2 hnsecs
memcpyASM: 181 μs and 8 hnsecs
$ ./test
memcpyD: 2 ms and 565 μs
memcpyDstdAlg: 76 μs and 9 hnsecs
memcpyC: 73 μs and 2 hnsecs
memcpyNaive: 71 μs and 9 hnsecs
memcpyASM: 145 μs and 3 hnsecs
$ ./test
memcpyD: 2 ms, 813 μs, and 8 hnsecs
memcpyDstdAlg: 81 μs and 2 hnsecs
memcpyC: 99 μs and 2 hnsecs
memcpyNaive: 74 μs and 2 hnsecs
memcpyASM: 149 μs and 1 hnsec
$ ./test
memcpyD: 2 ms, 593 μs, and 7 hnsecs
memcpyDstdAlg: 77 μs and 3 hnsecs
memcpyC: 75 μs
memcpyNaive: 77 μs and 2 hnsecs
memcpyASM: 145 μs and 5 hnsecs
---

Because of the large amounts of noise, the only conclusion one 
can draw from this is that memcpyD is the slowest, followed by 
the ASM implementation.

In fact, memcpyC and memcpyNaive produce exactly the same machine 
code (without bounds checking), as LLVM recognizes the loop and 
lowers it into a memcpy. memcpyDstdAlg instead gets turned into a 
vectorized loop, for reasons I didn't investigate any further.

  — David

Jun 10 2018

Walter Bright <newshound2 digitalmars.com> writes:

On 6/10/2018 11:16 AM, David Nadlinger wrote:
 Because of the large amounts of noise, the only conclusion one can draw from 
 this is that memcpyD is the slowest,

Probably because it does a memory allocation.


 followed by the ASM implementation.

The CPU makers abandoned optimizing the REP instructions decades ago, and just 
left the clunky implementations there for backwards compatibility.


 In fact, memcpyC and memcpyNaive produce exactly the same machine code
(without 
 bounds checking), as LLVM recognizes the loop and lowers it into a memcpy. 
 memcpyDstdAlg instead gets turned into a vectorized loop, for reasons I didn't 
 investigate any further.

This amply illustrates my other point that looking at the assembler generated
is 
crucial to understanding what's happening.

Jun 10 2018

Temtaime <temtaime gmail.com> writes:

On Sunday, 10 June 2018 at 22:23:08 UTC, Walter Bright wrote:
 On 6/10/2018 11:16 AM, David Nadlinger wrote:
 Because of the large amounts of noise, the only conclusion one 
 can draw from this is that memcpyD is the slowest,

 Probably because it does a memory allocation.


 followed by the ASM implementation.

 The CPU makers abandoned optimizing the REP instructions 
 decades ago, and just left the clunky implementations there for 
 backwards compatibility.


 In fact, memcpyC and memcpyNaive produce exactly the same 
 machine code (without bounds checking), as LLVM recognizes the 
 loop and lowers it into a memcpy. memcpyDstdAlg instead gets 
 turned into a vectorized loop, for reasons I didn't 
 investigate any further.

 This amply illustrates my other point that looking at the 
 assembler generated is crucial to understanding what's 
 happening.

On some cpu architectures(for example intel atoms) rep movsb is 
the fatest memcpy.

Jun 10 2018

David Nadlinger <code klickverbot.at> writes:

On Sunday, 10 June 2018 at 22:23:08 UTC, Walter Bright wrote:
 On 6/10/2018 11:16 AM, David Nadlinger wrote:
 Because of the large amounts of noise, the only conclusion one 
 can draw from this is that memcpyD is the slowest,

 Probably because it does a memory allocation.

Of course; that was already pointed out earlier in the thread.

 The CPU makers abandoned optimizing the REP instructions 
 decades ago, and just left the clunky implementations there for 
 backwards compatibility.

That's not entirely true. Intel started optimising some of the 
REP string instructions again on Ivy Bridge and above. There is a 
CPUID bit to indicate that (ERMS?); I'm sure the Optimization 
Manual has further details. From what I remember, `rep movsb` is 
supposed to beat an AVX loop on most recent Intel µarchs if the 
destination is aligned and the data is longer than a few cache 
lines. I've never measured that myself, though.

  — David

Jun 10 2018

Walter Bright <newshound2 digitalmars.com> writes:

On 6/10/2018 4:39 PM, David Nadlinger wrote:
 That's not entirely true. Intel started optimising some of the REP string 
 instructions again on Ivy Bridge and above. There is a CPUID bit to indicate 
 that (ERMS?); I'm sure the Optimization Manual has further details. From what
I 
 remember, `rep movsb` is supposed to beat an AVX loop on most recent Intel 
 µarchs if the destination is aligned and the data is longer than a few cache 

The drama of which instruction mix is faster on which CPU never abates!

Jun 10 2018

"Nick Sabalausky (Abscissa)" <SeeWebsiteToContactMe semitwist.com> writes:

On 06/10/2018 08:01 PM, Walter Bright wrote:
 On 6/10/2018 4:39 PM, David Nadlinger wrote:
 That's not entirely true. Intel started optimising some of the REP 
 string instructions again on Ivy Bridge and above. There is a CPUID 
 bit to indicate that (ERMS?); I'm sure the Optimization Manual has 
 further details. From what I remember, `rep movsb` is supposed to beat 
 an AVX loop on most recent Intel µarchs if the destination is aligned 
 and the data is longer than a few cache 

 
 The drama of which instruction mix is faster on which CPU never abates!

In many ways, I really miss 80's machine architecture ;) So simple.

Jun 10 2018

Walter Bright <newshound2 digitalmars.com> writes:

On 6/10/2018 5:49 AM, Mike Franklin wrote:
 [...]

One source of entropy in the results is src and dst being global variables. 
Global variables in D are in TLS, and TLS access can be complex (many 
instructions) and is influenced by the -fPIC switch. Worse, global variable 
access is not optimized in dmd because of aliasing problems.

The solution is to pass src, dst, and length to the copy function as function 
parameters (and make sure function inlining is off).

In light of this, I want to BEAT THE DEAD HORSE once again and assert that if 
the assembler generated by a benchmark is not examined, the results can be 
severely misleading. I've seen this happen again and again. In this case, TLS 
access is likely being benchmarked, not memcpy.

BTW, the relative timing of rep movsb can be highly dependent on which CPU chip 
you're using.

Jun 10 2018

solidstate1991 <laszloszeremi outlook.com> writes:

On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:
 void memcpyASM()
 {
     auto s = src.ptr;
     auto d = dst.ptr;
     size_t len = length;
     asm pure nothrow  nogc
     {
         mov RSI, s;
         mov RDI, d;
         cld;
         mov RCX, len;
         rep;
         movsb;
     }
 }

Protip: Use SSE or AVX for an even faster copying.

Jun 10 2018

Mike Franklin <slavo5150 yahoo.com> writes:

I've modified the test based on the feedback so far, so here's 
what it looks like now:

import std.datetime.stopwatch;
import std.stdio;
import core.stdc.string;
import std.random;
import std.algorithm;

enum length = 4096 * 2;

void init(ref ubyte[] a)
{
     a.length = length;

     for(int i = 0; i < length; i++)
     {
         a[i] = uniform!ubyte;
     }
}

void verifyResults(ubyte[] a, ubyte[] b)
{
     assert(memcmp(a.ptr, b.ptr, length) == 0);
}

void memcpyD(ubyte[] dst, ubyte[] src)
{
     dst[] = src[];
}

void memcpyDstdAlg(ubyte[] dst, ubyte[] src)
{
     copy(src, dst);
}

void memcpyC(ubyte[] dst, ubyte[] src)
{
     memcpy(dst.ptr, src.ptr, length);
}

void memcpyNaive(ubyte[] dst, ubyte[] src)
{
     for(int i = 0; i < length; i++)
     {
         dst[i] = src[i];
     }
}

void memcpyASM(ubyte[] dst, ubyte[] src)
{
     auto s = src.ptr;
     auto d = dst.ptr;
     size_t len = length;
     asm pure nothrow  nogc
     {
         mov RSI, s;
         mov RDI, d;
         cld;
         mov RCX, len;
         rep;
         movsb;
     }
}

Duration benchmark(alias f)(ubyte[] dst, ubyte[] src, uint n)
{
     Duration result;
     auto sw = StopWatch(AutoStart.yes);

     sw.reset();
     foreach (_; 0 .. n)
     {
         f(dst, src);
     }
     result = sw.peek();

     return result;
}

void main()
{
     ubyte[] src;
     ubyte[] dst;

     // verify the integrity of the algorithm
     init(src);
     init(dst);
     memcpyD(dst, src);
     verifyResults(dst, src);

     init(src);
     init(dst);
     memcpyDstdAlg(dst, src);
     verifyResults(dst, src);

     init(src);
     init(dst);
     memcpyC(dst, src);
     verifyResults(dst, src);

     init(src);
     init(dst);
     memcpyNaive(dst, src);
     verifyResults(dst, src);

     init(src);
     init(dst);
     memcpyASM(dst, src);
     verifyResults(dst, src);

     // test the performance of the algorithm
     enum iterations = 1000;
     writeln("memcpyD: ", benchmark!memcpyD(dst, src, iterations));
     writeln("memcpyDstdAlg: ", benchmark!memcpyDstdAlg(dst, src, 
iterations));
     writeln("memcpyC: ", benchmark!memcpyC(dst, src, iterations));
     writeln("memcpyNaive: ", benchmark!memcpyNaive(dst, src, 
iterations));
     writeln("memcpyASM: ", benchmark!memcpyASM(dst, src, 
iterations));
}

The results on my Windows 10 machine (Intel Core i7-6700, 3.4GHz):
memcpyD: 127 ╬╝s and 3 hnsecs
memcpyDstdAlg: 195 ╬╝s and 9 hnsecs
memcpyC: 126 ╬╝s and 7 hnsecs
memcpyNaive: 17 ms, 974 ╬╝s, and 9 hnsecs
memcpyASM: 122 ╬╝s and 8 hnsecs
(Gotta love how windows displays μ)

The results running on Arch Linux 64-bit in a VirtualBox on the 
same Windows 10 machine:
memcpyD: 409 μs
memcpyDstdAlg: 400 μs
memcpyC: 404 μs and 4 hnsecs
memcpyNaive: 17 ms, 251 μs, and 6 hnsecs
memcpyASM: 162 μs and 8 hnsecs

The results appear more sane now, but it seems the behavior is 
highly platform dependent.  Still the ASM is doing well for my 
hardware.  If I run the test multiple times, I do see a lot of 
noise in the results, but each test seems to be affected 
proportionally, so I'm gaining a little more confidence in the 
benchmark.

I still need to analyze the assembly of C's memcpy (anyone know 
where I can find the source code?), test on more platforms, and 
test varying sizes, but I'm just collecting some initial data 
right now, to learn how to proceed.

I'd be interested in those with other platforms reporting back 
their results for their hardware, and of course suggestions for 
how to meet or beat C's memcpy with a pure D implementation.

Thanks for all the feedback so far.

Mike

Jun 10 2018

Basile B. <b2.b2.temp.temp gmx.gmx.com.com.com> writes:

On Monday, 11 June 2018 at 01:03:16 UTC, Mike Franklin wrote:
 I've modified the test based on the feedback so far, so here's 
 what it looks like now:

 import std.datetime.stopwatch;
 import std.stdio;
 import core.stdc.string;
 import std.random;
 import std.algorithm;

 enum length = 4096 * 2;

 void init(ref ubyte[] a)
 {
     a.length = length;

     for(int i = 0; i < length; i++)
     {
         a[i] = uniform!ubyte;
     }
 }

 void verifyResults(ubyte[] a, ubyte[] b)
 {
     assert(memcmp(a.ptr, b.ptr, length) == 0);
 }

 void memcpyD(ubyte[] dst, ubyte[] src)
 {
     dst[] = src[];
 }

 void memcpyDstdAlg(ubyte[] dst, ubyte[] src)
 {
     copy(src, dst);
 }

 void memcpyC(ubyte[] dst, ubyte[] src)
 {
     memcpy(dst.ptr, src.ptr, length);
 }

 void memcpyNaive(ubyte[] dst, ubyte[] src)
 {
     for(int i = 0; i < length; i++)
     {
         dst[i] = src[i];
     }
 }

 void memcpyASM(ubyte[] dst, ubyte[] src)
 {
     auto s = src.ptr;
     auto d = dst.ptr;
     size_t len = length;
     asm pure nothrow  nogc
     {
         mov RSI, s;
         mov RDI, d;
         cld;
         mov RCX, len;
         rep;
         movsb;
     }
 }

 Duration benchmark(alias f)(ubyte[] dst, ubyte[] src, uint n)
 {
     Duration result;
     auto sw = StopWatch(AutoStart.yes);

     sw.reset();
     foreach (_; 0 .. n)
     {
         f(dst, src);
     }
     result = sw.peek();

     return result;
 }

 void main()
 {
     ubyte[] src;
     ubyte[] dst;

     // verify the integrity of the algorithm
     init(src);
     init(dst);
     memcpyD(dst, src);
     verifyResults(dst, src);

     init(src);
     init(dst);
     memcpyDstdAlg(dst, src);
     verifyResults(dst, src);

     init(src);
     init(dst);
     memcpyC(dst, src);
     verifyResults(dst, src);

     init(src);
     init(dst);
     memcpyNaive(dst, src);
     verifyResults(dst, src);

     init(src);
     init(dst);
     memcpyASM(dst, src);
     verifyResults(dst, src);

     // test the performance of the algorithm
     enum iterations = 1000;
     writeln("memcpyD: ", benchmark!memcpyD(dst, src, 
 iterations));
     writeln("memcpyDstdAlg: ", benchmark!memcpyDstdAlg(dst, 
 src, iterations));
     writeln("memcpyC: ", benchmark!memcpyC(dst, src, 
 iterations));
     writeln("memcpyNaive: ", benchmark!memcpyNaive(dst, src, 
 iterations));
     writeln("memcpyASM: ", benchmark!memcpyASM(dst, src, 
 iterations));
 }

 The results on my Windows 10 machine (Intel Core i7-6700, 
 3.4GHz):
 memcpyD: 127 ╬╝s and 3 hnsecs
 memcpyDstdAlg: 195 ╬╝s and 9 hnsecs
 memcpyC: 126 ╬╝s and 7 hnsecs
 memcpyNaive: 17 ms, 974 ╬╝s, and 9 hnsecs
 memcpyASM: 122 ╬╝s and 8 hnsecs
 (Gotta love how windows displays μ)

 The results running on Arch Linux 64-bit in a VirtualBox on the 
 same Windows 10 machine:
 memcpyD: 409 μs
 memcpyDstdAlg: 400 μs
 memcpyC: 404 μs and 4 hnsecs
 memcpyNaive: 17 ms, 251 μs, and 6 hnsecs
 memcpyASM: 162 μs and 8 hnsecs

 The results appear more sane now, but it seems the behavior is 
 highly platform dependent.  Still the ASM is doing well for my 
 hardware.  If I run the test multiple times, I do see a lot of 
 noise in the results, but each test seems to be affected 
 proportionally, so I'm gaining a little more confidence in the 
 benchmark.

 I still need to analyze the assembly of C's memcpy (anyone know 
 where I can find the source code?),

- default win32 OMF: 
https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C
- default linux: 
https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c
- not used but interesting: 
https://github.com/esmil/musl/blob/master/src/string/memcpy.c

Jun 10 2018

Basile B. <b2.b2.temp.temp gmx.gmx.com.com.com> writes:

On Monday, 11 June 2018 at 03:34:59 UTC, Basile B. wrote:
 On Monday, 11 June 2018 at 01:03:16 UTC, Mike Franklin wrote:
 [...]

 - default win32 OMF: 
 https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C
 - default linux: 
 https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c
 - not used but interesting: 
 https://github.com/esmil/musl/blob/master/src/string/memcpy.c

D slice style memcpy cg:

https://github.com/dlang/dmd/blob/master/src/dmd/backend/cod2.c#L3526

Jun 11 2018

Basile B. <b2.b2.temp.temp gmx.gmx.com.com.com> writes:

On Monday, 11 June 2018 at 03:34:59 UTC, Basile B. wrote:
 On Monday, 11 June 2018 at 01:03:16 UTC, Mike Franklin wrote:
 [...]

 - default win32 OMF: 
 https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C
 - default linux: 
 https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c
 - not used but interesting: 
 https://github.com/esmil/musl/blob/master/src/string/memcpy.c

D slice style memcpy cg:

https://github.com/dlang/dmd/blob/master/src/dmd/backend/cod2.c#L3526

Jun 11 2018

Walter Bright <newshound2 digitalmars.com> writes:

On 6/10/2018 8:34 PM, Basile B. wrote:
 - default win32 OMF: 
 https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C

I think you mean:

https://github.com/DigitalMars/dmc/blob/master/src/CORE32/MEMCPY.ASM

Jun 11 2018

Mike Franklin <slavo5150 yahoo.com> writes:

On Monday, 11 June 2018 at 08:05:14 UTC, Walter Bright wrote:
 On 6/10/2018 8:34 PM, Basile B. wrote:
 - default win32 OMF: 
 https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C

 I think you mean:

 https://github.com/DigitalMars/dmc/blob/master/src/CORE32/MEMCPY.ASM

Cool! and it's even boost licensed.  I think I'll try to port 
that next, before I try the SSE and AVX implementations.

Mike

Jun 11 2018

David Nadlinger <code klickverbot.at> writes:

On Monday, 11 June 2018 at 03:34:59 UTC, Basile B. wrote:
 - default linux: 
 https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c

To see what is executed when you call memcpy() on a regular 
GNU/Linux distro, you'd want to have a look at glibc instead. For 
example, the AVX2 and AVX512 implementations are in

    sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S

and

   sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S,

as forwarded to by memmove-avx2-unaligned-erms.S and 
memmove-avx512-unaligned-erms.S.

(Pop quiz: Why might a separate "no-vzeroupper" variant be a good 
idea?)

  — David

Jun 17 2018

Guillaume Piolat <first.last gmail.com> writes:

BTW the way memcpy is(was?) implemented in the C runtime coming 
from the Inter C++ compiler was really enlightening on the sheer 
difficulty of such a task.

First of all there isn't one loop but many depending on the 
source and destination alignment.

- If both are aligned on 16-byte boundaries, source and 
destination operand would be with MOVAPS/MOVDQA, nothing special
- If only the source or destination was misaligned, the function 
would dispatch to a variant with the core loop loading 16-byte 
aligned and writing 16-byte unaligned, with the PALIGNR 
instruction. However, since PALIGNR can't take a runtime value, 
this variant was _replicated 16 times_.
- I don't remember for both source and destination misaligned but 
you can degenerate this case to the above one.

Each of this loop had complicated loop preludes that do the first 
iteration, and they are so hard to do by hand.

It was also the only piece of assembly I've seen that 
(apparently) successfully used the "prefetch" instructions.

This was just the SSE version, AVX was different.

I don't know if someone really wrote this code, or if it was all 
from intrinsics.

Jun 11 2018

Walter Bright <newshound2 digitalmars.com> writes:

On 6/11/2018 11:17 AM, Guillaume Piolat wrote:
 I don't know if someone really wrote this code, or if it was all from
intrinsics.

memcpy is so critical to success it is likely written by Intel itself to ensure 
every drop of perf is wrung out of the CPU.

I was Intel CEO I'd direct the CPU hardware guys to do this and give it away.

Jun 11 2018

D Programming

C/C++ Programming

Other

digitalmars.D - Replacing C's memcpy with a D implementation