digitalmars.D - Replacing C's memcpy with a D implementation
- Mike Franklin (119/119) Jun 10 2018 I'm exploring the possibility of implementing some of the basic
- Nicholas Wilson (2/11) Jun 10 2018 what compiler? what flags?
- Mike Franklin (8/20) Jun 10 2018 dmd main.d
- Adam D. Ruppe (10/16) Jun 10 2018 So keep in mind that memcpy is really a magical intrinsic anyway
- Mike Franklin (21/25) Jun 10 2018 My intent is to use the D implementation in the druntime. I'm
- Kagamin (5/14) Jun 10 2018 In safe code you just use assignment and array ops, backend does
- Mike Franklin (6/20) Jun 10 2018 The compiler implementation is faulty. It rewrites the
- Mike Franklin (5/9) Jun 10 2018 Also, understand that this lowering happens in the IR generation
- Walter Bright (2/8) Jun 10 2018 It's only faulty if there's a bug in it. It's essentially @trusted code.
- Mike Franklin (13/22) Jun 10 2018 That only addresses the @safe attribute, and that code is much
- Walter Bright (16/26) Jun 11 2018 Making it a template is not really necessary. The compiler knows if ther...
- Mike Franklin (15/18) Jun 11 2018 There are other reasons to make it a template, though. For
- Walter Bright (7/21) Jun 11 2018 We have no design for this function that doesn't rely on the GC, and the...
- Mike Franklin (19/22) Jun 11 2018 I understand that. I was using `_d_arraysetlengthT` as an
- Mike Franklin (9/20) Jun 11 2018 I think it does, because I can then generate specific code based
- Mike Franklin (7/16) Jun 11 2018 Also, before you do any more nay-saying, you might want to
- Johannes Pfau (19/37) Jun 11 2018 I guess for most D runtime hooks, using templates is a good idea to
- Mike Franklin (26/38) Jun 11 2018 My plans go beyond microcontrollers. Mostly, I'd like to be able
- Steven Schveighoffer (11/14) Jun 11 2018 No, __doPostblit is necessary -- you are making a copy.
- Walter Bright (3/16) Jun 11 2018 Yes, you're right. This should probably go as a comment in the code in c...
- Walter Bright (1/1) Jun 11 2018 https://github.com/dlang/druntime/pull/2213
- Mike Franklin (17/19) Jun 10 2018 void memcpyD()
- rikki cattermole (3/32) Jun 10 2018 memcpy
- Seb (5/24) Jun 10 2018 I would increase the test data size, s.t. you get a least a few
- I love Ice Cream (5/5) Jun 10 2018 Don't C implementations already do 90% of what you want? I
- Walter Bright (2/6) Jun 10 2018 Note that .dup is doing a GC memory allocation.
- Patrick Schluter (46/51) Jun 10 2018 rep movsd is very CPU dependend and needs some precondtions to be
- Walter Bright (2/3) Jun 11 2018 Thanks. Agner Fog gets the last word on this topic!
- David Nadlinger (13/16) Jun 17 2018 Well, Agner is rarely wrong indeed, but there is a limit to how
- Guillaume Piolat (11/14) Jun 10 2018 Hi,
- Mike Franklin (13/15) Jun 10 2018 I tested with ldc and got similar results. I thought the
- David Nadlinger (45/48) Jun 10 2018 You've just discovered the fact that one can rarely be careful
- Walter Bright (6/13) Jun 10 2018 The CPU makers abandoned optimizing the REP instructions decades ago, an...
- Temtaime (3/19) Jun 10 2018 On some cpu architectures(for example intel atoms) rep movsb is
- David Nadlinger (10/17) Jun 10 2018 That's not entirely true. Intel started optimising some of the
- Walter Bright (2/7) Jun 10 2018 The drama of which instruction mix is faster on which CPU never abates!
- Nick Sabalausky (Abscissa) (2/11) Jun 10 2018 In many ways, I really miss 80's machine architecture ;) So simple.
- Walter Bright (13/14) Jun 10 2018 One source of entropy in the results is src and dst being global variabl...
- solidstate1991 (2/17) Jun 10 2018 Protip: Use SSE or AVX for an even faster copying.
- Mike Franklin (131/131) Jun 10 2018 I've modified the test based on the feedback so far, so here's
- Basile B. (7/134) Jun 10 2018 - default win32 OMF:
- Basile B. (3/11) Jun 11 2018 D slice style memcpy cg:
- Basile B. (3/11) Jun 11 2018 D slice style memcpy cg:
- Walter Bright (3/5) Jun 11 2018 I think you mean:
- Mike Franklin (4/9) Jun 11 2018 Cool! and it's even boost licensed. I think I'll try to port
- David Nadlinger (12/14) Jun 17 2018 To see what is executed when you call memcpy() on a regular
- Guillaume Piolat (21/21) Jun 11 2018 BTW the way memcpy is(was?) implemented in the C runtime coming
- Walter Bright (4/5) Jun 11 2018 memcpy is so critical to success it is likely written by Intel itself to...
I'm exploring the possibility of implementing some of the basic software building blocks (memcpy, memcmp, memmove, etc...) that D utilizes from the C library with D implementations. There are many reasons to do this, one of which is to leverage information available at compile-time and in D's type system (type sizes, alignment, etc...) in order to optimize the implementation of these functions, and allow them to be used from safe code. The prevailing wisdom has been that there is no way to improve on C's memcpy implementation given that it has been mirco-optimized to death over several decades by many talented members of the human race. So, I threw the following benchmark together to try to get a clue about what I was up against, and in a very short time, I beat the snot of C's memcpy. The benefit seems to disappear as the array sizes increase, but I believe the vast majority of calls to memcpy are probably quite small. import std.datetime.stopwatch; import std.stdio; import core.stdc.string; import std.random; import std.algorithm; enum length = 4096 * 2; ubyte[length] dst; ubyte[length] src; auto rnd = Random(42); ubyte[] src2; ubyte[] dst2; void verifyResults() { assert(memcmp(dst.ptr, src.ptr, length) == 0); assert(memcmp(dst2.ptr, src2.ptr, length) == 0); } void randomizeData() { for(int i = 0; i < length; i++) { src[i] = uniform!ubyte; dst[i] = uniform!ubyte; } src2 = src; dst2 = dst; } void memcpyD() { dst = src.dup; } void memcpyDstdAlg() { copy(src2, dst2); } void memcpyC() { memcpy(dst.ptr, src.ptr, length); } void memcpyNaive() { for(int i = 0; i < length; i++) { dst[i] = src[i]; } } void memcpyASM() { auto s = src.ptr; auto d = dst.ptr; size_t len = length; asm pure nothrow nogc { mov RSI, s; mov RDI, d; cld; mov RCX, len; rep; movsb; } } void main() { // verify the integrity of the algorithm randomizeData(); memcpyD(); verifyResults(); randomizeData(); memcpyDstdAlg(); verifyResults(); randomizeData(); memcpyC(); verifyResults(); randomizeData(); memcpyNaive(); verifyResults(); randomizeData(); memcpyASM(); verifyResults(); // test the performance of the algorithm auto r = benchmark!(memcpyD, memcpyDstdAlg, memcpyC, memcpyNaive, memcpyASM)(1000); Duration memcpyDResult = r[0]; Duration memcpyDstdAlgResult = r[1]; Duration memcpyCResult = r[2]; Duration memcpyNaiveResult = r[3]; Duration memcpyASMResult = r[4]; writeln("memcpyD: ", memcpyDResult); writeln("memcpyDstdAlg: ", memcpyDstdAlgResult); writeln("memcpyC: ", memcpyCResult); writeln("memcpyNaive: ", memcpyNaiveResult); writeln("memcpyASM: ", memcpyASMResult); } ------ Output -------- memcpyD: 1 ms, 772 μs, and 4 hnsecs memcpyDstdAlg: 531 μs and 8 hnsecs memcpyC: 371 μs and 3 hnsecs memcpyNaive: 21 ms, 572 μs, and 2 hnsecs memcpyASM: 119 μs and 6 hnsecs I'm not experienced with this kind of programming, so I'm doubting these results. Have I done something wrong? Am I overlooking something? Thanks, Mike
Jun 10 2018
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:I'm exploring the possibility of implementing some of the basic software building blocks (memcpy, memcmp, memmove, etc...) that D utilizes from the C library with D implementations. There are many reasons to do this, one of which is to leverage information available at compile-time and in D's type system (type sizes, alignment, etc...) in order to optimize the implementation of these functions, and allow them to be used from safe code. [...]what compiler? what flags?
Jun 10 2018
On Sunday, 10 June 2018 at 13:05:33 UTC, Nicholas Wilson wrote:On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:dmd main.d Arch Linux x86_64 DMD, with no flags. I'm not compiling C's memcpy, and the significant D implementation is memcpyASM which is all inline assembly, so I don't see how the compiler flags will make any difference, anyway. MikeI'm exploring the possibility of implementing some of the basic software building blocks (memcpy, memcmp, memmove, etc...) that D utilizes from the C library with D implementations. There are many reasons to do this, one of which is to leverage information available at compile-time and in D's type system (type sizes, alignment, etc...) in order to optimize the implementation of these functions, and allow them to be used from safe code. [...]what compiler? what flags?
Jun 10 2018
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:D utilizes from the C library with D implementations. There are many reasons to do this, one of which is to leverage information available at compile-time and in D's type system (type sizes, alignment, etc...) in order to optimize the implementation of these functions, and allow them to be used from safe code.So keep in mind that memcpy is really a magical intrinsic anyway and optimzers frequently don't actually call a function, but rather see the size and replace the instructions inline (like it might replace it with just a couple movs instead of something fancy). And D already has it built in as well for safe etc: arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can further do its magic so be sure to check against that too.
Jun 10 2018
On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:And D already has it built in as well for safe etc: arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can further do its magic so be sure to check against that too.My intent is to use the D implementation in the druntime. I'm trying to replace the runtime hooks with templates, so we don't have to rely so much on `TypeInfo` and we can begin to use more of D without linking in the runtime. But one think I discovered is that while we can set an array's length in safe, nothrow, pure code, it gets lowered to a runtime hook that is neither safe, nothrow, nor pure; the compiler is lying to us. If I replace the runtime hook with a template, I need to be honest, and that means all that low-level code in druntime that the runtime hook depends on needs to be safe, nothrow, and pure. So I'm starting with the fundamental dependencies on the C library, trying to ensure they can be replaced with D implementations and use in D idiomatically. For example, memcpy might look like this. void memcpy(T)(T[] dest, T[] src); We can extract size, alignment etc.. at compile-time. So, my goal is not for user-facing code, but for improving the druntime implementations. Does that make sense? Mike
Jun 10 2018
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:There are many reasons to do this, one of which is to leverage information available at compile-time and in D's type system (type sizes, alignment, etc...) in order to optimize the implementation of these functions, and allow them to be used from safe code.In safe code you just use assignment and array ops, backend does the rest. On Sunday, 10 June 2018 at 13:27:04 UTC, Mike Franklin wrote:But one think I discovered is that while we can set an array's length in safe, nothrow, pure code, it gets lowered to a runtime hook that is neither safe, nothrow, nor pure; the compiler is lying to us.If the compiler can't get it right then who can?
Jun 10 2018
On Sunday, 10 June 2018 at 15:12:27 UTC, Kagamin wrote:On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:The compiler implementation is faulty. It rewrites the expressions to an `extern(C)` runtime implementation that is not safe, nothrow, or pure: https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src rt/lifetime.d#L1442 The backend is not involved. MikeThere are many reasons to do this, one of which is to leverage information available at compile-time and in D's type system (type sizes, alignment, etc...) in order to optimize the implementation of these functions, and allow them to be used from safe code.In safe code you just use assignment and array ops, backend does the rest. On Sunday, 10 June 2018 at 13:27:04 UTC, Mike Franklin wrote:But one think I discovered is that while we can set an array's length in safe, nothrow, pure code, it gets lowered to a runtime hook that is neither safe, nothrow, nor pure; the compiler is lying to us.If the compiler can't get it right then who can?
Jun 10 2018
On Monday, 11 June 2018 at 02:49:00 UTC, Mike Franklin wrote:The compiler implementation is faulty. It rewrites the expressions to an `extern(C)` runtime implementation that is not safe, nothrow, or pure: https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src rt/lifetime.d#L1442 The backend is not involved.Also, understand that this lowering happens in the IR generation stage: https://github.com/dlang/dmd/blob/3a79629988efd51d4dda9edb38a6701cd097da89/ rc/dmd/e2ir.d#L2616 so semantic analysis is not performed on the runtime call. If it were, it would fail to compile. Mike
Jun 10 2018
On 6/10/2018 7:49 PM, Mike Franklin wrote:On Sunday, 10 June 2018 at 15:12:27 UTC, Kagamin wrote:It's only faulty if there's a bug in it. It's essentially trusted code.If the compiler can't get it right then who can?The compiler implementation is faulty. It rewrites the expressions to an `extern(C)` runtime implementation that is not safe, nothrow, or pure: https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src rt/lifetime.d#L1442 The backend is not involved.
Jun 10 2018
On Monday, 11 June 2018 at 03:31:05 UTC, Walter Bright wrote:On 6/10/2018 7:49 PM, Mike Franklin wrote:That only addresses the safe attribute, and that code is much too complex for anyone to audit it and certify it as safe. Exceptions are also not all handled, so there is no way it can pass as nothrow. The runtime call needs to be replaced with a template that can take advantage of D's compile-time type information instead of `TypeInfo`, but doing so will force the implementation through semantic processing, which, in it's current state, will fail to compile. It needs to be refactored, and creating safe, nothrow, pure building blocks will help with that, hence the topic of this thread. MikeOn Sunday, 10 June 2018 at 15:12:27 UTC, Kagamin wrote:It's only faulty if there's a bug in it. It's essentially trusted code.If the compiler can't get it right then who can?The compiler implementation is faulty. It rewrites the expressions to an `extern(C)` runtime implementation that is not safe, nothrow, or pure: https://github.com/dlang/druntime/blob/706081f3cb23f4c597cc487ce16ad3d2ed021053/src rt/lifetime.d#L1442 The backend is not involved.
Jun 10 2018
On 6/10/2018 8:43 PM, Mike Franklin wrote:That only addresses the safe attribute, and that code is much too complex for anyone to audit it and certify it as safe. Exceptions are also not all handled, so there is no way it can pass as nothrow. The runtime call needs to be replaced with a template that can take advantage of D's compile-time type information instead of `TypeInfo`, but doing so will force the implementation through semantic processing, which, in it's current state, will fail to compile. It needs to be refactored, and creating safe, nothrow, pure building blocks will help with that, hence the topic of this thread.Making it a template is not really necessary. The compiler knows if there is the possibility of it throwing based on the type, it doesn't need to infer it. (I notice it is doing __doPostblit(). This looks wrong, D allows data to be moved. As far as I can tell with a perfunctory examination, that's the only "can throw" bit.) Doing without TypeInfo is a worthy goal most of the time, but I don't think it is of much value here because it calls the garbage collector anyway which requires TypeInfo. Some obvious refactoring tasks for it would be to: 1. eliminate the gotos 2. use single assignment for variables 3. minimize the scopes of variables 4. use const 5. document the inputs and outputs 6. add test cases for the red lines
Jun 11 2018
On Monday, 11 June 2018 at 08:00:10 UTC, Walter Bright wrote:Making it a template is not really necessary. The compiler knows if there is the possibility of it throwing based on the type, it doesn't need to infer it.There are other reasons to make it a template, though. For example, if it were a template, it would not rely on `TypeInfo` and could then be used in -betterC-like situations. For setting the length of an array, there are issues with memory allocation that hinder that, but I'm thinking beyond just setting the length of an array here; I just used that as an illustrative example. I think there might also be optimization opportunities using templates, metaprogramming, and type introspection, that are not currently possible with the current design. So there are other reasons to pursue a template design for all of our runtime hooks, and if our memcpy, memcmp, etc.. are also templated it't turtles all the way down. We have to be careful about code bloat with tempates, but I think that can be mitigated. Mike
Jun 11 2018
On 6/11/2018 1:12 AM, Mike Franklin wrote:On Monday, 11 June 2018 at 08:00:10 UTC, Walter Bright wrote:We have no design for this function that doesn't rely on the GC, and the GC needs TypeInfo. This function is not usable with betterC with or without the TypeInfo argument.Making it a template is not really necessary. The compiler knows if there is the possibility of it throwing based on the type, it doesn't need to infer it.There are other reasons to make it a template, though. For example, if it were a template, it would not rely on `TypeInfo` and could then be used in -betterC-like situations.I think there might also be optimization opportunities using templates, metaprogramming, and type introspection, that are not currently possible with the current design.Just making it a template doesn't automatically enable any of this.So there are other reasons to pursue a template design for all of our runtime hooks, and if our memcpy, memcmp, etc.. are also templated it't turtles all the way down.memcpy and memcmp are already handled specially by modern compilers, have been for decades; I seriously doubt there's more oil in that well.
Jun 11 2018
On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:We have no design for this function that doesn't rely on the GC, and the GC needs TypeInfo. This function is not usable with betterC with or without the TypeInfo argument.I understand that. I was using `_d_arraysetlengthT` as an example of problems I encountered trying to convert these runtime hooks into templates, specifically illustrating how the compiler does not enforce the guarantees of safety, nothrow, and pure, and, in fact, completely disregards them. Converting *any* runtime hook, not just `_d_arraysetlengthT`, to a template comes with a whole host of problems because of the aforementioned issue. As soon as it's implemented as a template it is semantically analyzed in the context of its caller, so safe, nothrow, pure, etc. all need to accommodated for. That's good to have the compiler check for that stuff. It's unfortunate the current implementation does not, and that's the fault in the current implementation. There are other runtime hooks that are good candidates for templates, even if `_d_arraysetlengthT` isn't one of them. Again, I just chose `_d_arraysetlengthT` as an example, because it is the one I had experience with. Mike
Jun 11 2018
On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:I think it does, because I can then generate specific code based on the type information at compile-time.I think there might also be optimization opportunities using templates, metaprogramming, and type introspection, that are not currently possible with the current design.Just making it a template doesn't automatically enable any of this.Yes, I may fail to achieve any performance benefits, but even if I achieve parity with the current implementation it will still be a win for me because I will no longer have to depend on the C standard library. And if I have to fork to use D that way, then that's what I'm going to do. MikeSo there are other reasons to pursue a template design for all of our runtime hooks, and if our memcpy, memcmp, etc.. are also templated it't turtles all the way down.memcpy and memcmp are already handled specially by modern compilers, have been for decades; I seriously doubt there's more oil in that well.
Jun 11 2018
On Monday, 11 June 2018 at 10:38:30 UTC, Mike Franklin wrote:On Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:Also, before you do any more nay-saying, you might want to revisit this talk https://www.youtube.com/watch?v=endKC3fDxqs which demonstrates precisely the kind of benefits that can be achieved with these kinds of changes to the compiler/runtime interface. MikeI think it does, because I can then generate specific code based on the type information at compile-time.I think there might also be optimization opportunities using templates, metaprogramming, and type introspection, that are not currently possible with the current design.Just making it a template doesn't automatically enable any of this.
Jun 11 2018
Am Mon, 11 Jun 2018 10:54:23 +0000 schrieb Mike Franklin:On Monday, 11 June 2018 at 10:38:30 UTC, Mike Franklin wrote:I guess for most D runtime hooks, using templates is a good idea to enable inlining and further optimizations. I understand that you actually need to reimplement memcpy, as in your microcontroller usecase you don't want to have any C runtime. So you'll basically have to rewrite the C runtime parts D depends on. However, I think for memcpy and similar functions you're probably better off keeping the C interface. This directly provides the benefit of compiler intrinsics/optimizations. And marking memcpy as nothrow/pure/ system/nogc is simple* either way. For the D implementation, the compiler will verify this for you, for the C implementation, you have to mark the function depending on the C implementation. But that's mostly trivial. On a related note, I agree that the compiler sometimes cheats by ignoring attributes, especially when calling TypeInfo related functions, and this is a huge problem. Runtime TypeInfo is not descriptive enough to fully represent the types and whenever the compiler the casts without properly checking first, there's the possibility of a problem. -- JohannesOn Monday, 11 June 2018 at 10:07:39 UTC, Walter Bright wrote:Also, before you do any more nay-saying, you might want to revisit this talk https://www.youtube.com/watch?v=endKC3fDxqs which demonstrates precisely the kind of benefits that can be achieved with these kinds of changes to the compiler/runtime interface. MikeI think it does, because I can then generate specific code based on the type information at compile-time.I think there might also be optimization opportunities using templates, metaprogramming, and type introspection, that are not currently possible with the current design.Just making it a template doesn't automatically enable any of this.
Jun 11 2018
On Monday, 11 June 2018 at 18:34:58 UTC, Johannes Pfau wrote:I understand that you actually need to reimplement memcpy, as in your microcontroller usecase you don't want to have any C runtime. So you'll basically have to rewrite the C runtime parts D depends on. However, I think for memcpy and similar functions you're probably better off keeping the C interface. This directly provides the benefit of compiler intrinsics/optimizations. And marking memcpy as nothrow/pure/ system/nogc is simple* either way. For the D implementation, the compiler will verify this for you, for the C implementation, you have to mark the function depending on the C implementation. But that's mostly trivial.My plans go beyond microcontrollers. Mostly, I'd like to be able to use more features of D without having to link in a pre-built runtime. This is especially convenient for cross-compiling scenarios. By replacing the runtime hooks with templates, and the software building blocks with D implementations, we'd no longer need to obtain a C toolchain and a pre-compiled druntime library just to get a build for our target. We'd just need a cross-compiler like LDC or GDC. After adding druntime to the import path, we'd have everything we need for our target. Try, today, to create an LDC cross-compiler for a Windows Host and an ARM target like the Raspberry Pi. You'll quickly realize what a hassle it all is. Another issue with memcpy being generated by compilers is noone knows what it's actually doing without building and looking at the assembly. And the assembly will change based what's being built. Have you tried to look at the GCC implementation. I have, and I never want to again. If the implementation were templated in D, we'd be able to see exactly what code would be generated when by simply viewing the D runtime source code. It'd be easier to understand, predict, port, and enhance. There isn't one single game-changing benefit this endeavor would bring, and I guess that's why it's difficult to comprehend, but collectively, all the little things it would enable would make it quite compelling. Mike
Jun 11 2018
On 6/11/18 4:00 AM, Walter Bright wrote:(I notice it is doing __doPostblit(). This looks wrong, D allows data to be moved. As far as I can tell with a perfunctory examination, that's the only "can throw" bit.)No, __doPostblit is necessary -- you are making a copy. example: File[] fs = new File[5]; fs[0] = ...; // initialize fs auto fs2 = fs; fs.length = 100; At this point, fs points at a separate block from fs2. If you did not do postblit on this, then when one of those arrays is destroyed, the other will become invalid, as the File reference count would go to 0. -Steve
Jun 11 2018
On 6/11/2018 6:00 AM, Steven Schveighoffer wrote:No, __doPostblit is necessary -- you are making a copy. example: File[] fs = new File[5]; fs[0] = ...; // initialize fs auto fs2 = fs; fs.length = 100; At this point, fs points at a separate block from fs2. If you did not do postblit on this, then when one of those arrays is destroyed, the other will become invalid, as the File reference count would go to 0.Yes, you're right. This should probably go as a comment in the code in case it comes up again.
Jun 11 2018
https://github.com/dlang/druntime/pull/2213
Jun 11 2018
On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can further do its magicvoid memcpyD() { dst = src.dup; } void memcpyD2() { dst[] = src[]; } ----- memcpyD: 1 ms, 725 μs, and 1 hnsec memcpyD2: 587 μs and 5 hnsecs memcpyASM: 119 μs and 5 hnsecs Still, the ASM version is much faster. btw, what's the difference between the two. If you can't tell, I'm actually not a very good D programmer. Mike
Jun 10 2018
On 11/06/2018 1:45 AM, Mike Franklin wrote:On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:malloc (for slice not static array)arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can further do its magicvoid memcpyD() { dst = src.dup;} void memcpyD2() { dst[] = src[];memcpy} ----- memcpyD: 1 ms, 725 μs, and 1 hnsec memcpyD2: 587 μs and 5 hnsecs memcpyASM: 119 μs and 5 hnsecs Still, the ASM version is much faster. btw, what's the difference between the two. If you can't tell, I'm actually not a very good D programmer. Mike
Jun 10 2018
On Sunday, 10 June 2018 at 13:45:54 UTC, Mike Franklin wrote:On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:I would increase the test data size, s.t. you get a least a few seconds. Otherwise the benchmarking won't tell you much because it's subject to too much randomness.arr1[] = arr2[]; // the compiler makes this memcpy, the optimzer can further do its magicvoid memcpyD() { dst = src.dup; } void memcpyD2() { dst[] = src[]; } ----- memcpyD: 1 ms, 725 μs, and 1 hnsec memcpyD2: 587 μs and 5 hnsecs memcpyASM: 119 μs and 5 hnsecs Still, the ASM version is much faster. btw, what's the difference between the two. If you can't tell, I'm actually not a very good D programmer. Mike
Jun 10 2018
Don't C implementations already do 90% of what you want? I thought most compilers know about and optimize these methods based on context. I thought they were *special* in the eyes of the compiler already. I think you are fighting a battle pitting 40 years of tweaking against you...
Jun 10 2018
On 6/10/2018 6:45 AM, Mike Franklin wrote:void memcpyD() { dst = src.dup; }Note that .dup is doing a GC memory allocation.
Jun 10 2018
On Sunday, 10 June 2018 at 13:45:54 UTC, Mike Franklin wrote:On Sunday, 10 June 2018 at 13:16:21 UTC, Adam D. Ruppe wrote:memcpyD: 1 ms, 725 μs, and 1 hnsec memcpyD2: 587 μs and 5 hnsecs memcpyASM: 119 μs and 5 hnsecs Still, the ASM version is much faster.rep movsd is very CPU dependend and needs some precondtions to be fast. For relative short memory blocks it sucks on many other CPU than the last Intel. See what Agner Fog has to say about it: 16.10 String instructions (all processors) String instructions without a repeat prefix are too slow and should be replaced by simpler instructions. The same applies to LOOP on all processors and to JECXZ on some processors. REP MOVSD andREP STOSD are quite fast if the repeat count is not too small. Always use the largest word size possible (DWORDin 32-bit mode, QWORD in 64-bit mode), and make sure that both source and destination are aligned by the word size. In many cases, however, it is faster to use XMM registers. Moving data in XMM registers is faster than REP MOVSD and REP STOSD in most cases, especially on older processors. See page 164 for details. Note that while the REP MOVS instruction writes a word to the destination, it reads the next word from the source in the same clock cycle. You can have a cache bank conflict if bit 2-4 are the same in these two addresses on P2 and P3. In other words, you will get a penalty of one clock extra per iteration if ESI +WORDSIZE-EDI is divisible by 32. The easiest way to avoid cache bank conflicts is to align both source and destination by 8. Never use MOVSB or MOVSW in optimized code, not even in 16-bit mode. On many processors, REP MOVS and REP STOS can perform fast by moving 16 bytes or an entire cache line at a time . This happens only when certain conditions are met. Depending on the processor, the conditions for fast string instructions are, typically, that the count must be high, both source and destination must be aligned, the direction must be forward, the distance between source and destination must be at least the cache line size, and the memory type for both source and destination must be either write-back or write-combining (you can normally assume the latter condition is met). Under these conditions, the speed is as high as you can obtain with vector register moves or even faster on some processors. While the string instructions can be quite convenient, it must be emphasized that other solutions are faster in many cases. If the above conditions for fast move are not met then there is a lot to gain by using other methods. See page 164 for alternatives to REP MOVS
Jun 10 2018
On 6/10/2018 9:44 PM, Patrick Schluter wrote:See what Agner Fog has to say about it:Thanks. Agner Fog gets the last word on this topic!
Jun 11 2018
On Monday, 11 June 2018 at 08:02:42 UTC, Walter Bright wrote:On 6/10/2018 9:44 PM, Patrick Schluter wrote:Well, Agner is rarely wrong indeed, but there is a limit to how much material a single person can keep up to date. On newer uarchs, `rep movsb` isn't slower than `rep movsd`, and often performs similar to the best SSE2 implementation (using NT stores). See "BeeOnRope"'s answer to this StackOverflow question for an in-depth discussion about this: https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy AVX2 seems to offer extra performance beyond that, though, if it is available (for example if runtime feature detection is used). I believe I read a comment by Agner somewhere to that effect as well – a search engine will certainly be able to turn up more. — DavidSee what Agner Fog has to say about it:Thanks. Agner Fog gets the last word on this topic!
Jun 17 2018
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:I'm not experienced with this kind of programming, so I'm doubting these results. Have I done something wrong? Am I overlooking something?Hi, I've spent a lot of time optimizing memcpy. One of the result was that on Intel ICC the compiler intrinsics were unbeatable. Please make one that guarantee the usage of the corresponding backend intrinsic, for example on LLVM. The reasoning is that small areas of memory (statically known) will be elided to a few byte moves (inlined), whereas the larger one will call the highly optimized C stdlib calls. If you use ASM instead of IR the optimization barriers and register spilling will make you probably less efficient.
Jun 10 2018
On Sunday, 10 June 2018 at 13:17:53 UTC, Guillaume Piolat wrote:Please make one that guarantee the usage of the corresponding backend intrinsic, for example on LLVM.I tested with ldc and got similar results. I thought the implementation in C forwarded to the backend intrinsic. I think even LDC links with GCC, and the distribution of GCC is tuned for a given platform and architecture, so am I not already utilizing GCC's backend intrinsic. Also, I'm not trying to find the holy grail of memcpy with this endeavor. I just want to find something as good or better than what the druntime is currently using, but implemented in D. Even if it's not optimal, if it's as good as what druntime is currently using, I want to replace those calls in the runtime with a D implementation. Mike
Jun 10 2018
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:I'm not experienced with this kind of programming, so I'm doubting these results. Have I done something wrong? Am I overlooking something?You've just discovered the fact that one can rarely be careful enough with what is benchmarked, and having enough statistics. For example, check out the following output from running your program on macOS 10.12, compiled with LDC 1.8.0: --- $ ./test memcpyD: 2 ms, 570 μs, and 9 hnsecs memcpyDstdAlg: 77 μs and 2 hnsecs memcpyC: 74 μs and 1 hnsec memcpyNaive: 76 μs and 4 hnsecs memcpyASM: 145 μs and 5 hnsecs $ ./test memcpyD: 3 ms and 376 μs memcpyDstdAlg: 76 μs and 9 hnsecs memcpyC: 104 μs and 4 hnsecs memcpyNaive: 72 μs and 2 hnsecs memcpyASM: 181 μs and 8 hnsecs $ ./test memcpyD: 2 ms and 565 μs memcpyDstdAlg: 76 μs and 9 hnsecs memcpyC: 73 μs and 2 hnsecs memcpyNaive: 71 μs and 9 hnsecs memcpyASM: 145 μs and 3 hnsecs $ ./test memcpyD: 2 ms, 813 μs, and 8 hnsecs memcpyDstdAlg: 81 μs and 2 hnsecs memcpyC: 99 μs and 2 hnsecs memcpyNaive: 74 μs and 2 hnsecs memcpyASM: 149 μs and 1 hnsec $ ./test memcpyD: 2 ms, 593 μs, and 7 hnsecs memcpyDstdAlg: 77 μs and 3 hnsecs memcpyC: 75 μs memcpyNaive: 77 μs and 2 hnsecs memcpyASM: 145 μs and 5 hnsecs --- Because of the large amounts of noise, the only conclusion one can draw from this is that memcpyD is the slowest, followed by the ASM implementation. In fact, memcpyC and memcpyNaive produce exactly the same machine code (without bounds checking), as LLVM recognizes the loop and lowers it into a memcpy. memcpyDstdAlg instead gets turned into a vectorized loop, for reasons I didn't investigate any further. — David
Jun 10 2018
On 6/10/2018 11:16 AM, David Nadlinger wrote:Because of the large amounts of noise, the only conclusion one can draw from this is that memcpyD is the slowest,Probably because it does a memory allocation.followed by the ASM implementation.The CPU makers abandoned optimizing the REP instructions decades ago, and just left the clunky implementations there for backwards compatibility.In fact, memcpyC and memcpyNaive produce exactly the same machine code (without bounds checking), as LLVM recognizes the loop and lowers it into a memcpy. memcpyDstdAlg instead gets turned into a vectorized loop, for reasons I didn't investigate any further.This amply illustrates my other point that looking at the assembler generated is crucial to understanding what's happening.
Jun 10 2018
On Sunday, 10 June 2018 at 22:23:08 UTC, Walter Bright wrote:On 6/10/2018 11:16 AM, David Nadlinger wrote:On some cpu architectures(for example intel atoms) rep movsb is the fatest memcpy.Because of the large amounts of noise, the only conclusion one can draw from this is that memcpyD is the slowest,Probably because it does a memory allocation.followed by the ASM implementation.The CPU makers abandoned optimizing the REP instructions decades ago, and just left the clunky implementations there for backwards compatibility.In fact, memcpyC and memcpyNaive produce exactly the same machine code (without bounds checking), as LLVM recognizes the loop and lowers it into a memcpy. memcpyDstdAlg instead gets turned into a vectorized loop, for reasons I didn't investigate any further.This amply illustrates my other point that looking at the assembler generated is crucial to understanding what's happening.
Jun 10 2018
On Sunday, 10 June 2018 at 22:23:08 UTC, Walter Bright wrote:On 6/10/2018 11:16 AM, David Nadlinger wrote:Of course; that was already pointed out earlier in the thread.Because of the large amounts of noise, the only conclusion one can draw from this is that memcpyD is the slowest,Probably because it does a memory allocation.The CPU makers abandoned optimizing the REP instructions decades ago, and just left the clunky implementations there for backwards compatibility.That's not entirely true. Intel started optimising some of the REP string instructions again on Ivy Bridge and above. There is a CPUID bit to indicate that (ERMS?); I'm sure the Optimization Manual has further details. From what I remember, `rep movsb` is supposed to beat an AVX loop on most recent Intel µarchs if the destination is aligned and the data is longer than a few cache lines. I've never measured that myself, though. — David
Jun 10 2018
On 6/10/2018 4:39 PM, David Nadlinger wrote:That's not entirely true. Intel started optimising some of the REP string instructions again on Ivy Bridge and above. There is a CPUID bit to indicate that (ERMS?); I'm sure the Optimization Manual has further details. From what I remember, `rep movsb` is supposed to beat an AVX loop on most recent Intel µarchs if the destination is aligned and the data is longer than a few cacheThe drama of which instruction mix is faster on which CPU never abates!
Jun 10 2018
On 06/10/2018 08:01 PM, Walter Bright wrote:On 6/10/2018 4:39 PM, David Nadlinger wrote:In many ways, I really miss 80's machine architecture ;) So simple.That's not entirely true. Intel started optimising some of the REP string instructions again on Ivy Bridge and above. There is a CPUID bit to indicate that (ERMS?); I'm sure the Optimization Manual has further details. From what I remember, `rep movsb` is supposed to beat an AVX loop on most recent Intel µarchs if the destination is aligned and the data is longer than a few cacheThe drama of which instruction mix is faster on which CPU never abates!
Jun 10 2018
On 6/10/2018 5:49 AM, Mike Franklin wrote:[...]One source of entropy in the results is src and dst being global variables. Global variables in D are in TLS, and TLS access can be complex (many instructions) and is influenced by the -fPIC switch. Worse, global variable access is not optimized in dmd because of aliasing problems. The solution is to pass src, dst, and length to the copy function as function parameters (and make sure function inlining is off). In light of this, I want to BEAT THE DEAD HORSE once again and assert that if the assembler generated by a benchmark is not examined, the results can be severely misleading. I've seen this happen again and again. In this case, TLS access is likely being benchmarked, not memcpy. BTW, the relative timing of rep movsb can be highly dependent on which CPU chip you're using.
Jun 10 2018
On Sunday, 10 June 2018 at 12:49:31 UTC, Mike Franklin wrote:void memcpyASM() { auto s = src.ptr; auto d = dst.ptr; size_t len = length; asm pure nothrow nogc { mov RSI, s; mov RDI, d; cld; mov RCX, len; rep; movsb; } }Protip: Use SSE or AVX for an even faster copying.
Jun 10 2018
I've modified the test based on the feedback so far, so here's what it looks like now: import std.datetime.stopwatch; import std.stdio; import core.stdc.string; import std.random; import std.algorithm; enum length = 4096 * 2; void init(ref ubyte[] a) { a.length = length; for(int i = 0; i < length; i++) { a[i] = uniform!ubyte; } } void verifyResults(ubyte[] a, ubyte[] b) { assert(memcmp(a.ptr, b.ptr, length) == 0); } void memcpyD(ubyte[] dst, ubyte[] src) { dst[] = src[]; } void memcpyDstdAlg(ubyte[] dst, ubyte[] src) { copy(src, dst); } void memcpyC(ubyte[] dst, ubyte[] src) { memcpy(dst.ptr, src.ptr, length); } void memcpyNaive(ubyte[] dst, ubyte[] src) { for(int i = 0; i < length; i++) { dst[i] = src[i]; } } void memcpyASM(ubyte[] dst, ubyte[] src) { auto s = src.ptr; auto d = dst.ptr; size_t len = length; asm pure nothrow nogc { mov RSI, s; mov RDI, d; cld; mov RCX, len; rep; movsb; } } Duration benchmark(alias f)(ubyte[] dst, ubyte[] src, uint n) { Duration result; auto sw = StopWatch(AutoStart.yes); sw.reset(); foreach (_; 0 .. n) { f(dst, src); } result = sw.peek(); return result; } void main() { ubyte[] src; ubyte[] dst; // verify the integrity of the algorithm init(src); init(dst); memcpyD(dst, src); verifyResults(dst, src); init(src); init(dst); memcpyDstdAlg(dst, src); verifyResults(dst, src); init(src); init(dst); memcpyC(dst, src); verifyResults(dst, src); init(src); init(dst); memcpyNaive(dst, src); verifyResults(dst, src); init(src); init(dst); memcpyASM(dst, src); verifyResults(dst, src); // test the performance of the algorithm enum iterations = 1000; writeln("memcpyD: ", benchmark!memcpyD(dst, src, iterations)); writeln("memcpyDstdAlg: ", benchmark!memcpyDstdAlg(dst, src, iterations)); writeln("memcpyC: ", benchmark!memcpyC(dst, src, iterations)); writeln("memcpyNaive: ", benchmark!memcpyNaive(dst, src, iterations)); writeln("memcpyASM: ", benchmark!memcpyASM(dst, src, iterations)); } The results on my Windows 10 machine (Intel Core i7-6700, 3.4GHz): memcpyD: 127 ╬╝s and 3 hnsecs memcpyDstdAlg: 195 ╬╝s and 9 hnsecs memcpyC: 126 ╬╝s and 7 hnsecs memcpyNaive: 17 ms, 974 ╬╝s, and 9 hnsecs memcpyASM: 122 ╬╝s and 8 hnsecs (Gotta love how windows displays μ) The results running on Arch Linux 64-bit in a VirtualBox on the same Windows 10 machine: memcpyD: 409 μs memcpyDstdAlg: 400 μs memcpyC: 404 μs and 4 hnsecs memcpyNaive: 17 ms, 251 μs, and 6 hnsecs memcpyASM: 162 μs and 8 hnsecs The results appear more sane now, but it seems the behavior is highly platform dependent. Still the ASM is doing well for my hardware. If I run the test multiple times, I do see a lot of noise in the results, but each test seems to be affected proportionally, so I'm gaining a little more confidence in the benchmark. I still need to analyze the assembly of C's memcpy (anyone know where I can find the source code?), test on more platforms, and test varying sizes, but I'm just collecting some initial data right now, to learn how to proceed. I'd be interested in those with other platforms reporting back their results for their hardware, and of course suggestions for how to meet or beat C's memcpy with a pure D implementation. Thanks for all the feedback so far. Mike
Jun 10 2018
On Monday, 11 June 2018 at 01:03:16 UTC, Mike Franklin wrote:I've modified the test based on the feedback so far, so here's what it looks like now: import std.datetime.stopwatch; import std.stdio; import core.stdc.string; import std.random; import std.algorithm; enum length = 4096 * 2; void init(ref ubyte[] a) { a.length = length; for(int i = 0; i < length; i++) { a[i] = uniform!ubyte; } } void verifyResults(ubyte[] a, ubyte[] b) { assert(memcmp(a.ptr, b.ptr, length) == 0); } void memcpyD(ubyte[] dst, ubyte[] src) { dst[] = src[]; } void memcpyDstdAlg(ubyte[] dst, ubyte[] src) { copy(src, dst); } void memcpyC(ubyte[] dst, ubyte[] src) { memcpy(dst.ptr, src.ptr, length); } void memcpyNaive(ubyte[] dst, ubyte[] src) { for(int i = 0; i < length; i++) { dst[i] = src[i]; } } void memcpyASM(ubyte[] dst, ubyte[] src) { auto s = src.ptr; auto d = dst.ptr; size_t len = length; asm pure nothrow nogc { mov RSI, s; mov RDI, d; cld; mov RCX, len; rep; movsb; } } Duration benchmark(alias f)(ubyte[] dst, ubyte[] src, uint n) { Duration result; auto sw = StopWatch(AutoStart.yes); sw.reset(); foreach (_; 0 .. n) { f(dst, src); } result = sw.peek(); return result; } void main() { ubyte[] src; ubyte[] dst; // verify the integrity of the algorithm init(src); init(dst); memcpyD(dst, src); verifyResults(dst, src); init(src); init(dst); memcpyDstdAlg(dst, src); verifyResults(dst, src); init(src); init(dst); memcpyC(dst, src); verifyResults(dst, src); init(src); init(dst); memcpyNaive(dst, src); verifyResults(dst, src); init(src); init(dst); memcpyASM(dst, src); verifyResults(dst, src); // test the performance of the algorithm enum iterations = 1000; writeln("memcpyD: ", benchmark!memcpyD(dst, src, iterations)); writeln("memcpyDstdAlg: ", benchmark!memcpyDstdAlg(dst, src, iterations)); writeln("memcpyC: ", benchmark!memcpyC(dst, src, iterations)); writeln("memcpyNaive: ", benchmark!memcpyNaive(dst, src, iterations)); writeln("memcpyASM: ", benchmark!memcpyASM(dst, src, iterations)); } The results on my Windows 10 machine (Intel Core i7-6700, 3.4GHz): memcpyD: 127 ╬╝s and 3 hnsecs memcpyDstdAlg: 195 ╬╝s and 9 hnsecs memcpyC: 126 ╬╝s and 7 hnsecs memcpyNaive: 17 ms, 974 ╬╝s, and 9 hnsecs memcpyASM: 122 ╬╝s and 8 hnsecs (Gotta love how windows displays μ) The results running on Arch Linux 64-bit in a VirtualBox on the same Windows 10 machine: memcpyD: 409 μs memcpyDstdAlg: 400 μs memcpyC: 404 μs and 4 hnsecs memcpyNaive: 17 ms, 251 μs, and 6 hnsecs memcpyASM: 162 μs and 8 hnsecs The results appear more sane now, but it seems the behavior is highly platform dependent. Still the ASM is doing well for my hardware. If I run the test multiple times, I do see a lot of noise in the results, but each test seems to be affected proportionally, so I'm gaining a little more confidence in the benchmark. I still need to analyze the assembly of C's memcpy (anyone know where I can find the source code?),- default win32 OMF: https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C - default linux: https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c - not used but interesting: https://github.com/esmil/musl/blob/master/src/string/memcpy.c
Jun 10 2018
On Monday, 11 June 2018 at 03:34:59 UTC, Basile B. wrote:On Monday, 11 June 2018 at 01:03:16 UTC, Mike Franklin wrote:D slice style memcpy cg: https://github.com/dlang/dmd/blob/master/src/dmd/backend/cod2.c#L3526[...]- default win32 OMF: https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C - default linux: https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c - not used but interesting: https://github.com/esmil/musl/blob/master/src/string/memcpy.c
Jun 11 2018
On Monday, 11 June 2018 at 03:34:59 UTC, Basile B. wrote:On Monday, 11 June 2018 at 01:03:16 UTC, Mike Franklin wrote:D slice style memcpy cg: https://github.com/dlang/dmd/blob/master/src/dmd/backend/cod2.c#L3526[...]- default win32 OMF: https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.C - default linux: https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.c - not used but interesting: https://github.com/esmil/musl/blob/master/src/string/memcpy.c
Jun 11 2018
On 6/10/2018 8:34 PM, Basile B. wrote:- default win32 OMF: https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.CI think you mean: https://github.com/DigitalMars/dmc/blob/master/src/CORE32/MEMCPY.ASM
Jun 11 2018
On Monday, 11 June 2018 at 08:05:14 UTC, Walter Bright wrote:On 6/10/2018 8:34 PM, Basile B. wrote:Cool! and it's even boost licensed. I think I'll try to port that next, before I try the SSE and AVX implementations. Mike- default win32 OMF: https://github.com/DigitalMars/dmc/blob/master/src/core/MEMCCPY.CI think you mean: https://github.com/DigitalMars/dmc/blob/master/src/CORE32/MEMCPY.ASM
Jun 11 2018
On Monday, 11 June 2018 at 03:34:59 UTC, Basile B. wrote:- default linux: https://github.com/gcc-mirror/gcc/blob/master/libgcc/memcpy.cTo see what is executed when you call memcpy() on a regular GNU/Linux distro, you'd want to have a look at glibc instead. For example, the AVX2 and AVX512 implementations are in sysdeps/x86_64/multiarch/memmove-avx512-no-vzeroupper.S and sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S, as forwarded to by memmove-avx2-unaligned-erms.S and memmove-avx512-unaligned-erms.S. (Pop quiz: Why might a separate "no-vzeroupper" variant be a good idea?) — David
Jun 17 2018
BTW the way memcpy is(was?) implemented in the C runtime coming from the Inter C++ compiler was really enlightening on the sheer difficulty of such a task. First of all there isn't one loop but many depending on the source and destination alignment. - If both are aligned on 16-byte boundaries, source and destination operand would be with MOVAPS/MOVDQA, nothing special - If only the source or destination was misaligned, the function would dispatch to a variant with the core loop loading 16-byte aligned and writing 16-byte unaligned, with the PALIGNR instruction. However, since PALIGNR can't take a runtime value, this variant was _replicated 16 times_. - I don't remember for both source and destination misaligned but you can degenerate this case to the above one. Each of this loop had complicated loop preludes that do the first iteration, and they are so hard to do by hand. It was also the only piece of assembly I've seen that (apparently) successfully used the "prefetch" instructions. This was just the SSE version, AVX was different. I don't know if someone really wrote this code, or if it was all from intrinsics.
Jun 11 2018
On 6/11/2018 11:17 AM, Guillaume Piolat wrote:I don't know if someone really wrote this code, or if it was all from intrinsics.memcpy is so critical to success it is likely written by Intel itself to ensure every drop of perf is wrung out of the CPU. I was Intel CEO I'd direct the CPU hardware guys to do this and give it away.
Jun 11 2018