digitalmars.D - Best interface for memcpy() (and the string.h family of functions)

Stefanos Baziotis (11/11) May 29 2019 I'm a GSoC student (I'll post this week an update) in

Jonathan Marler (93/104) May 29 2019 The default memcpy signature is still pretty useful in many

Stefanos Baziotis (31/51) May 29 2019 I'm not sure about that. Does it really make sense to have such an

Jonathan Marler (28/81) May 29 2019 Sure. Any time you have a buffer whose type isn't known at

Stefanos Baziotis (20/60) May 29 2019 You want, because instantiation and inlining of specific types is

Jonathan Marler (8/37) May 29 2019 It doesn't make a difference whether the final memcpy is `void*`

Stefanos Baziotis (6/13) May 29 2019 This is what will prevent doing anything really useful in D.

Jonathan Marler (18/21) May 29 2019 You didn't answer the question. How would inlining the

Stefanos Baziotis (17/34) May 29 2019 I don't know how "benchmarks" does not answer a question. For me,

Jonathan Marler (15/53) May 29 2019 Yes that would be an answer, I guess I got confused when you

Stefanos Baziotis (19/33) May 29 2019 Great, you can see that in the benchmarks, memcpyD is faster than

Jonathan Marler (6/29) May 29 2019 I haven't benchmarked it yet but here's the changes I've made to

Stefanos Baziotis (46/50) May 29 2019 Good, this week I'm also working on alignment. (more

kinke (2/3) May 29 2019 It works fine with LDC, and I guess with GDC too.

welkam (8/9) May 29 2019 With D you can forward to best suiting implementation. What libc

kinke (30/33) May 29 2019 ref would only work when copying one instance at a time. Many

Stefanos Baziotis (12/17) May 29 2019 The current state is that we think that slices should be enough

kinke (11/17) May 29 2019 In D, there's no ugly and unsafe need to pass slices to memcpy,

Mike Franklin (8/13) May 29 2019 This is an important observation. My vision for the GSoC project

Stefanos Baziotis (8/20) May 30 2019 Not important. Because my thought was that a lot of users would

Mike Franklin (13/23) May 30 2019 If users need to copy blocks of memory they should first prefer

Mike Franklin (2/5) May 30 2019 should --> shouldn't
Stefanos Baziotis (25/31) May 30 2019 I agree with Walter on that. I don't think though that ref,

Kagamin (24/24) May 30 2019 IME partial copy primitives are lacking, so I use this:

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

I'm a GSoC student (I'll post this week an update) in
the project "Independency of D from the C Standard Library".
Part of this project is a D implementation of the family of 
functions
memcpy(), memset() etc.

What do you think is the best interface for say memcpy()?

My initial pick was void memcpyD(T)(T* dst, const T* src), but it 
was proposed
that `ref` instead of pointers might be better.

Thanks,
Stefanos

May 29 2019

Jonathan Marler <johnnymarler gmail.com> writes:

On Wednesday, 29 May 2019 at 11:46:28 UTC, Stefanos Baziotis 
wrote:
 I'm a GSoC student (I'll post this week an update) in
 the project "Independency of D from the C Standard Library".
 Part of this project is a D implementation of the family of 
 functions
 memcpy(), memset() etc.

 What do you think is the best interface for say memcpy()?

 My initial pick was void memcpyD(T)(T* dst, const T* src), but 
 it was proposed
 that `ref` instead of pointers might be better.

 Thanks,
 Stefanos

The default memcpy signature is still pretty useful in many 
cases.  The original signature should still be implemented and 
available as a non-template function:

void memcpy(void* dst, void* src, size_t length);

For D, you should also create a template so developer's don't 
have to cast to `void*` all the time, but it just forwards all 
calls to the real memcpy function like this:

void memcpy(T,U)(T* dst, U* src, size_t length)
{
     pragma(inline, true);
     memcpy(cast(void*)dst, cast(void*)src, length);
}

And there's no need to have a different name like `memcpyD`. The 
function behaves the same as libc's memcpy, and when you have 
libc available, you should use that implementation instead so you 
can leverages other people's work when you can.

However, we also want to get type-safety and bounds-checking when 
when can.  So we should also provide a set of templates that 
accept D arrays, verifies type-safety and bounds checking, then 
forwards the call to memcpy.

/**
acopy - Array Copy
*/
void acopy(T,U)(T dst, U src)  trusted
if (isArrayLike!T && isArrayLike!U && dst[0].sizeof == 
src[0].sizeof)
in { assert(dst.length >= src.length, "copyFrom source length 
larger than destination"); } do
{
     pragma(inline, true);
     static assert (!__traits(isStaticArray, T), "acopy doest not 
accept static arrays since they are passed by value");

     import whereever_memcpy_is: memcpy;
     memcpy(dst.ptr, src.ptr, src.length * ElementSizeForCopy!dst);
}
/// ditto
void acopy(T,U)(T dst, U src)  system
if (isArrayLike!T && isPointerLike!U && dst[0].sizeof == 
src[0].sizeof)
{
     pragma(inline, true);
     static assert (!__traits(isStaticArray, T), "acopy doest not 
accept static arrays since they are passed by value");

     import whereever_memcpy_is: memcpy;
     memcpy(dst.ptr, src, dst.length * ElementSizeForCopy!dst);
}
/// ditto
void acopy(T,U)(T dst, U src)  system
if (isPointerLike!T && isArrayLike!U && dst[0].sizeof == 
src[0].sizeof)
{
     pragma(inline, true);

     import whereever_memcpy_is: memcpy;
     memcpy(dst, src.ptr, src.length * ElementSizeForCopy!dst);
}
/// ditto
void acopy(T,U)(T dst, U src, size_t size)  system
if (isPointerLike!T && isPointerLike!U && dst[0].sizeof == 
src[0].sizeof)
{
     pragma(inline, true);
     import whereever_memcpy_is: memcpy;
     memcpy(dst, src, size * ElementSizeForCopy!dst);
}


Note that the isArrayLike and isPointerLike and 
ElementSizeForCopy would probably look something like:


template isArrayLike(T)
{
     enum isArrayLike =
            is(typeof(T.init.length))
         && is(typeof(T.init.ptr))
         && is(typeof(T.init[0]));
}
template isPointerLike(T)
{
     enum isPointerLike =
            T.sizeof == (void*).sizeof
         && is(typeof(T.init[0]));
}

// The size of each array element.  If the actual size is 0, then 
it
// is assumed to be 1.
template ElementSizeForCopy(alias Array)
{
     static if (Array[0].sizeof == 0)
         enum ElementSizeForCopy = 1;
     else
         enum ElementSizeForCopy = Array[0].sizeof;
}

Note that everything here is an inline-template, so everything 
gets reduced to a single memcpy call and some bounds checks.

May 29 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Wednesday, 29 May 2019 at 15:41:42 UTC, Jonathan Marler wrote:
 The default memcpy signature is still pretty useful in many 
 cases.  The original signature should still be implemented and 
 available as a non-template function:

 void memcpy(void* dst, void* src, size_t length);

 For D, you should also create a template so developer's don't 
 have to cast to `void*` all the time, but it just forwards all 
 calls to the real memcpy function like this:

 void memcpy(T,U)(T* dst, U* src, size_t length)
 {
     pragma(inline, true);
     memcpy(cast(void*)dst, cast(void*)src, length);
 }

 And there's no need to have a different name like `memcpyD`. 
 The function behaves the same as libc's memcpy, and when you 
 have libc available, you should use that implementation instead 
 so you can leverages other people's work when you can.

I'm not sure about that. Does it really make sense to have such an
interface in the case where you don't have libc memcpy available?
Although, there is a discussion about such fallback functions. 
But I don't
know, I feel like it will encourage bad practices.

In the same way, I don't know about whether it should accept two 
different types.

 However, we also want to get type-safety and bounds-checking 
 when when can.  So we should also provide a set of templates 
 that accept D arrays, verifies type-safety and bounds checking, 
 then forwards the call to memcpy.

Those are good ideas. But I think all this could be done 
explicitly with
(ref T[] dst, ref T[] source). This makes a specific-to-arrays 
version,
which again I'm unsure if it is good to make specific cases.

Generally, all those things are up for discussion, I don't pretend
to have some definitive answer.

The thing with all this code depending on libc memcpy is that to 
my understanding,
the prospect is that libc will be removed. And this project is a 
step towards that
by making some better D versions (meaning, leveraging D features).
If the better version calls libc, then when libc
is finally removed, all this code will break. And because we 
encouraged
this bad practice, _a lot_ of code will break.
Which will then force people to write their D-version of 
memcpy(void *dst, const void *src, size_t len);
Which of course is bad because suddenly, we lost all the D 
benefits + we lost
all the work that has been put on libc.

Best regards,
Stefanos

May 29 2019

Jonathan Marler <johnnymarler gmail.com> writes:

On Wednesday, 29 May 2019 at 17:35:03 UTC, Stefanos Baziotis 
wrote:
 On Wednesday, 29 May 2019 at 15:41:42 UTC, Jonathan Marler 
 wrote:
 The default memcpy signature is still pretty useful in many 
 cases.  The original signature should still be implemented and 
 available as a non-template function:

 void memcpy(void* dst, void* src, size_t length);

 For D, you should also create a template so developer's don't 
 have to cast to `void*` all the time, but it just forwards all 
 calls to the real memcpy function like this:

 void memcpy(T,U)(T* dst, U* src, size_t length)
 {
     pragma(inline, true);
     memcpy(cast(void*)dst, cast(void*)src, length);
 }

 And there's no need to have a different name like `memcpyD`. 
 The function behaves the same as libc's memcpy, and when you 
 have libc available, you should use that implementation 
 instead so you can leverages other people's work when you can.

 I'm not sure about that. Does it really make sense to have such 
 an
 interface in the case where you don't have libc memcpy 
 available?

Sure.  Any time you have a buffer whose type isn't known at 
compile-time and you need to copy between them.  For example, I 
have an audio program that copies buffers of audio, but the 
format of that buffer could be an array of floats or integers 
depending on the format that your audio hardware and OS support.

 Although, there is a discussion about such fallback functions. 
 But I don't
 know, I feel like it will encourage bad practices.

 In the same way, I don't know about whether it should accept 
 two different types.

Well that's why you have memcpy (for those who know what they're 
doing) and you have other functions for safe behavior.  But you 
don't want to instantiate a new version of memcpy for every type 
variation, that's why they all just forward the call to the real 
memcpy.

 However, we also want to get type-safety and bounds-checking 
 when when can.  So we should also provide a set of templates 
 that accept D arrays, verifies type-safety and bounds 
 checking, then forwards the call to memcpy.

 Those are good ideas. But I think all this could be done 
 explicitly with
 (ref T[] dst, ref T[] source). This makes a specific-to-arrays 
 version,
 which again I'm unsure if it is good to make specific cases.

Yes it could be done, but then you end up with N copies of your 
memcpy implementation, one for every combination of types.  
You're code size is going to explode.  You can certainly support 
the signature you provided, I just wouldn't have the 
implementation inside of that template, instead you should cast 
and forward to memcpy.

 The thing with all this code depending on libc memcpy is that 
 to my understanding,
 the prospect is that libc will be removed. And this project is 
 a step towards that
 by making some better D versions (meaning, leveraging D 
 features).

Right, which is why you use the libc version by default, and only 
use your own when libc is disabled.  This is what I do in my 
standard library https://github.com/marler8997/mar which works 
with or without libc.  I went through several designs for how to 
go about this memcpy solution and what I've provided you is the 
result of that.

 If the better version calls libc, then when libc
 is finally removed, all this code will break. And because we 
 encouraged
 this bad practice, _a lot_ of code will break.

How would it break?  If you remove libc, your module should now 
enable your implementation of memcpy.  And all the code that 
calls memcpy doesn't care whether it came from libc or from a D 
module.

May 29 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Wednesday, 29 May 2019 at 17:45:59 UTC, Jonathan Marler wrote:
 I'm not sure about that. Does it really make sense to have 
 such an
 interface in the case where you don't have libc memcpy 
 available?

 Sure.  Any time you have a buffer whose type isn't known at 
 compile-time and you need to copy between them.  For example, I 
 have an audio program that copies buffers of audio, but the 
 format of that buffer could be an array of floats or integers 
 depending on the format that your audio hardware and OS support.

So, you copy ubyte*.

 Well that's why you have memcpy (for those who know what 
 they're doing) and you have other functions for safe behavior.  
 But you don't want to instantiate a new version of memcpy for 
 every type variation, that's why they all just forward the call 
 to the real memcpy.

You want, because instantiation and inlining of specific types is
what makes D memcpy fast. And also, what I hope will make better 
error
messages and instrumentation. But that's yet to be seen, most 
important
is the performance.

 Yes it could be done, but then you end up with N copies of your 
 memcpy implementation, one for every combination of types.  
 You're code size is going to explode.  You can certainly 
 support the signature you provided, I just wouldn't have the 
 implementation inside of that template, instead you should cast 
 and forward to memcpy.

Actually, code size for arrays is a very good reminder, thanks.

 The thing with all this code depending on libc memcpy is that 
 to my understanding,
 the prospect is that libc will be removed. And this project is 
 a step towards that
 by making some better D versions (meaning, leveraging D 
 features).

 Right, which is why you use the libc version by default, and 
 only use your own when libc is disabled.  This is what I do in 
 my standard library https://github.com/marler8997/mar which 
 works with or without libc.  I went through several designs for 
 how to go about this memcpy solution and what I've provided you 
 is the result of that.

 If the better version calls libc, then when libc
 is finally removed, all this code will break. And because we 
 encouraged
 this bad practice, _a lot_ of code will break.

 How would it break?  If you remove libc, your module should now 
 enable your implementation of memcpy.  And all the code that 
 calls memcpy doesn't care whether it came from libc or from a D 
 module.

My point is that you will write code differently depending on 
what memcpy
you have, that's why this "new memcpy" will have different 
signature. To have
the best of both worlds, we would have to write our own
memcpy(void*, void*, size_t);.
And so, if you encourage the use of this interface (because hey, 
even if you don't
have libc eventually, your code will not crash), when libc is not 
present,
the code will be slow.

May 29 2019

Jonathan Marler <johnnymarler gmail.com> writes:

On Wednesday, 29 May 2019 at 17:55:49 UTC, Stefanos Baziotis 
wrote:
 On Wednesday, 29 May 2019 at 17:45:59 UTC, Jonathan Marler 
 wrote:
 I'm not sure about that. Does it really make sense to have 
 such an
 interface in the case where you don't have libc memcpy 
 available?

 Sure.  Any time you have a buffer whose type isn't known at 
 compile-time and you need to copy between them.  For example, 
 I have an audio program that copies buffers of audio, but the 
 format of that buffer could be an array of floats or integers 
 depending on the format that your audio hardware and OS 
 support.

 So, you copy ubyte*.

It doesn't make a difference whether the final memcpy is `void*` 
or `byte*`.  The point is that it's one function, not a template, 
and you might as well use the same type that the real memcpy uses 
so you don't change the signature when you're not using libc.

 Well that's why you have memcpy (for those who know what 
 they're doing) and you have other functions for safe behavior.
  But you don't want to instantiate a new version of memcpy for 
 every type variation, that's why they all just forward the 
 call to the real memcpy.

 You want, because instantiation and inlining of specific types 
 is
 what makes D memcpy fast. And also, what I hope will make 
 better error
 messages and instrumentation. But that's yet to be seen, most 
 important
 is the performance.

You don't want to inline the memcpy implementation.  What makes 
you think that would be faster?

May 29 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Wednesday, 29 May 2019 at 18:00:57 UTC, Jonathan Marler wrote:
 It doesn't make a difference whether the final memcpy is 
 `void*` or `byte*`.

Yes.

 The point is that it's one function, not a template, and you 
 might as well use the same type that the real memcpy uses so 
 you don't change the signature when you're not using libc.

This is what will prevent doing anything really useful in D.
This is what I meant that to have that, you have to implement
the D version of libc memcpy.

 You don't want to inline the memcpy implementation.  What makes 
 you think that would be faster?

CTFE / introspection I hope and currently, benchmarks.

May 29 2019

Jonathan Marler <johnnymarler gmail.com> writes:

On Wednesday, 29 May 2019 at 18:04:07 UTC, Stefanos Baziotis 
wrote:
 You don't want to inline the memcpy implementation.  What 
 makes you think that would be faster?

 CTFE / introspection I hope and currently, benchmarks.

You didn't answer the question.  How would inlining the 
implementation of memcpy be faster? The implementation of memcpy 
doesn't need to know which types it is copying, so every call to 
it can have the exact same implementation.  You only need one 
instance of the implementation.  This means you can fine-tune it, 
many libc implementations will implement it in assembly because 
it's used so often and again, it doesn't need to know what types 
it is copying.  All it needs is 2 pointers a size.  That's why in 
D, you should only create wrappers that ensure type-safety and 
bounds checking and then forward to the real implementation, and 
those wrappers should be inlined but not the memcpy 
implementation itself.

If you want to provide you own implementation of memcpy you can, 
but inlining your implementation into every call, when the 
implementation is truly type agnostic just results in code bloat 
with no benefit.

May 29 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Wednesday, 29 May 2019 at 18:14:11 UTC, Jonathan Marler wrote:
 You didn't answer the question.

I don't know how "benchmarks" does not answer a question. For me, 
it's
the most important answer.

 How would inlining the implementation of memcpy be faster? The 
 implementation of memcpy doesn't need to know which types it is 
 copying, so every call to it can have the exact same 
 implementation.  You only need one instance of the 
 implementation.  This means you can fine-tune it, many libc 
 implementations will implement it in assembly because it's used 
 so often and again, it doesn't need to know what types it is 
 copying.  All it needs is 2 pointers a size.  That's why in D, 
 you should only create wrappers that ensure type-safety and 
 bounds checking and then forward to the real implementation, 
 and those wrappers should be inlined but not the memcpy 
 implementation itself.

 If you want to provide you own implementation of memcpy you 
 can, but inlining your implementation into every call, when the 
 implementation is truly type agnostic just results in code 
 bloat with no benefit.

It is typed currently, with benefits. It's not the same for every 
type and our
idea is not to just forward the size. By inlining, you can get 
quite better
performance exactly because you inline and you don't just forward 
the size and
because you know info about the type.
Check this: 
https://github.com/JinShil/memcpyD/blob/master/memcpyd.d
And preferably, run it and see the asm generated.
Also, what should be considered is that types give you the info 
about alignment
and different implementations depending on this alignment.

May 29 2019

Jonathan Marler <johnnymarler gmail.com> writes:

On Wednesday, 29 May 2019 at 19:06:43 UTC, Stefanos Baziotis 
wrote:
 On Wednesday, 29 May 2019 at 18:14:11 UTC, Jonathan Marler 
 wrote:
 You didn't answer the question.

 I don't know how "benchmarks" does not answer a question. For 
 me, it's
 the most important answer.

Yes that would be an answer, I guess I got confused when you 
mentioned CTFE and introspection, I wasn't sure if "benchmarks" 
was referring to those features or to runtime benchmarks.  And 
looks like  Mike posted the benchmarks on that github link you 
sent.


 How would inlining the implementation of memcpy be faster? The 
 implementation of memcpy doesn't need to know which types it 
 is copying, so every call to it can have the exact same 
 implementation.  You only need one instance of the 
 implementation.  This means you can fine-tune it, many libc 
 implementations will implement it in assembly because it's 
 used so often and again, it doesn't need to know what types it 
 is copying.  All it needs is 2 pointers a size.  That's why in 
 D, you should only create wrappers that ensure type-safety and 
 bounds checking and then forward to the real implementation, 
 and those wrappers should be inlined but not the memcpy 
 implementation itself.

 If you want to provide you own implementation of memcpy you 
 can, but inlining your implementation into every call, when 
 the implementation is truly type agnostic just results in code 
 bloat with no benefit.

 It is typed currently, with benefits. It's not the same for 
 every type and our
 idea is not to just forward the size. By inlining, you can get 
 quite better
 performance exactly because you inline and you don't just 
 forward the size and
 because you know info about the type.
 Check this: 
 https://github.com/JinShil/memcpyD/blob/master/memcpyd.d
 And preferably, run it and see the asm generated.
 Also, what should be considered is that types give you the info 
 about alignment
 and different implementations depending on this alignment.

It's true that if you can assume pointers are aligned on a 
particular boundary that you can be faster than memcpy which 
works with any alignment.  This must be what Mike is doing, 
though, I would then create only a few instances of memcpy that 
assume alignment on boundaries like 4, 8, 16.  And if you have a 
pointer or an array to a particular type, you can probably assume 
that pointer/array is aligned on that types's "alignof" property.

I think I will use this in my library.

May 29 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Wednesday, 29 May 2019 at 19:35:36 UTC, Jonathan Marler wrote:
 Yes that would be an answer, I guess I got confused when you 
 mentioned CTFE and introspection, I wasn't sure if "benchmarks" 
 was referring to those features or to runtime benchmarks.  And 
 looks like  Mike posted the benchmarks on that github link you 
 sent.

Great, you can see that in the benchmarks, memcpyD is faster than 
libc memcpy
except for sizes larger than 32768. We hope that we can surpass 
those
as well, as yesterday I did some simple inline SIMD things and 
got better performance in 32768.
But previous work is of course responsibility of Mike and those 
benchmarks
are in part because of inlining.

 It's true that if you can assume pointers are aligned on a 
 particular boundary that you can be faster than memcpy which 
 works with any alignment.  This must be what Mike is doing, 
 though, I would then create only a few instances of memcpy that 
 assume alignment on boundaries like 4, 8, 16.  And if you have 
 a pointer or an array to a particular type, you can probably 
 assume that pointer/array is aligned on that types's "alignof" 
 property.

This is, as I said, the alignment guarrantee. I hope that I can 
get other benefits
from types also.
Also, hopefully we will do LDC / GDC specific things. Leverage 
the intrinsics for example.
I will put an update shortly, as the other students, explaining 
some of that, but I thought since we started it.. :p

 I think I will use this in my library.

Great! We hope that it will be useful and any feedback is 
appreciated!

May 29 2019

Jonathan Marler <johnnymarler gmail.com> writes:

On Wednesday, 29 May 2019 at 20:28:18 UTC, Stefanos Baziotis 
wrote:
 On Wednesday, 29 May 2019 at 19:35:36 UTC, Jonathan Marler 
 wrote:
[...]

 Great, you can see that in the benchmarks, memcpyD is faster 
 than libc memcpy
 except for sizes larger than 32768. We hope that we can surpass 
 those
 as well, as yesterday I did some simple inline SIMD things and 
 got better performance in 32768.
 But previous work is of course responsibility of Mike and those 
 benchmarks
 are in part because of inlining.

[...]

 This is, as I said, the alignment guarrantee. I hope that I can 
 get other benefits
 from types also.
 Also, hopefully we will do LDC / GDC specific things. Leverage 
 the intrinsics for example.
 I will put an update shortly, as the other students, explaining 
 some of that, but I thought since we started it.. :p

 [...]

 Great! We hope that it will be useful and any feedback is 
 appreciated!

I haven't benchmarked it yet but here's the changes I've made to 
my standard library to also take advantage of alignment 
guarantees from typed pointers and arrays.

https://github.com/dragon-lang/mar/commit/bb096d2d4f489d47177f6a678b1d9bab756e3dc7

May 29 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Wednesday, 29 May 2019 at 23:27:35 UTC, Jonathan Marler wrote:
 I haven't benchmarked it yet but here's the changes I've made 
 to my standard library to also take advantage of alignment 
 guarantees from typed pointers and arrays.

 https://github.com/dragon-lang/mar/commit/bb096d2d4f489d47177f6a678b1d9bab756e3dc7

Good, this week I'm also working on alignment. (more 
specifically, mis-alignment).
Since you took the time anyway to play with alignment, you might 
find
SIMD instructions useful.
Take a look at Mike's memcpyD. My yesterday toy SIMD that 
surpassed
libc memcpy was as simple as:

static foreach(i; 0 .. T.sizeof/32) {
     // Assuming RDI is 'dst' and RSI 'src'
     asm pure nothrow  nogc {
      	vmovdqa YMM0, [RDI+i*32];
         vmovdqa [RSI+i*32], YMM0;
     }
}
/* instead of
static foreach(i; 0 .. T.sizeof/32)
{
     memcpyD((cast(S!32*)dst) + i, (cast(const S!32*)src) + i);
}
*/

Again, really simple and dumb, but effective. A couple of notes, 
so that you
don't have the headaches I had:
1) You can use `vmovdqu` (notice the 'u' at the end) for 
unaligned memory and
skip note 2.
2) `vmovdqa` assumes 32-byte aligned memory. Now, `align()` is 
kind of
buggy, so if you have a normal buffer on the stack that you want 
to align, that:
align(32) ubyte[32768] buf;
won't work.
One solution is to allocate memory on heap and do slight pointer 
arithmetic
to have it aligned.

Last minute discovery:
Haha, the compiler flags I used were: -mcpu=avx -inline
With these flags, memcpyD is faster.
_Removing_ -inline resulted in faster code for libc memcpy. I'll 
have to look
close tomorrow.
(Oh, and the libc memcpy, it seems from disasm, achieves these 
results with sse3, so 128-bit instructions. I mean.. at least 
impressive).

May 29 2019

kinke <noone nowhere.com> writes:

On Thursday, 30 May 2019 at 00:55:54 UTC, Stefanos Baziotis wrote:
 Now, `align()` is kind of buggy

It works fine with LDC, and I guess with GDC too.

May 29 2019

welkam <wwwelkam gmail.com> writes:

On Wednesday, 29 May 2019 at 18:14:11 UTC, Jonathan Marler wrote:
 and then forward to the real implementation

With D you can forward to best suiting implementation. What libc 
does it performs various runtime checks in order to figure out 
what is the best way of copying provided input. With D it should 
be possible to make certain checks at compile time. Secondly C's 
memcopy is a big function not because its best for performance 
but because of convenience. With D we can have many smaller 
functions and they would be selected by template magic.

May 29 2019

kinke <noone nowhere.com> writes:

On Wednesday, 29 May 2019 at 11:46:28 UTC, Stefanos Baziotis 
wrote:
 My initial pick was void memcpyD(T)(T* dst, const T* src), but 
 it was proposed
 that `ref` instead of pointers might be better.

ref would only work when copying one instance at a time. Many 
times, you'll want to copy a contiguous array of a length only 
known at runtime (and definitely NOT invoke memcpy in a loop, so 
that the implementation can e.g. use SIMD streaming when copying 
gazillions of 32-bit pixels).

I'd suggest a structure similar to this, minimizing bloat:

// int a, b;            memcpyD(&a, &b);
// int[4] a, b;         memcpyD(&a, &b);
// int[16] a; int[4] b; memcpyD!4(&a[8], b.ptr);
void memcpyD(size_t length = 1, T)(T* dst, const T* src)
{
     pickBestImpl!(T.alignof, length * T.sizeof)(dst, src);
}

void memcpyD(T)(T* dst, const T* src, size_t length)
{
     pickBestImpl!(T.alignof)(dst, src, length * T.sizeof);
}

private:

/* These 2 will probably share most logic, the first one just 
exploiting a
  * static size. A common mixin might come in handy (e.g., 
switching from
  * runtime-if to static-if).
  */
void pickBestImpl(size_t alignment, size_t size)(void* dst, const 
void* src);
void pickBestImpl(size_t alignment)(void* dst, const void* src, 
size_t size);

May 29 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Wednesday, 29 May 2019 at 20:50:45 UTC, kinke wrote:
 ref would only work when copying one instance at a time. Many 
 times, you'll want to copy a contiguous array of a length only 
 known at runtime (and definitely NOT invoke memcpy in a loop, 
 so that the implementation can e.g. use SIMD streaming when 
 copying gazillions of 32-bit pixels).

The current state is that we think that slices should be enough 
for this need.
Meaning, you don't need the third size parameter. In this case, 
ref is better. On the other, in other cases I think that pointers
are more intuitive. Again, of course the fact that _I_ think it 
is of little importance. That post was primarily made so that 
you, the community, can give
feedback on this.

Apart from that, I'm still sceptical about whether we should 
provide
a version with size..

May 29 2019

kinke <noone nowhere.com> writes:

On Thursday, 30 May 2019 at 00:18:06 UTC, Stefanos Baziotis wrote:
 The current state is that we think that slices should be enough 
 for this need.
 Meaning, you don't need the third size parameter. In this case, 
 ref is better. On the other, in other cases I think that 
 pointers
 are more intuitive.

In D, there's no ugly and unsafe need to pass slices to memcpy, 
as a simple `dst[] = src[]` can do the job much better, boiling 
down to a memcpy (with 3rd param) if T is a POD (and the two 
slices don't overlap, have the same length etc. if bounds checks 
are enabled).

Taking a slice by ref, if I understand you correctly, would 
firstly only work with slice lvalues (i.e., no `ptr[0..$-1]` 
rvalues), and secondly IMO be very confusing and bad for generic 
code, as I would expect the slice itself to be memcopied then, 
not its contents.

May 29 2019

Mike Franklin <slavo5150 yahoo.com> writes:

On Thursday, 30 May 2019 at 01:19:54 UTC, kinke wrote:

 In D, there's no ugly and unsafe need to pass slices to memcpy, 
 as a simple `dst[] = src[]` can do the job much better, boiling 
 down to a memcpy (with 3rd param) if T is a POD (and the two 
 slices don't overlap, have the same length etc. if bounds 
 checks are enabled).

This is an important observation.  My vision for the GSoC project 
was targeted primarily at druntime. D memcpy would rarely, if 
ever, be invoked directly by most users.  Expressions like `dst[] 
= src[]` and other assignment expressions that require memcpy as 
part of their behaviro, would be lowered by the compiler to the 
runtime memcpy template.

Mike

May 29 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Thursday, 30 May 2019 at 01:35:05 UTC, Mike Franklin wrote:
 On Thursday, 30 May 2019 at 01:19:54 UTC, kinke wrote:

 In D, there's no ugly and unsafe need to pass slices to 
 memcpy, as a simple `dst[] = src[]` can do the job much 
 better, boiling down to a memcpy (with 3rd param) if T is a 
 POD (and the two slices don't overlap, have the same length 
 etc. if bounds checks are enabled).

 This is an important observation.  My vision for the GSoC 
 project was targeted primarily at druntime. D memcpy would 
 rarely, if ever, be invoked directly by most users.

If we don't really target users, then that makes this:

 Apart from that, I'm still sceptical about whether we should 
 provide
 a version with size..

Not important. Because my thought was that a lot of users would
have some pointers a, b and somehow want to do: memcpy(a, b, 
for_some_size);

What I'm thinking is that yes, we decouple D from libc _on D 
Runtime_.
But in general, users may will still want that.

May 30 2019

Mike Franklin <slavo5150 yahoo.com> writes:

On Thursday, 30 May 2019 at 08:28:50 UTC, Stefanos Baziotis wrote:

 If we don't really target users, then that makes this:

 Apart from that, I'm still sceptical about whether we should 
 provide
 a version with size..

 Not important. Because my thought was that a lot of users would
 have some pointers a, b and somehow want to do: memcpy(a, b, 
 for_some_size);

 What I'm thinking is that yes, we decouple D from libc _on D 
 Runtime_.
 But in general, users may will still want that.

If users need to copy blocks of memory they should first prefer 
those D features that were added to improve upon C, so users 
don't have to resort to raw pointers, pointer arithmetic, 
managing sizes outside of arrays, etc.  See Walter's article "C's 
biggest mistake" for some perspective on that 
http://www.drdobbs.com/architecture-and-design/cs-biggest-mistake/228701625
It's important when designing a D replacement for a C feature to not repeat C's
mistakes.

I wouldn't rule out a public interface in the future, but, at the 
moment, I don't see a compelling use case given that D has 
first-class arrays.  Regardless, a public interface should be 
required to achieve the goals of the GSoC project and could 
introduce controversy and other design complications.

Mike

May 30 2019

Mike Franklin <slavo5150 yahoo.com> writes:

On Thursday, 30 May 2019 at 09:10:11 UTC, Mike Franklin wrote:
Regardless, a public interface should be
 required to achieve the goals of the GSoC project and could 
 introduce controversy and other design complications.

should --> shouldn't

May 30 2019

Stefanos Baziotis <sdi1600105 di.uoa.gr> writes:

On Thursday, 30 May 2019 at 09:10:11 UTC, Mike Franklin wrote:
 If users need to copy blocks of memory they should first prefer 
 those D features that were added to improve upon C, so users 
 don't have to resort to raw pointers, pointer arithmetic, 
 managing sizes outside of arrays, etc.  See Walter's article 
 "C's biggest mistake" for some perspective on that 
 http://www.drdobbs.com/architecture-and-design/cs-biggest-mistake/228701625
It's important when designing a D replacement for a C feature to not repeat C's
mistakes.

I agree with Walter on that. I don't think though that ref, 
dynamic arrays as
now and GC are the solution to that or low-level memory 
management in general.

I think that people are in 2 categories:

1) People that use these D features will probably never want to 
use mempcy() (directly) anyway.
2) People that use D more as a betterC will probably want to use
a memcpy() with pointers and possibly one more optional 
parameter, in which they will give size.

But, some important notes:

a) D moves in a certain direction, away from C, pointers etc. And 
it moves
towards ref, dynamic arrays. Agreeing with that is not important, 
but help
is.

b) If memcpy() targets (possibly only) the D Runtime, then it 
doesn't really
care for the users in category 1) or 2) as they are on the user 
side.

So, I think the best option in this regard, especially note a) is 
to use refs,
unless there are serious implementation obstacles (which I doubt).

- Stefanos

May 30 2019

Kagamin <spam here.lot> writes:

IME partial copy primitives are lacking, so I use this:

/// Copy only as much as possible, return the copied data
T[] CopyHead(T)(T[] dst, in T[] src) pure
{
	if(dst.length>=src.length)return CopyAll(dst, src);
	CopyOverlap(dst, src[0..dst.length]);
	return dst;
}

/// Copy all input data, return the copied data
T[] CopyAll(T)(T[] dst, in T[] src) pure
{
	assert(dst.length>=src.length);
	dst=dst[0..src.length];
	CopyOverlap(dst, src);
	return dst;
}

/// Copy overlapping slices
void CopyOverlap(T)(T[] dst, in T[] src) pure
{
	import core.stdc.string:memmove;
	assert(dst.length==src.length,"same lengths required");
	byte[] dstBytes=cast(byte[])dst;
	memmove(dstBytes.ptr, src.ptr, dstBytes.length);
}

May 30 2019

D Programming

C/C++ Programming

Other

digitalmars.D - Best interface for memcpy() (and the string.h family of functions)