digitalmars.D.ldc - overlapping copy semantics question

Bruce Carneal (9/9) May 20 2024 Given the possibility for overlap is it correct that the two copy

kinke (12/12) May 20 2024 I think the problem here is that you don't get the expected

Bruce Carneal (18/30) May 20 2024 I don't view that missed optimization as much of a problem,

Bruce Carneal (5/8) May 20 2024 What I "missed" was the overlap check and branch around on entry

Bruce Carneal <bcarneal gmail.com> writes:

Given the possibility for overlap is it correct that the two copy 
functions in the godbolt example compile to identical code?

https://godbolt.org/z/jPhT5vhee

I think the  restrict function is compiled correctly since the 
compiler is free to assume no overlap, but the code generated for 
the unattributed variant appears to be off.  Perhaps I'm 
misunderstanding something wrt copy semantics?

As a point of reference, when compiled with gdc the code 
generated for the two functions is not identical.

May 20 2024

kinke <noone nowhere.com> writes:

I think the problem here is that you don't get the expected 
optimization to a memcpy (with -O3) when using the ` restrict` 
UDA, with the variant taking a D slice. So no correctness issue.

This boils down to the expected memcpy, apparently needing 
unpacking of D slices:
```
void cpr2(size_t srcLength, ubyte* src,  restrict ubyte* dst)
{
     foreach (i; 0 .. srcLength)
         dst[i] = src[i];
}
```

May 20 2024

Bruce Carneal <bcarneal gmail.com> writes:

On Monday, 20 May 2024 at 18:14:50 UTC, kinke wrote:
 I think the problem here is that you don't get the expected 
 optimization to a memcpy (with -O3) when using the ` restrict` 
 UDA, with the variant taking a D slice. So no correctness issue.

 This boils down to the expected memcpy, apparently needing 
 unpacking of D slices:
 ```
 void cpr2(size_t srcLength, ubyte* src,  restrict ubyte* dst)
 {
     foreach (i; 0 .. srcLength)
         dst[i] = src[i];
 }
 ```

I don't view that missed optimization as much of a problem, 
although I will note that gdc decided to issue a call to memmove 
for the  restrict slice code under -O3. The LDC cpr2 call out to 
memcpy for the lowered/non-slice variant seems entirely justified 
given  restrict.

What seems like a problem is emitting SIMD code for the vanilla 
(no attributes) cp() variant that doesn't produce the same result 
as a simple scalar loop would.  Consider:

values at location x: 0, 1, 2, 3, 4, ...
src at location x
dst at location x + 1

Shouldn't the vanilla scalar copy loop for the above just result 
in a bunch of zeros?  This is what I'd expect if a dead simple 
loop body were generated. If, on the other hand, you emit SIMD 
code for the loads and stores, as LDC is want to do, you get 
something different.

What am I missing?

May 20 2024

Bruce Carneal <bcarneal gmail.com> writes:

On Monday, 20 May 2024 at 23:09:35 UTC, Bruce Carneal wrote:
 On Monday, 20 May 2024 at 18:14:50 UTC, kinke wrote:
 ...

 What am I missing?

What I "missed" was the overlap check and branch around on entry 
to the vanilla code block that protects the SIMD version.

Sorry for the noise.  Glad to learn about the, slight, 
optimization though.

May 20 2024

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - overlapping copy semantics question