digitalmars.D.ldc - Work around conservative optimization

Kagamin (14/14) Jun 02 2018 uint load32_le(in ref ubyte[4] s)

Johan Engelen (12/26) Jun 02 2018 ```

Johan Engelen (18/25) Jun 02 2018 This only works on little endian machines of course, but the

kinke (3/15) Jun 03 2018 [For endian-ness conversion of integers, we have

kinke (3/5) Jun 03 2018 And, more portably, their core.bitop.bswap() aliases (restricted
David Nadlinger (7/9) Jun 03 2018 At the risk of pointing out the obvious, I usually find it vastly

kinke (11/20) Jun 03 2018 No need to reinvent the wheel, the Phobos solution is trivial

Kagamin (3/5) Jun 04 2018 You can't just store integer on alignment-sensitive (even little

Kagamin (12/24) Jun 04 2018 void store32_le_optim(ref ubyte[4] dest, uint val)

Kagamin (4/5) Jun 04 2018 I thought it tries to account for memory access errors: depending

Johan Engelen (6/11) Jun 04 2018 Invalid accesses do not need to be taken into account, because

Kagamin <spam here.lot> writes:

uint load32_le(in ref ubyte[4] s)
{
	return s[0] | (s[1]<<8) | (s[2]<<16) | (s[3]<<24);
}

void store32_le(ref ubyte[4] dest, uint val)
{
	dest[0]=cast(byte)val;
	dest[1]=cast(byte)(val>>8);
	dest[2]=cast(byte)(val>>16);
	dest[3]=cast(byte)(val>>24);
}

The first function is optimized to one load, but the second 
remains as 4 stores. Is there a code pattern that gets around 
this?

Jun 02 2018

Johan Engelen <j j.nl> writes:

On Saturday, 2 June 2018 at 10:40:43 UTC, Kagamin wrote:
 uint load32_le(in ref ubyte[4] s)
 {
 	return s[0] | (s[1]<<8) | (s[2]<<16) | (s[3]<<24);
 }

 void store32_le(ref ubyte[4] dest, uint val)
 {
 	dest[0]=cast(byte)val;
 	dest[1]=cast(byte)(val>>8);
 	dest[2]=cast(byte)(val>>16);
 	dest[3]=cast(byte)(val>>24);
 }

 The first function is optimized to one load, but the second 
 remains as 4 stores. Is there a code pattern that gets around 
 this?

```
void store32_le_optim(ref ubyte[4] dest, uint val)
{
     import core.stdc.string;
     memcpy(&dest, &val, val.sizeof);
}
```

LLVM is not yet smart enough to optimize adjacent stores, but it 
does assume it is valid to use knowledge of standard memcpy 
semantics.

-Johan

Jun 02 2018

Johan Engelen <j j.nl> writes:

On Saturday, 2 June 2018 at 11:44:40 UTC, Johan Engelen wrote:
 ```
 void store32_le_optim(ref ubyte[4] dest, uint val)
 {
     import core.stdc.string;
     memcpy(&dest, &val, val.sizeof);
 }
 ```

This only works on little endian machines of course, but the 
proper version is easy:

```
void store32_le_optim(ref ubyte[4] dest, uint val)
{
     import core.stdc.string;
     ubyte[4] temp;
     temp[0]=cast(ubyte)val;
     temp[1]=cast(ubyte)(val>>8);
     temp[2]=cast(ubyte)(val>>16);
     temp[3]=cast(ubyte)(val>>24);
     memcpy(&dest, &temp, temp.sizeof);
}
```

See in action for Little Endian and Big Endian: 
https://godbolt.org/g/QqcCpi

-Johan

Jun 02 2018

kinke <noone nowhere.com> writes:

On Saturday, 2 June 2018 at 18:32:37 UTC, Johan Engelen wrote:
 ```
 void store32_le_optim(ref ubyte[4] dest, uint val)
 {
     import core.stdc.string;
     ubyte[4] temp;
     temp[0]=cast(ubyte)val;
     temp[1]=cast(ubyte)(val>>8);
     temp[2]=cast(ubyte)(val>>16);
     temp[3]=cast(ubyte)(val>>24);
     memcpy(&dest, &temp, temp.sizeof);
 }
 ```

[For endian-ness conversion of integers, we have 
ldc.intrinsics.llvm_bswap().]

Jun 03 2018

kinke <noone nowhere.com> writes:

On Sunday, 3 June 2018 at 11:54:29 UTC, kinke wrote:
 [For endian-ness conversion of integers, we have 
 ldc.intrinsics.llvm_bswap().]

And, more portably, their core.bitop.bswap() aliases (restricted 
to uint and ulong though).

Jun 03 2018

"David Nadlinger" <code klickverbot.at> writes:

On 3 Jun 2018, at 12:54, kinke via digitalmars-d-ldc wrote:
 [For endian-ness conversion of integers, we have 
 ldc.intrinsics.llvm_bswap().]

At the risk of pointing out the obvious, I usually find it vastly 
preferable to just write the code in a way that's independent of the 
target platform's endianness – like in Johan's example – and let the 
optimizer deal with eliding the explicit handling if possible. Even 
DMD's optimizer can recognize those patterns just fine.

  — David

Jun 03 2018

kinke <noone nowhere.com> writes:

On Sunday, 3 June 2018 at 16:51:13 UTC, David Nadlinger wrote:
 On 3 Jun 2018, at 12:54, kinke via digitalmars-d-ldc wrote:
 [For endian-ness conversion of integers, we have 
 ldc.intrinsics.llvm_bswap().]

 At the risk of pointing out the obvious, I usually find it 
 vastly preferable to just write the code in a way that's 
 independent of the target platform's endianness – like in 
 Johan's example – and let the optimizer deal with eliding the 
 explicit handling if possible. Even DMD's optimizer can 
 recognize those patterns just fine.

No need to reinvent the wheel, the Phobos solution is trivial 
enough:

void store32_le(ref ubyte[4] dest, uint val)
{
     import std.bitmanip;
     dest = nativeToLittleEndian(val);
}

There's no overhead for little-endian machines with `-O`, but a 
suboptimal non-inlined druntime call instead of the LLVM 
intrinsic directly in the other case.

Jun 03 2018

Kagamin <spam here.lot> writes:

On Sunday, 3 June 2018 at 11:54:29 UTC, kinke wrote:
 [For endian-ness conversion of integers, we have 
 ldc.intrinsics.llvm_bswap().]

You can't just store integer on alignment-sensitive (even little 
endian) platforms.

Jun 04 2018

Kagamin <spam here.lot> writes:

On Saturday, 2 June 2018 at 18:32:37 UTC, Johan Engelen wrote:
 ```
 void store32_le_optim(ref ubyte[4] dest, uint val)
 {
     import core.stdc.string;
     ubyte[4] temp;
     temp[0]=cast(ubyte)val;
     temp[1]=cast(ubyte)(val>>8);
     temp[2]=cast(ubyte)(val>>16);
     temp[3]=cast(ubyte)(val>>24);
     memcpy(&dest, &temp, temp.sizeof);
 }
 ```

void store32_le_optim(ref ubyte[4] dest, uint val)
{
     import core.stdc.string;
     ubyte[4] temp;
     temp[0]=cast(ubyte)val;
     temp[1]=cast(ubyte)(val>>8);
     temp[2]=cast(ubyte)(val>>16);
     temp[3]=val>>24;
     dest=temp;
}

this works; CTFE doesn't support memcpy.

Jun 04 2018

Kagamin <spam here.lot> writes:

On Saturday, 2 June 2018 at 11:44:40 UTC, Johan Engelen wrote:
 LLVM is not yet smart enough to optimize adjacent stores

I thought it tries to account for memory access errors: depending 
on how the processor checks memory access the first bytes might 
not be stored.

Jun 04 2018

Johan Engelen <j j.nl> writes:

On Monday, 4 June 2018 at 08:48:42 UTC, Kagamin wrote:
 On Saturday, 2 June 2018 at 11:44:40 UTC, Johan Engelen wrote:
 LLVM is not yet smart enough to optimize adjacent stores

 I thought it tries to account for memory access errors: 
 depending on how the processor checks memory access the first 
 bytes might not be stored.

Invalid accesses do not need to be taken into account, because 
they are UB. So it is perfectly legal to combine the stores. LLVM 
does do that in some cases, just not in this particular case.  
(GCC does optimize it well btw)

- Johan

Jun 04 2018

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - Work around conservative optimization