digitalmars.D - By ref and by pointer kills performance.

claptrap (79/79) Feb 12 2024 I was refactoring some code and changed a parameter from by

Richard (Rikki) Andrew Cattermole (21/21) Feb 12 2024 dmd having bad codegen here isn't a surprise, that is to be expected.

Johan (21/33) Feb 13 2024 I hope someone can find the link to some DConf talk (me or

Johan (3/22) Feb 13 2024 Found it:

ryuukk_ (4/4) Feb 13 2024 Wow, OOP people ruining performance for everybody, who would have

Richard (Rikki) Andrew Cattermole (4/8) Feb 13 2024 No, it isn't limited to classes, the same issue appears with reference

Richard (Rikki) Andrew Cattermole (14/51) Feb 13 2024 If you don't like the problem, change the problem.

Richard (Rikki) Andrew Cattermole (12/12) Feb 13 2024 After even more thinking, I'm reminded of an operator overload that I

Kagamin (13/22) Feb 16 2024 Shouldn't it be okay for immutable(uint*) to not access

Richard (Rikki) Andrew Cattermole (4/19) Feb 16 2024 That covers immutable, but not immutable that has become const.

Kagamin (4/5) Feb 16 2024 That's fine, const can't be restrict qualified unconditionally,

Richard (Rikki) Andrew Cattermole (33/39) Feb 16 2024 It is not fine.

Kagamin (3/5) Feb 16 2024 Then you won't be able to create a monitor for them.

Richard (Rikki) Andrew Cattermole (5/11) Feb 16 2024 Code will still call into the null monitor.
FeepingCreature (7/12) Feb 18 2024 I don't get this point. The monitor field can just be the single

FeepingCreature (9/28) Feb 18 2024 Okay I think I overread this, probably because it's absurd. Why

Patrick Schluter (5/27) Feb 13 2024 Is not a thread issue. The memory the pointers point to only

Richard (Rikki) Andrew Cattermole (15/17) Feb 13 2024 It can be a thread issue, but yes overlapping is another potential
Basile B. (10/43) Feb 13 2024 It's been proven that the main problem, i.e what defeated CSE (

Walter Bright (13/14) Feb 13 2024 I'm used to people saying that DMD doesn't do data flow analysis. It doe...

Richard (Rikki) Andrew Cattermole (13/32) Feb 13 2024 I'm aware that what dmd is doing is right. But it isn't using SIMD

Richard (Rikki) Andrew Cattermole (36/36) Feb 13 2024 I'll give an example for something I wrote up in a reply for a Reddit po...

Timon Gehr (41/61) Feb 14 2024 Well, my understanding is that storing a value that was read from a
Iain Buclaw (11/23) Feb 14 2024 This, here, is the answer.
max haughton (14/32) Feb 18 2024 It's probably true that not many can rattle off the legalese that

Richard (Rikki) Andrew Cattermole (35/40) Feb 18 2024 A couple of days ago I had a read of Joe Duffy's blog.

Timon Gehr (13/40) Feb 14 2024 An unsynchronized read of a location that is modified by another thread

Bruce Carneal (9/21) Feb 12 2024 To reuse the value the compiler would have to prove that the

jmh530 (23/32) Feb 13 2024 As a heads up, the LDC wiki page doesn't have restrict on it.

jmh530 (3/28) Feb 13 2024 Sorry, I should note that while `fillRestricted3` compiles, It
Bruce Carneal (16/52) Feb 13 2024 There was a little discussion a while back about @restrict being

jmh530 (3/10) Feb 13 2024 godbolt is definitely your friend when you can use it.

claptrap (4/14) Feb 13 2024 Ah OK makes sense. The restrict attribute will help with what I'm
Timon Gehr (3/6) Feb 14 2024 Not really, it only has to show that the value in the memory location

Bruce Carneal (4/10) Feb 14 2024 True. I jumped to the general case where I'd seen sub optimal

Patrick Schluter (5/46) Feb 13 2024 Yes, that's normal. The compiler cannot know from the declaration

deadalnix (5/10) Feb 13 2024 It's a major footgun without much benefit. In this case, one

claptrap <clap trap.com> writes:

I was refactoring some code and changed a parameter from by 
value, to by pointer, and saw the performance drop by 50%. This 
is a highly reduced example of what I found, but basically 
passing something into a function by reference or pointer seems 
to make the compilers (it affects both DMD and LDC) treat it as 
if its volatile and must be loaded from memory on every use. This 
also inhibits the auto-vectorization of code by LDC.

https://d.godbolt.org/z/oonq1drd9

```d
void fillBP(uint* value, uint* dest)
{
     dest[0] = *value;
     dest[1] = *value;
     dest[2] = *value;
     dest[3] = *value;
}
```
codegen DMD -->

                 push    RBP
                 mov     RBP,RSP
                 mov     ECX,[RSI]
                 mov     [RDI],ECX
                 mov     EDX,[RSI]
                 mov     4[RDI],EDX
                 mov     R8D,[RSI]
                 mov     8[RDI],R8D
                 mov     R9D,[RSI]
                 mov     0Ch[RDI],R9D
                 pop     RBP
                 ret

codgen LDC -->

         mov     eax, dword ptr [rdi]
         mov     dword ptr [rsi], eax
         mov     eax, dword ptr [rdi]
         mov     dword ptr [rsi + 4], eax
         mov     eax, dword ptr [rdi]
         mov     dword ptr [rsi + 8], eax
         mov     eax, dword ptr [rdi]
         mov     dword ptr [rsi + 12], eax
         ret
```d
void fillBV(uint value, uint* dest)
{
     dest[0] = value;
     dest[1] = value;
     dest[2] = value;
     dest[3] = value;
}
```
codgen DMD -->

                 push    RBP
                 mov     RBP,RSP
                 mov     [RDI],ESI
                 mov     4[RDI],ESI
                 mov     8[RDI],ESI
                 mov     0Ch[RDI],ESI
                 pop     RBP
                 ret

codegen LDC -->

         movd    xmm0, edi
         pshufd  xmm0, xmm0, 0
         movdqu  xmmword ptr [rsi], xmm0
         ret

Interestingly if you do this...
```d
void fillBP(uint* value, uint* dest)
{
     uint tmp = *value;
     dest[0] = tmp;
     dest[1] = tmp;
     dest[2] = tmp;
     dest[3] = tmp;
}
```
You get identical code to the by value versions. (except the load 
from memory)

I'm not a compiler guy so maybe there's some rationale for this 
that I don't know but it seems like the compiler should be able 
to read "*value" once and cache it.

Feb 12 2024

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

dmd having bad codegen here isn't a surprise, that is to be expected.

Now for ldc:

```d
void fillBP(immutable(uint*) value, uint* dest) {
      dest[0] = *value;
      dest[1] = *value;
      dest[2] = *value;
      dest[3] = *value;
}
```

I expected that to not do the extra loads, but it did.

```d
void fillBP(immutable(uint*) value, uint* dest) {
	dest[0 .. 4][] = *value;
}
```

And that certainly should not be doing it either.
Even if it wasn't immutable.

For your code, because it is not immutable and therefore can be changed 
externally on another thread, the fact that the compiler has to do the 
loads is correct. This isn't a bug.

Feb 12 2024

Johan <j j.nl> writes:

On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be 
 expected.

 Now for ldc:

 ```d
 void fillBP(immutable(uint*) value, uint* dest) {
      dest[0] = *value;
      dest[1] = *value;
      dest[2] = *value;
      dest[3] = *value;
 }
 ```

 I expected that to not do the extra loads, but it did.

I hope someone can find the link to some DConf talk (me or 
Andrei) or forum post where I talk about why LDC assumes that 
`immutable(uint*)` points to _mutable_ (nota bene) data. The 
reason is the mutable thread synchronization field in immutable 
class variable storage (`__monitor`), combined with casting an 
immutable class to an array of immutable bytes.

Side-effects in-between immutable(uint*) lookup could run into a 
synchronization event on the immutable data (i.e. mutating it).
In the case of `fillBP` there are no side-effects possible 
between the reads, so it appears that indeed the optimization 
could be done. But a different thread might write to the data. I 
don't know how that data-race is then defined...
For the general case, side-effects are possible (e.g. a function 
call) so it is not possible to simply assume that `immutable` 
reference arguments never alias to other reference arguments; 
this complicates implementing the desired optimization.
I'm not saying it is impossible, it's just extra effort (and 
proof is needed).

-Johan

Feb 13 2024

Johan <j j.nl> writes:

On Tuesday, 13 February 2024 at 13:30:11 UTC, Johan wrote:
 On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) 
 Andrew Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be 
 expected.

 Now for ldc:

 ```d
 void fillBP(immutable(uint*) value, uint* dest) {
      dest[0] = *value;
      dest[1] = *value;
      dest[2] = *value;
      dest[3] = *value;
 }
 ```

 I expected that to not do the extra loads, but it did.

 I hope someone can find the link to some DConf talk (me or 
 Andrei)

Found it: 
https://youtu.be/-0jcE9B5kjs?list=PL9a7lgBtSQb-YCVj96v5vn1tEXPjKOPuB&t=403

Feb 13 2024

ryuukk_ <ryuukk.dev gmail.com> writes:

Wow, OOP people ruining performance for everybody, who would have 
thought..

That makes me worried now.. -betterC doesn't seem to change 
anything about optimization, or rather, lack of optimization..

Feb 13 2024

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 14/02/2024 4:20 AM, ryuukk_ wrote:
 Wow, OOP people ruining performance for everybody, who would have thought..
 
 That makes me worried now.. -betterC doesn't seem to change anything 
 about optimization, or rather, lack of optimization..

No, it isn't limited to classes, the same issue appears with reference 
counting.

Immutable simply isn't providing the guarantee of immutable.

Feb 13 2024

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 14/02/2024 2:30 AM, Johan wrote:
 On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) Andrew 
 Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be expected.

 Now for ldc:

 ```d
 void fillBP(immutable(uint*) value, uint* dest) {
      dest[0] = *value;
      dest[1] = *value;
      dest[2] = *value;
      dest[3] = *value;
 }
 ```

 I expected that to not do the extra loads, but it did.

 
 I hope someone can find the link to some DConf talk (me or Andrei) or 
 forum post where I talk about why LDC assumes that `immutable(uint*)` 
 points to _mutable_ (nota bene) data. The reason is the mutable thread 
 synchronization field in immutable class variable storage (`__monitor`), 
 combined with casting an immutable class to an array of immutable bytes.
 
 Side-effects in-between immutable(uint*) lookup could run into a 
 synchronization event on the immutable data (i.e. mutating it).
 In the case of `fillBP` there are no side-effects possible between the 
 reads, so it appears that indeed the optimization could be done. But a 
 different thread might write to the data. I don't know how that 
 data-race is then defined...
 For the general case, side-effects are possible (e.g. a function call) 
 so it is not possible to simply assume that `immutable` reference 
 arguments never alias to other reference arguments; this complicates 
 implementing the desired optimization.
 I'm not saying it is impossible, it's just extra effort (and proof is 
 needed).
 
 -Johan

If you don't like the problem, change the problem.

After thinking about it, I can agree for D classes immutable can't be 
known if it truely applies or not. A lock could still be taken in const 
code.

So the problem is:

For a type that is not a D class (COM, extern(C++) are not included in 
that definition), mark as immutable to LLVM IR so optimizations can be 
used wrt. immutable.

After all, to call an immutable this, method it must also be marked as 
const or immutable. The this pointer won't be allowed to be mutated 
unless someone did something bad (again, not the compiler's fault that 
the user went against it).

I don't think much thought needs to go into this.

Feb 13 2024

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

After even more thinking, I'm reminded of an operator overload that I 
want for reference counting.

``opGoingIntoROM``.

The reason for reference counting is obvious, to turn off reference 
counting.

But it also applies here.

Having this knowledge on the monitor could turn off the actual 
lock/unlock mutation.

Okay this is compelling enough to report it.

https://issues.dlang.org/show_bug.cgi?id=24393

This is no longer an "I want". It's a genuine type guarantee issue of 
something we currently support.

Feb 13 2024

Kagamin <spam here.lot> writes:

On Tuesday, 13 February 2024 at 13:30:11 UTC, Johan wrote:
 I hope someone can find the link to some DConf talk (me or 
 Andrei) or forum post where I talk about why LDC assumes that 
 `immutable(uint*)` points to _mutable_ (nota bene) data. The 
 reason is the mutable thread synchronization field in immutable 
 class variable storage (`__monitor`), combined with casting an 
 immutable class to an array of immutable bytes.

 Side-effects in-between immutable(uint*) lookup could run into 
 a synchronization event on the immutable data (i.e. mutating 
 it).

Shouldn't it be okay for immutable(uint*) to not access 
`__monitor`? I think, `__monitor` should work outside of type 
safety, like reference counting, then immutability won't matter.

```
shared(Object.Monitor*) getMonitor(immutable Object o)
{
	size_t addr=cast(size_t)cast(void*)&o.__monitor;
	return cast(shared(Object.Monitor*))(addr-20+20);
}
```

Is it legal for optimizer to cast away `restrict` qualifier like 
this?

Feb 16 2024

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 16/02/2024 10:34 PM, Kagamin wrote:
 On Tuesday, 13 February 2024 at 13:30:11 UTC, Johan wrote:
 
     I hope someone can find the link to some DConf talk (me or Andrei)
     or forum post where I talk about why LDC assumes that
     |immutable(uint*)| points to /mutable/ (nota bene) data. The reason
     is the mutable thread synchronization field in immutable class
     variable storage (|__monitor|), combined with casting an immutable
     class to an array of immutable bytes.
 
     Side-effects in-between immutable(uint*) lookup could run into a
     synchronization event on the immutable data (i.e. mutating it).
 
 Shouldn't it be okay for immutable(uint*) to not access |__monitor|? I 
 think, |__monitor| should work outside of type safety, like reference 
 counting, then immutability won't matter.

That covers immutable, but not immutable that has become const.

The only way I can see it working reliably is to simply turn off the 
mutex when it is immutable.

Feb 16 2024

Kagamin <spam here.lot> writes:

On Friday, 16 February 2024 at 09:56:50 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 That covers immutable, but not immutable that has become const.

That's fine, const can't be restrict qualified unconditionally, 
but immutable can.

Feb 16 2024

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 16/02/2024 11:23 PM, Kagamin wrote:
 On Friday, 16 February 2024 at 09:56:50 UTC, Richard (Rikki) Andrew 
 Cattermole wrote:
 That covers immutable, but not immutable that has become const.

 
 That's fine, const can't be restrict qualified unconditionally, but 
 immutable can.

It is not fine.

Immutable objects can be in read only memory, since nothing is allowed 
to modify the contents process wide.

Write to it, and see how the program segfaults.

Just to prove a point of what immutable enables a compiler to do:

```d
import std.stdio;
import core.sys.posix.sys.mman;
import core.sys.posix.unistd : sysconf, _SC_PAGESIZE;
import core.sys.posix.stdlib;

void main() {
     auto pagesize = sysconf(_SC_PAGESIZE);
     int* temp;
     posix_memalign(cast(void**)&temp, pagesize, 4 * pagesize);
     data = temp[0 .. pagesize];

     data[0 .. 2] = [1, 2];
     mprotect(data.ptr, int.sizeof * data.length, PROT_READ);
     writeln(data[0 .. 2]);

     data[0 .. 2] = [3, 4];
     writeln(data[0 .. 2]);
}

static int[] data;
```

```
[1, 2]
Error: program killed by signal 11
```

Read only memory like this is useful to kernels, it enables them to map 
the same ram ranges to multiple processes.

It provides a very strong guarantee, well beyond const and right now 
that guarantee isn't honored in the language and needs to be fixed 
(otherwise why have it at all?).

Feb 16 2024

Kagamin <spam here.lot> writes:

On Friday, 16 February 2024 at 11:10:13 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 Immutable objects can be in read only memory, since nothing is 
 allowed to modify the contents process wide.

Then you won't be able to create a monitor for them.

Feb 16 2024

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 17/02/2024 4:12 AM, Kagamin wrote:
 On Friday, 16 February 2024 at 11:10:13 UTC, Richard (Rikki) Andrew 
 Cattermole wrote:
 Immutable objects can be in read only memory, since nothing is allowed 
 to modify the contents process wide.

 
 Then you won't be able to create a monitor for them.

Code will still call into the null monitor.

Or we could have a way to disable it at runtime instead by setting a 
field to change behavior to cover all use cases (which is what I 
proposed ;) ).

Feb 16 2024

FeepingCreature <feepingcreature gmail.com> writes:

On Friday, 16 February 2024 at 15:12:58 UTC, Kagamin wrote:
 On Friday, 16 February 2024 at 11:10:13 UTC, Richard (Rikki) 
 Andrew Cattermole wrote:
 Immutable objects can be in read only memory, since nothing is 
 allowed to modify the contents process wide.

 Then you won't be able to create a monitor for them.

I don't get this point. The monitor field can just be the single 
pointer that is not transitively immutable? I mean. The compiler 
knows *exactly* when you access the monitor field. It's never 
accessed manually! This is a pointer access that can be generated 
at exactly line of code in the compiler. Just drop immutable 
there.

Feb 18 2024

FeepingCreature <feepingcreature gmail.com> writes:

On Sunday, 18 February 2024 at 13:41:42 UTC, FeepingCreature 
wrote:
 On Friday, 16 February 2024 at 15:12:58 UTC, Kagamin wrote:
 On Friday, 16 February 2024 at 11:10:13 UTC, Richard (Rikki) 
 Andrew Cattermole wrote:
 Immutable objects can be in read only memory, since nothing 
 is allowed to modify the contents process wide.

 Then you won't be able to create a monitor for them.

 I don't get this point. The monitor field can just be the 
 single pointer that is not transitively immutable? I mean. The 
 compiler knows *exactly* when you access the monitor field. 
 It's never accessed manually! This is a pointer access that can 
 be generated at exactly line of code in the compiler. Just drop 
 immutable there.

 I hope someone can find the link to some DConf talk (me or 
 Andrei) or forum post where I talk about why LDC assumes that 
 immutable(uint*) points to mutable (nota bene) data. The reason 
 is the mutable thread synchronization field in immutable class 
 variable storage (__monitor), combined with casting an 
 immutable class to an array of immutable bytes.

Okay I think I overread this, probably because it's absurd. Why 
is this idiom worth protecting? You're hardcasting *anyways*, 
it's forbidden in  safe *anyways*, just cast __monitor to uint* 
instead of immutable(uint*). If you're interacting with classes 
at such a deep level, asking you to remember that there's a 
mutable field in there even in immutable classes doesn't seem 
like a big ask.

Feb 18 2024

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be 
 expected.

 Now for ldc:

 ```d
 void fillBP(immutable(uint*) value, uint* dest) {
      dest[0] = *value;
      dest[1] = *value;
      dest[2] = *value;
      dest[3] = *value;
 }
 ```

 I expected that to not do the extra loads, but it did.

 ```d
 void fillBP(immutable(uint*) value, uint* dest) {
 	dest[0 .. 4][] = *value;
 }
 ```

 And that certainly should not be doing it either.
 Even if it wasn't immutable.

 For your code, because it is not immutable and therefore can be 
 changed externally on another thread, the fact that the 
 compiler has to do the loads is correct. This isn't a bug.

Is not a thread issue. The memory the pointers point to only 
needs to overlap and the loads are required to get the "right" 
result.

Feb 13 2024

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 14/02/2024 4:14 AM, Patrick Schluter wrote:
 Is not a thread issue. The memory the pointers point to only needs to 
 overlap and the loads are required to get the "right" result.

It can be a thread issue, but yes overlapping is another potential 
situation where it can get over written. I prefer talking about threads, 
because they bring an entirely unknown element to the discussion and 
shutdown any localized assumptions.

In the case of immutable it should not be possible for immutable memory 
to be changed unless someone did something bad.

Immutable is a very strong guarantee at the process level, that there 
are no mutable pointers pointing at immutable memory.

Therefore the compiler is free to make that optimization without 
concern. If it breaks, it's the users fault for misusing the language.

"The second way is to cast data to immutable. When doing so, it is up to 
the programmer to ensure that any mutable references to the same data 
are not used to modify the data after the cast."

https://dlang.org/spec/const3.html#creating_immutable_data

Feb 13 2024

Basile B. <b2.temp gmx.com> writes:

On Tuesday, 13 February 2024 at 15:14:06 UTC, Patrick Schluter 
wrote:
 On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) 
 Andrew Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be 
 expected.

 Now for ldc:

 ```d
 void fillBP(immutable(uint*) value, uint* dest) {
      dest[0] = *value;
      dest[1] = *value;
      dest[2] = *value;
      dest[3] = *value;
 }
 ```

 I expected that to not do the extra loads, but it did.

 ```d
 void fillBP(immutable(uint*) value, uint* dest) {
 	dest[0 .. 4][] = *value;
 }
 ```

 And that certainly should not be doing it either.
 Even if it wasn't immutable.

 For your code, because it is not immutable and therefore can 
 be changed externally on another thread, the fact that the 
 compiler has to do the loads is correct. This isn't a bug.

 Is not a thread issue. The memory the pointers point to only 
 needs to overlap and the loads are required to get the "right" 
 result.

It's been proven that the main problem, i.e what defeated CSE ( 
the Common Sub Expression optim pass), is the possible overlap 
between the two parameter.

However there'is still a risk on the value, i.e it can be 
modified by another thread, like mentioned Johan IIRC. The 
destination of the pointer you pass as "value" might change 
between one of the fourth assignment.

To be frank I thought this was the problem.

Feb 13 2024

Walter Bright <newshound2 digitalmars.com> writes:

On 2/12/2024 7:31 PM, Richard (Rikki) Andrew Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be expected.

I'm used to people saying that DMD doesn't do data flow analysis. It does. In 
fact, it is based on my OptimumC, which was the first C compiler on DOS to do 
DFA back in the 1980s.

This isn't a case of buggy DFA. It's a case of doing DFA correctly.

The issue is pointer aliasing. A pointer can point to anything, including const 
data. Therefore, storing through a pointer can alter any value that is
reachable 
via a pointer. Therefore, storing through a pointer invalidates any cached
value 
already read.

This is what you're seeing.

C99 tried to address this with __restrict, but few people use it or understand 
it. D didn't bother with it because people will inevitably misuse __restrict
and 
get their data mysteriously corrupted.

Feb 13 2024

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 14/02/2024 5:49 PM, Walter Bright wrote:
 On 2/12/2024 7:31 PM, Richard (Rikki) Andrew Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be expected.

 
 I'm used to people saying that DMD doesn't do data flow analysis. It 
 does. In fact, it is based on my OptimumC, which was the first C 
 compiler on DOS to do DFA back in the 1980s.
 
 This isn't a case of buggy DFA. It's a case of doing DFA correctly.
 
 The issue is pointer aliasing. A pointer can point to anything, 
 including const data. Therefore, storing through a pointer can alter any 
 value that is reachable via a pointer. Therefore, storing through a 
 pointer invalidates any cached value already read.
 
 This is what you're seeing.
 
 C99 tried to address this with __restrict, but few people use it or 
 understand it. D didn't bother with it because people will inevitably 
 misuse __restrict and get their data mysteriously corrupted.

I'm aware that what dmd is doing is right. But it isn't using SIMD 
either, so there is some performance left on the table.

A backend today vs 15 years ago is very different in terms of what they 
can do. Let alone one that last saw significant active development 30 
years ago.

I suggest that you have a chat with Bruce about this during the next 
meetup. We've been trying to explain to you just how backends have 
changed over the years from a users perspective. The ones today can only 
be described with one word: magical.

But I will say this: 20 years ago you needed inline assembly, 10 years 
ago you needed intrinsics, today you may not need to do anything special 
in D, although will in C++.

Feb 13 2024

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

I'll give an example for something I wrote up in a reply for a Reddit post:

Using ldc I was able to get this code to do 8.8 million vectors in 
around 1 second.

```d
module strategy;

int run(byte[] a, byte[] b) {
     int sum = 0;

     for(size_t i; i < a.length; i += 64) {
         static foreach(j; 0 .. 64) {
             sum += cast(short)a[i + j] * cast(short)b[i + j];
         }
     }

     return sum;
}
```

To achieve the same thing using Go, they had to write assembly (see 
DotVNNI example).

https://sourcegraph.com/blog/slow-to-simd

LDC is able to get to almost the maximum a CPU can do, with what looks 
to be almost naive looking code. DMD cannot compete with this, nor 
should it aim to.

Here is what actual naive code looks like for this problem:

```d
module strategy;

int run(byte[] a, byte[] b) {
     assert(a.length == b.length);

     int sum = 0;

     foreach(i; 0 .. a.length) {
         sum += cast(short)a[i] * cast(short)b[i];
     }

     return sum;
}
```

7 million vectors per second.

They had to write assembly to get that speed. I got it, without doing 
anything special...

Feb 13 2024

Timon Gehr <timon.gehr gmx.ch> writes:

On 14.02.24 05:49, Walter Bright wrote:
 On 2/12/2024 7:31 PM, Richard (Rikki) Andrew Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be expected.

 
 I'm used to people saying that DMD doesn't do data flow analysis. It 
 does. In fact, it is based on my OptimumC, which was the first C 
 compiler on DOS to do DFA back in the 1980s.
 
 This isn't a case of buggy DFA. It's a case of doing DFA correctly.
 
 The issue is pointer aliasing. A pointer can point to anything, 
 including const data. Therefore, storing through a pointer can alter any 
 value that is reachable via a pointer. Therefore, storing through a 
 pointer invalidates any cached value already read.
 
 This is what you're seeing.
 ...

Well, my understanding is that storing a value that was read from a 
`uint*` cannot invalidate a value read from the same `uint*` unless 
unaligned reads/writes are allowed or the pointer can point to itself.

Are unaligned reads/writes allowed in D? Can a `uint*` point to itself? 
Is there something else I missed?

How to make the following assertion fail by filling the `...` holes with 
code without invoking UB?

```d
void fillBP1(uint* value, uint* dest)pure{
     dest[0] = *value;
     dest[1] = *value;
     dest[2] = *value;
     dest[3] = *value;
}

void fillBP2(uint* value, uint* dest)pure{
     auto tmp = *value;
     dest[0] = tmp;
     dest[1] = tmp;
     dest[2] = tmp;
     dest[3] = tmp;
}

uint[4] distinguish(alias f)()pure{
     ...; // TODO
     uint[4] dest = ...; // TODO
     uint* value = ...;  // TODO
     f(value,dest.ptr);
     return dest;
}

void main(){
     import std.stdio;
     assert(distinguish!fillBP1()==distinguish!fillBP2());
     writeln(test!fillBP1);
     writeln(test!fillBP2);
}
```

 C99 tried to address this with __restrict, but few people use it or 
 understand it. D didn't bother with it because people will inevitably 
 misuse __restrict and get their data mysteriously corrupted.
 

My understanding is the case in the OB does not require `__restrict`. 
`value` can point to any of the 4 fields of `dest`, it is okay to only 
read it once even in that case. Of course, maybe the data flow analysis 
is more conservative than that, but I wouldn't necessarily point to this 
case as one where data flow analysis is working "correctly".

Feb 14 2024

Iain Buclaw <ibuclaw gdcproject.org> writes:

On Wednesday, 14 February 2024 at 04:49:28 UTC, Walter Bright 
wrote:
 This isn't a case of buggy DFA. It's a case of doing DFA 
 correctly.

 The issue is pointer aliasing. A pointer can point to anything, 
 including const data. Therefore, storing through a pointer can 
 alter any value that is reachable via a pointer. Therefore, 
 storing through a pointer invalidates any cached value already 
 read.

 This is what you're seeing.

 C99 tried to address this with __restrict, but few people use 
 it or understand it. D didn't bother with it because people 
 will inevitably misuse __restrict and get their data 
 mysteriously corrupted.

This, here, is the answer.

Moral of the story, use (gdc, ldc) vendor attributes if you 
really care.

```
import core.attribute;

void fillBP( restrict uint* dest, uint* value)
{
     //...
```

Feb 14 2024

max haughton <maxhaton gmail.com> writes:

On Wednesday, 14 February 2024 at 04:49:28 UTC, Walter Bright 
wrote:
 On 2/12/2024 7:31 PM, Richard (Rikki) Andrew Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be 
 expected.

 I'm used to people saying that DMD doesn't do data flow 
 analysis. It does. In fact, it is based on my OptimumC, which 
 was the first C compiler on DOS to do DFA back in the 1980s.

 This isn't a case of buggy DFA. It's a case of doing DFA 
 correctly.

 The issue is pointer aliasing. A pointer can point to anything, 
 including const data. Therefore, storing through a pointer can 
 alter any value that is reachable via a pointer. Therefore, 
 storing through a pointer invalidates any cached value already 
 read.

 This is what you're seeing.

 C99 tried to address this with __restrict, but few people use 
 it or understand it. D didn't bother with it because people 
 will inevitably misuse __restrict and get their data 
 mysteriously corrupted.


It's probably true that not many can rattle off the legalese that 
defines restrict semantics correctly (I know I can't, although 
have to say I've never been bitten by it), but within a roughly 
C-like programming model its where a lot of performance comes 
from when at the bleeding edge so I don't know if not including 
it at all is particularly good either.

Thinking about it, it's actually probably mainly an issue for 
library users rather than the author who actually wrote restrict 
to begin with - I have wondered in the past whether some kind of 
`isolated` style type qualifier could actually help here i.e. one 
could provide the compiler good information and make it very 
obvious to the programmer.

Feb 18 2024

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 19/02/2024 2:58 AM, max haughton wrote:
 Thinking about it, it's actually probably mainly an issue for library 
 users rather than the author who actually wrote restrict to begin with - 
 I have wondered in the past whether some kind of |isolated| style type 
 qualifier could actually help here i.e. one could provide the compiler 
 good information and make it very obvious to the programmer.

A couple of days ago I had a read of Joe Duffy's blog.

I was very impressed with what he had to say, enough to immediately want 
to buy his concurrency book and start adding his blog articles as 
references to my existing proposals.

He was one of the developers Midori and had a lot to say about exception 
mechanisms and concurrency. Turns out we both have some similar 
conclusions, only he gained his about 15 years prior to me (which may or 
may not have contributed to my impression).

The concept of isolated was one such concept that he talked about in 
regards to concurrency and its safety. I had not thought about it, that 
it would be a useful tool for making memory ownership safe in this context.

But it also applies here consider:

```d
isolated(int[]) a, b;
func(a, b);

void func(scope int[] c, scope int[] d)  safe {}
```

Should that be callable with a and b? Yes. The graph doesn't matter, it 
doesn't escape.

Now what about the case where it does?

```d
int[] func(return scope int[] c, scope int[] d)  safe {
	return c;
}
```

Again yes. Only one parameter contributes to return, the return would 
therefore be part of ``a``'s graph.

Another interesting possibility is that if one parameter is isolated, 
that means the second parameter ``d`` is not in ``c``'s graph, because 
isolated is a form of transfer of ownership, except because ``c`` is 
scope, the transfer doesn't occur from the outside.

As a result both ``c`` and ``d`` would have `` restrict`` added.

This is on my todo list to research, of course how exactly it plays out 
with DIP1000 and the this pointer is unknown to me at this stage.

Feb 18 2024

Timon Gehr <timon.gehr gmx.ch> writes:

On 13.02.24 04:31, Richard (Rikki) Andrew Cattermole wrote:
 dmd having bad codegen here isn't a surprise, that is to be expected.
 
 Now for ldc:
 
 ```d
 void fillBP(immutable(uint*) value, uint* dest) {
       dest[0] = *value;
       dest[1] = *value;
       dest[2] = *value;
       dest[3] = *value;
 }
 ```
 
 I expected that to not do the extra loads, but it did.
 
 ```d
 void fillBP(immutable(uint*) value, uint* dest) {
      dest[0 .. 4][] = *value;
 }
 ```
 
 And that certainly should not be doing it either.
 Even if it wasn't immutable.
 
 For your code, because it is not immutable and therefore can be changed 
 externally on another thread, the fact that the compiler has to do the 
 loads is correct. This isn't a bug.

An unsynchronized read of a location that is modified by another thread 
is a race condition and a race condition is UB, so this scenario is not 
an optimization blocker.

I think with this specific code reading back the pointers is 
unnecessary, because assigning `*value` to itself does not produce an 
observable difference in behavior given that it is also read and 
assigned to a different location at least once. (Unless LDC defines the 
behavior of unaligned reads/writes?)

Of course, if the code instead assigned `*value+1` to the entries of 
`dest`, the reads would become necessary.

In general, I think LDC and GDC are pretty conservative about optimizing 
based on UB, which I think has benefits for security.

Feb 14 2024

Bruce Carneal <bcarneal gmail.com> writes:

On Tuesday, 13 February 2024 at 02:11:45 UTC, claptrap wrote:
 I was refactoring some code and changed a parameter from by 
 value, to by pointer, and saw the performance drop by 50%. This 
 is a highly reduced example of what I found, but basically 
 passing something into a function by reference or pointer seems 
 to make the compilers (it affects both DMD and LDC) treat it as 
 if its volatile and must be loaded from memory on every use. 
 This also inhibits the auto-vectorization of code by LDC.

 https://d.godbolt.org/z/oonq1drd9

 ...

 I'm not a compiler guy so maybe there's some rationale for this 
 that I don't know but it seems like the compiler should be able 
 to read "*value" once and cache it.

To reuse the value the compiler would have to prove that the 
memory locations do not overlap.  FORTRAN does not have this 
problem, neither does ldc once you take responsibility for 
non-overlap with the  restrict attribute as seen here:

https://d.godbolt.org/z/z9vYndWqP

When loops are involved between potentially overlapping indexed 
arrays I've seen ldc go through the proof and do two versions of 
the code with a branch.

Feb 12 2024

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 13 February 2024 at 06:02:47 UTC, Bruce Carneal wrote:
 [snip]

 To reuse the value the compiler would have to prove that the 
 memory locations do not overlap.  FORTRAN does not have this 
 problem, neither does ldc once you take responsibility for 
 non-overlap with the  restrict attribute as seen here:

 https://d.godbolt.org/z/z9vYndWqP

 When loops are involved between potentially overlapping indexed 
 arrays I've seen ldc go through the proof and do two versions 
 of the code with a branch.

As a heads up, the LDC wiki page doesn't have restrict on it.

https://wiki.dlang.org/LDC-specific_language_changes

Does LDC's  restrict only work with pointers directly and not 
slices? `fillRestricted2` doesn't compile (it only fails because 
`value` is a slice, not because `dest` is one. But 
`fillRestricted3` compiles just fine.

```d
void fillRestricted2( restrict uint[] value, uint[] dest)
{
     dest[0] = value[0];
     dest[1] = value[1];
     dest[2] = value[2];
     dest[3] = value[3];
}

void fillRestricted3( restrict uint* value, uint[] dest)
{
     dest[0] = value[0];
     dest[1] = value[1];
     dest[2] = value[2];
     dest[3] = value[3];
}

```

Feb 13 2024

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 13 February 2024 at 15:40:23 UTC, jmh530 wrote:
 On Tuesday, 13 February 2024 at 06:02:47 UTC, Bruce Carneal 
 wrote:
 [...]

 As a heads up, the LDC wiki page doesn't have restrict on it.

 https://wiki.dlang.org/LDC-specific_language_changes

 Does LDC's  restrict only work with pointers directly and not 
 slices? `fillRestricted2` doesn't compile (it only fails 
 because `value` is a slice, not because `dest` is one. But 
 `fillRestricted3` compiles just fine.

 ```d
 void fillRestricted2( restrict uint[] value, uint[] dest)
 {
     dest[0] = value[0];
     dest[1] = value[1];
     dest[2] = value[2];
     dest[3] = value[3];
 }

 void fillRestricted3( restrict uint* value, uint[] dest)
 {
     dest[0] = value[0];
     dest[1] = value[1];
     dest[2] = value[2];
     dest[3] = value[3];
 }

 ```

Sorry, I should note that while `fillRestricted3` compiles, It 
has similar codegen as `fillBP`.

Feb 13 2024

Bruce Carneal <bcarneal gmail.com> writes:

On Tuesday, 13 February 2024 at 15:40:23 UTC, jmh530 wrote:
 On Tuesday, 13 February 2024 at 06:02:47 UTC, Bruce Carneal 
 wrote:
 [snip]

 To reuse the value the compiler would have to prove that the 
 memory locations do not overlap.  FORTRAN does not have this 
 problem, neither does ldc once you take responsibility for 
 non-overlap with the  restrict attribute as seen here:

 https://d.godbolt.org/z/z9vYndWqP

 When loops are involved between potentially overlapping 
 indexed arrays I've seen ldc go through the proof and do two 
 versions of the code with a branch.

 As a heads up, the LDC wiki page doesn't have restrict on it.

 https://wiki.dlang.org/LDC-specific_language_changes

 Does LDC's  restrict only work with pointers directly and not 
 slices? `fillRestricted2` doesn't compile (it only fails 
 because `value` is a slice, not because `dest` is one. But 
 `fillRestricted3` compiles just fine.

 ```d
 void fillRestricted2( restrict uint[] value, uint[] dest)
 {
     dest[0] = value[0];
     dest[1] = value[1];
     dest[2] = value[2];
     dest[3] = value[3];
 }

 void fillRestricted3( restrict uint* value, uint[] dest)
 {
     dest[0] = value[0];
     dest[1] = value[1];
     dest[2] = value[2];
     dest[3] = value[3];
 }

 ```

There was a little discussion a while back about  restrict being 
generalized to slices but, as you note, it's not there yet.

I typically use  restrict on slice .ptr fields passed into nested 
functions within an  trusted function.  Yes, that's a little 
clumsy but it lets you do bounds checks and the like before 
dropping into the hot loops.

If you're willing to tolerate the code bloat, and your code is 
simple, ldc will do the proof and auto-vec things for you.  
 restrict also seems to keep the auto-vec optimizer engaged in 
not-strictly-vanilla formulations.

If you have multiple sources and a single destination array note 
that you only need  restrict on the destination.  You're just 
trying to give the back end a clear view of dependencies.

As always, godbolt is your friend when optimizing known 
bottlenecks of this sort.

Feb 13 2024

jmh530 <john.michael.hall gmail.com> writes:

On Tuesday, 13 February 2024 at 15:56:59 UTC, Bruce Carneal wrote:
 [snip]

 There was a little discussion a while back about  restrict 
 being generalized to slices but, as you note, it's not there 
 yet.

Thanks for the detailed response!

 [snip]

 As always, godbolt is your friend when optimizing known 
 bottlenecks of this sort.

godbolt is definitely your friend when you can use it.

Feb 13 2024

claptrap <clap trap.com> writes:

On Tuesday, 13 February 2024 at 06:02:47 UTC, Bruce Carneal wrote:
 On Tuesday, 13 February 2024 at 02:11:45 UTC, claptrap wrote:

 To reuse the value the compiler would have to prove that the 
 memory locations do not overlap.  FORTRAN does not have this 
 problem, neither does ldc once you take responsibility for 
 non-overlap with the  restrict attribute as seen here:

 https://d.godbolt.org/z/z9vYndWqP

 When loops are involved between potentially overlapping indexed 
 arrays I've seen ldc go through the proof and do two versions 
 of the code with a branch.

Ah OK makes sense. The restrict attribute will help with what I'm 
doing.

thanks.

Feb 13 2024

Timon Gehr <timon.gehr gmx.ch> writes:

On 13.02.24 07:02, Bruce Carneal wrote:
 
 To reuse the value the compiler would have to prove that the memory 
 locations do not overlap.

Not really, it only has to show that the value in the memory location 
has not changed since the last read.

Feb 14 2024

Bruce Carneal <bcarneal gmail.com> writes:

On Wednesday, 14 February 2024 at 11:46:24 UTC, Timon Gehr wrote:
 On 13.02.24 07:02, Bruce Carneal wrote:
 
 To reuse the value the compiler would have to prove that the 
 memory locations do not overlap.

 Not really, it only has to show that the value in the memory 
 location has not changed since the last read.

True.  I jumped to the general case where I'd seen sub optimal 
code gen.  I should have commented on both the given special, 
stationary, case and the more general.

Feb 14 2024

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Tuesday, 13 February 2024 at 02:11:45 UTC, claptrap wrote:
 I was refactoring some code and changed a parameter from by 
 value, to by pointer, and saw the performance drop by 50%. This 
 is a highly reduced example of what I found, but basically 
 passing something into a function by reference or pointer seems 
 to make the compilers (it affects both DMD and LDC) treat it as 
 if its volatile and must be loaded from memory on every use. 
 This also inhibits the auto-vectorization of code by LDC.

 https://d.godbolt.org/z/oonq1drd9

 ```d
 void fillBP(uint* value, uint* dest)
 {
     dest[0] = *value;
     dest[1] = *value;
     dest[2] = *value;
     dest[3] = *value;
 }
 ```
 codegen DMD -->

                 push    RBP
                 mov     RBP,RSP
                 mov     ECX,[RSI]
                 mov     [RDI],ECX
                 mov     EDX,[RSI]
                 mov     4[RDI],EDX
                 mov     R8D,[RSI]
                 mov     8[RDI],R8D
                 mov     R9D,[RSI]
                 mov     0Ch[RDI],R9D
                 pop     RBP
                 ret

 codgen LDC -->

         mov     eax, dword ptr [rdi]
         mov     dword ptr [rsi], eax
         mov     eax, dword ptr [rdi]
         mov     dword ptr [rsi + 4], eax
         mov     eax, dword ptr [rdi]
         mov     dword ptr [rsi + 8], eax
         mov     eax, dword ptr [rdi]
         mov     dword ptr [rsi + 12], eax
         ret
 ```d

Yes, that's normal. The compiler cannot know from the declaration 
alone if your pointer overlaps. In C you can declare the pointers 
with restrict which will tell the compiler that the pointers 
don't overlap. I don't know why D doesn't support restrict.

Feb 13 2024

deadalnix <deadalnix gmail.com> writes:

On Tuesday, 13 February 2024 at 15:11:58 UTC, Patrick Schluter 
wrote:
 Yes, that's normal. The compiler cannot know from the 
 declaration alone if your pointer overlaps. In C you can 
 declare the pointers with restrict which will tell the compiler 
 that the pointers don't overlap. I don't know why D doesn't 
 support restrict.

It's a major footgun without much benefit. In this case, one 
could load the integers in a local variable and use the value 
from there, and you'd get teh same effet.

Feb 13 2024

D Programming

C/C++ Programming

Other

digitalmars.D - By ref and by pointer kills performance.