digitalmars.D - By ref and by pointer kills performance.
- claptrap (79/79) Feb 12 I was refactoring some code and changed a parameter from by
- Richard (Rikki) Andrew Cattermole (21/21) Feb 12 dmd having bad codegen here isn't a surprise, that is to be expected.
- Johan (21/33) Feb 13 I hope someone can find the link to some DConf talk (me or
- Johan (3/22) Feb 13 Found it:
- ryuukk_ (4/4) Feb 13 Wow, OOP people ruining performance for everybody, who would have
- Richard (Rikki) Andrew Cattermole (4/8) Feb 13 No, it isn't limited to classes, the same issue appears with reference
- Richard (Rikki) Andrew Cattermole (14/51) Feb 13 If you don't like the problem, change the problem.
- Richard (Rikki) Andrew Cattermole (12/12) Feb 13 After even more thinking, I'm reminded of an operator overload that I
- Kagamin (13/22) Feb 16 Shouldn't it be okay for immutable(uint*) to not access
- Richard (Rikki) Andrew Cattermole (4/19) Feb 16 That covers immutable, but not immutable that has become const.
- Kagamin (4/5) Feb 16 That's fine, const can't be restrict qualified unconditionally,
- Richard (Rikki) Andrew Cattermole (33/39) Feb 16 It is not fine.
- Kagamin (3/5) Feb 16 Then you won't be able to create a monitor for them.
- Richard (Rikki) Andrew Cattermole (5/11) Feb 16 Code will still call into the null monitor.
- FeepingCreature (7/12) Feb 18 I don't get this point. The monitor field can just be the single
- FeepingCreature (9/28) Feb 18 Okay I think I overread this, probably because it's absurd. Why
- Patrick Schluter (5/27) Feb 13 Is not a thread issue. The memory the pointers point to only
- Richard (Rikki) Andrew Cattermole (15/17) Feb 13 It can be a thread issue, but yes overlapping is another potential
- Basile B. (10/43) Feb 13 It's been proven that the main problem, i.e what defeated CSE (
- Walter Bright (13/14) Feb 13 I'm used to people saying that DMD doesn't do data flow analysis. It doe...
- Richard (Rikki) Andrew Cattermole (13/32) Feb 13 I'm aware that what dmd is doing is right. But it isn't using SIMD
- Richard (Rikki) Andrew Cattermole (36/36) Feb 13 I'll give an example for something I wrote up in a reply for a Reddit po...
- Timon Gehr (41/61) Feb 14 Well, my understanding is that storing a value that was read from a
- Iain Buclaw (11/23) Feb 14 This, here, is the answer.
- max haughton (14/32) Feb 18 It's probably true that not many can rattle off the legalese that
- Richard (Rikki) Andrew Cattermole (35/40) Feb 18 A couple of days ago I had a read of Joe Duffy's blog.
- Timon Gehr (13/40) Feb 14 An unsynchronized read of a location that is modified by another thread
- Bruce Carneal (9/21) Feb 12 To reuse the value the compiler would have to prove that the
- jmh530 (23/32) Feb 13 As a heads up, the LDC wiki page doesn't have restrict on it.
- jmh530 (3/28) Feb 13 Sorry, I should note that while `fillRestricted3` compiles, It
- Bruce Carneal (16/52) Feb 13 There was a little discussion a while back about @restrict being
- jmh530 (3/10) Feb 13 godbolt is definitely your friend when you can use it.
- claptrap (4/14) Feb 13 Ah OK makes sense. The restrict attribute will help with what I'm
- Timon Gehr (3/6) Feb 14 Not really, it only has to show that the value in the memory location
- Bruce Carneal (4/10) Feb 14 True. I jumped to the general case where I'd seen sub optimal
- Patrick Schluter (5/46) Feb 13 Yes, that's normal. The compiler cannot know from the declaration
- deadalnix (5/10) Feb 13 It's a major footgun without much benefit. In this case, one
I was refactoring some code and changed a parameter from by value, to by pointer, and saw the performance drop by 50%. This is a highly reduced example of what I found, but basically passing something into a function by reference or pointer seems to make the compilers (it affects both DMD and LDC) treat it as if its volatile and must be loaded from memory on every use. This also inhibits the auto-vectorization of code by LDC. https://d.godbolt.org/z/oonq1drd9 ```d void fillBP(uint* value, uint* dest) { dest[0] = *value; dest[1] = *value; dest[2] = *value; dest[3] = *value; } ``` codegen DMD --> push RBP mov RBP,RSP mov ECX,[RSI] mov [RDI],ECX mov EDX,[RSI] mov 4[RDI],EDX mov R8D,[RSI] mov 8[RDI],R8D mov R9D,[RSI] mov 0Ch[RDI],R9D pop RBP ret codgen LDC --> mov eax, dword ptr [rdi] mov dword ptr [rsi], eax mov eax, dword ptr [rdi] mov dword ptr [rsi + 4], eax mov eax, dword ptr [rdi] mov dword ptr [rsi + 8], eax mov eax, dword ptr [rdi] mov dword ptr [rsi + 12], eax ret ```d void fillBV(uint value, uint* dest) { dest[0] = value; dest[1] = value; dest[2] = value; dest[3] = value; } ``` codgen DMD --> push RBP mov RBP,RSP mov [RDI],ESI mov 4[RDI],ESI mov 8[RDI],ESI mov 0Ch[RDI],ESI pop RBP ret codegen LDC --> movd xmm0, edi pshufd xmm0, xmm0, 0 movdqu xmmword ptr [rsi], xmm0 ret Interestingly if you do this... ```d void fillBP(uint* value, uint* dest) { uint tmp = *value; dest[0] = tmp; dest[1] = tmp; dest[2] = tmp; dest[3] = tmp; } ``` You get identical code to the by value versions. (except the load from memory) I'm not a compiler guy so maybe there's some rationale for this that I don't know but it seems like the compiler should be able to read "*value" once and cache it.
Feb 12
dmd having bad codegen here isn't a surprise, that is to be expected. Now for ldc: ```d void fillBP(immutable(uint*) value, uint* dest) { dest[0] = *value; dest[1] = *value; dest[2] = *value; dest[3] = *value; } ``` I expected that to not do the extra loads, but it did. ```d void fillBP(immutable(uint*) value, uint* dest) { dest[0 .. 4][] = *value; } ``` And that certainly should not be doing it either. Even if it wasn't immutable. For your code, because it is not immutable and therefore can be changed externally on another thread, the fact that the compiler has to do the loads is correct. This isn't a bug.
Feb 12
On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) Andrew Cattermole wrote:dmd having bad codegen here isn't a surprise, that is to be expected. Now for ldc: ```d void fillBP(immutable(uint*) value, uint* dest) { dest[0] = *value; dest[1] = *value; dest[2] = *value; dest[3] = *value; } ``` I expected that to not do the extra loads, but it did.I hope someone can find the link to some DConf talk (me or Andrei) or forum post where I talk about why LDC assumes that `immutable(uint*)` points to _mutable_ (nota bene) data. The reason is the mutable thread synchronization field in immutable class variable storage (`__monitor`), combined with casting an immutable class to an array of immutable bytes. Side-effects in-between immutable(uint*) lookup could run into a synchronization event on the immutable data (i.e. mutating it). In the case of `fillBP` there are no side-effects possible between the reads, so it appears that indeed the optimization could be done. But a different thread might write to the data. I don't know how that data-race is then defined... For the general case, side-effects are possible (e.g. a function call) so it is not possible to simply assume that `immutable` reference arguments never alias to other reference arguments; this complicates implementing the desired optimization. I'm not saying it is impossible, it's just extra effort (and proof is needed). -Johan
Feb 13
On Tuesday, 13 February 2024 at 13:30:11 UTC, Johan wrote:On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) Andrew Cattermole wrote:Found it: https://youtu.be/-0jcE9B5kjs?list=PL9a7lgBtSQb-YCVj96v5vn1tEXPjKOPuB&t=403dmd having bad codegen here isn't a surprise, that is to be expected. Now for ldc: ```d void fillBP(immutable(uint*) value, uint* dest) { dest[0] = *value; dest[1] = *value; dest[2] = *value; dest[3] = *value; } ``` I expected that to not do the extra loads, but it did.I hope someone can find the link to some DConf talk (me or Andrei)
Feb 13
Wow, OOP people ruining performance for everybody, who would have thought.. That makes me worried now.. -betterC doesn't seem to change anything about optimization, or rather, lack of optimization..
Feb 13
On 14/02/2024 4:20 AM, ryuukk_ wrote:Wow, OOP people ruining performance for everybody, who would have thought.. That makes me worried now.. -betterC doesn't seem to change anything about optimization, or rather, lack of optimization..No, it isn't limited to classes, the same issue appears with reference counting. Immutable simply isn't providing the guarantee of immutable.
Feb 13
On 14/02/2024 2:30 AM, Johan wrote:On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) Andrew Cattermole wrote:If you don't like the problem, change the problem. After thinking about it, I can agree for D classes immutable can't be known if it truely applies or not. A lock could still be taken in const code. So the problem is: For a type that is not a D class (COM, extern(C++) are not included in that definition), mark as immutable to LLVM IR so optimizations can be used wrt. immutable. After all, to call an immutable this, method it must also be marked as const or immutable. The this pointer won't be allowed to be mutated unless someone did something bad (again, not the compiler's fault that the user went against it). I don't think much thought needs to go into this.dmd having bad codegen here isn't a surprise, that is to be expected. Now for ldc: ```d void fillBP(immutable(uint*) value, uint* dest) { dest[0] = *value; dest[1] = *value; dest[2] = *value; dest[3] = *value; } ``` I expected that to not do the extra loads, but it did.I hope someone can find the link to some DConf talk (me or Andrei) or forum post where I talk about why LDC assumes that `immutable(uint*)` points to _mutable_ (nota bene) data. The reason is the mutable thread synchronization field in immutable class variable storage (`__monitor`), combined with casting an immutable class to an array of immutable bytes. Side-effects in-between immutable(uint*) lookup could run into a synchronization event on the immutable data (i.e. mutating it). In the case of `fillBP` there are no side-effects possible between the reads, so it appears that indeed the optimization could be done. But a different thread might write to the data. I don't know how that data-race is then defined... For the general case, side-effects are possible (e.g. a function call) so it is not possible to simply assume that `immutable` reference arguments never alias to other reference arguments; this complicates implementing the desired optimization. I'm not saying it is impossible, it's just extra effort (and proof is needed). -Johan
Feb 13
After even more thinking, I'm reminded of an operator overload that I want for reference counting. ``opGoingIntoROM``. The reason for reference counting is obvious, to turn off reference counting. But it also applies here. Having this knowledge on the monitor could turn off the actual lock/unlock mutation. Okay this is compelling enough to report it. https://issues.dlang.org/show_bug.cgi?id=24393 This is no longer an "I want". It's a genuine type guarantee issue of something we currently support.
Feb 13
On Tuesday, 13 February 2024 at 13:30:11 UTC, Johan wrote:I hope someone can find the link to some DConf talk (me or Andrei) or forum post where I talk about why LDC assumes that `immutable(uint*)` points to _mutable_ (nota bene) data. The reason is the mutable thread synchronization field in immutable class variable storage (`__monitor`), combined with casting an immutable class to an array of immutable bytes. Side-effects in-between immutable(uint*) lookup could run into a synchronization event on the immutable data (i.e. mutating it).Shouldn't it be okay for immutable(uint*) to not access `__monitor`? I think, `__monitor` should work outside of type safety, like reference counting, then immutability won't matter. ``` shared(Object.Monitor*) getMonitor(immutable Object o) { size_t addr=cast(size_t)cast(void*)&o.__monitor; return cast(shared(Object.Monitor*))(addr-20+20); } ``` Is it legal for optimizer to cast away `restrict` qualifier like this?
Feb 16
On 16/02/2024 10:34 PM, Kagamin wrote:On Tuesday, 13 February 2024 at 13:30:11 UTC, Johan wrote: I hope someone can find the link to some DConf talk (me or Andrei) or forum post where I talk about why LDC assumes that |immutable(uint*)| points to /mutable/ (nota bene) data. The reason is the mutable thread synchronization field in immutable class variable storage (|__monitor|), combined with casting an immutable class to an array of immutable bytes. Side-effects in-between immutable(uint*) lookup could run into a synchronization event on the immutable data (i.e. mutating it). Shouldn't it be okay for immutable(uint*) to not access |__monitor|? I think, |__monitor| should work outside of type safety, like reference counting, then immutability won't matter.That covers immutable, but not immutable that has become const. The only way I can see it working reliably is to simply turn off the mutex when it is immutable.
Feb 16
On Friday, 16 February 2024 at 09:56:50 UTC, Richard (Rikki) Andrew Cattermole wrote:That covers immutable, but not immutable that has become const.That's fine, const can't be restrict qualified unconditionally, but immutable can.
Feb 16
On 16/02/2024 11:23 PM, Kagamin wrote:On Friday, 16 February 2024 at 09:56:50 UTC, Richard (Rikki) Andrew Cattermole wrote:It is not fine. Immutable objects can be in read only memory, since nothing is allowed to modify the contents process wide. Write to it, and see how the program segfaults. Just to prove a point of what immutable enables a compiler to do: ```d import std.stdio; import core.sys.posix.sys.mman; import core.sys.posix.unistd : sysconf, _SC_PAGESIZE; import core.sys.posix.stdlib; void main() { auto pagesize = sysconf(_SC_PAGESIZE); int* temp; posix_memalign(cast(void**)&temp, pagesize, 4 * pagesize); data = temp[0 .. pagesize]; data[0 .. 2] = [1, 2]; mprotect(data.ptr, int.sizeof * data.length, PROT_READ); writeln(data[0 .. 2]); data[0 .. 2] = [3, 4]; writeln(data[0 .. 2]); } static int[] data; ``` ``` [1, 2] Error: program killed by signal 11 ``` Read only memory like this is useful to kernels, it enables them to map the same ram ranges to multiple processes. It provides a very strong guarantee, well beyond const and right now that guarantee isn't honored in the language and needs to be fixed (otherwise why have it at all?).That covers immutable, but not immutable that has become const.That's fine, const can't be restrict qualified unconditionally, but immutable can.
Feb 16
On Friday, 16 February 2024 at 11:10:13 UTC, Richard (Rikki) Andrew Cattermole wrote:Immutable objects can be in read only memory, since nothing is allowed to modify the contents process wide.Then you won't be able to create a monitor for them.
Feb 16
On 17/02/2024 4:12 AM, Kagamin wrote:On Friday, 16 February 2024 at 11:10:13 UTC, Richard (Rikki) Andrew Cattermole wrote:Code will still call into the null monitor. Or we could have a way to disable it at runtime instead by setting a field to change behavior to cover all use cases (which is what I proposed ;) ).Immutable objects can be in read only memory, since nothing is allowed to modify the contents process wide.Then you won't be able to create a monitor for them.
Feb 16
On Friday, 16 February 2024 at 15:12:58 UTC, Kagamin wrote:On Friday, 16 February 2024 at 11:10:13 UTC, Richard (Rikki) Andrew Cattermole wrote:I don't get this point. The monitor field can just be the single pointer that is not transitively immutable? I mean. The compiler knows *exactly* when you access the monitor field. It's never accessed manually! This is a pointer access that can be generated at exactly line of code in the compiler. Just drop immutable there.Immutable objects can be in read only memory, since nothing is allowed to modify the contents process wide.Then you won't be able to create a monitor for them.
Feb 18
On Sunday, 18 February 2024 at 13:41:42 UTC, FeepingCreature wrote:On Friday, 16 February 2024 at 15:12:58 UTC, Kagamin wrote:On Friday, 16 February 2024 at 11:10:13 UTC, Richard (Rikki) Andrew Cattermole wrote:I don't get this point. The monitor field can just be the single pointer that is not transitively immutable? I mean. The compiler knows *exactly* when you access the monitor field. It's never accessed manually! This is a pointer access that can be generated at exactly line of code in the compiler. Just drop immutable there.Immutable objects can be in read only memory, since nothing is allowed to modify the contents process wide.Then you won't be able to create a monitor for them.I hope someone can find the link to some DConf talk (me or Andrei) or forum post where I talk about why LDC assumes that immutable(uint*) points to mutable (nota bene) data. The reason is the mutable thread synchronization field in immutable class variable storage (__monitor), combined with casting an immutable class to an array of immutable bytes.Okay I think I overread this, probably because it's absurd. Why is this idiom worth protecting? You're hardcasting *anyways*, it's forbidden in safe *anyways*, just cast __monitor to uint* instead of immutable(uint*). If you're interacting with classes at such a deep level, asking you to remember that there's a mutable field in there even in immutable classes doesn't seem like a big ask.
Feb 18
On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) Andrew Cattermole wrote:dmd having bad codegen here isn't a surprise, that is to be expected. Now for ldc: ```d void fillBP(immutable(uint*) value, uint* dest) { dest[0] = *value; dest[1] = *value; dest[2] = *value; dest[3] = *value; } ``` I expected that to not do the extra loads, but it did. ```d void fillBP(immutable(uint*) value, uint* dest) { dest[0 .. 4][] = *value; } ``` And that certainly should not be doing it either. Even if it wasn't immutable. For your code, because it is not immutable and therefore can be changed externally on another thread, the fact that the compiler has to do the loads is correct. This isn't a bug.Is not a thread issue. The memory the pointers point to only needs to overlap and the loads are required to get the "right" result.
Feb 13
On 14/02/2024 4:14 AM, Patrick Schluter wrote:Is not a thread issue. The memory the pointers point to only needs to overlap and the loads are required to get the "right" result.It can be a thread issue, but yes overlapping is another potential situation where it can get over written. I prefer talking about threads, because they bring an entirely unknown element to the discussion and shutdown any localized assumptions. In the case of immutable it should not be possible for immutable memory to be changed unless someone did something bad. Immutable is a very strong guarantee at the process level, that there are no mutable pointers pointing at immutable memory. Therefore the compiler is free to make that optimization without concern. If it breaks, it's the users fault for misusing the language. "The second way is to cast data to immutable. When doing so, it is up to the programmer to ensure that any mutable references to the same data are not used to modify the data after the cast." https://dlang.org/spec/const3.html#creating_immutable_data
Feb 13
On Tuesday, 13 February 2024 at 15:14:06 UTC, Patrick Schluter wrote:On Tuesday, 13 February 2024 at 03:31:31 UTC, Richard (Rikki) Andrew Cattermole wrote:It's been proven that the main problem, i.e what defeated CSE ( the Common Sub Expression optim pass), is the possible overlap between the two parameter. However there'is still a risk on the value, i.e it can be modified by another thread, like mentioned Johan IIRC. The destination of the pointer you pass as "value" might change between one of the fourth assignment. To be frank I thought this was the problem.dmd having bad codegen here isn't a surprise, that is to be expected. Now for ldc: ```d void fillBP(immutable(uint*) value, uint* dest) { dest[0] = *value; dest[1] = *value; dest[2] = *value; dest[3] = *value; } ``` I expected that to not do the extra loads, but it did. ```d void fillBP(immutable(uint*) value, uint* dest) { dest[0 .. 4][] = *value; } ``` And that certainly should not be doing it either. Even if it wasn't immutable. For your code, because it is not immutable and therefore can be changed externally on another thread, the fact that the compiler has to do the loads is correct. This isn't a bug.Is not a thread issue. The memory the pointers point to only needs to overlap and the loads are required to get the "right" result.
Feb 13
On 2/12/2024 7:31 PM, Richard (Rikki) Andrew Cattermole wrote:dmd having bad codegen here isn't a surprise, that is to be expected.I'm used to people saying that DMD doesn't do data flow analysis. It does. In fact, it is based on my OptimumC, which was the first C compiler on DOS to do DFA back in the 1980s. This isn't a case of buggy DFA. It's a case of doing DFA correctly. The issue is pointer aliasing. A pointer can point to anything, including const data. Therefore, storing through a pointer can alter any value that is reachable via a pointer. Therefore, storing through a pointer invalidates any cached value already read. This is what you're seeing. C99 tried to address this with __restrict, but few people use it or understand it. D didn't bother with it because people will inevitably misuse __restrict and get their data mysteriously corrupted.
Feb 13
On 14/02/2024 5:49 PM, Walter Bright wrote:On 2/12/2024 7:31 PM, Richard (Rikki) Andrew Cattermole wrote:I'm aware that what dmd is doing is right. But it isn't using SIMD either, so there is some performance left on the table. A backend today vs 15 years ago is very different in terms of what they can do. Let alone one that last saw significant active development 30 years ago. I suggest that you have a chat with Bruce about this during the next meetup. We've been trying to explain to you just how backends have changed over the years from a users perspective. The ones today can only be described with one word: magical. But I will say this: 20 years ago you needed inline assembly, 10 years ago you needed intrinsics, today you may not need to do anything special in D, although will in C++.dmd having bad codegen here isn't a surprise, that is to be expected.I'm used to people saying that DMD doesn't do data flow analysis. It does. In fact, it is based on my OptimumC, which was the first C compiler on DOS to do DFA back in the 1980s. This isn't a case of buggy DFA. It's a case of doing DFA correctly. The issue is pointer aliasing. A pointer can point to anything, including const data. Therefore, storing through a pointer can alter any value that is reachable via a pointer. Therefore, storing through a pointer invalidates any cached value already read. This is what you're seeing. C99 tried to address this with __restrict, but few people use it or understand it. D didn't bother with it because people will inevitably misuse __restrict and get their data mysteriously corrupted.
Feb 13
I'll give an example for something I wrote up in a reply for a Reddit post: Using ldc I was able to get this code to do 8.8 million vectors in around 1 second. ```d module strategy; int run(byte[] a, byte[] b) { int sum = 0; for(size_t i; i < a.length; i += 64) { static foreach(j; 0 .. 64) { sum += cast(short)a[i + j] * cast(short)b[i + j]; } } return sum; } ``` To achieve the same thing using Go, they had to write assembly (see DotVNNI example). https://sourcegraph.com/blog/slow-to-simd LDC is able to get to almost the maximum a CPU can do, with what looks to be almost naive looking code. DMD cannot compete with this, nor should it aim to. Here is what actual naive code looks like for this problem: ```d module strategy; int run(byte[] a, byte[] b) { assert(a.length == b.length); int sum = 0; foreach(i; 0 .. a.length) { sum += cast(short)a[i] * cast(short)b[i]; } return sum; } ``` 7 million vectors per second. They had to write assembly to get that speed. I got it, without doing anything special...
Feb 13
On 14.02.24 05:49, Walter Bright wrote:On 2/12/2024 7:31 PM, Richard (Rikki) Andrew Cattermole wrote:Well, my understanding is that storing a value that was read from a `uint*` cannot invalidate a value read from the same `uint*` unless unaligned reads/writes are allowed or the pointer can point to itself. Are unaligned reads/writes allowed in D? Can a `uint*` point to itself? Is there something else I missed? How to make the following assertion fail by filling the `...` holes with code without invoking UB? ```d void fillBP1(uint* value, uint* dest)pure{ dest[0] = *value; dest[1] = *value; dest[2] = *value; dest[3] = *value; } void fillBP2(uint* value, uint* dest)pure{ auto tmp = *value; dest[0] = tmp; dest[1] = tmp; dest[2] = tmp; dest[3] = tmp; } uint[4] distinguish(alias f)()pure{ ...; // TODO uint[4] dest = ...; // TODO uint* value = ...; // TODO f(value,dest.ptr); return dest; } void main(){ import std.stdio; assert(distinguish!fillBP1()==distinguish!fillBP2()); writeln(test!fillBP1); writeln(test!fillBP2); } ```dmd having bad codegen here isn't a surprise, that is to be expected.I'm used to people saying that DMD doesn't do data flow analysis. It does. In fact, it is based on my OptimumC, which was the first C compiler on DOS to do DFA back in the 1980s. This isn't a case of buggy DFA. It's a case of doing DFA correctly. The issue is pointer aliasing. A pointer can point to anything, including const data. Therefore, storing through a pointer can alter any value that is reachable via a pointer. Therefore, storing through a pointer invalidates any cached value already read. This is what you're seeing. ...C99 tried to address this with __restrict, but few people use it or understand it. D didn't bother with it because people will inevitably misuse __restrict and get their data mysteriously corrupted.My understanding is the case in the OB does not require `__restrict`. `value` can point to any of the 4 fields of `dest`, it is okay to only read it once even in that case. Of course, maybe the data flow analysis is more conservative than that, but I wouldn't necessarily point to this case as one where data flow analysis is working "correctly".
Feb 14
On Wednesday, 14 February 2024 at 04:49:28 UTC, Walter Bright wrote:This isn't a case of buggy DFA. It's a case of doing DFA correctly. The issue is pointer aliasing. A pointer can point to anything, including const data. Therefore, storing through a pointer can alter any value that is reachable via a pointer. Therefore, storing through a pointer invalidates any cached value already read. This is what you're seeing. C99 tried to address this with __restrict, but few people use it or understand it. D didn't bother with it because people will inevitably misuse __restrict and get their data mysteriously corrupted.This, here, is the answer. Moral of the story, use (gdc, ldc) vendor attributes if you really care. ``` import core.attribute; void fillBP( restrict uint* dest, uint* value) { //... ```
Feb 14
On Wednesday, 14 February 2024 at 04:49:28 UTC, Walter Bright wrote:On 2/12/2024 7:31 PM, Richard (Rikki) Andrew Cattermole wrote:It's probably true that not many can rattle off the legalese that defines restrict semantics correctly (I know I can't, although have to say I've never been bitten by it), but within a roughly C-like programming model its where a lot of performance comes from when at the bleeding edge so I don't know if not including it at all is particularly good either. Thinking about it, it's actually probably mainly an issue for library users rather than the author who actually wrote restrict to begin with - I have wondered in the past whether some kind of `isolated` style type qualifier could actually help here i.e. one could provide the compiler good information and make it very obvious to the programmer.dmd having bad codegen here isn't a surprise, that is to be expected.I'm used to people saying that DMD doesn't do data flow analysis. It does. In fact, it is based on my OptimumC, which was the first C compiler on DOS to do DFA back in the 1980s. This isn't a case of buggy DFA. It's a case of doing DFA correctly. The issue is pointer aliasing. A pointer can point to anything, including const data. Therefore, storing through a pointer can alter any value that is reachable via a pointer. Therefore, storing through a pointer invalidates any cached value already read. This is what you're seeing. C99 tried to address this with __restrict, but few people use it or understand it. D didn't bother with it because people will inevitably misuse __restrict and get their data mysteriously corrupted.
Feb 18
On 19/02/2024 2:58 AM, max haughton wrote:Thinking about it, it's actually probably mainly an issue for library users rather than the author who actually wrote restrict to begin with - I have wondered in the past whether some kind of |isolated| style type qualifier could actually help here i.e. one could provide the compiler good information and make it very obvious to the programmer.A couple of days ago I had a read of Joe Duffy's blog. I was very impressed with what he had to say, enough to immediately want to buy his concurrency book and start adding his blog articles as references to my existing proposals. He was one of the developers Midori and had a lot to say about exception mechanisms and concurrency. Turns out we both have some similar conclusions, only he gained his about 15 years prior to me (which may or may not have contributed to my impression). The concept of isolated was one such concept that he talked about in regards to concurrency and its safety. I had not thought about it, that it would be a useful tool for making memory ownership safe in this context. But it also applies here consider: ```d isolated(int[]) a, b; func(a, b); void func(scope int[] c, scope int[] d) safe {} ``` Should that be callable with a and b? Yes. The graph doesn't matter, it doesn't escape. Now what about the case where it does? ```d int[] func(return scope int[] c, scope int[] d) safe { return c; } ``` Again yes. Only one parameter contributes to return, the return would therefore be part of ``a``'s graph. Another interesting possibility is that if one parameter is isolated, that means the second parameter ``d`` is not in ``c``'s graph, because isolated is a form of transfer of ownership, except because ``c`` is scope, the transfer doesn't occur from the outside. As a result both ``c`` and ``d`` would have `` restrict`` added. This is on my todo list to research, of course how exactly it plays out with DIP1000 and the this pointer is unknown to me at this stage.
Feb 18
On 13.02.24 04:31, Richard (Rikki) Andrew Cattermole wrote:dmd having bad codegen here isn't a surprise, that is to be expected. Now for ldc: ```d void fillBP(immutable(uint*) value, uint* dest) { dest[0] = *value; dest[1] = *value; dest[2] = *value; dest[3] = *value; } ``` I expected that to not do the extra loads, but it did. ```d void fillBP(immutable(uint*) value, uint* dest) { dest[0 .. 4][] = *value; } ``` And that certainly should not be doing it either. Even if it wasn't immutable. For your code, because it is not immutable and therefore can be changed externally on another thread, the fact that the compiler has to do the loads is correct. This isn't a bug.An unsynchronized read of a location that is modified by another thread is a race condition and a race condition is UB, so this scenario is not an optimization blocker. I think with this specific code reading back the pointers is unnecessary, because assigning `*value` to itself does not produce an observable difference in behavior given that it is also read and assigned to a different location at least once. (Unless LDC defines the behavior of unaligned reads/writes?) Of course, if the code instead assigned `*value+1` to the entries of `dest`, the reads would become necessary. In general, I think LDC and GDC are pretty conservative about optimizing based on UB, which I think has benefits for security.
Feb 14
On Tuesday, 13 February 2024 at 02:11:45 UTC, claptrap wrote:I was refactoring some code and changed a parameter from by value, to by pointer, and saw the performance drop by 50%. This is a highly reduced example of what I found, but basically passing something into a function by reference or pointer seems to make the compilers (it affects both DMD and LDC) treat it as if its volatile and must be loaded from memory on every use. This also inhibits the auto-vectorization of code by LDC. https://d.godbolt.org/z/oonq1drd9 ... I'm not a compiler guy so maybe there's some rationale for this that I don't know but it seems like the compiler should be able to read "*value" once and cache it.To reuse the value the compiler would have to prove that the memory locations do not overlap. FORTRAN does not have this problem, neither does ldc once you take responsibility for non-overlap with the restrict attribute as seen here: https://d.godbolt.org/z/z9vYndWqP When loops are involved between potentially overlapping indexed arrays I've seen ldc go through the proof and do two versions of the code with a branch.
Feb 12
On Tuesday, 13 February 2024 at 06:02:47 UTC, Bruce Carneal wrote:[snip] To reuse the value the compiler would have to prove that the memory locations do not overlap. FORTRAN does not have this problem, neither does ldc once you take responsibility for non-overlap with the restrict attribute as seen here: https://d.godbolt.org/z/z9vYndWqP When loops are involved between potentially overlapping indexed arrays I've seen ldc go through the proof and do two versions of the code with a branch.As a heads up, the LDC wiki page doesn't have restrict on it. https://wiki.dlang.org/LDC-specific_language_changes Does LDC's restrict only work with pointers directly and not slices? `fillRestricted2` doesn't compile (it only fails because `value` is a slice, not because `dest` is one. But `fillRestricted3` compiles just fine. ```d void fillRestricted2( restrict uint[] value, uint[] dest) { dest[0] = value[0]; dest[1] = value[1]; dest[2] = value[2]; dest[3] = value[3]; } void fillRestricted3( restrict uint* value, uint[] dest) { dest[0] = value[0]; dest[1] = value[1]; dest[2] = value[2]; dest[3] = value[3]; } ```
Feb 13
On Tuesday, 13 February 2024 at 15:40:23 UTC, jmh530 wrote:On Tuesday, 13 February 2024 at 06:02:47 UTC, Bruce Carneal wrote:Sorry, I should note that while `fillRestricted3` compiles, It has similar codegen as `fillBP`.[...]As a heads up, the LDC wiki page doesn't have restrict on it. https://wiki.dlang.org/LDC-specific_language_changes Does LDC's restrict only work with pointers directly and not slices? `fillRestricted2` doesn't compile (it only fails because `value` is a slice, not because `dest` is one. But `fillRestricted3` compiles just fine. ```d void fillRestricted2( restrict uint[] value, uint[] dest) { dest[0] = value[0]; dest[1] = value[1]; dest[2] = value[2]; dest[3] = value[3]; } void fillRestricted3( restrict uint* value, uint[] dest) { dest[0] = value[0]; dest[1] = value[1]; dest[2] = value[2]; dest[3] = value[3]; } ```
Feb 13
On Tuesday, 13 February 2024 at 15:40:23 UTC, jmh530 wrote:On Tuesday, 13 February 2024 at 06:02:47 UTC, Bruce Carneal wrote:There was a little discussion a while back about restrict being generalized to slices but, as you note, it's not there yet. I typically use restrict on slice .ptr fields passed into nested functions within an trusted function. Yes, that's a little clumsy but it lets you do bounds checks and the like before dropping into the hot loops. If you're willing to tolerate the code bloat, and your code is simple, ldc will do the proof and auto-vec things for you. restrict also seems to keep the auto-vec optimizer engaged in not-strictly-vanilla formulations. If you have multiple sources and a single destination array note that you only need restrict on the destination. You're just trying to give the back end a clear view of dependencies. As always, godbolt is your friend when optimizing known bottlenecks of this sort.[snip] To reuse the value the compiler would have to prove that the memory locations do not overlap. FORTRAN does not have this problem, neither does ldc once you take responsibility for non-overlap with the restrict attribute as seen here: https://d.godbolt.org/z/z9vYndWqP When loops are involved between potentially overlapping indexed arrays I've seen ldc go through the proof and do two versions of the code with a branch.As a heads up, the LDC wiki page doesn't have restrict on it. https://wiki.dlang.org/LDC-specific_language_changes Does LDC's restrict only work with pointers directly and not slices? `fillRestricted2` doesn't compile (it only fails because `value` is a slice, not because `dest` is one. But `fillRestricted3` compiles just fine. ```d void fillRestricted2( restrict uint[] value, uint[] dest) { dest[0] = value[0]; dest[1] = value[1]; dest[2] = value[2]; dest[3] = value[3]; } void fillRestricted3( restrict uint* value, uint[] dest) { dest[0] = value[0]; dest[1] = value[1]; dest[2] = value[2]; dest[3] = value[3]; } ```
Feb 13
On Tuesday, 13 February 2024 at 15:56:59 UTC, Bruce Carneal wrote:[snip] There was a little discussion a while back about restrict being generalized to slices but, as you note, it's not there yet.Thanks for the detailed response![snip] As always, godbolt is your friend when optimizing known bottlenecks of this sort.godbolt is definitely your friend when you can use it.
Feb 13
On Tuesday, 13 February 2024 at 06:02:47 UTC, Bruce Carneal wrote:On Tuesday, 13 February 2024 at 02:11:45 UTC, claptrap wrote:Ah OK makes sense. The restrict attribute will help with what I'm doing. thanks.To reuse the value the compiler would have to prove that the memory locations do not overlap. FORTRAN does not have this problem, neither does ldc once you take responsibility for non-overlap with the restrict attribute as seen here: https://d.godbolt.org/z/z9vYndWqP When loops are involved between potentially overlapping indexed arrays I've seen ldc go through the proof and do two versions of the code with a branch.
Feb 13
On 13.02.24 07:02, Bruce Carneal wrote:To reuse the value the compiler would have to prove that the memory locations do not overlap.Not really, it only has to show that the value in the memory location has not changed since the last read.
Feb 14
On Wednesday, 14 February 2024 at 11:46:24 UTC, Timon Gehr wrote:On 13.02.24 07:02, Bruce Carneal wrote:True. I jumped to the general case where I'd seen sub optimal code gen. I should have commented on both the given special, stationary, case and the more general.To reuse the value the compiler would have to prove that the memory locations do not overlap.Not really, it only has to show that the value in the memory location has not changed since the last read.
Feb 14
On Tuesday, 13 February 2024 at 02:11:45 UTC, claptrap wrote:I was refactoring some code and changed a parameter from by value, to by pointer, and saw the performance drop by 50%. This is a highly reduced example of what I found, but basically passing something into a function by reference or pointer seems to make the compilers (it affects both DMD and LDC) treat it as if its volatile and must be loaded from memory on every use. This also inhibits the auto-vectorization of code by LDC. https://d.godbolt.org/z/oonq1drd9 ```d void fillBP(uint* value, uint* dest) { dest[0] = *value; dest[1] = *value; dest[2] = *value; dest[3] = *value; } ``` codegen DMD --> push RBP mov RBP,RSP mov ECX,[RSI] mov [RDI],ECX mov EDX,[RSI] mov 4[RDI],EDX mov R8D,[RSI] mov 8[RDI],R8D mov R9D,[RSI] mov 0Ch[RDI],R9D pop RBP ret codgen LDC --> mov eax, dword ptr [rdi] mov dword ptr [rsi], eax mov eax, dword ptr [rdi] mov dword ptr [rsi + 4], eax mov eax, dword ptr [rdi] mov dword ptr [rsi + 8], eax mov eax, dword ptr [rdi] mov dword ptr [rsi + 12], eax ret ```dYes, that's normal. The compiler cannot know from the declaration alone if your pointer overlaps. In C you can declare the pointers with restrict which will tell the compiler that the pointers don't overlap. I don't know why D doesn't support restrict.
Feb 13
On Tuesday, 13 February 2024 at 15:11:58 UTC, Patrick Schluter wrote:Yes, that's normal. The compiler cannot know from the declaration alone if your pointer overlaps. In C you can declare the pointers with restrict which will tell the compiler that the pointers don't overlap. I don't know why D doesn't support restrict.It's a major footgun without much benefit. In this case, one could load the integers in a local variable and use the value from there, and you'd get teh same effet.
Feb 13