www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - core.stdc.stdatomic

reply Walter Bright <newshound2 digitalmars.com> writes:
There have been multiple requests to add this support. It can all be done with
a 
library implemented with some inline assembler.

Anyone want the glory of implementing this?
Nov 13 2023
next sibling parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
For the most part it looks like everything is in core.atomic already.

The only thing that looks like it isn't is ``atomic_signal_fence`` and 
that has to be an intrinsic as it is specifically for a backend instruction.

But reminder: only dmd is missing the intrinsics necessary to do atomics 
correctly without memory errors. Dmd is a right off if you are doing 
anything more than just reference counting with it.
Nov 13 2023
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/13/2023 8:56 PM, Richard (Rikki) Andrew Cattermole wrote:
 For the most part it looks like everything is in core.atomic already.
 
 The only thing that looks like it isn't is ``atomic_signal_fence`` and that
has 
 to be an intrinsic as it is specifically for a backend instruction.
 
 But reminder: only dmd is missing the intrinsics necessary to do atomics 
 correctly without memory errors. Dmd is a right off if you are doing anything 
 more than just reference counting with it.
Aside from fence, they can all be done with simple functions. So why not?
Nov 14 2023
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/14/2023 12:24 AM, Walter Bright wrote:
 Dmd is a right off if you are doing anything 
 more than just reference counting with it.
Aside from fence, they can all be done with simple functions. So why not?
The fence instructions are supported by DMD's inline assembler, so they can be done, too, as simple functions. Why is this a write-off?
Nov 14 2023
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 14/11/2023 9:35 PM, Walter Bright wrote:
 The fence instructions are supported by DMD's inline assembler, so they 
 can be done, too, as simple functions.
 
 Why is this a write-off?
From C11 spec for ``atomic_signal_fence``: NOTE 2 Compiler optimizations and reorderings of loads and stores are inhibited in the same way as with atomic_thread_fence, but the hardware fence instructions that atomic_thread_fence would have inserted are not emitted. In other words no instructions emitted, it does not map to an x86 instruction. I've said this before, dmd is a write off for lock-free concurrent data structures. Having ANY extra functions in the call stack can throw off the timing and end in segfaults. It MUST be inlined! This will be true of any use case for atomics that is beyond that of reference counting.
Nov 14 2023
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/14/2023 12:58 AM, Richard (Rikki) Andrew Cattermole wrote:
 On 14/11/2023 9:35 PM, Walter Bright wrote:
 The fence instructions are supported by DMD's inline assembler, so they can be 
 done, too, as simple functions.

 Why is this a write-off?
From C11 spec for ``atomic_signal_fence``: NOTE 2 Compiler optimizations and reorderings of loads and stores are inhibited in the same way as with atomic_thread_fence, but the hardware fence instructions that atomic_thread_fence would have inserted are not emitted. In other words no instructions emitted, it does not map to an x86 instruction.
dmd does not reorder code around inline assembler instructions. So this is not a problem.
 I've said this before, dmd is a write off for lock-free concurrent data 
 structures. Having ANY extra functions in the call stack can throw off the 
 timing and end in segfaults. It MUST be inlined! This will be true of any use 
 case for atomics that is beyond that of reference counting.
Correct multithreaded code is about doing things in the correct sequence, it is not about timing. If synchronization code is dependent on instruction timing, it is inevitably going to fail because too many things affect timing. Yes, dmd's code here will be significantly slower than an intrinsic, but I don't see how it would be incorrect. As a path forward for DMD: 1. implement core.stdc.stdatomic in terms of core.atomic and/or core.internal.atomic 2. eventually add intrinsics to dmd to replace them
Nov 14 2023
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 15/11/2023 7:13 AM, Walter Bright wrote:
 Correct multithreaded code is about doing things in the correct 
 sequence, it is not about timing. If synchronization code is dependent 
 on instruction timing, it is inevitably going to fail because too many 
 things affect timing.
You have mostly caught on to the problems here that I have experienced. About the only people who can write lock-free concurrent data structures reliably work with kernels and yes kernels do have them. As a subject matter they are only about 20 years old (30 for some key theory). Which is very young for data structures. To quote Andrei about its significance: If you believe that's a fundamental enough question to award a prize to the answerer, so did others. In 2003, Maurice Herlihy was awarded the Edsger W. Dijkstra Prize in Distributed Computing for his seminal 1991 paper "Wait-Free Synchronization" (see http://www.podc.org/dijkstra/2003.html, which includes a link to the paper, too). In his tour-de-force paper, Herlihy proves which primitives are good and which are bad for building lock-free data structures. That brought some seemingly hot hardware architectures to instant obsolescence, while clarifying what synchronization primitives should be implemented in future hardware. https://drdobbs.com/lock-free-data-structures/184401865 So it is timing based, you have set points which act as synchronization events and then everything after it must work exactly the same on each core. This is VERY HARD! I got grey hair because of dmd using inline assembly, because function calls do not result in "exact" timings after those synchronization points! But it was possible with ldc with a lot of work, just not dmd. I'm not the only one who has gone down this path: https://github.com/MartinNowak/lock-free https://github.com/mw66/liblfdsd https://github.com/nin-jin/go.d
 As a path forward for DMD:

  1. implement core.stdc.stdatomic in terms of core.atomic and/or
     core.internal.atomic

  2. eventually add intrinsics to dmd to replace them
Almost. Everything that is in core.stdc.stdatomic should have similar codegen to the C compiler, if it doesn't that is a bug, this is what gives it the value desired, the guarantee that it will line up. So my proposal is almost the same but applies to all compilers: 1. Implement functions that wrap core.internal.atomic iff those function implementations are intrinsics, equivalent to an intrinsic or uses locking. 2. Implement intrinsics in core.interal.atomic and then implement the corresponding wrapper function for stdatomic.
Nov 14 2023
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/14/2023 7:49 PM, Richard (Rikki) Andrew Cattermole wrote:
 So it is timing based, you have set points which act as synchronization events 
 and then everything after it must work exactly the same on each core. This is 
 VERY HARD!
Everything I've read about writing correct synchronization says it's not about timing, it's about sequencing. For example, https://www.amazon.com/C-Concurrency-Action-Anthony-Williams/dp/1617294691 or maybe you and I are just misunderstanding terms. For example, fences. Fences enforce memory-ordering constraints, not timing. Happens-before and synchronizes-with are sequencing, not timing.
Nov 14 2023
parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 15/11/2023 7:32 PM, Walter Bright wrote:
 On 11/14/2023 7:49 PM, Richard (Rikki) Andrew Cattermole wrote:
 So it is timing based, you have set points which act as 
 synchronization events and then everything after it must work exactly 
 the same on each core. This is VERY HARD!
Everything I've read about writing correct synchronization says it's not about timing, it's about sequencing. For example, https://www.amazon.com/C-Concurrency-Action-Anthony-Williams/dp/1617294691 or maybe you and I are just misunderstanding terms. For example, fences. Fences enforce memory-ordering constraints, not timing. Happens-before and synchronizes-with are sequencing, not timing.
You have understood the simplified parts of the problem. The problem is when concurrency is in action, multiple cores operating on same memory in same time units. You reach a shifting sands feeling where multiple facts can be true at the same time on different cores and it can be completely contradictory. Memory can be mapped on one, and not on another. Did I mention I have grey hair because of this? This might be a reason why I have strong opinions about D foundations such as symbols since that experience ;) Also: https://github.com/dlang/dmd/pull/15816
Nov 14 2023
prev sibling parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
Question: Why do people want another wrapper around some inline assembly 
that already exists in core.atomic?

Answer: they don't. This does not allow people to implement any new ideas.

We don't need another wrapper around the same inline assembly that has 
the exact same tradeoffs with inlinability (it can't be inlined) and 
without the ability to succinctly communicate with the backend to ensure 
codegen looks the way it needs to.

You want to port code from C? Use core.atomic, but wait it has different 
behaviors? Well yeah... its not designed around the intrinsics that 
stdatomic.h is that give it any useful meaning.

See: ``kill_dependency``, ``atomic_init`` and ``atomic_signal_fence``.

Writing a wrapper around stdatomic.h would take probably 2 hours. You 
don't need to write any inline assembly, its already done in 
core.atomic. But realistically all you're doing is changing some names 
and order of parameters with slightly different types.
Nov 14 2023
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/14/2023 12:51 AM, Richard (Rikki) Andrew Cattermole wrote:
 Question: Why do people want another wrapper around some inline assembly that 
 already exists in core.atomic?
Because they have existing carefully crafted code in C and want to translate it to D.
 Writing a wrapper around stdatomic.h would take probably 2 hours.
Great! That saves each stdatomic C user 2 hours who wants to get their code in D.
Nov 14 2023
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 15/11/2023 8:00 AM, Walter Bright wrote:
 On 11/14/2023 12:51 AM, Richard (Rikki) Andrew Cattermole wrote:
 Question: Why do people want another wrapper around some inline 
 assembly that already exists in core.atomic?
Because they have existing carefully crafted code in C and want to translate it to D.
 Writing a wrapper around stdatomic.h would take probably 2 hours.
Great! That saves each stdatomic C user 2 hours who wants to get their code in D.
So what I'm getting at here is that we can already do a best effort approach to this by swapping stdatomic to core.atomic, but it does not bring the value that people are wanting to do so. The codegen must be similar, if the C compiler for a target uses a function call, so can we. If they use intrinsics with inlining so must us. This way the behavior will be similar, and the port can be a success. Otherwise you are introducing new and potentially wrong behavior; which means a failure in trying to fulfill your ideas.
Nov 14 2023
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/14/2023 7:07 PM, Richard (Rikki) Andrew Cattermole wrote:
 This way the behavior will be similar, and the port can be a success.
Otherwise 
 you are introducing new and potentially wrong behavior; which means a failure
in 
 trying to fulfill your ideas.
I do not understand why a function that consists of a FENCE instruction will be a failure compared to a FENCE instruction inlined.
Nov 14 2023
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 15/11/2023 7:33 PM, Walter Bright wrote:
 I do not understand why a function that consists of a FENCE instruction 
 will be a failure compared to a FENCE instruction inlined.
Yes, ideally a memory barrier wouldn't matter for how it is executed. But it does matter for load/cas. Because what was true when the operation executed may not be true any longer by the time the next operation is performed with deep calls in use.
Nov 14 2023
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/14/2023 10:59 PM, Richard (Rikki) Andrew Cattermole wrote:
 On 15/11/2023 7:33 PM, Walter Bright wrote:
 I do not understand why a function that consists of a FENCE instruction will 
 be a failure compared to a FENCE instruction inlined.
Yes, ideally a memory barrier wouldn't matter for how it is executed. But it does matter for load/cas. Because what was true when the operation executed may not be true any longer by the time the next operation is performed with deep calls in use.
Is this only a problem with load/cas?
Nov 15 2023
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 15/11/2023 10:33 PM, Walter Bright wrote:
 Is this only a problem with load/cas?
A store by itself should be ok. If you do any loading like you do with cas or atomicOp, then the timings might not be in any way predictable. But the general rule of thumb for this is: if an atomic operation is by itself for a given block of memory then you can ignore timings. If it used in conjunction with another atomic instruction (or more), you have to consider if timings can mess it up. But wrt. load/cas they are the two big primitives that are used in the literature very heavily so they need to be prioritized over the others for implementing intrinsics. They will also be used very heavily in any problem data structure function from my own experience. A random fun fact, the book I recommend on the subject "The Art of Multiprocessor Programming" one of the authors teachers at the same university as Roy! https://shop.elsevier.com/books/the-art-of-multiprocessor-programming/herlihy/978-0-12-415950-1
Nov 15 2023
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/15/2023 1:55 AM, Richard (Rikki) Andrew Cattermole wrote:
 A random fun fact, the book I recommend on the subject "The Art of 
 Multiprocessor Programming" one of the authors teachers at the same university 
 as Roy!
 
 https://shop.elsevier.com/books/the-art-of-multiprocessor-programming/herlihy/978-0-12-415950-1
Thanks for the book recommendation. The review comments, though, say its examples are all in Java. Being in Java, an interpreter, is it relevant to machine level programming?
Nov 15 2023
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 16/11/2023 3:30 PM, Walter Bright wrote:
 Being in Java, an interpreter, is it relevant to machine level programming?
Yes it is. Java has a well defined memory sub system, it isn't a toy interpreter. I.e. https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/atomic/package-summary.html As a subject matter I would consider it in the category of cross referencing required. So even if it isn't the book you want to reference, its still worth cross referencing to it at times. For $7 from thriftbooks its worth a risk.
Nov 15 2023
parent Walter Bright <newshound2 digitalmars.com> writes:
On 11/15/2023 8:07 PM, Richard (Rikki) Andrew Cattermole wrote:
 For $7 from thriftbooks its worth a risk.
Indeed. I ordered it. Thanks!
Nov 15 2023
prev sibling parent reply claptrap <clap trap.com> writes:
On Wednesday, 15 November 2023 at 06:59:45 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 On 15/11/2023 7:33 PM, Walter Bright wrote:
 I do not understand why a function that consists of a FENCE 
 instruction will be a failure compared to a FENCE instruction 
 inlined.
Yes, ideally a memory barrier wouldn't matter for how it is executed. But it does matter for load/cas. Because what was true when the operation executed may not be true any longer by the time the next operation is performed with deep calls in use.
That doesnt make any sense, the whole point of CAS is that it is atomic, immediately after it has completed you have no guarantees anyway, what difference does it make if it's wrapped in a function call?
Nov 15 2023
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 15/11/2023 10:55 PM, claptrap wrote:
 That doesnt make any sense, the whole point of CAS is that it is atomic, 
 immediately after it has completed you have no guarantees anyway, what 
 difference does it make if it's wrapped in a function call?
I understand that it seems like it does not make sense. Lock-free concurrent data structures are a highly advanced topic, that very few people in the world today can implement successfully. About the only people who are qualified to touch them for production software would be kernel developers for a specific cpu family. They rely quite heavily on the premise that atomic operations happen immediately in the codegen and then based upon the results do set actions in response. This is timing based it has to be preciese or they will interfere with each other. You do not have much lee-way before you start getting segfaults. I only ever saw partial success with ldc after seven months of researching them. For obvious reasons I do not recommend people going down this particular path of study, because you are going to get burned pretty badly guaranteed. Regardless compilers like gcc have intrinsics for all of stdatomic. We need to be matching it otherwise what D supports will not line up with what the system C compiler can offer in terms of use cases. https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
Nov 15 2023
next sibling parent reply DrDread <DrDread cheese.com> writes:
On Wednesday, 15 November 2023 at 10:26:32 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 They rely quite heavily on the premise that atomic operations 
 happen immediately in the codegen and then based upon the 
 results do set actions in response. This is timing based it has 
 to be preciese or they will interfere with each other. You do 
 not have much lee-way before you start getting segfaults.
whoever write code like that deserves all the problems it creates. This assumes specific CPU behaviour and may be completely broken on other systems.
Nov 15 2023
parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 16/11/2023 2:26 AM, DrDread wrote:
 whoever write code like that deserves all the problems it creates.
 This assumes specific CPU behaviour and may be completely broken on 
 other systems.
Yup, that's lock-free concurrent data structures for you. Need specialists to have any chance of working reliably.
Nov 15 2023
prev sibling parent reply claptrap <clap trap.com> writes:
On Wednesday, 15 November 2023 at 10:26:32 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 On 15/11/2023 10:55 PM, claptrap wrote:
 I understand that it seems like it does not make sense. 
 Lock-free concurrent data structures are a highly advanced 
 topic, that very few people in the world today can implement 
 successfully. About the only people who are qualified to touch 
 them for production software would be kernel developers for a 
 specific cpu family.
Im saying it doesnt make sense because i have worked on / implemented some lock free data structures. Ive shipped software that relied on them. None of the literature ive read ever had any algorithms that relied on "getting things done quickly" after a CAS. Fundamentally it cant work since the thread can be interupted immediatly after completing the CAS. So if you algorithm relies on something else happening within a specific time frame after a CAS it is not going to work. So im looking for an explanation or a pointer to an algorithm that exhibits what you describe because it is counter to my experience.
 They rely quite heavily on the premise that atomic operations 
 happen immediately in the codegen and then based upon the 
 results do set actions in response. This is timing based it has 
 to be preciese or they will interfere with each other. You do 
 not have much lee-way before you start getting segfaults.
I can see that getting an update done quickly can help with contention, but if the algorithm breaks when things arnt done quickly you're pretty much screwed afaik. I mean theres no way to guarantee any sequence of instructions gets completed with in given time frame, on x86 at least.
Nov 15 2023
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 16/11/2023 2:48 AM, claptrap wrote:
 So im looking for an explanation or a pointer to an algorithm that 
 exhibits what you describe because it is counter to my experience.
By any chance did you use GC based memory management for it? If you did, that would explain it. GC based memory management removes a lot of complexity surrounding removing of elements.
Nov 15 2023
next sibling parent reply Imperatorn <johan_forsberg_86 hotmail.com> writes:
On Wednesday, 15 November 2023 at 14:44:52 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 On 16/11/2023 2:48 AM, claptrap wrote:
 So im looking for an explanation or a pointer to an algorithm 
 that exhibits what you describe because it is counter to my 
 experience.
By any chance did you use GC based memory management for it? If you did, that would explain it. GC based memory management removes a lot of complexity surrounding removing of elements.
What's wrong with using the gc?
Nov 15 2023
parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 16/11/2023 5:41 AM, Imperatorn wrote:
 What's wrong with using the gc?
Nothing. In fact it is one of if not the best way to implement a lock-free concurrent data structure without segfaults. It is of particular interest because it turns a nearly impossible problem into a hard problem that can be solved in a reasonable time frame.
Nov 15 2023
prev sibling parent reply claptrap <clap trap.com> writes:
On Wednesday, 15 November 2023 at 14:44:52 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 On 16/11/2023 2:48 AM, claptrap wrote:
 So im looking for an explanation or a pointer to an algorithm 
 that exhibits what you describe because it is counter to my 
 experience.
By any chance did you use GC based memory management for it? If you did, that would explain it. GC based memory management removes a lot of complexity surrounding removing of elements.
C++ and assembler. I used a MPSC "garbage queue" for freeing memory, and there were points in the application where I knew it was safe to empty the queue. So you could maybe see that as "GC like". But GC or not doesn't answer the main question, what LF algorithm depends on a a sequence of instructions being done immediately after a CAS? How do you ever enforce that on x86?
Nov 15 2023
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 16/11/2023 8:53 AM, claptrap wrote:
 C++ and assembler. I used a MPSC "garbage queue" for freeing memory, and 
 there were points in the application where I knew it was safe to empty 
 the queue. So you could maybe see that as "GC like".
If you had known good points to release it, yeah that is a known to be good strategy. I went totally manual without any such external assistance. So if you want to know the area of the literature that I was reading it was anything that did not apply outside help to deallocate.
 But GC or not doesn't answer the main question, what LF algorithm 
 depends on a a sequence of instructions being done immediately after a 
 CAS? How do you ever enforce that on x86?
That's the fun part, you can't enforce it on any ISA. Signals, interrupts. If I hadn't been so distracted by the fact that things could work on ldc but not on dmd, I would've realized that what I was trying to do couldn't work. There is only one way to describe myself going down that path, a fool ;)
Nov 15 2023
parent reply claptrap <clap trap.com> writes:
On Thursday, 16 November 2023 at 04:25:52 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 But GC or not doesn't answer the main question, what LF 
 algorithm depends on a a sequence of instructions being done 
 immediately after a CAS? How do you ever enforce that on x86?
That's the fun part, you can't enforce it on any ISA.
I know, that's the point, you have zero timing guarantees since your instructions can be interrupted at any point. So any algorithm relying on "timing" is doomed to fall". IE, it makes no difference if the CAS is inline or wrapped in a function call. LF algorithms rely on a sequence of operations being done in a specific order, and that order being coherent across cores/threads.
Nov 16 2023
parent reply "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
On 17/11/2023 11:50 AM, claptrap wrote:
 LF algorithms rely on a sequence of operations being done in a specific 
 order, and that order being coherent across cores/threads.
Yes but there is a condition on this: 1. Each operation must be atomic, or: 2. Operate on an atomically synchronized memory But most importantly: 3. Must be predictable When you don't inline you get additional steps added that may not hold this condition. Where it can be not atomic and not operating on non-atomically synchronized memory. This is why it matters that it doesn't inline. It adds variability between the steps so that it isn't doing what you think the algorithm is doing. It introduces unpredictability.
Nov 16 2023
next sibling parent Stefan Koch <uplink.coder googlemail.com> writes:
On Friday, 17 November 2023 at 04:12:32 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 On 17/11/2023 11:50 AM, claptrap wrote:
 LF algorithms rely on a sequence of operations being done in a 
 specific order, and that order being coherent across 
 cores/threads.
Yes but there is a condition on this: 1. Each operation must be atomic, or: 2. Operate on an atomically synchronized memory But most importantly: 3. Must be predictable When you don't inline you get additional steps added that may not hold this condition. Where it can be not atomic and not operating on non-atomically synchronized memory. This is why it matters that it doesn't inline. It adds variability between the steps so that it isn't doing what you think the algorithm is doing. It introduces unpredictability.
when you rely on more than one operation being atomic you are already on the loosing team. Just because two operations are right next to each other in the machine code, it does not mean the will be executed right after each other. Another thread or processor might invalidate the condition you established with the first instruction that the other is relying on. Therefore your algorithm is only correct if you are not relying on predictable execution order.
Nov 17 2023
prev sibling parent reply claptrap <clap trap.com> writes:
On Friday, 17 November 2023 at 04:12:32 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 On 17/11/2023 11:50 AM, claptrap wrote:
 LF algorithms rely on a sequence of operations being done in a 
 specific order, and that order being coherent across 
 cores/threads.
Yes but there is a condition on this: 1. Each operation must be atomic, or: 2. Operate on an atomically synchronized memory
That's a given, I mean the whole point is any variable sharing across threads needs to be done atomically.
 But most importantly:
 3. Must be predictable

 When you don't inline you get additional steps added that may 
 not hold this condition. Where it can be not atomic and not 
 operating on non-atomically synchronized memory.
CAS operates on a specific memory location and that address will be passed into the function, there's no way for the function to change this to break alignment and hence break the atomicity. In fact on x86 alignment doesn't matter anyway if you have a lock prefix. On Arm it does IIRC. But arm also has less strict memory ordering. What instructions the compiler inserts around the CAS instruction are irrelevant, none of what they do can break the CAS. The only thing that the function has that can be used to barf things up is the memory location, everything else it has is thread local. So the only way the it can break the CAS is by reading or writing to the memory location in a non atomic manner. It literally has no instructions to do so unless the programmer tells it to. I mean if the compiler emits instructions to read or write to a pointer without the programmer instructing it to do so, your compiler is broken and you'll be getting segfaults everywhere anyway.
Nov 17 2023
parent reply IGotD- <nise nise.com> writes:
On Friday, 17 November 2023 at 10:25:31 UTC, claptrap wrote:
 What instructions the compiler inserts around the CAS 
 instruction are irrelevant, none of what they do can break the 
 CAS.
It is not irrelevant. You cannot break the CAS itself as it is usually implemented according to the SW ABI of the architecture. However, you must instruct the compiler so that reordering optimizations don't spill over the CAS implementations. If optimizations make instructions spill over to the wrong side of the CAS, then you possibly will end up in a non working algorithm. So you not only perhaps need to insert a memory barrier instruction depending on ISA, you also need to instruct the compiler not to trespass the atomic instructions. This is usually implicit as atomic operations are implemented as intrinsics which automatically do this for you.
Nov 17 2023
parent claptrap <clap trap.com> writes:
On Friday, 17 November 2023 at 12:41:39 UTC, IGotD- wrote:
 On Friday, 17 November 2023 at 10:25:31 UTC, claptrap wrote:
 What instructions the compiler inserts around the CAS 
 instruction are irrelevant, none of what they do can break the 
 CAS.
It is not irrelevant. You cannot break the CAS itself as it is usually implemented according to the SW ABI of the architecture. However, you must instruct the compiler so that reordering optimizations don't spill over the CAS implementations. If optimizations make instructions spill over to the wrong side of the CAS, then you possibly will end up in a non working algorithm.
You're missing the point. If your compiler is reordering instructions around a CAS it's broken. If it is doing that then its a problem whether or not you wrapped the CAS in a function call or not. And not only that but the instructions for the function are all thread local. The only shared thing the function has is the address of the atomic variable in memory. Theres nothing in that situation the compiler can reorder that would break multithreaded code that would not also break single thraded code. If your CAS intrinsic or instruction needs a fence, it needs it whether inline or wrapped in a function. Wrapping it in a function call doesnt somehow cause ordering issues.
Nov 17 2023
prev sibling next sibling parent reply ryuukk_ <ryuukk.dev gmail.com> writes:
These should be builtins imo, LDC could reuse LLVM intrinsics and 
GDC could reuse GCC stuff, and DMD wich would reuse what ever is 
in ``core.atomics``

https://llvm.org/docs/Atomics.html#libcalls-atomic

https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
Nov 14 2023
parent "Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:
For ldc and gdc core.atomic already uses the backend intrinsics. It is 
only dmd that doesn't do that and hence does bad things.

Whatever is missing that stdatomic.h needs, should be added.

https://github.com/ldc-developers/ldc/blob/master/runtime/druntime/src/core/internal/atomic.d#L24
Nov 14 2023
prev sibling parent Denis Feklushkin <feklushkin.denis gmail.com> writes:
On Tuesday, 14 November 2023 at 03:53:09 UTC, Walter Bright wrote:
 There have been multiple requests to add this support. It can 
 all be done with a library implemented with some inline 
 assembler.

 Anyone want the glory of implementing this?
Probably, previously it was implied that core.* is for own needs of druntime and Phobos? And existence of core.stdc.* is a forced solution, since using libc is the easiest way to create an interface between druntime and OSes
Nov 14 2023