digitalmars.D - core.stdc.stdatomic
- Walter Bright (3/3) Nov 13 2023 There have been multiple requests to add this support. It can all be don...
- Richard (Rikki) Andrew Cattermole (6/6) Nov 13 2023 For the most part it looks like everything is in core.atomic already.
- Walter Bright (2/10) Nov 14 2023 Aside from fence, they can all be done with simple functions. So why not...
- Walter Bright (4/8) Nov 14 2023 The fence instructions are supported by DMD's inline assembler, so they ...
- Richard (Rikki) Andrew Cattermole (13/17) Nov 14 2023 From C11 spec for ``atomic_signal_fence``:
- Walter Bright (11/30) Nov 14 2023 dmd does not reorder code around inline assembler instructions. So this ...
- Richard (Rikki) Andrew Cattermole (36/44) Nov 14 2023 You have mostly caught on to the problems here that I have experienced.
- Walter Bright (8/11) Nov 14 2023 Everything I've read about writing correct synchronization says it's not...
- Richard (Rikki) Andrew Cattermole (12/27) Nov 14 2023 You have understood the simplified parts of the problem.
- Richard (Rikki) Andrew Cattermole (15/15) Nov 14 2023 Question: Why do people want another wrapper around some inline assembly...
- Walter Bright (4/7) Nov 14 2023 Because they have existing carefully crafted code in C and want to trans...
- Richard (Rikki) Andrew Cattermole (9/21) Nov 14 2023 So what I'm getting at here is that we can already do a best effort
- Walter Bright (3/6) Nov 14 2023 I do not understand why a function that consists of a FENCE instruction ...
- Richard (Rikki) Andrew Cattermole (5/7) Nov 14 2023 Yes, ideally a memory barrier wouldn't matter for how it is executed.
- Walter Bright (2/11) Nov 15 2023 Is this only a problem with load/cas?
- Richard (Rikki) Andrew Cattermole (16/17) Nov 15 2023 A store by itself should be ok.
- Walter Bright (4/9) Nov 15 2023 Thanks for the book recommendation. The review comments, though, say its...
- Richard (Rikki) Andrew Cattermole (9/10) Nov 15 2023 Yes it is.
- Walter Bright (2/3) Nov 15 2023 Indeed. I ordered it. Thanks!
- claptrap (6/15) Nov 15 2023 That doesnt make any sense, the whole point of CAS is that it is
- Richard (Rikki) Andrew Cattermole (19/22) Nov 15 2023 I understand that it seems like it does not make sense. Lock-free
- DrDread (5/10) Nov 15 2023 whoever write code like that deserves all the problems it creates.
- Richard (Rikki) Andrew Cattermole (3/6) Nov 15 2023 Yup, that's lock-free concurrent data structures for you. Need
- claptrap (19/31) Nov 15 2023 Im saying it doesnt make sense because i have worked on /
- Richard (Rikki) Andrew Cattermole (4/6) Nov 15 2023 By any chance did you use GC based memory management for it?
- Imperatorn (3/10) Nov 15 2023 What's wrong with using the gc?
- Richard (Rikki) Andrew Cattermole (5/6) Nov 15 2023 Nothing. In fact it is one of if not the best way to implement a
- claptrap (9/16) Nov 15 2023 C++ and assembler. I used a MPSC "garbage queue" for freeing
- Richard (Rikki) Andrew Cattermole (12/18) Nov 15 2023 If you had known good points to release it, yeah that is a known to be
- claptrap (10/15) Nov 16 2023 I know, that's the point, you have zero timing guarantees since
- Richard (Rikki) Andrew Cattermole (12/14) Nov 16 2023 Yes but there is a condition on this:
- Stefan Koch (12/27) Nov 17 2023 when you rely on more than one operation being atomic you are
- claptrap (20/32) Nov 17 2023 That's a given, I mean the whole point is any variable sharing
- ryuukk_ (5/5) Nov 14 2023 These should be builtins imo, LDC could reuse LLVM intrinsics and
- Richard (Rikki) Andrew Cattermole (4/4) Nov 14 2023 For ldc and gdc core.atomic already uses the backend intrinsics. It is
- Denis Feklushkin (5/9) Nov 14 2023 Probably, previously it was implied that core.* is for own needs
There have been multiple requests to add this support. It can all be done with a library implemented with some inline assembler. Anyone want the glory of implementing this?
Nov 13 2023
For the most part it looks like everything is in core.atomic already. The only thing that looks like it isn't is ``atomic_signal_fence`` and that has to be an intrinsic as it is specifically for a backend instruction. But reminder: only dmd is missing the intrinsics necessary to do atomics correctly without memory errors. Dmd is a right off if you are doing anything more than just reference counting with it.
Nov 13 2023
On 11/13/2023 8:56 PM, Richard (Rikki) Andrew Cattermole wrote:For the most part it looks like everything is in core.atomic already. The only thing that looks like it isn't is ``atomic_signal_fence`` and that has to be an intrinsic as it is specifically for a backend instruction. But reminder: only dmd is missing the intrinsics necessary to do atomics correctly without memory errors. Dmd is a right off if you are doing anything more than just reference counting with it.Aside from fence, they can all be done with simple functions. So why not?
Nov 14 2023
On 11/14/2023 12:24 AM, Walter Bright wrote:The fence instructions are supported by DMD's inline assembler, so they can be done, too, as simple functions. Why is this a write-off?Dmd is a right off if you are doing anything more than just reference counting with it.Aside from fence, they can all be done with simple functions. So why not?
Nov 14 2023
On 14/11/2023 9:35 PM, Walter Bright wrote:The fence instructions are supported by DMD's inline assembler, so they can be done, too, as simple functions. Why is this a write-off?From C11 spec for ``atomic_signal_fence``: NOTE 2 Compiler optimizations and reorderings of loads and stores are inhibited in the same way as with atomic_thread_fence, but the hardware fence instructions that atomic_thread_fence would have inserted are not emitted. In other words no instructions emitted, it does not map to an x86 instruction. I've said this before, dmd is a write off for lock-free concurrent data structures. Having ANY extra functions in the call stack can throw off the timing and end in segfaults. It MUST be inlined! This will be true of any use case for atomics that is beyond that of reference counting.
Nov 14 2023
On 11/14/2023 12:58 AM, Richard (Rikki) Andrew Cattermole wrote:On 14/11/2023 9:35 PM, Walter Bright wrote:dmd does not reorder code around inline assembler instructions. So this is not a problem.The fence instructions are supported by DMD's inline assembler, so they can be done, too, as simple functions. Why is this a write-off?From C11 spec for ``atomic_signal_fence``: NOTE 2 Compiler optimizations and reorderings of loads and stores are inhibited in the same way as with atomic_thread_fence, but the hardware fence instructions that atomic_thread_fence would have inserted are not emitted. In other words no instructions emitted, it does not map to an x86 instruction.I've said this before, dmd is a write off for lock-free concurrent data structures. Having ANY extra functions in the call stack can throw off the timing and end in segfaults. It MUST be inlined! This will be true of any use case for atomics that is beyond that of reference counting.Correct multithreaded code is about doing things in the correct sequence, it is not about timing. If synchronization code is dependent on instruction timing, it is inevitably going to fail because too many things affect timing. Yes, dmd's code here will be significantly slower than an intrinsic, but I don't see how it would be incorrect. As a path forward for DMD: 1. implement core.stdc.stdatomic in terms of core.atomic and/or core.internal.atomic 2. eventually add intrinsics to dmd to replace them
Nov 14 2023
On 15/11/2023 7:13 AM, Walter Bright wrote:Correct multithreaded code is about doing things in the correct sequence, it is not about timing. If synchronization code is dependent on instruction timing, it is inevitably going to fail because too many things affect timing.You have mostly caught on to the problems here that I have experienced. About the only people who can write lock-free concurrent data structures reliably work with kernels and yes kernels do have them. As a subject matter they are only about 20 years old (30 for some key theory). Which is very young for data structures. To quote Andrei about its significance: If you believe that's a fundamental enough question to award a prize to the answerer, so did others. In 2003, Maurice Herlihy was awarded the Edsger W. Dijkstra Prize in Distributed Computing for his seminal 1991 paper "Wait-Free Synchronization" (see http://www.podc.org/dijkstra/2003.html, which includes a link to the paper, too). In his tour-de-force paper, Herlihy proves which primitives are good and which are bad for building lock-free data structures. That brought some seemingly hot hardware architectures to instant obsolescence, while clarifying what synchronization primitives should be implemented in future hardware. https://drdobbs.com/lock-free-data-structures/184401865 So it is timing based, you have set points which act as synchronization events and then everything after it must work exactly the same on each core. This is VERY HARD! I got grey hair because of dmd using inline assembly, because function calls do not result in "exact" timings after those synchronization points! But it was possible with ldc with a lot of work, just not dmd. I'm not the only one who has gone down this path: https://github.com/MartinNowak/lock-free https://github.com/mw66/liblfdsd https://github.com/nin-jin/go.dAs a path forward for DMD: 1. implement core.stdc.stdatomic in terms of core.atomic and/or core.internal.atomic 2. eventually add intrinsics to dmd to replace themAlmost. Everything that is in core.stdc.stdatomic should have similar codegen to the C compiler, if it doesn't that is a bug, this is what gives it the value desired, the guarantee that it will line up. So my proposal is almost the same but applies to all compilers: 1. Implement functions that wrap core.internal.atomic iff those function implementations are intrinsics, equivalent to an intrinsic or uses locking. 2. Implement intrinsics in core.interal.atomic and then implement the corresponding wrapper function for stdatomic.
Nov 14 2023
On 11/14/2023 7:49 PM, Richard (Rikki) Andrew Cattermole wrote:So it is timing based, you have set points which act as synchronization events and then everything after it must work exactly the same on each core. This is VERY HARD!Everything I've read about writing correct synchronization says it's not about timing, it's about sequencing. For example, https://www.amazon.com/C-Concurrency-Action-Anthony-Williams/dp/1617294691 or maybe you and I are just misunderstanding terms. For example, fences. Fences enforce memory-ordering constraints, not timing. Happens-before and synchronizes-with are sequencing, not timing.
Nov 14 2023
On 15/11/2023 7:32 PM, Walter Bright wrote:On 11/14/2023 7:49 PM, Richard (Rikki) Andrew Cattermole wrote:You have understood the simplified parts of the problem. The problem is when concurrency is in action, multiple cores operating on same memory in same time units. You reach a shifting sands feeling where multiple facts can be true at the same time on different cores and it can be completely contradictory. Memory can be mapped on one, and not on another. Did I mention I have grey hair because of this? This might be a reason why I have strong opinions about D foundations such as symbols since that experience ;) Also: https://github.com/dlang/dmd/pull/15816So it is timing based, you have set points which act as synchronization events and then everything after it must work exactly the same on each core. This is VERY HARD!Everything I've read about writing correct synchronization says it's not about timing, it's about sequencing. For example, https://www.amazon.com/C-Concurrency-Action-Anthony-Williams/dp/1617294691 or maybe you and I are just misunderstanding terms. For example, fences. Fences enforce memory-ordering constraints, not timing. Happens-before and synchronizes-with are sequencing, not timing.
Nov 14 2023
Question: Why do people want another wrapper around some inline assembly that already exists in core.atomic? Answer: they don't. This does not allow people to implement any new ideas. We don't need another wrapper around the same inline assembly that has the exact same tradeoffs with inlinability (it can't be inlined) and without the ability to succinctly communicate with the backend to ensure codegen looks the way it needs to. You want to port code from C? Use core.atomic, but wait it has different behaviors? Well yeah... its not designed around the intrinsics that stdatomic.h is that give it any useful meaning. See: ``kill_dependency``, ``atomic_init`` and ``atomic_signal_fence``. Writing a wrapper around stdatomic.h would take probably 2 hours. You don't need to write any inline assembly, its already done in core.atomic. But realistically all you're doing is changing some names and order of parameters with slightly different types.
Nov 14 2023
On 11/14/2023 12:51 AM, Richard (Rikki) Andrew Cattermole wrote:Question: Why do people want another wrapper around some inline assembly that already exists in core.atomic?Because they have existing carefully crafted code in C and want to translate it to D.Writing a wrapper around stdatomic.h would take probably 2 hours.Great! That saves each stdatomic C user 2 hours who wants to get their code in D.
Nov 14 2023
On 15/11/2023 8:00 AM, Walter Bright wrote:On 11/14/2023 12:51 AM, Richard (Rikki) Andrew Cattermole wrote:So what I'm getting at here is that we can already do a best effort approach to this by swapping stdatomic to core.atomic, but it does not bring the value that people are wanting to do so. The codegen must be similar, if the C compiler for a target uses a function call, so can we. If they use intrinsics with inlining so must us. This way the behavior will be similar, and the port can be a success. Otherwise you are introducing new and potentially wrong behavior; which means a failure in trying to fulfill your ideas.Question: Why do people want another wrapper around some inline assembly that already exists in core.atomic?Because they have existing carefully crafted code in C and want to translate it to D.Writing a wrapper around stdatomic.h would take probably 2 hours.Great! That saves each stdatomic C user 2 hours who wants to get their code in D.
Nov 14 2023
On 11/14/2023 7:07 PM, Richard (Rikki) Andrew Cattermole wrote:This way the behavior will be similar, and the port can be a success. Otherwise you are introducing new and potentially wrong behavior; which means a failure in trying to fulfill your ideas.I do not understand why a function that consists of a FENCE instruction will be a failure compared to a FENCE instruction inlined.
Nov 14 2023
On 15/11/2023 7:33 PM, Walter Bright wrote:I do not understand why a function that consists of a FENCE instruction will be a failure compared to a FENCE instruction inlined.Yes, ideally a memory barrier wouldn't matter for how it is executed. But it does matter for load/cas. Because what was true when the operation executed may not be true any longer by the time the next operation is performed with deep calls in use.
Nov 14 2023
On 11/14/2023 10:59 PM, Richard (Rikki) Andrew Cattermole wrote:On 15/11/2023 7:33 PM, Walter Bright wrote:Is this only a problem with load/cas?I do not understand why a function that consists of a FENCE instruction will be a failure compared to a FENCE instruction inlined.Yes, ideally a memory barrier wouldn't matter for how it is executed. But it does matter for load/cas. Because what was true when the operation executed may not be true any longer by the time the next operation is performed with deep calls in use.
Nov 15 2023
On 15/11/2023 10:33 PM, Walter Bright wrote:Is this only a problem with load/cas?A store by itself should be ok. If you do any loading like you do with cas or atomicOp, then the timings might not be in any way predictable. But the general rule of thumb for this is: if an atomic operation is by itself for a given block of memory then you can ignore timings. If it used in conjunction with another atomic instruction (or more), you have to consider if timings can mess it up. But wrt. load/cas they are the two big primitives that are used in the literature very heavily so they need to be prioritized over the others for implementing intrinsics. They will also be used very heavily in any problem data structure function from my own experience. A random fun fact, the book I recommend on the subject "The Art of Multiprocessor Programming" one of the authors teachers at the same university as Roy! https://shop.elsevier.com/books/the-art-of-multiprocessor-programming/herlihy/978-0-12-415950-1
Nov 15 2023
On 11/15/2023 1:55 AM, Richard (Rikki) Andrew Cattermole wrote:A random fun fact, the book I recommend on the subject "The Art of Multiprocessor Programming" one of the authors teachers at the same university as Roy! https://shop.elsevier.com/books/the-art-of-multiprocessor-programming/herlihy/978-0-12-415950-1Thanks for the book recommendation. The review comments, though, say its examples are all in Java. Being in Java, an interpreter, is it relevant to machine level programming?
Nov 15 2023
On 16/11/2023 3:30 PM, Walter Bright wrote:Being in Java, an interpreter, is it relevant to machine level programming?Yes it is. Java has a well defined memory sub system, it isn't a toy interpreter. I.e. https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/atomic/package-summary.html As a subject matter I would consider it in the category of cross referencing required. So even if it isn't the book you want to reference, its still worth cross referencing to it at times. For $7 from thriftbooks its worth a risk.
Nov 15 2023
On 11/15/2023 8:07 PM, Richard (Rikki) Andrew Cattermole wrote:For $7 from thriftbooks its worth a risk.Indeed. I ordered it. Thanks!
Nov 15 2023
On Wednesday, 15 November 2023 at 06:59:45 UTC, Richard (Rikki) Andrew Cattermole wrote:On 15/11/2023 7:33 PM, Walter Bright wrote:That doesnt make any sense, the whole point of CAS is that it is atomic, immediately after it has completed you have no guarantees anyway, what difference does it make if it's wrapped in a function call?I do not understand why a function that consists of a FENCE instruction will be a failure compared to a FENCE instruction inlined.Yes, ideally a memory barrier wouldn't matter for how it is executed. But it does matter for load/cas. Because what was true when the operation executed may not be true any longer by the time the next operation is performed with deep calls in use.
Nov 15 2023
On 15/11/2023 10:55 PM, claptrap wrote:That doesnt make any sense, the whole point of CAS is that it is atomic, immediately after it has completed you have no guarantees anyway, what difference does it make if it's wrapped in a function call?I understand that it seems like it does not make sense. Lock-free concurrent data structures are a highly advanced topic, that very few people in the world today can implement successfully. About the only people who are qualified to touch them for production software would be kernel developers for a specific cpu family. They rely quite heavily on the premise that atomic operations happen immediately in the codegen and then based upon the results do set actions in response. This is timing based it has to be preciese or they will interfere with each other. You do not have much lee-way before you start getting segfaults. I only ever saw partial success with ldc after seven months of researching them. For obvious reasons I do not recommend people going down this particular path of study, because you are going to get burned pretty badly guaranteed. Regardless compilers like gcc have intrinsics for all of stdatomic. We need to be matching it otherwise what D supports will not line up with what the system C compiler can offer in terms of use cases. https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
Nov 15 2023
On Wednesday, 15 November 2023 at 10:26:32 UTC, Richard (Rikki) Andrew Cattermole wrote:They rely quite heavily on the premise that atomic operations happen immediately in the codegen and then based upon the results do set actions in response. This is timing based it has to be preciese or they will interfere with each other. You do not have much lee-way before you start getting segfaults.whoever write code like that deserves all the problems it creates. This assumes specific CPU behaviour and may be completely broken on other systems.
Nov 15 2023
On 16/11/2023 2:26 AM, DrDread wrote:whoever write code like that deserves all the problems it creates. This assumes specific CPU behaviour and may be completely broken on other systems.Yup, that's lock-free concurrent data structures for you. Need specialists to have any chance of working reliably.
Nov 15 2023
On Wednesday, 15 November 2023 at 10:26:32 UTC, Richard (Rikki) Andrew Cattermole wrote:On 15/11/2023 10:55 PM, claptrap wrote: I understand that it seems like it does not make sense. Lock-free concurrent data structures are a highly advanced topic, that very few people in the world today can implement successfully. About the only people who are qualified to touch them for production software would be kernel developers for a specific cpu family.Im saying it doesnt make sense because i have worked on / implemented some lock free data structures. Ive shipped software that relied on them. None of the literature ive read ever had any algorithms that relied on "getting things done quickly" after a CAS. Fundamentally it cant work since the thread can be interupted immediatly after completing the CAS. So if you algorithm relies on something else happening within a specific time frame after a CAS it is not going to work. So im looking for an explanation or a pointer to an algorithm that exhibits what you describe because it is counter to my experience.They rely quite heavily on the premise that atomic operations happen immediately in the codegen and then based upon the results do set actions in response. This is timing based it has to be preciese or they will interfere with each other. You do not have much lee-way before you start getting segfaults.I can see that getting an update done quickly can help with contention, but if the algorithm breaks when things arnt done quickly you're pretty much screwed afaik. I mean theres no way to guarantee any sequence of instructions gets completed with in given time frame, on x86 at least.
Nov 15 2023
On 16/11/2023 2:48 AM, claptrap wrote:So im looking for an explanation or a pointer to an algorithm that exhibits what you describe because it is counter to my experience.By any chance did you use GC based memory management for it? If you did, that would explain it. GC based memory management removes a lot of complexity surrounding removing of elements.
Nov 15 2023
On Wednesday, 15 November 2023 at 14:44:52 UTC, Richard (Rikki) Andrew Cattermole wrote:On 16/11/2023 2:48 AM, claptrap wrote:What's wrong with using the gc?So im looking for an explanation or a pointer to an algorithm that exhibits what you describe because it is counter to my experience.By any chance did you use GC based memory management for it? If you did, that would explain it. GC based memory management removes a lot of complexity surrounding removing of elements.
Nov 15 2023
On 16/11/2023 5:41 AM, Imperatorn wrote:What's wrong with using the gc?Nothing. In fact it is one of if not the best way to implement a lock-free concurrent data structure without segfaults. It is of particular interest because it turns a nearly impossible problem into a hard problem that can be solved in a reasonable time frame.
Nov 15 2023
On Wednesday, 15 November 2023 at 14:44:52 UTC, Richard (Rikki) Andrew Cattermole wrote:On 16/11/2023 2:48 AM, claptrap wrote:C++ and assembler. I used a MPSC "garbage queue" for freeing memory, and there were points in the application where I knew it was safe to empty the queue. So you could maybe see that as "GC like". But GC or not doesn't answer the main question, what LF algorithm depends on a a sequence of instructions being done immediately after a CAS? How do you ever enforce that on x86?So im looking for an explanation or a pointer to an algorithm that exhibits what you describe because it is counter to my experience.By any chance did you use GC based memory management for it? If you did, that would explain it. GC based memory management removes a lot of complexity surrounding removing of elements.
Nov 15 2023
On 16/11/2023 8:53 AM, claptrap wrote:C++ and assembler. I used a MPSC "garbage queue" for freeing memory, and there were points in the application where I knew it was safe to empty the queue. So you could maybe see that as "GC like".If you had known good points to release it, yeah that is a known to be good strategy. I went totally manual without any such external assistance. So if you want to know the area of the literature that I was reading it was anything that did not apply outside help to deallocate.But GC or not doesn't answer the main question, what LF algorithm depends on a a sequence of instructions being done immediately after a CAS? How do you ever enforce that on x86?That's the fun part, you can't enforce it on any ISA. Signals, interrupts. If I hadn't been so distracted by the fact that things could work on ldc but not on dmd, I would've realized that what I was trying to do couldn't work. There is only one way to describe myself going down that path, a fool ;)
Nov 15 2023
On Thursday, 16 November 2023 at 04:25:52 UTC, Richard (Rikki) Andrew Cattermole wrote:I know, that's the point, you have zero timing guarantees since your instructions can be interrupted at any point. So any algorithm relying on "timing" is doomed to fall". IE, it makes no difference if the CAS is inline or wrapped in a function call. LF algorithms rely on a sequence of operations being done in a specific order, and that order being coherent across cores/threads.But GC or not doesn't answer the main question, what LF algorithm depends on a a sequence of instructions being done immediately after a CAS? How do you ever enforce that on x86?That's the fun part, you can't enforce it on any ISA.
Nov 16 2023
On 17/11/2023 11:50 AM, claptrap wrote:LF algorithms rely on a sequence of operations being done in a specific order, and that order being coherent across cores/threads.Yes but there is a condition on this: 1. Each operation must be atomic, or: 2. Operate on an atomically synchronized memory But most importantly: 3. Must be predictable When you don't inline you get additional steps added that may not hold this condition. Where it can be not atomic and not operating on non-atomically synchronized memory. This is why it matters that it doesn't inline. It adds variability between the steps so that it isn't doing what you think the algorithm is doing. It introduces unpredictability.
Nov 16 2023
On Friday, 17 November 2023 at 04:12:32 UTC, Richard (Rikki) Andrew Cattermole wrote:On 17/11/2023 11:50 AM, claptrap wrote:when you rely on more than one operation being atomic you are already on the loosing team. Just because two operations are right next to each other in the machine code, it does not mean the will be executed right after each other. Another thread or processor might invalidate the condition you established with the first instruction that the other is relying on. Therefore your algorithm is only correct if you are not relying on predictable execution order.LF algorithms rely on a sequence of operations being done in a specific order, and that order being coherent across cores/threads.Yes but there is a condition on this: 1. Each operation must be atomic, or: 2. Operate on an atomically synchronized memory But most importantly: 3. Must be predictable When you don't inline you get additional steps added that may not hold this condition. Where it can be not atomic and not operating on non-atomically synchronized memory. This is why it matters that it doesn't inline. It adds variability between the steps so that it isn't doing what you think the algorithm is doing. It introduces unpredictability.
Nov 17 2023
On Friday, 17 November 2023 at 04:12:32 UTC, Richard (Rikki) Andrew Cattermole wrote:On 17/11/2023 11:50 AM, claptrap wrote:That's a given, I mean the whole point is any variable sharing across threads needs to be done atomically.LF algorithms rely on a sequence of operations being done in a specific order, and that order being coherent across cores/threads.Yes but there is a condition on this: 1. Each operation must be atomic, or: 2. Operate on an atomically synchronized memoryBut most importantly: 3. Must be predictable When you don't inline you get additional steps added that may not hold this condition. Where it can be not atomic and not operating on non-atomically synchronized memory.CAS operates on a specific memory location and that address will be passed into the function, there's no way for the function to change this to break alignment and hence break the atomicity. In fact on x86 alignment doesn't matter anyway if you have a lock prefix. On Arm it does IIRC. But arm also has less strict memory ordering. What instructions the compiler inserts around the CAS instruction are irrelevant, none of what they do can break the CAS. The only thing that the function has that can be used to barf things up is the memory location, everything else it has is thread local. So the only way the it can break the CAS is by reading or writing to the memory location in a non atomic manner. It literally has no instructions to do so unless the programmer tells it to. I mean if the compiler emits instructions to read or write to a pointer without the programmer instructing it to do so, your compiler is broken and you'll be getting segfaults everywhere anyway.
Nov 17 2023
On Friday, 17 November 2023 at 10:25:31 UTC, claptrap wrote:What instructions the compiler inserts around the CAS instruction are irrelevant, none of what they do can break the CAS.It is not irrelevant. You cannot break the CAS itself as it is usually implemented according to the SW ABI of the architecture. However, you must instruct the compiler so that reordering optimizations don't spill over the CAS implementations. If optimizations make instructions spill over to the wrong side of the CAS, then you possibly will end up in a non working algorithm. So you not only perhaps need to insert a memory barrier instruction depending on ISA, you also need to instruct the compiler not to trespass the atomic instructions. This is usually implicit as atomic operations are implemented as intrinsics which automatically do this for you.
Nov 17 2023
On Friday, 17 November 2023 at 12:41:39 UTC, IGotD- wrote:On Friday, 17 November 2023 at 10:25:31 UTC, claptrap wrote:You're missing the point. If your compiler is reordering instructions around a CAS it's broken. If it is doing that then its a problem whether or not you wrapped the CAS in a function call or not. And not only that but the instructions for the function are all thread local. The only shared thing the function has is the address of the atomic variable in memory. Theres nothing in that situation the compiler can reorder that would break multithreaded code that would not also break single thraded code. If your CAS intrinsic or instruction needs a fence, it needs it whether inline or wrapped in a function. Wrapping it in a function call doesnt somehow cause ordering issues.What instructions the compiler inserts around the CAS instruction are irrelevant, none of what they do can break the CAS.It is not irrelevant. You cannot break the CAS itself as it is usually implemented according to the SW ABI of the architecture. However, you must instruct the compiler so that reordering optimizations don't spill over the CAS implementations. If optimizations make instructions spill over to the wrong side of the CAS, then you possibly will end up in a non working algorithm.
Nov 17 2023
These should be builtins imo, LDC could reuse LLVM intrinsics and GDC could reuse GCC stuff, and DMD wich would reuse what ever is in ``core.atomics`` https://llvm.org/docs/Atomics.html#libcalls-atomic https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
Nov 14 2023
For ldc and gdc core.atomic already uses the backend intrinsics. It is only dmd that doesn't do that and hence does bad things. Whatever is missing that stdatomic.h needs, should be added. https://github.com/ldc-developers/ldc/blob/master/runtime/druntime/src/core/internal/atomic.d#L24
Nov 14 2023
On Tuesday, 14 November 2023 at 03:53:09 UTC, Walter Bright wrote:There have been multiple requests to add this support. It can all be done with a library implemented with some inline assembler. Anyone want the glory of implementing this?Probably, previously it was implied that core.* is for own needs of druntime and Phobos? And existence of core.stdc.* is a forced solution, since using libc is the easiest way to create an interface between druntime and OSes
Nov 14 2023