digitalmars.D - core.stdc.stdatomic

Walter Bright (3/3) Nov 13 2023 There have been multiple requests to add this support. It can all be don...

Richard (Rikki) Andrew Cattermole (6/6) Nov 13 2023 For the most part it looks like everything is in core.atomic already.

Walter Bright (2/10) Nov 14 2023 Aside from fence, they can all be done with simple functions. So why not...

Walter Bright (4/8) Nov 14 2023 The fence instructions are supported by DMD's inline assembler, so they ...

Richard (Rikki) Andrew Cattermole (13/17) Nov 14 2023 From C11 spec for ``atomic_signal_fence``:

Walter Bright (11/30) Nov 14 2023 dmd does not reorder code around inline assembler instructions. So this ...

Richard (Rikki) Andrew Cattermole (36/44) Nov 14 2023 You have mostly caught on to the problems here that I have experienced.

Walter Bright (8/11) Nov 14 2023 Everything I've read about writing correct synchronization says it's not...

Richard (Rikki) Andrew Cattermole (12/27) Nov 14 2023 You have understood the simplified parts of the problem.

Richard (Rikki) Andrew Cattermole (15/15) Nov 14 2023 Question: Why do people want another wrapper around some inline assembly...

Walter Bright (4/7) Nov 14 2023 Because they have existing carefully crafted code in C and want to trans...

Richard (Rikki) Andrew Cattermole (9/21) Nov 14 2023 So what I'm getting at here is that we can already do a best effort

Walter Bright (3/6) Nov 14 2023 I do not understand why a function that consists of a FENCE instruction ...

Richard (Rikki) Andrew Cattermole (5/7) Nov 14 2023 Yes, ideally a memory barrier wouldn't matter for how it is executed.

Walter Bright (2/11) Nov 15 2023 Is this only a problem with load/cas?

Richard (Rikki) Andrew Cattermole (16/17) Nov 15 2023 A store by itself should be ok.

Walter Bright (4/9) Nov 15 2023 Thanks for the book recommendation. The review comments, though, say its...

Richard (Rikki) Andrew Cattermole (9/10) Nov 15 2023 Yes it is.

Walter Bright (2/3) Nov 15 2023 Indeed. I ordered it. Thanks!

claptrap (6/15) Nov 15 2023 That doesnt make any sense, the whole point of CAS is that it is

Richard (Rikki) Andrew Cattermole (19/22) Nov 15 2023 I understand that it seems like it does not make sense. Lock-free

DrDread (5/10) Nov 15 2023 whoever write code like that deserves all the problems it creates.

Richard (Rikki) Andrew Cattermole (3/6) Nov 15 2023 Yup, that's lock-free concurrent data structures for you. Need

claptrap (19/31) Nov 15 2023 Im saying it doesnt make sense because i have worked on /

Richard (Rikki) Andrew Cattermole (4/6) Nov 15 2023 By any chance did you use GC based memory management for it?

Imperatorn (3/10) Nov 15 2023 What's wrong with using the gc?

Richard (Rikki) Andrew Cattermole (5/6) Nov 15 2023 Nothing. In fact it is one of if not the best way to implement a

claptrap (9/16) Nov 15 2023 C++ and assembler. I used a MPSC "garbage queue" for freeing

Richard (Rikki) Andrew Cattermole (12/18) Nov 15 2023 If you had known good points to release it, yeah that is a known to be

claptrap (10/15) Nov 16 2023 I know, that's the point, you have zero timing guarantees since

Richard (Rikki) Andrew Cattermole (12/14) Nov 16 2023 Yes but there is a condition on this:

Stefan Koch (12/27) Nov 17 2023 when you rely on more than one operation being atomic you are
claptrap (20/32) Nov 17 2023 That's a given, I mean the whole point is any variable sharing

IGotD- (12/15) Nov 17 2023 It is not irrelevant. You cannot break the CAS itself as it is

claptrap (14/26) Nov 17 2023 You're missing the point.

ryuukk_ (5/5) Nov 14 2023 These should be builtins imo, LDC could reuse LLVM intrinsics and

Richard (Rikki) Andrew Cattermole (4/4) Nov 14 2023 For ldc and gdc core.atomic already uses the backend intrinsics. It is

Denis Feklushkin (5/9) Nov 14 2023 Probably, previously it was implied that core.* is for own needs

Walter Bright <newshound2 digitalmars.com> writes:

There have been multiple requests to add this support. It can all be done with
a 
library implemented with some inline assembler.

Anyone want the glory of implementing this?

Nov 13 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

For the most part it looks like everything is in core.atomic already.

The only thing that looks like it isn't is ``atomic_signal_fence`` and 
that has to be an intrinsic as it is specifically for a backend instruction.

But reminder: only dmd is missing the intrinsics necessary to do atomics 
correctly without memory errors. Dmd is a right off if you are doing 
anything more than just reference counting with it.

Nov 13 2023

Walter Bright <newshound2 digitalmars.com> writes:

On 11/13/2023 8:56 PM, Richard (Rikki) Andrew Cattermole wrote:
 For the most part it looks like everything is in core.atomic already.
 
 The only thing that looks like it isn't is ``atomic_signal_fence`` and that
has 
 to be an intrinsic as it is specifically for a backend instruction.
 
 But reminder: only dmd is missing the intrinsics necessary to do atomics 
 correctly without memory errors. Dmd is a right off if you are doing anything 
 more than just reference counting with it.

Aside from fence, they can all be done with simple functions. So why not?

Nov 14 2023

Walter Bright <newshound2 digitalmars.com> writes:

On 11/14/2023 12:24 AM, Walter Bright wrote:
 Dmd is a right off if you are doing anything 
 more than just reference counting with it.

 
 Aside from fence, they can all be done with simple functions. So why not?

The fence instructions are supported by DMD's inline assembler, so they can be 
done, too, as simple functions.

Why is this a write-off?

Nov 14 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 14/11/2023 9:35 PM, Walter Bright wrote:
 The fence instructions are supported by DMD's inline assembler, so they 
 can be done, too, as simple functions.
 
 Why is this a write-off?

 From C11 spec for ``atomic_signal_fence``:

NOTE 2 Compiler optimizations and reorderings of loads and stores are 
inhibited in the same way as with
atomic_thread_fence, but the hardware fence instructions that 
atomic_thread_fence would
have inserted are not emitted.

In other words no instructions emitted, it does not map to an x86 
instruction.

I've said this before, dmd is a write off for lock-free concurrent data 
structures. Having ANY extra functions in the call stack can throw off 
the timing and end in segfaults. It MUST be inlined! This will be true 
of any use case for atomics that is beyond that of reference counting.

Nov 14 2023

Walter Bright <newshound2 digitalmars.com> writes:

On 11/14/2023 12:58 AM, Richard (Rikki) Andrew Cattermole wrote:
 On 14/11/2023 9:35 PM, Walter Bright wrote:
 The fence instructions are supported by DMD's inline assembler, so they can be 
 done, too, as simple functions.

 Why is this a write-off?

 
  From C11 spec for ``atomic_signal_fence``:
 
 NOTE 2 Compiler optimizations and reorderings of loads and stores are
inhibited 
 in the same way as with
 atomic_thread_fence, but the hardware fence instructions that 
 atomic_thread_fence would
 have inserted are not emitted.
 
 In other words no instructions emitted, it does not map to an x86 instruction.

dmd does not reorder code around inline assembler instructions. So this is not
a 
problem.


 I've said this before, dmd is a write off for lock-free concurrent data 
 structures. Having ANY extra functions in the call stack can throw off the 
 timing and end in segfaults. It MUST be inlined! This will be true of any use 
 case for atomics that is beyond that of reference counting.

Correct multithreaded code is about doing things in the correct sequence, it is 
not about timing. If synchronization code is dependent on instruction timing,
it 
is inevitably going to fail because too many things affect timing.

Yes, dmd's code here will be significantly slower than an intrinsic, but I
don't 
see how it would be incorrect.

As a path forward for DMD:

1. implement core.stdc.stdatomic in terms of core.atomic and/or
core.internal.atomic

2. eventually add intrinsics to dmd to replace them

Nov 14 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 15/11/2023 7:13 AM, Walter Bright wrote:
 Correct multithreaded code is about doing things in the correct 
 sequence, it is not about timing. If synchronization code is dependent 
 on instruction timing, it is inevitably going to fail because too many 
 things affect timing.

You have mostly caught on to the problems here that I have experienced. 
About the only people who can write lock-free concurrent data structures 
reliably work with kernels and yes kernels do have them. As a subject 
matter they are only about 20 years old (30 for some key theory). Which 
is very young for data structures.



To quote Andrei about its significance:

If you believe that's a fundamental enough question to award a prize to 
the answerer, so did others. In 2003, Maurice Herlihy was awarded the 
Edsger W. Dijkstra Prize in Distributed Computing for his seminal 1991 
paper "Wait-Free Synchronization" (see 
http://www.podc.org/dijkstra/2003.html, which includes a link to the 
paper, too). In his tour-de-force paper, Herlihy proves which primitives 
are good and which are bad for building lock-free data structures. That 
brought some seemingly hot hardware architectures to instant 
obsolescence, while clarifying what synchronization primitives should be 
implemented in future hardware.

https://drdobbs.com/lock-free-data-structures/184401865



So it is timing based, you have set points which act as synchronization 
events and then everything after it must work exactly the same on each 
core. This is VERY HARD!

I got grey hair because of dmd using inline assembly, because function 
calls do not result in "exact" timings after those synchronization 
points! But it was possible with ldc with a lot of work, just not dmd.



I'm not the only one who has gone down this path:

https://github.com/MartinNowak/lock-free

https://github.com/mw66/liblfdsd

https://github.com/nin-jin/go.d



 As a path forward for DMD:

  1. implement core.stdc.stdatomic in terms of core.atomic and/or
     core.internal.atomic

  2. eventually add intrinsics to dmd to replace them

Almost. Everything that is in core.stdc.stdatomic should have similar 
codegen to the C compiler, if it doesn't that is a bug, this is what 
gives it the value desired, the guarantee that it will line up.

So my proposal is almost the same but applies to all compilers:

1. Implement functions that wrap core.internal.atomic iff those function 
implementations are intrinsics, equivalent to an intrinsic or uses locking.
2. Implement intrinsics in core.interal.atomic and then implement the 
corresponding wrapper function for stdatomic.

Nov 14 2023

Walter Bright <newshound2 digitalmars.com> writes:

On 11/14/2023 7:49 PM, Richard (Rikki) Andrew Cattermole wrote:
 So it is timing based, you have set points which act as synchronization events 
 and then everything after it must work exactly the same on each core. This is 
 VERY HARD!

Everything I've read about writing correct synchronization says it's not about 
timing, it's about sequencing.

For example,
https://www.amazon.com/C-Concurrency-Action-Anthony-Williams/dp/1617294691

or maybe you and I are just misunderstanding terms.

For example, fences. Fences enforce memory-ordering constraints, not timing. 
Happens-before and synchronizes-with are sequencing, not timing.

Nov 14 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 15/11/2023 7:32 PM, Walter Bright wrote:
 On 11/14/2023 7:49 PM, Richard (Rikki) Andrew Cattermole wrote:
 So it is timing based, you have set points which act as 
 synchronization events and then everything after it must work exactly 
 the same on each core. This is VERY HARD!

 
 Everything I've read about writing correct synchronization says it's not 
 about timing, it's about sequencing.
 
 For example,
 https://www.amazon.com/C-Concurrency-Action-Anthony-Williams/dp/1617294691
 
 or maybe you and I are just misunderstanding terms.
 
 For example, fences. Fences enforce memory-ordering constraints, not 
 timing. Happens-before and synchronizes-with are sequencing, not timing.

You have understood the simplified parts of the problem.

The problem is when concurrency is in action, multiple cores operating 
on same memory in same time units. You reach a shifting sands feeling 
where multiple facts can be true at the same time on different cores and 
it can be completely contradictory. Memory can be mapped on one, and not 
on another.

Did I mention I have grey hair because of this?

This might be a reason why I have strong opinions about D foundations 
such as symbols since that experience ;)

Also:

https://github.com/dlang/dmd/pull/15816

Nov 14 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

Question: Why do people want another wrapper around some inline assembly 
that already exists in core.atomic?

Answer: they don't. This does not allow people to implement any new ideas.

We don't need another wrapper around the same inline assembly that has 
the exact same tradeoffs with inlinability (it can't be inlined) and 
without the ability to succinctly communicate with the backend to ensure 
codegen looks the way it needs to.

You want to port code from C? Use core.atomic, but wait it has different 
behaviors? Well yeah... its not designed around the intrinsics that 
stdatomic.h is that give it any useful meaning.

See: ``kill_dependency``, ``atomic_init`` and ``atomic_signal_fence``.

Writing a wrapper around stdatomic.h would take probably 2 hours. You 
don't need to write any inline assembly, its already done in 
core.atomic. But realistically all you're doing is changing some names 
and order of parameters with slightly different types.

Nov 14 2023

Walter Bright <newshound2 digitalmars.com> writes:

On 11/14/2023 12:51 AM, Richard (Rikki) Andrew Cattermole wrote:
 Question: Why do people want another wrapper around some inline assembly that 
 already exists in core.atomic?

Because they have existing carefully crafted code in C and want to translate it 
to D.


 Writing a wrapper around stdatomic.h would take probably 2 hours.

Great! That saves each stdatomic C user 2 hours who wants to get their code in
D.

Nov 14 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 15/11/2023 8:00 AM, Walter Bright wrote:
 On 11/14/2023 12:51 AM, Richard (Rikki) Andrew Cattermole wrote:
 Question: Why do people want another wrapper around some inline 
 assembly that already exists in core.atomic?

 
 Because they have existing carefully crafted code in C and want to 
 translate it to D.
 
 
 Writing a wrapper around stdatomic.h would take probably 2 hours.

 
 Great! That saves each stdatomic C user 2 hours who wants to get their 
 code in D.

So what I'm getting at here is that we can already do a best effort 
approach to this by swapping stdatomic to core.atomic, but it does not 
bring the value that people are wanting to do so.

The codegen must be similar, if the C compiler for a target uses a 
function call, so can we. If they use intrinsics with inlining so must us.

This way the behavior will be similar, and the port can be a success. 
Otherwise you are introducing new and potentially wrong behavior; which 
means a failure in trying to fulfill your ideas.

Nov 14 2023

Walter Bright <newshound2 digitalmars.com> writes:

On 11/14/2023 7:07 PM, Richard (Rikki) Andrew Cattermole wrote:
 This way the behavior will be similar, and the port can be a success.
Otherwise 
 you are introducing new and potentially wrong behavior; which means a failure
in 
 trying to fulfill your ideas.

I do not understand why a function that consists of a FENCE instruction will be 
a failure compared to a FENCE instruction inlined.

Nov 14 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 15/11/2023 7:33 PM, Walter Bright wrote:
 I do not understand why a function that consists of a FENCE instruction 
 will be a failure compared to a FENCE instruction inlined.

Yes, ideally a memory barrier wouldn't matter for how it is executed.

But it does matter for load/cas. Because what was true when the 
operation executed may not be true any longer by the time the next 
operation is performed with deep calls in use.

Nov 14 2023

Walter Bright <newshound2 digitalmars.com> writes:

On 11/14/2023 10:59 PM, Richard (Rikki) Andrew Cattermole wrote:
 On 15/11/2023 7:33 PM, Walter Bright wrote:
 I do not understand why a function that consists of a FENCE instruction will 
 be a failure compared to a FENCE instruction inlined.

 
 Yes, ideally a memory barrier wouldn't matter for how it is executed.
 
 But it does matter for load/cas. Because what was true when the operation 
 executed may not be true any longer by the time the next operation is
performed 
 with deep calls in use.

Is this only a problem with load/cas?

Nov 15 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 15/11/2023 10:33 PM, Walter Bright wrote:
 Is this only a problem with load/cas?

A store by itself should be ok.

If you do any loading like you do with cas or atomicOp, then the timings 
might not be in any way predictable.

But the general rule of thumb for this is: if an atomic operation is by 
itself for a given block of memory then you can ignore timings. If it 
used in conjunction with another atomic instruction (or more), you have 
to consider if timings can mess it up.

But wrt. load/cas they are the two big primitives that are used in the 
literature very heavily so they need to be prioritized over the others 
for implementing intrinsics. They will also be used very heavily in any 
problem data structure function from my own experience.


A random fun fact, the book I recommend on the subject "The Art of 
Multiprocessor Programming" one of the authors teachers at the same 
university as Roy!

https://shop.elsevier.com/books/the-art-of-multiprocessor-programming/herlihy/978-0-12-415950-1

Nov 15 2023

Walter Bright <newshound2 digitalmars.com> writes:

On 11/15/2023 1:55 AM, Richard (Rikki) Andrew Cattermole wrote:
 A random fun fact, the book I recommend on the subject "The Art of 
 Multiprocessor Programming" one of the authors teachers at the same university 
 as Roy!
 
 https://shop.elsevier.com/books/the-art-of-multiprocessor-programming/herlihy/978-0-12-415950-1

Thanks for the book recommendation. The review comments, though, say its 
examples are all in Java. Being in Java, an interpreter, is it relevant to 
machine level programming?

Nov 15 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 16/11/2023 3:30 PM, Walter Bright wrote:
 Being in Java, an interpreter, is it relevant to machine level programming?

Yes it is.

Java has a well defined memory sub system, it isn't a toy interpreter.

I.e.

https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/atomic/package-summary.html

As a subject matter I would consider it in the category of cross 
referencing required. So even if it isn't the book you want to 
reference, its still worth cross referencing to it at times.

For $7 from thriftbooks its worth a risk.

Nov 15 2023

Walter Bright <newshound2 digitalmars.com> writes:

On 11/15/2023 8:07 PM, Richard (Rikki) Andrew Cattermole wrote:
 For $7 from thriftbooks its worth a risk.

Indeed. I ordered it. Thanks!

Nov 15 2023

claptrap <clap trap.com> writes:

On Wednesday, 15 November 2023 at 06:59:45 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 On 15/11/2023 7:33 PM, Walter Bright wrote:
 I do not understand why a function that consists of a FENCE 
 instruction will be a failure compared to a FENCE instruction 
 inlined.

 Yes, ideally a memory barrier wouldn't matter for how it is 
 executed.

 But it does matter for load/cas. Because what was true when the 
 operation executed may not be true any longer by the time the 
 next operation is performed with deep calls in use.

That doesnt make any sense, the whole point of CAS is that it is 
atomic, immediately after it has completed you have no guarantees 
anyway, what difference does it make if it's wrapped in a 
function call?

Nov 15 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 15/11/2023 10:55 PM, claptrap wrote:
 That doesnt make any sense, the whole point of CAS is that it is atomic, 
 immediately after it has completed you have no guarantees anyway, what 
 difference does it make if it's wrapped in a function call?

I understand that it seems like it does not make sense. Lock-free 
concurrent data structures are a highly advanced topic, that very few 
people in the world today can implement successfully. About the only 
people who are qualified to touch them for production software would be 
kernel developers for a specific cpu family.

They rely quite heavily on the premise that atomic operations happen 
immediately in the codegen and then based upon the results do set 
actions in response. This is timing based it has to be preciese or they 
will interfere with each other. You do not have much lee-way before you 
start getting segfaults.

I only ever saw partial success with ldc after seven months of 
researching them. For obvious reasons I do not recommend people going 
down this particular path of study, because you are going to get burned 
pretty badly guaranteed.



Regardless compilers like gcc have intrinsics for all of stdatomic. We 
need to be matching it otherwise what D supports will not line up with 
what the system C compiler can offer in terms of use cases. 
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html

Nov 15 2023

DrDread <DrDread cheese.com> writes:

On Wednesday, 15 November 2023 at 10:26:32 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 They rely quite heavily on the premise that atomic operations 
 happen immediately in the codegen and then based upon the 
 results do set actions in response. This is timing based it has 
 to be preciese or they will interfere with each other. You do 
 not have much lee-way before you start getting segfaults.

whoever write code like that deserves all the problems it creates.
This assumes specific CPU behaviour and may be completely broken 
on other systems.

Nov 15 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 16/11/2023 2:26 AM, DrDread wrote:
 whoever write code like that deserves all the problems it creates.
 This assumes specific CPU behaviour and may be completely broken on 
 other systems.

Yup, that's lock-free concurrent data structures for you. Need 
specialists to have any chance of working reliably.

Nov 15 2023

claptrap <clap trap.com> writes:

On Wednesday, 15 November 2023 at 10:26:32 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 On 15/11/2023 10:55 PM, claptrap wrote:
 I understand that it seems like it does not make sense. 
 Lock-free concurrent data structures are a highly advanced 
 topic, that very few people in the world today can implement 
 successfully. About the only people who are qualified to touch 
 them for production software would be kernel developers for a 
 specific cpu family.

Im saying it doesnt make sense because i have worked on / 
implemented some lock free data structures. Ive shipped software 
that relied on them.

None of the literature ive read ever had any algorithms that 
relied on "getting things done quickly" after a CAS.

Fundamentally it cant work since the thread can be interupted 
immediatly after completing the CAS. So if you algorithm relies 
on something else happening within a specific time frame after a 
CAS it is not going to work.

So im looking for an explanation or a pointer to an algorithm 
that exhibits what you describe because it is counter to my 
experience.


 They rely quite heavily on the premise that atomic operations 
 happen immediately in the codegen and then based upon the 
 results do set actions in response. This is timing based it has 
 to be preciese or they will interfere with each other. You do 
 not have much lee-way before you start getting segfaults.

I can see that getting an update done quickly can help with 
contention, but if the algorithm breaks when things arnt done 
quickly you're pretty much screwed afaik. I mean theres no way to 
guarantee any sequence of instructions gets completed with in 
given time frame, on x86 at least.

Nov 15 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 16/11/2023 2:48 AM, claptrap wrote:
 So im looking for an explanation or a pointer to an algorithm that 
 exhibits what you describe because it is counter to my experience.

By any chance did you use GC based memory management for it?

If you did, that would explain it. GC based memory management removes a 
lot of complexity surrounding removing of elements.

Nov 15 2023

Imperatorn <johan_forsberg_86 hotmail.com> writes:

On Wednesday, 15 November 2023 at 14:44:52 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 On 16/11/2023 2:48 AM, claptrap wrote:
 So im looking for an explanation or a pointer to an algorithm 
 that exhibits what you describe because it is counter to my 
 experience.

 By any chance did you use GC based memory management for it?

 If you did, that would explain it. GC based memory management 
 removes a lot of complexity surrounding removing of elements.

What's wrong with using the gc?

Nov 15 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 16/11/2023 5:41 AM, Imperatorn wrote:
 What's wrong with using the gc?

Nothing. In fact it is one of if not the best way to implement a 
lock-free concurrent data structure without segfaults.

It is of particular interest because it turns a nearly impossible 
problem into a hard problem that can be solved in a reasonable time frame.

Nov 15 2023

claptrap <clap trap.com> writes:

On Wednesday, 15 November 2023 at 14:44:52 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 On 16/11/2023 2:48 AM, claptrap wrote:
 So im looking for an explanation or a pointer to an algorithm 
 that exhibits what you describe because it is counter to my 
 experience.

 By any chance did you use GC based memory management for it?

 If you did, that would explain it. GC based memory management 
 removes a lot of complexity surrounding removing of elements.

C++ and assembler. I used a MPSC "garbage queue" for freeing 
memory, and there were points in the application where I knew it 
was safe to empty the queue. So you could maybe see that as "GC 
like".

But GC or not doesn't answer the main question, what LF algorithm 
depends on a a sequence of instructions being done immediately 
after a CAS? How do you ever enforce that on x86?

Nov 15 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 16/11/2023 8:53 AM, claptrap wrote:
 C++ and assembler. I used a MPSC "garbage queue" for freeing memory, and 
 there were points in the application where I knew it was safe to empty 
 the queue. So you could maybe see that as "GC like".

If you had known good points to release it, yeah that is a known to be 
good strategy.

I went totally manual without any such external assistance.

So if you want to know the area of the literature that I was reading it 
was anything that did not apply outside help to deallocate.

 But GC or not doesn't answer the main question, what LF algorithm 
 depends on a a sequence of instructions being done immediately after a 
 CAS? How do you ever enforce that on x86?

That's the fun part, you can't enforce it on any ISA.

Signals, interrupts.

If I hadn't been so distracted by the fact that things could work on ldc 
but not on dmd, I would've realized that what I was trying to do 
couldn't work.

There is only one way to describe myself going down that path, a fool ;)

Nov 15 2023

claptrap <clap trap.com> writes:

On Thursday, 16 November 2023 at 04:25:52 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 But GC or not doesn't answer the main question, what LF 
 algorithm depends on a a sequence of instructions being done 
 immediately after a CAS? How do you ever enforce that on x86?

 That's the fun part, you can't enforce it on any ISA.

I know, that's the point, you have zero timing guarantees since 
your instructions can be interrupted at any point. So any 
algorithm relying on "timing" is doomed to fall".

IE, it makes no difference if the CAS is inline or wrapped in a 
function call.

LF algorithms rely on a sequence of operations being done in a 
specific order, and that order being coherent across 
cores/threads.

Nov 16 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 17/11/2023 11:50 AM, claptrap wrote:
 LF algorithms rely on a sequence of operations being done in a specific 
 order, and that order being coherent across cores/threads.

Yes but there is a condition on this:

1. Each operation must be atomic, or:
2. Operate on an atomically synchronized memory

But most importantly:
3. Must be predictable

When you don't inline you get additional steps added that may not hold 
this condition. Where it can be not atomic and not operating on 
non-atomically synchronized memory.

This is why it matters that it doesn't inline. It adds variability 
between the steps so that it isn't doing what you think the algorithm is 
doing. It introduces unpredictability.

Nov 16 2023

Stefan Koch <uplink.coder googlemail.com> writes:

On Friday, 17 November 2023 at 04:12:32 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 On 17/11/2023 11:50 AM, claptrap wrote:
 LF algorithms rely on a sequence of operations being done in a 
 specific order, and that order being coherent across 
 cores/threads.

 Yes but there is a condition on this:

 1. Each operation must be atomic, or:
 2. Operate on an atomically synchronized memory

 But most importantly:
 3. Must be predictable

 When you don't inline you get additional steps added that may 
 not hold this condition. Where it can be not atomic and not 
 operating on non-atomically synchronized memory.

 This is why it matters that it doesn't inline. It adds 
 variability between the steps so that it isn't doing what you 
 think the algorithm is doing. It introduces unpredictability.

when you rely on more than one operation being atomic you are 
already on the loosing team.
Just because two operations are right next to each other in the 
machine code, it does not mean the will be executed right after 
each other.
Another thread or processor might invalidate the condition you 
established with the first instruction that the other is relying 
on.
Therefore your algorithm is only correct if you are not relying 
on predictable execution order.

Nov 17 2023

claptrap <clap trap.com> writes:

On Friday, 17 November 2023 at 04:12:32 UTC, Richard (Rikki) 
Andrew Cattermole wrote:
 On 17/11/2023 11:50 AM, claptrap wrote:
 LF algorithms rely on a sequence of operations being done in a 
 specific order, and that order being coherent across 
 cores/threads.

 Yes but there is a condition on this:

 1. Each operation must be atomic, or:
 2. Operate on an atomically synchronized memory

That's a given, I mean the whole point is any variable sharing 
across threads needs to be done atomically.


 But most importantly:
 3. Must be predictable

 When you don't inline you get additional steps added that may 
 not hold this condition. Where it can be not atomic and not 
 operating on non-atomically synchronized memory.

CAS operates on a specific memory location and that address will 
be passed into the function, there's no way for the function to 
change this to break alignment and hence break the atomicity. In 
fact on x86 alignment doesn't matter anyway if you have a lock 
prefix. On Arm it does IIRC. But arm also has less strict memory 
ordering.

What instructions the compiler inserts around the CAS instruction 
are irrelevant, none of what they do can break the CAS. The only 
thing that the function has that can be used to barf things up is 
the memory location, everything else it has is thread local. So 
the only way the it can break the CAS is by reading or writing to 
the memory location in a non atomic manner. It literally has no 
instructions to do so unless the programmer tells it to. I mean 
if the compiler emits instructions to read or write to a pointer 
without the programmer instructing it to do so, your compiler is 
broken and you'll be getting segfaults everywhere anyway.

Nov 17 2023

IGotD- <nise nise.com> writes:

On Friday, 17 November 2023 at 10:25:31 UTC, claptrap wrote:
 What instructions the compiler inserts around the CAS 
 instruction are irrelevant, none of what they do can break the 
 CAS.

It is not irrelevant. You cannot break the CAS itself as it is 
usually implemented according to the SW ABI of the architecture. 
However, you must instruct the compiler so that reordering 
optimizations don't spill over the CAS implementations. If 
optimizations make instructions spill over to the wrong side of 
the CAS, then you possibly will end up in a non working algorithm.

So you not only perhaps need to insert a memory barrier 
instruction depending on ISA, you also need to instruct the 
compiler not to trespass the atomic instructions. This is usually 
implicit as atomic operations are implemented as intrinsics which 
automatically do this for you.

Nov 17 2023

claptrap <clap trap.com> writes:

On Friday, 17 November 2023 at 12:41:39 UTC, IGotD- wrote:
 On Friday, 17 November 2023 at 10:25:31 UTC, claptrap wrote:
 What instructions the compiler inserts around the CAS 
 instruction are irrelevant, none of what they do can break the 
 CAS.

 It is not irrelevant. You cannot break the CAS itself as it is 
 usually implemented according to the SW ABI of the 
 architecture. However, you must instruct the compiler so that 
 reordering optimizations don't spill over the CAS 
 implementations. If optimizations make instructions spill over 
 to the wrong side of the CAS, then you possibly will end up in 
 a non working algorithm.

You're missing the point.

If your compiler is reordering instructions around a CAS it's 
broken.

If it is doing that then its a problem whether or not you wrapped 
the CAS in a function call or not.

And not only that but the instructions for the function are all 
thread local. The only shared thing the function has is the 
address of the atomic variable in memory. Theres nothing in that 
situation the compiler can reorder that would break multithreaded 
code that would not also break single thraded code.

If your CAS intrinsic or instruction needs a fence, it needs it 
whether inline or wrapped in a function. Wrapping it in a 
function call doesnt somehow cause ordering issues.

Nov 17 2023

ryuukk_ <ryuukk.dev gmail.com> writes:

These should be builtins imo, LDC could reuse LLVM intrinsics and 
GDC could reuse GCC stuff, and DMD wich would reuse what ever is 
in ``core.atomics``

https://llvm.org/docs/Atomics.html#libcalls-atomic

https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html

Nov 14 2023

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

For ldc and gdc core.atomic already uses the backend intrinsics. It is 
only dmd that doesn't do that and hence does bad things.

Whatever is missing that stdatomic.h needs, should be added.

https://github.com/ldc-developers/ldc/blob/master/runtime/druntime/src/core/internal/atomic.d#L24

Nov 14 2023

Denis Feklushkin <feklushkin.denis gmail.com> writes:

On Tuesday, 14 November 2023 at 03:53:09 UTC, Walter Bright wrote:
 There have been multiple requests to add this support. It can 
 all be done with a library implemented with some inline 
 assembler.

 Anyone want the glory of implementing this?

Probably, previously it was implied that core.* is for own needs 
of druntime and Phobos? And existence of core.stdc.* is a forced 
solution, since using libc is the easiest way to create an 
interface between druntime and OSes

Nov 14 2023

D Programming

C/C++ Programming

Other

digitalmars.D - core.stdc.stdatomic