digitalmars.D - Wait-free thread communication
- Jin (27/27) Jan 08 2016 Idea: no mutex, no CAS, only one direction queues without any
- =?UTF-8?B?Tm9yZGzDtnc=?= (5/9) Jan 08 2016 Very interesting. D has builtin unittests. You should add stress
- Jin (5/15) Jan 08 2016 I just add unit tests. But how to idiomatic implement benchmarks
- thedeemon (18/22) Jan 09 2016 Sorry, that's not looking any good.
- Jin (17/33) Jan 09 2016 No, jin.go creates new native thread on every call. And this is
- Andy Smith (8/36) Jan 09 2016 I'm a little worried you have no volatile writes or fences around
- Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (4/10) Jan 09 2016 But not on ARM, so he should use atomic acquire-release semantics
- David Nadlinger (21/31) Jan 09 2016 Not only that. It's a problem on x86 as well because advanced
- Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (17/28) Jan 09 2016 Yes, well, he had a sleep() in there that might prevent the
- David Nadlinger (10/13) Jan 10 2016 I wholeheartedly agree with that statement. However, if one
- Jin (3/9) Jan 09 2016 This works fine in my tests, but how to enforce memory barrier in
- Jin (5/11) Jan 09 2016 I just add atomic fence for push and take:
- Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (16/27) Jan 09 2016 You need it in the tests.
- Jin (8/14) Jan 09 2016 If memory writes will reorder, then current tests will fail.
- Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (2/7) Jan 09 2016 They have to be atomic if you want your code to be portable.
- Jin (3/4) Jan 09 2016 Do not want yet :-)
- Andy Smith (23/27) Jan 09 2016 Okay I've just refreshed my mental X86/64 memory model :-) What
- David Nadlinger (7/10) Jan 09 2016 It might work on DMD, but I know from experience (one-to-many
- Andy Smith (5/15) Jan 09 2016 Ah Cheers David... I think our messages crossed :-)
- David Nadlinger (6/7) Jan 09 2016 Either you are not calling it in the way you think you are, then,
- Martin Nowak (4/9) Jan 10 2016 Don't do this, memory fences are expensive.
- Martin Nowak (12/20) Jan 10 2016 Yes single-reader single-writer queue are the fastest way for
- Jin (3/10) Jan 13 2016 I am using Waiter for this
- Jin (3/3) Mar 27 2016 I just use this queues to implement Go like api for concurrency
- Dmitry Olshansky (4/7) Mar 28 2016 If nothing changed implementation-wise this is just data-racy queues :)
- Jin (2/10) Mar 28 2016 Why?
- Dmitry Olshansky (8/18) Mar 28 2016 All I see is a ring buffer with hand-wavy atomicFence on one of mutating...
- Jin (7/17) Mar 28 2016 Here the fibers are used.
Idea: no mutex, no CAS, only one direction queues without any locks. My prototype (https://github.com/nin-jin/go.d) is up to 10x faster than std.concurrency.send/receive --- writers =512 readers =1 std.concurency milliseconds=1313 jin.go milliseconds=113 --- Realization: Queue! - buffered static typed channel. One and only one thread can push data to queue, and one and only one can take data from queue. Thread blocks when he takes from empty queue and pushes to fulfilled queue. Queues! - list of queues with round robin balancing go! - starts new thread, in future i want to implement something like goroutine, but i do not understand how yet. I need some good advice, i am newbie in D :-) Problems: For blocking thread i use loop with Thread.sleep - this is bad decision IMHO. On my system i can create only up to 1000 threads without errors. Fibers from core.thread or Tasks from std.parallelism potentiaдly can resolve this problem, but how to integrate their gracefully. API of jin-go can be better? Which dub modules can help me?
Jan 08 2016
On Friday, 8 January 2016 at 16:58:59 UTC, Jin wrote:Idea: no mutex, no CAS, only one direction queues without any locks. My prototype (https://github.com/nin-jin/go.d) is up to 10x faster than std.concurrency.send/receiveVery interesting. D has builtin unittests. You should add stress unittests to assure that your logic is correct. You can start by searching for keyword `unittest` in the std.concurrency module for advice how to do this.
Jan 08 2016
On Friday, 8 January 2016 at 18:07:49 UTC, Nordlöw wrote:On Friday, 8 January 2016 at 16:58:59 UTC, Jin wrote:I just add unit tests. But how to idiomatic implement benchmarks (compiling must be in release mode)? Currently, i was implement main function in app.d and run with "dub --build=release", but nobody can import my module.Idea: no mutex, no CAS, only one direction queues without any locks. My prototype (https://github.com/nin-jin/go.d) is up to 10x faster than std.concurrency.send/receiveVery interesting. D has builtin unittests. You should add stress unittests to assure that your logic is correct. You can start by searching for keyword `unittest` in the std.concurrency module for advice how to do this.
Jan 08 2016
On Friday, 8 January 2016 at 16:58:59 UTC, Jin wrote:Idea: no mutex, no CAS, only one direction queues without any locks. My prototype (https://github.com/nin-jin/go.d) is up to 10x faster than std.concurrency.send/receiveSorry, that's not looking any good. You call it wait-free when in fact it's just the opposite: if a queue buffer is full on push it just waits in Thread.sleep which is 1) not wait-free at all 2) very heavy (call to kernel, context switch) And when buffer is not full/empty it works as a simple single-threaded queue, which means it's fine for using with fibers inside one thread but will not work correctly in multithreaded setting. Your benchmarks show time difference in your favor just because you compare very different things: your queue is benchmarked in single thread with fibers while std.concurrency is measured with multiple threads communicating with each other. Doing many switches between threads is much slower than switching between fibers in one thread, hence the time difference. It's not that your queue is any good, it's just you measure it wrong.
Jan 09 2016
On Saturday, 9 January 2016 at 08:28:33 UTC, thedeemon wrote:Your benchmarks show time difference in your favor just because you compare very different things: your queue is benchmarked in single thread with fibers while std.concurrency is measured with multiple threads communicating with each other. Doing many switches between threads is much slower than switching between fibers in one thread, hence the time difference. It's not that your queue is any good, it's just you measure it wrong.No, jin.go creates new native thread on every call. And this is problem :-) We can not create thousand of threads without errors. std.concurrency uses mutex to synchromize access to message queue. Cost of syncronization is proportional to count of threads. On Saturday, 9 January 2016 at 08:28:33 UTC, thedeemon wrote:You call it wait-free when in fact it's just the opposite: if a queue buffer is full on push it just waits in Thread.sleep which is 1) not wait-free at all 2) very heavy (call to kernel, context switch) And when buffer is not full/empty it works as a simple single-threaded queue, which means it's fine for using with fibers inside one thread but will not work correctly in multithreaded setting.Taking data from empty queue and pushing data to full queue is logicaly impossible for queues. We have 3 strategies for this cases: 1. Throwing runtime exception, like std.concurrency.send on full message box by default. 2. Blocking thread, like std.concurrency.receive on empty message box. 2. Skipping action. jin-go uses blocking strategy by default and you can check queue.empty and queue.full if you want to implement other strategy.
Jan 09 2016
On Saturday, 9 January 2016 at 11:00:01 UTC, Jin wrote:No, jin.go creates new native thread on every call. And this is problem :-) We can not create thousand of threads without errors.Ah, sorry, I misread the source. FiberScheduler got me distracted. Why is it there?
Jan 09 2016
On Saturday, 9 January 2016 at 13:55:16 UTC, thedeemon wrote:On Saturday, 9 January 2016 at 11:00:01 UTC, Jin wrote:This is artefact of experiments with fibers :-)No, jin.go creates new native thread on every call. And this is problem :-) We can not create thousand of threads without errors.Ah, sorry, I misread the source. FiberScheduler got me distracted. Why is it there?
Jan 09 2016
On Friday, 8 January 2016 at 16:58:59 UTC, Jin wrote:Idea: no mutex, no CAS, only one direction queues without any locks. My prototype (https://github.com/nin-jin/go.d) is up to 10x faster than std.concurrency.send/receive --- writers =512 readers =1 std.concurency milliseconds=1313 jin.go milliseconds=113 --- Realization: Queue! - buffered static typed channel. One and only one thread can push data to queue, and one and only one can take data from queue. Thread blocks when he takes from empty queue and pushes to fulfilled queue. Queues! - list of queues with round robin balancing go! - starts new thread, in future i want to implement something like goroutine, but i do not understand how yet. I need some good advice, i am newbie in D :-) Problems: For blocking thread i use loop with Thread.sleep - this is bad decision IMHO. On my system i can create only up to 1000 threads without errors. Fibers from core.thread or Tasks from std.parallelism potentiaдly can resolve this problem, but how to integrate their gracefully. API of jin-go can be better? Which dub modules can help me?I'm a little worried you have no volatile writes or fences around your code when you 'publish' an event using head/tail etc. It looks like it's working but how are you ensuring no compiler/CPU reordering is ocurring. Does x86_64 actually allow you to get away with this? I know its memory model is stricter than others... Cheeers, A.
Jan 09 2016
On Saturday, 9 January 2016 at 14:20:18 UTC, Andy Smith wrote:I'm a little worried you have no volatile writes or fences around your code when you 'publish' an event using head/tail etc. It looks like it's working but how are you ensuring no compiler/CPU reordering is ocurring. Does x86_64 actually allow you to get away with this? I know its memory model is stricter than others...But not on ARM, so he should use atomic acquire-release semantics on indices for push/pull. I suggest porting over spsc_queue from Boost.
Jan 09 2016
On Saturday, 9 January 2016 at 15:16:43 UTC, Ola Fosheim Grøstad wrote:On Saturday, 9 January 2016 at 14:20:18 UTC, Andy Smith wrote:Not only that. It's a problem on x86 as well because advanced optimizers like those in GDC or LDC will happily assume that the members are not written to by another thread (because there would be a race otherwise) and cache the loads or even eliminate some stores. Any guarantees the processor would offer are irrelevant if the compiler has already reordered your operations. It should be quite easy to see such effects. Just compile a simple test case on GDC or LDC with optimizations on.I'm a little worried you have no volatile writes or fences around your code when you 'publish' an event using head/tail etc. It looks like it's working but how are you ensuring no compiler/CPU reordering is ocurring. Does x86_64 actually allow you to get away with this? I know its memory model is stricter than others...But not on ARM, so he should use atomic acquire-release semantics on indices for push/pull.I suggest porting over spsc_queue from Boost.That would certainly be a starting point, although a high-performance single-producer single-consumer queue is trivial to implement once you understand atomics. Then again, I couldn't convince Andrei that a well-defined memory model (in which it even makes sense to talk about atomics in the first place) is important for D yet. Right now, you basically have to hope that everything works as in C++ – which is not a bad bet on GDC and LDC, and the weaker optimizer in DMD hides a lot of potential issues there – and stay away from things like `consume` loads that would depend on language semantics. — David
Jan 09 2016
On Saturday, 9 January 2016 at 17:42:41 UTC, David Nadlinger wrote:Not only that. It's a problem on x86 as well because advanced optimizers like those in GDC or LDC will happily assume that the members are not written to by another thread (because there would be a race otherwise) and cache the loads or even eliminate some stores. Any guarantees the processor would offer are irrelevant if the compiler has already reordered your operations. It should be quite easy to see such effects. Just compile a simple test case on GDC or LDC with optimizations on.Yes, well, he had a sleep() in there that might prevent the compiler from moving instructions etc, but even if so... it is a bad thing to rely on. Explicit atomics document what is going on (even when not technically needed), and should always be used where you want atomic operations, I think.That would certainly be a starting point, although a high-performance single-producer single-consumer queue is trivial to implement once you understand atomics.Yes. But my experience from writing custom multi-single queues is that it can end up harder than it looks to get it working and efficient. Don't be surprised if it takes 5-10x more time than anticipated to get it 100% right... So why not start with something that is correct? You still have to spend quite a bit of time to verify that the code is identical... but that is much easier than to "formally" prove that your own implementation is correct. (Intuition is often wrong in this area...)
Jan 09 2016
On Saturday, 9 January 2016 at 22:44:53 UTC, Ola Fosheim Grøstad wrote:Yes. But my experience from writing custom multi-single queues is that it can end up harder than it looks to get it working and efficient. […] (Intuition is often wrong in this area...)I wholeheartedly agree with that statement. However, if one doesn't understand atomics well enough to correctly implement even a simple single-single queue, I'm not sure whether one would be able to correctly port an existing implementation either. Then again, the primitives might be similar enough between C++11 and D so that a literal translation is possible without understanding what is going on. — David
Jan 10 2016
On Saturday, 9 January 2016 at 14:20:18 UTC, Andy Smith wrote:I'm a little worried you have no volatile writes or fences around your code when you 'publish' an event using head/tail etc. It looks like it's working but how are you ensuring no compiler/CPU reordering is ocurring. Does x86_64 actually allow you to get away with this? I know its memory model is stricter than others...This works fine in my tests, but how to enforce memory barrier in D?
Jan 09 2016
On Saturday, 9 January 2016 at 14:20:18 UTC, Andy Smith wrote:I'm a little worried you have no volatile writes or fences around your code when you 'publish' an event using head/tail etc. It looks like it's working but how are you ensuring no compiler/CPU reordering is ocurring. Does x86_64 actually allow you to get away with this? I know its memory model is stricter than others...I just add atomic fence for push and take: this.messages[ this.tail ] = value; atomicFence; this.tail = ( this.tail + 1 ) % this.size;
Jan 09 2016
On Saturday, 9 January 2016 at 15:51:51 UTC, Jin wrote:On Saturday, 9 January 2016 at 14:20:18 UTC, Andy Smith wrote:You need it in the tests. I haven't used atomics in D, but you have atomicLoad(MemoryOrder ms = MemoryOrder.seq, T)(ref const shared T val) atomicStore(MemoryOrder ms = MemoryOrder.seq, T, V1)(ref shared T val, V1 newval) The compiler should then be able to ignore it if the CPU handles it well enough, assuming the D implementation work. Something along the lines of: immutable i = atomicLoad!(MemoryOrder.raw)(tail); immutable n = (i+1)%size; if( n == atomicLoad!(MemoryOrder.acq)(head) ) return QUEUE_FULL; buffer[n] = ...; atomicStore(MemoryOrder.rel, tail, n); Or something like that.I'm a little worried you have no volatile writes or fences around your code when you 'publish' an event using head/tail etc. It looks like it's working but how are you ensuring no compiler/CPU reordering is ocurring. Does x86_64 actually allow you to get away with this? I know its memory model is stricter than others...I just add atomic fence for push and take: this.messages[ this.tail ] = value; atomicFence; this.tail = ( this.tail + 1 ) % this.size;
Jan 09 2016
On Saturday, 9 January 2016 at 16:05:34 UTC, Ola Fosheim Grøstad wrote:You need it in the tests.If memory writes will reorder, then current tests will fail. Memory reades can be in any order. On Saturday, 9 January 2016 at 16:05:34 UTC, Ola Fosheim Grøstad wrote:I haven't used atomics in D, but you have atomicLoad(MemoryOrder ms = MemoryOrder.seq, T)(ref const shared T val) atomicStore(MemoryOrder ms = MemoryOrder.seq, T, V1)(ref shared T val, V1 newval)I do not understand how to use its right. So i simple use atomicFence. Performance does not degrade.
Jan 09 2016
On Saturday, 9 January 2016 at 16:22:12 UTC, Jin wrote:On Saturday, 9 January 2016 at 16:05:34 UTC, Ola Fosheim Grøstad wrote:They have to be atomic if you want your code to be portable.You need it in the tests.If memory writes will reorder, then current tests will fail. Memory reades can be in any order.
Jan 09 2016
On Saturday, 9 January 2016 at 16:29:07 UTC, Ola Fosheim Grøstad wrote:They have to be atomic if you want your code to be portable.Do not want yet :-)
Jan 09 2016
On Saturday, 9 January 2016 at 17:24:47 UTC, Jin wrote:On Saturday, 9 January 2016 at 16:29:07 UTC, Ola Fosheim Grøstad wrote:Okay I've just refreshed my mental X86/64 memory model :-) What you've done should be okay from a CPU-reordering perspective. In x86/64 Loads are not reordered with other loads and stores are not reordered with other stores. So from a CPU perspective you should be okay on X86/64. There's still a grey area for me around possible *compiler* re-orderings... there's nothing to stop the compiler reordering those two lines with respect to each other because from an 'as-if-sequential' perspective those two operations are commutative. Unfortunately I'm not well versed enough on the internals of the three main compiler versions to give an opinion on whether this will be a problem or not :-( If your version is working for you on your platform/compiler, I'd recommend maybe being a bit conservative and wrap your code with appropriate version blocks until you know the answer to these questions for the other platforms/compilers (ARM, GDC, LDC etc.). And put appropriate explanation why that's the case in your README.md. I think that's just a basic courtesy to other users that you don't let them walk over a minefield without fair warning :-) Cheers, A.They have to be atomic if you want your code to be portable.Do not want yet :-)
Jan 09 2016
On Saturday, 9 January 2016 at 17:44:26 UTC, Andy Smith wrote:Unfortunately I'm not well versed enough on the internals of the three main compiler versions to give an opinion on whether this will be a problem or not :-(It might work on DMD, but I know from experience (one-to-many queues, i.e. single-consumer-multi-producer or the other way round) that the LDC optimizer will ruthlessly break code written in that "I know that it would work in assembly, let's just hope the compiler doesn't touch it" style. — David
Jan 09 2016
On Saturday, 9 January 2016 at 17:49:46 UTC, David Nadlinger wrote:On Saturday, 9 January 2016 at 17:44:26 UTC, Andy Smith wrote:Ah Cheers David... I think our messages crossed :-) Cheers, A.Unfortunately I'm not well versed enough on the internals of the three main compiler versions to give an opinion on whether this will be a problem or not :-(It might work on DMD, but I know from experience (one-to-many queues, i.e. single-consumer-multi-producer or the other way round) that the LDC optimizer will ruthlessly break code written in that "I know that it would work in assembly, let's just hope the compiler doesn't touch it" style. — David
Jan 09 2016
On Saturday, 9 January 2016 at 16:22:12 UTC, Jin wrote:So i simple use atomicFence. Performance does not degrade.Either you are not calling it in the way you think you are, then, or your code is otherwise very unoptimized. You can definitely measure the impact of a full mfence on otherwise tight lock-free code. — David
Jan 09 2016
On 01/09/2016 04:51 PM, Jin wrote:I just add atomic fence for push and take: this.messages[ this.tail ] = value; atomicFence; this.tail = ( this.tail + 1 ) % this.size;Don't do this, memory fences are expensive. This is what you need for a spsc queue. https://github.com/MartinNowak/lock-free/commit/233739262c14e00866a60f4a6a86c1b979ac968b
Jan 10 2016
On 01/08/2016 05:58 PM, Jin wrote:Idea: no mutex, no CAS, only one direction queues without any locks. My prototype (https://github.com/nin-jin/go.d) is up to 10x faster than std.concurrency.send/receiveYes single-reader single-writer queue are the fastest way for inter-thread communication. You might have a look at my [lock-free](http://code.dlang.org/packages/lock-free) for a correct implementation.Problems: For blocking thread i use loop with Thread.sleep - this is bad decision IMHO.Have a look at this exponential backoff implementation for my GC spinlock PR. https://github.com/D-Programming-Language/druntime/pull/1447/files#diff-fb5cbe06e1aaf83814ccf5ff08f05519R34 In general you need some sort of configurable or adaptive backoff or you'll waste too much time context switching. -Martin
Jan 10 2016
On Sunday, 10 January 2016 at 21:25:34 UTC, Martin Nowak wrote:I am using Waiter for this https://github.com/nin-jin/go.d/blob/master/source/jin/go.d#L171For blocking thread i use loop with Thread.sleep - this is bad decision IMHO.Have a look at this exponential backoff implementation for my GC spinlock PR. https://github.com/D-Programming-Language/druntime/pull/1447/files#diff-fb5cbe06e1aaf83814ccf5ff08f05519R34 In general you need some sort of configurable or adaptive backoff or you'll waste too much time context switching.
Jan 13 2016
I just use this queues to implement Go like api for concurrency (coroutines+channels): http://forum.dlang.org/thread/lcfnfnhjzonkdkeaumds forum.dlang.org
Mar 27 2016
On 27-Mar-2016 21:23, Jin wrote:I just use this queues to implement Go like api for concurrency (coroutines+channels): http://forum.dlang.org/thread/lcfnfnhjzonkdkeaumds forum.dlang.orgIf nothing changed implementation-wise this is just data-racy queues :) -- Dmitry Olshansky
Mar 28 2016
On Monday, 28 March 2016 at 16:39:45 UTC, Dmitry Olshansky wrote:On 27-Mar-2016 21:23, Jin wrote:Why?I just use this queues to implement Go like api for concurrency (coroutines+channels): http://forum.dlang.org/thread/lcfnfnhjzonkdkeaumds forum.dlang.orgIf nothing changed implementation-wise this is just data-racy queues :)
Mar 28 2016
On 28-Mar-2016 20:03, Jin wrote:On Monday, 28 March 2016 at 16:39:45 UTC, Dmitry Olshansky wrote:All I see is a ring buffer with hand-wavy atomicFence on one of mutating operations. popFront is not protected at all. Also force yielding a thread is not a sane synchronization technique. Over all - I suggest to not label this as "wait free" code as it's waaay far from what it takes to get that. -- Dmitry OlshanskyOn 27-Mar-2016 21:23, Jin wrote:Why?I just use this queues to implement Go like api for concurrency (coroutines+channels): http://forum.dlang.org/thread/lcfnfnhjzonkdkeaumds forum.dlang.orgIf nothing changed implementation-wise this is just data-racy queues :)
Mar 28 2016
On Monday, 28 March 2016 at 17:16:14 UTC, Dmitry Olshansky wrote:popFront does not need protection. It is atommic for provider.All I see is a ring buffer with hand-wavy atomicFence on one of mutating operations. popFront is not protected at all.If nothing changed implementation-wise this is just data-racy queues :)Why?Also force yielding a thread is not a sane synchronization technique.Here the fibers are used.Over all - I suggest to not label this as "wait free" code as it's waaay far from what it takes to get that.Each operation (clear,front,popFront,full,put) has a fixed number of steps if you checks for clear before access to front and check for full before put. If you not check this - you will be blockek of cource. What do you expect? Exception? Ignore?
Mar 28 2016