www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - A very interesting slide deck comparing sync and async IO

reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
https://www.mailinator.com/tymaPaulMultithreaded.pdf

Andrei
Mar 03 2016
next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Thursday, 3 March 2016 at 17:31:59 UTC, Andrei Alexandrescu 
wrote:
 https://www.mailinator.com/tymaPaulMultithreaded.pdf
Slide decks are so unbelievably bad at conveying information. But, looking through it, I think I agree. There's a reason why I stick with my cgi.d - using a simple process/thread model is so much easier and actually performs fine. fork() is very cheap on modern Linux and if you use it, the kernel takes care of basically all the hard stuff for you!
Mar 03 2016
prev sibling next sibling parent Brad Anderson <eco gnuk.net> writes:
On Thursday, 3 March 2016 at 17:31:59 UTC, Andrei Alexandrescu 
wrote:
 https://www.mailinator.com/tymaPaulMultithreaded.pdf

 Andrei
Related to this, I watched a talk about user-level threads by another Google engineer awhile back (also named Paul but not the same person). https://www.youtube.com/watch?v=KXuZi9aeGTw It's been awhile so I may have this a bit wrong but as I remember it the gist was that OSes don't have enough information for efficient scheduling and context switching of threads but if the OS lets the application handle it the context switching cost drops significantly (down to roughly 200ns). I came across it while following some Rust development which was a very interesting read: https://mail.mozilla.org/pipermail/rust-dev/2013-November/006550.html It's a bit over two years old now so I'm not sure what path Rust took.
Mar 03 2016
prev sibling next sibling parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 03/03/2016 09:31 AM, Andrei Alexandrescu wrote:
 https://www.mailinator.com/tymaPaulMultithreaded.pdf

 Andrei
Another interesting architecture is LMAX Disruptor: https://lmax-exchange.github.io/disruptor/ Ali
Mar 03 2016
prev sibling next sibling parent reply Vladimir Panteleev <thecybershadow.lists gmail.com> writes:
On Thursday, 3 March 2016 at 17:31:59 UTC, Andrei Alexandrescu 
wrote:
 https://www.mailinator.com/tymaPaulMultithreaded.pdf

 Andrei
Not an expert on the subject, but FWIW: 1. This is from 2008 2. Seems to be highly specific to Java 3. The first benchmark essentially measures the overhead of fiber context switching and nothing else 4. If I tried to write DFeed using this model, it would have been a complete trainwreck (there is too much global state and interaction between components to reason about using locks).
Mar 03 2016
parent deadalnix <deadalnix gmail.com> writes:
On Thursday, 3 March 2016 at 20:31:51 UTC, Vladimir Panteleev 
wrote:
 3. The first benchmark essentially measures the overhead of 
 fiber context switching and nothing else
Ha yes, forgot that. Many JVM use fiber instead of thread internally.
Mar 03 2016
prev sibling next sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Thursday, 3 March 2016 at 17:31:59 UTC, Andrei Alexandrescu 
wrote:
 https://www.mailinator.com/tymaPaulMultithreaded.pdf

 Andrei
A lot of data presented are kind of skewed. For instance, the synchronization costs across cores are done at 0% writes. It comes to no surprise that synchronizing read only data is cheap, but I don't think the author is justified in concluding that synchronization is cheap in the general case. "new threads are created if all existing are busy but only up to MAX_INT threads" More likely only up to the point you run out of file handles. You got to also note that context switches are measure on an opteron that has a 81 cycles iret (it is 330 on haswell) so that would explain why he find out they are way cheaper than expected. Opteron also have 2x64kb L1 cache, which probably reduce the cache trashing due to context switches at the cost of single threaded performances (which doesn't seems like a very good tradeoff for a CPU). In short, the CPU used is good for context switches, especially compared to modern intels. Other result are not that surprising, like blocking IO being faster than non blocking one, the gain from async coming from multiplexing IO, not making every individual IO operation faster. Or, in server terms, it'll change your DB bound frontend to a CPU bound one and offload scaling on the DB. If you don't have another component to limiting your server, going async is not going to improves things (like if you are already CPU bound).
Mar 03 2016
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 03/03/2016 06:01 PM, deadalnix wrote:

 2x64kb L1 cache, which probably reduce the cache trashing due
 to context switches
(I am speaking without measuring anything.) I imagine that lost cache is one of the biggest costs in thread switching. It would be great if a thread could select a thread with something like "I'm done, now please switch to my reader". And that's exactly one of the benefits of fibers: two workers ping pong back and forth, without much risk of losing their cached data. Is my assumption correct? Ali
Mar 03 2016
next sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Friday, 4 March 2016 at 03:14:01 UTC, Ali Çehreli wrote:
 I imagine that lost cache is one of the biggest costs in thread 
 switching. It would be great if a thread could select a thread 
 with something like "I'm done, now please switch to my reader". 
 And that's exactly one of the benefits of fibers: two workers 
 ping pong back and forth, without much risk of losing their 
 cached data.

 Is my assumption correct?

 Ali
The minimal cost of a context switch is one TLB miss (~300), one cache miss (~300) and iret (~300). But then, you usually don't do context switch for nothing, so some work has to be done in the context switch. This is where it gets very dependent on which system call you use. During its work, the system call would have evicted various cache lines and TLB entries to put its own data there. That means that after the context switch is done, part of your cache is gone, and you'll need some time to start again running full speed. You may think that it is the same with a library call, and to some extent it is, but much worse: as kernel and userspace do not share the same address space and access write, so you typically get way more trashing (you can't reuse TLB entries at all for instance). Actual numbers will vary from one chip to the other, but the general idea remains.
Mar 04 2016
parent reply Shachar Shemesh <shachar weka.io> writes:
For switching between threads, this seems wrong.

On 04/03/16 20:29, deadalnix wrote:

 The minimal cost of a context switch is one TLB miss (~300), one cache
 miss (~300)
Why do you claim the first two? Again, assuming threads of the same process, where the address space (minus the TLS) is the same.
 and iret (~300).
You have an iret whether you switched or not. Going into the kernel and coming back has that, whether we switched or not. In particular, since you need a system call in order to call "read", "write" and "accept", these are there regardless. BTW, using one thread per client, you need less system calls, which reduces that particular overhead.
 But then, you usually don't do context
 switch for nothing, so some work has to be done in the context switch.
 This is where it gets very dependent on which system call you use.

 During its work, the system call would have evicted various cache lines
 and TLB entries to put its own data there. That means that after the
 context switch is done, part of your cache is gone, and you'll need some
 time to start again running full speed.
I'm having a hard time following the discussion thread to understand what you are replying to here, but surely, you made the system call because you needed it to be done.
 You may think that it is the
 same with a library call, and to some extent it is, but much worse: as
 kernel and userspace do not share the same address space
I hate that myth! It is such a difficult one to kill. The kernel absolutely uses the same address space as the userspace (whatever user space happened to be there). That is why we had the 3/1GB split in the bad old 32bit days. The 1GB was a chunk of address space, bitten off the user space usable address space, that has the same mapping regardless of which user space is mapped in. This allows the kernel to use the same virtual addresses without loading a new TLB. In other words, performing a system call is not a context switch.
 and access
 write,
That much is true, but it is merely a bit change in a register. It has no cache implications.
 so you typically get way more trashing (you can't reuse TLB
 entries at all for instance).
Like I said, not true. As a side note, some platforms have cookies inside the TLB, so you don't have to flush it even when you do an actual context switch, but I'm guessing we're discussing Intel here. Shachar
Mar 06 2016
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Sunday, 6 March 2016 at 13:21:41 UTC, Shachar Shemesh wrote:
 You have an iret whether you switched or not. Going into the 
 kernel and coming back has that, whether we switched or not. In 
 particular, since you need a system call in order to call 
 "read", "write" and "accept", these are there regardless.
Yes, rule-of-thumb seems to be that regular syscall can be expected to have 200-1000 cycles overhead, on a virtualized setup 2000+ cycles. But some low level interfaces allow you to submit multiple i/o-ops in a single system call.
Mar 07 2016
prev sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Friday, 4 March 2016 at 03:14:01 UTC, Ali Çehreli wrote:.
 And that's exactly one of the benefits of fibers: two workers 
 ping pong back and forth, without much risk of losing their 
 cached data.

 Is my assumption correct?
Not if it is hyper-threaded, as pairs of threads are sharing resources. The only advantage of fibers is modelling, not performance. If you want max performance you need more control. But typical real world applications are far away from max theoretical performance, so maybe the impact isn't as great in a specific application for a given suboptimal pattern. I don't think you can measure or benchmark whether you are doing objectively well or not, in an absolute sense. For that you need to calculate the theoretical throughput and then see how far away you are in the real world and explain why. Otherwise you are just benchmarking a suboptimal pattern. Which makes sense when optimizing an existing application, but makes less sense when designing a new database engine from scratch.
Mar 04 2016
next sibling parent reply deadalnix <deadalnix gmail.com> writes:
On Friday, 4 March 2016 at 22:22:48 UTC, Ola Fosheim Grøstad 
wrote:
 On Friday, 4 March 2016 at 03:14:01 UTC, Ali Çehreli wrote:.
 And that's exactly one of the benefits of fibers: two workers 
 ping pong back and forth, without much risk of losing their 
 cached data.

 Is my assumption correct?
Not if it is hyper-threaded, as pairs of threads are sharing resources.
Irrelevant.
 Otherwise you are just benchmarking a suboptimal pattern. Which 
 makes sense when optimizing an existing application, but makes 
 less sense when designing a new database engine from scratch.
async do not make any sense on a DB. You want async when you are bound by a 3rd party system. For instance, if you have a frontend server that is bound by DB access, async will allow to shift you frontend code from DB bound to CPU bound, allowing you to get much more bang for the buck in your frontends. Now, if you limiting factor is IO on your own machine, like say local disk access, then it's going to change nothing (in fact, it may make things much worse as disk prefer sequential accesses).
Mar 04 2016
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Friday, 4 March 2016 at 23:10:28 UTC, deadalnix wrote:
 Not if it is hyper-threaded, as pairs of threads are sharing 
 resources.
Irrelevant.
I knew you would say that, but you are wrong.
 async do not make any sense on a DB. You want async when you 
 are bound by a 3rd party system.
Err... no. You want async when you are CPU-bound and are looking for better latency and more control. On linux you can use async to bypass the kernel caches and access disk blocks, basically reimplementing things that are usually done in kernel space in your user space application (at the cost of adding significant complexity to your application). What you get out of it depends on what you want, if you use a distributed setup, the drivers, the hardware and many other factors.
Mar 05 2016
prev sibling parent reply Chris Wright <dhasenan gmail.com> writes:
On Fri, 04 Mar 2016 22:22:48 +0000, Ola Fosheim Grøstad wrote:

 On Friday, 4 March 2016 at 03:14:01 UTC, Ali Çehreli wrote:.
 And that's exactly one of the benefits of fibers: two workers ping pong
 back and forth, without much risk of losing their cached data.

 Is my assumption correct?
Not if it is hyper-threaded, as pairs of threads are sharing resources. The only advantage of fibers is modelling, not performance. If you want max performance you need more control.
You can get that control by interacting with the scheduler. Thread schedulers tend to be in the kernel and fiber schedulers tend to be in userspace, so as a practical matter, it should be easier to get that control with fibers. Assuming your framework gives you access to the scheduler or lets you write your own. D does.
Mar 04 2016
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:
On Friday, 4 March 2016 at 23:26:02 UTC, Chris Wright wrote:
 You can get that control by interacting with the scheduler. 
 Thread schedulers tend to be in the kernel and fiber schedulers 
 tend to be in userspace, so as a practical matter, it should be 
 easier to get that control with fibers.
But in order to get the benefits of hyperthreading you need to schedule compatible workloads on the logical pair that shares a physical core, so you need to think about scheduling groups of workloads/fibers, not individual workloads/fibers.
Mar 04 2016
prev sibling next sibling parent reply Shachar Shemesh <shachar weka.io> writes:
On 03/03/16 19:31, Andrei Alexandrescu wrote:
 https://www.mailinator.com/tymaPaulMultithreaded.pdf

 Andrei
You just stepped on a pet peeve of mine. NIO isn't async IO. It's non-blocking IO. Many people (including Microsoft's MSDN) confuse the two, but they are completely and utterly different, and in fact, quite orthogonal (though the blocking async combination makes no sense, which is possibly the source of the confusion). Blocking IO means that if a request cannot be served immediately, your thread blocks until it can be served. Try to read from a socket with no data, the call to "read" will only return once data arrives. Synchronous IO means that a call only returns once it can be said, on some level, to have been performed. If you call "send" on a socket in synchronous mode, then, depending on the socket type, it will return after it has sent out the information (without waiting for an ACK), or, at the very least, after having copied the information into the kernel's buffers. If you call an asynchronous version of send, the copying of information into the socket may take place after the system call has already taken place. In Linux, for example, non-blocking mode is activated using a fcntl (O_NONBLOCK). Asynchronous operations are separate and distinct system calls (aio_*, io_* and vmsplice). Using non blocking mode introduces some challenges, but is, generally speaking, not very complicated (particularly when using fibers). Using asynchronous IO is considerably more complicated, as you need to keep track which of your buffers is currently having async operations done on. Not many systems are built around async IO. This may change, as most of the user space only low latency IO solutions (such as DPDK) are async by virtue of their hardware requirements. On a completely different note, me and a colleague started a proof of concept to disprove the claim that blocking+threads is slower. We did manage to service several tens of thousands of simultaneous connections using the one thread per client mode, proving that the mere context switch is not the problem here. Sadly, we never got around to bringing the code and the tests up to the level where we can publish something, but, at least in the case where a system call is needed in order to decide when to switch fibers, I dispute the claim that non-blocking is inherently faster. Shachar
Mar 03 2016
parent reply Sean Kelly <sean invisibleduck.org> writes:
On Friday, 4 March 2016 at 06:55:29 UTC, Shachar Shemesh wrote:
 On 03/03/16 19:31, Andrei Alexandrescu wrote:
 https://www.mailinator.com/tymaPaulMultithreaded.pdf
On a completely different note, me and a colleague started a proof of concept to disprove the claim that blocking+threads is slower. We did manage to service several tens of thousands of simultaneous connections using the one thread per client mode, proving that the mere context switch is not the problem here. Sadly, we never got around to bringing the code and the tests up to the level where we can publish something, but, at least in the case where a system call is needed in order to decide when to switch fibers, I dispute the claim that non-blocking is inherently faster.
It's not inherently faster. It just scales better in real-world scenarios, assuming a limited number of worker threads. Blocking IO is actually often fine in synthetic tests and in server to server, but in real world situations where some clients have modem-level bandwidth, a blocking write can ruin you.
Mar 05 2016
parent Shachar Shemesh <shachar weka.io> writes:
On 05/03/16 20:50, Sean Kelly wrote:
 On Friday, 4 March 2016 at 06:55:29 UTC, Shachar Shemesh wrote:
 On 03/03/16 19:31, Andrei Alexandrescu wrote:
 https://www.mailinator.com/tymaPaulMultithreaded.pdf
On a completely different note, me and a colleague started a proof of concept to disprove the claim that blocking+threads is slower. We did manage to service several tens of thousands of simultaneous connections using the one thread per client mode, proving that the mere context switch is not the problem here. Sadly, we never got around to bringing the code and the tests up to the level where we can publish something, but, at least in the case where a system call is needed in order to decide when to switch fibers, I dispute the claim that non-blocking is inherently faster.
It's not inherently faster. It just scales better in real-world scenarios, assuming a limited number of worker threads. Blocking IO is actually often fine in synthetic tests and in server to server, but in real world situations where some clients have modem-level bandwidth, a blocking write can ruin you.
Not if it's in its own thread, or am I missing something. Shachar
Mar 06 2016
prev sibling parent =?UTF-8?Q?S=c3=b6nke_Ludwig?= <sludwig outerproduct.org> writes:
Am 03.03.2016 um 18:31 schrieb Andrei Alexandrescu:
 https://www.mailinator.com/tymaPaulMultithreaded.pdf

 Andrei
A few points that come to mind: - Comparing random different high-level libraries is bound to give results that measure abstraction overhead/non-optimal system API use. Comparing on a JVM instead of bare-metal might skew the results further (e.g. some JIT optimizations not kicking in due to the use of callbacks, or something like that). It would be interesting to redo the benchmark in C/D using plain system APIs. - Comparing single-thread NBIO to multi-threaded BIO is obviously wrong when measuring peak-performance. NBIO should use a pool of on thread per core, each running an event/select loop, or alternatively using one process per core. The "Make better use of multi-cores" pro-BIO argument is pointless for that same reason. - Missing any hints about how the benchmark was performed (e.g. send()/recv() chunk size). For anything other than tiny packets, NBIO for sure is not measurably slower than BIO. Latency may be a bit worse, but that reverses once many connections come into play. - The "simpler to write" argument also breaks down when adding fibers to the mix. - Main argument for NBIO is that threads are relatively heavy system resources, context switches are rather expensive, and are limited in their number (irrespective of the amount of RAM). Depending on the kernel, the scheduler overhead may also grow with the number of threads. For small numbers of connections, IO for sure is perfectly fine, as long as synchronization overhead isn't an issue. - AIO/NBIO+fibers also allows to further reduce memory footprint by detaching the connection from the fiber between requests (e.g. for a keep-alive HTTP connection). This isn't possible with blocking IO. - The optimal approach always depends on the system being modelled, NBIO+fibers simply gives the maximum flexibility in that regard. You can let fibers run in isolation on different threads, use synchronization between then, or you can have concurrency without CPU-level synchronization overhead within the same thread. Especially the latter can become really interesting with thread-local memory allocators etc. It also becomes really interesting in situations where thread-synchronization gets difficult (lock-less structures).
Mar 05 2016