digitalmars.D - A very interesting slide deck comparing sync and async IO

Andrei Alexandrescu (2/2) Mar 03 2016 https://www.mailinator.com/tymaPaulMultithreaded.pdf

Adam D. Ruppe (8/9) Mar 03 2016 Slide decks are so unbelievably bad at conveying information.
Brad Anderson (16/18) Mar 03 2016 Related to this, I watched a talk about user-level threads by
=?UTF-8?Q?Ali_=c3=87ehreli?= (4/6) Mar 03 2016 Another interesting architecture is LMAX Disruptor:
Vladimir Panteleev (10/12) Mar 03 2016 Not an expert on the subject, but FWIW:

deadalnix (4/6) Mar 03 2016 Ha yes, forgot that. Many JVM use fiber instead of thread

deadalnix (25/27) Mar 03 2016 A lot of data presented are kind of skewed. For instance, the

=?UTF-8?Q?Ali_=c3=87ehreli?= (9/11) Mar 03 2016 (I am speaking without measuring anything.)

deadalnix (16/24) Mar 04 2016 The minimal cost of a context switch is one TLB miss (~300), one

Shachar Shemesh (28/45) Mar 06 2016 Why do you claim the first two? Again, assuming threads of the same

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (6/10) Mar 07 2016 Yes, rule-of-thumb seems to be that regular syscall can be

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (14/18) Mar 04 2016 Not if it is hyper-threaded, as pairs of threads are sharing

deadalnix (12/23) Mar 04 2016 Irrelevant.

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (11/16) Mar 05 2016 Err... no. You want async when you are CPU-bound and are looking

Chris Wright (7/16) Mar 04 2016 You can get that control by interacting with the scheduler. Thread

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (5/9) Mar 04 2016 But in order to get the benefits of hyperthreading you need to

Shachar Shemesh (37/39) Mar 03 2016 You just stepped on a pet peeve of mine. NIO isn't async IO. It's

Sean Kelly (6/18) Mar 05 2016 It's not inherently faster. It just scales better in real-world

Shachar Shemesh (3/21) Mar 06 2016 Not if it's in its own thread, or am I missing something.

=?UTF-8?Q?S=c3=b6nke_Ludwig?= (36/38) Mar 05 2016 A few points that come to mind:

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

https://www.mailinator.com/tymaPaulMultithreaded.pdf

Andrei

Mar 03 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Thursday, 3 March 2016 at 17:31:59 UTC, Andrei Alexandrescu 
wrote:
 https://www.mailinator.com/tymaPaulMultithreaded.pdf

Slide decks are so unbelievably bad at conveying information.


But, looking through it, I think I agree. There's a reason why I 
stick with my cgi.d - using a simple process/thread model is so 
much easier and actually performs fine. fork() is very cheap on 
modern Linux and if you use it, the kernel takes care of 
basically all the hard stuff for you!

Mar 03 2016

Brad Anderson <eco gnuk.net> writes:

On Thursday, 3 March 2016 at 17:31:59 UTC, Andrei Alexandrescu 
wrote:
 https://www.mailinator.com/tymaPaulMultithreaded.pdf

 Andrei

Related to this, I watched a talk about user-level threads by 
another Google engineer awhile back (also named Paul but not the 
same person).

https://www.youtube.com/watch?v=KXuZi9aeGTw

It's been awhile so I may have this a bit wrong but as I remember 
it the gist was that OSes don't have enough information for 
efficient scheduling and context switching of threads but if the 
OS lets the application handle it the context switching cost 
drops significantly (down to roughly 200ns).

I came across it while following some Rust development which was 
a very interesting read:
https://mail.mozilla.org/pipermail/rust-dev/2013-November/006550.html

It's a bit over two years old now so I'm not sure what path Rust 
took.

Mar 03 2016

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 03/03/2016 09:31 AM, Andrei Alexandrescu wrote:
 https://www.mailinator.com/tymaPaulMultithreaded.pdf

 Andrei

Another interesting architecture is LMAX Disruptor:

   https://lmax-exchange.github.io/disruptor/

Ali

Mar 03 2016

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:

On Thursday, 3 March 2016 at 17:31:59 UTC, Andrei Alexandrescu 
wrote:
 https://www.mailinator.com/tymaPaulMultithreaded.pdf

 Andrei

Not an expert on the subject, but FWIW:

1. This is from 2008
2. Seems to be highly specific to Java
3. The first benchmark essentially measures the overhead of fiber 
context switching and nothing else
4. If I tried to write DFeed using this model, it would have been 
a complete trainwreck (there is too much global state and 
interaction between components to reason about using locks).

Mar 03 2016

deadalnix <deadalnix gmail.com> writes:

On Thursday, 3 March 2016 at 20:31:51 UTC, Vladimir Panteleev 
wrote:
 3. The first benchmark essentially measures the overhead of 
 fiber context switching and nothing else

Ha yes, forgot that. Many JVM use fiber instead of thread 
internally.

Mar 03 2016

deadalnix <deadalnix gmail.com> writes:

On Thursday, 3 March 2016 at 17:31:59 UTC, Andrei Alexandrescu 
wrote:
 https://www.mailinator.com/tymaPaulMultithreaded.pdf

 Andrei

A lot of data presented are kind of skewed. For instance, the 
synchronization costs across cores are done at 0% writes. It 
comes to no surprise that synchronizing read only data is cheap, 
but I don't think the author is justified in concluding that 
synchronization is cheap in the general case.

"new threads are created if all existing are busy but only up to 
MAX_INT threads" More likely only up to the point you run out of 
file handles.

You got to also note that context switches are measure on an 
opteron that has a 81 cycles iret (it is 330 on haswell) so that 
would explain why he find out they are way cheaper than expected. 
Opteron also have 2x64kb L1 cache, which probably reduce the 
cache trashing due to context switches at the cost of single 
threaded performances (which doesn't seems like a very good 
tradeoff for a CPU). In short, the CPU used is good for context 
switches, especially compared to modern intels.

Other result are not that surprising, like blocking IO being 
faster than non blocking one, the gain from async coming from 
multiplexing IO, not making every individual IO operation faster. 
Or, in server terms, it'll change your DB bound frontend to a CPU 
bound one and offload scaling on the DB. If you don't have 
another component to limiting your server, going async is not 
going to improves things (like if you are already CPU bound).

Mar 03 2016

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 03/03/2016 06:01 PM, deadalnix wrote:

 2x64kb L1 cache, which probably reduce the cache trashing due
 to context switches

(I am speaking without measuring anything.)

I imagine that lost cache is one of the biggest costs in thread 
switching. It would be great if a thread could select a thread with 
something like "I'm done, now please switch to my reader". And that's 
exactly one of the benefits of fibers: two workers ping pong back and 
forth, without much risk of losing their cached data.

Is my assumption correct?

Ali

Mar 03 2016

deadalnix <deadalnix gmail.com> writes:

On Friday, 4 March 2016 at 03:14:01 UTC, Ali Çehreli wrote:
 I imagine that lost cache is one of the biggest costs in thread 
 switching. It would be great if a thread could select a thread 
 with something like "I'm done, now please switch to my reader". 
 And that's exactly one of the benefits of fibers: two workers 
 ping pong back and forth, without much risk of losing their 
 cached data.

 Is my assumption correct?

 Ali

The minimal cost of a context switch is one TLB miss (~300), one 
cache miss (~300) and iret (~300). But then, you usually don't do 
context switch for nothing, so some work has to be done in the 
context switch. This is where it gets very dependent on which 
system call you use.

During its work, the system call would have evicted various cache 
lines and TLB entries to put its own data there. That means that 
after the context switch is done, part of your cache is gone, and 
you'll need some time to start again running full speed. You may 
think that it is the same with a library call, and to some extent 
it is, but much worse: as kernel and userspace do not share the 
same address space and access write, so you typically get way 
more trashing (you can't reuse TLB entries at all for instance).

Actual numbers will vary from one chip to the other, but the 
general idea remains.

Mar 04 2016

Shachar Shemesh <shachar weka.io> writes:

For switching between threads, this seems wrong.

On 04/03/16 20:29, deadalnix wrote:

 The minimal cost of a context switch is one TLB miss (~300), one cache
 miss (~300)

Why do you claim the first two? Again, assuming threads of the same 
process, where the address space (minus the TLS) is the same.

 and iret (~300).

You have an iret whether you switched or not. Going into the kernel and 
coming back has that, whether we switched or not. In particular, since 
you need a system call in order to call "read", "write" and "accept", 
these are there regardless.

BTW, using one thread per client, you need less system calls, which 
reduces that particular overhead.

 But then, you usually don't do context
 switch for nothing, so some work has to be done in the context switch.
 This is where it gets very dependent on which system call you use.

 During its work, the system call would have evicted various cache lines
 and TLB entries to put its own data there. That means that after the
 context switch is done, part of your cache is gone, and you'll need some
 time to start again running full speed.

I'm having a hard time following the discussion thread to understand 
what you are replying to here, but surely, you made the system call 
because you needed it to be done.

 You may think that it is the
 same with a library call, and to some extent it is, but much worse: as
 kernel and userspace do not share the same address space

I hate that myth! It is such a difficult one to kill.

The kernel absolutely uses the same address space as the userspace 
(whatever user space happened to be there). That is why we had the 3/1GB 
split in the bad old 32bit days. The 1GB was a chunk of address space, 
bitten off the user space usable address space, that has the same 
mapping regardless of which user space is mapped in. This allows the 
kernel to use the same virtual addresses without loading a new TLB.

In other words, performing a system call is not a context switch.

 and access
 write,

That much is true, but it is merely a bit change in a register. It has 
no cache implications.

 so you typically get way more trashing (you can't reuse TLB
 entries at all for instance).

Like I said, not true.

As a side note, some platforms have cookies inside the TLB, so you don't 
have to flush it even when you do an actual context switch, but I'm 
guessing we're discussing Intel here.

Shachar

Mar 06 2016

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:

On Sunday, 6 March 2016 at 13:21:41 UTC, Shachar Shemesh wrote:
 You have an iret whether you switched or not. Going into the 
 kernel and coming back has that, whether we switched or not. In 
 particular, since you need a system call in order to call 
 "read", "write" and "accept", these are there regardless.

Yes, rule-of-thumb seems to be that regular syscall can be 
expected to have 200-1000 cycles overhead, on a virtualized setup 
2000+ cycles.

But some low level interfaces allow you to submit multiple 
i/o-ops in a single system call.

Mar 07 2016

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:

On Friday, 4 March 2016 at 03:14:01 UTC, Ali Çehreli wrote:.
 And that's exactly one of the benefits of fibers: two workers 
 ping pong back and forth, without much risk of losing their 
 cached data.

 Is my assumption correct?

Not if it is hyper-threaded, as pairs of threads are sharing 
resources. The only advantage of fibers is modelling, not 
performance. If you want max performance you need more control. 
But typical real world applications are far away from max 
theoretical performance, so maybe the impact isn't as great in a 
specific application for a given suboptimal pattern.

I don't think you can measure or benchmark whether you are doing 
objectively well or not, in an absolute sense. For that you need 
to calculate the theoretical throughput and then see how far away 
you are in the real world and explain why.

Otherwise you are just benchmarking a suboptimal pattern. Which 
makes sense when optimizing an existing application, but makes 
less sense when designing a new database engine from scratch.

Mar 04 2016

deadalnix <deadalnix gmail.com> writes:

On Friday, 4 March 2016 at 22:22:48 UTC, Ola Fosheim Grøstad 
wrote:
 On Friday, 4 March 2016 at 03:14:01 UTC, Ali Çehreli wrote:.
 And that's exactly one of the benefits of fibers: two workers 
 ping pong back and forth, without much risk of losing their 
 cached data.

 Is my assumption correct?

 Not if it is hyper-threaded, as pairs of threads are sharing 
 resources.

Irrelevant.

 Otherwise you are just benchmarking a suboptimal pattern. Which 
 makes sense when optimizing an existing application, but makes 
 less sense when designing a new database engine from scratch.

async do not make any sense on a DB. You want async when you are 
bound by a 3rd party system.

For instance, if you have a frontend server that is bound by DB 
access, async will allow to shift you frontend code from DB bound 
to CPU bound, allowing you to get much more bang for the buck in 
your frontends. Now, if you limiting factor is IO on your own 
machine, like say local disk access, then it's going to change 
nothing (in fact, it may make things much worse as disk prefer 
sequential accesses).

Mar 04 2016

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:

On Friday, 4 March 2016 at 23:10:28 UTC, deadalnix wrote:
 Not if it is hyper-threaded, as pairs of threads are sharing 
 resources.

 Irrelevant.

I knew you would say that, but you are wrong.

 async do not make any sense on a DB. You want async when you 
 are bound by a 3rd party system.

Err... no. You want async when you are CPU-bound and are looking 
for better latency and more control. On linux you can use async 
to bypass the kernel caches and access disk blocks, basically 
reimplementing things that are usually done in kernel space in 
your user space application (at the cost of adding significant 
complexity to your application).

What you get out of it depends on what you want, if you use a 
distributed setup, the drivers, the hardware and many other 
factors.

Mar 05 2016

Chris Wright <dhasenan gmail.com> writes:

On Fri, 04 Mar 2016 22:22:48 +0000, Ola Fosheim Grøstad wrote:

 On Friday, 4 March 2016 at 03:14:01 UTC, Ali Çehreli wrote:.
 And that's exactly one of the benefits of fibers: two workers ping pong
 back and forth, without much risk of losing their cached data.

 Is my assumption correct?

 
 Not if it is hyper-threaded, as pairs of threads are sharing resources.
 The only advantage of fibers is modelling, not performance. If you want
 max performance you need more control.

You can get that control by interacting with the scheduler. Thread 
schedulers tend to be in the kernel and fiber schedulers tend to be in 
userspace, so as a practical matter, it should be easier to get that 
control with fibers.

Assuming your framework gives you access to the scheduler or lets you 
write your own. D does.

Mar 04 2016

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= writes:

On Friday, 4 March 2016 at 23:26:02 UTC, Chris Wright wrote:
 You can get that control by interacting with the scheduler. 
 Thread schedulers tend to be in the kernel and fiber schedulers 
 tend to be in userspace, so as a practical matter, it should be 
 easier to get that control with fibers.

But in order to get the benefits of hyperthreading you need to 
schedule compatible workloads on the logical pair that shares a 
physical core, so you need to think about scheduling groups of 
workloads/fibers, not individual workloads/fibers.

Mar 04 2016

Shachar Shemesh <shachar weka.io> writes:

On 03/03/16 19:31, Andrei Alexandrescu wrote:
 https://www.mailinator.com/tymaPaulMultithreaded.pdf

 Andrei

You just stepped on a pet peeve of mine. NIO isn't async IO. It's 
non-blocking IO. Many people (including Microsoft's MSDN) confuse the 
two, but they are completely and utterly different, and in fact, quite 
orthogonal (though the blocking async combination makes no sense, which 
is possibly the source of the confusion).

Blocking IO means that if a request cannot be served immediately, your 
thread blocks until it can be served. Try to read from a socket with no 
data, the call to "read" will only return once data arrives.

Synchronous IO means that a call only returns once it can be said, on 
some level, to have been performed. If you call "send" on a socket in 
synchronous mode, then, depending on the socket type, it will return 
after it has sent out the information (without waiting for an ACK), or, 
at the very least, after having copied the information into the kernel's 
buffers. If you call an asynchronous version of send, the copying of 
information into the socket may take place after the system call has 
already taken place.

In Linux, for example, non-blocking mode is activated using a fcntl 
(O_NONBLOCK). Asynchronous operations are separate and distinct system 
calls (aio_*, io_* and vmsplice).

Using non blocking mode introduces some challenges, but is, generally 
speaking, not very complicated (particularly when using fibers). Using 
asynchronous IO is considerably more complicated, as you need to keep 
track which of your buffers is currently having async operations done 
on. Not many systems are built around async IO.

This may change, as most of the user space only low latency IO solutions 
(such as DPDK) are async by virtue of their hardware requirements.


On a completely different note, me and a colleague started a proof of 
concept to disprove the claim that blocking+threads is slower. We did 
manage to service several tens of thousands of simultaneous connections 
using the one thread per client mode, proving that the mere context 
switch is not the problem here. Sadly, we never got around to bringing 
the code and the tests up to the level where we can publish something, 
but, at least in the case where a system call is needed in order to 
decide when to switch fibers, I dispute the claim that non-blocking is 
inherently faster.

Shachar

Mar 03 2016

Sean Kelly <sean invisibleduck.org> writes:

On Friday, 4 March 2016 at 06:55:29 UTC, Shachar Shemesh wrote:
 On 03/03/16 19:31, Andrei Alexandrescu wrote:
 https://www.mailinator.com/tymaPaulMultithreaded.pdf

 On a completely different note, me and a colleague started a 
 proof of concept to disprove the claim that blocking+threads is 
 slower. We did manage to service several tens of thousands of 
 simultaneous connections using the one thread per client mode, 
 proving that the mere context switch is not the problem here. 
 Sadly, we never got around to bringing the code and the tests 
 up to the level where we can publish something, but, at least 
 in the case where a system call is needed in order to decide 
 when to switch fibers, I dispute the claim that non-blocking is 
 inherently faster.

It's not inherently faster. It just scales better in real-world 
scenarios, assuming a limited number of worker threads. Blocking 
IO is actually often fine in synthetic tests and in server to 
server, but in real world situations where some clients have 
modem-level bandwidth, a blocking write can ruin you.

Mar 05 2016

Shachar Shemesh <shachar weka.io> writes:

On 05/03/16 20:50, Sean Kelly wrote:
 On Friday, 4 March 2016 at 06:55:29 UTC, Shachar Shemesh wrote:
 On 03/03/16 19:31, Andrei Alexandrescu wrote:
 https://www.mailinator.com/tymaPaulMultithreaded.pdf

 On a completely different note, me and a colleague started a proof of
 concept to disprove the claim that blocking+threads is slower. We did
 manage to service several tens of thousands of simultaneous
 connections using the one thread per client mode, proving that the
 mere context switch is not the problem here. Sadly, we never got
 around to bringing the code and the tests up to the level where we can
 publish something, but, at least in the case where a system call is
 needed in order to decide when to switch fibers, I dispute the claim
 that non-blocking is inherently faster.

 It's not inherently faster. It just scales better in real-world
 scenarios, assuming a limited number of worker threads. Blocking IO is
 actually often fine in synthetic tests and in server to server, but in
 real world situations where some clients have modem-level bandwidth, a
 blocking write can ruin you.

Not if it's in its own thread, or am I missing something.

Shachar

Mar 06 2016

=?UTF-8?Q?S=c3=b6nke_Ludwig?= <sludwig outerproduct.org> writes:

Am 03.03.2016 um 18:31 schrieb Andrei Alexandrescu:
 https://www.mailinator.com/tymaPaulMultithreaded.pdf

 Andrei

A few points that come to mind:

- Comparing random different high-level libraries is bound to give 
results that measure abstraction overhead/non-optimal system API use. 
Comparing on a JVM instead of bare-metal might skew the results further 
(e.g. some JIT optimizations not kicking in due to the use of callbacks, 
or something like that). It would be interesting to redo the benchmark 
in C/D using plain system APIs.

- Comparing single-thread NBIO to multi-threaded BIO is obviously wrong 
when measuring peak-performance. NBIO should use a pool of on thread per 
core, each running an event/select loop, or alternatively using one 
process per core. The "Make better use of multi-cores" pro-BIO argument 
is pointless for that same reason.

- Missing any hints about how the benchmark was performed (e.g. 
send()/recv() chunk size). For anything other than tiny packets, NBIO 
for sure is not measurably slower than BIO. Latency may be a bit worse, 
but that reverses once many connections come into play.

- The "simpler to write" argument also breaks down when adding fibers to 
the mix.

- Main argument for NBIO is that threads are relatively heavy system 
resources, context switches are rather expensive, and are limited in 
their number (irrespective of the amount of RAM). Depending on the 
kernel, the scheduler overhead may also grow with the number of threads. 
For small numbers of connections, IO for sure is perfectly fine, as long 
as synchronization overhead isn't an issue.

- AIO/NBIO+fibers also allows to further reduce memory footprint by 
detaching the connection from the fiber between requests (e.g. for a 
keep-alive HTTP connection). This isn't possible with blocking IO.

- The optimal approach always depends on the system being modelled, 
NBIO+fibers simply gives the maximum flexibility in that regard. You can 
let fibers run in isolation on different threads, use synchronization 
between then, or you can have concurrency without CPU-level 
synchronization overhead within the same thread. Especially the latter 
can become really interesting with thread-local memory allocators etc. 
It also becomes really interesting in situations where 
thread-synchronization gets difficult (lock-less structures).

Mar 05 2016

D Programming

C/C++ Programming

Other

digitalmars.D - A very interesting slide deck comparing sync and async IO