www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - NGINX Unit and vibe.d Integration Performance

reply Kyle Ingraham <kyle kyleingraham.com> writes:
Hi there,
I'm looking for help with the performance of an integration I'm 
trying to write between NGINX Unit and D. Here are two minimal 
demos I've put together:
- https://github.com/kyleingraham/unit-d-hello-world (NGINX 
Unit/D)
- https://github.com/kyleingraham/unit-vibed-hello-world (NGINX 
Unit/vibe.d)

The first integration achieves ~43k requests per second on my 
computer. That matches what I've been able to achieve with a 
minimal vibe.d project and is I believe the max my benchmark 
configuration on macOS can hit.

The second though only achieves ~20k requests per second. In that 
demo I try to make vibe.d's concurrency system available during 
request handling. NGINX Unit's event loop is run in its own 
thread. When requests arrive, Unit sends them to the main thread 
for handling on vibe.d's event loop. I've tried a few methods to 
increase performance but none have been successful:
- Batching messages when sending new request messages to minimize 
overhead. This increased latency and didn't improve on throughput.
- Using vibe.d channels to pass requests. This achieved the same 
performance as message passing. I wasn't able to use the channel 
config that prioritized minimizing overhead as the API didn't 
jive with my use case.
- Using a lock-free queue 
(https://github.com/MartinNowak/lock-free) between threads with a 
loop in the vibe.d thread that constantly polled for requests. 
This method achieves ~43k requests per second but results in 
atrocious CPU usage.

~20k requests per second seems to be the best I can hit with all 
that I've tried. I know vibe.d can do better so I'm thinking 
there's something I'm missing. In profiling I can see that the 
vibe.d thread spends a third of its time in what seems to be 
event loop management code. Am I seeing the effects of Unit's and 
vibe.d's loops being 'out-of-sync' i.e. there being some slack 
time between a message being sent and then being acted upon? Is 
there a better way to integrate NGINX Unit with vibe.d?
Oct 27
next sibling parent reply ryuukk_ <ryuukk.dev gmail.com> writes:
Let's take a moment to appreciate how easy it was for you to use 
nginx unit from D

https://github.com/kyleingraham/unit-d-hello-world/blob/main/source/unit_integration.c

ImportC is great
Oct 27
parent Kyle Ingraham <kyle kyleingraham.com> writes:
On Monday, 28 October 2024 at 05:56:32 UTC, ryuukk_ wrote:
 ImportC is great
It really is. Most of my time setting it up was on getting include and linking flags working. Which is exactly what you’d run into using C from C.
Oct 28
prev sibling next sibling parent reply Salih Dincer <salihdb hotmail.com> writes:
On Monday, 28 October 2024 at 01:06:58 UTC, Kyle Ingraham wrote:
  ...

 The second though only achieves ~20k requests per second. In 
 that demo I try to make vibe.d's concurrency system available 
 during request handling. NGINX Unit's event loop is run in its 
 own thread. When requests arrive, Unit sends them to the main 
 thread for handling on vibe.d's event loop. I've tried a few 
 methods to increase performance...
Apparently, vibe.d's event loop is not fully compatible with NGINX Unit's loop, causing performance loss. I wonder if it would be wise to use something like an IntrusiveQueue or task pool to make it compatible? For example, something like this: ```d alias IQ = IntrusiveQueue; struct IntrusiveQueue(T) { import core.atomic; private { T[] buffer; size_t head, tail; alias acq = MemoryOrder.acq; alias rel = MemoryOrder.rel; } size_t capacity; this(size_t capacity) { this.capacity = capacity; buffer.length = capacity; } alias push = enqueue; bool enqueue(T item) { auto currTail = tail.atomicLoad!acq; auto nextTail = (currTail + 1) % capacity; if (nextTail == head.atomicLoad!acq) return false; buffer[currTail] = item; atomicStore!rel(tail, nextTail); return true; } alias fetch = dequeue; bool dequeue(ref T item) { auto currHead = head.atomicLoad!acq; if (currHead == tail.atomicLoad!acq) return false; auto nextTail = (currHead + 1) % capacity; item = buffer[currHead]; atomicStore!rel(head, nextTail); return true; } } unittest { enum start = 41; auto queue = IQ!int(10); queue.push(start); queue.push(start + 1); int item; if (queue.fetch(item)) assert(item == start); if (queue.fetch(item)) assert(item == start + 1); } ``` SDB 79
Oct 28
parent reply Kyle Ingraham <kyle kyleingraham.com> writes:
On Monday, 28 October 2024 at 18:37:18 UTC, Salih Dincer wrote:
 Apparently, vibe.d's event loop is not fully compatible with 
 NGINX Unit's loop, causing performance loss. I wonder if it 
 would be wise to use something like an IntrusiveQueue or task 
 pool to make it compatible? For example, something like this:
 ...
You are right that they aren't compatible. Running them in the same thread was a no-go (which makes sense given they both want to control when code is run). How would you suggest reading from the queue you provided in the vibe.d thread? I tried something similar with [lock-free](https://code.dlang.org/packages/lock-free). It was easy to push into the queue efficiently from Unit's thread but popping from it in vibe.d's was difficult: - Polling too little killed performance and too often wrecked CPU usage. - Using message passing reduced performance quite a bit. - Batching reads was hard because it was tricky balancing performance for single requests with performance for streams of them.
Oct 28
parent reply Salih Dincer <salihdb hotmail.com> writes:
On Monday, 28 October 2024 at 19:57:41 UTC, Kyle Ingraham wrote:
 
 - Polling too little killed performance and too often wrecked 
 CPU usage.
 - Using message passing reduced performance quite a bit.
 - Batching reads was hard because it was tricky balancing 
 performance for single requests with performance for streams of 
 them.
Semaphore? https://demirten-gitbooks-io.translate.goog/linux-sistem-programlama/content/semaphore/operations.html?_x_tr_sl=tr&_x_tr_tl=en&_x_tr_hl=tr&_x_tr_pto=wapp SDB 79
Oct 28
next sibling parent Salih Dincer <salihdb hotmail.com> writes:
On Monday, 28 October 2024 at 20:53:32 UTC, Salih Dincer wrote:
 Semaphore?
Please see: https://dlang.org/phobos/core_sync_semaphore.html SDB 79
Oct 28
prev sibling parent reply Kyle Ingraham <kyle kyleingraham.com> writes:
On Monday, 28 October 2024 at 20:53:32 UTC, Salih Dincer wrote:
 On Monday, 28 October 2024 at 19:57:41 UTC, Kyle Ingraham wrote:
 
 - Polling too little killed performance and too often wrecked 
 CPU usage.
 - Using message passing reduced performance quite a bit.
 - Batching reads was hard because it was tricky balancing 
 performance for single requests with performance for streams 
 of them.
Semaphore? https://demirten-gitbooks-io.translate.goog/linux-sistem-programlama/content/semaphore/operations.html?_x_tr_sl=tr&_x_tr_tl=en&_x_tr_hl=tr&_x_tr_pto=wapp SDB 79
I went back to try using a semaphore and ended up using a mutex, an event, and a lock-free queue. My aim was to limit the amount of vibe.d events emitted to hopefully limit event loop overhead. It works as follows: - Requests come in on the Unit thread and are added to the lock-free queue. - The Unit thread tries to obtain the mutex. If it cannot, it assumes request processing is in progress on the vibe.d thread and does not emit an event. - In the vibe.d thread it waits on an event. Once it arrives, it obtains the mutex and pulls from the lock-free queue until it is empty. - Once the queue is empty the vibe.d thread releases the mutex and waits for another event. This approach increased requests processed per events emitted/waited from 1:1 to 10:1. This had no impact on event loop overhead however. The entire program still spends ~50% of its runtime in this function: https://github.com/vibe-d/eventcore/blob/0cdddc475965824f32d32c9e4a1dfa58bd616cc9/source/eventcore/drivers/po ix/cfrunloop.d#L38. I'll see if I can get images here of my profiling. I'm sure I'm missing something obvious here.
Oct 31
next sibling parent Kyle Ingraham <kyle kyleingraham.com> writes:
On Thursday, 31 October 2024 at 16:43:09 UTC, Kyle Ingraham wrote:
 This approach increased requests processed per events 
 emitted/waited from 1:1 to 10:1. This had no impact on event 
 loop overhead however. The entire program still spends ~50% of 
 its runtime in this function: 
 https://github.com/vibe-d/eventcore/blob/0cdddc475965824f32d32c9e4a1dfa58bd616cc9/source/eventcore/drivers/po
ix/cfrunloop.d#L38. I'll see if I can get images here of my profiling. I'm sure
I'm missing something obvious here.
I forgot to add that once you add delays to my demonstrator and a program using vibe.d's web framework the two have similar performance numbers. Adding a 10ms sleep resulted in 600 req/s for my demonstrator and 630 req/s for vibe.d. It's encouraging to see the benefit of vibe.d's concurrency system with delays added. I'd like to be able to use it without drastically affecting throughput for no-delay cases however.
Oct 31
prev sibling parent Kyle Ingraham <kyle kyleingraham.com> writes:
On Thursday, 31 October 2024 at 16:43:09 UTC, Kyle Ingraham wrote:
 ..I'll see if I can get images here of my profiling...
Here are images as promised: - A flame graph - https://blog.kyleingraham.com/wp-content/uploads/2024/10/screenshot-2024-10-30-at-11.47.57e280afpm.png - A call tree - https://blog.kyleingraham.com/wp-content/uploads/2024/10/screenshot-2024-10-30-at-11.53.46e280afpm.png In the flame graph there are two threads: Main Thread and thread_entryPoint. NGINX Unit runs in thread_entryPoint. vibe.d and my request handing code run in Main Thread. My request handing code is grouped under fiber_entryPoint within Main Thread. vibe.d's code is grouped under 'start' in Main Thread.
Oct 31
prev sibling parent reply Kyle Ingraham <kyle kyleingraham.com> writes:
On Monday, 28 October 2024 at 01:06:58 UTC, Kyle Ingraham wrote:
 I know vibe.d can do better so I'm thinking there's something 
 I'm missing.
Sönke Ludwig solved this for me here: https://github.com/vibe-d/vibe.d/issues/2807#issue-2630501194 The solution was to switch to a configuration for eventcore that uses kqueue directly instead of CFRunLoop. Doing that brought performance back to the stratosphere. Solution from the GitHub issue: "You can add an explicit sub configuration to dub.json: ```json "dependencies": { "vibe-d": "~>0.10.1", "eventcore": "~>0.9.34" }, "subConfigurations": { "eventcore": "kqueue" }, ``` Or you could pass --override-config=eventcore/kqueue to the dub invocation to try it out temporarily." I elected to go with the command line flag approach.
Nov 02
parent reply Salih Dincer <salihdb hotmail.com> writes:
On Sunday, 3 November 2024 at 00:42:44 UTC, Kyle Ingraham wrote:
 
 "You can add an explicit sub configuration to dub.json:

 ```json
 "dependencies": {
 	"vibe-d": "~>0.10.1",
 	"eventcore": "~>0.9.34"
 },
 "subConfigurations": {
 	"eventcore": "kqueue"
 },
 ```
 Or you could pass --override-config=eventcore/kqueue to the dub 
 invocation to try it out temporarily."
I'm glad to hear that. In the world of software, there is actually no problem that cannot be solved; except for the halting problem :) SDB 79
Nov 04
parent monkyyy <crazymonkyyy gmail.com> writes:
On Monday, 4 November 2024 at 18:05:25 UTC, Salih Dincer wrote:
 
 I'm glad to hear that. In the world of software, there is 
 actually no problem that cannot be solved; except for the 
 halting problem :)
? Id argue theres entire families of problems that are unsolvable; the halting problem may just be a root
Nov 04