digitalmars.D.learn - NGINX Unit and vibe.d Integration Performance
- Kyle Ingraham (37/37) Oct 27 Hi there,
- ryuukk_ (4/4) Oct 27 Let's take a moment to appreciate how easy it was for you to use
- Kyle Ingraham (4/5) Oct 28 It really is. Most of my time setting it up was on getting
- Salih Dincer (54/61) Oct 28 Apparently, vibe.d's event loop is not fully compatible with
- Kyle Ingraham (15/20) Oct 28 You are right that they aren't compatible. Running them in the
- Salih Dincer (4/11) Oct 28 Semaphore?
- Salih Dincer (3/4) Oct 28 Please see: https://dlang.org/phobos/core_sync_semaphore.html
- Kyle Ingraham (20/31) Oct 31 I went back to try using a semaphore and ended up using a mutex,
- Kyle Ingraham (8/13) Oct 31 I forgot to add that once you add delays to my demonstrator and a
- Kyle Ingraham (11/12) Oct 31 Here are images as promised:
- Kyle Ingraham (20/22) Nov 02 Sönke Ludwig solved this for me here:
- Salih Dincer (5/18) Nov 04 I'm glad to hear that. In the world of software, there is
- monkyyy (4/8) Nov 04 ?
Hi there, I'm looking for help with the performance of an integration I'm trying to write between NGINX Unit and D. Here are two minimal demos I've put together: - https://github.com/kyleingraham/unit-d-hello-world (NGINX Unit/D) - https://github.com/kyleingraham/unit-vibed-hello-world (NGINX Unit/vibe.d) The first integration achieves ~43k requests per second on my computer. That matches what I've been able to achieve with a minimal vibe.d project and is I believe the max my benchmark configuration on macOS can hit. The second though only achieves ~20k requests per second. In that demo I try to make vibe.d's concurrency system available during request handling. NGINX Unit's event loop is run in its own thread. When requests arrive, Unit sends them to the main thread for handling on vibe.d's event loop. I've tried a few methods to increase performance but none have been successful: - Batching messages when sending new request messages to minimize overhead. This increased latency and didn't improve on throughput. - Using vibe.d channels to pass requests. This achieved the same performance as message passing. I wasn't able to use the channel config that prioritized minimizing overhead as the API didn't jive with my use case. - Using a lock-free queue (https://github.com/MartinNowak/lock-free) between threads with a loop in the vibe.d thread that constantly polled for requests. This method achieves ~43k requests per second but results in atrocious CPU usage. ~20k requests per second seems to be the best I can hit with all that I've tried. I know vibe.d can do better so I'm thinking there's something I'm missing. In profiling I can see that the vibe.d thread spends a third of its time in what seems to be event loop management code. Am I seeing the effects of Unit's and vibe.d's loops being 'out-of-sync' i.e. there being some slack time between a message being sent and then being acted upon? Is there a better way to integrate NGINX Unit with vibe.d?
Oct 27
Let's take a moment to appreciate how easy it was for you to use nginx unit from D https://github.com/kyleingraham/unit-d-hello-world/blob/main/source/unit_integration.c ImportC is great
Oct 27
On Monday, 28 October 2024 at 05:56:32 UTC, ryuukk_ wrote:ImportC is greatIt really is. Most of my time setting it up was on getting include and linking flags working. Which is exactly what you’d run into using C from C.
Oct 28
On Monday, 28 October 2024 at 01:06:58 UTC, Kyle Ingraham wrote:... The second though only achieves ~20k requests per second. In that demo I try to make vibe.d's concurrency system available during request handling. NGINX Unit's event loop is run in its own thread. When requests arrive, Unit sends them to the main thread for handling on vibe.d's event loop. I've tried a few methods to increase performance...Apparently, vibe.d's event loop is not fully compatible with NGINX Unit's loop, causing performance loss. I wonder if it would be wise to use something like an IntrusiveQueue or task pool to make it compatible? For example, something like this: ```d alias IQ = IntrusiveQueue; struct IntrusiveQueue(T) { import core.atomic; private { T[] buffer; size_t head, tail; alias acq = MemoryOrder.acq; alias rel = MemoryOrder.rel; } size_t capacity; this(size_t capacity) { this.capacity = capacity; buffer.length = capacity; } alias push = enqueue; bool enqueue(T item) { auto currTail = tail.atomicLoad!acq; auto nextTail = (currTail + 1) % capacity; if (nextTail == head.atomicLoad!acq) return false; buffer[currTail] = item; atomicStore!rel(tail, nextTail); return true; } alias fetch = dequeue; bool dequeue(ref T item) { auto currHead = head.atomicLoad!acq; if (currHead == tail.atomicLoad!acq) return false; auto nextTail = (currHead + 1) % capacity; item = buffer[currHead]; atomicStore!rel(head, nextTail); return true; } } unittest { enum start = 41; auto queue = IQ!int(10); queue.push(start); queue.push(start + 1); int item; if (queue.fetch(item)) assert(item == start); if (queue.fetch(item)) assert(item == start + 1); } ``` SDB 79
Oct 28
On Monday, 28 October 2024 at 18:37:18 UTC, Salih Dincer wrote:Apparently, vibe.d's event loop is not fully compatible with NGINX Unit's loop, causing performance loss. I wonder if it would be wise to use something like an IntrusiveQueue or task pool to make it compatible? For example, something like this: ...You are right that they aren't compatible. Running them in the same thread was a no-go (which makes sense given they both want to control when code is run). How would you suggest reading from the queue you provided in the vibe.d thread? I tried something similar with [lock-free](https://code.dlang.org/packages/lock-free). It was easy to push into the queue efficiently from Unit's thread but popping from it in vibe.d's was difficult: - Polling too little killed performance and too often wrecked CPU usage. - Using message passing reduced performance quite a bit. - Batching reads was hard because it was tricky balancing performance for single requests with performance for streams of them.
Oct 28
On Monday, 28 October 2024 at 19:57:41 UTC, Kyle Ingraham wrote:- Polling too little killed performance and too often wrecked CPU usage. - Using message passing reduced performance quite a bit. - Batching reads was hard because it was tricky balancing performance for single requests with performance for streams of them.Semaphore? https://demirten-gitbooks-io.translate.goog/linux-sistem-programlama/content/semaphore/operations.html?_x_tr_sl=tr&_x_tr_tl=en&_x_tr_hl=tr&_x_tr_pto=wapp SDB 79
Oct 28
On Monday, 28 October 2024 at 20:53:32 UTC, Salih Dincer wrote:Semaphore?Please see: https://dlang.org/phobos/core_sync_semaphore.html SDB 79
Oct 28
On Monday, 28 October 2024 at 20:53:32 UTC, Salih Dincer wrote:On Monday, 28 October 2024 at 19:57:41 UTC, Kyle Ingraham wrote:I went back to try using a semaphore and ended up using a mutex, an event, and a lock-free queue. My aim was to limit the amount of vibe.d events emitted to hopefully limit event loop overhead. It works as follows: - Requests come in on the Unit thread and are added to the lock-free queue. - The Unit thread tries to obtain the mutex. If it cannot, it assumes request processing is in progress on the vibe.d thread and does not emit an event. - In the vibe.d thread it waits on an event. Once it arrives, it obtains the mutex and pulls from the lock-free queue until it is empty. - Once the queue is empty the vibe.d thread releases the mutex and waits for another event. This approach increased requests processed per events emitted/waited from 1:1 to 10:1. This had no impact on event loop overhead however. The entire program still spends ~50% of its runtime in this function: https://github.com/vibe-d/eventcore/blob/0cdddc475965824f32d32c9e4a1dfa58bd616cc9/source/eventcore/drivers/po ix/cfrunloop.d#L38. I'll see if I can get images here of my profiling. I'm sure I'm missing something obvious here.- Polling too little killed performance and too often wrecked CPU usage. - Using message passing reduced performance quite a bit. - Batching reads was hard because it was tricky balancing performance for single requests with performance for streams of them.Semaphore? https://demirten-gitbooks-io.translate.goog/linux-sistem-programlama/content/semaphore/operations.html?_x_tr_sl=tr&_x_tr_tl=en&_x_tr_hl=tr&_x_tr_pto=wapp SDB 79
Oct 31
On Thursday, 31 October 2024 at 16:43:09 UTC, Kyle Ingraham wrote:This approach increased requests processed per events emitted/waited from 1:1 to 10:1. This had no impact on event loop overhead however. The entire program still spends ~50% of its runtime in this function: https://github.com/vibe-d/eventcore/blob/0cdddc475965824f32d32c9e4a1dfa58bd616cc9/source/eventcore/drivers/po ix/cfrunloop.d#L38. I'll see if I can get images here of my profiling. I'm sure I'm missing something obvious here.I forgot to add that once you add delays to my demonstrator and a program using vibe.d's web framework the two have similar performance numbers. Adding a 10ms sleep resulted in 600 req/s for my demonstrator and 630 req/s for vibe.d. It's encouraging to see the benefit of vibe.d's concurrency system with delays added. I'd like to be able to use it without drastically affecting throughput for no-delay cases however.
Oct 31
On Thursday, 31 October 2024 at 16:43:09 UTC, Kyle Ingraham wrote:..I'll see if I can get images here of my profiling...Here are images as promised: - A flame graph - https://blog.kyleingraham.com/wp-content/uploads/2024/10/screenshot-2024-10-30-at-11.47.57e280afpm.png - A call tree - https://blog.kyleingraham.com/wp-content/uploads/2024/10/screenshot-2024-10-30-at-11.53.46e280afpm.png In the flame graph there are two threads: Main Thread and thread_entryPoint. NGINX Unit runs in thread_entryPoint. vibe.d and my request handing code run in Main Thread. My request handing code is grouped under fiber_entryPoint within Main Thread. vibe.d's code is grouped under 'start' in Main Thread.
Oct 31
On Monday, 28 October 2024 at 01:06:58 UTC, Kyle Ingraham wrote:I know vibe.d can do better so I'm thinking there's something I'm missing.Sönke Ludwig solved this for me here: https://github.com/vibe-d/vibe.d/issues/2807#issue-2630501194 The solution was to switch to a configuration for eventcore that uses kqueue directly instead of CFRunLoop. Doing that brought performance back to the stratosphere. Solution from the GitHub issue: "You can add an explicit sub configuration to dub.json: ```json "dependencies": { "vibe-d": "~>0.10.1", "eventcore": "~>0.9.34" }, "subConfigurations": { "eventcore": "kqueue" }, ``` Or you could pass --override-config=eventcore/kqueue to the dub invocation to try it out temporarily." I elected to go with the command line flag approach.
Nov 02
On Sunday, 3 November 2024 at 00:42:44 UTC, Kyle Ingraham wrote:"You can add an explicit sub configuration to dub.json: ```json "dependencies": { "vibe-d": "~>0.10.1", "eventcore": "~>0.9.34" }, "subConfigurations": { "eventcore": "kqueue" }, ``` Or you could pass --override-config=eventcore/kqueue to the dub invocation to try it out temporarily."I'm glad to hear that. In the world of software, there is actually no problem that cannot be solved; except for the halting problem :) SDB 79
Nov 04
On Monday, 4 November 2024 at 18:05:25 UTC, Salih Dincer wrote:I'm glad to hear that. In the world of software, there is actually no problem that cannot be solved; except for the halting problem :)? Id argue theres entire families of problems that are unsolvable; the halting problem may just be a root
Nov 04