Lwt multi-processing much more performant than eio multi-core?

I took a stab at porting the existing httpaf implementation used in the TechEmpowerBenchmarks to use httpun and eio in this PR.

To my great surprise I’m finding that Lwt manages ~2x the throughput at 1/10th of the latency at low concurrency levels. At higher concurrency levels Lwt loses out on latency but still manages to do almost 2x the throughput albeit at a higher system load.

Test results from my machine can be found in this gist. The tests can be run locally by cloning the TechEmpowerBenchmark suite and running: ./tfb --mode benchmark --type json --test httpaf

Posting this here in case someone wants to give my PR a once over to see if I have made any silly mistakes and to see if these results are in line with what people would expect?

2 Likes

There was another attempt at httpaf with Eio here, where it was a lot faster (see PR#24 for a graph). That was a bit hacky and using an early version of Eio, but might give some hints.

You might be able to find out what’s going on with eio-trace. OCaml 5 performance shows how a similar-sounding problem (poor performance of the capnp-rpc Eio port) was tracked down.

1 Like

I tried an experiment, but this time with httpcats and miou.

From what I can see, lwt still exceeds the request rate per second. I think this is mainly due to the fact that, even if Miou offers a domain pool, there are still synchronization mechanisms between the domains that lwt does not implement.

The use of Lwt_unix.fork (rather than Stdlib.Domain.spawn) also avoids synchronization of the OCaml major heap between domains. Furthermore, the case of httpaf+lwt is more like 32 executables (for 32 cores) acting as a web servers rather than a single executable executing OCaml tasks in parallel.

However, I have noted that httpcats is better than what httpun+eio can offer. Here is a summary table[1]. This is the result of latency (the average) given by wrk/tfb:

httpaf+lwt httpun+eio httpcats
8 clients, 8 threads 19.05us 327.14us 32.54us
512 clients, 32 threads 8.5ms 1.88ms 1.07ms
16 clients, 16 threads 29.75us 808.30us 39.22us
32 clients, 32 threads 39.21us 1.23ms 64.83us
64 clients, 32 threads 425.44us 1.26ms 124.32us
128 clients, 32 threads 250.84us 1.15ms 263.56us
256 clients, 32 threads 2.51ms 1.25ms 471.59us
512 clients, 32 threads (warmed) 10.83ms 1.89ms 0.98ms

Note that httpcats supports client management more than httpaf+lwt and httpun+eio (latency is lowest when we have 512 clients). This may be due to the fact (compared to httpaf+lwt) that Miou asks the system for events (such as the arrival of a new connection) more often than lwt. In fact, lwt tends to execute OCaml tasks further down the line rather than periodically requesting new events (so it will simply prioritize the management of an HTTP request rather than managing the arrival of a new connection).

This is the result of the number of requests per second (the average) given by wrk/tfb:

httpaf+lwt httpun+eio httpcats
8 clients, 8 threads 51.26k req/s 25.37k req/s 33.29k req/s
512 clients, 32 threads 45.65k req/s 14.56k req/s 16.65k req/s
16 clients, 16 threads 35.22k req/s 13.83k req/s 27.49k req/s
32 clients, 32 threads 25.44k req/s 13.37k req/s 16.6k req/s
64 clients, 32 threads 38.12k req/s 12.08k req/s 17.45k req/s
128 clients, 32 threads 41.31k req/s 13.27k req/s 18.1k req/s
256 clients, 32 threads 43.96k req/s 14.03k req/s 17.96k req/s
512 clients, 32 threads (warmed) 44.78k req/s 14.37k req/s 16.82k req/s

As I said, lwt outperforms the others, but you always have to keep in mind that the implementation consists of 32 programs (for 32 cores which don’t share the same GC) that manage all the requests, whereas in the case of httpun+eio or httpcats, it is indeed 32 domains (sharing the same major heap) and in which there are synchronization mechanisms (mutex and condition) in the OCaml runtime and in what eio or miou offer.

Furthermore, making an application where you would like to share a global resource between all the HTTP request handlers you spawned with Lwt_unix.fork made might be more difficult than with httpun+eio or httpcats.

Finally, one last note is that httpcats uses miou.unix which uses Unix.select — it is a fairly legitimate criticism to use something other than the latter as it has quite a few limitations (in particular on the number of file descriptors that can be managed) but it is also something that can easily be improved — at least, the design of Miou[2] tends to be able to inject your own logic of system events such as the Solo5’s one for unikernels.

What I want to mention above all is that it seems to me that lwt uses libev in your example and eio uses io_uring. Despite Miou’s penalty (due to Unix.select), the performances that httpcats offer are still interesting :slight_smile: [3].

Finally, I would also like to mention that if you would like to go further with HTTP, we are currently developing vif: a small web framework based on httpcats. EDIT: vif is very experimental, even if we continue to develop it, don’t expect everything to work without a hitch!


  1. My CPU is an AMD Ryzen 9 7950X. ↩︎

  2. In particular, you might like to take the time to read this short tutorial explaining how to inject your own system to manage system events, and we could very easily imagine miou+io_uring. ↩︎

  3. Comparisons between schedulers can always be difficult. As mentioned in the README.md of httpcats, having a well-defined and reproducible protocol in order to offer reliable metrics is already a job in itself that always goes much further than launching a simple program like wrk. ↩︎

8 Likes

Thank you both for your in-depth answers (including @talex5 blog post here)!

I read the first of your two blog posts and I think the problem might be the same but I’m not sure because it’s my first time reading eio traces. If I understood the blog post correctly the fact that there is a suspend in-between what I interpret as the beginning and the end of the write was how you identified the lack of buffered writes as the problem?

To me this looks to be the same thing then:

Would you agree?

If I understood the blog post correctly the fact that there is a suspend in-between what I interpret as the beginning and the end of the write was how you identified the lack of buffered writes as the problem?

I think you mainly pointed out the problem that httpun+eio (but also httpcats) use domains instead of fork. In other words, when you test httpaf+lwt, you mainly test 2 schedulers: lwt and the system scheduler which has to play between 32 programs (for 32 cores) (therefore Linux).

As for httpun+eio and httpcats, you test eio for one or miou for the other as well as the OCaml runtime (and its GC) — these three can, for different reasons, do suspension (to do a Unix.select, promote values in the major heap, re-synchronize with io_uring). These suspensions do not appear with httpaf+lwt because:

  1. the GC is not shared between processes
  2. the list of lwt tasks to be done is not shared between processes
  3. system events are not shared between processes

I don’t know the details of eio, but it is difficult to be fair in comparison with httpaf+lwt where the system can be smarter in terms of scheduling compared to what can happen on the TCP/IP stack. Furthermore, Linux (and its scheduler) is better able to give an executable the opportunity to execute itself in relation to events relating to the TCP/IP stack to which it has direct access, whereas eio and miou must always make a syscall (io_uring or Unix.select) to then know what to do.

Once again, httpaf+lwt outperforms httpun+eio and httpcats, but what about a shared resource (such as a connection to a database) between the different handlers that you launch via Lwt_unix.fork? This type of global resource is quite common for a web server (the secret phrase for cookies, the connection to the database, the seed for the rng, etc.).

Perhaps using Lwt_domain (instead of Lwt_unix.fork) would be more accurate and avoid comparing apples and oranges.

1 Like

I only see one write for each read there. I’d guess that the write call returned EWOULDBLOCK, and so Eio waited until the FD was ready for writing before calling it again. Though I’m not sure why it would need to block, unless the client was reading the output too slowly. The trace looks reasonable to me (except that it’s only using one of its two domains at a time). One minor thing is that it does the read before the write; it’s possible adding a yield to switch the ordering might help, but I doubt it will make much difference.

As @dinosaure said, using multiple domains instead of multiple processes can slow things down due to GC synchronisation, but I don’t see any GC in the part shown in your screenshot, so that wouldn’t appear to be the problem. eio-trace gc-stats will show if it’s spending a lot of time in GC. If GC is a problem, the second blog post looks at that.

I’d start by comparing a single-domain/single-thread Lwt process with a single-domain/single-fiber Eio one. Is Lwt still faster then? Then try increasing concurrency for both, but still with a single domain.

It is worth noting that Lwt_unix.fork probably isn’t safe to use anyway. It calls up Unix.fork, about which the ocaml reference says “[Unix.fork] fails if the OCaml process is multi-core (any domain has been spawned). In addition, if any thread from the Thread module has been spawned, then the child process might be in a corrupted state.” As it happens theUnix.exec* functions are Thread.t thread unsafe also.

The problem is that Lwt automatically starts up new Thread.t threads when encountering blocking calls, and in recent versions of Lwt, Lwt_unix.set_default_async_method is no longer available to change this.

For a similar reason the Lwt_process module is unsafe.

1 Like

Ah, I had created a domain manager with 1 domain but the flag is called additional_domains… Just dropping this altogether got me down to a single domain.

Running wrk with a single thread and a single connection against a single lwt process and a single eio domain and lwt still comes out well ahead. This is after increasing the heap size for eio via the use of OCAMLRUNPARAM=s=8192k so it is not quite an apples to apples comparison as eio theoretically should have an edge in throughput here.

❯ ./run-httpaf-bench.sh
Running 2s test @ http://192.168.1.123:8080/json
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    23.14us    5.52us 458.00us   98.16%
    Req/Sec    41.84k     1.33k   43.28k    76.19%
  Latency Distribution
     50%   22.00us
     75%   24.00us
     90%   25.00us
     99%   30.00us
  87456 requests in 2.10s, 12.59MB read
Requests/sec:  41661.70
Transfer/sec:      6.00MB
❯ ./run-httpun-bench.sh
Running 2s test @ http://192.168.1.123:8080/json
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    34.53us   31.29us   0.93ms   99.31%
    Req/Sec    29.71k     2.11k   30.92k    90.00%
  Latency Distribution
     50%   31.00us
     75%   32.00us
     90%   37.00us
     99%   52.00us
  59145 requests in 2.00s, 8.52MB read
Requests/sec:  29570.81
Transfer/sec:      4.26MB

GC STATS:
./trace.fxt:

Ring   GC/s     App/s    Total/s   %GC
  0    0.011    2.202    2.213     0.50

All    0.011    2.202    2.213     0.50

Which eio backend is in use here? Sometimes uring is blocked by a too small buffer size, so you need to test both the posix and uring backends.

The any thing in the trace looks odd. I think this wrapper can be removed:

Its goal is to abort a read if read_closed is resolved. But that only happens here:

But if that happens, the shutdown should cause the read to end anyway. And it seems it only does that in response to getting EOF from the socket anyway.

That made it a bit faster for me.

Looking at the strace output, this might be more important than I thought. When handling multiple connections, the Lwt version writes to all of them first. By the time it does the next read, the data is ready. But the Eio version does the read first, which always fails and needs to be rescheduled later. I didn’t test changing that though, as I’m not sure where in httpun the ordering is done.

I’ve tried both and the posix backend is a single digit percent slower on my machine.

Unless I’ve misunderstood the ordering happens here:

And switching the loops around didn’t seem to do much. Putting write first might actually have made it slower.