Has anyone done any performance benchmark of writing servers in lwt vs ocaml system native threads? So far lwt has been my default IO/threads lib as I assumed it to be more performant and thus lightweight than ocaml threads. However, reading the documentation of ocaml threads module, it claims to be lightweight as well.
From what I understand, ocaml threads is an abstraction/binding over posix threads in unix systems and win32 sys threads in windows. For this discussion I am mostly interested in linux/unix os so I want to limit the discussion mainly to linux/unix posix threads vs lwt.
One aspect of ocaml threads that I like is that if I want to expose server/io based API to consumers of my API, I donât have to expose Lwt monad and thus the users of my API/library doesnât have to learn lwt. However, this is a secondary concern. The primary concern is can a server/io program based on threads hold a candle to a lwt based solution?
Note, the program is mostly IO bound so mostly just io concurrency performance is under consideration. However, ease of multi core programming is a bonus if possible. So I am basically looking to read/hear of other ocamlers experience with using these two libraries.
carton (experimental and it needs some pins) proposes an abstraction over the scheduler and binaries uses the Thread module. This software wants to be the next underlying engine to handle PACK file for ocaml-git.
From what I now and with a deep introspection on the current status of ocaml-git and what carton does, we are 3 times faster than before but I did not do a real benchmark and see where is objectively the main difference (I suspect that is mostly due to the last version of decompress even if I took the advantage to make my own thread-pool).
However, with this kind of abstraction, it would be easy to see the difference with the extraction of a large PACK file and provides a binary with Lwt and an other with Thread.
Ah, thanks. If I understand correctly, posix thread is a âlightweightâ thread library right? And since ocaml âThreadâ module is based on posix thread[1], it is still considered âlightweightâ since posix threads are considered lightweigth, yes?
From my experimentation, yes, but again itâs an intuition. I will try this week end to provide an Lwt back-end for carton and see the difference. Again, the intuition is possibly not true when I did some others works and optimize underlying computation like decompress.
@BikalGurung POSIX threads are not lightweight. Theyâre just an interface into system threads. POSIX just means itâs the unix standard, and itâs supported (or mostly supported) by windows as well.
What has to be weighed is the cost of maintaining and updating the data structures for lwt vs the cost of system context switching. AFAIK, lwt wins, but for IO-heavy loads, you might not see a big difference.
During some interval in time the c10k page was a good starting point to get lost in references. Sadly it no longer seems to be updated. Iâd be curious how much things have changed nowadays.
In any case over the past years more than one person mentioned to me âwhy donât you simply create a thread per connection/request/whatever, linux is able to cope with thousands of threadsâ. My cargo cult says it will be to heavy but then I never measured nor tried.
The point is that you can always try to make educated guesses about performance but in the end never believe, measure, you always end up being suprised.
Fix yourself a maximal request rate you want to support, a (not too) dummy per request workload and measure to see if you can meet your goal (and report your result here ;â))
Also @BikalGurung, remember that the main advantage of multithreading with blocking vs single-thread with non-blocking as discussed in the resources you linked to comes from the fact that modern architectures are incredibly concurrent and most likely, each thread can make good progress by itself. This isnât the case in OCaml due to the global runtime lock â youâre always still serializing everything back on one thread. This nullifies almost all the advantages of multithreading, leaving you only with the cost of context-switches vs epoll +management code in OCaml. Thatâs why I donât think youâll benefit much either way: the OS scheduler isnât really doing what itâs built for â itâs just scheduling one thread to deal with all the IO.
However, if it really is true that the event model is or will become outdated (and Iâm not sure what the case is), that would have implications for the design of the multicore project. I would also agree with you @BikalGurung that if the difference is negligible, avoiding the complexity of the lwt monad is probably preferable.
Multithreading still has one significant benefit: you donât have to wrap your whole code in a monad. Depending on the application that can be quite useful.
Since Lwt doesnât suspend a ready computation I had to implement a slow version of Chinese whispers, so that we can indeed measure the performance of the Lwt scheduler. In any case, even a slow version is 10 times faster than Go, and goroutines are only 1.5 times faster than POSIX threads in OCaml.
As it is said in the SO post, the implementation is following the style of the benchmark, so real Lwt code would be even faster, as usually, you donât need to use these mailboxes or artificially prune Lwt optimizations.
As nice as this is to see, the discussion above specifically compared epoll to context switching. Since this code doesnât do any IO (AFAICT), I think weâre still missing that dimension.
Indeed. Thatâs what I am realizing too. The lwt monad seems to leak all the way into the client code using your library, i.e. if you lib uses lwt, the consumers of your lib/api are also forced to learn and use lwt monad programming. Which is why I wanted to investigate if just using ocaml threads is good enough and hence this discussion thread. Thanks for confirming the advantage.
It appears that your Ocaml implementation of this benchmark doesnât pool threads? Typically native-threading language runtimes either provide that capability (e.g. Java 1.1) or any nontrivial use provides it (e.g. the many, many thread-pools in web-app servers). For instance, in Java it is straightforward to make a thread-pool that is nearly-impossible for application code to âget past to the underlying threadsâ, and regardless, with a few simple rules it is easy for applications to code against a thread-pool that has some set of APIs you have to use.
I donât have time right now to figure out your implementation and modify it to use a thread-pool, but Iâll push that onto my work-queue.
Do you know how Golangâs goroutines are implemented? Are they implemented as coroutines? If so, how do they interact with I/O? That is, how do they ensure that other coroutines continue when one of them is blocked on I/O? Also, how do they interact with FFI? (same question, really, but for built-in IO primitives, you might imagine that Golang did something special â which they cannot do for FFI).
Without knowing how Golangâs goroutines are implemented, itâs not really possible to judge them and compare them to Ocamlâs various implementations.
ETA: Oh, hm, no, it doesnât appear that your benchmark needs thread-pooling, so my first point is incorrect. And as I see, youâre using 10k threads. At that point, for sure you need to not be using native threads are just wrong.
I fail to see the reason why would I need to know this. Metaphorically speaking, if I would like to compare different means of transportation, I donât need to learn how glucose is transformed into adenosine triphosphate in horse cells, neither do I need to know anything about the cycles of the internal combustion engines. What I need, is to order them to move from point A to point B and a good clock to measure the taken time.
The same is applicable to the Chinese whispers benchmark (which I didnât design, by the way). The benchmark just creates N concurrent tasks, each waiting for a number to increment, connects them with pipes, and then pushes 1 into the first pipe. The idea is to compare different means of concurrency. No matter how fair they are. We can even create N system processes and connect them with pipe(8), or sockets. And again, this is a microbenchmark that doesnât involve any IO. We can implement it using OCaml continuations, or multicore delimited continuations, or a state machine, or whatever. We are just measuring how fast we can create N threads, suspend them, and then wake up.
Even if MirageOS uses Lwt, most of libraries want to be abstracted over it in some way. ocaml-tls or decompress (or httpaf) are some good examples about that.