Lwt vs System threads

BikalGurung · January 13, 2020, 11:04am

Has anyone done any performance benchmark of writing servers in lwt vs ocaml system native threads? So far lwt has been my default IO/threads lib as I assumed it to be more performant and thus lightweight than ocaml threads. However, reading the documentation of ocaml threads module, it claims to be lightweight as well.

From what I understand, ocaml threads is an abstraction/binding over posix threads in unix systems and win32 sys threads in windows. For this discussion I am mostly interested in linux/unix os so I want to limit the discussion mainly to linux/unix posix threads vs lwt.

One aspect of ocaml threads that I like is that if I want to expose server/io based API to consumers of my API, I don’t have to expose Lwt monad and thus the users of my API/library doesn’t have to learn lwt. However, this is a secondary concern. The primary concern is can a server/io program based on threads hold a candle to a lwt based solution?

Note, the program is mostly IO bound so mostly just io concurrency performance is under consideration. However, ease of multi core programming is a bonus if possible. So I am basically looking to read/hear of other ocamlers experience with using these two libraries.

dinosaure · January 13, 2020, 11:23am

carton (experimental and it needs some pins) proposes an abstraction over the scheduler and binaries uses the Thread module. This software wants to be the next underlying engine to handle PACK file for ocaml-git.

From what I now and with a deep introspection on the current status of ocaml-git and what carton does, we are 3 times faster than before but I did not do a real benchmark and see where is objectively the main difference (I suspect that is mostly due to the last version of decompress even if I took the advantage to make my own thread-pool).

However, with this kind of abstraction, it would be easy to see the difference with the extraction of a large PACK file and provides a binary with Lwt and an other with Thread.

yawaramin · January 13, 2020, 2:54pm

That needs to be updated, see this issue Update Thread module doc to clarify that it's used for system threads · Issue #9206 · ocaml/ocaml · GitHub

BikalGurung · January 13, 2020, 3:07pm

Ah, thanks. If I understand correctly, posix thread is a “lightweight” thread library right? And since ocaml ‘Thread’ module is based on posix thread[1], it is still considered “lightweight” since posix threads are considered lightweigth, yes?

[1] https://github.com/ocaml/ocaml/tree/trunk/otherlibs/systhreads

BikalGurung · January 13, 2020, 3:16pm

So if I understand you correctly, ocaml-git with the Threads lib is more performant than the lwt version. This is instructive, thanks.

dinosaure · January 13, 2020, 3:19pm

From my experimentation, yes, but again it’s an intuition. I will try this week end to provide an Lwt back-end for carton and see the difference. Again, the intuition is possibly not true when I did some others works and optimize underlying computation like decompress.

BikalGurung · January 13, 2020, 3:21pm

Cool. I look forward to your experimentation results.

bluddy · January 13, 2020, 4:52pm

@BikalGurung POSIX threads are not lightweight. They’re just an interface into system threads. POSIX just means it’s the unix standard, and it’s supported (or mostly supported) by windows as well.

What has to be weighed is the cost of maintaining and updating the data structures for lwt vs the cost of system context switching. AFAIK, lwt wins, but for IO-heavy loads, you might not see a big difference.

BikalGurung · January 13, 2020, 7:01pm

Here is an interesting read. Perhaps this explains why your thread version is more performant than the lwt/epoll based version.

BikalGurung · January 13, 2020, 7:16pm

It seems context switching is quite performant than epoll event dispatching techniques these days. At least in linux it seems.

[1] TheServerSide | Your Java Community discussing server side development
[2] nptl(7) - Linux manual page

dbuenzli · January 13, 2020, 7:27pm

During some interval in time the c10k page was a good starting point to get lost in references. Sadly it no longer seems to be updated. I’d be curious how much things have changed nowadays.

In any case over the past years more than one person mentioned to me “why don’t you simply create a thread per connection/request/whatever, linux is able to cope with thousands of threads”. My cargo cult says it will be to heavy but then I never measured nor tried.

The point is that you can always try to make educated guesses about performance but in the end never believe, measure, you always end up being suprised.

Fix yourself a maximal request rate you want to support, a (not too) dummy per request workload and measure to see if you can meet your goal (and report your result here ;–))

bluddy · January 13, 2020, 7:38pm

Also @BikalGurung, remember that the main advantage of multithreading with blocking vs single-thread with non-blocking as discussed in the resources you linked to comes from the fact that modern architectures are incredibly concurrent and most likely, each thread can make good progress by itself. This isn’t the case in OCaml due to the global runtime lock – you’re always still serializing everything back on one thread. This nullifies almost all the advantages of multithreading, leaving you only with the cost of context-switches vs epoll +management code in OCaml. That’s why I don’t think you’ll benefit much either way: the OS scheduler isn’t really doing what it’s built for – it’s just scheduling one thread to deal with all the IO.

However, if it really is true that the event model is or will become outdated (and I’m not sure what the case is), that would have implications for the design of the multicore project. I would also agree with you @BikalGurung that if the difference is negligible, avoiding the complexity of the lwt monad is probably preferable.

c-cube · January 13, 2020, 8:58pm

Multithreading still has one significant benefit: you don’t have to wrap your whole code in a monad. Depending on the application that can be quite useful.

ivg · January 13, 2020, 9:04pm

I did it here implementing the Chinese whispers benchmark using OCaml threads (with mailboxes) and Lwt. Here are the results, for 100k threads:

OCaml threads    : 1.581s
OCaml Lwt (fast) : 0.007s
OCaml Lwt (slow) : 0.111s
Go go-routines   : 0.952s

Since Lwt doesn’t suspend a ready computation I had to implement a slow version of Chinese whispers, so that we can indeed measure the performance of the Lwt scheduler. In any case, even a slow version is 10 times faster than Go, and goroutines are only 1.5 times faster than POSIX threads in OCaml.

As it is said in the SO post, the implementation is following the style of the benchmark, so real Lwt code would be even faster, as usually, you don’t need to use these mailboxes or artificially prune Lwt optimizations.

bluddy · January 13, 2020, 10:14pm

As nice as this is to see, the discussion above specifically compared epoll to context switching. Since this code doesn’t do any IO (AFAICT), I think we’re still missing that dimension.

BikalGurung · January 13, 2020, 10:50pm

Indeed. That’s what I am realizing too. The lwt monad seems to leak all the way into the client code using your library, i.e. if you lib uses lwt, the consumers of your lib/api are also forced to learn and use lwt monad programming. Which is why I wanted to investigate if just using ocaml threads is good enough and hence this discussion thread. Thanks for confirming the advantage.

Chet_Murthy · January 13, 2020, 11:43pm

It appears that your Ocaml implementation of this benchmark doesn’t pool threads? Typically native-threading language runtimes either provide that capability (e.g. Java 1.1) or any nontrivial use provides it (e.g. the many, many thread-pools in web-app servers). For instance, in Java it is straightforward to make a thread-pool that is nearly-impossible for application code to “get past to the underlying threads”, and regardless, with a few simple rules it is easy for applications to code against a thread-pool that has some set of APIs you have to use.
I don’t have time right now to figure out your implementation and modify it to use a thread-pool, but I’ll push that onto my work-queue.
Do you know how Golang’s goroutines are implemented? Are they implemented as coroutines? If so, how do they interact with I/O? That is, how do they ensure that other coroutines continue when one of them is blocked on I/O? Also, how do they interact with FFI? (same question, really, but for built-in IO primitives, you might imagine that Golang did something special – which they cannot do for FFI).
Without knowing how Golang’s goroutines are implemented, it’s not really possible to judge them and compare them to Ocaml’s various implementations.

ETA: Oh, hm, no, it doesn’t appear that your benchmark needs thread-pooling, so my first point is incorrect. And as I see, you’re using 10k threads. At that point, for sure ~~you need to not be using~~ native threads are just wrong.

ivg · January 14, 2020, 12:42pm

I fail to see the reason why would I need to know this. Metaphorically speaking, if I would like to compare different means of transportation, I don’t need to learn how glucose is transformed into adenosine triphosphate in horse cells, neither do I need to know anything about the cycles of the internal combustion engines. What I need, is to order them to move from point A to point B and a good clock to measure the taken time.

The same is applicable to the Chinese whispers benchmark (which I didn’t design, by the way). The benchmark just creates N concurrent tasks, each waiting for a number to increment, connects them with pipes, and then pushes 1 into the first pipe. The idea is to compare different means of concurrency. No matter how fair they are. We can even create N system processes and connect them with pipe(8), or sockets. And again, this is a microbenchmark that doesn’t involve any IO. We can implement it using OCaml continuations, or multicore delimited continuations, or a state machine, or whatever. We are just measuring how fast we can create N threads, suspend them, and then wake up.

dinosaure · January 14, 2020, 7:13pm

Even if MirageOS uses Lwt, most of libraries want to be abstracted over it in some way. ocaml-tls or decompress (or httpaf) are some good examples about that.

bluddy · January 14, 2020, 7:37pm

Doesn’t matter. You have to program with a monadic paradigm or you’d starve pending IO.

Topic		Replies	Views
threading/IO monad vs threads: the case of web-app servers Ecosystem lwt , web , monads , threading	37	3578	December 20, 2020
Thread library is deprecated Learning	9	2119	January 7, 2020
Eio library vs threads library for concurrent programming Learning	8	1070	October 5, 2023
TechEmpower benchmark: httpaf + lwt + unix on par with Haskell's warp Community server , http , benchmark	0	1342	November 17, 2020
Multicore Ocaml vs Thread Learning	12	3814	June 2, 2020

Lwt vs System threads

Related topics