Lwt vs System threads

That needs to be updated, see this issue https://github.com/ocaml/ocaml/issues/9206

Ah, thanks. If I understand correctly, posix thread is a “lightweight” thread library right? And since ocaml ‘Thread’ module is based on posix thread[1], it is still considered “lightweight” since posix threads are considered lightweigth, yes?

[1] https://github.com/ocaml/ocaml/tree/trunk/otherlibs/systhreads

So if I understand you correctly, ocaml-git with the Threads lib is more performant than the lwt version. This is instructive, thanks.

From my experimentation, yes, but again it’s an intuition. I will try this week end to provide an Lwt back-end for carton and see the difference. Again, the intuition is possibly not true when I did some others works and optimize underlying computation like decompress.

Cool. :slight_smile: I look forward to your experimentation results.

@BikalGurung POSIX threads are not lightweight. They’re just an interface into system threads. POSIX just means it’s the unix standard, and it’s supported (or mostly supported) by windows as well.

What has to be weighed is the cost of maintaining and updating the data structures for lwt vs the cost of system context switching. AFAIK, lwt wins, but for IO-heavy loads, you might not see a big difference.

Here is an interesting read. Perhaps this explains why your thread version is more performant than the lwt/epoll based version.

1 Like

It seems context switching is quite performant than epoll event dispatching techniques these days. At least in linux it seems.

[1] https://www.theserverside.com/discussions/thread/26700.html
[2] http://man7.org/linux/man-pages/man7/nptl.7.html

During some interval in time the c10k page was a good starting point to get lost in references. Sadly it no longer seems to be updated. I’d be curious how much things have changed nowadays.

In any case over the past years more than one person mentioned to me “why don’t you simply create a thread per connection/request/whatever, linux is able to cope with thousands of threads”. My cargo cult says it will be to heavy but then I never measured nor tried.

The point is that you can always try to make educated guesses about performance but in the end never believe, measure, you always end up being suprised.

Fix yourself a maximal request rate you want to support, a (not too) dummy per request workload and measure to see if you can meet your goal (and report your result here ;–))

1 Like

Also @BikalGurung, remember that the main advantage of multithreading with blocking vs single-thread with non-blocking as discussed in the resources you linked to comes from the fact that modern architectures are incredibly concurrent and most likely, each thread can make good progress by itself. This isn’t the case in OCaml due to the global runtime lock – you’re always still serializing everything back on one thread. This nullifies almost all the advantages of multithreading, leaving you only with the cost of context-switches vs epoll +management code in OCaml. That’s why I don’t think you’ll benefit much either way: the OS scheduler isn’t really doing what it’s built for – it’s just scheduling one thread to deal with all the IO.

However, if it really is true that the event model is or will become outdated (and I’m not sure what the case is), that would have implications for the design of the multicore project. I would also agree with you @BikalGurung that if the difference is negligible, avoiding the complexity of the lwt monad is probably preferable.

Multithreading still has one significant benefit: you don’t have to wrap your whole code in a monad. Depending on the application that can be quite useful.

I did it here implementing the Chinese whispers benchmark using OCaml threads (with mailboxes) and Lwt. Here are the results, for 100k threads:

OCaml threads    : 1.581s
OCaml Lwt (fast) : 0.007s
OCaml Lwt (slow) : 0.111s
Go go-routines   : 0.952s

Since Lwt doesn’t suspend a ready computation I had to implement a slow version of Chinese whispers, so that we can indeed measure the performance of the Lwt scheduler. In any case, even a slow version is 10 times faster than Go, and goroutines are only 1.5 times faster than POSIX threads in OCaml.

As it is said in the SO post, the implementation is following the style of the benchmark, so real Lwt code would be even faster, as usually, you don’t need to use these mailboxes or artificially prune Lwt optimizations.

4 Likes

As nice as this is to see, the discussion above specifically compared epoll to context switching. Since this code doesn’t do any IO (AFAICT), I think we’re still missing that dimension.

Indeed. That’s what I am realizing too. The lwt monad seems to leak all the way into the client code using your library, i.e. if you lib uses lwt, the consumers of your lib/api are also forced to learn and use lwt monad programming. Which is why I wanted to investigate if just using ocaml threads is good enough and hence this discussion thread. Thanks for confirming the advantage.

  • It appears that your Ocaml implementation of this benchmark doesn’t pool threads? Typically native-threading language runtimes either provide that capability (e.g. Java 1.1) or any nontrivial use provides it (e.g. the many, many thread-pools in web-app servers). For instance, in Java it is straightforward to make a thread-pool that is nearly-impossible for application code to “get past to the underlying threads”, and regardless, with a few simple rules it is easy for applications to code against a thread-pool that has some set of APIs you have to use.
    I don’t have time right now to figure out your implementation and modify it to use a thread-pool, but I’ll push that onto my work-queue.

  • Do you know how Golang’s goroutines are implemented? Are they implemented as coroutines? If so, how do they interact with I/O? That is, how do they ensure that other coroutines continue when one of them is blocked on I/O? Also, how do they interact with FFI? (same question, really, but for built-in IO primitives, you might imagine that Golang did something special – which they cannot do for FFI).
    Without knowing how Golang’s goroutines are implemented, it’s not really possible to judge them and compare them to Ocaml’s various implementations.

ETA: Oh, hm, no, it doesn’t appear that your benchmark needs thread-pooling, so my first point is incorrect. And as I see, you’re using 10k threads. At that point, for sure you need to not be using native threads are just wrong.

I fail to see the reason why would I need to know this. Metaphorically speaking, if I would like to compare different means of transportation, I don’t need to learn how glucose is transformed into adenosine triphosphate in horse cells, neither do I need to know anything about the cycles of the internal combustion engines. What I need, is to order them to move from point A to point B and a good clock to measure the taken time.

The same is applicable to the Chinese whispers benchmark (which I didn’t design, by the way). The benchmark just creates N concurrent tasks, each waiting for a number to increment, connects them with pipes, and then pushes 1 into the first pipe. The idea is to compare different means of concurrency. No matter how fair they are. We can even create N system processes and connect them with pipe(8), or sockets. And again, this is a microbenchmark that doesn’t involve any IO. We can implement it using OCaml continuations, or multicore delimited continuations, or a state machine, or whatever. We are just measuring how fast we can create N threads, suspend them, and then wake up.

Even if MirageOS uses Lwt, most of libraries want to be abstracted over it in some way. ocaml-tls or decompress (or httpaf) are some good examples about that.

Doesn’t matter. You have to program with a monadic paradigm or you’d starve pending IO.

Cool. Thanks. This is not too bad for the threads version. I had assumed that threads at 10K would be significantly worse but contrary to my assumptions it seems the threads based or one thread per connection model is quite competitive.

[Full disclosure: I prefer to use threads/mutex/condvar instead of CSP/CML-like concurrency, mostly because my systems projects often involve a ton of C/C++ code, and it’s just a -necessity-. That said, I’ve built code using monads – specifically for a web-stress-tester that needed to execute “click-trail scripts” of some complexity.]

If you’re -actually- wanting to build systems that will service 10k connections, then I think you will have to go with that I/O monad. Sure, you can do it with threads, but you’re going to want epoll() and probably other things. You might want to have more than one native “worker” thread in order to soak up CPU, but you’re not going to want 10k threads. When you use native threads, you give up control over the implementation of “wait for an external event”, and that can be somewhat performance-critical. Also, going with native threads means that you have to worry about all the ways in which threads can hang, GC, etc. For instance, John Reppy went to some length in CML to ensure that “threads that provably could not make progress” would be GCed. As it turns out, at least as of a couple of years ago, Golang’s goroutines had not implemented that sort of functionality.

It’s actually worse than that. If (again) you’re looking to hit 10k connections, I suspect you’ll find that NO off-the-shelf I/O monad framework is enough – you’ll end up having to hack the internals of whatever you choose – because almost nobody has this sort of problem.

Which brings me to my punchline, which is: I think that, it’s not so useful to think about “10k conns” as a way of evaluating concurrency solutions. If you build a system that needs hundreds of threads, you’re probably already succeeding and can afford to revisit its implementation. I’d suggest that it makes much more sense to pick a concurrency framework based on other considerations, like easy access to native libraries, programmer experience with I/O monads (or serious willingness+time to learn), whether there are a lot of libraries that need to be rewritten in monadic style, error-handling, etc.

As I said, I wrote a rather complete I/O monad implementation for this web-stress-tester, and while it was “just fine/fine/fine” for writing code, I never used that framework again – typically I don’t need to support a thousand threads, and at that point, fugeddaboudit, I’m goin’ with thread/mutex/condvar.

3 Likes