Lwt vs System threads

From my experimentation, yes, but again it’s an intuition. I will try this week end to provide an Lwt back-end for carton and see the difference. Again, the intuition is possibly not true when I did some others works and optimize underlying computation like decompress.

Cool. :slight_smile: I look forward to your experimentation results.

@BikalGurung POSIX threads are not lightweight. They’re just an interface into system threads. POSIX just means it’s the unix standard, and it’s supported (or mostly supported) by windows as well.

What has to be weighed is the cost of maintaining and updating the data structures for lwt vs the cost of system context switching. AFAIK, lwt wins, but for IO-heavy loads, you might not see a big difference.

Here is an interesting read. Perhaps this explains why your thread version is more performant than the lwt/epoll based version.

1 Like

It seems context switching is quite performant than epoll event dispatching techniques these days. At least in linux it seems.

[1] https://www.theserverside.com/discussions/thread/26700.html
[2] http://man7.org/linux/man-pages/man7/nptl.7.html

During some interval in time the c10k page was a good starting point to get lost in references. Sadly it no longer seems to be updated. I’d be curious how much things have changed nowadays.

In any case over the past years more than one person mentioned to me “why don’t you simply create a thread per connection/request/whatever, linux is able to cope with thousands of threads”. My cargo cult says it will be to heavy but then I never measured nor tried.

The point is that you can always try to make educated guesses about performance but in the end never believe, measure, you always end up being suprised.

Fix yourself a maximal request rate you want to support, a (not too) dummy per request workload and measure to see if you can meet your goal (and report your result here ;–))

1 Like

Also @BikalGurung, remember that the main advantage of multithreading with blocking vs single-thread with non-blocking as discussed in the resources you linked to comes from the fact that modern architectures are incredibly concurrent and most likely, each thread can make good progress by itself. This isn’t the case in OCaml due to the global runtime lock – you’re always still serializing everything back on one thread. This nullifies almost all the advantages of multithreading, leaving you only with the cost of context-switches vs epoll +management code in OCaml. That’s why I don’t think you’ll benefit much either way: the OS scheduler isn’t really doing what it’s built for – it’s just scheduling one thread to deal with all the IO.

However, if it really is true that the event model is or will become outdated (and I’m not sure what the case is), that would have implications for the design of the multicore project. I would also agree with you @BikalGurung that if the difference is negligible, avoiding the complexity of the lwt monad is probably preferable.

Multithreading still has one significant benefit: you don’t have to wrap your whole code in a monad. Depending on the application that can be quite useful.

I did it here implementing the Chinese whispers benchmark using OCaml threads (with mailboxes) and Lwt. Here are the results, for 100k threads:

OCaml threads    : 1.581s
OCaml Lwt (fast) : 0.007s
OCaml Lwt (slow) : 0.111s
Go go-routines   : 0.952s

Since Lwt doesn’t suspend a ready computation I had to implement a slow version of Chinese whispers, so that we can indeed measure the performance of the Lwt scheduler. In any case, even a slow version is 10 times faster than Go, and goroutines are only 1.5 times faster than POSIX threads in OCaml.

As it is said in the SO post, the implementation is following the style of the benchmark, so real Lwt code would be even faster, as usually, you don’t need to use these mailboxes or artificially prune Lwt optimizations.

4 Likes

As nice as this is to see, the discussion above specifically compared epoll to context switching. Since this code doesn’t do any IO (AFAICT), I think we’re still missing that dimension.

Indeed. That’s what I am realizing too. The lwt monad seems to leak all the way into the client code using your library, i.e. if you lib uses lwt, the consumers of your lib/api are also forced to learn and use lwt monad programming. Which is why I wanted to investigate if just using ocaml threads is good enough and hence this discussion thread. Thanks for confirming the advantage.

  • It appears that your Ocaml implementation of this benchmark doesn’t pool threads? Typically native-threading language runtimes either provide that capability (e.g. Java 1.1) or any nontrivial use provides it (e.g. the many, many thread-pools in web-app servers). For instance, in Java it is straightforward to make a thread-pool that is nearly-impossible for application code to “get past to the underlying threads”, and regardless, with a few simple rules it is easy for applications to code against a thread-pool that has some set of APIs you have to use.
    I don’t have time right now to figure out your implementation and modify it to use a thread-pool, but I’ll push that onto my work-queue.

  • Do you know how Golang’s goroutines are implemented? Are they implemented as coroutines? If so, how do they interact with I/O? That is, how do they ensure that other coroutines continue when one of them is blocked on I/O? Also, how do they interact with FFI? (same question, really, but for built-in IO primitives, you might imagine that Golang did something special – which they cannot do for FFI).
    Without knowing how Golang’s goroutines are implemented, it’s not really possible to judge them and compare them to Ocaml’s various implementations.

ETA: Oh, hm, no, it doesn’t appear that your benchmark needs thread-pooling, so my first point is incorrect. And as I see, you’re using 10k threads. At that point, for sure you need to not be using native threads are just wrong.

I fail to see the reason why would I need to know this. Metaphorically speaking, if I would like to compare different means of transportation, I don’t need to learn how glucose is transformed into adenosine triphosphate in horse cells, neither do I need to know anything about the cycles of the internal combustion engines. What I need, is to order them to move from point A to point B and a good clock to measure the taken time.

The same is applicable to the Chinese whispers benchmark (which I didn’t design, by the way). The benchmark just creates N concurrent tasks, each waiting for a number to increment, connects them with pipes, and then pushes 1 into the first pipe. The idea is to compare different means of concurrency. No matter how fair they are. We can even create N system processes and connect them with pipe(8), or sockets. And again, this is a microbenchmark that doesn’t involve any IO. We can implement it using OCaml continuations, or multicore delimited continuations, or a state machine, or whatever. We are just measuring how fast we can create N threads, suspend them, and then wake up.

Even if MirageOS uses Lwt, most of libraries want to be abstracted over it in some way. ocaml-tls or decompress (or httpaf) are some good examples about that.

Doesn’t matter. You have to program with a monadic paradigm or you’d starve pending IO.

Cool. Thanks. This is not too bad for the threads version. I had assumed that threads at 10K would be significantly worse but contrary to my assumptions it seems the threads based or one thread per connection model is quite competitive.

[Full disclosure: I prefer to use threads/mutex/condvar instead of CSP/CML-like concurrency, mostly because my systems projects often involve a ton of C/C++ code, and it’s just a -necessity-. That said, I’ve built code using monads – specifically for a web-stress-tester that needed to execute “click-trail scripts” of some complexity.]

If you’re -actually- wanting to build systems that will service 10k connections, then I think you will have to go with that I/O monad. Sure, you can do it with threads, but you’re going to want epoll() and probably other things. You might want to have more than one native “worker” thread in order to soak up CPU, but you’re not going to want 10k threads. When you use native threads, you give up control over the implementation of “wait for an external event”, and that can be somewhat performance-critical. Also, going with native threads means that you have to worry about all the ways in which threads can hang, GC, etc. For instance, John Reppy went to some length in CML to ensure that “threads that provably could not make progress” would be GCed. As it turns out, at least as of a couple of years ago, Golang’s goroutines had not implemented that sort of functionality.

It’s actually worse than that. If (again) you’re looking to hit 10k connections, I suspect you’ll find that NO off-the-shelf I/O monad framework is enough – you’ll end up having to hack the internals of whatever you choose – because almost nobody has this sort of problem.

Which brings me to my punchline, which is: I think that, it’s not so useful to think about “10k conns” as a way of evaluating concurrency solutions. If you build a system that needs hundreds of threads, you’re probably already succeeding and can afford to revisit its implementation. I’d suggest that it makes much more sense to pick a concurrency framework based on other considerations, like easy access to native libraries, programmer experience with I/O monads (or serious willingness+time to learn), whether there are a lot of libraries that need to be rewritten in monadic style, error-handling, etc.

As I said, I wrote a rather complete I/O monad implementation for this web-stress-tester, and while it was “just fine/fine/fine” for writing code, I never used that framework again – typically I don’t need to support a thousand threads, and at that point, fugeddaboudit, I’m goin’ with thread/mutex/condvar.

3 Likes

Thanks @Chet_Murthy for your wonderful answer.

Indeed, I don’t think my current lwt solution have to support 10k concurrent connections. I am writing a lib to support FCGI protocol for my web applications. It is currently using Lwt after having recently learned lwt myself. From my experience learning lwt, I realized that even for an experienced ocaml programmer without any background in monads and such, learning and using lwt monad is a serious investment of time and effort. This made me realize that users of my library would have had to put in perhaps the same amount of learning(for lwt) just to use my library. Perhaps this is not so much of a learning curve, but I couldn’t help thinking if just using plain threads is sufficient performance wise while removing the lwt learning curve for the users of my library.

Is this really a problem in practice? I haven’t used native ocaml threads at all so curious as to what your experience has been with it.

B

[OK, old-skool web-TP thoughts …]

TL;DR why not just do FCGI with a single process (serially-reused) process per request, and see how far you get?

Is this for an FCGI back-end? That is, there’ll be a webserver in front, and will be calling to Ocaml code running behind the FCGI protocol? If so, do you really need LWT? Or even threads? Here’s why I’m asking:

  1. typically a webserver will absorb almost all the concurrency that exists coming from the network. It has to buffer requests and responses anyway (in order to do parsing, routing, etc) and that’s on top of socket-buffers. Except for the largest req/resp, that’s typically sufficient. And it’s rare that such large (e.g. media) requests are handled via FCGI.

  2. the value of LWT goes down if there’s no I/O concurrency to be had.

  3. there is value in the process-isolation that comes with one-process-per-request (of course, that process gets reused serially)

  4. If the intent is to use shared variables in the process as a sort of “database” … well, that can work, but historically it’s been found that ti’s better to put such shared mutable data in an external store (if nothing else, a local memcached) – this is an aid to debugging, as well as making for more robust systems.

If we look back at the history of transaction -processing, we can see this pattern repeat itself:

  1. originally CICS was akin to this LWT approach, but IMS/DL1 more like FCGI. And (it turned out) CICS got used for more-lightweight trans, where IMS/DL1 got used for more heavyweight trans
  2. The web started off with CGI (ugh) and FCGI (as well as variants like mod_perl) and moved toward Java[1] with shared processes and threads. This was … problematic for reasons #3/#4 above, and lots of web-app frameworks continue to use “one request at a time per process” models for application code.
  3. the one place where lightweight concurrency has really stuck, is when dealing with reverse-AJAX and other models (like websockets) that use massive concurrency to allow the server to push content to the client. But this is really different from client->server RPC, and it would be (IMO) a mistake to try to fit them into the same codebase and runtime.

[1] the push for “multiple concurrent requests in a single process” in Java was mostly due to the enormous weight of a Java process, both in memory and startup-time.

This is a problem for all applications in complex transaction-processing systems. Unless your application code is vanishingly simple, eventually somebody’s going to write something that causes a hang.

[the rest is written partially from memory, partially from a quick scan of the Apache mod_fcgid documentation; I could be wrong about this …]
Also though, as I think about it, there’s another problem you might want to consider: FastCGi was designed with the idea that behind front-end webserver, is a pool of processes. It was not originally designed with the intention that there be a pool of -threads- in a single process behind. So for instance in Apache mod_fcgid, there are a bunch of different timeouts, and they apply to each process/connection independently. If Apache times out reading a response back from an FCGI connection, it will terminate that connection, but it won’t know to (for instance) terminate all connections to the corresponding process.

What I’m saying is: when/if there are “faults” (errors of various kinds), the FastCGI protocol is designed so that recovery can occur on a per-connection basis. If you route all connections to a single process, you’re pretty much vitiating that recovery logic. And there isn’t any other recovery logic available for the FastCGI protocol.

I might be wrong about this though – your goals might be different, and the FastCGI protocol guarantees might be different today.