Lwt vs System threads

Cool. Thanks. This is not too bad for the threads version. I had assumed that threads at 10K would be significantly worse but contrary to my assumptions it seems the threads based or one thread per connection model is quite competitive.

[Full disclosure: I prefer to use threads/mutex/condvar instead of CSP/CML-like concurrency, mostly because my systems projects often involve a ton of C/C++ code, and it’s just a -necessity-. That said, I’ve built code using monads – specifically for a web-stress-tester that needed to execute “click-trail scripts” of some complexity.]

If you’re -actually- wanting to build systems that will service 10k connections, then I think you will have to go with that I/O monad. Sure, you can do it with threads, but you’re going to want epoll() and probably other things. You might want to have more than one native “worker” thread in order to soak up CPU, but you’re not going to want 10k threads. When you use native threads, you give up control over the implementation of “wait for an external event”, and that can be somewhat performance-critical. Also, going with native threads means that you have to worry about all the ways in which threads can hang, GC, etc. For instance, John Reppy went to some length in CML to ensure that “threads that provably could not make progress” would be GCed. As it turns out, at least as of a couple of years ago, Golang’s goroutines had not implemented that sort of functionality.

It’s actually worse than that. If (again) you’re looking to hit 10k connections, I suspect you’ll find that NO off-the-shelf I/O monad framework is enough – you’ll end up having to hack the internals of whatever you choose – because almost nobody has this sort of problem.

Which brings me to my punchline, which is: I think that, it’s not so useful to think about “10k conns” as a way of evaluating concurrency solutions. If you build a system that needs hundreds of threads, you’re probably already succeeding and can afford to revisit its implementation. I’d suggest that it makes much more sense to pick a concurrency framework based on other considerations, like easy access to native libraries, programmer experience with I/O monads (or serious willingness+time to learn), whether there are a lot of libraries that need to be rewritten in monadic style, error-handling, etc.

As I said, I wrote a rather complete I/O monad implementation for this web-stress-tester, and while it was “just fine/fine/fine” for writing code, I never used that framework again – typically I don’t need to support a thousand threads, and at that point, fugeddaboudit, I’m goin’ with thread/mutex/condvar.

4 Likes

Thanks @Chet_Murthy for your wonderful answer.

Indeed, I don’t think my current lwt solution have to support 10k concurrent connections. I am writing a lib to support FCGI protocol for my web applications. It is currently using Lwt after having recently learned lwt myself. From my experience learning lwt, I realized that even for an experienced ocaml programmer without any background in monads and such, learning and using lwt monad is a serious investment of time and effort. This made me realize that users of my library would have had to put in perhaps the same amount of learning(for lwt) just to use my library. Perhaps this is not so much of a learning curve, but I couldn’t help thinking if just using plain threads is sufficient performance wise while removing the lwt learning curve for the users of my library.

Is this really a problem in practice? I haven’t used native ocaml threads at all so curious as to what your experience has been with it.

B

1 Like

[OK, old-skool web-TP thoughts …]

TL;DR why not just do FCGI with a single process (serially-reused) process per request, and see how far you get?

Is this for an FCGI back-end? That is, there’ll be a webserver in front, and will be calling to Ocaml code running behind the FCGI protocol? If so, do you really need LWT? Or even threads? Here’s why I’m asking:

  1. typically a webserver will absorb almost all the concurrency that exists coming from the network. It has to buffer requests and responses anyway (in order to do parsing, routing, etc) and that’s on top of socket-buffers. Except for the largest req/resp, that’s typically sufficient. And it’s rare that such large (e.g. media) requests are handled via FCGI.

  2. the value of LWT goes down if there’s no I/O concurrency to be had.

  3. there is value in the process-isolation that comes with one-process-per-request (of course, that process gets reused serially)

  4. If the intent is to use shared variables in the process as a sort of “database” … well, that can work, but historically it’s been found that ti’s better to put such shared mutable data in an external store (if nothing else, a local memcached) – this is an aid to debugging, as well as making for more robust systems.

If we look back at the history of transaction -processing, we can see this pattern repeat itself:

  1. originally CICS was akin to this LWT approach, but IMS/DL1 more like FCGI. And (it turned out) CICS got used for more-lightweight trans, where IMS/DL1 got used for more heavyweight trans
  2. The web started off with CGI (ugh) and FCGI (as well as variants like mod_perl) and moved toward Java[1] with shared processes and threads. This was … problematic for reasons #3/#4 above, and lots of web-app frameworks continue to use “one request at a time per process” models for application code.
  3. the one place where lightweight concurrency has really stuck, is when dealing with reverse-AJAX and other models (like websockets) that use massive concurrency to allow the server to push content to the client. But this is really different from client->server RPC, and it would be (IMO) a mistake to try to fit them into the same codebase and runtime.

[1] the push for “multiple concurrent requests in a single process” in Java was mostly due to the enormous weight of a Java process, both in memory and startup-time.

This is a problem for all applications in complex transaction-processing systems. Unless your application code is vanishingly simple, eventually somebody’s going to write something that causes a hang.

[the rest is written partially from memory, partially from a quick scan of the Apache mod_fcgid documentation; I could be wrong about this …]
Also though, as I think about it, there’s another problem you might want to consider: FastCGi was designed with the idea that behind front-end webserver, is a pool of processes. It was not originally designed with the intention that there be a pool of -threads- in a single process behind. So for instance in Apache mod_fcgid, there are a bunch of different timeouts, and they apply to each process/connection independently. If Apache times out reading a response back from an FCGI connection, it will terminate that connection, but it won’t know to (for instance) terminate all connections to the corresponding process.

What I’m saying is: when/if there are “faults” (errors of various kinds), the FastCGI protocol is designed so that recovery can occur on a per-connection basis. If you route all connections to a single process, you’re pretty much vitiating that recovery logic. And there isn’t any other recovery logic available for the FastCGI protocol.

I might be wrong about this though – your goals might be different, and the FastCGI protocol guarantees might be different today.

1 Like

Goroutines are implemented more or less as coroutines, but the scheduler multiplexes them onto a thread pool. I’m not sure of the exact details, but goroutines are a bit heavier than coroutines in other languages. Last I heard, each goroutine has an allocation cost of 8KB. Because they are multiplexed on OS threads, they don’t have explicit break points the same way asynchronous coroutines normally do. All of Go’s I/O functions and a few others implicitly yield to the scheduler. If one blocks on something else (like the CPU. Go never blocks on I/O), the other threads in the pool will continue to have work scheduled on them. I’m not sure how Go handles it if there is blocking on all threads in the pool.

From my very brief experience with OCaml, if you’re doing any kind of network I/O with a third-party library, you’re either using Async or Lwt (usually the latter), and both are monadic. I don’t have a ton of experience with monads myself, but the monadic paradigm presented by these libraries doesn’t differ substantially from async and await in languages that have them.

I’m not suggesting that there isn’t a learning curve involved, but it’s something most people working with network I/O are going to have to deal with at some point anyway.

5 Likes

Just discovered this gem which discusses the exact same issue the current thread is trying to address.

TLDR - pthreads/systhreads is quite performant to poll/epoll techniques. Additionally, it seems this multicore PR - Reimplementing Systhreads with pthreads (Domain execution contexts) by Engil · Pull Request #381 · ocaml-multicore/ocaml-multicore · GitHub - enables true parallelism in addition to concurrency.

To note as well, the changes includes the ability to run decs / systhreads accross many domains at once.

6 Likes

The article you linked is an interesting read. This talk on youtube also explains the difference in an easily digestible way.

Looking at awesome-ocaml there doesn’t seem to be a web framework that relies on threading. Is there one out there somewhere?

It’s a very far cry from a framework, but my tiny
httpd
relies on threads and works
pretty well for http 1.1.

3 Likes