Successful experiences using Multicore runtime?

Dear Discuss,

At LexiFi we are evaluating whether to put work towards making our codebase usable with multiple threads (ie protecting/reducing global state with locks, making it thread- and domain- local, etc). If undertaken, this will require considerable engineering investment on our part and so we are interested in experience reports from others who have undertaken similar efforts, particularly industrial/large codebases. We are particularly interested in design choices (eg precise combination of domains/threads/effects, etc), quantitative before/after benchmarks, etc.

For information, we currently have a mature multi-process architecture that uses message-passing for communication. While this works well, each process maintains its own separate heap, and the ensuing RAM usage is limiting the number of processes that we can run simultaneously on commodity hardware. For us, a main interest of multi-threading lies on the ability to share heap resources among the different threads, thus reducing RAM usage and allowing a greater number of workers to execute simultaneously in a single machine.

A somewhat similar architecture was recently reported about in Prohibitive amounts of runtime lock waits in multicore analysis with Infer · Issue #14047 · ocaml/ocaml · GitHub, which mentions a sizeable performance penalty when using multiple threads.

A specific point that we wonder about is the performance cliff-edge when the number of domains > N: the official documentation strongly recommends not starting more domains than “the number of cores”. Supposedly this is because minor collections require that all domains enter into a barrier, and if any domain threads are sleeping, this synchronization can be expensive. But what about other processes? Even if there are < N domains, couldn’t one hit the same performance cliff edge depending on how the OS schedules those domain threads? How to decide how many domains to start? This seems like a pretty basic question but I have not been able to find a satisfactory answer.

Alternatively, if using threads for concurrency (not parallelism) within a single domain (as it was possible to do already in OCaml 4), I guess one is not subject to the above performance cliff-edge. This would seem a natural approach if all one wants is to share heap resources between different threads, but of course we would not be able to take advantage of the performance benefits of using domains.

I guess the general question is how should one decide on the exact combination of domains and threads? The fact that there is no TLS data structure (there is a PR upstream Thread-local storage, take 3 by gasche · Pull Request #13355 · ocaml/ocaml · GitHub, but it is somewhat in stasis) which is a basic necessity if using threads in non-trivial ways adds to the confusion, since it implicitly biases the situation towards using domains.

My feeling is that I have heard mostly negative reports about porting efforts to Multicore so far, but maybe that is because positive ones are not reported as much. So if you have had a positive experience around the Multicore runtime, please share it. Thanks!

Cheers,
Nicolas

11 Likes

My two cents: better stick w/ your multiprocess implementation; especially if you already have good parallel performance.
Currently, you can reason about your code and when there is some communication happening between processes, it is explicit and should be traceable/debuggable.

2 Likes

See Tarides | We're Moving Ocsigen from Lwt to Eio!

In Owi, we turned a single core application into a multicore one. This is not exactly what you are looking for I think since we can’t compare with a multi-process version as we never had one. But I can’t remember any serious trouble with OCaml 5 runtime for Owi. Notice also that this is a project that spends a lot of time in C libraries (an SMT solver) so there is not that much pressure on the runtime performance. Anyway we consider that it was quite successful

6 Likes

Several thoughts:

(1) Other language runtimes have gone down this road before OCaml, and it might be worth looking into their experiences.

(2) I know that in the case of Java, there was a lot more than just getting the first round of SMP-friendliness in: there were various “hot locks” strewn thru the runtime and libraries that needed to be fixed. You should expect that here also.

(3) It was a real problem in Java, when running Java inside of guest VMs. People would “overcommit” CPU, running a collection of guest VMs that collectively needed more cores than the host provided. This resulted in guest VM threads getting starved, and I remember the hilarity as people were running distributed systems protocols that needed a certain amount of liveness (as in: “responses will arrive within N seconds”) and weren’t getting it. Lots of brokenness.

(4) that said, for each workload there is some -multiple- of the # of cores, that you can use as the rough # of active kernel threads you should shoot for. It’s application-specific, but you should be able to figure it out, I think.

(5) Going back again to the origin story of J2EE, the reason that J2EE ended up with many threads in a single JVM, is for the same reason you cite, only much worse: there was just too much live heap per application, and you couldn’t have N copies of that heap: it would have consumed far, far too much memory.

But eventually the galloping increase in memory made this less of a problem, and people got used to running a small number of identical JVMs for any particular application, instead of just ONE. At that point, you can somewhat use standard “at the first hint of trouble, shoot the process” fault tolerance, and things get much better, fault-tolerance-wise. And this was critically enabled by the arrival of much, much bigger memories.

Let me say that again with less story-telling: there’s a joke people make, that instead of spending six months and a ton of programmer time on performance-optimizing your application, if you just wait 12-18 months, the next generation of machines will give you the speedup you need, and you can spend that programmer time on new features, or bugfixes.

There’s a similar story about memory: just buy bigger memories, it might be cheaper than spending the programmer time.

3 Likes

Thanks for the replies!

If I understand correctly, if we already keep all CPUs busy with N processes (under our direct control, or otherwise running on the machine), it would really be a bad idea to use multiple domains, even just 2 for instance, within our processes. Is that right? So, the direction might rather be to stick to one domain per process, possibly using multiple threads (like we coud have done in OCaml 4).

The rule of thumb is that if your application is embarrassingly partitionable (minimal read/write data-sharing, and well-controlled, e.g. at well-defined points during execution) then processes can be much better than threading. This assumes that memory isn’t a price issue.

Yes, you’re absolutely right—the context here is handling parallel requests, which we initially addressed by using multiple processes. This approach made partitioning relatively straightforward. However, we’ve found that each process tends to consume a significant amount of RAM, largely due to duplicated caches that could otherwise be shared. That’s what has renewed our interest in consolidating multiple “workers” into a single process.

One of the challenges we face is that we can’t reliably know how many CPU cores will be available to our application. If I understand correctly, that makes the multicore approach risky in terms of consistent performance.

We do already use explicit shared memory for some caches (via memory-mapped files), but that forces us to either work directly with binary data or incur overhead deserializing it into OCaml values. Unfortunately, that doesn’t scale well to all use cases.

As for the trade-off between engineering a better solution versus just buying more memory: even at our modest scale, memory is not cheap. On AWS, for example, the ratio of RAM to vCPUs is fixed within an instance family—so increasing RAM often means paying for additional vCPUs we don’t actually need. When you’re running hundreds of VMs, reducing per-process memory usage can quickly translate into substantial annual cost savings, more than justifying the engineering effort.

And for our on-prem customers, it’s not really an option to just ask them to upgrade their hardware to accommodate inefficiencies.

Anyway, thanks again to everyone for the insightful discussion!

Regarding using shared memory, I think it may be possible to avoid serialization issues in some cases by placing out-of-heap blocks into shared memory (using Caml_out_of_heap_header).

Since you already are using shared memory, you probably have some way of dealing with concurrency issues that arise, which I think is the major downside to doing sharing in this manual way.