Successful experiences using Multicore runtime?

Dear Discuss,

At LexiFi we are evaluating whether to put work towards making our codebase usable with multiple threads (ie protecting/reducing global state with locks, making it thread- and domain- local, etc). If undertaken, this will require considerable engineering investment on our part and so we are interested in experience reports from others who have undertaken similar efforts, particularly industrial/large codebases. We are particularly interested in design choices (eg precise combination of domains/threads/effects, etc), quantitative before/after benchmarks, etc.

For information, we currently have a mature multi-process architecture that uses message-passing for communication. While this works well, each process maintains its own separate heap, and the ensuing RAM usage is limiting the number of processes that we can run simultaneously on commodity hardware. For us, a main interest of multi-threading lies on the ability to share heap resources among the different threads, thus reducing RAM usage and allowing a greater number of workers to execute simultaneously in a single machine.

A somewhat similar architecture was recently reported about in Prohibitive amounts of runtime lock waits in multicore analysis with Infer · Issue #14047 · ocaml/ocaml · GitHub, which mentions a sizeable performance penalty when using multiple threads.

A specific point that we wonder about is the performance cliff-edge when the number of domains > N: the official documentation strongly recommends not starting more domains than “the number of cores”. Supposedly this is because minor collections require that all domains enter into a barrier, and if any domain threads are sleeping, this synchronization can be expensive. But what about other processes? Even if there are < N domains, couldn’t one hit the same performance cliff edge depending on how the OS schedules those domain threads? How to decide how many domains to start? This seems like a pretty basic question but I have not been able to find a satisfactory answer.

Alternatively, if using threads for concurrency (not parallelism) within a single domain (as it was possible to do already in OCaml 4), I guess one is not subject to the above performance cliff-edge. This would seem a natural approach if all one wants is to share heap resources between different threads, but of course we would not be able to take advantage of the performance benefits of using domains.

I guess the general question is how should one decide on the exact combination of domains and threads? The fact that there is no TLS data structure (there is a PR upstream Thread-local storage, take 3 by gasche · Pull Request #13355 · ocaml/ocaml · GitHub, but it is somewhat in stasis) which is a basic necessity if using threads in non-trivial ways adds to the confusion, since it implicitly biases the situation towards using domains.

My feeling is that I have heard mostly negative reports about porting efforts to Multicore so far, but maybe that is because positive ones are not reported as much. So if you have had a positive experience around the Multicore runtime, please share it. Thanks!

Cheers,
Nicolas

12 Likes

My two cents: better stick w/ your multiprocess implementation; especially if you already have good parallel performance.
Currently, you can reason about your code and when there is some communication happening between processes, it is explicit and should be traceable/debuggable.

2 Likes

See Tarides | We're Moving Ocsigen from Lwt to Eio!

In Owi, we turned a single core application into a multicore one. This is not exactly what you are looking for I think since we can’t compare with a multi-process version as we never had one. But I can’t remember any serious trouble with OCaml 5 runtime for Owi. Notice also that this is a project that spends a lot of time in C libraries (an SMT solver) so there is not that much pressure on the runtime performance. Anyway we consider that it was quite successful

6 Likes

Several thoughts:

(1) Other language runtimes have gone down this road before OCaml, and it might be worth looking into their experiences.

(2) I know that in the case of Java, there was a lot more than just getting the first round of SMP-friendliness in: there were various “hot locks” strewn thru the runtime and libraries that needed to be fixed. You should expect that here also.

(3) It was a real problem in Java, when running Java inside of guest VMs. People would “overcommit” CPU, running a collection of guest VMs that collectively needed more cores than the host provided. This resulted in guest VM threads getting starved, and I remember the hilarity as people were running distributed systems protocols that needed a certain amount of liveness (as in: “responses will arrive within N seconds”) and weren’t getting it. Lots of brokenness.

(4) that said, for each workload there is some -multiple- of the # of cores, that you can use as the rough # of active kernel threads you should shoot for. It’s application-specific, but you should be able to figure it out, I think.

(5) Going back again to the origin story of J2EE, the reason that J2EE ended up with many threads in a single JVM, is for the same reason you cite, only much worse: there was just too much live heap per application, and you couldn’t have N copies of that heap: it would have consumed far, far too much memory.

But eventually the galloping increase in memory made this less of a problem, and people got used to running a small number of identical JVMs for any particular application, instead of just ONE. At that point, you can somewhat use standard “at the first hint of trouble, shoot the process” fault tolerance, and things get much better, fault-tolerance-wise. And this was critically enabled by the arrival of much, much bigger memories.

Let me say that again with less story-telling: there’s a joke people make, that instead of spending six months and a ton of programmer time on performance-optimizing your application, if you just wait 12-18 months, the next generation of machines will give you the speedup you need, and you can spend that programmer time on new features, or bugfixes.

There’s a similar story about memory: just buy bigger memories, it might be cheaper than spending the programmer time.

3 Likes

Thanks for the replies!

If I understand correctly, if we already keep all CPUs busy with N processes (under our direct control, or otherwise running on the machine), it would really be a bad idea to use multiple domains, even just 2 for instance, within our processes. Is that right? So, the direction might rather be to stick to one domain per process, possibly using multiple threads (like we coud have done in OCaml 4).

The rule of thumb is that if your application is embarrassingly partitionable (minimal read/write data-sharing, and well-controlled, e.g. at well-defined points during execution) then processes can be much better than threading. This assumes that memory isn’t a price issue.

Yes, you’re absolutely right—the context here is handling parallel requests, which we initially addressed by using multiple processes. This approach made partitioning relatively straightforward. However, we’ve found that each process tends to consume a significant amount of RAM, largely due to duplicated caches that could otherwise be shared. That’s what has renewed our interest in consolidating multiple “workers” into a single process.

One of the challenges we face is that we can’t reliably know how many CPU cores will be available to our application. If I understand correctly, that makes the multicore approach risky in terms of consistent performance.

We do already use explicit shared memory for some caches (via memory-mapped files), but that forces us to either work directly with binary data or incur overhead deserializing it into OCaml values. Unfortunately, that doesn’t scale well to all use cases.

As for the trade-off between engineering a better solution versus just buying more memory: even at our modest scale, memory is not cheap. On AWS, for example, the ratio of RAM to vCPUs is fixed within an instance family—so increasing RAM often means paying for additional vCPUs we don’t actually need. When you’re running hundreds of VMs, reducing per-process memory usage can quickly translate into substantial annual cost savings, more than justifying the engineering effort.

And for our on-prem customers, it’s not really an option to just ask them to upgrade their hardware to accommodate inefficiencies.

Anyway, thanks again to everyone for the insightful discussion!

Regarding using shared memory, I think it may be possible to avoid serialization issues in some cases by placing out-of-heap blocks into shared memory (using Caml_out_of_heap_header).

Since you already are using shared memory, you probably have some way of dealing with concurrency issues that arise, which I think is the major downside to doing sharing in this manual way.

I have had a great experience using Eio’s Executor_pool module to parallelize tasks with shared memory. In particular, the shared data is a hashtable whose values are (also shared) Bigarrays. This is true multiprocessing of course, and not using threads. I think that the recommended way to do this is to have one executor pool, with an appropriate number of domains, that is then passed around the program. This way, you don’t use more domains than cores. Admittedly, in my use case so far I get a new Executor_pool when I need it, because multiprocessing for different tasks won’t happen at the same time in the way I’m using it.

There is a Domain.recommended_domain_count value. This seems like it could be useful for you.

I’ve seen significant performance improvements, and no access problems. The only things I have been careful about are updating key value pairs of shared hashtables - what I’m doing is first splitting up the input for the tasks by getting the relevant information from the hashtable, and then dispatching tasks with this information to the Executor_pool. I am (perhaps unnecessarily) worried about what could happen if I am updating key/value pairs of a hashtable in a shared environment. The fact that I’m able to successfully use it without understanding this is a credit to Eio. Of course, I don’t know the intricacies of your use case, but it’s pretty easy to use and so I would recommend trying it out to anyone.

That was really the initial question. What is “recommended domain count”? IIUC, the required property is that all domains should run concurrently, each on their CPU; otherwise, a request for a minor collection in one domain might need to wait for other domains, possibly killing performance. But in a context where we don’t control all processes running on the machine, there is no way to have such guarantee.

You’re looking for gang scheduling. Regrettably, no popular OS implements it, AFAIK.

Should be possible using CPU affinity, no? processor 0.2 (latest) · OCaml Package

There are patches flying around for Linux that provide the userspace calls to implement gang scheduling (based on https://cs.stanford.edu/~jhumphri/documents/ghost.pdf), but I lost track of it after they turned into “sched-ext” eBPF-based ones somewhere along the way (GitHub - sched-ext/scx: sched_ext schedulers and tools). Slowly getting there on Linux only, but no hope of anything portable this century.

Thanks @xavierleroy for the pointer, I learned about the concept.

Could experts in the multicore runtime confirm that without gang scheduling, there is no hope to get reliably good performances with multiple domains on a machine where other processes could run in parallel?

There’s no portable option that I know of, but there are plenty of ways to make this work if you pick an OS and decide on how you want to partition your resources. If you let the OS decide on your scheduling for you, then you will get the lowest common denominator that it can guess in the quanta of a scheduling decision.

For example, on Linux you can use cset to isolate process cores (Ubuntu Manpage: cset - manage cpusets functions in the Linux kernel) to shield them from other processes running there (not as efficient as gang scheduling, but you get the perf you need for your OCaml process). You could also use cgroups (which is what Docker uses) to further partition groups of CPUs into management units for different apps. If you have a lot of memory, then you’ll also want to look at numactl (https://linux.die.net/man/8/numactl) to map memory closest to the core to that process. You can see some of the radical performance differences for inter-core communications in this 2012 paper (Fig 2/3) or this FOSDEM talk I gave on the subject back then. Machines have only gotten more non-uniform since then (with efficiency cores and power throttling).

I’ve got no idea about how modern Windows does this, and macOS doesn’t expose anywhere near enough due to its use of Grand Central Dispatch to queues everywhere managed by the kernel.

(Edit: @sadiq’s also interested in supervising OCaml projects involving sched_ext, see his project list at the end of Last three months in OCaml (July 2025) - Sadiq Jaffer)

2 Likes

If the context is the ECS deployment mentioned above, multicore should allow you to reserve a box with enough RAM for a single process while ensuring you can get close to 100% CPU utilization. So you should have less need to pack multiple processes on a single box.

If we need multiple processes for other reason (e.g. running a sidecar process along the regular program) then you probably need to some manual resource planning and tweak each process accordingly.