OCaml 5 performance

I’ve been trying out some tools to investigate performance problems in my OCaml programs and I’ve written up my experiences here in case other people find it useful:

The first post examines a case of slow IO in a concurrent Eio program, and the second looks at poor GC performance in a multicore app.

In particular, it seems that minor GC performance is very sensitive to other work running on the machine, since any domain being late will trigger the others to sleep, e.g.

I’d be interested to know if others can shed more light on this, or have other profiling tools they’ve found useful.

36 Likes

Very impressive work. The second post seems like it should have an associated issue on the ocaml repo, no?

This is very interesting.

For the second post, I’d like to point out the theoretical unevenly-paced-domains-cause-lots-of-premature-collections problem, for which concrete benchmarks would be useful (see KC’s comment here).

2 Likes

I agree this is very interesting! I would be curious to know if the merge of OS-based synchronization (the futex PR) improve things noticeably on your tests.

Can we retrace the thinking that led us to have minor GC require full STW synchronization? Was it really made only for the C interface? Is it possible that the benchmarks leading to this decision weren’t thorough enough? Because the impression I get is that we’ve turned even the most embarrassingly parallel tasks into very serial ones at the micro level, and that’s not going to turn out so well for our performance.

4 Likes

Thanks for looking into OCaml 5 performance.

How many cores did you have available when producing the graph in ‘part 2’? It shows a performance degradation starting at about 10 cores, and mpstat output shows about 8 cores (although the post mentions 160 cores).

If domains > cores then this might be multicore: massive slowdown on spectralnorm when domains > cores (slower than single domain) · Issue #11818 · ocaml/ocaml · GitHub, although 80 domains with 160 cores shouldn’t hit that problem, unless you run multiple solver processes too, and you don’t have 80 idle cores for each one use?

1 Like

Some of the tests ran on my machine (8 cores, x86) and some on the server (160 cores ARM). If a graph goes above 8 cores, it’s from the server.

We have been saying for decades now that multiprocess-based parallelism scales well and is an excellent choice for OCaml when it fits the workflow. (Before we merged the Multicore runtime, it was the only choice.) This remains true today. If your workload is embarrassingly parallel, with little shared data and no communication between sub-computations, multiprocess-based is going to work better than using several Domains.

Domains introduce synchronization costs between sub-computations, with the promise of making shared-memory parallelism and fine-grained synchronization fast. If you want to see what domains are supposedly good at, you should look at other benchmarks. If you don’t use shared-memory communication and/or fine-grained synchronization, you pay all the costs of a multicore runtime without any of the benefits.

The tests in this post are basically worst-case scenarios for domains. It is interesting to look at the results (what I really like is the demo of instrumentation tools; few people know about how to profile this stuff and it’s really useful when experts explain their tricks to do it), but it is expected to see processes beating domains there.

(It also is not so surprising that the current multicore runtime does not scale very well for large number of processes, due to the stop-the-world nature of the GC but also more generally I think most of the profiling and optimization effort went on recovering non-multicore-like performance for sequential programs, which is the entirety of OCaml programs today, so probably there is room for improvement on the Multicore workflow.)

10 Likes

This is plausible but not obvious to newcomers. Not knowing the internals, I had been assuming that multicore would be (apart from its core use case) also a replacement for all the pre-existing multiprocessing based parallelism libraries. For instance, multicore could conceivably save the cost of serialization to aggregate results, and/or the complexity of managing shared mmapped arrays.

Anyway, good to know. And I think this caveat (“Don’t use multicore for performant embarassing parallelization”) should be communicated in some easily found location.

4 Likes

Based on @edwin’s question’s I think I don’t want to worry about the number of CPUs available when designing software and delegate the scheduling to the runtime system and OS. So I hope that we have tasks or some other abstraction than domains that we create and it should not matter if there are 10, 100, or 1000 of them. I hope we can talk less about domains in the future.

1 Like

Note that in the real system there is quite a bit of synchronisation. For example, all the workers read package information from opam-repository using ocaml-git and keep it in an in-memory cache. Using domains lets that cache be shared. When a request comes in using a different commit, we might have to do a git pull and reset ocaml-git’s state, which requires stopping all the workers, etc. The code got a lot simpler and more reliable by switching to domains.

However, I removed that code when making the test-case, which just does a single warm-up run to populate the cache and then spawns workers doing the same solve over and over. That still seemed to show the same poor scaling, but was easier to analyse.

2 Likes

I like your suggestions in the post now that I’ve read them more carefully. It will indeed be interesting to see how 5.3 fares.

Also, in addition to a per-thread runtime flag that opts idle domains out of minor GCs, it would be nice if a domain that only allocates memory that cannot possibly belong to other threads could opt out of minor GC as well. The closures you show in your code seem like good candidates. A flag could be raised once the thread encountered allocations that could have been passed to other threads.

It would be nice to do something about the apparent asymmetry of execution. I imagine that in almost all high core-count CPUs, a couple of cores will be high performance, and the domains will be rotated between them. Maybe an asymmetric division of labor + CPU affinity makes sense here.

1 Like

domainslib provides such an abstraction domainslib 0.5.1 (latest) · OCaml Package

Moonpool (to pick one of the OCaml 5 effect schedulers, but there are others too) also provides a useful abstraction: GitHub - c-cube/moonpool: Commodity thread pools and concurrency primitives for OCaml 5,

This fixed pool of domains is shared between all the pools in moonpool. The rationale is that we should not have more domains than cores, so it’s easier to pre-allocate exactly that many domains, and run more flexible thread pools on top.

Yes, that is one of the problems that XAPI’s set of daemons has too: there are multiple daemons all running inside a VM, that communicate through JSON/XML serialized messages. Moving each process to be its own domain may avoid some of that overhead, but introduce other problems, I haven’t done the measurements yet.

Although if a design other than STW was chosen for the GC then I think we may not have been able to move to OCaml 5 at all with XAPI; I think the current design choice in OCaml 5 is good for backward compat/gradual upgrade scenarios: you only pay the cost for possible data races or these performance issues if you start using Domains, if not then everything is as before with Threads.

IIUC currently OCaml 5 multicore performance is inbetween OCaml 4 threads and OCaml 4 separate processes.
It is much faster than running workloads in separate threads in OCaml 4, because now at least some threads can truly run in parallel; even if they have to occasionally synchronise with each-other. That is still a lot better than what we had before, where you could only run one thread and every other OCaml thread was blocked.
It is also slower than OCaml 4 with separate processes when those processes had an efficient way to share work (i.e. with no or little synchronization and with no or little data sharing).
If your OCaml 4 processes didn’t have an efficient way of sharing data then it is a little less clear whether using OCaml 5 domains would be beneficial, and it depends a lot on the workload. My hope is that it might be beneficial for the kind of workload we have in XAPI, where delays, protocol and serialization overhead between oxenstored, xenopsd and XAPI can cause minute long delays in the current threaded model, all of which can be avoided because we can give direct readonly access to immutable data structures if they are part of the same process without any copying.

And it is only a starting point, I hope that the performance of OCaml 5 can be improved in the future!

Maybe we’ll need a collection of real workloads that are currently affected by OCaml 5 STW delays (microbenchmarks are great of course, but they may not capture all aspects of real applications), and @talex5 's is certainly one of those real workloads.

2 Likes

This sounds a bit like a strawman; I think people are discussing the synchronisation costs for parallel tasks whether embarrassing or not, and specifically the impact of the STW design. And even for embarassingly-parallel tasks, I don’t see people complaining about a reasonable synchronisation tax given the other advantages provided by multicore.

I want to add that the backwards-compatibility argument for the C FFI is a sound one (avoiding hitting the wall of reality). Though it’s unclear how much of the backwards-compatibility claim for the STW design will survive the C memory model issues.

Update: The multicore solver is finally faster than the old processes based one!

13 Likes

Fantastic. I’m curious to know whether some of the performance profiling tools helped you discover the issue or was it an intuition that ended up being correct. I ask because I’d like to know whether and how the existing tools can be used to debug similar problems by non experts.

The issue with spawning git slowing down all domains? Yes, that showed up on the eio-trace output quite clearly (from Use OCaml code to find the oldest commit by talex5 · Pull Request #79 · ocurrent/solver-service · GitHub):

The top part shows one of the main domain’s fibers spawning a git subprocess (“with_pipe”). The worker domains below had been performing regular minor GCs, but you can see there is no GC activity during that period, indicating that they’d stopped making progress.

The other two gaps in the GCs are also caused by other fibers spawning subprocesses (not shown here to save space).

The tracing does need improving a bit though. I’ll get Eio to show the spawn call explicitly (so you don’t have to guess what the with_pipe code is doing): Record trace event when spawning processes by talex5 · Pull Request #749 · ocaml-multicore/eio · GitHub

We should also get OCaml to report a trace span when a domain is trying to become STW leader, as that doesn’t show up at the moment.

2 Likes

Thanks for the clarification.

I wondered why the stw pause due to subprocess spawning was not identified as the issue in your original analysis. Was it because it was mistaken for something else (because the trace didn’t have the necessary detail)?

(To avoid derailing the thread) Can you make an issue for this at ocaml/ocaml please? I’ll be happy to implement the necessary span.

When I looked at the problem originally (last year), eio-trace didn’t exist yet and instead I made the simplified version of the solver. That showed that there was still a major scaling problem even without spawning processes. The blog post used that simplified version as the application being examined.

The conclusions are:

  • STW means that one straggler domain slows everything down.
  • With lots of domains, this becomes a major problem. On the ARM server, it can be mostly fixed by using “real-time” scheduling instead of sched-other.
  • However, scaling is also limited by spawning subprocesses pausing all domains. That was fixed by doing the work in OCaml instead.

(and I’ll make a PR on ocaml/ocaml with the extra tracing stuff)

4 Likes

If fork() causes noticable delays like these it might be useful to add a runtime event when one is detected (well fork in a multithreaded program is usually a bad idea anyway due to async signal safety).
I think this can only be done from the C side by registering a pthread_at_fork handler, and you get both prepare and parent handlers, and if the manpage is to be believed then these occur in the parent before fork() starts and when fork completes, so capturing these as runtime events would tell you the exact duration (I think fork’s duration should be proportional to the program’s size in memory, although pages don’t need to be copied, pagetables do need to be copied).
Perhaps this could be added as some optional tracing that could be enabled via some OCAMLRUNPARAM flags or so?