Parallel performance of an OCaml program: ocaml-5.0.0 Vs 4.13.1

I haven’t gathered meaningful statistics unfortunately. I remember this was somewhere between 4.10/4.10+domains and 4.12 for each, respectively. So, by that time, multicore was still WIP but it was nearing the end of its development. I’m curious if 4.14’s prefetching would improve this workload, or if the released multicore fares better, or if using e.g. domainslib can be more optimal.

The files are self-contained low-dependency and near-identical, and the measurement code is in the Makefile, so hopefully it’s easier to reproduce and improve for anyone interested.

I wanted to functorize but that seems to have made the compiler lay out the code slightly worse and created a bit of a slowdown.

Just one thing to note here. There’s a fair bit of work to do but we can probably alleviate the worst of the issues around having more domains than cores, it’s certainly not all set in stone.

One option is to have some number of GC domains (which might be dynamic from collection to collection) that’s smaller than the actual number of domains, and each GC domain does the work for one or more running domains.

The one bit I don’t think we can change (without breaking the C API) is we need to wait for all domains to hit a safe point and that means if you’ve got vastly more domains than cores (or on a heavily loaded system) you might be waiting a while at every minor collection.

It would be interesting to establish what is the contribution of each in the drop in performance from more domains than cores.

1 Like

Note: It took my beginner self a long time to understand that the “waiting a while” occurring here is not just “you have to wait for each domain to reach a poll point, so the more domains the longer maximal wait”, but rather “the OS may deschedule one of the domains, and everyone will be blocked for two/three orders of magnitude longer”.

3 Likes

In the early days of virtualization in enterprise TP, people would run Java apps that used Paxos and Lamport-Liskov distributed systems (hence required periodic heartbeat messages for liveness) in VMs that overcommitted memory on their host. So with more VM-declared memory, than actual memory on the host machine. Boy howdy, so much fun when the VM’s pages got paged-out by the host, and since the host didn’t know any better, it could and would page out the pages associated with the threads that were responsible for the heartbeat.

Even with multi-minute heartbeat timeouts, many significant users of these systems saw all sorts of craaazy breakage as VMs paged-out, were declared dead, then came back and responded, were declared alive, etc, etc, etc.

1 Like

Was it considered to empty just the minor heap which is full?

Can this behavior be changed via an environment variable?

For a parallel program, blocking all threads is a very bad thing (it’s called a barrier).
I guess, if only a single heap is emptied, this barrier could be shorter, which might be a good thing for parallel performance.

I don’t think people are asking to have much more domains than cores.
There is a recommend value, fine.

I’m asking this question b/c I don’t know the answer, and maybe it’ll enlighten for other readers. When you say a “safe point”, I’m reminded of the “safe points” in other multi-thread-friendly GCed runtimes, like for some JVMs. Where “safe point” means that a thread has parked all its out-of-representation pointers in known in-representation locations, so that a GC can move around the stuff those pointers reference. And this happens pretty frequently – it isn’t something that you have to wait forever for. Is this the same meaning of “safe point” that you’re using in the multicore runtime? Or is something different meant ?

I would make a distinction between “safe point”, where we know that we can run a GC, and “poll point”, where we actually check if we need to run a GC (or handle other asynchronous events: signals, monitoring callbacks, etc.). Poll points occur on each allocation which indeed occur very often, and some extra poll points are added on loop backedges to (try to) guarantee that we never wait forever.

Yes. This was our earlier design which allows minor heap arenas of each domain to be independently collected. This breaks the C API. The details are in the ICFP 2020 paper: Retrofitting Parallelism onto OCaml (ICFP 2020 - Program) - ICFP 2020.

It should be clarified that the barrier itself is not a problem. The latest research in Java land (LXR: Low-Latency, High-Throughput Garbage Collection (PLDI 2022 - PLDI Research Papers) - PLDI 2022 – distinguished paper award winner), which has had a long history of parallel GCs, shows that brief stop-the-world sections work surprisingly well for complex tasks (such as evacuation) when compared to the state-of-the-art, industrial strength, pauseless (concurrent) compacting collectors such as Shenandoah and ZGC. From the LXR paper

LXR’s design premise is that regular, brief stop-the-world collections will yield sufficient responsiveness and far greater efficiency than concurrent evacuation

The stop-the-world sections in OCaml 5 do a limited amount of work (the size of the minor heap). The stop-the-world minor GC is parallel and copying – only the live objects are touched (unlike a mark-and-sweep collector which needs to touch all objects whether live or dead). And typically the survival rate in the minor collection is less than 10%.

Overall, the design of the GCs is not the problem here. It is the issue of over-commitment. There are solutions in GHC where the number of threads that participate in the GC are set to a different number than the number of HECs (analogous to domains). These would be useful to explore before attempting to switch to concurrent minor collection designs where each domain can independently collect the minor heaps. This GC breaks an awful lot of libraries in the ecosystem, and they are not easy to fix (no sed scripts will work). This is non-trivial amount of work.

I had a look at your article. Just glanced through it to be honest, this is a very specialized article.
I understand that all the constraints you had to satisfy are very hard to meet and that there were some tough choices and compromises.

Perhaps the important bits are these.

Concurrent minor GC, which permits minor heaps of domains to be independently collected, requires a read barrier.

From section 1.2,

While this allows users to write fast code, the API also brakes in the invariants of the GC. For example, reading any OCaml object, be it mutable or not, using the C API does not involve a read barrier, and is compiled as a plain read of memory. A new GC scheme that adds a read barrier only to reads of mutable fields will need to deprecate the old API or suffer the risk of breaking code silently. Either way, the users will have to modify their code to work correctly under the new GC. Given that the compiler does not check incorrect uses of the C API, it is already difficult to write correct and efficient FFI code. We would like to strike a balance between the added complexity of the C API and the performance impact of the missed opportunities.

Essentially, the Field macro needs to include a read barrier and is also, critically, a GC safe point where the GC can move objects. This is not the case in sequential OCaml. As a result, Field(v,i) cannot be an l-value either. The users will not only need to change their code but also require a careful audit of all the C FFI use in order to ensure that the assumptions about when GCs are run aren’t broken. It is possible that we can design a clever static analysis tool on the C source code to analyse them easy safety cases. But this is an open research question.

OTOH, we experimentally validated that the stop-the-world parallel minor GC performs as well as (and in some cases better than) the concurrent minor GC (see results in Fig 11 and Fig 12). So our initial fears that stop-the-world GC would fare quite poorly against the concurrent collector were not well founded. Of course, the caveat is that we are not testing under over-committed environments.

I should say that the doors are not shut for innovation here. There are a number of things that we can try to improve the performance. Many of the criticisms found in the discuss forum are valid. Given that the problems are challenging and the resources are limited, we will need to prioritise and work on these.

What we’ve done in OCaml 5.0 is to ensure that we haven’t made any choices about the design that changes the semantics – no breakages. This allows the community to move to multicore easily, and take advantage of parallelism with well-understood limitations. We can now start looking at more interesting designs that might potentially break APIs if the performance improvements are worth it. If we had broken the APIs and had limitations, I think our uses would have been more cross than they are now :slight_smile:

10 Likes