Earlier this week (24-25 June 2018) we had a meeting in Paris between several OCaml maintainers and researchers (folks from INRIA, OCamllabs, Jane Street, and also Frédéric Bour), and one of the things that were discussed is the technical state of the Multicore-OCaml project. I thought that people here could be interested in a status update on that. Note that it’s not an update on user features (no release date for Multicore-OCaml in this post, sorry!), but an update on technical development plans, so that people that follow the compiler distribution development on https://github.com/ocaml/ocaml/ ( or https://github.com/ocamllabs/ocaml-multicore/ ) have some context for what is coming next.
I’m going to concentrate here on the part that concern the multicore runtime: garbage collection and low-level runtime code, summarizing the main points from my own notes of the meeting. Note that I’m not working on Multicore-OCaml myself, I’m reporting on work done by others. I certainly have made some mistakes below, and may update my post to fix some of those.
TL;DR: In 2019 we hope to integrate the multicore runtime into the upstream compiler distribution, even if it is not enabled by default (still experimental), so that it can evolve with the distribution and get the kind of testing and code porting experience we need to make decisions before a wider release.
Current state of the multicore runtime
KC Sivaramakrishnan ( @kayceesrk ) gave a presentation on the current state of the Multicore OCaml project. The slides are available. They are a good summary of the history and current state of the multicore runtime.
General milestone: migrate the multicore-ocaml runtime into the main distribution
Stephen Dolan ( @stedolan ) and KC reported that it is a lot of work for them to keep up with changes in the OCaml compiler implementation (they just recently rebased their branch on top of 4.06, it was 4.02 before that). When someone changes something in the OCaml compiler or runtime, they check that the rest of the distribution works properly or fix/adapt what is affected by the change; those changes are not tested against the multicore runtime, and Stephen and KC have to do all the fix/adapt work for all changes themselves.
To help with this problem, the consensus is that we should upstream the multicore-runtime code into the compiler distribution soon-ish, not enabled by default, so that people that change the compiler codebase immediately see the effect on the multicore runtime, and can participate in fixing and adapting the code with respect to their change – with guidance from multicore experts. For a period of time, the multicore runtime would only be available as an experimental option.
We hope that this migration could be done in about a year.
Q3/Q4 2018: build-up PRs
There are some PRs that change part of the current compiler and runtime to play nicer with the multicore bits, but are not part of the multicore runtime themselves. The plan is to get these submitted as PRs during the next development cycle (for 4.08, although we don’t know whether they will be merged by the 4.08 release). This is not a new process, some such PRs have already been integrated in the last few omnths/years (see for example GPR#1073, GPR#1723), but the hope is to get as much of those preliminary changes as possible.
Q3/Q4 2018 (?): forward-compatible C API
We know that Multicore OCaml will require some changes to the way C stubs are written for OCaml. Of course, we cannot expect authors of C bindings to switch overnight, so there will be some transition period where existing C bindings won’t be able to use the multicore runtime and will have to be ported. On the other hand, the current runtime should be able to support the multicore-friendly C API, so that ported code can work on both runtimes.
We agreed that it would be useful to have this “multicore-friendly C API” be available as soon as possible, or at least part of it, even before the multicore-runtime itself lands, so that people can already start making their code forward-compatible. Stephen already tried to do this in a giant PR GPR#1003 last year, and that was too big and never quite finished.
My understanding is that the multicore-OCaml people are still not completely sure on what the final API will be, which things definitely have to be broken and which thing might be supported with more work on their side, and hesitated to push changes that could end up unnecessary. We agreed to try again, with smaller API changes submitted separately of each other, instead of a giant single change; and agreed on principle that it was OK to propose new APIs that had a small chance of not being necessary in the end, as long as the API is sensible and can be easily implemented on top of the current runtime. (Early adopters might face a bit of code churn as things stabilize.)
(One first change that we want to look at is GPR#1798, which implements a notion of C-API versioning.)
2019: multicore runtime features, as an experimental runtime
Then the plan is to start merging the multicore runtime codebase itself, piece by piece, so that it becomes possible to perform larger-scale experiment with it. It still wouldn’t be enabled by default at this point, but it would be part of the actively moving compiler distribution, and in particular remain at feature parity with the rest of the compiler and runtime codebase.
In this phase we’ll need people to review the multicore runtime implementation, if only to help future upstream maintenance. We have started to ask around who would be potentially interested – hopefully with a related skill set. The project also has a laundry list of pending tasks that could also be worked on by the people studying the codebase.
Some of the tasks being worked on involve implementing parts of the OCaml runtime systems that are not yet fully supported by the multicore runtime, such as Ephemerons (a generalization of weak pointers). My understanding is that the multicore devs would like to reach feature parity before merging the runtime code, but this may be re-discussed and changed if some parts of the runtime prove too difficult to support.
remark: the runtime/language split
The current multicore-ocaml fork/switch contains both the multicore runtime, and an implementation of (untyped) effect handlers in the surface language, as the way for users to access concurrency features (to control the fibers / green-threads). Effect handlers come in evolving proposals of their own, there is a type-and-effect system under work by Leo White, and they are being discussed as well, in a somewhat independent way. Bundling the two changes in the same patchset makes reviewing more difficult, and it also created some silly technical issues: because effect handlers change the language AST, most ppx-extension code is broken on the multicore-OCaml fork, which makes it difficult to use language tooling, to test user programs, run interesting benchmarks, etc.
In the short term the plan for upstreaming the runtime is to separate it from the effect-handler part, by exposing an extremely minimal fiber-control API, as compiler primitives or as part of the Obj
runtime. That is not how anybody wants end-users to access the multicore runtime, but it would be a minimal device for the first period of runtime code upstreaming and reviewing, to make it easy to compile any codebase against the multicore-aware compiler, and use the standard OCaml packages and tooling in a multicore switch.
remark: performance tuning, not yet
Right now the multicore-OCaml devs, if I understand correctly, have been mostly working with micro-benchmarks, in large part because of the difficulty of using regular OCaml packages and tooling previously mentioned. A lot of opportunities (and necessity) for performance tuning will appear once macro-benchmarks and realistic workloads become available, and once some of the larger performance-sensitive codebases (which often include some C bindings or compiler-sensitive Obj hackery) have been ported. As Anil Madhavapeddy (@avsm) pointed out, once more code out there can be benchmarked against the multicore runtime we should start continuously monitoring the performance results.
The general expectation is that the multicore runtime will be slower for purely-sequential programs than the current runtime, but the goal is to keep this overhead small (a first goal that was mentioned was a 10% overhead, although we really don’t know yet how easy/hard that target is). The two distinct runtimes may remain available in the distribution for as long as there are enough users asking for the availability sequential runtime, and that the overhead is high enough to justify the maintenance costs of keeping both. (In term of the multicore-runtime performance on sequential workloads, some things can be made faster at the cost of being harder to write and possibly more painful to maintain, so there are tradeoffs still to be explored there.)
One thing I found interesting that Stephen explained to me is: you cannot just take a sequential program (say Coq), compile and run it under a multicore switch, and expect to get a meaningful “overhead number” (as in: “the multicore runtime is X% slower than the sequential one on this program”). The problem is that GCs can be configured to have more or less memory overhead – asking for less memory overhead results in more GC work, so a slower overall program. It doesn’t make sense to only compare the default settings of two GCs for time, as they may have very different memory-overhead profile: maybe the second GC looks faster, but if you adjust its settings to use no more memory than the first it actually is slower. What you have to do instead is to try to plot the 2D time/memory compromise, and compare the graphs for the two GCs, or at least compare the entire plot of the new GC with the current results of your current GC.
Summary
- In the next six months, we hope to start merging most of the preparatory work, and a forward-compatible C API, into the upstream compiler distribution.
- Then in 2019 we hope to start merging the multicore runtime itself (independently of effect handlers), as a non-default experimental option. We will need people to review the codebase and feel more confident in their ability to also edit it.
- This should allow much more extensive performance testing, and the porting of some performance-sensitive codebases, so that we can get a better picture of the performance profile, of the difficulty to port code, potential parts that need to be reworked, etc.
Plenty of interesting applications of a multicore runtime (and of a typed effect system) have also been discussed, interesting memory-model questions, formalization questions, etc. This is definitely an interesting time for the OCaml community!