Ocaml-multicore: report on a June 2018 development meeting in Paris

gasche · June 28, 2018, 12:01am

Earlier this week (24-25 June 2018) we had a meeting in Paris between several OCaml maintainers and researchers (folks from INRIA, OCamllabs, Jane Street, and also Frédéric Bour), and one of the things that were discussed is the technical state of the Multicore-OCaml project. I thought that people here could be interested in a status update on that. Note that it’s not an update on user features (no release date for Multicore-OCaml in this post, sorry!), but an update on technical development plans, so that people that follow the compiler distribution development on https://github.com/ocaml/ocaml/ ( or https://github.com/ocamllabs/ocaml-multicore/ ) have some context for what is coming next.

I’m going to concentrate here on the part that concern the multicore runtime: garbage collection and low-level runtime code, summarizing the main points from my own notes of the meeting. Note that I’m not working on Multicore-OCaml myself, I’m reporting on work done by others. I certainly have made some mistakes below, and may update my post to fix some of those.

TL;DR: In 2019 we hope to integrate the multicore runtime into the upstream compiler distribution, even if it is not enabled by default (still experimental), so that it can evolve with the distribution and get the kind of testing and code porting experience we need to make decisions before a wider release.

Current state of the multicore runtime

KC Sivaramakrishnan ( @kayceesrk ) gave a presentation on the current state of the Multicore OCaml project. The slides are available. They are a good summary of the history and current state of the multicore runtime.

General milestone: migrate the multicore-ocaml runtime into the main distribution

Stephen Dolan ( @stedolan ) and KC reported that it is a lot of work for them to keep up with changes in the OCaml compiler implementation (they just recently rebased their branch on top of 4.06, it was 4.02 before that). When someone changes something in the OCaml compiler or runtime, they check that the rest of the distribution works properly or fix/adapt what is affected by the change; those changes are not tested against the multicore runtime, and Stephen and KC have to do all the fix/adapt work for all changes themselves.

To help with this problem, the consensus is that we should upstream the multicore-runtime code into the compiler distribution soon-ish, not enabled by default, so that people that change the compiler codebase immediately see the effect on the multicore runtime, and can participate in fixing and adapting the code with respect to their change – with guidance from multicore experts. For a period of time, the multicore runtime would only be available as an experimental option.

We hope that this migration could be done in about a year.

Q3/Q4 2018: build-up PRs

There are some PRs that change part of the current compiler and runtime to play nicer with the multicore bits, but are not part of the multicore runtime themselves. The plan is to get these submitted as PRs during the next development cycle (for 4.08, although we don’t know whether they will be merged by the 4.08 release). This is not a new process, some such PRs have already been integrated in the last few omnths/years (see for example GPR#1073, GPR#1723), but the hope is to get as much of those preliminary changes as possible.

Q3/Q4 2018 (?): forward-compatible C API

We know that Multicore OCaml will require some changes to the way C stubs are written for OCaml. Of course, we cannot expect authors of C bindings to switch overnight, so there will be some transition period where existing C bindings won’t be able to use the multicore runtime and will have to be ported. On the other hand, the current runtime should be able to support the multicore-friendly C API, so that ported code can work on both runtimes.

We agreed that it would be useful to have this “multicore-friendly C API” be available as soon as possible, or at least part of it, even before the multicore-runtime itself lands, so that people can already start making their code forward-compatible. Stephen already tried to do this in a giant PR GPR#1003 last year, and that was too big and never quite finished.

My understanding is that the multicore-OCaml people are still not completely sure on what the final API will be, which things definitely have to be broken and which thing might be supported with more work on their side, and hesitated to push changes that could end up unnecessary. We agreed to try again, with smaller API changes submitted separately of each other, instead of a giant single change; and agreed on principle that it was OK to propose new APIs that had a small chance of not being necessary in the end, as long as the API is sensible and can be easily implemented on top of the current runtime. (Early adopters might face a bit of code churn as things stabilize.)

(One first change that we want to look at is GPR#1798, which implements a notion of C-API versioning.)

2019: multicore runtime features, as an experimental runtime

Then the plan is to start merging the multicore runtime codebase itself, piece by piece, so that it becomes possible to perform larger-scale experiment with it. It still wouldn’t be enabled by default at this point, but it would be part of the actively moving compiler distribution, and in particular remain at feature parity with the rest of the compiler and runtime codebase.

In this phase we’ll need people to review the multicore runtime implementation, if only to help future upstream maintenance. We have started to ask around who would be potentially interested – hopefully with a related skill set. The project also has a laundry list of pending tasks that could also be worked on by the people studying the codebase.

Some of the tasks being worked on involve implementing parts of the OCaml runtime systems that are not yet fully supported by the multicore runtime, such as Ephemerons (a generalization of weak pointers). My understanding is that the multicore devs would like to reach feature parity before merging the runtime code, but this may be re-discussed and changed if some parts of the runtime prove too difficult to support.

remark: the runtime/language split

The current multicore-ocaml fork/switch contains both the multicore runtime, and an implementation of (untyped) effect handlers in the surface language, as the way for users to access concurrency features (to control the fibers / green-threads). Effect handlers come in evolving proposals of their own, there is a type-and-effect system under work by Leo White, and they are being discussed as well, in a somewhat independent way. Bundling the two changes in the same patchset makes reviewing more difficult, and it also created some silly technical issues: because effect handlers change the language AST, most ppx-extension code is broken on the multicore-OCaml fork, which makes it difficult to use language tooling, to test user programs, run interesting benchmarks, etc.

In the short term the plan for upstreaming the runtime is to separate it from the effect-handler part, by exposing an extremely minimal fiber-control API, as compiler primitives or as part of the Obj runtime. That is not how anybody wants end-users to access the multicore runtime, but it would be a minimal device for the first period of runtime code upstreaming and reviewing, to make it easy to compile any codebase against the multicore-aware compiler, and use the standard OCaml packages and tooling in a multicore switch.

remark: performance tuning, not yet

Right now the multicore-OCaml devs, if I understand correctly, have been mostly working with micro-benchmarks, in large part because of the difficulty of using regular OCaml packages and tooling previously mentioned. A lot of opportunities (and necessity) for performance tuning will appear once macro-benchmarks and realistic workloads become available, and once some of the larger performance-sensitive codebases (which often include some C bindings or compiler-sensitive Obj hackery) have been ported. As Anil Madhavapeddy (@avsm) pointed out, once more code out there can be benchmarked against the multicore runtime we should start continuously monitoring the performance results.

The general expectation is that the multicore runtime will be slower for purely-sequential programs than the current runtime, but the goal is to keep this overhead small (a first goal that was mentioned was a 10% overhead, although we really don’t know yet how easy/hard that target is). The two distinct runtimes may remain available in the distribution for as long as there are enough users asking for the availability sequential runtime, and that the overhead is high enough to justify the maintenance costs of keeping both. (In term of the multicore-runtime performance on sequential workloads, some things can be made faster at the cost of being harder to write and possibly more painful to maintain, so there are tradeoffs still to be explored there.)

One thing I found interesting that Stephen explained to me is: you cannot just take a sequential program (say Coq), compile and run it under a multicore switch, and expect to get a meaningful “overhead number” (as in: “the multicore runtime is X% slower than the sequential one on this program”). The problem is that GCs can be configured to have more or less memory overhead – asking for less memory overhead results in more GC work, so a slower overall program. It doesn’t make sense to only compare the default settings of two GCs for time, as they may have very different memory-overhead profile: maybe the second GC looks faster, but if you adjust its settings to use no more memory than the first it actually is slower. What you have to do instead is to try to plot the 2D time/memory compromise, and compare the graphs for the two GCs, or at least compare the entire plot of the new GC with the current results of your current GC.

Summary

In the next six months, we hope to start merging most of the preparatory work, and a forward-compatible C API, into the upstream compiler distribution.
Then in 2019 we hope to start merging the multicore runtime itself (independently of effect handlers), as a non-default experimental option. We will need people to review the codebase and feel more confident in their ability to also edit it.
This should allow much more extensive performance testing, and the porting of some performance-sensitive codebases, so that we can get a better picture of the performance profile, of the difficulty to port code, potential parts that need to be reworked, etc.

Plenty of interesting applications of a multicore runtime (and of a typed effect system) have also been discussed, interesting memory-model questions, formalization questions, etc. This is definitely an interesting time for the OCaml community!

ejgallego · February 26, 2019, 3:58pm

Dear Gabriel, thank you very much for the great summary.

Do we have any update on how the merging plan is going?

I was able to compile Coq with the multicore branch [tho lack of support for Thread is an issue for running], and indeed we are very interested to see what the roadmap is, as IMHO Coq could greatly benefit from multicore support.

gasche · February 26, 2019, 4:28pm

Things have been progressing, but slower than planned (as always). I haven’t heard anything specific/pointed back from the multicore-ocaml devs myself, but from the upstream perspective it looks like we are still in the phase “buildup work in the compiler before the core runtime changes” – planned Q3/Q4 2018. Recent changes submitted by Stephen include GPR#1725 and GPR#1917.

sid · March 19, 2019, 8:28am

Thanks for the update above @gasche

Are there any risks for the merge of multicore into the main OCaml trunk? I have been generally following the discussions on github and there seem to be interesting discussions on various tickets that are tagged as multicore-prerequisite (and other multicore related PRs). Seeing these discussions makes me wonder if there could be big bumps along the way in actually getting this work merged into trunk.

Or is the multicore technical approach generally accepted by the OCaml compiler maintainers and it is a matter of fleshing out the details? Is there a possibility that the maintainers might just find certain technical decisions made by the multicore team unpalatable?

OCaml with multicore would be a potentially amazing platform to build on. Haskell, though “multicore” for many years has many weaknesses of its own (laziness, over-abstraction, complexity etc). Golang OTOH does not have enough abstraction and is quite low level and imperative. The JVM has its own problems. Here OCaml is likely to hit the sweet spot.

(And yes, I’m aware of LWT in the meanwhile but I find it quite low level compared to similar stuff that can be done in Haskell).

gasche · March 19, 2019, 10:04am

I’ll reply in the best way I can, but please keep in mind that I haven’t been involved personally in the ocaml-multicore work (although I did help review and integrate some of the multicore-prerequisite changes).

Are there any risks for the merge of multicore into the main OCaml trunk?

We don’t yet have conclusive performance numbers on the overhead of the multicore runtime for single-threaded computations. (A lot of progress has been made on a benchmarking infrastructure to measure this.) We also don’t have full visibility on the impact, ecosystem-wide, of the changes to the C FFI. If the overhead is high, or if the low-level changes incur too much breakage, this means that most people, who do not have a strong need for multicore usage in their programs, could keep using the non-multicore runtime for the time being. (Even if the runtime is merged upstream, there is a risk of it remaining a rarely-used option, or even that it would not be maintained on par with the main runtime.)

(Another risk would be that the people working on the multicore runtime today would move to something else before the merge is finished. For now it seems that they will keep getting funded at appropriate levels for this to not happen, which is very fortunate indeed. Jane Street and OCamllabs have been extremely helpful in funding the work so far, even though they may not have had much actual business-based motivation for a multicore runtime when the work stated.)

Or is the multicore technical approach generally accepted by the OCaml compiler maintainers and it is a matter of fleshing out the details?

Yes, this is my understanding. The various technical decisions have not been reviewed in depth yet, but there is agreement on the general design. (Earlier work on multicore runtimes for ocaml, such as ocaml4multicore, did not get to that point of general consensus.)

(And yes, I’m aware of LWT in the meanwhile but I find it quite low level compared to similar stuff that can be done in Haskell).

Well it may also be possible to build higher-level abstractions on top of Lwt/Async to express what it is that you want to express – for example transactional operations, if that is your thing. (But that does not replace the interest of parallel computations.)

XVilka · June 17, 2019, 5:22am

Were there any recent developments regarding the project?

sid · June 17, 2019, 7:13am

Multicore seems to be progressing – if only at a pace much slower than what I hoped.

There is a new (important, foundational) PR related to multicore:

github.com/ocaml/ocaml

Move C global variables to a dedicated structure

ocaml:trunk ← kayceesrk:r14-globals

opened 12:30PM - 04 Jun 19 UTC

kayceesrk

+4361 -3360

This PR moves C global variables in the runtime to a dedicated structure, a "dom…ain state" table. This PR is a pre-requisite for the multicore runtime, where each domain (a parallel thread of execution) will need to maintain its own table of domain local variables. This PR does not introduce any API changes. In order to allow fast access to the domain state, we steal the exception pointer register (r14 on amd64) and make it point to the domain state table. The exception pointer is now a field in the domain state table. Convenience macros are introduced to access the domain state fields in the runtime and in the backend of the compiler. The PR is syntactically fairly invasive. But the semantics changes are fewer. ## Status ~~Currently, only exception pointer, young_pointer, and young_limits are in the domain state. The only tested and working architecture is linux on amd64. The build is expected to fail on every other configuration.~~ - [x] Introduce domain state and steal exception pointer register - [x] Move `young_ptr` and `young_limit` to domain state - [x] Move multicore relevant native runtime C globals to the domain state - [x] Move multicore relevant bytecode runtime C global to the domain state - [x] Support Linux x86-64 - [x] Support macOS x86-64 - [x] Support Windows cygwin x86/x86-64 - [x] Support Windows mingw x86-64 - [x] Support FreeBSD x86-64 - [x] Support Linux x86 - [x] Support Windows x86 - [x] Support Linux arm64 - [x] Support Linux arm32 - [x] Support Linux Power64 - [x] Support Linux IBM Z - [x] Support Windows min x86-64 - [x] Support Windows Native (Microsoft Visual C/C++) x86-64 ## Changes ### Exceptions On amd64, exception pointers are cached in the r14 register. In this code snippet: ```ocaml exception Foo let foo () = try () with Foo -> () let bar () = raise Foo ``` the diff in the push trap and pop trap compilation is: ```diff .L103: - .cfi_adjust_cfa_offset 16 - subq $16, %rsp - movq %r14, (%rsp) - leaq .L102(%rip), %r14 - movq %r14, 8(%rsp) - movq %rsp, %r14 + leaq .L102(%rip), %r11 + .cfi_adjust_cfa_offset 8 + pushq %r11 + .cfi_adjust_cfa_offset 8 + pushq 8(%r14) + movq %rsp, 8(%r14) movq $1, %rax - popq %r14 + popq 8(%r14) .cfi_adjust_cfa_offset -8 addq $8, %rsp .cfi_adjust_cfa_offset -8 ``` The exception pointer is stored at `8(%r14)`. The diff in the raise code is as expected: ```diff .L104: movq camlHello@GOTPCREL(%rip), %rax movq (%rax), %rax - movq %r14, %rsp - popq %r14 + movq 8(%r14), %rsp + popq 8(%r14) popq %r11 jmp *%r11 .cfi_endproc ``` ### Allocations Inlined allocations sequences are shorter since the `young_limit` is now in the domain state. The diff in the allocation sequence in the code: ```ocaml let baz x = (x,x) ``` is ```diff movq %rax, %rbx .L101: subq $24, %r15 - movq caml_young_limit@GOTPCREL(%rip), %rax - cmpq (%rax), %r15 + cmpq (%r14), %r15 jb .L102 leaq 8(%r15), %rax movq $2048, -8(%rax) ``` `young_limit` is stored at `0(%r14)`. ### FFI There is no need to cache and flush the exception pointer when calling C functions. The fact that the exception pointer is not available as a global variable means that functions in `amd64.S` such as `caml_start_program`, callbacks, functions for raising exceptions will need the domain state as an extra argument. This change is internal only and the C API remains the same. ## Performance The goal is to retain the performance with respect to trunk. Preliminary performance numbers are available for [micro](http://bench.ocamllabs.io/comparison/?exe=14%2BL%2Br14-globals&ben=1%2C2%2C110%2C111%2C5%2C6%2C7%2C9%2C8%2C10%2C12%2C11%2C15%2C16%2C17%2C13%2C14%2C20%2C21%2C22%2C18%2C19%2C25%2C26%2C27%2C23%2C24%2C30%2C31%2C32%2C28%2C29%2C33%2C152%2C151%2C37%2C112%2C113%2C114%2C115%2C116%2C117%2C118%2C119%2C120%2C122%2C121%2C123%2C124%2C125%2C126%2C127%2C54%2C154%2C155%2C153%2C157%2C158%2C156%2C161%2C162%2C163%2C164%2C159%2C160%2C167%2C168%2C169%2C170%2C165%2C166%2C172%2C171%2C174%2C173%2C86&env=2&hor=true&bas=12%2BL%2Btrunk&chart=normal+bars) and [macro](http://bench2.ocamllabs.io/comparison/?exe=10%2BL%2Br14-globals&ben=1%2C2%2C130%2C131%2C3%2C4%2C5%2C6%2C7%2C8%2C9%2C10%2C11%2C12%2C13%2C75%2C76%2C77%2C78%2C79%2C80%2C81%2C82%2C83%2C84%2C132%2C133%2C14%2C15%2C16%2C85%2C86%2C87%2C88%2C17%2C18%2C19%2C20%2C21%2C22%2C23%2C24%2C25%2C26%2C89%2C90%2C91%2C92%2C93%2C94%2C27%2C28%2C29%2C30%2C31%2C32%2C33%2C34%2C35%2C36%2C37%2C38%2C39%2C40%2C41%2C42%2C95%2C43%2C44%2C45%2C46%2C47%2C48%2C49%2C50%2C51%2C52%2C53%2C54%2C55%2C96%2C97%2C98%2C99%2C56%2C100%2C57%2C58%2C59%2C60%2C61%2C101%2C62%2C63%2C64%2C65%2C66%2C102%2C67%2C68%2C69%2C70%2C129%2C103%2C104%2C105%2C106%2C107%2C108%2C71%2C72%2C74%2C109%2C110%2C111%2C112%2C113%2C114%2C115%2C116%2C117%2C118%2C119%2C120%2C121%2C122%2C123%2C124%2C125%2C126%2C127%2C128%2C73&env=3&hor=true&bas=9%2BL%2Btrunk&chart=normal+bars) benchmarks. There are a few outliers (in either direction) which I will investigate. And a few, which look like spurious runs (`sequence_cps` in the macrobenchmark). ## Reviewing It is instructive to start the review at `domain_state.tbl` which introduces the domain state table. The OCaml and C macros for accessing domain state are generated in `utils/domainstate.ml*` and `runtime/caml/domain_state.h`. The interesting changes are in the files `asmcomp/amd64/emit.mlp` and `runtime/amd64.S`.

It would be interesting to see how OCaml core dev’s react to this upcoming PR. If it gets merged quickly it would be a sign that things are moving along well.

I am not an insider to the multicore effort at all. Nor am I familiar with OCaml compiler internals. I can only surmise what is visible externally.

It would be awesome if people “in the know” kept giving us more frequent updates on this. Multicore is something that I believe can further unlock the potential of OCaml. There are a lot of people waiting and hoping for this to happen.

A statement from the OCaml community leaders ( @xavierleroy and others ) that they want to make this happen by date XYZ is something that might rally the community around this goal. Witness the speech by John F. Kennedy who said:

[The US] should commit itself to achieving the goal, before this decade is out, of landing a man on the Moon and returning him safely to the Earth. [1]

[1] We choose to go to the Moon - Wikipedia

This comparison is possibly cringy and multicore is not like going to the moon but goals can have a remarkably beneficial effect.

gasche · June 17, 2019, 10:07am

We had another development meeting at the end of April where Stephen Dolan was invited to give a progress report on Multicore. Things have been progressing, although of course at a slower pace than anticipated (personally I’m not terribly surprised given the complexity of the whole thing, but that’s what you get for announcing more specific time periods :-). In terms of the original document we are still in the “build-up PRs” and “forward-compatible C API” phase, and things are moving along nicely.

One interesting recent development is the buildup of a comprehensive runtime-benchmarking tool (the not-terribly-easy-to-use interface is at http://ocamllabs.io/multicore/), which makes it possible to get concrete numbers on the performance overhead introduced by the runtime changes. The numbers are not final in any way yet (there is a lot of room for tuning), but it helps evaluate design choices and in fact I understand that the multicore authors have started exploring some alternative choices now that they have numbers to compare options concretely. (Takes time, but gives a stronger implementation overall.)

On the social front: the overall consensus that we want to merge the multicore runtime still stands, there is no worries to be had about that. I think it would be rather foolish to make a statement about this “happening by date XYZ” given the low likelihood of getting such a date right. On the other hand, if you want to help, please feel free to help reviewing any of the open PRs, and/or have a look at the Multicore Roadmap which has lists of tasks still to be done.

jhw · June 17, 2019, 2:35pm

Indeed, as an outsider who has tried to offer help, I would add that there are a lot of opportunities for people to help that do not involve hacking on the compiler internals. There are a lot of packages in the community OPAM repository that need quite doable patches to make them provisionally functional on the ocaml-multicore branch so as to enable wider benchmark testing. Additionally, my experience is that doing that kind of work can surface problems that need to be filed as issue reports against ocaml-multicore.

Perhaps the ocaml-multicore team would comment on my observation above with additional details about things people can do to pitch in.

sid · August 27, 2019, 1:20pm

As a small step towards multicore, its interesting to note that https://github.com/ocaml/ocaml/pull/8713 just got merged to master!

jhw · August 27, 2019, 5:41pm

Now if only #1128 can be cleared for landing.

Topic		Replies	Views
Multicore OCaml: June 2020 Community multicore , multicore-monthly	11	5754	August 3, 2020
Multicore OCaml: November 2020 Community multicore , multicore-monthly	1	3242	February 6, 2021
Multicore OCaml: May 2020 update Community multicore , compiler , multicore-monthly	0	8796	June 1, 2020
Multicore Update: April 2020, with a preprint paper Community multicore , compiler , multicore-monthly	27	9323	June 5, 2020
Multicore OCaml: February 2022 Community multicore , multicore-monthly	0	2444	March 16, 2022