Welcome to the February 2020 news update from the Multicore OCaml team, spread across the UK, India, France and Switzerland! This follows on from last month’s update, and has been put together by @shakthimaan and @kayceesrk.
The release of OCaml 4.10.0 has successfully pushed out some prerequisite features into the upstream compiler. Our work in February has focussed on getting the multicore OCaml branch “feature complete” with respect to the complete OCaml language, and doing extensive benchmarking and stress testing to test our two minor heap implementations.
To this end, a number of significant patches have been merged into the Multicore OCaml trees that essentially provide complete coverage of the language features. We encourage you to test the same for regressions and provide any improvements or report shortcomings to us. There are ongoing OCaml PRs and issues that are also under review, and we hope to complete those for the 4.11 release cycle. A new set of parallel benchmarks have been added to our Sandmark benchmarking suite (live instance here), including enhancements to the build setup.
Multicore OCaml
Completed
The following PRs have been merged into Multicore OCaml:
A Forcing_tag is used to implement lazy values to handle a concurrency bug. It behaves like a locked bit, and any concurrent access by a mutator will raise an exception on that domain.
A preliminary version of safe points has been merged into the Multicore OCaml trees. ocaml-multicore/ocaml-multicore#187 also contains more discussion and background about how coverage can be improved in future PRs.
An “opportunistic work credit” is implemented in this PR which forms a basis for doing mark and sweep work while waiting to synchronise with other domains.
This PR closes the regression for the chameneos_redux_lwt benchmarking in Sandmark by using intnat to avoid sign extensions and cleans up write_barrier to improve overall performance.
The PR updates the sweep work units to all be in word size. This is to handle the differences between the budget for setup, sweep and for large allocations in blocks.
Ongoing
A lot of work is ongoing for the implementation of a synchronised minor garbage collector for Multicore OCaml, including benchmarking for the stop-the-world (stw) branch. We will publish the results of this in a future update, as we are assembling a currently comprehensive evaluation of the runtime against the mainstream runtime.
Benchmarking
Sandmark now has support to run parallel benchmarks. We can also now about GC latency measurements for both stock OCaml and Multicore OCaml compiler.
The PR now helps process the runtime log and produces a .bench file that captures the GC pause times. This works on both stock OCaml and in Multicore OCaml.
A test for measuring Irmin’s merge capabilities with Git as its filesystem is being tested with different read and write rates.
A number of other parallel benchmarks like Merge sort, Floyd-Warshall matrix, prime number generation, parallel map, filter et. al. have been added to Sandmark.
Documentation
Examples using domainslib and modifying Domains are currently being worked upon for a chapter on Parallel Programming for Multicore OCaml. We will release an early draft to the community for your feedback.
OCaml
One PR opened to OCaml this month, which fixes up the marshalling scheme to be multicore compatible. The complete set of upstream multicore prerequisites are labelled in the compiler issue tracker.
Since benchmarking is being listed as next on the agenda I just want to make sure you’ve seen: “Performance Matters” and that you’re aware of the coz-profiler.
Hi Anil (or anyone!). Is there a place I can find more about breaking changes that might be made to C extensions? As you may know we have a lot of C code which interfaces with OCaml, both as ordinary extensions written in C, but also embedding OCaml in C programs (although that’s much more rare), and I’d like a heads up about what’s likely to change.
Hi @rwmjones! In a nutshell: no breaking C changes. The longer version is that we implemented two different minor collectors in order to evaluate various tradeoffs systematically:
a concurrent minor collector that requires a read barrier and some C API changes in order to create more safe points
a stop-the-world minor collector that doesn’t require a read barrier and no extra C API changes, but would probably cause longer pauses
The good news is that our STW collector scales up much better than we expected (tested to 24 cores), and so our first domains patchset will almost certainly use that version now. We expect to shift to a concurrent (and possibly pauseless) collection algorithm at some future point, but in terms of upstreaming it looks like we should be able to delay any C API changes until after the first version of multicore has landed.
Do you have any nice standalone candidate programs using the C FFI we could add to Sandmark?
Just to clarify, “no breaking C changes” means none other than needing to be able to use no-naked-pointers, right? In particular, C bindings that return pointers allocated with malloc as values, without wrapping them in a block on the ocaml heap, are still in need of changing before their clients can be used in the multicore world.
Off-topic for the current thread, but just wanted to mention there is a simple alternative to this trick which does not involve wrapping the pointers in custom blocks: since malloc returns aligned pointers, one can set the lsb to “1” and send them over the OCaml side as an OCaml “int”. The original pointer is recovered by setting the lsb back to “0”. The cost is negligible and fully compatible with the GC and no-naked-pointers.
I think discussing breaking changes brought by multicore and their workarounds is reasonably on-topic.
I believe that the advice to use the lsb to represent off-heap pointers is incorrect currently. OCaml does not understand how to dereference such pointers. It is therefore unsuitable for applications in the style of Ancient. One could imagine that an instruction to clear the lsb is added before every dereference of value by the OCaml compiler, but it is unclear that the runtime costs are lesser than that of the page table.
I have read that unaligned memory accesses are efficient in modern Intel processors, so maybe one could also try your tip with unaligned pointers instead of encoding aligned pointers.
I am also curious about the answer to @jjb’s question. What are the plans for the page table in multicore? For instance Coq has been adapted to support the no-naked-pointers mode, but still relies on the page table in that mode.
We need to finish the transition to no-naked-pointers mode as the default in trunk OCaml. It has stalled a little as a configuration option in the past few releases.
My question is about plans for the page table (or similar devices to recognise off-heap pointers) in multicore, e.g. to support off-heap allocation in the style of Ancient.
We don’t plan to support naked-pointers in multicore. So no page tables. Every object that is allocated outside of the heap should have a header with colour Black. With multicore, maintaining a correct and efficient mutable page table may add unnecessary overheads in the concurrent GC (due to unnecessary synchronization in looking up the page table while marking; involves no atomic operations now). So we will make no-naked-pointers the default in trunk OCaml to allow users to adapt their code to the new norm.
Thanks - sorry for the delayed reply, I thought I’d set up my account here to send me email notifications, but apparently not.
I believe we have eradicated or mostly eradicated all naked pointers, and if we haven’t it’s an easy job to remove any remaining cases.
While we have quite a lot of C code (a quick count shows 13,000 edit: actually 46,000 lines), it’s not easily available in nice separate self-contained libraries, except perhaps for these:
Note that for libnbd the C bindings are generated so the only way to get to the bindings is to ./configure && make
Also of interest is this where we embed OCaml code compiled as a shared library into a C program, which I don’t think anyone has addressed (unless there’s no difference from the other way around of course):