Multicore OCaml: Feb 2020 update

Welcome to the February 2020 news update from the Multicore OCaml team, spread across the UK, India, France and Switzerland! This follows on from last month’s update, and has been put together by @shakthimaan and @kayceesrk.

The release of OCaml 4.10.0 has successfully pushed out some prerequisite features into the upstream compiler. Our work in February has focussed on getting the multicore OCaml branch “feature complete” with respect to the complete OCaml language, and doing extensive benchmarking and stress testing to test our two minor heap implementations.

To this end, a number of significant patches have been merged into the Multicore OCaml trees that essentially provide complete coverage of the language features. We encourage you to test the same for regressions and provide any improvements or report shortcomings to us. There are ongoing OCaml PRs and issues that are also under review, and we hope to complete those for the 4.11 release cycle. A new set of parallel benchmarks have been added to our Sandmark benchmarking suite (live instance here), including enhancements to the build setup.

Multicore OCaml

Completed

The following PRs have been merged into Multicore OCaml:

  • ocaml-multicore/ocaml-multicore#281
    Introduce Forcing_tag to fix concurrency bug with lazy values

    A Forcing_tag is used to implement lazy values to handle a concurrency bug. It behaves like a locked bit, and any concurrent access by a mutator will raise an exception on that domain.

  • ocaml-multicore/ocaml-multicore#282
    Safepoints

    A preliminary version of safe points has been merged into the Multicore OCaml trees. ocaml-multicore/ocaml-multicore#187 also contains more discussion and background about how coverage can be improved in future PRs.

  • ocaml-multicore/ocaml-multicore#285
    Introduce an ‘opportunistic’ major collection slice

    An “opportunistic work credit” is implemented in this PR which forms a basis for doing mark and sweep work while waiting to synchronise with other domains.

  • ocaml-multicore/ocaml-multicore#286
    Do fflush and variable args in caml_gc_log

    The caml_gc_log() function has been updated to ensure that fflush is invoked only when GC logging is enabled.

  • ocaml-multicore/ocaml-multicore#287
    Increase EVENT_BUF_SIZE

    During debugging with event trace data it is useful to reduce the buffer flush times, and hence the EVENT_BUF_SIZE has now been increased.

  • ocaml-multicore/ocaml-multicore#288
    Write barrier optimization

    This PR closes the regression for the chameneos_redux_lwt benchmarking in Sandmark by using intnat to avoid sign extensions and cleans up write_barrier to improve overall performance.

  • ocaml-multicore/ocaml-multicore#290
    Unify sweep budget to be in word size

    The PR updates the sweep work units to all be in word size. This is to handle the differences between the budget for setup, sweep and for large allocations in blocks.

Ongoing

  • A lot of work is ongoing for the implementation of a synchronised minor garbage collector for Multicore OCaml, including benchmarking for the stop-the-world (stw) branch. We will publish the results of this in a future update, as we are assembling a currently comprehensive evaluation of the runtime against the mainstream runtime.

Benchmarking

Sandmark now has support to run parallel benchmarks. We can also now about GC latency measurements for both stock OCaml and Multicore OCaml compiler.

  • ocaml-bench/sandmark#73
    More parallel benchmarks

    A number of parallel benchmarks such as N-body, Quick Sort and matrix multiplication have now been added to Sandmark!

  • ocaml-bench/sandmark#76
    Promote packages. Unbreak CI.

    The Continuous Integration build can now execute after updating and promoting packages in Sandmark.

  • ocaml-bench/sandmark#78
    Add support for collecting information about GC pausetimes on trunk

    The PR now helps process the runtime log and produces a .bench file that captures the GC pause times. This works on both stock OCaml and in Multicore OCaml.

  • ocaml-bench/sandmark#86
    Read and write Irmin benchmark

    A test for measuring Irmin’s merge capabilities with Git as its filesystem is being tested with different read and write rates.

  • A number of other parallel benchmarks like Merge sort, Floyd-Warshall matrix, prime number generation, parallel map, filter et. al. have been added to Sandmark.

Documentation

  • Examples using domainslib and modifying Domains are currently being worked upon for a chapter on Parallel Programming for Multicore OCaml. We will release an early draft to the community for your feedback.

OCaml

One PR opened to OCaml this month, which fixes up the marshalling scheme to be multicore compatible. The complete set of upstream multicore prerequisites are labelled in the compiler issue tracker.

  • ocaml/ocaml#9293 Use addrmap hash table for marshaling

    The hash table (addrmap) implementation from Multicore OCaml has been ported to upstream OCaml to avoid using GC mark bits to represent visitedness.

Acronyms

  • CTF: Common Trace Format
  • CI: Continuous Integration
  • GC: Garbage Collector
  • PR: Pull Request

As always, many thanks to our fellow OCaml developers and users who have reviewed our code, reported bugs or otherwise assisted this month.

48 Likes

Since benchmarking is being listed as next on the agenda I just want to make sure you’ve seen: “Performance Matters” and that you’re aware of the coz-profiler.

Hi Anil (or anyone!). Is there a place I can find more about breaking changes that might be made to C extensions? As you may know we have a lot of C code which interfaces with OCaml, both as ordinary extensions written in C, but also embedding OCaml in C programs (although that’s much more rare), and I’d like a heads up about what’s likely to change.

Hi @rwmjones! In a nutshell: no breaking C changes. The longer version is that we implemented two different minor collectors in order to evaluate various tradeoffs systematically:

  • a concurrent minor collector that requires a read barrier and some C API changes in order to create more safe points
  • a stop-the-world minor collector that doesn’t require a read barrier and no extra C API changes, but would probably cause longer pauses

The good news is that our STW collector scales up much better than we expected (tested to 24 cores), and so our first domains patchset will almost certainly use that version now. We expect to shift to a concurrent (and possibly pauseless) collection algorithm at some future point, but in terms of upstreaming it looks like we should be able to delay any C API changes until after the first version of multicore has landed.

Do you have any nice standalone candidate programs using the C FFI we could add to Sandmark?

2 Likes

Just to clarify, “no breaking C changes” means none other than needing to be able to use no-naked-pointers, right? In particular, C bindings that return pointers allocated with malloc as values, without wrapping them in a block on the ocaml heap, are still in need of changing before their clients can be used in the multicore world.

Off-topic for the current thread, but just wanted to mention there is a simple alternative to this trick which does not involve wrapping the pointers in custom blocks: since malloc returns aligned pointers, one can set the lsb to “1” and send them over the OCaml side as an OCaml “int”. The original pointer is recovered by setting the lsb back to “0”. The cost is negligible and fully compatible with the GC and no-naked-pointers.

2 Likes

I think discussing breaking changes brought by multicore and their workarounds is reasonably on-topic.

I believe that the advice to use the lsb to represent off-heap pointers is incorrect currently. OCaml does not understand how to dereference such pointers. It is therefore unsuitable for applications in the style of Ancient. One could imagine that an instruction to clear the lsb is added before every dereference of value by the OCaml compiler, but it is unclear that the runtime costs are lesser than that of the page table.

Indeed, this is only suitable for those cases where the pointers can be treated as opaque objects by the GC.

I have read that unaligned memory accesses are efficient in modern Intel processors, so maybe one could also try your tip with unaligned pointers instead of encoding aligned pointers.

I am also curious about the answer to @jjb’s question. What are the plans for the page table in multicore? For instance Coq has been adapted to support the no-naked-pointers mode, but still relies on the page table in that mode.

We need to finish the transition to no-naked-pointers mode as the default in trunk OCaml. It has stalled a little as a configuration option in the past few releases.

My question is about plans for the page table (or similar devices to recognise off-heap pointers) in multicore, e.g. to support off-heap allocation in the style of Ancient.

We don’t plan to support naked-pointers in multicore. So no page tables. Every object that is allocated outside of the heap should have a header with colour Black. With multicore, maintaining a correct and efficient mutable page table may add unnecessary overheads in the concurrent GC (due to unnecessary synchronization in looking up the page table while marking; involves no atomic operations now). So we will make no-naked-pointers the default in trunk OCaml to allow users to adapt their code to the new norm.

Thanks - sorry for the delayed reply, I thought I’d set up my account here to send me email notifications, but apparently not.

I believe we have eradicated or mostly eradicated all naked pointers, and if we haven’t it’s an easy job to remove any remaining cases.

While we have quite a lot of C code (a quick count shows 13,000 edit: actually 46,000 lines), it’s not easily available in nice separate self-contained libraries, except perhaps for these:



http://git.annexia.org/?p=ocaml-augeas.git
https://libvirt.org/git/?p=libvirt-ocaml.git

Note that for libnbd the C bindings are generated so the only way to get to the bindings is to ./configure && make

Also of interest is this where we embed OCaml code compiled as a shared library into a C program, which I don’t think anyone has addressed (unless there’s no difference from the other way around of course):