Multicore OCaml: February 2021

Welcome to the February 2021 Multicore OCaml monthly report. This update along with the previous update’s have been compiled by me, @kayceesrk and @shakthimaan. February has seen us focus heavily on stability in the multicore trees, as unlocking the ecosystem builds and running bulk CI has given us a wealth of issues to help chase down corner case issues. The work on upstreaming the next hunk of changes to OCaml 4.13 is also making great progress.

Overall, we remain on track to have a parallel-capable multicore runtime (versioned 5.0) after the next release of OCaml (4.13.0), although the exact release details have yet to be ratified in a core OCaml developers meeting. Excitingly, we have also made significant progress on concurrency, and there are details below of a new paper on that topic.

4.12.0: released with multicore-relevant changes

OCaml 4.12.0 has been released with a large number of internal changes required for multicore OCaml such as GC colours handling, the removal of the page table and modifications to the heap representations.

From a developer perspective, there is now a new configure option called the nnpchecker which dynamically instruments the runtime to help you spot the use of unboxed C pointers in your bindings. This was described here earlier against 4.10, but it is now also live on the opam repository CI. From now on, new opam package submissions will alert you with a failing test if naked pointers are detected in your opam package test suite. Please do try to include tests in your opam package to gain the benefits of this!

The screenshot below shows this working on the LLVM package (which is known to have naked pointers at present).

4.13~dev: upstreaming progress

Our PR queue for the 4.13 release is largely centred around the integration of “safe points”, which provide stronger guarantees that the OCaml mutator will poll the garbage collector regularly even when the application logic isn’t allocating regularly. This work began almost three years ago in the multicore OCaml trees, and is now under code review in upstream OCaml – please do chip in with any performance or code size tests on that PR.

Aside from this, the team is working various other pre-requisites such as a multicore-safe Lazy, implementing the memory model (explained in this PLDI 18 paper) and adapting the ephemeron API to be more parallel-friendly. It is not yet clear which of these will get into 4.13, and which will be put straight into the 5.0 trees yet.

post OCaml 5.0: concurrency and fibres

We are very happy to share a new preprint on “Retrofitting Effect Handlers onto OCaml”, which continues our “retrofitting” series to cover the elements of concurrency necessary to express interleavings in OCaml code. This has been conditionally accepted to appear (virtually) at PLDI 2021, and we are currently working on the camera ready version. Any feedback would be most welcome to @kayceesrk or myself. The abstract is below:

Effect handlers have been gathering momentum as a mechanism for modular programming with user-defined effects. Effect handlers allow for non-local control flow mechanisms such as generators, async/await, lightweight threads and coroutines to be composably expressed. We present a design and evaluate a full-fledged efficient implementation of effect handlers for OCaml, an industrial-strength multi-paradigm programming language. Our implementation strives to maintain the backwards compatibility and performance profile of existing OCaml code. Retrofitting effect handlers onto OCaml is challenging since OCaml does not currently have any non-local control flow mechanisms other than exceptions. Our implementation of effect handlers for OCaml: (i) imposes negligible overhead on code that does not use effect handlers; (ii) remains compatible with program analysis tools that inspect the stack; and (iii) is efficient for new code that makes use of effect handlers.

We have a strong focus on making sure that the existing nice properties of OCaml’s native code implementation (and in particular, debugging and backtraces) are maintained in our proposed concurrency extensions. As with any such major change to OCaml, the contents of this paper should be considered research-grade until they have been ratified at a future core OCaml developers meeting. But by all means, please do experiment with fibres and effects and get us feedback! We’re currently working on a high performance direct-style IO stack that has very promising early performance numbers.

If you want to learn more about effects, @kayceesrk gave a talk on Effective Programming at Lambda Days 2021 (presentation slides).

Performance Measurements with Sandmark

@shakthimaan presented the upcoming features of Sandmark 2.0 and its future roadmap in a community talk. The slide deck is published online, and please do send him any feedback to questions you might have about performance benchmarking. A complete regression testing for various targets and build tags for the Sandmark 2.0 -alpha branch was completed, and we continue to work on the new features for a 2.0 release.
Onto the details then! The Multicore OCaml updates are listed first, which are then followed by the various ongoing and completed tasks for the Sandmark benchmarking project. Finally, the ongoing upstream OCaml work is listed for your reference.

Multicore OCaml

Ongoing

Ecosystem

  • ocaml-multicore/multicore-opam#46
    Multicore compatible ocaml-migrate-parsetree.2.1.0

    A patch to make the ocaml-migrate-parsetree sources use the effect
    syntax. This now builds fine with Multicore OCaml parallel_minor_gc.

  • ocaml-multicore/multicore-opam#47
    Multicore compatible ppxlib

    The effect syntax has now been added to ppxlib, and this is now
    compatible with Multicore OCaml.

Improvements

  • ocaml-multicore/ocaml-multicore#474
    Fixing remarking to be safe with parallel domains

    A draft proposal to fix the problem of remarking pools owned by
    another domain. The solution aims to move the remarking a pool to
    the domain that owns the pool.

  • ocaml-multicore/ocaml-multicore#477
    Move TLS areas to a dedicated memory space

    The PR changes the way we allocate an individual domain’s TLS. The
    present implementation is not optimal for Domain Local Allocation
    Buffer, and hence the patch moves the TLS areas to its own memory
    alloted space.

  • ocaml-multicore/ocaml-multicore#480
    Remove leave_when_done and friends from STW API

    The stw_request.leave_when_done is cleaned up by removing the
    barriers from caml_try_run_on_all_domains* and stw_request.

Sundries

  • ocaml-multicore/ocaml-multicore#466
    Fix corruption when remarking a pool in another domain and that
    domain allocates

    An on-going investigation for the bytecode test failure for
    parallel/domain_parallel_spawn_burn. The recommendation is to have
    a remark queue per domain, and a global remark queue to hold work
    for any orphaned pools or work which could not be enqueued onto a
    domain.

  • ocaml-multicore/ocaml-multicore#468
    Finalisers causing segfault with multiple domains

    A test case has been submitted where Finalisers cause segmentation
    faults with multiple domains.

  • ocaml-multicore/ocaml-multicore#471
    Unix.fork fails with “unlock: Operation not permitted”

    The no blocking section on fork implementation is causing a fatal
    error during unlock with an “operation not permitted” message. This
    has been reported by opam-ci.

  • ocaml-multicore/ocaml-multicore#473
    Building an musl requires dynamically linked execinfo

    An attempt by Haz to build Multicore OCaml with musl. It failed
    because of requiring to link with external libexecinfo.

  • ocaml-multicore/ocaml-multicore#475
    Don’t reuse opcode of bytecode instructions

    An issue raised by Hugo Heuzard on extending existing opcodes and
    appending instructions, instead of reusing opcodes and shifting them
    in Multicore OCaml.

  • ocaml-multicore/ocaml-multicore#479
    Continuation_already_taken crashes toplevel

    A continuation already taken segmentation fault crash reported for
    the iterator-to-generator exercise for 4.10.0+multicore on x86-64.

Completed

Global roots

  • ocaml-multicore/ocaml-multicore#472
    Major GC: Scan global roots from one domain

    As a first step towards parallelizing global roots scanning, a patch
    is provided that scans the global roots from only one domain in a
    major cycle. The parallel benchmark results with the patch is shown
    in the illustration below:

  • ocaml-multicore/ocaml-multicore#476
    Global roots parallel tests

    The globroots_parallel_single.ml and
    globroots_parallel_multiple.ml tests are now added to keep a check
    on global roots interaction with domain lifecycle.

CI

Sundries

Benchmarking

Ongoing

Fixes

  • ocaml-bench/sandmark#208
    Fix params for simple-tests/capi

    The arguments to the simple-tests/capi benchmarks are now passed
    correctly, and they build and execute fine. The same can be verified
    using the following commands:

    $ TAG='"lt_1s"' make run_config_filtered.json
    $ RUN_CONFIG_JSON=run_config_filtered.json make ocaml-versions/4.10.0+multicore.bench
    
  • ocaml-bench/sandmark#209
    Use rule target kronecker.txt and remove from macro_bench

    The graph500seq benchmarks have been updated to use a target rule to
    build kronecker.txt prior to running kernel2 and kernel3. These
    set of benchmarks have been removed from the macro_bench tag.

Sundries

  • ocaml-bench/sandmark#205
    [RFC] Categorize and group by benchmarks

    A draft proposal to categorize the Sandmark benchmarks into a family
    of algorithms based on their use and application. A suggested list
    includes library, formal, numerical, graph etc.

  • ocaml/opam-repository#18203
    [new release] orun (0.0.1)

    A work-in-progress to publish the orun package in
    opam.ocaml.org. A new conf-libdw package has also been created to
    handle the dependencies.

  • The Sandmark 2.0 -alpha branch now includes all the bench targets
    from the present Sandmark master branch, and we have been performing
    regression builds for the various tags. The required dependency
    packages have also been added to the respective target benchmarks.

Completed

OCaml

Ongoing

  • ocaml/ocaml#10039
    Safepoints

    The Safepoints PR implements the prologue eliding algorithm and is
    now rebased to trunk. The effect of eliding optimisation and leaf
    function optimisations reduces the number of polls as illustrated
    below:

Our thanks to all the OCaml users and developers in the community for their contribution and support to the project!

Acronyms

  • API: Application Programming Interface
  • CI: Continuous Integration
  • DLAB: Domain Local Allocation Buffer
  • GC: Garbage Collector
  • OPAM: OCaml Package Manager
  • PLDI: Programming Language Design and Implementation
  • PR: Pull Request
  • RFC: Request For Comments
  • STW: Stop The World
  • TLS: Thread Local Storage
38 Likes

Small comment on the presentation of the benchmarks that show runtime vs. cores. In the ideal case, this is shaped like f(n)=1/n but it is difficult to see whether this is the case or not. It would be more instructive to present this in a way that shows how much speedup diverges from the ideal speedup. (Or use a logarithmic scale as a quick fix.)

1 Like

oh i made it in the news. woot!

1 Like

A bit of a naiive question perhaps, coming from a possible lack of familiarity with how the old gc vs the new one + polling would behave:
Is there an allocation threshold where the gc simply won’t fire up? or are pauses short-but-mandatory in such algorithm? I’m thinking of low-latency low-allocation multicore applications. And I’m wondering if OCaml is meant to compete there.

Also this one’s more of a long-term question, mainly coming from your wiki: there will be a period where one would be able to pick between the two gcs in upstream compiler according to the wiki, then multicore will become default. What’s the plan then? Will retaining both and switching between them remain an option?

1 Like