Multicore OCaml: Dec 2020 / Jan 2021

Welcome to a double helping of the multicore monthlies, with December 2020 and January 2021 bundled together (the team collectively collapsed into the end of year break for a well deserved rest). We encourage you to review all the previous monthly updates for 2020 which have been compiled by @shakthimaan, @kayceesrk, and me.

Looking back over 2020, we achieved a number of major milestones towards upstreaming multicore OCaml. The major highlights include the implementation of the eventlog tracing system to make debugging complex parallelism practical, the enormous rebasing of from OCaml 4.06 to 4.11, a chapter on parallel programming, the publication of “Retrofitting Parallelism onto OCaml” at ICFP 2020, the production use of the Sandmark benchmark, and the implementation of system threading integration. While all this was happening in the multicore code trees, the upstreaming efforts into mainline OCaml also went into full gear, with @xavierleroy leading the efforts from the core team to ensure that the right pieces went into various releases of OCaml with the same extensive code review as any other features get.

The end of 2020 saw enhancements and updates to the ecosystem libraries, with more tooling becoming available. In particular, we would like to thank:

  • @mattpallissard for getting merlin and dot-merlin-reader working with Multicore OCaml 4.10. This makes programming using OCaml Platform tools like the VSCode plugin much more pleasant.
  • @eduardorfs for testing the no-effect-syntax Multicore OCaml branch with a ReasonML project.

@kayceesrk also gave a couple of public talks online:

We’re really grateful to the OCaml core developers for giving this effort so much of their time and focus in 2020! We’re working on a broader plan for 2021’s exciting multicore roadmap which will be included in the next monthly after a core OCaml developer’s meeting ratifies it soon. The broad strategy remains consistent: putting pieces of functionality steadily into each upcoming OCaml release so that each can be reviewed and tested in isolation, ahead of the OCaml 5.0 release which will include domains parallelism.

With OCaml 4.12 out in beta, our January has mainly been spent tackling some of the big pieces needed for OCaml 4.13. In particular, the safe points PR has seen a big update (and corresponding performance improvements), and we have been working on the design and implementation of Domain-Local Allocation Buffers (DLAB). We’ve also started the process of figuring out how to merge the awesome sequential best-fit allocator with our multicore major GC, to get the best of both worlds in OCaml 5.0. The multicore IO stack has also restarted development, with focus on Linux’s new io_uring kernel interface before retrofitting the old stalwart epoll and kqueue interfaces.

Tooling-wise, the multicore Merlin support began in December is now merged, thanks to @mattpallissard and @eduardorfs. We continue to work on the enhancements for Sandmark 2.0 benchmarking suite for an upcoming alpha release – @shakthimaan gave an online seminar about these improvements to the multicore team which has been recorded and will be available in the next monthly for anyone interested in contributing to our benchmarking efforts.

As with previous reports, the Multicore OCaml updates are listed first for the month of December 2020 and then January 2021. The upstream OCaml ongoing work is finally mentioned for your reference after the multicore-tree specific pieces…

December 2020

Multicore OCaml

Ongoing

Ecosystem

API

  • ocaml-multicore/ocaml-multicore#448
    Reintroduce caml_stat_accessors in the C API

    The caml_stat_minor_words, caml_stat_promoted_words,
    caml_allocated_words caml_stat_minor_collections fields are not
    exposed in Multicore OCaml. This is a discussion to address possible
    solutions for the same.

  • ocaml-multicore/ocaml-multicore#459
    Replace caml_root API with global roots

    A work-in-progress to convert variables of type caml_root to
    value, and to register them as global root or generational global
    root, in order to remove the caml_root API entirely.

Sundries

  • ocaml-multicore/ocaml-multicore#450
    “rogue” systhreads and domain termination

    An RFC to discuss on the semantics of domain termination for
    non-empty thread chaining. In Multicore OCaml, a domain termination
    does not mean the end of a program, and slot reuse adds complexity
    to the implementation.

  • ocaml-multicore/ocaml-multicore#451
    Note for OCaml 5.0: Get rid of compatibility.h

    OCaml Multicore removed modify and initialize from
    compatibility.h, and this is a tracking issue to remove
    compatibility.h for OCaml 5.0.

  • ocaml-multicore/ocaml-multicore#458
    no-effect-syntax: Remove effects from typedtree

    The PR removes the the effect syntax use from typedtree.ml, and
    enables external applications that use the AST to work with
    domains-only Multicore OCaml.

  • ocaml-multicore/ocaml-multicore#461
    Remove stw/leader_collision events from eventlog

    A patch to make viewing and analyzing the logs better by removing
    the stw/leader_collision log messages.

Completed

  • ocaml-multicore/effects-examples#23
    Migrate to dune

    The build scripts were using OCamlbuild, and they have been ported
    to now use dune.

  • ocaml-multicore/ocaml-multicore#402
    Split handle_gc_interrupt into handling remote and polling sections

    The PR includes the addition of caml_poll_gc_work that contains
    the polling of GC work done in caml_handle_gc_interrupt. This
    facilitates handling of interrupts recursively without introducing
    new state.

  • ocaml-multicore/ocaml-multicore#439
    Systhread lifecycle work

    The improvement fixes a race condition in caml_thread_scan_roots
    when two domains are initializing, and rework has been done for
    improving general resource handling and freeing of descriptors and
    stacks.

  • ocaml-multicore/ocaml-multicore#446
    Collect GC stats at the end of minor collection

    The GC statistics is collected at the end of a minor collection, and
    the double buffering of GC sampled statistics has been removed. The
    change does not have an impact on the existing benchmark runs as
    observed against stock OCaml from the following illustration:

  • ocaml-multicore/ocaml-multicore#454
    Respect ASM_CFI_SUPPORTED flag in amd64

    The CFI directives in amd64.S are now guarded by
    ASM_CFI_SUPPORTED, and thus compilation with --disable-cfi will
    now provide a clean build.

  • ocaml-multicore/ocaml-multicore#455
    No blocking section on fork

    A patch to handle the case when a rogue thread attempts to take over
    the thread masterlock and to prevent a child thread from moving to
    an invalid state. Dune can now be used safely with Multicore OCaml.

Benchmarking

Ongoing

  • ocaml-bench/rungen#1
    Fix compiler warnings and errors for clean build

    The patch provides minor fixes for a clean build of rungen with dune
    to be used with Sandmark 2.0.

  • ocaml-bench/orun#2
    Fix compiler warnings and errors for clean build

    The unused variables and functions have been removed to remove all
    the warnings and errors produced when building orun with dune.

  • ocaml-bench/sandmark#198
    Noise in Sandmark

    An RFC to measure the noise between multiple execution runs of the
    benchmarks to better understand the performance with various
    hardware configuration settings, and with ASLR turned on and off.

  • ocaml-bench/sandmark#200
    Global roots microbenchmark

    The patch includes globroots_seq.ml, globroots_sp.ml, and
    globroots_mp.ml that adds microbenchmarks to measure the
    efficiency of global root scanning.

  • We are continuing to integrate the existing Sandmark benchmark test
    suite with a Sandmark 2.0 native dune build environment for use with
    opam compiler switch environment. The existing benchmarks have been
    ported to the same to use their respective dune files. The orun
    and rungen packages now live in separate GitHub repositories.

Completed

  • ocaml-bench/sandmark#196
    Filter benchmarks based on tag

    The benchmarks can now be filtered based on tags instead of custom
    target .json files. You can now build the benchmarks using the
    following commands:

    $ TAG='"run_in_ci"' make run_config_filtered.json 
    $ RUN_CONFIG_JSON=run_config_filtered.json make ocaml-versions/4.10.0+multicore.bench
    
  • ocaml-bench/sandmark#201
    Fix compiler version in CI

    A minor update in .drone.yml to use
    ocaml-versions/4.10.0+multicore.bench in the CI for
    4.10.0+multicore+serial.

OCaml

Ongoing

  • ocaml/ocaml#9876
    Do not cache young_limit in a processor register

    This PR for the removal of young_limit caching in a register for
    ARM64, PowerPC and RISC-V ports hardware is currently under review.

January 2021

Multicore OCaml

Ongoing

  • ocaml-multicore/ocaml-multicore#464
    Replace Field_imm with Field

    The patch replaces the Field immediate use with Field from the
    concurrent minor collector.

  • ocaml-multicore/ocaml-multicore#468
    Finalisers causing segfault with multiple domains

    An on-going test case where Finalisers cause segmentation faults
    with multiple domains.

  • The design and implementation of Domain-Local Allocation Buffers
    (DLAB) is underway, and the relevant notes on the same are available
    in the following DLAB
    Wiki
    .

Completed

Ecosystem

  • ocaml-bench/rungen#1
    Fix compiler warnings and errors for clean build

    Minor fixes for a clean build of rungen with dune to be used with
    Sandmark 2.0.

  • ocaml-bench/orun#2
    Fix compiler warnings and errors for clean build

    A patch to remove unused variables and functions without any
    warnings and errors when building orun with dune.

  • ocaml-bench/rungen#2
    Added meta files for dune-release lint

    The dune-release lint checks for rungen now pass with the
    inclusion of CHANGES, LICENSE and updates to rungen.opam files.

  • ocaml-bench/orun#3
    Add meta files for dune-release lint

    The CHANGES, LICENSE, README.md and orun.opam files have been added
    to prepare the sources for an opam.ocaml.org release.

  • ocaml-multicore/multicore-opam#39
    Multicore Merlin

    Thanks to @mattpallissard (Matt Pallissard) and @eduardorfs (Eduardo
    Rafael) for testing merlin and dot-merlin-reader, and to get it
    working with Multicore OCaml 4.10! The changes work fine with VSCode
    and Atom. The corresponding
    PR#40
    is now merged.

  • ocaml-multicore/ocaml-multicore#45
    Merlin and OCaml-LSP installation instructions

    The README.md file has been updated to include installation
    instructions to use Merlin and OCaml LSP Server.

Sundries

  • ocaml-multicore/ocaml-multicore#458
    no-effect-syntax: Remove effects from typedtree

    The PR enables external applications that use the AST to work with
    domains-only Multicore OCaml, and removes the effect syntax use from
    typedtree.ml.

  • ocaml-multicore/ocaml-multicore#461
    Remove stw/leader_collision events from eventlog

    The stw/leader_collision log messages have been cleaned up to make
    it easier to view and analyze the logs.

  • ocaml-multicore/ocaml-multicore#462
    Move from Travis to GitHub Actions

    The continuous integration builds are now updated to use GitHub
    Actions instead of Travis CI, in order to be similar to that of
    upstream CI.

  • ocaml-multicore/ocaml-multicore#463
    Minor GC: Restrict global roots scanning to one domain

    The live domains scan all the global roots during a minor
    collection, and the patch restricts the global root scanning to just
    one domain. The sequential and parallel macro benchmark results are
    given below:

  • ocaml-multicore/ocaml-multicore#467
    Disable the pruning of the mark stack

    A PR to disable the mark stack overflow for a concurrency bug that
    occurs when remarking a pool in another domain when that domain also
    does allocations.

Benchmarking

Ongoing

  • ocaml-bench/sandmark#202
    Add bench clean target in the Makefile

    A benchclean target has been added to the Makefile to only remove
    _build and _results. The _opam folder is retained with the
    required packages and dependencies installed, so that the benchmarks
    can be quickly re-built and executed.

  • ocaml-bench/sandmark#203
    Implement ITER support

    The use of ITER has been correctly implemented with multiple
    instances of the benchmarks being built, and to repeat the
    executions of the benchmarks. This helps to take averages from
    multiple runs for metrics. For example, using ITER=2 produces two
    .summary.bench files as shown below:

    $ ls _build/
      4.10.0+multicore_1  4.10.0+multicore_2  log
    
    $ ls _results/
      4.10.0+multicore_1.orun.summary.bench  4.10.0+multicore_2.orun.summary.bench
    
  • ocaml-bench/sandmark#204
    Adding layers.ml as a benchmark to Sandmark

    Th inclusion of Irmin layers benchmark and its dependencies into
    Sandmark. This is a work-in-progress.

  • We are continuing the enhancements for Sandmark 2.0 that uses a
    native dune to build and execute the benchmarks, and also port and
    test with the current Sandmark configuration files. The orun and
    rungen packages have been moved to their respective
    repositories. The use of a meta header entry to the .summary.bench
    file, ITER support, and package override features have been
    implemented.

Completed

  • ocaml-bench/sandmark#200
    Global roots microbenchmark

    The implementation of globroots_seq.ml, globroots_sp.ml, and
    globroots_mp.ml to measure the efficiency of global root scanning
    has been added to the microbenchmarks.

OCaml

Ongoing

Completed

  • ocaml/ocaml#9876
    Do not cache young_limit in a processor register

    The PR removes the caching of young_limit in a register for ARM64,
    PowerPC and RISC-V ports hardware.

Our thanks to all the OCaml users and developers in the community for their continued support and contribution to the project, and we look forward to working with you in 2021!

Acronyms

  • API: Application Programming Interface
  • ARM: Advanced RISC Machine
  • ASLR: Address Space Layout Randomization
  • AST: Abstract Syntax Tree
  • CFI: Call Frame Information
  • CI: Continuous Integration
  • GC: Garbage Collector
  • ICFP: International Conference on Functional Programming
  • JSON: JavaScript Object Notation
  • OPAM: OCaml Package Manager
  • PR: Pull Request
  • RFC: Request For Comments
  • RISC-V: Reduced Instruction Set Computing - V
  • UI: User Interface
40 Likes

Does this mean ocaml 5.0 will have a concurrency baked right in to the default ocaml? If so, this is an excellent news also how does this fit in to the mirage stack? i.e. will ocaml 5.0 concurrency work with mirage too?

OCaml 5.0 is slated to have domains-only parallelism and no fibres, but stay tuned for more news on our plans for fibres shortly. The Mirage prototype is looking very good so far, but I’m not quite ready to talk about it in more detail until it’s a bit more baked :slight_smile:

3 Likes

Will there be only one runtime in 5.0? As in, multicore will be the
default, not an opt-in switch as happened with flambda1?

1 Like

This is exciting to hear. Is the project public? I’ve spent some time over the past couple of months working on wrappers around the various polling interfaces (I started with kqueue and select), and I’ve been working on io-uring bindings as well. After reading this months’s multicore report I think i’d rather work on testing/contributing to an effort maintained by the multicore team (There are gaps in my ctypes/ocaml runtime knowledge) instead of duplicating some effort :slight_smile:

1 Like

The intention is to have a single runtime, yes. A huge amount of the work on the multicore project to date (see Retrofitting Parallelism onto OCaml) has been focussed on reducing the performance hit to sequential code when running on a parallel-capable runtime. If we end up with multiple runtimes in OCaml 5.0 after all this work, you’ll find me sobbing quietly in a corner of the developer meeting.

I’ve just pushed the non-multicore bits of the uring bindings over to https://github.com/ocaml-multicore/ocaml-uring. I’m still iterating on them, as there’s a very specific design that minimises heap pressure that we’re aiming towards. So not quite ready for contributions yet, but should be soon. The bindings there should be perfectly usable for pre-OCaml 5.0 use once that design settles (and in fact, I’ll probably add it to Lwt in order to give us experience using io_uring in production ahead of any default use in multicore OCaml).

I’ve got to say though – so far, io_uring has absolutely exceeded my expectations. I never want to look at epoll again now. And a side worry is that the BSDs are steadily getting left behind in terms of post-POSIX IO standards. Between io_uring in Linux, Grand Central Dispatch in macOS, and iocp in Windows, we’re spoilt for async choice in the “mainstream” operating systems.

10 Likes

if anyone is playing with multicore, I made this Lwt engine which is capable of running things like opium partially in parallel, 2x gains when running with 4 domains, it doesn’t scale a lot, but 2x is quite nice.

Just drop this in a file and setup Lwt with it.

11 Likes