Multicore OCaml: May 2021

Welcome to the May 2021 Multicore OCaml monthly report! This month’s update along with the previous updates have been compiled by @avsm, @ctk21, @kayceesrk and @shakthimaan.

Firstly, all of our upstream activity on the OCaml compiler is now reported as part of the shiny new compiler development newsletter #2 that @gasche has started. This represents a small but important shift – domains-only multicore is firmly locked in on the upstream roadmap for OCaml 5.0 and the whole OCaml compiler team has been helping and contributing to it, with the GC safe points feature being one of the last major multicore-prerequisites (and due to be in OCaml 4.13 soon).

This multicore newsletter will now focus on getting our ecosystem ready for domains-only multicore in OCaml 5.0, and on how the (not-yet-official) effect system and multicore IO stack is progressing. It’s a long one this month, so settle in with your favourite beverage and let’s begin :slight_smile:

OCaml Multicore: 4.12.0+domains

The multicore compiler now supports CTF runtime traces of its garbage collector and there are tools to display chrome tracing visualisations of the garbage collector events. A number of performance improvements (see speedup graphs later on) that highlight some ways to make best use of multicore were made to the existing benchmarks in Sandmark. There has also been work on scaling up to 128 cores/domains for task-based parallelism in domainslib using work stealing deques, bringing us closer to Cilk-style task-parallel performance.

As important as new features are what we have decided not to do. We’ve been working on and evaluating Domain Local Allocation Buffers (DLABs) for some time, with the intention of reducing the cost of minor GCs. We’ve found that the resulting performance didn’t match our expectations (vs the complexity of the change), and so we’ve decided not to proceed with this for OCaml 5.0. You can find the DLAB summary page summarises our experiences. We’ll come back to this post-OCaml 5.0 when there are fewer moving parts.

Ecosystem changes to prepare for 5.0.0 domains-only

As we are preparing 5.0 branches with the multicore branches over the coming months, we are stepping up preparations to ensure the OCaml ecosystem is ready.

Making the multicore compilers available by default in opam-repo

Over the next few week, we will be merging the multicore 4.12.0+domains and associated packages from their opam remote over in ocaml-multicore/multicore-opam into the mainline opam-repository. This is to make it more convenient to use the variant compilers to start testing your own packages with Domains.

As part of this change, there are two new base packages that will be available in opam-repository:

  • base-domains: This package indicates that the current compiler has the Domain module.
  • base-effects: This package indicates the current compiler has the experimental effect system.

By adding a dependency on these packages, the only valid solutions will be 4.12.0+domains (until OCaml 5.0 which will have this module) or 4.12.0+effects.

The goal of this is to let community packages more easily release versions of their code using Domains-only parallelism ahead of OCaml 5.0, so that we can start migration and thread-safety early. We do not encourage anyone to take a dependency on base-effects currently, as it is very much a moving target.

This opam-repository change isn’t in yet, but I’ll comment on this post when it is merged.

Adapting the Stdlib for thread-safety

One of the first things we have to do before porting third-party libraries is to get the Stdlib ready for thread-safety. This isn’t quite as simple as it might appear at first glance: if we adopt the naïve approach of simply putting a mutex around every bit of global state, our sequential performance will slow down. Therefore we are performing a more fine-grained analysis and fixes, which can be seen on the multicore stdlib page.

For anyone wishing to contribute: hunt through the Stdlib for global state, and categorise it appropriately, and then create a test case exercising that module with multiple Domains running, and submit a PR to ocaml-multicore. In general, if you see any build failures or runtime failures now, we’d really appreciate an issue being filed there too. You can see some good examples of such issues here (for mirage-crypto) and here (for Coqt).

Porting third-party libraries to Domains

As I mentioned last month, we put a call out for libraries and maintainers who wanted to port their code over. We’re starting with the following libraries and applications this month:

  • Lwt: the famous lightweight-threads library now has a PR to add Lwt_domains. This is the first simple(ish) step to using multicore cores with Lwt: it lets you run a pure (non-Lwt) function in another Domain via detach : ('a -> 'b) -> 'a -> 'b Lwt.t.

  • Mirage-Crypto: the next library we are adapting is the cryptography library, since it is also low-hanging fruit that should be easy to parallelise (since crypto functions do not have much global state). The port is still ongoing, as there are some minor build failures and also Stdlib functions in Format that aren’t yet thread-safe that are causing failures.

  • Tezos-Node: the bigger application we are applying some of the previous dependencies too is Tezos-Node, which makes use of the dependency chain here via Lwt, mirage-crypto, Irmin, Cohttp and many other libraries. We’ve got this compiling under 4.12.0+domains now and mostly passing the test suite, but will only report significant results once the dependencies and Stdlib are passing.

  • Owl: OCaml’s favourite machine learning library works surprisingly well out-of-the-box with 4.12.0+domains. An experiment for a significant machine-learning codebase written using it saw about a 2-4x speedup before some false-sharing bottlenecks kicked in. This is pretty good going given that we made no changes to the codebase itself, but stay tuned for more improvements over the coming months as we analyse the bottleneck.

This is hopefully a signal to all of you to start “having a go” with 4.12.0+domains on your own applications, and particularly with respect to seeing how wrapping it in Domains works out and identifying global state. You can read our handy tutorial on parallel programming with Multicore OCaml.

We are developing some tools to help find global state, but we’re going to all need to work together to identify some of these cases and begin migration. Crucially, we need some diversity in our dependency chains – if you have interesting applications using (e.g.) Async or the vanilla Thread module and have some cycles to work with us, please get in touch with me or @kayceesrk .

4.12.0+effects

The effects-based eio library is coming together nicely, and the interface and design rationales are all up-to-date in the README of the repository. The primary IO backend is ocaml-uring, which we are preparing for a separate release to opam-repository now as it also works fine on the sequential runtime for Linux (as long as you have a fairly recent kernel. Otherwise the kernel crashes). We also have a Grand Central Dispatch effect backend to give us a totally different execution model to exercise our effect handler abstractions.

While we won’t publish the performance numbers for the effect-based IO this month, you can get a sense of the sorts of tests we are running by looking at the retro-httpaf-bench repository, which now has various permutations of effects-based, uring-based and select-based webservers. We’ve submitted a talk to the upcoming OCaml Workshop later this summer, which, if accepted, will give you a deepdive into our effect-based IO.

As always, we begin with the Multicore OCaml ongoing and completed tasks. The ecosystem improvements are then listed followed by the updates to the Sandmark benchmarking project. Finally, the upstream OCaml work is mentioned for your reference. For those of you that have read this far and can think of nothing more fun than hacking on multicore programming runtimes, we are hiring in the UK, France and India – please find the job postings at the end!

Multicore OCaml

Ongoing

  • ocaml-multicore/ocaml-multicore#552
    Add a force_instrumented_runtime option to configure

    A new --enable-force-instrumented-runtime option is introduced to
    facilitate use of the instrumented runtime on linker invocations to
    obtain event logs.

  • ocaml-multicore/ocaml-multicore#553
    Testsuite failures with flambda enabled

    A list of tests are failing on b23a416 with flambda enabled, and
    they need to be investigated further.

  • ocaml-multicore/ocaml-multicore#555
    runtime: CAML_TRACE_VERSION is now set to a Multicore specific value

    Define a CAML_TRACE_VERSION to distinguish between Multicore OCaml
    and trunk for the runtime.

  • ocaml-multicore/ocaml-multicore#558
    Refactor Domain.{spawn/join} to use no critical sections

    The PR removes the use of Domain.wait and critical sections in
    Domain.{spawn/join}.

  • ocaml-multicore/ocaml-multicore#559
    Improve the Multicore GC Stats

    A draft PR to include more Multicore GC statistics when using
    OCAMLRUNPARAM=v=0x400.

Completed

  • ocaml-multicore/ocaml-multicore#508
    Domain Local Allocation Buffers

    The Domain Local Allocation Buffer implementation for OCaml Multicore has been dropped for now. A discussion is on the PR itself and there is a wiki
    page here.

  • ocaml-multicore/ocaml-multicore#527
    Port eventlog to CTF

    The porting of the eventlog implementation to the Common Trace
    Format is now complete.

    For an introduction to producing Chrome trace visualizations of the
    runtime events see eventlog-tools. This postprocessing tool turns the CTF
    trace into the Chrome tracing format that allows interactive visualizations
    like this:

Ecosystem

Ongoing

  • ocaml-multicore/eventlog-tools#2
    Add a pausetimes tool

    The eventlog_pausetimes tool takes a directory of eventlog files
    and computes the mean, max pause times, as well as the distribution
    up to the 99.9th percentiles. For example:

    ocaml-eventlog-pausetimes /home/engil/dev/ocaml-multicore/trace3/caml-426094-* name
    {
      "name": "name",
      "mean_latency": 718617,
      "max_latency": 33839379,
      "distr_latency": [191,250,707,16886,55829,105386,249272,552640,1325621,13312993,26227671]
    }
    
  • domainslib#29
    Task stealing with CL deques

    This ongoing work to use task-stealing Chase Lev deques for scheduling
    tasks across domains is looking very promising. Particularly for machines
    with 128 cores.

  • ocaml-multicore/retro-httpaf-bench#10
    Add Eio benchmark

    The addition of an Eio benchmark for retro-httpaf-bench. This is a
    work-in-progress.

  • ocaml-multicore/eio#26
    Grand Central Dispatch Backend

    An early draft PR that implements the Grand Central Dispatch (GCD)
    backend for Eio.

  • ocsigen/lwt#860
    Lwt_domain: An interfacet to Multicore parallelism

    An on-going effort to introduce Lwt_domain for performing
    computations to CPU cores using Multicore OCaml’s Domains.

Completed

retro-httpaf-bench

The retro-httpaf-bench repository contains scripts for running HTTP
server benchmarks.

eio

The eio library provides an effects-based parallel IO stack for
Multicore OCaml.

  • ocaml-multicore/eio#18
    Add fibreslib library

    The promise library has been renamed to fibreslib to avoid
    naming conflict with the existing package in opam, and the API
    (waiters and effects) has been split into its own respective
    modules.

  • ocaml-multicore/eio#19
    Update to latest ocaml-uring

    The code and configuration files have been updated to use the latest
    ocaml-uring.

  • ocaml-multicore/eio#20
    Add Fibreslib.Semaphore

    Implemented the Fibreslib.Semaphone module that is useful for
    rate-limiting, and based on OCaml’s Semaphore.Counting.

  • ocaml-multicore/eio#21
    Add high-level Eio API

    A new Eio library with interfaces for sources and sinks. The README
    documentation has been updated with motivation and usage.

  • ocaml-multicore/eio#22
    Add switches for structured concurrency

    Implementation of structured concurrency with documentation examples
    for tracing and testing with mocks.

  • ocaml-multicore/eio#23
    Rename repository to eio

    The Effects based parallel IO for OCaml repository has now been
    renamed from eioio to eio.

  • ocaml-multicore/eio#24
    Rename lib_eioio to lib_eunix

    The names have been updated to match the dune file.

  • ocaml-multicore/eio#25
    Detect deadlocks

    An exception is now raised to detect deadlocks if the scheduler
    finishes while the main thread continues to run.

  • ocaml-multicore/eio#27
    Convert expect tests to MDX

    The expected tests have been updated to use the MDX format, and this
    avoids the need for ppx libraries.

  • ocaml-multicore/eio#28
    Use splice to copy if possible

    The effect Splice has been implemented along with the update to
    ocaml-uring, and necessary documentation.

  • ocaml-multicore/eio#29
    Improve exception handling in switches

    Additional exception checks to handle when multiple threads fail,
    and for Switch.check and Fibre.fork_ignore.

  • ocaml-multicore/eio#30
    Add eio_main library to select backend automatically

    Use eio_main to select the appropriate backend (eunix, for
    example) based on the platform.

  • ocaml-multicore/eio#31
    Add Eio.Flow API

    Implemented a Flow module that allows combinations such as
    bidirectional flows and closable flows.

  • ocaml-multicore/eio#32
    Initial support for networks

    Eio provides a high-level API for networking, and the Network
    module has been added.

  • ocaml-multicore/eio#33
    Add some design rationale notes to the README

    The README has been updated with design notes, and reference to
    further reading on the principles of Object-capability model.

  • ocaml-multicore/eio#34
    Add shutdown, allow closing listening sockets, add cstruct_source

    Added cstruct_source, shutdown method along with source, sink and
    file descriptor types.

  • ocaml-multicore/eio#35
    Add Switch.on_release to auto-close FDs

    We can now attach resources such as file descriptors to switches,
    and these are freed when the the switch is finished.

Sundries

  • ocaml-multicore/domainslib#23
    Running tests: moving to dune runtest from manual commands in
    run_test target

    The dune runtest command is now used to execute the tests.

  • ocaml-multicore/domainslib#24
    Move to Mutex & Condition from Domain.Sync.{notify/wait}

    The channel implementation using Mutex and Condition is now
    complete. The performance results are shown in the following graph:

  • ocaml-multicore/multicore-opam#53
    Add base-domains and base-effects packages

    The base-domains and base-effects opam files have now been added
    to multicore-opam.

  • ocaml-multicore/multicore-opam#54
    Shift all multicore packages to unique versions and base-domains dependencies

    The naming convention is to now use base-effects and
    base-domains everywhere.

Benchmarking

Ongoing

  • ocaml-bench/sandmark#230
    Build for 4.13.0+trunk with dune.2.8.1

    A work-in-progress to upgrade Sandmark to use dune.2.8.1 to build
    4.13.0+trunk and generate the benchmarks. You can test the same
    using:

    TAG='"macro_bench"' make run_config_filtered.json
    RUN_CONFIG_JSON=run_config_filtered.json make ocaml-versions/4.13.0+trunk.bench
    

Completed

Sandmark

Performance
  • ocaml-bench/sandmark#221
    Fix up decompress iterations of work

    The use of parallel_for, simplification of data_to_compress to
    use String.init, and fix to correctly count the amount of work
    configured and done produces the following speed improvements:


  • ocaml-bench/sandmark#223
    A better floyd warshall

    An improvement to the Floyd Warshall implementation that fixes the
    random seed so that it is repeatable, and improves the pattern
    matching.



  • ocaml-bench/sandmark#224
    Some improvements for matrix multiplication

    The matrix_multiplication and matrix_multiplication_multicore
    code have been updated for easier maintenance, and results are
    written only after summing the values.


  • ocaml-bench/sandmark#225
    Better Multicore EA Benchmark

    The Evolutionary Algorithm now inserts a poll point into fittest
    to improve the benchmark results.


  • ocaml-bench/sandmark#226
    Better scaling for mandelbrot6_multicore

    The mandelbrot6_multicore scales well now with the use of
    parallel_for as observed in the following graphs:



  • ocaml-bench/sandmark#227
    Improve nbody_multicore benchmark with high core counts

    The energy function is now parallelised with parallel_for_reduce
    for larger core counts.


  • ocaml-bench/sandmark#229
    Improve game_of_life benchmarks

    The hot functions are now inlined to improve the game_of_life
    benchmarks, and we avoid initialising the temporary matrix with
    random numbers.


Sundries
  • ocaml-bench/sandmark#215
    Remove Gc.promote_to from treiber_stack.ml

    The 4.12+domains and 4.12+domains+effects branches have
    Gc.promote_to removed from the runtime.

  • ocaml-bench/sandmark#216
    Add configs for 4.12.0+stock, 4.12.0+domains, 4.12.0+domains+effects

    The ocaml-version configuration files for 4.12.0+stock,
    4.12.0+domains, and 4.12.0+domains+effects have now been included
    to Sandmark.

  • ocaml-bench/sandmark#220
    Attempt to improve the OCAMLRUNPARAM documentation

    The README has been updated with more documentation on the use of
    OCAMLRUNPARAM configuration when running the benchmarks.

  • ocaml-bench/sandmark#222
    Deprecate 4.06.1 and 4.10.0 and upgrade to 4.12.0

    The 4.06.1, 4.10.0 ocaml-versions have been removed and the CI
    has been updated to use 4.12.0 as the default version.

current-bench

  • ocurrent/current-bench#103
    Ability to set scale on UI to start at 0

    The graph origins now start from [0, y_max+delta] for the y-axis
    for better comparison.

    current-bench frontend fix 0 baseline

  • ocurrent/current-bench#121
    Use string representation for docker cpu setting.

    The OCAML_BENCH_DOCKER_CPU setting now switches from Integer to
    String to support a range of CPUs for parallel execution.

OCaml

Ongoing

  • ocaml/ocaml#10039
    Safepoints

    The Sandmark benchmark runs to obtain the performance numbers for
    the Safepoints PR for 4.13.0+trunk have been published. The PR is
    ready to be merged.

Job Advertisements

Our thanks to all the OCaml users, developers and contributors in the
community for their continued support to the project. Stay safe!

Acronyms

  • AMD: Advanced Micro Devices
  • API: Application Programming Interface
  • CI: Continuous Integration
  • CPU: Central Processing Unit
  • CTF: Common Trace Format
  • DLAB: Domain Local Allocation Buffer
  • EA: Evolutionary Algorithm
  • GC: Garbage Collector
  • GCD: Grand Central Dispatch
  • HTTP: Hypertext Transfer Protocol
  • OPAM: OCaml Package Manager
  • MVP: Minimal Viable Product
  • PR: Pull Request
  • TPS: Transactions Per Second
  • UI: User Interface
44 Likes

There is a lot of interesting action here. The work-stealing support in DomainsLib.Task is impressive, congratulations @ctk21. It’s also nice that you decided to update the benchmarks to follow evolving best performance practices, as it makes them interesting examples to look at. I haven’t had time to look at eio at all, also looks quite interesting!

4 Likes

Just a small heads up for anyone who is subscribed to the thread - the safepoints PR was finally merged, yay!

5 Likes

I am interested in maintaining a git branch of parany that would use OCaml multicore
instead of forked processes (the current backend):

I don’t know when this will happen, but I prefer changing this library
rather than porting all my parallel software (I have quite a few) to multicore-OCaml.

3 Likes