Multicore OCaml: October 2020

Welcome to the October 2020 multicore OCaml report, compiled by @shakthimaan, @kayceesrk and of course myself. The [previous monthly (https://discuss.ocaml.org/tag/multicore-monthly) updates are also available for your perusal.

OCaml 4.12.0-dev: The upstream OCaml tree has been branched for the 4.12 release, and the OCaml readiness team is busy stabilising it with the ecosystem. The 4.12.0 development stream has significant progress towards multicore support, especially with the runtime handling of naked pointers. The release will ship with a dynamic checker for naked pointers that you can use to verify that your own codebase is clean of them, as this will be a prerequisite for OCaml 5.0 and multicore compatibility. This is activated via the --enable-naked-pointers-checker configure option.

Convergence with upstream and multicore trees: The multicore OCaml trees have seen significant robustness improvements as we’ve converged our trees with upstream OCaml (possible now that the upstream architectural changes are synched with the requirements of multicore). In particular, the handling of global C roots is much better in multicore now as it uses the upstream OCaml scheme, and the GC colour scheme also exactly matches upstream OCaml’s. This means that community libraries from opam work increasingly well when built with multicore OCaml (using the no-effects-syntax branch).

Features: Multicore OCaml is also using domain local allocation buffers now to simplify its internals. We are also now working on benchmarking the IO subsystem, and support for CPU parallelism for the Lwt concurrency library has been added, as well as refreshing the new Asynchronous Effect-based IO (aeio) with Multicore OCaml, Lwt, and httpaf in an http-effects library.

Benchmarking: The Sandmark benchmarking test suite has additional configuration options, and there are new proposals in that project to leverage as much of the OCaml tools and ecosystem as much as possible.

As with previous updates, the Multicore OCaml ongoing, and completed tasks are listed first, which are followed by improvements to the Sandmark benchmarking test suite. Finally, the upstream OCaml related work is mentioned for your reference.

Multicore OCaml

Ongoing

  • ocaml-multicore/ocaml-multicore#422
    Simplify minor heaps configuration logic and masking

    The PR is a step towards using Domain local allocation buffers. A
    Minor_heap_max size is used to reserve the minor heaps area, and
    Is_young for relying on a boundary check. The Minor_heap_max can
    be overridden using OCAMLRUNPARAM environment variable.

  • ocaml-multicore/ocaml-multicore#426
    Replace global roots implementation

    An effort to replace the existing global roots implementation to be
    in line with OCaml’s globroots. The objective is to also have a
    per-domain skip list, and a global orphans when a domain is
    terminated.

  • ocaml-multicore/ocaml-multiore#427
    Garbage Collector colours change backport

    The Garbage Collector colour scheme
    changes
    in the major
    collector have now been backported to Multicore OCaml. The
    mark_entry does not include end, mark_stack_push resembles
    closer to trunk, and caml_shrink_mark_stack has been adapted from
    trunk.

  • ocaml-multicore/ocaml-multicore#429
    Fix a STW interrupt race

    The STW interrupt race in
    caml_try_run_on_all_domains_with_spin_work is fixed in this PR,
    where the enter_spin_callback and enter_spin_data fields of
    stw_request are initialized after we interrupt other domains.

Completed

Systhreads support

  • ocaml-multicore/ocaml-multicore#381
    Reimplementing Systhreads with pthreads (Domain execution contexts)

    The re-implementation of Systhreads with pthreads has been completed
    for Multicore OCaml. The Domain Execution Context (DEC) is
    introduced which allows multiple threads to run atop a domain.

  • ocaml-multicore/ocaml-multicore#410
    systhreads: caml_c_thread_register and caml_c_thread_unregister

    The caml_c_thread_register and caml_c_thread_unregister
    functions have been reimported to systhreads. In Multicore OCaml,
    threads created by C code will be registered to domain 0 threads
    chaining.

Domain Local Storage

  • ocaml-multicore/ocaml-multicore#404
    Domain.DLS.new_key takes an initialiser

    The Domain.DLS.new_key now accepts an initialiser argument to
    assign an associated value to a key, if not initialised
    already. Also, Domain.DLS.get no longer returns an option value.

  • ocaml-multicore/ocaml-multicore#405
    Rework Domain.DLS.get search function such that it no longer allocates

    The Domain.DLS.get has been updated to remove any memory
    allocation, if the key already exists in the domain local
    storage. The PR also changes the search function to accept all
    inputs as variables, instead of a closure from the environment.

Lwt

  • ocaml-multicore/multicore-opam#33
    Add lwt.5.3.0+multicore

    The Lwt.5.3.0 concurrency library has been added to support CPU
    parallelism with Multicore OCaml. A blog
    post

    introducing its installation and usage has been written by Sudha
    Parimala.

  • The Asynchronous Effect-based IO builds with a recent
    Lwt, and the HTTP effects demo has been updated to work with
    Multicore OCaml, Lwt, and httpaf. The demo source code is available
    at the http-effects repo.

Sundries

  • ocaml-multicore/ocaml-multicore#406
    Remove ephemeron usage of RPC

    The inter-domain mechanism is not required with the stop-the-world
    minor GC, and hence the same has been removed in the ephemeron
    implementation. The PR also does clean up and simplifies the
    ephemeron data structure and code.

  • ocaml-multicore/ocaml-multicore#411
    Fix typo for presume and presume_arg in internal_variable_names

    A minor typo bug fix to rename Presume and Presume_arg in
    internal_variables_names.ml.

  • ocaml-multicore/ocaml-multicore#414
    Fix up Ppoll semantics_of_primitives entry

    The semantics_of_primitives entry for Ppoll has been fixed which
    was causing flambda builds to remove poll points.

  • ocaml-multicore/ocaml-multicore#416
    Fix callback effect bug

    The PR fixes a bug when the C-to-OCaml callback prevents effects
    crossing a C callback boundary. The stack parent is cleared before a
    callback, and restored afterwards. It also makes the stack parent a
    local root, so that the GC can see it inside the callback.

Benchmarking

Ongoing

Configuration

  • ocaml-bench/ocaml-bench-scripts#12
    Add support for parallel multibench targets and JSON input

    The RUN_CONFIG_JSON and BUILD_BENCH_TARGET variables are now
    added and passed during run-time for the execution of parallel
    benchmarks. Default values are specified so that the serial
    benchmarks can still run without explicitly requiring the same.

  • ocaml-bench/sandmark#180
    Notebook Refactoring and User changes

    A refactoring effort is underway to make the parallel benchmark
    interactive. The user accounts on The Littlest JupyterHub
    installation have direct access to the benchmark results produced
    from ocaml-bench-scripts on the system.

  • ocaml-bench/sandmark#189
    Add environment support for wrapper in JSON configuration file

    The OCAMLRUNPARAM is now passed as an environment variable to the
    benchmarks during runtime, so that, different parameter values can
    be used to obtain multiple results for comparison. The use case and
    the discussion are available at Running benchmarks with varying
    OCAMLRUNPARAM

    issue. The environment variables can be specified in the
    run_config.json file, as shown below:

     {
        "name": "orun_2M",
        "environment": "OCAMLRUNPARAM='s=2M'",
        "command": "orun -o %{output} -- taskset --cpu-list 5 %{command}"
      }
    

Proposals

  • ocaml-bench/sandmark#159
    Implement a better way to describe tasklet cpulist

    The discussion to implement a better way to obtain the taskset list
    of cores for a benchmark run is still in progress. This is required
    to be able to specify hyper-threaded cores, NUMA zones, and the
    specific cores to use for the parallel benchmarks.

  • ocaml-bench/sandmark#179
    [RFC] Classifying benchmarks based on running time

    A proposal to categorize the benchmarks based on their running time
    has been provided. The following classification types have been
    suggested:

    • lt_1s: Benchmarks that run for less than 1 second.
    • lt_10s: Benchmarks that run for at least 1 second, but, less than 10 seconds.
    • 10s_100s: Benchmarks that run for at least 10 seconds, but, less than 100 seconds.
    • gt_100s: Benchmarks that run for at least 100 seconds.

    The PR for the same is available at Classification of
    benchmarks
    .

  • We are exploring the use of opam-compiler switch environment to
    build the Sandmark benchmark test suite. The merge of systhreads
    compatibility
    support

    now enables us to install dune natively inside the switch
    environment, along with the other benchmarks. With this approach, we
    hope to modularize our benchmarking test suite, and converge to
    fully using the OCaml tools and ecosystem.

Sundries

  • ocaml-bench/sandmark#181
    Lock-free map bench

    An implementation of a concurrent hash-array mapped trie that is
    lock-free, and is based on Prokopec, A. et. al. (2011). This
    cache-aware implementation benchmark is currently under review.

  • ocaml-bench/sandmark#183
    Use crout_decomposition name for numerical analysis benchmark

    A couple of LU decomposition benchmarks exist in the Sandmark
    repository, and this PR renames the
    numerical-analysis/lu_decomposition.ml benchmark to
    crout_decomposition.ml. This is to address Rename
    lu_decomposition benchmark in
    numerical-analysis

    any naming confusion between the two benchmarks, as their
    implementations are different.

Completed

  • ocaml-bench/sandmark#177
    Display raw baseline numbers in normalized graphs

    The raw baseline numbers are now included in the normalized graphs
    in the sequential notebook output. The graph for maxrsskb, for
    example, is shown below:

  • ocaml-bench/sandmark#178
    Change to new Domain.DLS API with Initializer

    The multicore-minilight and multicore-numerical benchmarks have
    now been updated to use the new Domain.DLS API with initializer.

  • ocaml-bench/sandmark#185
    Clean up existing effect benchmarks

    The PR ensures that the code compiles without any warnings, and adds
    a multicore_effects_run_config.json configuration file, and a
    run_all_effect.sh script to execute the same.

  • ocaml-bench/sandmark#186
    Very simple effect microbenchmarks to cover code paths

    A set of four microbenchmarks to test the throughput of our effects
    system have now been added to the Sandmark test suite. These include
    effect_throughput_clone, effect_throughput_val,
    effect_throughput_perform, and effect_throughput_perform_drop.

  • ocaml-bench/sandmark#187
    Implementation of ‘recursion’ benchmarks for effects

    A collection of recursion benchmarks to measure the overhead of
    effects are now included to Sandmark. This is inspired by the
    (Manticore
    benchmarks)[https://github.com/ManticoreProject/benchmark/].

OCaml

Ongoing

  • ocaml/ocaml#9876
    Do not cache young_limit in a processor register

    The PR removes the caching of young_limit in a register for ARM64,
    PowerPC and RISC-V ports. The Sandmark benchmarks are presently
    being tested on the respective hardware.

  • ocaml/ocaml#9934
    Prefetching optimisations for sweeping

    The Sandmark benchmarking tests were performed for analysing a
    couple of patches that optimise sweep_slice, and for the use of
    prefetching. The objective is to reduce cache misses during GC.

Completed

  • ocaml/ocaml#9947
    Add a naked pointers dynamic checker

    The check for “naked pointers” (dangerous out-of-heap pointers) is
    now done in run-time, and tests for the three modes: naked pointers,
    naked pointers and dynamic checker, and no naked pointers have been
    added in the PR.

  • ocaml/ocaml#9951
    Ensure that the mark stack push optimisation handles naked pointers

    The PR adds a precise check on whether to push an object into the
    mark stack, to handle naked pointers.

We would like to thank all the OCaml users and developers in the community for their continued support, reviews and contribution to the project.

Acronyms

  • AEIO: Asynchronous Effect-based IO
  • API: Application Programming Interface
  • ARM: Advanced RISC Machine
  • CPU: Central Processing Unit
  • DEC: Domain Execution Context
  • DLS: Domain Local Storage
  • GC: Garbage Collector
  • HTTP: Hypertext Transfer Protocol
  • JSON: JavaScript Object Notation
  • NUMA: Non-Uniform Memory Access
  • OPAM: OCaml Package Manager
  • OS: Operating System
  • PR: Pull Request
  • RISC-V: Reduced Instruction Set Computing - V
  • RPC: Remote Procedure Call
  • STW: Stop-The-World
41 Likes

I am not able to understand all the points, and see how huge the work you do for all the changes mentionned, but I really appreciate the communication effort, and how you synthesis for this task in a comprehensible way.

Your monthly repports helps to understand how works the inclusion for a such change, and the communication to the community is a great plus.

So thank you for both the work, and the reports :slight_smile:

2 Likes