Multicore OCaml: October 2020

avsm · November 9, 2020, 8:44am

Welcome to the October 2020 multicore OCaml report, compiled by @shakthimaan, @kayceesrk and of course myself. The [previous monthly (https://discuss.ocaml.org/tag/multicore-monthly) updates are also available for your perusal.

OCaml 4.12.0-dev: The upstream OCaml tree has been branched for the 4.12 release, and the OCaml readiness team is busy stabilising it with the ecosystem. The 4.12.0 development stream has significant progress towards multicore support, especially with the runtime handling of naked pointers. The release will ship with a dynamic checker for naked pointers that you can use to verify that your own codebase is clean of them, as this will be a prerequisite for OCaml 5.0 and multicore compatibility. This is activated via the --enable-naked-pointers-checker configure option.

Convergence with upstream and multicore trees: The multicore OCaml trees have seen significant robustness improvements as we’ve converged our trees with upstream OCaml (possible now that the upstream architectural changes are synched with the requirements of multicore). In particular, the handling of global C roots is much better in multicore now as it uses the upstream OCaml scheme, and the GC colour scheme also exactly matches upstream OCaml’s. This means that community libraries from opam work increasingly well when built with multicore OCaml (using the no-effects-syntax branch).

Features: Multicore OCaml is also using domain local allocation buffers now to simplify its internals. We are also now working on benchmarking the IO subsystem, and support for CPU parallelism for the Lwt concurrency library has been added, as well as refreshing the new Asynchronous Effect-based IO (aeio) with Multicore OCaml, Lwt, and httpaf in an http-effects library.

Benchmarking: The Sandmark benchmarking test suite has additional configuration options, and there are new proposals in that project to leverage as much of the OCaml tools and ecosystem as much as possible.

As with previous updates, the Multicore OCaml ongoing, and completed tasks are listed first, which are followed by improvements to the Sandmark benchmarking test suite. Finally, the upstream OCaml related work is mentioned for your reference.

Multicore OCaml

Ongoing

ocaml-multicore/ocaml-multicore#422
Simplify minor heaps configuration logic and masking

The PR is a step towards using Domain local allocation buffers. A
Minor_heap_max size is used to reserve the minor heaps area, and
Is_young for relying on a boundary check. The Minor_heap_max can
be overridden using OCAMLRUNPARAM environment variable.
ocaml-multicore/ocaml-multicore#426
Replace global roots implementation

An effort to replace the existing global roots implementation to be
in line with OCaml’s globroots. The objective is to also have a
per-domain skip list, and a global orphans when a domain is
terminated.
ocaml-multicore/ocaml-multiore#427
Garbage Collector colours change backport

The Garbage Collector colour scheme
changes in the major
collector have now been backported to Multicore OCaml. The
mark_entry does not include end, mark_stack_push resembles
closer to trunk, and caml_shrink_mark_stack has been adapted from
trunk.
ocaml-multicore/ocaml-multicore#429
Fix a STW interrupt race

The STW interrupt race in
caml_try_run_on_all_domains_with_spin_work is fixed in this PR,
where the enter_spin_callback and enter_spin_data fields of
stw_request are initialized after we interrupt other domains.

Completed

Systhreads support

ocaml-multicore/ocaml-multicore#381
Reimplementing Systhreads with pthreads (Domain execution contexts)

The re-implementation of Systhreads with pthreads has been completed
for Multicore OCaml. The Domain Execution Context (DEC) is
introduced which allows multiple threads to run atop a domain.
ocaml-multicore/ocaml-multicore#410
systhreads: caml_c_thread_register and caml_c_thread_unregister

The caml_c_thread_register and caml_c_thread_unregister
functions have been reimported to systhreads. In Multicore OCaml,
threads created by C code will be registered to domain 0 threads
chaining.

Domain Local Storage

ocaml-multicore/ocaml-multicore#404
Domain.DLS.new_key takes an initialiser

The Domain.DLS.new_key now accepts an initialiser argument to
assign an associated value to a key, if not initialised
already. Also, Domain.DLS.get no longer returns an option value.
ocaml-multicore/ocaml-multicore#405
Rework Domain.DLS.get search function such that it no longer allocates

The Domain.DLS.get has been updated to remove any memory
allocation, if the key already exists in the domain local
storage. The PR also changes the search function to accept all
inputs as variables, instead of a closure from the environment.

Lwt

ocaml-multicore/multicore-opam#33
Add lwt.5.3.0+multicore

The Lwt.5.3.0 concurrency library has been added to support CPU
parallelism with Multicore OCaml. A blog
post
introducing its installation and usage has been written by Sudha
Parimala.
The Asynchronous Effect-based IO builds with a recent
Lwt, and the HTTP effects demo has been updated to work with
Multicore OCaml, Lwt, and httpaf. The demo source code is available
at the http-effects repo.

Sundries

ocaml-multicore/ocaml-multicore#406
Remove ephemeron usage of RPC

The inter-domain mechanism is not required with the stop-the-world
minor GC, and hence the same has been removed in the ephemeron
implementation. The PR also does clean up and simplifies the
ephemeron data structure and code.
ocaml-multicore/ocaml-multicore#411
Fix typo for presume and presume_arg in internal_variable_names

A minor typo bug fix to rename Presume and Presume_arg in
internal_variables_names.ml.
ocaml-multicore/ocaml-multicore#414
Fix up Ppoll semantics_of_primitives entry

The semantics_of_primitives entry for Ppoll has been fixed which
was causing flambda builds to remove poll points.
ocaml-multicore/ocaml-multicore#416
Fix callback effect bug

The PR fixes a bug when the C-to-OCaml callback prevents effects
crossing a C callback boundary. The stack parent is cleared before a
callback, and restored afterwards. It also makes the stack parent a
local root, so that the GC can see it inside the callback.

Benchmarking

Ongoing

Configuration

ocaml-bench/ocaml-bench-scripts#12
Add support for parallel multibench targets and JSON input

The RUN_CONFIG_JSON and BUILD_BENCH_TARGET variables are now
added and passed during run-time for the execution of parallel
benchmarks. Default values are specified so that the serial
benchmarks can still run without explicitly requiring the same.
ocaml-bench/sandmark#180
Notebook Refactoring and User changes

A refactoring effort is underway to make the parallel benchmark
interactive. The user accounts on The Littlest JupyterHub
installation have direct access to the benchmark results produced
from ocaml-bench-scripts on the system.
ocaml-bench/sandmark#189
Add environment support for wrapper in JSON configuration file

The OCAMLRUNPARAM is now passed as an environment variable to the
benchmarks during runtime, so that, different parameter values can
be used to obtain multiple results for comparison. The use case and
the discussion are available at Running benchmarks with varying
OCAMLRUNPARAM
issue. The environment variables can be specified in the
run_config.json file, as shown below:
```
 {
    "name": "orun_2M",
    "environment": "OCAMLRUNPARAM='s=2M'",
    "command": "orun -o %{output} -- taskset --cpu-list 5 %{command}"
  }
```

Proposals

ocaml-bench/sandmark#159
Implement a better way to describe tasklet cpulist

The discussion to implement a better way to obtain the taskset list
of cores for a benchmark run is still in progress. This is required
to be able to specify hyper-threaded cores, NUMA zones, and the
specific cores to use for the parallel benchmarks.
ocaml-bench/sandmark#179
[RFC] Classifying benchmarks based on running time

A proposal to categorize the benchmarks based on their running time
has been provided. The following classification types have been
suggested:
- lt_1s: Benchmarks that run for less than 1 second.
- lt_10s: Benchmarks that run for at least 1 second, but, less than 10 seconds.
- 10s_100s: Benchmarks that run for at least 10 seconds, but, less than 100 seconds.
- gt_100s: Benchmarks that run for at least 100 seconds.
The PR for the same is available at Classification of
benchmarks.
We are exploring the use of opam-compiler switch environment to
build the Sandmark benchmark test suite. The merge of systhreads
compatibility
support
now enables us to install dune natively inside the switch
environment, along with the other benchmarks. With this approach, we
hope to modularize our benchmarking test suite, and converge to
fully using the OCaml tools and ecosystem.

Sundries

ocaml-bench/sandmark#181
Lock-free map bench

An implementation of a concurrent hash-array mapped trie that is
lock-free, and is based on Prokopec, A. et. al. (2011). This
cache-aware implementation benchmark is currently under review.
ocaml-bench/sandmark#183
Use crout_decomposition name for numerical analysis benchmark

A couple of LU decomposition benchmarks exist in the Sandmark
repository, and this PR renames the
numerical-analysis/lu_decomposition.ml benchmark to
crout_decomposition.ml. This is to address Rename
lu_decomposition benchmark in
numerical-analysis
any naming confusion between the two benchmarks, as their
implementations are different.

Completed

ocaml-bench/sandmark#177
Display raw baseline numbers in normalized graphs

The raw baseline numbers are now included in the normalized graphs
in the sequential notebook output. The graph for maxrsskb, for
example, is shown below:

ocaml-bench/sandmark#178
Change to new Domain.DLS API with Initializer

The multicore-minilight and multicore-numerical benchmarks have
now been updated to use the new Domain.DLS API with initializer.
ocaml-bench/sandmark#185
Clean up existing effect benchmarks

The PR ensures that the code compiles without any warnings, and adds
a multicore_effects_run_config.json configuration file, and a
run_all_effect.sh script to execute the same.
ocaml-bench/sandmark#186
Very simple effect microbenchmarks to cover code paths

A set of four microbenchmarks to test the throughput of our effects
system have now been added to the Sandmark test suite. These include
effect_throughput_clone, effect_throughput_val,
effect_throughput_perform, and effect_throughput_perform_drop.
ocaml-bench/sandmark#187
Implementation of ‘recursion’ benchmarks for effects

A collection of recursion benchmarks to measure the overhead of
effects are now included to Sandmark. This is inspired by the
(Manticore
benchmarks)[https://github.com/ManticoreProject/benchmark/].

OCaml

Ongoing

ocaml/ocaml#9876
Do not cache young_limit in a processor register

The PR removes the caching of young_limit in a register for ARM64,
PowerPC and RISC-V ports. The Sandmark benchmarks are presently
being tested on the respective hardware.
ocaml/ocaml#9934
Prefetching optimisations for sweeping

The Sandmark benchmarking tests were performed for analysing a
couple of patches that optimise sweep_slice, and for the use of
prefetching. The objective is to reduce cache misses during GC.

Completed

ocaml/ocaml#9947
Add a naked pointers dynamic checker

The check for “naked pointers” (dangerous out-of-heap pointers) is
now done in run-time, and tests for the three modes: naked pointers,
naked pointers and dynamic checker, and no naked pointers have been
added in the PR.
ocaml/ocaml#9951
Ensure that the mark stack push optimisation handles naked pointers

The PR adds a precise check on whether to push an object into the
mark stack, to handle naked pointers.

We would like to thank all the OCaml users and developers in the community for their continued support, reviews and contribution to the project.

Acronyms

AEIO: Asynchronous Effect-based IO
API: Application Programming Interface
ARM: Advanced RISC Machine
CPU: Central Processing Unit
DEC: Domain Execution Context
DLS: Domain Local Storage
GC: Garbage Collector
HTTP: Hypertext Transfer Protocol
JSON: JavaScript Object Notation
NUMA: Non-Uniform Memory Access
OPAM: OCaml Package Manager
OS: Operating System
PR: Pull Request
RISC-V: Reduced Instruction Set Computing - V
RPC: Remote Procedure Call
STW: Stop-The-World

Chimrod · November 10, 2020, 8:14am

I am not able to understand all the points, and see how huge the work you do for all the changes mentionned, but I really appreciate the communication effort, and how you synthesis for this task in a comprehensible way.

Your monthly repports helps to understand how works the inclusion for a such change, and the communication to the community is a great plus.

So thank you for both the work, and the reports

Topic		Replies	Views
Multicore OCaml: Dec 2020 / Jan 2021 Community multicore , multicore-monthly	6	3655	February 8, 2021
Multicore OCaml: April 2021 Community multicore , multicore-monthly	0	11029	May 13, 2021
Multicore OCaml: September 2020 Community multicore , multicore-monthly	9	14068	October 23, 2020
Multicore OCaml: November 2020 Community multicore , multicore-monthly	1	3262	February 6, 2021
Multicore OCaml: July 2021 Community multicore , multicore-monthly	9	6719	August 5, 2021

Multicore OCaml: October 2020

Multicore OCaml

Ongoing

Completed

Systhreads support

Domain Local Storage

Lwt

Sundries

Benchmarking

Ongoing

Configuration

Proposals

Sundries

Completed

OCaml

Ongoing

Completed

Acronyms

Related topics