Welcome to the May 2021 Multicore OCaml monthly report! This month’s update along with the previous updates have been compiled by @avsm, @ctk21, @kayceesrk and @shakthimaan.
Firstly, all of our upstream activity on the OCaml compiler is now reported as part of the shiny new compiler development newsletter #2 that @gasche has started. This represents a small but important shift – domains-only multicore is firmly locked in on the upstream roadmap for OCaml 5.0 and the whole OCaml compiler team has been helping and contributing to it, with the GC safe points feature being one of the last major multicore-prerequisites (and due to be in OCaml 4.13 soon).
This multicore newsletter will now focus on getting our ecosystem ready for domains-only multicore in OCaml 5.0, and on how the (not-yet-official) effect system and multicore IO stack is progressing. It’s a long one this month, so settle in with your favourite beverage and let’s begin
OCaml Multicore: 4.12.0+domains
The multicore compiler now supports CTF runtime traces of its garbage collector and there are tools to display chrome tracing visualisations of the garbage collector events. A number of performance improvements (see speedup graphs later on) that highlight some ways to make best use of multicore were made to the existing benchmarks in Sandmark. There has also been work on scaling up to 128 cores/domains for task-based parallelism in domainslib using work stealing deques, bringing us closer to Cilk-style task-parallel performance.
As important as new features are what we have decided not to do. We’ve been working on and evaluating Domain Local Allocation Buffers (DLABs) for some time, with the intention of reducing the cost of minor GCs. We’ve found that the resulting performance didn’t match our expectations (vs the complexity of the change), and so we’ve decided not to proceed with this for OCaml 5.0. You can find the DLAB summary page summarises our experiences. We’ll come back to this post-OCaml 5.0 when there are fewer moving parts.
Ecosystem changes to prepare for 5.0.0 domains-only
As we are preparing 5.0 branches with the multicore branches over the coming months, we are stepping up preparations to ensure the OCaml ecosystem is ready.
Making the multicore compilers available by default in opam-repo
Over the next few week, we will be merging the multicore 4.12.0+domains and associated packages from their opam remote over in ocaml-multicore/multicore-opam into the mainline opam-repository. This is to make it more convenient to use the variant compilers to start testing your own packages with Domain
s.
As part of this change, there are two new base packages that will be available in opam-repository:
-
base-domains
: This package indicates that the current compiler has theDomain
module. -
base-effects
: This package indicates the current compiler has the experimental effect system.
By adding a dependency on these packages, the only valid solutions will be 4.12.0+domains
(until OCaml 5.0 which will have this module) or 4.12.0+effects
.
The goal of this is to let community packages more easily release versions of their code using Domains-only parallelism ahead of OCaml 5.0, so that we can start migration and thread-safety early. We do not encourage anyone to take a dependency on base-effects currently, as it is very much a moving target.
This opam-repository change isn’t in yet, but I’ll comment on this post when it is merged.
Adapting the Stdlib for thread-safety
One of the first things we have to do before porting third-party libraries is to get the Stdlib ready for thread-safety. This isn’t quite as simple as it might appear at first glance: if we adopt the naïve approach of simply putting a mutex around every bit of global state, our sequential performance will slow down. Therefore we are performing a more fine-grained analysis and fixes, which can be seen on the multicore stdlib page.
For anyone wishing to contribute: hunt through the Stdlib for global state, and categorise it appropriately, and then create a test case exercising that module with multiple Domains running, and submit a PR to ocaml-multicore. In general, if you see any build failures or runtime failures now, we’d really appreciate an issue being filed there too. You can see some good examples of such issues here (for mirage-crypto) and here (for Coqt).
Porting third-party libraries to Domains
As I mentioned last month, we put a call out for libraries and maintainers who wanted to port their code over. We’re starting with the following libraries and applications this month:
-
Lwt: the famous lightweight-threads library now has a PR to add Lwt_domains. This is the first simple(ish) step to using multicore cores with Lwt: it lets you run a pure (non-Lwt) function in another Domain via
detach : ('a -> 'b) -> 'a -> 'b Lwt.t
. -
Mirage-Crypto: the next library we are adapting is the cryptography library, since it is also low-hanging fruit that should be easy to parallelise (since crypto functions do not have much global state). The port is still ongoing, as there are some minor build failures and also Stdlib functions in Format that aren’t yet thread-safe that are causing failures.
-
Tezos-Node: the bigger application we are applying some of the previous dependencies too is Tezos-Node, which makes use of the dependency chain here via Lwt, mirage-crypto, Irmin, Cohttp and many other libraries. We’ve got this compiling under 4.12.0+domains now and mostly passing the test suite, but will only report significant results once the dependencies and Stdlib are passing.
-
Owl: OCaml’s favourite machine learning library works surprisingly well out-of-the-box with 4.12.0+domains. An experiment for a significant machine-learning codebase written using it saw about a 2-4x speedup before some false-sharing bottlenecks kicked in. This is pretty good going given that we made no changes to the codebase itself, but stay tuned for more improvements over the coming months as we analyse the bottleneck.
This is hopefully a signal to all of you to start “having a go” with 4.12.0+domains on your own applications, and particularly with respect to seeing how wrapping it in Domains works out and identifying global state. You can read our handy tutorial on parallel programming with Multicore OCaml.
We are developing some tools to help find global state, but we’re going to all need to work together to identify some of these cases and begin migration. Crucially, we need some diversity in our dependency chains – if you have interesting applications using (e.g.) Async or the vanilla Thread
module and have some cycles to work with us, please get in touch with me or @kayceesrk .
4.12.0+effects
The effects-based eio library is coming together nicely, and the interface and design rationales are all up-to-date in the README of the repository. The primary IO backend is ocaml-uring, which we are preparing for a separate release to opam-repository now as it also works fine on the sequential runtime for Linux (as long as you have a fairly recent kernel. Otherwise the kernel crashes). We also have a Grand Central Dispatch effect backend to give us a totally different execution model to exercise our effect handler abstractions.
While we won’t publish the performance numbers for the effect-based IO this month, you can get a sense of the sorts of tests we are running by looking at the retro-httpaf-bench repository, which now has various permutations of effects-based, uring-based and select-based webservers. We’ve submitted a talk to the upcoming OCaml Workshop later this summer, which, if accepted, will give you a deepdive into our effect-based IO.
As always, we begin with the Multicore OCaml ongoing and completed tasks. The ecosystem improvements are then listed followed by the updates to the Sandmark benchmarking project. Finally, the upstream OCaml work is mentioned for your reference. For those of you that have read this far and can think of nothing more fun than hacking on multicore programming runtimes, we are hiring in the UK, France and India – please find the job postings at the end!
Multicore OCaml
Ongoing
-
ocaml-multicore/ocaml-multicore#552
Add a force_instrumented_runtime option to configureA new
--enable-force-instrumented-runtime
option is introduced to
facilitate use of the instrumented runtime on linker invocations to
obtain event logs. -
ocaml-multicore/ocaml-multicore#553
Testsuite failures with flambda enabledA list of tests are failing on
b23a416
with flambda enabled, and
they need to be investigated further. -
ocaml-multicore/ocaml-multicore#555
runtime: CAML_TRACE_VERSION is now set to a Multicore specific valueDefine a
CAML_TRACE_VERSION
to distinguish between Multicore OCaml
and trunk for the runtime. -
ocaml-multicore/ocaml-multicore#558
Refactor Domain.{spawn/join} to use no critical sectionsThe PR removes the use of
Domain.wait
and critical sections in
Domain.{spawn/join}
. -
ocaml-multicore/ocaml-multicore#559
Improve the Multicore GC StatsA draft PR to include more Multicore GC statistics when using
OCAMLRUNPARAM=v=0x400
.
Completed
-
ocaml-multicore/ocaml-multicore#508
Domain Local Allocation BuffersThe Domain Local Allocation Buffer implementation for OCaml Multicore has been dropped for now. A discussion is on the PR itself and there is a wiki
page here. -
ocaml-multicore/ocaml-multicore#527
Port eventlog to CTFThe porting of the
eventlog
implementation to the Common Trace
Format is now complete.For an introduction to producing Chrome trace visualizations of the
runtime events see eventlog-tools. This postprocessing tool turns the CTF
trace into the Chrome tracing format that allows interactive visualizations
like this:
-
ocaml-multicore/ocaml-multicore#543
Parallel version of weaklifetime testA parallel version of the
weaklifetime.ml
test is now added to the
test suite. -
ocaml-multicore/ocaml-multicore#546
Coverage of domain life-cycle in domain_dls and ephetest_par testsAdditional tests to increase test coverage for domain life-cycle for
domain_dls.ml
andephetest_par.ml
. -
ocaml-multicore/ocaml-multicore$#550
Lazy effects testInclusion of a test to address effects with Lazy computations for a
number of different use cases. -
ocaml-multicore/ocaml-multicore#557
Remove unused domain functionsA clean-up to remove unused functions in
domain.c
anddomain.h
.
Ecosystem
Ongoing
-
ocaml-multicore/eventlog-tools#2
Add a pausetimes toolThe
eventlog_pausetimes
tool takes a directory of eventlog files
and computes the mean, max pause times, as well as the distribution
up to the 99.9th percentiles. For example:ocaml-eventlog-pausetimes /home/engil/dev/ocaml-multicore/trace3/caml-426094-* name { "name": "name", "mean_latency": 718617, "max_latency": 33839379, "distr_latency": [191,250,707,16886,55829,105386,249272,552640,1325621,13312993,26227671] }
-
domainslib#29
Task stealing with CL dequesThis ongoing work to use task-stealing Chase Lev deques for scheduling
tasks across domains is looking very promising. Particularly for machines
with 128 cores. -
ocaml-multicore/retro-httpaf-bench#10
Add Eio benchmarkThe addition of an Eio benchmark for retro-httpaf-bench. This is a
work-in-progress. -
ocaml-multicore/eio#26
Grand Central Dispatch BackendAn early draft PR that implements the Grand Central Dispatch (GCD)
backend for Eio. -
ocsigen/lwt#860
Lwt_domain: An interfacet to Multicore parallelismAn on-going effort to introduce
Lwt_domain
for performing
computations to CPU cores using Multicore OCaml’s Domains.
Completed
retro-httpaf-bench
The retro-httpaf-bench
repository contains scripts for running HTTP
server benchmarks.
-
ocaml-multicore/retro-httpaf-bench#6
Move OCaml to 4.12The build scripts have been updated to use 4.12.0.
-
ocaml-multicore/retro-httpaf-bench#8
Adds a Rust benchmark using hyperThe inclusion of the Hyper benchmark limited to a single core to
match the other existing benchmarks. -
ocaml-multicore/retro-httpaf-bench#9
Release builds for dune, stretch request volumes, rust fixes and remove mimallocThe Dockerfile, README, build_benchmarks.sh and run_benchmarks.sh
files have been updated. -
ocaml-multicore/retro-httpaf-bench#15
Make benchmark more realisticThe PR enhances the implementation to correctly simulate a
hypothetical database request, and the effects code has been updated
accordingly.
eio
The eio
library provides an effects-based parallel IO stack for
Multicore OCaml.
-
ocaml-multicore/eio#18
Add fibreslib libraryThe
promise
library has been renamed tofibreslib
to avoid
naming conflict with the existing package in opam, and the API
(waiters and effects) has been split into its own respective
modules. -
ocaml-multicore/eio#19
Update to latest ocaml-uringThe code and configuration files have been updated to use the latest
ocaml-uring
. -
ocaml-multicore/eio#20
Add Fibreslib.SemaphoreImplemented the
Fibreslib.Semaphone
module that is useful for
rate-limiting, and based on OCaml’sSemaphore.Counting
. -
ocaml-multicore/eio#21
Add high-level Eio APIA new Eio library with interfaces for sources and sinks. The README
documentation has been updated with motivation and usage. -
ocaml-multicore/eio#22
Add switches for structured concurrencyImplementation of structured concurrency with documentation examples
for tracing and testing with mocks. -
ocaml-multicore/eio#23
Rename repository to eioThe Effects based parallel IO for OCaml repository has now been
renamed fromeioio
toeio
. -
ocaml-multicore/eio#24
Rename lib_eioio to lib_eunixThe names have been updated to match the dune file.
-
ocaml-multicore/eio#25
Detect deadlocksAn exception is now raised to detect deadlocks if the scheduler
finishes while the main thread continues to run. -
ocaml-multicore/eio#27
Convert expect tests to MDXThe expected tests have been updated to use the MDX format, and this
avoids the need for ppx libraries. -
ocaml-multicore/eio#28
Use splice to copy if possibleThe effect Splice has been implemented along with the update to
ocaml-uring, and necessary documentation. -
ocaml-multicore/eio#29
Improve exception handling in switchesAdditional exception checks to handle when multiple threads fail,
and forSwitch.check
andFibre.fork_ignore
. -
ocaml-multicore/eio#30
Add eio_main library to select backend automaticallyUse
eio_main
to select the appropriate backend (eunix
, for
example) based on the platform. -
ocaml-multicore/eio#31
Add Eio.Flow APIImplemented a Flow module that allows combinations such as
bidirectional flows and closable flows. -
ocaml-multicore/eio#32
Initial support for networksEio provides a high-level API for networking, and the
Network
module has been added. -
ocaml-multicore/eio#33
Add some design rationale notes to the READMEThe README has been updated with design notes, and reference to
further reading on the principles of Object-capability model. -
ocaml-multicore/eio#34
Add shutdown, allow closing listening sockets, add cstruct_sourceAdded cstruct_source,
shutdown
method along with source, sink and
file descriptor types. -
ocaml-multicore/eio#35
Add Switch.on_release to auto-close FDsWe can now attach resources such as file descriptors to switches,
and these are freed when the the switch is finished.
Sundries
-
ocaml-multicore/domainslib#23
Running tests: moving todune runtest
from manual commands in
run_test
targetThe
dune runtest
command is now used to execute the tests. -
ocaml-multicore/domainslib#24
Move to Mutex & Condition from Domain.Sync.{notify/wait}The channel implementation using
Mutex
andCondition
is now
complete. The performance results are shown in the following graph:
-
ocaml-multicore/multicore-opam#53
Add base-domains and base-effects packagesThe
base-domains
andbase-effects
opam files have now been added
to multicore-opam. -
ocaml-multicore/multicore-opam#54
Shift all multicore packages to unique versions and base-domains dependenciesThe naming convention is to now use
base-effects
and
base-domains
everywhere.
Benchmarking
Ongoing
-
ocaml-bench/sandmark#230
Build for 4.13.0+trunk with dune.2.8.1A work-in-progress to upgrade Sandmark to use dune.2.8.1 to build
4.13.0+trunk and generate the benchmarks. You can test the same
using:TAG='"macro_bench"' make run_config_filtered.json RUN_CONFIG_JSON=run_config_filtered.json make ocaml-versions/4.13.0+trunk.bench
Completed
Sandmark
Performance
-
ocaml-bench/sandmark#221
Fix up decompress iterations of workThe use of
parallel_for
, simplification ofdata_to_compress
to
useString.init
, and fix to correctly count the amount of work
configured and done produces the following speed improvements:
-
ocaml-bench/sandmark#223
A better floyd warshallAn improvement to the Floyd Warshall implementation that fixes the
random seed so that it is repeatable, and improves the pattern
matching.
-
ocaml-bench/sandmark#224
Some improvements for matrix multiplicationThe
matrix_multiplication
andmatrix_multiplication_multicore
code have been updated for easier maintenance, and results are
written only after summing the values.
-
ocaml-bench/sandmark#225
Better Multicore EA BenchmarkThe Evolutionary Algorithm now inserts a poll point into
fittest
to improve the benchmark results.
-
ocaml-bench/sandmark#226
Better scaling for mandelbrot6_multicoreThe
mandelbrot6_multicore
scales well now with the use of
parallel_for
as observed in the following graphs:
-
ocaml-bench/sandmark#227
Improve nbody_multicore benchmark with high core countsThe
energy
function is now parallelised withparallel_for_reduce
for larger core counts.
-
ocaml-bench/sandmark#229
Improve game_of_life benchmarksThe hot functions are now inlined to improve the
game_of_life
benchmarks, and we avoid initialising the temporary matrix with
random numbers.
Sundries
-
ocaml-bench/sandmark#215
Remove Gc.promote_to from treiber_stack.mlThe 4.12+domains and 4.12+domains+effects branches have
Gc.promote_to
removed from the runtime. -
ocaml-bench/sandmark#216
Add configs for 4.12.0+stock, 4.12.0+domains, 4.12.0+domains+effectsThe ocaml-version configuration files for 4.12.0+stock,
4.12.0+domains, and 4.12.0+domains+effects have now been included
to Sandmark. -
ocaml-bench/sandmark#220
Attempt to improve the OCAMLRUNPARAM documentationThe README has been updated with more documentation on the use of
OCAMLRUNPARAM configuration when running the benchmarks. -
ocaml-bench/sandmark#222
Deprecate 4.06.1 and 4.10.0 and upgrade to 4.12.0The 4.06.1, 4.10.0 ocaml-versions have been removed and the CI
has been updated to use 4.12.0 as the default version.
current-bench
-
ocurrent/current-bench#103
Ability to set scale on UI to start at 0The graph origins now start from
[0, y_max+delta]
for the y-axis
for better comparison. -
ocurrent/current-bench#121
Use string representation for docker cpu setting.The
OCAML_BENCH_DOCKER_CPU
setting now switches from Integer to
String to support a range of CPUs for parallel execution.
OCaml
Ongoing
-
ocaml/ocaml#10039
SafepointsThe Sandmark benchmark runs to obtain the performance numbers for
the Safepoints PR for 4.13.0+trunk have been published. The PR is
ready to be merged.
Job Advertisements
-
Multicore OCaml Runtime Systems Engineer
OCaml Labs (UK), Tarides (France) and Segfault Systems (India) -
Benchmark Tooling Engineer
Tarides
Our thanks to all the OCaml users, developers and contributors in the
community for their continued support to the project. Stay safe!
Acronyms
- AMD: Advanced Micro Devices
- API: Application Programming Interface
- CI: Continuous Integration
- CPU: Central Processing Unit
- CTF: Common Trace Format
- DLAB: Domain Local Allocation Buffer
- EA: Evolutionary Algorithm
- GC: Garbage Collector
- GCD: Grand Central Dispatch
- HTTP: Hypertext Transfer Protocol
- OPAM: OCaml Package Manager
- MVP: Minimal Viable Product
- PR: Pull Request
- TPS: Transactions Per Second
- UI: User Interface