Multicore OCaml: March 2020
Our work this month was primarily focused on performance improvements to the Multicore OCaml compiler and runtime, as part of a comprehensive evaluation exercise. We continue to add additional benchmarks to the Sandmark test suite. The eventlog tracing system and the use of hash tables for marshaling in upstream OCaml are in progress, and more PRs are being queued up for OCaml 4.11.0-dev as well.
The biggest observable change for users trying the branch is that a new GC (the “parallel minor gc”) has been merged in preference to the previous one (“the concurrent minor gc”). We will have the details in longer form at a later stage, but the essential gist is that the parallel minor GC no longer requires a read barrier or changes to the C API. It may have slightly worse scalability properties at a very high number of cores, but is roughly equivalent at up to 24 cores in our evaluations. Given the vast usability improvement from not having to port existing C FFI uses, we have decided to make the parallel minor GC the default one for our first upstream runtime patches. The concurrent minor GC follow at a later stage when we ramp up testing to 64-core+ machines. The multicore opam remote has been updated to reflect these changes, for those who wish to try it out at home.
We are now at a stage where we are porting larger applications to multicore. Thanks go to:
- @UnixJunkie who helped us integrate the Gram Matrix benchmark in https://github.com/ocaml-bench/sandmark/issues/99
- @jhw has done extensive work towards supporting Systhreads in https://github.com/ocaml-multicore/ocaml-multicore/pull/240. Systhreads is currently disabled in multicore, leading to some popular packages not compiling.
@antron has been advising us on how best to port
Lwt_unixmodules to multicore, giving us a widely used IO stack to test more applications against.
If you do have other suggestions for application that you think might provide useful benchmarks, then please do get in touch with myself or @kayceesrk.
Onto the details! The various ongoing and completed tasks for Multicore OCaml are listed first, which is followed by the changes to the Sandmark benchmarking infrastructure and ongoing PRs to upstream OCaml.
Proposed implementation of threads in terms of Domain and Atomic
A new implementation of the
Threadslibrary for use with the new
Atomicmodules in Multicore OCaml has been proposed. This builds Dune 2.4.0 which in turn makes it useful to build other packages. This PR is open for review.
Better safe points for OCaml
A newer implementation to insert safe points at the Cmm level is being worked upon in this branch.
The following PRs have been merged into Multicore OCaml:
Account correctly for incremental mark budget
The patch correctly measures the incremental mark budget value, and improves the maximum latency for the
Put the phase change event in the actual phase change code. The PR includes the
major_gc/phase_changeevent in the appropriate context.
Don’t take all the full pools in one go.
The code change selects one of the
global_full_poolsto try sweeping it later, instead of adopting all of the full ones.
Statistics for the current domain are more recent than other domains
The statistics (
minor_collections) for the current domain are more recent, and are used in the right context.
caml_blit_fieldsshould always use
The PR enforces that
caml_modify_field()is always used to store
Fix bug with
The ephemerons are allocated as marked, but, the keys or data can be unmarked. The blit operations copy weak references from one ephemeron to another without marking them. The patch marks the keys that are blitted in order to keep the unreachable keys alive for another major cycle.
Return early for 0 length blit
The PR forces a
CAMLreturn()call if the blit length is zero in
caml_domain_alone()invocation needs to be used in the shared heap teardown, and hence the
num_domains_runningdecrement is moved as the last operation for at least the
shared_heaplockfree fast paths.
The Sandmark performance benchmarking test suite has had newer benchmarks added, and work is underway to enhance its functionality.
Add PingPong Multicore benchmark
The PingPong benchmark that uses producer and consumer queues has now been included into Sandmark.
Add the read/write Irmin benchmark
A basic read/write file performance benchmark for Irmin has been added to Sandmark. You can vary the following input parameters: number of branches, number of keys, percentage of reads and writes, number of iterations, and the number of write operations.
Add Gram Matrix benchmark
ocaml-bench/sandmark#99 to include the Gram Matrix initialization numerical benchmark was created. This is useful for machine learning applications and is now available in the Sandmark performance benchmark suite. The speedup (sequential_time/multi_threaded_time) versus number of cores for Multicore (Concurrent Minor Collector), Parmap and Parany is quite significant and illustrated in the graph:
Add depend target in Makefile
Sandmark now includes a
dependtarget defined in the Makefile to check that both
libdw-devpackages are installed and available on Ubuntu.
More parallel benchmarks
An issue has been created to add more parallel benchmarks. We will use this to keep track of the requests. Please feel free to add your wish list of benchmarks!
ocaml/ocaml#9082 Eventlog tracing system
The configure script has now been be updated so that it can build on Windows. Apart from this major change, a number of minor commits have been made for the build and sanity checks. This PR is currently under review.
Reimplement output_value using a hash table to detect sharing.
The ocaml/ocaml#9293 “Use addrmap hash table for marshaling” PR has been re-implemented using a hash table and bit vector, thanks to @xavierleroy. This is a pre-requisite for Multicore OCaml that uses a concurrent garbage collector.
As always, we thank the OCaml developers and users in the community for their code reviews, support, and contribution to the project. From OCaml Labs, stay safe and healthy out there!