Multicore OCaml: March 2020 update

Multicore OCaml: March 2020

Welcome to the March 2020 news update from the Multicore OCaml team! This update has been assembled with @shakthimaan and @kayceesrk, as with the February and January ones.

Our work this month was primarily focused on performance improvements to the Multicore OCaml compiler and runtime, as part of a comprehensive evaluation exercise. We continue to add additional benchmarks to the Sandmark test suite. The eventlog tracing system and the use of hash tables for marshaling in upstream OCaml are in progress, and more PRs are being queued up for OCaml 4.11.0-dev as well.

The biggest observable change for users trying the branch is that a new GC (the “parallel minor gc”) has been merged in preference to the previous one (“the concurrent minor gc”). We will have the details in longer form at a later stage, but the essential gist is that the parallel minor GC no longer requires a read barrier or changes to the C API. It may have slightly worse scalability properties at a very high number of cores, but is roughly equivalent at up to 24 cores in our evaluations. Given the vast usability improvement from not having to port existing C FFI uses, we have decided to make the parallel minor GC the default one for our first upstream runtime patches. The concurrent minor GC follow at a later stage when we ramp up testing to 64-core+ machines. The multicore opam remote has been updated to reflect these changes, for those who wish to try it out at home.

We are now at a stage where we are porting larger applications to multicore. Thanks go to:

If you do have other suggestions for application that you think might provide useful benchmarks, then please do get in touch with myself or @kayceesrk.

Onto the details! The various ongoing and completed tasks for Multicore OCaml are listed first, which is followed by the changes to the Sandmark benchmarking infrastructure and ongoing PRs to upstream OCaml.

Multicore OCaml

Ongoing

  • ocaml-multicore/ocaml-multicore#240
    Proposed implementation of threads in terms of Domain and Atomic

    A new implementation of the Threads library for use with the new Domain and Atomic modules in Multicore OCaml has been proposed. This builds Dune 2.4.0 which in turn makes it useful to build other packages. This PR is open for review.

  • ocaml-multicore/safepoints-cmm-mach
    Better safe points for OCaml

    A newer implementation to insert safe points at the Cmm level is being worked upon in this branch.

Completed

The following PRs have been merged into Multicore OCaml:

  • ocaml-multicore/ocaml-multicore#303
    Account correctly for incremental mark budget

    The patch correctly measures the incremental mark budget value, and improves the maximum latency for the menhir.ocamly benchmark.

  • ocaml-multicore/ocaml-multicore#307
    Put the phase change event in the actual phase change code. The PR includes the major_gc/phase_change event in the appropriate context.

  • ocaml-multicore/ocaml-multicore#309
    Don’t take all the full pools in one go.

    The code change selects one of the global_full_pools to try sweeping it later, instead of adopting all of the full ones.

  • ocaml-multicore/ocaml-multicore#310
    Statistics for the current domain are more recent than other domains

    The statistics (minor_words, promoted_words, major_words, minor_collections) for the current domain are more recent, and are used in the right context.

  • ocaml-multicore/ocaml-multicore#315
    Writes in caml_blit_fields should always use caml_modify_field to record young_to_young pointers

    The PR enforces that caml_modify_field() is always used to store young_to_young pointers.

  • ocaml-multicore/ocaml-multicore#316
    Fix bug with Weak.blit.

    The ephemerons are allocated as marked, but, the keys or data can be unmarked. The blit operations copy weak references from one ephemeron to another without marking them. The patch marks the keys that are blitted in order to keep the unreachable keys alive for another major cycle.

  • ocaml-multicore/ocaml-multicore#317
    Return early for 0 length blit

    The PR forces a CAMLreturn() call if the blit length is zero in byterun/weak.c.

  • ocaml-multicore/ocaml-multicore#320
    Move num_domains_running decrement

    The caml_domain_alone() invocation needs to be used in the shared heap teardown, and hence the num_domains_running decrement is moved as the last operation for at least the shared_heap lockfree fast paths.

Benchmarking

The Sandmark performance benchmarking test suite has had newer benchmarks added, and work is underway to enhance its functionality.

  • ocaml-bench/sandmark#88
    Add PingPong Multicore benchmark

    The PingPong benchmark that uses producer and consumer queues has now been included into Sandmark.

  • ocaml-bench/sandmark#98
    Add the read/write Irmin benchmark

    A basic read/write file performance benchmark for Irmin has been added to Sandmark. You can vary the following input parameters: number of branches, number of keys, percentage of reads and writes, number of iterations, and the number of write operations.

  • ocaml-bench/sandmark#100
    Add Gram Matrix benchmark

    A request
    ocaml-bench/sandmark#99 to include the Gram Matrix initialization numerical benchmark was created. This is useful for machine learning applications and is now available in the Sandmark performance benchmark suite. The speedup (sequential_time/multi_threaded_time) versus number of cores for Multicore (Concurrent Minor Collector), Parmap and Parany is quite significant and illustrated in the graph:

Gram matrix speedup benchmark

  • ocaml-bench/sandmark#103
    Add depend target in Makefile

    Sandmark now includes a depend target defined in the Makefile to check that both libgmp-dev and libdw-dev packages are installed and available on Ubuntu.

  • ocaml-bench/sandmark#90
    More parallel benchmarks

    An issue has been created to add more parallel benchmarks. We will use this to keep track of the requests. Please feel free to add your wish list of benchmarks!

OCaml

Ongoing

  • ocaml/ocaml#9082 Eventlog tracing system

    The configure script has now been be updated so that it can build on Windows. Apart from this major change, a number of minor commits have been made for the build and sanity checks. This PR is currently under review.

  • ocaml/ocaml#9353
    Reimplement output_value using a hash table to detect sharing.

    The ocaml/ocaml#9293 “Use addrmap hash table for marshaling” PR has been re-implemented using a hash table and bit vector, thanks to @xavierleroy. This is a pre-requisite for Multicore OCaml that uses a concurrent garbage collector.

As always, we thank the OCaml developers and users in the community for their code reviews, support, and contribution to the project. From OCaml Labs, stay safe and healthy out there!

46 Likes

On a Linux computer with 16 cores (and a Mac with 24 cores), I ran the Parmap Vs Parany benchmark for Gram matrix initialization.
bench
c_{opt} is the best chunk-size found when using all cores of the machine.

3 Likes

This suggests that the Gram matrix benchmark should be redone with optimized chunk sizes, yes?

No, I think they optimized it for their machine.
They have a different machine, with many cores.
It is probably very different from the machines I have access to (which are rather standard workstations).

I did do a fair bit of chunk size tuning for the machine that I have. The optimal chunk sizes for parmap and parany vary widely between different machines.

Does this mean that programs that currently use ocaml threads package will work seamlessly with multi-threads once multi-thread ocaml lands in ocaml mainline? i.e. programs using ocaml threads package will utilise multi-threads without (any?) changes?

1 Like

As far as I understand no. Thread will still be restricted to the only one active process semantics in order to remain compatible.

1 Like

I had some fun the other day and ported a parallel raytracer benchmark from F# to multicore OCaml: https://github.com/athas/raytracers

It doesn’t perform well, but the worst cases are on rendering, where I just naively spawned a Domain per pixel (: Chunking per row of pixels worsened performance even more though.

If you know how to show better performance with multicore, then make a PR - he will gladly accept

5 Likes

Each domain will spawn a pthread + allocate new heap structures etc. So spawning a domain per pixel or a row is not a good idea. If you have N cores, you would spawn N-1 workers at the start of the program and distribute tasks among them until the program runs to completion.

Have a look at the implementation of GramMatrix (https://github.com/ocaml-bench/sandmark/blob/master/benchmarks/multicore-grammatrix/grammatrix_multicore.ml) for how to use channels for distributing work from master to worker. There are more benchmarks here: https://github.com/ocaml-bench/sandmark/tree/master/benchmarks/multicore-numerical. We also have a parallel version of a simple ray tracer: https://github.com/ocaml-bench/sandmark/tree/master/benchmarks/multicore-minilight/parallel

4 Likes

I’ll try to rewrite the benchmark, thanks (: The MPL language is very fast in the benchmark (faster than F# and Haskell) - do you have an idea of what the MPL ‘ForkJoin’ model has as advantages/disadvantages relative to multicore OCaml?

As a confined distraction, I wrote a small parallel benchmark program. I now have two questions:

  • The performance as not as good as a version of the program using Unix.fork(), which I would expect that one can get the same performance. What is the right place to post the code and ask for advice on how to profile and tune the performance?
  • I don’t know where to look for parallel-programming libraries on top of multicore, for example a Multicore implementation of parmap : ('a -> 'b) -> 'a array -> 'b array. If I understand correctly, this does not exist yet? What is the right place to contribute a version of it? Is Domainslib the place?

I don’t think there are such libraries for the moment.
Porting Parmap to multicore-OCaml would be useful, in the long run.
Same for Parany.
The performance is supposed to become better, because there
would be less communication overhead (marshalling and unmarshaling of things
between processes).

This assumes the parallel GC overhead, and the overhead of cache synchronization, is lower than the marshalling/unmarshalling overhead. While hopefully true, it’s not necessarily true in all cases.

2 Likes

domainslib would be the right place to add a multicore implementation of parmap. It would be interesting to see the small parallel benchmark; it would help with tuning the parallel GC. If you can post the code to some public location (even a gist perhaps), we can take a look. If it proves to be useful, we could later turn it into a micro benchmark for multicore and add it to Sandmark.

1 Like

I don’t see any disadvantages. You can build fork-join model as a library on top of Multicore OCaml.

One thing I would note is that Maple assumes disentangledness i.e, that the parallel tasks are oblivious to each others memory effects. This is not checked statically. The type safety of Maple is dependent on the programmer correctly writing disentangled programs. Disentangledness makes GC design much simpler; the GC can assume that there will be no cross task pointers and only pointers from ancestor to descendent. The details are a bit more complicated, but it is up to the programmer to write “correct” programs. The compiler doesn’t help you (yet?).

Multicore OCaml cannot make such assumptions. Multicore OCaml programs will not segfault even if you arbitrarily share memory between domains. This dictates that the Multicore OCaml GC handle the possibility of arbitrary pointers between domains / parallel tasks. Much of the complexity in the GC design is ensuring that we can ensure GC safety (reachable objects aren’t collected) and liveness (unreachable objects are eventually collected).

4 Likes

This is correct. Multicore OCaml will aim to preserve single threaded semantics as much as possible. Not everyone may want to use multicore features. It is better not to break their code unless there is a convincing reason to do so.

I was also asking about what is the right place to do this (I’m not convinced that an issue against Sandmark is appropriate, given that I’m just playing and trying to tune performances, not propose a new for Sandmark; I also don’t think that an issue in the ocaml-multicore repository is appropriate, as it should be reserved for runtime development). Unless there is a better suggestion I will just open a Discuss thread.

Rather than a discuss thread, please can we have it under ocaml-multicore, although it might not be a perfect fit. Would be nice to have the discussion about implementation along with the code, and easily findable in the future when we look for it.

1 Like

Had a go at making Multicore both fast at the rendering (where one knows beforehand what work needs to be done) using the sandmark benchmark you referenced as basis. And to have a bit more fun I implemented the ForkJoin library you suggested too (: … it only supports the Par effect, but that is enough to dynamically spawn work for a set of workers inside the BVH recursive function.

Performance wise, now the Multicore code is as fast + faster than both Haskell and F# https://github.com/athas/raytracers

12 Likes

That’s great to see. Thanks for your effort in building useful things on highly experimental software!

We’ve just added support for tasks in domainslib. This should hopefully make it easier to build the raytracer in a more idiomatic fashion. There are several examples that show how to use the task library. I can cut a new release of the library if you are interested in using it in your raytracer.

1 Like