Multicore Update: April 2020, with a preprint paper

Welcome to the April 2020 update from the Multicore OCaml team, across the UK, India, France and Switzerland! Although most of us are in lockdown, we continue to march forward. As with previous updates, thanks to @shakthimaan and @kayceesrk for help assembling it all.

Preprint: Retrofitting Parallelism onto OCaml

We’ve put up a preprint of a paper titled “Retrofitting Parallelism onto OCaml” for which we would be grateful to receive feedback. The paper lays out the problem space for the multicore extension of OCaml and presents the design choices, implementation and evaluation of the concurrent garbage collector (GC).

Note that this is not a final paper as it is currently under peer review, so any feedback given now can still be incorporated. Please use the e-mail contact details in the pdf paper for @kayceesrk and myself so we can aggregate (and acknowledge!) any such comments.

Rebasing Progress

The Multicore OCaml rebase from 4.06.1 has gained momentum. We have successfully rebased the parallel-minor-GC all the way onto the 4.09 OCaml trees. We will publish updated opam packages when we get to the recently branched 4.11 in the next couple of weeks.

Rebasing complex features like this is a “slow and steady” process due to the number of intermediate conflicts and bootstrapping, so we will not be publishing opam packages for every intermediate version – instead, the 4.11 trees will form the new “stable base” for any PRs.

Higher-level Domainslib API

A thread from last month’s update on building a parallel raytracer led to some useful advancements in the domainslib library to provide async/await-style task support. See the updates below for more details.

There is also an interesting discussion on ocaml-multicore/ocaml-multicore#324 about how to go about profiling and optimising your own small programs. More experiments with parallel algorithms with different scheduling properties would be most useful at this time.

Upstreamed features in 4.11

The 4.11 release has recently branched and has the following multicore-relevant changes in it:

  • A concurrency-safe marshalling implementation (originally in ocaml#9293, then implemented again in ocaml#9353). This will have a slight speed hit to marshalling-heavy programs, so feedback on trying this in your projects with 4.11 will be appreciated to the upstream OCaml issue tracker.

  • A runtime eventlog tracing system using the CTF format is on the verge of being merged in 4.11 over in ocaml#9082. This will also be of interest to those who need sequential program profiling, and is a generalisation of the infrastructure that was essential to our development of the multicore GC. If anyone is interested in helping with hacking on the OCaml side of CTF support to build clients, please get in touch with me or @kayceesrk.

In addition to the above highlights, we have also been making continuous improvements and additions to the Sandmark benchmarking test infrastructure. The various ongoing and completed tasks are provided below for your reference.

Multicore OCaml

Ongoing

  • ocaml-multicore/ocaml-multicore
    Promote Multicore OCaml to trunk

    The rebasing of Multicore OCaml from 4.06 to 4.10 is being worked, and we are now at 4.09! In a few weeks, we expect to complete the rebase to the latest trunk release.

  • ocaml-multicore/eventlog-tools:
    OCaml Eventlog Tools

    A project that provides a set of tools for runtime tracing for OCaml 4.11.0 and higher has been created. This includes a simple OCaml decoder for eventlog’s trace and a built-in chrome converter tool.

  • ocaml-multicore/domainslib#5
    Add parallel_scan to domainslib

    A parallel_scan implementation that uses the Task API with prefix_sum and summed_area_table has now been added to the Domain-level Parallel Programming library for Multicore OCaml (domainslib) library.

Completed

The following PRs have been merged into Multicore OCaml and its ecosystem projects:

  • ocaml-multicore/ocaml-multicore#328
    Multicore compiler with Flambda

    Support for Flambda has been merged into the Multicore OCaml project repository. The translation is now performed at cmmgen instead of lambda for clambda conversion.

  • ocaml-multicore/ocaml-multicore#324
    Optimizing a Multicore program

    The following documentation provides a detailed example on how to do performance debugging for a Multicore program to improve the runtime performance.

  • ocaml-multicore/ocaml-multicore#325
    Added eventlog_to_latencies.py script

    A script to generate a latency report from an eventlog has now been included in the ocaml-multicore repository.

  • ocaml-multicore/domainslib#4
    Add support for task_pools

    The domainslib library now has support for work-stealing task pools with async/await parallelism. You are encouraged to try the examples.

Benchmarking

A number of new benchmarks are being ported to the Sandmark performance benchmarking test suite.

  • ocaml-bench/sandmark#104
    Added python pip3 dependency

    A check_dependency function has now been defined in the Makefile along with a list of dependencies and pip packages for Ubuntu. You can now run make depend prior to building the benchmark suite to ensure that you have the required software. The python3-pip package has been added to the list of dependencies.

  • ocaml-bench/sandmark#96
    Sandmark Analyze notebooks

    The setup, builds and execution scripts for developer branches on bench2.ocamllabs.io have been migrated to winter.ocamllabs.io.

    A UI and automated script driven notebooks for analyzing sequential bench results is being worked upon.

  • ocaml-bench/sandmark#108
    Porting mergesort and matrix multiplication using Task Pool API library

    This is an on-going PR to implement merge sort and matrix_multiplication using parallel_for.

  • cubicle

    Cubicle is a model checker and an automatic SMT theorem prover. At present, it is being ported to Multicore OCaml, and this is a work in progress.

  • raytracers

    Raytracers is a repository that contains ray tracer implementation for different parallel functional programming languages. The OCaml implementation has now been updated to use the new Domainslib.Task API.

    Also, a few experiments were performed on flambda parameters for the Multicore raytracer which gives around 25% speedup, but it does not yet remove the boxing of floats. The experiments are to be repeated with a merge against the wip flambda2 trees on 4.11, that removes float boxing.

OCaml

Ongoing

  • ocaml/ocaml#9082
    Eventlog tracing system

    A substantial number of commits have gone into this PR based on reviews and feedback. These include updates to the configure script, handling warnings and exceptions, adding build support for Windows, removing unused code and coding style changes. This patch will be cherry-picked for the 4.11 release.

Completed

  • ocaml/ocaml#9353
    Reimplement output_value using a hash table to detect sharing

    This PR which implements a hash table and bit vector as required for Multicore OCaml has been merged to 4.11.

Our thanks as always go to all the OCaml developers and users in the community for their continued support, and contribution to the project!

Acronyms

  • API: Application Programming Interface
  • GC: Garbage Collector
  • PIP: Pip Installs Python
  • PR: Pull Request
  • SMT: Satisfiability Modulo Theories
  • UI: User Interface
41 Likes

Great work! I hope the multicore compiler will speed up my implementations of evolutionary algorithms. However, the current results are mixed. Would the following be a too naive multicore implementation of Array.init?

module T = Domainslib.Task

let parallel_init pool n f =
  let promise_array =
    Array.init n (fun i -> T.async pool (fun () -> f i)) in
  Array.map (fun p -> T.await pool p) promise_array

Maybe try making the granularity bigger by having each thread initialize several cells of your array (let’s call it an array chunk).
Usually this adds one parameter (the chunk size) that you need to optimize in order to reach the peak performance for your problem.

You want to do recursive subdivision spawning tasks that go onto work stealing deques. That way you get good locality of reference and good work distribution even if tasks have substantially different characteristics.

This is an incredible piece of work, BTW. I honestly didn’t think it was possible to get the kinds of numbers given in that paper.

2 Likes

Better to use parallel_for here with larger chunk size as @UnixJunkie has suggested. Spawning a task per index would result in excessive overhead on the work-stealing queue. Here is an example:

$ cat parinit.ml
module T = Domainslib.Task

let num_domains = try int_of_string Sys.argv.(1) with _ -> 1
let n = try int_of_string Sys.argv.(2) with _ -> 1024

let a = Array.create_float n

let _ =
  let pool = T.setup_pool ~num_domains:(num_domains - 1) in
  T.parallel_for pool ~chunk_size:(n/num_domains) ~start:0 ~finish:(n-1)
    ~body:(fun i -> a.(i) <- sqrt (float_of_int i *. float_of_int i));
  T.teardown_pool pool
$ cat dune
(executable
  (name parinit)
  (modules parinit)
  (libraries domainslib))
$ for i in 1 2 4 8 16; do time dune exec ./parinit.exe $i 100_000_000; done

real    0m1.553s
user    0m1.078s
sys     0m0.475s

real    0m0.873s
user    0m1.090s
sys     0m0.464s

real    0m0.517s
user    0m1.210s
sys     0m0.500s

real    0m0.301s
user    0m1.256s
sys     0m0.508s

real    0m0.203s
user    0m1.329s
sys     0m0.567s

For comparison, here is the sequential version of the program:

$ cat seqinit.ml
let n = try int_of_string Sys.argv.(1) with _ -> 1024

let a = Array.create_float n

let _ =
  for i = 0 to n-1 do
    a.(i) <- sqrt (float_of_int i *. float_of_int i)
  done
$ time dune exec ./seqinit.exe 100_000_000
                   
real    0m1.451s
user    0m1.011s
sys     0m0.440s

@Jon_Harrop’s suggestion of recursive spawning would also work. parallel_for might be more idiomatic if you are parallelising sequential for loop / map / iter.

5 Likes

Thank you for the advice. I will experiment with the parallel_for approach above.

2 Likes

You are welcome. If you have more questions, happy to answer them here.

1 Like

Do you plan to mainline only the parallel GC and omit the concurrent one?

Yes. That’s the plan. To be clear, the terms parallel and concurrent should be qualified as parallel minor and concurrent minor. The major GC is concurrent in both cases.

2 Likes

Thanks a lot for this work! Also thanks for writing and sharing the draft paper, it is an interesting read.

I have a question about how the C API finalizers are handled. Are the finalizers that are installed as C code pointers into the struct of operations for custom blocks treated differently than the finalizers installed using the Gc module operations? The much stronger constraints on what the C level finalizers are permitted to do, and much more relaxed guarantees on the order in which they are executed, make one think that a more efficient implementation is possible.

Another question, about lazy values, is whether the semantics which raises in case a lazy value that is in the process of being forced is also forced concurrently is obviously the desired or only feasible semantics. Do I understand correctly that this semantics means that if a lazy value is created, then shared between domains, and then happens to be needed by 2 of them at almost the same time, one could see an exception? In such cases, forcing the value again might return the value computed by the other domain, not the exception, right?

Regarding the comparison of sequential performance to the stock runtime, would it make sense to use different overhead settings for the new allocators relative to stock, analogously to how this has been suggested in discussions of comparisons of the best-fit allocator?

Just out of curiosity, does your experience make you think that there would be meaningful differences from setting the minor heap size so that they all fit into last-level-cache?

nit: section 2 mentions the old default minor heap size of 256K by default, but section 6 states the current 2MB.

1 Like

I have a question about how the C API finalizers are handled. Are the finalizers that are installed as C code pointers into the struct of operations for custom blocks treated differently than the finalizers installed using the Gc module operations? The much stronger constraints on what the C level finalizers are permitted to do, and much more relaxed guarantees on the order in which they are executed, make one think that a more efficient implementation is possible.

Custom block finalizers are indeed treated differently. The finalize function is called when custom blocks are swept, which is different from how the usual finalizers work. This behavior is the same in stock OCaml and multicore.

Another question, about lazy values, is whether the semantics which raises in case a lazy value that is in the process of being forced is also forced concurrently is obviously the desired or only feasible semantics. Do I understand correctly that this semantics means that if a lazy value is created, then shared between domains, and then happens to be needed by 2 of them at almost the same time, one could see an exception? In such cases, forcing the value again might return the value computed by the other domain, not the exception, right?

Not the exception. We will clarify this in the paper.

We’ve recently introduced the ability to distinguish concurrent forcing of lazy by a different domain from recursive forcing by the same domain. In the former case, the second domain would see RacyLazy exception raised while the latter raises Undefined. Consequently, one can write a safely share lazy values between domains like:

let rec safe_force l =
  try Lazy.force l with
  | Lazy.RacyLazy ->
      Domain.Sync.cpu_relax (); (* emits PAUSE instruction on x86 *)
      safe_force l

Regarding the comparison of sequential performance to the stock runtime, would it make sense to use different overhead settings for the new allocators relative to stock, analogously to how this has been suggested in discussions of comparisons of the best-fit allocator?

Agreed. We shall do this for the future revision of the paper.

Just out of curiosity, does your experience make you think that there would be meaningful differences from setting the minor heap size so that they all fit into last-level-cache?

We haven’t tried this. I’ve seen anecdotal evidence that it might help. I’d be interested in seeing the rigorous experimental result that shows that it helps.

nit: section 2 mentions the old default minor heap size of 256K by default, but section 6 states the current 2MB.

256k words = 2MB on 64bit. We shall clarify this.

Thanks for the comments!

Concurrent “magic statics” (C++) and lazy statics (Rust) are very useful. Is this code correct though? It will loop if the Lazy.RacyLazy exception arose as an uncaught exception in l (raised by another lazy). This is different from Lazy.Undefined where you do not have much hope that it will become defined later. One possible solution is to make waiting on a concurrent initialization the default. Do you see drawbacks to it, for instance do you see other useful ways to handle a Lazy.RacyLazy exception? (If there are useful uses of not locking, it is also possible to make waiting the default, and provide a new function Lazy.try_force : 'a t -> 'a option.)

Also, I have noticed that the PR you mentioned was not reviewed on GitHub, and it was self-merged the same day. What I wonder is how are all the small design decisions like this intended to be properly reviewed before merging into trunk? For instance if they all arrive in bulk and they need to be reviewed all at once at the last moment, I worry about the increased risk that some might fly under the radar. How can we help you the best on such aspects of language design?

I would consider this code to be correct. If you have an uncaught RacyLazy, then that’s a concurrency bug; the code hasn’t taken into account the racy access to the shared resource i.e, the lazy value.

One possible solution is to make waiting on a concurrent initialization the default.

I am not fond of blocking in the runtime when it is not needed. With Multicore OCaml, we’ve consciously stayed away from introducing blocking queues in the runtime unlike say GHC MVars or Go Channels where the blocking behavior is primitive (which also have a primitive notion of fibers and schedulers in the runtime, which we’ve consciously avoided). The reason is that we intend to write schedulers for fibers on top and blocking the domain on RacyLazy would lead to the entire scheduler on this domain being blocked even if there are other fibers that may possibly run.

Regarding other ways of handling RacyLazy, we may build a blocking queue of fibers in the user-land by performing an effect. See future work here. I’d like the semantics of concurrent lazy access to be as simple as possible so that it doesn’t preclude future extensions.

By actively participating in Multicore OCaml development. :slight_smile: We develop in the open. You are welcome to participate in the discussions.

There have been several PRs related to lazy values and only the multicore core team has participated in it. It wasn’t clear to me for this particular PR that I would receive any more comments if I had kept the PR open for longer. Hence, I merged it myself.

Also, the act of writing research papers is also concious effort by the multicore team to communicate the developments better to the core team and to the wider users. Given that Multicore is a large change, we hope that writing up papers such as these will help communicate the high-level design decisions that have lead to the current state of the code.

We’re conciously upstreaming small bits of multicore into trunk in self-contained pieces to precisely avoid the situation that you mention. Each of these PRs go through the usual rigorous reviewing process before it gets in.

they need to be reviewed all at once at the last moment

Unsure what you mean by last moment here? The multicore team has development schedules for features and which release they may get into. But we understand that good reviewing takes time and we don’t intend to hurry this process in any fashion.

1 Like

The API you proposed would in particular be against the exception-safety model we intend to propose with Stephen. We should continue the discussion elsewhere (I’ll comment on the PR or open an issue when I have more time).

I do not understand this comment since I have been commenting since 2018 on the multicore repo?

(Edit: ah! I asked what could we do, and you were addressing people in general. Yes, I also encourage other people to follow the multicore repo!)

But yes, I appreciate it when discussions happen there and I can contribute! For context, I follow the repo regularly and remember thinking that you were going to make Lazy useful to implement lazy statics, this is why this discussion caught my attention.

Happy to hear that we are on the same page regarding not operating under time pressure. I had in mind the introduction of Caml_state which ended up with an issue left open (#9197) that is harder to resolve now that programs may rely on the bug unintentionally. So there can indeed be such time pressure situations we should be wary about.

I haven’t seen the exception safety proposal. If you have a proposal written up somewhere I’ll be happy to take a look. Let’s continue the discussion in an issue.

I was a little taken back when I read this, since you seem to be suggesting changes to how we work that involve mind reading. If you were following the multicore repository and saw a PR that makes a change relevant to your interests and didn’t post your thoughts, then there’s very little we can do about that.

The underlying reason for your comments seems to be related to this:

It’s worth reiterating how to contribute to the OCaml language more broadly here:

  • Substantive changes in the underlying theory are often best communicated using research papers. These can be long or short, peer-reviewed or on arXiv, and in venues ranging from your homepage to ICFP or the OCaml Workshop. The essential goal here is to lay out a thesis and justify it either experimentally or theoretically.

  • Once you have some consensus on the broader picture, changes that will substantially modify OCaml’s behaviour can be posted to the RFCs repository, where they will be discussed online, and subsequently scheduled to be brought up in a core developer meetings.

  • The core developer meetings are held regularly (formerly in person, now online), and aim to come out with a decision about accepting, declining or requesting modifications to a proposed change or RFC.

  • Once a change has been agreed, posting a pull request to the main OCaml repository will have it undergo a series of reviews, after which it will be merged once it has sufficient approval from the OCaml developers.

Not every change has to go through this full process, of course, but the intention is to signpost upcoming changes well ahead of time such that design issues can be worked out before substantial effort has been put into implementation.

So, if you have a proposal for exception-safety that would impact our design of multicore OCaml, we’d like to see it either as an RFC or a short paper draft, as you prefer. Looking at (but not commenting) on our in-development PRs turns the design process into a mystery novel, except with many plot holes and an unexciting reveal.

Thanks also to everyone else on the thread for your constructive and encouraging feedback. We’ll have a new paper draft on arXiv as soon as we finish integrating this round of comments. @Per_Kristian_Lehre, if you have a chance to post your evolutionary algorithm as an issue on Issues · ocaml-multicore/ocaml-multicore · GitHub, we’d like to include a candidate implementation in our parallel benchmarks as well.

2 Likes
3 Likes

@gadmm asked me to intervene as a moderator here, which is of course delicate but worth doing carefully.

This is not what I understand from @gadmm wrote above. He would like to have more time to comment on Multicore PRs, and gives an explanation (related to an older issue) to why he was interested in Lazy semantics in particular, in addition to his long-time interest for exceptions and resource safety. His comments on Lazy seem interesting and worth discussing, although it is a delicate problem, so it is nice that @gadmm and @kayceesrk are planning to continue this discussion (elsewhere), born from excellent feedback questions from @jjb.

It is important to be kind to each other in these technical discussions. There is a tension for Multicore changes between getting things ready in a reasonable time and taking time for careful design consideration and discussion (and a lot of pressure from the community). @gadmm felt to me as a bit insensitive to those aspects in the discussion. But we also want to have a community where technical discussions are welcome, including technical criticism and pointing out valid issues. In this regard the comments on mind reading or mystery novels sound a bit off.

5 Likes

I read the paper with great pleasure—thank you for writing it up in a way that was so accessible.

While reading, there were a few places where I was a bit unclear about the “why” when presented with the what/how of the system design. In case this is helpful to you, I thought I would make a quick note of those here.

When describing the GC of the stock OCaml implementation, the paper says this about mutable fields:

When mutating a field of an object, the program invokes the write barrier. During the mark phase of the GC, the write barrier loads the object pointed to by the old value of the field, and if it is white, grays it and adds it to the mark stack. This preserves the invariant that every object reachable at the start of marking eventually gets marked (the snapshot-at-the-beginning property).

This initially confused me and I spent a couple minutes trying to understand it before reading on. This paragraph is the first mention of the “snapshot-at-the-beginning” property, but this property is only motivated at the end of this section (bounding the amount of work). It might have helped to introduce and motivate the property first, then explain how mutation could violate the invariant.

Also, the snapshot property is motivated as a guarantee against non-termination, but it also seems like it is useful for keeping pauses short. This is perhaps too obvious to mention.

Later, when describing the algorithm for the concurrent major GC’s majorSlice procedure, the concept of a budget is introduced.

The functions sweepSlice and markSlice take in a budget (in words) and return the unspent budget. At the start of the cycle, we record the number of active domains in the global variable gNumDomsToMark. If the call to markSlice returns a non-zero value and the domain had some marking work to do before the call, then it indicates that the mark stack on that domain has now been emptied, which we record by decrementing gNumDomsToMark.

This paragraph reads very closely to the logic in the pseudocode in Fig. 3, explaining what the code says but without explaining why. Eventually I inferred that the purpose of the budget was to put a bound on the amount of work performed, so GC could yield control back to the mutator. This might be obvious to someone more familiar with the domain, but was not immediately obvious to me. I was also curious how the precise value of the budget would be determined: is it a fixed value, or dynamic? If it’s based on heuristics, and if so which ones, and why?

Lastly, when describing the allocator, the paper draws a comparison with Streamflow.

Streamflow uses BiBoP [Steele Jr 1977] to track which slots in a page are free without requiring object headers. However, since every OCaml object already has a header, we use it instead to encode which slots are free.

It seems like there might be an error in this section, specifically when it says “we use it instead to encode which slots are free” but earlier says Streamflow uses it for the same purpose, “to track which slots in a page are free.”

Maybe it would be clearer to say that you only use it to track which slots are free, without the additional benefit of reduced overhead?

These very minor confusions aside, I found the paper exceptionally clear considering the breadth and complexity of the material. In particular I enjoyed the description of the read barrier. It is truly impressive how you’ve squeezed so much utility from 3 simple instructions.

Cheers!

2 Likes