Multicore Update: April 2020, with a preprint paper

Do you plan to mainline only the parallel GC and omit the concurrent one?

Yes. That’s the plan. To be clear, the terms parallel and concurrent should be qualified as parallel minor and concurrent minor. The major GC is concurrent in both cases.

2 Likes

Thanks a lot for this work! Also thanks for writing and sharing the draft paper, it is an interesting read.

I have a question about how the C API finalizers are handled. Are the finalizers that are installed as C code pointers into the struct of operations for custom blocks treated differently than the finalizers installed using the Gc module operations? The much stronger constraints on what the C level finalizers are permitted to do, and much more relaxed guarantees on the order in which they are executed, make one think that a more efficient implementation is possible.

Another question, about lazy values, is whether the semantics which raises in case a lazy value that is in the process of being forced is also forced concurrently is obviously the desired or only feasible semantics. Do I understand correctly that this semantics means that if a lazy value is created, then shared between domains, and then happens to be needed by 2 of them at almost the same time, one could see an exception? In such cases, forcing the value again might return the value computed by the other domain, not the exception, right?

Regarding the comparison of sequential performance to the stock runtime, would it make sense to use different overhead settings for the new allocators relative to stock, analogously to how this has been suggested in discussions of comparisons of the best-fit allocator?

Just out of curiosity, does your experience make you think that there would be meaningful differences from setting the minor heap size so that they all fit into last-level-cache?

nit: section 2 mentions the old default minor heap size of 256K by default, but section 6 states the current 2MB.

1 Like

I have a question about how the C API finalizers are handled. Are the finalizers that are installed as C code pointers into the struct of operations for custom blocks treated differently than the finalizers installed using the Gc module operations? The much stronger constraints on what the C level finalizers are permitted to do, and much more relaxed guarantees on the order in which they are executed, make one think that a more efficient implementation is possible.

Custom block finalizers are indeed treated differently. The finalize function is called when custom blocks are swept, which is different from how the usual finalizers work. This behavior is the same in stock OCaml and multicore.

Another question, about lazy values, is whether the semantics which raises in case a lazy value that is in the process of being forced is also forced concurrently is obviously the desired or only feasible semantics. Do I understand correctly that this semantics means that if a lazy value is created, then shared between domains, and then happens to be needed by 2 of them at almost the same time, one could see an exception? In such cases, forcing the value again might return the value computed by the other domain, not the exception, right?

Not the exception. We will clarify this in the paper.

We’ve recently introduced the ability to distinguish concurrent forcing of lazy by a different domain from recursive forcing by the same domain. In the former case, the second domain would see RacyLazy exception raised while the latter raises Undefined. Consequently, one can write a safely share lazy values between domains like:

let rec safe_force l =
  try Lazy.force l with
  | Lazy.RacyLazy ->
      Domain.Sync.cpu_relax (); (* emits PAUSE instruction on x86 *)
      safe_force l

Regarding the comparison of sequential performance to the stock runtime, would it make sense to use different overhead settings for the new allocators relative to stock, analogously to how this has been suggested in discussions of comparisons of the best-fit allocator?

Agreed. We shall do this for the future revision of the paper.

Just out of curiosity, does your experience make you think that there would be meaningful differences from setting the minor heap size so that they all fit into last-level-cache?

We haven’t tried this. I’ve seen anecdotal evidence that it might help. I’d be interested in seeing the rigorous experimental result that shows that it helps.

nit: section 2 mentions the old default minor heap size of 256K by default, but section 6 states the current 2MB.

256k words = 2MB on 64bit. We shall clarify this.

Thanks for the comments!

Concurrent “magic statics” (C++) and lazy statics (Rust) are very useful. Is this code correct though? It will loop if the Lazy.RacyLazy exception arose as an uncaught exception in l (raised by another lazy). This is different from Lazy.Undefined where you do not have much hope that it will become defined later. One possible solution is to make waiting on a concurrent initialization the default. Do you see drawbacks to it, for instance do you see other useful ways to handle a Lazy.RacyLazy exception? (If there are useful uses of not locking, it is also possible to make waiting the default, and provide a new function Lazy.try_force : 'a t -> 'a option.)

Also, I have noticed that the PR you mentioned was not reviewed on GitHub, and it was self-merged the same day. What I wonder is how are all the small design decisions like this intended to be properly reviewed before merging into trunk? For instance if they all arrive in bulk and they need to be reviewed all at once at the last moment, I worry about the increased risk that some might fly under the radar. How can we help you the best on such aspects of language design?

I would consider this code to be correct. If you have an uncaught RacyLazy, then that’s a concurrency bug; the code hasn’t taken into account the racy access to the shared resource i.e, the lazy value.

One possible solution is to make waiting on a concurrent initialization the default.

I am not fond of blocking in the runtime when it is not needed. With Multicore OCaml, we’ve consciously stayed away from introducing blocking queues in the runtime unlike say GHC MVars or Go Channels where the blocking behavior is primitive (which also have a primitive notion of fibers and schedulers in the runtime, which we’ve consciously avoided). The reason is that we intend to write schedulers for fibers on top and blocking the domain on RacyLazy would lead to the entire scheduler on this domain being blocked even if there are other fibers that may possibly run.

Regarding other ways of handling RacyLazy, we may build a blocking queue of fibers in the user-land by performing an effect. See future work here. I’d like the semantics of concurrent lazy access to be as simple as possible so that it doesn’t preclude future extensions.

By actively participating in Multicore OCaml development. :slight_smile: We develop in the open. You are welcome to participate in the discussions.

There have been several PRs related to lazy values and only the multicore core team has participated in it. It wasn’t clear to me for this particular PR that I would receive any more comments if I had kept the PR open for longer. Hence, I merged it myself.

Also, the act of writing research papers is also concious effort by the multicore team to communicate the developments better to the core team and to the wider users. Given that Multicore is a large change, we hope that writing up papers such as these will help communicate the high-level design decisions that have lead to the current state of the code.

We’re conciously upstreaming small bits of multicore into trunk in self-contained pieces to precisely avoid the situation that you mention. Each of these PRs go through the usual rigorous reviewing process before it gets in.

they need to be reviewed all at once at the last moment

Unsure what you mean by last moment here? The multicore team has development schedules for features and which release they may get into. But we understand that good reviewing takes time and we don’t intend to hurry this process in any fashion.

1 Like

The API you proposed would in particular be against the exception-safety model we intend to propose with Stephen. We should continue the discussion elsewhere (I’ll comment on the PR or open an issue when I have more time).

I do not understand this comment since I have been commenting since 2018 on the multicore repo?

(Edit: ah! I asked what could we do, and you were addressing people in general. Yes, I also encourage other people to follow the multicore repo!)

But yes, I appreciate it when discussions happen there and I can contribute! For context, I follow the repo regularly and remember thinking that you were going to make Lazy useful to implement lazy statics, this is why this discussion caught my attention.

Happy to hear that we are on the same page regarding not operating under time pressure. I had in mind the introduction of Caml_state which ended up with an issue left open (#9197) that is harder to resolve now that programs may rely on the bug unintentionally. So there can indeed be such time pressure situations we should be wary about.

I haven’t seen the exception safety proposal. If you have a proposal written up somewhere I’ll be happy to take a look. Let’s continue the discussion in an issue.

I was a little taken back when I read this, since you seem to be suggesting changes to how we work that involve mind reading. If you were following the multicore repository and saw a PR that makes a change relevant to your interests and didn’t post your thoughts, then there’s very little we can do about that.

The underlying reason for your comments seems to be related to this:

It’s worth reiterating how to contribute to the OCaml language more broadly here:

  • Substantive changes in the underlying theory are often best communicated using research papers. These can be long or short, peer-reviewed or on arXiv, and in venues ranging from your homepage to ICFP or the OCaml Workshop. The essential goal here is to lay out a thesis and justify it either experimentally or theoretically.

  • Once you have some consensus on the broader picture, changes that will substantially modify OCaml’s behaviour can be posted to the RFCs repository, where they will be discussed online, and subsequently scheduled to be brought up in a core developer meetings.

  • The core developer meetings are held regularly (formerly in person, now online), and aim to come out with a decision about accepting, declining or requesting modifications to a proposed change or RFC.

  • Once a change has been agreed, posting a pull request to the main OCaml repository will have it undergo a series of reviews, after which it will be merged once it has sufficient approval from the OCaml developers.

Not every change has to go through this full process, of course, but the intention is to signpost upcoming changes well ahead of time such that design issues can be worked out before substantial effort has been put into implementation.

So, if you have a proposal for exception-safety that would impact our design of multicore OCaml, we’d like to see it either as an RFC or a short paper draft, as you prefer. Looking at (but not commenting) on our in-development PRs turns the design process into a mystery novel, except with many plot holes and an unexciting reveal.

Thanks also to everyone else on the thread for your constructive and encouraging feedback. We’ll have a new paper draft on arXiv as soon as we finish integrating this round of comments. @Per_Kristian_Lehre, if you have a chance to post your evolutionary algorithm as an issue on https://github.com/ocaml-multicore/ocaml-multicore/issues, we’d like to include a candidate implementation in our parallel benchmarks as well.

2 Likes
3 Likes

@gadmm asked me to intervene as a moderator here, which is of course delicate but worth doing carefully.

This is not what I understand from @gadmm wrote above. He would like to have more time to comment on Multicore PRs, and gives an explanation (related to an older issue) to why he was interested in Lazy semantics in particular, in addition to his long-time interest for exceptions and resource safety. His comments on Lazy seem interesting and worth discussing, although it is a delicate problem, so it is nice that @gadmm and @kayceesrk are planning to continue this discussion (elsewhere), born from excellent feedback questions from @jjb.

It is important to be kind to each other in these technical discussions. There is a tension for Multicore changes between getting things ready in a reasonable time and taking time for careful design consideration and discussion (and a lot of pressure from the community). @gadmm felt to me as a bit insensitive to those aspects in the discussion. But we also want to have a community where technical discussions are welcome, including technical criticism and pointing out valid issues. In this regard the comments on mind reading or mystery novels sound a bit off.

5 Likes

I read the paper with great pleasure—thank you for writing it up in a way that was so accessible.

While reading, there were a few places where I was a bit unclear about the “why” when presented with the what/how of the system design. In case this is helpful to you, I thought I would make a quick note of those here.

When describing the GC of the stock OCaml implementation, the paper says this about mutable fields:

When mutating a field of an object, the program invokes the write barrier. During the mark phase of the GC, the write barrier loads the object pointed to by the old value of the field, and if it is white, grays it and adds it to the mark stack. This preserves the invariant that every object reachable at the start of marking eventually gets marked (the snapshot-at-the-beginning property).

This initially confused me and I spent a couple minutes trying to understand it before reading on. This paragraph is the first mention of the “snapshot-at-the-beginning” property, but this property is only motivated at the end of this section (bounding the amount of work). It might have helped to introduce and motivate the property first, then explain how mutation could violate the invariant.

Also, the snapshot property is motivated as a guarantee against non-termination, but it also seems like it is useful for keeping pauses short. This is perhaps too obvious to mention.

Later, when describing the algorithm for the concurrent major GC’s majorSlice procedure, the concept of a budget is introduced.

The functions sweepSlice and markSlice take in a budget (in words) and return the unspent budget. At the start of the cycle, we record the number of active domains in the global variable gNumDomsToMark. If the call to markSlice returns a non-zero value and the domain had some marking work to do before the call, then it indicates that the mark stack on that domain has now been emptied, which we record by decrementing gNumDomsToMark.

This paragraph reads very closely to the logic in the pseudocode in Fig. 3, explaining what the code says but without explaining why. Eventually I inferred that the purpose of the budget was to put a bound on the amount of work performed, so GC could yield control back to the mutator. This might be obvious to someone more familiar with the domain, but was not immediately obvious to me. I was also curious how the precise value of the budget would be determined: is it a fixed value, or dynamic? If it’s based on heuristics, and if so which ones, and why?

Lastly, when describing the allocator, the paper draws a comparison with Streamflow.

Streamflow uses BiBoP [Steele Jr 1977] to track which slots in a page are free without requiring object headers. However, since every OCaml object already has a header, we use it instead to encode which slots are free.

It seems like there might be an error in this section, specifically when it says “we use it instead to encode which slots are free” but earlier says Streamflow uses it for the same purpose, “to track which slots in a page are free.”

Maybe it would be clearer to say that you only use it to track which slots are free, without the additional benefit of reduced overhead?

These very minor confusions aside, I found the paper exceptionally clear considering the breadth and complexity of the material. In particular I enjoyed the description of the read barrier. It is truly impressive how you’ve squeezed so much utility from 3 simple instructions.

Cheers!

2 Likes

Thanks @samwgoldman for the useful comments. Will incorporate the suggestions in the next draft.

Reg Streamflow, Streamflow uses BiBoP, which is a separate data structure sitting next to the heap to encode which slots in a page are free. Streamflow does not use a header for objects in the heap. Since OCaml has object headers already, we store free/used information directly in the header without the need for a separate BiBoP data structure. Will clarify this in the text.

3 Likes

@jjb, as @kayceesrk notes, we haven’t done any rigorous experiments towards this, but I was curious if you had any insight into doing this in other systems (i.e. what prompted the question)? At first glance, the L3 cache sizes are roughly the right order of magnitude to have a go, and we may even be able to fit into lower caches. But then we have contrasting experiences from the Coq team, who seem to recommend enormous minor heap sizes in the current sequential runtime, so cache locality seems less of a concern there vs allocation throughput.

2 Likes

(Keep in mind that I’m not an expert in this territory.)

My experience with sequential OCaml systems is that some seem to be fastest with last-level-cache sized (or slightly smaller) minor heaps, and some do much better with larger ones. IME it seems to come down to the promotion rate. Some systems don’t need to promote much to the major heap even with a cache-sized minor heap, and if so the cache effects of keeping the minor heap in cache seem to be meaningful. But if there is a lot of promotion with a small minor heap, then that cost obliterates the cache effects and you lose.

In the multicore world, I suspect that the situation is more nuanced, as the hardware cache coherence implementation comes into play more. I really don’t know, but maybe there are significantly different effects on that performance when the minor heap stays in cache. I’m thinking that there might be something there based on very rough extrapolation from things I see some of my colleagues writing C++ backend services doing. So I’m just saying that I think there is a question with no obvious answer, which isn’t very much really, but I’d be curious about the answer, and I guess others in the ICFP community would too.

3 Likes

Thanks for the paper and the work. It’s always nice to have digestible high-level design descriptions.

One thing that I didn’t get from the paper is how exactly ConcurMinor breaks the current FFI and the impact it would have on the existing eco-system, on a scale from “it affect all projects” to “only people doing that fancy thing” :–) ?

At the end of the paper it seems you make the point that ParMinor is the solution to go with for the time being. Does this means you are going to leave behind the work done on ConcurMinor or do you intend to continue to maintain it ?

From that perspective can the two minor schemes be reasonably expressed as pluggable modules or are the concrete differences in interplay with the major and the domains too intricate for that ?

1 Like

Intuitively, it seems that aging should alleviate some of this problem. If you completely run out of space in the minor heap, because it’s full of live values, there’s no solution – you have to push data to the major heap. But if you have available space (i.e. enough non-live values to collect), aging the live data should help keep it around long enough until it is no longer live. In the meantime, you benefit from the caching effects.

Since most values are not shared, I’m not sure this is an issue. It seems like fitting the minor heaps in the architecture’s various cache sizes is even more important, since some caches are usually shared between cores. An over-allocating domain can therefore ruin the cache locality of other cores unless its minor heap is restricted.

One thing that I didn’t get from the paper is how exactly ConcurMinor breaks the current FFI and the impact it would have on the existing eco-system, on a scale from “it affect all projects” to “only people doing that fancy thing” :–) ?

All the projects that use the C API. The details are here: https://github.com/ocaml-multicore/ocaml-multicore/wiki/C-API-changes

At the end of the paper it seems you make the point that ParMinor is the solution to go with for the time being. Does this means you are going to leave behind the work done on ConcurMinor or do you intend to continue to maintain it ?

We don’t intend to maintain it. It is quite a bit of work to maintain and port the changes across two different GCs. ParMinor GC is now at 4.11 branch point (the default multicore compiler is 4.10 + ParMinor now). The ConcMinor is at 4.06.1.

Given that ConcMinor breaks the C API, the ecosystem would have to be fixed for ConcMinor to be useful. The code changes are indeed intricate; the differences are not just in the minor GC, but the compilers internal use of the C API. It will be quite a bit of work to keep both GCs in the same source distribution.

1 Like

Now that we have Sandmark, one interesting MSc/MTech project may be to study the effect of cache sizes on performance. This is fairly easy to do with OCAMLRUNPARAM option s, and we may see some interesting results.

1 Like

I do not think this is necessarily true.

Here is why I think so, but be warned that this is preliminary as I do not have time to explore this idea further on my own at the moment.

State in Rust

Breaking the C API is a consequence of deciding that all single-threaded shared mutable state must assume they are also shared between threads. So a new read barrier is used to promote values when read from another thread. But for data types that were correct up to now, users must also be careful to avoid races from now on… for instance by avoiding sharing values of such types between domains.

One lesson of Rust is that there are different kinds of mutable state, for different usages, with different means to achieve thread-safety.

The closest there is to current OCaml’s mutable is the notion of single-threaded multiple-writers mutable state (&Cell). It is made thread-safe in Rust by statically preventing values containing &Cell from crossing thread boundaries (by virtue of not having the Send trait). The same restriction is used to make some data structures more efficient by avoiding the cost of synchronisation (cf. the reference-counting pointer Rc vs. the atomic reference-counting pointer Arc).

This is not enough by itself, and Rust offers other kinds of state for communicating and sharing values between threads.

UnsafeCell has similarities with Ocaml multicore’s mutable (though yours is safe thanks to the work on the memory model): it is used to get raw pointers to the data that have almost no restriction and can be sent across domains, but the user is likewise told to “avoid data races”. It is rarely used alone, but together with type abstraction it can be used to program safe concurrent data structures.

Lastly, the default notion of state in Rust is linear state, which can be sent freely across threads. Thread-safety is ensured by restricting aliasing using the ownership and borrowing discipline.

A backwards-compatible concurrent collector?

If I had to imagine a backwards-compatible OCaml with static control of interference à la Rust based on ConcMinor, it would distinguish the three kinds of state (concretely with other keywords in addition to mutable). mutable would keep its current meaning of single-domain, multiple-writers state and not require a read barrier, and in particular preserve the API. (I count systhreads as single-threaded for this purpose, since here it means “sharing the same minor heap”.)

Programs could progressively transition to other kinds of state when parallelising the program. Concretely, a data structure like Stack.t, instead of becoming racy, would keep its current meaning, but users could replace it with a linear stack or a concurrent stack, two data structures distinct from the first one, when parallelizing their programs.

So how could this fit with the current plans? It is not entirely clear to me. If people start to rely on parallelism in an unstructured way (e.g. no clear distinction between different kinds of data types arising from different ways of ensuring thread-safety) then one will also lose the ability to retrofit ConcMinor in a backwards-compatible manner (by losing the information that the current mutable API is single-threaded). The API breakage of ConcMinor which might only be virtual right now (if I trust this preliminary, not fully-explored idea) will become real. (Further difficulties arise with the emulation of the Thread library with domains, but this could be changed later.)

But if users are provided in advance with a general direction for a model of control of interference this might happen differently. And eventually having such a model is desirable in any case, as it helps parallelizing programs (for instance the Firefox people reported that they had attempted and failed twice to parallelise the CSS engine in C++ before succeeding with Rust). Furthermore, in an imaginary retrofitting of ConcMinor, one could imagine enforcing something like the Send trait at the level of the read barrier until there is a better way (there would be two kinds of barriers, one of which would raise an exception if a state happened to be incorrectly shared across domains, and not be required in the FFI).

I find ConcMinor interesting from a systems programming perspective compared to the stop-the-world collector because it could (I hope) offer possibilities such as having a low-latency domain communicating with a higher-latency domain. Moreover the performance cost of the read barrier might be lower in this scheme if it could be removed for all but the concurrent data structures.

TL;DR

The current API and programs that use it remain unchanged, mutable does not require a read barrier. Single-threading is ensured by other means (for instance with single-threaded multiple-writers state as it exists in Rust, though static control of interference arrives later). This has the added benefit that data structures keep their current semantics. To share state between threads, a new kind of state is introduced, with a read barrier and the more complicated API.

Edit (12/06)

The previous version stated that Cell does not have the Send trait, but this is incorrect. Thank you to @stedolan for pointing this out. The analogy I wanted to make is between OCaml’s mutable and Rust’s &Cell, which does not have the Send trait (i.e. Cell is not Sync). More generally the situation in Rust is more complicated than I sketched. The ownership of Cell itself is tracked, and so it can be sent to other threads (but not shared between threads). With a true analogous of Cell under ConcMinor, there would be new opportunities for promoting lazily if necessary, such as when the program tries to obtain a &Cell from a Cell, which is necessary before any access to its contents. So a read barrier on &Cell could still be avoided with this more expressive variant (again, according to this preliminary, not fully-explored idea).

3 Likes