Multicore Update: April 2020, with a preprint paper

Thanks @samwgoldman for the useful comments. Will incorporate the suggestions in the next draft.

Reg Streamflow, Streamflow uses BiBoP, which is a separate data structure sitting next to the heap to encode which slots in a page are free. Streamflow does not use a header for objects in the heap. Since OCaml has object headers already, we store free/used information directly in the header without the need for a separate BiBoP data structure. Will clarify this in the text.

3 Likes

@jjb, as @kayceesrk notes, we havenā€™t done any rigorous experiments towards this, but I was curious if you had any insight into doing this in other systems (i.e. what prompted the question)? At first glance, the L3 cache sizes are roughly the right order of magnitude to have a go, and we may even be able to fit into lower caches. But then we have contrasting experiences from the Coq team, who seem to recommend enormous minor heap sizes in the current sequential runtime, so cache locality seems less of a concern there vs allocation throughput.

2 Likes

(Keep in mind that Iā€™m not an expert in this territory.)

My experience with sequential OCaml systems is that some seem to be fastest with last-level-cache sized (or slightly smaller) minor heaps, and some do much better with larger ones. IME it seems to come down to the promotion rate. Some systems donā€™t need to promote much to the major heap even with a cache-sized minor heap, and if so the cache effects of keeping the minor heap in cache seem to be meaningful. But if there is a lot of promotion with a small minor heap, then that cost obliterates the cache effects and you lose.

In the multicore world, I suspect that the situation is more nuanced, as the hardware cache coherence implementation comes into play more. I really donā€™t know, but maybe there are significantly different effects on that performance when the minor heap stays in cache. Iā€™m thinking that there might be something there based on very rough extrapolation from things I see some of my colleagues writing C++ backend services doing. So Iā€™m just saying that I think there is a question with no obvious answer, which isnā€™t very much really, but Iā€™d be curious about the answer, and I guess others in the ICFP community would too.

3 Likes

Thanks for the paper and the work. Itā€™s always nice to have digestible high-level design descriptions.

One thing that I didnā€™t get from the paper is how exactly ConcurMinor breaks the current FFI and the impact it would have on the existing eco-system, on a scale from ā€œit affect all projectsā€ to ā€œonly people doing that fancy thingā€ :ā€“) ?

At the end of the paper it seems you make the point that ParMinor is the solution to go with for the time being. Does this means you are going to leave behind the work done on ConcurMinor or do you intend to continue to maintain it ?

From that perspective can the two minor schemes be reasonably expressed as pluggable modules or are the concrete differences in interplay with the major and the domains too intricate for that ?

1 Like

Intuitively, it seems that aging should alleviate some of this problem. If you completely run out of space in the minor heap, because itā€™s full of live values, thereā€™s no solution ā€“ you have to push data to the major heap. But if you have available space (i.e. enough non-live values to collect), aging the live data should help keep it around long enough until it is no longer live. In the meantime, you benefit from the caching effects.

Since most values are not shared, Iā€™m not sure this is an issue. It seems like fitting the minor heaps in the architectureā€™s various cache sizes is even more important, since some caches are usually shared between cores. An over-allocating domain can therefore ruin the cache locality of other cores unless its minor heap is restricted.

One thing that I didnā€™t get from the paper is how exactly ConcurMinor breaks the current FFI and the impact it would have on the existing eco-system, on a scale from ā€œit affect all projectsā€ to ā€œonly people doing that fancy thingā€ :ā€“) ?

All the projects that use the C API. The details are here: C API changes Ā· ocaml-multicore/ocaml-multicore Wiki Ā· GitHub

At the end of the paper it seems you make the point that ParMinor is the solution to go with for the time being. Does this means you are going to leave behind the work done on ConcurMinor or do you intend to continue to maintain it ?

We donā€™t intend to maintain it. It is quite a bit of work to maintain and port the changes across two different GCs. ParMinor GC is now at 4.11 branch point (the default multicore compiler is 4.10 + ParMinor now). The ConcMinor is at 4.06.1.

Given that ConcMinor breaks the C API, the ecosystem would have to be fixed for ConcMinor to be useful. The code changes are indeed intricate; the differences are not just in the minor GC, but the compilers internal use of the C API. It will be quite a bit of work to keep both GCs in the same source distribution.

1 Like

Now that we have Sandmark, one interesting MSc/MTech project may be to study the effect of cache sizes on performance. This is fairly easy to do with OCAMLRUNPARAM option s, and we may see some interesting results.

1 Like

I do not think this is necessarily true.

Here is why I think so, but be warned that this is preliminary as I do not have time to explore this idea further on my own at the moment.

State in Rust

Breaking the C API is a consequence of deciding that all single-threaded shared mutable state must assume they are also shared between threads. So a new read barrier is used to promote values when read from another thread. But for data types that were correct up to now, users must also be careful to avoid races from now onā€¦ for instance by avoiding sharing values of such types between domains.

One lesson of Rust is that there are different kinds of mutable state, for different usages, with different means to achieve thread-safety.

The closest there is to current OCamlā€™s mutable is the notion of single-threaded multiple-writers mutable state (&Cell). It is made thread-safe in Rust by statically preventing values containing &Cell from crossing thread boundaries (by virtue of not having the Send trait). The same restriction is used to make some data structures more efficient by avoiding the cost of synchronisation (cf. the reference-counting pointer Rc vs. the atomic reference-counting pointer Arc).

This is not enough by itself, and Rust offers other kinds of state for communicating and sharing values between threads.

UnsafeCell has similarities with Ocaml multicoreā€™s mutable (though yours is safe thanks to the work on the memory model): it is used to get raw pointers to the data that have almost no restriction and can be sent across domains, but the user is likewise told to ā€œavoid data racesā€. It is rarely used alone, but together with type abstraction it can be used to program safe concurrent data structures.

Lastly, the default notion of state in Rust is linear state, which can be sent freely across threads. Thread-safety is ensured by restricting aliasing using the ownership and borrowing discipline.

A backwards-compatible concurrent collector?

If I had to imagine a backwards-compatible OCaml with static control of interference Ć  la Rust based on ConcMinor, it would distinguish the three kinds of state (concretely with other keywords in addition to mutable). mutable would keep its current meaning of single-domain, multiple-writers state and not require a read barrier, and in particular preserve the API. (I count systhreads as single-threaded for this purpose, since here it means ā€œsharing the same minor heapā€.)

Programs could progressively transition to other kinds of state when parallelising the program. Concretely, a data structure like Stack.t, instead of becoming racy, would keep its current meaning, but users could replace it with a linear stack or a concurrent stack, two data structures distinct from the first one, when parallelizing their programs.

So how could this fit with the current plans? It is not entirely clear to me. If people start to rely on parallelism in an unstructured way (e.g. no clear distinction between different kinds of data types arising from different ways of ensuring thread-safety) then one will also lose the ability to retrofit ConcMinor in a backwards-compatible manner (by losing the information that the current mutable API is single-threaded). The API breakage of ConcMinor which might only be virtual right now (if I trust this preliminary, not fully-explored idea) will become real. (Further difficulties arise with the emulation of the Thread library with domains, but this could be changed later.)

But if users are provided in advance with a general direction for a model of control of interference this might happen differently. And eventually having such a model is desirable in any case, as it helps parallelizing programs (for instance the Firefox people reported that they had attempted and failed twice to parallelise the CSS engine in C++ before succeeding with Rust). Furthermore, in an imaginary retrofitting of ConcMinor, one could imagine enforcing something like the Send trait at the level of the read barrier until there is a better way (there would be two kinds of barriers, one of which would raise an exception if a state happened to be incorrectly shared across domains, and not be required in the FFI).

I find ConcMinor interesting from a systems programming perspective compared to the stop-the-world collector because it could (I hope) offer possibilities such as having a low-latency domain communicating with a higher-latency domain. Moreover the performance cost of the read barrier might be lower in this scheme if it could be removed for all but the concurrent data structures.

TL;DR

The current API and programs that use it remain unchanged, mutable does not require a read barrier. Single-threading is ensured by other means (for instance with single-threaded multiple-writers state as it exists in Rust, though static control of interference arrives later). This has the added benefit that data structures keep their current semantics. To share state between threads, a new kind of state is introduced, with a read barrier and the more complicated API.

Edit (12/06)

The previous version stated that Cell does not have the Send trait, but this is incorrect. Thank you to @stedolan for pointing this out. The analogy I wanted to make is between OCamlā€™s mutable and Rustā€™s &Cell, which does not have the Send trait (i.e. Cell is not Sync). More generally the situation in Rust is more complicated than I sketched. The ownership of Cell itself is tracked, and so it can be sent to other threads (but not shared between threads). With a true analogous of Cell under ConcMinor, there would be new opportunities for promoting lazily if necessary, such as when the program tries to obtain a &Cell from a Cell, which is necessary before any access to its contents. So a read barrier on &Cell could still be avoided with this more expressive variant (again, according to this preliminary, not fully-explored idea).

5 Likes