Will OCaml 5+ multicore be fragile?

I’ve been following the progress of OCaml multicore – more from a community/end user perspective rather than an internal technical perspective.

I’ve been noticing that a lot of multicore developers seem have a lot of technical discussions around the memory model, various load/store issues (acquire, relaxed, sequential etc.), interaction with C volatile etc. on GitHub.

I don’t pretend to understand most of the discussions. However what strikes me is that there are a lot of places in the PR discussions where experts seems to not be sure about the most optimal and correct approach.

I am worried about what this means for the overall robustness of the OCaml 5+ runtime. From an outside perspective, to me at least, OCaml 5 multicore runtime seems to be a tight rope walk. There seem to be a lot of places where subtle bugs could lurk.

Are there possibilities of making the OCaml multicore runtime a bit slower but more easier to reason about? Or is the complexity simply not reducible? Alternatively is the complexity “worth it” ?

TL;DR Have we exceeded the complexity budget? Are multithreaded garbage collectors in 2022 that also work on RISC-V/AAarch64/Power always going to be so subtle?

I guess I’d like to know from people who know more about this issue cause I don’t know enough about this.

This question is intended to spark a discussion and help me understand in a qualitative sense if my concerns are valid or not.

6 Likes

I am not knowledgeable anough to answer your questions, but just to clarify a point that took me a while to fully grasp: many of the difficulties are not about understanding one memory model (which is challenging enough), but about the interactions between different memory models. In fact, within the OCaml toolchain there are three memory models that interact: the memory model of the underlying architecture (when using the native-code backend), the C memory model (since the runtime is written in C) and the OCaml memory model as specified in https://kcsrk.info/papers/pldi18-memory.pdf).

Cheers,
Nicolas

4 Likes

I am also not knowledgeable enough to give you a definitive answer, but I’ll add that memory model and multi-core in general is hard stuff.

I’ve followed similar threads in other communities, mainly Linux/FreeBSD/OpenBSD kernel and saw similar patterns, it’s hard work and the level of uncertainty on ocaml-multicore is on paar with those, nothing significantly different.

I believe what you observe is more an artifact of the thing in itself (multicore is complex stuff) than the level of resources of the ocaml-multicore people, which seems way more than sufficient, both in time and competence.

5 Likes

Another important point to keep in mind is that a disproportional amount of care is spent on the pathological situations where things went wrong or very close to wrong.

The runtime needs to handle racy accesses on mutable data structures to provide some guarantees for the programmers in those problematic situations. This is needed at the very least for debugging for people that accidentally felt in those corners cases. And in those situations weaving a coherent and actionable view of the state between the processor cores, the C memory model and the OCaml memory model is a subtle and difficult endeavor. In particular, the interaction with the “catch fire on data race” semantics of C seems hard to tame within the FFI (in other words, data races in C bindings will not be supported).

However, the guideline for ordinary programs is that they should steer far away from racy programs and the subtleties of the memory model whenever possible. And without data races, all this complexity is not a concern.

The memory model is here if you absolutely need to squeeze the last drop of performance, and you have enough time to prove your program correct, or are ready to debug complex and non-deterministic bugs that only appear on some CPU architectures.

10 Likes

Just to provide an “on the other hand”, on the other hand, once you do write multithreaded programs, you tend to find that all the possible problems that could happen, do happen. Multithreaded code is extremely subtle, even in relatively tame circumstances. Separate threads have a habit of messing each other up, getting deadlocked, starved and what-not in the most unusual ways.

4 Likes

From the users perspective, OCaml memory model provides a simple guarantee:

DRF-SC: Data race free programs remain sequentially consistent.

The users should be able to use sequential reasoning to understand parallel programs’ behaviour. That is, the users may assume that the execution of parallel programs proceeds one step at a time where at each step, one of the available threads is non-deterministically chosen and one action from that thread is executed. More details are in the WIP manual page. In many cases, higher-level libraries such as domainslib further shield the programmer from the usual pitfalls of parallel programming such as deadlocks.

The technical discussions that you mention on OCaml GitHub around various relaxed memory access, interaction with C, volatile, etc, are all towards ensuring that DRF-SC can be guaranteed to the user while also being efficient. As a user, I wouldn’t worry about the complexity of this discussion.

As an analogy, you may compare this with the OCaml type checker. As someone who is far from knowledgeable about the OCaml type system implementation, the guts of the discussion on the corner cases may seem complex. But from the user’s perspective, OCaml provides a simple guarantee:

Type safety: Well-typed programs do not go wrong.

As a user, I only rely on the above guarantee and not the guts of how the implementation ensures that guarantee.

13 Likes

This is a great analogy and helped me think more about this issue.

On the whole I agree with this analogy but I have a point of discussion: Type checking while very complex to the outsider has some simplifying properties:

If a Type checker gives an error on one compile, it will (almost always) give the same error on another compile.

In other words Type checking is usually quite deterministic. This means that the barrier to entry to a bug in the type checker to an outsider reasonably versed in type checker theory, while high, is not impossibly high.

And if any type checker is not deterministic, I will posit that compiler for that language can always be modified to make type checking deterministic without significantly affecting the expressiveness of that computer language.

This property of repeatability that makes bug fixing in a type checker much easier is not present in multi-threaded programs that go wrong. A multicore runtime may fail very rarely under circumstances that are difficult or impossible to repeat or understand. This actually gets even worse:

To recover some aspect of repeatability you could record the run of the OCaml multicore under rr on AMD64/Aarch64 – this again would catch some but not all possible bugs mainly because rr serializes all threads.

These “heisenbugs” could be due to a quirk or bug in the huge software and hardware stack that the OCaml runtime depends on.

So I guess what my complaint is that that OCaml (and other similar multicore runtimes) will suffer from heisenbugs. It seems while we are gaining parallelism, we are losing (some) debuggability and reliability.

I really don’t know what the solution is other than constantly adding test suites, using simulators and things like TLA+ and thinking about the problem more and more deeply.

That looks good in theory but in practice as a user I write a fair amount of C FFI code and from that perspective I do in fact worry about the complexity of these discussions – or rather by the fact that, for now, I understand very little of it – since it feels they become relevant in that setting.

OCaml has always provided foolproof rules to follow on the C FFI side. When I started programming in OCaml, I simply followed the rules without understanding them and that worked mostly well.

But with the time I got a fair understanding of the runtime system and thus understood the rules which is much better when things go wrong. For that reason that’s certainly an understanding I would like to regain at some point. At this point it’s unclear to me how much of these technical discussions I need to understand in order to do so.

4 Likes

Unrefined thoughts about this:

  • The 5.0 runtime was actually simplified quite a bit compared to the first iterations of the Multicore OCaml design; the process of trading generailty for simplicity (and less bugs) has been going on over the many years the multicore runtime evolved.
  • Yes, certainly many bugs are lurking! It’s a complete rewrite of the runtime, and while a lot of testing has been going on, it is far from being battle-tested as the previous runtime was, so clearly the 5.0 release is going to be rough and unearth some issues we haven’t found about. Note that this is a comment about the current state of the implementation, not about systemic fragility.
  • Yes, multicore runtimes are very complex beasts, especially if you try to (1) be generalist, not make language-level simplification that rule out certain usage patterns, and (2) remain performance-competitive on sequential programs. (2) is a very hard requirement to impose on a parallel runtime, and it is admirable that Multicore OCaml is managing so well. At this point (pre-first-release) we very rarely find notable performance regressions compared to the sequential runtime.
6 Likes

This is not exactly your question, but I think that we also have foolproof rules for Multicore FFI. Let me give it a try:

  • when you can, operate on shared immutable values or on non-shared (uniquely-owned) mutable values
  • when you need to share mutable state, you need synchronization, and all sequentially-consistent synchronization mechanisms are safe

Dragons lies when you try to use weaker consistency modes from C to operate on GC-managed values. It is probably not a good idea to try to do this.

1 Like

Thanks for trying to help but I don’t understand your rules (which values are we talking, OCaml values, C values ? which synchronization mechanisms are we talking about C or OCaml ones ?).

Let me give you a simple example. Here are short and trivial bindings to sqlite3 aswell as trivial instructions on how you should behave assuming you have a basic understanding of sqlite3 threading modes which you specify when you open your database connection.

At the moment I have absolutely no idea how to update these instructions in a meaningful way for multicore users and if my bindings need any change or became unsafe to use.

(That being said it may become clear once I finally get to read all the new manual sections which I never got to so far).

1 Like

There are two issues arising in this discussion. The original posting seemed to be concerned with the quality of ocaml-5’s implementation of parallelism. I know very little about that but I am glad to hear that the ocaml developers do talk about the difficulties of dealing with data races. (As an aside, the expression “data race” is itself slippery: in C it does not mean what the expression is generally taken to mean in other contexts, namely “incorrectly synchronized”. Operations on atomics with relaxed memory ordering technically do not suffer from a data race (ie undefined behavior) according to the C standard but operations on scalars with relaxed memory ordering are generally unsynchronized save at the hardware level and will, unless you really know what you are doing, cause your program to run incorrectly for lack of synchronization.)

The other issue is your one, namely the ease (or otherwise) of writing parallelized programs in the ocaml language. Here it is difficult to know what further can be done. As I understand it, sequential consistency is offered, so all threads when they do get to see a particular atomic operation will see all preceding atomic operations in the same order, which can make atomics easier to reason about[1], but any programming with atomics still requires some programmer knowledge. Beyond that, if you have a parallelizable computation you can get a long way by using higher order wrappers such as domainslib - just using Lwt_domain.detach will often be enough to do what you want.

[1] When programming in C I rarely use sequentially consistent memory ordering for atomics. Generally acquire/release semantics are enough and come practically for free on x86.

I was more concerned about the inherent dragons that lie in multicore approach that has been chosen by OCaml. In other words, is the complexity worth it? Are the bugs too subtle in the presence of so many complexities in the hardware/software memory models?

I’m pretty sure that the inherent quality of OCaml’s implementation will be good and only keep improving. But I was wondering if we had exceeded some kind of Goldilocks complexity budget.

The responses have been reassuring. In fact @gasche mentioned that older implementations of the multicore runtime were even more complex !

However there are requests for more features like terminating domains etc. in the future. Without any prejudice to this specific feature request, in general, OCaml should continue it’s powerful approach of saying no to features when the gain in complexity is much more than the gain in functionality.

It may not be useful to extend the analogy to a non-determinism. Non-determinism is a concern for parallelism but not for type systems.

But you are right in that non-determinism is a challenge for parallel programs in general. FWIW you may have non-determinism even in Rust which guarantees the absence of data races and in GHC Haskell software transactional memory (STM) which guarantees serializability. Deterministic parallel programming has had mixed success [1,2,3] due to the cost of enforcement, the challenges of underlying systems being non-deterministic, and the changes to the programming model. It is a nice research problem :slight_smile:

[1] Deterministic Parallel Java University of Illinois at Urbana-Champaign
[2] https://people.cs.umass.edu/~emery/pubs/dthreads-sosp11.pdf
[3] https://dl.acm.org/doi/10.1145/2034675.2034685

That’s an interesting example. The way I would hope it would work is that, assuming users (and your bindings) do not use Thread module threads but may use Domain module threads, then the C functions to be wrapped which are advertised as thread-safe in their documentation can be regarded as domain-safe when writing bindings for ocaml. Anything else would seem to make writing ocaml bindings for potentially multi-threaded C libraries somewhat intractable.

1 Like

I think I may have been misunderstood here. Your essential point, as I understood, was (paraphrasing) – just like we don’t need to understand the complex implementation of the type checker we don’t need to understand the subtle and complex internals of the multicore runtime in order to enjoys the guarantees it gives you. Treat it like a black box. Agree to some extent.

However, a multicore runtime developer can give less strong assurances to the end user about the robustness of runtime “black box” compared to, say, the type checker “black box”.

Let me explain: If there is a problem with type checker, I as an end user can simply provide the offending program to someone who knows the OCaml type checker and they can try to fix it. I will say “Hey this program should have type checked and it didn’t or I will say this program should have failed type check but it succeeded”.

OTOH if I experience a problem in the multicore runtime, it may be very difficult for me to explain and for you to repeat the circumstances in which something “bad” happened.

So when someone says “trust me – the implementation of the type checker is fine !” – my confidence high because I always know that even if something ever went wrong in the type checker, the issue is easier to repeat and hence easier for (someone) to fix.

When someone says “trust me – the implementation of the runtime is fine !” – my confidence is a bit lower. My confidence is lower simply as a result of the non-repeatability of any problem/bug that I may encounter in the field.

Of course, this is something inherent in the domain, nothing special about OCaml. It’s just that its a pain. Thanks for the engagement ! I am excited about multicore – thanks for all your hard work on it !

You asked about correctness rules in a FFI context, so I’m thinking about writing C code that manipulates OCaml values (and interacts with OCaml code manipulating those values). For C codes that manipulates C values, use the C rules. For OCaml code, you can use OCaml locks/mutex/whatever to provide mutual exclusion, and use Atomic.t for any shared mutable value whose access is not otherwise synchronized.

I think that the instructions are perfectly fine with the multicore runtime: if something is not thread-safe, then in particular it is not domain-safe, so users are not going to do anything unsafe if you tell them it is not thread-safe.

My understanding of the Sqlite documentation is that:

  • if users ask for mutex:Full (serialized mode), they can use their connections from multiple threads (and multiple domains) safely
  • if users ask for mutex:No (multi-threaded mode), they have to avoid sharing connections between threads (and between domains) or protect them with a lock.

(Your instructions just say “connections are not thread-safe”. Are they not thread-safe even if you pass mutex:Full? It looks your instructions are more restrictive than what the SQLite documentation suggest, so I guess I’m a bit confused / missing something here.)

More details: before OCaml applications using threads were “not quite multi-threaded” because of the runtime lock, a single mutator (OCaml code) was running at a time. The Multicore runtime does not have that restriction, and that is basically it. (OCaml threads running on the same domain share a runtime lock, but each domain has an independent runtime lock.) In particular, the threading/concurrency recommendations that do not assume a single runtime lock, that do not depend on the existence of a single runtime lock for correctness, they remain perfectly valid with Multicore OCaml. I believe that your documentation is an instance of this.

3 Likes

Yes they remain unsafe because rather than having to manually manage the life time of prepared statements which are, from SQLite’s point of view ressources[1], this OCaml connection abstracts that for you via a mutable LRU cache which, for now, is not protected against concurrent accesses.

It would likely be better to protect the abstraction along the No|Full instruction you actually pass to the underlying SQLite connection, but it was not useful to me at the time and I was in a hurry when I was writing this :–)

That was my intuition and I suspect most FFI code does – it would be interesting to find an occurence where it does not.

Btw. I just want to mention this. It was answered to me by @sadiq on a private channel this winter but I think it is useful for people to know.

At some point I started worrying a lot about code I have that lacks CAML{local,param} declarations for example say something like:

CAMLprim value pair (value x, value y)
{
   CAMLparam2 (x, y);
   value t = caml_alloc (2, 0);
   Store_field (t, x);
   Store_field (t, y);
   return t;
}

Pre-multicore the reasoning behind this kind of code was always of the form:

  1. A single thread is running OCaml code and I’m this thread at the moment.
  2. I know which operations can trigger a gc. If no such operation happens the pointers to the values I have cannot move so I don’t need to keep roots on them.

In the case above I started to worry that t could actually start moving under my feets since 1. is obviousy no longer true with multicore. @sadiq answered me that this was not the case because:

GC work can only happen at poll points or allocations […] the only thing that can be happening concurrently is marking and sweeping, neither of which could cause a problem there. […] Minor collections require all domains to be stopped (and that can only happen at poll and allocation points)


  1. Note that while you can dispose them explicitly the sqlite3-ocaml package treats them as finalized values. I think this is a bad idea as it means you may not be able to close your db (database busy error) before the gc has been triggered. But sqlite3-ocaml has a very long history and I’m pretty sure I would have made exactly the same design error at the time this was first written. ↩︎

2 Likes

The same point arose on this forum here: A tutorial on parallel programming in OCaml 5 - #6 by kayceesrk

2 Likes

And interesting update relating to Golang 1.19.

Go only provides sequentially consistent atomics, not any of the more relaxed forms found in other languages.

(See Go 1.19 Release Notes - The Go Programming Language )

We will all agree that Golang is a pretty performant runtime. The fact that they have made a decision to only provide sequentially consistent atomics is quite telling.

Reasoning about a relaxed memory model seems to be quite difficult in general. I would be curious – what if we simply converted everything to sequential consistency in OCaml multicore – how much of performance hit would we get for various architectures? I wonder if my thought experiment makes sense…