Parallel performance of an OCaml program: ocaml-5.0.0 Vs 4.13.1

I am not particularly interested in people trying to optimize the program.

Hot take: the problem with your benchmark is that it is not clear that you used Domains correctly, or whether you made programming mistakes that limit the scalability of the Domains version. For example, you mention that the benchmark displays on the x axis the number of compute domains, but that there are also 2 mostly-idle domains used for control. This is a performance antipattern for Domains programming, and it is hard to know whether this invalidates the performance results or not. I have already made this point last week: No Domain.maximum_domain_count() in the stdlib - #28 by gasche.

This benchmark is a comparison of a finely-tuned multi-process version developed over time by a domain expert (you) and a hastily-, naively-written multi-domain version developed by a newbie (you). Good luck interpreting that.

What I gather from this interaction and your experiment is that:

  • It may very well be the case that, for many sorts of embarrassingly-parallel problems, multiprocess concurrency works just as well or better as shared-memory concurrency. Others have said this in the thread before. In fact, the whole OCaml community has been saying this for the last 20 years: before the Multicore OCaml effort, we had clear explanations on the fact that the absence of a shared-memory concurrent runtime was not an issue for good parallel performance for some common categories of programs. These explanations remain valid even now that people have poured the considerable work of moving to a concurrent runtime.
    It is still interesting to get yet another confirmation of this (modulo the doubts about your Domains-using code which may be too naive), and maybe it can help recalibrate people’s expectations about the performance benefits of Multicore.
    This also explains why upstream-OCaml insisted so much on compatibility of sequential performance, imposing many difficult/painful performance constraints onto the Multicore OCaml developers. If multiple sequential processes are faster for your use-case, then just keep using that with OCaml 5.

  • The programming model for Domains is not as easy as it looks from a distance. People like @UnixJunkie with previous experience in parallel programming (much more than I do) seem to get it wrong. There are two things to distinguish here:

    • The current libraries exposed for using Domains in the stdlib and elsewhere are young and barebones, so it is to be expected that they are not so easy to use; for now you need Domains expertise to use them well. We will develop better libraries/abstractions over time, broadening the share of OCaml programmers able to use them easily/confidently. (Parany is a step in the right direction from this perspective, assuming you can grow Multicore expertise or get help from another expert.)
    • The programming model is harder than it looks, due in part to design choices made mid-way through Multicore development for retro-compatibility reasons (which has its own strong benefits). This is something that personally I only realized somewhat late in release process for OCaml 5.0 (I used to assume that “don’t use many domains” meant “100 is wasteful but okay in practice”), and I think that there are aspects of this that we are not yet collectively aware of.
      In particular the performance on contended systems (such as my laptop devoting a lot of compute time to Firefox tabs) is not very well understood, and may disappoint or thwart “simple” approaches for domain management and efficient concurrent code. (I will not be surprised when I meet the first heavily-optimized parallel program full of busy-waiting loops that does splendid on controlled benchmarks and erratically becomes dog slow on my actual machine). We will probably have more slightly-disappointing surprises down the road; let’s manage our expectations accordingly.
5 Likes