Parallel performance of an OCaml program: ocaml-5.0.0 Vs 4.13.1

I was always told “fork is faster than smp”, almost as conventional wisdom. It’s nice to see it visualized!

1 Like

BTW I just remembered something I wrote two years ago: it’s for the “programming language drag race” and I wrote it because OCaml’s score felt way too low. Never got to actually submitting it though. It further confirms this observation (but not visually):

(feel free to tear apart :grin:)

2 Likes

I do not read too much into the chart. There could be improvements to make to the program, or to the multicore runtime implementation, or both.

However, I see proposed around a lot the explanation of inherent synchronization costs of SMP runtimes. This should not be used to discard legitimate questions and explorations about consequences of the design choices of the multicore runtime (for instance, a stop-the-world collector that empties all minor heaps whenever one minor heap is full).

3 Likes

I think it is a valid point. I very much do not embrace the argument saying « choose the right architecture for the job » here. If one has to resort to forks for a simple performant map/reduce it means Ocaml multicore abstraction leaks (in time).

It would be interesting to understand the Intrinsic Parallel Efficiency of an Ocaml program with respect to at least the GC (minor and major) and the major allocator.

As far as I understand the minor GCs work in 1) lock-step and are as a consequence is 2) tail sensitive (to the most allocating among N). Also the major allocator is a global parallel allocator which has to cope with generational copy.

Intrinsic Parallel Efficiency: Maybe one way to characterize would be:

  • an embarrassing parallel program (map only, N partitions) , no sync, with an as much as possible identical sequential version (easy for map)
  • some functions with computation designed to generate, per run, an increasingly number of minor GC/sec (no major)
  • a variant with spill out to major, increasingly
  • a variant with unbalanced allocation (1 tail among N)
  • using single threaded (4.x) vs multi runtime (5.x)

PE (Parallel Efficiency): sequential-time-workload-sizeN / (parallel-time-workload-size1 * N)
SR (Sequential Regression): sequential-time-workload-5.x / sequential-time-workload-4.x (assuming no other unrelated optimization)
IPE of the Ocaml Multicore runtime: PE * SR

Also mentioned in another discussion about domains: the tail latency (wrt lock steps) induced by thread scheduling, which may suggest a sensitivity to noisy neighbors / too many domains. What is not part of IPE is synchronisation nor cache costs nor NUMA, nor // algorithm efficiency, as they are concerns regardless of the language.

And to end on a definitely positive note: effects and local DRF will allow for a great parallel programing model (green threads, sync primitives, performant and bug-free lock-free algorithms, actor/CSP model?…)

It is still early days, Ocaml team has done an incredible job with multicore, we will all benefit from gaining an understanding in the characteristics and inevitable trade-offs made during this journey.

3 Likes

If you have proper statistics of a parallel implementation comparing ocaml-5 to ocaml<5, it would be interesting to share them.

The program is parallelized using parany, so in the 4.13.1 case, the parallelism is fork-based.
In 5.0.0, it uses Domains and Domainslib.Chan.

Sorry, I didn’t spend much time digging, but is there a simple way to replicate your test?

Ideally some git repo that one could clone and then run some shell script(s) to run it within a switch that has OCaml 4 or OCaml 5.

My particular interest is to take a quick look at the OCaml 5 version and see if there is potentially some simple way to make it faster or to identify particular things to improve in OCaml 5. I make no promises, however. :smiley:

I need to share my test data file, that’s the only obvious thing missing.
This is a regular opam package, so installing it should be a matter of opam pin
to the git repository.
I am not particularly interested in people trying to optimize the program.
Rather, I would be more interested by stories of other parallel programs which have an ocaml<5 and an ocaml-5 version.

I am not particularly interested in people trying to optimize the program.

Hot take: the problem with your benchmark is that it is not clear that you used Domains correctly, or whether you made programming mistakes that limit the scalability of the Domains version. For example, you mention that the benchmark displays on the x axis the number of compute domains, but that there are also 2 mostly-idle domains used for control. This is a performance antipattern for Domains programming, and it is hard to know whether this invalidates the performance results or not. I have already made this point last week: No Domain.maximum_domain_count() in the stdlib - #28 by gasche.

This benchmark is a comparison of a finely-tuned multi-process version developed over time by a domain expert (you) and a hastily-, naively-written multi-domain version developed by a newbie (you). Good luck interpreting that.

What I gather from this interaction and your experiment is that:

  • It may very well be the case that, for many sorts of embarrassingly-parallel problems, multiprocess concurrency works just as well or better as shared-memory concurrency. Others have said this in the thread before. In fact, the whole OCaml community has been saying this for the last 20 years: before the Multicore OCaml effort, we had clear explanations on the fact that the absence of a shared-memory concurrent runtime was not an issue for good parallel performance for some common categories of programs. These explanations remain valid even now that people have poured the considerable work of moving to a concurrent runtime.
    It is still interesting to get yet another confirmation of this (modulo the doubts about your Domains-using code which may be too naive), and maybe it can help recalibrate people’s expectations about the performance benefits of Multicore.
    This also explains why upstream-OCaml insisted so much on compatibility of sequential performance, imposing many difficult/painful performance constraints onto the Multicore OCaml developers. If multiple sequential processes are faster for your use-case, then just keep using that with OCaml 5.

  • The programming model for Domains is not as easy as it looks from a distance. People like @UnixJunkie with previous experience in parallel programming (much more than I do) seem to get it wrong. There are two things to distinguish here:

    • The current libraries exposed for using Domains in the stdlib and elsewhere are young and barebones, so it is to be expected that they are not so easy to use; for now you need Domains expertise to use them well. We will develop better libraries/abstractions over time, broadening the share of OCaml programmers able to use them easily/confidently. (Parany is a step in the right direction from this perspective, assuming you can grow Multicore expertise or get help from another expert.)
    • The programming model is harder than it looks, due in part to design choices made mid-way through Multicore development for retro-compatibility reasons (which has its own strong benefits). This is something that personally I only realized somewhat late in release process for OCaml 5.0 (I used to assume that “don’t use many domains” meant “100 is wasteful but okay in practice”), and I think that there are aspects of this that we are not yet collectively aware of.
      In particular the performance on contended systems (such as my laptop devoting a lot of compute time to Firefox tabs) is not very well understood, and may disappoint or thwart “simple” approaches for domain management and efficient concurrent code. (I will not be surprised when I meet the first heavily-optimized parallel program full of busy-waiting loops that does splendid on controlled benchmarks and erratically becomes dog slow on my actual machine). We will probably have more slightly-disappointing surprises down the road; let’s manage our expectations accordingly.
5 Likes

Note: we know that multi-processing can be as fast or faster as multi-domain programming, but ideally multi-domain programming should not be too much slower. Provided good benchmarks that show a larger-than-expected performance difference, we could study the multi-domain performance and try to improve the Multicore runtime – surely there are many areas for improvement left.

It is not clear to me that it’s a valuable use of our time to discuss bad benchmarks such as yours. On one hand, maybe we should try to focus first on code that follows the recommended Domain programming advice. On the other hand, people in the wild write wild code, and it’s less hassle for everyone if that also runs fast enough. My personal intuition is that we should still focus on “good” benchmarks first, and figure out library designs that make it easy to write “good” programs.

I haven’t gathered meaningful statistics unfortunately. I remember this was somewhere between 4.10/4.10+domains and 4.12 for each, respectively. So, by that time, multicore was still WIP but it was nearing the end of its development. I’m curious if 4.14’s prefetching would improve this workload, or if the released multicore fares better, or if using e.g. domainslib can be more optimal.

The files are self-contained low-dependency and near-identical, and the measurement code is in the Makefile, so hopefully it’s easier to reproduce and improve for anyone interested.

I wanted to functorize but that seems to have made the compiler lay out the code slightly worse and created a bit of a slowdown.

Just one thing to note here. There’s a fair bit of work to do but we can probably alleviate the worst of the issues around having more domains than cores, it’s certainly not all set in stone.

One option is to have some number of GC domains (which might be dynamic from collection to collection) that’s smaller than the actual number of domains, and each GC domain does the work for one or more running domains.

The one bit I don’t think we can change (without breaking the C API) is we need to wait for all domains to hit a safe point and that means if you’ve got vastly more domains than cores (or on a heavily loaded system) you might be waiting a while at every minor collection.

It would be interesting to establish what is the contribution of each in the drop in performance from more domains than cores.

1 Like

Note: It took my beginner self a long time to understand that the “waiting a while” occurring here is not just “you have to wait for each domain to reach a poll point, so the more domains the longer maximal wait”, but rather “the OS may deschedule one of the domains, and everyone will be blocked for two/three orders of magnitude longer”.

3 Likes

In the early days of virtualization in enterprise TP, people would run Java apps that used Paxos and Lamport-Liskov distributed systems (hence required periodic heartbeat messages for liveness) in VMs that overcommitted memory on their host. So with more VM-declared memory, than actual memory on the host machine. Boy howdy, so much fun when the VM’s pages got paged-out by the host, and since the host didn’t know any better, it could and would page out the pages associated with the threads that were responsible for the heartbeat.

Even with multi-minute heartbeat timeouts, many significant users of these systems saw all sorts of craaazy breakage as VMs paged-out, were declared dead, then came back and responded, were declared alive, etc, etc, etc.

1 Like

Was it considered to empty just the minor heap which is full?

Can this behavior be changed via an environment variable?

For a parallel program, blocking all threads is a very bad thing (it’s called a barrier).
I guess, if only a single heap is emptied, this barrier could be shorter, which might be a good thing for parallel performance.

I don’t think people are asking to have much more domains than cores.
There is a recommend value, fine.

I’m asking this question b/c I don’t know the answer, and maybe it’ll enlighten for other readers. When you say a “safe point”, I’m reminded of the “safe points” in other multi-thread-friendly GCed runtimes, like for some JVMs. Where “safe point” means that a thread has parked all its out-of-representation pointers in known in-representation locations, so that a GC can move around the stuff those pointers reference. And this happens pretty frequently – it isn’t something that you have to wait forever for. Is this the same meaning of “safe point” that you’re using in the multicore runtime? Or is something different meant ?

I would make a distinction between “safe point”, where we know that we can run a GC, and “poll point”, where we actually check if we need to run a GC (or handle other asynchronous events: signals, monitoring callbacks, etc.). Poll points occur on each allocation which indeed occur very often, and some extra poll points are added on loop backedges to (try to) guarantee that we never wait forever.

Yes. This was our earlier design which allows minor heap arenas of each domain to be independently collected. This breaks the C API. The details are in the ICFP 2020 paper: Retrofitting Parallelism onto OCaml (ICFP 2020 - Program) - ICFP 2020.

It should be clarified that the barrier itself is not a problem. The latest research in Java land (LXR: Low-Latency, High-Throughput Garbage Collection (PLDI 2022 - PLDI Research Papers) - PLDI 2022 – distinguished paper award winner), which has had a long history of parallel GCs, shows that brief stop-the-world sections work surprisingly well for complex tasks (such as evacuation) when compared to the state-of-the-art, industrial strength, pauseless (concurrent) compacting collectors such as Shenandoah and ZGC. From the LXR paper

LXR’s design premise is that regular, brief stop-the-world collections will yield sufficient responsiveness and far greater efficiency than concurrent evacuation

The stop-the-world sections in OCaml 5 do a limited amount of work (the size of the minor heap). The stop-the-world minor GC is parallel and copying – only the live objects are touched (unlike a mark-and-sweep collector which needs to touch all objects whether live or dead). And typically the survival rate in the minor collection is less than 10%.

Overall, the design of the GCs is not the problem here. It is the issue of over-commitment. There are solutions in GHC where the number of threads that participate in the GC are set to a different number than the number of HECs (analogous to domains). These would be useful to explore before attempting to switch to concurrent minor collection designs where each domain can independently collect the minor heaps. This GC breaks an awful lot of libraries in the ecosystem, and they are not easy to fix (no sed scripts will work). This is non-trivial amount of work.

I had a look at your article. Just glanced through it to be honest, this is a very specialized article.
I understand that all the constraints you had to satisfy are very hard to meet and that there were some tough choices and compromises.

Perhaps the important bits are these.

Concurrent minor GC, which permits minor heaps of domains to be independently collected, requires a read barrier.

From section 1.2,

While this allows users to write fast code, the API also brakes in the invariants of the GC. For example, reading any OCaml object, be it mutable or not, using the C API does not involve a read barrier, and is compiled as a plain read of memory. A new GC scheme that adds a read barrier only to reads of mutable fields will need to deprecate the old API or suffer the risk of breaking code silently. Either way, the users will have to modify their code to work correctly under the new GC. Given that the compiler does not check incorrect uses of the C API, it is already difficult to write correct and efficient FFI code. We would like to strike a balance between the added complexity of the C API and the performance impact of the missed opportunities.

Essentially, the Field macro needs to include a read barrier and is also, critically, a GC safe point where the GC can move objects. This is not the case in sequential OCaml. As a result, Field(v,i) cannot be an l-value either. The users will not only need to change their code but also require a careful audit of all the C FFI use in order to ensure that the assumptions about when GCs are run aren’t broken. It is possible that we can design a clever static analysis tool on the C source code to analyse them easy safety cases. But this is an open research question.

OTOH, we experimentally validated that the stop-the-world parallel minor GC performs as well as (and in some cases better than) the concurrent minor GC (see results in Fig 11 and Fig 12). So our initial fears that stop-the-world GC would fare quite poorly against the concurrent collector were not well founded. Of course, the caveat is that we are not testing under over-committed environments.

I should say that the doors are not shut for innovation here. There are a number of things that we can try to improve the performance. Many of the criticisms found in the discuss forum are valid. Given that the problems are challenging and the resources are limited, we will need to prioritise and work on these.

What we’ve done in OCaml 5.0 is to ensure that we haven’t made any choices about the design that changes the semantics – no breakages. This allows the community to move to multicore easily, and take advantage of parallelism with well-understood limitations. We can now start looking at more interesting designs that might potentially break APIs if the performance improvements are worth it. If we had broken the APIs and had limitations, I think our uses would have been more cross than they are now :slight_smile:

10 Likes