The peak performance on this computer is reached by ocaml-4.13.1.
The program is parallelized using parany, so in the 4.13.1 case, the parallelism is fork-based
and values are serialized by Marshal.
In 5.0.0, it uses Domains and Domainslib.Chan (Marshal not needed).
I don’t claim this is the best way to write this program in parallel.
I don’t claim this is reflective of other users use-cases.
What the program does, in essence:
read some file
apply an autocorrelation function to each element read from that file
write the resulting vector for each element into an output file
I had to do some work so that the program parallelizes better.
In essence, I make the parallel work function do more things (coarser grain), the input function
should do the least work; same for the output function.
I might test later this week on another computer, with more computing threads.
For my personal records, so that I can rerun this later:
for j in `seq 1 5`; do for i in `(echo 1; seq 2 2 24) | sort -n -r`; do echo "NCORES: "$i; (acp4 -i data/IDH1_1conf.ph4 -o idh1_par.txt -np $i -csize 50 2>&1) | grep Hz; done; done
Is this workload intrinsically partitionable ? If so, then it shouldn’t be surprising that 4.13.1 beats 5.0, since partitioning will always incur less overhead than multithreading for truly partitionable workloads, right?
You cannot use 5 sample points and expect to measure a meaningful standard deviation even if the distribution of your sampled data is exactly a gaussian. More generally, standard deviation is not that meaningful without the hypothesis that the underlying distribution is a gaussian, and in presence of scheduling I expect this hypothesis to be false.
But yes, overall if you don’t need synchronisation between processes, process parallelism should be faster than shared memory parallelism since it means that you don’t have any synchronisation overhead.
Also, when reporting parallel results, it’s worth reporting the CPU type, # cores, hyperthreads-per-core, etc. Basically, the contents of /proc/cpuinfo , or output of lscpu .
Also, to follow along the path @octachron started down, you might want to construct a series of test inputs, each one doubling in size, and then run each scenario a large # of times on the smaller inputs (which will not take very long) and then as the inputs get larger, you will know whether you can reduce the # of runs, by watching the spread (variance? Not a prob/stats guy) in results.
I was not expecting the variance to be so huge and I wanted to minimize the number of experiments, so that the bench runs reasonably fast; hence the “only” 5 experiments per data point.
But, we are not here to discuss the fine points of statistics, what I see is what I call a “performance regression bug”. And, this is not good news.
If other people have parallel programs that they can run end-to-end, and test against ocaml-5 and before, this might be useful data.
You are not reporting a performance regression since you are comparing different parallelization methods? Process parallelism is expected to be faster in absence of communications and synchronization between processes and it is still available in OCaml 5.
TL;DR if your workload is “embarrassingly partitionable”, it is highly unlikely you will be able to do better using multithreading, than you can do using process parallelism. This is known, and is the case for basically every language runtime in the universe: OCaml isn’t special here.
Can you describe the pattern of communication ? From your description, it seems like it might be something like “scatter-gather” – that is, a main thread scatters work units to workers, who process and return the results to the main thread, who assembles them? If
the work required to scatter, and then to reassemble, is small compared to the work to process each work-unit
if there is no other mutable shared state
there isn’t some massive data-set that needs to be shared among all workers (even this can be accommodated, if that data-set can be preloaded before workers are forked)
then it can often be more-efficient (and more-performant) to use process-parallelism.
These sorts of workloads have a name: “embarrassingly partitionable”, and it is because it doesn’t take much to convert them to run well using multiple processes. And if you can do that, it will always, always be faster than using multithreading, and for many, many reasons.
Commercial transaction-processing software vendors have explored many of these issues, as have HPC code implementors. Multithreading isn’t actually the best way to get peak performance: it’s just often a good way to get decent performance out of an SMP without working insanely hard.
Even here, I would note that there’s a difference between SMP and “multicore”. The latter term (in the commercial world) is synonymous with “NUMA” (non-uniform memory access) and naive multithreading (any form of multithreading that doesn’t take into account which memory banks are used to store various data, and their relation to the processor sockets where threads run to access that data) is known to perform suboptimally on NUMA machines. And these days, there are a whole lot more NUMA machines that you might think, b/c Moore’s Law for frequency scaling has stopped.
BTW I just remembered something I wrote two years ago: it’s for the “programming language drag race” and I wrote it because OCaml’s score felt way too low. Never got to actually submitting it though. It further confirms this observation (but not visually):
I do not read too much into the chart. There could be improvements to make to the program, or to the multicore runtime implementation, or both.
However, I see proposed around a lot the explanation of inherent synchronization costs of SMP runtimes. This should not be used to discard legitimate questions and explorations about consequences of the design choices of the multicore runtime (for instance, a stop-the-world collector that empties all minor heaps whenever one minor heap is full).
I think it is a valid point. I very much do not embrace the argument saying « choose the right architecture for the job » here. If one has to resort to forks for a simple performant map/reduce it means Ocaml multicore abstraction leaks (in time).
It would be interesting to understand the Intrinsic Parallel Efficiency of an Ocaml program with respect to at least the GC (minor and major) and the major allocator.
As far as I understand the minor GCs work in 1) lock-step and are as a consequence is 2) tail sensitive (to the most allocating among N). Also the major allocator is a global parallel allocator which has to cope with generational copy.
Intrinsic Parallel Efficiency: Maybe one way to characterize would be:
an embarrassing parallel program (map only, N partitions) , no sync, with an as much as possible identical sequential version (easy for map)
some functions with computation designed to generate, per run, an increasingly number of minor GC/sec (no major)
a variant with spill out to major, increasingly
a variant with unbalanced allocation (1 tail among N)
using single threaded (4.x) vs multi runtime (5.x)
PE (Parallel Efficiency): sequential-time-workload-sizeN / (parallel-time-workload-size1 * N)
SR (Sequential Regression): sequential-time-workload-5.x / sequential-time-workload-4.x (assuming no other unrelated optimization) IPE of the Ocaml Multicore runtime: PE * SR
Also mentioned in another discussion about domains: the tail latency (wrt lock steps) induced by thread scheduling, which may suggest a sensitivity to noisy neighbors / too many domains. What is not part of IPE is synchronisation nor cache costs nor NUMA, nor // algorithm efficiency, as they are concerns regardless of the language.
And to end on a definitely positive note: effects and local DRF will allow for a great parallel programing model (green threads, sync primitives, performant and bug-free lock-free algorithms, actor/CSP model?…)
It is still early days, Ocaml team has done an incredible job with multicore, we will all benefit from gaining an understanding in the characteristics and inevitable trade-offs made during this journey.
I need to share my test data file, that’s the only obvious thing missing.
This is a regular opam package, so installing it should be a matter of opam pin
to the git repository.
I am not particularly interested in people trying to optimize the program.
Rather, I would be more interested by stories of other parallel programs which have an ocaml<5 and an ocaml-5 version.
I am not particularly interested in people trying to optimize the program.
Hot take: the problem with your benchmark is that it is not clear that you used Domains correctly, or whether you made programming mistakes that limit the scalability of the Domains version. For example, you mention that the benchmark displays on the x axis the number of compute domains, but that there are also 2 mostly-idle domains used for control. This is a performance antipattern for Domains programming, and it is hard to know whether this invalidates the performance results or not. I have already made this point last week: No Domain.maximum_domain_count() in the stdlib - #28 by gasche.
This benchmark is a comparison of a finely-tuned multi-process version developed over time by a domain expert (you) and a hastily-, naively-written multi-domain version developed by a newbie (you). Good luck interpreting that.
What I gather from this interaction and your experiment is that:
It may very well be the case that, for many sorts of embarrassingly-parallel problems, multiprocess concurrency works just as well or better as shared-memory concurrency. Others have said this in the thread before. In fact, the whole OCaml community has been saying this for the last 20 years: before the Multicore OCaml effort, we had clear explanations on the fact that the absence of a shared-memory concurrent runtime was not an issue for good parallel performance for some common categories of programs. These explanations remain valid even now that people have poured the considerable work of moving to a concurrent runtime.
It is still interesting to get yet another confirmation of this (modulo the doubts about your Domains-using code which may be too naive), and maybe it can help recalibrate people’s expectations about the performance benefits of Multicore.
This also explains why upstream-OCaml insisted so much on compatibility of sequential performance, imposing many difficult/painful performance constraints onto the Multicore OCaml developers. If multiple sequential processes are faster for your use-case, then just keep using that with OCaml 5.
The programming model for Domains is not as easy as it looks from a distance. People like @UnixJunkie with previous experience in parallel programming (much more than I do) seem to get it wrong. There are two things to distinguish here:
The current libraries exposed for using Domains in the stdlib and elsewhere are young and barebones, so it is to be expected that they are not so easy to use; for now you need Domains expertise to use them well. We will develop better libraries/abstractions over time, broadening the share of OCaml programmers able to use them easily/confidently. (Parany is a step in the right direction from this perspective, assuming you can grow Multicore expertise or get help from another expert.)
The programming model is harder than it looks, due in part to design choices made mid-way through Multicore development for retro-compatibility reasons (which has its own strong benefits). This is something that personally I only realized somewhat late in release process for OCaml 5.0 (I used to assume that “don’t use many domains” meant “100 is wasteful but okay in practice”), and I think that there are aspects of this that we are not yet collectively aware of.
In particular the performance on contended systems (such as my laptop devoting a lot of compute time to Firefox tabs) is not very well understood, and may disappoint or thwart “simple” approaches for domain management and efficient concurrent code. (I will not be surprised when I meet the first heavily-optimized parallel program full of busy-waiting loops that does splendid on controlled benchmarks and erratically becomes dog slow on my actual machine). We will probably have more slightly-disappointing surprises down the road; let’s manage our expectations accordingly.
Note: we know that multi-processing can be as fast or faster as multi-domain programming, but ideally multi-domain programming should not be too much slower. Provided good benchmarks that show a larger-than-expected performance difference, we could study the multi-domain performance and try to improve the Multicore runtime – surely there are many areas for improvement left.
It is not clear to me that it’s a valuable use of our time to discuss bad benchmarks such as yours. On one hand, maybe we should try to focus first on code that follows the recommended Domain programming advice. On the other hand, people in the wild write wild code, and it’s less hassle for everyone if that also runs fast enough. My personal intuition is that we should still focus on “good” benchmarks first, and figure out library designs that make it easy to write “good” programs.
I haven’t gathered meaningful statistics unfortunately. I remember this was somewhere between 4.10/4.10+domains and 4.12 for each, respectively. So, by that time, multicore was still WIP but it was nearing the end of its development. I’m curious if 4.14’s prefetching would improve this workload, or if the released multicore fares better, or if using e.g. domainslib can be more optimal.
The files are self-contained low-dependency and near-identical, and the measurement code is in the Makefile, so hopefully it’s easier to reproduce and improve for anyone interested.
I wanted to functorize but that seems to have made the compiler lay out the code slightly worse and created a bit of a slowdown.