[ANN] Moonpool 0.3

:wave: deer OCaml aficionados,

Moonpool 0.3 was just released on opam. Moonpool is a new concurrency library for OCaml >= 4.08, with support for OCaml 5 from the get-go. It started out with a thread pool (possibly distributed on multiple domains to be able to use multiple cores) along with a future/promise module.

This release comes with a set of new features on top of pool+futures:

  • a small 'a Lock.t abstraction to protect a resource with a lock in RAII-style
  • a type of unbounded channels (which are fairly naive in implementation)
  • improvements to Pool such as Pool.run_wait_block: (unit -> 'a) -> 'a that runs a whole computation on the pool, and waits for its result (or re-raises)
  • add Fut.await (only on OCaml 5)
  • add support for domain-local-await if installed
  • a Fork_join module for, well, fork-join parallelism, including parallel for and parallel List.map/Array.map. These computations can be nested and “feel” like writing code in a direct style. This relies on effects and is only available on OCaml 5.

Examples for fork-join

The (too) classic parallel fibonacci function:

open Moonpool
let (let@) = (@@)

let rec fib_direct x =
  if x <= 1 then
    1
  else
    fib_direct (x - 1) + fib_direct (x - 2)

let rec fib x : int =
  (* some cutoff for sequential computation *)
  if x <= 18 then
    fib_direct x
  else (
    let n1, n2 =
      Fork_join.both
        (fun () -> fib (x - 1))
        (fun () -> fib (x - 2))
    in
    n1 + n2
  )

let fib_40 : int =
  let@ pool = Pool.with_ ~min:8 () in
  Pool.run_wait_block pool (fun () -> fib 40)

A parallel sum, from a test case:

let () =
  let total_sum = Atomic.make 0 in
  Pool.run_wait_block pool (fun () ->
      Fork_join.for_ ~chunk_size:5 100 (fun low high ->
          (* iterate on the range sequentially. The range should have 5 items or less. *)
          let local_sum = ref 0 in
          for i = low to high do
            local_sum := !local_sum + i
          done;
          ignore (Atomic.fetch_and_add total_sum !local_sum : int)));
  assert (Atomic.get total_sum = 4950)

Note that Fork_join.for_ gives its functional argument a range to process, the size of which is controllable with the optional chunk_size. This allows for large values to be passed to for_ without starting as many tasks, as demonstrated below:

Computing digits of π:

let my_pi : float =
  let@ pool = with_pool () in

  let num_steps = 100_000_000 in
  let num_tasks = Pool.size pool in

  let step = 1. /. float num_steps in
  let global_sum = Lock.create 0. in

  Pool.run_wait_block pool (fun () ->
      Fork_join.for_
        ~chunk_size:(3 + (num_steps / num_tasks))
        num_steps
        (fun low high ->
          let sum = ref 0. in
          for i = low to high do
            let x = (float i +. 0.5) *. step in
            sum := !sum +. (4. /. (1. +. (x *. x)))
          done;
          let sum = !sum in
          Lock.update global_sum (fun n -> n +. sum)));

  let pi = step *. Lock.get global_sum in
  pi

Here the Lock is not a performance issue because there are only num_tasks (ie roughly your CPU’s number of cores) chunks processed in the for_, so there’s only like 8 updates at the end, not 100_000_000 updates which would create a lot of contention.

9 Likes

Random thoughts from reading the release note:

  • It’s a bit weird that you designed earlier versions of the library to work for both OCaml 4 and OCaml 5, but now you are adding 5-only features. Why can’t you provide an implementation of await : 'a t -> 'a on OCaml 4? Naively, I would guess that you could use a condition variable to signal that a future is resolved.

  • I think that if you start relying more and more on 5-only primitives, you should probably drop the idea that the library is compatible with OCaml 4. Either you really want to play with algebraic effects in a way that cannot be emulated with green threads or you don’t. Your current approach places the burden for this choice onto your users, which get none of the fun and all the decision fatigue.

  • It looks like you are reimplementing parts of the domainslib API. Do you think that your library should subsume Domainslib? Why would one rather use Domainslib? Relatedly:

    • Domainslib has find version of for_, namely parallel_find : ... -> (int -> 'a option) -> 'a option, see support enough primitives to implement exists and for_all in parallel · Issue #89 · ocaml-multicore/domainslib · GitHub . Have you considered adding something like this? More generally, maybe you should try cover the domainslib API?
    • Now that you also support fork-join parallelism, it is easy to write benchmark to compare the overhead of your library and Domainslib. Have you done this, can you share the results? (If would be curious to see scaling plots for increasing number of domains, for each library, using different sequential-cutoff values.)
1 Like
  • It’s a bit weird that you designed earlier versions of the library to work for both OCaml 4 and OCaml 5, but now you are adding 5-only features.
    Why can’t you provide an implementation of await : 'a t -> 'a on OCaml 4?
    Naively, I would guess that you could use a condition variable to signal that a future is resolved.

There’s Fut.wait_block since 0.1, with the big caveat that it blocks
the caller thread. It’s not going away, and you can use pools and
futures on OCaml 4.xx without issues (you just won’t get parallelism).

New things like Fut.await and Fork_join provide more convenient
primitives (or in the case of fork join, a brand new API) but

  • I think that if you start relying more and more on 5-only primitives, you should probably drop the idea that the library is compatible with OCaml 4.
    Either you really want to play with algebraic effects in a way that cannot be emulated with green threads or you don’t. Your current approach places the burden for this choice onto your users, which get none of the fun and all the decision fatigue.

Why? I can see two different use cases (for myself, even — I try to
design libraries that I can use). One is to provide a decent thread pool
that works on OCaml 4.xx and 5.xx, thus providing a migration path in
the future; the other is to provide composable parallelism constructs
for OCaml 5.xx users (a bit like openMP for C/C++/fortran), say.

The decision fatigue is only: if you use OCaml 4.xx, you have to ignore
parts of the API that are clearly marked with “this is OCaml 5 only”. I
feel that it’s similar in spirit to things like “let operators are
available on OCaml >= 4.08 only” in, for example, containers.

If you use OCaml 5, you can use fork-join for direct style parallelism,
or futures for starting computations in the background and getting
results later. That’s it, I don’t see where the decision fatigue would
be.

  • It looks like you are reimplementing parts of the domainslib API. Do you think that your library should subsume Domainslib?
    Why would one rather use Domainslib? Relatedly:

Subsuming would be very ambitious here :-). Domainslib is written by
domain experts (pun intended) and is also evolving. It might be better
for very small-grained parallelism because the scheduler uses more
advanced data structures for work-stealing.

I have not but it’s a good idea, I think — I suppose find is a bit special
because of the short-circuiting property.

Alternatively, a library of such constructs could be written in a way
that doesn’t depend on the actual scheduler; I think DLA + a “run_async”
function would be enough.

  • Now that you also support fork-join parallelism, it is easy to write benchmark to compare the overhead of your library and Domainslib.
    Have you done this, can you share the results? (If would be curious to see scaling plots for increasing number of domains,
    for each library, using different sequential-cutoff values.)

I have no benchmarked against other libraries yet. My primary goal is to
make something useful for me, and flexible enough to cover a set of use
cases (wider, I think, than domainslib since it’s really not designed
for IO thread pools or multiple pools).

Last week I tried to port a benchmark @dinosaure was working on, in
which a tool would hash all files inside a directory (transitively). It
didn’t take long and yielded fairly nice speedups (the switch is 1.8GiB):

$ hyperfine --warmup=1 './hash.sh ~/.opam/4.14.0/ -seq' './hash.sh ~/.opam/4.14.0 -j 4' './hash.sh ~/.opam/4.14.0 -j 8' './hash.sh ~/.opam/4.14.0 -j 12'
Benchmark 1: ./hash.sh ~/.opam/4.14.0/ -seq
  Time (mean ± σ):      2.176 s ±  0.010 s    [User: 1.915 s, System: 0.255 s]
  Range (min … max):    2.157 s …  2.193 s    10 runs
 
Benchmark 2: ./hash.sh ~/.opam/4.14.0 -j 4
  Time (mean ± σ):     646.8 ms ±   9.3 ms    [User: 2271.8 ms, System: 408.1 ms]
  Range (min … max):   636.9 ms … 664.3 ms    10 runs
 
Benchmark 3: ./hash.sh ~/.opam/4.14.0 -j 8
  Time (mean ± σ):     400.1 ms ±   9.9 ms    [User: 2379.6 ms, System: 536.5 ms]
  Range (min … max):   385.0 ms … 418.0 ms    10 runs
 
Benchmark 4: ./hash.sh ~/.opam/4.14.0 -j 12
  Time (mean ± σ):     363.0 ms ±  14.9 ms    [User: 3061.4 ms, System: 630.4 ms]
  Range (min … max):   343.2 ms … 391.0 ms    10 runs
 
Summary
  ./hash.sh ~/.opam/4.14.0 -j 12 ran
    1.10 ± 0.05 times faster than ./hash.sh ~/.opam/4.14.0 -j 8
    1.78 ± 0.08 times faster than ./hash.sh ~/.opam/4.14.0 -j 4
    6.00 ± 0.25 times faster than ./hash.sh ~/.opam/4.14.0/ -seq

(It’s not purely CPU since there’s also IOs, so I’d have trouble
commenting on the impact of fielsystem access vs time spent hashing,
etc. but it’s good results for what amounts to creating a pool and
turning a List.map to Fork_join.list_map)

2 Likes

Actually, I think there should be a separate repo/project to benchmark various parallelism libraries. There’s already plenty of contenders: domainslib, parany, miou, moonpool, maybe Eio’s worker pool, etc.