[ANN] moonpool 0.1

Most sincere salutations,

I’m happy to announce the release of moonpool 0.1. Moonpool, so far, is mostly an experiment, but is in a usable state already.

So what is moonpool? It’s my go at starting to leverage OCaml 5 for multicore computations. Unlike other approaches, it relies heavily on classic Thread.t, because unlike domains it’s ok to create many of them and have some block on IO or long running C calls. A pool provides a run: (unit -> unit) -> unit function that runs the task (its argument) onto one of the pool’s workers at a later date.

Moonpool works by allocating, at startup, a fixed pool of domains, of the recommended size.[1] From there, the user can create a number of regular thread pools, each of which will be distributed over the pool of domains so that the threads can run in parallel. It’s perfectly possible to have, on a 16 core machine, a pool of 50 IO threads for some server, along with a pool of 16 compute threads.

Moonpool also provides a Future abstraction. These futures are thread safe; the combinators such as map, bind, etc. can themselves run on pools. Futures are quite lightweight and use an Atomic.t variable for storage, no lock needed.[2]

Lastly: moonpool also works on OCaml >= 4.08, by simply reducing to regular thread pools running on a single domain. This should allow users to use moonpool on 4.xx, before migrating to 5.xx on their own time.

Documentation is here. It’s released under the MIT license.

Contributions or discussions are very welcome. This is very early days for this project, and I have lots to learn. The task scheduler is quite simple and will probably not compete with domainslib on super-fine grained tasks; but for use cases where tasks are not that tiny I think it works perfectly fine already.


  1. basically Domain.recommended_domain_count()-1 on OCaml 5. ↩︎

  2. except for Fut.wait_block, which is like an entrypoint and should be called “from the outside”. More in the documentation. ↩︎

28 Likes

Discussing with some people, it looks like I failed to describe the mental model for moonpool :slight_smile:

The model is: you can create one or more pools of workers. These workers are preemptive, not cooperative, based on Thread.t[1]. The pool, unlike in domainslib, can contain hundreds of workers even on a 8-core machine; you can also have multiple pools without having to coordinate so their sizes add up to 8.

Once you have a pool, you can use Pool.run pool (fun () -> do_some_work); … to run a task on the pool. You can also use Fut.spawn ~on:pool (fun () -> compute_something to get a 'a Fut.t that will resolve later into the result of compute_something. There’s a buch of future combinators, but overall, that’s it.

Pools are regular resources and can be shutdown, which will prevent any further task from being scheduled on the pool, and wait until all pending tasks are done.

My personal intuition is that moonpool has an advantage over, say, domainslib, when there’s a mix of blocking IOs and computations in a program. If you’re writing a brand new program with Eio, you might not need it (or possibly just for the computation part); if you’re trying to parallelize some existing code that touches Unix or the standard channels, or some C bindings that might block (e.g. sha’s file hashing) or might release the local runtime lock, then moonpool is simple enough to use.


  1. Threads are mostly preemptive: they can be preempted in most loops, allocation points, and function calls. I might be wrong on the details but I know that in practice they’ll be preempted if need be. ↩︎

4 Likes

Interesting approach! I’m trying to understand what happens to a Thread.t when it gets blocked waiting for some I/O. Does it get pre-empted and let some other Thread.t run in the domain? Or does a blocked thread block the domain?

The thread is pre-empted normally, just as on OCaml 4.xx. This means you
can have a (moon)pool of hundreds of threads performing blocking IO, or
just regular compute threads with the occasional blocking IO (i.e a lot
of regular, direct style OCaml code).

One of my goals was to be able to have a pool for compute (n threads for
n cores, spread across the fixed pool of n domains) along with some IO
pools (httpd, lsp, etc.) that have their own threads on these same
domains.

3 Likes

It’s really great (and refreshing) to see an API for parallelizing things that you can wrap your head around in just a few minutes! Thanks for this beautiful, simple library :slightly_smiling_face:

3 Likes

My personal intuition is that moonpool has an advantage over, say, domainslib, when there’s a mix of blocking IOs and computations in a program.

This is quite reasonable. Domainslib is really for nested parallel computations that are not expected to perform serious blocking IO all the time. Glad to see libraries that address the different execution models.

Have you considered implementing DLA support in moonpool? GitHub - ocaml-multicore/domain-local-await: A scheduler independent blocking mechanism. This would allow moonpool to be used alongside other libraries that implement DLA such as domainslib and Eio. CC @polytypic.

2 Likes

Yes, it seems like an interesting idea. Having await on futures would also be quite convenient. I think the per thread API seems to be a good fit for moonpool.

To do this while preserving compatibility with 4.xx I need to use some sort of preprocessor, but that’s doable.

Indeed, yes. I forgot that moonpool also works on the 4.xx compiler.

The interface of Domain_local_await does not explicitly expose a dependency on effect handlers or anything of the sort, so I would expect that it is possible to implement a 4.x version of it.

(I found that looking at the domainslib PR to implement DLA support made it easier to get an idea of how to implement such support in a scheduler.)


@c-cube: from your description of the library, it is not immediately obvious to me why there are separate pools. If the idea is to fix the number of worker domains once and for all, why use several pools? My understanding so far is that the reason to create several pools is to be able to easily “wait on a pool” – wait for all the work items of the pool to be completed. You mention having separate pools for IO threads and computation threads, but what are the benefits of this if they run on the same underlying domains anyway?

If the main feature of pools is indeed to wait/join on all workers at once, I would consider calling them “worksets” rather than “pools”. In any case, documenting the reason for using several pools would be helpful.

1 Like

If the idea is to fix the number of worker domains once and for all, why use several pools?

The number of domains is fixed, but pools are pools of threads (Thread.t), not domains. In fact, given the current Domain API, all these domains do is wait for instructions to start new worker threads :sweat_smile: .

The reason you may want to have multiple pools is because they might block/wait on different things. Some pools might block on IO (which takes a thread out until it gets a result); some might blockingly wait for a future to resolve; some might do some specific kind of C call that releases the(ir) runtime lock (calling a solver, hashing a file, doing some expensive cryptography, etc.).

Having multiple threads on a single domain is, I think, actually a potential benefit because it means there’s more chances that the domain won’t be stuck in a hard blocking syscall (unlucky Mutex.lock, IO read on the network, etc.) and thus unable to participate in minor GC synchronizations.

In particular in the case of blocking until a future is resolved, there’s a warning in the docs about that, because waiting in a blocking way for a future to resolve is dangerous if you do it on the same pool this future runs on. Blocking this way might create a deadlock (all workers are waiting for futures that can only be resolved by these very workers). The easiest solution is to never block in the same pool, and use the monadic future combinators instead (or await when it’s conditionally added[1]).

Something I’ve done in the past with thread pools is to make sure that if A blockingly waits on the future B, then A must run on a pool that is “downstream” from the pool where B runs. This means pools (or threads) have a partial order, or form a DAG, where the arrows mean “can block on”.

Thank you for the link on DLA in domainslib, I’ll look into it.


  1. indeed, the DLA support on 4.xx might still be tricky for this reason. You can’t just park the “await”-er thread, I think, it needs to do its share of work. ↩︎

Hello Simon,

Interesting library!

It might be nice to have some compatibility layer/module:

  • e.g. a function a la Parany.run
  • functions a la Parmap.parmap; Parmap.array_parmap, etc.

So that it is easy to port come code to use your library and be
able to do some performance comparisons with other such libraries.

Regards,
F.

I’m not too familiar with parany. What are the type signatures of these
functions? Do they block until the results are available?

I don’t understand. An individual thread in a pool may block on things, but you suggest that entire pools get blocked on IO or a blocking syscall or a synchornization operation? How does that happen?

My naive understanding is that all your pools share the same “global, fixed set of domains”, and any of those pools may get blocked at any point, leaving a domain available to work on some other thread of some pool (not necessarily the same pool). In this setting, I don’t see a clear difference between having two pools {T1, T2} and {T3, T4}, for example, instead of a single pool {T1, T2, T3, T4}: the utilization of compute resources should be the same in both cases. (Or am I missing something?) The only difference that I see is that having two pools gives a convenient way for a user to say “I want to wait/block until both T1 and T2 are done, but I don’t care about T3 and T4”.

I don’t understand. An individual thread in a pool may block on things, but you suggest that entire pools
get blocked on IO or a blocking syscall or a synchornization operation?
How does that happen?

I made a shortcut that assimilated whole pools to their workers. If
enough tasks are blocking on IO, all workers block on IO. For example if
the pool contains 20 threads, and you schedule 5,000 tasks “open a http
connection, write this text, close the connection”.

My naive understanding is that all your pools share the same “global, fixed set of domains”,
and any of those pools may get blocked at any point, leaving a domain available to
work on some other thread of some pool (not necessarily the same pool).
In this setting, I don’t see a clear difference between having two pools {T1, T2} and {T3, T4},
for example, instead of a single pool {T1, T2, T3, T4}: the utilization of compute resources
should be the same in both cases. (Or am I missing something?)
The only difference that I see is that having two pools gives a convenient
way for a user to say “I want to wait/block until both T1 and T2 are done,
but I don’t care about T3 and T4”.

Pools share the same underlying set of domains, which to me are mentally
similar to CPU cores, indeed.

However, each pool has not only its threads, but also its own job queue.
To expand on the example above, if you schedule 5,000 http requests, and
then a little compute task like “hash this string” will only start after
the 5,000 requests have been processed.

More realistically, hashing stuff could be something you start
doing after the first queries come back. Say you want to verify
checksums, you can have a IO pool that takes {url_file: string; url_checksum: string}, returns both the content of the file and its
checksum; and a separate pool that takes (checksum * file_content) Fut.t and re-computes a checksum from the content.[^1]

Another case is to have a “pool” with only one worker. I call this
“background thread” in some places. It’s typically a useful way to serialize
operations on a hidden piece of data (in my case, a sqlite handle). You
can submit operations db -> 'a and get back a 'a Fut.t; here having
a separate queue and a single worker is important.

[^1] this can be done via future combinators, because they take an
optional pool to specify where the function for map or bind runs.
So you could write something like download_file_and_checksum pair |> Fut.map ~on:compute_pool verify_checksum.

2 Likes

You are assuming that, if pool P1 has 5000 tasks, and pool P2 has 10 other tasks, then these 10 tasks will get to run faster than if we just added them at the end of pool P1. This sounds like a “fairness” assumption: separate pools will get comparable shares of domain compute ressources, or at least no pool will be delayed too much from running their first tasks.

This “fairness” assumption is not mentioned in your documentation. If it is important to understand how to use your pools, it would be worth clarifying.

After looking at the implementation, here is what I understand:

  • each pool uses a fixed number of threads, all running simultaneously; if there are more tasks sent to the pool, they are delayed and will only get one of the pool threads when previous tasks have finished
  • separate pools run their separate threads simultaneously, so they compete for compute resources on their domain using OCaml’s systhreads scheduler – which does provide fairness in practice
  • as a result, running in a new pool enables quicker completion than adding to an existing pool (as we will be scheduled right away instead of waiting for previous tasks in our pool to free some threads)
  • the ratio of compute resources that each pool gets should be roughly proportional to its number of worker threads
3 Likes

That’s fair. It seemed obvious to me since different pools use different threads, but it might not be to someone who hasn’t looked at the implementation.

Your assessment of the code is correct, I think. Pools have fixed size (for now anyway), so worker threads compete for resources with threads from the same pool+domain, or from other pools. Bigger pools that are fed enough tasks (ie are under sustained load) will get more CPU time overall; the OS scheduler is responsible for ensuring (some level of) fairness; OCaml also does influence fairness because it’s responsible for the preemption points.[1]

This is also what happens on OCaml 4.xx, except it’s all on a single “domain”. Multiple pools can still exist, and will generally not block one another[2] because the threads will be preempted.

I wouldn’t necessarily suggest to create a new temporary pool to do things faster (although why not, if it’s released properly), but at least having different pools for different purposes is easy.


  1. not sure of the name, but the points where code will check if it’s time to release the lock and yield to another thread. ↩︎

  2. unless a thread calls some C stub that blocks for a long time without releasing the runtime lock, that is. ↩︎

2 Likes

The whole semantic is explained in the interface file:

Oh, that seems reasonable easy then:

open Moonpool

let parmap j f (l: _ list) : _ list =
  let pool = Pool.create ~min:j () in
  let l2 = List.map (fun x -> Fut.spawn ~on:pool (fun () -> f x)) l in
  let res = Fut.join_list l2 |> Fut.wait_block_exn in
  Pool.shutdown pool;
  res

let pariter j f (l: _ list) : unit =
  let pool = Pool.create ~min:j () in
  Fut.for_list ~on:pool l f |> Fut.wait_block_exn;
  Pool.shutdown pool

and similar for arrays with Fut.join_array and Fut.for_array. That’s
all on the main branch.

1 Like

Thanks, I will play with it.