What's the best way to parallelize this spatial simulation?

I’ve been wanting to port a spatial simulation to from OCaml 4 to parallel OCaml 5. I’m looking for some advice how to do it best.

From a high level perspective, the simulation loop updates the positions of N agents (walkers) in a 2D rectangular space (arena). Position updates are one major computational cost. Each walker may collide with another one, and this collision is handled by the code.

The naive N^2 scaling of collision detection is avoided by using neighbor lists. That is, the domain is divided in K x K subarenas, and each subarena has an associated list of contained walkers. On each step of the loop, a walker checks for collisions only with walkers in it’s own or neighboring subarenas; also each subarena member list needs to be updated by implementing transitions of walkers to a neighboring subarena. These steps scale as N. Still, the other significant part of the cost comes from collision detection.

There is a natural unit of granularity: the subarena. I thought I’d assign one Domain.t to each and let the position updates of walkers within it be computed in this worker domain in parallel. The walker transfers and collision detection will require communication between domains.

However, the number of subarenas is a tuning parameter of the algorithm, and I can’t simply fix it to the number of cores. For a large system, I can easily be fastest to run 100 subarenas on the current sequential version of the code, while my machine has much less cores. So maybe it’s wrong to try to assign each subarena its own Domain.t?

Another question is that of inter-subarena communication. Is it best to keep the member lists of walkers in each subarena in the global heap, or should those be in thread-local storage, and walker transfers are passed as messages through appropriate domainslib channels? The walker positions are the most frequently updated data. Currently each walker is represented by a record with a unique ID and a mutable float subrecord for position, and member lists are hash tables using keyed by ID.

If possible with such a high-level description, I would appreciate good performance / program structuring advice!

Cheers

1 Like

I think that in general it’s wrong to think in terms of domains. As @gasche recently benchmarked it’s not really a good idea to have more domains than Domain.recommended_domain_count.

You should rather think in terms of parallelizable work items and then have an abstraction at which you throw work without having to think about how it’s going to be scheduled (basically a thread domain pool).

I hope the Stdlib eventually provides us with such abstractions for ambient parallelism, otherwise it will not be possible to write libraries that use parallelism underneath in a compositional way.

For example I just wrote an API to perceptually compare the pixels of two images on the cpu. This is eminently parallelizable and I certainly would like to be able to do it unbeknownst to the client. It would be nice if I could just consult a number of processors, divide the work accordingly and submit work to an abstraction without having to take scheduling decision or perform any setup; these decisions should be left to the application.

4 Likes

Ok, so thinking of Domain.t as the work package with associated processing thread is the wrong way, understood.

Part of my confusion comes from the fact that the work items come with mutable data that persists over many iterations. I had in mind some vague idea of ‘data locality’ and the concept of ‘thread-local storage’ as being efficient because global GC cycles could be avoided. (Is that even correct?) This seems to suggest that it would be efficient to keep the data associated with one subarena, such as the particle positions, as local data within some ‘thread per subarena’ that has a lifetime of many time steps. I’m not sure how to combine this structure with the idea of ‘work items plus opaque domain pool’.

IIUC, the idea of dispatching ‘work items’ to a domain pool would mean defining a work item as a one-time update of particle positions in one subarena, and then throwing such work items at a pool of domains. It would seem necessary to keep the particle positions in global memory.

Does that make any sense? Am I overthinking?

I can’t really offer advice but I think this tutorial about Domainslib (well more generally multicore OCaml) may help: GitHub - ocaml-multicore/parallel-programming-in-multicore-ocaml: Tutorial on Multicore OCaml parallel programming with domainslib