Dear Discuss,
At LexiFi we are evaluating whether to put work towards making our codebase usable with multiple threads (ie protecting/reducing global state with locks, making it thread- and domain- local, etc). If undertaken, this will require considerable engineering investment on our part and so we are interested in experience reports from others who have undertaken similar efforts, particularly industrial/large codebases. We are particularly interested in design choices (eg precise combination of domains/threads/effects, etc), quantitative before/after benchmarks, etc.
For information, we currently have a mature multi-process architecture that uses message-passing for communication. While this works well, each process maintains its own separate heap, and the ensuing RAM usage is limiting the number of processes that we can run simultaneously on commodity hardware. For us, a main interest of multi-threading lies on the ability to share heap resources among the different threads, thus reducing RAM usage and allowing a greater number of workers to execute simultaneously in a single machine.
A somewhat similar architecture was recently reported about in Prohibitive amounts of runtime lock waits in multicore analysis with Infer · Issue #14047 · ocaml/ocaml · GitHub, which mentions a sizeable performance penalty when using multiple threads.
A specific point that we wonder about is the performance cliff-edge when the number of domains > N: the official documentation strongly recommends not starting more domains than “the number of cores”. Supposedly this is because minor collections require that all domains enter into a barrier, and if any domain threads are sleeping, this synchronization can be expensive. But what about other processes? Even if there are < N domains, couldn’t one hit the same performance cliff edge depending on how the OS schedules those domain threads? How to decide how many domains to start? This seems like a pretty basic question but I have not been able to find a satisfactory answer.
Alternatively, if using threads for concurrency (not parallelism) within a single domain (as it was possible to do already in OCaml 4), I guess one is not subject to the above performance cliff-edge. This would seem a natural approach if all one wants is to share heap resources between different threads, but of course we would not be able to take advantage of the performance benefits of using domains.
I guess the general question is how should one decide on the exact combination of domains and threads? The fact that there is no TLS data structure (there is a PR upstream Thread-local storage, take 3 by gasche · Pull Request #13355 · ocaml/ocaml · GitHub, but it is somewhat in stasis) which is a basic necessity if using threads in non-trivial ways adds to the confusion, since it implicitly biases the situation towards using domains.
My feeling is that I have heard mostly negative reports about porting efforts to Multicore so far, but maybe that is because positive ones are not reported as much. So if you have had a positive experience around the Multicore runtime, please share it. Thanks!
Cheers,
Nicolas