Multicore OCaml: March 2020 update

This suggests that the Gram matrix benchmark should be redone with optimized chunk sizes, yes?

No, I think they optimized it for their machine.
They have a different machine, with many cores.
It is probably very different from the machines I have access to (which are rather standard workstations).

I did do a fair bit of chunk size tuning for the machine that I have. The optimal chunk sizes for parmap and parany vary widely between different machines.

Does this mean that programs that currently use ocaml threads package will work seamlessly with multi-threads once multi-thread ocaml lands in ocaml mainline? i.e. programs using ocaml threads package will utilise multi-threads without (any?) changes?

1 Like

As far as I understand no. Thread will still be restricted to the only one active process semantics in order to remain compatible.

1 Like

I had some fun the other day and ported a parallel raytracer benchmark from F# to multicore OCaml: https://github.com/athas/raytracers

It doesn’t perform well, but the worst cases are on rendering, where I just naively spawned a Domain per pixel (: Chunking per row of pixels worsened performance even more though.

If you know how to show better performance with multicore, then make a PR - he will gladly accept

5 Likes

Each domain will spawn a pthread + allocate new heap structures etc. So spawning a domain per pixel or a row is not a good idea. If you have N cores, you would spawn N-1 workers at the start of the program and distribute tasks among them until the program runs to completion.

Have a look at the implementation of GramMatrix (https://github.com/ocaml-bench/sandmark/blob/master/benchmarks/multicore-grammatrix/grammatrix_multicore.ml) for how to use channels for distributing work from master to worker. There are more benchmarks here: https://github.com/ocaml-bench/sandmark/tree/master/benchmarks/multicore-numerical. We also have a parallel version of a simple ray tracer: https://github.com/ocaml-bench/sandmark/tree/master/benchmarks/multicore-minilight/parallel

4 Likes

I’ll try to rewrite the benchmark, thanks (: The MPL language is very fast in the benchmark (faster than F# and Haskell) - do you have an idea of what the MPL ‘ForkJoin’ model has as advantages/disadvantages relative to multicore OCaml?

As a confined distraction, I wrote a small parallel benchmark program. I now have two questions:

  • The performance as not as good as a version of the program using Unix.fork(), which I would expect that one can get the same performance. What is the right place to post the code and ask for advice on how to profile and tune the performance?
  • I don’t know where to look for parallel-programming libraries on top of multicore, for example a Multicore implementation of parmap : ('a -> 'b) -> 'a array -> 'b array. If I understand correctly, this does not exist yet? What is the right place to contribute a version of it? Is Domainslib the place?

I don’t think there are such libraries for the moment.
Porting Parmap to multicore-OCaml would be useful, in the long run.
Same for Parany.
The performance is supposed to become better, because there
would be less communication overhead (marshalling and unmarshaling of things
between processes).

This assumes the parallel GC overhead, and the overhead of cache synchronization, is lower than the marshalling/unmarshalling overhead. While hopefully true, it’s not necessarily true in all cases.

2 Likes

domainslib would be the right place to add a multicore implementation of parmap. It would be interesting to see the small parallel benchmark; it would help with tuning the parallel GC. If you can post the code to some public location (even a gist perhaps), we can take a look. If it proves to be useful, we could later turn it into a micro benchmark for multicore and add it to Sandmark.

1 Like

I don’t see any disadvantages. You can build fork-join model as a library on top of Multicore OCaml.

One thing I would note is that Maple assumes disentangledness i.e, that the parallel tasks are oblivious to each others memory effects. This is not checked statically. The type safety of Maple is dependent on the programmer correctly writing disentangled programs. Disentangledness makes GC design much simpler; the GC can assume that there will be no cross task pointers and only pointers from ancestor to descendent. The details are a bit more complicated, but it is up to the programmer to write “correct” programs. The compiler doesn’t help you (yet?).

Multicore OCaml cannot make such assumptions. Multicore OCaml programs will not segfault even if you arbitrarily share memory between domains. This dictates that the Multicore OCaml GC handle the possibility of arbitrary pointers between domains / parallel tasks. Much of the complexity in the GC design is ensuring that we can ensure GC safety (reachable objects aren’t collected) and liveness (unreachable objects are eventually collected).

4 Likes

This is correct. Multicore OCaml will aim to preserve single threaded semantics as much as possible. Not everyone may want to use multicore features. It is better not to break their code unless there is a convincing reason to do so.

I was also asking about what is the right place to do this (I’m not convinced that an issue against Sandmark is appropriate, given that I’m just playing and trying to tune performances, not propose a new for Sandmark; I also don’t think that an issue in the ocaml-multicore repository is appropriate, as it should be reserved for runtime development). Unless there is a better suggestion I will just open a Discuss thread.

Rather than a discuss thread, please can we have it under ocaml-multicore, although it might not be a perfect fit. Would be nice to have the discussion about implementation along with the code, and easily findable in the future when we look for it.

1 Like

Had a go at making Multicore both fast at the rendering (where one knows beforehand what work needs to be done) using the sandmark benchmark you referenced as basis. And to have a bit more fun I implemented the ForkJoin library you suggested too (: … it only supports the Par effect, but that is enough to dynamically spawn work for a set of workers inside the BVH recursive function.

Performance wise, now the Multicore code is as fast + faster than both Haskell and F# https://github.com/athas/raytracers

12 Likes

That’s great to see. Thanks for your effort in building useful things on highly experimental software!

We’ve just added support for tasks in domainslib. This should hopefully make it easier to build the raytracer in a more idiomatic fashion. There are several examples that show how to use the task library. I can cut a new release of the library if you are interested in using it in your raytracer.

1 Like

I’ve released domainslib.0.2.0 which supports async/await and parallel for loops.

7 Likes

@rand I’ve made a PR to the raytracer, which uses both kinds parallelism primitives added by Domainslib.Task. It was a nice coincidence that the new API fits so nicely. Removes all of the custom parallelisation code \o/. The performance should be around the same ballpark.

6 Likes