OCaml 5 performance

If fork causes such delays even when immediately followed by some form of exec, it’s indeed worrisome. The stdlib doesn’t currently provide good enough abstractions over that, but it would fall on it to provide a workaround I think.

1 Like

Stdlib’s fork doesn’t work when you have multiple domains, so Eio has its own version. There, the forked child process only runs C code, but it uses values from the OCaml heap. Currently, this requires holding the GC lock during the call to fork.

That could be improved by copying to the C heap before forking. As a quick test, I did try removing the lock (and doing a minor GC first instead so things probably wouldn’t move), but it looked like it was still causing pauses in other domains so I suspect Linux does some pausing of its own. I didn’t investigate much though and could be wrong about that.

Another possible solution for the solver would be to have the main program spawn a subprocess at the start and have that spawn git processes on demand. But I think replacing C git with OCaml code was a better fix (it’s only slightly faster, but less total CPU use is good to reduce energy usage, even if we do have spare CPUs for it).

Unix.fork cannot be used in multi-domain programs, but the Unix.exec* or Unix.{create,open}_process_* functions should work fine, and I don’t have reasons to believe that they cause trouble for multi-domain scheduling (but I haven’t tested either).

1 Like

The Unix.exec* functions can allocate memory with malloc and so are not async-signal-safe and so not thread/domain safe. Unix.create_process has been thread safe for quite some time (since ocaml-4.12 I believe), but Unix.create_process_env only more recently ( Unix.create_process_env might not be multi-thread safe · Issue #12395 · ocaml/ocaml · GitHub ). I have not looked into the position with respect to Unix.open_process_*.

Without fork, the exec* functions aren’t very useful. That said, create_process should be enough for someone who wants to implement their own overlay for subprocess handling (I find the open_process API to be somewhat weird, and insufficient overall). So it’s enough if create_process is safe.

What is the machine you are using w/ 160 cores?
I’d be interested in buying one.

Then I’m very right for Parany to have moved away from the multicore run-time and having come back to the fork-based method.

Maybe rewrite your program using parmap or parany and see how the parallel performance fares.

This being said, I am not sure doing asynchronous I/O is a very reasonable design choice.

Simplify your algorithm enough so that it becomes just a giant List.map or List.iter (or Array.{iter|map}), then use one of the two parallelization libraries I mentioned.
I bet it will kick a** in terms of parallelization performance.

I am pretty shocked to hear that minor collection requires universal synchronisation. That any part of multicore has a global lock independent of domain count is bad enough, but that a frequent, per-domain operation has one is just absurd.

I’ll admit to not knowing much about parallel GC, but I know that what Janet does is to have an entirely disjoint heap per thread, and when an object is determined to have been shared with another thread, it becomes reference-counted instead. How feasible would that be in OCaml? That is, would it be possible to determine which specific objects might have been shared across domains, allowing false positives but not false negatives?

If so, I envision the following division, bearing in mind I am no expert:

  • Per-domain, a minor and a major heap, managed by a single-threaded generational collector.
  • Globally, a single heap, managed by an incremental collector. (I checked just now, a parallel GC with incremental marking and collection does indeed exist.)
  • Objects move from the local heaps to the global heap when they might have been shared with another domain.

I imagine this would wreak havoc upon the C API, but I don’t see any major structural issues with the idea. I bet some people here could find some though; any takers?

I think it’s an Ampere Altra Mt. Jade machine (see The Ampere Altra Review: 2x 80 Cores Arm Server Performance Monster).

I’d strongly recommend reading [2004.11663] Retrofitting Parallelism onto OCaml before using the word “absurd”!

5 Likes