Strangely high memory consumption with ocaml multicore

For testing OCaml multicore, I installed opam version 4.12.0+domains+effects and implemented quicksort using tasks.

To my surprise, even the purely sequential version has very high memory consumption if run from within a thread pool (Domainlibs) context: I see ocaml using 800 megabytes of RAM to sort an integer array of size 1 million if run within the following context:

let pool = Tsk.setup_pool ~num_additional_domains:3 () in
let run f = Tsk.run pool f in

The problem does not occur without this pool context.

Another surprising aspect: 256 gigabytes of virtual memory allocated for running anything inside ocaml.

If I run the computation over 4 threads, it dies before the end (but somehow the process returns 0).

So I wonder what is the current status of ocaml multicore, because what I’ve seen does not seem quite functional. Have I installed an incorrect version? What should I pick? (I’m using opam)

What about trying on trunk i.e. what will be OCaml 5.0 in a few months time. 4.12+domains+effects is supposed to be quite stable but is probably not getting many fixes and updates now.

Are you using int array or BigArray ?

I wouldn’t worry about this aspect too much. Many runtimes allocate huge amounts of virtual memory for various reasons. My guess is that each thread works in some pre-defined virtual memory space to avoid stomping on another thread when usage of the same memory space is not necessary. There could be other reasons.

As a point of comparison, for instance, my Haskell language server is showing that is occupying 1.0 TB (!) of virtual memory.

int array, not BigArray. An int array of length 1 million should take eight megabytes.

Trying on trunk? I tried installing it through opam but failed. Would you have a working command line for this?

Try this:

$ mkdir ocaml-500
$ cd ocaml-500
$ git clone https://github.com/ocaml/ocaml.git
$ opam switch create . --empty
$ cd ocaml
$ opam install .
$ ocaml --version
The OCaml toplevel, version 5.0.0+dev0-2021-11-05

Elaborated from ocaml/HACKING.adoc at trunk · ocaml/ocaml · GitHub . There are other approaches on that link also.

The situation is better with 5.0.0 beta trunk, but the process still dies early if 4 cores are used, and RAM consumption is just unbelievably high.

With a recent version of opam

opam switch create 5.0.0+trunk

works.
Earlier version needs to enable access to the beta versions of OCaml by adding the ocaml-beta repository:

opam switch create 5.0.0+trunk --repo=default,ocaml-beta=git+https://github.com/ocaml/ocaml-beta-repository.git

Running

let a = Array.init  1_000_000 Fun.id
let f () = Array.sort Stdlib.compare a
open Domainslib
let pool = Task.setup_pool ~num_additional_domains:3 ()
let () = Task.run pool f

gives me a peak memory use of 14 Mbytes. It seems that the issues does not come from the code that you shared.

1 Like

See source code. When array size reaches about 1 million, RAM consumption is about 1 gigabyte. This is just a vanilla handwritten sequential quicksort procedure called from the main domain in a pool with 4 domains.

The same procedure works perfectly if one sets the pool to 1 domain only, and memory consumption is just sixty megabytes.

Thanks for the code!

The GC behavior does seem interesting: with 4.14, the memory use peaks around 100MB, it goes down to 50MB with 5.0 without domains and goes up to 900MB once a domain is created.

It looks like the GC is struggling to scan fast enough all integer arrays that are allocated in the major heap. Typically either switching to float arrays or adding a Gc.full_major () before running a test for another array size fixes the memory consumption issue.

Sounds like a good argument for a ‘no scan’ bit in the header.

But you said ‘4.14’ which confuses me. Isn’t 4.14 supposed to not include multicore?

4.14 does not support multicore, but we are speaking of the behavior of a sequential test of sequential sort for increasing array lengths, which does not require parallelism to run.

Oh sorry I misread your post.

In any case, should GC parallel scanning be impacted so heavily? Surely there must be a performance comparison benchmark of single-domain to multi-domain GC scanning?

Indeed, the problem disappears if the main loop calls the GC. In contrast, the huge memory growth still occurs if using Bigarrays, so it’s not an issue of scanning a huge array of normal integers to check if they are pointers.

1 Like