For testing OCaml multicore, I installed opam version 4.12.0+domains+effects and implemented quicksort using tasks.
To my surprise, even the purely sequential version has very high memory consumption if run from within a thread pool (Domainlibs) context: I see ocaml using 800 megabytes of RAM to sort an integer array of size 1 million if run within the following context:
let pool = Tsk.setup_pool ~num_additional_domains:3 () in
let run f = Tsk.run pool f in
The problem does not occur without this pool context.
Another surprising aspect: 256 gigabytes of virtual memory allocated for running anything inside ocaml.
If I run the computation over 4 threads, it dies before the end (but somehow the process returns 0).
So I wonder what is the current status of ocaml multicore, because what I’ve seen does not seem quite functional. Have I installed an incorrect version? What should I pick? (I’m using opam)
What about trying on trunk i.e. what will be OCaml 5.0 in a few months time. 4.12+domains+effects is supposed to be quite stable but is probably not getting many fixes and updates now.
Are you using int array or BigArray ?
I wouldn’t worry about this aspect too much. Many runtimes allocate huge amounts of virtual memory for various reasons. My guess is that each thread works in some pre-defined virtual memory space to avoid stomping on another thread when usage of the same memory space is not necessary. There could be other reasons.
As a point of comparison, for instance, my Haskell language server is showing that is occupying 1.0 TB (!) of virtual memory.
let a = Array.init 1_000_000 Fun.id
let f () = Array.sort Stdlib.compare a
open Domainslib
let pool = Task.setup_pool ~num_additional_domains:3 ()
let () = Task.run pool f
gives me a peak memory use of 14 Mbytes. It seems that the issues does not come from the code that you shared.
See source code. When array size reaches about 1 million, RAM consumption is about 1 gigabyte. This is just a vanilla handwritten sequential quicksort procedure called from the main domain in a pool with 4 domains.
The same procedure works perfectly if one sets the pool to 1 domain only, and memory consumption is just sixty megabytes.
The GC behavior does seem interesting: with 4.14, the memory use peaks around 100MB, it goes down to 50MB with 5.0 without domains and goes up to 900MB once a domain is created.
It looks like the GC is struggling to scan fast enough all integer arrays that are allocated in the major heap. Typically either switching to float arrays or adding a Gc.full_major () before running a test for another array size fixes the memory consumption issue.
4.14 does not support multicore, but we are speaking of the behavior of a sequential test of sequential sort for increasing array lengths, which does not require parallelism to run.
In any case, should GC parallel scanning be impacted so heavily? Surely there must be a performance comparison benchmark of single-domain to multi-domain GC scanning?
Indeed, the problem disappears if the main loop calls the GC. In contrast, the huge memory growth still occurs if using Bigarrays, so it’s not an issue of scanning a huge array of normal integers to check if they are pointers.