I’m looking into moving infer to OCaml 5. Infer is using multi-processing for parallelism. Each worker can accumulate sometimes a significant amount of state while processing a task. Some part of this state may be relevant to subsequent tasks and some not, but a situation where a worker allocates e.g. 5Gb while processing a task and then collects 4.5Gb as garbage is pretty common.
With OCaml 4 we make workers regularly run GC compaction to release memory back to the OS. This allows us to keep the worker processes lean and manage the overall memory consumption. Also, this is where we have troubles with upgrading to 5.0.
GC in OCaml 5
In 4.x memory pages are allocated from the OS via mmap and is released back to the OS after the compaction via munmap.
I spent some time browsing GC code in trunk and it seems that with OCaml 5:
minor heap is mmaped as one big chunk, but is never released to the OS (unless the minor heap gets resized to a larger one). This is not terribly important because the minor heap is 16Mb (?) per domain by default.
major heap has 2 different allocation strategies:
a. size segmented free list for smaller allocations. These are mmaped but don’t seem to be released back to the OS at any point. I’ve found a pointer in the code to this issue but it hasn’t been updated since 2021.
b. large allocations via malloc (>128 words?). These are released via free. This maybe will release the pages back to the OS.
The memory allocation strategies we know in 4.x like first/next/best-fit are not relevant in 5.0 as there’s only one allocation strategy.
Moving to OCaml 5 we see a significant increase in memory pressure for our workload due to lack of compaction (=~ not releasing memory back to the OS).
Doing full major GC instead of compaction had a marginal effect (somewhat expectedly). Another idea is to tune the minor heap’s size to avoid some promotions maybe, but I don’t feel that it’ll make a difference really.
Obviously, the right thing to do would be migrating from multi-processing to multi-domain parallelism, but that’s a different story.
Am I missing or maybe misunderstood something? Perhaps, there are some other tuning knobs that might help?
There is nothing essential in the releasing of memory back to the OS that needs compaction, but compaction used to have this visible effect. If compaction is not re-implemented, or until it is, its release-of-memory effect should be implemented.
You’re mainly looking at solving problem 2. To do so, look into having Gc.compact complete a full major cycle and then, in a stop-the world section, free the mappings associated with the free pools, stored in the free list pool_freelist.free. The duration of Gc.compact is the opposite of performance-critical, so there is no need to be too efficient here.
Now there is an issue: pools are allocated in batches of 16. So you want to only release whole batches of 16 contiguous free pools. (With POSIX systems you can release part of a mapping, but this will fragment your virtual address space, which is better to avoid, and Windows will not support it.) There are various ways to go about it, let me know if you are interested.
at least until you start talking about using OS huge pages ↩︎
Thanks for your response @gadmm! To make sure I got this right:
it was a correct assessment that 5.0 doesn’t release memory to the OS.
release-of-memory functionality needs to be reimplemented in 5.0.
currently, caml_gc_compaction just runs a major cycle but we could add some extra bits to find and release contiguous pools of 16.
I guess, one concern I have is that without the actual compaction, the heap will be fragmented, so less # of 16 free contiguous pools and hence the effect of release-of-memory will be less pronounced.
Re: #3 currently, there is a TODO in pool_release to give pools back to the OS, but this is called on every sweep so seems more perf sensitive. I guess having release of pools only in compaction might be a better starting point.
Yes, all this sounds correct; including, I agree, the fact that fewer memory might be released, but the free-but-non-released pools will be the first ones to be used for new pool allocations. So the end result depends on how fragmented is the live data. If you find that it works well for infer it would be an argument in favour of accepting a PR.
If we kept batch allocation of pools I was pondering if there were simple things we could do that would make more much more likely that all the pools from a batch are on the free list when there’s a substantial reduction in major heap usage e.g could we change the free list so batches clump to the end?
If we could do that, we could go release whole batches that were entirely on the free list at the end of a major cycle.
Would an allocator like jemalloc help efficiently manage a memory pool in a multithreaded OCaml allocator? Background · jemalloc/jemalloc Wiki · GitHub
It provides various knobs to tune it according to your requirements (balance between lower memory usage, or higher multi-threaded performance): TUNING.md
It also supports “muzzy pages”, marking pages as freeable with ‘madvise’ but not actually immediately unmapping them (which is expensive). However the OS can reuse them if needed.
Requiring an extra OS dependency for memory allocation may not be ideal, but if OCaml already does its own memory allocation with ‘mmap’ it is unlikely it’d benefit from the user attempting to switch out ‘malloc’ with ‘jemalloc’.
Jemalloc has done quite a lot of research on reducing and avoiding fragmentation, so reusing the code, or ideas might be beneficial.
@sadiq Releasing memory is expensive (inter-processor interrupt to flush the TLB on all cores), so you might want to keep the free list and free it on Gc.compact instead.
@edwin Not uninteresting questions. Allocators are usually very bad at managing large aligned blocks with large alignment; their aligned alloc function is usually tuned for small alignments. Using madvise(MADV_FREE) is specific to overcommitting Linux systems. There are libraries that provide building blocks to build allocators and GCs developed in academia, but those are written in C++ and Rust due to the modularity requirements of such a thing.
Thanks so much for running these and it’s good to see it had an impact.
It took a little while to get these branches running in the benchmarking suite (and there’s still an abort in one of the parallel benchmarks I need to investigate) but there’s some preliminary sequential numbers here:
It seems the performance impact of not batching pool allocations is fairly small. The only difference between pool_release and pool_release_cycle is when pools are released. The former does so immediately, the latter only at the end of a major cycle.
I think there’s probably a good argument for releasing pools when done with them. I’m also pondering whether we need to mmap the pools or whether malloc might be sufficient.
This statement about Windows comes as a surprise to me. The Ravenbrook MPS has always does it on Posix and Windows platforms, and this was one of the reasons that a major commercial customer (on Windows) selected the MPS, back in 2001 - they measured the behaviour in various cirucmstances and particularly liked that the overall memory usage declined after a GC. See (for Windows) mps/vmw3.c at master · Ravenbrook/mps · GitHub
The caveat about fragmentation is true and potentially important, but over the years I’ve seldom seen it cause problems (as long as one builds in some hysteresis).
Right, I forgot that on Windows it is also possible to decommit (rather than release) part of the mapping, which is enough here.
Another issue with putting holes in the VAS is reaching the limit in number of mappings which is fairly low by default on Linux. jemalloc has a bug whereby it creates too many mapping when overcommitting is turned off. When overcommitting is enabled, the glibc malloc and jemalloc use madvise to decommit memory, as mentioned by @edwin. It affects the case when pools are mapped in a batch, it is unclear to me how it affects the case without batch allocation.
What I know on the subject comes from reading the source code of various allocators when working on ocaml-boxroot and the huge page allocator for OCaml. My advice would be to do just that if you want to release on-the-fly. In particular it seems to me that the best way to release memory is platform-specific (e.g. IIRC OSX has no overcommitting but has a commit/decommit mechanism similar to Windows).
(If I have to review @sadiq’s patch later on, I’d rather if the simplest approach of releasing on Gc.compact is taken in a first time, given how tricky properly implementing releasing on-the-fly looks.)