OCaml 5 GC. Releasing memory back to the OS

Context
I’m looking into moving infer to OCaml 5. Infer is using multi-processing for parallelism. Each worker can accumulate sometimes a significant amount of state while processing a task. Some part of this state may be relevant to subsequent tasks and some not, but a situation where a worker allocates e.g. 5Gb while processing a task and then collects 4.5Gb as garbage is pretty common.

With OCaml 4 we make workers regularly run GC compaction to release memory back to the OS. This allows us to keep the worker processes lean and manage the overall memory consumption. Also, this is where we have troubles with upgrading to 5.0.

GC in OCaml 5
In 4.x memory pages are allocated from the OS via mmap and is released back to the OS after the compaction via munmap.

I spent some time browsing GC code in trunk and it seems that with OCaml 5:

  1. minor heap is mmaped as one big chunk, but is never released to the OS (unless the minor heap gets resized to a larger one). This is not terribly important because the minor heap is 16Mb (?) per domain by default.
  2. major heap has 2 different allocation strategies:
    a. size segmented free list for smaller allocations. These are mmaped but don’t seem to be released back to the OS at any point. I’ve found a pointer in the code to this issue but it hasn’t been updated since 2021.
    b. large allocations via malloc (>128 words?). These are released via free. This maybe will release the pages back to the OS.
  3. The memory allocation strategies we know in 4.x like first/next/best-fit are not relevant in 5.0 as there’s only one allocation strategy.

Problem
Moving to OCaml 5 we see a significant increase in memory pressure for our workload due to lack of compaction (=~ not releasing memory back to the OS).

Doing full major GC instead of compaction had a marginal effect (somewhat expectedly). Another idea is to tune the minor heap’s size to avoid some promotions maybe, but I don’t feel that it’ll make a difference really.

Obviously, the right thing to do would be migrating from multi-processing to multi-domain parallelism, but that’s a different story.

Am I missing or maybe misunderstood something? Perhaps, there are some other tuning knobs that might help?

11 Likes

There is nothing essential in the releasing of memory back to the OS that needs compaction[1], but compaction used to have this visible effect. If compaction is not re-implemented, or until it is, its release-of-memory effect should be implemented[2].

You’re mainly looking at solving problem 2. To do so, look into having Gc.compact complete a full major cycle and then, in a stop-the world section, free the mappings associated with the free pools, stored in the free list pool_freelist.free. The duration of Gc.compact is the opposite of performance-critical, so there is no need to be too efficient here.

Now there is an issue: pools are allocated in batches of 16. So you want to only release whole batches of 16 contiguous free pools. (With POSIX systems you can release part of a mapping, but this will fragment your virtual address space, which is better to avoid, and Windows will not support it.) There are various ways to go about it, let me know if you are interested.


  1. at least until you start talking about using OS huge pages ↩︎

  2. previously argued here: Update Gc module to reflect multicore changes · Issue #11812 · ocaml/ocaml · GitHub ↩︎

1 Like

Thanks for your response @gadmm! To make sure I got this right:

  1. it was a correct assessment that 5.0 doesn’t release memory to the OS.
  2. release-of-memory functionality needs to be reimplemented in 5.0.
  3. currently, caml_gc_compaction just runs a major cycle but we could add some extra bits to find and release contiguous pools of 16.

Sounds correct?

I guess, one concern I have is that without the actual compaction, the heap will be fragmented, so less # of 16 free contiguous pools and hence the effect of release-of-memory will be less pronounced.


Re: #3 currently, there is a TODO in pool_release to give pools back to the OS, but this is called on every sweep so seems more perf sensitive. I guess having release of pools only in compaction might be a better starting point.

2 Likes

Yes, all this sounds correct; including, I agree, the fact that fewer memory might be released, but the free-but-non-released pools will be the first ones to be used for new pool allocations. So the end result depends on how fragmented is the live data. If you find that it works well for infer it would be an argument in favour of accepting a PR.

3 Likes

As @gadmm points out it’s the batch allocating of pools that complicates releasing back to the OS.

I knocked up GitHub - sadiqj/ocaml at pool_release (for trunk) and GitHub - sadiqj/ocaml at pool_release_5.0 (for 5.0 if you want to test) which doesn’t batch allocate pools and unmaps them when done. It seems to pass the testsuite and I’ll try to schedule a benchmarking run with it to see what kind of performance and major heap impact it has.

If we kept batch allocation of pools I was pondering if there were simple things we could do that would make more much more likely that all the pools from a batch are on the free list when there’s a substantial reduction in major heap usage e.g could we change the free list so batches clump to the end?

If we could do that, we could go release whole batches that were entirely on the free list at the end of a major cycle.

2 Likes

Would an allocator like jemalloc help efficiently manage a memory pool in a multithreaded OCaml allocator? Background · jemalloc/jemalloc Wiki · GitHub
It provides various knobs to tune it according to your requirements (balance between lower memory usage, or higher multi-threaded performance):
TUNING.md

It also supports “muzzy pages”, marking pages as freeable with ‘madvise’ but not actually immediately unmapping them (which is expensive). However the OS can reuse them if needed.

Requiring an extra OS dependency for memory allocation may not be ideal, but if OCaml already does its own memory allocation with ‘mmap’ it is unlikely it’d benefit from the user attempting to switch out ‘malloc’ with ‘jemalloc’.

Jemalloc has done quite a lot of research on reducing and avoiding fragmentation, so reusing the code, or ideas might be beneficial.

4 Likes

There are jemalloc bindings on opam, btw. You could try them, I did in a
small webserver exactly for memory reclamation reasons. I don’t know if
it works on OCaml 5 though.

1 Like

This is great! Looking forward to perf results.

Assuming we need the batched allocation, what are the constraints on the structure of the free list? E.g.

  1. are we allowed to make it a doubly-linked list?
  2. are we allowed to allocate a side-car object describing a batch that’d be referenced by all pools from the batch?
  3. or even make it consist of batches, and then a pool would be described by a batch pointer + a slot inside the batch?

This could make it easier to check on pool release whether the whole batch could be released too.

Looking at the header, it doesn’t expose the structure of the pool, but perhaps there are other limitations?

I would love to read that story, and benchmark results of one versus the other parallel strategy.
This would be real world data of a program making heavy use of parallelism

1 Like

It is nicer if one can get rid of batch allocation without cost. Otherwise, it is possible to traverse a list of allocated chunks and do a simple mark and sweep (similar function here: boxroot/boxroot.c · 4a1f980062e23b4dd3a977f510116bbe4ab46874 · ocaml-rust / ocaml-boxroot · GitLab). In the absence of a list of allocated chunks you might be able to recognize the first pool in a batch by overaligning them.

@sadiq Releasing memory is expensive (inter-processor interrupt to flush the TLB on all cores), so you might want to keep the free list and free it on Gc.compact instead.

@edwin Not uninteresting questions. Allocators are usually very bad at managing large aligned blocks with large alignment; their aligned alloc function is usually tuned for small alignments. Using madvise(MADV_FREE) is specific to overcommitting Linux systems. There are libraries that provide building blocks to build allocators and GCs developed in academia, but those are written in C++ and Rust due to the modularity requirements of such a thing.

Good point. I’ll knock up another version that frees the free list at the end of a major cycle and benchmark that too.

I came across GitHub - microsoft/mimalloc: mimalloc is a compact general purpose allocator with excellent performance. recently and wondered if something like this could potentially be adopted or used as an inspiration. Glad to see I’m not the only one :slight_smile:

1 Like

@sadiq I compared your pool_release_5.0 patch vs stock OCaml 5.0 on one of our workloads and the difference is pretty staggering.

With stock OCaml 5.0:

  • RSS shoots to 29G at the very beginning;
  • steadily grows to 48G;
  • peaks at 63G;
  • wall time 15m35s

With pool_release patch:

  • RSS is around or under 18G for the vast! majority of the workload;
  • peaks at 40G;
  • wall time 15m58s (+2.5%)

Of course, this is specific to Infer. Did you manage to run your benchmarks?

3 Likes

For completeness and out of curiosity, what are the numbers for same workload with OCaml 4.x?

Closer to OCaml 5 with pool_release than to stock OCaml 5:

  1. RSS is hovering around 15G for the majority of the workload;
  2. peaks at 43G;
  3. wall time 17m16s (perhaps it takes longer due to the extra work done during compaction).
2 Likes

Thanks so much for running these and it’s good to see it had an impact.

It took a little while to get these branches running in the benchmarking suite (and there’s still an abort in one of the parallel benchmarks I need to investigate) but there’s some preliminary sequential numbers here:

https://sandmark.ocamllabs.io/?app=Sequential+Benchmarks&sequential_10=5.0.0%2Btrunk%2Bsequential&sequential_01=20230209&sequential_B2=navajo_5.0.1%2Btrunk%2Bsadiqj%2Bpool_release%2Bsequential_20230209_pool_re&sequential_11=20220322&sequential_02=navajo_5.0.1%2Btrunk%2Bsadiqj%2Bpool_release%2Bsequential_20230209_pool_re&sequential_00=5.0.1%2Btrunk%2Bsadiqj%2Bpool_release%2Bsequential&sequential_12=navajo_5.0.0%2Btrunk%2Bsequential_20220322_7130374&sequential_B0=navajo&sequential_B1=20230209&sequential_num_variants=2&sequential_find_by=variant

It seems the performance impact of not batching pool allocations is fairly small. The only difference between pool_release and pool_release_cycle is when pools are released. The former does so immediately, the latter only at the end of a major cycle.

I think there’s probably a good argument for releasing pools when done with them. I’m also pondering whether we need to mmap the pools or whether malloc might be sufficient.

2 Likes

This statement about Windows comes as a surprise to me. The Ravenbrook MPS has always does it on Posix and Windows platforms, and this was one of the reasons that a major commercial customer (on Windows) selected the MPS, back in 2001 - they measured the behaviour in various cirucmstances and particularly liked that the overall memory usage declined after a GC. See (for Windows) mps/code/vmw3.c at master · Ravenbrook/mps · GitHub
The caveat about fragmentation is true and potentially important, but over the years I’ve seldom seen it cause problems (as long as one builds in some hysteresis).

1 Like

Right, I forgot that on Windows it is also possible to decommit (rather than release) part of the mapping, which is enough here.

Another issue with putting holes in the VAS is reaching the limit in number of mappings which is fairly low by default on Linux. jemalloc has a bug whereby it creates too many mapping when overcommitting is turned off. When overcommitting is enabled, the glibc malloc and jemalloc use madvise to decommit memory, as mentioned by @edwin. It affects the case when pools are mapped in a batch, it is unclear to me how it affects the case without batch allocation.

What I know on the subject comes from reading the source code of various allocators when working on ocaml-boxroot and the huge page allocator for OCaml. My advice would be to do just that if you want to release on-the-fly. In particular it seems to me that the best way to release memory is platform-specific (e.g. IIRC OSX has no overcommitting but has a commit/decommit mechanism similar to Windows).

(If I have to review @sadiq’s patch later on, I’d rather if the simplest approach of releasing on Gc.compact is taken in a first time, given how tricky properly implementing releasing on-the-fly looks.)