Naive? suggestion for the minor GC

Hello,

I have a (may be naive?) proposition to improve minor GC, especially with 0Caml 5. Looking at minor_gc.c, which is
a quite complex/long piece of C code, my proposition may be unfeasible because I missed some points :wink:

I propose that the minor heap is splitted in two zones: (1) and (2) (beware this is not the usual algorithm for copying generational GC).

  • The minor GC does nothing but scanning for (1) and a copying in major/promotion for (2).
  • zone (1) and (2) are swapped as soon as zone (1) is full, so allocation is always in zone (1). swap = pointer swap or modification of memory mapping if possible, not copying.
  • stop the world minor GC are started when zone (1) and (2) of some domain is full.
  • invariant : after a minor GC, zone (2) is always empty and therefore zone (2) is always empty or full.

This seems to have the following advantages:

  1. the collecting of zone (1) is ultra fast (no copying, just scanning)
  2. objects are promoted only after one swap meaning much less promotion of short living objects
  3. domain with slow allocation rate will not promote nothing when zone (2) is empty.
  4. more object will remain in the minor heap private to each domain (good for immutable object, may be not so
    good for mutable ones ?)
  5. as a consequence, the GC should resist well to domain having different allocation pattern

Disadvantage: slow allocating domain will have zone (2) almost never used, wasting half of the minor heap (but they will only promote really long living object). May be adjusting the size of the minor heap dynamically for each domain could be a good idea if one want to save some memory space in that case ?

My 2 scents,
Cheers,
Christophe

1 Like

Hi Christophe, it is not a (totally*) naive proposal. I had the occasion to discuss a similar one with Stephen and KC which went as follows:

  • Implement minor heaps as circular buffers (the start and end of the minor heap no longer need to coincide with the start and end of the memory);
  • If, at moment of minor collection, one has used ≤ free, then one can do a copy collection into the remaining free area.

A copy collection is simply the current minor collection algorithm but with an allocation function that does a bump pointer allocation into the free area (no need to track a generation number for each value).

One can generalize it by, instead of being conditional on used ≤ free, having minor collection always start allocating into the free area of the minor heap, and then allocating in the major pools if full. Both I think avoid the wasted space from your approach.

This is better than the current implementation as you avoid lots of false promotions, as you mention (which cost the allocation in the major pool + the added GC pressure to work to free an equivalent amount).

(*) But I now think that this approach might still be a bit naive in the sense that it is unlikely to be as good as trying to have the GC work as much on each domain, e.g. having all minor heaps be full at the same time (in this case any space left in the minor heaps to perform some copying is negligible). Indeed, if the copying strategy worked well, then you likely end up with domains that have nothing useful to do while waiting for the minor collection to finish, not even opportunistic major slices. I think the GC Handbook mentions sizing strategies for the minor heaps, in addition @kayceesrk has already experimented with “local allocation buffers” and might have more ideas in this area.

At least I think it has the merit of illustrating the false promotion problem, which is good to keep in mind in order to see any multicore benchmark (good or bad) with the appropriate amount of grains of salt while it exists.

By the way I do not think that this issue was ever seen as a fundamental one, more of a technical one. The more fundamental consequence for OCaml concern programs that relied on strategies to optimize for throughput or latency (e.g. having huge minor heaps to drastically reduce GC costs in Coq; or on the contrary having small minor heaps and writing in a style that avoids allocating long-lived values for low latency). The consequence is that (AFAIU) it will not be possible to have such strategies that work at a domain level, only at a global level. (For instance, no low-latency domain interacting with a higher-latency domain.)

My proposal is indeed similar to your circular buffers except no copy is performed if the zone (2) is still empty,
But as you said your proposal probably waste less space. Both are worth a try, at least to see if it decreases the pressure on major gc for instance in Coq :wink:

Keeping all core busy could be solved by having the number of domain = twice the number of cores in some cases :wink:

And for the waste of space, it only happens when the minor heap of some domain fails to get filled and I think my approach and the current approach was the same things.

The real problem of my approach is that up to half of the minor heap may be unused in the case where there is very little live object in the zone (1) of the minor heap. I think wasting at most half of each minor heap could be OK.

And domain that have finished their minor GC, could help other domains to finish their own minor GC ? If it can be parallelised ?

The proposals are not naive.

We had tried domain-local allocation buffers (DLABs) to better utilise the minor heap area. In our experiments, we found that the memory hierarchy effects due to domains allocating into regions mmap’ed by other domains caused significant negative performance impact compared to the improvements gained by minimising false promotions. This effort did not make it into Multicore OCaml. We wrote up a retrospective here: Domain Local Allocation Buffers Addendum · ocaml-multicore/ocaml-multicore Wiki · GitHub

Given the experience with DLABs my suggestion for any GC experiments is to not start with ideas for a new algorithm. Instead

  1. Build a benchmark suite of representative benchmarks that you care about. Sandmark parallel benchmarks is just a tiny sliver in the very wide space. We don’t yet know how parallel programs in OCaml will be written. At the end of the day the benchmark suite is the one on which you will be measuring performance improvements.
  2. Collect metrics. If you think false promotions are a problem, let’s collect experimental evidence for this. At this point, I don’t know whether false promotions are a problem in parallel OCaml programs because I haven’t measured it. So I’m reluctant to comment on the details of the algorithms proposing to improve false promotions.

We can reuse some of these efforts across the different experiments that we want to do. We’re open to accepting contributions to Sandmark for example.

I should also say that any serious change in the GC algorithm will require significant software engineering effort if the plan is to upstream it. Memory corruptions in a parallel GC are a pain to debug.

3 Likes

Relatedly: on OCaml 4.x, I believe that @damiendoligez experimented quite a bit with having several minor arenas to reduce promotion rate, but somehow never found something that worked clearly better than the current 4.x approach. (The current proposal interacts with the multi-domain runtime is a different way, so it may be worth revisiting in any case.)

(If I remember correctly, GHC uses a configurable number of minor arenas/nurseries.)

1 Like

Reading the discussion, I though my proposal can be easily ameliorated with a dynamically adjustable ratio:

Before a minor collection, the heap would have a shape like

OOOXXXXXXXXXXXXXXXXXXXOOOOOOOOO

X: occupied: for two consecutive X, the leftest is the oldest allocated.
O: free (where allocation occurs)

The domain that trigerred the minor GC is full (no O)

after the collection only the oldest X that represent p% of the heap are promoted resulting in a minor heap of the shape (the X remaining are not copied or moved, they just may have pointer to update inside):

OOOOOOOOOXXXXXXXXXXXXXOOOOOOOOO

A domain that allocates a lot may have p near to 100% (almost promote everything) and a
domain that do not allocate may have a smaller p, near 10% meaning it uses its spare time to delay promotion.

I think (but may be plain wrong) that it is not very different from the actual minor GC: we promote blocks in the minor
heap based to a test on their position that should not be too hard to add? Then we scan the block where it is (or scan it before copying).

There is a natural formula for p : if we allocated v% of the heap since the previous minor GC, we want to make sure
there is this much free space at the next minor GC if we assume the same allocation rate.

The minor heap is composed of 100% = v% + q% + f% :

  • v: what was allocated since the preivous minor

  • q: what was not promoted previously,

  • f: free space
    We want f to become v to make room for just what we need so we promote max(0, v - f). We do not promote anything if v < f. We probably want a maximum for p like 80% to always avoid promotion of very recent object so this gives

  • p = min(80,max(0,v - f)) % of the minor heap to be promoted.

Remark: this formula gives a constant p for a full minor heap (as expected).

If allocation rate of each domain is relatively constant, all minor heaps will be full at the same time as constant allocation rate is the hypothesis of the above. May be we should limit the variation of p in case of highly variable allocation rate, like

p = u p + v min(80,max(0,v - f)) with u + v = 1

Remark:

  • allocation is a bit more expensive: we need two tests instead of 1 to know where to allocate
  • we may waste almost 2 blocks of the max size allocated in the minor heap instead of 1: one when we have to allocate back at the beginning of the minor heap and another one when we trigger the GC.

It is hard to measure false promotions per se, but a relevant metric would be to look at how full the minor heaps tend to be when they are collected.

This is different from copying collection in a single-threaded setting, notably the argument for why this is better than the current implementation does not apply when there is only one domain.

In other aspects, it is closer to the OCaml 4 GC. For instance, it lets you trigger the major slice in a domain precisely when the minor heap is half full, independently in each domain.