Just came across this interesting paper: https://www.microsoft.com/en-us/research/uploads/prod/2023/09/ocamlrc.pdf
Looks like they swapped out OCaml 4.14’s garbage collector and swapped in the Perceus reference counting strategy, then benchmarked and got some good initial results. The most significant quote that stands out to me:
Since Perceus is garbage free, the Perceus backend uses less memory than the OCaml garbage collector on all benchmarks, with a 40%+ reduction in the cfold and deriv programs. However, for the other three benchmarks the working set is suprisingly close which is a testament to how well the OCaml GC is able to manage its memory.
Clearly, OCaml gives almost any memory management strategy a run for its money!
I’m guessing you’re aware of Bacon et al’s similar experiment with the JikesRVM Java VM, swapping in an SMP-friendly RCGC for the generational GC (IIRC) that JikesRVM was originally implemented with.
I don’t remember the details, but some time ago I saw a Twitter discussion about a clever way someone had found to analyze the overhead the garbage collection by injecting some kind of GC instructions into certain points of generated Java bytecode (iirc). But I’m guessing that was different than what you’re mentioning.
I think it’s this paper, but I’m not sure (didn’t read it, since I could talk with him directly): David F. Bacon - Wikipedia
Found the paper: https://people.cs.umass.edu/~emery/pubs/gcvsmalloc.pdf
They use this cool technique where they use extensive tracing of a Java program to figure out the optimal points to insert
free() calls i.e. they have what they call an ‘oracular GC’. It’s not super practical but it’s probably the best possible ‘ideal’ comparison because it really approaches ceteris paribus.
I don’t remember the details anymore, but at least for RCGC systems, sometimes, they do various things to increase concurrency or decrease pause-times, and some of those techniques are to defer the decrements (and sometimes the increments). So something might go to refcount-zero, but b/c the decrement is deferred, it can be a little while before that decrement is applied; then the stuff that that block points at will get decremented (again, maybe they get put on a queue for that decrement) so that stuff can then get decremented later. The collective effect of all this deferral of decrements is to increase the memory-size required for the program, since there can be a sizable amount of memory tied-up in stuff that is morally free, but not yet put back into free-lists for re-allocation.
My guess (based on your description, and a vague memory of looking at the paper when it was first mentioned here) is that the paper with this tracing-based technique didn’t do that: it actually immediately freed what could be freed, so it’ll yield memory-size that’s better than anything achievable by any practical GC (whether RC, copying, whatever).
 I have a distinct memory that Bacon’s RCGC indeed deferred both decrements and increments in a very, very clever scheme that allowed for better SMP and memory-bus utilization. Specifically, it was in order to reduce locked bus-cycles (e.g. caused by compare-and-swap – hence, to reduce the need for those).
Just to advertise, the Perceus thing is going to be presented at the ML workshop next Friday. There will be a lot of talks about memory management there!