No more memory leaks after linux kernel version and settings update (!?)

TLDR: linux kernel update (maybe settings have changed) fixed memory leaks.

Background:
I used to run my workloads in containers on a flatcar linux host with kernel version 5.15.48-flatcar.

After upgrading to OCaml 5+ my services started exhibiting large memory consumption. To give you some idea the software basically does some audio processing and sends data over the network, couple MB/s between 8 replicas, no big deal. It wasn’t heavily optimized but on OCaml 4.14.x the memory usage was constantly at around say ~200 MB. I understand the changes around compaction of the heap in 5+ etc but the problem here was there were real leaks. We managed to solve them in an upcoming beta version by removing extra allocations of Bytes and Cstructs. Empirically my understanding was that the GC was lazy to collect and free the refs to cstructs.

Surprise:
I upgraded the hosts to NixOS with 6.6.56 kernel and i am completely dumbfound - the memory leaks stopped. The memory consumption in the stable version is minimal. The OCaml I used to know lol. Any guesses what’s going on? The hardware DID NOT change.

Caveat:
We are early post migration and it might be an operator error of some kind but super unlikely.

Edit: We had some other unrelated (?) problems around memory management in other software, and I talked about it to some members of community, but those involved OCaml 5, io_uring (via Eio) etc. This post is about OCaml 4 era software based on Lwt and libev.

6 Likes

This is pleasantly surprising. If you have any more insights into this, I’d love to know. In particular, other users may also be affected in a similar way, and I’d like to understand what fixed the issue.

Thanks for sharing.

There are a number of improvements around Garbage Collection pacing that might help with OCaml 5 memory consumption.

Could you open an issue on ocaml with as much detail as you have so we can look into it?
Have a look at Regression with default GC settings between `4.14.2` and `5.1.1` · Issue #13123 · ocaml/ocaml · GitHub where I did some testing of these changes against liquidsoap. If you have an application or smaller reproduction example I would be happy to have a look at it. Ping me on discord if you want.

Are you using ephemerons / weak references anywhere in your code or libraries? There’s some other issues like Allow values reachable from ephemeron keys to be collected by minor GC by stedolan · Pull Request #13643 · ocaml/ocaml · GitHub that have been reported.

Empirically my understanding was that the GC was lazy to collect and free the refs to cstructs.

I wonder how we could track this either with statmemprof or other instrumentation of the GC like Olly / runtime_events_tools

I have a near minimal repro for other GC related issues in OCaml 5 and will publish it some time next week.

In this thread I am mostly curious about the uncanny impact of the update of the host (the kernel from the container’s point of view) on memory consumption. The software I described here was originally built with OCaml 4.14 and the current OCaml 5 build does not contain any changes to the root logic or libraries. It’s still Lwt+libev. It used to leak (evidently) on the older kernel and now it’s not which blew my mind.

I would like to get to the bottom of that. It would be good to setup a reproduction with better instrumentation to see what is going on. Was the application leaking or was it just slow in getting rid of garbage? Could you open an issue for that and dump as much detail as you can about packages / kernel version etc?

It would grow indefinitely until the activity of the program stopped almost completely. In our case the traffic drops significantly past 8PM and at that time the memory would be reclaimed when there’s virtually nothing happening in the program. I don’t know what’s the actual upper limit (if any) but the same software that ran today at max 115MB of RAM used to grow to 2 GB in around hour and a half at which point it was restarted.

I will create an issue with the libraries used and the kernel version.

1 Like