Ocaml runtime leak when creating/destroying many threads in 4.10+

Hi,

We’ve been trying to track down a memory leak that seems to have occured with an OCaml 4.08 → 4.10 change, and I think we’ve narrowed it (or at least one of them) down to https://github.com/ocaml/ocaml/commit/8b20b69a16bc11f277c514e81de8c92fe83293f5

There is a fix for this queued up in 4.1{2,3,4} and trunk: https://github.com/ocaml/ocaml/commit/04118b05 https://github.com/ocaml/ocaml/commit/1eeb0e7fe595f5f9e1ea1edbdf785ff3b49feeeb
https://github.com/ocaml/ocaml/commit/862ad80a2538433d90d878893bb0d75e50cb8dcb

However I got confused why I wasn’t able to reproduce the leak locally when using a compiler built from OPAM.
It turns out OPAM already applies the patch for <=4.12 (I wasn’t aware that a compiler from OPAM would have additional patches compared to upstream), due to: https://github.com/ocaml/opam-repository/commit/05035330325eb132d95446aecab07eddd88f110a
(if you’re on 4.13.x it doesn’t apply the patch though)

When building OCaml from upstream tarballs and using for example the Fedora .spec file we don’t get the patch, hence leading to a distinction between dev and production:
https://src.fedoraproject.org/rpms/ocaml/tree/rawhide

It looks like this will sort itself out when new 4.1x.y upstream compiler packages are released, but meanwhile I thought it’d be useful to let people know in case they run into similar problems.
(it is particularly problematic for long running daemons that spawn one thread/request)

5 Likes

opam-repository carries patches which are necessary to maintain the build on newer systems. In this case, that patch was back-ported as part of the malloc’ing of the sigaltstack to deal with glibc 2.34+ on F35 and Ubuntu 21.10+… it turns out we accidentally fixed this bug at the same time (I say accidentally, because I don’t think that memory leak was a known issue of the 4.10+ implementation)

Thanks, I understand now why you had to apply the patch in opam: on <= 4.12 the patch was needed to fix both the build issue on newer glibcs and the memory leak. Whereas 4.13.1 doesn’t have the build issue on newer glibcs anymore, so no patch was applied there, but it means it does have a memory leak.
Which is what confused me: why does 4.13 have a memory leak and 4.12 doesn’t, where none of the changes between 4.12 and 4.13 introduced a leak (when looking at github history), but once the opam patches are taken into account it all makes sense.

2 Likes