We have a server that OOMs every 30 minutes, exhausting 16G of RAM. Thinking we were leaking some global state, we memprof’d it and were surprised by these results:
Subtitles if you’re not used to this output : in black the total memory consumption at time t
, in blue the memory consumption at time t
that is still uncollected in the highlighted (red) zone.
If you compare the blue curves, they “start” later and later as we progress through the pictures. So it seems that we actually do not leak anything, since everything is eventually collected. However this “backlog” of collectable memory grows steadily until system memory exhaustion. We can sort of see the GC triggering at regular intervals, so it’s as if the amount of freed memory was always inferior to the allocated memory during that interval, so that garbage keep piling up. As if the major slice size was too low ? Could also be that the GC triggers less and less often, but that seems unlikely.
In any case, the smoking gun that proves we’re not leaking is that firing a Stdlib.Gc.full_major ()
every minute entirely solves the problem, keeping the RAM consumption around 1G even after an hour. (This usually completes in ~100ms so ironically for our use case it is a pretty good garbage collection algorithm ).
Are we correct in concluding that this is a bug? While tweaking the GC settings to make it more aggressive could also solve our issue, I suppose it should never diverge like so ?
Thanks for your time!