I would have expected an Out_of_memory exception. According to documentation, this exception is not triggered by minor GC so I can understand that failing to allocate ends up this way. But my server does not seem to consume a lot of memory and never terminates with an Out_of_memory exception. So I see two possibilities: either my server consumes more memory than I thought and I’m really unlucky since the allocation failure always occurs during minor GC, or there is a problem in minor GC. This PR mentions the same error message and it was not related (as far as I understand) to the program consuming too much memory.
Actually, this does not have anything to do with luck. What happens is that the runtime linearly first allocates objects in the minor heap, and once the minor heap is full, it recopies its content into the major heap. To be able to do so, the runtime extends the major heap sufficiently so that the whole content of the minor heap (2MB) fits. Therefore, if the runtime can never free any memory from the major heap, then it will hammer the system with larger and larger memory allocations, and none of them will trigger an exception, because they all happen during a minor collection.
The only situation where you can get an exception is when the runtime has decided that an object is too large to fit on the minor heap and has thus chosen to directly allocate it outside of the minor heap. But if your program never creates very large arrays (for example), you will not get any exception.
So, while it would not be the first time there is a bug in the OCaml runtime, the little information you have given up to now is just as likely to point to a bug in your program.
@Zoggy which version of OCaml are you using ? The fine details have changed recently (particularly between 4.x and 5.x), and as far as I know you should not see this message on 5.x unless you have some kind of memory corruption (typically an unsafe write out of bounds in an array or bytes, although a bug in the runtime cannot be totally excluded).
If you have a reliable way to reproduce the error (even if it takes a while) I would be interested in investigating.
Thanks @silene and @vlaviron for the explanations. My program is compiled with OCaml 5.1.0. It contains a binding to a C function with possibly large allocation, so it seems a good place to start investigating.
$ ocamlopt --version
$ ulimit -S -v 1000000
$ cat foo.ml
let () =
let r = ref  in
for i = 1 to 1_000_000_000 do
r := Array.make 10 "" :: !r
$ ocamlopt foo.ml
Fatal error: allocation failure during minor GC
It looks like memory corruption, as the minor GC is moving a block where the header claims a size of more than 1TB.
I’ll need a bit more time to find where the corruption occurs.
If you want to look at it yourself too, I used rr record ./_build/default/getdents.exe foo/ until it failed, and then rr replay to get into the debugger, cont (in the debugger) to let the program go until the failure, and then a mix of reverse-* instructions to go back to the failing allocation.
res = caml_alloc_2(Tag_cons, caml_copy_string(r->head), result);
The line above looks dubious. Consider the following scenario. The value result is first passed to caml_alloc_2, then caml_copy_string is called, which potentially triggers a garbage collection. The collector updates the content of the variable result, but its value has already been put aside by the compiler, so it will not be updated. Therefore, the second component of the block allocated by caml_alloc_2 gets filled with garbage.
I suggest calling caml_copy_string separately, so that result is properly updated by the garbage collector, e.g.,
res = caml_copy_string(r->head);
result = caml_alloc_2(Tag_cons, res, result);
I came to a similar conclusion, but for a slightly different reason.
The result of caml_copy_string is not registered with the GC (in the original code), so if caml_alloc_2 triggers a GC and the string was on the minor heap, it will be collected, with the temporary register holding the result now pointing to arbitrary memory in the minor heap. This arbitrary memory is later overwritten by another allocation, leading to a garbage header. Following this garbage header (typically when the list itself is promoted) triggers the error.
The code suggested by @silene above should fix the bug, as it stores the result of caml_copy_string into a GC-registered variable.