Fatal error: allocation failure during minor GC


I’m developing a server and it sometimes crashes with the following error message:

Fatal error: allocation failure during minor GC
Abandon (core dumped)

I have the core dump file but how can I use it to know what happened ? Running gdb gives me:

Reading symbols from devel/activitypub/_build/default/bin/server.exe...
[New LWP 14228]
[New LWP 14232]
[New LWP 14311]
[New LWP 14637]
[New LWP 14446]
[New LWP 14243]
[New LWP 14339]
[New LWP 14310]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `../_build/default/bin/server.exe -c taps.conf'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140267302415424) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: Aucun fichier ou dossier de ce type.
[Current thread is 1 (Thread 0x7f9286bbec40 (LWP 14228))]

This error means that your program ran out of memory (malloc failed)…


1 Like

I would have expected an Out_of_memory exception. According to documentation, this exception is not triggered by minor GC so I can understand that failing to allocate ends up this way. But my server does not seem to consume a lot of memory and never terminates with an Out_of_memory exception. So I see two possibilities: either my server consumes more memory than I thought and I’m really unlucky since the allocation failure always occurs during minor GC, or there is a problem in minor GC. This PR mentions the same error message and it was not related (as far as I understand) to the program consuming too much memory.

Actually, this does not have anything to do with luck. What happens is that the runtime linearly first allocates objects in the minor heap, and once the minor heap is full, it recopies its content into the major heap. To be able to do so, the runtime extends the major heap sufficiently so that the whole content of the minor heap (2MB) fits. Therefore, if the runtime can never free any memory from the major heap, then it will hammer the system with larger and larger memory allocations, and none of them will trigger an exception, because they all happen during a minor collection.

The only situation where you can get an exception is when the runtime has decided that an object is too large to fit on the minor heap and has thus chosen to directly allocate it outside of the minor heap. But if your program never creates very large arrays (for example), you will not get any exception.

So, while it would not be the first time there is a bug in the OCaml runtime, the little information you have given up to now is just as likely to point to a bug in your program.

@Zoggy which version of OCaml are you using ? The fine details have changed recently (particularly between 4.x and 5.x), and as far as I know you should not see this message on 5.x unless you have some kind of memory corruption (typically an unsafe write out of bounds in an array or bytes, although a bug in the runtime cannot be totally excluded).
If you have a reliable way to reproduce the error (even if it takes a while) I would be interested in investigating.

Thanks @silene and @vlaviron for the explanations. My program is compiled with OCaml 5.1.0. It contains a binding to a C function with possibly large allocation, so it seems a good place to start investigating.

You will get this failure if you exhaust memory:

$ ocamlopt --version
$ ulimit -S -v 1000000
$ cat foo.ml
let () =
  let r = ref [] in
  for i = 1 to 1_000_000_000 do
    r := Array.make 10 "" :: !r
$ ocamlopt foo.ml
$ ./a.out 
Fatal error: allocation failure during minor GC
1 Like

@vlaviron I created a small project which embeds the faulty (C) code. It requires lwt.unix and lwt_ppx and uses some Lwt macros.

Compile with dune build.

Just populate a directory foo with a few thousands files (you can use the included populate.bash script) and run ./_build/default/getdents.exe foo:

$ ./_build/default/getdents.exe foo/
Fatal error: allocation failure during minor GC
Abandon (core dumped)

(Sometimes the program ends up normally)

I must have done something wrong…

It looks like memory corruption, as the minor GC is moving a block where the header claims a size of more than 1TB.
I’ll need a bit more time to find where the corruption occurs.
If you want to look at it yourself too, I used rr record ./_build/default/getdents.exe foo/ until it failed, and then rr replay to get into the debugger, cont (in the debugger) to let the program go until the failure, and then a mix of reverse-* instructions to go back to the failing allocation.

res = caml_alloc_2(Tag_cons, caml_copy_string(r->head), result);

The line above looks dubious. Consider the following scenario. The value result is first passed to caml_alloc_2, then caml_copy_string is called, which potentially triggers a garbage collection. The collector updates the content of the variable result, but its value has already been put aside by the compiler, so it will not be updated. Therefore, the second component of the block allocated by caml_alloc_2 gets filled with garbage.

I suggest calling caml_copy_string separately, so that result is properly updated by the garbage collector, e.g.,

res = caml_copy_string(r->head);
result = caml_alloc_2(Tag_cons, res, result);
1 Like

I came to a similar conclusion, but for a slightly different reason.
The result of caml_copy_string is not registered with the GC (in the original code), so if caml_alloc_2 triggers a GC and the string was on the minor heap, it will be collected, with the temporary register holding the result now pointing to arbitrary memory in the minor heap. This arbitrary memory is later overwritten by another allocation, leading to a garbage header. Following this garbage header (typically when the list itself is promoted) triggers the error.
The code suggested by @silene above should fix the bug, as it stores the result of caml_copy_string into a GC-registered variable.

No, this cannot happen, since caml_alloc_2 properly registers its arguments in case of a garbage collection. So, the result of caml_copy_string is fine.

With your change it works with no problem now. Thanks a lot ! I will try to remember to allocate and affect step by step to avoid this in the future.

1 Like