Need help for understanding a segmentation fault with OCaml 5.0

I had a hard time trying to reduce a segmentation fault observed with OCaml 5.0: I would prefer not to fill an issue before having something smaller to reproduce and before checking for sure that the fault is not in my code, but since I didn’t manage to progress for three weeks, I dare asking the community for some help!

Here are some steps that should reproduce the segmentation fault:

$ docker run -it --rm ocaml/opam:debian-11-ocaml-5.0
opam@...:~$ opam pin add https://github.com/thierry-martinez/stdcompat.git#disable-magic
opam@...:~$ opam pin add https://github.com/thierry-martinez/metapp.git
opam@...:~$ opam pin add https://github.com/thierry-martinez/metaquot.git#ocaml-5.0-segfault
opam@...:~$ opam install refl

The segmentation fault occurs while metaquot.ppx preprocesses the source file ppx_refl.ml, and disappears when Gc.minor () is called before preprocessing each expression (see Solve segfault with OCaml 5.0 · thierry-martinez/metaquot@e19ef99 · GitHub). The segmentation fault does not occur with OCaml 4.14 and below, and disappears with little variations in the code, either when I try to reduce the size of ppx_refl.ml or when I change the code of metaquot.ml, or even when I try to embed the relevant parts of ppxlib with metaquot to make the example more standalone.

There is no magic nor FFI calls in the code of stdcompat (as pinned), metaquot, metapp, and I believe neither in ppxlib.

valgrind gives the following backtrace:

==1294611== Process terminating with default action of signal 11 (SIGSEGV)
==1294611==  Bad permissions for mapped region at address 0x56D9E8
==1294611==    at 0xA5C849: atomic_store_relaxed (platform.h:68)
==1294611==    by 0xA5C849: mark_slice_darken (major_gc.c:690)
==1294611==    by 0xA5C849: do_some_marking (major_gc.c:720)
==1294611==    by 0xA5C849: mark (major_gc.c:730)
==1294611==    by 0xA5CD46: major_collection_slice (major_gc.c:1241)
==1294611==    by 0xA5D704: caml_major_collection_slice (major_gc.c:1365)
==1294611==    by 0xA4CEFC: caml_poll_gc_work (domain.c:1523)
==1294611==    by 0xA60E4C: caml_check_urgent_gc (minor_gc.c:867)
==1294611==    by 0xA6BC6E: caml_c_call (in /home/tmartine/tmp/ppx_segfault/_build/default/ppx_segfault.exe)
==1294611==    by 0x4DB151F: ???
==1294611==    by 0x4DB7AC7: ???
==1294611== 

I wish I would be able to make a more standalone and small example, but I am stuck in how I can reduce it without making the segmentation faut disappears. Any help for understanding the problem will be appreciated! (And even knowing whether the problem can be reproduced or not in various settings could be useful!) Thank you very much!

2 Likes

I can reproduce the problem on my own laptop, running debian 10.
I ran the installation steps you described, after creating an opam switch with 5.0.0~alpha1.

I’ve started debugging the issue, and it looks like there is a big closure allocation (size 552, start of environment at 527) that is being initialized, and the GC is running in the middle of the initialization. Because the block is not initialized yet, the start_of_env field is not yet set properly, so the block is pushed on the mark stack with offset 0 (i.e. not skipping code pointers). After a bit of marking, the block is put back on the mark stack with a non-zero offset, and execution resumes. The initialization code finishes running, and when marking resumes it starts trying to mark code pointers. This is what triggers the segfault. I haven’t found yet why the GC manages to run between the initial allocation and its initialization though: the window is very short, and doesn’t contain any code that looks likely to trigger a GC. Hopefully someone more familiar than me about the GC will find the answer.

I’m quite confident that it’s a bug in the compiler, and not in your code, so I would suggest creating an issue. You can quote my previous paragraph to provide more info the other maintainers.

8 Likes

Thank you very much for your debugging, @vlaviron! Issue created: Segmentation fault linked to a big closure allocation with OCaml 5.0 · Issue #11482 · ocaml/ocaml · GitHub.

4 Likes