Segfault in `caml_shared_try_alloc` in multicore GC


I’m trying out ocaml multicore (4.12.0+domains) as a backend for our compiler, which generates ocaml code. I have the problem that every 10:th bootstrap or so triggers a segfault in caml_shared_try_alloc. This is the top of the backtrace from lldb:

thread #1, queue = '', stop reason = EXC_BAD_ACCESS (code=1, address=0x14ff) 
* frame #0: 0x00000001001f084a mii`caml_shared_try_alloc [inlined] pool_allocate(local=0x0000000101808800, sz=<unavailable>) at shared_heap.c:353:18 [opt] 
frame #1: 0x00000001001f044e mii`caml_shared_try_alloc(local=0x0000000101808800, wosize=6, tag=3, pinned=0) at shared_heap.c:392 [opt] 
frame #2: 0x00000001001d4161 mii`oldify_one [inlined] alloc_shared(wosize=6, tag=3) at minor_gc.c:151:15 [opt]
frame #3: 0x00000001001d4145 mii`oldify_one(st_v=<unavailable>, v=68719644640, p=0x000000010a01d280) at minor_gc.c:313 [opt] 
frame #4: 0x00000001001d47b7 mii`oldify_mopup(st=0x00007ffeefbff100, do_ephemerons=0) at minor_gc.c:449:9 [opt] 
frame #5: 0x00000001001d3f90 mii`caml_empty_minor_heap_promote(domain=0x00000001004537c0, participating_count=<unavailable>, participating=0x00000001004603f8, not_alone=1) at minor_gc.c:676:3 [opt] 
frame #6: 0x00000001001d4aa6 mii`caml_stw_empty_minor_heap_no_major_slice(domain=0x00000001004537c0, unused=<unavailable>, participating_count=9, participating=0x00000001004603f8) at minor_gc.c:740:3 [opt]
frame #7: 0x00000001001d4bf3 mii`caml_stw_empty_minor_heap(domain=0x00000001004537c0, unused=<unavailable>, participating_count=<unavailable>, participating=<unavailable>) at minor_gc.c:768:3 [opt]
frame #8: 0x00000001001f6efa mii`caml_try_run_on_all_domains_with_spin_work(handler=(mii`caml_stw_empty_minor_heap at minor_gc.c:767), data=0x0000000000000000, leader_setup=<unavailable>, enter_spin_callback=<unavailable>, enter_spin_data=<unavailable>) at domain.c:895:3 [opt]
frame #9: 0x00000001001d4c7d mii`caml_empty_minor_heaps_once [inlined] caml_try_stw_empty_minor_heap_on_all_domains at minor_gc.c:799:10 [opt]
frame #10: 0x00000001001d4c51 mii`caml_empty_minor_heaps_once at minor_gc.c:817 [opt]
frame #11: 0x00000001001f8193 mii`caml_poll_gc_work at domain.c:942:5 [opt] 
frame #12: 0x00000001001d0735 mii`caml_garbage_collection at signals_nat.c:110:5 [opt] 
frame #13: 0x00000001001f91e3 mii`caml_call_gc + 231

I haven’t been able to reproduce the error in a small program. I’ve only seen it while bootstrapping our compiler, which is about 250,000 lines of generated OCaml code. I understand this makes it hard to reason about what the error could be, but I wanted to ask here anyway, in case I’m missing something obvious or if someone recognises a known error.

I discovered the error while implementing a parallel task pool. Reducing the program a little, I found that spawning some domains that wait on a channel while the compiler is running triggers the error. In essence, this is what the generated code does:

let chan = Domainslib.Chan.make_unbounded () in
let tids = (fun _ -> Domain.spawn (fun _ -> Domainslib.Chan.recv chan)) (List.init 10 (fun _ -> ())) in

(* Do compiler stuff here ... *)
(* segfaults while compiler stuff is running, if ever *)

List.iter (fun _ -> Domainslib.Chan.send chan 1) tids; Domain.join tids

Note that everything else in the compiler is sequential, just these domains that are spawned. I also found that the error appears when the spawned domains do Thread.delay instead of waiting on a channel. Furthermore, I didn’t observe the error (in about 100 runs) when computing something heavy inside the domain (such as fibonacci 48) instead of waiting.

Thankful for thoughts you might have about this!

Hi @linnea! Thanks for the error report. Could you please create an issue here with a minimal example and steps to reproduce?

The problem is that I don’t have a minimal example, I haven’t been able to reproduce it in a small program. The only thing I have is 250,000 lines of generated OCaml code, which I could theoretically share, but I don’t know if that would be helpful.

For this kind of GC issue even the large example might be fine.

As long as we’ve got a clear way to build a binary that shows this issue happening we can probably work backwards.

Just created an issue. I actually found a slightly smaller example that segfaults in a different place in the GC. Both examples are attached in the issue.

1 Like