Test caml_state and conditionally caml_acquire_runtime_system - good or bad?

I’m adding OCaml 5 support to an existing project. The project is just some OCaml bindings and therefore fairly easy to deal with, but it also has callbacks. In other words:

OCaml -> C binding -> callback wrapper -> OCaml

Because most of the C code is long running we release the runtime lock around all C calls. So when we get to a callback wrapper we reacquire the lock around the OCaml code. This part all works fine.

However there is a C atexit function which does some cleanup on process exit. Unfortunately during the exit path, callbacks can also be invoked. The path is:

atexit (runtime lock held)-> callback wrapper -> OCaml

In the callback wrapper we’re attempting to acquire the runtime lock again, resulting in the infamous error:

Fatal error: Fatal error during lock: Resource deadlock avoided

A way to fix this (for OCaml 5 only) is to change the callback wrapper to do:

bool not_acquired = (caml_state == NULL);
if (not_acquired) caml_acquire_runtime_system ();
caml_callback (...)
if (not_acquired) caml_acquire_runtime_system ();

This works, but the question is, is it a good idea? Is there a better way to deal with this case?

I don’t know the answer to your question, but am I right that this used to work in OCaml 4.x? (That is, it was possible to acquire the runtime lock more than once.)

If yes, shouldn’t this be considered a bug in OCaml 5, given that one of its guiding principles is that FFI code that used to work in OCaml 4.x should continue to work in OCaml 5?

Cheers,
Nicolas

Acquiring twice a lock was an undefined behavior for POSIX threads in OCaml 4. OCaml 5 merely moved to erroring mode for mutex.

Is it documented that it is safe to acquire the runtime system lock more than once in OCaml 4.x?

I don’t know of any documentation regarding this point, but I remember seeing something very similar to what is reported by @rwmjones in production code.

Cheers,
Nicolas

I have implemented Caml_state_opt for this purpose in OCaml 5.0, due to a similar problem. Per the manual:

The macro Caml_state evaluates to the domain state variable, and checks in debug mode that the domain lock is held. Such a check is also placed in normal mode at key entry points of the C API; this is why calling some of the runtime functions and macros without correctly owning the domain lock can result in a fatal error: no domain lock held. The variant Caml_state_opt does not perform any check but evaluates to NULL when the domain lock is not held. This lets you determine whether a thread belonging to a domain currently holds its domain lock, for various purposes.

For OCaml 4 I use runtime hooks to implement a similar check: see boxroot/ocaml_hooks.c · main · ocaml-rust / ocaml-boxroot · GitLab.

2 Likes

It was definitely unsafe, and was in fact part of the motivations to let caml_state be NULL when the domain lock is not held and to add Caml_state_opt to let users check for this.

Lastly,

bool not_acquired = (caml_state == NULL);

works but only on platforms where caml_state can be accessed directly (in which case it is equal to Caml_state_opt). Prefer Caml_state_opt for portability.

Yes, it “worked” in OCaml 4 but as several have pointed out the code was likely wrong.

I will try out Caml_state_opt now.

Yes, that works great. Upstream fix: ocaml: Use Caml_state_opt in preference to caml_state · libguestfs/libguestfs@cade0b1 · GitHub