C bindings under OCaml 5: protecting data structures

I was just reading the thread on the OCaml LLVM bindings and it reminded me of a question that I had for the Sundials/ML library, which provides an OCaml interface to the Sundials C library of numeric solvers.

Like many other libraries, this one is built around a “session” value, which is created, used in calls to the library, and then finally freed. The C session object is wrapped by an OCaml session value. In C, the programmer is responsible for ensuring that the session object is not manipulated concurrently. Since OCaml is a managed language, one might expect it to protect irresponsible programmers from themselves. This would mean that an OCaml/C binding should ensure atomic access to the session value and underlying object (maybe a bit like the synchronized methods of Java…).

How are other library authors addressing this question?

Tim.

1 Like

We need more info for more specific advice (e.g. can the session value be updated by foreign or OCaml calls?). But indeed you need to make sure that an OCaml data race cannot create memory-unsafety on the C side.

One option would be to use a per-domain ‘session’ object, e.g. with OCaml library : Domain.DLS. Then OCaml’s runtime lock should protect against concurrent accesses from different threads that are part of the same domain, as long as your C binding doesn’t release the runtime lock.
However that’ll also prevent other threads (in the same domain) from running OCaml code, so it isn’t very nice.

Another option might be to use per-thread state, but there isn’t anything built-in to help with that.

However both of these could cause some confusion on the OCaml side: what if you create the session in one thread and want to use it in another (as long as you’re only using it from one thread at a time)?

If you consider accessing a session object from multiple threads a programming error, then just detecting it might suffice. A Mutex (either on OCaml or C side) might suffice for that if you use a try_lock: when the program behaves correctly the mutex would be uncontended and fast to acquire and release (in theory entirely in userspace), and the only time it might get contended is when you’d raise an exception on a trylock failure.
If the (tiny) potential overhead of that is a worry then this could even be done in a way that a user can turn off if you wrap it in an assert (compile with -noassert and both mutex+unlock is gone, only thing that remains is its allocation):

let session_check_mutex = Mutex.create ()

let foo () =
  assert (Mutex.try_lock session_check_mutex);
(* ... calll external ... *)
  assert(Mutex.unlock m; true)

Here is a program to measure the overhead of the 2 approaches (mutex vs atomic) using N domains each with its own uncontended mutex/atomic:

let perform _ = ()

let session_mutex_create () = Mutex.create ()
let session_mutex_trylock m = Mutex.try_lock m
let session_mutex_unlock m = Mutex.unlock m; true

let loop_mutex n =
  let t = session_mutex_create () in
  for i = 1 to n do
    assert (session_mutex_trylock t);
    Sys.opaque_identity (perform t);
    assert (session_mutex_unlock t)
  done

let session_atomic_create () = Atomic.make false
let session_atomic_trylock a = Atomic.compare_and_set a false true
let session_atomic_unlock a = Atomic.compare_and_set a true false

let loop_atomic n =
  let t = session_atomic_create () in
  for i = 1 to n do
    assert (session_atomic_trylock t);
    Sys.opaque_identity (perform t);
    assert (session_atomic_unlock t)
  done

let loop_noop n =
  let t = () in
  for i = 1 to n do
    Sys.opaque_identity (perform t);
  done

let run_domains f n =
  let domains = Array.init (Domain.recommended_domain_count ())
  (fun _ -> Domain.spawn @@ fun () -> f n) in
  Array.iter Domain.join domains

let () =
  Arg.parse
  [ ("-a", Arg.Int (run_domains loop_atomic), "loop atomic")
  ; ("-m", Arg.Int (run_domains loop_mutex), "loop mutex")
  ; ("-n", Arg.Int (run_domains loop_noop), "loop noop")
  ] ignore @@ Printf.sprintf "%s [-a <n> | -m <n> | -n <n>]" Sys.executable_name

ocamlfind ocamlopt -package threads.posix -I +threads -linkpkg asserts.ml -o asserts -thread
ocamlfind ocamlopt -package threads.posix -I +threads -linkpkg asserts.ml -o noasserts -thread -noassert
hyperfine './asserts -m 100000000' './asserts -a 100000000' './asserts -n 100000000' './noasserts -m 100000000' './noasserts -a 100000000'

Measured on a 32-core ‘AMD Ryzen 9 7950X’ after having disabled turbo, idle states and using perf scheduler:

 sudo sh -c 'echo 0 >/sys/devices/system/cpu/cpufreq/boost'
sudo cpupower idle-set -D 0
sudo cpupower frequency-set -g performance

Results (where noop is not doing any of this):

Benchmark 1: ./asserts -m 100000000
  Time (mean ± σ):      1.622 s ±  0.034 s    [User: 50.053 s, System: 0.016 s]
  Range (min … max):    1.589 s …  1.705 s    10 runs
 
Benchmark 2: ./asserts -a 100000000
  Time (mean ± σ):     789.4 ms ±   5.4 ms    [User: 24494.0 ms, System: 10.8 ms]
  Range (min … max):   781.8 ms … 797.6 ms    10 runs
 
Benchmark 3: ./asserts -n 100000000
  Time (mean ± σ):      55.4 ms ±   7.6 ms    [User: 1547.7 ms, System: 9.5 ms]
  Range (min … max):    48.6 ms …  75.2 ms    42 runs
 
Benchmark 4: ./noasserts -m 100000000
  Time (mean ± σ):      53.7 ms ±   5.3 ms    [User: 1532.4 ms, System: 8.1 ms]
  Range (min … max):    48.1 ms …  66.0 ms    52 runs
 
Benchmark 5: ./noasserts -a 100000000
  Time (mean ± σ):      54.0 ms ±   5.9 ms    [User: 1543.1 ms, System: 9.7 ms]
  Range (min … max):    48.3 ms …  71.5 ms    44 runs
 
Summary
  './noasserts -m 100000000' ran
    1.01 ± 0.15 times faster than './noasserts -a 100000000'
    1.03 ± 0.18 times faster than './asserts -n 100000000'
   14.71 ± 1.46 times faster than './asserts -a 100000000'
   30.23 ± 3.06 times faster than './asserts -m 100000000'

As you can see with noasserts you get “same” results as with not having any of this checking code at all, and atomics run 2x as fast as mutexes (they both results in function calls and are not inlined).

OTOH the overhead is quite small: ~8ns / check with the atomics. Which might be worth accepting if it saves you hours of debugging a race condition or memory corruption.

2 Likes