Hello everyone!
TLDR: Recursive locking should throw a Sys.error but it doesn’t
I encountered this subtle concurrency bug and am struggling to identify if my reasoning about the programs execution is incorrect or there is something more devious happening under the hood. Apologies if I miss anything strikingly obvious, this is my first issue!
I came across this unexpected behavior when working through @kayceesrk GitHub - kayceesrk/ocaml5-tutorial: A hands-on tutorial on the new parallelism features in OCaml 5 wonderful Multicore tutorial. In particular, there is a task in src/prod_cons_b.ml
to implement the pop
function for an atomic stack using condition variables.
The concurrent stack implementation along with the test is as follows (The sections with (**) indicate my code):
let n = try int_of_string Sys.argv.(1) with _ -> 10
module Atomic_stack : sig
type 'a t
val make : unit -> 'a t
val push : 'a t -> 'a -> unit
val pop : 'a t -> 'a
end = struct
type 'a t = {
mutable contents: 'a list;
mutex : Mutex.t;
condition : Condition.t
}
let make () = {
contents = [];
mutex = Mutex.create ();
condition = Condition.create ()
}
let push r v =
Mutex.lock r.mutex;
r.contents <- v::r.contents;
Condition.signal r.condition;
Mutex.unlock r.mutex
let rec pop r =
Mutex.lock r.mutex; (**)
match r.contents with (**)
| [] -> (**)
Condition.wait r.condition r.mutex; (**)
pop r (**)
| h :: t -> (**)
r.contents <- t; (**)
Mutex.unlock r.mutex; (**)
h (**)
end
let s = Atomic_stack.make ()
let rec producer n =
Unix.sleep 1; (*To ensure we enter the Condition.wait branch*)
if n = 0 then ()
else begin
Atomic_stack.push s n;
Format.printf "Produced %d\n%!" n;
producer (n-1)
end
let rec consumer n acc =
if n = 0 then acc
else begin
let v = Atomic_stack.pop s in
Format.printf "Consumed %d\n%!" v;
consumer (n-1) (n + acc)
end
let main () =
let c = Domain.spawn (fun _ -> consumer n 0) in
let p = Domain.spawn (fun _ -> producer n) in
Domain.join p;
assert (Domain.join c = n * (n+1) / 2)
let _ = main ()
Paying attention to the pop
function, I realized it’s incorrect because the thread woken up by the condition variable recursively tries to attain the lock which it already holds it. The expected behavior when this happens is that OCaml raises a Sys.error. (ocaml/mutex.mli at aec63fc6f22e5ea95c24f7ee2bd5716532406fa2 · ocaml/ocaml · GitHub). But this doesn’t always happen.
The test case is structured with a producer and consumer that will make (n) calls to pop
and push
depending on what is provided at the command line. When the program is run with 1
as the argument, we get the expected behaviour
$ dune exec src/prod_cons_b.exe 1
Produced 1
Fatal error: exception Sys_error("Mutex.lock: Resource deadlock avoided")
However when I run the program with any value greater than 1, instead of getting the error, my program deadlocks! Spinning forever after printing the following output
$ dune exec src/prod_cons_b.exe 2
Produced 2
The OCaml compiler that I’m using is 5.0.0+trunk and set-up according to KC’s tutorial.
I’ve been told that this is potentially platform dependent. I tested this out on two of my machines and get the same result:
M1 Mac, arm64
Darwin Kernel Version 21.4.0
Fedora 36 x86_64
Kernel version 5.17.6-300
I wonder if others also experience this behavior. And if anyone has insight on why, I’d really appreciate it!