Unexpected behaviour with recusive locking (5.0.0+trunk)

Hello everyone!
TLDR: Recursive locking should throw a Sys.error but it doesn’t

I encountered this subtle concurrency bug and am struggling to identify if my reasoning about the programs execution is incorrect or there is something more devious happening under the hood. Apologies if I miss anything strikingly obvious, this is my first issue!

I came across this unexpected behavior when working through @kayceesrk GitHub - kayceesrk/ocaml5-tutorial: A hands-on tutorial on the new parallelism features in OCaml 5 wonderful Multicore tutorial. In particular, there is a task in src/prod_cons_b.ml to implement the pop function for an atomic stack using condition variables.

The concurrent stack implementation along with the test is as follows (The sections with (**) indicate my code):

let n = try int_of_string Sys.argv.(1) with _ -> 10

module Atomic_stack : sig
  type 'a t
  val make : unit -> 'a t
  val push : 'a t -> 'a -> unit
  val pop  : 'a t -> 'a
end = struct
  type 'a t = {
    mutable contents: 'a list;
    mutex : Mutex.t;
    condition : Condition.t
  }

  let make () = {
    contents = [];
    mutex = Mutex.create ();
    condition = Condition.create ()
  }

  let push r v =
    Mutex.lock r.mutex;
    r.contents <- v::r.contents;
    Condition.signal r.condition;
    Mutex.unlock r.mutex

  let rec pop r = 
    Mutex.lock r.mutex;                     (**)
    match r.contents with                   (**)
    | [] ->                                 (**)
      Condition.wait r.condition r.mutex;   (**)
      pop r                                 (**)
    | h :: t ->                             (**)
      r.contents <- t;                      (**)
      Mutex.unlock r.mutex;                 (**)
      h                                     (**)
end
let s = Atomic_stack.make ()

let rec producer n =
  Unix.sleep 1;           (*To ensure we enter the Condition.wait branch*)
  if n = 0 then ()
  else begin
    Atomic_stack.push s n;
    Format.printf "Produced %d\n%!" n;
    producer (n-1)
  end
  
let rec consumer n acc =
  if n = 0 then acc
  else begin
    let v = Atomic_stack.pop s in
    Format.printf "Consumed %d\n%!" v;
    consumer (n-1) (n + acc)
  end
    
let main () =
  let c = Domain.spawn (fun _ -> consumer n 0) in
  let p = Domain.spawn (fun _ -> producer n) in
  Domain.join p;
  assert (Domain.join c = n * (n+1) / 2)

let _ = main ()

Paying attention to the pop function, I realized it’s incorrect because the thread woken up by the condition variable recursively tries to attain the lock which it already holds it. The expected behavior when this happens is that OCaml raises a Sys.error. (ocaml/mutex.mli at aec63fc6f22e5ea95c24f7ee2bd5716532406fa2 · ocaml/ocaml · GitHub). But this doesn’t always happen.

The test case is structured with a producer and consumer that will make (n) calls to pop and push depending on what is provided at the command line. When the program is run with 1 as the argument, we get the expected behaviour

$ dune exec src/prod_cons_b.exe 1
Produced 1                         
Fatal error: exception Sys_error("Mutex.lock: Resource deadlock avoided")

However when I run the program with any value greater than 1, instead of getting the error, my program deadlocks! Spinning forever after printing the following output

$ dune exec src/prod_cons_b.exe 2
Produced 2

The OCaml compiler that I’m using is 5.0.0+trunk and set-up according to KC’s tutorial.

I’ve been told that this is potentially platform dependent. I tested this out on two of my machines and get the same result:
M1 Mac, arm64
Darwin Kernel Version 21.4.0

Fedora 36 x86_64
Kernel version 5.17.6-300

I wonder if others also experience this behavior. And if anyone has insight on why, I’d really appreciate it!

I think the problem might be with your implementation of push.

In particular:

    Condition.signal r.condition;

might probably need to be Condition.broadcast; the reason being that when you push a new value onto the stack, you want to wake up all threads that are waiting on the lock (broadcast), not just one (signal).

I think this could be the cause of the behaviour you see if the Condition.signals from the producers all happen to wake up the same consumer thread maybe?

Hey Kiran!

Thanks so much for taking the time to respond! I tried out your suggestion but still no luck :frowning: I don’t think Condition.broadcast is necessary because we only spawn two threads and only one of them (consumer) is ever waiting on the lock.

Oh, I see; my bad - I didn’t read your question carefully enough. Yes, I guess if there’s only two domains then the suggestion I made wouldn’t apply.

Looking a little closer, I think the reason for the deadlock in the second case is because of how exceptions work across domains.

In particular, looking at the documentation of Domain.join:

In particular, I think you will only see the Sys_error on your main thread when you actually run Domain.join on the domain in which the exception was raised.

So, I think when n > 1, the Sys_error still gets raised in the consumer domain, halting its exception, and preventing it from releasing its lock. This means that Domain.join p can never complete, because the producer is left forever waiting for the mutex to be released, and thus, Domain.join c is never run either, and so the code just hangs without showing the exception.

This issue might be relevant: OCaml 5.0 & unhandled exceptions · Issue #11074 · ocaml/ocaml · GitHub

1 Like

Looking at your code, it seems to me that the behaviour with n = 2 can be explained in that the exception is never raised because you end up in the following situation: the consumer is blocked on Condition.wait in pop, and the producer is blocked on Mutex.lock in push—as the lock is held by the consumer.

Edit: scratch that, the lock is no longer held once the consumer blocks on wait. Never mind me.

1 Like

You were right @Gopiandcode, Great catch!

I tested out your suspicion by changing the implementation of push to eventually exit instead of hanging on the lock if it can’t acquire it and I get the expected error.

let push r v =
  if not @@ Mutex.try_lock r.mutex then () 
  else 
  (r.contents <- v::r.contents;
  Condition.broadcast r.condition;
  Mutex.unlock r.mutex)

and the result is

$ dune exec src/prod_cons_b.exe 2
Produced 2                         
Produced 1
Fatal error: exception Sys_error("Mutex.lock: Resource deadlock avoided")

Thank you so much for your input!

Out of curiosity, I’m wondering what’s the benefit of waiting for Domain.join to raise an error instead of immediately terminating all domains program. Or is this just a limitation of the design?

1 Like