I’m seeing a deadlock in an Async app reported here that appears to only (so far) happen on Broadwell architectures. I think the problem is around management of the ocaml runtime lock.
A few years ago there was a well publicized kernel bug that affected Haswell that was patched and shouldn’t be relevant anymore.
I’m seeing this bug on Broadwell platforms libc 2.27, kernel 4.15.0-55 and also kernel 5.4.0-48 (Ubuntu 18.04). Using OCaml 4.08.1 flambda.
Not seeing it on Skylake platforms.
My app doesn’t have any C code or use locks directly so I’m wondering if this is either somewhere in the ecosystem or somewhere in libc/Linux.
This is hard to iterate on because the deadlock is infrequent (once every few days) and attaching gdb only yields limited information.
Shot in the dark but, anyone run into something similar?
EDIT1: I see signal handling was very recently redone due to deadlock concerns. Trying
EDIT2: scratch that, happens on a Skylake box too