To close out some loose threads from earlier:
- Running ocaml 4.11.1 with cherry-picked signal + mutex invariants checking patches from ocaml beta 4.12 didn’t solve my deadlock (or throw new exceptions)
- My app running on Ubuntu 18.04 with libcs loaded from Ubuntu 16.04 (libc 2.23) have not deadlocked yet, about 2 days now
I confirm the repro in that glibc bug report deadlocks for me in all of the places my OCaml application was deadlocking.
The repro deadlocks on:
- Ubuntu 20.04 (libc 2.31); only took a few seconds, and then 10 minutes
- Ubuntu 18.04 (libc 2.27); took about 20 minutes
- Debian 10/buster (libc 2.28); took a few hours (slower box, fewer cores)
I’ve now patched some box glibcs with the one-line fix.
Repro has not deadlocked on:
- Ubuntu 20.04 (libc 2.31+patch): about a day
- Ubuntu 18.04 (libc 2.27+patch): about a day
There is also a stock Ubuntu 16.04 (libc 2.23) box that has been running the repro 2 days without deadlocking, which is expected since the pthread bug was likely introduced in 2.27.
Still testing my OCaml apps.
I’m not sure what they would do, though looking around the source .deb, there is some precedent for Ubuntu stacking their own patches atop of the official glibc distribution. Ubuntu does not seem to list this bug in their glibc bug tracker. Let me prepare a very informative report for them and see if they’ll consider slipping the patch in earlier.