Is there a known recent Linux locking bug that affects the OCaml runtime?

I’m seeing a deadlock in an Async app reported here that appears to only (so far) happen on Broadwell architectures. I think the problem is around management of the ocaml runtime lock.

A few years ago there was a well publicized kernel bug that affected Haswell that was patched and shouldn’t be relevant anymore.

I’m seeing this bug on Broadwell platforms libc 2.27, kernel 4.15.0-55 and also kernel 5.4.0-48 (Ubuntu 18.04). Using OCaml 4.08.1 flambda.

Not seeing it on Skylake platforms.

My app doesn’t have any C code or use locks directly so I’m wondering if this is either somewhere in the ecosystem or somewhere in libc/Linux.

This is hard to iterate on because the deadlock is infrequent (once every few days) and attaching gdb only yields limited information.

Shot in the dark but, anyone run into something similar?

EDIT1: I see signal handling was very recently redone due to deadlock concerns. Trying 4.11.1+flambda
EDIT2: scratch that, happens on a Skylake box too

2 Likes

I see signal handling was very recently redone due to deadlock concerns. Trying 4.11.1+flambda

The new signal handling code (https://github.com/ocaml/ocaml/pull/9722) is so new it hasn’t been released yet! It will be in release 4.12.

There’s also a PR in progress (Use "error checking" mutexes in the threads library by xavierleroy · Pull Request #9846 · ocaml/ocaml · GitHub) to add additional checks on the way mutexes are used.

During the discussion of that PR, it was mentioned that Glibc / NPTL uses (or tried to use) Hardware Lock Elision on Intel processors. HLE has a checkered history of not working as intended and being disabled a posteriori by BIOS updates. Could your Broadwell processor be affected?

4 Likes

Whoops!

Well, I can say that updating to 4.11.1 didn’t magically cure this problem.

Interesting, though glibc 2.27 changes the usage of HLE to opted-out by default. (You need to start with a GLIBC_TUNABLES environment variable configured to opt-in, which isn’t being done here)

:thinking:

So, when I attach to it in gdb the presentation is always the same.

A number of threads are waiting to acquire the master lock. One of them is the Async scheduler, so the whole program hangs (aside from the pure-C ocaml tick thread). One thread, however, is blocked in libc read on stdin. When I hit Enter, the read finishes and the whole program unhangs.

In gdb, if you step through the thread blocked on libc read, the read completes normally and then it appears to acquire the master lock with no trouble to leave the blocking section, even though all of the other threads are hung waiting to acquire it. Once the C stub completes, all of the other threads can run.

This… seems like master lock corruption? I’m not sure how you can get into this state through simple C-stub errors in the ecosystem. Calling enter blocking section twice from the same thread appears to just hang forever. A second release runtime lock appears to simply be ignored?

More weirdly, again, is that only presents so far on this one Intel architecture, Broadwell though this isn’t a rigorous finding (a handful of other Skylake and Cascade Lake architectures with this same program don’t exhibit the deadlock).

I’d obviously like to just blame the Broadwell architecture and replace those machines and say mission accomplished, but I suspect there may be a software bug somewhere and that architecture is just re-arranging computations a bit differently so it trips the bug.

Is there a smart way to debug this? I’ll next try 4.12 from trunk to rule out this being fixed already by the revamped signal handling. Failing that, all I can think of is adding to the master lock code a bunch of debugging fprintfs to stderr and more invariant checking in the hope it signals where in the software stack the bug may lie.

Actually, this deadlock finally happened on a Skylake box. Removing Broadwell from the topic.

Now, the biggest difference between where it deadlocks and where it doesn’t is the Ubuntu release. The deadlocks happen on Ubuntu 18.04, and have never been seen on Ubuntu 16.04. It’s the same binary run on either platform. Building the binary on either platform doesn’t change the outcome.

Example of a box where it doesn’t deadlock is Ubuntu 16.04 running kernel 4.4.0 and libc-2.23.

Next things to try

  • Ubuntu 20.04
  • Binary built with signal handling and mutex invariants checking patched ocaml compiler, on 18.04
  • Ubuntu 18.04 libc with the binary on Ubuntu 16.04 (if possible)
  • Vice versa on the above (if possible)

Forcing libc from another location instructions here.

EDIT: I would post this as a bug report somewhere, though I’m not sure what layer of the stack the bug is in so I’m wondering publicly here for now. LMK if you think it should be somewhere else :smile:

1 Like

Yes, calling enter_blocking_section twice in the same thread is a deadlock and calling leave_blocking_section twice is like calling it once.

One implausible explanation is that a pthread_cond_signal was ignored. Assume that when your thread that does read() did enter_blocking_section, the masterlock was released (busy is set to 0) but the is_free condition was not signaled properly. Then, all other threads that are waiting on the master lock keep waiting on the is_free condition. The read blocks, then unblocks when you press “return”, then does leave_blocking_section, reacquiring the master lock. The next time it does enter_blocking_section, the is_free condition is properly signaled and other threads get to run.

That’s the least bizarre scenario I can come up with.

3 Likes

Dude.

https://sourceware.org/bugzilla/show_bug.cgi?id=25847

4 Likes

Sometimes, implausible explanations are just right :slight_smile:

I see OCaml is in good company here, with C#, Python, and who knows how many other runtime systems with master locks.

7 Likes

It’s never the hardware, or the OS, or the library. Except when it is. :man_facepalming:

So, what does this mean for the OCaml community? The patch hasn’t even been accepted into glibc yet but even after it is, it’s going to take a long time to trickle down to users.

Is the solution to fix the compiler to detect affected glibcs and use a different master locking mechanism (:scream:)?

Push OCaml users to install a glibc fork? SIOUs[1] will probably do it, but not very user friendly for the mainstream.

Maybe this conversation is better off moved to a GitHub issue?

  1. Serious Industrial Ocaml Users
2 Likes

Isn’t this a major issue for just about every language, except those that don’t rely on pthread_cond_signal? Subtle, weird lock bugs that occur randomly? Of course, it’ll only affect high performance code, but it seems like a huge issue.

1 Like

What would Fedora and Ubuntu do? Do they know about the issue? Is there a way to raise awareness there?

Is the solution to fix the compiler to detect affected glibcs and use a different master locking mechanism (:scream:)?

It is possible to use a plain POSIX mutex as master lock, but fairness is awful.

Push OCaml users to install a glibc fork? SIOUs[1] will probably do it, but not very user friendly for the mainstream.

Some SIOUs are fond of Musl instead of Glibc.

3 Likes

It is a major issue for any code that uses POSIX threads, as pthread_cond_signal is about the 4th most used function in that API. The bug happens to show up in the run-time systems of several high-level languages.

Subtle, weird lock bugs that occur randomly?

In low-level programming with shared-memory concurrency, most bugs are like that (subtle, weird, and occurring randomly)…

5 Likes

To close out some loose threads from earlier:

  • Running ocaml 4.11.1 with cherry-picked signal + mutex invariants checking patches from ocaml beta 4.12 didn’t solve my deadlock (or throw new exceptions)
  • My app running on Ubuntu 18.04 with libcs loaded from Ubuntu 16.04 (libc 2.23) have not deadlocked yet, about 2 days now

I confirm the repro in that glibc bug report deadlocks for me in all of the places my OCaml application was deadlocking.

The repro deadlocks on:

  • Ubuntu 20.04 (libc 2.31); only took a few seconds, and then 10 minutes
  • Ubuntu 18.04 (libc 2.27); took about 20 minutes
  • Debian 10/buster (libc 2.28); took a few hours (slower box, fewer cores)

I’ve now patched some box glibcs with the one-line fix.

Repro has not deadlocked on:

  • Ubuntu 20.04 (libc 2.31+patch): about a day
  • Ubuntu 18.04 (libc 2.27+patch): about a day

There is also a stock Ubuntu 16.04 (libc 2.23) box that has been running the repro 2 days without deadlocking, which is expected since the pthread bug was likely introduced in 2.27.

Still testing my OCaml apps.

I’m not sure what they would do, though looking around the source .deb, there is some precedent for Ubuntu stacking their own patches atop of the official glibc distribution. Ubuntu does not seem to list this bug in their glibc bug tracker. Let me prepare a very informative report for them and see if they’ll consider slipping the patch in earlier.

6 Likes

Thanks for the update and the extensive testing.

The “one-line fix” is a nice example of why spurious wake-ups are useful!

The bug report at sourceware.org is sadly and surprisingly inactive (I’ve seen much quicker reactions from the Glibc/NPTL people in the past), so going through Debian, Ubuntu or Fedora might be a way to speed up the handling of this awful bug.

I confirm that there is a way to reimplement OCaml’s masterlock without using condition variables. However, other uses of condition variables (through the Condition OCaml library module) will still be broken. Plus, it’s hard to detect at configuration time which implementation of the masterlock to use.

4 Likes

I’ve reported this issue, and the results of substantial testing, to the Ubuntu project here: https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1899800

Red Hat/CentOS users may wish to try the glibc patch with repro and provide a similar report to the Fedora project.

I’m sure @rwmjones should be interested (but unsure how pings for work him on this forum).

I’ve reported this issue at the Fedora project as well

1 Like

There is a nice analysis of the bug here: https://probablydance.com/2020/10/31/using-tla-in-the-real-world-to-understand-a-glibc-bug/

1 Like

I was thinking about writing some code that uses condition variables, and checked the status of this bug again. There has been some (slow) progress:

And of course OCaml 5 has been released which reduces the reliance on the runtime lock, but the underlying problem still needs to be fixed for code that uses condition variables (a common synchronisation primitive).

(we at XenServer are not yet affected by this, because we are still on the old glibc 2.17, but we won’t be on that version forever)

1 Like

Short answer: yes, still not fixed upstream. If you’re affected, then you have to patch locally.