Dune 3.15.3 + OCaml 5.2 data race = false positive?

I was trying to build dune-configurator-3.15.3 using an OCaml 5.2+tsan switch.
After working around a TSAN bug (c - FATAL: ThreadSanitizer: unexpected memory mapping when running on Linux Kernels 6.6+ - Stack Overflow had to set ASLR bits to 28) I got this error report from inside the OCaml runtime:

==================
WARNING: ThreadSanitizer: data race (pid=1422866)
  Atomic write of size 8 at 0x7f248fdff518 by main thread (mutexes: write M0):
    #0 oldify_one runtime/minor_gc.c:250 (dune+0xf72325) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #1 caml_do_local_roots runtime/roots.c:64 (dune+0xf77476) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #2 caml_thread_scan_roots /home/edvint/.opam/5.2+tsan/.opam-switch/build/ocaml-variants.5.2.0+options/otherlibs/systhreads/st_stubs.c:179 (dune+0xf2370d) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #3 caml_empty_minor_heap_promote runtime/minor_gc.c:615 (dune+0xf735c0) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #4 caml_stw_empty_minor_heap_no_major_slice runtime/minor_gc.c:746 (dune+0xf739e5) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #5 caml_stw_empty_minor_heap runtime/minor_gc.c:777 (dune+0xf73c92) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #6 caml_try_run_on_all_domains_with_spin_work runtime/domain.c:1576 (dune+0xf4f533) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #7 caml_try_empty_minor_heap_on_all_domains runtime/minor_gc.c:808 (dune+0xf73dca) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #8 caml_empty_minor_heaps_once runtime/minor_gc.c:829 (dune+0xf73dca)
    #9 caml_poll_gc_work runtime/domain.c:1745 (dune+0xf4ee10) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #10 caml_handle_gc_interrupt runtime/domain.c:1778 (dune+0xf4fba5) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #11 caml_do_pending_actions_exn runtime/signals.c:341 (dune+0xf7cc34) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #12 caml_alloc_small_dispatch runtime/minor_gc.c:851 (dune+0xf73e4c) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #13 caml_garbage_collection runtime/signals_nat.c:84 (dune+0xf89e10) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #14 caml_call_gc <null> (dune+0xf87ad1) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #15 camlDune_rules__Scope.fun_5068 src/dune_rules/scope.ml:105 (dune+0x898582) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #16 camlDune_rules__Lib.resolve_name_6800 src/dune_rules/lib.ml:1135 (dune+0x8b0873) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #17 camlDune_rules__Lib.find_even_when_hidden_7883 src/dune_rules/lib.ml:1871 (dune+0x8bb2d6) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #18 camlDune_rules__Lib.get_compile_info_7913 src/dune_rules/lib.ml:1901 (dune+0x8bbd70) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #19 camlDune_rules__Lib_rules.rules_5492 src/dune_rules/lib_rules.ml:648 (dune+0x7c23b8) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #20 camlDune_rules__Gen_rules.if_available_buildable_2319 src/dune_rules/gen_rules.ml:102 (dune+0x76bdd7) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #21 caml_apply2 <null> (dune+0x6e7227) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #22 camlFiber__Scheduler.advance_1192 vendor/fiber/src/scheduler.ml:194 (dune+0xd6d2be) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #23 camlFiber.loop_391 vendor/fiber/src/fiber.ml:15 (dune+0xd62aeb) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #24 camlDune_engine__Scheduler.run_3186 vendor/fiber/src/fiber.ml:17 (dune+0xcc9d26) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #25 camlDune_engine__Scheduler.run_and_cleanup_3379 src/dune_engine/scheduler.ml:1141 (dune+0xcca4be) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #26 camlDune_engine__Scheduler.go_inner_5501 src/dune_engine/scheduler.ml:1363 (dune+0xcccb5f) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #27 camlCmdliner_term.fun_639 vendor/cmdliner/src/cmdliner_term.ml:24 (dune+0xe32da2) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #28 camlCmdliner_eval.run_parser_576 vendor/cmdliner/src/cmdliner_eval.ml:34 (dune+0xe2a865) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #29 camlCmdliner_eval.eval_value_inner_1700 vendor/cmdliner/src/cmdliner_eval.ml:202 (dune+0xe2c0e4) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #30 camlMain.entry bin/main.ml:106 (dune+0x6e7a39) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #31 caml_program <null> (dune+0x6df539) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #32 caml_start_program <null> (dune+0xf87e43) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #33 caml_startup_common runtime/startup_nat.c:132 (dune+0xf875ec) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #34 caml_startup_common runtime/startup_nat.c:88 (dune+0xf875ec)
    #35 caml_startup_exn runtime/startup_nat.c:139 (dune+0xf87697) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #36 caml_startup runtime/startup_nat.c:144 (dune+0xf87697)
    #37 caml_main runtime/startup_nat.c:151 (dune+0xf87697)
    #38 main runtime/main.c:37 (dune+0x6dbbd5) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)

  Previous read of size 8 at 0x7f248fdff518 by thread T4:
    #0 caml_unix_open /home/edvint/.opam/5.2+tsan/.opam-switch/build/ocaml-variants.5.2.0+options/otherlibs/unix/open_unix.c:79 (dune+0xf29dfb) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #1 caml_c_call <null> (dune+0xf87d27) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #2 camlDune_digest.file_787 src/dune_digest/dune_digest.ml:24 (dune+0xda6d3c) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #3 camlDune_shared_cache.digest_1966 src/dune_digest/dune_digest.ml:87 (dune+0x760c8b) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #4 camlStdune__Exn_with_backtrace.try_with_430 otherlibs/stdune/src/exn_with_backtrace.ml:9 (dune+0xdc94d8) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #5 camlDune_engine__Scheduler.f_3423 src/dune_engine/scheduler.ml:1153 (dune+0xcca806) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #6 camlDune_thread_pool__Thread_pool.loop_527 src/dune_thread_pool/thread_pool.ml:39 (dune+0xcd0c59) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #7 camlDune_engine__Scheduler.f_1701 src/dune_engine/scheduler.ml:102 (dune+0xcc413d) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #8 camlThread.fun_768 /home/edvint/.opam/5.2+tsan/.opam-switch/build/ocaml-variants.5.2.0+options/otherlibs/systhreads/thread.ml:48 (dune+0xe4f360) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #9 caml_start_program <null> (dune+0xf87e43) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #10 caml_callback_exn runtime/callback.c:201 (dune+0xf4a493) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)
    #11 caml_thread_start /home/edvint/.opam/5.2+tsan/.opam-switch/build/ocaml-variants.5.2.0+options/otherlibs/systhreads/st_stubs.c:634 (dune+0xf250a7) (BuildId: f5f81ea033312aa0209de3bf7f7d92b167faded7)

  Location is stack of thread T4.

The code:

  if (!(Is_block(v) && Is_young(v))) {
    /* not a minor block */
   (* ---> *) *p = v;
    return;
  }
  caml_enter_blocking_section();
  (* ---> *) fd = open(p, cv_flags, Int_val(perm));
  caml_leave_blocking_section();

And as the message says the race is between the main thread writing to a location that was previously part of thread 4’s stack (last read by this read function).
Now this either refers to some kind of memory corruption where some object in the main thread’s heap is part of thread 4’s stack (I don’t really see how, thread 4 mostly has integers on its stack, unless the main thread got a stale/dangling pointer), or memory got reused, and what used to be thread 4’s stack is now in fact the main thread’s heap, but thread sanitizer doesn’t know or didn’t get told about that, so it doesn’t see the main thread’s write as initializing write.

Has someone seen this before? Could this be a side-effect of OCaml switching stacks and tsan getting confused about it?

FWIW rerunning the build succeeds, so it doesn’t always reproduce, although if I build enough packages with dune, it’ll eventually fail like this again, interestingly the race is always between oldify_one + unix_open.

I think it’s unix: move Int_val outside blocking section by emillon · Pull Request #13188 · ocaml/ocaml · GitHub (caused by Int_val in a critical section)

1 Like

Thanks, I also reasoned the same way that “accessing integers is always safe because they cannot be moved”, but I agree that unix: move Int_val outside blocking section by emillon · Pull Request #13188 · ocaml/ocaml · GitHub is true and the GC does indeed write to integer values that are registered as GC roots (on the stack), and overwrites the value with itself.

I agree that the manual should be updated, because there will be a lot of C code that needs to be updated (at least to avoid the tsan warning, even if in practice writing the same value is probably safe). I assume the behaviour itself is not new in OCaml 5, it is just that tsan is a new feature that wasn’t present on OCaml 4!

Would there be a way for the GC to avoid this write? It seems redundant, and it should be able to determine whether something is an integer or not.

OTOH the function in the runtime is already annotated with CAMLno_tsan Disable TSan reporting on parts of the runtime ¡ ocaml/ocaml@288ce32 ¡ GitHub and that change is part of 5.2, so why does TSAN still warn about it, does the annotation not work properly?

Looks like GCC has recently learned about has_feature, so this ‘if’ is the wrong way around, probably the GCC version should be first:

#define CAMLno_tsan
/* __has_feature is Clang-specific, but GCC defines __SANITIZE_ADDRESS__ and
 * __SANITIZE_THREAD__. */
#if defined(__has_feature)
#  if __has_feature(thread_sanitizer)
#    undef CAMLno_tsan
#    if defined(__has_attribute)
#      if __has_attribute(disable_sanitizer_instrumentation)
#        define CAMLno_tsan \
           __attribute__((disable_sanitizer_instrumentation))
#      else
#        define CAMLno_tsan __attribute__((no_sanitize("thread")))
#      endif
#    else
#      define CAMLno_tsan __attribute__((no_sanitize("thread")))
#    endif
#  endif
#else
#  if defined(__SANITIZE_THREAD__)
#    undef CAMLno_tsan
#    define CAMLno_tsan __attribute__((no_sanitize_thread))
#  endif
#endif

This patch seems to fix the problem, and at least on some quick testing with GCC-14 and clang-18 it seems to turn TSAN on/off correctly for a function, but I’ll need to do more testing before I open a PR (e.g. figure a way to add a test to the testsuite run in the CI that checks whether the attribute works or not. I can determine that manually by inspecting the assembly, but for the testsuite it probably needs a small test that triggers a race on purpose and checks that it is correctly present/absent based on this flag):

diff --git a/runtime/caml/misc.h b/runtime/caml/misc.h
index ddccdc585d..b373c7f525 100644
--- a/runtime/caml/misc.h
+++ b/runtime/caml/misc.h
@@ -562,18 +562,20 @@ CAMLextern int caml_snwprintf(wchar_t * buf,
 
 /* Macro used to deactivate address sanitizer on some functions. */
 #define CAMLno_asan
-/* __has_feature is Clang-specific, but GCC defines __SANITIZE_ADDRESS__ and
- * __SANITIZE_THREAD__. */
-#if defined(__has_feature)
+#if defined(__SANITIZE_ADDRESS__)
+#    undef CAMLno_asan
+#    define CAMLno_asan __attribute__((no_sanitize_address))
+#elif defined(__has_feature)
+/* __has_feature used to be Clang-specific, and GCC defines __SANITIZE_ADDRESS__ and
+ * __SANITIZE_THREAD__ instead.
+ * Newer versions of GCC support __has_feature, but do not support all the attributes
+ * the Clang does, so we must check for GCC first, because `no_sanitize("thread")`
+ * doesn't work on GCC.
+ */
 #  if __has_feature(address_sanitizer)
 #    undef CAMLno_asan
 #    define CAMLno_asan __attribute__((no_sanitize("address")))
 #  endif
-#else
-#  if defined(__SANITIZE_ADDRESS__)
-#    undef CAMLno_asan
-#    define CAMLno_asan __attribute__((no_sanitize_address))
-#  endif
 #endif
 
 #endif /* CAML_INTERNALS */
diff --git a/runtime/caml/tsan.h b/runtime/caml/tsan.h
index 2f1df2f4fa..38e0c1dabd 100644
--- a/runtime/caml/tsan.h
+++ b/runtime/caml/tsan.h
@@ -17,9 +17,16 @@
 
 /* Macro used to deactivate thread sanitizer on some functions. */
 #define CAMLno_tsan
-/* __has_feature is Clang-specific, but GCC defines __SANITIZE_ADDRESS__ and
- * __SANITIZE_THREAD__. */
-#if defined(__has_feature)
+#if defined(__SANITIZE_THREAD__)
+#    undef CAMLno_tsan
+#    define CAMLno_tsan __attribute__((no_sanitize_thread))
+#elif defined(__has_feature)
+/* __has_feature used to be Clang-specific, and GCC defines __SANITIZE_ADDRESS__ and
+ * __SANITIZE_THREAD__ instead.
+ * Newer versions of GCC support __has_feature, but do not support all the attributes
+ * the Clang does, so we must check for GCC first, because `no_sanitize("thread")`
+ * doesn't work on GCC.
+ */
 #  if __has_feature(thread_sanitizer)
 #    undef CAMLno_tsan
 #    if defined(__has_attribute)
@@ -33,11 +40,6 @@
 #      define CAMLno_tsan __attribute__((no_sanitize("thread")))
 #    endif
 #  endif
-#else
-#  if defined(__SANITIZE_THREAD__)
-#    undef CAMLno_tsan
-#    define CAMLno_tsan __attribute__((no_sanitize_thread))
-#  endif
 #endif
 
 /* TSan records a release operation on encountering ANNOTATE_HAPPENS_BEFORE

Yeah, sorry about that, there is this weird transition period where the compilers with the bug fixed are not yet in widespread use, but the temporary character of the bug does not make it worth documenting in the manual of the next OCaml release.

It is indeed a bit awkward, but it was decided that it is easier to avoid this insiders’ trick of “the rule actually doesn’t apply to integers” rather than to change the performance-critical GC code. The new insiders’ trick is to not register immediate as GC roots. I can’t encourage it, however; the code that has the highest probability of not causing spurious reports in the future is the code strictly adhering to the FFI rules.

Thanks for spotting the issue and the prompt fix! I will open a PR for it early next week.

1 Like

This does actually simplify the rules a bit (even if there is a lot of code that doesn’t strictly adhere to it).
Previously if I wanted to warn about race conditions with the runtime lock released I had to check whether the parameter was a (non-integer) OCaml value or not.
Now all I’d need to check is whether the value is registered as a GC root or not, if it is then it is not safe to use it in any form with the runtime lock released.
(of course there is still a distinction between OCaml values and integers: with OCaml values you definitely have a race condition that will cause problems in practice now, whereas with integers it is technically undefined behaviour, so anything could happen, but in practice it has worked so far).
I’ll try to update my static analyzer with these new rules in mind.

Looks like I was wrong, GCC does seem to accept both attributes after all.
The reason the suppression wasn’t working is because it got removed in this commit (still in 5.2): Fix data race in minor GC · ocaml/ocaml@42a6339 · GitHub