Hi, out of curiosity I wanted to see the effect of extreme data race in ocaml 5.0.0, and indeed it is quite extreme. In this test, one domain increases a global variable 100000 times, one domain decreases it 100000 times. Amongst the various random results (see below), I regularly get the answer r=100000, as if only one domain has been executed. Does this depend on the caching strategy of the CPU, or is there another explanation?
let main d1 d2 reset get =
let h1 = Domain.spawn d1 in
let h2 = Domain.spawn d2 in
Printf.printf "r=%i\n" (get ());;
let test d1 d2 reset get s =
for i = 0 to 100 do main d1 d2 reset get done;;
(* Has data race *)
let r = ref 0
let n = 100000
let d1 () = for i = 1 to n do r := !r + 1 done
let d2 () = for i = 1 to n do r := !r - 1 done;;
test d1 d2 (fun () -> r := 0) (fun () -> !r) "Data race";;
The behaviour depends on a large number of factors. The number of cores on your machine, when and where your OS schedules the domains and the relaxed memory model.
Thanks to the precise relaxed memory model that we have for OCaml, you will not need to reason at the level of caches, non sequentially consistent hardware behaviours and compiler optimisations. If you are keen to learn about the details of the memory model, we have a manual chapter that explains it in detail: OCaml - Memory model: The hard bits.
Suppose that the reset of r happens on core 0. Suppose d2 gets scheduled on core 1. It requests the value of r and either the main memory (or a shared cache) or core 0 answers; so, core 1 gets the value 0 and starts its computation. Except for the final value of r which is propagated thanks to a memory barrier, all the intermediate changes to r might go unnoticed from core 0. Now suppose that d1 gets scheduled on core 0. The value of r is still in cache since core 1 has not yet got the chance to advertise a new value. Now, both core 0 and core 1 have finished their computation and they race to publish their final value of r Core 0 wins, the final value of r is 100000.
Are you asking for a possible execution that could result in your observation? I would guess a plausible execution could be d1 executes first to the point where it evaluates !r ==> 0, but before it writes the incremented value, then d2 executes to completion, setting the value of r to -100_000, and finally d1 is resumed, at which point it writes 0 + 1 to r and continues its execution, leaving r with the value 100_000.
It depends a lot on the architecture. If no one is hitting the same cache line (which is not the case here), it can take an arbitrary long time. Here, another core reading the same memory location (to put it into its cache) should cause the writing core to publish its latest value, which might still take some time to reach the other caches.
Even on an architecture like x86 with its exclusive caches (which theoretically means that writes from one core are instantly seen from any other core hitting the same cache line), invalidating lines can take a long time, during which cores do not stop executing code, unless an atomic operation or a memory barrier is used.
Moreover, your example is writing to the same memory location over and over, so the (speculative) writes do not even get to leave the store buffer. Thus, the cache is not even aware of the write, and neither do the caches of the other cores as a consequence.