Insane behaviour from C++-written tool interracting through FFI

Hello,

The tool I’m working on, Gillian, has been facing a very weird issue recently. We use Z3 as a solver and send queries through its OCaml API, and two independent changes each caused the same weird behaviour. Both have been fixed in the same way, and it makes absolutely no sense to me.

In both cases:

  • the changes were in entirely unrelated parts of the code-base and were just there to improve the logging behaviour of the tool. They consisted in at most 10 lines of code.
  • the changes introduced references to keep track of some information (one locally inside a function, one globally at the toplevel of a module).
  • Z3 would start timing out on some queries where it would not time before.
  • it was fixed by rewriting the same code to not use references (again in completely unrelated parts of the code base)

I am going to go into details about one case: I wanted to make some quick and dirty measures for the behaviour of one specific function that was called ~1200 times during execution. In order to do so, I introduced the following at the beginning of a module:

type measures = { total_time: float; ... }
let measures = ref { total_time = 0.; ... } 
let () =
  at_exit @@ fun () ->
  let measures = !measures in
  Fmt.pr "MEASURES: ..."

let record_behaviour computation_time result =
  measures := (operation on result, computation_time and !measures)

and later in the relevant function

let relevant_function ... = 
   let res = ... in
   record_behaviour res;
   res

Importantly, the relevant function (and in fact this entire module) is unrelated to Z3 and does not make any call to it.
More surprisingly, this would cause Z3 to time out, and it was fixed by removing the measures type and variable, and removing the at_exit function, and instead by making record_behaviour write a json of data directly to stderr and then aggregating the measures through a Python script.

My only guess is that references may be messing with the Gc or something in this case? I have no idea how to debug that.
The second case is similar and took us a week to debug as the new timing out would only happen in CI the github CI…

OCaml version: 4.14.1
Z3 version: 4.8.12 (not the opam version, our own build that is more straightforward)

1 Like

In the two versions of your code, is the link order of the OCaml modules the same (the order in which the modules appear in the final ocamlopt invocation)? If not, is the order of the modules whose initialization code might call Z3 the same? It is just a shot in the dark, but I have observed surprising changes sometimes when changed linking order of modules led to different Z3 internal ids, leading to different choices of case splits, etc. deep in the solvers.

To dig deeper, I would enable the Z3 log with Z3_open_log to capture a log for both versions, and then replay them using z3 -log to see if the differences persist, or to compare z3 running each of them to see if that gives any clues. Sometimes manually inspecting the logs can give ideas. It is a slog though.

The change described above happens in a single file and doesn’t require anything, so I do not believe the order of modules is changing at all :confused:

I didn’t know logging and replaying was a thing, I’ll try that thank you!

Another point to have in mind is that changes in when garbage collections occur can change the internal ids of Z3 values. Values of Z3 types such as ast are ref-counted and hash-consed, and when passed to OCaml the ref counts are decremented by GC finalizers. So when the last reference to a Z3 value is held by a dead but not yet collected OCaml value, and an equivalent Z3 value is constructed, then there is a form of race where if the GC runs and finalizes the value first, a new Z3 value will be created, while if the creation happens first, then the existing one will be reused. These two situations will result in different internal ids, and the solver heuristics can be sensitive to them.

2 Likes

Update: I am not convinced that this is anything to do with the GC anymore.
In particular because this change triggers timeouts in z3 when, importantly, this script file is entirely and completely unrelated to the failing test. It is not even a script that calls the same executable as the failing test. This script is not even called or touched by the container that starts failing when that change is introduced.

That being said, I do think your insight, @jjb, is very important: we must have somewhere in that case study an extremely unstable z3 query that randomly starts failing on any minor change. We will explore further

Sometimes when code is very sensitive to when GCs occur, it can help to reduce noise to perform a full GC after the program initialization steps are complete. There are cases where even things like the values of environment variables can be different lengths, resulting in some different sized allocations in the init routines, that can lead to differences in exactly when GCs are triggered. Inspecting the value of Gc.minor_words() at such just-post-init points across different runs can give some indication of whether this is stable or has some per-execution sensitivity. Enabling logging by setting OCAMLRUNPARAM v to appropriate values to see if GCs are being triggered at the same points across runs can also sometimes be helpful.

Have you tried linking and running with the debug runtime? (-runtime-variant d in ocamlopt_flags).
Also you seem to be using an older version of Z3, looking at upstream history: tweak GC in OCaml bindings by c-cube · Pull Request #5600 · Z3Prover/z3 · GitHub (tweak GC in OCaml bindings (#5600) · Z3Prover/z3@6302b86 · GitHub in 4.8.13, you reference 4.8.12).

Thank you very much for your answers!

It’s difficult for us to investigate further given the Unbearable Weight of Academic Deadlines :tm: coming up. We did try to perform a full GC and that didn’t help, but updating z3 seems to have fixed this particular occurence of the issue (thank you @edwin!).
In any case, I keep this thread close in case it ever happens again and we have to investigate. In any case, it helped me understand more how Z3 works and interacts with OCaml. Thank both of you for your help!