I don’t have much time to investigate today but the things that seemed like it improved the running time (correctness unchecked !) :
making dist a local function that expect 6 floats (and dist2 too). I’m not sure putting [@inline] also gets read of the floating point boxing, but I’d be very happy to learn that this is the case
dont recompute x/y/z from scratch. Create three floaring point references, and initialize them at the begining of each loop to x_min, y_min, z_min and at the end of the loop just do x := !x +. dx (and like wise for y and z). That should be the same no (modulo floating point differences…). It removes a conversion and an fp multiplication for each loop so that helps I think. Declare the references outside the three loops, floating point references are not allocated if they are local (in native code).
let vdW_grid g prot_atoms l_a =
let x_min, y_min, z_min = V3.to_triplet g.low in
let dx = g.step in
let x = ref x_min in
let y = ref 0.0 in
let z = ref 0.0 in
let n = A.length prot_atoms in
for i = 0 to g.nx - 1 do
y := y_min;
for j = 0 to g.ny - 1 do
z := z_min;
for k = 0 to g.nz - 1 do
for m = 0 to n - 1 do (* over all protein atoms *)
let p_p, p_a = A.unsafe_get prot_atoms m in
let r_ij = non_zero_dist (dist !x !y !z p_p.Vector3.x p_p.Vector3.y p_p.Vector3.z) in
let vdw = UFF.vdW_xiDi l_a p_a in
let p6 = pow6 (vdw.x_ij /. r_ij) in
BA3.unsafe_set g.grid i j k
(BA3.unsafe_get g.grid i j k +.
(vdw.d_ij *. ((-2.0 *. p6) +. (p6 *. p6))))
done;
z := !z +. dx;
done;
y := !y +. dx;
done;
x := !x +. dx;
done;
g
I have a branch (here) with individual commits that should show interesting things to do.
The first commit just uses perf stat instead of time, as it gives much more reliable numbers. The next commits all reduce the number of executed instructions on my computer, but not all reduce the overall time (number of cycles in the perf output). I encourage you to run the test script yourself on individual commits to see how it impacts performance. For instance, I observed that one commit that removes duplicate instructions doesn’t actually decrease the number of cycles, even though strictly less instructions were run.
I have the impression that you are using a Bigarray to store the data, and it would be faster to get and set the data in a native OCaml data structure. Maybe you should increment the data in a temporary variable that you initialize before the loop over m, and write only once in the Bigarray after the loop over m:
for k = 0 to g.nz - 1 do
let z = z_min +. (float k) *. dx in
let l_p = V3.make x y z in (* ligand atom position *)
let tmp = ref (BA3.unsafe_get g.grid i j k) in (* <- HERE *)
for m = 0 to n - 1 do (* over all protein atoms *)
let p_p, p_a = A.unsafe_get prot_atoms m in
let r_ij = Math.non_zero_dist (V3.dist l_p p_p) in
let vdw = UFF.vdW_xiDi l_a p_a in
let p6 = FF.pow6 (vdw.x_ij /. r_ij) in
tmp := !tmp +. p6 *. vdw.d_ij *. (p6 -. 2.0)) (* <- HERE *)
done;
BA3.unsafe_set g.grid i j k !tmp (* <- HERE *)
done
Also, in my experience, fetching a value from a 2D array is more costly from a Bigarray than from an OCaml array. Maybe you should follow the idea of @geoffder consider copying vdw once before the loop in a temporary OCaml array to have a quicker read access.
Finally, you should check from the layout of your Bigarray that elements (i,j,k) and (i,j,k+1) are contiguous in memory. If you use Fortran layout, it will not be the case and you will have very bad memory access patterns which will be very costly. In that case, change into C layout or change the order in which the loops are nested to iterate over contiguous elements.