Any suggestions on making this piece of OCaml numerical code more efficient?

I don’t have much time to investigate today but the things that seemed like it improved the running time (correctness unchecked !) :

  • making dist a local function that expect 6 floats (and dist2 too). I’m not sure putting [@inline] also gets read of the floating point boxing, but I’d be very happy to learn that this is the case
  • dont recompute x/y/z from scratch. Create three floaring point references, and initialize them at the begining of each loop to x_min, y_min, z_min and at the end of the loop just do x := !x +. dx (and like wise for y and z). That should be the same no (modulo floating point differences…). It removes a conversion and an fp multiplication for each loop so that helps I think. Declare the references outside the three loops, floating point references are not allocated if they are local (in native code).
  let vdW_grid g prot_atoms l_a =
    let x_min, y_min, z_min = V3.to_triplet g.low in
    let dx = g.step in
    let x = ref x_min in
    let y = ref 0.0 in
    let z = ref 0.0 in
    let n = A.length prot_atoms in
    for i = 0 to g.nx - 1 do
      y := y_min;
      for j = 0 to g.ny - 1 do
        z := z_min;
        for k = 0 to g.nz - 1 do
          for m = 0 to n - 1 do (* over all protein atoms *)
            let p_p, p_a = A.unsafe_get prot_atoms m in
            let r_ij = non_zero_dist (dist !x !y !z p_p.Vector3.x p_p.Vector3.y p_p.Vector3.z) in
            let vdw = UFF.vdW_xiDi l_a p_a in
            let p6 = pow6 (vdw.x_ij /. r_ij) in
            BA3.unsafe_set g.grid i j k
              (BA3.unsafe_get g.grid i j k +.
               (vdw.d_ij *. ((-2.0 *. p6) +. (p6 *. p6))))
          done;
          z := !z +. dx;
        done;
        y := !y +. dx;
      done;
      x := !x +. dx;
    done;
    g 

I have a branch (here) with individual commits that should show interesting things to do.
The first commit just uses perf stat instead of time, as it gives much more reliable numbers. The next commits all reduce the number of executed instructions on my computer, but not all reduce the overall time (number of cycles in the perf output). I encourage you to run the test script yourself on individual commits to see how it impacts performance. For instance, I observed that one commit that removes duplicate instructions doesn’t actually decrease the number of cycles, even though strictly less instructions were run.

1 Like

Very interesting, thanks a lot! :grinning:
I will look into your branch.

Nice catch for this one!
It shaves about 8s from the previous 49s calculation.
The fastest version of the program incorporates it.

1 Like

Thanks to all those who participated, the current version is about two times faster
than the initial one!

2 Likes

another idea: if you don’t use interfacing with C too often, you could use regular Arrays instead of Bigarrays, they are usually faster

Hi,

I have the impression that you are using a Bigarray to store the data, and it would be faster to get and set the data in a native OCaml data structure. Maybe you should increment the data in a temporary variable that you initialize before the loop over m, and write only once in the Bigarray after the loop over m:

        for k = 0 to g.nz - 1 do
          let z = z_min +. (float k) *. dx in
          let l_p = V3.make x y z in (* ligand atom position *)
          let tmp = ref (BA3.unsafe_get g.grid i j k) in  (* <- HERE *)
          for m = 0 to n - 1 do (* over all protein atoms *)
            let p_p, p_a = A.unsafe_get prot_atoms m in
            let r_ij = Math.non_zero_dist (V3.dist l_p p_p) in
            let vdw = UFF.vdW_xiDi l_a p_a in
            let p6 = FF.pow6 (vdw.x_ij /. r_ij) in
            tmp := !tmp +. p6 *. vdw.d_ij *. (p6 -. 2.0))   (* <- HERE *)
          done;
          BA3.unsafe_set g.grid i j k !tmp   (* <- HERE *)
        done

Also, in my experience, fetching a value from a 2D array is more costly from a Bigarray than from an OCaml array. Maybe you should follow the idea of @geoffder consider copying vdw once before the loop in a temporary OCaml array to have a quicker read access.

Finally, you should check from the layout of your Bigarray that elements (i,j,k) and (i,j,k+1) are contiguous in memory. If you use Fortran layout, it will not be the case and you will have very bad memory access patterns which will be very costly. In that case, change into C layout or change the order in which the loops are nested to iterate over contiguous elements.

I use a bigarray because:

  • I might have to interface with C or another language
  • I want 32bits floats so that the array takes less memory

I use the C layout for the bigarray.
Array indexes starting at 1 (Fortran layout) bothers me a lot.