I am trying to get the best possible performance out of a particle simulation.
TL; DR: How can I speed up small vector operations as much as possible?
Particle positions are represented as vectors with a signature whose boiled down version is
module type VEC = sig
type t
val dim: int
val ( - ) : t -> t -> t
val dot: t -> t -> float
val d2: t -> t -> float (* squared distance *)
end
The space dimension dim
may by 1, 2 or 3. The function d2
is called in a tight loop, and I noticed that my whole program’s performance is affected by how I implement it. I started with:
module V2 : VEC = struct
type t = {x:float; y:float}
let dim = 2
let ( - ) {x;y} {x=x';y=y'} = {x=x-.x'; y=y-.y'}
let dot {x; y} {x=x'; y=y'} = x*.x' +. y*.y'
let d2 v v' = let d = v' - v in dot d d
end
but then noticed that dot
allocates a lot, by profiling with landmarks
. As an experiment I then tried to switch VEC.t
to a tuple. I now have
module V2_tuple : VEC = struct
type t = float * float
let dim = 2
let ( - ) (x, y) (x', y') = (x -. x', y -. y')
let dot (x, y) (x', y') = x*.x' +. y*.y'
let _d2' v v' = let d = v' - v in dot d d
let _d2'' v v' =
let dx, dy = v - v' in
dx *. dx +. dy *. dy
let _d2''' (x, y) (x', y') =
let dx, dy = x -. x', y -.y' in
dx *. dx +. dy *. dy
let d2 = _d2' (* or _d2'' or _d2''' *)
end
I was a bit shocked that the performance of the three versions of d2
is not at all the same, _d2''
is faster than _d2'
by around 40% and _d2'''
is faster by another 60%. The allocations due to these functions also go down by similar factors. I am sure more seasoned OCamlers with not find this surprising, but to me it was. I think this is due to dot
and (-)
having to allocate tuples for their results, and (possibly) floats remaining unboxed when assigning them directly in tuple syntax let (x, y) = ...
.
So, I have a load of questions:
- Is the allocation story true?
- Is there another data structure that would work faster?
- Or even better, is there a way to write clean factored code and have it be fast? For example, I tried putting
let[@inline] dot
etc. but that did not seem to have an effect. - Could it be that profiling with landmarks is preventing optimizations/inlining? I did check that the non-instrumented program shows a similar order of performance between the different implementations; here the improvement in run-time for the whole program is about 10% each.
Any guidance would be appreciated!
PS. I am using ocaml-variants.4.11.1+flambda+no-flat-float-array
with -O3
optimization level. I’m not comfortable with reading assembly.
[edit for posterity]: As reported below, the differences between the d2
versions were likely because instrumentation with landmarks prevented proper inlining. Code compiled fully without it does not show them!