I am trying to get the best possible performance out of a particle simulation.

TL; DR: How can I speed up small vector operations as much as possible?

Particle positions are represented as vectors with a signature whose boiled down version is

```
module type VEC = sig
type t
val dim: int
val ( - ) : t -> t -> t
val dot: t -> t -> float
val d2: t -> t -> float (* squared distance *)
end
```

The space dimension `dim`

may by 1, 2 or 3. The function `d2`

is called in a tight loop, and I noticed that my whole programâ€™s performance is affected by how I implement it. I started with:

```
module V2 : VEC = struct
type t = {x:float; y:float}
let dim = 2
let ( - ) {x;y} {x=x';y=y'} = {x=x-.x'; y=y-.y'}
let dot {x; y} {x=x'; y=y'} = x*.x' +. y*.y'
let d2 v v' = let d = v' - v in dot d d
end
```

but then noticed that `dot`

allocates a lot, by profiling with `landmarks`

. As an experiment I then tried to switch `VEC.t`

to a tuple. I now have

```
module V2_tuple : VEC = struct
type t = float * float
let dim = 2
let ( - ) (x, y) (x', y') = (x -. x', y -. y')
let dot (x, y) (x', y') = x*.x' +. y*.y'
let _d2' v v' = let d = v' - v in dot d d
let _d2'' v v' =
let dx, dy = v - v' in
dx *. dx +. dy *. dy
let _d2''' (x, y) (x', y') =
let dx, dy = x -. x', y -.y' in
dx *. dx +. dy *. dy
let d2 = _d2' (* or _d2'' or _d2''' *)
end
```

I was a bit shocked that the performance of the three versions of `d2`

is not at all the same, `_d2''`

is faster than `_d2'`

by around 40% and `_d2'''`

is faster by another 60%. The allocations due to these functions also go down by similar factors. I am sure more seasoned OCamlers with not find this surprising, but to me it was. I think this is due to `dot`

and `(-)`

having to allocate tuples for their results, and (possibly) floats remaining unboxed when assigning them directly in tuple syntax `let (x, y) = ... `

.

So, I have a load of questions:

- Is the allocation story true?
- Is there another data structure that would work faster?
- Or even better, is there a way to write clean factored code and have it be fast? For example, I tried putting
`let[@inline] dot`

etc. but that did not seem to have an effect. - Could it be that profiling with landmarks is preventing optimizations/inlining? I did check that the non-instrumented program shows a similar order of performance between the different implementations; here the improvement in run-time for the whole program is about 10% each.

Any guidance would be appreciated!

PS. I am using `ocaml-variants.4.11.1+flambda+no-flat-float-array`

with `-O3`

optimization level. Iâ€™m not comfortable with reading assembly.

[edit for posterity]: As reported below, the differences between the `d2`

versions were likely because instrumentation with landmarks prevented proper inlining. Code compiled fully without it does not show them!