Optimizing small vector operations

I am trying to get the best possible performance out of a particle simulation.
TL; DR: How can I speed up small vector operations as much as possible?

Particle positions are represented as vectors with a signature whose boiled down version is

module type VEC = sig
  type t
  val dim: int
  val ( - ) : t -> t -> t
  val dot: t -> t -> float
  val d2: t -> t -> float                                (* squared distance *)
end

The space dimension dim may by 1, 2 or 3. The function d2 is called in a tight loop, and I noticed that my whole program’s performance is affected by how I implement it. I started with:

module V2 : VEC = struct
  type t = {x:float; y:float}
  let dim = 2
  let ( - ) {x;y} {x=x';y=y'} = {x=x-.x'; y=y-.y'}
  let dot {x; y} {x=x'; y=y'} = x*.x' +. y*.y'
  let d2 v v' = let d = v' - v in dot d d
end

but then noticed that dot allocates a lot, by profiling with landmarks. As an experiment I then tried to switch VEC.t to a tuple. I now have


module V2_tuple : VEC = struct
  type t = float * float
  let dim = 2
  let ( - ) (x, y) (x', y') = (x -. x', y -. y')
  let dot (x, y) (x', y') = x*.x' +. y*.y'
  let _d2' v v' = let d = v' - v in dot d d
  let _d2'' v v' =
    let dx, dy = v - v' in
    dx *. dx +. dy *. dy
  let _d2''' (x, y) (x', y') =
    let dx, dy = x -. x', y -.y' in
    dx *. dx +. dy *. dy
  let d2 = _d2' (* or _d2'' or _d2''' *)
end

I was a bit shocked that the performance of the three versions of d2 is not at all the same, _d2'' is faster than _d2' by around 40% and _d2''' is faster by another 60%. The allocations due to these functions also go down by similar factors. I am sure more seasoned OCamlers with not find this surprising, but to me it was. I think this is due to dot and (-) having to allocate tuples for their results, and (possibly) floats remaining unboxed when assigning them directly in tuple syntax let (x, y) = ... .

So, I have a load of questions:

  • Is the allocation story true?
  • Is there another data structure that would work faster?
  • Or even better, is there a way to write clean factored code and have it be fast? For example, I tried putting let[@inline] dot etc. but that did not seem to have an effect.
  • Could it be that profiling with landmarks is preventing optimizations/inlining? I did check that the non-instrumented program shows a similar order of performance between the different implementations; here the improvement in run-time for the whole program is about 10% each.

Any guidance would be appreciated!

PS. I am using ocaml-variants.4.11.1+flambda+no-flat-float-array with -O3 optimization level. I’m not comfortable with reading assembly.

[edit for posterity]: As reported below, the differences between the d2 versions were likely because instrumentation with landmarks prevented proper inlining. Code compiled fully without it does not show them!

4 Likes
  • Yes. At function boundaries you need the arguments (for function calls) and return value (for returns) to be strictly compliant OCaml values, so floats must be boxed, and tuples allocated. So your ( - ) function does three allocations, and your dot function one. Your _d2'' function should be allocating as much as your _d2' function, but since it only does a single access for each of the fields instead of two for the dot function (which has to read separately from its two arguments) it’s expected to be faster. You _d2''' function will not need to allocate anything for the subtraction as the compiler translates code of the form let x1, x2 = a1, a2 in ... into let x1 = a1 and x2 = a2 in ..., so the only allocation is at the end (for the return value).
  • You could probably gain a bit of performance by using records instead of tuples. Records containing only floats are compiled to a single block with the float values embedded directly, while tuples need an extra indirection (a block holding pointers to boxed floats).
  • I would have put [@inline always] annotations on all three functions (( - ), dot, and d2) and expect that with these annotations all variants have the same performance. It’s unfortunate but expected that even with aggressive inlining options these small functions are not inlined unless forced by an annotation (the gains from unboxing floats cannot currently be correctly estimated by flambda).
  • I don’t know how landmarks work exactly, but I doubt it has any impact on compiler optimisations.

Because of the issues with flambda not correctly estimating gains from float unboxing, I would advise to use the compilation flags -Oclassic when compiling code with heavy use of floating point numbers, or to systematically annotate small functions with floats as either arguments or return values with [@inline always]. If you use the -Oclassic flag, you may want to also tweak the -inline parameter: the default is -inline 10, but -inline 40 seems reasonable for heavier optimisations.

7 Likes

Just so I understand: The three allocations for (-) are all for the output? x-x', y-y' and the tuple? What happens if I have floats x,x',y,y' in the current environment and then call (x,y) - (x',y')? Do the input tuples have to be allocated then? (In that case there would be no gain in keeping the floats around individually)

According to the landmarks profile, _d2'' allocates about half as much. This is to be taken with caution since I seem to be missing a seed for the random number generator so that my runs are only almost the same (and not precisely). I’m working on fixing that.

Ah, so the RHS has to be an explicit pair for this to avoid allocation, and let x, y = f z will not avoid allocating a tuple for the ouput of f, yes?

Ok, so if the previous point is correct then indeed tuples cannot save allocation. I’ll see how a version of _d2'' with records fares.

I tried. For instance, let[@inline] ( - ) (x,y) (x',y') = (x-.x', y-.y') but it did not seem to make a difference. I concluded that -O3 must be doing all of that already, probably incorrectly by your assessment. If a function gets inlined, the allocation at the boundary should also be no more necessary and disappear in flambda, I reckoned. Is that correct in principle?

I’ll try some of these options and report back if I manage to find something better. Thanks!

Yes.

Yes, the input tuples have to be allocated. Unless the call to ( - ) is inlined, in which case the compiler can then optimise the code from ( - ) in this particular context, and eventually remove the allocation of the tuples.

That’s a bit surprising. I would be interested in understanding what happens here, if you have time to spare.

Yes, though if f is inlined the compiler should be able to remove the allocation anyway.

I’m very surprised that the [@inline] annotation doesn’t make a difference. The fact that you’ve been able to recover that much performance by manually duplicating the code should be proof that the functions are not inlined, and that inlining them would make a noticeable difference.

When you inline a function, its code is inserted at the call point, but that in itself doesn’t remove any allocations; however, it can often result in code that allocates a value (before the function call) and immediately accesses the fields (at the beginning of the function). In this case, the compiler can track the values of the fields directly, and this can make the allocation unused and removed.
As an example:

let f (x, y) = x + y
let g a b = f (a, b)

(* After initial compilation, in pseudo-code *)
let f = fun arg ->
  let x = arg.0 in
  let y = arg.1 in
  x + y
let g = fun a b ->
  let tuple = alloc [a; b] in
  f tuple

(* After inlining *)
let f = (* unchanged *)
let g = fun a b ->
  let tuple = alloc [a; b] in
  let arg = tuple in
  let x = arg.0 in
  let y = arg.1 in
  x + y

(* After simplification of field accesses *)
let f = (* unchanged *)
let g = fun a b ->
  let tuple = alloc [a; b] in
  let arg = tuple in
  let x = a in
  let y = b in
  a + b

(* After removal of unused variables *)
let f = (* unchanged *)
let g = fun a b ->
  a + b

In your case you have one tuple created by the ( - ) function that will be read by the dot function, so you need to inline both of them if you want to remove the intermediate allocation(s).

2 Likes

That was what I was expecting too. But just to be sure, I looked at the assembly output. If the functions are inlined by the compiler, the allocations are not removed and 72 bytes are allocated on the heap at once. (And the fact that there is a single massive allocation is actually proof that the functions were inlined.)

Quick update: for the same random seeds, I confirm that _d2'' allocates about 2/3 of what _d2' does according to landmarks. This is with [@inline] on all of ( - ), dot and _d2? and -O3.
Potential caveat: I am not running the exact code snippets above but the full simulation, where vectors are in fact 3D not 2D currently (but if needed I can make a 2D example as well)

I’d like to see the code in question, if possible. Ideally I’d like to see the source code for the function, and the relevant parts of the compiler’s intermediate representations (the parts I’m interested in can be printed by using the -dlambda -drawflambda -dflambda -dcmm flags). If the intermediate representations are too cumbersome, the raw assembly code might be helpful too.

The code is just as given by the original poster, except that it was trimmed from the less relevant parts and (-) was marked as inline (dot already is):

let [@inline] (-) (x, y) (x', y') = (x -. x', y -. y')
let dot (x, y) (x', y') = x*.x' +. y*.y'
let foo v v' = let d = v' - v in dot d d

The lambda and rawlambda passes are before inlining, so not interesting. The flambda pass is not relevant when not compiling with Flambda. Here is the result of the cmm pass:

(function{a.ml:3,8-40} camlA__foo_51 (v/117: val v'/116: val)
 (let
   (d/118
      (let
        (Psubfloat_arg/126 (load_mut val (+a v/117 8))
         Psubfloat/144
           (-f (load float64u (load_mut val (+a v'/116 8)))
             (load float64u Psubfloat_arg/126))
         Psubfloat_arg/129 (load_mut val v/117))
        (alloc{a.ml:3,23-29;a.ml:1,36-54}
          block-hdr(2048){a.ml:3,23-29;a.ml:1,36-54}
          (alloc{a.ml:3,23-29;a.ml:1,37-44}
            block-hdr(1277){a.ml:3,23-29;a.ml:1,37-44}
            (-f (load float64u (load_mut val v'/116))
              (load float64u Psubfloat_arg/129)))
          (alloc{a.ml:3,23-29;a.ml:1,46-53}
            block-hdr(1277){a.ml:3,23-29;a.ml:1,46-53} Psubfloat/144)))
    Pmulfloat_arg/119 (load_mut val (+a d/118 8))
    Pmulfloat/145
      (*f (load float64u (load_mut val (+a d/118 8)))
        (load float64u Pmulfloat_arg/119))
    Pmulfloat_arg/122 (load_mut val d/118))
   (alloc{a.ml:3,33-40;a.ml:2,26-40}
     block-hdr(1277){a.ml:3,33-40;a.ml:2,26-40}
     (+f
       (*f (load float64u (load_mut val d/118))
         (load float64u Pmulfloat_arg/122))
       Pmulfloat/145))))

As for the assembly, as I said, it merges all the allocations into a single one, and thus start with:

	subq	$72, %r15
	cmpq	8(%r14), %r15
	jb	.L110
1 Like

Thanks, I understand better your case now. Only flambda knows how to remove tuple allocations after inlining, so if you’re not using flambda the output you get is the one expected. The original poster was using flambda, so I assumed you were as well. Many of the comments I’ve made about simplification are flambda-specific.

@n4323 I’m curious about your numbers. Is there a way for me to try to reproduce them ? That way I could try to look at the generated code too.

Sure, but you advised the original poster to use -Oclassic. So, I thought that was the use case you were interested in. If I remove -Oclassic, then only the 16-byte allocation is left.

I’ll see if I can make a version of my example available tomorrow. Hopefully a minimal example will still show the same behavior…

Did you check that the tuple is optimized away? If so, you should leave a comment indicating that you did. That’s why I would instead use the following syntax, to make it clear that a block won’t be allocated to hold the pair:

let dx = x -. x'
and dy = y -. y' in

Personally, I usually just prefer the default style:

let dx = x -. x' in
let dy = y -. y' in

and reserve let ... and ... in and let rec ... and ... in for situations where they are necessary.

I did compare the let .. and ... version with the pseudo-tuple at some point. But I’ll re-check to be sure.

[edit] so i checked. According to landmarks, let dx = .. in let dy = ... and assignment in tuple syntax let dx, dy = ... allocates the same exact amount; for some reason the former consumes 10.57G cycles and the latter 11.0G cycles in my otherwise identical test runs.
[edit again] forget the differences in cycles, they are not reproducible. also a let ... and version gives indistinguishable results.

To avoid a potential misunderstanding: the speedup by 40% and another 60% in the OP were just for the cumulative time spent in the d2 functions, not for the whole program. For the whole program, it’s more like 10% and another 10%.

-Oclassic is a flambda-specific option that mimics the inlining heuristics from Closure (the non-flambda mode) but keeps the additional simplifications. It should be better than the non-flambda compiler in this case precisely because it can still remove the tuple allocation, which in turn allows the floats to be unboxed.

EDIT: Actually I just checked and you’re right, -Oclassic seems to disable simplification too so the extra allocations remain. I’ll investigate this a bit more.

EDIT2: So it appears that the -Oclassic option is a bit strange, as it messes with other optimisation flags. In particular it is impossible to have two rounds of simplification in -Oclassic mode, which means that the simplifications after inlining cannot be performed in this mode. I don’t think we’re going to fix this, so I’m going to have to advise against using -Oclassic for this case.

The “tight” version of this code would be:

let _d2''' { x = x1; y = y1 } { x = x2; y = y2 } =
  let dx = x1 -. x2 in
  let dy = y1 -. y2 in
  dx *. dx +. dy *. dy

To perform well (eliminate all unnecessary allocations) we need the compiler to unbox floats correctly, but I think that both flambda and non-flambda optimizers should do a good job here (but @vlaviron is orders-of-magnitude more knowledgeable on flambda).

The remaining performance wins would come from cross-function optimizations or aggressive inlining to eliminate float boxing at the callsite (of the arguments and/or the result, if applicable), and we would expect a well-behaving flambda to have the upper hand there, but it is unclear how well-behaving the current implementation is.

I tried the record version and compared it with the tuple version of _d2'''. The two functions allocate the same. Interestingly, however, the record version of the code runs slower overall. I noticed that ( - ) in the record version allocates ten times more than in the tuple version, although I have annotated both with let [@inline] ( - ) = ....
(( - ) is still used in other places than _d2'''.) The mystery deepens.

The record version does not avoid any allocation in the function taking the record as argument, but it avoids allocations in the functions creating the record. (It can cause allocations in functions that consume its fields only to put them in an boxing context, but this is not the case here.) So overall you should see a decrease in allocations when using records (unless you are writing strange code with a lot of boxing contexts, or you are splitting into non-inlined functions in a fine-grained ways that inserts spurious boxing contexts), but not in the function itself.

Note: it’s pretty easy to understand where OCaml needs to box/unbox floats, and where we hope that it will not:

  • floats are stored boxed in tuples, unboxed in records with all-float fields (but not other records) or arrays (unless you are in a -no-flat-float-array switch) or Floatarray.t (robustly)
  • floating-point computations are expressed as taking boxed arguments and returning boxed results, but in practice the optimizer will eliminate intermediary boxing

For example dx *. dx +. dy *. dy will never box the intermediate results of *. (but it will probably box the result of +. as a function boundary, and whether dx and dy will be passed boxed or unboxed depends on the optimizer).

Some stuff about floating-point performance is tricky and hard to predict, so you can find surprises if you micro-optimize code, but the broad principles above should give you a simple, clear mental model to write efficient floating-point code in most cases.

3 Likes

I believe most of Writing efficient numerical code in Objective Caml from the old caml site is still true (for non flambda compilers), which is always what I reach for (in particular when I wrote gg which may be of interest to the OP).

Somehow this information should be, if needed, updated to match reality and integrated somewhere (manual or ocaml.org). It’s very useful basic information.

6 Likes

I think the numbers you get are consistent with a situation where your [@inline] annotation is ignored for some reason, but with the tuple representation flambda still finds many cases where it makes sense to inline the function (because it can remove an allocation) while in the record case it will probably never do it. I assume that landmarks will not reports allocations from ( - ) at the places it has been inlined, so more inlining means less allocations from this particular function.

The reason it can remove an allocation for the tuple version not for the record one is that flambda distinguishes tuple-like structures (tuples, non-constant variant constructors, non-float records), for which it can track approximations, and array-like structures (arrays and float records) for which it doesn’t track approximations. So basically { x = expr }.x can be simplified to expr by flambda only if the record type is not a float record (and the record type doesn’t define x as mutable).

You shouldn’t see this as an argument against float records though; most of the time the ability to avoid the allocation of the underlying floats is much more important than the ability to remove the record allocation itself.

The bottom line is I’m really curious to see why it seems that the [@inline] annotations do not seem to work. With a working example (even a big one) I should be able to track down the cause fairly easily.