Optimizing small vector operations

Since @vlaviron was interested, I prepared a version that should be easy to clone and run locally (hopefully). If anyone would like to try it out, please PM me for a link.

Are you using landmarks via its ppx? If so, how sure are you that it does not interact with the [@inline] annotations? I would dig the ppx.exe command out of the dune log and rerun without -dump-ast to see the ocaml code post-preprocessing to check.

Iā€™m not very sure. This page describes the rewriting landmarks does in some detail. As I wrote in the OP, I did find consistent differences in program run-time also with landmarks disabled, though.

[edit]: Ok, I generated the preprocessed output as you described; here is an example for the record version of ( - ):

    let (-) { x; y } { x = x'; y = y' } = { x = (x -. x'); y = (y -. y') }
      [@@inline ]
    include
      struct
        [@@@ocaml.warning "-32"]
        let (-) __x0 __x1 =
          Landmark.enter
            __generated_landmark_7d2008dabe15716a397ffdfecca952f2_25;
          (let r =
             try ((-) __x0) __x1
             with
             | e ->
                 (Landmark.exit
                    __generated_landmark_7d2008dabe15716a397ffdfecca952f2_25;
                  Landmark.raise e) in
           Landmark.exit
             __generated_landmark_7d2008dabe15716a397ffdfecca952f2_25;
           r)
      end

ā€“ Iā€™m not quite sure how to interpret it. Does the included generated struct overwrite the definition in the first line? If yes, does that mean the [@@inline] annotation is lost?

[edit 2]
I just noticed that preprocessing with landmarks has an effect event if landmarks is not activated on running the program. Roughly, preprocessing by landmarks increases run-time by 30%. (And activating it at runtime, by a factor of 10 or so)

In my current example, when landmarks preprocessing is on but instrumentation is off at runtime, the fastest tuple version is around 10% faster than the record version, and this appears to be reproducible. In contrast, without preprocessing, both are the same up to noise.

This looks very similar to the record-based version. When records are returned from ( + ) and the like, they are constructed by an auxiliary function v defined as let v x y = {x; y}. Is that more than a notational convenience?

[edit]

This is an interesting read. I think I had read it at some point. My expectation was that nowadays with flambda, I could rely on small functions to all go away through inlining, so that I would not need to stick to the C-like style with big functions advocated there to get good performance.

It is the record based version. At the time I chose records over tuples and arrays of floats as it provides both unboxed floats and avoids bound checks on access.

No nothing more. Itā€™s the constructor (assumed to be inlined in the implementation ā€“ nowadays a few [@inline] should likely be added). The M.v is my general convention to create values of an immutable type M.t ā€“ itā€™s as close as you can get to a literal notation.

Yes, the [@inline] annotation on the original function is kept but the include ... overrides the definition of ( - ) with a function that is not annotated with [@inline]. So it means that if the ppx is applied, it interferes strongly with the ability of the compiler to optimise.
If itā€™s possible to switch to the ā€œBenchmarking manuallyā€ mode of landmarks, I would advise you to do so (it will still have an impact on performance, but smaller, and it should interfere less with the compilerā€™s optimisations).

1 Like

In adition to the effect of inliningn the addition of opaque side-effects in each subexpression would also severely limit the compilerā€™s ability to reorder or deduplicate computations. (I donā€™t realize how much of that is done by the optimizers today; there are few other valid simplifications on floating-point operations.)

I tried a version with the manual landmarks mode. For instance:

  let[@inline] ( - ) {x;y} {x=x';y=y'} =
    L.enter minus;
    let r = {x=x-.x'; y=y-.y'} in
    L.exit minus; r

This also seems problematic: It looks like r will get allocated, no matter if ( - ) is inlined as a whole and the fields of the record are used immediately afterwards.
My current timings:

  • Completely without landmarks: tuple- and record- versions are equally fast.
  • Manual-mode annotated code without runtime instrumentation: tuple is faster than record by around 10%.
  • Manual-mode annotated code with runtime instrumentation: the tuple based version is faster than the record based version by a factor of two. This may be explained in part by allocation. E.g. ( + ) allocates ten times more in the record based version (!). The speed differences are probably due to how record/tuple construction, inlining and landmarks interact.
  • Automatic-mode preprocessed code without runtime instrumentation: same speed.
  • Automatic-mode with runtime instrumentation: tuple is 15% faster.

Overall, unexpectely records are never better. Unfortunately, landmarks does not appear to be the tool of choice for micro-optimizing small functions. It seems to make sense only around chunks of code that run somewhat longer, so that messing with inlining isnā€™t as bad.

Yes, for micro-benchmarking I would use core_bench.

    let r = {x=x-.x'; y=y-.y'} in

Indeed, this will results in an allocation (of the record itself, not float boxing) in the naive execution of the OCaml program. Either you have faith in your optimizer (I would expect flambda to do fine, at least with the version that returns a record directly instead of let-binding it first), or you write slightly less abstract code where these pointwise operation are done by hand at the callsite (~ manual inlinine). I would probably choose the latter in a hot loop, to get robustness/predictability.

Unfortunately Flambda cannot remove allocations of float records because they are compiled as arrays very early. I guess some work could be done to try to track immutable, known-length arrays, but that is unlikely to happen in the near future.

Another timing update:

  • Fully without landmarks, the full-program run-time differences reported earlier between the _d2', _d2'', _d2''' implementations seem to disappear (tuple version).
  • Fully without landmarks, the full-program run-time for the record version is marginally below that in the bullet point above. This is the first time that records appear to be a little faster, as expected.

So part of the surprising results I posted above now look like user error: Preprocessing with landmarks interfered with inlining (even though itā€™s not activated at runtime). Sorry for the noise regarding that!

[edit]
For those using Landmarks, I found that a convenient way to use it in manual mode is to insert manual landmarks L.enter and L.exit points at few, not too-small functions, then define a do-nothing replacement module:

module Landmark_off = struct
  let register _ = ()
  let enter _ = ()
  let exit _ = ()
end

and switch profiling on/off like so:

module L =
  Landmark
  (*Landmark_off*)

This will completely eliminate calls to Landmarks after optimization including potential inlining obstructions as far as i can tell.

1 Like

@vlaviron I donā€™t know about the compilation scheme for float-only records. Are they always compiled to generic arrays, or are they compiled as FloatArray when they are known to be float-only? The former would explain why @n4323 is not seeing any benefit to records, if testing on a -no-flat-float-array switch!

Unless Iā€™m misremembering, records statically known to have only float fields are always compiled as ā€œflatā€ blocks, even in no-flat-float-array mode.

Cheers,
Nicolas

As far as I can tell theyā€™re compiled as FloatArray (Pmakearray (Pfloatarray, mut)).
It seems curious, but both versions could end up having roughly the same run time because the extra allocations from tuples + floats are matched by the extra record allocations when deconstructing a vector that has just been built (but thatā€™s just a random guess).

So I actually checked the code, and it turns out that flambda actually does track the contents of immutable float arrays, so the record allocations can actually disappear. Sorry for wrongly claiming the opposite earlier. (But in my defense, from looking at the code itā€™s a hack specific to float arrays, that was probably added only because float records ended up as arrays. For regular arrays nothing is tracked at all, even in the immutable case.)
So the record version should always be faster than the tuple version, but with aggressive inlining it might be the case that we end up with so little allocations in both cases that the difference becomes lost in the noise. Iā€™ll keep investigatingā€¦

1 Like

If you send me a link, Iā€™ll have a look at the code.