Optimizing small vector operations

Another timing update:

  • Fully without landmarks, the full-program run-time differences reported earlier between the _d2', _d2'', _d2''' implementations seem to disappear (tuple version).
  • Fully without landmarks, the full-program run-time for the record version is marginally below that in the bullet point above. This is the first time that records appear to be a little faster, as expected.

So part of the surprising results I posted above now look like user error: Preprocessing with landmarks interfered with inlining (even though it’s not activated at runtime). Sorry for the noise regarding that!

[edit]
For those using Landmarks, I found that a convenient way to use it in manual mode is to insert manual landmarks L.enter and L.exit points at few, not too-small functions, then define a do-nothing replacement module:

module Landmark_off = struct
  let register _ = ()
  let enter _ = ()
  let exit _ = ()
end

and switch profiling on/off like so:

module L =
  Landmark
  (*Landmark_off*)

This will completely eliminate calls to Landmarks after optimization including potential inlining obstructions as far as i can tell.

1 Like