Another timing update:
- Fully without landmarks, the full-program run-time differences reported earlier between the
_d2', _d2'', _d2'''implementations seem to disappear (tuple version). - Fully without landmarks, the full-program run-time for the record version is marginally below that in the bullet point above. This is the first time that records appear to be a little faster, as expected.
So part of the surprising results I posted above now look like user error: Preprocessing with landmarks interfered with inlining (even though it’s not activated at runtime). Sorry for the noise regarding that!
[edit]
For those using Landmarks, I found that a convenient way to use it in manual mode is to insert manual landmarks L.enter and L.exit points at few, not too-small functions, then define a do-nothing replacement module:
module Landmark_off = struct
let register _ = ()
let enter _ = ()
let exit _ = ()
end
and switch profiling on/off like so:
module L =
Landmark
(*Landmark_off*)
This will completely eliminate calls to Landmarks after optimization including potential inlining obstructions as far as i can tell.