Local functions and performance

Very basic question: is there a perfomance cost to the below idiom where a non-public recursive function implementing a loop / traversal of some kind takes some of its arguments from the context rather than explicitly? Or in other words, is the closure optimized away?

let my_public_function some_argument =
  let rec my_private_function accumulator =
    (* I use some_argument in there *)
    in
  my_private_function []

This is as opposed to:

let my_public_function some_argument =
  let rec my_private_function some_argument accumulator =
    ...
    in
  my_private_function some_argument []

or even:

let rec my_private_function some_argument accumulator =
    ...
let my_public_function some_argument =
  my_private_function some_argument []
2 Likes

Answering my own question: comparing the two functions below on godbolt.org (ocamlopt 4.12, with or without flambda), it does seem the first one allocates a closure as there is a call to caml_call_gc that the second one doesn’t have? Is that right?

let count_to n =
    let rec inner i =
        if i <= n then begin
            print_endline (string_of_int i);
            inner (i + 1)
        end in
    inner 0

vs.

let count_to n =
    let rec inner i n =
        if i <= n then begin
            print_endline (string_of_int i);
            inner (i + 1) n
        end in
    inner 0 n

In bytecode, all versions should be more or less equivalent (closures are never optimised away).

In native code, the first version allocates a closure everytime you call my_public_function, while the second will have its closure statically allocated if it doesn’t depend on other non-static variables.

The performance of my_private_function itself could be impacted too: the first version has one less parameter so it should be slightly faster to call, but that is offset by an optimisation on the second version: the extra parameter passed to each function that stores its closure can be removed for functions that do not need it, so in the end they will both take two arguments. The first version also requires an extra load from the closure to get some_argument (the load occurs once with Flambda, at each use otherwise).

Finally, a non-trivial question is how all of this interacts with the compiler’s optimisations. The function my_public_function is rather small, so it could be considered for inlining, but without Flambda only the third version (with my_private_function out of the body of my_public_function) can be inlined. With Flambda, the second and third version are equivalent, and the first one could be actually easier to optimise, in particular if you call my_public_function with a constant or statically-allocated argument.

6 Likes

Thanks a lot for the detailed answer, it’s very helpful.

I’m not sure I get your first sentence: if closures are never optimized away, wouldn’t the third version be more efficient even in bytecode, as there is no closure at all? Or is every function a closure in bytecode?

Yes. While in native code the compiler can generate direct calls to statically known functions, in bytecode every function call is generic (read the functions pointer from the closure, then jump to it).

1 Like

I’m surprised flambda doesn’t optimize the closure away. I don’t know if that is expected.

You can get it optimised away by passing the -unbox-closures flag. But it’s expected that it would not be optimised away by default.