Local functions and performance

threepwood · April 4, 2022, 8:33am

Very basic question: is there a perfomance cost to the below idiom where a non-public recursive function implementing a loop / traversal of some kind takes some of its arguments from the context rather than explicitly? Or in other words, is the closure optimized away?

let my_public_function some_argument =
  let rec my_private_function accumulator =
    (* I use some_argument in there *)
    in
  my_private_function []

This is as opposed to:

let my_public_function some_argument =
  let rec my_private_function some_argument accumulator =
    ...
    in
  my_private_function some_argument []

or even:

let rec my_private_function some_argument accumulator =
    ...
let my_public_function some_argument =
  my_private_function some_argument []

threepwood · April 4, 2022, 9:10am

Answering my own question: comparing the two functions below on godbolt.org (ocamlopt 4.12, with or without flambda), it does seem the first one allocates a closure as there is a call to caml_call_gc that the second one doesn’t have? Is that right?

let count_to n =
    let rec inner i =
        if i <= n then begin
            print_endline (string_of_int i);
            inner (i + 1)
        end in
    inner 0

vs.

let count_to n =
    let rec inner i n =
        if i <= n then begin
            print_endline (string_of_int i);
            inner (i + 1) n
        end in
    inner 0 n

vlaviron · April 4, 2022, 9:10am

In bytecode, all versions should be more or less equivalent (closures are never optimised away).

In native code, the first version allocates a closure everytime you call my_public_function, while the second will have its closure statically allocated if it doesn’t depend on other non-static variables.

The performance of my_private_function itself could be impacted too: the first version has one less parameter so it should be slightly faster to call, but that is offset by an optimisation on the second version: the extra parameter passed to each function that stores its closure can be removed for functions that do not need it, so in the end they will both take two arguments. The first version also requires an extra load from the closure to get some_argument (the load occurs once with Flambda, at each use otherwise).

Finally, a non-trivial question is how all of this interacts with the compiler’s optimisations. The function my_public_function is rather small, so it could be considered for inlining, but without Flambda only the third version (with my_private_function out of the body of my_public_function) can be inlined. With Flambda, the second and third version are equivalent, and the first one could be actually easier to optimise, in particular if you call my_public_function with a constant or statically-allocated argument.

threepwood · April 4, 2022, 9:26am

Thanks a lot for the detailed answer, it’s very helpful.

I’m not sure I get your first sentence: if closures are never optimized away, wouldn’t the third version be more efficient even in bytecode, as there is no closure at all? Or is every function a closure in bytecode?

vlaviron · April 4, 2022, 9:31am

Yes. While in native code the compiler can generate direct calls to statically known functions, in bytecode every function call is generic (read the functions pointer from the closure, then jump to it).

milan · April 10, 2022, 5:43pm

I’m surprised flambda doesn’t optimize the closure away. I don’t know if that is expected.

vlaviron · April 11, 2022, 8:07am

You can get it optimised away by passing the -unbox-closures flag. But it’s expected that it would not be optimised away by default.

Topic		Replies	Views
Closures, Inlining, performance optimization for anonymous functions Learning performance , closure	9	1959	April 27, 2018
Tail recursive loops and closures Learning performance	3	850	November 22, 2022
Question on mutual recursion Learning	1	166	November 17, 2024
Why do Seq.fold_left and Seq.iter recurse on a closure rather than at the first level? Learning	2	724	February 6, 2020
Is this optimized by the OCaml compiler? Learning	9	782	January 12, 2023

Local functions and performance

Related topics