Tail recursive loops and closures

When you are writing inner loops as tail-recursive functions I’d like to know if there’s a performance advantage in passing the arguments you are accessing to the loop function itself.

For example write:

let find_char c s = 
  let rec loop s max k = 
   if k > max then None else
   if s.[k] = c then Some k else loop s max (k + 1)
  in 
  loop s (String.length s - 1) 0

instead of

let find_char c s = 
  let max = String.length s - 1 in 
  let rec loop k = 
   if k > max then None else
   if s.[k] = c then Some k else loop (k + 1)
  in 
  loop 0

It seems at some point I started having the first style which is more burden for readers and easier to shoot yourself in the foot by mixing your arguments in the recursive calls.

There was likely a reason – maybe for fear of float boxing when looking up the closure environment, but which happens anyways in this case. I can’t remember exactly and I’m cargo culting it at that point.

Measurements are noise, -dlambda is too high-level to say and I’m too lazy for deep dive into assembly – but a cursory look doesn’t seem to show much difference.

I’m sure someone who knows the ins and outs of the backend has more sophisticated answers to that question.

3 Likes

I don’t know much (anything) about compiler internals/inlining, but after reading the responses for a similar question Local functions and performance - #3 by vlaviron I had started using the first style in your post for writing my inner loop functions, and if the task at hand was otherwise something where I cared enough to write a microbenchmark for that could show a difference in allocations. :sweat_smile:

1 Like

In the standard compilation pipeline (I don’t know about Flambda) having free variables in a local function means: 1) an extra allocation (to construct the closure containing the value of the free variables), and 2) an extra indirection when accessing the free variables (as they have to go through the closure).

Cheers,
Nicolas

1 Like

With the current compilers, the second version is more or less equivalent (in terms of performance) to the following:

let find_char c s = 
  let max = String.length s - 1 in 
  let rec loop k env = 
   if k > env.max then None else
   if env.s.[k] = c then Some k else loop (k + 1) env
  in 
  loop 0 {s; max}

So the unboxed version is very slightly faster, avoiding the initial allocation and extra indirections.
But it comes with one potential inconvenient: more variables in scope means more pressure on registers, so when the number of environment entries gets close enough to the number of available registers the closure-based version will likely become faster.

My suggestion would be to stick to the regular closure-based workflow by default, and when you want a little extra performance (and the code is small enough) switch to the argument passing version, with a comment describing that you’ve done that.

Finally, some remarks on other backends:

  • I don’t know exactly what the performance of the bytecode and Jsoo backends look like. I suspect that for small functions it’s mostly the same, but for big environments the closure version is a bit more efficient.
  • The Flambda 2 backend should be able to transform your tail rec functions back into loops, and perform optimisations (including unboxing) along the way. So you will get the same result either way.
6 Likes