Stdlib.List loop unrolling

I can see the loop unrolling technique applied for function (ocaml/ at trunk · ocaml/ocaml · GitHub). However, I don’t see the same technique being applied for Stdlib.List.find - ocaml/ at trunk · ocaml/ocaml · GitHub. Can find not benefit from loop unrolling similar to map?

There is a tension between speed and code size (typically smaller code will positively affect performance), as well as clarity (unrolling harms readability and makes the code less maintanable). was made tail-recursive recently, and, after benchmarking, its implementation was hand-unrolled in order for the new version to have performance no worse than the previous one in most or all cases.

Manual unrolling was justified by the central place occupies in practically any OCaml program, but would not be in most other cases.


1 Like

Thanks. The justification makes sense. If in some other use cases find was the predominant usage, do you think it makes sense to manually unroll find implementation as in the case of map?

I think it could make sense. However, is used to create a new fresh collection. On the other hand, List.find is used to return a single element, so maybe there are other directions to explore e.g. using an auxiliary data-structure where look-up for your particular kind of predicates is fast. This comes with its own problems of course like maintenance of indexes etc…).

If List.find dominates the execution time, it might be worth considering a better data structure than list.

And just to reiterate what @nojb said, was not unrolled to make it faster. It was unrolled so that it did not become slower due to the tail-call-modulo-cons transformation.

1 Like

If List.find dominates the execution time, it might be worth considering a better data structure than list.

Yes, I did consider AVL data structure like map, but didn’t like the insert performance (O(log n)), so settled for list since List.add is 0(1). find is a secondary consideration to add in my use case. However, I was wondering if re-implementing find using loop unrolling would make find use case also not too bad i.e. 0(n/2) or 0(n/3) rather than O(n).

Interesting. I am not too sure how loop unrolling helps tail-call-modulo-cons transformation. find is tail-recursive, shouldn’t loop unrolling make it a tad faster than not unrolling it?

O(n/2) and O(n/3) are still O(n), by definition of O.

It does not help the transformation itself. It just helps with performance. A modulo-cons tail call is slow (slower than a tail call or a standard call); by unrolling the loop, you do only half of them.

Sure. But you still do the same amount of (predicted) branches. So, the gain might be negligible if the test function itself is not inlined. (Moreover, unrolling might actually block inlining, since the function now needs to be inlined twice.)

1 Like

O(n/2) and O(n/3) are still O(n), by definition of O.

I should have clarified. I meant to say n loops since we are discussing loop unrolling.

A modulo-cons tail call is slow (slower than a tail call or a standard call); by unrolling the loop, you do only half of them.

So halving the number of times you do loop does positively affect performance, right?

Since TMC is slower than a recursive call, wouldn’t it make sense to make a tail recursive one? What actually is the benefit of TMC? Is it mostly a list allocation per call compared to a tail-recursive call?

The benefit of TRMC is that your function is tail recursive, so won’t blow-up the stack for large lists. Doing a tail-recursive map which retains the performances of the non-tail-recursive one is quite the challenge (the following thread has all the info). A very crude summary:

  • Usual : fast calls (which can further be unrolled) but not tail recursive, you may blow up the stack for long lists.
  • Simple tail-recursive by doing reverse at the end: the cost of reversing hurts quite a bit in practice and makes this non competitive for medium sized lists.
  • Unsafe tail-recursive by using Obj.magic and casting a mutable list that is allocated and modified in place. This relies on assumptions on unsafe features, which may be invalidated if you change backend or runtime (flambda, multicore,…).
  • Clever safe tail-recursive (as explained in the linked discussion). Very clever, needs to be hand-tuned (to recover the performances of non tail-rec) and the code is quite complex which becomes a maintenance problem, especially if you want several versions hand tuned for several architectures.

TRMC, to simplify, translates the simple natural (and other similar functions) to the version that creates a mutable structure. But the translation is done by the compiler. There is still a maintenance burden (the TRMC transform itself within the compiler) but it’s more general so the cost is deemed worth it (since now you don’t have to maintain a gazillion of unsafe tail recursive functions doing Obj.magic all over the place for map, concat, etc…).

1 Like

You missed the point. It is not the tail call that is costly. It is the “modulo cons” part.

In a nutshell, a map function over lists looks as follows:

let rec map f (h::t) = (f h) :: (map f t)

First, there is a recursive call to map then a memory block is initialized with the result of map. Obviously, this function is not tail recursive, since the block is (allocated and) initialized after the call.

The tail recursive version is as follows, assuming a variant of OCaml where you can mutate things arbitrarily:

let rec map_trmc f (h::t) (_::t') =
  let res = f h :: dummy in
  t' <- res;
  map_trmc f t res

Now the function is tail recursive. The compiler performs this transformation whenever you use the attribute [@tail_mod_cons]. But notice a few things. First, the new function now receives one more argument. But more importantly, while the assignment t' <- res might look like a cheap memory write, it is actually a costly call to the function caml_initialize.

To summarize, the purpose of unrolling is not to halve the number of recursive calls. It is to halve the number of calls to caml_initialize.

And in case it was not clear, neither the non-tail-recursive norList.find need to call caml_initialize.

1 Like

I’m not sure the cost of calling caml_initialize is so high, it’s a “noalloc” call and the code is pretty simple (an assignment and a well-predicted branch). It is probably sensibly less costly than the indirect call to f itself (even without taking the computation time of f into account, just the caller-side cost of organizing the call).

Let us try it:

let[@tail_mod_cons] rec map1 f = function
  | [] -> []
  | a::l ->
      let r = f a in
      r::map1 f l

let () =
  let l = List.init 1000 (fun i -> i) in
  for j = 0 to 100000 do
    ignore (map1 (fun v -> v + 1) l);

With a minor heap of 1G words (just to avoid any garbage collection), perf reports that 20% of the time is spent in caml_initialize.

By the way, even perfectly predicted branches have to be considered with a grain of salt. Once the branch buffer is full (e.g., 48 entries on recent x86 processors), the processor will just stop. Calling caml_initialize incurs three speculative branches per loop iteration. By comparison, the call to the anonymous function incurs only two speculative branches. And map1 has two more. In other words, caml_initialize accounts for 40% of all the speculative branches of this code.