Stdlib.List loop unrolling

BikalGurung · January 27, 2023, 10:49am

I can see the loop unrolling technique applied for Stdlib.List.map function (ocaml/list.ml at trunk · ocaml/ocaml · GitHub). However, I don’t see the same technique being applied for Stdlib.List.find - ocaml/list.ml at trunk · ocaml/ocaml · GitHub. Can find not benefit from loop unrolling similar to map?

nojb · January 27, 2023, 11:40am

There is a tension between speed and code size (typically smaller code will positively affect performance), as well as clarity (unrolling harms readability and makes the code less maintanable).

List.map was made tail-recursive recently, and, after benchmarking, its implementation was hand-unrolled in order for the new version to have performance no worse than the previous one in most or all cases.

Manual unrolling was justified by the central place List.map occupies in practically any OCaml program, but would not be in most other cases.

Cheers,
Nicolas

BikalGurung · January 27, 2023, 12:17pm

Thanks. The justification makes sense. If in some other use cases find was the predominant usage, do you think it makes sense to manually unroll find implementation as in the case of map?

K_N · January 27, 2023, 12:45pm

Hi,
I think it could make sense. However, List.map is used to create a new fresh collection. On the other hand, List.find is used to return a single element, so maybe there are other directions to explore e.g. using an auxiliary data-structure where look-up for your particular kind of predicates is fast. This comes with its own problems of course like maintenance of indexes etc…).

silene · January 27, 2023, 12:48pm

If List.find dominates the execution time, it might be worth considering a better data structure than list.

And just to reiterate what @nojb said, List.map was not unrolled to make it faster. It was unrolled so that it did not become slower due to the tail-call-modulo-cons transformation.

BikalGurung · January 27, 2023, 1:21pm

If List.find dominates the execution time, it might be worth considering a better data structure than list.

Yes, I did consider AVL data structure like map, but didn’t like the insert performance (O(log n)), so settled for list since List.add is 0(1). find is a secondary consideration to add in my use case. However, I was wondering if re-implementing find using loop unrolling would make find use case also not too bad i.e. 0(n/2) or 0(n/3) rather than O(n).

Interesting. I am not too sure how loop unrolling helps tail-call-modulo-cons transformation. find is tail-recursive, shouldn’t loop unrolling make it a tad faster than not unrolling it?

silene · January 27, 2023, 1:34pm

O(n/2) and O(n/3) are still O(n), by definition of O.

It does not help the transformation itself. It just helps with performance. A modulo-cons tail call is slow (slower than a tail call or a standard call); by unrolling the loop, you do only half of them.

Sure. But you still do the same amount of (predicted) branches. So, the gain might be negligible if the test function itself is not inlined. (Moreover, unrolling might actually block inlining, since the function now needs to be inlined twice.)

BikalGurung · January 28, 2023, 11:02am

O(n/2) and O(n/3) are still O(n), by definition of O.

I should have clarified. I meant to say n loops since we are discussing loop unrolling.

A modulo-cons tail call is slow (slower than a tail call or a standard call); by unrolling the loop, you do only half of them.

So halving the number of times you do loop does positively affect performance, right?

Since TMC is slower than a recursive call, wouldn’t it make sense to make List.map a tail recursive one? What actually is the benefit of TMC? Is it mostly a list allocation per call compared to a tail-recursive call?

K_N · January 28, 2023, 12:30pm

The benefit of TRMC is that your function is tail recursive, so won’t blow-up the stack for large lists. Doing a tail-recursive map which retains the performances of the non-tail-recursive one is quite the challenge (the following thread has all the info). A very crude summary:

Usual List.map : fast calls (which can further be unrolled) but not tail recursive, you may blow up the stack for long lists.
Simple tail-recursive List.map by doing reverse at the end: the cost of reversing hurts quite a bit in practice and makes this non competitive for medium sized lists.
Unsafe tail-recursive List.map by using Obj.magic and casting a mutable list that is allocated and modified in place. This relies on assumptions on unsafe features, which may be invalidated if you change backend or runtime (flambda, multicore,…).
Clever safe tail-recursive List.map (as explained in the linked discussion). Very clever, needs to be hand-tuned (to recover the performances of non tail-rec) and the code is quite complex which becomes a maintenance problem, especially if you want several versions hand tuned for several architectures.

TRMC, to simplify, translates the simple natural List.map (and other similar functions) to the version that creates a mutable structure. But the translation is done by the compiler. There is still a maintenance burden (the TRMC transform itself within the compiler) but it’s more general so the cost is deemed worth it (since now you don’t have to maintain a gazillion of unsafe tail recursive functions doing Obj.magic all over the place for map, concat, etc…).

silene · January 28, 2023, 2:58pm

You missed the point. It is not the tail call that is costly. It is the “modulo cons” part.

In a nutshell, a map function over lists looks as follows:

let rec map f (h::t) = (f h) :: (map f t)

First, there is a recursive call to map then a memory block is initialized with the result of map. Obviously, this function is not tail recursive, since the block is (allocated and) initialized after the call.

The tail recursive version is as follows, assuming a variant of OCaml where you can mutate things arbitrarily:

let rec map_trmc f (h::t) (_::t') =
  let res = f h :: dummy in
  t' <- res;
  map_trmc f t res

Now the function is tail recursive. The compiler performs this transformation whenever you use the attribute [@tail_mod_cons]. But notice a few things. First, the new function now receives one more argument. But more importantly, while the assignment t' <- res might look like a cheap memory write, it is actually a costly call to the function caml_initialize.

To summarize, the purpose of unrolling List.map is not to halve the number of recursive calls. It is to halve the number of calls to caml_initialize.

And in case it was not clear, neither the non-tail-recursive List.map norList.find need to call caml_initialize.

gasche · January 28, 2023, 3:49pm

I’m not sure the cost of calling caml_initialize is so high, it’s a “noalloc” call and the code is pretty simple (an assignment and a well-predicted branch). It is probably sensibly less costly than the indirect call to f itself (even without taking the computation time of f into account, just the caller-side cost of organizing the call).

silene · January 28, 2023, 4:57pm

Let us try it:

let[@tail_mod_cons] rec map1 f = function
  | [] -> []
  | a::l ->
      let r = f a in
      r::map1 f l

let () =
  let l = List.init 1000 (fun i -> i) in
  for j = 0 to 100000 do
    ignore (map1 (fun v -> v + 1) l);
  done

With a minor heap of 1G words (just to avoid any garbage collection), perf reports that 20% of the time is spent in caml_initialize.

By the way, even perfectly predicted branches have to be considered with a grain of salt. Once the branch buffer is full (e.g., 48 entries on recent x86 processors), the processor will just stop. Calling caml_initialize incurs three speculative branches per loop iteration. By comparison, the call to the anonymous function incurs only two speculative branches. And map1 has two more. In other words, caml_initialize accounts for 40% of all the speculative branches of this code.

Topic		Replies	Views
Fastest implementation of stack-safe List.map, featuring Containers, Batteries, and Base Community	4	1394	June 22, 2017
Stack overflow during evaluation (looping recursion?) Learning	16	1773	April 17, 2023
A new List.map that is both stack-safe and fast Community article , performance	34	12607	February 24, 2018
Array.map vs List.map performance Learning performance	6	1259	May 14, 2024
Switch to inline record gain 25% speed up of querying in avl tree based map implementation Community	3	479	August 25, 2023

Stdlib.List loop unrolling

Related topics