I offer for discussion a version of
List.map that outperforms stdlib, Base, Batteries, and Containers, and is tail-recursive. It uses much less memory than all these implementations, except Batteries. Indeed, low memory usage is directly related to speed here.
However, the new
f to elements starting from the end of the list. So, the ordering of side effects is reversed from what one may be used to.
Benchmarking results for various list sizes:
As can be seen, for small lists, the new version is as fast as any other. It remains fast for large lists – up to 2× faster than popular implementations. This is a significant advantage: on my machine, for lists with 106 elements, being 2× faster saves about 70ms per
The code is pretty ugly. It’s a slightly tricky unrolling of the naive tail-recursive map, which is probably over-tuned for x86_64 at this point.
However, I want to clearly state that the new
map doesn’t use any mutation, as is done by Batteries.
The plan for the rest of this post is:
- Briefly describe the existing implementations graphed above.
- Explain how the new implementation works.
- Offer some other thoughts.
This is the
map from the standard library, graphed above as “stdlib.” It doesn’t create any intermediate list, so it doesn’t use heap space beyond what is needed anyway for the input list and the result list. It is, however, not tail-recursive. As a result:
- It uses stack space proportional to the length of the input list. In fact, since
List.mapstack frames are larger than list cells, it uses more stack space than how much heap space would be used by an intermediate list.
- As the list size increases,
List.mapeventually exhausts the available stack space. This usually terminates your program with exception
For this second reason,
List.map is often considered “unsafe,” and users often look for a replacement.
The usual approach to making
List.map safe is to make it tail-recursive by constructing an intermediate list of results, then reversing it. You can see an example here.
This approach is graphed as “naive tail-rec” above. It’s the line that is always at 100%. We are measuring every other implementation relative to it.
It can be up to twice as slow as
The reason is memory management of the intermediate list. Allocation of the intermediate list is actually faster than allocation of
List.map stack frames, but the intermediate list has to be garbage-collected later, and that is relatively expensive.
List.map, unrolled 5×
It’s also possible to make
List.map use less stack space by having it work on up to, say, five elements of the input list on each iteration. It then performs the overhead of allocating a stack frame only once every five elements of the input list, instead of once for every single element. So, it is faster, and uses less stack space. It is graphed as “unrolled 5×” above.
It’s still not stack-safe, but using less stack space per element postpones
Stack_overflow to slightly larger input list sizes.
Containers and Base
- They begin processing the input list using an unrolled map, also keeping track of how many elements have been processed.
- If the number of elements exceeds 4000 (Containers) or 5000 (Base), they process the rest of the list using a naive tail-recusive map.
So, both implementations are stack-safe.
Using unrolled map for a prefix has two effects:
- It makes these implementations faster than
List.mapfor small lists.
- It delays the point at which these implementations become slower than
List.map, for lists large enough that naive tail-recursive map is used for the suffix.
However, eventually, for large enough lists, these implementations spend an overwhelming proportion of their time doing naive tail-recursion.
The combination explains their graphed performance: as list size increases, they start out doing as well as unrolled
map, and end up doing as poorly as naive tail-recursion.
Batteries (ab?)uses mutation to build the result list front-to-back, as one would with a mutable list. With a mutable list, there is direct access to the list tail pointer and the ability to mutate it. Basic OCaml lists are not mutable, but Batteries internally casts them to a mutable list type that has the same memory representation as OCaml lists.
Batteries does a single pass down the input list. For each input list element, a new result list cell is created, and the preceding result list cell, allocated in the previous iteration, is mutated to point to the new one.
So, the Batteries implementation is tail-recursive, and does not allocate any intermediate list. This probably explains its excellent performance on large lists.
I’m not sure why it’s slow for small lists – perhaps it would benefit from unrolling, or there are complications with executing
I did not consider it further, because I wanted to avoid mutation and making assumptions about the memory representation of list cells.
The new implementation
Like Containers and Base, the new implementation processes the first 5000 elements of its input using an ordinary unrolled
map. The difference is that for the remainder of the list, it uses a much faster and more memory-efficient tail-recursive
map than naive tail-recursion.
This new tail-recursive
map works by, roughly speaking, splitting the input list into chunks, and allocating an intermediate list whose length is equal to the number of chunks, not the number of elements of the input list. By choosing a large enough chunk size, memory management of the intermediate list becomes insignificant, compared to the other costs of doing
The exact choice of chunk size is dictated by the number of available registers in the machine’s architecture. Basically, the chunks have to be large enough so that iterating over them uses all the available registers, but small enough that spilling on the stack (i.e. enlarging stack frames) is not required. This is classic unrolling, to get efficient usage of stack space.
Certainly, the chunk size has to be small enough that no recursion or arbitrary iteration needs to be done within one chunk. Doing any operation on one chunk should consist of straight-line code, with, at worst, a fixed number of hardcoded
if-conditions to check (these are patterns in the source code).
On my x86_64 machine, with OCaml 4.05, the happy chunk size turned out to be 12.
“Splitting the input list into chunks” is actually more tricky than it may sound, because naively creating a list of N/12 new lists is already more expensive than naive tail-recursion. So, the new implementation does something different.
To keep things presentable, I’ll use a chunk size of 4, and motivate the implementation with an example and a brief, instructive false start.
Suppose the input list is
A -> B -> C -> D -> E -> F -> G -> H -> I -> J -> K -> L -> 
Then, the chunks, of size 4, are
1st: A, B, C, D 2nd: E, F, G, H 3rd: I, J, K, L
We could try to represent this list of chunks using a list of lists in the obvious way:
1st: A -> B -> C -> D ->  2nd: E -> F -> G -> H ->  3rd: I -> J -> K -> L -> 
Looking at the third list, its structure is identical to the 4-element tail of the input list. That means we can simply use a pointer to the
I cell of the input list, and call that the third chunk.
However, consider the first chunk. In it,
D points to
nil, but in the input list,
D points to
E. That means, in this representation, the first chunk will have to be allocated and built up entirely from scratch. Indeed, every chunk except the last will have to be rebuilt from scratch. That’s too expensive to make this kind of chunking faster than naive tail-recursion.
What the new implementation does instead is notional chunking, inspired by the idea that the last chunk could be represented by a pointer directly into the input list.
Each chunk is represented as a pointer directly into the input list, to the list cell that begins the chunk. In other words, each chunk is represented as a list, a different tail of the input list. The convention is that only the first 4 elements are actually considered to be part of the chunk:
(* chunk *) (* dangling stuff to be ignored *) 1st: A -> B -> C -> D -> E -> F -> G -> H -> I -> J -> K -> L ->  2nd: E -> F -> G -> H -> I -> J -> K -> L ->  3rd: I -> J -> K -> L -> 
To create this representation, we only need to walk down the input list, and, every 4 elements, add the current tail to the chunk list. So, if the input list has length N, we allocate N/4 cons cells for the chunk list. Ignoring corner cases,
let rec split (chunks_acc : 'a list list) (input_tail : 'a list) = match input_tail with | (_::_::_::_::tail) as chunk_start -> split (chunk_start::chunks_acc) tail |  -> chunks_acc in let chunks = split  input in (* ... *)
Happily, the chunk list (in
chunks_acc) gets built in reverse order by this process, because the next thing we do is walk through the chunk list, match on the elements of each chunk using a pattern, and build up
map's final result, back-to-front. Again, ignoring corner cases:
let rec map_chunks (chunks : 'a list list) (final_acc : 'a list) = match chunks with | [x1::x2::x3::x4]::preceding_chunks -> let y4 = f x4 in let y3 = f x3 in let y2 = f x2 in let y1 = f x1 in map_chunks preceding_chunks (y1::y2::y3::y4::final_acc) |  -> final_acc in map_chunks chunks 
So, the first chunk we see is
I -> J -> K -> L -> 
which gets accumulated into result list as
f I -> f J -> f K -> f L -> 
E -> F -> G -> H -> ... gets accumulated on to make
f E -> f F -> f G -> f H -> f I -> f J -> f K -> f L -> 
and so on.
The actual implementation measured has a chunk size of 12, and takes care of corner cases – specifically, the last chunk may have size anywhere from 1 to 12.
By comparison, the naive tail-recursive version can be seen, in a rough way, as a degenerate chunked implementation with chunk size 1. It allocates an intermediate list of length N/1 = N.
For large lists (106 elements), the new implementation allocates considerably less memory as overhead, about 7MB, compared to 24MB for naive tail-recursion, Containers, and Base. Only Batteries allocates less (0). The most memory-hungry implementation is, surprisingly, stdlib’s
List.map, but the overhead (30MB) consists of stack space, which is easy to deallocate.
I haven’t looked at it deeply enough, but this kind of optimization should apply to at least a subset of the other non-tail-recursive functions in module