I offer for discussion a version of List.map
that outperforms stdlib, Base, Batteries, and Containers, and is tail-recursive. It uses much less memory than all these implementations, except Batteries. Indeed, low memory usage is directly related to speed here.
However, the new map
applies f
to elements starting from the end of the list. So, the ordering of side effects is reversed from what one may be used to.
Benchmarking results for various list sizes:
As can be seen, for small lists, the new version is as fast as any other. It remains fast for large lists ā up to 2Ć faster than popular implementations. This is a significant advantage: on my machine, for lists with 106 elements, being 2Ć faster saves about 70ms per map
.
The code is pretty ugly. Itās a slightly tricky unrolling of the naive tail-recursive map, which is probably over-tuned for x86_64 at this point.
However, I want to clearly state that the new map
doesnāt use any mutation, as is done by Batteries.
The plan for the rest of this post is:
- Briefly describe the existing implementations graphed above.
- Explain how the new implementation works.
- Offer some other thoughts.
Existing implementations
List.map
This is the map
from the standard library, graphed above as āstdlib.ā It doesnāt create any intermediate list, so it doesnāt use heap space beyond what is needed anyway for the input list and the result list. It is, however, not tail-recursive. As a result:
- It uses stack space proportional to the length of the input list. In fact, since
List.map
stack frames are larger than list cells, it uses more stack space than how much heap space would be used by an intermediate list. - As the list size increases,
List.map
eventually exhausts the available stack space. This usually terminates your program with exceptionStack_overflow
.
For this second reason, List.map
is often considered āunsafe,ā and users often look for a replacement.
āNaiveā tail-recursive map
The usual approach to making List.map
safe is to make it tail-recursive by constructing an intermediate list of results, then reversing it. You can see an example here.
This approach is graphed as ānaive tail-recā above. Itās the line that is always at 100%. We are measuring every other implementation relative to it.
It can be up to twice as slow as List.map
.
The reason is memory management of the intermediate list. Allocation of the intermediate list is actually faster than allocation of List.map
stack frames, but the intermediate list has to be garbage-collected later, and that is relatively expensive.
List.map
, unrolled 5Ć
Itās also possible to make List.map
use less stack space by having it work on up to, say, five elements of the input list on each iteration. It then performs the overhead of allocating a stack frame only once every five elements of the input list, instead of once for every single element. So, it is faster, and uses less stack space. It is graphed as āunrolled 5Ćā above.
Itās still not stack-safe, but using less stack space per element postpones Stack_overflow
to slightly larger input list sizes.
Containers and Base
Containers and Base combine the last two approaches, unrolled map
and naive tail-recursive map
, as follows:
- They begin processing the input list using an unrolled map, also keeping track of how many elements have been processed.
- If the number of elements exceeds 4000 (Containers) or 5000 (Base), they process the rest of the list using a naive tail-recusive map.
So, both implementations are stack-safe.
Using unrolled map for a prefix has two effects:
- It makes these implementations faster than
List.map
for small lists. - It delays the point at which these implementations become slower than
List.map
, for lists large enough that naive tail-recursive map is used for the suffix.
However, eventually, for large enough lists, these implementations spend an overwhelming proportion of their time doing naive tail-recursion.
The combination explains their graphed performance: as list size increases, they start out doing as well as unrolled map
, and end up doing as poorly as naive tail-recursion.
Batteries
Batteries (ab?)uses mutation to build the result list front-to-back, as one would with a mutable list. With a mutable list, there is direct access to the list tail pointer and the ability to mutate it. Basic OCaml lists are not mutable, but Batteries internally casts them to a mutable list type that has the same memory representation as OCaml lists.
Batteries does a single pass down the input list. For each input list element, a new result list cell is created, and the preceding result list cell, allocated in the previous iteration, is mutated to point to the new one.
So, the Batteries implementation is tail-recursive, and does not allocate any intermediate list. This probably explains its excellent performance on large lists.
Iām not sure why itās slow for small lists ā perhaps it would benefit from unrolling, or there are complications with executing Obj.magic
.
I did not consider it further, because I wanted to avoid mutation and making assumptions about the memory representation of list cells.
The new implementation
Overview
Like Containers and Base, the new implementation processes the first 5000 elements of its input using an ordinary unrolled map
. The difference is that for the remainder of the list, it uses a much faster and more memory-efficient tail-recursive map
than naive tail-recursion.
This new tail-recursive map
works by, roughly speaking, splitting the input list into chunks, and allocating an intermediate list whose length is equal to the number of chunks, not the number of elements of the input list. By choosing a large enough chunk size, memory management of the intermediate list becomes insignificant, compared to the other costs of doing map
.
The exact choice of chunk size is dictated by the number of available registers in the machineās architecture. Basically, the chunks have to be large enough so that iterating over them uses all the available registers, but small enough that spilling on the stack (i.e. enlarging stack frames) is not required. This is classic unrolling, to get efficient usage of stack space.
Certainly, the chunk size has to be small enough that no recursion or arbitrary iteration needs to be done within one chunk. Doing any operation on one chunk should consist of straight-line code, with, at worst, a fixed number of hardcoded if
-conditions to check (these are patterns in the source code).
On my x86_64 machine, with OCaml 4.05, the happy chunk size turned out to be 12.
āSplitting the input list into chunksā is actually more tricky than it may sound, because naively creating a list of N/12 new lists is already more expensive than naive tail-recursion. So, the new implementation does something different.
Walkthrough
To keep things presentable, Iāll use a chunk size of 4, and motivate the implementation with an example and a brief, instructive false start.
Suppose the input list is
A -> B -> C -> D -> E -> F -> G -> H -> I -> J -> K -> L -> []
Then, the chunks, of size 4, are
1st: A, B, C, D
2nd: E, F, G, H
3rd: I, J, K, L
We could try to represent this list of chunks using a list of lists in the obvious way:
1st: A -> B -> C -> D -> []
2nd: E -> F -> G -> H -> []
3rd: I -> J -> K -> L -> []
Looking at the third list, its structure is identical to the 4-element tail of the input list. That means we can simply use a pointer to the I
cell of the input list, and call that the third chunk.
However, consider the first chunk. In it, D
points to nil
, but in the input list, D
points to E
. That means, in this representation, the first chunk will have to be allocated and built up entirely from scratch. Indeed, every chunk except the last will have to be rebuilt from scratch. Thatās too expensive to make this kind of chunking faster than naive tail-recursion.
What the new implementation does instead is notional chunking, inspired by the idea that the last chunk could be represented by a pointer directly into the input list.
Each chunk is represented as a pointer directly into the input list, to the list cell that begins the chunk. In other words, each chunk is represented as a list, a different tail of the input list. The convention is that only the first 4 elements are actually considered to be part of the chunk:
(* chunk *) (* dangling stuff to be ignored *)
1st: A -> B -> C -> D -> E -> F -> G -> H -> I -> J -> K -> L -> []
2nd: E -> F -> G -> H -> I -> J -> K -> L -> []
3rd: I -> J -> K -> L -> []
To create this representation, we only need to walk down the input list, and, every 4 elements, add the current tail to the chunk list. So, if the input list has length N, we allocate N/4 cons cells for the chunk list. Ignoring corner cases,
let rec split (chunks_acc : 'a list list) (input_tail : 'a list) =
match input_tail with
| (_::_::_::_::tail) as chunk_start ->
split (chunk_start::chunks_acc) tail
| [] ->
chunks_acc
in
let chunks = split [] input in
(* ... *)
Happily, the chunk list (in chunks_acc
) gets built in reverse order by this process, because the next thing we do is walk through the chunk list, match on the elements of each chunk using a pattern, and build up map
's final result, back-to-front. Again, ignoring corner cases:
let rec map_chunks (chunks : 'a list list) (final_acc : 'a list) =
match chunks with
| [x1::x2::x3::x4]::preceding_chunks ->
let y4 = f x4 in
let y3 = f x3 in
let y2 = f x2 in
let y1 = f x1 in
map_chunks preceding_chunks (y1::y2::y3::y4::final_acc)
| [] ->
final_acc
in
map_chunks chunks []
So, the first chunk we see is
I -> J -> K -> L -> []
which gets accumulated into result list as
f I -> f J -> f K -> f L -> []
then, E -> F -> G -> H -> ...
gets accumulated on to make
f E -> f F -> f G -> f H -> f I -> f J -> f K -> f L -> []
and so on.
The actual implementation measured has a chunk size of 12, and takes care of corner cases ā specifically, the last chunk may have size anywhere from 1 to 12.
Various notes
By comparison, the naive tail-recursive version can be seen, in a rough way, as a degenerate chunked implementation with chunk size 1. It allocates an intermediate list of length N/1 = N.
All the code used for the tests is available in a GitHub repo. Included is a directory with raw results.
For large lists (106 elements), the new implementation allocates considerably less memory as overhead, about 7MB, compared to 24MB for naive tail-recursion, Containers, and Base. Only Batteries allocates less (0). The most memory-hungry implementation is, surprisingly, stdlibās List.map
, but the overhead (30MB) consists of stack space, which is easy to deallocate.
I havenāt looked at it deeply enough, but this kind of optimization should apply to at least a subset of the other non-tail-recursive functions in module List
.
I must acknowledge the work of @jsthomas, who started measuring the performance of map
s in response to an issue in Lwt. That prompted me to experiment with List.map
a bit more.