A new List.map that is both stack-safe and fast

I offer for discussion a version of List.map that outperforms stdlib, Base, Batteries, and Containers, and is tail-recursive. It uses much less memory than all these implementations, except Batteries. Indeed, low memory usage is directly related to speed here.

However, the new map applies f to elements starting from the end of the list. So, the ordering of side effects is reversed from what one may be used to.


Benchmarking results for various list sizes:


48


As can be seen, for small lists, the new version is as fast as any other. It remains fast for large lists ā€“ up to 2Ɨ faster than popular implementations. This is a significant advantage: on my machine, for lists with 106 elements, being 2Ɨ faster saves about 70ms per map.


The code is pretty ugly. Itā€™s a slightly tricky unrolling of the naive tail-recursive map, which is probably over-tuned for x86_64 at this point.

However, I want to clearly state that the new map doesnā€™t use any mutation, as is done by Batteries.


The plan for the rest of this post is:

  • Briefly describe the existing implementations graphed above.
  • Explain how the new implementation works.
  • Offer some other thoughts.

Existing implementations

List.map

This is the map from the standard library, graphed above as ā€œstdlib.ā€ It doesnā€™t create any intermediate list, so it doesnā€™t use heap space beyond what is needed anyway for the input list and the result list. It is, however, not tail-recursive. As a result:

  1. It uses stack space proportional to the length of the input list. In fact, since List.map stack frames are larger than list cells, it uses more stack space than how much heap space would be used by an intermediate list.
  2. As the list size increases, List.map eventually exhausts the available stack space. This usually terminates your program with exception Stack_overflow.

For this second reason, List.map is often considered ā€œunsafe,ā€ and users often look for a replacement.


ā€œNaiveā€ tail-recursive map

The usual approach to making List.map safe is to make it tail-recursive by constructing an intermediate list of results, then reversing it. You can see an example here.

This approach is graphed as ā€œnaive tail-recā€ above. Itā€™s the line that is always at 100%. We are measuring every other implementation relative to it.

It can be up to twice as slow as List.map.

The reason is memory management of the intermediate list. Allocation of the intermediate list is actually faster than allocation of List.map stack frames, but the intermediate list has to be garbage-collected later, and that is relatively expensive.


List.map, unrolled 5Ɨ

Itā€™s also possible to make List.map use less stack space by having it work on up to, say, five elements of the input list on each iteration. It then performs the overhead of allocating a stack frame only once every five elements of the input list, instead of once for every single element. So, it is faster, and uses less stack space. It is graphed as ā€œunrolled 5Ɨā€ above.

Itā€™s still not stack-safe, but using less stack space per element postpones Stack_overflow to slightly larger input list sizes.


Containers and Base

Containers and Base combine the last two approaches, unrolled map and naive tail-recursive map, as follows:

  1. They begin processing the input list using an unrolled map, also keeping track of how many elements have been processed.
  2. If the number of elements exceeds 4000 (Containers) or 5000 (Base), they process the rest of the list using a naive tail-recusive map.

So, both implementations are stack-safe.

Using unrolled map for a prefix has two effects:

  • It makes these implementations faster than List.map for small lists.
  • It delays the point at which these implementations become slower than List.map, for lists large enough that naive tail-recursive map is used for the suffix.

However, eventually, for large enough lists, these implementations spend an overwhelming proportion of their time doing naive tail-recursion.

The combination explains their graphed performance: as list size increases, they start out doing as well as unrolled map, and end up doing as poorly as naive tail-recursion.


Batteries

Batteries (ab?)uses mutation to build the result list front-to-back, as one would with a mutable list. With a mutable list, there is direct access to the list tail pointer and the ability to mutate it. Basic OCaml lists are not mutable, but Batteries internally casts them to a mutable list type that has the same memory representation as OCaml lists.

Batteries does a single pass down the input list. For each input list element, a new result list cell is created, and the preceding result list cell, allocated in the previous iteration, is mutated to point to the new one.

So, the Batteries implementation is tail-recursive, and does not allocate any intermediate list. This probably explains its excellent performance on large lists.

Iā€™m not sure why itā€™s slow for small lists ā€“ perhaps it would benefit from unrolling, or there are complications with executing Obj.magic.

I did not consider it further, because I wanted to avoid mutation and making assumptions about the memory representation of list cells.


The new implementation

Overview

Like Containers and Base, the new implementation processes the first 5000 elements of its input using an ordinary unrolled map. The difference is that for the remainder of the list, it uses a much faster and more memory-efficient tail-recursive map than naive tail-recursion.

This new tail-recursive map works by, roughly speaking, splitting the input list into chunks, and allocating an intermediate list whose length is equal to the number of chunks, not the number of elements of the input list. By choosing a large enough chunk size, memory management of the intermediate list becomes insignificant, compared to the other costs of doing map.

The exact choice of chunk size is dictated by the number of available registers in the machineā€™s architecture. Basically, the chunks have to be large enough so that iterating over them uses all the available registers, but small enough that spilling on the stack (i.e. enlarging stack frames) is not required. This is classic unrolling, to get efficient usage of stack space.

Certainly, the chunk size has to be small enough that no recursion or arbitrary iteration needs to be done within one chunk. Doing any operation on one chunk should consist of straight-line code, with, at worst, a fixed number of hardcoded if-conditions to check (these are patterns in the source code).

On my x86_64 machine, with OCaml 4.05, the happy chunk size turned out to be 12.

ā€œSplitting the input list into chunksā€ is actually more tricky than it may sound, because naively creating a list of N/12 new lists is already more expensive than naive tail-recursion. So, the new implementation does something different.


Walkthrough

To keep things presentable, Iā€™ll use a chunk size of 4, and motivate the implementation with an example and a brief, instructive false start.

Suppose the input list is

A -> B -> C -> D -> E -> F -> G -> H -> I -> J -> K -> L -> []

Then, the chunks, of size 4, are

1st: A, B, C, D
2nd: E, F, G, H
3rd: I, J, K, L

We could try to represent this list of chunks using a list of lists in the obvious way:

1st: A -> B -> C -> D -> []
2nd: E -> F -> G -> H -> []
3rd: I -> J -> K -> L -> []

Looking at the third list, its structure is identical to the 4-element tail of the input list. That means we can simply use a pointer to the I cell of the input list, and call that the third chunk.

However, consider the first chunk. In it, D points to nil, but in the input list, D points to E. That means, in this representation, the first chunk will have to be allocated and built up entirely from scratch. Indeed, every chunk except the last will have to be rebuilt from scratch. Thatā€™s too expensive to make this kind of chunking faster than naive tail-recursion.

What the new implementation does instead is notional chunking, inspired by the idea that the last chunk could be represented by a pointer directly into the input list.

Each chunk is represented as a pointer directly into the input list, to the list cell that begins the chunk. In other words, each chunk is represented as a list, a different tail of the input list. The convention is that only the first 4 elements are actually considered to be part of the chunk:

     (*    chunk   *)      (* dangling stuff to be ignored *)
1st: A -> B -> C -> D  ->  E -> F -> G -> H -> I -> J -> K -> L -> []
2nd: E -> F -> G -> H  ->  I -> J -> K -> L -> []
3rd: I -> J -> K -> L  ->  []

To create this representation, we only need to walk down the input list, and, every 4 elements, add the current tail to the chunk list. So, if the input list has length N, we allocate N/4 cons cells for the chunk list. Ignoring corner cases,

let rec split (chunks_acc : 'a list list) (input_tail : 'a list) =
  match input_tail with
  | (_::_::_::_::tail) as chunk_start ->
    split (chunk_start::chunks_acc) tail
  | [] ->
    chunks_acc
in
let chunks = split [] input in
(* ... *)

Happily, the chunk list (in chunks_acc) gets built in reverse order by this process, because the next thing we do is walk through the chunk list, match on the elements of each chunk using a pattern, and build up map's final result, back-to-front. Again, ignoring corner cases:

let rec map_chunks (chunks : 'a list list) (final_acc : 'a list) =
  match chunks with
  | [x1::x2::x3::x4]::preceding_chunks ->
    let y4 = f x4 in
    let y3 = f x3 in
    let y2 = f x2 in
    let y1 = f x1 in
    map_chunks preceding_chunks (y1::y2::y3::y4::final_acc)
  | [] ->
    final_acc
in
map_chunks chunks []

So, the first chunk we see is

I -> J -> K -> L -> []

which gets accumulated into result list as

f I -> f J -> f K -> f L -> []

then, E -> F -> G -> H -> ... gets accumulated on to make

f E -> f F -> f G -> f H -> f I -> f J -> f K -> f L -> []

and so on.


The actual implementation measured has a chunk size of 12, and takes care of corner cases ā€“ specifically, the last chunk may have size anywhere from 1 to 12.


Various notes

By comparison, the naive tail-recursive version can be seen, in a rough way, as a degenerate chunked implementation with chunk size 1. It allocates an intermediate list of length N/1 = N.

All the code used for the tests is available in a GitHub repo. Included is a directory with raw results.

For large lists (106 elements), the new implementation allocates considerably less memory as overhead, about 7MB, compared to 24MB for naive tail-recursion, Containers, and Base. Only Batteries allocates less (0). The most memory-hungry implementation is, surprisingly, stdlibā€™s List.map, but the overhead (30MB) consists of stack space, which is easy to deallocate.

I havenā€™t looked at it deeply enough, but this kind of optimization should apply to at least a subset of the other non-tail-recursive functions in module List.

I must acknowledge the work of @jsthomas, who started measuring the performance of maps in response to an issue in Lwt. That prompted me to experiment with List.map a bit more.

36 Likes

the same strategy would work on split & combine as well.
Now, I have a pull request thatā€™s rotting here:

and thereā€™s a micro bench with unrolled versions too here:

The problem doesnā€™t seem to be implementing these things, but getting them accepted.

1 Like

Hi Anton,

Nice work!

Still quite some complexity for what shouldā€™ve been simple to optimize?

Iā€™m wondering if the Batteries approach can be made as fast as yours (with the unrolling as you suggested), and be made safe (guaranteed) as well (perhaps with some cooperation from the compiler/stdlib team).
If thatā€™s the case then this (and other similar optimization problems!) get(s) much simplerā€¦

2 Likes

@toolslive i wouldnā€™t propose to merge this map into the stdlib, as it will likely break all code that depends on the order in which f is called.

However, yes, for other, non-higher-order functions, or where the evaluation order would be the same, there may be no harm in unrolling the tail-recursive portion in some crazy way, like done here. I havenā€™t looked at them in detail, though, just done some mental sketches of whether it would work, and what the performance effect is likely to be. Some of them have pretty different allocation needs.

Even with that, I think an obstacle would be justifying architecture-specific unrolling in what is meant to be a portable stdlib. I optimized and measured this map for x86_64, but there are other platforms, bytecode, and all the JavaScript VMs. I donā€™t know what the best choice for those platforms is.

It seems like these kinds of unrolled functions are much more likely to go into community libraries or applications, which have an easier time supporting a much more restricted set of targets.

@domsj Thanks :slight_smile: Yeah, the actual code is a bit complex. Iā€™m not sure how easy this would be to generate by automatic optimization, though.

There is (was?) a proposal to have the compiler automatically generate what is effectively the Batteries approach. This is tail recursion modulo constructors by @let-def: https://github.com/ocaml/ocaml/pull/181. Iā€™m not sure what the exact performance implications would be, or the status of the pull request. I actually wanted to install a compiler with the patch applied, and include TRMC-d List.map in the benchmarking, but eventually ran out of time and didnā€™t do it.

Itā€™s a great job, though it is kind of disappointing that in 21st century we need to do this manually, like in old Fortran times, rather than relying on a compiler to perform the unrolling for us. Or do we? This brings us to the first question: did you try benchmarking with flambda on? The interesting things is that a compiler should be capable of producing a code, that is much more efficient than your implementation (and batteries). Probably, we are not there yet, but maybe it would be a good idea now, to move this optimization into the compiler? Or, at least, to derive an flambda-friendly implementation.

Also, I have a strong opinion, that in the performance critical code, one shouldnā€™t use List.map at all, but rather do List.rev_map, possible doing the final reverse operation at the end of a program. Basically, the reversing can always be factored out of the tight loop. So it would be interesting, how your implementation is compared with just List.rev_map without any reversing?

Another issue, is that in real world applications, the computation kernel usually takes more time to compute than the overhead of the list processing/generation. I would believe, that your fast implementation will prevent flamda from inlining the kernel and performing the unrolling. This should be especially true for floating points, since inlining and unrolling a kernel in the float domain may significantly reduce the number of allocations.

But this all criticism, is not to say, that the work is not great. Again, thatā€™s really cool. And at least we now have base implementation for comparison with what a compiler can do :slight_smile:

7 Likes

@ivg Very good points, indeed I need to acknowledge that you raised the inlining one with me earlier. I just donā€™t have the right time/expertise combination to fully address it :confused: However, I saw no evidence of automatic unrolling by the compiler in List.map so far, which makes me suspect that suppressing benefits from automatic unrolling followed by inlining is not yet an issue with current OCaml.

The benchmarking was done with Flambda on and -O3. I did it with both Flambda on and off to compare, and there was no apparent difference in performance of any function. Actually, now that I think of it, maybe it would have been best to copy/paste all the code out, and make sure to recompile it with -O3. Future workā€¦

The point about measuring List.rev_map is a good one that I missed, and is being suggested by multiple people. I think I need to take a break from working on this in the immediate hours or days, but it should be simple enough to add as a PR to the repo. Otherwise, I may do it later.

However, using rev_map in most cases also seems like manual optimization, just on a much higher ā€œalgorithmicā€ level, that we also shouldnā€™t have to do manually at this point either :slight_smile:

1 Like

However, using rev_map in most cases also seems like manual optimization, just on a much higher ā€œalgorithmicā€ level, that we also shouldnā€™t have to do manually at this point either ļæ¼

I wouldnā€™t agree with this. A programmer should aware of data structures and basic algorithmic complexities. This is the job of a programmer. The compilerā€™s job is to ensure that there are no unecessary allocations, the amount of stack frames is minimized, and all known equalities are applied. The compiler should never fix programmers code on the algorithmic level. And stupid code, should still remain stupid. For example, the List.t data type is the singly linked list, with a well-known feature of being a LIFO data structure. Thus forcing LIFO to behave as FIFO is a problem on the programmer side, and should be fixed there :smiley:

2 Likes

However, I saw no evidence of automatic unrolling by the compiler in List.map so far, which makes me suspect that suppressing benefits from automatic unrolling followed by inlining is not yet an issue with current OCaml.

Yeah, as we discussed it previously, this is the matter of the computation kernel. If the kernel is small, then its inlining would be neglible wrt to the list processing overhead. And since youā€™re actually measuring this overhead it right to have a small kernel. However, what Iā€™m saying is that in real life, the kernels tend to be bigger, so the performance impact of the list processing is smaller. So it is like competitive goals - your implementation reduces the list overhead, while preventing flambda from reducing the kernel overhead.

Well, Iā€™d had to rewrite several functions in List with tail recursive functions, some of them may be viewed here:

Of course itā€™s better with tail rec. no idea why the standard library ones arenā€™t, I think I had written even more of these, of course you canā€™t use the stdlib versions in an efficient algorithm.

Here is one which is tail recursive, is among the fastest, is simple, evaluates in the same order as stdlib, but allocates more memory than some of the others:

let tupled_map f xs =
  let rec rise ys = function
   | [] -> ys
   | (y0, y1, y2, y3, y4, y5, y6, y7) :: bs ->
      rise (y0 :: y1 :: y2 :: y3 :: y4 :: y5 :: y6 :: y7 :: ys) bs in
  let rec dive bs = function
   | x0 :: x1 :: x2 :: x3 :: x4 :: x5 :: x6 :: x7 :: xs ->
      dive ((f x0, f x1, f x2, f x3, f x4, f x5, f x6, f x7) :: bs) xs
   | xs -> rise (List.map f xs) bs in
  dive [] xs

My own measurement (ā€œtupledā€):

With lists of size 1000
Estimated testing time 1.66667m (10 benchmarks x 10s). Change using -quota SECS.
                                                                                        
  Name                           Time/Run   mWd/Run   mjWd/Run   Prom/Run   Percentage  
 ------------------------------ ---------- --------- ---------- ---------- ------------ 
  warmup                           6.92us    6.01kw     51.62w     51.62w       96.69%  
  naive-tail-recursive             7.15us    6.01kw     51.64w     51.64w      100.00%  
  base                             3.62us    3.01kw     17.04w     17.04w       50.55%  
  batteries                        4.97us    3.01kw     34.46w     34.46w       69.42%  
  containers                       4.03us    3.01kw     17.06w     17.06w       56.27%  
  chunked-tail-recursive-12        4.24us    3.27kw     18.94w     18.94w       59.34%  
  unrolled-5-chunked-12-hybrid     3.60us    3.01kw     17.09w     17.09w       50.34%  
  tupled                           3.65us    4.51kw     30.28w     30.28w       50.98%  
  stdlib                           4.99us    3.01kw     17.14w     17.14w       69.69%  
  unrolled-5                       3.44us    3.01kw     17.03w     17.03w       48.04%  
8 Likes

Nice. Could you post the results for large lists as well, like 106 elements?

I just got this far before I hit stack overflow:

With lists of size 177827
Estimated testing time 1.66667m (10 benchmarks x 10s). Change using -quota SECS.
                                                                                           
  Name                           Time/Run      mWd/Run   mjWd/Run   Prom/Run   Percentage  
 ------------------------------ ---------- ------------ ---------- ---------- ------------ 
  warmup                          11.97ms   1_067.00kw   861.91kw   861.91kw       99.56%  
  naive-tail-recursive            11.97ms   1_067.00kw   858.83kw   858.83kw       99.59%  
  base                            11.81ms   1_051.97kw   846.11kw   846.11kw       98.24%  
  batteries                        7.39ms     533.50kw   533.48kw   533.48kw       61.45%  
  containers                      12.02ms   1_055.00kw   855.50kw   855.50kw      100.00%  
  chunked-tail-recursive-12        6.34ms     578.11kw   439.60kw   439.60kw       52.72%  
  unrolled-5-chunked-12-hybrid     6.34ms     576.85kw   435.26kw   435.26kw       52.72%  
  tupled                           7.92ms     800.35kw   619.37kw   619.37kw       65.93%  
  stdlib                           7.12ms     533.50kw   385.42kw   385.42kw       59.21%  
  unrolled-5                       5.93ms     533.55kw   385.09kw   385.09kw       49.35%  

Iā€™ll check what Iā€™m doing wrong with the invocation. And here is the 10^6 result:

With lists of size 1000000
Estimated testing time 1.33333m (8 benchmarks x 10s). Change using -quota SECS.
                                                                                        
  Name                           Time/Run   mWd/Run   mjWd/Run   Prom/Run   Percentage  
 ------------------------------ ---------- --------- ---------- ---------- ------------ 
  warmup                          84.39ms    6.00Mw     5.81Mw     5.81Mw       99.49%  
  naive-tail-recursive            84.82ms    6.00Mw     5.81Mw     5.81Mw      100.00%  
  base                            83.19ms    5.99Mw     5.79Mw     5.79Mw       98.07%  
  batteries                       43.03ms    3.00Mw     3.00Mw     3.00Mw       50.73%  
  containers                      83.97ms    5.99Mw     5.79Mw     5.79Mw       98.99%  
  chunked-tail-recursive-12       45.07ms    3.25Mw     3.08Mw     3.08Mw       53.13%  
  unrolled-5-chunked-12-hybrid    45.25ms    3.25Mw     3.09Mw     3.09Mw       53.34%  
  tupled                          54.77ms    4.50Mw     4.31Mw     4.31Mw       64.57%  
2 Likes

This tupled_map is incredibly nice. Did you try it with the happy chunk size 12 (instead of 8)? (As @antron find out that this chunk is optimal on x86)

1 Like

Well, the ā€œhappyā€ chunk size might be different for this function. In particular, it allocates two memory blocks per chunk during dive and not one, though it might not matter, and there may be other differences due to how itā€™s compiled. Iā€™d be curious to know what it is, however :slight_smile:

EDIT: Also, sorry, on my machine List.map was working up to 106. I should probably change the upper bound so others donā€™t have to remove it manually.

ulimit -s 1073741824 and no more stackoverflows :slight_smile:

1 Like

Seems ā€œtupledā€ didnā€™t like 12 at all, as it timed 4.24us for 1000, and was also worse up to 6523 where I hit CTRL-C.

2 Likes

@antron, if Iā€™m following your description correctly, you are evaluating the f function in reverse, from the last to the first element. For me this is a deal-breaker, List.map must perform its effects left-to-right: even if that is not specified by the implementation, user code does rely on this. I think you may be able to change your code to apply f as you build the chunk, rather than afterwards.

3 Likes

@gasche, thatā€™s right. I donā€™t think my code can be easily changed to apply f when building the chunk, because, indeed, the trick is that chunks donā€™t have to be built at all. My code reuses the existing input list structure to represent the chunks.

@paurkedal has suggested an alternative, in which he does build chunks, but using a very inexpensive representation, and he does apply f while building them. This gives the typical front-to-back order (although, @paurkedal, I just remembered that you would have to rewrite dive to use explicit lets, in order to ensure the order).

@paurkedalā€™s alternative is a bit slower on large lists (20% or so slower than my non-building variant), but it still has quite impressive performance relative to the other existing implementations. And, I thought of some likely optimizations that could be done to it to close or narrow this gap, but I donā€™t have time to explore them immediately.

4 Likes

The compiler should never fix programmers code on the algorithmic level. And stupid code, should still remain stupid.

Quite frankly, Iā€™m not quite sure about this: perhaps you should really define what stupid code is? If a programmer does two List.rev in sequence on the same list, shouldnā€™t the compiler detect that itā€™s a no-op (perhaps by having some algebraic notion of operations on lists), and optimize them away?

Great work! Is there any way this sort of optimization could be generalized, applied in other situations, and in the long run integrated as an optimization step in the ocaml compiler?

1 Like