A new List.map that is both stack-safe and fast

ivg · September 22, 2017, 2:35pm

If a programmer does two List.rev in sequence on the same list, shouldn’t the compiler detect that it’s a no-op

Definitely it shouldn’t. Once compilers will become that clever, there will be no need in programmers any more.

Putting it more seriously, it all depends on a paricular model and language. A very high-level declarative (logical) language may operate on this level, and treat an even number of reverses as a nop. Though OCaml falls in a completely different niche, it is a rather low-level language that should do what a programmer asked him to do, without too much cleverness, letting the programmer to choose which algorithm or data structure is more efficient. It was always a feature of OCaml - that it is has predictive performance (no longer that true with the flambda).

antron · September 22, 2017, 2:56pm

i’d like to defer to experts on this. My guess is this kind of optimization is not so easy to generate automatically, as it involves deliberately introducing “from nowhere” a different representation for an intermediate list. I would be happy to be wrong here, though

It might strongly depend on what the source form is that is being optimized. Is it stdlib’s simple map automatically being made tail-recursive? Is it a naive tail-recursive map? Is it a version manually prepared for optimization, so the compiler only has to choose the unrolling factor most suited for its target platform?

Actually, given OCaml is a strict and impure language, we can probably reason that no compiler should ever optimize a naive map with an explicit let into my version, because that changes the ordering of calls to f. @paurkedal’s version doesn’t have this issue, so emitting it may be a valid optimization goal

What I mean by explicit let above is this. From the stdlib, slightly edited:

let rec map f = function
    [] -> []
  | a::l ->
    let r = f a in   (* Guarantees front-to-back ordering. *)
    r :: map f l

If it had been written in an even more naive, textbook style

let rec map f = function
    [] -> []
  | a::l -> (f a) :: map f l

Then I guess optimizations could change the ordering, though it still might be surprising if they did so, and might make code that makes too many assumptions break upon changing optimization settings.

ivg · September 22, 2017, 3:26pm

I’m not an expert, but I know one, his name is GCC, he is a rather old guy, but is well-respected. So let’s ask him, here is a rough translation of OCaml’s List.rev_map into C:

char **map(char *xs, char **(f (char *))) {
    char **ys = 0;

    while (xs) {
        char **y = cons();
        y[0] = f(xs[0]);
        y[1] = (char *)y;
        ys = y;
        xs = (char *)y[1];
    }
    return ys;
}

(sorry I was forced to put this bogus types, otherwise GCC won’t speak to me).

If we will compile it as it is with default optimizations, we will roughly get what OCaml is giving to us. But if we will be more aggressive (by using -O3 -funroll-all-loops), then here is what we got:

map:
.LFB0:
	pushq	%r12
	testq	%rdi, %rdi
	pushq	%rbp
	movq	%rdi, %rbp
	pushq	%rbx
	movq	%rsi, %rbx
	je	.L2
.L3:
	xorl	%eax, %eax
	call	cons
	movsbq	0(%rbp), %rdi
	movslq	%eax, %r12
	call	*%rbx
	movq	%r12, 8(%r12)
	movq	%rax, (%r12)
	xorl	%eax, %eax
	call	cons
	movsbq	(%r12), %rdi
	movslq	%eax, %rbp
	call	*%rbx
	movq	%rbp, 8(%rbp)
	movq	%rax, 0(%rbp)
	xorl	%eax, %eax
	call	cons
	movsbq	0(%rbp), %rdi
	movslq	%eax, %r12
	call	*%rbx
	movq	%r12, 8(%r12)
	movq	%rax, (%r12)
	xorl	%eax, %eax
	call	cons
	movsbq	(%r12), %rdi
	movslq	%eax, %rbp
	call	*%rbx
	movq	%rbp, 8(%rbp)
	movq	%rax, 0(%rbp)
	xorl	%eax, %eax
	call	cons
	movsbq	0(%rbp), %rdi
	movslq	%eax, %r12
	call	*%rbx
	movq	%r12, 8(%r12)
	movq	%rax, (%r12)
	xorl	%eax, %eax
	call	cons
	movsbq	(%r12), %rdi
	movslq	%eax, %rbp
	call	*%rbx
	movq	%rbp, 8(%rbp)
	movq	%rax, 0(%rbp)
	xorl	%eax, %eax
	call	cons
	movsbq	0(%rbp), %rdi
	movslq	%eax, %r12
	call	*%rbx
	movq	%r12, 8(%r12)
	movq	%rax, (%r12)
	xorl	%eax, %eax
	call	cons
	movsbq	(%r12), %rdi
	movslq	%eax, %rbp
	call	*%rbx
	movq	%rbp, 8(%rbp)
	movq	%rax, 0(%rbp)
	jmp	.L3

.L2:
	popq	%rbx
	popq	%rbp
	xorl	%eax, %eax
	popq	%r12
	ret

As you may see, cons is now unrolled 8 times, and we basically have the tupled implementation, though a little bit more efficient (no tuples are created). I don’t know whether we have such an optimization in flambda already, but if not, it would be very easy to implement. Possibly a good idea for the upcomming OCaml Hacking Session.

antron · September 22, 2017, 3:53pm

But List.rev_map is the “natural” map on lists, that requires no intermediate data structure. It’s a really simple loop. I’d expect any good compiler to be able to unroll it.

On the other hand, all implementations of regular map I’m aware of require either mutation or an intermediate data structure (in case of List.map, that intermediate data structure is an implicit stack of stack frames on the function call stack). Optimizing map should then involve converting between different such intermediate data structures (or choosing to introduce mutability, as with TRMC). That seems much more difficult than just unrolling a loop.

Though I have no real experience here, I’d actually expect the naive List.map to have the greatest potential for optimization, just because the implicit stack is natively “understood” by the compiler, being part of the execution model.

ivg · September 22, 2017, 4:14pm

Sure, I wasn’t trying to say that a compiler should be able to optimize a non-tail recursive code into the tail-recursive. The problem is that so far OCaml didn’t provide the loop unrolling. And yes, the loop unrolling may be even applied to the List.map, here is the C version of the map function:

char **map(char *xs, char **(f (char *))) {
    if (xs) {
        char **y = cons();
        y[0] = f(xs[0]);
        char ** ys = map(xs[1], f);
        y[1] = ys;
        return y;
    }
    return 0;
}

and here is the gcc output, with cons inlined 8 times:

map:
.LFB0:
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	movq	%rdi, %r12
	pushq	%rbp
	pushq	%rbx
	subq	$56, %rsp
	testq	%rdi, %rdi
	je	.L11
	xorl	%eax, %eax
	movq	%rsi, %rbx
	call	cons
	movsbq	(%r12), %rdi
	movslq	%eax, %rbp
	call	*%rbx
	movq	%rax, 0(%rbp)
	movsbq	1(%r12), %r13
	testq	%r13, %r13
	je	.L12
	xorl	%eax, %eax
	call	cons
	movsbq	0(%r13), %rdi
	movslq	%eax, %r12
	call	*%rbx
	movq	%rax, (%r12)
	movsbq	1(%r13), %r14
	testq	%r14, %r14
	je	.L13
	xorl	%eax, %eax
	call	cons
	movsbq	(%r14), %rdi
	movslq	%eax, %r13
	call	*%rbx
	movq	%rax, 0(%r13)
	movsbq	1(%r14), %r15
	testq	%r15, %r15
	je	.L14
	xorl	%eax, %eax
	call	cons
	movsbq	(%r15), %rdi
	movslq	%eax, %r14
	call	*%rbx
	movq	%rax, (%r14)
	movsbq	1(%r15), %r15
	testq	%r15, %r15
	je	.L15
	xorl	%eax, %eax
	call	cons
	movsbq	(%r15), %rdi
	cltq
	movq	%rax, 8(%rsp)
	call	*%rbx
	movq	8(%rsp), %rsi
	movq	%rax, (%rsi)
	movsbq	1(%r15), %r15
	testq	%r15, %r15
	je	.L16
	xorl	%eax, %eax
	call	cons
	movsbq	(%r15), %rdi
	cltq
	movq	%rax, 16(%rsp)
	call	*%rbx
	movq	16(%rsp), %rcx
	movq	%rax, (%rcx)
	movsbq	1(%r15), %rdi
	testq	%rdi, %rdi
	movq	%rdi, 24(%rsp)
	je	.L17
	xorl	%eax, %eax
	call	cons
	movq	24(%rsp), %r9
	cltq
	movq	%rax, %r15
	movq	%rax, 32(%rsp)
	movsbq	(%r9), %rdi
	call	*%rbx
	movq	24(%rsp), %r11
	movq	%rax, (%r15)
	movsbq	1(%r11), %rax
	testq	%rax, %rax
	movq	%rax, 24(%rsp)
	je	.L18
	xorl	%eax, %eax
	call	cons
	movq	24(%rsp), %rsi
	cltq
	movq	%rax, %r15
	movq	%rax, 40(%rsp)
	movsbq	(%rsi), %rdi
	call	*%rbx
	movq	24(%rsp), %rcx
	movq	%rax, (%r15)
	movsbq	1(%rcx), %rdi
	testq	%rdi, %rdi
	movq	%rdi, 24(%rsp)
	je	.L19
	xorl	%eax, %eax
	call	cons
	movq	24(%rsp), %r9
	cltq
	movq	%rax, %r15
	movsbq	(%r9), %rdi
	call	*%rbx
	movq	24(%rsp), %r11
	movq	%rax, (%r15)
	movq	%rbx, %rsi
	movsbq	1(%r11), %rdi
	call	map
	movq	%r15, %r8
	movq	%rax, 8(%r15)
.L10:
	movq	40(%rsp), %rax
	movq	%r8, 8(%rax)
	movq	%rax, %rbx
.L9:
	movq	32(%rsp), %rsi
	movq	%rbx, 8(%rsi)
	movq	%rsi, %r8
.L8:
	movq	16(%rsp), %rdx
	movq	%r8, 8(%rdx)
.L7:
	movq	8(%rsp), %rcx
	movq	%rdx, 8(%rcx)
	movq	%rcx, %rax
.L6:
	movq	%rax, 8(%r14)
.L5:
	movq	%r14, 8(%r13)
.L4:
	movq	%r13, 8(%r12)
.L3:
	movq	%r12, 8(%rbp)
	movq	%rbp, %rax
.L2:
	addq	$56, %rsp
	popq	%rbx
	popq	%rbp
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	ret

DidC · September 25, 2017, 8:47am

Good point! I totally agree with you that it’s important to be able to predict the performances of the compiled code.
Although the ocaml compiler already does certain optimizations of that sort, albeit simpler (basically, for integers and floats, product by one and addition with zero, or a value twice negated are optimized away). I guess it’s just noise though wrt a non trivial algorithm.

How about this: if the compiler had a switch to enable more advanced optimization (based on the algebraic properties of lists, or any other higher level), wouldn’t that be an interesting feature? Or perhaps it belongs to a separate library?

Furthermore, shouldn’t such optimizations be performed when opportunities are uncovered after inlining?

DidC · September 25, 2017, 9:40am

Perhaps it would be interesting to define a second function in the List module, which would perform the map starting from the end (I believe Haskell may have something similar for (applicative) functors) ? Thus Antron’s function could be used, without impacting the existing users, and it would more clearly denote the expected behaviour of both functions?

backtracking · September 26, 2017, 9:30am

Note that this idea of “notional chunking”, as you say, is not new: it is used for instance in List.sort to split the list in two halves to perform a mergesort. Instead of building two new lists, two pointers are used (together with numbers of significant elements in those lists). See function “chop” and its use in function “stable_sort”.

antron · September 26, 2017, 9:54am

Certainly the “notional chunking” itself, alone, is not new. It occurs any time you build a “view” into an existing sequence data structure without copying data, i.e. all the time in practical programming. Many people have written exactly that kind of merge sort, as well as used views in countless other situations.

bluddy · September 27, 2017, 3:39pm

Yeah probably a result of running out of registers.

I think the practical impact of this discussion should be Containers and Base switching implementations to the tupled version. It’s the same algorithm but with less allocation. Ping @c-cube, @Yaron_Minsky

Yaron_Minsky · September 28, 2017, 5:14am

A PR with an updated implementation wild be most welcome!

paurkedal · September 28, 2017, 4:13pm

Updated graph including “tupled” and tweaking the colour of “containers”:
map

spop · February 23, 2018, 4:39pm

When trying the program (with OCaml v4.06.1+flambda & core_bench v0.10.0) I get:

...
Uncaught exception:
  Stack overflow

Raised by primitive operation at file "src/benchmark.ml", line 71, characters 16-32

  (this line being: let gc1 = Gc.quick_stat () in )

…

Pb with OCaml or core_bench or…?

paurkedal · February 24, 2018, 8:35am

What is your stack limit (ulimit -s)? I increased it to 131072 kB due to stack overflows when running this. If that doesn’t help which of the tests (as defined in tester/tester.ml) is failing?

spop · February 24, 2018, 3:18pm

Thanks for ulimit. I forgot to check it (8192 by default). My bad !

So, with ulimit -s 131072 here is the result
(with the improved version of Containers )
map_s131072

Topic		Replies	Views
Fastest implementation of stack-safe List.map, featuring Containers, Batteries, and Base Community	4	1395	June 22, 2017
Stack overflow during evaluation (looping recursion?) Learning	16	1789	April 17, 2023
Stdlib.List loop unrolling Learning	11	973	January 28, 2023
Array.map vs List.map performance Learning performance	6	1292	May 14, 2024
A tail-recursive List.flatten in Stdlib? Ecosystem stdlib , list	7	467	October 6, 2024

A new List.map that is both stack-safe and fast

Related topics