This function performs twice as many allocations as the total number of elements (because each element gives rise to two allocations, x :: and xs ::). Hence it performs twice as many allocations as the (non-tail-recursive) version found in the standard library. It is better than the naïve tail-recursive function using an accumulator which performs three times as many allocations as the total number of elements, but it will still lose against the non-tail-recursive version for most practical inputs.
That definition can be transformed to avoid the extra allocation by specializing the call flatten (xs :: ys) via a function flatten_cons that takes xs and ys separately, like this:
let[@tail_mod_cons] rec flatten = function
| [] -> []
| []::xs -> flatten xs
| (x::xs)::ys -> x :: flatten_cons xs ys
and[@tail_mod_cons] flatten_cons xs ys =
match xs with
| [] -> flatten ys
| x::xs -> x :: flatten_cons xs ys
The paper Tail Modulo Cons (by @let-def, @Elarnon and @gasche) gives a similar implementation, but obtained by inlining append in the non-tail-recursive version, as does the manual.
Note that, if you do not also annotate flatten with [@tail_mod_cons], then flatten_cons will be calling a non-specialized function ([] -> flatten ys), so the optimization cannot trigger. Both functions need to be annotated.
The rule we followed so far in the stdlib is to convert functions to use TMC on-demand (no big change where we reviewed all trmc-able functions), and to perform a micro-benchmark to check that the TMC definition is as fast as the non-tail-rec one, including on small lists. In our experience this generally requires some manual unrolling of the TMC version (one or two steps of unrolling is often enough). See for example TRMC implementation of @ by yallop · Pull Request #11859 · ocaml/ocaml · GitHub , which moved @ to be tail-recursive.
Anyone should feel free to propose a TMC version of List.flatten, as long as they follow the same policy: write a micro-benchmark to check the impact on small inputs.
Having done some testing on speed the impact seems to range from 8-20% slower with TMC using ocamlc. Results with ocamlopt were worse seeing ~40-50+% longer execution times.
I cheerfully withdraw my suggestion in the face of pragmatism.