Inlining functors with recursive module arguments (using flambda)

After finding _caml_apply2 high in the perf profile where I don’t expect while tuning some code, I have not been able to figure out how to convince ocamlopt/flambda to inline and resolve functions as I would like.

It seems that when a functor is passed a recursive module as argument, even if the functor gets inlined, calls to the functions in the argument module are still not resolved, and get called using e.g. _caml_apply2. Is there any way around that (well, other than rewriting the arguments to be non-recursive modules)?

As a concrete example, consider:

❯ cat -n rec_mod.ml
     1	module Key : sig
     2	  type t
     3
     4	  val compare : t -> t -> int
     5	  val of_string : string -> t
     6	end = struct
     7	  include String
     8
     9	  let of_string s = s
    10	end
    11
    12	module rec Key_rec : sig
    13	  type t
    14
    15	  val compare : t -> t -> int
    16	  val of_string : string -> t
    17	end = struct
    18	  include String
    19
    20	  let of_string s = s
    21	end
    22
    23	module KeyMap = Map.Make (Key)
    24	module KeyMap_rec = Map.Make (Key_rec)
    25
    26	let key_test () = Key.of_string "one"
    27	let key_test_rec () = Key_rec.of_string "one"
    28
    29	let test () =
    30	  let open KeyMap in
    31	  let k = Key.of_string "one" in
    32	  find k (add k 1 empty)
    33
    34	let test_rec () =
    35	  let open KeyMap_rec in
    36	  let k = Key_rec.of_string "one" in
    37	  find k (add k 1 empty)

After compiling with:

❯ ocamlopt -g -O3 -S rec_mod.ml

we can see the difference between KeyMap and KeyMap_rec in the generated assembly code. For the non-recursive functor argument Key, _caml_string_compare is called directly:

_camlRec_mod__test_126:
        .loc    5       29      9
        .cfi_startproc
        subq    $8, %rsp
        .cfi_adjust_cfa_offset 8
L668:
        movq    _camlRec_mod__const_immstring_156@GOTPCREL(%rip), %rsi
        movq    %rsi, %rdi
        .loc    3       202     28
        call    _caml_string_compare
        cmpq    $1, %rax
        jne     L667
        movl    $3, %eax
        addq    $8, %rsp
        .cfi_adjust_cfa_offset -8
        ret
        .cfi_adjust_cfa_offset 8
        .align  2
L667:
        cmpq    $1, %rax
        jge     L666
        movl    $1, %ebx
        jmp     L665
        .align  2
L666:
        movl    $1, %ebx
L665:
        movq    _camlRec_mod__const_immstring_156@GOTPCREL(%rip), %rax
        .loc    1       140     15
        addq    $8, %rsp
        .cfi_adjust_cfa_offset -8
        jmp     _camlRec_mod__find_397
        .cfi_adjust_cfa_offset 8
        .cfi_adjust_cfa_offset -8
        .cfi_endproc

But for the recursive argument module Key_rec, the call is indirect using GOTPCREL and _caml_apply2:

_camlRec_mod__test_rec_149:
        .loc    5       34      13
        .cfi_startproc
        subq    $8, %rsp
        .cfi_adjust_cfa_offset 8
L675:
        movq    _camlRec_mod__Key_rec_223@GOTPCREL(%rip), %rax
        movq    (%rax), %rax
        .loc    5       36      10
        movq    8(%rax), %rbx
        movq    _camlRec_mod__const_immstring_156@GOTPCREL(%rip), %rax
        .loc    5       36      10
        movq    (%rbx), %rdi
        .loc    5       36      10
        call    *%rdi
L669:
        movq    %rax, (%rsp)
        movq    _camlRec_mod__Pmakeblock_arg_213@GOTPCREL(%rip), %rbx
        movq    (%rbx), %rdi
        movq    %rax, %rbx
        .loc    1       138     18
        call    _caml_apply2
L670:
        cmpq    $1, %rax
        jne     L674
        movl    $3, %eax
        addq    $8, %rsp
        .cfi_adjust_cfa_offset -8
        ret
        .cfi_adjust_cfa_offset 8
        .align  2
L674:
        cmpq    $1, %rax
        jge     L673
        movl    $1, %ebx
        jmp     L672
        .align  2
L673:
        movl    $1, %ebx
L672:
        movq    (%rsp), %rax
        .loc    1       140     15
        addq    $8, %rsp
        .cfi_adjust_cfa_offset -8
        jmp     _camlRec_mod__find_1037
        .cfi_adjust_cfa_offset 8
        .cfi_adjust_cfa_offset -8
        .cfi_endproc

(Just for context, I wouldn’t be asking this is the profiler wasn’t telling me to look hard at code that is analogous to this.)

Tagging some flambda folks who might be interested: @mshinwell, @lpw25, @vlaviron.

1 Like

Recursive modules are weird. They’re completely absent from the compiler’s intermediate representations after type-checking, having been compiled to code manipulating values with Obj (the functions in question are in CamlinternalMod, if you’re curious). The later parts of the compiler (including flambda) will only see the recursive module as a value created by a C call (caml_obj_block, bound to Obj.new_block) so there is no way to know what any of its fields resolve to from the compiler’s side.

2 Likes

Thanks Vincent, that explanation is very helpful! It does explain what I see. Unfortunately it seems that the solution is to find a way to avoid the recursive modules for super-hot code, but I guess that’s the nature of the beast.

1 Like

It can be hard to avoid recursive modules, so if you end up keeping them here are a few tips that might help:

  • The chapter in the manual about recursive modules mentions “safe” and “unsafe” modules. Only the safe ones are actually problematic (which is kind of counter-intuitive), as the unsafe ones end up being evaluated as regular modules. So if some modules in your recursive definition are more critical than others, and you can control which ones are safe, you may want to explicitly make the critical modules unsafe (by adding a dummy value of non-functional type to their signature) so that they will be compiled as regular modules.
  • The constraint for safe modules in dependency cycles only applies to implementations, not signatures, so you can have recursive definitions that actually compile like regular modules as long as the recursion is only on types (but you have to make them unsafe, otherwise the compiler will compile them as safe even if it doesn’t have to).

As an example, if you add a field val empty : t to the Key_rec module in your example, it becomes unsafe but since it doesn’t depend on itself it will compile just fine.

2 Likes

Thank you very much for this explanation and the tip to arrange for the hottest modules in a cycle to be “unsafe”! This is much easier to do than to eliminate recursive modules. I’ve modified some code to change which modules are safe vs unsafe in the cycles, and the improvement can be significant. For a case where a recursive module is the key argument of a container that is the bottleneck, I have seen improvements ~2x faster in some pathological cases, and generally closer to a ~2-5% speedup.

While this is great, this whole situation does have me thinking about ways to define data structures like Stdlib.Map in a way that they can appear recursively in their own keys without needing recursive modules (or other mechanisms that lead to indirect calls to e.g. compare or hash functions).