Why isn't `caml_alloc*` inlined to allocation sites?

StrongerXi · May 2, 2021, 5:01am

I’ve been trying to study the runtime system of OCaml by digging through the codebase, so please bear with my limited understanding.

I came across this question because I saw gc_regs and its use in roots_nat.c. It seemed a bit redundant, since each “OCaml frame chunk” only has a few gc_regs roots, but we are checking them when scanning each frame (distinguishing stack vs reg root based on 1st bit of frame_descr::live_ofs). So I looked more into this and think I understood why we need it, but on a second thought it felt somewhat unnecessary.

For instance, instead of

...
call caml_alloc1
...

Why not

...
subq    $16, %r15
cmpq    %r15, Caml_state(young_limit)
jbe      cont
... # push live regs onto stack, in contrast with aggressive spilling in `caml_call_gc`
mov caml_garbage_collection, %rax
call caml_c_call
movq    Caml_state(young_ptr), %r15
cont:
...

This way, live regs are always saved onto stack, and we can analyze where they are since we are doing a normal function call to caml_garbage_collection, whereas the current approach seems to

assume no reg is touched when first calling caml_alloc*
aggressively save all regs when we have to call caml_garbage_collection
now before calling caml_alloc*, we actually have live regs, instead of live stack slots, and we use that information to trace the live roots out of all the aggressively spilled regs.

To me, inlining seems to be faster, and allows us to get rid of gc_regs and the bit hack on live_ofs.

One drawback I see with this inlining is code bloat, but is that actually the reason why it’s not implemented?

nojb · May 2, 2021, 6:02am

Which command-line options are you using? Typically the compiler should use the second form for speed, but it is possible to use the first form by passing -compact. In any case, the choice to use the first form (caml_allocN) is made to reduce code size.

The choice is made here

github.com

ocaml/ocaml/blob/b720b583a1d8eeeb295adbf597c6d2ddf994c1cd/asmcomp/cmmgen.ml#L1368-L1373


let fun_codegen_options =
  if !Clflags.optimize_for_speed then
    []
  else
    [ Reduce_code_size ]
in

and used here

github.com

ocaml/ocaml/blob/6275c0ccda38a4bcc64689bace8b3a2873294c74/asmcomp/amd64/emit.mlp#L589


    | Thirtytwo_signed | Thirtytwo_unsigned ->
        I.mov (arg32 i 0) (addressing addr DWORD i 1)
    | Single ->
        I.cvtsd2ss (arg i 0) xmm15;
        I.movss xmm15 (addressing addr REAL4 i 1)
    | Double | Double_u ->
        I.movsd (arg i 0) (addressing addr REAL8 i 1)
    end
| Lop(Ialloc { bytes = n; dbginfo }) ->
    assert (n <= (Config.max_young_wosize + 1) * Arch.size_addr);
    if env.f.fun_fast then begin
      I.sub (int n) r15;
      I.cmp (domain_field Domainstate.Domain_young_limit) r15;
      let lbl_call_gc = new_label() in
      let lbl_frame =
        record_frame_label env i.live (Dbg_alloc dbginfo)
      in
      I.jb (label lbl_call_gc);
      let lbl_after_alloc = new_label() in
      def_label lbl_after_alloc;
      I.lea (mem64 NONE 8 R15) (res i 0);

Cheers,
Nicolás

StrongerXi · May 2, 2021, 12:48pm

Thanks for your explanation! I should’ve looked more carefully at emit.mlp, I noticed the env.f.fun_fast branch but skipped it…

So it’s for reducing code size indeed; it makes sense. Still, I was hoping we could get rid of the gc_regs thing, but that’s going to put lots of pressure on code size and probably forfeit the Reduce_code_size option.

silene · May 2, 2021, 3:11pm

Since the garbage collector is about to scan hundreds of thousands of words, removing gc_regs will not bring much of a benefit. Nonetheless, I wonder why so many registers are stored into it. In particular, floating-point registers are pointless for the garbage collector as they cannot be roots. So, only callee-saved ones really need to be saved, as the callee did not get the chance to save them.

StrongerXi · May 2, 2021, 3:28pm

Since the garbage collector is about to scan hundreds of thousands of words

Just to check my understanding, I initially thought the following conditional for every local root on stack is expensive (in roots_nat.c’s caml_oldify_local_roots) :

/* Scan the roots in this frame */
        for (p = d->live_ofs, n = d->num_live; n > 0; n--, p++) {
          ofs = *p;
          if (ofs & 1) {
            root = regs + (ofs >> 1);
          } else {
            root = (value *)(sp + ofs);
          }
          Oldify (root);
        }

Are you suggesting since majority of the scanning will happen in the minor heap, it’s acceptable to do a bit extra check for roots on the stack?

floating-point registers are pointless for the garbage collector as they cannot be roots. So, only callee-saved ones really need to be saved

Indeed, I think only some of the fp regs are callee-saved. But again, this extra work is trivial compared to the GC work, so it probably doesn’t hurt to save them all…? Just speculating.

silene · May 2, 2021, 4:08pm

A large part of the scanning will happen on the minor heap and a small part will happen on the stack, of which a negligible part will happen for registers of the current function stored on the stack. Removing gc_regs only remove the very last part. Scanning the minor heap and the stack will still happen anyway.

Sure, but this one is not even motivated by reducing code size and all the related benefits. So, I am wondering if it is just an oversight, some laziness/simplification (not wanting to bother with precisely characterizing registers), or some other reason.

Topic		Replies	Views
Relaxed rules for binding a C library? Learning multicore	8	714	October 25, 2022
Compile a language to C with OCaml GC support Learning compiler	11	1513	January 18, 2021
Are Begin_rootsN and End_roots deprecated? Learning c , gc	0	740	March 1, 2019
Interfacing C++ with OCaml Learning	14	1885	April 3, 2022
Segfault in `caml_shared_try_alloc` in multicore GC Learning multicore	4	907	September 3, 2021

Why isn't `caml_alloc*` inlined to allocation sites?

Related topics