Register allocation super-linear timing

jeffsco · January 15, 2018, 8:12pm

I’m working on a project that generates and then compiles large OCaml source files. As we work on larger problems, the compile times are starting to be a concern.

Right now I’m looking at one generated file with 39,563 functions of 5 or 10 lines each. Compiling with ocamlopt.opt -dtimings I see that register allocation takes around 3/4 of the time. I realize that there is a faster register allocator (-linscan) in OCaml 4.06.0 but I’d like to be compatible with older OCaml releases if possible.

Looking at the source for the traditional register allocator it seems to me that it works on one function definition at a time. This suggests that it should take 1/N of the time to do register allocation for 1/N of the functions. However if I actually do this experiment, I find that the allocator takes around 1/20 of the time to compile 1/4 of the functions. Or looking at it in the other direction, it takes 20 times as long to process 4 times as many functions.

In concrete terms, register allocation time for 39,563 functions is around 300 seconds, while register allocation time for each of the 4 groups of 9,890 functions is around 14 seconds.

When people say the allocator has quadratic complexity, I assume it’s quadratic in the number of instructions (or register slots) of a single function. But this measurement suggests it’s quadratic in the total number of instructions of the whole module (say). Since it seems to process only one function definition at a time, I don’t see how this would happen.

Possibly there is some accumulating state for the module as a whole that is slowing down the register allocator. There is global state for registers, for example, but it seems to be reinitialized at the beginning of each new function definition.

I’m hoping somebody familiar with the OCaml compiler internals can explain what I’m seeing.

gasche · January 18, 2018, 8:02pm

Yes, -linscan should solve your issue, and in trunk (so the future 4.07) there is a change that affects the compilation of modules with a lot of values even when -linscan is not set, and which should noticeably improve things in your case.

I would guess that the problem is not with each function individually, but with compiling the module-initialization code that puts all functions (all structure items) into a big record. This code can be large if you have a lot of functions at the toplevel, and that can hit register-allocation worst-cases.

I don’t remember the details on these module-compilation strategies; someone that knows better ( @chambart for example ) would know which code pattern to use that could help avoid the worst case. I am not sure whether enclosing your code into a include struct ... end block would help, for example. I think that using several of those (say, have only 1000 functions per include struct .. end) is likely to help.

Finally, the behavior is likely to depend on whether or not you use -flambda.

jeffsco · January 18, 2018, 8:27pm

This is really useful, thanks! All of our functions are at top level, yes. I experimented with various restructurings but didn’t find anything that helped. We will switch to 4.06.0 and -linscan for these gigantic generated files. Maybe in the future we can go back to the graph coloring allocator.

chambart · January 19, 2018, 2:14pm

I can confirm @gasche assumption. However, turning it into include struct ... end isn’t going to help.

In general a large function (in this case module initialization function) does not necessarily take long for the register allocator. There need to be a lot of interactions between the registers to trigger quadratic behaviors. To be more precise, it’s not the solving part that takes a long time, but building the problem itself: the register interaction graph. And to have a large interaction graph you need to have variables that are live across a large part of the file. By default this cannot really occur because the live range of variables is split into lots of small ranges by storing it into a global variable and reloading it for each use. But this can be broken by some later reorganizations, especially Common Subexpression Elimination. On 4.07 this transformation is going to be disabled for -flambda because this tends to occur more often in that case. But it is not going to be disabled without -flambda because we don’t have a real world example that demonstrate the problem.

Could you provide your example (on mantis preferably) to examine it ? There might be something possible for you that doesn’t necessarily involves using -linscan

jeffsco · January 19, 2018, 9:01pm

The code is gigantic and also mildly proprietary. But I’ll see if I can strip out dependencies and get something that shows our problem.

Thanks very much to both of you for your help and advice.

Topic		Replies	Views
Performance regression Ecosystem compiler	10	1346	October 12, 2023
Why isn't `caml_alloc*` inlined to allocation sites? Learning	5	648	May 2, 2021
Optimizing small vector operations Learning	35	1936	April 4, 2023
Using Ocaml to write network interface drivers Community	24	4895	February 14, 2020
Janestreet OCaml compiler extensions Ecosystem compiler	19	2161	August 10, 2024

Register allocation super-linear timing

Related topics