For an OCaml application that makes heavy use of recursive functions, is there any value to having a maximum stack size?
The problem for us is that typically a stack overflow results in a segfault, with no indication that it’s a stack overflow. We have these occasionally on large, legitimate input. I don’t see a reason to fight over those, so the solution is to increase the maximum stack size which by default is 8 MiB on Linux. In bash for example, this can be done with
ulimit -s 1000000000 (~10^12 bytes) and the ocaml process would inherit this value.
Since having a large stack is expected and is generally not a bug, I suspect a maximum stack size is useless. I’d rather set it to a value greater than any physical memory and forget about it. Is there an OS on which we should expect a problem if we do that?
Update: [FYI] Linux allows unlimited stack size (e.g.
ulimit -s unlimited from bash, which uses the
setrlimit system call). Based on what I read, Darwin appears to have a hard limit of 64 MiB and the
setrlimit call does nothing there. On Windows, the maximum stack size is hardcoded in the binary and I don’t know how large it can get.
I’ve had experience with this in Java commercial deployments. The value of a “realistic” limit on stack-size, is two-fold:
- You have to think about whether a recursive function should be tail-recursive. It’s valuable to think about.
- When a bona fide infinite stack recursion happens, you want it to be detected early, not after you’ve consumed all physical memory and swap space. This eventuality can be both unstable for the machine, and often means that the recursion runs for a while before it hits limits and terminates the process.
Now, it’s a fair cop to point out that it sure would be nice if, when a process exhausts the stack segment, some “supervisor” process could be triggered, to scan the stack and print out a stack traceback. I think [geez, it’s awful getting old] I worked for a company where that was pretty much automatic [somebody had written that code, just to be clear] for all C++ processes.
If you are running your application/service on Linux, systemd can now do both:
- Max stack size:
LimitSTACK= (equivalent to
- Max memory use before systemd’s OOM killer kicks in and kills the process:
It is the de facto service supervisor on Linux.
Your reply is pretty much what I’ve had in mind so far but I’m challenging assumption (1), and (2) is something that’s unrelated to stack usage (unless stack frames waste more memory than alternative implementations using the heap, see below). Note that we don’t have surprise infinite loops, possibly because OCaml makes it hard. Also possibly because infinite loops don’t correlate with large input data, i.e. tests will catch infinite loops.
OK, but what’s the benefit of making a function tail-recursive? Does it typically consume a lot less memory?
let rec map f acc = function
|  -> List.rev acc
| x :: xs -> map f (f x :: acc) xs
vs. the more natural
let rec map f = function
|  -> 
| x :: xs -> f x :: map f xs
If the difference in maximum memory usage is under say 4x, then I wouldn’t bother with limiting stack usage.
List.map (native code, x86_64) uses 32 bytes per stack frame according to my calculations:
ocamlopt -o test_map test_map.ml
ulimit -s 8192 # linux default
let rec map f = function
|  ->
| x :: xs ->
let y = f x in
y :: map f xs
let try_length len =
let input = List.init len (fun i -> i) in
printf "try %i\n%!" len;
ignore (map succ input);
printf "ok %i\n%!" len
let () =
$ echo "max stack size: $(ulimit -s) KB" && ocamlopt -o test_map test_map.ml && ./test_map
max stack size: 8192 KB
Fatal error: exception Stack_overflow
This comes to about 32 bytes per stack frame, which seems perfectly acceptable. I don’t know what this becomes in bigger functions.
I think the value of making a function tail-recursive (or having tail-recursive versions of functions) is that it makes us think about which data we expect to be unbounded in size. That said, clearly tail-recursive functions are more expensive than stack-recursive equivalents: you’re replacing stack-activation frames (which are costlessly reclaimed) with GC, and notwithstanding all claims that GC is free, it just isn’t. Allocation-avoidance is a performance-optimization strategy. So I sort of agree with you, in the sense that you shouldn’t make everything tail-recursive – just the stuff dealing with unbounded-size data.
I was partially responding to:
I’d rather set it to a value greater than any physical memory and forget about it.
If you do that, and encounter an unbounded stack-recursion, you’d end up consuming all physical memory, and then failing, right? And that’s both more unstable (for the machine, with, e.g. overcommitting memory) and takes longer, than having a more constrained limit. Also, sure, tests should catch infinite stack-recursion (since no data is literally infinite in size). But as a systems jock, I try to design multiple layers of safety into systems. And behaviour by a process that threatens the stability of the operating system is always a bad thing. Always.
As a compromise, instead of “any physical memory”, maybe it would be good to set the limit at “the amount of physical memory that you expect the process to consume, worst-case”.
Great advice, it’s much appreciated.
When we’re running the process in a container with a memory limit, this is a little better because at least it doesn’t kill the host. But of course we have multiple processes in that container, so it’s best to not cause collateral damage to those other processes. For this, we use a GC alarm (
Gc.create_alarm) that checks the
heap_words field and raises
Out_of_memory before it’s too late. We could do the same thing to monitor the stack growth, but I don’t know how well it would work in practice because the GC alarm is a hook at the end of major GC cycles. This is where I thought we might as well not have a system limit on the stack size and have a single GC alarm monitor both heap size and stack size (
Gc.stack_size field). And we could have a somewhat lower limit on the maximum stack size (e.g 50% of the maximum memory allowed for the heap) that would work independently of the system limit.
In the end, I’m considering the following:
- Get the system’s maximum stack size out of the way by setting it to “unlimited”.
- Use a single GC alarm to trigger
Stack_overflow exceptions when heap or stack get too big, respectively. This may not work in theory (recursive function calls that pile up without allocating at all?) but I suspect it works in practice and I’d like to try it out.
If you are (1) setting the system stack size to ‘unlimited’ and (2) monitoring the stack size directly in the OCaml process; why not have the system monitor the process stack size directly using systemd on Linux or launchd on Mac. It should simplify your code.
To riff from: https://www.fastonline.it/sites/default/files/2019-06/RobustProgramming.pdf
- memory over-allocation is a fault
- It’s important to contain such a fault at a reasonable boundary
- and it’s also important that that fault not percolate up past that boundary to endanger the operating system
- The correct boundary is probably the process
- At the process boundary, we don’t assume that any code in the process will catch the fault: indeed, our assumption is that the code of the process is faulty and has failed to catch the fault
I don’t know the details of how systemd does its job. But monitoring stack size with code -within- the OCaml process isn’t a substitute for a reliable out-of-process method of ensuring that the OCaml process never gets to the point where it can allocate memory that needs to be reserved for vital OS/system-services level operation.
At least as I understand resource limits in UNIX, the advantage of setting a resource limit on stack size (or memory use) is that the process literally cannot breach that limit. There’s no window between when such a breach occurs and when it’s detected; there’s no exception for native code, there’s no exceptions whatsoever.
Slide #31 (“Error Containment Region”) is probably relevant:
- Errors should not propagate past error containment region
- Errors which do so can result in system fail
Note that since 4.10, stack overflows should reliably result in the
Stack_overflow exceptions being raised (see #8670). If not, it could be worth to open an issue with a reproducer.
Ah! That’s fantastic! I hadn’t realized that.
The crash was with our semgrep-core program on a colleague’s Mac, native code compilation with ocaml 4.12. We’ll work on making a usable report.