Should out-of-memory ever be caught in libraries/framework?

What use cases are there for catching Out_of_memory? What about Stack_overflow?

Should a library ever catch those exceptions?
Should a framework ever catch those exceptions?

As a more concrete question, should Lwt ever catch Out_of_memory?

Currently Lwt catches all exceptions and wraps them in rejected promises. In my opinion the exceptions raised by the OCaml runtime should be treated differently and should just escape. These exceptions happen out of the blue, out of different blues on different platforms.

I’m proposing to change this (see Dont catch ocaml runtime exceptions by raphael-proust · Pull Request #964 · ocsigen/lwt · GitHub) and let SO/OOM traverse the whole Lwt abstraction. The current proposal (still WIP) also prevents the user from catching those exceptions using the Lwt exception management mechanism.

So, are there reasons to catch those exceptions? In this context?

My understanding is that these exceptions should never be caught (which is one reason why “catch-all” clauses try .. with _ -> ... are a bad idea) as they cannot generally be recovered from.

Cheers,
Nicolas

1 Like

They are tough ones. Out_of_memory works similarly to an asynchronous exception when this denotes the machine running out of resources (do not catch), but this is not reliable (you can get a fatal error on young allocation instead); however it works like a synchronous exception/error when this corresponds to an unreasonable allocation request (e.g. giving a silly value as an argument to Array.make). The latter is the only reliable use-case and it is not meaningless to catch and wrap it; the question you have to ask is whether people rely on this exception in this way.

Another thing to consider is that it is not reliable to pattern-match on them. There is an idiom of catching and wrapping exceptions (see e.g. Dynlink and Fun.protect in the standard library) in which case you can miss some of them. To fully solve these problems you need support from the language for a new kind of exceptions, or a community-wide convention to mimic it (e.g. a predicate to tell which exceptions are “critical” and should not be normally caught, together with some discipline of using try...with e when not (critical e) -> instead of try...with e ->—this is what they do in Coq).

At the very least, if you do so, you should do the same for Fun.finally_raised, and thus maybe also other exceptions denoting programming errors such as Assert_failure. It would be nice to do the same with exceptions raised from asynchronous callbacks too since Lwt do not support them (e.g. Sys.Break).

1 Like

If the goal is to build fault-tolerant systems, then one should never catch an exception that one cannot deal with completely. And since OOM is almost by definition impossible to deal with completely, one should not catch it. Unless, of course, one is going to suicide the process.

This is a specific instance of the general rule about building fault-tolerant systems: when a fault occurs, it should be allowed to propagate upward/outward to the boundary of the region of the system that can conclusively and fully deal with the fault, repairing it completely.

Rick Harper’s notes on fault-tolerance explain in detail: https://www.fastonline.it/sites/default/files/2019-06/RobustProgramming.pdf

4 Likes

Well Stack_overflow should be easier to deal with, just avoid any more deep stack calls, and by raising the exception the problem has already been avoided.

Out_of_memory is the more problematic one, because you’d still want to perform cleanup on finally branches, otherwise you’d leak resources, or memory, and when you’re out of memory you may want to recover as much memory as you can. OTOH performing the recovery may require allocating again, so you may not get the chance to do so.

OTOH I’ve seen an old program of mine survive a few instances of out_of_memory surprisingly well (it was a web service), and requests that didn’t hit out of memory still worked (I don’t remember all the details now, but I think the out of memory was actually a bug in the program where it tried allocating too much, so it is just those unreasonably large or repeated allocations that failed, on the recovery paths there was actually plenty of memory to still do “normal” operations).
OTOH the OS might decide to kill your process any time if you are out of memory or close to it (in fact I’m having a hard time to get Linux to reliably give me Out_of_memory, due to vm overcommit which is the default, but even without it, unless the allocation is unreasonably large it just won’t give me a NULL and would rather kill the process later), so you can’t rely on actually being given the chance to even recover.

It really depends whether the out of memory is a bug in the program (or occurs only due to really large allocations) or whether the system as a whole is OOM and even allocating tiny amounts would cause an OOM. Also there might be a difference when you get an OOM due to a ulimit or cgroup limit, or when the entire OS is OOM and has to kill something to regain memory.

Thanks for the fault tolerance notes, the concepts are somewhat similar to Erlang supervisor trees where each supervisor either handles the fault or propagates it upwards.

1 Like