Should out-of-memory ever be caught in libraries/framework?

What use cases are there for catching Out_of_memory? What about Stack_overflow?

Should a library ever catch those exceptions?
Should a framework ever catch those exceptions?

As a more concrete question, should Lwt ever catch Out_of_memory?

Currently Lwt catches all exceptions and wraps them in rejected promises. In my opinion the exceptions raised by the OCaml runtime should be treated differently and should just escape. These exceptions happen out of the blue, out of different blues on different platforms.

I’m proposing to change this (see Dont catch ocaml runtime exceptions by raphael-proust · Pull Request #964 · ocsigen/lwt · GitHub) and let SO/OOM traverse the whole Lwt abstraction. The current proposal (still WIP) also prevents the user from catching those exceptions using the Lwt exception management mechanism.

So, are there reasons to catch those exceptions? In this context?

My understanding is that these exceptions should never be caught (which is one reason why “catch-all” clauses try .. with _ -> ... are a bad idea) as they cannot generally be recovered from.

Cheers,
Nicolas

3 Likes

They are tough ones. Out_of_memory works similarly to an asynchronous exception when this denotes the machine running out of resources (do not catch), but this is not reliable (you can get a fatal error on young allocation instead); however it works like a synchronous exception/error when this corresponds to an unreasonable allocation request (e.g. giving a silly value as an argument to Array.make). The latter is the only reliable use-case and it is not meaningless to catch and wrap it; the question you have to ask is whether people rely on this exception in this way.

Another thing to consider is that it is not reliable to pattern-match on them. There is an idiom of catching and wrapping exceptions (see e.g. Dynlink and Fun.protect in the standard library) in which case you can miss some of them. To fully solve these problems you need support from the language for a new kind of exceptions, or a community-wide convention to mimic it (e.g. a predicate to tell which exceptions are “critical” and should not be normally caught, together with some discipline of using try...with e when not (critical e) -> instead of try...with e ->—this is what they do in Coq).

At the very least, if you do so, you should do the same for Fun.finally_raised, and thus maybe also other exceptions denoting programming errors such as Assert_failure. It would be nice to do the same with exceptions raised from asynchronous callbacks too since Lwt do not support them (e.g. Sys.Break).

1 Like

If the goal is to build fault-tolerant systems, then one should never catch an exception that one cannot deal with completely. And since OOM is almost by definition impossible to deal with completely, one should not catch it. Unless, of course, one is going to suicide the process.

This is a specific instance of the general rule about building fault-tolerant systems: when a fault occurs, it should be allowed to propagate upward/outward to the boundary of the region of the system that can conclusively and fully deal with the fault, repairing it completely.

Rick Harper’s notes on fault-tolerance explain in detail: https://www.fastonline.it/sites/default/files/2019-06/RobustProgramming.pdf

4 Likes

Well Stack_overflow should be easier to deal with, just avoid any more deep stack calls, and by raising the exception the problem has already been avoided.

Out_of_memory is the more problematic one, because you’d still want to perform cleanup on finally branches, otherwise you’d leak resources, or memory, and when you’re out of memory you may want to recover as much memory as you can. OTOH performing the recovery may require allocating again, so you may not get the chance to do so.

OTOH I’ve seen an old program of mine survive a few instances of out_of_memory surprisingly well (it was a web service), and requests that didn’t hit out of memory still worked (I don’t remember all the details now, but I think the out of memory was actually a bug in the program where it tried allocating too much, so it is just those unreasonably large or repeated allocations that failed, on the recovery paths there was actually plenty of memory to still do “normal” operations).
OTOH the OS might decide to kill your process any time if you are out of memory or close to it (in fact I’m having a hard time to get Linux to reliably give me Out_of_memory, due to vm overcommit which is the default, but even without it, unless the allocation is unreasonably large it just won’t give me a NULL and would rather kill the process later), so you can’t rely on actually being given the chance to even recover.

It really depends whether the out of memory is a bug in the program (or occurs only due to really large allocations) or whether the system as a whole is OOM and even allocating tiny amounts would cause an OOM. Also there might be a difference when you get an OOM due to a ulimit or cgroup limit, or when the entire OS is OOM and has to kill something to regain memory.

Thanks for the fault tolerance notes, the concepts are somewhat similar to Erlang supervisor trees where each supervisor either handles the fault or propagates it upwards.

1 Like

Doesn’t fault-tolerant mean, that you should be able to recover from an OOM? If you want to run a service efficiently, you want to use all resources and go to the limits as much as possible. So it totally makes sense to me to dynamically recover from short resources. You can heuristically limit the service earlier, by number of requests and by request size, trying to avoid the OOM, but especially if you have a service where the amount of memory necessary depends on the service request, it makes sense to cancel that specific request, when it uses more memory than currently available. Sure, “available” should probably be a soft limit lower than the hard limit, like disk quotas, to reserve some memory for the runtime.

Actually, no, and this is a well-understood issue. Long answer: https://www.fastonline.it/sites/default/files/2019-06/RobustProgramming.pdf

Shorter answer: you should only catch and deal with an error if you can do so completely. Otherwise, you should allow it to pass so that some further-out recovery mechanism can do so. Typically an OOM is not something you can reliably recover from completely. So you should let it pass. Typically, the only way to recover from an OOM is to kill the process and restart it. Or switch to a backup process (e.g. process pairs).

That PDF addresses this issue more completely, describing what he refers to as “fault containment regions” and such. It’s worth reading, b/c it gives a conceptual framework to understand many of the errors we see in software.

I’ll just indulge my personal bugaboo: in J2EE/Tomcat/etc software, it’s almost always impossible to reliably catch system-level errors (as opposed to “application”/“business logic” errors) in the thread where the exception is raised. The only reliable way to deal with these errors is to kill the J2EE/servlet process and restart it. But the imbeciles who came up with all that shit still talk about “lifecycle” and shit, as if it were possible to “stop” a servlet (or “bean”) and start it again. As if that were a way to address errors.

It’s not. There is only one lifecycle, and it starts with fork/exec and it ends with exit(2).

OK, I’m done with the rant. Ugh. Wasted a decade of my career cleaning these single-neuron-disease victims’ shit. For which they got paid handsomely.

3 Likes

[By “a service” I’m assuming you mean a transaction-processing service. E.g. any sort of web-app, but also pretty much every Internet application in existence.]

A little more detail. In fact, in transaction-processing, you DO NOT want to use all resources. Instead, you want to preserve a large and comfortable buffer of unused resources, so that you never hit hard limits. The way this is done is thru what’s called “admission control”, and every good TP system uses this technique. Every good TP system also limits the size/number of various resources in-use. It is not an exaggeration to say that

“capacity limitation is the quintessence of transaction processing”.

If you read a book on TP (e.g. Jim Grays Transaction Processing, still a classic), this jumps out at you.

For instance, my memory is that IBM mainframes would run at about 65% of full CPU utilization. That 65%, they’d run completely but the rest was reserved for the OS, recovery and cleaning routines, and extra idle for dealing with unforeseen events.

Imagine a unikernel, there are no processes, no OS, only the Ocaml runtime. Lwt promises are my “processes” that I want to kill, but not the whole “system”. I would prefer to have a way to reject the promise if the memory gets too full (as you said, shouldn’t be the hard limit, but some sensible threshold) with an exception. So I’m fine with a 65% threshold, that you mentioned later, to keep the Ocaml runtime and GC operable. But for me killing the whole Ocaml runtime in a concurrent context is like shutting down the whole OS because one process is using too much resources.

So what I am saying is probably: it should be possible to deal with the OOM error completely in the Lwt context.

First, an aside:

Are you familiar with Erlang? It has the property that all Erlang “processes” run in the same UNIX process. And Erlang processes can fail independently of each other. And Erlang applications can run for a long, long time. How is this? Well, two things:

  1. Erlang processes share no memory (except immutable (recounted, IIRC) strings). data is copied between processes.

  2. Erlang has something called “process supervision” where, when a process takes a fault, a supervisor process can wake up and cleanup after it. And since Erlang processes share no memory, it’s possible to recover completely.

But even with this, Erlang processes can fail: the runtime can take errors, foreign C modules can fail, etc. It’s just not common, IIUC.

Now, to your question:

No, this would not be a good design for a “service” that is designed to have high reliability. What you describe is where J2EE started, right? And it’s fair to say, that design failed.

There’s a saying in systems work:

Six weeks in the lab can save you half an hour in the library.

The history and development of fault-tolerant systems shows pretty clearly that the design you describe is not going to yield high-reliability systems. Now, the J2EE designers ought to have known this too (since all the books about it were nearing being out-of-print by the time J2EE came along) but they didn’t spend that half-hour in the library.

P.S. There’s nothing inherently wrong with a multi threaded runtime. Look at Apache 2.0. The thing is, you don’t run one such runtime; instead, you run N of them (N >= 2) and when one takes a fault, it crashes completely. [and again, I’m talking about non-business-logic faults]

BTW, if you’re having enough faults that the startup time of these runtimes is getting prohibitively expensive, that means the faults are common enough that you can go debug them. Jim Gray called these “Bohrbugs”.

PPS. If you’re interested in learning the history of TP, his book Transaction Processing is good. But also, reading “the Tandem papers” is absolutely essential. Tandem was probably the apex of “designed to be reliable” software and hardware, and reading about them is eye-opening. A small example: modern Infiniband networks are the direct descendant of Tandem’s networks.

1 Like

I fortunately have no idea what J2EE is (besides it has something to do with Java). :sweat_smile:

Why you assume OOM conditions must be due to a bug? A OS kernel could run out of memory with completely “bug-free” programs, and would kill them. And again, I’m working in a context, where there are no processes. But I get it, I simply expect too much from Lwt, I would need something more like the Erlang processes, with better separation.

an OOM always means that a memory allocation operation was aborted b/c no mem to fulfill it, right? So the invoking code is by a priori traveling along an error path. There’s a rule from systems programming: check your error-codes, don’t raise exceptions. For instance, you’ll find that in the Google C++ Style Guide. This is good advice for low-level operations in general, but memory-allocation is one of those that nobody ever checks the success of – too many places it can happen. Which is why typically programs leave handling it to the OS kernel, which terminates the process.

Re: Erlang processes, I suspect you’d find that if an Erlang process suffered an OOM, Erlang’s process supervision could deal with it. If the Erlang runtime suffered an OOM, process supervision could not deal with it, and the entire UNIX process would need to crash.

1 Like

Exactly. But “error path” and “bug” are really two different things.

Imagine a service that receives some data (let’s say a file) and does some processing on that data and returns a result. How much memory is required for that processing is not known before, because it depends on the data of the request. Also it’s a concurrent service, so several requests are possibly being processed at the same time. How would you possibly implement such a service without gracefully handling OOM conditions and reject requests in these cases? Limiting the number of parallel requests doesn’t necessarily help, depending on the granularity of the requests in comparison to the available memory, since you don’t know in advance how much memory a request will need. It would not be a bug, but programmatic failure, when the service rejects the requests with “sorry, currently I don’t have enough resources to fulfill your request”.

Mmm … an “error path” refers to the fact that control follows a path unexpected by the programmer, that is all. In that sense, it’s a bug. But even if we don’t call it a bug, it is still an unforeseen erroneous execution. These are called “faults” in fault-tolerant systems design.

Let’s take your example. Let’s modify it so that when you send the file, you send it wrapped in some marshalling format, like a protobuf, so that it’s sent all in-memory and is not streamed (since if it were streamed, it could be memory-efficiently handled by the server.

  1. At the moment when the OOM occurs, it might happen on the thread processing the file-message. But it might also happen on another thread (let’s say there is barely enough memory to hold the message, but not enough for other threads to continue their processing, completely unrelated to the file-message). If you’re going to go with this idea that we need to handle OOMs, then you’ll have to figure out how to handle them all, on all threads, performing all activities, right? On the other hand, if you’re going to say “well, some OOMs, we’ll just abort the UNIX process”, what’s the point of treating some of them as special ?

  2. Your scenario is actually a really instructive one. There is a very well-known DOS attack on internet services, by identifying RPCs that carry payloads, and for which the server doesn’t check data sizes and just naively demarshals into memory and processes. Then you send gigantic messages, and blow up the server. Done and dusted. So if you’re serious about sending large messages, you need to have provision for capping the size (“capacity limitation is the quintessence of transaction-processing”). Also, you’ll need to ensure that requests for this sort of RPC are always covered by authentication, so you can log the user-id of the requestor before starting demarshalling. So that if the request blows up the server, you can have it in your log, and can traceback to the requestor.

ETA: And yes, this is somewhat like the scenario you’re describing where programmatically the server code decides that a request is too large, and rejects it wiith an error-response. This is what is called a “business logic error”, and that is not the same as a fault. Sure, the server should return a comprehensible error-code/message. The difference is that you don’t reject the request after incurring an OOM, because that could cause faults all over the process as other code gets OOMs too. Instead, you detect that the message is too large, and you reject it before processing.

  1. But another thing that you might (and in reality, you should) do is to move those (big-message) RPCs to a separate server. You structure your service so that those RPCs happen on a simple server that can restart fast, and with no other sorts of RPCs on it. Then you can limit the # of concurrent requests on that “big message server” while allowing the rest of your service to remain high-concurrency.

But lets zoom out a few levels. First, I strongly urge you to read those notes from Rick Harper’s Stratus presentation. It’s nearly 40 years old at this point, and still very, very relevant. The relevant analysis goes like this:

  1. an OOM can affect a thread that is consuming a ton of memory. In that case the “fault” (excessive memory consumption) is “detected” (OOM raised) in the same “smallest fault-containment region” (the thread) as where the fault was incurred. So we can imagine we can “contain” and “recover”/“repair” the fault right there.

  2. But that fault could just as easily be detected in a different “smallest fault–containment region”. If we attempt to “repair” the fault there, we will FAIL to do so. B/c we aren’t actually going to fix it, since the thread with the excessive memory use is not the one on which the OOM happened.

  3. So the only way to properly (that is to say, completely) deal with the fault, is to escalate it out to an enclosing region where it can be dealt with completely. Typically, that region will be the UNIX process-boundary.

Look: transaction-processing and fault-tolerant systems design isn’t obvious stuff. There is a long history of why and how fault-tolerant systems get designed the way they do. I’ll tell you right now that when I entered industrial software (1995), I didn’t know any of it. I learned it all on-the-job. That is, to say, I spent ten years in the field, instead of six months in the library. Don’t be like me. If you’re going to design transaction-processing services and systems, it pays to learn from history how to design them for availability and fault-tolerance.

4 Likes

Anyway, as explained, Out_of_memory is not reliable for OOM situations: if the GC itself runs out of memory in the middle of minor collection, then you get a fatal error. Out_of_memory is only reliable as a synchronous error denoting an allocation too large, e.g. when passing a nonsensical size to Array.make.

Chet has very good points about the difficulty of recovering from exceptions and in particular OOM conditions. You might be interested in how memprof-limits goes around these issues.

  1. The exception happens when a limit set by the user is reached, not when the memory is full, so there is memory left for recovering.
  2. Tasks that are susceptible of being interrupted are explicitly marked; interruption does not happen in unrelated threads.
  3. The task that allocates the most is the most likely of being interrupted; tasks keep being interrupted until memory is freed.

By the way, the literature on fault tolerance has been very helpful indeed. (This stuff seems to keep getting rediscovered.)

3 Likes

Yeah, I looked into it before, but it requires threads. As I mentioned above, I work with a Mirage unikernel. There are no processes, there are no threads. There’s just the ocaml runtime, which owns the whole memory. So i need “business logic” errors per lwt promise scope when free memory becomes low. :man_shrugging:

Sorry I don’t know much about how Mirage is deployed, so this may be a stupid question. But isn’t a Mirage application typically compiled into a VM and run on a Xen hypervisor as a guest OS? In that case, if it OOMs, wouldn’t the Xen server OOM-kill it and restart it?

Boy howdy, so true. When I entered “the business” in 1995, I knew nothing about this stuff: I was a type theorist with a sideline in hardcore hacking. I’d bluffed my way thru my systems qualifiers, and had a single course in operating systems as an undergrad. So I learned on-the-job, and at first, I was just proud of being such a good hacker and debugger. But after 3-4 years of debugging an unending series of problems, a friend suggested I should start reading the classics of the field: the Tandem papers, the classic CACM papers by Spector&Gifford from the mid-80s (TWA Airline Control Program, Cirrus Banking Network, Space Shuttle, Bridge Building, etc) and also that Rick Harper presentation. A bunch of others. All old, many out-of-print. And it was a revelation.

It became clear to me that almost all of the modern web-app runtimes were written by people who had forgotten or never learned anything about transaction-processing, even though these were classic TP systems. Heck, Weblogic was owned by the same company that had built Tuxedo; the J2EE side of Websphere was written by the same guys who’d worked on Encina for Transarc, and on and on.

It was a gigantic farce: learning that all the people who were supposedly the experts in TP had in fact forgotten all the most basic things about the subject and were breaking the rules right-and-left. The example of “trying to recover from errors that you can never recover from completely except at the process-boundary” is one, but the one that really, really, really opened my eyes was seeing the massive number of variations on the theme of “flagrant violations of the two-phase locking/resource-acquisition discipline”: from

  1. acquiring database connections to multiple backend databases, in different orders in different transactions

  2. acquiring multiple database connections to the same database in the same transaction, in the same process [yes, madness – madness – b/c they then proceeded to use two-phase commit on those multiple conns … again, in the same process on the same tran on the same thread]

  3. an RPC inbound on an app-server would result in outbound RPC from that same app-server … [wait for it] back to the same app-server. [this is another version of “acquire multiple DB conns to the same database” – because threads are a capped resource]

I’m sure I’m forgetting other flagrant examples, but … when I started pointing these out, with the reference to chapter and verse from BHG (Bernstein, Hadzilacos, & Goodman), I’d get blank stares or “but it needs to be this way [because reasons]” from people who were ostensibly the senior architects in charge of designing this stuff.

That’s when I realized that basically our entire field (modern industrial TP) had forgotten these lessons completely.

4 Likes

I remember the Tandem papers vividly, indeed. Do you per chance have links to or pdfs of the others ?

In our case it is deployed by Muen, a separation kernel with static configuration. You have to imagine it more like in embedded SW engineering. If the unikernel crashes, the system halts. You might configure it in a way to automatically restart, but the system would lose it’s runtime state (it must be manually “unlocked” with a passphrase, because all storage is encrypted).

So I basically would end up implementing something like a “process” within ocaml, that has it’s own memory space, but I don’t believe it’s necessary to gracefully handle OOM, or better “low-memory” conditions, because LWT is cooperatively scheduled, and it should be safe to just stop the LWT task that experienced the condition and free all its resources. Actually I am using already the Out_of_memory exception to handle these cases in the lack of alternatives, and it seems to work reliably in my tests. But I would prefer if I could “reserve” some memory for the runtime/GC, exactly like you do it for the filesystem of the root partition, so that the system is still operational even if the hard-disk has been filled up. (The GC alarms are only triggered on major GC cycles, so I don’t believe that is guaranteed to happen often enough.)

1 Like