Understanding cancellation (in eio)

Another naive question about cancellation: in Eio, there is a single cancellation function per fibre context, and “setting” the cancellation function would implicitly erase a previously-set functions. (Cancellation functions are only set for “short” interactions, basically until an asynchronous operation completes, and then cleared.) Why is it enough, why don’t we sometimes want to wait on two things together and have two different cancellation actions to run?

Intuitively this seems related to the fact that Eio is a concurrency manager in, morally, a sequential system (a single domain); but is it the whole story, or is it related to the (in)expressiveness of the ways we request asynchronous results? Is it possible that we need multi-cancellation in the future for expressivity? In this case could we still provide transactional guarantees? (If cancelling one operation succeeds but the other has completed in the meantime…)

Isn’t that the other way around? As far as I understand, it is not an asynchronous operation. It is a synchronous one. And the cancellation function is there to cause this synchronous operation to fail earlier, so that the fiber can resume its execution as soon as possible (if only to throw the Cancelled exception).

Asynchronous exceptions are generally quite tricky to use. As Simon mentions in the article,

Basically it comes down to this: if we want to be able to interrupt purely functional code, asynchronous exceptions are the only way, because polling would be a side-effect.

OCaml doesn’t have the problem. Integrating asynchronous exceptions into the language also means that existing resource safe code will break.

The rest of the article is about how they make asynchronous exceptions are made safe for impure code. We’d written a paper in 2017 where we’ve discussed similar solutions: https://kcsrk.info/papers/system_effects_feb_18.pdf.

2 Likes

Eio deliberately uses a different cancellation system to Lwt. It’s a good idea to check it avoids these problems, indeed. The main improvement is that in Eio we cancel fibres rather than promises, which avoids most of the problems, and cancellation is sticky (so cancellations don’t get lost just because the current operation happened to be going through an uncancellable section at the point when the cancellation was performed).

  • what if you defined some generic retry function that retries on all exceptions (it’ll need to ignore the canceled exception otherwise your computation won’t actually finish when the timeout is reached).

A similar problem also exists with stack-overflow, out-of-memory and assertion failures, at least. Java has a nice hierarchy of exceptions, where Exception and Error both extend Throwable, and there the retry loop could just catch Exception:

An Error is a subclass of Throwable that indicates serious problems that a reasonable application should not try to catch.

OCaml doesn’t seem to have anything like this. Eio recommends you do Fibre.check () before handling an unknown exception. This will raise immediately if the fibre is cancelled.

  • what if your cleanup function takes a long time? (if your cleanup function is just a file descriptor close, fine, but it could be an API call to a remote server, on which you may have forgotten to put a timeout thinking you have the timeout on the outer call, etc.)

Eio says:

Quick clean-up actions (such as releasing a mutex or deleting a temporary file) are OK, but operations that may block should be avoided. For example, a network connection should simply be closed, without attempting to send a goodbye message.

  • what if your cleanup functions raise exceptions, where do those go?

An Eio switch collects all exceptions and combines them using Eio.Exn.combine. If the cleanup succeeds, it will re-raise the Cancelled exception, which is ignored when combining exceptions. If it raises something else (because cancelling failed), you’ll end up with an Eio.Exn.Multiple containing all the exceptions.

  • there is no good to way to cancel some operations: you may be past a “point of no return” where you’ve already committed to finishing the operation and cancelling at that point will leave the system in a somewhat broken state (e.g. maybe the reason you’ve hit a timeout is that you were running the cleanup function of some other computation, and canceling that means you’d leak file descriptors. Not canceling it means you’ll exceed your timeout). This needs some way to mark regions of code as not cancelable, or take extra care that you really perform the most critical function of the cleanup handler even if the handler itself is canceled. (E.g. if you tried to log something in the cancel handler, cancel the logging but still close the file descriptor)

Cancel.protect can be used to make something non-cancellable. Switch.on_release uses this automatically when running release hooks. This does mean that in Eio, cleanup operations cannot be cancelled by default.

One might want to consider canceling resources instead of (just) computations. Have a flag on resources that their use has been canceled. E.g. close the file descriptor and mark its wrapper type as canceled (requires the wrapper to be more than a mere int, but I’d happily pay that price to avoid double closes).

Eio and Lwt both wrap Unix.file_descr to do this. Both refer to such FDs as “closed” rather than “cancelled” however.

any further attempts to use the resource will run into errors, and a fairly obvious one: resource canceled, instead of EBADF).

Surprisingly, the EBADF is actually generated by Lwt, not the OS, so Lwt could easily give a better error here.

1 Like

Yes. A cancel function will only be present while the fibre is suspended, so it can’t start another operation. If you want to do another operation at the same time, you need another fibre, which will have its own cancel function.

@c-cube also pointed me at Lwt_switch which implements a different, more explicit approach to resource cleanup, where users attach their cleanup code to an explicit “switch” instead.

The documentation of Lwt cancellation is also worth a read. Are people using Lwt_switch because they find Lwt.cancel too complex?

Might be useful to distinguish these in Eio if possible. Accessing a closed FD would make me hunt for a double close in my code, where accessing a cancelled FD is something “normal” during cancelation and one should probably not even log such an exception and just reraise it.

Another naive question about cancellation. I wonder if cancellation is:

  1. The “natural” phenomenon arising out of exceptions in cooperative-concurrency systems: if a fiber fails with an exception, we want to do something to the other fibers whose result we know is unused (for performance reasons).
  2. Or a “separate” use-case than exceptional failure, that only makes sense in concurrent systems, and just happens to be “presented as exceptions” to direct-style code to reuse Fun.protect cleanup logic.

This might be related to the question of whether the only way to cancel a fiber is from the fiber code itself (in this case we could consider that any raise is the cancelling action, this fits well with scenario (1)), or if the standard way to cancel is to pass a handle to some fiber to cancel it (this feels more like scenario (2)).

It’s easy to imagine examples of cancellation use-cases coming from (1) – just think of any failure in concurrent code, when we know that we are waiting on a conjunction of several fibers. I lack the domain-expertise to know examples of code (or to make them up) where “cancellation” is its own thing not related to failure. (Maybe Ctrl-C is the right thing to be thinking of? So would (2) be equivalent to (1) + asynchronous exceptions?)

@lpw25 mentions that cancellation introduces a lot of complexity in cooperative-concurrency designs. But in situation (1), cancellation is just an emergent property of the system (if we take exceptions for granted), not something new introduced by the API designers. Is there a way to “not do (1)” or, alternatively, do people think that handling (1)-style cancellation is not too hard, it’s reasoning about “explicit cancellation from outside the fibers” that adds complexity?

1 Like

Note that pthread_cancel has existed for a long time (and is standardized since 2001). So, cancellation is unrelated to either cooperative concurrency or the existence of exceptions.

Failure is not the only important use case for cancellation; success is too. Indeed, an algorithm might be interested in getting a single success. So, once one of the jobs finds a solution to a given problem, there is no point in keeping all the other ones alive until they either succeed or fail. The archetypal example would be a SAT solver.

Right! When you build a conjunction of computations, and one fails, you want to abort the others. But if you build a disjunction of computations, and one succeeds, you want to abort the others as well. Excellent, thanks.

(But still not an example in which one would, for example, think of a non-timeliness of aborts as a correctness issue, more like a missed optimization.)

I wrote a few design notes for what I understand to be desirable properties of “minimal cancellation” – a design that merely tries to optimize concurrent computations on failure (or success!) of a related computation, rather than expose “cancellation” as an explicit user action.

I wonder if existing implementations offer those guarantees (my understanding from @talex5 replies is that Eio does, and in fact some points here have been suggested by him directly), what additional aspects they consider as important, and whether this contradicts some of the principles below.

(For example: Lwt.cancel gives users the ability to explicitly cancel another computation. I would not consider this a “core” requirement for cancellation, but maybe other people do? Is this in conflict or tension with some of the good properties?)

When cancellation implementations are judged “too complex”, this is because they fail to meet user expectations. Are we talking about some of the expectations listed below, or some other expectations?

(One aspect I had in mind but didn’t write down as a core requirement is a form of causality/ordering, as in distributed systems: maybe there are certain events that logically happen “after” termination of a computation A, and we would want to guarantee that if a computation is going to be cancelled due to the termination of A, then that must happen before it observes this event.)

Mininal specification of (cooperative) cancellation

  • Optimization: Cancelion is an optimization, where a computation whose result is not needed anymore may be cancelled by the scheduler, instead of running to completion.

  • Cooperative: cancellation is observed on scheduler join points (“bind” in monadic libraries, “perform” in effectful libraries): the computation is cancelled by receiving a specific exception from such a join point. (Receiving an exception allows computations to clean up their resources, but – just like other asynchronous exceptions – they should be careful to not just catch and ignore the cancellation exception, or the whole system will be less reliable.)

  • Asynchrony: cancellation manifests itself as an asychronous exception, that is distinct from the other failure modes at the join point. (A computation cannot “cancel itself”, it can only terminate with a failure/exception or a value; cancellation comes from the termination of other computations running concurrently.)

  • Protection: A computation can disable cancellation by masking/protecting itself; then it will be run to completion.

  • Transactionality: effectful operations at join points either run to completion (and resume the computation, possibly with an exception), or they cancel and have no other observable effect.

    In particular, in cases where “cancelling the operation in-flight” is not possible, the implementation should always resume the operation after it started (as if it was always cancelled/masked). It can check whether cancellation happened before starting the operation (and cancel in that case), but not after.

2 Likes

It can be an optimisation, but often it is essential. For example, if you’re waiting for a message that might get lost, you would add a timeout. If the cancel doesn’t do anything and the system keeps waiting, it will just hang forever.

It’s particularly important with structured concurrency, since there we’re not allowed to detach the fiber and let it keep running. And if we did let the fiber keep going, it might have observable side-effects later.

So standard that OCaml threads don’t actually provide this feature, which apparently is very trick to use well! There’s no way currently to kill a thread in OCaml. Quoting:

val kill : t -> unit

This function was supposed to terminate prematurely the thread whose handle is given. It is not currently implemented due to problems with cleanup handlers on many POSIX 1003.1c implementations. It always raises the Invalid_argument exception.

SAT solvers would typically be subprocesses running for several seconds, a case where cancellation is relatively doable (you can kill the process). Beware of race conditions, etc. However it’s interesting to note that even within a SAT solver, cancellation (timeout, memory limit) is typically done by polling: regularly check that you haven’t ran out of resources. I’m not aware of a better mechanism for CPU-bound computations anyway (except for setting an alarm but it’s very blunt).

1 Like

But if a single operation offered by the system blocks in a non-cancellable way, you have to choose between “timeout works” and the transactional guarantee. Above I specified the transactional guarantee. I think that “timely response to cancels” is an individual property of each asynchronous operation exposed, not of the global system. (Well: let’s say that both aspects are in tension, and I had the impression that transactionality was a hard rule to keep the system reasonable/programmable, while timeliness is a soft / per-operation rule. But I guess there are other choices!)

I’m confused @c-cube. Is cancellation important then? Are you saying it isn’t? Do people get along without it?

I think that the point is that SAT solvers will explicit check cancellation in user code (so: the framework doesn’t need to deal with it), rather than hoping that “good cancellation” will be an emergent design of the concurrency library used.

1 Like

it’s reasoning about “explicit cancellation from outside the fibers” that adds complexity?

FWIW “explicit cancellation from outside the fibers” is the only thing I would call cancellation. A fiber being able to raise it’s own exceptions is just a “normal” way for a fiber to finish running.

In general, cancellation is very risky in the presence of resources, so the only simple way to achieve it is to implement it at the ownership boundary for resources. In most cases that means implementing cancellation at the process level. (In a language like Rust, where ownership of resources is being tracked, you can create resource boundaries within a single process – and this is roughly what their poisoning stuff does). This also holds for other forms of asynchronous termination as well – I’m looking at you Out_of_memory exception.

3 Likes

@bluddy it’s useful but very hard to do. Afaik people do without it, in general, or rather they use explicit cancellation contexts/tokens.

1 Like

Three examples from XAPI (which is a long-running daemon using just threads, and not Lwt):

  • canceling a task sets a flag. At various points in the task this flag is checked and an exception is raised (e.g. before starting a potentially long-running task, in various periodic progress callbacks, etc.). This is very coarse but works when you want to cancel a task that can take tens of minutes (e.g. exporting a VM to disk), even if the task is not cancelled immediately it will terminate in a useful amount of time.
  • forcing a remote client to cancel what it is doing and reconnect is done in a very crude way: immediately mark its session id as expired, causing all further API (and sub-task) requests to fail, close the connection and force it to acquire a new session id. Again, the cancellation is not immediate, but gets noticed at the next API call.
  • timeout on remote RPC calls is done by running it through an external stunnel process, and sending that process a SIGKILL when a timeout is reached.

All of these are rather “coarse”, and except for the SIGKILL do not take effect immediately, yet they are still useful. Even if Thread.kill was available and implemented via pthread_cancel I’d probably not use it due to the many bugs and corner cases that could arise from trying to terminate an operation that hasn’t implemented a proper cleanup (including C code).

However if a mechanism was available at the language (or concurrency runtime) level that would provide cancelation at the next “point” then it’d probably be interesting to think about slowly converting XAPI to using it (e.g. next point where the need for GC-ing is checked, or even just pre-defined points like spawning a new fiber, or making any Fiber API calls).

There would need to be a way to thoroughly test this: cancellation is quite a rare event, and unpredictable by nature, and even just doing N cancellations in a loop won’t guarantee you’ve covered the corner case your users are going to hit when they try it out on their machines.
If a program has a finite, and enumerable amount of cancellation points then one could think about providing a test framework that explores each cancellation point at least once (e.g. by interposing between the OCaml Fiber API and application-under-test and telling it to cancel when invoked by the n-th location). Or using a fuzzer to try to explore as many of these states as possible if you could direct the fuzzer to care about the cancelation points mainly.

The cancelation operation itself can be quite costly in some cases, and you may want to consider different levels of cancelation:

  • graceful cancelation - try to terminate the task cleanly, by cleaning up and rolling back where possible (i.e. think of it like “SIGTERM for the fiber”)
  • hard cancelation - if the rollback process gets stuck / takes too long it should be possible to cancel it in a way that still avoids leaking resources, but could e.g. lead to data loss: e.g.think of it like “SIGKILL for the fiber”. Even SIGKILL is not guaranteed to be immediate if e.g. a kernel resource is stuck, but in the absence of bugs it is expected to be quite quick.

This could be implemented by providing 2 different Cancelation exceptions (or a boolean as argument to the canceled expression and exception handlers can decide whether to do a quick or more thorough cleanup). E.g.: closing a file descriptor on the “quick” cleanup, and unlinking files, deleting temp files for the “thorough” cleanup (but the “thorough” cleanup could get stuck if you are on NFS and the NFS server went away, or simply take a long time if your disk is very busy).

4 Likes

IME, cancellation is an inherent aspect of doing asynchronous programming. Many systems startout with just assuming operations will be handled in a timely fashion but once you go to production you have to start ensuring computations terminate in a timely process and it becomes pretty pervasive. I think offloading this to users:

  1. Will make it harder to interoperate between libraries because they have implemented their own ad-hoc cancellation.
  2. Will be more error prone because the patterns for doing safe cancellation will not be centralized.