Understanding cancellation (in eio)

Continuing a discussion offshoot from How to “block” in an agnostic way?, I have some naive questions about cancellation, in Eio and I guess possibly in general. (I don’t have much concurrent programming experience.)

Discontinuing Multicore continuations

With effect handlers, a nice realization of the Multicore OCaml folks is that the operation of “dropping” a fiber (delimited continuation), what is sometimes written discontinue k, in fact corresponds to resuming the continuation with an exception instead of a normal value: you write, say, discontinue k Exit, providing an explicit exception. This is nice (much better than just invalidating the continuation and moving on) because it gives a fighting chance to resource-cleanup code:

let input = open_in path in
let@ () = Fun.protect ~finally:(fun _ -> close_in input) in
let v = perform Some_effect in
...

If perform Some_effect discontinues the continuation with some exception exn, the code behaves as if perform Some_effect was turned into raise exn, and in particular the input channel will be closed.

(Aside: I previously thought of exceptions as a special case of input handler whose continuations are never resumed. This idea from the Multicore folk was surprising to me because it gives a special role to exceptions, in a way that does not feel like a hack but rather a graceful interaction between the two features. I’m still not fully sure whether my initial intuition was wrong, there is a deep reason to give a special status to 0-ary effects, or I’m missing something else – probably the latter.)

Cancellation?

Could we use the same approach for cancellation? Do we still need the set_cancel_fn business?

The Eio README explains that if an Eio fibre runs Fibre.yield (), and gets cancelled while it is waiting, this Fibre.yield () expression will raise a Cancelled exception. This sounds like the same idea as discontinuation above, great!

But then @talex5 explains that there is an imperative protocol (set_cancel_fn, clear_cancel_fn) to register cancellation-time logic into a “fibre context”. Why do we need this additional complexity? Naively I would expect

Fibre.set_cancel_fn my_context (fun exn -> <cleanup logic>);
<rest of the code>

to be expressible as

try <rest of the code>
with exn ->
  <cleanup logic>;
  reraise exn

What am I missing?

1 Like

Sorry, I didn’t explain that well. The set_cancel_fn is used when implementing a primitive operation (normal user code does not call it directly). Its purpose is to make whatever operation is blocking the fibre stop early.

e.g. say we have:

Fibre.both
  (fun () -> Net.accept ...)
  (fun () -> failwith "crash!")

We want to cancel the low-level accept operation quickly and get on with reporting the exception, rather than waiting until the next incoming connection arrives.

Internally, Net.accept will be something like:

let accept ... =
  let fn ctx enqueue =
    (* [fn] runs in the scheduler's context.
       The calling fibre is now suspended. *)
    let op = Low_level.start_accept ... in
    Fibre_context.set_cancel_fn ctx
      (fun ex -> Low_level.cancel op ex);
    Low_level.when_done op @@ fun v ->
    ignore (Fibre_context.clear_cancel_fn ctx : bool);
    enqueue v
  in
 perform (Suspend fn)

So what happens is:

  1. We call Net.accept, which performs Suspend.
  2. This switches to the scheduler’s context, giving it a continuation for the user’s fibre.
  3. The scheduler calls fn with the fibre context and enqueue (which doesn’t actually resume the continuation immediately, but just adds it to a queue).
  4. fn somehow starts the low-level accept operation, using a non-effect API. It registers callbacks to handle both cancellation and success.
  5. Fibre.both decides to cancel the first fibre. It removes the cancel function from its context and calls it.
  6. That calls Low_level.cancel op ex to ask the operation to stop. In this case, we assume that will call the when_done callback later with Error ex.
  7. When the OS has finished cancelling the accept, it calls the when_done callback, which ensures the cancel function is removed (in our case it already is) and then enqueues the result (a Cancelled exception).
  8. When that gets to the head of the run queue, the original Net.accept now raises Cancelled, as desired.

The above isn’t the only way you could do it, though. Something like this should work too:

let accept ... =
  let p, r = Promise.create () in
  let op = Low_level.start_accept ... in
  Low_level.when_done op (Promise.resolve r);
  try
    Promise.await p
  with Cancelled _ as ex ->
    Low_level.cancel op ex;
    Cancel.protect (fun () -> Promise.await p)

But of course Promise is keeping its own somewhat complex data-structures, so this wouldn’t be as efficient, and internally Promise.await would still be using set_cancel_fn.

BTW, thanks for investigating all these aspects @gasche - I think these threads are really useful!

Thanks for the extra explanations! After thinking for a bit, I had the following suggestion:

let accept ... =
  let op = Low_level.start_accept ... in
  match perform Suspend (fun cb -> Low_level.when_done op cb) with
  | v -> v
  | exception Cancelled ->
    Low_level.cancel op;
    raise Cancelled

I’m not sure whether you consider that this has the same performance downside as your Promise version; the code structure is similar, but I don’t see an obvious overhead here.

What causes the perform Suspend to raise Cancelled in the first place? It looks like op is the only thing with access to the continuation, and it doesn’t know it’s being cancelled yet.

You could have the scheduler register a cancellation handler automatically that just resumes with Cancelled in all cases, but:

  • Not all operations can be cancelled.
  • You probably want to wait for the cancellation to complete.
  • There is potentially a long time after the Cancelled result is enqueued and before the fibre is resumed during which the original operation might complete.

e.g. with Stream.take, either:

  • the operation succeeds, and an item is removed from the stream, or
  • it is cancelled, and no item is removed.

To do that, we need to clear the cancel function before actually removing the item. Whereas here, after catching Cancelled we’d have to somehow discover that the operation had already succeeded and return the success result instead.

The scheduler: when cancellation happens, the scheduler traverses the cancelled fibers and discontinues them with discontinue fibre Cancelled.

Not all operations can be cancelled.

I would expect the Cancelled exception to traverse resource handlers in the fiber stack until it reaches the top, and then give back control to the scheduler.

You probably want to wait for the cancellation to complete.

I’m not sure what you mean. If Low_level.cancel op is blocking, then the whole domain waits. If it’s asynchronous, the code above could be tuned to Suspend again into the scheduler (and resume the discontinuation of the fiber later).

There is potentially a long time after the Cancelled result is enqueued and before the fibre is resumed during which the original operation might complete.

If the point of Cancellation is to save work, this isn’t an issue: if you had enough work to do in the first place that the operation completes, cancelling wouldn’t have reduced the critical path. I guess this can also happen if many fibers are cancelled as a group: the scheduler could have enough work to do that it doesn’t get to cancel some of them.

Are the other benefits to cancellation than saving time? You seem to have a mental model where “cancelling the current async call as quickly as possible” is higher-priority than “resource cleanup for the fiber”. Why? If we want to model that in the code, we could do as follows:

let accept ... =
  let op = Low_level.start_accept ... in
  match perform Suspend (fun cb -> Low_level.when_done op cb) with
  | v -> v
  | exception Must_cancel ->
    Low_level.cancel op;
    perform Has_cancelled

where Has_cancelled parks the fiber to quickly go cancel the other fibers, and then later discontinue it with a different Cancelled exception for the “normal resource cleanup” part.

So you have a transactional specification in mind: either the operation succeeds (and the user code runs with its result), or it gets cancelled and it’s as if it was never started? This makes sense, and I guess it explains some of the design.

I must say that I’m still unconvinced by the set/clear_cancel_fn interface (I don’t understand why something so imperative is required). I would think of two options:

  1. Cancellation logic is handled by the “user”, along the lines sketched above. Upside: the Suspend interface does not care about cancellation. Downside: each user has to get it right, and in particular the “transactional” aspect is not easy.
  2. Cancellation logic is handled by the scheduler. Could the user pass a cancellation function of type (unit -> (unit, 'a) result) to the scheduler as an extra Suspend argument? Either cancellation succeeds, the operation was correctly aborted before being committed, or cancellation failed because a result is available, and the fiber must continue.
1 Like

I know this thread is Eio specific but I’m wondering if there is a preference for one form over the other based on a few things:

  1. What about JS support?
  2. @dbuenzli brought up that a supposed reason to go effects is being more composable.

With that, maybe keeping it out of the scheduler is better for both points?

I use my own concurrency monad, in large part because I was dissatisfied with how Lwt and Async did it, and my framework has a layered approach with a Future’s library an independent library, and my Future’s library supports cancellation. This is nothing original to me. But the nice result is the scheduler’s I use don’t have to care about cancellation which also means going to JS is pretty straight forward as the cancellation comes with the Future’s library and Just Works.

This is starting to sound like the need to program transactionally in the presence of cancellation with async exceptions (masking & co.). A priori, everything said about it transposes from polling (checking for cancellation status) at allocation points to polling at perform points. (One difference is that perform is more explicit and controllable than allocations, which let one hope that it could have been possible to program with a very explicit control flow, C-like style, but apparently this is not seen as realistic here.)

In the memprof-limits model, tasks that have to be cancelled together are given a common cancellation token which is a boolean flag which is checked periodically by the task (the name is taken from .NET’s cancellation tokens, but the polling part is done automatically, every 10k allocated words on average). Masking lets you program transactionally by turning off the polling in some critical sections.

The problem with IO exists in both cases.

  • With OCaml’s signals, you can get EINTR as a result of system calls, and it is up to the task to poll or restart the operation.
  • With memprof-limits, IO operations would run to completion and polling is performed independently from IO, so you do not need to wrap IO operations in EINTR-style loops. (The use-case is more CPU-bound computation.)

Both approaches avoid the sharing of state mentioned by Gabriel by putting the task in charge of polling. However I wonder if a sensible design might get the best of both worlds by sharing a minimal amount of state between the task and the scheduler. Essentially take the memprof-limits model (with polling at perform point instead of allocation point), have the scheduler in charge of the cancellation token and the masking state, and let the task access masking with high-level combinators.

@orbitz that sounds interesting, can you include a link to your work?

Intuitively cancellation (as I understand it from the Eio docs and the discussion here) is related to the scheduler, not completely independent: it is the scheduler that keeps track of the structural relations between fibres (which ones must be cancelled together), so cancellation information comes from a fibre to another fibres through the scheduler. From a fibre perspective, “I want to suspend” is information it gives to the scheduler, but “I’m being cancelled” is information coming from the scheduler.

Maybe it’s possible to keep track of the “fibre structure” (switches, structured concurrency) outside the scheduler, to handle cancellation separately? Is this what your approach does?

Note: I’m asking about cancellation from the perspective of an agnostic Suspend primitive. I wonder if its interface must take cancellation into account, or whether it can be handled separately. (In Eio Suspend talks about cancellation, but in a rather roundabout way in my opinion). Looking at Eio code, there seems to be two different use-cases for Suspend:

  • High-level, synchronous usage: when you wait on a programmatic event that will occur in another domain or on another fibre; for example “I wait until someone writes a value to this channel”. You suspend to wait on “something else”, and there isn’t anything special cancellation-wise.
  • Low-level, asynchronous usage: when you create an asynchronous operation by interacting with a low-level interface / the operating system (or libuv, etc.). This is the flavor of the code examples that were shared above in this thread. There we want to Suspend ourselves to give control back to the scheduler, but we also want to handle cancellation specifically (not just to cleanup the resources in our continuation) to cancel this asynchronous operation.

It’s not clear to me whether this low-level, asynchronous usage should be implemented “inside the scheduler” or “in user code”. In any case, this requires communication with the scheduler. In the Eio codebase, some operations are implemented in the same effect handler as the scheduler, while some operations are implemented “outside” the scheduler, but using low-level / private APIs that break the abstraction barrier.

I think it should be possible to implement “low-level asynchronous cancellation” purely on the user side but, depending on the requirements, this probably requires a more complex interaction protocol (with the scheduler) than just Suspend. But maybe it’s better to leave it to the scheduler to avoid having to expose this protocol. Or maybe a nice design would be a two-layer system, with a “high-level scheduling interface” (with built-in cancellation) implemented in terms of a “low-level scheduling interface” that implements cancellation outside the (low-level) scheduler.

Finally: I’m not sure if the Eio code right now corresponds to a two-layer system where the two layers where manually smashed/inlined for performance reasons (having a single handler), or whether the boundaries are unclear (to me at least) because the library grew organically and some things should be structured better.

@gadmm I wondered of course about the relation to your work on asynchronous exceptions and matching. But here note that we are in a cooperative setting where “asynchronous failure” (cancellation in our discussion) only happens on well-identified yield points. If we assume that the perform-er is always eventually given control back to cleanup itself (otherwise resource-safety code is broken, right?), then it can sort of implement user-level masking for a statically-delimited region of code by just being careful about collecting exceptions on each yield point and continuing with its non-interruptible logic before handling them.

This is not a realistic idea for the lower-level problem of asynchronous exception in OCaml code (which can occur at any allocation point, and now at poll points), so there we need masking in general. But maybe it is realistic when failures only occur on perform?

(I wonder if you have an example in mind where masking would be important for a user-level cooperative concurrency library.)

I took this discussion (esp. the Stream.take argument) as saying that being careful is not enough, hence the need for some mechanism which starts to look similar to what you need for truly asynchronous exceptions.

I may have misunderstood the example but I took Stream.take as such an example.

Also, for context (I mentioned this in my OCaml workshop talk), my work on asynchronous exceptions always took them as a metaphor for the need to avoid defensive programming, which in my mind would be already needed with the cancellation-at-perform-point model.

I made a dump of the repos I use for this in the attached link (no guarantee it will compile)

Future library. This is based on the Fut library created by @dbuenzli ages ago. I took what he did and kept going.

https://hg.sr.ht/~mmatalka/abb-dump/browse/src/abb_fut?rev=tip

An example scheduler:

https://hg.sr.ht/~mmatalka/abb-dump/browse/src/abb_scheduler_kqueue?rev=tip

I don’t know if there is much to learn from what I have, I’m a complete dilettante, hacking things together to get what I wanted working and I’m sure there is a lot to cringe at in here.

But the end result is what was important to me is I really wanted to be able to write the timeout function. You just have to us the equivalent of Fun.protect to handle resource cleanup.

I think that with direct-style there is probably something quite different going on in that there is not this Future value that can carry around code for what to do on a cancel.

Here is something I had missed previously. In the synchronous usage, the suspended computation may be resumed from two different places:

  • once the synchronous operation completes succesfully, the scheduler’s callback will get called and the scheduler will eventually resume the computation
  • or if the computation gets cancelled while it is suspended, it will resume with a Cancelled exception before the synchronous operation completes.

In the second case, we want to invalidate the scheduler callback, to avoid having the callback resume a computation that was already cancelled.

This can, again, be done purely on the user side, but it’s again somewhat heavy.

let active = ref true in
match
  perform Suspend (fun cb ->
    let cb v = if !active then cb v else () in
    ...)
with
| Cancelled -> active := false
| () -> ...

On the other hand, it’s not too hard for the scheduler to take care of this: the scheduler can pass a callback to the suspension that is already cancellation-aware along these lines.

Yes. This often isn’t needed, but I don’t want the interface to rule it out.

It depends what you mean by “scheduler” here. Individual Eio backends don’t know much about cancellation. Fibre.fork gives the backend a context for the new fibre, and Cancel.cancel cancels it, invoking the cancellation functions. All the backend needs to do is associate the context with the fibre (via the Get_context effect). Of course, any IO operations the backend provides should support cancellation.

The Eio_null backend (which provides no IO, for teaching purposes) demonstrates this; it doesn’t mention anything about cancellation, but cancellation does work with it.

This still needs to support cancellation, as in the Stream.take example. We don’t want anything getting removed from the stream after the taking fibre has been cancelled.

  • Low-level, asynchronous usage: when you create an asynchronous operation by interacting with a low-level interface / the operating system (or libuv , etc.). This is the flavor of the code examples that were shared above in this thread. There we want to Suspend ourselves to give control back to the scheduler, but we also want to handle cancellation specifically (not just to cleanup the resources in our continuation) to cancel this asynchronous operation.

This one is less important for a shared API. Operations like Net.accept are implemented by the backend itself, so it can use whatever effect it likes.

Yes, this is key.

I remember your talk, and Eio’s cancellation is design based in part on your recovering (memprof-limits.recovering) guidelines. In particular, Eio remembers whether it has been cancelled rather than relying on the exception being preserved, as you suggested.

1 Like

I wonder if I’m slow to understand what cancellation is about, or if this is explained well somewhere and I missed this part of the documentation when quickly browsing around Eio docs. As I keep asking naive question in this thread or the other I discover new things, and I wish there was a way to learn them that was less time-consuming for @talex5 :slight_smile:

Over in the other thread, @kayceesrk wondered if cancellation could be expressed “minimally” by turning

type 'a resumer = 'a -> unit
effect Suspend : ('a resumer -> unit) -> 'a eff

into

type 'a resumer = 'a -> bool

where false is returned when the task was cancelled.

I find it strange with this API that the producer learns whether cancellation happened or not only after trying to resume the suspension with the result. I would expect an API where the producer checks that the suspension is still active first. But I’m not sure which is more correct.

Sometimes there is a race between a producer producing a resource, and a suspended consumer being cancelled. If the producer commits to giving the resource to the suspender, then the suspender must be resumed for the “transactional” guarantees to hold. I guess the following scenarios could happen:

  1. The producer learns that the suspender was cancelled, but “too late” after it committed: it ignores cancellation information and resumes the suspender anyway; or
  2. The producer needs to be careful to not commit until it learns for sure that the suspender will resume. It may need to explicitly keep the resource around for another consumer if it learns that the supended consumer was cancelled.

(Above I have the “simple” cases of cancellation in mind, that I called “suspending on other computations”. In the “low-level case” of asynchronous operation, a symmetric situation arises. If the suspender decides to cancel the asynchronous operation, but cancellation fails because the operation has in fact been committeed, then the computation must continue by resuming the suspension as if cancellation hadn’t happened.)

Note: I believe that with the ('a -> bool) API we don’t have a choice between (1) and (2) anymore, we have to go with (2).

1 Like

Currently the Waiters API of Eio, which represents a set of consumed suspenders, use approach (2): for (non-shareable) resources, the producer must use wake_one waiters result, and gets back `Ok if a waiter was successfully resumed, or `Queue_empty otherwise, and then it needs to keep the resource in a private queue for future consumers.

This corresponds to the 'a -> bool interface suggested by KC, at least from the perspective of the producer. Internally Waiters has enough information to tell whether cancellation happened without calling the scheduler callback.

Having people tell me which bits are confusing is probably the least time-consuming way to get this all written down :slight_smile: And I see @patricoferris has already linked it in Add links to interesting OCaml 5 threads by patricoferris · Pull Request #7 · patricoferris/awesome-multicore-ocaml · GitHub.

It’s also quite possible that, as you say, “the library grew organically and some things should be structured better”, and these discussions will reveal how to do that.

2 Likes

Lwt already implements cancelation, so it might be worth looking at various problems and corner cases it brings up.
The latest documentation suggests to not use Lwt.cancel: it is complicated to guarantee code still works correctly in the presence of it.
Nevertheless the current Lwt cancelation algorithm and implementation are documented, and you can read more about its problems here, and corner cases.

I think I only ever used Lwt cancelation indirectly through Lwt.pick, (and Lwt_unix.with_timeout) which use cancellation internally to cancel either the timeout or the user provided function (where the idea is to not waste any more time calculation something we’re not going to need). However it is not as easy as one might think:

  • what if you defined some generic retry function that retries on all exceptions (it’ll need to ignore the canceled exception otherwise your computation won’t actually finish when the timeout is reached).
  • what if your cleanup function takes a long time? (if your cleanup function is just a file descriptor close, fine, but it could be an API call to a remote server, on which you may have forgotten to put a timeout thinking you have the timeout on the outer call, etc.)
  • what if your cleanup functions raise exceptions, where do those go?
  • there is no good to way to cancel some operations: you may be past a “point of no return” where you’ve already committed to finishing the operation and cancelling at that point will leave the system in a somewhat broken state (e.g. maybe the reason you’ve hit a timeout is that you were running the cleanup function of some other computation, and canceling that means you’d leak file descriptors. Not canceling it means you’ll exceed your timeout). This needs some way to mark regions of code as not cancelable, or take extra care that you really perform the most critical function of the cleanup handler even if the handler itself is canceled. (E.g. if you tried to log something in the cancel handler, cancel the logging but still close the file descriptor)

One might want to consider canceling resources instead of (just) computations. Have a flag on resources that their use has been canceled. E.g. close the file descriptor and mark its wrapper type as canceled (requires the wrapper to be more than a mere int, but I’d happily pay that price to avoid double closes).
Then if you got a generic retry function it can check the cancelation state of the resources it uses: if any of them got canceled then do not retry, and raise the canceled exception.
And anything that tries to use the canceled resource will run into exceptions and continuously raise exceptions until it eventually finishes (unless the various cleanup and retry logic has managed to create an infinite loop in the user code).
This avoids the need to raise asynchronous exceptions, and ensures the resource really is not used anymore after its been canceled (even if some user code has an overly broad try/catch which ignores the Canceled exception: any further attempts to use the resource will run into errors, and a fairly obvious one: resource canceled, instead of EBADF).
Of course this can be generalized to resources other than file descriptors.
This is a bit similar to Eio’s Switch module, but at a somewhat lower level: each resource operation would perform the “I am not canceled” check, instead of the outer Switch module performing that check. In fact that check could be performed by the scheduler itself.

It might be interesting to provide cancelation as a layer on top of backends, that can be interposed between any backend and the core Eio: although Eio could provide a default implementation here, allowing the user to interpose and implement their own cancelation semantics might be better, at least for canceling resources.

1 Like

This blog post from Simon Marlow about Asynchronous Exceptions showed up in my feed recently, despite being from 2017, and seems relevant to this cancellation discussion (mostly about pros/cons and getting code right):

https://simonmar.github.io/posts/2017-01-24-asynchronous-exceptions.html