Understanding cancellation (in eio)

@bluddy it’s useful but very hard to do. Afaik people do without it, in general, or rather they use explicit cancellation contexts/tokens.

1 Like

Three examples from XAPI (which is a long-running daemon using just threads, and not Lwt):

  • canceling a task sets a flag. At various points in the task this flag is checked and an exception is raised (e.g. before starting a potentially long-running task, in various periodic progress callbacks, etc.). This is very coarse but works when you want to cancel a task that can take tens of minutes (e.g. exporting a VM to disk), even if the task is not cancelled immediately it will terminate in a useful amount of time.
  • forcing a remote client to cancel what it is doing and reconnect is done in a very crude way: immediately mark its session id as expired, causing all further API (and sub-task) requests to fail, close the connection and force it to acquire a new session id. Again, the cancellation is not immediate, but gets noticed at the next API call.
  • timeout on remote RPC calls is done by running it through an external stunnel process, and sending that process a SIGKILL when a timeout is reached.

All of these are rather “coarse”, and except for the SIGKILL do not take effect immediately, yet they are still useful. Even if Thread.kill was available and implemented via pthread_cancel I’d probably not use it due to the many bugs and corner cases that could arise from trying to terminate an operation that hasn’t implemented a proper cleanup (including C code).

However if a mechanism was available at the language (or concurrency runtime) level that would provide cancelation at the next “point” then it’d probably be interesting to think about slowly converting XAPI to using it (e.g. next point where the need for GC-ing is checked, or even just pre-defined points like spawning a new fiber, or making any Fiber API calls).

There would need to be a way to thoroughly test this: cancellation is quite a rare event, and unpredictable by nature, and even just doing N cancellations in a loop won’t guarantee you’ve covered the corner case your users are going to hit when they try it out on their machines.
If a program has a finite, and enumerable amount of cancellation points then one could think about providing a test framework that explores each cancellation point at least once (e.g. by interposing between the OCaml Fiber API and application-under-test and telling it to cancel when invoked by the n-th location). Or using a fuzzer to try to explore as many of these states as possible if you could direct the fuzzer to care about the cancelation points mainly.

The cancelation operation itself can be quite costly in some cases, and you may want to consider different levels of cancelation:

  • graceful cancelation - try to terminate the task cleanly, by cleaning up and rolling back where possible (i.e. think of it like “SIGTERM for the fiber”)
  • hard cancelation - if the rollback process gets stuck / takes too long it should be possible to cancel it in a way that still avoids leaking resources, but could e.g. lead to data loss: e.g.think of it like “SIGKILL for the fiber”. Even SIGKILL is not guaranteed to be immediate if e.g. a kernel resource is stuck, but in the absence of bugs it is expected to be quite quick.

This could be implemented by providing 2 different Cancelation exceptions (or a boolean as argument to the canceled expression and exception handlers can decide whether to do a quick or more thorough cleanup). E.g.: closing a file descriptor on the “quick” cleanup, and unlinking files, deleting temp files for the “thorough” cleanup (but the “thorough” cleanup could get stuck if you are on NFS and the NFS server went away, or simply take a long time if your disk is very busy).

4 Likes

IME, cancellation is an inherent aspect of doing asynchronous programming. Many systems startout with just assuming operations will be handled in a timely fashion but once you go to production you have to start ensuring computations terminate in a timely process and it becomes pretty pervasive. I think offloading this to users:

  1. Will make it harder to interoperate between libraries because they have implemented their own ad-hoc cancellation.
  2. Will be more error prone because the patterns for doing safe cancellation will not be centralized.

Thank you for your great example, @edwin. I think this makes me think even more that this level of complexity needs to be handled (somehow) only in a cross-library manner. Imagine the type of lock-in that comes from tailoring your system’s behavior to a particular library’s cancellation semantics, or the level of pain should those semantics be changed between versions.

Indeed. But it’s almost as bad if different libraries implement different cancellation semantics.

It seems like cancellation itself is more complex than any other feature of concurrency libraries. That’s pretty scary. It really raises a question mark over the notion of doing this stuff in a decentralized manner (i.e. not in the stdlib) if we’re ever going to have interoperability.

@gasche Thanks for your thoughtful questions to help us understand cancellation better.

  • “Optimization”: To keep your abstract example, when you build a disjunction of computations and one succeeds, couldn’t some others keep searching indefinitely (e.g. recessive properties) and thus need to be aborted for correctness?
  • “Transactionality”: I suspect that you make this requirement too strong. Transactionality is an abstraction that you can implement on top of non-transactional operations as long as you have “protection”.

In general, cancellation is very risky in the presence of resources, so the only simple way to achieve it is to implement it at the ownership boundary for resources. In most cases that means implementing cancellation at the process level.

@lpw25 Does it mean that you consider that a bracket combinator is not simple, or not enough to achieve resource-safety (e.g. because its limitations would be apparent too quickly)? I always considered it a good first step to promote ideas about resource management.

A bit of both. I think that bracket is not simple: it requires you to consider where you need masking and if you get that wrong you are left with a subtle race condition bug. I also think its limitations are genuine and that you will find yourself needing something more complex quite quickly.

I think the question of how to make it simple to write cancellation-safe code is interesting and worth exploring, but at the moment I remain mostly of the opinion that the cost of writing cancellation-safe code is not worth the benefits of cancellation.

1 Like

I have an example of cancelling of disjunctions of sideeffecting computations not just being an optimization. In the example, I depend on Lwt.cancel to cancel a looping thread that renders intermediate state to the screen, before new state is available for rendering - i.e. a loading-UI.

It would be an error if the loading-UI was rendered right after the UI resulting from the new state. Lwt.pick could have been used instead, also calling Lwt.cancel internally.

Cancelling could have been avoided if rendering everything in the same loop, therefore the same function - though with cancelling I can separate concerns, and I avoid polling the next_niseq_state_t promise in the render-loop. The frame_timings logic is also simplified this way.

let loop_receive_state ~term ~niseq_state_gen =
  let open Lwt.Infix in
  let rec aux prev_state frame_timings =
    let next_niseq_state_t = niseq_state_gen () in
    let timeout_t = Lwt_unix.sleep 2. in
    let intermediate_output_t = match prev_state with
      | None ->
        show_init_loop ~term 
      | Some niseq_state ->
        redraw_loop ~term ~niseq_state ~timeout:timeout_t
    in
    next_niseq_state_t >>= fun niseq_state ->
    let now = Unix.gettimeofday () in
    let seconds = truncate N_constants.fps * 1 in
    let frame_timings = (now :: frame_timings) |> CCList.take seconds in
    let fps_stats = Some (calc_fps_stats frame_timings) in
    Lwt.cancel timeout_t;
    Lwt.cancel intermediate_output_t;
    niseq_state |> render_state ~term ~fps_stats >>= fun () -> 
    aux (Some niseq_state) frame_timings
  in
  aux None []


I still have reservations about the capabilities aspect of Eio, but the structured concurrency part looks very nice.
Just a few notes, for future reference to readers of this thread (if I haven’t missed them being posted above already):

Another interesting post about structured concurrency and cancellation: 250bpm

A structured concurrency library in python: trio, which might be relatively similar to Eio’s switches in concept (esp since @talex linked this)?

Companion post to the trio blogpost: Timeouts and cancellation for humans — njs blog which is directly relevant to the current topic.

3 Likes

Concerning Trio - Nathaniels example for comparing this with other libraries, which apparently should be simple and in few lines of code, is implementing the happy-eyeballs algorithm: Nathaniel J Smith - Python Concurrency for Mere Mortals - Pyninsula #10 - YouTube

This algorithm can be written using Lwt in a more readable style, and in fewer lines of code - which uses the cancellation semantics of Lwt via Lwt.pick. Though the solution is a bit errorprone because of the never thread (this problem is present with nurseries too): Structured-concurrency libraries - #17 by rand

Nurseries seem equivalent to a concurrency monad - but where there is no way to start a thread except to combine it into the monad using a concurrency operator. Concretely, the nursery in Trio is just Lwt.choose.

So to make Lwt into structured concurrency: disallow Lwt.async and let running_thread = run () in <code hopefully using running_thread>. In this alternative semantics of Lwt - run () would never run if not combined into the returned monadic value, and would just be GC’d.

As I understand it - monads in Haskell work exactly like this - where a monadic value has no identity, and cannot be run right away, but is instead a recipe for a computation. This would be a safer solution than Lwt, and I guess, safer than nurseries too.


Edit: Would be interesting to have equivalent implementations of happy eyeballs in Eio, Lwt and Haskell to compare!

Not what you wish I’m afraid, but here’s how I believe you would go to implement it with affect (announced here).

IIUC, this happy eyeball algorithm just starts the tasks of a list every delay seconds and returns the first one that succeeds (if any) and aborts the others it already started at that point.

Something like this would do I think. I didn’t test it though, it may be wrong :–)

let happy_eyeballs :
    delay_s:float -> tries:(unit -> 'a option) list -> discard:('a -> unit) ->
    'a option
=
fun ~delay_s ~tries ~discard ->
  let either fst snd = match Option.join fst with
  | None -> Option.join (Fiber.join snd)
  | Some _ as v ->
      Fiber.abort snd; Option.iter discard (Option.join (Fiber.join snd)); v
  in
  let rec loop = function
  | [] -> None
  | t :: ts ->
      let t = Fiber.spawn t in
      let ts = Fiber.spawn (fun () -> Funix.sleep_s delay_s; loop ts) in
      match Fiber.first t ts with
      | Either.Left t -> either t ts
      | Either.Right ts -> either ts t
  in
  loop tries

Note that you can’t use Fiber.either here, because it may drop a value without letting you discard it.

1 Like

Yes - interesting to see an implementation using Affect too (:

So, I havn’t used Affect, but it looks like you try to discard on None all the time - as you call Fiber.abort right before.

Concerning the interface (and related to the previous) - I’m not too fond of options for this kind of thing - as programmers tend to just use e.g. Option.join to get rid of options - instead of handling their unique semantics explicitly. In my Lwt happy eyeballs implementation, I also made a bug because of options getting merged - where I missed a case to handle. This got fixed right away when the result-categories got explicit using polymorphic variants.

The implementation is also not quite right concerning the delays - you need to continue with the next request right away if the first one fails - where you always delay all subsequent requests.

Fiber.abort will trigger cleanup code in the fiber if it’s still running. However it has no effect if the fiber is already terminated.

In this case, assuming the values we try to determine are ressources, we have two values here, so we need to discard one (btw. it’s unclear to me whether the Lwt code you linked to really handles this, but I don’t remember how lwt cancellation works, except I found it quite broken when I looked into it a few years ago).

Okay will try to adjust that when I get some time (I just based my implementation on that fancy picture :–)

I don’t see from Affects documentation how Fiber.abort triggers cleanup? If the thread was wrapped in Fun.protect?

Yes - but in Affects documentation it states that:

val join : 'a t -> 'a option
(** join f waits for f to terminate and returns its value. 
    This is None in case f aborted and Some v in case 
    f returned with value v. *)

… so as you abort first, and then do Option.iter discard - you never run discard?

No, I don’t handle resources in the Lwt code. Will try to code up a version that does

More precisely it raise Abort in the fiber function, that is in your ~tries functions (see the documentation preamble, the discussion here initiated by @gadmm may also be of intereset).

What you do with that is up to you. You can use Fun.protect but in that particular case you shouldn’t discard when you succeed :–) So it’s likely better do directly handle the exception with a try with for cleanup.

Again, you cannot abort what is already terminated:

val abort : 'a t -> unit
(** [abort f] aborts the fiber [f] (and its spawns). If [f]
    is already terminated this has no effect. *)

you can call Fiber.abort as much as you want on a fiber, if it already terminated with a value you do get that value on Fiber.join (i.e. there is no such axiom: Fiber.abort f => Fiber.join f = None)

Btw. I’m a bit unsure what you meant by that exactly. Should any failure entail directly poping a new connection attempt without delay ? I couldn’t find what you described in a quick read of the spec here. Can you point where this behaviour is described ?

Yes, that was at least what was done in Trio at line 10: https://youtu.be/i-R704I8ySE?t=2005
… and that is the primary thing that makes the algorithm interesting - else it seems too easy. According to the spec you linked, one should also put both lower and upper bounds on delays - but that feels like unnecessary complexity in this context.

Okay so it almost changes nothing in fact.

Basically wrap the trying fiber with a function that check for the result and try the next on failure (see attempt). You also need to use a queue rather than an immutable list. This leads to:

let happy_eyeballs' :
    delay_s:float -> tries:(unit -> 'a option) list -> discard:('a -> unit) ->
    'a option
=
fun ~delay_s ~tries ~discard ->
  let either fst snd = match Option.join fst with
  | None -> Option.join (Fiber.join snd)
  | Some _ as v ->
      Fiber.abort snd; Option.iter discard (Option.join (Fiber.join snd)); v
  in
  let tries = Queue.of_seq (List.to_seq tries) in
  let rec next () = match Queue.take_opt tries with
  | None -> None
  | Some t ->
      let attempt () = match t () with None -> next () | Some _ as v -> v in
      let t = Fiber.spawn attempt in
      let ts = Fiber.spawn (fun () -> Funix.sleep_s delay_s; next ()) in
      match Fiber.first t ts with
      | Either.Left t -> either t ts
      | Either.Right ts -> either ts t
  in
  next ()

So you continue with next right away if the current thread fails - but it seems like this leads to several interlaced next stacks; each using the delay_s but concurrently. So from the outside, you can end up with delays that are much shorter than delay_s if threads fail.

Your analysis is correct but I fail to understand the behaviour you want exactly (I just implemented “any failure entails directly poping a new connection attempt without delay”). Can you describe it in formal terms ?

Should attempt simply be a let rec that pops until it succeeds ?

So this version here implements sequences of attempts.

A sequence simply pops tries until it succeeds or gets aborted. In seqs, we spawn a fiber for a sequence and another fiber that delay_s before spawning a sequence and recursively.

Is that maybe what you had in mind ?

let happy_eyeballs''' :
    delay_s:float -> tries:(unit -> 'a option) list -> discard:('a -> unit) ->
    'a option
=
fun ~delay_s ~tries ~discard ->
  let either fst snd = match Option.join fst with
  | None -> Option.join (Fiber.join snd)
  | Some _ as v ->
      Fiber.abort snd; Option.iter discard (Option.join (Fiber.join snd)); v
  in
  let tries = Queue.of_seq (List.to_seq tries) in
  let rec seq () = match Queue.take_opt tries with
  | None -> None
  | Some t -> match t () with None -> seq () | Some _ as v -> v
  in
  let rec seqs () =
    if Queue.is_empty tries then None else
    let t = Fiber.spawn seq in
    let ts = Fiber.spawn (fun () -> Funix.sleep_s delay_s; seqs ()) in
    match Fiber.first t ts with
    | Either.Left t -> either t ts
    | Either.Right ts -> either ts t
  in
  seqs ()

A graphical depiction would be this:

1 Like

Cool - yes this fits the specs I was after. Though, there could be some trouble again if there was a lower bound on how often one can initiate a request, because of the parallel threads making requests. It seems straight forward to add to the Lwt version.


Edit: My friend who’s made a capabilities based language, where cancellation is also inspired partly by Trio, has also made a version here: firefly-boot/HappyEyeballs.ff at master · Ahnfelt/firefly-boot · GitHub
Edit: Link was changed