Lwt already implements cancelation, so it might be worth looking at various problems and corner cases it brings up.
The latest documentation suggests to not use Lwt.cancel: it is complicated to guarantee code still works correctly in the presence of it.
Nevertheless the current Lwt cancelation algorithm and implementation are documented, and you can read more about its problems here, and corner cases.
I think I only ever used Lwt cancelation indirectly through Lwt.pick, (and Lwt_unix.with_timeout) which use cancellation internally to cancel either the timeout or the user provided function (where the idea is to not waste any more time calculation something we’re not going to need). However it is not as easy as one might think:
- what if you defined some generic retry function that retries on all exceptions (it’ll need to ignore the canceled exception otherwise your computation won’t actually finish when the timeout is reached).
- what if your cleanup function takes a long time? (if your cleanup function is just a file descriptor close, fine, but it could be an API call to a remote server, on which you may have forgotten to put a timeout thinking you have the timeout on the outer call, etc.)
- what if your cleanup functions raise exceptions, where do those go?
- there is no good to way to cancel some operations: you may be past a “point of no return” where you’ve already committed to finishing the operation and cancelling at that point will leave the system in a somewhat broken state (e.g. maybe the reason you’ve hit a timeout is that you were running the cleanup function of some other computation, and canceling that means you’d leak file descriptors. Not canceling it means you’ll exceed your timeout). This needs some way to mark regions of code as not cancelable, or take extra care that you really perform the most critical function of the cleanup handler even if the handler itself is canceled. (E.g. if you tried to log something in the cancel handler, cancel the logging but still close the file descriptor)
One might want to consider canceling resources instead of (just) computations. Have a flag on resources that their use has been canceled. E.g. close the file descriptor and mark its wrapper type as canceled (requires the wrapper to be more than a mere int, but I’d happily pay that price to avoid double closes).
Then if you got a generic retry function it can check the cancelation state of the resources it uses: if any of them got canceled then do not retry, and raise the canceled exception.
And anything that tries to use the canceled resource will run into exceptions and continuously raise exceptions until it eventually finishes (unless the various cleanup and retry logic has managed to create an infinite loop in the user code).
This avoids the need to raise asynchronous exceptions, and ensures the resource really is not used anymore after its been canceled (even if some user code has an overly broad try/catch which ignores the Canceled exception: any further attempts to use the resource will run into errors, and a fairly obvious one: resource canceled, instead of EBADF).
Of course this can be generalized to resources other than file descriptors.
This is a bit similar to Eio’s Switch module, but at a somewhat lower level: each resource operation would perform the “I am not canceled” check, instead of the outer Switch module performing that check. In fact that check could be performed by the scheduler itself.
It might be interesting to provide cancelation as a layer on top of backends, that can be interposed between any backend and the core Eio: although Eio could provide a default implementation here, allowing the user to interpose and implement their own cancelation semantics might be better, at least for canceling resources.