[question] Interrupting a process in a portable way

multicore
windows

#1

Dear all, excuse me if this is a very basic question.

I’d like to have two processes [note, full-blown processes not threads] such that one process is a “control” one, whereas the other is the “worker” one, doing potentially heavy computations.

At some point, the control process may decide that the worker is taking too long and would like to send it a signal so it cancels the running computation; ideally it would cancel the computation but get back to listening so it can try to preserve previous cached state.

What is the best way to do this in a portable way? I could of course send a signal to the worker, but I am not sure what the current constraints are on Windows.

Note: I am OK using only 4.08 and any external lib that would help.

Thanks! Emilio


#2

You can use a signal via the Unix.kill which is portable, although might have slightly different behavior depending on your system. For example, in native Windows installation it will call the TerminateProcess function, which to my understanding, will do what you want. It won’t allow you to use the Sys.sigkill, though.

However, on the unix system, like Linux, MacOS X, or Cygwin, it will end up using the the kill function. However, signal receiving in OCaml is pseudo asynchronous. Signals are checked every time a minor gc is done (or of course when a process is blocked), so if your process is in a computation loop, that doesn’t have any allocations it might end up not seeing the signal (unless it is unmaskable signal like sigkill). Note, it is a very rare condition in OCaml to have a loop that never allocates, most likely everything will work fine for you.


#3

If you just want to terminate the worker process, sending a signal through Unix.kill should work. Unlike what @ivg wrote, if the signal is not handled and its default behavior is to terminate the process, the process will be terminated immediately on receipt of the signal.

If you want the worker process to handle the signal so as to e.g. raise an exception, it will work only under Unix, and with the caveat that @ivg mentions, namely that the signal will be detected at some program points only, typically at allocation points (not just GC time).

Another portable solution is to do the polling yourself. Set up a pipe between the controller and the worker, and have the worker periodically check for available data in the pipe (e.g. a nonblocking read) and act accordingly.


#4

and one more correction to myself,

It’s the opposite, the only signal that Unix.kill accepts on Windows is Sys.sigkill. So basically, the most portable way would be to use Unix.kill Sys.sigkill and it wall equally well on any system.


#5

If you’re willing to use lwt, some of the work can be done for you. For example, lwt_process has a ‘spawn with timeout’ option, and @ivg’s own parallel library can set up a pipe between processes.


#6

Dear Ivan, Xavier, Yotam,

thank you very much for your answers, help is really appreciated!

I think I have a better understanding now on what the situation is regarding process interruption in different platforms.

Maybe I can add a bit more the concrete use case I have.

We are developing a user interface for “reactive” proof checking. In such model, the IDE forwards changes to a daemon which will then update the document and initiate checking of proofs.

Such checking is by nature speculative; when the user updates the document, ongoing proof checking has to be cancelled and new one has to be launched, in an incremental fashion.

In this case, the controller is a “Language Server Protocol” daemon for Coq. The controller never does any expensive computation and it is basically a polling loop, so it works fine.

On the worker end, we have full-fledged Coq processes, and they have a pretty fat state and like to do long, costly computations.

The main example of reactive proof checking is Isabelle/PIDE, which TTBOMK is implemented over a preemptive threading model. In this setting, speculative execution is carried out by futures, however these are cheap to cancel.

Indeed I think I’ve heard the PIDE developers say that it took a long time to fine-tune the runtime so thread cancellation worked properly among platforms.

Going back to the Coq case, it seems to me indeed that we are facing a hard problem, as these kind of reactive UI interfaces seem to be one of the few use cases where a multi-threaded implementation may work better than a message-passing model.

Even so, my guess is that at least for Linux, using signals to interrupt should work well enough: it is unlikely that a Coq process is stuck on a true non-allocating loop. But of course, as we are bootstrapping a new language server, we’d like to provide a good Windows experience as much as possible.

Unfortunately adding polling to Coq is not a good option; in fact, there is already a “polling” check in Coq (check_for_interrupt), and even if it is just a few instructions, it has an observable impact on performance; having to do a system call would likely be bad given the nature of Coq hot paths such as reduction. It also wouldn’t work in all cases, as not every hot path has been instrumented to call the poll function.

I’d guess that the way to proceed would be to use some heuristics so users have to opt-in for expensive computation to take place [at the risk of having to use the KILL hammer) and indeed, try to thin state and process startup time. That should provide a decent experience on Windows for a wide range of users.

With all of this into account, IMHO the following questions do remain:

  • are we missing some better possibility than what is described in the summary above [killing processes on windows, trying to signal, if still stuck kill on Unix] ?

  • is there some chance we could improve the Windows behavior in this aspect? Is there some existing Mantis issue / roadmap on this aspect?

    We would be happy to add some platform-specific code to enable a better interruption model on Windows, would this be doable in some principled way?

    We could even contribute some manpower to help with that if needed.

Best regards,
Emilio


#7

Thanks for the excellent thread.

I care less about portability than the original poster. But I’m wondering if any fans of the Jane Street libraries have any advice about how to do this in Async? For example, creating a process with Async.Process.create and then tearing it down with Core_extended.Process.kill does not work nicely – broken pipes triggering exceptions, etc.

I often work around this by doing all of the process management using non-Async bindings for the underlying system calls in a separate binary. But it’d be nice to pull it all into Async.


#8

If the process that you’re trying to stop doesn’t cooperate, then you have to rely on a higher authority and ask the operating system to intervene. In other words, you have to use signals. All other methods of implementing inter-process synchronization rely on some sort of cooperation from the victim process.

So we have two venues, (a) enable some sort of cooperation and try to die gracefully, (b) just do the heavy killing, but try to minimize process bootup time. In the first case, you should be able to interrupt the process without killing it, so that you can resume from the previous consistent state. In the latter case, you kill the process altogether but you may try to minimize the warm-up period.

Minimal cooperation

We will be running a watchdog thread that monitors some interprocess lock (it could be file lock, call to select, or just a read from a socket) and the worker thread. We will rely on the OCaml asynchronous events to deliver the notification from the watchdog.

OCaml uses signals to deliver asynchronous events. OCaml signals are represented as a volatile flag which is set whenever a real signal has occurred and checked every time an allocation is made. If signals are pending then the signal handler is called, which then calls arbitrary OCaml code. In the code, you can raise an exception, which could be caught somewhere upper in the stack so that you can resume to the last consistent state. (That basically means, that let x = Int 42 may raise an exception).

We will use system threads to implement the watchdog. Since the watchdog thread will release the master lock, it won’t interfere with the worker thread until it is woken up. The operating system will provide the preemption (I hope, at least in Linux, it will). Once, the watchdog thread is ready, it will eventually get the CPU time, leave the blocking section, and set the notification that a signal is pending. The OS then will pass the control to the worker thread that will notice the signal on the next allocation, call the handler which will raise an exception, et voila.

The tricky part here, is that on Windows we can’t really send signals through OCaml, as the signal function only allows for sigkill. So we should go to C and implement our watchdog as a C program using the caml_set_signal function directly.

Minimizing the warm-up time

In this setup, we will have a manager process that will maintain a pool of worker processes and will dispatch the requests from the master process and kill bad workers, if the master so desires. The warm up minimization could be achieved by preallocating the workers (so that when a new worker is required, we won’t wait for it), and by pushing some state (that is never retracted) to the manager process, so that every time a manager spawns a new worker the latter will inherit the whole state from the manager, so that the state will be copied to the clone. You can extend this scheme adding an extra layer of indirection. The manager process creates queens, each queen acts as a manager in the previous scheme, but different queens can have different states. Using this scheme, you can create new queens at some backtracking positions, so that new states created from this position will startup faster.

Implementation-wise, we will be using Unix.fork on Unix-like systems. On Windows, the story is not that easy. Windows doesn’t have fork, so you have to emulate it somehow. It won’t be as effective as Unix of course. The easiest way would be just to use Cygwin. Or you can use Unix.create_process and somehow pass the state to the child (using mmap or similiar mechanism).