Recommendations for a general purpose orchestrator

I’m looking for recommendation for a good general purpose orchestration system.

Jobs are submitted and need to be distributed to workers. We may want to check up on the status of the job, get an error back if something failed, get results back and so forth. These are general purpose jobs and don’t involve kubernetes/docker. Some jobs are run periodically while other jobs are user initiated.

  • I’m looking for something featureful and robust
  • Something built in OCaml would obviously be great but I’m open to something in other languages. There are so many similar kinds of software available that I’m overwhelmed by the options

Here is what I found for OCaml

I’m sure there are others. Please tell me if you have any favorite orchestrators!

I used to use PBS (Portable Batch Scheduler) for this. Typically combined with NFS for shared filesystem. On “doorstops” as we used to call 'em back in the day.

1 Like

I have only used Nomad for personal toy projects but it has been a pleasant experience. It does seem to support your requirement of running non-containerized workloads: Nomad by HashiCorp

I have a somewhat complete example setup here: GitHub - rbjorklin/nomad-demo: Playing around with Hashicorp Nomad

3 Likes

Rework is my favorite (a little biased since I wrote it) :slight_smile:
It is strongly Python-centric though. Wondering what would be the minimum requirements/features to have something “generic” though “not kubernetes” (more like cron, but distributed) …

1 Like

Note: from your description, you probably want OCluster rather than OCurrent:

OCurrent pipelines can submit jobs to OCluster (ocaml-ci does this, for example), but you can also just submit jobs directly using the CLI tool, OCaml API, or Cap’n Proto endpoint. The OCluster worker agent has built-in support for Docker and OBuilder job types, but you could add other types easily enough with a bit of OCaml coding.

The main thing to note is that you need to keep a TCP connection to the scheduler while your job running. Closing the connection kills the job, so it’s not useful for jobs that run for days. But you do get logs streamed back to you in real time.

1 Like

This feels like a limitation in some ways. What if a program that is is submitting jobs crashes? When that program comes back we want it to be able to check the status of jobs that it previously submitted. This way the job will die if it crashes…

Related question: lets say the job completes and then the program crashes. Will it be able to check the status of a previously submitted job when it comes back online? Or are the jobs associated with the TCP connection?

P.S. Though OTOH, I can see some situations when killing the job when the TCP connection is broken as advantageous. What if there are two kinds of jobs – persistent (will not be killed if the TCP connection is killed) and ephermeral (the behavior you explained)

Related question: lets say the job completes and then the program crashes. Will it be able to check the status of a previously submitted job when it comes back online? Or are the jobs associated with the TCP connection?

The jobs are part of the TCP connection (managed by capnp) and currently jobs don’t have IDs you could use to refer to them. They’re more like references in a programming language.

The way we handle this in the CI is just to submit the job again after a crash. Job results are cached, so you’ll just get the previous cached result immediately.

For OBuilder, it also caches the steps individually, so if the connection is lost while e.g. building step 6 then the builder will kill the step 6 processes and abort, but when you resubmit the same job description it will resume from the checkpoint at the end of step 5. We need this caching anyway because lots of jobs start with the same steps (e.g. every PR that doesn’t change a project’s opam files will have the same “opam install …” step).

I should also mention that our cluster only has 30 workers, and we haven’t tested with more. If you have significantly more machines than that then it might not be suitable. On the other hand, it’s very easy to set up for a small cluster.

1 Like

Indeed we (@reynir and myself) have been developing this as a system with a small dependency cone, and the ability to preserve artifacts. Our use case is to have a server with local workers (currently FreeBSD jails / docker containers) using a TCP connection (the worker registers to the server, requests a job (a shell script), outputs log (stdout/stderr) and build result (success/failure, including the leftover files) to the server. The server than dumps that locally or uploads the result to a database (in our case, it uploads the result to GitHub - roburio/builder-web: Mirror of https://git.robur.io/robur/builder-web/).

A client is available to schedule jobs (regularly or manually). We use it heavily for our reproducible builds infrastructure at https://builds.robur.coop – which at the same time is reproducible in itself :slight_smile:

It is still under development (see the issue tracker, e.g. add platform to jobs · Issue #15 · roburio/builder · GitHub to support different workers on the same server). The worker-server client-server connections are not authenticated nor integrity-protected (since our setup is local, the server is not exposed on the network) – but that can be easily changed. If you happen to experiment with it, we’re eager to hear your feedback - open an issue if you’ve questions/remarks :smiley:

3 Likes