DataKit: Orchestrate applications using a Git-like dataflow
DataKit is a tool to orchestrate applications using a Git-like dataflow. It revisits the UNIX pipeline concept, with a modern twist: streams of tree-structured data instead of raw text. DataKit allows you to define complex build pipelines over version-controlled data.
DataKit is currently used as the coordination layer for HyperKit, the hypervisor component of Docker for Mac and Windows, and for the DataKitCI continuous integration system.
TL;DR: The effort this week has gone into preparing for a release of the development
trees next week, and for the renaming of the project repositories to the Moby Project (see week17 for more on this topic).
The use of multi-stage builds vastly decreased the size of the published containers to
just 21MB, which makes deploying DataKit much more efficient!
Since the Docker Hub doesn’t support multi-stage container builds yet, the Dockerfile
used for autobuilds was inlined to ensure that the published images stay in sync. (moby/datakit#521@talex5).
The root opam file is now called datakit.opam to make it fit with the other sub-packages (moby/datakit#525@talex5).
The GitHub bridge can now read its private key using Docker Swarm secret management, which avoids the need to spread the secret authentication token any further than necessary (moby/datakit#519@talex5@avsm).
The DataKit server no longer exposes an HTTP server. It was only used for debugging before, and Irmin 1.0 no longer supports it (moby/datakit#524@samoht@talex5).
As DataKit is used more in production, @talex5 has been steadily improving error handling to ensure that callers can handle failures more gracefully, either via retrying or logging exceptions (moby/datakit#526@talex5).
PRs with activity
@djs55 uses DataKit in Docker for Mac for configuration management, and so he proposed a scheme to make branch handling more robust for real-world use of this feature. His PR covers the case of software upgrades and user-supplied overrides for particular configuration keys (moby/datakit#523).
@avsm requested that DataKit CI be able to monitor a complete GitHub organisation and add monitoring hooks by watching the rigth events. @samoht proposed a fix to this in moby/datakit#419, but it has become outdated due to upstream changes, so he is rebasing it.
@samoht is also working towards making DataKit log less verbose commit messages, to reduce the size of the state repository (moby/datakit#476@samoht).
External Links or Blogs
“Optional Dependencies considered harmful” by Rudi Grinberg explains very well why DataKit and Irmin now have explicit OPAM packages rather than depopts for optional functionality.
TL;DR: The project now has a new home in the Moby Project and the main DataKit repository was renamed. There were three releases this week, of moby/datakit:0.10.0, mirage/irmin:1.1.0 and mirage/ocaml-9p:v0.10.0, with mega build speedups and improvements to the DataKit Continuous Integration engine.
The GitHub bridge was simplified by removing the VFS layer, which was intended to be useful for debugging but rarely used in practise (moby/datakit#535@samoht).
The Continuous Integration subsystem was improved to make the required GitHub scopes configurable to make it easier to deploy on public GitHub repositories with less privilege (moby/datakit#534@talex5@avsm) and to accept SSH keys automatically when autocloning state repositories (moby/datakit#536@talex5).
A regression in Irmin 1.0 in the Irmin.Tree.diff function, where nested diffs were reported with the wrong path was fixed (mirage/irmin#438@samoht).
@kayceesrk and @samoht have been discussing the semantics of merging option types in mirage/irmin#421, and concluded that Irmin 1.0.1 has the right semantics. mirage/irmin#422 was closed as a result.
@yomimono and @samoht did a lot of work in the runup to the MirageOS 3.0 release a few months ago to prepare DataKit for support, and all of that functionality is now present in DataKit 0.10, so moby/datakit#433 and moby/datakit#459 were closed.
While building the mirage-ci, @avsm had reported that invalid job names from the CI resulted in inscrutable errors in the web interface. After some discussion, @talex5 decided that moby/datakit#492 wasn’t worth the complexity, and that it should be addressed elsewhere.
@samoht had put together a design for a dkt CLI tool in moby/datakit#146. While there were many good ideas in the design, it has never been implemented fully, so the PR is closed. Get in touch with @samoht if you’d like to build it!
The C# bindings to DataKit in moby/datakit#334 are also awaiting review from @dgageot and @simonferquel. If you are interested in other language bindings, please create an issue.
@talex5 has opened a WIP PR in mirage/ocaml-9p#118 to upstream some of the 9P patches used by DataKit in production, such as logging exceptions and keep-alive pings.
The documentation is being improved by fixing the build instructions in moby/datakit#537 (@samoht@talex5). More patches to improve the instructions are welcome!
TL;DR: It’s been a quiet week after the previous week’s slew of releases, with mainly packaging fixes and debugging deployments of DataKit CI in LinuxKit.
PRs merged
Build and Packaging:
The automated release infrastructure was improved to support Jbuilder (moby/datakit#543@samoht).
The GitHub bridge version constraints were fixed to improve OPAM installation (moby/datakit#544@samoht).
The test harnesses were also extended to test on OCaml 4.03.0, which is the new minimum supported version of the compiler as of MirageOS 3.0 (moby/datakit#546@samoht).
Functionality improvements:
The example ‘self-ci’ was updated to use the latest release of the GitHub bridge (moby/datakit#545@samoht).
Ongoing activity
@samoht updated the development tree to support the latest versions of the 9P and Lwt libraries (moby/datakit#547@samoht).
@djs55 has fixed the linking of the protocol-9p-unix subpackage to correctly link against io-page.unix (mirage/ocaml-9p#121@djs55).
@talex5 has been debugging a possible regression in the latest releases by investigating why Irmin is reporting all remote branches as having invalid names (mirage/irmin#440@talex5).
Digestif is a standalone cryptographic hashing library.
Decompress is a zlib implementation in pure OCaml.
Both are now in the Mirage GitHub organisation and will be integrated into Irmin and DataKit over the next few months. The first step is to replace the use of camlzip fully, and then switch ocaml-git from using Nocrypto and GMP to the pure OCaml digestif instead. Irmin and DataKit support for both will follow after that.
There is also an experimental new Git packfile encoder and decoder called Sirodepac to let ocaml-git perform compression of repositories more easily, also by @dinosaure.
There has also been significant progress this week on having a filesystem that Irmin and DataKit can use when compiled as unikernels (and hence only have access to a raw block device):
Tom Ridgeannounced the initial release of a formally verified btree filesystem. It still needs to be integrated with the MirageOS filesystem interfaces, so volunteers who want to glue things together are welcome to participate.
Gabriel de Perthuis also responded to note that he is continuing to work on a flash-optimised pure OCaml filesystem (using hitchhiker trees). It is not quite ready to open-source yet, but he anticipates doing so in the next month or so.
Between both of these efforts, it looks like a fully unikernel-aware, type-safe DataKit that persists onto a block storage device is not too far away.
Finally, we welcome David Udelson, a junior from Cornell who has been selected as a Google Summer of Code student to work on a REST API for Irmin!
Releases
This week saw a minor moby/datakit:0.10.1 release of DataKit to support the latest versions of the 9P and Lwt libraries. Both of these have some backwards-imcompatible changes, so the 0.10.1 release of DataKit lets us use the latest features.
The ogit binary was temporarily missing, now fixed (git#213@samoht).
Ongoing activity
Work continues on switching to Digestif instead of Nocrypto (which has C stubs) (git#214@dinosaure@samoht).
The 9P interfaces are being made safer against leaking exceptions on read/write (9p#1269p#125).
The tests, client and server code has been refactored to make transport layer abstract. The default is still to use 9p but this is the first step to replace it by gRPC: datakit#551@samoht.
Other reports in this series can be browsed directly in the repository at moby/datakit:/reports.
@samoht this is really excellent, thanks for posting. Regarding the format of these reports, perhaps a good structure would be for the main post topic to briefly explain what DataKit and the associated reposirtories are (from the README), and then you can just add a weekly comment with the latest dev report?
I’ll also look into generating MirageOS dev reports in the same style, and perhaps we can add a category for updates so that other projects can also participate with their own.
This week saw DataKit add direct client bindings that use Git directly, and
thus do not require a 9P server to be running. This improves performance and
deployment simplicity of DataKit, but removes the intermediate flexibility that
a 9P filesystem offers.
Datakit also features a local filesystem bridge, in addition to the remote
GitHub one, to faciliate desktop development with local repositories. This
filesystem bridge was improved this week by using the latest Irmin release,
with better watch support moby/datakit#577.
Most of these improvements are still in the various master branches of
Irmin and DataKit, and will appear in a release near you soon.
If you are interested in contributing to any of these repositories, then mirage/irmin#415 has a thread about coding standards and getting started
with the Irmin REST API.
PRs merged
Client API:
Golang: separate user config from defaults in the database ([moby/datakit#523] [@djs55] @talex5).
Golang: add List func in the Snapshot module to get files in a directory ([moby/datakit#578] @ebriney)
OCaml/9P: Remove rename API calls in the client, as there were no users and it doesn’t work with 9P directories (moby/datakit#563@talex5@samoht)
OCaml: Renamed Datakit_path into Datakit_client.Path and Datakit_S into Datakit_client.S, to create a Datakit_client namespace (moby/datakit#558@samoht)
OCaml: Added client bindings that use Git directly without a 9P server process (moby/datakit#559@samoht)
Weekly DataKit dev report for 2017-05-29 to 2017-06-04 (week 22)
This report covers weekly developments in the datakitirmingit9p repositories.
This week saw a significant simplication in how the GitHub bridge is deployed, as it can now run without a server and use a local Git repository directly (datakit#577@talex5@samoht). This in turn simplifies the deployment of DataKit as part of various CI services use it now, such as the LinuxKit Moby project.
The GitHub bridge also adds an an owner file to identify the PR creator, which can be used by DataKitCI plugins (datakit#587@samoht).
For those getting started with DataKit and Irmin, @nickbetteridge started an issue on how to get started (irmin#450).
Our GSoC intern @dudelson is also active on irmin#415 about the HTTP REST API (@dbuenzli@dudelson@samoht), and has started
a tracking issue on contributing documentation improvements (irmin#451). This is massively appreciated!
As part of the move to standardise interfaces across Moby components, there is working on adding building GRPC and CaPnP RPC libraries so that DataKit can make use of these. Check in with @talex5 if you are interested in contributing.
Releases are now being cut with this functionality, starting with git:1.11.0.